On Tue, 18 Dec 2018 at 12:45, Tim Jaacks tim.jaacks@garz-fricke.com wrote:
Hello everyone,
I have written a backup script for my LAVA instance. While testing the restore process I stumbled upon issues. Are there any dependencies between the master and workers concerning backups? When the master crashes, but the worker does not, is it safe to restore the master only and keep the worker as it is? Or do I have to keep master and worker backups in sync and always restore both at the same time?
ZMQ buffers messages for a little time (exactly how long depends on message volumes). However, if the buffer does fill, just restarting the service will be fine.
So it is safe to start either lava-master or lava-slave in either order and if there is a long latency, the lava-slave service might need to be restarted once the master is up.
Restoring my master as described in the LAVA docs generally works. The web interface is back online, all the jobs and devices are in consistent states.
Restoring the worker is relatively easy, according to the docs. I installed the LAVA packages in their previous versions on a fresh (virtual) machine, restored /etc/lava-dispatcher/lava-slave and /etc/lava-coordinator/lava-coordinator.conf. The worker has status "online" in the LAVA web interface afterwards, so the communication seems to work.
However, starting a multinode job does not work. The job log says:
Check that the lava-coordinator is running (wherever you installed it) and is configured on the worker. The coordinator is capable of supporting multiple instances but often admins will install a lava-coordinator alongside the lava-server package on the master and configure workers to use the coordinator on the relevant master.
https://master.lavasoftware.org/static/docs/v2/first-installation.html#index...
https://master.lavasoftware.org/static/docs/v2/simple-admin.html#checking-fo...
Not every instance uses MultiNode, so this is an extra part of the backup - restore process for your lab.
lava-dispatcher, installed at version: 2018.5.post1-2~bpo9+1 start: 0 validate Start time: 2018-12-18 12:25:14.335215+00:00 (UTC) This MultiNode test job contains top level actions, in order, of: deploy, boot, test, finalize lxc, installed at version: 1:2.0.7-2+deb9u2 validate duration: 0.01 case: validate case_id: 112 definition: lava result: pass Initialising group b6eb846d-689f-40c5-b193-8afce41883ee Connecting to LAVA Coordinator on lava-server-vm:3079 timeout=90 seconds.
This comes out in a loop, until the job times out.
The lava-slave logfile says:
2018-12-18 12:27:15,114 INFO master => START(12) 2018-12-18 12:27:15,117 INFO [12] Starting job [...] 2018-12-18 12:27:15,124 DEBUG [12] dispatch: 2018-12-18 12:27:15,124 DEBUG [12] env : {'overrides': {'LC_ALL': 'C.UTF-8', 'LANG': 'C', 'PATH': '/usr/local/bin:/usr/local/sbin:/bin:/usr/bin:/usr/sbin:/sbin'}, 'purge': True} 2018-12-18 12:27:15,124 DEBUG [12] env-dut : 2018-12-18 12:27:15,129 ERROR [EXIT] 'NoneType' object has no attribute 'send_start_ok' 2018-12-18 12:27:15,129 ERROR 'NoneType' object has no attribute 'send_start_ok'
It is the "job = jobs.create()" call in lava-slave's handle_start() routine which fails. Obviously there is a separate database on the worker (of which I did not know until now), which fails to be filled with values. Does this database have to be backup'ed and restored? What is the purpose of this database? Is there anything I need to know about it concerning backups?
The SQLite database on the worker is just to retain state so that the lava-slave service can be restarted without affecting running test jobs.
It should not be restored - the previous state of the worker needs to be cleared when doing a restore.