Hello everyone,
I have written a backup script for my LAVA instance. While testing the restore process I stumbled upon issues. Are there any dependencies between the master and workers concerning backups? When the master crashes, but the worker does not, is it safe to restore the master only and keep the worker as it is? Or do I have to keep master and worker backups in sync and always restore both at the same time?
Restoring my master as described in the LAVA docs generally works. The web interface is back online, all the jobs and devices are in consistent states.
Restoring the worker is relatively easy, according to the docs. I installed the LAVA packages in their previous versions on a fresh (virtual) machine, restored /etc/lava-dispatcher/lava-slave and /etc/lava-coordinator/lava-coordinator.conf. The worker has status "online" in the LAVA web interface afterwards, so the communication seems to work.
However, starting a multinode job does not work. The job log says:
lava-dispatcher, installed at version: 2018.5.post1-2~bpo9+1 start: 0 validate Start time: 2018-12-18 12:25:14.335215+00:00 (UTC) This MultiNode test job contains top level actions, in order, of: deploy, boot, test, finalize lxc, installed at version: 1:2.0.7-2+deb9u2 validate duration: 0.01 case: validate case_id: 112 definition: lava result: pass Initialising group b6eb846d-689f-40c5-b193-8afce41883ee Connecting to LAVA Coordinator on lava-server-vm:3079 timeout=90 seconds.
This comes out in a loop, until the job times out.
The lava-slave logfile says:
2018-12-18 12:27:15,114 INFO master => START(12) 2018-12-18 12:27:15,117 INFO [12] Starting job [...] 2018-12-18 12:27:15,124 DEBUG [12] dispatch: 2018-12-18 12:27:15,124 DEBUG [12] env : {'overrides': {'LC_ALL': 'C.UTF-8', 'LANG': 'C', 'PATH': '/usr/local/bin:/usr/local/sbin:/bin:/usr/bin:/usr/sbin:/sbin'}, 'purge': True} 2018-12-18 12:27:15,124 DEBUG [12] env-dut : 2018-12-18 12:27:15,129 ERROR [EXIT] 'NoneType' object has no attribute 'send_start_ok' 2018-12-18 12:27:15,129 ERROR 'NoneType' object has no attribute 'send_start_ok'
It is the "job = jobs.create()" call in lava-slave's handle_start() routine which fails. Obviously there is a separate database on the worker (of which I did not know until now), which fails to be filled with values. Does this database have to be backup'ed and restored? What is the purpose of this database? Is there anything I need to know about it concerning backups?
Mit freundlichen Grüßen / Best regards Tim Jaacks DEVELOPMENT ENGINEER Garz & Fricke GmbH Tempowerkring 2 21079 Hamburg Direct: +49 40 791 899 - 55 Fax: +49 40 791899 - 39 tim.jaacks@garz-fricke.com www.garz-fricke.com WE MAKE IT YOURS!
Sitz der Gesellschaft: D-21079 Hamburg Registergericht: Amtsgericht Hamburg, HRB 60514 Geschäftsführer: Matthias Fricke, Manfred Garz, Marc-Michael Braun