Dependencies between master and worker when restoring from backups

List overview All Threads
Download

newer

older

FW: Timeouts in LAVA failing

Re: [Lava-users] Lava job always...

Tim Jaacks

18 Dec 2018 18 Dec '18

12:44 p.m.

Hello everyone,

I have written a backup script for my LAVA instance. While testing the restore process I stumbled upon issues. Are there any dependencies between the master and workers concerning backups? When the master crashes, but the worker does not, is it safe to restore the master only and keep the worker as it is? Or do I have to keep master and worker backups in sync and always restore both at the same time?

Restoring my master as described in the LAVA docs generally works. The web interface is back online, all the jobs and devices are in consistent states.

Restoring the worker is relatively easy, according to the docs. I installed the LAVA packages in their previous versions on a fresh (virtual) machine, restored /etc/lava-dispatcher/lava-slave and /etc/lava-coordinator/lava-coordinator.conf. The worker has status "online" in the LAVA web interface afterwards, so the communication seems to work.

However, starting a multinode job does not work. The job log says:

lava-dispatcher, installed at version: 2018.5.post1-2~bpo9+1 start: 0 validate Start time: 2018-12-18 12:25:14.335215+00:00 (UTC) This MultiNode test job contains top level actions, in order, of: deploy, boot, test, finalize lxc, installed at version: 1:2.0.7-2+deb9u2 validate duration: 0.01 case: validate case_id: 112 definition: lava result: pass Initialising group b6eb846d-689f-40c5-b193-8afce41883ee Connecting to LAVA Coordinator on lava-server-vm:3079 timeout=90 seconds.

This comes out in a loop, until the job times out.

The lava-slave logfile says:

2018-12-18 12:27:15,114 INFO master => START(12) 2018-12-18 12:27:15,117 INFO [12] Starting job [...] 2018-12-18 12:27:15,124 DEBUG [12] dispatch: 2018-12-18 12:27:15,124 DEBUG [12] env : {'overrides': {'LC_ALL': 'C.UTF-8', 'LANG': 'C', 'PATH': '/usr/local/bin:/usr/local/sbin:/bin:/usr/bin:/usr/sbin:/sbin'}, 'purge': True} 2018-12-18 12:27:15,124 DEBUG [12] env-dut : 2018-12-18 12:27:15,129 ERROR [EXIT] 'NoneType' object has no attribute 'send_start_ok' 2018-12-18 12:27:15,129 ERROR 'NoneType' object has no attribute 'send_start_ok'

It is the "job = jobs.create()" call in lava-slave's handle_start() routine which fails. Obviously there is a separate database on the worker (of which I did not know until now), which fails to be filled with values. Does this database have to be backup'ed and restored? What is the purpose of this database? Is there anything I need to know about it concerning backups?

Mit freundlichen Grüßen / Best regards Tim Jaacks DEVELOPMENT ENGINEER Garz & Fricke GmbH Tempowerkring 2 21079 Hamburg Direct: +49 40 791 899 - 55 Fax: +49 40 791899 - 39 tim.jaacks@garz-fricke.com www.garz-fricke.com WE MAKE IT YOURS!

Sitz der Gesellschaft: D-21079 Hamburg Registergericht: Amtsgericht Hamburg, HRB 60514 Geschäftsführer: Matthias Fricke, Manfred Garz, Marc-Michael Braun

Show replies by date

Neil Williams

18 Dec 18 Dec

1:19 p.m.

New subject: [Lava-users] Dependencies between master and worker when restoring from backups

On Tue, 18 Dec 2018 at 12:45, Tim Jaacks tim.jaacks@garz-fricke.com wrote:

...

Hello everyone,

I have written a backup script for my LAVA instance. While testing the restore process I stumbled upon issues. Are there any dependencies between the master and workers concerning backups? When the master crashes, but the worker does not, is it safe to restore the master only and keep the worker as it is? Or do I have to keep master and worker backups in sync and always restore both at the same time?

ZMQ buffers messages for a little time (exactly how long depends on message volumes). However, if the buffer does fill, just restarting the service will be fine.

So it is safe to start either lava-master or lava-slave in either order and if there is a long latency, the lava-slave service might need to be restarted once the master is up.

...

Restoring my master as described in the LAVA docs generally works. The web interface is back online, all the jobs and devices are in consistent states.

Restoring the worker is relatively easy, according to the docs. I installed the LAVA packages in their previous versions on a fresh (virtual) machine, restored /etc/lava-dispatcher/lava-slave and /etc/lava-coordinator/lava-coordinator.conf. The worker has status "online" in the LAVA web interface afterwards, so the communication seems to work.

However, starting a multinode job does not work. The job log says:

Check that the lava-coordinator is running (wherever you installed it) and is configured on the worker. The coordinator is capable of supporting multiple instances but often admins will install a lava-coordinator alongside the lava-server package on the master and configure workers to use the coordinator on the relevant master.

https://master.lavasoftware.org/static/docs/v2/first-installation.html#index...

https://master.lavasoftware.org/static/docs/v2/simple-admin.html#checking-fo...

Not every instance uses MultiNode, so this is an extra part of the backup - restore process for your lab.

...

    lava-dispatcher, installed at version: 2018.5.post1-2~bpo9+1
    start: 0 validate
    Start time: 2018-12-18 12:25:14.335215+00:00 (UTC)
    This MultiNode test job contains top level actions, in order, of: deploy, boot, test, finalize
    lxc, installed at version: 1:2.0.7-2+deb9u2
    validate duration: 0.01
    case: validate
    case_id: 112
    definition: lava
    result: pass
    Initialising group b6eb846d-689f-40c5-b193-8afce41883ee
    Connecting to LAVA Coordinator on lava-server-vm:3079 timeout=90 seconds.

This comes out in a loop, until the job times out.

The lava-slave logfile says:

    2018-12-18 12:27:15,114    INFO master => START(12)
    2018-12-18 12:27:15,117    INFO [12] Starting job
    [...]
    2018-12-18 12:27:15,124   DEBUG [12] dispatch:
    2018-12-18 12:27:15,124   DEBUG [12] env     : {'overrides': {'LC_ALL': 'C.UTF-8', 'LANG': 'C', 'PATH': '/usr/local/bin:/usr/local/sbin:/bin:/usr/bin:/usr/sbin:/sbin'}, 'purge': True}
    2018-12-18 12:27:15,124   DEBUG [12] env-dut :
    2018-12-18 12:27:15,129   ERROR [EXIT] 'NoneType' object has no attribute 'send_start_ok'
    2018-12-18 12:27:15,129   ERROR 'NoneType' object has no attribute 'send_start_ok'

The SQLite database on the worker is just to retain state so that the lava-slave service can be restarted without affecting running test jobs.

It should not be restored - the previous state of the worker needs to be cleared when doing a restore.

-- Neil Williams ============= neil.williams@linaro.org http://www.linux.codehelp.co.uk/

Tim Jaacks

1:35 p.m.

New subject: [Lava-users] Dependencies between master and worker when restoring from backups

-----Ursprüngliche Nachricht----- Von: Neil Williams neil.williams@linaro.org Gesendet: Dienstag, 18. Dezember 2018 14:20 An: Tim Jaacks tim.jaacks@garz-fricke.com Cc: lava-users@lists.lavasoftware.org Betreff: Re: [Lava-users] Dependencies between master and worker when restoring from backups

...

On Tue, 18 Dec 2018 at 12:45, Tim Jaacks tim.jaacks@garz-fricke.com wrote:

...
Hello everyone,

I have written a backup script for my LAVA instance. While testing the restore process I stumbled upon issues. Are there any dependencies between the master and workers concerning backups? When the master crashes, but the worker does not, is it safe to restore the master only and keep the worker as it is? Or do I have to keep master and worker backups in sync and always restore both at the same time?

ZMQ buffers messages for a little time (exactly how long depends on message volumes). However, if the buffer does fill, just restarting the service will be fine.

So it is safe to start either lava-master or lava-slave in either order and if there is a long latency, the lava-slave service might need to be restarted once the master is up.

That is good to know, thank you.

...

...
Restoring my master as described in the LAVA docs generally works. The web interface is back online, all the jobs and devices are in consistent states.

Restoring the worker is relatively easy, according to the docs. I installed the LAVA packages in their previous versions on a fresh (virtual) machine, restored /etc/lava-dispatcher/lava-slave and /etc/lava-coordinator/lava-coordinator.conf. The worker has status "online" in the LAVA web interface afterwards, so the communication seems to work.

However, starting a multinode job does not work. The job log says:

Check that the lava-coordinator is running (wherever you installed it) and is configured on the worker. The coordinator is capable of supporting multiple instances but often admins will install a lava-coordinator alongside the lava-server package on the master and configure workers to use the coordinator on the relevant master.

https://master.lavasoftware.org/static/docs/v2/first-installation.html#index...

https://master.lavasoftware.org/static/docs/v2/simple-admin.html#checking-fo...

Not every instance uses MultiNode, so this is an extra part of the backup - restore process for your lab.

I have installed the lava-coordinator on the server and it is running. The log says:

2018-12-18 12:27:13,618 lava_wait: {u'blocksize': 4096, u'client_name': u'12', u'hostname': u'lava-worker-vm', u'request': u'lava_wait', u'nodeID': u'12', u'group_name': u'b6eb846d-689f-40c5-b193-8afce41883ee', u'host': u'lava-server-vm', u'role': u'node_b', u'messageID': u'node_a_info', u'poll_delay': 3, u'port': 3079} 2018-12-18 12:27:13,618 MessageID node_a_info not yet seen for 12 2018-12-18 12:27:13,618 Ready to accept new connections 2018-12-18 12:27:15,570 Group complete, starting tests 2018-12-18 12:27:15,570 Ready to accept new connections 2018-12-18 12:27:15,639 clear Group Data: 2 of 2 2018-12-18 12:27:15,640 Clearing group data for b6eb846d-689f-40c5-b193-8afce41883ee 2018-12-18 12:27:15,640 Ready to accept new connections

Can you tell what goes wrong here? Are there any parts of the lava-coordinator which need to be backup'ed and restored additionally?

...

...
    lava-dispatcher, installed at version: 2018.5.post1-2~bpo9+1
    start: 0 validate
    Start time: 2018-12-18 12:25:14.335215+00:00 (UTC)
    This MultiNode test job contains top level actions, in order, of: deploy, boot, test, finalize
    lxc, installed at version: 1:2.0.7-2+deb9u2
    validate duration: 0.01
    case: validate
    case_id: 112
    definition: lava
    result: pass
    Initialising group b6eb846d-689f-40c5-b193-8afce41883ee
    Connecting to LAVA Coordinator on lava-server-vm:3079 timeout=90 seconds.
This comes out in a loop, until the job times out.

The lava-slave logfile says:
    2018-12-18 12:27:15,114    INFO master => START(12)
    2018-12-18 12:27:15,117    INFO [12] Starting job
    [...]
    2018-12-18 12:27:15,124   DEBUG [12] dispatch:
    2018-12-18 12:27:15,124   DEBUG [12] env     : {'overrides': {'LC_ALL': 'C.UTF-8', 'LANG': 'C', 'PATH': '/usr/local/bin:/usr/local/sbin:/bin:/usr/bin:/usr/sbin:/sbin'}, 'purge': True}
    2018-12-18 12:27:15,124   DEBUG [12] env-dut :
    2018-12-18 12:27:15,129   ERROR [EXIT] 'NoneType' object has no attribute 'send_start_ok'
    2018-12-18 12:27:15,129   ERROR 'NoneType' object has no attribute 'send_start_ok'
It is the "job = jobs.create()" call in lava-slave's handle_start() routine which fails. Obviously there is a separate database on the worker (of which I did not know until now), which fails to be filled with values. Does this database have to be backup'ed and restored? What is the purpose of this database? Is there anything I need to know about it concerning backups?
The SQLite database on the worker is just to retain state so that the lava-slave service can be restarted without affecting running test jobs.

It should not be restored - the previous state of the worker needs to be cleared when doing a restore.

Thanks, that is also good to know. If I get it right this means: when a LAVA master breaks down and I have to restore it from a backup, I MUST NOT leave the worker as it is, but instead reset it to a clean state (i.e. a fresh install). Is this correct?

...

--

Neil Williams

neil.williams@linaro.org http://www.linux.codehelp.co.uk/

Sitz der Gesellschaft: D-21079 Hamburg Registergericht: Amtsgericht Hamburg, HRB 60514 Geschäftsführer: Matthias Fricke, Manfred Garz, Marc-Michael Braun

Neil Williams

1:41 p.m.

New subject: [Lava-users] Dependencies between master and worker when restoring from backups

On Tue, 18 Dec 2018 at 13:35, Tim Jaacks tim.jaacks@garz-fricke.com wrote:

...

-----Ursprüngliche Nachricht----- Von: Neil Williams neil.williams@linaro.org Gesendet: Dienstag, 18. Dezember 2018 14:20 An: Tim Jaacks tim.jaacks@garz-fricke.com Cc: lava-users@lists.lavasoftware.org Betreff: Re: [Lava-users] Dependencies between master and worker when restoring from backups

...
On Tue, 18 Dec 2018 at 12:45, Tim Jaacks tim.jaacks@garz-fricke.com wrote:

...
Hello everyone,

I have written a backup script for my LAVA instance. While testing the restore process I stumbled upon issues. Are there any dependencies between the master and workers concerning backups? When the master crashes, but the worker does not, is it safe to restore the master only and keep the worker as it is? Or do I have to keep master and worker backups in sync and always restore both at the same time?

ZMQ buffers messages for a little time (exactly how long depends on message volumes). However, if the buffer does fill, just restarting the service will be fine.

So it is safe to start either lava-master or lava-slave in either order and if there is a long latency, the lava-slave service might need to be restarted once the master is up.

That is good to know, thank you.

...
...
Restoring my master as described in the LAVA docs generally works. The web interface is back online, all the jobs and devices are in consistent states.

Restoring the worker is relatively easy, according to the docs. I installed the LAVA packages in their previous versions on a fresh (virtual) machine, restored /etc/lava-dispatcher/lava-slave and /etc/lava-coordinator/lava-coordinator.conf. The worker has status "online" in the LAVA web interface afterwards, so the communication seems to work.

However, starting a multinode job does not work. The job log says:

Check that the lava-coordinator is running (wherever you installed it) and is configured on the worker. The coordinator is capable of supporting multiple instances but often admins will install a lava-coordinator alongside the lava-server package on the master and configure workers to use the coordinator on the relevant master.

https://master.lavasoftware.org/static/docs/v2/first-installation.html#index...

https://master.lavasoftware.org/static/docs/v2/simple-admin.html#checking-fo...

Not every instance uses MultiNode, so this is an extra part of the backup - restore process for your lab.

I have installed the lava-coordinator on the server and it is running. The log says:
    2018-12-18 12:27:13,618 lava_wait: {u'blocksize': 4096, u'client_name': u'12', u'hostname': u'lava-worker-vm', u'request': u'lava_wait', u'nodeID': u'12', u'group_name': u'b6eb846d-689f-40c5-b193-8afce41883ee', u'host': u'lava-server-vm', u'role': u'node_b', u'messageID': u'node_a_info', u'poll_delay': 3, u'port': 3079}
    2018-12-18 12:27:13,618 MessageID node_a_info not yet seen for 12
    2018-12-18 12:27:13,618 Ready to accept new connections
    2018-12-18 12:27:15,570 Group complete, starting tests
    2018-12-18 12:27:15,570 Ready to accept new connections
    2018-12-18 12:27:15,639 clear Group Data: 2 of 2
    2018-12-18 12:27:15,640 Clearing group data for b6eb846d-689f-40c5-b193-8afce41883ee
    2018-12-18 12:27:15,640 Ready to accept new connections
Can you tell what goes wrong here?

https://master.lavasoftware.org/static/docs/v2/simple-admin.html#checking-fo...

Could be a network issue between the worker and the coordinator?

The status.py script described in the documentation will pick up the settings from /etc/lava-coordinator/lava-coordinator.conf on the worker, so you can copy just that script onto the worker and check operability. You'll see the results of the test in the lava-coordinator logs too.

BTW: this will be improved, eventually. We have plans to move the coordinator inside the master and let it use the existing ZMQ support instead of needing it's own configuration. However, that isn't likely to get into the next release at the moment. https://git.lavasoftware.org/lava/lava/issues/45 and https://git.lavasoftware.org/lava/lava/issues/44

...

Are there any parts of the lava-coordinator which need to be backup'ed and restored additionally?

Only /etc/lava-coordinator/lava-coordinator.conf on the worker.

...

...
...
    lava-dispatcher, installed at version: 2018.5.post1-2~bpo9+1
    start: 0 validate
    Start time: 2018-12-18 12:25:14.335215+00:00 (UTC)
    This MultiNode test job contains top level actions, in order, of: deploy, boot, test, finalize
    lxc, installed at version: 1:2.0.7-2+deb9u2
    validate duration: 0.01
    case: validate
    case_id: 112
    definition: lava
    result: pass
    Initialising group b6eb846d-689f-40c5-b193-8afce41883ee
    Connecting to LAVA Coordinator on lava-server-vm:3079 timeout=90 seconds.
This comes out in a loop, until the job times out.

The lava-slave logfile says:
    2018-12-18 12:27:15,114    INFO master => START(12)
    2018-12-18 12:27:15,117    INFO [12] Starting job
    [...]
    2018-12-18 12:27:15,124   DEBUG [12] dispatch:
    2018-12-18 12:27:15,124   DEBUG [12] env     : {'overrides': {'LC_ALL': 'C.UTF-8', 'LANG': 'C', 'PATH': '/usr/local/bin:/usr/local/sbin:/bin:/usr/bin:/usr/sbin:/sbin'}, 'purge': True}
    2018-12-18 12:27:15,124   DEBUG [12] env-dut :
    2018-12-18 12:27:15,129   ERROR [EXIT] 'NoneType' object has no attribute 'send_start_ok'
    2018-12-18 12:27:15,129   ERROR 'NoneType' object has no attribute 'send_start_ok'
It is the "job = jobs.create()" call in lava-slave's handle_start() routine which fails. Obviously there is a separate database on the worker (of which I did not know until now), which fails to be filled with values. Does this database have to be backup'ed and restored? What is the purpose of this database? Is there anything I need to know about it concerning backups?
The SQLite database on the worker is just to retain state so that the lava-slave service can be restarted without affecting running test jobs.

It should not be restored - the previous state of the worker needs to be cleared when doing a restore.
Thanks, that is also good to know. If I get it right this means: when a LAVA master breaks down and I have to restore it from a backup, I MUST NOT leave the worker as it is, but instead reset it to a clean state (i.e. a fresh install). Is this correct?

...
--

Neil Williams

neil.williams@linaro.org http://www.linux.codehelp.co.uk/

Mit freundlichen Grüßen / Best regards Tim Jaacks DEVELOPMENT ENGINEER Garz & Fricke GmbH Tempowerkring 2 21079 Hamburg Direct: +49 40 791 899 - 55 Fax: +49 40 791899 - 39 tim.jaacks@garz-fricke.com www.garz-fricke.com WE MAKE IT YOURS!

Sitz der Gesellschaft: D-21079 Hamburg Registergericht: Amtsgericht Hamburg, HRB 60514 Geschäftsführer: Matthias Fricke, Manfred Garz, Marc-Michael Braun

-- Neil Williams ============= neil.williams@linaro.org http://www.linux.codehelp.co.uk/

Tim Jaacks

1:47 p.m.

New subject: [Lava-users] Dependencies between master and worker when restoring from backups

-----Ursprüngliche Nachricht----- Von: Neil Williams neil.williams@linaro.org Gesendet: Dienstag, 18. Dezember 2018 14:41 An: Tim Jaacks tim.jaacks@garz-fricke.com Cc: lava-users@lists.lavasoftware.org Betreff: Re: [Lava-users] Dependencies between master and worker when restoring from backups

...

On Tue, 18 Dec 2018 at 13:35, Tim Jaacks tim.jaacks@garz-fricke.com wrote:

...
-----Ursprüngliche Nachricht----- Von: Neil Williams neil.williams@linaro.org Gesendet: Dienstag, 18. Dezember 2018 14:20 An: Tim Jaacks tim.jaacks@garz-fricke.com Cc: lava-users@lists.lavasoftware.org Betreff: Re: [Lava-users] Dependencies between master and worker when restoring from backups

...
On Tue, 18 Dec 2018 at 12:45, Tim Jaacks tim.jaacks@garz-fricke.com wrote:

...
Hello everyone,

I have written a backup script for my LAVA instance. While testing the restore process I stumbled upon issues. Are there any dependencies between the master and workers concerning backups? When the master crashes, but the worker does not, is it safe to restore the master only and keep the worker as it is? Or do I have to keep master and worker backups in sync and always restore both at the same time?

ZMQ buffers messages for a little time (exactly how long depends on message volumes). However, if the buffer does fill, just restarting the service will be fine.

So it is safe to start either lava-master or lava-slave in either order and if there is a long latency, the lava-slave service might need to be restarted once the master is up.

That is good to know, thank you.

...
...
Restoring my master as described in the LAVA docs generally works. The web interface is back online, all the jobs and devices are in consistent states.

Restoring the worker is relatively easy, according to the docs. I installed the LAVA packages in their previous versions on a fresh (virtual) machine, restored /etc/lava-dispatcher/lava-slave and /etc/lava-coordinator/lava-coordinator.conf. The worker has status "online" in the LAVA web interface afterwards, so the communication seems to work.

However, starting a multinode job does not work. The job log says:

Check that the lava-coordinator is running (wherever you installed it) and is configured on the worker. The coordinator is capable of supporting multiple instances but often admins will install a lava-coordinator alongside the lava-server package on the master and configure workers to use the coordinator on the relevant master.

https://master.lavasoftware.org/static/docs/v2/first-installation.ht ml#index-2

https://master.lavasoftware.org/static/docs/v2/simple-admin.html#che cking-for-multinode-issues

Not every instance uses MultiNode, so this is an extra part of the backup - restore process for your lab.

I have installed the lava-coordinator on the server and it is running. The log says:
    2018-12-18 12:27:13,618 lava_wait: {u'blocksize': 4096, u'client_name': u'12', u'hostname': u'lava-worker-vm', u'request': u'lava_wait', u'nodeID': u'12', u'group_name': u'b6eb846d-689f-40c5-b193-8afce41883ee', u'host': u'lava-server-vm', u'role': u'node_b', u'messageID': u'node_a_info', u'poll_delay': 3, u'port': 3079}
    2018-12-18 12:27:13,618 MessageID node_a_info not yet seen for 12
    2018-12-18 12:27:13,618 Ready to accept new connections
    2018-12-18 12:27:15,570 Group complete, starting tests
    2018-12-18 12:27:15,570 Ready to accept new connections
    2018-12-18 12:27:15,639 clear Group Data: 2 of 2
    2018-12-18 12:27:15,640 Clearing group data for b6eb846d-689f-40c5-b193-8afce41883ee
    2018-12-18 12:27:15,640 Ready to accept new connections
Can you tell what goes wrong here?
https://master.lavasoftware.org/static/docs/v2/simple-admin.html#checking-fo...

Could be a network issue between the worker and the coordinator?

The status.py script described in the documentation will pick up the settings from /etc/lava-coordinator/lava-coordinator.conf on the worker, so you can copy just that script onto the worker and check operability. You'll see the results of the test in the lava-coordinator logs too.

BTW: this will be improved, eventually. We have plans to move the coordinator inside the master and let it use the existing ZMQ support instead of needing it's own configuration. However, that isn't likely to get into the next release at the moment. https://git.lavasoftware.org/lava/lava/issues/45 and https://git.lavasoftware.org/lava/lava/issues/44

...
Are there any parts of the lava-coordinator which need to be backup'ed and restored additionally?

Only /etc/lava-coordinator/lava-coordinator.conf on the worker.

Thanks for the info, Neil. A network issue seems unlikely, but I will investigate this further.

Can you quickly take a look at the very bottom of this discussion and tell me if I got you right with your last line?

...

...
...
...
    lava-dispatcher, installed at version: 2018.5.post1-2~bpo9+1
    start: 0 validate
    Start time: 2018-12-18 12:25:14.335215+00:00 (UTC)
    This MultiNode test job contains top level actions, in order, of: deploy, boot, test, finalize
    lxc, installed at version: 1:2.0.7-2+deb9u2
    validate duration: 0.01
    case: validate
    case_id: 112
    definition: lava
    result: pass
    Initialising group b6eb846d-689f-40c5-b193-8afce41883ee
    Connecting to LAVA Coordinator on lava-server-vm:3079 timeout=90 seconds.
This comes out in a loop, until the job times out.

The lava-slave logfile says:
    2018-12-18 12:27:15,114    INFO master => START(12)
    2018-12-18 12:27:15,117    INFO [12] Starting job
    [...]
    2018-12-18 12:27:15,124   DEBUG [12] dispatch:
    2018-12-18 12:27:15,124   DEBUG [12] env     : {'overrides': {'LC_ALL': 'C.UTF-8', 'LANG': 'C', 'PATH': '/usr/local/bin:/usr/local/sbin:/bin:/usr/bin:/usr/sbin:/sbin'}, 'purge': True}
    2018-12-18 12:27:15,124   DEBUG [12] env-dut :
    2018-12-18 12:27:15,129   ERROR [EXIT] 'NoneType' object has no attribute 'send_start_ok'
    2018-12-18 12:27:15,129   ERROR 'NoneType' object has no attribute 'send_start_ok'
It is the "job = jobs.create()" call in lava-slave's handle_start() routine which fails. Obviously there is a separate database on the worker (of which I did not know until now), which fails to be filled with values. Does this database have to be backup'ed and restored? What is the purpose of this database? Is there anything I need to know about it concerning backups?
The SQLite database on the worker is just to retain state so that the lava-slave service can be restarted without affecting running test jobs.

It should not be restored - the previous state of the worker needs to be cleared when doing a restore.
Thanks, that is also good to know. If I get it right this means: when a LAVA master breaks down and I have to restore it from a backup, I MUST NOT leave the worker as it is, but instead reset it to a clean state (i.e. a fresh install). Is this correct?

Did I get this right?

...

...
...
--

Neil Williams

neil.williams@linaro.org http://www.linux.codehelp.co.uk/

Mit freundlichen Grüßen / Best regards Tim Jaacks DEVELOPMENT ENGINEER Garz & Fricke GmbH Tempowerkring 2 21079 Hamburg Direct: +49 40 791 899 - 55 Fax: +49 40 791899 - 39 tim.jaacks@garz-fricke.com www.garz-fricke.com WE MAKE IT YOURS!

Sitz der Gesellschaft: D-21079 Hamburg Registergericht: Amtsgericht Hamburg, HRB 60514 Geschäftsführer: Matthias Fricke, Manfred Garz, Marc-Michael Braun

--

Neil Williams

neil.williams@linaro.org http://www.linux.codehelp.co.uk/

Sitz der Gesellschaft: D-21079 Hamburg Registergericht: Amtsgericht Hamburg, HRB 60514 Geschäftsführer: Matthias Fricke, Manfred Garz, Marc-Michael Braun

Neil Williams

2:04 p.m.

New subject: [Lava-users] Dependencies between master and worker when restoring from backups

On Tue, 18 Dec 2018 at 13:47, Tim Jaacks tim.jaacks@garz-fricke.com wrote:

...

-----Ursprüngliche Nachricht----- Von: Neil Williams neil.williams@linaro.org Gesendet: Dienstag, 18. Dezember 2018 14:41 An: Tim Jaacks tim.jaacks@garz-fricke.com Cc: lava-users@lists.lavasoftware.org Betreff: Re: [Lava-users] Dependencies between master and worker when restoring from backups

...
On Tue, 18 Dec 2018 at 13:35, Tim Jaacks tim.jaacks@garz-fricke.com wrote:

...
-----Ursprüngliche Nachricht----- Von: Neil Williams neil.williams@linaro.org Gesendet: Dienstag, 18. Dezember 2018 14:20 An: Tim Jaacks tim.jaacks@garz-fricke.com Cc: lava-users@lists.lavasoftware.org Betreff: Re: [Lava-users] Dependencies between master and worker when restoring from backups

...
On Tue, 18 Dec 2018 at 12:45, Tim Jaacks tim.jaacks@garz-fricke.com wrote:

...
Hello everyone,

I have written a backup script for my LAVA instance. While testing the restore process I stumbled upon issues. Are there any dependencies between the master and workers concerning backups? When the master crashes, but the worker does not, is it safe to restore the master only and keep the worker as it is? Or do I have to keep master and worker backups in sync and always restore both at the same time?

ZMQ buffers messages for a little time (exactly how long depends on message volumes). However, if the buffer does fill, just restarting the service will be fine.

So it is safe to start either lava-master or lava-slave in either order and if there is a long latency, the lava-slave service might need to be restarted once the master is up.

That is good to know, thank you.

...
...
Restoring my master as described in the LAVA docs generally works. The web interface is back online, all the jobs and devices are in consistent states.

Restoring the worker is relatively easy, according to the docs. I installed the LAVA packages in their previous versions on a fresh (virtual) machine, restored /etc/lava-dispatcher/lava-slave and /etc/lava-coordinator/lava-coordinator.conf. The worker has status "online" in the LAVA web interface afterwards, so the communication seems to work.

However, starting a multinode job does not work. The job log says:

Check that the lava-coordinator is running (wherever you installed it) and is configured on the worker. The coordinator is capable of supporting multiple instances but often admins will install a lava-coordinator alongside the lava-server package on the master and configure workers to use the coordinator on the relevant master.

https://master.lavasoftware.org/static/docs/v2/first-installation.ht ml#index-2

https://master.lavasoftware.org/static/docs/v2/simple-admin.html#che cking-for-multinode-issues

Not every instance uses MultiNode, so this is an extra part of the backup - restore process for your lab.

I have installed the lava-coordinator on the server and it is running. The log says:
    2018-12-18 12:27:13,618 lava_wait: {u'blocksize': 4096, u'client_name': u'12', u'hostname': u'lava-worker-vm', u'request': u'lava_wait', u'nodeID': u'12', u'group_name': u'b6eb846d-689f-40c5-b193-8afce41883ee', u'host': u'lava-server-vm', u'role': u'node_b', u'messageID': u'node_a_info', u'poll_delay': 3, u'port': 3079}
    2018-12-18 12:27:13,618 MessageID node_a_info not yet seen for 12
    2018-12-18 12:27:13,618 Ready to accept new connections
    2018-12-18 12:27:15,570 Group complete, starting tests
    2018-12-18 12:27:15,570 Ready to accept new connections
    2018-12-18 12:27:15,639 clear Group Data: 2 of 2
    2018-12-18 12:27:15,640 Clearing group data for b6eb846d-689f-40c5-b193-8afce41883ee
    2018-12-18 12:27:15,640 Ready to accept new connections
Can you tell what goes wrong here?
https://master.lavasoftware.org/static/docs/v2/simple-admin.html#checking-fo...

Could be a network issue between the worker and the coordinator?

The status.py script described in the documentation will pick up the settings from /etc/lava-coordinator/lava-coordinator.conf on the worker, so you can copy just that script onto the worker and check operability. You'll see the results of the test in the lava-coordinator logs too.

BTW: this will be improved, eventually. We have plans to move the coordinator inside the master and let it use the existing ZMQ support instead of needing it's own configuration. However, that isn't likely to get into the next release at the moment. https://git.lavasoftware.org/lava/lava/issues/45 and https://git.lavasoftware.org/lava/lava/issues/44

...
Are there any parts of the lava-coordinator which need to be backup'ed and restored additionally?

Only /etc/lava-coordinator/lava-coordinator.conf on the worker.
Thanks for the info, Neil. A network issue seems unlikely, but I will investigate this further.

Can you quickly take a look at the very bottom of this discussion and tell me if I got you right with your last line?

...
...
...
...
    lava-dispatcher, installed at version: 2018.5.post1-2~bpo9+1
    start: 0 validate
    Start time: 2018-12-18 12:25:14.335215+00:00 (UTC)
    This MultiNode test job contains top level actions, in order, of: deploy, boot, test, finalize
    lxc, installed at version: 1:2.0.7-2+deb9u2
    validate duration: 0.01
    case: validate
    case_id: 112
    definition: lava
    result: pass
    Initialising group b6eb846d-689f-40c5-b193-8afce41883ee
    Connecting to LAVA Coordinator on lava-server-vm:3079 timeout=90 seconds.
This comes out in a loop, until the job times out.

The lava-slave logfile says:
    2018-12-18 12:27:15,114    INFO master => START(12)
    2018-12-18 12:27:15,117    INFO [12] Starting job
    [...]
    2018-12-18 12:27:15,124   DEBUG [12] dispatch:
    2018-12-18 12:27:15,124   DEBUG [12] env     : {'overrides': {'LC_ALL': 'C.UTF-8', 'LANG': 'C', 'PATH': '/usr/local/bin:/usr/local/sbin:/bin:/usr/bin:/usr/sbin:/sbin'}, 'purge': True}
    2018-12-18 12:27:15,124   DEBUG [12] env-dut :
    2018-12-18 12:27:15,129   ERROR [EXIT] 'NoneType' object has no attribute 'send_start_ok'
    2018-12-18 12:27:15,129   ERROR 'NoneType' object has no attribute 'send_start_ok'
It is the "job = jobs.create()" call in lava-slave's handle_start() routine which fails. Obviously there is a separate database on the worker (of which I did not know until now), which fails to be filled with values. Does this database have to be backup'ed and restored? What is the purpose of this database? Is there anything I need to know about it concerning backups?
The SQLite database on the worker is just to retain state so that the lava-slave service can be restarted without affecting running test jobs.

It should not be restored - the previous state of the worker needs to be cleared when doing a restore.

i.e. the internal state within the lava-slave. Whatever jobs were running will no longer be running when the restore is complete. There's no need to restore the state database.

...

...
...
Thanks, that is also good to know. If I get it right this means: when a LAVA master breaks down and I have to restore it from a backup, I MUST NOT leave the worker as it is, but instead reset it to a clean state (i.e. a fresh install). Is this correct?

Did I get this right?

Not a fresh install. no. If the worker is not affected by the breakage, then the things you'll need to do are:

* make sure the worker is running a suitable version of lava-dispatcher for the version of the master which is being restored. (This is particularly if you've got local changes) * restart the lava-slave service once the master is up and running.

...

...
...
...
--

Neil Williams

neil.williams@linaro.org http://www.linux.codehelp.co.uk/

Mit freundlichen Grüßen / Best regards Tim Jaacks DEVELOPMENT ENGINEER Garz & Fricke GmbH Tempowerkring 2 21079 Hamburg Direct: +49 40 791 899 - 55 Fax: +49 40 791899 - 39 tim.jaacks@garz-fricke.com www.garz-fricke.com WE MAKE IT YOURS!

Sitz der Gesellschaft: D-21079 Hamburg Registergericht: Amtsgericht Hamburg, HRB 60514 Geschäftsführer: Matthias Fricke, Manfred Garz, Marc-Michael Braun

--

Neil Williams

neil.williams@linaro.org http://www.linux.codehelp.co.uk/

Mit freundlichen Grüßen / Best regards Tim Jaacks DEVELOPMENT ENGINEER Garz & Fricke GmbH Tempowerkring 2 21079 Hamburg Direct: +49 40 791 899 - 55 Fax: +49 40 791899 - 39 tim.jaacks@garz-fricke.com www.garz-fricke.com WE MAKE IT YOURS!

Sitz der Gesellschaft: D-21079 Hamburg Registergericht: Amtsgericht Hamburg, HRB 60514 Geschäftsführer: Matthias Fricke, Manfred Garz, Marc-Michael Braun

-- Neil Williams ============= neil.williams@linaro.org http://www.linux.codehelp.co.uk/

2387

days inactive

2387

days old

lava-users@lists.lavasoftware.org

5 comments

participants

tags (0)

participants (2)

Neil Williams
Tim Jaacks