Re: [Lava-users] job stuck after disconnection

19 Oct 2018


      Hello,
the second known reason for a job to be stuck is if lava-logs is stuck for
some reason.
I found a bug yesterday in the callbacks that can stuck lava-logs forever.
In fact, a timeout is missing in the callback http request.
If the remote server is taking forever to answer (I was playing with netcat
as a remote server and netcat was not answering anything), then lava-logs
(the process that is sending the notifications) will wait forever.
A patch is available here :
https://git.lavasoftware.org/lava/lava/merge_requests/113
As you are using a callback in the given job, that might be a reason.
Rgds
Le mar. 9 oct. 2018 à 10:57, Remi Duraffort remi.duraffort@linaro.org a
écrit :
...
Hello Corentin,
for what I can see in the job logs, the lava-run process was not killed
cleanly as the last lines of logs are missing (like
https://validation.linaro.org/scheduler/job/1894511#results_33286343).
Even if lava-run is crashing the last line should be sent.
So was the server running lava-run restarted? Do you know what happened to
the lava-run process?
To understand what happened there, a job cycle is:
lava-master => lava-slave: START
lava-slave => lava-master: START_OK when lava-run is started
lava-run => lava-logs: send the logs
When the job is about to finish, lava-run logs the last results (lava.job
result with pass or fail)
When lava-logs receive such log line, it can mark a TestJob as finished
and record the job health (canceled, success or failure)
At the same time, lava-slave does notice that lava-run finishes and send
an END message to lava-master.
But lava-master won't do anything until lava-logs has marked the TestJob
as finished. Because the logs hasn't been received yet.
In your case, the last line of log if missing, so lava-logs can't mark the
job has finished. At the same time lava-master and lava-slave are both
waiting for lava-logs.
That's why I added a "fail" button that can force this transition when
(for some reasons like a server crash) the last line of log is not going to
be sent.
Rgds.
Le mer. 26 sept. 2018 à 09:43, LABBE Corentin clabbe@baylibre.com a
écrit :
...
On Tue, Sep 25, 2018 at 09:03:03AM +0100, Neil Williams wrote:
...
On Tue, 25 Sep 2018 at 08:56, Corentin Labbe clabbe@baylibre.com
wrote:
...
...
Hello
We got a job (number 332) stuck in running state.
After 23h of inaction, the only way to stop it was to cancel+fail it.
According to the logs, a small disconnection happen between the slave
and
...
...
master.
The slave seems to try to update the final status of the job but the
master "ignore" it.
What Debian package version(s) of lava-dispatcher on the slave and
lava-server on the master?
On slave:
ii  lava-common                    2018.7-1+stretch
  all          Linaro Automated Validation Architecture common
ii  lava-dispatcher                2018.7-1+stretch
  amd64        Linaro Automated Validation Architecture dispatcher
On master:
ii  lava                                 2018.7-1+stretch
        all          Linaro Automated Validation Architecture metapackage
ii  lava-common                          2018.7-1+stretch
        all          Linaro Automated Validation Architecture common
ii  lava-coordinator                     0.1.7-1
       all          LAVA Coordinator daemon
ii  lava-dev                             2018.7-1+stretch
        all          Linaro Automated Validation Architecture developer
support
ii  lava-dispatcher                      2018.7-1+stretch
        amd64        Linaro Automated Validation Architecture dispatcher
ii  lava-server                          2018.7-1+stretch
        all          Linaro Automated Validation Architecture server
ii  lava-server-doc                      2018.7-1+stretch
        all          Linaro Automated Validation Architecture documentation
...
Can you attach (rather than inline) the test job log file (output.yaml)?
The job ended very quickly, looks like a validate error.
This is the job output
https://lava.automotivelinux.org/scheduler/job/332
...
It also looks like both master and slave went offline at the same time.
Can
...
you confirm that master and slave are both running in timezone UTC & if
both have ntp installed?
Yes I confirm

Lava-users mailing list
Lava-users@lists.linaro.org
https://lists.linaro.org/mailman/listinfo/lava-users
--
Rémi Duraffort
LAVA Team
-- 
Rémi Duraffort
LAVA Team

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

Re: [Lava-users] job stuck after disconnection