Hello,
the second known reason for a job to be stuck is if lava-logs is stuck for some reason.
I found a bug yesterday in the callbacks that can stuck lava-logs forever. In fact, a timeout is missing in the callback http request. If the remote server is taking forever to answer (I was playing with netcat as a remote server and netcat was not answering anything), then lava-logs (the process that is sending the notifications) will wait forever. A patch is available here : https://git.lavasoftware.org/lava/lava/merge_requests/113
As you are using a callback in the given job, that might be a reason.
Rgds
Le mar. 9 oct. 2018 à 10:57, Remi Duraffort remi.duraffort@linaro.org a écrit :
Hello Corentin,
for what I can see in the job logs, the lava-run process was not killed cleanly as the last lines of logs are missing (like https://validation.linaro.org/scheduler/job/1894511#results_33286343). Even if lava-run is crashing the last line should be sent. So was the server running lava-run restarted? Do you know what happened to the lava-run process?
To understand what happened there, a job cycle is: lava-master => lava-slave: START lava-slave => lava-master: START_OK when lava-run is started lava-run => lava-logs: send the logs When the job is about to finish, lava-run logs the last results (lava.job result with pass or fail) When lava-logs receive such log line, it can mark a TestJob as finished and record the job health (canceled, success or failure) At the same time, lava-slave does notice that lava-run finishes and send an END message to lava-master. But lava-master won't do anything until lava-logs has marked the TestJob as finished. Because the logs hasn't been received yet.
In your case, the last line of log if missing, so lava-logs can't mark the job has finished. At the same time lava-master and lava-slave are both waiting for lava-logs. That's why I added a "fail" button that can force this transition when (for some reasons like a server crash) the last line of log is not going to be sent.
Rgds.
Le mer. 26 sept. 2018 à 09:43, LABBE Corentin clabbe@baylibre.com a écrit :
On Tue, Sep 25, 2018 at 09:03:03AM +0100, Neil Williams wrote:
On Tue, 25 Sep 2018 at 08:56, Corentin Labbe clabbe@baylibre.com
wrote:
Hello
We got a job (number 332) stuck in running state. After 23h of inaction, the only way to stop it was to cancel+fail it. According to the logs, a small disconnection happen between the slave
and
master. The slave seems to try to update the final status of the job but the master "ignore" it.
What Debian package version(s) of lava-dispatcher on the slave and lava-server on the master?
On slave: ii lava-common 2018.7-1+stretch all Linaro Automated Validation Architecture common ii lava-dispatcher 2018.7-1+stretch amd64 Linaro Automated Validation Architecture dispatcher
On master: ii lava 2018.7-1+stretch all Linaro Automated Validation Architecture metapackage ii lava-common 2018.7-1+stretch all Linaro Automated Validation Architecture common ii lava-coordinator 0.1.7-1 all LAVA Coordinator daemon ii lava-dev 2018.7-1+stretch all Linaro Automated Validation Architecture developer support ii lava-dispatcher 2018.7-1+stretch amd64 Linaro Automated Validation Architecture dispatcher ii lava-server 2018.7-1+stretch all Linaro Automated Validation Architecture server ii lava-server-doc 2018.7-1+stretch all Linaro Automated Validation Architecture documentation
Can you attach (rather than inline) the test job log file (output.yaml)? The job ended very quickly, looks like a validate error.
This is the job output https://lava.automotivelinux.org/scheduler/job/332
It also looks like both master and slave went offline at the same time.
Can
you confirm that master and slave are both running in timezone UTC & if both have ntp installed?
Yes I confirm
Lava-users mailing list Lava-users@lists.linaro.org https://lists.linaro.org/mailman/listinfo/lava-users
-- Rémi Duraffort LAVA Team