Hello
We got a job (number 332) stuck in running state. After 23h of inaction, the only way to stop it was to cancel+fail it. According to the logs, a small disconnection happen between the slave and master. The slave seems to try to update the final status of the job but the master "ignore" it.
Regards
This is the slave logs:
2018-09-21 21:45:57,197 INFO [332] Starting job2018-09-21 21:45:57,226 DEBUG [332] device : actions: 2018-09-21 21:45:57,227 DEBUG [332] env-dut : 2018-09-21 21:45:57,241 DEBUG START_OK(332) => master 2018-09-21 21:46:17,260 DEBUG PING => master (last message 20s ago) 2018-09-21 21:46:17,453 DEBUG master => PONG(20) 2018-09-21 21:46:37,475 DEBUG PING => master (last message 20s ago) 2018-09-21 21:46:37,684 DEBUG master => PONG(20) 2018-09-21 21:46:57,708 DEBUG PING => master (last message 20s ago) 2018-09-21 21:47:03,603 DEBUG master => PONG(20) 2018-09-21 21:47:23,625 DEBUG PING => master (last message 20s ago) 2018-09-21 21:47:41,199 DEBUG master => PONG(20) 2018-09-21 21:48:01,222 DEBUG PING => master (last message 20s ago) 2018-09-21 21:48:19,372 DEBUG master => PONG(20) ## 2018-09-21 21:48:39,395 DEBUG PING => master (last message 20s ago) ## 2018-09-21 21:48:59,418 DEBUG PING => master (last message 40s ago) ## 2018-09-21 21:49:19,441 DEBUG PING => master (last message 60s ago) 2018-09-21 21:49:34,458 INFO [332] Job END ## 2018-09-21 21:49:34,487 DEBUG END(332) => master 2018-09-21 21:49:39,496 WARNING Master goes OFFLINE 2018-09-21 21:49:39,496 DEBUG PING => master (last message 80s ago) 2018-09-21 21:49:39,501 INFO [332] Job END (resend) 2018-09-21 21:49:39,515 DEBUG END(332) => master 2018-09-21 21:49:44,521 INFO [332] Job END (resend) 2018-09-21 21:49:44,538 DEBUG END(332) => master 2018-09-21 21:49:49,544 INFO [332] Job END (resend) 2018-09-21 21:49:49,559 DEBUG END(332) => master 2018-09-21 21:49:54,565 INFO [332] Job END (resend) 2018-09-21 21:49:54,588 DEBUG END(332) => master 2018-09-21 21:49:59,594 DEBUG PING => master (last message 100s ago) 2018-09-21 21:49:59,594 INFO [332] Job END (resend) 2018-09-21 21:49:59,612 DEBUG END(332) => master 2018-09-21 21:50:04,618 INFO [332] Job END (resend) 2018-09-21 21:50:04,639 DEBUG END(332) => master 2018-09-21 21:50:09,645 INFO [332] Job END (resend) 2018-09-21 21:50:09,659 DEBUG END(332) => master 2018-09-21 21:50:14,665 INFO [332] Job END (resend) 2018-09-21 21:50:14,685 DEBUG END(332) => master 2018-09-21 21:50:19,690 DEBUG PING => master (last message 120s ago) 2018-09-21 21:50:19,691 INFO [332] Job END (resend) 2018-09-21 21:50:19,708 DEBUG END(332) => master 2018-09-21 21:50:24,714 INFO [332] Job END (resend) 2018-09-21 21:50:24,735 DEBUG END(332) => master 2018-09-21 21:50:29,741 INFO [332] Job END (resend) 2018-09-21 21:50:29,755 DEBUG END(332) => master 2018-09-21 21:50:34,761 INFO [332] Job END (resend) 2018-09-21 21:50:34,776 DEBUG END(332) => master 2018-09-21 21:50:39,781 DEBUG PING => master (last message 140s ago) 2018-09-21 21:50:39,782 INFO [332] Job END (resend) 2018-09-21 21:50:39,800 DEBUG END(332) => master 2018-09-21 21:50:44,806 INFO [332] Job END (resend) 2018-09-21 21:50:44,823 DEBUG END(332) => master 2018-09-21 21:50:49,829 INFO [332] Job END (resend) 2018-09-21 21:50:49,850 DEBUG END(332) => master 2018-09-21 21:50:54,856 INFO [332] Job END (resend) 2018-09-21 21:50:54,876 DEBUG END(332) => master 2018-09-21 21:50:59,882 DEBUG PING => master (last message 160s ago) 2018-09-21 21:50:59,883 INFO [332] Job END (resend) 2018-09-21 21:50:59,901 DEBUG END(332) => master 2018-09-21 21:51:04,907 INFO [332] Job END (resend) 2018-09-21 21:51:04,920 DEBUG END(332) => master 2018-09-21 21:51:09,926 INFO [332] Job END (resend) 2018-09-21 21:51:09,939 DEBUG END(332) => master 2018-09-21 21:51:14,945 INFO [332] Job END (resend) 2018-09-21 21:51:14,967 DEBUG END(332) => master 2018-09-21 21:51:19,973 DEBUG PING => master (last message 180s ago) 2018-09-21 21:51:19,974 INFO [332] Job END (resend) 2018-09-21 21:51:19,994 DEBUG END(332) => master 2018-09-21 21:51:25,000 INFO [332] Job END (resend) 2018-09-21 21:51:25,020 DEBUG END(332) => master 2018-09-21 21:51:30,026 INFO [332] Job END (resend) 2018-09-21 21:51:30,047 DEBUG END(332) => master 2018-09-21 21:51:35,053 INFO [332] Job END (resend) 2018-09-21 21:51:35,071 DEBUG END(332) => master 2018-09-21 21:51:40,077 DEBUG PING => master (last message 200s ago) 2018-09-21 21:51:40,077 INFO [332] Job END (resend) 2018-09-21 21:51:40,095 DEBUG END(332) => master 2018-09-21 21:51:45,101 INFO [332] Job END (resend) 2018-09-21 21:51:45,123 DEBUG END(332) => master 2018-09-21 21:51:50,129 INFO [332] Job END (resend) 2018-09-21 21:51:50,149 DEBUG END(332) => master 2018-09-21 21:51:55,154 INFO [332] Job END (resend) 2018-09-21 21:51:55,175 DEBUG END(332) => master 2018-09-21 21:51:58,639 DEBUG master => PONG(20) 2018-09-21 21:51:58,640 INFO Master is ONLINE 2018-09-21 21:51:58,831 DEBUG master => PONG(20) 2018-09-21 21:51:58,833 DEBUG master => PONG(20) 2018-09-21 21:52:00,053 INFO master => END_OK(332) 2018-09-21 21:52:00,054 DEBUG [332] Removing /var/lib/lava/dispatcher/slave/tmp/332 2018-09-21 21:52:00,062 DEBUG master => PONG(20) 2018-09-21 21:52:00,062 INFO master => END_OK(332) 2018-09-21 21:52:00,063 DEBUG [332] Unknown job 2018-09-21 21:52:00,063 DEBUG [332] Removing /var/lib/lava/dispatcher/slave/tmp/332 2018-09-21 21:52:00,242 INFO master => END_OK(332) 2018-09-21 21:52:00,242 DEBUG [332] Unknown job 2018-09-21 21:52:00,243 DEBUG [332] Removing /var/lib/lava/dispatcher/slave/tmp/332 2018-09-21 21:52:00,243 INFO master => END_OK(332) 2018-09-21 21:52:00,244 DEBUG [332] Unknown job 2018-09-21 21:52:00,244 DEBUG [332] Removing /var/lib/lava/dispatcher/slave/tmp/332 2018-09-21 21:52:00,244 INFO master => END_OK(332) 2018-09-21 21:52:00,245 DEBUG [332] Unknown job 2018-09-21 21:52:00,245 DEBUG [332] Removing /var/lib/lava/dispatcher/slave/tmp/332 2018-09-21 21:52:00,246 DEBUG master => PONG(20) 2018-09-21 21:52:00,246 INFO master => END_OK(332) 2018-09-21 21:52:00,246 DEBUG [332] Unknown job 2018-09-21 21:52:00,246 DEBUG [332] Removing /var/lib/lava/dispatcher/slave/tmp/332 2018-09-21 21:52:00,381 INFO master => END_OK(332) 2018-09-21 21:52:00,382 DEBUG [332] Unknown job 2018-09-21 21:52:00,382 DEBUG [332] Removing /var/lib/lava/dispatcher/slave/tmp/332 2018-09-21 21:52:00,436 INFO master => END_OK(332) 2018-09-21 21:52:00,436 DEBUG [332] Unknown job 2018-09-21 21:52:00,436 DEBUG [332] Removing /var/lib/lava/dispatcher/slave/tmp/332 2018-09-21 21:52:00,592 INFO master => END_OK(332) 2018-09-21 21:52:00,593 DEBUG [332] Unknown job 2018-09-21 21:52:00,593 DEBUG [332] Removing /var/lib/lava/dispatcher/slave/tmp/332 2018-09-21 21:52:00,610 DEBUG master => PONG(20) 2018-09-21 21:52:00,770 INFO master => END_OK(332) 2018-09-21 21:52:00,771 DEBUG [332] Unknown job 2018-09-21 21:52:00,771 DEBUG [332] Removing /var/lib/lava/dispatcher/slave/tmp/332 2018-09-21 21:52:00,892 INFO master => END_OK(332) 2018-09-21 21:52:00,892 DEBUG [332] Unknown job 2018-09-21 21:52:00,892 DEBUG [332] Removing /var/lib/lava/dispatcher/slave/tmp/332 2018-09-21 21:52:00,976 INFO master => END_OK(332) 2018-09-21 21:52:00,977 DEBUG [332] Unknown job 2018-09-21 21:52:00,977 DEBUG [332] Removing /var/lib/lava/dispatcher/slave/tmp/332 2018-09-21 21:52:01,156 INFO master => END_OK(332) 2018-09-21 21:52:01,157 DEBUG [332] Unknown job 2018-09-21 21:52:01,157 DEBUG [332] Removing /var/lib/lava/dispatcher/slave/tmp/332 2018-09-21 21:52:01,164 DEBUG master => PONG(20) 2018-09-21 21:52:01,170 INFO master => END_OK(332) 2018-09-21 21:52:01,171 DEBUG [332] Unknown job 2018-09-21 21:52:01,171 DEBUG [332] Removing /var/lib/lava/dispatcher/slave/tmp/332 2018-09-21 21:52:01,353 INFO master => END_OK(332) 2018-09-21 21:52:01,353 DEBUG [332] Unknown job 2018-09-21 21:52:01,353 DEBUG [332] Removing /var/lib/lava/dispatcher/slave/tmp/332 2018-09-21 21:52:01,493 INFO master => END_OK(332) 2018-09-21 21:52:01,493 DEBUG [332] Unknown job 2018-09-21 21:52:01,493 DEBUG [332] Removing /var/lib/lava/dispatcher/slave/tmp/332 2018-09-21 21:52:01,550 INFO master => END_OK(332) 2018-09-21 21:52:01,551 DEBUG [332] Unknown job 2018-09-21 21:52:01,551 DEBUG [332] Removing /var/lib/lava/dispatcher/slave/tmp/332 2018-09-21 21:52:01,552 DEBUG master => PONG(20) 2018-09-21 21:52:01,700 INFO master => END_OK(332) 2018-09-21 21:52:01,700 DEBUG [332] Unknown job 2018-09-21 21:52:01,700 DEBUG [332] Removing /var/lib/lava/dispatcher/slave/tmp/332 2018-09-21 21:52:01,745 INFO master => END_OK(332) 2018-09-21 21:52:01,745 DEBUG [332] Unknown job 2018-09-21 21:52:01,745 DEBUG [332] Removing /var/lib/lava/dispatcher/slave/tmp/332 2018-09-21 21:52:01,819 INFO master => END_OK(332) 2018-09-21 21:52:01,819 DEBUG [332] Unknown job 2018-09-21 21:52:01,820 DEBUG [332] Removing /var/lib/lava/dispatcher/slave/tmp/332 2018-09-21 21:52:01,936 INFO master => END_OK(332) 2018-09-21 21:52:01,936 DEBUG [332] Unknown job 2018-09-21 21:52:01,936 DEBUG [332] Removing /var/lib/lava/dispatcher/slave/tmp/332 2018-09-21 21:52:01,938 DEBUG master => PONG(20) 2018-09-21 21:52:02,048 INFO master => END_OK(332) 2018-09-21 21:52:02,048 DEBUG [332] Unknown job 2018-09-21 21:52:02,049 DEBUG [332] Removing /var/lib/lava/dispatcher/slave/tmp/332 2018-09-21 21:52:02,138 INFO master => END_OK(332) 2018-09-21 21:52:02,138 DEBUG [332] Unknown job 2018-09-21 21:52:02,138 DEBUG [332] Removing /var/lib/lava/dispatcher/slave/tmp/332 2018-09-21 21:52:22,160 DEBUG PING => master (last message 20s ago) 2018-09-21 21:52:37,178 INFO master => END_OK(332) 2018-09-21 21:52:37,179 DEBUG [332] Unknown job 2018-09-21 21:52:37,179 DEBUG [332] Removing /var/lib/lava/dispatcher/slave/tmp/332 2018-09-21 21:52:37,180 INFO master => END_OK(332) 2018-09-21 21:52:37,180 DEBUG [332] Unknown job 2018-09-21 21:52:37,180 DEBUG [332] Removing /var/lib/lava/dispatcher/slave/tmp/332 2018-09-21 21:52:37,181 DEBUG master => PONG(20) 2018-09-21 21:52:37,181 INFO master => END_OK(332) 2018-09-21 21:52:37,181 DEBUG [332] Unknown job 2018-09-21 21:52:37,182 DEBUG [332] Removing /var/lib/lava/dispatcher/slave/tmp/332 2018-09-21 21:52:37,374 DEBUG master => PONG(20) 2018-09-21 21:52:57,396 DEBUG PING => master (last message 20s ago) 2018-09-21 21:53:17,418 DEBUG PING => master (last message 40s ago) 2018-09-21 21:53:21,421 DEBUG master => PONG(20) 2018-09-21 21:53:21,610 DEBUG master => PONG(20) 2018-09-21 21:53:41,632 DEBUG PING => master (last message 20s ago) 2018-09-21 21:54:01,653 DEBUG PING => master (last message 40s ago) 2018-09-21 21:54:05,695 DEBUG master => PONG(20) 2018-09-21 21:54:05,893 DEBUG master => PONG(20) 2018-09-21 21:54:25,911 DEBUG PING => master (last message 20s ago) 2018-09-21 21:54:45,933 DEBUG PING => master (last message 40s ago) 2018-09-21 21:54:51,497 DEBUG master => PONG(20) 2018-09-21 21:54:51,692 DEBUG master => PONG(20) 2018-09-21 21:55:11,713 DEBUG PING => master (last message 20s ago) 2018-09-21 21:55:31,733 DEBUG PING => master (last message 40s ago) 2018-09-21 21:55:38,958 DEBUG master => PONG(20) 2018-09-21 21:55:39,147 DEBUG master => PONG(20) 2018-09-21 21:55:59,169 DEBUG PING => master (last message 20s ago) 2018-09-21 21:56:19,191 DEBUG PING => master (last message 40s ago) 2018-09-21 21:56:23,724 DEBUG master => PONG(20) 2018-09-21 21:56:23,927 DEBUG master => PONG(20) 2018-09-21 21:56:43,950 DEBUG PING => master (last message 20s ago) 2018-09-21 21:56:44,137 DEBUG master => PONG(20) 2018-09-21 21:57:04,159 DEBUG PING => master (last message 20s ago) 2018-09-21 21:57:04,351 DEBUG master => PONG(20)
And this is the master logs: 2018-09-21 21:45:56,427 INFO scheduling jobs: 2018-09-21 21:45:56,428 DEBUG - bcm2837-rpi-3-b-32 2018-09-21 21:45:56,453 DEBUG -> rpi3-corelab-01 (Idle, Good) 2018-09-21 21:45:56,453 DEBUG |--> [332] scheduling 2018-09-21 21:45:56,459 DEBUG - dra7-evm 2018-09-21 21:45:56,464 DEBUG - qemu 2018-09-21 21:45:56,660 INFO [332] START => agl-core-lab-1 (rpi3-corelab-01) 2018-09-21 21:45:57,341 INFO [332] agl-core-lab-1 => START_OK 2018-09-21 21:45:57,773 DEBUG lava-logs => PING(20) 2018-09-21 21:46:16,711 INFO scheduling health checks: 2018-09-21 21:46:16,724 INFO scheduling jobs: 2018-09-21 21:46:16,725 DEBUG - dra7-evm 2018-09-21 21:46:16,729 DEBUG - qemu 2018-09-21 21:46:17,351 DEBUG agl-core-lab-1 => PING(20) 2018-09-21 21:46:18,980 DEBUG lava-logs => PING(20) 2018-09-21 21:46:36,744 INFO scheduling health checks: 2018-09-21 21:46:36,757 INFO scheduling jobs: 2018-09-21 21:46:36,758 DEBUG - dra7-evm 2018-09-21 21:46:36,762 DEBUG - qemu 2018-09-21 21:46:37,566 DEBUG agl-core-lab-1 => PING(20) 2018-09-21 21:46:48,466 DEBUG lava-logs => PING(20) 2018-09-21 21:46:56,776 INFO scheduling health checks: 2018-09-21 21:46:56,790 INFO scheduling jobs: 2018-09-21 21:46:56,791 DEBUG - dra7-evm 2018-09-21 21:46:56,795 DEBUG - qemu 2018-09-21 21:46:57,797 DEBUG agl-core-lab-1 => PING(20) 2018-09-21 21:47:08,482 DEBUG lava-logs => PING(20) 2018-09-21 21:47:16,809 INFO scheduling health checks: 2018-09-21 21:47:16,822 INFO scheduling jobs: 2018-09-21 21:47:16,823 DEBUG - dra7-evm 2018-09-21 21:47:16,827 DEBUG - qemu 2018-09-21 21:47:23,718 DEBUG agl-core-lab-1 => PING(20) 2018-09-21 21:47:31,521 DEBUG lava-logs => PING(20) 2018-09-21 21:47:36,841 INFO scheduling health checks: 2018-09-21 21:47:36,854 INFO scheduling jobs: 2018-09-21 21:47:36,855 DEBUG - dra7-evm 2018-09-21 21:47:36,859 DEBUG - qemu 2018-09-21 21:47:54,275 DEBUG lava-logs => PING(20) 2018-09-21 21:47:56,872 INFO scheduling health checks: 2018-09-21 21:47:56,885 INFO scheduling jobs: 2018-09-21 21:47:56,886 DEBUG - dra7-evm 2018-09-21 21:47:56,891 DEBUG - qemu 2018-09-21 21:48:01,321 DEBUG agl-core-lab-1 => PING(20) 2018-09-21 21:48:14,292 DEBUG lava-logs => PING(20) 2018-09-21 21:48:16,903 INFO scheduling health checks: 2018-09-21 21:48:16,917 INFO scheduling jobs: 2018-09-21 21:48:16,918 DEBUG - dra7-evm 2018-09-21 21:48:16,922 DEBUG - qemu 2018-09-21 21:48:35,995 DEBUG lava-logs => PING(20) 2018-09-21 21:48:36,933 INFO scheduling health checks: 2018-09-21 21:48:36,946 INFO scheduling jobs: 2018-09-21 21:48:36,947 DEBUG - dra7-evm 2018-09-21 21:48:36,951 DEBUG - qemu 2018-09-21 21:48:39,486 DEBUG agl-core-lab-1 => PING(20) 2018-09-21 21:48:56,011 DEBUG lava-logs => PING(20) 2018-09-21 21:48:56,962 INFO scheduling health checks: 2018-09-21 21:48:56,975 INFO scheduling jobs: 2018-09-21 21:48:56,976 DEBUG - dra7-evm 2018-09-21 21:48:56,981 DEBUG - qemu 2018-09-21 21:49:16,995 INFO scheduling health checks: 2018-09-21 21:49:17,009 INFO scheduling jobs: 2018-09-21 21:49:17,010 DEBUG - dra7-evm 2018-09-21 21:49:17,014 DEBUG - qemu 2018-09-21 21:49:21,406 DEBUG lava-logs => PING(20) 2018-09-21 21:49:37,026 INFO scheduling health checks: 2018-09-21 21:49:37,040 INFO scheduling jobs: 2018-09-21 21:49:37,041 DEBUG - dra7-evm 2018-09-21 21:49:37,046 DEBUG - qemu 2018-09-21 21:49:41,419 DEBUG lava-logs => PING(20) 2018-09-21 21:49:52,363 ERROR [STATE] Dispatcher <agl-core-lab-1> goes OFFLINE 2018-09-21 21:49:57,059 INFO scheduling health checks: 2018-09-21 21:49:57,067 INFO scheduling jobs: 2018-09-21 21:50:01,440 DEBUG lava-logs => PING(20) 2018-09-21 21:50:17,078 INFO scheduling health checks: 2018-09-21 21:50:17,086 INFO scheduling jobs: 2018-09-21 21:50:21,462 DEBUG lava-logs => PING(20) 2018-09-21 21:50:37,097 INFO scheduling health checks: 2018-09-21 21:50:37,105 INFO scheduling jobs: 2018-09-21 21:50:41,484 DEBUG lava-logs => PING(20) 2018-09-21 21:50:57,117 INFO scheduling health checks: 2018-09-21 21:50:57,125 INFO scheduling jobs: 2018-09-21 21:51:01,502 DEBUG lava-logs => PING(20) 2018-09-21 21:51:17,135 INFO scheduling health checks: 2018-09-21 21:51:17,144 INFO scheduling jobs: 2018-09-21 21:51:21,524 DEBUG lava-logs => PING(20) 2018-09-21 21:51:37,154 INFO scheduling health checks: 2018-09-21 21:51:37,162 INFO scheduling jobs: 2018-09-21 21:51:41,546 DEBUG lava-logs => PING(20) 2018-09-21 21:51:57,170 INFO scheduling health checks: 2018-09-21 21:51:57,178 INFO scheduling jobs: 2018-09-21 21:51:58,734 DEBUG agl-core-lab-1 => PING(20) 2018-09-21 21:51:58,737 DEBUG agl-core-lab-1 => PING(20) 2018-09-21 21:51:59,125 INFO [332] agl-core-lab-1 => END 2018-09-21 21:51:59,954 DEBUG get_type_name: skipping finalize 2018-09-21 21:51:59,954 DEBUG type_name failed for finalize metatype finalize 2018-09-21 21:51:59,954 DEBUG get_type_name: skipping finalize 2018-09-21 21:51:59,954 DEBUG type_name failed for finalize metatype finalize 2018-09-21 21:51:59,954 DEBUG get_type_name: skipping finalize 2018-09-21 21:51:59,954 DEBUG type_name failed for finalize metatype finalize 2018-09-21 21:51:59,960 DEBUG agl-core-lab-1 => PING(20) 2018-09-21 21:51:59,964 INFO [332] agl-core-lab-1 => END (duplicated), skipping 2018-09-21 21:51:59,968 INFO [332] agl-core-lab-1 => END (duplicated), skipping 2018-09-21 21:51:59,971 INFO [332] agl-core-lab-1 => END (duplicated), skipping 2018-09-21 21:51:59,974 INFO [332] agl-core-lab-1 => END (duplicated), skipping 2018-09-21 21:51:59,977 DEBUG agl-core-lab-1 => PING(20) 2018-09-21 21:52:00,093 INFO [332] agl-core-lab-1 => END (duplicated), skipping 2018-09-21 21:52:00,287 INFO [332] agl-core-lab-1 => END (duplicated), skipping 2018-09-21 21:52:00,341 INFO [332] agl-core-lab-1 => END (duplicated), skipping 2018-09-21 21:52:00,493 INFO [332] agl-core-lab-1 => END (duplicated), skipping 2018-09-21 21:52:00,496 DEBUG agl-core-lab-1 => PING(20) 2018-09-21 21:52:00,674 INFO [332] agl-core-lab-1 => END (duplicated), skipping 2018-09-21 21:52:00,795 INFO [332] agl-core-lab-1 => END (duplicated), skipping 2018-09-21 21:52:00,881 INFO [332] agl-core-lab-1 => END (duplicated), skipping 2018-09-21 21:52:01,061 INFO [332] agl-core-lab-1 => END (duplicated), skipping 2018-09-21 21:52:01,063 DEBUG agl-core-lab-1 => PING(20) 2018-09-21 21:52:01,075 INFO [332] agl-core-lab-1 => END (duplicated), skipping 2018-09-21 21:52:01,256 INFO [332] agl-core-lab-1 => END (duplicated), skipping 2018-09-21 21:52:01,397 INFO [332] agl-core-lab-1 => END (duplicated), skipping 2018-09-21 21:52:01,453 INFO [332] agl-core-lab-1 => END (duplicated), skipping 2018-09-21 21:52:01,456 DEBUG agl-core-lab-1 => PING(20) 2018-09-21 21:52:01,568 DEBUG lava-logs => PING(20) 2018-09-21 21:52:01,603 INFO [332] agl-core-lab-1 => END (duplicated), skipping 2018-09-21 21:52:01,646 INFO [332] agl-core-lab-1 => END (duplicated), skipping 2018-09-21 21:52:01,722 INFO [332] agl-core-lab-1 => END (duplicated), skipping 2018-09-21 21:52:01,840 INFO [332] agl-core-lab-1 => END (duplicated), skipping 2018-09-21 21:52:01,842 DEBUG agl-core-lab-1 => PING(20) 2018-09-21 21:52:01,884 INFO [332] agl-core-lab-1 => END (duplicated), skipping 2018-09-21 21:52:01,985 INFO [332] agl-core-lab-1 => END (duplicated), skipping 2018-09-21 21:52:02,040 INFO [332] agl-core-lab-1 => END (duplicated), skipping 2018-09-21 21:52:02,106 INFO [332] agl-core-lab-1 => END (duplicated), skipping 2018-09-21 21:52:02,109 DEBUG agl-core-lab-1 => PING(20) 2018-09-21 21:52:02,220 INFO [332] agl-core-lab-1 => END (duplicated), skipping 2018-09-21 21:52:17,187 INFO scheduling health checks: 2018-09-21 21:52:17,201 INFO scheduling jobs: 2018-09-21 21:52:17,202 DEBUG - dra7-evm 2018-09-21 21:52:17,207 DEBUG - qemu 2018-09-21 21:52:21,585 DEBUG lava-logs => PING(20)
On Tue, 25 Sep 2018 at 08:56, Corentin Labbe clabbe@baylibre.com wrote:
Hello
We got a job (number 332) stuck in running state. After 23h of inaction, the only way to stop it was to cancel+fail it. According to the logs, a small disconnection happen between the slave and master. The slave seems to try to update the final status of the job but the master "ignore" it.
What Debian package version(s) of lava-dispatcher on the slave and lava-server on the master?
Can you attach (rather than inline) the test job log file (output.yaml)? The job ended very quickly, looks like a validate error.
It also looks like both master and slave went offline at the same time. Can you confirm that master and slave are both running in timezone UTC & if both have ntp installed?
Regards
This is the slave logs:
2018-09-21 21:45:57,197 INFO [332] Starting job2018-09-21 21:45:57,226 DEBUG [332] device : actions: 2018-09-21 21:45:57,227 DEBUG [332] env-dut : 2018-09-21 21:45:57,241 DEBUG START_OK(332) => master 2018-09-21 21:46:17,260 DEBUG PING => master (last message 20s ago) 2018-09-21 21:46:17,453 DEBUG master => PONG(20) 2018-09-21 21:46:37,475 DEBUG PING => master (last message 20s ago) 2018-09-21 21:46:37,684 DEBUG master => PONG(20) 2018-09-21 21:46:57,708 DEBUG PING => master (last message 20s ago) 2018-09-21 21:47:03,603 DEBUG master => PONG(20) 2018-09-21 21:47:23,625 DEBUG PING => master (last message 20s ago) 2018-09-21 21:47:41,199 DEBUG master => PONG(20) 2018-09-21 21:48:01,222 DEBUG PING => master (last message 20s ago) 2018-09-21 21:48:19,372 DEBUG master => PONG(20) ## 2018-09-21 21:48:39,395 DEBUG PING => master (last message 20s ago) ## 2018-09-21 21:48:59,418 DEBUG PING => master (last message 40s ago) ## 2018-09-21 21:49:19,441 DEBUG PING => master (last message 60s ago) 2018-09-21 21:49:34,458 INFO [332] Job END ## 2018-09-21 21:49:34,487 DEBUG END(332) => master 2018-09-21 21:49:39,496 WARNING Master goes OFFLINE 2018-09-21 21:49:39,496 DEBUG PING => master (last message 80s ago) 2018-09-21 21:49:39,501 INFO [332] Job END (resend) 2018-09-21 21:49:39,515 DEBUG END(332) => master 2018-09-21 21:49:44,521 INFO [332] Job END (resend) 2018-09-21 21:49:44,538 DEBUG END(332) => master 2018-09-21 21:49:49,544 INFO [332] Job END (resend) 2018-09-21 21:49:49,559 DEBUG END(332) => master 2018-09-21 21:49:54,565 INFO [332] Job END (resend) 2018-09-21 21:49:54,588 DEBUG END(332) => master 2018-09-21 21:49:59,594 DEBUG PING => master (last message 100s ago) 2018-09-21 21:49:59,594 INFO [332] Job END (resend) 2018-09-21 21:49:59,612 DEBUG END(332) => master 2018-09-21 21:50:04,618 INFO [332] Job END (resend) 2018-09-21 21:50:04,639 DEBUG END(332) => master 2018-09-21 21:50:09,645 INFO [332] Job END (resend) 2018-09-21 21:50:09,659 DEBUG END(332) => master 2018-09-21 21:50:14,665 INFO [332] Job END (resend) 2018-09-21 21:50:14,685 DEBUG END(332) => master 2018-09-21 21:50:19,690 DEBUG PING => master (last message 120s ago) 2018-09-21 21:50:19,691 INFO [332] Job END (resend) 2018-09-21 21:50:19,708 DEBUG END(332) => master 2018-09-21 21:50:24,714 INFO [332] Job END (resend) 2018-09-21 21:50:24,735 DEBUG END(332) => master 2018-09-21 21:50:29,741 INFO [332] Job END (resend) 2018-09-21 21:50:29,755 DEBUG END(332) => master 2018-09-21 21:50:34,761 INFO [332] Job END (resend) 2018-09-21 21:50:34,776 DEBUG END(332) => master 2018-09-21 21:50:39,781 DEBUG PING => master (last message 140s ago) 2018-09-21 21:50:39,782 INFO [332] Job END (resend) 2018-09-21 21:50:39,800 DEBUG END(332) => master 2018-09-21 21:50:44,806 INFO [332] Job END (resend) 2018-09-21 21:50:44,823 DEBUG END(332) => master 2018-09-21 21:50:49,829 INFO [332] Job END (resend) 2018-09-21 21:50:49,850 DEBUG END(332) => master 2018-09-21 21:50:54,856 INFO [332] Job END (resend) 2018-09-21 21:50:54,876 DEBUG END(332) => master 2018-09-21 21:50:59,882 DEBUG PING => master (last message 160s ago) 2018-09-21 21:50:59,883 INFO [332] Job END (resend) 2018-09-21 21:50:59,901 DEBUG END(332) => master 2018-09-21 21:51:04,907 INFO [332] Job END (resend) 2018-09-21 21:51:04,920 DEBUG END(332) => master 2018-09-21 21:51:09,926 INFO [332] Job END (resend) 2018-09-21 21:51:09,939 DEBUG END(332) => master 2018-09-21 21:51:14,945 INFO [332] Job END (resend) 2018-09-21 21:51:14,967 DEBUG END(332) => master 2018-09-21 21:51:19,973 DEBUG PING => master (last message 180s ago) 2018-09-21 21:51:19,974 INFO [332] Job END (resend) 2018-09-21 21:51:19,994 DEBUG END(332) => master 2018-09-21 21:51:25,000 INFO [332] Job END (resend) 2018-09-21 21:51:25,020 DEBUG END(332) => master 2018-09-21 21:51:30,026 INFO [332] Job END (resend) 2018-09-21 21:51:30,047 DEBUG END(332) => master 2018-09-21 21:51:35,053 INFO [332] Job END (resend) 2018-09-21 21:51:35,071 DEBUG END(332) => master 2018-09-21 21:51:40,077 DEBUG PING => master (last message 200s ago) 2018-09-21 21:51:40,077 INFO [332] Job END (resend) 2018-09-21 21:51:40,095 DEBUG END(332) => master 2018-09-21 21:51:45,101 INFO [332] Job END (resend) 2018-09-21 21:51:45,123 DEBUG END(332) => master 2018-09-21 21:51:50,129 INFO [332] Job END (resend) 2018-09-21 21:51:50,149 DEBUG END(332) => master 2018-09-21 21:51:55,154 INFO [332] Job END (resend) 2018-09-21 21:51:55,175 DEBUG END(332) => master 2018-09-21 21:51:58,639 DEBUG master => PONG(20) 2018-09-21 21:51:58,640 INFO Master is ONLINE 2018-09-21 21:51:58,831 DEBUG master => PONG(20) 2018-09-21 21:51:58,833 DEBUG master => PONG(20) 2018-09-21 21:52:00,053 INFO master => END_OK(332) 2018-09-21 21:52:00,054 DEBUG [332] Removing /var/lib/lava/dispatcher/slave/tmp/332 2018-09-21 21:52:00,062 DEBUG master => PONG(20) 2018-09-21 21:52:00,062 INFO master => END_OK(332) 2018-09-21 21:52:00,063 DEBUG [332] Unknown job 2018-09-21 21:52:00,063 DEBUG [332] Removing /var/lib/lava/dispatcher/slave/tmp/332 2018-09-21 21:52:00,242 INFO master => END_OK(332) 2018-09-21 21:52:00,242 DEBUG [332] Unknown job 2018-09-21 21:52:00,243 DEBUG [332] Removing /var/lib/lava/dispatcher/slave/tmp/332 2018-09-21 21:52:00,243 INFO master => END_OK(332) 2018-09-21 21:52:00,244 DEBUG [332] Unknown job 2018-09-21 21:52:00,244 DEBUG [332] Removing /var/lib/lava/dispatcher/slave/tmp/332 2018-09-21 21:52:00,244 INFO master => END_OK(332) 2018-09-21 21:52:00,245 DEBUG [332] Unknown job 2018-09-21 21:52:00,245 DEBUG [332] Removing /var/lib/lava/dispatcher/slave/tmp/332 2018-09-21 21:52:00,246 DEBUG master => PONG(20) 2018-09-21 21:52:00,246 INFO master => END_OK(332) 2018-09-21 21:52:00,246 DEBUG [332] Unknown job 2018-09-21 21:52:00,246 DEBUG [332] Removing /var/lib/lava/dispatcher/slave/tmp/332 2018-09-21 21:52:00,381 INFO master => END_OK(332) 2018-09-21 21:52:00,382 DEBUG [332] Unknown job 2018-09-21 21:52:00,382 DEBUG [332] Removing /var/lib/lava/dispatcher/slave/tmp/332 2018-09-21 21:52:00,436 INFO master => END_OK(332) 2018-09-21 21:52:00,436 DEBUG [332] Unknown job 2018-09-21 21:52:00,436 DEBUG [332] Removing /var/lib/lava/dispatcher/slave/tmp/332 2018-09-21 21:52:00,592 INFO master => END_OK(332) 2018-09-21 21:52:00,593 DEBUG [332] Unknown job 2018-09-21 21:52:00,593 DEBUG [332] Removing /var/lib/lava/dispatcher/slave/tmp/332 2018-09-21 21:52:00,610 DEBUG master => PONG(20) 2018-09-21 21:52:00,770 INFO master => END_OK(332) 2018-09-21 21:52:00,771 DEBUG [332] Unknown job 2018-09-21 21:52:00,771 DEBUG [332] Removing /var/lib/lava/dispatcher/slave/tmp/332 2018-09-21 21:52:00,892 INFO master => END_OK(332) 2018-09-21 21:52:00,892 DEBUG [332] Unknown job 2018-09-21 21:52:00,892 DEBUG [332] Removing /var/lib/lava/dispatcher/slave/tmp/332 2018-09-21 21:52:00,976 INFO master => END_OK(332) 2018-09-21 21:52:00,977 DEBUG [332] Unknown job 2018-09-21 21:52:00,977 DEBUG [332] Removing /var/lib/lava/dispatcher/slave/tmp/332 2018-09-21 21:52:01,156 INFO master => END_OK(332) 2018-09-21 21:52:01,157 DEBUG [332] Unknown job 2018-09-21 21:52:01,157 DEBUG [332] Removing /var/lib/lava/dispatcher/slave/tmp/332 2018-09-21 21:52:01,164 DEBUG master => PONG(20) 2018-09-21 21:52:01,170 INFO master => END_OK(332) 2018-09-21 21:52:01,171 DEBUG [332] Unknown job 2018-09-21 21:52:01,171 DEBUG [332] Removing /var/lib/lava/dispatcher/slave/tmp/332 2018-09-21 21:52:01,353 INFO master => END_OK(332) 2018-09-21 21:52:01,353 DEBUG [332] Unknown job 2018-09-21 21:52:01,353 DEBUG [332] Removing /var/lib/lava/dispatcher/slave/tmp/332 2018-09-21 21:52:01,493 INFO master => END_OK(332) 2018-09-21 21:52:01,493 DEBUG [332] Unknown job 2018-09-21 21:52:01,493 DEBUG [332] Removing /var/lib/lava/dispatcher/slave/tmp/332 2018-09-21 21:52:01,550 INFO master => END_OK(332) 2018-09-21 21:52:01,551 DEBUG [332] Unknown job 2018-09-21 21:52:01,551 DEBUG [332] Removing /var/lib/lava/dispatcher/slave/tmp/332 2018-09-21 21:52:01,552 DEBUG master => PONG(20) 2018-09-21 21:52:01,700 INFO master => END_OK(332) 2018-09-21 21:52:01,700 DEBUG [332] Unknown job 2018-09-21 21:52:01,700 DEBUG [332] Removing /var/lib/lava/dispatcher/slave/tmp/332 2018-09-21 21:52:01,745 INFO master => END_OK(332) 2018-09-21 21:52:01,745 DEBUG [332] Unknown job 2018-09-21 21:52:01,745 DEBUG [332] Removing /var/lib/lava/dispatcher/slave/tmp/332 2018-09-21 21:52:01,819 INFO master => END_OK(332) 2018-09-21 21:52:01,819 DEBUG [332] Unknown job 2018-09-21 21:52:01,820 DEBUG [332] Removing /var/lib/lava/dispatcher/slave/tmp/332 2018-09-21 21:52:01,936 INFO master => END_OK(332) 2018-09-21 21:52:01,936 DEBUG [332] Unknown job 2018-09-21 21:52:01,936 DEBUG [332] Removing /var/lib/lava/dispatcher/slave/tmp/332 2018-09-21 21:52:01,938 DEBUG master => PONG(20) 2018-09-21 21:52:02,048 INFO master => END_OK(332) 2018-09-21 21:52:02,048 DEBUG [332] Unknown job 2018-09-21 21:52:02,049 DEBUG [332] Removing /var/lib/lava/dispatcher/slave/tmp/332 2018-09-21 21:52:02,138 INFO master => END_OK(332) 2018-09-21 21:52:02,138 DEBUG [332] Unknown job 2018-09-21 21:52:02,138 DEBUG [332] Removing /var/lib/lava/dispatcher/slave/tmp/332 2018-09-21 21:52:22,160 DEBUG PING => master (last message 20s ago) 2018-09-21 21:52:37,178 INFO master => END_OK(332) 2018-09-21 21:52:37,179 DEBUG [332] Unknown job 2018-09-21 21:52:37,179 DEBUG [332] Removing /var/lib/lava/dispatcher/slave/tmp/332 2018-09-21 21:52:37,180 INFO master => END_OK(332) 2018-09-21 21:52:37,180 DEBUG [332] Unknown job 2018-09-21 21:52:37,180 DEBUG [332] Removing /var/lib/lava/dispatcher/slave/tmp/332 2018-09-21 21:52:37,181 DEBUG master => PONG(20) 2018-09-21 21:52:37,181 INFO master => END_OK(332) 2018-09-21 21:52:37,181 DEBUG [332] Unknown job 2018-09-21 21:52:37,182 DEBUG [332] Removing /var/lib/lava/dispatcher/slave/tmp/332 2018-09-21 21:52:37,374 DEBUG master => PONG(20) 2018-09-21 21:52:57,396 DEBUG PING => master (last message 20s ago) 2018-09-21 21:53:17,418 DEBUG PING => master (last message 40s ago) 2018-09-21 21:53:21,421 DEBUG master => PONG(20) 2018-09-21 21:53:21,610 DEBUG master => PONG(20) 2018-09-21 21:53:41,632 DEBUG PING => master (last message 20s ago) 2018-09-21 21:54:01,653 DEBUG PING => master (last message 40s ago) 2018-09-21 21:54:05,695 DEBUG master => PONG(20) 2018-09-21 21:54:05,893 DEBUG master => PONG(20) 2018-09-21 21:54:25,911 DEBUG PING => master (last message 20s ago) 2018-09-21 21:54:45,933 DEBUG PING => master (last message 40s ago) 2018-09-21 21:54:51,497 DEBUG master => PONG(20) 2018-09-21 21:54:51,692 DEBUG master => PONG(20) 2018-09-21 21:55:11,713 DEBUG PING => master (last message 20s ago) 2018-09-21 21:55:31,733 DEBUG PING => master (last message 40s ago) 2018-09-21 21:55:38,958 DEBUG master => PONG(20) 2018-09-21 21:55:39,147 DEBUG master => PONG(20) 2018-09-21 21:55:59,169 DEBUG PING => master (last message 20s ago) 2018-09-21 21:56:19,191 DEBUG PING => master (last message 40s ago) 2018-09-21 21:56:23,724 DEBUG master => PONG(20) 2018-09-21 21:56:23,927 DEBUG master => PONG(20) 2018-09-21 21:56:43,950 DEBUG PING => master (last message 20s ago) 2018-09-21 21:56:44,137 DEBUG master => PONG(20) 2018-09-21 21:57:04,159 DEBUG PING => master (last message 20s ago) 2018-09-21 21:57:04,351 DEBUG master => PONG(20)
And this is the master logs: 2018-09-21 21:45:56,427 INFO scheduling jobs: 2018-09-21 21:45:56,428 DEBUG - bcm2837-rpi-3-b-32 2018-09-21 21:45:56,453 DEBUG -> rpi3-corelab-01 (Idle, Good) 2018-09-21 21:45:56,453 DEBUG |--> [332] scheduling 2018-09-21 21:45:56,459 DEBUG - dra7-evm 2018-09-21 21:45:56,464 DEBUG - qemu 2018-09-21 21:45:56,660 INFO [332] START => agl-core-lab-1 (rpi3-corelab-01) 2018-09-21 21:45:57,341 INFO [332] agl-core-lab-1 => START_OK 2018-09-21 21:45:57,773 DEBUG lava-logs => PING(20) 2018-09-21 21:46:16,711 INFO scheduling health checks: 2018-09-21 21:46:16,724 INFO scheduling jobs: 2018-09-21 21:46:16,725 DEBUG - dra7-evm 2018-09-21 21:46:16,729 DEBUG - qemu 2018-09-21 21:46:17,351 DEBUG agl-core-lab-1 => PING(20) 2018-09-21 21:46:18,980 DEBUG lava-logs => PING(20) 2018-09-21 21:46:36,744 INFO scheduling health checks: 2018-09-21 21:46:36,757 INFO scheduling jobs: 2018-09-21 21:46:36,758 DEBUG - dra7-evm 2018-09-21 21:46:36,762 DEBUG - qemu 2018-09-21 21:46:37,566 DEBUG agl-core-lab-1 => PING(20) 2018-09-21 21:46:48,466 DEBUG lava-logs => PING(20) 2018-09-21 21:46:56,776 INFO scheduling health checks: 2018-09-21 21:46:56,790 INFO scheduling jobs: 2018-09-21 21:46:56,791 DEBUG - dra7-evm 2018-09-21 21:46:56,795 DEBUG - qemu 2018-09-21 21:46:57,797 DEBUG agl-core-lab-1 => PING(20) 2018-09-21 21:47:08,482 DEBUG lava-logs => PING(20) 2018-09-21 21:47:16,809 INFO scheduling health checks: 2018-09-21 21:47:16,822 INFO scheduling jobs: 2018-09-21 21:47:16,823 DEBUG - dra7-evm 2018-09-21 21:47:16,827 DEBUG - qemu 2018-09-21 21:47:23,718 DEBUG agl-core-lab-1 => PING(20) 2018-09-21 21:47:31,521 DEBUG lava-logs => PING(20) 2018-09-21 21:47:36,841 INFO scheduling health checks: 2018-09-21 21:47:36,854 INFO scheduling jobs: 2018-09-21 21:47:36,855 DEBUG - dra7-evm 2018-09-21 21:47:36,859 DEBUG - qemu 2018-09-21 21:47:54,275 DEBUG lava-logs => PING(20) 2018-09-21 21:47:56,872 INFO scheduling health checks: 2018-09-21 21:47:56,885 INFO scheduling jobs: 2018-09-21 21:47:56,886 DEBUG - dra7-evm 2018-09-21 21:47:56,891 DEBUG - qemu 2018-09-21 21:48:01,321 DEBUG agl-core-lab-1 => PING(20) 2018-09-21 21:48:14,292 DEBUG lava-logs => PING(20) 2018-09-21 21:48:16,903 INFO scheduling health checks: 2018-09-21 21:48:16,917 INFO scheduling jobs: 2018-09-21 21:48:16,918 DEBUG - dra7-evm 2018-09-21 21:48:16,922 DEBUG - qemu 2018-09-21 21:48:35,995 DEBUG lava-logs => PING(20) 2018-09-21 21:48:36,933 INFO scheduling health checks: 2018-09-21 21:48:36,946 INFO scheduling jobs: 2018-09-21 21:48:36,947 DEBUG - dra7-evm 2018-09-21 21:48:36,951 DEBUG - qemu 2018-09-21 21:48:39,486 DEBUG agl-core-lab-1 => PING(20) 2018-09-21 21:48:56,011 DEBUG lava-logs => PING(20) 2018-09-21 21:48:56,962 INFO scheduling health checks: 2018-09-21 21:48:56,975 INFO scheduling jobs: 2018-09-21 21:48:56,976 DEBUG - dra7-evm 2018-09-21 21:48:56,981 DEBUG - qemu 2018-09-21 21:49:16,995 INFO scheduling health checks: 2018-09-21 21:49:17,009 INFO scheduling jobs: 2018-09-21 21:49:17,010 DEBUG - dra7-evm 2018-09-21 21:49:17,014 DEBUG - qemu 2018-09-21 21:49:21,406 DEBUG lava-logs => PING(20) 2018-09-21 21:49:37,026 INFO scheduling health checks: 2018-09-21 21:49:37,040 INFO scheduling jobs: 2018-09-21 21:49:37,041 DEBUG - dra7-evm 2018-09-21 21:49:37,046 DEBUG - qemu 2018-09-21 21:49:41,419 DEBUG lava-logs => PING(20) 2018-09-21 21:49:52,363 ERROR [STATE] Dispatcher <agl-core-lab-1> goes OFFLINE 2018-09-21 21:49:57,059 INFO scheduling health checks: 2018-09-21 21:49:57,067 INFO scheduling jobs: 2018-09-21 21:50:01,440 DEBUG lava-logs => PING(20) 2018-09-21 21:50:17,078 INFO scheduling health checks: 2018-09-21 21:50:17,086 INFO scheduling jobs: 2018-09-21 21:50:21,462 DEBUG lava-logs => PING(20) 2018-09-21 21:50:37,097 INFO scheduling health checks: 2018-09-21 21:50:37,105 INFO scheduling jobs: 2018-09-21 21:50:41,484 DEBUG lava-logs => PING(20) 2018-09-21 21:50:57,117 INFO scheduling health checks: 2018-09-21 21:50:57,125 INFO scheduling jobs: 2018-09-21 21:51:01,502 DEBUG lava-logs => PING(20) 2018-09-21 21:51:17,135 INFO scheduling health checks: 2018-09-21 21:51:17,144 INFO scheduling jobs: 2018-09-21 21:51:21,524 DEBUG lava-logs => PING(20) 2018-09-21 21:51:37,154 INFO scheduling health checks: 2018-09-21 21:51:37,162 INFO scheduling jobs: 2018-09-21 21:51:41,546 DEBUG lava-logs => PING(20) 2018-09-21 21:51:57,170 INFO scheduling health checks: 2018-09-21 21:51:57,178 INFO scheduling jobs: 2018-09-21 21:51:58,734 DEBUG agl-core-lab-1 => PING(20) 2018-09-21 21:51:58,737 DEBUG agl-core-lab-1 => PING(20) 2018-09-21 21:51:59,125 INFO [332] agl-core-lab-1 => END 2018-09-21 21:51:59,954 DEBUG get_type_name: skipping finalize 2018-09-21 21:51:59,954 DEBUG type_name failed for finalize metatype finalize 2018-09-21 21:51:59,954 DEBUG get_type_name: skipping finalize 2018-09-21 21:51:59,954 DEBUG type_name failed for finalize metatype finalize 2018-09-21 21:51:59,954 DEBUG get_type_name: skipping finalize 2018-09-21 21:51:59,954 DEBUG type_name failed for finalize metatype finalize 2018-09-21 21:51:59,960 DEBUG agl-core-lab-1 => PING(20) 2018-09-21 21:51:59,964 INFO [332] agl-core-lab-1 => END (duplicated), skipping 2018-09-21 21:51:59,968 INFO [332] agl-core-lab-1 => END (duplicated), skipping 2018-09-21 21:51:59,971 INFO [332] agl-core-lab-1 => END (duplicated), skipping 2018-09-21 21:51:59,974 INFO [332] agl-core-lab-1 => END (duplicated), skipping 2018-09-21 21:51:59,977 DEBUG agl-core-lab-1 => PING(20) 2018-09-21 21:52:00,093 INFO [332] agl-core-lab-1 => END (duplicated), skipping 2018-09-21 21:52:00,287 INFO [332] agl-core-lab-1 => END (duplicated), skipping 2018-09-21 21:52:00,341 INFO [332] agl-core-lab-1 => END (duplicated), skipping 2018-09-21 21:52:00,493 INFO [332] agl-core-lab-1 => END (duplicated), skipping 2018-09-21 21:52:00,496 DEBUG agl-core-lab-1 => PING(20) 2018-09-21 21:52:00,674 INFO [332] agl-core-lab-1 => END (duplicated), skipping 2018-09-21 21:52:00,795 INFO [332] agl-core-lab-1 => END (duplicated), skipping 2018-09-21 21:52:00,881 INFO [332] agl-core-lab-1 => END (duplicated), skipping 2018-09-21 21:52:01,061 INFO [332] agl-core-lab-1 => END (duplicated), skipping 2018-09-21 21:52:01,063 DEBUG agl-core-lab-1 => PING(20) 2018-09-21 21:52:01,075 INFO [332] agl-core-lab-1 => END (duplicated), skipping 2018-09-21 21:52:01,256 INFO [332] agl-core-lab-1 => END (duplicated), skipping 2018-09-21 21:52:01,397 INFO [332] agl-core-lab-1 => END (duplicated), skipping 2018-09-21 21:52:01,453 INFO [332] agl-core-lab-1 => END (duplicated), skipping 2018-09-21 21:52:01,456 DEBUG agl-core-lab-1 => PING(20) 2018-09-21 21:52:01,568 DEBUG lava-logs => PING(20) 2018-09-21 21:52:01,603 INFO [332] agl-core-lab-1 => END (duplicated), skipping 2018-09-21 21:52:01,646 INFO [332] agl-core-lab-1 => END (duplicated), skipping 2018-09-21 21:52:01,722 INFO [332] agl-core-lab-1 => END (duplicated), skipping 2018-09-21 21:52:01,840 INFO [332] agl-core-lab-1 => END (duplicated), skipping 2018-09-21 21:52:01,842 DEBUG agl-core-lab-1 => PING(20) 2018-09-21 21:52:01,884 INFO [332] agl-core-lab-1 => END (duplicated), skipping 2018-09-21 21:52:01,985 INFO [332] agl-core-lab-1 => END (duplicated), skipping 2018-09-21 21:52:02,040 INFO [332] agl-core-lab-1 => END (duplicated), skipping 2018-09-21 21:52:02,106 INFO [332] agl-core-lab-1 => END (duplicated), skipping 2018-09-21 21:52:02,109 DEBUG agl-core-lab-1 => PING(20) 2018-09-21 21:52:02,220 INFO [332] agl-core-lab-1 => END (duplicated), skipping 2018-09-21 21:52:17,187 INFO scheduling health checks: 2018-09-21 21:52:17,201 INFO scheduling jobs: 2018-09-21 21:52:17,202 DEBUG - dra7-evm 2018-09-21 21:52:17,207 DEBUG - qemu 2018-09-21 21:52:21,585 DEBUG lava-logs => PING(20)
Lava-users mailing list Lava-users@lists.linaro.org https://lists.linaro.org/mailman/listinfo/lava-users
On Tue, Sep 25, 2018 at 09:03:03AM +0100, Neil Williams wrote:
On Tue, 25 Sep 2018 at 08:56, Corentin Labbe clabbe@baylibre.com wrote:
Hello
We got a job (number 332) stuck in running state. After 23h of inaction, the only way to stop it was to cancel+fail it. According to the logs, a small disconnection happen between the slave and master. The slave seems to try to update the final status of the job but the master "ignore" it.
What Debian package version(s) of lava-dispatcher on the slave and lava-server on the master?
On slave: ii lava-common 2018.7-1+stretch all Linaro Automated Validation Architecture common ii lava-dispatcher 2018.7-1+stretch amd64 Linaro Automated Validation Architecture dispatcher
On master: ii lava 2018.7-1+stretch all Linaro Automated Validation Architecture metapackage ii lava-common 2018.7-1+stretch all Linaro Automated Validation Architecture common ii lava-coordinator 0.1.7-1 all LAVA Coordinator daemon ii lava-dev 2018.7-1+stretch all Linaro Automated Validation Architecture developer support ii lava-dispatcher 2018.7-1+stretch amd64 Linaro Automated Validation Architecture dispatcher ii lava-server 2018.7-1+stretch all Linaro Automated Validation Architecture server ii lava-server-doc 2018.7-1+stretch all Linaro Automated Validation Architecture documentation
Can you attach (rather than inline) the test job log file (output.yaml)? The job ended very quickly, looks like a validate error.
This is the job output https://lava.automotivelinux.org/scheduler/job/332
It also looks like both master and slave went offline at the same time. Can you confirm that master and slave are both running in timezone UTC & if both have ntp installed?
Yes I confirm
Hello Corentin,
for what I can see in the job logs, the lava-run process was not killed cleanly as the last lines of logs are missing (like https://validation.linaro.org/scheduler/job/1894511#results_33286343). Even if lava-run is crashing the last line should be sent. So was the server running lava-run restarted? Do you know what happened to the lava-run process?
To understand what happened there, a job cycle is: lava-master => lava-slave: START lava-slave => lava-master: START_OK when lava-run is started lava-run => lava-logs: send the logs When the job is about to finish, lava-run logs the last results (lava.job result with pass or fail) When lava-logs receive such log line, it can mark a TestJob as finished and record the job health (canceled, success or failure) At the same time, lava-slave does notice that lava-run finishes and send an END message to lava-master. But lava-master won't do anything until lava-logs has marked the TestJob as finished. Because the logs hasn't been received yet.
In your case, the last line of log if missing, so lava-logs can't mark the job has finished. At the same time lava-master and lava-slave are both waiting for lava-logs. That's why I added a "fail" button that can force this transition when (for some reasons like a server crash) the last line of log is not going to be sent.
Rgds.
Le mer. 26 sept. 2018 à 09:43, LABBE Corentin clabbe@baylibre.com a écrit :
On Tue, Sep 25, 2018 at 09:03:03AM +0100, Neil Williams wrote:
On Tue, 25 Sep 2018 at 08:56, Corentin Labbe clabbe@baylibre.com
wrote:
Hello
We got a job (number 332) stuck in running state. After 23h of inaction, the only way to stop it was to cancel+fail it. According to the logs, a small disconnection happen between the slave
and
master. The slave seems to try to update the final status of the job but the master "ignore" it.
What Debian package version(s) of lava-dispatcher on the slave and lava-server on the master?
On slave: ii lava-common 2018.7-1+stretch all Linaro Automated Validation Architecture common ii lava-dispatcher 2018.7-1+stretch amd64 Linaro Automated Validation Architecture dispatcher
On master: ii lava 2018.7-1+stretch all Linaro Automated Validation Architecture metapackage ii lava-common 2018.7-1+stretch all Linaro Automated Validation Architecture common ii lava-coordinator 0.1.7-1 all LAVA Coordinator daemon ii lava-dev 2018.7-1+stretch all Linaro Automated Validation Architecture developer support ii lava-dispatcher 2018.7-1+stretch amd64 Linaro Automated Validation Architecture dispatcher ii lava-server 2018.7-1+stretch all Linaro Automated Validation Architecture server ii lava-server-doc 2018.7-1+stretch all Linaro Automated Validation Architecture documentation
Can you attach (rather than inline) the test job log file (output.yaml)? The job ended very quickly, looks like a validate error.
This is the job output https://lava.automotivelinux.org/scheduler/job/332
It also looks like both master and slave went offline at the same time.
Can
you confirm that master and slave are both running in timezone UTC & if both have ntp installed?
Yes I confirm
Lava-users mailing list Lava-users@lists.linaro.org https://lists.linaro.org/mailman/listinfo/lava-users
Hello,
the second known reason for a job to be stuck is if lava-logs is stuck for some reason.
I found a bug yesterday in the callbacks that can stuck lava-logs forever. In fact, a timeout is missing in the callback http request. If the remote server is taking forever to answer (I was playing with netcat as a remote server and netcat was not answering anything), then lava-logs (the process that is sending the notifications) will wait forever. A patch is available here : https://git.lavasoftware.org/lava/lava/merge_requests/113
As you are using a callback in the given job, that might be a reason.
Rgds
Le mar. 9 oct. 2018 à 10:57, Remi Duraffort remi.duraffort@linaro.org a écrit :
Hello Corentin,
for what I can see in the job logs, the lava-run process was not killed cleanly as the last lines of logs are missing (like https://validation.linaro.org/scheduler/job/1894511#results_33286343). Even if lava-run is crashing the last line should be sent. So was the server running lava-run restarted? Do you know what happened to the lava-run process?
To understand what happened there, a job cycle is: lava-master => lava-slave: START lava-slave => lava-master: START_OK when lava-run is started lava-run => lava-logs: send the logs When the job is about to finish, lava-run logs the last results (lava.job result with pass or fail) When lava-logs receive such log line, it can mark a TestJob as finished and record the job health (canceled, success or failure) At the same time, lava-slave does notice that lava-run finishes and send an END message to lava-master. But lava-master won't do anything until lava-logs has marked the TestJob as finished. Because the logs hasn't been received yet.
In your case, the last line of log if missing, so lava-logs can't mark the job has finished. At the same time lava-master and lava-slave are both waiting for lava-logs. That's why I added a "fail" button that can force this transition when (for some reasons like a server crash) the last line of log is not going to be sent.
Rgds.
Le mer. 26 sept. 2018 à 09:43, LABBE Corentin clabbe@baylibre.com a écrit :
On Tue, Sep 25, 2018 at 09:03:03AM +0100, Neil Williams wrote:
On Tue, 25 Sep 2018 at 08:56, Corentin Labbe clabbe@baylibre.com
wrote:
Hello
We got a job (number 332) stuck in running state. After 23h of inaction, the only way to stop it was to cancel+fail it. According to the logs, a small disconnection happen between the slave
and
master. The slave seems to try to update the final status of the job but the master "ignore" it.
What Debian package version(s) of lava-dispatcher on the slave and lava-server on the master?
On slave: ii lava-common 2018.7-1+stretch all Linaro Automated Validation Architecture common ii lava-dispatcher 2018.7-1+stretch amd64 Linaro Automated Validation Architecture dispatcher
On master: ii lava 2018.7-1+stretch all Linaro Automated Validation Architecture metapackage ii lava-common 2018.7-1+stretch all Linaro Automated Validation Architecture common ii lava-coordinator 0.1.7-1 all LAVA Coordinator daemon ii lava-dev 2018.7-1+stretch all Linaro Automated Validation Architecture developer support ii lava-dispatcher 2018.7-1+stretch amd64 Linaro Automated Validation Architecture dispatcher ii lava-server 2018.7-1+stretch all Linaro Automated Validation Architecture server ii lava-server-doc 2018.7-1+stretch all Linaro Automated Validation Architecture documentation
Can you attach (rather than inline) the test job log file (output.yaml)? The job ended very quickly, looks like a validate error.
This is the job output https://lava.automotivelinux.org/scheduler/job/332
It also looks like both master and slave went offline at the same time.
Can
you confirm that master and slave are both running in timezone UTC & if both have ntp installed?
Yes I confirm
Lava-users mailing list Lava-users@lists.linaro.org https://lists.linaro.org/mailman/listinfo/lava-users
-- Rémi Duraffort LAVA Team
lava-users@lists.lavasoftware.org