worker/server: requeue unresponsive jobs
If a job is unresponsive the worker has most likely crashed or been shut down and the in-progress job been lost. Instead of failing these jobs, requeue them up to two times. Once a job is lost a third time it fails. This avoids infinite loops. This is implemented by extending FinishJob to RequeuOrFinish job. It takes a max number of requeues as an argument, and if that is 0, it has the same behavior as FinishJob used to have. If the maximum number of requeues has not yet been reached, then the running job is returned to pending state to be picked up again.
This commit is contained in:
parent
d02f666a4b
commit
626530818d
8 changed files with 216 additions and 61 deletions
|
|
@ -481,11 +481,29 @@ jq '.customizations.packages = [ "jesuisunpaquetquinexistepas" ]' "$REQUEST_FILE
|
|||
sendCompose "$REQUEST_FILE2"
|
||||
waitForState "failure"
|
||||
|
||||
# crashed/stopped/killed worker should result in a failed state
|
||||
# crashed/stopped/killed worker should result in the job being retried
|
||||
sendCompose "$REQUEST_FILE"
|
||||
waitForState "building"
|
||||
sudo systemctl stop "osbuild-remote-worker@*"
|
||||
waitForState "failure"
|
||||
RETRIED=0
|
||||
for RETRY in {1..10}; do
|
||||
ROWS=$(sudo ${CONTAINER_RUNTIME} exec "${DB_CONTAINER_NAME}" psql -U postgres -d osbuildcomposer -c \
|
||||
"SELECT retries FROM jobs WHERE id = '$COMPOSE_ID' AND retries = 1")
|
||||
if grep -q "1 row" <<< "$ROWS"; then
|
||||
RETRIED=1
|
||||
break
|
||||
else
|
||||
echo "Waiting until job is retried ($RETRY/10)"
|
||||
sleep 30
|
||||
fi
|
||||
done
|
||||
if [ "$RETRIED" != 1 ]; then
|
||||
echo "Job $COMPOSE_ID wasn't retried after killing the worker"
|
||||
exit 1
|
||||
fi
|
||||
# remove the job from the queue so the worker doesn't pick it up again
|
||||
sudo ${CONTAINER_RUNTIME} exec "${DB_CONTAINER_NAME}" psql -U postgres -d osbuildcomposer -c \
|
||||
"DELETE FROM jobs WHERE id = '$COMPOSE_ID'"
|
||||
sudo systemctl start "osbuild-remote-worker@localhost:8700.service"
|
||||
|
||||
# full integration case
|
||||
|
|
|
|||
Loading…
Add table
Add a link
Reference in a new issue