worker/server: requeue unresponsive jobs
If a job is unresponsive the worker has most likely crashed or been shut down and the in-progress job been lost. Instead of failing these jobs, requeue them up to two times. Once a job is lost a third time it fails. This avoids infinite loops. This is implemented by extending FinishJob to RequeuOrFinish job. It takes a max number of requeues as an argument, and if that is 0, it has the same behavior as FinishJob used to have. If the maximum number of requeues has not yet been reached, then the running job is returned to pending state to be picked up again.
This commit is contained in:
parent
d02f666a4b
commit
626530818d
8 changed files with 216 additions and 61 deletions
16
pkg/jobqueue/dbjobqueue/schemas/006_retry_count.sql
Normal file
16
pkg/jobqueue/dbjobqueue/schemas/006_retry_count.sql
Normal file
|
|
@ -0,0 +1,16 @@
|
|||
-- add the expires_at column
|
||||
ALTER TABLE jobs
|
||||
ADD COLUMN retries BIGINT DEFAULT 0;
|
||||
|
||||
-- We added a column, thus we have to recreate the view.
|
||||
CREATE OR REPLACE VIEW ready_jobs AS
|
||||
SELECT *
|
||||
FROM jobs
|
||||
WHERE started_at IS NULL
|
||||
AND canceled = FALSE
|
||||
AND id NOT IN (
|
||||
SELECT job_id
|
||||
FROM job_dependencies JOIN jobs ON dependency_id = id
|
||||
WHERE finished_at IS NULL
|
||||
)
|
||||
ORDER BY queued_at ASC
|
||||
Loading…
Add table
Add a link
Reference in a new issue