Commit graph

173 commits

Author SHA1 Message Date
Sanne Raymaekers
c4360a67f5 jobqueue: handle escaped null bytes in postgres
Postgres doesn't accept `\u0000` in the jsonb datatype. Switch to the
json datatype which is larger and slower, but accepts escaped null
bytes.

As we don't actually query or index the result jsonb directly, the
impact of this should be minimal.

See: https://www.postgresql.org/docs/current/datatype-json.html
2025-07-25 13:10:10 +02:00
Brian C. Lane
c3394ae589 jobqueuetest: Add job delete tests
This adds tests for retrieving all root jobs, and deleting jobs
and unused dependencies. These tests are run against the fsjobqueue for
unit testing, and against dbjobqueue for integration testing.

Resolves: RHEL-60120
2025-06-05 10:32:56 +02:00
Brian C. Lane
5cddc4223d dbjobqueue: Add AllRootJobIDs implementation
Related: RHEL-60120
2025-06-05 10:32:56 +02:00
Brian C. Lane
d8285a0b74 jobqueue: Add DeleteJob function
This allows jobs to be deleted from the database.
Currently only implemented by fsjobqueue. The function for
dbjobqueue currently returns nil.

This will remove all the job files used by the root job UUID as long as
no other job depends on them. ie. It starts at the top, and moves down
the dependency tree until it finds a job that is also used by another
job, removes the job to be deleted from its dependants list, and moves
back up the tree only deleting jobs with empty dependants lists.

Related: RHEL-60120
2025-06-05 10:32:56 +02:00
Brian C. Lane
87c0462a33 jobqueue: Add AllRootJobIDs function to jobqueue
This lists the root job UUIDs (the jobs with no dependants).
Currently only implemented by fsjobqueue. The function for
dbjobqueue currently returns nil.

Related: RHEL-60120
2025-02-03 17:27:31 -08:00
Florian Schüller
ca3f0a190f internal/jobqueue/jobqueuetest/jobqueuetest: fix DB tests
I got confused as the jobqueue interface is asymmetric.
It expects an object and returns a json.RawMessage
and when handing over to postgres this is abstracted
away by postgres
2024-11-19 13:55:38 +01:00
Florian Schüller
d3e3474fb7 internal/worker/server: return an error on depsolve timeout HMS-2989
Fixes the special case that if no worker is available and we
generate an internal timeout and cancel the depsolve including all
followup jobs, no error was propagated.
2024-11-19 13:55:38 +01:00
Sanne Raymaekers
056b3c5ea6 jobqueue: return if a job was requeued or not 2024-11-07 17:18:48 +01:00
Florian Schüller
ece16307c6 jobqueuetest: avoid warning and provide a valid JSON
Not needed for the test but just generates a useless warning
2024-11-06 15:16:42 +01:00
Florian Schüller
00d3f07d08 Makefile: implement make db-tests
enables the option to run the DB tests locally
that are executed in the github actions
2024-11-06 15:16:42 +01:00
Sanne Raymaekers
1b4935c325 jobqueue: add channel to workers
Stores the channel alongside the worker.
2024-04-19 14:32:07 +02:00
Sanne Raymaekers
ac854b7cc8 pkg/jobqueue: add arch to worker 2023-12-14 21:25:32 +01:00
Sanne Raymaekers
d784075d31 jobqueue: add ability to track workers 2023-12-06 17:22:36 +01:00
Brian C. Lane
aca748bc14 Don't Panic in getComposeStatus and skip invalid jobs in fsjobqueue New
This handles corrupt job json files by skipping them. They still exist,
and errors are logged, but the system keeps working.

If one or more of the json files in /var/lib/osbuild-composer/jobs/
becomes corrupt they can stop the osbuild-composer service from
starting, or stop commands like 'composer-cli compose status' from
working because they quit on the first error and miss any job that
aren't broken.
2023-11-20 13:34:40 +01:00
Ondřej Budai
cac9327b44 update to go 1.19
UBI and the oldest support Fedora (37) now all have go 1.19, so we are
cleared to switch.

gofmt now reformats comments in certain cases, so that explains the formatting
changes in this commit.
See https://go.dev/doc/go1.19#go-doc

Signed-off-by: Ondřej Budai <ondrej@budai.cz>
2023-07-21 19:18:00 +02:00
Tom Gundersen
626530818d worker/server: requeue unresponsive jobs
If a job is unresponsive the worker has most likely crashed or been shut
down and the in-progress job been lost.

Instead of failing these jobs, requeue them up to two times. Once a job is lost
a third time it fails. This avoids infinite loops.

This is implemented by extending FinishJob to RequeuOrFinish job. It takes a
max number of requeues as an argument, and if that is 0, it has the same
behavior as FinishJob used to have.

If the maximum number of requeues has not yet been reached, then the running
job is returned to pending state to be picked up again.
2022-11-02 15:26:00 +01:00
Sanne Raymaekers
0fe3f1b2ae jobqueue: Query job dependents 2022-08-30 16:14:52 +02:00
Simon de Vlieger
78ae275c61 jobqueue: store an expiry date
This introduces an expiry date (default: 14 days from insert date) and
adjust the service-maintenance script to delete jobs that are older than
the expiration date.
2022-07-13 17:26:04 +02:00
Sanne Raymaekers
03b57f002c jobqueue: Move jobqueue out of internal 2022-07-04 15:37:28 +02:00
Sanne Raymaekers
d9bd19404d osbuild-service-maintenance: Move maintenance queries out of jobqueue 2022-07-04 15:37:28 +02:00
Sanne Raymaekers
ff408aa68f osbuild-service-maintenance: Vacuum tables
Call vacuum analyze after each chunk of updates, and dump vacuum stats
at the beginning and end of the db cleanup.

Nulling results can increase size on disk, but calling vacuum analyze
will free up space within the table (not on disk) and reuse the space
for new inserts and updates.
2022-06-08 21:12:46 +02:00
Sanne Raymaekers
8bfc6c9961 dbjobqueue: Filter maintenance queries based on results
Jobs that already had their results nulled, shouldn't be included in the
maintenance job.
2022-06-08 21:12:46 +02:00
Chloe Kaubisch
873798514b prometheus: add tenant label
Include a tenant label for all prometheus metrics. Modify
jobstatus function in the worker accordingly to return channel
so it can be passed to prometheus.
2022-06-07 16:35:03 +02:00
Sanne Raymaekers
9b119fa4cf osbuild-service-maintenance: Delete results from select jobs
Instead of deleting records, delete the results from the manifest and
depsolve jobs. This redacts sensitive data which the manifest can
contain, and this conserves space.
2022-06-03 14:38:53 +02:00
Sanne Raymaekers
9bff4a4f0f dbjobqueue: Alter foreign key constraints
When deleting rows from the job table, make sure the delete is cascaded
to the dependencies and heartbeat tables.
2022-06-02 18:45:24 +02:00
Eng Zer Jun
00ea3eb285 test: use T.TempDir to create temporary test directory
The directory created by `T.TempDir` is automatically removed when the
test and all its subtests complete.

Reference: https://pkg.go.dev/testing#T.TempDir
Signed-off-by: Eng Zer Jun <engzerjun@gmail.com>
2022-04-05 09:27:43 +02:00
Ondřej Budai
e9ce9370c6 dbjoqbqueue: actually use the transaction object when a tx is created
Transactions are tied to a connection so this is actually not a functional
change. Nevertheless, I think it's nice to explicitly state that we are
using a transaction.

Signed-off-by: Ondřej Budai <ondrej@budai.cz>
2022-03-22 17:49:22 +01:00
Ondřej Budai
187eb188da dbjoqbqueue: wait for listener to become ready before returning from New
Otherwise, there might be an already waiting dequeuer and if something is
enqueued before `sqlListen` is called, we will lost this notification.

Also, a small log message was added when shutting down the listener.

Signed-off-by: Ondřej Budai <ondrej@budai.cz>
2022-03-22 17:49:22 +01:00
Ondřej Budai
c4c7f44fcb dbjobqueue: reimplement the jobqueue to use only one listening connection
Previously, all dequeuers (goroutines waiting for a job to be dequeued) were
listening for new messages on postgres channel jobs (LISTEN jobs). This didn't
scale well as each dequeuer required to have its own DB connection and the
number of DB connections is hard-limited in the pool's config.

I changed the logic to work somewhat differently: dbjobqueue.New() now spawns
a goroutine that listens on the postgres channel. If there's a new message,
the goroutine just wakes up all dequeuers using a standard go channel.
Go channels are cheap so this should scale much better.

A test was added that confirms that 100 dequeuers are not a big deal now. This
test failed when I tried to run on it on the previous commit. I tried even 1000
locally and it was still fine.

Signed-off-by: Ondřej Budai <ondrej@budai.cz>
2022-03-11 16:04:52 +01:00
Ondřej Budai
c8dbe0de74 dbjobqueue: remove unused variables from Dequeue
Removing queued_at and started_at is pretty straightforward, it wasn't needed.
Removing token might seem concerning but basically we were just pulling
the same value from DB as we were pushing there. I think there's no value in
doing that.

Signed-off-by: Ondřej Budai <ondrej@budai.cz>
2022-03-11 16:04:52 +01:00
Ondřej Budai
2765d2d9a8 jobqueuetest: add a test for multiple channels
Signed-off-by: Ondřej Budai <ondrej@budai.cz>
2022-03-08 12:07:00 +01:00
Ondřej Budai
32080e6202 jobqueuetest: modify testArgs to test also channels
jobqueue.Job must return the channel specified in jobqueue.Enqueue during
the whole lifecycle of the given job.

Signed-off-by: Ondřej Budai <ondrej@budai.cz>
2022-03-08 12:07:00 +01:00
Ondřej Budai
4c31b04a65 jobqueuetest: add channel arg to the pushTestJob helper
Signed-off-by: Ondřej Budai <ondrej@budai.cz>
2022-03-08 12:07:00 +01:00
Ondřej Budai
7bfcee36f8 jobqueue: introduce the concept of channels
Channels are a concept similar to job types. Callers must specify a channel
name when queueing a new job. A list of channels is also specified when
dequeueing a job. The dequeued job's channel will always be from one of the
specified channel. Of course, the job types are also respected. The dequeued
job will also always be from one of the specified type.

Currently, all calls to jobqueue were changed so all queue operations use
an empty channel name and all dequeue operations use a list containing
an empty channel.

Thus, this is a non-functional change.

Signed-off-by: Ondřej Budai <ondrej@budai.cz>
2022-03-08 12:07:00 +01:00
Ondřej Budai
9c80a17ee5 fsjobqueue: refactor to allow dequeuing by multiple criteria
Previous implementation of fsjobqueue is amazing but it has its drawbacks:
- dequeueing can be done only based on a job type
- it's limited to 100 jobs per a job type

As we soon want to be able to dequeue also by another criteria (job channel),
we need to refactor the queue.

The new implementation is more naive but also more flexible. It basically
works like the dbjobqueue - dequeueing goroutines listen for newly added
jobs. When that happens, a signal is sent to all of them and they all inspect
all pending jobs and dequeue ones that match their needs. Ones that don't find
a suitable job, are waiting for the next signal.

This is certainly slower implementation as every time a new job is added into
the queue, all dequeueing goroutines will have to iterate over all
pending jobs. I think that's fine because fsjobqueue is not recommended
to use for composer instances with heavy load.

Signed-off-by: Ondřej Budai <ondrej@budai.cz>
2022-02-16 17:14:36 +01:00
Gianluca Zuccarelli
6c4caec022 metrics: move metrics to worker server
For simplicity, the collection of the job metrics
was carried out in the job queue. This was only
being done in the dbqueue and not in the fsqueue.
This pr refactors the metric collection and moves
the job metrics to the worker server, by adding a
wrapper function to enqueueing jobs so that the
metrics only have to be recorded in one place when
queueing a job.
2022-02-03 23:40:42 +00:00
Gianluca Zuccarelli
bce12b7bea metrics: extract metric collection
Refactor the current metric collection to make use
of re-usable functions, since some of the same queries
are repeated. This will also make it easier to move
the collection of metrics from the job queue.
2022-02-03 23:40:42 +00:00
Tom Gundersen
b32ab36e1d worker/server: typesafe Job and JobStatus
Replace Job() and JobStatus() with typesafe versions, and introduce JobType()
for the rare instances where we don't know the type up front.

Additionally, catch a few more error cases:
 - if OSBuildResult is nil, then we failed to invoke osbuild
 - make sure the same JobResult handling is done for osbuild-koji, as for osbuild
2022-02-01 20:28:40 +00:00
Ondřej Budai
ab3990b90a dbjobqueue: fix FinishJob not returning an error if already finished
Reported by covscan

Signed-off-by: Ondřej Budai <ondrej@budai.cz>
2021-12-18 00:14:07 +00:00
Gianluca Zuccarelli
1a709eda5c metrics: add initial job metrics
Add job metrics to track the number of
pending/running jobs, the duration of
the jobs and how long the jobs spent in
the job queue.
2021-12-08 21:49:43 +00:00
sanne
c43ad2b22a osbuild-service-maintenance: Clean up expired images 2021-12-03 00:14:09 +00:00
Ondřej Budai
14b29ae98a dbjobqueue: don't log when context's deadline was exceeded
This happens rather often as we limit the request job timeout to 20s on the
service.

Signed-off-by: Ondřej Budai <ondrej@budai.cz>
2021-11-25 08:20:22 +01:00
Diaa Sami
df73b835c3 jobqueue: improve logging
Add job ID where it's missing
2021-11-16 19:16:34 +01:00
Diaa Sami
9c6438c8f4 jobqueue: include dependent job IDs when logging 2021-11-16 19:16:34 +01:00
Ondřej Budai
d3a3dbafed jobqueue: add DequeueByID
We will soon need to dequeue a job using its ID. This commit adds ability
to do that to the Jobqueue interface. As always, the fsjobqueue implementation
is slightly naive but it should fine for the usecases that it's designed for.

Signed-off-by: Ondřej Budai <ondrej@budai.cz>
2021-11-14 10:17:03 +01:00
Ondřej Budai
2ecc48727f fsjobqueue: factor out finished deps check
Signed-off-by: Ondřej Budai <ondrej@budai.cz>
2021-11-14 10:17:03 +01:00
Ondřej Budai
5f4db72777 fsjobqueue: do not delete empty channels
Previously, we deleted empty channels when a job was dequeued. This is
completely wrong because there still might be some clients waiting for
a job. This commit removes the cleanup and adds a regression test.

Note that this has the potential to leak memory if we ever use a lot of
job types. Currently, we have just handful of them, so this is fine.

Signed-off-by: Ondřej Budai <ondrej@budai.cz>
2021-11-14 10:17:03 +01:00
Diaa Sami
751ef84fc1 jobqueue: Better logging
Use appropriate logging levels for different statements
Log important jobqueue events with relevant information
2021-10-19 10:03:39 +01:00
sanne
d25ae71fef worker: Configurable timeout for RequestJob
This is backwards compatible, as long as the timeout is 0 (never
timeout), which is the default.

In case of the dbjobqueue the underlying timeout is due to
context.Canceled, context.DeadlineExceeded, or net.Error with Timeout()
true. For the fsjobqueue only the first two are considered.
2021-10-19 00:12:18 +01:00
sanne
9fab5def90 dbjobqueue: Reduce error noise in rollback check
If the transaction is already closed don't log the rollback failure as
an error, it means it was successfully committed.
2021-08-20 15:42:57 +02:00