Only add the arch label for osbuild job types, as the finish metrics
behave similarly. Having arch labels on dequeue metrics for any other
job type (but not on the finish metrics) would produce weird results.
We have switched how 5xx errors are being recorded
internally and we are now recording all failures
for all endpoints. As a result, a dedicated metric
only for compose failures is no longer required.
Openshift overrides the `service` label for
all metrics in the cluster. Update the label
from `service` to `subsystem` for the status
metrics query. This helps us differentiate
between requests from composer and the worker
server.
Create a custom middleware function
to measure 5xx requests for all composer
& worker routes and not just the `/composer`
endpoint. The result is a prometheus metric
that contains info on the request status code,
path & method.
A helper function has been added to clean the
dynamic parameters in the path routes to reduce
metric cardinality
Add a helper function to register the same metrics
for both the worker and composer - the only difference
being the subsystem name. The function checks if the
metric has already been registered and, if so, returns
the already registered metric.
Add the architecture label to build jobs
which will enable filtering and monitoring
build jobs by architecture. Build job results
contain the `arch` field in the results struct,
this is then used to pass to the metrics, where
there is a value, otherwise it is set to an
empty string.
Include a tenant label for all prometheus metrics. Modify
jobstatus function in the worker accordingly to return channel
so it can be passed to prometheus.
We are interested in the time it takes from a job could be dequeued
until it is, but if a job has dependencies that are not yet finished, it
cannot be dequeued.
Change the logic to measure the time since the last dependency was
dequeued rather than when the job was queued.
The purpose of this metric is to have an alert fire in case we have too
few workers processing jobs.
Currently the job metrics are namespaced with the composer
subsystem, i.e. `composer_worker`. Since we plan to split
the components to their own namespaces in app interface,
the worker subsystem should be split too.
This commit introduces the collection of error
metrics since it is now possible to differentiate
between internal errors and user input errors.
Additionally, the error status is reported for
job duration metrics.
Refactor the current metric collection to make use
of re-usable functions, since some of the same queries
are repeated. This will also make it easier to move
the collection of metrics from the job queue.
The change between the 32s bucket and the 64s bucket is too drastic
for measuring the duration of depsolve jobs. At present, 90% of the
depsolve jobs have a duration inbetween 32s and 64s, making the 32s
bucket too sensitive and the 64s bucket not sensitive enough.