We now have a proper rhel-10.0 distribution, and this alias is clashing
with it, so we are seeing the following message in production:
failed to configure distro aliases: invalid aliases: ["alias 'rhel-10.0' masks an existing distro"]
Let's fix it by removing the alias, it's obviously not needed anymore.
Set the RHEL release names without the minor version to point to the
latest GA release. Set the 'rhel-10.0' to the latest RHEL-9 minor
release in development, so that one can start building RHEL-10 images
without referencing RHEL-9.
Signed-off-by: Tomáš Hozza <thozza@redhat.com>
The fluentd sidecar had the same request/limit as the service container,
and the migrate init-container had the fluentd request/limit. It should
be the other way round.
The job won't run if it doesn't get scheduled within 30 minutes. This
prevents the job running multiple times in a row if it didn't get
scheduled, for instance due to resource limits.
The secret not existing causes the deployment to fail during a
validation stage.
```
[ERROR] [openshift_base.py:_validate_resources_used_exist] - [Deployment/composer] Secret db does not exist
```
This value is set in the worker config. In future it might also be
passed through the api to upload into target accounts, but it should
never be set in composer.
Linter output:
Specify anti-affinity in your pod specification to ensure that the
orchestrator attempts to schedule replicas on different nodes. Using
podAntiAffinity, specify a labelSelector that matches pods for the
deployment, and set the topologyKey to kubernetes.io/hostname.
Similar to DRY_RUN, these values should be overwritten in app-interface
per namespace. At some point the maintenance specific to the CRC tenant
(aws and gcp maintenance) should run in the workers namespace rather
than the composer namespace. Granularity is needed for this.
We don't have permissions to write to /run when running on OpenShift so let's
just use /tmp and change the filename to prevent any conflicts.
Signed-off-by: Ondřej Budai <ondrej@budai.cz>
Currently liveness and readiness was treated the same. However, their
behaviour at shutdown is meant to be different. When a service is not read
no new connections are made to it, and when a service is not live it can be
cleaned up.
By considering our service live if and only if it listens to HTTP requests we
don't have the opportunity to clean up after we stop listening to new requests.
Leave readiness probes as they are, and instead use a file in the filesystem to
indicate when the service is live. It is created before composer is spawned and
deleted once composer exits.
Because of the way the gcp secrets are stored for the workers, and how
the mapping from vault to openshift works (unable to map a multiple key
secret into a single json file), there's a bit of juggling required to
get the gcp credentials in the right format.
When backed by a DB, composer has no need of a queue directory.
This also addresses "Error moving artifacts for job" logging noise.
Signed-off-by: sanne <sanne.raymaekers@gmail.com>
We actually need 2 * 16 connections at minimum (one worker waits for two
jobs). Let's bump the maximum connection count even moar.
Signed-off-by: Ondřej Budai <ondrej@budai.cz>
By default, pgxpool.Pool has 4 connections (or number of cpus if higher).
Currently, we have 3 replicas, that means max 3*4=12 DB connections.
The dequeue operation is actually blocking - when a worker is waiting for
a job, one connection is blocked. My theory is that with 16 workers, we just
don't have enough connections that causes all sorts of weird slowdowns.
This commit bumps the number of connection from one replica to 10, therefore
we should be at 30 connections in total.
Signed-off-by: Ondřej Budai <ondrej@budai.cz>