Update the RHEL-9 Cloud Access images used for our workers from 9.0 to
9.6, which is the latest GA. We do upgrade all packages in our Ansible
playbook, but that is just waste of resources if we can use the latest
GA images.
Signed-off-by: Tomáš Hozza <thozza@redhat.com>
Update Fedora workers from EOL F40 to F42.
Remove workarounds that should not be needed any more (i.e. the Packer
upstream issue has been closed).
Signed-off-by: Tomáš Hozza <thozza@redhat.com>
The `cloud-init.target` in 9.6 has `After=multi-user.target` in its unit
config. The worker initialization service was set to run before
`multi-user.target`, but after `cloud-final.service`. This created an
impossible situation and systemd just disabling the initialization
service.
So this changes:
`multi-user.target -> worker-*.service -> cloud-final.service -> multi-user.target`
to
`cloud-init.target -> worker-*.service -> cloud-final.service -> multi-user.target`.
Thus resolving the loop.
Imagine this scenario: the packer job is ran, an AMI gets created.
We configure our deployment to use this AMI. Then, someone retries the
packer job. Since we have force_deregister=true, this will not only
create a new AMI, but also remove the old one (because it has the same
name). Thus, our deployment will get broken, because the source AMI
no longer exists. This means that the ASG cannot replace any broken
instances, and the secure instance feature gets absolutely broken
because it cannot spawn new secure instances (they "inherit" the AMI
ID from their parents).
Let's remove force_deregister=true, so the AMI never gets replaced.
This might cause some pipelines to start failing because they are
rerunning the packer job for same commit (the GA pipeline currently).
Let's fix those then, rerunning the packer job is just confusing.
If this causes some unexpected issues, we can always resort to using
unique AMI names (by appending a timestamp to their name), but having
multiple AMIs with different names, but same tags will cause our
terraform configuration to be reapplied everytime there's a rerun,
which is also not great.
There were errors using the latest oc 4.17 version:
error: failed to read input object (not a Template?): unable to decode
"templates/openshift/composer.yml": no kind "Template" is registered for
version "v1" in scheme "k8s.io/kubectl/pkg/scheme/scheme.go:28"
Define `rhel-10` distro alias in the OpenShift template. Even though the
same alias is defined in the default configuration, I think that it is
good to also include it in the template to not forget about it in the
future.
Signed-off-by: Tomáš Hozza <thozza@redhat.com>
It's misleading since it counts the amount of workers that have
registered to the current composer pods, it doesn't actually keep track
of the active workers.
Remove it and keep the worker-api stats as a proxy for active workers.
Ansible on fedora 40 seems broken, the default python 3.12 interpreter
doesn't work, 3.10 works but then the dnf module breaks.
Use 3.10 and stop using the dnf module.
Fedora 38 is EOL, and packit no longer builds rpms for it.
The current python3.12 + ansible 2.12 combination which is the default
on fedora 40 doesn't work, so switch to python3.9.
With the rpmcopy or rpmrepo_osbuild tags, the `Install worker rpm` stage
got skipped on RHEL and CI. Invert the tag logic and use `--tags`
instead of `--skip-tags`.
We could deploy this job for both composer and each tenant's workers
that's present in app-intf. Then we can remove the maintenance bits from
the composer template.
This reverts commit 484c82ce55.
The AWS sdk fails to get the instance identity document when the proxy
is configured. The proxy will need to be configured explicitly for the
depsolve job and osbuild (sources) job.
We now have a proper rhel-10.0 distribution, and this alias is clashing
with it, so we are seeing the following message in production:
failed to configure distro aliases: invalid aliases: ["alias 'rhel-10.0' masks an existing distro"]
Let's fix it by removing the alias, it's obviously not needed anymore.