Add a safeguard to ensure secure instances without valid
parent instances are terminated, as they are unnecessary to retain.
Typically, the parent does not exist if the secure instance is
older than 2 hours, but this check provides additional validation.
HMS-3632
Now and then there are leftover secure instances, probably when worker
instances get terminated during builds, this is possible in ASGs. 2
hours as a cutoff should be enough, since the build times out after 60
minutes, and fetching the output archive after 30 minutes, so that
leaves 30 minutes for booting and connection.
Fedora 35 support was dropped, so we can update to a newer Go.
Stable RHEL 8 and 9 and Fedora 36 ships Go 1.18, so let's switch to it.
"//go:build" directives are now apparently enforced by go fmt, so that's why
there were added.
Also, all the github actions were adjusted to use Go 1.18.
Signed-off-by: Ondřej Budai <ondrej@budai.cz>
If a job is unresponsive the worker has most likely crashed or been shut
down and the in-progress job been lost.
Instead of failing these jobs, requeue them up to two times. Once a job is lost
a third time it fails. This avoids infinite loops.
This is implemented by extending FinishJob to RequeuOrFinish job. It takes a
max number of requeues as an argument, and if that is 0, it has the same
behavior as FinishJob used to have.
If the maximum number of requeues has not yet been reached, then the running
job is returned to pending state to be picked up again.
This introduces an expiry date (default: 14 days from insert date) and
adjust the service-maintenance script to delete jobs that are older than
the expiration date.
Define supported job type names as constants and use them in all places,
instead of string literals.
There are multiple benefits of this approach. Using constants removed
the room for typos in the string literals. One can use autocompletion in
IDE for job types. Using constant makes it easier to find all references
where it is used and thus all places that are handling a specific job
type.
Call vacuum analyze after each chunk of updates, and dump vacuum stats
at the beginning and end of the db cleanup.
Nulling results can increase size on disk, but calling vacuum analyze
will free up space within the table (not on disk) and reuse the space
for new inserts and updates.
Instead of deleting records, delete the results from the manifest and
depsolve jobs. This redacts sensitive data which the manifest can
contain, and this conserves space.
Stage and production share the GCP account. To avoid trying to delete
each GCP image twice, the maintenance script needs the ability to
selectively disable certain parts based on the config.
The internal GCP package used `pkg.go.dev/google.golang.org/api` [1] to
interact with Compute Engine API. Modify the package to use the new and
idiomatic `pkg.go.dev/cloud.google.com/go` [2] library for interacting
with the Compute Engine API. The new library have been already used to
interact with the Cloudbuild and Storage APIs. The new library was not
used for Compute Engine since the beginning, because at that time, it
didn't support Compute Engine.
Update go.mod and vendored packages.
[1] https://github.com/googleapis/google-api-go-client
[2] https://github.com/googleapis/google-cloud-go
Signed-off-by: Tomas Hozza <thozza@redhat.com>
Because of the way the gcp secrets are stored for the workers, and how
the mapping from vault to openshift works (unable to map a multiple key
secret into a single json file), there's a bit of juggling required to
get the gcp credentials in the right format.