Commit graph

3223 commits

Author SHA1 Message Date
Sanne Raymaekers
38b799f162 cloud/awscloud: exclude really old instance types
RHEL 10 (nightly) builds fail on stage with "Fatal glibc error: CPU does
not support x86-64-v3", this is most likely due to very old instance
types not supporting a specific instruction set.
2024-11-29 15:42:27 +01:00
Ondřej Budai
64ff0e3dad awscloud: add very verbose logging to createFleet creation
We still see this error sometimes:

Unable to start secure instance: Unable to create fleet: InsufficientInstanceCapacity: There is no Spot capacity available that matches your request

This is awkward because the message mentions that there is no spot
capacity, even though the current code should retry on
InsufficientInstanceCapacity. I also confirmed this by searching for
the retries log messages: there are none in the logs.

We need a bigger hammer. Let's log everything that happens in the
createFleet method in order to have better understanding why the
retry logic isn't triggered. We should probably move most of the newly
added logs to the debug level, but let's delay that until we have
more insight into what's happening.
2024-11-26 16:12:09 +01:00
Sanne Raymaekers
54ffc08814 awscloud/secure-instance: pass on fleet information on error
By surfacing the output even in case of an error, the fleet ID and
instance ID can be extracted if present. Thus the instance can be
terminated before its dependencies are deleted.
2024-11-26 12:52:12 +01:00
Sanne Raymaekers
7a166cd356 awscloud/secure-instance: log error code comparisons
We're seeing some behaviour where create fleet is not retried and
subsequently the SI cleanup fails due to the security group already
being tied to an existing instance. There is no error that an instance
was launched anyway.
2024-11-26 12:52:12 +01:00
Florian Schüller
446e8448e3 awscloud/secure-instance: retry for 10 minutes
retry for 10 x 60sec. and don't log retries twice
2024-11-22 12:19:32 +01:00
Florian Schüller
4ec8894244 awscloud/secure-instance: retry on error in terminated waiter
terminated waiter sometimes responded
with "waiter state transitioned to Failure"
where we want to retry waiting for the termination
2024-11-22 12:19:32 +01:00
Sanne Raymaekers
8fd36225be cloudapi/v2: support HyperV generation in Azure upload options 2024-11-21 11:22:20 +01:00
Sanne Raymaekers
fb3e1b0701 internal/upload/azure: support different hyper v generations
When registering an image, users should be able to choose their hyper V
gen, as gen1 is quite outdated by now.
2024-11-21 11:22:20 +01:00
Sanne Raymaekers
d2f50a4224 internal/target: add Azure image HyperV generation 2024-11-21 11:22:20 +01:00
Florian Schüller
b5c71cd7e2 awscloud/secure-instance: enrich logging with secure instance id
we'll log as direct URL to the console for easier tracing
2024-11-19 17:26:23 +01:00
Florian Schüller
992f876da0 cloudapi/v2/server: rephrase error message 2024-11-19 13:55:38 +01:00
Florian Schüller
02778b5361 cloudapi/v2/server: assure order of fail-calls
by avoiding map but rather using a slice the
order of SetFailed is maintained
2024-11-19 13:55:38 +01:00
Florian Schüller
ca3f0a190f internal/jobqueue/jobqueuetest/jobqueuetest: fix DB tests
I got confused as the jobqueue interface is asymmetric.
It expects an object and returns a json.RawMessage
and when handing over to postgres this is abstracted
away by postgres
2024-11-19 13:55:38 +01:00
Florian Schüller
2f4d7d3140 internal/cloudapi/v2/server: remove osbuild job explicitly set "failed"
osbuild job is a dependency of the resolve and manifest jobs so
leaving the state and it will fail as a depency is also fine
2024-11-19 13:55:38 +01:00
Florian Schüller
d3e3474fb7 internal/worker/server: return an error on depsolve timeout HMS-2989
Fixes the special case that if no worker is available and we
generate an internal timeout and cancel the depsolve including all
followup jobs, no error was propagated.
2024-11-19 13:55:38 +01:00
Sanne Raymaekers
2eb3c9f44c worker/server: add tests for job heartbeats 2024-11-07 17:18:48 +01:00
Sanne Raymaekers
14bd8d38ca worker/server: add basic tests for Pending / Running job metrics 2024-11-07 17:18:48 +01:00
Sanne Raymaekers
a971f9340b worker/server: update metrics on requeue
When requeuing a job the next worker requesting the job would decrement
pending counter, but the pending counter only ever got incremented once,
when the job was first enqueued. Thus make sure to increment the pending
counter when a job is requeued.
2024-11-07 17:18:48 +01:00
Sanne Raymaekers
056b3c5ea6 jobqueue: return if a job was requeued or not 2024-11-07 17:18:48 +01:00
Lukas Zapletal
64f479092d osbuild-worker: use the new ostree resolver API 2024-11-07 16:17:56 +01:00
Florian Schüller
ece16307c6 jobqueuetest: avoid warning and provide a valid JSON
Not needed for the test but just generates a useless warning
2024-11-06 15:16:42 +01:00
Florian Schüller
00d3f07d08 Makefile: implement make db-tests
enables the option to run the DB tests locally
that are executed in the github actions
2024-11-06 15:16:42 +01:00
Sanne Raymaekers
aeba9d5a68 cloud/awscloud: don't specify max spot price
The current spot price could be limiting the available instance pool
significantly. ARM instances specifically are experiencing a lot of
capacity errors.
2024-10-30 15:41:09 +01:00
Sanne Raymaekers
4afcd8c3fd cloud/awscloud: fix another nilpointer in maintenance functions 2024-10-25 17:46:49 +02:00
Sanne Raymaekers
4f90a757dc cloud/awscloud: fix retrying to create secure instances
Set the correct target capacity specification type, just setting the
spot options to nil doesn't result in an on demand instance.
2024-10-24 20:25:48 +02:00
Sanne Raymaekers
6ccfc7f818 cloud/awscloud: fix nil pointer dereference in maintenance fns
The maintenance pod is crashing when describing the images by tag, most
likely something else is failing.
2024-10-24 12:05:42 +02:00
Sanne Raymaekers
d5912259a0 cloud/awscloud: rework create fleet retry logic
The current path sometimes launches two instances, which is problematic
because the rest of the secure instance code expects exactly one
instance. A security group could be attached to both instances, and
would block the worker from launching any more SIs, as it tries to
delete the old security group first, which is still held by one of the
surplus SIs which didn't get terminated.

Only retry if:
- on "UnfulfillableCapacity" or "InsufficientInstanceCapacity" error codes;
- there wasn't an instance launched anyway.

If either of these checks fail, do not try to launch another one, and
just fail the job.
2024-10-24 10:29:26 +02:00
Sanne Raymaekers
1c7a276d6f cloud/aws: add maintenance functions for secure instance cleanup 2024-10-23 10:32:57 +02:00
Sanne Raymaekers
8fc91d1c6d cloud/aws: move maintenance calls to separate file 2024-10-23 10:32:57 +02:00
Achilleas Koutsou
66c2c31a1c blueprint: add kickstart contents to conversion test
The option was added in f5c6cdd9cf but a
value was never added to the conversion test.
2024-10-22 22:08:39 +02:00
Achilleas Koutsou
654a6ad8f5 blueprint: enable the anaconda modules customization
This has been available since v0.74.0 of osbuild/images but was never
connected to the frontend blueprint.

See https://github.com/osbuild/images/pull/799
2024-10-22 22:08:39 +02:00
Sanne Raymaekers
5eb8227bf3 cloud/awscloud: retry CreateFleet regardless of the error code
The errors returned by create fleet are not entirely clear. It seems it
also returns `InsufficientInstanceCapacity` in addition to
`UnfulfillableCapacity`. Let's just retry three times regardless of the
create fleet error, that way there's no need to chase error codes which
aren't clearly defined.
2024-10-15 16:04:19 +02:00
Sanne Raymaekers
905df418aa cloud/aws: add a third secure instance fallback across AZs
In case the on demand option failed as well, retry one more time across
availability zones. This significantly increases the pool of available
instances, but increases network related costs, as transferring data
between AZs is not free.
2024-10-07 15:56:07 +02:00
Lukas Zapletal
65d5f48847 cloud: fixed typo UnfulfillableCapacity 2024-09-26 18:09:45 +02:00
Tomáš Hozza
efc251fa02 CloudAPI: test /sboms endpoint for Koji composes
Extend the API unit test for Koji composes, to verify that the newly
added /sboms endpoint works correctly.

Signed-off-by: Tomáš Hozza <thozza@redhat.com>
2024-09-20 17:02:09 +02:00
Tomáš Hozza
cf79bf677b CloudAPI: test /sboms endpoint for regular composes
Extend the unit test for regular (non-Koji) composes, to verify that
 the newly added /sboms endpoint works correctly.

Signed-off-by: Tomáš Hozza <thozza@redhat.com>
2024-09-20 17:02:09 +02:00
Tomáš Hozza
6e8f0418a6 CloudAPI: add new /composes/{id}/sboms endpoint
Add a new /sboms API endpoint, for getting SBOM documents for a given
compose ID. The endpoint returns an array of SBOM documents for each
image built as part of the compose. For each image, there is an SBOM
document for each osbuild pipeline, which installs RPM packages. This is
usually one 'buildroot' and one 'image' pipeline.

Signed-off-by: Tomáš Hozza <thozza@redhat.com>
2024-09-20 17:02:09 +02:00
Tomáš Hozza
102d06774c CloudAPI: extend manifestJobResultsFromJobDeps() to also return JobInfo
Extend the `manifestJobResultsFromJobDeps()` function to also return the
manifest `JobInfo`. This will be useful to inspect the job dependencies
and eliminate the need to add a specialized function for getting only
the `JobInfo`.

Signed-off-by: Tomáš Hozza <thozza@redhat.com>
2024-09-20 17:02:09 +02:00
Tomáš Hozza
1c7462b275 Worker/koji-finalize: import uploaded SBOM documents
If the Koji target result contains information about any uploaded SBOM
documents, import them to Koji as part of the finalize task.

Signed-off-by: Tomáš Hozza <thozza@redhat.com>
2024-09-20 17:02:09 +02:00
Tomáš Hozza
c109265abb Target/koji: extend the result struct with SBOM docs
Extend the Koji target result struct with an optional slice for uploaded
SBOM documents.

Signed-off-by: Tomáš Hozza <thozza@redhat.com>
2024-09-20 17:02:09 +02:00
Tomáš Hozza
4ae35a0ed9 Worker/osbuild: depend on depsolve job for Koji composes
For Koji composes, all files are uploaded to Koji as part of the osbuild
job (specifically as part of handling the Koji target). So in order to
be able to upload SBOM documents to Koji as part of Koji compose, the
osbuild job needs to to be able to access the depsolve job result, which
contains the SBOM documents. For this, the osbuild job must depend on
the depsolve job.

For Koji composes, make sure that osbuild job depends on the depsolve
job and set the DepsolveDynArgsIdx.

Signed-off-by: Tomáš Hozza <thozza@redhat.com>
2024-09-20 17:02:09 +02:00
Tomáš Hozza
f8d231d024 CloudAPI: request SBOM documents in depsolve jobs
Signed-off-by: Tomáš Hozza <thozza@redhat.com>
2024-09-20 17:02:09 +02:00
Tomáš Hozza
4779e90e17 Worker/depsolve: add support for SBOM
Add support to the `DepsolveJob` for requesting SBOM documents and
returning the results from the job.

Signed-off-by: Tomáš Hozza <thozza@redhat.com>
2024-09-20 17:02:09 +02:00
Tomáš Hozza
0628ac9131 Worker/json: remove redundant comment
Signed-off-by: Tomáš Hozza <thozza@redhat.com>
2024-09-20 17:02:09 +02:00
Tomáš Hozza
7bdd036395 Update osbuild/images to v0.88.0
Adjust all paces that call `Solver.Depsolve()`, to cope with the changes
that enabled SBOM support.

Fix loading of testing repositories in the CloudAPI unit tests.

Signed-off-by: Tomáš Hozza <thozza@redhat.com>
2024-09-20 17:02:09 +02:00
Achilleas Koutsou
4248564a55 cloudapi: update expected image type test for gcp-rhui
gce-rhui is now gone from RHEL 9 [1] and the old name simply aliases to
gce.  gcp-rhui in the cloudapi now resolves to 'gce' in RHEL 9 and
'gce-rhui' in RHEL 8.

[1] https://github.com/osbuild/images/pull/857
2024-09-17 23:33:44 +02:00
Achilleas Koutsou
ec01c6908b blueprint: sshkey to users in images blueprint conversion
The sshkey customization in osbuild/images has been dropped.  In
osbuild-composer we maintain it for backwards compatibility, converting
each to a user customization, which is a superset of the sshkey.
2024-09-17 23:33:44 +02:00
Michael Vogt
3df26ed79c osbuild-worker: fix "crashing" on worker registration issues
When the osbuild worker cannot register itself with the server
on startup the worker will "crash". This is inconsistent with the
existing behavior in `workerHeartbeat()` which deals with connectivity
or other server issue gracefully and retries periodically.

To unify the behavior this commit changes the behavior and only
issues a `logrus.Warnf` instead of the previous `Falalf` when
the registration fails.

Co-authored-by: Florian Schüller <florian.schueller@redhat.com>
2024-09-10 16:19:47 +02:00
Sanne Raymaekers
d6031ae87a upload/azure: turn off public access on storage accounts
Users might have compliance policies on their azure accounts which
forbid public access on storage accounts.
2024-09-09 12:52:14 +02:00
Florian Schüller
bb53f4833f internal/worker/client.go: refactor reading worker ID
Adds a helper function to the worker client instead of
redeclaring the same inline function.
2024-09-06 12:43:05 +02:00