Add a safeguard to ensure secure instances without valid
parent instances are terminated, as they are unnecessary to retain.
Typically, the parent does not exist if the secure instance is
older than 2 hours, but this check provides additional validation.
HMS-3632
Previously, the `OSTree` property in the Weldr API `ComposeRequest`
struct was not a pointer to the `ostree.ImageOptions` type. As a result,
it was initialized to an empty struct, even if not set in the client API
call.
As a result, the `OSTree` property in the `distro.ImageOptions` was
always not `nil`, when initializing the osbuild manifest. However, after
a change in `osbuild/images` [0], providing OSTree options for
non-OSTree image types is no longer considered valid. This caused a
failure to submit a new compose for any non-OSTree image type.
Change the `OSTree` property in Weldr `ComposeRequest` to be a pointer
and mark it as optional.
[0] https://github.com/osbuild/images/pull/1071
Signed-off-by: Tomáš Hozza <thozza@redhat.com>
Even in case of errors, as long as create fleet returns an instance,
attempt to use it.
In some cases AWS returns `InsufficientInstanceCapacity` but still
creates an instance:
```
msg="Won't retry CreateFleet with OnDemand instance, retry: false, errors: InsufficientInstanceCapacity: There is no Spot capacity available that matches your request.; Already launched instance ([i-...]), aborting create fleet"
msg="doCreateFleetRetry: returning retry: false, msg: [InsufficientInstanceCapacity: There is no Spot capacity available that matches your request. Already launched instance ([i-...]), aborting create fleet]"
msg="doCreateFleetRetry: cancelling retry, instance already exists: [i-...]"
msg="doCreateFleetRetry: setting retry to true"
msg="Checking to retry fleet create on error InsufficientInstanceCapacity (msg: There is no Spot capacity available that matches your request.)"
```
RHEL 10 (nightly) builds fail on stage with "Fatal glibc error: CPU does
not support x86-64-v3", this is most likely due to very old instance
types not supporting a specific instruction set.
We still see this error sometimes:
Unable to start secure instance: Unable to create fleet: InsufficientInstanceCapacity: There is no Spot capacity available that matches your request
This is awkward because the message mentions that there is no spot
capacity, even though the current code should retry on
InsufficientInstanceCapacity. I also confirmed this by searching for
the retries log messages: there are none in the logs.
We need a bigger hammer. Let's log everything that happens in the
createFleet method in order to have better understanding why the
retry logic isn't triggered. We should probably move most of the newly
added logs to the debug level, but let's delay that until we have
more insight into what's happening.
By surfacing the output even in case of an error, the fleet ID and
instance ID can be extracted if present. Thus the instance can be
terminated before its dependencies are deleted.
We're seeing some behaviour where create fleet is not retried and
subsequently the SI cleanup fails due to the security group already
being tied to an existing instance. There is no error that an instance
was launched anyway.
I got confused as the jobqueue interface is asymmetric.
It expects an object and returns a json.RawMessage
and when handing over to postgres this is abstracted
away by postgres
Fixes the special case that if no worker is available and we
generate an internal timeout and cancel the depsolve including all
followup jobs, no error was propagated.
When requeuing a job the next worker requesting the job would decrement
pending counter, but the pending counter only ever got incremented once,
when the job was first enqueued. Thus make sure to increment the pending
counter when a job is requeued.
The current spot price could be limiting the available instance pool
significantly. ARM instances specifically are experiencing a lot of
capacity errors.
The current path sometimes launches two instances, which is problematic
because the rest of the secure instance code expects exactly one
instance. A security group could be attached to both instances, and
would block the worker from launching any more SIs, as it tries to
delete the old security group first, which is still held by one of the
surplus SIs which didn't get terminated.
Only retry if:
- on "UnfulfillableCapacity" or "InsufficientInstanceCapacity" error codes;
- there wasn't an instance launched anyway.
If either of these checks fail, do not try to launch another one, and
just fail the job.
The errors returned by create fleet are not entirely clear. It seems it
also returns `InsufficientInstanceCapacity` in addition to
`UnfulfillableCapacity`. Let's just retry three times regardless of the
create fleet error, that way there's no need to chase error codes which
aren't clearly defined.
In case the on demand option failed as well, retry one more time across
availability zones. This significantly increases the pool of available
instances, but increases network related costs, as transferring data
between AZs is not free.
Extend the API unit test for Koji composes, to verify that the newly
added /sboms endpoint works correctly.
Signed-off-by: Tomáš Hozza <thozza@redhat.com>
Extend the unit test for regular (non-Koji) composes, to verify that
the newly added /sboms endpoint works correctly.
Signed-off-by: Tomáš Hozza <thozza@redhat.com>
Add a new /sboms API endpoint, for getting SBOM documents for a given
compose ID. The endpoint returns an array of SBOM documents for each
image built as part of the compose. For each image, there is an SBOM
document for each osbuild pipeline, which installs RPM packages. This is
usually one 'buildroot' and one 'image' pipeline.
Signed-off-by: Tomáš Hozza <thozza@redhat.com>
Extend the `manifestJobResultsFromJobDeps()` function to also return the
manifest `JobInfo`. This will be useful to inspect the job dependencies
and eliminate the need to add a specialized function for getting only
the `JobInfo`.
Signed-off-by: Tomáš Hozza <thozza@redhat.com>
If the Koji target result contains information about any uploaded SBOM
documents, import them to Koji as part of the finalize task.
Signed-off-by: Tomáš Hozza <thozza@redhat.com>
For Koji composes, all files are uploaded to Koji as part of the osbuild
job (specifically as part of handling the Koji target). So in order to
be able to upload SBOM documents to Koji as part of Koji compose, the
osbuild job needs to to be able to access the depsolve job result, which
contains the SBOM documents. For this, the osbuild job must depend on
the depsolve job.
For Koji composes, make sure that osbuild job depends on the depsolve
job and set the DepsolveDynArgsIdx.
Signed-off-by: Tomáš Hozza <thozza@redhat.com>