Commit graph

104 commits

Author SHA1 Message Date
Tomáš Hozza
a75f2b95fd pkg/awscloud/test: delete unused code
Signed-off-by: Tomáš Hozza <thozza@redhat.com>
2025-08-12 13:15:43 +02:00
Tomáš Hozza
caa7816121 internal/awscloud: remove S3 client from AWS struct
This functionality is not needed any more, since it is provided by the
osbuild/images version of AWS.

Signed-off-by: Tomáš Hozza <thozza@redhat.com>
2025-08-12 13:15:43 +02:00
Tomáš Hozza
4f0ae84add internal/awscloud: remove S3 manager from AWS struct
This functionality is not needed any more, since it is provided by the
osbuild/images version of AWS.

Signed-off-by: Tomáš Hozza <thozza@redhat.com>
2025-08-12 13:15:43 +02:00
Tomáš Hozza
0b5bfa044f internal/awscloud: remove S3 presign client from AWS struct
This functionality is not needed any more, since it is provided by the
osbuild/images version of AWS.

Signed-off-by: Tomáš Hozza <thozza@redhat.com>
2025-08-12 13:15:43 +02:00
Tomáš Hozza
7645f60f27 internal/awscloud: use AWS.Regions() from osbuild/images
Signed-off-by: Tomáš Hozza <thozza@redhat.com>
2025-08-12 13:15:43 +02:00
Tomáš Hozza
5d73c22bb5 internal/awscloud: use AWS.ShareImage() from osbuild/images
Signed-off-by: Tomáš Hozza <thozza@redhat.com>
2025-08-12 13:15:43 +02:00
Tomáš Hozza
a3937e99ce internal/awscloud: use AWS.Register() from osbuild/images
Signed-off-by: Tomáš Hozza <thozza@redhat.com>
2025-08-12 13:15:43 +02:00
Tomáš Hozza
27bae770e9 internal/awscloud: use AWS.S3ObjectPresignedURL() from osbuild/images
Signed-off-by: Tomáš Hozza <thozza@redhat.com>
2025-08-12 13:15:43 +02:00
Tomáš Hozza
a73820b4a8 internal/awscloud: use AWS.MarkS3ObjectAsPublic() from osbuild/images
Signed-off-by: Tomáš Hozza <thozza@redhat.com>
2025-08-12 13:15:43 +02:00
Tomáš Hozza
13ba870fc6 internal/awscloud: use AWS.Upload() from osbuild/images
Signed-off-by: Tomáš Hozza <thozza@redhat.com>
2025-08-12 13:15:43 +02:00
Tomáš Hozza
4595339774 internal/awscloud: delete unused method
Signed-off-by: Tomáš Hozza <thozza@redhat.com>
2025-08-12 13:15:43 +02:00
Tomáš Hozza
d594005f25 internal/awscloud: start embedding awscloud.AWS from osbuild/images
Start embedding the awscloud.AWS from osbuild/images in
osbuild-composer's version of awscloud.AWS. The idea is to remove all
methods from osbuild-composer implementation, which are used for
uploading and registering images in AWS. The rest that it related to
service maintenance or to running secure instances, will be kept in
osbuild-composer, since these are specific to the project.

Signed-off-by: Tomáš Hozza <thozza@redhat.com>
2025-08-12 13:15:43 +02:00
Beñat Gartzia Arruabarrena
d5a77ffcb5 internal/cloud/gcp/compute: Add TDX_CAPABLE guest OS feature
Latest RHEL images (from 9.6 on) should be able to run as TDX guests.
CentOS guests also fully support it at the moment.

See: https://issues.redhat.com/browse/COS-3111
See: https://github.com/coreos/coreos-assembler/pull/4006
See: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/5979
2025-02-27 13:33:22 +01:00
Florian Schüller
65b7ee65b2 osbuild-service-maintenance: implement removal of launch templates
Launch templates of instances that are terminated should be removed.
HMS-3632
2024-12-10 11:43:51 +01:00
Florian Schüller
a96ea533c0 osbuild-service-maintenance: implement removal of security groups
Security groups of instances that are terminated should be removed.
HMS-3632
2024-12-10 11:43:51 +01:00
Florian Schüller
7ebe266d3c osbuild-service-maintenance: implement removal on invalid parent
Add a safeguard to ensure secure instances without valid
parent instances are terminated, as they are unnecessary to retain.
Typically, the parent does not exist if the secure instance is
older than 2 hours, but this check provides additional validation.
HMS-3632
2024-12-10 11:43:51 +01:00
Sanne Raymaekers
f6feb7675b cloud/awscloud: use any instance create fleet returns
Even in case of errors, as long as create fleet returns an instance,
attempt to use it.

In some cases AWS returns `InsufficientInstanceCapacity` but still
creates an instance:
```
msg="Won't retry CreateFleet with OnDemand instance, retry: false, errors: InsufficientInstanceCapacity: There is no Spot capacity available that matches your request.; Already launched instance ([i-...]), aborting create fleet"
msg="doCreateFleetRetry: returning retry: false, msg: [InsufficientInstanceCapacity: There is no Spot capacity available that matches your request. Already launched instance ([i-...]), aborting create fleet]"
msg="doCreateFleetRetry: cancelling retry, instance already exists: [i-...]"
msg="doCreateFleetRetry: setting retry to true"
msg="Checking to retry fleet create on error InsufficientInstanceCapacity (msg: There is no Spot capacity available that matches your request.)"
```
2024-12-03 14:00:12 +01:00
Sanne Raymaekers
779053d910 cloud/awscloud: give secure instances a name
That way you can just enter the parent instance id into the search bar
and get both the worker and its executor.
2024-12-03 11:56:52 +01:00
Sanne Raymaekers
38b799f162 cloud/awscloud: exclude really old instance types
RHEL 10 (nightly) builds fail on stage with "Fatal glibc error: CPU does
not support x86-64-v3", this is most likely due to very old instance
types not supporting a specific instruction set.
2024-11-29 15:42:27 +01:00
Ondřej Budai
64ff0e3dad awscloud: add very verbose logging to createFleet creation
We still see this error sometimes:

Unable to start secure instance: Unable to create fleet: InsufficientInstanceCapacity: There is no Spot capacity available that matches your request

This is awkward because the message mentions that there is no spot
capacity, even though the current code should retry on
InsufficientInstanceCapacity. I also confirmed this by searching for
the retries log messages: there are none in the logs.

We need a bigger hammer. Let's log everything that happens in the
createFleet method in order to have better understanding why the
retry logic isn't triggered. We should probably move most of the newly
added logs to the debug level, but let's delay that until we have
more insight into what's happening.
2024-11-26 16:12:09 +01:00
Sanne Raymaekers
54ffc08814 awscloud/secure-instance: pass on fleet information on error
By surfacing the output even in case of an error, the fleet ID and
instance ID can be extracted if present. Thus the instance can be
terminated before its dependencies are deleted.
2024-11-26 12:52:12 +01:00
Sanne Raymaekers
7a166cd356 awscloud/secure-instance: log error code comparisons
We're seeing some behaviour where create fleet is not retried and
subsequently the SI cleanup fails due to the security group already
being tied to an existing instance. There is no error that an instance
was launched anyway.
2024-11-26 12:52:12 +01:00
Florian Schüller
446e8448e3 awscloud/secure-instance: retry for 10 minutes
retry for 10 x 60sec. and don't log retries twice
2024-11-22 12:19:32 +01:00
Florian Schüller
4ec8894244 awscloud/secure-instance: retry on error in terminated waiter
terminated waiter sometimes responded
with "waiter state transitioned to Failure"
where we want to retry waiting for the termination
2024-11-22 12:19:32 +01:00
Florian Schüller
b5c71cd7e2 awscloud/secure-instance: enrich logging with secure instance id
we'll log as direct URL to the console for easier tracing
2024-11-19 17:26:23 +01:00
Sanne Raymaekers
aeba9d5a68 cloud/awscloud: don't specify max spot price
The current spot price could be limiting the available instance pool
significantly. ARM instances specifically are experiencing a lot of
capacity errors.
2024-10-30 15:41:09 +01:00
Sanne Raymaekers
4afcd8c3fd cloud/awscloud: fix another nilpointer in maintenance functions 2024-10-25 17:46:49 +02:00
Sanne Raymaekers
4f90a757dc cloud/awscloud: fix retrying to create secure instances
Set the correct target capacity specification type, just setting the
spot options to nil doesn't result in an on demand instance.
2024-10-24 20:25:48 +02:00
Sanne Raymaekers
6ccfc7f818 cloud/awscloud: fix nil pointer dereference in maintenance fns
The maintenance pod is crashing when describing the images by tag, most
likely something else is failing.
2024-10-24 12:05:42 +02:00
Sanne Raymaekers
d5912259a0 cloud/awscloud: rework create fleet retry logic
The current path sometimes launches two instances, which is problematic
because the rest of the secure instance code expects exactly one
instance. A security group could be attached to both instances, and
would block the worker from launching any more SIs, as it tries to
delete the old security group first, which is still held by one of the
surplus SIs which didn't get terminated.

Only retry if:
- on "UnfulfillableCapacity" or "InsufficientInstanceCapacity" error codes;
- there wasn't an instance launched anyway.

If either of these checks fail, do not try to launch another one, and
just fail the job.
2024-10-24 10:29:26 +02:00
Sanne Raymaekers
1c7a276d6f cloud/aws: add maintenance functions for secure instance cleanup 2024-10-23 10:32:57 +02:00
Sanne Raymaekers
8fc91d1c6d cloud/aws: move maintenance calls to separate file 2024-10-23 10:32:57 +02:00
Sanne Raymaekers
5eb8227bf3 cloud/awscloud: retry CreateFleet regardless of the error code
The errors returned by create fleet are not entirely clear. It seems it
also returns `InsufficientInstanceCapacity` in addition to
`UnfulfillableCapacity`. Let's just retry three times regardless of the
create fleet error, that way there's no need to chase error codes which
aren't clearly defined.
2024-10-15 16:04:19 +02:00
Sanne Raymaekers
905df418aa cloud/aws: add a third secure instance fallback across AZs
In case the on demand option failed as well, retry one more time across
availability zones. This significantly increases the pool of available
instances, but increases network related costs, as transferring data
between AZs is not free.
2024-10-07 15:56:07 +02:00
Lukas Zapletal
65d5f48847 cloud: fixed typo UnfulfillableCapacity 2024-09-26 18:09:45 +02:00
Tomáš Hozza
e53d03fe16 compute/gcp: add guest OS features for el10/c10s
Signed-off-by: Tomáš Hozza <thozza@redhat.com>
2024-08-23 13:10:53 +02:00
Sanne Raymaekers
2624516f1a osbuild-worker: use aws sdk v2 for asg scale-in protection 2024-08-20 15:32:40 +02:00
Sanne Raymaekers
c90b92f666 cloud/awscloud: test failures when running a secure instance 2024-08-20 15:32:40 +02:00
Sanne Raymaekers
acc415a676 cloud/awscloud: test terminating a secure instance 2024-08-20 15:32:40 +02:00
Sanne Raymaekers
e30fba38fc cloud/awscloud: handle nil describe output when creating LTs/SGs 2024-08-20 15:32:40 +02:00
Sanne Raymaekers
fa3b203178 internal/boot: adapt to aws sdk v2 2024-08-20 15:32:40 +02:00
Sanne Raymaekers
16c9a7be88 cloud/awscloud: add tests for ec2 operations 2024-08-20 15:32:40 +02:00
Sanne Raymaekers
810e9133e8 cloud/awscloud: switch ec2 to v2 sdk 2024-08-20 15:32:40 +02:00
Sanne Raymaekers
8d158f6031 cloud/awscloud: switch s3 to v2 sdk 2024-08-20 15:32:40 +02:00
Sanne Raymaekers
dc389eaa71 cloud/awscloud: fix nil pointer dereference
When the cleanup function gets called, there's a chance the Instnace
field isn't populated yet, so store the instance ID separately and wait
for it to be terminated in case it's present.

The error would produce the following trace:
```
goroutine 1 [running]:
...
main.(*OSBuildJobImpl).Run.func1()
    osbuild/osbuild-composer/cmd/osbuild-worker/jobimpl-osbuild.go:404 +0xc5
panic({0x55e2a76a1e40?, 0x55e2a906d2f0?})
    /usr/lib/golang/src/runtime/panic.go:920 +0x270
github.com/osbuild/osbuild-composer/internal/cloud/awscloud.(*AWS).deleteFleetIfExists(0xc000faa840, 0xc0012718c0)
    osbuild/osbuild-composer/internal/cloud/awscloud/secure-instance.go:441 +0x175
github.com/osbuild/osbuild-composer/internal/cloud/awscloud.(*AWS).TerminateSecureInstance(0x55e2a90825e0?, 0x2?)
    osbuild/osbuild-composer/internal/cloud/awscloud/secure-instance.go:192 +0x1d
github.com/osbuild/osbuild-composer/internal/cloud/awscloud.(*AWS).RunSecureInstance.func1()
    osbuild/osbuild-composer/internal/cloud/awscloud/secure-instance.go:75 +0x69
github.com/osbuild/osbuild-composer/internal/cloud/awscloud.(*AWS).RunSecureInstance(0xc000faa840, {0xc000afeade, 0x10}, {0x0, 0x0}, {0x0, 0x0}, {0xc001120f30, 0x24})
    osbuild/osbuild-composer/internal/cloud/awscloud/secure-instance.go:169 +0x12a7
...
```
2024-08-01 10:58:08 +02:00
Sanne Raymaekers
547f74e7db awscloud: try to terminate previous secure instance
In case the previous executor SI belonging to the worker did not get
shut down properly, attempt to do it again when starting a new one,
otherwise replacing the SG or LT will not work.
2024-06-28 15:33:08 +02:00
Sanne Raymaekers
791ec07bc2 internal/awscloud: fix cloud-init userdata for secure instance
The conditional only checked if the cloudwatch group was set, and if it
wasn't, the hostname variable wouldn't be set either. So the executor
would try to look for a hostname but not find any.
2024-06-26 10:56:57 +02:00
Sanne Raymaekers
2a621521a8 osbuildexecutor/aws.ec2: set hostname of executor via cloud-init
This way much more of the journal will be captured under the new
hostname.
2024-06-25 10:58:10 +02:00
Sanne Raymaekers
ae4467ab0d internal/awscloud: retry CreateFleet
When receiving the "UnfillableCapacity" error from CreateFleet, retry
the request with an OnDemand instance.
2024-06-24 12:50:37 +02:00
Sanne Raymaekers
2e31ea50aa cloud/awscloud: use instance requirements when creating secure instance 2024-06-14 10:59:58 +02:00