Commit graph

79 commits

Author SHA1 Message Date
Sanne Raymaekers
aeba9d5a68 cloud/awscloud: don't specify max spot price
The current spot price could be limiting the available instance pool
significantly. ARM instances specifically are experiencing a lot of
capacity errors.
2024-10-30 15:41:09 +01:00
Sanne Raymaekers
4afcd8c3fd cloud/awscloud: fix another nilpointer in maintenance functions 2024-10-25 17:46:49 +02:00
Sanne Raymaekers
4f90a757dc cloud/awscloud: fix retrying to create secure instances
Set the correct target capacity specification type, just setting the
spot options to nil doesn't result in an on demand instance.
2024-10-24 20:25:48 +02:00
Sanne Raymaekers
6ccfc7f818 cloud/awscloud: fix nil pointer dereference in maintenance fns
The maintenance pod is crashing when describing the images by tag, most
likely something else is failing.
2024-10-24 12:05:42 +02:00
Sanne Raymaekers
d5912259a0 cloud/awscloud: rework create fleet retry logic
The current path sometimes launches two instances, which is problematic
because the rest of the secure instance code expects exactly one
instance. A security group could be attached to both instances, and
would block the worker from launching any more SIs, as it tries to
delete the old security group first, which is still held by one of the
surplus SIs which didn't get terminated.

Only retry if:
- on "UnfulfillableCapacity" or "InsufficientInstanceCapacity" error codes;
- there wasn't an instance launched anyway.

If either of these checks fail, do not try to launch another one, and
just fail the job.
2024-10-24 10:29:26 +02:00
Sanne Raymaekers
1c7a276d6f cloud/aws: add maintenance functions for secure instance cleanup 2024-10-23 10:32:57 +02:00
Sanne Raymaekers
8fc91d1c6d cloud/aws: move maintenance calls to separate file 2024-10-23 10:32:57 +02:00
Sanne Raymaekers
5eb8227bf3 cloud/awscloud: retry CreateFleet regardless of the error code
The errors returned by create fleet are not entirely clear. It seems it
also returns `InsufficientInstanceCapacity` in addition to
`UnfulfillableCapacity`. Let's just retry three times regardless of the
create fleet error, that way there's no need to chase error codes which
aren't clearly defined.
2024-10-15 16:04:19 +02:00
Sanne Raymaekers
905df418aa cloud/aws: add a third secure instance fallback across AZs
In case the on demand option failed as well, retry one more time across
availability zones. This significantly increases the pool of available
instances, but increases network related costs, as transferring data
between AZs is not free.
2024-10-07 15:56:07 +02:00
Lukas Zapletal
65d5f48847 cloud: fixed typo UnfulfillableCapacity 2024-09-26 18:09:45 +02:00
Tomáš Hozza
e53d03fe16 compute/gcp: add guest OS features for el10/c10s
Signed-off-by: Tomáš Hozza <thozza@redhat.com>
2024-08-23 13:10:53 +02:00
Sanne Raymaekers
2624516f1a osbuild-worker: use aws sdk v2 for asg scale-in protection 2024-08-20 15:32:40 +02:00
Sanne Raymaekers
c90b92f666 cloud/awscloud: test failures when running a secure instance 2024-08-20 15:32:40 +02:00
Sanne Raymaekers
acc415a676 cloud/awscloud: test terminating a secure instance 2024-08-20 15:32:40 +02:00
Sanne Raymaekers
e30fba38fc cloud/awscloud: handle nil describe output when creating LTs/SGs 2024-08-20 15:32:40 +02:00
Sanne Raymaekers
fa3b203178 internal/boot: adapt to aws sdk v2 2024-08-20 15:32:40 +02:00
Sanne Raymaekers
16c9a7be88 cloud/awscloud: add tests for ec2 operations 2024-08-20 15:32:40 +02:00
Sanne Raymaekers
810e9133e8 cloud/awscloud: switch ec2 to v2 sdk 2024-08-20 15:32:40 +02:00
Sanne Raymaekers
8d158f6031 cloud/awscloud: switch s3 to v2 sdk 2024-08-20 15:32:40 +02:00
Sanne Raymaekers
dc389eaa71 cloud/awscloud: fix nil pointer dereference
When the cleanup function gets called, there's a chance the Instnace
field isn't populated yet, so store the instance ID separately and wait
for it to be terminated in case it's present.

The error would produce the following trace:
```
goroutine 1 [running]:
...
main.(*OSBuildJobImpl).Run.func1()
    osbuild/osbuild-composer/cmd/osbuild-worker/jobimpl-osbuild.go:404 +0xc5
panic({0x55e2a76a1e40?, 0x55e2a906d2f0?})
    /usr/lib/golang/src/runtime/panic.go:920 +0x270
github.com/osbuild/osbuild-composer/internal/cloud/awscloud.(*AWS).deleteFleetIfExists(0xc000faa840, 0xc0012718c0)
    osbuild/osbuild-composer/internal/cloud/awscloud/secure-instance.go:441 +0x175
github.com/osbuild/osbuild-composer/internal/cloud/awscloud.(*AWS).TerminateSecureInstance(0x55e2a90825e0?, 0x2?)
    osbuild/osbuild-composer/internal/cloud/awscloud/secure-instance.go:192 +0x1d
github.com/osbuild/osbuild-composer/internal/cloud/awscloud.(*AWS).RunSecureInstance.func1()
    osbuild/osbuild-composer/internal/cloud/awscloud/secure-instance.go:75 +0x69
github.com/osbuild/osbuild-composer/internal/cloud/awscloud.(*AWS).RunSecureInstance(0xc000faa840, {0xc000afeade, 0x10}, {0x0, 0x0}, {0x0, 0x0}, {0xc001120f30, 0x24})
    osbuild/osbuild-composer/internal/cloud/awscloud/secure-instance.go:169 +0x12a7
...
```
2024-08-01 10:58:08 +02:00
Sanne Raymaekers
547f74e7db awscloud: try to terminate previous secure instance
In case the previous executor SI belonging to the worker did not get
shut down properly, attempt to do it again when starting a new one,
otherwise replacing the SG or LT will not work.
2024-06-28 15:33:08 +02:00
Sanne Raymaekers
791ec07bc2 internal/awscloud: fix cloud-init userdata for secure instance
The conditional only checked if the cloudwatch group was set, and if it
wasn't, the hostname variable wouldn't be set either. So the executor
would try to look for a hostname but not find any.
2024-06-26 10:56:57 +02:00
Sanne Raymaekers
2a621521a8 osbuildexecutor/aws.ec2: set hostname of executor via cloud-init
This way much more of the journal will be captured under the new
hostname.
2024-06-25 10:58:10 +02:00
Sanne Raymaekers
ae4467ab0d internal/awscloud: retry CreateFleet
When receiving the "UnfillableCapacity" error from CreateFleet, retry
the request with an OnDemand instance.
2024-06-24 12:50:37 +02:00
Sanne Raymaekers
2e31ea50aa cloud/awscloud: use instance requirements when creating secure instance 2024-06-14 10:59:58 +02:00
Sanne Raymaekers
314ed4b527 cloud/awscloud: allow internet access on secure instance again
The executor is timing out and there are no logs. This will require some
further work. Remove the restriction for now.
2024-03-20 14:58:25 +01:00
Sanne Raymaekers
79b5b736e9 cloud/awscloud: restrict network egress for secure instance
The security instance should no longer have any internet access.
2024-03-19 17:07:30 +01:00
Tomáš Hozza
e7743f17ec Worker: allow configuring executor CloudWatch group
We need the ability to use different CloudWatch group for the
osbuild-executor on Fedora workers in staging and production
environment.

Extend the worker confguration to allow configuring the CloudWatch group
name used by the osbuild-executor. Extend the secure instance code to
instruct cloud-init via user data to create /tmp/cloud_init_vars file
with the CloudWatch group name in the osbuild-executor instance, to make
it possible for the executor to configure its logging differently based
on the value.

Cover new changes by unit tests.

Signed-off-by: Tomáš Hozza <thozza@redhat.com>
2024-03-08 13:13:44 +01:00
Sanne Raymaekers
040eec4089 osbuild-worker: allow adding key to aws.ec2 executor
This is useful during testing to set up the executor machine.
2024-03-01 19:20:51 +01:00
Amelia Crate
b3bb851863 Tag rhel 9.2+ with SEV_LIVE_MIGRATABLE_V2
SEV-SNP capable kernels containing commit ac3f9c9f are compatible.
SEV_LIVE_MIGRATABLE indicated compatibility with an older version of SEV live migration, without ac3f9c9f.
See: https://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git/commit/?id=ac3f9c9f1b37edaa7d1a9b908bc79d843955a1a2
2024-02-22 15:45:39 +01:00
Sanne Raymaekers
5025ec31d3 cloud/awscloud: describe security groups using filters
Using the group names option only works for the default VPC, the workers
are not running in the default VPC. For non-default VPCs filters should
be used.
2024-02-20 15:23:52 +01:00
Sanne Raymaekers
7fce482baa cloud/awscloud: create secure instance in the same subnet
This reduces network costs as transferring data between AZs is not free.
2024-02-16 15:21:20 +01:00
Sanne Raymaekers
ee6b198b0a cloud/awscloud: remove restricting egress rule from SG
The machine still needs to be able to fetch sources, so just keep the
default 0.0.0.0/0 rule.
2024-02-15 14:23:18 +01:00
Sanne Raymaekers
8e6717fa1b cloud/awscloud: take instance type from host
InstanceRequirements is very flakey, the create fleet request fails
almost consistently with the same error.

To continue with testing use a fixed instance type for now. As a
followup we can expand the instance type selection logic or figure out
what was wrong with the InstanceRequirements.
2024-02-14 18:15:25 +01:00
Sanne Raymaekers
8a1d66a0bd cloud/awscloud: max 4 overrides are allowed when creating a fleet
```
InvalidParameterValue: Your request contains more than the maximum allowed number of InstanceRequirements (4)
```
2024-02-14 15:24:42 +01:00
Sanne Raymaekers
7fd150b938 cloud/awscloud: specify subnets when creating secure instance
For non-default VPCs, AWS needs the subnets it can launch the instance
in, otherwise it will try to launch the instance in the default VPC,
even if the supplied security groups are attached to a non-default VPC.

Furthermore there can only be 1 subnet specified per availability zone,
so query the subnets in the VPC of the host (as the instance needs to be
launched in the same network), and pick 1 of the VPC's subnets per AZ.
2024-02-14 13:45:52 +01:00
Sanne Raymaekers
a2fb1bfc61 cloud/awscloud: add userdata to secure instance
This way the `worker-initialization.service` knows to spin up the
builder instead of the worker.
2024-02-14 09:54:11 +01:00
Sanne Raymaekers
3db88960c2 cloud/awscloud: add ability to run a secure instance to awscloud
This instance can only contact the host, and requires this host to be
running on AWS itself with the appropriate IAM role.
2024-02-14 09:54:11 +01:00
Sanne Raymaekers
05a45ed233 cloud/awscloud: add ec2metadata client 2024-02-14 09:54:11 +01:00
Tomáš Hozza
8ba1976b02 internal/cloud/gcp/compute: keep legacy Guest OS Features for el9.0
The SEV-SNP support was added since RHEL-9.1, so we need to keep the
original Guest OS Feature set when importing RHEL-9.0 images to GCP.

Signed-off-by: Tomáš Hozza <thozza@redhat.com>
2023-08-21 16:57:33 +02:00
Timothée Ravier
4173e5d768 internal/cloud/gcp/compute: Add SEV_SNP_CAPABLE Guest OS Feature
See: https://github.com/coreos/coreos-assembler/pull/3547
See: https://cloud.google.com/blog/products/identity-security/rsa-snp-vm-more-confidential
See: https://issues.redhat.com/browse/COS-2343
2023-08-21 16:57:33 +02:00
Ondřej Budai
cac9327b44 update to go 1.19
UBI and the oldest support Fedora (37) now all have go 1.19, so we are
cleared to switch.

gofmt now reformats comments in certain cases, so that explains the formatting
changes in this commit.
See https://go.dev/doc/go1.19#go-doc

Signed-off-by: Ondřej Budai <ondrej@budai.cz>
2023-07-21 19:18:00 +02:00
Tomáš Hozza
0292725ce4 internal/GCP: remove all remaining uses of cloudbuild
Some uses of `cloudbuild` GCP API have been left in our internal cloud
API implementation for GCP. We do not use `cloudbuild` to import GCE
images into GCP any more.

Do not request the `cloudbuild` authentication scope when getting new
GCP client.

Update vendored packages accordingly.

Signed-off-by: Tomáš Hozza <thozza@redhat.com>
2023-05-24 19:28:06 +02:00
dependabot[bot]
60e55b5ed3 build(deps): bump cloud.google.com/go/compute from 1.10.0 to 1.19.3
Bumps [cloud.google.com/go/compute](https://github.com/googleapis/google-cloud-go) from 1.10.0 to 1.19.3.
- [Release notes](https://github.com/googleapis/google-cloud-go/releases)
- [Changelog](https://github.com/googleapis/google-cloud-go/blob/main/documentai/CHANGES.md)
- [Commits](https://github.com/googleapis/google-cloud-go/compare/kms/v1.10.0...compute/v1.19.3)

---
updated-dependencies:
- dependency-name: cloud.google.com/go/compute
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Migrated to the new version by following
https://github.com/googleapis/google-cloud-go/blob/main/migration.md

Co-authored-by: Tomáš Hozza <thozza@redhat.com>
Signed-off-by: dependabot[bot] <support@github.com>
Signed-off-by: Tomáš Hozza <thozza@redhat.com>
2023-05-22 11:51:42 +02:00
Tomáš Hozza
e13f0a1ae2 AWS: allow specifying the AMI boot mode when registering the image
When the AMI is being registered from a snapshot, the caller can
optionally specify the boot mode of the AMI. If no boot mode is
specified, then the default behavior is to use the boot type of the
instance that is launched from the AMI.

The default behavior (no boot type specified) is preserved after this
change.

Signed-off-by: Tomáš Hozza <thozza@redhat.com>
2023-05-19 13:24:39 +02:00
Brian C. Lane
7a4bb863dd Update deprecated io/ioutil functions
ioutil has been deprecated since go 1.16, this fixes all of the
deprecated functions we are using:

ioutil.ReadFile -> os.ReadFile
ioutil.ReadAll -> io.ReadAll
ioutil.WriteFile -> os.WriteFile
ioutil.TempFile -> os.CreateTemp
ioutil.TempDir -> os.MkdirTemp

All of the above are a simple name change, the function arguments and
results are exactly the same as before.

ioutil.ReadDir -> os.ReadDir

now returns a os.DirEntry but the IsDir and Name functions work the
same. The difference is that the FileInfo must be retrieved with the
Info() function which can also return an error.

These were identified by running:
golangci-lint run --build-tags=integration ./...
2023-03-07 09:22:23 -08:00
Sanne Raymaekers
2e3dd16220 osbuild-service-maintenance: clean up all regions
Since we started cloning images to different regions, the maintenance
script should clean up all of these regions.
2023-01-25 14:20:51 +01:00
Ondřej Budai
b997142db0 common: merge all *ToPtr methods to one generic ToPtr
After introducing Go 1.18 to a project, it's required by law to convert at
least one method to a generic one.

Everyone hates IntToPtr, StringToPtr, BoolToPtr and Uint64ToPtr, so let's
convert them to the ultimate generic ToPtr one.

Signed-off-by: Ondřej Budai <ondrej@budai.cz>
2023-01-09 14:03:18 +01:00
Colin Walters
a3a733a638 gcp: Cross-reference to coreos-assembler code
At the moment we have duplicate logic here; ideally of course
we consolidate (since both codebases are Go, perhaps we could
create a tiny little Go library for "RHEL GCP stuff"?) but
for now let's just cross-link for awareness.
2022-11-30 11:13:31 +01:00
Ondřej Budai
0e6c132ee6 awscloud: add option to mark S3 object as public
By setting the object's ACL to "public-read", anyone can download the object
even without authenticating with AWS.

The osbuild-upload-generic-s3 command got a new -public argument that
uses this new feature.

Signed-off-by: Ondřej Budai <ondrej@budai.cz>
2022-09-19 22:56:36 +02:00