debian-forge-composer

Author	SHA1	Message	Date
Beñat Gartzia Arruabarrena	d5a77ffcb5	internal/cloud/gcp/compute: Add TDX_CAPABLE guest OS feature Latest RHEL images (from 9.6 on) should be able to run as TDX guests. CentOS guests also fully support it at the moment. See: https://issues.redhat.com/browse/COS-3111 See: https://github.com/coreos/coreos-assembler/pull/4006 See: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/5979	2025-02-27 13:33:22 +01:00
Florian Schüller	65b7ee65b2	osbuild-service-maintenance: implement removal of launch templates Launch templates of instances that are terminated should be removed. HMS-3632	2024-12-10 11:43:51 +01:00
Florian Schüller	a96ea533c0	osbuild-service-maintenance: implement removal of security groups Security groups of instances that are terminated should be removed. HMS-3632	2024-12-10 11:43:51 +01:00
Florian Schüller	7ebe266d3c	osbuild-service-maintenance: implement removal on invalid parent Add a safeguard to ensure secure instances without valid parent instances are terminated, as they are unnecessary to retain. Typically, the parent does not exist if the secure instance is older than 2 hours, but this check provides additional validation. HMS-3632	2024-12-10 11:43:51 +01:00
Sanne Raymaekers	f6feb7675b	cloud/awscloud: use any instance create fleet returns Even in case of errors, as long as create fleet returns an instance, attempt to use it. In some cases AWS returns `InsufficientInstanceCapacity` but still creates an instance: ``` msg="Won't retry CreateFleet with OnDemand instance, retry: false, errors: InsufficientInstanceCapacity: There is no Spot capacity available that matches your request.; Already launched instance ([i-...]), aborting create fleet" msg="doCreateFleetRetry: returning retry: false, msg: [InsufficientInstanceCapacity: There is no Spot capacity available that matches your request. Already launched instance ([i-...]), aborting create fleet]" msg="doCreateFleetRetry: cancelling retry, instance already exists: [i-...]" msg="doCreateFleetRetry: setting retry to true" msg="Checking to retry fleet create on error InsufficientInstanceCapacity (msg: There is no Spot capacity available that matches your request.)" ```	2024-12-03 14:00:12 +01:00
Sanne Raymaekers	779053d910	cloud/awscloud: give secure instances a name That way you can just enter the parent instance id into the search bar and get both the worker and its executor.	2024-12-03 11:56:52 +01:00
Sanne Raymaekers	38b799f162	cloud/awscloud: exclude really old instance types RHEL 10 (nightly) builds fail on stage with "Fatal glibc error: CPU does not support x86-64-v3", this is most likely due to very old instance types not supporting a specific instruction set.	2024-11-29 15:42:27 +01:00
Ondřej Budai	64ff0e3dad	awscloud: add very verbose logging to createFleet creation We still see this error sometimes: Unable to start secure instance: Unable to create fleet: InsufficientInstanceCapacity: There is no Spot capacity available that matches your request This is awkward because the message mentions that there is no spot capacity, even though the current code should retry on InsufficientInstanceCapacity. I also confirmed this by searching for the retries log messages: there are none in the logs. We need a bigger hammer. Let's log everything that happens in the createFleet method in order to have better understanding why the retry logic isn't triggered. We should probably move most of the newly added logs to the debug level, but let's delay that until we have more insight into what's happening.	2024-11-26 16:12:09 +01:00
Sanne Raymaekers	54ffc08814	awscloud/secure-instance: pass on fleet information on error By surfacing the output even in case of an error, the fleet ID and instance ID can be extracted if present. Thus the instance can be terminated before its dependencies are deleted.	2024-11-26 12:52:12 +01:00
Sanne Raymaekers	7a166cd356	awscloud/secure-instance: log error code comparisons We're seeing some behaviour where create fleet is not retried and subsequently the SI cleanup fails due to the security group already being tied to an existing instance. There is no error that an instance was launched anyway.	2024-11-26 12:52:12 +01:00
Florian Schüller	446e8448e3	awscloud/secure-instance: retry for 10 minutes retry for 10 x 60sec. and don't log retries twice	2024-11-22 12:19:32 +01:00
Florian Schüller	4ec8894244	awscloud/secure-instance: retry on error in terminated waiter terminated waiter sometimes responded with "waiter state transitioned to Failure" where we want to retry waiting for the termination	2024-11-22 12:19:32 +01:00
Florian Schüller	b5c71cd7e2	awscloud/secure-instance: enrich logging with secure instance id we'll log as direct URL to the console for easier tracing	2024-11-19 17:26:23 +01:00
Sanne Raymaekers	aeba9d5a68	cloud/awscloud: don't specify max spot price The current spot price could be limiting the available instance pool significantly. ARM instances specifically are experiencing a lot of capacity errors.	2024-10-30 15:41:09 +01:00
Sanne Raymaekers	4afcd8c3fd	cloud/awscloud: fix another nilpointer in maintenance functions	2024-10-25 17:46:49 +02:00
Sanne Raymaekers	4f90a757dc	cloud/awscloud: fix retrying to create secure instances Set the correct target capacity specification type, just setting the spot options to nil doesn't result in an on demand instance.	2024-10-24 20:25:48 +02:00
Sanne Raymaekers	6ccfc7f818	cloud/awscloud: fix nil pointer dereference in maintenance fns The maintenance pod is crashing when describing the images by tag, most likely something else is failing.	2024-10-24 12:05:42 +02:00
Sanne Raymaekers	d5912259a0	cloud/awscloud: rework create fleet retry logic The current path sometimes launches two instances, which is problematic because the rest of the secure instance code expects exactly one instance. A security group could be attached to both instances, and would block the worker from launching any more SIs, as it tries to delete the old security group first, which is still held by one of the surplus SIs which didn't get terminated. Only retry if: - on "UnfulfillableCapacity" or "InsufficientInstanceCapacity" error codes; - there wasn't an instance launched anyway. If either of these checks fail, do not try to launch another one, and just fail the job.	2024-10-24 10:29:26 +02:00
Sanne Raymaekers	1c7a276d6f	cloud/aws: add maintenance functions for secure instance cleanup	2024-10-23 10:32:57 +02:00
Sanne Raymaekers	8fc91d1c6d	cloud/aws: move maintenance calls to separate file	2024-10-23 10:32:57 +02:00
Sanne Raymaekers	5eb8227bf3	cloud/awscloud: retry CreateFleet regardless of the error code The errors returned by create fleet are not entirely clear. It seems it also returns `InsufficientInstanceCapacity` in addition to `UnfulfillableCapacity`. Let's just retry three times regardless of the create fleet error, that way there's no need to chase error codes which aren't clearly defined.	2024-10-15 16:04:19 +02:00
Sanne Raymaekers	905df418aa	cloud/aws: add a third secure instance fallback across AZs In case the on demand option failed as well, retry one more time across availability zones. This significantly increases the pool of available instances, but increases network related costs, as transferring data between AZs is not free.	2024-10-07 15:56:07 +02:00
Lukas Zapletal	65d5f48847	cloud: fixed typo UnfulfillableCapacity	2024-09-26 18:09:45 +02:00
Tomáš Hozza	e53d03fe16	compute/gcp: add guest OS features for el10/c10s Signed-off-by: Tomáš Hozza <thozza@redhat.com>	2024-08-23 13:10:53 +02:00
Sanne Raymaekers	2624516f1a	osbuild-worker: use aws sdk v2 for asg scale-in protection	2024-08-20 15:32:40 +02:00
Sanne Raymaekers	c90b92f666	cloud/awscloud: test failures when running a secure instance	2024-08-20 15:32:40 +02:00
Sanne Raymaekers	acc415a676	cloud/awscloud: test terminating a secure instance	2024-08-20 15:32:40 +02:00
Sanne Raymaekers	e30fba38fc	cloud/awscloud: handle nil describe output when creating LTs/SGs	2024-08-20 15:32:40 +02:00
Sanne Raymaekers	fa3b203178	internal/boot: adapt to aws sdk v2	2024-08-20 15:32:40 +02:00
Sanne Raymaekers	16c9a7be88	cloud/awscloud: add tests for ec2 operations	2024-08-20 15:32:40 +02:00
Sanne Raymaekers	810e9133e8	cloud/awscloud: switch ec2 to v2 sdk	2024-08-20 15:32:40 +02:00
Sanne Raymaekers	8d158f6031	cloud/awscloud: switch s3 to v2 sdk	2024-08-20 15:32:40 +02:00
Sanne Raymaekers	dc389eaa71	cloud/awscloud: fix nil pointer dereference When the cleanup function gets called, there's a chance the Instnace field isn't populated yet, so store the instance ID separately and wait for it to be terminated in case it's present. The error would produce the following trace: ``` goroutine 1 [running]: ... main.(OSBuildJobImpl).Run.func1() osbuild/osbuild-composer/cmd/osbuild-worker/jobimpl-osbuild.go:404 +0xc5 panic({0x55e2a76a1e40?, 0x55e2a906d2f0?}) /usr/lib/golang/src/runtime/panic.go:920 +0x270 github.com/osbuild/osbuild-composer/internal/cloud/awscloud.(AWS).deleteFleetIfExists(0xc000faa840, 0xc0012718c0) osbuild/osbuild-composer/internal/cloud/awscloud/secure-instance.go:441 +0x175 github.com/osbuild/osbuild-composer/internal/cloud/awscloud.(AWS).TerminateSecureInstance(0x55e2a90825e0?, 0x2?) osbuild/osbuild-composer/internal/cloud/awscloud/secure-instance.go:192 +0x1d github.com/osbuild/osbuild-composer/internal/cloud/awscloud.(AWS).RunSecureInstance.func1() osbuild/osbuild-composer/internal/cloud/awscloud/secure-instance.go:75 +0x69 github.com/osbuild/osbuild-composer/internal/cloud/awscloud.(*AWS).RunSecureInstance(0xc000faa840, {0xc000afeade, 0x10}, {0x0, 0x0}, {0x0, 0x0}, {0xc001120f30, 0x24}) osbuild/osbuild-composer/internal/cloud/awscloud/secure-instance.go:169 +0x12a7 ... ```	2024-08-01 10:58:08 +02:00
Sanne Raymaekers	547f74e7db	awscloud: try to terminate previous secure instance In case the previous executor SI belonging to the worker did not get shut down properly, attempt to do it again when starting a new one, otherwise replacing the SG or LT will not work.	2024-06-28 15:33:08 +02:00
Sanne Raymaekers	791ec07bc2	internal/awscloud: fix cloud-init userdata for secure instance The conditional only checked if the cloudwatch group was set, and if it wasn't, the hostname variable wouldn't be set either. So the executor would try to look for a hostname but not find any.	2024-06-26 10:56:57 +02:00
Sanne Raymaekers	2a621521a8	osbuildexecutor/aws.ec2: set hostname of executor via cloud-init This way much more of the journal will be captured under the new hostname.	2024-06-25 10:58:10 +02:00
Sanne Raymaekers	ae4467ab0d	internal/awscloud: retry CreateFleet When receiving the "UnfillableCapacity" error from CreateFleet, retry the request with an OnDemand instance.	2024-06-24 12:50:37 +02:00
Sanne Raymaekers	2e31ea50aa	cloud/awscloud: use instance requirements when creating secure instance	2024-06-14 10:59:58 +02:00
Sanne Raymaekers	314ed4b527	cloud/awscloud: allow internet access on secure instance again The executor is timing out and there are no logs. This will require some further work. Remove the restriction for now.	2024-03-20 14:58:25 +01:00
Sanne Raymaekers	79b5b736e9	cloud/awscloud: restrict network egress for secure instance The security instance should no longer have any internet access.	2024-03-19 17:07:30 +01:00
Tomáš Hozza	e7743f17ec	Worker: allow configuring executor CloudWatch group We need the ability to use different CloudWatch group for the osbuild-executor on Fedora workers in staging and production environment. Extend the worker confguration to allow configuring the CloudWatch group name used by the osbuild-executor. Extend the secure instance code to instruct cloud-init via user data to create /tmp/cloud_init_vars file with the CloudWatch group name in the osbuild-executor instance, to make it possible for the executor to configure its logging differently based on the value. Cover new changes by unit tests. Signed-off-by: Tomáš Hozza <thozza@redhat.com>	2024-03-08 13:13:44 +01:00
Sanne Raymaekers	040eec4089	osbuild-worker: allow adding key to aws.ec2 executor This is useful during testing to set up the executor machine.	2024-03-01 19:20:51 +01:00
Amelia Crate	b3bb851863	Tag rhel 9.2+ with SEV_LIVE_MIGRATABLE_V2 SEV-SNP capable kernels containing commit ac3f9c9f are compatible. SEV_LIVE_MIGRATABLE indicated compatibility with an older version of SEV live migration, without ac3f9c9f. See: https://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git/commit/?id=ac3f9c9f1b37edaa7d1a9b908bc79d843955a1a2	2024-02-22 15:45:39 +01:00
Sanne Raymaekers	5025ec31d3	cloud/awscloud: describe security groups using filters Using the group names option only works for the default VPC, the workers are not running in the default VPC. For non-default VPCs filters should be used.	2024-02-20 15:23:52 +01:00
Sanne Raymaekers	7fce482baa	cloud/awscloud: create secure instance in the same subnet This reduces network costs as transferring data between AZs is not free.	2024-02-16 15:21:20 +01:00
Sanne Raymaekers	ee6b198b0a	cloud/awscloud: remove restricting egress rule from SG The machine still needs to be able to fetch sources, so just keep the default 0.0.0.0/0 rule.	2024-02-15 14:23:18 +01:00
Sanne Raymaekers	8e6717fa1b	cloud/awscloud: take instance type from host InstanceRequirements is very flakey, the create fleet request fails almost consistently with the same error. To continue with testing use a fixed instance type for now. As a followup we can expand the instance type selection logic or figure out what was wrong with the InstanceRequirements.	2024-02-14 18:15:25 +01:00
Sanne Raymaekers	8a1d66a0bd	cloud/awscloud: max 4 overrides are allowed when creating a fleet ``` InvalidParameterValue: Your request contains more than the maximum allowed number of InstanceRequirements (4) ```	2024-02-14 15:24:42 +01:00
Sanne Raymaekers	7fd150b938	cloud/awscloud: specify subnets when creating secure instance For non-default VPCs, AWS needs the subnets it can launch the instance in, otherwise it will try to launch the instance in the default VPC, even if the supplied security groups are attached to a non-default VPC. Furthermore there can only be 1 subnet specified per availability zone, so query the subnets in the VPC of the host (as the instance needs to be launched in the same network), and pick 1 of the VPC's subnets per AZ.	2024-02-14 13:45:52 +01:00
Sanne Raymaekers	a2fb1bfc61	cloud/awscloud: add userdata to secure instance This way the `worker-initialization.service` knows to spin up the builder instead of the worker.	2024-02-14 09:54:11 +01:00

1 2

92 commits