Add a safeguard to ensure secure instances without valid
parent instances are terminated, as they are unnecessary to retain.
Typically, the parent does not exist if the secure instance is
older than 2 hours, but this check provides additional validation.
HMS-3632
Even in case of errors, as long as create fleet returns an instance,
attempt to use it.
In some cases AWS returns `InsufficientInstanceCapacity` but still
creates an instance:
```
msg="Won't retry CreateFleet with OnDemand instance, retry: false, errors: InsufficientInstanceCapacity: There is no Spot capacity available that matches your request.; Already launched instance ([i-...]), aborting create fleet"
msg="doCreateFleetRetry: returning retry: false, msg: [InsufficientInstanceCapacity: There is no Spot capacity available that matches your request. Already launched instance ([i-...]), aborting create fleet]"
msg="doCreateFleetRetry: cancelling retry, instance already exists: [i-...]"
msg="doCreateFleetRetry: setting retry to true"
msg="Checking to retry fleet create on error InsufficientInstanceCapacity (msg: There is no Spot capacity available that matches your request.)"
```
RHEL 10 (nightly) builds fail on stage with "Fatal glibc error: CPU does
not support x86-64-v3", this is most likely due to very old instance
types not supporting a specific instruction set.
We still see this error sometimes:
Unable to start secure instance: Unable to create fleet: InsufficientInstanceCapacity: There is no Spot capacity available that matches your request
This is awkward because the message mentions that there is no spot
capacity, even though the current code should retry on
InsufficientInstanceCapacity. I also confirmed this by searching for
the retries log messages: there are none in the logs.
We need a bigger hammer. Let's log everything that happens in the
createFleet method in order to have better understanding why the
retry logic isn't triggered. We should probably move most of the newly
added logs to the debug level, but let's delay that until we have
more insight into what's happening.
By surfacing the output even in case of an error, the fleet ID and
instance ID can be extracted if present. Thus the instance can be
terminated before its dependencies are deleted.
We're seeing some behaviour where create fleet is not retried and
subsequently the SI cleanup fails due to the security group already
being tied to an existing instance. There is no error that an instance
was launched anyway.
The current spot price could be limiting the available instance pool
significantly. ARM instances specifically are experiencing a lot of
capacity errors.
The current path sometimes launches two instances, which is problematic
because the rest of the secure instance code expects exactly one
instance. A security group could be attached to both instances, and
would block the worker from launching any more SIs, as it tries to
delete the old security group first, which is still held by one of the
surplus SIs which didn't get terminated.
Only retry if:
- on "UnfulfillableCapacity" or "InsufficientInstanceCapacity" error codes;
- there wasn't an instance launched anyway.
If either of these checks fail, do not try to launch another one, and
just fail the job.
The errors returned by create fleet are not entirely clear. It seems it
also returns `InsufficientInstanceCapacity` in addition to
`UnfulfillableCapacity`. Let's just retry three times regardless of the
create fleet error, that way there's no need to chase error codes which
aren't clearly defined.
In case the on demand option failed as well, retry one more time across
availability zones. This significantly increases the pool of available
instances, but increases network related costs, as transferring data
between AZs is not free.
When the cleanup function gets called, there's a chance the Instnace
field isn't populated yet, so store the instance ID separately and wait
for it to be terminated in case it's present.
The error would produce the following trace:
```
goroutine 1 [running]:
...
main.(*OSBuildJobImpl).Run.func1()
osbuild/osbuild-composer/cmd/osbuild-worker/jobimpl-osbuild.go:404 +0xc5
panic({0x55e2a76a1e40?, 0x55e2a906d2f0?})
/usr/lib/golang/src/runtime/panic.go:920 +0x270
github.com/osbuild/osbuild-composer/internal/cloud/awscloud.(*AWS).deleteFleetIfExists(0xc000faa840, 0xc0012718c0)
osbuild/osbuild-composer/internal/cloud/awscloud/secure-instance.go:441 +0x175
github.com/osbuild/osbuild-composer/internal/cloud/awscloud.(*AWS).TerminateSecureInstance(0x55e2a90825e0?, 0x2?)
osbuild/osbuild-composer/internal/cloud/awscloud/secure-instance.go:192 +0x1d
github.com/osbuild/osbuild-composer/internal/cloud/awscloud.(*AWS).RunSecureInstance.func1()
osbuild/osbuild-composer/internal/cloud/awscloud/secure-instance.go:75 +0x69
github.com/osbuild/osbuild-composer/internal/cloud/awscloud.(*AWS).RunSecureInstance(0xc000faa840, {0xc000afeade, 0x10}, {0x0, 0x0}, {0x0, 0x0}, {0xc001120f30, 0x24})
osbuild/osbuild-composer/internal/cloud/awscloud/secure-instance.go:169 +0x12a7
...
```
In case the previous executor SI belonging to the worker did not get
shut down properly, attempt to do it again when starting a new one,
otherwise replacing the SG or LT will not work.
The conditional only checked if the cloudwatch group was set, and if it
wasn't, the hostname variable wouldn't be set either. So the executor
would try to look for a hostname but not find any.
We need the ability to use different CloudWatch group for the
osbuild-executor on Fedora workers in staging and production
environment.
Extend the worker confguration to allow configuring the CloudWatch group
name used by the osbuild-executor. Extend the secure instance code to
instruct cloud-init via user data to create /tmp/cloud_init_vars file
with the CloudWatch group name in the osbuild-executor instance, to make
it possible for the executor to configure its logging differently based
on the value.
Cover new changes by unit tests.
Signed-off-by: Tomáš Hozza <thozza@redhat.com>
Using the group names option only works for the default VPC, the workers
are not running in the default VPC. For non-default VPCs filters should
be used.
InstanceRequirements is very flakey, the create fleet request fails
almost consistently with the same error.
To continue with testing use a fixed instance type for now. As a
followup we can expand the instance type selection logic or figure out
what was wrong with the InstanceRequirements.
For non-default VPCs, AWS needs the subnets it can launch the instance
in, otherwise it will try to launch the instance in the default VPC,
even if the supplied security groups are attached to a non-default VPC.
Furthermore there can only be 1 subnet specified per availability zone,
so query the subnets in the VPC of the host (as the instance needs to be
launched in the same network), and pick 1 of the VPC's subnets per AZ.