Prior this commit, ${{ github.event.workflow_run.head_branch }} got
expanded in the bash script. A malicious actor could inject
an arbitrary shell script. Since this action has access to a token
with write rights the malicious actor can easily steal this token.
This commit moves the expansion into an env block where such an
injection cannot happen. This is the preferred way according to the
github docs:
https://docs.github.com/en/actions/security-guides/security-hardening-for-github-actions#using-an-intermediate-environment-variable
Even in case of errors, as long as create fleet returns an instance,
attempt to use it.
In some cases AWS returns `InsufficientInstanceCapacity` but still
creates an instance:
```
msg="Won't retry CreateFleet with OnDemand instance, retry: false, errors: InsufficientInstanceCapacity: There is no Spot capacity available that matches your request.; Already launched instance ([i-...]), aborting create fleet"
msg="doCreateFleetRetry: returning retry: false, msg: [InsufficientInstanceCapacity: There is no Spot capacity available that matches your request. Already launched instance ([i-...]), aborting create fleet]"
msg="doCreateFleetRetry: cancelling retry, instance already exists: [i-...]"
msg="doCreateFleetRetry: setting retry to true"
msg="Checking to retry fleet create on error InsufficientInstanceCapacity (msg: There is no Spot capacity available that matches your request.)"
```
RHEL 10 (nightly) builds fail on stage with "Fatal glibc error: CPU does
not support x86-64-v3", this is most likely due to very old instance
types not supporting a specific instruction set.
We still see this error sometimes:
Unable to start secure instance: Unable to create fleet: InsufficientInstanceCapacity: There is no Spot capacity available that matches your request
This is awkward because the message mentions that there is no spot
capacity, even though the current code should retry on
InsufficientInstanceCapacity. I also confirmed this by searching for
the retries log messages: there are none in the logs.
We need a bigger hammer. Let's log everything that happens in the
createFleet method in order to have better understanding why the
retry logic isn't triggered. We should probably move most of the newly
added logs to the debug level, but let's delay that until we have
more insight into what's happening.
By surfacing the output even in case of an error, the fleet ID and
instance ID can be extracted if present. Thus the instance can be
terminated before its dependencies are deleted.
We're seeing some behaviour where create fleet is not retried and
subsequently the SI cleanup fails due to the security group already
being tied to an existing instance. There is no error that an instance
was launched anyway.
This is a bit of an RFC commit, I noticed that when we discussed
a crash from the worker we looked at individual message from
syslog/journald for the stacktrace deatils. I was wondering if
having a more direct crash report would be more useful? We can
of course also add more logrus features to flag those with tags
like "crash" or something (I did not do that in this PR, I don't
know much about the operational side, sorry).
I got confused as the jobqueue interface is asymmetric.
It expects an object and returns a json.RawMessage
and when handing over to postgres this is abstracted
away by postgres
Fixes the special case that if no worker is available and we
generate an internal timeout and cancel the depsolve including all
followup jobs, no error was propagated.
When requeuing a job the next worker requesting the job would decrement
pending counter, but the pending counter only ever got incremented once,
when the job was first enqueued. Thus make sure to increment the pending
counter when a job is requeued.