This commit allows to exclude preserving ownership from an object
export. This is required to fix the issue that on macOS the an
podman based workflow cannot export objects with preserving
ownerships.
Originally this was a `no_preserve: Optional[List[str]] = None)`
to be super flexible in what we pass to `cp` but then I felt like
YAGNI - if we need more we can trivially change this (internal)
API again :)
This commit removes some unnecessary custom tmpdir() fixtures
and uses the pytest buildin tmp_path instead.
Some custom tmpdir fixtures are left in place as they configure
the tmp location to be under `/var/tmp` which is not trivial to
do with pytests `tmp_path`. Not sure or not if the is a deep
reason there for using /var/tmp. I assume it's to ensure that
the tests run on a real FS not on a potential tmpfs but I don't
have the full background so didn't want to change anything.
Instead of using `Path.stat` use `os.stat` since the former only
gained the `follow_symlinks` argument in 3.10 but we still need
to support Python 3.6 for RHEL 7 and 8.
Additionally, reduce the precision by converting timestamps to an
integer to avoid false negatives due to floating point arithmetic.
Add a new `source_epoch` attribute that if set, will lead to all
mtimes that are newer or equal to the creation date being clamped
to the specified `source_epoch` time when the object is finalized.
Port the existing object store tests from `unittest` to `pytest`.
Allow all tests that can run without root privileges to do so. No
functional change of the test itself.
Integrate the recently added file system cache `FsCache` into our
object store `ObjectStore`. NB: This changes the semantics of it:
previously a call to `ObjectStore.commit` resulted in the object
being in the cache (i/o errors aside). But `FsCache.store`, which
is now the backing store for objects, will only commit objects if
there is enough space left. Thus we cannot rely that objects are
present for reading after a call to `FsCache.store`. To cope with
this we now always copy the object into the cache, even for cases
where we previously moved it: for the case where commit is called
with `object_id` matching `Object.id`, which is the case for when
`commit` is called for last stage in the pipeline. We could keep
this optimization but then we would have to special case it and
not call `commit` for these cases but only after we exported all
objects; or in other words, after we are sure we will never read
from any committed object again. The extra complexity seems not
worth it for the little gain of the optimization.
Convert all the tests for the new semantic and also remove a lot
of them that make no sense under this new paradigm.
Add a new command line option `--cache-max-size` which will set
the maximum size of the cache, if specified.
In the `store_server` test, pass the store to `enter_context`,
instead of the `stack`; the latter is an interesting form of
recursion, and totally not what we want.
Integrate the new `Metadata` object as `meta` property on `Object`.
Use it to actually store metadata after a successful stage run.
A new class `PathAdapter` is introduce which is in turned used to
expose the base path of `Object` as `os.PathLike` so it can be
passed as path to `Metadata`. The advantage is that any changes
to the base path in `Object` will automatically be picked up by
`Metadata`; the prominent, and currently only, case where this is
happening in `Object` is `store_tree`.
Implement a new class, nested inside `Object`, to read and write
metadata. It is indexed by a key and individual pieces of meta-
data are stored in separate files. Empty files are not created.
Instead of storing the (tree) data directly at the root of the
object specific directory, move it into a `data/tree` subfolder.
This prepares for two things:
1) the `tree` folder will allow us to add another folder next to
it to store metadata.
2) storing both, `tree` and the future metadata folder in a
common subfolder, prepares for the future integration
with the new caching layer (`FsCache`).
The `Object.{read,write}` methods were introduced to implement
copy on write support. Calling `write` would trigger the copy,
if the object had a `base`. Additionally, a level of indirection
was introduced via bind mounts, which allowed to hide the actual
path of the object in the store and make sure that `read` really
returned a read-only path.
Support for copy-on-write was recently removed[1], and thus the
need for the `read` and `write` methods. We lose the benefits
of the indirection, but they are not really needed: the path to
the object is not really hidden since one can always use the
`resolve_ref` method to obtain the actual store object path.
The read only property of build trees is ensured via read only
bind mounts in the build root.
Instead of using `read` and `write`, `Object` now gained a new
`tree` property that is the path to the objects tree and also
is implementing `__fspath__` and so behaves like an `os.PathLike`
object and can thus transparently be used in many places, like
e.g. `os.path.join` or `pathlib.Path`.
[1] 5346025031
If the object's id does not match with the one supplied for the
commit, we create a clone. Otherwise we store the tree.
The code path is arranged in a way that we always go through
`Object.store_tree` so we always call `Object.finalize` as a
prepration for the future, where we might actually do something
meaningful in the finalizer, like reset the *times or count the
tree size.
Remove copy-on-write support from `objectstore.Object`. The main
reason for introducing copy-on-write was to save an additional
copy in the non DAG-pipeline model[1]. With the introduction of
the latter and the explicit `--export` option, we can achieve the
same result without the complexity of copy-on-write semantics.
[1] See commit 39213b7, part of 3b7c87d5..42a365d1 changeset.
There is little use in sharing the store between test, quite to
opposite: all tests expect a clean store and some currently set
that up themselves. Create a fresh store for each test.
The idea of this test case was to check that two identical trees are
only stored once, via their treesum in the object store; but this
functionality was removed in commit e97f6ef34 and instead of treesums
random uuids are now used. As a result there is no de-duplication
anymore -- the subject of the test. So remove the test.
The treesum of a filesystem tree is the content hash of all its
files, its directory structure and file metadata.
By storing trees by their treesum we avoid storing duplicates of
identical trees, at the cost of computing the hashes for every
commit to the store.
This has limited benefit as the likelihood of two trees being
identical is slim, in particular when we already have the ability
to cache based on pipeline/stage ID (i.e., we can avoid rebuilding
trees if the pipelines that built them were the same).
Drop the concept of a treesum entirely, even though I very much
liked the idea in theory...
Signed-off-by: Tom Gundersen <teg@jklm.no>
Add the ability to only read a sub-tree of a tree via `Object.read_at`.
Expose the functionality via the `Store{Server,Client}.read_tree_at`.
Extend the tests to check this new functionality.
Instead of using string interpolation and concatenation to build
file system paths, use `os.path.join` or directly the constructor
for `pathlib.Path`, which can take path segments.
Move `test_objectstore` into the module-level tests. This allows us to
run it as part of `make test-module.
Make sure to properly guard it as root-only module.