Ever since Ansible Collections got introduced,
collection install had to somehow figure the whole dependency tree that
it's supposed to download and install. The code we had rather entangled.
But things are going to change starting ansible-core 2.11.
And here's how.
One of the items planned for ansible-core v2.11 was improving
ansible-galaxy collection CLI 💻. The first
thing needed was making possible to upgrade collections when using the
install subcommand without requiring
--force-with-deps. This is something
people have been wanting for quite a while but
wasn't possible for roles.
Then, we also wanted to introduce an additional
collection install [ -U | --upgrade ] option. And we also considered
working on the new
ansible-galaxy collection remove subcommand but
never had time to complete this stretch goal1.
Another thing on our radar was caching HTTP responses to Galaxy API so that the dependency resolution process could become dramatically faster.
Jordan, Sloane and I formed a feature team to work on this. We
decided that we'll try to cut the subtasks one per person to spread the
load somehow. Jordan was to work on caching, Sloane was assigned to
--upgrade task and I was supposed to work on updating the bare
install command to make it not require
--force-with-deps for that matter) when there's a need to update the
already installed collections.
I was almost unfamiliar with this part of ansible-core so I needed to
get myself familiar with it by starting with exloring the pointers of my
colleagues on what functions will likely to need updates. What could
possibly go wrong? Well, as I was going deeper and deeper down the
rabbit hole, I realized that there was a lot of complexity in the
existing code and we basically had a rather simplistic dependency
resolver that looked like a yarn of leaky abstractions 🤯. It was hard
to reason about what strategies it follows to get all the transitive
dependencies for collections requested to be installed or downloaded.
At the same time, I remembered that there is this other prominent
project in the Python ecosystem — pip — that recently got a fresh out
of the oven dependency resolver resolvelib It's a third-party library
that pip bundles but it's also freely available for use via
install. My buddy Pradyun has been involved with this effort (the
pip one) for about four years so I had somebody I could ask dumb
questions about the dependency resolution :)
And so the idea to replace the dependency resolver was born. Instead of
patching a few places in the old code here and there, I thought why
overengineer this task and refactor the whole thing improve
the maintainability of the subpackage dedicated to managing collection
I must say that my enthusiasm to
break all the things refine a whole
bunch of already working code was met with a lot of suspicion within the
broader Ansible Core Engineering team, at first. This additionally meant
introducing a new runtime dependency — something that we almost never
do. We now have a good mechanism to help OS packagers seamlessly bundle
runtime dependencies, though2.
I faced with a challenge — I knew that the idea was good and now I had to convince others that it's not as crazy as it may seem.
I switched into the research mode, looked into what interfaces and hooks resolvelib requires and came up with a tiny 225 LoC long proof-of-concept. I even wired the demo into GitHub Actions CI/CD so people see the result instantly. After that, when folks saw how easy it is to connect resolvelib and delegate the resolution correctness responsibility to it, the team agreed that this refactoring would be useful and we should proceed.
Meanwhile Jordan was working on his caching task. So while I was busy figuring out where to stick resolvelib into our spaghetti, Jordan submitted the API HTTP request caching PR and it got merged without any problems.
The resolver replacement work was so fundamental that it turned out to block virtually everything else related to our ansible-galaxy CLI UX improvements. This was no longer just my task. Yes, I was making most of the design for the new architecture but I got just enormous amount of help getting this to the finish line. And I enjoyed this collaboration so much!
It wasn't just throwing old code away and adding the new one in place. One of our main objectives was to keep the behavior as close as possible to what the old code did. We've identified a lot of reduntant tests that could be removed, rewrote some of the unit tests into integration tests. We've also identified a ton of gaps in the test coverage which we filled in with many new tests (yaaay! 🙌). Sloane also did a lot of manual behavior verification and testing 👏.
I mentioned earlier that resolvelib was easy to integrate and even linked that extremely short PoC. This creates an illusion that it could be a "5-minute patch" but it totally wasn't.
resolvelib requires one to implement an interface they call "provider" with the following hooks:
identify(requirement_or_candidate)— returns a unique identifier for the package (FQCN in our case)
get_preference(resolution, candidates, information)— makes a sort key determining the "importance" of a certain requirement
find_matches(requirements)— returns all candidates matching the given requirements
is_satisfied_by(requirement, candidate)— double-checks the correctness of the candidates resolver chooses
get_dependencies(candidate)— retrieves all the direct requirements that given candidate has
This doesn't look too complicated, does it? That's because resolvelib really doesn't care what your requirements and candidates are for as long as you keep interfacing with it via the same data structures.
This also means that the resolver doesn't know where to get the info about the requirements and the candidates beyond the data you provide to it by implementing these hooks.
So we needed to implement talking to Galaxy API, taking into accont more than one Galaxy-like server as a source for retrieving collections. We needed to take into account non-Galaxy provided artifacts like direct URLs to tarballs or Git repos, or local files and folders.
This all could easily increase the complexity so I introduced the concepts of a concrete artifacts manager, and a facade for talking to multiple Galaxy APIs and other metadata sources (including the artifacts manager). The artifacts manager is responsible for downloading and caching the artifacts (if they are not local) as well as retrieving (and caching) their metadata. It also has an alternative constructor that can clean up the cache directory upon exit. Both objects are initialized once (at the beginning) and are passed to the consumers as a dependency injection.
Most of the packaging ecosystems are rather simple. They have packages with the content of one "atom" inside the artifact. Ansible Collections are mostly like that but there are additional cases which make everything substantionally more complex. One of the primary use-cases that differ is SCM-based collections — they may have one collection in the root of the repository but also in a certain (user-defined) subdirectory. Moreover, SCM targets may have multiple collections inside the same repository (in a namespace subdir that also can be nested as defined by the repo creators). To solve this, we mark Git targets as "virtual collections" during the dependency resolution. The artifacts manager downloads them into a temporary directory and marks that directory a single dependency of such a "virtual Git collection"). If there's subdirs, we do the same "virtual collection" trick with them (except unpacked dirs don't need to be copied into cache, the manager just holds their real paths in memory). These "virtual collections" are very helpful during the resolution and are skipped on the install step (after the resolution is complete).
Well, that's about it. 3–4 months into experimentation, development, testing, polishing and reviews, days before the feature freeze, and the feature is in devel!
Based on the refactoring, Sloane was able to complete her work on the
--upgrade option and it got merged too.
If you are an end-user who uses
[download|install|list|verify] subcommands, please make sure to tell us
how well we managed to mix refactoring with the feature development this
time. Hopefully, we've squashed all the bugs already 🤞 but we missed
anything — let us know! 🖖