Skip to main content

advancements in alchemiscale v0.6

·494 words·3 mins·
Author
David L. Dotson
Lead Developer // Datryllic LLC

In February of this year we released alchemiscale v0.6.0, and followed this up with v0.6.1 and v0.6.2. We wanted to take a moment to highlight these releases, and the improvements they bring for users!

v0.6.0
#

As a major release, v0.6.0 introduced the concept of Task restart policies, allowing users to automate restarts for Task failures based on the types of errors they are encountering. When running on distributed, heterogeneous compute, Tasks may encounter a variety of errors unrelated to the calculation being performed. These could be temporary issues due to resource oversubscription from colocated jobs (the so-called “noisy neighbor problem”), filesystem issues, GPU driver mismatches, or other problems that are often outside of a user’s control.

Users can now set a restart policy for the Tasks of an AlchemicalNetwork with:

from alchemiscale import AlchemiscaleClient, ScopedKey

asc: AlchemiscaleClient   # an existing AlchemiscaleClient instance
an_sk: ScopedKey          # an AlchemicalNetwork ScopedKey

# rerun any `Task` that failed with a `RunTimeError`
# or matched `MemoryError` at most 5 times
asc.add_task_restart_patterns(
        an_sk,
        [r"RuntimeError: .+",
        r"MemoryError: Unable to allocate \d+ GiB"],
        5
    )

More details on usage can be found in the User Guide.

This release also substantially reduced the size of result objects pulled by users via the AlchemiscaleClient. We now use the KeyedChain representation for all ProtocolDAGResult objects produced by compute services, and these are now also compressed-at-rest on creation using zstd.

Finally, we added user-configurable on-disk caching to the AlchemiscaleClient. For repeated calls to e.g. AlchemiscaleClient.get_network_results(), this reduces the need to pull down the same ProtocolDAGResults over and over, keeping the most recently-requested ProtocolDAGResults on a user’s local disk for retrieval instead. The cache will only keep objects up to a size limit (default 1 GiB), and this is configurable by the user.

v0.6.1
#

Release v0.6.1 was a bugfix release, fixing a broken codepath in the compute API for resolving task restarts for failed ProtocolDAGResults. This was a critical bug, in which failed Tasks caused compute API failures. We added additional tests to catch cases like this in the future before release.

v0.6.2
#

Release v0.6.2 was an incremental release, offering some usability improvements based on user feedback. Disk caching can now be disabled for users of the AlchemiscaleClient and for compute services, which can be especially helpful if the cache is causing issues on network filesystems (such as on HPC resources). Users can also now set all required arguments for the AlchemiscaleClient via environment variables, namely ALCHEMISCALE_URL, ALCHEMISCALE_ID, and ALCHEMISCALE_KEY. Not only is this convenient: it also reduces the likelihood of accidently saving an API key in a Jupyter notebook.

what’s next?
#

We are currently hard at work on the major new features coming in v0.7.0, so keep a look out for this release in the coming months!

If you are interested in trying alchemiscale out, or if you already have and want to offer ideas for improvement, please reach out! Posting in our Discussions forum is the best way to get started!