Archive DICOMs

As the “true” raw data, DICOMs are rarely (re)accessed and hardly ever need to change. However, they need to be stored somewhere. Tracking DICOMs in DataLad datasets allows dependency tracking for conversion to NIfTI. However, it’s good to keep DataLad optional (allow DataLad and non-DataLad access).

Historical precedent: ICF at FZJ

The following solution has been proposed for the Imaging Core Facility at FZJ:

DICOMs are packed into tar files (tarballs) ¹
the tarballs are placed on a web server (intranet only), organized by project (HTTP Basic Authentication for access management)
DataLad datasets record availability via archivist and uncurl special remotes, which translates to:
- a file is available from a tarball (archivist special remote)
- a tarball is available from a given URL, pointing to the web server (uncurl special remote)².
Only the Git repository (no annex) is stored by the consuming institute; the ICF web server is the permanent DICOM storage.

The system has been documented in https://inm-icf-utilities.readthedocs.io/en/latest/ and the tarball & dataset generation tools implementation is in https://github.com/psychoinformatics-de/inm-icf-utilities.

TRR reimplementation

One of the TRR sites indicated intent to use a Forgejo instance for DICOM storage. A particular challenge for the underlying system was inode limitation. For this reason, an adaptation of the ICF system has been proposed:

a dataset is generated upfront, and DICOM tarball is stored with the dataset in Forgejo (as annex)
we keep using the archivist remote (file in tarball) to avoid using up thousands of inodes for individual files (Git can pack its repository into several files, so we only add one more for the tarball).

A proof of principle for dataset generation (using re-written ICF code) has been proposed in https://hub.trr379.de/q02/dicom-utilities. See the README for more detailed explanations (and commit messages for even more detail).

timestamps are normalized to ensure re-packing the same data does not change tarball checksums ↩︎
uncurl is chosen because it allows re-writing patterns with just configuration, e.g., should the base URL change ↩︎