Data management

Identifiers

Identifiers are an essential component of the TRR379 research data management (RDM) approach. This is reflected in the visible organization of information on the consortium website, but also in the schemas that define the structure of metadata on TRR379 outputs.

Many systems for identifying particular types of entities have been developed. A well-known example is DOI for digital objects, most commonly used for publications. However, many others exist, like ROR for research organizations, or Cognitive Atlas for concepts, tasks, and phenotypes related to human cognition.

RDM in the TRR379 aims to employ and align with existing systems as much as possible to maximize interoperability with other efforts and solutions. However, no particular identifier system is required or exclusively adopted by TRR379.

Instead, anything and everything that is relevant for TRR379 has an identifier in a TRR379-specific namespace.

Identifier persistence

TRR379 RDM heavily relies on persistent identifiers. More or less anything and everything has, and must have, a persistent identifier. This key constraint makes it possible for multiple actors to collaboratively, and simultaneously contribute metadata on arbitrary aspects – without having to wait and query for finished metadata records on associated entities.

TRR379 identifier namespace

TRR379 uses URIs as identifiers that map onto the structure of the main consortium website. For example, the full TRR379 identifier for the spokesperson Ute Habel is https://trr379.de/contributors/ute-habel. In this URI, https://trr379.de is the unique TRR379-specific namespace prefix, contributors/ute-habel is the TRR379-specific identifier for Ute Habel (where contributors is a sub-namespace for agents that in some way contribute to the consortium).

Even though Ute Habel can also identified by the ORCID 0000-0003-0703-7722, via the quasi-standard identifier system for researchers, this alternative identifier is considered an optional, alternative identifier rather than a requirement for TRR379 RDM.

The reasons for this approach are simplicity, and flexibility.

An identifier in TRR379 RDM is a simple text label, in a self-managed namespace. This self-managed namespace can cover any and all entity types that require identification with TRR379. In many cases, an identifier directly maps to a page on the main consortium website. This is a simple strategy to document the nature of any entity. It also establishes the main website as a central, straightforward instrument for communicating and deduplicating identifiers in a distributed research consortium.

Alignment with other identifiers

Even though any relevant entity can receive a TRR379-specific identifier with the approach described above, the utility of these identifier is limited to TRR379-specific procedures and activities. However, a TRR379 metadata record on a research site (e.g., https://trr379.de/sites/aachen ) can be annotated with any alternative identifier for the same entity (e.g., https://ror.org/04xfq0f34 ). Thereby it is possible to combine the benefits of a self-governed, project-specific identifier namespace with the superior discoverability and interoperability of established identification systems for particular entities.

Identifiers for particular entities

The additional documentation linked below provides more information on particular identifiers used by TRR379.

Participants

Participants

This page is a more in-depth description of the rationale behind the SOP for participant identifiers used by TRR379.

Q1 participant identifiers

Q01 is the central recruitment project. Any participant included in the core TRR379 dataset is registered with Q01 and receives an identifier. This identifier is unique within TRR379 and stable across the entire lifespan of TRR379.

The dataset acquired by TRR379 is longitudinal in nature. Therefore participants need to be reliably identified and re-identified for follow-up visits. Because participants are not expected to remember their TRR379 identifier, it is necessary to store personal data on a participant for the time of their participation in data acquisition activities.

In order to avoid needlessly wide-spread distribution of this personal data, participant registration and personal data retention is done only at the site where a person participates in TRR379 data acquisitions. Each site:

issues TRR379-specific participant identifiers that are unique and valid throughout the runtime of TRR379
uses secure systems for this purpose, for example, existing patient information systems
is responsible for linking all relevant information that is required for reporting and data analysis within TRR379 to the issued Q01 identifier, so that all data can be identified and delivered upon request (e.g., link to brain imaging facility subject ID).

The site-issued identifiers have a unique, site-specific prefix (e.g., a letter like A for Aachen), such that each site can self-organize their own identifier namespace without having to synchronize with all other sites to avoid duplication.

The identifiers must not have any other information encoded in them.

Responsible use and anonymization of identifiers

The TRR379 participant identifiers, as described above, are pseudonymous. Using these TRR379-specific identifiers only, for any TRR379-specific communication and implementations, is advised for compliance with the GDPR principles of necessity and proportionality of personal data handling. This includes, for example, data analysis scripts that can be expected to become part of a more widely accessible documentation or publication.

Any TRR379 site that issues identifiers is responsible for strictly separating personal data used for (re-)identifying a participant, such as health insurance ID, government ID card numbers, or name and date of birth. This information is linked to TRR379-specific identifiers in a dedicated mapping table. Access to this table is limited to specifically authorized personnel.

When a participant withdraws, or when a study’s data acquisition is completed, the mapping of the TRR379 identifier to personal identifying information (1) is destroyed, by removing the associated record (row) from the mapping table. At this point, the TRR379 identifier itself can be considered anonymous. Consequently, occurrences of such identifiers in any published or otherwise shared records, or computer scripts need not be redacted.

The validity of the statement above critically depends on the identifier-issuing sites to maintain a strictly separate, confidential mapping of identifier to personal identifying information, and to not encode participant-specific information into the identifier itself.

Participant identifiers in A/B/C projects

For each project or study that is covered by its own ethics documentation and approval, separate and dedicated participant identifiers are used that are different from a Q01-identifier for a person. This is done to enable such projects to fulfill their individual requirements regarding responsible use of personal data. In particular, it enables any individual project to share and publish data without enabling trivial, undesired, and unauthorized cross-referencing of data on an individual person, acquired in different studies.

These project-specific identifiers are managed and issued in the same way as described above.

A project requests a project-specific identifier from the local site representative of Q01, by presenting personal identifying information.
This information is matched to any existing Q01-identifier, and a project-specific identifier is created and/or reported.
Any created project-specific identifier is linked to the Q01-identifier, using the same technical systems and procedures that also link other identifiers (e.g., patient information system record identifier).

Importantly, the mapping of the Q01-identifier and a project-specific identifier is typically not shared with the requesting project. This is done to prevent accidental and undesired co-occurrence of the two different identifiers in a way that enables unauthorized agents to reconstruct an identifier mapping that violates the boundaries of specific research ethics.

Special-purpose identifiers

Sometimes it is necessary to generate participant identifiers that are not compliant with the procedures and properties described above. For example, an external service provide may require particular information to be encoded in an identifier (e.g., sex, age, date of acquisition).

If this is the case, an additional identifier must be generated for that specific purpose. Its use must be limited in time and it must not be reused for other purposes.

Identifier generation and linkage to the standard Q01-participant identifiers is done using the procedure described for project-specific identifiers above.

Metadata editing

Metadata service

Deployment

Deployment

The metadata service is a small application built on FastAPI that can be deployed running in a virtual environment, managed by Hatch – running under an unprivileged user account. This scenario is described here. However, any other deployment approaches suitable for Python-based applications may work just as fine.

Required software

The only software that is required outside the virtual environment (and the web server) is pipx, which is used to deploy hatch for a user – no need for administrator privileges otherwise.

sudo apt install pipx --no-install-recommends

User account setup

Here we set up a dedicated user dumpthing to run the service. However, the service could also run under any other (existing) user account.

# new user, prohibit login, disable password
sudo adduser dumpthing --disabled-password --disabled-login

# allow this user to run prcoess while not logged in
sudo loginctl enable-linger dumpthing

# allow this user to execute systemd commands interactively.
# this needs XDG_RUNTIME_DIR define.
# the demo below is for ZSH
sudo -u dumpthing -s
cd
echo 'export XDG_RUNTIME_DIR="/run/user/$UID"' >> ~/.zshrc
# put `hatch` in the PATH for convenience
echo 'export PATH="/home/dumpthing/.local/bin:$PATH"' >> ~/.zshrc

Service environment setup

Everything in this section is done under the target user account. Use something like sudo -u dumpthing -s to enter it.

The following commands perform the initial setup, which provides an installation of the dump-things-service to query and encrich the TRR379 knowledge base.

# install `hatch` to run the service in a virtual environment
pipx install hatch

# obtain the source code for the service
git clone https://hub.trr379.de/q02/dump-things-service.git

# obtain the dataset with the (curated) metadata to be served
# by the service
git clone https://hub.trr379.de/q02/trr379-knowledge.git curated_metadata

# set up a directory for receiving metadata submissions
# each subdirectory in it must match a "token" the needs to be
# presented to the service to make it accept a record posting.
mkdir token_realms

# the service expects a particular data organization.
# we opt to create a dedicated root directory for it,
# and symlink all necessary components into it
mkdir server_root
ln -s ../curated_metadata/metadata server_root/global_store
ln -s ../token_realms server_root/token_stores

# now we can test-launch the service
hatch run fastapi:run --port 17345 /home/dumpthing/server_root

If the service comes up with no error, we can ctrl-c it.

Service management with systemd

We use systemd for managing the service process, the launch, and logging. This makes it largely unnecessary to interact with hatch directly, and allows for treating the user-space service like any other system service on the system.

The following service unit specification is all that is needed.

mkdir -p .config/systemd/user/
cat << EOT > .config/systemd/user/dumpthing.service
[Unit]
Description=DumpThing service (hatch environment)
Wants=network-online.target
After=network-online.target


[Service]
Type=simple
WorkingDirectory=/home/dumpthing/dump-things-service
ExecStart=/home/dumpthing/.local/bin/hatch run fastapi:run --port 17345 /home/dumpthing/server_root

[Install]
WantedBy=default.target
EOT

With this setup in place, we can control the service via systemd.

# launch the service
systemctl --user start dumpthing

# configure systemd to auto-launch the service in case of a
# system reboot
systemctl --user enable dumpthing.service

Web server setup

Here we use caddy as a reverse proxy to expose the services via https at metadata.trr379.de.

# append the following configuration to the caddy config
cat << EOT >> /etc/caddy/Caddyfile
# dumpthings service endpoints
metadata.trr379.de {
    reverse_proxy localhost:17345
}
EOT

A matching DNS setup must be configured separately.

Afterwards we can reload the web server configuration and have it expose the service.

# reload the webserver config to enable the reverse proxy setup
# (only necessary once)
sudo systemctl reload caddy

Updates and curation

Whenever there are updates to the to-be-served curated metadata, the setup described here only required the equivalent of a git pull to fetch these updates from the “knowledge” repository.

When records are submitted, they end up in the directory matching the token that was used for submission. Until such records are integrated with the curated metadata in global_store, they are only available for service requests that use that particular token.

An independent workflow must be used to perform this curation (acceptance, correction, rejection) of submitted records.

NeuroBagel cohort data discovery

Deployment

Deployment

NeuroBagel is a collection of containerized services that can be deployed in a variety of way. This page describes a deployment using podman and podman-compose that is confirmed to be working on machine with a basic Debian 12 installation.

For other installation methods, please refer to the NeuroBagel documentation.

The following instruction set up a “full-stack” NeuroBagel deployment. The contains all relevant components

query front-end
federation API
node API
graph database

This setup is suitable for a self-contained deployment, such as the central TRR379 node. Other deployments may only need a subset of these services.

On the target machine, NeuroBagel services will run “rootless”. This means they operate under a dedicated user account with minimal privileges.

Required software

Only podman, and its compose feature are needed. They can be installed via the system package manager.

sudo apt install podman podman-compose

User setup

We create a dedicated user neurobagel on the target machine. NeuroBagel will be deployed under this user account, and all software and data will be stored in its HOME directory.

# new user, prohibit login, disable password
sudo adduser neurobagel --disabled-password --disabled-login

# allow this user to run prcoess while not logged in
sudo loginctl enable-linger neurobagel

# allow this user to execute systemd commands interactively.
# this needs XDG_RUNTIME_DIR define.
# the demo below is for ZSH
sudo -u neurobagel -s
cd
echo 'export XDG_RUNTIME_DIR="/run/user/$UID"' >> ~/.zshrc
exit

Configure NeuroBagel

In the HOME directory of the neurobagel user we create the complete runtime environment for the service. All configuration is obtained from a Git repository.

# become the `neurobagel` user
sudo -u neurobagel -s
cd

# fetch the setup
git clone https://hub.trr379.de/q02/neurobagel-recipes recipes

# create the runtime directory
mkdir -p run/data/
mkdir -p run/secrets/

# copy over the demo data for testing (can be removed later)
cp recipes/data/* run/data

# generate passwords (using `pwgen` here, but could be any)
pwgen 20 1 > run/secrets/NB_GRAPH_ADMIN_PASSWORD.txt
pwgen 20 1 > run/secrets/NB_GRAPH_PASSWORD.txt

# configure the the address of the "local" NeuroBagel
# node to query
cat << EOT > recipes/local_nb_nodes.json
[
    {
        "NodeName": "TRR379 central node",
        "ApiURL": "https://nb-cnode.trr379.de" 
    }
]
EOT

Web server setup

NeuroBagel comprises a set of services that run on local ports that are routed to the respective containers. Here we use caddy as a reverse proxy to expose the necessary services via https at their canonical locations.

# append the following configuration to the caddy config
cat << EOT >> /etc/caddy/Caddyfile
# neurobagel query tool
nb-query.trr379.de {
    reverse_proxy localhost:13700
}
# neurobagel apis
# graph db api not exposed at 13701
nb-cnode.trr379.de {
    reverse_proxy localhost:13702
}
nb-federation.trr379.de {
    reverse_proxy localhost:13703
}
EOT

A matching DNS setup must be configured separately.

Manage NeuroBagel with systemd

We use systemd for managing the NeuroBagel service processes, the launch, and logging. This makes it largely unnecessary to interact with podman directly, and allows for treating the containerized NeuroBagel like any other system service.

The following service unit specification is all that is needed. With more recent versions of podman and podman-compose better setups are possible. using podman version. However, this one is working with the stock versions that come with Debian 12 (podman 4.3.1 and podman-composer 1.0.3) and requires no custom installations.

mkdir -p .config/systemd/user/
cat << EOT > .config/systemd/user/neurobagel.service
[Unit]
Description=NeuroBagel rootless pod (podman-compose)
Wants=network-online.target
After=network-online.target


[Service]
Type=simple
WorkingDirectory=/home/neurobagel/recipes
EnvironmentFile=/home/neurobagel/recipes/trr379.env
ExecStart=/usr/bin/podman-compose -f ./docker-compose.yml up
ExecStop=/usr/bin/podman-compose -f ./docker-compose.yml down

[Install]
WantedBy=default.target
EOT

Launch

With this setup in place, we can launch NeuroBagel

# reload the webserver config to enable the reverse proxy setup
# (only necessary once)
sudo systemctl reload caddy

# launch the service (pulls all images, loads data, etc).
# after a minutes the service should be up
systemctl --user start neurobagel

# configure systemd to auto-launch NeuroBagel in case of a
# system reboot
systemctl --user enable neurobagel.service

Data management

Subsections of Data management

Identifiers

Identifier persistence

TRR379 identifier namespace

Alignment with other identifiers

Identifiers for particular entities

Subsections of Identifiers

Participants

Q1 participant identifiers

Responsible use and anonymization of identifiers

Participant identifiers in A/B/C projects

Special-purpose identifiers

Metadata editing

Metadata service

Subsections of Metadata service

Deployment

Required software

User account setup

Service environment setup

Service management with systemd

Web server setup

Updates and curation

NeuroBagel cohort data discovery

Subsections of NeuroBagel cohort data discovery

Deployment

Required software

User setup

Configure NeuroBagel

Web server setup

Manage NeuroBagel with systemd

Launch