RDM Infrastructure

[This is a draft under discussion]

The planned RDM infrastructure is a federation of interoperable site infrastructures. The key design principle is that no primary data are aggregated to a central infrastructure.

We aim to establish an infrastructure that is suitable for use within the TRR379, but not limited to this scope. Once deployed, the associated services are usable beyond the scope of TRR379.

The following schema sketches the planned infrastructure. Components that hold (primary) data are depicted in yellow. Components that hold (mostly or exclusively) metadata are shown in blue. The direction of information flow is indicated by arrows, exchange of data by solid arrows, and metadata-only exchange by dotted arrows. Infrastructures that are only accessible to authorized agents are labeled with a “lock” symbol. Further details on individual components are provided below.

graph TB;
    subgraph Central services
        C1[("Collaboration portal<br>hub.trr379.de<br> 🔐")]:::meta
        C2{{"Data search<br>query.trr379.de<br> 🔐"}}:::meta
        C3(Main website<br>www.trr379.de):::meta
        C4("Data catalog<br>data.trr379.de"):::meta
    end
    subgraph "Aachen 🔐"
        A1[(hub)]:::data
        A1a[(lab1)]:::data
        A1b[(lab2)]:::data
        A2{{query}}:::meta
    end
    subgraph "Frankfurt 🔐"
        F1[(hub)]:::data
        F2{{query}}:::meta
    end
    subgraph "Heidelberg 🔐"
        H1[(hub)]:::data
        H1a[(ZI-hub)]:::data
        H1b[(lab1)]:::data
        H2{{query}}:::meta
    end
    A1 -.-> C1
    A1a -.-> A1
    A1b <---> A1
    A1 -.-> A2
    A2 <-.-> C2
    F1 -.-> C1
    F2 <-.-> C2
    F1 -.-> F2
    C3 <-.-> C1
    C3 -.-> C2
    H1 -.-> C1
    H1a <-.-> C1
    H1a <-.-> H1
    H1b <---> H1
    H1 -.-> H2
    H2 <-.-> C2
    C1 -.-> C2
    C1 -.-> C4
    C3 -.-> C4
    H1b <---> A1a
    %% node links to actual services
    click C1 href "https://hub.trr379.de"
    click C3 href "https://www.trr379.de"
    %% classes to distinguish data and metadata nodes
    classDef data fill:#ffa200,color:#000
    classDef meta fill:#5C99C8,color:#000
    %% invisible link purely for manipulating the grouping
    C2 ~~~ A1b
    C4 ~~~ A1b
    C2 ~~~ F1
    C2 ~~~ H1b

Central services

All central services a metadata-focused. No primary data acquired at participating sites are aggregated into central databases/storage.

Collaboration portal (hub.trr379.de)

This is the main hub for collecting actionable links to all TRR379 resources and information. The software solution for this hub is Forgejo-ankesajo. It is a free and open-source software package, and the direct service counterpart of DataLad, the main tool proposed for implementing reproducible research workflows in TRR379 labs.

The hub will store DataLad datasets, referencing all TRR379 data resources without hosting any actual data. Instead the DataLad dataset point to the individual institutional data stores, or to community data repositories when and where data have been published.

The hub is also a place to deposit (shared) computational environments, and implementations of (shared) data processing pipelines, software publications, and source code repositories under a uniform TRR379 umbrella.

A test site is deployed at https://hub.trr379.de and is being evaluated.

Main website

See this page for a description of the main website. Importantly, the website renders essential metadata for the TRR379 (contributors, roles, publications, projects, research topics, etc.) It provides a unique URI for any such entity, to be used as identifiers in all TRR379 (meta)data resources.

The website is programmatically generated from a repository hosted on the TRR379 hub, to enable contributions by all TRR379 members.

Data catalog

The TRR379 data catalog is a website dedicated to providing a uniform (read-only) view on the TRR’s data resources. It is rendered programmatically by DataLad Catalog from metadata on TRR379 data resources hosts in the TRR379 hub.

This site is indexed by specialized search engines like Google’s dataset search and a key enabler for general findability of TRR379 resources.

Data search

This is a federated data discovery service that is tailored to the cohort dataset acquired by TRR379 as a whole. It will enable the discovery of individual data records matching a given set of criteria, regardless of the contributing TRR379 site.

The service is federated. Each sites runs their own instance, and has the sole authority on deciding what metadata are shared with other TRR379 sites. Only these metadata property will be accessible by TRR379 at large, while more detailed metadata records can be use for in-house queries.

The proposed solution for the query service is a version of NeuroBagel adapted to the data nature and needs of TRR379.

Site infrastructure

Sites are free to implement any RDM solutions, as long as that infrastructure provides

(programmatically) queriable metadata of a previously agreed upon nature
(programmatically) accessible data to any authorized members of TRR379

with the aim to enable reproducible research from primary data to published results within and across TRR379.

Q02 supports sites with software solution that facilitate interoperability within TRR379. This includes the local deployment of the software systems used to run the central services.

We aim at individual sites running their own data hubs (using the same software solution) as the central https://hub.trr379.de. In contrast to the central hub, these institutional sites can directly use the storage features of Forgejo-ankesajo, and host arbitrary amounts of data. TRR379 can communicate data availability using a federation protocol.

Plans

Subsections of Plans