Design
The DM: -
Has an API
Provides concepts such as projects, datasets, files, applications and jobs
Expects users to be authenticated against a Keycloak server
Authorises activities using the services of an Account Server
The DM is exposed using a REST API definition: -
That can be used from outside the deployment cluster and is defined in the fle
app/openapi/openapi.yaml
The OpenAPI file is well documented so please refer to it directly for the current up-to-date API endpoint and content definitions.
The API
The API is used by the Data Manager UI to allow users to manage Datasets, Projects, Applications and Jobs. It can be used by anyone with a suitable Keycloak role.
- Datasets
Datasets and their associated metadata can be versioned and stored in the DM and shared with other users and projects. Datasets cannot be used or edited directly and instead either need to be downloaded or added to a projcet.
- Metadata
Metadata it textual material that can be versioned and belongs to a dataset and/or a specific dataset version.
- Projects
Projects are spaces where datasets (and files) can be stored and processed. Resultant files can be downloaded or converted into a new dataset (or dataset version). Projects can be shared with other users.
See Projects
- Files
Files are objects in a project. Files can be managed (i.e. are a copy of a corresponding dataset) or unmanaged (i.e. are files uploaded to a project directly or are the result of application of job activity.
- Users
Users can be added to datasets and projects as editors, which gives them the ability to modify the dataset or run applications or jobs in the corresponding project.
- Applications
Applications are generally long-running heavyweight instances that a user runs in a project space, usually providing their oen service via a URL.
A typical application would be a Jupyter notebook.
Applications are made available within the data manager using a Kubernetes operator deployed by the DM administrator.
- Jobs
Jobs are generally short-lived instances that a user runs in order to process files in the Project.
Jobs are made available within the data manager using Job Definitions deployed by the DM administrator.
Jobs are executed using a Kubernetes operator deployed by the DM administrator.
- Instances
An Instance is the term for the run-time manifestation of an Application or Job.
- Tasks
Tasks are used to represent asynchronous API operations and the creation and removal of instances (Applications and Jobs).
The Containers (Pods)
The DM is realised using a number of independent Containers (Pods)/
Apart from 3rd-party containers (like the PostgreSQL database and message
broker all the Pods are formed form a single Data Manager container image,
whose behaviour is defined at run-time in the docker-entrypoint.sh
using the environment variable IMAGE_ROLE
.
The Data Manager image is used for the following Containers (described below): -
API
CTW
MON
PBC
KEW
Other images are used for the remaining containers: -
CMB
Database
Celery Message Broker (CMB)
The CMB
container is an instance of RabbitMQ, and is used as the
backend for Celery task execution and as the message bus for
internally-generated Data Manager messages whose payloads are based on
Protocol Buffers (v3).
Database
The Database
container is an instance of a customised PostgreSQL server,
one that’s tuned to handle molecules. The database is used to store the
state of all the Data Manager objects and molecules present in all the Datasets
that are present.
API
The API
container runs as a Flask application that exposes the OpenAPI
on the path data-manager-api
with an interactive Swagger service at
data-manager-api/api
.
The API is the REST first responder, servicing all the REST requests. REST actions are either short and handled synchronously within the API container or asynchronous and dispatched to a worker container to run as a Task if they are time-consuming operations (like handling the upload of a Dataset file).
Asynchronous tasks are managed by Celery, which uses the CMB
and
Database
containers for communication and persistence.
All REST endpoints that are asynchronous return the Task (task_id
) so the
caller can inspect the task status using the /task/{task_id}
endpoint.
The API has root access to the Dataset and Project volumes.
Celery Task Worker (CTW)
The CTW
is a container that runs as a Celery Worker. The number of
Pods that are deployed will depend on the expected workload, each Pod
sharing the asynchronous tasks as they are received.
The CTW has root access to the Dataset and Project volumes.
Monitor (MON)
The MON
is a container that runs a number of background tasks managed
by an APScheduler. The monitor is responsible for a number of roles: -
Collecting stats on project volume usage (for billing purposes)
Dispatching accumulated charges to the Account Sever
Deleting projects
Pruning the API log
Miscellaneous housekeeping
The MON has root access to the Dataset and Project volumes.
Kubernetes Event Watcher (KEW)
The KEW
container handles Kubernetes-generated events and is designed
(primarily) to watch the logs of Instances (Applications, but mainly Jobs)
in order to: -
Collect Event messages and store them against the corresponding Task
Collect Coin messages and create a Charge, which will be sent] to the Account Server for billing by the
MON
container.Translate Instance terminations to a
PodMessage()
sent via theCMB
to thePBC
(described next).
Protocol Buffer Consumer (PBC)
The PBC
is the early form of our internal distributed message bus and is
used today to handle PodMessage()
messages dispatched by the KEW
.
The primary role of the PBC
is to translate Kubernetes Pod (Instance)
termination phases (detected and transmitted by the KEW
in the form of
protocol-buffer messages) into an appropriate Task State.
When an Instance ends the Task State will be one of SUCCESS
or FAILED
based on the Pod’s exit code.
Volumes
The DM relies on volumes to store files.
Dataset
The dataset volume, typically elastic (EFS), is used as a global store for Dataset versions and their associated meta-data. User’s don’t have direct access to this volume, instead it is used as a persistent store and source of data that can be deposited in a Project as a File.
Datasets can be created by the user by uploading an external file or from a file they create in a Project.
To create datasets users need an Account Server Dataset Storage Subscription.
The user pays for dataset storage using Coins available in the corresponding Account Server Dataset Storage Subscription.
Project
The project volume, typically elastic (EFS), is used to manage projects created by users. Projects are places where users can upload Files (directly or from a pre-existing Dataset version) and run Applications and Jobs to process the project data.
To create projects users need an Account Server Project Tier Subscription.
The user pays for project storage and the execution of Jobs and Applications using Coins available in the corresponding Account Server Project Tier Subscription.