Design

../_images/im-dm-diagrams.002.png

The DM: -

  • Has an API

  • Provides concepts such as projects, datasets, files, applications and jobs

  • Expects users to be authenticated against a Keycloak server

  • Authorises activities using the services of an Account Server

The DM is exposed using a REST API definition: -

  • That can be used from outside the deployment cluster and is defined in the fle app/openapi/openapi.yaml

The OpenAPI file is well documented so please refer to it directly for the current up-to-date API endpoint and content definitions.

The API

The API is used by the Data Manager UI to allow users to manage Datasets, Projects, Applications and Jobs. It can be used by anyone with a suitable Keycloak role.

Datasets

Datasets and their associated metadata can be versioned and stored in the DM and shared with other users and projects. Datasets cannot be used or edited directly and instead either need to be downloaded or added to a projcet.

Metadata

Metadata it textual material that can be versioned and belongs to a dataset and/or a specific dataset version.

Projects

Projects are spaces where datasets (and files) can be stored and processed. Resultant files can be downloaded or converted into a new dataset (or dataset version). Projects can be shared with other users.

See Projects

Files

Files are objects in a project. Files can be managed (i.e. are a copy of a corresponding dataset) or unmanaged (i.e. are files uploaded to a project directly or are the result of application of job activity.

Users

Users can be added to datasets and projects as editors, which gives them the ability to modify the dataset or run applications or jobs in the corresponding project.

Applications

Applications are generally long-running heavyweight instances that a user runs in a project space, usually providing their oen service via a URL.

A typical application would be a Jupyter notebook.

Applications are made available within the data manager using a Kubernetes operator deployed by the DM administrator.

Jobs

Jobs are generally short-lived instances that a user runs in order to process files in the Project.

Jobs are made available within the data manager using Job Definitions deployed by the DM administrator.

Jobs are executed using a Kubernetes operator deployed by the DM administrator.

Instances

An Instance is the term for the run-time manifestation of an Application or Job.

Tasks

Tasks are used to represent asynchronous API operations and the creation and removal of instances (Applications and Jobs).

The Containers (Pods)

The DM is realised using a number of independent Containers (Pods)/ Apart from 3rd-party containers (like the PostgreSQL database and message broker all the Pods are formed form a single Data Manager container image, whose behaviour is defined at run-time in the docker-entrypoint.sh using the environment variable IMAGE_ROLE.

The Data Manager image is used for the following Containers (described below): -

  • API

  • CTW

  • MON

  • PBC

  • KEW

Other images are used for the remaining containers: -

  • CMB

  • Database

Celery Message Broker (CMB)

The CMB container is an instance of RabbitMQ, and is used as the backend for Celery task execution and as the message bus for internally-generated Data Manager messages whose payloads are based on Protocol Buffers (v3).

Database

The Database container is an instance of a customised PostgreSQL server, one that’s tuned to handle molecules. The database is used to store the state of all the Data Manager objects and molecules present in all the Datasets that are present.

API

The API container runs as a Flask application that exposes the OpenAPI on the path data-manager-api with an interactive Swagger service at data-manager-api/api.

The API is the REST first responder, servicing all the REST requests. REST actions are either short and handled synchronously within the API container or asynchronous and dispatched to a worker container to run as a Task if they are time-consuming operations (like handling the upload of a Dataset file).

Asynchronous tasks are managed by Celery, which uses the CMB and Database containers for communication and persistence.

All REST endpoints that are asynchronous return the Task (task_id) so the caller can inspect the task status using the /task/{task_id} endpoint.

The API has root access to the Dataset and Project volumes.

Celery Task Worker (CTW)

The CTW is a container that runs as a Celery Worker. The number of Pods that are deployed will depend on the expected workload, each Pod sharing the asynchronous tasks as they are received.

The CTW has root access to the Dataset and Project volumes.

Monitor (MON)

The MON is a container that runs a number of background tasks managed by an APScheduler. The monitor is responsible for a number of roles: -

  • Collecting stats on project volume usage (for billing purposes)

  • Dispatching accumulated charges to the Account Sever

  • Deleting projects

  • Pruning the API log

  • Miscellaneous housekeeping

The MON has root access to the Dataset and Project volumes.

Kubernetes Event Watcher (KEW)

The KEW container handles Kubernetes-generated events and is designed (primarily) to watch the logs of Instances (Applications, but mainly Jobs) in order to: -

  • Collect Event messages and store them against the corresponding Task

  • Collect Coin messages and create a Charge, which will be sent] to the Account Server for billing by the MON container.

  • Translate Instance terminations to a PodMessage() sent via the CMB to the PBC (described next).

Protocol Buffer Consumer (PBC)

The PBC is the early form of our internal distributed message bus and is used today to handle PodMessage() messages dispatched by the KEW.

The primary role of the PBC is to translate Kubernetes Pod (Instance) termination phases (detected and transmitted by the KEW in the form of protocol-buffer messages) into an appropriate Task State.

When an Instance ends the Task State will be one of SUCCESS or FAILED based on the Pod’s exit code.

Volumes

The DM relies on volumes to store files.

Dataset

The dataset volume, typically elastic (EFS), is used as a global store for Dataset versions and their associated meta-data. User’s don’t have direct access to this volume, instead it is used as a persistent store and source of data that can be deposited in a Project as a File.

Datasets can be created by the user by uploading an external file or from a file they create in a Project.

To create datasets users need an Account Server Dataset Storage Subscription.

The user pays for dataset storage using Coins available in the corresponding Account Server Dataset Storage Subscription.

Project

The project volume, typically elastic (EFS), is used to manage projects created by users. Projects are places where users can upload Files (directly or from a pre-existing Dataset version) and run Applications and Jobs to process the project data.

To create projects users need an Account Server Project Tier Subscription.

The user pays for project storage and the execution of Jobs and Applications using Coins available in the corresponding Account Server Project Tier Subscription.