Creating new Jobs

This guide describes the process for creating new tools, and how to allow them to be run in Squonk2 as Jobs.

Tools are command line utilities that typically read some input and write some output. To run as a job in Squonk2 the tool needs to be packaged up into a container image and given a job definition.

Creating a repository

Our data-manager-job-template repository is a GitHub template repository that contains a functional foundation that you can fork in order to quickly create your own jobs. It contains documentation, a dockerfile, a simple working job implementation, some unit tests and employs GitHub’s continuous integration (Actions) so that your container image can be built and tagged automatically.

Take a look and read GitHub’s notes relating to template repositories.

You can also inspect some our active job repositories to see more comprehensive examples, like: -

Creating a base environment

We assume here that you are creating your tool with Python. You don’t have to use Python. Our squonk2-cdk repository contains Squonk2 jobs that are written in Java.

The easiest way to run your tools is to use an existing conda environment, or if none is appropriate, then create a new one. See the various environment-*.yaml files in our virtual-screening repository for some exiting examples, its environment-im-prep.yaml file defines a conda environment that contains RDKit, OpenBabel and some other tools and creates an environment called im-vs-prep.

If you don’t understand Conda environment files you can read all about them in their Managing environments documentation.

Armed with a suitable environment file, create your conda environment with something like this: -

$ conda env create -f my-environment.yaml

If the environment file sets the environment name to my-env you can activate your environment with something like this: -

$ conda activate my-env

Creating the Python module

If you are using Python then we strongly suggest you use Python 3. It doesn’t matter what you use, as your job will run in its own isolated container environment in the Data Manager but remember that Python 2 is being removed from many distributions now. It has been retired and that means that it will not be improved anymore, even if someone finds a security problem in it.

We suggest you follow the patterns used in the modules in our virtual-screening repository, namely: -

  1. Use argparse for handling the command line options.

  2. Follow the conventions used in the virtual-screening modules for the naming of the command line options.

  3. Provide a Python function in your module that allows the function of the tool be used from another module. e.g. have the main entrypoint parse the command line arguments, prepare anything that is necessary and then call a function that does the real work. That same function should be callable from another Python module or script.

  4. Keep the module relatively simple and concise. If your tool needs to do one thing followed by another thing then probably you should create two separate tools.

  5. Consider creating utility modules that you can reuse, e.g. like those in our rdkit_utils.py.

  6. Log information to STDOUT. See next section for details of a logging strategy that’s displayed to the user via the UI.

Output file permissions

Squonk Data Manager is strongly opinionated about file ownership and permissions. Files that are created by your job are expected to: -

  • Be owned by the user ID of the person who executed the job

  • Have group write permissions (the group effectively being all the editors of your project)

The first of these should happen automatically as all jobs run in a container as the appropriate user and group ID. The second will probably NOT happen automatically because most operating systems do not automatically allow users in the same group to have write-access to the files. So you need to do one of the following: -

  1. Update your job to do a chmod 664 operation (or equivalent) on all the files your Job creates. This ensures they’re are writeable by users in the same group. You may not want to do this, but it is an option.

  2. Add fix-permissions: true to the <job-name>.image section of your job’s definition. This will instruct the Data Manager to update the permissions of all the output files that are defined in the job’s variables.outputs section. The Data Manager does this once the job had completed successfully.

We plan to add other mechanisms in the future.

The Jote job tester should identify if the permissions are not correct.

Logging

If you are wanting your tool to be a Squonk2 job then you should pay attention to logging Data Manager Event and Cost messages.

Events

Event messages are lines in STDOUT conforming to a particular pattern that the Data Manager job executor looks for and reports as Events that are shown in the job execution UI. For instance, the Data Manager, considers STDOUT lines that look like this as an Event: -

2022-02-03T16:39:27+00:00 # INFO -EVENT- Hello World!

All the text after the -EVENT- field is considered the message to be reported. Not everything that is logged needs to be a Data Manger Event. You might want to log some more verbose log messages that won’t get reported through the Squonk2 Data Manager UI. Typically you only want a small number of significant (salient) events to be reported.

If your job runs for a long time you might want to emit an Event every few minutes to a) reassure the user the job is doing something and b) report progress.

Billing (Costs)

Cost messages are lines in STDOUT conforming to a particular pattern that the Data Manager job executor looks for and uses to charge the user for the work done. Ultimately this could result in the user incurring charges and you get paid for them using your tool, but you should write cost messages even if you are not wanting to monetise your work. You could assign a zero cost to your job, which at least allows usage of your job to be recorded.

A Cost message is written to STDOUT and looks like this: -

2022-02-03T16:40:16+00:00 # INFO -COST- 5.7 1

The two numbers after the -COST- field are: -

  1. The cost in some arbitrary units e.g. number of molecules processed, or whatever is an appropriate measure of the ‘cost’. The cost format is expected to be a Python Decimal. The value can be a whole number or a decimal.

  2. The sequence number of the cost event, this message being the first (1).

Note

Avoid the temptation to use floating point numbers for costing. Floating point representation suffers from well-documented precision issues. Instead use a language type that is more suitable for currency values (like Python’s Decimal).

Warning

The sequence number must be unique and an increasing value. The Data Manager uses the sequence number to de-duplicate cost lines. If your cost lines do not have a unique sequence number they may be interpreted more than once and an over-billing error will occur.

With cost messages it is usually best to log the cost event once processing is finished. In this case just one message needs to be logged. However, if your tool is likely to run for a long time it is better to log cost messages at regular intervals e.g. every 5 minutes, or every 100,000 molecules processed.

The cost value (the first number) can either be incremental or absolute. Consider these two sets of messages: -

2022-02-03T16:40:16+00:00 # INFO -COST- 5.7 1
2022-02-03T16:40:16+00:00 # INFO -COST- 8.3 2

2022-02-03T16:40:16+00:00 # INFO -COST- +5.7 1
2022-02-03T16:40:16+00:00 # INFO -COST- +8.3 2

The first pair is absolute, the second incremental. The difference is in the second the cost is prefixed with a +. In the first pair the final cost value is 8.3, in the second it is 14.0. Using absolute costs is probably easier and better, but there are times when you might want to use incremental costs.

If this all sounds a bit complex and you’re thinking of ignoring this then don’t! We have created a simple library in our PyPI data-manager-job-utilities package. It makes logging these messages from Python very simple. Modules in the virtual-screening repository contain lots of examples for how to use it.

For instance, the Virtual Screening minimize.py module emits event messages like this: -

DmLog.emit_event("Force field could not be set up for molecule", count)

and emits cost messages at regular intervals like this: -

if success % 10000 == 0:
    DmLog.emit_cost(success)

Creating the Dockerfile

To be runnable as a Squonk2 job then you must build a Docker container image that allows your tool to be run using the command that you will define in the job definition (see the next section).

If you are using our data-manager-job-template then you’ll already be setup with a Dockerfile and corresponding container image. Even if you’ve got your own repository, or maybe its a repository of legacy code your best course of action is to move it into a fork of our template repository.

Ideally your repository should be responsible for one container image but if your Jobs need distinctly different container environments you can have more than one Dockerfile.

Look at the Dockerfile-* files in virtual-screening repo as examples. You can even use a container image that contains conda as your base image and conda install the same packages as your conda environment (see Dockerfile-prep as an example).

Use continuous integration (GitHib Actions or GitLab CI) to build your container images automatically, cleanly and consistently. Do not rely on manually pushing the Docker image to a container repository.

Unless there’s good reason not to, you should use a public container registry such as DockerHub but you can also use a private repository if necessary. If so contact us about how to specify the appropriate pull secrets that your image will need to install into the Squonk2 Data Manager.

Job definitions, collections and manifests

To get your tool to run in Squonk2 as a job you need to write a job definition (a YAML file). The job definition describes the job, the container image its in, its command-line parameters, the files it creates and also defines tests that can be run to verify its behaviour (more on that later). We’ll go through the details using examples shortly.

When jobs are executed they are uniquely identified using a collection name, the name of the job and the version of the job. All this information is provided by the job definition file.

A job definition file defines a collection of jobs. The collection is essentially a way of grouping similar jobs together providing a namespace for your jobs. All jobs in a given job definition file belong to the same collection. The collection can be used in multiple job definition files. How you split your jobs between files is up to you. If you have a small number of jobs you might put them in one file. You might use multiple files if you want to define a large number of jobs and you want to limit the size of individual files for readability or maintenance. On the other hand you might use multiple files if you want to separate them for architectural reasons or you want to deploy one set to one customer and other to another.

To avoid job-name clashes with jobs in collections from 3rd parties the collection should contain something that is unique to you or your company. We prefix most of our collections with im-. This way we can safely create a min-max job will not be confused with a min-max from another collection.

Collection and job names are limited to 80 characters.

The manifest file

The job manifest is simple, it is a YAML file and it is used to list all the job definition files for a deployment. When you deploy jobs to the Data manager you provide a manifest file.

Manifest files live in your repository’s data-manager directory. The default file is called manifest.yaml but you can call yours whatever you want, and you can have more than one.

A typical manifest file looks like this: -

---
kind: DataManagerManifest
kind-version: '2021.1'

job-definition-files:
- virtual-screening.yaml
- rdkit.yaml
- xchem.yaml

This one identifies 3 job definition YAML files, all of which must also live in the data-manager directory.

The job definition file

The job definition YAML file is more complex. It’s best understood by looking at an example, for instance our rdkit.yaml defines a number of jobs that use RDKit.

The file can define one or more jobs. The file and job have these key sections: -

collection: rdkit

This is a top level property and all jobs in this file belong to this collection. All jobs in this file belong to this collection. The collection is supposed to define the source of the job. Jobs in multiple files can belong to the same collection.

All following sections are job specific.

category: comp chem

This defines the category of the job. A category is a functional description of the type of job e.g. “comp chem” and is supposed to reflect a classification that makes sense to a user. Jobs from multiple collections can belong to the same category. Please don’t create new categories without a good reason. A list of current categories can be found at JobCategories.

The collection and the category can be used for filtering jobs in the Squonk2 UI.

doc-url: rdkit/similarity-screen.md

The doc-url is an optional property that defines where user documentation for the job can be found.

We recommend you follow this pattern for documenting your jobs.

  • Create a /data-manager/docs/<category> directory in your repo where <category> is the name of the category your job belongs to (see above)

  • Create a file named <job-name>.md where <job-name> is the name of your job (the top level key for your job in the YAML file). If you create multiple version of your job keep the documentation for all version in that one file (e.g. have a change log)

  • Populate that file with Markdown documentation for your job (or something else that will be handled nicely by your repository’s web interface), preferably following the patterns you can see in the Virtual-screening repo.

If you follow this pattern then you do not need to define the doc-url property in your job definition, it will be generated automatically using your collection and job names.

If instead you want your docs to be in a file that does not follow that pattern then you can (e.g. if you have related jobs that are best documented in a single file) define a path that is relative to /data-manager/docs (see above where `rdkit/similarity-screen.md is documentation for multiple jobs). And if you have documentation that resides elsewhere you can provide the fully qualified URL to that page as the value for the doc-url property.

. _virtual-screening docs: https://github.com/InformaticsMatters/virtual-screening/tree/main/data-manager/docs

image:
  name: informaticsmatters/vs-prep
  tag: 'latest'
  project-directory: /data
  working-directory: /data

This defines the container image your repository builds. It also defines the path within the container where the Data Manager project will be mounted, and the directory that will be used as the execution directory for the job.

command: >-
  /code/max_min_picker.py --input '{{ inputFile }}'
  {% if seeds is defined %}--seeds{% for file in seeds %} '{{ file }}'{% endfor %}{% endif %}
  --output '{{ outputFile }}'
  --count {{ count }}
  {% if threshold is defined %}--threshold {{ threshold }}{% endif %}
  --interval 10000

This defines the command that is executed. It uses Jinja templating to fill in the values of the inputs and options that we’ll see next. The filled in template is used as the command that is executed when the job is run in the Squonk2 Data Manager in the Kubernetes cluster. Think of it being the <command> bit when running with docker: -

docker run -it yourorg/yourcontainer <command>

The Data Manager uses jinja2 v3.0

Note

Be careful to prevent hacking attacks by putting substituted strings in single quotes.

inputs:
  type: object
  required:
  - inputFile
  properties:
    inputFile:
      title: Molecules to pick from
      mime-types:
      - squonk/x-smiles
      type: file
    seeds:
      title: Molecules that are already picked
      mime-types:
      - squonk/x-smiles
      type: file
      multiple: true

This defines the files that are the inputs to your tool. In this case there are two inputs, with seeds being optional.

outputs:
  type: object
  properties:
    outputFile:
      title: Output file
      mime-types:
      - chemical/x-csv
      creates: '{{ outputFile }}'
      type: file

This defines the outputs of your tool - the files that are created.

options:
  type: object
  required:
  - count
  properties:
    outputFile:
      title: Output file name
      type: string
      pattern: "^[A-Za-z0-9_/\\.\\-]+$"
      default: diverse.smi
    count:
      title: Number of molecules to pick
      type: integer
      minimum: 1
    threshold:
      title: Similarity threshold
      type: number
      minimum: 0
      maximum: 1

This defines the user definable options for your job. This uses JSON schema notation with the user interface for the job executor in Squonk2 being automatically generated from this. Try to include as much validation as possible (especially with string options) to prevent hacking attempts.

This is not an exhaustive list of the sections, but covers the key aspects. Look at the other job definitions in the virtual-screening repository for more details, or contact us if you need more info.

Validating with the job tester

Writing the job definition is tricky and can be subject to silly typo or formatting errors. To assist with this we have created the im-jote PyPI package. Jote is our JOb TEster.

Jote lets you test a job definition. It can:

  • Validate the YAML (and our YAML formatting rules are quite strict)

  • Perform some basic sanity checks on your repository

  • Execute the tests defined in the job definition using Docker in a way that is very similar to how it will execute in Kubernetes so that you can test that it runs correctly

As an example, here is a test definition. It lives inside the job definition in the job definition YAML file. It comes from the same RDKit job example used above which includes additional comments explaining the meaning of the elements: -

tests:
  simple-execution:
    inputs:
      inputFile: data/mols.smi
    options:
      outputFile: diverse.smi
      count: 100
    checks:
      exitCode: 0
      outputs:
      - name: diverse.smi
        checks:
        - exists: true
        - lineCount: 100

You will notice that each test defines some inputs and options that are needed by the job and defines some basic checks on the outputs that should be created by running the job. The inputs and options are used to generate the command (see above) and then that command is run in docker to generate the outputs which are then checked for validity.

Jote is run against a particular manifest file, and you can restrict it to jobs from a particular collection or even a particular job. When jote is installed just use jote --help to see all the options.

For instance, to run the test we have been looking at in our virtual-screening repository we run jote selecting the manifest file, collection and job name, and you’ll see a response a little like this: -

$ jote -m manifest-virtual-screening.yaml -c rdkit -j max-min-picker
# Using manifest "manifest-virtual-screening.yaml"
# Found 10 tests
# Limiting to Collection "rdkit"
# Limiting to Job "max-min-picker"
  ---
+ collection=rdkit job=max-min-picker test=simple-execution
> run-level=Undefined
> image=informaticsmatters/vs-prep:latest
> command="/code/max_min_picker.py --input 'mols.smi'  --output 'diverse.smi' --count 100  --interval 10000"
# Creating test environment...
# docker-compose (1.29.2, build unknown)
# Created
# path=/data/github/im/virtual-screening/data-manager/jote/rdkit.max-min-picker.simple-execution
input_files=['data/mols.smi']
# Copying inputs (from "${PWD}/data")...
# + data/mols.smi
# Copied
# Executing the test ("docker-compose up")...
# Executed (exit code 0)
# Checking...
# - diverse.smi
#   exists (True) [OK]
#   lineCount (100) [OK]
# Checked
# Deleting the test...
# Deleted
  ---
Done (OK) passed=1 skipped=0 ignored=0 failed=0

Deploying to Squonk2

Jobs definitions are loaded into the Squonk2 Data Manager using the manifest file mentioned above. This allows us to provide a level of granularity of types of jobs.

You will need to contact the administrator of your Squonk2 instance who will use the /admin/job-manifest API endpoint to load your manifest. Also, if your manifest or the job definitions it includes changes, the administrator can use the /admin/job-manifest/load endpoint to reload the job manifests and the corresponding jobs.

Creating a Nextflow module

If you want to be able to incorporate your tool into a Nextflow workflow then the best approach is to create a Nextflow module using Nextflow’s DSL2. This allows your module to be easily incorporated into a workflow that uses other modules that we have created. This way your module can be used in multiple workflows.

Example modules can be found under nf-processes in our virtual-screening repository. Take a look at those and at the Nextflow documentation. Once you have your module you can look at the *.nf files in the same directory for examples of how to build a complete workflow out of modules.

To make your workflow executable in Squonk2 you need to create a job definition as described above. Again, look at the existing examples for inspiration.

Version Control

In order to ensure that jobs you deploy produce consistent results you must be strict with your version control strategy. This section provides “best practice” advice for how to manage your job definitions and container images.

There are three levels of version control.

The manifest URL

Jobs running in Squonk2 are identified by a combination of the collection, job, and version. All of this information is gathered from your Job definition files when an administrator loads them into Squonk2 using a URL for the Job Manifest. This is the first level of version control.

To ensure consistency with job execution this URL must be version controlled. We recommend that you use a specific TAG or RELEASE in your repository to identify the version of your job manifest. By using a TAG (or RELEASE) in your URL you will ensure that the same manifest file is always used.

The job definition

The Job Definition file is where the collection, jobs and job versions are defined.

If you change anything in the Job Definition file that might affect a Job’s execution (or even its documentation) you must alter the version number you have assigned to the Job. This is the second level of version control.

You might change change values in a specific Job Definition that do not affect the underlying container image version. For example, you may be adding extra documentation or exposing additional (exiting commands) or parameters. Any of these could be considered as altering the behaviour of the Job. Anything you do that can affect the Job execution must result in a new Job version number.

You can exclude adding new tests or test data, these on their own will not affect the Job’s behaviour.

When you change a Job version number (and have committed the changes, and the CI process has been successful) you must then apply a new repository TAG as described above. An administrator will then need to reload the Job Manifest in Squonk2 using the URL to your your new tagged Job Manifest.

When the Data Manager reloads the definitions it will create new Job records for each new combination of collection, job and version.

The Job container image

Finally, if you modify the container image in a way that might affect the Job execution you must publish a new container image with a new TAG. This is the third level of versioning.

When you change a container image version number you obviously modify its version value in the Job Definition file. Once you do this you should follow the steps above, e.g. commit the Job Definition, wait for CI success and then apply a new TAG (or RELEASE) to the repository.

Once this is all complete an administrator will need to load the Job definitions through the REST API as before.

When the Data Manager reloads the definitions it will create new Job records for each new combination of collection, job and version.

Can I delete old versions?

The simple answer is no. This is to ensure that a Job loaded into Squonk2 that someone might have used is always available to them in the future. This may change in future versions of Squonk2.

Can I rename the repository?

You can. Changing the repository name will require you to reload the job definitions using a new URL (and its new TAG or RELEASE value). If you have not changed the collection, job or job version numbers in the associated Job Manifest files any Jobs loaded into Squonk2 using the old repository will not change - if you had 10 Jobs before you will still have 10 Jobs, they will simply be controlled from a new Manifest URL.

You will need to ask the admin user to remove the Job Manifest record for the old repository.

Can I change the collection name?

You can, and you will need to follow the versioning strategy described above. Remember that any Jobs loaded using the old collection name will still be present in Squonk2.

Can I change a job name?

You can, and you will need to follow the versioning strategy described above. Remember that the original Job will still be present in Squonk2.

Replacing Jobs

You might want to replace a Job with another Job, while keeping the original Job for historical reasons. Rather than introducing a new version for the Job you may have consolidated its behaviour into a new Job implementation, or you may have introduced a new Job that duplicates the behaviour of an existing Job.

You can use the replaces property in the Job Definition file to indicate that a Job replaces others. The following example shows how the max-min-picker Job (which is part of the rdkit collection) replaces the max-picker and min-picker Jobs: -

---
kind: DataManagerJobDefinition
kind-version: '2021.1'
name: Virtual screening tools using RDKit
collection: rdkit

jobs:
  max-min-picker:
    replaces:
    - collection: rdkit
      job: max-picker
    - collection: rdkit
      job: min-picker