OBASK Architecture

OBASK pipeline comprises three servers/services and four data pipelines:

Pipeline servers:
Triplestore
Solr + preconfigured Solr core
Search API (ontology-search)
Neo4J production instance (obask-kb)
Pipeline data pipelines:
Data collection (pipeline-collectdata)
Triplestore ingestion (pipeline-updatetriplestore)
Data transformation and dumps for production instances (pipeline-dumps)
KG production instance ingestion (pipeline-updateprod)

Server and data pipelines are combined into 3 general sub-pipelines which are configured as Docker compose services. This documentation describes all 3 sub-pipelines in detail, including which role the individual servers and data pipelines play.

Pipeline Overview

Sub-pipeline: Deploy triplestore

Summary: This pipeline deploys an empty triplestore, collects all relevant ontologies, and pre-processes and loads the collected data into the triplestore. Components:
Triplestore
pipeline-collectdata (data collection and preprocessing pipeline for all resources)
pipeline-updatetriplestore (loading collected data into the triplestore)
Dependents: pipeline-dumps

Service: Triplestore

Image: eclipse/rdf4j-workbench:3.7.7
Summary: The triplestore is currently an unspectacular default implementation of rdf4j-server. We make use of a simple in-memory store that is configured here. The container is maintained elsewhere (see docker-hub pages of image for details).

Data pipeline: pipeline-collectdata

Image: ghcr.io/obasktools/pipeline-collectdatapipeline-collectdata
Git: https://github.com/OBASKTools/pipeline-collectdata
Dockerfile
Summary: This container encapsulates a process that downloads a number of source ontologies, and applies a number of ROBOT-based pre-processing steps, in particular: extracting modules/slices of external ontologies, running consistency checks and serialising as ttl for quicker ingest into triplestore. It also contains the data embargo pipeline and has some provisions for shacl validation.

Detailed notes on pipeline-collectdata

The process is encoded here. It performs the following steps:
Downloading external ontologies.
Removing embargoed data. The technique applied here is based on using ROBOT query and encoding the embargo logic as SPARQL queries (combined with ROBOT remove).
Ontologies in project configuration are imported in their entirety.
Ontologies in slice configuration are sliced. The slice corresponds to a BOTTOM module that has the combined signature of all ontologies in the fullontologies section with the signature of the KB.
All ontologies are converted to turtle format.
All ontologies ready to be imported into the triplestore are gzipped.

Data pipeline: pipeline-updatetriplestore

Image: ghcr.io/obasktools/pipeline-updatetriplestore
Dockerfile
Git: https://github.com/OBASKTools/pipeline-updatetriplestore
Summary: This container encapsulates a process that (1) sets up the triplestores and (2) loads all of the ttl files generated by pipeline-collectdata into the triplestore. The image contains the configuration details of triplestore, like choice of triplestore engine.

Detailed notes on pipeline-updatetriplestore:

The process loads the ontologies and data collected in the previous step into the triple store.

Sub-pipeline: Data transformation and dumps for production instances (pipeline-dumps)

Summary: This pipeline transforms the knowledge graph in the triplestore into various custom data dumps used by downstream services such as the production knowledge graph (neo4j) instance and solr.
Depends on: Triplestore
Dependents: obask-kb, solr, pipeline-updatesolr

Data pipeline: pipeline-dumps

Image: ghcr.io/obasktools/pipeline-dumps
Git: https://github.com/OBASKTools/pipeline-dumps
Summary: The dumps pipeline access the triple store to obtain data dumps that in mungs, transforms and enriches for various downstream purposes such as pipeline-prod ingestion and solr ingestion.
Dockerfile

Detailed notes on pipeline-dumps

The process performs the following steps (all encoded in the Makefile):
Build dump for obask-kb (Neo4j tabular data structure)
Build dump for solr (special json file, created using python)

Sub-pipeline: Knowledge Graph (obask-kb)

Summary: This pipeline deploys the production instance of the Knowledge Graph (neo4j database) and loads all the relevant data.
Depends on: pipeline-dumps
Dependents: None

Service: obask-kb

Image: ghcr.io/obasktools/obask-kb
Git: https://github.com/OBASKTools/obask-kb
Dockerfile
Summary: Deploys an empty, configured instance of a Neo4J database with the neo2owl plugin, APOC and GDS tools.

Data pipeline: pipeline-updateprod

Image: ghcr.io/obasktools/pipeline-updateprod
Git: https://github.com/OBASKTools/pipeline-updateprod
Dockerfile
Summary: The update-prod container currently takes an ontology (from the integration layer) and loads it into the the Neo4J instance (pipeline-prod) using the neo2owl plugin. Process:
Loading the ontology using the neo4j2owl:owl2Import() procedure
Setting a number of indices (see detailed notes below).

Detailed notes about pipeline-updateprod

You can set additional Pipeline post-processing steps like indices by editing this file. Note that this file can be used to set arbitrary post-processing cypher queries, not just indices (contrary to the file name). Essentially, all list cypher queries are executed in order right after PDB import is completed.
The possible configuration settings for the neo4j2owl:owl2Import() procedure are described here. The configuration is stored here.

Sub-pipeline: Ontology Search (ontology-search)

Summary: This pipeline deploys the production instance of the Knowledge Graph (neo4j database) and loads all the relevant data.
Depends on: pipeline-dumps
Dependents: None

Service: Solr

Image: solr:8.11
Summary: Deploys an empty, configured instance of Solr.

Data pipeline: pipeline-updatesolr

Image: ghcr.io/obasktools/pipeline-updatesolr
Git: https://github.com/OBASKTools/pipeline-updatesolr
Dockerfile
Summary: The pipeline-updatesolr container initializes the Solr schema and indexes the json data generated by the pipeline-dumps to the Solr.

Service: ontology-search

Image: ghcr.io/obasktools/ontology-search
Git: https://github.com/OBASKTools/ontology-search
Dockerfile
Summary: Provides ontology search APIs.