OBASK Architecture
OBASK pipeline comprises three servers/services and four data pipelines:
- Pipeline servers:
- Triplestore
- Solr + preconfigured Solr core
- Search API (
ontology-search
) - Neo4J production instance (
obask-kb
) - Pipeline data pipelines:
- Data collection (
pipeline-collectdata
) - Triplestore ingestion (
pipeline-updatetriplestore
) - Data transformation and dumps for production instances (
pipeline-dumps
) - KG production instance ingestion (
pipeline-updateprod
)
Server and data pipelines are combined into 3 general sub-pipelines which are configured as Docker compose services. This documentation describes all 3 sub-pipelines in detail, including which role the individual servers and data pipelines play.
Sub-pipeline: Deploy triplestore
- Summary: This pipeline deploys an empty triplestore, collects all relevant ontologies, and pre-processes and loads the collected data into the triplestore. Components:
- Triplestore
pipeline-collectdata
(data collection and preprocessing pipeline for all resources)pipeline-updatetriplestore
(loading collected data into the triplestore)- Dependents:
pipeline-dumps
Service: Triplestore
- Image: eclipse/rdf4j-workbench:3.7.7
- Summary: The triplestore is currently an unspectacular default implementation of rdf4j-server. We make use of a simple in-memory store that is configured here. The container is maintained elsewhere (see docker-hub pages of image for details).
Data pipeline: pipeline-collectdata
- Image: ghcr.io/obasktools/pipeline-collectdatapipeline-collectdata
- Git: https://github.com/OBASKTools/pipeline-collectdata
- Dockerfile
- Summary: This container encapsulates a process that downloads a number of source ontologies, and applies a number of ROBOT-based pre-processing steps, in particular: extracting modules/slices of external ontologies, running consistency checks and serialising as ttl for quicker ingest into triplestore. It also contains the data embargo pipeline and has some provisions for shacl validation.
Detailed notes on pipeline-collectdata
- The process is encoded here. It performs the following steps:
- Downloading external ontologies.
- Removing embargoed data. The technique applied here is based on using ROBOT query and encoding the embargo logic as SPARQL queries (combined with
ROBOT remove
). - Ontologies in project configuration are imported in their entirety.
- Ontologies in slice configuration are sliced. The slice corresponds to a BOTTOM module that has the combined signature of all ontologies in the fullontologies section with the signature of the KB.
- All ontologies are converted to turtle format.
- All ontologies ready to be imported into the triplestore are gzipped.
Data pipeline: pipeline-updatetriplestore
- Image: ghcr.io/obasktools/pipeline-updatetriplestore
- Dockerfile
- Git: https://github.com/OBASKTools/pipeline-updatetriplestore
- Summary: This container encapsulates a process that (1) sets up the triplestores and (2) loads all of the ttl files generated by pipeline-collectdata into the triplestore. The image contains the configuration details of triplestore, like choice of triplestore engine.
Detailed notes on pipeline-updatetriplestore:
- The process loads the ontologies and data collected in the previous step into the triple store.
Sub-pipeline: Data transformation and dumps for production instances (pipeline-dumps)
- Summary: This pipeline transforms the knowledge graph in the triplestore into various custom data dumps used by downstream services such as the production knowledge graph (neo4j) instance and solr.
- Depends on: Triplestore
- Dependents: obask-kb, solr, pipeline-updatesolr
Data pipeline: pipeline-dumps
- Image: ghcr.io/obasktools/pipeline-dumps
- Git: https://github.com/OBASKTools/pipeline-dumps
- Summary: The dumps pipeline access the triple store to obtain data dumps that in mungs, transforms and enriches for various downstream purposes such as pipeline-prod ingestion and solr ingestion.
- Dockerfile
Detailed notes on pipeline-dumps
- The process performs the following steps (all encoded in the Makefile):
- Build dump for
obask-kb
(Neo4j tabular data structure) - Build dump for
solr
(special json file, created using python)
Sub-pipeline: Knowledge Graph (obask-kb)
- Summary: This pipeline deploys the production instance of the Knowledge Graph (neo4j database) and loads all the relevant data.
- Depends on: pipeline-dumps
- Dependents: None
Service: obask-kb
- Image: ghcr.io/obasktools/obask-kb
- Git: https://github.com/OBASKTools/obask-kb
- Dockerfile
- Summary: Deploys an empty, configured instance of a Neo4J database with the neo2owl plugin, APOC and GDS tools.
Data pipeline: pipeline-updateprod
- Image: ghcr.io/obasktools/pipeline-updateprod
- Git: https://github.com/OBASKTools/pipeline-updateprod
- Dockerfile
- Summary: The update-prod container currently takes an ontology (from the integration layer) and loads it into the the Neo4J instance (pipeline-prod) using the neo2owl plugin. Process:
- Loading the ontology using the
neo4j2owl:owl2Import()
procedure - Setting a number of indices (see detailed notes below).
Detailed notes about pipeline-updateprod
- You can set additional Pipeline post-processing steps like indices by editing this file. Note that this file can be used to set arbitrary post-processing cypher queries, not just indices (contrary to the file name). Essentially, all list cypher queries are executed in order right after PDB import is completed.
- The possible configuration settings for the
neo4j2owl:owl2Import()
procedure are described here. The configuration is stored here.
Sub-pipeline: Ontology Search (ontology-search)
- Summary: This pipeline deploys the production instance of the Knowledge Graph (neo4j database) and loads all the relevant data.
- Depends on: pipeline-dumps
- Dependents: None
Service: Solr
- Image: solr:8.11
- Summary: Deploys an empty, configured instance of Solr.
Data pipeline: pipeline-updatesolr
- Image: ghcr.io/obasktools/pipeline-updatesolr
- Git: https://github.com/OBASKTools/pipeline-updatesolr
- Dockerfile
- Summary: The pipeline-updatesolr container initializes the Solr schema and indexes the json data generated by the pipeline-dumps to the Solr.
Service: ontology-search
- Image: ghcr.io/obasktools/ontology-search
- Git: https://github.com/OBASKTools/ontology-search
- Dockerfile
- Summary: Provides ontology search APIs.