Bio2Vec

Bio-Ontology Research Group · KAUST

Machine learning that reasons with biomedical knowledge.

We build neuro-symbolic AI for biology and medicine: methods and software that combine deep learning with the logical structure of ontologies and knowledge graphs. From embedding description logics into vector spaces to predicting protein function, phenotypes, and disease-causing variants — this is our open-source toolbox.

36open-source tools
6live web services
7research themes

Library & foundations

Our work centres on one idea: ontologies and knowledge graphs are background knowledge that machine-learning models can learn from. mOWL packages the methods below behind a single API, and our tutorials teach the underlying techniques.

mOWL

A Python library for machine learning with ontologies. mOWL maps ontology classes, relations, and instances into vector spaces while preserving the logical axioms, unifying graph-based, syntactic, and model-theoretic embeddings behind one API with direct access to the OWL API and automated reasoning from Python.

Machine Learning with Ontologies

Companion code and notebooks for our review of semantic similarity and ontology-based machine learning, reproducing benchmark experiments across semantic similarity, Onto2Vec/OPA2Vec, graph embeddings, and EL Embeddings.

Ontology Tutorial

Our hands-on teaching materials on ontologies, automated reasoning, semantic similarity, and combining ontologies with deep learning, developed for courses and summer schools.

Geometric ontology embeddings

These methods build vector spaces that are themselves approximate models of a description-logic theory, so geometry reflects logical entailment. They are the core of our neuro-symbolic research.

EL Embeddings

Geometric embeddings for the description logic EL++ that act as approximate models of the ontology: classes become n-balls and relations TransE-style translations, so subsumption, conjunction, and disjointness are enforced geometrically.

ELBE / EL2Box

Box-shaped EL++ embeddings. Representing concepts as axis-parallel boxes means the intersection of two concepts is again a box, giving the exact intersectional closure that ball-based methods cannot achieve.

catE

Lattice-preserving embeddings for the more expressive logic ALC, which supports full negation and universal restrictions. A category-theoretic construction materialises the ontology's concept lattice and embeds it order-preservingly.

DELE

Deductive EL++ embeddings: the ontology's deductive closure is folded into training and evaluation, with negative sampling that avoids treating entailed axioms as negatives.

geometric_embeddings

Enhancing geometric EL++ embeddings with negative sampling and deductive-closure filtering, and exposing biases in how knowledge-base-completion benchmarks are framed.

GeometrE

Fully geometric multi-hop reasoning on knowledge graphs: every logical operation is a geometric transformation rather than a learned neural operator, with a transitive loss that preserves transitive relations.

Graph- & corpus-based embeddings

Our earliest embedding methods turn logical axioms and RDF graphs into corpora or graphs that representation learning can consume — the lineage that began with Onto2Vec and the “2vec” family.

Onto2Vec

Learns joint embeddings of ontology classes and annotated entities by treating logical axioms and their deductive closure as sentences for a Word2Vec model — the first method to apply representation learning to arbitrary OWL axioms.

OPA2Vec

Extends Onto2Vec by adding the informal content of ontologies — labels, definitions, synonyms — and an optional literature-pretrained language model, yielding richer vectors for similarity-based prediction.

DL2Vec

Converts description-logic axioms into a labelled graph and learns embeddings by random walks; combining phenotype, function, and anatomy ontologies, it links candidate genes to diseases.

Walking RDF and OWL

Neuro-symbolic representation learning over RDF knowledge graphs and OWL ontologies: reason to the deductive closure, then run edge-labelled random walks and Word2Vec. The seed method behind much of our later work.

Onto2Graph

Infers graph structures from OWL ontologies using automated reasoning, turning complex axioms into edges over the deductive closure for downstream graph analysis.

ontology-graph-projections

A systematic study of how different graph projections of ontologies (Onto2Graph, OWL2Vec*, RDF) shape the embeddings learned from them and their ability to infer axioms.

vec2SPARQL

Integrates SPARQL querying with vector-space operations, so a single query can mix graph patterns with embedding similarity and machine-learning functions.

Protein function prediction — the DeepGO family

A decade of ontology-aware models for predicting Gene Ontology functions from protein sequence, each generation tightening the link between deep learning and the logical structure of GO.

DeepGO

Predicts Gene Ontology functions from protein sequence and interaction networks with a deep, ontology-aware classifier whose output layer mirrors the GO hierarchy.

DeepGOPlus

Sequence-only function prediction combining a deep convolutional network over the sequence with homology-based annotation transfer; strong CAFA performance.

DeepGOZero

Zero-shot function prediction: GO classes are grounded in their logical definitions via model-theoretic EL Embeddings, so functions with no training examples can still be predicted.

DeepGO-SE

Frames function prediction as approximate semantic entailment over GO: protein language-model embeddings are evaluated in many approximate models of the GO theory and the truth values aggregated.

DeepGOMeta

DeepGO for microbial communities — retrained on prokaryotes, archaea, and phages and paired with a metagenomics pipeline for functional profiling.

PU-GO

Reformulates function prediction as positive-unlabelled ranking, deriving class priors from the GO hierarchy so undiscovered annotations are not penalised as negatives.

GO-Agent

An LLM agent that predicts protein function as multi-step reasoning, cross-referencing sequence models, homology, literature, and GO axioms to refine and explain its predictions.

Genomic context

Predicts bacterial protein function from genomic context alone, pre-training a BERT model over genomes treated as sequences of protein-cluster tokens.

Phenotype-based gene & variant prioritization

Connecting patient phenotypes to genomes. These tools reason and learn over cross-species phenotype ontologies to rank the variants and genes behind genetic disease.

PhenomeNET-VP

Prioritizes causative variants in exomes and genomes by combining molecular pathogenicity with phenotype similarity computed by reasoning over the PhenomeNET cross-species phenotype ontology (PVP, DeepPVP, OligoPVP).

DeepPheno

Predicts the abnormal phenotypes resulting from single-gene loss of function with an ontology-aware hierarchical classifier over the Human Phenotype Ontology.

DeepSVP

Prioritizes structural and copy-number variants by relating affected genes to patient phenotypes through ontology embeddings of function, expression, and anatomy.

EmbedPVP

Prioritizes coding variants through neuro-symbolic, knowledge-enhanced learning, combining pathogenicity scores with phenotype, function, and anatomy knowledge across a choice of embedding methods.

STARVar

Ranks candidate variants from free-text patient symptoms — not only HPO codes — by combining literature text-mining with genomic evidence.

INDIGENA

Inductive disease-gene prediction: learns graph embeddings of individual phenotypes and aggregates them on the fly, generalising to unseen diseases where transductive methods cannot.

predCAN

Predicts cancer driver genes from biological background knowledge — cellular, functional, and knockout phenotypes embedded with OPA2Vec — rather than mutation frequency.

SMUDGE

Semantic disease-gene embeddings: builds vector representations of gene and disease phenotypes and propagates them to unannotated genes over an interaction network.

Drug discovery & molecular interactions

Embedding biomedical knowledge graphs together with sequence and text to predict interactions among drugs, targets, diseases, and pathogens.

multi-drug-embedding

Predicts drug targets and indications by jointly embedding a biomedical knowledge graph and the published literature, combining structured and textual evidence.

DeepViral

Predicts virus-host protein interactions from sequence together with infectious-disease phenotypes and protein functions grounded in ontologies.

Knowledge representation & ontology quality

Neuro-symbolic methods are only as good as the ontologies beneath them. These tools keep large biomedical ontologies tractable to reason over and free of hidden contradictions.

OntoFunc

An EL++-compatible representation pattern for biological functions that keeps large-scale reasoning over functions tractable, with tooling for function-based ontology analysis.

UNMIREOT

Detects, explains, and semi-automatically repairs hidden contradictions that surface when biomedical ontologies are combined — finding that a handful of axioms cause widespread incoherence across the OBO Foundry.

Live services & endpoints

Several of our methods run as hosted web services and public APIs that you can use directly, without installing anything.