Core Functionality

Graph-Based Bayesian Illumination (GB-BI) is an open-source software library that aims to make state-of-the-art, quality-diversity optimisation techniques infused with Bayesian optimisation easily accessible. We provide a modular codebase, novel benchmarks, and extensive documentation. In this section of the documentation, we discuss the core functionality of GB-BI in terms of fitness functions, molecular representations, acquisition functions, physicochemical descriptors, and structural filters. For practical considerations and specific configuration file settings, please see the tutorials.

Fitness Functions

GB-BI provides five classes of fitness functions out-of-the-box: fingerprint-based rediscovery, descriptor-based rediscovery, and SAS-modulated docking scores. These fitness functions can and have been used as benchmark tools to probe the efficiency of generative models but also have direct practical applications. Additional fitness functions can easily be added to the codebase.

Task

Description

Fingerprint Rediscovery

A lightweight task focused on molecule rediscovery where the fitness of a molecule is the Tanimoto similarity to the target molecule, based on their respective extended-connectivity fingerprints. Implementation based on Gaucamol, but applicable to generic targets.

Descriptor Rediscovery

An alternative molecule rediscovery task, with intermediate computational expense, where the fitness of a generated molecule is defined as the conformer-aggregated similarity to the target molecule. Conformer similarity is based on either USRCAT or Zernike descriptors.

Guacamol Benchmarks

These tasks optimise molecules to score highly on the GuacaMol task provided by the TDC oracle: molecular properties, molecular similarity, drug rediscovery, isomers, MPOs, median molecule and a few others.

Organic Photovoltaics

These tasks focus on the design of small organic donor molecules with optimal power conversion efficiency, based on single point GFN2-xTB calculations distilled through an autoML model provided by the Tartarus benchmarking suite.

SAS-Modulated Docking Scores

A computationally intensive task, utilizing docking methods which evaluate the theoretical affinity between a small molecule and a target protein. To avoid pure exploitation of the docking method, the scores are modulated by the synthetic accessibility of the small molecule.

Representations

GB-BI supports several molecular representations that are based on bit vectors or strings. These representations are used for the surrogate models using the Tanimoto kernel from GAUCHE. The string-based representations are turned into a bag-of-characters before being used in the kernel. Note that several of these vector representations are currently not natively supported by GAUCHE.

Representation

Description

ECFP

Extended-Connectivity Fingerprints (ECFP) are circular topological fingerprints that represent the presence of particular substructures.

FCFP

Functional-Class Fingerprints (FCFP) are circular topological fingerprints that represent the presence of particular pharmacophoric properties.

RDFP

RDKit-specific fingerprints (RDFP) are inspired by public descriptions of the Daylight fingerprints, but differ significantly in practical implementation.

APFP

Atom pair fingerprints (APFP) encode all unique triplets of atomic number, number of heavy atom neighbours, aromaticity, and chirality in a vector format.

TTFP

Topological torsion fingerprints (TTFP) encode the long-range relationships captured in atom pair fingerprints through information on the torsion angles.

SMILES

The simplified molecular-input line-entry system (SMILES) is a widely used line notation for describing a small molecule in terms of short ASCII strings.

SELFIES

Self-referencing embedded strings (SELFIES) are an alternative line notation for a small molecule, designed to be used in arbitrary machine learning models.

Acquisition Functions

Acquisition functions are heuristics employed to evaluate the potential of candidate molecules based on their predicted fitness value and the associated uncertainty of a surrogate fitness model (i.e. the Gaussian process). A large literature exists on the topic of acquisition functions and their design. GB-BI supports several of the most well-known and often used acquisition functions.

Acquisition Function

Description

Mean

The posterior mean (mean) is simply the direct fitness value as predicted by the surrogate fitness model.

UCB

The upper confidence bound (UCB) balances exploration and exploitation based on a confidence boundary derived from the surrogate fitness model.

EI

The expected improvement (EI) considers both the probability of improving on the current solutions and the magnitude of the predicted improvement.

logEI

A numerically stable variant of the logarithm of the expected improvement (logEI), which was recently introduced to alleviate the vanishing gradient problems.

Physicochemical Archive

Users choose their own features of interest and define relevant ranges of variation to construct a feature space. If, for instance, a user wants to find medicinally relevant molecules in chemical space, they could construct a feature space based on physicochemical properties like lipophilicity and molecular mass. The chosen ranges in which to explore these features can be used to specify a desired subset of chemical space in which to generate new molecules. GB-BI supports all descriptors from a selection of common RDKit modules.

Module

Description

AllChem

Includes a variety of functions for molecular operations and calculations.

Crippen

Contains methods for calculating logP and molar refractivity.

Lipinski

Implements rules and functions related to Lipinski’s rule of five for druglikeness.

Descriptors

Provides a comprehensive set of molecular descriptors.

rdMolDescriptors

Contains methods for calculating complicated molecular descriptors.

Structural Filters

To rule out unwanted and potentially toxic molecules, we use functional group knowledge from the ChEMBL database and a combination of ADME property calculations. We remove undesirable compounds before they enter the evaluation step of the algorithm. Removing these compounds at an early stage makes the algorithm more efficient, increases the predictive value of the final outcome, and significantly decreases overall processing time. Specifically, we filter out molecules that contain macrocycles, fail at Veber’s rule, or raise structural alerts.

Rule Set

Number of Alerts

Description

BMS

180

Alerts derived from Bristol-Myers Squibb, encompassing a broad range of concerns.

Dundee

105

Alerts identified by researchers at the University of Dundee.

Glaxo

55

Alerts from GlaxoSmithKline, focusing on known problematic groups.

Inpharmatica

91

Alerts from Inpharmatica Ltd, emphasizing computational toxicology findings.

LINT

57

Alerts from the LINT project, targeting specific structural liabilities.

MLSMR

116

Alerts from the Molecular Libraries Screening Center Network (MLSCN) repository.

PAINS

479

Pan-Assay INterference compoundS (PAINS) alerts known to interfere in assays.

SureChEMBL

166

Alerts derived from SureChEMBL, focusing on patent-related structural issues.