Small Molecule Rediscovery¶
In this notebook, we will show how to explore the rediscovery capabilities of GB-BI for small molecules.
Welcome to our notebook on using GB-GI to rediscover small molecules using the Tanimoto similarity and fingerprints! Because GB-BI is a quality-diversity method, the end result of such an optimisation is a collection of high-scoring molecules (i.e. highly similar to the target molecule) with a diversity of physico-chemical properties. So, if you’re eager to learn how to use GB-GI’s rediscovery capabilities or you want to know how to use different types of fingerprints, you’re in the right place. For rediscovery based on descriptors and conformer aggregate similarity, see the dedicated tutorial.
Running GB-BI¶
The most basic usage of GB-BI is to simply run the illuminate.py file, without any extra configurations specficied in the command line. In this case, GB-BI will call on the config.yaml
file in ../Argenomic-GP/configuration folder by making use of Hydra (an open-source Python framework that handle a hierarchical configuration file which can be easily be adapted through the command line.).
[ ]:
python GB-BI.py
While initially the configuration file seem complex or even intimidating, each of the individual components are very easy to understand and adapt. Later on in this tutorial, we will go into detail for each of the components, but for now we just want to highlight a few key aspects.
[4]:
---
controller:
max_generations: 100
max_fitness_calls: 1000
archive:
name: archive_150_4_25000
size: 150
accuracy: 25000
descriptor:
properties:
- Descriptors.ExactMolWt
- Descriptors.MolLogP
- Descriptors.TPSA
- Crippen.MolMR
ranges:
- - 225
- 555
- - -0.5
- 5.5
- - 0
- 140
- - 40
- 130
fitness:
type: Fingerprint
target: "O=C1NC(=O)SC1Cc4ccc(OCC3(Oc2c(c(c(O)c(c2CC3)C)C)C)C)cc4"
representation: ECFP4
arbiter:
rules:
- Glaxo
generator:
batch_size: 40
initial_size: 40
mutation_data: data/smarts/mutation_collection.tsv
initial_data: data/smiles/guacamol_intitial_rediscovery_troglitazone.smi
surrogate:
type: Fingerprint
representation: ECFP4
acquisition:
type: Mean
This standard config file specifies GB_BI run for a maximum of 100 generations or 1000 fitness function calls with an archive containing 150 niches. This is also the maximum amount of molecules that could potentially make up the evolutionary population at any given generation. The archive is spanned by 4 descriptors (Descriptors.ExactMolWt
, Descriptors.MolLogP
, Descriptors.TPSA
, Crippen.MolMR
) an limited to corresponding ranges indicated in the configuration file. The fitness
function is the Tanimoto similarity (standard for all fingerprint representations) applied to ECFP4 fingerprints. The target, Troglitazone, is specified by its SMILES representation. Finally, the molecular representation (ECFP4 fingerprints
) for the surrogate model and the acquistion function (Posterior Mean
) are also specified.
Interpreting the Output of a GB-BI Optimisation Run¶
In the same way we rely on HYDRA to take care of configurations files, we also use it to autmatically generate date and time-stamped output folders. For GB-BI, these folders contain a series of archive_i.csv
files where i
is the generation at the of which the archive was printed to file.
[25]:
import pandas as pd
archive_0 = pd.read_csv('./Example_Output/archive_100.csv')[['acquisition_value', 'fitness', 'niche_index', 'predicted_fitness', 'predicted_uncertainty', 'smiles']]
archive_0.head(10)
[25]:
acquisition_value | fitness | niche_index | predicted_fitness | predicted_uncertainty | smiles | |
---|---|---|---|---|---|---|
0 | 0.425027 | 0.448980 | 3 | 0.425027 | 0.000435 | O=C1NC(=O)C(Cc2ccc(OCC=C(O)Br)cc2)S1 |
1 | 0.365707 | 0.377358 | 4 | 0.365707 | 0.000271 | Cc1c(C)c2c(c(C)c1O)CCC(C)(CN1CCN(C)C1)O2 |
2 | 0.423895 | 0.427273 | 5 | 0.423895 | 0.000272 | CN1CCN(C(O)COc2ccc(CC3SC(=O)NC3=O)cc2)CC1 |
3 | 0.537205 | 0.552083 | 6 | 0.537205 | 0.000205 | Cc1c(C)c2c(c(C)c1O)CCC(C)(Cc1ccc(Br)cc1)O2 |
4 | 0.361091 | 0.326316 | 7 | 0.361091 | 0.000935 | O=C1NC(=O)C(Cc2ccccc2)S1 |
5 | 0.436628 | 0.440000 | 8 | 0.436628 | 0.000139 | CN(Br)N(Br)COc1ccc(CC2SC(=O)NC2=O)cc1 |
6 | 0.649736 | 0.652632 | 10 | 0.649736 | 0.000634 | Cc1c(C)c2c(c(C)c1O)CCC(C)(CONCC1SC(=O)NC1=O)O2 |
7 | 0.208211 | 0.215686 | 11 | 0.208211 | 0.000423 | O=C1NC(=O)C(COCC(O)OBr)S1 |
8 | 0.713624 | 0.765957 | 14 | 0.713624 | 0.000880 | Cc1c(C)c2c(c(C)c1O)CCC(C)(c1ccc(CC3SC(=O)NC3=O... |
9 | 0.800924 | 1.000000 | 18 | 0.800924 | 0.000857 | Cc1c(C)c2c(c(C)c1O)CCC(C)(COc1ccc(CC3SC(=O)NC3... |
In addition, the folder also contains a molecules.csv
file and a statistics.csv
file. These files respectively comprise of all molecules (and their properties) that have been evalauted by the fitness function and the overall statistics of the archive and the GB-BI run at each generation.
[24]:
import pandas as pd
statistics = pd.read_csv('./Example_Output/statistics.csv')
statistics.head(10)
[24]:
generation | maximum fitness | mean fitness | quality diversity score | coverage | function calls | max_err | mse | mae | |
---|---|---|---|---|---|---|---|---|---|
0 | 0 | 0.418803 | 0.377648 | 3.398836 | 6.000000 | 24 | NaN | NaN | NaN |
1 | 1 | 0.495935 | 0.326890 | 9.806713 | 20.000000 | 54 | 0.189796 | 0.009939 | 0.078342 |
2 | 2 | 0.495935 | 0.337603 | 12.153720 | 24.000000 | 85 | 0.076884 | 0.001355 | 0.029830 |
3 | 3 | 0.495935 | 0.360824 | 14.432969 | 26.666667 | 119 | 0.060353 | 0.000754 | 0.021815 |
4 | 4 | 0.564815 | 0.375560 | 15.773541 | 28.000000 | 155 | 0.106445 | 0.001486 | 0.028396 |
5 | 5 | 0.592593 | 0.386288 | 17.382972 | 30.000000 | 190 | 0.089415 | 0.001411 | 0.030500 |
6 | 6 | 0.791667 | 0.397842 | 19.096398 | 32.000000 | 226 | 0.225988 | 0.002762 | 0.032465 |
7 | 7 | 0.817204 | 0.402366 | 20.520662 | 34.000000 | 263 | 0.093953 | 0.000831 | 0.019793 |
8 | 8 | 0.817204 | 0.415433 | 21.602498 | 34.666667 | 301 | 0.091979 | 0.000716 | 0.018666 |
9 | 9 | 1.000000 | 0.429936 | 22.356681 | 34.666667 | 338 | 0.199076 | 0.001700 | 0.022910 |
Overwriting the Configuration File from the Command Line¶
All the settings specified in the standard configuration file can easily be overwritten in the command line. For example, we can change the target to Tiotixene
and change the acquistion function to the expected improvement
(EI).
[ ]:
python GB-BI.py fitness.target="O=S(=O)(N(C)C)c2cc1C(\c3c(Sc1cc2)cccc3)=C/CCN4CCN(C)CC4" acquisition.type=EI
Thanks to the capabilities of HYDRA, we can also easily set up a multirun of the GB-BI algorithm which sweeps throuhg the different combinations of the configurations specified in the command line. For instance, here we start a multirun with ECFP4
and FCFP4
as molecular representations of the surrogate model and using the posterior mean
(Mean) and the expected improvement
(EI) as acquisition functions.
[ ]:
python GB-BI.py hydra.mode=MULTIRUN surrogate.representation=ECFP4,FCFP4 acquisition.type=Mean,EI
Components of the Configuration File¶
As mentioned before, we will discuss each component of the configuration file. We will go through the different options for each of these components one by one and discuss potential pitfalls. The different components of the configuration file correspond to diffferent classes in the code.
Controller: This class controls the basic overall settings of a GB-BI run. In the configuration file, you can set the maximum amount of generations and the maximum amount of fitness calls alloted to the optimsation process. The GB-BI run will end when either has been reached.
[13]:
controller:
max_generations: 100
max_fitness_calls: 1000
Archive: This class controls the quality-diversity archive at the core of GB-BI. The amount of niches is set in this part of the configuration file, but also two more technical aspects. The archive makes use of a centroidal Voronoi tessellation which needs to be sampled at the beginning of each run (with a certain accuracy). To avoid unncessary re-sampling the centroidal Voronoi tessellation can be stored in a file, the name of which is set here. note that the name should be changed if you change the accuracy, dimensionality, or size of the archive. The dimensionality is set in the next part of the configuration file.
[ ]:
archive:
name: archive_150_4_25000
size: 150
accuracy: 25000
Descriptor: This class controls the amount and type of physicochemical properties that are used to describe a molecule and place it in its niche. The amounts of descriptors sets the dimensionality of the archive. When choosing ranges, be careful to make sure your target molecule lies within part of chemical space covered by the archive.
[ ]:
descriptor:
properties:
- Descriptors.ExactMolWt
- Descriptors.MolLogP
- Descriptors.TPSA
- Crippen.MolMR
ranges:
- - 225
- 555
- - -0.5
- 5.5
- - 0
- 140
- - 40
- 130
Fitness: This class controls the type of fitness function that is used during the GB-BI run. For fingerprint-based rediscovery, the options for the representation are ECFP4
, ECFP6
, FCFP4
, and FCFP6
in line with the representations supported by the Gaucamol rediscovery benchmarks.
[ ]:
fitness:
type: Fingerprint
target: "O=C1NC(=O)SC1Cc4ccc(OCC3(Oc2c(c(c(O)c(c2CC3)C)C)C)C)cc4"
representation: ECFP4
Arbiter: This class controls the type of structural filters that are used during the GB-BI run. The options for the rules are Glaxo
, Dundee
, BMS
, PAINS
, SureChEMBL
, MLSMR
, Inpharmatica
, and LINT
. The SMARTS in these rules are determined by Pat Walters, and more information on them can be found in the original scripts. When choosing rules, be careful to make sure your target molecule adheres to the chosen structural filters.
[ ]:
arbiter:
rules:
- Glaxo
Note that structural filters can be combined by adding them to configuration file. The combined options for the rules will be read in as a list.
[ ]:
arbiter:
rules:
- Glaxo
- Dundee
- BMS
Generator: This class controls the crossovers and mutations in the GB-BI run. The initial size, which is randomly sampled from the inital data (i.e. a column of smiles), can be set in this part of the configuration file together with the batch size for the rest of the optimisation run. The path to the file with the mutation data (SMARTS) can also be changed here.
[ ]:
generator:
batch_size: 40
initial_size: 40
mutation_data: data/smarts/mutation_collection.tsv
initial_data: data/smiles/guacamol_intitial_rediscovery_troglitazone.smi
Surrogate: This class controls the molecular representation used in surrogate fitness model of the GB-BI run. There are two available types: Fingerprint
and String
. For Fingerprint
, the representation options are ECFP4
, ECFP6
, FCFP4
, FCFP6
, RDFP
, APFP
, andTTFP
.
[ ]:
surrogate:
type: Fingerprint
representation: ECFP4
For String
, the representation options are Smiles
and Selfies
. As these strings will be represented as a bag-of-characters in the Gaussian process, a maximum ngram max_ngram
also needs to be defined in the configuration file.
[ ]:
surrogate:
type: String
representation: Selfies
max_ngram: 5
Acquisition: This class controls the type of acquisition function used in the GB-BI run. The options for the type are the posterioir mean Mean
, the upper confidence bound UCB
, the expected improvement EI
, and a numerically stable implementation of the logarithm of the expected improvement logEI
.
[ ]:
acquisition:
type: Mean
For the the upper confidence bound UCB
, an additional hyperparameter beta
which balances exploitation and exploration needs to be defined as well.
[ ]:
acquisition:
type: UCB
beta: 0.2
References¶
Griffiths, Ryan-Rhys, et al. “Gauche: A library for Gaussian processes in chemistry.” Advances in Neural Information Processing Systems 36 (2024).
Verhellen, Jonas, and Jeriek Van den Abeele. “Illuminating elite patches of chemical space.” Chemical science 11.42 (2020): 11485-11491.
Brown, Nathan, et al. “GuacaMol: benchmarking models for de novo molecular design.” Journal of chemical information and modeling 59.3 (2019): 1096-1108.
Jensen, Jan H. “A graph-based genetic algorithm and generative model/Monte Carlo tree search for the exploration of chemical space.” Chemical science 10.12 (2019): 3567-3572.