Small Molecule Rediscovery

In this notebook, we will show how to explore the rediscovery capabilities of GB-BI for small molecules.

Welcome to our notebook on using GB-GI to rediscover small molecules using the Tanimoto similarity and fingerprints! Because GB-BI is a quality-diversity method, the end result of such an optimisation is a collection of high-scoring molecules (i.e. highly similar to the target molecule) with a diversity of physico-chemical properties. So, if you’re eager to learn how to use GB-GI’s rediscovery capabilities or you want to know how to use different types of fingerprints, you’re in the right place. For rediscovery based on descriptors and conformer aggregate similarity, see the dedicated tutorial.

Running GB-BI

The most basic usage of GB-BI is to simply run the illuminate.py file, without any extra configurations specficied in the command line. In this case, GB-BI will call on the config.yaml file in ../Argenomic-GP/configuration folder by making use of Hydra (an open-source Python framework that handle a hierarchical configuration file which can be easily be adapted through the command line.).

[ ]:
python GB-BI.py

While initially the configuration file seem complex or even intimidating, each of the individual components are very easy to understand and adapt. Later on in this tutorial, we will go into detail for each of the components, but for now we just want to highlight a few key aspects.

[4]:
---
controller:
  max_generations: 100
  max_fitness_calls: 1000
archive:
  name: archive_150_4_25000
  size: 150
  accuracy: 25000
descriptor:
  properties:
  - Descriptors.ExactMolWt
  - Descriptors.MolLogP
  - Descriptors.TPSA
  - Crippen.MolMR
  ranges:
  - - 225
    - 555
  - - -0.5
    - 5.5
  - - 0
    - 140
  - - 40
    - 130
fitness:
  type: Fingerprint
  target: "O=C1NC(=O)SC1Cc4ccc(OCC3(Oc2c(c(c(O)c(c2CC3)C)C)C)C)cc4"
  representation: ECFP4
arbiter:
  rules:
  - Glaxo
generator:
  batch_size: 40
  initial_size: 40
  mutation_data: data/smarts/mutation_collection.tsv
  initial_data: data/smiles/guacamol_intitial_rediscovery_troglitazone.smi
surrogate:
  type: Fingerprint
  representation: ECFP4
acquisition:
  type: Mean

This standard config file specifies GB_BI run for a maximum of 100 generations or 1000 fitness function calls with an archive containing 150 niches. This is also the maximum amount of molecules that could potentially make up the evolutionary population at any given generation. The archive is spanned by 4 descriptors (Descriptors.ExactMolWt, Descriptors.MolLogP, Descriptors.TPSA, Crippen.MolMR) an limited to corresponding ranges indicated in the configuration file. The fitness function is the Tanimoto similarity (standard for all fingerprint representations) applied to ECFP4 fingerprints. The target, Troglitazone, is specified by its SMILES representation. Finally, the molecular representation (ECFP4 fingerprints) for the surrogate model and the acquistion function (Posterior Mean) are also specified.

Interpreting the Output of a GB-BI Optimisation Run

In the same way we rely on HYDRA to take care of configurations files, we also use it to autmatically generate date and time-stamped output folders. For GB-BI, these folders contain a series of archive_i.csv files where i is the generation at the of which the archive was printed to file.

[25]:
import pandas as pd
archive_0 = pd.read_csv('./Example_Output/archive_100.csv')[['acquisition_value', 'fitness', 'niche_index', 'predicted_fitness', 'predicted_uncertainty',   'smiles']]
archive_0.head(10)
[25]:
acquisition_value fitness niche_index predicted_fitness predicted_uncertainty smiles
0 0.425027 0.448980 3 0.425027 0.000435 O=C1NC(=O)C(Cc2ccc(OCC=C(O)Br)cc2)S1
1 0.365707 0.377358 4 0.365707 0.000271 Cc1c(C)c2c(c(C)c1O)CCC(C)(CN1CCN(C)C1)O2
2 0.423895 0.427273 5 0.423895 0.000272 CN1CCN(C(O)COc2ccc(CC3SC(=O)NC3=O)cc2)CC1
3 0.537205 0.552083 6 0.537205 0.000205 Cc1c(C)c2c(c(C)c1O)CCC(C)(Cc1ccc(Br)cc1)O2
4 0.361091 0.326316 7 0.361091 0.000935 O=C1NC(=O)C(Cc2ccccc2)S1
5 0.436628 0.440000 8 0.436628 0.000139 CN(Br)N(Br)COc1ccc(CC2SC(=O)NC2=O)cc1
6 0.649736 0.652632 10 0.649736 0.000634 Cc1c(C)c2c(c(C)c1O)CCC(C)(CONCC1SC(=O)NC1=O)O2
7 0.208211 0.215686 11 0.208211 0.000423 O=C1NC(=O)C(COCC(O)OBr)S1
8 0.713624 0.765957 14 0.713624 0.000880 Cc1c(C)c2c(c(C)c1O)CCC(C)(c1ccc(CC3SC(=O)NC3=O...
9 0.800924 1.000000 18 0.800924 0.000857 Cc1c(C)c2c(c(C)c1O)CCC(C)(COc1ccc(CC3SC(=O)NC3...

In addition, the folder also contains a molecules.csv file and a statistics.csv file. These files respectively comprise of all molecules (and their properties) that have been evalauted by the fitness function and the overall statistics of the archive and the GB-BI run at each generation.

[24]:
import pandas as pd
statistics = pd.read_csv('./Example_Output/statistics.csv')
statistics.head(10)
[24]:
generation maximum fitness mean fitness quality diversity score coverage function calls max_err mse mae
0 0 0.418803 0.377648 3.398836 6.000000 24 NaN NaN NaN
1 1 0.495935 0.326890 9.806713 20.000000 54 0.189796 0.009939 0.078342
2 2 0.495935 0.337603 12.153720 24.000000 85 0.076884 0.001355 0.029830
3 3 0.495935 0.360824 14.432969 26.666667 119 0.060353 0.000754 0.021815
4 4 0.564815 0.375560 15.773541 28.000000 155 0.106445 0.001486 0.028396
5 5 0.592593 0.386288 17.382972 30.000000 190 0.089415 0.001411 0.030500
6 6 0.791667 0.397842 19.096398 32.000000 226 0.225988 0.002762 0.032465
7 7 0.817204 0.402366 20.520662 34.000000 263 0.093953 0.000831 0.019793
8 8 0.817204 0.415433 21.602498 34.666667 301 0.091979 0.000716 0.018666
9 9 1.000000 0.429936 22.356681 34.666667 338 0.199076 0.001700 0.022910

Overwriting the Configuration File from the Command Line

All the settings specified in the standard configuration file can easily be overwritten in the command line. For example, we can change the target to Tiotixene and change the acquistion function to the expected improvement (EI).

[ ]:
python GB-BI.py fitness.target="O=S(=O)(N(C)C)c2cc1C(\c3c(Sc1cc2)cccc3)=C/CCN4CCN(C)CC4" acquisition.type=EI

Thanks to the capabilities of HYDRA, we can also easily set up a multirun of the GB-BI algorithm which sweeps throuhg the different combinations of the configurations specified in the command line. For instance, here we start a multirun with ECFP4 and FCFP4 as molecular representations of the surrogate model and using the posterior mean (Mean) and the expected improvement (EI) as acquisition functions.

[ ]:
python GB-BI.py hydra.mode=MULTIRUN surrogate.representation=ECFP4,FCFP4 acquisition.type=Mean,EI

Components of the Configuration File

As mentioned before, we will discuss each component of the configuration file. We will go through the different options for each of these components one by one and discuss potential pitfalls. The different components of the configuration file correspond to diffferent classes in the code.

Controller: This class controls the basic overall settings of a GB-BI run. In the configuration file, you can set the maximum amount of generations and the maximum amount of fitness calls alloted to the optimsation process. The GB-BI run will end when either has been reached.

[13]:
controller:
  max_generations: 100
  max_fitness_calls: 1000

Archive: This class controls the quality-diversity archive at the core of GB-BI. The amount of niches is set in this part of the configuration file, but also two more technical aspects. The archive makes use of a centroidal Voronoi tessellation which needs to be sampled at the beginning of each run (with a certain accuracy). To avoid unncessary re-sampling the centroidal Voronoi tessellation can be stored in a file, the name of which is set here. note that the name should be changed if you change the accuracy, dimensionality, or size of the archive. The dimensionality is set in the next part of the configuration file.

[ ]:
archive:
  name: archive_150_4_25000
  size: 150
  accuracy: 25000

Descriptor: This class controls the amount and type of physicochemical properties that are used to describe a molecule and place it in its niche. The amounts of descriptors sets the dimensionality of the archive. When choosing ranges, be careful to make sure your target molecule lies within part of chemical space covered by the archive.

[ ]:
descriptor:
  properties:
  - Descriptors.ExactMolWt
  - Descriptors.MolLogP
  - Descriptors.TPSA
  - Crippen.MolMR
  ranges:
  - - 225
    - 555
  - - -0.5
    - 5.5
  - - 0
    - 140
  - - 40
    - 130

Fitness: This class controls the type of fitness function that is used during the GB-BI run. For fingerprint-based rediscovery, the options for the representation are ECFP4, ECFP6, FCFP4, and FCFP6 in line with the representations supported by the Gaucamol rediscovery benchmarks.

[ ]:
fitness:
  type: Fingerprint
  target: "O=C1NC(=O)SC1Cc4ccc(OCC3(Oc2c(c(c(O)c(c2CC3)C)C)C)C)cc4"
  representation: ECFP4

Arbiter: This class controls the type of structural filters that are used during the GB-BI run. The options for the rules are Glaxo, Dundee, BMS, PAINS, SureChEMBL, MLSMR, Inpharmatica, and LINT. The SMARTS in these rules are determined by Pat Walters, and more information on them can be found in the original scripts. When choosing rules, be careful to make sure your target molecule adheres to the chosen structural filters.

[ ]:
arbiter:
  rules:
  - Glaxo

Note that structural filters can be combined by adding them to configuration file. The combined options for the rules will be read in as a list.

[ ]:
arbiter:
  rules:
  - Glaxo
  - Dundee
  - BMS

Generator: This class controls the crossovers and mutations in the GB-BI run. The initial size, which is randomly sampled from the inital data (i.e. a column of smiles), can be set in this part of the configuration file together with the batch size for the rest of the optimisation run. The path to the file with the mutation data (SMARTS) can also be changed here.

[ ]:
generator:
  batch_size: 40
  initial_size: 40
  mutation_data: data/smarts/mutation_collection.tsv
  initial_data: data/smiles/guacamol_intitial_rediscovery_troglitazone.smi

Surrogate: This class controls the molecular representation used in surrogate fitness model of the GB-BI run. There are two available types: Fingerprint and String. For Fingerprint, the representation options are ECFP4, ECFP6, FCFP4, FCFP6, RDFP, APFP, andTTFP.

[ ]:
surrogate:
  type: Fingerprint
  representation: ECFP4

For String, the representation options are Smiles and Selfies. As these strings will be represented as a bag-of-characters in the Gaussian process, a maximum ngram max_ngram also needs to be defined in the configuration file.

[ ]:
surrogate:
  type: String
  representation: Selfies
  max_ngram: 5

Acquisition: This class controls the type of acquisition function used in the GB-BI run. The options for the type are the posterioir mean Mean, the upper confidence bound UCB, the expected improvement EI, and a numerically stable implementation of the logarithm of the expected improvement logEI.

[ ]:
acquisition:
  type: Mean

For the the upper confidence bound UCB, an additional hyperparameter beta which balances exploitation and exploration needs to be defined as well.

[ ]:
acquisition:
  type: UCB
  beta: 0.2

References

  • Griffiths, Ryan-Rhys, et al. “Gauche: A library for Gaussian processes in chemistry.” Advances in Neural Information Processing Systems 36 (2024).

  • Verhellen, Jonas, and Jeriek Van den Abeele. “Illuminating elite patches of chemical space.” Chemical science 11.42 (2020): 11485-11491.

  • Brown, Nathan, et al. “GuacaMol: benchmarking models for de novo molecular design.” Journal of chemical information and modeling 59.3 (2019): 1096-1108.

  • Jensen, Jan H. “A graph-based genetic algorithm and generative model/Monte Carlo tree search for the exploration of chemical space.” Chemical science 10.12 (2019): 3567-3572.