Adapting and Extending GB-GI¶
In this notebook, we will show how to implement novel fitness functions, representations and acquisition functions.
Welcome to our notebook on adapting and extending GB-GI! Here, we’ll be introducing a fresh perspectives on the GB-GI codebase by implementing alternative fitness functions, molecular representations, and acquisition functions. Throughout this notebook, we’ll focus on the practical aspects of adapting the GB-BI code, providing concrete examples through the creation of new classes for these key components. So, if you’re eager to learn how to enhance GB-GI’s capabilities or want to know how adapt the code for your own purposes, you’re in the right place.
Defining a New Fitness Function¶
The most easily adaptable component of GB-BI’s internal functionalities is the fitness function. In the ../argenomic/functions/fitness.py file, you can find several fitness functions including the those used in the paper. Below, we show an Abstract_Fitness
class, which highlights how all of the fitness function classes are designed. Essentially, only the fitness_function
method needs to be implemented to capture the fitness function you want to implement. Optionally this might include
creating helper functions and a more involved use of the config file.
[ ]:
class Abstract_Fitness:
"""
A strategy class for calculating the fitness of a molecule.
"""
def __init__(self, config) -> None:
self.config = config
return None
def __call__(self, molecule) -> None:
"""
Updates the fitness value of a molecule.
"""
molecule.fitness = self.fitness_function(molecule)
return molecule
@abstractmethod
def fitness_function(self, molecule) -> float:
raise NotImplementedError
For example, we will be implementing the benchmark objective for the design of organic photovoltaics from Tartarus benchmark suite. We load the power conversion efficiency class pce
from the Tartarus library, based on the Scharber model, and apply it to the SMILES of molecules presented to the Power_Conversion_Fitness
class. Note that the fitness function includes a penalty based on the synthetic accessibility score (sas).
References: - Nigam, AkshatKumar, et al. Tartarus: A benchmarking platform for realistic and practical inverse molecular design. Advances in Neural Information Processing Systems 36 (2024). - Alharbi, Fahhad H., et al. An efficient descriptor model for designing materials for solar cells. npj Computational Materials 1.1 (2015): 1-9.
[ ]:
from tartarus import pce
class Power_Conversion_Fitness:
"""
A strategy class for calculating the fitness of a molecule.
"""
def __init__(self, config) -> None:
self.config = config
return None
def __call__(self, molecule) -> None:
"""
Updates the fitness value of a molecule.
"""
molecule.fitness = self.fitness_function(molecule)
return molecule
def fitness_function(self, molecule) -> float:
dipole, hl_gap, lumo, obj, pce_1, pce_2, sas = pce.get_properties(molecule.smiles)
return (pce_1 - sas)
Finally, don’t forget to add the newly designed fitness function class to the Fitness
class in the ../argenomic/mechanism.py file, as shown below, to make it available in the configuration file.
[ ]:
from argenomic.functions.fitness import Power_Conversion_Fitness
class Fitness:
@staticmethod
def __new__(self, config):
match config.type:
case "Fingerprint":
return Fingerprint_Fitness(config)
case "USRCAT":
return USRCAT_Fitness(config)
case "Zernike":
return Zernike_Fitness(config)
case "PCE":
return Power_Conversion_Fitness(config)
case _:
raise ValueError(f"{config.type} is not a supported fitness function type.")
Defining a New Molecular Representation¶
New molecular representations can readily be added to GB-BI. The process is somewhat more involved than adding a new fitness function, due to the large variety of potential molecular representations and the peculiarities of their original implementations. In the ../argenomic/functions/surrogate.py file, you find the GP_Surrogate
class which contains all the necessary functionality to apply the Tanimoto kernel from GAUCHE to the representations that are being calculated in the
calculate_encodings
method. Because in some cases (e.g. bag-of-words, SELFIES) the representations need to be determined over the combined list of novel and previously seen molecules, there is a separate add_to_prior_data
method which adds the novel molecules and their fitness values to the memory of the class and re-calculates the encodings.
[ ]:
class Abstract_Surrogate(GP_Surrogate):
"""
A strategy class for calculating the fitness of a molecule.
"""
def __init__(self, config):
super().__init__(config)
return None
@abstractmethod
def add_to_prior_data(self, molecules):
raise NotImplementedError
@abstractmethod
def calculate_encodings(self, molecules):
raise NotImplementedError
As an example, we will implement the Avalon fingerprint as a representation for the surrogate GP model. Note that for use in the Tanimoto kernel it is important to return Numpy array versions of the fingerprint vectors from the calculate_encodings
method. Because there are no complications in calculating fingerprint representations, the add_to_prior_data
method simply adds the encoding and the fitness of the new molecules to the self.encodings
and the self.fitnesses
variables.
[ ]:
from rdkit.Avalon import pyAvalonTools
class Avalon_Surrogate(GP_Surrogate):
def __init__(self, config):
super().__init__(config)
self.bits = self.config.bits
def add_to_prior_data(self, molecules):
"""
Updates the prior data for the surrogate model with new molecules and their fitness values.
"""
if self.encodings is not None and self.fitnesses is not None:
self.encodings = np.append(self.encodings, self.calculate_encodings(molecules), axis=0)
self.fitnesses = np.append(self.fitnesses, np.array([molecule.fitness for molecule in molecules]), axis=None)
else:
self.encodings = self.calculate_encodings(molecules)
self.fitnesses = np.array([molecule.fitness for molecule in molecules])
return None
def calculate_encodings(self, molecules):
molecular_graphs = [Chem.MolFromSmiles(Chem.CanonSmiles(molecule.smiles)) for molecule in molecules]
return np.array([pyAvalonTools.GetAvalonFP(molecular_graph, self.bits) for molecular_graph in molecular_graphs]).astype(np.float64)
Once again, at the end of this process, it’s necessary to add the newly designed surrogate function as an option in the Surrogate
class in the ../argenomic/mechanism.py file to make it available in the configuration file.
[ ]:
from argenomic.functions.surrogate import Avalon_Surrogate
class Surrogate:
@staticmethod
def __new__(self, config):
match config.type:
case "String":
return String_Surrogate(config)
case "Fingerprint":
return Fingerprint_Surrogate(config)
case "Fingerprint":
return Avalon_Surrogate(config)
case _:
raise ValueError(f"{config.type} is not a supported surrogate function type.")
Defining a New Acquisition Function¶
A new type of acquisition function can be added to GB-BI in a manner highly similar to adding a new molecular representation. In the ../argenomic/functions/acquisition.py file, you can find the BO_Acquisition
parent class that encapsulates all the necessary logic to apply Bayesian optimisation to the quality-diversity archive of GB-BI. A novel acquisition function is hence simply made by creating a class that inherits from this class and implements calculate_acquisition_value
method.
Note that the parent class has direct a link to the archive, so the current fitness value of the molecule in the relevant niche can be accessed if necessary.
[15]:
class Abstract_Acquisition(BO_Acquisition):
"""
A strategy class for the posterior mean of a list of molecules.
"""
@abstractmethod
def calculate_acquisition_value(self, molecules) -> None:
raise NotImplementedError
To show how this process works, we implement the probability of improvement as a novel acquisition function. First, we inherit from the BO_Acquisition class and then we fill in the calculate_acquisition_value method with the relevant logic. Note the archive is directly accessed to read-out the fitness of the current occupant of the niche the candidate molecule is assigned to. Empty niches have a fitness function value equal to zero.
[ ]:
class Probability_Of_Improvement(BO_Acquisition):
"""
A strategy class for the probability of improvement of a list of molecules.
"""
def calculate_acquisition_value(self, molecules) -> None:
"""
Updates the acquisition value for a list of molecules.
"""
current_fitnesses = [self.archive.elites[molecule.niche_index].fitness for molecule in molecules]
for molecule, current_fitness in zip(molecules, current_fitnesses):
Z = (molecule.predicted_fitness - current_fitness) / molecule.predicted_uncertainty
molecule.acquisition_value = norm.cdf(Z)
return molecules
Again, for one final time, it is important to remember to add the new acquisition function class to the ../argenomic/mechanism.py file and the Acquisition
factory class as shown here.
[ ]:
from argenomic.functions.surrogate import Probability_of_Improvement
class Acquisition:
@staticmethod
def __new__(self, config):
match config.type:
case 'Mean':
return Posterior_Mean(config)
case 'UCB':
return Upper_Confidence_Bound(config)
case 'EI':
return Expected_Improvement(config)
case 'logEI':
return Log_Expected_Improvement(config)
case 'PI':
return Probability_Of_Improvement(config)
case _:
raise ValueError(f"{config.type} is not a supported acquisition function type.")