Adapting and Extending GB-GI

In this notebook, we will show how to implement novel fitness functions, representations and acquisition functions.

Welcome to our notebook on adapting and extending GB-GI! Here, we’ll be introducing a fresh perspectives on the GB-GI codebase by implementing alternative fitness functions, molecular representations, and acquisition functions. Throughout this notebook, we’ll focus on the practical aspects of adapting the GB-BI code, providing concrete examples through the creation of new classes for these key components. So, if you’re eager to learn how to enhance GB-GI’s capabilities or want to know how adapt the code for your own purposes, you’re in the right place.

Defining a New Fitness Function

The most easily adaptable component of GB-BI’s internal functionalities is the fitness function. In the ../argenomic/functions/fitness.py file, you can find several fitness functions including the those used in the paper. Below, we show an Abstract_Fitness class, which highlights how all of the fitness function classes are designed. Essentially, only the fitness_function method needs to be implemented to capture the fitness function you want to implement. Optionally this might include creating helper functions and a more involved use of the config file.

[ ]:
class Abstract_Fitness:
    """
    A strategy class for calculating the fitness of a molecule.
    """
    def __init__(self, config) -> None:
        self.config = config
        return None

    def __call__(self, molecule) -> None:
        """
        Updates the fitness value of a molecule.
        """
        molecule.fitness = self.fitness_function(molecule)
        return molecule

    @abstractmethod
    def fitness_function(self, molecule) -> float:
        raise NotImplementedError

For example, we will be implementing the benchmark objective for the design of organic photovoltaics from Tartarus benchmark suite. We load the power conversion efficiency class pce from the Tartarus library, based on the Scharber model, and apply it to the SMILES of molecules presented to the Power_Conversion_Fitness class. Note that the fitness function includes a penalty based on the synthetic accessibility score (sas).

References: - Nigam, AkshatKumar, et al. Tartarus: A benchmarking platform for realistic and practical inverse molecular design. Advances in Neural Information Processing Systems 36 (2024). - Alharbi, Fahhad H., et al. An efficient descriptor model for designing materials for solar cells. npj Computational Materials 1.1 (2015): 1-9.

[ ]:
from tartarus import pce

class Power_Conversion_Fitness:
    """
    A strategy class for calculating the fitness of a molecule.
    """
    def __init__(self, config) -> None:
        self.config = config
        return None

    def __call__(self, molecule) -> None:
        """
        Updates the fitness value of a molecule.
        """
        molecule.fitness = self.fitness_function(molecule)
        return molecule

    def fitness_function(self, molecule) -> float:
        dipole, hl_gap, lumo, obj, pce_1, pce_2, sas = pce.get_properties(molecule.smiles)
        return (pce_1 - sas)

Finally, don’t forget to add the newly designed fitness function class to the Fitness class in the ../argenomic/mechanism.py file, as shown below, to make it available in the configuration file.

[ ]:
from argenomic.functions.fitness import Power_Conversion_Fitness

class Fitness:
    @staticmethod
    def __new__(self, config):
        match config.type:
            case "Fingerprint":
                return Fingerprint_Fitness(config)
            case "USRCAT":
                return USRCAT_Fitness(config)
            case "Zernike":
                return Zernike_Fitness(config)
            case "PCE":
                return Power_Conversion_Fitness(config)
            case _:
                raise ValueError(f"{config.type} is not a supported fitness function type.")

Defining a New Molecular Representation

New molecular representations can readily be added to GB-BI. The process is somewhat more involved than adding a new fitness function, due to the large variety of potential molecular representations and the peculiarities of their original implementations. In the ../argenomic/functions/surrogate.py file, you find the GP_Surrogate class which contains all the necessary functionality to apply the Tanimoto kernel from GAUCHE to the representations that are being calculated in the calculate_encodings method. Because in some cases (e.g. bag-of-words, SELFIES) the representations need to be determined over the combined list of novel and previously seen molecules, there is a separate add_to_prior_data method which adds the novel molecules and their fitness values to the memory of the class and re-calculates the encodings.

[ ]:
class Abstract_Surrogate(GP_Surrogate):
    """
    A strategy class for calculating the fitness of a molecule.
    """
    def __init__(self, config):
        super().__init__(config)
        return None

    @abstractmethod
    def add_to_prior_data(self, molecules):
        raise NotImplementedError

    @abstractmethod
    def calculate_encodings(self, molecules):
        raise NotImplementedError

As an example, we will implement the Avalon fingerprint as a representation for the surrogate GP model. Note that for use in the Tanimoto kernel it is important to return Numpy array versions of the fingerprint vectors from the calculate_encodings method. Because there are no complications in calculating fingerprint representations, the add_to_prior_data method simply adds the encoding and the fitness of the new molecules to the self.encodings and the self.fitnesses variables.

References: - Griffiths, Ryan-Rhys, et al. “Gauche: A library for Gaussian processes in chemistry.” Advances in Neural Information Processing Systems 36 (2024).
- Gedeck, Peter, Bernhard Rohde, and Christian Bartels. “QSAR− how good is it in practice? Comparison of descriptor sets on an unbiased cross section of corporate data sets.” Journal of chemical information and modeling 46.5 (2006): 1924-1936.
[ ]:
from rdkit.Avalon import pyAvalonTools

class Avalon_Surrogate(GP_Surrogate):
    def __init__(self, config):
        super().__init__(config)
        self.bits = self.config.bits

    def add_to_prior_data(self, molecules):
        """
        Updates the prior data for the surrogate model with new molecules and their fitness values.
        """
        if self.encodings is not None and self.fitnesses is not None:
            self.encodings = np.append(self.encodings, self.calculate_encodings(molecules), axis=0)
            self.fitnesses = np.append(self.fitnesses, np.array([molecule.fitness for molecule in molecules]), axis=None)
        else:
            self.encodings = self.calculate_encodings(molecules)
            self.fitnesses = np.array([molecule.fitness for molecule in molecules])
        return None

    def calculate_encodings(self, molecules):
        molecular_graphs = [Chem.MolFromSmiles(Chem.CanonSmiles(molecule.smiles)) for molecule in molecules]
        return np.array([pyAvalonTools.GetAvalonFP(molecular_graph, self.bits) for molecular_graph in molecular_graphs]).astype(np.float64)

Once again, at the end of this process, it’s necessary to add the newly designed surrogate function as an option in the Surrogate class in the ../argenomic/mechanism.py file to make it available in the configuration file.

[ ]:
from argenomic.functions.surrogate import Avalon_Surrogate

class Surrogate:
    @staticmethod
    def __new__(self, config):
        match config.type:
            case "String":
                return String_Surrogate(config)
            case "Fingerprint":
                return Fingerprint_Surrogate(config)
            case "Fingerprint":
                return Avalon_Surrogate(config)
            case _:
                raise ValueError(f"{config.type} is not a supported surrogate function type.")

Defining a New Acquisition Function

A new type of acquisition function can be added to GB-BI in a manner highly similar to adding a new molecular representation. In the ../argenomic/functions/acquisition.py file, you can find the BO_Acquisition parent class that encapsulates all the necessary logic to apply Bayesian optimisation to the quality-diversity archive of GB-BI. A novel acquisition function is hence simply made by creating a class that inherits from this class and implements calculate_acquisition_value method. Note that the parent class has direct a link to the archive, so the current fitness value of the molecule in the relevant niche can be accessed if necessary.

[15]:
class Abstract_Acquisition(BO_Acquisition):
    """
    A strategy class for the posterior mean of a list of molecules.
    """
    @abstractmethod
    def calculate_acquisition_value(self, molecules) -> None:
        raise NotImplementedError

To show how this process works, we implement the probability of improvement as a novel acquisition function. First, we inherit from the BO_Acquisition class and then we fill in the calculate_acquisition_value method with the relevant logic. Note the archive is directly accessed to read-out the fitness of the current occupant of the niche the candidate molecule is assigned to. Empty niches have a fitness function value equal to zero.

References: - Verhellen, Jonas, and Jeriek Van den Abeele. “Illuminating elite patches of chemical space.” Chemical science 11.42 (2020): 11485-11491.
- Kushner, Harold J. “A new method of locating the maximum point of an arbitrary multipeak curve in the presence of noise.” (1964): 97-106.
[ ]:
class Probability_Of_Improvement(BO_Acquisition):
    """
    A strategy class for the probability of improvement of a list of molecules.
    """
    def calculate_acquisition_value(self, molecules) -> None:
        """
        Updates the acquisition value for a list of molecules.
        """
        current_fitnesses = [self.archive.elites[molecule.niche_index].fitness for molecule in molecules]
        for molecule, current_fitness in zip(molecules, current_fitnesses):
            Z = (molecule.predicted_fitness - current_fitness) / molecule.predicted_uncertainty
            molecule.acquisition_value = norm.cdf(Z)
        return molecules

Again, for one final time, it is important to remember to add the new acquisition function class to the ../argenomic/mechanism.py file and the Acquisition factory class as shown here.

[ ]:
from argenomic.functions.surrogate import Probability_of_Improvement

class Acquisition:
    @staticmethod
    def __new__(self, config):
        match config.type:
            case 'Mean':
                return Posterior_Mean(config)
            case 'UCB':
                return Upper_Confidence_Bound(config)
            case 'EI':
                return Expected_Improvement(config)
            case 'logEI':
                return Log_Expected_Improvement(config)
            case 'PI':
                return Probability_Of_Improvement(config)
            case _:
                raise ValueError(f"{config.type} is not a supported acquisition function type.")