Leveraging high-throughput molecular simulations and machine learning for the design of chemical mixtures

0
Leveraging high-throughput molecular simulations and machine learning for the design of chemical mixtures

Formulation dataset: miscible solvents

Figure 1A shows the workflow of selecting formulation examples given the miscibility table of 81 solvents that were tabulated against 25 solvents. We first extracted miscibility tables from the CRC handbook to identify pairs of industrially relevant solvents that were miscible with one another from ref. 19. Figure 1A shows an example of binary mixtures selected by using miscibility tables of acetone, benzene, and 1,2-ethanediol. In this example, acetone and benzene are considered a formulation since they are miscible, whereas benzene and 1,2-ethanediol were not considered a formulation since they are immiscible. One limitation of using miscibility tables is that they measure miscibility with equal volumes of two liquids, which does not inform us on whether the mixture is miscible when varying compositions. We further tested whether varying compositions of binary solvent mixtures result in any immiscibilities, and we observed that the majority of the mixtures are miscible based on MD simulations (see Supplementary Figure 1 and Supplementary Figure 2). For an N-component system, we assumed that if every solvent pair were miscible with one another, then the entire N-component system is assumed to be miscible and considered as a viable formulation. Figure 1B shows the number of possible unique formulations as we increase the number of components up to six. We arbitrarily selected to study up to five components, which consists of a total of 19,238 unique formulations. By using experimentally derived miscibility tables, we designed a large formulation dataset that consists of miscible solvent mixtures, where homogeneous solutions are important in a variety of material science applications such as battery electrolytes, chemical reactivity, and consumer packaged goods.

Fig. 1: Formulation dataset generated from experimental miscibility tables.
figure 1

A Example of three solvents from miscibility tables extracted from ref. 19. Pairs of solvents that were labeled “miscible” were used to generate a formulation dataset. A total of 81 solvents were tabulated against 25 solvents for miscibility. B Number of unique formulations and cumulative number of unique formulations possible against the number of components using the miscibility table described in (A). The cumulative number of formulations means the cumulative sum of formulations from 1 to N components.

Using the 19,238 unique formulations for up to 5 components, we further varied the composition for binary and ternary systems as summarized in Table 1. We varied the composition for binary mixtures such that each component is varied from 20%, 40%, 50%, 60%, 80%. For ternary mixtures, we selected 60% of one component and 20% of the other components, as well as equal ratio mixtures. Given the large possibilities of variations for quaternary and quintenary mixtures, only equal ratio systems were studied here. In sum, a total of 30,142 formulation examples were studied in this work that span from pure component systems (N = 1) to quinternary systems (N = 5).

Table 1 Summary of formulations studied in this work as a function of number of components

Generating formulation dataset with molecular simulations

We validated whether simulation-derived properties can accurately capture experimental trends for industrially relevant solvents. We extracted three MD descriptors from the last 10 ns of the production MD simulation (see the “Methods” section for simulation and descriptor details): (1) packing density, (2) heat of vaporization (ΔHvap), and (3) enthalpy of mixing (ΔHm). Density measures how tightly packed the molecules are for a mixture. ΔHvap is the amount of heat needed to convert some fraction of liquid into vapor. While measuring ΔHvap for mixtures is challenging to measure experimentally20, ΔHvap has been observed to correlate with temperature-dependent viscosities of pure liquids from MD simulations21 and experiments22. Therefore, ΔHvap is an informative property that may be correlated to other material properties. ΔHm is a fundamental thermodynamic property of liquid mixtures that measures the energy released or absorbed upon the mixing of pure components into a single phase in equilibrium. We treat these three MD descriptors as relevant formulation labels that are applicable to material science applications. For example, density is an important property for battery applications since it dictates the battery weight and charge mobility; ΔHvap is a property that effectively measures the cohesion energy of a liquid and has been previously observed to correlate with viscosity21; and, ΔHm is important for process design that dictates properties, such as solubility and phase stability.

Figure 2A shows an example of acetone and benzene that are equally weighted and simulated with MD to compute formulation properties. The simulation snapshot from Fig. 2A shows a well-mixed system of acetone and benzene, which is consistent with the experimental miscibility table in Fig. 1A. Figure 2B shows the correlation coefficient (R2) between simulation-derived and experimental properties for density, ΔHvap, and ΔHm. For all formulation properties, we observe good agreement between simulation-derived and experimental properties with a R2 ≥ 0.84. Figure 2C–E shows the parity plot between simulation-derived and experimental properties. For density (Fig. 2C), we compared the packing density of eleven pure solvents and observe a strong agreement against experiments with a R2 of 0.98 and root-mean-squared error (RMSE) of ~15.4 g/cm3. Similarly, we observe a strong correlation between MD simulations and experiments for ΔHvap when comparing 34 pure solvents (Fig. 2D), which achieved an R2 of 0.97 and RMSE of 3.4 kcal/mol. Density and heat of vaporization are expected to be well-captured from MD simulations since the OPLS4 forcefield is parameterized to accurately predict these properties5; hence, the results in Fig. 2C, D are consistent with the literature in that these two properties are accurately predicted with MD simulations5,6,7. On the other hand, ΔHm is not used to parameterize the OPLS4 forcefield, but ΔHm has shown good agreement between experiments and MD simulations for a variety of solvents, such as nonpolar-nonpolar mixtures (e.g., benzene and cyclohexane) and nonpolar-polar mixtures (e.g., benzene and ethanol)23. Figure 2E shows that simulation-derived ΔHm captures experimental trends for 53 binary mixture examples using the simulation protocol in this work. Given that the simulation-derived properties correlate with experiments for density, ΔHvap, and ΔHm, we validated that MD simulations can accurately capture formulation properties for solvent systems studied in this work.

Fig. 2: Generating formulation labels using classical molecular dynamics (MD) simulations and validating them against experiments.
figure 2

A Workflow to compute formulation properties by adding a 50 wt% acetone and 50 wt% benzene mixture into an MD simulation. Formulation properties are computed using the last 10 ns of a production MD run. B Coefficient of determination (R2) between MD simulation and experimental values for density, heat of vaporization (ΔHvap), and enthalpy of mixing (ΔHm). N denotes the number of data points used for each validation. C Simulation-derived versus experimental density for eleven pure solvent examples. D Simulation-derived versus experimental ΔHvap for 34 pure component examples. Experimental densities and ΔHvap were taken from the CRC handbook19. E Simulated versus experimental enthalpy of mixing for 54 binary mixture examples. Experimental enthalpy of mixing values were extracted from ref. 23. All scatter plots contain coefficient of determination (R2) and root-mean-squared error (RMSE) between simulation and actual values in the lower right corner. A diagonal gray dashed line is shown as a visual guide. The examples used to compare the formulation labels between MD simulations and experiments are tabulated in Supplementary Table 1.

Since MD simulations can accurately capture experiment trends, we then used MD simulations to generate a large formulation dataset that is useful to benchmark formulation-property relationships. Using the miscibility table to identify miscible solvent systems ranging from pure component systems (N = 1) to quinternary systems (N = 5) as described in Fig. 1, we performed 30,142 MD simulations and extracted the density, ΔHvap, and ΔHm from the production simulations. Figure 3 shows the box and whisker plot of density, ΔHvap, and ΔHm computed from MD simulations as a function of number of components. Figure 3A, B shows that as the number of components increases, the distribution of density and ΔHvap is more narrow as compared to pure component systems (N = 1). These results show that pure component systems have a large range of properties as compared to when mixing the individual components, and mixtures of solvents can be used to fine-tune properties to highly specific values that are not possible when only using pure component systems. Similar to density and ΔHvap, Fig. 3C shows that increasing the number of components results in narrower ranges for ΔHm. However, ΔHm differs from the other two properties in that pure component systems will have ΔHm = 0 because ΔHm of a mixture is relative to its corresponding pure component systems. Hence, binary systems (N = 2) have the largest range of ΔHm values. Since ΔHm is a relative mixture property, it may be a challenging property to predict with formulation-property relationships as the model will need to learn differences between the mixture and its individual components. We use the 30,142 formulations with the three property labels from MD simulations to evaluate whether the formulation-property approaches can be used to create accurate models.

Fig. 3: Distribution of the formulation labels from classical molecular dynamics simulations.
figure 3

Box and whisker plot between formulation labels versus number of components are shown for A density, B heat of vaporization (ΔHvap), and C enthalpy of mixing (ΔHm). Gray grid lines are shown as visual guides.

Formulation-property relationships

All formulation-property relationships were built using the DeepAutoQSAR framework, Schrödinger’s automated molecular property prediction engine24,25. In DeepAutoQSAR, feature and model hyperparameter selection are iteratively improved by Bayesian optimization based on the model performance on the previous training cycle. This work extends the DeepAutoQSAR workflow to be able to encode formulations as inputs, where multiple molecules with compositions are inputted rather than only single molecule property predictions. We focused on formulation-property relationships that have the following ideal characteristics: (1) composition must be accounted for in the model such that variations in compositions impact property predictions; (2) models must be permutationally invariant, such that changing the order of input molecules and compositions do not change the prediction output; and, (3) models are flexible to the number of components, such that a model trained with binary mixtures can be used to predict ternary mixtures, quarternary mixtures, and so on. These model characteristics are important for designing formulations because composition is crucial to a formulation, ingredients can be inputted in a random order, and the inclusion or removal of particular ingredients is commonly evaluated to measure the impact of individual ingredients on formulation properties.

Figure 4 summarizes three different approaches for developing formulation-property relationships that satisfy the characteristics of an ideal model. Figure 4A shows the formulation descriptor aggregation (FDA) approach where individual molecules are featurized, weighted by their corresponding compositions, then aggregated by performing a variety of statistical metrics like computing the mean, standard deviation, minimum, maximum, and median. These aggregated features are considered as formulation descriptors, which are then passed as inputs into ML models to predict formulation properties. By aggregating with statistical approaches, the formulation descriptor captures the distribution of molecular properties of individual ingredients, which would be useful for property prediction. The FDA approach is analogous to matminer featurizers that perform statistical operations, such as averaging and standard deviation, to characterize inorganic materials by aggregating features from individual atomic types26.

Fig. 4: Schematic of formulation-property relationship approaches.
figure 4

A Formulation descriptor aggregation approach (FDA) where two molecules are featurized to generate molecular descriptors that are compositionally weighted, then aggregated by computing the mean, standard deviation (std), minimum (min), maximum (max), and median values. These aggregated features are considered as formulation descriptors that are passed into machine learning algorithms to predict formulation properties. B Formulation descriptor Set2Set approach (FDS2S) where two molecules are featurized to generate molecular descriptors that are compositionally weighted, then these descriptors are aggregated using a Set2Set algorithm, and finally the aggregated descriptors are passed into a fully connected neural network to predict formulation properties. C Formulation graph approach (FG) where two molecules are represented as graphs (G) consisting of atoms as nodes (V) and bonds as edges (E). For each molecule, 75 atomic features and composition are used in the node vector. Graph convolutions and update operations are performed, followed by a readout layer and a fully connected neural network to predict formulation properties.

Figure 4B shows a similar descriptor-based approach as FDA, but instead of aggregating with statistical approaches, the compositionally weighted descriptors are passed into a Set2Set algorithm27 to create a formulation descriptor vector (FDS2S). The Set2Set operator uses a combination of long-short-term-memory networks to process sequential data and the softmax function as an attention layer to aggregate multiple arrays coming from multiple molecules into a single array27. Set2Set outputs the same array even when the order of the input array is changed, thus satisfying the requirement of permutation invariance for an ideal formulation-property model. The final array from the Set2Set layer is then passed to a fully connected layer to predict the formulation property. The usefulness of Set2Set as a way to aggregate information has been seen in several previous works, such as aggregation of reactant or product information to predict bond dissociation energies28 or hydrolysis energies29.

Figure 4C shows a graph-based representation approach (FG), where atoms are nodes and bonds are edges. Each node vector consists of 75 atomic features and the composition of the ingredient. For each node, graph convolution operators aggregate information from the neighboring nodes and output a new atomic vector based on message passing across the molecular graph. The final learned atomic features are then outputted to a readout layer, which are then input to a fully connected neural network to predict the formulation property. Previous work has shown success in using graph-based representations for predicting viscosity of binary mixtures14 and battery performance of electrolyte systems15.

For descriptor-based approaches (i.e., FDA and FDS2S), four distinct molecular featurization approaches were evaluated: (1) 200 RDKit descriptors; (2) Morgan fingerprints with a size of ~500–2060 and radius of ~2–4; (3) 167-bit MACCS keys, which are 2D structure fingerprints commonly used to measure molecular similarity or virtual screening30; and, (4) 132 matminer descriptors. Featurization for RDKit, Morgan fingerprints, and MACCS keys were implemented using the rdkit package (Version 2023.9.5)31, whereas matminer descriptors were implemented using the matminer package (Version 0.9.0)26. All features were preprocessed with the following procedure: (1) constant features with variance of zero were removed; (2) correlated features with Pearson’s r greater than or equal to 0.90 were removed; and, (3) features were standardized by subtracting the mean and dividing by the standard deviation. For the FDA approach, the use of 200 RDKit descriptors as a featurizer was omitted because of the poor generalizability to new formulations for specific data splits, which is likely because these descriptors are molecular size dependent (e.g. molecular weight). For the FG approach, 75 atomic features were used to featurize each of the heavy atoms. Atomic featurizations include one-hot encodings of atomic number, implicit valence, formal charge, atomic degree, number of radial electrons, hybridization, and aromaticity25. The composition of the molecule was added as the 76th atomic feature to all nodes. Node features were preprocessed by removing correlated features with Pearson’s r greater than or equal to 0.90 and non-binary features were standardized by subtracting the mean and dividing by the standard deviation.

For the FDA approach, four ML algorithms were tested: elastic net, support vector regression, extreme gradient boosting (XGBoost)32, and fully connected neural network. For the FDS2S approach, only a model with the Set2Set layer27 and fully connected neural network was used. For the FG approach, ten graph-based approaches were evaluated: Graph Convolution Neural Network (GCN)33, Pytorch version of GCN (TorchGraphConv)34, TopK35, GraphSAGE36, Graph Isomorphism Network (GIN)37, Self-Attention Graph Pooling (SAGPool)38, EdgePool39, GlobalAttention37, Set2Set27, and SortPool40. Different GNN models differ slightly by how they aggregate information based on successes from previous literature37,39. Elastic net and support vector regression were implemented using the scikit-learn package (Version 1.2.1)41. XGBoost was implemented with the xgboost package (Version 1.7.4)32. Fully connected neural networks and graph-based models were trained with PyTorch (Version 2.0.1)42. The details of each ML algorithm and hyperparameters are summarized in ref. 25.

Performance of formulation-property models

We next evaluate the performance of the different formulation-structure approaches (Fig. 4) on predicting the three formulation properties extracted from MD simulations (Fig. 3). The performance of each formulation-property approach is measured by using a learning curve, where a machine learning algorithm is iteratively trained with incrementally increasing training sizes to determine its prediction accuracy on a left-out test set as a function of training set size. An ideal formulation-property model should be able to accurately predict formulation properties at both small (~100 examples) and large (>1000 examples) dataset sizes, especially since many formulation datasets are often data limited. For example, a recent study had fewer than 200 electrolyte formulations that were experimentally available to evaluate machine learning approaches on predicting battery charging efficiencies15, which makes benchmarking data-driven approaches for formulations challenging. By using MD simulations to generate formulation labels, we can rigorously analyze the performance of formulation-property relationships at both small and large dataset sizes, which would be useful to identify formulation-property approaches that are accurate for a broad range of training sizes.

Figure 5A–C shows the learning curve performance of FDA, FDS2S, and FG models when predicting density, ΔHvap, and ΔHm. Each learning curve shows the test set R2 as a function of training set size. When the target property is density (Fig. 5A), all formulation-property models achieve test set R2~0.90 when >500 training examples are available, which demonstrates that the formulation-property models can accurately predict density with relatively small dataset sizes. When the training size is less than 100, FDS2S models outperform FDA and FG approaches in predicting the test set density. Of the three target properties, density is the easiest property for formulation-property models to predict, which may be due to its general monotonic behavior as a function of composition for most binary mixtures23. Figure 5B shows that formulation-property models can also accurately capture ΔHvap with a test set R2 ≥ 0.80 when >500 training examples are available. Interestingly, FG models struggle to predict ΔHvap when the training size is less than 200, whereas descriptor-based models (FDA and FDS2S) achieve test set R2 ≥ 0.60 at this limited data region. The poor prediction accuracy of FG models is likely due to poor representations generated when using graph convolution neural networks when limited data is available. Pre-defined descriptors that can better represent the material at the small data scale have been shown to outperform graph-based models, where graph models that automatically learn molecular representations through convolutional operations require sufficient amount of training data to obtain informative molecular features9,21. Similar to density, FDS2S outperforms the other models in predicting ΔHvap across all training sizes. Figure 5C shows that formulation-property models generally struggle to predict ΔHm until the training size is at least ~5000 examples, which achieve a test set R2 ≥ 0.80. FDS2S performs the best in predicting ΔHm for the majority of the training sizes. At the large training sizes, FDS2S and FG models outperform FDA models, which highlights the strength of deep neural networks and learned representations at the large data scale when predicting complex properties. ΔHm is a relative property of a mixture to pure component systems, which adds to the complexity of creating accurate formulation-property relationship as differences of the mixtures to pure component systems are not explicitly defined in formulation-property relationships. One possible way to improve the predictions to ΔHm is to encode descriptor differences between multiple species to improve the predictions of relative properties, such as taking differences between reactant and product feature space to improve the prediction of bond dissociation energies28 or hydrolysis energies29, which is a subject of future work.

Fig. 5: Performance of formulation-property relationships.
figure 5

Learning curve showing the left-out test set coefficient of determination (R2) as a function of training size when formulation-property models are trained to predict A density, B heat of vaporization (ΔHvap), and C enthalpy of mixing (ΔHm). The average test set R2 of three independent runs is shown, and the uncertainty is estimated by computing the standard deviation of the individual runs. Dashed black line is drawn at test set R2 of 1 as a visual guide. The parity plots between predicted and actual values of the test set when FDS2S models are trained with 90% of the data (27,127 examples) are shown for D density, E ΔHvap, and F ΔHm. For parity plots, the test set R2 and root-mean-squared error performance is shown in the bottom right and a dashed black diagonal line is drawn as a visual guide.

Given that FDS2S demonstrated high test set R2 for all formulation properties in both small and large training sizes, we further analyzed the performance of FDS2S on the test set. Figure 5D–F shows the parity plot between predicted and actual values for density, ΔHvap, and ΔHm of the left-out test set when FDS2S models are trained with 90% of the data (i.e. training size of 27,127). Figure 5D, E shows that formulation-property models can accurately predict density and ΔHvap for new formulation examples with test set R2 close to unity. Furthermore, Fig. 5E shows that properties like ΔHm, which are challenging to predict, can also be accurately predicted with a test set R2 of 0.96 when a large number of data points are available. The results in Fig. 5 demonstrate that the FDS2S approach achieves high accuracy in predicting all the three formulation properties, and the FDS2S approach ranks higher than the FDA and FG approach in consistently creating accurate formulation-property models for both small and large dataset sizes. From the best of our knowledge, the FDS2S approach to create accurate formulation-property models have not yet been reported in the literature, and the results from Fig. 5 suggests that FDS2S is a promising approach to leverage the strengths of traditional descriptor-based approaches (e.g., FDA) at the small data scale and strengths of graph-based approaches (e.g., FG) at the large data scale to creating accurate formulation-property models regardless of dataset size.

Feature importance using formulation-property models

Since machine learning models achieved a high test set accuracy (R2 ≥ 0.90) when trained with 90% of the data, we next sought to identify the top relevant features that were useful to predict density, ΔHvap, and ΔHm. Of the formulation-property approaches shown in Fig. 4, the FDA approach is the most straightforward to perform feature importance analysis because predefined descriptors are more easy to interpret than graph-based representations. The FDA approach performs similarly to FDS2S and FG approaches at 90% of the training data (see training size of 27,127 in Fig. 5A–C), hence we would expect the top molecular descriptors relevant to formulation properties from the FDA approach might be similar to the FDS2S and FG approaches. We selected to use the SHapley Additive exPLanations (SHAP) approach to analyze the top features for FDA models because the SHAP approach is model agnositic that enables the evaluation of feature importance across different machine learning algorithms and have been observed to capture relevant top features in previous works21,43,44,45,46 (see the “Methods” section for details on how SHAP is computed).

Figure 6 shows the top three descriptors using the SHAP approach for FDA models when trained to predict density, ΔHvap, and ΔHm; example structures of individual solvent ingredients are highlighted to the right of each descriptor. Figure 6A shows that MACCS keys features were the most relevant features to accurate predictions of density. The mean MACCS keys of 160 and 114 contribute negatively to density, where the removal of low molecular weight methyl and ethyl groups leads to an increase in density. Conversely, the mean MACCS key of 107 contributes positively to density, which means that inclusion of high atomic weight halogen elements leads to an increase in density.

Fig. 6: Feature importance from FDA models.
figure 6

Top three most important features measured as the average magnitude of SHapley Additive exPLanations (SHAP) values (i.e., Mean SHAP) are shown for FDA models trained with 90% of the 30,142 formulation examples to predict A density, B heat of vaporization (ΔHvap), and C enthalpy of mixing (ΔHm). Positive Mean SHAP indicates that the descriptor positively contributes to the formulation property, whereas negative Mean SHAP indicates the converse. The average Mean SHAP of three models of an ensemble is reported and the uncertainty is estimated by the computing standard deviation of the Mean SHAP values. For descriptors, prefixes with “mean” and “std” mean that the compositionally weighted descriptor of individual ingredients was aggregated with average and standard deviation operations, respectively. For MACCS keys descriptors, the index of the MACCS key is shown in the right-most value (e.g., mean-MACCS_107 means the 107th MACCS key). For Morgan fingerprint descriptors, the index and total length of the bit-fingerprint is shown as the two right-most values (e.g., mean-MorganFingerprint_46_952 means a Morgan fingerprint index of 46 with a fingerprint size of 952). SMARTS pattern for MACCS keys, Morgan fingerprints, and example structures with red highlighted patterns are illustrated to the right of the SHAP plots.

Figure 6B shows that Morgan fingerprints were the most useful features to accurately predict ΔHvap. The mean of the top Morgan fingerprints is all positively correlated with ΔHvap, namely the inclusion of benzene rings, hydroxyl groups, and methylene units. ΔHvap is related to the cohesion energy of a solution; hence, favorable interaction energies between molecules in a mixture would typically lead to high ΔHvap values. Therefore, the inclusion of benzene rings may lead to ππ stacking, which is well-known to be a favorable interaction in the literature47. Furthermore, the inclusion of hydroxyl groups leads to favorable hydrogen bonding, and the inclusion of long methylene chains could lead to favorable nonpolar interactions48. Interestingly, Morgan fingerprint of index 46 with fingerprint size 952 (mean-MorganFingerprint_46_952) shows a bit-collision between benzene and hydroxyl groups, where the bit-fingerprint is set to unity for multiple atomic environments. While bit-collisions lead to information loss of distinct atomic environments, the importance of hydroxyl groups is re-iterated in mean-MorganFingerprint_536_1050, which suggests that bit-collisions did not significantly impact the interpretability of top features. In sum, ΔHvap can be increased by including ingredients with benzene groups, hydroxyl groups, or methylene units.

Similar to ΔHvap, Fig. 6C shows that Morgan fingerprints were top features relevant to predicting ΔHm. Interestingly, all top features relevant to ΔHm are nitrogen-containing compounds, and they all contribute negatively to ΔHm. Previous literature has reported mixtures with nitrogen-containing compounds, such as diethylamine and ethanol, have negative ΔHm values with increasing diethylamine content23, which is consistent with the top features in Fig. 6C. Therefore, ΔHm can be potentially tuned by including or removing ingredients with nitrogen-containing groups. The results in Fig. 6 demonstrate that top features related to a property can be extracted from formulation-property models, which can be used to fine-tune the selection of ingredients that satisfy a desired property criteria.

Active learning using formulation-property models

While formulation-property models are highly accurate with a large amount of data and can be subsequently used to extract important features relevant to a property, formulation design is often performed at the small data scale (~100 examples). Hence, we next evaluated whether formulation-property models are useful for identifying top candidates at the small data scale starting from 100 examples using an active learning approach. The typical approach for active learning is by using a surrogate model (i.e. a machine learning model) to train on a small subset of data and predict on a large pool of candidates; then, based on the predictions of the model, suggest the best candidates to evaluate in the next experiment. After the best candidates are evaluated, they are added as part of the training data, then the loop is repeated a set number of iterations until the desired property criteria is reached. The selection of best candidates from the machine learning predictions is determined based on the acquisition function. We evaluate four acquisition functions: expected improvement, greedy, most uncertain, and random acquisition functions (see the “Methods” section for details).

Figure 7 shows the performance of using formulation-property models in an active learning framework to identify formulations with the highest density, ΔHvap, and ΔHm. Figure 7A−C shows the R2 performance of formulation-property models on a 10% left-out test set as a function of training size when using four distinct acquisition functions. Figure 7D–F shows the percentage of formulations within the top 5% of density, ΔHvap, or ΔHm that were selected to be in the training set during the active learning iterations. For density as a target property, Fig. 7A shows that all acquisition functions result in a test set R2 of ~ 0.90 when the formulation-property model has less than 500 examples. The greedy acquisition function has a lower test set R2 as compared to the other acquisition functions, suggesting that the greedy acquisition function results in models that are not as generalizable as compared to random selection. However, even though the greedy acquisition function results in less accurate models, Fig. 7D shows that the greedy acquisition function captures close to 90% of the top 5% density values after the training sizes reach ~1500 examples. Conversely, the expected improvement and most uncertain acquisition functions only achieve ~20% of the top density candidates at the same training size. The random selection acquisition function is expected to be the worst with less than 5% of the top density values identified. At 2000 examples, the greedy acquisition function identified formulations with the highest density values 14-folds higher than when randomly selecting formulations.

Fig. 7: Active learning using formulation-property models.
figure 7

Left-out test set coefficient of determination (R2) as a function of train size when training formulation-property models to maximize A density, B heat of vaporization (ΔHvap), and C enthalpy of mixing (ΔHm) using an active learning approach for expected improvement, greedy, most uncertain, and random acquisition functions. 10% of the 30,142 formulation examples were randomly selected as the left-out test set such that the test set contains unique formulations that are unseen in the training data pool. The percentage of formulations that are within the top 5% of the target property as a function of training size is shown for (D) density, (E) ΔHvap, and (F) ΔHm for the same acquisition functions used in (AC). The reported R2 and top 5% is an average of three iterations of active learning runs with different random seeds, and the uncertainty of R2 is the standard deviation of the different seeds.

Similar to density as a target property, Fig. 7B shows that all acquisition functions result in ΔHvap models that achieve a test set R2 of ~ 0.90 when the training set contains 500 examples, and the greedy acquisition function generally has lower test set R2 as compared to the other acquisition functions. Interestingly, Fig. 7E shows that greedy, expected improvement, and most uncertain perform similarly in identifying formulations with the top 5% ΔHvap. At the training size of 2000, ~15% of the top ΔHvap is identified for all acquisition functions other than random selection; the latter only identified ~ 5% of formulations with the top ΔHvap values. Irrespective of expected improvement, greedy, or most uncertain acquisition function choice, we observe that formulation-property models can improve the identification of formulations with high ΔHvap values 2−3 times faster than random selection.

Figure 5 demonstrated that ΔHm was the most challenging to predict out of the three properties for formulation-property models. Figure 7C shows that varying acquisition functions do not dramatically improve generalizability of formulation-property models to predict ΔHm; the most uncertain acquisition function achieved a highest test set R2 of ~0.80 when the training size is 2000 examples, slightly higher than the random acquisition function. The greedy acquisition function struggled to create a generalized ΔHm model and achieved a test set R2 of ~0.40 for all training sizes. Figure 7F shows that the most uncertain acquisition function performed the best in identifying the formulations with the highest 5% of ΔHm values, followed by expected improvement and greedy acquisition functions. Interestingly, the most uncertain acquisition functions are not geared towards finding the maximum ΔHm as compared to expected improvement and greedy acquisition functions, but the most uncertain acquisition function still outperformed the other two approaches by choosing candidates with the highest prediction uncertainty. The results in Fig. 7F show that even though the formulation-property models may not accurately predict ΔHm at the small data scale, prediction uncertainties of ΔHm could be useful to identify formulation candidates that are outside the domain of the training data and may have extrema of ΔHm values. The most uncertain acquisition function achieves 2−3 times higher likelihood of selecting formulation candidates with high ΔHm values as compared to random selection.

Figure 7 demonstrates that formulation-property models are useful in identifying the next formulation candidates as compared to random selection irrespective of the acquisition function used. The selection of acquisition functions to use for an active learning workflow is highly dependent on the target property and how it is related to the underlying formulation structure. For simpler properties to predict with high test set R2 close to 0.90, such as density or ΔHvap, the greedy or expected improvement acquisition function generally performs well in identifying formulations with high property values. Conversely, for difficult to predict properties, such as ΔHm, most uncertain and expected improvement acquisition functions that account for prediction uncertainty are better at identifying formulations that may be outside of the training domain and represent the extrema of property values. Overall, formulation-property relationships can serve as a powerful approach to rapidly screen formulations even with limited data, provide insight into important ingredient characteristics relevant to a target property through feature importance analysis, and provide suggestions of next best candidates in an active learning workflow to iteratively identify formulations satisfying a property criteria.

Experimental case studies with formulation-property models

Given that formulation-property models can accurately predict properties of miscible solvents computed from MD simulations, we next sought to evaluate whether these machine learning models can accurately predict experimentally derived formulation properties from the literature, which would demonstrate that these formulation-property models can be applied practically in a real-world setting. Table 2 lists three case study examples from recent literature that have formulation datasets consisting of chemical ingredients, composition, and a target property. For each experimental dataset, Table 2 tabulates the target property and description, additional input features for the model, and materials science application area. Additional information about the curated datasets, such as the number of formulation examples, number of unique molecules/formulations, and minimum/maximum number of ingredients are also tabulated. After training formulation-property relationships with 90% of the training data and evaluating the model performance on the remaining 10% test set across five trials with out-of-sample data splitting (see the “Methods” section for details), the average performance on the test set R2, RMSE, and mean absolute error (MAE) is reported for each dataset in Table 2. The top featurizer and model identified by Bayesian optimization after 30 model training iterations are also reported, where the percentage of occurrences that a featurizer or model has shown up as the top three models for an ensemble is reported in parenthesis. While Table 2 is not a comprehensive list of all possible formulation datasets in the literature, these case study examples serve as a starting point to rigorously benchmark machine learning models to accurately predict formulation properties from experiments across different material science applications. For each case study, Fig. 8 shows an example material system and a parity plot between predicted and actual values when using a formulation machine learning model that is trained with a random, out-of-sample 90:10 train:test split. We next describe how the formulation-property models can be applied to the three case study examples highlighted in Table 2 and Fig. 8.

Table 2 Summary literature datasets used to benchmark formulation-property relationships
Fig. 8: Performance of formulation machine learning models on experimental datasets.
figure 8

A Example of binary solvent mixtures with compositions in mole percent and temperature that is passed into a formulation machine learning model to predict the log viscosity in centipoise using a dataset extracted from ref. 14. B Example of a drug in single or binary solvent mixtures with compositions in mole percent and temperature that is passed into a formulation machine learning model to predict the drug solubility in grams drug/100 grams solution using a dataset extracted from ref. 54. The compositions are re-normalized to sum up to 100% in the machine learning models. C Example of a mixture of ternary hydrocarbon solvents with compositions in volume percent that is passed into a formulation machine learning model to predict the motor octane number using a dataset extracted from ref. 8. For all datasets, a parity plot between predicted and actual target values is shown using the best-performing formulation machine learning model for a 90:10 train:test out-of-sample split. The number of examples (N), train/test set coefficient of determination (R2), and train/test set root-mean-squared error (RMSE) is shown in the bottom right of the plots. The histograms above and to the right of the parity plots show the normalized distribution of points for both train and test sets.

In case study 1, viscosity is an important material’s property that dictates how a fluid flows when an external force is applied14,21,49, which is crucial for battery and energy storage since it dictates the electrolyte solution performance in lithium-ion batteries50,51. In previous work, graph-based machine learning models were found to accurately predict temperature-dependent viscosities of single or binary solvent mixtures that achieved a test set RMSE of 0.08 using an 80:20 out-of-sample train:test split14. For this work, the dataset of 39,077 temperature-dependent viscosities was extracted from ref. 14 and curated by: (1) removing duplicated pure component examples (4691 rows); and, (2) removing examples with extrema of log viscosities less than −1 (8 rows) and greater than 3 (4 rows). The curated dataset has a total of 34,374 examples for training machine learning models. In the original work14, special care was taken to ensure that water was part of the training set since viscosity characteristics for water exhibit unusual behavior; in this work, we do not make the same restriction for water given that the models performed well across multiple random, out-of-sample train/test splits on the curated dataset.

Figure 8A shows an example of a binary mixture and temperature that is inputted into a formulation machine learning model to predict log viscosities. The parity plot between predicted versus actual log viscosities in Fig. 8A shows that the formulation machine learning model can accurately predict log viscosities with the majority of the points on the diagonal y = x line and test set R2 and RMSE of 0.97 and 0.06, respectively. Table 2 shows that formulation-property models can achieve an average test set R2 of 0.96 and RMSE of 0.08 across five random, out-of-sample train/test splits, which demonstrates the robustness of our machine learning approach. Interestingly, Table 2 shows that graph-based approaches are the best for predicting log viscosities, which is consistent with the original work that observed graph-based models achieved a similar test set RMSE of ~0.0814. The results in Fig. 8A and Table 2 demonstrate that formulation-property models can accurately predict temperature-dependent log viscosities of single or binary mixtures, which means that this model could be useful for energy storage, chemical processes, and many other material applications.

In case study 2, drug solubility in solvent mixtures is an important parameter for pharmaceutical applications, where solubility is measured by measuring the extent of concentration of a dissolved drug in a solvent system such that adding more drug does not increase its concentration in solution52. Given an accurate solubility model, it can be used to vary possible solvent structures, compositions, and temperature to tailor drug solubility for specific needs, such as drug purification or co-dissolving other necessary materials other than the drug (i.e., excipients)53. In previous work54, a large dataset of ~27,000 drug solubilities were extracted from the literature in binary mixtures at temperatures between ~250 and 370 K, and tree-based models like XGBoost32 was found to outperform neural network, linear, and other models in predicting drug solubility using drug and solvent RDKit and MACCS Keys features as inputs for machine learning models. For this work, the dataset from ref. 54 was extracted with 27,166 examples of drug solubilities consisting of 125 drug molecules and 44 unique solvents; the dataset was curated to ensure no duplicate drug-solvent-temperature combinations.

Figure 8B shows an example of a drug, binary solvent mixture, and temperature that is inputted into a formulation machine learning model to predict drug solubility (log S). The parity plot between predicted and actual drug solubility shows that formulation machine learning models can accurately predict drug solubilities with a test set R2 of 0.96 and RMSE of 0.25 log S (g/100 g) for a random, out-of-sample 90:10 train:test split. Table 2 shows that the models have an average test set R2 of 0.88, RMSE of 0.36 log S (g/100 g), and MAE of 0.24 log S (g/100 g) across five train/test splits, which is comparable to previous literature that observed an test set RMSE of ~0.33 log S (g/100 g) of MAE of ~0.22 log S (g/100 g) using an XGBoost model54 and lower than MAE values often found in the literature for solubility between 0.4 and 0.5 log S (g/100 g)55,56. Interestingly, Table 2 shows that FDS2S and FG models perform the best in predicting log S, which is distinct from the XGBoost models identified as top performing model in previous work54 and demonstrates that deep learning models could improve prediction accuracy for this dataset. Furthermore, we do not explicitly delineate which molecule is a drug or a solvent in formulation machine learning models, but the neural networks are still able to accurately predict drug solubility from the non-obvious connections between the input molecules. These results demonstrate that formulation machine learning models could be used to accurately predict solubility in mixture systems, which could be useful for a variety of materials applications like pharmaceutical formulations.

In case study 3, engine performance is dictated by the quality of the fuel mixture, where the fuel may lead to “knocking” sounds in an engine due to auto-ignition during compression that is determined by the motor octane number (MON)8. Since fuels consist of mixtures of hydrocarbons, creating a machine learning model that can predict MON would be useful for the petroleum industry to improve engine performance by optimizing the fuel mixture. Previous work8 focused on predicting MON using a linear mixing operator that combines information outputted from a recurrent neural network (namely long-short-term-memory cell57) operating on SMILES and 2D Mordred descriptors consisting of more than 1800 2D and 3D descriptors58. Their approach achieved a fairly high test set R2 of 0.92 and low MAE of 2.8 for predicting MON using both pure component and mixture hydrocarbon blends8. We have extracted the dataset from ref. 8 containing 772 examples with MON values consisting of pure (366 examples) and blend (356 examples) hydrocarbons.

Figure 8C shows an example of a ternary system with hydrocarbons that is inputted into a formulation machine learning model to predict MON. The parity plot between predicted and actual MON shows that the majority of the fuel examples are well-captured with a test set R2 of 0.79 and RMSE of 6.58. The test set performance of formulation machine learning models is poorer when predicting MON as compared to the other case studies because of the complexity of the fuels dataset, which contains either pure component systems or blends up to 121 ingredients. Inclusion of a larger dataset or more relevant descriptors into the model may be potential avenues to improve the model accuracy; for example, the original work included the yield sooting index of pure hydrocarbons as inputs into a neural network model8; however, we did not include yield sooting index as inputs to the machine learning model because it is a measured property that is not readily available for new systems. Table 2 reveals that across five random, out-of-sample 90:10 train:test splits, formulation machine learning models achieve an average test set R2 of 0.79, RMSE of 8.17, and MAE of 5.29. The performance of these models are poorer as compared to previous work which achieved an MAE of 2.8 using a single, pre-specified train/test split8; however, the models in the previous work were customized to predict fuel properties, such as the inclusion of yield sooting index of pure hydrocarbons into the models, which are not easily transferable to other formulation datasets. The large fluctuations in the test set R2 of ±0.11 suggest that the model performance varies depending on the train/test split, which should diminish with larger dataset size as seen in the previous two case studies that have more than ~ 27 K examples. Of the formulation-property approaches, Table 2 shows that Set2Set and XGBoost models performed the best in predicting MON as compared to graph-based approaches, which is likely due to descriptor-based approaches performing better for small datasets. In sum, formulation machine learning models could be used to predict fuel performance and are scalable to a number of components, up to 121 different ingredients. These three case studies demonstrate the capabilities of the formulation-property relationships in Fig. 4 to accurately predict a wide range of experimental material properties, showcasing their ability to be used for experimental datasets for diverse materials aside from using a simulation-derived dataset consisting of miscible solvent mixtures (Figs. 3 and 5).

link

Leave a Reply

Your email address will not be published. Required fields are marked *