Leveraging high-throughput molecular simulations and machine learning for the design of chemical mixtures

Table of Contents

Formulation dataset: miscible solvents

Figure 1A shows the workflow of selecting formulation examples given the miscibility table of 81 solvents that were tabulated against 25 solvents. We first extracted miscibility tables from the CRC handbook to identify pairs of industrially relevant solvents that were miscible with one another from ref. ¹⁹. Figure 1A shows an example of binary mixtures selected by using miscibility tables of acetone, benzene, and 1,2-ethanediol. In this example, acetone and benzene are considered a formulation since they are miscible, whereas benzene and 1,2-ethanediol were not considered a formulation since they are immiscible. One limitation of using miscibility tables is that they measure miscibility with equal volumes of two liquids, which does not inform us on whether the mixture is miscible when varying compositions. We further tested whether varying compositions of binary solvent mixtures result in any immiscibilities, and we observed that the majority of the mixtures are miscible based on MD simulations (see Supplementary Figure 1 and Supplementary Figure 2). For an N-component system, we assumed that if every solvent pair were miscible with one another, then the entire N-component system is assumed to be miscible and considered as a viable formulation. Figure 1B shows the number of possible unique formulations as we increase the number of components up to six. We arbitrarily selected to study up to five components, which consists of a total of 19,238 unique formulations. By using experimentally derived miscibility tables, we designed a large formulation dataset that consists of miscible solvent mixtures, where homogeneous solutions are important in a variety of material science applications such as battery electrolytes, chemical reactivity, and consumer packaged goods.

**Fig. 1: Formulation dataset generated from experimental miscibility tables.**

Using the 19,238 unique formulations for up to 5 components, we further varied the composition for binary and ternary systems as summarized in Table 1. We varied the composition for binary mixtures such that each component is varied from 20%, 40%, 50%, 60%, 80%. For ternary mixtures, we selected 60% of one component and 20% of the other components, as well as equal ratio mixtures. Given the large possibilities of variations for quaternary and quintenary mixtures, only equal ratio systems were studied here. In sum, a total of 30,142 formulation examples were studied in this work that span from pure component systems (N = 1) to quinternary systems (N = 5).

Table 1 Summary of formulations studied in this work as a function of number of components

Generating formulation dataset with molecular simulations

We validated whether simulation-derived properties can accurately capture experimental trends for industrially relevant solvents. We extracted three MD descriptors from the last 10 ns of the production MD simulation (see the “Methods” section for simulation and descriptor details): (1) packing density, (2) heat of vaporization (ΔH_vap), and (3) enthalpy of mixing (ΔH_m). Density measures how tightly packed the molecules are for a mixture. ΔH_vap is the amount of heat needed to convert some fraction of liquid into vapor. While measuring ΔH_vap for mixtures is challenging to measure experimentally²⁰, ΔH_vap has been observed to correlate with temperature-dependent viscosities of pure liquids from MD simulations²¹ and experiments²². Therefore, ΔH_vap is an informative property that may be correlated to other material properties. ΔH_m is a fundamental thermodynamic property of liquid mixtures that measures the energy released or absorbed upon the mixing of pure components into a single phase in equilibrium. We treat these three MD descriptors as relevant formulation labels that are applicable to material science applications. For example, density is an important property for battery applications since it dictates the battery weight and charge mobility; ΔH_vap is a property that effectively measures the cohesion energy of a liquid and has been previously observed to correlate with viscosity²¹; and, ΔH_m is important for process design that dictates properties, such as solubility and phase stability.

Figure 2A shows an example of acetone and benzene that are equally weighted and simulated with MD to compute formulation properties. The simulation snapshot from Fig. 2A shows a well-mixed system of acetone and benzene, which is consistent with the experimental miscibility table in Fig. 1A. Figure 2B shows the correlation coefficient (R²) between simulation-derived and experimental properties for density, ΔH_vap, and ΔH_m. For all formulation properties, we observe good agreement between simulation-derived and experimental properties with a R² ≥ 0.84. Figure 2C–E shows the parity plot between simulation-derived and experimental properties. For density (Fig. 2C), we compared the packing density of eleven pure solvents and observe a strong agreement against experiments with a R² of 0.98 and root-mean-squared error (RMSE) of ~15.4 g/cm³. Similarly, we observe a strong correlation between MD simulations and experiments for ΔH_vap when comparing 34 pure solvents (Fig. 2D), which achieved an R² of 0.97 and RMSE of 3.4 kcal/mol. Density and heat of vaporization are expected to be well-captured from MD simulations since the OPLS4 forcefield is parameterized to accurately predict these properties⁵; hence, the results in Fig. 2C, D are consistent with the literature in that these two properties are accurately predicted with MD simulations^5,6,7. On the other hand, ΔH_m is not used to parameterize the OPLS4 forcefield, but ΔH_m has shown good agreement between experiments and MD simulations for a variety of solvents, such as nonpolar-nonpolar mixtures (e.g., benzene and cyclohexane) and nonpolar-polar mixtures (e.g., benzene and ethanol)²³. Figure 2E shows that simulation-derived ΔH_m captures experimental trends for 53 binary mixture examples using the simulation protocol in this work. Given that the simulation-derived properties correlate with experiments for density, ΔH_vap, and ΔH_m, we validated that MD simulations can accurately capture formulation properties for solvent systems studied in this work.

**Fig. 2: Generating formulation labels using classical molecular dynamics (MD) simulations and validating them against experiments.**

Since MD simulations can accurately capture experiment trends, we then used MD simulations to generate a large formulation dataset that is useful to benchmark formulation-property relationships. Using the miscibility table to identify miscible solvent systems ranging from pure component systems (N = 1) to quinternary systems (N = 5) as described in Fig. 1, we performed 30,142 MD simulations and extracted the density, ΔH_vap, and ΔH_m from the production simulations. Figure 3 shows the box and whisker plot of density, ΔH_vap, and ΔH_m computed from MD simulations as a function of number of components. Figure 3A, B shows that as the number of components increases, the distribution of density and ΔH_vap is more narrow as compared to pure component systems (N = 1). These results show that pure component systems have a large range of properties as compared to when mixing the individual components, and mixtures of solvents can be used to fine-tune properties to highly specific values that are not possible when only using pure component systems. Similar to density and ΔH_vap, Fig. 3C shows that increasing the number of components results in narrower ranges for ΔH_m. However, ΔH_m differs from the other two properties in that pure component systems will have ΔH_m = 0 because ΔH_m of a mixture is relative to its corresponding pure component systems. Hence, binary systems (N = 2) have the largest range of ΔH_m values. Since ΔH_m is a relative mixture property, it may be a challenging property to predict with formulation-property relationships as the model will need to learn differences between the mixture and its individual components. We use the 30,142 formulations with the three property labels from MD simulations to evaluate whether the formulation-property approaches can be used to create accurate models.

**Fig. 3: Distribution of the formulation labels from classical molecular dynamics simulations.**

Formulation-property relationships

All formulation-property relationships were built using the DeepAutoQSAR framework, Schrödinger’s automated molecular property prediction engine^24,25. In DeepAutoQSAR, feature and model hyperparameter selection are iteratively improved by Bayesian optimization based on the model performance on the previous training cycle. This work extends the DeepAutoQSAR workflow to be able to encode formulations as inputs, where multiple molecules with compositions are inputted rather than only single molecule property predictions. We focused on formulation-property relationships that have the following ideal characteristics: (1) composition must be accounted for in the model such that variations in compositions impact property predictions; (2) models must be permutationally invariant, such that changing the order of input molecules and compositions do not change the prediction output; and, (3) models are flexible to the number of components, such that a model trained with binary mixtures can be used to predict ternary mixtures, quarternary mixtures, and so on. These model characteristics are important for designing formulations because composition is crucial to a formulation, ingredients can be inputted in a random order, and the inclusion or removal of particular ingredients is commonly evaluated to measure the impact of individual ingredients on formulation properties.

Figure 4 summarizes three different approaches for developing formulation-property relationships that satisfy the characteristics of an ideal model. Figure 4A shows the formulation descriptor aggregation (FDA) approach where individual molecules are featurized, weighted by their corresponding compositions, then aggregated by performing a variety of statistical metrics like computing the mean, standard deviation, minimum, maximum, and median. These aggregated features are considered as formulation descriptors, which are then passed as inputs into ML models to predict formulation properties. By aggregating with statistical approaches, the formulation descriptor captures the distribution of molecular properties of individual ingredients, which would be useful for property prediction. The FDA approach is analogous to matminer featurizers that perform statistical operations, such as averaging and standard deviation, to characterize inorganic materials by aggregating features from individual atomic types²⁶.

**Fig. 4: Schematic of formulation-property relationship approaches.**

Figure 4B shows a similar descriptor-based approach as FDA, but instead of aggregating with statistical approaches, the compositionally weighted descriptors are passed into a Set2Set algorithm²⁷ to create a formulation descriptor vector (FDS2S). The Set2Set operator uses a combination of long-short-term-memory networks to process sequential data and the softmax function as an attention layer to aggregate multiple arrays coming from multiple molecules into a single array²⁷. Set2Set outputs the same array even when the order of the input array is changed, thus satisfying the requirement of permutation invariance for an ideal formulation-property model. The final array from the Set2Set layer is then passed to a fully connected layer to predict the formulation property. The usefulness of Set2Set as a way to aggregate information has been seen in several previous works, such as aggregation of reactant or product information to predict bond dissociation energies²⁸ or hydrolysis energies²⁹.

Figure 4C shows a graph-based representation approach (FG), where atoms are nodes and bonds are edges. Each node vector consists of 75 atomic features and the composition of the ingredient. For each node, graph convolution operators aggregate information from the neighboring nodes and output a new atomic vector based on message passing across the molecular graph. The final learned atomic features are then outputted to a readout layer, which are then input to a fully connected neural network to predict the formulation property. Previous work has shown success in using graph-based representations for predicting viscosity of binary mixtures¹⁴ and battery performance of electrolyte systems¹⁵.

For descriptor-based approaches (i.e., FDA and FDS2S), four distinct molecular featurization approaches were evaluated: (1) 200 RDKit descriptors; (2) Morgan fingerprints with a size of ~500–2060 and radius of ~2–4; (3) 167-bit MACCS keys, which are 2D structure fingerprints commonly used to measure molecular similarity or virtual screening³⁰; and, (4) 132 matminer descriptors. Featurization for RDKit, Morgan fingerprints, and MACCS keys were implemented using the rdkit package (Version 2023.9.5)³¹, whereas matminer descriptors were implemented using the matminer package (Version 0.9.0)²⁶. All features were preprocessed with the following procedure: (1) constant features with variance of zero were removed; (2) correlated features with Pearson’s r greater than or equal to 0.90 were removed; and, (3) features were standardized by subtracting the mean and dividing by the standard deviation. For the FDA approach, the use of 200 RDKit descriptors as a featurizer was omitted because of the poor generalizability to new formulations for specific data splits, which is likely because these descriptors are molecular size dependent (e.g. molecular weight). For the FG approach, 75 atomic features were used to featurize each of the heavy atoms. Atomic featurizations include one-hot encodings of atomic number, implicit valence, formal charge, atomic degree, number of radial electrons, hybridization, and aromaticity²⁵. The composition of the molecule was added as the 76th atomic feature to all nodes. Node features were preprocessed by removing correlated features with Pearson’s r greater than or equal to 0.90 and non-binary features were standardized by subtracting the mean and dividing by the standard deviation.

For the FDA approach, four ML algorithms were tested: elastic net, support vector regression, extreme gradient boosting (XGBoost)³², and fully connected neural network. For the FDS2S approach, only a model with the Set2Set layer²⁷ and fully connected neural network was used. For the FG approach, ten graph-based approaches were evaluated: Graph Convolution Neural Network (GCN)³³, Pytorch version of GCN (TorchGraphConv)³⁴, TopK³⁵, GraphSAGE³⁶, Graph Isomorphism Network (GIN)³⁷, Self-Attention Graph Pooling (SAGPool)³⁸, EdgePool³⁹, GlobalAttention³⁷, Set2Set²⁷, and SortPool⁴⁰. Different GNN models differ slightly by how they aggregate information based on successes from previous literature^37,39. Elastic net and support vector regression were implemented using the scikit-learn package (Version 1.2.1)⁴¹. XGBoost was implemented with the xgboost package (Version 1.7.4)³². Fully connected neural networks and graph-based models were trained with PyTorch (Version 2.0.1)⁴². The details of each ML algorithm and hyperparameters are summarized in ref. ²⁵.

Performance of formulation-property models

We next evaluate the performance of the different formulation-structure approaches (Fig. 4) on predicting the three formulation properties extracted from MD simulations (Fig. 3). The performance of each formulation-property approach is measured by using a learning curve, where a machine learning algorithm is iteratively trained with incrementally increasing training sizes to determine its prediction accuracy on a left-out test set as a function of training set size. An ideal formulation-property model should be able to accurately predict formulation properties at both small (~100 examples) and large (>1000 examples) dataset sizes, especially since many formulation datasets are often data limited. For example, a recent study had fewer than 200 electrolyte formulations that were experimentally available to evaluate machine learning approaches on predicting battery charging efficiencies¹⁵, which makes benchmarking data-driven approaches for formulations challenging. By using MD simulations to generate formulation labels, we can rigorously analyze the performance of formulation-property relationships at both small and large dataset sizes, which would be useful to identify formulation-property approaches that are accurate for a broad range of training sizes.

Figure 5A–C shows the learning curve performance of FDA, FDS2S, and FG models when predicting density, ΔH_vap, and ΔH_m. Each learning curve shows the test set R² as a function of training set size. When the target property is density (Fig. 5A), all formulation-property models achieve test set R²~0.90 when >500 training examples are available, which demonstrates that the formulation-property models can accurately predict density with relatively small dataset sizes. When the training size is less than 100, FDS2S models outperform FDA and FG approaches in predicting the test set density. Of the three target properties, density is the easiest property for formulation-property models to predict, which may be due to its general monotonic behavior as a function of composition for most binary mixtures²³. Figure 5B shows that formulation-property models can also accurately capture ΔH_vap with a test set R² ≥ 0.80 when >500 training examples are available. Interestingly, FG models struggle to predict ΔH_vap when the training size is less than 200, whereas descriptor-based models (FDA and FDS2S) achieve test set R² ≥ 0.60 at this limited data region. The poor prediction accuracy of FG models is likely due to poor representations generated when using graph convolution neural networks when limited data is available. Pre-defined descriptors that can better represent the material at the small data scale have been shown to outperform graph-based models, where graph models that automatically learn molecular representations through convolutional operations require sufficient amount of training data to obtain informative molecular features^9,21. Similar to density, FDS2S outperforms the other models in predicting ΔH_vap across all training sizes. Figure 5C shows that formulation-property models generally struggle to predict ΔH_m until the training size is at least ~5000 examples, which achieve a test set R² ≥ 0.80. FDS2S performs the best in predicting ΔH_m for the majority of the training sizes. At the large training sizes, FDS2S and FG models outperform FDA models, which highlights the strength of deep neural networks and learned representations at the large data scale when predicting complex properties. ΔH_m is a relative property of a mixture to pure component systems, which adds to the complexity of creating accurate formulation-property relationship as differences of the mixtures to pure component systems are not explicitly defined in formulation-property relationships. One possible way to improve the predictions to ΔH_m is to encode descriptor differences between multiple species to improve the predictions of relative properties, such as taking differences between reactant and product feature space to improve the prediction of bond dissociation energies²⁸ or hydrolysis energies²⁹, which is a subject of future work.

**Fig. 5: Performance of formulation-property relationships.**

Given that FDS2S demonstrated high test set R² for all formulation properties in both small and large training sizes, we further analyzed the performance of FDS2S on the test set. Figure 5D–F shows the parity plot between predicted and actual values for density, ΔH_vap, and ΔH_m of the left-out test set when FDS2S models are trained with 90% of the data (i.e. training size of 27,127). Figure 5D, E shows that formulation-property models can accurately predict density and ΔH_vap for new formulation examples with test set R² close to unity. Furthermore, Fig. 5E shows that properties like ΔH_m, which are challenging to predict, can also be accurately predicted with a test set R² of 0.96 when a large number of data points are available. The results in Fig. 5 demonstrate that the FDS2S approach achieves high accuracy in predicting all the three formulation properties, and the FDS2S approach ranks higher than the FDA and FG approach in consistently creating accurate formulation-property models for both small and large dataset sizes. From the best of our knowledge, the FDS2S approach to create accurate formulation-property models have not yet been reported in the literature, and the results from Fig. 5 suggests that FDS2S is a promising approach to leverage the strengths of traditional descriptor-based approaches (e.g., FDA) at the small data scale and strengths of graph-based approaches (e.g., FG) at the large data scale to creating accurate formulation-property models regardless of dataset size.

Feature importance using formulation-property models

Since machine learning models achieved a high test set accuracy (R² ≥ 0.90) when trained with 90% of the data, we next sought to identify the top relevant features that were useful to predict density, ΔH_vap, and ΔH_m. Of the formulation-property approaches shown in Fig. 4, the FDA approach is the most straightforward to perform feature importance analysis because predefined descriptors are more easy to interpret than graph-based representations. The FDA approach performs similarly to FDS2S and FG approaches at 90% of the training data (see training size of 27,127 in Fig. 5A–C), hence we would expect the top molecular descriptors relevant to formulation properties from the FDA approach might be similar to the FDS2S and FG approaches. We selected to use the SHapley Additive exPLanations (SHAP) approach to analyze the top features for FDA models because the SHAP approach is model agnositic that enables the evaluation of feature importance across different machine learning algorithms and have been observed to capture relevant top features in previous works^{21,43,44,45,46} (see the “Methods” section for details on how SHAP is computed).

Figure 6 shows the top three descriptors using the SHAP approach for FDA models when trained to predict density, ΔH_vap, and ΔH_m; example structures of individual solvent ingredients are highlighted to the right of each descriptor. Figure 6A shows that MACCS keys features were the most relevant features to accurate predictions of density. The mean MACCS keys of 160 and 114 contribute negatively to density, where the removal of low molecular weight methyl and ethyl groups leads to an increase in density. Conversely, the mean MACCS key of 107 contributes positively to density, which means that inclusion of high atomic weight halogen elements leads to an increase in density.

**Fig. 6: Feature importance from FDA models.**

Figure 6B shows that Morgan fingerprints were the most useful features to accurately predict ΔH_vap. The mean of the top Morgan fingerprints is all positively correlated with ΔH_vap, namely the inclusion of benzene rings, hydroxyl groups, and methylene units. ΔH_vap is related to the cohesion energy of a solution; hence, favorable interaction energies between molecules in a mixture would typically lead to high ΔH_vap values. Therefore, the inclusion of benzene rings may lead to π–π stacking, which is well-known to be a favorable interaction in the literature⁴⁷. Furthermore, the inclusion of hydroxyl groups leads to favorable hydrogen bonding, and the inclusion of long methylene chains could lead to favorable nonpolar interactions⁴⁸. Interestingly, Morgan fingerprint of index 46 with fingerprint size 952 (mean-MorganFingerprint_46_952) shows a bit-collision between benzene and hydroxyl groups, where the bit-fingerprint is set to unity for multiple atomic environments. While bit-collisions lead to information loss of distinct atomic environments, the importance of hydroxyl groups is re-iterated in mean-MorganFingerprint_536_1050, which suggests that bit-collisions did not significantly impact the interpretability of top features. In sum, ΔH_vap can be increased by including ingredients with benzene groups, hydroxyl groups, or methylene units.

Similar to ΔH_vap, Fig. 6C shows that Morgan fingerprints were top features relevant to predicting ΔH_m. Interestingly, all top features relevant to ΔH_m are nitrogen-containing compounds, and they all contribute negatively to ΔH_m. Previous literature has reported mixtures with nitrogen-containing compounds, such as diethylamine and ethanol, have negative ΔH_m values with increasing diethylamine content²³, which is consistent with the top features in Fig. 6C. Therefore, ΔH_m can be potentially tuned by including or removing ingredients with nitrogen-containing groups. The results in Fig. 6 demonstrate that top features related to a property can be extracted from formulation-property models, which can be used to fine-tune the selection of ingredients that satisfy a desired property criteria.

Active learning using formulation-property models

While formulation-property models are highly accurate with a large amount of data and can be subsequently used to extract important features relevant to a property, formulation design is often performed at the small data scale (~100 examples). Hence, we next evaluated whether formulation-property models are useful for identifying top candidates at the small data scale starting from 100 examples using an active learning approach. The typical approach for active learning is by using a surrogate model (i.e. a machine learning model) to train on a small subset of data and predict on a large pool of candidates; then, based on the predictions of the model, suggest the best candidates to evaluate in the next experiment. After the best candidates are evaluated, they are added as part of the training data, then the loop is repeated a set number of iterations until the desired property criteria is reached. The selection of best candidates from the machine learning predictions is determined based on the acquisition function. We evaluate four acquisition functions: expected improvement, greedy, most uncertain, and random acquisition functions (see the “Methods” section for details).

Figure 7 shows the performance of using formulation-property models in an active learning framework to identify formulations with the highest density, ΔH_vap, and ΔH_m. Figure 7A−C shows the R² performance of formulation-property models on a 10% left-out test set as a function of training size when using four distinct acquisition functions. Figure 7D–F shows the percentage of formulations within the top 5% of density, ΔH_vap, or ΔH_m that were selected to be in the training set during the active learning iterations. For density as a target property, Fig. 7A shows that all acquisition functions result in a test set R² of ~ 0.90 when the formulation-property model has less than 500 examples. The greedy acquisition function has a lower test set R² as compared to the other acquisition functions, suggesting that the greedy acquisition function results in models that are not as generalizable as compared to random selection. However, even though the greedy acquisition function results in less accurate models, Fig. 7D shows that the greedy acquisition function captures close to 90% of the top 5% density values after the training sizes reach ~1500 examples. Conversely, the expected improvement and most uncertain acquisition functions only achieve ~20% of the top density candidates at the same training size. The random selection acquisition function is expected to be the worst with less than 5% of the top density values identified. At 2000 examples, the greedy acquisition function identified formulations with the highest density values 14-folds higher than when randomly selecting formulations.

**Fig. 7: Active learning using formulation-property models.**

Similar to density as a target property, Fig. 7B shows that all acquisition functions result in ΔH_vap models that achieve a test set R² of ~ 0.90 when the training set contains 500 examples, and the greedy acquisition function generally has lower test set R² as compared to the other acquisition functions. Interestingly, Fig. 7E shows that greedy, expected improvement, and most uncertain perform similarly in identifying formulations with the top 5% ΔH_vap. At the training size of 2000, ~15% of the top ΔH_vap is identified for all acquisition functions other than random selection; the latter only identified ~ 5% of formulations with the top ΔH_vap values. Irrespective of expected improvement, greedy, or most uncertain acquisition function choice, we observe that formulation-property models can improve the identification of formulations with high ΔH_vap values 2−3 times faster than random selection.

Figure 5 demonstrated that ΔH_m was the most challenging to predict out of the three properties for formulation-property models. Figure 7C shows that varying acquisition functions do not dramatically improve generalizability of formulation-property models to predict ΔH_m; the most uncertain acquisition function achieved a highest test set R² of ~0.80 when the training size is 2000 examples, slightly higher than the random acquisition function. The greedy acquisition function struggled to create a generalized ΔH_m model and achieved a test set R² of ~0.40 for all training sizes. Figure 7F shows that the most uncertain acquisition function performed the best in identifying the formulations with the highest 5% of ΔH_m values, followed by expected improvement and greedy acquisition functions. Interestingly, the most uncertain acquisition functions are not geared towards finding the maximum ΔH_m as compared to expected improvement and greedy acquisition functions, but the most uncertain acquisition function still outperformed the other two approaches by choosing candidates with the highest prediction uncertainty. The results in Fig. 7F show that even though the formulation-property models may not accurately predict ΔH_m at the small data scale, prediction uncertainties of ΔH_m could be useful to identify formulation candidates that are outside the domain of the training data and may have extrema of ΔH_m values. The most uncertain acquisition function achieves 2−3 times higher likelihood of selecting formulation candidates with high ΔH_m values as compared to random selection.

Figure 7 demonstrates that formulation-property models are useful in identifying the next formulation candidates as compared to random selection irrespective of the acquisition function used. The selection of acquisition functions to use for an active learning workflow is highly dependent on the target property and how it is related to the underlying formulation structure. For simpler properties to predict with high test set R² close to 0.90, such as density or ΔH_vap, the greedy or expected improvement acquisition function generally performs well in identifying formulations with high property values. Conversely, for difficult to predict properties, such as ΔH_m, most uncertain and expected improvement acquisition functions that account for prediction uncertainty are better at identifying formulations that may be outside of the training domain and represent the extrema of property values. Overall, formulation-property relationships can serve as a powerful approach to rapidly screen formulations even with limited data, provide insight into important ingredient characteristics relevant to a target property through feature importance analysis, and provide suggestions of next best candidates in an active learning workflow to iteratively identify formulations satisfying a property criteria.

Experimental case studies with formulation-property models

Given that formulation-property models can accurately predict properties of miscible solvents computed from MD simulations, we next sought to evaluate whether these machine learning models can accurately predict experimentally derived formulation properties from the literature, which would demonstrate that these formulation-property models can be applied practically in a real-world setting. Table 2 lists three case study examples from recent literature that have formulation datasets consisting of chemical ingredients, composition, and a target property. For each experimental dataset, Table 2 tabulates the target property and description, additional input features for the model, and materials science application area. Additional information about the curated datasets, such as the number of formulation examples, number of unique molecules/formulations, and minimum/maximum number of ingredients are also tabulated. After training formulation-property relationships with 90% of the training data and evaluating the model performance on the remaining 10% test set across five trials with out-of-sample data splitting (see the “Methods” section for details), the average performance on the test set R², RMSE, and mean absolute error (MAE) is reported for each dataset in Table 2. The top featurizer and model identified by Bayesian optimization after 30 model training iterations are also reported, where the percentage of occurrences that a featurizer or model has shown up as the top three models for an ensemble is reported in parenthesis. While Table 2 is not a comprehensive list of all possible formulation datasets in the literature, these case study examples serve as a starting point to rigorously benchmark machine learning models to accurately predict formulation properties from experiments across different material science applications. For each case study, Fig. 8 shows an example material system and a parity plot between predicted and actual values when using a formulation machine learning model that is trained with a random, out-of-sample 90:10 train:test split. We next describe how the formulation-property models can be applied to the three case study examples highlighted in Table 2 and Fig. 8.

Table 2 Summary literature datasets used to benchmark formulation-property relationships

**Fig. 8: Performance of formulation machine learning models on experimental datasets.**

In case study 1, viscosity is an important material’s property that dictates how a fluid flows when an external force is applied^14,21,49, which is crucial for battery and energy storage since it dictates the electrolyte solution performance in lithium-ion batteries^50,51. In previous work, graph-based machine learning models were found to accurately predict temperature-dependent viscosities of single or binary solvent mixtures that achieved a test set RMSE of 0.08 using an 80:20 out-of-sample train:test split¹⁴. For this work, the dataset of 39,077 temperature-dependent viscosities was extracted from ref. ¹⁴ and curated by: (1) removing duplicated pure component examples (4691 rows); and, (2) removing examples with extrema of log viscosities less than −1 (8 rows) and greater than 3 (4 rows). The curated dataset has a total of 34,374 examples for training machine learning models. In the original work¹⁴, special care was taken to ensure that water was part of the training set since viscosity characteristics for water exhibit unusual behavior; in this work, we do not make the same restriction for water given that the models performed well across multiple random, out-of-sample train/test splits on the curated dataset.

Figure 8A shows an example of a binary mixture and temperature that is inputted into a formulation machine learning model to predict log viscosities. The parity plot between predicted versus actual log viscosities in Fig. 8A shows that the formulation machine learning model can accurately predict log viscosities with the majority of the points on the diagonal y = x line and test set R² and RMSE of 0.97 and 0.06, respectively. Table 2 shows that formulation-property models can achieve an average test set R² of 0.96 and RMSE of 0.08 across five random, out-of-sample train/test splits, which demonstrates the robustness of our machine learning approach. Interestingly, Table 2 shows that graph-based approaches are the best for predicting log viscosities, which is consistent with the original work that observed graph-based models achieved a similar test set RMSE of ~0.08¹⁴. The results in Fig. 8A and Table 2 demonstrate that formulation-property models can accurately predict temperature-dependent log viscosities of single or binary mixtures, which means that this model could be useful for energy storage, chemical processes, and many other material applications.

In case study 2, drug solubility in solvent mixtures is an important parameter for pharmaceutical applications, where solubility is measured by measuring the extent of concentration of a dissolved drug in a solvent system such that adding more drug does not increase its concentration in solution⁵². Given an accurate solubility model, it can be used to vary possible solvent structures, compositions, and temperature to tailor drug solubility for specific needs, such as drug purification or co-dissolving other necessary materials other than the drug (i.e., excipients)⁵³. In previous work⁵⁴, a large dataset of ~27,000 drug solubilities were extracted from the literature in binary mixtures at temperatures between ~250 and 370 K, and tree-based models like XGBoost³² was found to outperform neural network, linear, and other models in predicting drug solubility using drug and solvent RDKit and MACCS Keys features as inputs for machine learning models. For this work, the dataset from ref. ⁵⁴ was extracted with 27,166 examples of drug solubilities consisting of 125 drug molecules and 44 unique solvents; the dataset was curated to ensure no duplicate drug-solvent-temperature combinations.

Figure 8B shows an example of a drug, binary solvent mixture, and temperature that is inputted into a formulation machine learning model to predict drug solubility (log S). The parity plot between predicted and actual drug solubility shows that formulation machine learning models can accurately predict drug solubilities with a test set R² of 0.96 and RMSE of 0.25 log S (g/100 g) for a random, out-of-sample 90:10 train:test split. Table 2 shows that the models have an average test set R² of 0.88, RMSE of 0.36 log S (g/100 g), and MAE of 0.24 log S (g/100 g) across five train/test splits, which is comparable to previous literature that observed an test set RMSE of ~0.33 log S (g/100 g) of MAE of ~0.22 log S (g/100 g) using an XGBoost model⁵⁴ and lower than MAE values often found in the literature for solubility between 0.4 and 0.5 log S (g/100 g)^55,56. Interestingly, Table 2 shows that FDS2S and FG models perform the best in predicting log S, which is distinct from the XGBoost models identified as top performing model in previous work⁵⁴ and demonstrates that deep learning models could improve prediction accuracy for this dataset. Furthermore, we do not explicitly delineate which molecule is a drug or a solvent in formulation machine learning models, but the neural networks are still able to accurately predict drug solubility from the non-obvious connections between the input molecules. These results demonstrate that formulation machine learning models could be used to accurately predict solubility in mixture systems, which could be useful for a variety of materials applications like pharmaceutical formulations.

In case study 3, engine performance is dictated by the quality of the fuel mixture, where the fuel may lead to “knocking” sounds in an engine due to auto-ignition during compression that is determined by the motor octane number (MON)⁸. Since fuels consist of mixtures of hydrocarbons, creating a machine learning model that can predict MON would be useful for the petroleum industry to improve engine performance by optimizing the fuel mixture. Previous work⁸ focused on predicting MON using a linear mixing operator that combines information outputted from a recurrent neural network (namely long-short-term-memory cell⁵⁷) operating on SMILES and 2D Mordred descriptors consisting of more than 1800 2D and 3D descriptors⁵⁸. Their approach achieved a fairly high test set R² of 0.92 and low MAE of 2.8 for predicting MON using both pure component and mixture hydrocarbon blends⁸. We have extracted the dataset from ref. ⁸ containing 772 examples with MON values consisting of pure (366 examples) and blend (356 examples) hydrocarbons.

Figure 8C shows an example of a ternary system with hydrocarbons that is inputted into a formulation machine learning model to predict MON. The parity plot between predicted and actual MON shows that the majority of the fuel examples are well-captured with a test set R² of 0.79 and RMSE of 6.58. The test set performance of formulation machine learning models is poorer when predicting MON as compared to the other case studies because of the complexity of the fuels dataset, which contains either pure component systems or blends up to 121 ingredients. Inclusion of a larger dataset or more relevant descriptors into the model may be potential avenues to improve the model accuracy; for example, the original work included the yield sooting index of pure hydrocarbons as inputs into a neural network model⁸; however, we did not include yield sooting index as inputs to the machine learning model because it is a measured property that is not readily available for new systems. Table 2 reveals that across five random, out-of-sample 90:10 train:test splits, formulation machine learning models achieve an average test set R² of 0.79, RMSE of 8.17, and MAE of 5.29. The performance of these models are poorer as compared to previous work which achieved an MAE of 2.8 using a single, pre-specified train/test split⁸; however, the models in the previous work were customized to predict fuel properties, such as the inclusion of yield sooting index of pure hydrocarbons into the models, which are not easily transferable to other formulation datasets. The large fluctuations in the test set R² of ±0.11 suggest that the model performance varies depending on the train/test split, which should diminish with larger dataset size as seen in the previous two case studies that have more than ~ 27 K examples. Of the formulation-property approaches, Table 2 shows that Set2Set and XGBoost models performed the best in predicting MON as compared to graph-based approaches, which is likely due to descriptor-based approaches performing better for small datasets. In sum, formulation machine learning models could be used to predict fuel performance and are scalable to a number of components, up to 121 different ingredients. These three case studies demonstrate the capabilities of the formulation-property relationships in Fig. 4 to accurately predict a wide range of experimental material properties, showcasing their ability to be used for experimental datasets for diverse materials aside from using a simulation-derived dataset consisting of miscible solvent mixtures (Figs. 3 and 5).

link

Leveraging high-throughput molecular simulations and machine learning for the design of chemical mixtures

Formulation dataset: miscible solvents

Generating formulation dataset with molecular simulations

Formulation-property relationships

Performance of formulation-property models

Feature importance using formulation-property models

Active learning using formulation-property models

Experimental case studies with formulation-property models

Immersive Learning Design: Creating Training Experiences that Transfer to the Job

A university educators’ guide to universal design for learning

RFP Instructional Design Services for Anishinaabemowin Early Learning Micro-Credential

Leave a Reply Cancel reply

What if Colleges Experimented at the Edges?

National Ad Campaign Aims to Shift Narrative on Higher Ed

Immersive Learning Design: Creating Training Experiences that Transfer to the Job

University of Texas System considers restrictions on ‘controversial’ teaching

A university educators’ guide to universal design for learning

Formulation dataset: miscible solvents

Generating formulation dataset with molecular simulations

Formulation-property relationships

Performance of formulation-property models

Feature importance using formulation-property models

Active learning using formulation-property models

Experimental case studies with formulation-property models

More Stories

Leave a Reply Cancel reply

You may have missed