Machine learning designs new GCGR/GLP-1R dual agonists with enhanced biological potency
Potency assays
Peptide potencies for cAMP accumulation were experimentally determined for the activation of both hGCGR and hGLP-1R expressed in Chinese hamster ovary (CHO) cells for a set of 125 unique peptide sequence variants, following methods described previously16,35. In brief stable CHO cell lines expressing human and mouse GLP-1R were generated in-house using standard methods, as previously described16. CHO cells expressing either human GLP-1R or human GCGR were dispensed in assay buffer (Hanks balanced salt solution containing 0.1% BSA (Sigma-Aldrich) and 0.5 mM IBMX (Sigma-Aldrich)) in 384-well assay plates containing dilutions of test peptides. After 30 min of incubation, cAMP levels were measured using the cAMP dynamic 2 HTRF kit (Cisbo) following the manufacturer’s recommendations. Fluorescence emissions at 665 nm and 620 nm following excitation at 320 nm were detected using an Envision reader (Perkin Elmer), and the data were transformed to % Delta F, as described in the manufacturer’s guidelines, before EC50 determination. All in vitro cell-based assay data are presented as the mean of n ≥ 3 independent experiments, and all individual EC50 measurements were within threefold of the geometric mean. The native peptide reference standard potency was within threefold of the historical geometric mean for all assays.
Datasets
The GPCR-binding peptides considered in this work exclusively comprise naturally occurring amino acids, so the models are not able to capture the effect of any chemical modifications of residues. The initial set of sequences was aligned using MAFFT version 736 to reveal regularities in amino-acid occurrences across positions. We reasoned that sequence alignment might help structure the data, thereby increasing the predictive power of the neural-network models. The aligned sequences were truncated to L = 30 amino acids, and redundant sequences were removed. The final set of sequences used in this study comprised N = 125 unique peptide sequences tested against human GPCR and GLP-1R receptors. Within this dataset, 122 records were C-terminally amidated. The sequences were subsequently encoded using a one-hot representation and used to train various regression models.
Data encoding
To encode the amino acid at each sequence position we used a one-hot (binary) representation. Here, we considered 21 categories: 20 amino acids and the gap symbol ‘-’, introduced by alignment. Because nearly all peptides used in these studies (122/125) were C-terminally amidated, we did not introduce an additional parameter to encode this feature. In this approach, each peptide sequence of length L is converted to a binary matrix S of size 21 × L, the entries of which indicate the presence of an amino acid Ai at the given sequence site, such that Sab = 1 if a = i and 0 elsewhere, ∀b ∈ 1, …, L. The binary matrix is then re-shaped into a vector: S21 × L → v1 × 21L. The alignment process ensures that L = 30 for all peptides, such that each sequence is represented by the binary vector \(v\in \mathbbR^1\times 630\).
Evaluation metrics
We employed the following commonly used regression metrics to evaluate the prediction accuracy of the models developed in this work.
-
1.
Root-mean-square error (r.m.s.e.): \(\rmr.m.s.e.=\sqrt\frac1N\mathop\sum \nolimits_i = 1^N(\;y_i-\haty)^2\)
-
2.
Mean absolute error (m.a.e.): \(\rmm.a.e.=\)
-
3.
Coefficient of determination (R2): \(R^2=1-\frac\mathop\sum \nolimits_i = 1^N(\;y_i-\haty)^2\mathop\sum \nolimits_i = 1^N(\;y_i-\bary)^2\)
Notation: yi, true value of the target for the ith sample; \(\haty\), predicted value of the target for the ith sample; \(\bary\), average value of the target; N, number of examples in the batch.
Neural-network model
We used the Keras/Tensorflow functional API to build the deep network model37. The first Conv1D layer in our model has 256 filters, a kernel window of three amino acids, without padding, and an L2 regularization penalty on the kernel, with weight = 0.01. The layer uses ReLU activation. We next added batch normalization and a MaxPool1D operation with stride 2 and used Dropout = 0.5. The second Conv1D layer contains 512 filters and the same configuration of parameters as the first layer, with an additional L2 regularization penalty on the bias term, with weight = 0.01. The layer is activated with ReLU, followed by batch normalization, MaxPool1D operation with stride 2, and Dropout = 0.5. The third convolutional layer has 128 filters; here the padding preserves the shape of the input, and the kernel as well as bias are regularized with L2. This layer is followed by MaxPool1D operation with stride 2. Next, the output from convolutional layers is flattened and two dense layers terminate the network. The first dense layer comprises 256 units, and the second layer has 64 units. Both layers are ReLU-activated. The final two dense layers with a single unit convert the model output to the prediction. These layers are not activated.
Network ensemble
We constructed a neural-network ensemble model where the final prediction is given by the average of the individual predictions made by M = 12 separate copies of the model. Different copies of the model, trained on the same data, differ in their predictions due to factors such as the random initialization of the network parameters. Ensembling predictions over several copies of the model has the effect of mitigating some of this randomness and reducing model variance. The resulting ensemble prediction is given by the average of the ensemble element predictions.
Model training and hyperparameter tuning
To adjust capacity and select non-trainable model parameters, the available data were used for performance validation. Initially, the dataset of 125 examples was divided into three subsets: 105 training sequences, ten sequences for validation and ten held-out sequences for final model performance evaluation (unseen during training). We performed ten sixfold cross-validations splits with different seeds to split the data, obtaining 60 (test set size) × 10 = 600 data points (errors) in total for each model. Retraining on different data splits allowed us to take into account the variance resulting from training on different data, in addition to the variance that arises due to the random model initialization.
For each baseline model, we used the sklearn grid search (GridSearchCV) to find the set of hyperparameters that provide the best cross-validation performance (listed in Supplementary Table 2). Parameters for which the optimal value differs between tasks are marked with a double value v1/v2 in the respective column of Supplementary Table 2, where v1 is the optimal parameter value for the GCGR task, and v2 is the optimal parameter value for the GLP-1R task. For the neural networks, various configurations of layers, unit numbers and regularization were tried, and we selected the model that gave the best performance on the validation set.
In addition, to prevent overfitting of the neural networks, we monitored performance using early stopping. Training was terminated when the optimization loss reported on the validation set goes up after a selected number of parameter updates. Here, we use the Early Stopping monitor implemented in the Keras call-back module38. Deep models with 120 training examples (final models) were trained for up to 1,500 epochs, monitoring the validation loss, with the patience of 100 epochs. Each batch for the gradient step contained 25 samples. The deep models with 105 training examples used for validation were trained for up to 1,500 epochs, monitoring the validation loss with the patience of 75 epochs, and 20 examples per batch (each epoch had five parameter updates). Model training is illustrated in Supplementary Fig. 2.
Baseline models
All baseline regressors in Table 1 were implemented using the sklearn Python module39. To confirm that the ML models do not simply learn the underlying potency distributions or amino-acid sequence compositions, we trained control ensembles of multi-task neural networks using the process described above, where we (1) shuffled each peptide sequence used to train the models and (2) shuffled the measured potencies between training examples. The resulting control models make much larger prediction errors; the results are shown in Supplementary Fig. 4 and summarized in Supplementary Table 4. Finally, we implemented a simple nearest-neighbours approach in which the predicted potency for a held-out test sequence is predicted by the measured potency of the nearest neighbour in the training data. For each test sequence we used the pairwise2 BioPython module with the BLOSUM62 matrix to score alignments with every training sequence; in the case of multiple equidistant training sequences, the average potency was reported. Results across sixfold cross-validation are summarized in Supplementary Table 4 and show that this approach is outperformed by the ML models described above.
We used a t-test (two-sided) to test whether the differences in model performance were significant between the ensemble of multi-task neural networks and the other models. Distributions of 600 prediction errors (squared difference between the true and predicted potency for each test sequence) obtained for each model for the GCGR and GLP-1R tasks are shown in Supplementary Table 3. For each pair of models we test the null hypothesis that the two independent populations of error samples have the same average values (we do not assume equal variances). Supplementary Table 3 shows that at a confidence level of 0.05, the multi-task neural-network ensemble performs significantly better in all cases for the GLP-1R task, whereas the performance differences are insignificant in all except one case for the GCGR task.
Multi-task training
Multi-task learning aims to improve generalization and increase prediction accuracy by learning objectives for several target variables from shared representations28. The basic idea is that by training all tasks using shared hidden layers, each task benefits from the presence of the others, which act as regularizers, making the model less sensitive to the specificity of a single target28,40. This is because the shared layers are shared representations—the model uses the same weights on each task. The effective number of training examples is therefore increased, and overfitting on each task separately is reduced by optimizing the model with respect to the average data noise40.
We used the Kereas deep-learning framework to build our model ( using the TensorFlow back-end37. The model consists of eight fully connected layers, comprising the input layer, followed by three 1D convolutional layers, three pooling layers and two dense layers at the bottom of the model, connected to two final units that convert the output to real-valued predictions. The overall objective is the weighted average of the loss for each of the two individual tasks:
$$L_\rmtotal=\mathop\sum \limits_i=1^k=2\alpha _iL_i$$
(1)
where Li is the loss function of the ith task, αi is the corresponding weight and k denotes the number of tasks. We set α1 = α2 = 0.5 so that each loss contributes with equal weight to the overall loss. We use the mean-squared-error (m.s.e.) as the loss for each task, \(\rmm.s.e.=\frac1n\mathop\sum \nolimits_j = 1^n(\;y_j-\haty)^2\), where n is the number of training examples per batch.
Our multi-task neural-network model shares all internal hidden layers between the tasks. Two output units return the predicted potencies, \(\haty_1\) and \(\haty_2\). The convolutional layers at the top of the model are designed to encode the peptide representations. We use a kernel with a window size of three amino acids and stride equal to 1. Each convolutional layer is followed by a max pooling layer, with stride equal to 2. We use batch normalization41 and Dropout42 for regularization. Each convolutional and dense layer is activated with ReLU43 activation. We trained the model with an optimization objective as given in equation (1) using the Adam optimizer44. The final network was trained on an equal number of training examples for both tasks, N = 120.
Model-guided ligand design
Our goal was to design peptide sequences with the following properties:
-
Highly active against both receptors:
$$\beginarrayr\rmActivity=\left\{\beginarrayl\log _10\rmEC_50^\rmGCGR[M\,] < -11.5\quad \\ \log _10\rmEC_50^\rmGLP-1R[M\,] < -11.5\quad \\ \rmEC_50^\rmGCGR/\rmEC_50^\rmGLP-1R\approx 1\quad \endarray\right.\endarray$$
(2)
-
Selectively active towards GCGR:
$$\beginarrayr\rmActivity=\left\{\beginarrayl\log _10\rmEC_50^\rmGCGR[M\,] < -11\quad \\ \log _10\rmEC_50^\rmGLP-1R[M\,] > -9\quad \\ \rmEC_50^\rmGCGR/\rmEC_50^\rmGLP-1R\approx 100\quad \endarray\right.\endarray$$
(3)
-
Selectively active towards GLP-1R:
$$\beginarrayr\rmActivity=\left\{\beginarrayl\log _10{\rmEC}_50^\rmGCGR[M\,] > -9\quad \\ \log _10{{{\rmEC}}}_50^\rmGLP-1R[M\,] < -11.5\quad \\ {{{\rmEC}}}_50^\rmGLP-1R/{{{\rmEC}}}_50^\rmGCGR\approx 100\quad \endarray\right.\endarray$$
(4)
We use model-guided directed evolution, an optimization strategy that attempts to solve the optimization problem by imitating the natural evolutionary process. In each successive generation (iteration), a change in the sequence is proposed, followed by the evaluation of a fitness function (here, potency predicted by the ensemble of multi-task neural networks) and the best solutions are progressed to the next generation. This process repeats until a satisfactory solution is reached. In this work we assume that the ensemble of multi-task convolutional neural networks makes reliable predictions up to three mutation steps from the closest training-set analogue sequence.
We first generated all single-step mutations from each training-set sequence in the three groups of interest, removing any duplicates within the generated set, and any overlaps with the training set. Because each sequence in the initial alignment has a length of 30 amino acids and each position can be mutated to one of 19 amino acids (20 if the position is gapped), this gives 570 single-step mutants in the first generation for each sequence in the training set, that is, 71,304 sequences, reducing to 69,639 sequences after removing duplicates. We then used each model to select the 50 best sequences for each of the three target designs defined above, and selected the ten most diverse sequences as starting points for a second round of optimization. Note that in the first generation for the multi-task CNN only five candidate dual agonists were found and used as parents for the second generation. For the second generation we repeated the process described for the first generation, and from the 50 best sequences for each group, we selected five diverse sequences as parents for the third generation. The entire process was then repeated for a final generation, taking the 50 sequences with the best predicted potencies within each of the three groups, considering GCGR for group one.
We identified six biophysical properties that can be predicted from a sequence using the ProtParam module ( from the biopython Python package45: (1) the isoelectric point in neutral pH, (2) GRAVY (grand average of hydropathy)46, (3) the instability index47, (4) aromaticity, (5) the molar extinction coefficient and (6) molecular weight. We compared the predicted value for each designed peptide with the predicted properties of peptides in the training set within the same potency group. We ranked the 50 best sequences in each group by computing the number of features whose values are within one standard deviation of the mean calculated for the corresponding group of training-set sequences. As the last step of filtering, we predicted the secondary structure for each final candidate using PSIPRED48 ( to confirm that the selected sequences are helical peptides. Using this ranking, we selected five final samples in each potency category—four from the third generation, and one from the first generation of mutants. We prioritized designed sequences with the smallest (first generation) and largest (third generation) distance from the training set. Sequences selected with the ensemble of multi-task neural-network models experimentally tested in this study and discussed in the main text are listed in Table 2 (Supplementary Tables 7–11 provide additional details).
To examine the similarity of peptides predicted by different models within each potency profile, we used PCA, considering the 500 one-hot-encoded sequences generated across all five compared models, the wild-type peptides—hGCG and hGLP-1—and their single-step mutants (551 hGCG and 570 hGLP-1), such that the projection was computed for an array [1,621 × 21L]. The projected data are shown in Supplementary Fig. 7. The selected final sequences listed in Supplementary Table 6 were analysed in terms of the total number of mutations from wild-type and predicted potencies. The predictions made with different models are consistent, as evidenced by the low values of standard deviation (<0.5) from the average prediction computed across the models.
To evaluate the information content generated by the sequence design process, we calculated the entropy across each set of designed sequences, and the relative entropy (Kullback–Leibler divergence, KL) between the distribution of amino acids at each sequence position estimated for the model-designed samples, and the training data. KL divergence is equal to zero if and only if the distribution of amino acids across the designed samples matches exactly the respective distribution of amino acids estimated from the training-set sequences. The relative entropy between two discrete distributions s(x) and t(x) is given by
$$ =-\mathop\sum \limits_i=1^21s(x_i)\log _21\fract(x_i)s(x_i)$$
(5)
where xi is one of the 21 symbols at a selected position j. We also measured the dependence between model-generated samples and the training data using mutual information (MI). Given two alignment columns A and B, each with discrete distributions of amino acids, their MI can be calculated as
$$I(A;\,B)=\mathop\sum \limits_i^K=21\mathop\sum \limits_j^L=21p(x_i,\,y_j)\log _21\fracp(x_i,\,y_j)p(x_i)p(\;y_j)=KL(\,p(a,\,b)$$
(6)
where xi is one of the 21 symbols at position A, and yj is one of the 21 symbols at position B. The MI describes the reduction of uncertainty about the amino acid at position i in our generated samples when we are told what the amino acid at position i in the training data is. The higher the value, the more dependent the variables.
Predicted properties of natural homologues
As described in the main text, we used our multi-task neural-network ensemble model to make predictions for natural GCG and GLP-1 peptide orthologues that are found in various organisms, identified using BLASTp to search the NCBI RefSeq49 database to identify non-redundant proglucagon sequences from various organisms across diverse phylogenetic groups. In vertebrates, the pre-proglucagon polypeptide is a product of the GCG gene, which encodes four peptide hormones: glucagon, glucagon-like peptide-1, glucagon-like peptide-2 and oxyntomodulin50. In humans, pre-proglucagon has 180 amino acids and is cleaved to yield proglucagon (158 amino acids), which lacks the N-terminal signalling sequence. Proglucagon is subsequently processed by the prohormone convertases 1, 2 and 350 to produce, among other products, the 29-amino-acid-long GCG (in human, positions 53–81, PSCK2) and the 30-amino-acid-long GLP-17−36 (positions 98–127, PSCK1), which are the focus of this work.
We identified 450 initial records, which we aligned using MAFFT version 736 with default parameters to construct a multiple sequence alignment (MSA). We also removed duplicated sequence isoforms, leaving a single representative for each species. Columns with low occupancy (f < 30% amino acids) were also removed, leaving 294 unique samples, such that the final MSA contained 294 rows (species) and 179 columns (positions). MSA regions corresponding to the human GCG sequence (positions 53–81) and the GLP-1 human sequence (positions 98–127) were extracted, yielding two sets of corresponding homologues. Species that lacked either a GCG or GLP-1 sequence in the alignment were further removed to yield two final peptide sequence sets, each comprising 288 orthologous sequences. The list of species and NCBI accession numbers, as well as the corresponding peptide sequences, are provided in Supplementary Tables 15 and 16.
Reporting summary
Further information on research design is available in the Nature Portfolio Reporting Summary linked to this Article.
link