Machine learning designs new GCGR/GLP1R dual agonists with enhanced biological potency
Potency assays
Peptide potencies for cAMP accumulation were experimentally determined for the activation of both hGCGR and hGLP1R expressed in Chinese hamster ovary (CHO) cells for a set of 125 unique peptide sequence variants, following methods described previously^{16,35}. In brief stable CHO cell lines expressing human and mouse GLP1R were generated inhouse using standard methods, as previously described^{16}. CHO cells expressing either human GLP1R or human GCGR were dispensed in assay buffer (Hanks balanced salt solution containing 0.1% BSA (SigmaAldrich) and 0.5 mM IBMX (SigmaAldrich)) in 384well assay plates containing dilutions of test peptides. After 30 min of incubation, cAMP levels were measured using the cAMP dynamic 2 HTRF kit (Cisbo) following the manufacturer’s recommendations. Fluorescence emissions at 665 nm and 620 nm following excitation at 320 nm were detected using an Envision reader (Perkin Elmer), and the data were transformed to % Delta F, as described in the manufacturer’s guidelines, before EC_{50} determination. All in vitro cellbased assay data are presented as the mean of n ≥ 3 independent experiments, and all individual EC_{50} measurements were within threefold of the geometric mean. The native peptide reference standard potency was within threefold of the historical geometric mean for all assays.
Datasets
The GPCRbinding peptides considered in this work exclusively comprise naturally occurring amino acids, so the models are not able to capture the effect of any chemical modifications of residues. The initial set of sequences was aligned using MAFFT version 7^{36} to reveal regularities in aminoacid occurrences across positions. We reasoned that sequence alignment might help structure the data, thereby increasing the predictive power of the neuralnetwork models. The aligned sequences were truncated to L = 30 amino acids, and redundant sequences were removed. The final set of sequences used in this study comprised N = 125 unique peptide sequences tested against human GPCR and GLP1R receptors. Within this dataset, 122 records were Cterminally amidated. The sequences were subsequently encoded using a onehot representation and used to train various regression models.
Data encoding
To encode the amino acid at each sequence position we used a onehot (binary) representation. Here, we considered 21 categories: 20 amino acids and the gap symbol ‘’, introduced by alignment. Because nearly all peptides used in these studies (122/125) were Cterminally amidated, we did not introduce an additional parameter to encode this feature. In this approach, each peptide sequence of length L is converted to a binary matrix S of size 21 × L, the entries of which indicate the presence of an amino acid A_{i} at the given sequence site, such that S_{ab} = 1 if a = i and 0 elsewhere, ∀_{b ∈ 1, …, L}. The binary matrix is then reshaped into a vector: S^{21 × L} → v^{1 × 21L}. The alignment process ensures that L = 30 for all peptides, such that each sequence is represented by the binary vector \(v\in \mathbbR^1\times 630\).
Evaluation metrics
We employed the following commonly used regression metrics to evaluate the prediction accuracy of the models developed in this work.

1.
Rootmeansquare error (r.m.s.e.): \(\rmr.m.s.e.=\sqrt\frac1N\mathop\sum \nolimits_i = 1^N(\;y_i\haty)^2\)

2.
Mean absolute error (m.a.e.): \(\rmm.a.e.=\)

3.
Coefficient of determination (R^{2}): \(R^2=1\frac\mathop\sum \nolimits_i = 1^N(\;y_i\haty)^2\mathop\sum \nolimits_i = 1^N(\;y_i\bary)^2\)
Notation: y_{i}, true value of the target for the ith sample; \(\haty\), predicted value of the target for the ith sample; \(\bary\), average value of the target; N, number of examples in the batch.
Neuralnetwork model
We used the Keras/Tensorflow functional API to build the deep network model^{37}. The first Conv1D layer in our model has 256 filters, a kernel window of three amino acids, without padding, and an L2 regularization penalty on the kernel, with weight = 0.01. The layer uses ReLU activation. We next added batch normalization and a MaxPool1D operation with stride 2 and used Dropout = 0.5. The second Conv1D layer contains 512 filters and the same configuration of parameters as the first layer, with an additional L2 regularization penalty on the bias term, with weight = 0.01. The layer is activated with ReLU, followed by batch normalization, MaxPool1D operation with stride 2, and Dropout = 0.5. The third convolutional layer has 128 filters; here the padding preserves the shape of the input, and the kernel as well as bias are regularized with L2. This layer is followed by MaxPool1D operation with stride 2. Next, the output from convolutional layers is flattened and two dense layers terminate the network. The first dense layer comprises 256 units, and the second layer has 64 units. Both layers are ReLUactivated. The final two dense layers with a single unit convert the model output to the prediction. These layers are not activated.
Network ensemble
We constructed a neuralnetwork ensemble model where the final prediction is given by the average of the individual predictions made by M = 12 separate copies of the model. Different copies of the model, trained on the same data, differ in their predictions due to factors such as the random initialization of the network parameters. Ensembling predictions over several copies of the model has the effect of mitigating some of this randomness and reducing model variance. The resulting ensemble prediction is given by the average of the ensemble element predictions.
Model training and hyperparameter tuning
To adjust capacity and select nontrainable model parameters, the available data were used for performance validation. Initially, the dataset of 125 examples was divided into three subsets: 105 training sequences, ten sequences for validation and ten heldout sequences for final model performance evaluation (unseen during training). We performed ten sixfold crossvalidations splits with different seeds to split the data, obtaining 60 (test set size) × 10 = 600 data points (errors) in total for each model. Retraining on different data splits allowed us to take into account the variance resulting from training on different data, in addition to the variance that arises due to the random model initialization.
For each baseline model, we used the sklearn grid search (GridSearchCV) to find the set of hyperparameters that provide the best crossvalidation performance (listed in Supplementary Table 2). Parameters for which the optimal value differs between tasks are marked with a double value v1/v2 in the respective column of Supplementary Table 2, where v1 is the optimal parameter value for the GCGR task, and v2 is the optimal parameter value for the GLP1R task. For the neural networks, various configurations of layers, unit numbers and regularization were tried, and we selected the model that gave the best performance on the validation set.
In addition, to prevent overfitting of the neural networks, we monitored performance using early stopping. Training was terminated when the optimization loss reported on the validation set goes up after a selected number of parameter updates. Here, we use the Early Stopping monitor implemented in the Keras callback module^{38}. Deep models with 120 training examples (final models) were trained for up to 1,500 epochs, monitoring the validation loss, with the patience of 100 epochs. Each batch for the gradient step contained 25 samples. The deep models with 105 training examples used for validation were trained for up to 1,500 epochs, monitoring the validation loss with the patience of 75 epochs, and 20 examples per batch (each epoch had five parameter updates). Model training is illustrated in Supplementary Fig. 2.
Baseline models
All baseline regressors in Table 1 were implemented using the sklearn Python module^{39}. To confirm that the ML models do not simply learn the underlying potency distributions or aminoacid sequence compositions, we trained control ensembles of multitask neural networks using the process described above, where we (1) shuffled each peptide sequence used to train the models and (2) shuffled the measured potencies between training examples. The resulting control models make much larger prediction errors; the results are shown in Supplementary Fig. 4 and summarized in Supplementary Table 4. Finally, we implemented a simple nearestneighbours approach in which the predicted potency for a heldout test sequence is predicted by the measured potency of the nearest neighbour in the training data. For each test sequence we used the pairwise2 BioPython module with the BLOSUM62 matrix to score alignments with every training sequence; in the case of multiple equidistant training sequences, the average potency was reported. Results across sixfold crossvalidation are summarized in Supplementary Table 4 and show that this approach is outperformed by the ML models described above.
We used a ttest (twosided) to test whether the differences in model performance were significant between the ensemble of multitask neural networks and the other models. Distributions of 600 prediction errors (squared difference between the true and predicted potency for each test sequence) obtained for each model for the GCGR and GLP1R tasks are shown in Supplementary Table 3. For each pair of models we test the null hypothesis that the two independent populations of error samples have the same average values (we do not assume equal variances). Supplementary Table 3 shows that at a confidence level of 0.05, the multitask neuralnetwork ensemble performs significantly better in all cases for the GLP1R task, whereas the performance differences are insignificant in all except one case for the GCGR task.
Multitask training
Multitask learning aims to improve generalization and increase prediction accuracy by learning objectives for several target variables from shared representations^{28}. The basic idea is that by training all tasks using shared hidden layers, each task benefits from the presence of the others, which act as regularizers, making the model less sensitive to the specificity of a single target^{28,40}. This is because the shared layers are shared representations—the model uses the same weights on each task. The effective number of training examples is therefore increased, and overfitting on each task separately is reduced by optimizing the model with respect to the average data noise^{40}.
We used the Kereas deeplearning framework to build our model ( using the TensorFlow backend^{37}. The model consists of eight fully connected layers, comprising the input layer, followed by three 1D convolutional layers, three pooling layers and two dense layers at the bottom of the model, connected to two final units that convert the output to realvalued predictions. The overall objective is the weighted average of the loss for each of the two individual tasks:
$$L_\rmtotal=\mathop\sum \limits_i=1^k=2\alpha _iL_i$$
(1)
where L_{i} is the loss function of the ith task, α_{i} is the corresponding weight and k denotes the number of tasks. We set α_{1} = α_{2} = 0.5 so that each loss contributes with equal weight to the overall loss. We use the meansquarederror (m.s.e.) as the loss for each task, \(\rmm.s.e.=\frac1n\mathop\sum \nolimits_j = 1^n(\;y_j\haty)^2\), where n is the number of training examples per batch.
Our multitask neuralnetwork model shares all internal hidden layers between the tasks. Two output units return the predicted potencies, \(\haty_1\) and \(\haty_2\). The convolutional layers at the top of the model are designed to encode the peptide representations. We use a kernel with a window size of three amino acids and stride equal to 1. Each convolutional layer is followed by a max pooling layer, with stride equal to 2. We use batch normalization^{41} and Dropout^{42} for regularization. Each convolutional and dense layer is activated with ReLU^{43} activation. We trained the model with an optimization objective as given in equation (1) using the Adam optimizer^{44}. The final network was trained on an equal number of training examples for both tasks, N = 120.
Modelguided ligand design
Our goal was to design peptide sequences with the following properties:

Highly active against both receptors:
$$\beginarrayr\rmActivity=\left\{\beginarrayl\log _10\rmEC_50^\rmGCGR[M\,] < 11.5\quad \\ \log _10\rmEC_50^\rmGLP1R[M\,] < 11.5\quad \\ \rmEC_50^\rmGCGR/\rmEC_50^\rmGLP1R\approx 1\quad \endarray\right.\endarray$$
(2)

Selectively active towards GCGR:
$$\beginarrayr\rmActivity=\left\{\beginarrayl\log _10\rmEC_50^\rmGCGR[M\,] < 11\quad \\ \log _10\rmEC_50^\rmGLP1R[M\,] > 9\quad \\ \rmEC_50^\rmGCGR/\rmEC_50^\rmGLP1R\approx 100\quad \endarray\right.\endarray$$
(3)

Selectively active towards GLP1R:
$$\beginarrayr\rmActivity=\left\{\beginarrayl\log _10{\rmEC}_50^\rmGCGR[M\,] > 9\quad \\ \log _10{{{\rmEC}}}_50^\rmGLP1R[M\,] < 11.5\quad \\ {{{\rmEC}}}_50^\rmGLP1R/{{{\rmEC}}}_50^\rmGCGR\approx 100\quad \endarray\right.\endarray$$
(4)
We use modelguided directed evolution, an optimization strategy that attempts to solve the optimization problem by imitating the natural evolutionary process. In each successive generation (iteration), a change in the sequence is proposed, followed by the evaluation of a fitness function (here, potency predicted by the ensemble of multitask neural networks) and the best solutions are progressed to the next generation. This process repeats until a satisfactory solution is reached. In this work we assume that the ensemble of multitask convolutional neural networks makes reliable predictions up to three mutation steps from the closest trainingset analogue sequence.
We first generated all singlestep mutations from each trainingset sequence in the three groups of interest, removing any duplicates within the generated set, and any overlaps with the training set. Because each sequence in the initial alignment has a length of 30 amino acids and each position can be mutated to one of 19 amino acids (20 if the position is gapped), this gives 570 singlestep mutants in the first generation for each sequence in the training set, that is, 71,304 sequences, reducing to 69,639 sequences after removing duplicates. We then used each model to select the 50 best sequences for each of the three target designs defined above, and selected the ten most diverse sequences as starting points for a second round of optimization. Note that in the first generation for the multitask CNN only five candidate dual agonists were found and used as parents for the second generation. For the second generation we repeated the process described for the first generation, and from the 50 best sequences for each group, we selected five diverse sequences as parents for the third generation. The entire process was then repeated for a final generation, taking the 50 sequences with the best predicted potencies within each of the three groups, considering GCGR for group one.
We identified six biophysical properties that can be predicted from a sequence using the ProtParam module ( from the biopython Python package^{45}: (1) the isoelectric point in neutral pH, (2) GRAVY (grand average of hydropathy)^{46}, (3) the instability index^{47}, (4) aromaticity, (5) the molar extinction coefficient and (6) molecular weight. We compared the predicted value for each designed peptide with the predicted properties of peptides in the training set within the same potency group. We ranked the 50 best sequences in each group by computing the number of features whose values are within one standard deviation of the mean calculated for the corresponding group of trainingset sequences. As the last step of filtering, we predicted the secondary structure for each final candidate using PSIPRED^{48} ( to confirm that the selected sequences are helical peptides. Using this ranking, we selected five final samples in each potency category—four from the third generation, and one from the first generation of mutants. We prioritized designed sequences with the smallest (first generation) and largest (third generation) distance from the training set. Sequences selected with the ensemble of multitask neuralnetwork models experimentally tested in this study and discussed in the main text are listed in Table 2 (Supplementary Tables 7–11 provide additional details).
To examine the similarity of peptides predicted by different models within each potency profile, we used PCA, considering the 500 onehotencoded sequences generated across all five compared models, the wildtype peptides—hGCG and hGLP1—and their singlestep mutants (551 hGCG and 570 hGLP1), such that the projection was computed for an array [1,621 × 21L]. The projected data are shown in Supplementary Fig. 7. The selected final sequences listed in Supplementary Table 6 were analysed in terms of the total number of mutations from wildtype and predicted potencies. The predictions made with different models are consistent, as evidenced by the low values of standard deviation (<0.5) from the average prediction computed across the models.
To evaluate the information content generated by the sequence design process, we calculated the entropy across each set of designed sequences, and the relative entropy (Kullback–Leibler divergence, KL) between the distribution of amino acids at each sequence position estimated for the modeldesigned samples, and the training data. KL divergence is equal to zero if and only if the distribution of amino acids across the designed samples matches exactly the respective distribution of amino acids estimated from the trainingset sequences. The relative entropy between two discrete distributions s(x) and t(x) is given by
$$ =\mathop\sum \limits_i=1^21s(x_i)\log _21\fract(x_i)s(x_i)$$
(5)
where x_{i} is one of the 21 symbols at a selected position j. We also measured the dependence between modelgenerated samples and the training data using mutual information (MI). Given two alignment columns A and B, each with discrete distributions of amino acids, their MI can be calculated as
$$I(A;\,B)=\mathop\sum \limits_i^K=21\mathop\sum \limits_j^L=21p(x_i,\,y_j)\log _21\fracp(x_i,\,y_j)p(x_i)p(\;y_j)=KL(\,p(a,\,b)$$
(6)
where x_{i} is one of the 21 symbols at position A, and y_{j} is one of the 21 symbols at position B. The MI describes the reduction of uncertainty about the amino acid at position i in our generated samples when we are told what the amino acid at position i in the training data is. The higher the value, the more dependent the variables.
Predicted properties of natural homologues
As described in the main text, we used our multitask neuralnetwork ensemble model to make predictions for natural GCG and GLP1 peptide orthologues that are found in various organisms, identified using BLASTp to search the NCBI RefSeq^{49} database to identify nonredundant proglucagon sequences from various organisms across diverse phylogenetic groups. In vertebrates, the preproglucagon polypeptide is a product of the GCG gene, which encodes four peptide hormones: glucagon, glucagonlike peptide1, glucagonlike peptide2 and oxyntomodulin^{50}. In humans, preproglucagon has 180 amino acids and is cleaved to yield proglucagon (158 amino acids), which lacks the Nterminal signalling sequence. Proglucagon is subsequently processed by the prohormone convertases 1, 2 and 3^{50} to produce, among other products, the 29aminoacidlong GCG (in human, positions 53–81, PSCK2) and the 30aminoacidlong GLP1_{7−36} (positions 98–127, PSCK1), which are the focus of this work.
We identified 450 initial records, which we aligned using MAFFT version 7^{36} with default parameters to construct a multiple sequence alignment (MSA). We also removed duplicated sequence isoforms, leaving a single representative for each species. Columns with low occupancy (f < 30% amino acids) were also removed, leaving 294 unique samples, such that the final MSA contained 294 rows (species) and 179 columns (positions). MSA regions corresponding to the human GCG sequence (positions 53–81) and the GLP1 human sequence (positions 98–127) were extracted, yielding two sets of corresponding homologues. Species that lacked either a GCG or GLP1 sequence in the alignment were further removed to yield two final peptide sequence sets, each comprising 288 orthologous sequences. The list of species and NCBI accession numbers, as well as the corresponding peptide sequences, are provided in Supplementary Tables 15 and 16.
Reporting summary
Further information on research design is available in the Nature Portfolio Reporting Summary linked to this Article.
link