Admissions in the age of AI: detecting AI-generated application materials in higher education

0
Admissions in the age of AI: detecting AI-generated application materials in higher education

This section outlines our research approach and the machine learning models/techniques employed in this study. Two traditional machine learning models, Naïve Bayes and Logistic Regression, are used as our baseline models. Their performance is compared against two state-of-the-art transformer-based models, BERT and DistilBERT, which account for the temporal dependencies in the text data. The code for these models are publicly available33 and all four models can be executed on text input via the online tools described in the “Interactive detection tools” section.

Approach and experimental design

As discussed in the introduction, the AI-content detection problem is decomposed into two classification tasks: distinguishing between human-authored and AI-generated text and between human-authored and AI-revised text. The rationale for this separation is twofold. Firstly, certain universities may permit AI-revised application materials but prohibit AI-generated ones. This distinction arises from the consideration that AI-revised documents can be viewed as originating from humans, with AI tools helping to correct grammar mistakes or refine writing styles. Hence, it becomes necessary to differentiate between these two types of AI-crafted content. Secondly, while we could approach the problem using a multi-class classifier for the three document types, the approach is more challenging due to similarities between AI-generated and AI-revised text, leading to ambiguity in class boundaries. Furthermore, evaluating the performance of multi-class classifiers is less straightforward, as metrics like precision, recall, and F1-score are most effectively designed for binary classification models.

For each classification task, we conducted two experiments to evaluate the effectiveness of domain-specific models and their cross-domain generalizability. In the initial experiment (results detailed in Table 2), we exclusively trained the models using educational data (i.e., LORs and SOIs). The training and test datasets were created through a random 4:1 split. Model performance was assessed using five metrics: overall accuracy, recall, specificity, precision, and F-1 score. In addition to evaluating the models on the combined test data (i.e., SOI+LOR), we analyzed their performance on individual document types (LOR and SOI) and 12,000 cross-domain, balanced examples from the GPT-wiki-intro dataset.

In the second experiment, we replicated the same procedure but augmented the training data with a disjoint set of 48,000 balanced instances from the GPT-wiki-intro dataset. Models trained with this mixed-domain dataset (i.e., LORs+SOIs+Wiki data) showed substantial improvements on the Wiki dataset with minimal impact on the educational data. This outcome reinforces our hypothesis that developing an AI-content detector within a specific domain is feasible, even with limited data resources. The findings from the second experiment are presented in Table 3.

Machine learning algorithms

This section briefly introduces the machine learning models utilized in this study. We selected BERT and DistilBERT for their broad adoption and proven reliability in NLP tasks. Although these models are older, they are sufficient for our task and provide both computational efficiency and accessibility. The same rationale applies to our choice of machine learning models. We used Naive Bayes (NB) and Logistic Regression (LR), both of which demonstrated their effectiveness in detecting AI-generated content.

Naïve Bayes

Naïve Bayes (NB)10 is a probabilistic classification algorithm built upon Bayes’ theorem and relies on the “naïve”assumption that the features \(\x_1, x_2, \dots , x_n\\) are conditionally independent, given the class label y. Mathematically,

\(P(x_1, x_2, \dots , x_n|y) = \prod _i=1^nP(x_i|y)\)

While the above assumption may not hold in all real-world scenarios, NB often serves as a strong baseline for text classification tasks. NB uses Bayes’ theorem to calculate the probability of each class given the observed features, and then predicts the class for unlabelled data with the highest probability \(\haty\), i.e.,

\(\haty = \text *arg\,max_i (P(y) \cdot P(x_i)\)

In the data preprocessing phase for the NB model, we applied Term Frequency-Inverse Document Frequency (TF-IDF)34 to vectorize the text input. TF-IDF transforms raw text data into numerical features by considering two key factors: the frequency of a term within a document (Term Frequency) and its significance across the entire dataset (Inverse Document Frequency). This method allows the model to prioritize highly discriminative terms for classification tasks by capturing the relative importance of words while reducing the influence of common terms.

Logistic regression

Logistic Regression (LR11) is a widely used algorithm in machine learning and statistics. It uses the Sigmoid function, \(\sigma (z) = \frac11+e^-z\), to model the relationship between the input features and the probability of belonging to the positive class (class 1). The input z to the Sigmoid function is modeled as a linear combination of the independent variables, \(\ x_1, x_2, \dots , x_n \\), i.e.,

\(z = w_0 + w_1*x_1 + w_2*x_2 + \dots + w_n*x_n\)

where \(\w_0, w_1, w_2, \dots , w_n \\) are model parameters.

The LR model generates predictions for new data by computing the conditional probability associated with the positive class based on the observed input features (i.e., \(P(y=1|(x_1, x_2, \dots , x_n)\)). If this probability is greater or equal to a predetermined threshold (typically 0.5), the model classifies it as class 1; otherwise, it predicts class 0. Logistic Regression is valued for its simplicity and interpretability, but its assumption of a linear relationship between features and the log of the target variable is not valid in all cases. We applied the same TF-IDF technique that was used for Naïve Bayes to prepare the training data for the LR model.

BERT

Bidirectional Encoder Representations from Transformer (BERT)12 is among the most notable pre-trained language models in the NLP domain. This model’s innovation lies in its ability to capture the bidirectional context of words in a sentence, enabling it to comprehend the intricacies of language, including nuances, word meanings, and context. As a result, BERT surpasses unidirectional models such as RNN or LSTM in a wide range of NLP tasks, including sentiment analysis, question answering, language translation, and text summarization.

BERT’s architecture is built upon the Transformer model35, which introduced the concept of self-attention mechanisms. These mechanisms enable BERT to assign varying levels of importance to different words in a sentence, facilitating the extraction of essential information and context. BERT’s pre-training involves two key tasks: masked language modeling and next sentence prediction. In the former, BERT learns to predict missing words in a sentence, forcing it to understand the relationships between words in context. In the latter, BERT learns to determine whether a pair of sentences logically follows one another, enhancing its grasp of document-level context.

One of BERT’s unique features is its ability to be fine-tuned for specific NLP tasks with relatively small amounts of task-specific data. This adaptability has made BERT the go-to choice for researchers and developers in various applications36,37,38,39. In this study, we fine-tuned the pre-trained BERT-base-uncased model with a 55% dropout rate for the final layer. This dropout level was selected empirically.

DistilBERT

DistilBERT13 is a variant of the BERT model35, designed to be more compact and computationally efficient while retaining comparable performance. DistilBERT is built on the same transformer architecture as BERT, which uses a stack of transformer encoder layers to process and encode input text data. The output is subsequently used for various downstream NLP tasks, such as text classification, sentiment analysis, and named entity recognition.

The main innovation in DistilBERT is the use of knowledge distillation, which involves training a smaller “distilled” model to mimic the behavior of a larger, pre-trained model. DistilBERT is trained to mimic BERT’s outputs and achieves its compactness and efficiency by reducing the number of parameters compared to BERT. It typically uses 40% fewer parameters, which makes it faster to train and classify examples, but yet it retains much of BERT’s performance. This makes it a preferred option for situations with constrained computational resources, such as deploying NLP models in environments with limited resources. Like the BERT model, we trained our DistilBERT model with a 55% dropout rate applied to the final hidden layer.

Table 2 Model performance trained exclusively on application (SOI and LOR) data.

link

Leave a Reply

Your email address will not be published. Required fields are marked *