Machine Learning

Executive Summary

In this section, we leverage machine learning to predict community judgments on two advice-seeking subreddits, AmItheAsshole (AITA) and AmIOverreacting (AIO). Using structured feedback labels such as YTA/NTA and OOR/NOR, we trained models to analyze the underlying topical, sentiment, and engagement features driving user responses. Our goal is to uncover whether community judgments can be accurately predicted based on post metadata and derived features, offering insights into the dynamics of online discourse.

To achieve this, we built and evaluated XGBoost and Random Forest models for each subreddit, incorporating key features identified during EDA, including post topics, sentiment scores, and engagement metrics (e.g., post scores and comment counts). These models help us understand not only the accuracy of predictions but also the relative importance of different features in shaping community outcomes.

Key findings reveal that both models perform well in predicting judgments, with Random Forest performing better within AmIOverreacting and XGBoost performing better within AmItheAsshole. Feature importance analysis highlights that post score and the number of comments were among the strongest predictors across both subreddits. For AIO, the enhanced labeling system we developed plays a critical role in generating reliable training data, enabling the model to accurately differentiate between “Overreacting” and “Not Overreacting” posts.

This work demonstrates the power of machine learning in analyzing large-scale social media data, offering practical applications for community moderation and understanding societal trends in judgment and advice-seeking behaviors. The results provide a strong foundation for further exploration of how these platforms shape and reflect public discourse.

Data Preparation and Feature Engineering

To ensure our models effectively predict community judgments on AmItheAsshole (AITA) and AmIOverreacting (AIO), we conducted rigorous data preparation and feature engineering. This process was tailored to capture the key characteristics of each subreddit and ensure high-quality inputs for the machine learning models.

Data Cleaning and Preprocessing

AITA Labels: The AITA subreddit includes structured feedback labels (YTA, NTA, ESH, etc.) directly from user comments. These labels were cleaned and aggregated to assign a single label per post, representing the majority vote.
AIO Labels: Since AIO does not always use predefined labels, we created a labeling system that is outlined below:

Since AIO does not always use predefined labels, we developed a robust labeling system to classify posts as “Overreacting,” “Not Overreacting,” or “Unclear.” This system relies on analyzing the top 10 ranked comments for each post, which were identified based on their engagement scores (e.g., upvotes). By focusing on the most relevant and visible comments, we ensured that the labeling reflected the general sentiment of the community rather than outliers.
To classify the posts, we implemented keyword matching on the comment text using predefined patterns for each label category. For instance, terms such as “valid reaction” or “not overreacting” were associated with the “Not Overreacting” label, while phrases like “blown out of proportion” or “overreacting” indicated the “Overreacting” label. We also used advanced regex patterns to ensure accuracy by avoiding partial matches or misclassifications (e.g., distinguishing “not overreacting” from “overreacting”).
Once individual comment labels were generated, a majority vote system was applied across the top 10 comments to determine the final label for each post. If “Not Overreacting” comments outnumbered “Overreacting” comments, the post was labeled as “Not Overreacting,” and vice versa. Posts where neither label achieved a clear majority, or where labels were ambiguous, were classified as “Unclear.”
This method ensured that labels were derived systematically and consistently, capturing the consensus of the AIO community while minimizing noise from outlier comments. By doing so, we created reliable training data for the machine learning models, enabling accurate predictions of community judgments.

Feature Engineering

Key features for the models were derived from both post content and metadata:

Post Sentiment Scores: Sentiment scores were calculated using a pre-trained sentiment analysis model, quantifying the emotional tone of each post.
Post Topics: Topics were extracted using Non-Negative Matrix Factorization (NMF), as detailed in the NLP section. Each post was assigned a dominant topic to incorporate thematic insights.
Engagement Metrics: Features such as post score, number of comments, and time since posting were included to capture user engagement levels.
Reddit Metadata: Additional features, such as the day and hour of posting, were included based on trends observed in the EDA.

Data Splits

The datasets for both subreddits were split into training (80%) and test (20%) sets. This division ensures robust evaluation of the models on unseen data while preventing overfitting.

Addressing Imbalanced Data

In both subreddits, labels such as ESH in AITA and “Unclear” in AIO were either ambiguous or underrepresented. Rather than attempting to model these minority classes, we chose to remove them from the dataset. This decision ensured that the machine learning models could focus on clear and well-defined labels, such as YTA and NTA in AITA, and “Overreacting” and “Not Overreacting” in AIO.

By eliminating these ambiguous labels, we streamlined the dataset, improving the clarity and reliability of the training process. This allowed us to focus on patterns and features relevant to the dominant community judgments without introducing noise or complexity from poorly defined categories.

Summary

This data preparation pipeline ensures that the input features capture the thematic, sentimental, and engagement-related nuances of each subreddit. By incorporating sentiment analysis, topic modeling, and metadata, the models are equipped to make informed predictions about community judgments.

AIO Models

Model Overview

To predict community judgments on the AmIOverreacting (AIO) subreddit, we employed two machine learning models: Random Forest and XGBoost. These models were selected for their ability to handle structured data effectively and their robust performance in classification tasks. Both models leveraged features such as post sentiment scores, topics, engagement metrics, and metadata, as outlined in the feature engineering section.

The models aimed to classify posts into two categories: Overreacting and Not Overreacting. The enhanced labeling system discussed earlier provided reliable training data for this task.

Model Performance

The performance of each model was evaluated using standard classification metrics, including precision, recall, F1-score, and accuracy. Below is the classification report for both models, which highlights their respective strengths and weaknesses.

Random Forest Classification Report:
                  precision    recall  f1-score   support

Not Overreacting       0.67      0.91      0.77       172
    Overreacting       0.78      0.43      0.55       134

        accuracy                           0.70       306
       macro avg       0.73      0.67      0.66       306
    weighted avg       0.72      0.70      0.67       306

XGBoost Classification Report:
                  precision    recall  f1-score   support

Not Overreacting       0.69      0.80      0.74       172
    Overreacting       0.68      0.54      0.60       134

        accuracy                           0.69       306
       macro avg       0.68      0.67      0.67       306
    weighted avg       0.69      0.69      0.68       306


Random Forest Kappa Score: 0.34994746219562345
XGBoost Kappa Score: 0.34766388346065025

From the classification reports:

The Random Forest model achieved higher overall accuracy (70%) compared to XGBoost (69%).
Random Forest demonstrated strong performance in identifying “Not Overreacting” posts, with a recall of 0.91 and an F1-score of 0.77.
XGBoost performed more evenly across both labels but struggled with “Overreacting” posts, achieving a recall of only 0.54.

Confusion Matrices

To further analyze the models’ predictions, confusion matrices were generated for both Random Forest and XGBoost. These matrices provide insights into how well each model correctly classified the labels and where misclassifications occurred.

Key observations:

Random Forest excelled in identifying “Not Overreacting” posts, with only 16 misclassifications out of 172 true examples.
XGBoost, while slightly less accurate, demonstrated a more balanced approach, with fewer extreme discrepancies between the two labels.

ROC Curve Analysis

The Receiver Operating Characteristic (ROC) curve evaluates the trade-off between the true positive rate (sensitivity) and false positive rate for both models. The AUC (Area Under the Curve) scores provide a comprehensive measure of model performance.

Insights from the ROC analysis:

The Random Forest model achieved a higher AUC score (0.72) compared to XGBoost (0.68), indicating superior overall performance.
Both models outperformed random guessing, demonstrating their ability to capture meaningful patterns in the data.

Feature Importance

Feature importance analysis revealed that post score and number of comments were among the strongest predictors for both models. Sentiment scores and post topics also played significant roles, highlighting the importance of combining content, engagement, and metadata features in predicting community judgments.

Summary

The AIO models demonstrate the feasibility of predicting community judgments based on a combination of textual and engagement-related features. While the Random Forest model showed slightly better performance overall, both models provide valuable insights into the factors driving user responses on the AIO subreddit. These findings lay the groundwork for future applications, such as improving content moderation or understanding community dynamics on advice-seeking platforms.

AITA Models

Model Overview

For the AmItheAsshole (AITA) subreddit, we employed Random Forest and XGBoost models to classify posts based on community judgments. Unlike AIO, AITA includes multiple structured feedback labels such as YTA (You’re the Ahole), NTA (Not the Ahole), ESH (Everyone Sucks Here), NAH (No Aholes Here), and INFO (Not Enough Info). However, due to significant imbalances in the distribution of these labels, we focused on the dominant classes, YTA and NTA, for our predictive modeling.

Model Performance

The performance of both models was evaluated using standard classification metrics. Below are the classification reports for Random Forest and XGBoost:

Random Forest Classification Report:
              precision    recall  f1-score   support

         YTA       0.72      0.52      0.60      7387
         NTA       0.59      0.84      0.69      7012

    accuracy                           0.63     15352
   macro avg       0.26      0.27      0.26     15352
weighted avg       0.61      0.63      0.60     15352

XGBoost Classification Report:
              precision    recall  f1-score   support

         YTA       0.71      0.62      0.66      7387
         NTA       0.64      0.81      0.71      7012

    accuracy                           0.67     15352
   macro avg       0.27      0.29      0.27     15352
weighted avg       0.63      0.67      0.64     15352


Random Forest Kappa Score: 0.3125477758880384
XGBoost Kappa Score: 0.37599140374591733

Key observations:

XGBoost outperformed Random Forest with an accuracy of 67% compared to 63%.
XGBoost demonstrated higher precision for NTA and higher recall for YTA, reflecting a balanced performance.

Confusion Matrices

The confusion matrices below provide a detailed look at the distribution of predictions for both models:

Key observations:

Random Forest had difficulty predicting YTA, with a relatively high number of false negatives misclassified as NTA.
XGBoost improved on this by capturing more correct YTA classifications while maintaining strong performance on NTA predictions.

ROC Curve Analysis

The ROC curves highlight the models’ ability to balance sensitivity and specificity:

Insights from the ROC analysis:

XGBoost achieved a higher AUC score (0.67) compared to Random Forest (0.63), indicating better overall performance.
Both models performed well above the random baseline, demonstrating their ability to leverage meaningful patterns in the data.

Feature Importance

Feature importance analysis revealed that post score, number of comments, and sentiment score were the most significant predictors across both models. This aligns with earlier findings in EDA and NLP, where these features were identified as key drivers of community engagement.

Summary

The AITA models underscore the potential for machine learning to predict community judgments in multi-label settings. While XGBoost emerged as the better-performing model, both models highlighted key features influencing user responses. Future work could focus on handling of underrepresented labels and exploring advanced techniques such as multi-label classification.

Conclusion

Our project demonstrates the feasibility and value of leveraging machine learning to analyze and predict community judgments across advice-seeking platforms. Through a combination of EDA, NLP, and ML, we have uncovered key insights into user behaviors, thematic trends, and judgment patterns in modern online subreddits

Key Findings:

AIO Results:

Random Forest performed slightly better in predicting “Overreacting” and “Not Overreacting” judgments, with an accuracy of 70%.
Post score and comment count emerged as the most influential predictors, emphasizing the role of engagement metrics.

AITA Results:

XGBoost outperformed Random Forest, achieving 67% accuracy for the labels (YTA and NTA).
Post score, comment count, and sentiment score were the most influential predictors.

Next Steps:

Model Improvements:

Experiment with advanced NLP techniques, such as transformers, to capture deeper contextual relationships in post text.
Address label imbalance by incorporating oversampling techniques or alternative loss functions.

Expanded Analysis:

Extend the dataset to include additional subreddits and external advice-seeking platforms for broader generalization.
Analyze temporal trends in judgment patterns to identify shifts in societal norms.

By bridging traditional and modern advice-seeking platforms, this project lays the groundwork for understanding the evolving dynamics of digital discourse and its implications for societal trends.