EDA

Executive Summary

Our analysis explores how users engage with advice-seeking platforms, focusing on Reddit subreddits AskReddit, AmItheAsshole, and AmIOverreacting, and comparing them to the iconic Dear Abby advice column. The results highlight significant differences in platform dynamics, engagement levels, and audience behaviors. While AskReddit dominates in terms of post volume and engagement due to its broad appeal, AmItheAsshole provides structured feedback on ethical dilemmas, making it a hub for direct and decisive user opinions. AmIOverreacting, though smaller and newer, shows strong engagement relative to its size and employs a structured feedback system similar to AmItheAsshole, enabling users to receive validation or critique on their emotional responses.

The data also reveals that posts on Reddit are far more frequent than those in Dear Abby, with individual subreddits surpassing the monthly activity of the column at its peak. Interestingly, while shorter posts dominate AskReddit, the subreddit still achieves the highest virality rate, showing that concise content can resonate deeply with audiences. Conversely, AmItheAsshole posts are often longer and more detailed, reflecting the complexity of the moral dilemmas discussed. The anonymous nature of Reddit fosters candid conversations, including a notable percentage of explicit (NSFW) content, particularly on AskReddit and AmIOverreacting.

These findings provide a comprehensive understanding of how users engage with these platforms, revealing distinct behavioral patterns and thematic trends. This analysis lays the groundwork for deeper investigations, including topic modeling, sentiment analysis, and predicting community judgments, to further uncover the motivations and themes that drive engagement in online advice-seeking communities.

Data Ingestion and Cleaning

Reddit Data Retrieval

We began by retrieving data from our professor’s Amazon S3 bucket, which had been collected via the Reddit API. The dataset was analyzed using AWS SageMaker and PySpark in JupyterLab. The data included posts and comments from three subreddits: AmItheAsshole (AITA), AskReddit, and AmIOverreacting (AIOR).

Posts Data Cleaning

Initial Dataset

The initial dataset for posts contained over 3 million entries, including some of Reddit’s largest subreddits: AITA and AskReddit. Given the size and scope of the data, we established cleaning protocols to ensure meaningful and robust analysis.

Cleaning Steps

  1. Removing Empty Text: Posts with removed or missing content (selftext = ‘[removed]’) were excluded to focus only on posts with substantial text data.
  2. Filtering by Engagement: Since our subreddits focus on seeking feedback, we applied comment thresholds to retain posts with significant engagement.
  • AITA: Posts with fewer than 25 comments were removed.
  • AIOR: Posts with fewer than 15 comments were removed.
  • AskReddit: Posts with fewer than 50 comments were removed.

These thresholds were chosen based on the summary statistics (mean, median, and quantiles) of comment counts per post, ensuring a representative sample of well-engaged posts.

Final Dataset

After cleaning, the dataset was reduced to approximately 80,000 posts across the three subreddits, retaining high-quality and meaningful entries for analysis.

Comments Data Cleaning

Initial Dataset

The comments dataset initially contained over 76 million entries, representing a vast array of user interactions on the three subreddits.

Cleaning Steps

  1. Removing Empty Text: Comments with removed or missing content (body = ‘[removed]’) were excluded to focus on meaningful user responses.
  2. Joining with Posts: To ensure all comments were tied to a relevant post, we joined the comments dataframe with the cleaned posts dataframe, removing orphaned comments.
  3. Filtering for Immediate Feedback: Reddit is a platform known for its fast-paced interactions, especially in these subreddits. To capture this dynamic, we:
  • Created a top_level_comment variable (binary) to isolate comments directly responding to the original post.
  • Attempted to analyze deeper comment threads recursively (using methods like graphs/nodes in Pandas and PySpark), but computational limitations made this infeasible due to the dataset’s size.
  1. Time-Based Filtering: A time_since_post variable was calculated, measuring the time in hours between a post’s creation and its comments. Only comments made within 24 hours of a post were retained, reflecting the immediacy of feedback typical in these subreddits.

Final Dataset

The final comments dataset was reduced to approximately 19 million comments, representing meaningful, top-level feedback tied to posts with significant engagement. Each cleaning decision was made after carefully examining the data and guided by the unique dynamics of Reddit’s feedback mechanisms.

External Data: Dear Abby

To complement our Reddit analysis, we incorporated the Dear Abby dataset, sourced from Kaggle and featured in The Pudding’s essay, 30 Years of American Anxieties. This dataset includes 20,000 questions submitted to the advice column from 1985 to 2017, offering a historical perspective on public concerns.

Data Preparation 1. Removed incomplete entries and standardized text formatting. 2. Retained submissions within the 1985–2017 time frame.

This dataset serves as a baseline for comparing traditional advice-seeking behavior with Reddit’s dynamic, real-time interactions, helping to explore how platform design and anonymity shape public concerns.

Comparing Engagement: Reddit vs. Dear Abby

Reddit Activity Across Subreddits

Reddit’s immense popularity as a modern advice-seeking platform is evident in the posting trends across its subreddits. The bar chart below illustrates monthly post frequencies for AskReddit, AmItheAsshole, and AmIOverreacting from June 2023 to July 2024.

figure: Number of post per subreddits each month of the year
subreddit_post min_date max_date
0 AmIOverreacting 11/10/23 11:42 7/31/24 23:52
1 AmItheAsshole 6/1/23 0:03 7/31/24 23:50
2 AskReddit 6/1/23 0:10 7/31/24 23:55

Key Observations:

  1. Dominance of AskReddit: AskReddit consistently receives the highest number of posts, reflecting its broad appeal and diverse range of topics.
  2. Steady Activity in AmItheAsshole: AmItheAsshole exhibits robust activity, indicating strong engagement with moral and ethical dilemmas.
  3. Emergence of AmIOverreacting: AmIOverreacting shows relatively lower activity, as it is a newer subreddit established in late 2023. Its posting frequency has gradually increased, reflecting its growing user base.

Takeaway:

Monthly activity on Reddit far exceeds that of Dear Abby, with individual subreddits, like AskReddit and AmItheAsshole, more than doubling the average number of monthly posts Dear Abby received.

Summary: Reddit v. Dear Abby

Both platforms highlight user-generated advice-seeking content, but the scale and dynamics differ drastically:

  • Reddit’s Reach: Individual subreddits like AskReddit and AmItheAsshole receive far more posts per month than Dear Abby at its peak, reflecting Reddit’s status as a dominant modern platform.
  • Dear Abby’s Longevity: Despite its smaller scale, Dear Abby provides a longitudinal view of public concerns over three decades, showcasing shifts in communication style and platform usage.
  • Platform Evolution: AmIOverreacting demonstrates the growth of niche communities on Reddit, highlighting its adaptability to emerging user needs, while Dear Abby reflects a more stable, traditional approach.

This comparison underscores Reddit’s expansive role in advice-seeking today, surpassing traditional platforms in scale and immediacy.

Exploring Post Virality Across Subreddits

To understand what drives post virality, we created a new variable, viral, identifying posts that ranked in the top 10% for both post score and number of comments across all subreddits. The table below summarizes viral post statistics for AmItheAsshole, AmIOverreacting, and AskReddit:

subreddit members total_posts viral_posts viral_posts_percentage viral_posts_per_10k_members
0 AmItheAsshole 21946374 33836 1434 4.238 0.653
1 AmIOverreacting 615668 3470 95 2.738 1.543
2 AskReddit 49108885 42296 3274 7.741 0.667

Key Insights:

  • AskReddit has the largest community (49M members) and highest total posts (42,296), resulting in the most viral posts (3,274) and the highest viral percentage (7.74%).
  • Despite being the smallest subreddit, AmIOverreacting achieves 1.54 viral posts per 10k members, more than double the rate of AskReddit (0.67) and AmItheAsshole (0.65).
  • AmItheAsshole balances high engagement and a large user base, producing 1,434 viral posts (4.24% of its total posts).

This analysis reveals differing engagement dynamics across subreddits. While AskReddit’s sheer size drives raw viral numbers, AmIOverreacting demonstrates exceptional virality relative to its small, focused community, suggesting that smaller subreddits can outperform larger ones in terms of engagement efficiency.

Distribution of Post Lengths by Subreddit

The histograms below illustrate the distribution of post lengths (measured in number of characters) across the subreddits AmIOverreacting, AmItheAsshole, and AskReddit. These distributions provide insights into the varying dynamics and content styles of each community.

Distribution of Selftext Length: AmIOverreacting

Distribution of Selftext Length: AmItheAsshole

Distribution of Selftext Length: AskReddit

Key Insights:

  1. AmIOverreacting:
  • The distribution is skewed right, with most posts under 3,000 characters.
  • This reflects its focus on concise, emotional queries, often aimed at validating users’ feelings.
  • Lower post counts compared to the other subreddits can be attributed to its status as a newer community.
  1. AmItheAsshole:
  • Posts exhibit a wider range of lengths, with a noticeable peak around 2,000 characters.
  • Users often provide detailed narratives to contextualize their moral dilemmas, which aligns with the subreddit’s focus on ethical judgment.
  1. AskReddit:
  • A bimodal distribution emerges, with one peak at very short posts (under 100 characters) and another less noticable peak at longer, detailed posts.
  • This reflects the diverse nature of AskReddit, accommodating both concise questions and extensive discussions.

Popularity and Content Characteristics:

  • Despite AskReddit’s shorter posts, it maintains the highest percentage of viral posts among the subreddits, indicating that post popularity is not tied to length.
  • A closer examination of content reveals differing engagement dynamics. For example:
    • AmItheAsshole posts often revolve around introspective queries about personal responsibility, leading to more reserved and non-explicit content.
    • AmIOverreacting may include more emotionally charged or explicit posts, as users seek validation for situations they believe justify their reactions.

NSFW Content Analysis

A breakdown of NSFW (Not Safe for Work) content further highlights these differences:

subreddit_post total_posts nsfw_posts nsfw_percentage
0 AmItheAsshole 33836 235 0.694527
1 AmIOverreacting 3470 202 5.821326
2 AskReddit 42296 5536 13.088708
  • AskReddit has the highest proportion of NSFW posts, reflecting its broad range of content topics.
  • AmItheAsshole maintains the lowest NSFW percentage, aligning with its focus on introspective and ethical discussions.
  • AmIOverreacting includes more explicit content, likely driven by emotionally charged scenarios where users seek validation for potentially controversial reactions.
  • The anonymous nature of Reddit encourages users to share sensitive or explicit topics more freely, particularly in communities like AskReddit and AmIOverreacting.

Breaking Down Feedback: Responses on AmItheAsshole

In contrast to AskReddit, the AmItheAsshole subreddit stands out for its structured feedback system, where comments are labeled with specific acronyms to indicate judgments on the original post. These labels range from agreement to disagreement, providing clarity and categorization in user responses:

  • YTA: You’re the Asshole
  • YWBTA: You Would Be the Asshole
  • NTA: Not the Asshole
  • YWNBTA: You Would Not Be the Asshole
  • ESH: Everyone Sucks Here
  • NAH: No A-holes Here
  • INFO: Not Enough Info

Response Patterns

The radar chart below visualizes the distribution of these labels in responses to posts on AmItheAsshole.

figure: Radar chart of the reponse of posts in AmItheAsshole

Key Insights:

  • The most frequent responses are YTA and NTA, indicating that audiences on this subreddit are highly decisive and assertive in their opinions.
  • This structured feedback system likely contributes to the subreddit’s popularity, as users seeking clarity or validation for their actions receive direct, categorical judgments.
  • The diversity of response labels, ranging from agreement to neutrality or disagreement, allows for nuanced discussions and reflects the community’s engagement with complex moral dilemmas.

Takeaway:

The structured nature of feedback on AmItheAsshole fosters a unique environment for moral and ethical discussions, making it a go-to platform for users seeking unfiltered and assertive opinions on controversial topics.

A similar labeling system is present in AmIOverreacting, which will be explored further in our machine learning section.

Viral Posts Examples

The table below highlights the top ten viral posts from each subreddit, offering a glimpse into the types of dilemmas and discussions that resonate most with their respective communities. This selection provides valuable context for understanding subreddit themes, which will be explored in greater depth through topic modeling on the NLP page.

subreddit_post post_title score_post rank
0 AskReddit Now that Reddit are killing 3rd party apps on July 1st what are great alternatives to Reddit? 77955 1
1 AskReddit You get pushed into 2030 for 10 minutes and you get ONE google search,What you looking for? 45430 2
2 AskReddit Medical professionals of Reddit, have you ever had a patient so lacking in common sense you wondered how they made it this far. If so, what is your story? 41171 3
3 AskReddit Parents who tried their best to raise their kids to be good humans but they turned out to be jerks, what do you wish you did differently? 37878 4
4 AskReddit What is something you used to think people were over exaggerating about until you experienced it yourself? 34682 5
5 AskReddit People who work at super fancy hotels, what kind of stuff happens that management doesn’t want people to know about? 32125 6
6 AskReddit People who work for the super wealthy, what stuff have you seen? 31837 7
7 AskReddit What's the most disturbing piece of audio there is? 31521 8
8 AskReddit Wedding photographers of Reddit, what was your \they're not gonna last long\" moment?" 31171 9
9 AskReddit If you suddenly had \fuck you\" money what would be the first thing you did?" 29507 10
10 AmIOverreacting My bf murdered my entire family, i’m thinking of ending it. Should I? 34031 1
11 AmIOverreacting AIO: I’m upset my gf referred to me as her “friend” 24716 2
12 AmIOverreacting I (35/M) told my wife (32/F) I want a divorce after she implied I am sexually abusing our daughter (4/F). AIO? 22922 3
13 AmIOverreacting My husband told me why he cheated on me 20170 4
14 AmIOverreacting My husband won't let me take more than two showers a week. I told him I need him to stop or I'm moving out for a while. 19742 5
15 AmIOverreacting AIO? My 23M boyfriend held me 19F underwater during a bath to prove a point and I’m still shaken 18465 6
16 AmIOverreacting My daughter is having an affair with the married neighbor. I told her she needs to move out of my house 16627 7
17 AmIOverreacting AIO my girlfriend won't stop swapping out my real groceries with small versions of the items 15684 8
18 AmIOverreacting My wife had an affair years ago. I just found out she is talking to the man again and I want to divorce. 15041 9
19 AmIOverreacting My wife continually goes out for the night and doesn’t come home. Am I overreacting? 14053 10
20 AmItheAsshole AITA for spending my son's university fund on a trip to Europe to drink beer like I always threatened instead of giving it to his step brother after he passed away. 35820 1
21 AmItheAsshole AITA for begging my girlfriend to uphold a sexist tradition just so she can make a good first impression? 29865 2
22 AmItheAsshole AITA for leaving my sister’s wedding early because she kept my husband out of pictures? 29805 3
23 AmItheAsshole AITA for disposing my pads in my boyfriend’s bathroom? 28512 4
24 AmItheAsshole WIBTA if I go on vacation instead of my brothers wedding? 28339 5
25 AmItheAsshole AITA for getting high so my relatives don't try and pawn their children on me? 26587 6
26 AmItheAsshole AITA for screaming at my husband and his sister to get out of my kitchen? 25337 7
27 AmItheAsshole AITA for leaving my husband at home, while I spend the week at my brothers, because of how he “buys” groceries? 25198 8
28 AmItheAsshole AITA for denying an older woman shelter from a storm? 25198 8
29 AmItheAsshole AITA for “kidnapping” my baby, causing my husband to have a panic attack 24504 10

Conclusion

Our exploratory data analysis reveals distinct patterns in user engagement, content dynamics, and subreddit-specific behaviors across AskReddit, AmItheAsshole, and AmIOverreacting. While AskReddit dominates in scale and reach, AmItheAsshole thrives on structured feedback for moral dilemmas, and AmIOverreacting demonstrates the unique appeal of niche communities. These insights provide a foundation for further analysis, including topic modeling and sentiment analysis, which will deepen our understanding of the themes, tone, and judgments that drive these platforms.