The Hot Hand: Unraveling the Myth

Introduction to the Hot Hand Phenomenon:

In the dynamic realm of sports, the concepts of a “hot hand” or being “in the zone” have become ubiquitous, often used to describe a player’s extraordinary performance streak. Originating in basketball, the term suggests that a player on a successful streak is more likely to continue their success. This belief, rooted in the increased confidence of the player, extends beyond basketball, resonating in baseball and other fields. Despite its prevalence, statistical evidence supporting the existence of the hot hand is scarce, with numerous studies suggesting it is a fallacy. This project embarks on an exploration within the sports domain to scrutinize the mystery of streaks.

Diverse Dataset Exploration:

Throughout the semester, I undertook a comprehensive exploration of the hot hand phenomenon, employing an array of datasets with varying complexities. The NCAA basketball data, scraped from the Villanova Men’s Basketball team in R, offered insights into shot data. The baseball data, sourced from FanGraphs and scraped in R, included both individual player data and detailed pitch-by-pitch information. Additionally, news text data was gathered using an API in Python. Each dataset’s unique cleaning process laid the foundation for an in-depth Exploratory Data Analysis (EDA), allowing for the refinement of hypotheses and the delineation of the investigative path.

EDA Signals and Insights:

Promising signals emerged during the EDA phase, particularly in individual player baseball data and the NCAA shot data. The former hinted at leveraging past hard-hit data to predict future performance, showcasing potential autocorrelation or seasonality effects. This finding holds promise for refining future models. The NCAA shot data, while lacking strong correlations individually, exhibited a descending trend in lag variables, suggesting the immediate prior shot’s influence on the current outcome.

Model Performance and Limitations:

The Naive Bayes model, designed to predict made or missed shots, failed to outperform random guessing, aligning with the null hypothesis that the hot hand does not exist. Further stages of analysis required the addition of numerical feature variables for the NCAA data, including shot value, score differential, and shooter field goal percentage.

While I was successful in reducing dimensions (4 variables accounted for over 75% of overall variance), the various clustering methods (KMeans, DBSCAN, and Hierarchical Clustering) failed to conclusively establish the presence of discernible data clusters in the shot data. Decision trees and random forests, though outperforming the Naive Bayes classifier, primarily relied on shot value rather than the lag variable, reaffirming its lower predictive power.

Looking Ahead:

I recognize the limitations of predominantly relying on NCAA shot data to confirm the hot hand fallacy. I plan to revisit this topic at a later date, armed with a deeper understanding of time series data and an expanded arsenal of machine learning algorithms. Future analyses may explore this topic using alternative definitions of success, such as launch speeds and angles in baseball. For now, my findings align with prior research, reinforcing the notion that the hot hand is indeed a fallacy.

Extra Joke

What do you get when you cross a pirate with a data scientist?

Someone who specializes in Rrrr.

Thank you for taking the time to explore my project, and kudos for enduring all the humor along the way!