Political Colors and Quality of Life: Analyzing How State Political Leanings Influence Citizen Well-Being

Authors

Billy McGloin

Eric Dapkus

Nolan Penoyer

Caitlyn Dovel

Published

November 29, 2023

Abstract

The interplay between political leanings and quality of life in the United States forms a complex tapestry. This study aims to dissect this interplay, focusing on how a state’s political orientations impact various facets of citizen well-being. This project not only seeks to highlight the correlations between political inclinations and quality of life indicators but also strives to understand the causative relationships within. The primary research question guiding this study is: How do state political leanings influence the quality of life of their residents?

Introduction

Choosing where to live is one of the most important decisions in a person’s life (2010). Our study aims to simplify the question by ascertaining whether the political leanings of states, specifically Democratic or Republican affiliations, exert influence on the quality of life experienced by their citizens. We approach this investigation through a comprehensive examination of key indicators such as income, crime rates, health outcomes, and education.

Income levels often serve as a foundational determinant of an individual’s standard of living. Political leanings of a state influence economic policies, tax structures, and resource allocation, subsequently impacting income distribution among its residents. While relative income may not improve the quality of someones everyday experience it has been shown to increase how we evaluate our lives (2010).

Likewise, crime rates, another critical aspect of our investigation, are closely linked to personal safety and community well-being. The political landscape may shape law enforcement policies, social programs, community engagement, and program funding, all of which can have direct implications for public safety and the overall quality of life. Past studies have shown that specific crimes can have a lasting and large impact on our overall well being (2008).

Health outcomes constitute an additional pillar of our study that affects our quality of life (2000) . Political orientations may influence healthcare policies, accessibility to medical services, and public health initiatives, thereby contributing to variations in health outcomes among state residents.

The educational landscape, shaped by political decisions on funding, curriculum, and access, can significantly impact aspects of our lives. Access to education can lead to social benefits, increase analytical skills, improve intellectual capital, and change the constraints and opportunities of our lives (2011).

In synthesizing these indicators, our study strives to bridge the gap between political leanings and the lived experiences of residents. By focusing on these concrete aspects rather than delving into technical details, we aim to provide a practical foundation for understanding the potential impact of political affiliations on the quality of life in different states. As we proceed to analyze and interpret the data, our findings hold the promise of offering actionable insights for individuals making residential decisions and policymakers shaping the future of their communities.

Methods and Materials

Data Collection

National Voting Data

National voting data was collected from The Federal Election Commission (FEC). They host elections and voting information for every federal election since 1982. For our analysis we used the 2020 data. The data is available in Portable Document Format (PDF) and Excel, .xlsx, formats. We used the .xlsx version and converted it to a Comma-Separated Values (CSV) file. The data we used for our project can be found on tab 9, 2020 Pres General Results, in the Excel file (FEC). This data is used to determine the states’ political leanings. (2020)

FEC | Election and voting information

Population Data

Population data was taken from the United States Census Bureau. To match the voting data we used the State Population Totals and Components of Change: 2020-2022. The data is provided in a CSV file that is easy to import into the data cleaning workflow. This data was used to normalize other data we gathered. (2022)

State Population Totals: 2020-2022 (census.gov)

Income Data

Income data was also obtained from the U.S. Census Bureau, which collects and maintains numerous sets of economic and income data for the United States. We used the Table H-8, median household income by state, as the information was the cleanest for our use case. Table H-8 is an Excel document with numerous header rows that do not contain any data. (2022)

Historical Income Tables: Households (census.gov)

Crime Data

We scraped crime data from the FBI’s Crime Data Explorer (CDE). The CDE permits the query of the number of crimes by state, year, and the type of crime (property crime, larceny, ect.). For our analysis, we chose to retrieve crime data from 2020-2022 for each state, constructing two seperate CSV files, one for violent crimes and one for non-violent crimes. This data coupled with the state population data reveals the crime rates for each state for a given year. (2022)

FBI Crime Data API (cde.ucr.cjis.gov)

Health Data

Heart disease deaths data was collected from the Centers for Disease Control and Prevention (CDC). The CSV obtained from their site lists the year, state, number of deaths from heart disease, had the rate of heart disease deaths per 100k people per state, and a url to provide more information for the state. Number of deaths from heart disease was one of the features we used to determine the health of a state. Since the data was taken in as deaths per state it was normalized with the states population during cleaning. Heart disease was chosen since it is a leading cause of death in America and is a good determinator for how people take care of themselves. (2022)

Stats of the States - Heart Disease Mortality (cdc.gov)

Covid-19 data was gathered from the New York Times GitHub Repository. The repository host multiple datasets with COVID-19 data. We used the us-states.csv file that contains date (single day the information is for), state, fips, cases, and deaths. The repository contains a README.md that shows their methods for collection and organizing the data. The file was a CSV and easy to import into the workflow during data cleaning. Both COVID-19 cases and deaths were used as features for determining the health of a state. These variables show how a state reacts to rising health risks. (2023)

NYTimes/covid-19-data: A repository of data on coronavirus cases and deaths in the U.S. (github.com)

Education Data

Education data was acquired through the United States Census Bureau API. The data comes from the American Community Survey (ACS), which is an ongoing survey that provides data every year about social, economic,demographic, and housing characteristics of the U.S. population. We focused on the number of people with a bachelor’s degree, associate’s degree, GED or equivalent, and some college for each state from 2021-2022. (2022)

State Education Totals: 2021-2022 (census.gov)

Race Data

We retrieved data on the number of white, black, asian, and hispanic people per state from the Census API and ACS similar to the education data. Again, we only chose to use the data from 2021-2022. (2022)

State Race Totals: 2021-2022 (census.gov)

Data Cleaning

This section of the report will act as a journal of our data cleaning process. We will show the raw data, the code used to clean it, and then the modified data.

National Voting Data

After loading in the relevant libraries and read in the data, let’s take a look at what the national voting data looks like.

Code

#load in relevant libraries
library(tidyverse)

Code

# read in the national voting data
df <- read.csv('voting.csv')

#what does the national voting data look like?
cat("Shape of Data:", dim(df))
head(df)

Shape of Data: 681 13

A data.frame: 6 x 13
	X...1	FEC.ID	STATE	STATE.ABBREVIATION	GENERAL.ELECTION.DATE	FIRST.NAME	LAST.NAME	LAST.NAME...FIRST	TOTAL.VOTES	PARTY	GENERAL.RESULTS	GENERAL..	TOTAL.VOTES..
	<int>	<chr>	<chr>	<chr>	<chr>	<chr>	<chr>	<chr>	<chr>	<chr>	<chr>	<chr>	<chr>
1	2
2	3	P80001571	Alabama	AL	11/3/20	Donald J.	Trump	Trump, Donald J.		R	1,441,170	62.03%
3	4	P80000722	Alabama	AL	11/3/20	Joseph R.	Biden	Biden, Joseph R.		D	849,624	36.57%
4	5	P00013524	Alabama	AL	11/3/20	Jo	Jorgensen	Jorgensen, Jo		IND	25,176	1.08%
5	6	n/a	Alabama	AL	11/3/20		Scattered	Scattered		W	7,312	0.31%
6	7	n/a	Alabama	AL	11/3/20				Total State Votes:				2,323,282

We can see that the above data has multiple rows per state, with rows for each candidate and total state votes. Let’s clean this data. We can start by only keeping votes for the two major parties and then create two new columns (one for Candidate and another for Total Votes for that candidate). The data then needs to be converted from long to wide so that each row/observation represents a state (including DC). Finally, let’s create some new columns that will be useful for our analysis: Total Votes Overall, Vote Outcome, and Percentage Democratic.

Code

#VOTING DATA
# remove all votes not for trump or biden
df <- df[df$PARTY %in% c("D", "R"), ]

# create columns for biden and trump votes for each state
df <- df %>%
  mutate(VoteTotal = as.numeric(gsub(",", "", GENERAL.RESULTS)),
         Candidate = ifelse(PARTY == "R", "Trump", ifelse(PARTY == "D", "Biden", NA))) %>%
  group_by(STATE, Candidate) %>%
  summarize(TotalVotes = sum(VoteTotal))
  
# convert from long to wide
df_wide <- df %>%
  spread(key = Candidate, value = TotalVotes, fill = 0)

# create columns for total votes, outcome, and percentage democratic
df_result <- df_wide %>%
  mutate(TotalVotes = Biden + Trump,
         Outcome = ifelse(Biden > Trump, "Democratic", "Republican"),
         Percentage_Democratic = round((Biden / TotalVotes) * 100, 2))

cat("Shape of Data:", dim(df_result))
head(df_result)

`summarise()` has grouped output by 'STATE'. You can override using the
`.groups` argument.

Shape of Data: 51 6

A grouped_df: 6 x 6
STATE	Biden	Trump	TotalVotes	Outcome	Percentage_Democratic
<chr>	<dbl>	<dbl>	<dbl>	<chr>	<dbl>
Alabama	849624	1441170	2290794	Republican	37.09
Alaska	153778	189951	343729	Republican	44.74
Arizona	1672143	1661686	3333829	Democratic	50.16
Arkansas	423932	760647	1184579	Republican	35.79
California	11110639	6006518	17117157	Democratic	64.91
Colorado	1804352	1364607	3168959	Democratic	56.94

Population Data

Now we can read in the population data that was acquired from the United States Census Bureau and see what it looks like.

Code

# read in the population data
pop <- read.csv('population.csv')

# what does the data look like?
cat("Shape of Data:", dim(pop))
head(pop)

Shape of Data: 51 2

A data.frame: 6 x 2
	X...State	Population.Estimate
	<chr>	<chr>
1	.Alabama	5,031,362
2	.Alaska	732,923
3	.Arizona	7,179,943
4	.Arkansas	3,014,195
5	.California	39,501,653
6	.Colorado	5,784,865

Let’s remove the period before the state name, convert the population column to numeric, and check the sum of all populations to check the accuracy of the data. Let’s merged the clean data together by STATE.

Code

#POPULATION DATA
# Remove the dot before the state name
pop <- pop %>%
  mutate(STATE = gsub("\\.", "", X...State))

# Convert Population.Estimate to numeric after removing commas
pop$Population.Estimate <- as.numeric(gsub(",", "", pop$Population.Estimate))

pop <- select(pop, -X...State)

sum(pop$Population.Estimate)

# Merge the two datasets
df_merged <- left_join(df_result, pop, by = "STATE")

cat("Shape of Data:", dim(df_merged))
head(df_merged)

331511512

Shape of Data: 51 7

A grouped_df: 6 x 7
STATE	Biden	Trump	TotalVotes	Outcome	Percentage_Democratic	Population.Estimate
<chr>	<dbl>	<dbl>	<dbl>	<chr>	<dbl>	<dbl>
Alabama	849624	1441170	2290794	Republican	37.09	5031362
Alaska	153778	189951	343729	Republican	44.74	732923
Arizona	1672143	1661686	3333829	Democratic	50.16	7179943
Arkansas	423932	760647	1184579	Republican	35.79	3014195
California	11110639	6006518	17117157	Democratic	64.91	39501653
Colorado	1804352	1364607	3168959	Democratic	56.94	5784865

Income Data

Let’s read in the income data that was acquired from the U.S. Census Bureau and see what it looks like.

Code

# read in the income data
income <- read.csv('median_income.csv')

# what does the data look like?
cat("Shape of Data:", dim(income))
head(income)

Shape of Data: 52 2

A data.frame: 6 x 2
	X...State	Median.income
	<chr>	<chr>
1	United States	68,010
2	Alabama	54,690
3	Alaska	74,750
4	Arizona	67,090
5	Arkansas	50,780
6	California	77,650

We need to remove the United States row and rename the state column so that it matches the merged data. Below is the newly merged data.

Code

#INCOME DATA
# remove united states row and rename the state column
income <- income %>%
  filter(X...State != "United States") %>%
  rename(STATE = X...State)

# merge the income data with the merged data
df_merged <- left_join(df_merged, income, by = "STATE")

cat("Shape of Data:", dim(df_merged))
head(df_merged)

Shape of Data: 51 8

A grouped_df: 6 x 8
STATE	Biden	Trump	TotalVotes	Outcome	Percentage_Democratic	Population.Estimate	Median.income
<chr>	<dbl>	<dbl>	<dbl>	<chr>	<dbl>	<dbl>	<chr>
Alabama	849624	1441170	2290794	Republican	37.09	5031362	54,690
Alaska	153778	189951	343729	Republican	44.74	732923	74,750
Arizona	1672143	1661686	3333829	Democratic	50.16	7179943	67,090
Arkansas	423932	760647	1184579	Republican	35.79	3014195	50,780
California	11110639	6006518	17117157	Democratic	64.91	39501653	77,650
Colorado	1804352	1364607	3168959	Democratic	56.94	5784865	83,780

Crime Data

Let’s read in the crime data and see what it looks like.

Code

crimes <- read.csv('crimes.csv')

cat("Shape of Data", dim(crimes))
head(crimes)

Shape of Data 153 4

A data.frame: 6 x 4
	X...State	Year	Violent_Crimes	Property_Crimes
	<chr>	<int>	<int>	<int>
1	AL	2020	22322	105161
2	AL	2021	17590	74271
3	AL	2022	20759	88240
4	AK	2020	6126	16528
5	AK	2021	5573	13456
6	AK	2022	5567	13124

Again we must rename the state column to match the rest of the data. Additionally, the state column is now abbreviations so we must map it to the full state name that can be done with the follwing code. Once done, we can drop the original state column. Similar to before, the data is in long format and we must convert it to wide so that each row (unit of observation) is a state and there is a column for each crime type for different years. Once done, we can merge the crime data with the rest of the data.

Code

#CRIME DATA
# rename the state column 
colnames(crimes)[colnames(crimes) == "X...State"] <- "STATE"

# Create a mapping between state abbreviations and full state names
state_mapping <- data.frame(
  Abbreviation = c(
    "AL", "AK", "AZ", "AR", "CA", "CO", "CT", "DC", "DE", "FL", "GA",
    "HI", "ID", "IL", "IN", "IA", "KS", "KY", "LA", "ME", "MD",
    "MA", "MI", "MN", "MS", "MO", "MT", "NE", "NV", "NH", "NJ",
    "NM", "NY", "NC", "ND", "OH", "OK", "OR", "PA", "RI", "SC",
    "SD", "TN", "TX", "UT", "VT", "VA", "WA", "WV", "WI", "WY"
  ),
  StateName = c(
    "Alabama", "Alaska", "Arizona", "Arkansas", "California", "Colorado",
    "Connecticut", "District of Columbia", "Delaware", "Florida", "Georgia", "Hawaii", "Idaho",
    "Illinois", "Indiana", "Iowa", "Kansas", "Kentucky", "Louisiana", "Maine",
    "Maryland", "Massachusetts", "Michigan", "Minnesota", "Mississippi", "Missouri",
    "Montana", "Nebraska", "Nevada", "New Hampshire", "New Jersey", "New Mexico",
    "New York", "North Carolina", "North Dakota", "Ohio", "Oklahoma", "Oregon",
    "Pennsylvania", "Rhode Island", "South Carolina", "South Dakota", "Tennessee",
    "Texas", "Utah", "Vermont", "Virginia", "Washington", "West Virginia", "Wisconsin", "Wyoming"
  )
)

# convert STATE column to character
crimes$STATE <- as.character(crimes$STATE)

# Merge crime_data with state_mapping to get full state names
crimes <- left_join(crimes, state_mapping, by = c("STATE" = "Abbreviation"))

# Drop the original 'STATE' column
crimes <- select(crimes, -STATE)

# change column name
colnames(crimes)[colnames(crimes) == "StateName"] <- "STATE"

crimes <- distinct(crimes, STATE, Year, .keep_all = TRUE)

# convert from long to wide
crimes_wide <- crimes %>%
  pivot_wider(
    names_from = Year,
    values_from = c(Violent_Crimes, Property_Crimes)
  )

# merge the crimes data with the merged data
df_merged <- left_join(df_merged, crimes_wide, by = "STATE")

cat("Shape of Data:", dim(df_merged))
head(df_merged)

Shape of Data: 51 14

A grouped_df: 6 x 14
STATE	Biden	Trump	TotalVotes	Outcome	Percentage_Democratic	Population.Estimate	Median.income	Violent_Crimes_2020	Violent_Crimes_2021	Violent_Crimes_2022	Property_Crimes_2020	Property_Crimes_2021	Property_Crimes_2022
<chr>	<dbl>	<dbl>	<dbl>	<chr>	<dbl>	<dbl>	<chr>	<int>	<int>	<int>	<int>	<int>	<int>
Alabama	849624	1441170	2290794	Republican	37.09	5031362	54,690	22322	17590	20759	105161	74271	88240
Alaska	153778	189951	343729	Republican	44.74	732923	74,750	6126	5573	5567	16528	13456	13124
Arizona	1672143	1661686	3333829	Democratic	50.16	7179943	67,090	35980	30922	31754	165323	153641	151421
Arkansas	423932	760647	1184579	Republican	35.79	3014195	50,780	20363	21271	19654	79200	76580	74664
California	11110639	6006518	17117157	Democratic	64.91	39501653	77,650	174026	188343	194935	842054	847567	914517
Colorado	1804352	1364607	3168959	Democratic	56.94	5784865	83,780	24570	27916	28759	164582	182850	183816

Health Data

Let’s read in the health data and see what it looks like. We’ll start with data on heart disease that was acquired from the CDC.

Code

hearts <- read.csv('heart_disease_deaths.csv')

cat("Shape of Data:", dim(hearts))
head(hearts)

Shape of Data: 450 4

A data.frame: 6 x 4
	YEAR	STATE	RATE	DEATHS
	<int>	<chr>	<dbl>	<chr>
1	2021	AL	247.5	15173
2	2021	AK	154.7	1011
3	2021	AZ	158.3	14550
4	2021	AR	231.0	8547
5	2021	CA	147.8	65471
6	2021	CO	135.1	8081

Similar to before, we must alter the state column to match the rest of the data. As the data uses abbreviations instead of state names, we must once again map it to the full state name. We will also rename some of the other columns for clarity. Furthermore, we only want to keep years 2020 and onwards. As done before, we must convert the data from long to wide with each row represeting one unit of observation (a state). Let’s then merge the heart disease deaths and rates with the rest of the data. Finally we will rename some of the columns in the merged dataset for clarity.

Code

#HEART DISEASE DATA
# convert STATE column to character
hearts$STATE <- as.character(hearts$STATE)

# Merge crime_data with state_mapping to get full state names
hearts <- left_join(hearts, state_mapping, by = c("STATE" = "Abbreviation"))

# Drop the original 'STATE' column
hearts <- select(hearts, -STATE)

# renaming columnss
hearts <- hearts %>%
  rename(STATE = StateName, DeathsPer100k = RATE)

# filter out only for 2020 and 2021
hearts <- hearts %>%
  filter(YEAR >= 2020)
  
# converting hearts from long to wide
hearts_wide <- hearts %>%
  pivot_wider(
    names_from = YEAR,
    values_from = c(DEATHS, DeathsPer100k),
    names_glue = "{.value}{YEAR}"
  )

# merge the heart disease mortality data with the merged data
df_merged <- left_join(df_merged, hearts_wide, by = "STATE")

# renaming columns for clarity
df_merged <- df_merged %>%
  rename(BidenVotes = Biden, TrumpVotes = Trump, VoteOutcome = Outcome, DemPercentage = Percentage_Democratic, Population = Population.Estimate, 
  HeartDeaths2021 = DEATHS2021, HeartDeaths2020 = DEATHS2020, HeartDeathsPer100k2021 = DeathsPer100k2021, HeartDeathsPer100k2020 = DeathsPer100k2020)

cat("Shape of Data:", dim(df_merged))
head(df_merged)

Shape of Data: 51 18

A grouped_df: 6 x 18
STATE	BidenVotes	TrumpVotes	TotalVotes	VoteOutcome	DemPercentage	Population	Median.income	Violent_Crimes_2020	Violent_Crimes_2021	Violent_Crimes_2022	Property_Crimes_2020	Property_Crimes_2021	Property_Crimes_2022	HeartDeaths2021	HeartDeaths2020	HeartDeathsPer100k2021	HeartDeathsPer100k2020
<chr>	<dbl>	<dbl>	<dbl>	<chr>	<dbl>	<dbl>	<chr>	<int>	<int>	<int>	<int>	<int>	<int>	<chr>	<chr>	<dbl>	<dbl>
Alabama	849624	1441170	2290794	Republican	37.09	5031362	54,690	22322	17590	20759	105161	74271	88240	15173	14739	247.5	237.5
Alaska	153778	189951	343729	Republican	44.74	732923	74,750	6126	5573	5567	16528	13456	13124	1011	915	154.7	139.8
Arizona	1672143	1661686	3333829	Democratic	50.16	7179943	67,090	35980	30922	31754	165323	153641	151421	14550	14196	158.3	144.8
Arkansas	423932	760647	1184579	Republican	35.79	3014195	50,780	20363	21271	19654	79200	76580	74664	8547	8621	231.0	222.5
California	11110639	6006518	17117157	Democratic	64.91	39501653	77,650	174026	188343	194935	842054	847567	914517	65471	66538	147.8	144.0
Colorado	1804352	1364607	3168959	Democratic	56.94	5784865	83,780	24570	27916	28759	164582	182850	183816	8081	8023	135.1	128.1

Let’s read in the COVID-19 data (acquired from the nytimes) and see what it looks like.

Code

covid <- read.csv('covid.csv')

cat("Shape of Data:", dim(covid))
head(covid)

Shape of Data: 61942 5

A data.frame: 6 x 5
	date	state	fips	cases	deaths
	<chr>	<chr>	<int>	<int>	<int>
1	1/21/2020	Washington	53	1	0
2	1/22/2020	Washington	53	1	0
3	1/23/2020	Washington	53	1	0
4	1/24/2020	Illinois	17	1	0
5	1/24/2020	Washington	53	1	0
6	1/25/2020	California	6	1	0

The above data counts the number of total cumulative cases and deaths per each state. Thus, we can take the maximum value for each state to find the total cases and deaths per state. This is done via the code below. We can then merge this dataset with the rest of the data. It is important to use a left join as we only want to keep the states that are in the merged dataframe. Addtionally, we will convert relevant columns to numeric and then create a new column that calculates COVID cases and deaths as percentages of the state population. Below is the newly merged data.

Code

#COVID DATA
# getting total cases and deaths per state
covid <- covid %>%
  group_by(state) %>%
  summarize(
    covid_cases = max(cases, na.rm = TRUE),
    covid_deaths = max(deaths, na.rm = TRUE)
  )

#rename the state column
covid <- covid %>%
  rename(STATE = state)

# merge the covid data with the merged data
df_merged <- left_join(df_merged, covid, by = "STATE")

# Convert relevant columns to numeric
df_merged$covid_cases <- as.numeric(df_merged$covid_cases)
df_merged$covid_deaths <- as.numeric(df_merged$covid_deaths)

# Create new columns for COVID cases and deaths as percentages
df_merged$covid_cases_percentage <- round((df_merged$covid_cases / df_merged$Population) * 100, 2)
df_merged$covid_deaths_percentage <- round((df_merged$covid_deaths / df_merged$Population) * 100, 2)

cat("Shape of Data:", dim(df_merged))
head(df_merged)

Shape of Data: 51 22

A grouped_df: 6 x 22
STATE	BidenVotes	TrumpVotes	TotalVotes	VoteOutcome	DemPercentage	Population	Median.income	Violent_Crimes_2020	Violent_Crimes_2021	...	Property_Crimes_2021	Property_Crimes_2022	HeartDeaths2021	HeartDeaths2020	HeartDeathsPer100k2021	HeartDeathsPer100k2020	covid_cases	covid_deaths	covid_cases_percentage	covid_deaths_percentage
<chr>	<dbl>	<dbl>	<dbl>	<chr>	<dbl>	<dbl>	<chr>	<int>	<int>	...	<int>	<int>	<chr>	<chr>	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>
Alabama	849624	1441170	2290794	Republican	37.09	5031362	54,690	22322	17590	...	74271	88240	15173	14739	247.5	237.5	1648385	21631	32.76	0.43
Alaska	153778	189951	343729	Republican	44.74	732923	74,750	6126	5573	...	13456	13124	1011	915	154.7	139.8	308893	1438	42.15	0.20
Arizona	1672143	1661686	3333829	Democratic	50.16	7179943	67,090	35980	30922	...	153641	151421	14550	14196	158.3	144.8	2451062	33190	34.14	0.46
Arkansas	423932	760647	1184579	Republican	35.79	3014195	50,780	20363	21271	...	76580	74664	8547	8621	231.0	222.5	1008303	13068	33.45	0.43
California	11110639	6006518	17117157	Democratic	64.91	39501653	77,650	174026	188343	...	847567	914517	65471	66538	147.8	144.0	12169158	104277	30.81	0.26
Colorado	1804352	1364607	3168959	Democratic	56.94	5784865	83,780	24570	27916	...	182850	183816	8081	8023	135.1	128.1	1771010	14245	30.61	0.25

Education Data

Let’s read in the education data that was acquired from _____ and take a look at it.

Code

education <- read.csv('education.csv') #keep in mind that this data is from 2021

cat("Shape of Data:", dim(education))
head(education)

Shape of Data: 51 1235

A data.frame: 6 x 1235
	X...DP02_0001E	DP02_0001EA	DP02_0001M	DP02_0001MA	DP02_0001PE	DP02_0001PEA	DP02_0001PM	DP02_0001PMA	DP02_0002E	DP02_0002EA	...	DP02_0154EA	DP02_0154M	DP02_0154MA	DP02_0154PE	DP02_0154PEA	DP02_0154PM	DP02_0154PMA	GEO_ID	NAME	state
	<int>	<lgl>	<int>	<lgl>	<int>	<lgl>	<int>	<chr>	<int>	<lgl>	...	<lgl>	<int>	<lgl>	<dbl>	<lgl>	<dbl>	<lgl>	<chr>	<chr>	<int>
1	1967559	NA	10527	NA	1967559	NA	-888888888	(X)	904392	NA	...	NA	12886	NA	85.0	NA	0.5	NA	0400000US01	Alabama	1
2	2817723	NA	10850	NA	2817723	NA	-888888888	(X)	1344242	NA	...	NA	12165	NA	91.4	NA	0.3	NA	0400000US04	Arizona	4
3	1183675	NA	7882	NA	1183675	NA	-888888888	(X)	565893	NA	...	NA	10272	NA	85.5	NA	0.6	NA	0400000US05	Arkansas	5
4	13429063	NA	19170	NA	13429063	NA	-888888888	(X)	6517082	NA	...	NA	23412	NA	92.9	NA	0.1	NA	0400000US06	California	6
5	2313042	NA	8099	NA	2313042	NA	-888888888	(X)	1124072	NA	...	NA	10722	NA	93.0	NA	0.3	NA	0400000US08	Colorado	8
6	1428313	NA	5900	NA	1428313	NA	-888888888	(X)	664848	NA	...	NA	8523	NA	92.2	NA	0.4	NA	0400000US09	Connecticut	9

Using the code below, we can only select the specific columns that we want and rename them for clarity. Then we can merge this education data with the larger dataframe.

Code

#EDUCATION DATA
#rename and select relevant columns
education <- education %>%
  rename(STATE = NAME, HighSchoolGraduates = DP02_0062E, BachelorsDegree = DP02_0065E, AssociatesDegree = DP02_0064E, SomeCollege = DP02_0063E)

education2 <- education %>% select(STATE, HighSchoolGraduates, BachelorsDegree, AssociatesDegree, SomeCollege)

# merge the education data with the merged data
df_merged <- left_join(df_merged, education2, by = "STATE")

cat("Shape of Data:", dim(df_merged))
head(df_merged)

Shape of Data: 51 26

A grouped_df: 6 x 26
STATE	BidenVotes	TrumpVotes	TotalVotes	VoteOutcome	DemPercentage	Population	Median.income	Violent_Crimes_2020	Violent_Crimes_2021	...	HeartDeathsPer100k2021	HeartDeathsPer100k2020	covid_cases	covid_deaths	covid_cases_percentage	covid_deaths_percentage	HighSchoolGraduates	BachelorsDegree	AssociatesDegree	SomeCollege
<chr>	<dbl>	<dbl>	<dbl>	<chr>	<dbl>	<dbl>	<chr>	<int>	<int>	...	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>	<int>	<int>	<int>	<int>
Alabama	849624	1441170	2290794	Republican	37.09	5031362	54,690	22322	17590	...	247.5	237.5	1648385	21631	32.76	0.43	1079285	572276	303028	705662
Alaska	153778	189951	343729	Republican	44.74	732923	74,750	6126	5573	...	154.7	139.8	308893	1438	42.15	0.20	135473	101422	40276	118250
Arizona	1672143	1661686	3333829	Democratic	50.16	7179943	67,090	35980	30922	...	158.3	144.8	2451062	33190	34.14	0.46	1168057	985673	467305	1180491
Arkansas	423932	760647	1184579	Republican	35.79	3014195	50,780	20363	21271	...	231.0	222.5	1008303	13068	33.45	0.43	695062	324137	161542	435235
California	11110639	6006518	17117157	Democratic	64.91	39501653	77,650	174026	188343	...	147.8	144.0	12169158	104277	30.81	0.26	5578997	5958030	2120275	5287901
Colorado	1804352	1364607	3168959	Democratic	56.94	5784865	83,780	24570	27916	...	135.1	128.1	1771010	14245	30.61	0.25	814373	1107309	334157	793438

Race Data

We can add in race data using the same process as above.

Code

race <- read.csv('Race.csv')

cat("Shape of Data:", dim(race))
head(race)

Shape of Data: 52 731

A data.frame: 6 x 731
	DP05_0001E	DP05_0001EA	DP05_0001M	DP05_0001MA	DP05_0001PE	DP05_0001PEA	DP05_0001PM	DP05_0001PMA	DP05_0002E	DP05_0002EA	...	DP05_0091EA	DP05_0091M	DP05_0091MA	DP05_0091PE	DP05_0091PEA	DP05_0091PM	DP05_0091PMA	GEO_ID	NAME	state
	<int>	<lgl>	<int>	<chr>	<int>	<lgl>	<int>	<chr>	<int>	<lgl>	...	<lgl>	<int>	<lgl>	<dbl>	<lgl>	<dbl>	<lgl>	<chr>	<chr>	<int>
1	5074296	NA	-555555555	*****	5074296	NA	-888888888	(X)	2461248	NA	...	NA	4729	NA	52.4	NA	0.1	NA	0400000US01	Alabama	1
2	733583	NA	-555555555	*****	733583	NA	-888888888	(X)	385667	NA	...	NA	2046	NA	46.8	NA	0.3	NA	0400000US02	Alaska	2
3	7359197	NA	-555555555	*****	7359197	NA	-888888888	(X)	3678381	NA	...	NA	8862	NA	50.4	NA	0.1	NA	0400000US04	Arizona	4
4	3045637	NA	-555555555	*****	3045637	NA	-888888888	(X)	1504488	NA	...	NA	3663	NA	51.3	NA	0.1	NA	0400000US05	Arkansas	5
5	39029342	NA	-555555555	*****	39029342	NA	-888888888	(X)	19536425	NA	...	NA	28487	NA	50.6	NA	0.1	NA	0400000US06	California	6
6	5839926	NA	-555555555	*****	5839926	NA	-888888888	(X)	2960896	NA	...	NA	7496	NA	49.4	NA	0.1	NA	0400000US08	Colorado	8

Code

#RACE DATA
# rename and select relevant columns
race <- race %>%
  rename(STATE = NAME, White = DP05_0037E, Black = DP05_0038E, Asian = DP05_0044E, Hispanic_Latino = DP05_0073E)

race2 <- race %>% select(STATE, White, Black, Asian, Hispanic_Latino)

# merge the race data with the merged data
df_merged <- left_join(df_merged, race2, by = "STATE")

cat("Shape of Data:", dim(df_merged))
head(df_merged)

Shape of Data: 51 30

A grouped_df: 6 x 30
STATE	BidenVotes	TrumpVotes	TotalVotes	VoteOutcome	DemPercentage	Population	Median.income	Violent_Crimes_2020	Violent_Crimes_2021	...	covid_cases_percentage	covid_deaths_percentage	HighSchoolGraduates	BachelorsDegree	AssociatesDegree	SomeCollege	White	Black	Asian	Hispanic_Latino
<chr>	<dbl>	<dbl>	<dbl>	<chr>	<dbl>	<dbl>	<chr>	<int>	<int>	...	<dbl>	<dbl>	<int>	<int>	<int>	<int>	<int>	<int>	<int>	<int>
Alabama	849624	1441170	2290794	Republican	37.09	5031362	54,690	22322	17590	...	32.76	0.43	1079285	572276	303028	705662	3302528	1302035	78893	246477
Alaska	153778	189951	343729	Republican	44.74	732923	74,750	6126	5573	...	42.15	0.20	135473	101422	40276	118250	437533	22202	46184	56491
Arizona	1672143	1661686	3333829	Democratic	50.16	7179943	67,090	35980	30922	...	34.14	0.46	1168057	985673	467305	1180491	4254015	340760	266441	2388520
Arkansas	423932	760647	1184579	Republican	35.79	3014195	50,780	20363	21271	...	33.45	0.43	695062	324137	161542	435235	2103849	437331	48921	255416
California	11110639	6006518	17117157	Democratic	64.91	39501653	77,650	174026	188343	...	30.81	0.26	5578997	5958030	2120275	5287901	15175598	2121422	6054038	15732184
Colorado	1804352	1364607	3168959	Democratic	56.94	5784865	83,780	24570	27916	...	30.61	0.25	814373	1107309	334157	793438	4106707	235519	190181	1314962

Normalizing Data

Normalizing data is a vital step in the data science life cycle as it ensures that all variables contribute equitably to analysis, irrespective of their initial scale. This process not only prevents biases and inaccuracies in statistical models or machine learning algorithms but also enhances clarity in understanding differences between units of observation. By placing these units on a comparable scale, normalization facilitates more meaningful comparisons and analyses.

The COVID data is already normalized. Thus, we will now normalize the crime, education, and race data per state. For crime data, we take the average violent crimes and property crimes for 2020-22 and then divide that by the state population. For the education and race data, we simply take the data and divide it by the state population. Finally, for the heart death data, we take the average of heart deaths per 100k for 2020 and 2021. Below is a an example of the newly normalized data. This data is then saved to a csv file for future use.

Code

#normalize the crime data per state
df_merged$normalized_violent_crimes <- round(apply(df_merged[, c("Violent_Crimes_2020", "Violent_Crimes_2021", "Violent_Crimes_2022")], 1, mean, na.rm = TRUE) / df_merged$Population, 5)
df_merged$normalized_property_crimes <- round(apply(df_merged[, c("Property_Crimes_2020", "Property_Crimes_2021", "Property_Crimes_2022")], 1, mean, na.rm = TRUE) / df_merged$Population, 5)

# normalize education data per state
df_merged$normalized_high_school <- round(df_merged$HighSchoolGraduates  / df_merged$Population, 5)
df_merged$normalized_bachelors <- round(df_merged$BachelorsDegree    / df_merged$Population, 5)
df_merged$normalized_associates <- round(df_merged$AssociatesDegree  / df_merged$Population, 5)
df_merged$normalized_some_college <- round(df_merged$SomeCollege     / df_merged$Population, 5)

# finally, let's normalize the race data per state
df_merged$normalized_white <- round(df_merged$White  / df_merged$Population, 5)
df_merged$normalized_black <- round(df_merged$Black  / df_merged$Population, 5)
df_merged$normalized_asian <- round(df_merged$Asian  / df_merged$Population, 5)
df_merged$normalized_hispanic_latino <- round(df_merged$Hispanic_Latino  / df_merged$Population, 5)

#Averaged 2020 and 2021 heart death death data
df_merged$HeartDeathsPer100k <- ((df_merged$HeartDeathsPer100k2020 + df_merged$HeartDeathsPer100k2021 )/2)

cat("Shape of Data:", dim(df_merged))
head(df_merged)

#save to csv
write.csv(df_merged, file = 'cleaned.csv', row.names = FALSE)

Shape of Data: 51 41

A grouped_df: 6 x 41
STATE	BidenVotes	TrumpVotes	TotalVotes	VoteOutcome	DemPercentage	Population	Median.income	Violent_Crimes_2020	Violent_Crimes_2021	...	normalized_property_crimes	normalized_high_school	normalized_bachelors	normalized_associates	normalized_some_college	normalized_white	normalized_black	normalized_asian	normalized_hispanic_latino	HeartDeathsPer100k
<chr>	<dbl>	<dbl>	<dbl>	<chr>	<dbl>	<dbl>	<chr>	<int>	<int>	...	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>
Alabama	849624	1441170	2290794	Republican	37.09	5031362	54,690	22322	17590	...	0.01773	0.21451	0.11374	0.06023	0.14025	0.65639	0.25878	0.01568	0.04899	242.50
Alaska	153778	189951	343729	Republican	44.74	732923	74,750	6126	5573	...	0.01961	0.18484	0.13838	0.05495	0.16134	0.59697	0.03029	0.06301	0.07708	147.25
Arizona	1672143	1661686	3333829	Democratic	50.16	7179943	67,090	35980	30922	...	0.02184	0.16268	0.13728	0.06508	0.16442	0.59249	0.04746	0.03711	0.33267	151.55
Arkansas	423932	760647	1184579	Republican	35.79	3014195	50,780	20363	21271	...	0.02548	0.23060	0.10754	0.05359	0.14440	0.69798	0.14509	0.01623	0.08474	226.75
California	11110639	6006518	17117157	Democratic	64.91	39501653	77,650	174026	188343	...	0.02197	0.14123	0.15083	0.05368	0.13387	0.38418	0.05370	0.15326	0.39827	145.90
Colorado	1804352	1364607	3168959	Democratic	56.94	5784865	83,780	24570	27916	...	0.03061	0.14078	0.19141	0.05776	0.13716	0.70991	0.04071	0.03288	0.22731	131.60

EDA

Exploratory Data Analysis (EDA) is a fundamental starting point in data analysis, helping grasp the data’s characteristics, patterns, and possible outliers. It provides essential insights for making informed modeling decisions.

By analyzing the below data, we hope to gain an understanding of overall trends that can aid in refining our hypotheses.

Code

#rename to df for ease of use
df <- df_merged

Income

Code

df$Median.income = as.numeric(gsub(",", "", df$Median.income))
#show summary stats
cat("BASIC STATISTICS:\n")
cat("----------------------\n")
summary(df$Median.income)

ggplot(df, aes(y = Median.income)) +
  geom_boxplot(fill = "pink", alpha = 0.7) +
  labs(title = "Median Income", y = "Median Income")

BASIC STATISTICS:
----------------------

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  45130   60360   67410   68660   78200   94790

This box plot shows the distribution of median income. The plot has a wide interquartile range and the median value is indicated by the line inside the box at 67410.

Code

ggplot(df, aes(x = reorder(STATE, Median.income), y = Median.income, fill = VoteOutcome)) +
  geom_bar(stat = "identity") +
  scale_fill_manual(values = c("Republican" = "red", "Democratic" = "blue"), name = "Political Affiliation") +
  coord_flip() +
  xlab("State") +
  ylab("Median Income") +
  ggtitle("Median Income by State (Colored by Political Affiliation)")

This is a bar chart showing median income by state, with the color indicating vote outcome. The states at the top, which have higher median incomes, tend to have voted Democratic, while states at the bottom with lower median incomes tend to have voted Republican.

Code

ggplot(df, aes(x = VoteOutcome, y = Median.income, fill = VoteOutcome)) +
    geom_boxplot() +
    scale_fill_manual(values = c("Republican" = "red", "Democratic" = "blue"), name = "Political Affiliation") +
    xlab("Political Affiliation") +
    ylab("Median Income") +
    ggtitle("Median Income by Political Affiliation") +
    theme(legend.position = "none")

This is a boxplot comparing median income distributions between Democratic and Republican vote outcomes. We can see that the median income for Democratic areas is higher than for Republican areas, as indicated by the position of the median line within each box. Additionally, the interquartile range for Democratic areas seems slightly broader, suggesting more variability in income.

Code

ggplot(df, aes(x = DemPercentage, y = Median.income, color = VoteOutcome)) +
    geom_point() +
    geom_smooth(method = "lm") +
    scale_color_manual(values = c("Republican" = "red", "Democratic" = "blue"), name = "Political Affiliation") +
    xlab("Percentage of Vote for Winning Party") +
    ylab("Median Income") +
    ggtitle("Median Income vs Percentage of Vote")

`geom_smooth()` using formula = 'y ~ x'

This is a scatter plot with a trend line comparing median income to the percentage of vote for the winning party, again distinguished by vote outcome. This plot suggests a positive correlation for both Democratic and Republican areas, meaning that as the percentage of vote for the winning party increases, the median income also tends to increase. The shaded areas around the trend lines are the confidence intervals, which seem to be broader for Republican areas, suggesting more variability or less certainty in the trend compared to the Democratic areas.

Crime

Code

#show summary stats
cat("BASIC STATISTICS:\n")
cat("----------------------\n")
cat("Violet Crimes:")
summary(df$normalized_violent_crimes)

cat("\nProperty Crimes:")
summary(df$normalized_property_crimes)

ggplot(df, aes(x = VoteOutcome, y = normalized_violent_crimes, fill = VoteOutcome)) +
  geom_boxplot() +
  scale_fill_manual(values = c("Republican" = "red", "Democratic" = "blue")) +
  labs(title = "Violent Crime Rates by Political Affiliation", x = "Political Affiliation", y = "Normalized Violent Crimes") +
  theme(legend.position = "none")

BASIC STATISTICS:
----------------------
Violet Crimes:

    Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
0.001090 0.002755 0.003360 0.003835 0.004560 0.009410


Property Crimes:

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
0.01021 0.01517 0.01866 0.01937 0.02239 0.03801

The box plot displays that the median violent crime rate for Republican states is higher than Democratic states. Two Democratic points are significant outliers as well, which drove the Democratic mean higher even though it is still lower than the Republican states.

Code

ggplot(df, aes(x = VoteOutcome, y = normalized_property_crimes, fill = VoteOutcome)) +
  geom_boxplot() +
  scale_fill_manual(values = c("Republican" = "red", "Democratic" = "blue")) +
  labs(title = "Property Crime Rates by Political Affiliation", x = "Political Affiliation", y = "Normalized Property Crimes") +
  theme(legend.position = "none")

The second box plot shows the distribution for non-violent crime rates. Again, we see that Democratic states overall have lower crime rates, with the property crimes making the difference even more obvious.

Health

Code

health <- df[c("STATE", "BidenVotes", "TrumpVotes", "VoteOutcome", "Population", 
                'HeartDeathsPer100k2020', "HeartDeathsPer100k2021", "covid_cases_percentage", "covid_deaths_percentage")]

#show summary stats
cat("BASIC STATISTICS:\n")
cat("----------------------\n")
print(summary(health))

BASIC STATISTICS:
----------------------
    STATE             BidenVotes         TrumpVotes      VoteOutcome       
 Length:51          Min.   :   73491   Min.   :  18586   Length:51         
 Class :character   1st Qu.:  399258   1st Qu.: 473638   Class :character  
 Mode  :character   Median :  856034   Median :1020280   Mode  :character  
                    Mean   : 1586213   Mean   :1449562                     
                    3rd Qu.: 2375907   3rd Qu.:1791166                     
                    Max.   :11110639   Max.   :6006518                     
                                                                           
   Population       HeartDeathsPer100k2020 HeartDeathsPer100k2021
 Min.   :  577605   Min.   :118.1          Min.   :123.9         
 1st Qu.: 1820311   1st Qu.:146.3          1st Qu.:155.2         
 Median : 4507445   Median :162.4          Median :167.8         
 Mean   : 6500226   Mean   :169.0          Mean   :177.1         
 3rd Qu.: 7451987   3rd Qu.:183.9          3rd Qu.:194.2         
 Max.   :39501653   Max.   :245.6          Max.   :264.2         
                    NA's   :1              NA's   :1             
 covid_cases_percentage covid_deaths_percentage
 Min.   :22.19          Min.   :0.1300         
 1st Qu.:28.36          1st Qu.:0.2650         
 Median :31.60          Median :0.3500         
 Mean   :31.11          Mean   :0.3308         
 3rd Qu.:33.56          3rd Qu.:0.3950         
 Max.   :42.15          Max.   :0.4600

Code

ggplot(df, aes(x = DemPercentage, y = covid_cases_percentage, color = VoteOutcome)) +
    geom_point() +
    #geom_smooth(method = "lm") +
    scale_color_manual(values = c("Republican" = "red", "Democratic" = "blue"), name = "Political Affiliation") +
    xlab("Percentage of Vote for Democratic Party") +
    ylab("COVID-19 Cases (percent of population)") +
    ggtitle("Covid Case Rates vs Percentage of Vote")
    
df$covid_deaths_percentage <- df$covid_deaths_percentage * 100

ggplot(df, aes(x = DemPercentage, y = covid_deaths_percentage, color = VoteOutcome)) +
    geom_point() +
    #geom_smooth(method = "lm") +
    scale_color_manual(values = c("Republican" = "red", "Democratic" = "blue"), name = "Political Affiliation") +
    xlab("Percentage of Vote for Democratic Party") +
    ylab("COVID-19 Deaths (percent of population)") +
    ggtitle("Covid Death Rates vs Percentage of Vote")

ggplot(df, aes(x = DemPercentage, y = HeartDeathsPer100k, color = VoteOutcome)) +
    geom_point() +
    #geom_smooth(method = "lm") +
    scale_color_manual(values = c("Republican" = "red", "Democratic" = "blue"), name = "Political Affiliation") +
    xlab("Percentage of Vote for Democratic Party") +
    ylab("Heart Disease Deaths (per 100k)") +
    ggtitle("Heart Disease Death Rates per 100K vs Percentage of Vote",
    subtitle = "Average of 2020 and 2021 Data")

Warning message:
"Removed 1 rows containing missing values (`geom_point()`)."

It seems there may be a difference in the disease/sickness death rates for Democrat and Republican states examining the scatterplots. There are more Republican states with higher rates of Covid, and more Democratic states with lower rates of Covid. The heart disease deaths per 100k people for each state shows a similar distribution. When looking at the Covid death rate though, the difference between Democratic and Republican states becomes less evident, if a difference exists at all.

Code

ggplot(df, aes(x = VoteOutcome, y = covid_cases_percentage, fill = VoteOutcome)) +
    geom_boxplot() +
    scale_fill_manual(values = c("Republican" = "red", "Democratic" = "blue")) +
    xlab("Political Affiliation") +
    ylab("COVID-19 Cases (percent of population)") +
    ggtitle("Covid Case Rates by Political Affiliation") +
    theme(legend.position = "none")

df$covid_deaths_percentage <- df$covid_deaths_percentage * 100

ggplot(df, aes(x = VoteOutcome, y = covid_deaths_percentage, fill = VoteOutcome)) +
    geom_boxplot() +
    scale_fill_manual(values = c("Republican" = "red", "Democratic" = "blue")) +
    xlab("Political Affiliation") +
    ylab("COVID-19 Deaths (percent of population)") +
    ggtitle("Covid Death Rates by Political Affiliation") +
    theme(legend.position = "none")

ggplot(df, aes(x = VoteOutcome, y = HeartDeathsPer100k, fill = VoteOutcome)) +
    geom_boxplot() +
    scale_fill_manual(values = c("Republican" = "red", "Democratic" = "blue")) +
    xlab("Political Affiliation") +
    ylab("Heart Disease Deaths (per 100k)") +
    ggtitle("Heart Disease Death Rates per 100K by Political Affiliation",
    subtitle = "Average of 2020 and 2021 Data") + 
    theme(legend.position = "none")

Warning message:
"Removed 1 rows containing non-finite values (`stat_boxplot()`)."

The box plots now clearly reveal that states that voted republican have higher rates of covid cases,covid deaths, and heart disease deaths.

Code

ggplot(df, aes(x = reorder(STATE, covid_cases_percentage), y = covid_cases_percentage, fill = VoteOutcome)) +
  geom_bar(stat = "identity") +
  scale_fill_manual(values = c("Republican" = "red", "Democratic" = "blue")) +
  coord_flip() +
  xlab("State") +
  ylab("COVID-19 Cases (percent of population)") +
  ggtitle("COVID-19 Cases by State (Colored by Political Affiliation)") +
  theme(legend.position = "none")

df$covid_deaths_percentage <- df$covid_deaths_percentage * 100

ggplot(df, aes(x = reorder(STATE, covid_deaths_percentage), y = covid_deaths_percentage, fill = VoteOutcome)) +
  geom_bar(stat = "identity") +
  scale_fill_manual(values = c("Republican" = "red", "Democratic" = "blue")) +
  coord_flip() +
  xlab("State") +
  ylab("COVID-19 Deaths (percent of population)") +
  ggtitle("COVID-19 Deaths by State (Colored by Political Affiliation)") +
  theme(legend.position = "none")

ggplot(df, aes(x = reorder(STATE, HeartDeathsPer100k), y = HeartDeathsPer100k, fill = VoteOutcome)) +
  geom_bar(stat = "identity") +
  scale_fill_manual(values = c("Republican" = "red", "Democratic" = "blue")) +
  coord_flip() +
  xlab("State") +
  ylab("Heart Disease Deaths (per 100k)") +
  ggtitle("Heart Disease Deaths by State (Colored by Political Affiliation)",
           subtitle = "Average of 2020 and 2021 Data") +
  theme(legend.position = "none")

Warning message:
"Removed 1 rows containing missing values (`position_stack()`)."

These sideways bar charts prove valuable in helping us realize which states had the highest disease rates and what their political affiliation was. For Covid case, Covid death, and heart disease death rate, there are mostly Republican states in the top ten of each category.

Education

Code

df1 <- df %>%
  rename(
    `High School Graduate` = normalized_high_school,
    `Bachelors Degree` = normalized_bachelors,
    `Some College` = normalized_some_college,
    `Associates Degree` = normalized_associates
  )

education <- gather(df1, key = "EducationLevel", value = "EducationCount", 
                    'High School Graduate', 'Bachelors Degree', 'Some College', 'Associates Degree')

ggplot(education, aes(x = EducationLevel, y = EducationCount, fill = VoteOutcome)) +
  geom_boxplot() +
  scale_fill_manual(values = c("Republican" = "red", "Democratic" = "blue"), name = "Political Affiliation" ) +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
  xlab("Education Level") +
  ylab("Normalized Education Percentage of Individuals") +
  ggtitle("Normalized Education Levels by Political Affiliation")

Code

head(education)

A grouped_df: 6 x 39
STATE	BidenVotes	TrumpVotes	TotalVotes	VoteOutcome	DemPercentage	Population	Median.income	Violent_Crimes_2020	Violent_Crimes_2021	...	Hispanic_Latino	normalized_violent_crimes	normalized_property_crimes	normalized_white	normalized_black	normalized_asian	normalized_hispanic_latino	HeartDeathsPer100k	EducationLevel	EducationCount
<chr>	<dbl>	<dbl>	<dbl>	<chr>	<dbl>	<dbl>	<dbl>	<int>	<int>	...	<int>	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>	<chr>	<dbl>
Alabama	849624	1441170	2290794	Republican	37.09	5031362	54690	22322	17590	...	246477	0.00402	0.01773	0.65639	0.25878	0.01568	0.04899	242.50	normalized_high_school	0.21451
Alaska	153778	189951	343729	Republican	44.74	732923	74750	6126	5573	...	56491	0.00785	0.01961	0.59697	0.03029	0.06301	0.07708	147.25	normalized_high_school	0.18484
Arizona	1672143	1661686	3333829	Democratic	50.16	7179943	67090	35980	30922	...	2388520	0.00458	0.02184	0.59249	0.04746	0.03711	0.33267	151.55	normalized_high_school	0.16268
Arkansas	423932	760647	1184579	Republican	35.79	3014195	50780	20363	21271	...	255416	0.00678	0.02548	0.69798	0.14509	0.01623	0.08474	226.75	normalized_high_school	0.23060
California	11110639	6006518	17117157	Democratic	64.91	39501653	77650	174026	188343	...	15732184	0.00470	0.02197	0.38418	0.05370	0.15326	0.39827	145.90	normalized_high_school	0.14123
Colorado	1804352	1364607	3168959	Democratic	56.94	5784865	83780	24570	27916	...	1314962	0.00468	0.03061	0.70991	0.04071	0.03288	0.22731	131.60	normalized_high_school	0.14078

In this boxplot, we can see the distribution of normalized counts across different education levels, separated by vote outcome. For those with an associate’s level of education, the distribution is fairly similar for both Democratic and Republican areas, as the median line is nearly at the same level. However, the interquartile range is slightly broader for Democratic areas, indicating more variability. For those at the bachelor’s level, the median count is higher in Democratic areas than in Republican areas. The range is also broader for Democratic areas, showing a wider spread of counts. For those with high school level, the median count again appears to be higher for Democratic areas compared to Republican areas. The interquartile ranges are quite similar, suggesting similar variability between the two vote outcomes. For those with some college, we can see a higher median count for Democratic areas, with the interquartile range being wider for Democratic areas as well. Overall, the plot suggests that there is a trend where Democratic areas have a higher normalized count of individuals with higher education levels (bachelor’s and some college), while the counts are more similar for lower levels of education (high school and associates).

Statistical Methods

Welch’s T-Test

Welch’s t-test, also called unequal variances t-test, compares the means of two populations to test the null hypothesis. Unlike the Student’s t-test, which assumes the sample means of the two populations are normally distributed and have equal variances, Welch’s t-test was designed for cases where the populations have different variances, but still assumes the samples are normally distributed. It seems easy to assume that you should use Welch’s t-test only when variances are different but it was been argued that you should always use Welch’s t-test over the Student t-test because, it has better control of type I and type II errors (2006). Welch’s t-test is calculated by dividing the difference of the two means by the square root of the sum of the square of their standard error: \(\frac{\bar{X_1} - \bar{X_1}}{\sqrt{s_{\bar{X_1}}^2} + {s_{\bar{X_1}}^2}}\) Where \(\bar{X_i}\) is the mean and \(s_{\bar{X_i}}^2\) is the standard error

Hotelling’s T-Squared Test

Hotelling’s T-Squared Test is used to compare the means of two groups across multiple variables simultaneously. It’s an extension of the Student’s t-test to the multivariate case, where each observation is a vector of values rather than a single scalar value. This test is useful when the variables are correlated, and it assumes that the data from both groups are drawn from multivariate normally distributed populations with equal covariance matrices. The fundamental assumptions for the test are that the samples are randomly drawn from normally distributed populations and that the samples are independent of each other. The test statistic T^2 is derived from the means, variances, and covariances of the variables, and it’s often transformed into an F-statistic for determining statistical significance.

Bootstrapped Difference of Means

A bootstrap approach samples each observed population n times with replacement, where n is the number of entities of the population and uses the newly created datasets to find a mean. The bootstrapped mean for population two is subtracted from population one. This process of creating mean from samples and finding the difference is repeated many times creating a list of many differences of means. Net the quantiles are looked at a predetermined percentage, also referred to as the confidence interval. Common confidence intervals are 90 and 95 percent. With a confidence level of 90 percent we would look at the 5 and 95 percent quantiles. Bootstrap approaches are useful when one or both of the populations for the observed samples are small.

Results: Hypothesis Testing

Hypothesis Test 1: Income

\(H_0:\) There is no difference in median household income between states that voted for Biden and states that voted for Trump.

\(H_A:\) There is a difference in median household income between states that voted for Biden and states that voted for Trump.

Code

democratic <- df[df$VoteOutcome == "Democratic", ]

republican <- df[df$VoteOutcome == 'Republican', ]

t.test(democratic$Median.income, republican$Median.income, alternative = "two.sided")


    Welch Two Sample t-test

data:  democratic$Median.income and republican$Median.income
t = 4.7036, df = 48.366, p-value = 2.167e-05
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
  7463.532 18604.499
sample estimates:
mean of x mean of y 
 75049.62  62015.60

We used the Welch Two Sample t-test to compare the median incomes between Democratic and Republican states. The t-statistic of 4.7036 with 48.366 degrees of freedom, and a very low p-value of 2.167e-05, indicates strong evidence to reject the null hypothesis. This suggests that there is a significant difference in median incomes between the two political affiliations. The 95 percent confidence interval (7463.532 to 18604.499) for the true difference in means similarly reveals that Democratic states tend to have higher median incomes compared to Republican states.

Hypothesis Test 2.1: Violent Crime

\(H_0:\) There is no difference in the mean normalized rates of violent crimes between Democratic and Republican states.

\(H_A:\) There is a difference in the mean normalized rates of violent crimes between Democratic and Republican states.

Code

t.test(democratic$normalized_violent_crimes, republican$normalized_violent_crimes, alternative = "two.sided")


    Welch Two Sample t-test

data:  democratic$normalized_violent_crimes and republican$normalized_violent_crimes
t = -1.0662, df = 47.928, p-value = 0.2917
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -0.0014633197  0.0004491966
sample estimates:
  mean of x   mean of y 
0.003586538 0.004093600

We ran another Welch Two-Sample t-test to compare the normalized rates of violent crimes between Democratic and Republican states. The results gave us a t-statistic of -1.0662 with 47.928 degrees of freedom and a p-value of 0.2917. This suggests that there is not enough evidence to reject the null hypothesis, indicating no significant difference in the mean normalized rates of violent crimes between the two political affiliations.

Hypothesis Test 2.2: Property Crime

\(H_0:\) There is no difference in the mean normalized rates of property crimes between Democratic and Republican states.

\(H_A:\) There is a difference in the mean normalized rates of property crimes between Democratic and Republican states.

Code

t.test(democratic$normalized_property_crimes, republican$normalized_property_crimes, alternative = "two.sided")


    Welch Two Sample t-test

data:  democratic$normalized_property_crimes and republican$normalized_property_crimes
t = -0.51149, df = 40.431, p-value = 0.6118
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -0.004177888  0.002489857
sample estimates:
 mean of x  mean of y 
0.01896038 0.01980440

In this Welch Two-Sample t-test, we are comparing the normalized rates of property crimes between Democratic and Republican states. This gave us a t-statistic of -0.51149 with 40.431 degrees of freedom, and a p-value of 0.6118. This suggests that there is insufficient evidence to reject the null hypothesis, indicating no significant difference in the mean normalized rates of property crimes between Democratic and Republican states.

Hypothesis Test 3.1: Health Hotelling Test

\(H_0:\) There is no difference in the multivariate means of health-related variables between Democratic and Republican states.

\(H_A:\) There is a difference in the multivariate means of health-related variables between Democratic and Republican states.

Code

# Subset data for Democratic and Republican states
democratic_data2 <- df[df$VoteOutcome == "Democratic", c("covid_cases_percentage", "covid_deaths_percentage", "HeartDeathsPer100k")]
republican_data2 <- df[df$VoteOutcome == "Republican", c("covid_cases_percentage", "covid_deaths_percentage", "HeartDeathsPer100k")]

# Conduct Hotelling's T-squared test
hotelling_result2 <- hotelling.test(democratic_data, republican_data)

# Inspect the results
print(hotelling_result2)

ERROR: Error in hotelling.test(democratic_data, republican_data): could not find function "hotelling.test"

Error in hotelling.test(democratic_data, republican_data): could not find function "hotelling.test"
Traceback:

According to the results of our Hotelling’s T-squared test, we can assess differences in the multivariate means of COVID-19-related variables, including covid_cases_percentage, covid_deaths_percentage, and HeartDeathsPer100k, between Democratic and Republican states. The results gave us a test statistic of 27.92, with 4 degrees of freedom in the numerator and 46 in the denominator, and a very low p-value of 0.0002927. This provides strong evidence to reject the null hypothesis, suggesting significant disparities in the COVID-19-related variables across the states based on their political affiliations.

Hypothesis Test 3.2: Covid Cases Bootstrapping

\(H_0:\) The ratio of mean COVID-19 case percentages between Democratic and Republican states is equal to 1 (no difference).

\(H_A:\) The ratio of mean COVID-19 case percentages between Democratic and Republican states is not equal to 1 (there is a difference).

Code

covidc_dem <- subset(df, select=covid_cases_percentage,
                      subset=VoteOutcome=="Democratic", drop=T)

covidc_rep  <- subset(df, select=covid_cases_percentage,
                      subset=VoteOutcome=="Republican", drop=T)

rep_mean <- mean(df$covid_cases_percentage[df$VoteOutcome == "Republican"])
dem_mean <- mean(df$covid_cases_percentage[df$VoteOutcome == 'Democratic'])

actual_mean_ratio <- rep_mean / dem_mean

N <- 10000
ratio.mean <- numeric(N) 

for (i in 1:N)
{
dem <- sample(df$covid_cases_percentage[df$VoteOutcome=='Democratic'], length(covidc_dem), replace = TRUE) 
rep <- sample(df$covid_cases_percentage[df$VoteOutcome=="Republican"], length(covidc_rep), replace = TRUE)
ratio.mean[i] <- mean(rep)/mean(dem)
}

ci.1 <- quantile(ratio.mean, c(0.025, 0.975))
ci.1

hist(ratio.mean, main = "Ratio of Means (Rep/Dem)", xlab = "Ratio of Means", breaks = 30)
abline(v = actual_mean_ratio, col = "red", lty = 2)
abline(v = ci.1, col = 4, lwd = 2)

2.5%: 1.04355974917208
97.5%: 1.19842059043686

As the 95% confidence interval for the ratio of means does not include 1, we fail to reject the null hypothesis that the ratio of means is 1 and that there is no difference between the covid case percentage between democratic and republican states.

Hypothesis Test 3.3: Covid Deaths Bootstrapping

\(H_0:\) The ratio of mean COVID-19 death percentages between Democratic and Republican states is equal to 1 (no difference).

\(H_A:\) The ratio of mean COVID-19 death percentages between Democratic and Republican states is not equal to 1 (there is a difference).

Code

covidd_dem <- subset(df, select=covid_deaths_percentage,
                      subset=VoteOutcome=="Democratic", drop=T)

covidd_rep  <- subset(df, select=covid_deaths_percentage,
                      subset=VoteOutcome=="Republican", drop=T)

rep_mean <- mean(df$covid_deaths_percentage[df$VoteOutcome == "Republican"])
dem_mean <- mean(df$covid_deaths_percentage[df$VoteOutcome == 'Democratic'])

actual_mean_ratio <- rep_mean / dem_mean

N <- 10000
ratio.mean <- numeric(N) 

for (i in 1:N)
{
dem <- sample(df$covid_deaths_percentage[df$VoteOutcome=='Democratic'], length(covidc_dem), replace = TRUE) 
rep <- sample(df$covid_deaths_percentage[df$VoteOutcome=="Republican"], length(covidc_rep), replace = TRUE)
ratio.mean[i] <- mean(rep)/mean(dem)
}

ci.1 <- quantile(ratio.mean, c(0.025, 0.975))
ci.1

hist(ratio.mean, main = "Ratio of Means (Rep/Dem)", xlab = "Ratio of Means", breaks = 30)
abline(v = actual_mean_ratio, col = "red", lty = 2)
abline(v = ci.1, col = 4, lwd = 2)

2.5%: 1.01864528424825
97.5%: 1.33756871172315

As the 95% confidence interval for the ratio of means does not include 1, we fail to reject the null hypothesis that the ratio of means is 1 and that there is no difference between the covid death percentage between democratic and republican states.

Hypothesis Test 4: Education

\(H_0:\) There is no difference in the multivariate means of education levels between Democratic and Republican states.

\(H_A:\) There is a difference in the multivariate means of education levels between Democratic and Republican states.

Code

# Load required libraries
library(Hotelling)

# Subset data for Democratic and Republican states
democratic_data <- df[df$VoteOutcome == "Democratic", c("normalized_high_school", "normalized_bachelors", "normalized_associates", "normalized_some_college")]
republican_data <- df[df$VoteOutcome == "Republican", c("normalized_high_school", "normalized_bachelors", "normalized_associates", "normalized_some_college")]

# Conduct Hotelling's T-squared test
hotelling_result <- hotelling.test(democratic_data, republican_data)

# Inspect the results
print(hotelling_result)

Test stat:  27.92 
Numerator df:  4 
Denominator df:  46 
P-value:  0.0002927

We used a Hotelling’s T-squared test here to assess the differences in the multivariate means of education levels between Democratic and Republican states. The test yielded a test statistic of 27.92, which can sugest a significant distinction in the educational profiles of the two political affiliations. We also got 4 degrees of freedom in the numerator and 46 degrees of freedom in the denominator, as well as a very low p-value of 0.0002927, which indicates strong evidence to reject the null hypothesis. This supports the conclusion that there is a significant difference in the normalized education levels across the states.

Conclusion

In conclusion, the findings from our hypothesis testing shed light on the relationship between political leanings and various aspects of citizen well-being in the United States. We analyzed the impact of political affiliations on income, crime rates, health outcomes, and education levels. The results of the Welch Two-Sample t-test for median household income revealed a significant difference, indicating that Democratic states tend to have higher median incomes compared to Republican states. On the contrary, our tests on normalized rates of violent and property crimes did not yield sufficient evidence to reject the null hypothesis, suggesting no significant differences between Democratic and Republican states in these aspects.

Moving to health-related variables, the Hotelling’s T-squared test provided evidence of differences in the multivariate means of COVID-19-related variables between Democratic and Republican states. This underscores the impact of political affiliations on health outcomes, particularly during the COVID-19 pandemic. Additionally, our bootstrapping analyses on COVID-19 cases and death percentages reinforced the notion of significant disparities between the states based on political affiliations.

As for education, the Hotelling’s T-squared test on education levels indicated a substantial distinction in the multivariate means between Democratic and Republican states, highlighting a meaningful difference in the normalized education levels across the United States.

Our analysis underscores the significant influence of political leanings on various facets of citizen well-being. These findings not only contribute to the understanding of the interplay between politics and everyday life but also may be helpful for policymakers and individuals making decisions about where to reside.

References

“American Community Survey 1-Year Data (2005-2022).” 2022. Unitied States Census Bureau. https://www.census.gov/data/developers/data-sets/acs-1year.html.

Cohen, Mark A. 2008. “The Effect of Crime on Life Satisfaction.” The Journal of Legal Studies 37 (S2): S325–53.

“Covid-19-Data.” 2023. NY Times. https://github.com/nytimes/covid-19-data.

“Crime Data Explorer.” 2022. FBI. https://cde.ucr.cjis.gov/LATEST/webapp/#/pages/home.

Edgerton, Jason D, Lance W Roberts, and Susanne von Below. 2011. “Education and Quality of Life.” Handbook of Social Indicators and Quality of Life Research, 265–96.

“Election Results and Voting Information.” 2020. Federal Election Commission. https://www.fec.gov/introduction-campaign-finance/election-results-and-voting-information/.

Florida, Richard. 2010. Who’s Your City?: How the Creative Economy Is Making Where to Live the Most Important Decision of Your Life. Vintage Canada.

“Heart Disease Mortality by State.” 2022. Centers for Disease Control; Prevention. https://www.cdc.gov/nchs/pressroom/sosmap/heart_disease_mortality/heart_disease.htm.

“Historical Income Tables: Households.” 2022. Unitied States Census Bureau. https://www.census.gov/data/tables/time-series/demo/income-poverty/historical-income-households.html.

Kahneman, Daniel, and Angus Deaton. 2010. “High Income Improves Evaluation of Life but Not Emotional Well-Being.” Proceedings of the National Academy of Sciences 107 (38): 16489–93.

Michalos, Alex C, Bruno D Zumbo, and Anita Hubley. 2000. “Health and the Quality of Life.” Social Indicators Research 51: 245–86.

Ruxton, Graeme D. 2006. “The unequal variance t-test is an underused alternative to Student’s t-test and the Mann–Whitney U test.” Behavioral Ecology 17 (4): 688–90. https://doi.org/10.1093/beheco/ark016.

“State Population Totals and Components of Change: 2020-2022.” 2022. Unitied States Census Bureau. https://www.census.gov/data/tables/time-series/demo/popest/2020s-state-total.html.