Political Colors and Quality of Life: Analyzing How State Political Leanings Influence Citizen Well-Being

Authors

Billy McGloin

Eric Dapkus

Nolan Penoyer

Caitlyn Dovel

Published

November 29, 2023

Abstract

The interplay between political leanings and quality of life in the United States forms a complex tapestry. This study aims to dissect this interplay, focusing on how a state’s political orientations impact various facets of citizen well-being. This project not only seeks to highlight the correlations between political inclinations and quality of life indicators but also strives to understand the causative relationships within. The primary research question guiding this study is: How do state political leanings influence the quality of life of their residents?

Introduction

Choosing where to live is one of the most important decisions in a person’s life (2010). Our study aims to simplify the question by ascertaining whether the political leanings of states, specifically Democratic or Republican affiliations, exert influence on the quality of life experienced by their citizens. We approach this investigation through a comprehensive examination of key indicators such as income, crime rates, health outcomes, and education.

Income levels often serve as a foundational determinant of an individual’s standard of living. Political leanings of a state influence economic policies, tax structures, and resource allocation, subsequently impacting income distribution among its residents. While relative income may not improve the quality of someones everyday experience it has been shown to increase how we evaluate our lives (2010).

Likewise, crime rates, another critical aspect of our investigation, are closely linked to personal safety and community well-being. The political landscape may shape law enforcement policies, social programs, community engagement, and program funding, all of which can have direct implications for public safety and the overall quality of life. Past studies have shown that specific crimes can have a lasting and large impact on our overall well being (2008).

Health outcomes constitute an additional pillar of our study that affects our quality of life (2000) . Political orientations may influence healthcare policies, accessibility to medical services, and public health initiatives, thereby contributing to variations in health outcomes among state residents.

The educational landscape, shaped by political decisions on funding, curriculum, and access, can significantly impact aspects of our lives. Access to education can lead to social benefits, increase analytical skills, improve intellectual capital, and change the constraints and opportunities of our lives (2011).

In synthesizing these indicators, our study strives to bridge the gap between political leanings and the lived experiences of residents. By focusing on these concrete aspects rather than delving into technical details, we aim to provide a practical foundation for understanding the potential impact of political affiliations on the quality of life in different states. As we proceed to analyze and interpret the data, our findings hold the promise of offering actionable insights for individuals making residential decisions and policymakers shaping the future of their communities.

Methods and Materials

Data Collection

National Voting Data

National voting data was collected from The Federal Election Commission (FEC). They host elections and voting information for every federal election since 1982. For our analysis we used the 2020 data. The data is available in Portable Document Format (PDF) and Excel, .xlsx, formats. We used the .xlsx version and converted it to a Comma-Separated Values (CSV) file. The data we used for our project can be found on tab 9, 2020 Pres General Results, in the Excel file (FEC). This data is used to determine the states’ political leanings. (2020)

FEC | Election and voting information

Population Data

Population data was taken from the United States Census Bureau. To match the voting data we used the State Population Totals and Components of Change: 2020-2022. The data is provided in a CSV file that is easy to import into the data cleaning workflow. This data was used to normalize other data we gathered. (2022)

State Population Totals: 2020-2022 (census.gov)

Income Data

Income data was also obtained from the U.S. Census Bureau, which collects and maintains numerous sets of economic and income data for the United States. We used the Table H-8, median household income by state, as the information was the cleanest for our use case. Table H-8 is an Excel document with numerous header rows that do not contain any data. (2022)

Historical Income Tables: Households (census.gov)

Crime Data

We scraped crime data from the FBI’s Crime Data Explorer (CDE). The CDE permits the query of the number of crimes by state, year, and the type of crime (property crime, larceny, ect.). For our analysis, we chose to retrieve crime data from 2020-2022 for each state, constructing two seperate CSV files, one for violent crimes and one for non-violent crimes. This data coupled with the state population data reveals the crime rates for each state for a given year. (2022)

FBI Crime Data API (cde.ucr.cjis.gov)

Health Data

Heart disease deaths data was collected from the Centers for Disease Control and Prevention (CDC). The CSV obtained from their site lists the year, state, number of deaths from heart disease, had the rate of heart disease deaths per 100k people per state, and a url to provide more information for the state. Number of deaths from heart disease was one of the features we used to determine the health of a state. Since the data was taken in as deaths per state it was normalized with the states population during cleaning. Heart disease was chosen since it is a leading cause of death in America and is a good determinator for how people take care of themselves. (2022)

Stats of the States - Heart Disease Mortality (cdc.gov)

Covid-19 data was gathered from the New York Times GitHub Repository. The repository host multiple datasets with COVID-19 data. We used the us-states.csv file that contains date (single day the information is for), state, fips, cases, and deaths. The repository contains a README.md that shows their methods for collection and organizing the data. The file was a CSV and easy to import into the workflow during data cleaning. Both COVID-19 cases and deaths were used as features for determining the health of a state. These variables show how a state reacts to rising health risks. (2023)

NYTimes/covid-19-data: A repository of data on coronavirus cases and deaths in the U.S. (github.com)

Education Data

Education data was acquired through the United States Census Bureau API. The data comes from the American Community Survey (ACS), which is an ongoing survey that provides data every year about social, economic,demographic, and housing characteristics of the U.S. population. We focused on the number of people with a bachelor’s degree, associate’s degree, GED or equivalent, and some college for each state from 2021-2022. (2022)

State Education Totals: 2021-2022 (census.gov)

Race Data

We retrieved data on the number of white, black, asian, and hispanic people per state from the Census API and ACS similar to the education data. Again, we only chose to use the data from 2021-2022. (2022)

State Race Totals: 2021-2022 (census.gov)

Data Cleaning

This section of the report will act as a journal of our data cleaning process. We will show the raw data, the code used to clean it, and then the modified data.

National Voting Data

After loading in the relevant libraries and read in the data, let’s take a look at what the national voting data looks like.

Code
#load in relevant libraries
library(tidyverse)
Code
# read in the national voting data
df <- read.csv('voting.csv')

#what does the national voting data look like?
cat("Shape of Data:", dim(df))
head(df)
Shape of Data: 681 13
A data.frame: 6 x 13
X...1 FEC.ID STATE STATE.ABBREVIATION GENERAL.ELECTION.DATE FIRST.NAME LAST.NAME LAST.NAME...FIRST TOTAL.VOTES PARTY GENERAL.RESULTS GENERAL.. TOTAL.VOTES..
<int> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
1 2
2 3 P80001571 Alabama AL 11/3/20 Donald J. Trump Trump, Donald J. R 1,441,170 62.03%
3 4 P80000722 Alabama AL 11/3/20 Joseph R. Biden Biden, Joseph R. D 849,624 36.57%
4 5 P00013524 Alabama AL 11/3/20 Jo Jorgensen Jorgensen, Jo IND 25,176 1.08%
5 6 n/a Alabama AL 11/3/20 Scattered Scattered W 7,312 0.31%
6 7 n/a Alabama AL 11/3/20 Total State Votes: 2,323,282

We can see that the above data has multiple rows per state, with rows for each candidate and total state votes. Let’s clean this data. We can start by only keeping votes for the two major parties and then create two new columns (one for Candidate and another for Total Votes for that candidate). The data then needs to be converted from long to wide so that each row/observation represents a state (including DC). Finally, let’s create some new columns that will be useful for our analysis: Total Votes Overall, Vote Outcome, and Percentage Democratic.

Code
#VOTING DATA
# remove all votes not for trump or biden
df <- df[df$PARTY %in% c("D", "R"), ]

# create columns for biden and trump votes for each state
df <- df %>%
  mutate(VoteTotal = as.numeric(gsub(",", "", GENERAL.RESULTS)),
         Candidate = ifelse(PARTY == "R", "Trump", ifelse(PARTY == "D", "Biden", NA))) %>%
  group_by(STATE, Candidate) %>%
  summarize(TotalVotes = sum(VoteTotal))
  
# convert from long to wide
df_wide <- df %>%
  spread(key = Candidate, value = TotalVotes, fill = 0)

# create columns for total votes, outcome, and percentage democratic
df_result <- df_wide %>%
  mutate(TotalVotes = Biden + Trump,
         Outcome = ifelse(Biden > Trump, "Democratic", "Republican"),
         Percentage_Democratic = round((Biden / TotalVotes) * 100, 2))

cat("Shape of Data:", dim(df_result))
head(df_result)
`summarise()` has grouped output by 'STATE'. You can override using the
`.groups` argument.
Shape of Data: 51 6
A grouped_df: 6 x 6
STATE Biden Trump TotalVotes Outcome Percentage_Democratic
<chr> <dbl> <dbl> <dbl> <chr> <dbl>
Alabama 849624 1441170 2290794 Republican 37.09
Alaska 153778 189951 343729 Republican 44.74
Arizona 1672143 1661686 3333829 Democratic 50.16
Arkansas 423932 760647 1184579 Republican 35.79
California 11110639 6006518 17117157 Democratic 64.91
Colorado 1804352 1364607 3168959 Democratic 56.94

Population Data

Now we can read in the population data that was acquired from the United States Census Bureau and see what it looks like.

Code
# read in the population data
pop <- read.csv('population.csv')

# what does the data look like?
cat("Shape of Data:", dim(pop))
head(pop)
Shape of Data: 51 2
A data.frame: 6 x 2
X...State Population.Estimate
<chr> <chr>
1 .Alabama 5,031,362
2 .Alaska 732,923
3 .Arizona 7,179,943
4 .Arkansas 3,014,195
5 .California 39,501,653
6 .Colorado 5,784,865

Let’s remove the period before the state name, convert the population column to numeric, and check the sum of all populations to check the accuracy of the data. Let’s merged the clean data together by STATE.

Code
#POPULATION DATA
# Remove the dot before the state name
pop <- pop %>%
  mutate(STATE = gsub("\\.", "", X...State))

# Convert Population.Estimate to numeric after removing commas
pop$Population.Estimate <- as.numeric(gsub(",", "", pop$Population.Estimate))

pop <- select(pop, -X...State)

sum(pop$Population.Estimate)

# Merge the two datasets
df_merged <- left_join(df_result, pop, by = "STATE")

cat("Shape of Data:", dim(df_merged))
head(df_merged)
331511512
Shape of Data: 51 7
A grouped_df: 6 x 7
STATE Biden Trump TotalVotes Outcome Percentage_Democratic Population.Estimate
<chr> <dbl> <dbl> <dbl> <chr> <dbl> <dbl>
Alabama 849624 1441170 2290794 Republican 37.09 5031362
Alaska 153778 189951 343729 Republican 44.74 732923
Arizona 1672143 1661686 3333829 Democratic 50.16 7179943
Arkansas 423932 760647 1184579 Republican 35.79 3014195
California 11110639 6006518 17117157 Democratic 64.91 39501653
Colorado 1804352 1364607 3168959 Democratic 56.94 5784865

Income Data

Let’s read in the income data that was acquired from the U.S. Census Bureau and see what it looks like.

Code
# read in the income data
income <- read.csv('median_income.csv')

# what does the data look like?
cat("Shape of Data:", dim(income))
head(income)
Shape of Data: 52 2
A data.frame: 6 x 2
X...State Median.income
<chr> <chr>
1 United States 68,010
2 Alabama 54,690
3 Alaska 74,750
4 Arizona 67,090
5 Arkansas 50,780
6 California 77,650

We need to remove the United States row and rename the state column so that it matches the merged data. Below is the newly merged data.

Code
#INCOME DATA
# remove united states row and rename the state column
income <- income %>%
  filter(X...State != "United States") %>%
  rename(STATE = X...State)

# merge the income data with the merged data
df_merged <- left_join(df_merged, income, by = "STATE")

cat("Shape of Data:", dim(df_merged))
head(df_merged)
Shape of Data: 51 8
A grouped_df: 6 x 8
STATE Biden Trump TotalVotes Outcome Percentage_Democratic Population.Estimate Median.income
<chr> <dbl> <dbl> <dbl> <chr> <dbl> <dbl> <chr>
Alabama 849624 1441170 2290794 Republican 37.09 5031362 54,690
Alaska 153778 189951 343729 Republican 44.74 732923 74,750
Arizona 1672143 1661686 3333829 Democratic 50.16 7179943 67,090
Arkansas 423932 760647 1184579 Republican 35.79 3014195 50,780
California 11110639 6006518 17117157 Democratic 64.91 39501653 77,650
Colorado 1804352 1364607 3168959 Democratic 56.94 5784865 83,780

Crime Data

Let’s read in the crime data and see what it looks like.

Code
crimes <- read.csv('crimes.csv')

cat("Shape of Data", dim(crimes))
head(crimes)
Shape of Data 153 4
A data.frame: 6 x 4
X...State Year Violent_Crimes Property_Crimes
<chr> <int> <int> <int>
1 AL 2020 22322 105161
2 AL 2021 17590 74271
3 AL 2022 20759 88240
4 AK 2020 6126 16528
5 AK 2021 5573 13456
6 AK 2022 5567 13124

Again we must rename the state column to match the rest of the data. Additionally, the state column is now abbreviations so we must map it to the full state name that can be done with the follwing code. Once done, we can drop the original state column. Similar to before, the data is in long format and we must convert it to wide so that each row (unit of observation) is a state and there is a column for each crime type for different years. Once done, we can merge the crime data with the rest of the data.

Code
#CRIME DATA
# rename the state column 
colnames(crimes)[colnames(crimes) == "X...State"] <- "STATE"

# Create a mapping between state abbreviations and full state names
state_mapping <- data.frame(
  Abbreviation = c(
    "AL", "AK", "AZ", "AR", "CA", "CO", "CT", "DC", "DE", "FL", "GA",
    "HI", "ID", "IL", "IN", "IA", "KS", "KY", "LA", "ME", "MD",
    "MA", "MI", "MN", "MS", "MO", "MT", "NE", "NV", "NH", "NJ",
    "NM", "NY", "NC", "ND", "OH", "OK", "OR", "PA", "RI", "SC",
    "SD", "TN", "TX", "UT", "VT", "VA", "WA", "WV", "WI", "WY"
  ),
  StateName = c(
    "Alabama", "Alaska", "Arizona", "Arkansas", "California", "Colorado",
    "Connecticut", "District of Columbia", "Delaware", "Florida", "Georgia", "Hawaii", "Idaho",
    "Illinois", "Indiana", "Iowa", "Kansas", "Kentucky", "Louisiana", "Maine",
    "Maryland", "Massachusetts", "Michigan", "Minnesota", "Mississippi", "Missouri",
    "Montana", "Nebraska", "Nevada", "New Hampshire", "New Jersey", "New Mexico",
    "New York", "North Carolina", "North Dakota", "Ohio", "Oklahoma", "Oregon",
    "Pennsylvania", "Rhode Island", "South Carolina", "South Dakota", "Tennessee",
    "Texas", "Utah", "Vermont", "Virginia", "Washington", "West Virginia", "Wisconsin", "Wyoming"
  )
)

# convert STATE column to character
crimes$STATE <- as.character(crimes$STATE)

# Merge crime_data with state_mapping to get full state names
crimes <- left_join(crimes, state_mapping, by = c("STATE" = "Abbreviation"))

# Drop the original 'STATE' column
crimes <- select(crimes, -STATE)

# change column name
colnames(crimes)[colnames(crimes) == "StateName"] <- "STATE"

crimes <- distinct(crimes, STATE, Year, .keep_all = TRUE)

# convert from long to wide
crimes_wide <- crimes %>%
  pivot_wider(
    names_from = Year,
    values_from = c(Violent_Crimes, Property_Crimes)
  )

# merge the crimes data with the merged data
df_merged <- left_join(df_merged, crimes_wide, by = "STATE")

cat("Shape of Data:", dim(df_merged))
head(df_merged)
Shape of Data: 51 14
A grouped_df: 6 x 14
STATE Biden Trump TotalVotes Outcome Percentage_Democratic Population.Estimate Median.income Violent_Crimes_2020 Violent_Crimes_2021 Violent_Crimes_2022 Property_Crimes_2020 Property_Crimes_2021 Property_Crimes_2022
<chr> <dbl> <dbl> <dbl> <chr> <dbl> <dbl> <chr> <int> <int> <int> <int> <int> <int>
Alabama 849624 1441170 2290794 Republican 37.09 5031362 54,690 22322 17590 20759 105161 74271 88240
Alaska 153778 189951 343729 Republican 44.74 732923 74,750 6126 5573 5567 16528 13456 13124
Arizona 1672143 1661686 3333829 Democratic 50.16 7179943 67,090 35980 30922 31754 165323 153641 151421
Arkansas 423932 760647 1184579 Republican 35.79 3014195 50,780 20363 21271 19654 79200 76580 74664
California 11110639 6006518 17117157 Democratic 64.91 39501653 77,650 174026 188343 194935 842054 847567 914517
Colorado 1804352 1364607 3168959 Democratic 56.94 5784865 83,780 24570 27916 28759 164582 182850 183816

Health Data

Let’s read in the health data and see what it looks like. We’ll start with data on heart disease that was acquired from the CDC.

Code
hearts <- read.csv('heart_disease_deaths.csv')

cat("Shape of Data:", dim(hearts))
head(hearts)
Shape of Data: 450 4
A data.frame: 6 x 4
YEAR STATE RATE DEATHS
<int> <chr> <dbl> <chr>
1 2021 AL 247.5 15173
2 2021 AK 154.7 1011
3 2021 AZ 158.3 14550
4 2021 AR 231.0 8547
5 2021 CA 147.8 65471
6 2021 CO 135.1 8081

Similar to before, we must alter the state column to match the rest of the data. As the data uses abbreviations instead of state names, we must once again map it to the full state name. We will also rename some of the other columns for clarity. Furthermore, we only want to keep years 2020 and onwards. As done before, we must convert the data from long to wide with each row represeting one unit of observation (a state). Let’s then merge the heart disease deaths and rates with the rest of the data. Finally we will rename some of the columns in the merged dataset for clarity.

Code
#HEART DISEASE DATA
# convert STATE column to character
hearts$STATE <- as.character(hearts$STATE)

# Merge crime_data with state_mapping to get full state names
hearts <- left_join(hearts, state_mapping, by = c("STATE" = "Abbreviation"))

# Drop the original 'STATE' column
hearts <- select(hearts, -STATE)

# renaming columnss
hearts <- hearts %>%
  rename(STATE = StateName, DeathsPer100k = RATE)

# filter out only for 2020 and 2021
hearts <- hearts %>%
  filter(YEAR >= 2020)
  
# converting hearts from long to wide
hearts_wide <- hearts %>%
  pivot_wider(
    names_from = YEAR,
    values_from = c(DEATHS, DeathsPer100k),
    names_glue = "{.value}{YEAR}"
  )

# merge the heart disease mortality data with the merged data
df_merged <- left_join(df_merged, hearts_wide, by = "STATE")

# renaming columns for clarity
df_merged <- df_merged %>%
  rename(BidenVotes = Biden, TrumpVotes = Trump, VoteOutcome = Outcome, DemPercentage = Percentage_Democratic, Population = Population.Estimate, 
  HeartDeaths2021 = DEATHS2021, HeartDeaths2020 = DEATHS2020, HeartDeathsPer100k2021 = DeathsPer100k2021, HeartDeathsPer100k2020 = DeathsPer100k2020)

cat("Shape of Data:", dim(df_merged))
head(df_merged)
Shape of Data: 51 18
A grouped_df: 6 x 18
STATE BidenVotes TrumpVotes TotalVotes VoteOutcome DemPercentage Population Median.income Violent_Crimes_2020 Violent_Crimes_2021 Violent_Crimes_2022 Property_Crimes_2020 Property_Crimes_2021 Property_Crimes_2022 HeartDeaths2021 HeartDeaths2020 HeartDeathsPer100k2021 HeartDeathsPer100k2020
<chr> <dbl> <dbl> <dbl> <chr> <dbl> <dbl> <chr> <int> <int> <int> <int> <int> <int> <chr> <chr> <dbl> <dbl>
Alabama 849624 1441170 2290794 Republican 37.09 5031362 54,690 22322 17590 20759 105161 74271 88240 15173 14739 247.5 237.5
Alaska 153778 189951 343729 Republican 44.74 732923 74,750 6126 5573 5567 16528 13456 13124 1011 915 154.7 139.8
Arizona 1672143 1661686 3333829 Democratic 50.16 7179943 67,090 35980 30922 31754 165323 153641 151421 14550 14196 158.3 144.8
Arkansas 423932 760647 1184579 Republican 35.79 3014195 50,780 20363 21271 19654 79200 76580 74664 8547 8621 231.0 222.5
California 11110639 6006518 17117157 Democratic 64.91 39501653 77,650 174026 188343 194935 842054 847567 914517 65471 66538 147.8 144.0
Colorado 1804352 1364607 3168959 Democratic 56.94 5784865 83,780 24570 27916 28759 164582 182850 183816 8081 8023 135.1 128.1

Let’s read in the COVID-19 data (acquired from the nytimes) and see what it looks like.

Code
covid <- read.csv('covid.csv')

cat("Shape of Data:", dim(covid))
head(covid)
Shape of Data: 61942 5
A data.frame: 6 x 5
date state fips cases deaths
<chr> <chr> <int> <int> <int>
1 1/21/2020 Washington 53 1 0
2 1/22/2020 Washington 53 1 0
3 1/23/2020 Washington 53 1 0
4 1/24/2020 Illinois 17 1 0
5 1/24/2020 Washington 53 1 0
6 1/25/2020 California 6 1 0

The above data counts the number of total cumulative cases and deaths per each state. Thus, we can take the maximum value for each state to find the total cases and deaths per state. This is done via the code below. We can then merge this dataset with the rest of the data. It is important to use a left join as we only want to keep the states that are in the merged dataframe. Addtionally, we will convert relevant columns to numeric and then create a new column that calculates COVID cases and deaths as percentages of the state population. Below is the newly merged data.

Code
#COVID DATA
# getting total cases and deaths per state
covid <- covid %>%
  group_by(state) %>%
  summarize(
    covid_cases = max(cases, na.rm = TRUE),
    covid_deaths = max(deaths, na.rm = TRUE)
  )

#rename the state column
covid <- covid %>%
  rename(STATE = state)

# merge the covid data with the merged data
df_merged <- left_join(df_merged, covid, by = "STATE")

# Convert relevant columns to numeric
df_merged$covid_cases <- as.numeric(df_merged$covid_cases)
df_merged$covid_deaths <- as.numeric(df_merged$covid_deaths)

# Create new columns for COVID cases and deaths as percentages
df_merged$covid_cases_percentage <- round((df_merged$covid_cases / df_merged$Population) * 100, 2)
df_merged$covid_deaths_percentage <- round((df_merged$covid_deaths / df_merged$Population) * 100, 2)

cat("Shape of Data:", dim(df_merged))
head(df_merged)
Shape of Data: 51 22
A grouped_df: 6 x 22
STATE BidenVotes TrumpVotes TotalVotes VoteOutcome DemPercentage Population Median.income Violent_Crimes_2020 Violent_Crimes_2021 ... Property_Crimes_2021 Property_Crimes_2022 HeartDeaths2021 HeartDeaths2020 HeartDeathsPer100k2021 HeartDeathsPer100k2020 covid_cases covid_deaths covid_cases_percentage covid_deaths_percentage
<chr> <dbl> <dbl> <dbl> <chr> <dbl> <dbl> <chr> <int> <int> ... <int> <int> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
Alabama 849624 1441170 2290794 Republican 37.09 5031362 54,690 22322 17590 ... 74271 88240 15173 14739 247.5 237.5 1648385 21631 32.76 0.43
Alaska 153778 189951 343729 Republican 44.74 732923 74,750 6126 5573 ... 13456 13124 1011 915 154.7 139.8 308893 1438 42.15 0.20
Arizona 1672143 1661686 3333829 Democratic 50.16 7179943 67,090 35980 30922 ... 153641 151421 14550 14196 158.3 144.8 2451062 33190 34.14 0.46
Arkansas 423932 760647 1184579 Republican 35.79 3014195 50,780 20363 21271 ... 76580 74664 8547 8621 231.0 222.5 1008303 13068 33.45 0.43
California 11110639 6006518 17117157 Democratic 64.91 39501653 77,650 174026 188343 ... 847567 914517 65471 66538 147.8 144.0 12169158 104277 30.81 0.26
Colorado 1804352 1364607 3168959 Democratic 56.94 5784865 83,780 24570 27916 ... 182850 183816 8081 8023 135.1 128.1 1771010 14245 30.61 0.25

Education Data

Let’s read in the education data that was acquired from _____ and take a look at it.

Code
education <- read.csv('education.csv') #keep in mind that this data is from 2021

cat("Shape of Data:", dim(education))
head(education)
Shape of Data: 51 1235
A data.frame: 6 x 1235
X...DP02_0001E DP02_0001EA DP02_0001M DP02_0001MA DP02_0001PE DP02_0001PEA DP02_0001PM DP02_0001PMA DP02_0002E DP02_0002EA ... DP02_0154EA DP02_0154M DP02_0154MA DP02_0154PE DP02_0154PEA DP02_0154PM DP02_0154PMA GEO_ID NAME state
<int> <lgl> <int> <lgl> <int> <lgl> <int> <chr> <int> <lgl> ... <lgl> <int> <lgl> <dbl> <lgl> <dbl> <lgl> <chr> <chr> <int>
1 1967559 NA 10527 NA 1967559 NA -888888888 (X) 904392 NA ... NA 12886 NA 85.0 NA 0.5 NA 0400000US01 Alabama 1
2 2817723 NA 10850 NA 2817723 NA -888888888 (X) 1344242 NA ... NA 12165 NA 91.4 NA 0.3 NA 0400000US04 Arizona 4
3 1183675 NA 7882 NA 1183675 NA -888888888 (X) 565893 NA ... NA 10272 NA 85.5 NA 0.6 NA 0400000US05 Arkansas 5
4 13429063 NA 19170 NA 13429063 NA -888888888 (X) 6517082 NA ... NA 23412 NA 92.9 NA 0.1 NA 0400000US06 California 6
5 2313042 NA 8099 NA 2313042 NA -888888888 (X) 1124072 NA ... NA 10722 NA 93.0 NA 0.3 NA 0400000US08 Colorado 8
6 1428313 NA 5900 NA 1428313 NA -888888888 (X) 664848 NA ... NA 8523 NA 92.2 NA 0.4 NA 0400000US09 Connecticut 9

Using the code below, we can only select the specific columns that we want and rename them for clarity. Then we can merge this education data with the larger dataframe.

Code
#EDUCATION DATA
#rename and select relevant columns
education <- education %>%
  rename(STATE = NAME, HighSchoolGraduates = DP02_0062E, BachelorsDegree = DP02_0065E, AssociatesDegree = DP02_0064E, SomeCollege = DP02_0063E)

education2 <- education %>% select(STATE, HighSchoolGraduates, BachelorsDegree, AssociatesDegree, SomeCollege)

# merge the education data with the merged data
df_merged <- left_join(df_merged, education2, by = "STATE")

cat("Shape of Data:", dim(df_merged))
head(df_merged)
Shape of Data: 51 26
A grouped_df: 6 x 26
STATE BidenVotes TrumpVotes TotalVotes VoteOutcome DemPercentage Population Median.income Violent_Crimes_2020 Violent_Crimes_2021 ... HeartDeathsPer100k2021 HeartDeathsPer100k2020 covid_cases covid_deaths covid_cases_percentage covid_deaths_percentage HighSchoolGraduates BachelorsDegree AssociatesDegree SomeCollege
<chr> <dbl> <dbl> <dbl> <chr> <dbl> <dbl> <chr> <int> <int> ... <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <int> <int> <int> <int>
Alabama 849624 1441170 2290794 Republican 37.09 5031362 54,690 22322 17590 ... 247.5 237.5 1648385 21631 32.76 0.43 1079285 572276 303028 705662
Alaska 153778 189951 343729 Republican 44.74 732923 74,750 6126 5573 ... 154.7 139.8 308893 1438 42.15 0.20 135473 101422 40276 118250
Arizona 1672143 1661686 3333829 Democratic 50.16 7179943 67,090 35980 30922 ... 158.3 144.8 2451062 33190 34.14 0.46 1168057 985673 467305 1180491
Arkansas 423932 760647 1184579 Republican 35.79 3014195 50,780 20363 21271 ... 231.0 222.5 1008303 13068 33.45 0.43 695062 324137 161542 435235
California 11110639 6006518 17117157 Democratic 64.91 39501653 77,650 174026 188343 ... 147.8 144.0 12169158 104277 30.81 0.26 5578997 5958030 2120275 5287901
Colorado 1804352 1364607 3168959 Democratic 56.94 5784865 83,780 24570 27916 ... 135.1 128.1 1771010 14245 30.61 0.25 814373 1107309 334157 793438

Race Data

We can add in race data using the same process as above.

Code
race <- read.csv('Race.csv')

cat("Shape of Data:", dim(race))
head(race)
Shape of Data: 52 731
A data.frame: 6 x 731
DP05_0001E DP05_0001EA DP05_0001M DP05_0001MA DP05_0001PE DP05_0001PEA DP05_0001PM DP05_0001PMA DP05_0002E DP05_0002EA ... DP05_0091EA DP05_0091M DP05_0091MA DP05_0091PE DP05_0091PEA DP05_0091PM DP05_0091PMA GEO_ID NAME state
<int> <lgl> <int> <chr> <int> <lgl> <int> <chr> <int> <lgl> ... <lgl> <int> <lgl> <dbl> <lgl> <dbl> <lgl> <chr> <chr> <int>
1 5074296 NA -555555555 ***** 5074296 NA -888888888 (X) 2461248 NA ... NA 4729 NA 52.4 NA 0.1 NA 0400000US01 Alabama 1
2 733583 NA -555555555 ***** 733583 NA -888888888 (X) 385667 NA ... NA 2046 NA 46.8 NA 0.3 NA 0400000US02 Alaska 2
3 7359197 NA -555555555 ***** 7359197 NA -888888888 (X) 3678381 NA ... NA 8862 NA 50.4 NA 0.1 NA 0400000US04 Arizona 4
4 3045637 NA -555555555 ***** 3045637 NA -888888888 (X) 1504488 NA ... NA 3663 NA 51.3 NA 0.1 NA 0400000US05 Arkansas 5
5 39029342 NA -555555555 ***** 39029342 NA -888888888 (X) 19536425 NA ... NA 28487 NA 50.6 NA 0.1 NA 0400000US06 California 6
6 5839926 NA -555555555 ***** 5839926 NA -888888888 (X) 2960896 NA ... NA 7496 NA 49.4 NA 0.1 NA 0400000US08 Colorado 8
Code
#RACE DATA
# rename and select relevant columns
race <- race %>%
  rename(STATE = NAME, White = DP05_0037E, Black = DP05_0038E, Asian = DP05_0044E, Hispanic_Latino = DP05_0073E)

race2 <- race %>% select(STATE, White, Black, Asian, Hispanic_Latino)

# merge the race data with the merged data
df_merged <- left_join(df_merged, race2, by = "STATE")

cat("Shape of Data:", dim(df_merged))
head(df_merged)
Shape of Data: 51 30
A grouped_df: 6 x 30
STATE BidenVotes TrumpVotes TotalVotes VoteOutcome DemPercentage Population Median.income Violent_Crimes_2020 Violent_Crimes_2021 ... covid_cases_percentage covid_deaths_percentage HighSchoolGraduates BachelorsDegree AssociatesDegree SomeCollege White Black Asian Hispanic_Latino
<chr> <dbl> <dbl> <dbl> <chr> <dbl> <dbl> <chr> <int> <int> ... <dbl> <dbl> <int> <int> <int> <int> <int> <int> <int> <int>
Alabama 849624 1441170 2290794 Republican 37.09 5031362 54,690 22322 17590 ... 32.76 0.43 1079285 572276 303028 705662 3302528 1302035 78893 246477
Alaska 153778 189951 343729 Republican 44.74 732923 74,750 6126 5573 ... 42.15 0.20 135473 101422 40276 118250 437533 22202 46184 56491
Arizona 1672143 1661686 3333829 Democratic 50.16 7179943 67,090 35980 30922 ... 34.14 0.46 1168057 985673 467305 1180491 4254015 340760 266441 2388520
Arkansas 423932 760647 1184579 Republican 35.79 3014195 50,780 20363 21271 ... 33.45 0.43 695062 324137 161542 435235 2103849 437331 48921 255416
California 11110639 6006518 17117157 Democratic 64.91 39501653 77,650 174026 188343 ... 30.81 0.26 5578997 5958030 2120275 5287901 15175598 2121422 6054038 15732184
Colorado 1804352 1364607 3168959 Democratic 56.94 5784865 83,780 24570 27916 ... 30.61 0.25 814373 1107309 334157 793438 4106707 235519 190181 1314962

Normalizing Data

Normalizing data is a vital step in the data science life cycle as it ensures that all variables contribute equitably to analysis, irrespective of their initial scale. This process not only prevents biases and inaccuracies in statistical models or machine learning algorithms but also enhances clarity in understanding differences between units of observation. By placing these units on a comparable scale, normalization facilitates more meaningful comparisons and analyses.

The COVID data is already normalized. Thus, we will now normalize the crime, education, and race data per state. For crime data, we take the average violent crimes and property crimes for 2020-22 and then divide that by the state population. For the education and race data, we simply take the data and divide it by the state population. Finally, for the heart death data, we take the average of heart deaths per 100k for 2020 and 2021. Below is a an example of the newly normalized data. This data is then saved to a csv file for future use.

Code
#normalize the crime data per state
df_merged$normalized_violent_crimes <- round(apply(df_merged[, c("Violent_Crimes_2020", "Violent_Crimes_2021", "Violent_Crimes_2022")], 1, mean, na.rm = TRUE) / df_merged$Population, 5)
df_merged$normalized_property_crimes <- round(apply(df_merged[, c("Property_Crimes_2020", "Property_Crimes_2021", "Property_Crimes_2022")], 1, mean, na.rm = TRUE) / df_merged$Population, 5)

# normalize education data per state
df_merged$normalized_high_school <- round(df_merged$HighSchoolGraduates  / df_merged$Population, 5)
df_merged$normalized_bachelors <- round(df_merged$BachelorsDegree    / df_merged$Population, 5)
df_merged$normalized_associates <- round(df_merged$AssociatesDegree  / df_merged$Population, 5)
df_merged$normalized_some_college <- round(df_merged$SomeCollege     / df_merged$Population, 5)

# finally, let's normalize the race data per state
df_merged$normalized_white <- round(df_merged$White  / df_merged$Population, 5)
df_merged$normalized_black <- round(df_merged$Black  / df_merged$Population, 5)
df_merged$normalized_asian <- round(df_merged$Asian  / df_merged$Population, 5)
df_merged$normalized_hispanic_latino <- round(df_merged$Hispanic_Latino  / df_merged$Population, 5)

#Averaged 2020 and 2021 heart death death data
df_merged$HeartDeathsPer100k <- ((df_merged$HeartDeathsPer100k2020 + df_merged$HeartDeathsPer100k2021 )/2)

cat("Shape of Data:", dim(df_merged))
head(df_merged)

#save to csv
write.csv(df_merged, file = 'cleaned.csv', row.names = FALSE)
Shape of Data: 51 41
A grouped_df: 6 x 41
STATE BidenVotes TrumpVotes TotalVotes VoteOutcome DemPercentage Population Median.income Violent_Crimes_2020 Violent_Crimes_2021 ... normalized_property_crimes normalized_high_school normalized_bachelors normalized_associates normalized_some_college normalized_white normalized_black normalized_asian normalized_hispanic_latino HeartDeathsPer100k
<chr> <dbl> <dbl> <dbl> <chr> <dbl> <dbl> <chr> <int> <int> ... <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
Alabama 849624 1441170 2290794 Republican 37.09 5031362 54,690 22322 17590 ... 0.01773 0.21451 0.11374 0.06023 0.14025 0.65639 0.25878 0.01568 0.04899 242.50
Alaska 153778 189951 343729 Republican 44.74 732923 74,750 6126 5573 ... 0.01961 0.18484 0.13838 0.05495 0.16134 0.59697 0.03029 0.06301 0.07708 147.25
Arizona 1672143 1661686 3333829 Democratic 50.16 7179943 67,090 35980 30922 ... 0.02184 0.16268 0.13728 0.06508 0.16442 0.59249 0.04746 0.03711 0.33267 151.55
Arkansas 423932 760647 1184579 Republican 35.79 3014195 50,780 20363 21271 ... 0.02548 0.23060 0.10754 0.05359 0.14440 0.69798 0.14509 0.01623 0.08474 226.75
California 11110639 6006518 17117157 Democratic 64.91 39501653 77,650 174026 188343 ... 0.02197 0.14123 0.15083 0.05368 0.13387 0.38418 0.05370 0.15326 0.39827 145.90
Colorado 1804352 1364607 3168959 Democratic 56.94 5784865 83,780 24570 27916 ... 0.03061 0.14078 0.19141 0.05776 0.13716 0.70991 0.04071 0.03288 0.22731 131.60

EDA

Exploratory Data Analysis (EDA) is a fundamental starting point in data analysis, helping grasp the data’s characteristics, patterns, and possible outliers. It provides essential insights for making informed modeling decisions.

By analyzing the below data, we hope to gain an understanding of overall trends that can aid in refining our hypotheses.

Code
#rename to df for ease of use
df <- df_merged

Income

Code
df$Median.income = as.numeric(gsub(",", "", df$Median.income))
#show summary stats
cat("BASIC STATISTICS:\n")
cat("----------------------\n")
summary(df$Median.income)

ggplot(df, aes(y = Median.income)) +
  geom_boxplot(fill = "pink", alpha = 0.7) +
  labs(title = "Median Income", y = "Median Income")
BASIC STATISTICS:
----------------------
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  45130   60360   67410   68660   78200   94790 

This box plot shows the distribution of median income. The plot has a wide interquartile range and the median value is indicated by the line inside the box at 67410.

Code
ggplot(df, aes(x = reorder(STATE, Median.income), y = Median.income, fill = VoteOutcome)) +
  geom_bar(stat = "identity") +
  scale_fill_manual(values = c("Republican" = "red", "Democratic" = "blue"), name = "Political Affiliation") +
  coord_flip() +
  xlab("State") +
  ylab("Median Income") +
  ggtitle("Median Income by State (Colored by Political Affiliation)")

This is a bar chart showing median income by state, with the color indicating vote outcome. The states at the top, which have higher median incomes, tend to have voted Democratic, while states at the bottom with lower median incomes tend to have voted Republican.

Code
ggplot(df, aes(x = VoteOutcome, y = Median.income, fill = VoteOutcome)) +
    geom_boxplot() +
    scale_fill_manual(values = c("Republican" = "red", "Democratic" = "blue"), name = "Political Affiliation") +
    xlab("Political Affiliation") +
    ylab("Median Income") +
    ggtitle("Median Income by Political Affiliation") +
    theme(legend.position = "none")

This is a boxplot comparing median income distributions between Democratic and Republican vote outcomes. We can see that the median income for Democratic areas is higher than for Republican areas, as indicated by the position of the median line within each box. Additionally, the interquartile range for Democratic areas seems slightly broader, suggesting more variability in income.

Code
ggplot(df, aes(x = DemPercentage, y = Median.income, color = VoteOutcome)) +
    geom_point() +
    geom_smooth(method = "lm") +
    scale_color_manual(values = c("Republican" = "red", "Democratic" = "blue"), name = "Political Affiliation") +
    xlab("Percentage of Vote for Winning Party") +
    ylab("Median Income") +
    ggtitle("Median Income vs Percentage of Vote")
`geom_smooth()` using formula = 'y ~ x'

This is a scatter plot with a trend line comparing median income to the percentage of vote for the winning party, again distinguished by vote outcome. This plot suggests a positive correlation for both Democratic and Republican areas, meaning that as the percentage of vote for the winning party increases, the median income also tends to increase. The shaded areas around the trend lines are the confidence intervals, which seem to be broader for Republican areas, suggesting more variability or less certainty in the trend compared to the Democratic areas.

Crime

Code
#show summary stats
cat("BASIC STATISTICS:\n")
cat("----------------------\n")
cat("Violet Crimes:")
summary(df$normalized_violent_crimes)

cat("\nProperty Crimes:")
summary(df$normalized_property_crimes)

ggplot(df, aes(x = VoteOutcome, y = normalized_violent_crimes, fill = VoteOutcome)) +
  geom_boxplot() +
  scale_fill_manual(values = c("Republican" = "red", "Democratic" = "blue")) +
  labs(title = "Violent Crime Rates by Political Affiliation", x = "Political Affiliation", y = "Normalized Violent Crimes") +
  theme(legend.position = "none")
BASIC STATISTICS:
----------------------
Violet Crimes:
Property Crimes:
    Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
0.001090 0.002755 0.003360 0.003835 0.004560 0.009410 
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
0.01021 0.01517 0.01866 0.01937 0.02239 0.03801 

The box plot displays that the median violent crime rate for Republican states is higher than Democratic states. Two Democratic points are significant outliers as well, which drove the Democratic mean higher even though it is still lower than the Republican states.

Code
ggplot(df, aes(x = VoteOutcome, y = normalized_property_crimes, fill = VoteOutcome)) +
  geom_boxplot() +
  scale_fill_manual(values = c("Republican" = "red", "Democratic" = "blue")) +
  labs(title = "Property Crime Rates by Political Affiliation", x = "Political Affiliation", y = "Normalized Property Crimes") +
  theme(legend.position = "none")

The second box plot shows the distribution for non-violent crime rates. Again, we see that Democratic states overall have lower crime rates, with the property crimes making the difference even more obvious.

Health

Code
health <- df[c("STATE", "BidenVotes", "TrumpVotes", "VoteOutcome", "Population", 
                'HeartDeathsPer100k2020', "HeartDeathsPer100k2021", "covid_cases_percentage", "covid_deaths_percentage")]

#show summary stats
cat("BASIC STATISTICS:\n")
cat("----------------------\n")
print(summary(health))
BASIC STATISTICS:
----------------------
    STATE             BidenVotes         TrumpVotes      VoteOutcome       
 Length:51          Min.   :   73491   Min.   :  18586   Length:51         
 Class :character   1st Qu.:  399258   1st Qu.: 473638   Class :character  
 Mode  :character   Median :  856034   Median :1020280   Mode  :character  
                    Mean   : 1586213   Mean   :1449562                     
                    3rd Qu.: 2375907   3rd Qu.:1791166                     
                    Max.   :11110639   Max.   :6006518                     
                                                                           
   Population       HeartDeathsPer100k2020 HeartDeathsPer100k2021
 Min.   :  577605   Min.   :118.1          Min.   :123.9         
 1st Qu.: 1820311   1st Qu.:146.3          1st Qu.:155.2         
 Median : 4507445   Median :162.4          Median :167.8         
 Mean   : 6500226   Mean   :169.0          Mean   :177.1         
 3rd Qu.: 7451987   3rd Qu.:183.9          3rd Qu.:194.2         
 Max.   :39501653   Max.   :245.6          Max.   :264.2         
                    NA's   :1              NA's   :1             
 covid_cases_percentage covid_deaths_percentage
 Min.   :22.19          Min.   :0.1300         
 1st Qu.:28.36          1st Qu.:0.2650         
 Median :31.60          Median :0.3500         
 Mean   :31.11          Mean   :0.3308         
 3rd Qu.:33.56          3rd Qu.:0.3950         
 Max.   :42.15          Max.   :0.4600         
                                               
Code
ggplot(df, aes(x = DemPercentage, y = covid_cases_percentage, color = VoteOutcome)) +
    geom_point() +
    #geom_smooth(method = "lm") +
    scale_color_manual(values = c("Republican" = "red", "Democratic" = "blue"), name = "Political Affiliation") +
    xlab("Percentage of Vote for Democratic Party") +
    ylab("COVID-19 Cases (percent of population)") +
    ggtitle("Covid Case Rates vs Percentage of Vote")
    
df$covid_deaths_percentage <- df$covid_deaths_percentage * 100

ggplot(df, aes(x = DemPercentage, y = covid_deaths_percentage, color = VoteOutcome)) +
    geom_point() +
    #geom_smooth(method = "lm") +
    scale_color_manual(values = c("Republican" = "red", "Democratic" = "blue"), name = "Political Affiliation") +
    xlab("Percentage of Vote for Democratic Party") +
    ylab("COVID-19 Deaths (percent of population)") +
    ggtitle("Covid Death Rates vs Percentage of Vote")

ggplot(df, aes(x = DemPercentage, y = HeartDeathsPer100k, color = VoteOutcome)) +
    geom_point() +
    #geom_smooth(method = "lm") +
    scale_color_manual(values = c("Republican" = "red", "Democratic" = "blue"), name = "Political Affiliation") +
    xlab("Percentage of Vote for Democratic Party") +
    ylab("Heart Disease Deaths (per 100k)") +
    ggtitle("Heart Disease Death Rates per 100K vs Percentage of Vote",
    subtitle = "Average of 2020 and 2021 Data")

Warning message:
"Removed 1 rows containing missing values (`geom_point()`)."

It seems there may be a difference in the disease/sickness death rates for Democrat and Republican states examining the scatterplots. There are more Republican states with higher rates of Covid, and more Democratic states with lower rates of Covid. The heart disease deaths per 100k people for each state shows a similar distribution. When looking at the Covid death rate though, the difference between Democratic and Republican states becomes less evident, if a difference exists at all.

Code
ggplot(df, aes(x = VoteOutcome, y = covid_cases_percentage, fill = VoteOutcome)) +
    geom_boxplot() +
    scale_fill_manual(values = c("Republican" = "red", "Democratic" = "blue")) +
    xlab("Political Affiliation") +
    ylab("COVID-19 Cases (percent of population)") +
    ggtitle("Covid Case Rates by Political Affiliation") +
    theme(legend.position = "none")

df$covid_deaths_percentage <- df$covid_deaths_percentage * 100

ggplot(df, aes(x = VoteOutcome, y = covid_deaths_percentage, fill = VoteOutcome)) +
    geom_boxplot() +
    scale_fill_manual(values = c("Republican" = "red", "Democratic" = "blue")) +
    xlab("Political Affiliation") +
    ylab("COVID-19 Deaths (percent of population)") +
    ggtitle("Covid Death Rates by Political Affiliation") +
    theme(legend.position = "none")

ggplot(df, aes(x = VoteOutcome, y = HeartDeathsPer100k, fill = VoteOutcome)) +
    geom_boxplot() +
    scale_fill_manual(values = c("Republican" = "red", "Democratic" = "blue")) +
    xlab("Political Affiliation") +
    ylab("Heart Disease Deaths (per 100k)") +
    ggtitle("Heart Disease Death Rates per 100K by Political Affiliation",
    subtitle = "Average of 2020 and 2021 Data") + 
    theme(legend.position = "none")

Warning message:
"Removed 1 rows containing non-finite values (`stat_boxplot()`)."

The box plots now clearly reveal that states that voted republican have higher rates of covid cases,covid deaths, and heart disease deaths.

Code
ggplot(df, aes(x = reorder(STATE, covid_cases_percentage), y = covid_cases_percentage, fill = VoteOutcome)) +
  geom_bar(stat = "identity") +
  scale_fill_manual(values = c("Republican" = "red", "Democratic" = "blue")) +
  coord_flip() +
  xlab("State") +
  ylab("COVID-19 Cases (percent of population)") +
  ggtitle("COVID-19 Cases by State (Colored by Political Affiliation)") +
  theme(legend.position = "none")

df$covid_deaths_percentage <- df$covid_deaths_percentage * 100

ggplot(df, aes(x = reorder(STATE, covid_deaths_percentage), y = covid_deaths_percentage, fill = VoteOutcome)) +
  geom_bar(stat = "identity") +
  scale_fill_manual(values = c("Republican" = "red", "Democratic" = "blue")) +
  coord_flip() +
  xlab("State") +
  ylab("COVID-19 Deaths (percent of population)") +
  ggtitle("COVID-19 Deaths by State (Colored by Political Affiliation)") +
  theme(legend.position = "none")

ggplot(df, aes(x = reorder(STATE, HeartDeathsPer100k), y = HeartDeathsPer100k, fill = VoteOutcome)) +
  geom_bar(stat = "identity") +
  scale_fill_manual(values = c("Republican" = "red", "Democratic" = "blue")) +
  coord_flip() +
  xlab("State") +
  ylab("Heart Disease Deaths (per 100k)") +
  ggtitle("Heart Disease Deaths by State (Colored by Political Affiliation)",
           subtitle = "Average of 2020 and 2021 Data") +
  theme(legend.position = "none")

Warning message:
"Removed 1 rows containing missing values (`position_stack()`)."

These sideways bar charts prove valuable in helping us realize which states had the highest disease rates and what their political affiliation was. For Covid case, Covid death, and heart disease death rate, there are mostly Republican states in the top ten of each category.

Education

Code
df1 <- df %>%
  rename(
    `High School Graduate` = normalized_high_school,
    `Bachelors Degree` = normalized_bachelors,
    `Some College` = normalized_some_college,
    `Associates Degree` = normalized_associates
  )

education <- gather(df1, key = "EducationLevel", value = "EducationCount", 
                    'High School Graduate', 'Bachelors Degree', 'Some College', 'Associates Degree')

ggplot(education, aes(x = EducationLevel, y = EducationCount, fill = VoteOutcome)) +
  geom_boxplot() +
  scale_fill_manual(values = c("Republican" = "red", "Democratic" = "blue"), name = "Political Affiliation" ) +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
  xlab("Education Level") +
  ylab("Normalized Education Percentage of Individuals") +
  ggtitle("Normalized Education Levels by Political Affiliation")

Code
head(education)
A grouped_df: 6 x 39
STATE BidenVotes TrumpVotes TotalVotes VoteOutcome DemPercentage Population Median.income Violent_Crimes_2020 Violent_Crimes_2021 ... Hispanic_Latino normalized_violent_crimes normalized_property_crimes normalized_white normalized_black normalized_asian normalized_hispanic_latino HeartDeathsPer100k EducationLevel EducationCount
<chr> <dbl> <dbl> <dbl> <chr> <dbl> <dbl> <dbl> <int> <int> ... <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <chr> <dbl>
Alabama 849624 1441170 2290794 Republican 37.09 5031362 54690 22322 17590 ... 246477 0.00402 0.01773 0.65639 0.25878 0.01568 0.04899 242.50 normalized_high_school 0.21451
Alaska 153778 189951 343729 Republican 44.74 732923 74750 6126 5573 ... 56491 0.00785 0.01961 0.59697 0.03029 0.06301 0.07708 147.25 normalized_high_school 0.18484
Arizona 1672143 1661686 3333829 Democratic 50.16 7179943 67090 35980 30922 ... 2388520 0.00458 0.02184 0.59249 0.04746 0.03711 0.33267 151.55 normalized_high_school 0.16268
Arkansas 423932 760647 1184579 Republican 35.79 3014195 50780 20363 21271 ... 255416 0.00678 0.02548 0.69798 0.14509 0.01623 0.08474 226.75 normalized_high_school 0.23060
California 11110639 6006518 17117157 Democratic 64.91 39501653 77650 174026 188343 ... 15732184 0.00470 0.02197 0.38418 0.05370 0.15326 0.39827 145.90 normalized_high_school 0.14123
Colorado 1804352 1364607 3168959 Democratic 56.94 5784865 83780 24570 27916 ... 1314962 0.00468 0.03061 0.70991 0.04071 0.03288 0.22731 131.60 normalized_high_school 0.14078

In this boxplot, we can see the distribution of normalized counts across different education levels, separated by vote outcome. For those with an associate’s level of education, the distribution is fairly similar for both Democratic and Republican areas, as the median line is nearly at the same level. However, the interquartile range is slightly broader for Democratic areas, indicating more variability. For those at the bachelor’s level, the median count is higher in Democratic areas than in Republican areas. The range is also broader for Democratic areas, showing a wider spread of counts. For those with high school level, the median count again appears to be higher for Democratic areas compared to Republican areas. The interquartile ranges are quite similar, suggesting similar variability between the two vote outcomes. For those with some college, we can see a higher median count for Democratic areas, with the interquartile range being wider for Democratic areas as well. Overall, the plot suggests that there is a trend where Democratic areas have a higher normalized count of individuals with higher education levels (bachelor’s and some college), while the counts are more similar for lower levels of education (high school and associates).

Statistical Methods

Welch’s T-Test

Welch’s t-test, also called unequal variances t-test, compares the means of two populations to test the null hypothesis. Unlike the Student’s t-test, which assumes the sample means of the two populations are normally distributed and have equal variances, Welch’s t-test was designed for cases where the populations have different variances, but still assumes the samples are normally distributed. It seems easy to assume that you should use Welch’s t-test only when variances are different but it was been argued that you should always use Welch’s t-test over the Student t-test because, it has better control of type I and type II errors (2006). Welch’s t-test is calculated by dividing the difference of the two means by the square root of the sum of the square of their standard error: \(\frac{\bar{X_1} - \bar{X_1}}{\sqrt{s_{\bar{X_1}}^2} + {s_{\bar{X_1}}^2}}\) Where \(\bar{X_i}\) is the mean and \(s_{\bar{X_i}}^2\) is the standard error

Hotelling’s T-Squared Test

Hotelling’s T-Squared Test is used to compare the means of two groups across multiple variables simultaneously. It’s an extension of the Student’s t-test to the multivariate case, where each observation is a vector of values rather than a single scalar value. This test is useful when the variables are correlated, and it assumes that the data from both groups are drawn from multivariate normally distributed populations with equal covariance matrices. The fundamental assumptions for the test are that the samples are randomly drawn from normally distributed populations and that the samples are independent of each other. The test statistic T^2 is derived from the means, variances, and covariances of the variables, and it’s often transformed into an F-statistic for determining statistical significance.

Bootstrapped Difference of Means

A bootstrap approach samples each observed population n times with replacement, where n is the number of entities of the population and uses the newly created datasets to find a mean. The bootstrapped mean for population two is subtracted from population one. This process of creating mean from samples and finding the difference is repeated many times creating a list of many differences of means. Net the quantiles are looked at a predetermined percentage, also referred to as the confidence interval. Common confidence intervals are 90 and 95 percent. With a confidence level of 90 percent we would look at the 5 and 95 percent quantiles. Bootstrap approaches are useful when one or both of the populations for the observed samples are small.

Results: Hypothesis Testing

Hypothesis Test 1: Income

\(H_0:\) There is no difference in median household income between states that voted for Biden and states that voted for Trump.

\(H_A:\) There is a difference in median household income between states that voted for Biden and states that voted for Trump.

Code
democratic <- df[df$VoteOutcome == "Democratic", ]

republican <- df[df$VoteOutcome == 'Republican', ]

t.test(democratic$Median.income, republican$Median.income, alternative = "two.sided")

    Welch Two Sample t-test

data:  democratic$Median.income and republican$Median.income
t = 4.7036, df = 48.366, p-value = 2.167e-05
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
  7463.532 18604.499
sample estimates:
mean of x mean of y 
 75049.62  62015.60 

We used the Welch Two Sample t-test to compare the median incomes between Democratic and Republican states. The t-statistic of 4.7036 with 48.366 degrees of freedom, and a very low p-value of 2.167e-05, indicates strong evidence to reject the null hypothesis. This suggests that there is a significant difference in median incomes between the two political affiliations. The 95 percent confidence interval (7463.532 to 18604.499) for the true difference in means similarly reveals that Democratic states tend to have higher median incomes compared to Republican states.

Hypothesis Test 2.1: Violent Crime

\(H_0:\) There is no difference in the mean normalized rates of violent crimes between Democratic and Republican states.

\(H_A:\) There is a difference in the mean normalized rates of violent crimes between Democratic and Republican states.

Code
t.test(democratic$normalized_violent_crimes, republican$normalized_violent_crimes, alternative = "two.sided")

    Welch Two Sample t-test

data:  democratic$normalized_violent_crimes and republican$normalized_violent_crimes
t = -1.0662, df = 47.928, p-value = 0.2917
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -0.0014633197  0.0004491966
sample estimates:
  mean of x   mean of y 
0.003586538 0.004093600 

We ran another Welch Two-Sample t-test to compare the normalized rates of violent crimes between Democratic and Republican states. The results gave us a t-statistic of -1.0662 with 47.928 degrees of freedom and a p-value of 0.2917. This suggests that there is not enough evidence to reject the null hypothesis, indicating no significant difference in the mean normalized rates of violent crimes between the two political affiliations.

Hypothesis Test 2.2: Property Crime

\(H_0:\) There is no difference in the mean normalized rates of property crimes between Democratic and Republican states.

\(H_A:\) There is a difference in the mean normalized rates of property crimes between Democratic and Republican states.

Code
t.test(democratic$normalized_property_crimes, republican$normalized_property_crimes, alternative = "two.sided")

    Welch Two Sample t-test

data:  democratic$normalized_property_crimes and republican$normalized_property_crimes
t = -0.51149, df = 40.431, p-value = 0.6118
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -0.004177888  0.002489857
sample estimates:
 mean of x  mean of y 
0.01896038 0.01980440 

In this Welch Two-Sample t-test, we are comparing the normalized rates of property crimes between Democratic and Republican states. This gave us a t-statistic of -0.51149 with 40.431 degrees of freedom, and a p-value of 0.6118. This suggests that there is insufficient evidence to reject the null hypothesis, indicating no significant difference in the mean normalized rates of property crimes between Democratic and Republican states.

Hypothesis Test 3.1: Health Hotelling Test

\(H_0:\) There is no difference in the multivariate means of health-related variables between Democratic and Republican states.

\(H_A:\) There is a difference in the multivariate means of health-related variables between Democratic and Republican states.

Code
# Subset data for Democratic and Republican states
democratic_data2 <- df[df$VoteOutcome == "Democratic", c("covid_cases_percentage", "covid_deaths_percentage", "HeartDeathsPer100k")]
republican_data2 <- df[df$VoteOutcome == "Republican", c("covid_cases_percentage", "covid_deaths_percentage", "HeartDeathsPer100k")]

# Conduct Hotelling's T-squared test
hotelling_result2 <- hotelling.test(democratic_data, republican_data)

# Inspect the results
print(hotelling_result2)
ERROR: Error in hotelling.test(democratic_data, republican_data): could not find function "hotelling.test"

According to the results of our Hotelling’s T-squared test, we can assess differences in the multivariate means of COVID-19-related variables, including covid_cases_percentage, covid_deaths_percentage, and HeartDeathsPer100k, between Democratic and Republican states. The results gave us a test statistic of 27.92, with 4 degrees of freedom in the numerator and 46 in the denominator, and a very low p-value of 0.0002927. This provides strong evidence to reject the null hypothesis, suggesting significant disparities in the COVID-19-related variables across the states based on their political affiliations.

Hypothesis Test 3.2: Covid Cases Bootstrapping

\(H_0:\) The ratio of mean COVID-19 case percentages between Democratic and Republican states is equal to 1 (no difference).

\(H_A:\) The ratio of mean COVID-19 case percentages between Democratic and Republican states is not equal to 1 (there is a difference).

Code
covidc_dem <- subset(df, select=covid_cases_percentage,
                      subset=VoteOutcome=="Democratic", drop=T)

covidc_rep  <- subset(df, select=covid_cases_percentage,
                      subset=VoteOutcome=="Republican", drop=T)

rep_mean <- mean(df$covid_cases_percentage[df$VoteOutcome == "Republican"])
dem_mean <- mean(df$covid_cases_percentage[df$VoteOutcome == 'Democratic'])

actual_mean_ratio <- rep_mean / dem_mean

N <- 10000
ratio.mean <- numeric(N) 

for (i in 1:N)
{
dem <- sample(df$covid_cases_percentage[df$VoteOutcome=='Democratic'], length(covidc_dem), replace = TRUE) 
rep <- sample(df$covid_cases_percentage[df$VoteOutcome=="Republican"], length(covidc_rep), replace = TRUE)
ratio.mean[i] <- mean(rep)/mean(dem)
}

ci.1 <- quantile(ratio.mean, c(0.025, 0.975))
ci.1

hist(ratio.mean, main = "Ratio of Means (Rep/Dem)", xlab = "Ratio of Means", breaks = 30)
abline(v = actual_mean_ratio, col = "red", lty = 2)
abline(v = ci.1, col = 4, lwd = 2)
2.5%
1.04355974917208
97.5%
1.19842059043686

As the 95% confidence interval for the ratio of means does not include 1, we fail to reject the null hypothesis that the ratio of means is 1 and that there is no difference between the covid case percentage between democratic and republican states.

Hypothesis Test 3.3: Covid Deaths Bootstrapping

\(H_0:\) The ratio of mean COVID-19 death percentages between Democratic and Republican states is equal to 1 (no difference).

\(H_A:\) The ratio of mean COVID-19 death percentages between Democratic and Republican states is not equal to 1 (there is a difference).

Code
covidd_dem <- subset(df, select=covid_deaths_percentage,
                      subset=VoteOutcome=="Democratic", drop=T)

covidd_rep  <- subset(df, select=covid_deaths_percentage,
                      subset=VoteOutcome=="Republican", drop=T)

rep_mean <- mean(df$covid_deaths_percentage[df$VoteOutcome == "Republican"])
dem_mean <- mean(df$covid_deaths_percentage[df$VoteOutcome == 'Democratic'])

actual_mean_ratio <- rep_mean / dem_mean

N <- 10000
ratio.mean <- numeric(N) 

for (i in 1:N)
{
dem <- sample(df$covid_deaths_percentage[df$VoteOutcome=='Democratic'], length(covidc_dem), replace = TRUE) 
rep <- sample(df$covid_deaths_percentage[df$VoteOutcome=="Republican"], length(covidc_rep), replace = TRUE)
ratio.mean[i] <- mean(rep)/mean(dem)
}

ci.1 <- quantile(ratio.mean, c(0.025, 0.975))
ci.1

hist(ratio.mean, main = "Ratio of Means (Rep/Dem)", xlab = "Ratio of Means", breaks = 30)
abline(v = actual_mean_ratio, col = "red", lty = 2)
abline(v = ci.1, col = 4, lwd = 2)
2.5%
1.01864528424825
97.5%
1.33756871172315

As the 95% confidence interval for the ratio of means does not include 1, we fail to reject the null hypothesis that the ratio of means is 1 and that there is no difference between the covid death percentage between democratic and republican states.

Hypothesis Test 4: Education

\(H_0:\) There is no difference in the multivariate means of education levels between Democratic and Republican states.

\(H_A:\) There is a difference in the multivariate means of education levels between Democratic and Republican states.

Code
# Load required libraries
library(Hotelling)

# Subset data for Democratic and Republican states
democratic_data <- df[df$VoteOutcome == "Democratic", c("normalized_high_school", "normalized_bachelors", "normalized_associates", "normalized_some_college")]
republican_data <- df[df$VoteOutcome == "Republican", c("normalized_high_school", "normalized_bachelors", "normalized_associates", "normalized_some_college")]

# Conduct Hotelling's T-squared test
hotelling_result <- hotelling.test(democratic_data, republican_data)

# Inspect the results
print(hotelling_result)
Test stat:  27.92 
Numerator df:  4 
Denominator df:  46 
P-value:  0.0002927 

We used a Hotelling’s T-squared test here to assess the differences in the multivariate means of education levels between Democratic and Republican states. The test yielded a test statistic of 27.92, which can sugest a significant distinction in the educational profiles of the two political affiliations. We also got 4 degrees of freedom in the numerator and 46 degrees of freedom in the denominator, as well as a very low p-value of 0.0002927, which indicates strong evidence to reject the null hypothesis. This supports the conclusion that there is a significant difference in the normalized education levels across the states.

Conclusion

In conclusion, the findings from our hypothesis testing shed light on the relationship between political leanings and various aspects of citizen well-being in the United States. We analyzed the impact of political affiliations on income, crime rates, health outcomes, and education levels. The results of the Welch Two-Sample t-test for median household income revealed a significant difference, indicating that Democratic states tend to have higher median incomes compared to Republican states. On the contrary, our tests on normalized rates of violent and property crimes did not yield sufficient evidence to reject the null hypothesis, suggesting no significant differences between Democratic and Republican states in these aspects.

Moving to health-related variables, the Hotelling’s T-squared test provided evidence of differences in the multivariate means of COVID-19-related variables between Democratic and Republican states. This underscores the impact of political affiliations on health outcomes, particularly during the COVID-19 pandemic. Additionally, our bootstrapping analyses on COVID-19 cases and death percentages reinforced the notion of significant disparities between the states based on political affiliations.

As for education, the Hotelling’s T-squared test on education levels indicated a substantial distinction in the multivariate means between Democratic and Republican states, highlighting a meaningful difference in the normalized education levels across the United States.

Our analysis underscores the significant influence of political leanings on various facets of citizen well-being. These findings not only contribute to the understanding of the interplay between politics and everyday life but also may be helpful for policymakers and individuals making decisions about where to reside.

References

“American Community Survey 1-Year Data (2005-2022).” 2022. Unitied States Census Bureau. https://www.census.gov/data/developers/data-sets/acs-1year.html.
Cohen, Mark A. 2008. “The Effect of Crime on Life Satisfaction.” The Journal of Legal Studies 37 (S2): S325–53.
“Covid-19-Data.” 2023. NY Times. https://github.com/nytimes/covid-19-data.
“Crime Data Explorer.” 2022. FBI. https://cde.ucr.cjis.gov/LATEST/webapp/#/pages/home.
Edgerton, Jason D, Lance W Roberts, and Susanne von Below. 2011. “Education and Quality of Life.” Handbook of Social Indicators and Quality of Life Research, 265–96.
“Election Results and Voting Information.” 2020. Federal Election Commission. https://www.fec.gov/introduction-campaign-finance/election-results-and-voting-information/.
Florida, Richard. 2010. Who’s Your City?: How the Creative Economy Is Making Where to Live the Most Important Decision of Your Life. Vintage Canada.
“Heart Disease Mortality by State.” 2022. Centers for Disease Control; Prevention. https://www.cdc.gov/nchs/pressroom/sosmap/heart_disease_mortality/heart_disease.htm.
“Historical Income Tables: Households.” 2022. Unitied States Census Bureau. https://www.census.gov/data/tables/time-series/demo/income-poverty/historical-income-households.html.
Kahneman, Daniel, and Angus Deaton. 2010. “High Income Improves Evaluation of Life but Not Emotional Well-Being.” Proceedings of the National Academy of Sciences 107 (38): 16489–93.
Michalos, Alex C, Bruno D Zumbo, and Anita Hubley. 2000. “Health and the Quality of Life.” Social Indicators Research 51: 245–86.
Ruxton, Graeme D. 2006. The unequal variance t-test is an underused alternative to Student’s t-test and the Mann–Whitney U test.” Behavioral Ecology 17 (4): 688–90. https://doi.org/10.1093/beheco/ark016.
“State Population Totals and Components of Change: 2020-2022.” 2022. Unitied States Census Bureau. https://www.census.gov/data/tables/time-series/demo/popest/2020s-state-total.html.