Code
#load in relevant libraries
library(tidyverse)
Billy McGloin
Eric Dapkus
Nolan Penoyer
Caitlyn Dovel
November 29, 2023
The interplay between political leanings and quality of life in the United States forms a complex tapestry. This study aims to dissect this interplay, focusing on how a state’s political orientations impact various facets of citizen well-being. This project not only seeks to highlight the correlations between political inclinations and quality of life indicators but also strives to understand the causative relationships within. The primary research question guiding this study is: How do state political leanings influence the quality of life of their residents?
Choosing where to live is one of the most important decisions in a person’s life (2010). Our study aims to simplify the question by ascertaining whether the political leanings of states, specifically Democratic or Republican affiliations, exert influence on the quality of life experienced by their citizens. We approach this investigation through a comprehensive examination of key indicators such as income, crime rates, health outcomes, and education.
Income levels often serve as a foundational determinant of an individual’s standard of living. Political leanings of a state influence economic policies, tax structures, and resource allocation, subsequently impacting income distribution among its residents. While relative income may not improve the quality of someones everyday experience it has been shown to increase how we evaluate our lives (2010).
Likewise, crime rates, another critical aspect of our investigation, are closely linked to personal safety and community well-being. The political landscape may shape law enforcement policies, social programs, community engagement, and program funding, all of which can have direct implications for public safety and the overall quality of life. Past studies have shown that specific crimes can have a lasting and large impact on our overall well being (2008).
Health outcomes constitute an additional pillar of our study that affects our quality of life (2000) . Political orientations may influence healthcare policies, accessibility to medical services, and public health initiatives, thereby contributing to variations in health outcomes among state residents.
The educational landscape, shaped by political decisions on funding, curriculum, and access, can significantly impact aspects of our lives. Access to education can lead to social benefits, increase analytical skills, improve intellectual capital, and change the constraints and opportunities of our lives (2011).
In synthesizing these indicators, our study strives to bridge the gap between political leanings and the lived experiences of residents. By focusing on these concrete aspects rather than delving into technical details, we aim to provide a practical foundation for understanding the potential impact of political affiliations on the quality of life in different states. As we proceed to analyze and interpret the data, our findings hold the promise of offering actionable insights for individuals making residential decisions and policymakers shaping the future of their communities.
National voting data was collected from The Federal Election Commission (FEC). They host elections and voting information for every federal election since 1982. For our analysis we used the 2020 data. The data is available in Portable Document Format (PDF) and Excel, .xlsx, formats. We used the .xlsx version and converted it to a Comma-Separated Values (CSV) file. The data we used for our project can be found on tab 9, 2020 Pres General Results, in the Excel file (FEC). This data is used to determine the states’ political leanings. (2020)
FEC | Election and voting information
Population data was taken from the United States Census Bureau. To match the voting data we used the State Population Totals and Components of Change: 2020-2022. The data is provided in a CSV file that is easy to import into the data cleaning workflow. This data was used to normalize other data we gathered. (2022)
State Population Totals: 2020-2022 (census.gov)
Income data was also obtained from the U.S. Census Bureau, which collects and maintains numerous sets of economic and income data for the United States. We used the Table H-8, median household income by state, as the information was the cleanest for our use case. Table H-8 is an Excel document with numerous header rows that do not contain any data. (2022)
Historical Income Tables: Households (census.gov)
We scraped crime data from the FBI’s Crime Data Explorer (CDE). The CDE permits the query of the number of crimes by state, year, and the type of crime (property crime, larceny, ect.). For our analysis, we chose to retrieve crime data from 2020-2022 for each state, constructing two seperate CSV files, one for violent crimes and one for non-violent crimes. This data coupled with the state population data reveals the crime rates for each state for a given year. (2022)
FBI Crime Data API (cde.ucr.cjis.gov)
Heart disease deaths data was collected from the Centers for Disease Control and Prevention (CDC). The CSV obtained from their site lists the year, state, number of deaths from heart disease, had the rate of heart disease deaths per 100k people per state, and a url to provide more information for the state. Number of deaths from heart disease was one of the features we used to determine the health of a state. Since the data was taken in as deaths per state it was normalized with the states population during cleaning. Heart disease was chosen since it is a leading cause of death in America and is a good determinator for how people take care of themselves. (2022)
Stats of the States - Heart Disease Mortality (cdc.gov)
Covid-19 data was gathered from the New York Times GitHub Repository. The repository host multiple datasets with COVID-19 data. We used the us-states.csv file that contains date (single day the information is for), state, fips, cases, and deaths. The repository contains a README.md that shows their methods for collection and organizing the data. The file was a CSV and easy to import into the workflow during data cleaning. Both COVID-19 cases and deaths were used as features for determining the health of a state. These variables show how a state reacts to rising health risks. (2023)
NYTimes/covid-19-data: A repository of data on coronavirus cases and deaths in the U.S. (github.com)
Education data was acquired through the United States Census Bureau API. The data comes from the American Community Survey (ACS), which is an ongoing survey that provides data every year about social, economic,demographic, and housing characteristics of the U.S. population. We focused on the number of people with a bachelor’s degree, associate’s degree, GED or equivalent, and some college for each state from 2021-2022. (2022)
State Education Totals: 2021-2022 (census.gov)
We retrieved data on the number of white, black, asian, and hispanic people per state from the Census API and ACS similar to the education data. Again, we only chose to use the data from 2021-2022. (2022)
State Race Totals: 2021-2022 (census.gov)
This section of the report will act as a journal of our data cleaning process. We will show the raw data, the code used to clean it, and then the modified data.
After loading in the relevant libraries and read in the data, let’s take a look at what the national voting data looks like.
Shape of Data: 681 13
X...1 | FEC.ID | STATE | STATE.ABBREVIATION | GENERAL.ELECTION.DATE | FIRST.NAME | LAST.NAME | LAST.NAME...FIRST | TOTAL.VOTES | PARTY | GENERAL.RESULTS | GENERAL.. | TOTAL.VOTES.. | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
<int> | <chr> | <chr> | <chr> | <chr> | <chr> | <chr> | <chr> | <chr> | <chr> | <chr> | <chr> | <chr> | |
1 | 2 | ||||||||||||
2 | 3 | P80001571 | Alabama | AL | 11/3/20 | Donald J. | Trump | Trump, Donald J. | R | 1,441,170 | 62.03% | ||
3 | 4 | P80000722 | Alabama | AL | 11/3/20 | Joseph R. | Biden | Biden, Joseph R. | D | 849,624 | 36.57% | ||
4 | 5 | P00013524 | Alabama | AL | 11/3/20 | Jo | Jorgensen | Jorgensen, Jo | IND | 25,176 | 1.08% | ||
5 | 6 | n/a | Alabama | AL | 11/3/20 | Scattered | Scattered | W | 7,312 | 0.31% | |||
6 | 7 | n/a | Alabama | AL | 11/3/20 | Total State Votes: | 2,323,282 |
We can see that the above data has multiple rows per state, with rows for each candidate and total state votes. Let’s clean this data. We can start by only keeping votes for the two major parties and then create two new columns (one for Candidate and another for Total Votes for that candidate). The data then needs to be converted from long to wide so that each row/observation represents a state (including DC). Finally, let’s create some new columns that will be useful for our analysis: Total Votes Overall, Vote Outcome, and Percentage Democratic.
#VOTING DATA
# remove all votes not for trump or biden
df <- df[df$PARTY %in% c("D", "R"), ]
# create columns for biden and trump votes for each state
df <- df %>%
mutate(VoteTotal = as.numeric(gsub(",", "", GENERAL.RESULTS)),
Candidate = ifelse(PARTY == "R", "Trump", ifelse(PARTY == "D", "Biden", NA))) %>%
group_by(STATE, Candidate) %>%
summarize(TotalVotes = sum(VoteTotal))
# convert from long to wide
df_wide <- df %>%
spread(key = Candidate, value = TotalVotes, fill = 0)
# create columns for total votes, outcome, and percentage democratic
df_result <- df_wide %>%
mutate(TotalVotes = Biden + Trump,
Outcome = ifelse(Biden > Trump, "Democratic", "Republican"),
Percentage_Democratic = round((Biden / TotalVotes) * 100, 2))
cat("Shape of Data:", dim(df_result))
head(df_result)
`summarise()` has grouped output by 'STATE'. You can override using the
`.groups` argument.
Shape of Data: 51 6
STATE | Biden | Trump | TotalVotes | Outcome | Percentage_Democratic |
---|---|---|---|---|---|
<chr> | <dbl> | <dbl> | <dbl> | <chr> | <dbl> |
Alabama | 849624 | 1441170 | 2290794 | Republican | 37.09 |
Alaska | 153778 | 189951 | 343729 | Republican | 44.74 |
Arizona | 1672143 | 1661686 | 3333829 | Democratic | 50.16 |
Arkansas | 423932 | 760647 | 1184579 | Republican | 35.79 |
California | 11110639 | 6006518 | 17117157 | Democratic | 64.91 |
Colorado | 1804352 | 1364607 | 3168959 | Democratic | 56.94 |
Now we can read in the population data that was acquired from the United States Census Bureau and see what it looks like.
Shape of Data: 51 2
X...State | Population.Estimate | |
---|---|---|
<chr> | <chr> | |
1 | .Alabama | 5,031,362 |
2 | .Alaska | 732,923 |
3 | .Arizona | 7,179,943 |
4 | .Arkansas | 3,014,195 |
5 | .California | 39,501,653 |
6 | .Colorado | 5,784,865 |
Let’s remove the period before the state name, convert the population column to numeric, and check the sum of all populations to check the accuracy of the data. Let’s merged the clean data together by STATE.
#POPULATION DATA
# Remove the dot before the state name
pop <- pop %>%
mutate(STATE = gsub("\\.", "", X...State))
# Convert Population.Estimate to numeric after removing commas
pop$Population.Estimate <- as.numeric(gsub(",", "", pop$Population.Estimate))
pop <- select(pop, -X...State)
sum(pop$Population.Estimate)
# Merge the two datasets
df_merged <- left_join(df_result, pop, by = "STATE")
cat("Shape of Data:", dim(df_merged))
head(df_merged)
Shape of Data: 51 7
STATE | Biden | Trump | TotalVotes | Outcome | Percentage_Democratic | Population.Estimate |
---|---|---|---|---|---|---|
<chr> | <dbl> | <dbl> | <dbl> | <chr> | <dbl> | <dbl> |
Alabama | 849624 | 1441170 | 2290794 | Republican | 37.09 | 5031362 |
Alaska | 153778 | 189951 | 343729 | Republican | 44.74 | 732923 |
Arizona | 1672143 | 1661686 | 3333829 | Democratic | 50.16 | 7179943 |
Arkansas | 423932 | 760647 | 1184579 | Republican | 35.79 | 3014195 |
California | 11110639 | 6006518 | 17117157 | Democratic | 64.91 | 39501653 |
Colorado | 1804352 | 1364607 | 3168959 | Democratic | 56.94 | 5784865 |
Let’s read in the income data that was acquired from the U.S. Census Bureau and see what it looks like.
Shape of Data: 52 2
X...State | Median.income | |
---|---|---|
<chr> | <chr> | |
1 | United States | 68,010 |
2 | Alabama | 54,690 |
3 | Alaska | 74,750 |
4 | Arizona | 67,090 |
5 | Arkansas | 50,780 |
6 | California | 77,650 |
We need to remove the United States row and rename the state column so that it matches the merged data. Below is the newly merged data.
#INCOME DATA
# remove united states row and rename the state column
income <- income %>%
filter(X...State != "United States") %>%
rename(STATE = X...State)
# merge the income data with the merged data
df_merged <- left_join(df_merged, income, by = "STATE")
cat("Shape of Data:", dim(df_merged))
head(df_merged)
Shape of Data: 51 8
STATE | Biden | Trump | TotalVotes | Outcome | Percentage_Democratic | Population.Estimate | Median.income |
---|---|---|---|---|---|---|---|
<chr> | <dbl> | <dbl> | <dbl> | <chr> | <dbl> | <dbl> | <chr> |
Alabama | 849624 | 1441170 | 2290794 | Republican | 37.09 | 5031362 | 54,690 |
Alaska | 153778 | 189951 | 343729 | Republican | 44.74 | 732923 | 74,750 |
Arizona | 1672143 | 1661686 | 3333829 | Democratic | 50.16 | 7179943 | 67,090 |
Arkansas | 423932 | 760647 | 1184579 | Republican | 35.79 | 3014195 | 50,780 |
California | 11110639 | 6006518 | 17117157 | Democratic | 64.91 | 39501653 | 77,650 |
Colorado | 1804352 | 1364607 | 3168959 | Democratic | 56.94 | 5784865 | 83,780 |
Let’s read in the crime data and see what it looks like.
Shape of Data 153 4
X...State | Year | Violent_Crimes | Property_Crimes | |
---|---|---|---|---|
<chr> | <int> | <int> | <int> | |
1 | AL | 2020 | 22322 | 105161 |
2 | AL | 2021 | 17590 | 74271 |
3 | AL | 2022 | 20759 | 88240 |
4 | AK | 2020 | 6126 | 16528 |
5 | AK | 2021 | 5573 | 13456 |
6 | AK | 2022 | 5567 | 13124 |
Again we must rename the state column to match the rest of the data. Additionally, the state column is now abbreviations so we must map it to the full state name that can be done with the follwing code. Once done, we can drop the original state column. Similar to before, the data is in long format and we must convert it to wide so that each row (unit of observation) is a state and there is a column for each crime type for different years. Once done, we can merge the crime data with the rest of the data.
#CRIME DATA
# rename the state column
colnames(crimes)[colnames(crimes) == "X...State"] <- "STATE"
# Create a mapping between state abbreviations and full state names
state_mapping <- data.frame(
Abbreviation = c(
"AL", "AK", "AZ", "AR", "CA", "CO", "CT", "DC", "DE", "FL", "GA",
"HI", "ID", "IL", "IN", "IA", "KS", "KY", "LA", "ME", "MD",
"MA", "MI", "MN", "MS", "MO", "MT", "NE", "NV", "NH", "NJ",
"NM", "NY", "NC", "ND", "OH", "OK", "OR", "PA", "RI", "SC",
"SD", "TN", "TX", "UT", "VT", "VA", "WA", "WV", "WI", "WY"
),
StateName = c(
"Alabama", "Alaska", "Arizona", "Arkansas", "California", "Colorado",
"Connecticut", "District of Columbia", "Delaware", "Florida", "Georgia", "Hawaii", "Idaho",
"Illinois", "Indiana", "Iowa", "Kansas", "Kentucky", "Louisiana", "Maine",
"Maryland", "Massachusetts", "Michigan", "Minnesota", "Mississippi", "Missouri",
"Montana", "Nebraska", "Nevada", "New Hampshire", "New Jersey", "New Mexico",
"New York", "North Carolina", "North Dakota", "Ohio", "Oklahoma", "Oregon",
"Pennsylvania", "Rhode Island", "South Carolina", "South Dakota", "Tennessee",
"Texas", "Utah", "Vermont", "Virginia", "Washington", "West Virginia", "Wisconsin", "Wyoming"
)
)
# convert STATE column to character
crimes$STATE <- as.character(crimes$STATE)
# Merge crime_data with state_mapping to get full state names
crimes <- left_join(crimes, state_mapping, by = c("STATE" = "Abbreviation"))
# Drop the original 'STATE' column
crimes <- select(crimes, -STATE)
# change column name
colnames(crimes)[colnames(crimes) == "StateName"] <- "STATE"
crimes <- distinct(crimes, STATE, Year, .keep_all = TRUE)
# convert from long to wide
crimes_wide <- crimes %>%
pivot_wider(
names_from = Year,
values_from = c(Violent_Crimes, Property_Crimes)
)
# merge the crimes data with the merged data
df_merged <- left_join(df_merged, crimes_wide, by = "STATE")
cat("Shape of Data:", dim(df_merged))
head(df_merged)
Shape of Data: 51 14
STATE | Biden | Trump | TotalVotes | Outcome | Percentage_Democratic | Population.Estimate | Median.income | Violent_Crimes_2020 | Violent_Crimes_2021 | Violent_Crimes_2022 | Property_Crimes_2020 | Property_Crimes_2021 | Property_Crimes_2022 |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
<chr> | <dbl> | <dbl> | <dbl> | <chr> | <dbl> | <dbl> | <chr> | <int> | <int> | <int> | <int> | <int> | <int> |
Alabama | 849624 | 1441170 | 2290794 | Republican | 37.09 | 5031362 | 54,690 | 22322 | 17590 | 20759 | 105161 | 74271 | 88240 |
Alaska | 153778 | 189951 | 343729 | Republican | 44.74 | 732923 | 74,750 | 6126 | 5573 | 5567 | 16528 | 13456 | 13124 |
Arizona | 1672143 | 1661686 | 3333829 | Democratic | 50.16 | 7179943 | 67,090 | 35980 | 30922 | 31754 | 165323 | 153641 | 151421 |
Arkansas | 423932 | 760647 | 1184579 | Republican | 35.79 | 3014195 | 50,780 | 20363 | 21271 | 19654 | 79200 | 76580 | 74664 |
California | 11110639 | 6006518 | 17117157 | Democratic | 64.91 | 39501653 | 77,650 | 174026 | 188343 | 194935 | 842054 | 847567 | 914517 |
Colorado | 1804352 | 1364607 | 3168959 | Democratic | 56.94 | 5784865 | 83,780 | 24570 | 27916 | 28759 | 164582 | 182850 | 183816 |
Let’s read in the health data and see what it looks like. We’ll start with data on heart disease that was acquired from the CDC.
Shape of Data: 450 4
YEAR | STATE | RATE | DEATHS | |
---|---|---|---|---|
<int> | <chr> | <dbl> | <chr> | |
1 | 2021 | AL | 247.5 | 15173 |
2 | 2021 | AK | 154.7 | 1011 |
3 | 2021 | AZ | 158.3 | 14550 |
4 | 2021 | AR | 231.0 | 8547 |
5 | 2021 | CA | 147.8 | 65471 |
6 | 2021 | CO | 135.1 | 8081 |
Similar to before, we must alter the state column to match the rest of the data. As the data uses abbreviations instead of state names, we must once again map it to the full state name. We will also rename some of the other columns for clarity. Furthermore, we only want to keep years 2020 and onwards. As done before, we must convert the data from long to wide with each row represeting one unit of observation (a state). Let’s then merge the heart disease deaths and rates with the rest of the data. Finally we will rename some of the columns in the merged dataset for clarity.
#HEART DISEASE DATA
# convert STATE column to character
hearts$STATE <- as.character(hearts$STATE)
# Merge crime_data with state_mapping to get full state names
hearts <- left_join(hearts, state_mapping, by = c("STATE" = "Abbreviation"))
# Drop the original 'STATE' column
hearts <- select(hearts, -STATE)
# renaming columnss
hearts <- hearts %>%
rename(STATE = StateName, DeathsPer100k = RATE)
# filter out only for 2020 and 2021
hearts <- hearts %>%
filter(YEAR >= 2020)
# converting hearts from long to wide
hearts_wide <- hearts %>%
pivot_wider(
names_from = YEAR,
values_from = c(DEATHS, DeathsPer100k),
names_glue = "{.value}{YEAR}"
)
# merge the heart disease mortality data with the merged data
df_merged <- left_join(df_merged, hearts_wide, by = "STATE")
# renaming columns for clarity
df_merged <- df_merged %>%
rename(BidenVotes = Biden, TrumpVotes = Trump, VoteOutcome = Outcome, DemPercentage = Percentage_Democratic, Population = Population.Estimate,
HeartDeaths2021 = DEATHS2021, HeartDeaths2020 = DEATHS2020, HeartDeathsPer100k2021 = DeathsPer100k2021, HeartDeathsPer100k2020 = DeathsPer100k2020)
cat("Shape of Data:", dim(df_merged))
head(df_merged)
Shape of Data: 51 18
STATE | BidenVotes | TrumpVotes | TotalVotes | VoteOutcome | DemPercentage | Population | Median.income | Violent_Crimes_2020 | Violent_Crimes_2021 | Violent_Crimes_2022 | Property_Crimes_2020 | Property_Crimes_2021 | Property_Crimes_2022 | HeartDeaths2021 | HeartDeaths2020 | HeartDeathsPer100k2021 | HeartDeathsPer100k2020 |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
<chr> | <dbl> | <dbl> | <dbl> | <chr> | <dbl> | <dbl> | <chr> | <int> | <int> | <int> | <int> | <int> | <int> | <chr> | <chr> | <dbl> | <dbl> |
Alabama | 849624 | 1441170 | 2290794 | Republican | 37.09 | 5031362 | 54,690 | 22322 | 17590 | 20759 | 105161 | 74271 | 88240 | 15173 | 14739 | 247.5 | 237.5 |
Alaska | 153778 | 189951 | 343729 | Republican | 44.74 | 732923 | 74,750 | 6126 | 5573 | 5567 | 16528 | 13456 | 13124 | 1011 | 915 | 154.7 | 139.8 |
Arizona | 1672143 | 1661686 | 3333829 | Democratic | 50.16 | 7179943 | 67,090 | 35980 | 30922 | 31754 | 165323 | 153641 | 151421 | 14550 | 14196 | 158.3 | 144.8 |
Arkansas | 423932 | 760647 | 1184579 | Republican | 35.79 | 3014195 | 50,780 | 20363 | 21271 | 19654 | 79200 | 76580 | 74664 | 8547 | 8621 | 231.0 | 222.5 |
California | 11110639 | 6006518 | 17117157 | Democratic | 64.91 | 39501653 | 77,650 | 174026 | 188343 | 194935 | 842054 | 847567 | 914517 | 65471 | 66538 | 147.8 | 144.0 |
Colorado | 1804352 | 1364607 | 3168959 | Democratic | 56.94 | 5784865 | 83,780 | 24570 | 27916 | 28759 | 164582 | 182850 | 183816 | 8081 | 8023 | 135.1 | 128.1 |
Let’s read in the COVID-19 data (acquired from the nytimes) and see what it looks like.
Shape of Data: 61942 5
date | state | fips | cases | deaths | |
---|---|---|---|---|---|
<chr> | <chr> | <int> | <int> | <int> | |
1 | 1/21/2020 | Washington | 53 | 1 | 0 |
2 | 1/22/2020 | Washington | 53 | 1 | 0 |
3 | 1/23/2020 | Washington | 53 | 1 | 0 |
4 | 1/24/2020 | Illinois | 17 | 1 | 0 |
5 | 1/24/2020 | Washington | 53 | 1 | 0 |
6 | 1/25/2020 | California | 6 | 1 | 0 |
The above data counts the number of total cumulative cases and deaths per each state. Thus, we can take the maximum value for each state to find the total cases and deaths per state. This is done via the code below. We can then merge this dataset with the rest of the data. It is important to use a left join as we only want to keep the states that are in the merged dataframe. Addtionally, we will convert relevant columns to numeric and then create a new column that calculates COVID cases and deaths as percentages of the state population. Below is the newly merged data.
#COVID DATA
# getting total cases and deaths per state
covid <- covid %>%
group_by(state) %>%
summarize(
covid_cases = max(cases, na.rm = TRUE),
covid_deaths = max(deaths, na.rm = TRUE)
)
#rename the state column
covid <- covid %>%
rename(STATE = state)
# merge the covid data with the merged data
df_merged <- left_join(df_merged, covid, by = "STATE")
# Convert relevant columns to numeric
df_merged$covid_cases <- as.numeric(df_merged$covid_cases)
df_merged$covid_deaths <- as.numeric(df_merged$covid_deaths)
# Create new columns for COVID cases and deaths as percentages
df_merged$covid_cases_percentage <- round((df_merged$covid_cases / df_merged$Population) * 100, 2)
df_merged$covid_deaths_percentage <- round((df_merged$covid_deaths / df_merged$Population) * 100, 2)
cat("Shape of Data:", dim(df_merged))
head(df_merged)
Shape of Data: 51 22
STATE | BidenVotes | TrumpVotes | TotalVotes | VoteOutcome | DemPercentage | Population | Median.income | Violent_Crimes_2020 | Violent_Crimes_2021 | ... | Property_Crimes_2021 | Property_Crimes_2022 | HeartDeaths2021 | HeartDeaths2020 | HeartDeathsPer100k2021 | HeartDeathsPer100k2020 | covid_cases | covid_deaths | covid_cases_percentage | covid_deaths_percentage |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
<chr> | <dbl> | <dbl> | <dbl> | <chr> | <dbl> | <dbl> | <chr> | <int> | <int> | ... | <int> | <int> | <chr> | <chr> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> |
Alabama | 849624 | 1441170 | 2290794 | Republican | 37.09 | 5031362 | 54,690 | 22322 | 17590 | ... | 74271 | 88240 | 15173 | 14739 | 247.5 | 237.5 | 1648385 | 21631 | 32.76 | 0.43 |
Alaska | 153778 | 189951 | 343729 | Republican | 44.74 | 732923 | 74,750 | 6126 | 5573 | ... | 13456 | 13124 | 1011 | 915 | 154.7 | 139.8 | 308893 | 1438 | 42.15 | 0.20 |
Arizona | 1672143 | 1661686 | 3333829 | Democratic | 50.16 | 7179943 | 67,090 | 35980 | 30922 | ... | 153641 | 151421 | 14550 | 14196 | 158.3 | 144.8 | 2451062 | 33190 | 34.14 | 0.46 |
Arkansas | 423932 | 760647 | 1184579 | Republican | 35.79 | 3014195 | 50,780 | 20363 | 21271 | ... | 76580 | 74664 | 8547 | 8621 | 231.0 | 222.5 | 1008303 | 13068 | 33.45 | 0.43 |
California | 11110639 | 6006518 | 17117157 | Democratic | 64.91 | 39501653 | 77,650 | 174026 | 188343 | ... | 847567 | 914517 | 65471 | 66538 | 147.8 | 144.0 | 12169158 | 104277 | 30.81 | 0.26 |
Colorado | 1804352 | 1364607 | 3168959 | Democratic | 56.94 | 5784865 | 83,780 | 24570 | 27916 | ... | 182850 | 183816 | 8081 | 8023 | 135.1 | 128.1 | 1771010 | 14245 | 30.61 | 0.25 |
Let’s read in the education data that was acquired from _____ and take a look at it.
Shape of Data: 51 1235
X...DP02_0001E | DP02_0001EA | DP02_0001M | DP02_0001MA | DP02_0001PE | DP02_0001PEA | DP02_0001PM | DP02_0001PMA | DP02_0002E | DP02_0002EA | ... | DP02_0154EA | DP02_0154M | DP02_0154MA | DP02_0154PE | DP02_0154PEA | DP02_0154PM | DP02_0154PMA | GEO_ID | NAME | state | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
<int> | <lgl> | <int> | <lgl> | <int> | <lgl> | <int> | <chr> | <int> | <lgl> | ... | <lgl> | <int> | <lgl> | <dbl> | <lgl> | <dbl> | <lgl> | <chr> | <chr> | <int> | |
1 | 1967559 | NA | 10527 | NA | 1967559 | NA | -888888888 | (X) | 904392 | NA | ... | NA | 12886 | NA | 85.0 | NA | 0.5 | NA | 0400000US01 | Alabama | 1 |
2 | 2817723 | NA | 10850 | NA | 2817723 | NA | -888888888 | (X) | 1344242 | NA | ... | NA | 12165 | NA | 91.4 | NA | 0.3 | NA | 0400000US04 | Arizona | 4 |
3 | 1183675 | NA | 7882 | NA | 1183675 | NA | -888888888 | (X) | 565893 | NA | ... | NA | 10272 | NA | 85.5 | NA | 0.6 | NA | 0400000US05 | Arkansas | 5 |
4 | 13429063 | NA | 19170 | NA | 13429063 | NA | -888888888 | (X) | 6517082 | NA | ... | NA | 23412 | NA | 92.9 | NA | 0.1 | NA | 0400000US06 | California | 6 |
5 | 2313042 | NA | 8099 | NA | 2313042 | NA | -888888888 | (X) | 1124072 | NA | ... | NA | 10722 | NA | 93.0 | NA | 0.3 | NA | 0400000US08 | Colorado | 8 |
6 | 1428313 | NA | 5900 | NA | 1428313 | NA | -888888888 | (X) | 664848 | NA | ... | NA | 8523 | NA | 92.2 | NA | 0.4 | NA | 0400000US09 | Connecticut | 9 |
Using the code below, we can only select the specific columns that we want and rename them for clarity. Then we can merge this education data with the larger dataframe.
#EDUCATION DATA
#rename and select relevant columns
education <- education %>%
rename(STATE = NAME, HighSchoolGraduates = DP02_0062E, BachelorsDegree = DP02_0065E, AssociatesDegree = DP02_0064E, SomeCollege = DP02_0063E)
education2 <- education %>% select(STATE, HighSchoolGraduates, BachelorsDegree, AssociatesDegree, SomeCollege)
# merge the education data with the merged data
df_merged <- left_join(df_merged, education2, by = "STATE")
cat("Shape of Data:", dim(df_merged))
head(df_merged)
Shape of Data: 51 26
STATE | BidenVotes | TrumpVotes | TotalVotes | VoteOutcome | DemPercentage | Population | Median.income | Violent_Crimes_2020 | Violent_Crimes_2021 | ... | HeartDeathsPer100k2021 | HeartDeathsPer100k2020 | covid_cases | covid_deaths | covid_cases_percentage | covid_deaths_percentage | HighSchoolGraduates | BachelorsDegree | AssociatesDegree | SomeCollege |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
<chr> | <dbl> | <dbl> | <dbl> | <chr> | <dbl> | <dbl> | <chr> | <int> | <int> | ... | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <int> | <int> | <int> | <int> |
Alabama | 849624 | 1441170 | 2290794 | Republican | 37.09 | 5031362 | 54,690 | 22322 | 17590 | ... | 247.5 | 237.5 | 1648385 | 21631 | 32.76 | 0.43 | 1079285 | 572276 | 303028 | 705662 |
Alaska | 153778 | 189951 | 343729 | Republican | 44.74 | 732923 | 74,750 | 6126 | 5573 | ... | 154.7 | 139.8 | 308893 | 1438 | 42.15 | 0.20 | 135473 | 101422 | 40276 | 118250 |
Arizona | 1672143 | 1661686 | 3333829 | Democratic | 50.16 | 7179943 | 67,090 | 35980 | 30922 | ... | 158.3 | 144.8 | 2451062 | 33190 | 34.14 | 0.46 | 1168057 | 985673 | 467305 | 1180491 |
Arkansas | 423932 | 760647 | 1184579 | Republican | 35.79 | 3014195 | 50,780 | 20363 | 21271 | ... | 231.0 | 222.5 | 1008303 | 13068 | 33.45 | 0.43 | 695062 | 324137 | 161542 | 435235 |
California | 11110639 | 6006518 | 17117157 | Democratic | 64.91 | 39501653 | 77,650 | 174026 | 188343 | ... | 147.8 | 144.0 | 12169158 | 104277 | 30.81 | 0.26 | 5578997 | 5958030 | 2120275 | 5287901 |
Colorado | 1804352 | 1364607 | 3168959 | Democratic | 56.94 | 5784865 | 83,780 | 24570 | 27916 | ... | 135.1 | 128.1 | 1771010 | 14245 | 30.61 | 0.25 | 814373 | 1107309 | 334157 | 793438 |
We can add in race data using the same process as above.
Shape of Data: 52 731
DP05_0001E | DP05_0001EA | DP05_0001M | DP05_0001MA | DP05_0001PE | DP05_0001PEA | DP05_0001PM | DP05_0001PMA | DP05_0002E | DP05_0002EA | ... | DP05_0091EA | DP05_0091M | DP05_0091MA | DP05_0091PE | DP05_0091PEA | DP05_0091PM | DP05_0091PMA | GEO_ID | NAME | state | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
<int> | <lgl> | <int> | <chr> | <int> | <lgl> | <int> | <chr> | <int> | <lgl> | ... | <lgl> | <int> | <lgl> | <dbl> | <lgl> | <dbl> | <lgl> | <chr> | <chr> | <int> | |
1 | 5074296 | NA | -555555555 | ***** | 5074296 | NA | -888888888 | (X) | 2461248 | NA | ... | NA | 4729 | NA | 52.4 | NA | 0.1 | NA | 0400000US01 | Alabama | 1 |
2 | 733583 | NA | -555555555 | ***** | 733583 | NA | -888888888 | (X) | 385667 | NA | ... | NA | 2046 | NA | 46.8 | NA | 0.3 | NA | 0400000US02 | Alaska | 2 |
3 | 7359197 | NA | -555555555 | ***** | 7359197 | NA | -888888888 | (X) | 3678381 | NA | ... | NA | 8862 | NA | 50.4 | NA | 0.1 | NA | 0400000US04 | Arizona | 4 |
4 | 3045637 | NA | -555555555 | ***** | 3045637 | NA | -888888888 | (X) | 1504488 | NA | ... | NA | 3663 | NA | 51.3 | NA | 0.1 | NA | 0400000US05 | Arkansas | 5 |
5 | 39029342 | NA | -555555555 | ***** | 39029342 | NA | -888888888 | (X) | 19536425 | NA | ... | NA | 28487 | NA | 50.6 | NA | 0.1 | NA | 0400000US06 | California | 6 |
6 | 5839926 | NA | -555555555 | ***** | 5839926 | NA | -888888888 | (X) | 2960896 | NA | ... | NA | 7496 | NA | 49.4 | NA | 0.1 | NA | 0400000US08 | Colorado | 8 |
#RACE DATA
# rename and select relevant columns
race <- race %>%
rename(STATE = NAME, White = DP05_0037E, Black = DP05_0038E, Asian = DP05_0044E, Hispanic_Latino = DP05_0073E)
race2 <- race %>% select(STATE, White, Black, Asian, Hispanic_Latino)
# merge the race data with the merged data
df_merged <- left_join(df_merged, race2, by = "STATE")
cat("Shape of Data:", dim(df_merged))
head(df_merged)
Shape of Data: 51 30
STATE | BidenVotes | TrumpVotes | TotalVotes | VoteOutcome | DemPercentage | Population | Median.income | Violent_Crimes_2020 | Violent_Crimes_2021 | ... | covid_cases_percentage | covid_deaths_percentage | HighSchoolGraduates | BachelorsDegree | AssociatesDegree | SomeCollege | White | Black | Asian | Hispanic_Latino |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
<chr> | <dbl> | <dbl> | <dbl> | <chr> | <dbl> | <dbl> | <chr> | <int> | <int> | ... | <dbl> | <dbl> | <int> | <int> | <int> | <int> | <int> | <int> | <int> | <int> |
Alabama | 849624 | 1441170 | 2290794 | Republican | 37.09 | 5031362 | 54,690 | 22322 | 17590 | ... | 32.76 | 0.43 | 1079285 | 572276 | 303028 | 705662 | 3302528 | 1302035 | 78893 | 246477 |
Alaska | 153778 | 189951 | 343729 | Republican | 44.74 | 732923 | 74,750 | 6126 | 5573 | ... | 42.15 | 0.20 | 135473 | 101422 | 40276 | 118250 | 437533 | 22202 | 46184 | 56491 |
Arizona | 1672143 | 1661686 | 3333829 | Democratic | 50.16 | 7179943 | 67,090 | 35980 | 30922 | ... | 34.14 | 0.46 | 1168057 | 985673 | 467305 | 1180491 | 4254015 | 340760 | 266441 | 2388520 |
Arkansas | 423932 | 760647 | 1184579 | Republican | 35.79 | 3014195 | 50,780 | 20363 | 21271 | ... | 33.45 | 0.43 | 695062 | 324137 | 161542 | 435235 | 2103849 | 437331 | 48921 | 255416 |
California | 11110639 | 6006518 | 17117157 | Democratic | 64.91 | 39501653 | 77,650 | 174026 | 188343 | ... | 30.81 | 0.26 | 5578997 | 5958030 | 2120275 | 5287901 | 15175598 | 2121422 | 6054038 | 15732184 |
Colorado | 1804352 | 1364607 | 3168959 | Democratic | 56.94 | 5784865 | 83,780 | 24570 | 27916 | ... | 30.61 | 0.25 | 814373 | 1107309 | 334157 | 793438 | 4106707 | 235519 | 190181 | 1314962 |
Normalizing data is a vital step in the data science life cycle as it ensures that all variables contribute equitably to analysis, irrespective of their initial scale. This process not only prevents biases and inaccuracies in statistical models or machine learning algorithms but also enhances clarity in understanding differences between units of observation. By placing these units on a comparable scale, normalization facilitates more meaningful comparisons and analyses.
The COVID data is already normalized. Thus, we will now normalize the crime, education, and race data per state. For crime data, we take the average violent crimes and property crimes for 2020-22 and then divide that by the state population. For the education and race data, we simply take the data and divide it by the state population. Finally, for the heart death data, we take the average of heart deaths per 100k for 2020 and 2021. Below is a an example of the newly normalized data. This data is then saved to a csv file for future use.
#normalize the crime data per state
df_merged$normalized_violent_crimes <- round(apply(df_merged[, c("Violent_Crimes_2020", "Violent_Crimes_2021", "Violent_Crimes_2022")], 1, mean, na.rm = TRUE) / df_merged$Population, 5)
df_merged$normalized_property_crimes <- round(apply(df_merged[, c("Property_Crimes_2020", "Property_Crimes_2021", "Property_Crimes_2022")], 1, mean, na.rm = TRUE) / df_merged$Population, 5)
# normalize education data per state
df_merged$normalized_high_school <- round(df_merged$HighSchoolGraduates / df_merged$Population, 5)
df_merged$normalized_bachelors <- round(df_merged$BachelorsDegree / df_merged$Population, 5)
df_merged$normalized_associates <- round(df_merged$AssociatesDegree / df_merged$Population, 5)
df_merged$normalized_some_college <- round(df_merged$SomeCollege / df_merged$Population, 5)
# finally, let's normalize the race data per state
df_merged$normalized_white <- round(df_merged$White / df_merged$Population, 5)
df_merged$normalized_black <- round(df_merged$Black / df_merged$Population, 5)
df_merged$normalized_asian <- round(df_merged$Asian / df_merged$Population, 5)
df_merged$normalized_hispanic_latino <- round(df_merged$Hispanic_Latino / df_merged$Population, 5)
#Averaged 2020 and 2021 heart death death data
df_merged$HeartDeathsPer100k <- ((df_merged$HeartDeathsPer100k2020 + df_merged$HeartDeathsPer100k2021 )/2)
cat("Shape of Data:", dim(df_merged))
head(df_merged)
#save to csv
write.csv(df_merged, file = 'cleaned.csv', row.names = FALSE)
Shape of Data: 51 41
STATE | BidenVotes | TrumpVotes | TotalVotes | VoteOutcome | DemPercentage | Population | Median.income | Violent_Crimes_2020 | Violent_Crimes_2021 | ... | normalized_property_crimes | normalized_high_school | normalized_bachelors | normalized_associates | normalized_some_college | normalized_white | normalized_black | normalized_asian | normalized_hispanic_latino | HeartDeathsPer100k |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
<chr> | <dbl> | <dbl> | <dbl> | <chr> | <dbl> | <dbl> | <chr> | <int> | <int> | ... | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> |
Alabama | 849624 | 1441170 | 2290794 | Republican | 37.09 | 5031362 | 54,690 | 22322 | 17590 | ... | 0.01773 | 0.21451 | 0.11374 | 0.06023 | 0.14025 | 0.65639 | 0.25878 | 0.01568 | 0.04899 | 242.50 |
Alaska | 153778 | 189951 | 343729 | Republican | 44.74 | 732923 | 74,750 | 6126 | 5573 | ... | 0.01961 | 0.18484 | 0.13838 | 0.05495 | 0.16134 | 0.59697 | 0.03029 | 0.06301 | 0.07708 | 147.25 |
Arizona | 1672143 | 1661686 | 3333829 | Democratic | 50.16 | 7179943 | 67,090 | 35980 | 30922 | ... | 0.02184 | 0.16268 | 0.13728 | 0.06508 | 0.16442 | 0.59249 | 0.04746 | 0.03711 | 0.33267 | 151.55 |
Arkansas | 423932 | 760647 | 1184579 | Republican | 35.79 | 3014195 | 50,780 | 20363 | 21271 | ... | 0.02548 | 0.23060 | 0.10754 | 0.05359 | 0.14440 | 0.69798 | 0.14509 | 0.01623 | 0.08474 | 226.75 |
California | 11110639 | 6006518 | 17117157 | Democratic | 64.91 | 39501653 | 77,650 | 174026 | 188343 | ... | 0.02197 | 0.14123 | 0.15083 | 0.05368 | 0.13387 | 0.38418 | 0.05370 | 0.15326 | 0.39827 | 145.90 |
Colorado | 1804352 | 1364607 | 3168959 | Democratic | 56.94 | 5784865 | 83,780 | 24570 | 27916 | ... | 0.03061 | 0.14078 | 0.19141 | 0.05776 | 0.13716 | 0.70991 | 0.04071 | 0.03288 | 0.22731 | 131.60 |
Exploratory Data Analysis (EDA) is a fundamental starting point in data analysis, helping grasp the data’s characteristics, patterns, and possible outliers. It provides essential insights for making informed modeling decisions.
By analyzing the below data, we hope to gain an understanding of overall trends that can aid in refining our hypotheses.
BASIC STATISTICS:
----------------------
Min. 1st Qu. Median Mean 3rd Qu. Max.
45130 60360 67410 68660 78200 94790
This box plot shows the distribution of median income. The plot has a wide interquartile range and the median value is indicated by the line inside the box at 67410.
ggplot(df, aes(x = reorder(STATE, Median.income), y = Median.income, fill = VoteOutcome)) +
geom_bar(stat = "identity") +
scale_fill_manual(values = c("Republican" = "red", "Democratic" = "blue"), name = "Political Affiliation") +
coord_flip() +
xlab("State") +
ylab("Median Income") +
ggtitle("Median Income by State (Colored by Political Affiliation)")
This is a bar chart showing median income by state, with the color indicating vote outcome. The states at the top, which have higher median incomes, tend to have voted Democratic, while states at the bottom with lower median incomes tend to have voted Republican.
ggplot(df, aes(x = VoteOutcome, y = Median.income, fill = VoteOutcome)) +
geom_boxplot() +
scale_fill_manual(values = c("Republican" = "red", "Democratic" = "blue"), name = "Political Affiliation") +
xlab("Political Affiliation") +
ylab("Median Income") +
ggtitle("Median Income by Political Affiliation") +
theme(legend.position = "none")
This is a boxplot comparing median income distributions between Democratic and Republican vote outcomes. We can see that the median income for Democratic areas is higher than for Republican areas, as indicated by the position of the median line within each box. Additionally, the interquartile range for Democratic areas seems slightly broader, suggesting more variability in income.
ggplot(df, aes(x = DemPercentage, y = Median.income, color = VoteOutcome)) +
geom_point() +
geom_smooth(method = "lm") +
scale_color_manual(values = c("Republican" = "red", "Democratic" = "blue"), name = "Political Affiliation") +
xlab("Percentage of Vote for Winning Party") +
ylab("Median Income") +
ggtitle("Median Income vs Percentage of Vote")
`geom_smooth()` using formula = 'y ~ x'
This is a scatter plot with a trend line comparing median income to the percentage of vote for the winning party, again distinguished by vote outcome. This plot suggests a positive correlation for both Democratic and Republican areas, meaning that as the percentage of vote for the winning party increases, the median income also tends to increase. The shaded areas around the trend lines are the confidence intervals, which seem to be broader for Republican areas, suggesting more variability or less certainty in the trend compared to the Democratic areas.
#show summary stats
cat("BASIC STATISTICS:\n")
cat("----------------------\n")
cat("Violet Crimes:")
summary(df$normalized_violent_crimes)
cat("\nProperty Crimes:")
summary(df$normalized_property_crimes)
ggplot(df, aes(x = VoteOutcome, y = normalized_violent_crimes, fill = VoteOutcome)) +
geom_boxplot() +
scale_fill_manual(values = c("Republican" = "red", "Democratic" = "blue")) +
labs(title = "Violent Crime Rates by Political Affiliation", x = "Political Affiliation", y = "Normalized Violent Crimes") +
theme(legend.position = "none")
BASIC STATISTICS:
----------------------
Violet Crimes:
Property Crimes:
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.001090 0.002755 0.003360 0.003835 0.004560 0.009410
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.01021 0.01517 0.01866 0.01937 0.02239 0.03801
The box plot displays that the median violent crime rate for Republican states is higher than Democratic states. Two Democratic points are significant outliers as well, which drove the Democratic mean higher even though it is still lower than the Republican states.
ggplot(df, aes(x = VoteOutcome, y = normalized_property_crimes, fill = VoteOutcome)) +
geom_boxplot() +
scale_fill_manual(values = c("Republican" = "red", "Democratic" = "blue")) +
labs(title = "Property Crime Rates by Political Affiliation", x = "Political Affiliation", y = "Normalized Property Crimes") +
theme(legend.position = "none")
The second box plot shows the distribution for non-violent crime rates. Again, we see that Democratic states overall have lower crime rates, with the property crimes making the difference even more obvious.
BASIC STATISTICS:
----------------------
STATE BidenVotes TrumpVotes VoteOutcome
Length:51 Min. : 73491 Min. : 18586 Length:51
Class :character 1st Qu.: 399258 1st Qu.: 473638 Class :character
Mode :character Median : 856034 Median :1020280 Mode :character
Mean : 1586213 Mean :1449562
3rd Qu.: 2375907 3rd Qu.:1791166
Max. :11110639 Max. :6006518
Population HeartDeathsPer100k2020 HeartDeathsPer100k2021
Min. : 577605 Min. :118.1 Min. :123.9
1st Qu.: 1820311 1st Qu.:146.3 1st Qu.:155.2
Median : 4507445 Median :162.4 Median :167.8
Mean : 6500226 Mean :169.0 Mean :177.1
3rd Qu.: 7451987 3rd Qu.:183.9 3rd Qu.:194.2
Max. :39501653 Max. :245.6 Max. :264.2
NA's :1 NA's :1
covid_cases_percentage covid_deaths_percentage
Min. :22.19 Min. :0.1300
1st Qu.:28.36 1st Qu.:0.2650
Median :31.60 Median :0.3500
Mean :31.11 Mean :0.3308
3rd Qu.:33.56 3rd Qu.:0.3950
Max. :42.15 Max. :0.4600
ggplot(df, aes(x = DemPercentage, y = covid_cases_percentage, color = VoteOutcome)) +
geom_point() +
#geom_smooth(method = "lm") +
scale_color_manual(values = c("Republican" = "red", "Democratic" = "blue"), name = "Political Affiliation") +
xlab("Percentage of Vote for Democratic Party") +
ylab("COVID-19 Cases (percent of population)") +
ggtitle("Covid Case Rates vs Percentage of Vote")
df$covid_deaths_percentage <- df$covid_deaths_percentage * 100
ggplot(df, aes(x = DemPercentage, y = covid_deaths_percentage, color = VoteOutcome)) +
geom_point() +
#geom_smooth(method = "lm") +
scale_color_manual(values = c("Republican" = "red", "Democratic" = "blue"), name = "Political Affiliation") +
xlab("Percentage of Vote for Democratic Party") +
ylab("COVID-19 Deaths (percent of population)") +
ggtitle("Covid Death Rates vs Percentage of Vote")
ggplot(df, aes(x = DemPercentage, y = HeartDeathsPer100k, color = VoteOutcome)) +
geom_point() +
#geom_smooth(method = "lm") +
scale_color_manual(values = c("Republican" = "red", "Democratic" = "blue"), name = "Political Affiliation") +
xlab("Percentage of Vote for Democratic Party") +
ylab("Heart Disease Deaths (per 100k)") +
ggtitle("Heart Disease Death Rates per 100K vs Percentage of Vote",
subtitle = "Average of 2020 and 2021 Data")
Warning message:
"Removed 1 rows containing missing values (`geom_point()`)."
It seems there may be a difference in the disease/sickness death rates for Democrat and Republican states examining the scatterplots. There are more Republican states with higher rates of Covid, and more Democratic states with lower rates of Covid. The heart disease deaths per 100k people for each state shows a similar distribution. When looking at the Covid death rate though, the difference between Democratic and Republican states becomes less evident, if a difference exists at all.
ggplot(df, aes(x = VoteOutcome, y = covid_cases_percentage, fill = VoteOutcome)) +
geom_boxplot() +
scale_fill_manual(values = c("Republican" = "red", "Democratic" = "blue")) +
xlab("Political Affiliation") +
ylab("COVID-19 Cases (percent of population)") +
ggtitle("Covid Case Rates by Political Affiliation") +
theme(legend.position = "none")
df$covid_deaths_percentage <- df$covid_deaths_percentage * 100
ggplot(df, aes(x = VoteOutcome, y = covid_deaths_percentage, fill = VoteOutcome)) +
geom_boxplot() +
scale_fill_manual(values = c("Republican" = "red", "Democratic" = "blue")) +
xlab("Political Affiliation") +
ylab("COVID-19 Deaths (percent of population)") +
ggtitle("Covid Death Rates by Political Affiliation") +
theme(legend.position = "none")
ggplot(df, aes(x = VoteOutcome, y = HeartDeathsPer100k, fill = VoteOutcome)) +
geom_boxplot() +
scale_fill_manual(values = c("Republican" = "red", "Democratic" = "blue")) +
xlab("Political Affiliation") +
ylab("Heart Disease Deaths (per 100k)") +
ggtitle("Heart Disease Death Rates per 100K by Political Affiliation",
subtitle = "Average of 2020 and 2021 Data") +
theme(legend.position = "none")
Warning message:
"Removed 1 rows containing non-finite values (`stat_boxplot()`)."
The box plots now clearly reveal that states that voted republican have higher rates of covid cases,covid deaths, and heart disease deaths.
ggplot(df, aes(x = reorder(STATE, covid_cases_percentage), y = covid_cases_percentage, fill = VoteOutcome)) +
geom_bar(stat = "identity") +
scale_fill_manual(values = c("Republican" = "red", "Democratic" = "blue")) +
coord_flip() +
xlab("State") +
ylab("COVID-19 Cases (percent of population)") +
ggtitle("COVID-19 Cases by State (Colored by Political Affiliation)") +
theme(legend.position = "none")
df$covid_deaths_percentage <- df$covid_deaths_percentage * 100
ggplot(df, aes(x = reorder(STATE, covid_deaths_percentage), y = covid_deaths_percentage, fill = VoteOutcome)) +
geom_bar(stat = "identity") +
scale_fill_manual(values = c("Republican" = "red", "Democratic" = "blue")) +
coord_flip() +
xlab("State") +
ylab("COVID-19 Deaths (percent of population)") +
ggtitle("COVID-19 Deaths by State (Colored by Political Affiliation)") +
theme(legend.position = "none")
ggplot(df, aes(x = reorder(STATE, HeartDeathsPer100k), y = HeartDeathsPer100k, fill = VoteOutcome)) +
geom_bar(stat = "identity") +
scale_fill_manual(values = c("Republican" = "red", "Democratic" = "blue")) +
coord_flip() +
xlab("State") +
ylab("Heart Disease Deaths (per 100k)") +
ggtitle("Heart Disease Deaths by State (Colored by Political Affiliation)",
subtitle = "Average of 2020 and 2021 Data") +
theme(legend.position = "none")
Warning message:
"Removed 1 rows containing missing values (`position_stack()`)."
These sideways bar charts prove valuable in helping us realize which states had the highest disease rates and what their political affiliation was. For Covid case, Covid death, and heart disease death rate, there are mostly Republican states in the top ten of each category.
df1 <- df %>%
rename(
`High School Graduate` = normalized_high_school,
`Bachelors Degree` = normalized_bachelors,
`Some College` = normalized_some_college,
`Associates Degree` = normalized_associates
)
education <- gather(df1, key = "EducationLevel", value = "EducationCount",
'High School Graduate', 'Bachelors Degree', 'Some College', 'Associates Degree')
ggplot(education, aes(x = EducationLevel, y = EducationCount, fill = VoteOutcome)) +
geom_boxplot() +
scale_fill_manual(values = c("Republican" = "red", "Democratic" = "blue"), name = "Political Affiliation" ) +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
xlab("Education Level") +
ylab("Normalized Education Percentage of Individuals") +
ggtitle("Normalized Education Levels by Political Affiliation")
STATE | BidenVotes | TrumpVotes | TotalVotes | VoteOutcome | DemPercentage | Population | Median.income | Violent_Crimes_2020 | Violent_Crimes_2021 | ... | Hispanic_Latino | normalized_violent_crimes | normalized_property_crimes | normalized_white | normalized_black | normalized_asian | normalized_hispanic_latino | HeartDeathsPer100k | EducationLevel | EducationCount |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
<chr> | <dbl> | <dbl> | <dbl> | <chr> | <dbl> | <dbl> | <dbl> | <int> | <int> | ... | <int> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <chr> | <dbl> |
Alabama | 849624 | 1441170 | 2290794 | Republican | 37.09 | 5031362 | 54690 | 22322 | 17590 | ... | 246477 | 0.00402 | 0.01773 | 0.65639 | 0.25878 | 0.01568 | 0.04899 | 242.50 | normalized_high_school | 0.21451 |
Alaska | 153778 | 189951 | 343729 | Republican | 44.74 | 732923 | 74750 | 6126 | 5573 | ... | 56491 | 0.00785 | 0.01961 | 0.59697 | 0.03029 | 0.06301 | 0.07708 | 147.25 | normalized_high_school | 0.18484 |
Arizona | 1672143 | 1661686 | 3333829 | Democratic | 50.16 | 7179943 | 67090 | 35980 | 30922 | ... | 2388520 | 0.00458 | 0.02184 | 0.59249 | 0.04746 | 0.03711 | 0.33267 | 151.55 | normalized_high_school | 0.16268 |
Arkansas | 423932 | 760647 | 1184579 | Republican | 35.79 | 3014195 | 50780 | 20363 | 21271 | ... | 255416 | 0.00678 | 0.02548 | 0.69798 | 0.14509 | 0.01623 | 0.08474 | 226.75 | normalized_high_school | 0.23060 |
California | 11110639 | 6006518 | 17117157 | Democratic | 64.91 | 39501653 | 77650 | 174026 | 188343 | ... | 15732184 | 0.00470 | 0.02197 | 0.38418 | 0.05370 | 0.15326 | 0.39827 | 145.90 | normalized_high_school | 0.14123 |
Colorado | 1804352 | 1364607 | 3168959 | Democratic | 56.94 | 5784865 | 83780 | 24570 | 27916 | ... | 1314962 | 0.00468 | 0.03061 | 0.70991 | 0.04071 | 0.03288 | 0.22731 | 131.60 | normalized_high_school | 0.14078 |
In this boxplot, we can see the distribution of normalized counts across different education levels, separated by vote outcome. For those with an associate’s level of education, the distribution is fairly similar for both Democratic and Republican areas, as the median line is nearly at the same level. However, the interquartile range is slightly broader for Democratic areas, indicating more variability. For those at the bachelor’s level, the median count is higher in Democratic areas than in Republican areas. The range is also broader for Democratic areas, showing a wider spread of counts. For those with high school level, the median count again appears to be higher for Democratic areas compared to Republican areas. The interquartile ranges are quite similar, suggesting similar variability between the two vote outcomes. For those with some college, we can see a higher median count for Democratic areas, with the interquartile range being wider for Democratic areas as well. Overall, the plot suggests that there is a trend where Democratic areas have a higher normalized count of individuals with higher education levels (bachelor’s and some college), while the counts are more similar for lower levels of education (high school and associates).
Welch’s t-test, also called unequal variances t-test, compares the means of two populations to test the null hypothesis. Unlike the Student’s t-test, which assumes the sample means of the two populations are normally distributed and have equal variances, Welch’s t-test was designed for cases where the populations have different variances, but still assumes the samples are normally distributed. It seems easy to assume that you should use Welch’s t-test only when variances are different but it was been argued that you should always use Welch’s t-test over the Student t-test because, it has better control of type I and type II errors (2006). Welch’s t-test is calculated by dividing the difference of the two means by the square root of the sum of the square of their standard error: \(\frac{\bar{X_1} - \bar{X_1}}{\sqrt{s_{\bar{X_1}}^2} + {s_{\bar{X_1}}^2}}\) Where \(\bar{X_i}\) is the mean and \(s_{\bar{X_i}}^2\) is the standard error
Hotelling’s T-Squared Test is used to compare the means of two groups across multiple variables simultaneously. It’s an extension of the Student’s t-test to the multivariate case, where each observation is a vector of values rather than a single scalar value. This test is useful when the variables are correlated, and it assumes that the data from both groups are drawn from multivariate normally distributed populations with equal covariance matrices. The fundamental assumptions for the test are that the samples are randomly drawn from normally distributed populations and that the samples are independent of each other. The test statistic T^2 is derived from the means, variances, and covariances of the variables, and it’s often transformed into an F-statistic for determining statistical significance.
A bootstrap approach samples each observed population n times with replacement, where n is the number of entities of the population and uses the newly created datasets to find a mean. The bootstrapped mean for population two is subtracted from population one. This process of creating mean from samples and finding the difference is repeated many times creating a list of many differences of means. Net the quantiles are looked at a predetermined percentage, also referred to as the confidence interval. Common confidence intervals are 90 and 95 percent. With a confidence level of 90 percent we would look at the 5 and 95 percent quantiles. Bootstrap approaches are useful when one or both of the populations for the observed samples are small.
\(H_0:\) There is no difference in median household income between states that voted for Biden and states that voted for Trump.
\(H_A:\) There is a difference in median household income between states that voted for Biden and states that voted for Trump.
Welch Two Sample t-test
data: democratic$Median.income and republican$Median.income
t = 4.7036, df = 48.366, p-value = 2.167e-05
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
7463.532 18604.499
sample estimates:
mean of x mean of y
75049.62 62015.60
We used the Welch Two Sample t-test to compare the median incomes between Democratic and Republican states. The t-statistic of 4.7036 with 48.366 degrees of freedom, and a very low p-value of 2.167e-05, indicates strong evidence to reject the null hypothesis. This suggests that there is a significant difference in median incomes between the two political affiliations. The 95 percent confidence interval (7463.532 to 18604.499) for the true difference in means similarly reveals that Democratic states tend to have higher median incomes compared to Republican states.
\(H_0:\) There is no difference in the mean normalized rates of violent crimes between Democratic and Republican states.
\(H_A:\) There is a difference in the mean normalized rates of violent crimes between Democratic and Republican states.
Welch Two Sample t-test
data: democratic$normalized_violent_crimes and republican$normalized_violent_crimes
t = -1.0662, df = 47.928, p-value = 0.2917
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-0.0014633197 0.0004491966
sample estimates:
mean of x mean of y
0.003586538 0.004093600
We ran another Welch Two-Sample t-test to compare the normalized rates of violent crimes between Democratic and Republican states. The results gave us a t-statistic of -1.0662 with 47.928 degrees of freedom and a p-value of 0.2917. This suggests that there is not enough evidence to reject the null hypothesis, indicating no significant difference in the mean normalized rates of violent crimes between the two political affiliations.
\(H_0:\) There is no difference in the mean normalized rates of property crimes between Democratic and Republican states.
\(H_A:\) There is a difference in the mean normalized rates of property crimes between Democratic and Republican states.
Welch Two Sample t-test
data: democratic$normalized_property_crimes and republican$normalized_property_crimes
t = -0.51149, df = 40.431, p-value = 0.6118
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-0.004177888 0.002489857
sample estimates:
mean of x mean of y
0.01896038 0.01980440
In this Welch Two-Sample t-test, we are comparing the normalized rates of property crimes between Democratic and Republican states. This gave us a t-statistic of -0.51149 with 40.431 degrees of freedom, and a p-value of 0.6118. This suggests that there is insufficient evidence to reject the null hypothesis, indicating no significant difference in the mean normalized rates of property crimes between Democratic and Republican states.
\(H_0:\) There is no difference in the multivariate means of health-related variables between Democratic and Republican states.
\(H_A:\) There is a difference in the multivariate means of health-related variables between Democratic and Republican states.
# Subset data for Democratic and Republican states
democratic_data2 <- df[df$VoteOutcome == "Democratic", c("covid_cases_percentage", "covid_deaths_percentage", "HeartDeathsPer100k")]
republican_data2 <- df[df$VoteOutcome == "Republican", c("covid_cases_percentage", "covid_deaths_percentage", "HeartDeathsPer100k")]
# Conduct Hotelling's T-squared test
hotelling_result2 <- hotelling.test(democratic_data, republican_data)
# Inspect the results
print(hotelling_result2)
ERROR: Error in hotelling.test(democratic_data, republican_data): could not find function "hotelling.test"
According to the results of our Hotelling’s T-squared test, we can assess differences in the multivariate means of COVID-19-related variables, including covid_cases_percentage, covid_deaths_percentage, and HeartDeathsPer100k, between Democratic and Republican states. The results gave us a test statistic of 27.92, with 4 degrees of freedom in the numerator and 46 in the denominator, and a very low p-value of 0.0002927. This provides strong evidence to reject the null hypothesis, suggesting significant disparities in the COVID-19-related variables across the states based on their political affiliations.
\(H_0:\) The ratio of mean COVID-19 case percentages between Democratic and Republican states is equal to 1 (no difference).
\(H_A:\) The ratio of mean COVID-19 case percentages between Democratic and Republican states is not equal to 1 (there is a difference).
covidc_dem <- subset(df, select=covid_cases_percentage,
subset=VoteOutcome=="Democratic", drop=T)
covidc_rep <- subset(df, select=covid_cases_percentage,
subset=VoteOutcome=="Republican", drop=T)
rep_mean <- mean(df$covid_cases_percentage[df$VoteOutcome == "Republican"])
dem_mean <- mean(df$covid_cases_percentage[df$VoteOutcome == 'Democratic'])
actual_mean_ratio <- rep_mean / dem_mean
N <- 10000
ratio.mean <- numeric(N)
for (i in 1:N)
{
dem <- sample(df$covid_cases_percentage[df$VoteOutcome=='Democratic'], length(covidc_dem), replace = TRUE)
rep <- sample(df$covid_cases_percentage[df$VoteOutcome=="Republican"], length(covidc_rep), replace = TRUE)
ratio.mean[i] <- mean(rep)/mean(dem)
}
ci.1 <- quantile(ratio.mean, c(0.025, 0.975))
ci.1
hist(ratio.mean, main = "Ratio of Means (Rep/Dem)", xlab = "Ratio of Means", breaks = 30)
abline(v = actual_mean_ratio, col = "red", lty = 2)
abline(v = ci.1, col = 4, lwd = 2)
As the 95% confidence interval for the ratio of means does not include 1, we fail to reject the null hypothesis that the ratio of means is 1 and that there is no difference between the covid case percentage between democratic and republican states.
\(H_0:\) The ratio of mean COVID-19 death percentages between Democratic and Republican states is equal to 1 (no difference).
\(H_A:\) The ratio of mean COVID-19 death percentages between Democratic and Republican states is not equal to 1 (there is a difference).
covidd_dem <- subset(df, select=covid_deaths_percentage,
subset=VoteOutcome=="Democratic", drop=T)
covidd_rep <- subset(df, select=covid_deaths_percentage,
subset=VoteOutcome=="Republican", drop=T)
rep_mean <- mean(df$covid_deaths_percentage[df$VoteOutcome == "Republican"])
dem_mean <- mean(df$covid_deaths_percentage[df$VoteOutcome == 'Democratic'])
actual_mean_ratio <- rep_mean / dem_mean
N <- 10000
ratio.mean <- numeric(N)
for (i in 1:N)
{
dem <- sample(df$covid_deaths_percentage[df$VoteOutcome=='Democratic'], length(covidc_dem), replace = TRUE)
rep <- sample(df$covid_deaths_percentage[df$VoteOutcome=="Republican"], length(covidc_rep), replace = TRUE)
ratio.mean[i] <- mean(rep)/mean(dem)
}
ci.1 <- quantile(ratio.mean, c(0.025, 0.975))
ci.1
hist(ratio.mean, main = "Ratio of Means (Rep/Dem)", xlab = "Ratio of Means", breaks = 30)
abline(v = actual_mean_ratio, col = "red", lty = 2)
abline(v = ci.1, col = 4, lwd = 2)
As the 95% confidence interval for the ratio of means does not include 1, we fail to reject the null hypothesis that the ratio of means is 1 and that there is no difference between the covid death percentage between democratic and republican states.
\(H_0:\) There is no difference in the multivariate means of education levels between Democratic and Republican states.
\(H_A:\) There is a difference in the multivariate means of education levels between Democratic and Republican states.
# Load required libraries
library(Hotelling)
# Subset data for Democratic and Republican states
democratic_data <- df[df$VoteOutcome == "Democratic", c("normalized_high_school", "normalized_bachelors", "normalized_associates", "normalized_some_college")]
republican_data <- df[df$VoteOutcome == "Republican", c("normalized_high_school", "normalized_bachelors", "normalized_associates", "normalized_some_college")]
# Conduct Hotelling's T-squared test
hotelling_result <- hotelling.test(democratic_data, republican_data)
# Inspect the results
print(hotelling_result)
Test stat: 27.92
Numerator df: 4
Denominator df: 46
P-value: 0.0002927
We used a Hotelling’s T-squared test here to assess the differences in the multivariate means of education levels between Democratic and Republican states. The test yielded a test statistic of 27.92, which can sugest a significant distinction in the educational profiles of the two political affiliations. We also got 4 degrees of freedom in the numerator and 46 degrees of freedom in the denominator, as well as a very low p-value of 0.0002927, which indicates strong evidence to reject the null hypothesis. This supports the conclusion that there is a significant difference in the normalized education levels across the states.
In conclusion, the findings from our hypothesis testing shed light on the relationship between political leanings and various aspects of citizen well-being in the United States. We analyzed the impact of political affiliations on income, crime rates, health outcomes, and education levels. The results of the Welch Two-Sample t-test for median household income revealed a significant difference, indicating that Democratic states tend to have higher median incomes compared to Republican states. On the contrary, our tests on normalized rates of violent and property crimes did not yield sufficient evidence to reject the null hypothesis, suggesting no significant differences between Democratic and Republican states in these aspects.
Moving to health-related variables, the Hotelling’s T-squared test provided evidence of differences in the multivariate means of COVID-19-related variables between Democratic and Republican states. This underscores the impact of political affiliations on health outcomes, particularly during the COVID-19 pandemic. Additionally, our bootstrapping analyses on COVID-19 cases and death percentages reinforced the notion of significant disparities between the states based on political affiliations.
As for education, the Hotelling’s T-squared test on education levels indicated a substantial distinction in the multivariate means between Democratic and Republican states, highlighting a meaningful difference in the normalized education levels across the United States.
Our analysis underscores the significant influence of political leanings on various facets of citizen well-being. These findings not only contribute to the understanding of the interplay between politics and everyday life but also may be helpful for policymakers and individuals making decisions about where to reside.