Assignment 3
Assignment 3: Analyzing 2016 data “Poll” Data in R
In this assignment, I analyze a small dataset of polling results using R. The dataset compares two fictional polls, one from ABC and one from CBS, for seven political candidates. The purpose is to practice data wrangling, visualization, and interpretation with ggplot2 while also reflecting on how to properly use polling data.
R Code
# Step 1: Define data
Name <- c("Jeb", "Donald", "Ted", "Marco", "Carly", "Hillary", "Bernie")
ABC_poll <- c( 4, 62, 51, 21, 2, 14, 15)
CBS_poll <- c( 12, 75, 43, 19, 1, 21, 19)
# Step 2: Create data frame
df_polls <- data.frame(Name, ABC_poll, CBS_poll)
# Step 3: Inspect data structure and first few rows
str(df_polls)
head(df_polls)
# Step 4: Summary statistics
mean(df_polls$ABC_poll) # Mean ABC poll
median(df_polls$CBS_poll) # Median CBS poll
range(df_polls[, c("ABC_poll","CBS_poll")]) # Range for both polls
# Step 5: Add difference column
df_polls$Diff <- df_polls$CBS_poll - df_polls$ABC_pol
df_polls
# Step 6: Visualization with ggplot2
install.packages("ggplot2")
install.packages("tidyr")
library(tidyr)
df_long <- pivot_longer(df_polls, cols = c("ABC_poll", "CBS_poll"),
names_to = "Poll", values_to = "Value")
library(ggplot2)
ggplot(df_long, aes(x = Name, y = Value, fill = Poll)) + geom_col(position = "dodge") +
labs(title = "Comparison of 2016 Poll Results", x = "Candidate", y = "Poll Percentage",
fill = "Poll Source") + theme_minimal()
Output
1. Structure of DataFrame ( str() ):
2. First Rows ( head() )
3. Summary Statistics
- Mean (average) of ABC poll values = 24.14286
- Median (middle value) of CBS poll values = 19
- Range of poll scores = 1 to 75
4. Data with Differences
Table showing each candidate’s ABC poll, CBS poll, and the calculated difference (CBS - ABC
).Key Patterns in the Data
The ABC and CBS results show clear differences for some candidates. For example, Donald’s score is 62 in the ABC poll but 75 in the CBS poll, a gap of 13 points. Hillary also shows a difference, with 14 in ABC and 21 in CBS. In contrast, Carly and Marco’s results are much closer, suggesting more consistency between the two polls.
The summary statistics add another view. The mean score in the ABC poll is about 24, while the median score in the CBS poll is 19. This suggests CBS reported slightly higher values for certain candidates. The bar chart (Figure 1) makes these differences easy to compare side by side.
Limitations of Using Made-Up Data
This dataset is fictional, so the numbers do not represent real voter behavior. Made-up data is useful for practicing R and learning visualization, but it has no real-world meaning. The differences we observe are only examples created for training. In real analysis, using fabricated data without clear labeling could be misleading. Readers might think the numbers reflect actual public opinion. That is why it is important to state clearly that this dataset is for practice only.
Collecting and Validating Real Poll Data
For meaningful results, poll data should come from reliable sources like FiveThirtyEight, Pew Research, or Gallup. These groups provide details such as sample size, margin of error, and methodology, which help assess the trustworthiness of the numbers. To validate results, analysts should compare multiple polls, check when the surveys were conducted, and review whether questions were unbiased. Data should also be cleaned and checked for outliers before making graphs. This ensures that the analysis is accurate and reliable.
Comments
Post a Comment