Assignment 9
Assignment 9: Visualization in R – Base Graphics, Lattice, and ggplot2
Objectives
- Compare three major visualization systems in R: Base graphics, Lattice, and ggplot2.
- Apply all three to the same dataset to observe syntactic, conceptual, and visual differences.
- Produce clear, reproducible code and interpret key findings.
Dataset Overview
For this assignment, I used the Breast Cancer Wisconsin Diagnostic dataset (brca) from the dslabs package — a common bioinformatics dataset used to classify tumor samples as Benign (B) or Malignant (M) based on cell nucleus measurements.
This dataset contains features computed from breast mass cell nuclei.
brca$y= Diagnosis (Benign or Malignant)brca$x= 30 numeric features describing cell characteristics
Load and explore the data
head(brca$y) shows the first six entries of the diagnosis vector.The first six tumors are Benign (B). The Levels: B M indicates two possible classes.
brca$x is a numeric matrix with 569 rows (samples) and 30 columns (features).df data frame now contains one categorical variable (diagnosis) and 30 numeric predictors.Base R Graphics
Base R graphics provide a simple and direct way to plot data using built-in functions such as plot(), hist(), and boxplot(). While less visually refined, they are highly flexible and fast to use.
Scatter Plot – Texture vs Radius
plot(df$radius_mean, df$texture_mean,col = ifelse(df$diagnosis == "M", "red", "blue"),pch = 19,xlab = "Mean Radius",ylab = "Mean Texture",main = "Base R: Texture vs Radius by Diagnosis")legend("topright", legend = c("Benign", "Malignant"),col = c("blue", "red"), pch = 19)
Output:
This scatter plot shows texture_mean (y-axis) versus radius_mean (x-axis).
- Blue points = Benign tumors
- Red points = Malignant tumors
Interpretation:
Most malignant tumors (red) appear toward the upper-right region — indicating higher radius and texture values, which often signal more irregular and larger tumor cells.
Histogram – Radius Distribution (Malignant)
hist(df$radius_mean[df$diagnosis == "M"],main = "Base R: Radius Distribution (Malignant)",xlab = "Mean Radius", col = "pink", breaks = 20)
Output:
The pink histogram reveals how mean radius values are distributed among malignant cases.
Interpretation:
Most malignant tumors have a mean radius between 15 and 20, indicating a general pattern of larger size compared to benign tumors, which is typical of cancerous growth.
Boxplot – Radius by Diagnosis
boxplot(radius_mean ~ diagnosis, data = df,main = "Base R: Boxplot of Mean Radius by Diagnosis",xlab = "Diagnosis", ylab = "Mean Radius",col = c("lightblue", "lightpink"))
Output:
This boxplot compares the spread of mean radius values for benign vs malignant tumors.
Interpretation:
This boxplot clearly distinguishes the two tumor types.
- Malignant (M) cases show higher median and greater spread of radius values.
- Benign (B) cases are tightly grouped around smaller radii, indicating less variability.
Lattice Graphics
Lattice plots are powerful for multivariate conditioning and small multiples. They handle grouped data elegantly and are more consistent in structure than Base R graphics.
Conditional Scatter Plot
library(lattice)xyplot(texture_mean ~ radius_mean | diagnosis,data = df,layout = c(2,1),main = "Lattice: Texture vs Radius by Diagnosis",xlab = "Mean Radius",ylab = "Mean Texture",col = "darkgreen",pch = 20)
Output Interpretation
The lattice system automatically creates two conditioned panels — one for Benign and one for Malignant.
You can clearly see:
-
Benign: smaller, compact cluster (low radius & texture)
-
Malignant: larger, dispersed pattern (higher values)
Lattice plots are excellent for comparing subsets directly.
Box-and-Whisker Plot
bwplot(radius_mean ~ diagnosis,data = df,main = "Lattice: Radius by Diagnosis",xlab = "Diagnosis",ylab = "Mean Radius",fill = "lightblue")
Density Plot
densityplot(~ radius_mean, groups = diagnosis,data = df, auto.key = TRUE,main = "Lattice: Density of Mean Radius by Diagnosis",plot.points = FALSE)
Output Interpretation
The density curves show two distinct distributions:
-
Benign (B): peaks around 12–14
-
Malignant (M): peaks around 17–19
This confirms that radius_mean effectively separates the two diagnoses — a potential feature for machine learning classification.
ggplot2 Visualizations
The ggplot2 package uses the Grammar of Graphics, a structured, layered approach that allows precise and elegant control over aesthetics, themes, and statistical transformations.
Scatter Plot with Regression Line
library(ggplot2)ggplot(df, aes(x = radius_mean, y = texture_mean, color = diagnosis)) +geom_point(size = 2, alpha = 0.6) +geom_smooth(method = "lm", se = FALSE) +labs(title = "ggplot2: Texture vs Radius by Diagnosis",x = "Mean Radius",y = "Mean Texture") +theme_minimal()
Output Interpretation
This ggplot2 scatter plot adds a linear regression line for each diagnosis group.
You can clearly observe:
-
Both groups show positive correlation between texture and radius.
-
The malignant (red) trend line is steeper, indicating a stronger relationship.
The theme_minimal() gives a professional, clean look — perfect for publications.
Faceted Histogram
ggplot(df, aes(x = radius_mean, fill = diagnosis)) +geom_histogram(binwidth = 1, color = "black", alpha = 0.6) +facet_wrap(~ diagnosis, scales = "free_y") +labs(title = "ggplot2: Radius Distribution by Diagnosis",x = "Mean Radius",y = "Count") +theme_minimal() +theme(legend.position = "none")
Discussion Questions
1. How does the syntax and workflow differ between Base, Lattice, and ggplot2?
Base R uses a procedural, step-by-step approach, where each plot element is added using separate functions such as plot(), hist(), or boxplot(). You manually specify colors, point types, legends, and axis labels. This system is quick for exploratory plotting and simple visualizations, but it has limited flexibility when creating multi-panel or layered plots.
Lattice follows a formula-based approach, for example y ~ x | factor, where the plot automatically handles multi-panel conditioning. You pass the dataset and a formula describing the relationship, and Lattice takes care of panel layout, spacing, and axis scales. This is very useful for comparing subsets of data, but layering additional elements can be less intuitive.
ggplot2 uses the grammar-of-graphics approach. You first define the data and aesthetics (aes()), and then add layers such as points, lines, smooth curves, or facets using geom_*() functions. Themes and additional customizations can be applied systematically. ggplot2 is highly flexible and supports complex, publication-ready visualizations, but the layered workflow has a steeper learning curve compared to Base R or Lattice.
2. Which system gave you the most control or produced the most “publication‑quality” output with minimal code?
ggplot2
-
Allows layering of multiple geoms (points, lines, smooth curves, histograms) in one plot.
-
Automatically handles legends, colors, faceting, and themes for professional aesthetics.
-
Minimal additional code produces plots suitable for academic publications or clinical reports.
-
Base R gives speed but lacks styling; Lattice is structured but less flexible.
Example: The ggplot2 scatter plot with regression line clearly separates benign and malignant tumors, shows trends, and looks clean and polished with only a few lines of code.
3. Challenges or surprises when switching between systems
-
Base → Lattice: The formula interface (
y ~ x | factor) takes some adjustment. Multi-panel plots are powerful but less intuitive at first. -
Lattice → ggplot2: The grammar-of-graphics approach introduces a layered concept, where aesthetics and layers are defined separately. It can be confusing initially but allows far more flexibility.
-
General surprises:
-
Color and labeling conventions differ between systems; care is needed to ensure consistent representation.
-
Lattice automatically handles panels, while in Base R you have to manually subset data.
-
ggplot2 faceting and layering makes it easier to compare groups without creating multiple plots manually.
-
Key takeaway: Transitioning between systems teaches the trade-offs between speed, structure, and flexibility in data visualization.
Comments
Post a Comment