Assignment 9

 

Assignment 9: Visualization in R – Base Graphics, Lattice, and ggplot2

Objectives

  • Compare three major visualization systems in R: Base graphics, Lattice, and ggplot2.
  • Apply all three to the same dataset to observe syntactic, conceptual, and visual differences.
  • Produce clear, reproducible code and interpret key findings.

Dataset Overview

For this assignment, I used the Breast Cancer Wisconsin Diagnostic dataset (brca) from the dslabs package — a common bioinformatics dataset used to classify tumor samples as Benign (B) or Malignant (M) based on cell nucleus measurements.

This dataset contains features computed from breast mass cell nuclei.

  • brca$y = Diagnosis (Benign or Malignant)
  • brca$x = 30 numeric features describing cell characteristics

Load and explore the data

# Load libraries and dataset
install.packages("dslabs")
library(dslabs)
data("brca")

# 'brca$y' contains the diagnosis (Benign/ Malignant)
# 'brca$x' is a matrix of 30 numeric features
head(brca$y)

Output :
The command head(brca$y) shows the first six entries of the diagnosis vector.

The first six tumors are Benign (B). The Levels: B M indicates two possible classes.

str(brca$x)
Output :

This shows that brca$x is a numeric matrix with 569 rows (samples) and 30 columns (features).
The feature names describe tumor properties such as radius_mean, texture_mean, and area_mean.

# Combine into one data frame for easy plotting
df <- data.frame(diagnosis = brca$y, brca$x)
head(df)
Output :
The df data frame now contains one categorical variable (diagnosis) and 30 numeric predictors.
Viewing the first 6 rows shows that all early entries are benign tumors with moderate radius and texture values.

Base R Graphics

Base R graphics provide a simple and direct way to plot data using built-in functions such as plot(), hist(), and boxplot(). While less visually refined, they are highly flexible and fast to use.

Scatter Plot – Texture vs Radius

plot(df$radius_mean, df$texture_mean,
col = ifelse(df$diagnosis == "M", "red", "blue"),
pch = 19,
xlab = "Mean Radius",
ylab = "Mean Texture",
main = "Base R: Texture vs Radius by Diagnosis")
legend("topright", legend = c("Benign", "Malignant"),
col = c("blue", "red"), pch = 19)

Output:
This scatter plot shows texture_mean (y-axis) versus radius_mean (x-axis).

  • Blue points = Benign tumors
  • Red points = Malignant tumors

Interpretation:
Most malignant tumors (red) appear toward the upper-right region — indicating higher radius and texture values, which often signal more irregular and larger tumor cells.

Histogram – Radius Distribution (Malignant)

hist(df$radius_mean[df$diagnosis == "M"],
main = "Base R: Radius Distribution (Malignant)",
xlab = "Mean Radius", col = "pink", breaks = 20)

Output:
The pink histogram reveals how mean radius values are distributed among malignant cases.

Interpretation:
Most malignant tumors have a mean radius between 15 and 20, indicating a general pattern of larger size compared to benign tumors, which is typical of cancerous growth.

Boxplot – Radius by Diagnosis

boxplot(radius_mean ~ diagnosis, data = df,
main = "Base R: Boxplot of Mean Radius by Diagnosis",
xlab = "Diagnosis", ylab = "Mean Radius",
col = c("lightblue", "lightpink"))

Output:
This boxplot compares the spread of mean radius values for benign vs malignant tumors.

Interpretation:

This boxplot clearly distinguishes the two tumor types.

  • Malignant (M) cases show higher median and greater spread of radius values.
  • Benign (B) cases are tightly grouped around smaller radii, indicating less variability.

Lattice Graphics

Lattice plots are powerful for multivariate conditioning and small multiples. They handle grouped data elegantly and are more consistent in structure than Base R graphics.

Conditional Scatter Plot

library(lattice)

xyplot(texture_mean ~ radius_mean | diagnosis,
data = df,
layout = c(2,1),
main = "Lattice: Texture vs Radius by Diagnosis",
xlab = "Mean Radius",
ylab = "Mean Texture",
col = "darkgreen",
pch = 20)

Output Interpretation

The lattice system automatically creates two conditioned panels — one for Benign and one for Malignant.
You can clearly see:

  • Benign: smaller, compact cluster (low radius & texture)

  • Malignant: larger, dispersed pattern (higher values)
    Lattice plots are excellent for comparing subsets directly.

Box-and-Whisker Plot

bwplot(radius_mean ~ diagnosis,
data = df,
main = "Lattice: Radius by Diagnosis",
xlab = "Diagnosis",
ylab = "Mean Radius",
fill = "lightblue")

Output Interpretation

Similar to the Base R boxplot, but more structured and cleaner.
Lattice automatically handles layout and axis formatting — less manual work.
Again, malignant cases show a higher mean radius distribution.

Density Plot

densityplot(~ radius_mean, groups = diagnosis,
data = df, auto.key = TRUE,
main = "Lattice: Density of Mean Radius by Diagnosis",
plot.points = FALSE)

Output Interpretation

The density curves show two distinct distributions:

  • Benign (B): peaks around 12–14

  • Malignant (M): peaks around 17–19

This confirms that radius_mean effectively separates the two diagnoses — a potential feature for machine learning classification.

ggplot2 Visualizations

The ggplot2 package uses the Grammar of Graphics, a structured, layered approach that allows precise and elegant control over aesthetics, themes, and statistical transformations.

Scatter Plot with Regression Line

library(ggplot2)

ggplot(df, aes(x = radius_mean, y = texture_mean, color = diagnosis)) +
geom_point(size = 2, alpha = 0.6) +
geom_smooth(method = "lm", se = FALSE) +
labs(title = "ggplot2: Texture vs Radius by Diagnosis",
x = "Mean Radius",
y = "Mean Texture") +
theme_minimal()

Output Interpretation

This ggplot2 scatter plot adds a linear regression line for each diagnosis group.
You can clearly observe:

  • Both groups show positive correlation between texture and radius.

  • The malignant (red) trend line is steeper, indicating a stronger relationship.

The theme_minimal() gives a professional, clean look — perfect for publications.

Faceted Histogram

ggplot(df, aes(x = radius_mean, fill = diagnosis)) +
geom_histogram(binwidth = 1, color = "black", alpha = 0.6) +
facet_wrap(~ diagnosis, scales = "free_y") +
labs(title = "ggplot2: Radius Distribution by Diagnosis",
x = "Mean Radius",
y = "Count") +
theme_minimal() +
theme(legend.position = "none")

Output Interpretation

Faceted histograms display each group separately:

  • Benign tumors show smaller radii (clustered around 12–14)

  • Malignant tumors are right-skewed (larger radii)

This visualization provides an intuitive side-by-side comparison that’s harder to achieve in Base R.

Discussion Questions

1. How does the syntax and workflow differ between Base, Lattice, and ggplot2?

Base R uses a procedural, step-by-step approach, where each plot element is added using separate functions such as plot(), hist(), or boxplot(). You manually specify colors, point types, legends, and axis labels. This system is quick for exploratory plotting and simple visualizations, but it has limited flexibility when creating multi-panel or layered plots.

Lattice follows a formula-based approach, for example y ~ x | factor, where the plot automatically handles multi-panel conditioning. You pass the dataset and a formula describing the relationship, and Lattice takes care of panel layout, spacing, and axis scales. This is very useful for comparing subsets of data, but layering additional elements can be less intuitive.

ggplot2 uses the grammar-of-graphics approach. You first define the data and aesthetics (aes()), and then add layers such as points, lines, smooth curves, or facets using geom_*() functions. Themes and additional customizations can be applied systematically. ggplot2 is highly flexible and supports complex, publication-ready visualizations, but the layered workflow has a steeper learning curve compared to Base R or Lattice.

2. Which system gave you the most control or produced the most “publication‑quality” output with minimal code?

ggplot2

  • Allows layering of multiple geoms (points, lines, smooth curves, histograms) in one plot.

  • Automatically handles legends, colors, faceting, and themes for professional aesthetics.

  • Minimal additional code produces plots suitable for academic publications or clinical reports.

  • Base R gives speed but lacks styling; Lattice is structured but less flexible.

Example: The ggplot2 scatter plot with regression line clearly separates benign and malignant tumors, shows trends, and looks clean and polished with only a few lines of code.

3. Challenges or surprises when switching between systems

  • Base → Lattice: The formula interface (y ~ x | factor) takes some adjustment. Multi-panel plots are powerful but less intuitive at first.

  • Lattice → ggplot2: The grammar-of-graphics approach introduces a layered concept, where aesthetics and layers are defined separately. It can be confusing initially but allows far more flexibility.

  • General surprises:

    • Color and labeling conventions differ between systems; care is needed to ensure consistent representation.

    • Lattice automatically handles panels, while in Base R you have to manually subset data.

    • ggplot2 faceting and layering makes it easier to compare groups without creating multiple plots manually.

Key takeaway: Transitioning between systems teaches the trade-offs between speed, structure, and flexibility in data visualization.


Comments

Popular posts from this blog

Assignment 5

Assignment 6

Assignment 2