Final Project Package
OncoMarker: Targeted Genomic Biomarker Discovery in Breast Cancer
Introduction
High-throughput sequencing has transformed oncology, producing massive datasets that describe gene expression, mutations, and epigenetic alterations. While this wealth of information has propelled research, whole-genome analyses often overwhelm computational pipelines and slow clinical translation.
Cancer research generates massive genomic data, but analyzing all ~20,000 genes is computationally heavy and hard to interpret. OncoMarker addresses this challenge by providing a streamlined R framework for analyzing targeted gene panels, enabling researchers and clinicians to quickly identify differential expression patterns, visualize results, and stratify patients based on biomarker risk. The package is designed to be accessible, operating on pre-processed expression matrices rather than raw sequencing data.
Ideology
The philosophy of OncoMarker is simple: "The Simple Twist". Instead of starting with raw FASTQ files, the package ingests normalized, tab-delimited expression matrices (Level 3 data) and clinical metadata. This approach:
-
Eliminates the need for high-performance computing.
-
Reduces the complexity of whole-genome analysis.
-
Provides robust S4 object-oriented data management for reproducibility.
By focusing on targeted panels (e.g., PAM50 or custom biomarker sets), OncoMarker allows users to identify clinically relevant genes efficiently.
Description
OncoMarker is an R/Bioconductor-style package that combines:
-
Strictly typed S4 objects for data integrity (
GenePanelclass). -
Differential expression analysis with fold-change and p-values.
-
Visualization via customizable Volcano plots.
-
Simple yet flexible risk prediction models for key tumor suppressors and oncogenes (e.g., TP53, BRCA1, MYC).
The package works with datasets like the Breast Cancer Gene Expression Dataset from Kaggle, which includes 17,814 genes across 590 samples, with clear separation between normal and tumor tissues.
Breast Cancer Gene Expression Dataset (Kaggle): https://www.kaggle.com/datasets/orvile/gene-expression-profiles-of-breast-cancer
Functionality
OncoMarker includes three core user-facing functions:
-
calc_fold_change()Computes log2 fold-change and p-values between Tumor and Normal samples. Handles multiple testing with FDR correction. -
plot_volcano()Generates Volcano plots highlighting statistically significant upregulated and downregulated genes. -
predict_risk()Performs cohort-based risk prediction for a given gene using a median cutoff. Differentiates tumor suppressors vs oncogenes automatically.
Internally, the package supports robust validation of input matrices and metadata, ensuring consistent results across different datasets.
Description File
The package includes a README and Vignette to guide users
Package: OncoMarker
Type: Package
Title: OncoMarker: Targeted Genomic Analysis Workflow
Version: 1.0.0
Authors@R:
c(person("Premitha", "Pagadala",
email = "premitha@usf.edu",
role = c("aut", "cre")))
Description: An S4-based bioinformatics framework for targeted biomarker discovery, differential expression analysis, and clinical risk stratification using TCGA breast cancer datasets.
License: MIT
Encoding: UTF-8
Imports:
ggplot2,
stats,
utils
Suggests:
knitr,
rmarkdown,
testthat,
devtools
VignetteBuilder: knitr
URL: https://github.com/premitha27/OncoMarker
BugReports: https://github.com/premitha27/OncoMarker/issues
RoxygenNote: 7.3.3
LazyData: true
Installation:
# Install devtools if not already installed
install.packages("devtools")
# Install from GitHub
devtools::install_github("YourGitHub/OncoMarker")
library(OncoMarker)
Data Requirements:
-
CSV expression matrices: Genes as rows, samples as columns.
-
Metadata CSV: Must include a
Diagnosiscolumn (TumorvsNormal).
Code Glimpse
# Load package and data
library(OncoMarker)
expression_matrix <- as.matrix(read.csv("BC-TCGA-NormalTumor.csv", row.names = 1))
metadata <- read.csv("BC-TCGA-Metadata.csv", row.names = 1)
# Create GenePanel object
panel <- new("GenePanel", expression_data = expression_matrix,
patient_metadata = metadata, cancer_type = "TCGA-BRCA")
# Differential expression analysis
stats <- calc_fold_change(panel)
# Volcano plot visualization
plot_volcano(panel)
# Risk stratification for TP53
risk <- predict_risk(panel, gene = "TP53", direction = "low_risk_high_expr")
panel@patient_metadata$Risk <- risk
Results
Using the BC-TCGA dataset:
-
MKI67 and ERBB2 are consistently upregulated in tumor samples.
-
TP53 downregulation identifies a "High Risk" patient subgroup.
-
Volcano plots visually emphasize significant genes, separating upregulated and downregulated markers.
-
Risk predictions can be directly integrated with survival analysis packages like
survivalfor Kaplan-Meier curves.
This demonstrates OncoMarker's ability to quickly translate large expression datasets into interpretable clinical insights.
Description:
-
X-axis: log2 fold-change (Tumor vs Normal)
-
Y-axis: -log10(p-value)
-
Red points: significantly upregulated genes
-
Blue points: significantly downregulated genes
-
Grey points: not significant
Figure: Volcano plot highlighting differentially expressed genes in breast cancer (TCGA-BRCA dataset).
Conclusion
OncoMarker simplifies targeted biomarker discovery by combining S4 object rigor, flexible analysis functions, and visualization tools. Researchers can bypass complex raw sequencing pipelines, perform reproducible differential expression analysis, and stratify patients based on clinically relevant markers.
By operating on normalized CSV matrices, the package democratizes genomic data analysis, making high-level bioinformatics accessible to both biologists and clinicians.
References
https://www.kaggle.com/datasets/orvile/gene-expression-profiles-of-breast-cancer
R Programming Tutorials (Comprehensive):
https://www.r-project.org/other-docs.html
Applied Modeling & Text Analysis (Julia Silge, Tidytext):
https://www.tidytextmining.com/
Visualization with ggplot2 (Wickham, R Graphics Cookbook):
https://ggplot2.tidyverse.org/
Awesome R Learning Resources:
https://github.com/qinwf/awesome-R
Comments
Post a Comment