Final Project Package


OncoMarker: Targeted Genomic Biomarker Discovery in Breast Cancer

Introduction

High-throughput sequencing has transformed oncology, producing massive datasets that describe gene expression, mutations, and epigenetic alterations. While this wealth of information has propelled research, whole-genome analyses often overwhelm computational pipelines and slow clinical translation.

Cancer research generates massive genomic data, but analyzing all ~20,000 genes is computationally heavy and hard to interpret. OncoMarker addresses this challenge by providing a streamlined R framework for analyzing targeted gene panels, enabling researchers and clinicians to quickly identify differential expression patterns, visualize results, and stratify patients based on biomarker risk. The package is designed to be accessible, operating on pre-processed expression matrices rather than raw sequencing data.

Ideology

The philosophy of OncoMarker is simple: "The Simple Twist". Instead of starting with raw FASTQ files, the package ingests normalized, tab-delimited expression matrices (Level 3 data) and clinical metadata. This approach:

  • Eliminates the need for high-performance computing.

  • Reduces the complexity of whole-genome analysis.

  • Provides robust S4 object-oriented data management for reproducibility.

By focusing on targeted panels (e.g., PAM50 or custom biomarker sets), OncoMarker allows users to identify clinically relevant genes efficiently.

Description

OncoMarker is an R/Bioconductor-style package that combines:

  • Strictly typed S4 objects for data integrity (GenePanel class).

  • Differential expression analysis with fold-change and p-values.

  • Visualization via customizable Volcano plots.

  • Simple yet flexible risk prediction models for key tumor suppressors and oncogenes (e.g., TP53, BRCA1, MYC).

The package works with datasets like the Breast Cancer Gene Expression Dataset from Kaggle, which includes 17,814 genes across 590 samples, with clear separation between normal and tumor tissues.

Breast Cancer Gene Expression Dataset (Kaggle): https://www.kaggle.com/datasets/orvile/gene-expression-profiles-of-breast-cancer 

Functionality

OncoMarker includes three core user-facing functions:

  1. calc_fold_change()
    Computes log2 fold-change and p-values between Tumor and Normal samples. Handles multiple testing with FDR correction.

  2. plot_volcano()
    Generates Volcano plots highlighting statistically significant upregulated and downregulated genes.

  3. predict_risk()
    Performs cohort-based risk prediction for a given gene using a median cutoff. Differentiates tumor suppressors vs oncogenes automatically.

Internally, the package supports robust validation of input matrices and metadata, ensuring consistent results across different datasets.

Description File

The package includes a README and Vignette to guide users

Package: OncoMarker

Type: Package

Title: OncoMarker: Targeted Genomic Analysis Workflow

Version: 1.0.0

Authors@R: 

    c(person("Premitha", "Pagadala", 

             email = "premitha@usf.edu",

             role = c("aut", "cre")))

Description: An S4-based bioinformatics framework for targeted biomarker discovery, differential expression analysis, and clinical risk stratification using TCGA breast cancer datasets.

License: MIT

Encoding: UTF-8

Imports:

    ggplot2,

    stats,

    utils

Suggests:

    knitr,

    rmarkdown,

    testthat,

    devtools

VignetteBuilder: knitr

URL: https://github.com/premitha27/OncoMarker

BugReports: https://github.com/premitha27/OncoMarker/issues

RoxygenNote: 7.3.3

LazyData: true

Installation:

# Install devtools if not already installed

install.packages("devtools")

# Install from GitHub

devtools::install_github("YourGitHub/OncoMarker")

library(OncoMarker)

Data Requirements:

  • CSV expression matrices: Genes as rows, samples as columns.

  • Metadata CSV: Must include a Diagnosis column (Tumor vs Normal).

Code Glimpse

# Load package and data

library(OncoMarker)

expression_matrix <- as.matrix(read.csv("BC-TCGA-NormalTumor.csv", row.names = 1))

metadata <- read.csv("BC-TCGA-Metadata.csv", row.names = 1)

# Create GenePanel object

panel <- new("GenePanel", expression_data = expression_matrix,

patient_metadata = metadata, cancer_type = "TCGA-BRCA")

# Differential expression analysis

stats <- calc_fold_change(panel)

# Volcano plot visualization

plot_volcano(panel)

# Risk stratification for TP53

risk <- predict_risk(panel, gene = "TP53", direction = "low_risk_high_expr")

panel@patient_metadata$Risk <- risk

Results

Using the BC-TCGA dataset:

  • MKI67 and ERBB2 are consistently upregulated in tumor samples.

  • TP53 downregulation identifies a "High Risk" patient subgroup.

  • Volcano plots visually emphasize significant genes, separating upregulated and downregulated markers.

  • Risk predictions can be directly integrated with survival analysis packages like survival for Kaplan-Meier curves.

This demonstrates OncoMarker's ability to quickly translate large expression datasets into interpretable clinical insights.

Description:

  • X-axis: log2 fold-change (Tumor vs Normal)

  • Y-axis: -log10(p-value)

  • Red points: significantly upregulated genes

  • Blue points: significantly downregulated genes

  • Grey points: not significant


Figure: Volcano plot highlighting differentially expressed genes in breast cancer (TCGA-BRCA dataset).

Conclusion

OncoMarker simplifies targeted biomarker discovery by combining S4 object rigor, flexible analysis functions, and visualization tools. Researchers can bypass complex raw sequencing pipelines, perform reproducible differential expression analysis, and stratify patients based on clinically relevant markers.

By operating on normalized CSV matrices, the package democratizes genomic data analysis, making high-level bioinformatics accessible to both biologists and clinicians.

References

Kaggle BC-TCGA Dataset:
https://www.kaggle.com/datasets/orvile/gene-expression-profiles-of-breast-cancer
R Programming Tutorials (Comprehensive):
https://www.r-project.org/other-docs.html
Applied Modeling & Text Analysis (Julia Silge, Tidytext):
https://www.tidytextmining.com/
Visualization with ggplot2 (Wickham, R Graphics Cookbook):
https://ggplot2.tidyverse.org/
Awesome R Learning Resources:
https://github.com/qinwf/awesome-R

Comments

Popular posts from this blog

Assignment 5

Assignment 6