# Import libraries
library(ggplot2)
library(readr)
library(dplyr)
library(tibble)
library(reshape2)
library(conflicted)
library(tidyr)
library(purrr)
library(limma)
library(edgeR)
library(corrmorant)
2 Week4 - Introduction to Gene expression analysis part 2
This R script serves as a scaffold for adding the code required to fulfill the assignments. It includes the assignments as well as a few hints.
Add the necessary code and type your answers in this document for your own record.
2.1 Assignment 3: Data exploration on the gene level
2.1.1 Convert the CPM values to FPKM values. For some of the exercises below, we need to convert the CPM expression values to FPKM expression values.
# Load data
<- read_delim("Data/MAGNET_exonLengths.txt", delim = "\t")
geneTotExonLengths <- read_delim("Data/MAGNET_GeneExpressionData_CPM_19112020.txt", delim = "\t")
gxData <- read_csv("Data/MAGNET_SampleData_18112022.csv")
sampleInfo
# Add gene ID as the row names
<- column_to_rownames(gxData, var = "EnsemblGeneID")
gxData <- column_to_rownames(geneTotExonLengths, var = "EnsemblGeneID")
geneTotExonLengths
# Check that row names are the same
all(rownames(geneTotExonLengths) == rownames(gxData)) # TRUE (just a check)
[1] TRUE
# Add sample name as the row names
<- column_to_rownames(sampleInfo, var = "sample_name")
sampleInfo
# Convert CPM expression values to FPKM
<- function(x) {
cpm2fpkm <- 2^(x) * 1E3 / geneTotExonLengths[, 1] # . before variable makes it a hidden variable
t
}<- cpm2fpkm(gxData) gxData_fpkm
2.1.2 What does FPKM stand for? How does this measure differ from CPM? (Google)
These metrics attempt to normalize for sequencing depth and gene length. Normalized expression units are necessary to remove technical biases in sequenced data such as depth of sequencing and gene length, and make gene expressions directly comparable within and across samples. More sequencing depth produces more read count for a gene expressed at the same level and differences in gene length generate unequal reads count for genes expressed at the same level.
CPM is a basic gene expression unit that normalizes only for sequencing depth (depth-normalized counts). It is biased in some applications where the gene length influences gene expression, such as RNA-seq.
\[\begin{equation}{} CPM = \frac{N \;reads \;mapped \;to \;gene \times 10^{6}}{Total \;N \;of \;mapped \;reads} \end{equation}\]RPKM (reads per kilobase of transcript per million reads mapped) is a gene expression unit that measures the expression levels (mRNA abundance) of genes or transcripts. RPKM is a gene length normalized expression unit that is used for identifying the differentially expressed genes by comparing the RPKM values between different experimental conditions. Generally, the higher the RPKM of a gene, the higher the expression of that gene.
\[\begin{equation}{} RPKM = \frac{N \;reads \;mapped \;to \;gene \times 10^{3} \times 10^{6}}{Total \;N \;of \;mapped \;reads \times gene \;length \;in \;bp} \end{equation}\]Here, 10^3 normalizes for gene length and 10^6 for sequencing depth factor.
FPKM (fragments per kilobase of exon per million mapped fragments) is a gene expression unit which is analogous to RPKM. FPKM is used especially for normalizing counts for paired-end RNA-seq data in which two (left and right) reads are sequenced from the same DNA fragment. Generally, the higher the FPKM of a gene, the higher the expression of that gene.
When we map paired-end data, both reads or only one read with high quality from a fragment can map to reference sequence. To avoid confusion or multiple counting, the fragments to which both or single read mapped are counted and represented for FPKM calculation.
2.1.3 In you own words, describe what the code above does.
The code takes the gene expression values in CPM and normalize them by the length of the genes, according to the formula expressed before.
2.1.4 Can we compare the FPKM value of gene A between two samples to state in which sample gene A is more highly expressed?
No, because FPKM values are normalized by the length of genes, which means that we cannot compare the values across different samples.
2.1.5 Can we compare the FPKM value of gene A to gene B in a single sample to state which gene is more highly expressed?
Yes, because FPKM values are normalized by the length of the genes.
2.1.6 Can we compare the CPM values of gene A between two samples to determine in which sample gene A is more highly expressed?
Yes, because CPM values are normalized by the sequencing depth (depth-normalized counts) and they do not take into account the length of genes for the normalization process.
2.1.7 Can we compare the CPM value of gene A to the value of gene B in a single sample to determine which gene is more highly expressed?
No, because CPM values are not normalized by the length of the genes.
2.1.8 Using the FPKM values, answer the following questions:
2.1.8.1 What are the IDs of the 5 highest expressed genes? What is their function according to the GeneCards website?
# Create a column with the mean of expression values of all samples
<- gxData_fpkm %>%
exp_mean_df rownames_to_column(var = "geneID") %>%
rowwise() %>%
mutate(exp_mean = mean(c_across(C00039:P01640))) %>%
column_to_rownames(var = "geneID") %>%
select(exp_mean)
# Select the 5 most expressed genes
<- exp_mean_df %>%
max5_genes_mean slice_max(n= 5, exp_mean)
Gene ID | Name | Function |
---|---|---|
ENSG00000198804 | MT-CO1 | Contributes to cytochrome-c oxidase activity |
ENSG00000198899 | MT-ATP6 | Contributes to proton-transporting ATP synthase activity |
ENSG00000198938 | MT-CO3 | Involved in respiratory chain complex IV assembly |
ENSG00000198712 | MT-CO2 | Contributes to cytochrome-c oxidase activity |
ENSG00000198886 | MT-ND4 | Enables NADH dehydrogenase (ubiquinone) activity |
All of the genes are involved in processes related to mitochondria activity, which make sense because the dataset has muscle samples.
2.1.8.2 What are the IDs of the 5 lowest expressed genes? What is their function according to the GeneCards website?
# Select the 5 lowest expressed genes
<- exp_mean_df %>%
min5_genes_mean slice_min(n = 5, exp_mean)
Gene ID | Name | Function |
---|---|---|
ENSG00000015568 | RGPD5 | RAN is a small GTP-binding protein of the RAS superfamily that is associated with the nuclear membrane |
ENSG00000162105 | SHANK2 | This gene encodes a protein that is a member of the Shank family of synaptic proteins that may function as molecular scaffolds in the postsynaptic density of excitatory synapses |
ENSG00000267586 | LINC00907 | RNA Gene, and is affiliated with the lncRNA class |
ENSG00000215126 | ZNG1F | Predicted to enable ATP binding activity |
ENSG00000183914 | DNAH2 | Dyneins are microtubule-associated motor protein complexes |
There are pseudogenes, RNA genes, and others related to different processes (i.e. synaptic genes).
2.1.8.3 What are the IDs of the 5 most variable genes? What is their function according to the GeneCards website?
# Create a column with the mean of expression values of all samples
<- gxData %>%
exp_var_df rownames_to_column(var = "geneID") %>%
rowwise() %>%
mutate(exp_var = var(c_across(C00039:P01640))) %>%
column_to_rownames(var = "geneID") %>%
select(exp_var)
# Select the 5 most variable genes
<- exp_var_df %>%
max5_genes_var slice_max(n= 5, exp_var)
Gene ID | Name | Function |
---|---|---|
ENSG00000198692 | EIF1AY | Eukaryotic Translation Initiation Factor 1A Y-Linked |
ENSG00000129824 | RPS4Y1 | Ribosomal Protein S4 Y-Linked 1 |
ENSG00000114374 | USP9Y | Ubiquitin Specific Peptidase 9 Y-Linked |
ENSG00000067048 | DDX3Y | DEAD-Box Helicase 3 Y-Linked |
ENSG00000012817 | KDM5D | Lysine Demethylase 5D - encodes a protein containing zinc finger domains |
4 of these genes are related to Y chromosome, which are absent in the female samples.
Note: By using the FPKM dataset, we got the same genes as the 5 mots highly expressed.
2.1.8.3.1 What are the IDs of the 5 least variable (= stable!) genes? What is their function according to the GeneCards website?
# Select the 5 least variable genes
<- exp_var_df %>%
min5_genes_var slice_min(n= 5, exp_var)
Gene ID | Name | Function |
---|---|---|
ENSG00000136709 | WD Repeat Domain 33 | WD repeats are conserved regions, which may facilitate formation of heterotrimeric or multiprotein complexes |
ENSG00000089053 | ANAPC5 | Anaphase Promoting Complex Subunit 5 |
ENSG00000111361 | EIF2B1 | Eukaryotic Translation Initiation Factor 2B Subunit Alpha |
ENSG00000086475 | SEPHS1 | Selenophosphate Synthetase 1 |
ENSG00000106609 | TMEM248 | Transmembrane Protein 248 |
All of the genes are related to conserved cellular functions - house keeping genes.
2.1.9 Using the CPM values, answer the following questions:
2.1.9.1 Which 5 genes show the strongest correlation to age in the control group?
# Get the dataframe with the list of sample names with NF
<- rownames_to_column(sampleInfo, var = "sample") %>%
NF_columns ::filter(etiology == "NF") %>%
dplyr::select(sample) %>%
dplyrpull(sample)
# Get the gene expression data from the NF patients
<- gxData %>%
NF_data ::select(NF_columns)
dplyr
# Transpose the dataframe to have genes as columns
<- as.data.frame(t(NF_data))
NF_data
# Get the metadata from NF patients
<- as.data.frame(t(sampleInfo))
t_sampleInfo
<- t_sampleInfo %>%
NF_metadata rownames_to_column(var = "covariate") %>%
::select(c(NF_columns, covariate)) %>%
dplyrcolumn_to_rownames(var = "covariate")
# Transpose the dataframe to have age as column
<- as.data.frame(t(NF_metadata))
NF_metadata
# Add age column into the gene expression dataframe
<- NF_data %>%
NF_data mutate(age = NF_metadata$age)
# Calculate correlation values
<- cor(NF_data[ , colnames(NF_data) != "age"],
data_cor as.numeric(NF_data$age))
<- as.data.frame(data_cor)
data_cor
# Select the 5 most correlated genes with age in the control group
<- data_cor %>%
corr5_genes_age arrange(desc(abs(V1))) %>%
slice_head(n= 5)
# Get the gene expression data
<- NF_data %>%
corr5_gene_data ::select(row.names(corr5_genes_age))
dplyr
# Calculate significance (p value) of the 5 most correlated genes with age
1,2] <- cor.test(corr5_gene_data$ENSG00000244681, as.numeric(NF_data$age))$p.value
corr5_genes_age[2,2] <- cor.test(corr5_gene_data$ENSG00000244694, as.numeric(NF_data$age))$p.value
corr5_genes_age[3,2] <- cor.test(corr5_gene_data$ENSG00000182264, as.numeric(NF_data$age))$p.value
corr5_genes_age[4,2] <- cor.test(corr5_gene_data$ENSG00000154080, as.numeric(NF_data$age))$p.value
corr5_genes_age[5,2] <- cor.test(corr5_gene_data$ENSG00000250337, as.numeric(NF_data$age))$p.value
corr5_genes_age[
colnames(corr5_genes_age) <- c("estimate", "p_value")
- Is the correlation positive or negative?
4 of the values are positive and one is negative
- Is the correlation significant?
Yes, all of the correlations were significant
2.1.9.2 What is their function according to the GeneCards website? Are they genes of which the expression is known to change with age (use Pubmed)?
Gene ID | Name | Function |
---|---|---|
ENSG00000244681 | MTHFD2P1 | Pseudogene |
ENSG00000244694 | PTCHD4 | Predicted to be integral component of membrane |
ENSG00000182264 | IZUMO1 | The sperm-specific protein Izumo is essential for sperm-egg plasma membrane binding and fusion |
ENSG00000154080 | CHST9 | Catalyzes the transfer of sulfate to position 4 of non-reducing N-acetylgalactosamine (GalNAc) residues in both N-glycans and O-glycans |
ENSG00000250337 | PURPL | RNA Gene, and is affiliated with the lncRNA class. Diseases associated with colorectal cancer and myasthenic syndrome |
2.1.9.3 Visualize the result for at least 1 gene (HINT: CPM values on the y-axis, age in years on the x-axis)
Scatter plot of one gene vs age
# Add age column into the 5 most correlated genes dataframe
<- corr5_gene_data %>%
corr5_gene_data mutate(Age = as.numeric(NF_metadata$age), .before = 1)
<- corr5_gene_data %>%
corr1gene_age select(Age, ENSG00000244681)
<- ggplot(corr1gene_age, aes(x = Age, y = ENSG00000244681)) +
corr1gene_age_plot geom_point() +
labs(x = "Age", y = "ENSG00000244681 Expression") +
scale_x_continuous(n.breaks = 10.0)
corr1gene_age_plot
Scatter plots, correlation values, and distributions of all genes and age
<- ggcorrm(data = corr5_gene_data) +
corr_plot_allgenes theme_corrm(base_size = 6) +
theme(axis.text.x = element_text(angle = 90, size = 8),
axis.text.y = element_text(size = 8),
strip.text.x = element_text(size = 5),
strip.text.y = element_text(size = 5),
legend.text = element_text(size = 8),
legend.title = element_text(size = 8)) +
lotri(geom_point(alpha = 0.5)) +
lotri(geom_smooth(colour = "red4")) +
utri_heatmap(alpha = 0.5, corr_method = "spearman") +
utri_corrtext(corr_method = "spearman", size = 3.5) +
dia_histogram(lower = 0.1, fill = "grey80", color = 1) +
dia_density(lower = 0.1, alpha = .1, colour = "red4") +
scale_fill_gradient2(low = "white", mid = "red3", high = "red4",
midpoint = 0.5, space = "rgb",
guide = guide_colorbar(title = "Correlation coefficient"),
limits = c(0, 1))
corr_plot_allgenes
2.2 Assignment 4: Differential gene expression analysis.
Now that we have explored the gene expression data, it is time to perform a differential gene expression analysis.
2.2.1 What is differential gene expression analysis (DGEA)? What are some of the most common packages in R for DGEA?
Differential expression analysis means taking the normalised read count data and performing statistical analysis to discover quantitative changes in
expression levels between experimental groups.
R packages:
We are going to use the limma package to perform a DGEA. We need to use the CPM normalized values. Have a look at the limma guide section 15.4: (https://www.bioconductor.org/packages/devel/bioc/vignettes/limma/inst/doc/usersguide.pdf)
2.2.2 Implement the steps noted in the limma guide for the MAGNET dataset. Start with a DGEA between DCM patients and healthy controls.
2.2.2.1 Limma-trend
# Convert counts to logCPM values
<- cpm(gxData, log = TRUE, prior.count = 3)
logCPM
# Create design matrix
= model.matrix(~0 + sampleInfo$etiology)
design
# Apply limma pipeline
<- lmFit(logCPM, design)
fit <- eBayes(fit, trend = TRUE)
fit topTable(fit, coef = ncol(design))
logFC AveExpr t P.Value adj.P.Val B
ENSG00000089053 6.381604 6.384943 961.3848 0 0 1415.207
ENSG00000129351 6.429434 6.416465 955.8661 0 0 1413.270
ENSG00000105323 6.380228 6.365844 944.4249 0 0 1409.213
ENSG00000136709 6.329745 6.324768 937.9541 0 0 1406.892
ENSG00000106609 6.348231 6.340574 921.1237 0 0 1400.769
ENSG00000176915 6.281051 6.282988 907.5412 0 0 1395.732
ENSG00000075785 6.481443 6.478184 898.0510 0 0 1392.161
ENSG00000100711 6.284988 6.278731 891.0022 0 0 1389.480
ENSG00000182944 6.459927 6.446393 890.8330 0 0 1389.416
ENSG00000113648 6.404155 6.400461 890.3165 0 0 1389.218
# Give more weight to fold-changes in the gene ranking
<- lmFit(logCPM, design)
fit <- treat(fit, lfc = log2(1.2), trend = TRUE)
fit topTreat(fit, coef = ncol(design))
logFC AveExpr t P.Value adj.P.Val
ENSG00000000003 6.117085 6.111910 388.5360 0 0
ENSG00000000419 6.294569 6.292874 480.5645 0 0
ENSG00000000457 6.100006 6.111824 433.3020 0 0
ENSG00000000460 5.886348 5.887347 308.1765 0 0
ENSG00000000938 5.992674 6.059126 206.5199 0 0
ENSG00000000971 6.391421 6.392179 255.8019 0 0
ENSG00000001036 6.299640 6.308049 581.1499 0 0
ENSG00000001084 6.273295 6.255830 284.7585 0 0
ENSG00000001167 6.134147 6.122626 201.0493 0 0
ENSG00000001460 5.943226 5.954836 370.8629 0 0
Results show the p value of 0 in some genes, which means that something is wrong.
2.2.3 Which co-variates should be taken along for correction? (confounding; see the “alcohol causes lung cancer” example from the lecture)
# Convert counts to logCPM values
<- cpm(gxData, log = TRUE, prior.count = 3)
logCPM
# Create design matrix considering confounding variables
= model.matrix(~0 + etiology + gender + age, data = sampleInfo)
design
# Apply limma pipeline with confounding variables
<- lmFit(logCPM, design)
fit
<- makeContrasts(DCMvsControl = etiologyDCM - etiologyNF,
cont.matrix levels = design)
<- contrasts.fit(fit, cont.matrix)
fit
<- eBayes(fit, trend = TRUE)
efit
<- topTable(efit, coef = 'DCMvsControl', number = nrow(gxData))
dgeRes
glimpse(dgeRes)
Rows: 20,781
Columns: 6
$ logFC <dbl> 0.15831103, 0.33966098, 0.21263013, -0.44141913, 0.36628640,…
$ AveExpr <dbl> 6.286287, 5.952977, 6.101884, 6.022681, 6.049712, 6.371697, …
$ t <dbl> 29.01545, 27.11832, 25.18996, -25.06301, 24.77563, -24.26241…
$ P.Value <dbl> 7.851116e-97, 1.689807e-89, 7.264246e-82, 2.342191e-81, 3.33…
$ adj.P.Val <dbl> 1.631540e-92, 1.755794e-85, 5.031943e-78, 1.216827e-77, 1.38…
$ B <dbl> 210.1073, 193.3613, 175.9173, 174.7550, 172.1180, 167.3887, …
After including the cofounding variables, the p values are not 0, which means that the cofounding variables are important for the analysis.
2.2.4 Copy the top 200 differentially expressed genes to for a quick GO enrichment analysis. Which processes are changed between DCM and controls? Do these processes make biological sense? (quick literature check!)
# Select the 200 most correlated genes with age in the control group
<- dgeRes %>%
to200_corr_genes slice_head(n = 200) %>%
rownames_to_column(var = "Gene_ID") %>%
select(Gene_ID)
# Select the names of all genes in the control group
<- dgeRes %>%
all_corr_genes rownames_to_column(var = "Gene_ID") %>%
select(Gene_ID)
# Export target list to csv file
write_csv(to200_corr_genes, "Outputs/to200_corr_genes.csv", col_names = FALSE)
# Export background list to csv file
write_csv(all_corr_genes, "Outputs/all_corr_genes.csv", col_names = FALSE)
In the GOrilla server, the inputs are the target and background tables exported in the previous step. The results are shown in the following figure:
The results show that the most enriched GO terms are related to inflammation and structural processed, and the immune system, which is consistent with the literature.