Part14 Demo volcano plot
A volcano plot is a type of scatter plot represents differential expression of features (genes for example): on the x-axis we typically find the fold change and on the y-axis the p-value.
Here is a demo of how to create a volcano plot from a file containing :
- Gene symbol
- log2 fold change
- p-value
Read in the file:
# The basic scatter plot: x is "log2FoldChange", y is "pvalue"
ggplot(data=de, aes(x=log2FoldChange, y=pvalue)) +
geom_point()
Doesn’t look quite like a Volcano plot…
In volcano plots, the p-value is converted into a -log10(p-value):
# Convert directly in the aes()
ggplot(data=de, aes(x=log2FoldChange, y=-log10(pvalue))) +
geom_point()
That’s better!
Choose a nicer theme:
Add vertical lines for log2FoldChange thresholds, and one horizontal line for the p-value threshold:
p2 <- p + geom_vline(xintercept=c(-0.6, 0.6), col="red") +
geom_hline(yintercept=-log10(0.05), col="red")
The significantly differentially expressed genes are the ones found in the upper-left and upper-right corners.
We can add a column to the data frame to specify if they are UP- or DOWN- regulated (log2FoldChange respectively positive or negative).
de <- de %>% mutate(diffexpressed=ifelse(log2FoldChange > 0.6 & pvalue < 0.05, "UP",
ifelse(log2FoldChange < -0.6 & pvalue < 0.05, "DOWN", "NO")))
head(de)
## # A tibble: 6 × 4
## gene_symbol log2FoldChange pvalue diffexpressed
## <chr> <dbl> <dbl> <chr>
## 1 Ndufa9 -0.876 3.39e-34 DOWN
## 2 Rpp21 -0.976 1.56e-26 DOWN
## 3 Amotl1 -0.758 2.16e-24 DOWN
## 4 Gm44064 -0.849 5.13e-20 DOWN
## 5 Hcfc1r1 -0.602 3.80e-17 DOWN
## 6 Rgs18 -0.817 1.06e-16 DOWN
Re-plot but, this time, conditionally color the points using diffexpressed column:
p <- ggplot(data=de, aes(x=log2FoldChange, y=-log10(pvalue), col=diffexpressed)) +
geom_point() +
theme_minimal()
Add lines as before…
p2 <- p + geom_vline(xintercept=c(-0.6, 0.6), col="red") +
geom_hline(yintercept=-log10(0.05), col="red")
Now we will write down the name of genes beside the points… but only selected genes.
Create a new column “delabel” to de, that will contain the name of genes differentially expressed (NA will be set, in case they are NO).
ggplot(data=de, aes(x=log2FoldChange, y=-log10(pvalue), col=diffexpressed, label=delabel)) +
geom_point() +
theme_minimal() +
geom_text()
Finally, we can organize the labels nicely using the “ggrepel” package and the geom_text_repel() function:
# load library: either "tick it" in the Packages tab, or use the following:
library(ggrepel)
# plot adding up all layers we have seen so far
ggplot(data=de, aes(x=log2FoldChange, y=-log10(pvalue), col=diffexpressed, label=delabel)) +
geom_point() +
theme_minimal() +
geom_text_repel() +
scale_color_manual(values=c(DOWN="blue", NO="black", UP="red")) +
geom_vline(xintercept=c(-0.6, 0.6), col="red") +
geom_hline(yintercept=-log10(0.05), col="red") +
ggtitle(label="Volcano plot") +
theme(plot.title = element_text(hjust = 0.5, face = "bold"))