Part15 Demo volcano plot

A volcano plot is a type of scatter plot represents differential expression of features (genes for example): on the x-axis we typically find the fold change and on the y-axis the p-value.

Here is a demo of how to create a volcano plot from a file containing :

  • Gene symbol
  • log2 fold change
  • p-value

Read in the file:

de <- read_csv("DataViz_source_files-main/files/demo_volcano.csv")
# The basic scatter plot: x is "log2FoldChange", y is "pvalue"
ggplot(data=de, aes(x=log2FoldChange, y=pvalue)) + 
  geom_point()

Doesn’t look quite like a Volcano plot…

In volcano plots, the p-value is converted into a -log10(p-value):

# Convert directly in the aes()
ggplot(data=de, aes(x=log2FoldChange, y=-log10(pvalue))) + 
  geom_point()

That’s better!

Choose a nicer theme:

p <- ggplot(data=de, aes(x=log2FoldChange, y=-log10(pvalue))) + 
  geom_point() + 
  theme_minimal()

Add vertical lines for log2FoldChange thresholds, and one horizontal line for the p-value threshold:

p2 <- p + geom_vline(xintercept=c(-0.6, 0.6), col="red") +
    geom_hline(yintercept=-log10(0.05), col="red")

The significantly differentially expressed genes are the ones found in the upper-left and upper-right corners.

We can add a column to the data frame to specify if they are UP- or DOWN- regulated (log2FoldChange respectively positive or negative).

Let’s create a new object with these new calculations. Note that we can use function mutate() from {dplyr} to create or modify columns

de_fc <- de %>% mutate(log10pvalue=-log10(pvalue),
                       diffexpressed=ifelse(log2FoldChange > 0.6 & pvalue < 0.05, "UP",
                                         ifelse(log2FoldChange < -0.6 & pvalue < 0.05, "DOWN", "NO")))

head(de_fc)
## # A tibble: 6 × 5
##   gene_symbol log2FoldChange   pvalue log10pvalue diffexpressed
##   <chr>                <dbl>    <dbl>       <dbl> <chr>        
## 1 Ndufa9              -0.876 3.39e-34        33.5 DOWN         
## 2 Rpp21               -0.976 1.56e-26        25.8 DOWN         
## 3 Amotl1              -0.758 2.16e-24        23.7 DOWN         
## 4 Gm44064             -0.849 5.13e-20        19.3 DOWN         
## 5 Hcfc1r1             -0.602 3.80e-17        16.4 DOWN         
## 6 Rgs18               -0.817 1.06e-16        16.0 DOWN

Re-plot but, this time, conditionally color the points using diffexpressed column:

p <- ggplot(data=de_fc, aes(x=log2FoldChange, y=log10pvalue, col=diffexpressed)) +
  geom_point() + 
  theme_minimal()

Add vertical and horizontal lines, as we did before.

p2 <- p + geom_vline(xintercept=c(-0.6, 0.6), col="red") +
        geom_hline(yintercept=-log10(0.05), col="red")

Change point colors:

p3 <- p2 + scale_color_manual(values=c(DOWN="blue", NO="grey50", UP="red"))

p3

Now we will write down the name of genes beside the points… but only selected genes.

Add to de_fc the new column “delabel”, that will contain the name of genes differentially expressed (NA will be set, in case they are NO).

de_fc <- mutate(de_fc, delabel=ifelse(diffexpressed=="NO", NA, gene_symbol))
ggplot(data=de_fc, aes(x=log2FoldChange, y=log10pvalue, col=diffexpressed, label=delabel)) + 
    geom_point() + 
    theme_minimal() +
    geom_text()

Finally, we can organize the labels nicely using the “ggrepel” package and the geom_text_repel() function:

# load library: either "tick it" in the Packages tab, or use the following:
library(ggrepel)
# plot adding up all layers we have seen so far
ggplot(data=de_fc, aes(x=log2FoldChange, y=log10pvalue, col=diffexpressed, label=delabel)) +
        geom_point() + 
        theme_minimal(base_size = 16) +
        geom_text_repel(show.legend = FALSE) +
        scale_color_manual(values=c(DOWN="blue", NO="grey50", UP="red")) +
        geom_vline(xintercept=c(-0.6, 0.6), col="red") +
        geom_hline(yintercept=-log10(0.05), col="red") +
        ggtitle(label="Volcano plot") +
        theme(plot.title = element_text(hjust = 0.5, face = "bold"))