8.7 Exercise 2

  1. Import DataViz_source_files-main/files/gencode.v44.annotation.csv in an object that you will call gtf. You can check the first 20 rows of gtf using the head() function: check the help page to see how it works.
correction
gtf <- read_csv("DataViz_source_files-main/files/gencode.v44.annotation.csv")

The data in gtf represents a small subset of the gencode v44 human gene annotation, created the following way:

  • Selection of protein coding genes, long non-coding genes, miRNAs, snRNAs and snoRNAs.
  • Selection of chromosomes 1 to 10 only.
  • Creation of a random subset of 1000 genes.
  • Conversion to a friendly csv format.
head(gtf, 20)
## # A tibble: 20 × 5
##    chr   strand gencode_id         gene_type      gene_symbol    
##    <chr> <chr>  <chr>              <chr>          <chr>          
##  1 chr4  +      ENSG00000250938.8  lncRNA         MAD2L1-DT      
##  2 chr4  +      ENSG00000286320.2  lncRNA         ENSG00000286320
##  3 chr1  +      ENSG00000215717.7  protein_coding TMEM167B       
##  4 chr3  -      ENSG00000265028.1  miRNA          ENSG00000265028
##  5 chr9  -      ENSG00000242375.1  lncRNA         ENSG00000242375
##  6 chr1  -      ENSG00000143199.18 protein_coding ADCY10         
##  7 chr5  +      ENSG00000181751.10 protein_coding MACIR          
##  8 chr3  -      ENSG00000290763.1  lncRNA         SDHAP1         
##  9 chr8  +      ENSG00000157168.22 protein_coding NRG1           
## 10 chr9  +      ENSG00000130956.14 protein_coding HABP4          
## 11 chr5  -      ENSG00000250360.1  lncRNA         ENSG00000250360
## 12 chr4  -      ENSG00000250532.1  lncRNA         ENSG00000250532
## 13 chr1  +      ENSG00000067704.10 protein_coding IARS2          
## 14 chr10 -      ENSG00000226083.5  lncRNA         SLC39A12-AS1   
## 15 chr8  -      ENSG00000136960.13 protein_coding ENPP2          
## 16 chr8  +      ENSG00000253263.1  lncRNA         ENSG00000253263
## 17 chr10 +      ENSG00000272381.2  lncRNA         LINC02664      
## 18 chr2  +      ENSG00000236854.2  lncRNA         ENSG00000236854
## 19 chr10 -      ENSG00000188716.6  protein_coding DUSP29         
## 20 chr7  -      ENSG00000284707.2  lncRNA         ENSG00000284707


  1. Create a simple barplot displaying the number of genes per chromosome:
correction
ggplot(data=gtf, mapping=aes(x=chr)) + 
  geom_bar()


3. Keep chromosomes on the x axis, and split the barplot per gene type.

TIP: remember how we set color= in mapping=aes() function in the scatter plot section? Give it a try here!

correction
ggplot(data=gtf, mapping=aes(x=chr, color=gene_type)) + 
  geom_bar()


4. Change color= with fill= in aes(). What changes?

correction
ggplot(data=gtf, mapping=aes(x=chr, fill=gene_type)) + 
  geom_bar()


5. Add a title to the graph:

correction
ggplot(data=gtf, mapping=aes(x=chr, fill=gene_type)) + 
  geom_bar() +
  ggtitle(label = "Number of genes per chromosome, split by gene type")


6. Change the default theme:

correction
ggplot(data=gtf, mapping=aes(x=chr, fill=gene_type)) + 
  geom_bar() +
  ggtitle(label = "Number of genes per chromosome, split by gene type") +
  theme_bw()


7. Save the graph in PNG format in the workshop’s directory.

correction
# save plot in an object
gtfbars <- ggplot(data=gtf, mapping=aes(x=chr, fill=gene_type)) + 
  geom_bar() +
  ggtitle(label = "Number of genes per chromosome, split by gene type") +
  theme_bw()

# save as PNG file
ggsave(filename="gtfbarplot.png", plot=gtfbars, 
       device="png")