21.3 Join tables

We have been working (mostly) with 2 objects so far:

  • geneexp that contains gene expression information
  • gtf that contains gene annotation information

How can we merge all the information into a single data frame?

tidyr provides an easy way to join 2 data frames, based on one or more columns containing common identifiers, to be able to merge relevant information together.

Relevant functions are the following:

  • left_join: keeps all observations from the first table (x)
  • right_join: keeps all observations from the second table (y)
  • inner_join: keeps the intersection of observations
  • outer_join: keeps the union of observations

Let’s try the 4 of them and check how many genes are left in each case (with nrow):

joinL <- left_join(x=gtf, y=geneexp, by=c("gene_symbol" = "GeneSymbol"))
nrow(joinL)
## [1] 420
joinR <- right_join(x=gtf, y=geneexp, by=c("gene_symbol" = "GeneSymbol"))
nrow(joinR)
## [1] 50
joinI <- inner_join(x=gtf, y=geneexp, by=c("gene_symbol" = "GeneSymbol"))
nrow(joinI)
## [1] 49
joinF <- full_join(x=gtf, y=geneexp, by=c("gene_symbol" = "GeneSymbol"))
nrow(joinF)
## [1] 421