13.6 Exercise 8: Regular expressions

Create the script “exercise8.R” and save it to the “Rcourse/Module2” directory: you will save all the commands of exercise 8 in that script.
Remember you can comment the code using #.

correction

getwd()
setwd("~/Rcourse/Module2")

1- Play with grep

Create the following data frame

df2 <- data.frame(age=c(32, 45, 12, 67, 40, 27), 
    citizenship=c("England", "India", "Spain", "Brasil", "Tunisia", "Poland"), 
    row.names=paste(rep(c("Patient", "Doctor"), c(4, 2)), 1:6, sep=""),
    stringsAsFactors=FALSE)

Using grep: create a smaller data frame df3 that contains only the Patient but NOT the Doctor information.

correction

# Select row names
rownames(df2)

## [1] "Patient1" "Patient2" "Patient3" "Patient4" "Doctor5"  "Doctor6"

# Select only rownames that correspond to patients
grep("Patient", rownames(df2))

## [1] 1 2 3 4

# Create data frame that contains only those rows
df3 <- df2[grep("Patient", rownames(df2)), ]

2- Play with gsub

Build this vector of file names:

vector1 <- c("L2_sample1_GTAGCG.fastq.gz", "L1_sample2_ATTGCC.fastq.gz", 
    "L1_sample3_TGTTAC.fastq.gz", "L4_sample4_ATGGTA.fastq.gz")

Use gsub and an appropriate regular expression to remove all but “sample1”, “sample2”, “sample3” and “sample4” from vector1.

correction

# | is used as OR
gsub(pattern="L[124]{1}_|_[ATGC]{6}.fastq.gz", 
    replacement="", 
    x=vector1)

## [1] "sample1" "sample2" "sample3" "sample4"