9.9 Exercise 5. Data frame manipulation
Create the script “exercise5.R” and save it to the “Rcourse/Module1” directory: you will save all the commands of exercise 5 in that script.
Remember you can comment the code using #.
9.9.1 Exercise 5a
1- Create the following data frame:
|43|181|M| |34|172|F| |22|189|M| |27|167|F|
With Row names: John, Jessica, Steve, Rachel.
And Column names: Age, Height, Sex.
correction
2- Check the structure of df with str().
correction
3- Calculate the average age and height in df
Try different approaches: * Calculate the average for each column separately.
correction
- Calculate the average of both columns simultaneously using the apply() function.
correction
4- Add one row to df2: Georges who is 53 years old and 168 tall.
correction
5- Change the row names of df so the data becomes anonymous: Use Patient1, Patient2, etc. instead of actual names.
correction
6- Create the data frame df2 that is a subset of df which will contain only the female entries.
correction
7- Create the data frame df3 that is a subset of df which will contain only entries of males taller than 170.
9.9.2 Exercise 5b
1. Create two data frames mydf1 and mydf2 as:
mydf1:
|1|14| |2|12| |3|15| |4|10|
mydf2:
|1|paul| |2|helen| |3|emily| |4|john| |5|mark|
With column names: “id”, “age” for mydf1, and “id”, “name” for mydf2.
correction
2- Merge mydf1 and mydf2 by their “id” column. Look for the help page of merge and/or Google it!
correction
3- Order mydf3 by decreasing age. Look for the help page of order.
9.9.3 Exercise 5c
1- Using the download.file function, download this file to your current directory. (Right click on “this file” -> Copy link location to get the full path).
correction
2- The function dir() lists the files and directories present in the current directory: check if genes_dataframe.RData was copied.
correction
3- Load genes_dataframe.RData in your environment Use the load function.
correction
4- genes_dataframe.RData contains the df_genes object: is it now present in your environment?
correction
5- Explore df_genes and see what it contains You can use a variety of functions: str, head, tail, dim, colnames, rownames, class…
correction
6- Select rows for which pvalue_KOvsWT < 0.05 AND log2FoldChange_KOvsWT > 0.5. Store in the up object.
correction
# rows where pvalue_KOvsWT < 0.05
df_genes$pvalue_KOvsWT < 0.05
# rows where log2FoldChange_KOvsWT > 0.5
df_genes$log2FoldChange_KOvsWT > 0.5
# rows that comply both of the above conditions
df_genes$pvalue_KOvsWT < 0.05 & df_genes$log2FoldChange_KOvsWT > 0.5
# select rows for which pvalue_KOvsWT < 0.05 AND log2FoldChange_KOvsWT > 0.5
up <- df_genes[df_genes$pvalue_KOvsWT < 0.05 &
df_genes$log2FoldChange_KOvsWT > 0.5,]
How many rows (genes) were selected?
7- Select from the up object the Zinc finger protein coding genes (i.e. the gene symbol starts with Zfp). Use the grep() function.
correction
8- Select rows for which pvalue_KOvsWT < 0.05 AND log2FoldChange_KOvsWT is > 0.5 OR < -0.5.
For the selection of log2FoldChange: give the abs function a try!
Store in the diff_genes object.
correction
# rows where pvalue_KOvsWT < 0.05
df_genes$pvalue_KOvsWT < 0.05
# rows where log2FoldChange_KOvsWT > 0.5
df_genes$log2FoldChange_KOvsWT > 0.5
# rows where log2FoldChange_KOvsWT < -0.5
df_genes$log2FoldChange_KOvsWT > -0.5
# rows where log2FoldChange_KOvsWT < -0.5 OR log2FoldChange_KOvsWT > 0.5
df_genes$log2FoldChange_KOvsWT > 0.5 | df_genes$log2FoldChange_KOvsWT > -0.5
# same as above but using the abs function
abs(df_genes$log2FoldChange_KOvsWT) > 0.5
# combine all required criteria
df_genes$pvalue_KOvsWT < 0.05 & abs(df_genes$log2FoldChange_KOvsWT) > 0.5
# extract corresponding entries
diff_genes <- df_genes[df_genes$pvalue_KOvsWT < 0.05 &
abs(df_genes$log2FoldChange_KOvsWT) > 0.5,]
How many rows (genes) were selected?