20.1 Exploratory Data Analysis
We will use the package “car”. For details on this package, see https://cran.r-project.org/web/packages/car/car.pdf
# install.packages("car") - run this code if you do not have the "car" package installed
library(car)
Let’s explore the dataset “Davis” from the car package. It is called “Self-Reports of Height and Weight Description”. The subjects were men and women engaged in regular exercise.
This data frame contains the following columns:
- sex - F, female; M, male
- weight - measured weight in kg
- height - measured height in cm
- repwt - reported weight in kg
- repht - reported height in cm
20.1.1 Data dimentionality: functions str(), summary(), head(), tail()
dim(data)
str(data)
head(data)
tail(data)
summary(data) # shows quantiles for each column and how many NA !!!!
20.1.2 Missing (NA) values in data: functions complete.cases(), na.omit(), all.equal()
How many rows do not contain missing values (i.e., not a single ‘NA’)?
sum(complete.cases(data))
x <- data[complete.cases(data), ] # here they are
y <- na.omit(data)
all.equal(x,y)
Excercise using complete_cases(): How many rows contain missing values (i.e., at least one ‘NA’)?
20.1.3 Looking at the subset of data
d <- data[data$weight < 60,] # rows with weight below 60 kg
str(d)
summary(d)
x <- data[data$weight > 50 & data$repwt <= 66,]
x # the result looks strange!?
x[!complete.cases(x),]
na.omit(x) # we cannot do this because this removes rows that contain at least one NA in any column
y <- data[data$weight > 50 & data$repwt <= 66 & !is.na(data$repwt), ]
y
dim(y)
y[!complete.cases(y), ]
20.1.4 Excercises on data subsetting and missing values
- How many people shorter than 170 cm reported that they are taller?
- What proportion of men in the dataset did not report their height? And women?
x <- data[data$sex == 'M' & is.na(data$repht), ]
nrow(x)
nrow(x)/nrow(data[data$sex == 'M', ])
nrow(data[data$sex == 'F' & is.na(data$repht), ]) / nrow(data[data$sex == 'F', ])
- Is it true that the same men who did not report height did not also report weight?
20.1.5 Exploring a particular variable (column): functions unique(), table()
length(data$repwt)
summary(data$repwt)
sum(is.na(data$repwt)) # count NA
table(is.na(data$repwt)) # table of how many complete data (FALSE) and how many missing (TRUE) for data$repwt
unique(data$repwt) # shows unique values for a specified column (NA is considered)
length(unique(data$repwt))
unique(na.omit(data$repwt)) # shows unique values for a specified column (NA is omitted)
length(unique(na.omit(data$repwt)))
table(data$repwt) # how many data with the same repwt
data.frame(table(data$repwt)) # same as above shown as a data frame
table(data$repwt, useNA = "ifany") # by default table() doesn't show missing values
data.frame(table(data$repwt, useNA = "ifany"))
20.1.6 Exploring relationships between variables: functions table(), cut(), and functions for factors levels(), nlevels()
table(data$sex) # table() builds a contingency table of the counts at each combination of factor levels
table(data$sex, data$weight) # shows relationships between variables
table(data$sex, data$weight < 80) # gives the 2x2 contingency table
m <- table(data$sex, data$weight < 80) # save it as a matrix
colnames(m) <- c("weight >= 80","weight < 80") # add names to columns
rownames(m) <- c("Female","Male") # add names to rows
m # check it out
table(data$weight)
intervals_weight <- cut(data$weight, breaks = seq(30, 170, 20))
table(intervals_weight)
table(intervals_weight, data$sex) # contigency table of sex by intervals
class(intervals_weight)
levels(intervals_weight)
nlevels(intervals_weight)
table(data$height)
intervals_height <- cut(data$height, breaks = seq(55, 200, 20))
table(intervals_weight, intervals_height)
20.1.7 Excercises using unique(), table() and cut()
Let’s assume that a person with the minimum height, or == min(data$height), is a wrong entry in the dataset and exclude it from the analysis.
- How many unique values are there for the height?
- How many intervals for the height will be obtained at breaks of 10 cm from minimum to maximum height. Use min() and max() in function seq() and nlevels() – be careful to include maximum value for height in the last interval.
min(x$height)
max(x$height)
intervals_height <- cut(x$height, breaks = seq(min(x$height), max(x$height)+1, 10))
nlevels(intervals_height)
- How many women are in the last two intervals? (just by looking at the table)