20.1 Exploratory Data Analysis

We will use the package “car”. For details on this package, see https://cran.r-project.org/web/packages/car/car.pdf

# install.packages("car") - run this code if you do not have the "car" package installed
library(car)

Let’s explore the dataset “Davis” from the car package. It is called “Self-Reports of Height and Weight Description”. The subjects were men and women engaged in regular exercise.
This data frame contains the following columns:
- sex - F, female; M, male
- weight - measured weight in kg
- height - measured height in cm
- repwt - reported weight in kg
- repht - reported height in cm

?Davis
data <- Davis

20.1.1 Data dimentionality: functions str(), summary(), head(), tail()

dim(data) 
str(data) 

head(data)
tail(data)

summary(data)  # shows quantiles for each column and how many NA !!!!

20.1.2 Missing (NA) values in data: functions complete.cases(), na.omit(), all.equal()

How many rows do not contain missing values (i.e., not a single ‘NA’)?

sum(complete.cases(data)) 
x <- data[complete.cases(data), ] # here they are
y <- na.omit(data)
all.equal(x,y)

Excercise using complete_cases(): How many rows contain missing values (i.e., at least one ‘NA’)?

20.1.3 Looking at the subset of data

d <- data[data$weight < 60,] # rows with weight below 60 kg
str(d)
summary(d)

x <- data[data$weight > 50 & data$repwt <= 66,] 
x  # the result looks strange!?
x[!complete.cases(x),]
na.omit(x)  # we cannot do this because this removes rows that contain at least one NA in any column

y <- data[data$weight > 50 & data$repwt <= 66 & !is.na(data$repwt), ] 
y
dim(y)
y[!complete.cases(y), ]

20.1.4 Excercises on data subsetting and missing values

How many people shorter than 170 cm reported that they are taller?

x <- data[data$height < 170 & data$repht >= 170 & !is.na(data$repht), ]
nrow(x)

What proportion of men in the dataset did not report their height? And women?

x <- data[data$sex == 'M' & is.na(data$repht), ]
nrow(x)
nrow(x)/nrow(data[data$sex == 'M', ])

nrow(data[data$sex == 'F' & is.na(data$repht), ]) / nrow(data[data$sex == 'F', ])

Is it true that the same men who did not report height did not also report weight?

all.equal(data[data$sex == 'M' & is.na(data$repht), ], data[data$sex == 'M' & is.na(data$repwt), ])

20.1.5 Exploring a particular variable (column): functions unique(), table()

length(data$repwt)
summary(data$repwt)
sum(is.na(data$repwt)) # count NA
table(is.na(data$repwt))  # table of how many complete data (FALSE) and how many missing (TRUE) for data$repwt

unique(data$repwt) # shows unique values for a specified column (NA is considered)
length(unique(data$repwt))

unique(na.omit(data$repwt)) # shows unique values for a specified column (NA is omitted)
length(unique(na.omit(data$repwt)))

table(data$repwt) # how many data with the same repwt
data.frame(table(data$repwt)) # same as above shown as a data frame     

table(data$repwt, useNA = "ifany") # by default table() doesn't show missing values
data.frame(table(data$repwt, useNA = "ifany"))

20.1.6 Exploring relationships between variables: functions table(), cut(), and functions for factors levels(), nlevels()

table(data$sex) # table() builds a contingency table of the counts at each combination of factor levels
table(data$sex, data$weight)  # shows relationships between variables
table(data$sex, data$weight < 80) # gives the 2x2 contingency table

m <- table(data$sex, data$weight < 80) # save it as a matrix
colnames(m) <- c("weight >= 80","weight < 80") # add names to columns
rownames(m) <- c("Female","Male") # add names to rows
m # check it out

table(data$weight)
intervals_weight <- cut(data$weight, breaks = seq(30, 170, 20)) 
table(intervals_weight)
table(intervals_weight, data$sex) # contigency table of sex by intervals 

class(intervals_weight)
levels(intervals_weight)
nlevels(intervals_weight)

table(data$height)
intervals_height <- cut(data$height, breaks = seq(55, 200, 20))
table(intervals_weight, intervals_height)

20.1.7 Excercises using unique(), table() and cut()

Let’s assume that a person with the minimum height, or == min(data$height), is a wrong entry in the dataset and exclude it from the analysis.

How many unique values are there for the height?

x <- data[!data$height == min(data$height), ]

unique(x$height)
length(unique(x$height))

How many intervals for the height will be obtained at breaks of 10 cm from minimum to maximum height. Use min() and max() in function seq() and nlevels() – be careful to include maximum value for height in the last interval.

min(x$height)
max(x$height)
intervals_height <- cut(x$height, breaks = seq(min(x$height), max(x$height)+1, 10))
nlevels(intervals_height)

How many women are in the last two intervals? (just by looking at the table)

table(intervals_height, x$sex)