library(readxl)
?read_excel # where does the helpfile appear?Intro to R for statistics, Day 2/3
Self-directed exercise
Today’s worksheet introduces you to a real dataset.
The tasks you will cover are:
- Reading data into R from an Excel file
- Checking and cleaning the data
- Revision of some of statistics and graphing that we covered on the first day.
- The importance of checking model validity
You can attempt the worksheet by following the material on your own on day 2.
On day 3 we will run through the work discussing any issues related to R or statistics that arise.
Expect to spend about three hours on today’s work in total. You all have different levels of knowledge and experience with data analysis and programming, and there is hopefully something in the exercises that will stretch everyone no matter what their background!
So make a good effort at the tutorial and the exercises, work together and use extra resources if needed (part of the skill is knowing where to go for help), but don’t spend more than half a day on this unless you want to!
The dataset
The data are from a randomised clinical trial of a new rehabilition intervention (compared to standard care) aimed at improving the walking speed of people recovering from stroke.
Better walking speed is a good indicator of general stroke recovery.
We have recorded:
- The age and sex of each participant,
- The treatment allocation,
- The hospital department from which they were recruited and
- Time they take to complete a walking task after their treatment.
Our research questions are:
- Does the new treatment improve walking speed compared to control treatment?
- By how much does it improve, and how certain are we of this?
The dataset can be found at walkingspeed_day2.xlsx
- Powerpoint slides to support this material are at:
Using readxl to read data from Excel files
R is not used for data entry or storage; you are likely to have data stored as an Excel file, a csv file, in a database, or in some other format generated by a device.
So before we can do any analysis we need to import data. We do this in code like everything else in R.
Base R cannot read data from Excel files. But several add-on ‘packages’ can read and write Excel files. Currently the best package for reading Excel files is called readxl. For writing Excel files the openxlsx package is very good.
Whatever the format your raw data there is likely to be a package to read it into R:
readxlfor Excel formathavenfor other statistics packages (Stata, SPSS etc)readrordata.tablefor csv and plain textpzfxfor GraphPad Prism.- Various
Bioconductorpackages for ’omics data in various formats.
Here we will import a dataset from Excel using the readxl package. If you already installed tidyverse then you already have this package.
Once the package is installed we can use the function readxl::read_excel(), which reads data from an Excel file.
Make sure the data looks OK and is in the right place on your computer
Before we dive in and import it, we need to make sure our data is in a sensible place.
Open RStudio and switch to the project you created in day 1, or start a new project if necessary.
Then, save the example data
walkingspeed_day2.xlsxfor this tutorial into your project folder. Check that it has appeared in the ‘files’ pane in RStudio.Start a new script file that will eventually include all of the commands we need to import, clean, visualise and analyse the data.
Now open the dataset in Excel and explore the file so that you understand what is there.
Read the help!
We are nearly ready to import our data. But before using a new function its always good to read its documentation.
R and R packages are not as self-explanatory as other software, and so you should expect to spend a fair amount of time, particularly as you are learning R, reading documentation, vignettes, blogs, etc on what R can do, which packages exist, and how to use them.
The read_excel() function has a few different options so first we should look at the help file:
Make sure to check:
- Description what does the function do,
- Usage what is the syntax
- Arguments detail of what all the options mean
- Value what do I get when I run this
- Examples (usually very helpful)
Note from the help file that read_excel() can extract data from different sheets and ranges of an Excel workbook, can use or ignore column names, and allows you to specify the type of data (numeric, dates, text etc) if you want to, or leave it to R to guess.
Many R packages also have vignettes or websites including simpler guides to their use in specific cases. readxl has a website that you might find helpful: https://readxl.tidyverse.org/
Now we’ll load the data. We want to use the ‘walking speed’ data from the walkingspeed.xlsx spreadsheet.
- Open the spreadsheet using Excel and find this sheet. The data we want is in the sheet called ‘day2’.
From the read_excel() help file we can deduce the syntax to load this data into R:
library(readxl)
walkingdat <- read_excel(path="walkingspeed_day2.xlsx", sheet="day2")This line assumes that the file walkingspeed_day2.xlsx is in the current working directory (you can check what the current working directory is with getwd(). The current working directory is also shown just above the R console window. You can see the files in the current working directory in the ‘Files’ tab on the bottom right of the RStudio window. When you create or load a project RStudio will set the working directory to the root of the project directory.
So this line calls the read_excel() function, with the arguments path, sheet set. The other arguments will be set to their default values, which you can see from the help file.
We could have set the range of the data in the spreadsheet (I usually do this for safety), but read_excel() can figure it out automatically most of the time; by default it picks the biggest continuous chunk of data starting in the top left of the sheet.
Now you should have a ‘data frame’ object called walkingdat in your environment, which includes the data from the Excel sheet ready to process and analyse.
Our workflow now is:
- Clean and code
- Visualise
- Describe
- Model
- Diagnose model
- Interpret
Cleaning
Before we can analyse our data we should check each variable to make sure that it looks OK.
We’ll use functions from day 1 to explore the content and structure of the dataset, and the distributions of the main variables. We can then address any problems that we find.
Recall these functions from the day 1 notes. We are checking that the variables have the correct type, and the distributions of the categorical variables.
This is an important aspect of any data analysis and we should not skip this step.
library(tidyverse)
summary(walkingdat) patid group time sex
Min. : 1.00 Length:135 Min. : 0.000 Length:135
1st Qu.: 34.50 Class :character 1st Qu.: 2.167 Class :character
Median : 68.00 Mode :character Median : 2.426 Mode :character
Mean : 68.01 Mean : 4.175
3rd Qu.:101.50 3rd Qu.: 2.949
Max. :136.00 Max. :185.362
age department
Length:135 Min. :1.000
Class :character 1st Qu.:2.000
Mode :character Median :3.000
Mean :2.607
3rd Qu.:3.500
Max. :4.000
class(walkingdat$age)[1] "character"
table(walkingdat$group)
control treat
68 67
table(walkingdat$sex)
F M
35 100
table(walkingdat$department)
1 2 3 4
28 31 42 34
Converting between types
You will have noticed that one of the numerical variables, age, has been loaded as character variable and not a numeric. We cannot do a quantitative analysis with a character variable!
Why do you think the variable is imported as character and not numeric?
We can convert age to a numeric using the function as.numeric().
# Tidyverse version:
walkingdat <- walkingdat |> mutate(age = as.numeric(age))
# Base R version:
walkingdat$age <- as.numeric(walkingdat$age)What warning message did you get when you converted this variable? What does it mean?
Checking the data
We mustn’t assume that all our data had been entered correctly.
How might we check on the age and time variables to see if the distribution looks OK?
We could make some descriptive statistics and basic plots!
hist(walkingdat$age)
hist(walkingdat$time)
range(walkingdat$time, na.rm=TRUE)What do you notice from here?
Dealing with outliers
It looks like there are some unreasonably high and low values of walking time.
We can make another graph of walking speed against age, this time on a logarithmic scale so both the extreme high and extreme low points are visible, to see what is going on.
## ggplot2 version
library(ggplot2)
ggplot(walkingdat) + aes(x=age, y=time) + geom_point() + scale_y_log10()
## base R version
plot(walkingdat$age, walkingdat$time, log="y")It seems there are some values for time that are likely to be technical errors. We can remove these values (set them to missing) in a few different ways:
## base R method
walkingdat$time[walkingdat$time<0.1] <- NA
walkingdat$time[walkingdat$time>100] <- NA
## Using mutate and ifelse
walkingdat <- walkingdat |> mutate(time = ifelse(time<0.1,NA,time))
walkingdat <- walkingdat |> mutate(time = ifelse(time>100,NA,time))
## the 'pure' tidyverse way with case_when is a bit clunky.
## look up 'case_when()' to understand this line
walkingdat <- walkingdat |> mutate( time = case_when(time<0.1 ~ NA_real_ ,
time>100 ~ NA_real_,
TRUE ~ time) )
# Now check the distribution of time again.
hist(walkingdat$time, breaks=100)
range(walkingdat$time, na.rm = TRUE)Now we have a cleaned dataset in our environment you can proceed with our visualization and analysis, using the functions you learned in day 1.
If you had trouble with the cleaning exercise then use the version of the dataset found in the cleaned tab of the spreadsheet:
walkingdat <- read_excel(path="walkingspeed_day2.xlsx", sheet="cleaned")Analysis - Exercises
Add commands to your script file to answer the questions below.
Add comments (lines beginning with #) to help you remember which questions you are answering, what you did and why!.
Some solutions are posted at the bottom of the page, but it will help you a lot to try to answer them on your own, by discussing with colleagues, or by looking up help online before using the solutions.
If you don’t understand the given solution, then please ask and we can go through it in the class.
We will discuss the testing, modelling and graphing elements in more detail on day 3, but it would be helpful to attempt these today.
- Structure of the dataset
- What are the names of the variables that comprise the
walkingdatadataframe? - What is the class of each?
- How many columns and rows does it have?
- What are the names of the variables that comprise the
- Descriptive statistics
- What is the age of youngest patient?
- What is the age of the oldest patient?
- Are the oldest and youngest patient male or female?
- Cross-tabulation (hint, use the
tablefunction)- How many Male and Female patients are there?
- Is the sex ratio balanced between treatment groups?
- How many women are there of age under 50? How many men are there of age under 50?
- Descriptive statistics using a subset of the data
- What is the average walking speed for men between 50 and 60 years of age?
- Graphing
- Consider what kind of graph you might make to illustrate the difference in walking speed between treatment groups, and try to make it using ggplot. Eg:
- Make a scatter plot of time vs age
- Make the colour of each point reflect the treatment group
- Make a box plot of time by sex, with a different coloured box per sex
- Label the axes appropriately
- Try a violin plot instead of a boxplot (with geom_violin). Which do you prefer?
- Make any other adjustments you think are informative!
- Hypothesis Testing
- What is the difference in average walking speed between control patients and treated patients?
- Is this difference statistically significant? Use t-test
t.test()and Mann-Whitney testswilcox.test()to examine the differences between the groups. Look up the syntax fort.test()andwilcox.test()using the help system, and use them to check whether the time taken to complete the task varies by treatment status. In each case use the version of the function that takes aformulaagument. Remember theformulasyntax from theboxplotandlmfunctions in day 1. - Compare the p-values for each method.
- Consider the assumptions for a t-test, and whether they are met in this dataset.
- Do you believe that treatment affects walking speed?
- Deriving new variables
- Create a new column that will have the speed of walking, that is 1/time.
- Create a new column that will have the time rounded up to nearest 10th of a second.
- Linear model diagnostics
- Estimate a linear model (with
lm) to estimate/test the effect of treatment on task completion time. - Use
summaryto get the model coefficients and p-values - Use
broom::tidyto get the model coefficients and p-values - Compare the model results to the equivalent t-test
- Extend the model to include the effects of age and sex as potential covariates.
- Use
plot()to explore the model diagnostics - Use
performance::check_modelto explore the model diagnostics - Now estimate a new model to compare the task completion speed between groups.
- Get the model coefficients and p-values
- Check the model diagnostics using
plotorperformance::check_model - Why are the result from this model different? How do you interpret the results?
- Can you get the confidence intervals for the treatment effect (use the help for
broom::tidy) - Use
model_summarypackage to make a pretty model summary table
- Estimate a linear model (with
Solutions
The exercises will have many possible solutions, under each exercise we have listed a few possible answers. Most are possible using the functions we have introduced in day 1 of the course, or with functions I suggest in the questions, but in some cases there are other solutions that I have given using different functions/packages.
If you answered the questions another way then that’s fine, but do try to understand how the given solutions work.
knitr::opts_knit$set(eval=FALSE)Structure of the dataset
What are the names of the variables that comprise the walkingdat dataframe?
library(tidyverse)
# Eg you could use:
str(walkingdat)
names(walkingdat)
summary(walkingdat)What is the class of each? What is the class of walkingdat?
str(walkingdat)
lapply(walkingdat, class) # 'lapply' takes a list (first argument) and applies a function (second argument) to every element of that list.
walkingdat |> map(class) # map is the tidyverse version of lapply. It is in the 'purrr' package which is loaded if you used library(tidyverse).
class(walkingdat) # walkingdat is a dataframe, but since you used read_excel() to get it it is also a `tbl` which is a dataframe with a few tidyverse enhancements.How many columns and rows does walkingdat have?
dim(walkingdat) # Get the dimension
nrow(walkingdat) # Get the number of rows
length(walkingdat) # The 'length' of a data frame is the numveDescriptive statistics
What is the age of youngest patient?
# look at the help for what na.rm does. You'll see many of the summary statistics need this.
min(walkingdat$age, na.rm=TRUE)
walkingdat |> summarize(min(age, na.rm=TRUE))
walkingdat |> select(age) |> min(na.rm=TRUE)What is the age of the oldest patient?
max(walkingdat$age)
walkingdat |> summarize(max(age, na.rm=TRUE))
walkingdat |> select(age) |> max(na.rm=TRUE)
range(walkingdat$age, na.rm=TRUE)
summary(walkingdat)Are the oldest and youngest patient male or female?
walkingdat |> filter(age==47)
walkingdat |> filter(age==72)
walkingdat[which.max(walkingdat$age), "sex"]
walkingdat[which.min(walkingdat$age), "sex"]
# head and tail show the start and end of a dataset.
walkingdat |> arrange(age) |> head()
walkingdat |> arrange(age) |> tail()Cross-tabulation
How many Male and Female patient are there?
walkingdat |> filter( sex =='F' ) |> count()
walkingdat |> filter( sex =='M' ) |> count()
table(walkingdat$sex)
walkingdat |> group_by(sex) |> count()Is the sex ratio balanced between treatment groups?
walkingdat |> filter( group =='treat', sex =='M' ) |> count()
walkingdat |> filter( group =='treat', sex =='F' ) |> count()How many women are there of age under 50? How many men are there of age under 50?
walkingdat |> filter(age < 50 , sex =='F' ) |> count()
walkingdat |> filter(age < 50 , sex =='M' ) |> count()
table(walkingdat$sex, walkingdat$age<50)Descriptive statistics using a subset of the data
What is the average walking speed for men between 60 and 70 years of age?
walkingdat |> filter(age >59, age <71, sex =='M' ) |>
select(time) |> pull() |> mean()
walkingdat |> filter( age >59, age <71, sex =='M' ) |>
summarise(mean(time))Deriving new variables
Create a new column that will have the speed of walking, that is 1/time.
walkingdat <- walkingdat |> mutate(speed=1/as.numeric(time))Create a new column that will have the time rounded to the nearest 10th of a second.
walkingdat <- walkingdat |> mutate(rounded=round(as.numeric(time), digits = 2))Graphing
Make a scatter plot of time vs age Make the colour of each point reflect the treatment group
ggplot(data=walkingdat) +
aes(x=age,y=time,colour=group) +
geom_point()Make a box plot of time by sex, with a different coloured box per sex
ggplot(data=walkingdat) +
aes(x=sex,y=time,fill=sex) +
geom_boxplot()Label the axes appropriately
ggplot(data=walkingdat) +
aes(x=sex,y=time,fill=sex) +
geom_boxplot() +
labs(x="Sex", y="Time to complete task (s)")Try a violin plot instead of a boxplot (with geom_violin). Which do you prefer?*
ggplot(data=walkingdat) +
aes(x=sex,y=time,fill=sex) +
geom_violin() +
labs(x="Sex", y="Time to complete task (s)")Hypothesis testing
What is the difference in task completion time between control patients and treated patients? Is this difference statistically significant?
t.test( time ~ group , data=walkingdat)
lm( time ~ group , data=walkingdat) |> broom::tidy()
lm( time ~ group , data=walkingdat) |> summary()
lm( time ~ group , data=walkingdat) |> anova()Use t-test t.test() and Mann-Whitney tests wilcox.test() to examine the differences between the groups.
t.test( time ~ group , data=walkingdat)
wilcox.test( time ~ group , data=walkingdat)Consider the assumptions underlying the t-test, and whether they are met in this dataset
The t-test and the linear model assume that the outcome variable is normally distributed within groups, which is not the case here. So the results from these is probably wrong.
The Mann-Whitney test (Wilcox test) does not make this assumption, and is more robust in this case.
Do you believe that treatment affects walking speed?
Try the t-test or linear model using the speed variable that you created instead of the time variable that is in the dataset.
Diagnostics
Estimate a linear model (with lm) to estimate/test the effect of treatment on task completion time.
mod1 <- lm(data=walkingdat , time ~ group)Use summary to get the model coefficients and p-values
Use broom::tidy to get the model coefficients and p-values
mod1 |> summary()
mod1 |> broom::tidy()Compare the model results to the equivalent t-test
Extend the model to include the effects of age and sex as potential covariates.
mod2 <- lm(data=walkingdat , time ~ group + age + sex)Use plot() to explore the model diagnostics
## You could plot model 1 or model 2.
plot(mod1)Use performance::check_model to explore the model diagnostics
performance::check_model(mod1)Now estimate a new model to compare the task completion speed between groups.
Get the model coefficients and p-values
Check the model diagnostics using plot or performance::check_model
walkingdat$speed <- 1/walkingdat$time
mod3 <- lm(data=walkingdat , speed ~ group + age + sex)
performance::check_model(mod3)Why are the result from this model different? How do you interpret the results?
Can you get the confidence intervals for the treatment effect (use the help for broom::tidy)
Use model_summary package to make a pretty model summary table
modelsummary::modelsummary(mod3)
modelsummary::modelsummary(list("Speed model"=mod3,"First model"=mod1) ,
estimate="{estimate} [{conf.low}, {conf.high}] {stars}",
coef_rename = c('(Intercept)'="Intercept",
'grouptreat'="Treatment",
'age'="Age (per year)",
'sexM'="Male vs Female"),
statistic=NULL, )