library(readxl)
# where does the helpfile appear? ?read_excel
Intro to R for statistics, Day 2/3
Self-directed exercise
Today’s worksheet introduces you to a real dataset.
The tasks you will cover are:
- Reading data into R from an Excel file
- Checking and cleaning the data
- Revision of some of statistics and graphing that we covered on the first day.
- The importance of checking model validity
You can attempt the worksheet by following the material on your own on day 2.
On day 3 we will run through the work discussing any issues related to R or statistics that arise.
Expect to spend about three hours on today’s work in total. You all have different levels of knowledge and experience with data analysis and programming, and there is hopefully something in the exercises that will stretch everyone no matter what their background!
So make a good effort at the tutorial and the exercises, work together and use extra resources if needed (part of the skill is knowing where to go for help), but don’t spend more than half a day on this unless you want to!
The dataset
The data are from a randomised clinical trial of a new rehabilition intervention (compared to standard care) aimed at improving the walking speed of people recovering from stroke.
Better walking speed is a good indicator of general stroke recovery.
We have recorded:
- The age and sex of each participant,
- The treatment allocation,
- The hospital department from which they were recruited and
- Time they take to complete a walking task after their treatment.
Our research questions are:
- Does the new treatment improve walking speed compared to control treatment?
- By how much does it improve, and how certain are we of this?
The dataset can be found at walkingspeed_day2.xlsx
- Powerpoint slides to support this material are at:
Using readxl
to read data from Excel files
R is not used for data entry or storage; you are likely to have data stored as an Excel file, a csv file, in a database, or in some other format generated by a device.
So before we can do any analysis we need to import data. We do this in code like everything else in R.
Base R cannot read data from Excel files. But several add-on ‘packages’ can read and write Excel files. Currently the best package for reading Excel files is called readxl
. For writing Excel files the openxlsx
package is very good.
Whatever the format your raw data there is likely to be a package to read it into R:
readxl
for Excel formathaven
for other statistics packages (Stata, SPSS etc)readr
ordata.table
for csv and plain textpzfx
for GraphPad Prism.- Various
Bioconductor
packages for ’omics data in various formats.
Here we will import a dataset from Excel using the readxl
package. If you already installed tidyverse
then you already have this package.
Once the package is installed we can use the function readxl::read_excel()
, which reads data from an Excel file.
Make sure the data looks OK and is in the right place on your computer
Before we dive in and import it, we need to make sure our data is in a sensible place.
Open RStudio and switch to the project you created in day 1, or start a new project if necessary.
Then, save the example data
walkingspeed_day2.xlsx
for this tutorial into your project folder. Check that it has appeared in the ‘files’ pane in RStudio.Start a new script file that will eventually include all of the commands we need to import, clean, visualise and analyse the data.
Now open the dataset in Excel and explore the file so that you understand what is there.
Read the help!
We are nearly ready to import our data. But before using a new function its always good to read its documentation.
R and R packages are not as self-explanatory as other software, and so you should expect to spend a fair amount of time, particularly as you are learning R, reading documentation, vignettes, blogs, etc on what R can do, which packages exist, and how to use them.
The read_excel()
function has a few different options so first we should look at the help file:
Make sure to check:
- Description what does the function do,
- Usage what is the syntax
- Arguments detail of what all the options mean
- Value what do I get when I run this
- Examples (usually very helpful)
Note from the help file that read_excel()
can extract data from different sheets and ranges of an Excel workbook, can use or ignore column names, and allows you to specify the type of data (numeric, dates, text etc) if you want to, or leave it to R to guess.
Many R packages also have vignettes or websites including simpler guides to their use in specific cases. readxl
has a website that you might find helpful: https://readxl.tidyverse.org/
Now we’ll load the data. We want to use the ‘walking speed’ data from the walkingspeed.xlsx spreadsheet.
- Open the spreadsheet using Excel and find this sheet. The data we want is in the sheet called ‘day2’.
From the read_excel()
help file we can deduce the syntax to load this data into R:
library(readxl)
<- read_excel(path="walkingspeed_day2.xlsx", sheet="day2") walkingdat
This line assumes that the file walkingspeed_day2.xlsx
is in the current working directory (you can check what the current working directory is with getwd()
. The current working directory is also shown just above the R console window. You can see the files in the current working directory in the ‘Files’ tab on the bottom right of the RStudio window. When you create or load a project RStudio will set the working directory to the root of the project directory.
So this line calls the read_excel()
function, with the arguments path
, sheet
set. The other arguments will be set to their default values, which you can see from the help file.
We could have set the range of the data in the spreadsheet (I usually do this for safety), but read_excel()
can figure it out automatically most of the time; by default it picks the biggest continuous chunk of data starting in the top left of the sheet.
Now you should have a ‘data frame’ object called walkingdat
in your environment, which includes the data from the Excel sheet ready to process and analyse.
Our workflow now is:
- Clean and code
- Visualise
- Describe
- Model
- Diagnose model
- Interpret
Cleaning
Before we can analyse our data we should check each variable to make sure that it looks OK.
We’ll use functions from day 1 to explore the content and structure of the dataset, and the distributions of the main variables. We can then address any problems that we find.
Recall these functions from the day 1 notes. We are checking that the variables have the correct type, and the distributions of the categorical variables.
This is an important aspect of any data analysis and we should not skip this step.
library(tidyverse)
summary(walkingdat)
patid group time sex
Min. : 1.00 Length:135 Min. : 0.000 Length:135
1st Qu.: 34.50 Class :character 1st Qu.: 2.167 Class :character
Median : 68.00 Mode :character Median : 2.426 Mode :character
Mean : 68.01 Mean : 4.175
3rd Qu.:101.50 3rd Qu.: 2.949
Max. :136.00 Max. :185.362
age department
Length:135 Min. :1.000
Class :character 1st Qu.:2.000
Mode :character Median :3.000
Mean :2.607
3rd Qu.:3.500
Max. :4.000
class(walkingdat$age)
[1] "character"
table(walkingdat$group)
control treat
68 67
table(walkingdat$sex)
F M
35 100
table(walkingdat$department)
1 2 3 4
28 31 42 34
Converting between types
You will have noticed that one of the numerical variables, age, has been loaded as character variable and not a numeric. We cannot do a quantitative analysis with a character variable!
Why do you think the variable is imported as character and not numeric?
We can convert age to a numeric using the function as.numeric()
.
# Tidyverse version:
<- walkingdat |> mutate(age = as.numeric(age))
walkingdat
# Base R version:
$age <- as.numeric(walkingdat$age) walkingdat
What warning message did you get when you converted this variable? What does it mean?
Checking the data
We mustn’t assume that all our data had been entered correctly.
How might we check on the age and time variables to see if the distribution looks OK?
We could make some descriptive statistics and basic plots!
hist(walkingdat$age)
hist(walkingdat$time)
range(walkingdat$time, na.rm=TRUE)
What do you notice from here?
Dealing with outliers
It looks like there are some unreasonably high and low values of walking time.
We can make another graph of walking speed against age, this time on a logarithmic scale so both the extreme high and extreme low points are visible, to see what is going on.
## ggplot2 version
library(ggplot2)
ggplot(walkingdat) + aes(x=age, y=time) + geom_point() + scale_y_log10()
## base R version
plot(walkingdat$age, walkingdat$time, log="y")
It seems there are some values for time that are likely to be technical errors. We can remove these values (set them to missing) in a few different ways:
## base R method
$time[walkingdat$time<0.1] <- NA
walkingdat$time[walkingdat$time>100] <- NA
walkingdat
## Using mutate and ifelse
<- walkingdat |> mutate(time = ifelse(time<0.1,NA,time))
walkingdat <- walkingdat |> mutate(time = ifelse(time>100,NA,time))
walkingdat
## the 'pure' tidyverse way with case_when is a bit clunky.
## look up 'case_when()' to understand this line
<- walkingdat |> mutate( time = case_when(time<0.1 ~ NA_real_ ,
walkingdat >100 ~ NA_real_,
timeTRUE ~ time) )
# Now check the distribution of time again.
hist(walkingdat$time, breaks=100)
range(walkingdat$time, na.rm = TRUE)
Now we have a cleaned dataset in our environment you can proceed with our visualization and analysis, using the functions you learned in day 1.
If you had trouble with the cleaning exercise then use the version of the dataset found in the cleaned
tab of the spreadsheet:
<- read_excel(path="walkingspeed_day2.xlsx", sheet="cleaned") walkingdat
Analysis - Exercises
Add commands to your script file to answer the questions below.
Add comments (lines beginning with #
) to help you remember which questions you are answering, what you did and why!.
Some solutions are posted at the bottom of the page, but it will help you a lot to try to answer them on your own, by discussing with colleagues, or by looking up help online before using the solutions.
If you don’t understand the given solution, then please ask and we can go through it in the class.
We will discuss the testing, modelling and graphing elements in more detail on day 3, but it would be helpful to attempt these today.
- Structure of the dataset
- What are the names of the variables that comprise the
walkingdata
dataframe? - What is the class of each?
- How many columns and rows does it have?
- What are the names of the variables that comprise the
- Descriptive statistics
- What is the age of youngest patient?
- What is the age of the oldest patient?
- Are the oldest and youngest patient male or female?
- Cross-tabulation (hint, use the
table
function)- How many Male and Female patients are there?
- Is the sex ratio balanced between treatment groups?
- How many women are there of age under 50? How many men are there of age under 50?
- Descriptive statistics using a subset of the data
- What is the average walking speed for men between 50 and 60 years of age?
- Graphing
- Consider what kind of graph you might make to illustrate the difference in walking speed between treatment groups, and try to make it using ggplot. Eg:
- Make a scatter plot of time vs age
- Make the colour of each point reflect the treatment group
- Make a box plot of time by sex, with a different coloured box per sex
- Label the axes appropriately
- Try a violin plot instead of a boxplot (with geom_violin). Which do you prefer?
- Make any other adjustments you think are informative!
- Hypothesis Testing
- What is the difference in average walking speed between control patients and treated patients?
- Is this difference statistically significant? Use t-test
t.test()
and Mann-Whitney testswilcox.test()
to examine the differences between the groups. Look up the syntax fort.test()
andwilcox.test()
using the help system, and use them to check whether the time taken to complete the task varies by treatment status. In each case use the version of the function that takes aformula
agument. Remember theformula
syntax from theboxplot
andlm
functions in day 1. - Compare the p-values for each method.
- Consider the assumptions for a t-test, and whether they are met in this dataset.
- Do you believe that treatment affects walking speed?
- Deriving new variables
- Create a new column that will have the speed of walking, that is 1/time.
- Create a new column that will have the time rounded up to nearest 10th of a second.
- Linear model diagnostics
- Estimate a linear model (with
lm
) to estimate/test the effect of treatment on task completion time. - Use
summary
to get the model coefficients and p-values - Use
broom::tidy
to get the model coefficients and p-values - Compare the model results to the equivalent t-test
- Extend the model to include the effects of age and sex as potential covariates.
- Use
plot()
to explore the model diagnostics - Use
performance::check_model
to explore the model diagnostics - Now estimate a new model to compare the task completion speed between groups.
- Get the model coefficients and p-values
- Check the model diagnostics using
plot
orperformance::check_model
- Why are the result from this model different? How do you interpret the results?
- Can you get the confidence intervals for the treatment effect (use the help for
broom::tidy
) - Use
model_summary
package to make a pretty model summary table
- Estimate a linear model (with
Solutions
The exercises will have many possible solutions, under each exercise we have listed a few possible answers. Most are possible using the functions we have introduced in day 1 of the course, or with functions I suggest in the questions, but in some cases there are other solutions that I have given using different functions/packages.
If you answered the questions another way then that’s fine, but do try to understand how the given solutions work.
::opts_knit$set(eval=FALSE) knitr
Structure of the dataset
What are the names of the variables that comprise the walkingdat
dataframe?
library(tidyverse)
# Eg you could use:
str(walkingdat)
names(walkingdat)
summary(walkingdat)
What is the class of each? What is the class of walkingdat
?
str(walkingdat)
lapply(walkingdat, class) # 'lapply' takes a list (first argument) and applies a function (second argument) to every element of that list.
|> map(class) # map is the tidyverse version of lapply. It is in the 'purrr' package which is loaded if you used library(tidyverse).
walkingdat
class(walkingdat) # walkingdat is a dataframe, but since you used read_excel() to get it it is also a `tbl` which is a dataframe with a few tidyverse enhancements.
How many columns and rows does walkingdat
have?
dim(walkingdat) # Get the dimension
nrow(walkingdat) # Get the number of rows
length(walkingdat) # The 'length' of a data frame is the numve
Descriptive statistics
What is the age of youngest patient?
# look at the help for what na.rm does. You'll see many of the summary statistics need this.
min(walkingdat$age, na.rm=TRUE)
|> summarize(min(age, na.rm=TRUE))
walkingdat |> select(age) |> min(na.rm=TRUE) walkingdat
What is the age of the oldest patient?
max(walkingdat$age)
|> summarize(max(age, na.rm=TRUE))
walkingdat
|> select(age) |> max(na.rm=TRUE)
walkingdat
range(walkingdat$age, na.rm=TRUE)
summary(walkingdat)
Are the oldest and youngest patient male or female?
|> filter(age==47)
walkingdat |> filter(age==72)
walkingdat
which.max(walkingdat$age), "sex"]
walkingdat[which.min(walkingdat$age), "sex"]
walkingdat[
# head and tail show the start and end of a dataset.
|> arrange(age) |> head()
walkingdat |> arrange(age) |> tail() walkingdat
Cross-tabulation
How many Male and Female patient are there?
|> filter( sex =='F' ) |> count()
walkingdat |> filter( sex =='M' ) |> count()
walkingdat
table(walkingdat$sex)
|> group_by(sex) |> count() walkingdat
Is the sex ratio balanced between treatment groups?
|> filter( group =='treat', sex =='M' ) |> count()
walkingdat |> filter( group =='treat', sex =='F' ) |> count() walkingdat
How many women are there of age under 50? How many men are there of age under 50?
|> filter(age < 50 , sex =='F' ) |> count()
walkingdat |> filter(age < 50 , sex =='M' ) |> count()
walkingdat
table(walkingdat$sex, walkingdat$age<50)
Descriptive statistics using a subset of the data
What is the average walking speed for men between 60 and 70 years of age?
|> filter(age >59, age <71, sex =='M' ) |>
walkingdat select(time) |> pull() |> mean()
|> filter( age >59, age <71, sex =='M' ) |>
walkingdat summarise(mean(time))
Deriving new variables
Create a new column that will have the speed of walking, that is 1/time.
<- walkingdat |> mutate(speed=1/as.numeric(time)) walkingdat
Create a new column that will have the time rounded to the nearest 10th of a second.
<- walkingdat |> mutate(rounded=round(as.numeric(time), digits = 2)) walkingdat
Graphing
Make a scatter plot of time vs age Make the colour of each point reflect the treatment group
ggplot(data=walkingdat) +
aes(x=age,y=time,colour=group) +
geom_point()
Make a box plot of time by sex, with a different coloured box per sex
ggplot(data=walkingdat) +
aes(x=sex,y=time,fill=sex) +
geom_boxplot()
Label the axes appropriately
ggplot(data=walkingdat) +
aes(x=sex,y=time,fill=sex) +
geom_boxplot() +
labs(x="Sex", y="Time to complete task (s)")
Try a violin plot instead of a boxplot (with geom_violin). Which do you prefer?*
ggplot(data=walkingdat) +
aes(x=sex,y=time,fill=sex) +
geom_violin() +
labs(x="Sex", y="Time to complete task (s)")
Hypothesis testing
What is the difference in task completion time between control patients and treated patients? Is this difference statistically significant?
t.test( time ~ group , data=walkingdat)
lm( time ~ group , data=walkingdat) |> broom::tidy()
lm( time ~ group , data=walkingdat) |> summary()
lm( time ~ group , data=walkingdat) |> anova()
Use t-test t.test()
and Mann-Whitney tests wilcox.test()
to examine the differences between the groups.
t.test( time ~ group , data=walkingdat)
wilcox.test( time ~ group , data=walkingdat)
Consider the assumptions underlying the t-test, and whether they are met in this dataset
The t-test and the linear model assume that the outcome variable is normally distributed within groups, which is not the case here. So the results from these is probably wrong.
The Mann-Whitney test (Wilcox test) does not make this assumption, and is more robust in this case.
Do you believe that treatment affects walking speed?
Try the t-test or linear model using the speed variable that you created instead of the time variable that is in the dataset.
Diagnostics
Estimate a linear model (with lm
) to estimate/test the effect of treatment on task completion time.
<- lm(data=walkingdat , time ~ group) mod1
Use summary
to get the model coefficients and p-values
Use broom::tidy
to get the model coefficients and p-values
|> summary()
mod1 |> broom::tidy() mod1
Compare the model results to the equivalent t-test
Extend the model to include the effects of age and sex as potential covariates.
<- lm(data=walkingdat , time ~ group + age + sex) mod2
Use plot()
to explore the model diagnostics
## You could plot model 1 or model 2.
plot(mod1)
Use performance::check_model
to explore the model diagnostics
::check_model(mod1) performance
Now estimate a new model to compare the task completion speed between groups.
Get the model coefficients and p-values
Check the model diagnostics using plot
or performance::check_model
$speed <- 1/walkingdat$time
walkingdat<- lm(data=walkingdat , speed ~ group + age + sex)
mod3 ::check_model(mod3) performance
Why are the result from this model different? How do you interpret the results?
Can you get the confidence intervals for the treatment effect (use the help for broom::tidy
)
Use model_summary
package to make a pretty model summary table
::modelsummary(mod3)
modelsummary
::modelsummary(list("Speed model"=mod3,"First model"=mod1) ,
modelsummaryestimate="{estimate} [{conf.low}, {conf.high}] {stars}",
coef_rename = c('(Intercept)'="Intercept",
'grouptreat'="Treatment",
'age'="Age (per year)",
'sexM'="Male vs Female"),
statistic=NULL, )