Intro to R for statistics, Day 2/3

Self-directed exercise

Authors

George Savva, QIB

Pirita Paajanen, JIC

Published

April 1, 2024

Today’s worksheet introduces you to a real dataset.

The tasks you will cover are:

You can attempt the worksheet by following the material on your own on day 2.

On day 3 we will run through the work discussing any issues related to R or statistics that arise.

Note for day 2

Expect to spend about three hours on today’s work in total. You all have different levels of knowledge and experience with data analysis and programming, and there is hopefully something in the exercises that will stretch everyone no matter what their background!

So make a good effort at the tutorial and the exercises, work together and use extra resources if needed (part of the skill is knowing where to go for help), but don’t spend more than half a day on this unless you want to!

The dataset

The data are from a randomised clinical trial of a new rehabilition intervention (compared to standard care) aimed at improving the walking speed of people recovering from stroke.

Better walking speed is a good indicator of general stroke recovery.

We have recorded:

  • The age and sex of each participant,
  • The treatment allocation,
  • The hospital department from which they were recruited and
  • Time they take to complete a walking task after their treatment.

Our research questions are:

  • Does the new treatment improve walking speed compared to control treatment?
  • By how much does it improve, and how certain are we of this?

The dataset can be found at walkingspeed_day2.xlsx

  • Powerpoint slides to support this material are at:

Graphing - day3.pptx

day3_data_lm.pptx

Using readxl to read data from Excel files

R is not used for data entry or storage; you are likely to have data stored as an Excel file, a csv file, in a database, or in some other format generated by a device.

So before we can do any analysis we need to import data. We do this in code like everything else in R.

Base R cannot read data from Excel files. But several add-on ‘packages’ can read and write Excel files. Currently the best package for reading Excel files is called readxl. For writing Excel files the openxlsx package is very good.

Whatever the format your raw data there is likely to be a package to read it into R:

  • readxl for Excel format
  • haven for other statistics packages (Stata, SPSS etc)
  • readr or data.table for csv and plain text
  • pzfx for GraphPad Prism.
  • Various Bioconductor packages for ’omics data in various formats.

Here we will import a dataset from Excel using the readxl package. If you already installed tidyverse then you already have this package.

Once the package is installed we can use the function readxl::read_excel(), which reads data from an Excel file.

Make sure the data looks OK and is in the right place on your computer

Before we dive in and import it, we need to make sure our data is in a sensible place.

  1. Open RStudio and switch to the project you created in day 1, or start a new project if necessary.

  2. Then, save the example data walkingspeed_day2.xlsx for this tutorial into your project folder. Check that it has appeared in the ‘files’ pane in RStudio.

  3. Start a new script file that will eventually include all of the commands we need to import, clean, visualise and analyse the data.

  4. Now open the dataset in Excel and explore the file so that you understand what is there.

Read the help!

We are nearly ready to import our data. But before using a new function its always good to read its documentation.

R and R packages are not as self-explanatory as other software, and so you should expect to spend a fair amount of time, particularly as you are learning R, reading documentation, vignettes, blogs, etc on what R can do, which packages exist, and how to use them.

The read_excel() function has a few different options so first we should look at the help file:

library(readxl)
?read_excel # where does the helpfile appear?

Make sure to check:

  • Description what does the function do,
  • Usage what is the syntax
  • Arguments detail of what all the options mean
  • Value what do I get when I run this
  • Examples (usually very helpful)

Note from the help file that read_excel() can extract data from different sheets and ranges of an Excel workbook, can use or ignore column names, and allows you to specify the type of data (numeric, dates, text etc) if you want to, or leave it to R to guess.

Many R packages also have vignettes or websites including simpler guides to their use in specific cases. readxl has a website that you might find helpful: https://readxl.tidyverse.org/

Now we’ll load the data. We want to use the ‘walking speed’ data from the walkingspeed.xlsx spreadsheet.

  1. Open the spreadsheet using Excel and find this sheet. The data we want is in the sheet called ‘day2’.

From the read_excel() help file we can deduce the syntax to load this data into R:

library(readxl)
walkingdat <- read_excel(path="walkingspeed_day2.xlsx", sheet="day2")

This line assumes that the file walkingspeed_day2.xlsx is in the current working directory (you can check what the current working directory is with getwd(). The current working directory is also shown just above the R console window. You can see the files in the current working directory in the ‘Files’ tab on the bottom right of the RStudio window. When you create or load a project RStudio will set the working directory to the root of the project directory.

So this line calls the read_excel() function, with the arguments path, sheet set. The other arguments will be set to their default values, which you can see from the help file.

We could have set the range of the data in the spreadsheet (I usually do this for safety), but read_excel() can figure it out automatically most of the time; by default it picks the biggest continuous chunk of data starting in the top left of the sheet.

Now you should have a ‘data frame’ object called walkingdat in your environment, which includes the data from the Excel sheet ready to process and analyse.

Our workflow now is:

  1. Clean and code
  2. Visualise
  3. Describe
  4. Model
  5. Diagnose model
  6. Interpret

Cleaning

Before we can analyse our data we should check each variable to make sure that it looks OK.

We’ll use functions from day 1 to explore the content and structure of the dataset, and the distributions of the main variables. We can then address any problems that we find.

Recall these functions from the day 1 notes. We are checking that the variables have the correct type, and the distributions of the categorical variables.

This is an important aspect of any data analysis and we should not skip this step.

library(tidyverse)
summary(walkingdat)
     patid           group                time             sex           
 Min.   :  1.00   Length:135         Min.   :  0.000   Length:135        
 1st Qu.: 34.50   Class :character   1st Qu.:  2.167   Class :character  
 Median : 68.00   Mode  :character   Median :  2.426   Mode  :character  
 Mean   : 68.01                      Mean   :  4.175                     
 3rd Qu.:101.50                      3rd Qu.:  2.949                     
 Max.   :136.00                      Max.   :185.362                     
     age              department   
 Length:135         Min.   :1.000  
 Class :character   1st Qu.:2.000  
 Mode  :character   Median :3.000  
                    Mean   :2.607  
                    3rd Qu.:3.500  
                    Max.   :4.000  
class(walkingdat$age)
[1] "character"
table(walkingdat$group)

control   treat 
     68      67 
table(walkingdat$sex)

  F   M 
 35 100 
table(walkingdat$department)

 1  2  3  4 
28 31 42 34 

Converting between types

You will have noticed that one of the numerical variables, age, has been loaded as character variable and not a numeric. We cannot do a quantitative analysis with a character variable!

Why do you think the variable is imported as character and not numeric?

We can convert age to a numeric using the function as.numeric().

# Tidyverse version:
walkingdat <- walkingdat |> mutate(age = as.numeric(age))

# Base R version:
walkingdat$age <- as.numeric(walkingdat$age)

What warning message did you get when you converted this variable? What does it mean?

Checking the data

We mustn’t assume that all our data had been entered correctly.

How might we check on the age and time variables to see if the distribution looks OK?

We could make some descriptive statistics and basic plots!

hist(walkingdat$age)
hist(walkingdat$time)
range(walkingdat$time, na.rm=TRUE)

What do you notice from here?

Dealing with outliers

It looks like there are some unreasonably high and low values of walking time.

We can make another graph of walking speed against age, this time on a logarithmic scale so both the extreme high and extreme low points are visible, to see what is going on.

## ggplot2 version
library(ggplot2)
ggplot(walkingdat) + aes(x=age, y=time) + geom_point() + scale_y_log10()

## base R version
plot(walkingdat$age, walkingdat$time, log="y")

It seems there are some values for time that are likely to be technical errors. We can remove these values (set them to missing) in a few different ways:

## base R method
walkingdat$time[walkingdat$time<0.1] <- NA
walkingdat$time[walkingdat$time>100] <- NA

## Using mutate and ifelse
walkingdat <- walkingdat |> mutate(time = ifelse(time<0.1,NA,time))
walkingdat <- walkingdat |> mutate(time = ifelse(time>100,NA,time))

## the 'pure' tidyverse way with case_when is a bit clunky.
## look up 'case_when()' to understand this line
walkingdat <- walkingdat |> mutate( time = case_when(time<0.1 ~ NA_real_ , 
                                                     time>100 ~ NA_real_, 
                                                     TRUE ~ time) )

# Now check the distribution of time again.
hist(walkingdat$time, breaks=100)
range(walkingdat$time, na.rm = TRUE)

Now we have a cleaned dataset in our environment you can proceed with our visualization and analysis, using the functions you learned in day 1.

If you had trouble with the cleaning exercise then use the version of the dataset found in the cleaned tab of the spreadsheet:

walkingdat <- read_excel(path="walkingspeed_day2.xlsx", sheet="cleaned")

Analysis - Exercises

Add commands to your script file to answer the questions below.

Add comments (lines beginning with #) to help you remember which questions you are answering, what you did and why!.

Some solutions are posted at the bottom of the page, but it will help you a lot to try to answer them on your own, by discussing with colleagues, or by looking up help online before using the solutions.

If you don’t understand the given solution, then please ask and we can go through it in the class.

We will discuss the testing, modelling and graphing elements in more detail on day 3, but it would be helpful to attempt these today.


  1. Structure of the dataset
    1. What are the names of the variables that comprise the walkingdata dataframe?
    2. What is the class of each?
    3. How many columns and rows does it have?
  2. Descriptive statistics
    1. What is the age of youngest patient?
    2. What is the age of the oldest patient?
    3. Are the oldest and youngest patient male or female?
  3. Cross-tabulation (hint, use the table function)
    1. How many Male and Female patients are there?
    2. Is the sex ratio balanced between treatment groups?
    3. How many women are there of age under 50? How many men are there of age under 50?
  4. Descriptive statistics using a subset of the data
    1. What is the average walking speed for men between 50 and 60 years of age?
  5. Graphing
    1. Consider what kind of graph you might make to illustrate the difference in walking speed between treatment groups, and try to make it using ggplot. Eg:
    2. Make a scatter plot of time vs age
    3. Make the colour of each point reflect the treatment group
    4. Make a box plot of time by sex, with a different coloured box per sex
    5. Label the axes appropriately
    6. Try a violin plot instead of a boxplot (with geom_violin). Which do you prefer?
    7. Make any other adjustments you think are informative!
  6. Hypothesis Testing
    1. What is the difference in average walking speed between control patients and treated patients?
    2. Is this difference statistically significant? Use t-test t.test() and Mann-Whitney tests wilcox.test() to examine the differences between the groups. Look up the syntax for t.test() and wilcox.test() using the help system, and use them to check whether the time taken to complete the task varies by treatment status. In each case use the version of the function that takes a formula agument. Remember the formula syntax from the boxplot and lm functions in day 1.
    3. Compare the p-values for each method.
    4. Consider the assumptions for a t-test, and whether they are met in this dataset.
    5. Do you believe that treatment affects walking speed?
  7. Deriving new variables
    1. Create a new column that will have the speed of walking, that is 1/time.
    2. Create a new column that will have the time rounded up to nearest 10th of a second.
  8. Linear model diagnostics
    1. Estimate a linear model (with lm) to estimate/test the effect of treatment on task completion time.
    2. Use summary to get the model coefficients and p-values
    3. Use broom::tidy to get the model coefficients and p-values
    4. Compare the model results to the equivalent t-test
    5. Extend the model to include the effects of age and sex as potential covariates.
    6. Use plot() to explore the model diagnostics
    7. Use performance::check_model to explore the model diagnostics
    8. Now estimate a new model to compare the task completion speed between groups.
    9. Get the model coefficients and p-values
    10. Check the model diagnostics using plot or performance::check_model
    11. Why are the result from this model different? How do you interpret the results?
    12. Can you get the confidence intervals for the treatment effect (use the help for broom::tidy)
    13. Use model_summary package to make a pretty model summary table

Solutions

The exercises will have many possible solutions, under each exercise we have listed a few possible answers. Most are possible using the functions we have introduced in day 1 of the course, or with functions I suggest in the questions, but in some cases there are other solutions that I have given using different functions/packages.

If you answered the questions another way then that’s fine, but do try to understand how the given solutions work.

knitr::opts_knit$set(eval=FALSE)

Structure of the dataset

What are the names of the variables that comprise the walkingdat dataframe?

library(tidyverse)

# Eg you could use:
str(walkingdat)
names(walkingdat)
summary(walkingdat)

What is the class of each? What is the class of walkingdat?

str(walkingdat)

lapply(walkingdat, class)  #  'lapply' takes a list (first argument) and applies a function (second argument) to every element of that list.

walkingdat |> map(class) # map is the tidyverse version of lapply.  It is in the 'purrr' package which is loaded if you used library(tidyverse).

class(walkingdat) # walkingdat is a dataframe, but since you used read_excel() to get it it is also a `tbl` which is a dataframe with a few tidyverse enhancements.

How many columns and rows does walkingdat have?

dim(walkingdat) # Get the dimension
nrow(walkingdat) # Get the number of rows
length(walkingdat) # The 'length' of a data frame is the numve

Descriptive statistics

What is the age of youngest patient?

# look at the help for what na.rm does.  You'll see many of the summary statistics need this.
min(walkingdat$age, na.rm=TRUE)
walkingdat |> summarize(min(age, na.rm=TRUE))
walkingdat |> select(age) |> min(na.rm=TRUE)

What is the age of the oldest patient?

max(walkingdat$age)

walkingdat |> summarize(max(age, na.rm=TRUE))

walkingdat |> select(age) |> max(na.rm=TRUE)

range(walkingdat$age, na.rm=TRUE)

summary(walkingdat)

Are the oldest and youngest patient male or female?

walkingdat |> filter(age==47)
walkingdat |> filter(age==72)

walkingdat[which.max(walkingdat$age), "sex"]
walkingdat[which.min(walkingdat$age), "sex"]

# head and tail show the start and end of a dataset.  
walkingdat |> arrange(age) |> head()
walkingdat |> arrange(age) |> tail()

Cross-tabulation

How many Male and Female patient are there?

walkingdat |> filter( sex =='F' ) |> count()
walkingdat |> filter( sex =='M' ) |> count()

table(walkingdat$sex)

walkingdat |> group_by(sex) |> count()

Is the sex ratio balanced between treatment groups?

walkingdat |> filter( group =='treat', sex =='M' ) |> count()
walkingdat |> filter( group =='treat', sex =='F' ) |> count()

How many women are there of age under 50? How many men are there of age under 50?

walkingdat |> filter(age < 50 , sex =='F' ) |>  count()
walkingdat |> filter(age < 50 , sex =='M' ) |>  count()

table(walkingdat$sex, walkingdat$age<50)

Descriptive statistics using a subset of the data

What is the average walking speed for men between 60 and 70 years of age?

walkingdat |> filter(age >59, age <71,  sex =='M' )  |> 
  select(time) |> pull() |> mean()

walkingdat |> filter( age >59, age <71,  sex =='M' ) |> 
  summarise(mean(time))

Deriving new variables

Create a new column that will have the speed of walking, that is 1/time.

walkingdat <- walkingdat |> mutate(speed=1/as.numeric(time))

Create a new column that will have the time rounded to the nearest 10th of a second.

walkingdat <- walkingdat |> mutate(rounded=round(as.numeric(time), digits = 2))

Graphing

Make a scatter plot of time vs age Make the colour of each point reflect the treatment group

ggplot(data=walkingdat) +
  aes(x=age,y=time,colour=group) + 
  geom_point()

Make a box plot of time by sex, with a different coloured box per sex

ggplot(data=walkingdat) +
  aes(x=sex,y=time,fill=sex) + 
  geom_boxplot()

Label the axes appropriately

ggplot(data=walkingdat) +
  aes(x=sex,y=time,fill=sex) + 
  geom_boxplot() + 
  labs(x="Sex", y="Time to complete task (s)")

Try a violin plot instead of a boxplot (with geom_violin). Which do you prefer?*

ggplot(data=walkingdat) +
  aes(x=sex,y=time,fill=sex) + 
  geom_violin() + 
  labs(x="Sex", y="Time to complete task (s)")

Hypothesis testing

What is the difference in task completion time between control patients and treated patients? Is this difference statistically significant?

t.test( time ~ group , data=walkingdat)

lm( time ~ group , data=walkingdat) |> broom::tidy()

lm( time ~ group , data=walkingdat) |> summary()

lm( time ~ group , data=walkingdat) |> anova()

Use t-test t.test() and Mann-Whitney tests wilcox.test() to examine the differences between the groups.

t.test( time ~ group , data=walkingdat)
wilcox.test( time ~ group , data=walkingdat)

Consider the assumptions underlying the t-test, and whether they are met in this dataset

The t-test and the linear model assume that the outcome variable is normally distributed within groups, which is not the case here. So the results from these is probably wrong.

The Mann-Whitney test (Wilcox test) does not make this assumption, and is more robust in this case.

Do you believe that treatment affects walking speed?

Try the t-test or linear model using the speed variable that you created instead of the time variable that is in the dataset.

Diagnostics

Estimate a linear model (with lm) to estimate/test the effect of treatment on task completion time.

mod1 <- lm(data=walkingdat , time ~ group)

Use summary to get the model coefficients and p-values

Use broom::tidy to get the model coefficients and p-values

mod1 |> summary()
mod1 |> broom::tidy()

Compare the model results to the equivalent t-test

Extend the model to include the effects of age and sex as potential covariates.

mod2 <- lm(data=walkingdat , time ~ group + age + sex)

Use plot() to explore the model diagnostics

## You could plot model 1 or model 2.
plot(mod1)

Use performance::check_model to explore the model diagnostics

performance::check_model(mod1)

Now estimate a new model to compare the task completion speed between groups.

Get the model coefficients and p-values

Check the model diagnostics using plot or performance::check_model

walkingdat$speed <- 1/walkingdat$time
mod3 <- lm(data=walkingdat , speed ~ group + age + sex)
performance::check_model(mod3)

Why are the result from this model different? How do you interpret the results?

Can you get the confidence intervals for the treatment effect (use the help for broom::tidy)

Use model_summary package to make a pretty model summary table

modelsummary::modelsummary(mod3)

modelsummary::modelsummary(list("Speed model"=mod3,"First model"=mod1) , 
                           estimate="{estimate} [{conf.low}, {conf.high}] {stars}", 
                           coef_rename = c('(Intercept)'="Intercept", 
                                           'grouptreat'="Treatment", 
                                           'age'="Age (per year)", 
                                           'sexM'="Male vs Female"), 
                           statistic=NULL, )