You should have R and Rstudio installed. We will be working with R within the Rstudio application.
Start a new script by going to File –> New File –> R Script
Become familiar with the R studio layout:
Determine which directory you are in. The results of your code will print to the console:
getwd()
ProTip: Use Ctrl + Enter (Windows) or Cmd + Enter (Mac) to run the current line of code using a keyboard shortcut.
Not the right directory? That’s fine–let’s change it!
To change or set the directory in R use the setwd()
function.
# Set the file path to YOUR desktop
setwd("/Users/YourName/Desktop")
Note: In this character string, you can only use a “/” or “\\” between folder names.
For example, to set the directory to a specific folder:
setwd()
parentheses and add your folder name (e.g.,“\\SpecificFolder”).To keep things tidy, create another folder specifically for today’s workshop. This can be done by creating the object todays.file
where you can change the name whenever you wish.
todays.file = "SEAInCode"
You can also make a string of file names or characters representing other objects using c()
. This will come in handy later in the workshop.
Use dir.create()
to make the new folder then set the working directory to todays.file
dir.create(todays.file)
# If you try to create a folder that already exists R will let you know
setwd(todays.file)
ProTip: R does not like spaces. Use underscores instead.
Check to see if you are in the right place using getwd()
, then use the function dir()
to see what is in the folder.
dir()
## [1] "growth_data.csv" "Labs to pull from"
## [3] "MultiRegDataSimulation.R" "NewScript.png"
## [5] "Rstudio.png" "RstudioAnnotated.png"
## [7] "RstudioAnnotated.pptx" "SEAInCodeRIntro2.Rmd"
ProTip: There will be nothing in your directory if you created a new folder to work in. Messed up? Want to move up one directory? No problem! Run setwd("..")
to move up a folder level.
Now save your script. R will often hold your place but it is good to save periodically. When you close Rstudio it will ask if you want to save your R space. This will save everything in the global environment. I tend not to save it since I probably saved all of the output I needed before deciding to close R.
There are millions of R packages designed to do all kinds of data analyses. Two packages that I have grown fond of for manipulating and visualizing data are dplyr
and ggplot2
. Fortunately, somebody decided to bundle these packages and a few others in what is called the “tidyverse”. We can install this package using install.packages()
.
dplyr: a package for data manipulation
ggplot2: a package for creating graphics
tidyverse: the data analysis super package!
install.packages("tidyverse")
Note: Installing the package only needs to be done once.
Load the package every time a new R session is started using library()
. The messages that show up are OK. You should only be concerned if warnings
or Error
pops up.
library(tidyverse)
## -- Attaching packages ------------------------------------------- tidyverse 1.3.0 --
## v ggplot2 3.3.2 v purrr 0.3.4
## v tibble 3.0.3 v dplyr 1.0.2
## v tidyr 1.1.2 v stringr 1.4.0
## v readr 1.3.1 v forcats 0.5.0
## -- Conflicts ---------------------------------------------- tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
Note: If you are ever wondering how to use these packages check out the cheat sheets:
… or Google it!
Read in the growth_data from Github using read.csv()
and give it the object name “data” using =
or ->
as the assignment operator:
data = read.csv("https://raw.githubusercontent.com/latreesedenson/SEAinCode/main/growth_data.csv")
data = as.data.frame(data)
Note: You could also copy and paste the data from the web browser into an excel file, change the format to comma delimited, saved it as a csv file and then read it in using read.csv()
. This way you would have a copy of the data on your local drive to read from. You would need to replace the weblink with the path to the data on your local computer.
Make sure the data was read in correctly using the function head()
. And look at parts of the data using the $
operator.
head(data)
## daylength temp_range growth species
## 1 short average 132.9485 A
## 2 average average 131.2542 A
## 3 long average 125.8153 A
## 4 short above 123.1229 A
## 5 average above 106.3336 A
## 6 long above 100.3005 A
data$temp_range
## [1] "average" "average" "average" "above" "above" "above" "below"
## [8] "below" "below" "average" "average" NA "above" "above"
## [15] "above" NA "below" "below" "average" "average" "average"
## [22] "above" NA "above" "below" "below" "below" "average"
## [29] "average" "average" "above" "above" "above" "below" "below"
## [36] "below" "average" "average" "average" "above" "above" "above"
## [43] "below" "below" "below" "average" "average" "average" "above"
## [50] "above" "above" "below" "below" "below" "average" "average"
## [57] "average" "above" "above" "above" "below" "below" "below"
## [64] "average" "average" "average" "above" "above" "above" "below"
## [71] "below" "below" "average" "average" "average" "above" "above"
## [78] "above" "below" "below" "below" "average" "average" "average"
## [85] "above" "above" NA "below" "below" "below" "average"
## [92] "average" "average" "above" "above" NA "below" "below"
## [99] "below" "average" "average" "average" "above" "above" "above"
## [106] "below" "below" "below" "average" "average" "average" "above"
## [113] "above" "above" "below" "below" "below" "average" "average"
## [120] "average" "above" "above" "above" "below" "below" "below"
## [127] "average" "average" "average" "above" "above" "above" "below"
## [134] "below" "below" "average" "average" "average" "above" "above"
## [141] "above" "below" "below" "below" "average" "average" "average"
## [148] "above" "above" "above" "below" "below" "below" "average"
## [155] "average" "average" "above" "above" "above" "below" "below"
## [162] "below" "average" "average" "average" "above" "above" "above"
## [169] "below" "below" "below" "average" "average" "average" "above"
## [176] "above" "above" "below" "below" "below" "average" "average"
## [183] "average" "above" "above" "above" "below" "below" "below"
## [190] "average" "average" "average" "above" "above" "above" "below"
## [197] "below" "below" "average" "average" "average" "above" "above"
## [204] "above" "below" "below" NA "average" "average" "average"
## [211] "above" "above" "above" "below" "below" "below" "average"
## [218] "average" NA "above" "above" "above" "below" "below"
## [225] "below" "average" "average" "average" "above" "above" "above"
## [232] "below" "below" "below" "average" "average" "average" "above"
## [239] "above" "above" "below" "below" "below" "average" "average"
## [246] "average" "above" "above" "above" "below" "below" "below"
## [253] "average" "average" "average" "above" "above" "above" "below"
## [260] "below" "below" "average" "average" "average" "above" "above"
## [267] "above" "below" "below" "below"
ProTip: Not sure what a function does? Type ?head()
into the console and see what pops up in the help menu. Scroll down to the bottom–there are often examples of how to use the function that you can practice with.
What are the column names? Is everything there that should be there?
names(data)
## [1] "daylength" "temp_range" "growth" "species"
What are the dimensions of the dataset, how many rows and columns?
dim(data)
## [1] 270 4
Let’s get some basic summary statistics from our data: minimums, maximums and means.
summary(data)
## daylength temp_range growth species
## Length:270 Length:270 Min. : 64.41 Length:270
## Class :character Class :character 1st Qu.: 94.48 Class :character
## Mode :character Mode :character Median :105.66 Mode :character
## Mean :102.97
## 3rd Qu.:111.64
## Max. :143.50
You will notice that there were multiple species here and some NAs in the data. Let’s work with the first species and remove the bad records. We will use the pipe operator %>%
from the magrittr package within the tidyverse to do this. This syntax leads to code that is easier to write and to read. Use the keyboard shortcut: Ctrl + Shift + M (Windows) or Cmd + Shift + M (Mac) to get the pipe operator.
Species_A_Clean = data %>% filter(species =="A",!is.na(temp_range))
head(Species_A_Clean)
## daylength temp_range growth species
## 1 short average 132.9485 A
## 2 average average 131.2542 A
## 3 long average 125.8153 A
## 4 short above 123.1229 A
## 5 average above 106.3336 A
## 6 long above 100.3005 A
dim(Species_A_Clean)
## [1] 86 4
summary(Species_A_Clean)
## daylength temp_range growth species
## Length:86 Length:86 Min. : 99.95 Length:86
## Class :character Class :character 1st Qu.:106.27 Class :character
## Mode :character Mode :character Median :111.17 Mode :character
## Mean :116.64
## 3rd Qu.:130.04
## Max. :143.50
Notice that there are no more NAs in the temp_range summary.
Species_A_Clean$temp_range
## [1] "average" "average" "average" "above" "above" "above" "below"
## [8] "below" "below" "average" "average" "above" "above" "above"
## [15] "below" "below" "average" "average" "average" "above" "above"
## [22] "below" "below" "below" "average" "average" "average" "above"
## [29] "above" "above" "below" "below" "below" "average" "average"
## [36] "average" "above" "above" "above" "below" "below" "below"
## [43] "average" "average" "average" "above" "above" "above" "below"
## [50] "below" "below" "average" "average" "average" "above" "above"
## [57] "above" "below" "below" "below" "average" "average" "average"
## [64] "above" "above" "above" "below" "below" "below" "average"
## [71] "average" "average" "above" "above" "above" "below" "below"
## [78] "below" "average" "average" "average" "above" "above" "below"
## [85] "below" "below"
Let’s do a boxplot of our response vs. our explanatory variables. Here we are using the ggplot2 package, but I also provide code for using Base R.
ggplot can get a bit complex depending on what you want to do but the basic components are typically the same.
You will notice that I have commented out some lines of code using ‘#’. This allows me to block out lines that I don’t want to run, as well as create comments for a better explanation of my code.
# ggplot general inputs/arguments (data, aesthetics - x and y) + plot type + etc.
ggplot(Species_A_Clean, aes(x=as.factor(daylength), y=growth)) +
geom_boxplot(fill="slateblue", alpha=0.2) + xlab("day length")
# To save to a file for viewing later we use the function ggsave()
# The .tiff extension is the type of file, you can also use .jpg
# .tiff files are good for publication graphics
# ggsave("boxplot_daylength.tiff")
# Change the x variable to graph a different explanatory variable
ggplot(Species_A_Clean, aes(x=as.factor(temp_range), y=growth)) +
geom_boxplot(fill="slateblue", alpha=0.2) + xlab("temp")
# ggsave("boxplot_temp.tiff")
# Here is the code to create a boxplot and save it in Base R
# Remove the # to run these 3 lines
# tiff("boxplot.tiff")
# boxplot(growth~temp_range,data=Species_A_Clean,xlab="temp_range", ylab="growth(wt")
# dev.off()
Notice the way we save graphics is different depending on the package we are using. Also dev.off()
closes all graphics devices so you can start a fresh graph next time.
Create a histogram of the data using the geom_histogram()
function. Change the binwidth
to see what happens.
ggplot(Species_A_Clean, aes(x = growth)) +
geom_histogram(binwidth = 1)
# You can remove the '#' in the line below if you want to save this
# ggsave("DataDistribution.tiff")
# Base R
# hist(Species_A_Clean$growth, breaks=1, col="darkgrey", main = "", xlab="growth")
Let’s fit a basic linear model to the data, to determine the impact of day length and temperature on species growth.
We use the lm()
function which needs model and the data as it’s input. The model is setup using a \(\sim\), where \(y\sim x\) means that Y is a function of a series of Xs.
lm(y ~ x1 + x2 + x3, Data = Data)
Let’s run a model with our species data. We tell R that we have factors with multiple levels by using the function factor()
. If you had a continuous variable you wouldn’t need this added piece.
SpeciesA.lm = lm(growth ~ factor(daylength)*factor(temp_range), data = Species_A_Clean)
To get a summary of the linear model output we use the function summary()
.
sum.mod = summary(SpeciesA.lm)
# since sum.mod is an object
# we will need to run the next line to see what is in it
sum.mod
##
## Call:
## lm(formula = growth ~ factor(daylength) * factor(temp_range),
## data = Species_A_Clean)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.7540 -3.2367 -0.5397 2.6419 12.9402
##
## Coefficients:
## Estimate Std. Error t value
## (Intercept) 109.0499 1.4656 74.405
## factor(daylength)long -2.6890 2.0727 -1.297
## factor(daylength)short 1.1328 2.0202 0.561
## factor(temp_range)average 26.8020 2.0202 13.267
## factor(temp_range)below 0.1362 2.0202 0.067
## factor(daylength)long:factor(temp_range)average -3.8093 2.8944 -1.316
## factor(daylength)short:factor(temp_range)average -3.4718 2.8192 -1.232
## factor(daylength)long:factor(temp_range)below -0.5685 2.8570 -0.199
## factor(daylength)short:factor(temp_range)below -1.2432 2.8570 -0.435
## Pr(>|t|)
## (Intercept) <2e-16 ***
## factor(daylength)long 0.198
## factor(daylength)short 0.577
## factor(temp_range)average <2e-16 ***
## factor(temp_range)below 0.946
## factor(daylength)long:factor(temp_range)average 0.192
## factor(daylength)short:factor(temp_range)average 0.222
## factor(daylength)long:factor(temp_range)below 0.843
## factor(daylength)short:factor(temp_range)below 0.665
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.397 on 77 degrees of freedom
## Multiple R-squared: 0.8904, Adjusted R-squared: 0.879
## F-statistic: 78.22 on 8 and 77 DF, p-value: < 2.2e-16
####
# Additional functions you may find useful in the future
####
# coefficients(fit) # model coefficients
# confint(fit, level=0.95) # CIs for model parameters
# fitted(fit) # predicted values
# residuals(fit) # residuals
# anova(fit) # anova table
####
Here you have your important outputs to make a conclusion with (pvalues, coefficients, etc.).
Let’s save some of that information to an excel file. Note: you cannot save the entire output to a csv using the lines below. You will need additional steps and functions.
# save results to an object
results = sum.mod$coefficients
# make the object into a dataframe so it is delimited
# try it without as.data.frame to see what I mean
out = as.data.frame(results)
# Write to a csv file
write.csv(out,"ModelCoefficients.csv")
# You can also write to a text file using write.table() and .txt file name instead of csv
Let’s plot the diagnostics of the model fit to the data.plot(linearmodel)
produces 4 main plots; each used to diagnose the fit of the model to the data.
# Remember you can save this graphic by uncommenting the line below
# be sure to use dev.off after to reset things
# tiff("ModelDiagnostics.tiff")
# The graphing format: 2 rows 2 columns
par(mfrow=c(2,2))
# type dev.off and run the next line without this function to see what happens
plot(SpeciesA.lm)
# close the plotting window and resets things
dev.off()
## null device
## 1
This is the part that is similar to what was in the introduction slides.
Let’s say you now need to do this for two other species, nobody wants to write all that code out again!
Now we will build our own function and if there is time we will loop through and analyze the data for all of the species, with the click of a button.
Just like plot()
, header()
, summary()
, dim()
, etc. are all functions, we can make our own functions too. Why do we make functions?
1.There are no available functions to do what you want to do.
2.Work smarter – not harder! You have a task that needs to be repeated, and you do not want to keep writing the same lines of code over and over again.
Functions must have a name and arguments, and they must return something. Arguments are the inputs you plan to use in your function. We often use the function return()
to have our functions “print” to the console or pass something on to another function. Example layout:
func.name = function(arg1,arg2)
{ # bracket to start function
# code to do something, here
return (something)
} # bracket to end function
Let’s do the following in a function: 1. Select the species of interest
Produce and save plots exploring the response vs the explanatory variables
Fit the linear model and export a table
Plot and save the diagnostics
do.analysis = function(species.name = "B", input.data = data)
{
# Use piping to select the data related to the species of choice in the first argument
species.data = input.data %>% filter(species == species.name)
# Create boxplots and save them for publication
ggplot(species.data, aes(x=as.factor(daylength), y=growth)) +
geom_boxplot(fill="slateblue", alpha=0.2) + xlab("daylength")
ggsave("boxplot_daylength.tiff")
ggplot(species.data, aes(x=as.factor(temp_range), y=growth))+
geom_boxplot(fill="slateblue", alpha=0.2) + xlab("temp")
ggsave("boxplot_temp.tiff")
# Fit the model and save some of the results
final.mod = lm(growth~daylength+temp_range,data = species.data)
# Notice we did this earlier but with more lines
out = as.data.frame(final.mod$coefficients)
write.csv(out,"ModelCoefficients.csv")
# Plot and save the diagnostics
tiff("modeldiagnostics.tiff")
par(mfrow=c(2,2))
plot(final.mod)
dev.off()
# use cat() to print a message when you are done with your analyses
cat(paste("\nmodel outputs for species",species.name,"are complete!\n"))
}
# Try out your function
do.analysis("A",data)
## Saving 7 x 5 in image
## Saving 7 x 5 in image
##
## model outputs for species A are complete!
There are two types of loops you will see most often in someones R code. For loops and While Loops. You use loops when you want to do something iteratively. For loops have the following structure:
for (i in 0:5) # a variable and range of values it will take
{ # a bracket to start
#some calculation or function to do something iteratively
} # a bracket to end the for loop
Here “i” sequentially becomes a number from 0 to 5 and is used in each iteration of the loop.
Let’s try a simple example where we want to add i to 5. Our output should consist of a vector of values from 5 to 10. We’ll use the cat()
function so we can see what the loop is doing.
for (i in 0:5)
{
x = i + 5 # Do something to i, iteratively
cat(x) # see what happens if you take this out
# I like to make detailed statements
# about what I am doing inside of a loop ...
# uncomment the next line and run the loop again to see what it does
# cat(paste("i= ", i, "and i + 5 =", x,"\n"))
} # end for loop
## 5678910
Now for the ultimate test
Use what we have learned so far to perform the created function on each species by using a loop.
species.list = c("A","B","C") # vector of species names to loop through
for (i.species in 1:length(species.list))
{
# I like to have a test object to see if my loop is working
# I comment it out before I run the full loop
# It should change the current.species object
# i.species = 1 # test
# It is good practice to set an object name once
# and use it in multiple places
current.species = species.list[i.species]
# create a new folder to work in
# or else things will overwrite because they have the same name
dir.create(current.species)
# work in the newly created directory
setwd(current.species)
# use the function we created previously
do.analysis(current.species,data)
# move back to the main folder to restart the loop
setwd("..")
} # End Species Loop
## Saving 7 x 5 in image
## Saving 7 x 5 in image
##
## model outputs for species A are complete!
## Saving 7 x 5 in image
## Saving 7 x 5 in image
##
## model outputs for species B are complete!
## Saving 7 x 5 in image
## Saving 7 x 5 in image
##
## model outputs for species C are complete!