Workshop Goals:

  • Determine the influence of two environmental factors on species growth.
  • Provide graphs and tables to summarize findings.
  • Create a function to automate your analyses.

Prerequisites:

You should have R and Rstudio installed. We will be working with R within the Rstudio application.

R script Setup

Start a new script by going to File –> New File –> R Script NewScript

Become familiar with the R studio layout: Rstudio Layout

Determine which directory you are in. The results of your code will print to the console:

getwd()

ProTip: Use Ctrl + Enter (Windows) or Cmd + Enter (Mac) to run the current line of code using a keyboard shortcut.

Not the right directory? That’s fine–let’s change it!

To change or set the directory in R use the setwd() function.

# Set the file path to YOUR desktop
setwd("/Users/YourName/Desktop")

Note: In this character string, you can only use a “/” or “\\” between folder names.

For example, to set the directory to a specific folder:

  1. From your desktop or start menu, navigate to the folder you wish to save in, and open it.
  2. Right-click in any empty space inside the folder, select properties.
  3. Select and copy the location (e.g., “C:\\Users\\Me\\Desktop”).
  4. Paste the location name inside the setwd() parentheses and add your folder name (e.g.,“\\SpecificFolder”).

To keep things tidy, create another folder specifically for today’s workshop. This can be done by creating the object todays.file where you can change the name whenever you wish.

todays.file = "SEAInCode"

You can also make a string of file names or characters representing other objects using c(). This will come in handy later in the workshop.

Use dir.create() to make the new folder then set the working directory to todays.file

dir.create(todays.file)
# If you try to create a folder that already exists R will let you know

setwd(todays.file)

ProTip: R does not like spaces. Use underscores instead.

Check to see if you are in the right place using getwd(), then use the function dir() to see what is in the folder.

dir()
## [1] "growth_data.csv"          "Labs to pull from"       
## [3] "MultiRegDataSimulation.R" "NewScript.png"           
## [5] "Rstudio.png"              "RstudioAnnotated.png"    
## [7] "RstudioAnnotated.pptx"    "SEAInCodeRIntro2.Rmd"

ProTip: There will be nothing in your directory if you created a new folder to work in. Messed up? Want to move up one directory? No problem! Run setwd("..") to move up a folder level.

Now save your script. R will often hold your place but it is good to save periodically. When you close Rstudio it will ask if you want to save your R space. This will save everything in the global environment. I tend not to save it since I probably saved all of the output I needed before deciding to close R.

Install and Load Packages

There are millions of R packages designed to do all kinds of data analyses. Two packages that I have grown fond of for manipulating and visualizing data are dplyr and ggplot2. Fortunately, somebody decided to bundle these packages and a few others in what is called the “tidyverse”. We can install this package using install.packages().

dplyr: a package for data manipulation

ggplot2: a package for creating graphics

tidyverse: the data analysis super package!

install.packages("tidyverse")

Note: Installing the package only needs to be done once.

Load the package every time a new R session is started using library(). The messages that show up are OK. You should only be concerned if warnings or Error pops up.

library(tidyverse)
## -- Attaching packages ------------------------------------------- tidyverse 1.3.0 --
## v ggplot2 3.3.2     v purrr   0.3.4
## v tibble  3.0.3     v dplyr   1.0.2
## v tidyr   1.1.2     v stringr 1.4.0
## v readr   1.3.1     v forcats 0.5.0
## -- Conflicts ---------------------------------------------- tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()

Note: If you are ever wondering how to use these packages check out the cheat sheets:

ggplot2 cheat sheet

dplyr cheat sheet

… or Google it!

Data Exploration

Load Data

Read in the growth_data from Github using read.csv() and give it the object name “data” using = or -> as the assignment operator:

data = read.csv("https://raw.githubusercontent.com/latreesedenson/SEAinCode/main/growth_data.csv")

data = as.data.frame(data)

Note: You could also copy and paste the data from the web browser into an excel file, change the format to comma delimited, saved it as a csv file and then read it in using read.csv(). This way you would have a copy of the data on your local drive to read from. You would need to replace the weblink with the path to the data on your local computer.

Explore Data

Make sure the data was read in correctly using the function head(). And look at parts of the data using the $ operator.

head(data)
##   daylength temp_range   growth species
## 1     short    average 132.9485       A
## 2   average    average 131.2542       A
## 3      long    average 125.8153       A
## 4     short      above 123.1229       A
## 5   average      above 106.3336       A
## 6      long      above 100.3005       A
data$temp_range
##   [1] "average" "average" "average" "above"   "above"   "above"   "below"  
##   [8] "below"   "below"   "average" "average" NA        "above"   "above"  
##  [15] "above"   NA        "below"   "below"   "average" "average" "average"
##  [22] "above"   NA        "above"   "below"   "below"   "below"   "average"
##  [29] "average" "average" "above"   "above"   "above"   "below"   "below"  
##  [36] "below"   "average" "average" "average" "above"   "above"   "above"  
##  [43] "below"   "below"   "below"   "average" "average" "average" "above"  
##  [50] "above"   "above"   "below"   "below"   "below"   "average" "average"
##  [57] "average" "above"   "above"   "above"   "below"   "below"   "below"  
##  [64] "average" "average" "average" "above"   "above"   "above"   "below"  
##  [71] "below"   "below"   "average" "average" "average" "above"   "above"  
##  [78] "above"   "below"   "below"   "below"   "average" "average" "average"
##  [85] "above"   "above"   NA        "below"   "below"   "below"   "average"
##  [92] "average" "average" "above"   "above"   NA        "below"   "below"  
##  [99] "below"   "average" "average" "average" "above"   "above"   "above"  
## [106] "below"   "below"   "below"   "average" "average" "average" "above"  
## [113] "above"   "above"   "below"   "below"   "below"   "average" "average"
## [120] "average" "above"   "above"   "above"   "below"   "below"   "below"  
## [127] "average" "average" "average" "above"   "above"   "above"   "below"  
## [134] "below"   "below"   "average" "average" "average" "above"   "above"  
## [141] "above"   "below"   "below"   "below"   "average" "average" "average"
## [148] "above"   "above"   "above"   "below"   "below"   "below"   "average"
## [155] "average" "average" "above"   "above"   "above"   "below"   "below"  
## [162] "below"   "average" "average" "average" "above"   "above"   "above"  
## [169] "below"   "below"   "below"   "average" "average" "average" "above"  
## [176] "above"   "above"   "below"   "below"   "below"   "average" "average"
## [183] "average" "above"   "above"   "above"   "below"   "below"   "below"  
## [190] "average" "average" "average" "above"   "above"   "above"   "below"  
## [197] "below"   "below"   "average" "average" "average" "above"   "above"  
## [204] "above"   "below"   "below"   NA        "average" "average" "average"
## [211] "above"   "above"   "above"   "below"   "below"   "below"   "average"
## [218] "average" NA        "above"   "above"   "above"   "below"   "below"  
## [225] "below"   "average" "average" "average" "above"   "above"   "above"  
## [232] "below"   "below"   "below"   "average" "average" "average" "above"  
## [239] "above"   "above"   "below"   "below"   "below"   "average" "average"
## [246] "average" "above"   "above"   "above"   "below"   "below"   "below"  
## [253] "average" "average" "average" "above"   "above"   "above"   "below"  
## [260] "below"   "below"   "average" "average" "average" "above"   "above"  
## [267] "above"   "below"   "below"   "below"

ProTip: Not sure what a function does? Type ?head() into the console and see what pops up in the help menu. Scroll down to the bottom–there are often examples of how to use the function that you can practice with.

What are the column names? Is everything there that should be there?

names(data)
## [1] "daylength"  "temp_range" "growth"     "species"

What are the dimensions of the dataset, how many rows and columns?

dim(data)
## [1] 270   4

Let’s get some basic summary statistics from our data: minimums, maximums and means.

summary(data)
##   daylength          temp_range            growth         species         
##  Length:270         Length:270         Min.   : 64.41   Length:270        
##  Class :character   Class :character   1st Qu.: 94.48   Class :character  
##  Mode  :character   Mode  :character   Median :105.66   Mode  :character  
##                                        Mean   :102.97                     
##                                        3rd Qu.:111.64                     
##                                        Max.   :143.50

You will notice that there were multiple species here and some NAs in the data. Let’s work with the first species and remove the bad records. We will use the pipe operator %>% from the magrittr package within the tidyverse to do this. This syntax leads to code that is easier to write and to read. Use the keyboard shortcut: Ctrl + Shift + M (Windows) or Cmd + Shift + M (Mac) to get the pipe operator.

Species_A_Clean = data %>% filter(species =="A",!is.na(temp_range))

head(Species_A_Clean)
##   daylength temp_range   growth species
## 1     short    average 132.9485       A
## 2   average    average 131.2542       A
## 3      long    average 125.8153       A
## 4     short      above 123.1229       A
## 5   average      above 106.3336       A
## 6      long      above 100.3005       A
dim(Species_A_Clean)
## [1] 86  4
summary(Species_A_Clean)
##   daylength          temp_range            growth         species         
##  Length:86          Length:86          Min.   : 99.95   Length:86         
##  Class :character   Class :character   1st Qu.:106.27   Class :character  
##  Mode  :character   Mode  :character   Median :111.17   Mode  :character  
##                                        Mean   :116.64                     
##                                        3rd Qu.:130.04                     
##                                        Max.   :143.50

Notice that there are no more NAs in the temp_range summary.

Species_A_Clean$temp_range
##  [1] "average" "average" "average" "above"   "above"   "above"   "below"  
##  [8] "below"   "below"   "average" "average" "above"   "above"   "above"  
## [15] "below"   "below"   "average" "average" "average" "above"   "above"  
## [22] "below"   "below"   "below"   "average" "average" "average" "above"  
## [29] "above"   "above"   "below"   "below"   "below"   "average" "average"
## [36] "average" "above"   "above"   "above"   "below"   "below"   "below"  
## [43] "average" "average" "average" "above"   "above"   "above"   "below"  
## [50] "below"   "below"   "average" "average" "average" "above"   "above"  
## [57] "above"   "below"   "below"   "below"   "average" "average" "average"
## [64] "above"   "above"   "above"   "below"   "below"   "below"   "average"
## [71] "average" "average" "above"   "above"   "above"   "below"   "below"  
## [78] "below"   "average" "average" "average" "above"   "above"   "below"  
## [85] "below"   "below"

Visualize

Let’s do a boxplot of our response vs. our explanatory variables. Here we are using the ggplot2 package, but I also provide code for using Base R.

ggplot can get a bit complex depending on what you want to do but the basic components are typically the same.

You will notice that I have commented out some lines of code using ‘#’. This allows me to block out lines that I don’t want to run, as well as create comments for a better explanation of my code.

# ggplot general inputs/arguments (data, aesthetics - x and y) + plot type + etc.

ggplot(Species_A_Clean, aes(x=as.factor(daylength), y=growth)) + 
  geom_boxplot(fill="slateblue", alpha=0.2) +  xlab("day length")

# To save to a file for viewing later we use the function ggsave()
# The .tiff extension is the type of file, you can also use .jpg
# .tiff files are good for publication graphics
# ggsave("boxplot_daylength.tiff")


# Change the x variable to graph a different explanatory variable
ggplot(Species_A_Clean, aes(x=as.factor(temp_range), y=growth)) + 
  geom_boxplot(fill="slateblue", alpha=0.2) +  xlab("temp")

# ggsave("boxplot_temp.tiff")

# Here is the code to create a boxplot and save it in Base R
# Remove the # to run these 3 lines
# tiff("boxplot.tiff")
# boxplot(growth~temp_range,data=Species_A_Clean,xlab="temp_range", ylab="growth(wt")
# dev.off() 

Notice the way we save graphics is different depending on the package we are using. Also dev.off() closes all graphics devices so you can start a fresh graph next time.

Create a histogram of the data using the geom_histogram() function. Change the binwidth to see what happens.

ggplot(Species_A_Clean, aes(x = growth)) +
  geom_histogram(binwidth = 1)

# You can remove the '#' in the line below if you want to save this
# ggsave("DataDistribution.tiff") 

# Base R
# hist(Species_A_Clean$growth, breaks=1, col="darkgrey", main = "", xlab="growth") 

Analyze

Let’s fit a basic linear model to the data, to determine the impact of day length and temperature on species growth.

We use the lm() function which needs model and the data as it’s input. The model is setup using a \(\sim\), where \(y\sim x\) means that Y is a function of a series of Xs.

lm(y ~ x1 + x2 + x3, Data = Data)

Let’s run a model with our species data. We tell R that we have factors with multiple levels by using the function factor(). If you had a continuous variable you wouldn’t need this added piece.

SpeciesA.lm = lm(growth ~ factor(daylength)*factor(temp_range), data = Species_A_Clean)

To get a summary of the linear model output we use the function summary().

sum.mod = summary(SpeciesA.lm) 
# since sum.mod is an object 
# we will need to run the next line to see what is in it

sum.mod
## 
## Call:
## lm(formula = growth ~ factor(daylength) * factor(temp_range), 
##     data = Species_A_Clean)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6.7540 -3.2367 -0.5397  2.6419 12.9402 
## 
## Coefficients:
##                                                  Estimate Std. Error t value
## (Intercept)                                      109.0499     1.4656  74.405
## factor(daylength)long                             -2.6890     2.0727  -1.297
## factor(daylength)short                             1.1328     2.0202   0.561
## factor(temp_range)average                         26.8020     2.0202  13.267
## factor(temp_range)below                            0.1362     2.0202   0.067
## factor(daylength)long:factor(temp_range)average   -3.8093     2.8944  -1.316
## factor(daylength)short:factor(temp_range)average  -3.4718     2.8192  -1.232
## factor(daylength)long:factor(temp_range)below     -0.5685     2.8570  -0.199
## factor(daylength)short:factor(temp_range)below    -1.2432     2.8570  -0.435
##                                                  Pr(>|t|)    
## (Intercept)                                        <2e-16 ***
## factor(daylength)long                               0.198    
## factor(daylength)short                              0.577    
## factor(temp_range)average                          <2e-16 ***
## factor(temp_range)below                             0.946    
## factor(daylength)long:factor(temp_range)average     0.192    
## factor(daylength)short:factor(temp_range)average    0.222    
## factor(daylength)long:factor(temp_range)below       0.843    
## factor(daylength)short:factor(temp_range)below      0.665    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.397 on 77 degrees of freedom
## Multiple R-squared:  0.8904, Adjusted R-squared:  0.879 
## F-statistic: 78.22 on 8 and 77 DF,  p-value: < 2.2e-16
####
# Additional functions you may find useful in the future
####
# coefficients(fit) # model coefficients
# confint(fit, level=0.95) # CIs for model parameters
# fitted(fit) # predicted values
# residuals(fit) # residuals
# anova(fit) # anova table 
####

Here you have your important outputs to make a conclusion with (pvalues, coefficients, etc.).

Let’s save some of that information to an excel file. Note: you cannot save the entire output to a csv using the lines below. You will need additional steps and functions.

# save results to an object
results = sum.mod$coefficients

# make the object into a dataframe so it is delimited
# try it without as.data.frame to see what I mean
out = as.data.frame(results)

# Write to a csv file 
write.csv(out,"ModelCoefficients.csv")  

# You can also write to a text file using write.table() and .txt file name instead of csv

Let’s plot the diagnostics of the model fit to the data.plot(linearmodel) produces 4 main plots; each used to diagnose the fit of the model to the data.

# Remember you can save this graphic by uncommenting the line below
# be sure to use dev.off after to reset things

# tiff("ModelDiagnostics.tiff")
# The graphing format: 2 rows 2 columns
par(mfrow=c(2,2)) 

# type dev.off and run the next line without this function to see what happens
plot(SpeciesA.lm)

# close the plotting window and resets things
dev.off()  
## null device 
##           1

Automation

This is the part that is similar to what was in the introduction slides.

Let’s say you now need to do this for two other species, nobody wants to write all that code out again!

Now we will build our own function and if there is time we will loop through and analyze the data for all of the species, with the click of a button.

Functions

Just like plot(), header(), summary(), dim(), etc. are all functions, we can make our own functions too. Why do we make functions?

1.There are no available functions to do what you want to do.

2.Work smarter – not harder! You have a task that needs to be repeated, and you do not want to keep writing the same lines of code over and over again.

Functions must have a name and arguments, and they must return something. Arguments are the inputs you plan to use in your function. We often use the function return() to have our functions “print” to the console or pass something on to another function. Example layout:

func.name = function(arg1,arg2) 
  { # bracket to start function
  
  # code to do something, here
  
  return (something)

  } # bracket to end function

Let’s do the following in a function: 1. Select the species of interest

  1. Produce and save plots exploring the response vs the explanatory variables

  2. Fit the linear model and export a table

  3. Plot and save the diagnostics

do.analysis = function(species.name = "B", input.data = data)
{
  # Use piping to select the data related to the species of choice in the first argument
  species.data = input.data %>% filter(species == species.name)
  
# Create boxplots and save them for publication
  ggplot(species.data, aes(x=as.factor(daylength), y=growth)) + 
    geom_boxplot(fill="slateblue", alpha=0.2) +  xlab("daylength")
  ggsave("boxplot_daylength.tiff")
  
  ggplot(species.data, aes(x=as.factor(temp_range), y=growth))+
  geom_boxplot(fill="slateblue", alpha=0.2) +  xlab("temp")
  ggsave("boxplot_temp.tiff")
  
  # Fit the model and save some of the results
  final.mod = lm(growth~daylength+temp_range,data = species.data)
  
  # Notice we did this earlier but with more lines
  out = as.data.frame(final.mod$coefficients)
  write.csv(out,"ModelCoefficients.csv")
  
  # Plot and save the diagnostics
  tiff("modeldiagnostics.tiff")
  par(mfrow=c(2,2))
  plot(final.mod)
  dev.off()
  
  # use cat() to print a message when you are done with your analyses
  cat(paste("\nmodel outputs for species",species.name,"are complete!\n"))
}

# Try out your function
do.analysis("A",data)
## Saving 7 x 5 in image
## Saving 7 x 5 in image
## 
## model outputs for species A are complete!

For Loops

There are two types of loops you will see most often in someones R code. For loops and While Loops. You use loops when you want to do something iteratively. For loops have the following structure:

for (i in 0:5) # a variable and range of values it will take
  { # a bracket to start
   
  #some calculation or function to do something iteratively
  
  } # a bracket to end the for loop

Here “i” sequentially becomes a number from 0 to 5 and is used in each iteration of the loop.

Let’s try a simple example where we want to add i to 5. Our output should consist of a vector of values from 5 to 10. We’ll use the cat() function so we can see what the loop is doing.

for (i in 0:5) 
  {
  x = i + 5 # Do something to i, iteratively
  
  cat(x) # see what happens if you take this out
  
  # I like to make detailed statements 
  # about what I am doing inside of a loop ... 
  # uncomment the next line and run the loop again to see what it does
  
  # cat(paste("i= ", i, "and i + 5 =", x,"\n")) 
  
  } # end for loop
## 5678910

Now for the ultimate test

Use what we have learned so far to perform the created function on each species by using a loop.

species.list = c("A","B","C") # vector of species names to loop through

for (i.species in 1:length(species.list))
{
  
  # I like to have a test object to see if my loop is working
  # I comment it out before I run the full loop
  # It should change the current.species object
  
  # i.species = 1 # test
  
  # It is good practice to set an object name once 
  # and use it in multiple places
  current.species = species.list[i.species] 
  
  # create a new folder to work in 
  # or else things will overwrite because they have the same name
  dir.create(current.species) 
  
  # work in the newly created directory 
  setwd(current.species) 
  
  # use the function we created previously
  do.analysis(current.species,data) 
  
  # move back to the main folder to restart the loop
  setwd("..") 
  
} # End Species Loop
## Saving 7 x 5 in image
## Saving 7 x 5 in image
## 
## model outputs for species A are complete!
## Saving 7 x 5 in image
## Saving 7 x 5 in image
## 
## model outputs for species B are complete!
## Saving 7 x 5 in image
## Saving 7 x 5 in image
## 
## model outputs for species C are complete!

Final ProTips

  1. Comment your code! ( Use the hashtag - # )
  2. R is case-sensitive. For example, “head()” is not the same as “Head().”
  3. Be smart when naming files, folders, etc.. You will be looking for them later. Adding the date or sequential numbers to a folder name is helpful.
  4. Debugging is the hardest part. Copy the error you get and google it. The Stack Overflow community will be your best friend.
  5. There are a million ways to skin a cat! What I showed you here is only one way of using R to do the things we want to do. Find what works for you.