# Online course 1: Introduction to R using a protocol for conducting and presenting results of regression-type analyses.

**Flyer for this course **

See the online flyer for a detailed description.

**Course content**

In this course, we will provide an introduction to R and at the same time explain how to conduct data exploration, apply (simple) linear regression models, communicate results, and also determine optimal sample size (using power analysis) in case you want to set up a new field study or experiment.

We will use a 10-step protocol based on Zuur and Ieno (2016). The protocol takes us from the organization of data (formulating relevant questions, visualizing data collection, data exploration, identifying dependency), through conducting analysis (presenting, fitting and validating the model) and presenting output (numerically and visually), to extending the model via simulation.

This course is for scientists who would like to learn R in a non-traditional approach by applying it in a playful way, and also for scientists who have been exposed to an introductory R course and would like to extend their skills to the next level. This course is also beneficial if you would like to learn data exploration, data visualization, apply linear regression models, and power analysis.

This online course contains various modules representing a total of approximately 10 hours of work. Each module consists of multiple video files with short theory presentations, followed by exercises using real data sets, and video files discussing the solutions. All video files are on-demand and can be watched online, as often as you want, at any time of the day, within a 6 month period.

**Module 1**

We will start this module with a theory presentation based on Zuur and Ieno (2015). Using a 10-step protocol, we will explain how to conduct a regression-type analysis and present the results. Note that we will not dive too deep into the statistical theory underlying the models.

We will then do 3 exercises that will teach you how to import data, manipulate data (deleting rows, selecting columns, etc), plot data using ggplot2, and formulate your questions as a preparation for the statistical analysis.

A more detailed outline is given in the two bullet points below.

- Introduction to R, theory presentation (10-step protocol), and executing steps 1 and 2 of the protocol in R. We will discuss the installation of R, R-Studio, and add-on packages, importing data into R and accessing variables.
- We will present various data sets and discuss how to formulate the underlying questions (which will motivate the application of certain statistical techniques). We will use the ggplot package in R to visualize spatial-temporal data and explain how to modify and manipulate data sets in R (e.g. removing rows or columns, creating new variables, etc.).

**Module 2 **

In this module, we start with a theory presentation on data exploration (based on Zuur et al. 2010). We will apply data exploration on 3 data sets. We discuss how to recognize outliers and what to do with them. We also explain how to identify the presence of collinearity (correlation between covariates) using multipanel scatterplots, Pearson correlations, variance inflation factors, and principal component analysis biplots. All statistical techniques are explained in Laymen's terms. We also explain that you should * not* test the response variable for normality (which is a huge misconception).

The bullet points below summarise module 2.

- Conduct data exploration in R and visualize the dependency structure in the data (steps 3 and 4 of the protocol).
- We continue with the visualization of spatial data, time-series data, and spatial-temporal data. Data exploration is applied to various data sets using R functions like the plot, boxplot, and dotchart functions. However, the emphasis is on the ggplot2 package to make multipanel graphs.

**Module 3**

The third module consists of three exercises. In the first exercise, we execute a linear regression model with one covariate and apply the entire protocol. We explain how to apply the model, read the output, judge whether the model is good using model validation, and show how to visualize it using ggplot2. In the second exercise, we use a linear regression model with two covariates, and the same protocol is applied. Although we use a simple linear regression model, similar steps should be applied for more advanced models (e.g. linear mixed-effects models, GLMMs, or GAM(M)s).

In the third exercise, we use one of the data sets that we used earlier in the course. We ask the question: 'If we were to repeat the sampling process, how many observations should we take?' The statistical tool to answer this question is called 'power analysis', and it is surprisingly simple. We will use power analysis to determine what happens if we take fewer, or more observations. We will also investigate how many observations we need to take if we want to be able to detect a 20% change in the response variable, a 10% change, and a 5% change. A power analysis can save you a lot of money and time! We will program the power analysis from scratch. Power analysis can also be applied to more advanced models like GLMM (although that is not part of this course)

The bullet points below summarise the third module.

- Two exercises in which we apply steps 5 - 10 of the protocol.
- We assume that you are familiar with the basics of linear regression (a Laymen's explanation is provided). We will show how to implement such a model in R, explain how to assess the underlying assumptions, and visualize the model.
- We will also explain what to present in a paper or report.
- One exercise explaining and applying power analysis to determine the optimal sample size.

Pre-required knowledge: Basic statistics (e.g. mean, variance, normality). No R knowledge is required. You will learn R ‘on the fly’. This is a non-technical course.