R posts

# Beginner’s Guide to Data Exploration and Visualisation

A Beginner’s Guide to Data Exploration and Visualisation with R (2015). Ieno EN, Zuur AF.

In 2010 we published a paper in the journal Methods in Ecology and Evolution entitled “A protocol for data exploration to avoid common statistical problems.” Little did we know at the time that this paper would become one of the journal’s all-time top downloaded and top cited papers, with a total of 22,472 downloads between 2010 and 2014. Based on this success we decided to extend the material in the paper into a book.

We give about 25 5-day statistics courses annually. Our typical audience consists of biological scientists at the post-graduate and post-doctoral levels. Early on in each course we have the following conversation with the participants:

• Us: “Do you review submitted manuscripts for journals?”
• Them: “Yes.”
• Us: “How much time do you spend on this?”
• Them: “About 2 hours maximum.”
• Us: “Do you like the statistical part of these manuscripts?”
• Them: “No!”
• Us: “Do you understand the statistical part?”
• Them: “Not always.”

This means that you spend a huge amount of time collecting data, analysing it, writing a manuscript, formatting it for a specific journal, and submitting it. Then someone else spends 2 hours assessing it. That is a lot of your time and money versus 2 hours of someone else’s review time.

What about making life much easier for reviewers, and consequently increasing the likelihood that your work will be published?

No one likes tables with numbers, especially not if they contain interactions. Corrections of interactions and slopes will confuse even the experienced statistician! A graph summarising the results of your statistical model makes it so much easier for a referee to understand what you have done and what the results tell you! This is the Visualisation part of the book title.

The Data Exploration part of the title is where you try to understand your own data (outliers, colllinearity, types of relationships) and whether the quality of the data is good enough to answer your biological questions. There is an example in the book where shell length eaten by oystercatchers is modelled as a function of season, location, and feeding type. This three-way interaction is highly significant but a simple graph of the results shows that in one specific combination of covariates (location A, stabbers, December) there are only two shell length observations, and these also have the same value and are relatively large. No wonder that the three-way interaction is significant. Perhaps we should conclude that the quality of the data is not good enough for such a model!

The book uses ecological datasets to discuss data exploration and visualisation tools. The authors also explain how to visualise the results of statistical models, an important aspect in publishing scientific papers. The book includes the R code needed to construct, visualise, and explore the main features of the data step by step. We wrote this book in such a way that the statistical knowledge level is as low as possible. A knowledge of linear regression is all that you need.

A Beginner’s Guide to Data Exploration and Visualisation with R is the fourth book in Highland Statistics’ Beginner’s Guide to series. Previous books include A Beginner’s Guide to Generalized Additive Models with R, A Beginner’s Guide to GLM and GLMM with R, and A Beginner’s Guide to GAMM with R. Books can only be ordered from www.highstat.com.

# Three statistics courses at the University of Southampton, UK

Highland Statistics Ltd. will provide three statistics courses at the University of Southampton, UK:

1. Data exploration, regression, GLM & GAM with introduction to R. 23 – 27 March 2015.
2. Introduction to Bayesian statistics and MCMC. 8 – 10 April 2015.
3. Introduction to Linear Mixed Effects Models and GLMM with R. 13 – 17 April 2015.

Dr Martin Solan

The courses are organized by Dr. Martin Solan: Recent statistical advances provide opportunity to interrogate established understanding and accepted theory, whilst also allowing researchers to generate novel research questions that were not previously answerable using less sophisticated statistical routines. At Ocean and Earth Science, University of Southampton, we believe that developing competency in applying a portfolio of statistical tools is integral to achieving high profile and high impact science, and it is with this focus that I am delighted to host a series of statistical courses run by Highland Statistics through 2014, 2015 and beyond.

The first course is a repeat from the March 2014 course that Highland Statistics ran at the University of Southampton (there was a waiting list, so don’t wait to long registering). The course starts with data exploration following Zuur et al. (2010). This is actually one of the most downloaded papers in MEE! Quite often people think that multiple linear regression is fitting a straight line through a cloud of observations. Wrong! You can easily buy accutane online or model non-linear patterns using linear regression techniques! Once we have explained multiple linear regression (i.e. interactions, model validation, model interpretation, the philosophies of model selection), the rest of statistics is a piece of cake! GLMs and GAMs are all extensions of regression.

You are competing with a large number of scientists for a small amount of space in scientific journals. Who likes statistics? Not your readers, not the referees of your manuscript and neither your line manager/PhD supervisor. We will show you how to increase your chances to get your work published!

So you thought that one week of statistics is enough? The third course is about linear mixed effects modelling and generalized linear mixed effects models (GLMM). You need these techniques if you have multiple observations from the same animal, location, site, plant, tree, country, person, vessel, observer, you name it. Realistically speaking, this means that all you guys need mixed modelling! Before you enthusiastically sign up for this course, please read the rest of this blog!

Mixed effects models are essentially linear regression models (or GLMs) that contain a  dependency structure. So, before signing up for this course ensure you are familiar with R, data exploration, regression and GLM. The problem with mixed effects models is that the software to estimate these models can only cope with standard distributions (Normal, Poisson, binomial). But for some reason ecologists always manage to end up with highly complicated data sets and models; GLMs and GLMMs with temporal correlation, or multiple nested random effects, crossed random effects, zero inflation, spatial correlation, etc., etc. Unfortunately, standard packages in R cannot be used anymore.

So what do you do? The answer is MCMC. For years we though that Bayesian statistics and MCMC were difficult things. Priors, posterior distribution, MCMC; they sound scary. However, the concept of Bayesian statistics is actually much easier than frequentist statistics (that is the stuff we do in the first course). We therefore decided to do the mixed modelling course with MCMC.With MCMC sky is the limit. You can fit almost any model!

There are two problems with MCMC; (i) you need a fast computer and (ii) it is not taught at most undergraduate courses. As to the later problem, the second course at Southampton University provides a 3-day introduction to Bayesian statistics and MCMC. It is half and half expected that participants who do the mixed modelling course also join the Bayesian statistics and MCMC course. Otherwise you will need to obtain the knowledge with self-study (Introduction to WinBUGS for Ecologists from Marc Kery is an excellent source, though he is using WinBUGS and we will be using JAGS. But the syntax is nearly the same). As to the first problem with MCMC; you need a decent computer. Something that is less than 5 years old. Why are we using JAGS and not WinBUGS? Because my old MacBook didn’t like to run WinBUGS under Parallels (see picture below). JAGS is cross-platform, is free, and can be run from R using R2jags.

Bad idea: running WinBUGS via Parallels on a MacBook.