Beginner’s Guide to Data Exploration and Visualisation

A Beginner’s Guide to Data Exploration and Visualisation with R (2015). Ieno EN, Zuur AF.

In 2010 we published a paper in the journal Methods in Ecology and Evolution entitled “A protocol for data exploration to avoid common statistical problems.” Little did we know at the time that this paper would become one of the journal’s all-time top downloaded and top cited papers, with a total of 22,472 downloads between 2010 and 2014.ISBN9780957174177 Based on this success we decided to extend the material in the paper into a book.

We give about 25 5-day statistics courses annually. Our typical audience consists of biological scientists at the post-graduate and post-doctoral levels. Early on in each course we have the following conversation with the participants:

  • Us: “Do you review submitted manuscripts for journals?”
  • Them: “Yes.”
  • Us: “How much time do you spend on this?”
  • Them: “About 2 hours maximum.”
  • Us: “Do you like the statistical part of these manuscripts?”
  • Them: “No!”
  • Us: “Do you understand the statistical part?”
  • Them: “Not always.”

This means that you spend a huge amount of time collecting data, analysing it, writing a manuscript, formatting it for a specific journal, and submitting it. Then someone else spends 2 hours assessing it. That is a lot of your time and money versus 2 hours of someone else’s review time.

What about making life much easier for reviewers, and consequently increasing the likelihood that your work will be published?

No one likes tables with numbers, especially not if they contain interactions. Corrections of interactions and slopes will confuse even the experienced statistician! A graph summarising the results of your statistical model makes it so much easier for a referee to understand what you have done and what the results tell you! This is the Visualisation part of the book title.

The Data Exploration part of the title is where you try to understand your own data (outliers, colllinearity, types of relationships) and whether the quality of the data is good enough to answer your biological questions. There is an example in the book where shell length eaten by oystercatchers is modelled as a function of season, location, and feeding type. This three-way interaction is highly significant but a simple graph of the results shows that in one specific combination of covariates (location A, stabberFigure7.5s, December) there are only two shell length observations, and these also have the same value and are relatively large. No wonder that the three-way interaction is significant. Perhaps we should conclude that the quality of the data is not good enough for such a model!

The book uses ecological datasets to discuss data exploration and visualisation tools. The authors also explain how to visualise the results of statistical models, an important aspect in publishing scientific papers. The book includes the R code needed to construct, visualise, and explore the main features of the data step by step. We wrote this book in such a way that the statistical knowledge level is as low as possible. A knowledge of linear regression is all that you need.

A Beginner’s Guide to Data Exploration and Visualisation with R is the fourth book in Highland Statistics’ Beginner’s Guide to series. Previous books include A Beginner’s Guide to Generalized Additive Models with R, A Beginner’s Guide to GLM and GLMM with R, and A Beginner’s Guide to GAMM with R. Books can only be ordered from www.highstat.com.CoversAll7March2015