The word ‘statistics’ can be considered either as a collection of numbers or as an active science. In the case of the collection of numbers, ‘stats’ (i.e. data) are now ubiquitously, inexpensively and quickly collected in every walk of life: in health, in business, in sports. In terms of the active science, statistics (or data analytics, as it is becoming more popularly known) encapsulates collecting, modelling, visualising and drawing inference from data, while accounting for uncertainty.
While data collection is ubiquitous, appropriate statistical analysis often lags behind. In the current `big data’ era, the need for appropriate statistical analysis is lauded, but appropriate statistical analysis relates to big and `not-so-big’ data alike.
Most engineers and scientists are familiar with the concepts of regression or statistical testing, having taken some introductory statistical education. Such statistical methods are typically appropriate only for the analysis of univariate data, i.e. data consisting of a single response variable (perhaps regressed against some explanatory variables).
However, the world in which we live is multivariate: data are collected simultaneously on many variables (which are typically strongly related to each other), often across space and time. In order to appropriately analyse such data, their structured multivariate nature (and their associated uncertainty) must be modelled correctly.
Getting good data from 'big data'
To use a cooking analogy, where eggs are considered as data and the aim is make a meringue for dessert: if you use the eggs incorrectly, you get scrambled eggs. Dr Chris Horn’s article (June 2016) in this Journal, '
Predictions using big data: a hot theme for the tech sector’, touches on a related issue within the context of `big data’ with the heading that `big data is not always good data’.
In a similar vein, good data with incorrect statistical analysis may as well be bad data. With good eggs, you can simply get very nice scrambled eggs, but they are not much use as a dessert. Employing appropriate statistical models, and accounting for uncertainty, is the key to making principled, data-driven decisions.
Much of statistical research aims to develop statistical methods or models that will provide suitable tools for analysis of data in varied settings. Further, many research statisticians focus on developing fast and accurate computational algorithms to estimate such statistical models.
Much of the research conducted by the statistics research group in the
School of Mathematics and Statistics at University College Dublin, and in the Science Foundation Ireland-funded
Insight Centre for Data Analytics, focuses on the development of statistical models for analysing multivariate data and on considering such models within the Bayesian framework.
Bayesian statistics, founded on Bayes’ Theorem (arguably the most important statistics theorem in existence), is a burgeoning area of research in statistics, which has also been adopted by the machine learning and computer-science communities. Bayesian methods consider prior information that the analyst may have in conjunction with the appropriate statistical model, while providing a natural environment in which to quantify uncertainty. While perhaps sounding somewhat complex, Bayesian statistics is practically applicable in every day life.
Those regression models you learned of in your introductory statistics courses can equally be estimated within a Bayesian setting. Working in the Bayesian paradigm typically involves using computationally heavy Monte Carlo-based sampling algorithms, which itself is an active area of current statistical research.
Many statistical researchers draw heavily on parallel computing resources, or supercomputers;
the R project for statistical computing is an excellent free software environment for statistical computing and graphics, which is attracting increased attention from industry.
Game Changer: for stats in sports
Currently, a popular setting for statistical analysis is the area of `stats in sports’. It is not uncommon for amateur Gaelic football teams to now have a performance analyst, while professional rugby and soccer teams have (perhaps several of) them by default.
Performance statistics, such as number of kicks, number of tackles etc, are collected on every training session and match. These data are inherently multivariate, as an individual’s performance stats may be related to each other and are often structured in terms of time and space.
The aim of having such data is, presumably, to aid decision making at a management level or perhaps to highlight areas for improvement at a player level. Even though such performance data are typically `not-so-big’, it is no less important to model the data appropriately. Thus, visualising and modelling such data in an appropriate statistical manner is the key to making good data-driven decisions.
Game Changer, an early-stage venture that recently won a University College Dublin commercialisation award at NovaUCD, is founded on appropriate statistical analysis tools for sports-performance statistics. Game Changer resulted from the final-year research project of Emily Duffy, a BSc Statistics student, under my supervision, through collaboration with Leinster Rugby’s chief analyst. Game Changer is a software platform that offers simple, user-friendly and fast performance analysis.
Currently, Game Changer consists of two products: DigiCoach and Talent Tracker. DigiCoach provides post-match performance analytics for individuals, teams and organisations. The software, easily and securely accessible on mobile device, tablet or personal computer, allows team management and players to instantly and visually examine their match performance statistics, and compare their performance to that of previous games.
The simplicity of DigiCoach is its key to success: it can easily be ported to a wide range of sports and has already garnered interest from other international rugby, GAA and NBA clubs.
One of the beauties of being a statistician is that the problems with which you work vary continuously in terms of difficulty and field of application: in the famous words of the famous statistician Tukey, `the best thing about statistics is that you get to play in everyone else’s back yard’. A great example of this trait is the second product available through Game Changer called Talent Tracker.
Talent Tracker
Talent Tracker is a bespoke product, motivated by the specific needs of an Irish professional provincial rugby club. It is in such a situation that statistics, like engineering, excels: a bespoke, mathematically correct, statistical analysis of data needs to be appropriately conducted to solve a very specific problem. Talent Tracker can be used by team management to help solve the problem of identifying key players for team selection, promotion to the next squad or potential recruitment.
Talent Tracker is underpinned by a multivariate statistical modelling tool that appropriately models the available data in a joint manner, taking account of dependencies between performance statistics. While Talent Tracker has been motivated by a specific sports club’s need, it is demonstrative of the flexibility and feasibility of using mathematically principled statistical models when making data-driven decisions.
The Game Changer software platform is now undergoing further development, to include other scalable tools similar to DigiCoach, and to develop additional bespoke sports club specific tools. Whether the data are sports performance stats, genomic data, survey data, the joy of the science of statistics is its utility anywhere there are data, no matter how big or not-so-big those data may be.
Dr Claire Gormley is assistant professor in statistics in the School of Mathematics and Statistics, University College Dublin and a funded investigator with the Insight Centre for Data Analytics.
Emily Duffy is a BSc Statistics graduate from University College Dublin.