Statisticians’ Place in Big Data

sherrineroseSherri Rose is an NSF mathematical sciences postdoctoral research fellow in the department of biostatistics at the Johns Hopkins Bloomberg School of Public Health. Rose recently coauthored the book Targeted Learning: Causal Inference for Observational and Experimental Data with Mark van der Laan for the Springer Series in Statistics.

Big Data has become the new buzz phrase in the world of information collection and analysis. The experiments we conduct and the observational data we collect continue to grow in size, due to rapidly expanding technology.

Large data sets also have drawn the attention of young people, with undergraduate and graduate students choosing computer science, engineering, and statistics for their programs of study. Each of these disciplines brings something unique to the table when discussing the challenges of Big Data, and interdisciplinary collaborations are becoming increasingly common.

Statisticians have a distinct and essential role to play in this new world of Big Data. The multidisciplinary teams statisticians are a part of may include medical doctors, political scientists, economists, or lab scientists. We, as young statisticians, are likely to repeatedly work in new and different application areas as our careers continue. Regardless of the subject matter, we need to have a key position throughout each entire study, not just after the data have been collected.

Even as a graduate student, I encountered multiple situations in which I was approached by potential collaborators who had already implemented their study design and spent countless dollars gathering data. Sometimes this will work. However, in my experiences, important variables weren’t collected or certain types of subjects were mistakenly excluded because a statistician wasn’t involved from the beginning. In the worst scenarios, they could not answer their main research questions with the data at hand.

I know I am not alone in these experiences, and I often recollect the famous R. A. Fisher quote: “To call in the statistician after the experiment is done may be no more than asking him to perform a post-mortem examination: He may be able to say what the experiment died of.”

This brings us to one of our first responsibilities as new statistician members of interdisciplinary teams. Defining the research question is a collaborative effort, and statisticians play a critical part in translating the scientific question into a statistical question. This includes carefully describing the following:

  • Data structure
  • Everything we know about the underlying system that generated the data (the model)
  • What we are trying to assess (the parameter or parameters we wish to estimate)

My introduction to statistical models as an undergraduate was in the context of parametric models, which make strong assumptions about the form of the underlying system that generated the data. However, in very large data sets, our background knowledge may not support these assumptions. Fortunately, we can instead make fewer assumptions about the probability distribution that generated the data using so-called nonparametric or semiparametric models, incorporating all appropriate background knowledge. Similarly, our parameter of interest for effect questions need not necessarily be a coefficient in a parametric regression model. We can define different features of our probability distribution, depending on our research question.

Our next major role is the estimation of our parameter(s). This may require implementing commonly used methods, developing a new method, or integrating techniques from other fields to answer our problem. In many cases, Big Data will not be served well by “off the shelf” methods that work in low-dimensional, less complicated settings. Our work as statisticians does not stop there; we must be involved in the dissemination of the results, including accurate interpretations of what our estimates mean.

In an article I wrote for Significance (“Big Data and the Future,” Volume 9, Issue 4), I highlighted several areas that create high-dimensional data in which statisticians are becoming immersed in scientific research teams. These included neuroimaging, post-market safety analysis, celiac disease, and air quality. The main theme in these seemingly disparate fields is that they are producing large amounts of data, requiring tailored statistical methods and statisticians who have a deep understanding of the science.

While we may be trainees, our background in the fundamentals of statistics provides us with many tools to contribute immediately. As junior faculty, principal investigators, mathematical statisticians, and in other starting positions across academia, industry, and government, we will join collaborative teams as statistical specialists. We need to take the time to understand the science behind our projects before applying and developing new methods. The importance of defining our research questions will not change as methods progress and technology advances. A firm basis in statistical tenants will allow us to both adapt as techniques improve and incorporate new approaches. Thus, we must also embrace the incorporation of computer scientists, engineers, and researchers in our collaborative teams to reach effective solutions.