Managing Large Data Sets

ShrenkJames Shrenk works as a pricing analytics professional in Arizona. He has a diverse background in several industries, including finance, telecommunications, and environmental services. His keen interests in statistical modeling and simulation keep him motivated to find new ways of transforming data into valuable information.

Statistical professionals and data analysts often are confronted with data that do not conform to our expectations of quality and definition. For example, data often come with poor documentation or—worse—no documentation at all. Even data that reside in a functional and mature data warehouse come with no guarantees. Documentation may be unclear or there is so much documentation and complexity that the task of bringing sense to the data is nearly impossible. In the new world of Big Data, these issues become increasingly exaggerated and generally more challenging. How can we navigate data, and especially Big Data, in meaningful ways?

First, it is important to understand that the definition of Big Data is context specific. Some analysts regard data sets of several thousand observations as fairly large. Clinical trials are one domain in which it is often expensive and time consuming to collect even moderate amounts of data. Others, for example in the world of marketing, may process Internet web logs containing millions of transactions (and consuming terabytes of space in large storage arrays). The goals of analysis remain the same: generate insights that define a problem or opportunity, bring clarity to the data, and connect the data to a meaningful result, whether it be furthering academic research or improving bottom-line profits.

The process used to accomplish these goals is important to every analyst and manager. While every analysis or project presents its own challenges, the better we are at generalizing our process, the greater the likelihood that we produce consistent, accurate, and reproducible results. Working with large data sets presents many opportunities to generalize the process and gain valuable experience that can be transferred to new and larger projects. Some parts of the process are technical, such as choice of data processing tools, working with data warehouses, and sampling strategies. Less technical, but arguably more important areas, include problemsolving capabilities, data intuition, and ensuring that the end product is relevant and understandable to stakeholders.

Tool selection is a necessary first step. Often, the choice of tools is decided well in advance of the specific project of interest. Organizations make the decision to use SAS, SPSS, R, or even Excel for all their data analysis needs. Since specific applications of those tools are not all known in advance, the choice is made for one-size-fits-all needs. If given the choice or flexibility to choose other tools, think carefully about the capabilities needed. If running R, is there concern about running out of memory given the size of data? Can these concerns be addressed with a better server, more memory, or other tools such as Hadoop? Even tools such as SAS come with practical concerns that should be carefully considered.

Give thought to the limitations of your tools and plan for contingencies. With larger data sets, data aggregation and reduction techniques often are worth applying. Handling the first aggregation of a data set within a data warehouse (accessed via SQL) can often save time and effort by reducing millions of records to thousands. However, remain aware that although database platforms are fantastically efficient at data aggregation, there is a price. That first aggregation often eliminates relevant details that can remain hidden throughout an analysis. It is often best to design and inspect the detail before grouping your results into a more manageable data set.

Perhaps data reduction for a project may be accomplished through a sampling strategy. This may not always work when small effect sizes are expected, but otherwise sampling remains a viable strategy to reduce the size of the data without losing a great deal of information.

Perhaps the problem can be split into groups and analyzed within each group. Split-apply-combine techniques such as those detailed by Hadley Wickham can provide relief from data too large to effectively analyze otherwise. Along those lines, spend time learning and researching methods for dealing with large data sets. Dirk Eddelbuettel maintains the High-Performance and Parallel Computing with R task view. Many packages are available to enable the analyst to accomplish their goals.

Learning and implementing these techniques—regardless of the tool being used—enables the analyst to develop valuable intuition when working with data. After years of practice and, indeed, many failures and successes, problems will become increasingly apparent before writing a line of code. Too many observations? Perhaps it is efficient to prototype using a small sample. Too many variables? Perhaps a principle component analysis is in order. Watching a process run for hours ultimately provides the spark of inspiration needed to learn how to parallelize the process!

Developing good data intuition is often a mysterious art among data analysts and statisticians. As a group, we constantly express the need to let data drive decisionmaking. Just as important is developing a “good feel” for what the data is so as to enable those insights. Also know that too much data is sometimes the problem, rather than the solution. Data sense-making expert Stephen Few describes the necessity of being able to differentiate signal from noise. Indeed, popular books are written (e.g., The Signal and the Noise by Nate Silver) that provide guidance on developing this practical skill.

In the end, the most important facet of working with data is in generating insights for those with whom we work. To remain relevant and vital to decisionmakers (whether they are managers, directors, foundations that provide grants, or university administrators) requires the analyst to be practical and skilled. It is often necessary to take small steps, iterate over an analysis many times, and accept imperfection to achieve great results. Embrace data, large and small, as it enables us to achieve great things. Finally, never overlook an opportunity to offer analytical insight, politely and professionally increase the knowledge of others lacking information-driven insights, or challenge long-held intuitions—whether they are your own or others.