Min Chen is a lead statistician who earned a PhD in statistics after completing her master’s in oceanography at Texas A&M University.
As statisticians in an organization with a sizeable data science community, we are frequently asked to compare data-driven modeling by those trained in statistics and those trained in machine learning. If not handled with care, the ensuing discussion can deteriorate into a polarized debate.
One camp thinks machine learning is a subfield of statistics, while the other is convinced it is the other way around. Such a discourse may lead to tribalism that fails to serve both communities.
Lack of clarity about this topic can cause organizations recruiting data scientists to struggle and job applicants to market themselves incorrectly. A clear differentiation can help organizations map their needs to the most suitable candidates and help statisticians articulate and market their skills effectively.
Exploring blogs, books, papers, conference presentations, and colleague practices suggests data modeling by those trained in machine learning often begins with a relatively large data set, which is wrangled before various algorithms are explored. The most suitable models are deployed and their usage is tracked. The process is known as machine learning operations (blue components in image above), and it includes concepts such as version control, model efficiency, and assessment automation. This refined and highly structured workflow resembles an engineering process more than the scientific method.
Adding practices and formal methodologies foundational to most statistics curricula provides a different framework. We refer to these additions as statistical thinking (purple components in image above). This approach leads to a distinct and stronger brand for those formally trained in statistics and helps hiring managers select the most suitable job applicants relative to the positions under consideration.
Statistical thinking includes … looking at work processes relevant to the project, understanding organizational cultures that could create biases, and assessing available data.
Statistical thinking includes engaging problem owners (e.g., researchers) as early as possible, hopefully before data collection begins. The aim is to create the best problem statement that will lead to the most valuable answers. It also involves looking at work processes relevant to the project, understanding organizational cultures that could create biases, and assessing available data.
Researchers will recognize these activities as best practices. The goal of this approach comes from John Tukey, who stated, “It’s better to have an approximate answer to the right questions than a highly accurate answer (i.e., a model) to the wrong question.”
Statistical thinking shines in planning data collection. Quantifying repeatability and reproducibility is performed if instruments or apparatuses are involved. Design of experiment methods such as blocking and randomization are used to address known or potentially confounded variables. The proper surveys or trial designs are considered if people are the objects being measured.
The purpose of incorporating these well-tested methods is to target maximized Fisher information before experiments are performed and data collected. These tools emerged before the era of big data; accordingly, they provide the means to develop economic experimental protocols leading to the most insights.
Exploratory data analysis and model building are generally similar, regardless of one’s training. However, before models are deployed, statistical thinking encourages formal hypothesis testing and checking fundamental assumptions. Correlation embedded within the models may be scrutinized further for causality. Parsimonious models more likely to reveal underlying phenomena are targeted. These practices routinely suggest other hypotheses and therefore send researchers back to experimentation and data collection.
The description above is not intended to minimize the value machine learning practitioners bring to an organization dealing with large and highly unstructured data challenges. In fact, poorly deployed and maintained models lead to significant value lost.
This column is also not intended to suggest statisticians cannot engage in machine learning operations. Its purpose is to articulate the different approaches to data modeling so hiring teams have a clearer perspective when evaluating job candidates relative to their organizational needs and statisticians have a better way to market their training and broader role in solving ill-posed problems that will eventually be addressed with data models.
As an statistics alumni ,I am glad you got the point ,making statistician aware about every model will help to have accurate and precise information and help to aggrandize the company profit and even drive into next level.