The Completely Sufficient Statistician

Ralph G. O’Brien
Keynote Address for the 14th Annual Kansas State University Conference on Applied Statistics in Agriculture April 2002

Today’s ideal statistical scientist develops and maintains a broad range of technical skills and personal qualities in four domains: (1) numeracy, in mathematics and numerical computing; (2) articulacy and people skills; (3) literacy, in technical writing and in programming; and (4) graphicacy. Yet, many of these components are given short shrift in university statistics programs. Can even the best statistician today really be “completely sufficient”?

We are constantly searching for people who are striving to become Completely Sufficient Statisticians (CSS). What are those?

“… all four types of ability”

The statistical scientist’s raison d’être is to improve empirical studies conducted by subject matter investigators (Figure 1). Statisticians are professional experts in the art and science of designing studies, analyzing data, and communicating the results. The research questions range from elegantly straightforward to utterly tangled—to wholly unformed. Whatever the case, we must define a sound study design and plan specific statistical models and tests. No design is perfect, so compromises are required. But as the legendary New York Yankees catcher Yogi Berra said, “Don’t make the wrong mistake.” Textbook data appears from paradise, clean and orderly and complete, but real data often flies in from a Kansas tornado, dirty and disorganized and crippled by missingness. And there is often too much data to possibly be analyzed well, given time and resources. The data analyst has an enormous variety of methods at his disposal. Which ones he uses and how well he uses them is dependent on his talent, creativity, time, and intellectual interest in the problem. Whatever happens, we hope that new useful knowledge is produced. No analysis is complete and more questions will arise to examine another day, if someone has the time. But when things come together, when you help discover something or confirm something that really makes a difference, well, nothing (professionally) could be more satisfying.

In W.G.V. Balchin’s 1976 article in the The American Cartographer, he presented a figure similar to Figure 2 to illustrate that humans evolved by first developing keen visual-spatial skills, then social skills, then verbal skills, and finally numerical skills. “In a brain as highly developed as that of a human being,” Balchin exhorted, “the potential for all four types of ability is inborn, but none of them can come to fruition without education.” This rings true for statistics education: The Completely Sufficient Statistician develops and maintains solid skills in

    numeracy —formulating and solving problems using mathematics and computing
    articulacy —speaking and listening; also people skills
    literacy —writing and reading
    graphicacy —producing and understanding graphics

Numeracy

The ideal statistician must be sufficiently mature in using mathematics and numerical computing to define and solve real problems (Figure 3). The mathematical theory of statistics makes firm connections between statistical science and mathematics, which is still stressed in university statistics programs, as it should be. By “numerical computing,” I mean all the methods and skills that enable us to transform raw observations into sound descriptive and inferential statistics. This involves more than the ordinary use of common statistical software systems (e.g., SAS). The CSS must be able to adapt those systems or use a regular programming language (e.g., C) to solve unique problems or carry out computations for methods that are not available in those systems. Our recruiting efforts at the Cleveland Clinic indicate that too many students are not sufficiently skilled in numerical computing.

Figure 3. Numeracy in statistical science: mathematics and numerical computing

Are those well trained in mathematical statistics and numerical computing sufficiently numerate? No. Full numeracy requires the CSS to be able to use mathematics and numerical computing on real subject-matter studies that may have diffuse and tangled research questions, imperfect and/or unique designs, and messy data. Too many gifted mathematical statisticians are rather unskilled statistical scientists. They did well in courses like measure theory and may even now teach them, but they flounder when translating a concrete issue in a subject-matter study into a statistical model that can field pragmatic solutions. They flounder again when they must translate their mathematical work back into the concrete terms of the study. This is a failure in numeracy.

In the December 2002 issue of Amstat News, David Moore, Roxy Peck, and Alan Rossman reported on a workshop held in October 2000 at Grinnel College. They wrote:

What Do Statisticians Need from Mathematics?
The two highest priority needs of statistics from the mathematics curriculum are to:

(1) Develop skills and habits of mind for problemsolving and for generalization. Such development is deemed more important than coverage of any specific content area.

(2) Focus on conceptual understanding of key issues of calculus and linear algebra, including function, derivative, integral, approximation, and transformation.

While the Grinnel workshop was concerned with undergraduate training, the same general prescription holds for graduate training. When a university’s statistics faculty is comprised of people who have primary interests in mathematical statistics, this remains the dominating theme of their hiring, their curricula, and their campus-wide service programs (consulting units, if any). When a statistics curriculum overemphasizes mathematical training, it comes at the expense of training in other domains.

All exercises given to students should emanate from something close to reality. Let me illustrate. For several decades, Dr. X (a real person, but not identified here) has been teaching an introductory statistics course in a mathematics department at a top-ranked liberal arts college.

Here, virtually verbatim, is one of his recent homework exercises:

The weight of a chemistry textbook is a normal random variable with a mean 3.5 lbs and standard deviation of 2.2. The weight of an economics textbook is a normal random variable with a mean of 4.6 lbs and a standard deviation of 1.3. Compute

(a) the probability that the total weight of two books will be 9.0 lbs or more.

(b) the probability that the economics book will be heavier than a chemistry book.

The problem is trite, has no connection to anything that a real investigator would study, and makes no sense mathematically (because the specified distribution implies that books can have negative weight). It fails completely to build statistical numeracy. In fact, it hurts it.

This reinforces the oft-heard notion that introductory statistics remains one of the worst courses taught in the undergraduate curriculum. Last year, I received the following email from my younger daughter:

Hi. I just wrote to inform you that my stat prof is incredibly bad. He’s a really nice guy and I’m sure he knows what he’s doing, but he really can’t teach. We spent an entire 1.5 hours learning about mean, median, and mode. WHAT IS WRONG WITH STAT TEACHERS!!? Oh well, should be easy at least. Don’t know how much longer I’ll keep going to class … he teaches straight out of the book. The chance of me following in your footsteps is now reaching a probability of .0000000001. :-)

I am pleased to report that at least for now (Fall 2002), she is majoring in mathematics and psychology and is planning to take as many statistics courses as she can. I hope she will be given problems with more realism, such as:

Suppose that ordinary cows milk contains two forms, gamma and omega, of the (fictitious) coenzyme benumerate. Across individual cows, benumerategamma has a mean concentration 20.0 mg/L with a standard deviation of 4.0mg/L, and benumerate-omega has a mean of 30.0 mg/L and a standard deviation of 6.0 mg/L. The correlation between the two measures is -0.40.

(a) What is the probability that the total benumerate concentration exceeds 40 mg/L in an individual cow’s sample of milk? What did you have to suppose (a better term than “assume”) mathematically in arriving at this answer? Discuss the sensitivity of your answer to what you supposed mathematically. In other words, how likely is it that your mathematical suppositions are so wrong that your results based upon them would be seriously misleading?

(b) What is the probability that the omega form exceeds the gamma form by at least 5 mg/L in an individual cow’s sample of milk? What mathematical suppositions did you make in arriving at this answer? Discuss the sensitivity of your answer to what you supposed mathematically.

The small Milky Way Farm has 25 dairy cows and their milk is pooled.

(c) What is the probability that the total benumerate concentration exceeds 40mg/L in a sample of pooled milk? What mathematical suppositions did you make in arriving at this answer? Discuss the sensitivity of your answer to what you supposed mathematically.

Note that questions (a) and (b) are largely isomorphic to Dr. X’s above, except that I have stated nothing about normality, because such things are never truisms in real problems. We suppose (pretend?) that such “assumptions” hold in order to get our work done, and we need to consider how sensitive our answers might be to “violations” of those assumptions. Question (c) hits the notion that the normality assumption is less important for problems involving means, but it does so in a way that reflects how real problems come presented subtly to the statistician. There are no perfect answers for questions like these, which is just fine for developing statistical numeracy. In applying mathematical abstractions to solve real problems, we need to again heed Yogi Berra’s advice: “Don’t make the wrong mistake.”

Mathematical numeracy is not sufficient for statistical numeracy. The Completely Sufficient Statistician must be insightful and practical in adapting her expertise in mathematics and numerical computing to answer real subject-matter questions.

continued on the next page

Pages: 1 2 3 4 5