Students Talk Fellowships

Five fellows from the Data Science for Social Good (DSSG) program offer advice and respond to questions about their experiences, views on data science, and future plans.

LandgrafAndrew Landgraf is a PhD candidate in statistics at The Ohio State University. He is interested in the intersection of machine learning and statistics and is doing research on dimensionality reduction.

How are you spending your summer as a DSSG fellow?

My team is working with Chicago Public Schools (CPS). Each year, months before the school year starts, CPS has to forecast the number of students enrolled at each school. Based on these forecasts, budgets are made for the schools. If the forecasts are incorrect, the budgets are adjusted, which can cause disruptions for both the students and teachers.

One of the more interesting and challenging approaches has been building a model at the student level. In this model, we are estimating the probability that each student goes to each school and adding them up to predict the number of students going to each school.

What inspired you to apply?

I enjoy working on applied statistical and machine learning problems, but you can only get so much out of working on an abstract problem or maximizing profits. This fellowship has given me a great opportunity to work on a project with a meaningful impact and hopefully to get involved with more in the future.

I have a strong background in statistics from my coursework and research projects, but I wanted to gain experience in other aspects of data science. To work on the interesting problems of the present and future, statisticians need strong computational skills and the ability to work with large data sets.

Do you recommend fellow statisticians participate in this program in the future? If so, why and what advice do you have for them?

The fellowship has been a great experience, and I think every statistician who is interested would benefit from it. The fellowship is comprised of a diverse group of very smart fellows and mentors, whom you can learn a lot from. The experience of working on not only your project, but learning from all the other projects as well, is invaluable. My advice is to try out as many different tools and methods as possible. You will likely not have another chance to experiment as freely and to have so many people who can help you along.

The DSSG fellows come from diverse fields. How do you view the relationship of statistics to data science?

There seems to be a battle right now for the definition of data science. In particular, many statisticians believe data science is statistics. Data scientists have to be able to collect and clean data, explore variables and model dependencies, communicate insights, and deliver data products. A solid statistical background is very important for all of these tasks, but is more crucial for some than others are. If we want to claim that data science is statistics, more equal emphasis would need to be put on all aspects.

What advice do you have for young statisticians wanting to work in data science?

The most important thing is to jump right in. If there is a data set you want to analyze or a language you want to learn, just do it. At its core, data science and statistics are about getting meaningful insights out of data, and the best way to learn is to try. There are more resources available now than ever, so if there is something more you would like to learn about, it is not hard to find. Don’t wait until there is a class project to get started.

What do you plan to do after your fellowship/graduating?

I am going back to OSU to complete my PhD. I am still deciding what to do after graduating, but my motivations for participating in the fellowship remain. I want to work on difficult problems that have meaningful impact and leverage the latest technologies to solve them.

SMajumdar picSubho Majumdar is a statistics PhD student at the University of Minnesota. He left an opportunity to join one of India’s top medical schools seven years ago to pursue statistics as a career, but remains passionate about the application of quantitative methods in biological fields, especially human genetics and public health.

How are you spending your summer as a DSSG fellow?

My team is working with the Chicago Department of Public Health on developing a predictive model for lead poisoning in children across the city of Chicago. We further aim to come up with a standalone application that provides house-level information on lead poisoning and risk scores to health professionals, as well as obtain a model that predicts the trajectory of a newborn child’s blood lead level in their early childhood.
At the heart of our model is a random forest–based algorithm, which takes into account several house-level, child-level, and census tract–level aggregate variables to come up with predictions.

What inspired you to apply?

Last summer, I did an internship with the National Marrow Donor Program, which has the largest database in the world for bone marrow donors and organizes transplantations for Leukemia patients when the need arises. This made me aware of the different computational methods in social good–related problems. When I came to know about DSSG from another PhD student in my department, I saw it as a natural extension of my previous experience in related work and another great opportunity to spend a summer doing something I am passionate about.

Do you recommend fellow statisticians participate in this program in the future? If so, why and what advice do you have for them?

Definitely yes. Especially in government organizations, vast troves of data are stored due to policy regulations, and valuable information can be gained by analyzing them. Being equipped with a rigorous mathematical and probabilistic framework provides us statisticians the intuition to come up with an applicable model. Moreover, I believe that irrespective of their own research interests, any statistician working in DSSG will leave with an enriched professional skill-set and a better understanding of how to tackle a real-world problem. This is because working in an environment with highly competent people from diverse backgrounds provides the opportunity to obtain various perspectives on the same problem.

The DSSG fellows come from diverse fields. How do you view the relationship of statistics to data science?

I believe that at the heart of data science is the rapid development of computational capabilities in the past two decades. For this reason, it is the confluence of several disciplines that deal with data, like statistics, machine learning, database management, and visualization methods. The field is very much in its infancy, and a combination of expertise from these disciplines, as well as domain knowledge about the specific problem one is dealing with, can make a big impact.

What advice do you have for young statisticians wanting to work in data science?

Know your data. It is not enough to know what you are doing with the data; it is also important to know why you are doing it.

Don’t be afraid to ask questions. Data science is not about only analyzing data, but also storage, visualization, communication, and understanding the practical issues related to the data. It is by asking questions that you’ll gain new insights into the same data.

Broaden your skills. That way you can get answers to more questions by yourself.

Keep abreast of or at least have a cursory knowledge of the latest developments in data science outside statistics.

What do you plan to do after your fellowship/graduating?

After getting my PhD, I want to work in interdisciplinary research positions at universities or government laboratories for a few years, but am also open to R&D positions in the health sector. In any case, I would like to be involved in the development of new statistical methodologies as well as their application.

Carl_Smaller copyCarl Shan earned a degree in statistics from the University of California at Berkeley, where he graduated early with high honors. In college, he directed a national education nonprofit, started an enrichment summer camp for middle-school students, and taught a math course to other Berkeley students. He is a bit obsessed with improving the field of education and hopes to leave the world a little better than he found it.

How are you spending your summer as a DSSG fellow?

At DSSG, my team and I are working on building models to predict the outcome of social services. Specifically, we’re working with a nonprofit called Health Leads. Health Leads figured out that many people who have health issues also have non-health needs (such as dilapidated housing) that contributes to their poor health. They work with patients to connect them with resources such as food stamps.

Unfortunately, many patients also drop out of the Health Leads program, and our team is digging through their data to figure out why. By digging up factors that contribute to drop outs, Health Leads can better improve their service to get patients the resources they need to live a healthy life.

What inspired you to apply?

When I was a freshman in college, I had a deeply moving personal experience—I signed up for a program to mentor nearby elementary school students. To be honest, I signed up mostly because there was a pretty girl in the club I wanted to talk to. However, once I started volunteering, I realized many of the students I was mentoring grew up in situations and backgrounds starkly different from my own. Whereas I had attended a fantastic high school, many of the students would not be lucky enough to receive the same quality of education I did. Frustrated, I dug around and realized that the social institution most well positioned to make a difference solving this inequality are schools.

I applied to DSSG because I was hoping to use my background in statistics to provide more people with the type of opportunities I received.

Do you recommend fellow statisticians participate in this program in the future? If so, why and what advice do you have for them?

Absolutely. Personally, I have been incredibly happy with the experience I’ve received at DSSG. The other fellows have been fantastic, with varied experiences and skill sets I’ve learned a great deal from. In addition, my project has been motivating and inspiring. The leaders of the fellowship also have been very supportive and effective.

If others would like to apply, one word of advice would be to normalize expectations by realizing that most nonprofits and governments do not do a great job of collecting, organizing, or analyzing data. Therefore, you should be prepared to play the role of a data janitor as much as a data scientist. I certainly spent a part of my summer just cleaning and munging the data.

The DSSG fellows come from diverse fields. How do you view the relationship of statistics to data science?

As I’ve experienced it, statistics is one of the foundational cornerstones of an effective data scientist. But I imagine what separates the best data scientists from the merely good is not simply a strong grasp of statistics and machine learning, but also the ability and knowledge to carefully interpret statistical results and communicate them to others in a meaningful way that inspires action.

Large parts of statistics seem to be primarily about analysis, producing knowledge. Rayid, the director of the fellowship, once mentioned to me that the more promising goal of data science isn’t the production of knowledge, but rather the production of actions.

What advice do you have for young statisticians wanting to work in data science?

I’ve already mentioned the whole “data janitor” vs. “data scientist” part in an earlier answer. Another point I’d like to throw out there comes from a donut discussion (weekly discussions about data science over donuts some fellows had during the summer).

It was pointed out that because data science is still a massively evolving field, it hasn’t developed the full system of checks and balances other fields (especially scientific research) might have.

For example, we claim to be data scientists working for the social good. But who gets to define what “social good” is? How do we hold data scientists who mine private data, but in the pursuit of the public interest, accountable for their actions?

All of these questions are to say that while data science is an incredibly talked-up and emerging profession, like any new field, it can come with potential abuses of power. Channeling Uncle Ben: “With great power comes great responsibility.”

What do you plan to do after your fellowship/graduation?

After the summer, I will actually be continuing on with the fellowship. In addition to a few other fellows who also will be staying on, I’ll be working on another project that uses a variety of factors to predict high-school dropout rates. I’d like to continue working on these projects and developing my statistical, machine learning, and computer science chops while producing social value.

In the longer term, I’m interested in going back to graduate school to study machine learning, public policy, and education. As for what I’ll do afterward, I’ll punt that decision to my (hopefully) wiser and more knowledgeable future self.

Schifeling_Tracy.JPGTracy Schifeling is starting her third year as a PhD student at Duke University. In 2010, she graduated from The University of Chicago with a bachelor’s in math.

How are you spending your summer as a DSSG fellow?

My team worked with Chicago Public Schools (CPS) to improve predictions of how many students will enroll at each school in the district. Many months before the start of the school year, CPS must predict back-to-school enrollment numbers at each school and allocate funds accordingly, so accurate predictions are crucial. Our summer was spent exploring their data and fitting statistical models to try to get the most predictive power out of the data as possible.

What inspired you to apply?

I was inspired to become a statistician after reading about the Chicago Police Department finding patterns in their crime data and making predictions about likely locations of future crimes. This fellowship seemed like the perfect opportunity to join the community of humanitarian statisticians and data scientists.

Do you recommend fellow statisticians participate in this program in the future? If so, why and what advice do you have for them?

Yes! It was a great experience to work through the entire data analysis process, from initial exploration of the data to communicating our final results back to CPS. My work at DSSG helped me develop a realistic understanding of what the challenges are when working with real-world data and how statistics can be useful in such real-world applications.

The DSSG fellows come from diverse fields. How do you view the relationship of statistics to data science?

Statistics is definitely an important step in the data science process. From my experience this summer, I would say that data science process includes understanding the research question and how data can help, data exploration and visualization, data preparation, statistical modeling, and communicating results.

What advice do you have for young statisticians wanting to work in data science?

Data science is fun because you get to be a generalist. I think it is useful to be familiar with a wide range of statistical models and machine learning methods, because a broad knowledge of these areas gives you the opportunity to be creative and explore a lot of modeling ideas.

What do you plan to do after your fellowship/graduation?

After the fellowship, I am returning to my PhD program at Duke.

Sarah_TanHui Fen (Sarah) Tan is a statistics PhD student at Cornell University. Previously, she studied statistics at the University of California at Berkeley and Columbia University and worked at government and nonprofit organizations in New York City. She tweets at @shftan.

How are you spending your summer as a DSSG fellow?

I am part of a team that works with the Nurse-Family Partnership (NFP), a national nonprofit organization that runs a home-visiting program for low-income mothers and children. We are building models using NFP’s operational data to identify mothers who are at risk of dropping out or not reaching the program’s goals. I also am starting a second project where we try to help WBEZ, a public radio station in Chicago, better target their fundraising.

What inspired you to apply?

Three reasons: First, before graduate school, I worked at government and nonprofit organizations, mostly as the only statistician in teams of domain experts. I wanted the experience of working in a data science team with people of varied backgrounds. Second, I was attracted by the prospect of working with some interesting data sets that we statisticians do not necessarily get access to in academia. Last, I wanted to interact with policymakers to better understand the feedback loop between data science and policy.

Do you recommend fellow statisticians participate in this program in the future? If so, why and what advice do you have for them?

Definitely. The program gives us a view into organizations and industries that are traditionally less data-driven yet really stand to benefit from data science. If you would like to use your skills for social good, this program provides exactly that opportunity. Also, Chicago in summer is beautiful! As for advice … as a fellow, when you are knee deep in data cleaning and modeling, it is helpful to take a step back and remember the question you are trying to answer and the use cases of the models you are building.

The DSSG fellows come from diverse fields. How do you view the relationship of statistics to data science?

Statistical models are essential in data science. But statistical thinking is also invaluable. The things we emphasize in statistics—the importance of knowing your data, checking your assumptions, worrying about bias—add rigor to data science. We statisticians have a lot to contribute to and a lot to learn from data science, and I think we have a tremendous opportunity to shape this young, developing field in our own ways.

What advice do you have for young statisticians wanting to work in data science?

Take some programming classes and learn some computer science. Get a hold of some data, and get your hands dirty. Try to experience the data science process from start to end, from data cleaning to modeling and model testing to visualization.

What do you plan to do after your fellowship/graduation?

I just started my second year in the statistics PhD program at Cornell, so I have a while to go before graduation. I hope to continue collaborating on problems with social impact. After I graduate, I would like to find opportunities with interesting problems and data sets.