Research in many areas of science and engineering is rapidly becoming an exercise in data management and analysis. From astronomy to molecular biology, huge data sets are in. A recent piece in the New York Times argues that undergraduate science education must change to teach students data processing skills. While it makes what is basically the standard argument for any new thing that someone wants to see in the curriculum ("this is super important and no one will know how to do it if colleges don't start teaching it better"), having done a bit of this kind of large-scale data analysis in my biology research days, I can appreciate the unique challenges. And there is no doubt that this type of science is only going to grow in importance.
There are actually two sets of skills needed. First, there are the technical skills to manipulate such data sets. Knowing some sort of programming language is critical for this. It doesn't need to be a traditional language—excel formulas could be a good start. But the researcher must be able to move information from one step in a process to another in a logical way, with lots of error checking. And they must also be able to extract selections from the dataset, and display those somehow. That is a mode of thinking developed by programmers and engineers.
The second set of skills are the ability to make observations and come up with testable hypotheses using huge amounts of data. The article argues this is what's missing in current science education. An IBM researcher is quoted as saying that if students "imprint on these small systems" (i.e. the bits of data they can put on their personal computers), then "that becomes their frame of reference and what they're always thinking about". Somehow, according to the people quoted in this article, the fact that there is such an enormous abundance of data changes the way a researcher must think about exploring it. And they argue that skill set is what undergraduates in today's science programs are lacking.
To rectify this, the National Science Foundation is starting to fund undergraduate education projects that teach students how to do research with large amounts of data. And companies like IBM and Google are apparently making some of their computing power and data available to undergraduate students. This is all great—the more places students have in their curriculum to do real explorations of real data, the better.
Though I've dealt with lots of data, I admittedly have never handled data sets that need supercomputing power to process them, so perhaps I'm missing something, but I don't see conceptually how asking questions of a gigabyte of data is all that different than asking questions about a petabyte (a million times more) data. Both are a lot more numbers than you could stare at on a piece of paper. With both, you need to be able to make observations, generate hypotheses, and figure out ways of testing those hypotheses by either manipulating the data, or comparing the data against something else (a simulation; a different set of data; a theoretical calculation). And either way, you still need some intuition about the underlying natural phenomenon that generated the data (knowledge coming from poking and prodding, or at least looking at, real things) to help ask interesting questions and pick out answers that are not nonsensical.
I would argue that although it may seem like the scale of the data sets is what is changing, what has really been changing for quite some time is the toolset that students must bring to a research career. The technical programming skills, while they don't themselves contribute to the science directly, open up so many avenues of inquiry to a student that without them, researchers are crippled. It would be like trying to research cellular processes without being able to use a microscope. Not only would you be incapable of investigating most of your hypotheses, but the range of thoughts you came up with would itself be constrained because you couldn't imagine addressing a lot of them. So although I'm all for giving students the opportunity to play with the major data sets now being formed, where I would really like to see the emphasis go is towards teaching students programming and data manipulation skills. I'd be tempted to say that every science undergraduate should learn some of those. Even if they learn them on their own dinky little computer in their dorm room with a data set they type in by hand.
Facebook
Twitter
Email
Post new comment