I’m working on a project for a ChiHackNight group. We’re developing a National Voter File to encourage people to run for local offices. Lack of voter information constitutes the largest problem. Of course, you can buy it for big bucks from consulting firms, but this rules out most of us. We’re trying to provide these tools for free. I’m in charge of getting census tract data into the data warehouse. So I need to do some data transformation from the census into our project’s format.
In the past few years, census data has gotten much easier to access via a set of APIs. For those of you who are unfamiliar with Applications Programming Interfaces, it’s a way of getting data from a site via web calls. For instance, with a single web URL, you can specify the Decennial Census (2010) populations, including ‘sex by age’ for all census tracts in a state . This simplifies downloading huge tables by delivering only the information you from a single call.
In starting this project up, my colleague, Ben Galewski, showed me a tool called Pentaho Data Integration, formerly known as Kettle. It is an Extract Transform Load (ETL) application for doing data transformation. ETL came about with the rise of big data and the accompanying necessity to use data sets in combination with other data. With PDI, I can take, data from a variety of formats, run it through various transformation steps and put it out to los of databases and other formats.
Back to the Future
As I started using this very valuable tool, I was reminded of the significant progress we’ve made in transferring data. When I started using computers (1969), we only used keypunch cards for entry and line printers for output. By 1978, when I started working with microcomputers, we had ‘portable’ storage media, but an 8″ diskette from one computer could not be read in another brand of computer. Now, DVDs, USB thumb drives are universally readable if you have that input option on the device you’re using. Furthermore, the program formats are standardized or import functions exist to read common formats, providing rudimentary data transformation. Plus, with the input of some keystrokes, we can transfer masses of information from one remote location to another.
Aha, it turns out there’s a catch with my specific project. The Census Department created a non-standard method for putting out their data (IMHO). As a result, PDI gets the data in a single field and refuses to break it out into separate fields. So I’m back to the drawing board. I guess I’ll have to code this particular program from scratch. Even though we’ve come so far, I’m still facing the same problem. I want to transform data from one format into another. Instead of using a tool, I’ll have to write a whole program for this fairly simple task.