Google Refine lets you fix and handle huge, messy sets of data


Google has just introduced a new product, and this time it's a PC application (with a browser-based UI). It's called Google Refine, and it solves a problem that is enormous for some people: it lets you take massive sets of "messy data" and massage them into shape so that they're uniform, make sense, and can be statistically analyzed.

The video after the jump shows a very good example, which is based on a CSV file exported from a publicly available data source (a government contract system, in this case). The data is very realistic – descriptions are inconsistent (Firm Fixed Price on some rows and FFP on other rows), and even the number formats are inconsistent (you get 0.78 on one row and a number in the millions on another row).

Google Refine lets you very easily hone in on those inconsistencies and fix them in a myriad of ways. This is an important data tool because those heaps of messy data are often public records, which are available but not transparent; being able to quickly analyze them could expose some very interesting patterns and anomalies in the way that public institutions and governments behave.

