DATA CURATION: WEAVING RAW DATA INTO BUSINESS GOLD

WorldLine Technology
Jun 18, 2019
3 min read

Credit by Bill Schmarzo

The Big Data craze caught fire with a provocative declaration that “Data is the New Oil”; that data will fuel the economic growth in the 21stcentury in much the same way that oil fueled the economic growth of the 20thcentury. The “New Oil” analogy was a great way to contextualize the economic value of data; to give the Big Data conversation an easily recognizable face.

However, understanding the “economics of oil” starts by understanding the differences between raw oil and refined fuel. To create value out of oil, oil must first be refined. For example, when raw oil ( West Texas Crude ) is refined into high-octane fuel ( VP MRX02 high-octane racing fuel ), the high-octane fuel is 16.9x more valuable than the raw oil [1] (see Figure 1).

*Figure 1: Refining raw oil into more valuable racing fuel*

Raw crude oil goes through a refinement, blending and engineering process where the crude oil is transformed into more valuable products such as petroleum naphtha, gasoline, diesel fuel, asphalt base, heating oil, kerosene, liquefied petroleum gas, jet fuel and fuel oils. This is a critical process that needs to be performed before the downstream constituents (like you and me and industrial concerns) can actually get value out of the oil (as gasoline or heating oil or diesel fuel). Oil in of itself, is of little consumer or industrial value. It’s only through the refinement process that we get an asset of value (see Figure 2).

*Figure 2: Economic Characteristics of Oil*

Without this oil refinement process, we’d all have to pour barrels of raw oil into our cars and then let the cars do the refining process for us. Not exactly a user-friendly experience. Plus, that requirement would have dramatically reduced the value of oil to the world. And this is exactly what we do in Information Technology; we give our users access to the raw data and force each use case or application to have to go through the data refinement process to get something of value (see Figure 3).

*Figure 3: Forcing Cars to Refine their Own Oil*

Forcing every analytic use case or application to curate its own data is not only not very user-friendly, but it dramatically reduces the value of the data to the organization. If we really want to serve the organization’s “consumers of data”, we need a methodical process for refining, blending and engineering the raw data into something of higher value – “curated” data.

The Economics of Curated Data

Data experiences the same economic transformation as oil. Raw data needs to go through a refinement process (cleanse, standardize, normalize, align, transform, engineer, enrich) in order to create “curated” data that dramatically increases the economic value and applicability of the data (see Figure 4).

*Figure 4: Economic Similarities of Oil and Data*

So, what is curated data?

Wikipedia defines it this way:

“Data curation is the organization and integration of data collected from various sources. It involves annotation, publication and presentation of the data such that the value of the data is maintained over time, and the data remains available for reuse and preservation. Data curation includes ‘all the processes needed for principled and controlled data creation, maintenance, and management, together with the capacity to add value to data.’”

This is a good start and I will expand upon that Curated Data definition with the following additional characteristics:

Time and effort have been invested in the data with the goal of improving data cleanliness, completeness, alignment, accuracy, granularity (the level at which the data is stored), and latency (when the data is available for analysis)
The data sets have been enriched with metadata including descriptive metadata, structural metadata, administrative metadata, reference metadata and statistical metadata.
The data is highly governed to ensure the availability, usability, integrity, security and usage compliance of the data across the organization’s different use cases.'
Finally, the data has been cataloged and indexed so the data can be easily searched, found, accessed, understood and re-used.

[1] My math. Prices on 04/04/2019:

Price West Texas crude = $62/barrelPrice VP MRX02 Racing Fuel $125/5 gallons or $25/gallon1 barrel = 42 gallons1 barrel of VP MRX02 = $1,050/barrel