Myths and Principles of Data Usability

Big Data, predictive analytics, real time analytics, machine learning, are today’s top buzzwords. They show up in job descriptions, tech articles, industry panel interviews and they inevitably snake their way into product roadmaps. It’s easy to believe that the answer to any problem is simply to get more data, but in order to set up a succesful data business you need to first start by making the data more usable.

Here are some thoughts on what data usability means.

Is data usable by humans not usable for machines and viceversa? The answer is “No”. Machine-usable data is also human-usable data.

In recent years, machine learning made huge progress when dealing with unstructured data, progress that was spurred by the awareness that humans are exceptionally good at making assumptions about the implicit properties of a dataset. A model that wants to learn how to convert Celsius degrees and Fahrenheit in recipes will have to make assumptions about the relationship between the two numbers and understand that while 419F is a perfectly suitable temperature to bake a cake, 419C can almost melt zinc and is hard to reach in a domestic oven. By making data more usable for a machine, we also ensure it becomes more usable for humans.

Like in the example of Fahrenheit and Celsius degrees, units of measure — in general metadata — are key to the usability of a dataset. This is a very common conundrum in datasets, where a metric about time spent on a site doesn’t mention the unit of measure of that number. Is it milliseconds, seconds or minutes? Being implicit about data is never a good idea, it may seem to make data less verbose, but it will generate questions that will inevitably come back to haunt you. Data structures should be as explicit as possible.

{ "time_in_oven": {"minutes":120, "hours":2}, "oven_temperature": {"F": 419, "C": 215}}

Database naming consistency is hard to achieve. Table and column names are like an Italian archeological excavation site: made of separate strata, each a representative of that era and highly protected. Analysts and algorithms alike are affected by this lack of consistency as features such as user_id, userid, and uid may all represent the same value, yet are not named the same. A naming change will result in applications failing and queries having to be edited: a pain nobody will want to go through. As you have to start somewhere, think of yourself as a monarch and institute an after-you regnal system that will guarantee consistency in the future, while the older tables get slowly deprecated.

Shamelessly taken from The Zen of Python, this one — like any good principle — contradicts (at least in part) the previous one. When desigining tables you may want to keep data repetition to a minimum, trying to fulfill a lifelong ambition of a world of purity and flawless logical order. Once again, reality will come to crush your dreams. Data structures,— both for columns and rows — should first and foremost be optimised for their primary use case, rather than for overall purity. The way that data will be queried and used needs to be defined and crystallised before the table is designed, and the table design follows the usage, not the other way around.

Schemas are the most precious, yet underestimated aspect of data usability to this day. Several attempts, including the schema.org project have been made to standardise them and in general to convey to the software engineering and database design community the importance of metadata. If all datasets were fully explicit we wouldn’t need schemas, but what cannot be reached with explicit data structures, can be with data schemas.

{
"name": "this is my schema",
"version": 1,
"properties" : [
{"name": "feature1"
"created_at": "2018-01-06T14:03:22UTC",
"description": "myfeature",
"type": "string"},
{"name": "feature2"
"created_at": "2018-01-06T14:03:22UTC",
"description": "myfeature",
"type": "number"}]
}

The production and maintenance of schemas is paramount to the health of a data business, where properties of datasets are described, annotated and updated as time goes by. Instituting a flexible schema repository is the first step towards a better data environment.

Eternal searcher, sample of Italian madness. Product and Usability expert. Find more about me on www.mariacerase.com

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store