Myths and Principles of Data Usability

Maria Cerase
3 min readJan 6, 2018

Big Data, predictive analytics, real time analytics, machine learning, are today’s top buzzwords. They show up in job descriptions, tech articles, industry panel interviews and they inevitably snake their way into product roadmaps. It’s easy to believe that the answer to any problem is simply to get more data, but in order to set up a succesful data business you need to first start by making the data more usable.

Here are some thoughts on what data usability means.

Human vs Machine usability

Is data usable by humans not usable for machines and viceversa? The answer is “No”. Machine-usable data is also human-usable data.

In recent years, machine learning made huge progress when dealing with unstructured data, progress that was spurred by the awareness that humans are exceptionally good at making assumptions about the implicit properties of a dataset. A model that wants to learn how to convert Celsius degrees and Fahrenheit in recipes will have to make assumptions about the relationship between the two numbers and understand that while 419F is a perfectly suitable temperature to bake a cake, 419C can almost melt zinc and is hard to reach in a domestic oven. By making data more usable for a machine, we also ensure it becomes more usable for humans.

Explicitness

Like in the example of Fahrenheit and Celsius degrees, units of measure — in general metadata — are key to the usability of a dataset. This is a very common conundrum in datasets, where a metric about time spent on a site doesn’t mention the unit of measure of that number. Is it milliseconds, seconds or minutes? Being implicit about data is never a good idea, it may seem to make data less verbose, but it will generate questions that will inevitably come back to haunt you. Data structures should be as explicit as possible.

{ "time_in_oven": {"minutes":120, "hours":2}, "oven_temperature": {"F": 419, "C": 215}}

Naming Consistency

Database naming consistency is hard to achieve. Table and column names are like an Italian archeological excavation site: made of separate strata, each a representative of that era and highly protected. Analysts and algorithms alike are affected by this lack of consistency as features such as user_id, userid, and uid may all represent the same value, yet are not named the same. A naming change will result in…

--

--

Maria Cerase

Eternal searcher, sample of Italian madness. Product and Usability expert. Find more about me on www.mariacerase.com