Unstructured Data: 4 Questions & Answers You Need To Know

Published in

Developer

To begin with, yesterday was like any other Monday morning. I woke up, skulled 2 coffees and opened my wardrobe to find something to wear.

But that’s when it hit me – my wardrobe was f*cking messy.

Rogue socks were splattered across every corner, dirt and crumbs were sprinkled throughout my shirts drawer, and soggy swimmers were hanging from every hook. I even saw a few ants.

I’d been away and things got out of control. Blame my social life, blame me for being a gen Z, or blame my unnecessary clothing accumulation habits. I don’t really care what gets the blame – the main point is that I couldn’t locate anything I needed.

It was just too much to manage. The decision of what to wear was unbearable. I entertained going back to bed – and then remembered that my emails don’t answer themselves and thought ‘hmm better not’.

By now you’re either thinking 1) Sort yourself out and stop living like a pig, or 2) What’s your disorganised personal life got to do with me, and why are you forcing me to read about it?

The reason I’m sharing this lame moment in time is because at the core of this anecdote lies a strong parallel with our decision-making in business.

Fundamentally, we all want to be able to efficiently and effectively make use of the colossal amounts of data that’s becoming available. This is especially true when we are using it to guide our decisions. However, the catch is that we can only do so optimally when it’s in a structured and manageable state.

In many of our businesses, the data we care about is largely digital. Fortunately, access to digital data often isn’t a major problem. We live in a world drowning in these kinds of data points. The information is there. But capturing it, processing it, ensuring it’s high-quality, and effectively using it is the trickiest and of course most advantageous part.

Like wardrobes, we can categorise data into 2 main types: structured and unstructured.

The majority of digital data is unstructured – it makes up 80-90% of all the digital data in the world.

Due to the prevalence and nature of unstructured data’s accumulation rate, it’s essential we have effective ways to handle and utilise it.

This blog post explains the differences between structured and unstructured data, and answers 4 common questions we come across in the topic of unstructured data:

Is natural language unstructured data?
Are receipts structured or unstructured?
What does AI have to do with unstructured data?
Why is Unstructured Text Data Important in Decision Making?

Structured Data

Structured data can be thought of as the boot-lickers, obedient rule-stickers, or conforming sheep. Take your pick. It fits neatly into whatever database system you’ve designed and is very predictable because it behaves just as you demand.

In other words, you easily put structured data into specific and pre-determined fields. E.g., names, locations, credit card numbers.

Unstructured Data

Unstructured data is where it gets interesting. It’s a lot more rebellious – refusing to be confined to simple and pre-defined fields.

E.g.

Text that needs context to be processed
Image, Video and Audio files
Internet of things (IoT) data
Device and network data
Other complex data such as weather data & data generated from social media (e.g., behaviour of users, user demographics, sentiment analysis of language, clicks).

It usually requires more than just rule-based algorithms to deal with unstructured data. You need something closer to humans, something more intelligent. So, this is naturally where machine learning loves to make an appearance.

Quick Summary

Common Questions

Q1. Is Natural Language Unstructured Data?

Unsurprisingly, the term “unstructured data” can be interpreted to mean very different things in different scenarios. We, therefore, need to clarify what exactly what we’re talking about so we can stay on the same page.

If you asked your English Professor, unstructured data would refer to some data that has absolutely no structure – literally “without structure”.

Take natural language, for example. To your English Professor, it’s structured data as follows many strict rules like grammar, punctuation, and spelling.

Paradoxically, in the realm of computer science, natural language is considered unstructured data.

As described earlier, your Data Scientist would tell you unstructured data just means that the data isn’t pre-defined or stored in a structured database format. Instead, it actualises in very complex manner. Like natural language – which is merely guided by the language’s laws.

Q2. Are receipts structured or unstructured?

Receipts and invoices may seem to be in a bit of a grey area. The different documents have similar basic structures and content, containing constant and necessary data points like:

Date
Total amount
Vendor name

They begin to vary with the layout and other variable data, such as discounts, line items, more dates, branding, etc.

When you get your hands on receipt data it doesn’t fit neatly into nice wee standardised table formats (as you expect structured data to).

Also, receipts are often initially stored as image files – certainly placing them into the unstructured data category.

However, sometimes they’re considered what’s called “semi-structured” data, due to their relative simplicity and somewhat consistency of the data inside the document coupled with how easy it is for us humans to understand. You should know that it is, however, debatable.

Q3. What does AI have to do with unstructured data?

Computers are built to deal with numbers and thus structured data. Not unstructured data.

AI technology is more advanced than traditional computing and is finding new ways that are proving to be useful for getting unstructured into a more computer-friendly and manageable state.

It’s like a friendly robot arriving to clean up my wardrobe.

A variety of AI technologies is emerging for handling unstructured data e.g.,

NLP for extracting information out of text (like emails, social media text, articles, business documents).
Speech-to-text conversion (for transforming audio speed in text)
Pattern recognition algorithms (for identifying and enabling categorising of objects – e.g., people, animals, biomedical data, and much more).

At its core, AI is being used for these 2 main purposes in unstructured data:

Data Management – AI has the power to process unstructured data and transform it into structured data.

Data Quality – AI can be to increase the quality of the data, through ‘pre-processing’ filtering and ‘post-processing’ enrichment of the data.

Q4. Why is Unstructured Text Data Important in Decision Making?

Processing unstructured data is especially important when we want to know about the context and sentiment (e.g., emotions) of text. Textual disambiguation is incredibly valuable because it allows text to be put into a structured database and then used for decision-making.

We want to use the information to gain insights at scales that we could never have even come close to before (i.e., without AI and computers).

Imagine making business decisions based upon 10-20% of existing digital data. Well, most companies don’t need to imagine. The reality is, without utilizing the right tools to handle unstructured data you’re limited to merely structured data.

This hugely decreases your ability to make confident and optimal decisions. Imagine being able to make use of 8x that amount of data, and often a richer more natural kind of data too. Well, that’s what this wave of innovation for unstructured data processing technologies enables you to do.

Begin testing the Receipt & Invoice OCR API for free now, if you want to you are working on any OCR receipt scanning or other unstructured text data projects.

In case you were curious

I spent the rest of my morning wiping out my drawers, folding my clothes, and shoeing away ants. By 11am I was able to find the space in my brain to decide what to wear.

With the help of an AI robot, I could’ve got on with my day 2 hours earlier, and probably chosen a better outfit.