Big Data Needs a Big Idea

Operating in a crazy, data-rich world

Big data, machine learning, AI – all buzzwords currently very much in fashion in the world of business. And like all buzzwords, they need a certain amount of disconnecting from the hype that surrounds them if they are to be useful.

Analysing data has always been important to understand behaviour and trends and predict future patterns. Although the analysis and predictions improve as the quantity increases, you don’t need vast quantities of data to get going. More important than the quantity of data is what you do with it, and everyone has to start somewhere.

However, there is still a widespread impression that big data is only for big business.

This perception is compounded by the lack of suitable tools for the job. So what’s wrong with what we do now? Traditional relational database structures hark back to a time when most analysis of data in business was to produce accounts and forecast financial performance in the coming months and years.

The data on which the analysis was based – such as sales figures – was very orderly; every order had a buyer at a defined price, so the data’s structure was simple and unchanging. It was very straightforward to produce reports that analysts and business leaders could digest and understand.

Fast forward to the 21st century and the Internet, where people post and share photos, watch videos, discuss, search, apply for a job, buy stuff, choose restaurants in a city they're visiting for the first time, find their way to those restaurants, review them… All human life is there in all its glorious chaos.

Unsurprisingly, the associated data, now being produced at a rate never seen before, reflects this. It’s no longer neat and tidy with unambiguously defined terms that relate to multiple tables of information.

A job site has roles for software developers and software engineers – the two terms are used synonymously. Your last shopping basket might have included a blue shirt, an azure pullover and a pair of navy chinos. It’s a real jumble – a mish mash, you might say.

However, there’s no time to normalise your database to try and make sense of these anomalies, because the website you’re on wants to recommend other job adverts that might be of interest to you, other ‘blue’ clothes you might be tempted to buy, or other restaurants you might like to try that are near to your current location.

So the data is analysed by a machine – a computer algorithm – and the results are passed on in milliseconds rather than in time for the next fiscal quarter. If this isn’t done quickly, then the customer has browsed on to somewhere else and the opportunity is lost.

Moreover, the application that is analysing the data is continuously being updated with new features or improved algorithms to make more accurate predictions faster. These apps need to get into the data immediately.

There’s no way that you can restructure the database to optimise the performance of the app – before you’ve upgraded it once, the app has changed again.

To really mine this data for all its worth, new ways of storing the information are needed that deal with its ‘untidy’ nature without trying to impose a rigid structure. So-called ‘unstructured’ data (data without a model or schema) makes up the the vast majority of data in existence, so it needs to be handled if it is to be useful. And it needs to be handled at lightning speed and by many different applications at the same time.

The old style of database can’t cope with this brave new world. Ideally, there are three fundamental ways that the next generation of databases should be different from what’s gone before.

1. Databases should be completely opaque

While database systems were once the crucial component in applications, today this is no longer the case.

Applications evolve more rapidly than in the past, their features get ever more sophisticated and their code base is more complex as a consequence. The database should not interfere with application development; it should not impose any specific development practices or programming languages.

Data should be available to developers in a completely accessible way, so that there is no need for them to have to switch back and forth between the application and the programming logic required by the database.

In the ideal database, all the data it contains is available as variables within your app, just as though they were held inside the local memory of the computer running the app. These variables should be exactly the way you created them, having the same structure and organization as you would need for your app – integers, strings, dates, arrays, objects, whatever is usually available in your programming language.

The database should have no structure or schema of its own, it should support the ‘schema’ of your application. Furthermore, you should be able to access the variables in the database using the same programming language you’re using in your app – javascript, python, java, ruby, etc – without needing to invoke any specific query language (such as SQL) or any get, put or find methods.

2. Everything in the database should behave as an index

When using any database, it’s important to be able to access only certain subsets of all the data, in a way that is best suited to the current task in hand.

In relational databases, you can define indices that relate one element in one table to another element in another, and you can use join statements and other techniques to execute intersections, unions and transformations on what’s stored inside the indices to obtain results that have a different format.

If it is to be useful, the next generation database must allow similar operations to be performed, in spite if its unstructured, schema-less design. It must allow each variable to behave like an index in a relational database so that they can be compared, correlated and combined, and will give restructured formats when the inputs and outputs differ.

For example, let’s imagine that you are storing information from a video-streaming website, which includes each video’s ID, duration and title.   1  

Splits between the sets will start to emerge.

For example, assuming these are the only values you have stored, Video will be split by 1, 2, 3, duration, 120, 180, 360, title, first title, second title and third title.

However, duration will only be split by 120, 180 and 360.

Similarly, first title will only be present in 1 and title. It is not an element of duration or 180, for example.   2  

The ideal database keeps track of such properties in the data that you store.   3   This can then bring benefits when you want to perform some automated data analysis using an algorithm.   4  

Indexing Diagram

3. You should not have to optimize the database and queries you make; the database should optimize your app

In the ideal next generation database, which is opaque by design, all of your variables can be added, subtracted, compared, looped, etc, as if they were locally inside the application’s memory.

The problem is that processing data in this way might not be very efficient, especially if the datasets are large.

It will take time to transfer the data over the network, and then the operation will be restricted to a single computer (the one pulling the data out of the database), which can slow the whole process down.

Once the ideal database has identified the intersections, unions and discontinuities between the arbitrary sets of data (stage   1   in the diagram below), it can distribute the related data evenly around a cluster of computing nodes if one is available   2   .

Then, when an application requests some calculation to be done on the data, instead of pulling data out of the database, you push code into it and let it decide the best way to perform the required calculations   3   .

Based on the branches you’ve written in that code and the way that the data is spread across the cluster nodes, the database automatically deduces how to break your code into smaller pieces and run them optimally in parallel on many nodes.

In this way, no single node has a greater burden of the processing effort to bear, and transfers of large amounts of data back and forth across the network are avoided   4   .

This greatly reduces the time and resources needed to complete the analysis   5   . And all the while, the database stays true to its opaque design and does not impose any specifics like frameworks, interfaces, methods that you have to use.

Optmization Diagram

In the ideal database, the more code you push to run on the database, the better, especially if it must be executed very quickly over potentially massive datasets. The more the database knows about what your application needs to do with the data, the better it can optimize its analysis.

The next generation is here

In a world where business opportunities can be won or lost in a matter of seconds, a database that holds all the information on your potential customers must be able to cope with highly unstructured data, which a traditional SQL-based relational database would only manage with a great deal of time-consuming normalising and regular reworks of the application.

However, a non-SQL database with no structure whatsoever requires a great deal of data retrieval and transfer across networks, which makes rapid automated analysis by algorithms highly resource-hungry.

A new generation of database is therefore needed for companies of all shapes and sizes to use for their online data management and analysis, one which has no schema and imposes no constraints on the applications with which it interacts, but has sufficient intelligence to distribute data logically across the available resources.

And even better, it should be able to perform the required analysis at the nodes in your network where the relevant data is stored, increasing the speed at which results are generated and opening up the benefits of ‘big data’ to all.

And as luck would have it, such a database already exists – it’s called, appropriately, mishmash io.