Climbing The Data Mountain

There are many things that are in short supply in today’s world – privacy, clean air, good manners – but one aspect of modern life that is not lacking is data. There are mountains of data being generated every second of every day. We are overrun with the stuff.

Companies collect and store data on anything and everything they can think of, knowing that it must be useful to help them solve problems, gain market advantage and increase profits.

But in reality, all that data can only be useful if we can analyse and understand what the information is telling us. This need is a major reason for the increased focus on machine learning and artifical intelligence that we’ve seen in recent years, as only computers have any hope of processing all the data being collected so that we can understand what it all means.

The trend so far has been to create applications (apps) that comprise a ‘front end’ for interacting with the data, a ‘back end’ to connect to the data, and a database to store the data.

A lot of work has been done to develop large-scale computational methods to assist with extracting and then analysing the data, but these are often treated separately from the database.

In other words, you have to remove the data, analyse it, then go back to the database and repeat.

All this can seem daunting to companies without a history of handling and analysing data. While their software developers who understand the apps inside out, they may not know where to start in creating an efficient algorithm that can be rapidly applied to the database to derive some useful trends or targetted information on customers, etc.

Bringing in expertise in traditional databases is one way forward, but as we shall see, these have serious limtations when it comes to handling the ‘big data’ that is increasingly showing itself to be valuable in solving all sorts of problems.

No need to move mountains

mishmash io presents an alternative approach that unites databases and machine learning but does so in a way that developers will find easy to understand and work with. We describe it as a ‘distributed database system’ because it looks and feels like a normal database but is designed to make it easy to take computational methods – algorithms – written in your app and apply them to the database itself.

This approach has several advantages. It’s much faster and easier to move an algorithm from your app into a database rather than move the data into your app, especially with the quantity of data being typically handled by apps these days.

Also, because mishmash io is set up to organise the data into useful chunks and then optimise how the algorithm is run (more on this later), you only need a couple of hundred lines of code to perform some pretty sophisticated computations on the data – something that a developer can easily create in a short time as a new feature for your app.

This small chunk of code is also made easier to write for two reasons. Firstly, mishmash io does not have a query language; it will accept and apply your algorithm in the language in which you have chosen to write it – python, java, ruby, whatever. Secondly, mishmash io does not have a data schema; it can use the data model – the arrays and objects – that you have set up in your app and apply it to the data as is. There is no need to use special frameworks or specific lines of source code.

Leave the guessing to others

The ease with which algorithms can be written and run in mishmash io helps make the principles of machine learning more accessible.

For example, it is relatively straightforward for a developer to write an algorithm which splits the data, compares the results and looks for a gain in information relevant to the query. Each split is scored for its information gain, and then the split with the highest score is analysed further until all the key parameters needed to give the highest scores are identified by the algorithm.

This process of scoring splits in the data is one way (there are others) in which mishmash io enables machine learning within a system – it’s nothing more magical than that!

However, this way of analysing the data iteratively has the great advantage that you (the person posing the question) don’t have to pick a place to start.

In other words, you don’t have to select the parameters that you think might be important in answering the question, which tends to skew the answer in favour of the chosen starting point (which almost inevitably reflects some bias on your part). It also risks missing other parameters that you, in your wildest dreams, might not have expected to be important.

Dual dynamics

So far, we’ve seen that mishmash io offers some key advantages to help those who are new to machine learning and analysing of large amounts of data.

But don’t be fooled – there’s a lot of very clever stuff going on ‘underneath the hood’. In fact, mishmash io has a dual function when operating, as illustrated in the diagram.

In the first, shown on the left, mishmash io is ‘digesting’ the data in the database.

This involves applying a proprietary algorithm which decomposes the data into convenient chunks that contain the relevant information for answering the question posed by an algorithm.

We call these chunks mishmashes, and they are stored in a distributed file system which allows them to be processed simultaneously at separate nodes within a cluster of computing locations.

Performing computations in parallel in this way can significantly increase the speed with which the algorithm can be run across all the relevant data.

The second function, illustrated on the right, is the digesting of the submitted query (or algorithm), which mishmash io transforms into equivalent algorithms which will run more efficiently across the clusters in which the mishmashes are stored.

The software looks at how the information is stored, including the splits in the data that have already been made to create the mishmash, then looks at the query that the algorithm is designed to answer, and works out a way of efficiently applying one to the other to minimise the amount of computation involved.

Architecture Diagram Representation of the functions of mishmash io used to analyse a database to answer a query posed by an algorithm

And it doesn’t stop there, because the situation is in a constant state of flux. New data is regularly added by the app, and new queries are also received.

So mishmash io works to re-order the way the data is split into mishmashes to optimise the running of each algorithm, as well as transforming each algorithm to best suit the way the data is stored.

This is the linking gearwheel in the centre of the diagram. It’s a synchronous process where one side is continually adjusting in response to changes in the other, similar to a chemical reaction approaching equilibrium except that a balance is almost never achieved because new information is continuously being added by the app in today’s data-hungry world.

A game of two halves

The strength of this approach can be illustrated using a light-hearted example: generating interesting statistics about a football match to help a commentator. Historical data on football fixtures can be purchased, giving information on the teams, venue, referee, scores, players, etc.

This is stored in mishmash io exactly as received, where it forms a tree structure.

If you then want to discover what factors lead to England beating Bulgaria, you can create an algorithm in about 200 lines of code that directs mishmash io to do the following:

  1. Create a mishmash of all information on England-Bulgaria fixtures.
  2. Compute the initial ratio of victories to defeats.
  3. Split the data based on each parameter in turn and compute the number of victories versus the number of defeats for each split.
  4. Retain the split whose parameters give the greatest increase in the ratio from its initial value.
  5. Repeat the process for this split for all the other parameters to find the next best split.
  6. Continue this iterative analysis until the parameters and values have been identified that have the strongest effect on the victories:defeats ratio.

For example, this will identify, amongst other things, that Bulgaria hasn’t beaten England at Wembley since 1967 – a serious home advantage!

We discuss this example in much more detail in a separate article.

Conceptually, this approach is familiar to software developers, and mishmash io gives them a tool to quickly start applying it to data of any size to extract the maximum value that the it can provide.

By writing a short algorithm that uses their choice of language and data model, they can use mishmash io to perform data analysis across multiple nodes in a cluster simultaneously without moving large amounts of data around or using special frameworks of specific query languages.

In other words, mishmash io demystifies and democratises machine learning and opens up the endless possibilities that can be conceived when it is applied to understand data and solve problems.