Data Analysis in a Parallel Universe

Perhaps the most important superpower of mishmash io is its ability to execute your algorithm code in the fastest way possible.

This is achieved by a much more subtle and sophisticated approach than simply faster processing speeds. mishmash io is able to analyse the code of the algorithm you have created and find efficient ways of solving it by adjusting how the data is organised within the database.

This allows different parts of the code to be run simultaneously on different areas of the database, before combining the results to return an answer. We call this superpower parallelism, and in this article we’ll try and explain some of the secrets behind how it works.

Designed into the DNA

We’ve already discussed in a previous article that all the data points in mishmash io are held as if they were local variables within the app, with no specific structure or schema being imposed.

This avoids developers having to use specific programming languages or frameworks to ‘fit’ the requirements of the database when interrogating the data, which allows shorter applications to be written.

However, the implications of this design philosophy go much further than that. The use of local variables gives mishmash io complete flexibility in how it organises the data, which is put to maximum advantage by analysing the code of the algorithm.

The code is converted into an abstract syntax tree – a syntactic structure that describes the steps your algorithm should take. This is done by the interpreter or compiler used by your choice of programming language.

The result of this code analysis makes it possible in principle to determine all the different ways of performing the required steps in the algorithm. mishmash io can then use its knowledge of how the data is stored to select the most efficient way of running the algorithm within the database.

The syntax tree is much easier to process than the original algorithm code. When used with a traditional database, executable code is generated that repeatedly takes data from the database and analyses it within the application.

With mishmash io, things are reversed – the syntax tree is sent to the network where the data is stored. There, the control flow of the algorithm is analysed to obtain a structure akin to a finite state machine model. This is a mapping of the total output space of the algorithm; in other words, what outputs or state transitions are produced for each combination of inputs and states.

Each execution of the algorithm will produce a single output within this space, but by understanding both the set of possible outputs and the distribution of data around the network, mishmash io can work out which outputs can be calculated at each node.

This process is strengthened further by mishmash io's ability to find sets or subsets with certain characteristics, without anyone having to define indexes or relations.

In this way, optimised versions of the algorithm that can be run at each node are produced and distributed around the cluster, which allows elements of the code to be run in parallel at many locations simultaneously.

However, there is more to it than that – thanks to all the code analysis, mishmash io now also understands which outputs can be calculated directly, without needing to run the actual algorithm.

So when an instance of the algorithm is presented by the application, at least some of the results have often already been computed at different points in the cluster, and any that have not can quickly be found using mishmash io’s knowledge of the output space.

The results are returned to the application and then combined appropriately to provide the solution. This process is illustrated in the diagram below.

In summary, mishmash io optimises the running of complex algorithms by a three-pronged approach:

  • Identifying the most efficient solution for solving the algorithm.
  • Understanding the entire set of outputs that can be calculated based on the algorithm and the data.
  • Executing only the parts of the code that are appropriate for the data stored at each location, before combining the results to find the overall solution.

DNA Diagram

An agricultural example

Conceptually, this is quite a complex scenario, and very different to the way that databases have been interrogated in the past, so we need a very simple example to illustrate it – and what could be simpler than a nursery rhyme?

Imagine a farm in Scotland, which just happens to be run by a farmer named Old MacDonald. He has many different animals, located here and there, and they all make different noises.

Suppose we want to create an algorithm that can generate the words of the nursery rhyme “Old MacDonald Had A Farm”, based on the data on the farm animals in a database.

For simplicity, let's store all the data in a single mishmash, as an array:

mishmash.farm_animals = {
    cows: [
        {
            location: "here"
        },
        {
            location: "there"
        }
    ],
    pigs: [
        {
            location: "here"
        },
        {
            location: "there"
        }
    ],
    // and so on...
};

For the purposes of this example, let’s assume that this information is distributed over a cluster of nodes, and that each node holds information on one species of animal.

The lyrics are quite simple and repetitive and the only parts that change are the animal and the sound it makes. So, let’s write a simple algorithm that will yield a single verse for each animal:

var verses = mishmash(function* (input) {
    for (animals in input.farm_animals) {
        var sound;

        if (animals === "cows") {
            sound = "moo";
        } else if (animals === "pigs") {
            sound = "oink";
        } else if (animals === "ducks") {
            sound = "quack";
        } else if (animals === "chickens") {
            sound = "cluck";
        } else if (animals === "cats") {
            sound = "meow";
        }

        yield "Old MacDonald had a farm, E-I-E-I-O\n" +
            "And on his farm he had some " + animals + ", E-I-E-I-O\n" +
            "With a " + sound + " " + sound + " here\n" +
            "And a " + sound + " " + sound + " there\n" +
            "Here a " + sound + ", there a " + sound + "\n" +
            "Everywhere a " + sound + " " + sound + "\n" +
            "Old MacDonald had a farm, E-I-E-I-O\n"
    }
});

In traditional database systems, such an algorithm would take the first animal in the input, check the data to assign the proper sound to that animal, and then yield the verse. Then, it would take the second animal and repeat, then the third, and so on.

The lyrics would be contained in the verses variable from where they can be printed:

for (verse of verses) {
  console.output(verse);
}

mishmash io uses a different approach to compute the same result. The supplied code is analysed and it is found that the set farm_animals is further split into subsets of cows, pigs, etc.

The required output (in this case, each verse) is also determined.

Finally, the control flow of the algorithm is analysed and reveals two requirements: the for() loop shows that each input from the animals array produces a single verse, and the if...else conditional statements indicate that the sound variable is also derived from each input.

Armed with this understanding of the data distribution and the algorithm code, mishmash io can now run the algorithm. It examines each input (the animal), and then runs the code for that input on the node which holds the data on that animal, for example:

yield "Old MacDonald had a farm, E-I-E-I-O\n" +
    "And on his farm he had some " + "cows" + ", E-I-E-I-O\n" +
    "With a " + "moo" + " " + "moo" + " here\n" +
    "And a " + "moo" + " " + "moo" + " there\n" +
    "Here a " + "moo" + ", there a " + "moo" + "\n" +
    "Everywhere a " + "moo" + " " + "moo" + "\n" +
    "Old MacDonald had a farm, E-I-E-I-O\n";

It can do this simultaneously on multiple nodes at once. Once each node has computed the required output, they are returned to the app which then links them together to give the entire song lyrics.

But seriously...

Old Macdonald is obviously a rather silly example that hardly needs state-of-the-art distributed computation to solve; indeed, mishmash io would probably take longer to compute the result than the app itself using no database at all.

However, in the real world, 'query' algorithms on large data sets will have hundreds of input parameters, IF statements, loops, functions and arithmetic or other expressions that produce the desired output.

In these situations, quickly finding the most appropriate subsets and transforming the entire algorithm into an optimised equivalent produces the greatest gains in performance.

A more realistic scenario is the football facts algorithm that was discussed in a previous article.

This algorithm examines all the possible combinations of data about football matches between two teams, looking for splits that lead to an increase in the information gain as a way of revealing structure (that is, a relation) between two data points.

Assuming that the data is split over nodes in a cluster, then mishmash io will examine the splits at each node to determine which ones can be ignored as they do not reveal any structure. Then, when the algorithm is run, time is saved by not returning splits which do not add any value.

Furthermore, the system will know how best to compute each split by identifying the required data sets and where they are located, rather than having to laboriously extract each pair from the database, analyse it, and then repeat.

Knowledge is power

Thanks to strides in the analysis of software code, a lot of the ‘heavy lifting’ that used to be performed by the code of an algorithm to manipulate the data can now be done by the in-built mechanism of mishmash io, providing that the algorithm is written simply and doesn’t rely on specific programming languages, APIs or frameworks.

So, as a developer, just keep your algorithm code as short, clean and easy to understand as possible, and mishmash io will be able to decipher it and work out optimal ways of running your query over whatever data storage structure is in place. Behold the secret behind the parallelism superpower!

However, keeping the code simple doesn’t mean you need to avoid tackling complicated or challenging problems. In fact, quite the opposite – you should seek to use as much of the capacity of your database as possible and execute as much processing as you wish, because the more information that mishmash io can gather from the algorithm, the more it can analyse, understand and optimise how to efficiently execute the code over the data.

In other words, you can use mishmash io’s machine learning abilities to maximise the ‘return’ from your ‘investment’ in writing your algorithm – which in turn can enhance your superpowers…