Adopting algorithms is the way to extract value from patterns in data. Leave the guessing to others.
Learn what makes mishmash io
a unique database by following a simple example.
Info alert:In this article you will learn:
mishmash io
can automatically parallelize your algorithms.Perhaps the most important superpower of mishmash io
is its ability to execute your algorithm code in the
fastest way possible.
This is achieved by a much more subtle and sophisticated approach than simply faster processing speeds.
mishmash io
is able to analyse the code of the algorithm you have created and find efficient ways of solving
it by adjusting how the data is organised within the database.
This allows different parts of the code to be run simultaneously on different areas of the database, before combining the results to return an answer. We call this superpower parallelism, and in this article we’ll try and explain some of the secrets behind how it works.
We’ve already discussed in a previous article that
all the data points in mishmash io
are held as if they were local variables within the app, with no specific
structure or schema being imposed.
This avoids developers having to use specific programming languages or frameworks to ‘fit’ the requirements of the database when interrogating the data, which allows shorter applications to be written.
However, the implications of this design philosophy go much further than that. The use of local variables gives
mishmash io
complete flexibility in how it organises the data, which is put to maximum advantage by analysing
the code of the algorithm.
The code is converted into an abstract syntax tree – a syntactic structure that describes the steps your algorithm should take. This is done by the interpreter or compiler used by your choice of programming language.
Info alert:About
local variables
local variables description
syntax tree
syntax tree description
The result of this code analysis makes it possible in principle to determine all the different ways of performing
the required steps in the algorithm. mishmash io
can then use its knowledge of how the data is stored to
select the most efficient way of running the algorithm within the database.
The syntax tree is much easier to process than the original algorithm code. When used with a traditional database, executable code is generated that repeatedly takes data from the database and analyses it within the application.
With mishmash io
, things are reversed – the syntax tree is sent to the network where the data is stored.
There, the control flow of the algorithm is analysed to obtain a structure similar to a finite state machine
model. This is a mapping of the total output space of the algorithm; in other words, what outputs or state
transitions are produced for each combination of inputs and states.
Each execution of the algorithm will produce a single output within this space, but by understanding both the set
of possible outputs and the distribution of data around the network, mishmash io
can work out which outputs
can be calculated at each node.
This process is strengthened further by mishmash io
's ability to find sets or subsets with certain
characteristics, without anyone having to define indexes or relations.
In this way, optimised versions of the algorithm that can be run at each node are produced and distributed around the cluster, which allows elements of the code to be run in parallel at many locations simultaneously.
However, there is more to it than that – thanks to all the code analysis, mishmash io
now also understands
which outputs can be calculated directly, without needing to run the actual algorithm.
Info alert:mishmash io
local variables
local variables description
syntax tree
syntax tree description
Info alert:Other databases
local variables
local variables description
syntax tree
syntax tree description
So when an instance of the algorithm is presented by the application, at least some of the results have often
already been computed at different points in the cluster, and any that have not can quickly be found using
mishmash io
's knowledge of the output space.
The results are returned to the application and then combined appropriately to provide the solution. This process is illustrated in the diagram below.
In summary, mishmash io
optimises the running of complex algorithms by a three-pronged approach:
Conceptually, this is quite a complex scenario, and very different to the way that databases have been interrogated in the past, so we need a very simple example to illustrate it – and what could be simpler than a nursery rhyme?
Imagine a farm in Scotland, which just happens to be run by a farmer named Old MacDonald. He has many different animals, located here and there, and they all make different noises.
Suppose we want to create an algorithm that can generate the words of the nursery rhyme “Old MacDonald Had A Farm”, based on the data on the farm animals in a database.
For simplicity, let's store all the data in a single mishmash, as an array:
mishmash.farm_animals = {
"cows": [
{
"location": "here"
},
{
"location": "there"
}
],
"pigs": [
{
"location": "here"
},
{
"location": "there"
}
],
# and so on...
}
mishmash.farm_animals = {
cows: [
{
location: "here"
},
{
location: "there"
}
],
pigs: [
{
location: "here"
},
{
location: "there"
}
],
// and so on...
};
For the purposes of this example, let’s assume that this information is distributed over a cluster of nodes, and that each node holds information on one species of animal.
The lyrics are quite simple and repetitive and the only parts that change are the animal and the sound it makes. So, let’s write a simple algorithm that will yield a single verse for each animal:
def get_song_text(input):
for animals in input.farm_animals:
sound = None
if animals == "cows":
sound = "moo"
elif animals == "pigs":
sound = "oink"
elif animals == "ducks":
sound = "quack"
elif animals == "chickens":
sound = "cluck"
elif animals == "cats":
sound = "meow"
yield "Old MacDonald had a farm, E-I-E-I-O\n" +
"And on his farm he had some " + animals + ", E-I-E-I-O\n" +
"With a " + sound + " " + sound + " here\n" +
"And a " + sound + " " + sound + " there\n" +
"Here a " + sound + ", there a " + sound + "\n" +
"Everywhere a " + sound + " " + sound + "\n" +
"Old MacDonald had a farm, E-I-E-I-O\n"
verses = mishmash(get_song_text)
var verses = mishmash(function* (input) {
for (animals in input.farm_animals) {
var sound;
if (animals === "cows") {
sound = "moo";
} else if (animals === "pigs") {
sound = "oink";
} else if (animals === "ducks") {
sound = "quack";
} else if (animals === "chickens") {
sound = "cluck";
} else if (animals === "cats") {
sound = "meow";
}
yield "Old MacDonald had a farm, E-I-E-I-O\n" +
"And on his farm he had some " + animals + ", E-I-E-I-O\n" +
"With a " + sound + " " + sound + " here\n" +
"And a " + sound + " " + sound + " there\n" +
"Here a " + sound + ", there a " + sound + "\n" +
"Everywhere a " + sound + " " + sound + "\n" +
"Old MacDonald had a farm, E-I-E-I-O\n"
}
});
In traditional database systems, such an algorithm would take the first animal in the input, check the data to assign the proper sound to that animal, and then yield the verse. Then, it would take the second animal and repeat, then the third, and so on.
The lyrics would be contained in the verses variable from where they can be printed:
for verse of verses:
print(verse)
for (verse of verses) {
console.output(verse);
}
mishmash io
uses a different approach to compute the same result. The supplied code is analysed and it is
found that the set farm_animals is further split into subsets of cows, pigs, etc.
The required output (in this case, each verse) is also determined.
Finally, the control flow of the algorithm is analysed and reveals two requirements: the for
loop shows that
each input from the animals array produces a single verse, and the if...else
conditional statements indicate
that the sound variable is also derived from each input.
Armed with this understanding of the data distribution and the algorithm code, mishmash io
can now run the
algorithm. It examines each input (the animal), and then runs the code for that input on the node which holds the
data on that animal, for example:
yield "Old MacDonald had a farm, E-I-E-I-O\n" +
"And on his farm he had some " + "cows" + ", E-I-E-I-O\n" +
"With a " + "moo" + " " + "moo" + " here\n" +
"And a " + "moo" + " " + "moo" + " there\n" +
"Here a " + "moo" + ", there a " + "moo" + "\n" +
"Everywhere a " + "moo" + " " + "moo" + "\n" +
"Old MacDonald had a farm, E-I-E-I-O\n";
yield "Old MacDonald had a farm, E-I-E-I-O\n" +
"And on his farm he had some " + "cows" + ", E-I-E-I-O\n" +
"With a " + "moo" + " " + "moo" + " here\n" +
"And a " + "moo" + " " + "moo" + " there\n" +
"Here a " + "moo" + ", there a " + "moo" + "\n" +
"Everywhere a " + "moo" + " " + "moo" + "\n" +
"Old MacDonald had a farm, E-I-E-I-O\n";
It can do this simultaneously on multiple nodes at once. Once each node has computed the required output, they are returned to the app which then links them together to give the entire song lyrics.
Old Macdonald is obviously a rather silly example that hardly needs state-of-the-art distributed computation to
solve; indeed, mishmash io
would probably take longer to compute the result than the app itself using no
database at all.
However, in the real world, query algorithms on large data sets will have hundreds of input parameters,
if
statements, loops, functions and arithmetic or other expressions that produce the desired output.
In these situations, quickly finding the most appropriate subsets and transforming the entire algorithm into an optimised equivalent produces the greatest gains in performance.
A more realistic scenario is the football facts algorithm that was discussed in a previous article.
This algorithm examines all the possible combinations of data about football matches between two teams, looking for splits that lead to an increase in the information gain as a way of revealing structure (that is, a relation) between two data points.
Assuming that the data is split over nodes in a cluster, then mishmash io
will examine the splits at each node
to determine which ones can be ignored as they do not reveal any structure. Then, when the algorithm is run,
time is saved by not returning splits which do not add any value.
Furthermore, the system will know how best to compute each split by identifying the required data sets and where they are located, rather than having to laboriously extract each pair from the database, analyse it, and then repeat.
Thanks to strides in the analysis of software code, a lot of the ‘heavy lifting’ that used to be performed by
the code of an algorithm to manipulate the data can now be done by the in-built mechanism of mishmash io
,
providing that the algorithm is written simply and doesn’t rely on specific programming languages, APIs or
frameworks.
So, as a developer, just keep your algorithm code as short, clean and easy to understand as possible, and
mishmash io
will be able to decipher it and work out optimal ways of running your query over whatever data
storage structure is in place. Behold the secret behind the parallelism superpower!
However, keeping the code simple doesn’t mean you need to avoid tackling complicated or challenging problems. In
fact, quite the opposite – you should seek to use as much of the capacity of your database as possible and
execute as much processing as you wish, because the more information that mishmash io
can gather from the
algorithm, the more it can analyse, understand and optimise how to efficiently execute the code over the data.
In other words, you can use mishmash io
's machine learning abilities to maximise the ‘return’ from your
‘investment’ in writing your algorithm – which in turn can enhance your superpowers…
Adopting algorithms is the way to extract value from patterns in data. Leave the guessing to others.
To help you get going, mishmash io follows three guiding principles that make algorithm development easy and accessible, despite increasing data sizes and complexity.
See how we use an algorithm to find structure in this smart football commentator example app.