Finding Structure in a Mishmash of Data

The power of entropy

One of the superpowers of mishmash io as a database is its ability to implement an entire algorithm that is written in your programming language (python, javascript, java, etc) and perform complicated computations over your datasets, guided by the principle of uncovering structure in the data.

Traditionally, complicated data analysis problems require that you (or rather your code) attempt to structure the data and then see what patterns (if any) that reveals, before changing the structure and then analysing again. Not only is this a long iterative process that involves a lot of data transfer back and forth between your database and your app, but it also requires some educated guesswork about where to start in structuring the data.

In a departure from this approach, mishmash io can apply algorithms that examine all possible data combinations and quantify the structure in terms of what we like to call entropy.

Entropy is a quantity that we’ve borrowed from thermodynamics, which can be loosely defined as the disorder of a system – in other words, a lack of structure. But in order to find the answer to a particular question, we need to look for structure – in other words, a reduction in entropy.

Once we have defined what we mean by a reduction in entropy, we can then program an algorithm to search all possible combinations of parameters and values to find the lowest value of this entropy quantity. By using a quantifiable measure of structure like this, there’s no chance for bias to subconsciously influence the interpretation of the data – the outcome is based on cold, hard mathematics.

To illustrate this, let’s develop a simple application that might appeal to a TV football commentator – finding interesting statistics about an ongoing match.

You know the sort of thing: Team A hasn’t won a game against team B in 10 years unless they scored during the first half, or Striker X has a poor scoring record against defenders Y and Z; all the tidbits of information that true sports fans find irresistable.

Let’s examine how we could develop a simple algorithm to extract such curiosities from historical data on football matches.

Desperately seeking structure

Obviously, the number one challenge is that we don’t know where in the data we should be looking – and a game of football generates a lot of potentially significant data:

  • which team is at home and which is playing away
  • team members selected for the match
  • goals scored in each half
  • number of free kicks, corners, yellow cards, red cards, substitutions
  • ...etc., etc., the list of potentially significant data can be very long

Any combination of any number of these parameters could potentially describe our interesting statistic – team A always wins if they score in the first half at home, for example. There are easily thousands upon thousands of parameters if you consider all the possible combinations.

The only thing we do know is that our interesting statistic is a repetitive outcome that leads to a single result – in this case, whether a team typically wins or typically loses.

In other words, a pattern; some structure in the data. It is this pattern that we want our algorithm to find and then identify the set of circumstances (as expressed by parameters and their values) that cause it to occur most often.

This process – looking for parameters that give structure to data and scoring the results to find the highest measure of this structure – is ideally suited to Machine Learning.

  • we are searching for hidden structure in wins vs. defeats
  • entropy is telling us this is more structured than that
  • highly structured parameters are describing a repetitive outcome: a pattern
  • a pattern in wins vs. defeats is an interesting statistic

So let’s exploit this in a simple algorithm in mishmash io.

Structuring the algorithm

Let’s keep this (relatively) simple for now and focus on the factors that influence the most important aspect of football – whether a team is likely to win or lose an upcoming fixture. (We’ll ignore draws as it keeps the maths simpler – and anyway, they’re often boring.)

The first thing we need is a data array covering all the past fixtures between these two teams such as that shown in part below in JSON format.

{
  "fixture_id": 100,
  "league_id": 2,
  "status": "Match Finished",
  "elapsed": 90,
  "venue": "Goodison Park",
  "referee": "S. Attwell",
  "homeTeam": {
    "team_id": 45,
    "team_name": "Everton"
  },
  "awayTeam": {
    "team_id": 37,
    "team_name": "Huddersfield"
  },
  "goalsHomeTeam": 1,
  "goalsAwayTeam": 1,
  "score": {
    "halftime": "1-1",
    "fulltime": "1-1",
    "extratime": null,
    "penalty": null
  },
  "events": [
    {
      "elapsed": 4,
      "team_id": 37,
      "teamName": "Huddersfield",
      "player_id": 2734,
      "player": "Philip Billing",
      "type": "Card",
      "detail": "Yellow Card"
    },
    // ...
    {
      "elapsed": 36,
      "team_id": 45,
      "teamName": "Everton",
      "player_id": 18766,
      "player": "Dominic Calvert-Lewin",
      "type": "Goal",
      "detail": "Normal Goal"
    }
    // ...
  ]
}

Those of you reading this who are familiar with tree structures will recognise that this array contains the features that are characteristic of such a structure. All the array parameters come from a starting point – the root.

You then have parameters like fixture_id, league_id, etc., which are leaf parameters – they have no other parameters associated with them so they sit at the extremities of the tree. Those parameters which have other parameters nesting within them form the branches. For example, the parameter homeTeam has sub-parameters such as team_id and team_name.

From the point of view of our algorithm, each of the parameters and sub-parameters in the tree are referred to as a key. A key is associated with a value. For instance, the value of the home team name parameter is Everton and it can be found under the key homeTeam.team_name.

Given that we are working with a tree structure, we need our algorithm to walk the tree – that is, visit all the parameters in the tree sequentially by following each branch from the root to the leaf, before going back to the root and following the next branch.

In so doing, the algorithm will find the value associated with every key in the database. We can then program the algorithm to look for a good split: a combination of a key and a value that splits the data into two sub-sets, each of which should be more structured than the original database; in our example, one of them should contain more victories, the other, more defeats.

Then we do the same thing within each of these two subsets, once again looking for a good split in each one of them. This continues so that, at every stage, we not only increase the structure in our data, but we also capture the split condition – the key-value pair that gives us the best split.

For example, in our database there is a venue leaf parameter. During home matches for our team (Everton), it has a value of Goodison Park. Let’s assume (not unreasonably) that Everton win more matches at home than when they are away, in which case our algorithm will at some point identify that when venue equals Goodison Park then there are proportionally more wins for Everton and when venue does not equal Goodison Park then there are proportionally more defeats for Everton.

Then the algorithm will look for another parameter that further splits victories and defeats when the venue is Goodison Park. For example, it may find that there is an event in the match that reduces the entropy in the data, such as when a particular player is in the team. It will then dig deeper and might find that the entropy reduction is even more pronounced if the type parameter value does not equal Card.

In this way, the algorithm can eventually create results such as Everton have not lost at home to Huddersfield if Dominic Calvert-Lewin is playing and doesn’t receive a card in the first half.

Now, don’t go and place any money on this, it’s just an example. The important point is that the algorithm arrives at its conclusions exclusively by looking for the splits in the data that increase structure (reduce entropy) without making any assumptions about which parameters might or might not be important.

So, here is a summary of what the algorithm needs to do:

  1. Calculate the initial entropy (an expression of the ratio of victories to defeats for your team) from all the fixtures in the array, and save this value.

  2. Walk the tree of all the keys in the array and split the data for every key-value pair.

  3. Calculate the entropy for each split and retain those with a lower entropy than you started with (in step 1).

  4. Keep walking the tree until you find the key-value pair that reduces entropy the most and save this as the best split.

  5. Repeat steps 1-4 within each group from the best split that you’ve just generated, this time ignoring whatever parameter was previously found to most reduce the entropy.

    Continue going deeper and deeper, into each sub-group in turn, applying the same steps – choosing the parameter and value that give you the best split between victories/defeats, until you reach a perfect group – one that contains only victories or only defeats.

    At this point, you can extract the set of circumstances that lead to a pattern: if you work your way backwards through the iterations and you collect all the key-value pairs from each step, you have found the rules that define the circumstances leading to typically winning or typically losing and you can stop splitting the current group.

  6. Continue until you have found the best split for all groups or you have no more parameters left to explore.

Let’s implement the steps above using an algorithm in mishmash io.

Example implementation

First, we need to set up a few things:

/*
 * Get the 'main' mishmash variable, giving us access to all the data. 
 * We will use it to compose all sorts of 'mishmashes' later
 */
var mishmash = new Mishmash();

// Initialize some variables - pick the two teams that are next to meet on the field
var teamA = 'Everton';
var teamB = 'Huddersfield';

The teams shown are just an example to get us started.

Now, let’s create a mishmash that contains only games played between the two selected teams (again, keeping it simple for clarity). To do this we use the main mishmash variable we created above just as if it was the entire list of all fixture objects in local memory. We can use properties, array indexes, etc to reach deeper and more specific parts of all the data, or we can enumerate mishmashes in order to combine them.

Since everything in mishmash io is an index – both the keys like homeTeam, awayTeam, but also all values such as Everton or Huddersfield – we can combine them in whatever order we like to compose a sub-mishmash with specific properties, such as homeTeam being equal to teamA (mishmash.homeTeam[teamA]).

We can extend this to specify the away team (mishmash.homeTeam[teamA][Mishmash._parent].awayTeam[teamB]). Then, as teams often exchange visits, we can combine this sub-mishmash with another similar one where homeTeam = teamB and awayTeam = teamA.

/*
 * Let's compose our first 'mishmash' - a set of all fixture objects where teamA played vs teamB,
 * home or away
 */
var all_fixtures = mishmash(
  mishmash.homeTeam[teamA][Mishmash._parent].awayTeam[teamB], 
  mishmash.homeTeam[teamB][Mishmash._parent].awayTeam[teamA]
);

As mishmash io does not have a classical database schema, it lets you explore every possible data structure, as long as it actually exists in the database.

You do not have to strictly follow the hierarchy of your objects if you see a simpler way to describe which ones they are. There are also specific keywords that you can use to represent a hierarchical relationship (like mishmash._parent) if you need one to make things clearer.

Our algorithm will also need to compose separate mishmashes of the wins and the defeats of teamA. You can apply standard methods available in your programming language to introduce more complex logic when composing mishmashes; in this instance we are using the filter() method of a javascript array.

/*
 * Let's compose another mishmash that has only teamA wins...
 */
var teamA_wins = mishmash(
  all_fixtures.homeTeam[teamA].filter(fixture => fixture.goalsHomeTeam > fixture.goalsAwayTeam),
  all_fixtures.awayTeam[teamA].filter(fixture => fixture.goalsAwayTeam > fixture.goalsHomeTeam)
);

/*
 * ... and one with all teamA defeats
 */
var teamA_defeats = mishmash(
  all_fixtures.homeTeam[teamA].filter(fixture => fixture.goalsHomeTeam < fixture.goalsAwayTeam),
  all_fixtures.awayTeam[teamA].filter(fixture => fixture.goalsAwayTeam < fixture.goalsHomeTeam)
);

So far, we have not actually transferred any data between the app and mishmash io, but it’s now time to start. Let’s see how teamA_wins compares to teamA_defeats by computing our initial entropy:

/*
 * Get the initial entropy - our basic measure if we're capturing more structure or not
 */
var num_fixtures = Array.length(all_fixtures);
var num_wins = Array.length(teamA_wins);
var num_defeats = Array.length(teamA_defeats);

var initial_entropy = 0 
  - ((num_wins / num_fixtures) * Math.log2(num_wins / num_fixtures))
  - ((num_defeats / num_fixtures) * Math.log2(num_defeats / num_fixtures));

Array.length() is a standard method in javascript and it can, of course, be applied to a mishmash in order to obtain its length.

In this case, however, it forces the mishmash to actually compute and return this length as an integer. In that respect, it’s akin to what is often known as a terminal operation. By asking for the length of a composed mishmash, all operations that we previously did to compose that mishmash will now be executed, in parallel, on the mishmash cluster so that a single integer length is returned.

We want our algorithm to repetitively and recursively split one mishmash into two separate parts, then find out if we’re gaining information (uncovering structure, losing entropy) in either of the parts, which shows that we are splitting the data further into victories and defeats.

So let’s simplify our algorithm code by implementing a few functions. If necessary, these functions can then be applied remotely on mishmash io to take advantage if its ability to speed up computations:

function get_wins(fixture_list) {
  return mishmash(
    fixture_list.homeTeam[teamA].filter(fixture => fixture.goalsHomeTeam > fixture.goalsAwayTeam),
    fixture_list.awayTeam[teamA].filter(fixture => fixture.goalsAwayTeam > fixture.goalsHomeTeam)
  );
}

function get_defeats(fixture_list) {
  return mishmash(
    fixture_list.homeTeam[teamA].filter(fixture => fixture.goalsHomeTeam < fixture.goalsAwayTeam),
    fixture_list.awayTeam[teamA].filter(fixture => fixture.goalsAwayTeam < fixture.goalsHomeTeam)
  );
}

function entropy(fixture_list, wins, defeats) {
  var num_fixtures = Array.length(fixture_list);
  var num_wins = Array.length(wins);
  var num_defeats = Array.length(defeats);

  return 0 
    - ((num_wins / num_fixtures) * Math.log2(num_wins / num_fixtures))
    - ((num_defeats / num_fixtures) * Math.log2(num_defeats / num_fixtures));
}

Performing actions over a mishmash of elements (like filtering out teamA victories in a mishmash of fixtures) implies a certain notion of order.

For example, applying the array filter() method above implies an order of elements, as they are stored. Often though, we need to process elements in an order that is defined by another logic, specific to the algorithm that we’re implementing.

In our example, we have to loop over all keys and all their values, and attempt to split the data for each key-value pair to check if we gain structure or not.

Iterating over elements based on certain, non-implied algorithmic logic in this way is also possible with mishmash io. One way of doing it is to use a generator function which is a common concept in many programming languages:

/*
 * Let's implement our custom iterator logic that we need to lookup for splits.
 * Basically - it loops over all parameter names in a mishmash and all possible values for each parameter.
 */
function* keys_and_values(key, objs) {
  if (typeof objs === 'object') {
    for (let key of Object.keys(objs)) {
      for (let value of objs[key]) {
        yield* keys_and_values(key, value);
      }
    }
  } else {
    yield {
      key: key, 
      value: objs
    };
  }
}

As you can see, this function will yield an object – a key-value pair.

Let’s now look at one of the most important parts of our algorithm – how to find the best split: a key and a value that best structures a mishmash into victories and defeats:

function find_split(fixtures, ignore_params) {
  // get an initial entropy - one before a split
  var entropy = entropy(fixtures, get_wins(fixtures), get_defeats(fixtures));
  var best_split;

  // loop over keys and values
  for (i of keys_and_values(null, fixtures)) {
    if (ignore_params.includes(i))
      continue; // ignore splits we've used before

    // get a split by the current key and value
    var split = fixtures.filter(fixture => fixture[i.key] == fixture[i.value]);
    // compute the entropy of wins and defeats within the new split
    var split_entropy = entropy(split, get_wins(split), get_defeats(split));
    if (split_entropy < entropy) {
      // we found a better split! so keep it
      best_split = i;
      entropy = split_entropy;
    }
  }

  // return the best split we've found, along with the two mishmashes it splits into
  return {
    split_key: best_split.key,
    split_value: best_split.value,
    entropy: entropy,
    fixtures_left: fixtures[best_split.key][best_split.value],
    fixtures_right: fixtures[best_split.key].filter(fixture => fixture[best_split.key] != fixture[best_split.value])
  }
}

And let’s wrap this up in a nice recursive function that will be composing a tree out of the best splits that we find:

/*
 * Finally, let's look for best splits recursively and build a tree of split keys and values
 */
var build_structure = function(fixtures, node, ignore_params) {
  var best_split = find_split(fixtures);
  if (best_split == null) {
    // when no more splits can be done - do nothing
    return;
  }

  node.split_key = best_split.split_key;
  node.split_value = best_split.split_value;

  var next_ignore_params = ignore_params.slice(0).push({key: best_split.split_key, value: best_split.split_value});
  if(Array.length(best_split.features_left) > 0) {
    // go further in building the structure on the left side, ignoring the key and value we just selected as best
    node.left = {};
    build_structure(best_split.features_left, node.left, next_ignore_params);
  }

  if(Array.length(best_split.features_right) > 0) {
    // go further in building the structure on the right side, ignoring the key and value we just selected as best
    node.right = {};
    build_structure(best_split.features_right, node.right, next_ignore_params);
  }
}

The possibilities are endless…

We’ve made quite a few simplifications in this description to stop it becoming overly complicated. However, mishmash io offers many possibilities for more complex analyses.

For example, our football example would probably produce a lot of results that would be fairly obvious to a football pundit. The top teams especially are more likely to win, so there will be more sets of circumstances that typically lead to them winning, and these will be very well-known and therefore neither surprising nor curious.

However, if a TV commentator wanted to generate interesting statistics, this means we should look for untypical outcomes that typically happen. Calculating how unlikely they are will generally require scanning the entire database (often more than once) – another reason why we are better off implementing an algorithm rather than coming up with a series of queries to explore relations between sets of data.

All this analysis potentially involves a lot of exchanges between the application and the database, which can ruin the performance of the app.

However, there’s one last thing we can do – we can submit all this code directly to mishmash io and let it optimize all those operations, parallelize our code and finally compute and return the result. It would be as if our code actually ran:

var tree = all_fixtures(function(input) {
  return build_structure(input, {}, []);
});

This is the second of mishmash io’s superpowers. Applying a function on a mishmash actually transfers the function to mishmash io.

Here, all its logic is analyzed and, using the knowledge mishmash io has about the data, it is transformed into equivalent logic that will reach the same result, only in parallel on multiple nodes in a cluster – and therefore, much faster.

But that’s another story…