How to optimise probabilistic matching performance on massive data sets
Within a probabilistic matching* engine (PME) there are 3 key stages:
*For an explanation of probabilistic matching see my previous blog here.
These 3 key stages, when implemented together, allow for very efficient performance on massive data sets, both in terms of speed of matching and quality of matching. A basic understanding of these 3 components and how work together is essential for optimising matching:
Within the PME algorithm, the attributes that are used for matching – such as names, DOB and address – are configured to be used. As the first step they are standardized and consolidated into comparison roles. This is so that subsequent processes can treat all data in the same way.
The standardization process can group together data that may be in slightly different formats.
Example: An international phone number and a UK phone number can both be standardized and used as a cover-all Phone Number, for the remainder of the algorithm
The standardization component, along with grouping data into logic groups for the algorithm also does some rudimentary processing of the data.
Example: It removes most special characters from the data (e.g. hyphens or commas) and changes all text to upper case. Depending on the standardization it might also take specific parts of the attribute (e.g. forename and surname for name) so that they can be treated specifically.
After standardization, all data is put into buckets. These buckets use bucketing functions to lump records with similar data together, creating groups of records which may be the same.
Example: you may have a bucketing function that takes a phone number and orders the digits from smallest to largest. Then, records which share the same ordered phone number are grouped together into the same bucket. Those same records may also be grouped into a different group of buckets on other attributes, such as name or address.
In this way, most records will appear in multiple buckets.
The key part to the bucketing is that for 2 records to be considered for matching by the algorithm, those 2 records must share at least 1 bucket (e.g. have the same ordered phone number, or some other such bucket). This is to increase the performance times of matching, since if you had to match a record by exhaustive force, you’d need to compare it against every other record, which is O(n). By reducing the candidates down to only those which share a bucket, we can get the number of comparisons down to O(ck), where c is the number of buckets a member is in and k is the maximum size of each bucket, which as good practice should be no more than 2000.
Example: if you have a dataset with 10 million records, to match by exhaustive force, you’d need to compare the 10,000,001th record to the other 10 million records already in the system (probably taking around 10-15 minutes on typical server hardware (10,000 matches per second)). On the other hand, having 15 matching attributes, each with the maximum 2000 records in them, that same record could be matched in 3 seconds.
The first 2 components of the algorithm are always done synchronously with an insert or update of data, since they are directly dependent upon the data itself. The actual matching – the comparison – is (typically) done asynchronously.
Once you have a list of possible candidates that a record should be compared against, the algorithm performs a pairwise match between that record and all the other candidate records, giving each pair a score. All the pairs which score above a certain threshold (auto-link threshold) are matched together (in transitive linking. In non-transitive linking this is slightly different). This matching is done by the Entity Manager, which looks after entity collapsing (if the match involves collapsing 2 or more existing entities into a single entity).
The algorithm specifies which comparison functions are used to give scores between records. These scores are also known as weights and are dataset dependent (probabilistically set as part of the PME). Depending on the comparison function, different weights will be generated and used, although the generated weights can be manually tweaked if necessary.
Additional benefit – searching capability
The purpose of the algorithm, and the 3 components of the algorithm, are to achieve high quality matches in a practical amount of time. A beneficial by-product of this process is that you can very efficiently use it to perform searching within MDM as well, although with a few caveats.
The 3 components in the previous section were described as if we had a new record that was updated or inserted into MDM and it needed to be matched. Such a record would be standardized and then placed into the buckets, using the bucketing functions. It would then be picked up by the Entity Manager and compared against all the other members that it shared a bucket with to finally determine which entity, or indeed, a new entity, it should go in.
If we change how we frame this example, we can quickly see how searching may take place.
Example: Instead of having a new record, let’s say you have some criteria for searching, such as surname, DOB and postcode. You can imagine packaging those attributes up into a very sparsely populated, temporary record. That temporary record would undergo the 3 stages of matching, namely, standardization, candidate selection and comparison, however, the results of these processes is not persisted.
Instead, the candidates that were compared against your temporary record, and the scores that they received are returned to you, the searcher, in descending order of scores, so that the highest scoring, and therefore most likely match to your original criteria, is returned first.