How to evaluate the performance of matching algorithms – top tips

by Matthew Harris on 19th September 2017

It can be difficult to evaluate the performance of matching algorithms. However there are several ways to determine how accurate they are and the actionable insight that could be gained by applying them to the data related to your business challenges.

In a technical sense, there are 2, and only 2, direct goals of any matching algorithm:

1. Reduce the number of false positives towards 0
2. Reduce the number of false negatives towards 0

There is potentially a third goal, for some types of algorithms, and that is:

3. Reduce the number of manual review tasks towards 0

From a business point of view when using a matching algorithm, the more these things can be reduced, the better the match which means the more business benefit can be derived.

Optimistic or Pessimistic Approach

Think of two ends of a spectrum – at one end you match absolutely nothing together, and the number of golden records is the same as the number of records, so you have 0 false positives (FP), but 100% false negatives (FN). On the other end of that spectrum, you end up matching everything into a single entity, so you have 0 false negatives, but 100% false positives. Therefore the right algorithm lies somewhere in-between.

Matching Thresholds

Some MDM users may want to achieve a certain number, or certain percentage of matching, instead of adhering to the 3 key goals. This is a hazardous pursuit, because with many algorithm types you can artificially achieve this by sliding the algorithm implementation along the spectrum towards the 100%FP/0%FN end. Therefore while this may seem like your algorithm is great, it will actually be of poor quality and, depending on the industry, have real and measurable impacts on your business, including financial penalties.

In practice, it is unlikely that the number of false positives and false negatives will be reduced to zero. With any algorithm, there will be some rules, either deterministic or probabilistic, that will match records correctly but also some records incorrectly, that is just the nature of real world data. In these cases, a decision must be made whether to make the algorithm:
– Optimistic: that is, increase the rate of matching but also increase the rate of false positives
– Pessimistic: that is, reduce the false positive rate, but also the number of matches overall.

OptimisticvsPessimistic

When dealing with probabilistic algorithms, which have a single threshold which sets both the FP and FN rate, as one increases the other decreases and vice-versa. The idea of optimistic and pessimistic matching is important for this type of implementation, as a decision will need to be made as to how many FP or FN are acceptable. Knowing the consequences of matching 2 records together that shouldn’t match, or not matching 2 records together that should will be essential in order to determine whether your business should adopt an optimistic or pessimistic approach.

As a general rule, most businesses are pessimistic in nature. This is understandable because the consequences of sharing one customer’s data with another customer (because you incorrectly matched them) is too big a risk to take. For this type of algorithm, evaluating its performance is methodical and follows a clear process:

1. Look at individual pairs of records (Pairs Analysis)
2. Increase the granularity to matched entities (Entity Validation)
3. Take overall metrics (Metrics)

Each of these steps is discussed in more depth in the following sections.

Pairs Analysis

Pairs analysis is a process which presents a statistically significant number of pairs of records to a group of data stewards. The stewards must decide whether the records are a match or not (or maybe), with the results being cross-checked against the algorithm that is under evaluation to determine:

– false positives (FP)
– false negatives (FN)
– true positives (TP)
– true negatives (TN)
– Pairs

Typically the pairs of records are chosen so that all parts of the algorithm are tested, and so that there can be a clear distinction made within the algorithm for the matched and non-matched pairs. For instance, in a PME (Probabilistic Matching Engine) algorithm, the pairs of records are chosen across a range of scores, which you would expect the low scores to be more non-matches, through to the high scores being all matches, with the transition and some maybe answers in the middle.

For robustness, you want multiple pairs per data point in your algorithm. That is, for each deterministic rule, you want multiple pairs that satisfy and do not satisfy that rule. For PME algorithms you want multiple pairs at each score. In this way, you can get a range of values, which may highlight some poor data or corner cases that your algorithm does, or does not, cope well with.

Furthermore, you want multiple people, ideally an odd number to tiebreak, to give answers on each pair. The more people doing the analysis the better, because you can clearly see which pairs are definitely a yes/no/maybe (high agreement within the users) or which pairs were difficult to decide (maybe 2 say yes, 1 says maybe and 2 say no).

To work out the TP, TN, FP and FN, you want to distil the users’ answers down to a single yes/no/maybe for each pair (you could take the majority decision as a rule of thumb), and then compare that answer with the match or non-match that your algorithm comes up with.

This will then classify each pair according to the below matrix:

AlgorithmDecisionTable

For any other answers (such as Maybe), these can be put into a 5th category, such as Maybe or Unknown.

Now that each pair is classified, you may be able to extrapolate out these answers. For instance, in the IBM MDM Sample Pairs process, there is a count of the total number of pairs in the entire data set for each score. With this, you can multiple these totals by the percentage of TP, FN, FP and FN to work out exactly how many of each type you are likely to get on your full data, not just the sample set. This is an incredibly important step, as it allows you to bridge the gap between a relatively small sample size, maybe of 2000 records, against your data set, which is likely in the millions.

At this point, especially with a PME algorithm, you may be able to look at different result scenarios, simply by changing the threshold which the algorithm determines a match or not. This wouldn’t necessarily require a matching exercise by the algorithm, as you may be able to do this in a spreadsheet with the sample pairs instead, drastically increasing the turnaround time for this exercise. In this way, you may be able to produce a graph for the number of TP, TN, FP and FN under different circumstances. Take a look at the below graph for an example:

Graph Matching

You can see that as the TP rate drops, the FN rate increases and as the FP rate drops the TN rate increases. These pairs of statistics are the direct inverse of one another.

Once you have all of this, you may find that increasing the threshold, or making some change to your algorithm, may result in 1 less false positive, but 10 more false negatives. Typically we find that a change to increase matching (or decrease false positives) has a 20%/80% rule in that it will produce 1 false positive for every 4 false negatives it reduces. In this way, to make any comparison fairer, you may wish to weight false positives more strongly than false negatives. That may mean that a single false positive is the equivalent of 4 false negatives, for instance.

At this point, you may wish to try and produce some correlations between your algorithm and the user answers. Some examples include Matthews Correlation (below https://en.wikipedia.org/wiki/Matthews_correlation_coefficient) and Pearson Correlation (only for numerical correlations).

You may wish to give a Yes +1.0, No -1.0 and Maybe 0.0 for this type of correlation. https://en.wikipedia.org/wiki/Pearson_correlation_coefficient).

Correlation Calculation

While this type of analysis is great at analysing both FP and FN (or TP and TN), the power of pairs analysis comes in its combination with other metrics and analysis.

With pairs analysis, the goal should be to find a threshold which minimises the number of false positives, irrespective of false negatives. Then, when you perform broader analysis across your wider data set, analysis which can give you overall matching rate, you can be fairly confident that the number of inaccurate matches is minimised and that the matches you have are all true positives.

The reason for this is because many matching algorithms employ transitive matching (if A and B match, and B and C match, then A and C match as well). When doing pairs analysis, you get no indication of the transitive nature of the matching, and so the number of false negatives/true positives are not actually a good indicator of the total matches. Instead, we try and lock in a good false positive rate, and look at the total number of matches in other ways.

Entity validation

Pairs of records are the most fine-grained way of looking at matching, so now we move on to a little more coarse-grained technique which is Entity Validation. This process still compares pairs of records, but does so only within a group of matched records, producing some statistics for that entity. These statistics can then highlight potential false positives or strange records.

Let’s look at two examples of matching – one entity will be a good, solid match, and the other will be a poor match, probably a false positive. Check out the two entities below and let’s assume that the threshold to match is 10 or above:

Match Strength Images

In the first example, ABCD, all the scores are well above the matching threshold, so this entity is reasonably good, even looking at it instinctively. On the other hand, we can see that EFGH has some links that are below the matching threshold, but because some key links are above the threshold, all 4 records are linked together (by transitive matching).

We can look at some simple statistics on these entities to see if we can highlight, objectively, these findings (note that for the mean and median we only count each score once, so if AB is counted, score BA will not be):

In this table there are some key differences on minimum, mean, median and standard deviation. We will examine two other examples before trying to decide what we should be looking for:

LInk Score Images

The overall statistics, for all 4 entities are:

Did you find the patterns? Here is a description of each entity:

• ABCD is a very well connected entity, with each record matching to each other record well
• EFGH only has the bare minimum of linkages to match it together, and only 1 of those matches is high (the 19). Half the matches are poor
• IJKLMN are 2 groups of well-connected records (IJK and LMN) joined by a single link. The other links are fairly weak.
• OPQRST is a well-connected group of 5 records (OPQST) with 1 poorly connected record (R) connected to the entity through O

Do the statistics highlight any of this to us with this information?

• Suspicious entities will always have a minimum score below the threshold. This may indicate that there is a record(s) which do not belong in the overall entity
• The maximum score is almost entirely useless
• The mean score does also not give much insight. If the mean was to be below the matching threshold, then there would be a definite problem (EFGH), but this rarely happens
• The median score is similar to the mean, but it may occur more frequently, especially if there are some really high scores that may raise the mean, artificially, higher
• The standard deviation is a key indicator for poor transitive matches. While difficult to put in decisive figures, at least ordering entities to look at by standard deviation would be a good idea
With these statistics, poor matching entities can be manually reviewed and either the algorithm updated or, if a BAU solution is sought, those entities can be manually altered by data stewards.

Metrics

Hopefully by now you have removed all the false positives from your data, but you’re still unsure about how many actual matches you have and how much business benefit you’ll get. After you match your entire dataset, you can then produce 3 statistics which produce 2 key indicators for your matching:

1. The total number of records in the system (members)
2. The total number of distinct groups of matched records in the system (entities)
3. Total number of entities of size 1 (records that don’t match to anything else, also known as singletons)

With these 3 metrics, members, entities and singletons, we can produce 2 more metrics, the Golden Record Rate and the Matching Rate:

Golden Record = entities/members

Matching Rate = (members – singletons) / members

The golden record rate gives you, as a percentage, how many entities you have per 100 records. The inverse of this (1 – GRR) is the de-duplication rate. The de-duplication rate is the number of records you are needlessly maintaining in your systems, as each record that could be de-duplicated is represented by an entity instead. The Matching Rate gives the percentage of records which take part in matching.

Typically, the lower the Golden Record Rate, the better, because it means there are more matches. Higher Matching Rates are also associated with better matching, because it means your algorithm is matching a lot of records. Note that this is only valid if you have been through the Pairs Analysis and Entity Validation processes to reduce the number of false positives, otherwise there is a risk that some of the matches you are getting are false positives.

Let’s look at 2 quick examples to see how these metrics work, both using 100 records:

Entity Attributes

Example 1, which doesn’t do so well on matching, leaves 80% of records unmatched.

Example 2, which matches very well, only leaving 20% of records unmatched.

Example 2 achieves a much lower GRR, as it matches more records into fewer entities.

What this means, in realistic terms, is that if you moved from keeping unmatched records to matched entities in your system, you would reduce your data volumes by 13% for Example 1 and a huge 57% in Example 2. This reduction would have measurable impacts on hardware costs as well as resilience costs.

As for the matching rate, this indicates, for Example 1, only 20% of records match to some other record, whereas 80% of records in Example 2 match to some other record using that algorithm.

Conclusion

By following this process, you will be able to report on these 3 key statistics at a high level and evaluate the performance of matching algorithms within your business. In addition you will be able to  do so with confidence, as the Pairs Analysis and Entity Validation exercises ensure that the metrics contain high quality matches, and all inaccurate matches have been removed, as far as possible. You will have achieved your goals of minimising the false positives, false negatives and manual deduplication tasks and you’ll also have confidence in predicting the business value that can be derived from using the matched data.

For more articles on algorithms and matching see my other blogs:

Why probabilistic matching is not a black box

How to optimise matching performance on massive data sets