Technical Spotlight 6: D-Harmony – how suitable is your data for matching?
Gaining a full, 360° view of your customers is a goal that many organisations aspire to. It can allow data scientists and data analysts to deliver business value in many areas, for example to find spending patterns (retail) or repeat offenders (policing). It is reasonably well understood that the quality of the data used in analysis impacts the ability to make informed decisions. What is less well know is that the suitability of your data for matching, is also critical.
I call this feature of data its ‘matchability’. There are many tools for data quality and each has its own strengths and weaknesses when it comes to highlighting different areas of data quality. Where they all seem to fall down though is in this ‘matchability’ area. How suitable is your data – especially customer data – for matching?
Being able to match data means that you can reduce overall data volumes, which often means reducing operational costs. More importantly, it allows you to gain that full, 360° view of your customers, creating opportunities that may not exist without bringing those customer records together.
How to determine data ‘matchability’?
It may seem odd to talk about MDM in the context of data quality specifically. However using a PME (Probabilistic Matching Engine) enables you to gain critical insight into the quality of data, specifically for the purposes of matching.
To give some background, a PME scores a pair of records together to indicate how likely or unlikely they are to match. Then, if this score is above a certain threshold, the engine matches the records together.
For instance, you might get 2 very similar records which score +20.0. If the threshold was +13.0, then these 2 records would match. However, if you have 2 other records which were substantially different, these differences might give a score of -20.0, definitely below the threshold and definitely not a match.
To dig a little deeper, a PME gives scores on specific attributes which are configured through an algorithm, such as name, address and date of birth. These individual scores are then totalled up to give the overall score between 2 records, the +20.0 and -20.0 from just before.
But what does that tell you about your company’s data quality? Well, instead of using the PME to match pairs of records, you can send through the same record twice, matching it against itself.
By doing this, (and I’m assuming you are planning on using the MDM PME to match your data at some stage), you can then compare both individual records and aggregate metrics of these comparisons to gain insights over your data.
An IBM InfoSphere MDM example
Let’s assume you have chosen to implement the IBM InfoSphere MDM product which is an industry leading PME. It also has UIs capable of tailoring workflows over your master data to suit the business need.
In a typical implementation you would install the product and then go through some cycles of “match tuning”, an iterative process to finely tune your algorithm to match your data well. During this process you will have set some matching threshold, the score above which records should match.
Let’s assume that:
- Your threshold is set at +13.0.
- You have name, address, phone number, DOB, email and passport number configured in your algorithm
- Each of these gives +5.0 for an exact match, -5.0 for a complete mismatch and a fairly linear continuum for things in-between. So if you had a complete, high quality record it would score +30.0 when compared against itself (the 6 attributes, each receiving +5.0).
The nature of the PME algorithm means that:
- Some data will not be considered for matching, such as dummy or bad data (e.g. 01/01/1900 as a DOB).
- Not all pieces of data receive the same matching score. For instance, in the UK, the name “John Smith” will receive a lower matching score than the rarer, “Jacque Gunderson”.
So while 2 records may both have a name, address or phone number, the scores when compared against themselves may be higher or lower, depending on the data values themselves. A high quality, full record could score a maximum of +30.0.
The MDM developer can not only tell you the maximum (theoretical maximum) but also the absolute minimum a pair of records can score (2 substantially different records). Now, you can take these 2 values and put them at opposite ends of a number line. Furthermore, the MDM developer will be able to tell you what the minimum score a record can get when compared against itself (note that this is almost certain to be 0.0, although your developer should be able to confirm). Finally, onto this number line you should put your matching threshold. The result may look like the line below.
Now, when comparing records against themselves, we need only worry about the possible minimum and maximum that this can achieve, between 0.0 and +30.0 in our example. However, this process can be extended to matched entities or tasks, for which the absolute minimum, -30.0, may be valid.
So, now, let’s assume we have a high quality record (A), which scores +27.0 (it has a somewhat common name and DOB, so doesn’t achieve the maximum), and then a low quality record (B) which scores only +11.0 (it actually only has part of a name, a common DOB and a poorly formatted address).
There is something that appears straight away when doing this, and that is that record B does not achieve the matching threshold.
If a record cannot achieve the matching threshold when compared against itself, it cannot match to any record, since any other record cannot get more similar to it than itself.
There is now a problem highlighted that the data may contain records which not only do not match, but that this algorithm cannot match – it is impossible.
So what next?
Every record in your system can be compared against itself, and then the results aggregated together. Some things to look for are:
- Total record count
- Count of records which fall below the matching threshold (the inverse can be calculated using the total record count)
- The mean score over all data
- A percentage breakdown of how many records fall into each 1.0 scoring bracket
Furthermore, the PME algorithm gives a score breakdown on each portion of the algorithm. This means that, for the example algorithm, you can also do a similar breakdown on name, DOB, address, phone number, email and passport number. When doing this breakdown, the matching threshold can be done away with, as this doesn’t make much sense.
By analysing all of these results you can easily tell which areas of data are letting down your matching. Maybe you have sparsely populated phone numbers or maybe your addresses are not standardized well.
In any case, this type of analysis can give you key metrics to deliver to stakeholders around the state of your data when it comes to matching your master data.