Technical Spotlight 6: D-Harmony – how suitable is your data for matching?

by Matthew Harris on 23rd November 2017
Matchability D Harmony

D-Harmony & Matchability

Gaining a full, 360° view of your customers is a goal that many organisations aspire to. It can allow data scientists and data analysts to deliver business value in many areas, for example to find spending patterns (retail) or repeat offenders (policing). It is reasonably well understood that the quality of the data used in analysis impacts the ability to make informed decisions. What is less well know is that the suitability of your data for matching, is also critical.

I call this feature of data its ‘matchability’.  There are many tools for data quality and each has its own strengths and weaknesses when it comes to highlighting different areas of data quality. Where they all seem to fall down though is in this ‘matchability’ area. How suitable is your data – especially customer data – for matching?

Being able to match data means that you can reduce overall data volumes, which often means reducing operational costs.  More importantly, it allows you to gain that full, 360° view of your customers, creating opportunities that may not exist without bringing those customer records together.

How to determine data ‘matchability’?

It may seem odd to talk about MDM in the context of data quality specifically. However using a PME (Probabilistic Matching Engine) enables you to gain critical insight into the quality of data, specifically for the purposes of matching.

Matchability1

To give some background, a PME scores a pair of records together to indicate how likely or unlikely they are to match.  Then, if this score is above a certain threshold, the engine matches the records together.

For instance, you might get 2 very similar records which score +20.0.  If the threshold was +13.0, then these 2 records would match.  However, if you have 2 other records which were substantially different, these differences might give a score of -20.0, definitely below the threshold and definitely not a match.

To dig a little deeper, a PME gives scores on specific attributes which are configured through an algorithm, such as name, address and date of birth.  These individual scores are then totalled up to give the overall score between 2 records, the +20.0 and -20.0 from just before.

But what does that tell you about your company’s data quality?  Well, instead of using the PME to match pairs of records, you can send through the same record twice, matching it against itself.

Matchability2

By doing this, (and I’m assuming you are planning on using the MDM PME to match your data at some stage), you can then compare both individual records and aggregate metrics of these comparisons to gain insights over your data.

An IBM InfoSphere MDM example

Let’s assume you have chosen to implement the IBM InfoSphere MDM product which is an industry leading PME. It also has UIs capable of tailoring workflows over your master data to suit the business need.

In a typical implementation you would install the product and then go through some cycles of “match tuning”, an iterative process to finely tune your algorithm to match your data well.  During this process you will have set some matching threshold, the score above which records should match.

Let’s assume that:

 

The nature of the PME algorithm means that:

 

So while 2 records may both have a name, address or phone number, the scores when compared against themselves may be higher or lower, depending on the data values themselves. A high quality, full record could score a maximum of +30.0.

The MDM developer can not only tell you the maximum (theoretical maximum) but also the absolute minimum a pair of records can score (2 substantially different records).  Now, you can take these 2 values and put them at opposite ends of a number line.  Furthermore, the MDM developer will be able to tell you what the minimum score a record can get when compared against itself (note that this is almost certain to be 0.0, although your developer should be able to confirm).  Finally, onto this number line you should put your matching threshold.  The result may look like the line below.

Matchability3

Now, when comparing records against themselves, we need only worry about the possible minimum and maximum that this can achieve, between 0.0 and +30.0 in our example.  However, this process can be extended to matched entities or tasks, for which the absolute minimum, -30.0, may be valid.

So, now, let’s assume we have a high quality record (A), which scores +27.0 (it has a somewhat common name and DOB, so doesn’t achieve the maximum), and then a low quality record (B) which scores only +11.0 (it actually only has part of a name, a common DOB and a poorly formatted address).

MH6D

 

Matchability5

There is something that appears straight away when doing this, and that is that record B does not achieve the matching threshold.

If a record cannot achieve the matching threshold when compared against itself, it cannot match to any record, since any other record cannot get more similar to it than itself.

There is now a problem highlighted that the data may contain records which not only do not match, but that this algorithm cannot match – it is impossible.

So what next?

Every record in your system can be compared against itself, and then the results aggregated together.  Some things to look for are:

Furthermore, the PME algorithm gives a score breakdown on each portion of the algorithm.  This means that, for the example algorithm, you can also do a similar breakdown on name, DOB, address, phone number, email and passport number.  When doing this breakdown, the matching threshold can be done away with, as this doesn’t make much sense.

By analysing all of these results you can easily tell which areas of data are letting down your matching.  Maybe you have sparsely populated phone numbers or maybe your addresses are not standardized well.

In any case, this type of analysis can give you key metrics to deliver to stakeholders around the state of your data when it comes to matching your master data.