Five data properties to consider for great MDM matching results
The importance of understanding your data
This blog, the fifth in our series is going to try and spark your imagination when it comes to data, invite you to start to think about fundamental aspects of that data, and how you could use those fundamentals to your advantage. It is going to give you five data properties to consider for great MDM matching results.
In the world of MDM, Probabilistic Matching is one of those buzz words that has led people to assume they could buy an MDM product, drop it in to their organisation, flick some switch and hey-presto! it just works. This is not the case. Organisations need to fully understand their data to make use of Probabilistic Matching and MDM. Once you understand the basics, you will then be able to utilise the data in any way you choose during matching (think of yourself as the Karate Kid – you have to first master incredibly mundane tasks, because they will create synergy in your data martial arts).
The five data properties to consider
1. Types of Data
When talking about matching, and whether 2 objects match, you compare their attributes. Usually, the objects that you are comparing will have multiple attributes, with the decision as to whether they are the same determined by whether these attributes agree or not. At its most basic, the attributes that you compare fall into 2 of the following six categories (one in each column):
It should be noted that options C and D are unlikely to be used, and we aren’t considering them at this time. If we limit our two options to only the 4 options left (A, B E, F), then we end up with 4 different combinations.
- A, E – data that when matching indicates objects are more likely to match, but when different doesn’t tell us much
Example: This would be things that are unique to a person but may also change over time, such as an address. Also, if you compare different types of the same thing, such as mobile phones and landline phones, may result in a lot of this type of data
- A, F – data that when matching indicates objects are more likely to match and when different are less likely to match.
Example: Full names are a great example of this type of data, where you expect that each person has a fairly unique name. Other examples are passport number, especially when combined with the passport issuer
- B, E – data that doesn’t really tell us much whether it matches or not
Example: Metadata that is held on the record, but not about the person, could be this type of data. For instance, if an address has been verified and given a verification stamp, the verification stamp would be this type of data.
- B, F – data that when matching doesn’t tell us much about matching, but when different indicates objects are less likely to match
Example: Gender is a great example of this type of data. If you picked 2 females at random, you couldn’t say whether they were the same woman. However, if you had a male and female, you could be fairly certain they were not the same person.
2. Data Landscape
When thinking about data, you can think about it having a specific landscape. When I think about MDM and the master data that is used within a matching algorithm, I think about each distinct field (or groups of fields) that is used as a matching criteria, as having its own landscape, complete with distinct features and quirks.
Many of these features most IT professionals would be familiar with from data quality assessments, such as uniqueness, null values and invalid values. However, there are other features that can also be of interest.
The overarching tenets of data quality are – (see this reference):
- Completeness – how well populated fields are (i.e. null values)
- Timeliness – how old/new the data is, or has it gone out of date
- Validity – whether the data conforms to a set of rules, such as reference data tables or date ranges
- Consistency – looking at the actual values and whether they appear to be accurate (based on previous or other experience). The count of unique values would be one way of assessing this
- Integrity – confirming that relationships are intact between data elements, e.g. all records with the title ‘Mr’ are male
These features are great, and should be performed regularly on all systems, especially as a pre-requisite to any matching. The results of these are also very useful in determining what attributes can be included within a matching algorithm. However, there are other features that are also important within data quality, but are often overlooked when a data quality analysis is performed.
3. Data Coverage
Data coverage is not applicable to all fields, but is still very useful. It is an extension of the principle of validity, whereby you can categorise data as being valid or invalid, a very binary classification. However, there may be some fields for which a range of values is permissible:
Examples: reference data lists (e.g. title or gender) or a numeric or date range (e.g. date of birth).
In these 2 examples there will be valid and invalid data, which will be picked up by the regular data quality analysis. For some fields you’ll probably also get a count of the distinct values, such as how many males and how many females in your data.
These metrics tell you things about the data you have, but don’t give you insight about data that you might be missing.
Example: Let’s look at phone number, specifically a full Australian phone number. These phone numbers are a maximum of 10 digits long and may have regional characteristics (such as a state or region prefix within the 10 digits).
Data you do have: In a traditional data quality assessment you will most likely find some bad data in there, things that are not digits (e.g. TESTPHONE) or things that are only a few digits long. Great! You can put in some rules to remove or cleanse this data. You will probably also find some really frequent phone numbers, such as 1111111111, which is likely just test or dummy data. Again, you can highlight these specially using some logic and cleanse or remove them. You’ll probably also get a count of how many records do not have a phone number.
Data you may not have: In the example, a 10 digit phone number, where the minimum length must be at least 6, you have around 10^10 combinations (we would actually take out the 100K records which have fewer than 6 digits, but compared to the total, this is fairly insignificant). What this does is create a well defined range that the data may pull from (another example would be date of birth, where the range of dates is very well defined). Now you can analyse the data you have against this range to see whether there are large swathes of phone numbers for which you don’t have any records.
This is highly likely, because telecommunications companies don’t use up all 10^10 phone numbers. However, being able to identify what these are may be of use. Why?
- You may find that there is some regional subset of phone numbers that is oddly missing, which may indicate that a franchise for that area doesn’t actually collect phone numbers.
- It may also be that there is a group of phone numbers for which you thought were valid but there is a sub-system that actually removes those valid phone numbers because of old code.
- What other reasons can you think of?
Taking a different example, date of birth, you would expect the rate of births to be fairly consistent across all days of the year. It is fairly common knowledge, however, that because of default values for incomplete data, the 1st of each month and especially 1st January are far more common dates of birth than any other day of the year. This type of frequency analysis would almost always surface during a typical data quality exercise, but by putting together the concept of a data range and the actual data values, you can visualise both together and start to see these patterns emerging, such as the spikes for the 1st of the month in the graph below.
As I mentioned earlier, data coverage doesn’t really make much sense for all fields. For instance, if you don’t have a clearly defined range, such as for names or even addresses, then this type of analysis is not valid.
4. Data Distance
While not all fields can be analysed by data coverage, they can be analysed by distance. This analysis is not done on a single value of data, but rather between values of data. If we think of data as cities, nodes on a graph, then the data distance is roads that connect them, the edges of the graph. What this type of analysis gives us is another way of conceptualising a data landscape.
If this is a bit abstract, let’s think about a really easy example in addresses. This is easier because addresses already have a distance between them, the physical distance between 2 places. Irrespective of whether this distance is as the crow flies, walking distance, driving distance or the measure of time it may take to drive from one place to another – they are all valid. In this example we don’t need to be specific, because all you need to do is define the distance between any 2 addresses, which all of those measures do.
Addresses, at least when talking about how the crow flies, also have the benefit of being able to be represented on a map, so imagine a map now, with addresses, cities, places and locations. If you had the inclination, you could go through something like Google Maps and plot every address you wanted to analyse, and then compute the distance between each point and every other point. If you think about the nodes and edges that this creates, it might look something like:
Figure 1: http://www.europeandriveguide.com/images/maps/england_large.gif
When you do this visualization it might become strikingly apparent that you do a great deal of business in one geographic location and almost none in another area. That is the type of insight you can gather from a distance analysis. That was a simple example, and quite useful, the benefit of this type of analysis runs deeper than this, if we can think more abstractly.
Instead of addresses, we might want to think of names this time. And instead of physical distance, since it doesn’t exist for names, we’ll use an edit distance – how many changes do you have to make in the spelling of one name to make it into the other name? For instance to get from MATT to KATE you would have to make 4 (add a K, add an E, remove the M and remove the second T). So the distance between the name MATT and the name KATE would be 4.
You can then do the same node/edge visualisation on all your names, only this time names that are spelt similarly will be close by one another and names that are not similar will be far away. You may also want to prune some edges out, that are, for instance, greater than a certain edit distance, since they are unlikely to convey any usable information and will just clutter the graph. Then, by augmenting such a graph with the number of records for each name, you may quickly find nicknames for some names or even very unique names which really stand out.
Figure 2: This graph was produce by calculating the Levenshtein distance between each name and dividing through by the length squared of each string (lev_dist / (length1^2 + length2^2)) and then only add edges which were less than 0.7
5. Attribute Grouping
While many data profiling tools look to compute composite keys as part of their uniqueness profiling, they traditionally stop short of analysing every single combination for uniqueness (for example name, address, phone number and DOB might be unique, but you won’t go using these 4 fields as a primary key or secondary key in a database).
Being able to group together attributes as part of a data profiling exercise is also useful for analysing null values (that is which combinations of attributes contain null values). In fact, most traditional DQ exercises can be expanded to include groups of attributes to add value. However, this type of analysis is very costly, as the number of combinations quickly gets out of hand, and you end up with information overload – so much data quality information that you can’t make head nor tail of it.
Since matching takes place across groupings of records, it makes sense that profiling should also take place across these attributes. In a deterministic algorithm the matching rules will be well defined and non-exhaustive, and so profiling on such rules is a viable strategy. However, for probabilistic matching or Artificial Intelligence matching the matching engine may use all combinations of attributes to decide on a match, combinations that are too many to profile.
It is for this reason that some matching engines can be utilised during their tuning phase to evaluate the quality of data, as well as performing matching. This is discussed in far more detail in “Data Matchability” which will be my next blog so stay tuned! In the interim you can find the previous four blogs in this series here.
In the mean time keep practising those mundane tasks!