Mastering Big Data – An IBM Approach (Part 2)
Introducing IBM’s Probabilistic Matching Engine
In my last post, I discussed the issues that are typically faced when trying to apply data matching techniques to big data. This post focuses on the IBM Probabilistic Matching Engine (PME), an essential component of IBM’s Big Match product that make it a great fit for big data.
The PME technology is an integral part of IBM’s Master Data Management suite that has been available as a traditional application server hosted application for many years. When looking at how master data management could be applied in a big data paradigm, the PME was the obvious starting place.
The PME has been able to successfully make the transition to the big data world because the scientific principles that underpin it can be applied at virtually any scale. In layman’s terms, computers do mathematics well, so whilst approaches that rely on decision trees or deterministic rules don’t scale efficiently and are difficult to adapt to changing business requirements, the IBM PME’s scientific approach scales linearly as the compute resources are extended and, more easily adapts to new information as the underlying data model changes.
In this post I am going to discuss the underlying matching approach and, in the next, I will discuss the physical attributes of the implementation.
How does it work?
The matching approach contains 3 central concepts that can be thought of as the legs of a three-legged stool; take one of these away are your matching effort will fail.
1. Data Statistics – this ‘leg’ effectively estimates the information quotient of each data token (i.e. a word, date, number) that appears in a sample of the data set. In simplistic terms, the statistics gathered allow PME to allocate a weight for each token that appears in a particular attribute. For example, in a ‘Person Name’ attribute, the name SMITH might be very common, whilst the name WARWICK might be less so. The former token could be said, in this instance, to offer less information to a match decision between two records than the latter. The PME would therefore attribute a lower score to a comparison of the name SMITH.
2. Empirical Context – This is the human element of the PME algorithm and is used to tell PME how closeness should be measured for each attribute type and what reference data should be used to help interpret what it is being seen (i.e. phonetics, synonyms, transliteration libraries, etc). Access to domain expertise and a thorough understanding of the underlying data sets and business requirements are needed to do this effectively. This is an area that Entity Group specialise in and our consultants have wide experience of tuning the PME to achieve optimal results.
3. Decision Theory – the IBM PME applies a likelihood ratio test to determine if the resulting comparison score represents a match or mismatch. In mathematics, this technique is a statistical test used for comparing the goodness of fit of two statistical models and readily allows the underlying inputs into those models to be extended. Again, to put this into layman’s terms, the matching algorithms are easy to change when the underlying data or business models change.
Why does this approach work?
That’s the theory. But what’s the reality? Why does this approach work? Well, some reasons are as follows:
1. Flexibility – a comment I hear regularly from customers that are moving to adopt the PME technology is that their existing matching approach is just not flexible enough, with small changes requiring large consulting efforts or an over dependence on specialised knowledge that has often has left the organisation. This is a major problem for complex projects that start small or implement a ‘dumbed down’ version of matching with the idea of making it more robust at a later date, a recipe for certain pain in the future. The benefit of PME is that the approach is not rigid and adapts easily to changing requirements. Whilst a change in business requirements will always mean a degree of configuration and testing, the PME approach easily allows for additional ‘information’ to be added to its comparison algorithms. For assurance purposes this will usually need to be tested with the business but this also has a proven methodology to quickly achieve the assurance level required.
2. Tolerance – poor data quality often derails attempts to match master data. A particular advantage of the PME approach is that it is tolerant of many endemic data quality issues that are difficult to fix. Traditional approaches to the problem require you to fix the data quality first and vendors will happily sell you tools… However, some important quality issues JUST CAN’T BE FIXED. Take the following example:
|Record 1||Record 2|
|Name:||Brent Cumberbatch||Trent Cumberhatch|
|Address:||22 Acacia Ave, London W1 2GQ||22 Scacia Ave, London W1 2QG|
The problem here is that, if we start with the assertion that these records represent the same person, the errors in the data are related to the information and not the form. Data Quality tools are excellent at correcting malformed data or validating against reference data. In this case there is not an easy way to do this. Each of the names are convincing; there is no way of knowing which date of birth is correct – differences could be down to mistyping or by internationalisation errors; and, similarly, the social security numbers are both well-formed (at least for the UK’s National Insurance format); finally, the addresses could have been mis-typed and it is touch and go whether address validation software would be able to validate them effectively.
The matching assertion above is of course subjective, but in my experience most matching approaches would fail to identify these records as potential matches and many would not even select them as candidates for comparison. PME however, computes the likelihood of variances in information (i.e. data quality issues!) and looks holistically across all available data items it has been configured to use. In this extreme example, PME would very probably identify this record as a potential match.
3. Adaptability – finally, PME is adaptive to the information quotient of the records being compared. This is powerful as just like a human will instinctively weight a ‘word’ or ‘token’ in a string based on the information it conveys (see SMITH v WARWICK above), the PME adjusts its scoring in a similar way. This gives much greater accuracy than techniques that do not weight comparisons at the individual token level.
In my next post, I will describe the architecture of IBM’s Big Match platform that makes it a great fit for scalable computing environments such as Apache Hadoop.
Please do share this blog using the social media buttons to the right or via your own accounts with the #datachangemakers.