Using machine learning to improve data matching accuracy in the public sector

by Guy Bradshaw on 14th December 2018

At Entity we’re currently working together with IBM on an innovative new project to explore how machine learning can be used to improve the efficiency of MDM data matching and manual task resolution.

A key function of MDM systems is to identify duplicate data across multiple systems, using matching algorithms. When the matching algorithm cannot confidently determine whether two data points are in fact duplicates then the decision is passed onto a human data steward to identify duplicates.

Our aim with this project is to see if machine learning can be used to determine whether there are patterns to the decisions that the human data stewards make which could then be used to make the matching process more efficient.

Using data stewards to identify matches is extremely resource intensive and expensive. With this project, we’re hoping that we’ll be able to automate more of these laborious tasks by using machine learning to identify patterns in the previous decisions that humans have made and using that knowledge to improve the accuracy of the matching process and reduce the amount of manual intervention required.

At the moment we’re piloting this project within a couple of local and central government organisations. The public sector is ideally suited to benefit from this new technology due to the volume of different datasets that they typically hold. For example, one of our public sector clients has an MDM system that matches data from at least 16 different datasets, so any improvement in the quality of the data matching algorithm that reduces the need for manual intervention will save significant amounts of time and money.

Data stewards have the benefit of business knowledge that they can use to inform their decisions which, obviously, a matching algorithm does not have. Our hope is that machine learning can identify patterns in those human decisions that standard matching algorithms cannot.

An example might be that data stewards match cases 99% of the time if they’ve come from two particular sources and were both created within a short period of time. This might be because the steward knows that one of the systems creates duplicate data entries if it is unable to find a record in a timely manner, perhaps due to data sync issues. Factors like this are not normally considered in standard matching algorithms, but machine learning may identify such pattern and use it to improve matching accuracy.

In addition to resource savings, our hope is that this project will have other benefits. For example, if indeed the accuracy of matching algorithms can be improved then this will allow more data to be matched which, in turn, allows more insights to be gained from that data without having to factor in time for manual intervention.

The increase in the accuracy of the matching has the potential to drive benefits across the public sector, including in countering fraud, delivering more ‘connected’ cross-agency services and in supporting vulnerable adults and children.