Mastering Big Data – an IBM approach

by Clive Warwick on 29th January 2018

Master Data Management & Big Data – an introduction

Information professionals are, for the most part, all aware of the hype surrounding big data and, for those of us working at organisations that are running projects in this category, many will be experiencing the problems that such vast volumes of data bring.  We are finding out first hand that the basic underlying precepts of data management still hold true.

One such rule is that, to be useful, data needs to be trusted and, to be trusted, it needs to be governed.  There are several tools and approaches for data governance, and if you’d like to understand the breadth of the challenge and a sensible approach to getting to grips with it, you could not go far wrong by taking a look at Entity Group’s “Crossing the Data Delta” a highly authoritative and well received approach to the subject.

One important step on that multi-faceted governance journey is ensuring that each data item in your mass of big data can be attributed back to its master entity; i.e. transactions can be accurately attributed to individual products, customers, suppliers and locations, etc. The industry refers to this challenge as Master Data Management, and in big data it is particularly difficult.

Drowning in Data

Why is Master Data Management in Big Data so difficult?

Master data management relies on complex, compute-heavy processes to match, associate and govern the many data records that your organisation receives.  Scaling these compute heavy processes into the big data world at first might seem straightforward… we’ll just run our existing stuff on a cluster, right?

Wrong.

Traditional, deterministic matching approaches based on business rules do not scale well, and matching master data is an exponential problem.  Big data platforms provide graph databases and top-level Apache projects such as SOLR, Spark and MapReduce. These are often articulated as the way forward, but in reality, the issues of matching master data at big data scale don’t just go away.  The problems faced by traditional, rules-based matching approaches remain pervasive, despite the cluster-oriented tools available.

In-house solutions created for a particular use case may work to begin with, but as the data model expands or changes and more sources are added, the approach will quickly prove to be inflexible and unable to scale or adapt to the business requirement.  Where the software development principle of Agile is becoming popular to help businesses react quickly to market demands this is particularly apparent.

Huge growth of data

Introducing IBM Big Match

Fortunately, there are solutions.  IBM has for many years led the field in data matching.  Its Probabilistic Matching Engine (PME), a core component of IBM’s Big Match product, brings a scientific approach to data matching. It has been used for over two decades to provide trusted real-time data matching, entity resolution and de-confliction for some of the world’s largest organisations in both the private and public sectors.

Global businesses and governments need to master many hundreds of millions of records for a myriad of reasons and IBM helps them do this.  However, these organisations also face the ubiquitous big data challenge.  If they had hundreds of millions of records to deal with yesterday, then the exponential growth of data and transactions that enterprises are facing these days means they can have billions of records to deal with today.  This truly massive scale meant that even IBM’s innovative technology was hitting the limits of the underlying compute platforms.  A rethink was needed, and in 2014 IBM launched Big Match, a solution for data matching at Big Data scale.

In my next posts, I will describe the characteristics of IBM’s PME that make it a great fit for Big Data.

For more information on the matching qualities of IBM’s PME please check out our past Technical Spotlight blogs here.