Record Linkage in Large Data Sets
Watching a presentation on single customer view at the MongoDB Days conference I realised how common is the problem of having multiple records that describe the same entity. There are two main reasons for solving this problem:
- To have a complete view of the entities by joining or merging them from two separate data sets. This could be the case of linking customer data from the CRM and the operational database.
- To detect duplicate records from a single data set. An example could be to find out if there are files with the same content in a directory with millions of files.
Usually a record linkage process has three phases:
- Pre-linkage phase: data is preprocessed or normalized to improve the linking accuracy.
- Linking phase: the actual linking is performed.
- Post-linking phase: review of the results, checking the "possibly linked" records and finally using the results.
The pre-linkage phase is probably the most context-dependent of all three and its goal is to transform the data to make the linkage possible. Potential transformations could be changing dates, telephone numbers and addresses to a standard format, or splitting/merging variables in one data set to match the schema of another.
Linking Phase: Deterministic Linkage
The deterministic methods involve exact matching of one or more variables. These methods are simple and fast but detect far less links than the probabilistic linkage. On the other hand, linked records are more likely to refer to the same entity.
Simple Deterministic Linkage
The simplest of the deterministic methods, consists in linking records if they have the same values in the given variables. In the SQL world, we would identify those records using a cross join on one or more columns. In MapReduce, we would create a key joining all the variables and then we would link all the records that fall in the same reduce group. If the keys were to big we could hash them and use that hash as a key.
Transitive Deterministic Linking
Consists in linking records if any of the variables match. Let's use an example of an online store that wants to track his users across devices and has the following events:
EventId | EventType | DeviceId | CardNumber | AccountID ----------|-----------|------------|------------|----------- 1 | PageView | D1 | - | A1 2 | Sale | D2 | C1 | A1 3 | PageView | D3 | - | - 4 | Sale | D2 | C2 | - 5 | Sale | D3 | C3 | - 6 | Sale | D4 | C3 | A2
We could infer that events 1 and 2 are linked, because despite being performed from different devices (cookies), in both cases the user was logged-in in the same account. Also the event 2 seems to be linked with number 4, because both sales have been done on the same device even though the user was not logged-in in the second sale. So events 1, 2 and 4 can be attributed to the same user.
Doing this process is not easy in MapReduce, because the linking involves an iterative algorithm that can join groups of events on each iteration. One way to do it is using Giraph, converting each variable value in a vertex and considering the occurrence of two variable values in the same event as an edge. Then is just a matter of finding the connected components of the graph and each connected component will represent a user.
Linking Phase: Probabilistic Linkage
The probabilistic linkage or fuzzy linkage consists in linking records even if the variables don't match exactly. The most known of the models is the Fellegi-Sunter, which describes a set of rules that are proved to be optimal when the variables are conditionally independent. The main idea is to compute the discriminatory power of each variable, and then combine them all to get the probability of two records being the same.
That said, it is not easy to implement in large data sets and there are simpler techniques that can provide good results. It is also worth noticing that probabilistic linkage can create false positives (linking records that should not be linked) and false negatives (not linking records that should be linked).
Probabilistic Linkage with Blocking
The problem with finding duplicates in a data set is that if you have N documents you will have to compare each document against the rest, performing O(N^2) comparisons. To avoid that, a common technique is blocking, which consists in partitioning the dataset in smaller blocks that contain potential duplicates, making viable to compare all the documents against each other within each group. Typical blocking keys are normalized strings (lowercase, without punctuation, etc), locality-sensitive hashes (LSH) and the top-N terms of a document with higher tf-idf value.
Using MapReduce, in the map phase we would create a blocking key for each document, ideally that would create groups of no more than a thousand documents. Then in the reduce phase we would perform the comparison using different distance measures depending on the type of the variable. For example we could use Levenshtein or Jaro-Winkler distances for Strings, Euclidean distances for positions, etc. Then the records would be linked if all the variables were at least 95% similar or we could combine the similarity score of each pair of variables with a weighted mean.
Similarity of Documents
For records with lots of text (documents), a useful technique is to split each document into smaller chunks to then compare how many chunks are shared with other documents using the Jaccard index, which is the number of shared chunks divided by the number of different chunks in both documents. Then you could link two documents if it's similarity index is higher than a given threshold.
For example, if we have two documents D1 and D2, and we split each in four chunks:
The most usual ways to split documents is by creating shingles or with content-based chunking algorithms which create less chunks, making it easier to compute the similarity. It is also common to hash the shingles to reduce network I/O during the comparison.
Once the documents have been linked, what we do with them will depend on the data set. We should be careful not to do non-reversible operations because we could discover later on the the linkage was not correct. The options are:
- Link: adding a reference in each record to the other linked records or storing the links in a different data set.
- Merge: if we're linking two different data sets, we could create a view on top of them that unifies both data sets. That would be the case of single customer view solutions.
- Deduplicate: keeping only one of the records to avoid having duplicates. We should only do this if we keep the raw data and we can reprocess the records.
In probabilistic linkage is also common to link the records if the similarity between records is higher than a threshold X, not linking if the similarity is lower than a threshold Y and manually reviewing the pairs if the similarity is between X and Y.