The challenge
In my first article GDPR implementation challenges: RAD tools to the rescue! I showed how it is possible to quickly implement a control panel for the maintenance of sensitive content related to users and customers.
But before you remove or export user/customer data, you must be sure that there are no extra records related to the same user or customer. It might happen that a customer was created twice, the first time with a middle name and the second time without the middle name. How do you catch these anomalies?
Technical approach
There are a lot of deduplication tools in the market, which are normally quite expensive and complex to integrate with your current system. However it is possible to start with something very simple and quick to implement, just to get an idea of the data quality in your system, and eventually build up something more sophisticated, using other tools and techniques more suitable for your problem.
One of the standard features of the Oracle database that can help in this context is the PL/SQL package UTL_MATCH. Without going too much into the details, this package allows the use of the Jaro-Winkler algorithm (see a description on Wikipedia), which gives you the “distance” between two strings.
The 3 very attractive features of this package are:
- The algorithm is implemented for you, no need to reinvent the wheel
- The execution speed is already optimized for you (really fast!)
- Very easy to use.
In my tests I got subsecond response comparing 100.000 strings. Perform some test in your system to see what you get. Here are a couple of examples that you can use:
SELECT row_num
,jw_distance
FROM (SELECT ROWNUM row_num
,UTL_MATCH.jaro_winkler_similarity('100',''||ROWNUM||'') jw_distance
FROM dual
CONNECT BY LEVEL <= 100000)
ORDER BY 2 DESC;
SELECT your_column
,jw_distance
FROM (SELECT your_column
,UTL_MATCH.jaro_winkler_similarity('your_string_to_compare',
your_column) jw_distance
FROM your_table)
ORDER BY 2 DESC;
Demo
In this screenshot the application presents the list of users currently in the system (around 60.000 in this demo).
Now you can click on one of the icons in the “Dedup.” column to go to the next step, or you can try to search for a specific name, in the field near the loop icon, and pick a user that shows different variations that might indicate possible duplicates. This last step is just to show how different are the results in case of normal search performed as LIKE ‘string_to_search%’ and the next screen, with the Jaro-Winkler algorithm in action.
Clicking on the Dedup. column icons you will be presented the screen showing the candidate duplicates:
The match is implemented only on the “Full Name” column (but of course other more complex strategies are possible).
Notice how the ranking shows candidates impossible to discover using a simple “LIKE %” search, as every single character in the string is matched against the target name, which is a strategy that would require much more effort if programmed from scratch.
And here is the link to the GDPR Control Center demo where you can see Jaro-Winkler algorithm in action and play with it.
Have fun! 🙂