Data overlays are perhaps the most insidious of all data warehousing issues. A data overlay is exactly what it sounds like–an existing demographic record with a distinct unique identifier is updated to a completely different demographic record but with the SAME unique identifier. .
The problem is mostly perceptive. When hospitals and data administrators claim that a unique identifier is unique, they expect it to be so. From a business standpoint, teams don’t investigate whether source systems might be overwriting records because they don’t really see the problem until sometime after the overlay has occurred.
Even then, the complexity of root cause may be hidden for quite some time.
What makes overlays especially tricky in terms of centralized hub management is the ability to conceptualize how records in an entity continue to maintain a link when they are obviously different. Without the proper information, human judgment takes a beating on this one.
Consider the following example:
A user, who we’ll call Steven, works in the billing department of a hospital. He types in the demographic information on a new patient named Mary Johnson who, along with other identifying information is also assigned a unique patient identifier.
Some days later, Steven receives billing information for a new patient, John Smith. For any number of reasons, either human or technological, this record is given the same unique identifier as Mary Johnson. Thus, instead of a patient addition, a patient update has been performed.
When the record arrives in the data warehousing for processing with the patient’s other profile data, the first record—Mary Johnson—is overwritten with demographics for John Smith. And this is where the insidiousness begins. Linking overtop of John Smith’s information in the entity management system is the rest of the data that formerly linked with Mary Johnson such as lab records, pharmacy information, and any other identifiers that were tied and linked to the former Mary Johnson record.
Just because an overlay of information has occurred does NOT mean that the records are smart enough to disentangle themselves based on the new attributes. In fact, we often change multiple attributes at once such as moving address and phone numbers. Would we really want to create a new identity after this data had separated?
Nevertheless, it’s the unintentional impacts that cause the real trouble. In this case, as billing administrators attempt to pull Mary Johnson’s profile, a John Smith record will also crop up in the mix, leading to a great deal of confusion, finger-pointing, and even blame against the software itself.
Data overlays wreak havoc because teams don’t know where to look or what to fix. At first blush, it may looks like the data warehouse is the culprit, but unless data analysts can inspect the full history of the records in question—when did Mary Johnson “become” John Smith?—teams should avoid jumping to conclusions.
The best types of software have a way of detecting overlay issues and have defined methods for unraveling them. These remediation efforts are limited to records only within the centralized hub, however, and are relevant only after the data has settled. Treating the symptoms on the data warehouse does not address the root causes and without going back to the source, these types of data errors may occur again, perhaps frequently. The old adage rings true once again: “an ounce of prevention is worth a pound of cure.”
In summary, if data in a warehouse joins together for incompatible records, you may have an overlay issue on your hands. Stay focused, be persistent, and work hard at being a great data detective.