No data system is immune from quality issues. At some point, a human being will invariably make a mistake with a record–be it a typo or a coding error–and a bit of corrupted data will make it into the system. What happens next is based largely on how stakeholders have prepared their organization to confront data quality issues.
Sometimes data issues can be caught early on before major damage is done, but other times, they will be overlooked and, quite possibly, accepted as a legitimate version of the “truth.”
Corrupt data is like a pollutant in water: a little bit of it doesn’t hurt the environment, but given enough garbage, you can block the river, kill all the fish, and spend hundreds of thousands of dollars performing the cleanup.
Jim Harris poses an interesting strategy in his recent blog “Retroactive Data Quality.” As he states, what if “we could somehow travel back in time and prevent specific business-impacting occurrence of poor data quality from happening.” Although, I am intrigued by this concept, society unfortunately doesn’t possess the technology to travel back in time.
But there is another option to consider. Even though we cannot go back to “putting data right that once went wrong” (Jim’s passing reference to the excellent television program Quantum Leap), we do have the ability to track where the data went wrong in the first place and then backtrack to correct the problem from the source. How, you may ask?
For one thing, we could implement a data warehouse.
A data warehouse, a repository designed to clean, transform, and catalog data from across the enterprise, is designed to manage and retrieve data at highly efficient speeds. The tables within the warehouse are often highly denormalized and focus on subjects relevant to separate business domains such as sales, marketing, and manufacturing. What a properly structured data warehouse does very well, however, is store data. Not just “once a night” replicated data, mind you, but the entire history of the data flowing from the source.
We can look at a data warehouse as following the law of conservation of energy which states that energy (or, in this case, data) is neither created nor destroyed. A data warehouse is simply the recipient of upstream system data and unable to make modifications on its own. Thus, bad data streaming from the source will produce bad data in the warehouse. The only true strategy for fixing this data is to “cleanse” the source and allow the modification to flow back into the warehouse.
This is where the true power of a data warehouse comes into play. Following the law of conservation, the modified data does not overwrite the old, but rather, gets added into the warehouse alongside the corrupt data. Specialized metadata fields are populated that flag a record as “active” versus “deleted” as well as the time and date this activation and deletion occurred. With multiple changes and modifications occurring across multiple sources each hour, it is easy to imagine how fast the tables in a data warehouse can grow.
But the advantages of a properly designed warehouse outweigh the negatives in at least two critical areas. First, data auditors can more easily track the amount of modifications stemming from the source systems. So if data related to Marketing consistently produce more deletions on a daily basis than the other systems combined, this may raise a red flag requiring some focus towards Marketing best practices. Second, because a data warehouse provides a snapshot of records at a single point in time, analysts are better able to identify the cause of data quality errors once they are known. In essence, analysts can go “back in time” to see where the data first became corrupt.
Although it is technically not time travel per Jim Harris’ article, a data warehouse may be the next best thing. There are, of course, multiple factors to consider before buying the latest and greatest data warehouse package. Researching the features that vendors provide in terms of architecture, ease of deployment, and out-of-the-box capabilities (especially involving the deletion and update triggers) is an important first step in making the right decision. Ironically, it is the future purchase of a data warehouse that could identify and resolve an organization’s data quality issues of the past.