Introduction
Everyone can imagine what is meant by the term data quality, but it is difficult to say exactly when that quality is high or low. Thomas Redman provides a useful definition in his book Data Driven (http://www.dataversity.net/contributors/thomas-redman/): data quality is high if the data in question is suitable for its intended use in business activities, decision-making and planning. And John Morris of Data Quality Pro aims to explain what “suitable for its intended use” means: perfection in data quality is infeasible, but ensure a sufficient level at the right place at the right time (https://www.dataqualitypro.com/data-quality-for-data-migration-techniques/).
Data eXcellence looks at data quality through data migration glasses. What are the requirements for data in a migration? How do I ensure that my data meets those requirements? And how do I gain insight into this? In short, when is my data “suitable for its intended use”? This article will help you answer these questions.
Data quality: a determining factor
Data quality is one of the determining factors for the success of data migration. Poor data quality can turn a migration into a drawn-out problem. Mandatory data is missing or does not meet the requirements: an IBAN or citizen service number does not pass the 11-digit check. Important data is not suitable for automatic processing: it is not structured or available as hard copy only. Or a database is inconsistent, which means that data has to be linked using all sorts of unconventional trickery. All of this takes a lot of effort and increases the completion time.
High data quality, on the other hand, yields gains in many areas. If all mandatory fields have been filled correctly, the migration team does not have to spend time on this. If all data adheres to strict rules, the migration will be easier, as the migration software does not have to take into account numerous exceptions.
That is why it is important that the data quality in a migration project becomes clear at the earliest possible stage. There are many tools that can analyse the content of databases. The most basic versions will tell you how varied the content is and what the referential integrity is like. This provides an indication of the overall quality, but does not say much yet about the usability of the data in a target system. In more extensive tools you can indicate which specific rules the data must meet. And some tools even provide for a controlled method of repairing data contamination. Gartner has compiled a useful overview at https://www.gartner.com/reviews/market/data-quality-tools.
Data eXcellence has developed its own data quality technology, completely focused on data quality from a migration perspective. This tooling has been integrated in DXF: the Data eXcellence data migration factory.
The target system is leading in a migration
During data migration, the target system dictates the requirements which data quality has to meet. This goes beyond the filling of mandatory fields. The effectiveness of a system can be strongly affected by the presence or absence of certain data. If, for example, the history of entries in mortgage administration records is incomplete, a target system may mistakenly charge arrears. Therefore, the migration team must first determine under which conditions the target system will work correctly. The team then converts these conditions into data quality requirements.
- Data that is not perfect, but which does not cause the target system to work incorrectly, remains unchanged. Remember the definition of Redman: suitable for its intended use.
- Only consider data that is actually moved to the target system. For this, it is important to determine the exact scope of the data. This refers to the width (what type of data) and depth (how much data history).
- Try to limit the history to data that is actually necessary for the target system to operate. Historical data often originates from previous migrations or dates back to a time when the source system set different demands. More history means more variety and thus more data quality rules, more exceptions and increased effort to eliminate data quality issues.
Keep all data quality requirements not imposed by the target system out of the scope of the data migration. The focus should be entirely on the correct functioning of the target system. Allow for plausibility checks at the most: an unlikely age of a mortgagee (under the age of 18 or over 110), or an unlikely mortgage amount. The target system will not be affected in terms of its functioning, but it is clear something is not right. However, the general requirement for data quality must remain for the target system to work correctly. No more and no less.
The Data eXcellence approach
Data quality is given a high priority in the data migration projects of Data eXcellence. The requirements of the target system are mapped from the outset of a project. The migration team translates these requirements into data quality rules, liaising with experts of the source and target systems.
Some examples of data quality rules:
- Citizen service number does not pass the 11-digit check
- The terminated loan has no termination date
- Nationality cannot be found in Country code table
- The commencement date of the mortgage pre-dates 01/01/1980
- The date of the first payment pre-dates the starting date of the loan
- The driver of the car is under 18
Next, the team checks whether the source data complies with these rules. All exceptions are listed in a report, which is discussed with the customer periodically (for example, weekly). During this so-called cleansing consultation, every reported “violation” is assigned an action owner, who will be responsible for tackling the data problem. This way, every issue remains the focus of attention of those involved until everything is resolved.
There are several ways to resolve data quality problems. The most common are:
- Defining conversion rules. It is often possible to derive a missing piece of data from other source data. In that case, a conversion rule can perform this derivation automatically. In other cases, using a default value suffices. For example, a missing country code is filled with “Netherlands” by default.
- Cleaning of data in the source system. By adjusting information in the source data, you literally tackle a problem at the source. This is the most time-consuming method of data cleaning, but in situations where individual information is missing, it is often the only option. This approach towards resolving data problems will result in the longest completion time.
- Adjust the target system. Usually, this is an emergency solution, but if the above two solutions are not feasible or too time-consuming, the choice can be to adjust the target system. This is often a short-term measure to save time, so that users can still enter missing data after migration.
The experience of Data eXcellence is that 80% of data quality issues can be quickly located and resolved, partly with the proper support of tooling. Most effort is accounted for by the remaining 20%. It often involves manual investigations and sorting of specific exceptions.
CONCLUSION
Data quality deserves (and is given) a lot of attention within data migration projects. The focus is on the correct functioning of the target system and not on a general improvement of quality. The so-called cleansing consultation is a key element in the approach of Data eXcellence. Resolving quality problems is based on the 80/20 rule. It is important that the project has sufficient time and resources available for this.