This question is becoming more and more common. However, organisations cannot always give a good answer to this. Let alone that there is always a solution for the underlying problem: removing data from IT systems that you are no longer allowed or want to keep. Also known colloquially as ‘cleaning data’.
When is data cleaning relevant?
The reasons for data cleaning are diverse. The most common reason is to be able to comply with laws and regulations, such as the GDPR. Some data may simply no longer be stored and must be deleted. The Dutch Data Protection Authority calls this the “right to be forgotten”. There may also be other reasons to clean data. Consider, for example, the Public Records Act, destruction periods, the division of business activities or, for example, a BPO (such as a pension provider) that sees a customer leave and must demonstrably remove the associated data. Lastly, of course, the general rule applies that less data means fewer risks in the event of a data breach, lower costs due to less use of storage and often better performance.
Data purging, data clean-up or just data cleaning?
In this article, we use the term data cleaning for removing data from applications and underlying databases. Terms as Data purging and data clean-up are also regularly used. The term data cleaning is also used in the data quality domain. In that case, it is about detecting and correcting ‘dirty’ data. In this context, the term data cleansing or data cleaning is used.
What kind of data is eligible?
It goes without saying that privacy-sensitive data, such as personal data and associated transactional data, is eligible for removal. Examples include financial transactions, medical data or, for example, an order history. And what about all the detailed data of graduate students and pupils? The issue occurs in every sector.
In addition, the removal of competitively sensitive information is also an example of data cleaning. Or cleaning outdated data and data that is no longer usable.
The right to be forgotten
The Dutch Data Protection Authority describes the "The right to be forgotten". If there is no good reason for an organisation to continue processing personal data (any longer), the organisation is in some cases obliged to delete this data. For example, if the organisation no longer needs the personal data for the purpose for which it was collected or processed by the organisation. Or if the statutory retention period has expired. There are a number of exceptions to the right to be forgotten. For example, if an organisation is by law obliged to use the data or to keep it for a certain period of time. In that case, the organisation may not delete the data.
Data cleaning, how do you do that?
Ideally, business applications and package software offer integrated functionality to delete data based on specific criteria. Unfortunately, this is not always the case. Fortunately, alternatives are available.
If large amounts of data need to be removed at once, offline cleaning is a possibility. For example, the application is switched off during maintenance, during which the data is demonstrably correctly and completely removed with the help of data integration tools. After validation and acceptance of the result, the ‘cleaned’ application is made available again. The advantage of this solution is that there is no need to take into account other simultaneous users. By making maximum use of resources and technical possibilities, performance is often not an issue. A disadvantage is of course that this option results in downtime of the application.
If downtime is not possible or not acceptable, you can choose to clean online. Specialist tools are used to remove the data from an operational production environment in a controlled manner. Naturally, relations, sequence and dependencies are taken into account. Because simultaneous use of the application by other users and processes must be taken into account, attention to performance is an important aspect. Additional validation measures are necessary to guarantee the integrity of the data during simultaneous cleaning and use of the application.
For both options, the usual principles are used for QA assurance. After configuring the solution and performing (acceptance) tests, a ‘pre-check’ takes place before the actual cleaning is carried out. After acceptance, the process can be performed periodically to continue to meet the cleaning criteria.
If the removal of data is completely impossible or undesirable, anonymisation may be an alternative. In that case, the data is not physically removed, but masked in such a way that the criteria are met.
Removing data from production systems is of course not without risks. Therefore, choose a solution that suits your situation and ensure that this process is carried out in a controlled manner.
It starts with an inventory of the data to be removed. Where in the landscape is this data located? Also think of the less obvious places in the landscape. Data that needs to be cleaned may be located in the entire chain. And what about backups, the data warehouse or replication environments? In some situations, a procedural solution can offer a solution.
Determining the correct and unambiguous selection criteria is also an important starting point: which data should be deleted? The result of the selection (the dataset to be removed) forms the basis for the data cleaning.
In addition, the evidence plays an important role: the comparison of the selection, the situation before the removal and the result. Delivering (audit) reports that demonstrate that the data has been correctly and completely removed is essential.
Finally, a robust solution is necessary during the data cleaning process, especially if cleaning takes place online and for a long time. The following are important:
- Restartability in error situations
- Assurance of data integrity
- Facilities to influence performance
Data eXcellence & data cleaning
DX has extensive experience with cleaning data in manufacturing systems, both offline and online. Through a targeted approach and specialist tools, DX supports organisations with (the actual implementation of) data cleaning.