Realistic and safe test data through anonymisation

You use data to test data processing software. In most test environments, testers are not allowed to work with production data, which is why they usually use a data set created specifically for testing purposes. This creates an area of tension between usability and security: the more realistic the test data, the more the software will behave as it would in a production environment during testing, yet the greater the risk that the test set will reveal data that do not belong in a test environment.

So how do you create a realistic yet safe test set? How do you treat commercially sensitive information? And what about privacy-sensitive information? If you are involved in making test data or if you are curious how Data eXcellence approaches this, please read on.

Organisations that store and process data take steps to ensure the security of their sensitive data. Sensitive data is data that is strategically or commercially sensitive or data that is privacy-sensitive. In the latter case, the GDPR formulates requirements with regard to the recording and processing of personal data (see https://www.eugdpr.org/ and https://hulpbijprivacy.nl/).

The security of sensitive data requires access restrictions and accurate administration for all systems in which users work with this data. Consequently, this data is, for example, only accessible to a select group of employees in a protected environment where the system logs every action (consulting, editing, deleting).

In a development environment, these kinds of measures are expensive and hamper the work. That is why developers and testers of data processing software use test data. This is data that, should it accidentally end up in the public domain, does not cause privacy issues, does not reveal trade secrets and does not cause harm to the organisation’s reputation.

Test data can be produced by entering fictitious data into the system that needs to be tested. However, this method has disadvantages:

Contamination. The approach only works if a separate environment is available (applications, database) in which the test data can be entered, otherwise the test set will be “contaminated” with other data.
Time-consuming. Filling a test set of any meaningful volume is time-consuming.
Laborious. Creating a realistic amount of history is laborious. For example, mortgages or pensions that have been running for years (or that have just ended), including transaction history. Or historic changes in interest rate.
Insufficient variationCreating sufficient variation in the portfolio is hard. Examples of all “flavours” must be entered.
Obsolescence A test set quickly becomes outdated.
Sometimes not possible. In the case of data conversions, the intended target environment is often still incomplete, as a result of which entering test data it is not yet possible or to a very limited extent only.

This article deals with anonymising data. This is editing existing production data in such a way that developers and testers can work with it in an unsecured environment. This involves structured data in databases. Unstructured data such as PDF files, scans or documents are not discussed in this article.

In the case of data conversions, the anonymisation is performed on data from the source systems. Anonymising such a data set and then running the conversion software with the anonymous data creates an anonymous test set in the format of the target system.

“Strategically or commercially sensitivity” refers to data that seems harmless on its own, but which as a whole can reveal important information about an organisation: how many customers does the organisation have, what is the total invested capital, what is the coverage of its securities, what products does the organisation sell, what is the distribution of the portfolio like, how much commission is an intermediary paid, what profit margin does a product have, etc.

An effective way of removing strategic or commercial sensitivity of data is to make a selection. A subset is compiled from the total data set. This subset contains a wide range of values, but the composition of the original set cannot be derived from it.

How is a selection made?

Making a test set with the greatest possible variation and the smallest possible amount of data can be complex. The following approach often suffices:

Determine two or three properties that are important in terms of the variation in the data set (e.g. type of coveragein the case of insurance, product type in the case of a savings product, repayment method in the case of a mortgage). Analyse which manifestations of these properties are present in the dataset (for example, in the property repayment method: annuity, linear, interest-only, savings, investment ).
Determine which entity is the central entity in the dataset. In a mortgage administration this will be, for example, the loan, whereas in a customer administration it will be the customer. This is called the main entity.
Randomly select a number of occurrences of this main entity for each property defined under 1. Make sure that all variants of each of these properties are present. In the example stating the repayment method: select random loans so that the repayment methods annuity, linear, interest-only, savingsand investment are all present. The number of occurrences of the main entity that is selected depends on the variation in the total data set. A total of 500 to 1000 is a good starting point.
Ensure the data set is consistent. From all tables, select the data that corresponds with the selected main entities. For example: from the total set of entries, only the entries corresponding to a selected main entity are selected. This method ensures the selection of a consistent dataset with a wide variety of properties, in which strategically or commercially sensitive information can no longer be found.

You can add or remove specific data later on if necessary for testing purposes. The selection is therefore gradually improved. For example, you can intentionally include erroneous values found in the production data in the selection to test whether the software can handle these errors.

Make the selection again regularly so that the test set is always based on recent production data. It is important that the test data is created fully automated. This allows the test set to be constructed as required, manually or as part of an automated test.

Data eXcellence uses its own software to make the selection. This software selects a minimum amount of data with maximum variation for the configured properties.

Data elements that can be traced to specific persons or organisations must be kept out of the hands of unauthorised parties. Therefore, you must replace such elements with other random data.

What data elements must be replaced?

A data element must be replaced if this data element, on its own or in combination with other data, reveals which individuals or organisations are in the records.

The fewer elements that are replaced, the more realistic the data set will behave. This increases usability for testing purposes. On the other hand, substituting an insufficient amount will lead to safety risks. This consideration is made together with the data owner. The rule of thumb is to replace all elements that pose a risk, but no more than that.

Substituting too much can lead to an unusable dataset. Data starts showing inconsistencies between them or no longer meets the requirements of the system for which it is intended: someone with a youth savings account who is 60 years of age according to his anonymised date of birth, or a loan that according to the anonymised starting date runs for 3 years, but which has a transaction history of 20 years. It is therefore important to carefully consider whether or not substitution is necessary for each data element.

Data eXcellence uses its own software that recognises patterns in databases such as a credit card number, IBAN, a citizen service number or an account number. This helps to determine the list of data elements to be replaced. This list practically always contains the following elements:

Surname
Street name
Postcode
Place of residence
Comments (which often contain notes with privacy-sensitive information: telephone numbers, e-mail addresses, names of partners, account numbers, etc.)
Descriptions of a transaction or payment
Citizen service number
Company Name
Chamber of Commerce number
IBAN or account number
Credit card number

(Check https://en.wikipedia.org/wiki/Personally_identifiable_information)

The following elements need not be replaced:

Date of birth (hundreds of people are born in the Netherlands each day, so a date on its own does not lead to a specific person)
Date of death (the same principle applies)
House number addition
Prefix
Salutation
Gender
Title (unless it leads to a unique person)
Financial and operational data or transactions that cannot be traced to a natural person or organisation, such as amounts, percentages, transaction types
Supporting data required for the system to function

Cases containing data elements that can theoretically pose a privacy problem need to be discussed:

First name: a rare first name could be recognisable, but in practice, a first name cannot be traced to a specific natural person
Initials: a rare combination of initials might be recognisable, but the chances of this occurring are limited
House number: unusual numbers exist that are not very common, but the question is whether such a house number can really be traced to a specific address without any other address details.

The meaning of an element determines whether substitution is needed or not. For example, a loan number in one system can be a meaningless technical key (no substitution), whereas in other systems, this can be an account number or IBAN (substitution).

Furthermore, there is data that may or may not have to be replaced depending on the role in which it occurs in the system. The data of a notary office that has a business loan from a bank must be anonymised (the notary office is a customer). However, when a notary prepares deeds or executes mortgages, anonymising the associated data is not needed. After all, it is no secret that a notary performs such activities (the notary office is a third party performing a service).

Anonymisation agreement

An anonymisation agreement records which data elements are anonymised. This is a record of the above process. If the agreements were to change later on, this will be recorded as well. This way, it is always clear what the current agreements are.

The anonymisation software produces a report that contains the anonymised elements. This report serves as evidence that the anonymisation has been carried out as agreed.

How is data replaced?

The “neater” the data is replaced, the more realistic the result. In principle, you can replace any alphanumeric piece of data with a random series of letter, but that can reduce the usability of the test set: if all names, addresses, etc. look like “Test” or “xxxx”, individual data is practically unrecognisable, which makes testing difficult. On the other hand, if data is replaced very realistically, it is difficult to see whether or not the data has been anonymised. Perhaps the testers accidentally work with production data after all.

A good solution for this is to replace data with similar data, which is immediately recognisable as test data. For example: in a dataset of a Dutch mortgage lender, you can replace all names with English names and all place names with names of foreign cities.

Key values require special treatment. To ensure the referential integrity of the anonymous set, replace equal values in the entire data set with the same anonymous values. For example, if loan number 123456 in one table is replaced with 100001, loan number 123456 must be replaced with 100001 in all other tables containing the value.

Finally, you must take into account possible data quality requirements of the system for which the test set is intended. Many systems perform checks on the format of certain fields. Examples include a Dutch postcode, citizen service number and IBAN. If such checks are in place, the substituted data must comply with the rules that have been set.

Which software is available?

Several software packages are available for carrying out the substitution.
Data eXcellence uses SQL Data Generator by Red Gate (https://www.redgate.com/products/sql-development/sql-data-generator/index) and Privacy by Datprof (https://www.datprof.com). Another well-known solution is ARX (http://arx.deidentifier.org/). If you start searching, you will soon find that there is plenty of choice.

Realistic and safe test data through anonymisation

Introduction

Anonymisation of strategically or commercially sensitive data

How is a selection made?

Anonymising privacy-sensitive data

What data elements must be replaced?

LOW LEVEL OF SUBSTITUTION

HIGH LEVEL OF SUBSTITUTION

Anonymisation agreement

How is data replaced?

Which software is available?

Conclusion

Want to know more?