Introduction
Duplicate registrations of customers, articles or other objects in IT systems is a known problem. Duplicate registrations not only require extra maintenance effort, but also cause problems in the provision of management information and the correct execution of business processes.
When executing data migration projects, Data eXcellence is challenged to solve these problems and has developed a working method and tooling for this. A data migration is an excellent moment to correct duplication and to make a fresh start with the new system.
This article describes how DX detects and corrects duplication, thus improving the usability and reliability of business data.
Basic registration rules of the Dutch government: Ask once, use it and report back if you have any doubts about correctness.
Unambiguous and one-time registration; multiple use
The principle of ‘unambiguous and one-time registration for multiple use’ is generally accepted. By capturing data in one place, based on clear definitions, it becomes possible to share that data with other users.
Unfortunately, the reality is often different. For example, data from one object, let’s take a ‘customer’ as an example, is recorded in multiple systems. When the characteristics of this customer are compared across the different systems, all too often it turns out that they contradict each other!
Example: A customer’s relocation has been entered in the CRM registration, but not (yet) in the failure registration system. The result: the engineer and his toolbox are in the wrong location.
Duplication also occurs within one data set, for example, because a customer was not found and was re-entered.
Recognising duplication
To recognise duplication, it is necessary to name criteria that indicate a possible duplication. In certain situations, this is possible on the basis of a shared identifier, such as a citizen service number or a Chamber of Commerce number. In this case, a corresponding value of the identifier indicates the same object. Note: it is wise to perform an extra check, because by swapping two numbers in the identifier, a completely different object is identified. Therefore, in addition to the identical citizen service number, also validate whether the date of birth matches.
It is important to determine that the identifier is applied at the correct level. In addition to the file number, the Chamber of Commerce also uses the location number. The file number is identical for all branches of a company and is known for several registrations. The location number is unique and actually identifies one specific registration. Those who want to identify a group can use the file number, but those who want to identify a unique location must base themselves on the unique location number.
With a functional identifier, clear agreements are made about the ‘conditions’ that must be met in order to establish a match. For example, for a building it can be agreed that registrations are identical if the postcode and house number are the same.
The name usually plays a major role for individuals and companies, whereby fuzzy matching can be used, so that ‘Thyssen’ and ‘Thijssen’ are designated as identical. The result of a fuzzy match may require a ‘manual’ review as these matches are not perfect.
In the case below, two match rules have been applied:
- Technical identifier: If Location no. is the same and (Bank account or Telephone number is the same) then you have a Match.
- If Location no. is not the sameand if File no. is the same and (Postcode-House no. or Bank account is the same) then you have a Match.
1
2
3
4
5
6
7
Firmeks
FirmX
Firmex-Grn
Firmex-ASD
Firmex-Groningen
The firm
Newfirm
12345678
12345678
12345678
98765432
08765430
9876598765
9876598765
5678956789
5678956789
8765432100
6665000100
1012 QX
1021 XQ
9743 XX
1012 QX
9743
4444 QQ
3333 XX
1
99
10
1
2
3
454545676767
454545676767
767675645454
899897676756
010-98980000
010-98980000
020-89855566
040-55669900
01/01/21
01/01/19
01/02/20
01/02/20
01/01/19
01/01/20
01/04/21
2,4
1,4
5
1,2
3
Match & Selection
Groups of registrations can be determined based on the matches found. Based on the above example, four groups of branches are composed.
Group A: Firmeks, FirmX and Firmeks-ASD
Group B: Firmex-GRN and Firmex-Groningen
Group C: The Firm
Group D: Newfirm
To determine group companies, the addition of one extra match rule would suffice:
3. If File no. is the same and bank account or Telephone no. is the same, then you have a Match.
Group A: Firmeks, FirmX and Firmeks-ASD
Group B: Firmex-GRN and Firmex-Groningen
In this particular example, this merges groups A and B.
By applying the criteria, duplications are thus classified into groups. Although the ‘members of the groups’ have now been determined, it is not yet clear which registration should be leading. A set of criteria is also drawn up for this - for example:
- The registration with the most recent date
- The registration with the most filled attributes
- The registration with the most 1-1 matches
You can then choose to use a Golden Record in which data from various registrations are combined into one new object. The records below have been compiled based on both certain groups from the example. The starting point is that the record with the most recent date is leading and is supplemented with data from the other registrations.
9
10
Firmeks
Firmex-Grn
12345678
12345678
9876598765
5678956789
1012 QX
9743 XX
1
10
454545676767
767675645454
010-98980000
010-98980000
01/06/21
01/06/21
The impact on data migrations
In a data migration, data from multiple source systems is brought together to a new target system, where it is not desirable to ‘contaminate’ the target system with duplication. Therefore, in data migrations, DX uses the techniques described above to enter data correctly and uniquely.
During the data migration, a table is built in which each source registration is linked to a new Master data record to be entered. In this case, the composite Golden Records are entered.
1 Firmeks
2 FirmX
3 Firmex-Grn
4 Firmex-ASD
5 Firmex-Groningen
6 The firm
7 Newfirm
A
A
B
A
B
C
D
9 Firmeks
9 Firmeks
10 Firmex-Grn
9 Firmeks
10 Firmex-Grn
6 The firm
7 Newfirm
In this example, the target system lists four objects, the leading group member, in this case the two golden records and the two original source records.
Subsequently, the transaction data, which was related to a source registration, must be linked again on the basis of the translation table. In the example below, all original orders are linked to the new customers.
Deduplicating customers and re-relating the orders means that checking the data migration becomes more complex. Human intervention may also be necessary when fuzzy matching is used. In the latter case, a fully automated migration is not possible.
An absolute count and comparison of source and target object is not enough. Not only because source objects are missing due to failure and filtering, but also because source registrations are merged during deduplication. A count that can demonstrate the completeness of a data migration requires extra attention and knowledge of the deduplication process.
In practice, customers often choose to perform duplication before the migration.
Extra attention to checks: Deduplication of source data makes checking the completeness of a data migration more complex.
Data quality in data migrations
If data deduplication is performed correctly during a data migration, this provides enormous added value. The result is a consistent target database, with each object entered uniquely. Of course, the correct input checks must be carried out when working with the new system. Properly validating whether the customer ‘to be entered’ is already present in the data collection remains the best measure to prevent data contamination.
Data eXcellence is a specialist in performing data migrations and pays a lot of attention to the quality of the data of the target system. The DX method and toolkit contains specific, configurable components that can be used to determine duplication and combine them into unique registrations.