Introduction

Duplicate registrations of customers, articles or other objects in IT systems is a known problem. Duplicate registrations not only require extra maintenance effort, but also cause problems in the provision of management information and the correct execution of business processes.

When executing data migration projects, Data eXcellence is challenged to solve these problems and has developed a working method and tooling for this. A data migration is an excellent moment to correct duplication and to make a fresh start with the new system.

This article describes how DX detects and corrects duplication, thus improving the usability and reliability of business data.

Basic registration rules of the Dutch government: Ask once, use it and report back if you have any doubts about correctness. 

 

Unambiguous and one-time registration; multiple use

The principle of ‘unambiguous and one-time registration for multiple use’ is generally accepted. By capturing data in one place, based on clear definitions, it becomes possible to share that data with other users.

Unfortunately, the reality is often different. For example, data from one object, let’s take a ‘customer’ as an example, is recorded in multiple systems. When the characteristics of this customer are compared across the different systems, all too often it turns out that they contradict each other!

Example: A customer’s relocation has been entered in the CRM registration, but not (yet) in the failure registration system. The result: the engineer and his toolbox are in the wrong location.

Duplication also occurs within one data set, for example, because a customer was not found and was re-entered.

Recognising duplication

To recognise duplication, it is necessary to name criteria that indicate a possible duplication. In certain situations, this is possible on the basis of a shared identifier, such as a citizen service number or a Chamber of Commerce number. In this case, a corresponding value of the identifier indicates the same object. Note: it is wise to perform an extra check, because by swapping two numbers in the identifier, a completely different object is identified. Therefore, in addition to the identical citizen service number, also validate whether the date of birth matches.

It is important to determine that the identifier is applied at the correct level. In addition to the file number, the Chamber of Commerce also uses the location number. The file number is identical for all branches of a company and is known for several registrations. The location number is unique and actually identifies one specific registration. Those who want to identify a group can use the file number, but those who want to identify a unique location must base themselves on the unique location number.

With a functional identifier, clear agreements are made about the ‘conditions’ that must be met in order to establish a match. For example, for a building it can be agreed that registrations are identical if the postcode and house number are the same.

The name usually plays a major role for individuals and companies, whereby fuzzy matching can be used, so that ‘Thyssen’ and ‘Thijssen’ are designated as identical. The result of a fuzzy match may require a ‘manual’ review as these matches are not perfect.

In the case below, two match rules have been applied:

  1. Technical identifier: If Location no. is the same and (Bank account or Telephone number is the same) then you have a Match.
  2. If Location no. is not the sameand if File no. is the same and (Postcode-House no. or Bank account is the same) then you have a Match.
No.
No.

1

2

3

4

5

6

7

Name
Name

Firmeks

FirmX

Firmex-Grn

Firmex-ASD

Firmex-Groningen

The firm

Newfirm

File no.
File no.

12345678

12345678

12345678

98765432

08765430

Branch no.
Branch no.

9876598765

9876598765

5678956789

5678956789

8765432100

6665000100

Postcode
Postcode

1012 QX

1021 XQ

9743 XX

1012 QX

9743

4444 QQ

3333 XX

House no.
House no.

1

99

10

1

2

3

Bank account
Bank account

454545676767

454545676767

767675645454

899897676756

Telephone no.
Telephone no.

010-98980000

010-98980000

020-89855566

040-55669900

Date
Date

01/01/21

01/01/19

01/02/20

01/02/20

01/01/19

01/01/20

01/04/21

Matched
Matched

2,4

1,4

5

1,2

3

Match & Selection

Groups of registrations can be determined based on the matches found. Based on the above example, four groups of branches are composed.

Group A: Firmeks, FirmX and Firmeks-ASD

Group B: Firmex-GRN and Firmex-Groningen

Group C: The Firm

Group D: Newfirm

To determine group companies, the addition of one extra match rule would suffice:

3. If File no. is the same and bank account or Telephone no. is the same, then you have a Match.

Group A: Firmeks, FirmX and Firmeks-ASD

Group B: Firmex-GRN and Firmex-Groningen

In this particular example, this merges groups A and B.

By applying the criteria, duplications are thus classified into groups. Although the ‘members of the groups’ have now been determined, it is not yet clear which registration should be leading. A set of criteria is also drawn up for this - for example:

  • The registration with the most recent date
  • The registration with the most filled attributes
  • The registration with the most 1-1 matches

You can then choose to use a Golden Record in which data from various registrations are combined into one new object. The records below have been compiled based on both certain groups from the example. The starting point is that the record with the most recent date is leading and is supplemented with data from the other registrations.

No.
No.

9

10

Name
Name

Firmeks

Firmex-Grn

File no.
File no.

12345678

12345678

Branch no.
Branch no.

9876598765

5678956789

Postcode
Postcode

1012 QX

9743 XX

House no.
House no.

1

10

Bank account
Bank account

454545676767

767675645454

Telephone no.
Telephone no.

010-98980000

010-98980000

Date
Date

01/06/21

01/06/21

The impact on data migrations

In a data migration, data from multiple source systems is brought together to a new target system, where it is not desirable to ‘contaminate’ the target system with duplication. Therefore, in data migrations, DX uses the techniques described above to enter data correctly and uniquely.

During the data migration, a table is built in which each source registration is linked to a new Master data record to be entered. In this case, the composite Golden Records are entered.

Source
Source

1 Firmeks

2 FirmX

3 Firmex-Grn

4 Firmex-ASD

5 Firmex-Groningen

6 The firm

7 Newfirm

Group
Group

A

A

B

A

B

C

D

Leading
Leading

9 Firmeks

9 Firmeks

10 Firmex-Grn

9 Firmeks

10 Firmex-Grn

6 The firm

7 Newfirm

In this example, the target system lists four objects, the leading group member, in this case the two golden records and the two original source records.

Subsequently, the transaction data, which was related to a source registration, must be linked again on the basis of the translation table. In the example below, all original orders are linked to the new customers.

vertaaltabel klantnummers

Deduplicating customers and re-relating the orders means that checking the data migration becomes more complex. Human intervention may also be necessary when fuzzy matching is used. In the latter case, a fully automated migration is not possible.

An absolute count and comparison of source and target object is not enough. Not only because source objects are missing due to failure and filtering, but also because source registrations are merged during deduplication. A count that can demonstrate the completeness of a data migration requires extra attention and knowledge of the deduplication process.

In practice, customers often choose to perform duplication before the migration.

Extra attention to checks: Deduplication of source data makes checking the completeness of a data migration more complex.

Data quality in data migrations

If data deduplication is performed correctly during a data migration, this provides enormous added value. The result is a consistent target database, with each object entered uniquely. Of course, the correct input checks must be carried out when working with the new system. Properly validating whether the customer ‘to be entered’ is already present in the data collection remains the best measure to prevent data contamination.

Data eXcellence is a specialist in performing data migrations and pays a lot of attention to the quality of the data of the target system. The DX method and toolkit contains specific, configurable components that can be used to determine duplication and combine them into unique registrations.

Want to find out more?