June, 2005

The match-and-merge dilemma
Experts tackle the challenges of records ‘de-duplication’


Since the beginning of electronic data entry, the problem of uniquely identifying individuals and removing duplicate records within a database has plagued developers and program managers. Duplicate records can cause problems that are far more serious than inconvenience and aggravation.

Incomplete and fragmented records contribute to:

  • reduced efficiency.
  • inadequate services delivered, resulting in compromised or suboptimal health care.
  • potentially harmful interventions, misdiagnoses, and delivery of inappropriate services.
  • decreased credibility of systems and their information.

The match-merge dilemma has grown more complex as public health agencies increasingly seek to integrate information from disparate program databases. Unfortunately, help in tackling the problem has not been readily available. As a result, public health agencies typically wrestle with their de-duplication issues alone, essentially working in a vacuum.

Now, a Connections workgroup on Unique Records is developing a toolkit to address match-merge issues in public health. The Unique Records workgroup is creating a portfolio of useful guides, including principles and concepts, real-world case studies, a questionnaire, and a self-assessment tool. This approach is designed to assist public health professionals in assessing the problem, evaluating solutions, and making informed decisions. The workgroup is composed of Connections members who have experience in de-duplication issues, other invited experts, consultant Susan Salkowitz of Salkowitz Associates LLC, and Stephen Clyde of Utah State University’s Computer Science Department.

Child health integration projects create enterprise-wide, person-centric systems from disparate databases with different business rules for identification of individuals. Data cleaning activities, often termed “de-duplication,” consist of various processes:

  • matching, identifying existing records that might be for the same person.
  • linking, referencing records for the same individual to each other with the individual records remaining in their separate systems.
  • merging, combining multiple records into one record.

These processes are often termed “record coalescing,” which refers to linking records, merging records, or both.

Knowledge sharing: a key success factor 
For years, public health agencies have wrestled with their de-duplication issues system by system without sharing knowledge among programs. Although most public health programs share similar issues with matching, merging, and linking, each database and system configuration is unique. Even when programs use the same kind of databases, software, or configurations, each system’s data may have unique patterns, data-entry fields, and ways of processing input. 

“In de-duplication, there isn’t one approach that solves everyone’s problems,” said Dr. Clyde. “In public health, you’re dealing with legacy systems that have evolved from different sources, often resulting from funding into silo programs. What may seem like subtle differences in data structure can result in significant problems for matching and merging. There are as many possible solutions to the duplicate data problem as there are individual situations.

“You might ask,” he continued. “Why isn’t there only one kind of car on the road? People’s needs and preferences differ and change over time, and that’s reflected in evolving designs, whether they’re cars or databases.”

When issues and problems are shared in a community of practice such as Connections, commonalities among programs and approaches emerge, and this knowledge can be synthesized so that programs can manage de-duplication more effectively.

Right tools at the right time
The Connections community of practice formed a Unique Records workgroup to develop a multifaceted toolkit – set for release in fall 2005 – to help public health agencies improve their de-duplication processes by planning and analyzing their projects methodically. The workgroup agreed that such a toolkit would offer a better chance of addressing the problems and allow individual agencies an opportunity to develop solutions that work best for their integration projects.

The new toolkit does not simply give answers; it helps developers and program managers address de-duplication problems in their own agency settings. Workgroup participants are contributing their own experiences in identifying and managing duplicate records and understanding how to resolve them to serve as examples and guidelines for others.

These materials are being assembled in a toolkit format so that people who are ready to roll up their sleeves can start down the road toward improving records quality in a database, said Salkowitz. “It’s a hands-on guide that distills textbook material and best practices for all programs to apply to get de-duplication done.”

Components of the toolkit
This portfolio approach provides:

  • Values and principles that should guide every data integration project.
  • Concepts of matching, linking, and merging, for an integrated system.
  • Case studies that give examples from current Connections projects of de-duplication methods in existing integrated public health information systems.
  • Profile/questionnaire that assists decision-makers in categorizing data sources and de-duplication approaches that they use in their own integration projects. By organizing their own information into this profile, public health program managers and systems developers will be able to compare their approaches with other similar projects and use the materials in the toolkit to manage their de-duplication efforts.
  • Self-assessment tool – a set of checklists – that supports a structured approach to problem identification, quality assurance, and evaluation of de-duplication procedures.
“Our goal,” said Salkowitz, “is to develop tools based on the approaches of experienced public health practitioners and informed by research-based knowledge. In that way, developers and managers of integrated child health information systems can achieve and maintain high-quality databases and uniquely identify children and their associated records.”

HOME | SITE MAP | CONTACT US | SEARCH | PRIVACY POLICY

©2005 Public Health Informatics Institute
All Rights Reserved

750 Commerce Drive, Suite 400 • Decatur, Georgia 30030
TEL: 1.866.815.9704 • FAX: 1.800.765.7520


Last updated November 1, 2005 November 1, 2005