Don't Trash That Name

Today, Aug 2005 by Hermansen, John C

Is "GIGO" a thing of the past?

These days, we rarely hear the old saw, "Garbage In, Garbage Out," but it used to be a very common explanation for the mediocre performance of database operations like mail merges and document processing. The fact that we do not hear that expression as often, however, does not mean that the problem has gone away. Quite the contrary.

It is not that our databases are that much cleaner now, but that they are so much larger. Dirty data still plagues every database and that means that an extraordinary amount of valuable information can be lost forever in the mega-sized databases that are now commonplace.

Projects and programs to cleanse or remediate corporate data continue to receive the same short shrift that they always have. One major hard drive crash, and a new DBA never again forgets to backup the database. But erroneous data rarely teaches the same kind of dramatic lesson, and thus dirty data persists. For process managers and document retrieval experts, there are some important things to be aware of when capturing and retrieving records that may be accessed by use of a person's name.

Don't Use Dirty Names

One of the most difficult database problems to fix is also the one that can cause the most havoc in databases: personal names. While there are programs and services that can "clean up" addresses, telephone numbers, and other data fields, until recently there were no effective tools for helping with bad name data. Why is this?

For data entry applications, proper names must be accepted as presented. And while there is rarely a directory, dictionary, or other reference available, the entry of personal name (PN) data is contingent on an extraordinary amount of education about names. It is at the critical moment of data capture that we have the last opportunity to properly enter such names; unfortunately, it is also at this vital juncture that we paradoxically opt for minimizing our attention.

Especially with names from cultures that we are not familiar with, names often suffer significant damage during data entry. Below are a few examples of names that might be found in the database of any organization today.

The Trouble With Names

There are four major reasons why matching and handling names are so difficult; character variations, phonetic variations, inconsistent parsing, and transliteration issues.

Character variations come in many flavors and occur for many reasons. These can be typos such as Smith/Msith, noise such as Smith/Srrf th, truncations such as Vasconcellos/Vasconce, and concatenations such as Beth Smith/BethSmith. Character variations can be caused by data entry problems, scanning errors, and other such anomalies. As can be seen in these simple examples, they can cause name matching problems that range from trivial to catastrophic.

Phonetic variations are especially prevalent when verbal information is used as either the source data or the query data. Verbal source data would include things like customer databases where information that has been relayed has been done so verbally. In these cases, the "correct" spelling of the individual's name is frequently not known, so the individual entering the data will make a best-guess based upon the pronunciation that was provided (which also may not be correct).

Inconsistent parsing is a problem that is especially complex with ethnically diverse names, as we saw above. This issue is caused by the Anglo-centric standard of foreing name data to be entered in three fields (First name, last name, middle name). While this format has been fine for names such as Beth Ann Smith, ethnically diverse names such as Arabic, Hispanic, Asian and many more that can have as many as eight name components cause huge problems with retrieval. Name components are often condensed, concatenated, or dropped altogether to make them fit. Unfortunately, the individual querying the name must be lucky enough to make the same assumptions in order to have a reasonable chance of success when searching for an individual.

Transliteration problems are the most severe obstacle to the storage and retrieval of names and probably the least well-understood. The fact that surprises most people is that there are actually, many, many ways that names can be transliterated from "foreign" scripts such as Arabic and Asian languages into the Roman alphabet that we use for English.

Data Stewardship

There is no substitute for establishing an on-going process of data stewardship. This means first cleaning the existing legacy data, setting up and monitoring methods for error prevention at data entry, and maintaining a routine for scheduled data cleaning. Doing this for other data fields is bothersome enough, but how can this possibly be done for complex personal name data?

Using information about how cultures around the world define and use personal names, it is now possible to actually measure the accuracy of the way a name is parsed into surname and given name fields, no matter what type of name it is. This provides a superior method for cleaning up an existing database, and it allows for interactive checking of data entry. In fact, the entire parsing operation could be - and perhaps should be - automated using new knowledge-based name recognition technology.

 

BNET TalkbackShare your ideas and expertise on this topic

Please add your comment:

  1. You are currently: a Guest |
  2.  

Basic HTML tags that work in comments are: bold (<b></b>), italic (<i></i>), underline (<u></u>), and hyperlink (<a href></a)

advertisement
advertisement
  • Click Here
  • Click Here
  • Click Here
advertisement

Content provided in partnership with ProQuest