Advancing Practice, Instruction, and Innovation Through Informatics (APIII 2003): Scientific Session and E-Poster Abstracts

Archives of Pathology & Laboratory Medicine, Oct 2004 by Becich, Michael J, Crowley, Rebecca

Design: Pathology reports from 3 institutions were examined for commonly encountered identifiers. Regular expressions were created to find and remove these identifiers. The scrubber was run iteratively on a training set until it exhibited good scrubbing performance. One thousand eight hundred new pathology reports (600 from each institution, encompassing 3 different time frames) were then processed, and each report was reviewed manually to look for identifiers that were missed (underscrubbing). The listing of removed text was also examined to find nonidentifying text that was removed (overscrubbing).

Results: Approximately 33% of the pathology cases contained identifiers in the body of the report. Ninety-six percent of identifiers present in the test set were removed. The identifiers that were missed were largely institution names and foreign addresses. Of the scrubbed cases, 1.3% contained HIPAA-specified identifiers (names, accession numbers, and dates) that were missed. Outside consultation case reports typically contained numerous identifiers and were the most challenging to de-identify comprehensively. There was variation in performance among the test sets, highlighting the need for site-specific customization. Overscrubbing was more prevalent than underscrubbing, and most instances of overscrubbing were due to the extensive list of personal and location names used.

Conclusions: We conclude that our first test of this software confirms the initial hypothesis that it is possible to create robust de-identification software using open-source tools. This application is currently capable of removing the vast majority of identifying information from pathology reports, while leaving the nonidentifying text intact. While the software does not perform perfectly yet, we expect that fine-tuning of the regular expressions and expansion of the database will remove the remaining identifiers. The major sources of underscrubbing are misspellings, accession numbers with unusual formats, and unexpected or unusual proper names.

Predicting Tumor Marker Outcomes With Monte Carlo Simulations

http://65.222.228.150/ijb/ramiab.htm

Jules J. Berman, MD, PhD (bermanj@maH.nih.gov). National Cancer Institute, National Institutes of Health, Bethesda, Md.

Context: Genome and proteome research have promised a revolution in tumor diagnosis. The revolution has not arrived. In fact, only a handful of new markers have appeared in the past several years. A simple thought experiment demonstrates the problem.

In a retrospective study, Dr X demonstrated a "perfect" tumor marker that never failed to distinguish between 2 tumor variants (aggressive and indolent) with identical morphology. In this example, an aggressive variant grows 10 times as fast and metastasizes at 10 times the rate of the indolent variant with the same morphology. In a prospective trial of the same marker, 200 tumors are excised at the time of clinical detection (tumor size, 2 cm). Dr X finds that 100 of the tumors stain as "indolent variants" and 100 tumors stain as "aggressive variants." The trials monitor all 200 patients, determining survival at 5 years. At the end of the trial, there is no survival difference between patients with indolent variants and patients with aggressive variants. The marker is considered a total failure, with millions of dollars wasted on the prospective trial.


 

BNET TalkbackShare your ideas and expertise on this topic

Please add your comment:

  1. You are currently: a Guest |
  2.  

Basic HTML tags that work in comments are: bold (<b></b>), italic (<i></i>), underline (<u></u>), and hyperlink (<a href></a)

advertisement
advertisement
  • Click Here
  • Click Here
  • Click Here
advertisement

Content provided in partnership with ProQuest