Classifying electronic documents: A new paradigm

Information Management Journal, Mar/Apr 2002 by Schewe, Donald B

LessonsLearned

The U.S. Department of Education set out to determine whether large volumes of electronic data can be indexed cost-effectively

At the Core

This article:

Explains how the Department of Education (DoEd) used artificial neural network software to classify electronic documents

Reveals the challenges in analyzing and categorizing mass amounts of electronically stored data

The introduction of the desktop computer and its widespread adoption by government and the private sector in the 1980s raised new problems for records managers. The computer increased information volume, manipulation of data became easier, and networks increased the ease with which information was transmitted, further adding to information mass.

When e-mail was added to the mix, quantity began to overwhelm traditional information management systems. Records managers simply could not apply the old classification and storage systems to the huge amount of information generated.

Volume was not the only problem, of course. There was also the issue of what medium to store the information on and how to migrate it forward as new versions of software were developed.

The underlying assumption of the current paradigm for records and information management is that there is now too much information to manage at the document level. The file folder has become the basic unit of control. However, computer users often do not file the information in neat folders. Attempts to force users to do this filing have been largely unsuccessful. The time has come for a new paradigm.

The monster that has created this mountain of information can be tamed and turned into the engine that controls it - and to a far better degree than has been possible in the past. With the power of today's computers - those found in virtually every office - indexing can be done down to the word level, and retrieval can be virtually instantaneous. If only there were a way to categorize the information.

Enter the new paradigm.

Case in Point

The U.S. Department of Education (DoEd) had used artificial neural network technology to analyze and categorize some electronic materials at the end of the Clinton Administration. That project was successful, and the department wanted to see if the technology could be applied to their vast electronic information holdings consisting of word processing documents, spreadsheets in various formats, databases (both offthe-shelf and proprietary), and e-mail messages. The documents could not be deleted because some were record material deserving of retention for varying time periods according to the department's records retention schedule. The cost of storing and maintaining this material was a drain on the budget and hampered ongoing activities.

To see what could be accomplished, the department set up a demonstration project using e-mail and word processing documents from individuals who left the agency at the end of the Clinton Administration. Approximately 4 gigabytes of e-mail and half a gigabyte of word processing documents were provided for the project. Fairfax, Virginia-- based STG Inc., the same company that had done the earlier project, was employed to undertake this one, with the significant addition of an experienced records manager.

Prior experience within the department and elsewhere in the federal government had resulted in less-than-satisfactory results with desktop-deployed records management applications. Users were reluctant or adamantly unwilling to use the software, results were spotty, and constant training and retraining were necessary for staff, particularly where constant turnover was the norm. With these factors in mind, the project's goal was to see if a system could be devised to deal with the material using artificial neural network technology and to see if it could be employed as a network service.

The artificial neural network software was Hummingbird's Knowledge Manager Workstation. It uses a mathematical construct that analyzes the frequency and placement of words and concepts within documents to place them within a multidimensional grid. By manipulating the grid's various parameters, one can control the level of inclusiveness within the "clusters" of documents or groups of documents around specific ideas or concepts, thus increasing categorization process accuracy. The project team nicknamed the software "ANNie."

It quickly became apparent that the technology was very powerful and could categorize massive amounts of information, do it in a very short time span, and attain accuracy levels greater than could be expected from even the best of file clerks. Furthermore, accuracy was greatly enhanced when the number of possible categories was narrowed, leading to the project's first major decision: Focus would be not on all possible data within the agency but instead on individual work groups where the number of subjects addressed was limited by the work group's scope. This decision also made it easier to deal with related questions. For instance, access levels could be maintained at the office level just as they were for local area network access. It also allowed for ready application of the "office of record" principle. A memo from the secretary of education to the staff, for example, would show up in every individual's mailbox, but it was only necessary to maintain the outgoing copy from the secretary's office.

 

BNET TalkbackShare your ideas and expertise on this topic

Please add your comment:

  1. You are currently: a Guest |
  2.  

Basic HTML tags that work in comments are: bold (<b></b>), italic (<i></i>), underline (<u></u>), and hyperlink (<a href></a)

advertisement
advertisement
  • Click Here
  • Click Here
  • Click Here
advertisement

Content provided in partnership with ProQuest