The most influential paper Gerard Salton never wrote

Library Trends, Spring, 2004 by David Dubin

THE TERM DISCRIMINATION MODEL

In 1974 and 1975 Salton published several important papers on a theory of indexing and a method for selecting words from documents and assigning numeric weights to them. The presentation of this model, called the "term discrimination value model" (TDV), would prove to be significant not only because an automatic indexing principle was expressed in this model but also because of its impact on the IR research community's perception of what became known as the VSM.

The term discrimination model proposes that document features (such as extracted words) most useful for indexing will be those that increase the average dissimilarity between pairs of documents. In the basic conception the computed similarity averaged over every document pair is compared with and without the inclusion of a feature under consideration. The features are then ranked by the difference between those averages, with the best having the most dramatic lowering of average similarity when they are included. The process of computing a discrimination value can be speeded by comparing each document to an artificial average or centroid document rather than computing similarities for every document pair.

It is not essential to the TDV indexing model that similarity computations be explained in terms of operations on vectors or that document features be weighted or ordered. But, not surprisingly, Salton explained the model geometrically using vectors as he had done in the earlier publications. The key publications on the TDV indexing model are a Cornell technical report (Salton, 1974) that was republished a year later as a monograph (Salton, 1975), an article in the January-February 1975 issue of the JASIS (Salton, Yang, & Yu, 1975), and an article in the November 1975 issue of CACM (Salton, Wong, & Yang, 1975).

The articles in CACM and JASIS (particularly the former) had the greatest impact on how the VSM came to be viewed. This is largely because of presentational choices that had little direct bearing on the thesis of either article. Most significantly, the CACM article is titled "A Vector Space Model for Automatic Indexing." One might consider this an unfortunate choice since (as discussed above) vector spaces are not essential to the TDV selection and weighting model. What both articles actually present is an "average document similarity model" for automatic indexing. Because Salton and his colleagues were computing document similarity the same way that they had been doing for years, they used the same mathematical models to explain how those computations were performed. Hence the vectors and vector operations. After over a decade of explaining their system design choices in this way, Salton and his colleagues seem to have grown comfortable with vector spaces as an economical explanatory tool. That may help account for why the vector space is foregrounded in the CACM article's title and in the opening paragraph, which begins "Consider a document space ..."

In addition, both articles use the same illustration for their first figure: a three-dimensional coordinate system where index terms are depicted as orthogonal basis vectors and documents are plotted as vectors in the space of term weights. For purposes of advancing and explaining the thesis this illustration is correct, since it gives the reader a correct impression of how similarities were computed in the experiments conducted to evaluate TDV as an indexing strategy. But as we will see, the figure made a lasting impression on readers, and eventually more was read into this illustration than was warranted.


 

BNET TalkbackShare your ideas and expertise on this topic

Please add your comment:

  1. You are currently: a Guest |
  2.  

Basic HTML tags that work in comments are: bold (<b></b>), italic (<i></i>), underline (<u></u>), and hyperlink (<a href></a)

advertisement
advertisement
  • Click Here
  • Click Here
  • Click Here
advertisement

Content provided in partnership with Thompson Gale