The most influential paper Gerard Salton never wrote

Library Trends, Spring, 2004 by David Dubin

This discussion is noteworthy for two reasons. First, the orthogonality assumption is described as applying to the term vectors (rather than some unspecified basis as in the 1979 article). Secondly, it is another telling example of the retrieval/computational model confusion. On the one hand, the authors correctly express a retrieval model issue, that is, the decision to treat words as unrelated. They acknowledge that dependencies known to exist between words in texts are not represented, measured, or used by the system. Salton and McGill understand the impact that this might have on retrieval results and explain why they choose to dismiss that concern.

On the other hand, the authors describe this decision in terms of a vector orthogonality assumption. As explained earlier, term vector orthogonality is not an assumption but rather a fact resulting from definition. Indeed, it is not even accurate to describe the retrieval model as depending on an assumption of term independence; the SMART system makes no probabilistic inference that could be falsified but merely computes document/query similarity in particular ways. (2) This characterization of SMART is another unfortunate consequence of seeing vector spaces as an IR model. As mentioned earlier, it invited Wong and Raghavan to question Salton's theoretical rigor the following year.

Salton's 1989 book, Automatic Text Processing, includes the author's first full description of the VSM as an IR model. Ironically, much of the characterization is adapted directly from Wong and Raghavan's earlier criticism of what they interpreted Salton to have meant. The illustration of the document space in chapter 10 is an exact copy of figure 1 in Wong and Raghavan's 1984 paper (and their 1986 follow-up) and depicts the term vectors at oblique angles to one another rather than at right angles as in the 1975 TDV papers. Based on Wong and Raghavan's criticism, Salton corrects an earlier (1979) error on the use of term and document correlations to define an orthogonal basis and follows their example in calling for additional information to define the correlations. Citing Raghavan and Wong, Salton repeats the 1983 mischaracterization that term vector orthogonality implies an assumption of term independence.

[FIGURE 1 OMITTED]

EPILOGUE: THE PAPER SALTON NEVER WROTE

As one would expect, published references to the vector model are usually much briefer than the detailed responses, extensions, and alternative proposals discussed above. An author may state, for example, that his or her experimental system realizes or is based on the VSM. Or the VSM may simply be included in a list of other models or formalisms.

It is ironic that in these references the most popular citations for the VSM seem to be the two TDV papers, the 1983 text, and the 1971 collection of SMART system articles. These choices are understandable: the CACM article was suggestively titled, and both it and the JASIS article included the same evocative illustration for figure 1. The 1971 text concerns SMART, the design of which largely defined the loose bundle of operational assumptions and expectations that people associate with systems based on the VSM. The 1983 book by Salton and McGill included descriptions that made it clear that the abstract and computational modeling issues that had been kept distinct in 1968 were by then inextricably intertwined.


 

BNET TalkbackShare your ideas and expertise on this topic

Please add your comment:

  1. You are currently: a Guest |
  2.  

Basic HTML tags that work in comments are: bold (<b></b>), italic (<i></i>), underline (<u></u>), and hyperlink (<a href></a)

advertisement
advertisement
  • Click Here
  • Click Here
  • Click Here
advertisement

Content provided in partnership with Thompson Gale