The most influential paper Gerard Salton never wrote

Library Trends, Spring, 2004 by David Dubin

Salton may have imagined that the "underlying basis" represented both empirically derived and psychologically real dimensions. For example, an article by Koll, published the same year as Salton's JDoc article, describes a system (called WEIRD) in which a derived vector space is proposed as a solution to the problem of measuring conceptual similarity (Koll, 1979). Alternatively, Salton may have supposed the basis to represent concepts that are neither psychologically real nor derived from data but rather pure abstractions: A few years earlier Salton's future coauthor Michael McGill had published a paper relating SMART to an abstract but informal vector model proposed by Meincke and Atherton (McGill, 1976; Meincke & Atherton, 1979).

The other significance of the orthogonality/correlation issue in 1979 is that it is a special case of a retrieval modeling issue Salton had cited in 1968: relationships between elements of the various sets. The year 1979 saw the first coupling of this abstract modeling issue with vector representations that had been discussed separately in the 1968 book. Furthermore, the earliest characterizations of Salton's VSM as an IR model appeared that year in separate publications by Salton, McGill, and Koll (Salton, 1979; McGill & Huitfeldt, 1979; Koll, 1979).

Koll identifies the basis in Salton's vector model as the index term vectors. Within a few years, Salton would come to agree that it is the index term vectors (not some other basis) that are assumed to be orthogonal in his VSM. But that position is equally problematic: if the basis vectors represent index terms then those vectors are not assumed to be orthogonal, they simply are orthogonal, because all that the vectors represent is the way that term frequency data are used in the system's computations.

When a commentator on the VSM says that term basis vectors are assumed to be orthogonal, this is a misstating of the actual fact that dependencies among words in natural language are ignored. Approaches such as WEIRD and Latent Semantic Indexing do compute and use information about these dependencies, and although SMART's similarity computations never worked that way, there is ample evidence in the writings of Salton and his colleagues that they understood word/term dependencies and conducted many experiments to employ term associations in retrieval (Salton, 1963; Lesk, 1969; Salton, Buckley, & Yu, 1983).

It is a subtle error of language or description to claim that the VSM assumes term vectors are orthogonal. And it is no coincidence that this error first appears when the VSM was first characterized as a retrieval model instead of a computation model. If term vector orthogonality is a simplifying assumption, then that implies the existence of correlated terms independent of their operational definition in the computational design choices. But, as with the "underlying basis" of 1979, it is not clear what those entities could be. Evidently, the familiarity of vector space illustrations has led to a confounding of objective facts (that term dependencies and word associations exist) with implications for how those facts might be modeled (as correlations between vectors in a vector space). In 1968 Salton had included the character of relationships among members of the descriptor set as a retrieval modeling issue. By 1979, discussion of those relationships had become inseparable from discussion of similarity computations. That confusion continued to shape reactions to Salton's contributions over the subsequent years.


 

BNET TalkbackShare your ideas and expertise on this topic

Please add your comment:

  1. You are currently: a Guest |
  2.  

Basic HTML tags that work in comments are: bold (<b></b>), italic (<i></i>), underline (<u></u>), and hyperlink (<a href></a)

advertisement
advertisement
  • Click Here
  • Click Here
  • Click Here
advertisement
Click Here

Content provided in partnership with Thompson Gale