The most influential paper Gerard Salton never wrote

Library Trends, Spring, 2004 by David Dubin

It is at this last data-centric level that one should understand the use of vector abstractions in most of Salton's IR publications: vector components represent raw or modified observations, and relations between vectors (such as the cosine of the angle between pairs of them) are devices for explaining computations or other design choices about how an IR system operates. As we shall see, the habit of describing data and computations in terms of operations on vectors eventually became so familiar that some later interpretations seem to lose sight of the role the vector model was intended to play.

EARLIEST EXAMPLES

The elements of what would come to be known as the VSM are evident in Sahon's earliest publications on experimental IR and also the work of other authors (Switzer, 1965; Sammon, 1968). In a 1963 article in the Journal of the Association for Computing Machinery (JACM), Salton describes systems and methods for what at that time he calls "associative document retrieval techniques." Building on earlier work by people such as H. P. Luhn, Salton outlines the architecture for automated systems that extract words from machine-readable texts, select a subset of those words deemed significant enough to represent the document content, and compute measures of association between pairs of terms, pairs of documents, and between documents and queries.

Even in this early paper one finds frequencies of extracted words presented using matrix and vector notation and the cosine of angles between vectors recommended as a measure of association. The vector representation is employed to describe similarities computed using both extracted words and citation data. Furthermore, it is clear that vector representations are to be understood precisely at the data-centric level described above: the term-document matrix is called an incidence matrix, leaving no doubt that what the vector components model are observations. The similarity measures are at all points described as methods or operations on the data that can be interpreted as relations between vectors.

SMART was the system Salton developed over the course of his career as an IR researcher. More than just an IR system, SMART was the working expression of Sahon's theories and the experimental environment in which those theories were evaluated and tested (Salton, 1971). The earliest papers describing the SMART system show that the same extraction and association procedures outlined in the JACM article are central to SMART's design and operation (Salton, 1965b; Salton & Lesk, 1965). In 1965 Salton published a paper in IEEE Spectrum titled "Progress in Automatic Information Retrieval" (1965a). That article discusses specific features of SMART and characterizes document representations and similarity computations in terms of vectors. In addition, relevance feedback experiments (conducted by J. J. Rocchio) are described in terms of query vector modifications. In all these examples, the vector spaces illustrate how computations such as similarity measures and relevance feedback are applied to the data; the vector spaces are models of computations executed by the system.


 

BNET TalkbackShare your ideas and expertise on this topic

Please add your comment:

  1. You are currently: a Guest |
  2.  

Basic HTML tags that work in comments are: bold (<b></b>), italic (<i></i>), underline (<u></u>), and hyperlink (<a href></a)

advertisement
advertisement
  • Click Here
  • Click Here
  • Click Here
advertisement
Click Here

Content provided in partnership with Thompson Gale