The most influential paper Gerard Salton never wrote
Library Trends, Spring, 2004 by David Dubin
THE VECTOR SPACE AS AN IR MODEL
The next significant evolutionary stage of how the VSM came to be perceived became evident in 1979. That year Salton published an article in the Journal of Documentation (JDoc) titled "Mathematics and Information Retrieval." This article was the first since the 1968 text to discuss issues of modeling in depth, and it is significant for two reasons:
1. This seems to be the first time Salton refers to the VSM as an IR model in print
2. Salton describes an orthogonality assumption for the first time in this article
Informally, one can understand the orthogonality issue as whether the vectors forming the basis of the space (that is, those representing variables under investigation) are at right angles to one another. Modeling variables as orthogonal basis vectors suggests that those variables either are or should be treated as statistically independent of one another. Salton's vector spaces (such as those in the 1975 TDV articles) model frequencies of extracted words with orthogonal basis vectors, which gives the false impression that words are assumed to occur independently of each other. As noted above, however, Salton's use of vector spaces is for modeling how an IR system performs particular computations. No empirical claim about word occurrences is implied: the equations and diagrams merely illustrate how the system was programmed to match documents and queries.
In "Mathematics and Information Retrieval" Salton uses the term "vector processing model" rather than vector space model, and this is the first suggestion that the VSM has shifted from being understood as a model for illustrating specific computations to being an IR model in its own right. This article recapitulates much of the set-theoretic modeling discussion in the 1968 text, but this time puts alongside it a section on "Retrieval as vector matching operations." The description of the vector representation and operations is similar to the earlier computational/operational illustrations but with some telling exceptions: Salton mentions an "underlying basis" out of which the vectors representing index terms are composed via linear combination. Precisely what this basis represents Salton declines to specify, but he states that to assume that this basis is orthogonal would be at odds with "actual fact" since "relationships may exist between individual vector attributes" (Salton, 1979, p. 8).
The significance of this shift in thinking is twofold: First, Salton's use of vector spaces has temporarily drifted from the operational, data-centric conception seen earlier to some other vague level of abstraction. Second, the question of correlation or orthogonality is explicitly linked to a modeling issue that Salton had identified in 1968: the existence of relations or dependencies among the document and query identifiers.
When Salton alludes to the mysterious "underlying basis," he may have in mind latent dimensions of the kind that can be uncovered through, for example, principal components analysis or factor analysis. Methods for representing documents in these empirically derived vector spaces had been proposed before (Switzer, 1965; Sammon, 1968) and since (Deerwester, Dumais, Furnas, Landauer, & Harshman, 1990), but the techniques had not been used in Salton's research. Perhaps the "underlying basis" is supposed to represent psychological variables of the kind that can be studied by eliciting similarity judgments from experimental subjects. Several key studies investigating the suitability of vector spaces as psychological models of similarity were published in the years before and after "Mathematics and Information Retrieval" (Tversky, 1977; Tversky & Gati, 1978; Tversky & Gati, 1982). But such psychological models were never part of the SMART system. In any case, the basis cannot represent either the terms or the documents, since Salton claims that both term and document vectors are linear combinations of the basis vectors. The reader is left with the impression of entities that somehow have a real existence independent of the IR system design decisions and that the system models imperfectly. What could those be?
- 5 Rules for Immediate Annuities
- Death in the Family: 12 Things to Do Now
- Dumbest Things You Do With Your Money
- 6 Online Networking Mistakes to Avoid
- 401(k) Mistakes to Avoid
- 5 Economic Scenarios to Keep You Up at Night
- The Real ‘Best Places to Retire’
- Best Credit Cards for You
- 12 Tough Questions to Ask Your Parents
- The Real ‘Best Colleges’
- Home Buyer Tax Credit: How to Cash In
- Why You Shouldn't Bash Cash
- 8 Phony 'Bargains' and Better Alternatives
- Danger: 3 Debit Card Scams to Avoid
- 6 Myths About Gas Mileage
- 29 Fees We Hate Most
- Quick and Easy Ways to Boost Returns
- Best Stocks to Buy Now
- Lower Your Taxes: 10 Moves to Make Now
- New Jobs: 8 Lessons from Real-Life Career Switchers
- The New Job Market: Who Wins and Who Loses?
- Health Care Reform's Public Option: Everything You Need to Know
- Volunteer Work When Unemployed: Should You Work for Free?
- Whose Recovery Is This?
- Long-Term-Care Insurance: 4 Biggest Risks to Avoid
Content provided in partnership with
Most Recent Reference Articles
- A Maryland state trooper gave Erik Bonstrom an $80 ticket for driving too slowly
- In California, postal worker Dean Hudson has been found guilty
- Alec Loorz, the 15-year-old founder of Kids vs. Global Warming and recent Brower Youth Award recipient, went to Congress in November for a press conference with Senators Barbara Boxer and John Kerry, who are championing legislation to stabilize US greenho
- Foreign exchange
- The buzz on bees
Most Recent Reference Publications
Most Popular Reference Articles
- Credit card debt on college campuses: causes, consequences, and solutions
- 9 questions to ask your new lover: what you were afraid to ask, but always wanted to know
- How Tyler Perry rose from homelessness to a $5 million mansion
- Rejoice anyway - Zephaniah 3:14-20, Philippians 4:4-7 - Living by the Word - Column
- Living by the word


