The most influential paper Gerard Salton never wrote
Library Trends, Spring, 2004 by David Dubin
THE TERM DISCRIMINATION MODEL
In 1974 and 1975 Salton published several important papers on a theory of indexing and a method for selecting words from documents and assigning numeric weights to them. The presentation of this model, called the "term discrimination value model" (TDV), would prove to be significant not only because an automatic indexing principle was expressed in this model but also because of its impact on the IR research community's perception of what became known as the VSM.
The term discrimination model proposes that document features (such as extracted words) most useful for indexing will be those that increase the average dissimilarity between pairs of documents. In the basic conception the computed similarity averaged over every document pair is compared with and without the inclusion of a feature under consideration. The features are then ranked by the difference between those averages, with the best having the most dramatic lowering of average similarity when they are included. The process of computing a discrimination value can be speeded by comparing each document to an artificial average or centroid document rather than computing similarities for every document pair.
It is not essential to the TDV indexing model that similarity computations be explained in terms of operations on vectors or that document features be weighted or ordered. But, not surprisingly, Salton explained the model geometrically using vectors as he had done in the earlier publications. The key publications on the TDV indexing model are a Cornell technical report (Salton, 1974) that was republished a year later as a monograph (Salton, 1975), an article in the January-February 1975 issue of the JASIS (Salton, Yang, & Yu, 1975), and an article in the November 1975 issue of CACM (Salton, Wong, & Yang, 1975).
The articles in CACM and JASIS (particularly the former) had the greatest impact on how the VSM came to be viewed. This is largely because of presentational choices that had little direct bearing on the thesis of either article. Most significantly, the CACM article is titled "A Vector Space Model for Automatic Indexing." One might consider this an unfortunate choice since (as discussed above) vector spaces are not essential to the TDV selection and weighting model. What both articles actually present is an "average document similarity model" for automatic indexing. Because Salton and his colleagues were computing document similarity the same way that they had been doing for years, they used the same mathematical models to explain how those computations were performed. Hence the vectors and vector operations. After over a decade of explaining their system design choices in this way, Salton and his colleagues seem to have grown comfortable with vector spaces as an economical explanatory tool. That may help account for why the vector space is foregrounded in the CACM article's title and in the opening paragraph, which begins "Consider a document space ..."
In addition, both articles use the same illustration for their first figure: a three-dimensional coordinate system where index terms are depicted as orthogonal basis vectors and documents are plotted as vectors in the space of term weights. For purposes of advancing and explaining the thesis this illustration is correct, since it gives the reader a correct impression of how similarities were computed in the experiments conducted to evaluate TDV as an indexing strategy. But as we will see, the figure made a lasting impression on readers, and eventually more was read into this illustration than was warranted.
- 5 Rules for Immediate Annuities
- Death in the Family: 12 Things to Do Now
- Dumbest Things You Do With Your Money
- 6 Online Networking Mistakes to Avoid
- 401(k) Mistakes to Avoid
- 5 Economic Scenarios to Keep You Up at Night
- The Real ‘Best Places to Retire’
- Best Credit Cards for You
- 12 Tough Questions to Ask Your Parents
- The Real ‘Best Colleges’
- Home Buyer Tax Credit: How to Cash In
- Why You Shouldn't Bash Cash
- 8 Phony 'Bargains' and Better Alternatives
- Danger: 3 Debit Card Scams to Avoid
- 6 Myths About Gas Mileage
- 29 Fees We Hate Most
- Quick and Easy Ways to Boost Returns
- Best Stocks to Buy Now
- Lower Your Taxes: 10 Moves to Make Now
- New Jobs: 8 Lessons from Real-Life Career Switchers
- The New Job Market: Who Wins and Who Loses?
- Health Care Reform's Public Option: Everything You Need to Know
- Volunteer Work When Unemployed: Should You Work for Free?
- Whose Recovery Is This?
- Long-Term-Care Insurance: 4 Biggest Risks to Avoid
Content provided in partnership with
Most Recent Reference Articles
- A Maryland state trooper gave Erik Bonstrom an $80 ticket for driving too slowly
- In California, postal worker Dean Hudson has been found guilty
- Alec Loorz, the 15-year-old founder of Kids vs. Global Warming and recent Brower Youth Award recipient, went to Congress in November for a press conference with Senators Barbara Boxer and John Kerry, who are championing legislation to stabilize US greenho
- Foreign exchange
- The buzz on bees
Most Recent Reference Publications
Most Popular Reference Articles
- Credit card debt on college campuses: causes, consequences, and solutions
- 9 questions to ask your new lover: what you were afraid to ask, but always wanted to know
- How Tyler Perry rose from homelessness to a $5 million mansion
- Rejoice anyway - Zephaniah 3:14-20, Philippians 4:4-7 - Living by the Word - Column
- Living by the word


