EnroWiki : InformationRetrieval

Information retrieval

Zipf's Law

We can analyse part of a sentence, such as a subphrase describing a protein–protein interaction or part of a sentence containing a gene and a protein name, but we always run into Zipf's law whenever we write down the rules for how the extraction is done (http://dx.doi.org/10.1371/journal.pbio.0030065.g002

). A small number of patterns describe a reasonable portion of protein–protein interactions, gene names, or mutations, but many of those entities are described by a pattern of words that's only ever used once. Even if we could collect them all—which is impossible—we can't stop new phrases from being used. (Rebholz-Schuhmann, Dietrich & Kirsch, Harald & Couto, Francisco, Facts from Text---Is Text Mining Ready to Deliver?, in 3 PLoS Biology, 2, e65 (2005), http://dx.doi.org/10.1371/journal.pbio.0030065

)

Index-, topic-, and content words

The implicit semantic models in today's systems are based on single word occurrence and, occasionally, cooccurrence between words. This takes us a bit of the way, but not far enough for us to claim we have text understanding within reach. Words are besides vague both polysemous and incomplete: every word means several things and every thing can be expressed with any one of several words. (Karlgren, Jussi, The Basics of Information Retrieval: Statistics and Linguistics, 2000, http://www.sics.se/ jussi/Undervisning/texter/ir-textbook.pdf

)

It is generally accpeted within the text retrieval community that the words used in documents relate to the documents' content, and that documents with similar content will tend to use a similar vocabulary. Therefore, text retrieval systems index and compare documents by the words they contain. The quality of these index words is usually measured by how well the words can disciminate a topic or document from all other topics or documents.

Several researchers have attempted to evaluate index words by their content. Damerau, for example, uses word frequency to extract doman-oriented vocabulary which he defines as "a list of content words (not necessarily complete) that would characteristically be used in talking about a particular subject, say education, as opposed to the list of words used to talk about, say aviation". However, his evaluation method does not guarantee that the words are domain-oriented. It only shows that the selected words are used more often in the domain's documents. Therefore, good discriminators can easily be accepted as topic words by this test if they happened to be used more often in one domain than the other domains tested. But a good discriminating word is not necessarily a content word.

Manning and Schütze define non-content words informally as "words that taken in isolation … do not give much information about the contents of the document". Note that although a content word is usually relevant to the topic being discussed in the document, it does not need to be.

Many researchers opt to remove stopwords using a preset stopword list for efficiency and effectiveness. However, these stopword lists vary from one database to another and from one language to another, so it is desirable that topic relevance measures be able to identify these non-content words as such regardless of the language or database used. (Al-Halimi, Reem K. and Tompa, Frank W., Using Word Position in Documents for Topic Characterization, cs-2003-36, University of Waterloo, Canada, oct, 2003, http://www.cs.uwaterloo.ca/research/tr/2003/36/relevance_measures_TR.pdf

)

Luhn used [Zipf's Law] as a null hypothesis to enable him to specify two cut-offs, an upper and a lower, thus excluding non-significant words. The words exceeding the upper cut-off were considered to be common and those below the lower cut-off rare, and therefore not contributing significantly to the content of the article. He thus devised a counting technique for finding significant words. Consistent with this he assumed that the resolving power of significant words, by which he meant the ability of words to discriminate content, reached a peak at a rank order position half way between the two cut-offs and from the peak fell off in either direction reducing to almost zero at the cut-off points. A certain arbitrariness is involved in determining the cut-offs. There is no oracle which gives their values. They have to be established by trial and error.

The fundamental hypothesis made now is that a content-bearing word is a word that distinguishes more than one class of documents with respect to the extent to which the topic referred to by the word is treated in the documents in each class. (van Rijsbergen, C. J., Information Retrieval, Butterworths, London, 2nd ed., 1979, http://www.dcs.gla.ac.uk/Keith/Preface.html

)

Carroll and Roeloffs compared several frequency criteria with the judgment of panels of indexers for selection of keywords from documents. They found that the statistical ranking agreed better with the consensus of human judgment than did the judgment of individual indexers. (Bookstein, A. and Swanson, Don R., Probabilistic models for automatic indexing, in 25 Journal of the American Society for Information Science, 5, 312 (1974), http://people.csail.mit.edu/jrennie/papers/other/bookstein-indexing-74.pdf

)

One may also use the Zipfian distribution of terms to determine which terms will be considered 'stopwords'. A stopword is a term that will not be indexed or processed in a query because of its high frequency. It is deemed to be of little value in the search process. For example, in most text-based IR systems, articles such as 'the' occur too frequently to be of value. These words provide no context or topicality for a search. The same may be true of frequently occurring nouns in subject-oriented databases, e.g. excluding a word such as 'science' in a science-oriented database since every record in the database probably deals with science. Salton and McGill (1983) hypothesized that the best search terms to use were those that occurred with mid-range frequency within the database, since they would be more discriminating than frequently occurring terms, but would cast a broader net than infrequently occurring terms. (Wolfram, Dietmar, Applications of Informetrics to Information Retrieval Research, in 3 Informing Science The International Journal of an Emerging Transdiscipline, 2, 77-82 (2000), http://www.doaj.org/abstract?id=89976

)

TF * IDF

There are various ways of combining term frequencies and inverse document frequencies, and empirical studies (Salton, Gerard & Yang, Chung-Shu, On the specification of term values in automatic indexing, in 29 Journal of Documentation, 4, 351-372 (1973)) show that the optimal combination may vary from collection to collection. Generally, tf is multiplied by idf to obtain a combined term weight. Alternatives would be for instance to entirely discard terms with idf below a set threshold — which seems to be slightly better for searchers that require high precision. Both measures are usually smoothed by taking logarithms rather than the straight measure — or some similar simple transformation — to avoid dramatic effects of small numbers. (Karlgren, Jussi, The Basics of Information Retrieval: Statistics and Linguistics, op. cit.)

TF*IDF-style measures emerged from extensive empirical studies of combinations of weigthing factors, particularly by the Cornell group (Salton, Gerard & Yang, Chung-Shu, On the specification of term values in automatic indexing, op. cit.). (Robertson, Stephen, Understanding Inverse Document Frequency: On theoretical arguments for IDF, in 60 Journal of Documentation, 5, 503-520 (2004), http://www.soi.city.ac.uk/ ser/idfpapers/Robertson_idf_JDoc.pdf

)

Inverse Document Frequency (IDF) is a popular measure of a word's importance. It is defined as the logarithm of the ratio of number of documents in a collection to the number of documents containing the given word. (...) The traditional justifications of IDF and similar measures assume that words occur independently in documents. (Papineni, Kishore, Why inverse document frequency?, Proceedings of the 2nd Meeting of the North American Chapter of the Association for Computational Linguistics, Association for Computational Linguistics, Pittsburgh, PA, 1-8, 2001, http://acl.ldc.upenn.edu/N/N01/N01-1004.pdf

)

We see that the importance of a word is not linearly proportional to the IDF. Indeed, words that occur extemely rarely have very high IDF but very low gain as do words that occur in almost every document. This corroborates H.P. Luhn's observation that mid-frequency words have the highest "resolving power". (Papineni, Kishore, Why inverse document frequency?, op. cit.)

Document length

Most algorithms in use introduce document length as a normalization factor of some sort, if the documents in the document base vary as regards length (Salton, Gerald & Buckley, Christopher, Term weighting approaches in automatic text retrieval, in 24 Information Processing and Management, 5, 513-523 (1988)). (Karlgren, Jussi, The Basics of Information Retrieval: Statistics and Linguistics, op. cit.)

When collections have documents of varying lengths, longer documents tend to score higher since they contain more words and word repetitions. This effect is usually compensated by normalizing for document lengths in the term weighting method. (Singhal, Amit, Modern information retrieval: A brief overview, in 24 Bulletin of the IEEE Computer Society Technical Commitee on Data Engineering, 4, 35-43 (2001), http://net.pku.edu.cn/ webg/papers/IR%20Overview.pdf

)

Stopwords

In the simple case of removing high frequency words by means of a 'stop' word list we are attempting to increase the level of discrimination between document. (van Rijsbergen, C. J., Information Retrieval, op. cit.)

Another obvious reduction in dictionary size is to compile a list of stopwords and remove them from the dictionary. These are words that almost never have any predictive capability, such as articles a and the and pronouns such as it and they. These common words can be discarded before the feature generation process, but it's more effective to generate the features firts, apply all the other transformations, and at the very last stage reject the ones that correspond to stopwords. (Weiss, Sholom M. and Indurkhya, Nitin & Zhang, Tong & Damerau, Fred J., Text Mining: Predictive Methods for Analyzing Unstructured Information, Springer, New York, 2005)

Precision and recall

In principle, a system is preferred that produces both high recall by retrieving everything that is relevant, and also high precision by rejecting all items that are extraneous. The recall function of retrieval appears to be best served by using broad, high-frequency terms that occur in many documents of the collection. Such terms may be expected to pull out many documents, including many of the relevant documents. The precision factor, however, may be best served by using narrow, highly specific terms that are capable of isolating the few relevant items from the mass of nonrelevant ones. In practice, compromises are normally made by using terms that are broad enough to achieve a reasonable recall level without at the same time producing unreasonably low precision. The differing recall and precision requirements favor the use of composite term weighting factors that contain both recall- and precision-enhancing components. (Salton, Gerald & Buckley, Christopher, Term weighting approaches in automatic text retrieval, op. cit.)