Abstract
Information is one of the most valuable resources we have, and the primary way we access information nowadays is via search engines. Unfortunately, written language is rife with inconsistencies and ambiguity, which can cause many problems, including Vocabulary Mismatch, when authors use different words to describe the same thing. In a text retrieval search engine, Vocabulary Mismatch can be addressed with Query Expansion: adding more words to a user’s search query, such that it contains a more broad vocabulary. If expansion terms are not chosen carefully, query drift can occur, which is where the query’s semantics drift away from what the user meant. This thesis proposes two methods to mitigate query drift directly. The first, Term Frequency Merging (Chapter 7), modifies the ranking function by accumulating the term frequency of each expansion term with its respective original query term under the saturation function. Accumulating frequencies in this way prevents over-boosting words that have disproportionately more expansions than other words. The second, Query Context Selection (Chapter 8), is a more restrictive expansion term selection process that prevents the inclusion of expansions terms that are semantically incompatible. This is done by using the semantic context shared by the original query terms to select the expansions that most strongly relate to the original query as a whole and ignore the expansions which are likely spurious. The results from both experiments are promising as they both outperform no-expansion and naive-expansion. Notably, there is significant improvement where query drift has caused the most damage. However, other query reformulation approaches like Blind Relevance Feedback still outperform the two proposed methods in many test cases.