Removing a Single Brick From The Language Barrier Between Us and Information

Reuben Crimp

Back

Removing a Single Brick From The Language Barrier Between Us and Information

Graduate Thesis/Dissertation

Open access

Removing a Single Brick From The Language Barrier Between Us and Information

Reuben Crimp

Master of Science - MSc, University of Otago

University of Otago

2021

Handle:

https://hdl.handle.net/10523/12419

Abstract

New Zealand

Linguistics

NLP

BM25

information retrieval

natural language processing

search engine

Dunedin

information science

query drift

Information is one of the most valuable resources we have, and the primary way we access information nowadays is via search engines. Unfortunately, written language is rife with inconsistencies and ambiguity, which can cause many problems, including Vocabulary Mismatch, when authors use different words to describe the same thing. In a text retrieval search engine, Vocabulary Mismatch can be addressed with Query Expansion: adding more words to a user’s search query, such that it contains a more broad vocabulary. If expansion terms are not chosen carefully, query drift can occur, which is where the query’s semantics drift away from what the user meant. This thesis proposes two methods to mitigate query drift directly. The first, Term Frequency Merging (Chapter 7), modifies the ranking function by accumulating the term frequency of each expansion term with its respective original query term under the saturation function. Accumulating frequencies in this way prevents over-boosting words that have disproportionately more expansions than other words. The second, Query Context Selection (Chapter 8), is a more restrictive expansion term selection process that prevents the inclusion of expansions terms that are semantically incompatible. This is done by using the semantic context shared by the original query terms to select the expansions that most strongly relate to the original query as a whole and ignore the expansions which are likely spurious. The results from both experiments are promising as they both outperform no-expansion and naive-expansion. Notably, there is significant improvement where query drift has caused the most damage. However, other query reformulation approaches like Blind Relevance Feedback still outperform the two proposed methods in many test cases.

Files and links (1)

pdf

thesis_r.pdfDownload View

Metrics

380 File views/ downloads

245 Record Views

Details

Record Identifier: 9926479531401891
Title: Removing a Single Brick From The Language Barrier Between Us and Information
Creators: Reuben Crimp
Contributors: Andrew Trotman (Advisor / Supervisor)
Academic Unit: Computer Science
Publisher: University of Otago
Degree Awarded: Master of Science - MSc
Project Type: Thesis - Masters
Awarding Institution: University of Otago
Date published ; e-published: 2021
Wikidata ID: Q112955070
Language: English
Resource Type ; Subtype: Graduate Thesis/Dissertation
Format: application/pdf