Abstract
Information retrieval (IR) is the process of finding relevant information, based on user queries, in a large collection of documents. The two main performance issues in IR are effectiveness and efficiency; effectiveness measures how accurately an IR system can find relevant information, and efficiency relates to how long it takes the IR system to find the relevant information. To satisfy users, an IR system should find the most relevant information in as short a time as possible.
When considering efficiency issues, IR systems are interesting because they are neither purely input/output (I/O) intensive nor solely central processing unit (CPU) intensive. Normally, the efficiency of IR is addressed in terms of accumulator initialisation, disc I/O, decompression, ranking and sorting. First, an array of accumulators, holding intermediate aggregated results, has to be initialised. Second, disc I/O is required to read dictionary terms and the corresponding lists of postings. Third, these lists are typically stored in a compressed format, so decompression is required after they are fetched from the disc. Fourth, complex ranking functions are applied to calculate similarity scores between the documents and the user queries. Finally, a large number of possible candidate documents must be sorted so that the most relevant results can be returned to the user.
The objectives of this PhD research were to identify the bottlenecks among the different components of an IR system, provide possible solutions to minimise or eliminate these bottlenecks, and combine the optimised solutions to form a solid baseline for future IR research.