Abstract
In memory-intensive algorithms, the problem size is often so large that it cannot fit into the cache of a CPU, and this may result in an excessive number of cache misses, a bottleneck that can easily make seemingly embarrassingly-parallel algorithms such as feature-matching unscalable in many-core systems. To solve this bottleneck, this paper proposes a general Divide-and-Merge methodology, which divides the feature space into several small sub-spaces, so that the shared resources in each sub-space can be satisfied without causing bottlenecks. Experimental results have shown that the Divide-and-Merge methodology reduces the L3 cache misses and time spent on memory-allocation-related system calls, resulting in a 211% performance improvement on an AMD 64-core CPU machine, and 57% and 16% performance improvements on AMD and Intel 16-core machines respectively. Performance results on a modern GPU also show that a well-tuned algorithm with time complexity of O(F-2) is able to defeat a state-of-the-art O(F-1.5) algorithm by 188% for our real-world dataset, which again highlights the huge performance impact of the memory system.