Bin-Hash Indexing: A Parallel GPU-Based Method For Fast Query Processing
Luke Gosink, Wes Bethel, John D. Owens, Ken Joy, Kesheng Wu, and Luke Jeremy Gosink
This work presents a new indexing data structure for query processing, called the Bin-Hash index, that effectively utilizes the parallel processing power of the Graphics Processing Unit (GPU). Our approach concentrates on reduceing both the amount of bandwidth and memory required to evaluate a query. We achieve this goal by integrating two key strategies: we use encoded data tables to help overcome the limitations imposed by limited GPU memory, and a technique known as perfect spatial hashing to accelerate the retrieval of raw data necessary for candidate checks. We support our candidate checks with a flexible dual cache (one for the GPU and one for the CPU) that uses independent replacement polices. To this end, the CPU serves as a host to the GPU, only supplying the raw data needed for candidate checks; all query evaluations are performed on the GPU by executing kernels written in the data-parallel programming language CUDA.
In our timing measurements, our new query processing method can be an order of magnitude faster than current state-of-the-art indexing technologies such as the compressed bitmap index.
Growth in dataset size significantly outpaces the growth of CPU speed and disk throughput. As a result, the efficiency of existing query processing techniques is greatly challenged. The need for accelerated performance forces many researchers to seek alternative techniques for query evaluation. In this paper, we propose a query processing technique that utilizes the impressive processing power available in commodity Graphics Processing Units (GPU).
GPUs possess two attractive features motivating their use for query processing. First, they are widely available; typical desktop PCs and workstations will have access to a GPU. Second, a typical GPU possesses over an order of magnitude more processing power than a typical CPU. Importantly, technology trends support an accelerated growth in computational power for future GPUs indicating that GPUs will become an even more powerful and important resource to be utilized for general purpose tasks.
There are, however, significant challenges to effectively employing the GPU for query processing. First, relatively small memory capacities on current GPUs limit how much data is available to query at any one time (current memory for NVIDIA's 8800 GTX is ~768MB). Out-of-core strategies designed to overcome such memory limitations must deal with a second challenge - the bandwidth bottleneck that exists between the CPU and GPU. This bottleneck can severely limit the performance of GPU-based tasks if the arithmetic intensity of the task is too low.
In addition to these challenges, full GPU utilization for query processing requires an indexing data structure that supports massively parallel operations. Earlier GPU-based query methods adopt a strategy of examining the vertical projection of relevant columns to answer queries. Such vertical "projection scans" are inherently parallel (because each row can be examined independently), but also require access to large amounts of raw data.
In this work, we propose a new indexing strategy that significantly reduces GPU-based memory requirements while providing the same level of parallelism offered by previous GPU-based work. This new indexing strategy, called the Bin-Hash index, combines the use of in-core (i.e. on the GPU) and out-of-core (i.e. on the CPU) data strategies. These in-core strategies focus on reducing the memory footprint of data that must be stored on the GPU, while the out-of-core data strategies help to accelerate the accessing of raw data needed for candidate checks. The combined intent of these two strategies is to trade slightly more computational work on the GPU, for a better utilization of GPU memory and bandwidth. This strategic trade is the best strategy for taking advantage of long-term technology trends which favor the growth of GPU-based processing capabilities over the growth of GPU-based memory size and bandwidth speeds,
The main contributions of this work include:
- We propose an indexing data structure for answering queries on the GPU that combines the use of in-core (i.e\ on the GPU) encoded data, with out-of-core (i.e.\ on the CPU) raw data accesses to support candidate checks; earlier work on processing queries with the GPU operates exclusively on raw data.
- In our performance tests, an implementation of our approach is shown to outperform a fast compressed bitmap indexing strategy (FastBit); earlier works lack such a direct comparison.
- We extended perfect spatial hashing to make it efficient for use on GPUs. This spatial hashing data structure is essential for our Bin-Hash index to minimize the amount of data needed on the GPU, and to utilize the GPU's parallel processing capability.
We present the test results obtained from three separate query processing strategies: the projection scan, FastBit (a compressed bitmap index technology used in QDV), and our Bin-Hash approach---note that FastBit is publicly available for download. Each of these strategies are used to query two different datasets, with each dataset's test suite testing a specific area important to query processing performance. Additionally, we examine the overhead of each method in terms of additional storage required and compare the results: compressed data overhead (FastBit), encoded data overhead (Bin-Hash), and raw-data.
Adding additional dimensions to a query (i.e. variables or attributes that are indexed and searchable as columns in a database table) can result in an exponential growth in processing requirements. This "curse of dimensionality", noted especially for algorithms utilizing tree-based indexing methods, is undesirable given that most scientific datasets of interest tend to be high-dimensional.
In the image below, we see the results of the 3 different querying strategies. This suite of tests explores the query "complexity" issue with the Hurricane Isabel WRF Model. This dataset consists of 25 million spatial cells where for each cell there are 48 timesteps, and at each time step in each cell the records from any one of 13 variables may be queried. We begin with a simple query constraining one variable in this dataset. In each subsequent query test, an additional variable is added until a total of 8 variables are used in the query. Each new variable added in these tests adds an additional 25 million records that must be processed. To ensure that each variable is contributing to the query's results, i.e. adding records to the query's "hit list", each variable is carefully constrained to select a disjoint set equaling 12% of the total 25 million spatial cells.
Significantly, the disparity between the differing cache levels is notably minimal for the Bin-Hash method. At 1 variable, the difference in timings between "no cache" and "full GPU cache" is a mere 3 milliseconds. At 8 variables, when we have increased the workload of records to process from 25 million to 200 million, this disparity has only doubled to 6 milliseconds. This indicates that it is the query processing work being performed on the GPU that is raising the timing results, and NOT bandwidth limitations. If bandwidth was a factor in these timings, we would expect to see a much stronger disparity increase between the "no cache" and "full GPU cache" timings as the amount of data sent from the CPU to the GPU increased.
Performance is an important factor in examining these various strategies; also important, however, is the issue of storage overhead. In the previous two tests, the distribution of data in the respective datasets varies significantly. The Hurricane dataset contains data that is notably "random". Such behavior in data is difficult to compress. Comparatively, the Hydrogen dataset used in in the second test suite (these results are in the paper) contains data distributions that are considerably more continuous, or smooth. Such data is often much easier to compress.
The Table below shows the results of the total overhead required for each dataset given the 3 different storage approaches used in the test suites: raw data (for projection scan), compressed data (for FastBit), and encoded data (for the Bin-Hash schemes).
Here we observe, in the 3 left columns, the penalty incurred by compressing and encoding near-random data. Both the compression utilized by FastBit, and the encoding performed by the Bin-Hash method, exceed, or approach double the size of the raw data. In comparison, the Hydrogen dataset, displayed in the 3 right columns, show the more "typical", or expected storage gains achieved from FastBit's compression scheme.
Luke J. Gosink, Kesheng Wu, E. Wes Bethel, John D. Owens, Kenneth I. Joy, Bin-Hash Indexing: A Parallel Method For Fast Query Processing, Technical Report LBNL-729E, Laurence Berkeley National Laboratories, 2008