TitleRegister Packing for Cyclic Reduction: A Case Study (In Proceedings)
inProceedings of the Fourth Workshop on General Purpose Processing on Graphics Processing Units
Author(s) Andrew Davidson, John D. Owens
Keyword(s)GPU Computing, Tridiagonal Solvers, Cyclic Reduction
Year March 2011
LocationNewport Beach, CA
DateMarch 2011
Abstract We generalize a method for avoiding GPU shared communication when dealing with a downsweep pattern. We apply this generalization to Cyclic Reduction, a tridiagonal solver with this pattern. Previously, Cyclic Reduction suffered poor performance when compared to other tridiagonal solvers on the GPU due to performance issues stemming from shared memory bandwidth bottlenecks and step-efficiency. We address this problem by applying our downsweep shared-memory communication reducing methodology. Our re-mapping also allows Cyclic Reduction to solve larger systems directly in a virtual block. By using our generalized mapping, we improve Cyclic Reduction’s performance on a GPU by a factor of 3–4.5x over the original CR implementation, making it 1.5–3x faster than other GPU tridiagonal solvers.