|Title||Register Packing for Cyclic Reduction: A Case Study
(In Proceedings) |
|in||Proceedings of the Fourth Workshop on General Purpose Processing on Graphics Processing Units|
Andrew Davidson, John D. Owens |
|Keyword(s)||GPU Computing, Tridiagonal Solvers, Cyclic Reduction|
|Location||Newport Beach, CA|
We generalize a method for avoiding GPU shared
communication when dealing with a downsweep pattern. We apply this generalization to Cyclic Reduction, a tridiagonal solver with this pattern. Previously, Cyclic Reduction suffered poor performance when compared to other tridiagonal solvers on the GPU due to performance issues stemming from shared memory bandwidth bottlenecks and step-efficiency. We address this problem by applying our downsweep shared-memory communication reducing methodology. Our re-mapping also allows Cyclic Reduction to solve larger systems directly in a virtual block. By using our generalized mapping, we improve Cyclic Reduction’s performance on a GPU by a factor of 3–4.5x over the original CR implementation, making it 1.5–3x faster than other GPU tridiagonal solvers.