A Novel Hilbert Curve for Cache-locality Preserving Loops

Author(s): C. Bohm, M. Perdacher, C. Plant
Abstract: Modern microprocessors offer a rich memory hierarchy including various levels of cache and registers. Some of these memories (like main memory, L3 cache) are big but slow and shared among all cores. Others (registers, L1 cache) are fast and exclusively assigned to a single core but small. Only if the data accesses have a high locality, we can avoid excessive data transfers between the memory hierarchy. In this paper we consider fundamental algorithms like matrix multiplication, K-Means, Cholesky decomposition as well as the algorithm by Floyd and Warshall typically operating in two or three nested loops. We propose to traverse these loops whenever possible not in the canonical order but in an order defined by a space-filling curve. This traversal order dramatically improves data locality over a wide granularity allowing not only to efficiently support a cache of a single, known size (cache conscious) but also a hierarchy of various caches where the effective size available to our algorithms may even be unknown (cache oblivious). We propose a new space-filling curve called Fast Unrestricted (FUR) Hilbert with the following advantages: (1) we overcome the usual limitation to square-like grid sizes where the side-length is a power of 2 or 3. Instead, our approach allows arbitrary loop boundaries for all variables. (2) FUR-Hilbert is non-recursive with a guaranteed constant worst case time complexity per loop iteration (in contrast to O(log(gridsize)) for previous methods). (3) Our non-recursive approach makes the application of our cache-oblivious loops in any host algorithm as easy as conventional loops and facilitates automatic optimization by the compiler. (4) We demonstrate that crucial algorithms like Cholesky decomposition as well as the algorithm by Floyd and Warshall by can be efficiently supported. (5) Extensive experiments on runtime efficiency, cache usage and energy consumption demonstrate the profit of our approach. We believe that future compilers could translate nested loops into cache-oblivious loops either fully automatic or by a user-guided analysis of the data dependency.
Organisation(s): Research Group Data Mining and Machine Learning, Research Network Data Science
External organisation(s): Ludwig-Maximilians-Universität München
Journal: IEEE Transactions on Big Data
Volume: 7
Pages: 241-254
No. of pages: 14
DOI: https://doi.org/10.1109/TBDATA.2018.2830378
Publication date: 2018
Peer reviewed: Yes
Austrian Fields of Science 2012: 102033 Data mining, 102023 Supercomputing
Keywords
Portal url: https://ucrisportal.univie.ac.at/en/publications/33cc1c17-8466-44bf-9d8b-c92b3aa03961