C/C++ tip: How to loop through multi-dimensional arrays quickly

Topics: C/C++

2D images, 3D volumes, and other multi-dimensional data frequently require loops that sweep through an array to compute statistics, normalize values, or apply transfer functions. Maintaining a multi-dimensional array within a single linear array is a common performance technique. Popular "hand optimizations" fiddle with array indexing and pointer math to improve performance, but how well do they work? This article benchmarks nine common multi-dimensional array loop and indexing methods and four common compilers to find the fastest method to loop through multi-dimensional arrays quickly.

How to create a multi-dimensional array

A textbook example of a 3D array creates an array of arrays of arrays of values, then indexes into it using an (i,j,k) tuple:

value = data[i][j][k];

This works, but for large data the pointers in the array of arrays of arrays waste space and the malloc( ) calls are expensive. The component arrays can be scattered all over memory, which makes a start-to-finish sweep through the array jump hither and yon. This defeats modern processor prefetchers that look for regular memory access patterns to guide prefetching data from memory to fast caches in the background. Without prefetcher help, the array of arrays of arrays textbook approach causes lots of cache misses, making it slow.

High-performance code instead implements a multi-dimensional array as a single linear array with hand-authored array indexing math to keep track of what values are where:

value = data[ i * height * depth + j * depth + k ];

Since the array is a single large chunk of memory, sweeping through it from start-to-finish creates a regular access pattern that processor prefetchers easily recognize, which enables them to load caches in the background. The result is fewer cache misses and much better performance.

Beyond using linear arrays, high-performance code often plays with array index and pointer math to "optimize" the code to save multiplies or reduce loop costs. This makes the code harder to read, but does it improve performance?

How to loop through a multi-dimensional array

Each of the methods below sweep through an array and get every value. Since we're interested in the costs for the loop and indexing math, the loop body is intentionally simple and just sums the values. We don't care about sum overflow.

All of these methods have been seen in "hand optimized" code, even though some of them may be dubious.

Method 1: Nested loops with multiple arrays

The textbook 3D array uses an array of arrays of arrays of values. Three nested loops sweep through the array.

for ( int i = 0; i < width; ++i )
	for ( int j = 0; j < height; ++j )
		for ( int k = 0; k < depth; ++k )
			sum += data[i][j][k];

Method 2: Nested loops with linear array and indexes with 3 multiplies

When a linear array is used instead, straightforward math computes an array index using three multiplies and two adds:

index = (i * height * depth) + (j * depth) + k

Three nested loops sweep through the array. Long integer casts when calculating the array index are needed to avoid 32-bit overflow on large arrays (technically, because of automatic C/C++ type promotion, fewer casts could be used, but casting everything explicitly is good practice).

for ( int i = 0; i < width; ++i )
	for ( int j = 0; j < height; ++j )
		for ( int k = 0; k < depth; ++k )
			sum += data[ (long)i*(long)height*(long)depth + (long)j*(long)depth + (long)k ];

Method 3: Nested loops with linear array and indexes with 2 multiplies

The three multiplies in array index math can be reduced to two with trivial algebra:

index = (i * height + j) * depth + k

Three nested loops sweep through the array.

for ( int i = 0; i < width; ++i )
	for ( int j = 0; j < height; ++j )
		for ( int k = 0; k < depth; ++k )
			sum += data[ ((long)i*(long)height + (long)j)*(long)depth + (long)k ];

Method 4: Nested loops with linear array and invariant multiplies out of loops

The quantity (i*height*depth) doesn't change within the for loop on i, and the quantity (j*depth) doesn't change within the for loop on j. By introducing a pair of temporary variables to store these quantities, the number of multiplies within the innermost loop is reduced.

const long heightdepth = (long)height * (long)depth;
for ( int i = 0; i < width; ++i )
{
	const long ipart = (long)i * heightdepth;
	for ( int j = 0; j < height; ++j )
	{
		const long ijpart = ipart + (long)j * (long)depth;
		for ( int k = 0; k < depth; ++k )
			sum += data[ijpart + (long)k];
	}
}

Method 5: Nested loops with linear array and invariant multiplies saved in an array

The (i*height*depth) and (j*depth) quantities for all i and j don't change as long as the array size stays the same. By precomputing two arrays of these quantities, all of the multiplies are replaced with array references.

Precompute:

const long heightdepth = (long)height * (long)depth;
long*const imult = (long*)malloc( width * sizeof(long) );
for ( int i = 0; i < width; ++i )
	imult[i] = (long)i * heightdepth;

long*const jmult = (long*)malloc( height * sizeof(long) );
for ( int j = 0; j < height; ++j )
	jmult[j] = (long)j * depth;

Sweep:

for ( int i = 0; i < width; ++i )
{
	const long ipart = imult[i];
	for ( int j = 0; j < height; ++j )
	{
		const long ijpart = ipart + jmult[j];
		for ( int k = 0; k < depth; ++k )
			sum += data[ijpart + (long)k];
	}
}

Method 6: Nested loops with linear array and single incrementing index

With a linear array, all of the values are in consecutive locations in the array. A single incrementing array index is sufficient within a start-to-finish sweep. Three nested loops still might be used to add code clarity (or they're an artifact of "hand optimizations" that didn't realize they were redundant).

long index = 0;
for ( int i = 0; i < width; ++i )
	for ( int j = 0; j < height; ++j )
		for ( int k = 0; k < depth; ++k )
			sum += data[index++];

Method 7: Nested loops with linear array and single incrementing pointer

Instead of a single incrementing array index, a single incrementing pointer can be used for start-to-finish sweeps through a linear array. Three nested loops still remain.

const int* d = data;
for ( int i = 0; i < width; ++i )
	for ( int j = 0; j < height; ++j )
		for ( int k = 0; k < depth; ++k )
			sum += *d++;

Method 8: Single loop with linear array and incrementing index

Instead of three nested loops, a single loop is sufficient to run from index 0 to (width*height*depth).

long index = 0;
const long indexend = (long)width * (long)height * (long)depth;
while ( index != indexend )
		sum += data[index++];

Method 9: Single loop with linear array and incrementing pointer

Instead of an incrementing array index, an incrementing pointer does the same job.

const int* d = data;
const int*const dend = d + (long)width * (long)height * (long)depth;
while ( d != dend )
		sum += *d++;

Test specification

The benchmarks reported below time the same code compiled with four common high-performance computing C/C++ compilers:

Compiler flags used here enable maximum code optimizations and 64-bit code tuned to a 2.6GHz Intel EMT64T Xeon E5-2670 "Sandy Bridge" processor. The process has a 32 Kbyte L1 cache, a 256 Kbyte L2 cache, and a 20 Mbyte L3 cache. The L1 cache supports two simultaneous CPU loads per clock cycle and an 8-byte bus for each load. For a 2.6GHz clock, the theoretical maximum bandwidth from the L1 cache to the CPU is (2.6GHz * 2 * 8) = 44 Gbytes/sec.

The plots below show performance in Gbytes/sec (vertical axis) of each method as the array size (horizontal axis) is varied from 1 to 4 Gbytes using powers of 2 array dimensions for a 3D array (e.g. 1x1x1, 2x2x2, 4x4x4, etc.). Similar results are found for other array dimensionality (e.g. 2D, 4D, etc.).

Each of the above array methods was tested and are shown on the plots in different colors:

Benchmark results – compiled without optimizations

The plot below shows performance for code compiled with GCC without optimizations. CLANG/LLVM, ICC, and PGCC produce similar results.

Compiler observations:

  • Unoptimized code performance for all methods is extremely poor at about 1.2 Gbytes/sec compared to the theoretical maximum of 44 Gbytes/sec for the processor.

Method observations:

  • The hand optimizations for the various methods above do improve performance. The slowest method (light blue) uses the textbook array of arrays of arrays, while the fastest method (gray) uses a linear array and a single incrementing pointer.

But these results are for code without any compiler optimizations. Nobody should ever use such code.

Benchmark results – compiled with optimizations

The plots below show performance for code compiled using CLANG/LLVM, GCC, ICC, and PGCC using maximum optimizations. For each pair of plots, the left plot uses a vertical axis from 0 to 40 Gbytes/sec, while the right plot zooms in to the 0 to 15 Gbyte/sec region.

The mountain-shaped curve comes from a combination of loop overhead and the processor's cache speeds. For small arrays, loop overhead dominates and performance is low. For mid-sized arrays, data lodges in the caches and performance is high. As arrays get larger, they spill over from the L1 to L2 cache, L2 to L3, and L3 to memory, and performance drops. The curves smooth out L1/L2/L3 performance steps because the processor's prefetcher tries to stage data in the L1 cache ahead of need.

Compiler observations:

  • CLANG/LLVM and PGCC code does poorly with most methods at runs at about 5 and 9 Gbytes/sec, respectively. While that's better than unoptimized code, it's well below the theoretical 44 Gbytes/sec capability of the processor.
  • GCC and ICC both did better with most methods running between 13 and 15 Gbytes/sec.

Method observations:

  • The stand-out fastest methods by far used a linear array and a single incrementing array index (black) or pointer (gray). With GCC, these methods hit 22 Gbytes/sec. With ICC, they hit 35 Gbytes/sec and about 80% of the theoretical maximum of 44 Gbytes/sec. It's clear that ICC produces outstanding code for these cases.
  • The slowest method is the textbook array of arrays of arrays of values (light blue). You're always better off managing a linear array with your own array index math.
  • Several of the methods performed about the same. Optimizing compilers do a good job of recognizing invariant expressions and pulling them out of inner loops. There's little need to hand optimize to do what a good compiler will do anyway.
  • Some of the methods performed worse than simple nested loops with multiplied array indexes (dark blue). Hand optimizations can create odd looking code that's harder for a compiler to match up to an optimization template. It's often better to resist the temptation to "optimize" code and let the compiler do its job.

Conclusions

  • Code from non-optimizing compilers is dreadful. But you already knew that.
  • CLANG/LLVM did poorly. The other compilers beat it easily.
  • The textbook approach to managing multi-dimensional data in an array of arrays of arrays of values is never the fastest approach. Always use a linear array and array indexing math instead.
  • Compiler optimizers are very good. Clever "optimized" loop and array indexing methods often don't improve performance compared to simpler code once compiler optimizers do their job.
  • When doing a full array sweep, using a single incrementing index (black) or pointer (gray) is substantially faster than any other method. A good compiler, like ICC, can generate code that issues multiple wide loads from the cache on every cycle and achieve performance that approaches the theoretical maximum.

Further reading

Related articles at NadeauSoftware.com

Web articles

Comments

Outstanding science with

Outstanding science with clear results. Thank you.

thanks for your excellent summary

Hi
I really appreciate your analysis and commentary on textbook approach.
I'll share your page with my lab mates and they will be happy too.
Thanks!

A great article. this blog

A great article. this blog is very interesting, i like it. thank you.

peak bw

The code is single threaded. You'll never get more than ~10GB/s on a single threaded code; this scales with CPU clock. If you do check your benchmark. ICC might do auto-parallization, but check compiler flags first.

Re: Peak bw

As noted in the article, as arrays get large they no longer lodge in the L1/L2/L3 caches and performance degrades to the speed of memory. The measured performance is approximately the peak memory bandwidth of the processor used. Adding more threads adds more competition for that same memory bandwidth, which one thread alone is already saturating. So adding threads will actually reduce performance due to thread conflicts, rather than improve performance.

What all this illustrates is the well-known problem that the pace of increasing processor performance has exceeded the pace of memory performance improvements. Today's processors are bottlenecking on memory bandwidth. Much of the power of cores is wasted while idle waiting on memory. Adding more memory bandwidth to a processor is desirable, but difficult and it doesn't improve memory latency, which is also a big issue. Newer processors do add more memory channels, and there are new memory designs in the works that increase bandwidth. These will help, but designing your code to be memory-use smart will still be a win for those processors, as well as current and older processors.

Post new comment

The content of this field is kept private and will not be shown publicly.
  • Allowed HTML tags: <a> <em> <strong> <cite> <code> <ul> <ol> <li> <dl> <dt> <dd>
  • Lines and paragraphs break automatically.
  • Web page addresses and e-mail addresses turn into links automatically.

More information about formatting options

Nadeau software consulting
Nadeau software consulting