C/C++ tip: How to copy memory quickly

Topics: C/C++

Code performance always matters, and copying data is a common operation. Simple array copy code uses a loop to copy one value at a time. ISO C provides the memcpy( ) and memmove( ) functions to do this efficiently, but are they faster, by how much, and under what conditions? This article benchmarks five common methods and four common compilers to find the fastest method to copy an array quickly.

How to copy an array

Let source data be an array of 4-byte integers to copy into a destination array. Both arrays are on the heap.

Method 1: Loop with array indexes

Use a loop with incrementing array indexes to copy from the source to the destination.

for ( size_t i = 0; i < length; ++i )
	dst[i] = src[i];

Method 2: Loop with pointers

Use a loop with incrementing parallel source and destination pointers.

const int* s = src;
int* d = dst;
const int*const dend = dst + n;
while ( d != dend )
	*d++ = *s++;

Method 3: Loop with get/set function calls

Loop through the array and call get( ) and set( ) functions, such as accessor functions on a C++ data class, or from C++ array operator overloading.

for ( size_t i = 0; i < n; ++i )
	set( dst, i, get( src, i ) );

The get( ) and set( ) functions used here are defined in the same file as:

int get( const int*const src, const size_t index ) { return src[index]; }
int set( int*const dst, const size_t index, const int value ) { dst[index] = value; }

Method 4: Call memcpy( )

Call the ISO C memcpy( ) function.

memcpy( (void*)dst, (void*)src, n * sizeof(int) );

Method 5: Call memmove( )

Call the ISO C memmove( ) function (same as memcpy( ) but checks for source and destination overlap).

memmove( (void*)dst, (void*)src, n * sizeof(int) );

Untested methods

There are a lot of functions very similar to memcpy( ) and memmove( ), and usually implemented using them. Here are few:

  • Memory copy with void pointers:
    • POSIX's deprecated bcopy( ) is identical to memmove( ), but with swapped source and destination arguments.
    • GNU's mempcpy( ) returns a pointer to the byte after the last byte written, but is otherwise identical to memcpy( ).
    • POSIX's memccpy( ) and Microsoft's _memccpy( ) stop the copy on a selected value, but are otherwise identical to memcpy( ).
    • Gnome's g_memmove( ) is a macro that usually maps to memmove( ).

    • Microsoft's CopyMemory( ) and MoveMemory( ) are identical to memcpy( ) and memmove( ).
    • Microsoft's memcpy_s( ) and memmove_s( ) add a destination array length and do array bounds checking, but are otherwise identical to memcpy( ) and memmove( ).
  • Memory copy with wide character pointers:
    • ISO C's wmemcpy( ) and wmemmove( ) and GNU's wmempcpy( ) use wide character pointer arguments, but are otherwise identical to memcpy( ), memmove( ), and mempcpy( ).
    • Microsoft's wmemcpy_s( ), and wmemmove_s( ) add a destination array length and do array bounds checking, but are otherwise identical to wmemcpy( ) and wmemmove( ).
  • String copy with character pointers:
    • ISO C's strncpy( ) and GNU's stpncpy( ) use character pointer arguments and stop the copy on a '\0' character, but are otherwise identical to memcpy( ) and mempcpy( ).
    • Microsoft's strncpy_s( ) adds a destination array length and does array bounds checking, but is otherwise identical to strncpy( ).
  • String copy with wide character pointers:
    • ISO C's wcsncpy( ) and GNU's wcpncpy( ) use wide character pointer arguments and stop the copy on a '\0' wide character, but are otherwise identical to strncpy( ) and stpncpy( ).
    • Microsoft's wcsncpy_s( ) adds a destination array length and does array bounds checking, but is otherwise identical to wcsncpy( ).

Test specification

The benchmarks reported below time the same code compiled with four common C/C++ compilers:

Compiler flags used here enable maximum code optimizations and 64-bit code tuned to a 2.6GHz Intel EMT64T Xeon E5-2670 "Sandy Bridge" processor. The process has a 32 Kbyte L1 cache, a 256 Kbyte L2 cache, and a 20 Mbyte L3 cache. The L1 cache supports two simultaneous CPU loads per clock cycle and an 8-byte bus for each load. For a 2.6GHz clock, the theoretical maximum bandwidth from the L1 cache to the CPU is (2.6GHz * 2 * 8) = 44 Gbytes/sec.

All tests were run on Linux using one core, and all tests were run multiple times and their results averaged.

The plots below show performance in Gbytes/sec (vertical axis) of each method as the array size (horizontal axis) is varied from 4 bytes to 64 Gbytes in powers of 2.

Benchmark results – compiled without optimizations

The plot below shows performance for code compiled with GCC without optimizations. CLANG/LLVM, ICC, and PGCC produce similar results. On the left the vertical axis runs from 0 to 40 Gbytes/sec. On the right, the plot has been zoomed in to the 0 to 2 Gbytes/sec range.

Compiler observations:

  • Unoptimized code performance for the looping methods is extremely poor at 1.5 Gbytes/sec or less compared to the theoretical maximum of 44 Gbytes/sec for the processor.

Method observations:

  • The loop methods are certainly faster than the get/set functions, but they're still slow compared to memcpy( ) and memmove( ), which use optimized LibC functions.

However, these results are for code without any compiler optimizations. Nobody should ever release unoptimized code.

Benchmark results – compiled with optimizations

The plots below show performance for code compiled using CLANG/LLVM, GCC, ICC, and PGCC using the maximum optimizations available on the test platform.

The mountain-shaped curve comes from a combination of loop overhead and the processor's cache speeds. For very small arrays, loop overhead dominates and performance is low. For mid-sized arrays, fast copying runs near the L1 cache speed and performance is high. As arrays get larger, they spill over from the L1 to L2 cache, L2 to L3, and L3 to memory, and performance drops. The curves smooth out L1/L2/L3 performance steps because the processor's prefetcher tries to stage data in the L1 cache ahead of need.

Compiler observations:

  • PGCC fails to optimize the loop with array indexes (blue) or get/set functions (yellow). The array index loop is a common code construct and should have been recognized and optimized, and the get/set functions should have been inlined. Additional, while the pointer loop (green) was optimized, its performance drops off suddenly at the 256 Kbyte array test, which is odd.
  • CLANG/LLVM fails to optimize the loop with pointers (green). This is a very common code construct and should have been optimized.

Method observations:

  • ICC is the undisputed star of the show with spectacular performance for everything except memmove( ) (purple), and even that is about as fast as the fastest results from the other compilers. The ICC code hits an amazing 42.5 Gbytes/sec, and just short of the 44 Gbyte/sec theoretical maximum. The get/set functions are clearly inlined and the looping methods have been optimized to as well as ICC's custom fast SSE memcpy( ).

Conclusions

  • Code from non-optimizing compilers is terrible and should never be released.
  • PGCC and CLANG/LLVM both missed optimizing common code constructs, producing unexpectedly bad performance.
  • memcpy( ) is always the fastest method to copy data. A great compiler, like ICC, can optimize loops to do as well.

Further reading

Related articles at NadeauSoftware.com

Web articles

Comments

Post new comment

The content of this field is kept private and will not be shown publicly.
  • Allowed HTML tags: <a> <em> <strong> <cite> <code> <ul> <ol> <li> <dl> <dt> <dd>
  • Lines and paragraphs break automatically.
  • Web page addresses and e-mail addresses turn into links automatically.

More information about formatting options

Nadeau software consulting
Nadeau software consulting