Test Kernel
The kernel I am going to benchmark is a variation of the general matrix-vector multiplication (GEMV) found in BLAS level 2. The simplified kernel takes the form
where summation over index is assumed. The values and are elements of vectors , respectively, and are elements of matrix for some integer number . For typical problems is large.
Operational Intensity
The operational intensity1 of a compute kernel is the ratio between arithmetic operations and total number of bytes transferred due to memory accesses. To simplify the discussion, I am assuming the destination of all memory transactions is DRAM. For the simplified GEMV kernel above, there are multiplications and additions for a total of flop. The memory accesses to DRAM are trickier to estimate because of cache memory hierarchies found in all modern CPU architectures. An upper bound can be estimated by assuming no cache, where every memory access results in a capacity miss.2 In that case there are reads for , reads for the matrix elements , reads for due to the inner products and another writes for , resulting in a total of memory accesses. For a lower bound estimate we assume an infinite cache where only compulsory misses are relevant.2 Here every required read and write is performed exactly once, resulting in a total of memory accesses. For both estimates the leading order term of memory accesses is , indicating that the GEMV kernel is memory bound. Indeed, all BLAS level 1 and level 2 kernels are memory bound.
Since is assumed large, it is reasonable to pick the upper bound memory access estimate, where the cache capacity is assumed small compared to the problem size. Assuming 32-bit single precision floating point data, the operational intensity for this test kernel computes as
The approximate value indicated corresponds to the expected value in the limit when is large.
Naive Implementation
A straightforward implementation of the GEMV kernel discussed in the previous section could look like:
1void gemv(const float *A, const float *x, float *y, const size_t n)
2{
3 for (size_t i = 0; i < n; ++i)
4 for (size_t j = 0; j < n; ++j)
5 y[i] += A[i * n + j] * x[j];
6}
Assuming the code is built for the same target architecture, a Fortran compiler
is generally able to emit more efficient code than a C/C++ compiler for this
particular implementation. This is because a Fortran compiler assumes that
array y must not alias x or A . A C/C++ compiler does not make this
assumption which will prevent it from performing certain optimizations during
code generation. It is often the reason for the claim that Fortran is faster
than C/C++.
Strict Aliasing Rule in C/C++
The concept of pointers in certain programming languages allows to declare variables that “point to” a specific memory location by assigning to it a corresponding memory address. This concept does generally not prevent any two pointers from pointing to the same location in memory. It may also be possible that these pointers point to ranges in memory that overlap. Pointers do not exist in earlier releases of Fortran and so aliasing of memory regions is not a concern. This is not true for programs written in C/C++. However, to limit the scope of possible aliasing, some rules apply broadly known as strict aliasing:
Pointers with incompatible types (e.g.
intandfloat) do not alias in memory. Dereferencing a pointer that points to an incompatible type is undefined behavior.
Consider the following function for which the strict aliasing rule applies to
the pointer parameter a and b:
1int foo(int *a, int *b)
2{
3 *a = 1;
4 *b = 2;
5 return *a; /* what is the value of *a? */
6}
The compiler does not know a priori where a and b will point to when foo
is called. The return value may be 1 or 2 depending on what arguments are
passed into foo. If pointers alias, the compiler cannot reorder read and
write operations which inflicts a performance penalty in cases where strict
aliasing is not desired. In the example above, the compiler must issue a read
in line 5 because the write in line 4 may change its value assigned in line 3.
If the pointers do not alias, the additional read instruction is not necessary.
The code below shows the optimized assembly code generated by GCC 14.2.1 (note
that the -fstrict-aliasing flag is implied for -O2, -O3 and -Os
optimizations). The values of pointers a and b are stored in registers
rdi and rsi, respectively. If the values in these two registers are the
same, line 2 and 3 below write to the same memory location and consequently
the compiler is required to perform a fresh read from address rdi when moving
the result into return register eax in line 4. A total of three memory
operations are required to ensure correctness of this optimized code.
1foo: ; gcc 14.2 with flags -O3 -fstrict-aliasing
2 mov DWORD PTR [rdi], 1 ; *a = 1
3 mov DWORD PTR [rsi], 2 ; *b = 2
4 mov eax, DWORD PTR [rdi] ; store *a in register eax
5 ret
The compiler must be conservative with respect to optimizations when aliasing is possible. A performance trade-off for the dynamic flexibility that pointers offer.
Sometimes strict aliasing may not be desired, especially when performance is a
concern (the compiler must have the freedom to reorder memory operations). In
that case a pointer can be qualified as restrict which then establishes a
contract between the compiler and the programmer that allows to transfer
responsibility to the programmer.3 More freedom (for the person that writes
the code) typically means more responsibility. The restrict type qualifier
tells the compiler that a pointer will not alias other memory in the block
scope it is defined (from the compilers’ perspective it is more of a liberation
than restriction). If we agree (assuming understood) with this contract, we can
rewrite the code using the restrict type qualifier in the public API of the
function
1int foo(int *restrict a, int *restrict b)
2{
3 *a = 1;
4 *b = 2;
5 return *a; /* the value of *a is definitely 1! */
6}
Because of our contract, the compiler can be absolutely certain that the return
value is 1. If something else would be expected, then it is not the
compiler’s fault. The optimized assembly now looks as follows:
1foo: ; gcc 14.2 with flags -O3 -fstrict-aliasing
2 mov DWORD PTR [rdi], 1 ; *a = 1
3 mov eax, 1 ; return value is 1, no question!
4 mov DWORD PTR [rsi], 2 ; *b = 2
5 ret
This code requires only two memory operations compared to the three required when the strict aliasing rule must be enforced. Aliasing requires additional memory transactions (write instructions) to guarantee correct code. These additional memory operations are one of the causes for performance degradation in C/C++. Fortran does not have to deal with this.
Memory Aware Implementation
The naive implementation of the GEMV kernel shown at the beginning of the
previous section is subject to the
strict aliasing rule. The writes to memory locations pointed to by y may
overwrite memory that is subsequently read by A and/or x. Note that the
const qualifier applied to these two pointers in the function signature is
not sufficient to prevent from aliasing. The optimized assembly code for this
naive implementation looks like:
1gemv: ; gcc 14.2 with flags -O3 -ftree-vectorize
2 test rcx, rcx
3 je .L1
4 lea r8, [0+rcx*4]
5 lea r9, [rdx+r8]
6.L3:
7 movss xmm1, DWORD PTR [rdx]
8 xor eax, eax
9.L4:
10 movss xmm0, DWORD PTR [rdi+rax*4]
11 mulss xmm0, DWORD PTR [rsi+rax*4]
12 add rax, 1
13 addss xmm1, xmm0
14 movss DWORD PTR [rdx], xmm1
15 cmp rcx, rax
16 jne .L4
17 add rdx, 4
18 add rdi, r8
19 cmp r9, rdx
20 jne .L3
21.L1:
22 ret
The highlighted section of lines 9–16 corresponds to the inner-most loop. This
code iterates over index
to compute the inner-product
and
add the result to
. For every
, the value
is loaded into
register xmm0 (line 10), then multiplied with
(line 11, the second
operand is loaded from memory) and finally added to
which is represented
by the accumulator register xmm1 in line 13. Because of the strict aliasing
rule, the compiler is required to emit a store instruction in line 14 that
writes the current value of
back to memory before continuing with
reading again in lines 10 and 11. This store is unnecessary if we can guarantee
that y does not alias A and/or x (these are typically distinct arrays for
the GEMV operation). Furthermore, note that this code is trivial to vectorize
but pointer aliasing prevents the compiler from doing this optimization. This
is another cause for performance degradation in C/C++ when pointer aliasing is
present. Slight modifications to this kernel implementation will help the
compiler emit more efficient code.
Qualify the pointer to the destination array as restrict
As seen in the previous section, one simple
way to fix this problem is to add the restrict qualifier to the pointer
argument for the destination array y (the one being written to). Changing the
function signature of the previous GEMV implementation to
1void gemv(const float *A, const float *x, float *restrict y, const size_t n)
removes the unnecessary store instruction in the inner-most loop and allows the compiler to emit SIMD instructions for vectorized code. The assembly code for the inner-most loop now looks like:4
1.L4: ; gcc 14.2 with flags -O3 -ftree-vectorize
2 movups xmm0, XMMWORD PTR [rsi+rax]
3 movups xmm3, XMMWORD PTR [rdx+rax]
4 add rax, 16
5 mulps xmm0, xmm3
6 movaps xmm1, xmm0
7 addss xmm1, xmm2
8 movaps xmm2, xmm0
9 shufps xmm2, xmm0, 85
10 addss xmm1, xmm2
11 movaps xmm2, xmm0
12 unpckhps xmm2, xmm0
13 shufps xmm0, xmm0, 255
14 addss xmm1, xmm2
15 movaps xmm2, xmm0
16 addss xmm2, xmm1
17 cmp rax, rcx
18 jne .L4
Lines 2 and 3 perform 128-bit loads for
and
, respectively,
and store the result in registers xmm0 and xmm3. These are vector loads and
operate on 4 32-bit floats simultaneously (4-way SSE SIMD for the given compiler
flags, other SIMD instruction set extensions would generate a similar code path
for the given compiler flags). Line 5 then executes a vector multiplication for
the product
and lines 5–16 correspond to a horizontal reduction
into the accumulator register xmm2 (the value
). The required store
instruction that was observed for the case with strict aliasing is optimized
away for this implementation.
The auto-vectorized code generated by GCC 14.2.1 is not ideal because it performs the horizontal reduction for every loop iteration . A better approach would be to use a vector register for the accumulation of products using FMA instructions and then perform the horizontal reduction of the accumulator register into once after the accumulation steps. For comparison, compiling this code using Clang 19.1.0 with the same optimization flags generates different scalar (non-SIMD) code with 4-way loop-unrolling.
Use Thread-Private Memory
For the GEMV algorithm discussed here, using the restrict qualifier on the
destination array y is not necessary. The strict aliasing rule is not a
problem for this kernel if we write the code in a way that will impose our
intent more clear on the compiler. In the initial implementation, it is clear
that the loop index i is constant in the inner-most loop. There is no need at
all to index into array y for every iteration j.
1void gemv(const float *A, const float *x, float *y, const size_t n)
2{
3 for (size_t i = 0; i < n; ++i)
4 for (size_t j = 0; j < n; ++j)
5 y[i] += A[i * n + j] * x[j];
6}
Instead of writing the initial code as shown above, it is better to use an
accumulator for which the compiler will allocate a register (which is
thread-private). There is no need to use a shared resource such as y[i] to
perform the accumulation. A better GEMV implementation would thus be
1void gemv(const float *A, const float *x, float *y, const size_t n)
2{
3 for (size_t i = 0; i < n; ++i) {
4 float inner_product = 0.0f;
5 for (size_t j = 0; j < n; ++j) {
6 inner_product += A[i * n + j] * x[j];
7 }
8 y[i] += inner_product;
9 }
10}
Note that no restrict qualifier is used for this implementation. The
resulting assembly code is identical to the one obtained using the restrict
qualified destination array as in the previous section. The code is more elegant because intention is
expressed clearly by the use of an accumulator in thread-private memory. Using
thread-private memory whenever possible is furthermore important for writing
multi-threaded code where
race-conditions may be an
issue.
Fortran Implementation
Now it is time to compare against the Fortran equivalent of the GEMV kernel.
Because Fortran is using
column-major
storage, it is assumed that the matrix
is symmetric for the remainder of
this post. The following code is taken from the file sgemv.f of the
reference BLAS 3.12.0
library with some
code stripped off to meet the simplified GEMV variant discussed in the Test
Kernel section:
1 SUBROUTINE SGEMV(A,X,Y,N)
2 INTEGER(4) N,I,J
3 REAL(4) A(N,*),X(*),Y(*)
4 REAL(4) TEMP
5#ifdef TRANSPOSE
6! Form y := A^T*x + y.
7 DO 100 J = 1,N
8 TEMP = 0.0
9 DO 90 I = 1,N
10 TEMP = TEMP + A(I,J)*X(I)
11 90 CONTINUE
12 Y(J) = Y(J) + TEMP
13 100 CONTINUE
14#else
15! Form y := A*x + y.
16 DO 60 J = 1,N
17 TEMP = X(J)
18 DO 50 I = 1,N
19 Y(I) = Y(I) + TEMP*A(I,J)
20 50 CONTINUE
21 60 CONTINUE
22#endif
23 RETURN
24 END
As mentioned above, the equivalent Fortran algorithm is when TRANSPOSE is
defined (lines 7–13). Compiling this code with GFortran 14.2.1 results again
in identical assembly code as obtained for the memory aware C/C++ versions.
Different for Fortran of course is the absence of aliasing, such that a loop
structure without the TEMP accumulator
1 DO 100 J = 1,N
2 DO 90 I = 1,N
3 Y(J) = Y(J) + A(I,J)*X(I)
4 90 CONTINUE
5 100 CONTINUE
still yields identical assembly code.
Benchmark
In this section I am going to benchmark the different C/C++ implementations discussed above and compare to the BLAS Fortran implementation. The main metric of interest is the operational intensity discussed in a previous section. The necessary instruction counts for flops and memory accesses are determined using hardware counters via the PAPI library. The benchmark results shown below are obtained using an Intel Core i7-8700K CPU running at 3.70 GHz and the problem size is set to . The benchmark code can be found here.
Assuming the upper bound estimate discussed in an earlier section, a total of
memory instructions are to be
expected. The naive C/C++ kernel with strict aliasing generates the following
benchmark output (compiled with GCC 14.2.1 and -O2 flag):
1Total cycles: 427373621
2Total instructions: 700071682
3Instructions per cycle (IPC): 1.64
4L1 cache size: 32 KB
5L2 cache size: 256 KB
6L3 cache size: 12288 KB
7Total problem size: 390703 KB
8Total L1 data misses: 12515647
9Total load/store: 300010748 (expected: 200020000)
10Operational intensity: 1.666690e-01 (expected: 2.499875e-01)
11Performance [GFlop/s]: 2.204652e+00
12Wall-time [micro-seconds]: 9.072179e+04
The measured number of load/store instructions exceeds the expected value by 100
Million because of the additional store instruction required in the inner-most
loop due to the strict aliasing rule (see Memory Aware Implementation). The additional store instructions
reduce the observed operational intensity accordingly. The measurement for the
kernel without strict aliasing produces the expected values (compiled with GCC
14.2.1 and -O2 flag):
1Total cycles: 421014858
2Total instructions: 600101681
3Instructions per cycle (IPC): 1.43
4L1 cache size: 32 KB
5L2 cache size: 256 KB
6L3 cache size: 12288 KB
7Total problem size: 390703 KB
8Total L1 data misses: 12520995
9Total load/store: 200020748 (expected: 200020000)
10Operational intensity: 2.499866e-01 (expected: 2.499875e-01)
11Performance [GFlop/s]: 2.237932e+00
12Wall-time [micro-seconds]: 8.937267e+04
The roofline plot below summarizes measurements obtained for single core execution using different compilers as well as the C and Fortran implementations of the GEMV test kernel. The following compilers have been used on an Arch Linux system with kernel 6.12.4:
gcc-14 | GNU C compiler (version 14.2.1) |
gfortran-14 | GNU Fortran compiler (version 14.2.1) |
clang-18 | LLVM C compiler (version 18.1.8) |
flang-18 | LLVM Fortran compiler (version 18.1.8) |
nvfortran-24 | Nvidia HPC SDK Fortran compiler (version 24.7, PGI legacy) |
All executables are built using -O2 optimization targeting scalar code without
SIMD (AVX2) or FMA instructions (additional flags used with the Nvidia Fortran
compiler are -acc=host and -mno-fma):
The roofline shows that none of the test executables reach the hardware limit
for the given level of optimization. The C implementations correspond to the
gcc-14 and clang-18 legend entries, where the naive implementations are indicated with strict aliasing. The
executable generated with the Nvidia Fortran compiler exhibits 4x more
load/store instructions than what is expected which results in slightly worse
performance compared to the other test executables. (It is not clear why the
compiler does this but removing the -mno-fma flag immediately results in
aggressively optimized code, see next section.)
The overall performance for all test executables is roughly identical at this
level of optimization, irrespective of the strict aliasing rule. Given the
upper bound estimate for the operational intensity of the GEMV test kernel, the
fastest executable is limited by the bandwidth ceiling which corresponds to
about 10.4 GFlop/s on this Intel architecture. A C compiler that is
compliant with the strict aliasing rule (even when aggressive optimizations are
enabled) is unlikely to generate better code than what is observed for gcc-14
above. Therefore, the naive C implementation of the GEMV test kernel is expected to exhibit 5x
slower performance due to the strict aliasing rule in C/C++.
With Compiler Optimization
More aggressive optimizations can be enabled by allowing the compiler to use SIMD instructions via auto-vectorization as well as FMA instructions and other optimization techniques of which some may disregard strict standard compliance. The optimization flags used for the following measurements are:
gcc-14 | -Ofast -mavx2 -mfma -march=native -mtune=native -funroll-all-loops |
gfortran-14 | -Ofast -mavx2 -mfma -march=native -mtune=native -funroll-all-loops |
clang-18 | -Ofast -mavx2 -mfma -march=native -mtune=native -funroll-all-loops |
flang-18 | -O3 -ffast-math -fstack-arrays -march=core-avx2 |
nvfortran-24 | -O3 -acc=host |
These are rather aggressive optimization flags which may not always be suitable
for production code. The benchmark output for the C kernel implementation with
strict aliasing now looks like (compiled with GCC 14.2.1 and the optimization
flags for gcc-14 shown in the table above):
1Total cycles: 421750341
2Total instructions: 337611682
3Instructions per cycle (IPC): 0.80
4L1 cache size: 32 KB
5L2 cache size: 256 KB
6L3 cache size: 12288 KB
7Total problem size: 390703 KB
8Total L1 data misses: 12520574
9Total load/store: 300010663 (expected: 200020000)
10Operational intensity: 1.666691e-01 (expected: 2.499875e-01)
11Performance [GFlop/s]: 2.230897e+00
12Wall-time [micro-seconds]: 8.965453e+04
The total number of cycles remains about the same while the total number of instructions executed has roughly halved due to optimization passes, reducing the average number of instructions per cycle (IPC) by the same factor. The number of load/store instructions remains the same, indicating that the optimized version is still issuing scalar load/stores which is a consequence of the strict aliasing rule. The numbers show that the GNU C compiler honors the strict aliasing rule even when aggressive optimizations are enabled and thus does not have much freedom to optimize the code. Furthermore, IPC is not an accurate metric for this memory bound code.
The benchmark output for the C kernel implementation using thread private memory
for the loop accumulator looks like the following (compiled with GCC 14.2.1 and
the optimization flags for gcc-14 shown in the table above):
1Total cycles: 83715295
2Total instructions: 30031680
3Instructions per cycle (IPC): 0.36
4L1 cache size: 32 KB
5L2 cache size: 256 KB
6L3 cache size: 12288 KB
7Total problem size: 390703 KB
8Total L1 data misses: 12537289
9Total load/store: 25020760 (expected: 200020000)
10Operational intensity: 1.998440e+00 (expected: 2.499875e-01)
11Performance [GFlop/s]: 1.121076e+01
12Wall-time [micro-seconds]: 1.784090e+04
Here the compiler was able to use AVX2 SIMD instructions that operate on 256-bit vector registers. Since the test code is using 32-bit (single precision) floating point data there are 8 SIMD lanes per register. This is the reason for 8x fewer load/store instructions since each load/store instruction operates on 8 elements simultaneously. The fewer memory instructions obfuscate the observed operational intensity which appears to be 8x larger than expected (because it is calculated assuming scalar memory instructions). The total number of bytes transferred remains the same however, which is the metric used to define the denominator of the operational intensity.
The roofline for the measurements with optimized executables looks like this:
The optimized executables reach the hardware ceiling with the exception of the naive C implementation that enforces the strict aliasing rule. It is interesting to note that the latter only applies to the GNU C compiler, the LLVM C compiler seems to ignore the strict aliasing rule for the optimization flags specified in the table above (which is one of these optimizations that is not compliant with the standard mentioned above). These results show that the performance limiting factor should be the hardware rather than the programming language or compiler. Therefore, Fortran is not faster than C.
For a more thorough discussion see this paper. ↩︎
See the 3C’s, for example on wikipedia ↩︎ ↩︎
The
restricttype qualifier was introduced in C99. In C++restrictis not a keyword but many compilers support it via built-in extensions. GCC support is provided by the built-in__restrictand__restrict__type qualifiers. ↩︎The listing only shows the vectorized code path of the inner-most loop assuming the vector size
nis an integer multiple of 4. ↩︎