Fortran is not faster than C

published: 2024-12-28 (18 minutes | 3766 words)

Summary

This post elaborates on some technical details about performance differences that may be observed between Fortran and C/C++ implementations. The motivation for this post is based on arguments I had in the past where it was claimed that a Fortran implementation of some algorithm yields faster executable code. For a well designed programming language, the performance limiting factor of a particular implementation is the hardware, not the language or compiler.

Test Kernel

The kernel I am going to benchmark is a variation of the general matrix-vector multiplication (GEMV) found in BLAS level 2. The simplified kernel takes the form

y_{i} = y_{i} + a_{ij} x_{j},

where summation over index $j$ is assumed. The values $x_{i}$ and $y_{i}$ are elements of vectors $x, y \in R^{n}$ , respectively, and $a_{ij}$ are elements of matrix $A \in R^{n \times n}$ for some integer number $n > 1$ . For typical problems $n$ is large.

Operational Intensity

The operational intensity¹ of a compute kernel is the ratio between arithmetic operations and total number of bytes transferred due to memory accesses. To simplify the discussion, I am assuming the destination of all memory transactions is DRAM. For the simplified GEMV kernel above, there are $n^{2}$ multiplications and $n (n - 1) + n$ additions for a total of $2 n^{2}$ flop. The memory accesses to DRAM are trickier to estimate because of cache memory hierarchies found in all modern CPU architectures. An upper bound can be estimated by assuming no cache, where every memory access results in a capacity miss.² In that case there are $n$ reads for $y_{i}$ , $n^{2}$ reads for the matrix elements $a_{ij}$ , $n^{2}$ reads for $x_{i}$ due to the $n$ inner products and another $n$ writes for $y_{i}$ , resulting in a total of $2 n^{2} + 2 n$ memory accesses. For a lower bound estimate we assume an infinite cache where only compulsory misses are relevant.² Here every required read and write is performed exactly once, resulting in a total of $n^{2} + 3 n$ memory accesses. For both estimates the leading order term of memory accesses is $n^{2}$ , indicating that the GEMV kernel is memory bound. Indeed, all BLAS level 1 and level 2 kernels are memory bound.

Since $n$ is assumed large, it is reasonable to pick the upper bound memory access estimate, where the cache capacity is assumed small compared to the problem size. Assuming 32-bit single precision floating point data, the operational intensity for this test kernel computes as

O I = \frac{2 n ^{2}}{8 ( n ^{2} + n )} \approx 0.25.

The approximate value indicated corresponds to the expected value in the limit when $n$ is large.

Naive Implementation

A straightforward implementation of the GEMV kernel discussed in the previous section could look like:

1void gemv(const float *A, const float *x, float *y, const size_t n)
2{
3    for (size_t i = 0; i < n; ++i)
4        for (size_t j = 0; j < n; ++j)
5            y[i] += A[i * n + j] * x[j];
6}

Assuming the code is built for the same target architecture, a Fortran compiler is generally able to emit more efficient code than a C/C++ compiler for this particular implementation. This is because a Fortran compiler assumes that array y must not alias x or A . A C/C++ compiler does not make this assumption which will prevent it from performing certain optimizations during code generation. It is often the reason for the claim that Fortran is faster than C/C++.

Strict Aliasing Rule in C/C++

The concept of pointers in certain programming languages allows to declare variables that “point to” a specific memory location by assigning to it a corresponding memory address. This concept does generally not prevent any two pointers from pointing to the same location in memory. It may also be possible that these pointers point to ranges in memory that overlap. Pointers do not exist in earlier releases of Fortran and so aliasing of memory regions is not a concern. This is not true for programs written in C/C++. However, to limit the scope of possible aliasing, some rules apply broadly known as strict aliasing:

Pointers with incompatible types (e.g. int and float) do not alias in memory. Dereferencing a pointer that points to an incompatible type is undefined behavior.

Consider the following function for which the strict aliasing rule applies to the pointer parameter a and b:

1int foo(int *a, int *b)
2{
3    *a = 1;
4    *b = 2;
5    return *a; /* what is the value of *a? */
6}

The compiler does not know a priori where a and b will point to when foo is called. The return value may be 1 or 2 depending on what arguments are passed into foo. If pointers alias, the compiler cannot reorder read and write operations which inflicts a performance penalty in cases where strict aliasing is not desired. In the example above, the compiler must issue a read in line 5 because the write in line 4 may change its value assigned in line 3. If the pointers do not alias, the additional read instruction is not necessary. The code below shows the optimized assembly code generated by GCC 14.2.1 (note that the -fstrict-aliasing flag is implied for -O2, -O3 and -Os optimizations). The values of pointers a and b are stored in registers rdi and rsi, respectively. If the values in these two registers are the same, line 2 and 3 below write to the same memory location and consequently the compiler is required to perform a fresh read from address rdi when moving the result into return register eax in line 4. A total of three memory operations are required to ensure correctness of this optimized code.

1foo: ; gcc 14.2 with flags -O3 -fstrict-aliasing
2        mov     DWORD PTR [rdi], 1   ; *a = 1
3        mov     DWORD PTR [rsi], 2   ; *b = 2
4        mov     eax, DWORD PTR [rdi] ; store *a in register eax
5        ret

The compiler must be conservative with respect to optimizations when aliasing is possible. A performance trade-off for the dynamic flexibility that pointers offer.

Sometimes strict aliasing may not be desired, especially when performance is a concern (the compiler must have the freedom to reorder memory operations). In that case a pointer can be qualified as restrict which then establishes a contract between the compiler and the programmer that allows to transfer responsibility to the programmer.³ More freedom (for the person that writes the code) typically means more responsibility. The restrict type qualifier tells the compiler that a pointer will not alias other memory in the block scope it is defined (from the compilers’ perspective it is more of a liberation than restriction). If we agree (assuming understood) with this contract, we can rewrite the code using the restrict type qualifier in the public API of the function

1int foo(int *restrict a, int *restrict b)
2{
3    *a = 1;
4    *b = 2;
5    return *a; /* the value of *a is definitely 1! */
6}

Because of our contract, the compiler can be absolutely certain that the return value is 1. If something else would be expected, then it is not the compiler’s fault. The optimized assembly now looks as follows:

1foo: ; gcc 14.2 with flags -O3 -fstrict-aliasing
2        mov     DWORD PTR [rdi], 1   ; *a = 1
3        mov     eax, 1               ; return value is 1, no question!
4        mov     DWORD PTR [rsi], 2   ; *b = 2
5        ret

This code requires only two memory operations compared to the three required when the strict aliasing rule must be enforced. Aliasing requires additional memory transactions (write instructions) to guarantee correct code. These additional memory operations are one of the causes for performance degradation in C/C++. Fortran does not have to deal with this.

Memory Aware Implementation

The naive implementation of the GEMV kernel shown at the beginning of the previous section is subject to the strict aliasing rule. The writes to memory locations pointed to by y may overwrite memory that is subsequently read by A and/or x. Note that the const qualifier applied to these two pointers in the function signature is not sufficient to prevent from aliasing. The optimized assembly code for this naive implementation looks like:

 1gemv: ; gcc 14.2 with flags -O3 -ftree-vectorize
 2        test    rcx, rcx
 3        je      .L1
 4        lea     r8, [0+rcx*4]
 5        lea     r9, [rdx+r8]
 6.L3:
 7        movss   xmm1, DWORD PTR [rdx]
 8        xor     eax, eax
 9.L4:
10        movss   xmm0, DWORD PTR [rdi+rax*4]
11        mulss   xmm0, DWORD PTR [rsi+rax*4]
12        add     rax, 1
13        addss   xmm1, xmm0
14        movss   DWORD PTR [rdx], xmm1
15        cmp     rcx, rax
16        jne     .L4
17        add     rdx, 4
18        add     rdi, r8
19        cmp     r9, rdx
20        jne     .L3
21.L1:
22        ret

The highlighted section of lines 9–16 corresponds to the inner-most loop. This code iterates over index $j$ to compute the inner-product $a_{ij} x_{j}$ and add the result to $y_{i}$ . For every $j$ , the value $a_{ij}$ is loaded into register xmm0 (line 10), then multiplied with $x_{j}$ (line 11, the second operand is loaded from memory) and finally added to $y_{i}$ which is represented by the accumulator register xmm1 in line 13. Because of the strict aliasing rule, the compiler is required to emit a store instruction in line 14 that writes the current value of $y_{i}$ back to memory before continuing with reading again in lines 10 and 11. This store is unnecessary if we can guarantee that y does not alias A and/or x (these are typically distinct arrays for the GEMV operation). Furthermore, note that this code is trivial to vectorize but pointer aliasing prevents the compiler from doing this optimization. This is another cause for performance degradation in C/C++ when pointer aliasing is present. Slight modifications to this kernel implementation will help the compiler emit more efficient code.

Qualify the pointer to the destination array as `restrict`

As seen in the previous section, one simple way to fix this problem is to add the restrict qualifier to the pointer argument for the destination array y (the one being written to). Changing the function signature of the previous GEMV implementation to

1void gemv(const float *A, const float *x, float *restrict y, const size_t n)

removes the unnecessary store instruction in the inner-most loop and allows the compiler to emit SIMD instructions for vectorized code. The assembly code for the inner-most loop now looks like:⁴

 1.L4: ; gcc 14.2 with flags -O3 -ftree-vectorize
 2        movups   xmm0, XMMWORD PTR [rsi+rax]
 3        movups   xmm3, XMMWORD PTR [rdx+rax]
 4        add      rax, 16
 5        mulps    xmm0, xmm3
 6        movaps   xmm1, xmm0
 7        addss    xmm1, xmm2
 8        movaps   xmm2, xmm0
 9        shufps   xmm2, xmm0, 85
10        addss    xmm1, xmm2
11        movaps   xmm2, xmm0
12        unpckhps xmm2, xmm0
13        shufps   xmm0, xmm0, 255
14        addss    xmm1, xmm2
15        movaps   xmm2, xmm0
16        addss    xmm2, xmm1
17        cmp      rax, rcx
18        jne      .L4

Lines 2 and 3 perform 128-bit loads for $x_{j}$ and $a_{ij}$ , respectively, and store the result in registers xmm0 and xmm3. These are vector loads and operate on 4 32-bit floats simultaneously (4-way SSE SIMD for the given compiler flags, other SIMD instruction set extensions would generate a similar code path for the given compiler flags). Line 5 then executes a vector multiplication for the product $a_{ij} x_{j}$ and lines 5–16 correspond to a horizontal reduction into the accumulator register xmm2 (the value $y_{i}$ ). The required store instruction that was observed for the case with strict aliasing is optimized away for this implementation.

The auto-vectorized code generated by GCC 14.2.1 is not ideal because it performs the horizontal reduction for every loop iteration $j$ . A better approach would be to use a vector register for the accumulation of $a_{ij} x_{j}$ products using FMA instructions and then perform the horizontal reduction of the accumulator register into $y_{i}$ once after the $j$ accumulation steps. For comparison, compiling this code using Clang 19.1.0 with the same optimization flags generates different scalar (non-SIMD) code with 4-way loop-unrolling.

Use Thread-Private Memory

For the GEMV algorithm discussed here, using the restrict qualifier on the destination array y is not necessary. The strict aliasing rule is not a problem for this kernel if we write the code in a way that will impose our intent more clear on the compiler. In the initial implementation, it is clear that the loop index i is constant in the inner-most loop. There is no need at all to index into array y for every iteration j.

1void gemv(const float *A, const float *x, float *y, const size_t n)
2{
3    for (size_t i = 0; i < n; ++i)
4        for (size_t j = 0; j < n; ++j)
5            y[i] += A[i * n + j] * x[j];
6}

Instead of writing the initial code as shown above, it is better to use an accumulator for which the compiler will allocate a register (which is thread-private). There is no need to use a shared resource such as y[i] to perform the accumulation. A better GEMV implementation would thus be

 1void gemv(const float *A, const float *x, float *y, const size_t n)
 2{
 3    for (size_t i = 0; i < n; ++i) {
 4        float inner_product = 0.0f;
 5        for (size_t j = 0; j < n; ++j) {
 6            inner_product += A[i * n + j] * x[j];
 7        }
 8        y[i] += inner_product;
 9    }
10}

Note that no restrict qualifier is used for this implementation. The resulting assembly code is identical to the one obtained using the restrict qualified destination array as in the previous section. The code is more elegant because intention is expressed clearly by the use of an accumulator in thread-private memory. Using thread-private memory whenever possible is furthermore important for writing multi-threaded code where race-conditions may be an issue.

Fortran Implementation

Now it is time to compare against the Fortran equivalent of the GEMV kernel. Because Fortran is using column-major storage, it is assumed that the matrix $A$ is symmetric for the remainder of this post. The following code is taken from the file sgemv.f of the reference BLAS 3.12.0 library with some code stripped off to meet the simplified GEMV variant discussed in the Test Kernel section:

 1      SUBROUTINE SGEMV(A,X,Y,N)
 2      INTEGER(4) N,I,J
 3      REAL(4) A(N,*),X(*),Y(*)
 4      REAL(4) TEMP
 5#ifdef TRANSPOSE
 6! Form  y := A^T*x + y.
 7      DO 100 J = 1,N
 8      TEMP = 0.0
 9      DO 90 I = 1,N
10      TEMP = TEMP + A(I,J)*X(I)
11   90             CONTINUE
12      Y(J) = Y(J) + TEMP
13  100         CONTINUE
14#else
15! Form  y := A*x + y.
16      DO 60 J = 1,N
17      TEMP = X(J)
18      DO 50 I = 1,N
19      Y(I) = Y(I) + TEMP*A(I,J)
20   50             CONTINUE
21   60         CONTINUE
22#endif
23      RETURN
24      END

As mentioned above, the equivalent Fortran algorithm is when TRANSPOSE is defined (lines 7–13). Compiling this code with GFortran 14.2.1 results again in identical assembly code as obtained for the memory aware C/C++ versions. Different for Fortran of course is the absence of aliasing, such that a loop structure without the TEMP accumulator

1      DO 100 J = 1,N
2      DO 90 I = 1,N
3      Y(J) = Y(J) + A(I,J)*X(I)
4   90             CONTINUE
5  100         CONTINUE

still yields identical assembly code.

Benchmark

In this section I am going to benchmark the different C/C++ implementations discussed above and compare to the BLAS Fortran implementation. The main metric of interest is the operational intensity discussed in a previous section. The necessary instruction counts for flops and memory accesses are determined using hardware counters via the PAPI library. The benchmark results shown below are obtained using an Intel Core i7-8700K CPU running at 3.70 GHz and the problem size is set to $n = 10000$ . The benchmark code can be found here.

Assuming the upper bound estimate discussed in an earlier section, a total of $2 n^{2} + 2 n = 200020000$ memory instructions are to be expected. The naive C/C++ kernel with strict aliasing generates the following benchmark output (compiled with GCC 14.2.1 and -O2 flag):

 1Total cycles:                 427373621
 2Total instructions:           700071682
 3Instructions per cycle (IPC): 1.64
 4L1 cache size:                32 KB
 5L2 cache size:                256 KB
 6L3 cache size:                12288 KB
 7Total problem size:           390703 KB
 8Total L1 data misses:         12515647
 9Total load/store:             300010748 (expected: 200020000)
10Operational intensity:        1.666690e-01 (expected: 2.499875e-01)
11Performance [GFlop/s]:        2.204652e+00
12Wall-time   [micro-seconds]:  9.072179e+04

The measured number of load/store instructions exceeds the expected value by 100 Million because of the additional store instruction required in the inner-most loop due to the strict aliasing rule (see Memory Aware Implementation). The additional store instructions reduce the observed operational intensity accordingly. The measurement for the kernel without strict aliasing produces the expected values (compiled with GCC 14.2.1 and -O2 flag):

 1Total cycles:                 421014858
 2Total instructions:           600101681
 3Instructions per cycle (IPC): 1.43
 4L1 cache size:                32 KB
 5L2 cache size:                256 KB
 6L3 cache size:                12288 KB
 7Total problem size:           390703 KB
 8Total L1 data misses:         12520995
 9Total load/store:             200020748 (expected: 200020000)
10Operational intensity:        2.499866e-01 (expected: 2.499875e-01)
11Performance [GFlop/s]:        2.237932e+00
12Wall-time   [micro-seconds]:  8.937267e+04

The roofline plot below summarizes measurements obtained for single core execution using different compilers as well as the C and Fortran implementations of the GEMV test kernel. The following compilers have been used on an Arch Linux system with kernel 6.12.4:


`gcc-14`	GNU C compiler (version 14.2.1)
`gfortran-14`	GNU Fortran compiler (version 14.2.1)
`clang-18`	LLVM C compiler (version 18.1.8)
`flang-18`	LLVM Fortran compiler (version 18.1.8)
`nvfortran-24`	Nvidia HPC SDK Fortran compiler (version 24.7, PGI legacy)

All executables are built using -O2 optimization targeting scalar code without SIMD (AVX2) or FMA instructions (additional flags used with the Nvidia Fortran compiler are -acc=host and -mno-fma):

The roofline shows that none of the test executables reach the hardware limit for the given level of optimization. The C implementations correspond to the gcc-14 and clang-18 legend entries, where the naive implementations are indicated with strict aliasing. The executable generated with the Nvidia Fortran compiler exhibits 4x more load/store instructions than what is expected which results in slightly worse performance compared to the other test executables. (It is not clear why the compiler does this but removing the -mno-fma flag immediately results in aggressively optimized code, see next section.)

The overall performance for all test executables is roughly identical at this level of optimization, irrespective of the strict aliasing rule. Given the upper bound estimate for the operational intensity of the GEMV test kernel, the fastest executable is limited by the bandwidth ceiling which corresponds to about 10.4 GFlop/s on this Intel architecture. A C compiler that is compliant with the strict aliasing rule (even when aggressive optimizations are enabled) is unlikely to generate better code than what is observed for gcc-14 above. Therefore, the naive C implementation of the GEMV test kernel is expected to exhibit 5x slower performance due to the strict aliasing rule in C/C++.

With Compiler Optimization

More aggressive optimizations can be enabled by allowing the compiler to use SIMD instructions via auto-vectorization as well as FMA instructions and other optimization techniques of which some may disregard strict standard compliance. The optimization flags used for the following measurements are:


`gcc-14`	`-Ofast -mavx2 -mfma -march=native -mtune=native -funroll-all-loops`
`gfortran-14`	`-Ofast -mavx2 -mfma -march=native -mtune=native -funroll-all-loops`
`clang-18`	`-Ofast -mavx2 -mfma -march=native -mtune=native -funroll-all-loops`
`flang-18`	`-O3 -ffast-math -fstack-arrays -march=core-avx2`
`nvfortran-24`	`-O3 -acc=host`

These are rather aggressive optimization flags which may not always be suitable for production code. The benchmark output for the C kernel implementation with strict aliasing now looks like (compiled with GCC 14.2.1 and the optimization flags for gcc-14 shown in the table above):

 1Total cycles:                 421750341
 2Total instructions:           337611682
 3Instructions per cycle (IPC): 0.80
 4L1 cache size:                32 KB
 5L2 cache size:                256 KB
 6L3 cache size:                12288 KB
 7Total problem size:           390703 KB
 8Total L1 data misses:         12520574
 9Total load/store:             300010663 (expected: 200020000)
10Operational intensity:        1.666691e-01 (expected: 2.499875e-01)
11Performance [GFlop/s]:        2.230897e+00
12Wall-time   [micro-seconds]:  8.965453e+04

The total number of cycles remains about the same while the total number of instructions executed has roughly halved due to optimization passes, reducing the average number of instructions per cycle (IPC) by the same factor. The number of load/store instructions remains the same, indicating that the optimized version is still issuing scalar load/stores which is a consequence of the strict aliasing rule. The numbers show that the GNU C compiler honors the strict aliasing rule even when aggressive optimizations are enabled and thus does not have much freedom to optimize the code. Furthermore, IPC is not an accurate metric for this memory bound code.

The benchmark output for the C kernel implementation using thread private memory for the loop accumulator looks like the following (compiled with GCC 14.2.1 and the optimization flags for gcc-14 shown in the table above):

 1Total cycles:                 83715295
 2Total instructions:           30031680
 3Instructions per cycle (IPC): 0.36
 4L1 cache size:                32 KB
 5L2 cache size:                256 KB
 6L3 cache size:                12288 KB
 7Total problem size:           390703 KB
 8Total L1 data misses:         12537289
 9Total load/store:             25020760 (expected: 200020000)
10Operational intensity:        1.998440e+00 (expected: 2.499875e-01)
11Performance [GFlop/s]:        1.121076e+01
12Wall-time   [micro-seconds]:  1.784090e+04

Here the compiler was able to use AVX2 SIMD instructions that operate on 256-bit vector registers. Since the test code is using 32-bit (single precision) floating point data there are 8 SIMD lanes per register. This is the reason for 8x fewer load/store instructions since each load/store instruction operates on 8 elements simultaneously. The fewer memory instructions obfuscate the observed operational intensity which appears to be 8x larger than expected (because it is calculated assuming scalar memory instructions). The total number of bytes transferred remains the same however, which is the metric used to define the denominator of the operational intensity.

The roofline for the measurements with optimized executables looks like this:

Intel Core i7-8700K single core optimized

The optimized executables reach the hardware ceiling with the exception of the naive C implementation that enforces the strict aliasing rule. It is interesting to note that the latter only applies to the GNU C compiler, the LLVM C compiler seems to ignore the strict aliasing rule for the optimization flags specified in the table above (which is one of these optimizations that is not compliant with the standard mentioned above). These results show that the performance limiting factor should be the hardware rather than the programming language or compiler. Therefore, Fortran is not faster than C.

For a more thorough discussion see this paper. ↩︎
See the 3C’s, for example on wikipedia ↩︎ ↩︎
The restrict type qualifier was introduced in C99. In C++ restrict is not a keyword but many compilers support it via built-in extensions. GCC support is provided by the built-in __restrict and __restrict__ type qualifiers. ↩︎
The listing only shows the vectorized code path of the inner-most loop assuming the vector size n is an integer multiple of 4. ↩︎

→ please send an email to the mailing list for comments (using post title as subject).
→ click here to see all comments.
→ mailing list etiquette.
tags: fortran; c; performance; strict-aliasing;