C Compilers Still Can't Do SIMD

A deep dive into why hand-written assembly still matters for high-performance code

Followers of the ffmpeg project know that the maintainers often emphasize that to achieve top performance in a hot path, you need to use hand-written assembly. For those who come from high-level language programming and/or have never focused on achieving high performance, this may come off as boastful show-off and a waste of time rather than something that actually needs to be done. Indeed, compilers nowadays are amazing. Programmers who work in systems programming or high-frequency trading (HFT), where top performance is a requirement, very rarely (if ever) use hand-written assembly. So, why bother?

Well, as a scientific programmer I can say that while C/C++/Rust compilers are amazing in many ways, there is still one thing they can't get consistently right, and that's SIMD.

What is SIMD?

For those who don't know, SIMD stands for Single Instruction Multiple Data. In a nutshell, it's a neat way to vectorize most common operations such as addition, subtraction, multiplication, etc.

Consider this piece of code:

#include <cstdio>
void test(const int *a, const int *b, int *c, size_t N)
{
    for(size_t i = 0; i < N; i++)
    {
        c[i] = a[i] + b[i];
    }
}

Without Optimization (-O0)

Here's the x86-x64 assembly produced by gcc with no optimization:

test(int const*, int const*, int*, unsigned long):
        push    rbp
        mov     rbp, rsp
        mov     QWORD PTR [rbp-24], rdi
        mov     QWORD PTR [rbp-32], rsi
        mov     QWORD PTR [rbp-40], rdx
        mov     QWORD PTR [rbp-48], rcx
        mov     QWORD PTR [rbp-8], 0
        jmp     .L2
.L3:
        mov     rax, QWORD PTR [rbp-8]
        lea     rdx, [0+rax*4]
        mov     rax, QWORD PTR [rbp-24]
        add     rax, rdx
        mov     ecx, DWORD PTR [rax]
        mov     rax, QWORD PTR [rbp-8]
        lea     rdx, [0+rax*4]
        mov     rax, QWORD PTR [rbp-32]
        add     rax, rdx
        mov     edx, DWORD PTR [rax]
        mov     rax, QWORD PTR [rbp-8]
        lea     rsi, [0+rax*4]
        mov     rax, QWORD PTR [rbp-40]
        add     rax, rsi
        add     edx, ecx
        mov     DWORD PTR [rax], edx
        add     QWORD PTR [rbp-8], 1
.L2:
        mov     rax, QWORD PTR [rbp-8]
        cmp     rax, QWORD PTR [rbp-48]
        jb      .L3
        nop
        nop
        pop     rbp
        ret

The most relevant part is .L3 where the addition happens. It does what it's supposed to: load data from arrays a and b into registers for each i, add them, and store the result back to c.

With SIMD Optimization (-O3 -mavx2)

But what happens if we build the same piece of code in optimized mode with the flags -O3 -mavx2?

test(int const*, int const*, int*, unsigned long):
        test    rcx, rcx
        je      .L33
        lea     r8, [rcx-1]
        cmp     r8, 2
        jbe     .L12
        ; ... (alignment checks) ...
.L5:
        vmovdqu ymm0, YMMWORD PTR [rsi+rax]
        vpaddd  ymm0, ymm0, YMMWORD PTR [rdi+rax]
        vmovdqu YMMWORD PTR [rdx+rax], ymm0
        add     rax, 32
        cmp     rax, r8
        jne     .L5
        ; ... (tail handling) ...
.L4:
        vmovdqu xmm0, XMMWORD PTR [rsi+rax*4]
        vpaddd  xmm0, xmm0, XMMWORD PTR [rdi+rax*4]
        vmovdqu XMMWORD PTR [rdx+rax*4], xmm0
        ; ... handles remainder ...
.L7:
        mov r8d, DWORD PTR [rsi+rax*4]
        add r8d, DWORD PTR [rdi+rax*4]
        mov DWORD PTR [rdx+rax*4], r8d
        ; ... repeat 2 more times ...

The output seems complicated, but it's really not. The .L3 block of the unoptimized assembly is now represented by blocks .L4, .L7, and .L5. Instead of standard mov and add instructions, we have SIMD instructions like vmovdqu and vpaddd. What are those? And why are they better?

Key Insight: Instead of computing the sum for each integer element i separately, we do it once for 4 or 8 integers at once. Block .L5 loads 8 integers into the ymm0 register (which is 256 bits wide), adds them together, and this is done for the first N – mod(N,8) integers. The remaining mod(N,8) elements (the tail) are processed by block .L4 using a 128-bit register (4 integers at once), and finally the remainder mod(mod(N,8),4) is processed by block .L7, each integer separately.

For example, if N = 1029, block .L5 will be invoked 128 times to process 128 × 8 = 1024 elements, and the remaining 5 will be processed in blocks .L4 and .L7.

Is It Actually Faster?

To answer this question, let's recall two key factors that determine the performance of a specific instruction: latency and throughput.

Latency is the time it takes for a single operation to complete from start to finish. It's measured as the number of clock cycles from when the instruction receives its inputs to when the result is available. vpaddd has 1-cycle latency, meaning the result is ready 1 cycle after the inputs are ready.
Throughput is how many operations can be completed per unit of time when running many operations in a pipeline.

For example, suppose we want to add 1000 numbers:

If each addition takes 4 cycles but you can start a new addition every cycle (throughput = 1/cycle), then the total execution is 1000 + 4 = 1004 cycles (startup + pipeline filling).
On the other hand, if each addition takes 1 cycle (latency = 1), but you can only start a new addition every 4 cycles (throughput = 0.25/cycle), then the total time ≈ 1000 × 4 = 4000 cycles.

Typically, both vpaddd and add have the same latency of 1 cycle. However, add has a higher throughput of 4 per cycle vs. 2 per cycle for vpaddd. Thus:

vpaddd can process: 2 ops/cycle × 8 elements = 16 elements/cycle
add can process: 4 ops/cycle × 1 element = 4 elements/cycle

So using vector registers can be highly beneficial for speeding up parallelized workflows, which are widely used in scientific computing, video, audio, and image processing.

The Problem: Compilers Still Fail

However, in quite a few cases, modern C/C++ compilers (at the moment of writing, clang 21 and gcc 15) still fail to produce optimized assembly and take advantage of SIMD registers. Below I'll consider such a toy example.

#include <iostream>
#include <random>
#include <vector>
#include <cmath>

int main(int argc, char **argv) {
    int N = 10'000;
    float sigma = 1;
    int radius = 3;
    if (argc > 1) {
        N = std::stoi(argv[1]);
    }
    if (argc > 2) {
        radius = std::stoi(argv[2]);
    }
    if (argc > 3) {
        sigma = std::stof(argv[3]);
    }

    std::vector<float> array(N * N), results(N * N);
    std::random_device rd;
    std::mt19937 gen(rd());
    std::uniform_real_distribution<> dis(0, 1.0);
    int sample_size = N * N;
    for (auto& val : array) {
        val = dis(gen);
    }
    size_t total_count = 0;

    for (int i = 0; i < N; ++i) {
        for (int j = 0; j < N; ++j) {
            const int imax = std::min(i + radius, N - 1);
            const int imin = std::max(i - radius, 0);
            const int jmax = std::min(j + radius, N - 1);
            const int jmin = std::max(j - radius, 0);
            const int index_global = N * i + j;
            const float center_val = array[i * N + j];
            float sum_weights = 0.0f;
            float sum_weighted = 0.0f;

            for (int ii = imin; ii <= imax; ++ii) {
                for (int jj = jmin; jj <= jmax; ++jj) {
                    total_count++;
                    int index_local = N * ii + jj;
                    const float diff = center_val - array[index_local];
                    const float w = std::exp(-diff * diff / sigma / sigma / 2);
                    sum_weights += w;
                    sum_weighted += array[index_local] * w;
                }
            }
            results[index_global] = sum_weighted / sum_weights;
        }
    }
    std::cout << "Average " << std::accumulate(results.begin(), results.end(), 0.0) / sample_size << std::endl;
    std::cout << "Total Count " << total_count << std::endl;
    return 0;
}

This is a simplified version of the bilateral filter where the spatial smoothing sigma is set to infinity. The inner nested loop over a window can clearly be parallelized: computation of the exponent std::exp(-diff * diff / sigma / sigma / 2) can be done for each pixel independently, and there are SIMD vectorized calls _ZGVdN8v_expf@plt and _ZGVbN4v_expf@plt on x64 architecture that do exactly that.

The full assembly produced with gcc v15 -O3 -ffast-math -mavx2 can be found here. Here are some snippets of the assembly:

 ; SIMD exponent calls
 .L77: ; handles 8 float
        vmovups ymm1, YMMWORD PTR [r13+0]
        vmovaps YMMWORD PTR [rbp-10256], ymm2
        add     r13, 32
        vmovaps YMMWORD PTR [rbp-10224], ymm3
        vsubps  ymm0, ymm1, ymm4
        vsubps  ymm5, ymm4, ymm1
        vmovaps YMMWORD PTR [rbp-10160], ymm4
        vmovaps YMMWORD PTR [rbp-10192], ymm1
        vmulps  ymm0, ymm0, ymm5
        vmulps  ymm0, ymm0, YMMWORD PTR [rbp-10288]
        call    _ZGVdN8v_expf
        vmovaps ymm3, YMMWORD PTR [rbp-10224]
        cmp     rbx, r13
        vmovaps ymm2, YMMWORD PTR [rbp-10256]
        vmovaps ymm4, YMMWORD PTR [rbp-10160]
        vaddps  ymm3, ymm3, ymm0
        vmulps  ymm0, ymm0, YMMWORD PTR [rbp-10192]
        vaddps  ymm2, ymm2, ymm0
        jne     .L77
        ; tail handling
.L155: ; handling of 4 floats
        vbroadcastss    xmm0, DWORD PTR [rbp-10300]
        lea     edx, [rax+1]
        movsx   rsi, r12d
        movsx   rax, r15d
        add     rax, rsi
        mov     DWORD PTR [rbp-10256], edx
        add     rax, rcx
        vmovups xmm1, XMMWORD PTR [r14+rax*4]
        vsubps  xmm2, xmm1, xmm0
        vsubps  xmm0, xmm0, xmm1
        vmovaps XMMWORD PTR [rbp-10224], xmm1
        vmulps  xmm0, xmm2, xmm0
        vmulps  xmm0, xmm0, XMMWORD PTR [rbp-10368]
        vzeroupper
        call    _ZGVbN4v_expf
        vmulps  xmm1, xmm0, XMMWORD PTR [rbp-10224]
        vaddps  xmm0, xmm0, XMMWORD PTR [rbp-10192]
        vaddps  xmm1, xmm1, XMMWORD PTR [rbp-10160]
        mov     edx, DWORD PTR [rbp-10256]
        vmovhlps        xmm2, xmm0, xmm0
        vaddps  xmm2, xmm2, xmm0
        vshufps xmm0, xmm2, xmm2, 85
        vaddps  xmm0, xmm0, xmm2
        vaddss  xmm6, xmm0, DWORD PTR [rbp-10296]
        vmovhlps        xmm0, xmm1, xmm1
        vaddps  xmm1, xmm0, xmm1
        vmovss  DWORD PTR [rbp-10296], xmm6
        vshufps xmm0, xmm1, xmm1, 85
        vaddps  xmm0, xmm0, xmm1
        vaddss  xmm7, xmm0, DWORD PTR [rbp-10292]
        vmovss  DWORD PTR [rbp-10292], xmm7
        test    dl, 3
        je      .L78
        ; continues ...
.L79:   ; handling of of 3 possible remaining floats
        lea     eax, [r12+rbx]
        vmovss  xmm6, DWORD PTR [rbp-10300]
        cdqe
        vmovss  xmm1, DWORD PTR [r14+rax*4]
        vsubss  xmm0, xmm1, xmm6
        vsubss  xmm2, xmm6, xmm1
        vmovss  DWORD PTR [rbp-10160], xmm1
        vmulss  xmm0, xmm0, xmm2
        vmulss  xmm0, xmm0, DWORD PTR [rbp-10324]
        call    expf
        vaddss  xmm5, xmm0, DWORD PTR [rbp-10296]
        vmulss  xmm0, xmm0, DWORD PTR [rbp-10160]
        lea     eax, [rbx+1]
        vaddss  xmm4, xmm0, DWORD PTR [rbp-10292]
        mov     r13d, DWORD PTR [rbp-10312]
        vmovss  DWORD PTR [rbp-10296], xmm5
        vmovss  DWORD PTR [rbp-10292], xmm4
        cmp     r13d, eax
        jl      .L78
        add     eax, r12d
        vmovss  xmm6, DWORD PTR [rbp-10300]
        add     ebx, 2
        cdqe
        vmovss  xmm1, DWORD PTR [r14+rax*4]
        vsubss  xmm0, xmm1, xmm6
        vsubss  xmm2, xmm6, xmm1
        vmovss  DWORD PTR [rbp-10160], xmm1
        vmulss  xmm0, xmm0, xmm2
        vmulss  xmm0, xmm0, DWORD PTR [rbp-10324]
        call    expf
        vaddss  xmm5, xmm0, DWORD PTR [rbp-10296]
        vmulss  xmm0, xmm0, DWORD PTR [rbp-10160]
        vaddss  xmm3, xmm0, DWORD PTR [rbp-10292]
        vmovss  DWORD PTR [rbp-10296], xmm5
        vmovss  DWORD PTR [rbp-10292], xmm3
        cmp     r13d, ebx
        jl      .L78
        add     ebx, r12d
        vmovss  xmm6, DWORD PTR [rbp-10300]
        movsx   rbx, ebx
        vmovss  xmm1, DWORD PTR [r14+rbx*4]
        vsubss  xmm0, xmm1, xmm6
        vsubss  xmm2, xmm6, xmm1
        vmovss  DWORD PTR [rbp-10160], xmm1
        vmulss  xmm0, xmm0, xmm2
        vmulss  xmm0, xmm0, DWORD PTR [rbp-10324]
        call    expf
        vaddss  xmm7, xmm0, DWORD PTR [rbp-10296]
        vmulss  xmm0, xmm0, DWORD PTR [rbp-10160]
        vmovss  DWORD PTR [rbp-10296], xmm7
        vaddss  xmm7, xmm0, DWORD PTR [rbp-10292]
        vmovss  DWORD PTR [rbp-10292], xmm7

It invokes _ZGVdN8v_expf@plt, and for the tail/remainder possibly calls _ZGVbN4v_expf@plt and 3 possible calls for expf@plt. Sounds great, right?

The Compiler is Fragile

Well, it turns out the gcc compiler is very sensitive. If we eliminate the sum_weighted variable and directly accumulate into results[index_global], no SIMD exponents are emitted (see full assembly here):

 .L76: ; one exponent at a time
        vmovss  xmm1, DWORD PTR [r12]
        vmovss  xmm7, DWORD PTR [rbp-10144]
        vmovss  DWORD PTR [rbp-10140], xmm2
        add     r12, 4
        vsubss  xmm0, xmm1, xmm7
        vsubss  xmm4, xmm7, xmm1
        vmovss  DWORD PTR [rbp-10136], xmm1
        vmulss  xmm0, xmm0, xmm4
        vmulss  xmm0, xmm0, DWORD PTR [rbp-10148]
        call    expf
        vmovss  xmm1, DWORD PTR [rbp-10136]
        vmovss  xmm2, DWORD PTR [rbp-10140]
        vaddss  xmm5, xmm0, DWORD PTR [rbp-10132]
        vmulss  xmm1, xmm1, xmm0
        vmovss  DWORD PTR [rbp-10132], xmm5
        vaddss  xmm2, xmm2, xmm1
        vmovss  DWORD PTR [r15], xmm2
        cmp     rbx, r12
        jne     .L76
        mov     rcx, QWORD PTR [rbp-10184]
        add     QWORD PTR [rbp-10208], rcx

Importantly, if we switch the compiler to clang-21, neither version will produce assembly with vectorized exponents: see here.

Benchmark Results

How does it affect runtime? On my 13th Gen Intel(R) Core(TM) i5-13400:

GCC-15 SIMD version: 15 seconds
GCC-15 non-SIMD version: 34 seconds
Clang binaries: ~30 seconds
Mac M1 (all versions): ~32 seconds

This is, of course, not acceptable. The algorithm is not complex; the loops are straightforward, and sliding window algorithms have been used for several decades in various fields.

Side Note on the algorithm: In practice, the way how the bilateral filter is implemented in image processing libraries is different. Exponents are not computed on-the-fly for each pixel in the window; instead, a pre-computed lookup table and Taylor expansion are used to approximate exponents and speed up the process. However, the purpose of this example is to illustrate compiler behavior rather than to provide an optimal implementation of the bilateral filter. Another topic we have not discussed here is that instruction scheduling and vectorization may introduce additional temporary variables or use wider registers (like SIMD registers), increasing register pressure and potentially spilling more data to the stack or requiring more stack space. In fact, optimized assembly produced by gcc uses quite a few stack variables and increase memory footprint.

The Solution (For Now)

Right now, the only way to resolve the problem is to use SIMD intrinsics or inline assembly with support for multiple architectures which is what, for example, the ffmpeg project does. But this does reduce code portability and maintainability. One may hope that in the next iterations/versions of the compiler, the handling of SIMD will improve.

Side Note on Rust: Re-writing the same code in Rust revealed that Rust doesn't emit SIMD exponent instructions (as expected, since after all, clang and Rust share the same LLVM backend). However, the Rust code performs very differently on Intel CPU vs. Mac: the same code runs 49 seconds and 28 seconds on the Intel CPU and M1 Mac respectively, despite the fact that both non-SIMD versions of the C++ code give roughly the same timing (32-34 seconds).

Conclusion

While modern compilers are incredibly sophisticated, SIMD optimization remains an area where they fall short. For performance-critical code in scientific computing, multimedia processing, and other domains that benefit from vectorization, developers may still need to resort to hand-written SIMD intrinsics or assembly to achieve optimal performance without relying on compiler optimizations.

This explains why projects like ffmpeg continue to invest in hand-optimized assembly code. it's a necessary compromise until compilers can reliably handle these optimizations automatically.