C Compilers Still Can't Do SIMD
Followers of the ffmpeg project know that the maintainers often emphasize that to achieve top performance in a hot path, you need to use hand-written assembly. For those who come from high-level language programming and/or have never focused on achieving high performance, this may come off as boastful show-off and a waste of time rather than something that actually needs to be done. Indeed, compilers nowadays are amazing. Programmers who work in systems programming or high-frequency trading (HFT), where top performance is a requirement, very rarely (if ever) use hand-written assembly. So, why bother?
Well, as a scientific programmer I can say that while C/C++/Rust compilers are amazing in many ways, there is still one thing they can't get consistently right, and that's SIMD.
What is SIMD?
For those who don't know, SIMD stands for Single Instruction Multiple Data. In a nutshell, it's a neat way to vectorize most common operations such as addition, subtraction, multiplication, etc.
Consider this piece of code:
#include <cstdio>
void test(const int *a, const int *b, int *c, size_t N)
{
for(size_t i = 0; i < N; i++)
{
c[i] = a[i] + b[i];
}
}
Without Optimization (-O0)
Here's the x86-x64 assembly produced by gcc with no optimization:
test(int const*, int const*, int*, unsigned long):
push rbp
mov rbp, rsp
mov QWORD PTR [rbp-24], rdi
mov QWORD PTR [rbp-32], rsi
mov QWORD PTR [rbp-40], rdx
mov QWORD PTR [rbp-48], rcx
mov QWORD PTR [rbp-8], 0
jmp .L2
.L3:
mov rax, QWORD PTR [rbp-8]
lea rdx, [0+rax*4]
mov rax, QWORD PTR [rbp-24]
add rax, rdx
mov ecx, DWORD PTR [rax]
mov rax, QWORD PTR [rbp-8]
lea rdx, [0+rax*4]
mov rax, QWORD PTR [rbp-32]
add rax, rdx
mov edx, DWORD PTR [rax]
mov rax, QWORD PTR [rbp-8]
lea rsi, [0+rax*4]
mov rax, QWORD PTR [rbp-40]
add rax, rsi
add edx, ecx
mov DWORD PTR [rax], edx
add QWORD PTR [rbp-8], 1
.L2:
mov rax, QWORD PTR [rbp-8]
cmp rax, QWORD PTR [rbp-48]
jb .L3
nop
nop
pop rbp
ret
The most relevant part is .L3 where the addition happens. It does what it's supposed to: load
data from arrays a and b into registers for each i, add them, and
store the result back to c.
With SIMD Optimization (-O3 -mavx2)
But what happens if we build the same piece of code in optimized mode with the flags -O3 -mavx2?
test(int const*, int const*, int*, unsigned long):
test rcx, rcx
je .L33
lea r8, [rcx-1]
cmp r8, 2
jbe .L12
; ... (alignment checks) ...
.L5:
vmovdqu ymm0, YMMWORD PTR [rsi+rax]
vpaddd ymm0, ymm0, YMMWORD PTR [rdi+rax]
vmovdqu YMMWORD PTR [rdx+rax], ymm0
add rax, 32
cmp rax, r8
jne .L5
; ... (tail handling) ...
.L4:
vmovdqu xmm0, XMMWORD PTR [rsi+rax*4]
vpaddd xmm0, xmm0, XMMWORD PTR [rdi+rax*4]
vmovdqu XMMWORD PTR [rdx+rax*4], xmm0
; ... handles remainder ...
.L7:
mov r8d, DWORD PTR [rsi+rax*4]
add r8d, DWORD PTR [rdi+rax*4]
mov DWORD PTR [rdx+rax*4], r8d
; ... repeat 2 more times ...
The output seems complicated, but it's really not. The .L3 block of the unoptimized assembly is
now represented by blocks .L4, .L7, and .L5. Instead of standard
mov and add instructions, we have SIMD instructions like vmovdqu and
vpaddd. What are those? And why are they better?
i
separately, we do it once for 4 or 8 integers at once. Block .L5 loads 8 integers into the
ymm0 register (which is 256 bits wide), adds them together, and this is done for the first N –
mod(N,8) integers. The remaining mod(N,8) elements (the tail) are processed by block .L4 using
a 128-bit register (4 integers at once), and finally the remainder mod(mod(N,8),4) is processed by block
.L7, each integer separately.
For example, if N = 1029, block .L5 will be invoked 128 times to process 128 × 8 = 1024
elements, and the remaining 5 will be processed in blocks .L4 and .L7.
Is It Actually Faster?
To answer this question, let's recall two key factors that determine the performance of a specific instruction: latency and throughput.
- Latency is the time it takes for a single operation to complete from start to finish.
It's measured as the number of clock cycles from when the instruction receives its inputs to when the
result is available.
vpadddhas 1-cycle latency, meaning the result is ready 1 cycle after the inputs are ready. - Throughput is how many operations can be completed per unit of time when running many operations in a pipeline.
For example, suppose we want to add 1000 numbers:
- If each addition takes 4 cycles but you can start a new addition every cycle (throughput = 1/cycle), then the total execution is 1000 + 4 = 1004 cycles (startup + pipeline filling).
- On the other hand, if each addition takes 1 cycle (latency = 1), but you can only start a new addition every 4 cycles (throughput = 0.25/cycle), then the total time ≈ 1000 × 4 = 4000 cycles.
Typically, both vpaddd and add have the same latency of 1 cycle. However,
add has a higher throughput of 4 per cycle vs. 2 per cycle for vpaddd. Thus:
vpadddcan process: 2 ops/cycle × 8 elements = 16 elements/cycleaddcan process: 4 ops/cycle × 1 element = 4 elements/cycle
So using vector registers can be highly beneficial for speeding up parallelized workflows, which are widely used in scientific computing, video, audio, and image processing.
The Problem: Compilers Still Fail
However, in quite a few cases, modern C/C++ compilers (at the moment of writing, clang 21 and gcc 15) still fail to produce optimized assembly and take advantage of SIMD registers. Below I'll consider such a toy example.
#include <iostream>
#include <random>
#include <vector>
#include <cmath>
int main(int argc, char **argv) {
int N = 10'000;
float sigma = 1;
int radius = 3;
if (argc > 1) {
N = std::stoi(argv[1]);
}
if (argc > 2) {
radius = std::stoi(argv[2]);
}
if (argc > 3) {
sigma = std::stof(argv[3]);
}
std::vector<float> array(N * N), results(N * N);
std::random_device rd;
std::mt19937 gen(rd());
std::uniform_real_distribution<> dis(0, 1.0);
int sample_size = N * N;
for (auto& val : array) {
val = dis(gen);
}
size_t total_count = 0;
for (int i = 0; i < N; ++i) {
for (int j = 0; j < N; ++j) {
const int imax = std::min(i + radius, N - 1);
const int imin = std::max(i - radius, 0);
const int jmax = std::min(j + radius, N - 1);
const int jmin = std::max(j - radius, 0);
const int index_global = N * i + j;
const float center_val = array[i * N + j];
float sum_weights = 0.0f;
float sum_weighted = 0.0f;
for (int ii = imin; ii <= imax; ++ii) {
for (int jj = jmin; jj <= jmax; ++jj) {
total_count++;
int index_local = N * ii + jj;
const float diff = center_val - array[index_local];
const float w = std::exp(-diff * diff / sigma / sigma / 2);
sum_weights += w;
sum_weighted += array[index_local] * w;
}
}
results[index_global] = sum_weighted / sum_weights;
}
}
std::cout << "Average " << std::accumulate(results.begin(), results.end(), 0.0) / sample_size << std::endl;
std::cout << "Total Count " << total_count << std::endl;
return 0;
}
This is a simplified version of the bilateral filter where the spatial smoothing sigma is set to infinity.
The inner nested loop over a window can clearly be parallelized: computation of the exponent
std::exp(-diff * diff / sigma / sigma / 2) can be done for each pixel independently, and there
are SIMD vectorized calls _ZGVdN8v_expf@plt and _ZGVbN4v_expf@plt on x64
architecture that do exactly that.
The full assembly produced with gcc v15 -O3 -ffast-math -mavx2 can be found here. Here are some snippets of the assembly:
; SIMD exponent calls
.L77: ; handles 8 float
vmovups ymm1, YMMWORD PTR [r13+0]
vmovaps YMMWORD PTR [rbp-10256], ymm2
add r13, 32
vmovaps YMMWORD PTR [rbp-10224], ymm3
vsubps ymm0, ymm1, ymm4
vsubps ymm5, ymm4, ymm1
vmovaps YMMWORD PTR [rbp-10160], ymm4
vmovaps YMMWORD PTR [rbp-10192], ymm1
vmulps ymm0, ymm0, ymm5
vmulps ymm0, ymm0, YMMWORD PTR [rbp-10288]
call _ZGVdN8v_expf
vmovaps ymm3, YMMWORD PTR [rbp-10224]
cmp rbx, r13
vmovaps ymm2, YMMWORD PTR [rbp-10256]
vmovaps ymm4, YMMWORD PTR [rbp-10160]
vaddps ymm3, ymm3, ymm0
vmulps ymm0, ymm0, YMMWORD PTR [rbp-10192]
vaddps ymm2, ymm2, ymm0
jne .L77
; tail handling
.L155: ; handling of 4 floats
vbroadcastss xmm0, DWORD PTR [rbp-10300]
lea edx, [rax+1]
movsx rsi, r12d
movsx rax, r15d
add rax, rsi
mov DWORD PTR [rbp-10256], edx
add rax, rcx
vmovups xmm1, XMMWORD PTR [r14+rax*4]
vsubps xmm2, xmm1, xmm0
vsubps xmm0, xmm0, xmm1
vmovaps XMMWORD PTR [rbp-10224], xmm1
vmulps xmm0, xmm2, xmm0
vmulps xmm0, xmm0, XMMWORD PTR [rbp-10368]
vzeroupper
call _ZGVbN4v_expf
vmulps xmm1, xmm0, XMMWORD PTR [rbp-10224]
vaddps xmm0, xmm0, XMMWORD PTR [rbp-10192]
vaddps xmm1, xmm1, XMMWORD PTR [rbp-10160]
mov edx, DWORD PTR [rbp-10256]
vmovhlps xmm2, xmm0, xmm0
vaddps xmm2, xmm2, xmm0
vshufps xmm0, xmm2, xmm2, 85
vaddps xmm0, xmm0, xmm2
vaddss xmm6, xmm0, DWORD PTR [rbp-10296]
vmovhlps xmm0, xmm1, xmm1
vaddps xmm1, xmm0, xmm1
vmovss DWORD PTR [rbp-10296], xmm6
vshufps xmm0, xmm1, xmm1, 85
vaddps xmm0, xmm0, xmm1
vaddss xmm7, xmm0, DWORD PTR [rbp-10292]
vmovss DWORD PTR [rbp-10292], xmm7
test dl, 3
je .L78
; continues ...
.L79: ; handling of of 3 possible remaining floats
lea eax, [r12+rbx]
vmovss xmm6, DWORD PTR [rbp-10300]
cdqe
vmovss xmm1, DWORD PTR [r14+rax*4]
vsubss xmm0, xmm1, xmm6
vsubss xmm2, xmm6, xmm1
vmovss DWORD PTR [rbp-10160], xmm1
vmulss xmm0, xmm0, xmm2
vmulss xmm0, xmm0, DWORD PTR [rbp-10324]
call expf
vaddss xmm5, xmm0, DWORD PTR [rbp-10296]
vmulss xmm0, xmm0, DWORD PTR [rbp-10160]
lea eax, [rbx+1]
vaddss xmm4, xmm0, DWORD PTR [rbp-10292]
mov r13d, DWORD PTR [rbp-10312]
vmovss DWORD PTR [rbp-10296], xmm5
vmovss DWORD PTR [rbp-10292], xmm4
cmp r13d, eax
jl .L78
add eax, r12d
vmovss xmm6, DWORD PTR [rbp-10300]
add ebx, 2
cdqe
vmovss xmm1, DWORD PTR [r14+rax*4]
vsubss xmm0, xmm1, xmm6
vsubss xmm2, xmm6, xmm1
vmovss DWORD PTR [rbp-10160], xmm1
vmulss xmm0, xmm0, xmm2
vmulss xmm0, xmm0, DWORD PTR [rbp-10324]
call expf
vaddss xmm5, xmm0, DWORD PTR [rbp-10296]
vmulss xmm0, xmm0, DWORD PTR [rbp-10160]
vaddss xmm3, xmm0, DWORD PTR [rbp-10292]
vmovss DWORD PTR [rbp-10296], xmm5
vmovss DWORD PTR [rbp-10292], xmm3
cmp r13d, ebx
jl .L78
add ebx, r12d
vmovss xmm6, DWORD PTR [rbp-10300]
movsx rbx, ebx
vmovss xmm1, DWORD PTR [r14+rbx*4]
vsubss xmm0, xmm1, xmm6
vsubss xmm2, xmm6, xmm1
vmovss DWORD PTR [rbp-10160], xmm1
vmulss xmm0, xmm0, xmm2
vmulss xmm0, xmm0, DWORD PTR [rbp-10324]
call expf
vaddss xmm7, xmm0, DWORD PTR [rbp-10296]
vmulss xmm0, xmm0, DWORD PTR [rbp-10160]
vmovss DWORD PTR [rbp-10296], xmm7
vaddss xmm7, xmm0, DWORD PTR [rbp-10292]
vmovss DWORD PTR [rbp-10292], xmm7
It invokes _ZGVdN8v_expf@plt, and for the tail/remainder possibly calls
_ZGVbN4v_expf@plt and 3 possible calls for expf@plt. Sounds great, right?
The Compiler is Fragile
Well, it turns out the gcc compiler is very sensitive. If we eliminate the sum_weighted variable
and directly accumulate into results[index_global], no SIMD exponents are emitted (see full assembly here):
.L76: ; one exponent at a time
vmovss xmm1, DWORD PTR [r12]
vmovss xmm7, DWORD PTR [rbp-10144]
vmovss DWORD PTR [rbp-10140], xmm2
add r12, 4
vsubss xmm0, xmm1, xmm7
vsubss xmm4, xmm7, xmm1
vmovss DWORD PTR [rbp-10136], xmm1
vmulss xmm0, xmm0, xmm4
vmulss xmm0, xmm0, DWORD PTR [rbp-10148]
call expf
vmovss xmm1, DWORD PTR [rbp-10136]
vmovss xmm2, DWORD PTR [rbp-10140]
vaddss xmm5, xmm0, DWORD PTR [rbp-10132]
vmulss xmm1, xmm1, xmm0
vmovss DWORD PTR [rbp-10132], xmm5
vaddss xmm2, xmm2, xmm1
vmovss DWORD PTR [r15], xmm2
cmp rbx, r12
jne .L76
mov rcx, QWORD PTR [rbp-10184]
add QWORD PTR [rbp-10208], rcx
Importantly, if we switch the compiler to clang-21, neither version will produce assembly with vectorized exponents: see here.
Benchmark Results
How does it affect runtime? On my 13th Gen Intel(R) Core(TM) i5-13400:
- GCC-15 SIMD version: 15 seconds
- GCC-15 non-SIMD version: 34 seconds
- Clang binaries: ~30 seconds
- Mac M1 (all versions): ~32 seconds
This is, of course, not acceptable. The algorithm is not complex; the loops are straightforward, and sliding window algorithms have been used for several decades in various fields.
The Solution (For Now)
Right now, the only way to resolve the problem is to use SIMD intrinsics or inline assembly with support for multiple architectures which is what, for example, the ffmpeg project does. But this does reduce code portability and maintainability. One may hope that in the next iterations/versions of the compiler, the handling of SIMD will improve.
Conclusion
While modern compilers are incredibly sophisticated, SIMD optimization remains an area where they fall short. For performance-critical code in scientific computing, multimedia processing, and other domains that benefit from vectorization, developers may still need to resort to hand-written SIMD intrinsics or assembly to achieve optimal performance without relying on compiler optimizations.
This explains why projects like ffmpeg continue to invest in hand-optimized assembly code. it's a necessary compromise until compilers can reliably handle these optimizations automatically.