i386: Vectorized BF16 arithmetic with AVX10.2 - add/sub/mul/div/FMA/sqrt/smaxmin

Posted on 27 August 202425 March 2026 by [email protected]

These patches have been committed to the master branch:

8e16f26ca9f – i386: Support partial vectorized V2BF/V4BF plus/minus/mult/div/sqrt

62df24e5003 – i386: Support partial vectorized V2BF/V4BF smaxmin

f82fa0da4d9 – i386: Support vectorized BF16 add/sub/mul/div with AVX10.2

6d294fb8ac9 – i386: Support vectorized BF16 FMA with AVX10.2

29ef601973d – i386: Support vectorized BF16 smaxmin with AVX10.2

e19f65b0be1 – i386: Support vectorized BF16 sqrt with AVX10.2

f9ca3fd1fe3 – i386: Support partial vectorized FMA for V2BF/V4BF

3d031cc4465 – x86: Refine V4BF/V2BF FMA Testcase

This series of 8 commits enables GCC to auto-vectorize BF16 arithmetic operations using AVX10.2 native instructions. Previously, BF16 arithmetic required conversion to FP32 and back. With AVX10.2, the processor can operate on BF16 values directly, and these patches teach GCC to emit those instructions.

The FP32 Round-Trip Problem

Before AVX10.2, BF16 was a storage-only format on x86. To add two BF16 vectors, GCC had to:

Convert both BF16 operands to FP32 (vcvtbf162ps)
Perform the FP32 operation (vaddps)
Convert the result back to BF16 (vcvtne2ps2bf16)

This 3-instruction sequence also widens the data: eight BF16 elements in a 128-bit register become eight FP32 elements requiring a 256-bit register. For a tight vectorized loop, this round-trip conversion dominates the execution cost and doubles the register pressure.

Diagram showing BF16 arithmetic before vs after AVX10.2

What AVX10.2 Brings

AVX10.2 introduces native BF16 arithmetic instructions that operate directly on packed BF16 vectors. The complete set enabled by this patch series:

Instruction          Operation         Commit
-----------          ---------         ------
vaddnepbf16          a + b             f82fa0da4d9
vsubnepbf16          a - b             f82fa0da4d9
vmulnepbf16          a * b             f82fa0da4d9
vdivnepbf16          a / b             f82fa0da4d9
vsqrtnepbf16         sqrt(a)           e19f65b0be1
vfmadd132nepbf16     a*b + c (FMA)     6d294fb8ac9
vmaxpbf16            max(a, b)         29ef601973d
vminpbf16            min(a, b)         29ef601973d

The “NE” in the instruction names stands for “Nearest Even” rounding, which is the IEEE 754 default rounding mode. Each instruction operates on the full packed BF16 vector without any format conversion.

Full-Width Vectors (V8BF/V16BF/V32BF)

The core changes are in sse.md. A new mode iterator VF_BHSD extends the existing VF (vector float) iterator to include BF16 modes:

;; VF_BHSD - BF16 + HF16 + SF + DF vector modes
(define_mode_iterator VF_BHSD
  [(V32BF "TARGET_AVX10_2_512") (V16BF "TARGET_AVX10_2_256")
   (V8BF "TARGET_AVX10_2_256") ...])

By adding BF16 modes to the add/sub/mul/div patterns’ mode iterator, existing instruction templates automatically generate the correct BF16 variants. GCC’s machine description already handles the "ne" prefix through the sse attribute system.

FMA support is similarly enabled by adding V8BF/V16BF/V32BF to the FMAMODEM iterator, which covers fma, fms, fnma, and fnms patterns.

sqrt uses a new iterator VF2HB extending VF2H with BF modes, connecting to the vsqrtnepbf16 instruction.

smax/smin get a new dedicated expand pattern in sse.md:

(define_expand "<code><mode>3"
  [(set (match_operand:VBF_AVX10_2 0 "register_operand")
        (smaxmin:VBF_AVX10_2
          (match_operand:VBF_AVX10_2 1 "nonimmediate_operand")
          (match_operand:VBF_AVX10_2 2 "nonimmediate_operand")))]
  "TARGET_AVX10_2_256"
  "ix86_fixup_binary_operands_no_copy (<CODE>, <MODE>mode, operands);")

Partial Vectors (V2BF/V4BF)

For small loops that don’t fill a full 128-bit register, GCC uses partial vectorization with V2BF (32-bit, 2 elements) and V4BF (64-bit, 4 elements). These are handled in mmx.md with a widening strategy – the partial vector is widened to V8BF, the full-width instruction is used, and the result is extracted back:

(define_mode_iterator VBF_32_64
  [(V4BF "TARGET_MMX_WITH_SSE") V2BF])

(define_expand "<insn><mode>3"
  [(set (match_operand:VBF_32_64 0 "register_operand")
        (plusminusmultdiv:VBF_32_64
          (match_operand:VBF_32_64 1 "nonimmediate_operand")
          (match_operand:VBF_32_64 2 "nonimmediate_operand")))]
  "TARGET_AVX10_2_256"
{
  rtx op0 = gen_reg_rtx (V8BFmode);
  rtx op1 = lowpart_subreg (V8BFmode, force_reg (<MODE>mode, operands[1]),
                            <MODE>mode);
  rtx op2 = lowpart_subreg (V8BFmode, force_reg (<MODE>mode, operands[2]),
                            <MODE>mode);
  emit_insn (gen_<insn>v8bf3 (op0, op1, op2));
  emit_move_insn (operands[0], lowpart_subreg (<MODE>mode, op0, V8BFmode));
  DONE;
})

The partial vector patterns widen to V8BF, perform the operation using the full-width instruction, and extract the lower portion back. This same approach is used for all operations: arithmetic, FMA, sqrt, and smax/smin. The upper lanes contain garbage but are never read, so the widening is free.

Test Cases

Each commit includes tests verifying correct instruction selection. For example, the full-width arithmetic test:

/* { dg-do compile } */
/* { dg-options "-mavx10.2 -O2" } */
/* { dg-final { scan-assembler-times "vaddnepbf16" 2 } } */
/* { dg-final { scan-assembler-times "vsubnepbf16" 2 } } */
/* { dg-final { scan-assembler-times "vmulnepbf16" 2 } } */
/* { dg-final { scan-assembler-times "vdivnepbf16" 2 } } */

typedef __bf16 v8bf __attribute__((vector_size (16)));
typedef __bf16 v16bf __attribute__((vector_size (32)));

v8bf addv8bf (v8bf a, v8bf b) { return a + b; }
v16bf addv16bf (v16bf a, v16bf b) { return a + b; }
v8bf subv8bf (v8bf a, v8bf b) { return a - b; }
v16bf subv16bf (v16bf a, v16bf b) { return a - b; }
v8bf mulv8bf (v8bf a, v8bf b) { return a * b; }
v16bf mulv16bf (v16bf a, v16bf b) { return a * b; }
v8bf divv8bf (v8bf a, v8bf b) { return a / b; }
v16bf divv16bf (v16bf a, v16bf b) { return a / b; }

The scan-assembler count of 2 per instruction verifies that both V8BF (128-bit) and V16BF (256-bit) variants are being generated. The partial-vector FMA test additionally verifies vfmadd132nepbf16 emission for V2BF/V4BF loops.

Co-authored-by: Hongtao Liu ([email protected])

i386: Vectorized BF16 arithmetic with AVX10.2 – add/sub/mul/div/FMA/sqrt/smaxmin

The FP32 Round-Trip Problem

What AVX10.2 Brings

Full-Width Vectors (V8BF/V16BF/V32BF)

Partial Vectors (V2BF/V4BF)

Test Cases

Leave a Reply Cancel reply