AVX10.2: Support FP16/BF16/FP8 convert instructions

Posted on 27 August 202425 March 2026 by [email protected]

This patch has been committed to the master branch:

2a046117a83 – AVX10.2: Support convert instructions

This patch adds GCC intrinsic support for the AVX10.2 conversion instructions – a new set of instructions for converting between FP16, BF16, FP8 (HF8/BF8), and FP32 formats with various rounding and saturation options. These instructions are the hardware backbone of mixed-precision AI inference and training pipelines on future Intel processors.

Why New Conversion Instructions?

Modern AI workloads use multiple floating-point precisions in a single pipeline. Weights might be stored in 8-bit float for memory efficiency, activations computed in FP16 or BF16 for speed, and critical accumulations done in FP32 for accuracy. Each precision boundary requires a format conversion, and on current hardware these conversions are often multi-instruction sequences. AVX10.2 makes them single instructions:

AVX10.2 Conversion Instruction Map showing paths between FP32, FP16, BF16, HF8, and BF8

The New Floating-Point Formats

AVX10.2 introduces two 8-bit float formats alongside the existing FP32, FP16, and BF16:

Format   Bits   Layout        Range              Use Case
------   ----   -----------   ----------------   --------
FP32     32     S:8:23        ~1.2e-38 to 3.4e38 Accumulation
FP16     16     S:5:10        ~6.1e-5 to 65504   Compute
BF16     16     S:8:7         ~1.2e-38 to 3.4e38 Compute (wider range)
HF8       8     S:4:3         ~0.0625 to 240     Weight storage
BF8       8     S:5:2         ~6.1e-8 to 57344   Weight storage (wider range)

HF8 (IEEE FP8 E4M3) has more mantissa precision, making it better for storing values that cluster around a narrow range. BF8 (IEEE FP8 E5M2) has more exponent range, making it better for values spanning many orders of magnitude. The choice depends on the layer’s weight distribution.

Instruction Categories

The conversion instructions fall into four categories:

1. Direct conversion (NE = Nearest Even)

// FP16 -> HF8 (nearest even rounding)
__m128i _mm_cvtneph_hf8(__m128h A);
__m128i _mm256_cvtneph_hf8(__m256h A);

// FP16 -> BF8
__m128i _mm_cvtneph_bf8(__m128h A);

// HF8 -> FP16 (exact, no rounding needed)
__m128h _mm_cvthf8_ph(__m128i A);
__m256h _mm256_cvthf8_ph(__m128i A);

2. Two-source packing (NE2 variants) – these take two source vectors and interleave the converted results into one destination:

// Two FP32 vectors -> one FP16 vector (2x packing)
__m128h _mm256_cvt2ps_phx(__m256 A, __m256 B);

// Two FP16 vectors -> one HF8 vector (2x packing)
__m128i _mm_cvtne2ph_hf8(__m128h A, __m128h B);
__m256i _mm256_cvtne2ph_hf8(__m256h A, __m256h B);

3. Biased conversion – adds a bias value before converting, enabling stochastic rounding for training:

// FP16 -> HF8 with bias (for stochastic rounding)
__m128i _mm_cvtbiasph_hf8(__m128i bias, __m128h A);
__m128i _mm_cvtbiasph_hf8s(__m128i bias, __m128h A);  // with saturation

4. Saturation variants (S suffix) – clamp out-of-range values to the maximum representable value instead of producing infinity or NaN:

// Without saturation: overflow -> infinity
__m128i _mm_cvtneph_bf8(__m128h A);

// With saturation: overflow -> max representable
__m128i _mm_cvtneph_bf8s(__m128h A);

Implementation Details

The implementation spans several files:

avx10_2convertintrin.h (978 lines) – 128-bit and 256-bit intrinsics with masking variants
avx10_2-512convertintrin.h (548 lines) – 512-bit variants
i386-builtin.def – 46 new builtin definitions
i386-builtin-types.def – new function type signatures for the mixed-width conversions
sse.md – machine description patterns with rounding control support

A key implementation challenge is handling the mixed-width nature of these conversions. When converting FP16 (16-bit) to HF8 (8-bit), the destination is half the width of the source. The intrinsic headers handle this by using wider integer types for the 8-bit results and zeroing the upper portion:

// 256-bit FP16 (16 elements) -> 128-bit HF8 (16 elements, half the bytes)
extern __inline __m128i
__attribute__((__gnu_inline__, __always_inline__, __artificial__))
_mm256_cvtneph_hf8 (__m256h __A)
{
  return (__m128i) __builtin_ia32_vcvtneph2hf8_256 ((__v16hf) __A);
}

Each instruction also supports masked variants (merge-masking and zero-masking) across all three vector widths (128-bit, 256-bit, 512-bit where applicable). The 512-bit forms additionally support embedded rounding control, allowing the programmer to specify the rounding mode per-instruction rather than modifying the global MXCSR register.

Test Coverage

The patch includes over 30 new test files covering each instruction variant. Compile-time tests verify correct instruction selection via scan-assembler directives, and runtime tests validate conversion results against reference implementations. The tests cover edge cases including denormals, infinities, NaN propagation, and saturation behavior.

Why New Conversion Instructions?

The New Floating-Point Formats

Instruction Categories

Implementation Details

Test Coverage

Leave a Reply Cancel reply