Support Intel AMX-FP8 – 8-bit float tile matrix multiply

This patch has been committed to the master branch: dd859e93a16 – Support Intel AMX-FP8 This patch adds GCC support for Intel AMX-FP8, a new extension to the Advanced Matrix Extensions (AMX) that performs tile-based matrix multiplication with 8-bit floating-point (FP8) operands and FP32 accumulation. AMX-FP8 doubles the compute density compared to AMX-BF16 by packing twice… Read more Support Intel AMX-FP8 – 8-bit float tile matrix multiply

x86: Fast-math float-to-BF16 truncation via PSRLD

This patch has been committed to the master branch: 8718727509b – x86: Implement Fast-Math Float Truncation to BF16 via PSRLD Instruction This patch optimizes float to __bf16 conversion under -ffast-math by using a simple bit shift (psrld $16) instead of a dedicated conversion instruction, removing the requirement for AVX512-BF16 or AVX-NE-CONVERT hardware. This makes BF16… Read more x86: Fast-math float-to-BF16 truncation via PSRLD

x86: Extend AVX512 vectorized popcount to small vector modes

This patch has been committed to the master branch: 85910e650a6 – x86: Extend AVX512 Vectorization for Popcount in Various Modes This patch enables vectorized popcount (population count / bit counting) for small vector modes that were previously unhandled: V2QI, V4QI, V8QI, V2HI, V4HI, and V2SI. These are the partial-vectorization modes used when loop trip counts… Read more x86: Extend AVX512 vectorized popcount to small vector modes

i386: Integrate BFmode in ix86_preferred_simd_mode for auto-vectorization

This patch has been committed to the master branch: b851bce473d – i386: Integrate BFmode for Enhanced Vectorization in ix86_preferred_simd_mode This small but important patch tells GCC’s auto-vectorizer how to choose the best SIMD register width for BF16 operations, enabling automatic vectorization of BF16 loops without manual intrinsics. It’s the final piece that connects all the… Read more i386: Integrate BFmode in ix86_preferred_simd_mode for auto-vectorization

i386: Native BF16 comparisons with AVX10.2 – vcmppbf16 and vcomsbf16

These patches have been committed to the master branch: f77435aa391 – i386: Support vec_cmp for V8BF/V16BF/V32BF in AVX10.2 89d50c45048 – i386: Enable V2BF/V4BF vec_cmp with AVX10.2 vcmppbf16 61622cfa463 – i386: Utilize VCOMSBF16 for BF16 Comparisons with AVX10.2 These three patches enable native BF16 comparison support in GCC’s x86 backend using AVX10.2 instructions. Together they cover… Read more i386: Native BF16 comparisons with AVX10.2 – vcmppbf16 and vcomsbf16

i386: Vectorized BF16 arithmetic with AVX10.2 – add/sub/mul/div/FMA/sqrt/smaxmin

These patches have been committed to the master branch: 8e16f26ca9f – i386: Support partial vectorized V2BF/V4BF plus/minus/mult/div/sqrt 62df24e5003 – i386: Support partial vectorized V2BF/V4BF smaxmin f82fa0da4d9 – i386: Support vectorized BF16 add/sub/mul/div with AVX10.2 6d294fb8ac9 – i386: Support vectorized BF16 FMA with AVX10.2 29ef601973d – i386: Support vectorized BF16 smaxmin with AVX10.2 e19f65b0be1 – i386:… Read more i386: Vectorized BF16 arithmetic with AVX10.2 – add/sub/mul/div/FMA/sqrt/smaxmin

AVX10.2: Support FP16/BF16/FP8 convert instructions

This patch has been committed to the master branch: 2a046117a83 – AVX10.2: Support convert instructions This patch adds GCC intrinsic support for the AVX10.2 conversion instructions – a new set of instructions for converting between FP16, BF16, FP8 (HF8/BF8), and FP32 formats with various rounding and saturation options. These instructions are the hardware backbone of… Read more AVX10.2: Support FP16/BF16/FP8 convert instructions

x86: BF16 bitwise and sign-manipulation ops for all vector widths

These patches have been committed to the master branch: f3f9e4ee764 – x86: Support bitwise and/andnot/abs/neg/copysign/xorsign op for V8BF/V16BF/V32BF d0c86be1ce7 – i386: Support partial signbit/xorsign/copysign/abs/neg/and/xor/ior/andn for V2BF/V4BF These two patches enable BF16 (Brain Floating-Point 16) bitwise and sign-manipulation operations across all vector widths on x86. The first commit handles full-width vectors (V8BF, V16BF, V32BF) while the… Read more x86: BF16 bitwise and sign-manipulation ops for all vector widths

x86: Emit vcvtne2ps2bf16 for odd-element BF16 shuffle

This patch has been committed to the master branch: 6d0b7b69d14 – Emit cvtne2ps2bf16 for odd increasing perm in __builtin_shufflevector This patch teaches GCC’s x86 backend to recognize a specific BF16 vector shuffle pattern – selecting every odd element from two concatenated vectors – and lower it to the vcvtne2ps2bf16 instruction instead of a general-purpose byte… Read more x86: Emit vcvtne2ps2bf16 for odd-element BF16 shuffle

x86: 3-instruction byte-swap vector shuffle for V16QI/V8QI [PR107563]

These patches have been committed to the master branch: a71f90c5a7a – Add 3-instruction subroutine vector shift for V16QI 0022064649d – Fix Logical Shift Issue in expand_vec_perm_psrlw_psllw_por Bugzilla: PR target/107563, PR target/115146 This pair of patches adds an efficient 3-instruction sequence for byte-swap permutations in V16QI and V8QI vectors on x86. Instead of falling through to… Read more x86: 3-instruction byte-swap vector shuffle for V16QI/V8QI [PR107563]