x86: BF16 bitwise and sign-manipulation ops for all vector widths

Posted on 9 July 202425 March 2026 by [email protected]

These patches have been committed to the master branch:

f3f9e4ee764 – x86: Support bitwise and/andnot/abs/neg/copysign/xorsign op for V8BF/V16BF/V32BF

d0c86be1ce7 – i386: Support partial signbit/xorsign/copysign/abs/neg/and/xor/ior/andn for V2BF/V4BF

These two patches enable BF16 (Brain Floating-Point 16) bitwise and sign-manipulation operations across all vector widths on x86. The first commit handles full-width vectors (V8BF, V16BF, V32BF) while the second extends support to partial vectors (V2BF, V4BF). Together, they let GCC generate efficient SIMD code for abs, neg, copysign, xorsign, and raw bitwise operations on BF16 vectors without round-tripping through FP32 conversions.

Background: Why Bitwise Ops Matter for BF16

BF16 (bfloat16) uses the same exponent range as IEEE 754 single-precision float but with a reduced 8-bit mantissa. This makes it ideal for machine learning workloads where full FP32 precision isn’t needed. GCC already supported BF16 arithmetic through conversion to FP32, but bitwise operations – which manipulate the sign bit and mantissa directly – needed explicit support.

Operations like abs, neg, copysign, and xorsign are bitwise at the hardware level: abs clears the sign bit, neg flips it, copysign copies the sign from one operand, and xorsign conditionally flips sign based on another value’s sign. These can all be implemented with AND, ANDNOT, OR, and XOR on the raw bit representation – no floating-point unit needed.

Diagram showing BF16 bitwise sign manipulation operations

How Sign Manipulation Maps to SIMD Instructions

Each sign operation reduces to one or two bitwise SIMD instructions plus a constant mask:

abs(x)          = x AND 0x7FFF          ;  VANDPS  (clear sign bit)
neg(x)          = x XOR 0x8000          ;  VXORPS  (flip sign bit)
copysign(x, y)  = (x AND 0x7FFF)        ;  VANDNPS + VORPS
                  OR (y AND 0x8000)      ;  (keep x magnitude, y sign)
xorsign(x, y)   = x XOR (y AND 0x8000)  ;  VANDPS + VXORPS
                                         ;  (flip x sign if y negative)

The mask 0x7FFF has bit 15 (the sign bit) cleared and all other bits set. 0x8000 has only the sign bit set. These constant masks are broadcast into SIMD registers at compile time, so each operation costs just 1-2 instructions at runtime with no memory traffic beyond loading the operands.

Full-Width Vectors (V8BF/V16BF/V32BF)

The first patch extends three key functions in i386-expand.cc to recognize BF16 vector modes:

ix86_expand_fp_absneg_operator – handles abs and neg by masking the sign bit
ix86_expand_copysign – implements copysign via sign-bit extraction and insertion
ix86_expand_xorsign – implements xorsign by XORing sign bits

In sse.md, BF16 modes are added to the relevant mode iterators. This is the key mechanism: GCC’s machine description uses mode iterators to define a single pattern that expands to multiple concrete instruction patterns. By adding BF16 modes to the VF_BHSD iterator, all existing bitwise float patterns automatically gain BF16 support:

;; VF_BHSD - all BF/HF/SF/DF vector modes for bitwise float ops
[(V32BF "TARGET_AVX512F && TARGET_EVEX512")
 (V16BF "TARGET_AVX") (V8BF "TARGET_SSE2")
 (V32HF "TARGET_AVX512F && TARGET_EVEX512")
 (V16HF "TARGET_AVX") (V8HF "TARGET_SSE2")
 (V16SF "TARGET_AVX512F && TARGET_EVEX512") (V8SF "TARGET_AVX") V4SF
 (V8DF "TARGET_AVX512F && TARGET_EVEX512")
 (V4DF "TARGET_AVX") (V2DF "TARGET_SSE2")]

The existing patterns for andnot, and, xor, and ior already work generically through this iterator – by adding BF16 modes, all these operations are enabled automatically. The condition guards were also generalized from checking HFmode specifically to checking the scalar size (<ssescalarsize> != 16) to cover both HF16 and BF16 uniformly.

Partial Vectors (V2BF/V4BF)

The second patch handles the smaller V2BF (32-bit) and V4BF (64-bit) modes used in partial vectorization. When GCC vectorizes a loop with only 2 or 4 BF16 values, it uses these narrow modes rather than padding to a full 128-bit register. The partial vector support is managed through mmx.md and reuses the existing ix86_build_const_vector and ix86_build_signbit_mask infrastructure, which was extended to recognize the new BF modes:

   case E_V2HFmode:
+  case E_V2BFmode:
     half_mode = E_HFmode;
     n = 2;
     break;
   case E_V4HFmode:
+  case E_V4BFmode:
     half_mode = E_HFmode;
     n = 4;
     break;

The mmx_pabs<mode>2 and mmx_neg<mode>2 patterns in mmx.md were extended to match V2BF and V4BF modes, emitting the same AND/XOR with mask approach but operating on 32-bit or 64-bit registers instead of full XMM registers.

What the Compiler Generates

For a simple abs on V8BF, GCC now generates:

absv8bf:
    vandps  xmm0, xmm0, XMMWORD PTR .LC0[rip]   ; AND with 0x7FFF mask
    ret
.LC0:
    .long   2147450879   ; 0x7FFF7FFF repeated

Without these patches, GCC would have to convert each BF16 element to FP32, apply the abs operation, and convert back – a much more expensive sequence involving vcvtbf162ps and vcvtne2ps2bf16 instructions for each direction.

Test Cases

Runtime tests verify abs and neg for V8BF (AVX2) and V32BF (AVX-512), using both positive and negative BF16 values including edge cases:

/* { dg-do run { target avx2 } } */
/* { dg-options "-O1 -mavx512bf16 -fdump-tree-vect-details -fdump-tree-optimized" } */

typedef __bf16 v8bf __attribute__((vector_size(16)));

v8bf
absv8bf (v8bf a)
{
  return __builtin_elementwise_abs (a);
}

v8bf
negv8bf (v8bf a)
{
  return -a;
}

Scan-assembler tests also verify that the compiler emits the expected bitwise instructions rather than falling back to FP32 conversion paths, across all supported vector widths from V2BF through V32BF.