x86: Extend AVX512 vectorized popcount to small vector modes

This patch has been committed to the master branch:

85910e650a6 – x86: Extend AVX512 Vectorization for Popcount in Various Modes


This patch enables vectorized popcount (population count / bit counting) for small vector modes that were previously unhandled: V2QI, V4QI, V8QI, V2HI, V4HI, and V2SI. These are the partial-vectorization modes used when loop trip counts are too small for full 128-bit or wider vectors.


Background: Partial Vectorization

GCC already supported vectorized popcount for full-width vectors (V16QI, V8HI, V4SI, etc.) using AVX512-BITALG (vpopcntb/vpopcntw) and AVX512-VPOPCNTDQ (vpopcntd/vpopcntq). However, when the vectorizer encountered small loops – like a 2-element int array or a 4-element short array – it couldn’t vectorize the popcount because no patterns existed for the partial vector modes in mmx.md.

Partial vectorization is GCC’s approach to handling loops where the trip count doesn’t fill a full SIMD register. Instead of falling back to scalar code, GCC uses narrow vector modes (V2SI = 64-bit, V4QI = 32-bit, etc.) that occupy the lower portion of an XMM register. The hardware instruction is the same – vpopcntd still operates on a full XMM register – but only the lower lanes contain meaningful data.

Diagram showing AVX-512 popcount partial vector modes

The New Patterns

Three new define_insn patterns are added to mmx.md, one for each element width:

;; Byte popcount: V2QI, V4QI, V8QI -> vpopcntb
(define_insn "popcount<mode>2"
  [(set (match_operand:VI1_16_32_64 0 "register_operand" "=v")
        (popcount:VI1_16_32_64
          (match_operand:VI1_16_32_64 1 "register_operand" "v")))]
  "TARGET_AVX512VL && TARGET_AVX512BITALG"
  "vpopcntb\t{%1, %0|%0, %1}")

;; Halfword popcount: V2HI, V4HI -> vpopcntw
(define_insn "popcount<mode>2"
  [(set (match_operand:VI2_32_64 0 "register_operand" "=v")
        (popcount:VI2_32_64
          (match_operand:VI2_32_64 1 "register_operand" "v")))]
  "TARGET_AVX512VL && TARGET_AVX512BITALG"
  "vpopcntw\t{%1, %0|%0, %1}")

;; Doubleword popcount: V2SI -> vpopcntd
(define_insn "popcountv2si2"
  [(set (match_operand:V2SI 0 "register_operand" "=v")
        (popcount:V2SI
          (match_operand:V2SI 1 "register_operand" "v")))]
  "TARGET_AVX512VPOPCNTDQ && TARGET_AVX512VL && TARGET_MMX_WITH_SSE"
  "vpopcntd\t{%1, %0|%0, %1}")

A new mode iterator VI1_16_32_64 groups the 2-byte, 4-byte, and 8-byte QI vector modes. Similarly, VI2_32_64 groups V2HI and V4HI. These iterators let a single pattern definition expand to cover all partial widths for that element size.

Note the different ISA requirements: byte and halfword popcount need AVX512-BITALG (which provides vpopcntb/vpopcntw), while doubleword popcount needs AVX512-VPOPCNTDQ (which provides vpopcntd/vpopcntq). Both require AVX512VL for the 128-bit register forms.

Why This Matters for Real Code

Popcount is used extensively in bioinformatics (Hamming distance for DNA comparison), cryptography, chess engines (bitboard move generation), and data structures (rank queries in succinct data structures). Many of these use cases operate on small fixed-size arrays – exactly the scenario where partial vectorization kicks in.

For example, counting bits in a pair of 32-bit integers:

void popcount_pair(int *a, int *__restrict b) {
  for (int i = 0; i != 2; i++)
    a[i] = __builtin_popcount(b[i]);
}

Before this patch: two scalar popcnt instructions. After: one vpopcntd instruction operating on both values simultaneously in a single XMM register.

Test Case

/* { dg-do compile } */
/* { dg-options "-O2 -mavx512vpopcntdq -mavx512bitalg -mavx512vl" } */
/* { dg-final { scan-assembler-times "vpopcntd" 1 { target { ! ia32 } } } } */
/* { dg-final { scan-assembler-times "vpopcntw" 2 { target { ! ia32 } } } } */
/* { dg-final { scan-assembler-times "vpopcntb" 3 { target { ! ia32 } } } } */

void foo1 (int* a, int* __restrict b) {
  for (int i = 0; i != 2; i++)
    a[i] = __builtin_popcount (b[i]);
}

void foo2 (unsigned short* a, unsigned short* __restrict b) {
  for (int i = 0; i != 4; i++)
    a[i] = __builtin_popcount (b[i]);
}

void foo4 (unsigned char* a, unsigned char* __restrict b) {
  for (int i = 0; i != 8; i++)
    a[i] = __builtin_popcount (b[i]);
}

The scan-assembler counts verify that each element-size variant generates the correct AVX-512 popcount instruction. The ia32 target check excludes 32-bit mode where register allocation differs.


Leave a Reply

Your email address will not be published. Required fields are marked *