Optimize nested permutation to single VEC_PERM_EXPR [PR54346]

Posted on 24 September 202224 March 2026 by [email protected]

These patches have been committed to the master branch:

b88adba751d – Optimize nested permutation to single VEC_PERM_EXPR

35106383c09 – Move scanning pass of forwprop-19.c to dse1

Bugzilla: PR tree-optimization/54346, PR testsuite/107220

This patch addresses PR 54346, a long-standing optimization opportunity in GCC’s middle-end tree optimization. When two consecutive vector permutations are nested – the result of the first feeding into both operands of the second – the two can be folded into a single VEC_PERM_EXPR by composing their selector indices. This eliminates an entire vector permutation instruction and the intermediate register it produces.

Background: What is VEC_PERM_EXPR?

In GCC’s GIMPLE intermediate representation, VEC_PERM_EXPR <op0, op1, selector> represents a vector permutation (shuffle) operation. It takes two input vectors op0 and op1, conceptually concatenates them into a doubled-length array, and uses the selector vector to pick elements from this combined space. For example, with 4-element integer vectors:

Indices 0-3 select from the first operand op0
Indices 4-7 select from the second operand op1

This operation is the backbone of all SIMD shuffle instructions across architectures – it maps to vperm2i128/vshufps/vpshufb on x86, tbl/trn/zip on AArch64, and equivalent instructions on other targets.

The Problem

Consider the following pattern in GIMPLE IR:

c = VEC_PERM_EXPR <a, b, VCST0>;
d = VEC_PERM_EXPR <c, c, VCST1>;

The second permutation uses c as both its input operands. This means every element of d is selected from c, which is itself just a rearrangement of elements from a and b. The two-hop index chain can be collapsed into a single hop.

Before and after: two VEC_PERM_EXPR collapsed into one — Two nested permutations (left) fold into one (right), eliminating the intermediate vector c.

By computing a new composed selector, we can express the same result as a single operation directly on the original inputs:

d = VEC_PERM_EXPR <a, b, NEW_VCST>;

This eliminates one permutation instruction entirely, removes the intermediate vector c (which may free a register), and reduces both instruction count and register pressure.

How Index Composition Works

The key operation is composing two selector vectors. For each position i in the result:

Look up VCST1[i] to find which element of c is selected
Follow that through VCST0 to find which element of the original (a, b) space that c element came from
That final index becomes NEW_VCST[i]

In formula form: NEW_VCST[i] = VCST0[ VCST1[i] ]. Here is a concrete example:

Step-by-step index composition trace — Each element of the composed selector is computed by looking up the outer index in the inner selector.

In this example, VCST0 = {0, 5, 2, 7} and VCST1 = {2, 0, 3, 1}. The composed result is {2, 0, 7, 5}, which directly references elements from the original a (indices 0-3, blue) and b (indices 4-7, yellow) without the intermediate vector.

The Implementation

The optimization is implemented as a match.pd simplification rule. GCC’s match.pd is a domain-specific language for expressing tree-level pattern-matching transformations. Rules written in match.pd are automatically applied during forward propagation (forwprop) and other optimization passes that invoke the pattern matcher.

Here is how match.pd works at a high level: you specify a pattern (what to match in the IR) and a replacement (what to transform it into), with an optional with block for C++ computation in between. The framework handles traversal – when it encounters a tree node matching the pattern, it runs the replacement logic.

The core rule:

/* Merge
   c = VEC_PERM_EXPR <a, b, VCST0>;
   d = VEC_PERM_EXPR <c, c, VCST1>;
   to
   d = VEC_PERM_EXPR <a, b, NEW_VCST>;  */

(simplify
 (vec_perm (vec_perm@0 @1 @2 VECTOR_CST@3) @0 VECTOR_CST@4)
 (with
  {
    if (!TYPE_VECTOR_SUBPARTS (type).is_constant ())
      return NULL_TREE;

    tree op0;
    machine_mode result_mode = TYPE_MODE (type);
    machine_mode op_mode = TYPE_MODE (TREE_TYPE (@1));
    int nelts = TYPE_VECTOR_SUBPARTS (type).to_constant ();
    vec_perm_builder builder0;
    vec_perm_builder builder1;
    vec_perm_builder builder2 (nelts, nelts, 1);

    if (!tree_to_vec_perm_builder (&builder0, @3)
        || !tree_to_vec_perm_builder (&builder1, @4))
      return NULL_TREE;

    vec_perm_indices sel0 (builder0, 2, nelts);
    vec_perm_indices sel1 (builder1, 1, nelts);

    for (int i = 0; i < nelts; i++)
      builder2.quick_push (sel0[sel1[i].to_constant ()]);

    vec_perm_indices sel2 (builder2, 2, nelts);

    if (!can_vec_perm_const_p (result_mode, op_mode, sel2, false))
      return NULL_TREE;

    op0 = vec_perm_indices_to_tree (TREE_TYPE (@4), sel2);
  }
  (vec_perm @1 @2 { op0; })))

Dissecting the Code

Let’s walk through each part of the implementation:

Pattern matching: (vec_perm (vec_perm@0 @1 @2 VECTOR_CST@3) @0 VECTOR_CST@4)

@0 captures the inner VEC_PERM_EXPR result. It appears as both the first and second operand of the outer permutation – this is the crucial constraint that both inputs are the same vector c.
@1 and @2 are the original input vectors (a and b).
VECTOR_CST@3 and VECTOR_CST@4 capture the two selector constants. Both must be compile-time constant vectors for this optimization to apply.

Guard checks:

TYPE_VECTOR_SUBPARTS(type).is_constant() – Only constant-length vectors are handled. Variable-length vectors (as used in RISC-V V extension or SVE) would require a different approach since the number of elements isn’t known at compile time.
tree_to_vec_perm_builder – Extracts the selector indices from the VECTOR_CST tree nodes into vec_perm_builder objects that can be queried programmatically.

Index composition:

sel0 is constructed with ninputs=2 and sel1 with ninputs=1. This is the key asymmetry: the inner permutation selects from two distinct vectors (a, b) spanning indices 0 to 2*nelts-1, while the outer permutation selects from a single vector (c, c) so its indices wrap modulo nelts.
The composition loop builder2.quick_push(sel0[sel1[i].to_constant()]) performs the two-hop lookup for each element position.

Target validation:

can_vec_perm_const_p(result_mode, op_mode, sel2, false) asks the target backend whether it can actually implement the composed permutation in a single instruction (or acceptable instruction sequence). Not all permutations are cheap on all targets – for instance, crossing 128-bit lane boundaries on pre-AVX2 x86 is expensive. If the target can’t efficiently execute the combined permutation, the optimization is skipped, and the two-permutation sequence is kept.

Test Cases

The new test case gcc.dg/pr54346.c validates the optimization directly:

/* { dg-do compile } */
/* { dg-options "-O -fdump-tree-dse1" } */

typedef int veci __attribute__ ((vector_size (4 * sizeof (int))));

void fun (veci a, veci b, veci *i)
{
  veci c = __builtin_shuffle (a, b, (veci) {1, 4, 2, 7});
  *i = __builtin_shuffle (c, (veci) { 7, 2, 1, 5 });
}

/* { dg-final { scan-tree-dump "VEC_PERM_EXPR.*{ 3, 6, 0, 0 }" "dse1" } } */
/* { dg-final { scan-tree-dump-times "VEC_PERM_EXPR" 1 "dse1" } } */

The first shuffle selects elements {1, 4, 2, 7} from the concatenation of a (indices 0-3) and b (indices 4-7), producing intermediate vector c. The second shuffle then picks elements {7, 2, 1, 5} from (c, c).

After the optimization, GCC collapses these two shuffles into a single VEC_PERM_EXPR with the composed selector { 3, 6, 0, 0 }, directly referencing elements from the original a and b. The test verifies two things:

The composed index vector { 3, 6, 0, 0 } appears in the tree dump
Only one VEC_PERM_EXPR remains (the intermediate was eliminated)

Impact on Existing Tests

The new match.pd rule also changed where the existing forwprop-19.c test’s permutation folding occurs. This test had two nested shuffles that compose into an identity permutation – previously eliminated during forwprop1. With the new composed-permutation rule, the fold now happens earlier and the dead result is cleaned up by dse1 (Dead Store Elimination, pass 1) instead.

A follow-up commit (PR 107220) updated the test expectations accordingly:

-/* { dg-options "-O -fdump-tree-forwprop1" } */
+/* { dg-options "-O -fdump-tree-dse1" } */
 ...
-/* { dg-final { scan-tree-dump-not "VEC_PERM_EXPR" "forwprop1" } } */
+/* { dg-final { scan-tree-dump-not "VEC_PERM_EXPR" "dse1" } } */

This is a common consequence of adding new optimization rules in GCC’s pipeline – earlier folding shifts where downstream passes observe the results, requiring test scan targets to be updated.

Scope and Limitations

This optimization specifically handles the case where the outer permutation’s both operands are the same inner permutation result. Other nesting patterns – such as two different permutations feeding into the outer VEC_PERM_EXPR, or the inner result appearing as only one operand – are not covered by this rule and would require separate patterns.

The constant-length vector restriction (is_constant() guard) means this does not apply to scalable vectors used in architectures like AArch64 SVE or RISC-V V. Extending to variable-length vectors is possible but would need runtime index computation rather than compile-time folding.

Co-authored-by: Hongtao Liu ([email protected])