https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114087

            Bug ID: 114087
           Summary: RISC-V optimization on checking certain bits set ((x &
                    mask) == val)
           Product: gcc
           Version: 13.2.0
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: rtl-optimization
          Assignee: unassigned at gcc dot gnu.org
          Reporter: Explorer09 at gmail dot com
  Target Milestone: ---

It might be common in the C family of languages to check if certain bits are
set in an integer with a code pattern like this:

```c
unsigned int x;
if ((x & 0x3000) == 0x1000) {
  // Do something...
}
```

And I am surprised when compilers like GCC and Clang didn't realize they can
use some bit shifts and inversions of bit masks to save some instructions and
emit smaller code.

Here I present 3 possible optimizations that could be implemented in a
compiler. Two of them can apply not only to RISC-V, but other RISC
architectures as well (except ARM, perhaps). The last one is specific to RISC-V
due to the 20-bit immediate operand of its "lui" (load upper immediate)
instruction.

The bit masks should be compile time constants, and the "-Os" flag (optimize
for size) is assumed.

### Test code

The example code and constants are crafted specifically for RISC-V.

Each group of `pred*` functions should function identically (if not, please let
me know; it might be a typo).

* The "a" variants are what I commonly write for checking the set bits.
* The "b" variants are what I believe the compiler should ideally transform the
code to. I wrote them to let compiler developers know how the optimization can
be done. (But in practice the "b" code might transform to "a", meaning the
"optimization" direction reversed.)
* The "c" variants are hacks to make things work. They contain `__asm__
volatile` directives to force GCC or Clang to optimize in the direction I want.
The generated assembly should present what I considered the ideal result.

```c
#include <stdbool.h>
#include <stdint.h>
#define POWER_OF_TWO_FACTOR(x) ((x) & -(x))

// -------------------------------------------------------------------
// Example 1: The bitwise AND mask contains lower bits in all ones.
// By converting the bitwise AND into a bitwise OR, an "addi"
// instruction can be saved.
// (This might conflict with optimizations utilizing RISC-V "bclri"
// instruction; use one or the other.)
// (In ARM there are "bic" instructions already, making this
// optimization useless.)

static uint32_t mask1 = 0x55555FFF;
static uint32_t val1  = 0x14501DEF;
// static_assert((mask1 & val1) == val1);
// static_assert((mask1 & 0xFFF) == 0xFFF);

bool pred1a(uint32_t x) {
  return ((x & mask1) == val1);
}

bool pred1b(uint32_t x) {
  return ((x | ~mask1) == (val1 | ~mask1));
}

bool pred1c(uint32_t x) {
  register uint32_t temp = x | ~mask1;
  __asm__ volatile ("" : "+r" (temp));
  return (temp == (val1 | ~mask1));
}

// -------------------------------------------------------------------
// Example 2: The bitwise AND mask could fit an 11-bit immediate
// operand of RISC-V "andi" instruction with a help of right
// shifting. (Keep the sign bit of the immediate operand zero.)
// (This kind of optimization could also work with other RISC 
// architectures, except ARM.)

static uint32_t mask2 = 0x55500000;
static uint32_t val2  = 0x14500000;

// static_assert(mask2 != 0);
// static_assert((mask2 & val2) == val2);
// static_assert(mask2 / POWER_OF_TWO_FACTOR(mask2) <= 0x7FF);

bool pred2a(uint32_t x) {
  return ((x & mask2) == val2);
}

bool pred2b(uint32_t x) {
  uint32_t factor = POWER_OF_TWO_FACTOR(mask2);
  return ((x >> __builtin_ctz(factor)) & (mask2 / factor))
    == (val2 / factor);
}

bool pred2c(uint32_t x) {
  uint32_t factor = POWER_OF_TWO_FACTOR(mask2);
  register uint32_t temp = x >> 20;
  __asm__ volatile ("" : "+r" (temp));
  return (temp & 0x555) == 0x145;
}

// -------------------------------------------------------------------
// Example 3: The bitwise AND mask could fit a 20-bit immediate
// operand of RISC-V "lui" instruction.
// Only RISC-V has this 20-bit immediate "U-type" format, AFAIK.

static uint32_t mask3 = 0x00055555;
static uint32_t val3  = 0x00045014;

// static_assert(mask3 / POWER_OF_TWO_FACTOR(mask3) <= 0xFFFFF);

bool pred3a(uint32_t x) {
  return ((x & mask3) == val3);
}

bool pred3b(uint32_t x) {
  uint32_t factor = POWER_OF_TWO_FACTOR(mask3);
  return (((x / factor) << 12) & ((mask3 / factor) << 12))
    == ((val3 / factor) << 12);
}

bool pred3c(uint32_t x) {
  uint32_t factor = POWER_OF_TWO_FACTOR(mask3);
  register uint32_t temp = x << 12;
  __asm__ volatile ("" : "+r" (temp));
  return (temp & ((mask3 / factor) << 12))
    == ((val3 / factor) << 12);
}
```

I tested the code in the Compiler Explorer (godbolt.org).

### Generated assembly (for reference only)

```
pred1a:
        li      a5,1431658496
        addi    a5,a5,-1
        and     a0,a0,a5
        li      a5,340795392
        addi    a5,a5,-529
        sub     a0,a0,a5
        seqz    a0,a0
        ret
pred1b:  # Expected result
        li      a5,-1431658496
        or      a0,a0,a5
        li      a5,-1090863104
        addi    a5,a5,-529
        sub     a0,a0,a5
        seqz    a0,a0
        ret
pred2a:
        li      a5,1431306240
        and     a0,a0,a5
        li      a5,340787200
        sub     a0,a0,a5
        seqz    a0,a0
        ret
# pred2b compiles to same code as pred2a
pred2c:  # Expected result
        srliw   a0,a0,20
        andi    a0,a0,1365
        addi    a0,a0,-325
        seqz    a0,a0
        ret
pred3a:
        li      a5,348160
        addi    a5,a5,1365
        and     a0,a0,a5
        li      a5,282624
        addi    a5,a5,20
        sub     a0,a0,a5
        seqz    a0,a0
        ret
# pred3b compiles to same code as pred3a
pred3c:  # Expected result
        slliw   a0,a0,12
        li      a5,1431654400
        and     a0,a0,a5
        li      a5,1157709824
        sub     a0,a0,a5
        seqz    a0,a0
        ret
```

Reply via email to