https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113813
Bug ID: 113813 Summary: Reduction of xor/and/ior of 16 bytes can be improved Product: gcc Version: 14.0 Status: UNCONFIRMED Keywords: missed-optimization Severity: enhancement Priority: P3 Component: target Assignee: unassigned at gcc dot gnu.org Reporter: pinskia at gcc dot gnu.org Target Milestone: --- Target: aarch64 Take: ``` #define SIGN unsigned #define TYPE char #define SIZE 16 void sor(SIGN TYPE *a, SIGN TYPE *r) { SIGN TYPE b = 0; for(int i = 0; i < SIZE; i++) b |= a[i]; *r = b; } void sxor(SIGN TYPE *a, SIGN TYPE *r) { SIGN TYPE b = 0; for(int i = 0; i < SIZE; i++) b ^= a[i]; *r = b; } void sand(SIGN TYPE *a, SIGN TYPE *r) { SIGN TYPE b = -1; for(int i = 0; i < SIZE; i++) b &= a[i]; *r = b; } ``` Currently for sor GCC (at `-O3 -march=armv9-a+sve2 -fno-vect-cost-model`) produces: ``` ptrue p7.b, vl16 ptrue p6.b, all ld1b z31.b, p7/z, [x0] mov z30.b, #0 sel z30.b, p7, z31.b, z30.b orv b30, p6, z30.b str b30, [x1] ``` But this could be improved to just: ``` ptrue p7.b, vl16 ld1b z31.b, p7/z, [x0] orv b30, p7, z30.b str b30, [x1] ``` Similarly for sxor/sand. The same is true for short/int (8/4). Note without -fno-vect-cost-model, it is just so much worse (on the trunk only). Note we should be able to use the SVE instruction when perfering NEON auto-vec too.