https://gcc.gnu.org/bugzilla/show_bug.cgi?id=91460

            Bug ID: 91460
           Summary: gcc -mpreferred-vector-width=256 is slower than
                    -mpreferred-vector-width=128 for some loops
           Product: gcc
           Version: 10.0
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: rtl-optimization
          Assignee: unassigned at gcc dot gnu.org
          Reporter: skpgkp2 at gmail dot com
                CC: hjl.tools at gmail dot com
  Target Milestone: ---

1 static inline void pixel_avg( uint8_t *dst,  int     i_dst_stride,
2                              uint8_t *src1, int i_src1_stride,
3                              uint8_t *src2, int i_src2_stride,
4                               int i_width, int i_height )
5 {
6     for( int y = 0; y < i_height; y++ )
7     {
8         for( int x = 0; x < i_width; x++ )
9             dst[x] = ( src1[x] + src2[x] + 1 ) >> 1;
10         dst  += i_dst_stride;
11         src1 += i_src1_stride;
12         src2 += i_src2_stride;
13     }
14 }

If above code is in hot loop.

if i_width value is between 16 and 32, -mprefer-vector-width=128 can provide
~6% performance improvement as compared to -mprefer-vector-width=256.

i_width value must be at least 16 to trigger 128 bit vectorization at line 8.

i_width value must be at least 32 to trigger 256 bit vectorization at line 8.

Reply via email to