https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111874

            Bug ID: 111874
           Summary: Missed mask_fold_left_plus with AVX512
           Product: gcc
           Version: 14.0
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: target
          Assignee: unassigned at gcc dot gnu.org
          Reporter: rguenth at gcc dot gnu.org
  Target Milestone: ---

Currently fold-left reductions are open-coded by the vectorizer, extracting
scalar elements and doing in-order adds.  That's probably as good as it can
get.
For the case of conditional (or loop masked) fold-left reductions the scalar
fallback isn't implemented.  But AVX512 has vpcompress that could be used
to implement a more efficient sequence for a masked fold-left, possibly
using a loop and population count of the mask.

It might be interesting to experiment with this, not so much for the
fully masked loop case but for conditional reduction.  Maybe there's
some expert at Intel or AMD who can produce a good instruction sequence
here.

Reply via email to