https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111874
Bug ID: 111874 Summary: Missed mask_fold_left_plus with AVX512 Product: gcc Version: 14.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: target Assignee: unassigned at gcc dot gnu.org Reporter: rguenth at gcc dot gnu.org Target Milestone: --- Currently fold-left reductions are open-coded by the vectorizer, extracting scalar elements and doing in-order adds. That's probably as good as it can get. For the case of conditional (or loop masked) fold-left reductions the scalar fallback isn't implemented. But AVX512 has vpcompress that could be used to implement a more efficient sequence for a masked fold-left, possibly using a loop and population count of the mask. It might be interesting to experiment with this, not so much for the fully masked loop case but for conditional reduction. Maybe there's some expert at Intel or AMD who can produce a good instruction sequence here.