Not actually solving those testcases but currently scatters/gathers are costed as a single vector load/store. Micro-architectural Intel seems to have gotten to the point (with skylake) to hide all but the scalar loads/stores which seem to take up as many load/store ports as expected. I'm not expecting further improvements ontop of that so the following patch reflects this.
When the vectorizer would open-code gathers/scatters extra cost of vector element insertion / extraction would be accounted for (if the CPU doesn't have scatter/gather support). Bootstrap / regtest running on x86_64-unknown-linux-gnu. Any objection at this stage? Thanks, Richard. 2017-05-28 Richard Biener <rguent...@suse.de> PR tree-optimization/57796 * tree-vect-stmts.c (vect_model_store_cost): Cost scatters as N scalar stores. (vect_model_load_cost): Cost gathers as N scalar loads. Index: gcc/tree-vect-stmts.c =================================================================== --- gcc/tree-vect-stmts.c (revision 246500) +++ gcc/tree-vect-stmts.c (working copy) @@ -929,7 +929,8 @@ vect_model_store_cost (stmt_vec_info stm tree vectype = STMT_VINFO_VECTYPE (stmt_info); /* Costs of the stores. */ - if (memory_access_type == VMAT_ELEMENTWISE) + if (memory_access_type == VMAT_ELEMENTWISE + || memory_access_type == VMAT_GATHER_SCATTER) /* N scalar stores plus extracting the elements. */ inside_cost += record_stmt_cost (body_cost_vec, ncopies * TYPE_VECTOR_SUBPARTS (vectype), @@ -1056,7 +1057,8 @@ vect_model_load_cost (stmt_vec_info stmt } /* The loads themselves. */ - if (memory_access_type == VMAT_ELEMENTWISE) + if (memory_access_type == VMAT_ELEMENTWISE + || memory_access_type == VMAT_GATHER_SCATTER) { /* N scalar loads plus gathering them into a vector. */ tree vectype = STMT_VINFO_VECTYPE (stmt_info);