Re: RFR: 8349452: Fix performance regression for Arrays.fill() with AVX512 [v13]
On Fri, 30 Jan 2026 08:33:57 GMT, Emanuel Peter wrote: >>> > > @vamsi-parasa Ok, so now we have one benchmark that shows a speedup and >>> > > one that shows a regression. How are we to proceed? >>> > > It seems that without loads [#28442 >>> > > (comment)](https://github.com/openjdk/jdk/pull/28442#issuecomment-3761659799), >>> > > this patch leads to a regression. >>> > > Only if there is a load from one of the last elements that the >>> > > `Arrays.fill` stored to with a masked operation do we get a slowdown. >>> > > Because of missing load-to-store forwarding. If we instead started >>> > > loading from the first elements, those would probably already be in >>> > > cache, and we would not have any latency issues, right? >>> > > But is it not rather an edge-case that we load from the last elements >>> > > immediately after the `Arrays.fill`? At least for longer arrays, it >>> > > seems an edge case. For short arrays it is probably more likely that we >>> > > access the last element soon after the fill. >>> > > It does not seem like a trivial decision to me if this patch is an >>> > > improvement or not. What do you think @vamsi-parasa ? >>> > > @sviswa7 @dwhite-intel You already approved this PR. What are your >>> > > thoughts here? >>> > >>> > >>> > @eme64 My thoughts are to go ahead with this PR replacing masked stores >>> > with scalar tail processing. As we have seen from >>> > https://bugs.openjdk.org/browse/JDK-8349452 masked stores can cause big >>> > regression in certain scenarios: accessing elements just written or any >>> > other adjacent data that happens to fall in the masked store range. >>> >>> @sviswa7 But once this PR is integrated, I could file a performance >>> regression with the benchmarks from [up >>> here](https://github.com/openjdk/jdk/pull/28442#issuecomment-3761659799). >>> So what's the argument which choice is better, since we have a mix of >>> speedups/regression going either way, and both are probably in the 10-20% >>> range? >> >> @eme64 You have a point there, but if you see the performance numbers for >> ByteMatrix.java (from JDK-8349452) in the PR description above we are >> talking about a recovery of 3x or so. The ByteMatrix.java is doing only >> Arrays.fill() on individual arrays of a 2D array. The individual arrays >> happened to be allocated alongside each other by the JVM and the next store >> sees stalls due to the masked store of previous array initialization. That >> was the reason to look for a solution without masked stores. > > @sviswa7 Ah right, the ByteMatrix.java is yet another case. There, we don't > seem to have any loads. > >> The individual arrays happened to be allocated alongside each other by the >> JVM and the next store sees stalls due to the masked store of previous array >> initialization. > > Ah, that sounds interesting! Is there some tool that would let me see that it > was due to masked store stalls? > My (probably uneducated) guess would have been that it is just because a > single element store would be much cheaper than a masked operation. If you > only access a single or 2 elements, then a masked store is not yet > profitable. What if the masked stores were a bit further away, say a > cacheline away? Would that be significantly faster, because there are no > stalls? Or still slow because of the inherent higher cost of masked > operations? > > If we take the ByteMatrix.java benchmark: how would the performance change if > we increase the size of the arrays (height)? Is there some height before > which the non-masked solution is faster, and after which the masked is faster? > > Would it be a solution to use scalar stores for very small arrays, and only > use the masked loop starting at a certain threshold? > > --- > > I would like to see a summary of all the benchmarks we have here, and in > which cases we get speedups/slowdowns, and for which reason. Maybe listing > those reasons lets us see some third option we did not yet consider. And > listing all the reasons and code shapes may help us find out which shapes we > care about most, and then come to a decision that weighs off the pros and > cons. > > We should also document our decision nicely in the code, so that if someone > gets a regression in the future, we can see if we had already considered that > code shape. > > Does that make sense? Or do you have a better idea how to make a good > decision here? Hi Emanuel (@eme64), Based on the discussion, I will run further experiments to see if the regressions can be addressed and get back to you at a later date. Thanks, Vamsi - PR Comment: https://git.openjdk.org/jdk/pull/28442#issuecomment-3825349231
Re: RFR: 8349452: Fix performance regression for Arrays.fill() with AVX512 [v13]
On Fri, 30 Jan 2026 17:13:49 GMT, Sandhya Viswanathan wrote: > Vamsi should be able to confirm this. Regarding whether the slow down is due > to masked stores stalls, that was my hypothesis > based on the optimization > guide, excerpts of which Vamsi shared above. > The MaxVectorSize=64 for the platform used to collect the data in the PR's description for ByteMatrix fill workload. - PR Comment: https://git.openjdk.org/jdk/pull/28442#issuecomment-3825339469
Re: RFR: 8349452: Fix performance regression for Arrays.fill() with AVX512 [v13]
On Thu, 29 Jan 2026 19:43:27 GMT, Sandhya Viswanathan wrote: >> ### Int VectorBulkOperationsArray Fill >> >> Benchmark (ns/op) | Size | -OptimizeFill(JITed code) | >> +OptimizeFill(Masked store) | +OptimizeFill(Unmasked store - >> This PR) | Delta (masked vs. unmasked) >> -- | -- | -- | -- | -- | -- >> VectorBulkOperationsArray.fill_var_int_arrays_fill | 0 | 0.649 | 0.651 | >> 0.655 | 1% >> VectorBulkOperationsArray.fill_var_int_arrays_fill | 1 | 2.371 | 2.801 | >> 2.827 | 1% >> VectorBulkOperationsArray.fill_var_int_arrays_fill | 2 | 2.374 | 2.585 | >> 2.942 | 12% >> VectorBulkOperationsArray.fill_var_int_arrays_fill | 3 | 2.809 | 2.589 | >> 3.094 | 16% >> VectorBulkOperationsArray.fill_var_int_arrays_fill | 4 | 3.356 | 2.587 | >> 2.852 | 9% >> VectorBulkOperationsArray.fill_var_int_arrays_fill | 5 | 3.531 | 2.588 | >> 3.158 | 18% >> VectorBulkOperationsArray.fill_var_int_arrays_fill | 6 | 3.747 | 2.589 | >> 3.118 | 17% >> VectorBulkOperationsArray.fill_var_int_arrays_fill | 7 | 3.989 | 2.589 | >> 3.332 | 22% >> VectorBulkOperationsArray.fill_var_int_arrays_fill | 8 | 5.047 | 2.588 | >> 2.832 | 9% >> VectorBulkOperationsArray.fill_var_int_arrays_fill | 9 | 4.79 | 2.845 | >> 3.056 | 7% >> VectorBulkOperationsArray.fill_var_int_arrays_fill | 10 | 4.982 | 2.85 | >> 3.274 | 13% >> VectorBulkOperationsArray.fill_var_int_arrays_fill | 11 | 4.551 | 2.852 | >> 3.521 | 19% >> VectorBulkOperationsArray.fill_var_int_arrays_fill | 12 | 4.281 | 2.853 | >> 3.12 | 9% >> VectorBulkOperationsArray.fill_var_int_arrays_fill | 13 | 4.391 | 2.894 | >> 3.499 | 17% >> VectorBulkOperationsArray.fill_var_int_arrays_fill | 14 | 4.909 | 2.848 | >> 3.339 | 15% >> VectorBulkOperationsArray.fill_var_int_arrays_fill | 15 | 5.269 | 2.853 | >> 3.524 | 19% >> VectorBulkOperationsArray.fill_var_int_arrays_fill | 16 | 5.663 | 2.836 | >> 3.101 | 9% >> VectorBulkOperationsArray.fill_var_int_arrays_fill | 17 | 5.553 | 2.924 | >> 3.111 | 6% >> VectorBulkOperationsArray.fill_var_int_arrays_fill | 18 | 5.105 | 2.933 | >> 3.358 | 13% >> VectorBulkOperationsArray.fill_var_int_arrays_fill | 19 | 5.09 | 2.942 | >> 3.583 | 18% >> VectorBulkOperationsArray.fill_var_int_arrays_fill | 20 | 4.457 | 2.927 | >> 3.272 | 11% >> VectorBulkOperationsArray.fill_var_int_arrays_fill | 21 | 4.745 | 3.104 | >> 3.598 | 14% >> VectorBulkOperationsArray.fill_var_int_arrays_fill | 22 | 4.949 | 2.932 | >> 3.481 | 16% >> VectorBulkOperationsArray.fill_var_int_arrays_fill | 23 | 4.992 | 2.939 | >> 3.761 | 22% >> VectorBulkOperationsArray.fill_var_int_arrays_fill | 24 | 5.198 | 2.92 | >> 3.205 | 9% >> VectorBulkOperationsArray.fill_var_int_arrays_fill | 25 | 5 > >> > > @vamsi-parasa Ok, so now we have one benchmark that shows a speedup and >> > > one that shows a regression. How are we to proceed? >> > > It seems that without loads [#28442 >> > > (comment)](https://github.com/openjdk/jdk/pull/28442#issuecomment-3761659799), >> > > this patch leads to a regression. >> > > Only if there is a load from one of the last elements that the >> > > `Arrays.fill` stored to with a masked operation do we get a slowdown. >> > > Because of missing load-to-store forwarding. If we instead started >> > > loading from the first elements, those would probably already be in >> > > cache, and we would not have any latency issues, right? >> > > But is it not rather an edge-case that we load from the last elements >> > > immediately after the `Arrays.fill`? At least for longer arrays, it >> > > seems an edge case. For short arrays it is probably more likely that we >> > > access the last element soon after the fill. >> > > It does not seem like a trivial decision to me if this patch is an >> > > improvement or not. What do you think @vamsi-parasa ? >> > > @sviswa7 @dwhite-intel You already approved this PR. What are your >> > > thoughts here? >> > >> > >> > @eme64 My thoughts are to go ahead with this PR replacing masked stores >> > with scalar tail processing. As we have seen from >> > https://bugs.openjdk.org/browse/JDK-8349452 masked stores can cause big >> > regression in certain scenarios: accessing elements just written or any >> > other adjacent data that happens to fall in the masked store range. >> >> @sviswa7 But once this PR is integrated, I could file a performance >> regression with the benchmarks from [up >> here](https://github.com/openjdk/jdk/pull/28442#issuecomment-3761659799). So >> what's the argument which choice is better, since we have a mix of >> speedups/regression going either way, and both are probably in the 10-20% >> range? > > @eme64 You have a point there, but if you see the performance numbers for > ByteMatrix.java (from JDK-8349452) in the PR description above we are talking > about a recovery of 3x or so. The ByteMatrix.java is doing only Arrays.fill() > on individual arrays of a 2D array. The individual arrays happened to be > allocated alongside each other by the JVM a
Re: RFR: 8349452: Fix performance regression for Arrays.fill() with AVX512 [v13]
On Thu, 29 Jan 2026 19:43:27 GMT, Sandhya Viswanathan wrote: >> ### Int VectorBulkOperationsArray Fill >> >> Benchmark (ns/op) | Size | -OptimizeFill(JITed code) | >> +OptimizeFill(Masked store) | +OptimizeFill(Unmasked store - >> This PR) | Delta (masked vs. unmasked) >> -- | -- | -- | -- | -- | -- >> VectorBulkOperationsArray.fill_var_int_arrays_fill | 0 | 0.649 | 0.651 | >> 0.655 | 1% >> VectorBulkOperationsArray.fill_var_int_arrays_fill | 1 | 2.371 | 2.801 | >> 2.827 | 1% >> VectorBulkOperationsArray.fill_var_int_arrays_fill | 2 | 2.374 | 2.585 | >> 2.942 | 12% >> VectorBulkOperationsArray.fill_var_int_arrays_fill | 3 | 2.809 | 2.589 | >> 3.094 | 16% >> VectorBulkOperationsArray.fill_var_int_arrays_fill | 4 | 3.356 | 2.587 | >> 2.852 | 9% >> VectorBulkOperationsArray.fill_var_int_arrays_fill | 5 | 3.531 | 2.588 | >> 3.158 | 18% >> VectorBulkOperationsArray.fill_var_int_arrays_fill | 6 | 3.747 | 2.589 | >> 3.118 | 17% >> VectorBulkOperationsArray.fill_var_int_arrays_fill | 7 | 3.989 | 2.589 | >> 3.332 | 22% >> VectorBulkOperationsArray.fill_var_int_arrays_fill | 8 | 5.047 | 2.588 | >> 2.832 | 9% >> VectorBulkOperationsArray.fill_var_int_arrays_fill | 9 | 4.79 | 2.845 | >> 3.056 | 7% >> VectorBulkOperationsArray.fill_var_int_arrays_fill | 10 | 4.982 | 2.85 | >> 3.274 | 13% >> VectorBulkOperationsArray.fill_var_int_arrays_fill | 11 | 4.551 | 2.852 | >> 3.521 | 19% >> VectorBulkOperationsArray.fill_var_int_arrays_fill | 12 | 4.281 | 2.853 | >> 3.12 | 9% >> VectorBulkOperationsArray.fill_var_int_arrays_fill | 13 | 4.391 | 2.894 | >> 3.499 | 17% >> VectorBulkOperationsArray.fill_var_int_arrays_fill | 14 | 4.909 | 2.848 | >> 3.339 | 15% >> VectorBulkOperationsArray.fill_var_int_arrays_fill | 15 | 5.269 | 2.853 | >> 3.524 | 19% >> VectorBulkOperationsArray.fill_var_int_arrays_fill | 16 | 5.663 | 2.836 | >> 3.101 | 9% >> VectorBulkOperationsArray.fill_var_int_arrays_fill | 17 | 5.553 | 2.924 | >> 3.111 | 6% >> VectorBulkOperationsArray.fill_var_int_arrays_fill | 18 | 5.105 | 2.933 | >> 3.358 | 13% >> VectorBulkOperationsArray.fill_var_int_arrays_fill | 19 | 5.09 | 2.942 | >> 3.583 | 18% >> VectorBulkOperationsArray.fill_var_int_arrays_fill | 20 | 4.457 | 2.927 | >> 3.272 | 11% >> VectorBulkOperationsArray.fill_var_int_arrays_fill | 21 | 4.745 | 3.104 | >> 3.598 | 14% >> VectorBulkOperationsArray.fill_var_int_arrays_fill | 22 | 4.949 | 2.932 | >> 3.481 | 16% >> VectorBulkOperationsArray.fill_var_int_arrays_fill | 23 | 4.992 | 2.939 | >> 3.761 | 22% >> VectorBulkOperationsArray.fill_var_int_arrays_fill | 24 | 5.198 | 2.92 | >> 3.205 | 9% >> VectorBulkOperationsArray.fill_var_int_arrays_fill | 25 | 5 > >> > > @vamsi-parasa Ok, so now we have one benchmark that shows a speedup and >> > > one that shows a regression. How are we to proceed? >> > > It seems that without loads [#28442 >> > > (comment)](https://github.com/openjdk/jdk/pull/28442#issuecomment-3761659799), >> > > this patch leads to a regression. >> > > Only if there is a load from one of the last elements that the >> > > `Arrays.fill` stored to with a masked operation do we get a slowdown. >> > > Because of missing load-to-store forwarding. If we instead started >> > > loading from the first elements, those would probably already be in >> > > cache, and we would not have any latency issues, right? >> > > But is it not rather an edge-case that we load from the last elements >> > > immediately after the `Arrays.fill`? At least for longer arrays, it >> > > seems an edge case. For short arrays it is probably more likely that we >> > > access the last element soon after the fill. >> > > It does not seem like a trivial decision to me if this patch is an >> > > improvement or not. What do you think @vamsi-parasa ? >> > > @sviswa7 @dwhite-intel You already approved this PR. What are your >> > > thoughts here? >> > >> > >> > @eme64 My thoughts are to go ahead with this PR replacing masked stores >> > with scalar tail processing. As we have seen from >> > https://bugs.openjdk.org/browse/JDK-8349452 masked stores can cause big >> > regression in certain scenarios: accessing elements just written or any >> > other adjacent data that happens to fall in the masked store range. >> >> @sviswa7 But once this PR is integrated, I could file a performance >> regression with the benchmarks from [up >> here](https://github.com/openjdk/jdk/pull/28442#issuecomment-3761659799). So >> what's the argument which choice is better, since we have a mix of >> speedups/regression going either way, and both are probably in the 10-20% >> range? > > @eme64 You have a point there, but if you see the performance numbers for > ByteMatrix.java (from JDK-8349452) in the PR description above we are talking > about a recovery of 3x or so. The ByteMatrix.java is doing only Arrays.fill() > on individual arrays of a 2D array. The individual arrays happened to be > allocated alongside each other by the JVM a
Re: RFR: 8349452: Fix performance regression for Arrays.fill() with AVX512 [v13]
On Thu, 22 Jan 2026 20:30:26 GMT, Srinivas Vamsi Parasa wrote: >> Srinivas Vamsi Parasa has updated the pull request incrementally with one >> additional commit since the last revision: >> >> Update ALL of ArraysFill JMH micro > > ### Int VectorBulkOperationsArray Fill > > Benchmark (ns/op) | Size | -OptimizeFill(JITed code) | > +OptimizeFill(Masked store) | +OptimizeFill(Unmasked store - > This PR) | Delta (masked vs. unmasked) > -- | -- | -- | -- | -- | -- > VectorBulkOperationsArray.fill_var_int_arrays_fill | 0 | 0.649 | 0.651 | > 0.655 | 1% > VectorBulkOperationsArray.fill_var_int_arrays_fill | 1 | 2.371 | 2.801 | > 2.827 | 1% > VectorBulkOperationsArray.fill_var_int_arrays_fill | 2 | 2.374 | 2.585 | > 2.942 | 12% > VectorBulkOperationsArray.fill_var_int_arrays_fill | 3 | 2.809 | 2.589 | > 3.094 | 16% > VectorBulkOperationsArray.fill_var_int_arrays_fill | 4 | 3.356 | 2.587 | > 2.852 | 9% > VectorBulkOperationsArray.fill_var_int_arrays_fill | 5 | 3.531 | 2.588 | > 3.158 | 18% > VectorBulkOperationsArray.fill_var_int_arrays_fill | 6 | 3.747 | 2.589 | > 3.118 | 17% > VectorBulkOperationsArray.fill_var_int_arrays_fill | 7 | 3.989 | 2.589 | > 3.332 | 22% > VectorBulkOperationsArray.fill_var_int_arrays_fill | 8 | 5.047 | 2.588 | > 2.832 | 9% > VectorBulkOperationsArray.fill_var_int_arrays_fill | 9 | 4.79 | 2.845 | 3.056 > | 7% > VectorBulkOperationsArray.fill_var_int_arrays_fill | 10 | 4.982 | 2.85 | > 3.274 | 13% > VectorBulkOperationsArray.fill_var_int_arrays_fill | 11 | 4.551 | 2.852 | > 3.521 | 19% > VectorBulkOperationsArray.fill_var_int_arrays_fill | 12 | 4.281 | 2.853 | > 3.12 | 9% > VectorBulkOperationsArray.fill_var_int_arrays_fill | 13 | 4.391 | 2.894 | > 3.499 | 17% > VectorBulkOperationsArray.fill_var_int_arrays_fill | 14 | 4.909 | 2.848 | > 3.339 | 15% > VectorBulkOperationsArray.fill_var_int_arrays_fill | 15 | 5.269 | 2.853 | > 3.524 | 19% > VectorBulkOperationsArray.fill_var_int_arrays_fill | 16 | 5.663 | 2.836 | > 3.101 | 9% > VectorBulkOperationsArray.fill_var_int_arrays_fill | 17 | 5.553 | 2.924 | > 3.111 | 6% > VectorBulkOperationsArray.fill_var_int_arrays_fill | 18 | 5.105 | 2.933 | > 3.358 | 13% > VectorBulkOperationsArray.fill_var_int_arrays_fill | 19 | 5.09 | 2.942 | > 3.583 | 18% > VectorBulkOperationsArray.fill_var_int_arrays_fill | 20 | 4.457 | 2.927 | > 3.272 | 11% > VectorBulkOperationsArray.fill_var_int_arrays_fill | 21 | 4.745 | 3.104 | > 3.598 | 14% > VectorBulkOperationsArray.fill_var_int_arrays_fill | 22 | 4.949 | 2.932 | > 3.481 | 16% > VectorBulkOperationsArray.fill_var_int_arrays_fill | 23 | 4.992 | 2.939 | > 3.761 | 22% > VectorBulkOperationsArray.fill_var_int_arrays_fill | 24 | 5.198 | 2.92 | > 3.205 | 9% > VectorBulkOperationsArray.fill_var_int_arrays_fill | 25 | 5.097 | 3.116 | > 3.387 | 8% > VectorBulkOperationsArray.fill_var_... > > > @vamsi-parasa Ok, so now we have one benchmark that shows a speedup and > > > one that shows a regression. How are we to proceed? > > > It seems that without loads [#28442 > > > (comment)](https://github.com/openjdk/jdk/pull/28442#issuecomment-3761659799), > > > this patch leads to a regression. > > > Only if there is a load from one of the last elements that the > > > `Arrays.fill` stored to with a masked operation do we get a slowdown. > > > Because of missing load-to-store forwarding. If we instead started > > > loading from the first elements, those would probably already be in > > > cache, and we would not have any latency issues, right? > > > But is it not rather an edge-case that we load from the last elements > > > immediately after the `Arrays.fill`? At least for longer arrays, it seems > > > an edge case. For short arrays it is probably more likely that we access > > > the last element soon after the fill. > > > It does not seem like a trivial decision to me if this patch is an > > > improvement or not. What do you think @vamsi-parasa ? > > > @sviswa7 @dwhite-intel You already approved this PR. What are your > > > thoughts here? > > > > > > @eme64 My thoughts are to go ahead with this PR replacing masked stores > > with scalar tail processing. As we have seen from > > https://bugs.openjdk.org/browse/JDK-8349452 masked stores can cause big > > regression in certain scenarios: accessing elements just written or any > > other adjacent data that happens to fall in the masked store range. > > @sviswa7 But once this PR is integrated, I could file a performance > regression with the benchmarks from [up > here](https://github.com/openjdk/jdk/pull/28442#issuecomment-3761659799). So > what's the argument which choice is better, since we have a mix of > speedups/regression going either way, and both are probably in the 10-20% > range? @eme64 You have a point there, but if you see the performance numbers for ByteMatrix.java (from JDK-8349452) in the PR description above we are talking about a recovery of 3x or so. The ByteMatrix.java is doing on
Re: RFR: 8349452: Fix performance regression for Arrays.fill() with AVX512 [v13]
On Thu, 22 Jan 2026 20:30:26 GMT, Srinivas Vamsi Parasa wrote: >> Srinivas Vamsi Parasa has updated the pull request incrementally with one >> additional commit since the last revision: >> >> Update ALL of ArraysFill JMH micro > > ### Int VectorBulkOperationsArray Fill > > Benchmark (ns/op) | Size | -OptimizeFill(JITed code) | > +OptimizeFill(Masked store) | +OptimizeFill(Unmasked store - > This PR) | Delta (masked vs. unmasked) > -- | -- | -- | -- | -- | -- > VectorBulkOperationsArray.fill_var_int_arrays_fill | 0 | 0.649 | 0.651 | > 0.655 | 1% > VectorBulkOperationsArray.fill_var_int_arrays_fill | 1 | 2.371 | 2.801 | > 2.827 | 1% > VectorBulkOperationsArray.fill_var_int_arrays_fill | 2 | 2.374 | 2.585 | > 2.942 | 12% > VectorBulkOperationsArray.fill_var_int_arrays_fill | 3 | 2.809 | 2.589 | > 3.094 | 16% > VectorBulkOperationsArray.fill_var_int_arrays_fill | 4 | 3.356 | 2.587 | > 2.852 | 9% > VectorBulkOperationsArray.fill_var_int_arrays_fill | 5 | 3.531 | 2.588 | > 3.158 | 18% > VectorBulkOperationsArray.fill_var_int_arrays_fill | 6 | 3.747 | 2.589 | > 3.118 | 17% > VectorBulkOperationsArray.fill_var_int_arrays_fill | 7 | 3.989 | 2.589 | > 3.332 | 22% > VectorBulkOperationsArray.fill_var_int_arrays_fill | 8 | 5.047 | 2.588 | > 2.832 | 9% > VectorBulkOperationsArray.fill_var_int_arrays_fill | 9 | 4.79 | 2.845 | 3.056 > | 7% > VectorBulkOperationsArray.fill_var_int_arrays_fill | 10 | 4.982 | 2.85 | > 3.274 | 13% > VectorBulkOperationsArray.fill_var_int_arrays_fill | 11 | 4.551 | 2.852 | > 3.521 | 19% > VectorBulkOperationsArray.fill_var_int_arrays_fill | 12 | 4.281 | 2.853 | > 3.12 | 9% > VectorBulkOperationsArray.fill_var_int_arrays_fill | 13 | 4.391 | 2.894 | > 3.499 | 17% > VectorBulkOperationsArray.fill_var_int_arrays_fill | 14 | 4.909 | 2.848 | > 3.339 | 15% > VectorBulkOperationsArray.fill_var_int_arrays_fill | 15 | 5.269 | 2.853 | > 3.524 | 19% > VectorBulkOperationsArray.fill_var_int_arrays_fill | 16 | 5.663 | 2.836 | > 3.101 | 9% > VectorBulkOperationsArray.fill_var_int_arrays_fill | 17 | 5.553 | 2.924 | > 3.111 | 6% > VectorBulkOperationsArray.fill_var_int_arrays_fill | 18 | 5.105 | 2.933 | > 3.358 | 13% > VectorBulkOperationsArray.fill_var_int_arrays_fill | 19 | 5.09 | 2.942 | > 3.583 | 18% > VectorBulkOperationsArray.fill_var_int_arrays_fill | 20 | 4.457 | 2.927 | > 3.272 | 11% > VectorBulkOperationsArray.fill_var_int_arrays_fill | 21 | 4.745 | 3.104 | > 3.598 | 14% > VectorBulkOperationsArray.fill_var_int_arrays_fill | 22 | 4.949 | 2.932 | > 3.481 | 16% > VectorBulkOperationsArray.fill_var_int_arrays_fill | 23 | 4.992 | 2.939 | > 3.761 | 22% > VectorBulkOperationsArray.fill_var_int_arrays_fill | 24 | 5.198 | 2.92 | > 3.205 | 9% > VectorBulkOperationsArray.fill_var_int_arrays_fill | 25 | 5.097 | 3.116 | > 3.387 | 8% > VectorBulkOperationsArray.fill_var_... > > @vamsi-parasa Ok, so now we have one benchmark that shows a speedup and one > > that shows a regression. How are we to proceed? > > It seems that without loads [#28442 > > (comment)](https://github.com/openjdk/jdk/pull/28442#issuecomment-3761659799), > > this patch leads to a regression. > > Only if there is a load from one of the last elements that the > > `Arrays.fill` stored to with a masked operation do we get a slowdown. > > Because of missing load-to-store forwarding. If we instead started loading > > from the first elements, those would probably already be in cache, and we > > would not have any latency issues, right? > > But is it not rather an edge-case that we load from the last elements > > immediately after the `Arrays.fill`? At least for longer arrays, it seems > > an edge case. For short arrays it is probably more likely that we access > > the last element soon after the fill. > > It does not seem like a trivial decision to me if this patch is an > > improvement or not. What do you think @vamsi-parasa ? > > @sviswa7 @dwhite-intel You already approved this PR. What are your thoughts > > here? > > @eme64 My thoughts are to go ahead with this PR replacing masked stores with > scalar tail processing. As we have seen from > https://bugs.openjdk.org/browse/JDK-8349452 masked stores can cause big > regression in certain scenarios: accessing elements just written or any other > adjacent data that happens to fall in the masked store range. @sviswa7 But once this PR is integrated, I could file a performance regression with the benchmarks from [up here](https://github.com/openjdk/jdk/pull/28442#issuecomment-3761659799). So what's the argument which choice is better, since we have a mix of speedups/regression going either way, and both are probably in the 10-20% range? - PR Comment: https://git.openjdk.org/jdk/pull/28442#issuecomment-3788945532
Re: RFR: 8349452: Fix performance regression for Arrays.fill() with AVX512 [v13]
On Fri, 16 Jan 2026 20:14:31 GMT, Srinivas Vamsi Parasa wrote: >> The goal of this PR is to fix the performance regression in Arrays.fill() >> x86 stubs caused by masked AVX stores. The fix is to replace the masked AVX >> stores with store instructions without masks (i.e. unmasked stores). >> `fill32_masked()` and `fill64_masked()` stubs are replaced with >> `fill32_unmasked()` and `fill64_unmasked()` respectively. >> >> To speedup unmasked stores, array fills for sizes < 64 bytes are broken down >> into sequences of 32B, 16B, 8B, 4B, 2B and 1B stores, depending on the size. >> >> >> ### **Performance comparison for byte array fills in a loop for 1 million >> times** >> >> >> UseAVX=3 ByteArray Size | +OptimizeFill(Masked store stub) >> [secs] | -OptimizeFill (No stub) [secs] | --->This PR: +OptimizeFill >> (Unmasked store stub) [secs] >> -- | -- | -- | -- >> 1 | 0.46 | 0.14 | 0.189 >> 2 | 0.46 | 0.16 | 0.191 >> 3 | 0.46 | 0.176 | 0.199 >> 4 | 0.46 | 0.244 | 0.212 >> 5 | 0.46 | 0.29 | 0.364 >> 10 | 0.46 | 0.58 | 0.354 >> 15 | 0.46 | 0.42 | 0.325 >> 16 | 0.46 | 0.46 | 0.281 >> 17 | 0.21 | 0.5 | 0.365 >> 20 | 0.21 | 0.37 | 0.326 >> 25 | 0.21 | 0.59 | 0.343 >> 31 | 0.21 | 0.53 | 0.317 >> 32 | 0.21 | 0.58 | 0.249 >> 35 | 0.5 | 0.77 | 0.303 >> 40 | 0.5 | 0.61 | 0.312 >> 45 | 0.5 | 0.52 | 0.364 >> 48 | 0.5 | 0.66 | 0.283 >> 49 | 0.22 | 0.69 | 0.367 >> 50 | 0.22 | 0.78 | 0.344 >> 55 | 0.22 | 0.67 | 0.332 >> 60 | 0.22 | 0.67 | 0.312 >> 64 | 0.22 | 0.82 | 0.253 >> 70 | 0.51 | 1.1 | 0.394 >> 80 | 0.49 | 0.89 | 0.346 >> 90 | 0.225 | 0.68 | 0.385 >> 100 | 0.54 | 1.09 | 0.364 >> 110 | 0.6 | 0.98 | 0.416 >> 120 | 0.26 | 0.75 | 0.367 >> 128 | 0.266 | 1.1 | 0.342 > > Srinivas Vamsi Parasa has updated the pull request incrementally with one > additional commit since the last revision: > > Update ALL of ArraysFill JMH micro ### Short VectorBulkOperationsArray Fill Benchmark (ns/op) | Size | -OptimizeFill (JITed code) | +OptimizeFill (Masked store) | +OptimizeFill (Unmasked store - This PR) | Delta (masked vs. unmasked) -- | -- | -- | -- | -- | -- VectorBulkOperationsArray.fill_var_short_arrays_fill | 0 | 0.649 | 0.65 | 0.65 | 0% VectorBulkOperationsArray.fill_var_short_arrays_fill | 1 | 2.366 | 2.806 | 3.025 | 8% VectorBulkOperationsArray.fill_var_short_arrays_fill | 2 | 2.37 | 2.587 | 2.789 | 8% VectorBulkOperationsArray.fill_var_short_arrays_fill | 3 | 2.825 | 2.587 | 3.299 | 28% VectorBulkOperationsArray.fill_var_short_arrays_fill | 4 | 3.09 | 2.59 | 3.024 | 17% VectorBulkOperationsArray.fill_var_short_arrays_fill | 5 | 3.336 | 2.589 | 3.338 | 29% VectorBulkOperationsArray.fill_var_short_arrays_fill | 6 | 3.544 | 2.596 | 3.189 | 23% VectorBulkOperationsArray.fill_var_short_arrays_fill | 7 | 3.712 | 2.719 | 3.449 | 27% VectorBulkOperationsArray.fill_var_short_arrays_fill | 8 | 4.883 | 2.589 | 2.86 | 10% VectorBulkOperationsArray.fill_var_short_arrays_fill | 9 | 4.817 | 2.589 | 3.355 | 30% VectorBulkOperationsArray.fill_var_short_arrays_fill | 10 | 4.774 | 2.585 | 3.16 | 22% VectorBulkOperationsArray.fill_var_short_arrays_fill | 11 | 4.514 | 2.589 | 3.431 | 33% VectorBulkOperationsArray.fill_var_short_arrays_fill | 12 | 4.097 | 2.587 | 3.111 | 20% VectorBulkOperationsArray.fill_var_short_arrays_fill | 13 | 4.351 | 2.599 | 3.393 | 31% VectorBulkOperationsArray.fill_var_short_arrays_fill | 14 | 4.674 | 2.588 | 3.319 | 28% VectorBulkOperationsArray.fill_var_short_arrays_fill | 15 | 4.981 | 2.586 | 3.542 | 37% VectorBulkOperationsArray.fill_var_short_arrays_fill | 16 | 5.406 | 2.586 | 2.833 | 10% VectorBulkOperationsArray.fill_var_short_arrays_fill | 17 | 5.307 | 2.8 | 3.202 | 14% VectorBulkOperationsArray.fill_var_short_arrays_fill | 18 | 5.093 | 2.811 | 3.051 | 9% VectorBulkOperationsArray.fill_var_short_arrays_fill | 19 | 4.68 | 2.817 | 3.568 | 27% VectorBulkOperationsArray.fill_var_short_arrays_fill | 20 | 4.528 | 2.81 | 3.294 | 17% VectorBulkOperationsArray.fill_var_short_arrays_fill | 21 | 4.633 | 2.814 | 3.589 | 28% VectorBulkOperationsArray.fill_var_short_arrays_fill | 22 | 5.102 | 2.809 | 3.495 | 24% VectorBulkOperationsArray.fill_var_short_arrays_fill | 23 | 5.521 | 2.812 | 3.717 | 32% VectorBulkOperationsArray.fill_var_short_arrays_fill | 24 | 6.205 | 2.813 | 3.094 | 10% VectorBulkOperationsArray.fill_var_short_arrays_fill | 25 | 5.92 | 2.816 | 3.58 | 27% VectorBulkOperationsArray.fill_var_short_arrays_fill | 26 | 4.805 | 2.87 | 3.495 | 22% VectorBulkOperationsArray.fill_var_short_arrays_fill | 27 | 4.744 | 2.815 | 3.712 | 32% VectorBulkOperationsArray.fill_var_short_arrays_fill | 28 | 4.45 | 2.811 | 3.361 | 20% VectorBulkOperationsArray.fill_var_short_arrays_fill | 29 | 4.59 | 2.813 | 3.734 | 33% VectorBulkOperationsArray.fill_var_short_arrays_fill | 30 | 4.781 | 2.812 | 3.589 | 28% VectorBulkOperationsArray.fill_var_short_arrays_fill | 31 | 4.992 | 2.81 | 3.817 | 36% VectorBulkOperationsArray.fill_var_short_arrays_
Re: RFR: 8349452: Fix performance regression for Arrays.fill() with AVX512 [v13]
On Wed, 21 Jan 2026 22:07:22 GMT, Derek White wrote: > I'm expecting to see a small regression in a write-only fill, and a larger > improvement in write+read fill, but we didn't present the data in a way that > makes it easy to compare those two tests. So we should present the graphed > data as a table as well. Then we can discuss how common the write+read fill > case is. Hi Derek, Please see the data for write-only fill operations for byte, short and intel below. Thanks, Vamsi ### Byte VectorBulkOperationsArray Fill Benchmark (ns/op) | Size | -OptimizeFill(JITed code) | +OptimizeFill (Masked store) | +OptimizeFill (Unmasked store - This PR) | Delta (masked vs. unmasked) -- | -- | -- | -- | -- | -- VectorBulkOperationsArray.fill_var_byte_arrays_fill | 0 | 0.649 | 0.65 | 0.653 | 0% VectorBulkOperationsArray.fill_var_byte_arrays_fill | 1 | 2.372 | 2.803 | 2.588 | -8% VectorBulkOperationsArray.fill_var_byte_arrays_fill | 2 | 2.37 | 2.596 | 2.471 | -5% VectorBulkOperationsArray.fill_var_byte_arrays_fill | 3 | 2.813 | 2.591 | 2.495 | -4% VectorBulkOperationsArray.fill_var_byte_arrays_fill | 4 | 3.086 | 2.598 | 2.757 | 6% VectorBulkOperationsArray.fill_var_byte_arrays_fill | 5 | 3.343 | 2.59 | 3.644 | 41% VectorBulkOperationsArray.fill_var_byte_arrays_fill | 6 | 3.549 | 2.589 | 3.536 | 37% VectorBulkOperationsArray.fill_var_byte_arrays_fill | 7 | 3.716 | 2.616 | 3.695 | 41% VectorBulkOperationsArray.fill_var_byte_arrays_fill | 8 | 4.854 | 2.59 | 3.252 | 26% VectorBulkOperationsArray.fill_var_byte_arrays_fill | 9 | 4.771 | 2.587 | 3.591 | 39% VectorBulkOperationsArray.fill_var_byte_arrays_fill | 10 | 4.78 | 2.595 | 3.542 | 36% VectorBulkOperationsArray.fill_var_byte_arrays_fill | 11 | 4.532 | 2.589 | 3.669 | 42% VectorBulkOperationsArray.fill_var_byte_arrays_fill | 12 | 4.164 | 2.592 | 3.505 | 35% VectorBulkOperationsArray.fill_var_byte_arrays_fill | 13 | 4.348 | 2.589 | 3.655 | 41% VectorBulkOperationsArray.fill_var_byte_arrays_fill | 14 | 4.703 | 2.594 | 3.637 | 40% VectorBulkOperationsArray.fill_var_byte_arrays_fill | 15 | 4.973 | 2.591 | 3.754 | 45% VectorBulkOperationsArray.fill_var_byte_arrays_fill | 16 | 5.498 | 2.593 | 3.062 | 18% VectorBulkOperationsArray.fill_var_byte_arrays_fill | 17 | 5.305 | 2.588 | 3.611 | 40% VectorBulkOperationsArray.fill_var_byte_arrays_fill | 18 | 5.081 | 2.59 | 3.649 | 41% VectorBulkOperationsArray.fill_var_byte_arrays_fill | 19 | 4.782 | 2.586 | 3.642 | 41% VectorBulkOperationsArray.fill_var_byte_arrays_fill | 20 | 4.458 | 2.588 | 3.565 | 38% VectorBulkOperationsArray.fill_var_byte_arrays_fill | 21 | 4.66 | 2.586 | 3.741 | 45% VectorBulkOperationsArray.fill_var_byte_arrays_fill | 22 | 5.112 | 2.591 | 3.681 | 42% VectorBulkOperationsArray.fill_var_byte_arrays_fill | 23 | 5.522 | 2.607 | 3.742 | 44% VectorBulkOperationsArray.fill_var_byte_arrays_fill | 24 | 6.02 | 2.589 | 3.27 | 26% VectorBulkOperationsArray.fill_var_byte_arrays_fill | 25 | 4.84 | 2.588 | 3.691 | 43% VectorBulkOperationsArray.fill_var_byte_arrays_fill | 26 | 4.81 | 2.589 | 3.674 | 42% VectorBulkOperationsArray.fill_var_byte_arrays_fill | 27 | 4.695 | 2.591 | 3.761 | 45% VectorBulkOperationsArray.fill_var_byte_arrays_fill | 28 | 4.828 | 2.589 | 3.578 | 38% VectorBulkOperationsArray.fill_var_byte_arrays_fill | 29 | 4.531 | 2.586 | 3.762 | 45% VectorBulkOperationsArray.fill_var_byte_arrays_fill | 30 | 5.38 | 2.59 | 3.713 | 43% VectorBulkOperationsArray.fill_var_byte_arrays_fill | 31 | 4.948 | 2.588 | 3.887 | 50% VectorBulkOperationsArray.fill_var_byte_arrays_fill | 32 | 5.21 | 2.589 | 2.861 | 11% VectorBulkOperationsArray.fill_var_byte_arrays_fill | 33 | 6.258 | 2.824 | 3.377 | 20% VectorBulkOperationsArray.fill_var_byte_arrays_fill | 34 | 4.992 | 2.829 | 3.388 | 20% VectorBulkOperationsArray.fill_var_byte_arrays_fill | 35 | 4.918 | 2.812 | 3.577 | 27% VectorBulkOperationsArray.fill_var_byte_arrays_fill | 36 | 4.647 | 2.814 | 3.351 | 19% VectorBulkOperationsArray.fill_var_byte_arrays_fill | 37 | 4.762 | 2.815 | 3.775 | 34% VectorBulkOperationsArray.fill_var_byte_arrays_fill | 38 | 4.93 | 2.819 | 3.76 | 33% VectorBulkOperationsArray.fill_var_byte_arrays_fill | 39 | 5.137 | 2.821 | 3.954 | 40% VectorBulkOperationsArray.fill_var_byte_arrays_fill | 40 | 5.377 | 2.815 | 3.483 | 24% VectorBulkOperationsArray.fill_var_byte_arrays_fill | 41 | 5.373 | 2.815 | 3.777 | 34% VectorBulkOperationsArray.fill_var_byte_arrays_fill | 42 | 5.309 | 2.815 | 3.77 | 34% VectorBulkOperationsArray.fill_var_byte_arrays_fill | 43 | 5.157 | 2.815 | 3.835 | 36% VectorBulkOperationsArray.fill_var_byte_arrays_fill | 44 | 4.862 | 2.82 | 3.743 | 33% VectorBulkOperationsArray.fill_var_byte_arrays_fill | 45 | 4.957 | 2.816 | 3.882 | 38% VectorBulkOperationsArray.fill_var_byte_arrays_fill | 46 | 5.207 | 2.814 | 3.85 | 37% VectorBulkOperationsArray.fill_var_byte_arrays_fill | 47 | 5.526 | 2.813 | 4.023 | 43% VectorBulkOperationsArray.fill_var_byte_arrays_fill | 48 | 5
Re: RFR: 8349452: Fix performance regression for Arrays.fill() with AVX512 [v13]
On Wed, 21 Jan 2026 00:01:39 GMT, Srinivas Vamsi Parasa wrote: >> @vamsi-parasa Thanks for the extra data! >> >> Do I see this right? In the plots >> [here](https://github.com/openjdk/jdk/pull/28442#issuecomment-3761659799), >> the masked performance lies lower/better than unmasked performance (here we >> measure ns/ops). But in your tables >> [here](https://github.com/openjdk/jdk/pull/28442#issuecomment-3761712841) >> you are measuring ops/ms, and are getting the opposite trend: masked is >> slower than unmasked. >> >> Can you explain the difference between the two results? > >> Can you explain the difference between the two results? >> > Hi Emanuel (@eme64), > Yes, the conclusions you mentioned are correct. The store only benchmark > shows that masked store is slightly better than the unmasked store. However, > the store followed by load benchmarks shows that the unmasked store is better > than masked vector store as masked vector stores have very limited store > forwarding support in the hardware. > > This is because the load operation following the masked vector store is > blocked until the data is written into the cache. This is also mentioned in > the [Intel Software optimization > manual](https://www.intel.com/content/www/us/en/content-details/671488/intel-64-and-ia-32-architectures-optimization-reference-manual-volume-1.html) > (Chapter 18, section 18.4, page 578). > > Pasting the relevant text below for reference: > > > 18.4 FORWARDING AND MEMORY MASKING > When using masked store and load, consider the following: > • When the mask is not all-ones or all-zeroes, the load operation, following > the masked store operation > from the same address is blocked, until the data is written to the cache. > • Unlike GPR forwarding rules, vector loads whether or not they are masked, > do not forward unless > load and store addresses are exactly the same. > — st_mask = 10101010, ld_mask = 01010101, can forward: no, should block: yes > — st_mask = , ld_mask = 0011, can forward: no, should block: yes > • When the mask is all-ones, blocking does not occur, because the data may be > forwarded to the load > operation. > — st_mask = , ld_mask = don’t care, can forward: yes, should block: no > • When mask is all-zeroes, blocking does not occur, though neither does > forwarding. > — st_mask = , ld_mask = don’t care, can forward: no, should block: no > In summary, a masked store should be used carefully, for example, if the > remainder size is known at > compile time to be 1, and there is a load operation from the same cache line > after it (or there is an > overlap in addresses + vector lengths), it may be better to use scalar > remainder processing, rather than > a masked remainder block. > > > Thanks, > Vamsi > @vamsi-parasa Ok, so now we have one benchmark that shows a speedup and one > that shows a regression. How are we to proceed? > > It seems that without loads [#28442 > (comment)](https://github.com/openjdk/jdk/pull/28442#issuecomment-3761659799), > this patch leads to a regression. > > Only if there is a load from one of the last elements that the `Arrays.fill` > stored to with a masked operation do we get a slowdown. Because of missing > load-to-store forwarding. If we instead started loading from the first > elements, those would probably already be in cache, and we would not have any > latency issues, right? > > But is it not rather an edge-case that we load from the last elements > immediately after the `Arrays.fill`? At least for longer arrays, it seems an > edge case. For short arrays it is probably more likely that we access the > last element soon after the fill. > > It does not seem like a trivial decision to me if this patch is an > improvement or not. What do you think @vamsi-parasa ? > > @sviswa7 @dwhite-intel You already approved this PR. What are your thoughts > here? @eme64 My thoughts are to go ahead with this PR replacing masked stores with scalar tail processing. As we have seen from https://bugs.openjdk.org/browse/JDK-8349452 masked stores can cause big regression in certain scenarios: accessing elements just written or any other adjacent data that happens to fall in the masked store range. - PR Comment: https://git.openjdk.org/jdk/pull/28442#issuecomment-3785772087
Re: RFR: 8349452: Fix performance regression for Arrays.fill() with AVX512 [v13]
On Fri, 16 Jan 2026 20:14:31 GMT, Srinivas Vamsi Parasa wrote: >> The goal of this PR is to fix the performance regression in Arrays.fill() >> x86 stubs caused by masked AVX stores. The fix is to replace the masked AVX >> stores with store instructions without masks (i.e. unmasked stores). >> `fill32_masked()` and `fill64_masked()` stubs are replaced with >> `fill32_unmasked()` and `fill64_unmasked()` respectively. >> >> To speedup unmasked stores, array fills for sizes < 64 bytes are broken down >> into sequences of 32B, 16B, 8B, 4B, 2B and 1B stores, depending on the size. >> >> >> ### **Performance comparison for byte array fills in a loop for 1 million >> times** >> >> >> UseAVX=3 ByteArray Size | +OptimizeFill(Masked store stub) >> [secs] | -OptimizeFill (No stub) [secs] | --->This PR: +OptimizeFill >> (Unmasked store stub) [secs] >> -- | -- | -- | -- >> 1 | 0.46 | 0.14 | 0.189 >> 2 | 0.46 | 0.16 | 0.191 >> 3 | 0.46 | 0.176 | 0.199 >> 4 | 0.46 | 0.244 | 0.212 >> 5 | 0.46 | 0.29 | 0.364 >> 10 | 0.46 | 0.58 | 0.354 >> 15 | 0.46 | 0.42 | 0.325 >> 16 | 0.46 | 0.46 | 0.281 >> 17 | 0.21 | 0.5 | 0.365 >> 20 | 0.21 | 0.37 | 0.326 >> 25 | 0.21 | 0.59 | 0.343 >> 31 | 0.21 | 0.53 | 0.317 >> 32 | 0.21 | 0.58 | 0.249 >> 35 | 0.5 | 0.77 | 0.303 >> 40 | 0.5 | 0.61 | 0.312 >> 45 | 0.5 | 0.52 | 0.364 >> 48 | 0.5 | 0.66 | 0.283 >> 49 | 0.22 | 0.69 | 0.367 >> 50 | 0.22 | 0.78 | 0.344 >> 55 | 0.22 | 0.67 | 0.332 >> 60 | 0.22 | 0.67 | 0.312 >> 64 | 0.22 | 0.82 | 0.253 >> 70 | 0.51 | 1.1 | 0.394 >> 80 | 0.49 | 0.89 | 0.346 >> 90 | 0.225 | 0.68 | 0.385 >> 100 | 0.54 | 1.09 | 0.364 >> 110 | 0.6 | 0.98 | 0.416 >> 120 | 0.26 | 0.75 | 0.367 >> 128 | 0.266 | 1.1 | 0.342 > > Srinivas Vamsi Parasa has updated the pull request incrementally with one > additional commit since the last revision: > > Update ALL of ArraysFill JMH micro I'm expecting to see a small regression in a write-only fill, and a larger improvement in write+read fill, but we didn't present the data in a way that makes it easy to compare those two tests. So we should present the graphed data as a table as well. Then we can discuss how common the write+read fill case is. - PR Comment: https://git.openjdk.org/jdk/pull/28442#issuecomment-3781378167
Re: RFR: 8349452: Fix performance regression for Arrays.fill() with AVX512 [v13]
On Wed, 21 Jan 2026 00:01:39 GMT, Srinivas Vamsi Parasa wrote: >> @vamsi-parasa Thanks for the extra data! >> >> Do I see this right? In the plots >> [here](https://github.com/openjdk/jdk/pull/28442#issuecomment-3761659799), >> the masked performance lies lower/better than unmasked performance (here we >> measure ns/ops). But in your tables >> [here](https://github.com/openjdk/jdk/pull/28442#issuecomment-3761712841) >> you are measuring ops/ms, and are getting the opposite trend: masked is >> slower than unmasked. >> >> Can you explain the difference between the two results? > >> Can you explain the difference between the two results? >> > Hi Emanuel (@eme64), > Yes, the conclusions you mentioned are correct. The store only benchmark > shows that masked store is slightly better than the unmasked store. However, > the store followed by load benchmarks shows that the unmasked store is better > than masked vector store as masked vector stores have very limited store > forwarding support in the hardware. > > This is because the load operation following the masked vector store is > blocked until the data is written into the cache. This is also mentioned in > the [Intel Software optimization > manual](https://www.intel.com/content/www/us/en/content-details/671488/intel-64-and-ia-32-architectures-optimization-reference-manual-volume-1.html) > (Chapter 18, section 18.4, page 578). > > Pasting the relevant text below for reference: > > > 18.4 FORWARDING AND MEMORY MASKING > When using masked store and load, consider the following: > • When the mask is not all-ones or all-zeroes, the load operation, following > the masked store operation > from the same address is blocked, until the data is written to the cache. > • Unlike GPR forwarding rules, vector loads whether or not they are masked, > do not forward unless > load and store addresses are exactly the same. > — st_mask = 10101010, ld_mask = 01010101, can forward: no, should block: yes > — st_mask = , ld_mask = 0011, can forward: no, should block: yes > • When the mask is all-ones, blocking does not occur, because the data may be > forwarded to the load > operation. > — st_mask = , ld_mask = don’t care, can forward: yes, should block: no > • When mask is all-zeroes, blocking does not occur, though neither does > forwarding. > — st_mask = , ld_mask = don’t care, can forward: no, should block: no > In summary, a masked store should be used carefully, for example, if the > remainder size is known at > compile time to be 1, and there is a load operation from the same cache line > after it (or there is an > overlap in addresses + vector lengths), it may be better to use scalar > remainder processing, rather than > a masked remainder block. > > > Thanks, > Vamsi @vamsi-parasa Ok, so now we have one benchmark that shows a speedup and one that shows a regression. How are we to proceed? It seems that without loads https://github.com/openjdk/jdk/pull/28442#issuecomment-3761659799, this patch leads to a regression. Only if there is a load from one of the last elements that the `Arrays.fill` stored to with a masked operation do we get a slowdown. Because of missing load-to-store forwarding. If we instead started loading from the first elements, those would probably already be in cache, and we would not have any latency issues, right? But is it not rather an edge-case that we load from the last elements immediately after the `Arrays.fill`? At least for longer arrays, it seems an edge case. For short arrays it is probably more likely that we access the last element soon after the fill. It does not seem like a trivial decision to me if this patch is an improvement or not. What do you think @vamsi-parasa ? @sviswa7 @dwhite-intel You already approved this PR. What are your thoughts here? - PR Comment: https://git.openjdk.org/jdk/pull/28442#issuecomment-3776741440
Re: RFR: 8349452: Fix performance regression for Arrays.fill() with AVX512 [v13]
On Mon, 19 Jan 2026 08:11:19 GMT, Emanuel Peter wrote: > Can you explain the difference between the two results? > Hi Emanuel (@eme64), Yes, the conclusions you mentioned are correct. The store only benchmark shows that masked store is slightly better than the unmasked store. However, the store followed by load benchmarks shows that the unmasked store is better than masked vector store as masked vector stores have very limited store forwarding support in the hardware. This is because the load operation following the masked vector store is blocked until the data is written into the cache. This is also mentioned in the [Intel Software optimization manual](https://www.intel.com/content/www/us/en/content-details/671488/intel-64-and-ia-32-architectures-optimization-reference-manual-volume-1.html) (Chapter 18, section 18.4, page 578). Pasting the relevant text below for reference: 18.4 FORWARDING AND MEMORY MASKING When using masked store and load, consider the following: • When the mask is not all-ones or all-zeroes, the load operation, following the masked store operation from the same address is blocked, until the data is written to the cache. • Unlike GPR forwarding rules, vector loads whether or not they are masked, do not forward unless load and store addresses are exactly the same. — st_mask = 10101010, ld_mask = 01010101, can forward: no, should block: yes — st_mask = , ld_mask = 0011, can forward: no, should block: yes • When the mask is all-ones, blocking does not occur, because the data may be forwarded to the load operation. — st_mask = , ld_mask = don’t care, can forward: yes, should block: no • When mask is all-zeroes, blocking does not occur, though neither does forwarding. — st_mask = , ld_mask = don’t care, can forward: no, should block: no In summary, a masked store should be used carefully, for example, if the remainder size is known at compile time to be 1, and there is a load operation from the same cache line after it (or there is an overlap in addresses + vector lengths), it may be better to use scalar remainder processing, rather than a masked remainder block. Thanks, Vamsi - PR Comment: https://git.openjdk.org/jdk/pull/28442#issuecomment-3775508253
Re: RFR: 8349452: Fix performance regression for Arrays.fill() with AVX512 [v13]
On Fri, 16 Jan 2026 20:31:28 GMT, Srinivas Vamsi Parasa
wrote:
>> Srinivas Vamsi Parasa has updated the pull request incrementally with one
>> additional commit since the last revision:
>>
>> Update ALL of ArraysFill JMH micro
>
> Also, we can see the benefit of using unmasked stores (this PR) instead of
> masked vector stores (existing implementation) when we update the
> ArraysFill.java JMH micro-benchmark to perform fill (write) followed by read
> of the filled data as shown below using short array fill as an example:
>
>
> @Benchmark
> public short testShortFill() {
> Arrays.fill(testShortArray, (short) -1);
> return (short) (testShortArray[0] + testShortArray[size - 1]);
> }
>
>
>
>
>
> ### Table shows throughput (ops/ms); **(Higher is better)**
> Benchmark (ops/ms) MaxVectorSize = 32 | SIZE | +OptimizeFill
> (Masked Store) | +OptimizeFill (Unmasked Store - This PR) | Delta
> -- | -- | -- | -- | --
> ArraysFill.testByteFill | 1 | 175381 | 342456 | 95%
> ArraysFill.testByteFill | 10 | 175421 | 264607 | 51%
> ArraysFill.testByteFill | 20 | 175447 | 27 | 55%
> ArraysFill.testByteFill | 30 | 175454 | 253351 | 44%
> ArraysFill.testByteFill | 40 | 162429 | 273043 | 68%
> ArraysFill.testByteFill | 50 | 162443 | 251734 | 55%
> ArraysFill.testByteFill | 60 | 162454 | 248156 | 53%
> ArraysFill.testByteFill | 70 | 156659 | 236497 | 51%
> ArraysFill.testByteFill | 80 | 175403 | 269433 | 54%
> ArraysFill.testByteFill | 90 | 175422 | 230276 | 31%
> ArraysFill.testByteFill | 100 | 168662 | 252394 | 50%
> ArraysFill.testByteFill | 110 | 146182 | 217917 | 49%
> ArraysFill.testByteFill | 120 | 168693 | 239126 | 42%
> ArraysFill.testByteFill | 130 | 162378 | 166159 | 2%
> ArraysFill.testByteFill | 140 | 156569 | 168296 | 7%
> ArraysFill.testByteFill | 150 | 151214 | 167388 | 11%
> ArraysFill.testByteFill | 160 | 156594 | 173529 | 11%
> ArraysFill.testByteFill | 170 | 156590 | 167976 | 7%
> ArraysFill.testByteFill | 180 | 156561 | 173015 | 11%
> ArraysFill.testByteFill | 190 | 156601 | 173073 | 11%
> ArraysFill.testByteFill | 200 | 168277 | 174293 | 4%
> ArraysFill.testIntFill | 1 | 175403 | 334460 | 91%
> ArraysFill.testIntFill | 10 | 162437 | 273799 | 69%
> ArraysFill.testIntFill | 20 | 156636 | 273483 | 75%
> ArraysFill.testIntFill | 30 | 162440 | 243303 | 50%
> ArraysFill.testIntFill | 40 | 156592 | 175162 | 12%
> ArraysFill.testIntFill | 50 | 156585 | 168433 | 8%
> ArraysFill.testIntFill | 60 | 151193 | 195235 | 29%
> ArraysFill.testIntFill | 70 | 141406 | 167060 | 18%
> ArraysFill.testIntFill | 80 | 141406 | 167119 | 18%
> ArraysFill.testIntFill | 90 | 141437 | 166976 | 18%
> ArraysFill.testIntFill | 100 | 168349 | 173943 | 3%
> ArraysFill.testIntFill | 110 | 132864 | 173096 | 30%
> ArraysFill.testIntFill | 120 | 128972 | 173722 | 35%
> ArraysFill
@vamsi-parasa Thanks for the extra data!
Do I see this right? In the plots
[here](https://github.com/openjdk/jdk/pull/28442#issuecomment-3761659799), the
masked performance lies lower/better than unmasked performance (here we measure
ns/ops). But in your tables
[here](https://github.com/openjdk/jdk/pull/28442#issuecomment-3761712841) you
are measuring ops/ms, and are getting the opposite trend: masked is slower than
unmasked.
Can you explain the difference between the two results?
-
PR Comment: https://git.openjdk.org/jdk/pull/28442#issuecomment-3767004043
Re: RFR: 8349452: Fix performance regression for Arrays.fill() with AVX512 [v13]
On Fri, 16 Jan 2026 20:14:31 GMT, Srinivas Vamsi Parasa
wrote:
>> The goal of this PR is to fix the performance regression in Arrays.fill()
>> x86 stubs caused by masked AVX stores. The fix is to replace the masked AVX
>> stores with store instructions without masks (i.e. unmasked stores).
>> `fill32_masked()` and `fill64_masked()` stubs are replaced with
>> `fill32_unmasked()` and `fill64_unmasked()` respectively.
>>
>> To speedup unmasked stores, array fills for sizes < 64 bytes are broken down
>> into sequences of 32B, 16B, 8B, 4B, 2B and 1B stores, depending on the size.
>>
>>
>> ### **Performance comparison for byte array fills in a loop for 1 million
>> times**
>>
>>
>> UseAVX=3 ByteArray Size | +OptimizeFill(Masked store stub)
>> [secs] | -OptimizeFill (No stub) [secs] | --->This PR: +OptimizeFill
>> (Unmasked store stub) [secs]
>> -- | -- | -- | --
>> 1 | 0.46 | 0.14 | 0.189
>> 2 | 0.46 | 0.16 | 0.191
>> 3 | 0.46 | 0.176 | 0.199
>> 4 | 0.46 | 0.244 | 0.212
>> 5 | 0.46 | 0.29 | 0.364
>> 10 | 0.46 | 0.58 | 0.354
>> 15 | 0.46 | 0.42 | 0.325
>> 16 | 0.46 | 0.46 | 0.281
>> 17 | 0.21 | 0.5 | 0.365
>> 20 | 0.21 | 0.37 | 0.326
>> 25 | 0.21 | 0.59 | 0.343
>> 31 | 0.21 | 0.53 | 0.317
>> 32 | 0.21 | 0.58 | 0.249
>> 35 | 0.5 | 0.77 | 0.303
>> 40 | 0.5 | 0.61 | 0.312
>> 45 | 0.5 | 0.52 | 0.364
>> 48 | 0.5 | 0.66 | 0.283
>> 49 | 0.22 | 0.69 | 0.367
>> 50 | 0.22 | 0.78 | 0.344
>> 55 | 0.22 | 0.67 | 0.332
>> 60 | 0.22 | 0.67 | 0.312
>> 64 | 0.22 | 0.82 | 0.253
>> 70 | 0.51 | 1.1 | 0.394
>> 80 | 0.49 | 0.89 | 0.346
>> 90 | 0.225 | 0.68 | 0.385
>> 100 | 0.54 | 1.09 | 0.364
>> 110 | 0.6 | 0.98 | 0.416
>> 120 | 0.26 | 0.75 | 0.367
>> 128 | 0.266 | 1.1 | 0.342
>
> Srinivas Vamsi Parasa has updated the pull request incrementally with one
> additional commit since the last revision:
>
> Update ALL of ArraysFill JMH micro
Also, we can see the benefit of using unmasked stores (this PR) instead of
masked vector stores (existing implementation) when we update the
ArraysFill.java JMH micro-benchmark to perform fill (write) followed by read of
the filled data as shown below using short array fill as an example:
@Benchmark
public short testShortFill() {
Arrays.fill(testShortArray, (short) -1);
return (short) (testShortArray[0] + testShortArray[size - 1]);
}
**(Higher is better)**
Benchmark (ops/ms) MaxVectorSize = 32 | SIZE | +OptimizeFill (Masked
Store) | +OptimizeFill (Unmasked Store - This PR) | Delta
-- | -- | -- | -- | --
ArraysFill.testByteFill | 1 | 175381 | 342456 | 95%
ArraysFill.testByteFill | 10 | 175421 | 264607 | 51%
ArraysFill.testByteFill | 20 | 175447 | 27 | 55%
ArraysFill.testByteFill | 30 | 175454 | 253351 | 44%
ArraysFill.testByteFill | 40 | 162429 | 273043 | 68%
ArraysFill.testByteFill | 50 | 162443 | 251734 | 55%
ArraysFill.testByteFill | 60 | 162454 | 248156 | 53%
ArraysFill.testByteFill | 70 | 156659 | 236497 | 51%
ArraysFill.testByteFill | 80 | 175403 | 269433 | 54%
ArraysFill.testByteFill | 90 | 175422 | 230276 | 31%
ArraysFill.testByteFill | 100 | 168662 | 252394 | 50%
ArraysFill.testByteFill | 110 | 146182 | 217917 | 49%
ArraysFill.testByteFill | 120 | 168693 | 239126 | 42%
ArraysFill.testByteFill | 130 | 162378 | 166159 | 2%
ArraysFill.testByteFill | 140 | 156569 | 168296 | 7%
ArraysFill.testByteFill | 150 | 151214 | 167388 | 11%
ArraysFill.testByteFill | 160 | 156594 | 173529 | 11%
ArraysFill.testByteFill | 170 | 156590 | 167976 | 7%
ArraysFill.testByteFill | 180 | 156561 | 173015 | 11%
ArraysFill.testByteFill | 190 | 156601 | 173073 | 11%
ArraysFill.testByteFill | 200 | 168277 | 174293 | 4%
ArraysFill.testIntFill | 1 | 175403 | 334460 | 91%
ArraysFill.testIntFill | 10 | 162437 | 273799 | 69%
ArraysFill.testIntFill | 20 | 156636 | 273483 | 75%
ArraysFill.testIntFill | 30 | 162440 | 243303 | 50%
ArraysFill.testIntFill | 40 | 156592 | 175162 | 12%
ArraysFill.testIntFill | 50 | 156585 | 168433 | 8%
ArraysFill.testIntFill | 60 | 151193 | 195235 | 29%
ArraysFill.testIntFill | 70 | 141406 | 167060 | 18%
ArraysFill.testIntFill | 80 | 141406 | 167119 | 18%
ArraysFill.testIntFill | 90 | 141437 | 166976 | 18%
ArraysFill.testIntFill | 100 | 168349 | 173943 | 3%
ArraysFill.testIntFill | 110 | 132864 | 173096 | 30%
ArraysFill.testIntFill | 120 | 128972 | 173722 | 35%
ArraysFill.testIntFill | 130 | 128958 | 149835 | 16%
ArraysFill.testIntFill | 140 | 167934 | 165903 | -1%
ArraysFill.testIntFill | 150 | 121799 | 133351 | 9%
ArraysFill.testIntFill | 160 | 121824 | 154654 | 27%
ArraysFill.testIntFill | 170 | 121800 | 163515 | 34%
ArraysFill.testIntFill | 180 | 121770 | 150235 | 23%
ArraysFill.testIntFill | 190 | 121808 | 145138 | 19%
ArraysFill.testIntFill | 200 | 112433 | 142084 | 26%
ArraysFill.testShortFill | 1 | 99696 | 309697 | 211%
ArraysFill.testShortFill | 10 | 175433 | 290773 | 66%
ArraysFill.testShortFill | 20 | 175417 | 270345 | 54%
ArraysFill.testShortFill | 30 | 162459 | 257180 | 58%
ArraysFill.testShortFill
Re: RFR: 8349452: Fix performance regression for Arrays.fill() with AVX512 [v13]
> The goal of this PR is to fix the performance regression in Arrays.fill() x86 > stubs caused by masked AVX stores. The fix is to replace the masked AVX > stores with store instructions without masks (i.e. unmasked stores). > `fill32_masked()` and `fill64_masked()` stubs are replaced with > `fill32_unmasked()` and `fill64_unmasked()` respectively. > > To speedup unmasked stores, array fills for sizes < 64 bytes are broken down > into sequences of 32B, 16B, 8B, 4B, 2B and 1B stores, depending on the size. > > > ### **Performance comparison for byte array fills in a loop for 1 million > times** > > > UseAVX=3 ByteArray Size | +OptimizeFill(Masked store stub) [secs] > | -OptimizeFill (No stub) [secs] | --->This PR: +OptimizeFill (Unmasked > store stub) [secs] > -- | -- | -- | -- > 1 | 0.46 | 0.14 | 0.189 > 2 | 0.46 | 0.16 | 0.191 > 3 | 0.46 | 0.176 | 0.199 > 4 | 0.46 | 0.244 | 0.212 > 5 | 0.46 | 0.29 | 0.364 > 10 | 0.46 | 0.58 | 0.354 > 15 | 0.46 | 0.42 | 0.325 > 16 | 0.46 | 0.46 | 0.281 > 17 | 0.21 | 0.5 | 0.365 > 20 | 0.21 | 0.37 | 0.326 > 25 | 0.21 | 0.59 | 0.343 > 31 | 0.21 | 0.53 | 0.317 > 32 | 0.21 | 0.58 | 0.249 > 35 | 0.5 | 0.77 | 0.303 > 40 | 0.5 | 0.61 | 0.312 > 45 | 0.5 | 0.52 | 0.364 > 48 | 0.5 | 0.66 | 0.283 > 49 | 0.22 | 0.69 | 0.367 > 50 | 0.22 | 0.78 | 0.344 > 55 | 0.22 | 0.67 | 0.332 > 60 | 0.22 | 0.67 | 0.312 > 64 | 0.22 | 0.82 | 0.253 > 70 | 0.51 | 1.1 | 0.394 > 80 | 0.49 | 0.89 | 0.346 > 90 | 0.225 | 0.68 | 0.385 > 100 | 0.54 | 1.09 | 0.364 > 110 | 0.6 | 0.98 | 0.416 > 120 | 0.26 | 0.75 | 0.367 > 128 | 0.266 | 1.1 | 0.342 Srinivas Vamsi Parasa has updated the pull request incrementally with one additional commit since the last revision: Update ALL of ArraysFill JMH micro - Changes: - all: https://git.openjdk.org/jdk/pull/28442/files - new: https://git.openjdk.org/jdk/pull/28442/files/5edff7f7..620ae44e Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=28442&range=12 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=28442&range=11-12 Stats: 8 lines in 1 file changed: 4 ins; 0 del; 4 mod Patch: https://git.openjdk.org/jdk/pull/28442.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/28442/head:pull/28442 PR: https://git.openjdk.org/jdk/pull/28442
