Re: RFR: 8349452: Fix performance regression for Arrays.fill() with AVX512 [v13]

2026-01-30 Thread Srinivas Vamsi Parasa
On Fri, 30 Jan 2026 08:33:57 GMT, Emanuel Peter  wrote:

>>> > > @vamsi-parasa Ok, so now we have one benchmark that shows a speedup and 
>>> > > one that shows a regression. How are we to proceed?
>>> > > It seems that without loads [#28442 
>>> > > (comment)](https://github.com/openjdk/jdk/pull/28442#issuecomment-3761659799),
>>> > >  this patch leads to a regression.
>>> > > Only if there is a load from one of the last elements that the 
>>> > > `Arrays.fill` stored to with a masked operation do we get a slowdown. 
>>> > > Because of missing load-to-store forwarding. If we instead started 
>>> > > loading from the first elements, those would probably already be in 
>>> > > cache, and we would not have any latency issues, right?
>>> > > But is it not rather an edge-case that we load from the last elements 
>>> > > immediately after the `Arrays.fill`? At least for longer arrays, it 
>>> > > seems an edge case. For short arrays it is probably more likely that we 
>>> > > access the last element soon after the fill.
>>> > > It does not seem like a trivial decision to me if this patch is an 
>>> > > improvement or not. What do you think @vamsi-parasa ?
>>> > > @sviswa7 @dwhite-intel You already approved this PR. What are your 
>>> > > thoughts here?
>>> > 
>>> > 
>>> > @eme64 My thoughts are to go ahead with this PR replacing masked stores 
>>> > with scalar tail processing. As we have seen from 
>>> > https://bugs.openjdk.org/browse/JDK-8349452 masked stores can cause big 
>>> > regression in certain scenarios: accessing elements just written or any 
>>> > other adjacent data that happens to fall in the masked store range.
>>> 
>>> @sviswa7 But once this PR is integrated, I could file a performance 
>>> regression with the benchmarks from [up 
>>> here](https://github.com/openjdk/jdk/pull/28442#issuecomment-3761659799). 
>>> So what's the argument which choice is better, since we have a mix of 
>>> speedups/regression going either way, and both are probably in the 10-20% 
>>> range?
>> 
>> @eme64 You have a point there, but if you see the performance numbers for 
>> ByteMatrix.java (from JDK-8349452) in the PR description above we are 
>> talking about a recovery of 3x or so. The ByteMatrix.java is doing only 
>> Arrays.fill() on individual arrays of a 2D array. The individual arrays 
>> happened to be allocated alongside each other by the JVM and the next store 
>> sees stalls due to the masked store of previous array initialization. That 
>> was the reason to look for a solution without masked stores.
>
> @sviswa7 Ah right, the ByteMatrix.java is yet another case. There, we don't 
> seem to have any loads.
> 
>> The individual arrays happened to be allocated alongside each other by the 
>> JVM and the next store sees stalls due to the masked store of previous array 
>> initialization.
> 
> Ah, that sounds interesting! Is there some tool that would let me see that it 
> was due to masked store stalls?
> My (probably uneducated) guess would have been that it is just because a 
> single element store would be much cheaper than a masked operation. If you 
> only access a single or 2 elements, then a masked store is not yet 
> profitable. What if the masked stores were a bit further away, say a 
> cacheline away? Would that be significantly faster, because there are no 
> stalls? Or still slow because of the inherent higher cost of masked 
> operations?
> 
> If we take the ByteMatrix.java benchmark: how would the performance change if 
> we increase the size of the arrays (height)? Is there some height before 
> which the non-masked solution is faster, and after which the masked is faster?
> 
> Would it be a solution to use scalar stores for very small arrays, and only 
> use the masked loop starting at a certain threshold?
> 
> ---
> 
> I would like to see a summary of all the benchmarks we have here, and in 
> which cases we get speedups/slowdowns, and for which reason. Maybe listing 
> those reasons lets us see some third option we did not yet consider. And 
> listing all the reasons and code shapes may help us find out which shapes we 
> care about most, and then come to a decision that weighs off the pros and 
> cons.
> 
> We should also document our decision nicely in the code, so that if someone 
> gets a regression in the future, we can see if we had already considered that 
> code shape.
> 
> Does that make sense? Or do you have a better idea how to make a good 
> decision here?

Hi Emanuel (@eme64),

Based on the discussion, I will run further experiments to see if the 
regressions can be addressed and get back to you at a later date.

Thanks,
Vamsi

-

PR Comment: https://git.openjdk.org/jdk/pull/28442#issuecomment-3825349231


Re: RFR: 8349452: Fix performance regression for Arrays.fill() with AVX512 [v13]

2026-01-30 Thread Srinivas Vamsi Parasa
On Fri, 30 Jan 2026 17:13:49 GMT, Sandhya Viswanathan 
 wrote:

> Vamsi should be able to confirm this. Regarding whether the slow down is due 
> to masked stores stalls, that was my hypothesis > based on the optimization 
> guide, excerpts of which Vamsi shared above.
>
The MaxVectorSize=64 for the platform used to collect the data in the PR's 
description for ByteMatrix fill workload.

-

PR Comment: https://git.openjdk.org/jdk/pull/28442#issuecomment-3825339469


Re: RFR: 8349452: Fix performance regression for Arrays.fill() with AVX512 [v13]

2026-01-30 Thread Sandhya Viswanathan
On Thu, 29 Jan 2026 19:43:27 GMT, Sandhya Viswanathan 
 wrote:

>> ### Int VectorBulkOperationsArray Fill
>> 
>> Benchmark   (ns/op) | Size | -OptimizeFill(JITed code) | 
>> +OptimizeFill(Masked store) | +OptimizeFill(Unmasked store - 
>> This PR) | Delta  (masked vs. unmasked)
>> -- | -- | -- | -- | -- | --
>> VectorBulkOperationsArray.fill_var_int_arrays_fill | 0 | 0.649 | 0.651 | 
>> 0.655 | 1%
>> VectorBulkOperationsArray.fill_var_int_arrays_fill | 1 | 2.371 | 2.801 | 
>> 2.827 | 1%
>> VectorBulkOperationsArray.fill_var_int_arrays_fill | 2 | 2.374 | 2.585 | 
>> 2.942 | 12%
>> VectorBulkOperationsArray.fill_var_int_arrays_fill | 3 | 2.809 | 2.589 | 
>> 3.094 | 16%
>> VectorBulkOperationsArray.fill_var_int_arrays_fill | 4 | 3.356 | 2.587 | 
>> 2.852 | 9%
>> VectorBulkOperationsArray.fill_var_int_arrays_fill | 5 | 3.531 | 2.588 | 
>> 3.158 | 18%
>> VectorBulkOperationsArray.fill_var_int_arrays_fill | 6 | 3.747 | 2.589 | 
>> 3.118 | 17%
>> VectorBulkOperationsArray.fill_var_int_arrays_fill | 7 | 3.989 | 2.589 | 
>> 3.332 | 22%
>> VectorBulkOperationsArray.fill_var_int_arrays_fill | 8 | 5.047 | 2.588 | 
>> 2.832 | 9%
>> VectorBulkOperationsArray.fill_var_int_arrays_fill | 9 | 4.79 | 2.845 | 
>> 3.056 | 7%
>> VectorBulkOperationsArray.fill_var_int_arrays_fill | 10 | 4.982 | 2.85 | 
>> 3.274 | 13%
>> VectorBulkOperationsArray.fill_var_int_arrays_fill | 11 | 4.551 | 2.852 | 
>> 3.521 | 19%
>> VectorBulkOperationsArray.fill_var_int_arrays_fill | 12 | 4.281 | 2.853 | 
>> 3.12 | 9%
>> VectorBulkOperationsArray.fill_var_int_arrays_fill | 13 | 4.391 | 2.894 | 
>> 3.499 | 17%
>> VectorBulkOperationsArray.fill_var_int_arrays_fill | 14 | 4.909 | 2.848 | 
>> 3.339 | 15%
>> VectorBulkOperationsArray.fill_var_int_arrays_fill | 15 | 5.269 | 2.853 | 
>> 3.524 | 19%
>> VectorBulkOperationsArray.fill_var_int_arrays_fill | 16 | 5.663 | 2.836 | 
>> 3.101 | 9%
>> VectorBulkOperationsArray.fill_var_int_arrays_fill | 17 | 5.553 | 2.924 | 
>> 3.111 | 6%
>> VectorBulkOperationsArray.fill_var_int_arrays_fill | 18 | 5.105 | 2.933 | 
>> 3.358 | 13%
>> VectorBulkOperationsArray.fill_var_int_arrays_fill | 19 | 5.09 | 2.942 | 
>> 3.583 | 18%
>> VectorBulkOperationsArray.fill_var_int_arrays_fill | 20 | 4.457 | 2.927 | 
>> 3.272 | 11%
>> VectorBulkOperationsArray.fill_var_int_arrays_fill | 21 | 4.745 | 3.104 | 
>> 3.598 | 14%
>> VectorBulkOperationsArray.fill_var_int_arrays_fill | 22 | 4.949 | 2.932 | 
>> 3.481 | 16%
>> VectorBulkOperationsArray.fill_var_int_arrays_fill | 23 | 4.992 | 2.939 | 
>> 3.761 | 22%
>> VectorBulkOperationsArray.fill_var_int_arrays_fill | 24 | 5.198 | 2.92 | 
>> 3.205 | 9%
>> VectorBulkOperationsArray.fill_var_int_arrays_fill | 25 | 5
>
>> > > @vamsi-parasa Ok, so now we have one benchmark that shows a speedup and 
>> > > one that shows a regression. How are we to proceed?
>> > > It seems that without loads [#28442 
>> > > (comment)](https://github.com/openjdk/jdk/pull/28442#issuecomment-3761659799),
>> > >  this patch leads to a regression.
>> > > Only if there is a load from one of the last elements that the 
>> > > `Arrays.fill` stored to with a masked operation do we get a slowdown. 
>> > > Because of missing load-to-store forwarding. If we instead started 
>> > > loading from the first elements, those would probably already be in 
>> > > cache, and we would not have any latency issues, right?
>> > > But is it not rather an edge-case that we load from the last elements 
>> > > immediately after the `Arrays.fill`? At least for longer arrays, it 
>> > > seems an edge case. For short arrays it is probably more likely that we 
>> > > access the last element soon after the fill.
>> > > It does not seem like a trivial decision to me if this patch is an 
>> > > improvement or not. What do you think @vamsi-parasa ?
>> > > @sviswa7 @dwhite-intel You already approved this PR. What are your 
>> > > thoughts here?
>> > 
>> > 
>> > @eme64 My thoughts are to go ahead with this PR replacing masked stores 
>> > with scalar tail processing. As we have seen from 
>> > https://bugs.openjdk.org/browse/JDK-8349452 masked stores can cause big 
>> > regression in certain scenarios: accessing elements just written or any 
>> > other adjacent data that happens to fall in the masked store range.
>> 
>> @sviswa7 But once this PR is integrated, I could file a performance 
>> regression with the benchmarks from [up 
>> here](https://github.com/openjdk/jdk/pull/28442#issuecomment-3761659799). So 
>> what's the argument which choice is better, since we have a mix of 
>> speedups/regression going either way, and both are probably in the 10-20% 
>> range?
> 
> @eme64 You have a point there, but if you see the performance numbers for 
> ByteMatrix.java (from JDK-8349452) in the PR description above we are talking 
> about a recovery of 3x or so. The ByteMatrix.java is doing only Arrays.fill() 
> on individual arrays of a 2D array. The individual arrays happened to be 
> allocated alongside each other by the JVM a

Re: RFR: 8349452: Fix performance regression for Arrays.fill() with AVX512 [v13]

2026-01-30 Thread Emanuel Peter
On Thu, 29 Jan 2026 19:43:27 GMT, Sandhya Viswanathan 
 wrote:

>> ### Int VectorBulkOperationsArray Fill
>> 
>> Benchmark   (ns/op) | Size | -OptimizeFill(JITed code) | 
>> +OptimizeFill(Masked store) | +OptimizeFill(Unmasked store - 
>> This PR) | Delta  (masked vs. unmasked)
>> -- | -- | -- | -- | -- | --
>> VectorBulkOperationsArray.fill_var_int_arrays_fill | 0 | 0.649 | 0.651 | 
>> 0.655 | 1%
>> VectorBulkOperationsArray.fill_var_int_arrays_fill | 1 | 2.371 | 2.801 | 
>> 2.827 | 1%
>> VectorBulkOperationsArray.fill_var_int_arrays_fill | 2 | 2.374 | 2.585 | 
>> 2.942 | 12%
>> VectorBulkOperationsArray.fill_var_int_arrays_fill | 3 | 2.809 | 2.589 | 
>> 3.094 | 16%
>> VectorBulkOperationsArray.fill_var_int_arrays_fill | 4 | 3.356 | 2.587 | 
>> 2.852 | 9%
>> VectorBulkOperationsArray.fill_var_int_arrays_fill | 5 | 3.531 | 2.588 | 
>> 3.158 | 18%
>> VectorBulkOperationsArray.fill_var_int_arrays_fill | 6 | 3.747 | 2.589 | 
>> 3.118 | 17%
>> VectorBulkOperationsArray.fill_var_int_arrays_fill | 7 | 3.989 | 2.589 | 
>> 3.332 | 22%
>> VectorBulkOperationsArray.fill_var_int_arrays_fill | 8 | 5.047 | 2.588 | 
>> 2.832 | 9%
>> VectorBulkOperationsArray.fill_var_int_arrays_fill | 9 | 4.79 | 2.845 | 
>> 3.056 | 7%
>> VectorBulkOperationsArray.fill_var_int_arrays_fill | 10 | 4.982 | 2.85 | 
>> 3.274 | 13%
>> VectorBulkOperationsArray.fill_var_int_arrays_fill | 11 | 4.551 | 2.852 | 
>> 3.521 | 19%
>> VectorBulkOperationsArray.fill_var_int_arrays_fill | 12 | 4.281 | 2.853 | 
>> 3.12 | 9%
>> VectorBulkOperationsArray.fill_var_int_arrays_fill | 13 | 4.391 | 2.894 | 
>> 3.499 | 17%
>> VectorBulkOperationsArray.fill_var_int_arrays_fill | 14 | 4.909 | 2.848 | 
>> 3.339 | 15%
>> VectorBulkOperationsArray.fill_var_int_arrays_fill | 15 | 5.269 | 2.853 | 
>> 3.524 | 19%
>> VectorBulkOperationsArray.fill_var_int_arrays_fill | 16 | 5.663 | 2.836 | 
>> 3.101 | 9%
>> VectorBulkOperationsArray.fill_var_int_arrays_fill | 17 | 5.553 | 2.924 | 
>> 3.111 | 6%
>> VectorBulkOperationsArray.fill_var_int_arrays_fill | 18 | 5.105 | 2.933 | 
>> 3.358 | 13%
>> VectorBulkOperationsArray.fill_var_int_arrays_fill | 19 | 5.09 | 2.942 | 
>> 3.583 | 18%
>> VectorBulkOperationsArray.fill_var_int_arrays_fill | 20 | 4.457 | 2.927 | 
>> 3.272 | 11%
>> VectorBulkOperationsArray.fill_var_int_arrays_fill | 21 | 4.745 | 3.104 | 
>> 3.598 | 14%
>> VectorBulkOperationsArray.fill_var_int_arrays_fill | 22 | 4.949 | 2.932 | 
>> 3.481 | 16%
>> VectorBulkOperationsArray.fill_var_int_arrays_fill | 23 | 4.992 | 2.939 | 
>> 3.761 | 22%
>> VectorBulkOperationsArray.fill_var_int_arrays_fill | 24 | 5.198 | 2.92 | 
>> 3.205 | 9%
>> VectorBulkOperationsArray.fill_var_int_arrays_fill | 25 | 5
>
>> > > @vamsi-parasa Ok, so now we have one benchmark that shows a speedup and 
>> > > one that shows a regression. How are we to proceed?
>> > > It seems that without loads [#28442 
>> > > (comment)](https://github.com/openjdk/jdk/pull/28442#issuecomment-3761659799),
>> > >  this patch leads to a regression.
>> > > Only if there is a load from one of the last elements that the 
>> > > `Arrays.fill` stored to with a masked operation do we get a slowdown. 
>> > > Because of missing load-to-store forwarding. If we instead started 
>> > > loading from the first elements, those would probably already be in 
>> > > cache, and we would not have any latency issues, right?
>> > > But is it not rather an edge-case that we load from the last elements 
>> > > immediately after the `Arrays.fill`? At least for longer arrays, it 
>> > > seems an edge case. For short arrays it is probably more likely that we 
>> > > access the last element soon after the fill.
>> > > It does not seem like a trivial decision to me if this patch is an 
>> > > improvement or not. What do you think @vamsi-parasa ?
>> > > @sviswa7 @dwhite-intel You already approved this PR. What are your 
>> > > thoughts here?
>> > 
>> > 
>> > @eme64 My thoughts are to go ahead with this PR replacing masked stores 
>> > with scalar tail processing. As we have seen from 
>> > https://bugs.openjdk.org/browse/JDK-8349452 masked stores can cause big 
>> > regression in certain scenarios: accessing elements just written or any 
>> > other adjacent data that happens to fall in the masked store range.
>> 
>> @sviswa7 But once this PR is integrated, I could file a performance 
>> regression with the benchmarks from [up 
>> here](https://github.com/openjdk/jdk/pull/28442#issuecomment-3761659799). So 
>> what's the argument which choice is better, since we have a mix of 
>> speedups/regression going either way, and both are probably in the 10-20% 
>> range?
> 
> @eme64 You have a point there, but if you see the performance numbers for 
> ByteMatrix.java (from JDK-8349452) in the PR description above we are talking 
> about a recovery of 3x or so. The ByteMatrix.java is doing only Arrays.fill() 
> on individual arrays of a 2D array. The individual arrays happened to be 
> allocated alongside each other by the JVM a

Re: RFR: 8349452: Fix performance regression for Arrays.fill() with AVX512 [v13]

2026-01-29 Thread Sandhya Viswanathan
On Thu, 22 Jan 2026 20:30:26 GMT, Srinivas Vamsi Parasa  
wrote:

>> Srinivas Vamsi Parasa has updated the pull request incrementally with one 
>> additional commit since the last revision:
>> 
>>   Update ALL of ArraysFill JMH micro
>
> ### Int VectorBulkOperationsArray Fill
> 
> Benchmark   (ns/op) | Size | -OptimizeFill(JITed code) | 
> +OptimizeFill(Masked store) | +OptimizeFill(Unmasked store - 
> This PR) | Delta  (masked vs. unmasked)
> -- | -- | -- | -- | -- | --
> VectorBulkOperationsArray.fill_var_int_arrays_fill | 0 | 0.649 | 0.651 | 
> 0.655 | 1%
> VectorBulkOperationsArray.fill_var_int_arrays_fill | 1 | 2.371 | 2.801 | 
> 2.827 | 1%
> VectorBulkOperationsArray.fill_var_int_arrays_fill | 2 | 2.374 | 2.585 | 
> 2.942 | 12%
> VectorBulkOperationsArray.fill_var_int_arrays_fill | 3 | 2.809 | 2.589 | 
> 3.094 | 16%
> VectorBulkOperationsArray.fill_var_int_arrays_fill | 4 | 3.356 | 2.587 | 
> 2.852 | 9%
> VectorBulkOperationsArray.fill_var_int_arrays_fill | 5 | 3.531 | 2.588 | 
> 3.158 | 18%
> VectorBulkOperationsArray.fill_var_int_arrays_fill | 6 | 3.747 | 2.589 | 
> 3.118 | 17%
> VectorBulkOperationsArray.fill_var_int_arrays_fill | 7 | 3.989 | 2.589 | 
> 3.332 | 22%
> VectorBulkOperationsArray.fill_var_int_arrays_fill | 8 | 5.047 | 2.588 | 
> 2.832 | 9%
> VectorBulkOperationsArray.fill_var_int_arrays_fill | 9 | 4.79 | 2.845 | 3.056 
> | 7%
> VectorBulkOperationsArray.fill_var_int_arrays_fill | 10 | 4.982 | 2.85 | 
> 3.274 | 13%
> VectorBulkOperationsArray.fill_var_int_arrays_fill | 11 | 4.551 | 2.852 | 
> 3.521 | 19%
> VectorBulkOperationsArray.fill_var_int_arrays_fill | 12 | 4.281 | 2.853 | 
> 3.12 | 9%
> VectorBulkOperationsArray.fill_var_int_arrays_fill | 13 | 4.391 | 2.894 | 
> 3.499 | 17%
> VectorBulkOperationsArray.fill_var_int_arrays_fill | 14 | 4.909 | 2.848 | 
> 3.339 | 15%
> VectorBulkOperationsArray.fill_var_int_arrays_fill | 15 | 5.269 | 2.853 | 
> 3.524 | 19%
> VectorBulkOperationsArray.fill_var_int_arrays_fill | 16 | 5.663 | 2.836 | 
> 3.101 | 9%
> VectorBulkOperationsArray.fill_var_int_arrays_fill | 17 | 5.553 | 2.924 | 
> 3.111 | 6%
> VectorBulkOperationsArray.fill_var_int_arrays_fill | 18 | 5.105 | 2.933 | 
> 3.358 | 13%
> VectorBulkOperationsArray.fill_var_int_arrays_fill | 19 | 5.09 | 2.942 | 
> 3.583 | 18%
> VectorBulkOperationsArray.fill_var_int_arrays_fill | 20 | 4.457 | 2.927 | 
> 3.272 | 11%
> VectorBulkOperationsArray.fill_var_int_arrays_fill | 21 | 4.745 | 3.104 | 
> 3.598 | 14%
> VectorBulkOperationsArray.fill_var_int_arrays_fill | 22 | 4.949 | 2.932 | 
> 3.481 | 16%
> VectorBulkOperationsArray.fill_var_int_arrays_fill | 23 | 4.992 | 2.939 | 
> 3.761 | 22%
> VectorBulkOperationsArray.fill_var_int_arrays_fill | 24 | 5.198 | 2.92 | 
> 3.205 | 9%
> VectorBulkOperationsArray.fill_var_int_arrays_fill | 25 | 5.097 | 3.116 | 
> 3.387 | 8%
> VectorBulkOperationsArray.fill_var_...

> > > @vamsi-parasa Ok, so now we have one benchmark that shows a speedup and 
> > > one that shows a regression. How are we to proceed?
> > > It seems that without loads [#28442 
> > > (comment)](https://github.com/openjdk/jdk/pull/28442#issuecomment-3761659799),
> > >  this patch leads to a regression.
> > > Only if there is a load from one of the last elements that the 
> > > `Arrays.fill` stored to with a masked operation do we get a slowdown. 
> > > Because of missing load-to-store forwarding. If we instead started 
> > > loading from the first elements, those would probably already be in 
> > > cache, and we would not have any latency issues, right?
> > > But is it not rather an edge-case that we load from the last elements 
> > > immediately after the `Arrays.fill`? At least for longer arrays, it seems 
> > > an edge case. For short arrays it is probably more likely that we access 
> > > the last element soon after the fill.
> > > It does not seem like a trivial decision to me if this patch is an 
> > > improvement or not. What do you think @vamsi-parasa ?
> > > @sviswa7 @dwhite-intel You already approved this PR. What are your 
> > > thoughts here?
> > 
> > 
> > @eme64 My thoughts are to go ahead with this PR replacing masked stores 
> > with scalar tail processing. As we have seen from 
> > https://bugs.openjdk.org/browse/JDK-8349452 masked stores can cause big 
> > regression in certain scenarios: accessing elements just written or any 
> > other adjacent data that happens to fall in the masked store range.
> 
> @sviswa7 But once this PR is integrated, I could file a performance 
> regression with the benchmarks from [up 
> here](https://github.com/openjdk/jdk/pull/28442#issuecomment-3761659799). So 
> what's the argument which choice is better, since we have a mix of 
> speedups/regression going either way, and both are probably in the 10-20% 
> range?

@eme64 You have a point there, but if you see the performance numbers for 
ByteMatrix.java (from JDK-8349452) in the PR description above we are talking 
about a recovery of 3x or so. The ByteMatrix.java is doing on

Re: RFR: 8349452: Fix performance regression for Arrays.fill() with AVX512 [v13]

2026-01-23 Thread Emanuel Peter
On Thu, 22 Jan 2026 20:30:26 GMT, Srinivas Vamsi Parasa  
wrote:

>> Srinivas Vamsi Parasa has updated the pull request incrementally with one 
>> additional commit since the last revision:
>> 
>>   Update ALL of ArraysFill JMH micro
>
> ### Int VectorBulkOperationsArray Fill
> 
> Benchmark   (ns/op) | Size | -OptimizeFill(JITed code) | 
> +OptimizeFill(Masked store) | +OptimizeFill(Unmasked store - 
> This PR) | Delta  (masked vs. unmasked)
> -- | -- | -- | -- | -- | --
> VectorBulkOperationsArray.fill_var_int_arrays_fill | 0 | 0.649 | 0.651 | 
> 0.655 | 1%
> VectorBulkOperationsArray.fill_var_int_arrays_fill | 1 | 2.371 | 2.801 | 
> 2.827 | 1%
> VectorBulkOperationsArray.fill_var_int_arrays_fill | 2 | 2.374 | 2.585 | 
> 2.942 | 12%
> VectorBulkOperationsArray.fill_var_int_arrays_fill | 3 | 2.809 | 2.589 | 
> 3.094 | 16%
> VectorBulkOperationsArray.fill_var_int_arrays_fill | 4 | 3.356 | 2.587 | 
> 2.852 | 9%
> VectorBulkOperationsArray.fill_var_int_arrays_fill | 5 | 3.531 | 2.588 | 
> 3.158 | 18%
> VectorBulkOperationsArray.fill_var_int_arrays_fill | 6 | 3.747 | 2.589 | 
> 3.118 | 17%
> VectorBulkOperationsArray.fill_var_int_arrays_fill | 7 | 3.989 | 2.589 | 
> 3.332 | 22%
> VectorBulkOperationsArray.fill_var_int_arrays_fill | 8 | 5.047 | 2.588 | 
> 2.832 | 9%
> VectorBulkOperationsArray.fill_var_int_arrays_fill | 9 | 4.79 | 2.845 | 3.056 
> | 7%
> VectorBulkOperationsArray.fill_var_int_arrays_fill | 10 | 4.982 | 2.85 | 
> 3.274 | 13%
> VectorBulkOperationsArray.fill_var_int_arrays_fill | 11 | 4.551 | 2.852 | 
> 3.521 | 19%
> VectorBulkOperationsArray.fill_var_int_arrays_fill | 12 | 4.281 | 2.853 | 
> 3.12 | 9%
> VectorBulkOperationsArray.fill_var_int_arrays_fill | 13 | 4.391 | 2.894 | 
> 3.499 | 17%
> VectorBulkOperationsArray.fill_var_int_arrays_fill | 14 | 4.909 | 2.848 | 
> 3.339 | 15%
> VectorBulkOperationsArray.fill_var_int_arrays_fill | 15 | 5.269 | 2.853 | 
> 3.524 | 19%
> VectorBulkOperationsArray.fill_var_int_arrays_fill | 16 | 5.663 | 2.836 | 
> 3.101 | 9%
> VectorBulkOperationsArray.fill_var_int_arrays_fill | 17 | 5.553 | 2.924 | 
> 3.111 | 6%
> VectorBulkOperationsArray.fill_var_int_arrays_fill | 18 | 5.105 | 2.933 | 
> 3.358 | 13%
> VectorBulkOperationsArray.fill_var_int_arrays_fill | 19 | 5.09 | 2.942 | 
> 3.583 | 18%
> VectorBulkOperationsArray.fill_var_int_arrays_fill | 20 | 4.457 | 2.927 | 
> 3.272 | 11%
> VectorBulkOperationsArray.fill_var_int_arrays_fill | 21 | 4.745 | 3.104 | 
> 3.598 | 14%
> VectorBulkOperationsArray.fill_var_int_arrays_fill | 22 | 4.949 | 2.932 | 
> 3.481 | 16%
> VectorBulkOperationsArray.fill_var_int_arrays_fill | 23 | 4.992 | 2.939 | 
> 3.761 | 22%
> VectorBulkOperationsArray.fill_var_int_arrays_fill | 24 | 5.198 | 2.92 | 
> 3.205 | 9%
> VectorBulkOperationsArray.fill_var_int_arrays_fill | 25 | 5.097 | 3.116 | 
> 3.387 | 8%
> VectorBulkOperationsArray.fill_var_...

> > @vamsi-parasa Ok, so now we have one benchmark that shows a speedup and one 
> > that shows a regression. How are we to proceed?
> > It seems that without loads [#28442 
> > (comment)](https://github.com/openjdk/jdk/pull/28442#issuecomment-3761659799),
> >  this patch leads to a regression.
> > Only if there is a load from one of the last elements that the 
> > `Arrays.fill` stored to with a masked operation do we get a slowdown. 
> > Because of missing load-to-store forwarding. If we instead started loading 
> > from the first elements, those would probably already be in cache, and we 
> > would not have any latency issues, right?
> > But is it not rather an edge-case that we load from the last elements 
> > immediately after the `Arrays.fill`? At least for longer arrays, it seems 
> > an edge case. For short arrays it is probably more likely that we access 
> > the last element soon after the fill.
> > It does not seem like a trivial decision to me if this patch is an 
> > improvement or not. What do you think @vamsi-parasa ?
> > @sviswa7 @dwhite-intel You already approved this PR. What are your thoughts 
> > here?
> 
> @eme64 My thoughts are to go ahead with this PR replacing masked stores with 
> scalar tail processing. As we have seen from 
> https://bugs.openjdk.org/browse/JDK-8349452 masked stores can cause big 
> regression in certain scenarios: accessing elements just written or any other 
> adjacent data that happens to fall in the masked store range.

@sviswa7 But once this PR is integrated, I could file a performance regression 
with the benchmarks from [up 
here](https://github.com/openjdk/jdk/pull/28442#issuecomment-3761659799). So 
what's the argument which choice is better, since we have a mix of 
speedups/regression going either way, and both are probably in the 10-20% range?

-

PR Comment: https://git.openjdk.org/jdk/pull/28442#issuecomment-3788945532


Re: RFR: 8349452: Fix performance regression for Arrays.fill() with AVX512 [v13]

2026-01-22 Thread Srinivas Vamsi Parasa
On Fri, 16 Jan 2026 20:14:31 GMT, Srinivas Vamsi Parasa  
wrote:

>> The goal of this PR is to fix the performance regression in Arrays.fill() 
>> x86 stubs caused by masked AVX stores. The fix is to replace the masked AVX 
>> stores with store instructions without masks (i.e. unmasked stores). 
>> `fill32_masked()` and `fill64_masked()` stubs are replaced with 
>> `fill32_unmasked()` and `fill64_unmasked()` respectively.
>> 
>> To speedup unmasked stores, array fills for sizes < 64 bytes are broken down 
>> into sequences of 32B, 16B, 8B, 4B, 2B and 1B stores, depending on the size.
>> 
>> 
>> ### **Performance comparison for byte array fills in a loop for 1 million 
>> times**
>> 
>> 
>> UseAVX=3   ByteArray Size | +OptimizeFill(Masked store   stub) 
>> [secs] | -OptimizeFill   (No stub)   [secs] | --->This PR: +OptimizeFill   
>> (Unmasked store   stub)   [secs]
>> -- | -- | -- | --
>> 1 | 0.46 | 0.14 | 0.189
>> 2 | 0.46 | 0.16 | 0.191
>> 3 | 0.46 | 0.176 | 0.199
>> 4 | 0.46 | 0.244 | 0.212
>> 5 | 0.46 | 0.29 | 0.364
>> 10 | 0.46 | 0.58 | 0.354
>> 15 | 0.46 | 0.42 | 0.325
>> 16 | 0.46 | 0.46 | 0.281
>> 17 | 0.21 | 0.5 | 0.365
>> 20 | 0.21 | 0.37 | 0.326
>> 25 | 0.21 | 0.59 | 0.343
>> 31 | 0.21 | 0.53 | 0.317
>> 32 | 0.21 | 0.58 | 0.249
>> 35 | 0.5 | 0.77 | 0.303
>> 40 | 0.5 | 0.61 | 0.312
>> 45 | 0.5 | 0.52 | 0.364
>> 48 | 0.5 | 0.66 | 0.283
>> 49 | 0.22 | 0.69 | 0.367
>> 50 | 0.22 | 0.78 | 0.344
>> 55 | 0.22 | 0.67 | 0.332
>> 60 | 0.22 | 0.67 | 0.312
>> 64 | 0.22 | 0.82 | 0.253
>> 70 | 0.51 | 1.1 | 0.394
>> 80 | 0.49 | 0.89 | 0.346
>> 90 | 0.225 | 0.68 | 0.385
>> 100 | 0.54 | 1.09 | 0.364
>> 110 | 0.6 | 0.98 | 0.416
>> 120 | 0.26 | 0.75 | 0.367
>> 128 | 0.266 | 1.1 | 0.342
>
> Srinivas Vamsi Parasa has updated the pull request incrementally with one 
> additional commit since the last revision:
> 
>   Update ALL of ArraysFill JMH micro

### Short VectorBulkOperationsArray Fill
Benchmark   (ns/op) | Size | -OptimizeFill  (JITed code) | +OptimizeFill
(Masked store) | +OptimizeFill  (Unmasked store - This PR) | Delta  
(masked vs. unmasked)
-- | -- | -- | -- | -- | --
VectorBulkOperationsArray.fill_var_short_arrays_fill | 0 | 0.649 | 0.65 | 0.65 
| 0%
VectorBulkOperationsArray.fill_var_short_arrays_fill | 1 | 2.366 | 2.806 | 
3.025 | 8%
VectorBulkOperationsArray.fill_var_short_arrays_fill | 2 | 2.37 | 2.587 | 2.789 
| 8%
VectorBulkOperationsArray.fill_var_short_arrays_fill | 3 | 2.825 | 2.587 | 
3.299 | 28%
VectorBulkOperationsArray.fill_var_short_arrays_fill | 4 | 3.09 | 2.59 | 3.024 
| 17%
VectorBulkOperationsArray.fill_var_short_arrays_fill | 5 | 3.336 | 2.589 | 
3.338 | 29%
VectorBulkOperationsArray.fill_var_short_arrays_fill | 6 | 3.544 | 2.596 | 
3.189 | 23%
VectorBulkOperationsArray.fill_var_short_arrays_fill | 7 | 3.712 | 2.719 | 
3.449 | 27%
VectorBulkOperationsArray.fill_var_short_arrays_fill | 8 | 4.883 | 2.589 | 2.86 
| 10%
VectorBulkOperationsArray.fill_var_short_arrays_fill | 9 | 4.817 | 2.589 | 
3.355 | 30%
VectorBulkOperationsArray.fill_var_short_arrays_fill | 10 | 4.774 | 2.585 | 
3.16 | 22%
VectorBulkOperationsArray.fill_var_short_arrays_fill | 11 | 4.514 | 2.589 | 
3.431 | 33%
VectorBulkOperationsArray.fill_var_short_arrays_fill | 12 | 4.097 | 2.587 | 
3.111 | 20%
VectorBulkOperationsArray.fill_var_short_arrays_fill | 13 | 4.351 | 2.599 | 
3.393 | 31%
VectorBulkOperationsArray.fill_var_short_arrays_fill | 14 | 4.674 | 2.588 | 
3.319 | 28%
VectorBulkOperationsArray.fill_var_short_arrays_fill | 15 | 4.981 | 2.586 | 
3.542 | 37%
VectorBulkOperationsArray.fill_var_short_arrays_fill | 16 | 5.406 | 2.586 | 
2.833 | 10%
VectorBulkOperationsArray.fill_var_short_arrays_fill | 17 | 5.307 | 2.8 | 3.202 
| 14%
VectorBulkOperationsArray.fill_var_short_arrays_fill | 18 | 5.093 | 2.811 | 
3.051 | 9%
VectorBulkOperationsArray.fill_var_short_arrays_fill | 19 | 4.68 | 2.817 | 
3.568 | 27%
VectorBulkOperationsArray.fill_var_short_arrays_fill | 20 | 4.528 | 2.81 | 
3.294 | 17%
VectorBulkOperationsArray.fill_var_short_arrays_fill | 21 | 4.633 | 2.814 | 
3.589 | 28%
VectorBulkOperationsArray.fill_var_short_arrays_fill | 22 | 5.102 | 2.809 | 
3.495 | 24%
VectorBulkOperationsArray.fill_var_short_arrays_fill | 23 | 5.521 | 2.812 | 
3.717 | 32%
VectorBulkOperationsArray.fill_var_short_arrays_fill | 24 | 6.205 | 2.813 | 
3.094 | 10%
VectorBulkOperationsArray.fill_var_short_arrays_fill | 25 | 5.92 | 2.816 | 3.58 
| 27%
VectorBulkOperationsArray.fill_var_short_arrays_fill | 26 | 4.805 | 2.87 | 
3.495 | 22%
VectorBulkOperationsArray.fill_var_short_arrays_fill | 27 | 4.744 | 2.815 | 
3.712 | 32%
VectorBulkOperationsArray.fill_var_short_arrays_fill | 28 | 4.45 | 2.811 | 
3.361 | 20%
VectorBulkOperationsArray.fill_var_short_arrays_fill | 29 | 4.59 | 2.813 | 
3.734 | 33%
VectorBulkOperationsArray.fill_var_short_arrays_fill | 30 | 4.781 | 2.812 | 
3.589 | 28%
VectorBulkOperationsArray.fill_var_short_arrays_fill | 31 | 4.992 | 2.81 | 
3.817 | 36%
VectorBulkOperationsArray.fill_var_short_arrays_

Re: RFR: 8349452: Fix performance regression for Arrays.fill() with AVX512 [v13]

2026-01-22 Thread Srinivas Vamsi Parasa
On Wed, 21 Jan 2026 22:07:22 GMT, Derek White  wrote:

> I'm expecting to see a small regression in a write-only fill, and a larger 
> improvement in write+read fill, but we didn't present the data in a way that 
> makes it easy to compare those two tests. So we should present the graphed 
> data as a table as well. Then we can discuss how common the write+read fill 
> case is.

Hi Derek,

Please see the data for write-only fill operations for byte, short and intel 
below.

Thanks,
Vamsi

### Byte VectorBulkOperationsArray Fill
Benchmark   (ns/op) | Size | -OptimizeFill(JITed code) | +OptimizeFill  
(Masked store) | +OptimizeFill  (Unmasked store - This PR) | Delta  
(masked vs. unmasked)
-- | -- | -- | -- | -- | --
VectorBulkOperationsArray.fill_var_byte_arrays_fill | 0 | 0.649 | 0.65 | 0.653 
| 0%
VectorBulkOperationsArray.fill_var_byte_arrays_fill | 1 | 2.372 | 2.803 | 2.588 
| -8%
VectorBulkOperationsArray.fill_var_byte_arrays_fill | 2 | 2.37 | 2.596 | 2.471 
| -5%
VectorBulkOperationsArray.fill_var_byte_arrays_fill | 3 | 2.813 | 2.591 | 2.495 
| -4%
VectorBulkOperationsArray.fill_var_byte_arrays_fill | 4 | 3.086 | 2.598 | 2.757 
| 6%
VectorBulkOperationsArray.fill_var_byte_arrays_fill | 5 | 3.343 | 2.59 | 3.644 
| 41%
VectorBulkOperationsArray.fill_var_byte_arrays_fill | 6 | 3.549 | 2.589 | 3.536 
| 37%
VectorBulkOperationsArray.fill_var_byte_arrays_fill | 7 | 3.716 | 2.616 | 3.695 
| 41%
VectorBulkOperationsArray.fill_var_byte_arrays_fill | 8 | 4.854 | 2.59 | 3.252 
| 26%
VectorBulkOperationsArray.fill_var_byte_arrays_fill | 9 | 4.771 | 2.587 | 3.591 
| 39%
VectorBulkOperationsArray.fill_var_byte_arrays_fill | 10 | 4.78 | 2.595 | 3.542 
| 36%
VectorBulkOperationsArray.fill_var_byte_arrays_fill | 11 | 4.532 | 2.589 | 
3.669 | 42%
VectorBulkOperationsArray.fill_var_byte_arrays_fill | 12 | 4.164 | 2.592 | 
3.505 | 35%
VectorBulkOperationsArray.fill_var_byte_arrays_fill | 13 | 4.348 | 2.589 | 
3.655 | 41%
VectorBulkOperationsArray.fill_var_byte_arrays_fill | 14 | 4.703 | 2.594 | 
3.637 | 40%
VectorBulkOperationsArray.fill_var_byte_arrays_fill | 15 | 4.973 | 2.591 | 
3.754 | 45%
VectorBulkOperationsArray.fill_var_byte_arrays_fill | 16 | 5.498 | 2.593 | 
3.062 | 18%
VectorBulkOperationsArray.fill_var_byte_arrays_fill | 17 | 5.305 | 2.588 | 
3.611 | 40%
VectorBulkOperationsArray.fill_var_byte_arrays_fill | 18 | 5.081 | 2.59 | 3.649 
| 41%
VectorBulkOperationsArray.fill_var_byte_arrays_fill | 19 | 4.782 | 2.586 | 
3.642 | 41%
VectorBulkOperationsArray.fill_var_byte_arrays_fill | 20 | 4.458 | 2.588 | 
3.565 | 38%
VectorBulkOperationsArray.fill_var_byte_arrays_fill | 21 | 4.66 | 2.586 | 3.741 
| 45%
VectorBulkOperationsArray.fill_var_byte_arrays_fill | 22 | 5.112 | 2.591 | 
3.681 | 42%
VectorBulkOperationsArray.fill_var_byte_arrays_fill | 23 | 5.522 | 2.607 | 
3.742 | 44%
VectorBulkOperationsArray.fill_var_byte_arrays_fill | 24 | 6.02 | 2.589 | 3.27 
| 26%
VectorBulkOperationsArray.fill_var_byte_arrays_fill | 25 | 4.84 | 2.588 | 3.691 
| 43%
VectorBulkOperationsArray.fill_var_byte_arrays_fill | 26 | 4.81 | 2.589 | 3.674 
| 42%
VectorBulkOperationsArray.fill_var_byte_arrays_fill | 27 | 4.695 | 2.591 | 
3.761 | 45%
VectorBulkOperationsArray.fill_var_byte_arrays_fill | 28 | 4.828 | 2.589 | 
3.578 | 38%
VectorBulkOperationsArray.fill_var_byte_arrays_fill | 29 | 4.531 | 2.586 | 
3.762 | 45%
VectorBulkOperationsArray.fill_var_byte_arrays_fill | 30 | 5.38 | 2.59 | 3.713 
| 43%
VectorBulkOperationsArray.fill_var_byte_arrays_fill | 31 | 4.948 | 2.588 | 
3.887 | 50%
VectorBulkOperationsArray.fill_var_byte_arrays_fill | 32 | 5.21 | 2.589 | 2.861 
| 11%
VectorBulkOperationsArray.fill_var_byte_arrays_fill | 33 | 6.258 | 2.824 | 
3.377 | 20%
VectorBulkOperationsArray.fill_var_byte_arrays_fill | 34 | 4.992 | 2.829 | 
3.388 | 20%
VectorBulkOperationsArray.fill_var_byte_arrays_fill | 35 | 4.918 | 2.812 | 
3.577 | 27%
VectorBulkOperationsArray.fill_var_byte_arrays_fill | 36 | 4.647 | 2.814 | 
3.351 | 19%
VectorBulkOperationsArray.fill_var_byte_arrays_fill | 37 | 4.762 | 2.815 | 
3.775 | 34%
VectorBulkOperationsArray.fill_var_byte_arrays_fill | 38 | 4.93 | 2.819 | 3.76 
| 33%
VectorBulkOperationsArray.fill_var_byte_arrays_fill | 39 | 5.137 | 2.821 | 
3.954 | 40%
VectorBulkOperationsArray.fill_var_byte_arrays_fill | 40 | 5.377 | 2.815 | 
3.483 | 24%
VectorBulkOperationsArray.fill_var_byte_arrays_fill | 41 | 5.373 | 2.815 | 
3.777 | 34%
VectorBulkOperationsArray.fill_var_byte_arrays_fill | 42 | 5.309 | 2.815 | 3.77 
| 34%
VectorBulkOperationsArray.fill_var_byte_arrays_fill | 43 | 5.157 | 2.815 | 
3.835 | 36%
VectorBulkOperationsArray.fill_var_byte_arrays_fill | 44 | 4.862 | 2.82 | 3.743 
| 33%
VectorBulkOperationsArray.fill_var_byte_arrays_fill | 45 | 4.957 | 2.816 | 
3.882 | 38%
VectorBulkOperationsArray.fill_var_byte_arrays_fill | 46 | 5.207 | 2.814 | 3.85 
| 37%
VectorBulkOperationsArray.fill_var_byte_arrays_fill | 47 | 5.526 | 2.813 | 
4.023 | 43%
VectorBulkOperationsArray.fill_var_byte_arrays_fill | 48 | 5

Re: RFR: 8349452: Fix performance regression for Arrays.fill() with AVX512 [v13]

2026-01-22 Thread Sandhya Viswanathan
On Wed, 21 Jan 2026 00:01:39 GMT, Srinivas Vamsi Parasa  
wrote:

>> @vamsi-parasa Thanks for the extra data!
>> 
>> Do I see this right? In the plots 
>> [here](https://github.com/openjdk/jdk/pull/28442#issuecomment-3761659799), 
>> the masked performance lies lower/better than unmasked performance (here we 
>> measure ns/ops). But in your tables 
>> [here](https://github.com/openjdk/jdk/pull/28442#issuecomment-3761712841) 
>> you are measuring ops/ms, and are getting the opposite trend: masked is 
>> slower than unmasked.
>> 
>> Can you explain the difference between the two results?
>
>> Can you explain the difference between the two results?
>>
> Hi Emanuel (@eme64),
> Yes, the conclusions you mentioned are correct. The store only benchmark 
> shows that masked store is slightly better than the unmasked store. However, 
> the store followed by load benchmarks shows that the unmasked store is better 
> than masked vector store as masked vector stores have very limited store 
> forwarding support in the hardware.
> 
> This is because the load operation following the masked vector store is 
> blocked until the data is written into the cache. This is also mentioned in 
> the [Intel Software optimization 
> manual](https://www.intel.com/content/www/us/en/content-details/671488/intel-64-and-ia-32-architectures-optimization-reference-manual-volume-1.html)
>  (Chapter 18, section 18.4, page 578).
> 
> Pasting the relevant text below for reference:
> 
> 
> 18.4 FORWARDING AND MEMORY MASKING
> When using masked store and load, consider the following:
> • When the mask is not all-ones or all-zeroes, the load operation, following 
> the masked store operation 
> from the same address is blocked, until the data is written to the cache. 
> • Unlike GPR forwarding rules, vector loads whether or not they are masked, 
> do not forward unless 
> load and store addresses are exactly the same.
> — st_mask = 10101010, ld_mask = 01010101, can forward: no, should block: yes
> — st_mask = , ld_mask = 0011, can forward: no, should block: yes
> • When the mask is all-ones, blocking does not occur, because the data may be 
> forwarded to the load 
> operation.
> — st_mask = , ld_mask = don’t care, can forward: yes, should block: no
> • When mask is all-zeroes, blocking does not occur, though neither does 
> forwarding.
> — st_mask = , ld_mask = don’t care, can forward: no, should block: no
> In summary, a masked store should be used carefully, for example, if the 
> remainder size is known at 
> compile time to be 1, and there is a load operation from the same cache line 
> after it (or there is an 
> overlap in addresses + vector lengths), it may be better to use scalar 
> remainder processing, rather than 
> a masked remainder block.
> 
> 
> Thanks,
> Vamsi

> @vamsi-parasa Ok, so now we have one benchmark that shows a speedup and one 
> that shows a regression. How are we to proceed?
> 
> It seems that without loads [#28442 
> (comment)](https://github.com/openjdk/jdk/pull/28442#issuecomment-3761659799),
>  this patch leads to a regression.
> 
> Only if there is a load from one of the last elements that the `Arrays.fill` 
> stored to with a masked operation do we get a slowdown. Because of missing 
> load-to-store forwarding. If we instead started loading from the first 
> elements, those would probably already be in cache, and we would not have any 
> latency issues, right?
> 
> But is it not rather an edge-case that we load from the last elements 
> immediately after the `Arrays.fill`? At least for longer arrays, it seems an 
> edge case. For short arrays it is probably more likely that we access the 
> last element soon after the fill.
> 
> It does not seem like a trivial decision to me if this patch is an 
> improvement or not. What do you think @vamsi-parasa ?
> 
> @sviswa7 @dwhite-intel You already approved this PR. What are your thoughts 
> here?

@eme64 My thoughts are to go ahead with this PR replacing masked stores with 
scalar tail processing. As we have seen from 
https://bugs.openjdk.org/browse/JDK-8349452 masked stores can cause big 
regression in certain scenarios: accessing elements just written or any other 
adjacent data that happens to fall in the masked store range.

-

PR Comment: https://git.openjdk.org/jdk/pull/28442#issuecomment-3785772087


Re: RFR: 8349452: Fix performance regression for Arrays.fill() with AVX512 [v13]

2026-01-21 Thread Derek White
On Fri, 16 Jan 2026 20:14:31 GMT, Srinivas Vamsi Parasa  
wrote:

>> The goal of this PR is to fix the performance regression in Arrays.fill() 
>> x86 stubs caused by masked AVX stores. The fix is to replace the masked AVX 
>> stores with store instructions without masks (i.e. unmasked stores). 
>> `fill32_masked()` and `fill64_masked()` stubs are replaced with 
>> `fill32_unmasked()` and `fill64_unmasked()` respectively.
>> 
>> To speedup unmasked stores, array fills for sizes < 64 bytes are broken down 
>> into sequences of 32B, 16B, 8B, 4B, 2B and 1B stores, depending on the size.
>> 
>> 
>> ### **Performance comparison for byte array fills in a loop for 1 million 
>> times**
>> 
>> 
>> UseAVX=3   ByteArray Size | +OptimizeFill(Masked store   stub) 
>> [secs] | -OptimizeFill   (No stub)   [secs] | --->This PR: +OptimizeFill   
>> (Unmasked store   stub)   [secs]
>> -- | -- | -- | --
>> 1 | 0.46 | 0.14 | 0.189
>> 2 | 0.46 | 0.16 | 0.191
>> 3 | 0.46 | 0.176 | 0.199
>> 4 | 0.46 | 0.244 | 0.212
>> 5 | 0.46 | 0.29 | 0.364
>> 10 | 0.46 | 0.58 | 0.354
>> 15 | 0.46 | 0.42 | 0.325
>> 16 | 0.46 | 0.46 | 0.281
>> 17 | 0.21 | 0.5 | 0.365
>> 20 | 0.21 | 0.37 | 0.326
>> 25 | 0.21 | 0.59 | 0.343
>> 31 | 0.21 | 0.53 | 0.317
>> 32 | 0.21 | 0.58 | 0.249
>> 35 | 0.5 | 0.77 | 0.303
>> 40 | 0.5 | 0.61 | 0.312
>> 45 | 0.5 | 0.52 | 0.364
>> 48 | 0.5 | 0.66 | 0.283
>> 49 | 0.22 | 0.69 | 0.367
>> 50 | 0.22 | 0.78 | 0.344
>> 55 | 0.22 | 0.67 | 0.332
>> 60 | 0.22 | 0.67 | 0.312
>> 64 | 0.22 | 0.82 | 0.253
>> 70 | 0.51 | 1.1 | 0.394
>> 80 | 0.49 | 0.89 | 0.346
>> 90 | 0.225 | 0.68 | 0.385
>> 100 | 0.54 | 1.09 | 0.364
>> 110 | 0.6 | 0.98 | 0.416
>> 120 | 0.26 | 0.75 | 0.367
>> 128 | 0.266 | 1.1 | 0.342
>
> Srinivas Vamsi Parasa has updated the pull request incrementally with one 
> additional commit since the last revision:
> 
>   Update ALL of ArraysFill JMH micro

I'm expecting to see a small regression in a write-only fill, and a larger 
improvement in write+read fill, but we didn't present the data in a way that 
makes it easy to compare those two tests. So we should present the graphed data 
as a table as well. Then we can discuss how common the write+read fill case is.

-

PR Comment: https://git.openjdk.org/jdk/pull/28442#issuecomment-3781378167


Re: RFR: 8349452: Fix performance regression for Arrays.fill() with AVX512 [v13]

2026-01-21 Thread Emanuel Peter
On Wed, 21 Jan 2026 00:01:39 GMT, Srinivas Vamsi Parasa  
wrote:

>> @vamsi-parasa Thanks for the extra data!
>> 
>> Do I see this right? In the plots 
>> [here](https://github.com/openjdk/jdk/pull/28442#issuecomment-3761659799), 
>> the masked performance lies lower/better than unmasked performance (here we 
>> measure ns/ops). But in your tables 
>> [here](https://github.com/openjdk/jdk/pull/28442#issuecomment-3761712841) 
>> you are measuring ops/ms, and are getting the opposite trend: masked is 
>> slower than unmasked.
>> 
>> Can you explain the difference between the two results?
>
>> Can you explain the difference between the two results?
>>
> Hi Emanuel (@eme64),
> Yes, the conclusions you mentioned are correct. The store only benchmark 
> shows that masked store is slightly better than the unmasked store. However, 
> the store followed by load benchmarks shows that the unmasked store is better 
> than masked vector store as masked vector stores have very limited store 
> forwarding support in the hardware.
> 
> This is because the load operation following the masked vector store is 
> blocked until the data is written into the cache. This is also mentioned in 
> the [Intel Software optimization 
> manual](https://www.intel.com/content/www/us/en/content-details/671488/intel-64-and-ia-32-architectures-optimization-reference-manual-volume-1.html)
>  (Chapter 18, section 18.4, page 578).
> 
> Pasting the relevant text below for reference:
> 
> 
> 18.4 FORWARDING AND MEMORY MASKING
> When using masked store and load, consider the following:
> • When the mask is not all-ones or all-zeroes, the load operation, following 
> the masked store operation 
> from the same address is blocked, until the data is written to the cache. 
> • Unlike GPR forwarding rules, vector loads whether or not they are masked, 
> do not forward unless 
> load and store addresses are exactly the same.
> — st_mask = 10101010, ld_mask = 01010101, can forward: no, should block: yes
> — st_mask = , ld_mask = 0011, can forward: no, should block: yes
> • When the mask is all-ones, blocking does not occur, because the data may be 
> forwarded to the load 
> operation.
> — st_mask = , ld_mask = don’t care, can forward: yes, should block: no
> • When mask is all-zeroes, blocking does not occur, though neither does 
> forwarding.
> — st_mask = , ld_mask = don’t care, can forward: no, should block: no
> In summary, a masked store should be used carefully, for example, if the 
> remainder size is known at 
> compile time to be 1, and there is a load operation from the same cache line 
> after it (or there is an 
> overlap in addresses + vector lengths), it may be better to use scalar 
> remainder processing, rather than 
> a masked remainder block.
> 
> 
> Thanks,
> Vamsi

@vamsi-parasa Ok, so now we have one benchmark that shows a speedup and one 
that shows a regression. How are we to proceed?

It seems that without loads 
https://github.com/openjdk/jdk/pull/28442#issuecomment-3761659799, this patch 
leads to a regression.

Only if there is a load from one of the last elements that the `Arrays.fill` 
stored to with a masked operation do we get a slowdown. Because of missing 
load-to-store forwarding. If we instead started loading from the first 
elements, those would probably already be in cache, and we would not have any 
latency issues, right?

But is it not rather an edge-case that we load from the last elements 
immediately after the `Arrays.fill`?
At least for longer arrays, it seems an edge case. For short arrays it is 
probably more likely that we access the last element soon after the fill.

It does not seem like a trivial decision to me if this patch is an improvement 
or not. What do you think @vamsi-parasa ?

@sviswa7 @dwhite-intel You already approved this PR. What are your thoughts 
here?

-

PR Comment: https://git.openjdk.org/jdk/pull/28442#issuecomment-3776741440


Re: RFR: 8349452: Fix performance regression for Arrays.fill() with AVX512 [v13]

2026-01-20 Thread Srinivas Vamsi Parasa
On Mon, 19 Jan 2026 08:11:19 GMT, Emanuel Peter  wrote:

> Can you explain the difference between the two results?
>
Hi Emanuel (@eme64),
Yes, the conclusions you mentioned are correct. The store only benchmark shows 
that masked store is slightly better than the unmasked store. However, the 
store followed by load benchmarks shows that the unmasked store is better than 
masked vector store as masked vector stores have very limited store forwarding 
support in the hardware.

This is because the load operation following the masked vector store is blocked 
until the data is written into the cache. This is also mentioned in the [Intel 
Software optimization 
manual](https://www.intel.com/content/www/us/en/content-details/671488/intel-64-and-ia-32-architectures-optimization-reference-manual-volume-1.html)
 (Chapter 18, section 18.4, page 578).

Pasting the relevant text below for reference:


18.4 FORWARDING AND MEMORY MASKING
When using masked store and load, consider the following:
• When the mask is not all-ones or all-zeroes, the load operation, following 
the masked store operation 
from the same address is blocked, until the data is written to the cache. 
• Unlike GPR forwarding rules, vector loads whether or not they are masked, do 
not forward unless 
load and store addresses are exactly the same.
— st_mask = 10101010, ld_mask = 01010101, can forward: no, should block: yes
— st_mask = , ld_mask = 0011, can forward: no, should block: yes
• When the mask is all-ones, blocking does not occur, because the data may be 
forwarded to the load 
operation.
— st_mask = , ld_mask = don’t care, can forward: yes, should block: no
• When mask is all-zeroes, blocking does not occur, though neither does 
forwarding.
— st_mask = , ld_mask = don’t care, can forward: no, should block: no
In summary, a masked store should be used carefully, for example, if the 
remainder size is known at 
compile time to be 1, and there is a load operation from the same cache line 
after it (or there is an 
overlap in addresses + vector lengths), it may be better to use scalar 
remainder processing, rather than 
a masked remainder block.


Thanks,
Vamsi

-

PR Comment: https://git.openjdk.org/jdk/pull/28442#issuecomment-3775508253


Re: RFR: 8349452: Fix performance regression for Arrays.fill() with AVX512 [v13]

2026-01-19 Thread Emanuel Peter
On Fri, 16 Jan 2026 20:31:28 GMT, Srinivas Vamsi Parasa  
wrote:

>> Srinivas Vamsi Parasa has updated the pull request incrementally with one 
>> additional commit since the last revision:
>> 
>>   Update ALL of ArraysFill JMH micro
>
> Also, we can see the benefit of using unmasked stores (this PR) instead of 
> masked vector stores (existing implementation) when we update the 
> ArraysFill.java JMH micro-benchmark to perform fill (write) followed by read 
> of the filled data as shown below using short array fill as an example:
> 
> 
> @Benchmark
> public short testShortFill() {
> Arrays.fill(testShortArray, (short) -1);
> return (short) (testShortArray[0] + testShortArray[size - 1]);
> }
> 
> 
> 
> 
> 
> ### Table shows throughput (ops/ms); **(Higher is better)** 
> Benchmark   (ops/ms) MaxVectorSize = 32 | SIZE | +OptimizeFill 
> (Masked Store) | +OptimizeFill (Unmasked Store - This PR) | Delta
> -- | -- | -- | -- | --
> ArraysFill.testByteFill | 1 | 175381 | 342456 | 95%
> ArraysFill.testByteFill | 10 | 175421 | 264607 | 51%
> ArraysFill.testByteFill | 20 | 175447 | 27 | 55%
> ArraysFill.testByteFill | 30 | 175454 | 253351 | 44%
> ArraysFill.testByteFill | 40 | 162429 | 273043 | 68%
> ArraysFill.testByteFill | 50 | 162443 | 251734 | 55%
> ArraysFill.testByteFill | 60 | 162454 | 248156 | 53%
> ArraysFill.testByteFill | 70 | 156659 | 236497 | 51%
> ArraysFill.testByteFill | 80 | 175403 | 269433 | 54%
> ArraysFill.testByteFill | 90 | 175422 | 230276 | 31%
> ArraysFill.testByteFill | 100 | 168662 | 252394 | 50%
> ArraysFill.testByteFill | 110 | 146182 | 217917 | 49%
> ArraysFill.testByteFill | 120 | 168693 | 239126 | 42%
> ArraysFill.testByteFill | 130 | 162378 | 166159 | 2%
> ArraysFill.testByteFill | 140 | 156569 | 168296 | 7%
> ArraysFill.testByteFill | 150 | 151214 | 167388 | 11%
> ArraysFill.testByteFill | 160 | 156594 | 173529 | 11%
> ArraysFill.testByteFill | 170 | 156590 | 167976 | 7%
> ArraysFill.testByteFill | 180 | 156561 | 173015 | 11%
> ArraysFill.testByteFill | 190 | 156601 | 173073 | 11%
> ArraysFill.testByteFill | 200 | 168277 | 174293 | 4%
> ArraysFill.testIntFill | 1 | 175403 | 334460 | 91%
> ArraysFill.testIntFill | 10 | 162437 | 273799 | 69%
> ArraysFill.testIntFill | 20 | 156636 | 273483 | 75%
> ArraysFill.testIntFill | 30 | 162440 | 243303 | 50%
> ArraysFill.testIntFill | 40 | 156592 | 175162 | 12%
> ArraysFill.testIntFill | 50 | 156585 | 168433 | 8%
> ArraysFill.testIntFill | 60 | 151193 | 195235 | 29%
> ArraysFill.testIntFill | 70 | 141406 | 167060 | 18%
> ArraysFill.testIntFill | 80 | 141406 | 167119 | 18%
> ArraysFill.testIntFill | 90 | 141437 | 166976 | 18%
> ArraysFill.testIntFill | 100 | 168349 | 173943 | 3%
> ArraysFill.testIntFill | 110 | 132864 | 173096 | 30%
> ArraysFill.testIntFill | 120 | 128972 | 173722 | 35%
> ArraysFill

@vamsi-parasa Thanks for the extra data!

Do I see this right? In the plots 
[here](https://github.com/openjdk/jdk/pull/28442#issuecomment-3761659799), the 
masked performance lies lower/better than unmasked performance (here we measure 
ns/ops). But in your tables 
[here](https://github.com/openjdk/jdk/pull/28442#issuecomment-3761712841) you 
are measuring ops/ms, and are getting the opposite trend: masked is slower than 
unmasked.

Can you explain the difference between the two results?

-

PR Comment: https://git.openjdk.org/jdk/pull/28442#issuecomment-3767004043


Re: RFR: 8349452: Fix performance regression for Arrays.fill() with AVX512 [v13]

2026-01-16 Thread Srinivas Vamsi Parasa
On Fri, 16 Jan 2026 20:14:31 GMT, Srinivas Vamsi Parasa  
wrote:

>> The goal of this PR is to fix the performance regression in Arrays.fill() 
>> x86 stubs caused by masked AVX stores. The fix is to replace the masked AVX 
>> stores with store instructions without masks (i.e. unmasked stores). 
>> `fill32_masked()` and `fill64_masked()` stubs are replaced with 
>> `fill32_unmasked()` and `fill64_unmasked()` respectively.
>> 
>> To speedup unmasked stores, array fills for sizes < 64 bytes are broken down 
>> into sequences of 32B, 16B, 8B, 4B, 2B and 1B stores, depending on the size.
>> 
>> 
>> ### **Performance comparison for byte array fills in a loop for 1 million 
>> times**
>> 
>> 
>> UseAVX=3   ByteArray Size | +OptimizeFill(Masked store   stub) 
>> [secs] | -OptimizeFill   (No stub)   [secs] | --->This PR: +OptimizeFill   
>> (Unmasked store   stub)   [secs]
>> -- | -- | -- | --
>> 1 | 0.46 | 0.14 | 0.189
>> 2 | 0.46 | 0.16 | 0.191
>> 3 | 0.46 | 0.176 | 0.199
>> 4 | 0.46 | 0.244 | 0.212
>> 5 | 0.46 | 0.29 | 0.364
>> 10 | 0.46 | 0.58 | 0.354
>> 15 | 0.46 | 0.42 | 0.325
>> 16 | 0.46 | 0.46 | 0.281
>> 17 | 0.21 | 0.5 | 0.365
>> 20 | 0.21 | 0.37 | 0.326
>> 25 | 0.21 | 0.59 | 0.343
>> 31 | 0.21 | 0.53 | 0.317
>> 32 | 0.21 | 0.58 | 0.249
>> 35 | 0.5 | 0.77 | 0.303
>> 40 | 0.5 | 0.61 | 0.312
>> 45 | 0.5 | 0.52 | 0.364
>> 48 | 0.5 | 0.66 | 0.283
>> 49 | 0.22 | 0.69 | 0.367
>> 50 | 0.22 | 0.78 | 0.344
>> 55 | 0.22 | 0.67 | 0.332
>> 60 | 0.22 | 0.67 | 0.312
>> 64 | 0.22 | 0.82 | 0.253
>> 70 | 0.51 | 1.1 | 0.394
>> 80 | 0.49 | 0.89 | 0.346
>> 90 | 0.225 | 0.68 | 0.385
>> 100 | 0.54 | 1.09 | 0.364
>> 110 | 0.6 | 0.98 | 0.416
>> 120 | 0.26 | 0.75 | 0.367
>> 128 | 0.266 | 1.1 | 0.342
>
> Srinivas Vamsi Parasa has updated the pull request incrementally with one 
> additional commit since the last revision:
> 
>   Update ALL of ArraysFill JMH micro

Also, we can see the benefit of using unmasked stores (this PR) instead of 
masked vector stores (existing implementation) when we update the 
ArraysFill.java JMH micro-benchmark to perform fill (write) followed by read of 
the filled data as shown below using short array fill as an example:


@Benchmark
public short testShortFill() {
Arrays.fill(testShortArray, (short) -1);
return (short) (testShortArray[0] + testShortArray[size - 1]);
}





**(Higher is better)** 
Benchmark   (ops/ms) MaxVectorSize = 32 | SIZE | +OptimizeFill (Masked 
Store) | +OptimizeFill (Unmasked Store - This PR) | Delta
-- | -- | -- | -- | --
ArraysFill.testByteFill | 1 | 175381 | 342456 | 95%
ArraysFill.testByteFill | 10 | 175421 | 264607 | 51%
ArraysFill.testByteFill | 20 | 175447 | 27 | 55%
ArraysFill.testByteFill | 30 | 175454 | 253351 | 44%
ArraysFill.testByteFill | 40 | 162429 | 273043 | 68%
ArraysFill.testByteFill | 50 | 162443 | 251734 | 55%
ArraysFill.testByteFill | 60 | 162454 | 248156 | 53%
ArraysFill.testByteFill | 70 | 156659 | 236497 | 51%
ArraysFill.testByteFill | 80 | 175403 | 269433 | 54%
ArraysFill.testByteFill | 90 | 175422 | 230276 | 31%
ArraysFill.testByteFill | 100 | 168662 | 252394 | 50%
ArraysFill.testByteFill | 110 | 146182 | 217917 | 49%
ArraysFill.testByteFill | 120 | 168693 | 239126 | 42%
ArraysFill.testByteFill | 130 | 162378 | 166159 | 2%
ArraysFill.testByteFill | 140 | 156569 | 168296 | 7%
ArraysFill.testByteFill | 150 | 151214 | 167388 | 11%
ArraysFill.testByteFill | 160 | 156594 | 173529 | 11%
ArraysFill.testByteFill | 170 | 156590 | 167976 | 7%
ArraysFill.testByteFill | 180 | 156561 | 173015 | 11%
ArraysFill.testByteFill | 190 | 156601 | 173073 | 11%
ArraysFill.testByteFill | 200 | 168277 | 174293 | 4%
ArraysFill.testIntFill | 1 | 175403 | 334460 | 91%
ArraysFill.testIntFill | 10 | 162437 | 273799 | 69%
ArraysFill.testIntFill | 20 | 156636 | 273483 | 75%
ArraysFill.testIntFill | 30 | 162440 | 243303 | 50%
ArraysFill.testIntFill | 40 | 156592 | 175162 | 12%
ArraysFill.testIntFill | 50 | 156585 | 168433 | 8%
ArraysFill.testIntFill | 60 | 151193 | 195235 | 29%
ArraysFill.testIntFill | 70 | 141406 | 167060 | 18%
ArraysFill.testIntFill | 80 | 141406 | 167119 | 18%
ArraysFill.testIntFill | 90 | 141437 | 166976 | 18%
ArraysFill.testIntFill | 100 | 168349 | 173943 | 3%
ArraysFill.testIntFill | 110 | 132864 | 173096 | 30%
ArraysFill.testIntFill | 120 | 128972 | 173722 | 35%
ArraysFill.testIntFill | 130 | 128958 | 149835 | 16%
ArraysFill.testIntFill | 140 | 167934 | 165903 | -1%
ArraysFill.testIntFill | 150 | 121799 | 133351 | 9%
ArraysFill.testIntFill | 160 | 121824 | 154654 | 27%
ArraysFill.testIntFill | 170 | 121800 | 163515 | 34%
ArraysFill.testIntFill | 180 | 121770 | 150235 | 23%
ArraysFill.testIntFill | 190 | 121808 | 145138 | 19%
ArraysFill.testIntFill | 200 | 112433 | 142084 | 26%
ArraysFill.testShortFill | 1 | 99696 | 309697 | 211%
ArraysFill.testShortFill | 10 | 175433 | 290773 | 66%
ArraysFill.testShortFill | 20 | 175417 | 270345 | 54%
ArraysFill.testShortFill | 30 | 162459 | 257180 | 58%
ArraysFill.testShortFill

Re: RFR: 8349452: Fix performance regression for Arrays.fill() with AVX512 [v13]

2026-01-16 Thread Srinivas Vamsi Parasa
> The goal of this PR is to fix the performance regression in Arrays.fill() x86 
> stubs caused by masked AVX stores. The fix is to replace the masked AVX 
> stores with store instructions without masks (i.e. unmasked stores). 
> `fill32_masked()` and `fill64_masked()` stubs are replaced with 
> `fill32_unmasked()` and `fill64_unmasked()` respectively.
> 
> To speedup unmasked stores, array fills for sizes < 64 bytes are broken down 
> into sequences of 32B, 16B, 8B, 4B, 2B and 1B stores, depending on the size.
> 
> 
> ### **Performance comparison for byte array fills in a loop for 1 million 
> times**
> 
> 
> UseAVX=3   ByteArray Size | +OptimizeFill(Masked store   stub) [secs] 
> | -OptimizeFill   (No stub)   [secs] | --->This PR: +OptimizeFill   (Unmasked 
> store   stub)   [secs]
> -- | -- | -- | --
> 1 | 0.46 | 0.14 | 0.189
> 2 | 0.46 | 0.16 | 0.191
> 3 | 0.46 | 0.176 | 0.199
> 4 | 0.46 | 0.244 | 0.212
> 5 | 0.46 | 0.29 | 0.364
> 10 | 0.46 | 0.58 | 0.354
> 15 | 0.46 | 0.42 | 0.325
> 16 | 0.46 | 0.46 | 0.281
> 17 | 0.21 | 0.5 | 0.365
> 20 | 0.21 | 0.37 | 0.326
> 25 | 0.21 | 0.59 | 0.343
> 31 | 0.21 | 0.53 | 0.317
> 32 | 0.21 | 0.58 | 0.249
> 35 | 0.5 | 0.77 | 0.303
> 40 | 0.5 | 0.61 | 0.312
> 45 | 0.5 | 0.52 | 0.364
> 48 | 0.5 | 0.66 | 0.283
> 49 | 0.22 | 0.69 | 0.367
> 50 | 0.22 | 0.78 | 0.344
> 55 | 0.22 | 0.67 | 0.332
> 60 | 0.22 | 0.67 | 0.312
> 64 | 0.22 | 0.82 | 0.253
> 70 | 0.51 | 1.1 | 0.394
> 80 | 0.49 | 0.89 | 0.346
> 90 | 0.225 | 0.68 | 0.385
> 100 | 0.54 | 1.09 | 0.364
> 110 | 0.6 | 0.98 | 0.416
> 120 | 0.26 | 0.75 | 0.367
> 128 | 0.266 | 1.1 | 0.342

Srinivas Vamsi Parasa has updated the pull request incrementally with one 
additional commit since the last revision:

  Update ALL of ArraysFill JMH micro

-

Changes:
  - all: https://git.openjdk.org/jdk/pull/28442/files
  - new: https://git.openjdk.org/jdk/pull/28442/files/5edff7f7..620ae44e

Webrevs:
 - full: https://webrevs.openjdk.org/?repo=jdk&pr=28442&range=12
 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=28442&range=11-12

  Stats: 8 lines in 1 file changed: 4 ins; 0 del; 4 mod
  Patch: https://git.openjdk.org/jdk/pull/28442.diff
  Fetch: git fetch https://git.openjdk.org/jdk.git pull/28442/head:pull/28442

PR: https://git.openjdk.org/jdk/pull/28442