Re: RFR: 8329331: Intrinsify Unsafe::setMemory [v26]

2024-04-21 Thread Jatin Bhateja
On Sat, 20 Apr 2024 22:31:48 GMT, Scott Gibbons wrote: >> This code makes an intrinsic stub for `Unsafe::setMemory` for x86_64. See >> [this PR](https://github.com/openjdk/jdk/pull/16760) for discussion around >> this change. >> >> Overall, making this an intrinsic improves overall

Integrated: 8318650: Optimized subword gather for x86 targets.

2024-04-21 Thread Jatin Bhateja
On Wed, 25 Oct 2023 04:34:59 GMT, Jatin Bhateja wrote: > Hi All, > > This patch optimizes sub-word gather operation for x86 targets with AVX2 and > AVX512 features. > > Following is the summary of changes:- > > 1) Intrinsify sub-word gather using hybrid algorithm whic

Re: RFR: 8318650: Optimized subword gather for x86 targets. [v18]

2024-04-21 Thread Jatin Bhateja
to caller contexts. > > Kindly review and share your feedback. > > Best Regards, > Jatin Jatin Bhateja has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains 16 commits: - Merge branch 'master' of http://github.com/openj

Re: RFR: 8329331: Intrinsify Unsafe::setMemory [v24]

2024-04-20 Thread Jatin Bhateja
On Fri, 19 Apr 2024 22:08:52 GMT, Scott Gibbons wrote: >> This code makes an intrinsic stub for `Unsafe::setMemory` for x86_64. See >> [this PR](https://github.com/openjdk/jdk/pull/16760) for discussion around >> this change. >> >> Overall, making this an intrinsic improves overall

Re: RFR: 8329331: Intrinsify Unsafe::setMemory [v21]

2024-04-19 Thread Jatin Bhateja
On Tue, 16 Apr 2024 00:04:15 GMT, Scott Gibbons wrote: >> This code makes an intrinsic stub for `Unsafe::setMemory` for x86_64. See >> [this PR](https://github.com/openjdk/jdk/pull/16760) for discussion around >> this change. >> >> Overall, making this an intrinsic improves overall

Re: RFR: 8329538: Accelerate P256 on x86_64 using Montgomery intrinsic [v2]

2024-04-15 Thread Jatin Bhateja
On Mon, 15 Apr 2024 22:04:14 GMT, Volodymyr Paprotski wrote: >> src/hotspot/cpu/x86/stubGenerator_x86_64_poly_mont.cpp line 394: >> >>> 392: __ lea(aLimbs, Address(aLimbs,8)); >>> 393: __ lea(bLimbs, Address(bLimbs,8)); >>> 394: __ jmp(L_DefaultLoop); >> >> Both sub and cmp are flag

Re: RFR: 8329538: Accelerate P256 on x86_64 using Montgomery intrinsic [v2]

2024-04-05 Thread Jatin Bhateja
On Tue, 2 Apr 2024 19:19:59 GMT, Volodymyr Paprotski wrote: >> Performance. Before: >> >> Benchmark(algorithm) (dataSize) (keyLength) >> (provider) Mode Cnt ScoreError Units >> SignatureBench.ECDSA.signSHA256withECDSA1024 256

Re: RFR: 8318650: Optimized subword gather for x86 targets. [v17]

2024-03-02 Thread Jatin Bhateja
to caller contexts. > > Kindly review and share your feedback. > > Best Regards, > Jatin Jatin Bhateja has updated the pull request incrementally with one additional commit since the last revision: Review resolutions. - Changes: - all: https://git.openjdk.o

Re: RFR: JDK-8320448 Accelerate IndexOf using AVX2 [v8]

2024-02-26 Thread Jatin Bhateja
andboxes/jdk-reviews/jdk/make/core.237140) >> # >> # An error report file with more information is saved as: >> # /home/jatinbha/sandboxes/jdk-reviews/jdk/make/hs_err_pid237140.log >>... (rest of output omitted) > > @jatin-bhateja Thanks for the note. Fixed a

Re: RFR: 8318650: Optimized subword gather for x86 targets. [v15]

2024-02-26 Thread Jatin Bhateja
On Mon, 26 Feb 2024 13:47:35 GMT, Emanuel Peter wrote: >> Jatin Bhateja has updated the pull request incrementally with one additional >> commit since the last revision: >> >> Review comments resolutions > > Reposting link to a conversation that is marked &q

Re: RFR: 8318650: Optimized subword gather for x86 targets. [v14]

2024-02-26 Thread Jatin Bhateja
On Mon, 26 Feb 2024 15:05:24 GMT, Emanuel Peter wrote: >> I was referring to the various arrays as well above. I think it would be >> exactly more concise if you defined a local label in the loop body. > > Have you had a look at `C2_MacroAssembler::rtm_counters_update`? Correct, with each

Re: RFR: 8318650: Optimized subword gather for x86 targets. [v16]

2024-02-26 Thread Jatin Bhateja
to caller contexts. > > Kindly review and share your feedback. > > Best Regards, > Jatin Jatin Bhateja has updated the pull request incrementally with one additional commit since the last revision: Review comment resolutions. - Changes: - all: https://git

Re: RFR: 8318650: Optimized subword gather for x86 targets. [v14]

2024-02-26 Thread Jatin Bhateja
On Mon, 26 Feb 2024 13:31:05 GMT, Emanuel Peter wrote: >> At the risk of becoming too nit-picky: which allocations are you talking >> about? Given you only have a single src and a single dst for this >> label/jump. So you won't use `_patch_overflow`. And therefore, all >> allocations are on

Re: RFR: JDK-8320448 Accelerate IndexOf using AVX2 [v13]

2024-02-26 Thread Jatin Bhateja
On Thu, 22 Feb 2024 03:15:10 GMT, Scott Gibbons wrote: >> Re-write the IndexOf code without the use of the pcmpestri instruction, only >> using AVX2 instructions. This change accelerates String.IndexOf on average >> 1.3x for AVX2. The benchmark numbers: >> >> >> Benchmark

Re: RFR: 8318650: Optimized subword gather for x86 targets. [v13]

2024-02-26 Thread Jatin Bhateja
On Mon, 26 Feb 2024 09:36:09 GMT, Emanuel Peter wrote: >> 64 bit sub-word SPECIES will either hold 8 bytes values or 4 short values, >> algorithm appropriately handle it. > > Are you saying that the constraints are too relaxed, but currently no outside > algorithm would pass something bad? >

Re: RFR: 8318650: Optimized subword gather for x86 targets. [v13]

2024-02-26 Thread Jatin Bhateja
On Mon, 26 Feb 2024 09:37:33 GMT, Emanuel Peter wrote: >> I'll rereview after > > So xtmp1...3 and rtmp cannot have more descriptive names? These are temporary variable and appropriately named. - PR Review Comment: https://git.openjdk.org/jdk/pull/16354#discussion_r1502587427

Re: RFR: 8318650: Optimized subword gather for x86 targets. [v14]

2024-02-26 Thread Jatin Bhateja
On Mon, 26 Feb 2024 09:39:01 GMT, Emanuel Peter wrote: >> Jatin Bhateja has updated the pull request incrementally with one additional >> commit since the last revision: >> >> Review comments resolutions. > > src/hotspot/cpu/x86/c2_MacroAssembler_x86.cpp line 1

Re: RFR: 8318650: Optimized subword gather for x86 targets. [v15]

2024-02-26 Thread Jatin Bhateja
to caller contexts. > > Kindly review and share your feedback. > > Best Regards, > Jatin Jatin Bhateja has updated the pull request incrementally with one additional commit since the last revision: Review comments resolutions - Changes: - all: https://git

Re: RFR: 8318650: Optimized subword gather for x86 targets. [v14]

2024-02-26 Thread Jatin Bhateja
On Mon, 26 Feb 2024 09:47:50 GMT, Emanuel Peter wrote: >> Jatin Bhateja has updated the pull request incrementally with one additional >> commit since the last revision: >> >> Review comments resolutions. > > src/jdk.incubator.vector/share/classes/jdk/incubator

Re: RFR: 8318650: Optimized subword gather for x86 targets. [v13]

2024-02-24 Thread Jatin Bhateja
On Tue, 20 Feb 2024 08:36:29 GMT, Emanuel Peter wrote: >> Jatin Bhateja has updated the pull request incrementally with one additional >> commit since the last revision: >> >> Review comments resolutions. > > src/hotspot/cpu/x86/x86.ad line 4120: > >>

Re: RFR: 8318650: Optimized subword gather for x86 targets. [v14]

2024-02-24 Thread Jatin Bhateja
to caller contexts. > > Kindly review and share your feedback. > > Best Regards, > Jatin Jatin Bhateja has updated the pull request incrementally with one additional commit since the last revision: Review comments resolutions. - Changes: - all: https://git

Re: RFR: 8318650: Optimized subword gather for x86 targets. [v13]

2024-02-24 Thread Jatin Bhateja
On Tue, 20 Feb 2024 08:04:27 GMT, Emanuel Peter wrote: >> src/hotspot/cpu/x86/c2_MacroAssembler_x86.cpp line 1584: >> >>> 1582: Label *larr[] = {, , , }; >>> 1583: for (int i = 0; i < 4; i++) { >>> 1584: // dst[i] = mask ? src[index[i]] : 0 >> >> I like these comments a lot! >>

Re: RFR: JDK-8320448 Accelerate IndexOf using AVX2 [v8]

2024-02-20 Thread Jatin Bhateja
On Wed, 14 Feb 2024 14:31:03 GMT, Scott Gibbons wrote: >> Scott Gibbons has updated the pull request incrementally with one additional >> commit since the last revision: >> >> Remove gcc lib fn; reduce spacial cases to 10 from 32 > > Thank you all for the reviews. I have been asked to

Re: RFR: 8318650: Optimized subword gather for x86 targets. [v13]

2024-02-07 Thread Jatin Bhateja
to caller contexts. > > Kindly review and share your feedback. > > Best Regards, > Jatin Jatin Bhateja has updated the pull request incrementally with one additional commit since the last revision: Review comments resolutions. - Changes: - all: https://git

Re: RFR: 8318650: Optimized subword gather for x86 targets. [v11]

2024-02-01 Thread Jatin Bhateja
non >> accessible address space, so we want to be super safe here. I am also >> sitting over other solution which is performant. > > Hi @jatin-bhateja, > > The layout of an array is as follows: > > [header] - [length] - [data] > > Since `length` is a 4-byt

Re: RFR: 8318650: Optimized subword gather for x86 targets. [v11]

2024-02-01 Thread Jatin Bhateja
On Wed, 31 Jan 2024 21:29:08 GMT, Sandhya Viswanathan wrote: >> Jatin Bhateja has refreshed the contents of this pull request, and previous >> commits have been removed. Incremental views are not available. > > src/hotspot/cpu/x86/c2_MacroAssembler_x86.cpp line 1613: >

Re: RFR: 8318650: Optimized subword gather for x86 targets. [v11]

2024-02-01 Thread Jatin Bhateja
On Thu, 1 Feb 2024 16:25:52 GMT, Jatin Bhateja wrote: >> I guess the fact that the Java objects are 8 byte alignment padded and the >> alignment being done at lines 1609-1611 and 1616-1621 somehow takes care of >> this. > > Hi @sviswa7 , I have rolled back to originally

Re: RFR: 8318650: Optimized subword gather for x86 targets. [v11]

2024-02-01 Thread Jatin Bhateja
On Wed, 31 Jan 2024 23:53:16 GMT, Sandhya Viswanathan wrote: >> src/hotspot/cpu/x86/c2_MacroAssembler_x86.cpp line 1613: >> >>> 1611: vpand(xtmp, idx_vec, xtmp, vlen_enc); >>> 1612: // Load double words from normalized indices. >>> 1613: evpgatherdd(dst, gmask, Address(base, xtmp,

Re: RFR: 8318650: Optimized subword gather for x86 targets. [v12]

2024-02-01 Thread Jatin Bhateja
ons into caller contexts. > > Kindly review and share your feedback. > > Best Regards, > Jatin Jatin Bhateja has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains 10 commits: - Generalizing masked sub-gather support. - Merge

Re: RFR: 8324858: [vectorapi] Bounds checking issues when accessing memory segments

2024-02-01 Thread Jatin Bhateja
On Mon, 29 Jan 2024 19:45:41 GMT, Paul Sandoz wrote: > The implementation of method `VectorSpecies::fromMemorySegment`, in > `AbstractSpecies::fromMemorySegment`, neglects to perform bounds checks on > the offset argument when the method is compiled by C2 (bounds checks are > performed when

Re: RFR: 8322768: Optimize non-subword vector compress and expand APIs for AVX2 target. [v9]

2024-01-25 Thread Jatin Bhateja
On Thu, 25 Jan 2024 09:15:26 GMT, Emanuel Peter wrote: >> Jatin Bhateja has updated the pull request with a new target base due to a >> merge or a rebase. The incremental webrev excludes the unrelated changes >> brought in by the merge/rebase. The pull request cont

Integrated: 8322768: Optimize non-subword vector compress and expand APIs for AVX2 target.

2024-01-25 Thread Jatin Bhateja
On Thu, 4 Jan 2024 05:28:59 GMT, Jatin Bhateja wrote: > Hi, > > Patch optimizes non-subword vector compress and expand APIs for x86 AVX2 only > targets. > Upcoming E-core Xeons (Sierra Forest) and Hybrid CPUs only support AVX2 > instruction set. > These are very

Re: RFR: 8322768: Optimize non-subword vector compress and expand APIs for AVX2 target. [v9]

2024-01-24 Thread Jatin Bhateja
On Tue, 23 Jan 2024 15:20:47 GMT, Emanuel Peter wrote: >> Jatin Bhateja has updated the pull request with a new target base due to a >> merge or a rebase. The incremental webrev excludes the unrelated changes >> brought in by the merge/rebase. The pull request cont

Re: RFR: 8322768: Optimize non-subword vector compress and expand APIs for AVX2 target. [v8]

2024-01-23 Thread Jatin Bhateja
On Tue, 23 Jan 2024 08:17:13 GMT, Emanuel Peter wrote: >> Jatin Bhateja has updated the pull request incrementally with one additional >> commit since the last revision: >> >> Review comments resolution > > src/hotspot/cpu/x86/c2_MacroAssembler_x86.cpp line 5

Re: RFR: 8322768: Optimize non-subword vector compress and expand APIs for AVX2 target. [v9]

2024-01-23 Thread Jatin Bhateja
> ops/ms > ColumnFilterBenchmark.filterFloatColumn 2047 thrpt2 2587.229 > ops/ms > ColumnFilterBenchmark.filterFloatColumn 4096 thrpt2 1278.665 > ops/ms > ColumnFilterBenchmark.filterIntColumn 1024 thrpt2 4149.384 > op

Re: RFR: JDK-8320448 Accelerate IndexOf using AVX2 [v7]

2024-01-21 Thread Jatin Bhateja
On Mon, 22 Jan 2024 07:05:56 GMT, Jatin Bhateja wrote: >> Scott Gibbons has updated the pull request with a new target base due to a >> merge or a rebase. The pull request now contains 22 commits: >> >> - Merge branch 'openjdk:master' into indexof >>

Re: RFR: JDK-8320448 Accelerate IndexOf using AVX2 [v7]

2024-01-21 Thread Jatin Bhateja
On Thu, 11 Jan 2024 23:06:32 GMT, Scott Gibbons wrote: >> Re-write the IndexOf code without the use of the pcmpestri instruction, only >> using AVX2 instructions. This change accelerates String.IndexOf on average >> 1.3x for AVX2. The benchmark numbers: >> >> >> Benchmark

Re: RFR: 8318650: Optimized subword gather for x86 targets. [v11]

2024-01-20 Thread Jatin Bhateja
ons into caller contexts. > > Kindly review and share your feedback. > > Best Regards, > Jatin Jatin Bhateja has updated the pull request incrementally with one additional commit since the last revision: Review comments resolutions. - Changes: - all: https://git.ope

Re: RFR: 8322768: Optimize non-subword vector compress and expand APIs for AVX2 target. [v8]

2024-01-20 Thread Jatin Bhateja
> ops/ms > ColumnFilterBenchmark.filterFloatColumn 2047 thrpt2 2587.229 > ops/ms > ColumnFilterBenchmark.filterFloatColumn 4096 thrpt2 1278.665 > ops/ms > ColumnFilterBenchmark.filterIntColumn 1024 thrpt2 4149.384 > op

Re: RFR: 8322768: Optimize non-subword vector compress and expand APIs for AVX2 target. [v7]

2024-01-19 Thread Jatin Bhateja
> ops/ms > ColumnFilterBenchmark.filterFloatColumn 2047 thrpt2 2587.229 > ops/ms > ColumnFilterBenchmark.filterFloatColumn 4096 thrpt2 1278.665 > ops/ms > ColumnFilterBenchmark.filterIntColumn 1024 thrpt2 4149.384 > op

Re: RFR: 8322768: Optimize non-subword vector compress and expand APIs for AVX2 target. [v5]

2024-01-19 Thread Jatin Bhateja
On Fri, 19 Jan 2024 07:43:18 GMT, Emanuel Peter wrote: >> For long/double each permute row is 32 byte in size, so a shift by 5 to >> compute row address. > > Ah right. Maybe we could say `32byte = 4 long = 4 * 64bit`. > Because "64bit row" sounds like the whole row is only 64 bit long. It is >

Re: RFR: 8322768: Optimize non-subword vector compress and expand APIs for AVX2 target. [v5]

2024-01-18 Thread Jatin Bhateja
On Tue, 16 Jan 2024 07:08:57 GMT, Emanuel Peter wrote: >> Each long/double permute lane holds 64 bit value. > > @jatin-bhateja so why do you shift by 5? I thought 4 longs are 32 bit? For long/double each permute row is 32 byte in size, so a shift by 5 to compute row address. -

Re: RFR: 8322768: Optimize non-subword vector compress and expand APIs for AVX2 target. [v6]

2024-01-18 Thread Jatin Bhateja
> ops/ms > ColumnFilterBenchmark.filterFloatColumn 2047 thrpt2 2587.229 > ops/ms > ColumnFilterBenchmark.filterFloatColumn 4096 thrpt2 1278.665 > ops/ms > ColumnFilterBenchmark.filterIntColumn 1024 thrpt2 4149.384 > op

Re: RFR: JDK-8320448 Accelerate IndexOf using AVX2 [v7]

2024-01-16 Thread Jatin Bhateja
On Thu, 11 Jan 2024 23:06:32 GMT, Scott Gibbons wrote: >> Re-write the IndexOf code without the use of the pcmpestri instruction, only >> using AVX2 instructions. This change accelerates String.IndexOf on average >> 1.3x for AVX2. The benchmark numbers: >> >> >> Benchmark

Re: RFR: JDK-8320448 Accelerate IndexOf using AVX2 [v7]

2024-01-16 Thread Jatin Bhateja
On Thu, 11 Jan 2024 23:06:32 GMT, Scott Gibbons wrote: >> Re-write the IndexOf code without the use of the pcmpestri instruction, only >> using AVX2 instructions. This change accelerates String.IndexOf on average >> 1.3x for AVX2. The benchmark numbers: >> >> >> Benchmark

Re: RFR: JDK-8320448 Accelerate IndexOf using AVX2 [v7]

2024-01-16 Thread Jatin Bhateja
On Thu, 11 Jan 2024 23:06:32 GMT, Scott Gibbons wrote: >> Re-write the IndexOf code without the use of the pcmpestri instruction, only >> using AVX2 instructions. This change accelerates String.IndexOf on average >> 1.3x for AVX2. The benchmark numbers: >> >> >> Benchmark

Re: RFR: JDK-8320448 Accelerate IndexOf using AVX2 [v7]

2024-01-16 Thread Jatin Bhateja
On Thu, 11 Jan 2024 23:06:32 GMT, Scott Gibbons wrote: >> Re-write the IndexOf code without the use of the pcmpestri instruction, only >> using AVX2 instructions. This change accelerates String.IndexOf on average >> 1.3x for AVX2. The benchmark numbers: >> >> >> Benchmark

Re: RFR: JDK-8320448 Accelerate IndexOf using AVX2 [v7]

2024-01-16 Thread Jatin Bhateja
On Thu, 11 Jan 2024 23:06:32 GMT, Scott Gibbons wrote: >> Re-write the IndexOf code without the use of the pcmpestri instruction, only >> using AVX2 instructions. This change accelerates String.IndexOf on average >> 1.3x for AVX2. The benchmark numbers: >> >> >> Benchmark

Re: RFR: JDK-8320448 Accelerate IndexOf using AVX2 [v7]

2024-01-16 Thread Jatin Bhateja
On Thu, 11 Jan 2024 23:06:32 GMT, Scott Gibbons wrote: >> Re-write the IndexOf code without the use of the pcmpestri instruction, only >> using AVX2 instructions. This change accelerates String.IndexOf on average >> 1.3x for AVX2. The benchmark numbers: >> >> >> Benchmark

Re: RFR: 8318650: Optimized subword gather for x86 targets. [v10]

2024-01-15 Thread Jatin Bhateja
On Mon, 15 Jan 2024 14:27:43 GMT, Emanuel Peter wrote: >> Jatin Bhateja has updated the pull request with a new target base due to a >> merge or a rebase. The pull request now contains 12 commits: >> >> - Accelerating masked sub-word gathers for AVX2 targets, this giv

Re: RFR: JDK-8320448 Accelerate IndexOf using AVX2 [v7]

2024-01-15 Thread Jatin Bhateja
On Thu, 11 Jan 2024 23:06:32 GMT, Scott Gibbons wrote: >> Re-write the IndexOf code without the use of the pcmpestri instruction, only >> using AVX2 instructions. This change accelerates String.IndexOf on average >> 1.3x for AVX2. The benchmark numbers: >> >> >> Benchmark

Re: RFR: 8322768: Optimize non-subword vector compress and expand APIs for AVX2 target. [v5]

2024-01-15 Thread Jatin Bhateja
On Mon, 15 Jan 2024 09:10:38 GMT, Emanuel Peter wrote: >> Jatin Bhateja has updated the pull request incrementally with one additional >> commit since the last revision: >> >> Using emulated variable blend E-Core optimized instruction. > > src/hotspot/cpu/x86/

Re: RFR: 8318650: Optimized subword gather for x86 targets. [v10]

2024-01-15 Thread Jatin Bhateja
On Mon, 15 Jan 2024 14:36:38 GMT, Emanuel Peter wrote: >> src/hotspot/cpu/x86/c2_MacroAssembler_x86.cpp line 1776: >> >>> 1774: for (int i = 0; i < 4; i++) { >>> 1775: movl(rtmp, Address(idx_base, i * 4)); >>> 1776: addl(rtmp, offset); >> >> Can the `offset` not be added to

Re: RFR: 8318650: Optimized subword gather for x86 targets. [v10]

2024-01-15 Thread Jatin Bhateja
On Mon, 15 Jan 2024 13:49:06 GMT, Emanuel Peter wrote: >> Jatin Bhateja has updated the pull request with a new target base due to a >> merge or a rebase. The pull request now contains 12 commits: >> >> - Accelerating masked sub-word gathers for AVX2 targets, this giv

Re: RFR: 8322768: Optimize non-subword vector compress and expand APIs for AVX2 target. [v5]

2024-01-10 Thread Jatin Bhateja
On Tue, 9 Jan 2024 16:48:56 GMT, Jatin Bhateja wrote: >> Hi, >> >> Patch optimizes non-subword vector compress and expand APIs for x86 AVX2 >> only targets. >> Upcoming E-core Xeons (Sierra Forest) and Hybrid CPUs only support AVX2 >> instruction set. &g

Re: RFR: 8322768: Optimize non-subword vector compress and expand APIs for AVX2 target. [v5]

2024-01-09 Thread Jatin Bhateja
> ops/ms > ColumnFilterBenchmark.filterFloatColumn 2047 thrpt2 2587.229 > ops/ms > ColumnFilterBenchmark.filterFloatColumn 4096 thrpt2 1278.665 > ops/ms > ColumnFilterBenchmark.filterIntColumn 1024 thrpt2 4149.384 > op

Re: RFR: JDK-8320448 Accelerate IndexOf using AVX2 [v5]

2024-01-09 Thread Jatin Bhateja
On Thu, 21 Dec 2023 15:21:08 GMT, Scott Gibbons wrote: >> Re-write the IndexOf code without the use of the pcmpestri instruction, only >> using AVX2 instructions. This change accelerates String.IndexOf on average >> 1.3x for AVX2. The benchmark numbers: >> >> >> Benchmark

Re: RFR: 8322768: Optimize non-subword vector compress and expand APIs for AVX2 target. [v3]

2024-01-08 Thread Jatin Bhateja
ctor-node, so that it can float out of a loop if the mask is invariant? >> >> CompressV / ExpandV only accepts two inputs, vector to be operated on and >> mask under which operation is performed, permute table based implementation >> is specific to x86 backend implemen

Re: RFR: 8322768: Optimize non-subword vector compress and expand APIs for AVX2 target. [v2]

2024-01-08 Thread Jatin Bhateja
On Mon, 8 Jan 2024 07:55:00 GMT, Emanuel Peter wrote: >>> You are using `VectorMask pred = VectorMask.fromLong(ispecies, >>> maskctr++);`. That basically systematically iterates over all masks, which >>> is nice for a correctness test. But that would use different density inside >>> one test

Re: RFR: 8322768: Optimize non-subword vector compress and expand APIs for AVX2 target. [v3]

2024-01-07 Thread Jatin Bhateja
On Fri, 5 Jan 2024 10:02:28 GMT, Emanuel Peter wrote: > Thanks for the updates! > > One more idea: Your AVX2 solution has a lot of cost for converting the mask > to a permutation. Might it make sense to split this off into a separate > vector-node, so that it can float out of a loop if the

Re: RFR: 8322768: Optimize non-subword vector compress and expand APIs for AVX2 target. [v4]

2024-01-07 Thread Jatin Bhateja
> ops/ms > ColumnFilterBenchmark.filterFloatColumn 2047 thrpt2 2587.229 > ops/ms > ColumnFilterBenchmark.filterFloatColumn 4096 thrpt2 1278.665 > ops/ms > ColumnFilterBenchmark.filterIntColumn 1024 thrpt2 4149.384 > op

Re: RFR: 8322768: Optimize non-subword vector compress and expand APIs for AVX2 target. [v2]

2024-01-07 Thread Jatin Bhateja
On Fri, 5 Jan 2024 09:45:11 GMT, Emanuel Peter wrote: > You are using `VectorMask pred = VectorMask.fromLong(ispecies, > maskctr++);`. That basically systematically iterates over all masks, which is > nice for a correctness test. But that would use different density inside one > test run,

Re: RFR: 8322768: Optimize non-subword vector compress and expand APIs for AVX2 target. [v2]

2024-01-04 Thread Jatin Bhateja
On Thu, 4 Jan 2024 13:41:40 GMT, Emanuel Peter wrote: >> Jatin Bhateja has updated the pull request incrementally with one additional >> commit since the last revision: >> >> Updating copyright year of modified files. > > src/hotspot/cpu/x86/c2_MacroAssembler

Re: RFR: 8322768: Optimize non-subword vector compress and expand APIs for AVX2 target. [v3]

2024-01-04 Thread Jatin Bhateja
> ops/ms > ColumnFilterBenchmark.filterFloatColumn 2047 thrpt2 2587.229 > ops/ms > ColumnFilterBenchmark.filterFloatColumn 4096 thrpt2 1278.665 > ops/ms > ColumnFilterBenchmark.filterIntColumn 1024 thrpt2 4149.384 > op

Re: RFR: 8322768: Optimize non-subword vector compress and expand APIs for AVX2 target. [v2]

2024-01-04 Thread Jatin Bhateja
On Thu, 4 Jan 2024 13:30:24 GMT, Emanuel Peter wrote: >> test/micro/org/openjdk/bench/jdk/incubator/vector/ColumnFilterBenchmark.java >> line 94: >> >>> 92:IntVector vec = IntVector.fromArray(ispecies, intinCol, i); >>> 93:VectorMask pred =

Re: RFR: 8322768: Optimize non-subword vector compress and expand APIs for AVX2 target. [v2]

2024-01-04 Thread Jatin Bhateja
On Fri, 5 Jan 2024 07:03:26 GMT, Jatin Bhateja wrote: >> And what about some result verification? Or is there another test that does >> that? > > We do have extensive functional tests for compress/expand APIs in > [test/jdk/jdk/incubator/vector](https://github.com/openjdk

Re: RFR: 8322768: Optimize non-subword vector compress and expand APIs for AVX2 target. [v2]

2024-01-04 Thread Jatin Bhateja
On Thu, 4 Jan 2024 13:33:08 GMT, Emanuel Peter wrote: >> Jatin Bhateja has updated the pull request incrementally with one additional >> commit since the last revision: >> >> Updating copyright year of modified files. > > test/micro/org/open

Re: RFR: 8322768: Optimize non-subword vector compress and expand APIs for AVX2 target. [v2]

2024-01-03 Thread Jatin Bhateja
> ops/ms > ColumnFilterBenchmark.filterFloatColumn 2047 thrpt2 2587.229 > ops/ms > ColumnFilterBenchmark.filterFloatColumn 4096 thrpt2 1278.665 > ops/ms > ColumnFilterBenchmark.filterIntColumn 1024 thrpt2 4149.384 &

RFR: 8322768: Optimize non-subword vector compress and expand APIs for AVX2 target.

2024-01-03 Thread Jatin Bhateja
Hi, Patch optimizes non-subword vector compress and expand APIs for x86 AVX2 only targets. Upcoming E-core Xeons (Sierra Forest) and Hybrid CPUs only support AVX2 instruction set. These are very frequently used operation in columnar database filter operation. Implementation uses a lookup table

Re: RFR: 8318650: Optimized subword gather for x86 targets. [v10]

2024-01-01 Thread Jatin Bhateja
ons into caller contexts. > > Kindly review and share your feedback. > > Best Regards, > Jatin Jatin Bhateja has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains 12 commits: - Accelerating masked sub-word gathers for AVX2

Re: RFR: 8318650: Optimized subword gather for x86 targets. [v9]

2023-12-17 Thread Jatin Bhateja
ons into caller contexts. > > Kindly review and share your feedback. > > Best Regards, > Jatin Jatin Bhateja has updated the pull request incrementally with one additional commit since the last revision: Removing JDK-8321648 related changes. - Changes: - all: https:

Re: RFR: 8318650: Optimized subword gather for x86 targets. [v8]

2023-12-17 Thread Jatin Bhateja
On Sun, 17 Dec 2023 17:55:11 GMT, Jatin Bhateja wrote: >> Hi All, >> >> This patch optimizes sub-word gather operation for x86 targets with AVX2 and >> AVX512 features. >> >> Following is the summary of changes:- >> >> 1) Intrinsify sub-wo

Re: RFR: 8318650: Optimized subword gather for x86 targets. [v7]

2023-12-17 Thread Jatin Bhateja
On Wed, 15 Nov 2023 02:17:58 GMT, Jatin Bhateja wrote: >> Hi All, >> >> This patch optimizes sub-word gather operation for x86 targets with AVX2 and >> AVX512 features. >> >> Following is the summary of changes:- >> >> 1) Intrinsify sub-wo

Re: RFR: 8318650: Optimized subword gather for x86 targets. [v8]

2023-12-17 Thread Jatin Bhateja
into caller contexts. > > 3) Some minor adjustments in existing gather instruction pattens for > double/quad words. > > > Kindly review and share your feedback. > > > Best Regards, > Jatin Jatin Bhateja has updated the pull request with a new target base

Re: RFR: 8319577: x86_64 AVX2 intrinsics for Arrays.sort methods (int, float arrays) [v10]

2023-12-06 Thread Jatin Bhateja
On Wed, 6 Dec 2023 17:44:25 GMT, Srinivas Vamsi Parasa wrote: >> The goal is to develop faster sort routines for x86_64 CPUs by taking >> advantage of AVX2 instructions. This enhancement provides an order of >> magnitude speedup for Arrays.sort() using int, long, float and double arrays. >>

Re: RFR: 8319577: x86_64 AVX2 intrinsics for Arrays.sort methods (int, float arrays) [v8]

2023-12-05 Thread Jatin Bhateja
On Mon, 4 Dec 2023 22:15:24 GMT, Srinivas Vamsi Parasa wrote: >> The goal is to develop faster sort routines for x86_64 CPUs by taking >> advantage of AVX2 instructions. This enhancement provides an order of >> magnitude speedup for Arrays.sort() using int, long, float and double arrays. >>

Re: RFR: 8319577: x86_64 AVX2 intrinsics for Arrays.sort methods (int, float arrays) [v8]

2023-12-05 Thread Jatin Bhateja
On Mon, 4 Dec 2023 22:15:24 GMT, Srinivas Vamsi Parasa wrote: >> The goal is to develop faster sort routines for x86_64 CPUs by taking >> advantage of AVX2 instructions. This enhancement provides an order of >> magnitude speedup for Arrays.sort() using int, long, float and double arrays. >>

Re: RFR: 8319577: x86_64 AVX2 intrinsics for Arrays.sort methods (int, float arrays) [v2]

2023-11-28 Thread Jatin Bhateja
On Tue, 28 Nov 2023 12:23:12 GMT, Jatin Bhateja wrote: >> Srinivas Vamsi Parasa has updated the pull request with a new target base >> due to a merge or a rebase. The incremental webrev excludes the unrelated >> changes brought in by the merge/rebase. The pull r

Re: RFR: 8319577: x86_64 AVX2 intrinsics for Arrays.sort methods (int, float arrays) [v2]

2023-11-28 Thread Jatin Bhateja
On Sat, 18 Nov 2023 01:21:09 GMT, Srinivas Vamsi Parasa wrote: >> The goal is to develop faster sort routines for x86_64 CPUs by taking >> advantage of AVX2 instructions. This enhancement provides an order of >> magnitude speedup for Arrays.sort() using int, long, float and double arrays. >>

Re: RFR: 8318650: Optimized subword gather for x86 targets. [v7]

2023-11-21 Thread Jatin Bhateja
On Wed, 15 Nov 2023 02:17:58 GMT, Jatin Bhateja wrote: >> Hi All, >> >> This patch optimizes sub-word gather operation for x86 targets with AVX2 and >> AVX512 features. >> >> Following is the summary of changes:- >> >> 1) Intrinsi

Re: RFR: 8310159: Bulk copy with Unsafe::arrayCopy is slower compared to memcpy [v4]

2023-11-20 Thread Jatin Bhateja
On Thu, 16 Nov 2023 21:26:47 GMT, Steve Dohrmann wrote: >> Below is baseline data collected using a modified version of the >> java.lang.foreign.xor micro benchmark referenced by @mcimadamore in the bug >> report. I collected data on an Ubuntu 22.04 laptop with a Tigerlake >> i7-1185G7,

Re: RFR: 8318650: Optimized subword gather for x86 targets. [v7]

2023-11-19 Thread Jatin Bhateja
On Mon, 20 Nov 2023 01:34:57 GMT, Xiaohong Gong wrote: > > > BTW, I have two questions: > > > > > > 1. An intrinsic which should accept the vector as index like non-subword > > > gather is more benefical in real applications. See: [8287289: > > > Gather/Scatter with Index Vector  > > >

Re: RFR: 8318650: Optimized subword gather for x86 targets. [v7]

2023-11-16 Thread Jatin Bhateja
On Wed, 15 Nov 2023 02:17:58 GMT, Jatin Bhateja wrote: >> Hi All, >> >> This patch optimizes sub-word gather operation for x86 targets with AVX2 and >> AVX512 features. >> >> Following is the summary of changes:- >> >> 1) Intrinsi

Re: RFR: 8318650: Optimized subword gather for x86 targets. [v7]

2023-11-16 Thread Jatin Bhateja
On Thu, 16 Nov 2023 04:07:21 GMT, Xiaohong Gong wrote: >> Jatin Bhateja has updated the pull request incrementally with one additional >> commit since the last revision: >> >> Fix incorrect comment > > src/jdk.incubator.vector/share/classes/jdk/incubator/vecto

Re: RFR: 8310159: Bulk copy with Unsafe::arrayCopy is slower compared to memcpy

2023-11-15 Thread Jatin Bhateja
On Wed, 15 Nov 2023 17:03:38 GMT, Steve Dohrmann wrote: >> Do you see any concerns while handling multithreaded case where writer is >> busy copying 256 bytes block in loop and reader try to access a location >> still not flushed out of write combining buffer. > > The results a concurrent

Re: RFR: 8310159: Bulk copy with Unsafe::arrayCopy is slower compared to memcpy

2023-11-15 Thread Jatin Bhateja
On Tue, 14 Nov 2023 07:59:22 GMT, Jatin Bhateja wrote: >> Below is baseline data collected using a modified version of the >> java.lang.foreign.xor micro benchmark referenced by @mcimadamore in the bug >> report. I collected data on an Ubuntu 22.04 laptop with a Tiger

Re: RFR: 8318650: Optimized subword gather for x86 targets. [v7]

2023-11-14 Thread Jatin Bhateja
was very bulky. This may impact > in-lining decisions into caller contexts. > > 3) Some minor adjustments in existing gather instruction pattens for > double/quad words. > > > Kindly review and share your feedback. > > > Best Regards, > Jatin Jatin Bhateja has

Re: RFR: 8310159: Bulk copy with Unsafe::arrayCopy is slower compared to memcpy

2023-11-14 Thread Jatin Bhateja
On Wed, 15 Nov 2023 01:17:05 GMT, Steve Dohrmann wrote: >> @jatin-bhateja There is a sfence at line 781. > > Thanks, there is an store fence upon completion of the main loop for the > large size code: > > ![image](https://github.com/openjdk/jdk/assets/3858882/3b

Re: RFR: 8318650: Optimized subword gather for x86 targets. [v6]

2023-11-14 Thread Jatin Bhateja
was very bulky. This may impact > in-lining decisions into caller contexts. > > 3) Some minor adjustments in existing gather instruction pattens for > double/quad words. > > > Kindly review and share your feedback. > > > Best Regards, > Jatin Jatin Bhate

Re: RFR: 8310159: Bulk copy with Unsafe::arrayCopy is slower compared to memcpy

2023-11-14 Thread Jatin Bhateja
On Wed, 8 Nov 2023 23:23:48 GMT, Steve Dohrmann wrote: > Below is baseline data collected using a modified version of the > java.lang.foreign.xor micro benchmark referenced by @mcimadamore in the bug > report. I collected data on an Ubuntu 22.04 laptop with a Tigerlake > i7-1185G7, which

Re: RFR: 8310159: Bulk copy with Unsafe::arrayCopy is slower compared to memcpy

2023-11-14 Thread Jatin Bhateja
On Wed, 8 Nov 2023 23:23:48 GMT, Steve Dohrmann wrote: > Below is baseline data collected using a modified version of the > java.lang.foreign.xor micro benchmark referenced by @mcimadamore in the bug > report. I collected data on an Ubuntu 22.04 laptop with a Tigerlake > i7-1185G7, which

Re: RFR: 8318650: Optimized subword gather for x86 targets. [v5]

2023-11-09 Thread Jatin Bhateja
On Fri, 10 Nov 2023 03:33:51 GMT, Sandhya Viswanathan wrote: >> src/hotspot/cpu/x86/c2_MacroAssembler_x86.cpp line 1648: >> >>> 1646: vpermd(xtmp3, xtmp1, xtmp3, vlen_enc == Assembler::AVX_512bit ? >>> vlen_enc : Assembler::AVX_256bit); >>> 1647: vpsubd(xtmp1, xtmp1, xtmp2, vlen_enc);

Re: RFR: 8318650: Optimized subword gather for x86 targets. [v5]

2023-11-09 Thread Jatin Bhateja
was very bulky. This may impact > in-lining decisions into caller contexts. > > 3) Some minor adjustments in existing gather instruction pattens for > double/quad words. > > > Kindly review and share your feedback. > > > Best Regards, > Jatin Jatin Bhate

Re: RFR: 8318650: Optimized subword gather for x86 targets. [v3]

2023-11-06 Thread Jatin Bhateja
On Mon, 6 Nov 2023 18:33:46 GMT, Sandhya Viswanathan wrote: > This is not a masked operation so every lane of dst will be written through > pinsrw/pinsrb. An vpxor before is not required. xor here clears the intermediate vector after each iteration, this is eventually ORs with destination.

Re: RFR: 8318650: Optimized subword gather for x86 targets. [v3]

2023-11-05 Thread Jatin Bhateja
On Fri, 3 Nov 2023 00:22:55 GMT, Sandhya Viswanathan wrote: >> Jatin Bhateja has updated the pull request incrementally with one additional >> commit since the last revision: >> >> Restricting masked sub-word gather to AVX512 target to align with integral >> g

Re: RFR: 8318650: Optimized subword gather for x86 targets. [v3]

2023-11-05 Thread Jatin Bhateja
On Fri, 3 Nov 2023 23:20:49 GMT, Sandhya Viswanathan wrote: >> Jatin Bhateja has updated the pull request incrementally with one additional >> commit since the last revision: >> >> Restricting masked sub-word gather to AVX512 target to align with integral >> g

Re: RFR: 8318650: Optimized subword gather for x86 targets. [v3]

2023-11-05 Thread Jatin Bhateja
On Fri, 3 Nov 2023 20:00:30 GMT, Sandhya Viswanathan wrote: >> Jatin Bhateja has updated the pull request incrementally with one additional >> commit since the last revision: >> >> Restricting masked sub-word gather to AVX512 target to align with integral >> g

Re: RFR: 8318650: Optimized subword gather for x86 targets. [v4]

2023-11-05 Thread Jatin Bhateja
was very bulky. This may impact > in-lining decisions into caller contexts. > > 3) Some minor adjustments in existing gather instruction pattens for > double/quad words. > > > Kindly review and share your feedback. > > > Best Regards, > Jatin Jatin Bhate

Re: RFR: 8318650: Optimized subword gather for x86 targets. [v3]

2023-11-05 Thread Jatin Bhateja
On Sun, 5 Nov 2023 12:58:33 GMT, Jatin Bhateja wrote: >> src/hotspot/cpu/x86/x86.ad line 4074: >> >>> 4072: BasicType elem_bt = Matcher::vector_element_basic_type(this); >>> 4073: assert(!is_subword_type(elem_bt), "sanity"); // T_INT,

Re: RFR: 8318650: Optimized subword gather for x86 targets. [v3]

2023-11-05 Thread Jatin Bhateja
On Fri, 3 Nov 2023 23:07:44 GMT, Sandhya Viswanathan wrote: >> Jatin Bhateja has updated the pull request incrementally with one additional >> commit since the last revision: >> >> Restricting masked sub-word gather to AVX512 target to align with integral >> g

  1   2   >