Re: RFR: 8325438: Add exhaustive tests for Math.round intrinsics [v13]

2024-05-07 Thread Emanuel Peter
On Tue, 7 May 2024 13:23:48 GMT, Emanuel Peter wrote: >> Hamlin Li has updated the pull request incrementally with one additional >> commit since the last revision: >> >> fix issues; modify vm options to make sure test the expected behaviors. > > test/hotspo

Re: RFR: 8325438: Add exhaustive tests for Math.round intrinsics [v13]

2024-05-07 Thread Emanuel Peter
On Mon, 29 Apr 2024 11:38:27 GMT, Hamlin Li wrote: >> HI, >> Can you have a look at this patch adding some tests for Math.round >> instrinsics? >> Thanks! >> >> ### FYI: >> During the development of RoundVF/RoundF, we faced the issues which were >> only spotted by running test exhaustively

Re: RFR: 8318650: Optimized subword gather for x86 targets. [v17]

2024-03-04 Thread Emanuel Peter
On Sat, 2 Mar 2024 16:22:22 GMT, Jatin Bhateja wrote: >> Hi All, >> >> This patch optimizes sub-word gather operation for x86 targets with AVX2 and >> AVX512 features. >> >> Following is the summary of changes:- >> >> 1) Intrinsify sub-word gather using hybrid algorithm which initially >>

Re: RFR: 8318650: Optimized subword gather for x86 targets. [v16]

2024-02-27 Thread Emanuel Peter
On Tue, 27 Feb 2024 02:47:13 GMT, Jatin Bhateja wrote: >> Hi All, >> >> This patch optimizes sub-word gather operation for x86 targets with AVX2 and >> AVX512 features. >> >> Following is the summary of changes:- >> >> 1) Intrinsify sub-word gather using hybrid algorithm which initially >>

Re: RFR: 8318650: Optimized subword gather for x86 targets. [v16]

2024-02-27 Thread Emanuel Peter
On Tue, 27 Feb 2024 10:25:19 GMT, Emanuel Peter wrote: >> Jatin Bhateja has updated the pull request incrementally with one additional >> commit since the last revision: >> >> Review comment resolutions. > > src/hotspot/cpu/x86/c2_MacroAssembler_x86.cpp

Re: RFR: 8318650: Optimized subword gather for x86 targets. [v14]

2024-02-26 Thread Emanuel Peter
On Mon, 26 Feb 2024 15:05:03 GMT, Emanuel Peter wrote: >> Hi @eme64 , I was referring to allocation of label's array. To be concise >> and avoid hand unrolling of loop, I chose an array of labels. > > I was referring to the various arrays as well above. I think it would

Re: RFR: 8318650: Optimized subword gather for x86 targets. [v14]

2024-02-26 Thread Emanuel Peter
On Mon, 26 Feb 2024 14:58:53 GMT, Jatin Bhateja wrote: >> I could not find any other case with the same pattern, of initializing a >> list of Labels. >> >> On the other hand, I can find cases where we already do what I am saying: >> `C2_MacroAssembler::rtm_counters_update` > > Hi @eme64 , I

Re: RFR: 8318650: Optimized subword gather for x86 targets. [v15]

2024-02-26 Thread Emanuel Peter
On Mon, 26 Feb 2024 13:14:24 GMT, Jatin Bhateja wrote: >> Hi All, >> >> This patch optimizes sub-word gather operation for x86 targets with AVX2 and >> AVX512 features. >> >> Following is the summary of changes:- >> >> 1) Intrinsify sub-word gather using hybrid algorithm which initially >>

Re: RFR: 8318650: Optimized subword gather for x86 targets. [v14]

2024-02-26 Thread Emanuel Peter
On Mon, 26 Feb 2024 13:24:05 GMT, Emanuel Peter wrote: >> To avoid invariant initializations to happen within the loop, compiler will >> unroll this small loop and will forward the initializations, if it does not >> then we can save redundant allocation within loop. > >

Re: RFR: 8318650: Optimized subword gather for x86 targets. [v14]

2024-02-26 Thread Emanuel Peter
On Mon, 26 Feb 2024 13:09:22 GMT, Jatin Bhateja wrote: >> src/hotspot/cpu/x86/c2_MacroAssembler_x86.cpp line 1584: >> >>> 1582: if (elem_bt == T_SHORT) { >>> 1583: Label case0, case1, case2, case3; >>> 1584: Label* larr[] = {, , , }; >> >> Not sure if I asked this already: why define

Re: RFR: 8318650: Optimized subword gather for x86 targets. [v14]

2024-02-26 Thread Emanuel Peter
On Mon, 26 Feb 2024 13:06:19 GMT, Jatin Bhateja wrote: >> src/jdk.incubator.vector/share/classes/jdk/incubator/vector/X-Vector.java.template >> line 4840: >> >>> 4838: >>> 4839: // Check indices are within array bounds. >>> 4840: // FIXME: Check index under mask controlling.

Re: RFR: 8318650: Optimized subword gather for x86 targets. [v14]

2024-02-26 Thread Emanuel Peter
On Sun, 25 Feb 2024 06:27:10 GMT, Jatin Bhateja wrote: >> Hi All, >> >> This patch optimizes sub-word gather operation for x86 targets with AVX2 and >> AVX512 features. >> >> Following is the summary of changes:- >> >> 1) Intrinsify sub-word gather using hybrid algorithm which initially >>

Re: RFR: 8318650: Optimized subword gather for x86 targets. [v13]

2024-02-26 Thread Emanuel Peter
On Tue, 20 Feb 2024 08:29:44 GMT, Emanuel Peter wrote: >> src/hotspot/cpu/x86/c2_MacroAssembler_x86.cpp line 1716: >> >>> 1714: XMMRegister xtmp3, Register >>> rtmp, >>> 1715:

Re: RFR: 8318650: Optimized subword gather for x86 targets. [v13]

2024-02-26 Thread Emanuel Peter
On Sun, 25 Feb 2024 06:23:50 GMT, Jatin Bhateja wrote: >> src/hotspot/cpu/x86/x86.ad line 4120: >> >>> 4118: BasicType elem_bt = Matcher::vector_element_basic_type(this); >>> 4119: __ lea($tmp$$Register, $mem$$Address); >>> 4120: __ vgather8b(elem_bt, $dst$$XMMRegister,

Re: RFR: 8318650: Optimized subword gather for x86 targets. [v7]

2024-02-20 Thread Emanuel Peter
On Sun, 17 Dec 2023 17:51:37 GMT, Jatin Bhateja wrote: >> Jatin Bhateja has updated the pull request incrementally with one additional >> commit since the last revision: >> >> Fix incorrect comment > > Refined implementation using integral gather operation for AVX512 targets. As > per Intel

Re: RFR: 8318650: Optimized subword gather for x86 targets. [v13]

2024-02-20 Thread Emanuel Peter
On Wed, 7 Feb 2024 18:38:29 GMT, Jatin Bhateja wrote: >> Hi All, >> >> This patch optimizes sub-word gather operation for x86 targets with AVX2 and >> AVX512 features. >> >> Following is the summary of changes:- >> >> 1) Intrinsify sub-word gather using hybrid algorithm which initially >>

Re: RFR: 8318650: Optimized subword gather for x86 targets. [v13]

2024-02-20 Thread Emanuel Peter
On Tue, 20 Feb 2024 07:35:28 GMT, Emanuel Peter wrote: >> Jatin Bhateja has updated the pull request incrementally with one additional >> commit since the last revision: >> >> Review comments resolutions. > > src/hotspot/cpu/x86/c2_MacroAssembler_x86.cpp li

Re: RFR: 8322768: Optimize non-subword vector compress and expand APIs for AVX2 target. [v9]

2024-01-25 Thread Emanuel Peter
On Tue, 23 Jan 2024 11:56:58 GMT, Jatin Bhateja wrote: >> Hi, >> >> Patch optimizes non-subword vector compress and expand APIs for x86 AVX2 >> only targets. >> Upcoming E-core Xeons (Sierra Forest) and Hybrid CPUs only support AVX2 >> instruction set. >> These are very frequently used APIs

Re: RFR: 8322768: Optimize non-subword vector compress and expand APIs for AVX2 target. [v9]

2024-01-23 Thread Emanuel Peter
On Tue, 23 Jan 2024 11:56:58 GMT, Jatin Bhateja wrote: >> Hi, >> >> Patch optimizes non-subword vector compress and expand APIs for x86 AVX2 >> only targets. >> Upcoming E-core Xeons (Sierra Forest) and Hybrid CPUs only support AVX2 >> instruction set. >> These are very frequently used APIs

Re: RFR: 8322768: Optimize non-subword vector compress and expand APIs for AVX2 target. [v8]

2024-01-23 Thread Emanuel Peter
On Sat, 20 Jan 2024 09:55:45 GMT, Jatin Bhateja wrote: >> Hi, >> >> Patch optimizes non-subword vector compress and expand APIs for x86 AVX2 >> only targets. >> Upcoming E-core Xeons (Sierra Forest) and Hybrid CPUs only support AVX2 >> instruction set. >> These are very frequently used APIs

Re: RFR: 8322768: Optimize non-subword vector compress and expand APIs for AVX2 target. [v5]

2024-01-18 Thread Emanuel Peter
On Thu, 18 Jan 2024 17:06:55 GMT, Jatin Bhateja wrote: >> @jatin-bhateja so why do you shift by 5? I thought 4 longs are 32 bit? > > For long/double each permute row is 32 byte in size, so a shift by 5 to > compute row address. Ah right. Maybe we could say `32byte = 4 long = 4 * 64bit`.

Re: RFR: 8318650: Optimized subword gather for x86 targets. [v10]

2024-01-15 Thread Emanuel Peter
On Tue, 16 Jan 2024 06:08:28 GMT, Jatin Bhateja wrote: >> Or would that require too many registers? > >> Can the `offset` not be added to `idx_base` before the loop? > > Offset needs to be added to each index element, please refer to API > specification for details. >

Re: RFR: 8318650: Optimized subword gather for x86 targets. [v10]

2024-01-15 Thread Emanuel Peter
On Tue, 16 Jan 2024 06:08:35 GMT, Jatin Bhateja wrote: >> src/hotspot/cpu/x86/c2_MacroAssembler_x86.cpp line 1634: >> >>> 1632: Register offset, >>> XMMRegister offset_vec, XMMRegister idx_vec, >>> 1633:

Re: RFR: 8318650: Optimized subword gather for x86 targets. [v10]

2024-01-15 Thread Emanuel Peter
On Tue, 16 Jan 2024 06:08:40 GMT, Jatin Bhateja wrote: >> src/hotspot/cpu/x86/c2_MacroAssembler_x86.cpp line 1627: >> >>> 1625: vpsrlvd(dst, dst, xtmp, vlen_enc); >>> 1626: // Pack double word vector into byte vector. >>> 1627: vpackI2X(T_BYTE, dst, ones, xtmp, vlen_enc); >> >> I

Re: RFR: 8318650: Optimized subword gather for x86 targets. [v10]

2024-01-15 Thread Emanuel Peter
On Tue, 16 Jan 2024 06:08:31 GMT, Jatin Bhateja wrote: >> src/hotspot/cpu/x86/c2_MacroAssembler_x86.cpp line 1757: >> >>> 1755: for (int i = 0; i < 4; i++) { >>> 1756: movl(rtmp, Address(idx_base, i * 4)); >>> 1757: pinsrw(dst, Address(base, rtmp, Address::times_2), i); >> >>

Re: RFR: 8318650: Optimized subword gather for x86 targets. [v10]

2024-01-15 Thread Emanuel Peter
On Tue, 16 Jan 2024 06:17:43 GMT, Jatin Bhateja wrote: >> src/hotspot/cpu/x86/c2_MacroAssembler_x86.cpp line 1900: >> >>> 1898: vgather8b(elem_ty, xtmp3, base, idx_base, rtmp, vlen_enc); >>> 1899: } else { >>> 1900: LP64_ONLY(vgather8b_masked(elem_ty, xtmp3, base, idx_base, >>>

Re: RFR: 8322768: Optimize non-subword vector compress and expand APIs for AVX2 target. [v5]

2024-01-15 Thread Emanuel Peter
On Tue, 16 Jan 2024 06:13:43 GMT, Jatin Bhateja wrote: >> src/hotspot/cpu/x86/c2_MacroAssembler_x86.cpp line 5309: >> >>> 5307: assert(bt == T_LONG || bt == T_DOUBLE, ""); >>> 5308: vmovmskpd(rtmp, mask, vec_enc); >>> 5309: shlq(rtmp, 5); // for 64 bit rows (4 longs) >> >>

Re: RFR: 8318650: Optimized subword gather for x86 targets. [v10]

2024-01-15 Thread Emanuel Peter
On Mon, 15 Jan 2024 14:25:28 GMT, Emanuel Peter wrote: >> Jatin Bhateja has updated the pull request with a new target base due to a >> merge or a rebase. The pull request now contains 12 commits: >> >> - Accelerating masked sub-word gathers for AVX2 targets, this giv

Re: RFR: 8318650: Optimized subword gather for x86 targets. [v10]

2024-01-15 Thread Emanuel Peter
On Mon, 1 Jan 2024 14:36:06 GMT, Jatin Bhateja wrote: >> Hi All, >> >> This patch optimizes sub-word gather operation for x86 targets with AVX2 and >> AVX512 features. >> >> Following is the summary of changes:- >> >> 1) Intrinsify sub-word gather using hybrid algorithm which initially >>

Re: RFR: 8322768: Optimize non-subword vector compress and expand APIs for AVX2 target. [v5]

2024-01-15 Thread Emanuel Peter
On Tue, 9 Jan 2024 16:48:56 GMT, Jatin Bhateja wrote: >> Hi, >> >> Patch optimizes non-subword vector compress and expand APIs for x86 AVX2 >> only targets. >> Upcoming E-core Xeons (Sierra Forest) and Hybrid CPUs only support AVX2 >> instruction set. >> These are very frequently used APIs in

Re: RFR: JDK-8320448 Accelerate IndexOf using AVX2 [v6]

2024-01-09 Thread Emanuel Peter
On Mon, 8 Jan 2024 20:48:39 GMT, Scott Gibbons wrote: >> Re-write the IndexOf code without the use of the pcmpestri instruction, only >> using AVX2 instructions. This change accelerates String.IndexOf on average >> 1.3x for AVX2. The benchmark numbers: >> >> >> Benchmark

Re: RFR: 8322768: Optimize non-subword vector compress and expand APIs for AVX2 target. [v2]

2024-01-09 Thread Emanuel Peter
On Tue, 9 Jan 2024 06:13:44 GMT, Jatin Bhateja wrote: >> Yes, IF it is vectorized, then there is no difference between high and low >> density. My concern was more if vectorization is preferrable over the scalar >> alternative in the low-density case, where branch prediction is more stable. >

Re: RFR: 8322768: Optimize non-subword vector compress and expand APIs for AVX2 target. [v4]

2024-01-09 Thread Emanuel Peter
On Mon, 8 Jan 2024 06:23:46 GMT, Jatin Bhateja wrote: >> Hi, >> >> Patch optimizes non-subword vector compress and expand APIs for x86 AVX2 >> only targets. >> Upcoming E-core Xeons (Sierra Forest) and Hybrid CPUs only support AVX2 >> instruction set. >> These are very frequently used APIs in

Re: RFR: 8322768: Optimize non-subword vector compress and expand APIs for AVX2 target. [v2]

2024-01-08 Thread Emanuel Peter
On Fri, 5 Jan 2024 09:35:34 GMT, Emanuel Peter wrote: >> Thanks for the comment addition! > > Improvement suggestion: > For a vector with 8 ints, we get `2^8 = 256` many bit patterns for the mask. > The table has a row for each `mask` value, consisting of 8 ints, which &

Re: RFR: 8322768: Optimize non-subword vector compress and expand APIs for AVX2 target. [v4]

2024-01-08 Thread Emanuel Peter
On Mon, 8 Jan 2024 06:23:46 GMT, Jatin Bhateja wrote: >> Hi, >> >> Patch optimizes non-subword vector compress and expand APIs for x86 AVX2 >> only targets. >> Upcoming E-core Xeons (Sierra Forest) and Hybrid CPUs only support AVX2 >> instruction set. >> These are very frequently used APIs in

Re: RFR: 8322768: Optimize non-subword vector compress and expand APIs for AVX2 target. [v2]

2024-01-08 Thread Emanuel Peter
On Mon, 8 Jan 2024 06:06:20 GMT, Jatin Bhateja wrote: >> You are using `VectorMask pred = VectorMask.fromLong(ispecies, >> maskctr++);`. >> That basically systematically iterates over all masks, which is nice for a >> correctness test. >> But that would use different density inside one test

Re: RFR: 8322768: Optimize non-subword vector compress and expand APIs for AVX2 target. [v3]

2024-01-05 Thread Emanuel Peter
On Fri, 5 Jan 2024 07:08:35 GMT, Jatin Bhateja wrote: >> Hi, >> >> Patch optimizes non-subword vector compress and expand APIs for x86 AVX2 >> only targets. >> Upcoming E-core Xeons (Sierra Forest) and Hybrid CPUs only support AVX2 >> instruction set. >> These are very frequently used APIs in

Re: RFR: 8322768: Optimize non-subword vector compress and expand APIs for AVX2 target. [v2]

2024-01-05 Thread Emanuel Peter
On Fri, 5 Jan 2024 07:05:51 GMT, Jatin Bhateja wrote: >> We do have extensive functional tests for compress/expand APIs in >> [test/jdk/jdk/incubator/vector](https://github.com/openjdk/jdk/tree/master/test/jdk/jdk/incubator/vector) > >> Could there be equivalent `expand` tests? > > Here are

Re: RFR: 8322768: Optimize non-subword vector compress and expand APIs for AVX2 target. [v2]

2024-01-05 Thread Emanuel Peter
On Fri, 5 Jan 2024 09:37:55 GMT, Emanuel Peter wrote: >> This computes the byte offset from start of the table, both integer and long >> permute table have same row sizes, 8 int elements vs 4 long elements. > > Ah, I understand now. Maybe leave a comment for that? I would

Re: RFR: 8322768: Optimize non-subword vector compress and expand APIs for AVX2 target. [v2]

2024-01-05 Thread Emanuel Peter
On Thu, 4 Jan 2024 13:40:19 GMT, Emanuel Peter wrote: >> Jatin Bhateja has updated the pull request incrementally with one additional >> commit since the last revision: >> >> Updating copyright year of modified files. > > src/hotspot/cpu/x86/stubGenerator_

Re: RFR: 8322768: Optimize non-subword vector compress and expand APIs for AVX2 target. [v2]

2024-01-05 Thread Emanuel Peter
On Fri, 5 Jan 2024 09:31:50 GMT, Emanuel Peter wrote: >> src/hotspot/cpu/x86/stubGenerator_x86_64.cpp line 957: >> >>> 955: __ align(CodeEntryAlignment); >>> 956: StubCodeMark mark(this, "StubRoutines", stub_name); >>> 957: address

Re: RFR: 8322768: Optimize non-subword vector compress and expand APIs for AVX2 target. [v2]

2024-01-05 Thread Emanuel Peter
On Fri, 5 Jan 2024 07:03:34 GMT, Jatin Bhateja wrote: >> src/hotspot/cpu/x86/c2_MacroAssembler_x86.cpp line 5307: >> >>> 5305: assert(bt == T_LONG || bt == T_DOUBLE, ""); >>> 5306: vmovmskpd(rtmp, mask, vec_enc); >>> 5307: shlq(rtmp, 5); >> >> Might this need to be 6? If I

Re: RFR: 8322768: Optimize non-subword vector compress and expand APIs for AVX2 target. [v2]

2024-01-04 Thread Emanuel Peter
On Thu, 4 Jan 2024 05:39:01 GMT, Jatin Bhateja wrote: >> Hi, >> >> Patch optimizes non-subword vector compress and expand APIs for x86 AVX2 >> only targets. >> Upcoming E-core Xeons (Sierra Forest) and Hybrid CPUs only support AVX2 >> instruction set. >> These are very frequently used APIs in

Re: RFR: 8322768: Optimize non-subword vector compress and expand APIs for AVX2 target. [v2]

2024-01-04 Thread Emanuel Peter
On Thu, 4 Jan 2024 13:09:30 GMT, Emanuel Peter wrote: >> Jatin Bhateja has updated the pull request incrementally with one additional >> commit since the last revision: >> >> Updating copyright year of modified files. > > test/micro/org/open