On Sat, 20 Apr 2024 22:31:48 GMT, Scott Gibbons wrote:
>> This code makes an intrinsic stub for `Unsafe::setMemory` for x86_64. See
>> [this PR](https://github.com/openjdk/jdk/pull/16760) for discussion around
>> this change.
>>
>> Overall, making this an intrinsic improves overall
On Wed, 25 Oct 2023 04:34:59 GMT, Jatin Bhateja wrote:
> Hi All,
>
> This patch optimizes sub-word gather operation for x86 targets with AVX2 and
> AVX512 features.
>
> Following is the summary of changes:-
>
> 1) Intrinsify sub-word gather using hybrid algorithm whic
to caller contexts.
>
> Kindly review and share your feedback.
>
> Best Regards,
> Jatin
Jatin Bhateja has updated the pull request with a new target base due to a
merge or a rebase. The pull request now contains 16 commits:
- Merge branch 'master' of http://github.com/openj
On Fri, 19 Apr 2024 22:08:52 GMT, Scott Gibbons wrote:
>> This code makes an intrinsic stub for `Unsafe::setMemory` for x86_64. See
>> [this PR](https://github.com/openjdk/jdk/pull/16760) for discussion around
>> this change.
>>
>> Overall, making this an intrinsic improves overall
On Tue, 16 Apr 2024 00:04:15 GMT, Scott Gibbons wrote:
>> This code makes an intrinsic stub for `Unsafe::setMemory` for x86_64. See
>> [this PR](https://github.com/openjdk/jdk/pull/16760) for discussion around
>> this change.
>>
>> Overall, making this an intrinsic improves overall
On Mon, 15 Apr 2024 22:04:14 GMT, Volodymyr Paprotski wrote:
>> src/hotspot/cpu/x86/stubGenerator_x86_64_poly_mont.cpp line 394:
>>
>>> 392: __ lea(aLimbs, Address(aLimbs,8));
>>> 393: __ lea(bLimbs, Address(bLimbs,8));
>>> 394: __ jmp(L_DefaultLoop);
>>
>> Both sub and cmp are flag
On Tue, 2 Apr 2024 19:19:59 GMT, Volodymyr Paprotski wrote:
>> Performance. Before:
>>
>> Benchmark(algorithm) (dataSize) (keyLength)
>> (provider) Mode Cnt ScoreError Units
>> SignatureBench.ECDSA.signSHA256withECDSA1024 256
to caller contexts.
>
> Kindly review and share your feedback.
>
> Best Regards,
> Jatin
Jatin Bhateja has updated the pull request incrementally with one additional
commit since the last revision:
Review resolutions.
-
Changes:
- all: https://git.openjdk.o
andboxes/jdk-reviews/jdk/make/core.237140)
>> #
>> # An error report file with more information is saved as:
>> # /home/jatinbha/sandboxes/jdk-reviews/jdk/make/hs_err_pid237140.log
>>... (rest of output omitted)
>
> @jatin-bhateja Thanks for the note. Fixed a
On Mon, 26 Feb 2024 13:47:35 GMT, Emanuel Peter wrote:
>> Jatin Bhateja has updated the pull request incrementally with one additional
>> commit since the last revision:
>>
>> Review comments resolutions
>
> Reposting link to a conversation that is marked &q
On Mon, 26 Feb 2024 15:05:24 GMT, Emanuel Peter wrote:
>> I was referring to the various arrays as well above. I think it would be
>> exactly more concise if you defined a local label in the loop body.
>
> Have you had a look at `C2_MacroAssembler::rtm_counters_update`?
Correct, with each
to caller contexts.
>
> Kindly review and share your feedback.
>
> Best Regards,
> Jatin
Jatin Bhateja has updated the pull request incrementally with one additional
commit since the last revision:
Review comment resolutions.
-
Changes:
- all: https://git
On Mon, 26 Feb 2024 13:31:05 GMT, Emanuel Peter wrote:
>> At the risk of becoming too nit-picky: which allocations are you talking
>> about? Given you only have a single src and a single dst for this
>> label/jump. So you won't use `_patch_overflow`. And therefore, all
>> allocations are on
On Thu, 22 Feb 2024 03:15:10 GMT, Scott Gibbons wrote:
>> Re-write the IndexOf code without the use of the pcmpestri instruction, only
>> using AVX2 instructions. This change accelerates String.IndexOf on average
>> 1.3x for AVX2. The benchmark numbers:
>>
>>
>> Benchmark
On Mon, 26 Feb 2024 09:36:09 GMT, Emanuel Peter wrote:
>> 64 bit sub-word SPECIES will either hold 8 bytes values or 4 short values,
>> algorithm appropriately handle it.
>
> Are you saying that the constraints are too relaxed, but currently no outside
> algorithm would pass something bad?
>
On Mon, 26 Feb 2024 09:37:33 GMT, Emanuel Peter wrote:
>> I'll rereview after
>
> So xtmp1...3 and rtmp cannot have more descriptive names?
These are temporary variable and appropriately named.
-
PR Review Comment: https://git.openjdk.org/jdk/pull/16354#discussion_r1502587427
On Mon, 26 Feb 2024 09:39:01 GMT, Emanuel Peter wrote:
>> Jatin Bhateja has updated the pull request incrementally with one additional
>> commit since the last revision:
>>
>> Review comments resolutions.
>
> src/hotspot/cpu/x86/c2_MacroAssembler_x86.cpp line 1
to caller contexts.
>
> Kindly review and share your feedback.
>
> Best Regards,
> Jatin
Jatin Bhateja has updated the pull request incrementally with one additional
commit since the last revision:
Review comments resolutions
-
Changes:
- all: https://git
On Mon, 26 Feb 2024 09:47:50 GMT, Emanuel Peter wrote:
>> Jatin Bhateja has updated the pull request incrementally with one additional
>> commit since the last revision:
>>
>> Review comments resolutions.
>
> src/jdk.incubator.vector/share/classes/jdk/incubator
On Tue, 20 Feb 2024 08:36:29 GMT, Emanuel Peter wrote:
>> Jatin Bhateja has updated the pull request incrementally with one additional
>> commit since the last revision:
>>
>> Review comments resolutions.
>
> src/hotspot/cpu/x86/x86.ad line 4120:
>
>>
to caller contexts.
>
> Kindly review and share your feedback.
>
> Best Regards,
> Jatin
Jatin Bhateja has updated the pull request incrementally with one additional
commit since the last revision:
Review comments resolutions.
-
Changes:
- all: https://git
On Tue, 20 Feb 2024 08:04:27 GMT, Emanuel Peter wrote:
>> src/hotspot/cpu/x86/c2_MacroAssembler_x86.cpp line 1584:
>>
>>> 1582: Label *larr[] = {, , , };
>>> 1583: for (int i = 0; i < 4; i++) {
>>> 1584: // dst[i] = mask ? src[index[i]] : 0
>>
>> I like these comments a lot!
>>
On Wed, 14 Feb 2024 14:31:03 GMT, Scott Gibbons wrote:
>> Scott Gibbons has updated the pull request incrementally with one additional
>> commit since the last revision:
>>
>> Remove gcc lib fn; reduce spacial cases to 10 from 32
>
> Thank you all for the reviews. I have been asked to
to caller contexts.
>
> Kindly review and share your feedback.
>
> Best Regards,
> Jatin
Jatin Bhateja has updated the pull request incrementally with one additional
commit since the last revision:
Review comments resolutions.
-
Changes:
- all: https://git
non
>> accessible address space, so we want to be super safe here. I am also
>> sitting over other solution which is performant.
>
> Hi @jatin-bhateja,
>
> The layout of an array is as follows:
>
> [header] - [length] - [data]
>
> Since `length` is a 4-byt
On Wed, 31 Jan 2024 21:29:08 GMT, Sandhya Viswanathan
wrote:
>> Jatin Bhateja has refreshed the contents of this pull request, and previous
>> commits have been removed. Incremental views are not available.
>
> src/hotspot/cpu/x86/c2_MacroAssembler_x86.cpp line 1613:
>
On Thu, 1 Feb 2024 16:25:52 GMT, Jatin Bhateja wrote:
>> I guess the fact that the Java objects are 8 byte alignment padded and the
>> alignment being done at lines 1609-1611 and 1616-1621 somehow takes care of
>> this.
>
> Hi @sviswa7 , I have rolled back to originally
On Wed, 31 Jan 2024 23:53:16 GMT, Sandhya Viswanathan
wrote:
>> src/hotspot/cpu/x86/c2_MacroAssembler_x86.cpp line 1613:
>>
>>> 1611: vpand(xtmp, idx_vec, xtmp, vlen_enc);
>>> 1612: // Load double words from normalized indices.
>>> 1613: evpgatherdd(dst, gmask, Address(base, xtmp,
ons into caller contexts.
>
> Kindly review and share your feedback.
>
> Best Regards,
> Jatin
Jatin Bhateja has updated the pull request with a new target base due to a
merge or a rebase. The pull request now contains 10 commits:
- Generalizing masked sub-gather support.
- Merge
On Mon, 29 Jan 2024 19:45:41 GMT, Paul Sandoz wrote:
> The implementation of method `VectorSpecies::fromMemorySegment`, in
> `AbstractSpecies::fromMemorySegment`, neglects to perform bounds checks on
> the offset argument when the method is compiled by C2 (bounds checks are
> performed when
On Thu, 25 Jan 2024 09:15:26 GMT, Emanuel Peter wrote:
>> Jatin Bhateja has updated the pull request with a new target base due to a
>> merge or a rebase. The incremental webrev excludes the unrelated changes
>> brought in by the merge/rebase. The pull request cont
On Thu, 4 Jan 2024 05:28:59 GMT, Jatin Bhateja wrote:
> Hi,
>
> Patch optimizes non-subword vector compress and expand APIs for x86 AVX2 only
> targets.
> Upcoming E-core Xeons (Sierra Forest) and Hybrid CPUs only support AVX2
> instruction set.
> These are very
On Tue, 23 Jan 2024 15:20:47 GMT, Emanuel Peter wrote:
>> Jatin Bhateja has updated the pull request with a new target base due to a
>> merge or a rebase. The incremental webrev excludes the unrelated changes
>> brought in by the merge/rebase. The pull request cont
On Tue, 23 Jan 2024 08:17:13 GMT, Emanuel Peter wrote:
>> Jatin Bhateja has updated the pull request incrementally with one additional
>> commit since the last revision:
>>
>> Review comments resolution
>
> src/hotspot/cpu/x86/c2_MacroAssembler_x86.cpp line 5
> ops/ms
> ColumnFilterBenchmark.filterFloatColumn 2047 thrpt2 2587.229
> ops/ms
> ColumnFilterBenchmark.filterFloatColumn 4096 thrpt2 1278.665
> ops/ms
> ColumnFilterBenchmark.filterIntColumn 1024 thrpt2 4149.384
> op
On Mon, 22 Jan 2024 07:05:56 GMT, Jatin Bhateja wrote:
>> Scott Gibbons has updated the pull request with a new target base due to a
>> merge or a rebase. The pull request now contains 22 commits:
>>
>> - Merge branch 'openjdk:master' into indexof
>>
On Thu, 11 Jan 2024 23:06:32 GMT, Scott Gibbons wrote:
>> Re-write the IndexOf code without the use of the pcmpestri instruction, only
>> using AVX2 instructions. This change accelerates String.IndexOf on average
>> 1.3x for AVX2. The benchmark numbers:
>>
>>
>> Benchmark
ons into caller contexts.
>
> Kindly review and share your feedback.
>
> Best Regards,
> Jatin
Jatin Bhateja has updated the pull request incrementally with one additional
commit since the last revision:
Review comments resolutions.
-
Changes:
- all: https://git.ope
> ops/ms
> ColumnFilterBenchmark.filterFloatColumn 2047 thrpt2 2587.229
> ops/ms
> ColumnFilterBenchmark.filterFloatColumn 4096 thrpt2 1278.665
> ops/ms
> ColumnFilterBenchmark.filterIntColumn 1024 thrpt2 4149.384
> op
> ops/ms
> ColumnFilterBenchmark.filterFloatColumn 2047 thrpt2 2587.229
> ops/ms
> ColumnFilterBenchmark.filterFloatColumn 4096 thrpt2 1278.665
> ops/ms
> ColumnFilterBenchmark.filterIntColumn 1024 thrpt2 4149.384
> op
On Fri, 19 Jan 2024 07:43:18 GMT, Emanuel Peter wrote:
>> For long/double each permute row is 32 byte in size, so a shift by 5 to
>> compute row address.
>
> Ah right. Maybe we could say `32byte = 4 long = 4 * 64bit`.
> Because "64bit row" sounds like the whole row is only 64 bit long. It is
>
On Tue, 16 Jan 2024 07:08:57 GMT, Emanuel Peter wrote:
>> Each long/double permute lane holds 64 bit value.
>
> @jatin-bhateja so why do you shift by 5? I thought 4 longs are 32 bit?
For long/double each permute row is 32 byte in size, so a shift by 5 to compute
row address.
-
> ops/ms
> ColumnFilterBenchmark.filterFloatColumn 2047 thrpt2 2587.229
> ops/ms
> ColumnFilterBenchmark.filterFloatColumn 4096 thrpt2 1278.665
> ops/ms
> ColumnFilterBenchmark.filterIntColumn 1024 thrpt2 4149.384
> op
On Thu, 11 Jan 2024 23:06:32 GMT, Scott Gibbons wrote:
>> Re-write the IndexOf code without the use of the pcmpestri instruction, only
>> using AVX2 instructions. This change accelerates String.IndexOf on average
>> 1.3x for AVX2. The benchmark numbers:
>>
>>
>> Benchmark
On Thu, 11 Jan 2024 23:06:32 GMT, Scott Gibbons wrote:
>> Re-write the IndexOf code without the use of the pcmpestri instruction, only
>> using AVX2 instructions. This change accelerates String.IndexOf on average
>> 1.3x for AVX2. The benchmark numbers:
>>
>>
>> Benchmark
On Thu, 11 Jan 2024 23:06:32 GMT, Scott Gibbons wrote:
>> Re-write the IndexOf code without the use of the pcmpestri instruction, only
>> using AVX2 instructions. This change accelerates String.IndexOf on average
>> 1.3x for AVX2. The benchmark numbers:
>>
>>
>> Benchmark
On Thu, 11 Jan 2024 23:06:32 GMT, Scott Gibbons wrote:
>> Re-write the IndexOf code without the use of the pcmpestri instruction, only
>> using AVX2 instructions. This change accelerates String.IndexOf on average
>> 1.3x for AVX2. The benchmark numbers:
>>
>>
>> Benchmark
On Thu, 11 Jan 2024 23:06:32 GMT, Scott Gibbons wrote:
>> Re-write the IndexOf code without the use of the pcmpestri instruction, only
>> using AVX2 instructions. This change accelerates String.IndexOf on average
>> 1.3x for AVX2. The benchmark numbers:
>>
>>
>> Benchmark
On Thu, 11 Jan 2024 23:06:32 GMT, Scott Gibbons wrote:
>> Re-write the IndexOf code without the use of the pcmpestri instruction, only
>> using AVX2 instructions. This change accelerates String.IndexOf on average
>> 1.3x for AVX2. The benchmark numbers:
>>
>>
>> Benchmark
On Mon, 15 Jan 2024 14:27:43 GMT, Emanuel Peter wrote:
>> Jatin Bhateja has updated the pull request with a new target base due to a
>> merge or a rebase. The pull request now contains 12 commits:
>>
>> - Accelerating masked sub-word gathers for AVX2 targets, this giv
On Thu, 11 Jan 2024 23:06:32 GMT, Scott Gibbons wrote:
>> Re-write the IndexOf code without the use of the pcmpestri instruction, only
>> using AVX2 instructions. This change accelerates String.IndexOf on average
>> 1.3x for AVX2. The benchmark numbers:
>>
>>
>> Benchmark
On Mon, 15 Jan 2024 09:10:38 GMT, Emanuel Peter wrote:
>> Jatin Bhateja has updated the pull request incrementally with one additional
>> commit since the last revision:
>>
>> Using emulated variable blend E-Core optimized instruction.
>
> src/hotspot/cpu/x86/
On Mon, 15 Jan 2024 14:36:38 GMT, Emanuel Peter wrote:
>> src/hotspot/cpu/x86/c2_MacroAssembler_x86.cpp line 1776:
>>
>>> 1774: for (int i = 0; i < 4; i++) {
>>> 1775: movl(rtmp, Address(idx_base, i * 4));
>>> 1776: addl(rtmp, offset);
>>
>> Can the `offset` not be added to
On Mon, 15 Jan 2024 13:49:06 GMT, Emanuel Peter wrote:
>> Jatin Bhateja has updated the pull request with a new target base due to a
>> merge or a rebase. The pull request now contains 12 commits:
>>
>> - Accelerating masked sub-word gathers for AVX2 targets, this giv
On Tue, 9 Jan 2024 16:48:56 GMT, Jatin Bhateja wrote:
>> Hi,
>>
>> Patch optimizes non-subword vector compress and expand APIs for x86 AVX2
>> only targets.
>> Upcoming E-core Xeons (Sierra Forest) and Hybrid CPUs only support AVX2
>> instruction set.
&g
> ops/ms
> ColumnFilterBenchmark.filterFloatColumn 2047 thrpt2 2587.229
> ops/ms
> ColumnFilterBenchmark.filterFloatColumn 4096 thrpt2 1278.665
> ops/ms
> ColumnFilterBenchmark.filterIntColumn 1024 thrpt2 4149.384
> op
On Thu, 21 Dec 2023 15:21:08 GMT, Scott Gibbons wrote:
>> Re-write the IndexOf code without the use of the pcmpestri instruction, only
>> using AVX2 instructions. This change accelerates String.IndexOf on average
>> 1.3x for AVX2. The benchmark numbers:
>>
>>
>> Benchmark
ctor-node, so that it can float out of a loop if the mask is invariant?
>>
>> CompressV / ExpandV only accepts two inputs, vector to be operated on and
>> mask under which operation is performed, permute table based implementation
>> is specific to x86 backend implemen
On Mon, 8 Jan 2024 07:55:00 GMT, Emanuel Peter wrote:
>>> You are using `VectorMask pred = VectorMask.fromLong(ispecies,
>>> maskctr++);`. That basically systematically iterates over all masks, which
>>> is nice for a correctness test. But that would use different density inside
>>> one test
On Fri, 5 Jan 2024 10:02:28 GMT, Emanuel Peter wrote:
> Thanks for the updates!
>
> One more idea: Your AVX2 solution has a lot of cost for converting the mask
> to a permutation. Might it make sense to split this off into a separate
> vector-node, so that it can float out of a loop if the
> ops/ms
> ColumnFilterBenchmark.filterFloatColumn 2047 thrpt2 2587.229
> ops/ms
> ColumnFilterBenchmark.filterFloatColumn 4096 thrpt2 1278.665
> ops/ms
> ColumnFilterBenchmark.filterIntColumn 1024 thrpt2 4149.384
> op
On Fri, 5 Jan 2024 09:45:11 GMT, Emanuel Peter wrote:
> You are using `VectorMask pred = VectorMask.fromLong(ispecies,
> maskctr++);`. That basically systematically iterates over all masks, which is
> nice for a correctness test. But that would use different density inside one
> test run,
On Thu, 4 Jan 2024 13:41:40 GMT, Emanuel Peter wrote:
>> Jatin Bhateja has updated the pull request incrementally with one additional
>> commit since the last revision:
>>
>> Updating copyright year of modified files.
>
> src/hotspot/cpu/x86/c2_MacroAssembler
> ops/ms
> ColumnFilterBenchmark.filterFloatColumn 2047 thrpt2 2587.229
> ops/ms
> ColumnFilterBenchmark.filterFloatColumn 4096 thrpt2 1278.665
> ops/ms
> ColumnFilterBenchmark.filterIntColumn 1024 thrpt2 4149.384
> op
On Thu, 4 Jan 2024 13:30:24 GMT, Emanuel Peter wrote:
>> test/micro/org/openjdk/bench/jdk/incubator/vector/ColumnFilterBenchmark.java
>> line 94:
>>
>>> 92:IntVector vec = IntVector.fromArray(ispecies, intinCol, i);
>>> 93:VectorMask pred =
On Fri, 5 Jan 2024 07:03:26 GMT, Jatin Bhateja wrote:
>> And what about some result verification? Or is there another test that does
>> that?
>
> We do have extensive functional tests for compress/expand APIs in
> [test/jdk/jdk/incubator/vector](https://github.com/openjdk
On Thu, 4 Jan 2024 13:33:08 GMT, Emanuel Peter wrote:
>> Jatin Bhateja has updated the pull request incrementally with one additional
>> commit since the last revision:
>>
>> Updating copyright year of modified files.
>
> test/micro/org/open
> ops/ms
> ColumnFilterBenchmark.filterFloatColumn 2047 thrpt2 2587.229
> ops/ms
> ColumnFilterBenchmark.filterFloatColumn 4096 thrpt2 1278.665
> ops/ms
> ColumnFilterBenchmark.filterIntColumn 1024 thrpt2 4149.384
&
Hi,
Patch optimizes non-subword vector compress and expand APIs for x86 AVX2 only
targets.
Upcoming E-core Xeons (Sierra Forest) and Hybrid CPUs only support AVX2
instruction set.
These are very frequently used operation in columnar database filter operation.
Implementation uses a lookup table
ons into caller contexts.
>
> Kindly review and share your feedback.
>
> Best Regards,
> Jatin
Jatin Bhateja has updated the pull request with a new target base due to a
merge or a rebase. The pull request now contains 12 commits:
- Accelerating masked sub-word gathers for AVX2
ons into caller contexts.
>
> Kindly review and share your feedback.
>
> Best Regards,
> Jatin
Jatin Bhateja has updated the pull request incrementally with one additional
commit since the last revision:
Removing JDK-8321648 related changes.
-
Changes:
- all: https:
On Sun, 17 Dec 2023 17:55:11 GMT, Jatin Bhateja wrote:
>> Hi All,
>>
>> This patch optimizes sub-word gather operation for x86 targets with AVX2 and
>> AVX512 features.
>>
>> Following is the summary of changes:-
>>
>> 1) Intrinsify sub-wo
On Wed, 15 Nov 2023 02:17:58 GMT, Jatin Bhateja wrote:
>> Hi All,
>>
>> This patch optimizes sub-word gather operation for x86 targets with AVX2 and
>> AVX512 features.
>>
>> Following is the summary of changes:-
>>
>> 1) Intrinsify sub-wo
into caller contexts.
>
> 3) Some minor adjustments in existing gather instruction pattens for
> double/quad words.
>
>
> Kindly review and share your feedback.
>
>
> Best Regards,
> Jatin
Jatin Bhateja has updated the pull request with a new target base
On Wed, 6 Dec 2023 17:44:25 GMT, Srinivas Vamsi Parasa wrote:
>> The goal is to develop faster sort routines for x86_64 CPUs by taking
>> advantage of AVX2 instructions. This enhancement provides an order of
>> magnitude speedup for Arrays.sort() using int, long, float and double arrays.
>>
On Mon, 4 Dec 2023 22:15:24 GMT, Srinivas Vamsi Parasa wrote:
>> The goal is to develop faster sort routines for x86_64 CPUs by taking
>> advantage of AVX2 instructions. This enhancement provides an order of
>> magnitude speedup for Arrays.sort() using int, long, float and double arrays.
>>
On Mon, 4 Dec 2023 22:15:24 GMT, Srinivas Vamsi Parasa wrote:
>> The goal is to develop faster sort routines for x86_64 CPUs by taking
>> advantage of AVX2 instructions. This enhancement provides an order of
>> magnitude speedup for Arrays.sort() using int, long, float and double arrays.
>>
On Tue, 28 Nov 2023 12:23:12 GMT, Jatin Bhateja wrote:
>> Srinivas Vamsi Parasa has updated the pull request with a new target base
>> due to a merge or a rebase. The incremental webrev excludes the unrelated
>> changes brought in by the merge/rebase. The pull r
On Sat, 18 Nov 2023 01:21:09 GMT, Srinivas Vamsi Parasa
wrote:
>> The goal is to develop faster sort routines for x86_64 CPUs by taking
>> advantage of AVX2 instructions. This enhancement provides an order of
>> magnitude speedup for Arrays.sort() using int, long, float and double arrays.
>>
On Wed, 15 Nov 2023 02:17:58 GMT, Jatin Bhateja wrote:
>> Hi All,
>>
>> This patch optimizes sub-word gather operation for x86 targets with AVX2 and
>> AVX512 features.
>>
>> Following is the summary of changes:-
>>
>> 1) Intrinsi
On Thu, 16 Nov 2023 21:26:47 GMT, Steve Dohrmann wrote:
>> Below is baseline data collected using a modified version of the
>> java.lang.foreign.xor micro benchmark referenced by @mcimadamore in the bug
>> report. I collected data on an Ubuntu 22.04 laptop with a Tigerlake
>> i7-1185G7,
On Mon, 20 Nov 2023 01:34:57 GMT, Xiaohong Gong wrote:
> > > BTW, I have two questions:
> > >
> > > 1. An intrinsic which should accept the vector as index like non-subword
> > > gather is more benefical in real applications. See: [8287289:
> > > Gather/Scatter with Index VectorÂ
> > >
On Wed, 15 Nov 2023 02:17:58 GMT, Jatin Bhateja wrote:
>> Hi All,
>>
>> This patch optimizes sub-word gather operation for x86 targets with AVX2 and
>> AVX512 features.
>>
>> Following is the summary of changes:-
>>
>> 1) Intrinsi
On Thu, 16 Nov 2023 04:07:21 GMT, Xiaohong Gong wrote:
>> Jatin Bhateja has updated the pull request incrementally with one additional
>> commit since the last revision:
>>
>> Fix incorrect comment
>
> src/jdk.incubator.vector/share/classes/jdk/incubator/vecto
On Wed, 15 Nov 2023 17:03:38 GMT, Steve Dohrmann wrote:
>> Do you see any concerns while handling multithreaded case where writer is
>> busy copying 256 bytes block in loop and reader try to access a location
>> still not flushed out of write combining buffer.
>
> The results a concurrent
On Tue, 14 Nov 2023 07:59:22 GMT, Jatin Bhateja wrote:
>> Below is baseline data collected using a modified version of the
>> java.lang.foreign.xor micro benchmark referenced by @mcimadamore in the bug
>> report. I collected data on an Ubuntu 22.04 laptop with a Tiger
was very bulky. This may impact
> in-lining decisions into caller contexts.
>
> 3) Some minor adjustments in existing gather instruction pattens for
> double/quad words.
>
>
> Kindly review and share your feedback.
>
>
> Best Regards,
> Jatin
Jatin Bhateja has
On Wed, 15 Nov 2023 01:17:05 GMT, Steve Dohrmann wrote:
>> @jatin-bhateja There is a sfence at line 781.
>
> Thanks, there is an store fence upon completion of the main loop for the
> large size code:
>
> ![image](https://github.com/openjdk/jdk/assets/3858882/3b
was very bulky. This may impact
> in-lining decisions into caller contexts.
>
> 3) Some minor adjustments in existing gather instruction pattens for
> double/quad words.
>
>
> Kindly review and share your feedback.
>
>
> Best Regards,
> Jatin
Jatin Bhate
On Wed, 8 Nov 2023 23:23:48 GMT, Steve Dohrmann wrote:
> Below is baseline data collected using a modified version of the
> java.lang.foreign.xor micro benchmark referenced by @mcimadamore in the bug
> report. I collected data on an Ubuntu 22.04 laptop with a Tigerlake
> i7-1185G7, which
On Wed, 8 Nov 2023 23:23:48 GMT, Steve Dohrmann wrote:
> Below is baseline data collected using a modified version of the
> java.lang.foreign.xor micro benchmark referenced by @mcimadamore in the bug
> report. I collected data on an Ubuntu 22.04 laptop with a Tigerlake
> i7-1185G7, which
On Fri, 10 Nov 2023 03:33:51 GMT, Sandhya Viswanathan
wrote:
>> src/hotspot/cpu/x86/c2_MacroAssembler_x86.cpp line 1648:
>>
>>> 1646: vpermd(xtmp3, xtmp1, xtmp3, vlen_enc == Assembler::AVX_512bit ?
>>> vlen_enc : Assembler::AVX_256bit);
>>> 1647: vpsubd(xtmp1, xtmp1, xtmp2, vlen_enc);
was very bulky. This may impact
> in-lining decisions into caller contexts.
>
> 3) Some minor adjustments in existing gather instruction pattens for
> double/quad words.
>
>
> Kindly review and share your feedback.
>
>
> Best Regards,
> Jatin
Jatin Bhate
On Mon, 6 Nov 2023 18:33:46 GMT, Sandhya Viswanathan
wrote:
> This is not a masked operation so every lane of dst will be written through
> pinsrw/pinsrb. An vpxor before is not required.
xor here clears the intermediate vector after each iteration, this is
eventually ORs with destination.
On Fri, 3 Nov 2023 00:22:55 GMT, Sandhya Viswanathan
wrote:
>> Jatin Bhateja has updated the pull request incrementally with one additional
>> commit since the last revision:
>>
>> Restricting masked sub-word gather to AVX512 target to align with integral
>> g
On Fri, 3 Nov 2023 23:20:49 GMT, Sandhya Viswanathan
wrote:
>> Jatin Bhateja has updated the pull request incrementally with one additional
>> commit since the last revision:
>>
>> Restricting masked sub-word gather to AVX512 target to align with integral
>> g
On Fri, 3 Nov 2023 20:00:30 GMT, Sandhya Viswanathan
wrote:
>> Jatin Bhateja has updated the pull request incrementally with one additional
>> commit since the last revision:
>>
>> Restricting masked sub-word gather to AVX512 target to align with integral
>> g
was very bulky. This may impact
> in-lining decisions into caller contexts.
>
> 3) Some minor adjustments in existing gather instruction pattens for
> double/quad words.
>
>
> Kindly review and share your feedback.
>
>
> Best Regards,
> Jatin
Jatin Bhate
On Sun, 5 Nov 2023 12:58:33 GMT, Jatin Bhateja wrote:
>> src/hotspot/cpu/x86/x86.ad line 4074:
>>
>>> 4072: BasicType elem_bt = Matcher::vector_element_basic_type(this);
>>> 4073: assert(!is_subword_type(elem_bt), "sanity"); // T_INT,
On Fri, 3 Nov 2023 23:07:44 GMT, Sandhya Viswanathan
wrote:
>> Jatin Bhateja has updated the pull request incrementally with one additional
>> commit since the last revision:
>>
>> Restricting masked sub-word gather to AVX512 target to align with integral
>> g
1 - 100 of 146 matches
Mail list logo