Re: RFR: 8268276: Base64 Decoding optimization for x86 using AVX-512 [v7]

2021-06-24 Thread Sandhya Viswanathan
On Thu, 24 Jun 2021 14:50:01 GMT, Vladimir Kozlov  wrote:

>> Scott Gibbons has updated the pull request incrementally with one additional 
>> commit since the last revision:
>> 
>>   Fixing Windows build warnings
>
> The rest of testing hs-tier1-4 and xcomp is finished and clean.
> So this is the only failure. I attached hs_err file to RFE.

Thanks a lot @vnkozlov for the review and test.

-

PR: https://git.openjdk.java.net/jdk/pull/4368


Re: RFR: 8268276: Base64 Decoding optimization for x86 using AVX-512 [v7]

2021-06-24 Thread Scott Gibbons
On Thu, 24 Jun 2021 14:50:01 GMT, Vladimir Kozlov  wrote:

>> Scott Gibbons has updated the pull request incrementally with one additional 
>> commit since the last revision:
>> 
>>   Fixing Windows build warnings
>
> The rest of testing hs-tier1-4 and xcomp is finished and clean.
> So this is the only failure. I attached hs_err file to RFE.

Hi, @vnkozlov.  I just pushed a change that fixes a register overwrite.  Can 
you please start the tests again?

Thanks

-

PR: https://git.openjdk.java.net/jdk/pull/4368


Re: RFR: 8268276: Base64 Decoding optimization for x86 using AVX-512 [v7]

2021-06-24 Thread Vladimir Kozlov
On Wed, 23 Jun 2021 00:31:55 GMT, Scott Gibbons 
 wrote:

>> Add the Base64 Decode intrinsic for x86 to utilize AVX-512 for acceleration. 
>> Also allows for performance improvement for non-AVX-512 enabled platforms. 
>> Due to the nature of MIME-encoded inputs, modify the intrinsic signature to 
>> accept an additional parameter (isMIME) for fast-path MIME decoding.
>> 
>> A change was made to the signature of DecodeBlock in Base64.java to provide 
>> the intrinsic information as to whether MIME decoding was being done.  This 
>> allows for the intrinsic to bypass the expensive setup of zmm registers from 
>> AVX tables, knowing there may be invalid Base64 characters every 76 
>> characters or so.  A change was also made here removing the restriction that 
>> the intrinsic must return an even multiple of 3 bytes decoded.  This 
>> implementation handles the pad characters at the end of the string and will 
>> return the actual number of characters decoded.
>> 
>> The AVX portion of this code will decode in blocks of 256 bytes per loop 
>> iteration, then in chunks of 64 bytes, followed by end fixup decoding.  The 
>> non-AVX code is an assembly-optimized version of the java DecodeBlock and 
>> behaves identically.
>> 
>> Running the Base64Decode benchmark, this change increases decode performance 
>> by an average of 2.6x with a maximum 19.7x for buffers > ~20k.  The numbers 
>> are given in the table below.
>> 
>> **Base Score** is without intrinsic support, **Optimized Score** is using 
>> this intrinsic, and **Gain** is **Base** / **Optimized**.
>> 
>> 
>> Benchmark Name | Base Score | Optimized Score | Gain
>> -- | -- | -- | --
>> testBase64Decode size 1 | 15.36 | 15.32 | 1.00
>> testBase64Decode size 3 | 17.00 | 16.72 | 1.02
>> testBase64Decode size 7 | 20.60 | 18.82 | 1.09
>> testBase64Decode size 32 | 34.21 | 26.77 | 1.28
>> testBase64Decode size 64 | 54.43 | 38.35 | 1.42
>> testBase64Decode size 80 | 66.40 | 48.34 | 1.37
>> testBase64Decode size 96 | 73.16 | 52.90 | 1.38
>> testBase64Decode size 112 | 84.93 | 51.82 | 1.64
>> testBase64Decode size 512 | 288.81 | 32.04 | 9.01
>> testBase64Decode size 1000 | 560.48 | 40.79 | 13.74
>> testBase64Decode size 2 | 9530.28 | 483.37 | 19.72
>> testBase64Decode size 5 | 24552.24 | 1735.07 | 14.15
>> testBase64MIMEDecode size 1 | 22.87 | 21.36 | 1.07
>> testBase64MIMEDecode size 3 | 27.79 | 25.32 | 1.10
>> testBase64MIMEDecode size 7 | 44.74 | 43.81 | 1.02
>> testBase64MIMEDecode size 32 | 142.69 | 129.56 | 1.10
>> testBase64MIMEDecode size 64 | 256.90 | 243.80 | 1.05
>> testBase64MIMEDecode size 80 | 311.60 | 310.80 | 1.00
>> testBase64MIMEDecode size 96 | 364.00 | 346.66 | 1.05
>> testBase64MIMEDecode size 112 | 472.88 | 394.78 | 1.20
>> testBase64MIMEDecode size 512 | 1814.96 | 1671.28 | 1.09
>> testBase64MIMEDecode size 1000 | 3623.50 | 3227.61 | 1.12
>> testBase64MIMEDecode size 2 | 70484.09 | 64940.77 | 1.09
>> testBase64MIMEDecode size 5 | 191732.34 | 158158.95 | 1.21
>> testBase64WithErrorInputsDecode size 1 | 1531.02 | 1185.19 | 1.29
>> testBase64WithErrorInputsDecode size 3 | 1306.59 | 1170.99 | 1.12
>> testBase64WithErrorInputsDecode size 7 | 1238.11 | 1176.62 | 1.05
>> testBase64WithErrorInputsDecode size 32 | 1346.46 | 1138.47 | 1.18
>> testBase64WithErrorInputsDecode size 64 | 1195.28 | 1172.52 | 1.02
>> testBase64WithErrorInputsDecode size 80 | 1469.00 | 1180.94 | 1.24
>> testBase64WithErrorInputsDecode size 96 | 1434.48 | 1167.74 | 1.23
>> testBase64WithErrorInputsDecode size 112 | 1440.06 | 1162.56 | 1.24
>> testBase64WithErrorInputsDecode size 512 | 1362.79 | 1193.42 | 1.14
>> testBase64WithErrorInputsDecode size 1000 | 1426.07 | 1194.44 | 1.19
>> testBase64WithErrorInputsDecode size   2 | 1398.44 | 1138.17 | 1.23
>> testBase64WithErrorInputsDecode size   5 | 1409.41 | 1114.16 | 1.26
>
> Scott Gibbons has updated the pull request incrementally with one additional 
> commit since the last revision:
> 
>   Fixing Windows build warnings

The rest of testing hs-tier1-4 and xcomp is finished and clean.
So this is the only failure. I attached hs_err file to RFE.

-

PR: https://git.openjdk.java.net/jdk/pull/4368


Re: RFR: 8268276: Base64 Decoding optimization for x86 using AVX-512 [v7]

2021-06-24 Thread Vladimir Kozlov
On Wed, 23 Jun 2021 00:31:55 GMT, Scott Gibbons 
 wrote:

>> Add the Base64 Decode intrinsic for x86 to utilize AVX-512 for acceleration. 
>> Also allows for performance improvement for non-AVX-512 enabled platforms. 
>> Due to the nature of MIME-encoded inputs, modify the intrinsic signature to 
>> accept an additional parameter (isMIME) for fast-path MIME decoding.
>> 
>> A change was made to the signature of DecodeBlock in Base64.java to provide 
>> the intrinsic information as to whether MIME decoding was being done.  This 
>> allows for the intrinsic to bypass the expensive setup of zmm registers from 
>> AVX tables, knowing there may be invalid Base64 characters every 76 
>> characters or so.  A change was also made here removing the restriction that 
>> the intrinsic must return an even multiple of 3 bytes decoded.  This 
>> implementation handles the pad characters at the end of the string and will 
>> return the actual number of characters decoded.
>> 
>> The AVX portion of this code will decode in blocks of 256 bytes per loop 
>> iteration, then in chunks of 64 bytes, followed by end fixup decoding.  The 
>> non-AVX code is an assembly-optimized version of the java DecodeBlock and 
>> behaves identically.
>> 
>> Running the Base64Decode benchmark, this change increases decode performance 
>> by an average of 2.6x with a maximum 19.7x for buffers > ~20k.  The numbers 
>> are given in the table below.
>> 
>> **Base Score** is without intrinsic support, **Optimized Score** is using 
>> this intrinsic, and **Gain** is **Base** / **Optimized**.
>> 
>> 
>> Benchmark Name | Base Score | Optimized Score | Gain
>> -- | -- | -- | --
>> testBase64Decode size 1 | 15.36 | 15.32 | 1.00
>> testBase64Decode size 3 | 17.00 | 16.72 | 1.02
>> testBase64Decode size 7 | 20.60 | 18.82 | 1.09
>> testBase64Decode size 32 | 34.21 | 26.77 | 1.28
>> testBase64Decode size 64 | 54.43 | 38.35 | 1.42
>> testBase64Decode size 80 | 66.40 | 48.34 | 1.37
>> testBase64Decode size 96 | 73.16 | 52.90 | 1.38
>> testBase64Decode size 112 | 84.93 | 51.82 | 1.64
>> testBase64Decode size 512 | 288.81 | 32.04 | 9.01
>> testBase64Decode size 1000 | 560.48 | 40.79 | 13.74
>> testBase64Decode size 2 | 9530.28 | 483.37 | 19.72
>> testBase64Decode size 5 | 24552.24 | 1735.07 | 14.15
>> testBase64MIMEDecode size 1 | 22.87 | 21.36 | 1.07
>> testBase64MIMEDecode size 3 | 27.79 | 25.32 | 1.10
>> testBase64MIMEDecode size 7 | 44.74 | 43.81 | 1.02
>> testBase64MIMEDecode size 32 | 142.69 | 129.56 | 1.10
>> testBase64MIMEDecode size 64 | 256.90 | 243.80 | 1.05
>> testBase64MIMEDecode size 80 | 311.60 | 310.80 | 1.00
>> testBase64MIMEDecode size 96 | 364.00 | 346.66 | 1.05
>> testBase64MIMEDecode size 112 | 472.88 | 394.78 | 1.20
>> testBase64MIMEDecode size 512 | 1814.96 | 1671.28 | 1.09
>> testBase64MIMEDecode size 1000 | 3623.50 | 3227.61 | 1.12
>> testBase64MIMEDecode size 2 | 70484.09 | 64940.77 | 1.09
>> testBase64MIMEDecode size 5 | 191732.34 | 158158.95 | 1.21
>> testBase64WithErrorInputsDecode size 1 | 1531.02 | 1185.19 | 1.29
>> testBase64WithErrorInputsDecode size 3 | 1306.59 | 1170.99 | 1.12
>> testBase64WithErrorInputsDecode size 7 | 1238.11 | 1176.62 | 1.05
>> testBase64WithErrorInputsDecode size 32 | 1346.46 | 1138.47 | 1.18
>> testBase64WithErrorInputsDecode size 64 | 1195.28 | 1172.52 | 1.02
>> testBase64WithErrorInputsDecode size 80 | 1469.00 | 1180.94 | 1.24
>> testBase64WithErrorInputsDecode size 96 | 1434.48 | 1167.74 | 1.23
>> testBase64WithErrorInputsDecode size 112 | 1440.06 | 1162.56 | 1.24
>> testBase64WithErrorInputsDecode size 512 | 1362.79 | 1193.42 | 1.14
>> testBase64WithErrorInputsDecode size 1000 | 1426.07 | 1194.44 | 1.19
>> testBase64WithErrorInputsDecode size   2 | 1398.44 | 1138.17 | 1.23
>> testBase64WithErrorInputsDecode size   5 | 1409.41 | 1114.16 | 1.26
>
> Scott Gibbons has updated the pull request incrementally with one additional 
> commit since the last revision:
> 
>   Fixing Windows build warnings

I hit strange failure in compiler/intrinsics/base64/TestBase64.java test on 
Windows machine which have Intel 8167M cpu (AVX512).

#  EXCEPTION_ACCESS_VIOLATION (0xc005) at pc=0x7ff92bcbd99e, pid=24628, 
tid=6804
#
# Problematic frame:
# V  [jvm.dll+0xabd99e]  ObjectMonitor::object_peek+0xe
#

Current thread (0x016c923de2c0):  JavaThread "MainThread" [_thread_in_Java, 
id=6804, stack(0x0060df60,0x0060df70)]

Stack: [0x0060df60,0x0060df70],  sp=0x0060df6fcb50,  free 
space=1010k
Native frames: (J=compiled Java code, j=interpreted, Vv=VM code, C=native code)
V  [jvm.dll+0xabd99e]  ObjectMonitor::object_peek+0xe  (objectMonitor.cpp:304)
V  [jvm.dll+0xc48d5b]  ObjectSynchronizer::quick_enter+0x9b  
(synchronizer.cpp:331)
V  [jvm.dll+0xb9b6f6]  SharedRuntime::monitor_enter_helper+0x36  
(sharedRuntime.cpp:2112)
V  [jvm.dll+0x389894]  Runtime1::monitorenter+0x94  (c1_Runtime1.cpp:748)
C  0x016c99c4a757

Java frames: (J=compiled Java code, 

Re: RFR: 8268276: Base64 Decoding optimization for x86 using AVX-512 [v7]

2021-06-23 Thread Vladimir Kozlov
On Wed, 23 Jun 2021 00:31:55 GMT, Scott Gibbons 
 wrote:

>> Add the Base64 Decode intrinsic for x86 to utilize AVX-512 for acceleration. 
>> Also allows for performance improvement for non-AVX-512 enabled platforms. 
>> Due to the nature of MIME-encoded inputs, modify the intrinsic signature to 
>> accept an additional parameter (isMIME) for fast-path MIME decoding.
>> 
>> A change was made to the signature of DecodeBlock in Base64.java to provide 
>> the intrinsic information as to whether MIME decoding was being done.  This 
>> allows for the intrinsic to bypass the expensive setup of zmm registers from 
>> AVX tables, knowing there may be invalid Base64 characters every 76 
>> characters or so.  A change was also made here removing the restriction that 
>> the intrinsic must return an even multiple of 3 bytes decoded.  This 
>> implementation handles the pad characters at the end of the string and will 
>> return the actual number of characters decoded.
>> 
>> The AVX portion of this code will decode in blocks of 256 bytes per loop 
>> iteration, then in chunks of 64 bytes, followed by end fixup decoding.  The 
>> non-AVX code is an assembly-optimized version of the java DecodeBlock and 
>> behaves identically.
>> 
>> Running the Base64Decode benchmark, this change increases decode performance 
>> by an average of 2.6x with a maximum 19.7x for buffers > ~20k.  The numbers 
>> are given in the table below.
>> 
>> **Base Score** is without intrinsic support, **Optimized Score** is using 
>> this intrinsic, and **Gain** is **Base** / **Optimized**.
>> 
>> 
>> Benchmark Name | Base Score | Optimized Score | Gain
>> -- | -- | -- | --
>> testBase64Decode size 1 | 15.36 | 15.32 | 1.00
>> testBase64Decode size 3 | 17.00 | 16.72 | 1.02
>> testBase64Decode size 7 | 20.60 | 18.82 | 1.09
>> testBase64Decode size 32 | 34.21 | 26.77 | 1.28
>> testBase64Decode size 64 | 54.43 | 38.35 | 1.42
>> testBase64Decode size 80 | 66.40 | 48.34 | 1.37
>> testBase64Decode size 96 | 73.16 | 52.90 | 1.38
>> testBase64Decode size 112 | 84.93 | 51.82 | 1.64
>> testBase64Decode size 512 | 288.81 | 32.04 | 9.01
>> testBase64Decode size 1000 | 560.48 | 40.79 | 13.74
>> testBase64Decode size 2 | 9530.28 | 483.37 | 19.72
>> testBase64Decode size 5 | 24552.24 | 1735.07 | 14.15
>> testBase64MIMEDecode size 1 | 22.87 | 21.36 | 1.07
>> testBase64MIMEDecode size 3 | 27.79 | 25.32 | 1.10
>> testBase64MIMEDecode size 7 | 44.74 | 43.81 | 1.02
>> testBase64MIMEDecode size 32 | 142.69 | 129.56 | 1.10
>> testBase64MIMEDecode size 64 | 256.90 | 243.80 | 1.05
>> testBase64MIMEDecode size 80 | 311.60 | 310.80 | 1.00
>> testBase64MIMEDecode size 96 | 364.00 | 346.66 | 1.05
>> testBase64MIMEDecode size 112 | 472.88 | 394.78 | 1.20
>> testBase64MIMEDecode size 512 | 1814.96 | 1671.28 | 1.09
>> testBase64MIMEDecode size 1000 | 3623.50 | 3227.61 | 1.12
>> testBase64MIMEDecode size 2 | 70484.09 | 64940.77 | 1.09
>> testBase64MIMEDecode size 5 | 191732.34 | 158158.95 | 1.21
>> testBase64WithErrorInputsDecode size 1 | 1531.02 | 1185.19 | 1.29
>> testBase64WithErrorInputsDecode size 3 | 1306.59 | 1170.99 | 1.12
>> testBase64WithErrorInputsDecode size 7 | 1238.11 | 1176.62 | 1.05
>> testBase64WithErrorInputsDecode size 32 | 1346.46 | 1138.47 | 1.18
>> testBase64WithErrorInputsDecode size 64 | 1195.28 | 1172.52 | 1.02
>> testBase64WithErrorInputsDecode size 80 | 1469.00 | 1180.94 | 1.24
>> testBase64WithErrorInputsDecode size 96 | 1434.48 | 1167.74 | 1.23
>> testBase64WithErrorInputsDecode size 112 | 1440.06 | 1162.56 | 1.24
>> testBase64WithErrorInputsDecode size 512 | 1362.79 | 1193.42 | 1.14
>> testBase64WithErrorInputsDecode size 1000 | 1426.07 | 1194.44 | 1.19
>> testBase64WithErrorInputsDecode size   2 | 1398.44 | 1138.17 | 1.23
>> testBase64WithErrorInputsDecode size   5 | 1409.41 | 1114.16 | 1.26
>
> Scott Gibbons has updated the pull request incrementally with one additional 
> commit since the last revision:
> 
>   Fixing Windows build warnings

I will run our internal testing before approving this.

-

PR: https://git.openjdk.java.net/jdk/pull/4368


RE: RFR: 8268276: Base64 Decoding optimization for x86 using AVX-512 [v7]

2021-06-23 Thread Gibbons, Scott
Hi, David.  I don't have permissions to run tests in this repo.  I have tested 
on several x86 platforms (ICX, SKL) with several options.  I'll be running more 
tests today.

Thanks,
--Scott

-Original Message-
From: hotspot-dev  On Behalf Of David Holmes
Sent: Tuesday, June 22, 2021 7:21 PM
To: build-dev@openjdk.java.net; core-libs-...@openjdk.java.net; 
hotspot-...@openjdk.java.net; hotspot-compiler-...@openjdk.java.net
Subject: Re: RFR: 8268276: Base64 Decoding optimization for x86 using AVX-512 
[v7]

On Wed, 23 Jun 2021 00:31:55 GMT, Scott Gibbons 
 wrote:

>> Add the Base64 Decode intrinsic for x86 to utilize AVX-512 for acceleration. 
>> Also allows for performance improvement for non-AVX-512 enabled platforms. 
>> Due to the nature of MIME-encoded inputs, modify the intrinsic signature to 
>> accept an additional parameter (isMIME) for fast-path MIME decoding.
>> 
>> A change was made to the signature of DecodeBlock in Base64.java to provide 
>> the intrinsic information as to whether MIME decoding was being done.  This 
>> allows for the intrinsic to bypass the expensive setup of zmm registers from 
>> AVX tables, knowing there may be invalid Base64 characters every 76 
>> characters or so.  A change was also made here removing the restriction that 
>> the intrinsic must return an even multiple of 3 bytes decoded.  This 
>> implementation handles the pad characters at the end of the string and will 
>> return the actual number of characters decoded.
>> 
>> The AVX portion of this code will decode in blocks of 256 bytes per loop 
>> iteration, then in chunks of 64 bytes, followed by end fixup decoding.  The 
>> non-AVX code is an assembly-optimized version of the java DecodeBlock and 
>> behaves identically.
>> 
>> Running the Base64Decode benchmark, this change increases decode performance 
>> by an average of 2.6x with a maximum 19.7x for buffers > ~20k.  The numbers 
>> are given in the table below.
>> 
>> **Base Score** is without intrinsic support, **Optimized Score** is using 
>> this intrinsic, and **Gain** is **Base** / **Optimized**.
>> 
>> 
>> Benchmark Name | Base Score | Optimized Score | Gain
>> -- | -- | -- | --
>> testBase64Decode size 1 | 15.36 | 15.32 | 1.00 testBase64Decode size 
>> 3 | 17.00 | 16.72 | 1.02 testBase64Decode size 7 | 20.60 | 18.82 | 
>> 1.09 testBase64Decode size 32 | 34.21 | 26.77 | 1.28 testBase64Decode 
>> size 64 | 54.43 | 38.35 | 1.42 testBase64Decode size 80 | 66.40 | 
>> 48.34 | 1.37 testBase64Decode size 96 | 73.16 | 52.90 | 1.38 
>> testBase64Decode size 112 | 84.93 | 51.82 | 1.64 testBase64Decode 
>> size 512 | 288.81 | 32.04 | 9.01 testBase64Decode size 1000 | 560.48 
>> | 40.79 | 13.74 testBase64Decode size 2 | 9530.28 | 483.37 | 
>> 19.72 testBase64Decode size 5 | 24552.24 | 1735.07 | 14.15 
>> testBase64MIMEDecode size 1 | 22.87 | 21.36 | 1.07 
>> testBase64MIMEDecode size 3 | 27.79 | 25.32 | 1.10 
>> testBase64MIMEDecode size 7 | 44.74 | 43.81 | 1.02 
>> testBase64MIMEDecode size 32 | 142.69 | 129.56 | 1.10 
>> testBase64MIMEDecode size 64 | 256.90 | 243.80 | 1.05 
>> testBase64MIMEDecode size 80 | 311.60 | 310.80 | 1.00 
>> testBase64MIMEDecode size 96 | 364.00 | 346.66 | 1.05 
>> testBase64MIMEDecode size 112 | 472.88 | 394.78 | 1.20 
>> testBase64MIMEDecode size 512 | 1814.96 | 1671.28 | 1.09 
>> testBase64MIMEDecode size 1000 | 3623.50 | 3227.61 | 1.12 
>> testBase64MIMEDecode size 2 | 70484.09 | 64940.77 | 1.09 
>> testBase64MIMEDecode size 5 | 191732.34 | 158158.95 | 1.21 
>> testBase64WithErrorInputsDecode size 1 | 1531.02 | 1185.19 | 1.29 
>> testBase64WithErrorInputsDecode size 3 | 1306.59 | 1170.99 | 1.12 
>> testBase64WithErrorInputsDecode size 7 | 1238.11 | 1176.62 | 1.05 
>> testBase64WithErrorInputsDecode size 32 | 1346.46 | 1138.47 | 1.18 
>> testBase64WithErrorInputsDecode size 64 | 1195.28 | 1172.52 | 1.02 
>> testBase64WithErrorInputsDecode size 80 | 1469.00 | 1180.94 | 1.24 
>> testBase64WithErrorInputsDecode size 96 | 1434.48 | 1167.74 | 1.23 
>> testBase64WithErrorInputsDecode size 112 | 1440.06 | 1162.56 | 1.24 
>> testBase64WithErrorInputsDecode size 512 | 1362.79 | 1193.42 | 1.14 
>> testBase64WithErrorInputsDecode size 1000 | 1426.07 | 1194.44 | 1.19
>> testBase64WithErrorInputsDecode size   2 | 1398.44 | 1138.17 | 1.23
>> testBase64WithErrorInputsDecode size   5 | 1409.41 | 1114.16 | 1.26
>
> Scott Gibbons has updated the pull request incrementally with one additional 
> commit since the last revision:
> 
>   Fixing Windows build warnings

What testing has been done for this change? I do not see that the Github 
Actions have been run for this PR. Has this been tested on a range of x86 
systems with differing AVX capabilities?

Thanks,
David

-

PR: https://git.openjdk.java.net/jdk/pull/4368


Re: RFR: 8268276: Base64 Decoding optimization for x86 using AVX-512 [v7]

2021-06-22 Thread David Holmes
On Wed, 23 Jun 2021 00:31:55 GMT, Scott Gibbons 
 wrote:

>> Add the Base64 Decode intrinsic for x86 to utilize AVX-512 for acceleration. 
>> Also allows for performance improvement for non-AVX-512 enabled platforms. 
>> Due to the nature of MIME-encoded inputs, modify the intrinsic signature to 
>> accept an additional parameter (isMIME) for fast-path MIME decoding.
>> 
>> A change was made to the signature of DecodeBlock in Base64.java to provide 
>> the intrinsic information as to whether MIME decoding was being done.  This 
>> allows for the intrinsic to bypass the expensive setup of zmm registers from 
>> AVX tables, knowing there may be invalid Base64 characters every 76 
>> characters or so.  A change was also made here removing the restriction that 
>> the intrinsic must return an even multiple of 3 bytes decoded.  This 
>> implementation handles the pad characters at the end of the string and will 
>> return the actual number of characters decoded.
>> 
>> The AVX portion of this code will decode in blocks of 256 bytes per loop 
>> iteration, then in chunks of 64 bytes, followed by end fixup decoding.  The 
>> non-AVX code is an assembly-optimized version of the java DecodeBlock and 
>> behaves identically.
>> 
>> Running the Base64Decode benchmark, this change increases decode performance 
>> by an average of 2.6x with a maximum 19.7x for buffers > ~20k.  The numbers 
>> are given in the table below.
>> 
>> **Base Score** is without intrinsic support, **Optimized Score** is using 
>> this intrinsic, and **Gain** is **Base** / **Optimized**.
>> 
>> 
>> Benchmark Name | Base Score | Optimized Score | Gain
>> -- | -- | -- | --
>> testBase64Decode size 1 | 15.36 | 15.32 | 1.00
>> testBase64Decode size 3 | 17.00 | 16.72 | 1.02
>> testBase64Decode size 7 | 20.60 | 18.82 | 1.09
>> testBase64Decode size 32 | 34.21 | 26.77 | 1.28
>> testBase64Decode size 64 | 54.43 | 38.35 | 1.42
>> testBase64Decode size 80 | 66.40 | 48.34 | 1.37
>> testBase64Decode size 96 | 73.16 | 52.90 | 1.38
>> testBase64Decode size 112 | 84.93 | 51.82 | 1.64
>> testBase64Decode size 512 | 288.81 | 32.04 | 9.01
>> testBase64Decode size 1000 | 560.48 | 40.79 | 13.74
>> testBase64Decode size 2 | 9530.28 | 483.37 | 19.72
>> testBase64Decode size 5 | 24552.24 | 1735.07 | 14.15
>> testBase64MIMEDecode size 1 | 22.87 | 21.36 | 1.07
>> testBase64MIMEDecode size 3 | 27.79 | 25.32 | 1.10
>> testBase64MIMEDecode size 7 | 44.74 | 43.81 | 1.02
>> testBase64MIMEDecode size 32 | 142.69 | 129.56 | 1.10
>> testBase64MIMEDecode size 64 | 256.90 | 243.80 | 1.05
>> testBase64MIMEDecode size 80 | 311.60 | 310.80 | 1.00
>> testBase64MIMEDecode size 96 | 364.00 | 346.66 | 1.05
>> testBase64MIMEDecode size 112 | 472.88 | 394.78 | 1.20
>> testBase64MIMEDecode size 512 | 1814.96 | 1671.28 | 1.09
>> testBase64MIMEDecode size 1000 | 3623.50 | 3227.61 | 1.12
>> testBase64MIMEDecode size 2 | 70484.09 | 64940.77 | 1.09
>> testBase64MIMEDecode size 5 | 191732.34 | 158158.95 | 1.21
>> testBase64WithErrorInputsDecode size 1 | 1531.02 | 1185.19 | 1.29
>> testBase64WithErrorInputsDecode size 3 | 1306.59 | 1170.99 | 1.12
>> testBase64WithErrorInputsDecode size 7 | 1238.11 | 1176.62 | 1.05
>> testBase64WithErrorInputsDecode size 32 | 1346.46 | 1138.47 | 1.18
>> testBase64WithErrorInputsDecode size 64 | 1195.28 | 1172.52 | 1.02
>> testBase64WithErrorInputsDecode size 80 | 1469.00 | 1180.94 | 1.24
>> testBase64WithErrorInputsDecode size 96 | 1434.48 | 1167.74 | 1.23
>> testBase64WithErrorInputsDecode size 112 | 1440.06 | 1162.56 | 1.24
>> testBase64WithErrorInputsDecode size 512 | 1362.79 | 1193.42 | 1.14
>> testBase64WithErrorInputsDecode size 1000 | 1426.07 | 1194.44 | 1.19
>> testBase64WithErrorInputsDecode size   2 | 1398.44 | 1138.17 | 1.23
>> testBase64WithErrorInputsDecode size   5 | 1409.41 | 1114.16 | 1.26
>
> Scott Gibbons has updated the pull request incrementally with one additional 
> commit since the last revision:
> 
>   Fixing Windows build warnings

What testing has been done for this change? I do not see that the Github 
Actions have been run for this PR. Has this been tested on a range of x86 
systems with differing AVX capabilities?

Thanks,
David

-

PR: https://git.openjdk.java.net/jdk/pull/4368


Re: RFR: 8268276: Base64 Decoding optimization for x86 using AVX-512 [v7]

2021-06-22 Thread Scott Gibbons
> Add the Base64 Decode intrinsic for x86 to utilize AVX-512 for acceleration. 
> Also allows for performance improvement for non-AVX-512 enabled platforms. 
> Due to the nature of MIME-encoded inputs, modify the intrinsic signature to 
> accept an additional parameter (isMIME) for fast-path MIME decoding.
> 
> A change was made to the signature of DecodeBlock in Base64.java to provide 
> the intrinsic information as to whether MIME decoding was being done.  This 
> allows for the intrinsic to bypass the expensive setup of zmm registers from 
> AVX tables, knowing there may be invalid Base64 characters every 76 
> characters or so.  A change was also made here removing the restriction that 
> the intrinsic must return an even multiple of 3 bytes decoded.  This 
> implementation handles the pad characters at the end of the string and will 
> return the actual number of characters decoded.
> 
> The AVX portion of this code will decode in blocks of 256 bytes per loop 
> iteration, then in chunks of 64 bytes, followed by end fixup decoding.  The 
> non-AVX code is an assembly-optimized version of the java DecodeBlock and 
> behaves identically.
> 
> Running the Base64Decode benchmark, this change increases decode performance 
> by an average of 2.6x with a maximum 19.7x for buffers > ~20k.  The numbers 
> are given in the table below.
> 
> **Base Score** is without intrinsic support, **Optimized Score** is using 
> this intrinsic, and **Gain** is **Base** / **Optimized**.
> 
> 
> Benchmark Name | Base Score | Optimized Score | Gain
> -- | -- | -- | --
> testBase64Decode size 1 | 15.36 | 15.32 | 1.00
> testBase64Decode size 3 | 17.00 | 16.72 | 1.02
> testBase64Decode size 7 | 20.60 | 18.82 | 1.09
> testBase64Decode size 32 | 34.21 | 26.77 | 1.28
> testBase64Decode size 64 | 54.43 | 38.35 | 1.42
> testBase64Decode size 80 | 66.40 | 48.34 | 1.37
> testBase64Decode size 96 | 73.16 | 52.90 | 1.38
> testBase64Decode size 112 | 84.93 | 51.82 | 1.64
> testBase64Decode size 512 | 288.81 | 32.04 | 9.01
> testBase64Decode size 1000 | 560.48 | 40.79 | 13.74
> testBase64Decode size 2 | 9530.28 | 483.37 | 19.72
> testBase64Decode size 5 | 24552.24 | 1735.07 | 14.15
> testBase64MIMEDecode size 1 | 22.87 | 21.36 | 1.07
> testBase64MIMEDecode size 3 | 27.79 | 25.32 | 1.10
> testBase64MIMEDecode size 7 | 44.74 | 43.81 | 1.02
> testBase64MIMEDecode size 32 | 142.69 | 129.56 | 1.10
> testBase64MIMEDecode size 64 | 256.90 | 243.80 | 1.05
> testBase64MIMEDecode size 80 | 311.60 | 310.80 | 1.00
> testBase64MIMEDecode size 96 | 364.00 | 346.66 | 1.05
> testBase64MIMEDecode size 112 | 472.88 | 394.78 | 1.20
> testBase64MIMEDecode size 512 | 1814.96 | 1671.28 | 1.09
> testBase64MIMEDecode size 1000 | 3623.50 | 3227.61 | 1.12
> testBase64MIMEDecode size 2 | 70484.09 | 64940.77 | 1.09
> testBase64MIMEDecode size 5 | 191732.34 | 158158.95 | 1.21
> testBase64WithErrorInputsDecode size 1 | 1531.02 | 1185.19 | 1.29
> testBase64WithErrorInputsDecode size 3 | 1306.59 | 1170.99 | 1.12
> testBase64WithErrorInputsDecode size 7 | 1238.11 | 1176.62 | 1.05
> testBase64WithErrorInputsDecode size 32 | 1346.46 | 1138.47 | 1.18
> testBase64WithErrorInputsDecode size 64 | 1195.28 | 1172.52 | 1.02
> testBase64WithErrorInputsDecode size 80 | 1469.00 | 1180.94 | 1.24
> testBase64WithErrorInputsDecode size 96 | 1434.48 | 1167.74 | 1.23
> testBase64WithErrorInputsDecode size 112 | 1440.06 | 1162.56 | 1.24
> testBase64WithErrorInputsDecode size 512 | 1362.79 | 1193.42 | 1.14
> testBase64WithErrorInputsDecode size 1000 | 1426.07 | 1194.44 | 1.19
> testBase64WithErrorInputsDecode size   2 | 1398.44 | 1138.17 | 1.23
> testBase64WithErrorInputsDecode size   5 | 1409.41 | 1114.16 | 1.26

Scott Gibbons has updated the pull request incrementally with one additional 
commit since the last revision:

  Fixing Windows build warnings

-

Changes:
  - all: https://git.openjdk.java.net/jdk/pull/4368/files
  - new: https://git.openjdk.java.net/jdk/pull/4368/files/e1b4af9e..58461b80

Webrevs:
 - full: https://webrevs.openjdk.java.net/?repo=jdk=4368=06
 - incr: https://webrevs.openjdk.java.net/?repo=jdk=4368=05-06

  Stats: 24 lines in 1 file changed: 8 ins; 0 del; 16 mod
  Patch: https://git.openjdk.java.net/jdk/pull/4368.diff
  Fetch: git fetch https://git.openjdk.java.net/jdk pull/4368/head:pull/4368

PR: https://git.openjdk.java.net/jdk/pull/4368