Re: RFR: 8155795: Optimize Integer/Long.reverse by using reverseBytes
Hi, I think this shows the compiler isn't doing anything really strange; it'd be fun to know if it's the different instruction order or the use of one less constant that shows an effect, but it's not really significant either way. I think the latest patch is fine as it is and will push it soon. Thanks! /Claes On 2016-05-09 21:50, Jaroslav Kameník wrote: Hi guys, I have looked at generated asm, it does not seem there is problem, only little different order of shifts and ands.. Plus, at haswell, times do not differ so much. Here are different parts: Slower version: 057 sallR10, #24 05b movlR11, RAX# spill 05e sallR11, #8 062 movlR8, RAX # spill 065 shrlR8, #8 069 andlR11, #16711680 # int 070 andlR8, #65280 # int 077 shrlRAX, #24 Faster version: 057 shrlR10, #24 05b movlR11, RAX# spill 05e shrlR11, #8 062 movlR8, RAX # spill 065 andlR8, #65280 # int 06c andlR11, #65280 # int 073 sallR8, #8 077 sallRAX, #24 Full code: ORIGINAL - 5,169 ± 0,377 ns/op === 000 B1: # N1 <- BLOCK HEAD IS JUNK Freq: 1 000 # stack bang (96 bytes) pushq rbp # Save rbp subqrsp, #16# Create frame 00c movlRAX, RSI# spill 00e andlRAX, #1431655765# int 014 shrlRSI, #1 016 sallRAX, #1 018 andlRSI, #1431655765# int 01e orl RAX, RSI# int 020 movlR10, RAX# spill 023 shrlR10, #2 027 andlRAX, #858993459 # int 02d andlR10, #858993459 # int 034 sallRAX, #2 037 orl RAX, R10# int 03a movlR10, RAX# spill 03d shrlR10, #4 041 andlRAX, #252645135 # int 047 andlR10, #252645135 # int 04e sallRAX, #4 051 orl RAX, R10# int 054 movlR10, RAX# spill 057 shrlR10, #24 05b movlR11, RAX# spill 05e shrlR11, #8 062 movlR8, RAX # spill 065 andlR8, #65280 # int 06c andlR11, #65280 # int 073 sallR8, #8 077 sallRAX, #24 07a orl RAX, R8 # int 07d orl RAX, R11# int 080 orl RAX, R10# int 083 addqrsp, 16 # Destroy frame USED INTRINSIC - 4,093 ± 0,100 ns/op = 000 B1: # N1 <- BLOCK HEAD IS JUNK Freq: 1 000 # stack bang (96 bytes) pushq rbp # Save rbp subqrsp, #16# Create frame 00c movlRAX, RSI# spill 00e andlRAX, #1431655765# int 014 shrlRSI, #1 016 sallRAX, #1 018 andlRSI, #1431655765# int 01e orl RAX, RSI# int 020 movlR10, RAX# spill 023 shrlR10, #2 027 andlRAX, #858993459 # int 02d andlR10, #858993459 # int 034 sallRAX, #2 037 orl RAX, R10# int 03a movlR10, RAX# spill 03d shrlR10, #4 041 andlRAX, #252645135 # int 047 andlR10, #252645135 # int 04e sallRAX, #4 051 orl RAX, R10# int 054 bswapl RAX 056 addqrsp, 16 # Destroy frame INTRINSICS DISABLED - 5,173 ± 0,096 ns/op == 000 B1: # N1 <- BLOCK HEAD IS JUNK Freq: 1 000 # stack bang (96 bytes) pushq rbp # Save rbp subqrsp, #16# Create frame 00c movlRAX, RSI# spill 00e andlRAX, #1431655765# int 014 shrlRSI, #1 016 sallRAX, #1 018 andlRSI, #1431655765# int 01e orl RAX, RSI# int 020 movlR10, RAX# spill 023 shrlR10, #2 027 andlRAX, #858993459 # int 02d andlR10, #858993459 # int 034 sallRAX, #2 037 orl RAX, R10# int 03a movlR10, RAX# spill 03d shrlR10, #4 041 andlRAX, #252645135 # int 047 andlR10, #252645135 # int 04e sallRAX, #4 051 orl RAX, R10# int 054 movlR10, RAX# spill 057 sallR10, #24 05b movlR11, RAX# spill 05e sallR11, #8 062 movlR8, RAX # spill 065 shrlR8, #8 069 andlR11, #16711680 # int 070 andlR8, #65280 # int 077 shrlRAX, #24 07a orl RAX, R8 # int 07d orl RAX, R11# int 080 orl RAX, R10# int 083 addqrsp, 16 # Destroy frame INTRINSICS DISABLED 2 - 5,081 ± 0,092 ns/op === 000 B1: # N1 <- BLOCK HEAD IS JUNK Freq: 1 000 # stack bang (96 bytes) pushq rbp # Save rbp subqrsp, #16# Create frame 00c movlRAX, RSI# spill 00e andlRAX, #1431655765#
Re: RFR: 8155795: Optimize Integer/Long.reverse by using reverseBytes
Hi guys, I have looked at generated asm, it does not seem there is problem, only little different order of shifts and ands.. Plus, at haswell, times do not differ so much. Here are different parts: Slower version: 057 sallR10, #24 05b movlR11, RAX# spill 05e sallR11, #8 062 movlR8, RAX # spill 065 shrlR8, #8 069 andlR11, #16711680 # int 070 andlR8, #65280 # int 077 shrlRAX, #24 Faster version: 057 shrlR10, #24 05b movlR11, RAX# spill 05e shrlR11, #8 062 movlR8, RAX # spill 065 andlR8, #65280 # int 06c andlR11, #65280 # int 073 sallR8, #8 077 sallRAX, #24 Full code: ORIGINAL - 5,169 ± 0,377 ns/op === 000 B1: # N1 <- BLOCK HEAD IS JUNK Freq: 1 000 # stack bang (96 bytes) pushq rbp # Save rbp subqrsp, #16# Create frame 00c movlRAX, RSI# spill 00e andlRAX, #1431655765# int 014 shrlRSI, #1 016 sallRAX, #1 018 andlRSI, #1431655765# int 01e orl RAX, RSI# int 020 movlR10, RAX# spill 023 shrlR10, #2 027 andlRAX, #858993459 # int 02d andlR10, #858993459 # int 034 sallRAX, #2 037 orl RAX, R10# int 03a movlR10, RAX# spill 03d shrlR10, #4 041 andlRAX, #252645135 # int 047 andlR10, #252645135 # int 04e sallRAX, #4 051 orl RAX, R10# int 054 movlR10, RAX# spill 057 shrlR10, #24 05b movlR11, RAX# spill 05e shrlR11, #8 062 movlR8, RAX # spill 065 andlR8, #65280 # int 06c andlR11, #65280 # int 073 sallR8, #8 077 sallRAX, #24 07a orl RAX, R8 # int 07d orl RAX, R11# int 080 orl RAX, R10# int 083 addqrsp, 16 # Destroy frame USED INTRINSIC - 4,093 ± 0,100 ns/op = 000 B1: # N1 <- BLOCK HEAD IS JUNK Freq: 1 000 # stack bang (96 bytes) pushq rbp # Save rbp subqrsp, #16# Create frame 00c movlRAX, RSI# spill 00e andlRAX, #1431655765# int 014 shrlRSI, #1 016 sallRAX, #1 018 andlRSI, #1431655765# int 01e orl RAX, RSI# int 020 movlR10, RAX# spill 023 shrlR10, #2 027 andlRAX, #858993459 # int 02d andlR10, #858993459 # int 034 sallRAX, #2 037 orl RAX, R10# int 03a movlR10, RAX# spill 03d shrlR10, #4 041 andlRAX, #252645135 # int 047 andlR10, #252645135 # int 04e sallRAX, #4 051 orl RAX, R10# int 054 bswapl RAX 056 addqrsp, 16 # Destroy frame INTRINSICS DISABLED - 5,173 ± 0,096 ns/op == 000 B1: # N1 <- BLOCK HEAD IS JUNK Freq: 1 000 # stack bang (96 bytes) pushq rbp # Save rbp subqrsp, #16# Create frame 00c movlRAX, RSI# spill 00e andlRAX, #1431655765# int 014 shrlRSI, #1 016 sallRAX, #1 018 andlRSI, #1431655765# int 01e orl RAX, RSI# int 020 movlR10, RAX# spill 023 shrlR10, #2 027 andlRAX, #858993459 # int 02d andlR10, #858993459 # int 034 sallRAX, #2 037 orl RAX, R10# int 03a movlR10, RAX# spill 03d shrlR10, #4 041 andlRAX, #252645135 # int 047 andlR10, #252645135 # int 04e sallRAX, #4 051 orl RAX, R10# int 054 movlR10, RAX# spill 057 sallR10, #24 05b movlR11, RAX# spill 05e sallR11, #8 062 movlR8, RAX # spill 065 shrlR8, #8 069 andlR11, #16711680 # int 070 andlR8, #65280 # int 077 shrlRAX, #24 07a orl RAX, R8 # int 07d orl RAX, R11# int 080 orl RAX, R10# int 083 addqrsp, 16 # Destroy frame INTRINSICS DISABLED 2 - 5,081 ± 0,092 ns/op === 000 B1: # N1 <- BLOCK HEAD IS JUNK Freq: 1 000 # stack bang (96 bytes) pushq rbp # Save rbp subqrsp, #16# Create frame 00c movlRAX, RSI# spill 00e andlRAX, #1431655765# int 014 shrlRSI, #1 016 sallRAX, #1 018 andlRSI, #1431655765# int 01e orl RAX, RSI# int 020 movlR10, RAX# spill 023 shrlR10, #2 027 andlRAX, #858993459 # int 02d andlR10, #858993459 # int 034 sallRAX, #2 037 orl RAX, R10# int 03a movlR10, RAX# spill 03d
Re: RFR: 8155795: Optimize Integer/Long.reverse by using reverseBytes
Thanks for comments. I have changed tests as Aleksey suggested. Btw. is not there some class used for utility methods as leftpad? Ad performance of changed reverseBytes, I tried to compare both reverseBytes alone, and it seems to be same, but differs when inlined to reverse(). J. 2016-05-02 17:00 GMT+02:00 Claes Redestad: > On 2016-05-02 16:48, Aleksey Shipilev wrote: > >> On 05/02/2016 05:07 PM, Claes Redestad wrote: >> >>> I'd like to sponsor this patch proposed by Jaroslav Kameník here: >>> >>> http://mail.openjdk.java.net/pipermail/core-libs-dev/2016-April/040660.html >>> >>> Bug https://bugs.openjdk.java.net/browse/JDK-8155795 >>> Webrev: http://cr.openjdk.java.net/~redestad/8155795/webrev.01/ >>> >> *) I wonder if Integer.reverseBytes is better spelled as: >> >> return (i >>> 24) | >> ((i & 0xFF) >> 8) | >> ((i & 0xFF00) << 8) | >> (i << 24); >> > > The reverseBytes changes were motivated by seeing slightly better > performance on the micro with the suggested changes, but after > discussing this a bit I think we should revert to the original code alone > and investigate if there's some compiler quirk lurking here separately. > > I'll file a bug. > > >> *) The test should probably follow the same randomized testing pattern >> as other tests: >> >> for (int i = 0; i < N; i++) { >> int x = rnd.nextInt(); >> >> String expected = new StringBuilder(). >> .append(leftpad(Integer.toBinaryString(x), 32)) >> .reverse().toString(); >> >> String actual = >> leftpad(Integer.toBinaryString(Integer.reverse(x)), 32); >> >> if (!expected.equals(actual)) { >> throw new RuntimeException("reverse: \n" + >>expected + " \n" + actual); >> } >> >> // That's how development looks like in 2016. >> private static String leftpad(String s, int width) { >> String r = s; >> for (int c = 0; c < width - s.length(); c++) { >> r = "0" + r; >> } >> return r; >> } >> >> ...and should probably precede any other test that uses reverse(). >> > > Seems reasonable. > > Jaroslav, do you mind improving the test as per Aleksey's > suggestions? > > Thanks! > > /Claes > diff -r 2bf84670f079 src/java.base/share/classes/java/lang/Integer.java --- a/src/java.base/share/classes/java/lang/Integer.javaSat Apr 30 16:08:48 2016 -0700 +++ b/src/java.base/share/classes/java/lang/Integer.javaMon May 02 22:09:54 2016 +0200 @@ -1790,9 +1790,8 @@ i = (i & 0x) << 1 | (i >>> 1) & 0x; i = (i & 0x) << 2 | (i >>> 2) & 0x; i = (i & 0x0f0f0f0f) << 4 | (i >>> 4) & 0x0f0f0f0f; -i = (i << 24) | ((i & 0xff00) << 8) | -((i >>> 8) & 0xff00) | (i >>> 24); -return i; + +return reverseBytes(i); } /** @@ -1820,10 +1819,10 @@ */ @HotSpotIntrinsicCandidate public static int reverseBytes(int i) { -return ((i >>> 24) ) | - ((i >> 8) & 0xFF00) | - ((i << 8) & 0xFF) | - ((i << 24)); +return (i << 24)| + ((i & 0xff00) << 8) | + ((i >>> 8) & 0xff00) | + (i >>> 24); } /** diff -r 2bf84670f079 src/java.base/share/classes/java/lang/Long.java --- a/src/java.base/share/classes/java/lang/Long.java Sat Apr 30 16:08:48 2016 -0700 +++ b/src/java.base/share/classes/java/lang/Long.java Mon May 02 22:09:54 2016 +0200 @@ -1952,10 +1952,8 @@ i = (i & 0xL) << 1 | (i >>> 1) & 0xL; i = (i & 0xL) << 2 | (i >>> 2) & 0xL; i = (i & 0x0f0f0f0f0f0f0f0fL) << 4 | (i >>> 4) & 0x0f0f0f0f0f0f0f0fL; -i = (i & 0x00ff00ff00ff00ffL) << 8 | (i >>> 8) & 0x00ff00ff00ff00ffL; -i = (i << 48) | ((i & 0xL) << 16) | -((i >>> 16) & 0xL) | (i >>> 48); -return i; + +return reverseBytes(i); } /** diff -r 2bf84670f079 test/java/lang/Integer/BitTwiddle.java --- a/test/java/lang/Integer/BitTwiddle.javaSat Apr 30 16:08:48 2016 -0700 +++ b/test/java/lang/Integer/BitTwiddle.javaMon May 02 22:09:54 2016 +0200 @@ -58,6 +58,21 @@ for (int i = 0; i < N; i++) { int x = rnd.nextInt(); + +String expected = new StringBuilder() +.append(leftpad(toBinaryString(x), 32)) +.reverse().toString(); + +String actual = leftpad(toBinaryString(reverse(x)), 32); + +if (!expected.equals(actual)) { +throw new RuntimeException("reverse: \n" + +expected + " \n" + actual); +} +} + +for (int i = 0; i < N; i++) { +int x = rnd.nextInt(); if (highestOneBit(x) != reverse(lowestOneBit(reverse(x throw new RuntimeException("g: " +
Re: RFR: 8155795: Optimize Integer/Long.reverse by using reverseBytes
Am 02.05.2016 um 17:00 schrieb Claes Redestad: The reverseBytes changes were motivated by seeing slightly better performance on the micro with the suggested changes, but after discussing this a bit I think we should revert to the original code alone and investigate if there's some compiler quirk lurking here separately. Maybe (i & 0xFF00) is faster than (i & 0xFF), because the first can be executed by shorter 16-bit CPU op code. Looking at HotSpot disassembler output could solve the miracle. -Ulf
Re: RFR: 8155795: Optimize Integer/Long.reverse by using reverseBytes
On 2016-05-02 16:48, Aleksey Shipilev wrote: On 05/02/2016 05:07 PM, Claes Redestad wrote: I'd like to sponsor this patch proposed by Jaroslav Kameník here: http://mail.openjdk.java.net/pipermail/core-libs-dev/2016-April/040660.html Bug https://bugs.openjdk.java.net/browse/JDK-8155795 Webrev: http://cr.openjdk.java.net/~redestad/8155795/webrev.01/ *) I wonder if Integer.reverseBytes is better spelled as: return (i >>> 24) | ((i & 0xFF) >> 8) | ((i & 0xFF00) << 8) | (i << 24); The reverseBytes changes were motivated by seeing slightly better performance on the micro with the suggested changes, but after discussing this a bit I think we should revert to the original code alone and investigate if there's some compiler quirk lurking here separately. I'll file a bug. *) The test should probably follow the same randomized testing pattern as other tests: for (int i = 0; i < N; i++) { int x = rnd.nextInt(); String expected = new StringBuilder(). .append(leftpad(Integer.toBinaryString(x), 32)) .reverse().toString(); String actual = leftpad(Integer.toBinaryString(Integer.reverse(x)), 32); if (!expected.equals(actual)) { throw new RuntimeException("reverse: \n" + expected + " \n" + actual); } // That's how development looks like in 2016. private static String leftpad(String s, int width) { String r = s; for (int c = 0; c < width - s.length(); c++) { r = "0" + r; } return r; } ...and should probably precede any other test that uses reverse(). Seems reasonable. Jaroslav, do you mind improving the test as per Aleksey's suggestions? Thanks! /Claes
Re: RFR: 8155795: Optimize Integer/Long.reverse by using reverseBytes
On 05/02/2016 05:07 PM, Claes Redestad wrote: > I'd like to sponsor this patch proposed by Jaroslav Kameník here: > http://mail.openjdk.java.net/pipermail/core-libs-dev/2016-April/040660.html > > Bug https://bugs.openjdk.java.net/browse/JDK-8155795 > Webrev: http://cr.openjdk.java.net/~redestad/8155795/webrev.01/ *) I wonder if Integer.reverseBytes is better spelled as: return (i >>> 24) | ((i & 0xFF) >> 8) | ((i & 0xFF00) << 8) | (i << 24); *) The test should probably follow the same randomized testing pattern as other tests: for (int i = 0; i < N; i++) { int x = rnd.nextInt(); String expected = new StringBuilder(). .append(leftpad(Integer.toBinaryString(x), 32)) .reverse().toString(); String actual = leftpad(Integer.toBinaryString(Integer.reverse(x)), 32); if (!expected.equals(actual)) { throw new RuntimeException("reverse: \n" + expected + " \n" + actual); } // That's how development looks like in 2016. private static String leftpad(String s, int width) { String r = s; for (int c = 0; c < width - s.length(); c++) { r = "0" + r; } return r; } ...and should probably precede any other test that uses reverse(). Thanks, -Aleksey
RFR: 8155795: Optimize Integer/Long.reverse by using reverseBytes
Hi, I'd like to sponsor this patch proposed by Jaroslav Kameník here: http://mail.openjdk.java.net/pipermail/core-libs-dev/2016-April/040660.html Bug https://bugs.openjdk.java.net/browse/JDK-8155795 Webrev: http://cr.openjdk.java.net/~redestad/8155795/webrev.01/ The idea is that reusing reverseBytes can bring a speedup since this method is intrinsified on many platforms. Thanks! /Claes