Re: RFR: 8155795: Optimize Integer/Long.reverse by using reverseBytes

2016-05-10 Thread Claes Redestad

Hi,

I think this shows the compiler isn't doing anything really strange; 
it'd be fun to know if it's
the different instruction order or the use of one less constant that 
shows an effect, but it's

not really significant either way.

I think the latest patch is fine as it is and will push it soon.

Thanks!

/Claes

On 2016-05-09 21:50, Jaroslav Kameník wrote:

Hi guys,

I have looked at generated asm, it does not seem there is problem, 
only little different order
of shifts and ands.. Plus, at haswell, times do not differ so much. 
Here are different parts:


Slower version:

057 sallR10, #24
05b movlR11, RAX# spill
05e sallR11, #8
062 movlR8, RAX # spill
065 shrlR8, #8
069 andlR11, #16711680  # int
070 andlR8, #65280  # int
077 shrlRAX, #24

Faster version:

057 shrlR10, #24
05b movlR11, RAX# spill
05e shrlR11, #8
062 movlR8, RAX # spill
065 andlR8, #65280  # int
06c andlR11, #65280 # int
073 sallR8, #8
077 sallRAX, #24

Full code:

   ORIGINAL - 5,169 ± 0,377  ns/op
   ===

000   B1: # N1 <- BLOCK HEAD IS JUNK   Freq: 1
000 # stack bang (96 bytes)
 pushq   rbp # Save rbp
 subqrsp, #16# Create frame

00c movlRAX, RSI# spill
00e andlRAX, #1431655765# int
014 shrlRSI, #1
016 sallRAX, #1
018 andlRSI, #1431655765# int
01e orl RAX, RSI# int
020 movlR10, RAX# spill
023 shrlR10, #2
027 andlRAX, #858993459 # int
02d andlR10, #858993459 # int
034 sallRAX, #2
037 orl RAX, R10# int
03a movlR10, RAX# spill
03d shrlR10, #4
041 andlRAX, #252645135 # int
047 andlR10, #252645135 # int
04e sallRAX, #4
051 orl RAX, R10# int
054 movlR10, RAX# spill
057 shrlR10, #24
05b movlR11, RAX# spill
05e shrlR11, #8
062 movlR8, RAX # spill
065 andlR8, #65280  # int
06c andlR11, #65280 # int
073 sallR8, #8
077 sallRAX, #24
07a orl RAX, R8 # int
07d orl RAX, R11# int
080 orl RAX, R10# int
083 addqrsp, 16 # Destroy frame


   USED INTRINSIC - 4,093 ± 0,100  ns/op
   =

000   B1: # N1 <- BLOCK HEAD IS JUNK   Freq: 1
000 # stack bang (96 bytes)
 pushq   rbp # Save rbp
 subqrsp, #16# Create frame

00c movlRAX, RSI# spill
00e andlRAX, #1431655765# int
014 shrlRSI, #1
016 sallRAX, #1
018 andlRSI, #1431655765# int
01e orl RAX, RSI# int
020 movlR10, RAX# spill
023 shrlR10, #2
027 andlRAX, #858993459 # int
02d andlR10, #858993459 # int
034 sallRAX, #2
037 orl RAX, R10# int
03a movlR10, RAX# spill
03d shrlR10, #4
041 andlRAX, #252645135 # int
047 andlR10, #252645135 # int
04e sallRAX, #4
051 orl RAX, R10# int
054 bswapl  RAX
056 addqrsp, 16 # Destroy frame


   INTRINSICS DISABLED - 5,173 ± 0,096  ns/op
   ==


000   B1: # N1 <- BLOCK HEAD IS JUNK   Freq: 1
000 # stack bang (96 bytes)
 pushq   rbp # Save rbp
 subqrsp, #16# Create frame

00c movlRAX, RSI# spill
00e andlRAX, #1431655765# int
014 shrlRSI, #1
016 sallRAX, #1
018 andlRSI, #1431655765# int
01e orl RAX, RSI# int
020 movlR10, RAX# spill
023 shrlR10, #2
027 andlRAX, #858993459 # int
02d andlR10, #858993459 # int
034 sallRAX, #2
037 orl RAX, R10# int
03a movlR10, RAX# spill
03d shrlR10, #4
041 andlRAX, #252645135 # int
047 andlR10, #252645135 # int
04e sallRAX, #4
051 orl RAX, R10# int
054 movlR10, RAX# spill
057 sallR10, #24
05b movlR11, RAX# spill
05e sallR11, #8
062 movlR8, RAX # spill
065 shrlR8, #8
069 andlR11, #16711680  # int
070 andlR8, #65280  # int
077 shrlRAX, #24
07a orl RAX, R8 # int
07d orl RAX, R11# int
080 orl RAX, R10# int
083 addqrsp, 16 # Destroy frame


   INTRINSICS DISABLED 2 - 5,081 ± 0,092  ns/op
   ===

000   B1: # N1 <- BLOCK HEAD IS JUNK   Freq: 1
000 # stack bang (96 bytes)
 pushq   rbp # Save rbp
 subqrsp, #16# Create frame

00c movlRAX, RSI# spill
00e andlRAX, #1431655765# 

Re: RFR: 8155795: Optimize Integer/Long.reverse by using reverseBytes

2016-05-09 Thread Jaroslav Kameník
Hi guys,

I have looked at generated asm, it does not seem there is problem, only
little different order
of shifts and ands.. Plus, at haswell, times do not differ so much. Here
are different parts:

Slower version:

057 sallR10, #24
05b movlR11, RAX# spill
05e sallR11, #8
062 movlR8, RAX # spill
065 shrlR8, #8
069 andlR11, #16711680  # int
070 andlR8, #65280  # int
077 shrlRAX, #24

Faster version:

057 shrlR10, #24
05b movlR11, RAX# spill
05e shrlR11, #8
062 movlR8, RAX # spill
065 andlR8, #65280  # int
06c andlR11, #65280 # int
073 sallR8, #8
077 sallRAX, #24

Full code:

  ORIGINAL - 5,169 ± 0,377  ns/op
  ===

000   B1: # N1 <- BLOCK HEAD IS JUNK   Freq: 1
000 # stack bang (96 bytes)
pushq   rbp # Save rbp
subqrsp, #16# Create frame

00c movlRAX, RSI# spill
00e andlRAX, #1431655765# int
014 shrlRSI, #1
016 sallRAX, #1
018 andlRSI, #1431655765# int
01e orl RAX, RSI# int
020 movlR10, RAX# spill
023 shrlR10, #2
027 andlRAX, #858993459 # int
02d andlR10, #858993459 # int
034 sallRAX, #2
037 orl RAX, R10# int
03a movlR10, RAX# spill
03d shrlR10, #4
041 andlRAX, #252645135 # int
047 andlR10, #252645135 # int
04e sallRAX, #4
051 orl RAX, R10# int
054 movlR10, RAX# spill
057 shrlR10, #24
05b movlR11, RAX# spill
05e shrlR11, #8
062 movlR8, RAX # spill
065 andlR8, #65280  # int
06c andlR11, #65280 # int
073 sallR8, #8
077 sallRAX, #24
07a orl RAX, R8 # int
07d orl RAX, R11# int
080 orl RAX, R10# int
083 addqrsp, 16 # Destroy frame


  USED INTRINSIC - 4,093 ± 0,100  ns/op
  =

000   B1: # N1 <- BLOCK HEAD IS JUNK   Freq: 1
000 # stack bang (96 bytes)
pushq   rbp # Save rbp
subqrsp, #16# Create frame

00c movlRAX, RSI# spill
00e andlRAX, #1431655765# int
014 shrlRSI, #1
016 sallRAX, #1
018 andlRSI, #1431655765# int
01e orl RAX, RSI# int
020 movlR10, RAX# spill
023 shrlR10, #2
027 andlRAX, #858993459 # int
02d andlR10, #858993459 # int
034 sallRAX, #2
037 orl RAX, R10# int
03a movlR10, RAX# spill
03d shrlR10, #4
041 andlRAX, #252645135 # int
047 andlR10, #252645135 # int
04e sallRAX, #4
051 orl RAX, R10# int
054 bswapl  RAX
056 addqrsp, 16 # Destroy frame


  INTRINSICS DISABLED - 5,173 ± 0,096  ns/op
  ==


000   B1: # N1 <- BLOCK HEAD IS JUNK   Freq: 1
000 # stack bang (96 bytes)
pushq   rbp # Save rbp
subqrsp, #16# Create frame

00c movlRAX, RSI# spill
00e andlRAX, #1431655765# int
014 shrlRSI, #1
016 sallRAX, #1
018 andlRSI, #1431655765# int
01e orl RAX, RSI# int
020 movlR10, RAX# spill
023 shrlR10, #2
027 andlRAX, #858993459 # int
02d andlR10, #858993459 # int
034 sallRAX, #2
037 orl RAX, R10# int
03a movlR10, RAX# spill
03d shrlR10, #4
041 andlRAX, #252645135 # int
047 andlR10, #252645135 # int
04e sallRAX, #4
051 orl RAX, R10# int
054 movlR10, RAX# spill
057 sallR10, #24
05b movlR11, RAX# spill
05e sallR11, #8
062 movlR8, RAX # spill
065 shrlR8, #8
069 andlR11, #16711680  # int
070 andlR8, #65280  # int
077 shrlRAX, #24
07a orl RAX, R8 # int
07d orl RAX, R11# int
080 orl RAX, R10# int
083 addqrsp, 16 # Destroy frame


  INTRINSICS DISABLED 2 - 5,081 ± 0,092  ns/op
  ===

000   B1: # N1 <- BLOCK HEAD IS JUNK   Freq: 1
000 # stack bang (96 bytes)
pushq   rbp # Save rbp
subqrsp, #16# Create frame

00c movlRAX, RSI# spill
00e andlRAX, #1431655765# int
014 shrlRSI, #1
016 sallRAX, #1
018 andlRSI, #1431655765# int
01e orl RAX, RSI# int
020 movlR10, RAX# spill
023 shrlR10, #2
027 andlRAX, #858993459 # int
02d andlR10, #858993459 # int
034 sallRAX, #2
037 orl RAX, R10# int
03a movlR10, RAX# spill
03d

Re: RFR: 8155795: Optimize Integer/Long.reverse by using reverseBytes

2016-05-02 Thread Jaroslav Kameník
Thanks for comments. I have changed tests as Aleksey suggested.

Btw. is not there some class used for utility methods as leftpad?

Ad performance of changed reverseBytes, I tried to compare
both reverseBytes alone, and it seems to be same, but differs
when inlined to reverse().

J.

2016-05-02 17:00 GMT+02:00 Claes Redestad :

> On 2016-05-02 16:48, Aleksey Shipilev wrote:
>
>> On 05/02/2016 05:07 PM, Claes Redestad wrote:
>>
>>> I'd like to sponsor this patch proposed by Jaroslav Kameník here:
>>>
>>> http://mail.openjdk.java.net/pipermail/core-libs-dev/2016-April/040660.html
>>>
>>> Bug https://bugs.openjdk.java.net/browse/JDK-8155795
>>> Webrev: http://cr.openjdk.java.net/~redestad/8155795/webrev.01/
>>>
>> *) I wonder if Integer.reverseBytes is better spelled as:
>>
>>   return (i >>> 24) |
>>  ((i & 0xFF) >> 8) |
>>  ((i & 0xFF00) << 8)   |
>>  (i << 24);
>>
>
> The reverseBytes changes were motivated by seeing slightly better
> performance on the micro with the suggested changes, but after
> discussing this a bit I think we should revert to the original code alone
> and investigate if there's some compiler quirk lurking here separately.
>
> I'll file a bug.
>
>
>> *) The test should probably follow the same randomized testing pattern
>> as other tests:
>>
>>   for (int i = 0; i < N; i++) {
>> int x = rnd.nextInt();
>>
>> String expected = new StringBuilder().
>> .append(leftpad(Integer.toBinaryString(x), 32))
>> .reverse().toString();
>>
>> String actual   =
>> leftpad(Integer.toBinaryString(Integer.reverse(x)), 32);
>>
>> if (!expected.equals(actual)) {
>> throw new RuntimeException("reverse: \n" +
>>expected + " \n" + actual);
>> }
>>
>>   // That's how development looks like in 2016.
>>   private static String leftpad(String s, int width) {
>> String r = s;
>> for (int c = 0; c < width - s.length(); c++) {
>>   r = "0" + r;
>> }
>> return r;
>>   }
>>
>> ...and should probably precede any other test that uses reverse().
>>
>
> Seems reasonable.
>
> Jaroslav, do you mind improving the test as per Aleksey's
> suggestions?
>
> Thanks!
>
> /Claes
>
diff -r 2bf84670f079 src/java.base/share/classes/java/lang/Integer.java
--- a/src/java.base/share/classes/java/lang/Integer.javaSat Apr 30 
16:08:48 2016 -0700
+++ b/src/java.base/share/classes/java/lang/Integer.javaMon May 02 
22:09:54 2016 +0200
@@ -1790,9 +1790,8 @@
 i = (i & 0x) << 1 | (i >>> 1) & 0x;
 i = (i & 0x) << 2 | (i >>> 2) & 0x;
 i = (i & 0x0f0f0f0f) << 4 | (i >>> 4) & 0x0f0f0f0f;
-i = (i << 24) | ((i & 0xff00) << 8) |
-((i >>> 8) & 0xff00) | (i >>> 24);
-return i;
+
+return reverseBytes(i);
 }
 
 /**
@@ -1820,10 +1819,10 @@
  */
 @HotSpotIntrinsicCandidate
 public static int reverseBytes(int i) {
-return ((i >>> 24)   ) |
-   ((i >>   8) &   0xFF00) |
-   ((i <<   8) & 0xFF) |
-   ((i << 24));
+return (i << 24)|
+   ((i & 0xff00) << 8)  |
+   ((i >>> 8) & 0xff00) |
+   (i >>> 24);
 }
 
 /**
diff -r 2bf84670f079 src/java.base/share/classes/java/lang/Long.java
--- a/src/java.base/share/classes/java/lang/Long.java   Sat Apr 30 16:08:48 
2016 -0700
+++ b/src/java.base/share/classes/java/lang/Long.java   Mon May 02 22:09:54 
2016 +0200
@@ -1952,10 +1952,8 @@
 i = (i & 0xL) << 1 | (i >>> 1) & 0xL;
 i = (i & 0xL) << 2 | (i >>> 2) & 0xL;
 i = (i & 0x0f0f0f0f0f0f0f0fL) << 4 | (i >>> 4) & 0x0f0f0f0f0f0f0f0fL;
-i = (i & 0x00ff00ff00ff00ffL) << 8 | (i >>> 8) & 0x00ff00ff00ff00ffL;
-i = (i << 48) | ((i & 0xL) << 16) |
-((i >>> 16) & 0xL) | (i >>> 48);
-return i;
+
+return reverseBytes(i);
 }
 
 /**
diff -r 2bf84670f079 test/java/lang/Integer/BitTwiddle.java
--- a/test/java/lang/Integer/BitTwiddle.javaSat Apr 30 16:08:48 2016 -0700
+++ b/test/java/lang/Integer/BitTwiddle.javaMon May 02 22:09:54 2016 +0200
@@ -58,6 +58,21 @@
 
 for (int i = 0; i < N; i++) {
 int x = rnd.nextInt();
+
+String expected = new StringBuilder()
+.append(leftpad(toBinaryString(x), 32))
+.reverse().toString();
+
+String actual = leftpad(toBinaryString(reverse(x)), 32);
+
+if (!expected.equals(actual)) {
+throw new RuntimeException("reverse: \n" +
+expected + " \n" + actual);
+}
+}
+
+for (int i = 0; i < N; i++) {
+int x = rnd.nextInt();
 if (highestOneBit(x) != reverse(lowestOneBit(reverse(x
 throw new RuntimeException("g: " + 

Re: RFR: 8155795: Optimize Integer/Long.reverse by using reverseBytes

2016-05-02 Thread Ulf Zibis

Am 02.05.2016 um 17:00 schrieb Claes Redestad:

The reverseBytes changes were motivated by seeing slightly better
performance on the micro with the suggested changes, but after
discussing this a bit I think we should revert to the original code alone
and investigate if there's some compiler quirk lurking here separately.
Maybe (i & 0xFF00) is faster than (i & 0xFF), because the first can be executed by shorter 
16-bit CPU op code.

Looking at HotSpot disassembler output could solve the miracle.

-Ulf



Re: RFR: 8155795: Optimize Integer/Long.reverse by using reverseBytes

2016-05-02 Thread Claes Redestad

On 2016-05-02 16:48, Aleksey Shipilev wrote:

On 05/02/2016 05:07 PM, Claes Redestad wrote:

I'd like to sponsor this patch proposed by Jaroslav Kameník here:
http://mail.openjdk.java.net/pipermail/core-libs-dev/2016-April/040660.html

Bug https://bugs.openjdk.java.net/browse/JDK-8155795
Webrev: http://cr.openjdk.java.net/~redestad/8155795/webrev.01/

*) I wonder if Integer.reverseBytes is better spelled as:

  return (i >>> 24) |
 ((i & 0xFF) >> 8) |
 ((i & 0xFF00) << 8)   |
 (i << 24);


The reverseBytes changes were motivated by seeing slightly better
performance on the micro with the suggested changes, but after
discussing this a bit I think we should revert to the original code alone
and investigate if there's some compiler quirk lurking here separately.

I'll file a bug.



*) The test should probably follow the same randomized testing pattern
as other tests:

  for (int i = 0; i < N; i++) {
int x = rnd.nextInt();

String expected = new StringBuilder().
.append(leftpad(Integer.toBinaryString(x), 32))
.reverse().toString();

String actual   =
leftpad(Integer.toBinaryString(Integer.reverse(x)), 32);

if (!expected.equals(actual)) {
throw new RuntimeException("reverse: \n" +
   expected + " \n" + actual);
}

  // That's how development looks like in 2016.
  private static String leftpad(String s, int width) {
String r = s;
for (int c = 0; c < width - s.length(); c++) {
  r = "0" + r;
}
return r;
  }

...and should probably precede any other test that uses reverse().


Seems reasonable.

Jaroslav, do you mind improving the test as per Aleksey's
suggestions?

Thanks!

/Claes


Re: RFR: 8155795: Optimize Integer/Long.reverse by using reverseBytes

2016-05-02 Thread Aleksey Shipilev
On 05/02/2016 05:07 PM, Claes Redestad wrote:
> I'd like to sponsor this patch proposed by Jaroslav Kameník here:
> http://mail.openjdk.java.net/pipermail/core-libs-dev/2016-April/040660.html
> 
> Bug https://bugs.openjdk.java.net/browse/JDK-8155795
> Webrev: http://cr.openjdk.java.net/~redestad/8155795/webrev.01/

*) I wonder if Integer.reverseBytes is better spelled as:

 return (i >>> 24) |
((i & 0xFF) >> 8) |
((i & 0xFF00) << 8)   |
(i << 24);

*) The test should probably follow the same randomized testing pattern
as other tests:

 for (int i = 0; i < N; i++) {
   int x = rnd.nextInt();

   String expected = new StringBuilder().
   .append(leftpad(Integer.toBinaryString(x), 32))
   .reverse().toString();

   String actual   =
   leftpad(Integer.toBinaryString(Integer.reverse(x)), 32);

   if (!expected.equals(actual)) {
   throw new RuntimeException("reverse: \n" +
  expected + " \n" + actual);
   }

 // That's how development looks like in 2016.
 private static String leftpad(String s, int width) {
   String r = s;
   for (int c = 0; c < width - s.length(); c++) {
 r = "0" + r;
   }
   return r;
 }

...and should probably precede any other test that uses reverse().

Thanks,
-Aleksey



RFR: 8155795: Optimize Integer/Long.reverse by using reverseBytes

2016-05-02 Thread Claes Redestad

Hi,

I'd like to sponsor this patch proposed by Jaroslav Kameník here: 
http://mail.openjdk.java.net/pipermail/core-libs-dev/2016-April/040660.html


Bug https://bugs.openjdk.java.net/browse/JDK-8155795
Webrev: http://cr.openjdk.java.net/~redestad/8155795/webrev.01/

The idea is that reusing reverseBytes can bring a speedup since this 
method is intrinsified on many platforms.


Thanks!

/Claes