[jira] [Comment Edited] (LUCENE-4062) More fine-grained control over the packed integer implementation that is chosen

2012-07-01 Thread Toke Eskildsen (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-4062?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13404629#comment-13404629
 ] 

Toke Eskildsen edited comment on LUCENE-4062 at 7/1/12 11:06 AM:
-

Another angle on the auto-selection problem: Packed64SingleBlock supports 1, 2, 
3, 4, 5, 6, 7, 8, 9, 10, 12, 16, 21, 32 as bpv's. The Direct-classes handles 8, 
16  32 bpv and with the inheritance-trick, Packed64 handles 1, 2  4 exactly 
as Packed64SingleBlock. This leaves us with 3, 5, 6, 7, 9, 10, 12 and 21 bpv as 
unique implementations. In the hope of getting a better overview, I tried 
making bar-charts with those bps's: http://ekot.dk/misc/packedints/padding.html

Again, it would be illuminating to get measurements from an older as well as a 
newer AMD machine. I also hope to get the time to run the test on another 
i7-machine to see if the trend from te-prime and mars can be confirmed, but I 
am leaving for vacation on Tuesday. As of now, the most fitting implementation 
depends on the machine and since the choice of implementation affects the 
persistence format, I find this problematic. It is not optimal for a 
heterogeneous distributed setup.

  was (Author: toke):
In the hope of getting a better overview, I tried making bar-charts with 
measurements only for the supported bpv's for Packed64SingleBlock: 
http://ekot.dk/misc/packedints/padding.html
  
 More fine-grained control over the packed integer implementation that is 
 chosen
 ---

 Key: LUCENE-4062
 URL: https://issues.apache.org/jira/browse/LUCENE-4062
 Project: Lucene - Java
  Issue Type: Improvement
  Components: core/other
Reporter: Adrien Grand
Assignee: Adrien Grand
Priority: Minor
  Labels: performance
 Fix For: 4.0, 5.0

 Attachments: LUCENE-4062-2.patch, LUCENE-4062.patch, 
 LUCENE-4062.patch, LUCENE-4062.patch, LUCENE-4062.patch, LUCENE-4062.patch, 
 LUCENE-4062.patch, LUCENE-4062.patch, Packed64SingleBlock.java, 
 Packed64Strategy.java, Packed64calc.java, PackedIntsBenchmark.java, 
 PackedIntsBenchmark.java, PackedIntsBenchmark.java, 
 measurements_te_graphs.pdf, measurements_te_i7.txt, measurements_te_p4.txt, 
 measurements_te_xeon.txt


 In order to save space, Lucene has two main PackedInts.Mutable implentations, 
 one that is very fast and is based on a byte/short/integer/long array 
 (Direct*) and another one which packs bits in a memory-efficient manner 
 (Packed*).
 The packed implementation tends to be much slower than the direct one, which 
 discourages some Lucene components to use it. On the other hand, if you store 
 21 bits integers in a Direct32, this is a space loss of (32-21)/32=35%.
 If you accept to trade some space for speed, you could store 3 of these 21 
 bits integers in a long, resulting in an overhead of 1/3 bit per value. One 
 advantage of this approach is that you never need to read more than one block 
 to read or write a value, so this can be significantly faster than Packed32 
 and Packed64 which always need to read/write two blocks in order to avoid 
 costly branches.
 I ran some tests, and for 1000 21 bits values, this implementation takes 
 less than 2% more space and has 44% faster writes and 30% faster reads. The 
 12 bits version (5 values per block) has the same performance improvement and 
 a 6% memory overhead compared to the packed implementation.
 In order to select the best implementation for a given integer size, I wrote 
 the {{PackedInts.getMutable(valueCount, bitsPerValue, 
 acceptableOverheadPerValue)}} method. This method select the fastest 
 implementation that has less than {{acceptableOverheadPerValue}} wasted bits 
 per value. For example, if you accept an overhead of 20% 
 ({{acceptableOverheadPerValue = 0.2f * bitsPerValue}}), which is pretty 
 reasonable, here is what implementations would be selected:
  * 1: Packed64SingleBlock1
  * 2: Packed64SingleBlock2
  * 3: Packed64SingleBlock3
  * 4: Packed64SingleBlock4
  * 5: Packed64SingleBlock5
  * 6: Packed64SingleBlock6
  * 7: Direct8
  * 8: Direct8
  * 9: Packed64SingleBlock9
  * 10: Packed64SingleBlock10
  * 11: Packed64SingleBlock12
  * 12: Packed64SingleBlock12
  * 13: Packed64
  * 14: Direct16
  * 15: Direct16
  * 16: Direct16
  * 17: Packed64
  * 18: Packed64SingleBlock21
  * 19: Packed64SingleBlock21
  * 20: Packed64SingleBlock21
  * 21: Packed64SingleBlock21
  * 22: Packed64
  * 23: Packed64
  * 24: Packed64
  * 25: Packed64
  * 26: Packed64
  * 27: Direct32
  * 28: Direct32
  * 29: Direct32
  * 30: Direct32
  * 31: Direct32
  * 32: Direct32
  * 33: Packed64
  * 34: Packed64
  * 35: Packed64
  * 36: Packed64
  * 37: Packed64
  * 38: Packed64
  * 39: Packed64
  * 40: 

[jira] [Comment Edited] (LUCENE-4062) More fine-grained control over the packed integer implementation that is chosen

2012-06-29 Thread Toke Eskildsen (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-4062?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13404234#comment-13404234
 ] 

Toke Eskildsen edited comment on LUCENE-4062 at 6/29/12 10:03 PM:
--

+1 for replacing the Packed64SingleBlock with the optimized version. It is 
consistently better.

I have updated the charts at http://ekot.dk/misc/packedints/ with i7 
measurements. For the three non-i7-based Intel processors, Packed64SingleBlock 
seems clear-cut for the Best possible performance without going all-out with 
Direct.

I tried running two tests in parallel on te-prime (see 
http://ekot.dk/misc/packedints/te-prime_parallel.html) and got very consistent 
results:
  * For set, contiguous strategy is faster than or equal to padded for bpvs 
3-12 and 16-21 (with bpv 1 and 2 being hard to measure). Since padded only 
supports bpv 1-10, 12, 16, 21  32 with non-conforming bpvs rounded up, this 
effectively means that there is zero gain in using padded over contiguous with 
regard to set on that machine.
  * For get, the picture is nearly the same, except for bpv 17-21 where 
contiguous is slower than padded (observed in process 2 as well as the 
single-thread run). The difference is less than 10% though. The same pattern, 
although noisier, can be seen on mars.

My preliminary conclusion for i7-processors is thus that Packed64Strategy is 
the right choice for Best possible performance without going all-out with 
Direct. I am getting very curious about the untested AMD architecture now.

A by-product of the te-prime parallel test is that the amount of cache seems to 
matter little when it comes to selecting the most fitting implementation. Thank 
$diety for small favors.

  was (Author: toke):
+1 for replacing the Packed64SingleBlock with the optimized version. It is 
consistently better.

I have updated the charts at http://ekot.dk/misc/packedints/ with i7 
measurements. For the three non-i7-based Intel processors, Packed64SingleBlock 
seems clear-cut for the Best possible performance without going all-out with 
Direct.

I tried running two tests in parallel on te-prime (see 
http://ekot.dk/misc/packedints/te-prime_parallel.html) and got very consistent 
results:
  * For set, contiguous strategy is faster than or equal to padded for bpvs 
3-12 and 16-21 (with bpv 1 and 2 being hard to measure). Since padded only 
supports bpv 1-10, 12, 16, 21  32 with non-conforming bpvs rounded up, this 
effectively means that there is zero gain in using padded over contiguous with 
regard to set on that machine.
  * For get, the picture is the same, except for process 2, where 17-21 was 
slower for contiguous than for padded. Sigh. I should of course have started 3 
processes so voting would have been easier. The same pattern, although noisier, 
can be seen on mars.

My preliminary conclusion for i7-processors is thus that Packed64Strategy is 
the right choice for Best possible performance without going all-out with 
Direct. I am getting very curious about the untested AMD architecture now.

A by-product of the te-prime parallel test is that the amount of cache seems to 
matter little when it comes to selecting the most fitting implementation. Thank 
$diety for small favors.
  
 More fine-grained control over the packed integer implementation that is 
 chosen
 ---

 Key: LUCENE-4062
 URL: https://issues.apache.org/jira/browse/LUCENE-4062
 Project: Lucene - Java
  Issue Type: Improvement
  Components: core/other
Reporter: Adrien Grand
Assignee: Adrien Grand
Priority: Minor
  Labels: performance
 Fix For: 4.0, 5.0

 Attachments: LUCENE-4062-2.patch, LUCENE-4062.patch, 
 LUCENE-4062.patch, LUCENE-4062.patch, LUCENE-4062.patch, LUCENE-4062.patch, 
 LUCENE-4062.patch, LUCENE-4062.patch, Packed64SingleBlock.java, 
 Packed64Strategy.java, Packed64calc.java, PackedIntsBenchmark.java, 
 PackedIntsBenchmark.java, PackedIntsBenchmark.java, 
 measurements_te_graphs.pdf, measurements_te_i7.txt, measurements_te_p4.txt, 
 measurements_te_xeon.txt


 In order to save space, Lucene has two main PackedInts.Mutable implentations, 
 one that is very fast and is based on a byte/short/integer/long array 
 (Direct*) and another one which packs bits in a memory-efficient manner 
 (Packed*).
 The packed implementation tends to be much slower than the direct one, which 
 discourages some Lucene components to use it. On the other hand, if you store 
 21 bits integers in a Direct32, this is a space loss of (32-21)/32=35%.
 If you accept to trade some space for speed, you could store 3 of these 21 
 bits integers in a long, resulting in an overhead of 1/3 bit per value. One 
 advantage of this 

[jira] [Comment Edited] (LUCENE-4062) More fine-grained control over the packed integer implementation that is chosen

2012-06-27 Thread Adrien Grand (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-4062?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13402201#comment-13402201
 ] 

Adrien Grand edited comment on LUCENE-4062 at 6/27/12 12:58 PM:


Thanks for sharing your results. Here are mines: 
http://people.apache.org/~jpountz/packed_ints_calc.html (E5500 @ 2.80GHz, java 
1.7.0_02, hotspot build 22.0-b10). Funny to see those little bumps when the 
number of bits per value is 8, 16, 32 or 64 (24 as well, although it is 
smaller)!

It is not clear whether this impl is faster or slower than the single-block 
impl (or even the 3 blocks impl, I am happily surprised by the read throughput 
on the intel 4 machine) depending on the hardware. However, this new impl seems 
to be consistently better than the current Packed64 class so I think we should 
replace it with your new impl. What do you think? Can you write a patch?

  was (Author: jpountz):
Thanks for sharing your results. Here are mines: 
http://people.apache.org/~jpountz/packed_ints_calc.html (E5500 @ 2.80GHz, java 
1.7.0_02, hotspot build 22.0-b10). Funny to see those little bumps when the 
number of bits per value is 8, 16, 32 or 64 (24 as well, although it is 
smaller)!

It is not clear whether this impl is faster or slower than the single-block 
impl (or even the 3 blocks impl, I am happily surprised by the read throughput 
on the intel 4 machine) depending on the hardware. However, this new impl seems 
to be consistently better than the actual Packed64 class so I think we should 
replace it with your new impl. What do you think? Can you write a patch?
  
 More fine-grained control over the packed integer implementation that is 
 chosen
 ---

 Key: LUCENE-4062
 URL: https://issues.apache.org/jira/browse/LUCENE-4062
 Project: Lucene - Java
  Issue Type: Improvement
  Components: core/other
Reporter: Adrien Grand
Assignee: Adrien Grand
Priority: Minor
  Labels: performance
 Fix For: 4.0, 5.0

 Attachments: LUCENE-4062-2.patch, LUCENE-4062.patch, 
 LUCENE-4062.patch, LUCENE-4062.patch, LUCENE-4062.patch, LUCENE-4062.patch, 
 LUCENE-4062.patch, LUCENE-4062.patch, Packed64calc.java, 
 PackedIntsBenchmark.java, PackedIntsBenchmark.java, 
 measurements_te_graphs.pdf, measurements_te_i7.txt, measurements_te_p4.txt, 
 measurements_te_xeon.txt


 In order to save space, Lucene has two main PackedInts.Mutable implentations, 
 one that is very fast and is based on a byte/short/integer/long array 
 (Direct*) and another one which packs bits in a memory-efficient manner 
 (Packed*).
 The packed implementation tends to be much slower than the direct one, which 
 discourages some Lucene components to use it. On the other hand, if you store 
 21 bits integers in a Direct32, this is a space loss of (32-21)/32=35%.
 If you accept to trade some space for speed, you could store 3 of these 21 
 bits integers in a long, resulting in an overhead of 1/3 bit per value. One 
 advantage of this approach is that you never need to read more than one block 
 to read or write a value, so this can be significantly faster than Packed32 
 and Packed64 which always need to read/write two blocks in order to avoid 
 costly branches.
 I ran some tests, and for 1000 21 bits values, this implementation takes 
 less than 2% more space and has 44% faster writes and 30% faster reads. The 
 12 bits version (5 values per block) has the same performance improvement and 
 a 6% memory overhead compared to the packed implementation.
 In order to select the best implementation for a given integer size, I wrote 
 the {{PackedInts.getMutable(valueCount, bitsPerValue, 
 acceptableOverheadPerValue)}} method. This method select the fastest 
 implementation that has less than {{acceptableOverheadPerValue}} wasted bits 
 per value. For example, if you accept an overhead of 20% 
 ({{acceptableOverheadPerValue = 0.2f * bitsPerValue}}), which is pretty 
 reasonable, here is what implementations would be selected:
  * 1: Packed64SingleBlock1
  * 2: Packed64SingleBlock2
  * 3: Packed64SingleBlock3
  * 4: Packed64SingleBlock4
  * 5: Packed64SingleBlock5
  * 6: Packed64SingleBlock6
  * 7: Direct8
  * 8: Direct8
  * 9: Packed64SingleBlock9
  * 10: Packed64SingleBlock10
  * 11: Packed64SingleBlock12
  * 12: Packed64SingleBlock12
  * 13: Packed64
  * 14: Direct16
  * 15: Direct16
  * 16: Direct16
  * 17: Packed64
  * 18: Packed64SingleBlock21
  * 19: Packed64SingleBlock21
  * 20: Packed64SingleBlock21
  * 21: Packed64SingleBlock21
  * 22: Packed64
  * 23: Packed64
  * 24: Packed64
  * 25: Packed64
  * 26: Packed64
  * 27: Direct32
  * 28: Direct32
  * 29: Direct32
  * 30: Direct32
  * 31: Direct32
  * 32: Direct32