[jira] [Comment Edited] (LUCENE-4062) More fine-grained control over the packed integer implementation that is chosen
[ https://issues.apache.org/jira/browse/LUCENE-4062?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13404629#comment-13404629 ] Toke Eskildsen edited comment on LUCENE-4062 at 7/1/12 11:06 AM: - Another angle on the auto-selection problem: Packed64SingleBlock supports 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 12, 16, 21, 32 as bpv's. The Direct-classes handles 8, 16 32 bpv and with the inheritance-trick, Packed64 handles 1, 2 4 exactly as Packed64SingleBlock. This leaves us with 3, 5, 6, 7, 9, 10, 12 and 21 bpv as unique implementations. In the hope of getting a better overview, I tried making bar-charts with those bps's: http://ekot.dk/misc/packedints/padding.html Again, it would be illuminating to get measurements from an older as well as a newer AMD machine. I also hope to get the time to run the test on another i7-machine to see if the trend from te-prime and mars can be confirmed, but I am leaving for vacation on Tuesday. As of now, the most fitting implementation depends on the machine and since the choice of implementation affects the persistence format, I find this problematic. It is not optimal for a heterogeneous distributed setup. was (Author: toke): In the hope of getting a better overview, I tried making bar-charts with measurements only for the supported bpv's for Packed64SingleBlock: http://ekot.dk/misc/packedints/padding.html More fine-grained control over the packed integer implementation that is chosen --- Key: LUCENE-4062 URL: https://issues.apache.org/jira/browse/LUCENE-4062 Project: Lucene - Java Issue Type: Improvement Components: core/other Reporter: Adrien Grand Assignee: Adrien Grand Priority: Minor Labels: performance Fix For: 4.0, 5.0 Attachments: LUCENE-4062-2.patch, LUCENE-4062.patch, LUCENE-4062.patch, LUCENE-4062.patch, LUCENE-4062.patch, LUCENE-4062.patch, LUCENE-4062.patch, LUCENE-4062.patch, Packed64SingleBlock.java, Packed64Strategy.java, Packed64calc.java, PackedIntsBenchmark.java, PackedIntsBenchmark.java, PackedIntsBenchmark.java, measurements_te_graphs.pdf, measurements_te_i7.txt, measurements_te_p4.txt, measurements_te_xeon.txt In order to save space, Lucene has two main PackedInts.Mutable implentations, one that is very fast and is based on a byte/short/integer/long array (Direct*) and another one which packs bits in a memory-efficient manner (Packed*). The packed implementation tends to be much slower than the direct one, which discourages some Lucene components to use it. On the other hand, if you store 21 bits integers in a Direct32, this is a space loss of (32-21)/32=35%. If you accept to trade some space for speed, you could store 3 of these 21 bits integers in a long, resulting in an overhead of 1/3 bit per value. One advantage of this approach is that you never need to read more than one block to read or write a value, so this can be significantly faster than Packed32 and Packed64 which always need to read/write two blocks in order to avoid costly branches. I ran some tests, and for 1000 21 bits values, this implementation takes less than 2% more space and has 44% faster writes and 30% faster reads. The 12 bits version (5 values per block) has the same performance improvement and a 6% memory overhead compared to the packed implementation. In order to select the best implementation for a given integer size, I wrote the {{PackedInts.getMutable(valueCount, bitsPerValue, acceptableOverheadPerValue)}} method. This method select the fastest implementation that has less than {{acceptableOverheadPerValue}} wasted bits per value. For example, if you accept an overhead of 20% ({{acceptableOverheadPerValue = 0.2f * bitsPerValue}}), which is pretty reasonable, here is what implementations would be selected: * 1: Packed64SingleBlock1 * 2: Packed64SingleBlock2 * 3: Packed64SingleBlock3 * 4: Packed64SingleBlock4 * 5: Packed64SingleBlock5 * 6: Packed64SingleBlock6 * 7: Direct8 * 8: Direct8 * 9: Packed64SingleBlock9 * 10: Packed64SingleBlock10 * 11: Packed64SingleBlock12 * 12: Packed64SingleBlock12 * 13: Packed64 * 14: Direct16 * 15: Direct16 * 16: Direct16 * 17: Packed64 * 18: Packed64SingleBlock21 * 19: Packed64SingleBlock21 * 20: Packed64SingleBlock21 * 21: Packed64SingleBlock21 * 22: Packed64 * 23: Packed64 * 24: Packed64 * 25: Packed64 * 26: Packed64 * 27: Direct32 * 28: Direct32 * 29: Direct32 * 30: Direct32 * 31: Direct32 * 32: Direct32 * 33: Packed64 * 34: Packed64 * 35: Packed64 * 36: Packed64 * 37: Packed64 * 38: Packed64 * 39: Packed64 * 40:
[jira] [Comment Edited] (LUCENE-4062) More fine-grained control over the packed integer implementation that is chosen
[ https://issues.apache.org/jira/browse/LUCENE-4062?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13404234#comment-13404234 ] Toke Eskildsen edited comment on LUCENE-4062 at 6/29/12 10:03 PM: -- +1 for replacing the Packed64SingleBlock with the optimized version. It is consistently better. I have updated the charts at http://ekot.dk/misc/packedints/ with i7 measurements. For the three non-i7-based Intel processors, Packed64SingleBlock seems clear-cut for the Best possible performance without going all-out with Direct. I tried running two tests in parallel on te-prime (see http://ekot.dk/misc/packedints/te-prime_parallel.html) and got very consistent results: * For set, contiguous strategy is faster than or equal to padded for bpvs 3-12 and 16-21 (with bpv 1 and 2 being hard to measure). Since padded only supports bpv 1-10, 12, 16, 21 32 with non-conforming bpvs rounded up, this effectively means that there is zero gain in using padded over contiguous with regard to set on that machine. * For get, the picture is nearly the same, except for bpv 17-21 where contiguous is slower than padded (observed in process 2 as well as the single-thread run). The difference is less than 10% though. The same pattern, although noisier, can be seen on mars. My preliminary conclusion for i7-processors is thus that Packed64Strategy is the right choice for Best possible performance without going all-out with Direct. I am getting very curious about the untested AMD architecture now. A by-product of the te-prime parallel test is that the amount of cache seems to matter little when it comes to selecting the most fitting implementation. Thank $diety for small favors. was (Author: toke): +1 for replacing the Packed64SingleBlock with the optimized version. It is consistently better. I have updated the charts at http://ekot.dk/misc/packedints/ with i7 measurements. For the three non-i7-based Intel processors, Packed64SingleBlock seems clear-cut for the Best possible performance without going all-out with Direct. I tried running two tests in parallel on te-prime (see http://ekot.dk/misc/packedints/te-prime_parallel.html) and got very consistent results: * For set, contiguous strategy is faster than or equal to padded for bpvs 3-12 and 16-21 (with bpv 1 and 2 being hard to measure). Since padded only supports bpv 1-10, 12, 16, 21 32 with non-conforming bpvs rounded up, this effectively means that there is zero gain in using padded over contiguous with regard to set on that machine. * For get, the picture is the same, except for process 2, where 17-21 was slower for contiguous than for padded. Sigh. I should of course have started 3 processes so voting would have been easier. The same pattern, although noisier, can be seen on mars. My preliminary conclusion for i7-processors is thus that Packed64Strategy is the right choice for Best possible performance without going all-out with Direct. I am getting very curious about the untested AMD architecture now. A by-product of the te-prime parallel test is that the amount of cache seems to matter little when it comes to selecting the most fitting implementation. Thank $diety for small favors. More fine-grained control over the packed integer implementation that is chosen --- Key: LUCENE-4062 URL: https://issues.apache.org/jira/browse/LUCENE-4062 Project: Lucene - Java Issue Type: Improvement Components: core/other Reporter: Adrien Grand Assignee: Adrien Grand Priority: Minor Labels: performance Fix For: 4.0, 5.0 Attachments: LUCENE-4062-2.patch, LUCENE-4062.patch, LUCENE-4062.patch, LUCENE-4062.patch, LUCENE-4062.patch, LUCENE-4062.patch, LUCENE-4062.patch, LUCENE-4062.patch, Packed64SingleBlock.java, Packed64Strategy.java, Packed64calc.java, PackedIntsBenchmark.java, PackedIntsBenchmark.java, PackedIntsBenchmark.java, measurements_te_graphs.pdf, measurements_te_i7.txt, measurements_te_p4.txt, measurements_te_xeon.txt In order to save space, Lucene has two main PackedInts.Mutable implentations, one that is very fast and is based on a byte/short/integer/long array (Direct*) and another one which packs bits in a memory-efficient manner (Packed*). The packed implementation tends to be much slower than the direct one, which discourages some Lucene components to use it. On the other hand, if you store 21 bits integers in a Direct32, this is a space loss of (32-21)/32=35%. If you accept to trade some space for speed, you could store 3 of these 21 bits integers in a long, resulting in an overhead of 1/3 bit per value. One advantage of this
[jira] [Comment Edited] (LUCENE-4062) More fine-grained control over the packed integer implementation that is chosen
[ https://issues.apache.org/jira/browse/LUCENE-4062?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13402201#comment-13402201 ] Adrien Grand edited comment on LUCENE-4062 at 6/27/12 12:58 PM: Thanks for sharing your results. Here are mines: http://people.apache.org/~jpountz/packed_ints_calc.html (E5500 @ 2.80GHz, java 1.7.0_02, hotspot build 22.0-b10). Funny to see those little bumps when the number of bits per value is 8, 16, 32 or 64 (24 as well, although it is smaller)! It is not clear whether this impl is faster or slower than the single-block impl (or even the 3 blocks impl, I am happily surprised by the read throughput on the intel 4 machine) depending on the hardware. However, this new impl seems to be consistently better than the current Packed64 class so I think we should replace it with your new impl. What do you think? Can you write a patch? was (Author: jpountz): Thanks for sharing your results. Here are mines: http://people.apache.org/~jpountz/packed_ints_calc.html (E5500 @ 2.80GHz, java 1.7.0_02, hotspot build 22.0-b10). Funny to see those little bumps when the number of bits per value is 8, 16, 32 or 64 (24 as well, although it is smaller)! It is not clear whether this impl is faster or slower than the single-block impl (or even the 3 blocks impl, I am happily surprised by the read throughput on the intel 4 machine) depending on the hardware. However, this new impl seems to be consistently better than the actual Packed64 class so I think we should replace it with your new impl. What do you think? Can you write a patch? More fine-grained control over the packed integer implementation that is chosen --- Key: LUCENE-4062 URL: https://issues.apache.org/jira/browse/LUCENE-4062 Project: Lucene - Java Issue Type: Improvement Components: core/other Reporter: Adrien Grand Assignee: Adrien Grand Priority: Minor Labels: performance Fix For: 4.0, 5.0 Attachments: LUCENE-4062-2.patch, LUCENE-4062.patch, LUCENE-4062.patch, LUCENE-4062.patch, LUCENE-4062.patch, LUCENE-4062.patch, LUCENE-4062.patch, LUCENE-4062.patch, Packed64calc.java, PackedIntsBenchmark.java, PackedIntsBenchmark.java, measurements_te_graphs.pdf, measurements_te_i7.txt, measurements_te_p4.txt, measurements_te_xeon.txt In order to save space, Lucene has two main PackedInts.Mutable implentations, one that is very fast and is based on a byte/short/integer/long array (Direct*) and another one which packs bits in a memory-efficient manner (Packed*). The packed implementation tends to be much slower than the direct one, which discourages some Lucene components to use it. On the other hand, if you store 21 bits integers in a Direct32, this is a space loss of (32-21)/32=35%. If you accept to trade some space for speed, you could store 3 of these 21 bits integers in a long, resulting in an overhead of 1/3 bit per value. One advantage of this approach is that you never need to read more than one block to read or write a value, so this can be significantly faster than Packed32 and Packed64 which always need to read/write two blocks in order to avoid costly branches. I ran some tests, and for 1000 21 bits values, this implementation takes less than 2% more space and has 44% faster writes and 30% faster reads. The 12 bits version (5 values per block) has the same performance improvement and a 6% memory overhead compared to the packed implementation. In order to select the best implementation for a given integer size, I wrote the {{PackedInts.getMutable(valueCount, bitsPerValue, acceptableOverheadPerValue)}} method. This method select the fastest implementation that has less than {{acceptableOverheadPerValue}} wasted bits per value. For example, if you accept an overhead of 20% ({{acceptableOverheadPerValue = 0.2f * bitsPerValue}}), which is pretty reasonable, here is what implementations would be selected: * 1: Packed64SingleBlock1 * 2: Packed64SingleBlock2 * 3: Packed64SingleBlock3 * 4: Packed64SingleBlock4 * 5: Packed64SingleBlock5 * 6: Packed64SingleBlock6 * 7: Direct8 * 8: Direct8 * 9: Packed64SingleBlock9 * 10: Packed64SingleBlock10 * 11: Packed64SingleBlock12 * 12: Packed64SingleBlock12 * 13: Packed64 * 14: Direct16 * 15: Direct16 * 16: Direct16 * 17: Packed64 * 18: Packed64SingleBlock21 * 19: Packed64SingleBlock21 * 20: Packed64SingleBlock21 * 21: Packed64SingleBlock21 * 22: Packed64 * 23: Packed64 * 24: Packed64 * 25: Packed64 * 26: Packed64 * 27: Direct32 * 28: Direct32 * 29: Direct32 * 30: Direct32 * 31: Direct32 * 32: Direct32