[jira] [Commented] (LUCENE-9027) SIMD-based decoding of postings lists
[ https://issues.apache.org/jira/browse/LUCENE-9027?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17192488#comment-17192488 ] Michael McCandless commented on LUCENE-9027: {quote}Furthermore, I think this advertises a best-practice for other format writers – to do the same for their formats. If I have a custom DocValuesFormat and change it, seeing the best-practice of versions in names clarifies how to go about versioning. {quote} I think that holds true for our default (supported) codec formats, but it is not clear to me that this full discipline is needed for our experimental formats. Users need to understand and maybe be reminded that those formats are truly experimental. > SIMD-based decoding of postings lists > - > > Key: LUCENE-9027 > URL: https://issues.apache.org/jira/browse/LUCENE-9027 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Adrien Grand >Priority: Minor > Fix For: 8.4 > > Time Spent: 3h 50m > Remaining Estimate: 0h > > [~rcmuir] has been mentioning the idea for quite some time that we might be > able to write the decoding logic in such a way that Java would use SIMD > instructions. More recently [~paul.masurel] wrote a [blog > post|https://fulmicoton.com/posts/bitpacking/] that raises the point that > Lucene could still do decode multiple ints at once in a single instruction by > packing two ints in a long and we've had some discussions about what we could > try in Lucene to speed up the decoding of postings. This made me want to look > a bit deeper at what we could do. > Our current decoding logic reads data in a byte[] and decodes packed integers > from it. Unfortunately it doesn't make use of SIMD instructions and looks > like > [this|https://github.com/jpountz/decode-128-ints-benchmark/blob/master/src/main/java/jpountz/NaiveByteDecoder.java]. > I confirmed by looking at the generated assembly that if I take an array of > integers and shift them all by the same number of bits then Java will use > SIMD instructions to shift multiple integers at once. This led me to writing > this > [implementation|https://github.com/jpountz/decode-128-ints-benchmark/blob/master/src/main/java/jpountz/SimpleSIMDDecoder.java] > that tries as much as possible to shift long sequences of ints by the same > number of bits to speed up decoding. It is indeed faster than the current > logic we have, up to about 2x faster for some numbers of bits per value. > Currently the best > [implementation|https://github.com/jpountz/decode-128-ints-benchmark/blob/master/src/main/java/jpountz/SIMDDecoder.java] > I've been able to come up with combines the above idea with the idea that > Paul mentioned in his blog that consists of emulating SIMD from Java by > packing multiple integers into a long: 2 ints, 4 shorts or 8 bytes. It is a > bit harder to read but gives another speedup on top of the above > implementation. > I have a [JMH > benchmark|https://github.com/jpountz/decode-128-ints-benchmark/] available in > case someone would like to play with this and maybe even come up with an even > faster implementation. It is 2-2.5x faster than our current implementation > for most numbers of bits per value. I'm copying results here: > {noformat} > * `readLongs` just reads 2*bitsPerValue longs from the ByteBuffer, it serves > as >a baseline. > * `decodeNaiveFromBytes` reads a byte[] and decodes from it. This is what the >current Lucene codec does. > * `decodeNaiveFromLongs` decodes from longs on the fly. > * `decodeSimpleSIMD` is a simple implementation that relies on how Java >recognizes some patterns and uses SIMD instructions. > * `decodeSIMD` is a more complex implementation that both relies on the C2 >compiler to generate SIMD instructions and encodes 8 bytes, 4 shorts or >2 ints in a long in order to decompress multiple values at once. > Benchmark (bitsPerValue) (byteOrder) > Mode Cnt Score Error Units > PackedIntsDecodeBenchmark.decodeNaiveFromBytes 1 LE > thrpt5 12.912 ± 0.393 ops/us > PackedIntsDecodeBenchmark.decodeNaiveFromBytes 1 BE > thrpt5 12.862 ± 0.395 ops/us > PackedIntsDecodeBenchmark.decodeNaiveFromBytes 2 LE > thrpt5 13.040 ± 1.162 ops/us > PackedIntsDecodeBenchmark.decodeNaiveFromBytes 2 BE > thrpt5 13.027 ± 0.270 ops/us > PackedIntsDecodeBenchmark.decodeNaiveFromBytes 3 LE > thrpt5 12.409 ± 0.637 ops/us > PackedIntsDecodeBenchmark.decodeNaiveFromBytes 3 BE > thrpt5 12.268 ± 0.947 ops/us > PackedIntsDecodeBenchmark.decodeNaiveFromBytes 4 LE > thrpt5 14.177 ±
[jira] [Commented] (LUCENE-9027) SIMD-based decoding of postings lists
[ https://issues.apache.org/jira/browse/LUCENE-9027?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17181320#comment-17181320 ] David Smiley commented on LUCENE-9027: -- Furthermore, I think this advertises a best-practice for other format writers -- to do the same for their formats. If I have a custom DocValuesFormat and change it, seeing the best-practice of versions in names clarifies how to go about versioning. > SIMD-based decoding of postings lists > - > > Key: LUCENE-9027 > URL: https://issues.apache.org/jira/browse/LUCENE-9027 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Adrien Grand >Priority: Minor > Fix For: 8.4 > > Time Spent: 3h 50m > Remaining Estimate: 0h > > [~rcmuir] has been mentioning the idea for quite some time that we might be > able to write the decoding logic in such a way that Java would use SIMD > instructions. More recently [~paul.masurel] wrote a [blog > post|https://fulmicoton.com/posts/bitpacking/] that raises the point that > Lucene could still do decode multiple ints at once in a single instruction by > packing two ints in a long and we've had some discussions about what we could > try in Lucene to speed up the decoding of postings. This made me want to look > a bit deeper at what we could do. > Our current decoding logic reads data in a byte[] and decodes packed integers > from it. Unfortunately it doesn't make use of SIMD instructions and looks > like > [this|https://github.com/jpountz/decode-128-ints-benchmark/blob/master/src/main/java/jpountz/NaiveByteDecoder.java]. > I confirmed by looking at the generated assembly that if I take an array of > integers and shift them all by the same number of bits then Java will use > SIMD instructions to shift multiple integers at once. This led me to writing > this > [implementation|https://github.com/jpountz/decode-128-ints-benchmark/blob/master/src/main/java/jpountz/SimpleSIMDDecoder.java] > that tries as much as possible to shift long sequences of ints by the same > number of bits to speed up decoding. It is indeed faster than the current > logic we have, up to about 2x faster for some numbers of bits per value. > Currently the best > [implementation|https://github.com/jpountz/decode-128-ints-benchmark/blob/master/src/main/java/jpountz/SIMDDecoder.java] > I've been able to come up with combines the above idea with the idea that > Paul mentioned in his blog that consists of emulating SIMD from Java by > packing multiple integers into a long: 2 ints, 4 shorts or 8 bytes. It is a > bit harder to read but gives another speedup on top of the above > implementation. > I have a [JMH > benchmark|https://github.com/jpountz/decode-128-ints-benchmark/] available in > case someone would like to play with this and maybe even come up with an even > faster implementation. It is 2-2.5x faster than our current implementation > for most numbers of bits per value. I'm copying results here: > {noformat} > * `readLongs` just reads 2*bitsPerValue longs from the ByteBuffer, it serves > as >a baseline. > * `decodeNaiveFromBytes` reads a byte[] and decodes from it. This is what the >current Lucene codec does. > * `decodeNaiveFromLongs` decodes from longs on the fly. > * `decodeSimpleSIMD` is a simple implementation that relies on how Java >recognizes some patterns and uses SIMD instructions. > * `decodeSIMD` is a more complex implementation that both relies on the C2 >compiler to generate SIMD instructions and encodes 8 bytes, 4 shorts or >2 ints in a long in order to decompress multiple values at once. > Benchmark (bitsPerValue) (byteOrder) > Mode Cnt Score Error Units > PackedIntsDecodeBenchmark.decodeNaiveFromBytes 1 LE > thrpt5 12.912 ± 0.393 ops/us > PackedIntsDecodeBenchmark.decodeNaiveFromBytes 1 BE > thrpt5 12.862 ± 0.395 ops/us > PackedIntsDecodeBenchmark.decodeNaiveFromBytes 2 LE > thrpt5 13.040 ± 1.162 ops/us > PackedIntsDecodeBenchmark.decodeNaiveFromBytes 2 BE > thrpt5 13.027 ± 0.270 ops/us > PackedIntsDecodeBenchmark.decodeNaiveFromBytes 3 LE > thrpt5 12.409 ± 0.637 ops/us > PackedIntsDecodeBenchmark.decodeNaiveFromBytes 3 BE > thrpt5 12.268 ± 0.947 ops/us > PackedIntsDecodeBenchmark.decodeNaiveFromBytes 4 LE > thrpt5 14.177 ± 2.263 ops/us > PackedIntsDecodeBenchmark.decodeNaiveFromBytes 4 BE > thrpt5 11.457 ± 0.150 ops/us > PackedIntsDecodeBenchmark.decodeNaiveFromBytes 5 LE > thrpt5 10.988 ± 1.179 ops/us >
[jira] [Commented] (LUCENE-9027) SIMD-based decoding of postings lists
[ https://issues.apache.org/jira/browse/LUCENE-9027?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17181319#comment-17181319 ] David Smiley commented on LUCENE-9027: -- bq. the version doesn't really matter I disagree. Without a version, users will likely get mysterious low level errors when incompatible versions read/write each other's segments. It's pretty low-maintenance to just apply this version suffix strategy on the name and bump it appropriately. > SIMD-based decoding of postings lists > - > > Key: LUCENE-9027 > URL: https://issues.apache.org/jira/browse/LUCENE-9027 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Adrien Grand >Priority: Minor > Fix For: 8.4 > > Time Spent: 3h 50m > Remaining Estimate: 0h > > [~rcmuir] has been mentioning the idea for quite some time that we might be > able to write the decoding logic in such a way that Java would use SIMD > instructions. More recently [~paul.masurel] wrote a [blog > post|https://fulmicoton.com/posts/bitpacking/] that raises the point that > Lucene could still do decode multiple ints at once in a single instruction by > packing two ints in a long and we've had some discussions about what we could > try in Lucene to speed up the decoding of postings. This made me want to look > a bit deeper at what we could do. > Our current decoding logic reads data in a byte[] and decodes packed integers > from it. Unfortunately it doesn't make use of SIMD instructions and looks > like > [this|https://github.com/jpountz/decode-128-ints-benchmark/blob/master/src/main/java/jpountz/NaiveByteDecoder.java]. > I confirmed by looking at the generated assembly that if I take an array of > integers and shift them all by the same number of bits then Java will use > SIMD instructions to shift multiple integers at once. This led me to writing > this > [implementation|https://github.com/jpountz/decode-128-ints-benchmark/blob/master/src/main/java/jpountz/SimpleSIMDDecoder.java] > that tries as much as possible to shift long sequences of ints by the same > number of bits to speed up decoding. It is indeed faster than the current > logic we have, up to about 2x faster for some numbers of bits per value. > Currently the best > [implementation|https://github.com/jpountz/decode-128-ints-benchmark/blob/master/src/main/java/jpountz/SIMDDecoder.java] > I've been able to come up with combines the above idea with the idea that > Paul mentioned in his blog that consists of emulating SIMD from Java by > packing multiple integers into a long: 2 ints, 4 shorts or 8 bytes. It is a > bit harder to read but gives another speedup on top of the above > implementation. > I have a [JMH > benchmark|https://github.com/jpountz/decode-128-ints-benchmark/] available in > case someone would like to play with this and maybe even come up with an even > faster implementation. It is 2-2.5x faster than our current implementation > for most numbers of bits per value. I'm copying results here: > {noformat} > * `readLongs` just reads 2*bitsPerValue longs from the ByteBuffer, it serves > as >a baseline. > * `decodeNaiveFromBytes` reads a byte[] and decodes from it. This is what the >current Lucene codec does. > * `decodeNaiveFromLongs` decodes from longs on the fly. > * `decodeSimpleSIMD` is a simple implementation that relies on how Java >recognizes some patterns and uses SIMD instructions. > * `decodeSIMD` is a more complex implementation that both relies on the C2 >compiler to generate SIMD instructions and encodes 8 bytes, 4 shorts or >2 ints in a long in order to decompress multiple values at once. > Benchmark (bitsPerValue) (byteOrder) > Mode Cnt Score Error Units > PackedIntsDecodeBenchmark.decodeNaiveFromBytes 1 LE > thrpt5 12.912 ± 0.393 ops/us > PackedIntsDecodeBenchmark.decodeNaiveFromBytes 1 BE > thrpt5 12.862 ± 0.395 ops/us > PackedIntsDecodeBenchmark.decodeNaiveFromBytes 2 LE > thrpt5 13.040 ± 1.162 ops/us > PackedIntsDecodeBenchmark.decodeNaiveFromBytes 2 BE > thrpt5 13.027 ± 0.270 ops/us > PackedIntsDecodeBenchmark.decodeNaiveFromBytes 3 LE > thrpt5 12.409 ± 0.637 ops/us > PackedIntsDecodeBenchmark.decodeNaiveFromBytes 3 BE > thrpt5 12.268 ± 0.947 ops/us > PackedIntsDecodeBenchmark.decodeNaiveFromBytes 4 LE > thrpt5 14.177 ± 2.263 ops/us > PackedIntsDecodeBenchmark.decodeNaiveFromBytes 4 BE > thrpt5 11.457 ± 0.150 ops/us > PackedIntsDecodeBenchmark.decodeNaiveFromBytes 5 LE > thrpt5 10.988 ±
[jira] [Commented] (LUCENE-9027) SIMD-based decoding of postings lists
[ https://issues.apache.org/jira/browse/LUCENE-9027?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17180780#comment-17180780 ] Michael McCandless commented on LUCENE-9027: Ahh thanks [~jpountz]. Yeah +1 to remove any implied versioning from this experimental postings format's name! > SIMD-based decoding of postings lists > - > > Key: LUCENE-9027 > URL: https://issues.apache.org/jira/browse/LUCENE-9027 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Adrien Grand >Priority: Minor > Fix For: 8.4 > > Time Spent: 3h 50m > Remaining Estimate: 0h > > [~rcmuir] has been mentioning the idea for quite some time that we might be > able to write the decoding logic in such a way that Java would use SIMD > instructions. More recently [~paul.masurel] wrote a [blog > post|https://fulmicoton.com/posts/bitpacking/] that raises the point that > Lucene could still do decode multiple ints at once in a single instruction by > packing two ints in a long and we've had some discussions about what we could > try in Lucene to speed up the decoding of postings. This made me want to look > a bit deeper at what we could do. > Our current decoding logic reads data in a byte[] and decodes packed integers > from it. Unfortunately it doesn't make use of SIMD instructions and looks > like > [this|https://github.com/jpountz/decode-128-ints-benchmark/blob/master/src/main/java/jpountz/NaiveByteDecoder.java]. > I confirmed by looking at the generated assembly that if I take an array of > integers and shift them all by the same number of bits then Java will use > SIMD instructions to shift multiple integers at once. This led me to writing > this > [implementation|https://github.com/jpountz/decode-128-ints-benchmark/blob/master/src/main/java/jpountz/SimpleSIMDDecoder.java] > that tries as much as possible to shift long sequences of ints by the same > number of bits to speed up decoding. It is indeed faster than the current > logic we have, up to about 2x faster for some numbers of bits per value. > Currently the best > [implementation|https://github.com/jpountz/decode-128-ints-benchmark/blob/master/src/main/java/jpountz/SIMDDecoder.java] > I've been able to come up with combines the above idea with the idea that > Paul mentioned in his blog that consists of emulating SIMD from Java by > packing multiple integers into a long: 2 ints, 4 shorts or 8 bytes. It is a > bit harder to read but gives another speedup on top of the above > implementation. > I have a [JMH > benchmark|https://github.com/jpountz/decode-128-ints-benchmark/] available in > case someone would like to play with this and maybe even come up with an even > faster implementation. It is 2-2.5x faster than our current implementation > for most numbers of bits per value. I'm copying results here: > {noformat} > * `readLongs` just reads 2*bitsPerValue longs from the ByteBuffer, it serves > as >a baseline. > * `decodeNaiveFromBytes` reads a byte[] and decodes from it. This is what the >current Lucene codec does. > * `decodeNaiveFromLongs` decodes from longs on the fly. > * `decodeSimpleSIMD` is a simple implementation that relies on how Java >recognizes some patterns and uses SIMD instructions. > * `decodeSIMD` is a more complex implementation that both relies on the C2 >compiler to generate SIMD instructions and encodes 8 bytes, 4 shorts or >2 ints in a long in order to decompress multiple values at once. > Benchmark (bitsPerValue) (byteOrder) > Mode Cnt Score Error Units > PackedIntsDecodeBenchmark.decodeNaiveFromBytes 1 LE > thrpt5 12.912 ± 0.393 ops/us > PackedIntsDecodeBenchmark.decodeNaiveFromBytes 1 BE > thrpt5 12.862 ± 0.395 ops/us > PackedIntsDecodeBenchmark.decodeNaiveFromBytes 2 LE > thrpt5 13.040 ± 1.162 ops/us > PackedIntsDecodeBenchmark.decodeNaiveFromBytes 2 BE > thrpt5 13.027 ± 0.270 ops/us > PackedIntsDecodeBenchmark.decodeNaiveFromBytes 3 LE > thrpt5 12.409 ± 0.637 ops/us > PackedIntsDecodeBenchmark.decodeNaiveFromBytes 3 BE > thrpt5 12.268 ± 0.947 ops/us > PackedIntsDecodeBenchmark.decodeNaiveFromBytes 4 LE > thrpt5 14.177 ± 2.263 ops/us > PackedIntsDecodeBenchmark.decodeNaiveFromBytes 4 BE > thrpt5 11.457 ± 0.150 ops/us > PackedIntsDecodeBenchmark.decodeNaiveFromBytes 5 LE > thrpt5 10.988 ± 1.179 ops/us > PackedIntsDecodeBenchmark.decodeNaiveFromBytes 5 BE > thrpt5 11.226 ± 0.088 ops/us >
[jira] [Commented] (LUCENE-9027) SIMD-based decoding of postings lists
[ https://issues.apache.org/jira/browse/LUCENE-9027?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17180639#comment-17180639 ] Adrien Grand commented on LUCENE-9027: -- I think that David is thinking of e.g. "FST50" (FSTPostingsFormat). Since this postings format is about the terms dictionary (it mostly reuses Lucene84PostingsWriter/Reader for convenience) I was inclined to keep it named FST50 since the format of the terms dictionary had not changed. I don't have strong feelings though, I'd be ok with renaming it "FST84", or maybe even just "FST" as the version doesn't really matter for these postings formats that we don't provide backward compatibility for. > SIMD-based decoding of postings lists > - > > Key: LUCENE-9027 > URL: https://issues.apache.org/jira/browse/LUCENE-9027 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Adrien Grand >Priority: Minor > Fix For: 8.4 > > Time Spent: 3h 50m > Remaining Estimate: 0h > > [~rcmuir] has been mentioning the idea for quite some time that we might be > able to write the decoding logic in such a way that Java would use SIMD > instructions. More recently [~paul.masurel] wrote a [blog > post|https://fulmicoton.com/posts/bitpacking/] that raises the point that > Lucene could still do decode multiple ints at once in a single instruction by > packing two ints in a long and we've had some discussions about what we could > try in Lucene to speed up the decoding of postings. This made me want to look > a bit deeper at what we could do. > Our current decoding logic reads data in a byte[] and decodes packed integers > from it. Unfortunately it doesn't make use of SIMD instructions and looks > like > [this|https://github.com/jpountz/decode-128-ints-benchmark/blob/master/src/main/java/jpountz/NaiveByteDecoder.java]. > I confirmed by looking at the generated assembly that if I take an array of > integers and shift them all by the same number of bits then Java will use > SIMD instructions to shift multiple integers at once. This led me to writing > this > [implementation|https://github.com/jpountz/decode-128-ints-benchmark/blob/master/src/main/java/jpountz/SimpleSIMDDecoder.java] > that tries as much as possible to shift long sequences of ints by the same > number of bits to speed up decoding. It is indeed faster than the current > logic we have, up to about 2x faster for some numbers of bits per value. > Currently the best > [implementation|https://github.com/jpountz/decode-128-ints-benchmark/blob/master/src/main/java/jpountz/SIMDDecoder.java] > I've been able to come up with combines the above idea with the idea that > Paul mentioned in his blog that consists of emulating SIMD from Java by > packing multiple integers into a long: 2 ints, 4 shorts or 8 bytes. It is a > bit harder to read but gives another speedup on top of the above > implementation. > I have a [JMH > benchmark|https://github.com/jpountz/decode-128-ints-benchmark/] available in > case someone would like to play with this and maybe even come up with an even > faster implementation. It is 2-2.5x faster than our current implementation > for most numbers of bits per value. I'm copying results here: > {noformat} > * `readLongs` just reads 2*bitsPerValue longs from the ByteBuffer, it serves > as >a baseline. > * `decodeNaiveFromBytes` reads a byte[] and decodes from it. This is what the >current Lucene codec does. > * `decodeNaiveFromLongs` decodes from longs on the fly. > * `decodeSimpleSIMD` is a simple implementation that relies on how Java >recognizes some patterns and uses SIMD instructions. > * `decodeSIMD` is a more complex implementation that both relies on the C2 >compiler to generate SIMD instructions and encodes 8 bytes, 4 shorts or >2 ints in a long in order to decompress multiple values at once. > Benchmark (bitsPerValue) (byteOrder) > Mode Cnt Score Error Units > PackedIntsDecodeBenchmark.decodeNaiveFromBytes 1 LE > thrpt5 12.912 ± 0.393 ops/us > PackedIntsDecodeBenchmark.decodeNaiveFromBytes 1 BE > thrpt5 12.862 ± 0.395 ops/us > PackedIntsDecodeBenchmark.decodeNaiveFromBytes 2 LE > thrpt5 13.040 ± 1.162 ops/us > PackedIntsDecodeBenchmark.decodeNaiveFromBytes 2 BE > thrpt5 13.027 ± 0.270 ops/us > PackedIntsDecodeBenchmark.decodeNaiveFromBytes 3 LE > thrpt5 12.409 ± 0.637 ops/us > PackedIntsDecodeBenchmark.decodeNaiveFromBytes 3 BE > thrpt5 12.268 ± 0.947 ops/us > PackedIntsDecodeBenchmark.decodeNaiveFromBytes 4 LE > thrpt5 14.177 ± 2.263 ops/us >
[jira] [Commented] (LUCENE-9027) SIMD-based decoding of postings lists
[ https://issues.apache.org/jira/browse/LUCENE-9027?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17179726#comment-17179726 ] Michael McCandless commented on LUCENE-9027: {quote}_Shouldn't the other postings formats that switched from use of this postings write/reader also bump their version suffix?_ {quote} Hmm, can you give an example? What other postings formats (besides {{Lucene84PostingsFormat}}) are using the new {{Lucene84PostingsWriter/Reader}}? > SIMD-based decoding of postings lists > - > > Key: LUCENE-9027 > URL: https://issues.apache.org/jira/browse/LUCENE-9027 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Adrien Grand >Priority: Minor > Fix For: 8.4 > > Time Spent: 3h 50m > Remaining Estimate: 0h > > [~rcmuir] has been mentioning the idea for quite some time that we might be > able to write the decoding logic in such a way that Java would use SIMD > instructions. More recently [~paul.masurel] wrote a [blog > post|https://fulmicoton.com/posts/bitpacking/] that raises the point that > Lucene could still do decode multiple ints at once in a single instruction by > packing two ints in a long and we've had some discussions about what we could > try in Lucene to speed up the decoding of postings. This made me want to look > a bit deeper at what we could do. > Our current decoding logic reads data in a byte[] and decodes packed integers > from it. Unfortunately it doesn't make use of SIMD instructions and looks > like > [this|https://github.com/jpountz/decode-128-ints-benchmark/blob/master/src/main/java/jpountz/NaiveByteDecoder.java]. > I confirmed by looking at the generated assembly that if I take an array of > integers and shift them all by the same number of bits then Java will use > SIMD instructions to shift multiple integers at once. This led me to writing > this > [implementation|https://github.com/jpountz/decode-128-ints-benchmark/blob/master/src/main/java/jpountz/SimpleSIMDDecoder.java] > that tries as much as possible to shift long sequences of ints by the same > number of bits to speed up decoding. It is indeed faster than the current > logic we have, up to about 2x faster for some numbers of bits per value. > Currently the best > [implementation|https://github.com/jpountz/decode-128-ints-benchmark/blob/master/src/main/java/jpountz/SIMDDecoder.java] > I've been able to come up with combines the above idea with the idea that > Paul mentioned in his blog that consists of emulating SIMD from Java by > packing multiple integers into a long: 2 ints, 4 shorts or 8 bytes. It is a > bit harder to read but gives another speedup on top of the above > implementation. > I have a [JMH > benchmark|https://github.com/jpountz/decode-128-ints-benchmark/] available in > case someone would like to play with this and maybe even come up with an even > faster implementation. It is 2-2.5x faster than our current implementation > for most numbers of bits per value. I'm copying results here: > {noformat} > * `readLongs` just reads 2*bitsPerValue longs from the ByteBuffer, it serves > as >a baseline. > * `decodeNaiveFromBytes` reads a byte[] and decodes from it. This is what the >current Lucene codec does. > * `decodeNaiveFromLongs` decodes from longs on the fly. > * `decodeSimpleSIMD` is a simple implementation that relies on how Java >recognizes some patterns and uses SIMD instructions. > * `decodeSIMD` is a more complex implementation that both relies on the C2 >compiler to generate SIMD instructions and encodes 8 bytes, 4 shorts or >2 ints in a long in order to decompress multiple values at once. > Benchmark (bitsPerValue) (byteOrder) > Mode Cnt Score Error Units > PackedIntsDecodeBenchmark.decodeNaiveFromBytes 1 LE > thrpt5 12.912 ± 0.393 ops/us > PackedIntsDecodeBenchmark.decodeNaiveFromBytes 1 BE > thrpt5 12.862 ± 0.395 ops/us > PackedIntsDecodeBenchmark.decodeNaiveFromBytes 2 LE > thrpt5 13.040 ± 1.162 ops/us > PackedIntsDecodeBenchmark.decodeNaiveFromBytes 2 BE > thrpt5 13.027 ± 0.270 ops/us > PackedIntsDecodeBenchmark.decodeNaiveFromBytes 3 LE > thrpt5 12.409 ± 0.637 ops/us > PackedIntsDecodeBenchmark.decodeNaiveFromBytes 3 BE > thrpt5 12.268 ± 0.947 ops/us > PackedIntsDecodeBenchmark.decodeNaiveFromBytes 4 LE > thrpt5 14.177 ± 2.263 ops/us > PackedIntsDecodeBenchmark.decodeNaiveFromBytes 4 BE > thrpt5 11.457 ± 0.150 ops/us > PackedIntsDecodeBenchmark.decodeNaiveFromBytes 5 LE > thrpt5
[jira] [Commented] (LUCENE-9027) SIMD-based decoding of postings lists
[ https://issues.apache.org/jira/browse/LUCENE-9027?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17179200#comment-17179200 ] David Smiley commented on LUCENE-9027: -- Question: This feature introduced Lucene84PostingsFormat because, presumably, the use of Lucene50PostingsWriter changed to Lucene84PostingsWriter (ditto for reader). New formats have new names: "Lucene50" -> "Lucene84". _Shouldn't the other postings formats that switched from use of this postings write/reader also bump their version suffix?_ No need to duplicate them because we may not want the maintenance of actually supporting the previous version. Too late now, of course. This format naming/version strategy seems like something that ought to be documented. > SIMD-based decoding of postings lists > - > > Key: LUCENE-9027 > URL: https://issues.apache.org/jira/browse/LUCENE-9027 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Adrien Grand >Priority: Minor > Fix For: 8.4 > > Time Spent: 3h 50m > Remaining Estimate: 0h > > [~rcmuir] has been mentioning the idea for quite some time that we might be > able to write the decoding logic in such a way that Java would use SIMD > instructions. More recently [~paul.masurel] wrote a [blog > post|https://fulmicoton.com/posts/bitpacking/] that raises the point that > Lucene could still do decode multiple ints at once in a single instruction by > packing two ints in a long and we've had some discussions about what we could > try in Lucene to speed up the decoding of postings. This made me want to look > a bit deeper at what we could do. > Our current decoding logic reads data in a byte[] and decodes packed integers > from it. Unfortunately it doesn't make use of SIMD instructions and looks > like > [this|https://github.com/jpountz/decode-128-ints-benchmark/blob/master/src/main/java/jpountz/NaiveByteDecoder.java]. > I confirmed by looking at the generated assembly that if I take an array of > integers and shift them all by the same number of bits then Java will use > SIMD instructions to shift multiple integers at once. This led me to writing > this > [implementation|https://github.com/jpountz/decode-128-ints-benchmark/blob/master/src/main/java/jpountz/SimpleSIMDDecoder.java] > that tries as much as possible to shift long sequences of ints by the same > number of bits to speed up decoding. It is indeed faster than the current > logic we have, up to about 2x faster for some numbers of bits per value. > Currently the best > [implementation|https://github.com/jpountz/decode-128-ints-benchmark/blob/master/src/main/java/jpountz/SIMDDecoder.java] > I've been able to come up with combines the above idea with the idea that > Paul mentioned in his blog that consists of emulating SIMD from Java by > packing multiple integers into a long: 2 ints, 4 shorts or 8 bytes. It is a > bit harder to read but gives another speedup on top of the above > implementation. > I have a [JMH > benchmark|https://github.com/jpountz/decode-128-ints-benchmark/] available in > case someone would like to play with this and maybe even come up with an even > faster implementation. It is 2-2.5x faster than our current implementation > for most numbers of bits per value. I'm copying results here: > {noformat} > * `readLongs` just reads 2*bitsPerValue longs from the ByteBuffer, it serves > as >a baseline. > * `decodeNaiveFromBytes` reads a byte[] and decodes from it. This is what the >current Lucene codec does. > * `decodeNaiveFromLongs` decodes from longs on the fly. > * `decodeSimpleSIMD` is a simple implementation that relies on how Java >recognizes some patterns and uses SIMD instructions. > * `decodeSIMD` is a more complex implementation that both relies on the C2 >compiler to generate SIMD instructions and encodes 8 bytes, 4 shorts or >2 ints in a long in order to decompress multiple values at once. > Benchmark (bitsPerValue) (byteOrder) > Mode Cnt Score Error Units > PackedIntsDecodeBenchmark.decodeNaiveFromBytes 1 LE > thrpt5 12.912 ± 0.393 ops/us > PackedIntsDecodeBenchmark.decodeNaiveFromBytes 1 BE > thrpt5 12.862 ± 0.395 ops/us > PackedIntsDecodeBenchmark.decodeNaiveFromBytes 2 LE > thrpt5 13.040 ± 1.162 ops/us > PackedIntsDecodeBenchmark.decodeNaiveFromBytes 2 BE > thrpt5 13.027 ± 0.270 ops/us > PackedIntsDecodeBenchmark.decodeNaiveFromBytes 3 LE > thrpt5 12.409 ± 0.637 ops/us > PackedIntsDecodeBenchmark.decodeNaiveFromBytes 3 BE > thrpt5 12.268 ± 0.947 ops/us > PackedIntsDecodeBenchmark.decodeNaiveFromBytes
[jira] [Commented] (LUCENE-9027) SIMD-based decoding of postings lists
[ https://issues.apache.org/jira/browse/LUCENE-9027?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16980279#comment-16980279 ] ASF subversion and git services commented on LUCENE-9027: - Commit bc758601cd8f77136e0f8bb8467927c3e37c7ddf in lucene-solr's branch refs/heads/branch_8x from Adrien Grand [ https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=bc75860 ] LUCENE-9027: Try to get back some indexing speed. > SIMD-based decoding of postings lists > - > > Key: LUCENE-9027 > URL: https://issues.apache.org/jira/browse/LUCENE-9027 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Adrien Grand >Priority: Minor > Fix For: 8.4 > > Time Spent: 3h 50m > Remaining Estimate: 0h > > [~rcmuir] has been mentioning the idea for quite some time that we might be > able to write the decoding logic in such a way that Java would use SIMD > instructions. More recently [~paul.masurel] wrote a [blog > post|https://fulmicoton.com/posts/bitpacking/] that raises the point that > Lucene could still do decode multiple ints at once in a single instruction by > packing two ints in a long and we've had some discussions about what we could > try in Lucene to speed up the decoding of postings. This made me want to look > a bit deeper at what we could do. > Our current decoding logic reads data in a byte[] and decodes packed integers > from it. Unfortunately it doesn't make use of SIMD instructions and looks > like > [this|https://github.com/jpountz/decode-128-ints-benchmark/blob/master/src/main/java/jpountz/NaiveByteDecoder.java]. > I confirmed by looking at the generated assembly that if I take an array of > integers and shift them all by the same number of bits then Java will use > SIMD instructions to shift multiple integers at once. This led me to writing > this > [implementation|https://github.com/jpountz/decode-128-ints-benchmark/blob/master/src/main/java/jpountz/SimpleSIMDDecoder.java] > that tries as much as possible to shift long sequences of ints by the same > number of bits to speed up decoding. It is indeed faster than the current > logic we have, up to about 2x faster for some numbers of bits per value. > Currently the best > [implementation|https://github.com/jpountz/decode-128-ints-benchmark/blob/master/src/main/java/jpountz/SIMDDecoder.java] > I've been able to come up with combines the above idea with the idea that > Paul mentioned in his blog that consists of emulating SIMD from Java by > packing multiple integers into a long: 2 ints, 4 shorts or 8 bytes. It is a > bit harder to read but gives another speedup on top of the above > implementation. > I have a [JMH > benchmark|https://github.com/jpountz/decode-128-ints-benchmark/] available in > case someone would like to play with this and maybe even come up with an even > faster implementation. It is 2-2.5x faster than our current implementation > for most numbers of bits per value. I'm copying results here: > {noformat} > * `readLongs` just reads 2*bitsPerValue longs from the ByteBuffer, it serves > as >a baseline. > * `decodeNaiveFromBytes` reads a byte[] and decodes from it. This is what the >current Lucene codec does. > * `decodeNaiveFromLongs` decodes from longs on the fly. > * `decodeSimpleSIMD` is a simple implementation that relies on how Java >recognizes some patterns and uses SIMD instructions. > * `decodeSIMD` is a more complex implementation that both relies on the C2 >compiler to generate SIMD instructions and encodes 8 bytes, 4 shorts or >2 ints in a long in order to decompress multiple values at once. > Benchmark (bitsPerValue) (byteOrder) > Mode Cnt Score Error Units > PackedIntsDecodeBenchmark.decodeNaiveFromBytes 1 LE > thrpt5 12.912 ± 0.393 ops/us > PackedIntsDecodeBenchmark.decodeNaiveFromBytes 1 BE > thrpt5 12.862 ± 0.395 ops/us > PackedIntsDecodeBenchmark.decodeNaiveFromBytes 2 LE > thrpt5 13.040 ± 1.162 ops/us > PackedIntsDecodeBenchmark.decodeNaiveFromBytes 2 BE > thrpt5 13.027 ± 0.270 ops/us > PackedIntsDecodeBenchmark.decodeNaiveFromBytes 3 LE > thrpt5 12.409 ± 0.637 ops/us > PackedIntsDecodeBenchmark.decodeNaiveFromBytes 3 BE > thrpt5 12.268 ± 0.947 ops/us > PackedIntsDecodeBenchmark.decodeNaiveFromBytes 4 LE > thrpt5 14.177 ± 2.263 ops/us > PackedIntsDecodeBenchmark.decodeNaiveFromBytes 4 BE > thrpt5 11.457 ± 0.150 ops/us > PackedIntsDecodeBenchmark.decodeNaiveFromBytes 5 LE > thrpt5 10.988 ± 1.179 ops/us >
[jira] [Commented] (LUCENE-9027) SIMD-based decoding of postings lists
[ https://issues.apache.org/jira/browse/LUCENE-9027?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16980280#comment-16980280 ] ASF subversion and git services commented on LUCENE-9027: - Commit c51006c3c48e41dfb68b62cdaf39916d5eed65b8 in lucene-solr's branch refs/heads/master from Adrien Grand [ https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=c51006c ] LUCENE-9027: Try to get back some indexing speed. > SIMD-based decoding of postings lists > - > > Key: LUCENE-9027 > URL: https://issues.apache.org/jira/browse/LUCENE-9027 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Adrien Grand >Priority: Minor > Fix For: 8.4 > > Time Spent: 3h 50m > Remaining Estimate: 0h > > [~rcmuir] has been mentioning the idea for quite some time that we might be > able to write the decoding logic in such a way that Java would use SIMD > instructions. More recently [~paul.masurel] wrote a [blog > post|https://fulmicoton.com/posts/bitpacking/] that raises the point that > Lucene could still do decode multiple ints at once in a single instruction by > packing two ints in a long and we've had some discussions about what we could > try in Lucene to speed up the decoding of postings. This made me want to look > a bit deeper at what we could do. > Our current decoding logic reads data in a byte[] and decodes packed integers > from it. Unfortunately it doesn't make use of SIMD instructions and looks > like > [this|https://github.com/jpountz/decode-128-ints-benchmark/blob/master/src/main/java/jpountz/NaiveByteDecoder.java]. > I confirmed by looking at the generated assembly that if I take an array of > integers and shift them all by the same number of bits then Java will use > SIMD instructions to shift multiple integers at once. This led me to writing > this > [implementation|https://github.com/jpountz/decode-128-ints-benchmark/blob/master/src/main/java/jpountz/SimpleSIMDDecoder.java] > that tries as much as possible to shift long sequences of ints by the same > number of bits to speed up decoding. It is indeed faster than the current > logic we have, up to about 2x faster for some numbers of bits per value. > Currently the best > [implementation|https://github.com/jpountz/decode-128-ints-benchmark/blob/master/src/main/java/jpountz/SIMDDecoder.java] > I've been able to come up with combines the above idea with the idea that > Paul mentioned in his blog that consists of emulating SIMD from Java by > packing multiple integers into a long: 2 ints, 4 shorts or 8 bytes. It is a > bit harder to read but gives another speedup on top of the above > implementation. > I have a [JMH > benchmark|https://github.com/jpountz/decode-128-ints-benchmark/] available in > case someone would like to play with this and maybe even come up with an even > faster implementation. It is 2-2.5x faster than our current implementation > for most numbers of bits per value. I'm copying results here: > {noformat} > * `readLongs` just reads 2*bitsPerValue longs from the ByteBuffer, it serves > as >a baseline. > * `decodeNaiveFromBytes` reads a byte[] and decodes from it. This is what the >current Lucene codec does. > * `decodeNaiveFromLongs` decodes from longs on the fly. > * `decodeSimpleSIMD` is a simple implementation that relies on how Java >recognizes some patterns and uses SIMD instructions. > * `decodeSIMD` is a more complex implementation that both relies on the C2 >compiler to generate SIMD instructions and encodes 8 bytes, 4 shorts or >2 ints in a long in order to decompress multiple values at once. > Benchmark (bitsPerValue) (byteOrder) > Mode Cnt Score Error Units > PackedIntsDecodeBenchmark.decodeNaiveFromBytes 1 LE > thrpt5 12.912 ± 0.393 ops/us > PackedIntsDecodeBenchmark.decodeNaiveFromBytes 1 BE > thrpt5 12.862 ± 0.395 ops/us > PackedIntsDecodeBenchmark.decodeNaiveFromBytes 2 LE > thrpt5 13.040 ± 1.162 ops/us > PackedIntsDecodeBenchmark.decodeNaiveFromBytes 2 BE > thrpt5 13.027 ± 0.270 ops/us > PackedIntsDecodeBenchmark.decodeNaiveFromBytes 3 LE > thrpt5 12.409 ± 0.637 ops/us > PackedIntsDecodeBenchmark.decodeNaiveFromBytes 3 BE > thrpt5 12.268 ± 0.947 ops/us > PackedIntsDecodeBenchmark.decodeNaiveFromBytes 4 LE > thrpt5 14.177 ± 2.263 ops/us > PackedIntsDecodeBenchmark.decodeNaiveFromBytes 4 BE > thrpt5 11.457 ± 0.150 ops/us > PackedIntsDecodeBenchmark.decodeNaiveFromBytes 5 LE > thrpt5 10.988 ± 1.179 ops/us >
[jira] [Commented] (LUCENE-9027) SIMD-based decoding of postings lists
[ https://issues.apache.org/jira/browse/LUCENE-9027?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16980032#comment-16980032 ] Adrien Grand commented on LUCENE-9027: -- This and my followup changes seem to have introduced a hit on indexing throughput http://people.apache.org/~mikemccand/lucenebench/indexing.html. I'll dig. > SIMD-based decoding of postings lists > - > > Key: LUCENE-9027 > URL: https://issues.apache.org/jira/browse/LUCENE-9027 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Adrien Grand >Priority: Minor > Fix For: 8.4 > > Time Spent: 3h 50m > Remaining Estimate: 0h > > [~rcmuir] has been mentioning the idea for quite some time that we might be > able to write the decoding logic in such a way that Java would use SIMD > instructions. More recently [~paul.masurel] wrote a [blog > post|https://fulmicoton.com/posts/bitpacking/] that raises the point that > Lucene could still do decode multiple ints at once in a single instruction by > packing two ints in a long and we've had some discussions about what we could > try in Lucene to speed up the decoding of postings. This made me want to look > a bit deeper at what we could do. > Our current decoding logic reads data in a byte[] and decodes packed integers > from it. Unfortunately it doesn't make use of SIMD instructions and looks > like > [this|https://github.com/jpountz/decode-128-ints-benchmark/blob/master/src/main/java/jpountz/NaiveByteDecoder.java]. > I confirmed by looking at the generated assembly that if I take an array of > integers and shift them all by the same number of bits then Java will use > SIMD instructions to shift multiple integers at once. This led me to writing > this > [implementation|https://github.com/jpountz/decode-128-ints-benchmark/blob/master/src/main/java/jpountz/SimpleSIMDDecoder.java] > that tries as much as possible to shift long sequences of ints by the same > number of bits to speed up decoding. It is indeed faster than the current > logic we have, up to about 2x faster for some numbers of bits per value. > Currently the best > [implementation|https://github.com/jpountz/decode-128-ints-benchmark/blob/master/src/main/java/jpountz/SIMDDecoder.java] > I've been able to come up with combines the above idea with the idea that > Paul mentioned in his blog that consists of emulating SIMD from Java by > packing multiple integers into a long: 2 ints, 4 shorts or 8 bytes. It is a > bit harder to read but gives another speedup on top of the above > implementation. > I have a [JMH > benchmark|https://github.com/jpountz/decode-128-ints-benchmark/] available in > case someone would like to play with this and maybe even come up with an even > faster implementation. It is 2-2.5x faster than our current implementation > for most numbers of bits per value. I'm copying results here: > {noformat} > * `readLongs` just reads 2*bitsPerValue longs from the ByteBuffer, it serves > as >a baseline. > * `decodeNaiveFromBytes` reads a byte[] and decodes from it. This is what the >current Lucene codec does. > * `decodeNaiveFromLongs` decodes from longs on the fly. > * `decodeSimpleSIMD` is a simple implementation that relies on how Java >recognizes some patterns and uses SIMD instructions. > * `decodeSIMD` is a more complex implementation that both relies on the C2 >compiler to generate SIMD instructions and encodes 8 bytes, 4 shorts or >2 ints in a long in order to decompress multiple values at once. > Benchmark (bitsPerValue) (byteOrder) > Mode Cnt Score Error Units > PackedIntsDecodeBenchmark.decodeNaiveFromBytes 1 LE > thrpt5 12.912 ± 0.393 ops/us > PackedIntsDecodeBenchmark.decodeNaiveFromBytes 1 BE > thrpt5 12.862 ± 0.395 ops/us > PackedIntsDecodeBenchmark.decodeNaiveFromBytes 2 LE > thrpt5 13.040 ± 1.162 ops/us > PackedIntsDecodeBenchmark.decodeNaiveFromBytes 2 BE > thrpt5 13.027 ± 0.270 ops/us > PackedIntsDecodeBenchmark.decodeNaiveFromBytes 3 LE > thrpt5 12.409 ± 0.637 ops/us > PackedIntsDecodeBenchmark.decodeNaiveFromBytes 3 BE > thrpt5 12.268 ± 0.947 ops/us > PackedIntsDecodeBenchmark.decodeNaiveFromBytes 4 LE > thrpt5 14.177 ± 2.263 ops/us > PackedIntsDecodeBenchmark.decodeNaiveFromBytes 4 BE > thrpt5 11.457 ± 0.150 ops/us > PackedIntsDecodeBenchmark.decodeNaiveFromBytes 5 LE > thrpt5 10.988 ± 1.179 ops/us > PackedIntsDecodeBenchmark.decodeNaiveFromBytes 5 BE > thrpt5 11.226 ± 0.088 ops/us >
[jira] [Commented] (LUCENE-9027) SIMD-based decoding of postings lists
[ https://issues.apache.org/jira/browse/LUCENE-9027?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16978213#comment-16978213 ] Bruno Roustant commented on LUCENE-9027: Impressive! > SIMD-based decoding of postings lists > - > > Key: LUCENE-9027 > URL: https://issues.apache.org/jira/browse/LUCENE-9027 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Adrien Grand >Priority: Minor > Fix For: 8.4 > > Time Spent: 3h 50m > Remaining Estimate: 0h > > [~rcmuir] has been mentioning the idea for quite some time that we might be > able to write the decoding logic in such a way that Java would use SIMD > instructions. More recently [~paul.masurel] wrote a [blog > post|https://fulmicoton.com/posts/bitpacking/] that raises the point that > Lucene could still do decode multiple ints at once in a single instruction by > packing two ints in a long and we've had some discussions about what we could > try in Lucene to speed up the decoding of postings. This made me want to look > a bit deeper at what we could do. > Our current decoding logic reads data in a byte[] and decodes packed integers > from it. Unfortunately it doesn't make use of SIMD instructions and looks > like > [this|https://github.com/jpountz/decode-128-ints-benchmark/blob/master/src/main/java/jpountz/NaiveByteDecoder.java]. > I confirmed by looking at the generated assembly that if I take an array of > integers and shift them all by the same number of bits then Java will use > SIMD instructions to shift multiple integers at once. This led me to writing > this > [implementation|https://github.com/jpountz/decode-128-ints-benchmark/blob/master/src/main/java/jpountz/SimpleSIMDDecoder.java] > that tries as much as possible to shift long sequences of ints by the same > number of bits to speed up decoding. It is indeed faster than the current > logic we have, up to about 2x faster for some numbers of bits per value. > Currently the best > [implementation|https://github.com/jpountz/decode-128-ints-benchmark/blob/master/src/main/java/jpountz/SIMDDecoder.java] > I've been able to come up with combines the above idea with the idea that > Paul mentioned in his blog that consists of emulating SIMD from Java by > packing multiple integers into a long: 2 ints, 4 shorts or 8 bytes. It is a > bit harder to read but gives another speedup on top of the above > implementation. > I have a [JMH > benchmark|https://github.com/jpountz/decode-128-ints-benchmark/] available in > case someone would like to play with this and maybe even come up with an even > faster implementation. It is 2-2.5x faster than our current implementation > for most numbers of bits per value. I'm copying results here: > {noformat} > * `readLongs` just reads 2*bitsPerValue longs from the ByteBuffer, it serves > as >a baseline. > * `decodeNaiveFromBytes` reads a byte[] and decodes from it. This is what the >current Lucene codec does. > * `decodeNaiveFromLongs` decodes from longs on the fly. > * `decodeSimpleSIMD` is a simple implementation that relies on how Java >recognizes some patterns and uses SIMD instructions. > * `decodeSIMD` is a more complex implementation that both relies on the C2 >compiler to generate SIMD instructions and encodes 8 bytes, 4 shorts or >2 ints in a long in order to decompress multiple values at once. > Benchmark (bitsPerValue) (byteOrder) > Mode Cnt Score Error Units > PackedIntsDecodeBenchmark.decodeNaiveFromBytes 1 LE > thrpt5 12.912 ± 0.393 ops/us > PackedIntsDecodeBenchmark.decodeNaiveFromBytes 1 BE > thrpt5 12.862 ± 0.395 ops/us > PackedIntsDecodeBenchmark.decodeNaiveFromBytes 2 LE > thrpt5 13.040 ± 1.162 ops/us > PackedIntsDecodeBenchmark.decodeNaiveFromBytes 2 BE > thrpt5 13.027 ± 0.270 ops/us > PackedIntsDecodeBenchmark.decodeNaiveFromBytes 3 LE > thrpt5 12.409 ± 0.637 ops/us > PackedIntsDecodeBenchmark.decodeNaiveFromBytes 3 BE > thrpt5 12.268 ± 0.947 ops/us > PackedIntsDecodeBenchmark.decodeNaiveFromBytes 4 LE > thrpt5 14.177 ± 2.263 ops/us > PackedIntsDecodeBenchmark.decodeNaiveFromBytes 4 BE > thrpt5 11.457 ± 0.150 ops/us > PackedIntsDecodeBenchmark.decodeNaiveFromBytes 5 LE > thrpt5 10.988 ± 1.179 ops/us > PackedIntsDecodeBenchmark.decodeNaiveFromBytes 5 BE > thrpt5 11.226 ± 0.088 ops/us > PackedIntsDecodeBenchmark.decodeNaiveFromBytes 6 LE > thrpt5 9.791 ± 0.305 ops/us >
[jira] [Commented] (LUCENE-9027) SIMD-based decoding of postings lists
[ https://issues.apache.org/jira/browse/LUCENE-9027?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16978171#comment-16978171 ] Adrien Grand commented on LUCENE-9027: -- Nightly benchmarks have picked up this change. Several queries got a 5+% increase of QPS: - Prefix ~ +14% - IntervalsOrdered ~ +10% - SpanNearQuery ~ +9% - AndMedOrHighHigh ~ +7% - AndHighMed ~ +7% - OrHighMed ~ +7% - Wildcard ~ +7% More details at [http://people.apache.org/~mikemccand/lucenebench/] > SIMD-based decoding of postings lists > - > > Key: LUCENE-9027 > URL: https://issues.apache.org/jira/browse/LUCENE-9027 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Adrien Grand >Priority: Minor > Fix For: 8.4 > > Time Spent: 3h 50m > Remaining Estimate: 0h > > [~rcmuir] has been mentioning the idea for quite some time that we might be > able to write the decoding logic in such a way that Java would use SIMD > instructions. More recently [~paul.masurel] wrote a [blog > post|https://fulmicoton.com/posts/bitpacking/] that raises the point that > Lucene could still do decode multiple ints at once in a single instruction by > packing two ints in a long and we've had some discussions about what we could > try in Lucene to speed up the decoding of postings. This made me want to look > a bit deeper at what we could do. > Our current decoding logic reads data in a byte[] and decodes packed integers > from it. Unfortunately it doesn't make use of SIMD instructions and looks > like > [this|https://github.com/jpountz/decode-128-ints-benchmark/blob/master/src/main/java/jpountz/NaiveByteDecoder.java]. > I confirmed by looking at the generated assembly that if I take an array of > integers and shift them all by the same number of bits then Java will use > SIMD instructions to shift multiple integers at once. This led me to writing > this > [implementation|https://github.com/jpountz/decode-128-ints-benchmark/blob/master/src/main/java/jpountz/SimpleSIMDDecoder.java] > that tries as much as possible to shift long sequences of ints by the same > number of bits to speed up decoding. It is indeed faster than the current > logic we have, up to about 2x faster for some numbers of bits per value. > Currently the best > [implementation|https://github.com/jpountz/decode-128-ints-benchmark/blob/master/src/main/java/jpountz/SIMDDecoder.java] > I've been able to come up with combines the above idea with the idea that > Paul mentioned in his blog that consists of emulating SIMD from Java by > packing multiple integers into a long: 2 ints, 4 shorts or 8 bytes. It is a > bit harder to read but gives another speedup on top of the above > implementation. > I have a [JMH > benchmark|https://github.com/jpountz/decode-128-ints-benchmark/] available in > case someone would like to play with this and maybe even come up with an even > faster implementation. It is 2-2.5x faster than our current implementation > for most numbers of bits per value. I'm copying results here: > {noformat} > * `readLongs` just reads 2*bitsPerValue longs from the ByteBuffer, it serves > as >a baseline. > * `decodeNaiveFromBytes` reads a byte[] and decodes from it. This is what the >current Lucene codec does. > * `decodeNaiveFromLongs` decodes from longs on the fly. > * `decodeSimpleSIMD` is a simple implementation that relies on how Java >recognizes some patterns and uses SIMD instructions. > * `decodeSIMD` is a more complex implementation that both relies on the C2 >compiler to generate SIMD instructions and encodes 8 bytes, 4 shorts or >2 ints in a long in order to decompress multiple values at once. > Benchmark (bitsPerValue) (byteOrder) > Mode Cnt Score Error Units > PackedIntsDecodeBenchmark.decodeNaiveFromBytes 1 LE > thrpt5 12.912 ± 0.393 ops/us > PackedIntsDecodeBenchmark.decodeNaiveFromBytes 1 BE > thrpt5 12.862 ± 0.395 ops/us > PackedIntsDecodeBenchmark.decodeNaiveFromBytes 2 LE > thrpt5 13.040 ± 1.162 ops/us > PackedIntsDecodeBenchmark.decodeNaiveFromBytes 2 BE > thrpt5 13.027 ± 0.270 ops/us > PackedIntsDecodeBenchmark.decodeNaiveFromBytes 3 LE > thrpt5 12.409 ± 0.637 ops/us > PackedIntsDecodeBenchmark.decodeNaiveFromBytes 3 BE > thrpt5 12.268 ± 0.947 ops/us > PackedIntsDecodeBenchmark.decodeNaiveFromBytes 4 LE > thrpt5 14.177 ± 2.263 ops/us > PackedIntsDecodeBenchmark.decodeNaiveFromBytes 4 BE > thrpt5 11.457 ± 0.150 ops/us > PackedIntsDecodeBenchmark.decodeNaiveFromBytes 5 LE > thrpt
[jira] [Commented] (LUCENE-9027) SIMD-based decoding of postings lists
[ https://issues.apache.org/jira/browse/LUCENE-9027?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16977532#comment-16977532 ] ASF subversion and git services commented on LUCENE-9027: - Commit cb1a72ad1642400888683e1f735a0316334b2484 in lucene-solr's branch refs/heads/branch_8x from Adrien Grand [ https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=cb1a72a ] LUCENE-9027: Use SIMD instructions to decode postings. (#973) > SIMD-based decoding of postings lists > - > > Key: LUCENE-9027 > URL: https://issues.apache.org/jira/browse/LUCENE-9027 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Adrien Grand >Priority: Minor > Time Spent: 3h 50m > Remaining Estimate: 0h > > [~rcmuir] has been mentioning the idea for quite some time that we might be > able to write the decoding logic in such a way that Java would use SIMD > instructions. More recently [~paul.masurel] wrote a [blog > post|https://fulmicoton.com/posts/bitpacking/] that raises the point that > Lucene could still do decode multiple ints at once in a single instruction by > packing two ints in a long and we've had some discussions about what we could > try in Lucene to speed up the decoding of postings. This made me want to look > a bit deeper at what we could do. > Our current decoding logic reads data in a byte[] and decodes packed integers > from it. Unfortunately it doesn't make use of SIMD instructions and looks > like > [this|https://github.com/jpountz/decode-128-ints-benchmark/blob/master/src/main/java/jpountz/NaiveByteDecoder.java]. > I confirmed by looking at the generated assembly that if I take an array of > integers and shift them all by the same number of bits then Java will use > SIMD instructions to shift multiple integers at once. This led me to writing > this > [implementation|https://github.com/jpountz/decode-128-ints-benchmark/blob/master/src/main/java/jpountz/SimpleSIMDDecoder.java] > that tries as much as possible to shift long sequences of ints by the same > number of bits to speed up decoding. It is indeed faster than the current > logic we have, up to about 2x faster for some numbers of bits per value. > Currently the best > [implementation|https://github.com/jpountz/decode-128-ints-benchmark/blob/master/src/main/java/jpountz/SIMDDecoder.java] > I've been able to come up with combines the above idea with the idea that > Paul mentioned in his blog that consists of emulating SIMD from Java by > packing multiple integers into a long: 2 ints, 4 shorts or 8 bytes. It is a > bit harder to read but gives another speedup on top of the above > implementation. > I have a [JMH > benchmark|https://github.com/jpountz/decode-128-ints-benchmark/] available in > case someone would like to play with this and maybe even come up with an even > faster implementation. It is 2-2.5x faster than our current implementation > for most numbers of bits per value. I'm copying results here: > {noformat} > * `readLongs` just reads 2*bitsPerValue longs from the ByteBuffer, it serves > as >a baseline. > * `decodeNaiveFromBytes` reads a byte[] and decodes from it. This is what the >current Lucene codec does. > * `decodeNaiveFromLongs` decodes from longs on the fly. > * `decodeSimpleSIMD` is a simple implementation that relies on how Java >recognizes some patterns and uses SIMD instructions. > * `decodeSIMD` is a more complex implementation that both relies on the C2 >compiler to generate SIMD instructions and encodes 8 bytes, 4 shorts or >2 ints in a long in order to decompress multiple values at once. > Benchmark (bitsPerValue) (byteOrder) > Mode Cnt Score Error Units > PackedIntsDecodeBenchmark.decodeNaiveFromBytes 1 LE > thrpt5 12.912 ± 0.393 ops/us > PackedIntsDecodeBenchmark.decodeNaiveFromBytes 1 BE > thrpt5 12.862 ± 0.395 ops/us > PackedIntsDecodeBenchmark.decodeNaiveFromBytes 2 LE > thrpt5 13.040 ± 1.162 ops/us > PackedIntsDecodeBenchmark.decodeNaiveFromBytes 2 BE > thrpt5 13.027 ± 0.270 ops/us > PackedIntsDecodeBenchmark.decodeNaiveFromBytes 3 LE > thrpt5 12.409 ± 0.637 ops/us > PackedIntsDecodeBenchmark.decodeNaiveFromBytes 3 BE > thrpt5 12.268 ± 0.947 ops/us > PackedIntsDecodeBenchmark.decodeNaiveFromBytes 4 LE > thrpt5 14.177 ± 2.263 ops/us > PackedIntsDecodeBenchmark.decodeNaiveFromBytes 4 BE > thrpt5 11.457 ± 0.150 ops/us > PackedIntsDecodeBenchmark.decodeNaiveFromBytes 5 LE > thrpt5 10.988 ± 1.179 ops/us >
[jira] [Commented] (LUCENE-9027) SIMD-based decoding of postings lists
[ https://issues.apache.org/jira/browse/LUCENE-9027?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16976752#comment-16976752 ] ASF subversion and git services commented on LUCENE-9027: - Commit 7755cdf03fc250e310c3b7d9b2e785f2939d3dc9 in lucene-solr's branch refs/heads/master from Adrien Grand [ https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=7755cdf ] LUCENE-9027: Use SIMD instructions to decode postings. (#973) > SIMD-based decoding of postings lists > - > > Key: LUCENE-9027 > URL: https://issues.apache.org/jira/browse/LUCENE-9027 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Adrien Grand >Priority: Minor > Time Spent: 3h 50m > Remaining Estimate: 0h > > [~rcmuir] has been mentioning the idea for quite some time that we might be > able to write the decoding logic in such a way that Java would use SIMD > instructions. More recently [~paul.masurel] wrote a [blog > post|https://fulmicoton.com/posts/bitpacking/] that raises the point that > Lucene could still do decode multiple ints at once in a single instruction by > packing two ints in a long and we've had some discussions about what we could > try in Lucene to speed up the decoding of postings. This made me want to look > a bit deeper at what we could do. > Our current decoding logic reads data in a byte[] and decodes packed integers > from it. Unfortunately it doesn't make use of SIMD instructions and looks > like > [this|https://github.com/jpountz/decode-128-ints-benchmark/blob/master/src/main/java/jpountz/NaiveByteDecoder.java]. > I confirmed by looking at the generated assembly that if I take an array of > integers and shift them all by the same number of bits then Java will use > SIMD instructions to shift multiple integers at once. This led me to writing > this > [implementation|https://github.com/jpountz/decode-128-ints-benchmark/blob/master/src/main/java/jpountz/SimpleSIMDDecoder.java] > that tries as much as possible to shift long sequences of ints by the same > number of bits to speed up decoding. It is indeed faster than the current > logic we have, up to about 2x faster for some numbers of bits per value. > Currently the best > [implementation|https://github.com/jpountz/decode-128-ints-benchmark/blob/master/src/main/java/jpountz/SIMDDecoder.java] > I've been able to come up with combines the above idea with the idea that > Paul mentioned in his blog that consists of emulating SIMD from Java by > packing multiple integers into a long: 2 ints, 4 shorts or 8 bytes. It is a > bit harder to read but gives another speedup on top of the above > implementation. > I have a [JMH > benchmark|https://github.com/jpountz/decode-128-ints-benchmark/] available in > case someone would like to play with this and maybe even come up with an even > faster implementation. It is 2-2.5x faster than our current implementation > for most numbers of bits per value. I'm copying results here: > {noformat} > * `readLongs` just reads 2*bitsPerValue longs from the ByteBuffer, it serves > as >a baseline. > * `decodeNaiveFromBytes` reads a byte[] and decodes from it. This is what the >current Lucene codec does. > * `decodeNaiveFromLongs` decodes from longs on the fly. > * `decodeSimpleSIMD` is a simple implementation that relies on how Java >recognizes some patterns and uses SIMD instructions. > * `decodeSIMD` is a more complex implementation that both relies on the C2 >compiler to generate SIMD instructions and encodes 8 bytes, 4 shorts or >2 ints in a long in order to decompress multiple values at once. > Benchmark (bitsPerValue) (byteOrder) > Mode Cnt Score Error Units > PackedIntsDecodeBenchmark.decodeNaiveFromBytes 1 LE > thrpt5 12.912 ± 0.393 ops/us > PackedIntsDecodeBenchmark.decodeNaiveFromBytes 1 BE > thrpt5 12.862 ± 0.395 ops/us > PackedIntsDecodeBenchmark.decodeNaiveFromBytes 2 LE > thrpt5 13.040 ± 1.162 ops/us > PackedIntsDecodeBenchmark.decodeNaiveFromBytes 2 BE > thrpt5 13.027 ± 0.270 ops/us > PackedIntsDecodeBenchmark.decodeNaiveFromBytes 3 LE > thrpt5 12.409 ± 0.637 ops/us > PackedIntsDecodeBenchmark.decodeNaiveFromBytes 3 BE > thrpt5 12.268 ± 0.947 ops/us > PackedIntsDecodeBenchmark.decodeNaiveFromBytes 4 LE > thrpt5 14.177 ± 2.263 ops/us > PackedIntsDecodeBenchmark.decodeNaiveFromBytes 4 BE > thrpt5 11.457 ± 0.150 ops/us > PackedIntsDecodeBenchmark.decodeNaiveFromBytes 5 LE > thrpt5 10.988 ± 1.179 ops/us >
[jira] [Commented] (LUCENE-9027) SIMD-based decoding of postings lists
[ https://issues.apache.org/jira/browse/LUCENE-9027?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16973707#comment-16973707 ] Yonik Seeley commented on LUCENE-9027: -- Yep, going little-endian is the right choice. big-endian has been dead for a while, and is very unlikely to be revived. > SIMD-based decoding of postings lists > - > > Key: LUCENE-9027 > URL: https://issues.apache.org/jira/browse/LUCENE-9027 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Adrien Grand >Priority: Minor > Time Spent: 2h 10m > Remaining Estimate: 0h > > [~rcmuir] has been mentioning the idea for quite some time that we might be > able to write the decoding logic in such a way that Java would use SIMD > instructions. More recently [~paul.masurel] wrote a [blog > post|https://fulmicoton.com/posts/bitpacking/] that raises the point that > Lucene could still do decode multiple ints at once in a single instruction by > packing two ints in a long and we've had some discussions about what we could > try in Lucene to speed up the decoding of postings. This made me want to look > a bit deeper at what we could do. > Our current decoding logic reads data in a byte[] and decodes packed integers > from it. Unfortunately it doesn't make use of SIMD instructions and looks > like > [this|https://github.com/jpountz/decode-128-ints-benchmark/blob/master/src/main/java/jpountz/NaiveByteDecoder.java]. > I confirmed by looking at the generated assembly that if I take an array of > integers and shift them all by the same number of bits then Java will use > SIMD instructions to shift multiple integers at once. This led me to writing > this > [implementation|https://github.com/jpountz/decode-128-ints-benchmark/blob/master/src/main/java/jpountz/SimpleSIMDDecoder.java] > that tries as much as possible to shift long sequences of ints by the same > number of bits to speed up decoding. It is indeed faster than the current > logic we have, up to about 2x faster for some numbers of bits per value. > Currently the best > [implementation|https://github.com/jpountz/decode-128-ints-benchmark/blob/master/src/main/java/jpountz/SIMDDecoder.java] > I've been able to come up with combines the above idea with the idea that > Paul mentioned in his blog that consists of emulating SIMD from Java by > packing multiple integers into a long: 2 ints, 4 shorts or 8 bytes. It is a > bit harder to read but gives another speedup on top of the above > implementation. > I have a [JMH > benchmark|https://github.com/jpountz/decode-128-ints-benchmark/] available in > case someone would like to play with this and maybe even come up with an even > faster implementation. It is 2-2.5x faster than our current implementation > for most numbers of bits per value. I'm copying results here: > {noformat} > * `readLongs` just reads 2*bitsPerValue longs from the ByteBuffer, it serves > as >a baseline. > * `decodeNaiveFromBytes` reads a byte[] and decodes from it. This is what the >current Lucene codec does. > * `decodeNaiveFromLongs` decodes from longs on the fly. > * `decodeSimpleSIMD` is a simple implementation that relies on how Java >recognizes some patterns and uses SIMD instructions. > * `decodeSIMD` is a more complex implementation that both relies on the C2 >compiler to generate SIMD instructions and encodes 8 bytes, 4 shorts or >2 ints in a long in order to decompress multiple values at once. > Benchmark (bitsPerValue) (byteOrder) > Mode Cnt Score Error Units > PackedIntsDecodeBenchmark.decodeNaiveFromBytes 1 LE > thrpt5 12.912 ± 0.393 ops/us > PackedIntsDecodeBenchmark.decodeNaiveFromBytes 1 BE > thrpt5 12.862 ± 0.395 ops/us > PackedIntsDecodeBenchmark.decodeNaiveFromBytes 2 LE > thrpt5 13.040 ± 1.162 ops/us > PackedIntsDecodeBenchmark.decodeNaiveFromBytes 2 BE > thrpt5 13.027 ± 0.270 ops/us > PackedIntsDecodeBenchmark.decodeNaiveFromBytes 3 LE > thrpt5 12.409 ± 0.637 ops/us > PackedIntsDecodeBenchmark.decodeNaiveFromBytes 3 BE > thrpt5 12.268 ± 0.947 ops/us > PackedIntsDecodeBenchmark.decodeNaiveFromBytes 4 LE > thrpt5 14.177 ± 2.263 ops/us > PackedIntsDecodeBenchmark.decodeNaiveFromBytes 4 BE > thrpt5 11.457 ± 0.150 ops/us > PackedIntsDecodeBenchmark.decodeNaiveFromBytes 5 LE > thrpt5 10.988 ± 1.179 ops/us > PackedIntsDecodeBenchmark.decodeNaiveFromBytes 5 BE > thrpt5 11.226 ± 0.088 ops/us > PackedIntsDecodeBenchmark.decodeNaiveFromBytes 6
[jira] [Commented] (LUCENE-9027) SIMD-based decoding of postings lists
[ https://issues.apache.org/jira/browse/LUCENE-9027?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16973691#comment-16973691 ] Adrien Grand commented on LUCENE-9027: -- Following a suggestion by [~rcmuir] I specialized the patch for little-endian architectures. Big-endian architectures will get a slowdown due to the need to do a Long#reverseBytes on all read longs, but this probably doesn't matter much and helps simplify DataInput. > SIMD-based decoding of postings lists > - > > Key: LUCENE-9027 > URL: https://issues.apache.org/jira/browse/LUCENE-9027 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Adrien Grand >Priority: Minor > Time Spent: 2h 10m > Remaining Estimate: 0h > > [~rcmuir] has been mentioning the idea for quite some time that we might be > able to write the decoding logic in such a way that Java would use SIMD > instructions. More recently [~paul.masurel] wrote a [blog > post|https://fulmicoton.com/posts/bitpacking/] that raises the point that > Lucene could still do decode multiple ints at once in a single instruction by > packing two ints in a long and we've had some discussions about what we could > try in Lucene to speed up the decoding of postings. This made me want to look > a bit deeper at what we could do. > Our current decoding logic reads data in a byte[] and decodes packed integers > from it. Unfortunately it doesn't make use of SIMD instructions and looks > like > [this|https://github.com/jpountz/decode-128-ints-benchmark/blob/master/src/main/java/jpountz/NaiveByteDecoder.java]. > I confirmed by looking at the generated assembly that if I take an array of > integers and shift them all by the same number of bits then Java will use > SIMD instructions to shift multiple integers at once. This led me to writing > this > [implementation|https://github.com/jpountz/decode-128-ints-benchmark/blob/master/src/main/java/jpountz/SimpleSIMDDecoder.java] > that tries as much as possible to shift long sequences of ints by the same > number of bits to speed up decoding. It is indeed faster than the current > logic we have, up to about 2x faster for some numbers of bits per value. > Currently the best > [implementation|https://github.com/jpountz/decode-128-ints-benchmark/blob/master/src/main/java/jpountz/SIMDDecoder.java] > I've been able to come up with combines the above idea with the idea that > Paul mentioned in his blog that consists of emulating SIMD from Java by > packing multiple integers into a long: 2 ints, 4 shorts or 8 bytes. It is a > bit harder to read but gives another speedup on top of the above > implementation. > I have a [JMH > benchmark|https://github.com/jpountz/decode-128-ints-benchmark/] available in > case someone would like to play with this and maybe even come up with an even > faster implementation. It is 2-2.5x faster than our current implementation > for most numbers of bits per value. I'm copying results here: > {noformat} > * `readLongs` just reads 2*bitsPerValue longs from the ByteBuffer, it serves > as >a baseline. > * `decodeNaiveFromBytes` reads a byte[] and decodes from it. This is what the >current Lucene codec does. > * `decodeNaiveFromLongs` decodes from longs on the fly. > * `decodeSimpleSIMD` is a simple implementation that relies on how Java >recognizes some patterns and uses SIMD instructions. > * `decodeSIMD` is a more complex implementation that both relies on the C2 >compiler to generate SIMD instructions and encodes 8 bytes, 4 shorts or >2 ints in a long in order to decompress multiple values at once. > Benchmark (bitsPerValue) (byteOrder) > Mode Cnt Score Error Units > PackedIntsDecodeBenchmark.decodeNaiveFromBytes 1 LE > thrpt5 12.912 ± 0.393 ops/us > PackedIntsDecodeBenchmark.decodeNaiveFromBytes 1 BE > thrpt5 12.862 ± 0.395 ops/us > PackedIntsDecodeBenchmark.decodeNaiveFromBytes 2 LE > thrpt5 13.040 ± 1.162 ops/us > PackedIntsDecodeBenchmark.decodeNaiveFromBytes 2 BE > thrpt5 13.027 ± 0.270 ops/us > PackedIntsDecodeBenchmark.decodeNaiveFromBytes 3 LE > thrpt5 12.409 ± 0.637 ops/us > PackedIntsDecodeBenchmark.decodeNaiveFromBytes 3 BE > thrpt5 12.268 ± 0.947 ops/us > PackedIntsDecodeBenchmark.decodeNaiveFromBytes 4 LE > thrpt5 14.177 ± 2.263 ops/us > PackedIntsDecodeBenchmark.decodeNaiveFromBytes 4 BE > thrpt5 11.457 ± 0.150 ops/us > PackedIntsDecodeBenchmark.decodeNaiveFromBytes 5 LE > thrpt5 10.988 ± 1.179 ops/us >
[jira] [Commented] (LUCENE-9027) SIMD-based decoding of postings lists
[ https://issues.apache.org/jira/browse/LUCENE-9027?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16973328#comment-16973328 ] Adrien Grand commented on LUCENE-9027: -- I guess it depends what kind of analytics, I expect counts to get the best speedup indeed, since they are only about decoding postings and counting matches. However other analytics use-cases like facets still need to read values for a doc-value field, which likely drowns the speedup a bit, like sorting. We have some benchmarks for Elasticsearch that run nightly at https://benchmarks.elastic.co but most analytics queries like histograms or terms facets run on a match_all query, so I'm not expecting to see any speedup. I've done some more research regarding the prefix sum. Unfortunately, I didn't manage to get C2 to generate SIMD instructions for the prefix sum (even with gaps of 2 or 4 instead of 1). The best option I have for now works by hooking into the decoding logic and summing up longs when they still represent 2 packed integers (which effectively computes the prefix sum for [0:64) and [64:128) in parallel and then add the value at index 63 to all values at indices [64:128). This last step is vectorized by the JVM. See benchmark results at the bottom of https://github.com/jpountz/decode-128-ints-benchmark if you are interested. I wish we could do better, but it's still nice that we can both decompress and compute the prefix sum faster that we can just decompress with the current codec on average. Having the sum pre-computed also helps simplify some conditions in the PostingsEnum implementations of nextDoc/advance. Here is the result of a run on wikibigall with the current pull request. I included some sorting tasks as well to see how much they benefit from this change {noformat} TaskQPS baseline StdDev QPS patch StdDev Pct diff Fuzzy2 99.40 (10.4%) 95.46 (6.9%) -4.0% ( -19% - 14%) IntNRQ 966.00 (2.4%) 977.27 (2.1%) 1.2% ( -3% -5%) Fuzzy1 97.25 (7.8%) 99.66 (9.4%) 2.5% ( -13% - 21%) OrHighHigh 84.95 (3.7%) 87.40 (4.2%) 2.9% ( -4% - 11%) Term 1316.61 (3.1%) 1357.55 (4.8%) 3.1% ( -4% - 11%) HighTermDayOfYearSort 35.89 (6.3%) 37.66 (4.2%) 4.9% ( -5% - 16%) Phrase 60.22 (2.1%) 63.45 (4.3%) 5.4% ( -1% - 12%) HighTermMonthSort 63.83 (9.1%) 67.40 (10.7%) 5.6% ( -12% - 27%) AndHighHigh 27.27 (3.4%) 28.80 (4.0%) 5.6% ( -1% - 13%) SloppyPhrase1.35 (7.4%)1.43 (9.3%) 6.2% ( -9% - 24%) AndHighOrMedMed 26.38 (1.7%) 28.17 (2.6%) 6.8% ( 2% - 11%) IntervalsOrdered 21.99 (2.8%) 23.57 (1.9%) 7.2% ( 2% - 12%) OrHighMed 39.27 (2.8%) 42.32 (3.0%) 7.8% ( 1% - 13%) SpanNear9.74 (3.2%) 10.71 (2.0%) 10.0% ( 4% - 15%) AndHighMed 59.40 (3.1%) 65.44 (3.6%) 10.2% ( 3% - 17%) Wildcard 127.70 (7.7%) 140.93 (3.3%) 10.4% ( 0% - 23%) AndMedOrHighHigh 30.65 (1.4%) 34.15 (2.1%) 11.4% ( 7% - 15%) Prefix3 46.12 (9.9%) 53.01 (10.1%) 14.9% ( -4% - 38%) {noformat} I now plan to focus of getting this into a mergeable state. > SIMD-based decoding of postings lists > - > > Key: LUCENE-9027 > URL: https://issues.apache.org/jira/browse/LUCENE-9027 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Adrien Grand >Priority: Minor > Time Spent: 10m > Remaining Estimate: 0h > > [~rcmuir] has been mentioning the idea for quite some time that we might be > able to write the decoding logic in such a way that Java would use SIMD > instructions. More recently [~paul.masurel] wrote a [blog > post|https://fulmicoton.com/posts/bitpacking/] that raises the point that > Lucene could still do decode multiple ints at once in a single instruction by > packing two ints in a long and we've had some discussions about what we could > try in Lucene to speed up the decoding of postings. This made me want to look > a bit deeper at what we could do. > Our current decoding logic reads data in a byte[] and decodes packed integers > from it. Unfortunately it doesn't make use of SIMD instructions and looks > like >
[jira] [Commented] (LUCENE-9027) SIMD-based decoding of postings lists
[ https://issues.apache.org/jira/browse/LUCENE-9027?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16973070#comment-16973070 ] Paul Masurel commented on LUCENE-9027: -- The benefit is probably greater on analytics. Is there a benchmark for the ELK stack somewhere? > SIMD-based decoding of postings lists > - > > Key: LUCENE-9027 > URL: https://issues.apache.org/jira/browse/LUCENE-9027 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Adrien Grand >Priority: Minor > Time Spent: 10m > Remaining Estimate: 0h > > [~rcmuir] has been mentioning the idea for quite some time that we might be > able to write the decoding logic in such a way that Java would use SIMD > instructions. More recently [~paul.masurel] wrote a [blog > post|https://fulmicoton.com/posts/bitpacking/] that raises the point that > Lucene could still do decode multiple ints at once in a single instruction by > packing two ints in a long and we've had some discussions about what we could > try in Lucene to speed up the decoding of postings. This made me want to look > a bit deeper at what we could do. > Our current decoding logic reads data in a byte[] and decodes packed integers > from it. Unfortunately it doesn't make use of SIMD instructions and looks > like > [this|https://github.com/jpountz/decode-128-ints-benchmark/blob/master/src/main/java/jpountz/NaiveByteDecoder.java]. > I confirmed by looking at the generated assembly that if I take an array of > integers and shift them all by the same number of bits then Java will use > SIMD instructions to shift multiple integers at once. This led me to writing > this > [implementation|https://github.com/jpountz/decode-128-ints-benchmark/blob/master/src/main/java/jpountz/SimpleSIMDDecoder.java] > that tries as much as possible to shift long sequences of ints by the same > number of bits to speed up decoding. It is indeed faster than the current > logic we have, up to about 2x faster for some numbers of bits per value. > Currently the best > [implementation|https://github.com/jpountz/decode-128-ints-benchmark/blob/master/src/main/java/jpountz/SIMDDecoder.java] > I've been able to come up with combines the above idea with the idea that > Paul mentioned in his blog that consists of emulating SIMD from Java by > packing multiple integers into a long: 2 ints, 4 shorts or 8 bytes. It is a > bit harder to read but gives another speedup on top of the above > implementation. > I have a [JMH > benchmark|https://github.com/jpountz/decode-128-ints-benchmark/] available in > case someone would like to play with this and maybe even come up with an even > faster implementation. It is 2-2.5x faster than our current implementation > for most numbers of bits per value. I'm copying results here: > {noformat} > * `readLongs` just reads 2*bitsPerValue longs from the ByteBuffer, it serves > as >a baseline. > * `decodeNaiveFromBytes` reads a byte[] and decodes from it. This is what the >current Lucene codec does. > * `decodeNaiveFromLongs` decodes from longs on the fly. > * `decodeSimpleSIMD` is a simple implementation that relies on how Java >recognizes some patterns and uses SIMD instructions. > * `decodeSIMD` is a more complex implementation that both relies on the C2 >compiler to generate SIMD instructions and encodes 8 bytes, 4 shorts or >2 ints in a long in order to decompress multiple values at once. > Benchmark (bitsPerValue) (byteOrder) > Mode Cnt Score Error Units > PackedIntsDecodeBenchmark.decodeNaiveFromBytes 1 LE > thrpt5 12.912 ± 0.393 ops/us > PackedIntsDecodeBenchmark.decodeNaiveFromBytes 1 BE > thrpt5 12.862 ± 0.395 ops/us > PackedIntsDecodeBenchmark.decodeNaiveFromBytes 2 LE > thrpt5 13.040 ± 1.162 ops/us > PackedIntsDecodeBenchmark.decodeNaiveFromBytes 2 BE > thrpt5 13.027 ± 0.270 ops/us > PackedIntsDecodeBenchmark.decodeNaiveFromBytes 3 LE > thrpt5 12.409 ± 0.637 ops/us > PackedIntsDecodeBenchmark.decodeNaiveFromBytes 3 BE > thrpt5 12.268 ± 0.947 ops/us > PackedIntsDecodeBenchmark.decodeNaiveFromBytes 4 LE > thrpt5 14.177 ± 2.263 ops/us > PackedIntsDecodeBenchmark.decodeNaiveFromBytes 4 BE > thrpt5 11.457 ± 0.150 ops/us > PackedIntsDecodeBenchmark.decodeNaiveFromBytes 5 LE > thrpt5 10.988 ± 1.179 ops/us > PackedIntsDecodeBenchmark.decodeNaiveFromBytes 5 BE > thrpt5 11.226 ± 0.088 ops/us > PackedIntsDecodeBenchmark.decodeNaiveFromBytes 6 LE > thrpt5
[jira] [Commented] (LUCENE-9027) SIMD-based decoding of postings lists
[ https://issues.apache.org/jira/browse/LUCENE-9027?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16970621#comment-16970621 ] Michael McCandless commented on LUCENE-9027: Thanks [~jpountz] – I forgot that Codec impacts make our {{Term}} query tasks super fast! Your explanation makes sense. I'll look at the PR around endianess – as ugly as it sounds, I think it is the right tradeoff. We should do things at indexing time to make search time faster, and if one of those things is writing multi-byte values in the "right" order for searching, that's fair game! > SIMD-based decoding of postings lists > - > > Key: LUCENE-9027 > URL: https://issues.apache.org/jira/browse/LUCENE-9027 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Adrien Grand >Priority: Minor > Time Spent: 10m > Remaining Estimate: 0h > > [~rcmuir] has been mentioning the idea for quite some time that we might be > able to write the decoding logic in such a way that Java would use SIMD > instructions. More recently [~paul.masurel] wrote a [blog > post|https://fulmicoton.com/posts/bitpacking/] that raises the point that > Lucene could still do decode multiple ints at once in a single instruction by > packing two ints in a long and we've had some discussions about what we could > try in Lucene to speed up the decoding of postings. This made me want to look > a bit deeper at what we could do. > Our current decoding logic reads data in a byte[] and decodes packed integers > from it. Unfortunately it doesn't make use of SIMD instructions and looks > like > [this|https://github.com/jpountz/decode-128-ints-benchmark/blob/master/src/main/java/jpountz/NaiveByteDecoder.java]. > I confirmed by looking at the generated assembly that if I take an array of > integers and shift them all by the same number of bits then Java will use > SIMD instructions to shift multiple integers at once. This led me to writing > this > [implementation|https://github.com/jpountz/decode-128-ints-benchmark/blob/master/src/main/java/jpountz/SimpleSIMDDecoder.java] > that tries as much as possible to shift long sequences of ints by the same > number of bits to speed up decoding. It is indeed faster than the current > logic we have, up to about 2x faster for some numbers of bits per value. > Currently the best > [implementation|https://github.com/jpountz/decode-128-ints-benchmark/blob/master/src/main/java/jpountz/SIMDDecoder.java] > I've been able to come up with combines the above idea with the idea that > Paul mentioned in his blog that consists of emulating SIMD from Java by > packing multiple integers into a long: 2 ints, 4 shorts or 8 bytes. It is a > bit harder to read but gives another speedup on top of the above > implementation. > I have a [JMH > benchmark|https://github.com/jpountz/decode-128-ints-benchmark/] available in > case someone would like to play with this and maybe even come up with an even > faster implementation. It is 2-2.5x faster than our current implementation > for most numbers of bits per value. I'm copying results here: > {noformat} > * `readLongs` just reads 2*bitsPerValue longs from the ByteBuffer, it serves > as >a baseline. > * `decodeNaiveFromBytes` reads a byte[] and decodes from it. This is what the >current Lucene codec does. > * `decodeNaiveFromLongs` decodes from longs on the fly. > * `decodeSimpleSIMD` is a simple implementation that relies on how Java >recognizes some patterns and uses SIMD instructions. > * `decodeSIMD` is a more complex implementation that both relies on the C2 >compiler to generate SIMD instructions and encodes 8 bytes, 4 shorts or >2 ints in a long in order to decompress multiple values at once. > Benchmark (bitsPerValue) (byteOrder) > Mode Cnt Score Error Units > PackedIntsDecodeBenchmark.decodeNaiveFromBytes 1 LE > thrpt5 12.912 ± 0.393 ops/us > PackedIntsDecodeBenchmark.decodeNaiveFromBytes 1 BE > thrpt5 12.862 ± 0.395 ops/us > PackedIntsDecodeBenchmark.decodeNaiveFromBytes 2 LE > thrpt5 13.040 ± 1.162 ops/us > PackedIntsDecodeBenchmark.decodeNaiveFromBytes 2 BE > thrpt5 13.027 ± 0.270 ops/us > PackedIntsDecodeBenchmark.decodeNaiveFromBytes 3 LE > thrpt5 12.409 ± 0.637 ops/us > PackedIntsDecodeBenchmark.decodeNaiveFromBytes 3 BE > thrpt5 12.268 ± 0.947 ops/us > PackedIntsDecodeBenchmark.decodeNaiveFromBytes 4 LE > thrpt5 14.177 ± 2.263 ops/us > PackedIntsDecodeBenchmark.decodeNaiveFromBytes 4 BE > thrpt5 11.457 ± 0.150 ops/us >
[jira] [Commented] (LUCENE-9027) SIMD-based decoding of postings lists
[ https://issues.apache.org/jira/browse/LUCENE-9027?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16969575#comment-16969575 ] Adrien Grand commented on LUCENE-9027: -- I'm curious whether you have thoughts about how the pull request specializes for the endianness of the machine that creates the index. I'm a bit unhappy about this, but on the other hand reversing bytes proved to be a bottleneck on the benchmarks that I ran. > SIMD-based decoding of postings lists > - > > Key: LUCENE-9027 > URL: https://issues.apache.org/jira/browse/LUCENE-9027 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Adrien Grand >Priority: Minor > Time Spent: 10m > Remaining Estimate: 0h > > [~rcmuir] has been mentioning the idea for quite some time that we might be > able to write the decoding logic in such a way that Java would use SIMD > instructions. More recently [~paul.masurel] wrote a [blog > post|https://fulmicoton.com/posts/bitpacking/] that raises the point that > Lucene could still do decode multiple ints at once in a single instruction by > packing two ints in a long and we've had some discussions about what we could > try in Lucene to speed up the decoding of postings. This made me want to look > a bit deeper at what we could do. > Our current decoding logic reads data in a byte[] and decodes packed integers > from it. Unfortunately it doesn't make use of SIMD instructions and looks > like > [this|https://github.com/jpountz/decode-128-ints-benchmark/blob/master/src/main/java/jpountz/NaiveByteDecoder.java]. > I confirmed by looking at the generated assembly that if I take an array of > integers and shift them all by the same number of bits then Java will use > SIMD instructions to shift multiple integers at once. This led me to writing > this > [implementation|https://github.com/jpountz/decode-128-ints-benchmark/blob/master/src/main/java/jpountz/SimpleSIMDDecoder.java] > that tries as much as possible to shift long sequences of ints by the same > number of bits to speed up decoding. It is indeed faster than the current > logic we have, up to about 2x faster for some numbers of bits per value. > Currently the best > [implementation|https://github.com/jpountz/decode-128-ints-benchmark/blob/master/src/main/java/jpountz/SIMDDecoder.java] > I've been able to come up with combines the above idea with the idea that > Paul mentioned in his blog that consists of emulating SIMD from Java by > packing multiple integers into a long: 2 ints, 4 shorts or 8 bytes. It is a > bit harder to read but gives another speedup on top of the above > implementation. > I have a [JMH > benchmark|https://github.com/jpountz/decode-128-ints-benchmark/] available in > case someone would like to play with this and maybe even come up with an even > faster implementation. It is 2-2.5x faster than our current implementation > for most numbers of bits per value. I'm copying results here: > {noformat} > * `readLongs` just reads 2*bitsPerValue longs from the ByteBuffer, it serves > as >a baseline. > * `decodeNaiveFromBytes` reads a byte[] and decodes from it. This is what the >current Lucene codec does. > * `decodeNaiveFromLongs` decodes from longs on the fly. > * `decodeSimpleSIMD` is a simple implementation that relies on how Java >recognizes some patterns and uses SIMD instructions. > * `decodeSIMD` is a more complex implementation that both relies on the C2 >compiler to generate SIMD instructions and encodes 8 bytes, 4 shorts or >2 ints in a long in order to decompress multiple values at once. > Benchmark (bitsPerValue) (byteOrder) > Mode Cnt Score Error Units > PackedIntsDecodeBenchmark.decodeNaiveFromBytes 1 LE > thrpt5 12.912 ± 0.393 ops/us > PackedIntsDecodeBenchmark.decodeNaiveFromBytes 1 BE > thrpt5 12.862 ± 0.395 ops/us > PackedIntsDecodeBenchmark.decodeNaiveFromBytes 2 LE > thrpt5 13.040 ± 1.162 ops/us > PackedIntsDecodeBenchmark.decodeNaiveFromBytes 2 BE > thrpt5 13.027 ± 0.270 ops/us > PackedIntsDecodeBenchmark.decodeNaiveFromBytes 3 LE > thrpt5 12.409 ± 0.637 ops/us > PackedIntsDecodeBenchmark.decodeNaiveFromBytes 3 BE > thrpt5 12.268 ± 0.947 ops/us > PackedIntsDecodeBenchmark.decodeNaiveFromBytes 4 LE > thrpt5 14.177 ± 2.263 ops/us > PackedIntsDecodeBenchmark.decodeNaiveFromBytes 4 BE > thrpt5 11.457 ± 0.150 ops/us > PackedIntsDecodeBenchmark.decodeNaiveFromBytes 5 LE > thrpt5 10.988 ± 1.179 ops/us > PackedIntsDecodeBenchmark.decodeNaiveFromBytes
[jira] [Commented] (LUCENE-9027) SIMD-based decoding of postings lists
[ https://issues.apache.org/jira/browse/LUCENE-9027?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16969571#comment-16969571 ] Adrien Grand commented on LUCENE-9027: -- This QPS is consistent with what we are seeing on the nightly benchmarks http://people.apache.org/~mikemccand/lucenebench/Term.html. I think Term doesn't show a speedup because decoding postings is not the bottleneck of term queries. When running a term query, Lucene would only decode blocks of postings that have a competitive match. On the other hand, for queries like AndHighMed, the high-cardinality clause needs to decode lots of blocks, and it's not unlikely that many decoded blocks don't even translate to a match for the conjunction. > SIMD-based decoding of postings lists > - > > Key: LUCENE-9027 > URL: https://issues.apache.org/jira/browse/LUCENE-9027 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Adrien Grand >Priority: Minor > Time Spent: 10m > Remaining Estimate: 0h > > [~rcmuir] has been mentioning the idea for quite some time that we might be > able to write the decoding logic in such a way that Java would use SIMD > instructions. More recently [~paul.masurel] wrote a [blog > post|https://fulmicoton.com/posts/bitpacking/] that raises the point that > Lucene could still do decode multiple ints at once in a single instruction by > packing two ints in a long and we've had some discussions about what we could > try in Lucene to speed up the decoding of postings. This made me want to look > a bit deeper at what we could do. > Our current decoding logic reads data in a byte[] and decodes packed integers > from it. Unfortunately it doesn't make use of SIMD instructions and looks > like > [this|https://github.com/jpountz/decode-128-ints-benchmark/blob/master/src/main/java/jpountz/NaiveByteDecoder.java]. > I confirmed by looking at the generated assembly that if I take an array of > integers and shift them all by the same number of bits then Java will use > SIMD instructions to shift multiple integers at once. This led me to writing > this > [implementation|https://github.com/jpountz/decode-128-ints-benchmark/blob/master/src/main/java/jpountz/SimpleSIMDDecoder.java] > that tries as much as possible to shift long sequences of ints by the same > number of bits to speed up decoding. It is indeed faster than the current > logic we have, up to about 2x faster for some numbers of bits per value. > Currently the best > [implementation|https://github.com/jpountz/decode-128-ints-benchmark/blob/master/src/main/java/jpountz/SIMDDecoder.java] > I've been able to come up with combines the above idea with the idea that > Paul mentioned in his blog that consists of emulating SIMD from Java by > packing multiple integers into a long: 2 ints, 4 shorts or 8 bytes. It is a > bit harder to read but gives another speedup on top of the above > implementation. > I have a [JMH > benchmark|https://github.com/jpountz/decode-128-ints-benchmark/] available in > case someone would like to play with this and maybe even come up with an even > faster implementation. It is 2-2.5x faster than our current implementation > for most numbers of bits per value. I'm copying results here: > {noformat} > * `readLongs` just reads 2*bitsPerValue longs from the ByteBuffer, it serves > as >a baseline. > * `decodeNaiveFromBytes` reads a byte[] and decodes from it. This is what the >current Lucene codec does. > * `decodeNaiveFromLongs` decodes from longs on the fly. > * `decodeSimpleSIMD` is a simple implementation that relies on how Java >recognizes some patterns and uses SIMD instructions. > * `decodeSIMD` is a more complex implementation that both relies on the C2 >compiler to generate SIMD instructions and encodes 8 bytes, 4 shorts or >2 ints in a long in order to decompress multiple values at once. > Benchmark (bitsPerValue) (byteOrder) > Mode Cnt Score Error Units > PackedIntsDecodeBenchmark.decodeNaiveFromBytes 1 LE > thrpt5 12.912 ± 0.393 ops/us > PackedIntsDecodeBenchmark.decodeNaiveFromBytes 1 BE > thrpt5 12.862 ± 0.395 ops/us > PackedIntsDecodeBenchmark.decodeNaiveFromBytes 2 LE > thrpt5 13.040 ± 1.162 ops/us > PackedIntsDecodeBenchmark.decodeNaiveFromBytes 2 BE > thrpt5 13.027 ± 0.270 ops/us > PackedIntsDecodeBenchmark.decodeNaiveFromBytes 3 LE > thrpt5 12.409 ± 0.637 ops/us > PackedIntsDecodeBenchmark.decodeNaiveFromBytes 3 BE > thrpt5 12.268 ± 0.947 ops/us > PackedIntsDecodeBenchmark.decodeNaiveFromBytes 4 LE > thrpt5 14.177 ± 2.263
[jira] [Commented] (LUCENE-9027) SIMD-based decoding of postings lists
[ https://issues.apache.org/jira/browse/LUCENE-9027?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16969339#comment-16969339 ] Michael McCandless commented on LUCENE-9027: Hmm why are your {{Term}} results crazy fast (~1400 QPS) on {{wikibigall}}? I thought {{Term}} would also show gains here, since it just walks through all postings blocks for the term, decoding and collecting? > SIMD-based decoding of postings lists > - > > Key: LUCENE-9027 > URL: https://issues.apache.org/jira/browse/LUCENE-9027 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Adrien Grand >Priority: Minor > Time Spent: 10m > Remaining Estimate: 0h > > [~rcmuir] has been mentioning the idea for quite some time that we might be > able to write the decoding logic in such a way that Java would use SIMD > instructions. More recently [~paul.masurel] wrote a [blog > post|https://fulmicoton.com/posts/bitpacking/] that raises the point that > Lucene could still do decode multiple ints at once in a single instruction by > packing two ints in a long and we've had some discussions about what we could > try in Lucene to speed up the decoding of postings. This made me want to look > a bit deeper at what we could do. > Our current decoding logic reads data in a byte[] and decodes packed integers > from it. Unfortunately it doesn't make use of SIMD instructions and looks > like > [this|https://github.com/jpountz/decode-128-ints-benchmark/blob/master/src/main/java/jpountz/NaiveByteDecoder.java]. > I confirmed by looking at the generated assembly that if I take an array of > integers and shift them all by the same number of bits then Java will use > SIMD instructions to shift multiple integers at once. This led me to writing > this > [implementation|https://github.com/jpountz/decode-128-ints-benchmark/blob/master/src/main/java/jpountz/SimpleSIMDDecoder.java] > that tries as much as possible to shift long sequences of ints by the same > number of bits to speed up decoding. It is indeed faster than the current > logic we have, up to about 2x faster for some numbers of bits per value. > Currently the best > [implementation|https://github.com/jpountz/decode-128-ints-benchmark/blob/master/src/main/java/jpountz/SIMDDecoder.java] > I've been able to come up with combines the above idea with the idea that > Paul mentioned in his blog that consists of emulating SIMD from Java by > packing multiple integers into a long: 2 ints, 4 shorts or 8 bytes. It is a > bit harder to read but gives another speedup on top of the above > implementation. > I have a [JMH > benchmark|https://github.com/jpountz/decode-128-ints-benchmark/] available in > case someone would like to play with this and maybe even come up with an even > faster implementation. It is 2-2.5x faster than our current implementation > for most numbers of bits per value. I'm copying results here: > {noformat} > * `readLongs` just reads 2*bitsPerValue longs from the ByteBuffer, it serves > as >a baseline. > * `decodeNaiveFromBytes` reads a byte[] and decodes from it. This is what the >current Lucene codec does. > * `decodeNaiveFromLongs` decodes from longs on the fly. > * `decodeSimpleSIMD` is a simple implementation that relies on how Java >recognizes some patterns and uses SIMD instructions. > * `decodeSIMD` is a more complex implementation that both relies on the C2 >compiler to generate SIMD instructions and encodes 8 bytes, 4 shorts or >2 ints in a long in order to decompress multiple values at once. > Benchmark (bitsPerValue) (byteOrder) > Mode Cnt Score Error Units > PackedIntsDecodeBenchmark.decodeNaiveFromBytes 1 LE > thrpt5 12.912 ± 0.393 ops/us > PackedIntsDecodeBenchmark.decodeNaiveFromBytes 1 BE > thrpt5 12.862 ± 0.395 ops/us > PackedIntsDecodeBenchmark.decodeNaiveFromBytes 2 LE > thrpt5 13.040 ± 1.162 ops/us > PackedIntsDecodeBenchmark.decodeNaiveFromBytes 2 BE > thrpt5 13.027 ± 0.270 ops/us > PackedIntsDecodeBenchmark.decodeNaiveFromBytes 3 LE > thrpt5 12.409 ± 0.637 ops/us > PackedIntsDecodeBenchmark.decodeNaiveFromBytes 3 BE > thrpt5 12.268 ± 0.947 ops/us > PackedIntsDecodeBenchmark.decodeNaiveFromBytes 4 LE > thrpt5 14.177 ± 2.263 ops/us > PackedIntsDecodeBenchmark.decodeNaiveFromBytes 4 BE > thrpt5 11.457 ± 0.150 ops/us > PackedIntsDecodeBenchmark.decodeNaiveFromBytes 5 LE > thrpt5 10.988 ± 1.179 ops/us > PackedIntsDecodeBenchmark.decodeNaiveFromBytes 5 BE > thrpt
[jira] [Commented] (LUCENE-9027) SIMD-based decoding of postings lists
[ https://issues.apache.org/jira/browse/LUCENE-9027?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16967237#comment-16967237 ] Robert Muir commented on LUCENE-9027: - +1 for PFOR to avoid the silly worst-cases. IIRC the only reason for the more wasteful FOR was just to make a baby step from vbyte encoding, that was tough enough at the time. Maybe the support for exceptions is even more important for other datasets. > SIMD-based decoding of postings lists > - > > Key: LUCENE-9027 > URL: https://issues.apache.org/jira/browse/LUCENE-9027 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Adrien Grand >Priority: Minor > Time Spent: 10m > Remaining Estimate: 0h > > [~rcmuir] has been mentioning the idea for quite some time that we might be > able to write the decoding logic in such a way that Java would use SIMD > instructions. More recently [~paul.masurel] wrote a [blog > post|https://fulmicoton.com/posts/bitpacking/] that raises the point that > Lucene could still do decode multiple ints at once in a single instruction by > packing two ints in a long and we've had some discussions about what we could > try in Lucene to speed up the decoding of postings. This made me want to look > a bit deeper at what we could do. > Our current decoding logic reads data in a byte[] and decodes packed integers > from it. Unfortunately it doesn't make use of SIMD instructions and looks > like > [this|https://github.com/jpountz/decode-128-ints-benchmark/blob/master/src/main/java/jpountz/NaiveByteDecoder.java]. > I confirmed by looking at the generated assembly that if I take an array of > integers and shift them all by the same number of bits then Java will use > SIMD instructions to shift multiple integers at once. This led me to writing > this > [implementation|https://github.com/jpountz/decode-128-ints-benchmark/blob/master/src/main/java/jpountz/SimpleSIMDDecoder.java] > that tries as much as possible to shift long sequences of ints by the same > number of bits to speed up decoding. It is indeed faster than the current > logic we have, up to about 2x faster for some numbers of bits per value. > Currently the best > [implementation|https://github.com/jpountz/decode-128-ints-benchmark/blob/master/src/main/java/jpountz/SIMDDecoder.java] > I've been able to come up with combines the above idea with the idea that > Paul mentioned in his blog that consists of emulating SIMD from Java by > packing multiple integers into a long: 2 ints, 4 shorts or 8 bytes. It is a > bit harder to read but gives another speedup on top of the above > implementation. > I have a [JMH > benchmark|https://github.com/jpountz/decode-128-ints-benchmark/] available in > case someone would like to play with this and maybe even come up with an even > faster implementation. It is 2-2.5x faster than our current implementation > for most numbers of bits per value. I'm copying results here: > {noformat} > * `readLongs` just reads 2*bitsPerValue longs from the ByteBuffer, it serves > as >a baseline. > * `decodeNaiveFromBytes` reads a byte[] and decodes from it. This is what the >current Lucene codec does. > * `decodeNaiveFromLongs` decodes from longs on the fly. > * `decodeSimpleSIMD` is a simple implementation that relies on how Java >recognizes some patterns and uses SIMD instructions. > * `decodeSIMD` is a more complex implementation that both relies on the C2 >compiler to generate SIMD instructions and encodes 8 bytes, 4 shorts or >2 ints in a long in order to decompress multiple values at once. > Benchmark (bitsPerValue) (byteOrder) > Mode Cnt Score Error Units > PackedIntsDecodeBenchmark.decodeNaiveFromBytes 1 LE > thrpt5 12.912 ± 0.393 ops/us > PackedIntsDecodeBenchmark.decodeNaiveFromBytes 1 BE > thrpt5 12.862 ± 0.395 ops/us > PackedIntsDecodeBenchmark.decodeNaiveFromBytes 2 LE > thrpt5 13.040 ± 1.162 ops/us > PackedIntsDecodeBenchmark.decodeNaiveFromBytes 2 BE > thrpt5 13.027 ± 0.270 ops/us > PackedIntsDecodeBenchmark.decodeNaiveFromBytes 3 LE > thrpt5 12.409 ± 0.637 ops/us > PackedIntsDecodeBenchmark.decodeNaiveFromBytes 3 BE > thrpt5 12.268 ± 0.947 ops/us > PackedIntsDecodeBenchmark.decodeNaiveFromBytes 4 LE > thrpt5 14.177 ± 2.263 ops/us > PackedIntsDecodeBenchmark.decodeNaiveFromBytes 4 BE > thrpt5 11.457 ± 0.150 ops/us > PackedIntsDecodeBenchmark.decodeNaiveFromBytes 5 LE > thrpt5 10.988 ± 1.179 ops/us > PackedIntsDecodeBenchmark.decodeNaiveFromBytes
[jira] [Commented] (LUCENE-9027) SIMD-based decoding of postings lists
[ https://issues.apache.org/jira/browse/LUCENE-9027?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16963116#comment-16963116 ] Adrien Grand commented on LUCENE-9027: -- The fact that throughput is better for lower bits per value made me want to try out PFOR delta to try getting lower numbers of bits per value when possible. I implemented it with a maximum of 3 exceptions per block. It made the .doc file 9.4% smaller and the .pos file 8.7% smaller. {noformat} TaskQPS baseline StdDev QPS patch StdDev Pct diff Fuzzy2 71.64 (6.3%) 70.99 (7.9%) -0.9% ( -14% - 14%) IntNRQ 105.31 (5.9%) 104.79 (4.8%) -0.5% ( -10% - 10%) Fuzzy1 130.67 (9.6%) 130.58 (8.8%) -0.1% ( -16% - 20%) Term 1397.31 (3.4%) 1400.02 (3.1%) 0.2% ( -6% -6%) AndHighHigh 34.23 (2.7%) 34.67 (3.5%) 1.3% ( -4% -7%) OrHighHigh 11.64 (2.8%) 11.93 (2.5%) 2.5% ( -2% -8%) OrHighMed 72.36 (3.0%) 76.48 (3.0%) 5.7% ( 0% - 12%) AndHighMed 59.88 (3.1%) 63.68 (3.9%) 6.4% ( 0% - 13%) Wildcard 50.18 (8.4%) 53.41 (8.0%) 6.4% ( -9% - 24%) SpanNear 23.99 (2.9%) 25.56 (1.1%) 6.5% ( 2% - 10%) AndHighOrMedMed 34.24 (2.1%) 36.89 (2.2%) 7.8% ( 3% - 12%) IntervalsOrdered4.39 (6.0%)4.76 (5.3%) 8.3% ( -2% - 20%) AndMedOrHighHigh 30.16 (2.7%) 32.76 (2.6%) 8.6% ( 3% - 14%) Phrase 69.68 (1.3%) 75.76 (1.4%) 8.7% ( 5% - 11%) Prefix3 170.85 (14.3%) 194.70 (10.9%) 14.0% ( -9% - 45%) SloppyPhrase 18.07 (3.8%) 20.92 (4.1%) 15.8% ( 7% - 24%) {noformat} > SIMD-based decoding of postings lists > - > > Key: LUCENE-9027 > URL: https://issues.apache.org/jira/browse/LUCENE-9027 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Adrien Grand >Priority: Minor > Time Spent: 10m > Remaining Estimate: 0h > > [~rcmuir] has been mentioning the idea for quite some time that we might be > able to write the decoding logic in such a way that Java would use SIMD > instructions. More recently [~paul.masurel] wrote a [blog > post|https://fulmicoton.com/posts/bitpacking/] that raises the point that > Lucene could still do decode multiple ints at once in a single instruction by > packing two ints in a long and we've had some discussions about what we could > try in Lucene to speed up the decoding of postings. This made me want to look > a bit deeper at what we could do. > Our current decoding logic reads data in a byte[] and decodes packed integers > from it. Unfortunately it doesn't make use of SIMD instructions and looks > like > [this|https://github.com/jpountz/decode-128-ints-benchmark/blob/master/src/main/java/jpountz/NaiveByteDecoder.java]. > I confirmed by looking at the generated assembly that if I take an array of > integers and shift them all by the same number of bits then Java will use > SIMD instructions to shift multiple integers at once. This led me to writing > this > [implementation|https://github.com/jpountz/decode-128-ints-benchmark/blob/master/src/main/java/jpountz/SimpleSIMDDecoder.java] > that tries as much as possible to shift long sequences of ints by the same > number of bits to speed up decoding. It is indeed faster than the current > logic we have, up to about 2x faster for some numbers of bits per value. > Currently the best > [implementation|https://github.com/jpountz/decode-128-ints-benchmark/blob/master/src/main/java/jpountz/SIMDDecoder.java] > I've been able to come up with combines the above idea with the idea that > Paul mentioned in his blog that consists of emulating SIMD from Java by > packing multiple integers into a long: 2 ints, 4 shorts or 8 bytes. It is a > bit harder to read but gives another speedup on top of the above > implementation. > I have a [JMH > benchmark|https://github.com/jpountz/decode-128-ints-benchmark/] available in > case someone would like to play with this and maybe even come up with an even > faster implementation. It is 2-2.5x faster than our current implementation > for most numbers of bits per value. I'm copying results here: > {noformat} > * `readLongs` just reads 2*bitsPerValue longs from the ByteBuffer, it serves > as >a baseline. > * `decodeNaiveFromBytes` reads a
[jira] [Commented] (LUCENE-9027) SIMD-based decoding of postings lists
[ https://issues.apache.org/jira/browse/LUCENE-9027?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16959085#comment-16959085 ] Adrien Grand commented on LUCENE-9027: -- This is a file created by vim that should not have been pushed. :) I removed it. > SIMD-based decoding of postings lists > - > > Key: LUCENE-9027 > URL: https://issues.apache.org/jira/browse/LUCENE-9027 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Adrien Grand >Priority: Minor > Time Spent: 10m > Remaining Estimate: 0h > > [~rcmuir] has been mentioning the idea for quite some time that we might be > able to write the decoding logic in such a way that Java would use SIMD > instructions. More recently [~paul.masurel] wrote a [blog > post|https://fulmicoton.com/posts/bitpacking/] that raises the point that > Lucene could still do decode multiple ints at once in a single instruction by > packing two ints in a long and we've had some discussions about what we could > try in Lucene to speed up the decoding of postings. This made me want to look > a bit deeper at what we could do. > Our current decoding logic reads data in a byte[] and decodes packed integers > from it. Unfortunately it doesn't make use of SIMD instructions and looks > like > [this|https://github.com/jpountz/decode-128-ints-benchmark/blob/master/src/main/java/jpountz/NaiveByteDecoder.java]. > I confirmed by looking at the generated assembly that if I take an array of > integers and shift them all by the same number of bits then Java will use > SIMD instructions to shift multiple integers at once. This led me to writing > this > [implementation|https://github.com/jpountz/decode-128-ints-benchmark/blob/master/src/main/java/jpountz/SimpleSIMDDecoder.java] > that tries as much as possible to shift long sequences of ints by the same > number of bits to speed up decoding. It is indeed faster than the current > logic we have, up to about 2x faster for some numbers of bits per value. > Currently the best > [implementation|https://github.com/jpountz/decode-128-ints-benchmark/blob/master/src/main/java/jpountz/SIMDDecoder.java] > I've been able to come up with combines the above idea with the idea that > Paul mentioned in his blog that consists of emulating SIMD from Java by > packing multiple integers into a long: 2 ints, 4 shorts or 8 bytes. It is a > bit harder to read but gives another speedup on top of the above > implementation. > I have a [JMH > benchmark|https://github.com/jpountz/decode-128-ints-benchmark/] available in > case someone would like to play with this and maybe even come up with an even > faster implementation. It is 2-2.5x faster than our current implementation > for most numbers of bits per value. I'm copying results here: > {noformat} > * `readLongs` just reads 2*bitsPerValue longs from the ByteBuffer, it serves > as >a baseline. > * `decodeNaiveFromBytes` reads a byte[] and decodes from it. This is what the >current Lucene codec does. > * `decodeNaiveFromLongs` decodes from longs on the fly. > * `decodeSimpleSIMD` is a simple implementation that relies on how Java >recognizes some patterns and uses SIMD instructions. > * `decodeSIMD` is a more complex implementation that both relies on the C2 >compiler to generate SIMD instructions and encodes 8 bytes, 4 shorts or >2 ints in a long in order to decompress multiple values at once. > Benchmark (bitsPerValue) (byteOrder) > Mode Cnt Score Error Units > PackedIntsDecodeBenchmark.decodeNaiveFromBytes 1 LE > thrpt5 12.912 ± 0.393 ops/us > PackedIntsDecodeBenchmark.decodeNaiveFromBytes 1 BE > thrpt5 12.862 ± 0.395 ops/us > PackedIntsDecodeBenchmark.decodeNaiveFromBytes 2 LE > thrpt5 13.040 ± 1.162 ops/us > PackedIntsDecodeBenchmark.decodeNaiveFromBytes 2 BE > thrpt5 13.027 ± 0.270 ops/us > PackedIntsDecodeBenchmark.decodeNaiveFromBytes 3 LE > thrpt5 12.409 ± 0.637 ops/us > PackedIntsDecodeBenchmark.decodeNaiveFromBytes 3 BE > thrpt5 12.268 ± 0.947 ops/us > PackedIntsDecodeBenchmark.decodeNaiveFromBytes 4 LE > thrpt5 14.177 ± 2.263 ops/us > PackedIntsDecodeBenchmark.decodeNaiveFromBytes 4 BE > thrpt5 11.457 ± 0.150 ops/us > PackedIntsDecodeBenchmark.decodeNaiveFromBytes 5 LE > thrpt5 10.988 ± 1.179 ops/us > PackedIntsDecodeBenchmark.decodeNaiveFromBytes 5 BE > thrpt5 11.226 ± 0.088 ops/us > PackedIntsDecodeBenchmark.decodeNaiveFromBytes 6 LE > thrpt5 9.791 ± 0.305 ops/us >
[jira] [Commented] (LUCENE-9027) SIMD-based decoding of postings lists
[ https://issues.apache.org/jira/browse/LUCENE-9027?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16959066#comment-16959066 ] Uwe Schindler commented on LUCENE-9027: --- What is the .swp file in the PR? > SIMD-based decoding of postings lists > - > > Key: LUCENE-9027 > URL: https://issues.apache.org/jira/browse/LUCENE-9027 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Adrien Grand >Priority: Minor > Time Spent: 10m > Remaining Estimate: 0h > > [~rcmuir] has been mentioning the idea for quite some time that we might be > able to write the decoding logic in such a way that Java would use SIMD > instructions. More recently [~paul.masurel] wrote a [blog > post|https://fulmicoton.com/posts/bitpacking/] that raises the point that > Lucene could still do decode multiple ints at once in a single instruction by > packing two ints in a long and we've had some discussions about what we could > try in Lucene to speed up the decoding of postings. This made me want to look > a bit deeper at what we could do. > Our current decoding logic reads data in a byte[] and decodes packed integers > from it. Unfortunately it doesn't make use of SIMD instructions and looks > like > [this|https://github.com/jpountz/decode-128-ints-benchmark/blob/master/src/main/java/jpountz/NaiveByteDecoder.java]. > I confirmed by looking at the generated assembly that if I take an array of > integers and shift them all by the same number of bits then Java will use > SIMD instructions to shift multiple integers at once. This led me to writing > this > [implementation|https://github.com/jpountz/decode-128-ints-benchmark/blob/master/src/main/java/jpountz/SimpleSIMDDecoder.java] > that tries as much as possible to shift long sequences of ints by the same > number of bits to speed up decoding. It is indeed faster than the current > logic we have, up to about 2x faster for some numbers of bits per value. > Currently the best > [implementation|https://github.com/jpountz/decode-128-ints-benchmark/blob/master/src/main/java/jpountz/SIMDDecoder.java] > I've been able to come up with combines the above idea with the idea that > Paul mentioned in his blog that consists of emulating SIMD from Java by > packing multiple integers into a long: 2 ints, 4 shorts or 8 bytes. It is a > bit harder to read but gives another speedup on top of the above > implementation. > I have a [JMH > benchmark|https://github.com/jpountz/decode-128-ints-benchmark/] available in > case someone would like to play with this and maybe even come up with an even > faster implementation. It is 2-2.5x faster than our current implementation > for most numbers of bits per value. I'm copying results here: > {noformat} > * `readLongs` just reads 2*bitsPerValue longs from the ByteBuffer, it serves > as >a baseline. > * `decodeNaiveFromBytes` reads a byte[] and decodes from it. This is what the >current Lucene codec does. > * `decodeNaiveFromLongs` decodes from longs on the fly. > * `decodeSimpleSIMD` is a simple implementation that relies on how Java >recognizes some patterns and uses SIMD instructions. > * `decodeSIMD` is a more complex implementation that both relies on the C2 >compiler to generate SIMD instructions and encodes 8 bytes, 4 shorts or >2 ints in a long in order to decompress multiple values at once. > Benchmark (bitsPerValue) (byteOrder) > Mode Cnt Score Error Units > PackedIntsDecodeBenchmark.decodeNaiveFromBytes 1 LE > thrpt5 12.912 ± 0.393 ops/us > PackedIntsDecodeBenchmark.decodeNaiveFromBytes 1 BE > thrpt5 12.862 ± 0.395 ops/us > PackedIntsDecodeBenchmark.decodeNaiveFromBytes 2 LE > thrpt5 13.040 ± 1.162 ops/us > PackedIntsDecodeBenchmark.decodeNaiveFromBytes 2 BE > thrpt5 13.027 ± 0.270 ops/us > PackedIntsDecodeBenchmark.decodeNaiveFromBytes 3 LE > thrpt5 12.409 ± 0.637 ops/us > PackedIntsDecodeBenchmark.decodeNaiveFromBytes 3 BE > thrpt5 12.268 ± 0.947 ops/us > PackedIntsDecodeBenchmark.decodeNaiveFromBytes 4 LE > thrpt5 14.177 ± 2.263 ops/us > PackedIntsDecodeBenchmark.decodeNaiveFromBytes 4 BE > thrpt5 11.457 ± 0.150 ops/us > PackedIntsDecodeBenchmark.decodeNaiveFromBytes 5 LE > thrpt5 10.988 ± 1.179 ops/us > PackedIntsDecodeBenchmark.decodeNaiveFromBytes 5 BE > thrpt5 11.226 ± 0.088 ops/us > PackedIntsDecodeBenchmark.decodeNaiveFromBytes 6 LE > thrpt5 9.791 ± 0.305 ops/us >
[jira] [Commented] (LUCENE-9027) SIMD-based decoding of postings lists
[ https://issues.apache.org/jira/browse/LUCENE-9027?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16959054#comment-16959054 ] Adrien Grand commented on LUCENE-9027: -- And here are results on wikibigall {noformat} TaskQPS baseline StdDev QPS patch StdDev Pct diff IntNRQ 1054.07 (2.2%) 999.02 (7.8%) -5.2% ( -14% -4%) Prefix3 135.69 (5.3%) 132.79 (7.2%) -2.1% ( -13% - 11%) Term 1341.46 (2.6%) 1350.12 (2.6%) 0.6% ( -4% -5%) Wildcard 137.74 (1.9%) 138.76 (4.5%) 0.7% ( -5% -7%) Fuzzy1 100.54 (8.9%) 101.56 (8.7%) 1.0% ( -15% - 20%) AndHighHigh 72.82 (2.8%) 74.01 (2.7%) 1.6% ( -3% -7%) SpanNear2.17 (4.5%)2.22 (5.3%) 2.1% ( -7% - 12%) OrHighHigh 11.36 (3.0%) 11.64 (3.5%) 2.5% ( -3% -9%) IntervalsOrdered4.45 (3.7%)4.59 (7.5%) 3.0% ( -7% - 14%) Fuzzy2 87.88 (13.2%) 91.56 (10.0%) 4.2% ( -16% - 31%) OrHighMed 47.04 (2.8%) 49.16 (3.3%) 4.5% ( -1% - 10%) AndHighOrMedMed 26.00 (2.1%) 27.23 (2.6%) 4.7% ( 0% -9%) AndHighMed 65.16 (2.7%) 68.27 (2.9%) 4.8% ( 0% - 10%) AndMedOrHighHigh 27.20 (2.3%) 28.61 (2.4%) 5.2% ( 0% - 10%) Phrase 26.55 (1.2%) 27.96 (2.3%) 5.3% ( 1% -8%) SloppyPhrase 17.99 (2.6%) 19.77 (4.0%) 9.9% ( 3% - 16%) {noformat} As expected it mostly helps queries that use "advance" by large intervals (and might thus decode entire blocks for only a couple doc IDs) and phrases which spend a lot of time reading positions. > SIMD-based decoding of postings lists > - > > Key: LUCENE-9027 > URL: https://issues.apache.org/jira/browse/LUCENE-9027 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Adrien Grand >Priority: Minor > Time Spent: 10m > Remaining Estimate: 0h > > [~rcmuir] has been mentioning the idea for quite some time that we might be > able to write the decoding logic in such a way that Java would use SIMD > instructions. More recently [~paul.masurel] wrote a [blog > post|https://fulmicoton.com/posts/bitpacking/] that raises the point that > Lucene could still do decode multiple ints at once in a single instruction by > packing two ints in a long and we've had some discussions about what we could > try in Lucene to speed up the decoding of postings. This made me want to look > a bit deeper at what we could do. > Our current decoding logic reads data in a byte[] and decodes packed integers > from it. Unfortunately it doesn't make use of SIMD instructions and looks > like > [this|https://github.com/jpountz/decode-128-ints-benchmark/blob/master/src/main/java/jpountz/NaiveByteDecoder.java]. > I confirmed by looking at the generated assembly that if I take an array of > integers and shift them all by the same number of bits then Java will use > SIMD instructions to shift multiple integers at once. This led me to writing > this > [implementation|https://github.com/jpountz/decode-128-ints-benchmark/blob/master/src/main/java/jpountz/SimpleSIMDDecoder.java] > that tries as much as possible to shift long sequences of ints by the same > number of bits to speed up decoding. It is indeed faster than the current > logic we have, up to about 2x faster for some numbers of bits per value. > Currently the best > [implementation|https://github.com/jpountz/decode-128-ints-benchmark/blob/master/src/main/java/jpountz/SIMDDecoder.java] > I've been able to come up with combines the above idea with the idea that > Paul mentioned in his blog that consists of emulating SIMD from Java by > packing multiple integers into a long: 2 ints, 4 shorts or 8 bytes. It is a > bit harder to read but gives another speedup on top of the above > implementation. > I have a [JMH > benchmark|https://github.com/jpountz/decode-128-ints-benchmark/] available in > case someone would like to play with this and maybe even come up with an even > faster implementation. It is 2-2.5x faster than our current implementation > for most numbers of bits per value. I'm copying results here: > {noformat} > * `readLongs` just reads 2*bitsPerValue longs from the ByteBuffer, it serves > as >a baseline. > * `decodeNaiveFromBytes` reads a byte[] and decodes from it. This is what the >current