[jira] [Commented] (LUCENE-9027) SIMD-based decoding of postings lists

2020-09-08 Thread Michael McCandless (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9027?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17192488#comment-17192488
 ] 

Michael McCandless commented on LUCENE-9027:


{quote}Furthermore, I think this advertises a best-practice for other format 
writers – to do the same for their formats. If I have a custom DocValuesFormat 
and change it, seeing the best-practice of versions in names clarifies how to 
go about versioning.
{quote}
I think that holds true for our default (supported) codec formats, but it is 
not clear to me that this full discipline is needed for our experimental 
formats.  Users need to understand and maybe be reminded that those formats are 
truly experimental.

> SIMD-based decoding of postings lists
> -
>
> Key: LUCENE-9027
> URL: https://issues.apache.org/jira/browse/LUCENE-9027
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Adrien Grand
>Priority: Minor
> Fix For: 8.4
>
>  Time Spent: 3h 50m
>  Remaining Estimate: 0h
>
> [~rcmuir] has been mentioning the idea for quite some time that we might be 
> able to write the decoding logic in such a way that Java would use SIMD 
> instructions. More recently [~paul.masurel] wrote a [blog 
> post|https://fulmicoton.com/posts/bitpacking/] that raises the point that 
> Lucene could still do decode multiple ints at once in a single instruction by 
> packing two ints in a long and we've had some discussions about what we could 
> try in Lucene to speed up the decoding of postings. This made me want to look 
> a bit deeper at what we could do.
> Our current decoding logic reads data in a byte[] and decodes packed integers 
> from it. Unfortunately it doesn't make use of SIMD instructions and looks 
> like 
> [this|https://github.com/jpountz/decode-128-ints-benchmark/blob/master/src/main/java/jpountz/NaiveByteDecoder.java].
> I confirmed by looking at the generated assembly that if I take an array of 
> integers and shift them all by the same number of bits then Java will use 
> SIMD instructions to shift multiple integers at once. This led me to writing 
> this 
> [implementation|https://github.com/jpountz/decode-128-ints-benchmark/blob/master/src/main/java/jpountz/SimpleSIMDDecoder.java]
>  that tries as much as possible to shift long sequences of ints by the same 
> number of bits to speed up decoding. It is indeed faster than the current 
> logic we have, up to about 2x faster for some numbers of bits per value.
> Currently the best 
> [implementation|https://github.com/jpountz/decode-128-ints-benchmark/blob/master/src/main/java/jpountz/SIMDDecoder.java]
>  I've been able to come up with combines the above idea with the idea that 
> Paul mentioned in his blog that consists of emulating SIMD from Java by 
> packing multiple integers into a long: 2 ints, 4 shorts or 8 bytes. It is a 
> bit harder to read but gives another speedup on top of the above 
> implementation.
> I have a [JMH 
> benchmark|https://github.com/jpountz/decode-128-ints-benchmark/] available in 
> case someone would like to play with this and maybe even come up with an even 
> faster implementation. It is 2-2.5x faster than our current implementation 
> for most numbers of bits per value. I'm copying results here:
> {noformat}
>  * `readLongs` just reads 2*bitsPerValue longs from the ByteBuffer, it serves 
> as
>a baseline.
>  * `decodeNaiveFromBytes` reads a byte[] and decodes from it. This is what the
>current Lucene codec does.
>  * `decodeNaiveFromLongs` decodes from longs on the fly.
>  * `decodeSimpleSIMD` is a simple implementation that relies on how Java
>recognizes some patterns and uses SIMD instructions.
>  * `decodeSIMD` is a more complex implementation that both relies on the C2
>compiler to generate SIMD instructions and encodes 8 bytes, 4 shorts or
>2 ints in a long in order to decompress multiple values at once.
> Benchmark   (bitsPerValue)  (byteOrder)   
> Mode  Cnt   Score   Error   Units
> PackedIntsDecodeBenchmark.decodeNaiveFromBytes   1   LE  
> thrpt5  12.912 ± 0.393  ops/us
> PackedIntsDecodeBenchmark.decodeNaiveFromBytes   1   BE  
> thrpt5  12.862 ± 0.395  ops/us
> PackedIntsDecodeBenchmark.decodeNaiveFromBytes   2   LE  
> thrpt5  13.040 ± 1.162  ops/us
> PackedIntsDecodeBenchmark.decodeNaiveFromBytes   2   BE  
> thrpt5  13.027 ± 0.270  ops/us
> PackedIntsDecodeBenchmark.decodeNaiveFromBytes   3   LE  
> thrpt5  12.409 ± 0.637  ops/us
> PackedIntsDecodeBenchmark.decodeNaiveFromBytes   3   BE  
> thrpt5  12.268 ± 0.947  ops/us
> PackedIntsDecodeBenchmark.decodeNaiveFromBytes   4   LE  
> thrpt5  14.177 ± 

[jira] [Commented] (LUCENE-9027) SIMD-based decoding of postings lists

2020-08-20 Thread David Smiley (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9027?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17181320#comment-17181320
 ] 

David Smiley commented on LUCENE-9027:
--

Furthermore, I think this advertises a best-practice for other format writers 
-- to do the same for their formats.  If I have a custom DocValuesFormat and 
change it, seeing the best-practice of versions in names clarifies how to go 
about versioning.

> SIMD-based decoding of postings lists
> -
>
> Key: LUCENE-9027
> URL: https://issues.apache.org/jira/browse/LUCENE-9027
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Adrien Grand
>Priority: Minor
> Fix For: 8.4
>
>  Time Spent: 3h 50m
>  Remaining Estimate: 0h
>
> [~rcmuir] has been mentioning the idea for quite some time that we might be 
> able to write the decoding logic in such a way that Java would use SIMD 
> instructions. More recently [~paul.masurel] wrote a [blog 
> post|https://fulmicoton.com/posts/bitpacking/] that raises the point that 
> Lucene could still do decode multiple ints at once in a single instruction by 
> packing two ints in a long and we've had some discussions about what we could 
> try in Lucene to speed up the decoding of postings. This made me want to look 
> a bit deeper at what we could do.
> Our current decoding logic reads data in a byte[] and decodes packed integers 
> from it. Unfortunately it doesn't make use of SIMD instructions and looks 
> like 
> [this|https://github.com/jpountz/decode-128-ints-benchmark/blob/master/src/main/java/jpountz/NaiveByteDecoder.java].
> I confirmed by looking at the generated assembly that if I take an array of 
> integers and shift them all by the same number of bits then Java will use 
> SIMD instructions to shift multiple integers at once. This led me to writing 
> this 
> [implementation|https://github.com/jpountz/decode-128-ints-benchmark/blob/master/src/main/java/jpountz/SimpleSIMDDecoder.java]
>  that tries as much as possible to shift long sequences of ints by the same 
> number of bits to speed up decoding. It is indeed faster than the current 
> logic we have, up to about 2x faster for some numbers of bits per value.
> Currently the best 
> [implementation|https://github.com/jpountz/decode-128-ints-benchmark/blob/master/src/main/java/jpountz/SIMDDecoder.java]
>  I've been able to come up with combines the above idea with the idea that 
> Paul mentioned in his blog that consists of emulating SIMD from Java by 
> packing multiple integers into a long: 2 ints, 4 shorts or 8 bytes. It is a 
> bit harder to read but gives another speedup on top of the above 
> implementation.
> I have a [JMH 
> benchmark|https://github.com/jpountz/decode-128-ints-benchmark/] available in 
> case someone would like to play with this and maybe even come up with an even 
> faster implementation. It is 2-2.5x faster than our current implementation 
> for most numbers of bits per value. I'm copying results here:
> {noformat}
>  * `readLongs` just reads 2*bitsPerValue longs from the ByteBuffer, it serves 
> as
>a baseline.
>  * `decodeNaiveFromBytes` reads a byte[] and decodes from it. This is what the
>current Lucene codec does.
>  * `decodeNaiveFromLongs` decodes from longs on the fly.
>  * `decodeSimpleSIMD` is a simple implementation that relies on how Java
>recognizes some patterns and uses SIMD instructions.
>  * `decodeSIMD` is a more complex implementation that both relies on the C2
>compiler to generate SIMD instructions and encodes 8 bytes, 4 shorts or
>2 ints in a long in order to decompress multiple values at once.
> Benchmark   (bitsPerValue)  (byteOrder)   
> Mode  Cnt   Score   Error   Units
> PackedIntsDecodeBenchmark.decodeNaiveFromBytes   1   LE  
> thrpt5  12.912 ± 0.393  ops/us
> PackedIntsDecodeBenchmark.decodeNaiveFromBytes   1   BE  
> thrpt5  12.862 ± 0.395  ops/us
> PackedIntsDecodeBenchmark.decodeNaiveFromBytes   2   LE  
> thrpt5  13.040 ± 1.162  ops/us
> PackedIntsDecodeBenchmark.decodeNaiveFromBytes   2   BE  
> thrpt5  13.027 ± 0.270  ops/us
> PackedIntsDecodeBenchmark.decodeNaiveFromBytes   3   LE  
> thrpt5  12.409 ± 0.637  ops/us
> PackedIntsDecodeBenchmark.decodeNaiveFromBytes   3   BE  
> thrpt5  12.268 ± 0.947  ops/us
> PackedIntsDecodeBenchmark.decodeNaiveFromBytes   4   LE  
> thrpt5  14.177 ± 2.263  ops/us
> PackedIntsDecodeBenchmark.decodeNaiveFromBytes   4   BE  
> thrpt5  11.457 ± 0.150  ops/us
> PackedIntsDecodeBenchmark.decodeNaiveFromBytes   5   LE  
> thrpt5  10.988 ± 1.179  ops/us
> 

[jira] [Commented] (LUCENE-9027) SIMD-based decoding of postings lists

2020-08-20 Thread David Smiley (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9027?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17181319#comment-17181319
 ] 

David Smiley commented on LUCENE-9027:
--

bq. the version doesn't really matter

I disagree.  Without a version, users will likely get mysterious low level 
errors when incompatible versions read/write each other's segments.  It's 
pretty low-maintenance to just apply this version suffix strategy on the name 
and bump it appropriately.

> SIMD-based decoding of postings lists
> -
>
> Key: LUCENE-9027
> URL: https://issues.apache.org/jira/browse/LUCENE-9027
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Adrien Grand
>Priority: Minor
> Fix For: 8.4
>
>  Time Spent: 3h 50m
>  Remaining Estimate: 0h
>
> [~rcmuir] has been mentioning the idea for quite some time that we might be 
> able to write the decoding logic in such a way that Java would use SIMD 
> instructions. More recently [~paul.masurel] wrote a [blog 
> post|https://fulmicoton.com/posts/bitpacking/] that raises the point that 
> Lucene could still do decode multiple ints at once in a single instruction by 
> packing two ints in a long and we've had some discussions about what we could 
> try in Lucene to speed up the decoding of postings. This made me want to look 
> a bit deeper at what we could do.
> Our current decoding logic reads data in a byte[] and decodes packed integers 
> from it. Unfortunately it doesn't make use of SIMD instructions and looks 
> like 
> [this|https://github.com/jpountz/decode-128-ints-benchmark/blob/master/src/main/java/jpountz/NaiveByteDecoder.java].
> I confirmed by looking at the generated assembly that if I take an array of 
> integers and shift them all by the same number of bits then Java will use 
> SIMD instructions to shift multiple integers at once. This led me to writing 
> this 
> [implementation|https://github.com/jpountz/decode-128-ints-benchmark/blob/master/src/main/java/jpountz/SimpleSIMDDecoder.java]
>  that tries as much as possible to shift long sequences of ints by the same 
> number of bits to speed up decoding. It is indeed faster than the current 
> logic we have, up to about 2x faster for some numbers of bits per value.
> Currently the best 
> [implementation|https://github.com/jpountz/decode-128-ints-benchmark/blob/master/src/main/java/jpountz/SIMDDecoder.java]
>  I've been able to come up with combines the above idea with the idea that 
> Paul mentioned in his blog that consists of emulating SIMD from Java by 
> packing multiple integers into a long: 2 ints, 4 shorts or 8 bytes. It is a 
> bit harder to read but gives another speedup on top of the above 
> implementation.
> I have a [JMH 
> benchmark|https://github.com/jpountz/decode-128-ints-benchmark/] available in 
> case someone would like to play with this and maybe even come up with an even 
> faster implementation. It is 2-2.5x faster than our current implementation 
> for most numbers of bits per value. I'm copying results here:
> {noformat}
>  * `readLongs` just reads 2*bitsPerValue longs from the ByteBuffer, it serves 
> as
>a baseline.
>  * `decodeNaiveFromBytes` reads a byte[] and decodes from it. This is what the
>current Lucene codec does.
>  * `decodeNaiveFromLongs` decodes from longs on the fly.
>  * `decodeSimpleSIMD` is a simple implementation that relies on how Java
>recognizes some patterns and uses SIMD instructions.
>  * `decodeSIMD` is a more complex implementation that both relies on the C2
>compiler to generate SIMD instructions and encodes 8 bytes, 4 shorts or
>2 ints in a long in order to decompress multiple values at once.
> Benchmark   (bitsPerValue)  (byteOrder)   
> Mode  Cnt   Score   Error   Units
> PackedIntsDecodeBenchmark.decodeNaiveFromBytes   1   LE  
> thrpt5  12.912 ± 0.393  ops/us
> PackedIntsDecodeBenchmark.decodeNaiveFromBytes   1   BE  
> thrpt5  12.862 ± 0.395  ops/us
> PackedIntsDecodeBenchmark.decodeNaiveFromBytes   2   LE  
> thrpt5  13.040 ± 1.162  ops/us
> PackedIntsDecodeBenchmark.decodeNaiveFromBytes   2   BE  
> thrpt5  13.027 ± 0.270  ops/us
> PackedIntsDecodeBenchmark.decodeNaiveFromBytes   3   LE  
> thrpt5  12.409 ± 0.637  ops/us
> PackedIntsDecodeBenchmark.decodeNaiveFromBytes   3   BE  
> thrpt5  12.268 ± 0.947  ops/us
> PackedIntsDecodeBenchmark.decodeNaiveFromBytes   4   LE  
> thrpt5  14.177 ± 2.263  ops/us
> PackedIntsDecodeBenchmark.decodeNaiveFromBytes   4   BE  
> thrpt5  11.457 ± 0.150  ops/us
> PackedIntsDecodeBenchmark.decodeNaiveFromBytes   5   LE  
> thrpt5  10.988 ± 

[jira] [Commented] (LUCENE-9027) SIMD-based decoding of postings lists

2020-08-19 Thread Michael McCandless (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9027?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17180780#comment-17180780
 ] 

Michael McCandless commented on LUCENE-9027:


Ahh thanks [~jpountz].

Yeah +1 to remove any implied versioning from this experimental postings 
format's name!

> SIMD-based decoding of postings lists
> -
>
> Key: LUCENE-9027
> URL: https://issues.apache.org/jira/browse/LUCENE-9027
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Adrien Grand
>Priority: Minor
> Fix For: 8.4
>
>  Time Spent: 3h 50m
>  Remaining Estimate: 0h
>
> [~rcmuir] has been mentioning the idea for quite some time that we might be 
> able to write the decoding logic in such a way that Java would use SIMD 
> instructions. More recently [~paul.masurel] wrote a [blog 
> post|https://fulmicoton.com/posts/bitpacking/] that raises the point that 
> Lucene could still do decode multiple ints at once in a single instruction by 
> packing two ints in a long and we've had some discussions about what we could 
> try in Lucene to speed up the decoding of postings. This made me want to look 
> a bit deeper at what we could do.
> Our current decoding logic reads data in a byte[] and decodes packed integers 
> from it. Unfortunately it doesn't make use of SIMD instructions and looks 
> like 
> [this|https://github.com/jpountz/decode-128-ints-benchmark/blob/master/src/main/java/jpountz/NaiveByteDecoder.java].
> I confirmed by looking at the generated assembly that if I take an array of 
> integers and shift them all by the same number of bits then Java will use 
> SIMD instructions to shift multiple integers at once. This led me to writing 
> this 
> [implementation|https://github.com/jpountz/decode-128-ints-benchmark/blob/master/src/main/java/jpountz/SimpleSIMDDecoder.java]
>  that tries as much as possible to shift long sequences of ints by the same 
> number of bits to speed up decoding. It is indeed faster than the current 
> logic we have, up to about 2x faster for some numbers of bits per value.
> Currently the best 
> [implementation|https://github.com/jpountz/decode-128-ints-benchmark/blob/master/src/main/java/jpountz/SIMDDecoder.java]
>  I've been able to come up with combines the above idea with the idea that 
> Paul mentioned in his blog that consists of emulating SIMD from Java by 
> packing multiple integers into a long: 2 ints, 4 shorts or 8 bytes. It is a 
> bit harder to read but gives another speedup on top of the above 
> implementation.
> I have a [JMH 
> benchmark|https://github.com/jpountz/decode-128-ints-benchmark/] available in 
> case someone would like to play with this and maybe even come up with an even 
> faster implementation. It is 2-2.5x faster than our current implementation 
> for most numbers of bits per value. I'm copying results here:
> {noformat}
>  * `readLongs` just reads 2*bitsPerValue longs from the ByteBuffer, it serves 
> as
>a baseline.
>  * `decodeNaiveFromBytes` reads a byte[] and decodes from it. This is what the
>current Lucene codec does.
>  * `decodeNaiveFromLongs` decodes from longs on the fly.
>  * `decodeSimpleSIMD` is a simple implementation that relies on how Java
>recognizes some patterns and uses SIMD instructions.
>  * `decodeSIMD` is a more complex implementation that both relies on the C2
>compiler to generate SIMD instructions and encodes 8 bytes, 4 shorts or
>2 ints in a long in order to decompress multiple values at once.
> Benchmark   (bitsPerValue)  (byteOrder)   
> Mode  Cnt   Score   Error   Units
> PackedIntsDecodeBenchmark.decodeNaiveFromBytes   1   LE  
> thrpt5  12.912 ± 0.393  ops/us
> PackedIntsDecodeBenchmark.decodeNaiveFromBytes   1   BE  
> thrpt5  12.862 ± 0.395  ops/us
> PackedIntsDecodeBenchmark.decodeNaiveFromBytes   2   LE  
> thrpt5  13.040 ± 1.162  ops/us
> PackedIntsDecodeBenchmark.decodeNaiveFromBytes   2   BE  
> thrpt5  13.027 ± 0.270  ops/us
> PackedIntsDecodeBenchmark.decodeNaiveFromBytes   3   LE  
> thrpt5  12.409 ± 0.637  ops/us
> PackedIntsDecodeBenchmark.decodeNaiveFromBytes   3   BE  
> thrpt5  12.268 ± 0.947  ops/us
> PackedIntsDecodeBenchmark.decodeNaiveFromBytes   4   LE  
> thrpt5  14.177 ± 2.263  ops/us
> PackedIntsDecodeBenchmark.decodeNaiveFromBytes   4   BE  
> thrpt5  11.457 ± 0.150  ops/us
> PackedIntsDecodeBenchmark.decodeNaiveFromBytes   5   LE  
> thrpt5  10.988 ± 1.179  ops/us
> PackedIntsDecodeBenchmark.decodeNaiveFromBytes   5   BE  
> thrpt5  11.226 ± 0.088  ops/us
> 

[jira] [Commented] (LUCENE-9027) SIMD-based decoding of postings lists

2020-08-19 Thread Adrien Grand (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9027?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17180639#comment-17180639
 ] 

Adrien Grand commented on LUCENE-9027:
--

I think that David is thinking of e.g. "FST50" (FSTPostingsFormat). Since this 
postings format is about the terms dictionary (it mostly reuses 
Lucene84PostingsWriter/Reader for convenience) I was inclined to keep it named 
FST50 since the format of the terms dictionary had not changed. I don't have 
strong feelings though, I'd be ok with renaming it "FST84", or maybe even just 
"FST" as the version doesn't really matter for these postings formats that we 
don't provide backward compatibility for.

> SIMD-based decoding of postings lists
> -
>
> Key: LUCENE-9027
> URL: https://issues.apache.org/jira/browse/LUCENE-9027
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Adrien Grand
>Priority: Minor
> Fix For: 8.4
>
>  Time Spent: 3h 50m
>  Remaining Estimate: 0h
>
> [~rcmuir] has been mentioning the idea for quite some time that we might be 
> able to write the decoding logic in such a way that Java would use SIMD 
> instructions. More recently [~paul.masurel] wrote a [blog 
> post|https://fulmicoton.com/posts/bitpacking/] that raises the point that 
> Lucene could still do decode multiple ints at once in a single instruction by 
> packing two ints in a long and we've had some discussions about what we could 
> try in Lucene to speed up the decoding of postings. This made me want to look 
> a bit deeper at what we could do.
> Our current decoding logic reads data in a byte[] and decodes packed integers 
> from it. Unfortunately it doesn't make use of SIMD instructions and looks 
> like 
> [this|https://github.com/jpountz/decode-128-ints-benchmark/blob/master/src/main/java/jpountz/NaiveByteDecoder.java].
> I confirmed by looking at the generated assembly that if I take an array of 
> integers and shift them all by the same number of bits then Java will use 
> SIMD instructions to shift multiple integers at once. This led me to writing 
> this 
> [implementation|https://github.com/jpountz/decode-128-ints-benchmark/blob/master/src/main/java/jpountz/SimpleSIMDDecoder.java]
>  that tries as much as possible to shift long sequences of ints by the same 
> number of bits to speed up decoding. It is indeed faster than the current 
> logic we have, up to about 2x faster for some numbers of bits per value.
> Currently the best 
> [implementation|https://github.com/jpountz/decode-128-ints-benchmark/blob/master/src/main/java/jpountz/SIMDDecoder.java]
>  I've been able to come up with combines the above idea with the idea that 
> Paul mentioned in his blog that consists of emulating SIMD from Java by 
> packing multiple integers into a long: 2 ints, 4 shorts or 8 bytes. It is a 
> bit harder to read but gives another speedup on top of the above 
> implementation.
> I have a [JMH 
> benchmark|https://github.com/jpountz/decode-128-ints-benchmark/] available in 
> case someone would like to play with this and maybe even come up with an even 
> faster implementation. It is 2-2.5x faster than our current implementation 
> for most numbers of bits per value. I'm copying results here:
> {noformat}
>  * `readLongs` just reads 2*bitsPerValue longs from the ByteBuffer, it serves 
> as
>a baseline.
>  * `decodeNaiveFromBytes` reads a byte[] and decodes from it. This is what the
>current Lucene codec does.
>  * `decodeNaiveFromLongs` decodes from longs on the fly.
>  * `decodeSimpleSIMD` is a simple implementation that relies on how Java
>recognizes some patterns and uses SIMD instructions.
>  * `decodeSIMD` is a more complex implementation that both relies on the C2
>compiler to generate SIMD instructions and encodes 8 bytes, 4 shorts or
>2 ints in a long in order to decompress multiple values at once.
> Benchmark   (bitsPerValue)  (byteOrder)   
> Mode  Cnt   Score   Error   Units
> PackedIntsDecodeBenchmark.decodeNaiveFromBytes   1   LE  
> thrpt5  12.912 ± 0.393  ops/us
> PackedIntsDecodeBenchmark.decodeNaiveFromBytes   1   BE  
> thrpt5  12.862 ± 0.395  ops/us
> PackedIntsDecodeBenchmark.decodeNaiveFromBytes   2   LE  
> thrpt5  13.040 ± 1.162  ops/us
> PackedIntsDecodeBenchmark.decodeNaiveFromBytes   2   BE  
> thrpt5  13.027 ± 0.270  ops/us
> PackedIntsDecodeBenchmark.decodeNaiveFromBytes   3   LE  
> thrpt5  12.409 ± 0.637  ops/us
> PackedIntsDecodeBenchmark.decodeNaiveFromBytes   3   BE  
> thrpt5  12.268 ± 0.947  ops/us
> PackedIntsDecodeBenchmark.decodeNaiveFromBytes   4   LE  
> thrpt5  14.177 ± 2.263  ops/us
> 

[jira] [Commented] (LUCENE-9027) SIMD-based decoding of postings lists

2020-08-18 Thread Michael McCandless (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9027?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17179726#comment-17179726
 ] 

Michael McCandless commented on LUCENE-9027:


{quote}_Shouldn't the other postings formats that switched from use of this 
postings write/reader also bump their version suffix?_
{quote}
Hmm, can you give an example?  What other postings formats (besides 
{{Lucene84PostingsFormat}}) are using the new {{Lucene84PostingsWriter/Reader}}?

> SIMD-based decoding of postings lists
> -
>
> Key: LUCENE-9027
> URL: https://issues.apache.org/jira/browse/LUCENE-9027
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Adrien Grand
>Priority: Minor
> Fix For: 8.4
>
>  Time Spent: 3h 50m
>  Remaining Estimate: 0h
>
> [~rcmuir] has been mentioning the idea for quite some time that we might be 
> able to write the decoding logic in such a way that Java would use SIMD 
> instructions. More recently [~paul.masurel] wrote a [blog 
> post|https://fulmicoton.com/posts/bitpacking/] that raises the point that 
> Lucene could still do decode multiple ints at once in a single instruction by 
> packing two ints in a long and we've had some discussions about what we could 
> try in Lucene to speed up the decoding of postings. This made me want to look 
> a bit deeper at what we could do.
> Our current decoding logic reads data in a byte[] and decodes packed integers 
> from it. Unfortunately it doesn't make use of SIMD instructions and looks 
> like 
> [this|https://github.com/jpountz/decode-128-ints-benchmark/blob/master/src/main/java/jpountz/NaiveByteDecoder.java].
> I confirmed by looking at the generated assembly that if I take an array of 
> integers and shift them all by the same number of bits then Java will use 
> SIMD instructions to shift multiple integers at once. This led me to writing 
> this 
> [implementation|https://github.com/jpountz/decode-128-ints-benchmark/blob/master/src/main/java/jpountz/SimpleSIMDDecoder.java]
>  that tries as much as possible to shift long sequences of ints by the same 
> number of bits to speed up decoding. It is indeed faster than the current 
> logic we have, up to about 2x faster for some numbers of bits per value.
> Currently the best 
> [implementation|https://github.com/jpountz/decode-128-ints-benchmark/blob/master/src/main/java/jpountz/SIMDDecoder.java]
>  I've been able to come up with combines the above idea with the idea that 
> Paul mentioned in his blog that consists of emulating SIMD from Java by 
> packing multiple integers into a long: 2 ints, 4 shorts or 8 bytes. It is a 
> bit harder to read but gives another speedup on top of the above 
> implementation.
> I have a [JMH 
> benchmark|https://github.com/jpountz/decode-128-ints-benchmark/] available in 
> case someone would like to play with this and maybe even come up with an even 
> faster implementation. It is 2-2.5x faster than our current implementation 
> for most numbers of bits per value. I'm copying results here:
> {noformat}
>  * `readLongs` just reads 2*bitsPerValue longs from the ByteBuffer, it serves 
> as
>a baseline.
>  * `decodeNaiveFromBytes` reads a byte[] and decodes from it. This is what the
>current Lucene codec does.
>  * `decodeNaiveFromLongs` decodes from longs on the fly.
>  * `decodeSimpleSIMD` is a simple implementation that relies on how Java
>recognizes some patterns and uses SIMD instructions.
>  * `decodeSIMD` is a more complex implementation that both relies on the C2
>compiler to generate SIMD instructions and encodes 8 bytes, 4 shorts or
>2 ints in a long in order to decompress multiple values at once.
> Benchmark   (bitsPerValue)  (byteOrder)   
> Mode  Cnt   Score   Error   Units
> PackedIntsDecodeBenchmark.decodeNaiveFromBytes   1   LE  
> thrpt5  12.912 ± 0.393  ops/us
> PackedIntsDecodeBenchmark.decodeNaiveFromBytes   1   BE  
> thrpt5  12.862 ± 0.395  ops/us
> PackedIntsDecodeBenchmark.decodeNaiveFromBytes   2   LE  
> thrpt5  13.040 ± 1.162  ops/us
> PackedIntsDecodeBenchmark.decodeNaiveFromBytes   2   BE  
> thrpt5  13.027 ± 0.270  ops/us
> PackedIntsDecodeBenchmark.decodeNaiveFromBytes   3   LE  
> thrpt5  12.409 ± 0.637  ops/us
> PackedIntsDecodeBenchmark.decodeNaiveFromBytes   3   BE  
> thrpt5  12.268 ± 0.947  ops/us
> PackedIntsDecodeBenchmark.decodeNaiveFromBytes   4   LE  
> thrpt5  14.177 ± 2.263  ops/us
> PackedIntsDecodeBenchmark.decodeNaiveFromBytes   4   BE  
> thrpt5  11.457 ± 0.150  ops/us
> PackedIntsDecodeBenchmark.decodeNaiveFromBytes   5   LE  
> thrpt5  

[jira] [Commented] (LUCENE-9027) SIMD-based decoding of postings lists

2020-08-17 Thread David Smiley (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9027?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17179200#comment-17179200
 ] 

David Smiley commented on LUCENE-9027:
--

Question:  This feature introduced Lucene84PostingsFormat because, presumably, 
the use of Lucene50PostingsWriter changed to Lucene84PostingsWriter (ditto for 
reader).  New formats have new names: "Lucene50" -> "Lucene84".  _Shouldn't the 
other postings formats that switched from use of this postings write/reader 
also bump their version suffix?_  No need to duplicate them because we may not 
want the maintenance of actually supporting the previous version.  Too late 
now, of course.   This format naming/version strategy seems like something that 
ought to be documented.

> SIMD-based decoding of postings lists
> -
>
> Key: LUCENE-9027
> URL: https://issues.apache.org/jira/browse/LUCENE-9027
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Adrien Grand
>Priority: Minor
> Fix For: 8.4
>
>  Time Spent: 3h 50m
>  Remaining Estimate: 0h
>
> [~rcmuir] has been mentioning the idea for quite some time that we might be 
> able to write the decoding logic in such a way that Java would use SIMD 
> instructions. More recently [~paul.masurel] wrote a [blog 
> post|https://fulmicoton.com/posts/bitpacking/] that raises the point that 
> Lucene could still do decode multiple ints at once in a single instruction by 
> packing two ints in a long and we've had some discussions about what we could 
> try in Lucene to speed up the decoding of postings. This made me want to look 
> a bit deeper at what we could do.
> Our current decoding logic reads data in a byte[] and decodes packed integers 
> from it. Unfortunately it doesn't make use of SIMD instructions and looks 
> like 
> [this|https://github.com/jpountz/decode-128-ints-benchmark/blob/master/src/main/java/jpountz/NaiveByteDecoder.java].
> I confirmed by looking at the generated assembly that if I take an array of 
> integers and shift them all by the same number of bits then Java will use 
> SIMD instructions to shift multiple integers at once. This led me to writing 
> this 
> [implementation|https://github.com/jpountz/decode-128-ints-benchmark/blob/master/src/main/java/jpountz/SimpleSIMDDecoder.java]
>  that tries as much as possible to shift long sequences of ints by the same 
> number of bits to speed up decoding. It is indeed faster than the current 
> logic we have, up to about 2x faster for some numbers of bits per value.
> Currently the best 
> [implementation|https://github.com/jpountz/decode-128-ints-benchmark/blob/master/src/main/java/jpountz/SIMDDecoder.java]
>  I've been able to come up with combines the above idea with the idea that 
> Paul mentioned in his blog that consists of emulating SIMD from Java by 
> packing multiple integers into a long: 2 ints, 4 shorts or 8 bytes. It is a 
> bit harder to read but gives another speedup on top of the above 
> implementation.
> I have a [JMH 
> benchmark|https://github.com/jpountz/decode-128-ints-benchmark/] available in 
> case someone would like to play with this and maybe even come up with an even 
> faster implementation. It is 2-2.5x faster than our current implementation 
> for most numbers of bits per value. I'm copying results here:
> {noformat}
>  * `readLongs` just reads 2*bitsPerValue longs from the ByteBuffer, it serves 
> as
>a baseline.
>  * `decodeNaiveFromBytes` reads a byte[] and decodes from it. This is what the
>current Lucene codec does.
>  * `decodeNaiveFromLongs` decodes from longs on the fly.
>  * `decodeSimpleSIMD` is a simple implementation that relies on how Java
>recognizes some patterns and uses SIMD instructions.
>  * `decodeSIMD` is a more complex implementation that both relies on the C2
>compiler to generate SIMD instructions and encodes 8 bytes, 4 shorts or
>2 ints in a long in order to decompress multiple values at once.
> Benchmark   (bitsPerValue)  (byteOrder)   
> Mode  Cnt   Score   Error   Units
> PackedIntsDecodeBenchmark.decodeNaiveFromBytes   1   LE  
> thrpt5  12.912 ± 0.393  ops/us
> PackedIntsDecodeBenchmark.decodeNaiveFromBytes   1   BE  
> thrpt5  12.862 ± 0.395  ops/us
> PackedIntsDecodeBenchmark.decodeNaiveFromBytes   2   LE  
> thrpt5  13.040 ± 1.162  ops/us
> PackedIntsDecodeBenchmark.decodeNaiveFromBytes   2   BE  
> thrpt5  13.027 ± 0.270  ops/us
> PackedIntsDecodeBenchmark.decodeNaiveFromBytes   3   LE  
> thrpt5  12.409 ± 0.637  ops/us
> PackedIntsDecodeBenchmark.decodeNaiveFromBytes   3   BE  
> thrpt5  12.268 ± 0.947  ops/us
> PackedIntsDecodeBenchmark.decodeNaiveFromBytes

[jira] [Commented] (LUCENE-9027) SIMD-based decoding of postings lists

2019-11-22 Thread ASF subversion and git services (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9027?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16980279#comment-16980279
 ] 

ASF subversion and git services commented on LUCENE-9027:
-

Commit bc758601cd8f77136e0f8bb8467927c3e37c7ddf in lucene-solr's branch 
refs/heads/branch_8x from Adrien Grand
[ https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=bc75860 ]

LUCENE-9027: Try to get back some indexing speed.


> SIMD-based decoding of postings lists
> -
>
> Key: LUCENE-9027
> URL: https://issues.apache.org/jira/browse/LUCENE-9027
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Adrien Grand
>Priority: Minor
> Fix For: 8.4
>
>  Time Spent: 3h 50m
>  Remaining Estimate: 0h
>
> [~rcmuir] has been mentioning the idea for quite some time that we might be 
> able to write the decoding logic in such a way that Java would use SIMD 
> instructions. More recently [~paul.masurel] wrote a [blog 
> post|https://fulmicoton.com/posts/bitpacking/] that raises the point that 
> Lucene could still do decode multiple ints at once in a single instruction by 
> packing two ints in a long and we've had some discussions about what we could 
> try in Lucene to speed up the decoding of postings. This made me want to look 
> a bit deeper at what we could do.
> Our current decoding logic reads data in a byte[] and decodes packed integers 
> from it. Unfortunately it doesn't make use of SIMD instructions and looks 
> like 
> [this|https://github.com/jpountz/decode-128-ints-benchmark/blob/master/src/main/java/jpountz/NaiveByteDecoder.java].
> I confirmed by looking at the generated assembly that if I take an array of 
> integers and shift them all by the same number of bits then Java will use 
> SIMD instructions to shift multiple integers at once. This led me to writing 
> this 
> [implementation|https://github.com/jpountz/decode-128-ints-benchmark/blob/master/src/main/java/jpountz/SimpleSIMDDecoder.java]
>  that tries as much as possible to shift long sequences of ints by the same 
> number of bits to speed up decoding. It is indeed faster than the current 
> logic we have, up to about 2x faster for some numbers of bits per value.
> Currently the best 
> [implementation|https://github.com/jpountz/decode-128-ints-benchmark/blob/master/src/main/java/jpountz/SIMDDecoder.java]
>  I've been able to come up with combines the above idea with the idea that 
> Paul mentioned in his blog that consists of emulating SIMD from Java by 
> packing multiple integers into a long: 2 ints, 4 shorts or 8 bytes. It is a 
> bit harder to read but gives another speedup on top of the above 
> implementation.
> I have a [JMH 
> benchmark|https://github.com/jpountz/decode-128-ints-benchmark/] available in 
> case someone would like to play with this and maybe even come up with an even 
> faster implementation. It is 2-2.5x faster than our current implementation 
> for most numbers of bits per value. I'm copying results here:
> {noformat}
>  * `readLongs` just reads 2*bitsPerValue longs from the ByteBuffer, it serves 
> as
>a baseline.
>  * `decodeNaiveFromBytes` reads a byte[] and decodes from it. This is what the
>current Lucene codec does.
>  * `decodeNaiveFromLongs` decodes from longs on the fly.
>  * `decodeSimpleSIMD` is a simple implementation that relies on how Java
>recognizes some patterns and uses SIMD instructions.
>  * `decodeSIMD` is a more complex implementation that both relies on the C2
>compiler to generate SIMD instructions and encodes 8 bytes, 4 shorts or
>2 ints in a long in order to decompress multiple values at once.
> Benchmark   (bitsPerValue)  (byteOrder)   
> Mode  Cnt   Score   Error   Units
> PackedIntsDecodeBenchmark.decodeNaiveFromBytes   1   LE  
> thrpt5  12.912 ± 0.393  ops/us
> PackedIntsDecodeBenchmark.decodeNaiveFromBytes   1   BE  
> thrpt5  12.862 ± 0.395  ops/us
> PackedIntsDecodeBenchmark.decodeNaiveFromBytes   2   LE  
> thrpt5  13.040 ± 1.162  ops/us
> PackedIntsDecodeBenchmark.decodeNaiveFromBytes   2   BE  
> thrpt5  13.027 ± 0.270  ops/us
> PackedIntsDecodeBenchmark.decodeNaiveFromBytes   3   LE  
> thrpt5  12.409 ± 0.637  ops/us
> PackedIntsDecodeBenchmark.decodeNaiveFromBytes   3   BE  
> thrpt5  12.268 ± 0.947  ops/us
> PackedIntsDecodeBenchmark.decodeNaiveFromBytes   4   LE  
> thrpt5  14.177 ± 2.263  ops/us
> PackedIntsDecodeBenchmark.decodeNaiveFromBytes   4   BE  
> thrpt5  11.457 ± 0.150  ops/us
> PackedIntsDecodeBenchmark.decodeNaiveFromBytes   5   LE  
> thrpt5  10.988 ± 1.179  ops/us
> 

[jira] [Commented] (LUCENE-9027) SIMD-based decoding of postings lists

2019-11-22 Thread ASF subversion and git services (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9027?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16980280#comment-16980280
 ] 

ASF subversion and git services commented on LUCENE-9027:
-

Commit c51006c3c48e41dfb68b62cdaf39916d5eed65b8 in lucene-solr's branch 
refs/heads/master from Adrien Grand
[ https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=c51006c ]

LUCENE-9027: Try to get back some indexing speed.


> SIMD-based decoding of postings lists
> -
>
> Key: LUCENE-9027
> URL: https://issues.apache.org/jira/browse/LUCENE-9027
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Adrien Grand
>Priority: Minor
> Fix For: 8.4
>
>  Time Spent: 3h 50m
>  Remaining Estimate: 0h
>
> [~rcmuir] has been mentioning the idea for quite some time that we might be 
> able to write the decoding logic in such a way that Java would use SIMD 
> instructions. More recently [~paul.masurel] wrote a [blog 
> post|https://fulmicoton.com/posts/bitpacking/] that raises the point that 
> Lucene could still do decode multiple ints at once in a single instruction by 
> packing two ints in a long and we've had some discussions about what we could 
> try in Lucene to speed up the decoding of postings. This made me want to look 
> a bit deeper at what we could do.
> Our current decoding logic reads data in a byte[] and decodes packed integers 
> from it. Unfortunately it doesn't make use of SIMD instructions and looks 
> like 
> [this|https://github.com/jpountz/decode-128-ints-benchmark/blob/master/src/main/java/jpountz/NaiveByteDecoder.java].
> I confirmed by looking at the generated assembly that if I take an array of 
> integers and shift them all by the same number of bits then Java will use 
> SIMD instructions to shift multiple integers at once. This led me to writing 
> this 
> [implementation|https://github.com/jpountz/decode-128-ints-benchmark/blob/master/src/main/java/jpountz/SimpleSIMDDecoder.java]
>  that tries as much as possible to shift long sequences of ints by the same 
> number of bits to speed up decoding. It is indeed faster than the current 
> logic we have, up to about 2x faster for some numbers of bits per value.
> Currently the best 
> [implementation|https://github.com/jpountz/decode-128-ints-benchmark/blob/master/src/main/java/jpountz/SIMDDecoder.java]
>  I've been able to come up with combines the above idea with the idea that 
> Paul mentioned in his blog that consists of emulating SIMD from Java by 
> packing multiple integers into a long: 2 ints, 4 shorts or 8 bytes. It is a 
> bit harder to read but gives another speedup on top of the above 
> implementation.
> I have a [JMH 
> benchmark|https://github.com/jpountz/decode-128-ints-benchmark/] available in 
> case someone would like to play with this and maybe even come up with an even 
> faster implementation. It is 2-2.5x faster than our current implementation 
> for most numbers of bits per value. I'm copying results here:
> {noformat}
>  * `readLongs` just reads 2*bitsPerValue longs from the ByteBuffer, it serves 
> as
>a baseline.
>  * `decodeNaiveFromBytes` reads a byte[] and decodes from it. This is what the
>current Lucene codec does.
>  * `decodeNaiveFromLongs` decodes from longs on the fly.
>  * `decodeSimpleSIMD` is a simple implementation that relies on how Java
>recognizes some patterns and uses SIMD instructions.
>  * `decodeSIMD` is a more complex implementation that both relies on the C2
>compiler to generate SIMD instructions and encodes 8 bytes, 4 shorts or
>2 ints in a long in order to decompress multiple values at once.
> Benchmark   (bitsPerValue)  (byteOrder)   
> Mode  Cnt   Score   Error   Units
> PackedIntsDecodeBenchmark.decodeNaiveFromBytes   1   LE  
> thrpt5  12.912 ± 0.393  ops/us
> PackedIntsDecodeBenchmark.decodeNaiveFromBytes   1   BE  
> thrpt5  12.862 ± 0.395  ops/us
> PackedIntsDecodeBenchmark.decodeNaiveFromBytes   2   LE  
> thrpt5  13.040 ± 1.162  ops/us
> PackedIntsDecodeBenchmark.decodeNaiveFromBytes   2   BE  
> thrpt5  13.027 ± 0.270  ops/us
> PackedIntsDecodeBenchmark.decodeNaiveFromBytes   3   LE  
> thrpt5  12.409 ± 0.637  ops/us
> PackedIntsDecodeBenchmark.decodeNaiveFromBytes   3   BE  
> thrpt5  12.268 ± 0.947  ops/us
> PackedIntsDecodeBenchmark.decodeNaiveFromBytes   4   LE  
> thrpt5  14.177 ± 2.263  ops/us
> PackedIntsDecodeBenchmark.decodeNaiveFromBytes   4   BE  
> thrpt5  11.457 ± 0.150  ops/us
> PackedIntsDecodeBenchmark.decodeNaiveFromBytes   5   LE  
> thrpt5  10.988 ± 1.179  ops/us
> 

[jira] [Commented] (LUCENE-9027) SIMD-based decoding of postings lists

2019-11-22 Thread Adrien Grand (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9027?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16980032#comment-16980032
 ] 

Adrien Grand commented on LUCENE-9027:
--

This and my followup changes seem to have introduced a hit on indexing 
throughput http://people.apache.org/~mikemccand/lucenebench/indexing.html. I'll 
dig.

> SIMD-based decoding of postings lists
> -
>
> Key: LUCENE-9027
> URL: https://issues.apache.org/jira/browse/LUCENE-9027
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Adrien Grand
>Priority: Minor
> Fix For: 8.4
>
>  Time Spent: 3h 50m
>  Remaining Estimate: 0h
>
> [~rcmuir] has been mentioning the idea for quite some time that we might be 
> able to write the decoding logic in such a way that Java would use SIMD 
> instructions. More recently [~paul.masurel] wrote a [blog 
> post|https://fulmicoton.com/posts/bitpacking/] that raises the point that 
> Lucene could still do decode multiple ints at once in a single instruction by 
> packing two ints in a long and we've had some discussions about what we could 
> try in Lucene to speed up the decoding of postings. This made me want to look 
> a bit deeper at what we could do.
> Our current decoding logic reads data in a byte[] and decodes packed integers 
> from it. Unfortunately it doesn't make use of SIMD instructions and looks 
> like 
> [this|https://github.com/jpountz/decode-128-ints-benchmark/blob/master/src/main/java/jpountz/NaiveByteDecoder.java].
> I confirmed by looking at the generated assembly that if I take an array of 
> integers and shift them all by the same number of bits then Java will use 
> SIMD instructions to shift multiple integers at once. This led me to writing 
> this 
> [implementation|https://github.com/jpountz/decode-128-ints-benchmark/blob/master/src/main/java/jpountz/SimpleSIMDDecoder.java]
>  that tries as much as possible to shift long sequences of ints by the same 
> number of bits to speed up decoding. It is indeed faster than the current 
> logic we have, up to about 2x faster for some numbers of bits per value.
> Currently the best 
> [implementation|https://github.com/jpountz/decode-128-ints-benchmark/blob/master/src/main/java/jpountz/SIMDDecoder.java]
>  I've been able to come up with combines the above idea with the idea that 
> Paul mentioned in his blog that consists of emulating SIMD from Java by 
> packing multiple integers into a long: 2 ints, 4 shorts or 8 bytes. It is a 
> bit harder to read but gives another speedup on top of the above 
> implementation.
> I have a [JMH 
> benchmark|https://github.com/jpountz/decode-128-ints-benchmark/] available in 
> case someone would like to play with this and maybe even come up with an even 
> faster implementation. It is 2-2.5x faster than our current implementation 
> for most numbers of bits per value. I'm copying results here:
> {noformat}
>  * `readLongs` just reads 2*bitsPerValue longs from the ByteBuffer, it serves 
> as
>a baseline.
>  * `decodeNaiveFromBytes` reads a byte[] and decodes from it. This is what the
>current Lucene codec does.
>  * `decodeNaiveFromLongs` decodes from longs on the fly.
>  * `decodeSimpleSIMD` is a simple implementation that relies on how Java
>recognizes some patterns and uses SIMD instructions.
>  * `decodeSIMD` is a more complex implementation that both relies on the C2
>compiler to generate SIMD instructions and encodes 8 bytes, 4 shorts or
>2 ints in a long in order to decompress multiple values at once.
> Benchmark   (bitsPerValue)  (byteOrder)   
> Mode  Cnt   Score   Error   Units
> PackedIntsDecodeBenchmark.decodeNaiveFromBytes   1   LE  
> thrpt5  12.912 ± 0.393  ops/us
> PackedIntsDecodeBenchmark.decodeNaiveFromBytes   1   BE  
> thrpt5  12.862 ± 0.395  ops/us
> PackedIntsDecodeBenchmark.decodeNaiveFromBytes   2   LE  
> thrpt5  13.040 ± 1.162  ops/us
> PackedIntsDecodeBenchmark.decodeNaiveFromBytes   2   BE  
> thrpt5  13.027 ± 0.270  ops/us
> PackedIntsDecodeBenchmark.decodeNaiveFromBytes   3   LE  
> thrpt5  12.409 ± 0.637  ops/us
> PackedIntsDecodeBenchmark.decodeNaiveFromBytes   3   BE  
> thrpt5  12.268 ± 0.947  ops/us
> PackedIntsDecodeBenchmark.decodeNaiveFromBytes   4   LE  
> thrpt5  14.177 ± 2.263  ops/us
> PackedIntsDecodeBenchmark.decodeNaiveFromBytes   4   BE  
> thrpt5  11.457 ± 0.150  ops/us
> PackedIntsDecodeBenchmark.decodeNaiveFromBytes   5   LE  
> thrpt5  10.988 ± 1.179  ops/us
> PackedIntsDecodeBenchmark.decodeNaiveFromBytes   5   BE  
> thrpt5  11.226 ± 0.088  ops/us
> 

[jira] [Commented] (LUCENE-9027) SIMD-based decoding of postings lists

2019-11-20 Thread Bruno Roustant (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9027?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16978213#comment-16978213
 ] 

Bruno Roustant commented on LUCENE-9027:


Impressive!

> SIMD-based decoding of postings lists
> -
>
> Key: LUCENE-9027
> URL: https://issues.apache.org/jira/browse/LUCENE-9027
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Adrien Grand
>Priority: Minor
> Fix For: 8.4
>
>  Time Spent: 3h 50m
>  Remaining Estimate: 0h
>
> [~rcmuir] has been mentioning the idea for quite some time that we might be 
> able to write the decoding logic in such a way that Java would use SIMD 
> instructions. More recently [~paul.masurel] wrote a [blog 
> post|https://fulmicoton.com/posts/bitpacking/] that raises the point that 
> Lucene could still do decode multiple ints at once in a single instruction by 
> packing two ints in a long and we've had some discussions about what we could 
> try in Lucene to speed up the decoding of postings. This made me want to look 
> a bit deeper at what we could do.
> Our current decoding logic reads data in a byte[] and decodes packed integers 
> from it. Unfortunately it doesn't make use of SIMD instructions and looks 
> like 
> [this|https://github.com/jpountz/decode-128-ints-benchmark/blob/master/src/main/java/jpountz/NaiveByteDecoder.java].
> I confirmed by looking at the generated assembly that if I take an array of 
> integers and shift them all by the same number of bits then Java will use 
> SIMD instructions to shift multiple integers at once. This led me to writing 
> this 
> [implementation|https://github.com/jpountz/decode-128-ints-benchmark/blob/master/src/main/java/jpountz/SimpleSIMDDecoder.java]
>  that tries as much as possible to shift long sequences of ints by the same 
> number of bits to speed up decoding. It is indeed faster than the current 
> logic we have, up to about 2x faster for some numbers of bits per value.
> Currently the best 
> [implementation|https://github.com/jpountz/decode-128-ints-benchmark/blob/master/src/main/java/jpountz/SIMDDecoder.java]
>  I've been able to come up with combines the above idea with the idea that 
> Paul mentioned in his blog that consists of emulating SIMD from Java by 
> packing multiple integers into a long: 2 ints, 4 shorts or 8 bytes. It is a 
> bit harder to read but gives another speedup on top of the above 
> implementation.
> I have a [JMH 
> benchmark|https://github.com/jpountz/decode-128-ints-benchmark/] available in 
> case someone would like to play with this and maybe even come up with an even 
> faster implementation. It is 2-2.5x faster than our current implementation 
> for most numbers of bits per value. I'm copying results here:
> {noformat}
>  * `readLongs` just reads 2*bitsPerValue longs from the ByteBuffer, it serves 
> as
>a baseline.
>  * `decodeNaiveFromBytes` reads a byte[] and decodes from it. This is what the
>current Lucene codec does.
>  * `decodeNaiveFromLongs` decodes from longs on the fly.
>  * `decodeSimpleSIMD` is a simple implementation that relies on how Java
>recognizes some patterns and uses SIMD instructions.
>  * `decodeSIMD` is a more complex implementation that both relies on the C2
>compiler to generate SIMD instructions and encodes 8 bytes, 4 shorts or
>2 ints in a long in order to decompress multiple values at once.
> Benchmark   (bitsPerValue)  (byteOrder)   
> Mode  Cnt   Score   Error   Units
> PackedIntsDecodeBenchmark.decodeNaiveFromBytes   1   LE  
> thrpt5  12.912 ± 0.393  ops/us
> PackedIntsDecodeBenchmark.decodeNaiveFromBytes   1   BE  
> thrpt5  12.862 ± 0.395  ops/us
> PackedIntsDecodeBenchmark.decodeNaiveFromBytes   2   LE  
> thrpt5  13.040 ± 1.162  ops/us
> PackedIntsDecodeBenchmark.decodeNaiveFromBytes   2   BE  
> thrpt5  13.027 ± 0.270  ops/us
> PackedIntsDecodeBenchmark.decodeNaiveFromBytes   3   LE  
> thrpt5  12.409 ± 0.637  ops/us
> PackedIntsDecodeBenchmark.decodeNaiveFromBytes   3   BE  
> thrpt5  12.268 ± 0.947  ops/us
> PackedIntsDecodeBenchmark.decodeNaiveFromBytes   4   LE  
> thrpt5  14.177 ± 2.263  ops/us
> PackedIntsDecodeBenchmark.decodeNaiveFromBytes   4   BE  
> thrpt5  11.457 ± 0.150  ops/us
> PackedIntsDecodeBenchmark.decodeNaiveFromBytes   5   LE  
> thrpt5  10.988 ± 1.179  ops/us
> PackedIntsDecodeBenchmark.decodeNaiveFromBytes   5   BE  
> thrpt5  11.226 ± 0.088  ops/us
> PackedIntsDecodeBenchmark.decodeNaiveFromBytes   6   LE  
> thrpt5   9.791 ± 0.305  ops/us
> 

[jira] [Commented] (LUCENE-9027) SIMD-based decoding of postings lists

2019-11-19 Thread Adrien Grand (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9027?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16978171#comment-16978171
 ] 

Adrien Grand commented on LUCENE-9027:
--

Nightly benchmarks have picked up this change. Several queries got a 5+% 
increase of QPS:
 - Prefix ~ +14%
 - IntervalsOrdered ~ +10%
 - SpanNearQuery ~ +9%
 - AndMedOrHighHigh ~ +7%
 - AndHighMed ~ +7%
 - OrHighMed ~ +7%
 - Wildcard ~ +7%

More details at [http://people.apache.org/~mikemccand/lucenebench/]

> SIMD-based decoding of postings lists
> -
>
> Key: LUCENE-9027
> URL: https://issues.apache.org/jira/browse/LUCENE-9027
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Adrien Grand
>Priority: Minor
> Fix For: 8.4
>
>  Time Spent: 3h 50m
>  Remaining Estimate: 0h
>
> [~rcmuir] has been mentioning the idea for quite some time that we might be 
> able to write the decoding logic in such a way that Java would use SIMD 
> instructions. More recently [~paul.masurel] wrote a [blog 
> post|https://fulmicoton.com/posts/bitpacking/] that raises the point that 
> Lucene could still do decode multiple ints at once in a single instruction by 
> packing two ints in a long and we've had some discussions about what we could 
> try in Lucene to speed up the decoding of postings. This made me want to look 
> a bit deeper at what we could do.
> Our current decoding logic reads data in a byte[] and decodes packed integers 
> from it. Unfortunately it doesn't make use of SIMD instructions and looks 
> like 
> [this|https://github.com/jpountz/decode-128-ints-benchmark/blob/master/src/main/java/jpountz/NaiveByteDecoder.java].
> I confirmed by looking at the generated assembly that if I take an array of 
> integers and shift them all by the same number of bits then Java will use 
> SIMD instructions to shift multiple integers at once. This led me to writing 
> this 
> [implementation|https://github.com/jpountz/decode-128-ints-benchmark/blob/master/src/main/java/jpountz/SimpleSIMDDecoder.java]
>  that tries as much as possible to shift long sequences of ints by the same 
> number of bits to speed up decoding. It is indeed faster than the current 
> logic we have, up to about 2x faster for some numbers of bits per value.
> Currently the best 
> [implementation|https://github.com/jpountz/decode-128-ints-benchmark/blob/master/src/main/java/jpountz/SIMDDecoder.java]
>  I've been able to come up with combines the above idea with the idea that 
> Paul mentioned in his blog that consists of emulating SIMD from Java by 
> packing multiple integers into a long: 2 ints, 4 shorts or 8 bytes. It is a 
> bit harder to read but gives another speedup on top of the above 
> implementation.
> I have a [JMH 
> benchmark|https://github.com/jpountz/decode-128-ints-benchmark/] available in 
> case someone would like to play with this and maybe even come up with an even 
> faster implementation. It is 2-2.5x faster than our current implementation 
> for most numbers of bits per value. I'm copying results here:
> {noformat}
>  * `readLongs` just reads 2*bitsPerValue longs from the ByteBuffer, it serves 
> as
>a baseline.
>  * `decodeNaiveFromBytes` reads a byte[] and decodes from it. This is what the
>current Lucene codec does.
>  * `decodeNaiveFromLongs` decodes from longs on the fly.
>  * `decodeSimpleSIMD` is a simple implementation that relies on how Java
>recognizes some patterns and uses SIMD instructions.
>  * `decodeSIMD` is a more complex implementation that both relies on the C2
>compiler to generate SIMD instructions and encodes 8 bytes, 4 shorts or
>2 ints in a long in order to decompress multiple values at once.
> Benchmark   (bitsPerValue)  (byteOrder)   
> Mode  Cnt   Score   Error   Units
> PackedIntsDecodeBenchmark.decodeNaiveFromBytes   1   LE  
> thrpt5  12.912 ± 0.393  ops/us
> PackedIntsDecodeBenchmark.decodeNaiveFromBytes   1   BE  
> thrpt5  12.862 ± 0.395  ops/us
> PackedIntsDecodeBenchmark.decodeNaiveFromBytes   2   LE  
> thrpt5  13.040 ± 1.162  ops/us
> PackedIntsDecodeBenchmark.decodeNaiveFromBytes   2   BE  
> thrpt5  13.027 ± 0.270  ops/us
> PackedIntsDecodeBenchmark.decodeNaiveFromBytes   3   LE  
> thrpt5  12.409 ± 0.637  ops/us
> PackedIntsDecodeBenchmark.decodeNaiveFromBytes   3   BE  
> thrpt5  12.268 ± 0.947  ops/us
> PackedIntsDecodeBenchmark.decodeNaiveFromBytes   4   LE  
> thrpt5  14.177 ± 2.263  ops/us
> PackedIntsDecodeBenchmark.decodeNaiveFromBytes   4   BE  
> thrpt5  11.457 ± 0.150  ops/us
> PackedIntsDecodeBenchmark.decodeNaiveFromBytes   5   LE  
> thrpt

[jira] [Commented] (LUCENE-9027) SIMD-based decoding of postings lists

2019-11-19 Thread ASF subversion and git services (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9027?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16977532#comment-16977532
 ] 

ASF subversion and git services commented on LUCENE-9027:
-

Commit cb1a72ad1642400888683e1f735a0316334b2484 in lucene-solr's branch 
refs/heads/branch_8x from Adrien Grand
[ https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=cb1a72a ]

LUCENE-9027: Use SIMD instructions to decode postings. (#973)


> SIMD-based decoding of postings lists
> -
>
> Key: LUCENE-9027
> URL: https://issues.apache.org/jira/browse/LUCENE-9027
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Adrien Grand
>Priority: Minor
>  Time Spent: 3h 50m
>  Remaining Estimate: 0h
>
> [~rcmuir] has been mentioning the idea for quite some time that we might be 
> able to write the decoding logic in such a way that Java would use SIMD 
> instructions. More recently [~paul.masurel] wrote a [blog 
> post|https://fulmicoton.com/posts/bitpacking/] that raises the point that 
> Lucene could still do decode multiple ints at once in a single instruction by 
> packing two ints in a long and we've had some discussions about what we could 
> try in Lucene to speed up the decoding of postings. This made me want to look 
> a bit deeper at what we could do.
> Our current decoding logic reads data in a byte[] and decodes packed integers 
> from it. Unfortunately it doesn't make use of SIMD instructions and looks 
> like 
> [this|https://github.com/jpountz/decode-128-ints-benchmark/blob/master/src/main/java/jpountz/NaiveByteDecoder.java].
> I confirmed by looking at the generated assembly that if I take an array of 
> integers and shift them all by the same number of bits then Java will use 
> SIMD instructions to shift multiple integers at once. This led me to writing 
> this 
> [implementation|https://github.com/jpountz/decode-128-ints-benchmark/blob/master/src/main/java/jpountz/SimpleSIMDDecoder.java]
>  that tries as much as possible to shift long sequences of ints by the same 
> number of bits to speed up decoding. It is indeed faster than the current 
> logic we have, up to about 2x faster for some numbers of bits per value.
> Currently the best 
> [implementation|https://github.com/jpountz/decode-128-ints-benchmark/blob/master/src/main/java/jpountz/SIMDDecoder.java]
>  I've been able to come up with combines the above idea with the idea that 
> Paul mentioned in his blog that consists of emulating SIMD from Java by 
> packing multiple integers into a long: 2 ints, 4 shorts or 8 bytes. It is a 
> bit harder to read but gives another speedup on top of the above 
> implementation.
> I have a [JMH 
> benchmark|https://github.com/jpountz/decode-128-ints-benchmark/] available in 
> case someone would like to play with this and maybe even come up with an even 
> faster implementation. It is 2-2.5x faster than our current implementation 
> for most numbers of bits per value. I'm copying results here:
> {noformat}
>  * `readLongs` just reads 2*bitsPerValue longs from the ByteBuffer, it serves 
> as
>a baseline.
>  * `decodeNaiveFromBytes` reads a byte[] and decodes from it. This is what the
>current Lucene codec does.
>  * `decodeNaiveFromLongs` decodes from longs on the fly.
>  * `decodeSimpleSIMD` is a simple implementation that relies on how Java
>recognizes some patterns and uses SIMD instructions.
>  * `decodeSIMD` is a more complex implementation that both relies on the C2
>compiler to generate SIMD instructions and encodes 8 bytes, 4 shorts or
>2 ints in a long in order to decompress multiple values at once.
> Benchmark   (bitsPerValue)  (byteOrder)   
> Mode  Cnt   Score   Error   Units
> PackedIntsDecodeBenchmark.decodeNaiveFromBytes   1   LE  
> thrpt5  12.912 ± 0.393  ops/us
> PackedIntsDecodeBenchmark.decodeNaiveFromBytes   1   BE  
> thrpt5  12.862 ± 0.395  ops/us
> PackedIntsDecodeBenchmark.decodeNaiveFromBytes   2   LE  
> thrpt5  13.040 ± 1.162  ops/us
> PackedIntsDecodeBenchmark.decodeNaiveFromBytes   2   BE  
> thrpt5  13.027 ± 0.270  ops/us
> PackedIntsDecodeBenchmark.decodeNaiveFromBytes   3   LE  
> thrpt5  12.409 ± 0.637  ops/us
> PackedIntsDecodeBenchmark.decodeNaiveFromBytes   3   BE  
> thrpt5  12.268 ± 0.947  ops/us
> PackedIntsDecodeBenchmark.decodeNaiveFromBytes   4   LE  
> thrpt5  14.177 ± 2.263  ops/us
> PackedIntsDecodeBenchmark.decodeNaiveFromBytes   4   BE  
> thrpt5  11.457 ± 0.150  ops/us
> PackedIntsDecodeBenchmark.decodeNaiveFromBytes   5   LE  
> thrpt5  10.988 ± 1.179  ops/us
> 

[jira] [Commented] (LUCENE-9027) SIMD-based decoding of postings lists

2019-11-18 Thread ASF subversion and git services (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9027?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16976752#comment-16976752
 ] 

ASF subversion and git services commented on LUCENE-9027:
-

Commit 7755cdf03fc250e310c3b7d9b2e785f2939d3dc9 in lucene-solr's branch 
refs/heads/master from Adrien Grand
[ https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=7755cdf ]

LUCENE-9027: Use SIMD instructions to decode postings. (#973)




> SIMD-based decoding of postings lists
> -
>
> Key: LUCENE-9027
> URL: https://issues.apache.org/jira/browse/LUCENE-9027
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Adrien Grand
>Priority: Minor
>  Time Spent: 3h 50m
>  Remaining Estimate: 0h
>
> [~rcmuir] has been mentioning the idea for quite some time that we might be 
> able to write the decoding logic in such a way that Java would use SIMD 
> instructions. More recently [~paul.masurel] wrote a [blog 
> post|https://fulmicoton.com/posts/bitpacking/] that raises the point that 
> Lucene could still do decode multiple ints at once in a single instruction by 
> packing two ints in a long and we've had some discussions about what we could 
> try in Lucene to speed up the decoding of postings. This made me want to look 
> a bit deeper at what we could do.
> Our current decoding logic reads data in a byte[] and decodes packed integers 
> from it. Unfortunately it doesn't make use of SIMD instructions and looks 
> like 
> [this|https://github.com/jpountz/decode-128-ints-benchmark/blob/master/src/main/java/jpountz/NaiveByteDecoder.java].
> I confirmed by looking at the generated assembly that if I take an array of 
> integers and shift them all by the same number of bits then Java will use 
> SIMD instructions to shift multiple integers at once. This led me to writing 
> this 
> [implementation|https://github.com/jpountz/decode-128-ints-benchmark/blob/master/src/main/java/jpountz/SimpleSIMDDecoder.java]
>  that tries as much as possible to shift long sequences of ints by the same 
> number of bits to speed up decoding. It is indeed faster than the current 
> logic we have, up to about 2x faster for some numbers of bits per value.
> Currently the best 
> [implementation|https://github.com/jpountz/decode-128-ints-benchmark/blob/master/src/main/java/jpountz/SIMDDecoder.java]
>  I've been able to come up with combines the above idea with the idea that 
> Paul mentioned in his blog that consists of emulating SIMD from Java by 
> packing multiple integers into a long: 2 ints, 4 shorts or 8 bytes. It is a 
> bit harder to read but gives another speedup on top of the above 
> implementation.
> I have a [JMH 
> benchmark|https://github.com/jpountz/decode-128-ints-benchmark/] available in 
> case someone would like to play with this and maybe even come up with an even 
> faster implementation. It is 2-2.5x faster than our current implementation 
> for most numbers of bits per value. I'm copying results here:
> {noformat}
>  * `readLongs` just reads 2*bitsPerValue longs from the ByteBuffer, it serves 
> as
>a baseline.
>  * `decodeNaiveFromBytes` reads a byte[] and decodes from it. This is what the
>current Lucene codec does.
>  * `decodeNaiveFromLongs` decodes from longs on the fly.
>  * `decodeSimpleSIMD` is a simple implementation that relies on how Java
>recognizes some patterns and uses SIMD instructions.
>  * `decodeSIMD` is a more complex implementation that both relies on the C2
>compiler to generate SIMD instructions and encodes 8 bytes, 4 shorts or
>2 ints in a long in order to decompress multiple values at once.
> Benchmark   (bitsPerValue)  (byteOrder)   
> Mode  Cnt   Score   Error   Units
> PackedIntsDecodeBenchmark.decodeNaiveFromBytes   1   LE  
> thrpt5  12.912 ± 0.393  ops/us
> PackedIntsDecodeBenchmark.decodeNaiveFromBytes   1   BE  
> thrpt5  12.862 ± 0.395  ops/us
> PackedIntsDecodeBenchmark.decodeNaiveFromBytes   2   LE  
> thrpt5  13.040 ± 1.162  ops/us
> PackedIntsDecodeBenchmark.decodeNaiveFromBytes   2   BE  
> thrpt5  13.027 ± 0.270  ops/us
> PackedIntsDecodeBenchmark.decodeNaiveFromBytes   3   LE  
> thrpt5  12.409 ± 0.637  ops/us
> PackedIntsDecodeBenchmark.decodeNaiveFromBytes   3   BE  
> thrpt5  12.268 ± 0.947  ops/us
> PackedIntsDecodeBenchmark.decodeNaiveFromBytes   4   LE  
> thrpt5  14.177 ± 2.263  ops/us
> PackedIntsDecodeBenchmark.decodeNaiveFromBytes   4   BE  
> thrpt5  11.457 ± 0.150  ops/us
> PackedIntsDecodeBenchmark.decodeNaiveFromBytes   5   LE  
> thrpt5  10.988 ± 1.179  ops/us
> 

[jira] [Commented] (LUCENE-9027) SIMD-based decoding of postings lists

2019-11-13 Thread Yonik Seeley (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9027?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16973707#comment-16973707
 ] 

Yonik Seeley commented on LUCENE-9027:
--

Yep, going little-endian is the right choice.  big-endian has been dead for a 
while, and is very unlikely to be revived.

> SIMD-based decoding of postings lists
> -
>
> Key: LUCENE-9027
> URL: https://issues.apache.org/jira/browse/LUCENE-9027
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Adrien Grand
>Priority: Minor
>  Time Spent: 2h 10m
>  Remaining Estimate: 0h
>
> [~rcmuir] has been mentioning the idea for quite some time that we might be 
> able to write the decoding logic in such a way that Java would use SIMD 
> instructions. More recently [~paul.masurel] wrote a [blog 
> post|https://fulmicoton.com/posts/bitpacking/] that raises the point that 
> Lucene could still do decode multiple ints at once in a single instruction by 
> packing two ints in a long and we've had some discussions about what we could 
> try in Lucene to speed up the decoding of postings. This made me want to look 
> a bit deeper at what we could do.
> Our current decoding logic reads data in a byte[] and decodes packed integers 
> from it. Unfortunately it doesn't make use of SIMD instructions and looks 
> like 
> [this|https://github.com/jpountz/decode-128-ints-benchmark/blob/master/src/main/java/jpountz/NaiveByteDecoder.java].
> I confirmed by looking at the generated assembly that if I take an array of 
> integers and shift them all by the same number of bits then Java will use 
> SIMD instructions to shift multiple integers at once. This led me to writing 
> this 
> [implementation|https://github.com/jpountz/decode-128-ints-benchmark/blob/master/src/main/java/jpountz/SimpleSIMDDecoder.java]
>  that tries as much as possible to shift long sequences of ints by the same 
> number of bits to speed up decoding. It is indeed faster than the current 
> logic we have, up to about 2x faster for some numbers of bits per value.
> Currently the best 
> [implementation|https://github.com/jpountz/decode-128-ints-benchmark/blob/master/src/main/java/jpountz/SIMDDecoder.java]
>  I've been able to come up with combines the above idea with the idea that 
> Paul mentioned in his blog that consists of emulating SIMD from Java by 
> packing multiple integers into a long: 2 ints, 4 shorts or 8 bytes. It is a 
> bit harder to read but gives another speedup on top of the above 
> implementation.
> I have a [JMH 
> benchmark|https://github.com/jpountz/decode-128-ints-benchmark/] available in 
> case someone would like to play with this and maybe even come up with an even 
> faster implementation. It is 2-2.5x faster than our current implementation 
> for most numbers of bits per value. I'm copying results here:
> {noformat}
>  * `readLongs` just reads 2*bitsPerValue longs from the ByteBuffer, it serves 
> as
>a baseline.
>  * `decodeNaiveFromBytes` reads a byte[] and decodes from it. This is what the
>current Lucene codec does.
>  * `decodeNaiveFromLongs` decodes from longs on the fly.
>  * `decodeSimpleSIMD` is a simple implementation that relies on how Java
>recognizes some patterns and uses SIMD instructions.
>  * `decodeSIMD` is a more complex implementation that both relies on the C2
>compiler to generate SIMD instructions and encodes 8 bytes, 4 shorts or
>2 ints in a long in order to decompress multiple values at once.
> Benchmark   (bitsPerValue)  (byteOrder)   
> Mode  Cnt   Score   Error   Units
> PackedIntsDecodeBenchmark.decodeNaiveFromBytes   1   LE  
> thrpt5  12.912 ± 0.393  ops/us
> PackedIntsDecodeBenchmark.decodeNaiveFromBytes   1   BE  
> thrpt5  12.862 ± 0.395  ops/us
> PackedIntsDecodeBenchmark.decodeNaiveFromBytes   2   LE  
> thrpt5  13.040 ± 1.162  ops/us
> PackedIntsDecodeBenchmark.decodeNaiveFromBytes   2   BE  
> thrpt5  13.027 ± 0.270  ops/us
> PackedIntsDecodeBenchmark.decodeNaiveFromBytes   3   LE  
> thrpt5  12.409 ± 0.637  ops/us
> PackedIntsDecodeBenchmark.decodeNaiveFromBytes   3   BE  
> thrpt5  12.268 ± 0.947  ops/us
> PackedIntsDecodeBenchmark.decodeNaiveFromBytes   4   LE  
> thrpt5  14.177 ± 2.263  ops/us
> PackedIntsDecodeBenchmark.decodeNaiveFromBytes   4   BE  
> thrpt5  11.457 ± 0.150  ops/us
> PackedIntsDecodeBenchmark.decodeNaiveFromBytes   5   LE  
> thrpt5  10.988 ± 1.179  ops/us
> PackedIntsDecodeBenchmark.decodeNaiveFromBytes   5   BE  
> thrpt5  11.226 ± 0.088  ops/us
> PackedIntsDecodeBenchmark.decodeNaiveFromBytes   6   

[jira] [Commented] (LUCENE-9027) SIMD-based decoding of postings lists

2019-11-13 Thread Adrien Grand (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9027?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16973691#comment-16973691
 ] 

Adrien Grand commented on LUCENE-9027:
--

Following a suggestion by [~rcmuir] I specialized the patch for little-endian 
architectures. Big-endian architectures will get a slowdown due to the need to 
do a Long#reverseBytes on all read longs, but this probably doesn't matter much 
and helps simplify DataInput.

> SIMD-based decoding of postings lists
> -
>
> Key: LUCENE-9027
> URL: https://issues.apache.org/jira/browse/LUCENE-9027
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Adrien Grand
>Priority: Minor
>  Time Spent: 2h 10m
>  Remaining Estimate: 0h
>
> [~rcmuir] has been mentioning the idea for quite some time that we might be 
> able to write the decoding logic in such a way that Java would use SIMD 
> instructions. More recently [~paul.masurel] wrote a [blog 
> post|https://fulmicoton.com/posts/bitpacking/] that raises the point that 
> Lucene could still do decode multiple ints at once in a single instruction by 
> packing two ints in a long and we've had some discussions about what we could 
> try in Lucene to speed up the decoding of postings. This made me want to look 
> a bit deeper at what we could do.
> Our current decoding logic reads data in a byte[] and decodes packed integers 
> from it. Unfortunately it doesn't make use of SIMD instructions and looks 
> like 
> [this|https://github.com/jpountz/decode-128-ints-benchmark/blob/master/src/main/java/jpountz/NaiveByteDecoder.java].
> I confirmed by looking at the generated assembly that if I take an array of 
> integers and shift them all by the same number of bits then Java will use 
> SIMD instructions to shift multiple integers at once. This led me to writing 
> this 
> [implementation|https://github.com/jpountz/decode-128-ints-benchmark/blob/master/src/main/java/jpountz/SimpleSIMDDecoder.java]
>  that tries as much as possible to shift long sequences of ints by the same 
> number of bits to speed up decoding. It is indeed faster than the current 
> logic we have, up to about 2x faster for some numbers of bits per value.
> Currently the best 
> [implementation|https://github.com/jpountz/decode-128-ints-benchmark/blob/master/src/main/java/jpountz/SIMDDecoder.java]
>  I've been able to come up with combines the above idea with the idea that 
> Paul mentioned in his blog that consists of emulating SIMD from Java by 
> packing multiple integers into a long: 2 ints, 4 shorts or 8 bytes. It is a 
> bit harder to read but gives another speedup on top of the above 
> implementation.
> I have a [JMH 
> benchmark|https://github.com/jpountz/decode-128-ints-benchmark/] available in 
> case someone would like to play with this and maybe even come up with an even 
> faster implementation. It is 2-2.5x faster than our current implementation 
> for most numbers of bits per value. I'm copying results here:
> {noformat}
>  * `readLongs` just reads 2*bitsPerValue longs from the ByteBuffer, it serves 
> as
>a baseline.
>  * `decodeNaiveFromBytes` reads a byte[] and decodes from it. This is what the
>current Lucene codec does.
>  * `decodeNaiveFromLongs` decodes from longs on the fly.
>  * `decodeSimpleSIMD` is a simple implementation that relies on how Java
>recognizes some patterns and uses SIMD instructions.
>  * `decodeSIMD` is a more complex implementation that both relies on the C2
>compiler to generate SIMD instructions and encodes 8 bytes, 4 shorts or
>2 ints in a long in order to decompress multiple values at once.
> Benchmark   (bitsPerValue)  (byteOrder)   
> Mode  Cnt   Score   Error   Units
> PackedIntsDecodeBenchmark.decodeNaiveFromBytes   1   LE  
> thrpt5  12.912 ± 0.393  ops/us
> PackedIntsDecodeBenchmark.decodeNaiveFromBytes   1   BE  
> thrpt5  12.862 ± 0.395  ops/us
> PackedIntsDecodeBenchmark.decodeNaiveFromBytes   2   LE  
> thrpt5  13.040 ± 1.162  ops/us
> PackedIntsDecodeBenchmark.decodeNaiveFromBytes   2   BE  
> thrpt5  13.027 ± 0.270  ops/us
> PackedIntsDecodeBenchmark.decodeNaiveFromBytes   3   LE  
> thrpt5  12.409 ± 0.637  ops/us
> PackedIntsDecodeBenchmark.decodeNaiveFromBytes   3   BE  
> thrpt5  12.268 ± 0.947  ops/us
> PackedIntsDecodeBenchmark.decodeNaiveFromBytes   4   LE  
> thrpt5  14.177 ± 2.263  ops/us
> PackedIntsDecodeBenchmark.decodeNaiveFromBytes   4   BE  
> thrpt5  11.457 ± 0.150  ops/us
> PackedIntsDecodeBenchmark.decodeNaiveFromBytes   5   LE  
> thrpt5  10.988 ± 1.179  ops/us
> 

[jira] [Commented] (LUCENE-9027) SIMD-based decoding of postings lists

2019-11-13 Thread Adrien Grand (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9027?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16973328#comment-16973328
 ] 

Adrien Grand commented on LUCENE-9027:
--

I guess it depends what kind of analytics, I expect counts to get the best 
speedup indeed, since they are only about decoding postings and counting 
matches. However other analytics use-cases like facets still need to read 
values for a doc-value field, which likely drowns the speedup a bit, like 
sorting. We have some benchmarks for Elasticsearch that run nightly at 
https://benchmarks.elastic.co but most analytics queries like histograms or 
terms facets run on a match_all query, so I'm not expecting to see any speedup.

I've done some more research regarding the prefix sum. Unfortunately, I didn't 
manage to get C2 to generate SIMD instructions for the prefix sum (even with 
gaps of 2 or 4 instead of 1). The best option I have for now works by hooking 
into the decoding logic and summing up longs when they still represent 2 packed 
integers (which effectively computes the prefix sum for [0:64) and [64:128) in 
parallel and then add the value at index 63 to all values at indices [64:128). 
This last step is vectorized by the JVM. See benchmark results at the bottom of 
https://github.com/jpountz/decode-128-ints-benchmark if you are interested. I 
wish we could do better, but it's still nice that we can both decompress and 
compute the prefix sum faster that we can just decompress with the current 
codec on average. Having the sum pre-computed also helps simplify some 
conditions in the PostingsEnum implementations of nextDoc/advance.

Here is the result of a run on wikibigall with the current pull request. I 
included some sorting tasks as well to see how much they benefit from this 
change

{noformat}
TaskQPS baseline  StdDev   QPS patch  StdDev
Pct diff
  Fuzzy2   99.40 (10.4%)   95.46  (6.9%)   
-4.0% ( -19% -   14%)
  IntNRQ  966.00  (2.4%)  977.27  (2.1%)
1.2% (  -3% -5%)
  Fuzzy1   97.25  (7.8%)   99.66  (9.4%)
2.5% ( -13% -   21%)
  OrHighHigh   84.95  (3.7%)   87.40  (4.2%)
2.9% (  -4% -   11%)
Term 1316.61  (3.1%) 1357.55  (4.8%)
3.1% (  -4% -   11%)
   HighTermDayOfYearSort   35.89  (6.3%)   37.66  (4.2%)
4.9% (  -5% -   16%)
  Phrase   60.22  (2.1%)   63.45  (4.3%)
5.4% (  -1% -   12%)
   HighTermMonthSort   63.83  (9.1%)   67.40 (10.7%)
5.6% ( -12% -   27%)
 AndHighHigh   27.27  (3.4%)   28.80  (4.0%)
5.6% (  -1% -   13%)
SloppyPhrase1.35  (7.4%)1.43  (9.3%)
6.2% (  -9% -   24%)
 AndHighOrMedMed   26.38  (1.7%)   28.17  (2.6%)
6.8% (   2% -   11%)
IntervalsOrdered   21.99  (2.8%)   23.57  (1.9%)
7.2% (   2% -   12%)
   OrHighMed   39.27  (2.8%)   42.32  (3.0%)
7.8% (   1% -   13%)
SpanNear9.74  (3.2%)   10.71  (2.0%)   
10.0% (   4% -   15%)
  AndHighMed   59.40  (3.1%)   65.44  (3.6%)   
10.2% (   3% -   17%)
Wildcard  127.70  (7.7%)  140.93  (3.3%)   
10.4% (   0% -   23%)
AndMedOrHighHigh   30.65  (1.4%)   34.15  (2.1%)   
11.4% (   7% -   15%)
 Prefix3   46.12  (9.9%)   53.01 (10.1%)   
14.9% (  -4% -   38%)
{noformat}

I now plan to focus of getting this into a mergeable state. 

> SIMD-based decoding of postings lists
> -
>
> Key: LUCENE-9027
> URL: https://issues.apache.org/jira/browse/LUCENE-9027
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Adrien Grand
>Priority: Minor
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> [~rcmuir] has been mentioning the idea for quite some time that we might be 
> able to write the decoding logic in such a way that Java would use SIMD 
> instructions. More recently [~paul.masurel] wrote a [blog 
> post|https://fulmicoton.com/posts/bitpacking/] that raises the point that 
> Lucene could still do decode multiple ints at once in a single instruction by 
> packing two ints in a long and we've had some discussions about what we could 
> try in Lucene to speed up the decoding of postings. This made me want to look 
> a bit deeper at what we could do.
> Our current decoding logic reads data in a byte[] and decodes packed integers 
> from it. Unfortunately it doesn't make use of SIMD instructions and looks 
> like 
> 

[jira] [Commented] (LUCENE-9027) SIMD-based decoding of postings lists

2019-11-12 Thread Paul Masurel (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9027?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16973070#comment-16973070
 ] 

Paul Masurel commented on LUCENE-9027:
--

The benefit is probably greater on analytics.

Is there a benchmark for the ELK stack somewhere?

 

> SIMD-based decoding of postings lists
> -
>
> Key: LUCENE-9027
> URL: https://issues.apache.org/jira/browse/LUCENE-9027
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Adrien Grand
>Priority: Minor
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> [~rcmuir] has been mentioning the idea for quite some time that we might be 
> able to write the decoding logic in such a way that Java would use SIMD 
> instructions. More recently [~paul.masurel] wrote a [blog 
> post|https://fulmicoton.com/posts/bitpacking/] that raises the point that 
> Lucene could still do decode multiple ints at once in a single instruction by 
> packing two ints in a long and we've had some discussions about what we could 
> try in Lucene to speed up the decoding of postings. This made me want to look 
> a bit deeper at what we could do.
> Our current decoding logic reads data in a byte[] and decodes packed integers 
> from it. Unfortunately it doesn't make use of SIMD instructions and looks 
> like 
> [this|https://github.com/jpountz/decode-128-ints-benchmark/blob/master/src/main/java/jpountz/NaiveByteDecoder.java].
> I confirmed by looking at the generated assembly that if I take an array of 
> integers and shift them all by the same number of bits then Java will use 
> SIMD instructions to shift multiple integers at once. This led me to writing 
> this 
> [implementation|https://github.com/jpountz/decode-128-ints-benchmark/blob/master/src/main/java/jpountz/SimpleSIMDDecoder.java]
>  that tries as much as possible to shift long sequences of ints by the same 
> number of bits to speed up decoding. It is indeed faster than the current 
> logic we have, up to about 2x faster for some numbers of bits per value.
> Currently the best 
> [implementation|https://github.com/jpountz/decode-128-ints-benchmark/blob/master/src/main/java/jpountz/SIMDDecoder.java]
>  I've been able to come up with combines the above idea with the idea that 
> Paul mentioned in his blog that consists of emulating SIMD from Java by 
> packing multiple integers into a long: 2 ints, 4 shorts or 8 bytes. It is a 
> bit harder to read but gives another speedup on top of the above 
> implementation.
> I have a [JMH 
> benchmark|https://github.com/jpountz/decode-128-ints-benchmark/] available in 
> case someone would like to play with this and maybe even come up with an even 
> faster implementation. It is 2-2.5x faster than our current implementation 
> for most numbers of bits per value. I'm copying results here:
> {noformat}
>  * `readLongs` just reads 2*bitsPerValue longs from the ByteBuffer, it serves 
> as
>a baseline.
>  * `decodeNaiveFromBytes` reads a byte[] and decodes from it. This is what the
>current Lucene codec does.
>  * `decodeNaiveFromLongs` decodes from longs on the fly.
>  * `decodeSimpleSIMD` is a simple implementation that relies on how Java
>recognizes some patterns and uses SIMD instructions.
>  * `decodeSIMD` is a more complex implementation that both relies on the C2
>compiler to generate SIMD instructions and encodes 8 bytes, 4 shorts or
>2 ints in a long in order to decompress multiple values at once.
> Benchmark   (bitsPerValue)  (byteOrder)   
> Mode  Cnt   Score   Error   Units
> PackedIntsDecodeBenchmark.decodeNaiveFromBytes   1   LE  
> thrpt5  12.912 ± 0.393  ops/us
> PackedIntsDecodeBenchmark.decodeNaiveFromBytes   1   BE  
> thrpt5  12.862 ± 0.395  ops/us
> PackedIntsDecodeBenchmark.decodeNaiveFromBytes   2   LE  
> thrpt5  13.040 ± 1.162  ops/us
> PackedIntsDecodeBenchmark.decodeNaiveFromBytes   2   BE  
> thrpt5  13.027 ± 0.270  ops/us
> PackedIntsDecodeBenchmark.decodeNaiveFromBytes   3   LE  
> thrpt5  12.409 ± 0.637  ops/us
> PackedIntsDecodeBenchmark.decodeNaiveFromBytes   3   BE  
> thrpt5  12.268 ± 0.947  ops/us
> PackedIntsDecodeBenchmark.decodeNaiveFromBytes   4   LE  
> thrpt5  14.177 ± 2.263  ops/us
> PackedIntsDecodeBenchmark.decodeNaiveFromBytes   4   BE  
> thrpt5  11.457 ± 0.150  ops/us
> PackedIntsDecodeBenchmark.decodeNaiveFromBytes   5   LE  
> thrpt5  10.988 ± 1.179  ops/us
> PackedIntsDecodeBenchmark.decodeNaiveFromBytes   5   BE  
> thrpt5  11.226 ± 0.088  ops/us
> PackedIntsDecodeBenchmark.decodeNaiveFromBytes   6   LE  
> thrpt5   

[jira] [Commented] (LUCENE-9027) SIMD-based decoding of postings lists

2019-11-08 Thread Michael McCandless (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9027?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16970621#comment-16970621
 ] 

Michael McCandless commented on LUCENE-9027:


Thanks [~jpountz] – I forgot that Codec impacts make our {{Term}} query tasks 
super fast!  Your explanation makes sense.

I'll look at the PR around endianess – as ugly as it sounds, I think it is the 
right tradeoff.  We should do things at indexing time to make search time 
faster, and if one of those things is writing multi-byte values in the "right" 
order for searching, that's fair game!

> SIMD-based decoding of postings lists
> -
>
> Key: LUCENE-9027
> URL: https://issues.apache.org/jira/browse/LUCENE-9027
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Adrien Grand
>Priority: Minor
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> [~rcmuir] has been mentioning the idea for quite some time that we might be 
> able to write the decoding logic in such a way that Java would use SIMD 
> instructions. More recently [~paul.masurel] wrote a [blog 
> post|https://fulmicoton.com/posts/bitpacking/] that raises the point that 
> Lucene could still do decode multiple ints at once in a single instruction by 
> packing two ints in a long and we've had some discussions about what we could 
> try in Lucene to speed up the decoding of postings. This made me want to look 
> a bit deeper at what we could do.
> Our current decoding logic reads data in a byte[] and decodes packed integers 
> from it. Unfortunately it doesn't make use of SIMD instructions and looks 
> like 
> [this|https://github.com/jpountz/decode-128-ints-benchmark/blob/master/src/main/java/jpountz/NaiveByteDecoder.java].
> I confirmed by looking at the generated assembly that if I take an array of 
> integers and shift them all by the same number of bits then Java will use 
> SIMD instructions to shift multiple integers at once. This led me to writing 
> this 
> [implementation|https://github.com/jpountz/decode-128-ints-benchmark/blob/master/src/main/java/jpountz/SimpleSIMDDecoder.java]
>  that tries as much as possible to shift long sequences of ints by the same 
> number of bits to speed up decoding. It is indeed faster than the current 
> logic we have, up to about 2x faster for some numbers of bits per value.
> Currently the best 
> [implementation|https://github.com/jpountz/decode-128-ints-benchmark/blob/master/src/main/java/jpountz/SIMDDecoder.java]
>  I've been able to come up with combines the above idea with the idea that 
> Paul mentioned in his blog that consists of emulating SIMD from Java by 
> packing multiple integers into a long: 2 ints, 4 shorts or 8 bytes. It is a 
> bit harder to read but gives another speedup on top of the above 
> implementation.
> I have a [JMH 
> benchmark|https://github.com/jpountz/decode-128-ints-benchmark/] available in 
> case someone would like to play with this and maybe even come up with an even 
> faster implementation. It is 2-2.5x faster than our current implementation 
> for most numbers of bits per value. I'm copying results here:
> {noformat}
>  * `readLongs` just reads 2*bitsPerValue longs from the ByteBuffer, it serves 
> as
>a baseline.
>  * `decodeNaiveFromBytes` reads a byte[] and decodes from it. This is what the
>current Lucene codec does.
>  * `decodeNaiveFromLongs` decodes from longs on the fly.
>  * `decodeSimpleSIMD` is a simple implementation that relies on how Java
>recognizes some patterns and uses SIMD instructions.
>  * `decodeSIMD` is a more complex implementation that both relies on the C2
>compiler to generate SIMD instructions and encodes 8 bytes, 4 shorts or
>2 ints in a long in order to decompress multiple values at once.
> Benchmark   (bitsPerValue)  (byteOrder)   
> Mode  Cnt   Score   Error   Units
> PackedIntsDecodeBenchmark.decodeNaiveFromBytes   1   LE  
> thrpt5  12.912 ± 0.393  ops/us
> PackedIntsDecodeBenchmark.decodeNaiveFromBytes   1   BE  
> thrpt5  12.862 ± 0.395  ops/us
> PackedIntsDecodeBenchmark.decodeNaiveFromBytes   2   LE  
> thrpt5  13.040 ± 1.162  ops/us
> PackedIntsDecodeBenchmark.decodeNaiveFromBytes   2   BE  
> thrpt5  13.027 ± 0.270  ops/us
> PackedIntsDecodeBenchmark.decodeNaiveFromBytes   3   LE  
> thrpt5  12.409 ± 0.637  ops/us
> PackedIntsDecodeBenchmark.decodeNaiveFromBytes   3   BE  
> thrpt5  12.268 ± 0.947  ops/us
> PackedIntsDecodeBenchmark.decodeNaiveFromBytes   4   LE  
> thrpt5  14.177 ± 2.263  ops/us
> PackedIntsDecodeBenchmark.decodeNaiveFromBytes   4   BE  
> thrpt5  11.457 ± 0.150  ops/us
> 

[jira] [Commented] (LUCENE-9027) SIMD-based decoding of postings lists

2019-11-07 Thread Adrien Grand (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9027?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16969575#comment-16969575
 ] 

Adrien Grand commented on LUCENE-9027:
--

I'm curious whether you have thoughts about how the pull request specializes 
for the endianness of the machine that creates the index. I'm a bit unhappy 
about this, but on the other hand reversing bytes proved to be a bottleneck on 
the benchmarks that I ran.

> SIMD-based decoding of postings lists
> -
>
> Key: LUCENE-9027
> URL: https://issues.apache.org/jira/browse/LUCENE-9027
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Adrien Grand
>Priority: Minor
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> [~rcmuir] has been mentioning the idea for quite some time that we might be 
> able to write the decoding logic in such a way that Java would use SIMD 
> instructions. More recently [~paul.masurel] wrote a [blog 
> post|https://fulmicoton.com/posts/bitpacking/] that raises the point that 
> Lucene could still do decode multiple ints at once in a single instruction by 
> packing two ints in a long and we've had some discussions about what we could 
> try in Lucene to speed up the decoding of postings. This made me want to look 
> a bit deeper at what we could do.
> Our current decoding logic reads data in a byte[] and decodes packed integers 
> from it. Unfortunately it doesn't make use of SIMD instructions and looks 
> like 
> [this|https://github.com/jpountz/decode-128-ints-benchmark/blob/master/src/main/java/jpountz/NaiveByteDecoder.java].
> I confirmed by looking at the generated assembly that if I take an array of 
> integers and shift them all by the same number of bits then Java will use 
> SIMD instructions to shift multiple integers at once. This led me to writing 
> this 
> [implementation|https://github.com/jpountz/decode-128-ints-benchmark/blob/master/src/main/java/jpountz/SimpleSIMDDecoder.java]
>  that tries as much as possible to shift long sequences of ints by the same 
> number of bits to speed up decoding. It is indeed faster than the current 
> logic we have, up to about 2x faster for some numbers of bits per value.
> Currently the best 
> [implementation|https://github.com/jpountz/decode-128-ints-benchmark/blob/master/src/main/java/jpountz/SIMDDecoder.java]
>  I've been able to come up with combines the above idea with the idea that 
> Paul mentioned in his blog that consists of emulating SIMD from Java by 
> packing multiple integers into a long: 2 ints, 4 shorts or 8 bytes. It is a 
> bit harder to read but gives another speedup on top of the above 
> implementation.
> I have a [JMH 
> benchmark|https://github.com/jpountz/decode-128-ints-benchmark/] available in 
> case someone would like to play with this and maybe even come up with an even 
> faster implementation. It is 2-2.5x faster than our current implementation 
> for most numbers of bits per value. I'm copying results here:
> {noformat}
>  * `readLongs` just reads 2*bitsPerValue longs from the ByteBuffer, it serves 
> as
>a baseline.
>  * `decodeNaiveFromBytes` reads a byte[] and decodes from it. This is what the
>current Lucene codec does.
>  * `decodeNaiveFromLongs` decodes from longs on the fly.
>  * `decodeSimpleSIMD` is a simple implementation that relies on how Java
>recognizes some patterns and uses SIMD instructions.
>  * `decodeSIMD` is a more complex implementation that both relies on the C2
>compiler to generate SIMD instructions and encodes 8 bytes, 4 shorts or
>2 ints in a long in order to decompress multiple values at once.
> Benchmark   (bitsPerValue)  (byteOrder)   
> Mode  Cnt   Score   Error   Units
> PackedIntsDecodeBenchmark.decodeNaiveFromBytes   1   LE  
> thrpt5  12.912 ± 0.393  ops/us
> PackedIntsDecodeBenchmark.decodeNaiveFromBytes   1   BE  
> thrpt5  12.862 ± 0.395  ops/us
> PackedIntsDecodeBenchmark.decodeNaiveFromBytes   2   LE  
> thrpt5  13.040 ± 1.162  ops/us
> PackedIntsDecodeBenchmark.decodeNaiveFromBytes   2   BE  
> thrpt5  13.027 ± 0.270  ops/us
> PackedIntsDecodeBenchmark.decodeNaiveFromBytes   3   LE  
> thrpt5  12.409 ± 0.637  ops/us
> PackedIntsDecodeBenchmark.decodeNaiveFromBytes   3   BE  
> thrpt5  12.268 ± 0.947  ops/us
> PackedIntsDecodeBenchmark.decodeNaiveFromBytes   4   LE  
> thrpt5  14.177 ± 2.263  ops/us
> PackedIntsDecodeBenchmark.decodeNaiveFromBytes   4   BE  
> thrpt5  11.457 ± 0.150  ops/us
> PackedIntsDecodeBenchmark.decodeNaiveFromBytes   5   LE  
> thrpt5  10.988 ± 1.179  ops/us
> PackedIntsDecodeBenchmark.decodeNaiveFromBytes   

[jira] [Commented] (LUCENE-9027) SIMD-based decoding of postings lists

2019-11-07 Thread Adrien Grand (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9027?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16969571#comment-16969571
 ] 

Adrien Grand commented on LUCENE-9027:
--

This QPS is consistent with what we are seeing on the nightly benchmarks 
http://people.apache.org/~mikemccand/lucenebench/Term.html. I think Term 
doesn't show a speedup because decoding postings is not the bottleneck of term 
queries. When running a term query, Lucene would only decode blocks of postings 
that have a competitive match. On the other hand, for queries like AndHighMed, 
the high-cardinality clause needs to decode lots of blocks, and it's not 
unlikely that many decoded blocks don't even translate to a match for the 
conjunction.


> SIMD-based decoding of postings lists
> -
>
> Key: LUCENE-9027
> URL: https://issues.apache.org/jira/browse/LUCENE-9027
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Adrien Grand
>Priority: Minor
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> [~rcmuir] has been mentioning the idea for quite some time that we might be 
> able to write the decoding logic in such a way that Java would use SIMD 
> instructions. More recently [~paul.masurel] wrote a [blog 
> post|https://fulmicoton.com/posts/bitpacking/] that raises the point that 
> Lucene could still do decode multiple ints at once in a single instruction by 
> packing two ints in a long and we've had some discussions about what we could 
> try in Lucene to speed up the decoding of postings. This made me want to look 
> a bit deeper at what we could do.
> Our current decoding logic reads data in a byte[] and decodes packed integers 
> from it. Unfortunately it doesn't make use of SIMD instructions and looks 
> like 
> [this|https://github.com/jpountz/decode-128-ints-benchmark/blob/master/src/main/java/jpountz/NaiveByteDecoder.java].
> I confirmed by looking at the generated assembly that if I take an array of 
> integers and shift them all by the same number of bits then Java will use 
> SIMD instructions to shift multiple integers at once. This led me to writing 
> this 
> [implementation|https://github.com/jpountz/decode-128-ints-benchmark/blob/master/src/main/java/jpountz/SimpleSIMDDecoder.java]
>  that tries as much as possible to shift long sequences of ints by the same 
> number of bits to speed up decoding. It is indeed faster than the current 
> logic we have, up to about 2x faster for some numbers of bits per value.
> Currently the best 
> [implementation|https://github.com/jpountz/decode-128-ints-benchmark/blob/master/src/main/java/jpountz/SIMDDecoder.java]
>  I've been able to come up with combines the above idea with the idea that 
> Paul mentioned in his blog that consists of emulating SIMD from Java by 
> packing multiple integers into a long: 2 ints, 4 shorts or 8 bytes. It is a 
> bit harder to read but gives another speedup on top of the above 
> implementation.
> I have a [JMH 
> benchmark|https://github.com/jpountz/decode-128-ints-benchmark/] available in 
> case someone would like to play with this and maybe even come up with an even 
> faster implementation. It is 2-2.5x faster than our current implementation 
> for most numbers of bits per value. I'm copying results here:
> {noformat}
>  * `readLongs` just reads 2*bitsPerValue longs from the ByteBuffer, it serves 
> as
>a baseline.
>  * `decodeNaiveFromBytes` reads a byte[] and decodes from it. This is what the
>current Lucene codec does.
>  * `decodeNaiveFromLongs` decodes from longs on the fly.
>  * `decodeSimpleSIMD` is a simple implementation that relies on how Java
>recognizes some patterns and uses SIMD instructions.
>  * `decodeSIMD` is a more complex implementation that both relies on the C2
>compiler to generate SIMD instructions and encodes 8 bytes, 4 shorts or
>2 ints in a long in order to decompress multiple values at once.
> Benchmark   (bitsPerValue)  (byteOrder)   
> Mode  Cnt   Score   Error   Units
> PackedIntsDecodeBenchmark.decodeNaiveFromBytes   1   LE  
> thrpt5  12.912 ± 0.393  ops/us
> PackedIntsDecodeBenchmark.decodeNaiveFromBytes   1   BE  
> thrpt5  12.862 ± 0.395  ops/us
> PackedIntsDecodeBenchmark.decodeNaiveFromBytes   2   LE  
> thrpt5  13.040 ± 1.162  ops/us
> PackedIntsDecodeBenchmark.decodeNaiveFromBytes   2   BE  
> thrpt5  13.027 ± 0.270  ops/us
> PackedIntsDecodeBenchmark.decodeNaiveFromBytes   3   LE  
> thrpt5  12.409 ± 0.637  ops/us
> PackedIntsDecodeBenchmark.decodeNaiveFromBytes   3   BE  
> thrpt5  12.268 ± 0.947  ops/us
> PackedIntsDecodeBenchmark.decodeNaiveFromBytes   4   LE  
> thrpt5  14.177 ± 2.263  

[jira] [Commented] (LUCENE-9027) SIMD-based decoding of postings lists

2019-11-07 Thread Michael McCandless (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9027?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16969339#comment-16969339
 ] 

Michael McCandless commented on LUCENE-9027:


Hmm why are your {{Term}} results crazy fast (~1400 QPS) on {{wikibigall}}?  I 
thought {{Term}} would also show gains here, since it just walks through all 
postings blocks for the term, decoding and collecting?

> SIMD-based decoding of postings lists
> -
>
> Key: LUCENE-9027
> URL: https://issues.apache.org/jira/browse/LUCENE-9027
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Adrien Grand
>Priority: Minor
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> [~rcmuir] has been mentioning the idea for quite some time that we might be 
> able to write the decoding logic in such a way that Java would use SIMD 
> instructions. More recently [~paul.masurel] wrote a [blog 
> post|https://fulmicoton.com/posts/bitpacking/] that raises the point that 
> Lucene could still do decode multiple ints at once in a single instruction by 
> packing two ints in a long and we've had some discussions about what we could 
> try in Lucene to speed up the decoding of postings. This made me want to look 
> a bit deeper at what we could do.
> Our current decoding logic reads data in a byte[] and decodes packed integers 
> from it. Unfortunately it doesn't make use of SIMD instructions and looks 
> like 
> [this|https://github.com/jpountz/decode-128-ints-benchmark/blob/master/src/main/java/jpountz/NaiveByteDecoder.java].
> I confirmed by looking at the generated assembly that if I take an array of 
> integers and shift them all by the same number of bits then Java will use 
> SIMD instructions to shift multiple integers at once. This led me to writing 
> this 
> [implementation|https://github.com/jpountz/decode-128-ints-benchmark/blob/master/src/main/java/jpountz/SimpleSIMDDecoder.java]
>  that tries as much as possible to shift long sequences of ints by the same 
> number of bits to speed up decoding. It is indeed faster than the current 
> logic we have, up to about 2x faster for some numbers of bits per value.
> Currently the best 
> [implementation|https://github.com/jpountz/decode-128-ints-benchmark/blob/master/src/main/java/jpountz/SIMDDecoder.java]
>  I've been able to come up with combines the above idea with the idea that 
> Paul mentioned in his blog that consists of emulating SIMD from Java by 
> packing multiple integers into a long: 2 ints, 4 shorts or 8 bytes. It is a 
> bit harder to read but gives another speedup on top of the above 
> implementation.
> I have a [JMH 
> benchmark|https://github.com/jpountz/decode-128-ints-benchmark/] available in 
> case someone would like to play with this and maybe even come up with an even 
> faster implementation. It is 2-2.5x faster than our current implementation 
> for most numbers of bits per value. I'm copying results here:
> {noformat}
>  * `readLongs` just reads 2*bitsPerValue longs from the ByteBuffer, it serves 
> as
>a baseline.
>  * `decodeNaiveFromBytes` reads a byte[] and decodes from it. This is what the
>current Lucene codec does.
>  * `decodeNaiveFromLongs` decodes from longs on the fly.
>  * `decodeSimpleSIMD` is a simple implementation that relies on how Java
>recognizes some patterns and uses SIMD instructions.
>  * `decodeSIMD` is a more complex implementation that both relies on the C2
>compiler to generate SIMD instructions and encodes 8 bytes, 4 shorts or
>2 ints in a long in order to decompress multiple values at once.
> Benchmark   (bitsPerValue)  (byteOrder)   
> Mode  Cnt   Score   Error   Units
> PackedIntsDecodeBenchmark.decodeNaiveFromBytes   1   LE  
> thrpt5  12.912 ± 0.393  ops/us
> PackedIntsDecodeBenchmark.decodeNaiveFromBytes   1   BE  
> thrpt5  12.862 ± 0.395  ops/us
> PackedIntsDecodeBenchmark.decodeNaiveFromBytes   2   LE  
> thrpt5  13.040 ± 1.162  ops/us
> PackedIntsDecodeBenchmark.decodeNaiveFromBytes   2   BE  
> thrpt5  13.027 ± 0.270  ops/us
> PackedIntsDecodeBenchmark.decodeNaiveFromBytes   3   LE  
> thrpt5  12.409 ± 0.637  ops/us
> PackedIntsDecodeBenchmark.decodeNaiveFromBytes   3   BE  
> thrpt5  12.268 ± 0.947  ops/us
> PackedIntsDecodeBenchmark.decodeNaiveFromBytes   4   LE  
> thrpt5  14.177 ± 2.263  ops/us
> PackedIntsDecodeBenchmark.decodeNaiveFromBytes   4   BE  
> thrpt5  11.457 ± 0.150  ops/us
> PackedIntsDecodeBenchmark.decodeNaiveFromBytes   5   LE  
> thrpt5  10.988 ± 1.179  ops/us
> PackedIntsDecodeBenchmark.decodeNaiveFromBytes   5   BE  
> thrpt

[jira] [Commented] (LUCENE-9027) SIMD-based decoding of postings lists

2019-11-04 Thread Robert Muir (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9027?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16967237#comment-16967237
 ] 

Robert Muir commented on LUCENE-9027:
-

+1 for PFOR to avoid the silly worst-cases. IIRC the only reason for the more 
wasteful FOR was just to make a baby step from vbyte encoding, that was tough 
enough at the time. Maybe the support for exceptions is even more important for 
other datasets.

> SIMD-based decoding of postings lists
> -
>
> Key: LUCENE-9027
> URL: https://issues.apache.org/jira/browse/LUCENE-9027
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Adrien Grand
>Priority: Minor
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> [~rcmuir] has been mentioning the idea for quite some time that we might be 
> able to write the decoding logic in such a way that Java would use SIMD 
> instructions. More recently [~paul.masurel] wrote a [blog 
> post|https://fulmicoton.com/posts/bitpacking/] that raises the point that 
> Lucene could still do decode multiple ints at once in a single instruction by 
> packing two ints in a long and we've had some discussions about what we could 
> try in Lucene to speed up the decoding of postings. This made me want to look 
> a bit deeper at what we could do.
> Our current decoding logic reads data in a byte[] and decodes packed integers 
> from it. Unfortunately it doesn't make use of SIMD instructions and looks 
> like 
> [this|https://github.com/jpountz/decode-128-ints-benchmark/blob/master/src/main/java/jpountz/NaiveByteDecoder.java].
> I confirmed by looking at the generated assembly that if I take an array of 
> integers and shift them all by the same number of bits then Java will use 
> SIMD instructions to shift multiple integers at once. This led me to writing 
> this 
> [implementation|https://github.com/jpountz/decode-128-ints-benchmark/blob/master/src/main/java/jpountz/SimpleSIMDDecoder.java]
>  that tries as much as possible to shift long sequences of ints by the same 
> number of bits to speed up decoding. It is indeed faster than the current 
> logic we have, up to about 2x faster for some numbers of bits per value.
> Currently the best 
> [implementation|https://github.com/jpountz/decode-128-ints-benchmark/blob/master/src/main/java/jpountz/SIMDDecoder.java]
>  I've been able to come up with combines the above idea with the idea that 
> Paul mentioned in his blog that consists of emulating SIMD from Java by 
> packing multiple integers into a long: 2 ints, 4 shorts or 8 bytes. It is a 
> bit harder to read but gives another speedup on top of the above 
> implementation.
> I have a [JMH 
> benchmark|https://github.com/jpountz/decode-128-ints-benchmark/] available in 
> case someone would like to play with this and maybe even come up with an even 
> faster implementation. It is 2-2.5x faster than our current implementation 
> for most numbers of bits per value. I'm copying results here:
> {noformat}
>  * `readLongs` just reads 2*bitsPerValue longs from the ByteBuffer, it serves 
> as
>a baseline.
>  * `decodeNaiveFromBytes` reads a byte[] and decodes from it. This is what the
>current Lucene codec does.
>  * `decodeNaiveFromLongs` decodes from longs on the fly.
>  * `decodeSimpleSIMD` is a simple implementation that relies on how Java
>recognizes some patterns and uses SIMD instructions.
>  * `decodeSIMD` is a more complex implementation that both relies on the C2
>compiler to generate SIMD instructions and encodes 8 bytes, 4 shorts or
>2 ints in a long in order to decompress multiple values at once.
> Benchmark   (bitsPerValue)  (byteOrder)   
> Mode  Cnt   Score   Error   Units
> PackedIntsDecodeBenchmark.decodeNaiveFromBytes   1   LE  
> thrpt5  12.912 ± 0.393  ops/us
> PackedIntsDecodeBenchmark.decodeNaiveFromBytes   1   BE  
> thrpt5  12.862 ± 0.395  ops/us
> PackedIntsDecodeBenchmark.decodeNaiveFromBytes   2   LE  
> thrpt5  13.040 ± 1.162  ops/us
> PackedIntsDecodeBenchmark.decodeNaiveFromBytes   2   BE  
> thrpt5  13.027 ± 0.270  ops/us
> PackedIntsDecodeBenchmark.decodeNaiveFromBytes   3   LE  
> thrpt5  12.409 ± 0.637  ops/us
> PackedIntsDecodeBenchmark.decodeNaiveFromBytes   3   BE  
> thrpt5  12.268 ± 0.947  ops/us
> PackedIntsDecodeBenchmark.decodeNaiveFromBytes   4   LE  
> thrpt5  14.177 ± 2.263  ops/us
> PackedIntsDecodeBenchmark.decodeNaiveFromBytes   4   BE  
> thrpt5  11.457 ± 0.150  ops/us
> PackedIntsDecodeBenchmark.decodeNaiveFromBytes   5   LE  
> thrpt5  10.988 ± 1.179  ops/us
> PackedIntsDecodeBenchmark.decodeNaiveFromBytes   

[jira] [Commented] (LUCENE-9027) SIMD-based decoding of postings lists

2019-10-30 Thread Adrien Grand (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9027?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16963116#comment-16963116
 ] 

Adrien Grand commented on LUCENE-9027:
--

The fact that throughput is better for lower bits per value made me want to try 
out PFOR delta to try getting lower numbers of bits per value when possible. I 
implemented it with a maximum of 3 exceptions per block. It made the .doc file 
9.4% smaller and the .pos file 8.7% smaller.

{noformat}
TaskQPS baseline  StdDev   QPS patch  StdDev
Pct diff
  Fuzzy2   71.64  (6.3%)   70.99  (7.9%)   
-0.9% ( -14% -   14%)
  IntNRQ  105.31  (5.9%)  104.79  (4.8%)   
-0.5% ( -10% -   10%)
  Fuzzy1  130.67  (9.6%)  130.58  (8.8%)   
-0.1% ( -16% -   20%)
Term 1397.31  (3.4%) 1400.02  (3.1%)
0.2% (  -6% -6%)
 AndHighHigh   34.23  (2.7%)   34.67  (3.5%)
1.3% (  -4% -7%)
  OrHighHigh   11.64  (2.8%)   11.93  (2.5%)
2.5% (  -2% -8%)
   OrHighMed   72.36  (3.0%)   76.48  (3.0%)
5.7% (   0% -   12%)
  AndHighMed   59.88  (3.1%)   63.68  (3.9%)
6.4% (   0% -   13%)
Wildcard   50.18  (8.4%)   53.41  (8.0%)
6.4% (  -9% -   24%)
SpanNear   23.99  (2.9%)   25.56  (1.1%)
6.5% (   2% -   10%)
 AndHighOrMedMed   34.24  (2.1%)   36.89  (2.2%)
7.8% (   3% -   12%)
IntervalsOrdered4.39  (6.0%)4.76  (5.3%)
8.3% (  -2% -   20%)
AndMedOrHighHigh   30.16  (2.7%)   32.76  (2.6%)
8.6% (   3% -   14%)
  Phrase   69.68  (1.3%)   75.76  (1.4%)
8.7% (   5% -   11%)
 Prefix3  170.85 (14.3%)  194.70 (10.9%)   
14.0% (  -9% -   45%)
SloppyPhrase   18.07  (3.8%)   20.92  (4.1%)   
15.8% (   7% -   24%)
{noformat}

> SIMD-based decoding of postings lists
> -
>
> Key: LUCENE-9027
> URL: https://issues.apache.org/jira/browse/LUCENE-9027
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Adrien Grand
>Priority: Minor
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> [~rcmuir] has been mentioning the idea for quite some time that we might be 
> able to write the decoding logic in such a way that Java would use SIMD 
> instructions. More recently [~paul.masurel] wrote a [blog 
> post|https://fulmicoton.com/posts/bitpacking/] that raises the point that 
> Lucene could still do decode multiple ints at once in a single instruction by 
> packing two ints in a long and we've had some discussions about what we could 
> try in Lucene to speed up the decoding of postings. This made me want to look 
> a bit deeper at what we could do.
> Our current decoding logic reads data in a byte[] and decodes packed integers 
> from it. Unfortunately it doesn't make use of SIMD instructions and looks 
> like 
> [this|https://github.com/jpountz/decode-128-ints-benchmark/blob/master/src/main/java/jpountz/NaiveByteDecoder.java].
> I confirmed by looking at the generated assembly that if I take an array of 
> integers and shift them all by the same number of bits then Java will use 
> SIMD instructions to shift multiple integers at once. This led me to writing 
> this 
> [implementation|https://github.com/jpountz/decode-128-ints-benchmark/blob/master/src/main/java/jpountz/SimpleSIMDDecoder.java]
>  that tries as much as possible to shift long sequences of ints by the same 
> number of bits to speed up decoding. It is indeed faster than the current 
> logic we have, up to about 2x faster for some numbers of bits per value.
> Currently the best 
> [implementation|https://github.com/jpountz/decode-128-ints-benchmark/blob/master/src/main/java/jpountz/SIMDDecoder.java]
>  I've been able to come up with combines the above idea with the idea that 
> Paul mentioned in his blog that consists of emulating SIMD from Java by 
> packing multiple integers into a long: 2 ints, 4 shorts or 8 bytes. It is a 
> bit harder to read but gives another speedup on top of the above 
> implementation.
> I have a [JMH 
> benchmark|https://github.com/jpountz/decode-128-ints-benchmark/] available in 
> case someone would like to play with this and maybe even come up with an even 
> faster implementation. It is 2-2.5x faster than our current implementation 
> for most numbers of bits per value. I'm copying results here:
> {noformat}
>  * `readLongs` just reads 2*bitsPerValue longs from the ByteBuffer, it serves 
> as
>a baseline.
>  * `decodeNaiveFromBytes` reads a 

[jira] [Commented] (LUCENE-9027) SIMD-based decoding of postings lists

2019-10-24 Thread Adrien Grand (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9027?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16959085#comment-16959085
 ] 

Adrien Grand commented on LUCENE-9027:
--

This is a file created by vim that should not have been pushed. :) I removed it.

> SIMD-based decoding of postings lists
> -
>
> Key: LUCENE-9027
> URL: https://issues.apache.org/jira/browse/LUCENE-9027
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Adrien Grand
>Priority: Minor
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> [~rcmuir] has been mentioning the idea for quite some time that we might be 
> able to write the decoding logic in such a way that Java would use SIMD 
> instructions. More recently [~paul.masurel] wrote a [blog 
> post|https://fulmicoton.com/posts/bitpacking/] that raises the point that 
> Lucene could still do decode multiple ints at once in a single instruction by 
> packing two ints in a long and we've had some discussions about what we could 
> try in Lucene to speed up the decoding of postings. This made me want to look 
> a bit deeper at what we could do.
> Our current decoding logic reads data in a byte[] and decodes packed integers 
> from it. Unfortunately it doesn't make use of SIMD instructions and looks 
> like 
> [this|https://github.com/jpountz/decode-128-ints-benchmark/blob/master/src/main/java/jpountz/NaiveByteDecoder.java].
> I confirmed by looking at the generated assembly that if I take an array of 
> integers and shift them all by the same number of bits then Java will use 
> SIMD instructions to shift multiple integers at once. This led me to writing 
> this 
> [implementation|https://github.com/jpountz/decode-128-ints-benchmark/blob/master/src/main/java/jpountz/SimpleSIMDDecoder.java]
>  that tries as much as possible to shift long sequences of ints by the same 
> number of bits to speed up decoding. It is indeed faster than the current 
> logic we have, up to about 2x faster for some numbers of bits per value.
> Currently the best 
> [implementation|https://github.com/jpountz/decode-128-ints-benchmark/blob/master/src/main/java/jpountz/SIMDDecoder.java]
>  I've been able to come up with combines the above idea with the idea that 
> Paul mentioned in his blog that consists of emulating SIMD from Java by 
> packing multiple integers into a long: 2 ints, 4 shorts or 8 bytes. It is a 
> bit harder to read but gives another speedup on top of the above 
> implementation.
> I have a [JMH 
> benchmark|https://github.com/jpountz/decode-128-ints-benchmark/] available in 
> case someone would like to play with this and maybe even come up with an even 
> faster implementation. It is 2-2.5x faster than our current implementation 
> for most numbers of bits per value. I'm copying results here:
> {noformat}
>  * `readLongs` just reads 2*bitsPerValue longs from the ByteBuffer, it serves 
> as
>a baseline.
>  * `decodeNaiveFromBytes` reads a byte[] and decodes from it. This is what the
>current Lucene codec does.
>  * `decodeNaiveFromLongs` decodes from longs on the fly.
>  * `decodeSimpleSIMD` is a simple implementation that relies on how Java
>recognizes some patterns and uses SIMD instructions.
>  * `decodeSIMD` is a more complex implementation that both relies on the C2
>compiler to generate SIMD instructions and encodes 8 bytes, 4 shorts or
>2 ints in a long in order to decompress multiple values at once.
> Benchmark   (bitsPerValue)  (byteOrder)   
> Mode  Cnt   Score   Error   Units
> PackedIntsDecodeBenchmark.decodeNaiveFromBytes   1   LE  
> thrpt5  12.912 ± 0.393  ops/us
> PackedIntsDecodeBenchmark.decodeNaiveFromBytes   1   BE  
> thrpt5  12.862 ± 0.395  ops/us
> PackedIntsDecodeBenchmark.decodeNaiveFromBytes   2   LE  
> thrpt5  13.040 ± 1.162  ops/us
> PackedIntsDecodeBenchmark.decodeNaiveFromBytes   2   BE  
> thrpt5  13.027 ± 0.270  ops/us
> PackedIntsDecodeBenchmark.decodeNaiveFromBytes   3   LE  
> thrpt5  12.409 ± 0.637  ops/us
> PackedIntsDecodeBenchmark.decodeNaiveFromBytes   3   BE  
> thrpt5  12.268 ± 0.947  ops/us
> PackedIntsDecodeBenchmark.decodeNaiveFromBytes   4   LE  
> thrpt5  14.177 ± 2.263  ops/us
> PackedIntsDecodeBenchmark.decodeNaiveFromBytes   4   BE  
> thrpt5  11.457 ± 0.150  ops/us
> PackedIntsDecodeBenchmark.decodeNaiveFromBytes   5   LE  
> thrpt5  10.988 ± 1.179  ops/us
> PackedIntsDecodeBenchmark.decodeNaiveFromBytes   5   BE  
> thrpt5  11.226 ± 0.088  ops/us
> PackedIntsDecodeBenchmark.decodeNaiveFromBytes   6   LE  
> thrpt5   9.791 ± 0.305  ops/us
> 

[jira] [Commented] (LUCENE-9027) SIMD-based decoding of postings lists

2019-10-24 Thread Uwe Schindler (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9027?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16959066#comment-16959066
 ] 

Uwe Schindler commented on LUCENE-9027:
---

What is the .swp file in the PR?

> SIMD-based decoding of postings lists
> -
>
> Key: LUCENE-9027
> URL: https://issues.apache.org/jira/browse/LUCENE-9027
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Adrien Grand
>Priority: Minor
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> [~rcmuir] has been mentioning the idea for quite some time that we might be 
> able to write the decoding logic in such a way that Java would use SIMD 
> instructions. More recently [~paul.masurel] wrote a [blog 
> post|https://fulmicoton.com/posts/bitpacking/] that raises the point that 
> Lucene could still do decode multiple ints at once in a single instruction by 
> packing two ints in a long and we've had some discussions about what we could 
> try in Lucene to speed up the decoding of postings. This made me want to look 
> a bit deeper at what we could do.
> Our current decoding logic reads data in a byte[] and decodes packed integers 
> from it. Unfortunately it doesn't make use of SIMD instructions and looks 
> like 
> [this|https://github.com/jpountz/decode-128-ints-benchmark/blob/master/src/main/java/jpountz/NaiveByteDecoder.java].
> I confirmed by looking at the generated assembly that if I take an array of 
> integers and shift them all by the same number of bits then Java will use 
> SIMD instructions to shift multiple integers at once. This led me to writing 
> this 
> [implementation|https://github.com/jpountz/decode-128-ints-benchmark/blob/master/src/main/java/jpountz/SimpleSIMDDecoder.java]
>  that tries as much as possible to shift long sequences of ints by the same 
> number of bits to speed up decoding. It is indeed faster than the current 
> logic we have, up to about 2x faster for some numbers of bits per value.
> Currently the best 
> [implementation|https://github.com/jpountz/decode-128-ints-benchmark/blob/master/src/main/java/jpountz/SIMDDecoder.java]
>  I've been able to come up with combines the above idea with the idea that 
> Paul mentioned in his blog that consists of emulating SIMD from Java by 
> packing multiple integers into a long: 2 ints, 4 shorts or 8 bytes. It is a 
> bit harder to read but gives another speedup on top of the above 
> implementation.
> I have a [JMH 
> benchmark|https://github.com/jpountz/decode-128-ints-benchmark/] available in 
> case someone would like to play with this and maybe even come up with an even 
> faster implementation. It is 2-2.5x faster than our current implementation 
> for most numbers of bits per value. I'm copying results here:
> {noformat}
>  * `readLongs` just reads 2*bitsPerValue longs from the ByteBuffer, it serves 
> as
>a baseline.
>  * `decodeNaiveFromBytes` reads a byte[] and decodes from it. This is what the
>current Lucene codec does.
>  * `decodeNaiveFromLongs` decodes from longs on the fly.
>  * `decodeSimpleSIMD` is a simple implementation that relies on how Java
>recognizes some patterns and uses SIMD instructions.
>  * `decodeSIMD` is a more complex implementation that both relies on the C2
>compiler to generate SIMD instructions and encodes 8 bytes, 4 shorts or
>2 ints in a long in order to decompress multiple values at once.
> Benchmark   (bitsPerValue)  (byteOrder)   
> Mode  Cnt   Score   Error   Units
> PackedIntsDecodeBenchmark.decodeNaiveFromBytes   1   LE  
> thrpt5  12.912 ± 0.393  ops/us
> PackedIntsDecodeBenchmark.decodeNaiveFromBytes   1   BE  
> thrpt5  12.862 ± 0.395  ops/us
> PackedIntsDecodeBenchmark.decodeNaiveFromBytes   2   LE  
> thrpt5  13.040 ± 1.162  ops/us
> PackedIntsDecodeBenchmark.decodeNaiveFromBytes   2   BE  
> thrpt5  13.027 ± 0.270  ops/us
> PackedIntsDecodeBenchmark.decodeNaiveFromBytes   3   LE  
> thrpt5  12.409 ± 0.637  ops/us
> PackedIntsDecodeBenchmark.decodeNaiveFromBytes   3   BE  
> thrpt5  12.268 ± 0.947  ops/us
> PackedIntsDecodeBenchmark.decodeNaiveFromBytes   4   LE  
> thrpt5  14.177 ± 2.263  ops/us
> PackedIntsDecodeBenchmark.decodeNaiveFromBytes   4   BE  
> thrpt5  11.457 ± 0.150  ops/us
> PackedIntsDecodeBenchmark.decodeNaiveFromBytes   5   LE  
> thrpt5  10.988 ± 1.179  ops/us
> PackedIntsDecodeBenchmark.decodeNaiveFromBytes   5   BE  
> thrpt5  11.226 ± 0.088  ops/us
> PackedIntsDecodeBenchmark.decodeNaiveFromBytes   6   LE  
> thrpt5   9.791 ± 0.305  ops/us
> 

[jira] [Commented] (LUCENE-9027) SIMD-based decoding of postings lists

2019-10-24 Thread Adrien Grand (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9027?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16959054#comment-16959054
 ] 

Adrien Grand commented on LUCENE-9027:
--

And here are results on wikibigall

{noformat}
TaskQPS baseline  StdDev   QPS patch  StdDev
Pct diff
  IntNRQ 1054.07  (2.2%)  999.02  (7.8%)   
-5.2% ( -14% -4%)
 Prefix3  135.69  (5.3%)  132.79  (7.2%)   
-2.1% ( -13% -   11%)
Term 1341.46  (2.6%) 1350.12  (2.6%)
0.6% (  -4% -5%)
Wildcard  137.74  (1.9%)  138.76  (4.5%)
0.7% (  -5% -7%)
  Fuzzy1  100.54  (8.9%)  101.56  (8.7%)
1.0% ( -15% -   20%)
 AndHighHigh   72.82  (2.8%)   74.01  (2.7%)
1.6% (  -3% -7%)
SpanNear2.17  (4.5%)2.22  (5.3%)
2.1% (  -7% -   12%)
  OrHighHigh   11.36  (3.0%)   11.64  (3.5%)
2.5% (  -3% -9%)
IntervalsOrdered4.45  (3.7%)4.59  (7.5%)
3.0% (  -7% -   14%)
  Fuzzy2   87.88 (13.2%)   91.56 (10.0%)
4.2% ( -16% -   31%)
   OrHighMed   47.04  (2.8%)   49.16  (3.3%)
4.5% (  -1% -   10%)
 AndHighOrMedMed   26.00  (2.1%)   27.23  (2.6%)
4.7% (   0% -9%)
  AndHighMed   65.16  (2.7%)   68.27  (2.9%)
4.8% (   0% -   10%)
AndMedOrHighHigh   27.20  (2.3%)   28.61  (2.4%)
5.2% (   0% -   10%)
  Phrase   26.55  (1.2%)   27.96  (2.3%)
5.3% (   1% -8%)
SloppyPhrase   17.99  (2.6%)   19.77  (4.0%)
9.9% (   3% -   16%)
{noformat}

As expected it mostly helps queries that use "advance" by large intervals (and 
might thus decode entire blocks for only a couple doc IDs) and phrases which 
spend a lot of time reading positions.

> SIMD-based decoding of postings lists
> -
>
> Key: LUCENE-9027
> URL: https://issues.apache.org/jira/browse/LUCENE-9027
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Adrien Grand
>Priority: Minor
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> [~rcmuir] has been mentioning the idea for quite some time that we might be 
> able to write the decoding logic in such a way that Java would use SIMD 
> instructions. More recently [~paul.masurel] wrote a [blog 
> post|https://fulmicoton.com/posts/bitpacking/] that raises the point that 
> Lucene could still do decode multiple ints at once in a single instruction by 
> packing two ints in a long and we've had some discussions about what we could 
> try in Lucene to speed up the decoding of postings. This made me want to look 
> a bit deeper at what we could do.
> Our current decoding logic reads data in a byte[] and decodes packed integers 
> from it. Unfortunately it doesn't make use of SIMD instructions and looks 
> like 
> [this|https://github.com/jpountz/decode-128-ints-benchmark/blob/master/src/main/java/jpountz/NaiveByteDecoder.java].
> I confirmed by looking at the generated assembly that if I take an array of 
> integers and shift them all by the same number of bits then Java will use 
> SIMD instructions to shift multiple integers at once. This led me to writing 
> this 
> [implementation|https://github.com/jpountz/decode-128-ints-benchmark/blob/master/src/main/java/jpountz/SimpleSIMDDecoder.java]
>  that tries as much as possible to shift long sequences of ints by the same 
> number of bits to speed up decoding. It is indeed faster than the current 
> logic we have, up to about 2x faster for some numbers of bits per value.
> Currently the best 
> [implementation|https://github.com/jpountz/decode-128-ints-benchmark/blob/master/src/main/java/jpountz/SIMDDecoder.java]
>  I've been able to come up with combines the above idea with the idea that 
> Paul mentioned in his blog that consists of emulating SIMD from Java by 
> packing multiple integers into a long: 2 ints, 4 shorts or 8 bytes. It is a 
> bit harder to read but gives another speedup on top of the above 
> implementation.
> I have a [JMH 
> benchmark|https://github.com/jpountz/decode-128-ints-benchmark/] available in 
> case someone would like to play with this and maybe even come up with an even 
> faster implementation. It is 2-2.5x faster than our current implementation 
> for most numbers of bits per value. I'm copying results here:
> {noformat}
>  * `readLongs` just reads 2*bitsPerValue longs from the ByteBuffer, it serves 
> as
>a baseline.
>  * `decodeNaiveFromBytes` reads a byte[] and decodes from it. This is what the
>current