[jira] [Commented] (LUCENE-7563) BKD index should compress unused leading bytes
[ https://issues.apache.org/jira/browse/LUCENE-7563?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15728813#comment-15728813 ] ASF subversion and git services commented on LUCENE-7563: - Commit fd1f608b49a7a8b5f7e6cc805378da2217ec657a in lucene-solr's branch refs/heads/branch_6x from Mike McCandless [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=fd1f608 ] LUCENE-7563: remove redundant array copy in PackedIndexTree.clone > BKD index should compress unused leading bytes > -- > > Key: LUCENE-7563 > URL: https://issues.apache.org/jira/browse/LUCENE-7563 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Michael McCandless > Fix For: master (7.0), 6.4 > > Attachments: LUCENE-7563-prefixlen-unary.patch, > LUCENE-7563-prefixlen-unary.patch, LUCENE-7563.patch, LUCENE-7563.patch, > LUCENE-7563.patch, LUCENE-7563.patch > > > Today the BKD (points) in-heap index always uses {{dimensionNumBytes}} per > dimension, but if e.g. you are indexing {{LongPoint}} yet only use the bottom > two bytes in a given segment, we shouldn't store all those leading 0s in the > index. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-7563) BKD index should compress unused leading bytes
[ https://issues.apache.org/jira/browse/LUCENE-7563?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15728814#comment-15728814 ] ASF subversion and git services commented on LUCENE-7563: - Commit 0c8e8e396a4ccc41e6af78ac7d0342716c36902a in lucene-solr's branch refs/heads/branch_6x from Mike McCandless [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=0c8e8e3 ] LUCENE-7563: fix 6.x backport compilation errors > BKD index should compress unused leading bytes > -- > > Key: LUCENE-7563 > URL: https://issues.apache.org/jira/browse/LUCENE-7563 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Michael McCandless > Fix For: master (7.0), 6.4 > > Attachments: LUCENE-7563-prefixlen-unary.patch, > LUCENE-7563-prefixlen-unary.patch, LUCENE-7563.patch, LUCENE-7563.patch, > LUCENE-7563.patch, LUCENE-7563.patch > > > Today the BKD (points) in-heap index always uses {{dimensionNumBytes}} per > dimension, but if e.g. you are indexing {{LongPoint}} yet only use the bottom > two bytes in a given segment, we shouldn't store all those leading 0s in the > index. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-7563) BKD index should compress unused leading bytes
[ https://issues.apache.org/jira/browse/LUCENE-7563?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15728811#comment-15728811 ] ASF subversion and git services commented on LUCENE-7563: - Commit f51766c00fc374a6fc6f407b723bd8458556de7d in lucene-solr's branch refs/heads/branch_6x from Mike McCandless [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=f51766c ] LUCENE-7563: use a compressed format for the in-heap BKD index > BKD index should compress unused leading bytes > -- > > Key: LUCENE-7563 > URL: https://issues.apache.org/jira/browse/LUCENE-7563 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Michael McCandless > Fix For: master (7.0), 6.4 > > Attachments: LUCENE-7563-prefixlen-unary.patch, > LUCENE-7563-prefixlen-unary.patch, LUCENE-7563.patch, LUCENE-7563.patch, > LUCENE-7563.patch, LUCENE-7563.patch > > > Today the BKD (points) in-heap index always uses {{dimensionNumBytes}} per > dimension, but if e.g. you are indexing {{LongPoint}} yet only use the bottom > two bytes in a given segment, we shouldn't store all those leading 0s in the > index. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-7563) BKD index should compress unused leading bytes
[ https://issues.apache.org/jira/browse/LUCENE-7563?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15727037#comment-15727037 ] ASF subversion and git services commented on LUCENE-7563: - Commit bd8b191505d92c89a483a6189497374238476a00 in lucene-solr's branch refs/heads/apiv2 from Mike McCandless [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=bd8b191 ] LUCENE-7563: remove redundant array copy in PackedIndexTree.clone > BKD index should compress unused leading bytes > -- > > Key: LUCENE-7563 > URL: https://issues.apache.org/jira/browse/LUCENE-7563 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Michael McCandless > Fix For: master (7.0), 6.4 > > Attachments: LUCENE-7563-prefixlen-unary.patch, > LUCENE-7563-prefixlen-unary.patch, LUCENE-7563.patch, LUCENE-7563.patch, > LUCENE-7563.patch, LUCENE-7563.patch > > > Today the BKD (points) in-heap index always uses {{dimensionNumBytes}} per > dimension, but if e.g. you are indexing {{LongPoint}} yet only use the bottom > two bytes in a given segment, we shouldn't store all those leading 0s in the > index. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-7563) BKD index should compress unused leading bytes
[ https://issues.apache.org/jira/browse/LUCENE-7563?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15727036#comment-15727036 ] ASF subversion and git services commented on LUCENE-7563: - Commit 5e8db2e068f2549b9619d5ac48a50c8032fc292b in lucene-solr's branch refs/heads/apiv2 from Mike McCandless [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=5e8db2e ] LUCENE-7563: use a compressed format for the in-heap BKD index > BKD index should compress unused leading bytes > -- > > Key: LUCENE-7563 > URL: https://issues.apache.org/jira/browse/LUCENE-7563 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Michael McCandless > Fix For: master (7.0), 6.4 > > Attachments: LUCENE-7563-prefixlen-unary.patch, > LUCENE-7563-prefixlen-unary.patch, LUCENE-7563.patch, LUCENE-7563.patch, > LUCENE-7563.patch, LUCENE-7563.patch > > > Today the BKD (points) in-heap index always uses {{dimensionNumBytes}} per > dimension, but if e.g. you are indexing {{LongPoint}} yet only use the bottom > two bytes in a given segment, we shouldn't store all those leading 0s in the > index. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-7563) BKD index should compress unused leading bytes
[ https://issues.apache.org/jira/browse/LUCENE-7563?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15722537#comment-15722537 ] Michael McCandless commented on LUCENE-7563: Ahh, OK; I think we should restrict {{TestBKD}} to the same dimension count / bytes per dimension limits that Lucene enforces? As we tighten up how we compress it on disk and the in-heap index we should only test for what we actually offer to the end user. > BKD index should compress unused leading bytes > -- > > Key: LUCENE-7563 > URL: https://issues.apache.org/jira/browse/LUCENE-7563 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Michael McCandless > Fix For: master (7.0), 6.4 > > Attachments: LUCENE-7563-prefixlen-unary.patch, LUCENE-7563.patch, > LUCENE-7563.patch, LUCENE-7563.patch, LUCENE-7563.patch > > > Today the BKD (points) in-heap index always uses {{dimensionNumBytes}} per > dimension, but if e.g. you are indexing {{LongPoint}} yet only use the bottom > two bytes in a given segment, we shouldn't store all those leading 0s in the > index. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-7563) BKD index should compress unused leading bytes
[ https://issues.apache.org/jira/browse/LUCENE-7563?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15722349#comment-15722349 ] Adrien Grand commented on LUCENE-7563: -- I digged into it, the test failure may happen with large numbers of bytes per dimension. It could be fixed if we limited the number of bytes per value of BKDWriter to 16 (like we do in FieldInfos) and made {{code}} a long. > BKD index should compress unused leading bytes > -- > > Key: LUCENE-7563 > URL: https://issues.apache.org/jira/browse/LUCENE-7563 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Michael McCandless > Fix For: master (7.0), 6.4 > > Attachments: LUCENE-7563-prefixlen-unary.patch, LUCENE-7563.patch, > LUCENE-7563.patch, LUCENE-7563.patch, LUCENE-7563.patch > > > Today the BKD (points) in-heap index always uses {{dimensionNumBytes}} per > dimension, but if e.g. you are indexing {{LongPoint}} yet only use the bottom > two bytes in a given segment, we shouldn't store all those leading 0s in the > index. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-7563) BKD index should compress unused leading bytes
[ https://issues.apache.org/jira/browse/LUCENE-7563?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15722090#comment-15722090 ] Michael McCandless commented on LUCENE-7563: bq. I think there is just a redundant arraycopy in clone()? Thanks, I pushed a fix! bq. For the record, I played with another idea leveraging the fact that the prefix lengths on two consecutive levels are likely close to each other, I like this idea! But I hit this test failure ... doesn't reproduce on trunk: {noformat} [junit4] 2> NOTE: reproduce with: ant test -Dtestcase=TestBKD -Dtests.method=testWastedLeadingBytes -Dtests.seed=2E5F0E183BBA1098 -Dtests.locale=es-PR -Dtests.timezone=CST -Dtests.asserts=true -Dtests.file.encoding=US-ASCII [junit4] ERROR 0.90s J1 | TestBKD.testWastedLeadingBytes <<< [junit4]> Throwable #1: java.lang.ArrayIndexOutOfBoundsException: -32 [junit4]>at __randomizedtesting.SeedInfo.seed([2E5F0E183BBA1098:ABD9D50B47794EFC]:0) [junit4]>at org.apache.lucene.util.bkd.BKDReader$PackedIndexTree.readNodeData(BKDReader.java:442) [junit4]>at org.apache.lucene.util.bkd.BKDReader$PackedIndexTree.(BKDReader.java:343) [junit4]>at org.apache.lucene.util.bkd.BKDReader.getIntersectState(BKDReader.java:526) [junit4]>at org.apache.lucene.util.bkd.BKDReader.intersect(BKDReader.java:498) [junit4]>at org.apache.lucene.util.bkd.TestBKD.testWastedLeadingBytes(TestBKD.java:1042) [junit4]>at java.lang.Thread.run(Thread.java:745) {noformat} > BKD index should compress unused leading bytes > -- > > Key: LUCENE-7563 > URL: https://issues.apache.org/jira/browse/LUCENE-7563 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Michael McCandless > Fix For: master (7.0), 6.4 > > Attachments: LUCENE-7563-prefixlen-unary.patch, LUCENE-7563.patch, > LUCENE-7563.patch, LUCENE-7563.patch, LUCENE-7563.patch > > > Today the BKD (points) in-heap index always uses {{dimensionNumBytes}} per > dimension, but if e.g. you are indexing {{LongPoint}} yet only use the bottom > two bytes in a given segment, we shouldn't store all those leading 0s in the > index. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-7563) BKD index should compress unused leading bytes
[ https://issues.apache.org/jira/browse/LUCENE-7563?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15722054#comment-15722054 ] ASF subversion and git services commented on LUCENE-7563: - Commit bd8b191505d92c89a483a6189497374238476a00 in lucene-solr's branch refs/heads/master from Mike McCandless [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=bd8b191 ] LUCENE-7563: remove redundant array copy in PackedIndexTree.clone > BKD index should compress unused leading bytes > -- > > Key: LUCENE-7563 > URL: https://issues.apache.org/jira/browse/LUCENE-7563 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Michael McCandless > Fix For: master (7.0), 6.4 > > Attachments: LUCENE-7563-prefixlen-unary.patch, LUCENE-7563.patch, > LUCENE-7563.patch, LUCENE-7563.patch, LUCENE-7563.patch > > > Today the BKD (points) in-heap index always uses {{dimensionNumBytes}} per > dimension, but if e.g. you are indexing {{LongPoint}} yet only use the bottom > two bytes in a given segment, we shouldn't store all those leading 0s in the > index. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-7563) BKD index should compress unused leading bytes
[ https://issues.apache.org/jira/browse/LUCENE-7563?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15719702#comment-15719702 ] ASF subversion and git services commented on LUCENE-7563: - Commit 5e8db2e068f2549b9619d5ac48a50c8032fc292b in lucene-solr's branch refs/heads/master from Mike McCandless [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=5e8db2e ] LUCENE-7563: use a compressed format for the in-heap BKD index > BKD index should compress unused leading bytes > -- > > Key: LUCENE-7563 > URL: https://issues.apache.org/jira/browse/LUCENE-7563 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Michael McCandless > Fix For: master (7.0), 6.4 > > Attachments: LUCENE-7563.patch, LUCENE-7563.patch, LUCENE-7563.patch, > LUCENE-7563.patch > > > Today the BKD (points) in-heap index always uses {{dimensionNumBytes}} per > dimension, but if e.g. you are indexing {{LongPoint}} yet only use the bottom > two bytes in a given segment, we shouldn't store all those leading 0s in the > index. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-7563) BKD index should compress unused leading bytes
[ https://issues.apache.org/jira/browse/LUCENE-7563?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15702891#comment-15702891 ] Adrien Grand commented on LUCENE-7563: -- bq. Hmm I think I am already doing that? You are right, I had not read the code correctly. bq. Oooh that's a great idea! Saves 1 byte per inner node. We need 5 bits for the prefix I think since it can range 0 .. 16 inclusive, and 3 bits for the splitDim since it's 0 .. 7 inclusive. I have been thinking about it more and I think we can make it more general. The first two bytes that differ are likely close to each other, so if we call their difference {{firstByteDelta}}, we could pack {{firstByteDelta}}, {{splitDim}} and {{prefix}} into a single vint (eg. {{(firstByteDelta * (1 + bytesPerDim) + prefix) * numDims + splitDim}}) that would sometimes only take one byte (quite often when {{numDims}} and {{bytesPerDim}} are small and rarely in the opposite case). bq. but it felt wrong to just pass these packed bytes to the simple text format ... Agreed. Maybe we should duplicate the curent BKDReader/BKDWriter into a new impl that would be specific to SimpleText and would not need all those optimizations so that both impls can evolve separately. > BKD index should compress unused leading bytes > -- > > Key: LUCENE-7563 > URL: https://issues.apache.org/jira/browse/LUCENE-7563 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Michael McCandless > Fix For: master (7.0), 6.4 > > Attachments: LUCENE-7563.patch, LUCENE-7563.patch > > > Today the BKD (points) in-heap index always uses {{dimensionNumBytes}} per > dimension, but if e.g. you are indexing {{LongPoint}} yet only use the bottom > two bytes in a given segment, we shouldn't store all those leading 0s in the > index. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-7563) BKD index should compress unused leading bytes
[ https://issues.apache.org/jira/browse/LUCENE-7563?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15702280#comment-15702280 ] Michael McCandless commented on LUCENE-7563: bq. It seems we are always delta coding with the split value of the parent level, but for the multi-dimensional case, I think it would be better to delta-code with the last split value that was on the same dimension? Hmm I think I am already doing that? Note that the {{splitValuesStack}} in {{BKDReader.PackedIndexTree}} holds all dimensions' last split values, and then when I read the suffix bytes in, I copy them into the packed values for the current split dimension: {noformat} in.readBytes(splitValuesStack[level], splitDim*bytesPerDim+prefix, suffix); {noformat} I think? I'll test on the OpenStreetMaps geo benchmark to measure the impact ... I'll also run the 2B tests to make sure nothing broke. bq. For instance we use whole bytes to store the split dimension or the prefix length while they only need 3 and 4 bits? In the multi-dimensional case we could store both on a single byte. Oooh that's a great idea! Saves 1 byte per inner node. We need 5 bits for the prefix I think since it can range 0 .. 16 inclusive, and 3 bits for the {{splitDim}} since it's 0 .. 7 inclusive. bq. It doesn't need to be done in the same patch, but it would also be nice for SimpleText to not use the legacy format of the index. I'm not sure how to proceed however. Yeah I'm not sure what to do here either ... but it felt wrong to just pass these packed bytes to the simple text format ... that packed form is even further from "simple" than the two arrays we have now. > BKD index should compress unused leading bytes > -- > > Key: LUCENE-7563 > URL: https://issues.apache.org/jira/browse/LUCENE-7563 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Michael McCandless > Fix For: master (7.0), 6.4 > > Attachments: LUCENE-7563.patch, LUCENE-7563.patch > > > Today the BKD (points) in-heap index always uses {{dimensionNumBytes}} per > dimension, but if e.g. you are indexing {{LongPoint}} yet only use the bottom > two bytes in a given segment, we shouldn't store all those leading 0s in the > index. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-7563) BKD index should compress unused leading bytes
[ https://issues.apache.org/jira/browse/LUCENE-7563?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15701587#comment-15701587 ] Adrien Grand commented on LUCENE-7563: -- It seems we are always delta coding with the split value of the parent level, but for the multi-dimensional case, I think it would be better to delta-code with the last split value that was on the same dimension? Otherwise compression would be very poor if both dimensions store a very different range of values? Something else I was wondering is whether we can make bigger gains. For instance we use whole bytes to store the split dimension or the prefix length while they only need 3 and 4 bits? In the multi-dimensional case we could store both on a single byte. Maybe we can do even better, I haven't though much about it. It doesn't need to be done in the same patch, but it would also be nice for SimpleText to not use the legacy format of the index. I'm not sure how to proceed however. > BKD index should compress unused leading bytes > -- > > Key: LUCENE-7563 > URL: https://issues.apache.org/jira/browse/LUCENE-7563 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Michael McCandless > Fix For: master (7.0), 6.4 > > Attachments: LUCENE-7563.patch, LUCENE-7563.patch > > > Today the BKD (points) in-heap index always uses {{dimensionNumBytes}} per > dimension, but if e.g. you are indexing {{LongPoint}} yet only use the bottom > two bytes in a given segment, we shouldn't store all those leading 0s in the > index. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-7563) BKD index should compress unused leading bytes
[ https://issues.apache.org/jira/browse/LUCENE-7563?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15667700#comment-15667700 ] Adrien Grand commented on LUCENE-7563: -- +1 > BKD index should compress unused leading bytes > -- > > Key: LUCENE-7563 > URL: https://issues.apache.org/jira/browse/LUCENE-7563 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Michael McCandless > Fix For: master (7.0), 6.4 > > > Today the BKD (points) in-heap index always uses {{dimensionNumBytes}} per > dimension, but if e.g. you are indexing {{LongPoint}} yet only use the bottom > two bytes in a given segment, we shouldn't store all those leading 0s in the > index. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org