[jira] [Updated] (LUCENE-4069) Segment-level Bloom filters for a 2 x speed up on rare term searches
[ https://issues.apache.org/jira/browse/LUCENE-4069?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mark Harwood updated LUCENE-4069: - Fix Version/s: 5.0 Applied to trunk in revision 1368567 Segment-level Bloom filters for a 2 x speed up on rare term searches Key: LUCENE-4069 URL: https://issues.apache.org/jira/browse/LUCENE-4069 Project: Lucene - Core Issue Type: Improvement Components: core/index Affects Versions: 3.6, 4.0-ALPHA Reporter: Mark Harwood Assignee: Mark Harwood Priority: Minor Fix For: 4.0, 5.0 Attachments: 4069Failure.zip, BloomFilterPostingsBranch4x.patch, LUCENE-4069-tryDeleteDocument.patch, LUCENE-4203.patch, MHBloomFilterOn3.6Branch.patch, PKLookupUpdatePerfTest.java, PKLookupUpdatePerfTest.java, PKLookupUpdatePerfTest.java, PKLookupUpdatePerfTest.java, PrimaryKeyPerfTest40.java An addition to each segment which stores a Bloom filter for selected fields in order to give fast-fail to term searches, helping avoid wasted disk access. Best suited for low-frequency fields e.g. primary keys on big indexes with many segments but also speeds up general searching in my tests. Overview slideshow here: http://www.slideshare.net/MarkHarwood/lucene-bloomfilteredsegments Benchmarks based on Wikipedia content here: http://goo.gl/X7QqU Patch based on 3.6 codebase attached. There are no 3.6 API changes currently - to play just add a field with _blm on the end of the name to invoke special indexing/querying capability. Clearly a new Field or schema declaration(!) would need adding to APIs to configure the service properly. Also, a patch for Lucene4.0 codebase introducing a new PostingsFormat -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-4069) Segment-level Bloom filters for a 2 x speed up on rare term searches
[ https://issues.apache.org/jira/browse/LUCENE-4069?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mark Harwood updated LUCENE-4069: - Attachment: (was: BloomFilterPostingsBranch4x.patch) Segment-level Bloom filters for a 2 x speed up on rare term searches Key: LUCENE-4069 URL: https://issues.apache.org/jira/browse/LUCENE-4069 Project: Lucene - Core Issue Type: Improvement Components: core/index Affects Versions: 3.6, 4.0-ALPHA Reporter: Mark Harwood Priority: Minor Fix For: 4.0 Attachments: 4069Failure.zip, LUCENE-4069-tryDeleteDocument.patch, LUCENE-4203.patch, MHBloomFilterOn3.6Branch.patch, PKLookupUpdatePerfTest.java, PKLookupUpdatePerfTest.java, PKLookupUpdatePerfTest.java, PKLookupUpdatePerfTest.java, PrimaryKeyPerfTest40.java An addition to each segment which stores a Bloom filter for selected fields in order to give fast-fail to term searches, helping avoid wasted disk access. Best suited for low-frequency fields e.g. primary keys on big indexes with many segments but also speeds up general searching in my tests. Overview slideshow here: http://www.slideshare.net/MarkHarwood/lucene-bloomfilteredsegments Benchmarks based on Wikipedia content here: http://goo.gl/X7QqU Patch based on 3.6 codebase attached. There are no 3.6 API changes currently - to play just add a field with _blm on the end of the name to invoke special indexing/querying capability. Clearly a new Field or schema declaration(!) would need adding to APIs to configure the service properly. Also, a patch for Lucene4.0 codebase introducing a new PostingsFormat -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-4069) Segment-level Bloom filters for a 2 x speed up on rare term searches
[ https://issues.apache.org/jira/browse/LUCENE-4069?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mark Harwood updated LUCENE-4069: - Attachment: BloomFilterPostingsBranch4x.patch Updated with fix to issue explored in Lucene-4275 Segment-level Bloom filters for a 2 x speed up on rare term searches Key: LUCENE-4069 URL: https://issues.apache.org/jira/browse/LUCENE-4069 Project: Lucene - Core Issue Type: Improvement Components: core/index Affects Versions: 3.6, 4.0-ALPHA Reporter: Mark Harwood Priority: Minor Fix For: 4.0 Attachments: 4069Failure.zip, BloomFilterPostingsBranch4x.patch, LUCENE-4069-tryDeleteDocument.patch, LUCENE-4203.patch, MHBloomFilterOn3.6Branch.patch, PKLookupUpdatePerfTest.java, PKLookupUpdatePerfTest.java, PKLookupUpdatePerfTest.java, PKLookupUpdatePerfTest.java, PrimaryKeyPerfTest40.java An addition to each segment which stores a Bloom filter for selected fields in order to give fast-fail to term searches, helping avoid wasted disk access. Best suited for low-frequency fields e.g. primary keys on big indexes with many segments but also speeds up general searching in my tests. Overview slideshow here: http://www.slideshare.net/MarkHarwood/lucene-bloomfilteredsegments Benchmarks based on Wikipedia content here: http://goo.gl/X7QqU Patch based on 3.6 codebase attached. There are no 3.6 API changes currently - to play just add a field with _blm on the end of the name to invoke special indexing/querying capability. Clearly a new Field or schema declaration(!) would need adding to APIs to configure the service properly. Also, a patch for Lucene4.0 codebase introducing a new PostingsFormat -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-4069) Segment-level Bloom filters for a 2 x speed up on rare term searches
[ https://issues.apache.org/jira/browse/LUCENE-4069?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mark Harwood updated LUCENE-4069: - Attachment: (was: BloomFilterPostingsBranch4x.patch) Segment-level Bloom filters for a 2 x speed up on rare term searches Key: LUCENE-4069 URL: https://issues.apache.org/jira/browse/LUCENE-4069 Project: Lucene - Core Issue Type: Improvement Components: core/index Affects Versions: 3.6, 4.0-ALPHA Reporter: Mark Harwood Priority: Minor Fix For: 4.0 Attachments: 4069Failure.zip, LUCENE-4069-tryDeleteDocument.patch, LUCENE-4203.patch, MHBloomFilterOn3.6Branch.patch, PKLookupUpdatePerfTest.java, PKLookupUpdatePerfTest.java, PKLookupUpdatePerfTest.java, PKLookupUpdatePerfTest.java, PrimaryKeyPerfTest40.java An addition to each segment which stores a Bloom filter for selected fields in order to give fast-fail to term searches, helping avoid wasted disk access. Best suited for low-frequency fields e.g. primary keys on big indexes with many segments but also speeds up general searching in my tests. Overview slideshow here: http://www.slideshare.net/MarkHarwood/lucene-bloomfilteredsegments Benchmarks based on Wikipedia content here: http://goo.gl/X7QqU Patch based on 3.6 codebase attached. There are no 3.6 API changes currently - to play just add a field with _blm on the end of the name to invoke special indexing/querying capability. Clearly a new Field or schema declaration(!) would need adding to APIs to configure the service properly. Also, a patch for Lucene4.0 codebase introducing a new PostingsFormat -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-4069) Segment-level Bloom filters for a 2 x speed up on rare term searches
[ https://issues.apache.org/jira/browse/LUCENE-4069?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mark Harwood updated LUCENE-4069: - Attachment: BloomFilterPostingsBranch4x.patch Updated patch to bring in line with latest core API changes. All tests now pass clean so will commit soon Segment-level Bloom filters for a 2 x speed up on rare term searches Key: LUCENE-4069 URL: https://issues.apache.org/jira/browse/LUCENE-4069 Project: Lucene - Core Issue Type: Improvement Components: core/index Affects Versions: 3.6, 4.0-ALPHA Reporter: Mark Harwood Priority: Minor Fix For: 4.0 Attachments: 4069Failure.zip, BloomFilterPostingsBranch4x.patch, LUCENE-4069-tryDeleteDocument.patch, LUCENE-4203.patch, MHBloomFilterOn3.6Branch.patch, PKLookupUpdatePerfTest.java, PKLookupUpdatePerfTest.java, PKLookupUpdatePerfTest.java, PKLookupUpdatePerfTest.java, PrimaryKeyPerfTest40.java An addition to each segment which stores a Bloom filter for selected fields in order to give fast-fail to term searches, helping avoid wasted disk access. Best suited for low-frequency fields e.g. primary keys on big indexes with many segments but also speeds up general searching in my tests. Overview slideshow here: http://www.slideshare.net/MarkHarwood/lucene-bloomfilteredsegments Benchmarks based on Wikipedia content here: http://goo.gl/X7QqU Patch based on 3.6 codebase attached. There are no 3.6 API changes currently - to play just add a field with _blm on the end of the name to invoke special indexing/querying capability. Clearly a new Field or schema declaration(!) would need adding to APIs to configure the service properly. Also, a patch for Lucene4.0 codebase introducing a new PostingsFormat -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-4069) Segment-level Bloom filters for a 2 x speed up on rare term searches
[ https://issues.apache.org/jira/browse/LUCENE-4069?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mark Harwood updated LUCENE-4069: - Attachment: 4069Failure.zip Attached a log of thread activity showing how TestIndexWriterCommit.testCommitThreadSafety() is failing. At this stage I can't tell if this is a failing in MockDirectoryWrapper or the test or the BloomPF class but it is related to files being removed unexpectedly. Segment-level Bloom filters for a 2 x speed up on rare term searches Key: LUCENE-4069 URL: https://issues.apache.org/jira/browse/LUCENE-4069 Project: Lucene - Java Issue Type: Improvement Components: core/index Affects Versions: 3.6, 4.0-ALPHA Reporter: Mark Harwood Priority: Minor Fix For: 4.0 Attachments: 4069Failure.zip, BloomFilterPostingsBranch4x.patch, LUCENE-4069-tryDeleteDocument.patch, LUCENE-4203.patch, MHBloomFilterOn3.6Branch.patch, PKLookupUpdatePerfTest.java, PKLookupUpdatePerfTest.java, PKLookupUpdatePerfTest.java, PKLookupUpdatePerfTest.java, PrimaryKeyPerfTest40.java An addition to each segment which stores a Bloom filter for selected fields in order to give fast-fail to term searches, helping avoid wasted disk access. Best suited for low-frequency fields e.g. primary keys on big indexes with many segments but also speeds up general searching in my tests. Overview slideshow here: http://www.slideshare.net/MarkHarwood/lucene-bloomfilteredsegments Benchmarks based on Wikipedia content here: http://goo.gl/X7QqU Patch based on 3.6 codebase attached. There are no 3.6 API changes currently - to play just add a field with _blm on the end of the name to invoke special indexing/querying capability. Clearly a new Field or schema declaration(!) would need adding to APIs to configure the service properly. Also, a patch for Lucene4.0 codebase introducing a new PostingsFormat -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-4069) Segment-level Bloom filters for a 2 x speed up on rare term searches
[ https://issues.apache.org/jira/browse/LUCENE-4069?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mark Harwood updated LUCENE-4069: - Attachment: (was: BloomFilterPostingsBranch4x.patch) Segment-level Bloom filters for a 2 x speed up on rare term searches Key: LUCENE-4069 URL: https://issues.apache.org/jira/browse/LUCENE-4069 Project: Lucene - Java Issue Type: Improvement Components: core/index Affects Versions: 3.6, 4.0-ALPHA Reporter: Mark Harwood Priority: Minor Fix For: 4.0 Attachments: LUCENE-4069-tryDeleteDocument.patch, LUCENE-4203.patch, MHBloomFilterOn3.6Branch.patch, PKLookupUpdatePerfTest.java, PKLookupUpdatePerfTest.java, PKLookupUpdatePerfTest.java, PKLookupUpdatePerfTest.java, PrimaryKeyPerfTest40.java An addition to each segment which stores a Bloom filter for selected fields in order to give fast-fail to term searches, helping avoid wasted disk access. Best suited for low-frequency fields e.g. primary keys on big indexes with many segments but also speeds up general searching in my tests. Overview slideshow here: http://www.slideshare.net/MarkHarwood/lucene-bloomfilteredsegments Benchmarks based on Wikipedia content here: http://goo.gl/X7QqU Patch based on 3.6 codebase attached. There are no 3.6 API changes currently - to play just add a field with _blm on the end of the name to invoke special indexing/querying capability. Clearly a new Field or schema declaration(!) would need adding to APIs to configure the service properly. Also, a patch for Lucene4.0 codebase introducing a new PostingsFormat -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-4069) Segment-level Bloom filters for a 2 x speed up on rare term searches
[ https://issues.apache.org/jira/browse/LUCENE-4069?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mark Harwood updated LUCENE-4069: - Attachment: BloomFilterPostingsBranch4x.patch New patch with use of SegmentWriteState to right-size the choice if bitset for volume of content. Segment-level Bloom filters for a 2 x speed up on rare term searches Key: LUCENE-4069 URL: https://issues.apache.org/jira/browse/LUCENE-4069 Project: Lucene - Java Issue Type: Improvement Components: core/index Affects Versions: 3.6, 4.0-ALPHA Reporter: Mark Harwood Priority: Minor Fix For: 4.0 Attachments: BloomFilterPostingsBranch4x.patch, LUCENE-4069-tryDeleteDocument.patch, LUCENE-4203.patch, MHBloomFilterOn3.6Branch.patch, PKLookupUpdatePerfTest.java, PKLookupUpdatePerfTest.java, PKLookupUpdatePerfTest.java, PKLookupUpdatePerfTest.java, PrimaryKeyPerfTest40.java An addition to each segment which stores a Bloom filter for selected fields in order to give fast-fail to term searches, helping avoid wasted disk access. Best suited for low-frequency fields e.g. primary keys on big indexes with many segments but also speeds up general searching in my tests. Overview slideshow here: http://www.slideshare.net/MarkHarwood/lucene-bloomfilteredsegments Benchmarks based on Wikipedia content here: http://goo.gl/X7QqU Patch based on 3.6 codebase attached. There are no 3.6 API changes currently - to play just add a field with _blm on the end of the name to invoke special indexing/querying capability. Clearly a new Field or schema declaration(!) would need adding to APIs to configure the service properly. Also, a patch for Lucene4.0 codebase introducing a new PostingsFormat -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-4069) Segment-level Bloom filters for a 2 x speed up on rare term searches
[ https://issues.apache.org/jira/browse/LUCENE-4069?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mark Harwood updated LUCENE-4069: - Attachment: (was: BloomFilterPostingsBranch4x.patch) Segment-level Bloom filters for a 2 x speed up on rare term searches Key: LUCENE-4069 URL: https://issues.apache.org/jira/browse/LUCENE-4069 Project: Lucene - Java Issue Type: Improvement Components: core/index Affects Versions: 3.6, 4.0-ALPHA Reporter: Mark Harwood Priority: Minor Fix For: 4.0 Attachments: LUCENE-4069-tryDeleteDocument.patch, LUCENE-4203.patch, MHBloomFilterOn3.6Branch.patch, PKLookupUpdatePerfTest.java, PKLookupUpdatePerfTest.java, PKLookupUpdatePerfTest.java, PKLookupUpdatePerfTest.java, PrimaryKeyPerfTest40.java An addition to each segment which stores a Bloom filter for selected fields in order to give fast-fail to term searches, helping avoid wasted disk access. Best suited for low-frequency fields e.g. primary keys on big indexes with many segments but also speeds up general searching in my tests. Overview slideshow here: http://www.slideshare.net/MarkHarwood/lucene-bloomfilteredsegments Benchmarks based on Wikipedia content here: http://goo.gl/X7QqU Patch based on 3.6 codebase attached. There are no 3.6 API changes currently - to play just add a field with _blm on the end of the name to invoke special indexing/querying capability. Clearly a new Field or schema declaration(!) would need adding to APIs to configure the service properly. Also, a patch for Lucene4.0 codebase introducing a new PostingsFormat -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-4069) Segment-level Bloom filters for a 2 x speed up on rare term searches
[ https://issues.apache.org/jira/browse/LUCENE-4069?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mark Harwood updated LUCENE-4069: - Attachment: BloomFilterPostingsBranch4x.patch Added bloom package.html and changes.txt. I plan to commit in a day or two if there are no objections. Segment-level Bloom filters for a 2 x speed up on rare term searches Key: LUCENE-4069 URL: https://issues.apache.org/jira/browse/LUCENE-4069 Project: Lucene - Java Issue Type: Improvement Components: core/index Affects Versions: 3.6, 4.0-ALPHA Reporter: Mark Harwood Priority: Minor Fix For: 4.0 Attachments: BloomFilterPostingsBranch4x.patch, LUCENE-4069-tryDeleteDocument.patch, LUCENE-4203.patch, MHBloomFilterOn3.6Branch.patch, PKLookupUpdatePerfTest.java, PKLookupUpdatePerfTest.java, PKLookupUpdatePerfTest.java, PKLookupUpdatePerfTest.java, PrimaryKeyPerfTest40.java An addition to each segment which stores a Bloom filter for selected fields in order to give fast-fail to term searches, helping avoid wasted disk access. Best suited for low-frequency fields e.g. primary keys on big indexes with many segments but also speeds up general searching in my tests. Overview slideshow here: http://www.slideshare.net/MarkHarwood/lucene-bloomfilteredsegments Benchmarks based on Wikipedia content here: http://goo.gl/X7QqU Patch based on 3.6 codebase attached. There are no 3.6 API changes currently - to play just add a field with _blm on the end of the name to invoke special indexing/querying capability. Clearly a new Field or schema declaration(!) would need adding to APIs to configure the service properly. Also, a patch for Lucene4.0 codebase introducing a new PostingsFormat -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-4069) Segment-level Bloom filters for a 2 x speed up on rare term searches
[ https://issues.apache.org/jira/browse/LUCENE-4069?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael McCandless updated LUCENE-4069: --- Attachment: PKLookupUpdatePerfTest.java Only substantial change was I fixed the updates vs insert counting to work correctly when doDirectDelete is true. Segment-level Bloom filters for a 2 x speed up on rare term searches Key: LUCENE-4069 URL: https://issues.apache.org/jira/browse/LUCENE-4069 Project: Lucene - Java Issue Type: Improvement Components: core/index Affects Versions: 3.6, 4.0 Reporter: Mark Harwood Priority: Minor Fix For: 4.0, 3.6.1 Attachments: BloomFilterPostingsBranch4x.patch, LUCENE-4069-tryDeleteDocument.patch, MHBloomFilterOn3.6Branch.patch, PKLookupUpdatePerfTest.java, PKLookupUpdatePerfTest.java, PKLookupUpdatePerfTest.java, PKLookupUpdatePerfTest.java, PrimaryKeyPerfTest40.java An addition to each segment which stores a Bloom filter for selected fields in order to give fast-fail to term searches, helping avoid wasted disk access. Best suited for low-frequency fields e.g. primary keys on big indexes with many segments but also speeds up general searching in my tests. Overview slideshow here: http://www.slideshare.net/MarkHarwood/lucene-bloomfilteredsegments Benchmarks based on Wikipedia content here: http://goo.gl/X7QqU Patch based on 3.6 codebase attached. There are no 3.6 API changes currently - to play just add a field with _blm on the end of the name to invoke special indexing/querying capability. Clearly a new Field or schema declaration(!) would need adding to APIs to configure the service properly. Also, a patch for Lucene4.0 codebase introducing a new PostingsFormat -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-4069) Segment-level Bloom filters for a 2 x speed up on rare term searches
[ https://issues.apache.org/jira/browse/LUCENE-4069?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael McCandless updated LUCENE-4069: --- Attachment: LUCENE-4203.patch Initial rough patch ... not ready to commit and needs a good random test. Segment-level Bloom filters for a 2 x speed up on rare term searches Key: LUCENE-4069 URL: https://issues.apache.org/jira/browse/LUCENE-4069 Project: Lucene - Java Issue Type: Improvement Components: core/index Affects Versions: 3.6, 4.0 Reporter: Mark Harwood Priority: Minor Fix For: 4.0, 3.6.1 Attachments: BloomFilterPostingsBranch4x.patch, LUCENE-4069-tryDeleteDocument.patch, LUCENE-4203.patch, MHBloomFilterOn3.6Branch.patch, PKLookupUpdatePerfTest.java, PKLookupUpdatePerfTest.java, PKLookupUpdatePerfTest.java, PKLookupUpdatePerfTest.java, PrimaryKeyPerfTest40.java An addition to each segment which stores a Bloom filter for selected fields in order to give fast-fail to term searches, helping avoid wasted disk access. Best suited for low-frequency fields e.g. primary keys on big indexes with many segments but also speeds up general searching in my tests. Overview slideshow here: http://www.slideshare.net/MarkHarwood/lucene-bloomfilteredsegments Benchmarks based on Wikipedia content here: http://goo.gl/X7QqU Patch based on 3.6 codebase attached. There are no 3.6 API changes currently - to play just add a field with _blm on the end of the name to invoke special indexing/querying capability. Clearly a new Field or schema declaration(!) would need adding to APIs to configure the service properly. Also, a patch for Lucene4.0 codebase introducing a new PostingsFormat -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-4069) Segment-level Bloom filters for a 2 x speed up on rare term searches
[ https://issues.apache.org/jira/browse/LUCENE-4069?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mark Harwood updated LUCENE-4069: - Attachment: PKLookupUpdatePerfTest.java Updated performance test with option to alter the ratio of inserts vs updates via keyspace size. Segment-level Bloom filters for a 2 x speed up on rare term searches Key: LUCENE-4069 URL: https://issues.apache.org/jira/browse/LUCENE-4069 Project: Lucene - Java Issue Type: Improvement Components: core/index Affects Versions: 3.6, 4.0 Reporter: Mark Harwood Priority: Minor Fix For: 4.0, 3.6.1 Attachments: BloomFilterPostingsBranch4x.patch, LUCENE-4069-tryDeleteDocument.patch, MHBloomFilterOn3.6Branch.patch, PKLookupUpdatePerfTest.java, PKLookupUpdatePerfTest.java, PKLookupUpdatePerfTest.java, PrimaryKeyPerfTest40.java An addition to each segment which stores a Bloom filter for selected fields in order to give fast-fail to term searches, helping avoid wasted disk access. Best suited for low-frequency fields e.g. primary keys on big indexes with many segments but also speeds up general searching in my tests. Overview slideshow here: http://www.slideshare.net/MarkHarwood/lucene-bloomfilteredsegments Benchmarks based on Wikipedia content here: http://goo.gl/X7QqU Patch based on 3.6 codebase attached. There are no 3.6 API changes currently - to play just add a field with _blm on the end of the name to invoke special indexing/querying capability. Clearly a new Field or schema declaration(!) would need adding to APIs to configure the service properly. Also, a patch for Lucene4.0 codebase introducing a new PostingsFormat -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-4069) Segment-level Bloom filters for a 2 x speed up on rare term searches
[ https://issues.apache.org/jira/browse/LUCENE-4069?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael McCandless updated LUCENE-4069: --- Attachment: LUCENE-4069-tryDeleteDocument.patch PKLookupUpdatePerfTest.java I made various improvements to the PK lookup/update performance tester (called out as booleans at the top) so we can test the impact of: * Reusing the enums * Pre-sorting the keys before lookup * Best case (flush once after each update) vs worst case * Using DocValues direct source instead of a stored field to hold the counter * Using base 36 for the PK vs base 10 Also, it's really wasteful that we go and lookup the docID by PK, to retrieve the old count, and then do an updateDocument call which forces IW to go and to the exact same (costly) lookup again. So I added a new method enabling deletion by document ID in IndexWriter. I named it tryDeleteDocument, and it takes SegmentInfo and int docID. If that segment has not been merged away, the delete succeeds, else it fails (and the app must delete through the normal way). This seems to give ~20% speedup. I think eg Solr's update document field by retrieving old doc, changing field, calling IW.updateDocument could also use this method. Segment-level Bloom filters for a 2 x speed up on rare term searches Key: LUCENE-4069 URL: https://issues.apache.org/jira/browse/LUCENE-4069 Project: Lucene - Java Issue Type: Improvement Components: core/index Affects Versions: 3.6, 4.0 Reporter: Mark Harwood Priority: Minor Fix For: 4.0, 3.6.1 Attachments: BloomFilterPostingsBranch4x.patch, LUCENE-4069-tryDeleteDocument.patch, MHBloomFilterOn3.6Branch.patch, PKLookupUpdatePerfTest.java, PKLookupUpdatePerfTest.java, PrimaryKeyPerfTest40.java An addition to each segment which stores a Bloom filter for selected fields in order to give fast-fail to term searches, helping avoid wasted disk access. Best suited for low-frequency fields e.g. primary keys on big indexes with many segments but also speeds up general searching in my tests. Overview slideshow here: http://www.slideshare.net/MarkHarwood/lucene-bloomfilteredsegments Benchmarks based on Wikipedia content here: http://goo.gl/X7QqU Patch based on 3.6 codebase attached. There are no 3.6 API changes currently - to play just add a field with _blm on the end of the name to invoke special indexing/querying capability. Clearly a new Field or schema declaration(!) would need adding to APIs to configure the service properly. Also, a patch for Lucene4.0 codebase introducing a new PostingsFormat -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-4069) Segment-level Bloom filters for a 2 x speed up on rare term searches
[ https://issues.apache.org/jira/browse/LUCENE-4069?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mark Harwood updated LUCENE-4069: - Attachment: (was: BloomFilterPostingsBranch4x.patch) Segment-level Bloom filters for a 2 x speed up on rare term searches Key: LUCENE-4069 URL: https://issues.apache.org/jira/browse/LUCENE-4069 Project: Lucene - Java Issue Type: Improvement Components: core/index Affects Versions: 3.6, 4.0 Reporter: Mark Harwood Priority: Minor Fix For: 4.0, 3.6.1 Attachments: MHBloomFilterOn3.6Branch.patch, PKLookupUpdatePerfTest.java, PrimaryKeyPerfTest40.java An addition to each segment which stores a Bloom filter for selected fields in order to give fast-fail to term searches, helping avoid wasted disk access. Best suited for low-frequency fields e.g. primary keys on big indexes with many segments but also speeds up general searching in my tests. Overview slideshow here: http://www.slideshare.net/MarkHarwood/lucene-bloomfilteredsegments Benchmarks based on Wikipedia content here: http://goo.gl/X7QqU Patch based on 3.6 codebase attached. There are no 3.6 API changes currently - to play just add a field with _blm on the end of the name to invoke special indexing/querying capability. Clearly a new Field or schema declaration(!) would need adding to APIs to configure the service properly. Also, a patch for Lucene4.0 codebase introducing a new PostingsFormat -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-4069) Segment-level Bloom filters for a 2 x speed up on rare term searches
[ https://issues.apache.org/jira/browse/LUCENE-4069?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mark Harwood updated LUCENE-4069: - Attachment: BloomFilterPostingsBranch4x.patch Added customizable saturation threshold after which Bloom filters are retired and no longer maintained (due to merges creating v large segments) Segment-level Bloom filters for a 2 x speed up on rare term searches Key: LUCENE-4069 URL: https://issues.apache.org/jira/browse/LUCENE-4069 Project: Lucene - Java Issue Type: Improvement Components: core/index Affects Versions: 3.6, 4.0 Reporter: Mark Harwood Priority: Minor Fix For: 4.0, 3.6.1 Attachments: BloomFilterPostingsBranch4x.patch, MHBloomFilterOn3.6Branch.patch, PKLookupUpdatePerfTest.java, PrimaryKeyPerfTest40.java An addition to each segment which stores a Bloom filter for selected fields in order to give fast-fail to term searches, helping avoid wasted disk access. Best suited for low-frequency fields e.g. primary keys on big indexes with many segments but also speeds up general searching in my tests. Overview slideshow here: http://www.slideshare.net/MarkHarwood/lucene-bloomfilteredsegments Benchmarks based on Wikipedia content here: http://goo.gl/X7QqU Patch based on 3.6 codebase attached. There are no 3.6 API changes currently - to play just add a field with _blm on the end of the name to invoke special indexing/querying capability. Clearly a new Field or schema declaration(!) would need adding to APIs to configure the service properly. Also, a patch for Lucene4.0 codebase introducing a new PostingsFormat -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-4069) Segment-level Bloom filters for a 2 x speed up on rare term searches
[ https://issues.apache.org/jira/browse/LUCENE-4069?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mark Harwood updated LUCENE-4069: - Attachment: PKLookupUpdatePerfTest.java Attached a performance test (adapted from Mike's PKLookupPerfTest) that demonstrates the worst-case scenario where BloomFilter offers the 2x speed up not previously revealed in Mike's other tests. This test case mixes reads and writes on a growing index and is representative of the real-world scenario I am seeking to optimize. See the javadoc for test details. Segment-level Bloom filters for a 2 x speed up on rare term searches Key: LUCENE-4069 URL: https://issues.apache.org/jira/browse/LUCENE-4069 Project: Lucene - Java Issue Type: Improvement Components: core/index Affects Versions: 3.6, 4.0 Reporter: Mark Harwood Priority: Minor Fix For: 4.0, 3.6.1 Attachments: BloomFilterPostingsBranch4x.patch, MHBloomFilterOn3.6Branch.patch, PKLookupUpdatePerfTest.java, PrimaryKeyPerfTest40.java An addition to each segment which stores a Bloom filter for selected fields in order to give fast-fail to term searches, helping avoid wasted disk access. Best suited for low-frequency fields e.g. primary keys on big indexes with many segments but also speeds up general searching in my tests. Overview slideshow here: http://www.slideshare.net/MarkHarwood/lucene-bloomfilteredsegments Benchmarks based on Wikipedia content here: http://goo.gl/X7QqU Patch based on 3.6 codebase attached. There are no 3.6 API changes currently - to play just add a field with _blm on the end of the name to invoke special indexing/querying capability. Clearly a new Field or schema declaration(!) would need adding to APIs to configure the service properly. Also, a patch for Lucene4.0 codebase introducing a new PostingsFormat -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-4069) Segment-level Bloom filters for a 2 x speed up on rare term searches
[ https://issues.apache.org/jira/browse/LUCENE-4069?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mark Harwood updated LUCENE-4069: - Attachment: (was: BloomFilterPostingsBranch4x.patch) Segment-level Bloom filters for a 2 x speed up on rare term searches Key: LUCENE-4069 URL: https://issues.apache.org/jira/browse/LUCENE-4069 Project: Lucene - Java Issue Type: Improvement Components: core/index Affects Versions: 3.6, 4.0 Reporter: Mark Harwood Priority: Minor Fix For: 4.0, 3.6.1 Attachments: MHBloomFilterOn3.6Branch.patch, PrimaryKeyPerfTest40.java An addition to each segment which stores a Bloom filter for selected fields in order to give fast-fail to term searches, helping avoid wasted disk access. Best suited for low-frequency fields e.g. primary keys on big indexes with many segments but also speeds up general searching in my tests. Overview slideshow here: http://www.slideshare.net/MarkHarwood/lucene-bloomfilteredsegments Benchmarks based on Wikipedia content here: http://goo.gl/X7QqU Patch based on 3.6 codebase attached. There are no 3.6 API changes currently - to play just add a field with _blm on the end of the name to invoke special indexing/querying capability. Clearly a new Field or schema declaration(!) would need adding to APIs to configure the service properly. Also, a patch for Lucene4.0 codebase introducing a new PostingsFormat -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-4069) Segment-level Bloom filters for a 2 x speed up on rare term searches
[ https://issues.apache.org/jira/browse/LUCENE-4069?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mark Harwood updated LUCENE-4069: - Attachment: BloomFilterPostingsBranch4x.patch Fix for the not downsizing bug and a subsequent issue which that fix revealed. The 2nd issue was that on saturation, the downsize method would actually upsize into a bigger bitset. This causes false negatives on searches - it's safe to downsize the indexing bitset but not upsize as there is already some information loss involved. Segment-level Bloom filters for a 2 x speed up on rare term searches Key: LUCENE-4069 URL: https://issues.apache.org/jira/browse/LUCENE-4069 Project: Lucene - Java Issue Type: Improvement Components: core/index Affects Versions: 3.6, 4.0 Reporter: Mark Harwood Priority: Minor Fix For: 4.0, 3.6.1 Attachments: BloomFilterPostingsBranch4x.patch, MHBloomFilterOn3.6Branch.patch An addition to each segment which stores a Bloom filter for selected fields in order to give fast-fail to term searches, helping avoid wasted disk access. Best suited for low-frequency fields e.g. primary keys on big indexes with many segments but also speeds up general searching in my tests. Overview slideshow here: http://www.slideshare.net/MarkHarwood/lucene-bloomfilteredsegments Benchmarks based on Wikipedia content here: http://goo.gl/X7QqU Patch based on 3.6 codebase attached. There are no 3.6 API changes currently - to play just add a field with _blm on the end of the name to invoke special indexing/querying capability. Clearly a new Field or schema declaration(!) would need adding to APIs to configure the service properly. Also, a patch for Lucene4.0 codebase introducing a new PostingsFormat -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-4069) Segment-level Bloom filters for a 2 x speed up on rare term searches
[ https://issues.apache.org/jira/browse/LUCENE-4069?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mark Harwood updated LUCENE-4069: - Attachment: (was: PrimaryKeyPerfTest40.java) Segment-level Bloom filters for a 2 x speed up on rare term searches Key: LUCENE-4069 URL: https://issues.apache.org/jira/browse/LUCENE-4069 Project: Lucene - Java Issue Type: Improvement Components: core/index Affects Versions: 3.6, 4.0 Reporter: Mark Harwood Priority: Minor Fix For: 4.0, 3.6.1 Attachments: BloomFilterPostingsBranch4x.patch, MHBloomFilterOn3.6Branch.patch An addition to each segment which stores a Bloom filter for selected fields in order to give fast-fail to term searches, helping avoid wasted disk access. Best suited for low-frequency fields e.g. primary keys on big indexes with many segments but also speeds up general searching in my tests. Overview slideshow here: http://www.slideshare.net/MarkHarwood/lucene-bloomfilteredsegments Benchmarks based on Wikipedia content here: http://goo.gl/X7QqU Patch based on 3.6 codebase attached. There are no 3.6 API changes currently - to play just add a field with _blm on the end of the name to invoke special indexing/querying capability. Clearly a new Field or schema declaration(!) would need adding to APIs to configure the service properly. Also, a patch for Lucene4.0 codebase introducing a new PostingsFormat -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-4069) Segment-level Bloom filters for a 2 x speed up on rare term searches
[ https://issues.apache.org/jira/browse/LUCENE-4069?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mark Harwood updated LUCENE-4069: - Attachment: PrimaryKeyPerfTest40.java Updated Performance test code based on new IndexReader changes for accessing subreaders Segment-level Bloom filters for a 2 x speed up on rare term searches Key: LUCENE-4069 URL: https://issues.apache.org/jira/browse/LUCENE-4069 Project: Lucene - Java Issue Type: Improvement Components: core/index Affects Versions: 3.6, 4.0 Reporter: Mark Harwood Priority: Minor Fix For: 4.0, 3.6.1 Attachments: BloomFilterPostingsBranch4x.patch, MHBloomFilterOn3.6Branch.patch, PrimaryKeyPerfTest40.java An addition to each segment which stores a Bloom filter for selected fields in order to give fast-fail to term searches, helping avoid wasted disk access. Best suited for low-frequency fields e.g. primary keys on big indexes with many segments but also speeds up general searching in my tests. Overview slideshow here: http://www.slideshare.net/MarkHarwood/lucene-bloomfilteredsegments Benchmarks based on Wikipedia content here: http://goo.gl/X7QqU Patch based on 3.6 codebase attached. There are no 3.6 API changes currently - to play just add a field with _blm on the end of the name to invoke special indexing/querying capability. Clearly a new Field or schema declaration(!) would need adding to APIs to configure the service properly. Also, a patch for Lucene4.0 codebase introducing a new PostingsFormat -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-4069) Segment-level Bloom filters for a 2 x speed up on rare term searches
[ https://issues.apache.org/jira/browse/LUCENE-4069?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mark Harwood updated LUCENE-4069: - Attachment: (was: BloomFilterPostingsBranch4x.patch) Segment-level Bloom filters for a 2 x speed up on rare term searches Key: LUCENE-4069 URL: https://issues.apache.org/jira/browse/LUCENE-4069 Project: Lucene - Java Issue Type: Improvement Components: core/index Affects Versions: 3.6, 4.0 Reporter: Mark Harwood Priority: Minor Fix For: 4.0, 3.6.1 Attachments: BloomFilterPostingsBranch4x.patch, MHBloomFilterOn3.6Branch.patch, PrimaryKeyPerfTest40.java An addition to each segment which stores a Bloom filter for selected fields in order to give fast-fail to term searches, helping avoid wasted disk access. Best suited for low-frequency fields e.g. primary keys on big indexes with many segments but also speeds up general searching in my tests. Overview slideshow here: http://www.slideshare.net/MarkHarwood/lucene-bloomfilteredsegments Benchmarks based on Wikipedia content here: http://goo.gl/X7QqU Patch based on 3.6 codebase attached. There are no 3.6 API changes currently - to play just add a field with _blm on the end of the name to invoke special indexing/querying capability. Clearly a new Field or schema declaration(!) would need adding to APIs to configure the service properly. Also, a patch for Lucene4.0 codebase introducing a new PostingsFormat -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-4069) Segment-level Bloom filters for a 2 x speed up on rare term searches
[ https://issues.apache.org/jira/browse/LUCENE-4069?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mark Harwood updated LUCENE-4069: - Attachment: BloomFilterPostingsBranch4x.patch New patch with Terms.intersect overridden for faster Fuzzy queries. Didn't get too far with running the Wikipedia perf tests due to missing data file (see http://code.google.com/a/apache-extras.org/p/luceneutil/issues/detail?id=7 ) Segment-level Bloom filters for a 2 x speed up on rare term searches Key: LUCENE-4069 URL: https://issues.apache.org/jira/browse/LUCENE-4069 Project: Lucene - Java Issue Type: Improvement Components: core/index Affects Versions: 3.6, 4.0 Reporter: Mark Harwood Priority: Minor Fix For: 4.0, 3.6.1 Attachments: BloomFilterPostingsBranch4x.patch, MHBloomFilterOn3.6Branch.patch, PrimaryKeyPerfTest40.java An addition to each segment which stores a Bloom filter for selected fields in order to give fast-fail to term searches, helping avoid wasted disk access. Best suited for low-frequency fields e.g. primary keys on big indexes with many segments but also speeds up general searching in my tests. Overview slideshow here: http://www.slideshare.net/MarkHarwood/lucene-bloomfilteredsegments Benchmarks based on Wikipedia content here: http://goo.gl/X7QqU Patch based on 3.6 codebase attached. There are no 3.6 API changes currently - to play just add a field with _blm on the end of the name to invoke special indexing/querying capability. Clearly a new Field or schema declaration(!) would need adding to APIs to configure the service properly. Also, a patch for Lucene4.0 codebase introducing a new PostingsFormat -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-4069) Segment-level Bloom filters for a 2 x speed up on rare term searches
[ https://issues.apache.org/jira/browse/LUCENE-4069?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mark Harwood updated LUCENE-4069: - Attachment: (was: BloomFilterPostings40.patch) Segment-level Bloom filters for a 2 x speed up on rare term searches Key: LUCENE-4069 URL: https://issues.apache.org/jira/browse/LUCENE-4069 Project: Lucene - Java Issue Type: Improvement Components: core/index Affects Versions: 3.6, 4.0 Reporter: Mark Harwood Priority: Minor Fix For: 4.0, 3.6.1 Attachments: MHBloomFilterOn3.6Branch.patch, PrimaryKey40PerformanceTestSrc.zip An addition to each segment which stores a Bloom filter for selected fields in order to give fast-fail to term searches, helping avoid wasted disk access. Best suited for low-frequency fields e.g. primary keys on big indexes with many segments but also speeds up general searching in my tests. Overview slideshow here: http://www.slideshare.net/MarkHarwood/lucene-bloomfilteredsegments Benchmarks based on Wikipedia content here: http://goo.gl/X7QqU Patch based on 3.6 codebase attached. There are no 3.6 API changes currently - to play just add a field with _blm on the end of the name to invoke special indexing/querying capability. Clearly a new Field or schema declaration(!) would need adding to APIs to configure the service properly. Also, a patch for Lucene4.0 codebase introducing a new PostingsFormat -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-4069) Segment-level Bloom filters for a 2 x speed up on rare term searches
[ https://issues.apache.org/jira/browse/LUCENE-4069?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mark Harwood updated LUCENE-4069: - Attachment: BloomFilterPostingsBranch4x.patch Updated as follows: * Extracted Bloom filter functionality as new oal.util.FuzzySet class - the name is changed because Bloom filtering is one application for a FuzzySet, fuzzy count distincts being another. * BloomFilterPostingsFormat now take a factory that can tailor choice of BloomFilter per field (bitset size/saturation settings and choice of hash algo). Provided a default factory implementation. * All Unit tests pass now that I have a test PostingsFormat class that uses v small bitsets where before the many-field unit tests would cause OOM. Will follow up with benchmarks when I have more time to run and document them. Initial results from my large-scale tests on growing indexes show a nice flat line in the face of a growing index whereas a non-Bloomed index saw-tooths upwards as segments grow/merge. Segment-level Bloom filters for a 2 x speed up on rare term searches Key: LUCENE-4069 URL: https://issues.apache.org/jira/browse/LUCENE-4069 Project: Lucene - Java Issue Type: Improvement Components: core/index Affects Versions: 3.6, 4.0 Reporter: Mark Harwood Priority: Minor Fix For: 4.0, 3.6.1 Attachments: BloomFilterPostingsBranch4x.patch, MHBloomFilterOn3.6Branch.patch An addition to each segment which stores a Bloom filter for selected fields in order to give fast-fail to term searches, helping avoid wasted disk access. Best suited for low-frequency fields e.g. primary keys on big indexes with many segments but also speeds up general searching in my tests. Overview slideshow here: http://www.slideshare.net/MarkHarwood/lucene-bloomfilteredsegments Benchmarks based on Wikipedia content here: http://goo.gl/X7QqU Patch based on 3.6 codebase attached. There are no 3.6 API changes currently - to play just add a field with _blm on the end of the name to invoke special indexing/querying capability. Clearly a new Field or schema declaration(!) would need adding to APIs to configure the service properly. Also, a patch for Lucene4.0 codebase introducing a new PostingsFormat -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-4069) Segment-level Bloom filters for a 2 x speed up on rare term searches
[ https://issues.apache.org/jira/browse/LUCENE-4069?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mark Harwood updated LUCENE-4069: - Attachment: (was: PrimaryKey40PerformanceTestSrc.zip) Segment-level Bloom filters for a 2 x speed up on rare term searches Key: LUCENE-4069 URL: https://issues.apache.org/jira/browse/LUCENE-4069 Project: Lucene - Java Issue Type: Improvement Components: core/index Affects Versions: 3.6, 4.0 Reporter: Mark Harwood Priority: Minor Fix For: 4.0, 3.6.1 Attachments: BloomFilterPostingsBranch4x.patch, MHBloomFilterOn3.6Branch.patch An addition to each segment which stores a Bloom filter for selected fields in order to give fast-fail to term searches, helping avoid wasted disk access. Best suited for low-frequency fields e.g. primary keys on big indexes with many segments but also speeds up general searching in my tests. Overview slideshow here: http://www.slideshare.net/MarkHarwood/lucene-bloomfilteredsegments Benchmarks based on Wikipedia content here: http://goo.gl/X7QqU Patch based on 3.6 codebase attached. There are no 3.6 API changes currently - to play just add a field with _blm on the end of the name to invoke special indexing/querying capability. Clearly a new Field or schema declaration(!) would need adding to APIs to configure the service properly. Also, a patch for Lucene4.0 codebase introducing a new PostingsFormat -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-4069) Segment-level Bloom filters for a 2 x speed up on rare term searches
[ https://issues.apache.org/jira/browse/LUCENE-4069?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mark Harwood updated LUCENE-4069: - Attachment: PrimaryKeyPerfTest40.java Benchmark tool adapted from Mike's original Pulsing codec benchmark. Now includes Bloom postings example. Segment-level Bloom filters for a 2 x speed up on rare term searches Key: LUCENE-4069 URL: https://issues.apache.org/jira/browse/LUCENE-4069 Project: Lucene - Java Issue Type: Improvement Components: core/index Affects Versions: 3.6, 4.0 Reporter: Mark Harwood Priority: Minor Fix For: 4.0, 3.6.1 Attachments: BloomFilterPostingsBranch4x.patch, MHBloomFilterOn3.6Branch.patch, PrimaryKeyPerfTest40.java An addition to each segment which stores a Bloom filter for selected fields in order to give fast-fail to term searches, helping avoid wasted disk access. Best suited for low-frequency fields e.g. primary keys on big indexes with many segments but also speeds up general searching in my tests. Overview slideshow here: http://www.slideshare.net/MarkHarwood/lucene-bloomfilteredsegments Benchmarks based on Wikipedia content here: http://goo.gl/X7QqU Patch based on 3.6 codebase attached. There are no 3.6 API changes currently - to play just add a field with _blm on the end of the name to invoke special indexing/querying capability. Clearly a new Field or schema declaration(!) would need adding to APIs to configure the service properly. Also, a patch for Lucene4.0 codebase introducing a new PostingsFormat -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-4069) Segment-level Bloom filters for a 2 x speed up on rare term searches
[ https://issues.apache.org/jira/browse/LUCENE-4069?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mark Harwood updated LUCENE-4069: - Attachment: BloomFilterPostings40.patch This is looking more promising. Running ant test-core -Dtests.postingsformat=TestBloomFilteredLucene40Postings now passes all tests but causes OOM exception on 3 tests: * TestConsistentFieldNumbers.testManyFields * TestIndexableField.testArbitraryFields * TestIndexWriter.testManyFields Any pointers on how to annotate or otherwise avoid the BloomFilter class for many-field tests would be welcome. These are not realistic tests for this class (we don't expect indexes with 100s of primary-key like fields). In this patch I've * added an SPI lookup mechanism for pluggable hash algos. * documented the file format * fixed issues with TermVector tests * changed the API To use: BloomFilteringPostingFormat now takes a delegate PostingsFormat and a set of field names that are to have bloom-filters created. Fields that are not listed in the filter set can be safely indexed as per normal and doing so is beneficial because it allows filtered and non filtered field data to co-exist in the same physical files created by the delegate PostingsFormat. Segment-level Bloom filters for a 2 x speed up on rare term searches Key: LUCENE-4069 URL: https://issues.apache.org/jira/browse/LUCENE-4069 Project: Lucene - Java Issue Type: Improvement Components: core/index Affects Versions: 3.6, 4.0 Reporter: Mark Harwood Priority: Minor Fix For: 4.0, 3.6.1 Attachments: BloomFilterCodec40.patch, BloomFilterPostings40.patch, MHBloomFilterOn3.6Branch.patch, PrimaryKey40PerformanceTestSrc.zip An addition to each segment which stores a Bloom filter for selected fields in order to give fast-fail to term searches, helping avoid wasted disk access. Best suited for low-frequency fields e.g. primary keys on big indexes with many segments but also speeds up general searching in my tests. Overview slideshow here: http://www.slideshare.net/MarkHarwood/lucene-bloomfilteredsegments Benchmarks based on Wikipedia content here: http://goo.gl/X7QqU Patch based on 3.6 codebase attached. There are no 3.6 API changes currently - to play just add a field with _blm on the end of the name to invoke special indexing/querying capability. Clearly a new Field or schema declaration(!) would need adding to APIs to configure the service properly. Also, a patch for Lucene4.0 codebase introducing a new PostingsFormat -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-4069) Segment-level Bloom filters for a 2 x speed up on rare term searches
[ https://issues.apache.org/jira/browse/LUCENE-4069?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mark Harwood updated LUCENE-4069: - Attachment: (was: BloomFilterCodec40.patch) Segment-level Bloom filters for a 2 x speed up on rare term searches Key: LUCENE-4069 URL: https://issues.apache.org/jira/browse/LUCENE-4069 Project: Lucene - Java Issue Type: Improvement Components: core/index Affects Versions: 3.6, 4.0 Reporter: Mark Harwood Priority: Minor Fix For: 4.0, 3.6.1 Attachments: BloomFilterPostings40.patch, MHBloomFilterOn3.6Branch.patch, PrimaryKey40PerformanceTestSrc.zip An addition to each segment which stores a Bloom filter for selected fields in order to give fast-fail to term searches, helping avoid wasted disk access. Best suited for low-frequency fields e.g. primary keys on big indexes with many segments but also speeds up general searching in my tests. Overview slideshow here: http://www.slideshare.net/MarkHarwood/lucene-bloomfilteredsegments Benchmarks based on Wikipedia content here: http://goo.gl/X7QqU Patch based on 3.6 codebase attached. There are no 3.6 API changes currently - to play just add a field with _blm on the end of the name to invoke special indexing/querying capability. Clearly a new Field or schema declaration(!) would need adding to APIs to configure the service properly. Also, a patch for Lucene4.0 codebase introducing a new PostingsFormat -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-4069) Segment-level Bloom filters for a 2 x speed up on rare term searches
[ https://issues.apache.org/jira/browse/LUCENE-4069?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mark Harwood updated LUCENE-4069: - Attachment: (was: BloomFilterPostings40.patch) Segment-level Bloom filters for a 2 x speed up on rare term searches Key: LUCENE-4069 URL: https://issues.apache.org/jira/browse/LUCENE-4069 Project: Lucene - Java Issue Type: Improvement Components: core/index Affects Versions: 3.6, 4.0 Reporter: Mark Harwood Priority: Minor Fix For: 4.0, 3.6.1 Attachments: BloomFilterPostings40.patch, MHBloomFilterOn3.6Branch.patch, PrimaryKey40PerformanceTestSrc.zip An addition to each segment which stores a Bloom filter for selected fields in order to give fast-fail to term searches, helping avoid wasted disk access. Best suited for low-frequency fields e.g. primary keys on big indexes with many segments but also speeds up general searching in my tests. Overview slideshow here: http://www.slideshare.net/MarkHarwood/lucene-bloomfilteredsegments Benchmarks based on Wikipedia content here: http://goo.gl/X7QqU Patch based on 3.6 codebase attached. There are no 3.6 API changes currently - to play just add a field with _blm on the end of the name to invoke special indexing/querying capability. Clearly a new Field or schema declaration(!) would need adding to APIs to configure the service properly. Also, a patch for Lucene4.0 codebase introducing a new PostingsFormat -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-4069) Segment-level Bloom filters for a 2 x speed up on rare term searches
[ https://issues.apache.org/jira/browse/LUCENE-4069?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mark Harwood updated LUCENE-4069: - Attachment: BloomFilterPostings40.patch Added missing class Segment-level Bloom filters for a 2 x speed up on rare term searches Key: LUCENE-4069 URL: https://issues.apache.org/jira/browse/LUCENE-4069 Project: Lucene - Java Issue Type: Improvement Components: core/index Affects Versions: 3.6, 4.0 Reporter: Mark Harwood Priority: Minor Fix For: 4.0, 3.6.1 Attachments: BloomFilterPostings40.patch, MHBloomFilterOn3.6Branch.patch, PrimaryKey40PerformanceTestSrc.zip An addition to each segment which stores a Bloom filter for selected fields in order to give fast-fail to term searches, helping avoid wasted disk access. Best suited for low-frequency fields e.g. primary keys on big indexes with many segments but also speeds up general searching in my tests. Overview slideshow here: http://www.slideshare.net/MarkHarwood/lucene-bloomfilteredsegments Benchmarks based on Wikipedia content here: http://goo.gl/X7QqU Patch based on 3.6 codebase attached. There are no 3.6 API changes currently - to play just add a field with _blm on the end of the name to invoke special indexing/querying capability. Clearly a new Field or schema declaration(!) would need adding to APIs to configure the service properly. Also, a patch for Lucene4.0 codebase introducing a new PostingsFormat -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-4069) Segment-level Bloom filters for a 2 x speed up on rare term searches
[ https://issues.apache.org/jira/browse/LUCENE-4069?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mark Harwood updated LUCENE-4069: - Attachment: (was: BloomFilterCodec40.patch) Segment-level Bloom filters for a 2 x speed up on rare term searches Key: LUCENE-4069 URL: https://issues.apache.org/jira/browse/LUCENE-4069 Project: Lucene - Java Issue Type: Improvement Components: core/index Affects Versions: 3.6 Reporter: Mark Harwood Priority: Minor Fix For: 3.6.1 Attachments: MHBloomFilterOn3.6Branch.patch, PrimaryKey40PerformanceTestSrc.zip An addition to each segment which stores a Bloom filter for selected fields in order to give fast-fail to term searches, helping avoid wasted disk access. Best suited for low-frequency fields e.g. primary keys on big indexes with many segments but also speeds up general searching in my tests. Overview slideshow here: http://www.slideshare.net/MarkHarwood/lucene-bloomfilteredsegments Benchmarks based on Wikipedia content here: http://goo.gl/X7QqU Patch based on 3.6 codebase attached. There are no API changes currently - to play just add a field with _blm on the end of the name to invoke special indexing/querying capability. Clearly a new Field or schema declaration(!) would need adding to APIs to configure the service properly. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-4069) Segment-level Bloom filters for a 2 x speed up on rare term searches
[ https://issues.apache.org/jira/browse/LUCENE-4069?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mark Harwood updated LUCENE-4069: - Attachment: BloomFilterCodec40.patch Updated to work with trunk. * Changed to use FixedBitSet * Is now a PostingsFormat abstract base class * Added missing MurmurHash class TODOs * Move Bloom filter logic to common utils classes * Use Service Providers for pluggable choice of hash algos? * Expose settings for memory/saturation Segment-level Bloom filters for a 2 x speed up on rare term searches Key: LUCENE-4069 URL: https://issues.apache.org/jira/browse/LUCENE-4069 Project: Lucene - Java Issue Type: Improvement Components: core/index Affects Versions: 3.6 Reporter: Mark Harwood Priority: Minor Fix For: 3.6.1 Attachments: BloomFilterCodec40.patch, MHBloomFilterOn3.6Branch.patch, PrimaryKey40PerformanceTestSrc.zip An addition to each segment which stores a Bloom filter for selected fields in order to give fast-fail to term searches, helping avoid wasted disk access. Best suited for low-frequency fields e.g. primary keys on big indexes with many segments but also speeds up general searching in my tests. Overview slideshow here: http://www.slideshare.net/MarkHarwood/lucene-bloomfilteredsegments Benchmarks based on Wikipedia content here: http://goo.gl/X7QqU Patch based on 3.6 codebase attached. There are no API changes currently - to play just add a field with _blm on the end of the name to invoke special indexing/querying capability. Clearly a new Field or schema declaration(!) would need adding to APIs to configure the service properly. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-4069) Segment-level Bloom filters for a 2 x speed up on rare term searches
[ https://issues.apache.org/jira/browse/LUCENE-4069?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mark Harwood updated LUCENE-4069: - Description: An addition to each segment which stores a Bloom filter for selected fields in order to give fast-fail to term searches, helping avoid wasted disk access. Best suited for low-frequency fields e.g. primary keys on big indexes with many segments but also speeds up general searching in my tests. Overview slideshow here: http://www.slideshare.net/MarkHarwood/lucene-bloomfilteredsegments Benchmarks based on Wikipedia content here: http://goo.gl/X7QqU Patch based on 3.6 codebase attached. There are no 3.6 API changes currently - to play just add a field with _blm on the end of the name to invoke special indexing/querying capability. Clearly a new Field or schema declaration(!) would need adding to APIs to configure the service properly. Also, a patch for Lucene4.0 codebase introducing a new PostingsFormat was: An addition to each segment which stores a Bloom filter for selected fields in order to give fast-fail to term searches, helping avoid wasted disk access. Best suited for low-frequency fields e.g. primary keys on big indexes with many segments but also speeds up general searching in my tests. Overview slideshow here: http://www.slideshare.net/MarkHarwood/lucene-bloomfilteredsegments Benchmarks based on Wikipedia content here: http://goo.gl/X7QqU Patch based on 3.6 codebase attached. There are no API changes currently - to play just add a field with _blm on the end of the name to invoke special indexing/querying capability. Clearly a new Field or schema declaration(!) would need adding to APIs to configure the service properly. Affects Version/s: 4.0 Fix Version/s: 4.0 Segment-level Bloom filters for a 2 x speed up on rare term searches Key: LUCENE-4069 URL: https://issues.apache.org/jira/browse/LUCENE-4069 Project: Lucene - Java Issue Type: Improvement Components: core/index Affects Versions: 3.6, 4.0 Reporter: Mark Harwood Priority: Minor Fix For: 4.0, 3.6.1 Attachments: BloomFilterCodec40.patch, MHBloomFilterOn3.6Branch.patch, PrimaryKey40PerformanceTestSrc.zip An addition to each segment which stores a Bloom filter for selected fields in order to give fast-fail to term searches, helping avoid wasted disk access. Best suited for low-frequency fields e.g. primary keys on big indexes with many segments but also speeds up general searching in my tests. Overview slideshow here: http://www.slideshare.net/MarkHarwood/lucene-bloomfilteredsegments Benchmarks based on Wikipedia content here: http://goo.gl/X7QqU Patch based on 3.6 codebase attached. There are no 3.6 API changes currently - to play just add a field with _blm on the end of the name to invoke special indexing/querying capability. Clearly a new Field or schema declaration(!) would need adding to APIs to configure the service properly. Also, a patch for Lucene4.0 codebase introducing a new PostingsFormat -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-4069) Segment-level Bloom filters for a 2 x speed up on rare term searches
[ https://issues.apache.org/jira/browse/LUCENE-4069?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mark Harwood updated LUCENE-4069: - Attachment: (was: BloomFilterCodec40.patch) Segment-level Bloom filters for a 2 x speed up on rare term searches Key: LUCENE-4069 URL: https://issues.apache.org/jira/browse/LUCENE-4069 Project: Lucene - Java Issue Type: Improvement Components: core/index Affects Versions: 3.6 Reporter: Mark Harwood Priority: Minor Fix For: 3.6.1 Attachments: MHBloomFilterOn3.6Branch.patch, PrimaryKey40PerformanceTestSrc.zip An addition to each segment which stores a Bloom filter for selected fields in order to give fast-fail to term searches, helping avoid wasted disk access. Best suited for low-frequency fields e.g. primary keys on big indexes with many segments but also speeds up general searching in my tests. Overview slideshow here: http://www.slideshare.net/MarkHarwood/lucene-bloomfilteredsegments Benchmarks based on Wikipedia content here: http://goo.gl/X7QqU Patch based on 3.6 codebase attached. There are no API changes currently - to play just add a field with _blm on the end of the name to invoke special indexing/querying capability. Clearly a new Field or schema declaration(!) would need adding to APIs to configure the service properly. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-4069) Segment-level Bloom filters for a 2 x speed up on rare term searches
[ https://issues.apache.org/jira/browse/LUCENE-4069?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mark Harwood updated LUCENE-4069: - Attachment: BloomFilterCodec40.patch Fixed the issue with 1 field in an index. Tests on random lookups on Wikipedia titles (unique keys) now show a 3 x speed up for a Bloom-filtered index over standard 4.0 Codec for fully warmed indexes. Segment-level Bloom filters for a 2 x speed up on rare term searches Key: LUCENE-4069 URL: https://issues.apache.org/jira/browse/LUCENE-4069 Project: Lucene - Java Issue Type: Improvement Components: core/index Affects Versions: 3.6 Reporter: Mark Harwood Priority: Minor Fix For: 3.6.1 Attachments: BloomFilterCodec40.patch, MHBloomFilterOn3.6Branch.patch, PrimaryKey40PerformanceTestSrc.zip An addition to each segment which stores a Bloom filter for selected fields in order to give fast-fail to term searches, helping avoid wasted disk access. Best suited for low-frequency fields e.g. primary keys on big indexes with many segments but also speeds up general searching in my tests. Overview slideshow here: http://www.slideshare.net/MarkHarwood/lucene-bloomfilteredsegments Benchmarks based on Wikipedia content here: http://goo.gl/X7QqU Patch based on 3.6 codebase attached. There are no API changes currently - to play just add a field with _blm on the end of the name to invoke special indexing/querying capability. Clearly a new Field or schema declaration(!) would need adding to APIs to configure the service properly. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-4069) Segment-level Bloom filters for a 2 x speed up on rare term searches
[ https://issues.apache.org/jira/browse/LUCENE-4069?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mark Harwood updated LUCENE-4069: - Attachment: PrimaryKey40PerformanceTestSrc.zip BloomFilterCodec40.patch Segment-level Bloom filters for a 2 x speed up on rare term searches Key: LUCENE-4069 URL: https://issues.apache.org/jira/browse/LUCENE-4069 Project: Lucene - Java Issue Type: Improvement Components: core/index Affects Versions: 3.6 Reporter: Mark Harwood Priority: Minor Fix For: 3.6.1 Attachments: BloomFilterCodec40.patch, MHBloomFilterOn3.6Branch.patch, PrimaryKey40PerformanceTestSrc.zip An addition to each segment which stores a Bloom filter for selected fields in order to give fast-fail to term searches, helping avoid wasted disk access. Best suited for low-frequency fields e.g. primary keys on big indexes with many segments but also speeds up general searching in my tests. Overview slideshow here: http://www.slideshare.net/MarkHarwood/lucene-bloomfilteredsegments Benchmarks based on Wikipedia content here: http://goo.gl/X7QqU Patch based on 3.6 codebase attached. There are no API changes currently - to play just add a field with _blm on the end of the name to invoke special indexing/querying capability. Clearly a new Field or schema declaration(!) would need adding to APIs to configure the service properly. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-4069) Segment-level Bloom filters for a 2 x speed up on rare term searches
[ https://issues.apache.org/jira/browse/LUCENE-4069?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mark Harwood updated LUCENE-4069: - Attachment: BloomFilterCodec40.patch PrimaryKey40PerformanceTestSrc.zip I've ported this Bloom Filtering code to work as a 4.0 Codec now. I see a 35% improvement over standard Codecs on random lookups on a warmed index. I also notice that the PulsingCodec is no longer faster than standard Codec - is this news to people as I thought it was supposed to be the way forward? My test rig (adapted from Mike's original primary key test rig here http://blog.mikemccandless.com/2010/06/lucenes-pulsingcodec-on-primary-key.html) is attached as a zip. The new BloomFilteringCodec is also attached here as a patch. Searches against plain text fields also look to be faster (using AOL500k queries searching Wikipedia English) but obviously that particular test rig is harder to include as an attachment here. I can open a seperate JIRA issue for this 4.0 version of the code if that makes more sense. Segment-level Bloom filters for a 2 x speed up on rare term searches Key: LUCENE-4069 URL: https://issues.apache.org/jira/browse/LUCENE-4069 Project: Lucene - Java Issue Type: Improvement Components: core/index Affects Versions: 3.6 Reporter: Mark Harwood Priority: Minor Fix For: 3.6.1 Attachments: BloomFilterCodec40.patch, MHBloomFilterOn3.6Branch.patch, PrimaryKey40PerformanceTestSrc.zip An addition to each segment which stores a Bloom filter for selected fields in order to give fast-fail to term searches, helping avoid wasted disk access. Best suited for low-frequency fields e.g. primary keys on big indexes with many segments but also speeds up general searching in my tests. Overview slideshow here: http://www.slideshare.net/MarkHarwood/lucene-bloomfilteredsegments Benchmarks based on Wikipedia content here: http://goo.gl/X7QqU Patch based on 3.6 codebase attached. There are no API changes currently - to play just add a field with _blm on the end of the name to invoke special indexing/querying capability. Clearly a new Field or schema declaration(!) would need adding to APIs to configure the service properly. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-4069) Segment-level Bloom filters for a 2 x speed up on rare term searches
[ https://issues.apache.org/jira/browse/LUCENE-4069?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mark Harwood updated LUCENE-4069: - Attachment: (was: BloomFilterCodec40.patch) Segment-level Bloom filters for a 2 x speed up on rare term searches Key: LUCENE-4069 URL: https://issues.apache.org/jira/browse/LUCENE-4069 Project: Lucene - Java Issue Type: Improvement Components: core/index Affects Versions: 3.6 Reporter: Mark Harwood Priority: Minor Fix For: 3.6.1 Attachments: BloomFilterCodec40.patch, MHBloomFilterOn3.6Branch.patch, PrimaryKey40PerformanceTestSrc.zip An addition to each segment which stores a Bloom filter for selected fields in order to give fast-fail to term searches, helping avoid wasted disk access. Best suited for low-frequency fields e.g. primary keys on big indexes with many segments but also speeds up general searching in my tests. Overview slideshow here: http://www.slideshare.net/MarkHarwood/lucene-bloomfilteredsegments Benchmarks based on Wikipedia content here: http://goo.gl/X7QqU Patch based on 3.6 codebase attached. There are no API changes currently - to play just add a field with _blm on the end of the name to invoke special indexing/querying capability. Clearly a new Field or schema declaration(!) would need adding to APIs to configure the service properly. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-4069) Segment-level Bloom filters for a 2 x speed up on rare term searches
[ https://issues.apache.org/jira/browse/LUCENE-4069?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mark Harwood updated LUCENE-4069: - Attachment: (was: PrimaryKey40PerformanceTestSrc.zip) Segment-level Bloom filters for a 2 x speed up on rare term searches Key: LUCENE-4069 URL: https://issues.apache.org/jira/browse/LUCENE-4069 Project: Lucene - Java Issue Type: Improvement Components: core/index Affects Versions: 3.6 Reporter: Mark Harwood Priority: Minor Fix For: 3.6.1 Attachments: BloomFilterCodec40.patch, MHBloomFilterOn3.6Branch.patch, PrimaryKey40PerformanceTestSrc.zip An addition to each segment which stores a Bloom filter for selected fields in order to give fast-fail to term searches, helping avoid wasted disk access. Best suited for low-frequency fields e.g. primary keys on big indexes with many segments but also speeds up general searching in my tests. Overview slideshow here: http://www.slideshare.net/MarkHarwood/lucene-bloomfilteredsegments Benchmarks based on Wikipedia content here: http://goo.gl/X7QqU Patch based on 3.6 codebase attached. There are no API changes currently - to play just add a field with _blm on the end of the name to invoke special indexing/querying capability. Clearly a new Field or schema declaration(!) would need adding to APIs to configure the service properly. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-4069) Segment-level Bloom filters for a 2 x speed up on rare term searches
[ https://issues.apache.org/jira/browse/LUCENE-4069?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mark Harwood updated LUCENE-4069: - Attachment: MHBloomFilterOn3.6Branch.patch Initial patch Segment-level Bloom filters for a 2 x speed up on rare term searches Key: LUCENE-4069 URL: https://issues.apache.org/jira/browse/LUCENE-4069 Project: Lucene - Java Issue Type: Improvement Components: core/index Affects Versions: 3.6 Reporter: Mark Harwood Priority: Minor Fix For: 3.6.1 Attachments: MHBloomFilterOn3.6Branch.patch An addition to each segment which stores a Bloom filter for selected fields in order to give fast-fail to term searches, helping avoid wasted disk access. Best suited for low-frequency fields e.g. primary keys on big indexes with many segments but also speeds up general searching in my tests. Overview slideshow here: http://www.slideshare.net/MarkHarwood/lucene-bloomfilteredsegments Benchmarks based on Wikipedia content here: http://goo.gl/X7QqU Patch based on 3.6 codebase attached. There are no API changes currently - to play just add a field with _blm on the end of the name to invoke special indexing/querying capability. Clearly a new Field or schema declaration(!) would need adding to APIs to configure the service properly. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org