[jira] [Updated] (LUCENE-4069) Segment-level Bloom filters for a 2 x speed up on rare term searches

2012-08-02 Thread Mark Harwood (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-4069?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mark Harwood updated LUCENE-4069:
-

Fix Version/s: 5.0

Applied to trunk in revision 1368567

 Segment-level Bloom filters for a 2 x speed up on rare term searches
 

 Key: LUCENE-4069
 URL: https://issues.apache.org/jira/browse/LUCENE-4069
 Project: Lucene - Core
  Issue Type: Improvement
  Components: core/index
Affects Versions: 3.6, 4.0-ALPHA
Reporter: Mark Harwood
Assignee: Mark Harwood
Priority: Minor
 Fix For: 4.0, 5.0

 Attachments: 4069Failure.zip, BloomFilterPostingsBranch4x.patch, 
 LUCENE-4069-tryDeleteDocument.patch, LUCENE-4203.patch, 
 MHBloomFilterOn3.6Branch.patch, PKLookupUpdatePerfTest.java, 
 PKLookupUpdatePerfTest.java, PKLookupUpdatePerfTest.java, 
 PKLookupUpdatePerfTest.java, PrimaryKeyPerfTest40.java


 An addition to each segment which stores a Bloom filter for selected fields 
 in order to give fast-fail to term searches, helping avoid wasted disk access.
 Best suited for low-frequency fields e.g. primary keys on big indexes with 
 many segments but also speeds up general searching in my tests.
 Overview slideshow here: 
 http://www.slideshare.net/MarkHarwood/lucene-bloomfilteredsegments
 Benchmarks based on Wikipedia content here: http://goo.gl/X7QqU
 Patch based on 3.6 codebase attached.
 There are no 3.6 API changes currently - to play just add a field with _blm 
 on the end of the name to invoke special indexing/querying capability. 
 Clearly a new Field or schema declaration(!) would need adding to APIs to 
 configure the service properly.
 Also, a patch for Lucene4.0 codebase introducing a new PostingsFormat

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-4069) Segment-level Bloom filters for a 2 x speed up on rare term searches

2012-08-01 Thread Mark Harwood (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-4069?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mark Harwood updated LUCENE-4069:
-

Attachment: (was: BloomFilterPostingsBranch4x.patch)

 Segment-level Bloom filters for a 2 x speed up on rare term searches
 

 Key: LUCENE-4069
 URL: https://issues.apache.org/jira/browse/LUCENE-4069
 Project: Lucene - Core
  Issue Type: Improvement
  Components: core/index
Affects Versions: 3.6, 4.0-ALPHA
Reporter: Mark Harwood
Priority: Minor
 Fix For: 4.0

 Attachments: 4069Failure.zip, LUCENE-4069-tryDeleteDocument.patch, 
 LUCENE-4203.patch, MHBloomFilterOn3.6Branch.patch, 
 PKLookupUpdatePerfTest.java, PKLookupUpdatePerfTest.java, 
 PKLookupUpdatePerfTest.java, PKLookupUpdatePerfTest.java, 
 PrimaryKeyPerfTest40.java


 An addition to each segment which stores a Bloom filter for selected fields 
 in order to give fast-fail to term searches, helping avoid wasted disk access.
 Best suited for low-frequency fields e.g. primary keys on big indexes with 
 many segments but also speeds up general searching in my tests.
 Overview slideshow here: 
 http://www.slideshare.net/MarkHarwood/lucene-bloomfilteredsegments
 Benchmarks based on Wikipedia content here: http://goo.gl/X7QqU
 Patch based on 3.6 codebase attached.
 There are no 3.6 API changes currently - to play just add a field with _blm 
 on the end of the name to invoke special indexing/querying capability. 
 Clearly a new Field or schema declaration(!) would need adding to APIs to 
 configure the service properly.
 Also, a patch for Lucene4.0 codebase introducing a new PostingsFormat

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-4069) Segment-level Bloom filters for a 2 x speed up on rare term searches

2012-08-01 Thread Mark Harwood (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-4069?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mark Harwood updated LUCENE-4069:
-

Attachment: BloomFilterPostingsBranch4x.patch

Updated with fix to issue explored in Lucene-4275

 Segment-level Bloom filters for a 2 x speed up on rare term searches
 

 Key: LUCENE-4069
 URL: https://issues.apache.org/jira/browse/LUCENE-4069
 Project: Lucene - Core
  Issue Type: Improvement
  Components: core/index
Affects Versions: 3.6, 4.0-ALPHA
Reporter: Mark Harwood
Priority: Minor
 Fix For: 4.0

 Attachments: 4069Failure.zip, BloomFilterPostingsBranch4x.patch, 
 LUCENE-4069-tryDeleteDocument.patch, LUCENE-4203.patch, 
 MHBloomFilterOn3.6Branch.patch, PKLookupUpdatePerfTest.java, 
 PKLookupUpdatePerfTest.java, PKLookupUpdatePerfTest.java, 
 PKLookupUpdatePerfTest.java, PrimaryKeyPerfTest40.java


 An addition to each segment which stores a Bloom filter for selected fields 
 in order to give fast-fail to term searches, helping avoid wasted disk access.
 Best suited for low-frequency fields e.g. primary keys on big indexes with 
 many segments but also speeds up general searching in my tests.
 Overview slideshow here: 
 http://www.slideshare.net/MarkHarwood/lucene-bloomfilteredsegments
 Benchmarks based on Wikipedia content here: http://goo.gl/X7QqU
 Patch based on 3.6 codebase attached.
 There are no 3.6 API changes currently - to play just add a field with _blm 
 on the end of the name to invoke special indexing/querying capability. 
 Clearly a new Field or schema declaration(!) would need adding to APIs to 
 configure the service properly.
 Also, a patch for Lucene4.0 codebase introducing a new PostingsFormat

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-4069) Segment-level Bloom filters for a 2 x speed up on rare term searches

2012-08-01 Thread Mark Harwood (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-4069?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mark Harwood updated LUCENE-4069:
-

Attachment: (was: BloomFilterPostingsBranch4x.patch)

 Segment-level Bloom filters for a 2 x speed up on rare term searches
 

 Key: LUCENE-4069
 URL: https://issues.apache.org/jira/browse/LUCENE-4069
 Project: Lucene - Core
  Issue Type: Improvement
  Components: core/index
Affects Versions: 3.6, 4.0-ALPHA
Reporter: Mark Harwood
Priority: Minor
 Fix For: 4.0

 Attachments: 4069Failure.zip, LUCENE-4069-tryDeleteDocument.patch, 
 LUCENE-4203.patch, MHBloomFilterOn3.6Branch.patch, 
 PKLookupUpdatePerfTest.java, PKLookupUpdatePerfTest.java, 
 PKLookupUpdatePerfTest.java, PKLookupUpdatePerfTest.java, 
 PrimaryKeyPerfTest40.java


 An addition to each segment which stores a Bloom filter for selected fields 
 in order to give fast-fail to term searches, helping avoid wasted disk access.
 Best suited for low-frequency fields e.g. primary keys on big indexes with 
 many segments but also speeds up general searching in my tests.
 Overview slideshow here: 
 http://www.slideshare.net/MarkHarwood/lucene-bloomfilteredsegments
 Benchmarks based on Wikipedia content here: http://goo.gl/X7QqU
 Patch based on 3.6 codebase attached.
 There are no 3.6 API changes currently - to play just add a field with _blm 
 on the end of the name to invoke special indexing/querying capability. 
 Clearly a new Field or schema declaration(!) would need adding to APIs to 
 configure the service properly.
 Also, a patch for Lucene4.0 codebase introducing a new PostingsFormat

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-4069) Segment-level Bloom filters for a 2 x speed up on rare term searches

2012-08-01 Thread Mark Harwood (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-4069?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mark Harwood updated LUCENE-4069:
-

Attachment: BloomFilterPostingsBranch4x.patch

Updated patch to bring in line with latest core API changes.
All tests now pass clean so will commit soon

 Segment-level Bloom filters for a 2 x speed up on rare term searches
 

 Key: LUCENE-4069
 URL: https://issues.apache.org/jira/browse/LUCENE-4069
 Project: Lucene - Core
  Issue Type: Improvement
  Components: core/index
Affects Versions: 3.6, 4.0-ALPHA
Reporter: Mark Harwood
Priority: Minor
 Fix For: 4.0

 Attachments: 4069Failure.zip, BloomFilterPostingsBranch4x.patch, 
 LUCENE-4069-tryDeleteDocument.patch, LUCENE-4203.patch, 
 MHBloomFilterOn3.6Branch.patch, PKLookupUpdatePerfTest.java, 
 PKLookupUpdatePerfTest.java, PKLookupUpdatePerfTest.java, 
 PKLookupUpdatePerfTest.java, PrimaryKeyPerfTest40.java


 An addition to each segment which stores a Bloom filter for selected fields 
 in order to give fast-fail to term searches, helping avoid wasted disk access.
 Best suited for low-frequency fields e.g. primary keys on big indexes with 
 many segments but also speeds up general searching in my tests.
 Overview slideshow here: 
 http://www.slideshare.net/MarkHarwood/lucene-bloomfilteredsegments
 Benchmarks based on Wikipedia content here: http://goo.gl/X7QqU
 Patch based on 3.6 codebase attached.
 There are no 3.6 API changes currently - to play just add a field with _blm 
 on the end of the name to invoke special indexing/querying capability. 
 Clearly a new Field or schema declaration(!) would need adding to APIs to 
 configure the service properly.
 Also, a patch for Lucene4.0 codebase introducing a new PostingsFormat

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-4069) Segment-level Bloom filters for a 2 x speed up on rare term searches

2012-07-20 Thread Mark Harwood (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-4069?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mark Harwood updated LUCENE-4069:
-

Attachment: 4069Failure.zip

Attached a log of thread activity showing how 
TestIndexWriterCommit.testCommitThreadSafety() is failing.
At this stage I can't tell if this is a failing in MockDirectoryWrapper or the 
test or the BloomPF class but it is related to files being removed unexpectedly.

 Segment-level Bloom filters for a 2 x speed up on rare term searches
 

 Key: LUCENE-4069
 URL: https://issues.apache.org/jira/browse/LUCENE-4069
 Project: Lucene - Java
  Issue Type: Improvement
  Components: core/index
Affects Versions: 3.6, 4.0-ALPHA
Reporter: Mark Harwood
Priority: Minor
 Fix For: 4.0

 Attachments: 4069Failure.zip, BloomFilterPostingsBranch4x.patch, 
 LUCENE-4069-tryDeleteDocument.patch, LUCENE-4203.patch, 
 MHBloomFilterOn3.6Branch.patch, PKLookupUpdatePerfTest.java, 
 PKLookupUpdatePerfTest.java, PKLookupUpdatePerfTest.java, 
 PKLookupUpdatePerfTest.java, PrimaryKeyPerfTest40.java


 An addition to each segment which stores a Bloom filter for selected fields 
 in order to give fast-fail to term searches, helping avoid wasted disk access.
 Best suited for low-frequency fields e.g. primary keys on big indexes with 
 many segments but also speeds up general searching in my tests.
 Overview slideshow here: 
 http://www.slideshare.net/MarkHarwood/lucene-bloomfilteredsegments
 Benchmarks based on Wikipedia content here: http://goo.gl/X7QqU
 Patch based on 3.6 codebase attached.
 There are no 3.6 API changes currently - to play just add a field with _blm 
 on the end of the name to invoke special indexing/querying capability. 
 Clearly a new Field or schema declaration(!) would need adding to APIs to 
 configure the service properly.
 Also, a patch for Lucene4.0 codebase introducing a new PostingsFormat

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-4069) Segment-level Bloom filters for a 2 x speed up on rare term searches

2012-07-17 Thread Mark Harwood (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-4069?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mark Harwood updated LUCENE-4069:
-

Attachment: (was: BloomFilterPostingsBranch4x.patch)

 Segment-level Bloom filters for a 2 x speed up on rare term searches
 

 Key: LUCENE-4069
 URL: https://issues.apache.org/jira/browse/LUCENE-4069
 Project: Lucene - Java
  Issue Type: Improvement
  Components: core/index
Affects Versions: 3.6, 4.0-ALPHA
Reporter: Mark Harwood
Priority: Minor
 Fix For: 4.0

 Attachments: LUCENE-4069-tryDeleteDocument.patch, LUCENE-4203.patch, 
 MHBloomFilterOn3.6Branch.patch, PKLookupUpdatePerfTest.java, 
 PKLookupUpdatePerfTest.java, PKLookupUpdatePerfTest.java, 
 PKLookupUpdatePerfTest.java, PrimaryKeyPerfTest40.java


 An addition to each segment which stores a Bloom filter for selected fields 
 in order to give fast-fail to term searches, helping avoid wasted disk access.
 Best suited for low-frequency fields e.g. primary keys on big indexes with 
 many segments but also speeds up general searching in my tests.
 Overview slideshow here: 
 http://www.slideshare.net/MarkHarwood/lucene-bloomfilteredsegments
 Benchmarks based on Wikipedia content here: http://goo.gl/X7QqU
 Patch based on 3.6 codebase attached.
 There are no 3.6 API changes currently - to play just add a field with _blm 
 on the end of the name to invoke special indexing/querying capability. 
 Clearly a new Field or schema declaration(!) would need adding to APIs to 
 configure the service properly.
 Also, a patch for Lucene4.0 codebase introducing a new PostingsFormat

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-4069) Segment-level Bloom filters for a 2 x speed up on rare term searches

2012-07-17 Thread Mark Harwood (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-4069?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mark Harwood updated LUCENE-4069:
-

Attachment: BloomFilterPostingsBranch4x.patch

New patch with use of SegmentWriteState to right-size the choice if bitset for 
volume of content.

 Segment-level Bloom filters for a 2 x speed up on rare term searches
 

 Key: LUCENE-4069
 URL: https://issues.apache.org/jira/browse/LUCENE-4069
 Project: Lucene - Java
  Issue Type: Improvement
  Components: core/index
Affects Versions: 3.6, 4.0-ALPHA
Reporter: Mark Harwood
Priority: Minor
 Fix For: 4.0

 Attachments: BloomFilterPostingsBranch4x.patch, 
 LUCENE-4069-tryDeleteDocument.patch, LUCENE-4203.patch, 
 MHBloomFilterOn3.6Branch.patch, PKLookupUpdatePerfTest.java, 
 PKLookupUpdatePerfTest.java, PKLookupUpdatePerfTest.java, 
 PKLookupUpdatePerfTest.java, PrimaryKeyPerfTest40.java


 An addition to each segment which stores a Bloom filter for selected fields 
 in order to give fast-fail to term searches, helping avoid wasted disk access.
 Best suited for low-frequency fields e.g. primary keys on big indexes with 
 many segments but also speeds up general searching in my tests.
 Overview slideshow here: 
 http://www.slideshare.net/MarkHarwood/lucene-bloomfilteredsegments
 Benchmarks based on Wikipedia content here: http://goo.gl/X7QqU
 Patch based on 3.6 codebase attached.
 There are no 3.6 API changes currently - to play just add a field with _blm 
 on the end of the name to invoke special indexing/querying capability. 
 Clearly a new Field or schema declaration(!) would need adding to APIs to 
 configure the service properly.
 Also, a patch for Lucene4.0 codebase introducing a new PostingsFormat

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-4069) Segment-level Bloom filters for a 2 x speed up on rare term searches

2012-07-16 Thread Mark Harwood (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-4069?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mark Harwood updated LUCENE-4069:
-

Attachment: (was: BloomFilterPostingsBranch4x.patch)

 Segment-level Bloom filters for a 2 x speed up on rare term searches
 

 Key: LUCENE-4069
 URL: https://issues.apache.org/jira/browse/LUCENE-4069
 Project: Lucene - Java
  Issue Type: Improvement
  Components: core/index
Affects Versions: 3.6, 4.0-ALPHA
Reporter: Mark Harwood
Priority: Minor
 Fix For: 4.0

 Attachments: LUCENE-4069-tryDeleteDocument.patch, LUCENE-4203.patch, 
 MHBloomFilterOn3.6Branch.patch, PKLookupUpdatePerfTest.java, 
 PKLookupUpdatePerfTest.java, PKLookupUpdatePerfTest.java, 
 PKLookupUpdatePerfTest.java, PrimaryKeyPerfTest40.java


 An addition to each segment which stores a Bloom filter for selected fields 
 in order to give fast-fail to term searches, helping avoid wasted disk access.
 Best suited for low-frequency fields e.g. primary keys on big indexes with 
 many segments but also speeds up general searching in my tests.
 Overview slideshow here: 
 http://www.slideshare.net/MarkHarwood/lucene-bloomfilteredsegments
 Benchmarks based on Wikipedia content here: http://goo.gl/X7QqU
 Patch based on 3.6 codebase attached.
 There are no 3.6 API changes currently - to play just add a field with _blm 
 on the end of the name to invoke special indexing/querying capability. 
 Clearly a new Field or schema declaration(!) would need adding to APIs to 
 configure the service properly.
 Also, a patch for Lucene4.0 codebase introducing a new PostingsFormat

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-4069) Segment-level Bloom filters for a 2 x speed up on rare term searches

2012-07-16 Thread Mark Harwood (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-4069?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mark Harwood updated LUCENE-4069:
-

Attachment: BloomFilterPostingsBranch4x.patch

Added bloom package.html and changes.txt. I plan to commit in a day or two if 
there are no objections.

 Segment-level Bloom filters for a 2 x speed up on rare term searches
 

 Key: LUCENE-4069
 URL: https://issues.apache.org/jira/browse/LUCENE-4069
 Project: Lucene - Java
  Issue Type: Improvement
  Components: core/index
Affects Versions: 3.6, 4.0-ALPHA
Reporter: Mark Harwood
Priority: Minor
 Fix For: 4.0

 Attachments: BloomFilterPostingsBranch4x.patch, 
 LUCENE-4069-tryDeleteDocument.patch, LUCENE-4203.patch, 
 MHBloomFilterOn3.6Branch.patch, PKLookupUpdatePerfTest.java, 
 PKLookupUpdatePerfTest.java, PKLookupUpdatePerfTest.java, 
 PKLookupUpdatePerfTest.java, PrimaryKeyPerfTest40.java


 An addition to each segment which stores a Bloom filter for selected fields 
 in order to give fast-fail to term searches, helping avoid wasted disk access.
 Best suited for low-frequency fields e.g. primary keys on big indexes with 
 many segments but also speeds up general searching in my tests.
 Overview slideshow here: 
 http://www.slideshare.net/MarkHarwood/lucene-bloomfilteredsegments
 Benchmarks based on Wikipedia content here: http://goo.gl/X7QqU
 Patch based on 3.6 codebase attached.
 There are no 3.6 API changes currently - to play just add a field with _blm 
 on the end of the name to invoke special indexing/querying capability. 
 Clearly a new Field or schema declaration(!) would need adding to APIs to 
 configure the service properly.
 Also, a patch for Lucene4.0 codebase introducing a new PostingsFormat

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-4069) Segment-level Bloom filters for a 2 x speed up on rare term searches

2012-07-09 Thread Michael McCandless (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-4069?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael McCandless updated LUCENE-4069:
---

Attachment: PKLookupUpdatePerfTest.java

Only substantial change was I fixed the updates vs insert counting to work 
correctly when doDirectDelete is true.

 Segment-level Bloom filters for a 2 x speed up on rare term searches
 

 Key: LUCENE-4069
 URL: https://issues.apache.org/jira/browse/LUCENE-4069
 Project: Lucene - Java
  Issue Type: Improvement
  Components: core/index
Affects Versions: 3.6, 4.0
Reporter: Mark Harwood
Priority: Minor
 Fix For: 4.0, 3.6.1

 Attachments: BloomFilterPostingsBranch4x.patch, 
 LUCENE-4069-tryDeleteDocument.patch, MHBloomFilterOn3.6Branch.patch, 
 PKLookupUpdatePerfTest.java, PKLookupUpdatePerfTest.java, 
 PKLookupUpdatePerfTest.java, PKLookupUpdatePerfTest.java, 
 PrimaryKeyPerfTest40.java


 An addition to each segment which stores a Bloom filter for selected fields 
 in order to give fast-fail to term searches, helping avoid wasted disk access.
 Best suited for low-frequency fields e.g. primary keys on big indexes with 
 many segments but also speeds up general searching in my tests.
 Overview slideshow here: 
 http://www.slideshare.net/MarkHarwood/lucene-bloomfilteredsegments
 Benchmarks based on Wikipedia content here: http://goo.gl/X7QqU
 Patch based on 3.6 codebase attached.
 There are no 3.6 API changes currently - to play just add a field with _blm 
 on the end of the name to invoke special indexing/querying capability. 
 Clearly a new Field or schema declaration(!) would need adding to APIs to 
 configure the service properly.
 Also, a patch for Lucene4.0 codebase introducing a new PostingsFormat

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-4069) Segment-level Bloom filters for a 2 x speed up on rare term searches

2012-07-09 Thread Michael McCandless (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-4069?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael McCandless updated LUCENE-4069:
---

Attachment: LUCENE-4203.patch

Initial rough patch ... not ready to commit and needs a good random test.

 Segment-level Bloom filters for a 2 x speed up on rare term searches
 

 Key: LUCENE-4069
 URL: https://issues.apache.org/jira/browse/LUCENE-4069
 Project: Lucene - Java
  Issue Type: Improvement
  Components: core/index
Affects Versions: 3.6, 4.0
Reporter: Mark Harwood
Priority: Minor
 Fix For: 4.0, 3.6.1

 Attachments: BloomFilterPostingsBranch4x.patch, 
 LUCENE-4069-tryDeleteDocument.patch, LUCENE-4203.patch, 
 MHBloomFilterOn3.6Branch.patch, PKLookupUpdatePerfTest.java, 
 PKLookupUpdatePerfTest.java, PKLookupUpdatePerfTest.java, 
 PKLookupUpdatePerfTest.java, PrimaryKeyPerfTest40.java


 An addition to each segment which stores a Bloom filter for selected fields 
 in order to give fast-fail to term searches, helping avoid wasted disk access.
 Best suited for low-frequency fields e.g. primary keys on big indexes with 
 many segments but also speeds up general searching in my tests.
 Overview slideshow here: 
 http://www.slideshare.net/MarkHarwood/lucene-bloomfilteredsegments
 Benchmarks based on Wikipedia content here: http://goo.gl/X7QqU
 Patch based on 3.6 codebase attached.
 There are no 3.6 API changes currently - to play just add a field with _blm 
 on the end of the name to invoke special indexing/querying capability. 
 Clearly a new Field or schema declaration(!) would need adding to APIs to 
 configure the service properly.
 Also, a patch for Lucene4.0 codebase introducing a new PostingsFormat

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-4069) Segment-level Bloom filters for a 2 x speed up on rare term searches

2012-07-06 Thread Mark Harwood (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-4069?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mark Harwood updated LUCENE-4069:
-

Attachment: PKLookupUpdatePerfTest.java

Updated performance test with option to alter the ratio of inserts vs updates 
via keyspace size.

 Segment-level Bloom filters for a 2 x speed up on rare term searches
 

 Key: LUCENE-4069
 URL: https://issues.apache.org/jira/browse/LUCENE-4069
 Project: Lucene - Java
  Issue Type: Improvement
  Components: core/index
Affects Versions: 3.6, 4.0
Reporter: Mark Harwood
Priority: Minor
 Fix For: 4.0, 3.6.1

 Attachments: BloomFilterPostingsBranch4x.patch, 
 LUCENE-4069-tryDeleteDocument.patch, MHBloomFilterOn3.6Branch.patch, 
 PKLookupUpdatePerfTest.java, PKLookupUpdatePerfTest.java, 
 PKLookupUpdatePerfTest.java, PrimaryKeyPerfTest40.java


 An addition to each segment which stores a Bloom filter for selected fields 
 in order to give fast-fail to term searches, helping avoid wasted disk access.
 Best suited for low-frequency fields e.g. primary keys on big indexes with 
 many segments but also speeds up general searching in my tests.
 Overview slideshow here: 
 http://www.slideshare.net/MarkHarwood/lucene-bloomfilteredsegments
 Benchmarks based on Wikipedia content here: http://goo.gl/X7QqU
 Patch based on 3.6 codebase attached.
 There are no 3.6 API changes currently - to play just add a field with _blm 
 on the end of the name to invoke special indexing/querying capability. 
 Clearly a new Field or schema declaration(!) would need adding to APIs to 
 configure the service properly.
 Also, a patch for Lucene4.0 codebase introducing a new PostingsFormat

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-4069) Segment-level Bloom filters for a 2 x speed up on rare term searches

2012-06-28 Thread Michael McCandless (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-4069?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael McCandless updated LUCENE-4069:
---

Attachment: LUCENE-4069-tryDeleteDocument.patch
PKLookupUpdatePerfTest.java

I made various improvements to the PK lookup/update performance tester (called 
out as booleans at the top) so we can test the impact of:
  * Reusing the enums
  * Pre-sorting the keys before lookup
  * Best case (flush once after each update) vs worst case
  * Using DocValues direct source instead of a stored field to hold the counter
  * Using base 36 for the PK vs base 10

Also, it's really wasteful that we go and lookup the docID by PK, to retrieve 
the old count, and then do an updateDocument call which forces IW to go and to 
the exact same (costly) lookup again.

So I added a new method enabling deletion by document ID in IndexWriter.  I 
named it tryDeleteDocument, and it takes SegmentInfo and int docID.  If that 
segment has not been merged away, the delete succeeds, else it fails (and the 
app must delete through the normal way).  This seems to give ~20% speedup.

I think eg Solr's update document field by retrieving old doc, changing field, 
calling IW.updateDocument could also use this method.

 Segment-level Bloom filters for a 2 x speed up on rare term searches
 

 Key: LUCENE-4069
 URL: https://issues.apache.org/jira/browse/LUCENE-4069
 Project: Lucene - Java
  Issue Type: Improvement
  Components: core/index
Affects Versions: 3.6, 4.0
Reporter: Mark Harwood
Priority: Minor
 Fix For: 4.0, 3.6.1

 Attachments: BloomFilterPostingsBranch4x.patch, 
 LUCENE-4069-tryDeleteDocument.patch, MHBloomFilterOn3.6Branch.patch, 
 PKLookupUpdatePerfTest.java, PKLookupUpdatePerfTest.java, 
 PrimaryKeyPerfTest40.java


 An addition to each segment which stores a Bloom filter for selected fields 
 in order to give fast-fail to term searches, helping avoid wasted disk access.
 Best suited for low-frequency fields e.g. primary keys on big indexes with 
 many segments but also speeds up general searching in my tests.
 Overview slideshow here: 
 http://www.slideshare.net/MarkHarwood/lucene-bloomfilteredsegments
 Benchmarks based on Wikipedia content here: http://goo.gl/X7QqU
 Patch based on 3.6 codebase attached.
 There are no 3.6 API changes currently - to play just add a field with _blm 
 on the end of the name to invoke special indexing/querying capability. 
 Clearly a new Field or schema declaration(!) would need adding to APIs to 
 configure the service properly.
 Also, a patch for Lucene4.0 codebase introducing a new PostingsFormat

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-4069) Segment-level Bloom filters for a 2 x speed up on rare term searches

2012-06-25 Thread Mark Harwood (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-4069?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mark Harwood updated LUCENE-4069:
-

Attachment: (was: BloomFilterPostingsBranch4x.patch)

 Segment-level Bloom filters for a 2 x speed up on rare term searches
 

 Key: LUCENE-4069
 URL: https://issues.apache.org/jira/browse/LUCENE-4069
 Project: Lucene - Java
  Issue Type: Improvement
  Components: core/index
Affects Versions: 3.6, 4.0
Reporter: Mark Harwood
Priority: Minor
 Fix For: 4.0, 3.6.1

 Attachments: MHBloomFilterOn3.6Branch.patch, 
 PKLookupUpdatePerfTest.java, PrimaryKeyPerfTest40.java


 An addition to each segment which stores a Bloom filter for selected fields 
 in order to give fast-fail to term searches, helping avoid wasted disk access.
 Best suited for low-frequency fields e.g. primary keys on big indexes with 
 many segments but also speeds up general searching in my tests.
 Overview slideshow here: 
 http://www.slideshare.net/MarkHarwood/lucene-bloomfilteredsegments
 Benchmarks based on Wikipedia content here: http://goo.gl/X7QqU
 Patch based on 3.6 codebase attached.
 There are no 3.6 API changes currently - to play just add a field with _blm 
 on the end of the name to invoke special indexing/querying capability. 
 Clearly a new Field or schema declaration(!) would need adding to APIs to 
 configure the service properly.
 Also, a patch for Lucene4.0 codebase introducing a new PostingsFormat

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-4069) Segment-level Bloom filters for a 2 x speed up on rare term searches

2012-06-25 Thread Mark Harwood (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-4069?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mark Harwood updated LUCENE-4069:
-

Attachment: BloomFilterPostingsBranch4x.patch

Added customizable saturation threshold after which Bloom filters are retired 
and no longer maintained (due to merges creating v large segments)

 Segment-level Bloom filters for a 2 x speed up on rare term searches
 

 Key: LUCENE-4069
 URL: https://issues.apache.org/jira/browse/LUCENE-4069
 Project: Lucene - Java
  Issue Type: Improvement
  Components: core/index
Affects Versions: 3.6, 4.0
Reporter: Mark Harwood
Priority: Minor
 Fix For: 4.0, 3.6.1

 Attachments: BloomFilterPostingsBranch4x.patch, 
 MHBloomFilterOn3.6Branch.patch, PKLookupUpdatePerfTest.java, 
 PrimaryKeyPerfTest40.java


 An addition to each segment which stores a Bloom filter for selected fields 
 in order to give fast-fail to term searches, helping avoid wasted disk access.
 Best suited for low-frequency fields e.g. primary keys on big indexes with 
 many segments but also speeds up general searching in my tests.
 Overview slideshow here: 
 http://www.slideshare.net/MarkHarwood/lucene-bloomfilteredsegments
 Benchmarks based on Wikipedia content here: http://goo.gl/X7QqU
 Patch based on 3.6 codebase attached.
 There are no 3.6 API changes currently - to play just add a field with _blm 
 on the end of the name to invoke special indexing/querying capability. 
 Clearly a new Field or schema declaration(!) would need adding to APIs to 
 configure the service properly.
 Also, a patch for Lucene4.0 codebase introducing a new PostingsFormat

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-4069) Segment-level Bloom filters for a 2 x speed up on rare term searches

2012-06-22 Thread Mark Harwood (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-4069?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mark Harwood updated LUCENE-4069:
-

Attachment: PKLookupUpdatePerfTest.java

Attached a performance test (adapted from Mike's PKLookupPerfTest) that 
demonstrates the worst-case scenario where BloomFilter offers the 2x speed up 
not previously revealed in Mike's other tests.

This test case mixes reads and writes on a growing index and is representative 
of the real-world scenario I am seeking to optimize. See the javadoc for test 
details.

 Segment-level Bloom filters for a 2 x speed up on rare term searches
 

 Key: LUCENE-4069
 URL: https://issues.apache.org/jira/browse/LUCENE-4069
 Project: Lucene - Java
  Issue Type: Improvement
  Components: core/index
Affects Versions: 3.6, 4.0
Reporter: Mark Harwood
Priority: Minor
 Fix For: 4.0, 3.6.1

 Attachments: BloomFilterPostingsBranch4x.patch, 
 MHBloomFilterOn3.6Branch.patch, PKLookupUpdatePerfTest.java, 
 PrimaryKeyPerfTest40.java


 An addition to each segment which stores a Bloom filter for selected fields 
 in order to give fast-fail to term searches, helping avoid wasted disk access.
 Best suited for low-frequency fields e.g. primary keys on big indexes with 
 many segments but also speeds up general searching in my tests.
 Overview slideshow here: 
 http://www.slideshare.net/MarkHarwood/lucene-bloomfilteredsegments
 Benchmarks based on Wikipedia content here: http://goo.gl/X7QqU
 Patch based on 3.6 codebase attached.
 There are no 3.6 API changes currently - to play just add a field with _blm 
 on the end of the name to invoke special indexing/querying capability. 
 Clearly a new Field or schema declaration(!) would need adding to APIs to 
 configure the service properly.
 Also, a patch for Lucene4.0 codebase introducing a new PostingsFormat

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-4069) Segment-level Bloom filters for a 2 x speed up on rare term searches

2012-06-20 Thread Mark Harwood (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-4069?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mark Harwood updated LUCENE-4069:
-

Attachment: (was: BloomFilterPostingsBranch4x.patch)

 Segment-level Bloom filters for a 2 x speed up on rare term searches
 

 Key: LUCENE-4069
 URL: https://issues.apache.org/jira/browse/LUCENE-4069
 Project: Lucene - Java
  Issue Type: Improvement
  Components: core/index
Affects Versions: 3.6, 4.0
Reporter: Mark Harwood
Priority: Minor
 Fix For: 4.0, 3.6.1

 Attachments: MHBloomFilterOn3.6Branch.patch, PrimaryKeyPerfTest40.java


 An addition to each segment which stores a Bloom filter for selected fields 
 in order to give fast-fail to term searches, helping avoid wasted disk access.
 Best suited for low-frequency fields e.g. primary keys on big indexes with 
 many segments but also speeds up general searching in my tests.
 Overview slideshow here: 
 http://www.slideshare.net/MarkHarwood/lucene-bloomfilteredsegments
 Benchmarks based on Wikipedia content here: http://goo.gl/X7QqU
 Patch based on 3.6 codebase attached.
 There are no 3.6 API changes currently - to play just add a field with _blm 
 on the end of the name to invoke special indexing/querying capability. 
 Clearly a new Field or schema declaration(!) would need adding to APIs to 
 configure the service properly.
 Also, a patch for Lucene4.0 codebase introducing a new PostingsFormat

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-4069) Segment-level Bloom filters for a 2 x speed up on rare term searches

2012-06-20 Thread Mark Harwood (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-4069?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mark Harwood updated LUCENE-4069:
-

Attachment: BloomFilterPostingsBranch4x.patch

Fix for the not downsizing bug and a subsequent issue which that fix 
revealed. The 2nd issue was that on saturation, the downsize method would 
actually upsize into a bigger bitset. This causes false negatives on searches - 
it's safe to downsize the indexing bitset but not upsize as there is already 
some information loss involved.

 Segment-level Bloom filters for a 2 x speed up on rare term searches
 

 Key: LUCENE-4069
 URL: https://issues.apache.org/jira/browse/LUCENE-4069
 Project: Lucene - Java
  Issue Type: Improvement
  Components: core/index
Affects Versions: 3.6, 4.0
Reporter: Mark Harwood
Priority: Minor
 Fix For: 4.0, 3.6.1

 Attachments: BloomFilterPostingsBranch4x.patch, 
 MHBloomFilterOn3.6Branch.patch


 An addition to each segment which stores a Bloom filter for selected fields 
 in order to give fast-fail to term searches, helping avoid wasted disk access.
 Best suited for low-frequency fields e.g. primary keys on big indexes with 
 many segments but also speeds up general searching in my tests.
 Overview slideshow here: 
 http://www.slideshare.net/MarkHarwood/lucene-bloomfilteredsegments
 Benchmarks based on Wikipedia content here: http://goo.gl/X7QqU
 Patch based on 3.6 codebase attached.
 There are no 3.6 API changes currently - to play just add a field with _blm 
 on the end of the name to invoke special indexing/querying capability. 
 Clearly a new Field or schema declaration(!) would need adding to APIs to 
 configure the service properly.
 Also, a patch for Lucene4.0 codebase introducing a new PostingsFormat

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-4069) Segment-level Bloom filters for a 2 x speed up on rare term searches

2012-06-20 Thread Mark Harwood (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-4069?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mark Harwood updated LUCENE-4069:
-

Attachment: (was: PrimaryKeyPerfTest40.java)

 Segment-level Bloom filters for a 2 x speed up on rare term searches
 

 Key: LUCENE-4069
 URL: https://issues.apache.org/jira/browse/LUCENE-4069
 Project: Lucene - Java
  Issue Type: Improvement
  Components: core/index
Affects Versions: 3.6, 4.0
Reporter: Mark Harwood
Priority: Minor
 Fix For: 4.0, 3.6.1

 Attachments: BloomFilterPostingsBranch4x.patch, 
 MHBloomFilterOn3.6Branch.patch


 An addition to each segment which stores a Bloom filter for selected fields 
 in order to give fast-fail to term searches, helping avoid wasted disk access.
 Best suited for low-frequency fields e.g. primary keys on big indexes with 
 many segments but also speeds up general searching in my tests.
 Overview slideshow here: 
 http://www.slideshare.net/MarkHarwood/lucene-bloomfilteredsegments
 Benchmarks based on Wikipedia content here: http://goo.gl/X7QqU
 Patch based on 3.6 codebase attached.
 There are no 3.6 API changes currently - to play just add a field with _blm 
 on the end of the name to invoke special indexing/querying capability. 
 Clearly a new Field or schema declaration(!) would need adding to APIs to 
 configure the service properly.
 Also, a patch for Lucene4.0 codebase introducing a new PostingsFormat

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-4069) Segment-level Bloom filters for a 2 x speed up on rare term searches

2012-06-20 Thread Mark Harwood (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-4069?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mark Harwood updated LUCENE-4069:
-

Attachment: PrimaryKeyPerfTest40.java

Updated Performance test code based on new IndexReader changes for accessing 
subreaders

 Segment-level Bloom filters for a 2 x speed up on rare term searches
 

 Key: LUCENE-4069
 URL: https://issues.apache.org/jira/browse/LUCENE-4069
 Project: Lucene - Java
  Issue Type: Improvement
  Components: core/index
Affects Versions: 3.6, 4.0
Reporter: Mark Harwood
Priority: Minor
 Fix For: 4.0, 3.6.1

 Attachments: BloomFilterPostingsBranch4x.patch, 
 MHBloomFilterOn3.6Branch.patch, PrimaryKeyPerfTest40.java


 An addition to each segment which stores a Bloom filter for selected fields 
 in order to give fast-fail to term searches, helping avoid wasted disk access.
 Best suited for low-frequency fields e.g. primary keys on big indexes with 
 many segments but also speeds up general searching in my tests.
 Overview slideshow here: 
 http://www.slideshare.net/MarkHarwood/lucene-bloomfilteredsegments
 Benchmarks based on Wikipedia content here: http://goo.gl/X7QqU
 Patch based on 3.6 codebase attached.
 There are no 3.6 API changes currently - to play just add a field with _blm 
 on the end of the name to invoke special indexing/querying capability. 
 Clearly a new Field or schema declaration(!) would need adding to APIs to 
 configure the service properly.
 Also, a patch for Lucene4.0 codebase introducing a new PostingsFormat

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-4069) Segment-level Bloom filters for a 2 x speed up on rare term searches

2012-06-18 Thread Mark Harwood (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-4069?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mark Harwood updated LUCENE-4069:
-

Attachment: (was: BloomFilterPostingsBranch4x.patch)

 Segment-level Bloom filters for a 2 x speed up on rare term searches
 

 Key: LUCENE-4069
 URL: https://issues.apache.org/jira/browse/LUCENE-4069
 Project: Lucene - Java
  Issue Type: Improvement
  Components: core/index
Affects Versions: 3.6, 4.0
Reporter: Mark Harwood
Priority: Minor
 Fix For: 4.0, 3.6.1

 Attachments: BloomFilterPostingsBranch4x.patch, 
 MHBloomFilterOn3.6Branch.patch, PrimaryKeyPerfTest40.java


 An addition to each segment which stores a Bloom filter for selected fields 
 in order to give fast-fail to term searches, helping avoid wasted disk access.
 Best suited for low-frequency fields e.g. primary keys on big indexes with 
 many segments but also speeds up general searching in my tests.
 Overview slideshow here: 
 http://www.slideshare.net/MarkHarwood/lucene-bloomfilteredsegments
 Benchmarks based on Wikipedia content here: http://goo.gl/X7QqU
 Patch based on 3.6 codebase attached.
 There are no 3.6 API changes currently - to play just add a field with _blm 
 on the end of the name to invoke special indexing/querying capability. 
 Clearly a new Field or schema declaration(!) would need adding to APIs to 
 configure the service properly.
 Also, a patch for Lucene4.0 codebase introducing a new PostingsFormat

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-4069) Segment-level Bloom filters for a 2 x speed up on rare term searches

2012-06-18 Thread Mark Harwood (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-4069?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mark Harwood updated LUCENE-4069:
-

Attachment: BloomFilterPostingsBranch4x.patch

New patch with Terms.intersect overridden for faster Fuzzy queries.
Didn't get too far with running the Wikipedia perf tests due to missing data 
file (see 
http://code.google.com/a/apache-extras.org/p/luceneutil/issues/detail?id=7 )

 Segment-level Bloom filters for a 2 x speed up on rare term searches
 

 Key: LUCENE-4069
 URL: https://issues.apache.org/jira/browse/LUCENE-4069
 Project: Lucene - Java
  Issue Type: Improvement
  Components: core/index
Affects Versions: 3.6, 4.0
Reporter: Mark Harwood
Priority: Minor
 Fix For: 4.0, 3.6.1

 Attachments: BloomFilterPostingsBranch4x.patch, 
 MHBloomFilterOn3.6Branch.patch, PrimaryKeyPerfTest40.java


 An addition to each segment which stores a Bloom filter for selected fields 
 in order to give fast-fail to term searches, helping avoid wasted disk access.
 Best suited for low-frequency fields e.g. primary keys on big indexes with 
 many segments but also speeds up general searching in my tests.
 Overview slideshow here: 
 http://www.slideshare.net/MarkHarwood/lucene-bloomfilteredsegments
 Benchmarks based on Wikipedia content here: http://goo.gl/X7QqU
 Patch based on 3.6 codebase attached.
 There are no 3.6 API changes currently - to play just add a field with _blm 
 on the end of the name to invoke special indexing/querying capability. 
 Clearly a new Field or schema declaration(!) would need adding to APIs to 
 configure the service properly.
 Also, a patch for Lucene4.0 codebase introducing a new PostingsFormat

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-4069) Segment-level Bloom filters for a 2 x speed up on rare term searches

2012-06-08 Thread Mark Harwood (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-4069?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mark Harwood updated LUCENE-4069:
-

Attachment: (was: BloomFilterPostings40.patch)

 Segment-level Bloom filters for a 2 x speed up on rare term searches
 

 Key: LUCENE-4069
 URL: https://issues.apache.org/jira/browse/LUCENE-4069
 Project: Lucene - Java
  Issue Type: Improvement
  Components: core/index
Affects Versions: 3.6, 4.0
Reporter: Mark Harwood
Priority: Minor
 Fix For: 4.0, 3.6.1

 Attachments: MHBloomFilterOn3.6Branch.patch, 
 PrimaryKey40PerformanceTestSrc.zip


 An addition to each segment which stores a Bloom filter for selected fields 
 in order to give fast-fail to term searches, helping avoid wasted disk access.
 Best suited for low-frequency fields e.g. primary keys on big indexes with 
 many segments but also speeds up general searching in my tests.
 Overview slideshow here: 
 http://www.slideshare.net/MarkHarwood/lucene-bloomfilteredsegments
 Benchmarks based on Wikipedia content here: http://goo.gl/X7QqU
 Patch based on 3.6 codebase attached.
 There are no 3.6 API changes currently - to play just add a field with _blm 
 on the end of the name to invoke special indexing/querying capability. 
 Clearly a new Field or schema declaration(!) would need adding to APIs to 
 configure the service properly.
 Also, a patch for Lucene4.0 codebase introducing a new PostingsFormat

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-4069) Segment-level Bloom filters for a 2 x speed up on rare term searches

2012-06-08 Thread Mark Harwood (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-4069?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mark Harwood updated LUCENE-4069:
-

Attachment: BloomFilterPostingsBranch4x.patch

Updated as follows:
* Extracted Bloom filter functionality as new oal.util.FuzzySet class - the 
name is changed because Bloom filtering is one application for a FuzzySet, 
fuzzy count distincts being another.
* BloomFilterPostingsFormat now take a factory that can tailor choice of 
BloomFilter per field (bitset size/saturation settings and choice of hash 
algo). Provided a default factory implementation.
* All Unit tests pass now that I have a test PostingsFormat class that uses v 
small bitsets where before the many-field unit tests would cause OOM. 

Will follow up with benchmarks when I have more time to run and document them. 
Initial results from my large-scale tests on growing indexes show a nice flat 
line in the face of a growing index whereas a non-Bloomed index saw-tooths 
upwards as segments grow/merge.

 Segment-level Bloom filters for a 2 x speed up on rare term searches
 

 Key: LUCENE-4069
 URL: https://issues.apache.org/jira/browse/LUCENE-4069
 Project: Lucene - Java
  Issue Type: Improvement
  Components: core/index
Affects Versions: 3.6, 4.0
Reporter: Mark Harwood
Priority: Minor
 Fix For: 4.0, 3.6.1

 Attachments: BloomFilterPostingsBranch4x.patch, 
 MHBloomFilterOn3.6Branch.patch


 An addition to each segment which stores a Bloom filter for selected fields 
 in order to give fast-fail to term searches, helping avoid wasted disk access.
 Best suited for low-frequency fields e.g. primary keys on big indexes with 
 many segments but also speeds up general searching in my tests.
 Overview slideshow here: 
 http://www.slideshare.net/MarkHarwood/lucene-bloomfilteredsegments
 Benchmarks based on Wikipedia content here: http://goo.gl/X7QqU
 Patch based on 3.6 codebase attached.
 There are no 3.6 API changes currently - to play just add a field with _blm 
 on the end of the name to invoke special indexing/querying capability. 
 Clearly a new Field or schema declaration(!) would need adding to APIs to 
 configure the service properly.
 Also, a patch for Lucene4.0 codebase introducing a new PostingsFormat

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-4069) Segment-level Bloom filters for a 2 x speed up on rare term searches

2012-06-08 Thread Mark Harwood (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-4069?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mark Harwood updated LUCENE-4069:
-

Attachment: (was: PrimaryKey40PerformanceTestSrc.zip)

 Segment-level Bloom filters for a 2 x speed up on rare term searches
 

 Key: LUCENE-4069
 URL: https://issues.apache.org/jira/browse/LUCENE-4069
 Project: Lucene - Java
  Issue Type: Improvement
  Components: core/index
Affects Versions: 3.6, 4.0
Reporter: Mark Harwood
Priority: Minor
 Fix For: 4.0, 3.6.1

 Attachments: BloomFilterPostingsBranch4x.patch, 
 MHBloomFilterOn3.6Branch.patch


 An addition to each segment which stores a Bloom filter for selected fields 
 in order to give fast-fail to term searches, helping avoid wasted disk access.
 Best suited for low-frequency fields e.g. primary keys on big indexes with 
 many segments but also speeds up general searching in my tests.
 Overview slideshow here: 
 http://www.slideshare.net/MarkHarwood/lucene-bloomfilteredsegments
 Benchmarks based on Wikipedia content here: http://goo.gl/X7QqU
 Patch based on 3.6 codebase attached.
 There are no 3.6 API changes currently - to play just add a field with _blm 
 on the end of the name to invoke special indexing/querying capability. 
 Clearly a new Field or schema declaration(!) would need adding to APIs to 
 configure the service properly.
 Also, a patch for Lucene4.0 codebase introducing a new PostingsFormat

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-4069) Segment-level Bloom filters for a 2 x speed up on rare term searches

2012-06-08 Thread Mark Harwood (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-4069?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mark Harwood updated LUCENE-4069:
-

Attachment: PrimaryKeyPerfTest40.java

Benchmark tool adapted from Mike's original Pulsing codec benchmark. Now 
includes Bloom postings example.

 Segment-level Bloom filters for a 2 x speed up on rare term searches
 

 Key: LUCENE-4069
 URL: https://issues.apache.org/jira/browse/LUCENE-4069
 Project: Lucene - Java
  Issue Type: Improvement
  Components: core/index
Affects Versions: 3.6, 4.0
Reporter: Mark Harwood
Priority: Minor
 Fix For: 4.0, 3.6.1

 Attachments: BloomFilterPostingsBranch4x.patch, 
 MHBloomFilterOn3.6Branch.patch, PrimaryKeyPerfTest40.java


 An addition to each segment which stores a Bloom filter for selected fields 
 in order to give fast-fail to term searches, helping avoid wasted disk access.
 Best suited for low-frequency fields e.g. primary keys on big indexes with 
 many segments but also speeds up general searching in my tests.
 Overview slideshow here: 
 http://www.slideshare.net/MarkHarwood/lucene-bloomfilteredsegments
 Benchmarks based on Wikipedia content here: http://goo.gl/X7QqU
 Patch based on 3.6 codebase attached.
 There are no 3.6 API changes currently - to play just add a field with _blm 
 on the end of the name to invoke special indexing/querying capability. 
 Clearly a new Field or schema declaration(!) would need adding to APIs to 
 configure the service properly.
 Also, a patch for Lucene4.0 codebase introducing a new PostingsFormat

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-4069) Segment-level Bloom filters for a 2 x speed up on rare term searches

2012-05-31 Thread Mark Harwood (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-4069?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mark Harwood updated LUCENE-4069:
-

Attachment: BloomFilterPostings40.patch

This is looking more promising.

Running ant test-core 
-Dtests.postingsformat=TestBloomFilteredLucene40Postings now passes all tests 
but causes OOM exception on 3 tests:
* TestConsistentFieldNumbers.testManyFields
* TestIndexableField.testArbitraryFields
* TestIndexWriter.testManyFields

Any pointers on how to annotate or otherwise avoid the BloomFilter class for 
many-field tests would be welcome. These are not realistic tests for this 
class (we don't expect indexes with 100s of primary-key like fields).

In this patch I've
* added an SPI lookup mechanism for pluggable hash algos.
* documented the file format
* fixed issues with TermVector tests
* changed the API


To use:
BloomFilteringPostingFormat now takes a delegate PostingsFormat and a set of 
field names that are to have bloom-filters created.
Fields that are not listed in the filter set can be safely indexed as per 
normal and doing so is beneficial because it allows filtered and non filtered 
field data to co-exist in the same physical files created by the delegate 
PostingsFormat.


 Segment-level Bloom filters for a 2 x speed up on rare term searches
 

 Key: LUCENE-4069
 URL: https://issues.apache.org/jira/browse/LUCENE-4069
 Project: Lucene - Java
  Issue Type: Improvement
  Components: core/index
Affects Versions: 3.6, 4.0
Reporter: Mark Harwood
Priority: Minor
 Fix For: 4.0, 3.6.1

 Attachments: BloomFilterCodec40.patch, BloomFilterPostings40.patch, 
 MHBloomFilterOn3.6Branch.patch, PrimaryKey40PerformanceTestSrc.zip


 An addition to each segment which stores a Bloom filter for selected fields 
 in order to give fast-fail to term searches, helping avoid wasted disk access.
 Best suited for low-frequency fields e.g. primary keys on big indexes with 
 many segments but also speeds up general searching in my tests.
 Overview slideshow here: 
 http://www.slideshare.net/MarkHarwood/lucene-bloomfilteredsegments
 Benchmarks based on Wikipedia content here: http://goo.gl/X7QqU
 Patch based on 3.6 codebase attached.
 There are no 3.6 API changes currently - to play just add a field with _blm 
 on the end of the name to invoke special indexing/querying capability. 
 Clearly a new Field or schema declaration(!) would need adding to APIs to 
 configure the service properly.
 Also, a patch for Lucene4.0 codebase introducing a new PostingsFormat

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-4069) Segment-level Bloom filters for a 2 x speed up on rare term searches

2012-05-31 Thread Mark Harwood (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-4069?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mark Harwood updated LUCENE-4069:
-

Attachment: (was: BloomFilterCodec40.patch)

 Segment-level Bloom filters for a 2 x speed up on rare term searches
 

 Key: LUCENE-4069
 URL: https://issues.apache.org/jira/browse/LUCENE-4069
 Project: Lucene - Java
  Issue Type: Improvement
  Components: core/index
Affects Versions: 3.6, 4.0
Reporter: Mark Harwood
Priority: Minor
 Fix For: 4.0, 3.6.1

 Attachments: BloomFilterPostings40.patch, 
 MHBloomFilterOn3.6Branch.patch, PrimaryKey40PerformanceTestSrc.zip


 An addition to each segment which stores a Bloom filter for selected fields 
 in order to give fast-fail to term searches, helping avoid wasted disk access.
 Best suited for low-frequency fields e.g. primary keys on big indexes with 
 many segments but also speeds up general searching in my tests.
 Overview slideshow here: 
 http://www.slideshare.net/MarkHarwood/lucene-bloomfilteredsegments
 Benchmarks based on Wikipedia content here: http://goo.gl/X7QqU
 Patch based on 3.6 codebase attached.
 There are no 3.6 API changes currently - to play just add a field with _blm 
 on the end of the name to invoke special indexing/querying capability. 
 Clearly a new Field or schema declaration(!) would need adding to APIs to 
 configure the service properly.
 Also, a patch for Lucene4.0 codebase introducing a new PostingsFormat

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-4069) Segment-level Bloom filters for a 2 x speed up on rare term searches

2012-05-31 Thread Mark Harwood (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-4069?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mark Harwood updated LUCENE-4069:
-

Attachment: (was: BloomFilterPostings40.patch)

 Segment-level Bloom filters for a 2 x speed up on rare term searches
 

 Key: LUCENE-4069
 URL: https://issues.apache.org/jira/browse/LUCENE-4069
 Project: Lucene - Java
  Issue Type: Improvement
  Components: core/index
Affects Versions: 3.6, 4.0
Reporter: Mark Harwood
Priority: Minor
 Fix For: 4.0, 3.6.1

 Attachments: BloomFilterPostings40.patch, 
 MHBloomFilterOn3.6Branch.patch, PrimaryKey40PerformanceTestSrc.zip


 An addition to each segment which stores a Bloom filter for selected fields 
 in order to give fast-fail to term searches, helping avoid wasted disk access.
 Best suited for low-frequency fields e.g. primary keys on big indexes with 
 many segments but also speeds up general searching in my tests.
 Overview slideshow here: 
 http://www.slideshare.net/MarkHarwood/lucene-bloomfilteredsegments
 Benchmarks based on Wikipedia content here: http://goo.gl/X7QqU
 Patch based on 3.6 codebase attached.
 There are no 3.6 API changes currently - to play just add a field with _blm 
 on the end of the name to invoke special indexing/querying capability. 
 Clearly a new Field or schema declaration(!) would need adding to APIs to 
 configure the service properly.
 Also, a patch for Lucene4.0 codebase introducing a new PostingsFormat

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-4069) Segment-level Bloom filters for a 2 x speed up on rare term searches

2012-05-31 Thread Mark Harwood (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-4069?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mark Harwood updated LUCENE-4069:
-

Attachment: BloomFilterPostings40.patch

Added missing class

 Segment-level Bloom filters for a 2 x speed up on rare term searches
 

 Key: LUCENE-4069
 URL: https://issues.apache.org/jira/browse/LUCENE-4069
 Project: Lucene - Java
  Issue Type: Improvement
  Components: core/index
Affects Versions: 3.6, 4.0
Reporter: Mark Harwood
Priority: Minor
 Fix For: 4.0, 3.6.1

 Attachments: BloomFilterPostings40.patch, 
 MHBloomFilterOn3.6Branch.patch, PrimaryKey40PerformanceTestSrc.zip


 An addition to each segment which stores a Bloom filter for selected fields 
 in order to give fast-fail to term searches, helping avoid wasted disk access.
 Best suited for low-frequency fields e.g. primary keys on big indexes with 
 many segments but also speeds up general searching in my tests.
 Overview slideshow here: 
 http://www.slideshare.net/MarkHarwood/lucene-bloomfilteredsegments
 Benchmarks based on Wikipedia content here: http://goo.gl/X7QqU
 Patch based on 3.6 codebase attached.
 There are no 3.6 API changes currently - to play just add a field with _blm 
 on the end of the name to invoke special indexing/querying capability. 
 Clearly a new Field or schema declaration(!) would need adding to APIs to 
 configure the service properly.
 Also, a patch for Lucene4.0 codebase introducing a new PostingsFormat

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-4069) Segment-level Bloom filters for a 2 x speed up on rare term searches

2012-05-29 Thread Mark Harwood (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-4069?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mark Harwood updated LUCENE-4069:
-

Attachment: (was: BloomFilterCodec40.patch)

 Segment-level Bloom filters for a 2 x speed up on rare term searches
 

 Key: LUCENE-4069
 URL: https://issues.apache.org/jira/browse/LUCENE-4069
 Project: Lucene - Java
  Issue Type: Improvement
  Components: core/index
Affects Versions: 3.6
Reporter: Mark Harwood
Priority: Minor
 Fix For: 3.6.1

 Attachments: MHBloomFilterOn3.6Branch.patch, 
 PrimaryKey40PerformanceTestSrc.zip


 An addition to each segment which stores a Bloom filter for selected fields 
 in order to give fast-fail to term searches, helping avoid wasted disk access.
 Best suited for low-frequency fields e.g. primary keys on big indexes with 
 many segments but also speeds up general searching in my tests.
 Overview slideshow here: 
 http://www.slideshare.net/MarkHarwood/lucene-bloomfilteredsegments
 Benchmarks based on Wikipedia content here: http://goo.gl/X7QqU
 Patch based on 3.6 codebase attached.
 There are no API changes currently - to play just add a field with _blm on 
 the end of the name to invoke special indexing/querying capability. Clearly a 
 new Field or schema declaration(!) would need adding to APIs to configure the 
 service properly.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-4069) Segment-level Bloom filters for a 2 x speed up on rare term searches

2012-05-29 Thread Mark Harwood (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-4069?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mark Harwood updated LUCENE-4069:
-

Attachment: BloomFilterCodec40.patch

Updated to work with trunk.
* Changed to use FixedBitSet
* Is now a PostingsFormat abstract base class
* Added missing MurmurHash class

TODOs
* Move Bloom filter logic to common utils classes
* Use Service Providers for pluggable choice of hash algos?
* Expose settings for memory/saturation

 Segment-level Bloom filters for a 2 x speed up on rare term searches
 

 Key: LUCENE-4069
 URL: https://issues.apache.org/jira/browse/LUCENE-4069
 Project: Lucene - Java
  Issue Type: Improvement
  Components: core/index
Affects Versions: 3.6
Reporter: Mark Harwood
Priority: Minor
 Fix For: 3.6.1

 Attachments: BloomFilterCodec40.patch, 
 MHBloomFilterOn3.6Branch.patch, PrimaryKey40PerformanceTestSrc.zip


 An addition to each segment which stores a Bloom filter for selected fields 
 in order to give fast-fail to term searches, helping avoid wasted disk access.
 Best suited for low-frequency fields e.g. primary keys on big indexes with 
 many segments but also speeds up general searching in my tests.
 Overview slideshow here: 
 http://www.slideshare.net/MarkHarwood/lucene-bloomfilteredsegments
 Benchmarks based on Wikipedia content here: http://goo.gl/X7QqU
 Patch based on 3.6 codebase attached.
 There are no API changes currently - to play just add a field with _blm on 
 the end of the name to invoke special indexing/querying capability. Clearly a 
 new Field or schema declaration(!) would need adding to APIs to configure the 
 service properly.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-4069) Segment-level Bloom filters for a 2 x speed up on rare term searches

2012-05-29 Thread Mark Harwood (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-4069?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mark Harwood updated LUCENE-4069:
-

  Description: 
An addition to each segment which stores a Bloom filter for selected fields in 
order to give fast-fail to term searches, helping avoid wasted disk access.

Best suited for low-frequency fields e.g. primary keys on big indexes with many 
segments but also speeds up general searching in my tests.

Overview slideshow here: 
http://www.slideshare.net/MarkHarwood/lucene-bloomfilteredsegments

Benchmarks based on Wikipedia content here: http://goo.gl/X7QqU

Patch based on 3.6 codebase attached.
There are no 3.6 API changes currently - to play just add a field with _blm 
on the end of the name to invoke special indexing/querying capability. Clearly 
a new Field or schema declaration(!) would need adding to APIs to configure the 
service properly.

Also, a patch for Lucene4.0 codebase introducing a new PostingsFormat

  was:
An addition to each segment which stores a Bloom filter for selected fields in 
order to give fast-fail to term searches, helping avoid wasted disk access.

Best suited for low-frequency fields e.g. primary keys on big indexes with many 
segments but also speeds up general searching in my tests.

Overview slideshow here: 
http://www.slideshare.net/MarkHarwood/lucene-bloomfilteredsegments

Benchmarks based on Wikipedia content here: http://goo.gl/X7QqU

Patch based on 3.6 codebase attached.
There are no API changes currently - to play just add a field with _blm on 
the end of the name to invoke special indexing/querying capability. Clearly a 
new Field or schema declaration(!) would need adding to APIs to configure the 
service properly.

Affects Version/s: 4.0
Fix Version/s: 4.0

 Segment-level Bloom filters for a 2 x speed up on rare term searches
 

 Key: LUCENE-4069
 URL: https://issues.apache.org/jira/browse/LUCENE-4069
 Project: Lucene - Java
  Issue Type: Improvement
  Components: core/index
Affects Versions: 3.6, 4.0
Reporter: Mark Harwood
Priority: Minor
 Fix For: 4.0, 3.6.1

 Attachments: BloomFilterCodec40.patch, 
 MHBloomFilterOn3.6Branch.patch, PrimaryKey40PerformanceTestSrc.zip


 An addition to each segment which stores a Bloom filter for selected fields 
 in order to give fast-fail to term searches, helping avoid wasted disk access.
 Best suited for low-frequency fields e.g. primary keys on big indexes with 
 many segments but also speeds up general searching in my tests.
 Overview slideshow here: 
 http://www.slideshare.net/MarkHarwood/lucene-bloomfilteredsegments
 Benchmarks based on Wikipedia content here: http://goo.gl/X7QqU
 Patch based on 3.6 codebase attached.
 There are no 3.6 API changes currently - to play just add a field with _blm 
 on the end of the name to invoke special indexing/querying capability. 
 Clearly a new Field or schema declaration(!) would need adding to APIs to 
 configure the service properly.
 Also, a patch for Lucene4.0 codebase introducing a new PostingsFormat

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-4069) Segment-level Bloom filters for a 2 x speed up on rare term searches

2012-05-28 Thread Mark Harwood (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-4069?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mark Harwood updated LUCENE-4069:
-

Attachment: (was: BloomFilterCodec40.patch)

 Segment-level Bloom filters for a 2 x speed up on rare term searches
 

 Key: LUCENE-4069
 URL: https://issues.apache.org/jira/browse/LUCENE-4069
 Project: Lucene - Java
  Issue Type: Improvement
  Components: core/index
Affects Versions: 3.6
Reporter: Mark Harwood
Priority: Minor
 Fix For: 3.6.1

 Attachments: MHBloomFilterOn3.6Branch.patch, 
 PrimaryKey40PerformanceTestSrc.zip


 An addition to each segment which stores a Bloom filter for selected fields 
 in order to give fast-fail to term searches, helping avoid wasted disk access.
 Best suited for low-frequency fields e.g. primary keys on big indexes with 
 many segments but also speeds up general searching in my tests.
 Overview slideshow here: 
 http://www.slideshare.net/MarkHarwood/lucene-bloomfilteredsegments
 Benchmarks based on Wikipedia content here: http://goo.gl/X7QqU
 Patch based on 3.6 codebase attached.
 There are no API changes currently - to play just add a field with _blm on 
 the end of the name to invoke special indexing/querying capability. Clearly a 
 new Field or schema declaration(!) would need adding to APIs to configure the 
 service properly.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-4069) Segment-level Bloom filters for a 2 x speed up on rare term searches

2012-05-28 Thread Mark Harwood (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-4069?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mark Harwood updated LUCENE-4069:
-

Attachment: BloomFilterCodec40.patch

Fixed the issue with 1 field in an index.
Tests on random lookups on Wikipedia titles (unique keys) now show a 3 x speed 
up for a Bloom-filtered index over standard 4.0 Codec for fully warmed indexes.

 Segment-level Bloom filters for a 2 x speed up on rare term searches
 

 Key: LUCENE-4069
 URL: https://issues.apache.org/jira/browse/LUCENE-4069
 Project: Lucene - Java
  Issue Type: Improvement
  Components: core/index
Affects Versions: 3.6
Reporter: Mark Harwood
Priority: Minor
 Fix For: 3.6.1

 Attachments: BloomFilterCodec40.patch, 
 MHBloomFilterOn3.6Branch.patch, PrimaryKey40PerformanceTestSrc.zip


 An addition to each segment which stores a Bloom filter for selected fields 
 in order to give fast-fail to term searches, helping avoid wasted disk access.
 Best suited for low-frequency fields e.g. primary keys on big indexes with 
 many segments but also speeds up general searching in my tests.
 Overview slideshow here: 
 http://www.slideshare.net/MarkHarwood/lucene-bloomfilteredsegments
 Benchmarks based on Wikipedia content here: http://goo.gl/X7QqU
 Patch based on 3.6 codebase attached.
 There are no API changes currently - to play just add a field with _blm on 
 the end of the name to invoke special indexing/querying capability. Clearly a 
 new Field or schema declaration(!) would need adding to APIs to configure the 
 service properly.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-4069) Segment-level Bloom filters for a 2 x speed up on rare term searches

2012-05-25 Thread Mark Harwood (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-4069?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mark Harwood updated LUCENE-4069:
-

Attachment: PrimaryKey40PerformanceTestSrc.zip
BloomFilterCodec40.patch

 Segment-level Bloom filters for a 2 x speed up on rare term searches
 

 Key: LUCENE-4069
 URL: https://issues.apache.org/jira/browse/LUCENE-4069
 Project: Lucene - Java
  Issue Type: Improvement
  Components: core/index
Affects Versions: 3.6
Reporter: Mark Harwood
Priority: Minor
 Fix For: 3.6.1

 Attachments: BloomFilterCodec40.patch, 
 MHBloomFilterOn3.6Branch.patch, PrimaryKey40PerformanceTestSrc.zip


 An addition to each segment which stores a Bloom filter for selected fields 
 in order to give fast-fail to term searches, helping avoid wasted disk access.
 Best suited for low-frequency fields e.g. primary keys on big indexes with 
 many segments but also speeds up general searching in my tests.
 Overview slideshow here: 
 http://www.slideshare.net/MarkHarwood/lucene-bloomfilteredsegments
 Benchmarks based on Wikipedia content here: http://goo.gl/X7QqU
 Patch based on 3.6 codebase attached.
 There are no API changes currently - to play just add a field with _blm on 
 the end of the name to invoke special indexing/querying capability. Clearly a 
 new Field or schema declaration(!) would need adding to APIs to configure the 
 service properly.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-4069) Segment-level Bloom filters for a 2 x speed up on rare term searches

2012-05-25 Thread Mark Harwood (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-4069?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mark Harwood updated LUCENE-4069:
-

Attachment: BloomFilterCodec40.patch
PrimaryKey40PerformanceTestSrc.zip

I've ported this Bloom Filtering code to work as a 4.0 Codec now.
I see a 35% improvement over standard Codecs on random lookups on a warmed 
index. 

I also notice that the PulsingCodec is no longer faster than standard Codec - 
is this news to people as I thought it was supposed to be the way forward?

My test rig (adapted from Mike's original primary key test rig here 
http://blog.mikemccandless.com/2010/06/lucenes-pulsingcodec-on-primary-key.html)
 is attached as a zip.
The new BloomFilteringCodec is also attached here as a patch.

Searches against plain text fields also look to be faster (using AOL500k 
queries searching Wikipedia English) but obviously that particular test rig is 
harder to include as an attachment here.

I can open a seperate JIRA issue for this 4.0 version of the code if that makes 
more sense.



 Segment-level Bloom filters for a 2 x speed up on rare term searches
 

 Key: LUCENE-4069
 URL: https://issues.apache.org/jira/browse/LUCENE-4069
 Project: Lucene - Java
  Issue Type: Improvement
  Components: core/index
Affects Versions: 3.6
Reporter: Mark Harwood
Priority: Minor
 Fix For: 3.6.1

 Attachments: BloomFilterCodec40.patch, 
 MHBloomFilterOn3.6Branch.patch, PrimaryKey40PerformanceTestSrc.zip


 An addition to each segment which stores a Bloom filter for selected fields 
 in order to give fast-fail to term searches, helping avoid wasted disk access.
 Best suited for low-frequency fields e.g. primary keys on big indexes with 
 many segments but also speeds up general searching in my tests.
 Overview slideshow here: 
 http://www.slideshare.net/MarkHarwood/lucene-bloomfilteredsegments
 Benchmarks based on Wikipedia content here: http://goo.gl/X7QqU
 Patch based on 3.6 codebase attached.
 There are no API changes currently - to play just add a field with _blm on 
 the end of the name to invoke special indexing/querying capability. Clearly a 
 new Field or schema declaration(!) would need adding to APIs to configure the 
 service properly.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-4069) Segment-level Bloom filters for a 2 x speed up on rare term searches

2012-05-25 Thread Mark Harwood (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-4069?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mark Harwood updated LUCENE-4069:
-

Attachment: (was: BloomFilterCodec40.patch)

 Segment-level Bloom filters for a 2 x speed up on rare term searches
 

 Key: LUCENE-4069
 URL: https://issues.apache.org/jira/browse/LUCENE-4069
 Project: Lucene - Java
  Issue Type: Improvement
  Components: core/index
Affects Versions: 3.6
Reporter: Mark Harwood
Priority: Minor
 Fix For: 3.6.1

 Attachments: BloomFilterCodec40.patch, 
 MHBloomFilterOn3.6Branch.patch, PrimaryKey40PerformanceTestSrc.zip


 An addition to each segment which stores a Bloom filter for selected fields 
 in order to give fast-fail to term searches, helping avoid wasted disk access.
 Best suited for low-frequency fields e.g. primary keys on big indexes with 
 many segments but also speeds up general searching in my tests.
 Overview slideshow here: 
 http://www.slideshare.net/MarkHarwood/lucene-bloomfilteredsegments
 Benchmarks based on Wikipedia content here: http://goo.gl/X7QqU
 Patch based on 3.6 codebase attached.
 There are no API changes currently - to play just add a field with _blm on 
 the end of the name to invoke special indexing/querying capability. Clearly a 
 new Field or schema declaration(!) would need adding to APIs to configure the 
 service properly.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-4069) Segment-level Bloom filters for a 2 x speed up on rare term searches

2012-05-25 Thread Mark Harwood (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-4069?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mark Harwood updated LUCENE-4069:
-

Attachment: (was: PrimaryKey40PerformanceTestSrc.zip)

 Segment-level Bloom filters for a 2 x speed up on rare term searches
 

 Key: LUCENE-4069
 URL: https://issues.apache.org/jira/browse/LUCENE-4069
 Project: Lucene - Java
  Issue Type: Improvement
  Components: core/index
Affects Versions: 3.6
Reporter: Mark Harwood
Priority: Minor
 Fix For: 3.6.1

 Attachments: BloomFilterCodec40.patch, 
 MHBloomFilterOn3.6Branch.patch, PrimaryKey40PerformanceTestSrc.zip


 An addition to each segment which stores a Bloom filter for selected fields 
 in order to give fast-fail to term searches, helping avoid wasted disk access.
 Best suited for low-frequency fields e.g. primary keys on big indexes with 
 many segments but also speeds up general searching in my tests.
 Overview slideshow here: 
 http://www.slideshare.net/MarkHarwood/lucene-bloomfilteredsegments
 Benchmarks based on Wikipedia content here: http://goo.gl/X7QqU
 Patch based on 3.6 codebase attached.
 There are no API changes currently - to play just add a field with _blm on 
 the end of the name to invoke special indexing/querying capability. Clearly a 
 new Field or schema declaration(!) would need adding to APIs to configure the 
 service properly.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-4069) Segment-level Bloom filters for a 2 x speed up on rare term searches

2012-05-18 Thread Mark Harwood (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-4069?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mark Harwood updated LUCENE-4069:
-

Attachment: MHBloomFilterOn3.6Branch.patch

Initial patch

 Segment-level Bloom filters for a 2 x speed up on rare term searches
 

 Key: LUCENE-4069
 URL: https://issues.apache.org/jira/browse/LUCENE-4069
 Project: Lucene - Java
  Issue Type: Improvement
  Components: core/index
Affects Versions: 3.6
Reporter: Mark Harwood
Priority: Minor
 Fix For: 3.6.1

 Attachments: MHBloomFilterOn3.6Branch.patch


 An addition to each segment which stores a Bloom filter for selected fields 
 in order to give fast-fail to term searches, helping avoid wasted disk access.
 Best suited for low-frequency fields e.g. primary keys on big indexes with 
 many segments but also speeds up general searching in my tests.
 Overview slideshow here: 
 http://www.slideshare.net/MarkHarwood/lucene-bloomfilteredsegments
 Benchmarks based on Wikipedia content here: http://goo.gl/X7QqU
 Patch based on 3.6 codebase attached.
 There are no API changes currently - to play just add a field with _blm on 
 the end of the name to invoke special indexing/querying capability. Clearly a 
 new Field or schema declaration(!) would need adding to APIs to configure the 
 service properly.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org