Re: DisjunctionScorer performance

2009-01-06 Thread John Wang
One more thing I missed. I don't quite get your point about skip() vs next(). With or queries, skipping does not help as much comparing to and queries. -John On Tue, Jan 6, 2009 at 11:55 PM, John Wang wrote: > Paul: > >Our very simple/naive testing methodology for OrDocIdSetIterator: >

Re: DisjunctionScorer performance

2009-01-06 Thread John Wang
Paul: Our very simple/naive testing methodology for OrDocIdSetIterator: 5 sub iterators, each subiterators just iterate from 0 to 1,000,000. The test iterates the OrDocIdSetIterator until next() is false. Do you want me to run the same test against DisjunctDisi? -John On Tue, Jan

Re: DisjunctionScorer performance

2009-01-06 Thread Paul Elschot
On Wednesday 07 January 2009 07:36:06 John Wang wrote: > Hi guys: > > We have been building a suite of boolean operators DocIdSets > (e.g. AndDocIdSet/Iterator, OrDocIdSet/Iterator, > NotDocIdSet/Iterator). We compared our implementation on the > OrDocIdSetIterator (based on DisjunctionMaxScor

DisjunctionScorer performance

2009-01-06 Thread John Wang
Hi guys: We have been building a suite of boolean operators DocIdSets (e.g. AndDocIdSet/Iterator, OrDocIdSet/Iterator, NotDocIdSet/Iterator). We compared our implementation on the OrDocIdSetIterator (based on DisjunctionMaxScorer code) with some code tuning, and we see the performance doubled

Re: TestIndexInput test failures on jdk 1.6/linux after r641303

2009-01-06 Thread Sami Siren
Michael McCandless wrote: I'll remove those 2 test cases. The build now works perfectly. Thanks Mike! -- Sami Siren - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-

[jira] Updated: (LUCENE-1314) IndexReader.clone

2009-01-06 Thread Jason Rutherglen (JIRA)
[ https://issues.apache.org/jira/browse/LUCENE-1314?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jason Rutherglen updated LUCENE-1314: - Attachment: LUCENE-1314.patch Everything in the previous post should be working and comp

[jira] Updated: (LUCENE-1483) Change IndexSearcher multisegment searches to search each individual segment using a single HitCollector

2009-01-06 Thread Mark Miller (JIRA)
[ https://issues.apache.org/jira/browse/LUCENE-1483?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mark Miller updated LUCENE-1483: Attachment: LUCENE-1483.patch Merged everything and put Sort.ORD back the way it was (using ORD_SU

Re: [jira] Commented: (LUCENE-1513) fastss fuzzyquery

2009-01-06 Thread Robert Muir
for the k=1 case in my mind your last comment might not really be that much slower than storing the additional data... sounds worth investigating On Tue, Jan 6, 2009 at 8:04 PM, robert engels wrote: > I think you would need to store the position in the stream using position > == to the k factor.

[jira] Commented: (LUCENE-1483) Change IndexSearcher multisegment searches to search each individual segment using a single HitCollector

2009-01-06 Thread Mark Miller (JIRA)
[ https://issues.apache.org/jira/browse/LUCENE-1483?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12661394#action_12661394 ] Mark Miller commented on LUCENE-1483: - bq. I think we should fix TestSort so that

Re: [jira] Commented: (LUCENE-1513) fastss fuzzyquery

2009-01-06 Thread robert engels
I think you would need to store the position in the stream using position == to the k factor. Pretty straightforward, both for indexing and for searching. I think if you want the utmost in performance this is the way to go. If you don't want to store all of the additional data, I still think

[jira] Commented: (LUCENE-1483) Change IndexSearcher multisegment searches to search each individual segment using a single HitCollector

2009-01-06 Thread Mark Miller (JIRA)
[ https://issues.apache.org/jira/browse/LUCENE-1483?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12661390#action_12661390 ] Mark Miller commented on LUCENE-1483: - Can't seem to use the partial patch, but I'll t

Re: [jira] Commented: (LUCENE-1513) fastss fuzzyquery

2009-01-06 Thread Robert Muir
robert theres only one problem i see: i don't see how you can do a single search since fastssWC returns some false positives (with k=1 it will still return some things with ED of 2). maybe if you store the deletion position information as a payload (thus using original fastss where there are no fal

Re: [jira] Commented: (LUCENE-1513) fastss fuzzyquery

2009-01-06 Thread robert engels
I understand now. The index in my case would definitely be MUCH larger, but I think it would perform better, as you only need to do a single search - for obert (if you assume it was a misspelling). In your case you would eventually do an OR search in the lucene index for all possible matc

Re: [jira] Commented: (LUCENE-1513) fastss fuzzyquery

2009-01-06 Thread Robert Muir
i see what you are saying here. this is different than fastss but sounds nice for spelling correction. i suppose one reason why i like fastss is for my application i need the true complete edit distance, i'm actually not using it for spelling correction but as a first step for other tasks. but ma

Re: [jira] Commented: (LUCENE-1513) fastss fuzzyquery

2009-01-06 Thread robert engels
To clarify a statement in the last email. To generate the 'possible source words' in real-time is not a difficult as first seems, if you assume some sort of first character prefix (which is what it appears google does). For example, assume the user typed 'robrt' instead of 'robert'. You s

Re: [jira] Commented: (LUCENE-1513) fastss fuzzyquery

2009-01-06 Thread Robert Muir
On Tue, Jan 6, 2009 at 5:15 PM, robert engels wrote: > It is definitely going to increase the index size, but not any more than > than the external one would (if my understanding is correct). > The nice thing is that you don't have to try and keep documents numbers in > sync - it will be automati

Re: [jira] Commented: (LUCENE-1513) fastss fuzzyquery

2009-01-06 Thread robert engels
It is definitely going to increase the index size, but not any more than than the external one would (if my understanding is correct). The nice thing is that you don't have to try and keep documents numbers in sync - it will be automatic. Maybe I don't understand what your external index is

[jira] Resolved: (LUCENE-1502) CharArraySet behaves inconsistently in add(Object) and contains(Object)

2009-01-06 Thread Michael McCandless (JIRA)
[ https://issues.apache.org/jira/browse/LUCENE-1502?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael McCandless resolved LUCENE-1502. Resolution: Fixed Lucene Fields: [New, Patch Available] (was: [Patch Availa

Re: [jira] Commented: (LUCENE-1513) fastss fuzzyquery

2009-01-06 Thread Robert Muir
i see, your idea would definitely simplify some things. What about the index size difference between this approach and using separate index? Would this separate field increase index size? I guess my line of thinking is if you have 10 docs with robert, with separate index you just have robert, and

Re: [jira] Commented: (LUCENE-1513) fastss fuzzyquery

2009-01-06 Thread robert engels
I don't think that is the case. You will have single deletion neighborhood. The number of unique terms in the field is going to be the union of the deletion dictionaries of each source term. For example, given the following documents A which have field 'X' with value best, and document B wi

Re: [jira] Commented: (LUCENE-1513) fastss fuzzyquery

2009-01-06 Thread Robert Muir
a deletion neighborhood can be pretty large (for example robert is something like robert obert rbert robrt robet ...) so if you have a 100 million docs with 1 billion words, but only 100k unique terms, it definitely would be wasteful to have 1 billion deletion neighborhoods when you only need 100k.

Re: [jira] Commented: (LUCENE-1513) fastss fuzzyquery

2009-01-06 Thread robert engels
Why not just create a new field for this? That is, if you have FieldA, create field FieldAFuzzy and put the various permutations there. The fuzzy scorer/parser can be changed to automatically use the Fuzzy field when required. You could also store positions, and allow that the first ter

[jira] Commented: (LUCENE-1513) fastss fuzzyquery

2009-01-06 Thread Robert Muir (JIRA)
[ https://issues.apache.org/jira/browse/LUCENE-1513?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12661314#action_12661314 ] Robert Muir commented on LUCENE-1513: - otis, discussion was on java-user. again, I ap

[jira] Commented: (LUCENE-1483) Change IndexSearcher multisegment searches to search each individual segment using a single HitCollector

2009-01-06 Thread Mark Miller (JIRA)
[ https://issues.apache.org/jira/browse/LUCENE-1483?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12661306#action_12661306 ] Mark Miller commented on LUCENE-1483: - bq. Could we just make ctors on each comparator

[jira] Commented: (LUCENE-1483) Change IndexSearcher multisegment searches to search each individual segment using a single HitCollector

2009-01-06 Thread Michael McCandless (JIRA)
[ https://issues.apache.org/jira/browse/LUCENE-1483?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12661304#action_12661304 ] Michael McCandless commented on LUCENE-1483: {quote} > I'm trying to get loca

[jira] Commented: (LUCENE-1513) fastss fuzzyquery

2009-01-06 Thread Otis Gospodnetic (JIRA)
[ https://issues.apache.org/jira/browse/LUCENE-1513?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12661302#action_12661302 ] Otis Gospodnetic commented on LUCENE-1513: -- I feel like I missed some FastSS disc

[jira] Commented: (LUCENE-1483) Change IndexSearcher multisegment searches to search each individual segment using a single HitCollector

2009-01-06 Thread Michael McCandless (JIRA)
[ https://issues.apache.org/jira/browse/LUCENE-1483?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12661295#action_12661295 ] Michael McCandless commented on LUCENE-1483: {quote} > Not sure about new cons

[jira] Updated: (LUCENE-1483) Change IndexSearcher multisegment searches to search each individual segment using a single HitCollector

2009-01-06 Thread Michael McCandless (JIRA)
[ https://issues.apache.org/jira/browse/LUCENE-1483?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael McCandless updated LUCENE-1483: --- Attachment: LUCENE-1483-partial.patch Attached prototype changes to switch to "setBo

[jira] Commented: (LUCENE-1513) fastss fuzzyquery

2009-01-06 Thread Otis Gospodnetic (JIRA)
[ https://issues.apache.org/jira/browse/LUCENE-1513?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12661286#action_12661286 ] Otis Gospodnetic commented on LUCENE-1513: -- References provided by Glen Newton:

[jira] Commented: (LUCENE-1304) Memory Leak when using Custom Sort (i.e., DistanceSortSource) of LocalLucene with Lucene

2009-01-06 Thread Mark Miller (JIRA)
[ https://issues.apache.org/jira/browse/LUCENE-1304?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12661269#action_12661269 ] Mark Miller commented on LUCENE-1304: - The main impact is that most of that code will

[jira] Commented: (LUCENE-1483) Change IndexSearcher multisegment searches to search each individual segment using a single HitCollector

2009-01-06 Thread Mark Miller (JIRA)
[ https://issues.apache.org/jira/browse/LUCENE-1483?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12661264#action_12661264 ] Mark Miller commented on LUCENE-1483: - I think we are wrapping up, but it may make sen

[jira] Commented: (LUCENE-1483) Change IndexSearcher multisegment searches to search each individual segment using a single HitCollector

2009-01-06 Thread Ryan McKinley (JIRA)
[ https://issues.apache.org/jira/browse/LUCENE-1483?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12661260#action_12661260 ] Ryan McKinley commented on LUCENE-1483: --- Any estimates on how far along this is? Is

[jira] Commented: (LUCENE-1504) SerialChainFilter should use DocSet API rather then deprecated BitSet API

2009-01-06 Thread Mark Miller (JIRA)
[ https://issues.apache.org/jira/browse/LUCENE-1504?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12661249#action_12661249 ] Mark Miller commented on LUCENE-1504: - I think there is contrib dependency examples in

[jira] Commented: (LUCENE-1512) Incorporate GeoHash in contrib/spatial

2009-01-06 Thread Ryan McKinley (JIRA)
[ https://issues.apache.org/jira/browse/LUCENE-1512?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12661241#action_12661241 ] Ryan McKinley commented on LUCENE-1512: --- Any chance you could make a new patch witho

[jira] Commented: (LUCENE-1483) Change IndexSearcher multisegment searches to search each individual segment using a single HitCollector

2009-01-06 Thread Mark Miller (JIRA)
[ https://issues.apache.org/jira/browse/LUCENE-1483?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12661238#action_12661238 ] Mark Miller commented on LUCENE-1483: - Here is what that example policy has to be esse

[jira] Commented: (LUCENE-1512) Incorporate GeoHash in contrib/spatial

2009-01-06 Thread Ryan McKinley (JIRA)
[ https://issues.apache.org/jira/browse/LUCENE-1512?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12661223#action_12661223 ] Ryan McKinley commented on LUCENE-1512: --- This is awesome. thanks patrick! > Incorp

[jira] Updated: (LUCENE-1513) fastss fuzzyquery

2009-01-06 Thread Robert Muir (JIRA)
[ https://issues.apache.org/jira/browse/LUCENE-1513?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Robert Muir updated LUCENE-1513: Attachment: fastSSfuzzy.zip > fastss fuzzyquery > - > > Key: LUCEN

[jira] Created: (LUCENE-1513) fastss fuzzyquery

2009-01-06 Thread Robert Muir (JIRA)
fastss fuzzyquery - Key: LUCENE-1513 URL: https://issues.apache.org/jira/browse/LUCENE-1513 Project: Lucene - Java Issue Type: New Feature Components: contrib/* Reporter: Robert Muir Priority

[jira] Commented: (LUCENE-1314) IndexReader.clone

2009-01-06 Thread Michael McCandless (JIRA)
[ https://issues.apache.org/jira/browse/LUCENE-1314?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12661214#action_12661214 ] Michael McCandless commented on LUCENE-1314: {quote} > The problem is the use

[jira] Updated: (LUCENE-1512) Incorporate GeoHash in contrib/spatial

2009-01-06 Thread patrick o'leary (JIRA)
[ https://issues.apache.org/jira/browse/LUCENE-1512?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] patrick o'leary updated LUCENE-1512: Attachment: LUCENE-1512.patch spatial-lucene GeoHash implementation based on http://en.wi

[jira] Created: (LUCENE-1512) Incorporate GeoHash in contrib/spatial

2009-01-06 Thread patrick o'leary (JIRA)
Incorporate GeoHash in contrib/spatial -- Key: LUCENE-1512 URL: https://issues.apache.org/jira/browse/LUCENE-1512 Project: Lucene - Java Issue Type: New Feature Components: contrib/spatial

[jira] Commented: (LUCENE-1304) Memory Leak when using Custom Sort (i.e., DistanceSortSource) of LocalLucene with Lucene

2009-01-06 Thread patrick o'leary (JIRA)
[ https://issues.apache.org/jira/browse/LUCENE-1304?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12661199#action_12661199 ] patrick o'leary commented on LUCENE-1304: - How will LUCENE-1483 impact this immedi

[jira] Commented: (LUCENE-1483) Change IndexSearcher multisegment searches to search each individual segment using a single HitCollector

2009-01-06 Thread Mark Miller (JIRA)
[ https://issues.apache.org/jira/browse/LUCENE-1483?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12661165#action_12661165 ] Mark Miller commented on LUCENE-1483: - There are other little conversion steps that ha

[jira] Issue Comment Edited: (LUCENE-1483) Change IndexSearcher multisegment searches to search each individual segment using a single HitCollector

2009-01-06 Thread Mark Miller (JIRA)
[ https://issues.apache.org/jira/browse/LUCENE-1483?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12661160#action_12661160 ] markrmil...@gmail.com edited comment on LUCENE-1483 at 1/6/09 6:57 AM: -

[jira] Commented: (LUCENE-1483) Change IndexSearcher multisegment searches to search each individual segment using a single HitCollector

2009-01-06 Thread Mark Miller (JIRA)
[ https://issues.apache.org/jira/browse/LUCENE-1483?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12661160#action_12661160 ] Mark Miller commented on LUCENE-1483: - bq. Mark, I see 3 testcase failures in TestSort

[jira] Commented: (LUCENE-1483) Change IndexSearcher multisegment searches to search each individual segment using a single HitCollector

2009-01-06 Thread Michael McCandless (JIRA)
[ https://issues.apache.org/jira/browse/LUCENE-1483?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12661149#action_12661149 ] Michael McCandless commented on LUCENE-1483: On what ComparatorPolicy to use b

[jira] Commented: (LUCENE-1483) Change IndexSearcher multisegment searches to search each individual segment using a single HitCollector

2009-01-06 Thread Michael McCandless (JIRA)
[ https://issues.apache.org/jira/browse/LUCENE-1483?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12661148#action_12661148 ] Michael McCandless commented on LUCENE-1483: I prototyped a rough change to th

[jira] Commented: (LUCENE-1483) Change IndexSearcher multisegment searches to search each individual segment using a single HitCollector

2009-01-06 Thread Michael McCandless (JIRA)
[ https://issues.apache.org/jira/browse/LUCENE-1483?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12661145#action_12661145 ] Michael McCandless commented on LUCENE-1483: Mark, I see 3 testcase failures i

[jira] Commented: (LUCENE-1509) IndexCommit.getFileNames() should not return dups

2009-01-06 Thread Shalin Shekhar Mangar (JIRA)
[ https://issues.apache.org/jira/browse/LUCENE-1509?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12661143#action_12661143 ] Shalin Shekhar Mangar commented on LUCENE-1509: --- Thanks Michael! > IndexCom

[jira] Commented: (LUCENE-1227) NGramTokenizer to handle more than 1024 chars

2009-01-06 Thread Grant Ingersoll (JIRA)
[ https://issues.apache.org/jira/browse/LUCENE-1227?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12661125#action_12661125 ] Grant Ingersoll commented on LUCENE-1227: - Yes, please do have a look and let us k