[jira] Commented: (LUCENE-1582) Make TrieRange completely independent from Document/Field with TokenStream of prefix encoded values
[ https://issues.apache.org/jira/browse/LUCENE-1582?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12695261#action_12695261 ] Shalin Shekhar Mangar commented on LUCENE-1582: --- bq. trieCodeLong/Int() returns a TokenStream. During encoding, all char[] arrays are reused by Token API, additional String[] arrays for the encoded result are not created, instead the TokenStream enumerates the trie values. +1 Make TrieRange completely independent from Document/Field with TokenStream of prefix encoded values --- Key: LUCENE-1582 URL: https://issues.apache.org/jira/browse/LUCENE-1582 Project: Lucene - Java Issue Type: Improvement Components: contrib/* Affects Versions: 2.9 Reporter: Uwe Schindler Assignee: Uwe Schindler Fix For: 2.9 TrieRange has currently the following problem: - To add a field, that uses a trie encoding, you can manually add each term to the index or use a helper method from TrieUtils. The helper method has the problem, that it uses a fixed field configuration - TrieUtils currently creates per default a helper field containing the lower precision terms to enable sorting (limitation of one term/document for sorting) - trieCodeLong/Int() creates unnecessarily String[] and char[] arrays that is heavy for GC, if you index lot of numeric values. Also a lot of char[] to String copying is involved. This issue should improve this: - trieCodeLong/Int() returns a TokenStream. During encoding, all char[] arrays are reused by Token API, additional String[] arrays for the encoded result are not created, instead the TokenStream enumerates the trie values. - Trie fields can be added to Documents during indexing using the standard API: new Field(name,TokenStream,...), so no extra util method needed. By using token filters, one could also add payload and so and customize everything. The drawback is: Sorting would not work anymore. To enable sorting, a (sub-)issue can extend the FieldCache to stop iterating the terms, as soon as a lower precision one is enumerated by TermEnum. I will create a hack patch for TrieUtils-use only, that uses a non-checked Exceptionin the Parser to stop iteration. With LUCENE-831, a more generic API for this type can be used (custom parser/iterator implementation for FieldCache). I will attach the field cache patch (with the temporary solution, until FieldCache is reimplemented) as a separate patch file, or maybe open another issue for it. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1516) Integrate IndexReader with IndexWriter
[ https://issues.apache.org/jira/browse/LUCENE-1516?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12695306#action_12695306 ] Michael McCandless commented on LUCENE-1516: Good catch! I'll fix. Integrate IndexReader with IndexWriter --- Key: LUCENE-1516 URL: https://issues.apache.org/jira/browse/LUCENE-1516 Project: Lucene - Java Issue Type: Improvement Affects Versions: 2.4 Reporter: Jason Rutherglen Assignee: Michael McCandless Priority: Minor Fix For: 2.9 Attachments: LUCENE-1516.patch, LUCENE-1516.patch, LUCENE-1516.patch, LUCENE-1516.patch, LUCENE-1516.patch, LUCENE-1516.patch, LUCENE-1516.patch, LUCENE-1516.patch, LUCENE-1516.patch, LUCENE-1516.patch, LUCENE-1516.patch, LUCENE-1516.patch, LUCENE-1516.patch, LUCENE-1516.patch, LUCENE-1516.patch, LUCENE-1516.patch, LUCENE-1516.patch, LUCENE-1516.patch, LUCENE-1516.patch, LUCENE-1516.patch, LUCENE-1516.patch, LUCENE-1516.patch, LUCENE-1516.patch, LUCENE-1516.patch, LUCENE-1516.patch, LUCENE-1516.patch, LUCENE-1516.patch, LUCENE-1516.patch, magnetic.png, ssd.png, ssd2.png Original Estimate: 672h Remaining Estimate: 672h The current problem is an IndexReader and IndexWriter cannot be open at the same time and perform updates as they both require a write lock to the index. While methods such as IW.deleteDocuments enables deleting from IW, methods such as IR.deleteDocument(int doc) and norms updating are not available from IW. This limits the capabilities of performing updates to the index dynamically or in realtime without closing the IW and opening an IR, deleting or updating norms, flushing, then opening the IW again, a process which can be detrimental to realtime updates. This patch will expose an IndexWriter.getReader method that returns the currently flushed state of the index as a class that implements IndexReader. The new IR implementation will differ from existing IR implementations such as MultiSegmentReader in that flushing will synchronize updates with IW in part by sharing the write lock. All methods of IR will be usable including reopen and clone. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: IndexWriter.addIndexesNoOptimize(IndexReader[] readers)
Makes sense. Wanna make a patch? We'd then deprecate addIndexes(IndexReader[]). Mike On Thu, Apr 2, 2009 at 9:16 PM, Jason Rutherglen jason.rutherg...@gmail.com wrote: This seems like something that's tenable? It would be useful for merging ram indexes to disk where if a directory is passed, the directory may be changed. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1584) Callback for intercepting merging segments in IndexWriter
[ https://issues.apache.org/jira/browse/LUCENE-1584?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12695318#action_12695318 ] Michael McCandless commented on LUCENE-1584: I think this can be achieved, today, by making your own MergeScheduler wrapper, or by subclassing ConcurrentMergeScheduler and eg overriding the doMerge method? If so, I'd prefer not to add a callback to IW. Callback for intercepting merging segments in IndexWriter - Key: LUCENE-1584 URL: https://issues.apache.org/jira/browse/LUCENE-1584 Project: Lucene - Java Issue Type: Improvement Components: Index Affects Versions: 2.4.1 Reporter: Jason Rutherglen Priority: Minor Fix For: 2.9 Attachments: LUCENE-1584.patch Original Estimate: 96h Remaining Estimate: 96h For things like merging field caches or bitsets, it's useful to know which segments were merged to create a new segment. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: Lucene filter
Could you re-ask this on java-user, instead? Thanks. Mike On Thu, Apr 2, 2009 at 6:24 PM, addman addiek...@yahoo.com wrote: How do you create a Lucene Filter to check if a field has a value? It is part for a ChainedFilter that I am creating. -- View this message in context: http://www.nabble.com/Lucene-filter-tp22858220p22858220.html Sent from the Lucene - Java Developer mailing list archive at Nabble.com. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: Future projects
On Thu, Apr 2, 2009 at 5:56 PM, Jason Rutherglen jason.rutherg...@gmail.com wrote: I think I need to understand better why delete by Query isn't viable in your situation... The delete by query is a separate problem which I haven't fully explored yet. Oh, I had thought we were tugging on this thread in order to explore delete-by-docID in the writer. OK. Tracking the segment genealogy is really an interim step for merging field caches before column stride fields gets implemented. I see -- meaning in Bobo you'd like to manage your own memory resident field caches, and merge them whenever IW has merged a segment? Seems like you don't need genealogy for that. Actually CSF cannot be used with Bobo's field caches anyways which means we'd need a way to find out about the segment parents. CSF isn't really designed yet. How come it can't be used with Bobo's field caches? We can try to accommodate Bobo's field cache needs when designing CSF. Does it operate at the segment level? Seems like that'd give you good enough realtime performance (though merging in RAM will definitely be faster). We need to see how Bobo integrates with LUCENE-1483. Lucene's internal field cache usage is now entirely at the segment level (ie, Lucene core should never request full field cache array at the MultiSegmentReader level). I think Bobo must have to do the same, if it handles near realtime updates, to get adequate performance. Though... since we have LUCENE-831 (rework API Lucene exposes for accessing arrays-of-atomic-types-per-segment) and LUCENE-1231 (CSF = a more efficient impl (than uninversion) of the API we expose in LUCENE-831) on deck, we should try to understand Bobo's needs. EG how come Bobo made its own field cache impl? Just because uninversion is too slow? It seems like we've been talking about CSF for 2 years and there isn't a patch for it? If I had more time I'd take a look. What is the status of it? I think Michael is looking into it? I'd really like to get it into 2.9. We should do it in conjunction with 831 since they are so tied. I'll write a patch that implements a callback for the segment merging such that the user can decide what information they want to record about the merged SRs (I'm pretty sure there isn't a way to do this with MergePolicy?) Actually I think you can do this w/ a simple MergeScheduler wrapper or by subclassing CMS. I'll put a comment on the issue. Mike - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: Future projects
On Thu, Apr 2, 2009 at 6:55 PM, John Wang john.w...@gmail.com wrote: Just to clarify, Approach 1 and approach 2 are both currently performing ok currently for us. OK that's very good to know. Mike - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1575) Refactoring Lucene collectors (HitCollector and extensions)
[ https://issues.apache.org/jira/browse/LUCENE-1575?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12695338#action_12695338 ] Shai Erera commented on LUCENE-1575: I've been thinking about TimeLimitedCollector and the revert to extend HitCollector I had to do in the last patch - the main reason was that I couldn't find a better name and did not want to deprecate it. But then, I thought that perhaps the current name is not so good, and we can change it? Syntactically, it is not a 'limited' collector, but more of a 'limiting' collector (I think, not being a native English speaker I may be wrong). Alternative names I've been thinking about are TimeKeeperCollector, TimeLimitingCollector, TimingOutCollector. The advantage is that we deprecate the current one and have a clear back-compat support, instead of changing it in 3.0 to extend Collector. If you agree with any of these names I can create a new class, deprecate the current one, change the tests back to use the new version (and remove all those comments about the changes in 3.0). What do you think? Refactoring Lucene collectors (HitCollector and extensions) --- Key: LUCENE-1575 URL: https://issues.apache.org/jira/browse/LUCENE-1575 Project: Lucene - Java Issue Type: Improvement Components: Search Reporter: Shai Erera Assignee: Michael McCandless Fix For: 2.9 Attachments: LUCENE-1575.1.patch, LUCENE-1575.2.patch, LUCENE-1575.3.patch, LUCENE-1575.4.patch, LUCENE-1575.5.patch, LUCENE-1575.6.patch, LUCENE-1575.patch This issue is a result of a recent discussion we've had on the mailing list. You can read the thread [here|http://www.nabble.com/Is-TopDocCollector%27s-collect()-implementation-correct--td22557419.html]. We have agreed to do the following refactoring: * Rename MultiReaderHitCollector to Collector, with the purpose that it will be the base class for all Collector implementations. * Deprecate HitCollector in favor of the new Collector. * Introduce new methods in IndexSearcher that accept Collector, and deprecate those that accept HitCollector. ** Create a final class HitCollectorWrapper, and use it in the deprecated methods in IndexSearcher, wrapping the given HitCollector. ** HitCollectorWrapper will be marked deprecated, so we can remove it in 3.0, when we remove HitCollector. ** It will remove any instanceof checks that currently exist in IndexSearcher code. * Create a new (abstract) TopDocsCollector, which will: ** Leave collect and setNextReader unimplemented. ** Introduce protected members PriorityQueue and totalHits. ** Introduce a single protected constructor which accepts a PriorityQueue. ** Implement topDocs() and getTotalHits() using the PQ and totalHits members. These can be used as-are by extending classes, as well as be overridden. ** Introduce a new topDocs(start, howMany) method which will be used a convenience method when implementing a search application which allows paging through search results. It will also attempt to improve the memory allocation, by allocating a ScoreDoc[] of the requested size only. * Change TopScoreDocCollector to extend TopDocsCollector, use the topDocs() and getTotalHits() implementations as they are from TopDocsCollector. The class will also be made final. * Change TopFieldCollector to extend TopDocsCollector, and make the class final. Also implement topDocs(start, howMany). * Change TopFieldDocCollector (deprecated) to extend TopDocsCollector, instead of TopScoreDocCollector. Implement topDocs(start, howMany) * Review other places where HitCollector is used, such as in Scorer, deprecate those places and use Collector instead. Additionally, the following proposal was made w.r.t. decoupling score from collect(): * Change collect to accecpt only a doc Id (unbased). * Introduce a setScorer(Scorer) method. * If during collect the implementation needs the score, it can call scorer.score(). If we do this, then we need to review all places in the code where collect(doc, score) is called, and assert whether Scorer can be passed. Also this raises few questions: * What if during collect() Scorer is null? (i.e., not set) - is it even possible? * I noticed that many (if not all) of the collect() implementations discard the document if its score is not greater than 0. Doesn't it mean that score is needed in collect() always? Open issues: * The name for Collector * TopDocsCollector was mentioned on the thread as TopResultsCollector, but that was when we thought to call Colletor ResultsColletor. Since we decided (so far) on Collector, I think TopDocsCollector makes sense, because of its TopDocs output. * Decoupling score from collect(). I will post a patch a bit later,
[jira] Updated: (LUCENE-1582) Make TrieRange completely independent from Document/Field with TokenStream of prefix encoded values
[ https://issues.apache.org/jira/browse/LUCENE-1582?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Uwe Schindler updated LUCENE-1582: -- Attachment: LUCENE-1582.patch Make TrieRange completely independent from Document/Field with TokenStream of prefix encoded values --- Key: LUCENE-1582 URL: https://issues.apache.org/jira/browse/LUCENE-1582 Project: Lucene - Java Issue Type: Improvement Components: contrib/* Affects Versions: 2.9 Reporter: Uwe Schindler Assignee: Uwe Schindler Fix For: 2.9 Attachments: LUCENE-1582.patch TrieRange has currently the following problem: - To add a field, that uses a trie encoding, you can manually add each term to the index or use a helper method from TrieUtils. The helper method has the problem, that it uses a fixed field configuration - TrieUtils currently creates per default a helper field containing the lower precision terms to enable sorting (limitation of one term/document for sorting) - trieCodeLong/Int() creates unnecessarily String[] and char[] arrays that is heavy for GC, if you index lot of numeric values. Also a lot of char[] to String copying is involved. This issue should improve this: - trieCodeLong/Int() returns a TokenStream. During encoding, all char[] arrays are reused by Token API, additional String[] arrays for the encoded result are not created, instead the TokenStream enumerates the trie values. - Trie fields can be added to Documents during indexing using the standard API: new Field(name,TokenStream,...), so no extra util method needed. By using token filters, one could also add payload and so and customize everything. The drawback is: Sorting would not work anymore. To enable sorting, a (sub-)issue can extend the FieldCache to stop iterating the terms, as soon as a lower precision one is enumerated by TermEnum. I will create a hack patch for TrieUtils-use only, that uses a non-checked Exceptionin the Parser to stop iteration. With LUCENE-831, a more generic API for this type can be used (custom parser/iterator implementation for FieldCache). I will attach the field cache patch (with the temporary solution, until FieldCache is reimplemented) as a separate patch file, or maybe open another issue for it. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1582) Make TrieRange completely independent from Document/Field with TokenStream of prefix encoded values
[ https://issues.apache.org/jira/browse/LUCENE-1582?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12695341#action_12695341 ] Uwe Schindler commented on LUCENE-1582: --- A first version of the patch: - JavaDocs not finished (examples, documentation) yet - New classes: IntTrieTokenStream, LongTrieTokenStream - Removed TrieUtils.trieCodeInt/Long() - Removed TrieUtils.addIndexFields() - Removed all fields[] arrays, now only one field name is supported everywhere To index a trie-encoded field, just use (preferred way): {code} Filed f=new Field(name, new LongTrieTokenStream(value, precisionStep)); f.setOmitNorms(true); f.setOmitTermFreqAndPositions(true); {code} (maybe TrieUtils supplies a shortcut helper method that uses these special optimal settings when creating the field, e.g. TrieUtils.newLongTrieField()). This is extensible with TokenFilters, if somebody wants to add payloads and so on. This patch also contains the sorting fixes in the core: FieldCache.StopFillCacheException can be thrown from withing the parser. Maybe this should be provides as a separate sub-isse (or top-level issue), because I cannot apply patches to core. Mike, can you do this, when we commit this? Yonik: It would be nice to hear some comments from you, too. I really like the new way to create trie encoded fields. When this moves to core, the tokenizers can be renamed to IntTokenStream, TrieUtils now only contains the converters to/from doubles and the encoding and range split. About the GC note in the description of this issue: The new API does not use so much array allocations and array copies and reuses the Token. But as it is needed to generate a TokenStream instance for every numeric value, the GC cost is about the same for new and old API. Especially because each TokenStream creates a LinkedHashMap internally for the attributes. Just a question for the indexer people: Is it possible to add two fields with the same field name to a document, both with a TokenStream? This is needed to add more than one trie encoded value (which worked with the old API). I just want to be sure. Make TrieRange completely independent from Document/Field with TokenStream of prefix encoded values --- Key: LUCENE-1582 URL: https://issues.apache.org/jira/browse/LUCENE-1582 Project: Lucene - Java Issue Type: Improvement Components: contrib/* Affects Versions: 2.9 Reporter: Uwe Schindler Assignee: Uwe Schindler Fix For: 2.9 Attachments: LUCENE-1582.patch TrieRange has currently the following problem: - To add a field, that uses a trie encoding, you can manually add each term to the index or use a helper method from TrieUtils. The helper method has the problem, that it uses a fixed field configuration - TrieUtils currently creates per default a helper field containing the lower precision terms to enable sorting (limitation of one term/document for sorting) - trieCodeLong/Int() creates unnecessarily String[] and char[] arrays that is heavy for GC, if you index lot of numeric values. Also a lot of char[] to String copying is involved. This issue should improve this: - trieCodeLong/Int() returns a TokenStream. During encoding, all char[] arrays are reused by Token API, additional String[] arrays for the encoded result are not created, instead the TokenStream enumerates the trie values. - Trie fields can be added to Documents during indexing using the standard API: new Field(name,TokenStream,...), so no extra util method needed. By using token filters, one could also add payload and so and customize everything. The drawback is: Sorting would not work anymore. To enable sorting, a (sub-)issue can extend the FieldCache to stop iterating the terms, as soon as a lower precision one is enumerated by TermEnum. I will create a hack patch for TrieUtils-use only, that uses a non-checked Exceptionin the Parser to stop iteration. With LUCENE-831, a more generic API for this type can be used (custom parser/iterator implementation for FieldCache). I will attach the field cache patch (with the temporary solution, until FieldCache is reimplemented) as a separate patch file, or maybe open another issue for it. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1575) Refactoring Lucene collectors (HitCollector and extensions)
[ https://issues.apache.org/jira/browse/LUCENE-1575?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12695358#action_12695358 ] Michael McCandless commented on LUCENE-1575: bq. If you agree with any of these names I can create a new class, deprecate the current one, change the tests back to use the new version (and remove all those comments about the changes in 3.0). What do you think? I like this approach. I like TimeLimitingCollector, or maybe TimeoutCollector? Refactoring Lucene collectors (HitCollector and extensions) --- Key: LUCENE-1575 URL: https://issues.apache.org/jira/browse/LUCENE-1575 Project: Lucene - Java Issue Type: Improvement Components: Search Reporter: Shai Erera Assignee: Michael McCandless Fix For: 2.9 Attachments: LUCENE-1575.1.patch, LUCENE-1575.2.patch, LUCENE-1575.3.patch, LUCENE-1575.4.patch, LUCENE-1575.5.patch, LUCENE-1575.6.patch, LUCENE-1575.patch This issue is a result of a recent discussion we've had on the mailing list. You can read the thread [here|http://www.nabble.com/Is-TopDocCollector%27s-collect()-implementation-correct--td22557419.html]. We have agreed to do the following refactoring: * Rename MultiReaderHitCollector to Collector, with the purpose that it will be the base class for all Collector implementations. * Deprecate HitCollector in favor of the new Collector. * Introduce new methods in IndexSearcher that accept Collector, and deprecate those that accept HitCollector. ** Create a final class HitCollectorWrapper, and use it in the deprecated methods in IndexSearcher, wrapping the given HitCollector. ** HitCollectorWrapper will be marked deprecated, so we can remove it in 3.0, when we remove HitCollector. ** It will remove any instanceof checks that currently exist in IndexSearcher code. * Create a new (abstract) TopDocsCollector, which will: ** Leave collect and setNextReader unimplemented. ** Introduce protected members PriorityQueue and totalHits. ** Introduce a single protected constructor which accepts a PriorityQueue. ** Implement topDocs() and getTotalHits() using the PQ and totalHits members. These can be used as-are by extending classes, as well as be overridden. ** Introduce a new topDocs(start, howMany) method which will be used a convenience method when implementing a search application which allows paging through search results. It will also attempt to improve the memory allocation, by allocating a ScoreDoc[] of the requested size only. * Change TopScoreDocCollector to extend TopDocsCollector, use the topDocs() and getTotalHits() implementations as they are from TopDocsCollector. The class will also be made final. * Change TopFieldCollector to extend TopDocsCollector, and make the class final. Also implement topDocs(start, howMany). * Change TopFieldDocCollector (deprecated) to extend TopDocsCollector, instead of TopScoreDocCollector. Implement topDocs(start, howMany) * Review other places where HitCollector is used, such as in Scorer, deprecate those places and use Collector instead. Additionally, the following proposal was made w.r.t. decoupling score from collect(): * Change collect to accecpt only a doc Id (unbased). * Introduce a setScorer(Scorer) method. * If during collect the implementation needs the score, it can call scorer.score(). If we do this, then we need to review all places in the code where collect(doc, score) is called, and assert whether Scorer can be passed. Also this raises few questions: * What if during collect() Scorer is null? (i.e., not set) - is it even possible? * I noticed that many (if not all) of the collect() implementations discard the document if its score is not greater than 0. Doesn't it mean that score is needed in collect() always? Open issues: * The name for Collector * TopDocsCollector was mentioned on the thread as TopResultsCollector, but that was when we thought to call Colletor ResultsColletor. Since we decided (so far) on Collector, I think TopDocsCollector makes sense, because of its TopDocs output. * Decoupling score from collect(). I will post a patch a bit later, as this is expected to be a very large patch. I will split it into 2: (1) code patch (2) test cases (moving to use Collector instead of HitCollector, as well as testing the new topDocs(start, howMany) method. There might be even a 3rd patch which handles the setScorer thing in Collector (maybe even a different issue?) -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For
[jira] Updated: (LUCENE-1575) Refactoring Lucene collectors (HitCollector and extensions)
[ https://issues.apache.org/jira/browse/LUCENE-1575?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael McCandless updated LUCENE-1575: --- Attachment: LUCENE-1575.patch OK, I attached a new patch with some minor changes: * Beefed up javadocs in Collector.java; fixed other javadocs warnings. Tweaked CHANGES.txt. * Renamed PositiveOnlyScoresCollector -- PositiveScoresOnlyCollector And also came across these questions/issues: * TopFieldCollector's updateBottom add methods take score, and are passed score from the non-scoring collectors, but shouldn't? * TermScorer need not override score(HitCollector hc) (super does the same thing). * The changes to TermScorer make me a bit nervous. EG, the new InternalScorer: will it hurt performance? Also this part: {code} +// Set the Scorer doc and score before calling collect in case it will be +// used in collect() +s.d = doc; +s.score = score; +c.collect(doc); // collect score {code} is spooky: I don't like how we worry that one may call scorer.doc() (I don't like the ambiguity in the API -- we both pass doc and fear you may call scorer.doc()). Not sure how to resolve it. * Hmm -- we added a new abstract method to src/java/org/apache/lucene/search/Searcher.java (that accepts Collector). Should that method be concrete (and throw UOE), for back compat? * We've also added a method to the Searchable interface, which is a break in back-compat. But my feeling is we should allow this break (but Shai can you add another Note at the top of CHANGES.txt, calling this out?). Refactoring Lucene collectors (HitCollector and extensions) --- Key: LUCENE-1575 URL: https://issues.apache.org/jira/browse/LUCENE-1575 Project: Lucene - Java Issue Type: Improvement Components: Search Reporter: Shai Erera Assignee: Michael McCandless Fix For: 2.9 Attachments: LUCENE-1575.1.patch, LUCENE-1575.2.patch, LUCENE-1575.3.patch, LUCENE-1575.4.patch, LUCENE-1575.5.patch, LUCENE-1575.6.patch, LUCENE-1575.patch, LUCENE-1575.patch This issue is a result of a recent discussion we've had on the mailing list. You can read the thread [here|http://www.nabble.com/Is-TopDocCollector%27s-collect()-implementation-correct--td22557419.html]. We have agreed to do the following refactoring: * Rename MultiReaderHitCollector to Collector, with the purpose that it will be the base class for all Collector implementations. * Deprecate HitCollector in favor of the new Collector. * Introduce new methods in IndexSearcher that accept Collector, and deprecate those that accept HitCollector. ** Create a final class HitCollectorWrapper, and use it in the deprecated methods in IndexSearcher, wrapping the given HitCollector. ** HitCollectorWrapper will be marked deprecated, so we can remove it in 3.0, when we remove HitCollector. ** It will remove any instanceof checks that currently exist in IndexSearcher code. * Create a new (abstract) TopDocsCollector, which will: ** Leave collect and setNextReader unimplemented. ** Introduce protected members PriorityQueue and totalHits. ** Introduce a single protected constructor which accepts a PriorityQueue. ** Implement topDocs() and getTotalHits() using the PQ and totalHits members. These can be used as-are by extending classes, as well as be overridden. ** Introduce a new topDocs(start, howMany) method which will be used a convenience method when implementing a search application which allows paging through search results. It will also attempt to improve the memory allocation, by allocating a ScoreDoc[] of the requested size only. * Change TopScoreDocCollector to extend TopDocsCollector, use the topDocs() and getTotalHits() implementations as they are from TopDocsCollector. The class will also be made final. * Change TopFieldCollector to extend TopDocsCollector, and make the class final. Also implement topDocs(start, howMany). * Change TopFieldDocCollector (deprecated) to extend TopDocsCollector, instead of TopScoreDocCollector. Implement topDocs(start, howMany) * Review other places where HitCollector is used, such as in Scorer, deprecate those places and use Collector instead. Additionally, the following proposal was made w.r.t. decoupling score from collect(): * Change collect to accecpt only a doc Id (unbased). * Introduce a setScorer(Scorer) method. * If during collect the implementation needs the score, it can call scorer.score(). If we do this, then we need to review all places in the code where collect(doc, score) is called, and assert whether Scorer can be passed. Also this raises few questions: * What if during collect() Scorer
[jira] Commented: (LUCENE-1582) Make TrieRange completely independent from Document/Field with TokenStream of prefix encoded values
[ https://issues.apache.org/jira/browse/LUCENE-1582?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12695362#action_12695362 ] Michael McCandless commented on LUCENE-1582: bq. Maybe this should be provides as a separate sub-isse (or top-level issue), because I cannot apply patches to core. Mike, can you do this, when we commit this? It's fine to include these changes in this patch -- I can commit them all at once. bq. But as it is needed to generate a TokenStream instance for every numeric value, the GC cost is about the same for new and old API. Especially because each TokenStream creates a LinkedHashMap internally for the attributes. Hmm, we should do some perf tests to see how big a deal this turns out to be. It'd be nice to get some sort of reuse API working if performance is really hurt. (Eg Analyzers can provide reusableTokenStream, keyed by thread). You'd presumably have to key on thread field name. If you do this then probably a shortcut helper method should be the preferred way. bq. Just a question for the indexer people: Is it possible to add two fields with the same field name to a document, both with a TokenStream? Each with a different TokenStream instance, right? Yes, this should be fine; the tokens are logically concatenated just like multi-valued String fields. Make TrieRange completely independent from Document/Field with TokenStream of prefix encoded values --- Key: LUCENE-1582 URL: https://issues.apache.org/jira/browse/LUCENE-1582 Project: Lucene - Java Issue Type: Improvement Components: contrib/* Affects Versions: 2.9 Reporter: Uwe Schindler Assignee: Uwe Schindler Fix For: 2.9 Attachments: LUCENE-1582.patch TrieRange has currently the following problem: - To add a field, that uses a trie encoding, you can manually add each term to the index or use a helper method from TrieUtils. The helper method has the problem, that it uses a fixed field configuration - TrieUtils currently creates per default a helper field containing the lower precision terms to enable sorting (limitation of one term/document for sorting) - trieCodeLong/Int() creates unnecessarily String[] and char[] arrays that is heavy for GC, if you index lot of numeric values. Also a lot of char[] to String copying is involved. This issue should improve this: - trieCodeLong/Int() returns a TokenStream. During encoding, all char[] arrays are reused by Token API, additional String[] arrays for the encoded result are not created, instead the TokenStream enumerates the trie values. - Trie fields can be added to Documents during indexing using the standard API: new Field(name,TokenStream,...), so no extra util method needed. By using token filters, one could also add payload and so and customize everything. The drawback is: Sorting would not work anymore. To enable sorting, a (sub-)issue can extend the FieldCache to stop iterating the terms, as soon as a lower precision one is enumerated by TermEnum. I will create a hack patch for TrieUtils-use only, that uses a non-checked Exceptionin the Parser to stop iteration. With LUCENE-831, a more generic API for this type can be used (custom parser/iterator implementation for FieldCache). I will attach the field cache patch (with the temporary solution, until FieldCache is reimplemented) as a separate patch file, or maybe open another issue for it. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1582) Make TrieRange completely independent from Document/Field with TokenStream of prefix encoded values
[ https://issues.apache.org/jira/browse/LUCENE-1582?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12695364#action_12695364 ] Uwe Schindler commented on LUCENE-1582: --- bq. Hmm, we should do some perf tests to see how big a deal this turns out to be. It'd be nice to get some sort of reuse API working if performance is really hurt. (Eg Analyzers can provide reusableTokenStream, keyed by thread). You'd presumably have to key on thread field name. If you do this then probably a shortcut helper method should be the preferred way. We can also leave this to the implementor: If somebody indexes thousands of documents, he could reuse one instance of the TokenStream for each document. As the instance is only read on document addition, he must provide a separate instance for each field, but can reuse it for the next document. This is the same like reusing Field instances during indexing. I can add a setValue() method to the tokenStream that resets it with the new value. So one could use one instance and always use setValue() to supply a new value for each document. The precisionStep should not be modifiable. {quote} bq. Just a question for the indexer people: Is it possible to add two fields with the same field name to a document, both with a TokenStream? Each with a different TokenStream instance, right? Yes, this should be fine; the tokens are logically concatenated just like multi-valued String fields. {quote} Yes, sure :-) Make TrieRange completely independent from Document/Field with TokenStream of prefix encoded values --- Key: LUCENE-1582 URL: https://issues.apache.org/jira/browse/LUCENE-1582 Project: Lucene - Java Issue Type: Improvement Components: contrib/* Affects Versions: 2.9 Reporter: Uwe Schindler Assignee: Uwe Schindler Fix For: 2.9 Attachments: LUCENE-1582.patch TrieRange has currently the following problem: - To add a field, that uses a trie encoding, you can manually add each term to the index or use a helper method from TrieUtils. The helper method has the problem, that it uses a fixed field configuration - TrieUtils currently creates per default a helper field containing the lower precision terms to enable sorting (limitation of one term/document for sorting) - trieCodeLong/Int() creates unnecessarily String[] and char[] arrays that is heavy for GC, if you index lot of numeric values. Also a lot of char[] to String copying is involved. This issue should improve this: - trieCodeLong/Int() returns a TokenStream. During encoding, all char[] arrays are reused by Token API, additional String[] arrays for the encoded result are not created, instead the TokenStream enumerates the trie values. - Trie fields can be added to Documents during indexing using the standard API: new Field(name,TokenStream,...), so no extra util method needed. By using token filters, one could also add payload and so and customize everything. The drawback is: Sorting would not work anymore. To enable sorting, a (sub-)issue can extend the FieldCache to stop iterating the terms, as soon as a lower precision one is enumerated by TermEnum. I will create a hack patch for TrieUtils-use only, that uses a non-checked Exceptionin the Parser to stop iteration. With LUCENE-831, a more generic API for this type can be used (custom parser/iterator implementation for FieldCache). I will attach the field cache patch (with the temporary solution, until FieldCache is reimplemented) as a separate patch file, or maybe open another issue for it. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1341) BoostingNearQuery class (prototype)
[ https://issues.apache.org/jira/browse/LUCENE-1341?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12695370#action_12695370 ] Grant Ingersoll commented on LUCENE-1341: - Hi Peter, This looks good, I think it just needs some unit tests and then it will be good. BoostingNearQuery class (prototype) --- Key: LUCENE-1341 URL: https://issues.apache.org/jira/browse/LUCENE-1341 Project: Lucene - Java Issue Type: Improvement Components: Query/Scoring Affects Versions: 2.3.1 Reporter: Peter Keegan Assignee: Grant Ingersoll Priority: Minor Fix For: 3.0 Attachments: bnq.patch, bnq.patch, BoostingNearQuery.java, BoostingNearQuery.java, LUCENE-1341-new.patch, LUCENE-1341.patch This patch implements term boosting for SpanNearQuery. Refer to: http://www.gossamer-threads.com/lists/lucene/java-user/62779 This patch works but probably needs more work. I don't like the use of 'instanceof', but I didn't want to touch Spans or TermSpans. Also, the payload code is mostly a copy of what's in BoostingTermQuery and could be common-sourced somewhere. Feel free to throw darts at it :) -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1582) Make TrieRange completely independent from Document/Field with TokenStream of prefix encoded values
[ https://issues.apache.org/jira/browse/LUCENE-1582?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12695400#action_12695400 ] Michael McCandless commented on LUCENE-1582: bq. I can add a setValue() method to the tokenStream that resets it with the new value. That's a good step forward, but it'd likely mean the default is to be slower performance? In general I prefer (when realistic) to have default ootb experience to be good performance, but in this case it doesn't seem like there's an easy way to have a natural high-performance default. And eg we don't reuse Document Field by default, so expecting someone to do a bit of work to reuse Trie's TokenStreams seems OK. It's almost like Analyzer.reusableTokenStream(...) should know it's deailing with a Numeric field, and handle the reuse for you, in a future world when Lucene knows that a Field is a NumericField, meant to be indexed using trie. But we can leave all of that for future optimization; for now, providing setValue is great. Make TrieRange completely independent from Document/Field with TokenStream of prefix encoded values --- Key: LUCENE-1582 URL: https://issues.apache.org/jira/browse/LUCENE-1582 Project: Lucene - Java Issue Type: Improvement Components: contrib/* Affects Versions: 2.9 Reporter: Uwe Schindler Assignee: Uwe Schindler Fix For: 2.9 Attachments: LUCENE-1582.patch TrieRange has currently the following problem: - To add a field, that uses a trie encoding, you can manually add each term to the index or use a helper method from TrieUtils. The helper method has the problem, that it uses a fixed field configuration - TrieUtils currently creates per default a helper field containing the lower precision terms to enable sorting (limitation of one term/document for sorting) - trieCodeLong/Int() creates unnecessarily String[] and char[] arrays that is heavy for GC, if you index lot of numeric values. Also a lot of char[] to String copying is involved. This issue should improve this: - trieCodeLong/Int() returns a TokenStream. During encoding, all char[] arrays are reused by Token API, additional String[] arrays for the encoded result are not created, instead the TokenStream enumerates the trie values. - Trie fields can be added to Documents during indexing using the standard API: new Field(name,TokenStream,...), so no extra util method needed. By using token filters, one could also add payload and so and customize everything. The drawback is: Sorting would not work anymore. To enable sorting, a (sub-)issue can extend the FieldCache to stop iterating the terms, as soon as a lower precision one is enumerated by TermEnum. I will create a hack patch for TrieUtils-use only, that uses a non-checked Exceptionin the Parser to stop iteration. With LUCENE-831, a more generic API for this type can be used (custom parser/iterator implementation for FieldCache). I will attach the field cache patch (with the temporary solution, until FieldCache is reimplemented) as a separate patch file, or maybe open another issue for it. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1539) Improve Benchmark
[ https://issues.apache.org/jira/browse/LUCENE-1539?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12695462#action_12695462 ] Michael McCandless commented on LUCENE-1539: This patch looks good -- some questions: * Is CreateWikiIndex intended to be committed? I thought not? Ie I though the goal w/ this issue is add the necessary tasks so that CreateWikiIndex would be done as an alg. * I think we shouldn't bump to Java 1.5 -- it's only CreateWikiIndex that needs it anyway (in only 2 places). * PrintReaderTask never closes the reader. * Not sure why you needed to relax private - protected in AddDocTask? Improve Benchmark - Key: LUCENE-1539 URL: https://issues.apache.org/jira/browse/LUCENE-1539 Project: Lucene - Java Issue Type: Improvement Components: contrib/benchmark Affects Versions: 2.4 Reporter: Jason Rutherglen Assignee: Michael McCandless Priority: Minor Fix For: 2.9 Attachments: LUCENE-1539.patch, LUCENE-1539.patch, sortBench2.py, sortCollate2.py Original Estimate: 336h Remaining Estimate: 336h Benchmark can be improved by incorporating recent suggestions posted on java-dev. M. McCandless' Python scripts that execute multiple rounds of tests can either be incorporated into the codebase or converted to Java. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Assigned: (LUCENE-1539) Improve Benchmark
[ https://issues.apache.org/jira/browse/LUCENE-1539?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael McCandless reassigned LUCENE-1539: -- Assignee: Michael McCandless Improve Benchmark - Key: LUCENE-1539 URL: https://issues.apache.org/jira/browse/LUCENE-1539 Project: Lucene - Java Issue Type: Improvement Components: contrib/benchmark Affects Versions: 2.4 Reporter: Jason Rutherglen Assignee: Michael McCandless Priority: Minor Fix For: 2.9 Attachments: LUCENE-1539.patch, LUCENE-1539.patch, sortBench2.py, sortCollate2.py Original Estimate: 336h Remaining Estimate: 336h Benchmark can be improved by incorporating recent suggestions posted on java-dev. M. McCandless' Python scripts that execute multiple rounds of tests can either be incorporated into the codebase or converted to Java. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1575) Refactoring Lucene collectors (HitCollector and extensions)
[ https://issues.apache.org/jira/browse/LUCENE-1575?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12695478#action_12695478 ] Shai Erera commented on LUCENE-1575: bq. I like TimeLimitingCollector, or maybe TimeoutCollector? I like TimeLimitingCollector better, as I think the name makes the class more self explanatory. bq. TopFieldCollector's updateBottom add methods take score, and are passed score from the non-scoring collectors, but shouldn't? At the end of the day, even the non-scoring collectors store a score in ScoreDoc, which is Float.NaN. So they should pass a score. Unlike the scoring ones, they always pass Float.NaN without ever calling scorer.score(). That's the cleanest way I've found I can make the changes to that class, w/o duplicating implementation all over the place. Notice that the scoring versions extend the non-scoring, and just add score computation, which resulted in a very clean implementation. bq. TermScorer need not override score(HitCollector hc) (super does the same thing). Agreed. bq. The changes to TermScorer make me a bit nervous. Since we pass Sorer to Collector, I thought we cannot really rely on anyone not calling scorer.doc() or getSimilarity ever - it is in the API. Since doc() is abstract, I had to implement it and just thought that retuning the current doc is better than -1 for example. There are some alternatives I see to resolve it: # Create an abstract ScoringOnlyScorer which extends Scorer and implements all methods to throw UOE (also as final), besides score() which it will define abstract. We then define a ScoringOnlyScorerWrapper which takes a Scorer and delegates the score() calls. We use SOSW in places where we can't extend SOS. Where we can, we just extend it directly and implement score(), like in the InternalScorer case. # Create a new class which implements just score() (I've yet to come with a good name since Scorer is already taken) and create a wrapper which takes a Scorer and delegates the score() calls to it. Then Collector will use that new class, and we're sure that only score() can be called. The last two comments are completely an overlook by my side. I'm not so sure about your proposal though. If we add to Searcher a concrete impl which throws UOE, how would that work in 3.0? How would anyone who extends Searcher know that it has to extend this method? Maybe do it now, and document that in 3.0 it will become abstract again? About Searchable, I wonder how many do implement Searchable, rather than extend IndexSearcher. Perhaps instead of making any changes in back-compat and add documentation to CHANGES I'll just comment out this method with a TODO to re-enstate in 3.0? Refactoring Lucene collectors (HitCollector and extensions) --- Key: LUCENE-1575 URL: https://issues.apache.org/jira/browse/LUCENE-1575 Project: Lucene - Java Issue Type: Improvement Components: Search Reporter: Shai Erera Assignee: Michael McCandless Fix For: 2.9 Attachments: LUCENE-1575.1.patch, LUCENE-1575.2.patch, LUCENE-1575.3.patch, LUCENE-1575.4.patch, LUCENE-1575.5.patch, LUCENE-1575.6.patch, LUCENE-1575.patch, LUCENE-1575.patch This issue is a result of a recent discussion we've had on the mailing list. You can read the thread [here|http://www.nabble.com/Is-TopDocCollector%27s-collect()-implementation-correct--td22557419.html]. We have agreed to do the following refactoring: * Rename MultiReaderHitCollector to Collector, with the purpose that it will be the base class for all Collector implementations. * Deprecate HitCollector in favor of the new Collector. * Introduce new methods in IndexSearcher that accept Collector, and deprecate those that accept HitCollector. ** Create a final class HitCollectorWrapper, and use it in the deprecated methods in IndexSearcher, wrapping the given HitCollector. ** HitCollectorWrapper will be marked deprecated, so we can remove it in 3.0, when we remove HitCollector. ** It will remove any instanceof checks that currently exist in IndexSearcher code. * Create a new (abstract) TopDocsCollector, which will: ** Leave collect and setNextReader unimplemented. ** Introduce protected members PriorityQueue and totalHits. ** Introduce a single protected constructor which accepts a PriorityQueue. ** Implement topDocs() and getTotalHits() using the PQ and totalHits members. These can be used as-are by extending classes, as well as be overridden. ** Introduce a new topDocs(start, howMany) method which will be used a convenience method when implementing a search application which allows paging through search results. It will also attempt to improve the memory allocation, by allocating a ScoreDoc[] of the requested size
[jira] Commented: (LUCENE-1575) Refactoring Lucene collectors (HitCollector and extensions)
[ https://issues.apache.org/jira/browse/LUCENE-1575?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12695513#action_12695513 ] Michael McCandless commented on LUCENE-1575: bq. I like TimeLimitingCollector better, as I think the name makes the class more self explanatory. OK let's go with that! {quote} At the end of the day, even the non-scoring collectors store a score in ScoreDoc, which is Float.NaN. So they should pass a score. Unlike the scoring ones, they always pass Float.NaN without ever calling scorer.score(). That's the cleanest way I've found I can make the changes to that class, w/o duplicating implementation all over the place. Notice that the scoring versions extend the non-scoring, and just add score computation, which resulted in a very clean implementation. {quote} OK... let's stick with this approach for now. Since the impl is locked down (ctor for TopFieldCollector is private) we can freely switch up this API in the future without breaking back compat, if we want to optimize not passing/copying around the unused score. Can't the scoring collector impls in TopFieldCollector be final? bq. Since we pass Sorer to Collector, I thought we cannot really rely on anyone not calling scorer.doc() or getSimilarity ever Maybe instead make InternalScorer non-static, and then doc() can return the doc from the TermScorer instance, instead of having to copy s.d = doc each time? score can do a similar thing. Actually, hang on: if I'm using a Collector that doesn't need the score, TermScoring is still computing it? We don't want that right? Can we simply pass this to setScorer(...)? bq. If we add to Searcher a concrete impl which throws UOE, how would that work in 3.0? How would anyone who extends Searcher know that it has to extend this method? Maybe do it now, and document that in 3.0 it will become abstract again? OK let's do that? bq. About Searchable, I wonder how many do implement Searchable, rather than extend IndexSearcher. Perhaps instead of making any changes in back-compat and add documentation to CHANGES I'll just comment out this method with a TODO to re-enstate in 3.0? OK. Make sure at the end of all of this, you open a new issue, marked as fix version 3.0, that has all the and then on 3.0 we do XYZs from this. Refactoring Lucene collectors (HitCollector and extensions) --- Key: LUCENE-1575 URL: https://issues.apache.org/jira/browse/LUCENE-1575 Project: Lucene - Java Issue Type: Improvement Components: Search Reporter: Shai Erera Assignee: Michael McCandless Fix For: 2.9 Attachments: LUCENE-1575.1.patch, LUCENE-1575.2.patch, LUCENE-1575.3.patch, LUCENE-1575.4.patch, LUCENE-1575.5.patch, LUCENE-1575.6.patch, LUCENE-1575.patch, LUCENE-1575.patch This issue is a result of a recent discussion we've had on the mailing list. You can read the thread [here|http://www.nabble.com/Is-TopDocCollector%27s-collect()-implementation-correct--td22557419.html]. We have agreed to do the following refactoring: * Rename MultiReaderHitCollector to Collector, with the purpose that it will be the base class for all Collector implementations. * Deprecate HitCollector in favor of the new Collector. * Introduce new methods in IndexSearcher that accept Collector, and deprecate those that accept HitCollector. ** Create a final class HitCollectorWrapper, and use it in the deprecated methods in IndexSearcher, wrapping the given HitCollector. ** HitCollectorWrapper will be marked deprecated, so we can remove it in 3.0, when we remove HitCollector. ** It will remove any instanceof checks that currently exist in IndexSearcher code. * Create a new (abstract) TopDocsCollector, which will: ** Leave collect and setNextReader unimplemented. ** Introduce protected members PriorityQueue and totalHits. ** Introduce a single protected constructor which accepts a PriorityQueue. ** Implement topDocs() and getTotalHits() using the PQ and totalHits members. These can be used as-are by extending classes, as well as be overridden. ** Introduce a new topDocs(start, howMany) method which will be used a convenience method when implementing a search application which allows paging through search results. It will also attempt to improve the memory allocation, by allocating a ScoreDoc[] of the requested size only. * Change TopScoreDocCollector to extend TopDocsCollector, use the topDocs() and getTotalHits() implementations as they are from TopDocsCollector. The class will also be made final. * Change TopFieldCollector to extend TopDocsCollector, and make the class final. Also implement topDocs(start, howMany). * Change TopFieldDocCollector (deprecated) to extend TopDocsCollector, instead of
[jira] Commented: (LUCENE-1575) Refactoring Lucene collectors (HitCollector and extensions)
[ https://issues.apache.org/jira/browse/LUCENE-1575?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12695523#action_12695523 ] Shai Erera commented on LUCENE-1575: bq. Can't the scoring collector impls in TopFieldCollector be final? They can, but they are private so they cannot be extended anyway. I can do that, but does it really matter? bq. We don't want that right? Can we simply pass this to setScorer(...)? That's what I wanted to do, but then noticed that TermScorer.score() method is a bit different. However, now that I look at it again, I wonder if they are different. The difference is that in score(), it does at the end {code} return raw * Similarity.decodeNorm(norms[doc]); {code} and in score(Collector, int) it does {code} float[] normDecoder = Similarity.getNormDecoder(); ... score *= normDecoder[norms[doc] 0xFF]; {code} Looking in Similarity.decodeNorm, it does exactly what's done in score(Collector, int). So I guess this code has been duplicated for no good reason? Please validate what I wrote and if you also agree, I can change the entire method (score(Collector, int)) to not compute any score and call c.setScorer(this). That will solve it. So are you ok with passing Scorer to Collector, instead of just a class with a single score() method? I will open an issue w/ a fix version 3.0 and take care of all those TODOs. Should the issue also get rid of the deprecated methods? Or will we have a general issue in 3.0 that removes all deprecated methods? Refactoring Lucene collectors (HitCollector and extensions) --- Key: LUCENE-1575 URL: https://issues.apache.org/jira/browse/LUCENE-1575 Project: Lucene - Java Issue Type: Improvement Components: Search Reporter: Shai Erera Assignee: Michael McCandless Fix For: 2.9 Attachments: LUCENE-1575.1.patch, LUCENE-1575.2.patch, LUCENE-1575.3.patch, LUCENE-1575.4.patch, LUCENE-1575.5.patch, LUCENE-1575.6.patch, LUCENE-1575.patch, LUCENE-1575.patch This issue is a result of a recent discussion we've had on the mailing list. You can read the thread [here|http://www.nabble.com/Is-TopDocCollector%27s-collect()-implementation-correct--td22557419.html]. We have agreed to do the following refactoring: * Rename MultiReaderHitCollector to Collector, with the purpose that it will be the base class for all Collector implementations. * Deprecate HitCollector in favor of the new Collector. * Introduce new methods in IndexSearcher that accept Collector, and deprecate those that accept HitCollector. ** Create a final class HitCollectorWrapper, and use it in the deprecated methods in IndexSearcher, wrapping the given HitCollector. ** HitCollectorWrapper will be marked deprecated, so we can remove it in 3.0, when we remove HitCollector. ** It will remove any instanceof checks that currently exist in IndexSearcher code. * Create a new (abstract) TopDocsCollector, which will: ** Leave collect and setNextReader unimplemented. ** Introduce protected members PriorityQueue and totalHits. ** Introduce a single protected constructor which accepts a PriorityQueue. ** Implement topDocs() and getTotalHits() using the PQ and totalHits members. These can be used as-are by extending classes, as well as be overridden. ** Introduce a new topDocs(start, howMany) method which will be used a convenience method when implementing a search application which allows paging through search results. It will also attempt to improve the memory allocation, by allocating a ScoreDoc[] of the requested size only. * Change TopScoreDocCollector to extend TopDocsCollector, use the topDocs() and getTotalHits() implementations as they are from TopDocsCollector. The class will also be made final. * Change TopFieldCollector to extend TopDocsCollector, and make the class final. Also implement topDocs(start, howMany). * Change TopFieldDocCollector (deprecated) to extend TopDocsCollector, instead of TopScoreDocCollector. Implement topDocs(start, howMany) * Review other places where HitCollector is used, such as in Scorer, deprecate those places and use Collector instead. Additionally, the following proposal was made w.r.t. decoupling score from collect(): * Change collect to accecpt only a doc Id (unbased). * Introduce a setScorer(Scorer) method. * If during collect the implementation needs the score, it can call scorer.score(). If we do this, then we need to review all places in the code where collect(doc, score) is called, and assert whether Scorer can be passed. Also this raises few questions: * What if during collect() Scorer is null? (i.e., not set) - is it even possible? * I noticed that many (if not all) of the collect() implementations discard the document if its score
[jira] Commented: (LUCENE-1575) Refactoring Lucene collectors (HitCollector and extensions)
[ https://issues.apache.org/jira/browse/LUCENE-1575?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12695525#action_12695525 ] Shai Erera commented on LUCENE-1575: BTW Mike - I think the accidental changes to Searchable and Searcher could have been easily detected by test-tags if we had classes in the back-compat tag which implemented interfaces / extended abstract classes with empty implementations. These are not really junit tests, but if someone would have changed an interface or abstract class, then attempting to compile the test package against the trunk would fail. It is not so relevant now, since the next release is 2.9 following by a 3.0 and back-compat will completely go away in 3.0, but perhaps post 3.0? Also, it will prevent us from making changes to back-compat like we wanted to in this issue, but perhaps it's good? Refactoring Lucene collectors (HitCollector and extensions) --- Key: LUCENE-1575 URL: https://issues.apache.org/jira/browse/LUCENE-1575 Project: Lucene - Java Issue Type: Improvement Components: Search Reporter: Shai Erera Assignee: Michael McCandless Fix For: 2.9 Attachments: LUCENE-1575.1.patch, LUCENE-1575.2.patch, LUCENE-1575.3.patch, LUCENE-1575.4.patch, LUCENE-1575.5.patch, LUCENE-1575.6.patch, LUCENE-1575.patch, LUCENE-1575.patch This issue is a result of a recent discussion we've had on the mailing list. You can read the thread [here|http://www.nabble.com/Is-TopDocCollector%27s-collect()-implementation-correct--td22557419.html]. We have agreed to do the following refactoring: * Rename MultiReaderHitCollector to Collector, with the purpose that it will be the base class for all Collector implementations. * Deprecate HitCollector in favor of the new Collector. * Introduce new methods in IndexSearcher that accept Collector, and deprecate those that accept HitCollector. ** Create a final class HitCollectorWrapper, and use it in the deprecated methods in IndexSearcher, wrapping the given HitCollector. ** HitCollectorWrapper will be marked deprecated, so we can remove it in 3.0, when we remove HitCollector. ** It will remove any instanceof checks that currently exist in IndexSearcher code. * Create a new (abstract) TopDocsCollector, which will: ** Leave collect and setNextReader unimplemented. ** Introduce protected members PriorityQueue and totalHits. ** Introduce a single protected constructor which accepts a PriorityQueue. ** Implement topDocs() and getTotalHits() using the PQ and totalHits members. These can be used as-are by extending classes, as well as be overridden. ** Introduce a new topDocs(start, howMany) method which will be used a convenience method when implementing a search application which allows paging through search results. It will also attempt to improve the memory allocation, by allocating a ScoreDoc[] of the requested size only. * Change TopScoreDocCollector to extend TopDocsCollector, use the topDocs() and getTotalHits() implementations as they are from TopDocsCollector. The class will also be made final. * Change TopFieldCollector to extend TopDocsCollector, and make the class final. Also implement topDocs(start, howMany). * Change TopFieldDocCollector (deprecated) to extend TopDocsCollector, instead of TopScoreDocCollector. Implement topDocs(start, howMany) * Review other places where HitCollector is used, such as in Scorer, deprecate those places and use Collector instead. Additionally, the following proposal was made w.r.t. decoupling score from collect(): * Change collect to accecpt only a doc Id (unbased). * Introduce a setScorer(Scorer) method. * If during collect the implementation needs the score, it can call scorer.score(). If we do this, then we need to review all places in the code where collect(doc, score) is called, and assert whether Scorer can be passed. Also this raises few questions: * What if during collect() Scorer is null? (i.e., not set) - is it even possible? * I noticed that many (if not all) of the collect() implementations discard the document if its score is not greater than 0. Doesn't it mean that score is needed in collect() always? Open issues: * The name for Collector * TopDocsCollector was mentioned on the thread as TopResultsCollector, but that was when we thought to call Colletor ResultsColletor. Since we decided (so far) on Collector, I think TopDocsCollector makes sense, because of its TopDocs output. * Decoupling score from collect(). I will post a patch a bit later, as this is expected to be a very large patch. I will split it into 2: (1) code patch (2) test cases (moving to use Collector instead of HitCollector, as well as testing the new topDocs(start, howMany)
[jira] Commented: (LUCENE-1575) Refactoring Lucene collectors (HitCollector and extensions)
[ https://issues.apache.org/jira/browse/LUCENE-1575?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12695537#action_12695537 ] Michael McCandless commented on LUCENE-1575: I ran a first do no harm perf test, comparing trunk with this patch: ||query||sort||hits||qps||qpsnew||pctg|| |147|score| 6953|3631.1|3641.8| 0.3%| |147|title| 6953|2916.7|2255.6|-22.7%| |147|doc| 6953|3251.2|2676.8|-17.7%| |text|score| 157101| 208.1| 202.1| -2.9%| |text|title| 157101| 96.7| 84.8|-12.3%| |text|doc| 157101| 174.0| 115.2|-33.8%| |1|score| 565452| 58.0| 56.4| -2.8%| |1|title| 565452| 44.5| 34.1|-23.4%| |1|doc| 565452| 49.2| 32.8|-33.3%| |1 OR 2|score| 784928| 14.1| 13.7| -2.8%| |1 OR 2|title| 784928| 12.5| 11.5| -8.0%| |1 OR 2|doc| 784928| 13.0| 11.9| -8.5%| |1 AND 2|score| 333153| 15.5| 15.5| 0.0%| |1 AND 2|title| 333153| 14.8| 13.7| -7.4%| |1 AND 2|doc| 333153| 15.2| 14.2| -6.6%| Looks like: * Sort by relevance got maybe a tad slower (~3%) * Sort by field is now quite a bit slower (23-33% on term query '1') This was on a full wikipedia index, with 14 segments, Sun java 1.6.0_07 on OS X Mac Pro quad core, on Intel X25M 160 GB SSD. I think we need to iterate some to try to get some performance back. Refactoring Lucene collectors (HitCollector and extensions) --- Key: LUCENE-1575 URL: https://issues.apache.org/jira/browse/LUCENE-1575 Project: Lucene - Java Issue Type: Improvement Components: Search Reporter: Shai Erera Assignee: Michael McCandless Fix For: 2.9 Attachments: LUCENE-1575.1.patch, LUCENE-1575.2.patch, LUCENE-1575.3.patch, LUCENE-1575.4.patch, LUCENE-1575.5.patch, LUCENE-1575.6.patch, LUCENE-1575.patch, LUCENE-1575.patch This issue is a result of a recent discussion we've had on the mailing list. You can read the thread [here|http://www.nabble.com/Is-TopDocCollector%27s-collect()-implementation-correct--td22557419.html]. We have agreed to do the following refactoring: * Rename MultiReaderHitCollector to Collector, with the purpose that it will be the base class for all Collector implementations. * Deprecate HitCollector in favor of the new Collector. * Introduce new methods in IndexSearcher that accept Collector, and deprecate those that accept HitCollector. ** Create a final class HitCollectorWrapper, and use it in the deprecated methods in IndexSearcher, wrapping the given HitCollector. ** HitCollectorWrapper will be marked deprecated, so we can remove it in 3.0, when we remove HitCollector. ** It will remove any instanceof checks that currently exist in IndexSearcher code. * Create a new (abstract) TopDocsCollector, which will: ** Leave collect and setNextReader unimplemented. ** Introduce protected members PriorityQueue and totalHits. ** Introduce a single protected constructor which accepts a PriorityQueue. ** Implement topDocs() and getTotalHits() using the PQ and totalHits members. These can be used as-are by extending classes, as well as be overridden. ** Introduce a new topDocs(start, howMany) method which will be used a convenience method when implementing a search application which allows paging through search results. It will also attempt to improve the memory allocation, by allocating a ScoreDoc[] of the requested size only. * Change TopScoreDocCollector to extend TopDocsCollector, use the topDocs() and getTotalHits() implementations as they are from TopDocsCollector. The class will also be made final. * Change TopFieldCollector to extend TopDocsCollector, and make the class final. Also implement topDocs(start, howMany). * Change TopFieldDocCollector (deprecated) to extend TopDocsCollector, instead of TopScoreDocCollector. Implement topDocs(start, howMany) * Review other places where HitCollector is used, such as in Scorer, deprecate those places and use Collector instead. Additionally, the following proposal was made w.r.t. decoupling score from collect(): * Change collect to accecpt only a doc Id (unbased). * Introduce a setScorer(Scorer) method. * If during collect the implementation needs the score, it can call scorer.score(). If we do this, then we need to review all places in the code where collect(doc, score) is called, and assert whether Scorer can be passed. Also this raises few questions: * What if during collect() Scorer is null? (i.e., not set) - is it even possible? * I noticed that many (if not all) of the collect() implementations discard the document if its score is not greater than 0. Doesn't it mean that score is needed in collect() always? Open issues: * The name for Collector * TopDocsCollector was mentioned on the thread as TopResultsCollector, but that was when we thought to call Colletor
Re: Future projects
By default bobo DOES use a flavor of the field cache data structure with some addition information for performance. (e.g. minDocid,maxDocid,freq per term) Bobo is architected as a platform where clients can write their own FacetHandlers in which each FacetHandler manages its own view of memory structure, and thus can be more complicated that field cache. At LinkedIn, we write FacetHandlers for geo lat/lon filtering and social network faceting. -John On Fri, Apr 3, 2009 at 3:35 AM, Michael McCandless luc...@mikemccandless.com wrote: On Thu, Apr 2, 2009 at 5:56 PM, Jason Rutherglen jason.rutherg...@gmail.com wrote: I think I need to understand better why delete by Query isn't viable in your situation... The delete by query is a separate problem which I haven't fully explored yet. Oh, I had thought we were tugging on this thread in order to explore delete-by-docID in the writer. OK. Tracking the segment genealogy is really an interim step for merging field caches before column stride fields gets implemented. I see -- meaning in Bobo you'd like to manage your own memory resident field caches, and merge them whenever IW has merged a segment? Seems like you don't need genealogy for that. Actually CSF cannot be used with Bobo's field caches anyways which means we'd need a way to find out about the segment parents. CSF isn't really designed yet. How come it can't be used with Bobo's field caches? We can try to accommodate Bobo's field cache needs when designing CSF. Does it operate at the segment level? Seems like that'd give you good enough realtime performance (though merging in RAM will definitely be faster). We need to see how Bobo integrates with LUCENE-1483. Lucene's internal field cache usage is now entirely at the segment level (ie, Lucene core should never request full field cache array at the MultiSegmentReader level). I think Bobo must have to do the same, if it handles near realtime updates, to get adequate performance. Though... since we have LUCENE-831 (rework API Lucene exposes for accessing arrays-of-atomic-types-per-segment) and LUCENE-1231 (CSF = a more efficient impl (than uninversion) of the API we expose in LUCENE-831) on deck, we should try to understand Bobo's needs. EG how come Bobo made its own field cache impl? Just because uninversion is too slow? It seems like we've been talking about CSF for 2 years and there isn't a patch for it? If I had more time I'd take a look. What is the status of it? I think Michael is looking into it? I'd really like to get it into 2.9. We should do it in conjunction with 831 since they are so tied. I'll write a patch that implements a callback for the segment merging such that the user can decide what information they want to record about the merged SRs (I'm pretty sure there isn't a way to do this with MergePolicy?) Actually I think you can do this w/ a simple MergeScheduler wrapper or by subclassing CMS. I'll put a comment on the issue. Mike - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1575) Refactoring Lucene collectors (HitCollector and extensions)
[ https://issues.apache.org/jira/browse/LUCENE-1575?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12695541#action_12695541 ] Michael McCandless commented on LUCENE-1575: {quote} Can't the scoring collector impls in TopFieldCollector be final? They can, but they are private so they cannot be extended anyway. I can do that, but does it really matter? {quote} I was thinking in case it eeks performance. bq. So I guess this code has been duplicated for no good reason? Duplicated for performance I think. bq. I can change the entire method (score(Collector, int)) to not compute any score and call c.setScorer(this). That will solve it. I think we should try this? bq. So are you ok with passing Scorer to Collector, instead of just a class with a single score() method? Good question... I'm not sure. It would be cleaner to expose only score() (and I think we could add methods over time), but then we'll be creating new instance per segment per search which'll only slow things down. bq. I will open an issue w/ a fix version 3.0 and take care of all those TODOs. Should the issue also get rid of the deprecated methods? Or will we have a general issue in 3.0 that removes all deprecated methods? You don't need to enumerate deprecated methods to get rid of -- we won't forget those ones :) It's these other special tasks that may slip through the cracks. Refactoring Lucene collectors (HitCollector and extensions) --- Key: LUCENE-1575 URL: https://issues.apache.org/jira/browse/LUCENE-1575 Project: Lucene - Java Issue Type: Improvement Components: Search Reporter: Shai Erera Assignee: Michael McCandless Fix For: 2.9 Attachments: LUCENE-1575.1.patch, LUCENE-1575.2.patch, LUCENE-1575.3.patch, LUCENE-1575.4.patch, LUCENE-1575.5.patch, LUCENE-1575.6.patch, LUCENE-1575.patch, LUCENE-1575.patch This issue is a result of a recent discussion we've had on the mailing list. You can read the thread [here|http://www.nabble.com/Is-TopDocCollector%27s-collect()-implementation-correct--td22557419.html]. We have agreed to do the following refactoring: * Rename MultiReaderHitCollector to Collector, with the purpose that it will be the base class for all Collector implementations. * Deprecate HitCollector in favor of the new Collector. * Introduce new methods in IndexSearcher that accept Collector, and deprecate those that accept HitCollector. ** Create a final class HitCollectorWrapper, and use it in the deprecated methods in IndexSearcher, wrapping the given HitCollector. ** HitCollectorWrapper will be marked deprecated, so we can remove it in 3.0, when we remove HitCollector. ** It will remove any instanceof checks that currently exist in IndexSearcher code. * Create a new (abstract) TopDocsCollector, which will: ** Leave collect and setNextReader unimplemented. ** Introduce protected members PriorityQueue and totalHits. ** Introduce a single protected constructor which accepts a PriorityQueue. ** Implement topDocs() and getTotalHits() using the PQ and totalHits members. These can be used as-are by extending classes, as well as be overridden. ** Introduce a new topDocs(start, howMany) method which will be used a convenience method when implementing a search application which allows paging through search results. It will also attempt to improve the memory allocation, by allocating a ScoreDoc[] of the requested size only. * Change TopScoreDocCollector to extend TopDocsCollector, use the topDocs() and getTotalHits() implementations as they are from TopDocsCollector. The class will also be made final. * Change TopFieldCollector to extend TopDocsCollector, and make the class final. Also implement topDocs(start, howMany). * Change TopFieldDocCollector (deprecated) to extend TopDocsCollector, instead of TopScoreDocCollector. Implement topDocs(start, howMany) * Review other places where HitCollector is used, such as in Scorer, deprecate those places and use Collector instead. Additionally, the following proposal was made w.r.t. decoupling score from collect(): * Change collect to accecpt only a doc Id (unbased). * Introduce a setScorer(Scorer) method. * If during collect the implementation needs the score, it can call scorer.score(). If we do this, then we need to review all places in the code where collect(doc, score) is called, and assert whether Scorer can be passed. Also this raises few questions: * What if during collect() Scorer is null? (i.e., not set) - is it even possible? * I noticed that many (if not all) of the collect() implementations discard the document if its score is not greater than 0. Doesn't it mean that score is needed in collect() always? Open issues: * The name for
[jira] Commented: (LUCENE-1575) Refactoring Lucene collectors (HitCollector and extensions)
[ https://issues.apache.org/jira/browse/LUCENE-1575?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12695543#action_12695543 ] Michael McCandless commented on LUCENE-1575: bq. BTW Mike - I think the accidental changes to Searchable and Searcher could have been easily detected by test-tags if we had classes in the back-compat tag which implemented interfaces / extended abstract classes with empty implementations. These are not really junit tests, but if someone would have changed an interface or abstract class, then attempting to compile the test package against the trunk would fail. I think that's a great idea! Every interface/abstract class should have a just compile me subclass in the tests. bq. It is not so relevant now, since the next release is 2.9 following by a 3.0 and back-compat will completely go away in 3.0, but perhaps post 3.0? It is relevant because neither Searchable nor Searcher are deprecated (yet)? Ie during development of 2.9 and of 3.0 we have to ensure we don't break back compat of non-deprecated APIs. So maybe fold this in on the next patch iteration? bq. Also, it will prevent us from making changes to back-compat like we wanted to in this issue, but perhaps it's good? It's good, because it'd raise the issue right way vs us catching it or not by staring at the code :) Refactoring Lucene collectors (HitCollector and extensions) --- Key: LUCENE-1575 URL: https://issues.apache.org/jira/browse/LUCENE-1575 Project: Lucene - Java Issue Type: Improvement Components: Search Reporter: Shai Erera Assignee: Michael McCandless Fix For: 2.9 Attachments: LUCENE-1575.1.patch, LUCENE-1575.2.patch, LUCENE-1575.3.patch, LUCENE-1575.4.patch, LUCENE-1575.5.patch, LUCENE-1575.6.patch, LUCENE-1575.patch, LUCENE-1575.patch This issue is a result of a recent discussion we've had on the mailing list. You can read the thread [here|http://www.nabble.com/Is-TopDocCollector%27s-collect()-implementation-correct--td22557419.html]. We have agreed to do the following refactoring: * Rename MultiReaderHitCollector to Collector, with the purpose that it will be the base class for all Collector implementations. * Deprecate HitCollector in favor of the new Collector. * Introduce new methods in IndexSearcher that accept Collector, and deprecate those that accept HitCollector. ** Create a final class HitCollectorWrapper, and use it in the deprecated methods in IndexSearcher, wrapping the given HitCollector. ** HitCollectorWrapper will be marked deprecated, so we can remove it in 3.0, when we remove HitCollector. ** It will remove any instanceof checks that currently exist in IndexSearcher code. * Create a new (abstract) TopDocsCollector, which will: ** Leave collect and setNextReader unimplemented. ** Introduce protected members PriorityQueue and totalHits. ** Introduce a single protected constructor which accepts a PriorityQueue. ** Implement topDocs() and getTotalHits() using the PQ and totalHits members. These can be used as-are by extending classes, as well as be overridden. ** Introduce a new topDocs(start, howMany) method which will be used a convenience method when implementing a search application which allows paging through search results. It will also attempt to improve the memory allocation, by allocating a ScoreDoc[] of the requested size only. * Change TopScoreDocCollector to extend TopDocsCollector, use the topDocs() and getTotalHits() implementations as they are from TopDocsCollector. The class will also be made final. * Change TopFieldCollector to extend TopDocsCollector, and make the class final. Also implement topDocs(start, howMany). * Change TopFieldDocCollector (deprecated) to extend TopDocsCollector, instead of TopScoreDocCollector. Implement topDocs(start, howMany) * Review other places where HitCollector is used, such as in Scorer, deprecate those places and use Collector instead. Additionally, the following proposal was made w.r.t. decoupling score from collect(): * Change collect to accecpt only a doc Id (unbased). * Introduce a setScorer(Scorer) method. * If during collect the implementation needs the score, it can call scorer.score(). If we do this, then we need to review all places in the code where collect(doc, score) is called, and assert whether Scorer can be passed. Also this raises few questions: * What if during collect() Scorer is null? (i.e., not set) - is it even possible? * I noticed that many (if not all) of the collect() implementations discard the document if its score is not greater than 0. Doesn't it mean that score is needed in collect() always? Open issues: * The name for Collector * TopDocsCollector was mentioned on the
[jira] Commented: (LUCENE-1575) Refactoring Lucene collectors (HitCollector and extensions)
[ https://issues.apache.org/jira/browse/LUCENE-1575?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12695553#action_12695553 ] Jason Rutherglen commented on LUCENE-1575: -- Something related to time limiting collectors we may want to solve (maybe not in this patch) is passing the time limiting to the sub-scorers. At the hit collector level the sub-scorers of a multi clause query could be busy exceeding the time limit before returning the first doc hit? Refactoring Lucene collectors (HitCollector and extensions) --- Key: LUCENE-1575 URL: https://issues.apache.org/jira/browse/LUCENE-1575 Project: Lucene - Java Issue Type: Improvement Components: Search Reporter: Shai Erera Assignee: Michael McCandless Fix For: 2.9 Attachments: LUCENE-1575.1.patch, LUCENE-1575.2.patch, LUCENE-1575.3.patch, LUCENE-1575.4.patch, LUCENE-1575.5.patch, LUCENE-1575.6.patch, LUCENE-1575.patch, LUCENE-1575.patch This issue is a result of a recent discussion we've had on the mailing list. You can read the thread [here|http://www.nabble.com/Is-TopDocCollector%27s-collect()-implementation-correct--td22557419.html]. We have agreed to do the following refactoring: * Rename MultiReaderHitCollector to Collector, with the purpose that it will be the base class for all Collector implementations. * Deprecate HitCollector in favor of the new Collector. * Introduce new methods in IndexSearcher that accept Collector, and deprecate those that accept HitCollector. ** Create a final class HitCollectorWrapper, and use it in the deprecated methods in IndexSearcher, wrapping the given HitCollector. ** HitCollectorWrapper will be marked deprecated, so we can remove it in 3.0, when we remove HitCollector. ** It will remove any instanceof checks that currently exist in IndexSearcher code. * Create a new (abstract) TopDocsCollector, which will: ** Leave collect and setNextReader unimplemented. ** Introduce protected members PriorityQueue and totalHits. ** Introduce a single protected constructor which accepts a PriorityQueue. ** Implement topDocs() and getTotalHits() using the PQ and totalHits members. These can be used as-are by extending classes, as well as be overridden. ** Introduce a new topDocs(start, howMany) method which will be used a convenience method when implementing a search application which allows paging through search results. It will also attempt to improve the memory allocation, by allocating a ScoreDoc[] of the requested size only. * Change TopScoreDocCollector to extend TopDocsCollector, use the topDocs() and getTotalHits() implementations as they are from TopDocsCollector. The class will also be made final. * Change TopFieldCollector to extend TopDocsCollector, and make the class final. Also implement topDocs(start, howMany). * Change TopFieldDocCollector (deprecated) to extend TopDocsCollector, instead of TopScoreDocCollector. Implement topDocs(start, howMany) * Review other places where HitCollector is used, such as in Scorer, deprecate those places and use Collector instead. Additionally, the following proposal was made w.r.t. decoupling score from collect(): * Change collect to accecpt only a doc Id (unbased). * Introduce a setScorer(Scorer) method. * If during collect the implementation needs the score, it can call scorer.score(). If we do this, then we need to review all places in the code where collect(doc, score) is called, and assert whether Scorer can be passed. Also this raises few questions: * What if during collect() Scorer is null? (i.e., not set) - is it even possible? * I noticed that many (if not all) of the collect() implementations discard the document if its score is not greater than 0. Doesn't it mean that score is needed in collect() always? Open issues: * The name for Collector * TopDocsCollector was mentioned on the thread as TopResultsCollector, but that was when we thought to call Colletor ResultsColletor. Since we decided (so far) on Collector, I think TopDocsCollector makes sense, because of its TopDocs output. * Decoupling score from collect(). I will post a patch a bit later, as this is expected to be a very large patch. I will split it into 2: (1) code patch (2) test cases (moving to use Collector instead of HitCollector, as well as testing the new topDocs(start, howMany) method. There might be even a 3rd patch which handles the setScorer thing in Collector (maybe even a different issue?) -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For
[jira] Updated: (LUCENE-1582) Make TrieRange completely independent from Document/Field with TokenStream of prefix encoded values
[ https://issues.apache.org/jira/browse/LUCENE-1582?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Uwe Schindler updated LUCENE-1582: -- Attachment: LUCENE-1582.patch Make TrieRange completely independent from Document/Field with TokenStream of prefix encoded values --- Key: LUCENE-1582 URL: https://issues.apache.org/jira/browse/LUCENE-1582 Project: Lucene - Java Issue Type: Improvement Components: contrib/* Affects Versions: 2.9 Reporter: Uwe Schindler Assignee: Uwe Schindler Fix For: 2.9 Attachments: LUCENE-1582.patch, LUCENE-1582.patch TrieRange has currently the following problem: - To add a field, that uses a trie encoding, you can manually add each term to the index or use a helper method from TrieUtils. The helper method has the problem, that it uses a fixed field configuration - TrieUtils currently creates per default a helper field containing the lower precision terms to enable sorting (limitation of one term/document for sorting) - trieCodeLong/Int() creates unnecessarily String[] and char[] arrays that is heavy for GC, if you index lot of numeric values. Also a lot of char[] to String copying is involved. This issue should improve this: - trieCodeLong/Int() returns a TokenStream. During encoding, all char[] arrays are reused by Token API, additional String[] arrays for the encoded result are not created, instead the TokenStream enumerates the trie values. - Trie fields can be added to Documents during indexing using the standard API: new Field(name,TokenStream,...), so no extra util method needed. By using token filters, one could also add payload and so and customize everything. The drawback is: Sorting would not work anymore. To enable sorting, a (sub-)issue can extend the FieldCache to stop iterating the terms, as soon as a lower precision one is enumerated by TermEnum. I will create a hack patch for TrieUtils-use only, that uses a non-checked Exceptionin the Parser to stop iteration. With LUCENE-831, a more generic API for this type can be used (custom parser/iterator implementation for FieldCache). I will attach the field cache patch (with the temporary solution, until FieldCache is reimplemented) as a separate patch file, or maybe open another issue for it. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1582) Make TrieRange completely independent from Document/Field with TokenStream of prefix encoded values
[ https://issues.apache.org/jira/browse/LUCENE-1582?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12695554#action_12695554 ] Uwe Schindler commented on LUCENE-1582: --- Updated patch: - supports a setValue() to reset the TokenStream with a new value for reuse (as discussed before) - completed JavaDocs - remove dead code parts - small change in RangeBuilder API (unneeded parameters) The difference between reusing fields and tokenstreams and always creating a new one is measureable (I compared in the test case), but not significant. The JavaDocs contain infos, how to reuse. I have done everything what i planned, now its time to discuss the change. Make TrieRange completely independent from Document/Field with TokenStream of prefix encoded values --- Key: LUCENE-1582 URL: https://issues.apache.org/jira/browse/LUCENE-1582 Project: Lucene - Java Issue Type: Improvement Components: contrib/* Affects Versions: 2.9 Reporter: Uwe Schindler Assignee: Uwe Schindler Fix For: 2.9 Attachments: LUCENE-1582.patch, LUCENE-1582.patch TrieRange has currently the following problem: - To add a field, that uses a trie encoding, you can manually add each term to the index or use a helper method from TrieUtils. The helper method has the problem, that it uses a fixed field configuration - TrieUtils currently creates per default a helper field containing the lower precision terms to enable sorting (limitation of one term/document for sorting) - trieCodeLong/Int() creates unnecessarily String[] and char[] arrays that is heavy for GC, if you index lot of numeric values. Also a lot of char[] to String copying is involved. This issue should improve this: - trieCodeLong/Int() returns a TokenStream. During encoding, all char[] arrays are reused by Token API, additional String[] arrays for the encoded result are not created, instead the TokenStream enumerates the trie values. - Trie fields can be added to Documents during indexing using the standard API: new Field(name,TokenStream,...), so no extra util method needed. By using token filters, one could also add payload and so and customize everything. The drawback is: Sorting would not work anymore. To enable sorting, a (sub-)issue can extend the FieldCache to stop iterating the terms, as soon as a lower precision one is enumerated by TermEnum. I will create a hack patch for TrieUtils-use only, that uses a non-checked Exceptionin the Parser to stop iteration. With LUCENE-831, a more generic API for this type can be used (custom parser/iterator implementation for FieldCache). I will attach the field cache patch (with the temporary solution, until FieldCache is reimplemented) as a separate patch file, or maybe open another issue for it. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1584) Callback for intercepting merging segments in IndexWriter
[ https://issues.apache.org/jira/browse/LUCENE-1584?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12695557#action_12695557 ] Jason Rutherglen commented on LUCENE-1584: -- I would like to move away from our current position of somewhat closed APIs that require user classes be a part of the Lucene packages. It's always best to reuse existing APIs, however we've migrated to OSGi which means anytime we need to place new classes in Lucene packages, we need to rollout specific JARs (I think, perhaps it's more complex) for the few classes outside of our main package classes. This makes deployment of search applications a bit more difficult and time consuming. A related thread regarding MergePolicy is at: http://markmail.org/thread/h5bxjflpcyejrcqg Callback for intercepting merging segments in IndexWriter - Key: LUCENE-1584 URL: https://issues.apache.org/jira/browse/LUCENE-1584 Project: Lucene - Java Issue Type: Improvement Components: Index Affects Versions: 2.4.1 Reporter: Jason Rutherglen Priority: Minor Fix For: 2.9 Attachments: LUCENE-1584.patch Original Estimate: 96h Remaining Estimate: 96h For things like merging field caches or bitsets, it's useful to know which segments were merged to create a new segment. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1575) Refactoring Lucene collectors (HitCollector and extensions)
[ https://issues.apache.org/jira/browse/LUCENE-1575?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12695575#action_12695575 ] Shai Erera commented on LUCENE-1575: How do I run such a test? Is there an algorithm for that in the benchmark package? I compared the new TSDC to the trunk's version and the new code does ('-' means a negative change, '+' means a positive change, '|' means neither/undetermined): * adds one collector.setScorer() call to each query. (-) * The scorer.score() call in collect() was just moved from whoever called collect() to inside collect(), so I don't think there's a difference. (|) * Does not check if score 0.0f in each collect (+) * implements the new topDocs() method. Previously, it just implemented topDocs() which returned everything. Now, topDocs() calls topDocs(0, pq.size()), which verifies parameters and such - since that's executed once at the end of the search, I doubt that it has any effect major effect on the results. BTW, as I scanned through the code I noticed that previously TSDC returned maxScore = Float.NEGATIVE_INFINITY in case there were 0 results to the query, and now it returns Float.NaN. I'm not sure however if this breaks anything, since maxScore is probably used (if at all) for normalization of scores, and in case there are 0 results you don't really have anything to normalize? However I'm not sure ... Regarding TopFieldDocs I am quite surprised. I assume the test uses the OneComparatorScoringCollector, which means scores are computed: * It has the same issue as in TSDC regarding topDocs(). So I think it should be changed here as well, however I doubt that's the cause for the performance hit. * It computes the score and then does super.collect(), which adds a method call (-) * It doesn't check if the score is 0 (+) * It calls comparator.setScorer, which is ignored in all comparators besides RelevanceComparator. Not sure if it has any performance effects (|) The rest of the code in collect() is exactly the same. Can it be that super.collect() has such an effect? When I think on the results of TSDC (-3%) vs. TFC (-28% on avg.), I think it might be since setScorer() is called once before the series of collect() calls, however super.collect() is called for every document. Your index is large (2M documents, right?) and I don't know how many results are for each query, if they are in the range of 100Ks, then that could be the explanation. Mike - in case it's faster for you to run it, can you try to run the test again with a change in the code which inlines super.collect() into OneComparatorScoringCollector and compare the results again? I will run it also after you tell me which algorithm you used, but only tomorrow morning, so if you get to do it before then, that'd be great. I doubt that the change in topDocs() affects the query time that much, since it's called at the end of the search, and doing 4-5 'if' statements is really not that expensive (I mean once per the entire search), comparing to ScoreDoc[] array allocation, fetching Stored fields from the index etc. So I'd hate to implement all 3 topDocs() in each of the TopDocsCollector extensions unless it proves to be a problem. Shai On Fri, Apr 3, 2009 at 10:02 PM, Michael McCandless (JIRA) j...@apache.orgwrote: Refactoring Lucene collectors (HitCollector and extensions) --- Key: LUCENE-1575 URL: https://issues.apache.org/jira/browse/LUCENE-1575 Project: Lucene - Java Issue Type: Improvement Components: Search Reporter: Shai Erera Assignee: Michael McCandless Fix For: 2.9 Attachments: LUCENE-1575.1.patch, LUCENE-1575.2.patch, LUCENE-1575.3.patch, LUCENE-1575.4.patch, LUCENE-1575.5.patch, LUCENE-1575.6.patch, LUCENE-1575.patch, LUCENE-1575.patch This issue is a result of a recent discussion we've had on the mailing list. You can read the thread [here|http://www.nabble.com/Is-TopDocCollector%27s-collect()-implementation-correct--td22557419.html]. We have agreed to do the following refactoring: * Rename MultiReaderHitCollector to Collector, with the purpose that it will be the base class for all Collector implementations. * Deprecate HitCollector in favor of the new Collector. * Introduce new methods in IndexSearcher that accept Collector, and deprecate those that accept HitCollector. ** Create a final class HitCollectorWrapper, and use it in the deprecated methods in IndexSearcher, wrapping the given HitCollector. ** HitCollectorWrapper will be marked deprecated, so we can remove it in 3.0, when we remove HitCollector. ** It will remove any instanceof checks that currently exist in IndexSearcher code. * Create a new (abstract) TopDocsCollector, which will: ** Leave collect and setNextReader unimplemented. **
[jira] Commented: (LUCENE-1483) Change IndexSearcher multisegment searches to search each individual segment using a single HitCollector
[ https://issues.apache.org/jira/browse/LUCENE-1483?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12695577#action_12695577 ] Jeremy Volkman commented on LUCENE-1483: I'm trying to create a FieldValueHitQueue outside of an IndexSearcher. One part of my code collects all results in a fashion similar to http://www.gossamer-threads.com/lists/lucene/java-user/66362#66362. At the end of my collection, I used to pass the results through a FieldSortedHitQueue of the proper size to get sorted results. The problem now is that FieldValueHitQueue takes an array of subreaders instead of one IndexReader. As far as I can tell, there's no way for me to get a proper sorted array of subreaders for an IndexReader without copying and pasting the gatherSubReaders and sortSubReaders methods from IndexSearcher. This isn't desirable, so could IndexSearcher perhaps provide some sort of getSortedSubReaders() method? Either that, or extract this functionality out into a common utility method that IndexSearcher uses. Change IndexSearcher multisegment searches to search each individual segment using a single HitCollector Key: LUCENE-1483 URL: https://issues.apache.org/jira/browse/LUCENE-1483 Project: Lucene - Java Issue Type: Improvement Affects Versions: 2.9 Reporter: Mark Miller Assignee: Michael McCandless Priority: Minor Fix For: 2.9 Attachments: LUCENE-1483-backcompat.patch, LUCENE-1483-partial.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, sortBench.py, sortCollate.py This issue changes how an IndexSearcher searches over multiple segments. The current method of searching multiple segments is to use a MultiSegmentReader and treat all of the segments as one. This causes filters and FieldCaches to be keyed to the MultiReader and makes reopen expensive. If only a few segments change, the FieldCache is still loaded for all of them. This patch changes things by searching each individual segment one at a time, but sharing the HitCollector used across each segment. This allows FieldCaches and Filters to be keyed on individual SegmentReaders, making reopen much cheaper. FieldCache loading over multiple segments can be much faster as well - with the old method, all unique terms for every segment is enumerated against each segment - because of the likely logarithmic change in terms per segment, this can be very wasteful. Searching individual segments avoids this cost. The term/document statistics from the multireader are used to score results for each segment. When sorting, its more difficult to use a single HitCollector for each sub searcher. Ordinals are not comparable across segments. To account for this, a new field sort enabled HitCollector is introduced that is able to collect and sort across segments (because of its ability to compare ordinals across segments). This TopFieldCollector class will collect the values/ordinals for a given segment, and upon moving to the next segment, translate any ordinals/values so that they can be compared against the values for the new segment. This is done lazily. All and all, the switch seems to provide numerous performance benefits, in both sorted and non sorted search. We were seeing a good loss on indices with lots of segments (1000?) and certain queue sizes / queries, but the latest results seem to show thats been mostly taken care of (you shouldnt be using such a large queue on such a segmented index anyway). * Introduces ** MultiReaderHitCollector - a HitCollector that can collect across multiple IndexReaders. Old HitCollectors are wrapped to support multiple IndexReaders. ** TopFieldCollector - a HitCollector that can compare values/ordinals across IndexReaders and sort on fields. ** FieldValueHitQueue - a Priority queue that is part of the TopFieldCollector implementation. ** FieldComparator - a new Comparator class that works across IndexReaders. Part of the TopFieldCollector implementation. ** FieldComparatorSource - new class to allow for custom Comparators. *
[jira] Commented: (LUCENE-1483) Change IndexSearcher multisegment searches to search each individual segment using a single HitCollector
[ https://issues.apache.org/jira/browse/LUCENE-1483?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12695586#action_12695586 ] Shai Erera commented on LUCENE-1483: Hi Jeremy This will be taken care of in 1575 by removing the IndexReader[] arg from TopFieldCollector. As a matter of fact, 1575 changes quite a bit the collector's API, so you might want to take a look there. Anyway, I've run into the same issue there and realized this arg can be safely removed from TopFieldCollector as well as FieldValueHitQueue. Change IndexSearcher multisegment searches to search each individual segment using a single HitCollector Key: LUCENE-1483 URL: https://issues.apache.org/jira/browse/LUCENE-1483 Project: Lucene - Java Issue Type: Improvement Affects Versions: 2.9 Reporter: Mark Miller Assignee: Michael McCandless Priority: Minor Fix For: 2.9 Attachments: LUCENE-1483-backcompat.patch, LUCENE-1483-partial.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, sortBench.py, sortCollate.py This issue changes how an IndexSearcher searches over multiple segments. The current method of searching multiple segments is to use a MultiSegmentReader and treat all of the segments as one. This causes filters and FieldCaches to be keyed to the MultiReader and makes reopen expensive. If only a few segments change, the FieldCache is still loaded for all of them. This patch changes things by searching each individual segment one at a time, but sharing the HitCollector used across each segment. This allows FieldCaches and Filters to be keyed on individual SegmentReaders, making reopen much cheaper. FieldCache loading over multiple segments can be much faster as well - with the old method, all unique terms for every segment is enumerated against each segment - because of the likely logarithmic change in terms per segment, this can be very wasteful. Searching individual segments avoids this cost. The term/document statistics from the multireader are used to score results for each segment. When sorting, its more difficult to use a single HitCollector for each sub searcher. Ordinals are not comparable across segments. To account for this, a new field sort enabled HitCollector is introduced that is able to collect and sort across segments (because of its ability to compare ordinals across segments). This TopFieldCollector class will collect the values/ordinals for a given segment, and upon moving to the next segment, translate any ordinals/values so that they can be compared against the values for the new segment. This is done lazily. All and all, the switch seems to provide numerous performance benefits, in both sorted and non sorted search. We were seeing a good loss on indices with lots of segments (1000?) and certain queue sizes / queries, but the latest results seem to show thats been mostly taken care of (you shouldnt be using such a large queue on such a segmented index anyway). * Introduces ** MultiReaderHitCollector - a HitCollector that can collect across multiple IndexReaders. Old HitCollectors are wrapped to support multiple IndexReaders. ** TopFieldCollector - a HitCollector that can compare values/ordinals across IndexReaders and sort on fields. ** FieldValueHitQueue - a Priority queue that is part of the TopFieldCollector implementation. ** FieldComparator - a new Comparator class that works across IndexReaders. Part of the TopFieldCollector implementation. ** FieldComparatorSource - new class to allow for custom Comparators. * Alters ** IndexSearcher uses a single HitCollector to collect hits against each individual SegmentReader. All the other changes stem from this ;) * Deprecates ** TopFieldDocCollector ** FieldSortedHitQueue -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail:
[jira] Commented: (LUCENE-1483) Change IndexSearcher multisegment searches to search each individual segment using a single HitCollector
[ https://issues.apache.org/jira/browse/LUCENE-1483?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12695587#action_12695587 ] Uwe Schindler commented on LUCENE-1483: --- This will be changed as part of LUCENE-1575 Change IndexSearcher multisegment searches to search each individual segment using a single HitCollector Key: LUCENE-1483 URL: https://issues.apache.org/jira/browse/LUCENE-1483 Project: Lucene - Java Issue Type: Improvement Affects Versions: 2.9 Reporter: Mark Miller Assignee: Michael McCandless Priority: Minor Fix For: 2.9 Attachments: LUCENE-1483-backcompat.patch, LUCENE-1483-partial.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, sortBench.py, sortCollate.py This issue changes how an IndexSearcher searches over multiple segments. The current method of searching multiple segments is to use a MultiSegmentReader and treat all of the segments as one. This causes filters and FieldCaches to be keyed to the MultiReader and makes reopen expensive. If only a few segments change, the FieldCache is still loaded for all of them. This patch changes things by searching each individual segment one at a time, but sharing the HitCollector used across each segment. This allows FieldCaches and Filters to be keyed on individual SegmentReaders, making reopen much cheaper. FieldCache loading over multiple segments can be much faster as well - with the old method, all unique terms for every segment is enumerated against each segment - because of the likely logarithmic change in terms per segment, this can be very wasteful. Searching individual segments avoids this cost. The term/document statistics from the multireader are used to score results for each segment. When sorting, its more difficult to use a single HitCollector for each sub searcher. Ordinals are not comparable across segments. To account for this, a new field sort enabled HitCollector is introduced that is able to collect and sort across segments (because of its ability to compare ordinals across segments). This TopFieldCollector class will collect the values/ordinals for a given segment, and upon moving to the next segment, translate any ordinals/values so that they can be compared against the values for the new segment. This is done lazily. All and all, the switch seems to provide numerous performance benefits, in both sorted and non sorted search. We were seeing a good loss on indices with lots of segments (1000?) and certain queue sizes / queries, but the latest results seem to show thats been mostly taken care of (you shouldnt be using such a large queue on such a segmented index anyway). * Introduces ** MultiReaderHitCollector - a HitCollector that can collect across multiple IndexReaders. Old HitCollectors are wrapped to support multiple IndexReaders. ** TopFieldCollector - a HitCollector that can compare values/ordinals across IndexReaders and sort on fields. ** FieldValueHitQueue - a Priority queue that is part of the TopFieldCollector implementation. ** FieldComparator - a new Comparator class that works across IndexReaders. Part of the TopFieldCollector implementation. ** FieldComparatorSource - new class to allow for custom Comparators. * Alters ** IndexSearcher uses a single HitCollector to collect hits against each individual SegmentReader. All the other changes stem from this ;) * Deprecates ** TopFieldDocCollector ** FieldSortedHitQueue -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: Future projects
meaning in Bobo you'd like to manage your own memory resident field caches, and merge them whenever IW has merged a segment? Seems like you don't need genealogy for that. Agreed, there is no need for full genealogy. CSF isn't really designed yet. How come it can't be used with Bobo's field caches? I guess CSF should be able to support it, makes sense. As long as the container is flexible with the encoding (I need to look into this more on the Bobo side). Lucene's internal field cache usage is now entirely at the segment level (ie, Lucene core should never request full field cache array at the MultiSegmentReader level). I think Bobo must have to do the same, if it handles near realtime updates, to get adequate performance. Bobo needs to migrate to this model, I don't think we've done that yet. EG how come Bobo made its own field cache impl? Just because uninversion is too slow? It could be integrated once LUCENE-831 is completed. I think the current model of a weak reference and the inability to unload if needed is a concern. I don't think it's because of uninversion. On Fri, Apr 3, 2009 at 3:35 AM, Michael McCandless luc...@mikemccandless.com wrote: On Thu, Apr 2, 2009 at 5:56 PM, Jason Rutherglen jason.rutherg...@gmail.com wrote: I think I need to understand better why delete by Query isn't viable in your situation... The delete by query is a separate problem which I haven't fully explored yet. Oh, I had thought we were tugging on this thread in order to explore delete-by-docID in the writer. OK. Tracking the segment genealogy is really an interim step for merging field caches before column stride fields gets implemented. I see -- meaning in Bobo you'd like to manage your own memory resident field caches, and merge them whenever IW has merged a segment? Seems like you don't need genealogy for that. Actually CSF cannot be used with Bobo's field caches anyways which means we'd need a way to find out about the segment parents. CSF isn't really designed yet. How come it can't be used with Bobo's field caches? We can try to accommodate Bobo's field cache needs when designing CSF. Does it operate at the segment level? Seems like that'd give you good enough realtime performance (though merging in RAM will definitely be faster). We need to see how Bobo integrates with LUCENE-1483. Lucene's internal field cache usage is now entirely at the segment level (ie, Lucene core should never request full field cache array at the MultiSegmentReader level). I think Bobo must have to do the same, if it handles near realtime updates, to get adequate performance. Though... since we have LUCENE-831 (rework API Lucene exposes for accessing arrays-of-atomic-types-per-segment) and LUCENE-1231 (CSF = a more efficient impl (than uninversion) of the API we expose in LUCENE-831) on deck, we should try to understand Bobo's needs. EG how come Bobo made its own field cache impl? Just because uninversion is too slow? It seems like we've been talking about CSF for 2 years and there isn't a patch for it? If I had more time I'd take a look. What is the status of it? I think Michael is looking into it? I'd really like to get it into 2.9. We should do it in conjunction with 831 since they are so tied. I'll write a patch that implements a callback for the segment merging such that the user can decide what information they want to record about the merged SRs (I'm pretty sure there isn't a way to do this with MergePolicy?) Actually I think you can do this w/ a simple MergeScheduler wrapper or by subclassing CMS. I'll put a comment on the issue. Mike - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1575) Refactoring Lucene collectors (HitCollector and extensions)
[ https://issues.apache.org/jira/browse/LUCENE-1575?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12695595#action_12695595 ] Shai Erera commented on LUCENE-1575: BTW, I can change FieldValueHitQueue like I changed TopFieldCollector by introducing a factory create() method which will return a OneComparaterFieldValueHitQueue and MultiComparatorsFieldValueHitQueue. Today, FVHQ.lessThan checks the numComparators in each call, which is redundant. Also the class isn't final and I'm not sure if we want to change it. What do you think? Refactoring Lucene collectors (HitCollector and extensions) --- Key: LUCENE-1575 URL: https://issues.apache.org/jira/browse/LUCENE-1575 Project: Lucene - Java Issue Type: Improvement Components: Search Reporter: Shai Erera Assignee: Michael McCandless Fix For: 2.9 Attachments: LUCENE-1575.1.patch, LUCENE-1575.2.patch, LUCENE-1575.3.patch, LUCENE-1575.4.patch, LUCENE-1575.5.patch, LUCENE-1575.6.patch, LUCENE-1575.patch, LUCENE-1575.patch This issue is a result of a recent discussion we've had on the mailing list. You can read the thread [here|http://www.nabble.com/Is-TopDocCollector%27s-collect()-implementation-correct--td22557419.html]. We have agreed to do the following refactoring: * Rename MultiReaderHitCollector to Collector, with the purpose that it will be the base class for all Collector implementations. * Deprecate HitCollector in favor of the new Collector. * Introduce new methods in IndexSearcher that accept Collector, and deprecate those that accept HitCollector. ** Create a final class HitCollectorWrapper, and use it in the deprecated methods in IndexSearcher, wrapping the given HitCollector. ** HitCollectorWrapper will be marked deprecated, so we can remove it in 3.0, when we remove HitCollector. ** It will remove any instanceof checks that currently exist in IndexSearcher code. * Create a new (abstract) TopDocsCollector, which will: ** Leave collect and setNextReader unimplemented. ** Introduce protected members PriorityQueue and totalHits. ** Introduce a single protected constructor which accepts a PriorityQueue. ** Implement topDocs() and getTotalHits() using the PQ and totalHits members. These can be used as-are by extending classes, as well as be overridden. ** Introduce a new topDocs(start, howMany) method which will be used a convenience method when implementing a search application which allows paging through search results. It will also attempt to improve the memory allocation, by allocating a ScoreDoc[] of the requested size only. * Change TopScoreDocCollector to extend TopDocsCollector, use the topDocs() and getTotalHits() implementations as they are from TopDocsCollector. The class will also be made final. * Change TopFieldCollector to extend TopDocsCollector, and make the class final. Also implement topDocs(start, howMany). * Change TopFieldDocCollector (deprecated) to extend TopDocsCollector, instead of TopScoreDocCollector. Implement topDocs(start, howMany) * Review other places where HitCollector is used, such as in Scorer, deprecate those places and use Collector instead. Additionally, the following proposal was made w.r.t. decoupling score from collect(): * Change collect to accecpt only a doc Id (unbased). * Introduce a setScorer(Scorer) method. * If during collect the implementation needs the score, it can call scorer.score(). If we do this, then we need to review all places in the code where collect(doc, score) is called, and assert whether Scorer can be passed. Also this raises few questions: * What if during collect() Scorer is null? (i.e., not set) - is it even possible? * I noticed that many (if not all) of the collect() implementations discard the document if its score is not greater than 0. Doesn't it mean that score is needed in collect() always? Open issues: * The name for Collector * TopDocsCollector was mentioned on the thread as TopResultsCollector, but that was when we thought to call Colletor ResultsColletor. Since we decided (so far) on Collector, I think TopDocsCollector makes sense, because of its TopDocs output. * Decoupling score from collect(). I will post a patch a bit later, as this is expected to be a very large patch. I will split it into 2: (1) code patch (2) test cases (moving to use Collector instead of HitCollector, as well as testing the new topDocs(start, howMany) method. There might be even a 3rd patch which handles the setScorer thing in Collector (maybe even a different issue?) -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (LUCENE-1231) Column-stride fields (aka per-document Payloads)
[ https://issues.apache.org/jira/browse/LUCENE-1231?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12695666#action_12695666 ] Michael Busch commented on LUCENE-1231: --- For the search side we need an API similar to TermDocs and Payloads, let's call it ColumnStrideFieldAccessor (CSFA) for now. It should have next(), skipTo(), doc(), etc. methods. However, the way TermPositions#getPayloads() currently works is that it always forces you to copy the bytes from the underlying IndexInput into the payload byte[] array. Since we usually use a BufferedIndexInput, this is then an arraycopy from BufferedIndexInput's buffer array into the byte array. I think to improve this we could allow users to call methods like readVInt() directly on the CSFA. So I was thinking about adding DataInput and DataOutput as superclasses of IndexInput and IndexOutput. DataIn(Out)put would implement the different read and write methods, whereas IndexIn(Out)put would only implement methods like close(), seek(), getFilePointer(), length(), flush(), etc. So then CSFA would extend DataInput or alternatively have a getDataInput() method. The danger here compared to the current payloads API would be that the user might read too few or too many bytes of a CSF, which would result in an undefined and possibly hard to debug behavior. But we could offer e.g.: {code} static ColumnStrideFieldsAccessor getAccessor(ColumnStrideFieldsAccessor in, Mode mode) { if (mode == Mode.Fast) { return in; } else if (mode == Mode.Safe) { return new SafeAccessor(in); } {code} The SafeAccessor would count for you the number of read bytes and throw exceptions if you don't consume the number of bytes you should consume. This is of course overhead, but users could use the SafeAccessor until they're confident that everything works fine in their system, and then switch to the fast accessor for better performance. If there are no objections I will open a separate JIRA issue for the DataInput/Output patch. Column-stride fields (aka per-document Payloads) Key: LUCENE-1231 URL: https://issues.apache.org/jira/browse/LUCENE-1231 Project: Lucene - Java Issue Type: New Feature Components: Index Reporter: Michael Busch Assignee: Michael Busch Priority: Minor Fix For: 3.0 This new feature has been proposed and discussed here: http://markmail.org/search/?q=per-document+payloads#query:per-document%20payloads+page:1+mid:jq4g5myhlvidw3oc+state:results Currently it is possible in Lucene to store data as stored fields or as payloads. Stored fields provide good performance if you want to load all fields for one document, because this is an sequential I/O operation. If you however want to load the data from one field for a large number of documents, then stored fields perform quite badly, because lot's of I/O seeks might have to be performed. A better way to do this is using payloads. By creating a special posting list that has one posting with payload for each document you can simulate a column- stride field. The performance is significantly better compared to stored fields, however still not optimal. The reason is that for each document the freq value, which is in this particular case always 1, has to be decoded, also one position value, which is always 0, has to be loaded. As a solution we want to add real column-stride fields to Lucene. A possible format for the new data structure could look like this (CSD stands for column- stride data, once we decide for a final name for this feature we can change this): CSDList -- FixedLengthList | VariableLengthList, SkipList FixedLengthList -- Payload^SegSize VariableLengthList -- DocDelta, PayloadLength?, Payload Payload -- Byte^PayloadLength PayloadLength -- VInt SkipList -- see frq.file We distinguish here between the fixed length and the variable length cases. To allow flexibility, Lucene could automatically pick the right data structure. This could work like this: When the DocumentsWriter writes a segment it checks whether all values of a field have the same length. If yes, it stores them as FixedLengthList, if not, then as VariableLengthList. When the SegmentMerger merges two or more segments it checks if all segments have a FixedLengthList with the same length for a column-stride field. If not, it writes a VariableLengthList to the new segment. Once this feature is implemented, we should think about making the column- stride fields updateable, similar to the norms. This will be a very powerful feature that can for example be used for low-latency tagging of documents. Other use cases: - replace norms - allow to store boost values separately from norms - as input for the FieldCache, thus
[jira] Created: (LUCENE-1585) Allow to control how payloads are merged
Allow to control how payloads are merged Key: LUCENE-1585 URL: https://issues.apache.org/jira/browse/LUCENE-1585 Project: Lucene - Java Issue Type: New Feature Components: Index Reporter: Michael Busch Assignee: Michael Busch Priority: Minor Lucene handles backwards-compatibility of its data structures by converting them from the old into the new formats during segment merging. Payloads are simply byte arrays in which users can store arbitrary data. Applications that use payloads might want to convert the format of their payloads in a similar fashion. Otherwise it's not easily possible to ever change the encoding of a payload without reindexing. So I propose to introduce a PayloadMerger class that the SegmentMerger invokes to merge the payloads from multiple segments. Users can then implement their own PayloadMerger to convert payloads from an old into a new format. In the future we need this kind of flexibility also for column-stride fields (LUCENE-1231) and flexible indexing codecs. In addition to that it would be nice if users could store version information in the segments file. E.g. they could store in segment _2 the term a:b uses payloads of format x.y. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org