[jira] Commented: (LUCENE-1582) Make TrieRange completely independent from Document/Field with TokenStream of prefix encoded values

2009-04-03 Thread Shalin Shekhar Mangar (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1582?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12695261#action_12695261
 ] 

Shalin Shekhar Mangar commented on LUCENE-1582:
---

bq. trieCodeLong/Int() returns a TokenStream. During encoding, all char[] 
arrays are reused by Token API, additional String[] arrays for the encoded 
result are not created, instead the TokenStream enumerates the trie values.

+1

 Make TrieRange completely independent from Document/Field with TokenStream of 
 prefix encoded values
 ---

 Key: LUCENE-1582
 URL: https://issues.apache.org/jira/browse/LUCENE-1582
 Project: Lucene - Java
  Issue Type: Improvement
  Components: contrib/*
Affects Versions: 2.9
Reporter: Uwe Schindler
Assignee: Uwe Schindler
 Fix For: 2.9


 TrieRange has currently the following problem:
 - To add a field, that uses a trie encoding, you can manually add each term 
 to the index or use a helper method from TrieUtils. The helper method has the 
 problem, that it uses a fixed field configuration
 - TrieUtils currently creates per default a helper field containing the lower 
 precision terms to enable sorting (limitation of one term/document for 
 sorting)
 - trieCodeLong/Int() creates unnecessarily String[] and char[] arrays that is 
 heavy for GC, if you index lot of numeric values. Also a lot of char[] to 
 String copying is involved.
 This issue should improve this:
 - trieCodeLong/Int() returns a TokenStream. During encoding, all char[] 
 arrays are reused by Token API, additional String[] arrays for the encoded 
 result are not created, instead the TokenStream enumerates the trie values.
 - Trie fields can be added to Documents during indexing using the standard 
 API: new Field(name,TokenStream,...), so no extra util method needed. By 
 using token filters, one could also add payload and so and customize 
 everything.
 The drawback is: Sorting would not work anymore. To enable sorting, a 
 (sub-)issue can extend the FieldCache to stop iterating the terms, as soon as 
 a lower precision one is enumerated by TermEnum. I will create a hack patch 
 for TrieUtils-use only, that uses a non-checked Exceptionin the Parser to 
 stop iteration. With LUCENE-831, a more generic API for this type can be used 
 (custom parser/iterator implementation for FieldCache). I will attach the 
 field cache patch (with the temporary solution, until FieldCache is 
 reimplemented) as a separate patch file, or maybe open another issue for it.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1516) Integrate IndexReader with IndexWriter

2009-04-03 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1516?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12695306#action_12695306
 ] 

Michael McCandless commented on LUCENE-1516:


Good catch!  I'll fix.

 Integrate IndexReader with IndexWriter 
 ---

 Key: LUCENE-1516
 URL: https://issues.apache.org/jira/browse/LUCENE-1516
 Project: Lucene - Java
  Issue Type: Improvement
Affects Versions: 2.4
Reporter: Jason Rutherglen
Assignee: Michael McCandless
Priority: Minor
 Fix For: 2.9

 Attachments: LUCENE-1516.patch, LUCENE-1516.patch, LUCENE-1516.patch, 
 LUCENE-1516.patch, LUCENE-1516.patch, LUCENE-1516.patch, LUCENE-1516.patch, 
 LUCENE-1516.patch, LUCENE-1516.patch, LUCENE-1516.patch, LUCENE-1516.patch, 
 LUCENE-1516.patch, LUCENE-1516.patch, LUCENE-1516.patch, LUCENE-1516.patch, 
 LUCENE-1516.patch, LUCENE-1516.patch, LUCENE-1516.patch, LUCENE-1516.patch, 
 LUCENE-1516.patch, LUCENE-1516.patch, LUCENE-1516.patch, LUCENE-1516.patch, 
 LUCENE-1516.patch, LUCENE-1516.patch, LUCENE-1516.patch, LUCENE-1516.patch, 
 LUCENE-1516.patch, magnetic.png, ssd.png, ssd2.png

   Original Estimate: 672h
  Remaining Estimate: 672h

 The current problem is an IndexReader and IndexWriter cannot be open
 at the same time and perform updates as they both require a write
 lock to the index. While methods such as IW.deleteDocuments enables
 deleting from IW, methods such as IR.deleteDocument(int doc) and
 norms updating are not available from IW. This limits the
 capabilities of performing updates to the index dynamically or in
 realtime without closing the IW and opening an IR, deleting or
 updating norms, flushing, then opening the IW again, a process which
 can be detrimental to realtime updates. 
 This patch will expose an IndexWriter.getReader method that returns
 the currently flushed state of the index as a class that implements
 IndexReader. The new IR implementation will differ from existing IR
 implementations such as MultiSegmentReader in that flushing will
 synchronize updates with IW in part by sharing the write lock. All
 methods of IR will be usable including reopen and clone. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



Re: IndexWriter.addIndexesNoOptimize(IndexReader[] readers)

2009-04-03 Thread Michael McCandless
Makes sense.  Wanna make a patch?  We'd then deprecate
addIndexes(IndexReader[]).

Mike

On Thu, Apr 2, 2009 at 9:16 PM, Jason Rutherglen
jason.rutherg...@gmail.com wrote:
 This seems like something that's tenable?  It would be useful for merging
 ram indexes to disk where if a directory is passed, the directory may be
 changed.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1584) Callback for intercepting merging segments in IndexWriter

2009-04-03 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1584?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12695318#action_12695318
 ] 

Michael McCandless commented on LUCENE-1584:


I think this can be achieved, today, by making your own MergeScheduler wrapper, 
or by subclassing ConcurrentMergeScheduler and eg overriding the doMerge 
method?  If so, I'd prefer not to add a callback to IW.

 Callback for intercepting merging segments in IndexWriter
 -

 Key: LUCENE-1584
 URL: https://issues.apache.org/jira/browse/LUCENE-1584
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Index
Affects Versions: 2.4.1
Reporter: Jason Rutherglen
Priority: Minor
 Fix For: 2.9

 Attachments: LUCENE-1584.patch

   Original Estimate: 96h
  Remaining Estimate: 96h

 For things like merging field caches or bitsets, it's useful to
 know which segments were merged to create a new segment.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



Re: Lucene filter

2009-04-03 Thread Michael McCandless
Could you re-ask this on java-user, instead?  Thanks.

Mike

On Thu, Apr 2, 2009 at 6:24 PM, addman addiek...@yahoo.com wrote:

 How do you create a Lucene Filter to check if a field has a value?  It is
 part for a ChainedFilter that I am creating.
 --
 View this message in context: 
 http://www.nabble.com/Lucene-filter-tp22858220p22858220.html
 Sent from the Lucene - Java Developer mailing list archive at Nabble.com.


 -
 To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-dev-h...@lucene.apache.org



-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



Re: Future projects

2009-04-03 Thread Michael McCandless
On Thu, Apr 2, 2009 at 5:56 PM, Jason Rutherglen
jason.rutherg...@gmail.com wrote:
 I think I need to understand better why delete by Query isn't
 viable in your situation...

 The delete by query is a separate problem which I haven't fully
 explored yet.

Oh, I had thought we were tugging on this thread in order to explore
delete-by-docID in the writer.  OK.

 Tracking the segment genealogy is really an
 interim step for merging field caches before column stride
 fields gets implemented.

I see -- meaning in Bobo you'd like to manage your own memory resident
field caches, and merge them whenever IW has merged a segment?  Seems
like you don't need genealogy for that.

 Actually CSF cannot be used with Bobo's
 field caches anyways which means we'd need a way to find out
 about the segment parents.

CSF isn't really designed yet.  How come it can't be used with Bobo's
field caches?  We can try to accommodate Bobo's field cache needs when
designing CSF.

 Does it operate at the segment level? Seems like that'd give
 you good enough realtime performance (though merging in RAM will
 definitely be faster).

 We need to see how Bobo integrates with LUCENE-1483.

Lucene's internal field cache usage is now entirely at the segment
level (ie, Lucene core should never request full field cache array at
the MultiSegmentReader level).  I think Bobo must have to do the same,
if it handles near realtime updates, to get adequate performance.

Though... since we have LUCENE-831 (rework API Lucene exposes for
accessing arrays-of-atomic-types-per-segment) and LUCENE-1231 (CSF = a
more efficient impl (than uninversion) of the API we expose in
LUCENE-831) on deck, we should try to understand Bobo's needs.

EG how come Bobo made its own field cache impl?  Just because
uninversion is too slow?

 It seems like we've been talking about CSF for 2 years and there
 isn't a patch for it? If I had more time I'd take a look. What
 is the status of it?

I think Michael is looking into it?  I'd really like to get it into
2.9.  We should do it in conjunction with 831 since they are so tied.

 I'll write a patch that implements a callback for the segment
 merging such that the user can decide what information they want
 to record about the merged SRs (I'm pretty sure there isn't a
 way to do this with MergePolicy?)

Actually I think you can do this w/ a simple MergeScheduler wrapper or
by subclassing CMS.  I'll put a comment on the issue.

Mike

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



Re: Future projects

2009-04-03 Thread Michael McCandless
On Thu, Apr 2, 2009 at 6:55 PM, John Wang john.w...@gmail.com wrote:
 Just to clarify, Approach 1 and approach 2 are both currently performing ok
 currently for us.

OK that's very good to know.

Mike

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1575) Refactoring Lucene collectors (HitCollector and extensions)

2009-04-03 Thread Shai Erera (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1575?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12695338#action_12695338
 ] 

Shai Erera commented on LUCENE-1575:


I've been thinking about TimeLimitedCollector and the revert to extend 
HitCollector I had to do in the last patch - the main reason was that I 
couldn't find a better name and did not want to deprecate it. But then, I 
thought that perhaps the current name is not so good, and we can change it? 
Syntactically, it is not a 'limited' collector, but more of a 'limiting' 
collector (I think, not being a native English speaker I may be wrong).
Alternative names I've been thinking about are TimeKeeperCollector, 
TimeLimitingCollector, TimingOutCollector.
The advantage is that we deprecate the current one and have a clear back-compat 
support, instead of changing it in 3.0 to extend Collector. If you agree with 
any of these names I can create a new class, deprecate the current one, change 
the tests back to use the new version (and remove all those comments about the 
changes in 3.0). What do you think?

 Refactoring Lucene collectors (HitCollector and extensions)
 ---

 Key: LUCENE-1575
 URL: https://issues.apache.org/jira/browse/LUCENE-1575
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Search
Reporter: Shai Erera
Assignee: Michael McCandless
 Fix For: 2.9

 Attachments: LUCENE-1575.1.patch, LUCENE-1575.2.patch, 
 LUCENE-1575.3.patch, LUCENE-1575.4.patch, LUCENE-1575.5.patch, 
 LUCENE-1575.6.patch, LUCENE-1575.patch


 This issue is a result of a recent discussion we've had on the mailing list. 
 You can read the thread 
 [here|http://www.nabble.com/Is-TopDocCollector%27s-collect()-implementation-correct--td22557419.html].
 We have agreed to do the following refactoring:
 * Rename MultiReaderHitCollector to Collector, with the purpose that it will 
 be the base class for all Collector implementations.
 * Deprecate HitCollector in favor of the new Collector.
 * Introduce new methods in IndexSearcher that accept Collector, and deprecate 
 those that accept HitCollector.
 ** Create a final class HitCollectorWrapper, and use it in the deprecated 
 methods in IndexSearcher, wrapping the given HitCollector.
 ** HitCollectorWrapper will be marked deprecated, so we can remove it in 3.0, 
 when we remove HitCollector.
 ** It will remove any instanceof checks that currently exist in IndexSearcher 
 code.
 * Create a new (abstract) TopDocsCollector, which will:
 ** Leave collect and setNextReader unimplemented.
 ** Introduce protected members PriorityQueue and totalHits.
 ** Introduce a single protected constructor which accepts a PriorityQueue.
 ** Implement topDocs() and getTotalHits() using the PQ and totalHits members. 
 These can be used as-are by extending classes, as well as be overridden.
 ** Introduce a new topDocs(start, howMany) method which will be used a 
 convenience method when implementing a search application which allows paging 
 through search results. It will also attempt to improve the memory 
 allocation, by allocating a ScoreDoc[] of the requested size only.
 * Change TopScoreDocCollector to extend TopDocsCollector, use the topDocs() 
 and getTotalHits() implementations as they are from TopDocsCollector. The 
 class will also be made final.
 * Change TopFieldCollector to extend TopDocsCollector, and make the class 
 final. Also implement topDocs(start, howMany).
 * Change TopFieldDocCollector (deprecated) to extend TopDocsCollector, 
 instead of TopScoreDocCollector. Implement topDocs(start, howMany)
 * Review other places where HitCollector is used, such as in Scorer, 
 deprecate those places and use Collector instead.
 Additionally, the following proposal was made w.r.t. decoupling score from 
 collect():
 * Change collect to accecpt only a doc Id (unbased).
 * Introduce a setScorer(Scorer) method.
 * If during collect the implementation needs the score, it can call 
 scorer.score().
 If we do this, then we need to review all places in the code where 
 collect(doc, score) is called, and assert whether Scorer can be passed. Also 
 this raises few questions:
 * What if during collect() Scorer is null? (i.e., not set) - is it even 
 possible?
 * I noticed that many (if not all) of the collect() implementations discard 
 the document if its score is not greater than 0. Doesn't it mean that score 
 is needed in collect() always?
 Open issues:
 * The name for Collector
 * TopDocsCollector was mentioned on the thread as TopResultsCollector, but 
 that was when we thought to call Colletor ResultsColletor. Since we decided 
 (so far) on Collector, I think TopDocsCollector makes sense, because of its 
 TopDocs output.
 * Decoupling score from collect().
 I will post a patch a bit later, 

[jira] Updated: (LUCENE-1582) Make TrieRange completely independent from Document/Field with TokenStream of prefix encoded values

2009-04-03 Thread Uwe Schindler (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1582?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Uwe Schindler updated LUCENE-1582:
--

Attachment: LUCENE-1582.patch

 Make TrieRange completely independent from Document/Field with TokenStream of 
 prefix encoded values
 ---

 Key: LUCENE-1582
 URL: https://issues.apache.org/jira/browse/LUCENE-1582
 Project: Lucene - Java
  Issue Type: Improvement
  Components: contrib/*
Affects Versions: 2.9
Reporter: Uwe Schindler
Assignee: Uwe Schindler
 Fix For: 2.9

 Attachments: LUCENE-1582.patch


 TrieRange has currently the following problem:
 - To add a field, that uses a trie encoding, you can manually add each term 
 to the index or use a helper method from TrieUtils. The helper method has the 
 problem, that it uses a fixed field configuration
 - TrieUtils currently creates per default a helper field containing the lower 
 precision terms to enable sorting (limitation of one term/document for 
 sorting)
 - trieCodeLong/Int() creates unnecessarily String[] and char[] arrays that is 
 heavy for GC, if you index lot of numeric values. Also a lot of char[] to 
 String copying is involved.
 This issue should improve this:
 - trieCodeLong/Int() returns a TokenStream. During encoding, all char[] 
 arrays are reused by Token API, additional String[] arrays for the encoded 
 result are not created, instead the TokenStream enumerates the trie values.
 - Trie fields can be added to Documents during indexing using the standard 
 API: new Field(name,TokenStream,...), so no extra util method needed. By 
 using token filters, one could also add payload and so and customize 
 everything.
 The drawback is: Sorting would not work anymore. To enable sorting, a 
 (sub-)issue can extend the FieldCache to stop iterating the terms, as soon as 
 a lower precision one is enumerated by TermEnum. I will create a hack patch 
 for TrieUtils-use only, that uses a non-checked Exceptionin the Parser to 
 stop iteration. With LUCENE-831, a more generic API for this type can be used 
 (custom parser/iterator implementation for FieldCache). I will attach the 
 field cache patch (with the temporary solution, until FieldCache is 
 reimplemented) as a separate patch file, or maybe open another issue for it.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1582) Make TrieRange completely independent from Document/Field with TokenStream of prefix encoded values

2009-04-03 Thread Uwe Schindler (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1582?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12695341#action_12695341
 ] 

Uwe Schindler commented on LUCENE-1582:
---

A first version of the patch:
- JavaDocs not finished (examples, documentation) yet
- New classes: IntTrieTokenStream, LongTrieTokenStream
- Removed TrieUtils.trieCodeInt/Long()
- Removed TrieUtils.addIndexFields()
- Removed all fields[] arrays, now only one field name is supported everywhere

To index a trie-encoded field, just use (preferred way):
{code}
Filed f=new Field(name, new LongTrieTokenStream(value, precisionStep));
f.setOmitNorms(true);
f.setOmitTermFreqAndPositions(true);
{code}
(maybe TrieUtils supplies a shortcut helper method that uses these special 
optimal settings when creating the field, e.g. TrieUtils.newLongTrieField()). 
This is extensible with TokenFilters, if somebody wants to add payloads and so 
on.

This patch also contains the sorting fixes in the core: 
FieldCache.StopFillCacheException can be thrown from withing the parser. Maybe 
this should be provides as a separate sub-isse (or top-level issue), because I 
cannot apply patches to core. Mike, can you do this, when we commit this?

Yonik: It would be nice to hear some comments from you, too.

I really like the new way to create trie encoded fields. When this moves to 
core, the tokenizers can be renamed to IntTokenStream, TrieUtils now only 
contains the converters to/from doubles and the encoding and range split.

About the GC note in the description of this issue: The new API does not use so 
much array allocations and array copies and reuses the Token. But as it is 
needed to generate a TokenStream instance for every numeric value, the GC cost 
is about the same for new and old API. Especially because each TokenStream 
creates a LinkedHashMap internally for the attributes.

Just a question for the indexer people: Is it possible to add two fields with 
the same field name to a document, both with a TokenStream? This is needed to 
add more than one trie encoded value (which worked with the old API). I just 
want to be sure.

 Make TrieRange completely independent from Document/Field with TokenStream of 
 prefix encoded values
 ---

 Key: LUCENE-1582
 URL: https://issues.apache.org/jira/browse/LUCENE-1582
 Project: Lucene - Java
  Issue Type: Improvement
  Components: contrib/*
Affects Versions: 2.9
Reporter: Uwe Schindler
Assignee: Uwe Schindler
 Fix For: 2.9

 Attachments: LUCENE-1582.patch


 TrieRange has currently the following problem:
 - To add a field, that uses a trie encoding, you can manually add each term 
 to the index or use a helper method from TrieUtils. The helper method has the 
 problem, that it uses a fixed field configuration
 - TrieUtils currently creates per default a helper field containing the lower 
 precision terms to enable sorting (limitation of one term/document for 
 sorting)
 - trieCodeLong/Int() creates unnecessarily String[] and char[] arrays that is 
 heavy for GC, if you index lot of numeric values. Also a lot of char[] to 
 String copying is involved.
 This issue should improve this:
 - trieCodeLong/Int() returns a TokenStream. During encoding, all char[] 
 arrays are reused by Token API, additional String[] arrays for the encoded 
 result are not created, instead the TokenStream enumerates the trie values.
 - Trie fields can be added to Documents during indexing using the standard 
 API: new Field(name,TokenStream,...), so no extra util method needed. By 
 using token filters, one could also add payload and so and customize 
 everything.
 The drawback is: Sorting would not work anymore. To enable sorting, a 
 (sub-)issue can extend the FieldCache to stop iterating the terms, as soon as 
 a lower precision one is enumerated by TermEnum. I will create a hack patch 
 for TrieUtils-use only, that uses a non-checked Exceptionin the Parser to 
 stop iteration. With LUCENE-831, a more generic API for this type can be used 
 (custom parser/iterator implementation for FieldCache). I will attach the 
 field cache patch (with the temporary solution, until FieldCache is 
 reimplemented) as a separate patch file, or maybe open another issue for it.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1575) Refactoring Lucene collectors (HitCollector and extensions)

2009-04-03 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1575?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12695358#action_12695358
 ] 

Michael McCandless commented on LUCENE-1575:


bq.  If you agree with any of these names I can create a new class, deprecate 
the current one, change the tests back to use the new version (and remove all 
those comments about the changes in 3.0). What do you think?

I like this approach.  I like TimeLimitingCollector, or maybe 
TimeoutCollector?

 Refactoring Lucene collectors (HitCollector and extensions)
 ---

 Key: LUCENE-1575
 URL: https://issues.apache.org/jira/browse/LUCENE-1575
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Search
Reporter: Shai Erera
Assignee: Michael McCandless
 Fix For: 2.9

 Attachments: LUCENE-1575.1.patch, LUCENE-1575.2.patch, 
 LUCENE-1575.3.patch, LUCENE-1575.4.patch, LUCENE-1575.5.patch, 
 LUCENE-1575.6.patch, LUCENE-1575.patch


 This issue is a result of a recent discussion we've had on the mailing list. 
 You can read the thread 
 [here|http://www.nabble.com/Is-TopDocCollector%27s-collect()-implementation-correct--td22557419.html].
 We have agreed to do the following refactoring:
 * Rename MultiReaderHitCollector to Collector, with the purpose that it will 
 be the base class for all Collector implementations.
 * Deprecate HitCollector in favor of the new Collector.
 * Introduce new methods in IndexSearcher that accept Collector, and deprecate 
 those that accept HitCollector.
 ** Create a final class HitCollectorWrapper, and use it in the deprecated 
 methods in IndexSearcher, wrapping the given HitCollector.
 ** HitCollectorWrapper will be marked deprecated, so we can remove it in 3.0, 
 when we remove HitCollector.
 ** It will remove any instanceof checks that currently exist in IndexSearcher 
 code.
 * Create a new (abstract) TopDocsCollector, which will:
 ** Leave collect and setNextReader unimplemented.
 ** Introduce protected members PriorityQueue and totalHits.
 ** Introduce a single protected constructor which accepts a PriorityQueue.
 ** Implement topDocs() and getTotalHits() using the PQ and totalHits members. 
 These can be used as-are by extending classes, as well as be overridden.
 ** Introduce a new topDocs(start, howMany) method which will be used a 
 convenience method when implementing a search application which allows paging 
 through search results. It will also attempt to improve the memory 
 allocation, by allocating a ScoreDoc[] of the requested size only.
 * Change TopScoreDocCollector to extend TopDocsCollector, use the topDocs() 
 and getTotalHits() implementations as they are from TopDocsCollector. The 
 class will also be made final.
 * Change TopFieldCollector to extend TopDocsCollector, and make the class 
 final. Also implement topDocs(start, howMany).
 * Change TopFieldDocCollector (deprecated) to extend TopDocsCollector, 
 instead of TopScoreDocCollector. Implement topDocs(start, howMany)
 * Review other places where HitCollector is used, such as in Scorer, 
 deprecate those places and use Collector instead.
 Additionally, the following proposal was made w.r.t. decoupling score from 
 collect():
 * Change collect to accecpt only a doc Id (unbased).
 * Introduce a setScorer(Scorer) method.
 * If during collect the implementation needs the score, it can call 
 scorer.score().
 If we do this, then we need to review all places in the code where 
 collect(doc, score) is called, and assert whether Scorer can be passed. Also 
 this raises few questions:
 * What if during collect() Scorer is null? (i.e., not set) - is it even 
 possible?
 * I noticed that many (if not all) of the collect() implementations discard 
 the document if its score is not greater than 0. Doesn't it mean that score 
 is needed in collect() always?
 Open issues:
 * The name for Collector
 * TopDocsCollector was mentioned on the thread as TopResultsCollector, but 
 that was when we thought to call Colletor ResultsColletor. Since we decided 
 (so far) on Collector, I think TopDocsCollector makes sense, because of its 
 TopDocs output.
 * Decoupling score from collect().
 I will post a patch a bit later, as this is expected to be a very large 
 patch. I will split it into 2: (1) code patch (2) test cases (moving to use 
 Collector instead of HitCollector, as well as testing the new topDocs(start, 
 howMany) method.
 There might be even a 3rd patch which handles the setScorer thing in 
 Collector (maybe even a different issue?)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For 

[jira] Updated: (LUCENE-1575) Refactoring Lucene collectors (HitCollector and extensions)

2009-04-03 Thread Michael McCandless (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1575?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael McCandless updated LUCENE-1575:
---

Attachment: LUCENE-1575.patch


OK, I attached a new patch with some minor changes:

  * Beefed up javadocs in Collector.java; fixed other javadocs
warnings.  Tweaked CHANGES.txt.

  * Renamed PositiveOnlyScoresCollector --
PositiveScoresOnlyCollector

And also came across these questions/issues:

  * TopFieldCollector's updateBottom  add methods take score, and are
passed score from the non-scoring collectors, but shouldn't?

  * TermScorer need not override score(HitCollector hc) (super does
the same thing).

  * The changes to TermScorer make me a bit nervous.  EG, the new
InternalScorer: will it hurt performance?  Also this part:
{code}
+// Set the Scorer doc and score before calling collect in case it will 
be
+// used in collect()
+s.d = doc;
+s.score = score;
+c.collect(doc);  // collect score
{code}
is spooky: I don't like how we worry that one may call scorer.doc() (I
don't like the ambiguity in the API -- we both pass doc and fear you
may call scorer.doc()).  Not sure how to resolve it.

  * Hmm -- we added a new abstract method to
src/java/org/apache/lucene/search/Searcher.java (that accepts
Collector).  Should that method be concrete (and throw UOE), for
back compat?

  * We've also added a method to the Searchable interface, which is
a break in back-compat.  But my feeling is we should allow this
break (but Shai can you add another Note at the top of
CHANGES.txt, calling this out?).


 Refactoring Lucene collectors (HitCollector and extensions)
 ---

 Key: LUCENE-1575
 URL: https://issues.apache.org/jira/browse/LUCENE-1575
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Search
Reporter: Shai Erera
Assignee: Michael McCandless
 Fix For: 2.9

 Attachments: LUCENE-1575.1.patch, LUCENE-1575.2.patch, 
 LUCENE-1575.3.patch, LUCENE-1575.4.patch, LUCENE-1575.5.patch, 
 LUCENE-1575.6.patch, LUCENE-1575.patch, LUCENE-1575.patch


 This issue is a result of a recent discussion we've had on the mailing list. 
 You can read the thread 
 [here|http://www.nabble.com/Is-TopDocCollector%27s-collect()-implementation-correct--td22557419.html].
 We have agreed to do the following refactoring:
 * Rename MultiReaderHitCollector to Collector, with the purpose that it will 
 be the base class for all Collector implementations.
 * Deprecate HitCollector in favor of the new Collector.
 * Introduce new methods in IndexSearcher that accept Collector, and deprecate 
 those that accept HitCollector.
 ** Create a final class HitCollectorWrapper, and use it in the deprecated 
 methods in IndexSearcher, wrapping the given HitCollector.
 ** HitCollectorWrapper will be marked deprecated, so we can remove it in 3.0, 
 when we remove HitCollector.
 ** It will remove any instanceof checks that currently exist in IndexSearcher 
 code.
 * Create a new (abstract) TopDocsCollector, which will:
 ** Leave collect and setNextReader unimplemented.
 ** Introduce protected members PriorityQueue and totalHits.
 ** Introduce a single protected constructor which accepts a PriorityQueue.
 ** Implement topDocs() and getTotalHits() using the PQ and totalHits members. 
 These can be used as-are by extending classes, as well as be overridden.
 ** Introduce a new topDocs(start, howMany) method which will be used a 
 convenience method when implementing a search application which allows paging 
 through search results. It will also attempt to improve the memory 
 allocation, by allocating a ScoreDoc[] of the requested size only.
 * Change TopScoreDocCollector to extend TopDocsCollector, use the topDocs() 
 and getTotalHits() implementations as they are from TopDocsCollector. The 
 class will also be made final.
 * Change TopFieldCollector to extend TopDocsCollector, and make the class 
 final. Also implement topDocs(start, howMany).
 * Change TopFieldDocCollector (deprecated) to extend TopDocsCollector, 
 instead of TopScoreDocCollector. Implement topDocs(start, howMany)
 * Review other places where HitCollector is used, such as in Scorer, 
 deprecate those places and use Collector instead.
 Additionally, the following proposal was made w.r.t. decoupling score from 
 collect():
 * Change collect to accecpt only a doc Id (unbased).
 * Introduce a setScorer(Scorer) method.
 * If during collect the implementation needs the score, it can call 
 scorer.score().
 If we do this, then we need to review all places in the code where 
 collect(doc, score) is called, and assert whether Scorer can be passed. Also 
 this raises few questions:
 * What if during collect() Scorer 

[jira] Commented: (LUCENE-1582) Make TrieRange completely independent from Document/Field with TokenStream of prefix encoded values

2009-04-03 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1582?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12695362#action_12695362
 ] 

Michael McCandless commented on LUCENE-1582:


bq. Maybe this should be provides as a separate sub-isse (or top-level issue), 
because I cannot apply patches to core. Mike, can you do this, when we commit 
this?

It's fine to include these changes in this patch -- I can commit them all at 
once.

bq. But as it is needed to generate a TokenStream instance for every numeric 
value, the GC cost is about the same for new and old API. Especially because 
each TokenStream creates a LinkedHashMap internally for the attributes.

Hmm, we should do some perf tests to see how big a deal this turns out to be.  
It'd be nice to get some sort of reuse API working if performance is really 
hurt.  (Eg Analyzers can provide reusableTokenStream, keyed by thread).  You'd 
presumably have to key on thread  field name.  If you do this then probably a 
shortcut helper method should be the preferred way.

bq. Just a question for the indexer people: Is it possible to add two fields 
with the same field name to a document, both with a TokenStream? 

Each with a different TokenStream instance, right?  Yes, this should be fine; 
the tokens are logically concatenated just like multi-valued String fields.

 Make TrieRange completely independent from Document/Field with TokenStream of 
 prefix encoded values
 ---

 Key: LUCENE-1582
 URL: https://issues.apache.org/jira/browse/LUCENE-1582
 Project: Lucene - Java
  Issue Type: Improvement
  Components: contrib/*
Affects Versions: 2.9
Reporter: Uwe Schindler
Assignee: Uwe Schindler
 Fix For: 2.9

 Attachments: LUCENE-1582.patch


 TrieRange has currently the following problem:
 - To add a field, that uses a trie encoding, you can manually add each term 
 to the index or use a helper method from TrieUtils. The helper method has the 
 problem, that it uses a fixed field configuration
 - TrieUtils currently creates per default a helper field containing the lower 
 precision terms to enable sorting (limitation of one term/document for 
 sorting)
 - trieCodeLong/Int() creates unnecessarily String[] and char[] arrays that is 
 heavy for GC, if you index lot of numeric values. Also a lot of char[] to 
 String copying is involved.
 This issue should improve this:
 - trieCodeLong/Int() returns a TokenStream. During encoding, all char[] 
 arrays are reused by Token API, additional String[] arrays for the encoded 
 result are not created, instead the TokenStream enumerates the trie values.
 - Trie fields can be added to Documents during indexing using the standard 
 API: new Field(name,TokenStream,...), so no extra util method needed. By 
 using token filters, one could also add payload and so and customize 
 everything.
 The drawback is: Sorting would not work anymore. To enable sorting, a 
 (sub-)issue can extend the FieldCache to stop iterating the terms, as soon as 
 a lower precision one is enumerated by TermEnum. I will create a hack patch 
 for TrieUtils-use only, that uses a non-checked Exceptionin the Parser to 
 stop iteration. With LUCENE-831, a more generic API for this type can be used 
 (custom parser/iterator implementation for FieldCache). I will attach the 
 field cache patch (with the temporary solution, until FieldCache is 
 reimplemented) as a separate patch file, or maybe open another issue for it.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1582) Make TrieRange completely independent from Document/Field with TokenStream of prefix encoded values

2009-04-03 Thread Uwe Schindler (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1582?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12695364#action_12695364
 ] 

Uwe Schindler commented on LUCENE-1582:
---

bq. Hmm, we should do some perf tests to see how big a deal this turns out to 
be. It'd be nice to get some sort of reuse API working if performance is really 
hurt. (Eg Analyzers can provide reusableTokenStream, keyed by thread). You'd 
presumably have to key on thread  field name. If you do this then probably a 
shortcut helper method should be the preferred way.

We can also leave this to the implementor: If somebody indexes thousands of 
documents, he could reuse one instance of the TokenStream for each document. As 
the instance is only read on document addition, he must provide a separate 
instance for each field, but can reuse it for the next document. This is the 
same like reusing Field instances during indexing.

I can add a setValue() method to the tokenStream that resets it with the new 
value. So one could use one instance and always use setValue() to supply a new 
value for each document. The precisionStep should not be modifiable.

{quote}
bq. Just a question for the indexer people: Is it possible to add two fields 
with the same field name to a document, both with a TokenStream? 

Each with a different TokenStream instance, right? Yes, this should be fine; 
the tokens are logically concatenated just like multi-valued String fields.
{quote}

Yes, sure :-)

 Make TrieRange completely independent from Document/Field with TokenStream of 
 prefix encoded values
 ---

 Key: LUCENE-1582
 URL: https://issues.apache.org/jira/browse/LUCENE-1582
 Project: Lucene - Java
  Issue Type: Improvement
  Components: contrib/*
Affects Versions: 2.9
Reporter: Uwe Schindler
Assignee: Uwe Schindler
 Fix For: 2.9

 Attachments: LUCENE-1582.patch


 TrieRange has currently the following problem:
 - To add a field, that uses a trie encoding, you can manually add each term 
 to the index or use a helper method from TrieUtils. The helper method has the 
 problem, that it uses a fixed field configuration
 - TrieUtils currently creates per default a helper field containing the lower 
 precision terms to enable sorting (limitation of one term/document for 
 sorting)
 - trieCodeLong/Int() creates unnecessarily String[] and char[] arrays that is 
 heavy for GC, if you index lot of numeric values. Also a lot of char[] to 
 String copying is involved.
 This issue should improve this:
 - trieCodeLong/Int() returns a TokenStream. During encoding, all char[] 
 arrays are reused by Token API, additional String[] arrays for the encoded 
 result are not created, instead the TokenStream enumerates the trie values.
 - Trie fields can be added to Documents during indexing using the standard 
 API: new Field(name,TokenStream,...), so no extra util method needed. By 
 using token filters, one could also add payload and so and customize 
 everything.
 The drawback is: Sorting would not work anymore. To enable sorting, a 
 (sub-)issue can extend the FieldCache to stop iterating the terms, as soon as 
 a lower precision one is enumerated by TermEnum. I will create a hack patch 
 for TrieUtils-use only, that uses a non-checked Exceptionin the Parser to 
 stop iteration. With LUCENE-831, a more generic API for this type can be used 
 (custom parser/iterator implementation for FieldCache). I will attach the 
 field cache patch (with the temporary solution, until FieldCache is 
 reimplemented) as a separate patch file, or maybe open another issue for it.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1341) BoostingNearQuery class (prototype)

2009-04-03 Thread Grant Ingersoll (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1341?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12695370#action_12695370
 ] 

Grant Ingersoll commented on LUCENE-1341:
-

Hi Peter,

This looks good, I think it just needs some unit tests and then it will be good.

 BoostingNearQuery class (prototype)
 ---

 Key: LUCENE-1341
 URL: https://issues.apache.org/jira/browse/LUCENE-1341
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Query/Scoring
Affects Versions: 2.3.1
Reporter: Peter Keegan
Assignee: Grant Ingersoll
Priority: Minor
 Fix For: 3.0

 Attachments: bnq.patch, bnq.patch, BoostingNearQuery.java, 
 BoostingNearQuery.java, LUCENE-1341-new.patch, LUCENE-1341.patch


 This patch implements term boosting for SpanNearQuery. Refer to: 
 http://www.gossamer-threads.com/lists/lucene/java-user/62779
 This patch works but probably needs more work. I don't like the use of 
 'instanceof', but I didn't want to touch Spans or TermSpans. Also, the 
 payload code is mostly a copy of what's in BoostingTermQuery and could be 
 common-sourced somewhere. Feel free to throw darts at it :)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1582) Make TrieRange completely independent from Document/Field with TokenStream of prefix encoded values

2009-04-03 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1582?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12695400#action_12695400
 ] 

Michael McCandless commented on LUCENE-1582:


bq. I can add a setValue() method to the tokenStream that resets it with the 
new value.

That's a good step forward, but it'd likely mean the default is to be slower 
performance?  In general I prefer (when realistic) to have default ootb 
experience to be good performance, but in this case it doesn't seem like 
there's an easy way to have a natural high-performance default.  And eg we 
don't reuse Document  Field by default, so expecting someone to do a bit of 
work to reuse Trie's TokenStreams seems OK.

It's almost like Analyzer.reusableTokenStream(...) should know it's 
deailing with a Numeric field, and handle the reuse for you, in a future world 
when Lucene knows that a Field is a NumericField, meant to be indexed using 
trie.  But we can leave all of that for future optimization; for now, providing 
setValue is great.

 Make TrieRange completely independent from Document/Field with TokenStream of 
 prefix encoded values
 ---

 Key: LUCENE-1582
 URL: https://issues.apache.org/jira/browse/LUCENE-1582
 Project: Lucene - Java
  Issue Type: Improvement
  Components: contrib/*
Affects Versions: 2.9
Reporter: Uwe Schindler
Assignee: Uwe Schindler
 Fix For: 2.9

 Attachments: LUCENE-1582.patch


 TrieRange has currently the following problem:
 - To add a field, that uses a trie encoding, you can manually add each term 
 to the index or use a helper method from TrieUtils. The helper method has the 
 problem, that it uses a fixed field configuration
 - TrieUtils currently creates per default a helper field containing the lower 
 precision terms to enable sorting (limitation of one term/document for 
 sorting)
 - trieCodeLong/Int() creates unnecessarily String[] and char[] arrays that is 
 heavy for GC, if you index lot of numeric values. Also a lot of char[] to 
 String copying is involved.
 This issue should improve this:
 - trieCodeLong/Int() returns a TokenStream. During encoding, all char[] 
 arrays are reused by Token API, additional String[] arrays for the encoded 
 result are not created, instead the TokenStream enumerates the trie values.
 - Trie fields can be added to Documents during indexing using the standard 
 API: new Field(name,TokenStream,...), so no extra util method needed. By 
 using token filters, one could also add payload and so and customize 
 everything.
 The drawback is: Sorting would not work anymore. To enable sorting, a 
 (sub-)issue can extend the FieldCache to stop iterating the terms, as soon as 
 a lower precision one is enumerated by TermEnum. I will create a hack patch 
 for TrieUtils-use only, that uses a non-checked Exceptionin the Parser to 
 stop iteration. With LUCENE-831, a more generic API for this type can be used 
 (custom parser/iterator implementation for FieldCache). I will attach the 
 field cache patch (with the temporary solution, until FieldCache is 
 reimplemented) as a separate patch file, or maybe open another issue for it.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1539) Improve Benchmark

2009-04-03 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1539?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12695462#action_12695462
 ] 

Michael McCandless commented on LUCENE-1539:


This patch looks good -- some questions:

  * Is CreateWikiIndex intended to be committed?  I thought not?  Ie I
though the goal w/ this issue is add the necessary tasks so that
CreateWikiIndex would be done as an alg.

  * I think we shouldn't bump to Java 1.5 -- it's only CreateWikiIndex
that needs it anyway (in only 2 places).

  * PrintReaderTask never closes the reader.

  * Not sure why you needed to relax private - protected in AddDocTask?


 Improve Benchmark
 -

 Key: LUCENE-1539
 URL: https://issues.apache.org/jira/browse/LUCENE-1539
 Project: Lucene - Java
  Issue Type: Improvement
  Components: contrib/benchmark
Affects Versions: 2.4
Reporter: Jason Rutherglen
Assignee: Michael McCandless
Priority: Minor
 Fix For: 2.9

 Attachments: LUCENE-1539.patch, LUCENE-1539.patch, sortBench2.py, 
 sortCollate2.py

   Original Estimate: 336h
  Remaining Estimate: 336h

 Benchmark can be improved by incorporating recent suggestions posted
 on java-dev. M. McCandless' Python scripts that execute multiple
 rounds of tests can either be incorporated into the codebase or
 converted to Java.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Assigned: (LUCENE-1539) Improve Benchmark

2009-04-03 Thread Michael McCandless (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1539?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael McCandless reassigned LUCENE-1539:
--

Assignee: Michael McCandless

 Improve Benchmark
 -

 Key: LUCENE-1539
 URL: https://issues.apache.org/jira/browse/LUCENE-1539
 Project: Lucene - Java
  Issue Type: Improvement
  Components: contrib/benchmark
Affects Versions: 2.4
Reporter: Jason Rutherglen
Assignee: Michael McCandless
Priority: Minor
 Fix For: 2.9

 Attachments: LUCENE-1539.patch, LUCENE-1539.patch, sortBench2.py, 
 sortCollate2.py

   Original Estimate: 336h
  Remaining Estimate: 336h

 Benchmark can be improved by incorporating recent suggestions posted
 on java-dev. M. McCandless' Python scripts that execute multiple
 rounds of tests can either be incorporated into the codebase or
 converted to Java.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1575) Refactoring Lucene collectors (HitCollector and extensions)

2009-04-03 Thread Shai Erera (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1575?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12695478#action_12695478
 ] 

Shai Erera commented on LUCENE-1575:


bq. I like TimeLimitingCollector, or maybe TimeoutCollector?

I like TimeLimitingCollector better, as I think the name makes the class more 
self explanatory.

bq. TopFieldCollector's updateBottom  add methods take score, and are passed 
score from the non-scoring collectors, but shouldn't?

At the end of the day, even the non-scoring collectors store a score in 
ScoreDoc, which is Float.NaN. So they should pass a score. Unlike the scoring 
ones, they always pass Float.NaN without ever calling scorer.score(). That's 
the cleanest way I've found I can make the changes to that class, w/o 
duplicating implementation all over the place. Notice that the scoring versions 
extend the non-scoring, and just add score computation, which resulted in a 
very clean implementation.

bq. TermScorer need not override score(HitCollector hc) (super does the same 
thing).

Agreed.

bq. The changes to TermScorer make me a bit nervous.

Since we pass Sorer to Collector, I thought we cannot really rely on anyone not 
calling scorer.doc() or getSimilarity ever - it is in the API. Since doc() is 
abstract, I had to implement it and just thought that retuning the current doc 
is better than -1 for example. There are some alternatives I see to resolve it:
# Create an abstract ScoringOnlyScorer which extends Scorer and implements all 
methods to throw UOE (also as final), besides score() which it will define 
abstract. We then define a ScoringOnlyScorerWrapper which takes a Scorer and 
delegates the score() calls. We use SOSW in places where we can't extend SOS. 
Where we can, we just extend it directly and implement score(), like in the 
InternalScorer case.
# Create a new class which implements just score() (I've yet to come with a 
good name since Scorer is already taken) and create a wrapper which takes a 
Scorer and delegates the score() calls to it. Then Collector will use that new 
class, and we're sure that only score() can be called.

The last two comments are completely an overlook by my side. I'm not so sure 
about your proposal though. If we add to Searcher a concrete impl which throws 
UOE, how would that work in 3.0? How would anyone who extends Searcher know 
that it has to extend this method? Maybe do it now, and document that in 3.0 it 
will become abstract again?
About Searchable, I wonder how many do implement Searchable, rather than extend 
IndexSearcher. Perhaps instead of making any changes in back-compat and add 
documentation to CHANGES I'll just comment out this method with a TODO to 
re-enstate in 3.0?

 Refactoring Lucene collectors (HitCollector and extensions)
 ---

 Key: LUCENE-1575
 URL: https://issues.apache.org/jira/browse/LUCENE-1575
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Search
Reporter: Shai Erera
Assignee: Michael McCandless
 Fix For: 2.9

 Attachments: LUCENE-1575.1.patch, LUCENE-1575.2.patch, 
 LUCENE-1575.3.patch, LUCENE-1575.4.patch, LUCENE-1575.5.patch, 
 LUCENE-1575.6.patch, LUCENE-1575.patch, LUCENE-1575.patch


 This issue is a result of a recent discussion we've had on the mailing list. 
 You can read the thread 
 [here|http://www.nabble.com/Is-TopDocCollector%27s-collect()-implementation-correct--td22557419.html].
 We have agreed to do the following refactoring:
 * Rename MultiReaderHitCollector to Collector, with the purpose that it will 
 be the base class for all Collector implementations.
 * Deprecate HitCollector in favor of the new Collector.
 * Introduce new methods in IndexSearcher that accept Collector, and deprecate 
 those that accept HitCollector.
 ** Create a final class HitCollectorWrapper, and use it in the deprecated 
 methods in IndexSearcher, wrapping the given HitCollector.
 ** HitCollectorWrapper will be marked deprecated, so we can remove it in 3.0, 
 when we remove HitCollector.
 ** It will remove any instanceof checks that currently exist in IndexSearcher 
 code.
 * Create a new (abstract) TopDocsCollector, which will:
 ** Leave collect and setNextReader unimplemented.
 ** Introduce protected members PriorityQueue and totalHits.
 ** Introduce a single protected constructor which accepts a PriorityQueue.
 ** Implement topDocs() and getTotalHits() using the PQ and totalHits members. 
 These can be used as-are by extending classes, as well as be overridden.
 ** Introduce a new topDocs(start, howMany) method which will be used a 
 convenience method when implementing a search application which allows paging 
 through search results. It will also attempt to improve the memory 
 allocation, by allocating a ScoreDoc[] of the requested size 

[jira] Commented: (LUCENE-1575) Refactoring Lucene collectors (HitCollector and extensions)

2009-04-03 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1575?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12695513#action_12695513
 ] 

Michael McCandless commented on LUCENE-1575:


bq. I like TimeLimitingCollector better, as I think the name makes the class 
more self explanatory.

OK let's go with that!

{quote}
At the end of the day, even the non-scoring collectors store a score in 
ScoreDoc, which is Float.NaN. So they should pass a score. Unlike the scoring 
ones, they always pass Float.NaN without ever calling scorer.score(). That's 
the cleanest way I've found I can make the changes to that class, w/o 
duplicating implementation all over the place. Notice that the scoring versions 
extend the non-scoring, and just add score computation, which resulted in a 
very clean implementation.
{quote}

OK... let's stick with this approach for now.  Since the impl is
locked down (ctor for TopFieldCollector is private) we can freely
switch up this API in the future without breaking back compat, if we
want to optimize not passing/copying around the unused score.

Can't the scoring collector impls in TopFieldCollector be final?

bq. Since we pass Sorer to Collector, I thought we cannot really rely on anyone 
not calling scorer.doc() or getSimilarity ever

Maybe instead make InternalScorer non-static, and then doc() can
return the doc from the TermScorer instance, instead of having to copy
s.d = doc each time?  score can do a similar thing.

Actually, hang on: if I'm using a Collector that doesn't need the
score, TermScoring is still computing it?  We don't want that right?
Can we simply pass this to setScorer(...)?

bq. If we add to Searcher a concrete impl which throws UOE, how would that work 
in 3.0? How would anyone who extends Searcher know that it has to extend this 
method? Maybe do it now, and document that in 3.0 it will become abstract again?

OK let's do that?

bq. About Searchable, I wonder how many do implement Searchable, rather than 
extend IndexSearcher. Perhaps instead of making any changes in back-compat and 
add documentation to CHANGES I'll just comment out this method with a TODO to 
re-enstate in 3.0?

OK.

Make sure at the end of all of this, you open a new issue, marked as
fix version 3.0, that has all the and then on 3.0 we do XYZs from
this.

 Refactoring Lucene collectors (HitCollector and extensions)
 ---

 Key: LUCENE-1575
 URL: https://issues.apache.org/jira/browse/LUCENE-1575
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Search
Reporter: Shai Erera
Assignee: Michael McCandless
 Fix For: 2.9

 Attachments: LUCENE-1575.1.patch, LUCENE-1575.2.patch, 
 LUCENE-1575.3.patch, LUCENE-1575.4.patch, LUCENE-1575.5.patch, 
 LUCENE-1575.6.patch, LUCENE-1575.patch, LUCENE-1575.patch


 This issue is a result of a recent discussion we've had on the mailing list. 
 You can read the thread 
 [here|http://www.nabble.com/Is-TopDocCollector%27s-collect()-implementation-correct--td22557419.html].
 We have agreed to do the following refactoring:
 * Rename MultiReaderHitCollector to Collector, with the purpose that it will 
 be the base class for all Collector implementations.
 * Deprecate HitCollector in favor of the new Collector.
 * Introduce new methods in IndexSearcher that accept Collector, and deprecate 
 those that accept HitCollector.
 ** Create a final class HitCollectorWrapper, and use it in the deprecated 
 methods in IndexSearcher, wrapping the given HitCollector.
 ** HitCollectorWrapper will be marked deprecated, so we can remove it in 3.0, 
 when we remove HitCollector.
 ** It will remove any instanceof checks that currently exist in IndexSearcher 
 code.
 * Create a new (abstract) TopDocsCollector, which will:
 ** Leave collect and setNextReader unimplemented.
 ** Introduce protected members PriorityQueue and totalHits.
 ** Introduce a single protected constructor which accepts a PriorityQueue.
 ** Implement topDocs() and getTotalHits() using the PQ and totalHits members. 
 These can be used as-are by extending classes, as well as be overridden.
 ** Introduce a new topDocs(start, howMany) method which will be used a 
 convenience method when implementing a search application which allows paging 
 through search results. It will also attempt to improve the memory 
 allocation, by allocating a ScoreDoc[] of the requested size only.
 * Change TopScoreDocCollector to extend TopDocsCollector, use the topDocs() 
 and getTotalHits() implementations as they are from TopDocsCollector. The 
 class will also be made final.
 * Change TopFieldCollector to extend TopDocsCollector, and make the class 
 final. Also implement topDocs(start, howMany).
 * Change TopFieldDocCollector (deprecated) to extend TopDocsCollector, 
 instead of 

[jira] Commented: (LUCENE-1575) Refactoring Lucene collectors (HitCollector and extensions)

2009-04-03 Thread Shai Erera (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1575?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12695523#action_12695523
 ] 

Shai Erera commented on LUCENE-1575:


bq. Can't the scoring collector impls in TopFieldCollector be final?

They can, but they are private so they cannot be extended anyway. I can do 
that, but does it really matter?

bq. We don't want that right? Can we simply pass this to setScorer(...)?

That's what I wanted to do, but then noticed that TermScorer.score() method is 
a bit different. However, now that I look at it again, I wonder if they are 
different. The difference is that in score(), it does at the end
{code}
return raw * Similarity.decodeNorm(norms[doc]);
{code}
and in score(Collector, int) it does
{code}
float[] normDecoder = Similarity.getNormDecoder();
...
score *= normDecoder[norms[doc]  0xFF];
{code}

Looking in Similarity.decodeNorm, it does exactly what's done in 
score(Collector, int). So I guess this code has been duplicated for no good 
reason? Please validate what I wrote and if you also agree, I can change the 
entire method (score(Collector, int)) to not compute any score and call 
c.setScorer(this). That will solve it.

So are you ok with passing Scorer to Collector, instead of just a class with a 
single score() method?

I will open an issue w/ a fix version 3.0 and take care of all those TODOs. 
Should the issue also get rid of the deprecated methods? Or will we have a 
general issue in 3.0 that removes all deprecated methods?


 Refactoring Lucene collectors (HitCollector and extensions)
 ---

 Key: LUCENE-1575
 URL: https://issues.apache.org/jira/browse/LUCENE-1575
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Search
Reporter: Shai Erera
Assignee: Michael McCandless
 Fix For: 2.9

 Attachments: LUCENE-1575.1.patch, LUCENE-1575.2.patch, 
 LUCENE-1575.3.patch, LUCENE-1575.4.patch, LUCENE-1575.5.patch, 
 LUCENE-1575.6.patch, LUCENE-1575.patch, LUCENE-1575.patch


 This issue is a result of a recent discussion we've had on the mailing list. 
 You can read the thread 
 [here|http://www.nabble.com/Is-TopDocCollector%27s-collect()-implementation-correct--td22557419.html].
 We have agreed to do the following refactoring:
 * Rename MultiReaderHitCollector to Collector, with the purpose that it will 
 be the base class for all Collector implementations.
 * Deprecate HitCollector in favor of the new Collector.
 * Introduce new methods in IndexSearcher that accept Collector, and deprecate 
 those that accept HitCollector.
 ** Create a final class HitCollectorWrapper, and use it in the deprecated 
 methods in IndexSearcher, wrapping the given HitCollector.
 ** HitCollectorWrapper will be marked deprecated, so we can remove it in 3.0, 
 when we remove HitCollector.
 ** It will remove any instanceof checks that currently exist in IndexSearcher 
 code.
 * Create a new (abstract) TopDocsCollector, which will:
 ** Leave collect and setNextReader unimplemented.
 ** Introduce protected members PriorityQueue and totalHits.
 ** Introduce a single protected constructor which accepts a PriorityQueue.
 ** Implement topDocs() and getTotalHits() using the PQ and totalHits members. 
 These can be used as-are by extending classes, as well as be overridden.
 ** Introduce a new topDocs(start, howMany) method which will be used a 
 convenience method when implementing a search application which allows paging 
 through search results. It will also attempt to improve the memory 
 allocation, by allocating a ScoreDoc[] of the requested size only.
 * Change TopScoreDocCollector to extend TopDocsCollector, use the topDocs() 
 and getTotalHits() implementations as they are from TopDocsCollector. The 
 class will also be made final.
 * Change TopFieldCollector to extend TopDocsCollector, and make the class 
 final. Also implement topDocs(start, howMany).
 * Change TopFieldDocCollector (deprecated) to extend TopDocsCollector, 
 instead of TopScoreDocCollector. Implement topDocs(start, howMany)
 * Review other places where HitCollector is used, such as in Scorer, 
 deprecate those places and use Collector instead.
 Additionally, the following proposal was made w.r.t. decoupling score from 
 collect():
 * Change collect to accecpt only a doc Id (unbased).
 * Introduce a setScorer(Scorer) method.
 * If during collect the implementation needs the score, it can call 
 scorer.score().
 If we do this, then we need to review all places in the code where 
 collect(doc, score) is called, and assert whether Scorer can be passed. Also 
 this raises few questions:
 * What if during collect() Scorer is null? (i.e., not set) - is it even 
 possible?
 * I noticed that many (if not all) of the collect() implementations discard 
 the document if its score 

[jira] Commented: (LUCENE-1575) Refactoring Lucene collectors (HitCollector and extensions)

2009-04-03 Thread Shai Erera (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1575?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12695525#action_12695525
 ] 

Shai Erera commented on LUCENE-1575:


BTW Mike - I think the accidental changes to Searchable and Searcher could have 
been easily detected by test-tags if we had classes in the back-compat tag 
which implemented interfaces / extended abstract classes with empty 
implementations. These are not really junit tests, but if someone would have 
changed an interface or abstract class, then attempting to compile the test 
package against the trunk would fail.

It is not so relevant now, since the next release is 2.9 following by a 3.0 and 
back-compat will completely go away in 3.0, but perhaps post 3.0? Also, it will 
prevent us from making changes to back-compat like we wanted to in this issue, 
but perhaps it's good?


 Refactoring Lucene collectors (HitCollector and extensions)
 ---

 Key: LUCENE-1575
 URL: https://issues.apache.org/jira/browse/LUCENE-1575
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Search
Reporter: Shai Erera
Assignee: Michael McCandless
 Fix For: 2.9

 Attachments: LUCENE-1575.1.patch, LUCENE-1575.2.patch, 
 LUCENE-1575.3.patch, LUCENE-1575.4.patch, LUCENE-1575.5.patch, 
 LUCENE-1575.6.patch, LUCENE-1575.patch, LUCENE-1575.patch


 This issue is a result of a recent discussion we've had on the mailing list. 
 You can read the thread 
 [here|http://www.nabble.com/Is-TopDocCollector%27s-collect()-implementation-correct--td22557419.html].
 We have agreed to do the following refactoring:
 * Rename MultiReaderHitCollector to Collector, with the purpose that it will 
 be the base class for all Collector implementations.
 * Deprecate HitCollector in favor of the new Collector.
 * Introduce new methods in IndexSearcher that accept Collector, and deprecate 
 those that accept HitCollector.
 ** Create a final class HitCollectorWrapper, and use it in the deprecated 
 methods in IndexSearcher, wrapping the given HitCollector.
 ** HitCollectorWrapper will be marked deprecated, so we can remove it in 3.0, 
 when we remove HitCollector.
 ** It will remove any instanceof checks that currently exist in IndexSearcher 
 code.
 * Create a new (abstract) TopDocsCollector, which will:
 ** Leave collect and setNextReader unimplemented.
 ** Introduce protected members PriorityQueue and totalHits.
 ** Introduce a single protected constructor which accepts a PriorityQueue.
 ** Implement topDocs() and getTotalHits() using the PQ and totalHits members. 
 These can be used as-are by extending classes, as well as be overridden.
 ** Introduce a new topDocs(start, howMany) method which will be used a 
 convenience method when implementing a search application which allows paging 
 through search results. It will also attempt to improve the memory 
 allocation, by allocating a ScoreDoc[] of the requested size only.
 * Change TopScoreDocCollector to extend TopDocsCollector, use the topDocs() 
 and getTotalHits() implementations as they are from TopDocsCollector. The 
 class will also be made final.
 * Change TopFieldCollector to extend TopDocsCollector, and make the class 
 final. Also implement topDocs(start, howMany).
 * Change TopFieldDocCollector (deprecated) to extend TopDocsCollector, 
 instead of TopScoreDocCollector. Implement topDocs(start, howMany)
 * Review other places where HitCollector is used, such as in Scorer, 
 deprecate those places and use Collector instead.
 Additionally, the following proposal was made w.r.t. decoupling score from 
 collect():
 * Change collect to accecpt only a doc Id (unbased).
 * Introduce a setScorer(Scorer) method.
 * If during collect the implementation needs the score, it can call 
 scorer.score().
 If we do this, then we need to review all places in the code where 
 collect(doc, score) is called, and assert whether Scorer can be passed. Also 
 this raises few questions:
 * What if during collect() Scorer is null? (i.e., not set) - is it even 
 possible?
 * I noticed that many (if not all) of the collect() implementations discard 
 the document if its score is not greater than 0. Doesn't it mean that score 
 is needed in collect() always?
 Open issues:
 * The name for Collector
 * TopDocsCollector was mentioned on the thread as TopResultsCollector, but 
 that was when we thought to call Colletor ResultsColletor. Since we decided 
 (so far) on Collector, I think TopDocsCollector makes sense, because of its 
 TopDocs output.
 * Decoupling score from collect().
 I will post a patch a bit later, as this is expected to be a very large 
 patch. I will split it into 2: (1) code patch (2) test cases (moving to use 
 Collector instead of HitCollector, as well as testing the new topDocs(start, 
 howMany) 

[jira] Commented: (LUCENE-1575) Refactoring Lucene collectors (HitCollector and extensions)

2009-04-03 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1575?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12695537#action_12695537
 ] 

Michael McCandless commented on LUCENE-1575:



I ran a first do no harm perf test, comparing trunk with this patch:

||query||sort||hits||qps||qpsnew||pctg||
|147|score|   6953|3631.1|3641.8|  0.3%|
|147|title|   6953|2916.7|2255.6|-22.7%|
|147|doc|   6953|3251.2|2676.8|-17.7%|
|text|score| 157101| 208.1| 202.1| -2.9%|
|text|title| 157101|  96.7|  84.8|-12.3%|
|text|doc| 157101| 174.0| 115.2|-33.8%|
|1|score| 565452|  58.0|  56.4| -2.8%|
|1|title| 565452|  44.5|  34.1|-23.4%|
|1|doc| 565452|  49.2|  32.8|-33.3%|
|1 OR 2|score| 784928|  14.1|  13.7| -2.8%|
|1 OR 2|title| 784928|  12.5|  11.5| -8.0%|
|1 OR 2|doc| 784928|  13.0|  11.9| -8.5%|
|1 AND 2|score| 333153|  15.5|  15.5|  0.0%|
|1 AND 2|title| 333153|  14.8|  13.7| -7.4%|
|1 AND 2|doc| 333153|  15.2|  14.2| -6.6%|

Looks like:
 
  * Sort by relevance got maybe a tad slower (~3%)

  * Sort by field is now quite a bit slower (23-33% on term query '1')

This was on a full wikipedia index, with 14 segments, Sun java
1.6.0_07 on OS X Mac Pro quad core, on Intel X25M 160 GB
SSD.

I think we need to iterate some to try to get some performance back.


 Refactoring Lucene collectors (HitCollector and extensions)
 ---

 Key: LUCENE-1575
 URL: https://issues.apache.org/jira/browse/LUCENE-1575
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Search
Reporter: Shai Erera
Assignee: Michael McCandless
 Fix For: 2.9

 Attachments: LUCENE-1575.1.patch, LUCENE-1575.2.patch, 
 LUCENE-1575.3.patch, LUCENE-1575.4.patch, LUCENE-1575.5.patch, 
 LUCENE-1575.6.patch, LUCENE-1575.patch, LUCENE-1575.patch


 This issue is a result of a recent discussion we've had on the mailing list. 
 You can read the thread 
 [here|http://www.nabble.com/Is-TopDocCollector%27s-collect()-implementation-correct--td22557419.html].
 We have agreed to do the following refactoring:
 * Rename MultiReaderHitCollector to Collector, with the purpose that it will 
 be the base class for all Collector implementations.
 * Deprecate HitCollector in favor of the new Collector.
 * Introduce new methods in IndexSearcher that accept Collector, and deprecate 
 those that accept HitCollector.
 ** Create a final class HitCollectorWrapper, and use it in the deprecated 
 methods in IndexSearcher, wrapping the given HitCollector.
 ** HitCollectorWrapper will be marked deprecated, so we can remove it in 3.0, 
 when we remove HitCollector.
 ** It will remove any instanceof checks that currently exist in IndexSearcher 
 code.
 * Create a new (abstract) TopDocsCollector, which will:
 ** Leave collect and setNextReader unimplemented.
 ** Introduce protected members PriorityQueue and totalHits.
 ** Introduce a single protected constructor which accepts a PriorityQueue.
 ** Implement topDocs() and getTotalHits() using the PQ and totalHits members. 
 These can be used as-are by extending classes, as well as be overridden.
 ** Introduce a new topDocs(start, howMany) method which will be used a 
 convenience method when implementing a search application which allows paging 
 through search results. It will also attempt to improve the memory 
 allocation, by allocating a ScoreDoc[] of the requested size only.
 * Change TopScoreDocCollector to extend TopDocsCollector, use the topDocs() 
 and getTotalHits() implementations as they are from TopDocsCollector. The 
 class will also be made final.
 * Change TopFieldCollector to extend TopDocsCollector, and make the class 
 final. Also implement topDocs(start, howMany).
 * Change TopFieldDocCollector (deprecated) to extend TopDocsCollector, 
 instead of TopScoreDocCollector. Implement topDocs(start, howMany)
 * Review other places where HitCollector is used, such as in Scorer, 
 deprecate those places and use Collector instead.
 Additionally, the following proposal was made w.r.t. decoupling score from 
 collect():
 * Change collect to accecpt only a doc Id (unbased).
 * Introduce a setScorer(Scorer) method.
 * If during collect the implementation needs the score, it can call 
 scorer.score().
 If we do this, then we need to review all places in the code where 
 collect(doc, score) is called, and assert whether Scorer can be passed. Also 
 this raises few questions:
 * What if during collect() Scorer is null? (i.e., not set) - is it even 
 possible?
 * I noticed that many (if not all) of the collect() implementations discard 
 the document if its score is not greater than 0. Doesn't it mean that score 
 is needed in collect() always?
 Open issues:
 * The name for Collector
 * TopDocsCollector was mentioned on the thread as TopResultsCollector, but 
 that was when we thought to call Colletor 

Re: Future projects

2009-04-03 Thread John Wang
By default bobo DOES use a flavor of the field cache data structure with
some addition information for performance. (e.g. minDocid,maxDocid,freq per
term)
Bobo is architected as a platform where clients can write their own
FacetHandlers in which each FacetHandler manages its own view of memory
structure, and thus can be more complicated that field cache.
At LinkedIn, we write FacetHandlers for geo lat/lon filtering and social
network faceting.

-John

On Fri, Apr 3, 2009 at 3:35 AM, Michael McCandless 
luc...@mikemccandless.com wrote:

 On Thu, Apr 2, 2009 at 5:56 PM, Jason Rutherglen
 jason.rutherg...@gmail.com wrote:
  I think I need to understand better why delete by Query isn't
  viable in your situation...
 
  The delete by query is a separate problem which I haven't fully
  explored yet.

 Oh, I had thought we were tugging on this thread in order to explore
 delete-by-docID in the writer.  OK.

  Tracking the segment genealogy is really an
  interim step for merging field caches before column stride
  fields gets implemented.

 I see -- meaning in Bobo you'd like to manage your own memory resident
 field caches, and merge them whenever IW has merged a segment?  Seems
 like you don't need genealogy for that.

  Actually CSF cannot be used with Bobo's
  field caches anyways which means we'd need a way to find out
  about the segment parents.

 CSF isn't really designed yet.  How come it can't be used with Bobo's
 field caches?  We can try to accommodate Bobo's field cache needs when
 designing CSF.

  Does it operate at the segment level? Seems like that'd give
  you good enough realtime performance (though merging in RAM will
  definitely be faster).
 
  We need to see how Bobo integrates with LUCENE-1483.

 Lucene's internal field cache usage is now entirely at the segment
 level (ie, Lucene core should never request full field cache array at
 the MultiSegmentReader level).  I think Bobo must have to do the same,
 if it handles near realtime updates, to get adequate performance.

 Though... since we have LUCENE-831 (rework API Lucene exposes for
 accessing arrays-of-atomic-types-per-segment) and LUCENE-1231 (CSF = a
 more efficient impl (than uninversion) of the API we expose in
 LUCENE-831) on deck, we should try to understand Bobo's needs.

 EG how come Bobo made its own field cache impl?  Just because
 uninversion is too slow?

  It seems like we've been talking about CSF for 2 years and there
  isn't a patch for it? If I had more time I'd take a look. What
  is the status of it?

 I think Michael is looking into it?  I'd really like to get it into
 2.9.  We should do it in conjunction with 831 since they are so tied.

  I'll write a patch that implements a callback for the segment
  merging such that the user can decide what information they want
  to record about the merged SRs (I'm pretty sure there isn't a
  way to do this with MergePolicy?)

 Actually I think you can do this w/ a simple MergeScheduler wrapper or
 by subclassing CMS.  I'll put a comment on the issue.

 Mike

 -
 To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-dev-h...@lucene.apache.org




[jira] Commented: (LUCENE-1575) Refactoring Lucene collectors (HitCollector and extensions)

2009-04-03 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1575?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12695541#action_12695541
 ] 

Michael McCandless commented on LUCENE-1575:


{quote}
 Can't the scoring collector impls in TopFieldCollector be final?

They can, but they are private so they cannot be extended anyway. I can do 
that, but does it really matter?
{quote}

I was thinking in case it eeks performance.

bq. So I guess this code has been duplicated for no good reason?

Duplicated for performance I think.

bq. I can change the entire method (score(Collector, int)) to not compute any 
score and call c.setScorer(this). That will solve it.

I think we should try this?

bq. So are you ok with passing Scorer to Collector, instead of just a class 
with a single score() method?

Good question... I'm not sure.  It would be cleaner to expose only
score() (and I think we could add methods over time), but then we'll
be creating new instance per segment per search which'll only slow
things down.

bq. I will open an issue w/ a fix version 3.0 and take care of all those TODOs. 
Should the issue also get rid of the deprecated methods? Or will we have a 
general issue in 3.0 that removes all deprecated methods?

You don't need to enumerate deprecated methods to get rid of -- we
won't forget those ones :) It's these other special tasks that may
slip through the cracks.


 Refactoring Lucene collectors (HitCollector and extensions)
 ---

 Key: LUCENE-1575
 URL: https://issues.apache.org/jira/browse/LUCENE-1575
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Search
Reporter: Shai Erera
Assignee: Michael McCandless
 Fix For: 2.9

 Attachments: LUCENE-1575.1.patch, LUCENE-1575.2.patch, 
 LUCENE-1575.3.patch, LUCENE-1575.4.patch, LUCENE-1575.5.patch, 
 LUCENE-1575.6.patch, LUCENE-1575.patch, LUCENE-1575.patch


 This issue is a result of a recent discussion we've had on the mailing list. 
 You can read the thread 
 [here|http://www.nabble.com/Is-TopDocCollector%27s-collect()-implementation-correct--td22557419.html].
 We have agreed to do the following refactoring:
 * Rename MultiReaderHitCollector to Collector, with the purpose that it will 
 be the base class for all Collector implementations.
 * Deprecate HitCollector in favor of the new Collector.
 * Introduce new methods in IndexSearcher that accept Collector, and deprecate 
 those that accept HitCollector.
 ** Create a final class HitCollectorWrapper, and use it in the deprecated 
 methods in IndexSearcher, wrapping the given HitCollector.
 ** HitCollectorWrapper will be marked deprecated, so we can remove it in 3.0, 
 when we remove HitCollector.
 ** It will remove any instanceof checks that currently exist in IndexSearcher 
 code.
 * Create a new (abstract) TopDocsCollector, which will:
 ** Leave collect and setNextReader unimplemented.
 ** Introduce protected members PriorityQueue and totalHits.
 ** Introduce a single protected constructor which accepts a PriorityQueue.
 ** Implement topDocs() and getTotalHits() using the PQ and totalHits members. 
 These can be used as-are by extending classes, as well as be overridden.
 ** Introduce a new topDocs(start, howMany) method which will be used a 
 convenience method when implementing a search application which allows paging 
 through search results. It will also attempt to improve the memory 
 allocation, by allocating a ScoreDoc[] of the requested size only.
 * Change TopScoreDocCollector to extend TopDocsCollector, use the topDocs() 
 and getTotalHits() implementations as they are from TopDocsCollector. The 
 class will also be made final.
 * Change TopFieldCollector to extend TopDocsCollector, and make the class 
 final. Also implement topDocs(start, howMany).
 * Change TopFieldDocCollector (deprecated) to extend TopDocsCollector, 
 instead of TopScoreDocCollector. Implement topDocs(start, howMany)
 * Review other places where HitCollector is used, such as in Scorer, 
 deprecate those places and use Collector instead.
 Additionally, the following proposal was made w.r.t. decoupling score from 
 collect():
 * Change collect to accecpt only a doc Id (unbased).
 * Introduce a setScorer(Scorer) method.
 * If during collect the implementation needs the score, it can call 
 scorer.score().
 If we do this, then we need to review all places in the code where 
 collect(doc, score) is called, and assert whether Scorer can be passed. Also 
 this raises few questions:
 * What if during collect() Scorer is null? (i.e., not set) - is it even 
 possible?
 * I noticed that many (if not all) of the collect() implementations discard 
 the document if its score is not greater than 0. Doesn't it mean that score 
 is needed in collect() always?
 Open issues:
 * The name for 

[jira] Commented: (LUCENE-1575) Refactoring Lucene collectors (HitCollector and extensions)

2009-04-03 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1575?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12695543#action_12695543
 ] 

Michael McCandless commented on LUCENE-1575:


bq. BTW Mike - I think the accidental changes to Searchable and Searcher could 
have been easily detected by test-tags if we had classes in the back-compat tag 
which implemented interfaces / extended abstract classes with empty 
implementations. These are not really junit tests, but if someone would have 
changed an interface or abstract class, then attempting to compile the test 
package against the trunk would fail.

I think that's a great idea!  Every interface/abstract class should
have a just compile me subclass in the tests.

bq. It is not so relevant now, since the next release is 2.9 following by a 3.0 
and back-compat will completely go away in 3.0, but perhaps post 3.0?

It is relevant because neither Searchable nor Searcher are deprecated
(yet)?  Ie during development of 2.9 and of 3.0 we have to ensure we
don't break back compat of non-deprecated APIs.

So maybe fold this in on the next patch iteration?

bq. Also, it will prevent us from making changes to back-compat like we wanted 
to in this issue, but perhaps it's good?

It's good, because it'd raise the issue right way vs us catching it or
not by staring at the code :)


 Refactoring Lucene collectors (HitCollector and extensions)
 ---

 Key: LUCENE-1575
 URL: https://issues.apache.org/jira/browse/LUCENE-1575
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Search
Reporter: Shai Erera
Assignee: Michael McCandless
 Fix For: 2.9

 Attachments: LUCENE-1575.1.patch, LUCENE-1575.2.patch, 
 LUCENE-1575.3.patch, LUCENE-1575.4.patch, LUCENE-1575.5.patch, 
 LUCENE-1575.6.patch, LUCENE-1575.patch, LUCENE-1575.patch


 This issue is a result of a recent discussion we've had on the mailing list. 
 You can read the thread 
 [here|http://www.nabble.com/Is-TopDocCollector%27s-collect()-implementation-correct--td22557419.html].
 We have agreed to do the following refactoring:
 * Rename MultiReaderHitCollector to Collector, with the purpose that it will 
 be the base class for all Collector implementations.
 * Deprecate HitCollector in favor of the new Collector.
 * Introduce new methods in IndexSearcher that accept Collector, and deprecate 
 those that accept HitCollector.
 ** Create a final class HitCollectorWrapper, and use it in the deprecated 
 methods in IndexSearcher, wrapping the given HitCollector.
 ** HitCollectorWrapper will be marked deprecated, so we can remove it in 3.0, 
 when we remove HitCollector.
 ** It will remove any instanceof checks that currently exist in IndexSearcher 
 code.
 * Create a new (abstract) TopDocsCollector, which will:
 ** Leave collect and setNextReader unimplemented.
 ** Introduce protected members PriorityQueue and totalHits.
 ** Introduce a single protected constructor which accepts a PriorityQueue.
 ** Implement topDocs() and getTotalHits() using the PQ and totalHits members. 
 These can be used as-are by extending classes, as well as be overridden.
 ** Introduce a new topDocs(start, howMany) method which will be used a 
 convenience method when implementing a search application which allows paging 
 through search results. It will also attempt to improve the memory 
 allocation, by allocating a ScoreDoc[] of the requested size only.
 * Change TopScoreDocCollector to extend TopDocsCollector, use the topDocs() 
 and getTotalHits() implementations as they are from TopDocsCollector. The 
 class will also be made final.
 * Change TopFieldCollector to extend TopDocsCollector, and make the class 
 final. Also implement topDocs(start, howMany).
 * Change TopFieldDocCollector (deprecated) to extend TopDocsCollector, 
 instead of TopScoreDocCollector. Implement topDocs(start, howMany)
 * Review other places where HitCollector is used, such as in Scorer, 
 deprecate those places and use Collector instead.
 Additionally, the following proposal was made w.r.t. decoupling score from 
 collect():
 * Change collect to accecpt only a doc Id (unbased).
 * Introduce a setScorer(Scorer) method.
 * If during collect the implementation needs the score, it can call 
 scorer.score().
 If we do this, then we need to review all places in the code where 
 collect(doc, score) is called, and assert whether Scorer can be passed. Also 
 this raises few questions:
 * What if during collect() Scorer is null? (i.e., not set) - is it even 
 possible?
 * I noticed that many (if not all) of the collect() implementations discard 
 the document if its score is not greater than 0. Doesn't it mean that score 
 is needed in collect() always?
 Open issues:
 * The name for Collector
 * TopDocsCollector was mentioned on the 

[jira] Commented: (LUCENE-1575) Refactoring Lucene collectors (HitCollector and extensions)

2009-04-03 Thread Jason Rutherglen (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1575?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12695553#action_12695553
 ] 

Jason Rutherglen commented on LUCENE-1575:
--

Something related to time limiting collectors we may want to
solve (maybe not in this patch) is passing the time limiting to
the sub-scorers. At the hit collector level the sub-scorers of a
multi clause query could be busy exceeding the time limit before
returning the first doc hit?

 Refactoring Lucene collectors (HitCollector and extensions)
 ---

 Key: LUCENE-1575
 URL: https://issues.apache.org/jira/browse/LUCENE-1575
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Search
Reporter: Shai Erera
Assignee: Michael McCandless
 Fix For: 2.9

 Attachments: LUCENE-1575.1.patch, LUCENE-1575.2.patch, 
 LUCENE-1575.3.patch, LUCENE-1575.4.patch, LUCENE-1575.5.patch, 
 LUCENE-1575.6.patch, LUCENE-1575.patch, LUCENE-1575.patch


 This issue is a result of a recent discussion we've had on the mailing list. 
 You can read the thread 
 [here|http://www.nabble.com/Is-TopDocCollector%27s-collect()-implementation-correct--td22557419.html].
 We have agreed to do the following refactoring:
 * Rename MultiReaderHitCollector to Collector, with the purpose that it will 
 be the base class for all Collector implementations.
 * Deprecate HitCollector in favor of the new Collector.
 * Introduce new methods in IndexSearcher that accept Collector, and deprecate 
 those that accept HitCollector.
 ** Create a final class HitCollectorWrapper, and use it in the deprecated 
 methods in IndexSearcher, wrapping the given HitCollector.
 ** HitCollectorWrapper will be marked deprecated, so we can remove it in 3.0, 
 when we remove HitCollector.
 ** It will remove any instanceof checks that currently exist in IndexSearcher 
 code.
 * Create a new (abstract) TopDocsCollector, which will:
 ** Leave collect and setNextReader unimplemented.
 ** Introduce protected members PriorityQueue and totalHits.
 ** Introduce a single protected constructor which accepts a PriorityQueue.
 ** Implement topDocs() and getTotalHits() using the PQ and totalHits members. 
 These can be used as-are by extending classes, as well as be overridden.
 ** Introduce a new topDocs(start, howMany) method which will be used a 
 convenience method when implementing a search application which allows paging 
 through search results. It will also attempt to improve the memory 
 allocation, by allocating a ScoreDoc[] of the requested size only.
 * Change TopScoreDocCollector to extend TopDocsCollector, use the topDocs() 
 and getTotalHits() implementations as they are from TopDocsCollector. The 
 class will also be made final.
 * Change TopFieldCollector to extend TopDocsCollector, and make the class 
 final. Also implement topDocs(start, howMany).
 * Change TopFieldDocCollector (deprecated) to extend TopDocsCollector, 
 instead of TopScoreDocCollector. Implement topDocs(start, howMany)
 * Review other places where HitCollector is used, such as in Scorer, 
 deprecate those places and use Collector instead.
 Additionally, the following proposal was made w.r.t. decoupling score from 
 collect():
 * Change collect to accecpt only a doc Id (unbased).
 * Introduce a setScorer(Scorer) method.
 * If during collect the implementation needs the score, it can call 
 scorer.score().
 If we do this, then we need to review all places in the code where 
 collect(doc, score) is called, and assert whether Scorer can be passed. Also 
 this raises few questions:
 * What if during collect() Scorer is null? (i.e., not set) - is it even 
 possible?
 * I noticed that many (if not all) of the collect() implementations discard 
 the document if its score is not greater than 0. Doesn't it mean that score 
 is needed in collect() always?
 Open issues:
 * The name for Collector
 * TopDocsCollector was mentioned on the thread as TopResultsCollector, but 
 that was when we thought to call Colletor ResultsColletor. Since we decided 
 (so far) on Collector, I think TopDocsCollector makes sense, because of its 
 TopDocs output.
 * Decoupling score from collect().
 I will post a patch a bit later, as this is expected to be a very large 
 patch. I will split it into 2: (1) code patch (2) test cases (moving to use 
 Collector instead of HitCollector, as well as testing the new topDocs(start, 
 howMany) method.
 There might be even a 3rd patch which handles the setScorer thing in 
 Collector (maybe even a different issue?)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For 

[jira] Updated: (LUCENE-1582) Make TrieRange completely independent from Document/Field with TokenStream of prefix encoded values

2009-04-03 Thread Uwe Schindler (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1582?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Uwe Schindler updated LUCENE-1582:
--

Attachment: LUCENE-1582.patch

 Make TrieRange completely independent from Document/Field with TokenStream of 
 prefix encoded values
 ---

 Key: LUCENE-1582
 URL: https://issues.apache.org/jira/browse/LUCENE-1582
 Project: Lucene - Java
  Issue Type: Improvement
  Components: contrib/*
Affects Versions: 2.9
Reporter: Uwe Schindler
Assignee: Uwe Schindler
 Fix For: 2.9

 Attachments: LUCENE-1582.patch, LUCENE-1582.patch


 TrieRange has currently the following problem:
 - To add a field, that uses a trie encoding, you can manually add each term 
 to the index or use a helper method from TrieUtils. The helper method has the 
 problem, that it uses a fixed field configuration
 - TrieUtils currently creates per default a helper field containing the lower 
 precision terms to enable sorting (limitation of one term/document for 
 sorting)
 - trieCodeLong/Int() creates unnecessarily String[] and char[] arrays that is 
 heavy for GC, if you index lot of numeric values. Also a lot of char[] to 
 String copying is involved.
 This issue should improve this:
 - trieCodeLong/Int() returns a TokenStream. During encoding, all char[] 
 arrays are reused by Token API, additional String[] arrays for the encoded 
 result are not created, instead the TokenStream enumerates the trie values.
 - Trie fields can be added to Documents during indexing using the standard 
 API: new Field(name,TokenStream,...), so no extra util method needed. By 
 using token filters, one could also add payload and so and customize 
 everything.
 The drawback is: Sorting would not work anymore. To enable sorting, a 
 (sub-)issue can extend the FieldCache to stop iterating the terms, as soon as 
 a lower precision one is enumerated by TermEnum. I will create a hack patch 
 for TrieUtils-use only, that uses a non-checked Exceptionin the Parser to 
 stop iteration. With LUCENE-831, a more generic API for this type can be used 
 (custom parser/iterator implementation for FieldCache). I will attach the 
 field cache patch (with the temporary solution, until FieldCache is 
 reimplemented) as a separate patch file, or maybe open another issue for it.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1582) Make TrieRange completely independent from Document/Field with TokenStream of prefix encoded values

2009-04-03 Thread Uwe Schindler (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1582?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12695554#action_12695554
 ] 

Uwe Schindler commented on LUCENE-1582:
---

Updated patch:
- supports a setValue() to reset the TokenStream with a new value for reuse (as 
discussed before)
- completed JavaDocs
- remove dead code parts
- small change in RangeBuilder API (unneeded parameters)

The difference between reusing fields and tokenstreams and always creating a 
new one is measureable (I compared in the test case), but not significant. The 
JavaDocs contain infos, how to reuse.

I have done everything what i planned, now its time to discuss the change.

 Make TrieRange completely independent from Document/Field with TokenStream of 
 prefix encoded values
 ---

 Key: LUCENE-1582
 URL: https://issues.apache.org/jira/browse/LUCENE-1582
 Project: Lucene - Java
  Issue Type: Improvement
  Components: contrib/*
Affects Versions: 2.9
Reporter: Uwe Schindler
Assignee: Uwe Schindler
 Fix For: 2.9

 Attachments: LUCENE-1582.patch, LUCENE-1582.patch


 TrieRange has currently the following problem:
 - To add a field, that uses a trie encoding, you can manually add each term 
 to the index or use a helper method from TrieUtils. The helper method has the 
 problem, that it uses a fixed field configuration
 - TrieUtils currently creates per default a helper field containing the lower 
 precision terms to enable sorting (limitation of one term/document for 
 sorting)
 - trieCodeLong/Int() creates unnecessarily String[] and char[] arrays that is 
 heavy for GC, if you index lot of numeric values. Also a lot of char[] to 
 String copying is involved.
 This issue should improve this:
 - trieCodeLong/Int() returns a TokenStream. During encoding, all char[] 
 arrays are reused by Token API, additional String[] arrays for the encoded 
 result are not created, instead the TokenStream enumerates the trie values.
 - Trie fields can be added to Documents during indexing using the standard 
 API: new Field(name,TokenStream,...), so no extra util method needed. By 
 using token filters, one could also add payload and so and customize 
 everything.
 The drawback is: Sorting would not work anymore. To enable sorting, a 
 (sub-)issue can extend the FieldCache to stop iterating the terms, as soon as 
 a lower precision one is enumerated by TermEnum. I will create a hack patch 
 for TrieUtils-use only, that uses a non-checked Exceptionin the Parser to 
 stop iteration. With LUCENE-831, a more generic API for this type can be used 
 (custom parser/iterator implementation for FieldCache). I will attach the 
 field cache patch (with the temporary solution, until FieldCache is 
 reimplemented) as a separate patch file, or maybe open another issue for it.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1584) Callback for intercepting merging segments in IndexWriter

2009-04-03 Thread Jason Rutherglen (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1584?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12695557#action_12695557
 ] 

Jason Rutherglen commented on LUCENE-1584:
--

I would like to move away from our current position of somewhat
closed APIs that require user classes be a part of the Lucene
packages. 

It's always best to reuse existing APIs, however we've migrated
to OSGi which means anytime we need to place new classes in
Lucene packages, we need to rollout specific JARs (I think,
perhaps it's more complex) for the few classes outside of our
main package classes. This makes deployment of search
applications a bit more difficult and time consuming. 

A related thread regarding MergePolicy is at:
http://markmail.org/thread/h5bxjflpcyejrcqg 

 Callback for intercepting merging segments in IndexWriter
 -

 Key: LUCENE-1584
 URL: https://issues.apache.org/jira/browse/LUCENE-1584
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Index
Affects Versions: 2.4.1
Reporter: Jason Rutherglen
Priority: Minor
 Fix For: 2.9

 Attachments: LUCENE-1584.patch

   Original Estimate: 96h
  Remaining Estimate: 96h

 For things like merging field caches or bitsets, it's useful to
 know which segments were merged to create a new segment.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1575) Refactoring Lucene collectors (HitCollector and extensions)

2009-04-03 Thread Shai Erera (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1575?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12695575#action_12695575
 ] 

Shai Erera commented on LUCENE-1575:




How do I run such a test? Is there an algorithm for that in the benchmark
package?

I compared the new TSDC to the trunk's version and the new code does ('-'
means a negative change, '+' means a positive change, '|' means
neither/undetermined):
* adds one collector.setScorer() call to each query. (-)
* The scorer.score() call in collect() was just moved from whoever called
collect() to inside collect(), so I don't think there's a difference. (|)
* Does not check if score  0.0f in each collect (+)
* implements the new topDocs() method. Previously, it just implemented
topDocs() which returned everything. Now, topDocs() calls topDocs(0,
pq.size()), which verifies parameters and such - since that's executed once
at the end of the search, I doubt that it has any effect major effect on the
results.

BTW, as I scanned through the code I noticed that previously TSDC returned
maxScore = Float.NEGATIVE_INFINITY in case there were 0 results to the
query, and now it returns Float.NaN. I'm not sure however if this breaks
anything, since maxScore is probably used (if at all) for normalization of
scores, and in case there are 0 results you don't really have anything to
normalize? However I'm not sure ...

Regarding TopFieldDocs I am quite surprised. I assume the test uses the
OneComparatorScoringCollector, which means scores are computed:
* It has the same issue as in TSDC regarding topDocs(). So I think it should
be changed here as well, however I doubt that's the cause for the
performance hit.
* It computes the score and then does super.collect(), which adds a method
call (-)
* It doesn't check if the score is  0 (+)
* It calls comparator.setScorer, which is ignored in all comparators besides
RelevanceComparator. Not sure if it has any performance effects (|)
The rest of the code in collect() is exactly the same.

Can it be that super.collect() has such an effect? When I think on the
results of TSDC (-3%) vs. TFC (-28% on avg.), I think it might be since
setScorer() is called once before the series of collect() calls, however
super.collect() is called for every document. Your index is large (2M
documents, right?) and I don't know how many results are for each query, if
they are in the range of 100Ks, then that could be the explanation.

Mike - in case it's faster for you to run it, can you try to run the test
again with a change in the code which inlines super.collect() into
OneComparatorScoringCollector and compare the results again? I will run it
also after you tell me which algorithm you used, but only tomorrow morning,
so if you get to do it before then, that'd be great.

I doubt that the change in topDocs() affects the query time that much, since
it's called at the end of the search, and doing 4-5 'if' statements is
really not that expensive (I mean once per the entire search), comparing to
ScoreDoc[] array allocation, fetching Stored fields from the index etc. So
I'd hate to implement all 3 topDocs() in each of the TopDocsCollector
extensions unless it proves to be a problem.

Shai

On Fri, Apr 3, 2009 at 10:02 PM, Michael McCandless (JIRA)
j...@apache.orgwrote:



 Refactoring Lucene collectors (HitCollector and extensions)
 ---

 Key: LUCENE-1575
 URL: https://issues.apache.org/jira/browse/LUCENE-1575
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Search
Reporter: Shai Erera
Assignee: Michael McCandless
 Fix For: 2.9

 Attachments: LUCENE-1575.1.patch, LUCENE-1575.2.patch, 
 LUCENE-1575.3.patch, LUCENE-1575.4.patch, LUCENE-1575.5.patch, 
 LUCENE-1575.6.patch, LUCENE-1575.patch, LUCENE-1575.patch


 This issue is a result of a recent discussion we've had on the mailing list. 
 You can read the thread 
 [here|http://www.nabble.com/Is-TopDocCollector%27s-collect()-implementation-correct--td22557419.html].
 We have agreed to do the following refactoring:
 * Rename MultiReaderHitCollector to Collector, with the purpose that it will 
 be the base class for all Collector implementations.
 * Deprecate HitCollector in favor of the new Collector.
 * Introduce new methods in IndexSearcher that accept Collector, and deprecate 
 those that accept HitCollector.
 ** Create a final class HitCollectorWrapper, and use it in the deprecated 
 methods in IndexSearcher, wrapping the given HitCollector.
 ** HitCollectorWrapper will be marked deprecated, so we can remove it in 3.0, 
 when we remove HitCollector.
 ** It will remove any instanceof checks that currently exist in IndexSearcher 
 code.
 * Create a new (abstract) TopDocsCollector, which will:
 ** Leave collect and setNextReader unimplemented.
 ** 

[jira] Commented: (LUCENE-1483) Change IndexSearcher multisegment searches to search each individual segment using a single HitCollector

2009-04-03 Thread Jeremy Volkman (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1483?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12695577#action_12695577
 ] 

Jeremy Volkman commented on LUCENE-1483:


I'm trying to create a FieldValueHitQueue outside of an IndexSearcher. One part 
of my code collects all results in a fashion similar to 
http://www.gossamer-threads.com/lists/lucene/java-user/66362#66362. At the end 
of my collection, I used to pass the results through a FieldSortedHitQueue of 
the proper size to get sorted results. The problem now is that 
FieldValueHitQueue takes an array of subreaders instead of one IndexReader. As 
far as I can tell, there's no way for me to get a proper sorted array of 
subreaders for an IndexReader without copying and pasting the gatherSubReaders 
and sortSubReaders methods from IndexSearcher. This isn't desirable, so could 
IndexSearcher perhaps provide some sort of getSortedSubReaders() method? Either 
that, or extract this functionality out into a common utility method that 
IndexSearcher uses.

 Change IndexSearcher multisegment searches to search each individual segment 
 using a single HitCollector
 

 Key: LUCENE-1483
 URL: https://issues.apache.org/jira/browse/LUCENE-1483
 Project: Lucene - Java
  Issue Type: Improvement
Affects Versions: 2.9
Reporter: Mark Miller
Assignee: Michael McCandless
Priority: Minor
 Fix For: 2.9

 Attachments: LUCENE-1483-backcompat.patch, LUCENE-1483-partial.patch, 
 LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, 
 LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, 
 LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, 
 LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, 
 LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, 
 LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, 
 LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, 
 LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, 
 LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, sortBench.py, 
 sortCollate.py


 This issue changes how an IndexSearcher searches over multiple segments. The 
 current method of searching multiple segments is to use a MultiSegmentReader 
 and treat all of the segments as one. This causes filters and FieldCaches to 
 be keyed to the MultiReader and makes reopen expensive. If only a few 
 segments change, the FieldCache is still loaded for all of them.
 This patch changes things by searching each individual segment one at a time, 
 but sharing the HitCollector used across each segment. This allows 
 FieldCaches and Filters to be keyed on individual SegmentReaders, making 
 reopen much cheaper. FieldCache loading over multiple segments can be much 
 faster as well - with the old method, all unique terms for every segment is 
 enumerated against each segment - because of the likely logarithmic change in 
 terms per segment, this can be very wasteful. Searching individual segments 
 avoids this cost. The term/document statistics from the multireader are used 
 to score results for each segment.
 When sorting, its more difficult to use a single HitCollector for each sub 
 searcher. Ordinals are not comparable across segments. To account for this, a 
 new field sort enabled HitCollector is introduced that is able to collect and 
 sort across segments (because of its ability to compare ordinals across 
 segments). This TopFieldCollector class will collect the values/ordinals for 
 a given segment, and upon moving to the next segment, translate any 
 ordinals/values so that they can be compared against the values for the new 
 segment. This is done lazily.
 All and all, the switch seems to provide numerous performance benefits, in 
 both sorted and non sorted search. We were seeing a good loss on indices with 
 lots of segments (1000?) and certain queue sizes / queries, but the latest 
 results seem to show thats been mostly taken care of (you shouldnt be using 
 such a large queue on such a segmented index anyway).
 * Introduces
 ** MultiReaderHitCollector - a HitCollector that can collect across multiple 
 IndexReaders. Old HitCollectors are wrapped to support multiple IndexReaders.
 ** TopFieldCollector - a HitCollector that can compare values/ordinals across 
 IndexReaders and sort on fields.
 ** FieldValueHitQueue - a Priority queue that is part of the 
 TopFieldCollector implementation.
 ** FieldComparator - a new Comparator class that works across IndexReaders. 
 Part of the TopFieldCollector implementation.
 ** FieldComparatorSource - new class to allow for custom Comparators.
 * 

[jira] Commented: (LUCENE-1483) Change IndexSearcher multisegment searches to search each individual segment using a single HitCollector

2009-04-03 Thread Shai Erera (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1483?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12695586#action_12695586
 ] 

Shai Erera commented on LUCENE-1483:


Hi Jeremy

This will be taken care of in 1575 by removing the IndexReader[] arg from 
TopFieldCollector. As a matter of fact, 1575 changes quite a bit the 
collector's API, so you might want to take a look there. Anyway, I've run into 
the same issue there and realized this arg can be safely removed from 
TopFieldCollector as well as FieldValueHitQueue.

 Change IndexSearcher multisegment searches to search each individual segment 
 using a single HitCollector
 

 Key: LUCENE-1483
 URL: https://issues.apache.org/jira/browse/LUCENE-1483
 Project: Lucene - Java
  Issue Type: Improvement
Affects Versions: 2.9
Reporter: Mark Miller
Assignee: Michael McCandless
Priority: Minor
 Fix For: 2.9

 Attachments: LUCENE-1483-backcompat.patch, LUCENE-1483-partial.patch, 
 LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, 
 LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, 
 LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, 
 LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, 
 LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, 
 LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, 
 LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, 
 LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, 
 LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, sortBench.py, 
 sortCollate.py


 This issue changes how an IndexSearcher searches over multiple segments. The 
 current method of searching multiple segments is to use a MultiSegmentReader 
 and treat all of the segments as one. This causes filters and FieldCaches to 
 be keyed to the MultiReader and makes reopen expensive. If only a few 
 segments change, the FieldCache is still loaded for all of them.
 This patch changes things by searching each individual segment one at a time, 
 but sharing the HitCollector used across each segment. This allows 
 FieldCaches and Filters to be keyed on individual SegmentReaders, making 
 reopen much cheaper. FieldCache loading over multiple segments can be much 
 faster as well - with the old method, all unique terms for every segment is 
 enumerated against each segment - because of the likely logarithmic change in 
 terms per segment, this can be very wasteful. Searching individual segments 
 avoids this cost. The term/document statistics from the multireader are used 
 to score results for each segment.
 When sorting, its more difficult to use a single HitCollector for each sub 
 searcher. Ordinals are not comparable across segments. To account for this, a 
 new field sort enabled HitCollector is introduced that is able to collect and 
 sort across segments (because of its ability to compare ordinals across 
 segments). This TopFieldCollector class will collect the values/ordinals for 
 a given segment, and upon moving to the next segment, translate any 
 ordinals/values so that they can be compared against the values for the new 
 segment. This is done lazily.
 All and all, the switch seems to provide numerous performance benefits, in 
 both sorted and non sorted search. We were seeing a good loss on indices with 
 lots of segments (1000?) and certain queue sizes / queries, but the latest 
 results seem to show thats been mostly taken care of (you shouldnt be using 
 such a large queue on such a segmented index anyway).
 * Introduces
 ** MultiReaderHitCollector - a HitCollector that can collect across multiple 
 IndexReaders. Old HitCollectors are wrapped to support multiple IndexReaders.
 ** TopFieldCollector - a HitCollector that can compare values/ordinals across 
 IndexReaders and sort on fields.
 ** FieldValueHitQueue - a Priority queue that is part of the 
 TopFieldCollector implementation.
 ** FieldComparator - a new Comparator class that works across IndexReaders. 
 Part of the TopFieldCollector implementation.
 ** FieldComparatorSource - new class to allow for custom Comparators.
 * Alters
 ** IndexSearcher uses a single HitCollector to collect hits against each 
 individual SegmentReader. All the other changes stem from this ;)
 * Deprecates
 ** TopFieldDocCollector
 ** FieldSortedHitQueue

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: 

[jira] Commented: (LUCENE-1483) Change IndexSearcher multisegment searches to search each individual segment using a single HitCollector

2009-04-03 Thread Uwe Schindler (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1483?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12695587#action_12695587
 ] 

Uwe Schindler commented on LUCENE-1483:
---

This will be changed as part of LUCENE-1575

 Change IndexSearcher multisegment searches to search each individual segment 
 using a single HitCollector
 

 Key: LUCENE-1483
 URL: https://issues.apache.org/jira/browse/LUCENE-1483
 Project: Lucene - Java
  Issue Type: Improvement
Affects Versions: 2.9
Reporter: Mark Miller
Assignee: Michael McCandless
Priority: Minor
 Fix For: 2.9

 Attachments: LUCENE-1483-backcompat.patch, LUCENE-1483-partial.patch, 
 LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, 
 LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, 
 LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, 
 LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, 
 LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, 
 LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, 
 LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, 
 LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, 
 LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, sortBench.py, 
 sortCollate.py


 This issue changes how an IndexSearcher searches over multiple segments. The 
 current method of searching multiple segments is to use a MultiSegmentReader 
 and treat all of the segments as one. This causes filters and FieldCaches to 
 be keyed to the MultiReader and makes reopen expensive. If only a few 
 segments change, the FieldCache is still loaded for all of them.
 This patch changes things by searching each individual segment one at a time, 
 but sharing the HitCollector used across each segment. This allows 
 FieldCaches and Filters to be keyed on individual SegmentReaders, making 
 reopen much cheaper. FieldCache loading over multiple segments can be much 
 faster as well - with the old method, all unique terms for every segment is 
 enumerated against each segment - because of the likely logarithmic change in 
 terms per segment, this can be very wasteful. Searching individual segments 
 avoids this cost. The term/document statistics from the multireader are used 
 to score results for each segment.
 When sorting, its more difficult to use a single HitCollector for each sub 
 searcher. Ordinals are not comparable across segments. To account for this, a 
 new field sort enabled HitCollector is introduced that is able to collect and 
 sort across segments (because of its ability to compare ordinals across 
 segments). This TopFieldCollector class will collect the values/ordinals for 
 a given segment, and upon moving to the next segment, translate any 
 ordinals/values so that they can be compared against the values for the new 
 segment. This is done lazily.
 All and all, the switch seems to provide numerous performance benefits, in 
 both sorted and non sorted search. We were seeing a good loss on indices with 
 lots of segments (1000?) and certain queue sizes / queries, but the latest 
 results seem to show thats been mostly taken care of (you shouldnt be using 
 such a large queue on such a segmented index anyway).
 * Introduces
 ** MultiReaderHitCollector - a HitCollector that can collect across multiple 
 IndexReaders. Old HitCollectors are wrapped to support multiple IndexReaders.
 ** TopFieldCollector - a HitCollector that can compare values/ordinals across 
 IndexReaders and sort on fields.
 ** FieldValueHitQueue - a Priority queue that is part of the 
 TopFieldCollector implementation.
 ** FieldComparator - a new Comparator class that works across IndexReaders. 
 Part of the TopFieldCollector implementation.
 ** FieldComparatorSource - new class to allow for custom Comparators.
 * Alters
 ** IndexSearcher uses a single HitCollector to collect hits against each 
 individual SegmentReader. All the other changes stem from this ;)
 * Deprecates
 ** TopFieldDocCollector
 ** FieldSortedHitQueue

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



Re: Future projects

2009-04-03 Thread Jason Rutherglen
 meaning in Bobo you'd like to manage your own memory resident
field caches, and merge them whenever IW has merged a segment?
Seems like you don't need genealogy for that.

Agreed, there is no need for full genealogy.

 CSF isn't really designed yet. How come it can't be used with
Bobo's field caches?

I guess CSF should be able to support it, makes sense. As long
as the container is flexible with the encoding (I need to look
into this more on the Bobo side).

 Lucene's internal field cache usage is now entirely at the
segment level (ie, Lucene core should never request full field
cache array at the MultiSegmentReader level). I think Bobo must
have to do the same, if it handles near realtime updates, to get
adequate performance.

Bobo needs to migrate to this model, I don't think we've done
that yet.

 EG how come Bobo made its own field cache impl? Just because
uninversion is too slow?

It could be integrated once LUCENE-831 is completed. I think the
current model of a weak reference and the inability to unload if
needed is a concern.  I don't think it's because of uninversion.

On Fri, Apr 3, 2009 at 3:35 AM, Michael McCandless 
luc...@mikemccandless.com wrote:

 On Thu, Apr 2, 2009 at 5:56 PM, Jason Rutherglen
 jason.rutherg...@gmail.com wrote:
  I think I need to understand better why delete by Query isn't
  viable in your situation...
 
  The delete by query is a separate problem which I haven't fully
  explored yet.

 Oh, I had thought we were tugging on this thread in order to explore
 delete-by-docID in the writer.  OK.

  Tracking the segment genealogy is really an
  interim step for merging field caches before column stride
  fields gets implemented.

 I see -- meaning in Bobo you'd like to manage your own memory resident
 field caches, and merge them whenever IW has merged a segment?  Seems
 like you don't need genealogy for that.

  Actually CSF cannot be used with Bobo's
  field caches anyways which means we'd need a way to find out
  about the segment parents.

 CSF isn't really designed yet.  How come it can't be used with Bobo's
 field caches?  We can try to accommodate Bobo's field cache needs when
 designing CSF.

  Does it operate at the segment level? Seems like that'd give
  you good enough realtime performance (though merging in RAM will
  definitely be faster).
 
  We need to see how Bobo integrates with LUCENE-1483.

 Lucene's internal field cache usage is now entirely at the segment
 level (ie, Lucene core should never request full field cache array at
 the MultiSegmentReader level).  I think Bobo must have to do the same,
 if it handles near realtime updates, to get adequate performance.

 Though... since we have LUCENE-831 (rework API Lucene exposes for
 accessing arrays-of-atomic-types-per-segment) and LUCENE-1231 (CSF = a
 more efficient impl (than uninversion) of the API we expose in
 LUCENE-831) on deck, we should try to understand Bobo's needs.

 EG how come Bobo made its own field cache impl?  Just because
 uninversion is too slow?

  It seems like we've been talking about CSF for 2 years and there
  isn't a patch for it? If I had more time I'd take a look. What
  is the status of it?

 I think Michael is looking into it?  I'd really like to get it into
 2.9.  We should do it in conjunction with 831 since they are so tied.

  I'll write a patch that implements a callback for the segment
  merging such that the user can decide what information they want
  to record about the merged SRs (I'm pretty sure there isn't a
  way to do this with MergePolicy?)

 Actually I think you can do this w/ a simple MergeScheduler wrapper or
 by subclassing CMS.  I'll put a comment on the issue.

 Mike

 -
 To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-dev-h...@lucene.apache.org




[jira] Commented: (LUCENE-1575) Refactoring Lucene collectors (HitCollector and extensions)

2009-04-03 Thread Shai Erera (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1575?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12695595#action_12695595
 ] 

Shai Erera commented on LUCENE-1575:


BTW, I can change FieldValueHitQueue like I changed TopFieldCollector by
introducing a factory create() method which will return a
OneComparaterFieldValueHitQueue and MultiComparatorsFieldValueHitQueue.
Today, FVHQ.lessThan checks the numComparators in each call, which is
redundant.

Also the class isn't final and I'm not sure if we want to change it.

What do you think?




 Refactoring Lucene collectors (HitCollector and extensions)
 ---

 Key: LUCENE-1575
 URL: https://issues.apache.org/jira/browse/LUCENE-1575
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Search
Reporter: Shai Erera
Assignee: Michael McCandless
 Fix For: 2.9

 Attachments: LUCENE-1575.1.patch, LUCENE-1575.2.patch, 
 LUCENE-1575.3.patch, LUCENE-1575.4.patch, LUCENE-1575.5.patch, 
 LUCENE-1575.6.patch, LUCENE-1575.patch, LUCENE-1575.patch


 This issue is a result of a recent discussion we've had on the mailing list. 
 You can read the thread 
 [here|http://www.nabble.com/Is-TopDocCollector%27s-collect()-implementation-correct--td22557419.html].
 We have agreed to do the following refactoring:
 * Rename MultiReaderHitCollector to Collector, with the purpose that it will 
 be the base class for all Collector implementations.
 * Deprecate HitCollector in favor of the new Collector.
 * Introduce new methods in IndexSearcher that accept Collector, and deprecate 
 those that accept HitCollector.
 ** Create a final class HitCollectorWrapper, and use it in the deprecated 
 methods in IndexSearcher, wrapping the given HitCollector.
 ** HitCollectorWrapper will be marked deprecated, so we can remove it in 3.0, 
 when we remove HitCollector.
 ** It will remove any instanceof checks that currently exist in IndexSearcher 
 code.
 * Create a new (abstract) TopDocsCollector, which will:
 ** Leave collect and setNextReader unimplemented.
 ** Introduce protected members PriorityQueue and totalHits.
 ** Introduce a single protected constructor which accepts a PriorityQueue.
 ** Implement topDocs() and getTotalHits() using the PQ and totalHits members. 
 These can be used as-are by extending classes, as well as be overridden.
 ** Introduce a new topDocs(start, howMany) method which will be used a 
 convenience method when implementing a search application which allows paging 
 through search results. It will also attempt to improve the memory 
 allocation, by allocating a ScoreDoc[] of the requested size only.
 * Change TopScoreDocCollector to extend TopDocsCollector, use the topDocs() 
 and getTotalHits() implementations as they are from TopDocsCollector. The 
 class will also be made final.
 * Change TopFieldCollector to extend TopDocsCollector, and make the class 
 final. Also implement topDocs(start, howMany).
 * Change TopFieldDocCollector (deprecated) to extend TopDocsCollector, 
 instead of TopScoreDocCollector. Implement topDocs(start, howMany)
 * Review other places where HitCollector is used, such as in Scorer, 
 deprecate those places and use Collector instead.
 Additionally, the following proposal was made w.r.t. decoupling score from 
 collect():
 * Change collect to accecpt only a doc Id (unbased).
 * Introduce a setScorer(Scorer) method.
 * If during collect the implementation needs the score, it can call 
 scorer.score().
 If we do this, then we need to review all places in the code where 
 collect(doc, score) is called, and assert whether Scorer can be passed. Also 
 this raises few questions:
 * What if during collect() Scorer is null? (i.e., not set) - is it even 
 possible?
 * I noticed that many (if not all) of the collect() implementations discard 
 the document if its score is not greater than 0. Doesn't it mean that score 
 is needed in collect() always?
 Open issues:
 * The name for Collector
 * TopDocsCollector was mentioned on the thread as TopResultsCollector, but 
 that was when we thought to call Colletor ResultsColletor. Since we decided 
 (so far) on Collector, I think TopDocsCollector makes sense, because of its 
 TopDocs output.
 * Decoupling score from collect().
 I will post a patch a bit later, as this is expected to be a very large 
 patch. I will split it into 2: (1) code patch (2) test cases (moving to use 
 Collector instead of HitCollector, as well as testing the new topDocs(start, 
 howMany) method.
 There might be even a 3rd patch which handles the setScorer thing in 
 Collector (maybe even a different issue?)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (LUCENE-1231) Column-stride fields (aka per-document Payloads)

2009-04-03 Thread Michael Busch (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1231?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12695666#action_12695666
 ] 

Michael Busch commented on LUCENE-1231:
---

For the search side we need an API similar to TermDocs and Payloads,
let's call it ColumnStrideFieldAccessor (CSFA) for now. It should have
next(), skipTo(), doc(), etc. methods.
However, the way TermPositions#getPayloads() currently works
is that it always forces you to copy the bytes from the underlying
IndexInput into the payload byte[] array. Since we usually use a
BufferedIndexInput, this is then an arraycopy from
BufferedIndexInput's buffer array into the byte array.

I think to improve this we could allow users to call methods like
readVInt() directly on the CSFA. So I was thinking about adding
DataInput and DataOutput as superclasses of IndexInput and
IndexOutput. DataIn(Out)put would implement the different read and
write methods, whereas IndexIn(Out)put would only implement methods
like close(), seek(), getFilePointer(), length(), flush(), etc.

So then CSFA would extend DataInput or alternatively have a
getDataInput() method. The danger here compared to the current
payloads API would be that the user might read too few or too many
bytes of a CSF, which would result in an undefined and possibly hard
to debug behavior. But we could offer e.g.:

{code}
  static ColumnStrideFieldsAccessor getAccessor(ColumnStrideFieldsAccessor in, 
Mode mode) {
if (mode == Mode.Fast) {
  return in;
} else if (mode == Mode.Safe) {
  return new SafeAccessor(in);
}
{code}

The SafeAccessor would count for you the number of read bytes and
throw exceptions if you don't consume the number of bytes you should
consume. This is of course overhead, but users could use the
SafeAccessor until they're confident that everything works fine in
their system, and then switch to the fast accessor for better
performance.

If there are no objections I will open a separate JIRA issue for the
DataInput/Output patch.

 Column-stride fields (aka per-document Payloads)
 

 Key: LUCENE-1231
 URL: https://issues.apache.org/jira/browse/LUCENE-1231
 Project: Lucene - Java
  Issue Type: New Feature
  Components: Index
Reporter: Michael Busch
Assignee: Michael Busch
Priority: Minor
 Fix For: 3.0


 This new feature has been proposed and discussed here:
 http://markmail.org/search/?q=per-document+payloads#query:per-document%20payloads+page:1+mid:jq4g5myhlvidw3oc+state:results
 Currently it is possible in Lucene to store data as stored fields or as 
 payloads.
 Stored fields provide good performance if you want to load all fields for one
 document, because this is an sequential I/O operation.
 If you however want to load the data from one field for a large number of 
 documents, then stored fields perform quite badly, because lot's of I/O seeks 
 might have to be performed. 
 A better way to do this is using payloads. By creating a special posting 
 list
 that has one posting with payload for each document you can simulate a 
 column-
 stride field. The performance is significantly better compared to stored 
 fields,
 however still not optimal. The reason is that for each document the freq 
 value,
 which is in this particular case always 1, has to be decoded, also one 
 position
 value, which is always 0, has to be loaded.
 As a solution we want to add real column-stride fields to Lucene. A possible
 format for the new data structure could look like this (CSD stands for column-
 stride data, once we decide for a final name for this feature we can change 
 this):
 CSDList -- FixedLengthList | VariableLengthList, SkipList 
 FixedLengthList -- Payload^SegSize 
 VariableLengthList -- DocDelta, PayloadLength?, Payload 
 Payload -- Byte^PayloadLength 
 PayloadLength -- VInt 
 SkipList -- see frq.file
 We distinguish here between the fixed length and the variable length cases. To
 allow flexibility, Lucene could automatically pick the right data 
 structure. 
 This could work like this: When the DocumentsWriter writes a segment it 
 checks 
 whether all values of a field have the same length. If yes, it stores them as 
 FixedLengthList, if not, then as VariableLengthList. When the SegmentMerger 
 merges two or more segments it checks if all segments have a FixedLengthList 
 with the same length for a column-stride field. If not, it writes a 
 VariableLengthList to the new segment. 
 Once this feature is implemented, we should think about making the column-
 stride fields updateable, similar to the norms. This will be a very powerful
 feature that can for example be used for low-latency tagging of documents.
 Other use cases:
 - replace norms
 - allow to store boost values separately from norms
 - as input for the FieldCache, thus 

[jira] Created: (LUCENE-1585) Allow to control how payloads are merged

2009-04-03 Thread Michael Busch (JIRA)
Allow to control how payloads are merged


 Key: LUCENE-1585
 URL: https://issues.apache.org/jira/browse/LUCENE-1585
 Project: Lucene - Java
  Issue Type: New Feature
  Components: Index
Reporter: Michael Busch
Assignee: Michael Busch
Priority: Minor


Lucene handles backwards-compatibility of its data structures by
converting them from the old into the new formats during segment
merging. 

Payloads are simply byte arrays in which users can store arbitrary
data. Applications that use payloads might want to convert the format
of their payloads in a similar fashion. Otherwise it's not easily
possible to ever change the encoding of a payload without reindexing.

So I propose to introduce a PayloadMerger class that the SegmentMerger
invokes to merge the payloads from multiple segments. Users can then
implement their own PayloadMerger to convert payloads from an old into
a new format.

In the future we need this kind of flexibility also for column-stride
fields (LUCENE-1231) and flexible indexing codecs.

In addition to that it would be nice if users could store version
information in the segments file. E.g. they could store in segment _2
the term a:b uses payloads of format x.y.


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org