date:20090403


[ 
https://issues.apache.org/jira/browse/LUCENE-1516?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12695306#action_12695306
 ] 

Michael McCandless commented on LUCENE-1516:


Good catch!  I'll fix.

 Integrate IndexReader with IndexWriter 
 ---

 Key: LUCENE-1516
 URL: https://issues.apache.org/jira/browse/LUCENE-1516
 Project: Lucene - Java
  Issue Type: Improvement
Affects Versions: 2.4
Reporter: Jason Rutherglen
Assignee: Michael McCandless
Priority: Minor
 Fix For: 2.9

 Attachments: LUCENE-1516.patch, LUCENE-1516.patch, LUCENE-1516.patch, 
 LUCENE-1516.patch, LUCENE-1516.patch, LUCENE-1516.patch, LUCENE-1516.patch, 
 LUCENE-1516.patch, LUCENE-1516.patch, LUCENE-1516.patch, LUCENE-1516.patch, 
 LUCENE-1516.patch, LUCENE-1516.patch, LUCENE-1516.patch, LUCENE-1516.patch, 
 LUCENE-1516.patch, LUCENE-1516.patch, LUCENE-1516.patch, LUCENE-1516.patch, 
 LUCENE-1516.patch, LUCENE-1516.patch, LUCENE-1516.patch, LUCENE-1516.patch, 
 LUCENE-1516.patch, LUCENE-1516.patch, LUCENE-1516.patch, LUCENE-1516.patch, 
 LUCENE-1516.patch, magnetic.png, ssd.png, ssd2.png

   Original Estimate: 672h
  Remaining Estimate: 672h

 The current problem is an IndexReader and IndexWriter cannot be open
 at the same time and perform updates as they both require a write
 lock to the index. While methods such as IW.deleteDocuments enables
 deleting from IW, methods such as IR.deleteDocument(int doc) and
 norms updating are not available from IW. This limits the
 capabilities of performing updates to the index dynamically or in
 realtime without closing the IW and opening an IR, deleting or
 updating norms, flushing, then opening the IW again, a process which
 can be detrimental to realtime updates. 
 This patch will expose an IndexWriter.getReader method that returns
 the currently flushed state of the index as a class that implements
 IndexReader. The new IR implementation will differ from existing IR
 implementations such as MultiSegmentReader in that flushing will
 synchronize updates with IW in part by sharing the write lock. All
 methods of IR will be usable including reopen and clone. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

Re: IndexWriter.addIndexesNoOptimize(IndexReader[] readers)

Makes sense.  Wanna make a patch?  We'd then deprecate
addIndexes(IndexReader[]).

Mike

On Thu, Apr 2, 2009 at 9:16 PM, Jason Rutherglen
jason.rutherg...@gmail.com wrote:
 This seems like something that's tenable?  It would be useful for merging
 ram indexes to disk where if a directory is passed, the directory may be
 changed.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-1584) Callback for intercepting merging segments in IndexWriter


[ 
https://issues.apache.org/jira/browse/LUCENE-1584?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12695318#action_12695318
 ] 

Michael McCandless commented on LUCENE-1584:


I think this can be achieved, today, by making your own MergeScheduler wrapper, 
or by subclassing ConcurrentMergeScheduler and eg overriding the doMerge 
method?  If so, I'd prefer not to add a callback to IW.

 Callback for intercepting merging segments in IndexWriter
 -

 Key: LUCENE-1584
 URL: https://issues.apache.org/jira/browse/LUCENE-1584
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Index
Affects Versions: 2.4.1
Reporter: Jason Rutherglen
Priority: Minor
 Fix For: 2.9

 Attachments: LUCENE-1584.patch

   Original Estimate: 96h
  Remaining Estimate: 96h

 For things like merging field caches or bitsets, it's useful to
 know which segments were merged to create a new segment.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

Re: Lucene filter

Could you re-ask this on java-user, instead?  Thanks.

Mike

On Thu, Apr 2, 2009 at 6:24 PM, addman addiek...@yahoo.com wrote:

 How do you create a Lucene Filter to check if a field has a value?  It is
 part for a ChainedFilter that I am creating.
 --
 View this message in context: 
 http://www.nabble.com/Lucene-filter-tp22858220p22858220.html
 Sent from the Lucene - Java Developer mailing list archive at Nabble.com.


 -
 To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-dev-h...@lucene.apache.org



-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

Re: Future projects

On Thu, Apr 2, 2009 at 5:56 PM, Jason Rutherglen
jason.rutherg...@gmail.com wrote:
 I think I need to understand better why delete by Query isn't
 viable in your situation...

 The delete by query is a separate problem which I haven't fully
 explored yet.

Oh, I had thought we were tugging on this thread in order to explore
delete-by-docID in the writer.  OK.

 Tracking the segment genealogy is really an
 interim step for merging field caches before column stride
 fields gets implemented.

I see -- meaning in Bobo you'd like to manage your own memory resident
field caches, and merge them whenever IW has merged a segment?  Seems
like you don't need genealogy for that.

 Actually CSF cannot be used with Bobo's
 field caches anyways which means we'd need a way to find out
 about the segment parents.

CSF isn't really designed yet.  How come it can't be used with Bobo's
field caches?  We can try to accommodate Bobo's field cache needs when
designing CSF.

 Does it operate at the segment level? Seems like that'd give
 you good enough realtime performance (though merging in RAM will
 definitely be faster).

 We need to see how Bobo integrates with LUCENE-1483.

Lucene's internal field cache usage is now entirely at the segment
level (ie, Lucene core should never request full field cache array at
the MultiSegmentReader level).  I think Bobo must have to do the same,
if it handles near realtime updates, to get adequate performance.

Though... since we have LUCENE-831 (rework API Lucene exposes for
accessing arrays-of-atomic-types-per-segment) and LUCENE-1231 (CSF = a
more efficient impl (than uninversion) of the API we expose in
LUCENE-831) on deck, we should try to understand Bobo's needs.

EG how come Bobo made its own field cache impl?  Just because
uninversion is too slow?

 It seems like we've been talking about CSF for 2 years and there
 isn't a patch for it? If I had more time I'd take a look. What
 is the status of it?

I think Michael is looking into it?  I'd really like to get it into
2.9.  We should do it in conjunction with 831 since they are so tied.

 I'll write a patch that implements a callback for the segment
 merging such that the user can decide what information they want
 to record about the merged SRs (I'm pretty sure there isn't a
 way to do this with MergePolicy?)

Actually I think you can do this w/ a simple MergeScheduler wrapper or
by subclassing CMS.  I'll put a comment on the issue.

Mike

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

Re: Future projects

On Thu, Apr 2, 2009 at 6:55 PM, John Wang john.w...@gmail.com wrote:
 Just to clarify, Approach 1 and approach 2 are both currently performing ok
 currently for us.

OK that's very good to know.

Mike

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-1575) Refactoring Lucene collectors (HitCollector and extensions)

[
https://issues.apache.org/jira/browse/LUCENE-1575?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12695338#action_12695338
]

Shai Erera commented on LUCENE-1575:

I've been thinking about TimeLimitedCollector and the revert to extend
HitCollector I had to do in the last patch - the main reason was that I
couldn't find a better name and did not want to deprecate it. But then, I
thought that perhaps the current name is not so good, and we can change it?
Syntactically, it is not a 'limited' collector, but more of a 'limiting'
collector (I think, not being a native English speaker I may be wrong).
Alternative names I've been thinking about are TimeKeeperCollector,
TimeLimitingCollector, TimingOutCollector.
The advantage is that we deprecate the current one and have a clear back-compat
support, instead of changing it in 3.0 to extend Collector. If you agree with
any of these names I can create a new class, deprecate the current one, change
the tests back to use the new version (and remove all those comments about the
changes in 3.0). What do you think?

Refactoring Lucene collectors (HitCollector and extensions)
---

Key: LUCENE-1575
URL: https://issues.apache.org/jira/browse/LUCENE-1575
Project: Lucene - Java
Issue Type: Improvement
Components: Search
Reporter: Shai Erera
Assignee: Michael McCandless
Fix For: 2.9

Attachments: LUCENE-1575.1.patch, LUCENE-1575.2.patch,
LUCENE-1575.3.patch, LUCENE-1575.4.patch, LUCENE-1575.5.patch,
LUCENE-1575.6.patch, LUCENE-1575.patch

This issue is a result of a recent discussion we've had on the mailing list.
You can read the thread
[here|http://www.nabble.com/Is-TopDocCollector%27s-collect()-implementation-correct--td22557419.html].
We have agreed to do the following refactoring:
* Rename MultiReaderHitCollector to Collector, with the purpose that it will
be the base class for all Collector implementations.
* Deprecate HitCollector in favor of the new Collector.
* Introduce new methods in IndexSearcher that accept Collector, and deprecate
those that accept HitCollector.
** Create a final class HitCollectorWrapper, and use it in the deprecated
methods in IndexSearcher, wrapping the given HitCollector.
** HitCollectorWrapper will be marked deprecated, so we can remove it in 3.0,
when we remove HitCollector.
** It will remove any instanceof checks that currently exist in IndexSearcher
code.
* Create a new (abstract) TopDocsCollector, which will:
** Leave collect and setNextReader unimplemented.
** Introduce protected members PriorityQueue and totalHits.
** Introduce a single protected constructor which accepts a PriorityQueue.
** Implement topDocs() and getTotalHits() using the PQ and totalHits members.
These can be used as-are by extending classes, as well as be overridden.
** Introduce a new topDocs(start, howMany) method which will be used a
convenience method when implementing a search application which allows paging
through search results. It will also attempt to improve the memory
allocation, by allocating a ScoreDoc[] of the requested size only.
* Change TopScoreDocCollector to extend TopDocsCollector, use the topDocs()
and getTotalHits() implementations as they are from TopDocsCollector. The
class will also be made final.
* Change TopFieldCollector to extend TopDocsCollector, and make the class
final. Also implement topDocs(start, howMany).
* Change TopFieldDocCollector (deprecated) to extend TopDocsCollector,
instead of TopScoreDocCollector. Implement topDocs(start, howMany)
* Review other places where HitCollector is used, such as in Scorer,
deprecate those places and use Collector instead.
Additionally, the following proposal was made w.r.t. decoupling score from
collect():
* Change collect to accecpt only a doc Id (unbased).
* Introduce a setScorer(Scorer) method.
* If during collect the implementation needs the score, it can call
scorer.score().
If we do this, then we need to review all places in the code where
collect(doc, score) is called, and assert whether Scorer can be passed. Also
this raises few questions:
* What if during collect() Scorer is null? (i.e., not set) - is it even
possible?
* I noticed that many (if not all) of the collect() implementations discard
the document if its score is not greater than 0. Doesn't it mean that score
is needed in collect() always?
Open issues:
* The name for Collector
* TopDocsCollector was mentioned on the thread as TopResultsCollector, but
that was when we thought to call Colletor ResultsColletor. Since we decided
(so far) on Collector, I think TopDocsCollector makes sense, because of its
TopDocs output.
* Decoupling score from collect().
I will post a patch a bit later,

[jira] Updated: (LUCENE-1582) Make TrieRange completely independent from Document/Field with TokenStream of prefix encoded values

[
https://issues.apache.org/jira/browse/LUCENE-1582?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Uwe Schindler updated LUCENE-1582:
--

Attachment: LUCENE-1582.patch

Make TrieRange completely independent from Document/Field with TokenStream of
prefix encoded values
---

Attachments: LUCENE-1582.patch

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-1582) Make TrieRange completely independent from Document/Field with TokenStream of prefix encoded values

[
https://issues.apache.org/jira/browse/LUCENE-1582?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12695341#action_12695341
]

Uwe Schindler commented on LUCENE-1582:
---

A first version of the patch:
- JavaDocs not finished (examples, documentation) yet
- New classes: IntTrieTokenStream, LongTrieTokenStream
- Removed TrieUtils.trieCodeInt/Long()
- Removed TrieUtils.addIndexFields()
- Removed all fields[] arrays, now only one field name is supported everywhere

To index a trie-encoded field, just use (preferred way):
{code}
Filed f=new Field(name, new LongTrieTokenStream(value, precisionStep));
f.setOmitNorms(true);
f.setOmitTermFreqAndPositions(true);
{code}
(maybe TrieUtils supplies a shortcut helper method that uses these special
optimal settings when creating the field, e.g. TrieUtils.newLongTrieField()).
This is extensible with TokenFilters, if somebody wants to add payloads and so
on.

This patch also contains the sorting fixes in the core:
FieldCache.StopFillCacheException can be thrown from withing the parser. Maybe
this should be provides as a separate sub-isse (or top-level issue), because I
cannot apply patches to core. Mike, can you do this, when we commit this?

Yonik: It would be nice to hear some comments from you, too.

I really like the new way to create trie encoded fields. When this moves to
core, the tokenizers can be renamed to IntTokenStream, TrieUtils now only
contains the converters to/from doubles and the encoding and range split.

About the GC note in the description of this issue: The new API does not use so
much array allocations and array copies and reuses the Token. But as it is
needed to generate a TokenStream instance for every numeric value, the GC cost
is about the same for new and old API. Especially because each TokenStream
creates a LinkedHashMap internally for the attributes.

Just a question for the indexer people: Is it possible to add two fields with
the same field name to a document, both with a TokenStream? This is needed to
add more than one trie encoded value (which worked with the old API). I just
want to be sure.

Make TrieRange completely independent from Document/Field with TokenStream of
prefix encoded values
---

Attachments: LUCENE-1582.patch

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-1575) Refactoring Lucene collectors (HitCollector and extensions)

[
https://issues.apache.org/jira/browse/LUCENE-1575?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12695358#action_12695358
]

Michael McCandless commented on LUCENE-1575:

bq. If you agree with any of these names I can create a new class, deprecate
the current one, change the tests back to use the new version (and remove all
those comments about the changes in 3.0). What do you think?

I like this approach. I like TimeLimitingCollector, or maybe
TimeoutCollector?

Refactoring Lucene collectors (HitCollector and extensions)
---

Attachments: LUCENE-1575.1.patch, LUCENE-1575.2.patch,
LUCENE-1575.3.patch, LUCENE-1575.4.patch, LUCENE-1575.5.patch,
LUCENE-1575.6.patch, LUCENE-1575.patch

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For

[jira] Updated: (LUCENE-1575) Refactoring Lucene collectors (HitCollector and extensions)


 [ 
https://issues.apache.org/jira/browse/LUCENE-1575?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael McCandless updated LUCENE-1575:
---

Attachment: LUCENE-1575.patch


OK, I attached a new patch with some minor changes:

  * Beefed up javadocs in Collector.java; fixed other javadocs
warnings.  Tweaked CHANGES.txt.

  * Renamed PositiveOnlyScoresCollector --
PositiveScoresOnlyCollector

And also came across these questions/issues:

  * TopFieldCollector's updateBottom  add methods take score, and are
passed score from the non-scoring collectors, but shouldn't?

  * TermScorer need not override score(HitCollector hc) (super does
the same thing).

  * The changes to TermScorer make me a bit nervous.  EG, the new
InternalScorer: will it hurt performance?  Also this part:
{code}
+// Set the Scorer doc and score before calling collect in case it will 
be
+// used in collect()
+s.d = doc;
+s.score = score;
+c.collect(doc);  // collect score
{code}
is spooky: I don't like how we worry that one may call scorer.doc() (I
don't like the ambiguity in the API -- we both pass doc and fear you
may call scorer.doc()).  Not sure how to resolve it.

  * Hmm -- we added a new abstract method to
src/java/org/apache/lucene/search/Searcher.java (that accepts
Collector).  Should that method be concrete (and throw UOE), for
back compat?

  * We've also added a method to the Searchable interface, which is
a break in back-compat.  But my feeling is we should allow this
break (but Shai can you add another Note at the top of
CHANGES.txt, calling this out?).


 Refactoring Lucene collectors (HitCollector and extensions)
 ---

 Key: LUCENE-1575
 URL: https://issues.apache.org/jira/browse/LUCENE-1575
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Search
Reporter: Shai Erera
Assignee: Michael McCandless
 Fix For: 2.9

 Attachments: LUCENE-1575.1.patch, LUCENE-1575.2.patch, 
 LUCENE-1575.3.patch, LUCENE-1575.4.patch, LUCENE-1575.5.patch, 
 LUCENE-1575.6.patch, LUCENE-1575.patch, LUCENE-1575.patch


 This issue is a result of a recent discussion we've had on the mailing list. 
 You can read the thread 
 [here|http://www.nabble.com/Is-TopDocCollector%27s-collect()-implementation-correct--td22557419.html].
 We have agreed to do the following refactoring:
 * Rename MultiReaderHitCollector to Collector, with the purpose that it will 
 be the base class for all Collector implementations.
 * Deprecate HitCollector in favor of the new Collector.
 * Introduce new methods in IndexSearcher that accept Collector, and deprecate 
 those that accept HitCollector.
 ** Create a final class HitCollectorWrapper, and use it in the deprecated 
 methods in IndexSearcher, wrapping the given HitCollector.
 ** HitCollectorWrapper will be marked deprecated, so we can remove it in 3.0, 
 when we remove HitCollector.
 ** It will remove any instanceof checks that currently exist in IndexSearcher 
 code.
 * Create a new (abstract) TopDocsCollector, which will:
 ** Leave collect and setNextReader unimplemented.
 ** Introduce protected members PriorityQueue and totalHits.
 ** Introduce a single protected constructor which accepts a PriorityQueue.
 ** Implement topDocs() and getTotalHits() using the PQ and totalHits members. 
 These can be used as-are by extending classes, as well as be overridden.
 ** Introduce a new topDocs(start, howMany) method which will be used a 
 convenience method when implementing a search application which allows paging 
 through search results. It will also attempt to improve the memory 
 allocation, by allocating a ScoreDoc[] of the requested size only.
 * Change TopScoreDocCollector to extend TopDocsCollector, use the topDocs() 
 and getTotalHits() implementations as they are from TopDocsCollector. The 
 class will also be made final.
 * Change TopFieldCollector to extend TopDocsCollector, and make the class 
 final. Also implement topDocs(start, howMany).
 * Change TopFieldDocCollector (deprecated) to extend TopDocsCollector, 
 instead of TopScoreDocCollector. Implement topDocs(start, howMany)
 * Review other places where HitCollector is used, such as in Scorer, 
 deprecate those places and use Collector instead.
 Additionally, the following proposal was made w.r.t. decoupling score from 
 collect():
 * Change collect to accecpt only a doc Id (unbased).
 * Introduce a setScorer(Scorer) method.
 * If during collect the implementation needs the score, it can call 
 scorer.score().
 If we do this, then we need to review all places in the code where 
 collect(doc, score) is called, and assert whether Scorer can be passed. Also 
 this raises few questions:
 * What if during collect() Scorer

[jira] Commented: (LUCENE-1582) Make TrieRange completely independent from Document/Field with TokenStream of prefix encoded values

[
https://issues.apache.org/jira/browse/LUCENE-1582?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12695362#action_12695362
]

Michael McCandless commented on LUCENE-1582:

bq. Maybe this should be provides as a separate sub-isse (or top-level issue),
because I cannot apply patches to core. Mike, can you do this, when we commit
this?

It's fine to include these changes in this patch -- I can commit them all at
once.

bq. But as it is needed to generate a TokenStream instance for every numeric
value, the GC cost is about the same for new and old API. Especially because
each TokenStream creates a LinkedHashMap internally for the attributes.

Hmm, we should do some perf tests to see how big a deal this turns out to be.
It'd be nice to get some sort of reuse API working if performance is really
hurt. (Eg Analyzers can provide reusableTokenStream, keyed by thread). You'd
presumably have to key on thread field name. If you do this then probably a
shortcut helper method should be the preferred way.

bq. Just a question for the indexer people: Is it possible to add two fields
with the same field name to a document, both with a TokenStream?

Each with a different TokenStream instance, right? Yes, this should be fine;
the tokens are logically concatenated just like multi-valued String fields.

Make TrieRange completely independent from Document/Field with TokenStream of
prefix encoded values
---

Attachments: LUCENE-1582.patch

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-1582) Make TrieRange completely independent from Document/Field with TokenStream of prefix encoded values

[
https://issues.apache.org/jira/browse/LUCENE-1582?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12695364#action_12695364
]

Uwe Schindler commented on LUCENE-1582:
---

bq. Hmm, we should do some perf tests to see how big a deal this turns out to
be. It'd be nice to get some sort of reuse API working if performance is really
hurt. (Eg Analyzers can provide reusableTokenStream, keyed by thread). You'd
presumably have to key on thread field name. If you do this then probably a
shortcut helper method should be the preferred way.

We can also leave this to the implementor: If somebody indexes thousands of
documents, he could reuse one instance of the TokenStream for each document. As
the instance is only read on document addition, he must provide a separate
instance for each field, but can reuse it for the next document. This is the
same like reusing Field instances during indexing.

I can add a setValue() method to the tokenStream that resets it with the new
value. So one could use one instance and always use setValue() to supply a new
value for each document. The precisionStep should not be modifiable.

{quote}
bq. Just a question for the indexer people: Is it possible to add two fields
with the same field name to a document, both with a TokenStream?

Each with a different TokenStream instance, right? Yes, this should be fine;
the tokens are logically concatenated just like multi-valued String fields.
{quote}

Yes, sure :-)

Make TrieRange completely independent from Document/Field with TokenStream of
prefix encoded values
---

Attachments: LUCENE-1582.patch

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-1341) BoostingNearQuery class (prototype)

2009-04-03 Thread Grant Ingersoll (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-1341?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12695370#action_12695370
 ] 

Grant Ingersoll commented on LUCENE-1341:
-

Hi Peter,

This looks good, I think it just needs some unit tests and then it will be good.

 BoostingNearQuery class (prototype)
 ---

 Key: LUCENE-1341
 URL: https://issues.apache.org/jira/browse/LUCENE-1341
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Query/Scoring
Affects Versions: 2.3.1
Reporter: Peter Keegan
Assignee: Grant Ingersoll
Priority: Minor
 Fix For: 3.0

 Attachments: bnq.patch, bnq.patch, BoostingNearQuery.java, 
 BoostingNearQuery.java, LUCENE-1341-new.patch, LUCENE-1341.patch


 This patch implements term boosting for SpanNearQuery. Refer to: 
 http://www.gossamer-threads.com/lists/lucene/java-user/62779
 This patch works but probably needs more work. I don't like the use of 
 'instanceof', but I didn't want to touch Spans or TermSpans. Also, the 
 payload code is mostly a copy of what's in BoostingTermQuery and could be 
 common-sourced somewhere. Feel free to throw darts at it :)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-1582) Make TrieRange completely independent from Document/Field with TokenStream of prefix encoded values

[
https://issues.apache.org/jira/browse/LUCENE-1582?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12695400#action_12695400
]

Michael McCandless commented on LUCENE-1582:

bq. I can add a setValue() method to the tokenStream that resets it with the
new value.

That's a good step forward, but it'd likely mean the default is to be slower
performance? In general I prefer (when realistic) to have default ootb
experience to be good performance, but in this case it doesn't seem like
there's an easy way to have a natural high-performance default. And eg we
don't reuse Document Field by default, so expecting someone to do a bit of
work to reuse Trie's TokenStreams seems OK.

It's almost like Analyzer.reusableTokenStream(...) should know it's
deailing with a Numeric field, and handle the reuse for you, in a future world
when Lucene knows that a Field is a NumericField, meant to be indexed using
trie. But we can leave all of that for future optimization; for now, providing
setValue is great.

Make TrieRange completely independent from Document/Field with TokenStream of
prefix encoded values
---

Attachments: LUCENE-1582.patch

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-1539) Improve Benchmark


[ 
https://issues.apache.org/jira/browse/LUCENE-1539?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12695462#action_12695462
 ] 

Michael McCandless commented on LUCENE-1539:


This patch looks good -- some questions:

  * Is CreateWikiIndex intended to be committed?  I thought not?  Ie I
though the goal w/ this issue is add the necessary tasks so that
CreateWikiIndex would be done as an alg.

  * I think we shouldn't bump to Java 1.5 -- it's only CreateWikiIndex
that needs it anyway (in only 2 places).

  * PrintReaderTask never closes the reader.

  * Not sure why you needed to relax private - protected in AddDocTask?


 Improve Benchmark
 -

 Key: LUCENE-1539
 URL: https://issues.apache.org/jira/browse/LUCENE-1539
 Project: Lucene - Java
  Issue Type: Improvement
  Components: contrib/benchmark
Affects Versions: 2.4
Reporter: Jason Rutherglen
Assignee: Michael McCandless
Priority: Minor
 Fix For: 2.9

 Attachments: LUCENE-1539.patch, LUCENE-1539.patch, sortBench2.py, 
 sortCollate2.py

   Original Estimate: 336h
  Remaining Estimate: 336h

 Benchmark can be improved by incorporating recent suggestions posted
 on java-dev. M. McCandless' Python scripts that execute multiple
 rounds of tests can either be incorporated into the codebase or
 converted to Java.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Assigned: (LUCENE-1539) Improve Benchmark


 [ 
https://issues.apache.org/jira/browse/LUCENE-1539?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael McCandless reassigned LUCENE-1539:
--

Assignee: Michael McCandless

 Improve Benchmark
 -

 Key: LUCENE-1539
 URL: https://issues.apache.org/jira/browse/LUCENE-1539
 Project: Lucene - Java
  Issue Type: Improvement
  Components: contrib/benchmark
Affects Versions: 2.4
Reporter: Jason Rutherglen
Assignee: Michael McCandless
Priority: Minor
 Fix For: 2.9

 Attachments: LUCENE-1539.patch, LUCENE-1539.patch, sortBench2.py, 
 sortCollate2.py

   Original Estimate: 336h
  Remaining Estimate: 336h

 Benchmark can be improved by incorporating recent suggestions posted
 on java-dev. M. McCandless' Python scripts that execute multiple
 rounds of tests can either be incorporated into the codebase or
 converted to Java.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-1575) Refactoring Lucene collectors (HitCollector and extensions)


[ 
https://issues.apache.org/jira/browse/LUCENE-1575?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12695478#action_12695478
 ] 

Shai Erera commented on LUCENE-1575:


bq. I like TimeLimitingCollector, or maybe TimeoutCollector?

I like TimeLimitingCollector better, as I think the name makes the class more 
self explanatory.

bq. TopFieldCollector's updateBottom  add methods take score, and are passed 
score from the non-scoring collectors, but shouldn't?

At the end of the day, even the non-scoring collectors store a score in 
ScoreDoc, which is Float.NaN. So they should pass a score. Unlike the scoring 
ones, they always pass Float.NaN without ever calling scorer.score(). That's 
the cleanest way I've found I can make the changes to that class, w/o 
duplicating implementation all over the place. Notice that the scoring versions 
extend the non-scoring, and just add score computation, which resulted in a 
very clean implementation.

bq. TermScorer need not override score(HitCollector hc) (super does the same 
thing).

Agreed.

bq. The changes to TermScorer make me a bit nervous.

Since we pass Sorer to Collector, I thought we cannot really rely on anyone not 
calling scorer.doc() or getSimilarity ever - it is in the API. Since doc() is 
abstract, I had to implement it and just thought that retuning the current doc 
is better than -1 for example. There are some alternatives I see to resolve it:
# Create an abstract ScoringOnlyScorer which extends Scorer and implements all 
methods to throw UOE (also as final), besides score() which it will define 
abstract. We then define a ScoringOnlyScorerWrapper which takes a Scorer and 
delegates the score() calls. We use SOSW in places where we can't extend SOS. 
Where we can, we just extend it directly and implement score(), like in the 
InternalScorer case.
# Create a new class which implements just score() (I've yet to come with a 
good name since Scorer is already taken) and create a wrapper which takes a 
Scorer and delegates the score() calls to it. Then Collector will use that new 
class, and we're sure that only score() can be called.

The last two comments are completely an overlook by my side. I'm not so sure 
about your proposal though. If we add to Searcher a concrete impl which throws 
UOE, how would that work in 3.0? How would anyone who extends Searcher know 
that it has to extend this method? Maybe do it now, and document that in 3.0 it 
will become abstract again?
About Searchable, I wonder how many do implement Searchable, rather than extend 
IndexSearcher. Perhaps instead of making any changes in back-compat and add 
documentation to CHANGES I'll just comment out this method with a TODO to 
re-enstate in 3.0?

 Refactoring Lucene collectors (HitCollector and extensions)
 ---

 Key: LUCENE-1575
 URL: https://issues.apache.org/jira/browse/LUCENE-1575
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Search
Reporter: Shai Erera
Assignee: Michael McCandless
 Fix For: 2.9

 Attachments: LUCENE-1575.1.patch, LUCENE-1575.2.patch, 
 LUCENE-1575.3.patch, LUCENE-1575.4.patch, LUCENE-1575.5.patch, 
 LUCENE-1575.6.patch, LUCENE-1575.patch, LUCENE-1575.patch


 This issue is a result of a recent discussion we've had on the mailing list. 
 You can read the thread 
 [here|http://www.nabble.com/Is-TopDocCollector%27s-collect()-implementation-correct--td22557419.html].
 We have agreed to do the following refactoring:
 * Rename MultiReaderHitCollector to Collector, with the purpose that it will 
 be the base class for all Collector implementations.
 * Deprecate HitCollector in favor of the new Collector.
 * Introduce new methods in IndexSearcher that accept Collector, and deprecate 
 those that accept HitCollector.
 ** Create a final class HitCollectorWrapper, and use it in the deprecated 
 methods in IndexSearcher, wrapping the given HitCollector.
 ** HitCollectorWrapper will be marked deprecated, so we can remove it in 3.0, 
 when we remove HitCollector.
 ** It will remove any instanceof checks that currently exist in IndexSearcher 
 code.
 * Create a new (abstract) TopDocsCollector, which will:
 ** Leave collect and setNextReader unimplemented.
 ** Introduce protected members PriorityQueue and totalHits.
 ** Introduce a single protected constructor which accepts a PriorityQueue.
 ** Implement topDocs() and getTotalHits() using the PQ and totalHits members. 
 These can be used as-are by extending classes, as well as be overridden.
 ** Introduce a new topDocs(start, howMany) method which will be used a 
 convenience method when implementing a search application which allows paging 
 through search results. It will also attempt to improve the memory 
 allocation, by allocating a ScoreDoc[] of the requested size

[jira] Commented: (LUCENE-1575) Refactoring Lucene collectors (HitCollector and extensions)


[ 
https://issues.apache.org/jira/browse/LUCENE-1575?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12695513#action_12695513
 ] 

Michael McCandless commented on LUCENE-1575:


bq. I like TimeLimitingCollector better, as I think the name makes the class 
more self explanatory.

OK let's go with that!

{quote}
At the end of the day, even the non-scoring collectors store a score in 
ScoreDoc, which is Float.NaN. So they should pass a score. Unlike the scoring 
ones, they always pass Float.NaN without ever calling scorer.score(). That's 
the cleanest way I've found I can make the changes to that class, w/o 
duplicating implementation all over the place. Notice that the scoring versions 
extend the non-scoring, and just add score computation, which resulted in a 
very clean implementation.
{quote}

OK... let's stick with this approach for now.  Since the impl is
locked down (ctor for TopFieldCollector is private) we can freely
switch up this API in the future without breaking back compat, if we
want to optimize not passing/copying around the unused score.

Can't the scoring collector impls in TopFieldCollector be final?

bq. Since we pass Sorer to Collector, I thought we cannot really rely on anyone 
not calling scorer.doc() or getSimilarity ever

Maybe instead make InternalScorer non-static, and then doc() can
return the doc from the TermScorer instance, instead of having to copy
s.d = doc each time?  score can do a similar thing.

Actually, hang on: if I'm using a Collector that doesn't need the
score, TermScoring is still computing it?  We don't want that right?
Can we simply pass this to setScorer(...)?

bq. If we add to Searcher a concrete impl which throws UOE, how would that work 
in 3.0? How would anyone who extends Searcher know that it has to extend this 
method? Maybe do it now, and document that in 3.0 it will become abstract again?

OK let's do that?

bq. About Searchable, I wonder how many do implement Searchable, rather than 
extend IndexSearcher. Perhaps instead of making any changes in back-compat and 
add documentation to CHANGES I'll just comment out this method with a TODO to 
re-enstate in 3.0?

OK.

Make sure at the end of all of this, you open a new issue, marked as
fix version 3.0, that has all the and then on 3.0 we do XYZs from
this.

 Refactoring Lucene collectors (HitCollector and extensions)
 ---

 Key: LUCENE-1575
 URL: https://issues.apache.org/jira/browse/LUCENE-1575
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Search
Reporter: Shai Erera
Assignee: Michael McCandless
 Fix For: 2.9

 Attachments: LUCENE-1575.1.patch, LUCENE-1575.2.patch, 
 LUCENE-1575.3.patch, LUCENE-1575.4.patch, LUCENE-1575.5.patch, 
 LUCENE-1575.6.patch, LUCENE-1575.patch, LUCENE-1575.patch


 This issue is a result of a recent discussion we've had on the mailing list. 
 You can read the thread 
 [here|http://www.nabble.com/Is-TopDocCollector%27s-collect()-implementation-correct--td22557419.html].
 We have agreed to do the following refactoring:
 * Rename MultiReaderHitCollector to Collector, with the purpose that it will 
 be the base class for all Collector implementations.
 * Deprecate HitCollector in favor of the new Collector.
 * Introduce new methods in IndexSearcher that accept Collector, and deprecate 
 those that accept HitCollector.
 ** Create a final class HitCollectorWrapper, and use it in the deprecated 
 methods in IndexSearcher, wrapping the given HitCollector.
 ** HitCollectorWrapper will be marked deprecated, so we can remove it in 3.0, 
 when we remove HitCollector.
 ** It will remove any instanceof checks that currently exist in IndexSearcher 
 code.
 * Create a new (abstract) TopDocsCollector, which will:
 ** Leave collect and setNextReader unimplemented.
 ** Introduce protected members PriorityQueue and totalHits.
 ** Introduce a single protected constructor which accepts a PriorityQueue.
 ** Implement topDocs() and getTotalHits() using the PQ and totalHits members. 
 These can be used as-are by extending classes, as well as be overridden.
 ** Introduce a new topDocs(start, howMany) method which will be used a 
 convenience method when implementing a search application which allows paging 
 through search results. It will also attempt to improve the memory 
 allocation, by allocating a ScoreDoc[] of the requested size only.
 * Change TopScoreDocCollector to extend TopDocsCollector, use the topDocs() 
 and getTotalHits() implementations as they are from TopDocsCollector. The 
 class will also be made final.
 * Change TopFieldCollector to extend TopDocsCollector, and make the class 
 final. Also implement topDocs(start, howMany).
 * Change TopFieldDocCollector (deprecated) to extend TopDocsCollector, 
 instead of

[jira] Commented: (LUCENE-1575) Refactoring Lucene collectors (HitCollector and extensions)


[ 
https://issues.apache.org/jira/browse/LUCENE-1575?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12695523#action_12695523
 ] 

Shai Erera commented on LUCENE-1575:


bq. Can't the scoring collector impls in TopFieldCollector be final?

They can, but they are private so they cannot be extended anyway. I can do 
that, but does it really matter?

bq. We don't want that right? Can we simply pass this to setScorer(...)?

That's what I wanted to do, but then noticed that TermScorer.score() method is 
a bit different. However, now that I look at it again, I wonder if they are 
different. The difference is that in score(), it does at the end
{code}
return raw * Similarity.decodeNorm(norms[doc]);
{code}
and in score(Collector, int) it does
{code}
float[] normDecoder = Similarity.getNormDecoder();
...
score *= normDecoder[norms[doc]  0xFF];
{code}

Looking in Similarity.decodeNorm, it does exactly what's done in 
score(Collector, int). So I guess this code has been duplicated for no good 
reason? Please validate what I wrote and if you also agree, I can change the 
entire method (score(Collector, int)) to not compute any score and call 
c.setScorer(this). That will solve it.

So are you ok with passing Scorer to Collector, instead of just a class with a 
single score() method?

I will open an issue w/ a fix version 3.0 and take care of all those TODOs. 
Should the issue also get rid of the deprecated methods? Or will we have a 
general issue in 3.0 that removes all deprecated methods?


 Refactoring Lucene collectors (HitCollector and extensions)
 ---

 Key: LUCENE-1575
 URL: https://issues.apache.org/jira/browse/LUCENE-1575
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Search
Reporter: Shai Erera
Assignee: Michael McCandless
 Fix For: 2.9

 Attachments: LUCENE-1575.1.patch, LUCENE-1575.2.patch, 
 LUCENE-1575.3.patch, LUCENE-1575.4.patch, LUCENE-1575.5.patch, 
 LUCENE-1575.6.patch, LUCENE-1575.patch, LUCENE-1575.patch


 This issue is a result of a recent discussion we've had on the mailing list. 
 You can read the thread 
 [here|http://www.nabble.com/Is-TopDocCollector%27s-collect()-implementation-correct--td22557419.html].
 We have agreed to do the following refactoring:
 * Rename MultiReaderHitCollector to Collector, with the purpose that it will 
 be the base class for all Collector implementations.
 * Deprecate HitCollector in favor of the new Collector.
 * Introduce new methods in IndexSearcher that accept Collector, and deprecate 
 those that accept HitCollector.
 ** Create a final class HitCollectorWrapper, and use it in the deprecated 
 methods in IndexSearcher, wrapping the given HitCollector.
 ** HitCollectorWrapper will be marked deprecated, so we can remove it in 3.0, 
 when we remove HitCollector.
 ** It will remove any instanceof checks that currently exist in IndexSearcher 
 code.
 * Create a new (abstract) TopDocsCollector, which will:
 ** Leave collect and setNextReader unimplemented.
 ** Introduce protected members PriorityQueue and totalHits.
 ** Introduce a single protected constructor which accepts a PriorityQueue.
 ** Implement topDocs() and getTotalHits() using the PQ and totalHits members. 
 These can be used as-are by extending classes, as well as be overridden.
 ** Introduce a new topDocs(start, howMany) method which will be used a 
 convenience method when implementing a search application which allows paging 
 through search results. It will also attempt to improve the memory 
 allocation, by allocating a ScoreDoc[] of the requested size only.
 * Change TopScoreDocCollector to extend TopDocsCollector, use the topDocs() 
 and getTotalHits() implementations as they are from TopDocsCollector. The 
 class will also be made final.
 * Change TopFieldCollector to extend TopDocsCollector, and make the class 
 final. Also implement topDocs(start, howMany).
 * Change TopFieldDocCollector (deprecated) to extend TopDocsCollector, 
 instead of TopScoreDocCollector. Implement topDocs(start, howMany)
 * Review other places where HitCollector is used, such as in Scorer, 
 deprecate those places and use Collector instead.
 Additionally, the following proposal was made w.r.t. decoupling score from 
 collect():
 * Change collect to accecpt only a doc Id (unbased).
 * Introduce a setScorer(Scorer) method.
 * If during collect the implementation needs the score, it can call 
 scorer.score().
 If we do this, then we need to review all places in the code where 
 collect(doc, score) is called, and assert whether Scorer can be passed. Also 
 this raises few questions:
 * What if during collect() Scorer is null? (i.e., not set) - is it even 
 possible?
 * I noticed that many (if not all) of the collect() implementations discard 
 the document if its score

[jira] Commented: (LUCENE-1575) Refactoring Lucene collectors (HitCollector and extensions)


[ 
https://issues.apache.org/jira/browse/LUCENE-1575?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12695525#action_12695525
 ] 

Shai Erera commented on LUCENE-1575:


BTW Mike - I think the accidental changes to Searchable and Searcher could have 
been easily detected by test-tags if we had classes in the back-compat tag 
which implemented interfaces / extended abstract classes with empty 
implementations. These are not really junit tests, but if someone would have 
changed an interface or abstract class, then attempting to compile the test 
package against the trunk would fail.

It is not so relevant now, since the next release is 2.9 following by a 3.0 and 
back-compat will completely go away in 3.0, but perhaps post 3.0? Also, it will 
prevent us from making changes to back-compat like we wanted to in this issue, 
but perhaps it's good?


 Refactoring Lucene collectors (HitCollector and extensions)
 ---

 Key: LUCENE-1575
 URL: https://issues.apache.org/jira/browse/LUCENE-1575
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Search
Reporter: Shai Erera
Assignee: Michael McCandless
 Fix For: 2.9

 Attachments: LUCENE-1575.1.patch, LUCENE-1575.2.patch, 
 LUCENE-1575.3.patch, LUCENE-1575.4.patch, LUCENE-1575.5.patch, 
 LUCENE-1575.6.patch, LUCENE-1575.patch, LUCENE-1575.patch


 This issue is a result of a recent discussion we've had on the mailing list. 
 You can read the thread 
 [here|http://www.nabble.com/Is-TopDocCollector%27s-collect()-implementation-correct--td22557419.html].
 We have agreed to do the following refactoring:
 * Rename MultiReaderHitCollector to Collector, with the purpose that it will 
 be the base class for all Collector implementations.
 * Deprecate HitCollector in favor of the new Collector.
 * Introduce new methods in IndexSearcher that accept Collector, and deprecate 
 those that accept HitCollector.
 ** Create a final class HitCollectorWrapper, and use it in the deprecated 
 methods in IndexSearcher, wrapping the given HitCollector.
 ** HitCollectorWrapper will be marked deprecated, so we can remove it in 3.0, 
 when we remove HitCollector.
 ** It will remove any instanceof checks that currently exist in IndexSearcher 
 code.
 * Create a new (abstract) TopDocsCollector, which will:
 ** Leave collect and setNextReader unimplemented.
 ** Introduce protected members PriorityQueue and totalHits.
 ** Introduce a single protected constructor which accepts a PriorityQueue.
 ** Implement topDocs() and getTotalHits() using the PQ and totalHits members. 
 These can be used as-are by extending classes, as well as be overridden.
 ** Introduce a new topDocs(start, howMany) method which will be used a 
 convenience method when implementing a search application which allows paging 
 through search results. It will also attempt to improve the memory 
 allocation, by allocating a ScoreDoc[] of the requested size only.
 * Change TopScoreDocCollector to extend TopDocsCollector, use the topDocs() 
 and getTotalHits() implementations as they are from TopDocsCollector. The 
 class will also be made final.
 * Change TopFieldCollector to extend TopDocsCollector, and make the class 
 final. Also implement topDocs(start, howMany).
 * Change TopFieldDocCollector (deprecated) to extend TopDocsCollector, 
 instead of TopScoreDocCollector. Implement topDocs(start, howMany)
 * Review other places where HitCollector is used, such as in Scorer, 
 deprecate those places and use Collector instead.
 Additionally, the following proposal was made w.r.t. decoupling score from 
 collect():
 * Change collect to accecpt only a doc Id (unbased).
 * Introduce a setScorer(Scorer) method.
 * If during collect the implementation needs the score, it can call 
 scorer.score().
 If we do this, then we need to review all places in the code where 
 collect(doc, score) is called, and assert whether Scorer can be passed. Also 
 this raises few questions:
 * What if during collect() Scorer is null? (i.e., not set) - is it even 
 possible?
 * I noticed that many (if not all) of the collect() implementations discard 
 the document if its score is not greater than 0. Doesn't it mean that score 
 is needed in collect() always?
 Open issues:
 * The name for Collector
 * TopDocsCollector was mentioned on the thread as TopResultsCollector, but 
 that was when we thought to call Colletor ResultsColletor. Since we decided 
 (so far) on Collector, I think TopDocsCollector makes sense, because of its 
 TopDocs output.
 * Decoupling score from collect().
 I will post a patch a bit later, as this is expected to be a very large 
 patch. I will split it into 2: (1) code patch (2) test cases (moving to use 
 Collector instead of HitCollector, as well as testing the new topDocs(start, 
 howMany)

[jira] Commented: (LUCENE-1575) Refactoring Lucene collectors (HitCollector and extensions)


[ 
https://issues.apache.org/jira/browse/LUCENE-1575?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12695537#action_12695537
 ] 

Michael McCandless commented on LUCENE-1575:



I ran a first do no harm perf test, comparing trunk with this patch:

||query||sort||hits||qps||qpsnew||pctg||
|147|score|   6953|3631.1|3641.8|  0.3%|
|147|title|   6953|2916.7|2255.6|-22.7%|
|147|doc|   6953|3251.2|2676.8|-17.7%|
|text|score| 157101| 208.1| 202.1| -2.9%|
|text|title| 157101|  96.7|  84.8|-12.3%|
|text|doc| 157101| 174.0| 115.2|-33.8%|
|1|score| 565452|  58.0|  56.4| -2.8%|
|1|title| 565452|  44.5|  34.1|-23.4%|
|1|doc| 565452|  49.2|  32.8|-33.3%|
|1 OR 2|score| 784928|  14.1|  13.7| -2.8%|
|1 OR 2|title| 784928|  12.5|  11.5| -8.0%|
|1 OR 2|doc| 784928|  13.0|  11.9| -8.5%|
|1 AND 2|score| 333153|  15.5|  15.5|  0.0%|
|1 AND 2|title| 333153|  14.8|  13.7| -7.4%|
|1 AND 2|doc| 333153|  15.2|  14.2| -6.6%|

Looks like:
 
  * Sort by relevance got maybe a tad slower (~3%)

  * Sort by field is now quite a bit slower (23-33% on term query '1')

This was on a full wikipedia index, with 14 segments, Sun java
1.6.0_07 on OS X Mac Pro quad core, on Intel X25M 160 GB
SSD.

I think we need to iterate some to try to get some performance back.


 Refactoring Lucene collectors (HitCollector and extensions)
 ---

 Key: LUCENE-1575
 URL: https://issues.apache.org/jira/browse/LUCENE-1575
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Search
Reporter: Shai Erera
Assignee: Michael McCandless
 Fix For: 2.9

 Attachments: LUCENE-1575.1.patch, LUCENE-1575.2.patch, 
 LUCENE-1575.3.patch, LUCENE-1575.4.patch, LUCENE-1575.5.patch, 
 LUCENE-1575.6.patch, LUCENE-1575.patch, LUCENE-1575.patch


 This issue is a result of a recent discussion we've had on the mailing list. 
 You can read the thread 
 [here|http://www.nabble.com/Is-TopDocCollector%27s-collect()-implementation-correct--td22557419.html].
 We have agreed to do the following refactoring:
 * Rename MultiReaderHitCollector to Collector, with the purpose that it will 
 be the base class for all Collector implementations.
 * Deprecate HitCollector in favor of the new Collector.
 * Introduce new methods in IndexSearcher that accept Collector, and deprecate 
 those that accept HitCollector.
 ** Create a final class HitCollectorWrapper, and use it in the deprecated 
 methods in IndexSearcher, wrapping the given HitCollector.
 ** HitCollectorWrapper will be marked deprecated, so we can remove it in 3.0, 
 when we remove HitCollector.
 ** It will remove any instanceof checks that currently exist in IndexSearcher 
 code.
 * Create a new (abstract) TopDocsCollector, which will:
 ** Leave collect and setNextReader unimplemented.
 ** Introduce protected members PriorityQueue and totalHits.
 ** Introduce a single protected constructor which accepts a PriorityQueue.
 ** Implement topDocs() and getTotalHits() using the PQ and totalHits members. 
 These can be used as-are by extending classes, as well as be overridden.
 ** Introduce a new topDocs(start, howMany) method which will be used a 
 convenience method when implementing a search application which allows paging 
 through search results. It will also attempt to improve the memory 
 allocation, by allocating a ScoreDoc[] of the requested size only.
 * Change TopScoreDocCollector to extend TopDocsCollector, use the topDocs() 
 and getTotalHits() implementations as they are from TopDocsCollector. The 
 class will also be made final.
 * Change TopFieldCollector to extend TopDocsCollector, and make the class 
 final. Also implement topDocs(start, howMany).
 * Change TopFieldDocCollector (deprecated) to extend TopDocsCollector, 
 instead of TopScoreDocCollector. Implement topDocs(start, howMany)
 * Review other places where HitCollector is used, such as in Scorer, 
 deprecate those places and use Collector instead.
 Additionally, the following proposal was made w.r.t. decoupling score from 
 collect():
 * Change collect to accecpt only a doc Id (unbased).
 * Introduce a setScorer(Scorer) method.
 * If during collect the implementation needs the score, it can call 
 scorer.score().
 If we do this, then we need to review all places in the code where 
 collect(doc, score) is called, and assert whether Scorer can be passed. Also 
 this raises few questions:
 * What if during collect() Scorer is null? (i.e., not set) - is it even 
 possible?
 * I noticed that many (if not all) of the collect() implementations discard 
 the document if its score is not greater than 0. Doesn't it mean that score 
 is needed in collect() always?
 Open issues:
 * The name for Collector
 * TopDocsCollector was mentioned on the thread as TopResultsCollector, but 
 that was when we thought to call Colletor

Re: Future projects

2009-04-03 Thread John Wang

By default bobo DOES use a flavor of the field cache data structure with
some addition information for performance. (e.g. minDocid,maxDocid,freq per
term)
Bobo is architected as a platform where clients can write their own
FacetHandlers in which each FacetHandler manages its own view of memory
structure, and thus can be more complicated that field cache.
At LinkedIn, we write FacetHandlers for geo lat/lon filtering and social
network faceting.

-John

On Fri, Apr 3, 2009 at 3:35 AM, Michael McCandless 
luc...@mikemccandless.com wrote:

 On Thu, Apr 2, 2009 at 5:56 PM, Jason Rutherglen
 jason.rutherg...@gmail.com wrote:
  I think I need to understand better why delete by Query isn't
  viable in your situation...
 
  The delete by query is a separate problem which I haven't fully
  explored yet.

 Oh, I had thought we were tugging on this thread in order to explore
 delete-by-docID in the writer.  OK.

  Tracking the segment genealogy is really an
  interim step for merging field caches before column stride
  fields gets implemented.

 I see -- meaning in Bobo you'd like to manage your own memory resident
 field caches, and merge them whenever IW has merged a segment?  Seems
 like you don't need genealogy for that.

  Actually CSF cannot be used with Bobo's
  field caches anyways which means we'd need a way to find out
  about the segment parents.

 CSF isn't really designed yet.  How come it can't be used with Bobo's
 field caches?  We can try to accommodate Bobo's field cache needs when
 designing CSF.

  Does it operate at the segment level? Seems like that'd give
  you good enough realtime performance (though merging in RAM will
  definitely be faster).
 
  We need to see how Bobo integrates with LUCENE-1483.

 Lucene's internal field cache usage is now entirely at the segment
 level (ie, Lucene core should never request full field cache array at
 the MultiSegmentReader level).  I think Bobo must have to do the same,
 if it handles near realtime updates, to get adequate performance.

 Though... since we have LUCENE-831 (rework API Lucene exposes for
 accessing arrays-of-atomic-types-per-segment) and LUCENE-1231 (CSF = a
 more efficient impl (than uninversion) of the API we expose in
 LUCENE-831) on deck, we should try to understand Bobo's needs.

 EG how come Bobo made its own field cache impl?  Just because
 uninversion is too slow?

  It seems like we've been talking about CSF for 2 years and there
  isn't a patch for it? If I had more time I'd take a look. What
  is the status of it?

 I think Michael is looking into it?  I'd really like to get it into
 2.9.  We should do it in conjunction with 831 since they are so tied.

  I'll write a patch that implements a callback for the segment
  merging such that the user can decide what information they want
  to record about the merged SRs (I'm pretty sure there isn't a
  way to do this with MergePolicy?)

 Actually I think you can do this w/ a simple MergeScheduler wrapper or
 by subclassing CMS.  I'll put a comment on the issue.

 Mike

 -
 To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-1575) Refactoring Lucene collectors (HitCollector and extensions)


[ 
https://issues.apache.org/jira/browse/LUCENE-1575?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12695541#action_12695541
 ] 

Michael McCandless commented on LUCENE-1575:


{quote}
 Can't the scoring collector impls in TopFieldCollector be final?

They can, but they are private so they cannot be extended anyway. I can do 
that, but does it really matter?
{quote}

I was thinking in case it eeks performance.

bq. So I guess this code has been duplicated for no good reason?

Duplicated for performance I think.

bq. I can change the entire method (score(Collector, int)) to not compute any 
score and call c.setScorer(this). That will solve it.

I think we should try this?

bq. So are you ok with passing Scorer to Collector, instead of just a class 
with a single score() method?

Good question... I'm not sure.  It would be cleaner to expose only
score() (and I think we could add methods over time), but then we'll
be creating new instance per segment per search which'll only slow
things down.

bq. I will open an issue w/ a fix version 3.0 and take care of all those TODOs. 
Should the issue also get rid of the deprecated methods? Or will we have a 
general issue in 3.0 that removes all deprecated methods?

You don't need to enumerate deprecated methods to get rid of -- we
won't forget those ones :) It's these other special tasks that may
slip through the cracks.


 Refactoring Lucene collectors (HitCollector and extensions)
 ---

 Key: LUCENE-1575
 URL: https://issues.apache.org/jira/browse/LUCENE-1575
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Search
Reporter: Shai Erera
Assignee: Michael McCandless
 Fix For: 2.9

 Attachments: LUCENE-1575.1.patch, LUCENE-1575.2.patch, 
 LUCENE-1575.3.patch, LUCENE-1575.4.patch, LUCENE-1575.5.patch, 
 LUCENE-1575.6.patch, LUCENE-1575.patch, LUCENE-1575.patch


 This issue is a result of a recent discussion we've had on the mailing list. 
 You can read the thread 
 [here|http://www.nabble.com/Is-TopDocCollector%27s-collect()-implementation-correct--td22557419.html].
 We have agreed to do the following refactoring:
 * Rename MultiReaderHitCollector to Collector, with the purpose that it will 
 be the base class for all Collector implementations.
 * Deprecate HitCollector in favor of the new Collector.
 * Introduce new methods in IndexSearcher that accept Collector, and deprecate 
 those that accept HitCollector.
 ** Create a final class HitCollectorWrapper, and use it in the deprecated 
 methods in IndexSearcher, wrapping the given HitCollector.
 ** HitCollectorWrapper will be marked deprecated, so we can remove it in 3.0, 
 when we remove HitCollector.
 ** It will remove any instanceof checks that currently exist in IndexSearcher 
 code.
 * Create a new (abstract) TopDocsCollector, which will:
 ** Leave collect and setNextReader unimplemented.
 ** Introduce protected members PriorityQueue and totalHits.
 ** Introduce a single protected constructor which accepts a PriorityQueue.
 ** Implement topDocs() and getTotalHits() using the PQ and totalHits members. 
 These can be used as-are by extending classes, as well as be overridden.
 ** Introduce a new topDocs(start, howMany) method which will be used a 
 convenience method when implementing a search application which allows paging 
 through search results. It will also attempt to improve the memory 
 allocation, by allocating a ScoreDoc[] of the requested size only.
 * Change TopScoreDocCollector to extend TopDocsCollector, use the topDocs() 
 and getTotalHits() implementations as they are from TopDocsCollector. The 
 class will also be made final.
 * Change TopFieldCollector to extend TopDocsCollector, and make the class 
 final. Also implement topDocs(start, howMany).
 * Change TopFieldDocCollector (deprecated) to extend TopDocsCollector, 
 instead of TopScoreDocCollector. Implement topDocs(start, howMany)
 * Review other places where HitCollector is used, such as in Scorer, 
 deprecate those places and use Collector instead.
 Additionally, the following proposal was made w.r.t. decoupling score from 
 collect():
 * Change collect to accecpt only a doc Id (unbased).
 * Introduce a setScorer(Scorer) method.
 * If during collect the implementation needs the score, it can call 
 scorer.score().
 If we do this, then we need to review all places in the code where 
 collect(doc, score) is called, and assert whether Scorer can be passed. Also 
 this raises few questions:
 * What if during collect() Scorer is null? (i.e., not set) - is it even 
 possible?
 * I noticed that many (if not all) of the collect() implementations discard 
 the document if its score is not greater than 0. Doesn't it mean that score 
 is needed in collect() always?
 Open issues:
 * The name for

[jira] Commented: (LUCENE-1575) Refactoring Lucene collectors (HitCollector and extensions)


[ 
https://issues.apache.org/jira/browse/LUCENE-1575?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12695543#action_12695543
 ] 

Michael McCandless commented on LUCENE-1575:


bq. BTW Mike - I think the accidental changes to Searchable and Searcher could 
have been easily detected by test-tags if we had classes in the back-compat tag 
which implemented interfaces / extended abstract classes with empty 
implementations. These are not really junit tests, but if someone would have 
changed an interface or abstract class, then attempting to compile the test 
package against the trunk would fail.

I think that's a great idea!  Every interface/abstract class should
have a just compile me subclass in the tests.

bq. It is not so relevant now, since the next release is 2.9 following by a 3.0 
and back-compat will completely go away in 3.0, but perhaps post 3.0?

It is relevant because neither Searchable nor Searcher are deprecated
(yet)?  Ie during development of 2.9 and of 3.0 we have to ensure we
don't break back compat of non-deprecated APIs.

So maybe fold this in on the next patch iteration?

bq. Also, it will prevent us from making changes to back-compat like we wanted 
to in this issue, but perhaps it's good?

It's good, because it'd raise the issue right way vs us catching it or
not by staring at the code :)


 Refactoring Lucene collectors (HitCollector and extensions)
 ---

 Key: LUCENE-1575
 URL: https://issues.apache.org/jira/browse/LUCENE-1575
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Search
Reporter: Shai Erera
Assignee: Michael McCandless
 Fix For: 2.9

 Attachments: LUCENE-1575.1.patch, LUCENE-1575.2.patch, 
 LUCENE-1575.3.patch, LUCENE-1575.4.patch, LUCENE-1575.5.patch, 
 LUCENE-1575.6.patch, LUCENE-1575.patch, LUCENE-1575.patch


 This issue is a result of a recent discussion we've had on the mailing list. 
 You can read the thread 
 [here|http://www.nabble.com/Is-TopDocCollector%27s-collect()-implementation-correct--td22557419.html].
 We have agreed to do the following refactoring:
 * Rename MultiReaderHitCollector to Collector, with the purpose that it will 
 be the base class for all Collector implementations.
 * Deprecate HitCollector in favor of the new Collector.
 * Introduce new methods in IndexSearcher that accept Collector, and deprecate 
 those that accept HitCollector.
 ** Create a final class HitCollectorWrapper, and use it in the deprecated 
 methods in IndexSearcher, wrapping the given HitCollector.
 ** HitCollectorWrapper will be marked deprecated, so we can remove it in 3.0, 
 when we remove HitCollector.
 ** It will remove any instanceof checks that currently exist in IndexSearcher 
 code.
 * Create a new (abstract) TopDocsCollector, which will:
 ** Leave collect and setNextReader unimplemented.
 ** Introduce protected members PriorityQueue and totalHits.
 ** Introduce a single protected constructor which accepts a PriorityQueue.
 ** Implement topDocs() and getTotalHits() using the PQ and totalHits members. 
 These can be used as-are by extending classes, as well as be overridden.
 ** Introduce a new topDocs(start, howMany) method which will be used a 
 convenience method when implementing a search application which allows paging 
 through search results. It will also attempt to improve the memory 
 allocation, by allocating a ScoreDoc[] of the requested size only.
 * Change TopScoreDocCollector to extend TopDocsCollector, use the topDocs() 
 and getTotalHits() implementations as they are from TopDocsCollector. The 
 class will also be made final.
 * Change TopFieldCollector to extend TopDocsCollector, and make the class 
 final. Also implement topDocs(start, howMany).
 * Change TopFieldDocCollector (deprecated) to extend TopDocsCollector, 
 instead of TopScoreDocCollector. Implement topDocs(start, howMany)
 * Review other places where HitCollector is used, such as in Scorer, 
 deprecate those places and use Collector instead.
 Additionally, the following proposal was made w.r.t. decoupling score from 
 collect():
 * Change collect to accecpt only a doc Id (unbased).
 * Introduce a setScorer(Scorer) method.
 * If during collect the implementation needs the score, it can call 
 scorer.score().
 If we do this, then we need to review all places in the code where 
 collect(doc, score) is called, and assert whether Scorer can be passed. Also 
 this raises few questions:
 * What if during collect() Scorer is null? (i.e., not set) - is it even 
 possible?
 * I noticed that many (if not all) of the collect() implementations discard 
 the document if its score is not greater than 0. Doesn't it mean that score 
 is needed in collect() always?
 Open issues:
 * The name for Collector
 * TopDocsCollector was mentioned on the

[jira] Commented: (LUCENE-1575) Refactoring Lucene collectors (HitCollector and extensions)

2009-04-03 Thread Jason Rutherglen (JIRA)

[
https://issues.apache.org/jira/browse/LUCENE-1575?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12695553#action_12695553
]

Jason Rutherglen commented on LUCENE-1575:
--

Something related to time limiting collectors we may want to
solve (maybe not in this patch) is passing the time limiting to
the sub-scorers. At the hit collector level the sub-scorers of a
multi clause query could be busy exceeding the time limit before
returning the first doc hit?

Refactoring Lucene collectors (HitCollector and extensions)
---

Attachments: LUCENE-1575.1.patch, LUCENE-1575.2.patch,
LUCENE-1575.3.patch, LUCENE-1575.4.patch, LUCENE-1575.5.patch,
LUCENE-1575.6.patch, LUCENE-1575.patch, LUCENE-1575.patch

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For

[jira] Updated: (LUCENE-1582) Make TrieRange completely independent from Document/Field with TokenStream of prefix encoded values

[
https://issues.apache.org/jira/browse/LUCENE-1582?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Uwe Schindler updated LUCENE-1582:
--

Attachment: LUCENE-1582.patch

Make TrieRange completely independent from Document/Field with TokenStream of
prefix encoded values
---

Attachments: LUCENE-1582.patch, LUCENE-1582.patch

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-1582) Make TrieRange completely independent from Document/Field with TokenStream of prefix encoded values

[
https://issues.apache.org/jira/browse/LUCENE-1582?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12695554#action_12695554
]

Uwe Schindler commented on LUCENE-1582:
---

Updated patch:
- supports a setValue() to reset the TokenStream with a new value for reuse (as
discussed before)
- completed JavaDocs
- remove dead code parts
- small change in RangeBuilder API (unneeded parameters)

The difference between reusing fields and tokenstreams and always creating a
new one is measureable (I compared in the test case), but not significant. The
JavaDocs contain infos, how to reuse.

I have done everything what i planned, now its time to discuss the change.

Make TrieRange completely independent from Document/Field with TokenStream of
prefix encoded values
---

Attachments: LUCENE-1582.patch, LUCENE-1582.patch

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-1584) Callback for intercepting merging segments in IndexWriter

2009-04-03 Thread Jason Rutherglen (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-1584?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12695557#action_12695557
 ] 

Jason Rutherglen commented on LUCENE-1584:
--

I would like to move away from our current position of somewhat
closed APIs that require user classes be a part of the Lucene
packages. 

It's always best to reuse existing APIs, however we've migrated
to OSGi which means anytime we need to place new classes in
Lucene packages, we need to rollout specific JARs (I think,
perhaps it's more complex) for the few classes outside of our
main package classes. This makes deployment of search
applications a bit more difficult and time consuming. 

A related thread regarding MergePolicy is at:
http://markmail.org/thread/h5bxjflpcyejrcqg 

 Callback for intercepting merging segments in IndexWriter
 -

 Key: LUCENE-1584
 URL: https://issues.apache.org/jira/browse/LUCENE-1584
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Index
Affects Versions: 2.4.1
Reporter: Jason Rutherglen
Priority: Minor
 Fix For: 2.9

 Attachments: LUCENE-1584.patch

   Original Estimate: 96h
  Remaining Estimate: 96h

 For things like merging field caches or bitsets, it's useful to
 know which segments were merged to create a new segment.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-1575) Refactoring Lucene collectors (HitCollector and extensions)


[ 
https://issues.apache.org/jira/browse/LUCENE-1575?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12695575#action_12695575
 ] 

Shai Erera commented on LUCENE-1575:




How do I run such a test? Is there an algorithm for that in the benchmark
package?

I compared the new TSDC to the trunk's version and the new code does ('-'
means a negative change, '+' means a positive change, '|' means
neither/undetermined):
* adds one collector.setScorer() call to each query. (-)
* The scorer.score() call in collect() was just moved from whoever called
collect() to inside collect(), so I don't think there's a difference. (|)
* Does not check if score  0.0f in each collect (+)
* implements the new topDocs() method. Previously, it just implemented
topDocs() which returned everything. Now, topDocs() calls topDocs(0,
pq.size()), which verifies parameters and such - since that's executed once
at the end of the search, I doubt that it has any effect major effect on the
results.

BTW, as I scanned through the code I noticed that previously TSDC returned
maxScore = Float.NEGATIVE_INFINITY in case there were 0 results to the
query, and now it returns Float.NaN. I'm not sure however if this breaks
anything, since maxScore is probably used (if at all) for normalization of
scores, and in case there are 0 results you don't really have anything to
normalize? However I'm not sure ...

Regarding TopFieldDocs I am quite surprised. I assume the test uses the
OneComparatorScoringCollector, which means scores are computed:
* It has the same issue as in TSDC regarding topDocs(). So I think it should
be changed here as well, however I doubt that's the cause for the
performance hit.
* It computes the score and then does super.collect(), which adds a method
call (-)
* It doesn't check if the score is  0 (+)
* It calls comparator.setScorer, which is ignored in all comparators besides
RelevanceComparator. Not sure if it has any performance effects (|)
The rest of the code in collect() is exactly the same.

Can it be that super.collect() has such an effect? When I think on the
results of TSDC (-3%) vs. TFC (-28% on avg.), I think it might be since
setScorer() is called once before the series of collect() calls, however
super.collect() is called for every document. Your index is large (2M
documents, right?) and I don't know how many results are for each query, if
they are in the range of 100Ks, then that could be the explanation.

Mike - in case it's faster for you to run it, can you try to run the test
again with a change in the code which inlines super.collect() into
OneComparatorScoringCollector and compare the results again? I will run it
also after you tell me which algorithm you used, but only tomorrow morning,
so if you get to do it before then, that'd be great.

I doubt that the change in topDocs() affects the query time that much, since
it's called at the end of the search, and doing 4-5 'if' statements is
really not that expensive (I mean once per the entire search), comparing to
ScoreDoc[] array allocation, fetching Stored fields from the index etc. So
I'd hate to implement all 3 topDocs() in each of the TopDocsCollector
extensions unless it proves to be a problem.

Shai

On Fri, Apr 3, 2009 at 10:02 PM, Michael McCandless (JIRA)
j...@apache.orgwrote:



 Refactoring Lucene collectors (HitCollector and extensions)
 ---

 Key: LUCENE-1575
 URL: https://issues.apache.org/jira/browse/LUCENE-1575
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Search
Reporter: Shai Erera
Assignee: Michael McCandless
 Fix For: 2.9

 Attachments: LUCENE-1575.1.patch, LUCENE-1575.2.patch, 
 LUCENE-1575.3.patch, LUCENE-1575.4.patch, LUCENE-1575.5.patch, 
 LUCENE-1575.6.patch, LUCENE-1575.patch, LUCENE-1575.patch


 This issue is a result of a recent discussion we've had on the mailing list. 
 You can read the thread 
 [here|http://www.nabble.com/Is-TopDocCollector%27s-collect()-implementation-correct--td22557419.html].
 We have agreed to do the following refactoring:
 * Rename MultiReaderHitCollector to Collector, with the purpose that it will 
 be the base class for all Collector implementations.
 * Deprecate HitCollector in favor of the new Collector.
 * Introduce new methods in IndexSearcher that accept Collector, and deprecate 
 those that accept HitCollector.
 ** Create a final class HitCollectorWrapper, and use it in the deprecated 
 methods in IndexSearcher, wrapping the given HitCollector.
 ** HitCollectorWrapper will be marked deprecated, so we can remove it in 3.0, 
 when we remove HitCollector.
 ** It will remove any instanceof checks that currently exist in IndexSearcher 
 code.
 * Create a new (abstract) TopDocsCollector, which will:
 ** Leave collect and setNextReader unimplemented.
 **

[jira] Commented: (LUCENE-1483) Change IndexSearcher multisegment searches to search each individual segment using a single HitCollector

2009-04-03 Thread Jeremy Volkman (JIRA)

[
https://issues.apache.org/jira/browse/LUCENE-1483?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12695577#action_12695577
]

Jeremy Volkman commented on LUCENE-1483:

I'm trying to create a FieldValueHitQueue outside of an IndexSearcher. One part
of my code collects all results in a fashion similar to
http://www.gossamer-threads.com/lists/lucene/java-user/66362#66362. At the end
of my collection, I used to pass the results through a FieldSortedHitQueue of
the proper size to get sorted results. The problem now is that
FieldValueHitQueue takes an array of subreaders instead of one IndexReader. As
far as I can tell, there's no way for me to get a proper sorted array of
subreaders for an IndexReader without copying and pasting the gatherSubReaders
and sortSubReaders methods from IndexSearcher. This isn't desirable, so could
IndexSearcher perhaps provide some sort of getSortedSubReaders() method? Either
that, or extract this functionality out into a common utility method that
IndexSearcher uses.

Change IndexSearcher multisegment searches to search each individual segment
using a single HitCollector

Key: LUCENE-1483
URL: https://issues.apache.org/jira/browse/LUCENE-1483
Project: Lucene - Java
Issue Type: Improvement
Affects Versions: 2.9
Reporter: Mark Miller
Assignee: Michael McCandless
Priority: Minor
Fix For: 2.9

Attachments: LUCENE-1483-backcompat.patch, LUCENE-1483-partial.patch,
LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch,
LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch,
LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch,
LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch,
LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch,
LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch,
LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch,
LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch,
LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, sortBench.py,
sortCollate.py

This issue changes how an IndexSearcher searches over multiple segments. The
current method of searching multiple segments is to use a MultiSegmentReader
and treat all of the segments as one. This causes filters and FieldCaches to
be keyed to the MultiReader and makes reopen expensive. If only a few
segments change, the FieldCache is still loaded for all of them.
This patch changes things by searching each individual segment one at a time,
but sharing the HitCollector used across each segment. This allows
FieldCaches and Filters to be keyed on individual SegmentReaders, making
reopen much cheaper. FieldCache loading over multiple segments can be much
faster as well - with the old method, all unique terms for every segment is
enumerated against each segment - because of the likely logarithmic change in
terms per segment, this can be very wasteful. Searching individual segments
avoids this cost. The term/document statistics from the multireader are used
to score results for each segment.
When sorting, its more difficult to use a single HitCollector for each sub
searcher. Ordinals are not comparable across segments. To account for this, a
new field sort enabled HitCollector is introduced that is able to collect and
sort across segments (because of its ability to compare ordinals across
segments). This TopFieldCollector class will collect the values/ordinals for
a given segment, and upon moving to the next segment, translate any
ordinals/values so that they can be compared against the values for the new
segment. This is done lazily.
All and all, the switch seems to provide numerous performance benefits, in
both sorted and non sorted search. We were seeing a good loss on indices with
lots of segments (1000?) and certain queue sizes / queries, but the latest
results seem to show thats been mostly taken care of (you shouldnt be using
such a large queue on such a segmented index anyway).
* Introduces
** MultiReaderHitCollector - a HitCollector that can collect across multiple
IndexReaders. Old HitCollectors are wrapped to support multiple IndexReaders.
** TopFieldCollector - a HitCollector that can compare values/ordinals across
IndexReaders and sort on fields.
** FieldValueHitQueue - a Priority queue that is part of the
TopFieldCollector implementation.
** FieldComparator - a new Comparator class that works across IndexReaders.
Part of the TopFieldCollector implementation.
** FieldComparatorSource - new class to allow for custom Comparators.
*

[jira] Commented: (LUCENE-1483) Change IndexSearcher multisegment searches to search each individual segment using a single HitCollector

[
https://issues.apache.org/jira/browse/LUCENE-1483?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12695586#action_12695586
]

Shai Erera commented on LUCENE-1483:

Hi Jeremy

This will be taken care of in 1575 by removing the IndexReader[] arg from
TopFieldCollector. As a matter of fact, 1575 changes quite a bit the
collector's API, so you might want to take a look there. Anyway, I've run into
the same issue there and realized this arg can be safely removed from
TopFieldCollector as well as FieldValueHitQueue.

Change IndexSearcher multisegment searches to search each individual segment
using a single HitCollector

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail:

[jira] Commented: (LUCENE-1483) Change IndexSearcher multisegment searches to search each individual segment using a single HitCollector

[
https://issues.apache.org/jira/browse/LUCENE-1483?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12695587#action_12695587
]

Uwe Schindler commented on LUCENE-1483:
---

This will be changed as part of LUCENE-1575

Change IndexSearcher multisegment searches to search each individual segment
using a single HitCollector

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

Re: Future projects

2009-04-03 Thread Jason Rutherglen

 meaning in Bobo you'd like to manage your own memory resident
field caches, and merge them whenever IW has merged a segment?
Seems like you don't need genealogy for that.

Agreed, there is no need for full genealogy.

 CSF isn't really designed yet. How come it can't be used with
Bobo's field caches?

I guess CSF should be able to support it, makes sense. As long
as the container is flexible with the encoding (I need to look
into this more on the Bobo side).

 Lucene's internal field cache usage is now entirely at the
segment level (ie, Lucene core should never request full field
cache array at the MultiSegmentReader level). I think Bobo must
have to do the same, if it handles near realtime updates, to get
adequate performance.

Bobo needs to migrate to this model, I don't think we've done
that yet.

 EG how come Bobo made its own field cache impl? Just because
uninversion is too slow?

It could be integrated once LUCENE-831 is completed. I think the
current model of a weak reference and the inability to unload if
needed is a concern.  I don't think it's because of uninversion.

On Fri, Apr 3, 2009 at 3:35 AM, Michael McCandless 
luc...@mikemccandless.com wrote:

 On Thu, Apr 2, 2009 at 5:56 PM, Jason Rutherglen
 jason.rutherg...@gmail.com wrote:
  I think I need to understand better why delete by Query isn't
  viable in your situation...
 
  The delete by query is a separate problem which I haven't fully
  explored yet.

 Oh, I had thought we were tugging on this thread in order to explore
 delete-by-docID in the writer.  OK.

  Tracking the segment genealogy is really an
  interim step for merging field caches before column stride
  fields gets implemented.

 I see -- meaning in Bobo you'd like to manage your own memory resident
 field caches, and merge them whenever IW has merged a segment?  Seems
 like you don't need genealogy for that.

  Actually CSF cannot be used with Bobo's
  field caches anyways which means we'd need a way to find out
  about the segment parents.

 CSF isn't really designed yet.  How come it can't be used with Bobo's
 field caches?  We can try to accommodate Bobo's field cache needs when
 designing CSF.

  Does it operate at the segment level? Seems like that'd give
  you good enough realtime performance (though merging in RAM will
  definitely be faster).
 
  We need to see how Bobo integrates with LUCENE-1483.

 Lucene's internal field cache usage is now entirely at the segment
 level (ie, Lucene core should never request full field cache array at
 the MultiSegmentReader level).  I think Bobo must have to do the same,
 if it handles near realtime updates, to get adequate performance.

 Though... since we have LUCENE-831 (rework API Lucene exposes for
 accessing arrays-of-atomic-types-per-segment) and LUCENE-1231 (CSF = a
 more efficient impl (than uninversion) of the API we expose in
 LUCENE-831) on deck, we should try to understand Bobo's needs.

 EG how come Bobo made its own field cache impl?  Just because
 uninversion is too slow?

  It seems like we've been talking about CSF for 2 years and there
  isn't a patch for it? If I had more time I'd take a look. What
  is the status of it?

 I think Michael is looking into it?  I'd really like to get it into
 2.9.  We should do it in conjunction with 831 since they are so tied.

  I'll write a patch that implements a callback for the segment
  merging such that the user can decide what information they want
  to record about the merged SRs (I'm pretty sure there isn't a
  way to do this with MergePolicy?)

 Actually I think you can do this w/ a simple MergeScheduler wrapper or
 by subclassing CMS.  I'll put a comment on the issue.

 Mike

 -
 To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-1575) Refactoring Lucene collectors (HitCollector and extensions)