Re: lucene Scorers

2004-11-24 Thread Paul Elschot
On Wednesday 24 November 2004 01:31, Ken McCracken wrote:
 Hi,
 
 Thanks the pointers in your replies.  Would it be possible to include
 some sort of accrual scorer interface somewhere in the Lucene Query
 APIs?  This could be passed into a query similar to
 MaxDisjunctionQuery; and combine the sum, max, tieBreaker, etc.,
 according to the implementor's discretion, to compute the overall
 score for a document.

The DisjunctionScorer is currently not part of Lucene.
You might try and subclass Similarity to provide what you need and
pass that to your Query.

I'm using a few subclasses of DisjunctionScorer to provide the actual
score value ao. for max and sum.
For each of these scorers,  I use a separate Query and Weight.
This gives a parallel class hierarchy for Query, Weight and Scorer.

I guess it's time to have a look at Design Patterns and/or Refactoring
on how to get rid of the parallel class hierarchy. That could also
involve some sort of accrual scorer and Lucene's Similarity.

Regards,
Paul Elschot

 -Ken
 
 On Sat, 13 Nov 2004 12:07:05 +0100, Paul Elschot [EMAIL PROTECTED] 
wrote:
  On Friday 12 November 2004 22:56, Chuck Williams wrote:
  
  
   I had a similar need and wrote MaxDisjunctionQuery and
   MaxDisjunctionScorer.  Unfortunately these are not available as a patch
   but I've included the original message below that has the code (modulo
   line breaks added by simple text email format).
  
   This code is functional -- I use it in my app.  It is optimized for its
   stated use, which involves a small number of clauses.  You'd want to
   improve the incremental sorting (e.g., using the bucket technique of
   BooleanQuery) if you need it for large numbers of clauses.
  
  When you're interested, you can also have a look here for
  yet another DisjunctionScorer:
  http://issues.apache.org/bugzilla/show_bug.cgi?id=31785
  
  It has the advantage that it implements skipTo() so that it can
  be used as a subscorer of ConjunctionScorer, ie. it can be
  faster in situations like this:
  
  aa AND (bb OR cc)
  
  where bb and cc are treated by the DisjunctionScorer.
  When aa is a filter this can also be used to implement
  a filtering query.
  
  
  
  
   Re. Paul's suggested steps below, I did not integrate this with query
   parser as I didn't need that functionality (since I'm generating the
   multi-field expansions for which max is a much better scoring choice
   than sum).
  
   Chuck
  
   Included message:
  
   -Original Message-
   From: Chuck Williams [mailto:[EMAIL PROTECTED]
   Sent: Monday, October 11, 2004 9:55 PM
   To: [EMAIL PROTECTED]
   Subject: Contribution: better multi-field searching
  
   The files included below (MaxDisjunctionQuery.java and
   MaxDisjunctionScorer.java) provide a new mechanism for searching across
   multiple fields.
  
  The maximum indeed works well, also when the fields differ a lot length.
  
  Regards,
  Paul
  
  
  
  
  -
  To unsubscribe, e-mail: [EMAIL PROTECTED]
  For additional commands, e-mail: [EMAIL PROTECTED]
  
 
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 
 
 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: lucene Scorers

2004-11-23 Thread Ken McCracken
Hi,

Thanks the pointers in your replies.  Would it be possible to include
some sort of accrual scorer interface somewhere in the Lucene Query
APIs?  This could be passed into a query similar to
MaxDisjunctionQuery; and combine the sum, max, tieBreaker, etc.,
according to the implementor's discretion, to compute the overall
score for a document.

-Ken

On Sat, 13 Nov 2004 12:07:05 +0100, Paul Elschot [EMAIL PROTECTED] wrote:
 On Friday 12 November 2004 22:56, Chuck Williams wrote:
 
 
  I had a similar need and wrote MaxDisjunctionQuery and
  MaxDisjunctionScorer.  Unfortunately these are not available as a patch
  but I've included the original message below that has the code (modulo
  line breaks added by simple text email format).
 
  This code is functional -- I use it in my app.  It is optimized for its
  stated use, which involves a small number of clauses.  You'd want to
  improve the incremental sorting (e.g., using the bucket technique of
  BooleanQuery) if you need it for large numbers of clauses.
 
 When you're interested, you can also have a look here for
 yet another DisjunctionScorer:
 http://issues.apache.org/bugzilla/show_bug.cgi?id=31785
 
 It has the advantage that it implements skipTo() so that it can
 be used as a subscorer of ConjunctionScorer, ie. it can be
 faster in situations like this:
 
 aa AND (bb OR cc)
 
 where bb and cc are treated by the DisjunctionScorer.
 When aa is a filter this can also be used to implement
 a filtering query.
 
 
 
 
  Re. Paul's suggested steps below, I did not integrate this with query
  parser as I didn't need that functionality (since I'm generating the
  multi-field expansions for which max is a much better scoring choice
  than sum).
 
  Chuck
 
  Included message:
 
  -Original Message-
  From: Chuck Williams [mailto:[EMAIL PROTECTED]
  Sent: Monday, October 11, 2004 9:55 PM
  To: [EMAIL PROTECTED]
  Subject: Contribution: better multi-field searching
 
  The files included below (MaxDisjunctionQuery.java and
  MaxDisjunctionScorer.java) provide a new mechanism for searching across
  multiple fields.
 
 The maximum indeed works well, also when the fields differ a lot length.
 
 Regards,
 Paul
 
 
 
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: lucene Scorers

2004-11-23 Thread Chuck Williams
Hi Ken,

I'm glad our replies were helpful.  It sounds like you looked at the
code in MaxDisjunctionQuery, so you probably noticed that it also
implements skipTo().  Your suggestion sounds like a good thing to do.  I
thought about that when writing MaxDisjunctionQuery, but didn't need the
generality, and it does make the code more complex.  I think Lucene
needs one of these mechanisms in it, at least to solve the problems
associated with the current default use of BooleanQuery for multiple
field expansions.  Your proposal would generalize this to solve
additional cases where different accrual operators are appropriate.

You could write and submit the generalization, although there are no
guarantees anybody would do anything with it.  I didn't get anywhere in
my attempt to submit MaxDisjunctionQuery.  I think there is also a
serious problem in scoring with the current score normalization (it does
not provide meaningfully comaparable scores across different searches,
which means that absolute score numbers like 0.8 have no intrinsic
meaning concerning how good a result is or is not).  When I finally get
back to tuning search in my app, that's the next one I'll try a
submission on.

Chuck

   -Original Message-
   From: Ken McCracken [mailto:[EMAIL PROTECTED]
   Sent: Tuesday, November 23, 2004 4:31 PM
   To: Lucene Users List
   Subject: Re: lucene Scorers
   
   Hi,
   
   Thanks the pointers in your replies.  Would it be possible to
include
   some sort of accrual scorer interface somewhere in the Lucene Query
   APIs?  This could be passed into a query similar to
   MaxDisjunctionQuery; and combine the sum, max, tieBreaker, etc.,
   according to the implementor's discretion, to compute the overall
   score for a document.
   
   -Ken
   
   On Sat, 13 Nov 2004 12:07:05 +0100, Paul Elschot
   [EMAIL PROTECTED] wrote:
On Friday 12 November 2004 22:56, Chuck Williams wrote:
   
   
 I had a similar need and wrote MaxDisjunctionQuery and
 MaxDisjunctionScorer.  Unfortunately these are not available as
a
   patch
 but I've included the original message below that has the code
   (modulo
 line breaks added by simple text email format).

 This code is functional -- I use it in my app.  It is optimized
for
   its
 stated use, which involves a small number of clauses.  You'd
want to
 improve the incremental sorting (e.g., using the bucket
technique of
 BooleanQuery) if you need it for large numbers of clauses.
   
When you're interested, you can also have a look here for
yet another DisjunctionScorer:
http://issues.apache.org/bugzilla/show_bug.cgi?id=31785
   
It has the advantage that it implements skipTo() so that it can
be used as a subscorer of ConjunctionScorer, ie. it can be
faster in situations like this:
   
aa AND (bb OR cc)
   
where bb and cc are treated by the DisjunctionScorer.
When aa is a filter this can also be used to implement
a filtering query.
   
   
   
   
 Re. Paul's suggested steps below, I did not integrate this with
   query
 parser as I didn't need that functionality (since I'm generating
the
 multi-field expansions for which max is a much better scoring
choice
 than sum).

 Chuck

 Included message:

 -Original Message-
 From: Chuck Williams [mailto:[EMAIL PROTECTED]
 Sent: Monday, October 11, 2004 9:55 PM
 To: [EMAIL PROTECTED]
 Subject: Contribution: better multi-field searching

 The files included below (MaxDisjunctionQuery.java and
 MaxDisjunctionScorer.java) provide a new mechanism for searching
   across
 multiple fields.
   
The maximum indeed works well, also when the fields differ a lot
   length.
   
Regards,
Paul
   
   
   
   
   
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail:
[EMAIL PROTECTED]
   
   
   
  
-
   To unsubscribe, e-mail: [EMAIL PROTECTED]
   For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: lucene Scorers

2004-11-13 Thread Paul Elschot
On Friday 12 November 2004 22:56, Chuck Williams wrote:
 I had a similar need and wrote MaxDisjunctionQuery and
 MaxDisjunctionScorer.  Unfortunately these are not available as a patch
 but I've included the original message below that has the code (modulo
 line breaks added by simple text email format).

 This code is functional -- I use it in my app.  It is optimized for its
 stated use, which involves a small number of clauses.  You'd want to
 improve the incremental sorting (e.g., using the bucket technique of
 BooleanQuery) if you need it for large numbers of clauses.

When you're interested, you can also have a look here for
yet another DisjunctionScorer:
http://issues.apache.org/bugzilla/show_bug.cgi?id=31785

It has the advantage that it implements skipTo() so that it can 
be used as a subscorer of ConjunctionScorer, ie. it can be
faster in situations like this:

aa AND (bb OR cc)

where bb and cc are treated by the DisjunctionScorer.
When aa is a filter this can also be used to implement
a filtering query.

 
 Re. Paul's suggested steps below, I did not integrate this with query
 parser as I didn't need that functionality (since I'm generating the
 multi-field expansions for which max is a much better scoring choice
 than sum).
 
 Chuck
 
 Included message:
 
 -Original Message-
 From: Chuck Williams [mailto:[EMAIL PROTECTED] 
 Sent: Monday, October 11, 2004 9:55 PM
 To: [EMAIL PROTECTED]
 Subject: Contribution: better multi-field searching
 
 The files included below (MaxDisjunctionQuery.java and
 MaxDisjunctionScorer.java) provide a new mechanism for searching across
 multiple fields.

The maximum indeed works well, also when the fields differ a lot length.
 
Regards,
Paul


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



lucene Scorers

2004-11-12 Thread Ken McCracken
Hi,

I am looking at the Similarity class overview, and wondering if I can
replace the SUM operator with a MAX operator, or any other operator
(across the terms in a query).

For example, if I search for car OR automobile, a BooleanScorer is
used to add the values from each subexpression together.  In the
BooleanScorer from lucene_1_4_final, in the inner class Collector, we
have in the collect(...) method, the line

 bucket.score += score;   // increment score

that I may want replace with a MAX operator such as 

 if (score  bucket.score) bucket.score = score;// take the max

I may also want to keep track of both the max and the sum, by
extending the inner class Bucket.

Do you have any suggestions on how to implement such a change? 
Ideally, I would like to have the ability to define my choice of
scoring algorithm at search time (at run time), and use the Lucene SUM
scorer for some searches, and the MAX scorer for other searches.

Thanks for you help.

-Ken

PS.  The code I'm talking about falls in the follwoing area, for my
example search car OR automobile.  If I walk the code during search,
I see that the BooleanScorer$Collector is created by the Weight that
was just created, in BooleanQuery$BooleanWeight.scorer(...), as it
adds the subscorers for each of the terms in the BooleanScorer.  When
that collector is asked to collect(...), its bucketTable is filled in.
 Since the collectors for each of the terms use the same bucketTable,
if the document already appears in the bucketTable, then it's score is
added to implement a SUM operator.

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: lucene Scorers

2004-11-12 Thread Paul Elschot
On Friday 12 November 2004 20:48, Ken McCracken wrote:
 Hi,
 
 I am looking at the Similarity class overview, and wondering if I can
 replace the SUM operator with a MAX operator, or any other operator
 (across the terms in a query).
 
 For example, if I search for car OR automobile, a BooleanScorer is
 used to add the values from each subexpression together.  In the
 BooleanScorer from lucene_1_4_final, in the inner class Collector, we
 have in the collect(...) method, the line
 
  bucket.score += score; // increment score
 
 that I may want replace with a MAX operator such as 
 
  if (score  bucket.score) bucket.score = score;// take the max
 
 I may also want to keep track of both the max and the sum, by
 extending the inner class Bucket.
 
 Do you have any suggestions on how to implement such a change? 
 Ideally, I would like to have the ability to define my choice of
 scoring algorithm at search time (at run time), and use the Lucene SUM
 scorer for some searches, and the MAX scorer for other searches.
 
 Thanks for you help.
 
 -Ken
 
 PS.  The code I'm talking about falls in the follwoing area, for my
 example search car OR automobile.  If I walk the code during search,
 I see that the BooleanScorer$Collector is created by the Weight that
 was just created, in BooleanQuery$BooleanWeight.scorer(...), as it
 adds the subscorers for each of the terms in the BooleanScorer.  When
 that collector is asked to collect(...), its bucketTable is filled in.
  Since the collectors for each of the terms use the same bucketTable,
 if the document already appears in the bucketTable, then it's score is
 added to implement a SUM operator.

SInce you are that far already, you can (in reverse order):
- replace the BooleanScorer by another one that takes the max
 instead of summing.
- replace the weight to return that scorer.
- replace the BooleanQuery to return that weight.
- override QueryParser.getBooleanQuery() to return that query
 in the cases you want, that is when all clauses are optional.

replace usually means inherit from in new code.
When you need more info on this, try lucene-dev.

Regards,
Paul Elschot.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: lucene Scorers

2004-11-12 Thread Chuck Williams
 the Explanation for our score
 */
public Explanation explain(int doc) throws IOException {
throw new UnsupportedOperationException();
}

}

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


   -Original Message-
   From: Paul Elschot [mailto:[EMAIL PROTECTED]
   Sent: Friday, November 12, 2004 12:02 PM
   To: [EMAIL PROTECTED]
   Subject: Re: lucene Scorers
   
   On Friday 12 November 2004 20:48, Ken McCracken wrote:
Hi,
   
I am looking at the Similarity class overview, and wondering if I
can
replace the SUM operator with a MAX operator, or any other
operator
(across the terms in a query).
   
For example, if I search for car OR automobile, a BooleanScorer
is
used to add the values from each subexpression together.  In the
BooleanScorer from lucene_1_4_final, in the inner class Collector,
we
have in the collect(...) method, the line
   
 bucket.score += score; // increment
score
   
that I may want replace with a MAX operator such as
   
 if (score  bucket.score) bucket.score = score;//
take
   the max
   
I may also want to keep track of both the max and the sum, by
extending the inner class Bucket.
   
Do you have any suggestions on how to implement such a change?
Ideally, I would like to have the ability to define my choice of
scoring algorithm at search time (at run time), and use the Lucene
SUM
scorer for some searches, and the MAX scorer for other searches.
   
Thanks for you help.
   
-Ken
   
PS.  The code I'm talking about falls in the follwoing area, for
my
example search car OR automobile.  If I walk the code during
search,
I see that the BooleanScorer$Collector is created by the Weight
that
was just created, in BooleanQuery$BooleanWeight.scorer(...), as it
adds the subscorers for each of the terms in the BooleanScorer.
When
that collector is asked to collect(...), its bucketTable is filled
in.
 Since the collectors for each of the terms use the same
bucketTable,
if the document already appears in the bucketTable, then it's
score is
added to implement a SUM operator.
   
   SInce you are that far already, you can (in reverse order):
   - replace the BooleanScorer by another one that takes the max
instead of summing.
   - replace the weight to return that scorer.
   - replace the BooleanQuery to return that weight.
   - override QueryParser.getBooleanQuery() to return that query
in the cases you want, that is when all clauses are optional.
   
   replace usually means inherit from in new code.
   When you need more info on this, try lucene-dev.
   
   Regards,
   Paul Elschot.
   
   
  
-
   To unsubscribe, e-mail: [EMAIL PROTECTED]
   For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]