Re: lucene Scorers
On Wednesday 24 November 2004 01:31, Ken McCracken wrote: Hi, Thanks the pointers in your replies. Would it be possible to include some sort of accrual scorer interface somewhere in the Lucene Query APIs? This could be passed into a query similar to MaxDisjunctionQuery; and combine the sum, max, tieBreaker, etc., according to the implementor's discretion, to compute the overall score for a document. The DisjunctionScorer is currently not part of Lucene. You might try and subclass Similarity to provide what you need and pass that to your Query. I'm using a few subclasses of DisjunctionScorer to provide the actual score value ao. for max and sum. For each of these scorers, I use a separate Query and Weight. This gives a parallel class hierarchy for Query, Weight and Scorer. I guess it's time to have a look at Design Patterns and/or Refactoring on how to get rid of the parallel class hierarchy. That could also involve some sort of accrual scorer and Lucene's Similarity. Regards, Paul Elschot -Ken On Sat, 13 Nov 2004 12:07:05 +0100, Paul Elschot [EMAIL PROTECTED] wrote: On Friday 12 November 2004 22:56, Chuck Williams wrote: I had a similar need and wrote MaxDisjunctionQuery and MaxDisjunctionScorer. Unfortunately these are not available as a patch but I've included the original message below that has the code (modulo line breaks added by simple text email format). This code is functional -- I use it in my app. It is optimized for its stated use, which involves a small number of clauses. You'd want to improve the incremental sorting (e.g., using the bucket technique of BooleanQuery) if you need it for large numbers of clauses. When you're interested, you can also have a look here for yet another DisjunctionScorer: http://issues.apache.org/bugzilla/show_bug.cgi?id=31785 It has the advantage that it implements skipTo() so that it can be used as a subscorer of ConjunctionScorer, ie. it can be faster in situations like this: aa AND (bb OR cc) where bb and cc are treated by the DisjunctionScorer. When aa is a filter this can also be used to implement a filtering query. Re. Paul's suggested steps below, I did not integrate this with query parser as I didn't need that functionality (since I'm generating the multi-field expansions for which max is a much better scoring choice than sum). Chuck Included message: -Original Message- From: Chuck Williams [mailto:[EMAIL PROTECTED] Sent: Monday, October 11, 2004 9:55 PM To: [EMAIL PROTECTED] Subject: Contribution: better multi-field searching The files included below (MaxDisjunctionQuery.java and MaxDisjunctionScorer.java) provide a new mechanism for searching across multiple fields. The maximum indeed works well, also when the fields differ a lot length. Regards, Paul - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: lucene Scorers
Hi, Thanks the pointers in your replies. Would it be possible to include some sort of accrual scorer interface somewhere in the Lucene Query APIs? This could be passed into a query similar to MaxDisjunctionQuery; and combine the sum, max, tieBreaker, etc., according to the implementor's discretion, to compute the overall score for a document. -Ken On Sat, 13 Nov 2004 12:07:05 +0100, Paul Elschot [EMAIL PROTECTED] wrote: On Friday 12 November 2004 22:56, Chuck Williams wrote: I had a similar need and wrote MaxDisjunctionQuery and MaxDisjunctionScorer. Unfortunately these are not available as a patch but I've included the original message below that has the code (modulo line breaks added by simple text email format). This code is functional -- I use it in my app. It is optimized for its stated use, which involves a small number of clauses. You'd want to improve the incremental sorting (e.g., using the bucket technique of BooleanQuery) if you need it for large numbers of clauses. When you're interested, you can also have a look here for yet another DisjunctionScorer: http://issues.apache.org/bugzilla/show_bug.cgi?id=31785 It has the advantage that it implements skipTo() so that it can be used as a subscorer of ConjunctionScorer, ie. it can be faster in situations like this: aa AND (bb OR cc) where bb and cc are treated by the DisjunctionScorer. When aa is a filter this can also be used to implement a filtering query. Re. Paul's suggested steps below, I did not integrate this with query parser as I didn't need that functionality (since I'm generating the multi-field expansions for which max is a much better scoring choice than sum). Chuck Included message: -Original Message- From: Chuck Williams [mailto:[EMAIL PROTECTED] Sent: Monday, October 11, 2004 9:55 PM To: [EMAIL PROTECTED] Subject: Contribution: better multi-field searching The files included below (MaxDisjunctionQuery.java and MaxDisjunctionScorer.java) provide a new mechanism for searching across multiple fields. The maximum indeed works well, also when the fields differ a lot length. Regards, Paul - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: lucene Scorers
Hi Ken, I'm glad our replies were helpful. It sounds like you looked at the code in MaxDisjunctionQuery, so you probably noticed that it also implements skipTo(). Your suggestion sounds like a good thing to do. I thought about that when writing MaxDisjunctionQuery, but didn't need the generality, and it does make the code more complex. I think Lucene needs one of these mechanisms in it, at least to solve the problems associated with the current default use of BooleanQuery for multiple field expansions. Your proposal would generalize this to solve additional cases where different accrual operators are appropriate. You could write and submit the generalization, although there are no guarantees anybody would do anything with it. I didn't get anywhere in my attempt to submit MaxDisjunctionQuery. I think there is also a serious problem in scoring with the current score normalization (it does not provide meaningfully comaparable scores across different searches, which means that absolute score numbers like 0.8 have no intrinsic meaning concerning how good a result is or is not). When I finally get back to tuning search in my app, that's the next one I'll try a submission on. Chuck -Original Message- From: Ken McCracken [mailto:[EMAIL PROTECTED] Sent: Tuesday, November 23, 2004 4:31 PM To: Lucene Users List Subject: Re: lucene Scorers Hi, Thanks the pointers in your replies. Would it be possible to include some sort of accrual scorer interface somewhere in the Lucene Query APIs? This could be passed into a query similar to MaxDisjunctionQuery; and combine the sum, max, tieBreaker, etc., according to the implementor's discretion, to compute the overall score for a document. -Ken On Sat, 13 Nov 2004 12:07:05 +0100, Paul Elschot [EMAIL PROTECTED] wrote: On Friday 12 November 2004 22:56, Chuck Williams wrote: I had a similar need and wrote MaxDisjunctionQuery and MaxDisjunctionScorer. Unfortunately these are not available as a patch but I've included the original message below that has the code (modulo line breaks added by simple text email format). This code is functional -- I use it in my app. It is optimized for its stated use, which involves a small number of clauses. You'd want to improve the incremental sorting (e.g., using the bucket technique of BooleanQuery) if you need it for large numbers of clauses. When you're interested, you can also have a look here for yet another DisjunctionScorer: http://issues.apache.org/bugzilla/show_bug.cgi?id=31785 It has the advantage that it implements skipTo() so that it can be used as a subscorer of ConjunctionScorer, ie. it can be faster in situations like this: aa AND (bb OR cc) where bb and cc are treated by the DisjunctionScorer. When aa is a filter this can also be used to implement a filtering query. Re. Paul's suggested steps below, I did not integrate this with query parser as I didn't need that functionality (since I'm generating the multi-field expansions for which max is a much better scoring choice than sum). Chuck Included message: -Original Message- From: Chuck Williams [mailto:[EMAIL PROTECTED] Sent: Monday, October 11, 2004 9:55 PM To: [EMAIL PROTECTED] Subject: Contribution: better multi-field searching The files included below (MaxDisjunctionQuery.java and MaxDisjunctionScorer.java) provide a new mechanism for searching across multiple fields. The maximum indeed works well, also when the fields differ a lot length. Regards, Paul - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: lucene Scorers
On Friday 12 November 2004 22:56, Chuck Williams wrote: I had a similar need and wrote MaxDisjunctionQuery and MaxDisjunctionScorer. Unfortunately these are not available as a patch but I've included the original message below that has the code (modulo line breaks added by simple text email format). This code is functional -- I use it in my app. It is optimized for its stated use, which involves a small number of clauses. You'd want to improve the incremental sorting (e.g., using the bucket technique of BooleanQuery) if you need it for large numbers of clauses. When you're interested, you can also have a look here for yet another DisjunctionScorer: http://issues.apache.org/bugzilla/show_bug.cgi?id=31785 It has the advantage that it implements skipTo() so that it can be used as a subscorer of ConjunctionScorer, ie. it can be faster in situations like this: aa AND (bb OR cc) where bb and cc are treated by the DisjunctionScorer. When aa is a filter this can also be used to implement a filtering query. Re. Paul's suggested steps below, I did not integrate this with query parser as I didn't need that functionality (since I'm generating the multi-field expansions for which max is a much better scoring choice than sum). Chuck Included message: -Original Message- From: Chuck Williams [mailto:[EMAIL PROTECTED] Sent: Monday, October 11, 2004 9:55 PM To: [EMAIL PROTECTED] Subject: Contribution: better multi-field searching The files included below (MaxDisjunctionQuery.java and MaxDisjunctionScorer.java) provide a new mechanism for searching across multiple fields. The maximum indeed works well, also when the fields differ a lot length. Regards, Paul - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
lucene Scorers
Hi, I am looking at the Similarity class overview, and wondering if I can replace the SUM operator with a MAX operator, or any other operator (across the terms in a query). For example, if I search for car OR automobile, a BooleanScorer is used to add the values from each subexpression together. In the BooleanScorer from lucene_1_4_final, in the inner class Collector, we have in the collect(...) method, the line bucket.score += score; // increment score that I may want replace with a MAX operator such as if (score bucket.score) bucket.score = score;// take the max I may also want to keep track of both the max and the sum, by extending the inner class Bucket. Do you have any suggestions on how to implement such a change? Ideally, I would like to have the ability to define my choice of scoring algorithm at search time (at run time), and use the Lucene SUM scorer for some searches, and the MAX scorer for other searches. Thanks for you help. -Ken PS. The code I'm talking about falls in the follwoing area, for my example search car OR automobile. If I walk the code during search, I see that the BooleanScorer$Collector is created by the Weight that was just created, in BooleanQuery$BooleanWeight.scorer(...), as it adds the subscorers for each of the terms in the BooleanScorer. When that collector is asked to collect(...), its bucketTable is filled in. Since the collectors for each of the terms use the same bucketTable, if the document already appears in the bucketTable, then it's score is added to implement a SUM operator. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: lucene Scorers
On Friday 12 November 2004 20:48, Ken McCracken wrote: Hi, I am looking at the Similarity class overview, and wondering if I can replace the SUM operator with a MAX operator, or any other operator (across the terms in a query). For example, if I search for car OR automobile, a BooleanScorer is used to add the values from each subexpression together. In the BooleanScorer from lucene_1_4_final, in the inner class Collector, we have in the collect(...) method, the line bucket.score += score; // increment score that I may want replace with a MAX operator such as if (score bucket.score) bucket.score = score;// take the max I may also want to keep track of both the max and the sum, by extending the inner class Bucket. Do you have any suggestions on how to implement such a change? Ideally, I would like to have the ability to define my choice of scoring algorithm at search time (at run time), and use the Lucene SUM scorer for some searches, and the MAX scorer for other searches. Thanks for you help. -Ken PS. The code I'm talking about falls in the follwoing area, for my example search car OR automobile. If I walk the code during search, I see that the BooleanScorer$Collector is created by the Weight that was just created, in BooleanQuery$BooleanWeight.scorer(...), as it adds the subscorers for each of the terms in the BooleanScorer. When that collector is asked to collect(...), its bucketTable is filled in. Since the collectors for each of the terms use the same bucketTable, if the document already appears in the bucketTable, then it's score is added to implement a SUM operator. SInce you are that far already, you can (in reverse order): - replace the BooleanScorer by another one that takes the max instead of summing. - replace the weight to return that scorer. - replace the BooleanQuery to return that weight. - override QueryParser.getBooleanQuery() to return that query in the cases you want, that is when all clauses are optional. replace usually means inherit from in new code. When you need more info on this, try lucene-dev. Regards, Paul Elschot. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: lucene Scorers
the Explanation for our score */ public Explanation explain(int doc) throws IOException { throw new UnsupportedOperationException(); } } - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] -Original Message- From: Paul Elschot [mailto:[EMAIL PROTECTED] Sent: Friday, November 12, 2004 12:02 PM To: [EMAIL PROTECTED] Subject: Re: lucene Scorers On Friday 12 November 2004 20:48, Ken McCracken wrote: Hi, I am looking at the Similarity class overview, and wondering if I can replace the SUM operator with a MAX operator, or any other operator (across the terms in a query). For example, if I search for car OR automobile, a BooleanScorer is used to add the values from each subexpression together. In the BooleanScorer from lucene_1_4_final, in the inner class Collector, we have in the collect(...) method, the line bucket.score += score; // increment score that I may want replace with a MAX operator such as if (score bucket.score) bucket.score = score;// take the max I may also want to keep track of both the max and the sum, by extending the inner class Bucket. Do you have any suggestions on how to implement such a change? Ideally, I would like to have the ability to define my choice of scoring algorithm at search time (at run time), and use the Lucene SUM scorer for some searches, and the MAX scorer for other searches. Thanks for you help. -Ken PS. The code I'm talking about falls in the follwoing area, for my example search car OR automobile. If I walk the code during search, I see that the BooleanScorer$Collector is created by the Weight that was just created, in BooleanQuery$BooleanWeight.scorer(...), as it adds the subscorers for each of the terms in the BooleanScorer. When that collector is asked to collect(...), its bucketTable is filled in. Since the collectors for each of the terms use the same bucketTable, if the document already appears in the bucketTable, then it's score is added to implement a SUM operator. SInce you are that far already, you can (in reverse order): - replace the BooleanScorer by another one that takes the max instead of summing. - replace the weight to return that scorer. - replace the BooleanQuery to return that weight. - override QueryParser.getBooleanQuery() to return that query in the cases you want, that is when all clauses are optional. replace usually means inherit from in new code. When you need more info on this, try lucene-dev. Regards, Paul Elschot. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]