RE: Relevance percentage
Gururaja, If you want to score based solely on coord(), then Paul's approach looks best. However, based on your earlier messages, it looks to me like you want to score based on all factors (with coord boosted as Paul suggested, or lengthNorm flattened as I suggested -- either will get the order you want in the example you posted), but you want to print the (unboosted) coord percentage along with each result in the result list. If this is the case, since the number of results per page on the result list is presumably small, I think you are best off replicating the explain() mechanism. I don't have the source code, but you can look at IndexSearcher.explain(), which recreates the weight with Query.weight(), then calls what in this case will be BooleanQuery.BooleanWeight.explain(), which has the code to recompute coord on a result (specifically it computes overlap and maxoverlap and then calls Similarity.coord()). You could cut and paste this code to just compute coord for your top-level BooleanQuery's. Sorry I don't have source code to do this, but the approach should work. Good luck, Chuck -Original Message- From: Paul Elschot [mailto:[EMAIL PROTECTED] Sent: Wednesday, December 22, 2004 11:59 PM To: lucene-user@jakarta.apache.org Subject: Re: Relevance percentage On Thursday 23 December 2004 08:13, Gururaja H wrote: Hi Chuck Williams, Thanks much for the reply. If your queries are all BooleanQuery's of TermQuery's, then this is very simple. Iterate down the list of BooleanClause's and count the number whose score is 0, then divide this by the total number of clauses. Take a look at BooleanQuery.BooleanWeight.explain() as it does this (along with generating the rest of the explanation). If you support the full Lucene query language, then you need to look at all the query types and decide what exactly you want to compute (as coord is not always well- defined). We are supporting full Lucene query language. My request is, assuming queries are all BooleanQuery please post the implementation source code for the same. ie to calculate the coord() method input parameters overlap and maxOverlap. I don't have the code, but I can give an overview of possible steps: First inherit from BooleanScorer to implement a score() method that returns only the coord() value (preferably a precomputed one). Then inherit from BooleanQuery.BooleanWeight to return the above Scorer. Then inherit from BooleanQuery to use the above Weight in createWeight(). Then inherit from QueryParser to use the above Query in getBooleanQuery(). Finally use such a query in a search: the document scores will be the coord() values. Regards, Paul Elschot. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: Relevance percentage
Hi Chuck Williams, Thanks much for the reply. If your queries are all BooleanQuery's of TermQuery's, then this is very simple. Iterate down the list of BooleanClause's and count the number whose score is 0, then divide this by the total number of clauses. Take a look at BooleanQuery.BooleanWeight.explain() as it does this (along with generating the rest of the explanation). If you support the full Lucene query language, then you need to look at all the query types and decide what exactly you want to compute (as coord is not always well-defined). We are supporting full Lucene query language. My request is, assuming queries are all BooleanQuery please post the implementation source code for the same. ie to calculate the coord() method input parameters overlap and maxOverlap. Thanks, Gururaja Chuck Williams [EMAIL PROTECTED] wrote: The coord() value is not saved anywhere so you would need to recompute it. You could either call explain() and parse the result string, or better, look at explain() and implement what it does more efficiently just for coord(). If your queries are all BooleanQuery's of TermQuery's, then this is very simple. Iterate down the list of BooleanClause's and count the number whose score is 0, then divide this by the total number of clauses. Take a look at BooleanQuery.BooleanWeight.explain() as it does this (along with generating the rest of the explanation). If you support the full Lucene query language, then you need to look at all the query types and decide what exactly you want to compute (as coord is not always well-defined). I'm on the West Coast of the U.S. so evidently on a very different time zone from you -- will look at your other message next. Chuck -Original Message- From: Gururaja H [mailto:[EMAIL PROTECTED] Sent: Monday, December 20, 2004 6:10 AM To: Lucene Users List; Mike Snare Subject: Re: Relevance percentage Hi, But, How to calculate the coord() fraction ? I know by default, in DefaultSimilarity the coord() fraction is defined as below: /** Implemented as overlap / maxOverlap. */ public float coord(int overlap, int maxOverlap) { return overlap / (float)maxOverlap; } How to get the overlap and maxOverlap value in each of the matched document(s) ? Thanks, Gururaja Mike Snare wrote: I'm still new to Lucene, but wouldn't that be the coord()? My understanding is that the coord() is the fraction of the boolean query that matched a given document. Again, I'm new, so somebody else will have to confirm or deny... -Mike On Mon, 20 Dec 2004 00:33:21 -0800 (PST), Gururaja H wrote: How to find out the percentages of matched terms in the document(s) using Lucene ? Here is an example, of what i am trying to do: The search query has 5 terms(ibm, risc, tape, dirve, manual) and there are 4 matching documents with the following attributes: Doc#1: contains terms(ibm,drive) Doc#2: contains terms(ibm,risc, tape, drive) Doc#3: contains terms(ibm,risc, tape,drive) Doc#4: contains terms(ibm, risc, tape, drive, manual). The percentages displayed would be 100%(Doc#4), 80%(doc#2), 80%(doc#3) and 40% (doc#1). Any help on how to go about doing this ? Thanks, Gururaja - Do you Yahoo!? Send a seasonal email greeting and help others. Do good. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - Do you Yahoo!? All your favorites on one personal page - Try My Yahoo! - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - Do you Yahoo!? Yahoo! Mail - Find what you need with new enhanced search. Learn more.
Re: Relevance percentage
On Thursday 23 December 2004 08:13, Gururaja H wrote: Hi Chuck Williams, Thanks much for the reply. If your queries are all BooleanQuery's of TermQuery's, then this is very simple. Iterate down the list of BooleanClause's and count the number whose score is 0, then divide this by the total number of clauses. Take a look at BooleanQuery.BooleanWeight.explain() as it does this (along with generating the rest of the explanation). If you support the full Lucene query language, then you need to look at all the query types and decide what exactly you want to compute (as coord is not always well-defined). We are supporting full Lucene query language. My request is, assuming queries are all BooleanQuery please post the implementation source code for the same. ie to calculate the coord() method input parameters overlap and maxOverlap. I don't have the code, but I can give an overview of possible steps: First inherit from BooleanScorer to implement a score() method that returns only the coord() value (preferably a precomputed one). Then inherit from BooleanQuery.BooleanWeight to return the above Scorer. Then inherit from BooleanQuery to use the above Weight in createWeight(). Then inherit from QueryParser to use the above Query in getBooleanQuery(). Finally use such a query in a search: the document scores will be the coord() values. Regards, Paul Elschot. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Relevance percentage
I'm still new to Lucene, but wouldn't that be the coord()? My understanding is that the coord() is the fraction of the boolean query that matched a given document. Again, I'm new, so somebody else will have to confirm or deny... -Mike On Mon, 20 Dec 2004 00:33:21 -0800 (PST), Gururaja H [EMAIL PROTECTED] wrote: How to find out the percentages of matched terms in the document(s) using Lucene ? Here is an example, of what i am trying to do: The search query has 5 terms(ibm, risc, tape, dirve, manual) and there are 4 matching documents with the following attributes: Doc#1: contains terms(ibm,drive) Doc#2: contains terms(ibm,risc, tape, drive) Doc#3: contains terms(ibm,risc, tape,drive) Doc#4: contains terms(ibm, risc, tape, drive, manual). The percentages displayed would be 100%(Doc#4), 80%(doc#2), 80%(doc#3) and 40% (doc#1). Any help on how to go about doing this ? Thanks, Gururaja - Do you Yahoo!? Send a seasonal email greeting and help others. Do good. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Relevance percentage
Hi, But, How to calculate the coord() fraction ? I know by default, in DefaultSimilarity the coord() fraction is defined as below: /** Implemented as codeoverlap / maxOverlap/code. */ public float coord(int overlap, int maxOverlap) { return overlap / (float)maxOverlap; } How to get the overlap and maxOverlap value in each of the matched document(s) ? Thanks, Gururaja Mike Snare [EMAIL PROTECTED] wrote: I'm still new to Lucene, but wouldn't that be the coord()? My understanding is that the coord() is the fraction of the boolean query that matched a given document. Again, I'm new, so somebody else will have to confirm or deny... -Mike On Mon, 20 Dec 2004 00:33:21 -0800 (PST), Gururaja H wrote: How to find out the percentages of matched terms in the document(s) using Lucene ? Here is an example, of what i am trying to do: The search query has 5 terms(ibm, risc, tape, dirve, manual) and there are 4 matching documents with the following attributes: Doc#1: contains terms(ibm,drive) Doc#2: contains terms(ibm,risc, tape, drive) Doc#3: contains terms(ibm,risc, tape,drive) Doc#4: contains terms(ibm, risc, tape, drive, manual). The percentages displayed would be 100%(Doc#4), 80%(doc#2), 80%(doc#3) and 40% (doc#1). Any help on how to go about doing this ? Thanks, Gururaja - Do you Yahoo!? Send a seasonal email greeting and help others. Do good. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - Do you Yahoo!? All your favorites on one personal page Try My Yahoo!
RE: Relevance percentage
The coord() value is not saved anywhere so you would need to recompute it. You could either call explain() and parse the result string, or better, look at explain() and implement what it does more efficiently just for coord(). If your queries are all BooleanQuery's of TermQuery's, then this is very simple. Iterate down the list of BooleanClause's and count the number whose score is 0, then divide this by the total number of clauses. Take a look at BooleanQuery.BooleanWeight.explain() as it does this (along with generating the rest of the explanation). If you support the full Lucene query language, then you need to look at all the query types and decide what exactly you want to compute (as coord is not always well-defined). I'm on the West Coast of the U.S. so evidently on a very different time zone from you -- will look at your other message next. Chuck -Original Message- From: Gururaja H [mailto:[EMAIL PROTECTED] Sent: Monday, December 20, 2004 6:10 AM To: Lucene Users List; Mike Snare Subject: Re: Relevance percentage Hi, But, How to calculate the coord() fraction ? I know by default, in DefaultSimilarity the coord() fraction is defined as below: /** Implemented as codeoverlap / maxOverlap/code. */ public float coord(int overlap, int maxOverlap) { return overlap / (float)maxOverlap; } How to get the overlap and maxOverlap value in each of the matched document(s) ? Thanks, Gururaja Mike Snare [EMAIL PROTECTED] wrote: I'm still new to Lucene, but wouldn't that be the coord()? My understanding is that the coord() is the fraction of the boolean query that matched a given document. Again, I'm new, so somebody else will have to confirm or deny... -Mike On Mon, 20 Dec 2004 00:33:21 -0800 (PST), Gururaja H wrote: How to find out the percentages of matched terms in the document(s) using Lucene ? Here is an example, of what i am trying to do: The search query has 5 terms(ibm, risc, tape, dirve, manual) and there are 4 matching documents with the following attributes: Doc#1: contains terms(ibm,drive) Doc#2: contains terms(ibm,risc, tape, drive) Doc#3: contains terms(ibm,risc, tape,drive) Doc#4: contains terms(ibm, risc, tape, drive, manual). The percentages displayed would be 100%(Doc#4), 80%(doc#2), 80%(doc#3) and 40% (doc#1). Any help on how to go about doing this ? Thanks, Gururaja - Do you Yahoo!? Send a seasonal email greeting and help others. Do good. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - Do you Yahoo!? All your favorites on one personal page - Try My Yahoo! - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Relevance percentage
On Monday 20 December 2004 15:09, Gururaja H wrote: Hi, But, How to calculate the coord() fraction ? I know by default, in DefaultSimilarity the coord() fraction is defined as below: /** Implemented as codeoverlap / maxOverlap/code. */ public float coord(int overlap, int maxOverlap) { return overlap / (float)maxOverlap; } How to get the overlap and maxOverlap value in each of the matched document(s) ? In case you only want the coordination factor to have more influence in the order of your search results you can use a Similarity with a coord() function that has a power higher than 1: public float coord(int overlap, int maxOverlap) { return (float) Math.pow((overlap / (float)maxOverlap), SOME_POWER); } I'd first try values between 3.0f and 5.0f for SOME_POWER. The searching code precomputes all coord values once per query per search, so there is no need to worry about the computing efficiency. This has the advantage that the other scoring factors are still used for ranking. Since the other factors can vary quite a bit, it is difficult to guarantee that any coord() implementation will provide a score that sorts by the number of matching clauses. Higher powers as above can come a long way, though. Regards, Paul Elschot Thanks, Gururaja Mike Snare [EMAIL PROTECTED] wrote: I'm still new to Lucene, but wouldn't that be the coord()? My understanding is that the coord() is the fraction of the boolean query that matched a given document. Again, I'm new, so somebody else will have to confirm or deny... -Mike On Mon, 20 Dec 2004 00:33:21 -0800 (PST), Gururaja H wrote: How to find out the percentages of matched terms in the document(s) using Lucene ? Here is an example, of what i am trying to do: The search query has 5 terms(ibm, risc, tape, dirve, manual) and there are 4 matching documents with the following attributes: Doc#1: contains terms(ibm,drive) Doc#2: contains terms(ibm,risc, tape, drive) Doc#3: contains terms(ibm,risc, tape,drive) Doc#4: contains terms(ibm, risc, tape, drive, manual). The percentages displayed would be 100%(Doc#4), 80%(doc#2), 80%(doc#3) and 40% (doc#1). Any help on how to go about doing this ? Thanks, Gururaja - Do you Yahoo!? Send a seasonal email greeting and help others. Do good. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - Do you Yahoo!? All your favorites on one personal page Try My Yahoo! - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Relevance percentage
Thanks much for the reply. Paul Elschot [EMAIL PROTECTED] wrote:On Monday 20 December 2004 15:09, Gururaja H wrote: Hi, But, How to calculate the coord() fraction ? I know by default, in DefaultSimilarity the coord() fraction is defined as below: /** Implemented as overlap / maxOverlap. */ public float coord(int overlap, int maxOverlap) { return overlap / (float)maxOverlap; } How to get the overlap and maxOverlap value in each of the matched document(s) ? In case you only want the coordination factor to have more influence in the order of your search results you can use a Similarity with a coord() function that has a power higher than 1: public float coord(int overlap, int maxOverlap) { return (float) Math.pow((overlap / (float)maxOverlap), SOME_POWER); } I'd first try values between 3.0f and 5.0f for SOME_POWER. The searching code precomputes all coord values once per query per search, so there is no need to worry about the computing efficiency. This has the advantage that the other scoring factors are still used for ranking. Since the other factors can vary quite a bit, it is difficult to guarantee that any coord() implementation will provide a score that sorts by the number of matching clauses. Higher powers as above can come a long way, though. Regards, Paul Elschot Thanks, Gururaja Mike Snare wrote: I'm still new to Lucene, but wouldn't that be the coord()? My understanding is that the coord() is the fraction of the boolean query that matched a given document. Again, I'm new, so somebody else will have to confirm or deny... -Mike On Mon, 20 Dec 2004 00:33:21 -0800 (PST), Gururaja H wrote: How to find out the percentages of matched terms in the document(s) using Lucene ? Here is an example, of what i am trying to do: The search query has 5 terms(ibm, risc, tape, dirve, manual) and there are 4 matching documents with the following attributes: Doc#1: contains terms(ibm,drive) Doc#2: contains terms(ibm,risc, tape, drive) Doc#3: contains terms(ibm,risc, tape,drive) Doc#4: contains terms(ibm, risc, tape, drive, manual). The percentages displayed would be 100%(Doc#4), 80%(doc#2), 80%(doc#3) and 40% (doc#1). Any help on how to go about doing this ? Thanks, Gururaja - Do you Yahoo!? Send a seasonal email greeting and help others. Do good. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - Do you Yahoo!? All your favorites on one personal page Try My Yahoo! - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - Do you Yahoo!? Yahoo! Mail - 250MB free storage. Do more. Manage less.
RE: Relevance percentage
Thanks much for the reply. Chuck Williams [EMAIL PROTECTED] wrote:The coord() value is not saved anywhere so you would need to recompute it. You could either call explain() and parse the result string, or better, look at explain() and implement what it does more efficiently just for coord(). If your queries are all BooleanQuery's of TermQuery's, then this is very simple. Iterate down the list of BooleanClause's and count the number whose score is 0, then divide this by the total number of clauses. Take a look at BooleanQuery.BooleanWeight.explain() as it does this (along with generating the rest of the explanation). If you support the full Lucene query language, then you need to look at all the query types and decide what exactly you want to compute (as coord is not always well-defined). I'm on the West Coast of the U.S. so evidently on a very different time zone from you -- will look at your other message next. Chuck -Original Message- From: Gururaja H [mailto:[EMAIL PROTECTED] Sent: Monday, December 20, 2004 6:10 AM To: Lucene Users List; Mike Snare Subject: Re: Relevance percentage Hi, But, How to calculate the coord() fraction ? I know by default, in DefaultSimilarity the coord() fraction is defined as below: /** Implemented as overlap / maxOverlap. */ public float coord(int overlap, int maxOverlap) { return overlap / (float)maxOverlap; } How to get the overlap and maxOverlap value in each of the matched document(s) ? Thanks, Gururaja Mike Snare wrote: I'm still new to Lucene, but wouldn't that be the coord()? My understanding is that the coord() is the fraction of the boolean query that matched a given document. Again, I'm new, so somebody else will have to confirm or deny... -Mike On Mon, 20 Dec 2004 00:33:21 -0800 (PST), Gururaja H wrote: How to find out the percentages of matched terms in the document(s) using Lucene ? Here is an example, of what i am trying to do: The search query has 5 terms(ibm, risc, tape, dirve, manual) and there are 4 matching documents with the following attributes: Doc#1: contains terms(ibm,drive) Doc#2: contains terms(ibm,risc, tape, drive) Doc#3: contains terms(ibm,risc, tape,drive) Doc#4: contains terms(ibm, risc, tape, drive, manual). The percentages displayed would be 100%(Doc#4), 80%(doc#2), 80%(doc#3) and 40% (doc#1). Any help on how to go about doing this ? Thanks, Gururaja - Do you Yahoo!? Send a seasonal email greeting and help others. Do good. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - Do you Yahoo!? All your favorites on one personal page - Try My Yahoo! - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] __ Do You Yahoo!? Tired of spam? Yahoo! Mail has the best spam protection around http://mail.yahoo.com