RE: Relevance percentage

2004-12-23 Thread Chuck Williams
Gururaja,

If you want to score based solely on coord(), then Paul's approach looks
best.  However, based on your earlier messages, it looks to me like you
want to score based on all factors (with coord boosted as Paul
suggested, or lengthNorm flattened as I suggested -- either will get the
order you want in the example you posted), but you want to print the
(unboosted) coord percentage along with each result in the result list.

If this is the case, since the number of results per page on the result
list is presumably small, I think you are best off replicating the
explain() mechanism.  I don't have the source code, but you can look at
IndexSearcher.explain(), which recreates the weight with Query.weight(),
then calls what in this case will be
BooleanQuery.BooleanWeight.explain(), which has the code to recompute
coord on a result (specifically it computes overlap and maxoverlap and
then calls Similarity.coord()).  You could cut and paste this code to
just compute coord for your top-level BooleanQuery's.

Sorry I don't have source code to do this, but the approach should work.
Good luck,

Chuck

   -Original Message-
   From: Paul Elschot [mailto:[EMAIL PROTECTED]
   Sent: Wednesday, December 22, 2004 11:59 PM
   To: lucene-user@jakarta.apache.org
   Subject: Re: Relevance percentage
   
   On Thursday 23 December 2004 08:13, Gururaja H wrote:
Hi Chuck Williams,
   
Thanks much for the reply.
   
If your queries are all BooleanQuery's of
TermQuery's, then this is very simple. Iterate down the list of
BooleanClause's and count the number whose score is  0, then
divide
this by the total number of clauses. Take a look at
BooleanQuery.BooleanWeight.explain() as it does this (along with
generating the rest of the explanation). If you support the full
   Lucene
query language, then you need to look at all the query types and
   decide
what exactly you want to compute (as coord is not always well-
   defined).
   
We are supporting full Lucene query language.
   
My request is, assuming queries are all BooleanQuery please
post the implementation source code for the same.  ie to calculate
the
   coord() method input parameters overlap and maxOverlap.
   
   I don't have the code, but I can give an overview of possible
   steps:
   
   First inherit from BooleanScorer to implement a score() method that
   returns only the coord() value (preferably a precomputed one).
   Then inherit from BooleanQuery.BooleanWeight to return the above
   Scorer.
   Then inherit from BooleanQuery to use the above Weight in
createWeight().
   Then inherit from QueryParser to use the above Query in
   getBooleanQuery().
   Finally use such a query in a search: the document scores will be
   the coord() values.
   
   Regards,
   Paul Elschot.
   
   
  
-
   To unsubscribe, e-mail: [EMAIL PROTECTED]
   For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: Relevance percentage

2004-12-22 Thread Gururaja H
Hi Chuck Williams,
 
Thanks much for the reply.
 
If your queries are all BooleanQuery's of
TermQuery's, then this is very simple. Iterate down the list of
BooleanClause's and count the number whose score is  0, then divide
this by the total number of clauses. Take a look at
BooleanQuery.BooleanWeight.explain() as it does this (along with
generating the rest of the explanation). If you support the full Lucene
query language, then you need to look at all the query types and decide
what exactly you want to compute (as coord is not always well-defined).
 
We are supporting full Lucene query language.  
 
My request is, assuming queries are all BooleanQuery please
post the implementation source code for the same.  ie to calculate the coord() 
method input parameters overlap and maxOverlap.
 
Thanks,
Gururaja





Chuck Williams [EMAIL PROTECTED] wrote:

The coord() value is not saved anywhere so you would need to recompute
it. You could either call explain() and parse the result string, or
better, look at explain() and implement what it does more efficiently
just for coord(). If your queries are all BooleanQuery's of
TermQuery's, then this is very simple. Iterate down the list of
BooleanClause's and count the number whose score is  0, then divide
this by the total number of clauses. Take a look at
BooleanQuery.BooleanWeight.explain() as it does this (along with
generating the rest of the explanation). If you support the full Lucene
query language, then you need to look at all the query types and decide
what exactly you want to compute (as coord is not always well-defined).



I'm on the West Coast of the U.S. so evidently on a very different time
zone from you -- will look at your other message next.

Chuck

 -Original Message-
 From: Gururaja H [mailto:[EMAIL PROTECTED]
 Sent: Monday, December 20, 2004 6:10 AM
 To: Lucene Users List; Mike Snare
 Subject: Re: Relevance percentage
 
 Hi,
 
 But, How to calculate the coord() fraction ? I know by default,
 in DefaultSimilarity the coord() fraction is defined as below:
 
 /** Implemented as overlap / maxOverlap. */
 
 public float coord(int overlap, int maxOverlap) {
 
 return overlap / (float)maxOverlap;
 
 }
 How to get the overlap and maxOverlap value in each of the matched
 document(s) ?
 
 Thanks,
 Gururaja
 
 Mike Snare wrote:
 I'm still new to Lucene, but wouldn't that be the coord()? My
 understanding is that the coord() is the fraction of the boolean
query
 that matched a given document.
 
 Again, I'm new, so somebody else will have to confirm or deny...
 
 -Mike
 
 
 On Mon, 20 Dec 2004 00:33:21 -0800 (PST), Gururaja H
 wrote:
  How to find out the percentages of matched terms in the
document(s)
 using Lucene ?
  Here is an example, of what i am trying to do:
  The search query has 5 terms(ibm, risc, tape, dirve, manual) and
there
 are 4 matching
  documents with the following attributes:
  Doc#1: contains terms(ibm,drive)
  Doc#2: contains terms(ibm,risc, tape, drive)
  Doc#3: contains terms(ibm,risc, tape,drive)
  Doc#4: contains terms(ibm, risc, tape, drive, manual).
  The percentages displayed would be 100%(Doc#4), 80%(doc#2),
80%(doc#3)
 and 40%
  (doc#1).
 
  Any help on how to go about doing this ?
 
  Thanks,
  Gururaja
 
 
  -
  Do you Yahoo!?
  Send a seasonal email greeting and help others. Do good.
 
 

-
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 
 
 
 -
 Do you Yahoo!?
 All your favorites on one personal page - Try My Yahoo!

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




-
Do you Yahoo!?
 Yahoo! Mail - Find what you need with new enhanced search. Learn more.

Re: Relevance percentage

2004-12-22 Thread Paul Elschot
On Thursday 23 December 2004 08:13, Gururaja H wrote:
 Hi Chuck Williams,
  
 Thanks much for the reply.
  
 If your queries are all BooleanQuery's of
 TermQuery's, then this is very simple. Iterate down the list of
 BooleanClause's and count the number whose score is  0, then divide
 this by the total number of clauses. Take a look at
 BooleanQuery.BooleanWeight.explain() as it does this (along with
 generating the rest of the explanation). If you support the full Lucene
 query language, then you need to look at all the query types and decide
 what exactly you want to compute (as coord is not always well-defined).
  
 We are supporting full Lucene query language.  
  
 My request is, assuming queries are all BooleanQuery please
 post the implementation source code for the same.  ie to calculate the 
coord() method input parameters overlap and maxOverlap.

I don't have the code, but I can give an overview of possible
steps:

First inherit from BooleanScorer to implement a score() method that
returns only the coord() value (preferably a precomputed one).
Then inherit from BooleanQuery.BooleanWeight to return the above
Scorer.
Then inherit from BooleanQuery to use the above Weight in createWeight().
Then inherit from QueryParser to use the above Query in getBooleanQuery().
Finally use such a query in a search: the document scores will be
the coord() values.

Regards,
Paul Elschot.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Relevance percentage

2004-12-20 Thread Mike Snare
I'm still new to Lucene, but wouldn't that be the coord()?  My
understanding is that the coord() is the fraction of the boolean query
that matched a given document.

Again, I'm new, so somebody else will have to confirm or deny...

-Mike


On Mon, 20 Dec 2004 00:33:21 -0800 (PST), Gururaja H
[EMAIL PROTECTED] wrote:
 How to find out the percentages of matched terms in the document(s) using 
 Lucene ?
 Here is an example, of what i am trying to do:
 The search query has 5 terms(ibm, risc, tape, dirve, manual) and there are 4 
 matching
 documents with the following attributes:
 Doc#1: contains terms(ibm,drive)
 Doc#2: contains terms(ibm,risc, tape, drive)
 Doc#3: contains terms(ibm,risc, tape,drive)
 Doc#4: contains terms(ibm, risc, tape, drive, manual).
 The percentages displayed would be 100%(Doc#4), 80%(doc#2), 80%(doc#3) and 40%
 (doc#1).
 
 Any help on how to go about doing this ?
 
 Thanks,
 Gururaja
 
 
 -
 Do you Yahoo!?
  Send a seasonal email greeting and help others. Do good.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Relevance percentage

2004-12-20 Thread Gururaja H
Hi,
 
But, How to calculate the coord() fraction ?  I know by default,
in DefaultSimilarity the coord() fraction is defined as below:

/** Implemented as codeoverlap / maxOverlap/code. */

public float coord(int overlap, int maxOverlap) {

return overlap / (float)maxOverlap;

}
How to get the overlap and maxOverlap value in each of the matched document(s) ?
 
Thanks,
Gururaja

Mike Snare [EMAIL PROTECTED] wrote:
I'm still new to Lucene, but wouldn't that be the coord()? My
understanding is that the coord() is the fraction of the boolean query
that matched a given document.

Again, I'm new, so somebody else will have to confirm or deny...

-Mike


On Mon, 20 Dec 2004 00:33:21 -0800 (PST), Gururaja H
wrote:
 How to find out the percentages of matched terms in the document(s) using 
 Lucene ?
 Here is an example, of what i am trying to do:
 The search query has 5 terms(ibm, risc, tape, dirve, manual) and there are 4 
 matching
 documents with the following attributes:
 Doc#1: contains terms(ibm,drive)
 Doc#2: contains terms(ibm,risc, tape, drive)
 Doc#3: contains terms(ibm,risc, tape,drive)
 Doc#4: contains terms(ibm, risc, tape, drive, manual).
 The percentages displayed would be 100%(Doc#4), 80%(doc#2), 80%(doc#3) and 40%
 (doc#1).
 
 Any help on how to go about doing this ?
 
 Thanks,
 Gururaja
 
 
 -
 Do you Yahoo!?
 Send a seasonal email greeting and help others. Do good.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



-
Do you Yahoo!?
 All your favorites on one personal page – Try My Yahoo!

RE: Relevance percentage

2004-12-20 Thread Chuck Williams
The coord() value is not saved anywhere so you would need to recompute
it.  You could either call explain() and parse the result string, or
better, look at explain() and implement what it does more efficiently
just for coord().  If your queries are all BooleanQuery's of
TermQuery's, then this is very simple.  Iterate down the list of
BooleanClause's and count the number whose score is  0, then divide
this by the total number of clauses.  Take a look at
BooleanQuery.BooleanWeight.explain() as it does this (along with
generating the rest of the explanation).  If you support the full Lucene
query language, then you need to look at all the query types and decide
what exactly you want to compute (as coord is not always well-defined).

I'm on the West Coast of the U.S. so evidently on a very different time
zone from you -- will look at your other message next.

Chuck

   -Original Message-
   From: Gururaja H [mailto:[EMAIL PROTECTED]
   Sent: Monday, December 20, 2004 6:10 AM
   To: Lucene Users List; Mike Snare
   Subject: Re: Relevance percentage
   
   Hi,
   
   But, How to calculate the coord() fraction ?  I know by default,
   in DefaultSimilarity the coord() fraction is defined as below:
   
   /** Implemented as codeoverlap / maxOverlap/code. */
   
   public float coord(int overlap, int maxOverlap) {
   
   return overlap / (float)maxOverlap;
   
   }
   How to get the overlap and maxOverlap value in each of the matched
   document(s) ?
   
   Thanks,
   Gururaja
   
   Mike Snare [EMAIL PROTECTED] wrote:
   I'm still new to Lucene, but wouldn't that be the coord()? My
   understanding is that the coord() is the fraction of the boolean
query
   that matched a given document.
   
   Again, I'm new, so somebody else will have to confirm or deny...
   
   -Mike
   
   
   On Mon, 20 Dec 2004 00:33:21 -0800 (PST), Gururaja H
   wrote:
How to find out the percentages of matched terms in the
document(s)
   using Lucene ?
Here is an example, of what i am trying to do:
The search query has 5 terms(ibm, risc, tape, dirve, manual) and
there
   are 4 matching
documents with the following attributes:
Doc#1: contains terms(ibm,drive)
Doc#2: contains terms(ibm,risc, tape, drive)
Doc#3: contains terms(ibm,risc, tape,drive)
Doc#4: contains terms(ibm, risc, tape, drive, manual).
The percentages displayed would be 100%(Doc#4), 80%(doc#2),
80%(doc#3)
   and 40%
(doc#1).
   
Any help on how to go about doing this ?
   
Thanks,
Gururaja
   
   
-
Do you Yahoo!?
Send a seasonal email greeting and help others. Do good.
   
   
  
-
   To unsubscribe, e-mail: [EMAIL PROTECTED]
   For additional commands, e-mail: [EMAIL PROTECTED]
   
   
   
   -
   Do you Yahoo!?
All your favorites on one personal page - Try My Yahoo!

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Relevance percentage

2004-12-20 Thread Paul Elschot
On Monday 20 December 2004 15:09, Gururaja H wrote:
 Hi,
  
 But, How to calculate the coord() fraction ?  I know by default,
 in DefaultSimilarity the coord() fraction is defined as below:
 
 /** Implemented as codeoverlap / maxOverlap/code. */
 
 public float coord(int overlap, int maxOverlap) {
 
 return overlap / (float)maxOverlap;
 
 }
 How to get the overlap and maxOverlap value in each of the matched 
document(s) ?

In case you only want the coordination factor to have more influence
in the order of your search results you can use a Similarity with
a coord() function that has a power higher than 1:

  public float coord(int overlap, int maxOverlap) {
return (float) Math.pow((overlap / (float)maxOverlap), SOME_POWER);
  }

I'd first try values between 3.0f and 5.0f for SOME_POWER.

The searching code precomputes all coord values once per query
per search, so there is no need to worry about the computing efficiency.

This has the advantage that the other scoring factors are still used
for ranking.

Since the other factors can vary quite a bit, it is difficult to guarantee
that any coord() implementation will provide a score that sorts by the
number of matching clauses. Higher powers as above can come
a long way, though.

Regards,
Paul Elschot


  
 Thanks,
 Gururaja
 
 Mike Snare [EMAIL PROTECTED] wrote:
 I'm still new to Lucene, but wouldn't that be the coord()? My
 understanding is that the coord() is the fraction of the boolean query
 that matched a given document.
 
 Again, I'm new, so somebody else will have to confirm or deny...
 
 -Mike
 
 
 On Mon, 20 Dec 2004 00:33:21 -0800 (PST), Gururaja H
 wrote:
  How to find out the percentages of matched terms in the document(s) using 
Lucene ?
  Here is an example, of what i am trying to do:
  The search query has 5 terms(ibm, risc, tape, dirve, manual) and there are 
4 matching
  documents with the following attributes:
  Doc#1: contains terms(ibm,drive)
  Doc#2: contains terms(ibm,risc, tape, drive)
  Doc#3: contains terms(ibm,risc, tape,drive)
  Doc#4: contains terms(ibm, risc, tape, drive, manual).
  The percentages displayed would be 100%(Doc#4), 80%(doc#2), 80%(doc#3) and 
40%
  (doc#1).
  
  Any help on how to go about doing this ?
  
  Thanks,
  Gururaja
  
  
  -
  Do you Yahoo!?
  Send a seasonal email greeting and help others. Do good.
 
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 
 
   
 -
 Do you Yahoo!?
  All your favorites on one personal page – Try My Yahoo!


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Relevance percentage

2004-12-20 Thread Gururaja H
Thanks much for the reply.

Paul Elschot [EMAIL PROTECTED] wrote:On Monday 20 December 2004 15:09, 
Gururaja H wrote:
 Hi,
 
 But, How to calculate the coord() fraction ? I know by default,
 in DefaultSimilarity the coord() fraction is defined as below:
 
 /** Implemented as overlap / maxOverlap. */
 
 public float coord(int overlap, int maxOverlap) {
 
 return overlap / (float)maxOverlap;
 
 }
 How to get the overlap and maxOverlap value in each of the matched 
document(s) ?

In case you only want the coordination factor to have more influence
in the order of your search results you can use a Similarity with
a coord() function that has a power higher than 1:

public float coord(int overlap, int maxOverlap) {
return (float) Math.pow((overlap / (float)maxOverlap), SOME_POWER);
}

I'd first try values between 3.0f and 5.0f for SOME_POWER.

The searching code precomputes all coord values once per query
per search, so there is no need to worry about the computing efficiency.

This has the advantage that the other scoring factors are still used
for ranking.

Since the other factors can vary quite a bit, it is difficult to guarantee
that any coord() implementation will provide a score that sorts by the
number of matching clauses. Higher powers as above can come
a long way, though.

Regards,
Paul Elschot



 Thanks,
 Gururaja
 
 Mike Snare wrote:
 I'm still new to Lucene, but wouldn't that be the coord()? My
 understanding is that the coord() is the fraction of the boolean query
 that matched a given document.
 
 Again, I'm new, so somebody else will have to confirm or deny...
 
 -Mike
 
 
 On Mon, 20 Dec 2004 00:33:21 -0800 (PST), Gururaja H
 wrote:
  How to find out the percentages of matched terms in the document(s) using 
Lucene ?
  Here is an example, of what i am trying to do:
  The search query has 5 terms(ibm, risc, tape, dirve, manual) and there are 
4 matching
  documents with the following attributes:
  Doc#1: contains terms(ibm,drive)
  Doc#2: contains terms(ibm,risc, tape, drive)
  Doc#3: contains terms(ibm,risc, tape,drive)
  Doc#4: contains terms(ibm, risc, tape, drive, manual).
  The percentages displayed would be 100%(Doc#4), 80%(doc#2), 80%(doc#3) and 
40%
  (doc#1).
  
  Any help on how to go about doing this ?
  
  Thanks,
  Gururaja
  
  
  -
  Do you Yahoo!?
  Send a seasonal email greeting and help others. Do good.
 
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 
 
 
 -
 Do you Yahoo!?
 All your favorites on one personal page – Try My Yahoo!


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




-
Do you Yahoo!?
 Yahoo! Mail - 250MB free storage. Do more. Manage less.

RE: Relevance percentage

2004-12-20 Thread Gururaja H
Thanks much for the reply.

Chuck Williams [EMAIL PROTECTED] wrote:The coord() value is not saved 
anywhere so you would need to recompute
it. You could either call explain() and parse the result string, or
better, look at explain() and implement what it does more efficiently
just for coord(). If your queries are all BooleanQuery's of
TermQuery's, then this is very simple. Iterate down the list of
BooleanClause's and count the number whose score is  0, then divide
this by the total number of clauses. Take a look at
BooleanQuery.BooleanWeight.explain() as it does this (along with
generating the rest of the explanation). If you support the full Lucene
query language, then you need to look at all the query types and decide
what exactly you want to compute (as coord is not always well-defined).

I'm on the West Coast of the U.S. so evidently on a very different time
zone from you -- will look at your other message next.

Chuck

 -Original Message-
 From: Gururaja H [mailto:[EMAIL PROTECTED]
 Sent: Monday, December 20, 2004 6:10 AM
 To: Lucene Users List; Mike Snare
 Subject: Re: Relevance percentage
 
 Hi,
 
 But, How to calculate the coord() fraction ? I know by default,
 in DefaultSimilarity the coord() fraction is defined as below:
 
 /** Implemented as overlap / maxOverlap. */
 
 public float coord(int overlap, int maxOverlap) {
 
 return overlap / (float)maxOverlap;
 
 }
 How to get the overlap and maxOverlap value in each of the matched
 document(s) ?
 
 Thanks,
 Gururaja
 
 Mike Snare wrote:
 I'm still new to Lucene, but wouldn't that be the coord()? My
 understanding is that the coord() is the fraction of the boolean
query
 that matched a given document.
 
 Again, I'm new, so somebody else will have to confirm or deny...
 
 -Mike
 
 
 On Mon, 20 Dec 2004 00:33:21 -0800 (PST), Gururaja H
 wrote:
  How to find out the percentages of matched terms in the
document(s)
 using Lucene ?
  Here is an example, of what i am trying to do:
  The search query has 5 terms(ibm, risc, tape, dirve, manual) and
there
 are 4 matching
  documents with the following attributes:
  Doc#1: contains terms(ibm,drive)
  Doc#2: contains terms(ibm,risc, tape, drive)
  Doc#3: contains terms(ibm,risc, tape,drive)
  Doc#4: contains terms(ibm, risc, tape, drive, manual).
  The percentages displayed would be 100%(Doc#4), 80%(doc#2),
80%(doc#3)
 and 40%
  (doc#1).
 
  Any help on how to go about doing this ?
 
  Thanks,
  Gururaja
 
 
  -
  Do you Yahoo!?
  Send a seasonal email greeting and help others. Do good.
 
 

-
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 
 
 
 -
 Do you Yahoo!?
 All your favorites on one personal page - Try My Yahoo!

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



__
Do You Yahoo!?
Tired of spam?  Yahoo! Mail has the best spam protection around 
http://mail.yahoo.com