RE: Poor Lucene Ranking for Short Text

2004-12-24 Thread Chuck Williams
I think you are confusing lengthNorm and the overall normalization of the 
score.  For overall normalization (prior to a final forced normalization in 
Hits), Lucene uses the formula you cite, except that it never sums td_d*idf_t, 
using instead tf_q*idf_t again, because the former is computationally 
intractable (changing even a single document changes the idf values, which 
means either that all document norms would have to be computed or that the sum 
over the document would need to happen at query time; the former is 
unacceptable for indexing time with large indices and the latter is 
unacceptable for query time with large documents).

lengthNorm is by default 1/sqrt(number_terms_in_document).  It is not 1.0f by 
default because 1.0f is in general not a good value; e.g., a single occurrence 
of a term in a 1meg document is not as significant as a single occurrence of 
the same term in a 1k document.  However, I find the default value to need 
additional damping because it affects the score too much, especially for small 
documents.  So, I use something like
   3.0f/log10(1000 + number_terms_in_document)

Chuck

  > -Original Message-
  > From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]
  > Sent: Friday, December 24, 2004 8:24 AM
  > To: 'Lucene Users List'
  > Subject: AW: Poor Lucene Ranking for Short Text
  > 
  > Hi Kevin,
  > 
  > Seem like you have some knowledge about the lenghtNorm value in Lucene.
  > Comparing it to the formula in "Modern Information Retrieval" does it
  > sum up
  > the denominator sqrt((sum(tf_d*idf_t)²)) * sqrt((sum(tf_q*idf_t)²))
  > 
  > Just a quick note is ok.
  > 
  > Besides that could you invite me to rojo. There beta status seem to be
  > quite
  > long.
  > 
  > Thanks
  > Michael
  > 
  > | -Ursprüngliche Nachricht-
  > | Von:
  > | [EMAIL PROTECTED]
  > | e.org
  > | [mailto:[EMAIL PROTECTED]
  > | ta.apache.org] Im Auftrag von Kevin A. Burton
  > | Gesendet: Mittwoch, 27. Oktober 2004 22:48
  > | An: Lucene Users List
  > | Betreff: Re: Poor Lucene Ranking for Short Text
  > |
  > | Daniel Naber wrote:
  > |
  > | > (Kevin complains about shorter documents ranked higher)
  > | >
  > | >This is something that can easily be fixed. Just use a Similarity
  > | >implementation that extends DefaultSimilarity and that overwrites
  > | >lengthNorm: just return 1.0f there. You need to use that
  > | Similarity for
  > | >indexing and searching, i.e. it requires reindexing.
  > | >
  > | >
  > | What happens when I do this with an existing index? I don't
  > | want to have to rewrite this index as it will take FOREVER
  > |
  > | If the current behavior is all that happens this is fine...
  > | this way I can just get this behavior for new documents that
  > | are added.
  > |
  > | Also... why isn't this the default?
  > |
  > | Kevin
  > |
  > | --
  > |
  > | Use Rojo (RSS/Atom aggregator).  Visit http://rojo.com. Ask
  > | me for an invite!  Also see irc.freenode.net #rojo if you
  > | want to chat.
  > |
  > | Rojo is Hiring! - http://www.rojonetworks.com/JobsAtRojo.html
  > |
  > | If you're interested in RSS, Weblogs, Social Networking,
  > | etc... then you should work for Rojo!  If you recommend
  > | someone and we hire them you'll get a free iPod!
  > |
  > | Kevin A. Burton, Location - San Francisco, CA
  > |AIM/YIM - sfburtonator,  Web - http://peerfear.org/
  > | GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412
  > |
  > |
  > | -
  > | To unsubscribe, e-mail: [EMAIL PROTECTED]
  > | For additional commands, e-mail: [EMAIL PROTECTED]
  > |
  > |
  > 
  > 
  > 
  > -
  > To unsubscribe, e-mail: [EMAIL PROTECTED]
  > For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



AW: Poor Lucene Ranking for Short Text

2004-12-24 Thread Michael Hartmann
Hi Kevin,

Seem like you have some knowledge about the lenghtNorm value in Lucene.
Comparing it to the formula in "Modern Information Retrieval" does it sum up
the denominator sqrt((sum(tf_d*idf_t)²)) * sqrt((sum(tf_q*idf_t)²))

Just a quick note is ok.

Besides that could you invite me to rojo. There beta status seem to be quite
long.

Thanks
Michael

| -Ursprüngliche Nachricht-
| Von: 
| [EMAIL PROTECTED]
| e.org 
| [mailto:[EMAIL PROTECTED]
| ta.apache.org] Im Auftrag von Kevin A. Burton
| Gesendet: Mittwoch, 27. Oktober 2004 22:48
| An: Lucene Users List
| Betreff: Re: Poor Lucene Ranking for Short Text
| 
| Daniel Naber wrote:
| 
| > (Kevin complains about shorter documents ranked higher)
| >
| >This is something that can easily be fixed. Just use a Similarity 
| >implementation that extends DefaultSimilarity and that overwrites
| >lengthNorm: just return 1.0f there. You need to use that 
| Similarity for 
| >indexing and searching, i.e. it requires reindexing.
| >  
| >
| What happens when I do this with an existing index? I don't 
| want to have to rewrite this index as it will take FOREVER
| 
| If the current behavior is all that happens this is fine... 
| this way I can just get this behavior for new documents that 
| are added.
| 
| Also... why isn't this the default?
| 
| Kevin
| 
| -- 
| 
| Use Rojo (RSS/Atom aggregator).  Visit http://rojo.com. Ask 
| me for an invite!  Also see irc.freenode.net #rojo if you 
| want to chat.
| 
| Rojo is Hiring! - http://www.rojonetworks.com/JobsAtRojo.html
| 
| If you're interested in RSS, Weblogs, Social Networking, 
| etc... then you should work for Rojo!  If you recommend 
| someone and we hire them you'll get a free iPod!
| 
| Kevin A. Burton, Location - San Francisco, CA
|AIM/YIM - sfburtonator,  Web - http://peerfear.org/ 
| GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412
| 
| 
| -
| To unsubscribe, e-mail: [EMAIL PROTECTED]
| For additional commands, e-mail: [EMAIL PROTECTED]
| 
| 



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Poor Lucene Ranking for Short Text

2004-10-27 Thread Daniel Naber
On Wednesday 27 October 2004 22:47, Kevin A. Burton wrote:

> If the current behavior is all that happens this is fine... this way I
> can just get this behavior for new documents that are added.

You'll have to try it out, I'm not sure what exactly will happen.

> Also... why isn't this the default?

You'll probably end up with many documents having exactly the same ranking. 
And those documents will then be sorted in a random order (not really, 
they will by sorted by internal ID I think, but that's no useful order for 
most use cases).

Regards
 Daniel

-- 
http://www.danielnaber.de

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Poor Lucene Ranking for Short Text

2004-10-27 Thread Kevin A. Burton
Daniel Naber wrote:
(Kevin complains about shorter documents ranked higher)
This is something that can easily be fixed. Just use a Similarity 
implementation that extends DefaultSimilarity and that overwrites 
lengthNorm: just return 1.0f there. You need to use that Similarity for 
indexing and searching, i.e. it requires reindexing.
 

What happens when I do this with an existing index? I don't want to have 
to rewrite this index as it will take FOREVER

If the current behavior is all that happens this is fine... this way I 
can just get this behavior for new documents that are added.

Also... why isn't this the default?
Kevin
--
Use Rojo (RSS/Atom aggregator).  Visit http://rojo.com. Ask me for an 
invite!  Also see irc.freenode.net #rojo if you want to chat.

Rojo is Hiring! - http://www.rojonetworks.com/JobsAtRojo.html
If you're interested in RSS, Weblogs, Social Networking, etc... then you 
should work for Rojo!  If you recommend someone and we hire them you'll 
get a free iPod!
   
Kevin A. Burton, Location - San Francisco, CA
  AIM/YIM - sfburtonator,  Web - http://peerfear.org/
GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Poor Lucene Ranking for Short Text

2004-10-27 Thread Daniel Naber
On Wednesday 27 October 2004 20:20, Kevin A. Burton wrote:

> http://www.peerfear.org/rss/permalink/2004/10/26/PoorLuceneRankingForSho
>rtText/

(Kevin complains about shorter documents ranked higher)

This is something that can easily be fixed. Just use a Similarity 
implementation that extends DefaultSimilarity and that overwrites 
lengthNorm: just return 1.0f there. You need to use that Similarity for 
indexing and searching, i.e. it requires reindexing.

Regards
 Daniel

-- 
http://www.danielnaber.de

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Poor Lucene Ranking for Short Text

2004-10-27 Thread Kevin A. Burton
http://www.peerfear.org/rss/permalink/2004/10/26/PoorLuceneRankingForShortText/
--
Use Rojo (RSS/Atom aggregator).  Visit http://rojo.com. Ask me for an 
invite!  Also see irc.freenode.net #rojo if you want to chat.

Rojo is Hiring! - http://www.rojonetworks.com/JobsAtRojo.html
If you're interested in RSS, Weblogs, Social Networking, etc... then you 
should work for Rojo!  If you recommend someone and we hire them you'll 
get a free iPod!
   
Kevin A. Burton, Location - San Francisco, CA
  AIM/YIM - sfburtonator,  Web - http://peerfear.org/
GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]