RE: Poor Lucene Ranking for Short Text
I think you are confusing lengthNorm and the overall normalization of the score. For overall normalization (prior to a final forced normalization in Hits), Lucene uses the formula you cite, except that it never sums td_d*idf_t, using instead tf_q*idf_t again, because the former is computationally intractable (changing even a single document changes the idf values, which means either that all document norms would have to be computed or that the sum over the document would need to happen at query time; the former is unacceptable for indexing time with large indices and the latter is unacceptable for query time with large documents). lengthNorm is by default 1/sqrt(number_terms_in_document). It is not 1.0f by default because 1.0f is in general not a good value; e.g., a single occurrence of a term in a 1meg document is not as significant as a single occurrence of the same term in a 1k document. However, I find the default value to need additional damping because it affects the score too much, especially for small documents. So, I use something like 3.0f/log10(1000 + number_terms_in_document) Chuck > -Original Message- > From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] > Sent: Friday, December 24, 2004 8:24 AM > To: 'Lucene Users List' > Subject: AW: Poor Lucene Ranking for Short Text > > Hi Kevin, > > Seem like you have some knowledge about the lenghtNorm value in Lucene. > Comparing it to the formula in "Modern Information Retrieval" does it > sum up > the denominator sqrt((sum(tf_d*idf_t)²)) * sqrt((sum(tf_q*idf_t)²)) > > Just a quick note is ok. > > Besides that could you invite me to rojo. There beta status seem to be > quite > long. > > Thanks > Michael > > | -Ursprüngliche Nachricht- > | Von: > | [EMAIL PROTECTED] > | e.org > | [mailto:[EMAIL PROTECTED] > | ta.apache.org] Im Auftrag von Kevin A. Burton > | Gesendet: Mittwoch, 27. Oktober 2004 22:48 > | An: Lucene Users List > | Betreff: Re: Poor Lucene Ranking for Short Text > | > | Daniel Naber wrote: > | > | > (Kevin complains about shorter documents ranked higher) > | > > | >This is something that can easily be fixed. Just use a Similarity > | >implementation that extends DefaultSimilarity and that overwrites > | >lengthNorm: just return 1.0f there. You need to use that > | Similarity for > | >indexing and searching, i.e. it requires reindexing. > | > > | > > | What happens when I do this with an existing index? I don't > | want to have to rewrite this index as it will take FOREVER > | > | If the current behavior is all that happens this is fine... > | this way I can just get this behavior for new documents that > | are added. > | > | Also... why isn't this the default? > | > | Kevin > | > | -- > | > | Use Rojo (RSS/Atom aggregator). Visit http://rojo.com. Ask > | me for an invite! Also see irc.freenode.net #rojo if you > | want to chat. > | > | Rojo is Hiring! - http://www.rojonetworks.com/JobsAtRojo.html > | > | If you're interested in RSS, Weblogs, Social Networking, > | etc... then you should work for Rojo! If you recommend > | someone and we hire them you'll get a free iPod! > | > | Kevin A. Burton, Location - San Francisco, CA > |AIM/YIM - sfburtonator, Web - http://peerfear.org/ > | GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412 > | > | > | - > | To unsubscribe, e-mail: [EMAIL PROTECTED] > | For additional commands, e-mail: [EMAIL PROTECTED] > | > | > > > > - > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
AW: Poor Lucene Ranking for Short Text
Hi Kevin, Seem like you have some knowledge about the lenghtNorm value in Lucene. Comparing it to the formula in "Modern Information Retrieval" does it sum up the denominator sqrt((sum(tf_d*idf_t)²)) * sqrt((sum(tf_q*idf_t)²)) Just a quick note is ok. Besides that could you invite me to rojo. There beta status seem to be quite long. Thanks Michael | -Ursprüngliche Nachricht- | Von: | [EMAIL PROTECTED] | e.org | [mailto:[EMAIL PROTECTED] | ta.apache.org] Im Auftrag von Kevin A. Burton | Gesendet: Mittwoch, 27. Oktober 2004 22:48 | An: Lucene Users List | Betreff: Re: Poor Lucene Ranking for Short Text | | Daniel Naber wrote: | | > (Kevin complains about shorter documents ranked higher) | > | >This is something that can easily be fixed. Just use a Similarity | >implementation that extends DefaultSimilarity and that overwrites | >lengthNorm: just return 1.0f there. You need to use that | Similarity for | >indexing and searching, i.e. it requires reindexing. | > | > | What happens when I do this with an existing index? I don't | want to have to rewrite this index as it will take FOREVER | | If the current behavior is all that happens this is fine... | this way I can just get this behavior for new documents that | are added. | | Also... why isn't this the default? | | Kevin | | -- | | Use Rojo (RSS/Atom aggregator). Visit http://rojo.com. Ask | me for an invite! Also see irc.freenode.net #rojo if you | want to chat. | | Rojo is Hiring! - http://www.rojonetworks.com/JobsAtRojo.html | | If you're interested in RSS, Weblogs, Social Networking, | etc... then you should work for Rojo! If you recommend | someone and we hire them you'll get a free iPod! | | Kevin A. Burton, Location - San Francisco, CA |AIM/YIM - sfburtonator, Web - http://peerfear.org/ | GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412 | | | - | To unsubscribe, e-mail: [EMAIL PROTECTED] | For additional commands, e-mail: [EMAIL PROTECTED] | | - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Poor Lucene Ranking for Short Text
On Wednesday 27 October 2004 22:47, Kevin A. Burton wrote: > If the current behavior is all that happens this is fine... this way I > can just get this behavior for new documents that are added. You'll have to try it out, I'm not sure what exactly will happen. > Also... why isn't this the default? You'll probably end up with many documents having exactly the same ranking. And those documents will then be sorted in a random order (not really, they will by sorted by internal ID I think, but that's no useful order for most use cases). Regards Daniel -- http://www.danielnaber.de - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Poor Lucene Ranking for Short Text
Daniel Naber wrote: (Kevin complains about shorter documents ranked higher) This is something that can easily be fixed. Just use a Similarity implementation that extends DefaultSimilarity and that overwrites lengthNorm: just return 1.0f there. You need to use that Similarity for indexing and searching, i.e. it requires reindexing. What happens when I do this with an existing index? I don't want to have to rewrite this index as it will take FOREVER If the current behavior is all that happens this is fine... this way I can just get this behavior for new documents that are added. Also... why isn't this the default? Kevin -- Use Rojo (RSS/Atom aggregator). Visit http://rojo.com. Ask me for an invite! Also see irc.freenode.net #rojo if you want to chat. Rojo is Hiring! - http://www.rojonetworks.com/JobsAtRojo.html If you're interested in RSS, Weblogs, Social Networking, etc... then you should work for Rojo! If you recommend someone and we hire them you'll get a free iPod! Kevin A. Burton, Location - San Francisco, CA AIM/YIM - sfburtonator, Web - http://peerfear.org/ GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412 - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Poor Lucene Ranking for Short Text
On Wednesday 27 October 2004 20:20, Kevin A. Burton wrote: > http://www.peerfear.org/rss/permalink/2004/10/26/PoorLuceneRankingForSho >rtText/ (Kevin complains about shorter documents ranked higher) This is something that can easily be fixed. Just use a Similarity implementation that extends DefaultSimilarity and that overwrites lengthNorm: just return 1.0f there. You need to use that Similarity for indexing and searching, i.e. it requires reindexing. Regards Daniel -- http://www.danielnaber.de - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Poor Lucene Ranking for Short Text
http://www.peerfear.org/rss/permalink/2004/10/26/PoorLuceneRankingForShortText/ -- Use Rojo (RSS/Atom aggregator). Visit http://rojo.com. Ask me for an invite! Also see irc.freenode.net #rojo if you want to chat. Rojo is Hiring! - http://www.rojonetworks.com/JobsAtRojo.html If you're interested in RSS, Weblogs, Social Networking, etc... then you should work for Rojo! If you recommend someone and we hire them you'll get a free iPod! Kevin A. Burton, Location - San Francisco, CA AIM/YIM - sfburtonator, Web - http://peerfear.org/ GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412 - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]