On Thu, Jul 28, 2011 at 08:31, Chris Hostetter <hossman_luc...@fucit.org>wrote:

>
> : Presumably, they are doing this by increasing tf (term frequency),
> : i.e., by repeating keywords multiple times. If so, you can use a custom
> : similarity class that caps term frequency, and/or ensures that the
> scoring
> : increases less than linearly with tf. Please see
>

In some cases, yes they are repeating keywords multiple times. Stuffing
different combinations - Solr, Solr Lucene, Solr Search, Solr Apache, Solr
Guide.


>
> in paticular, using something like SweetSpotSimilarity tuned to know what
> values make sense for "good" content in your domain can be useful because
> it can actaully penalize docsuments that are too short/long or have term
> freqs that are outside of a reasonble expected range.
>

I am not a Solr expert, But I was thinking in this direction. The ratio of
tokens/total_length would be nearer to 1 for a stuffed document, while it
would be nearer to 0 for a bogus document. Somewhere between the two lies
documents that are more likely to be meaningful. I am not sure how to use
SweetSpotSimilarity. I am googling on this, but any useful insights are so
much appreciated.

Reply via email to