On Thu, Jul 28, 2011 at 08:31, Chris Hostetter <hossman_luc...@fucit.org>wrote:
> > : Presumably, they are doing this by increasing tf (term frequency), > : i.e., by repeating keywords multiple times. If so, you can use a custom > : similarity class that caps term frequency, and/or ensures that the > scoring > : increases less than linearly with tf. Please see > In some cases, yes they are repeating keywords multiple times. Stuffing different combinations - Solr, Solr Lucene, Solr Search, Solr Apache, Solr Guide. > > in paticular, using something like SweetSpotSimilarity tuned to know what > values make sense for "good" content in your domain can be useful because > it can actaully penalize docsuments that are too short/long or have term > freqs that are outside of a reasonble expected range. > I am not a Solr expert, But I was thinking in this direction. The ratio of tokens/total_length would be nearer to 1 for a stuffed document, while it would be nearer to 0 for a bogus document. Somewhere between the two lies documents that are more likely to be meaningful. I am not sure how to use SweetSpotSimilarity. I am googling on this, but any useful insights are so much appreciated.