Re: Dealing with keyword stuffing

2011-07-29 Thread Pranav Prakash
Cool, So I used SweetSpotSimilarity with default params and I see some
improvements. However, I could still see some of the 'stuffed' documents
coming up in the results. I feel that SweetSpotSimilarity alone is not
enough. Going through
http://trec.nist.gov/pubs/trec16/papers/ibm-haifa.mq.final.pdf I figure out
that there are other things - Pivoted Length Normalization and term
frequency normalization that needs fine tuning too.

Should I create a custom Similarity Class that overrides all the default
behavior? I guess that should help me get more relevant results. Where
should I start beginning with it? Pl. do not assume less obvious things, I
am still learning !! :-)

*Pranav Prakash*

temet nosce

Twitter http://twitter.com/pranavprakash | Blog http://blog.myblive.com |
Google http://www.google.com/profiles/pranny


On Thu, Jul 28, 2011 at 17:03, Gora Mohanty g...@mimirtech.com wrote:

 On Thu, Jul 28, 2011 at 3:48 PM, Pranav Prakash pra...@gmail.com wrote:
 [...]
  I am not sure how to use SweetSpotSimilarity. I am googling on this, but
  any useful insights are so much appreciated.

 Replace the existing DefaultSimilarity class in schema.xml (look towards
 the bottom of the file) with the SweetSpotSimilarity class, e.g., have a
 line
 like:
  similarity class=org.apache.lucene.search.SweetSpotSimilarity/

 Regards,
 Gora



Re: Dealing with keyword stuffing

2011-07-28 Thread Pranav Prakash
On Thu, Jul 28, 2011 at 08:31, Chris Hostetter hossman_luc...@fucit.orgwrote:


 : Presumably, they are doing this by increasing tf (term frequency),
 : i.e., by repeating keywords multiple times. If so, you can use a custom
 : similarity class that caps term frequency, and/or ensures that the
 scoring
 : increases less than linearly with tf. Please see


In some cases, yes they are repeating keywords multiple times. Stuffing
different combinations - Solr, Solr Lucene, Solr Search, Solr Apache, Solr
Guide.



 in paticular, using something like SweetSpotSimilarity tuned to know what
 values make sense for good content in your domain can be useful because
 it can actaully penalize docsuments that are too short/long or have term
 freqs that are outside of a reasonble expected range.


I am not a Solr expert, But I was thinking in this direction. The ratio of
tokens/total_length would be nearer to 1 for a stuffed document, while it
would be nearer to 0 for a bogus document. Somewhere between the two lies
documents that are more likely to be meaningful. I am not sure how to use
SweetSpotSimilarity. I am googling on this, but any useful insights are so
much appreciated.


Re: Dealing with keyword stuffing

2011-07-28 Thread Gora Mohanty
On Thu, Jul 28, 2011 at 3:48 PM, Pranav Prakash pra...@gmail.com wrote:
[...]
 I am not sure how to use SweetSpotSimilarity. I am googling on this, but
 any useful insights are so much appreciated.

Replace the existing DefaultSimilarity class in schema.xml (look towards
the bottom of the file) with the SweetSpotSimilarity class, e.g., have a line
like:
  similarity class=org.apache.lucene.search.SweetSpotSimilarity/

Regards,
Gora


Re: Dealing with keyword stuffing

2011-07-27 Thread Gora Mohanty
On Wed, Jul 27, 2011 at 7:15 PM, Pranav Prakash pra...@gmail.com wrote:
 I guess most of you have already handled and many of you might still be
 handling keyword stuffing. Here is my scenario. We have a huge index
 containing about 6m docs. (Not sure if that is huge :-) And every document
 contains title, description, tags, content (textual data). People have been
 doing keyword stuffing on the documents, so when searched for a query
 term, the first results are always the ones who are optimized.

 So, instead of people getting relevant results, they get spam content
 (highly optimized, keyword stuffed content) as first few results. I have
 tried a couple of things like providing different boosts to different
 fields, but almost everything seems to fail.
[...]

Presumably, they are doing this by increasing tf (term frequency),
i.e., by repeating keywords multiple times. If so, you can use a custom
similarity class that caps term frequency, and/or ensures that the scoring
increases less than linearly with tf. Please see
http://wiki.apache.org/solr/SchemaXml#Similarity , and/or do a web
search for more details.

Regards,
Gora


Re: Dealing with keyword stuffing

2011-07-27 Thread Chris Hostetter

: Presumably, they are doing this by increasing tf (term frequency),
: i.e., by repeating keywords multiple times. If so, you can use a custom
: similarity class that caps term frequency, and/or ensures that the scoring
: increases less than linearly with tf. Please see

in paticular, using something like SweetSpotSimilarity tuned to know what 
values make sense for good content in your domain can be useful because 
it can actaully penalize docsuments that are too short/long or have term 
freqs that are outside of a reasonble expected range.

FWIW though: that's really just a generic answer to a generic question.  
the better you understand your data, the better you can configure solr for 
it -- and that goes equally for the advice people can give you about how 
to configure solr.  you haven't given any information about hte nature of 
your data: the types of documets, the authoritaive source, the fields 
involved, where/how/when people edit this data, who is keyword spamming, 
etc.; or how you wnat to use it: what types of queries you need to 
support, what your users objectives are, etc.  That makes it impossible 
for anyone to suggest anything but the most general answer customize 
your Similarity.

-Hoss