Re: [Spambayes] Images of commercial text with decoy text are mushing my index

skip Mon, 01 Jan 2007 07:48:03 -0800

    Jamie> With OCR, will this continue to be an issue?

Forgot to answer this question.  The decoy text will still be considered
using the same parameters.  By default, the classifier only considers the
150 most highest and lowest scoring tokens, so if the message is near that
limit, adding high- or low-scoring OCR-generated tokens will push some other
tokens out of consideration.  OTOH, the problem with most of these image
spams is generally that there are very few tokens of any significance.  They
tend to score near 0.50 as a whole without the contribution of OCR-generated
tokens.  (Most of the tokens extracted from the decoy text generally score
near 0.5 and are discarded.)


The only way to tell for sure is to examine the tokens generated and their
scores to see what is contributing to the overall classification.

Skip

_______________________________________________
[email protected]
http://mail.python.org/mailman/listinfo/spambayes
Check the FAQ before asking: http://spambayes.sf.net/faq.html

Re: [Spambayes] Images of commercial text with decoy text are mushing my index

Reply via email to