"Amir 'CG' Caspi" <ceph...@3phase.com> writes: > Well, not really true, because of the rising resurgence of spammers using > image-based spam, i.e. the number of words in text/plain or text/html is > very low, and all of the spam content is embedded in a binary attached > image, which uses either regular links or even imagemap links to direct > victims to the final spam site. > > In fact, now that I think about it, almost all of my bayes_00 FNs are > these image spams, which have very little text... but the text content is > usually pretty generic (like "unsubscribe here" and/or a mailing address) > so one would still think it should hit near 50, not 00. This is why I > want to see what the matched tokens are and why I'm still suspicious of a > problem in my DB.
So perhaps bayes output should not only have the probability but also some notion of the number of tokens, and the assigned score should be based on the number of tokens too. Specifically, a 00 type output for only a few dozen tokens should perhaps count for a much less strongly negative score.
pgpNQ0riNZnFf.pgp
Description: PGP signature