Am 29.05.2016 um 02:46 schrieb Dianne Skoll:
And also, two-word phrases can be stronger indicators than the
individual words; "hot" and "sex" in isolation may not be strong spam
indicators, but "hot sex" probably is stronger.

Going from one-word tokens to one+two-word tokens will have a pretty
big payoff, I think.  I'm not so sure about two to three


the best result for many of the sort spams which try to defeat bayes would be 2 or 3 word tokes - we complement bayes with currently 1500 handcrafted body rules with scores of 0.5/1.5/2.5/3.5/4.5 points

the majority of that rules have 2 or 3 words

the current toekns should stay as the are and *additional* 2-word tokens of the same messages - that would boost bayes to a completly different level with enough training data

one word tokens are limited in many ways (while it work not bad to say)

Attachment: signature.asc
Description: OpenPGP digital signature

Reply via email to