Am 21.01.2016 um 13:11 schrieb RW:
On Wed, 20 Jan 2016 22:21:49 -0800 Marc Perkel wrote:OK - Just to show you this isn't Bayesian - see if you can do this. Here is a list of 5505874 words and phrases used in the subject line of HAM and never seen in the subject line of SPAM http://www.junkemailfilter.com/data/subject-ham.txt Here is a list of 3494938 words and phrases used in the subject line of SPAM and never seen in the subject line of HAM http://www.junkemailfilter.com/data/subject-spam.txt Hope you understand it now. Not Bayesian!!!!the only difference between "ambulatory care" -> only in ham "aall cards" -> only in spam and "ambulatory care" occurs 16 times in ham and 0 times in spam "aall cards" occurs 0 times in ham and 3 times in spam is that you have discarded the count information
no entirely when "urrently, SA's bayes tokens are single words" from https://mail-archives.apache.org/mod_mbox/spamassassin-dev/201211.mbox/%3c509d55a8.30...@gmail.com%3E is still true
please review that response below and consider 2/4 word tokes *additionally* in the SA-tokenizer and it will beat out the "new magic" easily witha well trained bayes in all cases
-------- Weitergeleitete Nachricht -------- Betreff: Re: My new method for blocking spam - REVEALED! Datum: Wed, 20 Jan 2016 15:20:01 -0500 Von: Dianne Skoll <d...@roaringpenguin.com> Organisation: Roaring Penguin Software Inc. An: users@spamassassin.apache.org On Wed, 20 Jan 2016 12:11:02 -0800 Marc Perkel <supp...@junkemailfilter.com> wrote: > Again - it's not about matching as Bayes does. It's about not > matching. It's not about not matching. It's about a preprocessing step that discards tokens that don't have extreme probabilities. I think your method works as well as it does because you're using up to four-word phrases as tokens. The rest of the method is nonsense, butthe four-word phrase tokens are the magic ingredient; they'd make Bayes work awesomely also.
signature.asc
Description: OpenPGP digital signature