Re: SA not correctly classifying spam

Karsten Bräckelmann Mon, 11 Nov 2013 20:46:16 -0800

On Tue, 2013-11-12 at 01:57 -0200, Sergio Durigan Junior wrote:
> On Tuesday, November 12 2013, Benny Pedersen wrote:
> > Karsten Bräckelmann skrev den 2013-11-12 03:20:
> >
> > > [1] Also, just as shown in this thread, properly handling list posts is
> > >     not trivial.
> >
> > maillist is good ham learning spams :)
> 
> Yeah, that's a good reason to keep scanning mailing lists.  Actually,
> it's because of that that I have lots of hams learned :-).


And precisely that is what is not correct.

Your personal mail traffic shows certain words (sorry, tokens) that will
*never* appear in spam. And ham patterns are rather stable, they change
much slower than spam patterns and tokens.


Mailing lists *can* be a good source of ham. There are, however, a
plethora of counter examples why it is really bad and harmful.

* Lists like this very one is prone to include spam samples, tokens that
  you'd *never* encounter in regular ham. Queue a dozen curse words,
  mis-spelled body-parts and raw URIs.

* Lists that are unfortunately accepting posts without subscription, or
  not properly filtering spam.

* High volume lists. Subscribe to LKML, and it won't take long until
  your Bayes db is severely biased.

Then there are other things than lists, that might be worth filtering
out.

* An admin's cron noise often includes URIs somewhere in the logs. Guess
  what, half of them got on the cron noise exactly for the reason of
  being blacklisted. Filter that and still expect it being ham...

* Bugzilla (and any other (bug|issue|problem) tracking tool) mail is
  full of tokens you will never observe in human generated mail. It is
  prone to contain stuff mentioned above.

* It appears (from some threads here, not observed it myself) that e.g.
  debian accepts bug reports via email, and a substantial amount of that
  crap is delivered to the bug-$(product)-list subscribers. Ham?

And if that is not enough already: There is a reason Bayes requires a
minimum training of both ham and spam. There is a reason why advice
(though varying) is to keep training in some certain range.

The range commonly varies between 1:1 and ham:spam ratio. Of course
highly dependent on one's in-stream, user base, through-put and phase of
the moon.

With one notable exception: If (almost) all you teach Bayes is foo,
almost all mail will look like foo to Bayes.

In other words: If all you're being taught as a child is good, how will
you ever know what is bad?


-- 
char *t="\10pse\0r\0dtu\0.@ghno\x4e\xc8\x79\xf4\xab\x51\x8a\x10\xf4\xf4\xc4";
main(){ char h,m=h=*t++,*x=t+2*h,c,i,l=*x,s=0; for (i=0;i<l;i++){ i%8? c<<=1:
(c=*++x); c&128 && (s+=h); if (!(h>>=1)||!t[s+h]){ putchar(t[s]);h=m;s=0; }}}

Re: SA not correctly classifying spam

Reply via email to