Re: Default Bayes Database

Karsten Bräckelmann Fri, 10 May 2013 14:15:06 -0700

On Fri, 2013-05-10 at 15:51 -0400, David F. Skoll wrote:
> On Wed, 08 May 2013 19:32:26 +0200 Axb <axb.li...@gmail.com> wrote:
> 
> > - your HAM is somebody else's SPAM
> 
> Do you have evidence for that?

Evidence... examples, rather.

I happened to be the lucky recipient of specific spam campaigns in
languages I do not speak. Campaign referring to quite a few samples
during a specific, relatively short time period. This definitely
happened with French, Spanish, and Turkish. Odds are high for any word
in those languages being on the seriously spammy side. Unlike for anyone
actually speaking these languages...

Being easily associated with particular water sports is like a magnet
for getting spammed with totally unrelated water sports. One style is
good, all others are bad-ish. That would be the same for other folks,
though with different signs.

I do receive quite specific campaigns, plain text, no obfuscation,
offering private health insurance ("Private Krankenversicherung" in
German). That is a totally valid phrase. Unlike English, German tends to
concatenate words to form specifics -- "Krankenversicherung" is pretty
much a word-by-word translation of "health insurance". This makes the
word more rare, "health" on its own in comparison hardly gives a hint.
And the totally legit word is spammy for me, because I usually do not
talk about that topic in mail. My next door neighbor probably would
disagree...

"Your ham is someone else's spam" on a different level: There are quite
a few reports in bugzilla, where an obfuscation pattern matches a legit
word in non-English languages.

Accents are good for obfuscation. But accents also are entirely legit.

Paypal. And them notifying their customers about changes in the terms of
use. And actually sending out the full terms of use in the same mail. In
this case, again, German -- but they managed to score a whopping 12.2
once for me. Yes, of course, BAYES_99.

Plus some other shady-business indicating rules, triggered various
times: FUZZY_CREDIT, TRACKER_ID, URI_DOT_INFO.
Oh, lovely. That 2009 sample has FUZZY_VLIUM and FRT_VALIUMx.

> Karsten Bräckelmann wrote:
> > Just try to imagine working in an industry where e.g. Viagra and
> > Cialis are totally legit phrases to use...
> 
> Actually, we find that is not a problem because spammers use things
> like Vi@gr@ and C1AL1S that are far more damning than the unmodified words
> themselves.

That was one quick example. See above for a similar scenario not
involving medication, but sports.

> Also, our Bayes implementation uses word pairs as well as
> individual words which improves its selectivity.

Good for you, but that is irrelevant to the discussion at hand, which is
about the Bayes engine in SA.

-- 
char *t="\10pse\0r\0dtu\0.@ghno\x4e\xc8\x79\xf4\xab\x51\x8a\x10\xf4\xf4\xc4";
main(){ char h,m=h=*t++,*x=t+2*h,c,i,l=*x,s=0; for (i=0;i<l;i++){ i%8? c<<=1:
(c=*++x); c&128 && (s+=h); if (!(h>>=1)||!t[s+h]){ putchar(t[s]);h=m;s=0; }}}

Re: Default Bayes Database

Reply via email to