On Fri, 2013-05-10 at 15:51 -0400, David F. Skoll wrote: > On Wed, 08 May 2013 19:32:26 +0200 Axb <axb.li...@gmail.com> wrote: > > > - your HAM is somebody else's SPAM > > Do you have evidence for that?
Evidence... examples, rather. I happened to be the lucky recipient of specific spam campaigns in languages I do not speak. Campaign referring to quite a few samples during a specific, relatively short time period. This definitely happened with French, Spanish, and Turkish. Odds are high for any word in those languages being on the seriously spammy side. Unlike for anyone actually speaking these languages... Being easily associated with particular water sports is like a magnet for getting spammed with totally unrelated water sports. One style is good, all others are bad-ish. That would be the same for other folks, though with different signs. I do receive quite specific campaigns, plain text, no obfuscation, offering private health insurance ("Private Krankenversicherung" in German). That is a totally valid phrase. Unlike English, German tends to concatenate words to form specifics -- "Krankenversicherung" is pretty much a word-by-word translation of "health insurance". This makes the word more rare, "health" on its own in comparison hardly gives a hint. And the totally legit word is spammy for me, because I usually do not talk about that topic in mail. My next door neighbor probably would disagree... "Your ham is someone else's spam" on a different level: There are quite a few reports in bugzilla, where an obfuscation pattern matches a legit word in non-English languages. Accents are good for obfuscation. But accents also are entirely legit. Paypal. And them notifying their customers about changes in the terms of use. And actually sending out the full terms of use in the same mail. In this case, again, German -- but they managed to score a whopping 12.2 once for me. Yes, of course, BAYES_99. Plus some other shady-business indicating rules, triggered various times: FUZZY_CREDIT, TRACKER_ID, URI_DOT_INFO. Oh, lovely. That 2009 sample has FUZZY_VLIUM and FRT_VALIUMx. > Karsten Bräckelmann wrote: > > Just try to imagine working in an industry where e.g. Viagra and > > Cialis are totally legit phrases to use... > > Actually, we find that is not a problem because spammers use things > like Vi@gr@ and C1AL1S that are far more damning than the unmodified words > themselves. That was one quick example. See above for a similar scenario not involving medication, but sports. > Also, our Bayes implementation uses word pairs as well as > individual words which improves its selectivity. Good for you, but that is irrelevant to the discussion at hand, which is about the Bayes engine in SA. -- char *t="\10pse\0r\0dtu\0.@ghno\x4e\xc8\x79\xf4\xab\x51\x8a\x10\xf4\xf4\xc4"; main(){ char h,m=h=*t++,*x=t+2*h,c,i,l=*x,s=0; for (i=0;i<l;i++){ i%8? c<<=1: (c=*++x); c&128 && (s+=h); if (!(h>>=1)||!t[s+h]){ putchar(t[s]);h=m;s=0; }}}