On Fri, 10 May 2013 23:14:36 +0200 Karsten Bräckelmann <guent...@rudersport.de> wrote:
> I happened to be the lucky recipient of specific spam campaigns in > languages I do not speak. Campaign referring to quite a few samples > during a specific, relatively short time period. This definitely > happened with French, Spanish, and Turkish. Odds are high for any word > in those languages being on the seriously spammy side. Unlike for > anyone actually speaking these languages... We (probably) have a much larger sample population, so this tends not to be as much of a problem for us. > I do receive quite specific campaigns, plain text, no obfuscation, > offering private health insurance ("Private Krankenversicherung" in > German). That is a totally valid phrase. Unlike English, German tends > to concatenate words to form specifics -- "Krankenversicherung" is > pretty much a word-by-word translation of "health insurance". This > makes the word more rare, "health" on its own in comparison hardly > gives a hint. And the totally legit word is spammy for me, because I > usually do not talk about that topic in mail. My next door neighbor > probably would disagree... Again, the key is a large sample size. > "Your ham is someone else's spam" on a different level: There are > quite a few reports in bugzilla, where an obfuscation pattern matches > a legit word in non-English languages. These are edge cases that are pretty easily handled with personal Bayes databases or whitelisting if the system keeps getting it wrong. > Accents are good for obfuscation. But accents also are entirely legit. And we can tell which is which, based on a large sample size. > Paypal. And them notifying their customers about changes in the terms > of use. And actually sending out the full terms of use in the same > mail. In this case, again, German -- but they managed to score a > whopping 12.2 once for me. Yes, of course, BAYES_99. Was this with your personal Bayes data? Even that can be wrong sometimes... Regards, David.