On Wednesday 20 January 2016 at 17:52:05, Marc Perkel wrote: > Suppose I get an email with the subject line "Let's get some lunch". I > know it's a good email because spammers never say "Let's go to lunch". > In fact there are an infinite number of words and phrases that are used > in good email that are never ever used in spam.
Surely this is going to change as soon as enough people implement your filtering system - spammers will use legitimate phrases from ham, both in the subject line and the body of their emails, and thereby get classified as ham? > And if I'm using words and phrases *never used in spam* that are used in ham > - it's good email. > And similarly - if I'm using words and phrases that are used in spam and > *never used in spam* - it's spam. I'm assuming that last line should be "*never used in ham* - it's spam". > So - how do I get a list of words and phrases never used in spam? I > create a list of words and phrases that are used in spam and check to > see if it's *not on the list*. So, you're identifying ham by checking that it does not contain words or phrases which you have previously seen in spam... Sounds very much like Bayes to me. > What I do is tokenize the spamiest parts of the email, like the subject > line How do you identify "the spammiest parts" of an email? > I'd like to see SA implement this. > I'm not going to share my code because my code is specific to my system and > it a combination of bash scripts, redis, pascal, php, and Exim rules. And > the open source programmers are likely to implement it better than I have. Given that you have *some* source code, no matter how bad / buggy / specific it is, I think you'll get much greater take-up (and also comprehension of exactly what your technique is) if you at least publish that and invite people to improve on it, rather than say "here's a method idea - you guys code it". > I'm seeing close to 100% accuracy. 1. How close? 2. On what volume of email? 3. What proportion of spam / ham? 4. What % false positives / negatives? 5. How many different domains' email are you feeding in to it? 6. How long have you been testing it (ie: how much have you seen of how it adapts to new spam patterns)? > It is so accurate it's scary and I think my implementation is crude at best. > I think if it were done right it could even get closer to 100% than I have. I can repeat that I think you'll get far more interest and involvement from coders if you at least publish what you have. Regards, Antony. -- How I want a drink, alcoholic of course, after the heavy chapters involving quantum mechanics. - mnemonic for 3.14159265358979 Please reply to the list; please *don't* CC me.