On Wednesday 20 January 2016 at 17:52:05, Marc Perkel wrote:

> Suppose I get an email with the subject line "Let's get some lunch". I
> know it's a good email because spammers never say "Let's go to lunch".
> In fact there are an infinite number of words and phrases that are used
> in good email that are never ever used in spam.

Surely this is going to change as soon as enough people implement your 
filtering system - spammers will use legitimate phrases from ham, both in the 
subject line and the body of their emails, and thereby get classified as ham?

> And if I'm using words and phrases *never used in spam* that are used in ham
> - it's good email.
> And similarly - if I'm using words and phrases that are used in spam and
> *never used in spam* - it's spam.

I'm assuming that last line should be "*never used in ham* - it's spam".

> So - how do I get a list of words and phrases never used in spam? I
> create a list of words and phrases that are used in spam and check to
> see if it's *not on the list*.

So, you're identifying ham by checking that it does not contain words or 
phrases which you have previously seen in spam...

Sounds very much like Bayes to me.

> What I do is tokenize the spamiest parts of the email, like the subject
> line

How do you identify "the spammiest parts" of an email?

> I'd like to see SA implement this.

> I'm not going to share my code because my code is specific to my system and
> it a combination of bash scripts, redis, pascal, php, and Exim rules. And
> the open source programmers are likely to implement it better than I have.

Given that you have *some* source code, no matter how bad / buggy / specific it 
is, I think you'll get much greater take-up (and also comprehension of exactly 
what your technique is) if you at least publish that and invite people to 
improve on it, rather than say "here's a method idea - you guys code it".

> I'm seeing close to 100% accuracy.

1. How close?

2. On what volume of email?

3. What proportion of spam / ham?

4. What % false positives / negatives?

5. How many different domains' email are you feeding in to it?

6. How long have you been testing it (ie: how much have you seen of how it 
adapts to new spam patterns)?

> It is so accurate it's scary and I think my implementation is crude at best.
> I think if it were done right it could even get closer to 100% than I have.

I can repeat that I think you'll get far more interest and involvement from 
coders if you at least publish what you have.


Regards,


Antony.

-- 
How I want a drink, alcoholic of course, after the heavy chapters involving 
quantum mechanics.

 - mnemonic for 3.14159265358979

                                                   Please reply to the list;
                                                         please *don't* CC me.

Reply via email to