On 12/11/15 06:58, RW wrote:
On Thu, 10 Dec 2015 13:54:05 -0800
Marc Perkel wrote:

Bayes breaks the message down into some sort of tokens and then does
statistics on those tokens as to tokens found in spam vs. tokens
found in ham.

But what about combinations of tokens? I'm thinking that I'd like to
have something that says when it sees tokens X and Y and Z then
that's spam even though X,Y,Z might be in ham when not combined.

Does bayes do that or is there anything that does?
In general making arbitrary combinations is not practical. What some
filters do is make tokens out of word combinations in a sliding window.
This can be very useful in catching difficult spams that are composed
of common neutral words, although in my experience it's a little more
prone to FPs than Bayes.

I use Bogofilter and DSPAM.

On Thu, 10 Dec 2015 21:28:44 -0800
Marc Perkel wrote:

I'm thinking about incorporating Bogofilter but instead of feeding it
messages I'm thinking about feeding it the SpamAssassin results - the
rule names it hit + other data about the message and then let it
score the rules. That's what I want to experiment with.
I thought of trying something like that myself, but my filtering became
practically perfect before I got around to it, so I never bothered. And
I think there are some problems with it.

The first is that FNs in SpamAssassin tend to come from a lack of
useful information rather than the scoring system failing to combine it
well.

The second is that most rules are either fairly neutral or strongly
spammy. There are few strong ham indicators to balance the rest. You
might be able to balance it with metadata, and reputation information,
but the trick is to do it without getting a high FP rate on new senders.

If you did wish to take account of rule combinations, you'd really have
to do it yourself because sliding-window tokenization wouldn't do it
well.



What I was thinking about doing was creating a string of tokens that represented key features of the message. Then run that through a program that created new tokens out of every possible combination of 2 tokens and adding that to the string. Then running bayes on that. My tokens will not be the text of the message but rules hit including a lot of rules I create not for points but just for tokens.

For example. I create rules that look for many phrases about a subject and the subject becomes a token. For examples:

JESUS
ROYALTY
MONEY

But themselves not an indicator of spam. But if you have all 3 then it's definitely spam. The idea is to not look at words but look at the meaning of phrases. For instance, introductions:

Dear (friend)
I am (someone)
I am contacting you because (some reason)

This says - I don't know you.

I am a member of the (Nigerian royal family|Armed forces in Iraq) etc.

These can all be reduced to tokens and then you just look for combination of tokens.



--
Marc Perkel - Sales/Support
supp...@junkemailfilter.com
http://www.junkemailfilter.com
Junk Email Filter dot com
415-992-3400

Reply via email to