Re: A New Approach: Find the Ham

John Rudd Sun, 11 Feb 2007 02:32:25 -0800

Giampaolo Tomassoni wrote:

From: Miles Fidelman [mailto:[EMAIL PROTECTED]
Dan wrote:
I've developed a new approach to scoring that I want to 1) share witheveryone and 2) make into a working system thats as accurate as whatI've already built, but easier to use. First, the theory:
NEW ASSUMPTION
All messages are spam unless x,y,z score says they're ham.

NEW APPROACH
Block everything, then create rules to not catch what you do want.ie, build tests that target the spam (keeping all the tests you'vealready built), then score the thousands of ways ham triggers on thosetests.
It strikes me that the hardest part of this approach is filtering outtoo much ham. At least for me, it's more important to make sure thatpeople reach me, than to filter out all spam. If we take the approachthat everything is to be filtered out, except x,y,z - then the risk offiltering out too much seems pretty high.
I definitely agree with you.

By the way, if Dan really brought a new perspective to us (i.e.: a new way to 
detect ham), what would stop us in integrating it into SA?


Nothing would stop you from integrating it into SA.

For one, you could give every message a +5 just for existing. Nowyou've assumed all messages are spam, and you're going to require thatthe message characteristics lower the score below 5.

The problem I see with this approach is that: spam, by its nature, allhas characteristics in common that are already targeted:

a) coming from common points of origin, such as spamhauses, open relays,etc. (countered with blacklisting)

b) urging you to take certain actions, such as clicking on links,calling phone numbers, replying in order to opt-out, etc. (URIBLs, RE'sand bayes)


c) similar topics, such as medication, porn, stocks, etc. (RE's and bayes)

d) mailers with similar bad behaviors, such as things which are easy totarget via greet_pause, greylisting, nolisting, looking for formatviolations, etc.

So, in the "finding the spam" approach, you're looking for thesefeatures as a means of trying to identify the message as spam.

In order to develop a "find the ham" approach, you have to figure out"what are the characteristics of ham?"

e) does it come from common points of origin? no. It can, and in myexperience does, come from anywhere.


f) does it urge you to take certain actions?  not generally.

g) does it all have similar topics? for my mailing lists, sure... butrarely do my gf and mother talk about the same topic...

Trying to narrow ham down to a range of sources, actions, and topicsseems to be MUCH more difficult than trying to do the same for spam.

About the only thing you can do that sets ham apart from spam in theselists is "d" -- you could have a set "h" which says "if it comes from anRFC compliant source, we'll mark it as being slightly more ham-like".At which point, all of the spammers will get more RFC compliant. Thatstill leaves the problem that e-g are no where near as identifiable astargets as a-c are.

(that said: I'm not saying "don't try" -- do try ... I would love to beproven wrong, as long as the solution doesn't involve something as badfor the internet as challenge-response type systems are)

Re: A New Approach: Find the Ham

Reply via email to