Giampaolo Tomassoni wrote:
From: Miles Fidelman [mailto:[EMAIL PROTECTED]
Dan wrote:
I've developed a new approach to scoring that I want to 1) share with everyone and 2) make into a working system thats as accurate as what I've already built, but easier to use. First, the theory:

NEW ASSUMPTION
All messages are spam unless x,y,z score says they're ham.

NEW APPROACH
Block everything, then create rules to not catch what you do want. ie, build tests that target the spam (keeping all the tests you've already built), then score the thousands of ways ham triggers on those tests.
It strikes me that the hardest part of this approach is filtering out too much ham. At least for me, it's more important to make sure that people reach me, than to filter out all spam. If we take the approach that everything is to be filtered out, except x,y,z - then the risk of filtering out too much seems pretty high.

I definitely agree with you.

By the way, if Dan really brought a new perspective to us (i.e.: a new way to 
detect ham), what would stop us in integrating it into SA?


Nothing would stop you from integrating it into SA.

For one, you could give every message a +5 just for existing. Now you've assumed all messages are spam, and you're going to require that the message characteristics lower the score below 5.

The problem I see with this approach is that: spam, by its nature, all has characteristics in common that are already targeted:

a) coming from common points of origin, such as spamhauses, open relays, etc. (countered with blacklisting)

b) urging you to take certain actions, such as clicking on links, calling phone numbers, replying in order to opt-out, etc. (URIBLs, RE's and bayes)

c) similar topics, such as medication, porn, stocks, etc. (RE's and bayes)

d) mailers with similar bad behaviors, such as things which are easy to target via greet_pause, greylisting, nolisting, looking for format violations, etc.

So, in the "finding the spam" approach, you're looking for these features as a means of trying to identify the message as spam.


In order to develop a "find the ham" approach, you have to figure out "what are the characteristics of ham?"

e) does it come from common points of origin? no. It can, and in my experience does, come from anywhere.

f) does it urge you to take certain actions?  not generally.

g) does it all have similar topics? for my mailing lists, sure... but rarely do my gf and mother talk about the same topic...

Trying to narrow ham down to a range of sources, actions, and topics seems to be MUCH more difficult than trying to do the same for spam.

About the only thing you can do that sets ham apart from spam in these lists is "d" -- you could have a set "h" which says "if it comes from an RFC compliant source, we'll mark it as being slightly more ham-like". At which point, all of the spammers will get more RFC compliant. That still leaves the problem that e-g are no where near as identifiable as targets as a-c are.


(that said: I'm not saying "don't try" -- do try ... I would love to be proven wrong, as long as the solution doesn't involve something as bad for the internet as challenge-response type systems are)

Reply via email to