Giampaolo Tomassoni wrote:
From: Miles Fidelman [mailto:[EMAIL PROTECTED]
Dan wrote:
I've developed a new approach to scoring that I want to 1) share with
everyone and 2) make into a working system thats as accurate as what
I've already built, but easier to use. First, the theory:
NEW ASSUMPTION
All messages are spam unless x,y,z score says they're ham.
NEW APPROACH
Block everything, then create rules to not catch what you do want.
ie, build tests that target the spam (keeping all the tests you've
already built), then score the thousands of ways ham triggers on those
tests.
It strikes me that the hardest part of this approach is filtering out
too much ham. At least for me, it's more important to make sure that
people reach me, than to filter out all spam. If we take the approach
that everything is to be filtered out, except x,y,z - then the risk of
filtering out too much seems pretty high.
I definitely agree with you.
By the way, if Dan really brought a new perspective to us (i.e.: a new way to
detect ham), what would stop us in integrating it into SA?
Nothing would stop you from integrating it into SA.
For one, you could give every message a +5 just for existing. Now
you've assumed all messages are spam, and you're going to require that
the message characteristics lower the score below 5.
The problem I see with this approach is that: spam, by its nature, all
has characteristics in common that are already targeted:
a) coming from common points of origin, such as spamhauses, open relays,
etc. (countered with blacklisting)
b) urging you to take certain actions, such as clicking on links,
calling phone numbers, replying in order to opt-out, etc. (URIBLs, RE's
and bayes)
c) similar topics, such as medication, porn, stocks, etc. (RE's and bayes)
d) mailers with similar bad behaviors, such as things which are easy to
target via greet_pause, greylisting, nolisting, looking for format
violations, etc.
So, in the "finding the spam" approach, you're looking for these
features as a means of trying to identify the message as spam.
In order to develop a "find the ham" approach, you have to figure out
"what are the characteristics of ham?"
e) does it come from common points of origin? no. It can, and in my
experience does, come from anywhere.
f) does it urge you to take certain actions? not generally.
g) does it all have similar topics? for my mailing lists, sure... but
rarely do my gf and mother talk about the same topic...
Trying to narrow ham down to a range of sources, actions, and topics
seems to be MUCH more difficult than trying to do the same for spam.
About the only thing you can do that sets ham apart from spam in these
lists is "d" -- you could have a set "h" which says "if it comes from an
RFC compliant source, we'll mark it as being slightly more ham-like".
At which point, all of the spammers will get more RFC compliant. That
still leaves the problem that e-g are no where near as identifiable as
targets as a-c are.
(that said: I'm not saying "don't try" -- do try ... I would love to be
proven wrong, as long as the solution doesn't involve something as bad
for the internet as challenge-response type systems are)