Re: A New Approach: Find the Ham

2007-02-12 Thread michael moncur

I agree that this isn't going to be the best approach. Detecting ham
is simply more difficult:

1. New types of ham emerge more often than new types of spam. Spammers
generally stick to tried-and-true subjects while ham is all over the
place.

2. Ham is more personalized than spam. Everyone gets very similar
spam, but nobody gets the same mix of ham messages that I get.

3. Ham has a much greater range of potential subjects and patterns
than spam. For all the spam, nobody's doing anything creative like
trying to sell fountain pens or beverage dispensers or books of poetry
with spam - it's all fake rolexes and cheap pharmaceuticals. Ham, on
the other hand, has a million potential subjects and you get
one-of-a-kind messages every day.

4. Spammers will have an easier time faking ham characteristics than
removing spam characteristics, which may be endemic to their methods
(spamming software, botnets, etc.)

5. Network effects are very helpful with spam (DNS blacklists, Razor,
etc.) but not very helpful with ham.

Of course, ham rules are helpful - especially personalized ones. I use
a bunch. But they're best used with the existing framework of spam
detection.


Re: A New Approach: Find the Ham

2007-02-12 Thread Dan

Duncan  Michael,

Thank you for the careful thought and detailed input.  Please read my  
Protype Config email of yesterday afternoon.  This is not as it  
appears, NOT a weighted ham finding rules approach but rather a non  
weighted ham tuned spam finding rules approach.  Its unconventional  
and takes a little getting used to.


Thanks!
Dan



On Feb 12, 2007, at 0:59, michael moncur wrote:



I agree that this isn't going to be the best approach. Detecting ham
is simply more difficult:

1. New types of ham emerge more often than new types of spam. Spammers
generally stick to tried-and-true subjects while ham is all over the
place.

2. Ham is more personalized than spam. Everyone gets very similar
spam, but nobody gets the same mix of ham messages that I get.

3. Ham has a much greater range of potential subjects and patterns
than spam. For all the spam, nobody's doing anything creative like
trying to sell fountain pens or beverage dispensers or books of poetry
with spam - it's all fake rolexes and cheap pharmaceuticals. Ham, on
the other hand, has a million potential subjects and you get
one-of-a-kind messages every day.

4. Spammers will have an easier time faking ham characteristics than
removing spam characteristics, which may be endemic to their methods
(spamming software, botnets, etc.)

5. Network effects are very helpful with spam (DNS blacklists, Razor,
etc.) but not very helpful with ham.

Of course, ham rules are helpful - especially personalized ones. I use
a bunch. But they're best used with the existing framework of spam
detection.




HTML mail (was Re: A New Approach: Find the Ham)

2007-02-12 Thread Kelson

Tom Allison wrote:

Personally, I think HTML email should be outright discarded from the start.
If you look at this arguement presented by the OP then it reinforces the 
idea that most ascii is ham and most html is spam.  Therefore, reject 
delivery of all html based email.  Or to be more succinct -- reject any 
MIME type of alternative content or html only content.  That would 
remove probably 90% of the spam in one shot.


Speaking from an ISP perspective:

I hate to break it to you, but most end users want some sort of 
formatted mail.  The days of all email being ASCII-only are over, just 
as the days of all websites being text-only are over.


Now, if you can come up with another markup language for formatting email...

* That satisfies end users' wants without being vulnerable to the
  filter-evasion that HTML makes possible
* And you can get all the major email clients to render it
* And you can get all the major email clients to use it for formatted
  composition instead of HTML (so end users can still make their text
  blue and embed the latest cute image of kittens)
* And you can get commercial email campaign software to use it instead
  of HTML (so organizations can include a company logo, or pictures of
  the items that they're promoting in this week's newsletter)

...*then* it'll be viable to discard HTML.

Obviously, individuals and businesses handling their own mail can apply 
stricter rules.  But it's not something that can be done (yet) on a 
large scale without disappointing a lot of people -- and not just the 
spammers.


--
Kelson Vibber
SpeedGate Communications www.speed.net


Re: HTML mail (was Re: A New Approach: Find the Ham)

2007-02-12 Thread Gene Heskett
On Monday 12 February 2007 13:27, Kelson wrote:
Tom Allison wrote:
 Personally, I think HTML email should be outright discarded from the
 start. If you look at this arguement presented by the OP then it
 reinforces the idea that most ascii is ham and most html is spam. 
 Therefore, reject delivery of all html based email.  Or to be more
 succinct -- reject any MIME type of alternative content or html only
 content.  That would remove probably 90% of the spam in one shot.

Speaking from an ISP perspective:

I hate to break it to you, but most end users want some sort of
formatted mail.  The days of all email being ASCII-only are over, just
as the days of all websites being text-only are over.

With all due respect, that's 100% BS.  MIME was invented to handle the 
non-ascii stuff, and does it very well except for M$, who couldn't follow 
a std rule with a loaded 44 magnum stuck in Bills ear.

Now, if you can come up with another markup language for formatting
 email...

* That satisfies end users' wants without being vulnerable to the
   filter-evasion that HTML makes possible
* And you can get all the major email clients to render it
* And you can get all the major email clients to use it for formatted
   composition instead of HTML (so end users can still make their text
   blue and embed the latest cute image of kittens)
* And you can get commercial email campaign software to use it instead
   of HTML (so organizations can include a company logo, or pictures of
   the items that they're promoting in this week's newsletter)

...*then* it'll be viable to discard HTML.

There is, its the proper use of mimetypes.

Obviously, individuals and businesses handling their own mail can apply
stricter rules.  But it's not something that can be done (yet) on a
large scale without disappointing a lot of people -- and not just the
spammers.

-- 
Cheers, Gene
There are four boxes to be used in defense of liberty:
 soap, ballot, jury, and ammo. Please use in that order.
-Ed Howdershelt (Author)
Yahoo.com and AOL/TW attorneys please note, additions to the above
message by Gene Heskett are:
Copyright 2007 by Maurice Eugene Heskett, all rights reserved.


RE: HTML mail (was Re: A New Approach: Find the Ham)

2007-02-12 Thread Coffey, Neal
Gene Heskett wrote:
 On Monday 12 February 2007 13:27, Kelson wrote:
 Now, if you can come up with another markup language for formatting
 email... 
 
 [...]
 * And you can get all the major email clients to use it for formatted
   composition instead of HTML (so end users can still make their text
   blue and embed the latest cute image of kittens)
 [...]
 
 There is, its the proper use of mimetypes.

I'm sorry, but I must have missed the class where they explained how
MIME types can be used for text markup.  Can you link me to a website
explaining how to use MIME types to change font colors and display
inline images?


Re: HTML mail (was Re: A New Approach: Find the Ham)

2007-02-12 Thread Kelson

Gene Heskett wrote:
With all due respect, that's 100% BS.  MIME was invented to handle the 
non-ascii stuff, and does it very well except for M$, who couldn't follow 
a std rule with a loaded 44 magnum stuck in Bills ear.


100% BS?  So end-users don't like formatting in their messages?  Email 
is still all-ASCII?  Websites are still all-text?  Or are you responding 
to something else?



There is, its the proper use of mimetypes.


I'm not talking about the MIME structure, I'm talking about the 
formatted version of the message.  Last I looked, MIME *by itself* 
didn't allow you to change fonts or colors, add bold or italics, create 
bulleted lists that flow properly, allow images to appear within a 
document instead of as a separate segment, etc.


In other words, what can adequately replace text/html in the 
non-plaintext multipart/alternative section such that HTML becomes 
irrelevant for legitimate uses?  Microsoft Word?  PDF?  RTF?  Any of 
those would be worse, IMO.  text/richtext might do the job, except 
Eudora is the only client I can think of that composes in it.


--
Kelson Vibber
SpeedGate Communications www.speed.net


Re: HTML mail (was Re: A New Approach: Find the Ham)

2007-02-12 Thread Kenneth Porter
--On Monday, February 12, 2007 12:50 PM -0800 Kelson [EMAIL PROTECTED] 
wrote:



In other words, what can adequately replace text/html in the
non-plaintext multipart/alternative section such that HTML becomes
irrelevant for legitimate uses?  Microsoft Word?  PDF?  RTF?  Any of
those would be worse, IMO.  text/richtext might do the job, except Eudora
is the only client I can think of that composes in it.


Mulberry does that and text/enriched.

http://www.mulberrymail.com/

The author is currently preparing it for open-sourcing.

I think all you need is an inline image markup and that format would then 
serve most needs.


Of course, Word and Outlook would likely still generate messages 10x the 
size of equivalent text, and add additional non-standard undocumented 
markup to embrace and extend the basic text/enriched format and lock out 
competitors from interoperability.


Re: HTML mail (was Re: A New Approach: Find the Ham)

2007-02-12 Thread John Rudd

Kelson wrote:

Tom Allison wrote:
Personally, I think HTML email should be outright discarded from the 
start.
If you look at this arguement presented by the OP then it reinforces 
the idea that most ascii is ham and most html is spam.  Therefore, 
reject delivery of all html based email.  Or to be more succinct -- 
reject any MIME type of alternative content or html only content.  
That would remove probably 90% of the spam in one shot.


Speaking from an ISP perspective:

I hate to break it to you, but most end users want some sort of 
formatted mail.  The days of all email being ASCII-only are over, just 
as the days of all websites being text-only are over.




Speaking with my postmaster hat on, I agree: the days of ASCII-only are 
over, and I, as a postmaster, must allow the flow of cooked mail 
(HTML, Doc, graphics, etc.), and cannot force nor enforce raw text email.



Speaking with my a sender and recipient of email user hat on: I 
couldn't give a rodent's posterior what other people do or don't want to 
put in their email.  By the time I see it, it's plain text, and if that 
removes some essential content, that's the other person's problem for 
having made a poor choice in the format of their message.  I take no 
responsibility for what they intended to send me.



Stepping back from both of those perspectives: I can't force other 
people to any particular thing, and I don't want to.  But they can't 
force me to do any particular thing either.  I'm going to read plain 
text email, and sometimes look at attached images if I want to.  I wont 
stop you from sending me html-only, but I wont read it, either.



(in fact, earlier today, someone at work sent me a please answer the 
part in green message, and I answered back none of it was in green ... 
probably because I filter out any non-plain-text components of the 
email ... still waiting for her to reply)


Re: A New Approach: Find the Ham

2007-02-12 Thread Duncan Findlay
On Sun, Feb 11, 2007 at 11:10:53PM -0500, Duncan Findlay wrote:
 I've read most of the e-mails on this topic and I think the underlying
 problem is that this method relies on knowing exactly which profiles
 (i.e. combinations of rules) valid ham can hit.

After re-reading your message with your prototype (there was one thing
I missed before), I'd like to revise my criticism. Will you respond?

 I see a number of problems:

 - How do we actually generate the profiles that are to be considered
 ham? Does it just need to happen once in anyone's mass-check logs?
 Does it have to happen multiple times? How many rules will this generate?

This remains valid.

 - Won't spammers be able to craft their messages (possibly even by
 breaking more rules) to meet a (known) ham profile? Currently spammers
 can craft messages but they have to avoid the major rules. By allowing
 certain profiles (depending on your answer to the previous point) this
 will give spammers enough room to put some pretty obvious spam
 through.

This remains valid, though the possibly even by breaking more rules
bit doesn't really make sense. My point is still that spammers can
craft spam messages that will fit your ham profiles, and (depending on
your response to my point above) this can probably let through some
pretty spammy messages.

 - What happens when you add a new local rule? How would you figure
 which combinations are valid ham profiles with the new rule.

Still valid.

 - Suppose we start seeing a new type of ham that hits two low scoring
 rules under the current system? How will your system deal with that?
 The way I uderstand it, these new ham messages would be considered
 spam.

Still valid.

 Interesting theory, but I don't think it'll work in practice.

Still valid.

Thanks,

-- 
Duncan Findlay


pgpxqMUhoehZc.pgp
Description: PGP signature


Re: A New Approach: Find the Ham

2007-02-12 Thread Duncan Findlay
On Mon, Feb 12, 2007 at 11:00:06PM -0500, Duncan Findlay wrote:
 On Sun, Feb 11, 2007 at 11:10:53PM -0500, Duncan Findlay wrote:
  I've read most of the e-mails on this topic and I think the underlying
  problem is that this method relies on knowing exactly which profiles
  (i.e. combinations of rules) valid ham can hit.

 After re-reading your message with your prototype (there was one thing
 I missed before), I'd like to revise my criticism. Will you respond?

That paragraph seems to have an unnecessarily adversarial tone to
it. That was unintentional. I would like to know if I'm still missing
something, or if you have a cool solution to these problems. :-)

-- 
Duncan Findlay


pgpzitN9uehre.pgp
Description: PGP signature


Re: A New Approach: Find the Ham

2007-02-11 Thread John Rudd

Giampaolo Tomassoni wrote:

From: Miles Fidelman [mailto:[EMAIL PROTECTED]

Dan wrote:
I've developed a new approach to scoring that I want to 1) share with 
everyone and 2) make into a working system thats as accurate as what 
I've already built, but easier to use.  First, the theory:


NEW ASSUMPTION
All messages are spam unless x,y,z score says they're ham.

NEW APPROACH
Block everything, then create rules to not catch what you do want.  
ie, build tests that target the spam (keeping all the tests you've 
already built), then score the thousands of ways ham triggers on those 
tests.
It strikes me that the hardest part of this approach is filtering out 
too much ham.  At least for me, it's more important to make sure that 
people reach me, than to filter out all spam.  If we take the approach 
that everything is to be filtered out, except x,y,z - then the risk of 
filtering out too much seems pretty high.


I definitely agree with you.

By the way, if Dan really brought a new perspective to us (i.e.: a new way to 
detect ham), what would stop us in integrating it into SA?



Nothing would stop you from integrating it into SA.

For one, you could give every message a +5 just for existing.  Now 
you've assumed all messages are spam, and you're going to require that 
the message characteristics lower the score below 5.


The problem I see with this approach is that: spam, by its nature, all 
has characteristics in common that are already targeted:


a) coming from common points of origin, such as spamhauses, open relays, 
etc. (countered with blacklisting)


b) urging you to take certain actions, such as clicking on links, 
calling phone numbers, replying in order to opt-out, etc. (URIBLs, RE's 
and bayes)


c) similar topics, such as medication, porn, stocks, etc. (RE's and bayes)

d) mailers with similar bad behaviors, such as things which are easy to 
target via greet_pause, greylisting, nolisting, looking for format 
violations, etc.


So, in the finding the spam approach, you're looking for these 
features as a means of trying to identify the message as spam.



In order to develop a find the ham approach, you have to figure out 
what are the characteristics of ham?


e) does it come from common points of origin?  no.  It can, and in my 
experience does, come from anywhere.


f) does it urge you to take certain actions?  not generally.

g) does it all have similar topics?  for my mailing lists, sure... but 
rarely do my gf and mother talk about the same topic...


Trying to narrow ham down to a range of sources, actions, and topics 
seems to be MUCH more difficult than trying to do the same for spam.


About the only thing you can do that sets ham apart from spam in these 
lists is d -- you could have a set h which says if it comes from an 
RFC compliant source, we'll mark it as being slightly more ham-like. 
At which point, all of the spammers will get more RFC compliant.  That 
still leaves the problem that e-g are no where near as identifiable as 
targets as a-c are.



(that said: I'm not saying don't try -- do try ... I would love to be 
proven wrong, as long as the solution doesn't involve something as bad 
for the internet as challenge-response type systems are)


Re: A New Approach: Find the Ham

2007-02-11 Thread John Andersen
On Saturday 10 February 2007, Dan wrote:
 On Feb 10, 2007, at 14:38, Mathieu Bouchard wrote:
  How do you ever find FPs if you have so many TP to sort through  
  that it's not even worth sorting through FP+TP to find the FP ?  
  IMHO, that'd be why we assume that mails are ham rather than assume  
  that they are spam.

 I haven't found FP reviewing to be a big deal.  In my latest SA based  
 configuration, for example...

Whoa...

Which side of the fence are you on? 

How can you cite your current configuration of SA as any kind
of indication of how hard it would be to find FPs in a totally
reversed situation?




-- 
_
John Andersen


pgpUkeMq2M17i.pgp
Description: PGP signature


Re: A New Approach: Find the Ham

2007-02-11 Thread Justin Mason

Long-time SpamAssassin users with a good memory might recall back in
SpamAssassin 2.4x, we included quite a few ham-targeting rules, such as
was this sent using User-Agent: Mozilla?, is this formatted like a
reply to a previous message?, does it include headers from a mailing
list? and is it formatted like a PGP-signed message?

Pretty soon, spammers simply adopted _all_ of those attributes,
sending spam containing User-Agent: mozilla, In-Reply-To headers,
formatted like PGP-signed reply messages ;)

If you give spammers a way to get negative points easily, they'll attack
it.  it's simply unsafe to assume they won't.  A published ruleset that
does this based on forgeable attributes will be quickly attacked (again).

Having said that, rules that are *unforgeable* are entirely safe to use,
and we include those -- namely whitelist_from_rcvd/spf/dk/dkim, and the
locally-trained Bayes tests (which spammers have a much harder time
guessing).

Also, writing your own local ham-spotting rules is generally safe, as long
as you don't publish them where spammers can find out about them.

--j.

Nigel Frankcom writes:
On Sat, 10 Feb 2007 15:14:56 -0500, Miles Fidelman
[EMAIL PROTECTED] wrote:

Dan wrote:
 I've developed a new approach to scoring that I want to 1) share with=20
 everyone and 2) make into a working system thats as accurate as what=20
 I've already built, but easier to use.  First, the theory:

 NEW ASSUMPTION
 All messages are spam unless x,y,z score says they're ham.

 NEW APPROACH
 Block everything, then create rules to not catch what you do want. =20
 ie, build tests that target the spam (keeping all the tests you've=20
 already built), then score the thousands of ways ham triggers on those=
=20
 tests.
It strikes me that the hardest part of this approach is filtering out=20
too much ham.  At least for me, it's more important to make sure that=20
people reach me, than to filter out all spam.  If we take the approach=20
that everything is to be filtered out, except x,y,z - then the risk of=20
filtering out too much seems pretty high.

These are my local stats... I'd far rather those numbers were the
other way round.

Even if Dan is wrong, at least he's thinking.

http://www.blue-canoe.com/stats/index.php?D1=3D11

What do Theo, Matt  Co have to say? They've been doing this a lot
longer than us.

Kind regards




Re: A New Approach: Find the Ham

2007-02-11 Thread tom


On Feb 10, 2007, at 3:19 PM, Giampaolo Tomassoni wrote:


From: Tom Allison [mailto:[EMAIL PROTECTED]

Personally, I think HTML email should be outright discarded from
the start.
If you look at this arguement presented by the OP then it
reinforces the idea
that most ascii is ham and most html is spam.  Therefore, reject
delivery of all
html based email.  Or to be more succinct -- reject any MIME type
of alternative
content or html only content.  That would remove probably 90% of
the spam in one
shot.


Sending text/ascii e-mails may probably fit your habits and the  
ones from your contacts, but it would result in thrashing a lot of  
ham on larger userbases.


Giampaolo



I am clearly thinking in more revolutionary terms then what email has  
been doing over the last decade of trying to accommodate every Tom  
Dick and Harry that comes along with a wish list.




RE: A New Approach: Find the Ham

2007-02-11 Thread Giampaolo Tomassoni
From: tom [mailto:[EMAIL PROTECTED]
 
 On Feb 10, 2007, at 3:19 PM, Giampaolo Tomassoni wrote:
 
  From: Tom Allison [mailto:[EMAIL PROTECTED]
  Personally, I think HTML email should be outright discarded from
  the start.
  If you look at this arguement presented by the OP then it
  reinforces the idea
  that most ascii is ham and most html is spam.  Therefore, reject
  delivery of all
  html based email.  Or to be more succinct -- reject any MIME type
  of alternative
  content or html only content.  That would remove probably 90% of
  the spam in one
  shot.
 
  Sending text/ascii e-mails may probably fit your habits and the  
  ones from your contacts, but it would result in thrashing a lot of  
  ham on larger userbases.
 
  Giampaolo
 
 
 I am clearly thinking in more revolutionary terms then what email has  
 been doing over the last decade of trying to accommodate every Tom  
 Dick and Harry that comes along with a wish list.

Well, I don't know: I don't dislike the fact that e-mail messages may be 
vectors of html content. The problem is not what you bring to a destination, it 
is the how. The problem is that all the RFC set of regulations about electonic 
mailing fail in definitely avoiding the use of fake addresses and the complete 
anonimicity of the sender to the occasional destinator.

This, combined with the very low cost at which one can send spam, do result in 
a lot of spam.

If the identity of the sender could be really trusted, I believe that it would 
be a lot more easy to control spam and, eventually, get rid of it. There are 
RFCs for message signing and the like, but they are basicly optional 
operations, not mandatory.

I may probably have get a pessimistic view on the world, but as long as there 
will be a business around the spam, it will be very difficul to impose a new, 
sender-identity-concerned, mandatory standard on electronic mailing: there are 
too many economical interests around it. From ISPs to computer resellers, 
everybody gains something from it.

Giampaolo



Re: A New Approach: Find the Ham

2007-02-11 Thread Theo Van Dinter
On Sat, Feb 10, 2007 at 08:22:41PM +, Nigel Frankcom wrote:
 What do Theo, Matt  Co have to say? They've been doing this a lot
 longer than us.

Unless I'm missing something, this approach is the standard block
everything except for what we explicitly want to receive.  Which is
great, if you can define what we want to receive in a way that
isn't able to be forged.  By in large, that means whitelist_from_*.
Then everything that isn't whitelisted gets blocked.

If that's what you want to do, that's fine.  The main issue is that
it's very likely that all of the mails that you would want to receive
aren't whitelisted.  If you don't care, then you're done.

If you do care ...  then you can't block the mails, and need to accept
them to figure out if you actually want to receive them.  How do you
deal with that?  Since you've already gotten rid of the mails that you
know you want, you need to filter the rest so that you get rid of the
stuff you don't want (ie: spam).

In the end, this is the methodology described on this list for years.

-- 
Randomly Selected Tagline:
Kluge.net belongs to Theo, my ex-roommate from Worcester, who I can say
 with some measure of admiration, is insane.
 - Alan Caulkins, http://www.maxint.net/~fatman/


pgpNgY5XHP7kI.pgp
Description: PGP signature


RE: A New Approach: Find the Ham

2007-02-11 Thread Philip Seccombe
Apologies if this has been answered before or anything, but where/how
are you generating those stats?
I'm not using SA with SQL so I'm not sure if it will work for me, but
those I like!

Stats in question: http://www.blue-canoe.com/stats/index.php?D1=11 


Kind Regards,
Philip Seccombe
Turnstone Technologies NZ Limited

Phone: +64 9 970 5550
Fax: +64 9 970 5559
DDI: +64 9 970 5552
Email: [EMAIL PROTECTED] 
Web: www.turnstone.co.nz 

-Original Message-
From: Nigel Frankcom [mailto:[EMAIL PROTECTED] 
Sent: Sunday, 11 February 2007 9:23 a.m.
To: Miles Fidelman
Cc: SpamAssassin Users
Subject: Re: A New Approach: Find the Ham

On Sat, 10 Feb 2007 15:14:56 -0500, Miles Fidelman
[EMAIL PROTECTED] wrote:

Dan wrote:
 I've developed a new approach to scoring that I want to 1) share with

 everyone and 2) make into a working system thats as accurate as what 
 I've already built, but easier to use.  First, the theory:

 NEW ASSUMPTION
 All messages are spam unless x,y,z score says they're ham.

 NEW APPROACH
 Block everything, then create rules to not catch what you do want.  
 ie, build tests that target the spam (keeping all the tests you've 
 already built), then score the thousands of ways ham triggers on
those 
 tests.
It strikes me that the hardest part of this approach is filtering out 
too much ham.  At least for me, it's more important to make sure that 
people reach me, than to filter out all spam.  If we take the approach 
that everything is to be filtered out, except x,y,z - then the risk of 
filtering out too much seems pretty high.

These are my local stats... I'd far rather those numbers were the
other way round.

Even if Dan is wrong, at least he's thinking.

http://www.blue-canoe.com/stats/index.php?D1=11

What do Theo, Matt  Co have to say? They've been doing this a lot
longer than us.

Kind regards


Re: A New Approach: Find the Ham

2007-02-11 Thread .rp
On 10 Feb 2007 at 11:43, Dan wrote:

 I've developed a new approach to scoring that I want to 1) share with 
 everyone and 2) make into a working system thats as accurate as what 
 I've already built, but easier to use.  First, the theory:
[...] 
 NEW SITUATION
 Ham is now the tiniest minority of all email.
 
 NEW ASSUMPTION
 All messages are spam unless x,y,z score says they're ham.
 
 NEW APPROACH
 Block everything, then create rules to not catch what you do want.  
 ie, build tests that target the spam (keeping all the tests you've 
 already built), then score the thousands of ways ham triggers on 
 those tests.
 
 NEW RESULT
 Spend less time and energy while catching more of what you do want 
 and less of what you don't.
 
 CHALLENGE
 All filtering software is written to score for results that equal 
 spam - catch the bad
 
 SOLUTION
 Make filtering software score for results that equal ham - uncatch 
 the good.
 
 Your thoughts?
 Dan
 
The science fiction periodical ANALOG had a story based exactly on this 
premise. I 
think the story ran in 2005. In the story, just about everything is connected 
to a public 
interface so therefore everything is subject to getting spam'ed - and worse.



Re: A New Approach: Find the Ham

2007-02-11 Thread Duncan Findlay
Hey Dan,

I've read most of the e-mails on this topic and I think the underlying
problem is that this method relies on knowing exactly which profiles
(i.e. combinations of rules) valid ham can hit.

I see a number of problems:

- How do we actually generate the profiles that are to be considered
ham? Does it just need to happen once in anyone's mass-check logs?
Does it have to happen multiple times? How many rules will this generate?

- Won't spammers be able to craft their messages (possibly even by
breaking more rules) to meet a (known) ham profile? Currently spammers
can craft messages but they have to avoid the major rules. By allowing
certain profiles (depending on your answer to the previous point) this
will give spammers enough room to put some pretty obvious spam
through.

- What happens when you add a new local rule? How would you figure
which combinations are valid ham profiles with the new rule.

- Suppose we start seeing a new type of ham that hits two low scoring
rules under the current system? How will your system deal with that?
The way I uderstand it, these new ham messages would be considered
spam.

Interesting theory, but I don't think it'll work in practice.

-- 
Duncan Findlay


pgp8GwTi2pg2F.pgp
Description: PGP signature


RE: A New Approach: Find the Ham

2007-02-10 Thread Giampaolo Tomassoni
From: Dan [mailto:[EMAIL PROTECTED]
 
 I've developed a new approach to scoring that I want to 1) share with  
 everyone and 2) make into a working system thats as accurate as what  
 I've already built, but easier to use.  First, the theory:
 
 
 
 SITUATION
 In the beginning, all email was ham.  When spam came along, we left  
 the ham alone and targeted the annoyance (spam).
 
 ASSUMPTION
 All messages are ham unless x,y,z score says they're spam.
 
 APPROACH
 Block nothing, then create rules to catch what you don't want.  ie,  
 build tests that target the spam, then score the millions of ways  
 spam can occur.
 
 RESULT
 Huge time spent tuning and retuning weights, catching everything in  
 sight (including much ham).
 
 
 
 NEW SITUATION
 Ham is now the tiniest minority of all email.
 
 NEW ASSUMPTION
 All messages are spam unless x,y,z score says they're ham.
 
 NEW APPROACH
 Block everything, then create rules to not catch what you do want.   
 ie, build tests that target the spam (keeping all the tests you've  
 already built), then score the thousands of ways ham triggers on  
 those tests.
 
 NEW RESULT
 Spend less time and energy while catching more of what you do want  
 and less of what you don't.
 
 
 
 CHALLENGE
 All filtering software is written to score for results that equal  
 spam - catch the bad
 
 SOLUTION
 Make filtering software score for results that equal ham - uncatch  
 the good.
 
 
 Your thoughts?

How can this method spend less time and energy? Aren't you going to build a 
mirrored method with respect to the actual one? Your rules wouldn't be like 
the actual ones, but negated?

Giampaolo

 
 Dan
 
 
 BTW, is there a better forum for this level of question?
 



Re: A New Approach: Find the Ham

2007-02-10 Thread Nigel Frankcom
On Sat, 10 Feb 2007 20:52:17 +0100, Giampaolo Tomassoni
[EMAIL PROTECTED] wrote:

From: Dan [mailto:[EMAIL PROTECTED]
 
 I've developed a new approach to scoring that I want to 1) share with  
 everyone and 2) make into a working system thats as accurate as what  
 I've already built, but easier to use.  First, the theory:
 
 
 
 SITUATION
 In the beginning, all email was ham.  When spam came along, we left  
 the ham alone and targeted the annoyance (spam).
 
 ASSUMPTION
 All messages are ham unless x,y,z score says they're spam.
 
 APPROACH
 Block nothing, then create rules to catch what you don't want.  ie,  
 build tests that target the spam, then score the millions of ways  
 spam can occur.
 
 RESULT
 Huge time spent tuning and retuning weights, catching everything in  
 sight (including much ham).
 
 
 
 NEW SITUATION
 Ham is now the tiniest minority of all email.
 
 NEW ASSUMPTION
 All messages are spam unless x,y,z score says they're ham.
 
 NEW APPROACH
 Block everything, then create rules to not catch what you do want.   
 ie, build tests that target the spam (keeping all the tests you've  
 already built), then score the thousands of ways ham triggers on  
 those tests.
 
 NEW RESULT
 Spend less time and energy while catching more of what you do want  
 and less of what you don't.
 
 
 
 CHALLENGE
 All filtering software is written to score for results that equal  
 spam - catch the bad
 
 SOLUTION
 Make filtering software score for results that equal ham - uncatch  
 the good.
 
 
 Your thoughts?

How can this method spend less time and energy? Aren't you going to build a 
mirrored method with respect to the actual one? Your rules wouldn't be like 
the actual ones, but negated?

Giampaolo

 
 Dan
 
 
 BTW, is there a better forum for this level of question?
 

Dan has a good point; on the surface at least. spam now accounts for
80%+ of all mail, so why are we concentrating on that?

At least the point is worth debate (IMHO).

Can it be done? Even I can see that it can, given the right impetus.
Though perhaps too many companies are making a good $/£/Y off
anti-spam systems based on, around or directly using SA.

Be interesting to see where this thread goes.

Kind regards

Nigel


Re: A New Approach: Find the Ham

2007-02-10 Thread Tom Allison



CHALLENGE
All filtering software is written to score for results that equal  
spam - catch the bad


SOLUTION
Make filtering software score for results that equal ham - uncatch  
the good.



Your thoughts?


How can this method spend less time and energy? Aren't you going to build a 
mirrored method with respect to the actual one? Your rules wouldn't be like the actual 
ones, but negated?

Giampaolo


Dan


BTW, is there a better forum for this level of question?






This would be easier to filter.
It would also be more adaptive to a statistical approach than a regex approach.

Personally, I think HTML email should be outright discarded from the start.
If you look at this arguement presented by the OP then it reinforces the idea 
that most ascii is ham and most html is spam.  Therefore, reject delivery of all 
html based email.  Or to be more succinct -- reject any MIME type of alternative 
content or html only content.  That would remove probably 90% of the spam in one 
shot.


Re: A New Approach: Find the Ham

2007-02-10 Thread Miles Fidelman

Dan wrote:
I've developed a new approach to scoring that I want to 1) share with 
everyone and 2) make into a working system thats as accurate as what 
I've already built, but easier to use.  First, the theory:


NEW ASSUMPTION
All messages are spam unless x,y,z score says they're ham.

NEW APPROACH
Block everything, then create rules to not catch what you do want.  
ie, build tests that target the spam (keeping all the tests you've 
already built), then score the thousands of ways ham triggers on those 
tests.
It strikes me that the hardest part of this approach is filtering out 
too much ham.  At least for me, it's more important to make sure that 
people reach me, than to filter out all spam.  If we take the approach 
that everything is to be filtered out, except x,y,z - then the risk of 
filtering out too much seems pretty high.


RE: A New Approach: Find the Ham

2007-02-10 Thread Giampaolo Tomassoni
From: Tom Allison [mailto:[EMAIL PROTECTED]
 
  CHALLENGE
  All filtering software is written to score for results that equal  
  spam - catch the bad
 
  SOLUTION
  Make filtering software score for results that equal ham - uncatch  
  the good.
 
 
  Your thoughts?
  
  How can this method spend less time and energy? Aren't you 
 going to build a mirrored method with respect to the actual 
 one? Your rules wouldn't be like the actual ones, but negated?
  
  Giampaolo
  
  Dan
 
 
  BTW, is there a better forum for this level of question?
 
  
  
 
 This would be easier to filter.
 It would also be more adaptive to a statistical approach than a 
 regex approach.
 
 Personally, I think HTML email should be outright discarded from 
 the start.
 If you look at this arguement presented by the OP then it 
 reinforces the idea 
 that most ascii is ham and most html is spam.  Therefore, reject 
 delivery of all 
 html based email.  Or to be more succinct -- reject any MIME type 
 of alternative 
 content or html only content.  That would remove probably 90% of 
 the spam in one 
 shot.

Sending text/ascii e-mails may probably fit your habits and the ones from your 
contacts, but it would result in thrashing a lot of ham on larger userbases.

Giampaolo



RE: A New Approach: Find the Ham

2007-02-10 Thread Giampaolo Tomassoni
From: Tom Allison [mailto:[EMAIL PROTECTED]
 
  CHALLENGE
  All filtering software is written to score for results that equal  
  spam - catch the bad
 
  SOLUTION
  Make filtering software score for results that equal ham - uncatch  
  the good.
 
 
  Your thoughts?
  
  How can this method spend less time and energy? Aren't you 
 going to build a mirrored method with respect to the actual 
 one? Your rules wouldn't be like the actual ones, but negated?
  
  Giampaolo
  
  Dan
 
 
  BTW, is there a better forum for this level of question?
 
  
  
 
 This would be easier to filter.
 It would also be more adaptive to a statistical approach than a 
 regex approach.
 
 Personally, I think HTML email should be outright discarded from 
 the start.
 If you look at this arguement presented by the OP then it 
 reinforces the idea 
 that most ascii is ham and most html is spam.  Therefore, reject 
 delivery of all 
 html based email.  Or to be more succinct -- reject any MIME type 
 of alternative 
 content or html only content.  That would remove probably 90% of 
 the spam in one 
 shot.

Sending text/ascii e-mails may probably fit your habits and the ones from your 
contacts, but it would result in thrashing a lot of ham on larger userbases.

Giampaolo



Re: A New Approach: Find the Ham

2007-02-10 Thread urgrue
One consideration is that spam getting through is never more than an 
annoyance. Ham getting caught can be a big problem. So any kind of deny 
by default system has to deal with how to respond to people sending you 
mail that gets trapped and provide a way for the sender to get 
approval.  How does one join the global whitelist and how does one 
prevent spammers from joining it?


I dont think spam will ever go away until sending email costs money, via 
some kind of global digital stamp system. Which, frankly, i would 
welcome with open arms, but will probably never happen.



Dan has a good point; on the surface at least. spam now accounts for
80%+ of all mail, so why are we concentrating on that?

At least the point is worth debate (IMHO).

Can it be done? Even I can see that it can, given the right impetus.
Though perhaps too many companies are making a good $/£/Y off
anti-spam systems based on, around or directly using SA.

Be interesting to see where this thread goes.

Kind regards

Nigel
  




Re: A New Approach: Find the Ham

2007-02-10 Thread urgrue




This would be easier to filter.
It would also be more adaptive to a statistical approach than a regex 
approach.


Personally, I think HTML email should be outright discarded from the 
start.
If you look at this arguement presented by the OP then it reinforces 
the idea that most ascii is ham and most html is spam.  Therefore, 
reject delivery of all html based email.  Or to be more succinct -- 
reject any MIME type of alternative content or html only content.  
That would remove probably 90% of the spam in one shot.


Yeah, for about a week. Obviously they wont keep sending HTML mail if 
everyone is blocking it, right?


Re: A New Approach: Find the Ham

2007-02-10 Thread Nigel Frankcom
On Sat, 10 Feb 2007 15:14:56 -0500, Miles Fidelman
[EMAIL PROTECTED] wrote:

Dan wrote:
 I've developed a new approach to scoring that I want to 1) share with 
 everyone and 2) make into a working system thats as accurate as what 
 I've already built, but easier to use.  First, the theory:

 NEW ASSUMPTION
 All messages are spam unless x,y,z score says they're ham.

 NEW APPROACH
 Block everything, then create rules to not catch what you do want.  
 ie, build tests that target the spam (keeping all the tests you've 
 already built), then score the thousands of ways ham triggers on those 
 tests.
It strikes me that the hardest part of this approach is filtering out 
too much ham.  At least for me, it's more important to make sure that 
people reach me, than to filter out all spam.  If we take the approach 
that everything is to be filtered out, except x,y,z - then the risk of 
filtering out too much seems pretty high.

These are my local stats... I'd far rather those numbers were the
other way round.

Even if Dan is wrong, at least he's thinking.

http://www.blue-canoe.com/stats/index.php?D1=11

What do Theo, Matt  Co have to say? They've been doing this a lot
longer than us.

Kind regards


RE: A New Approach: Find the Ham

2007-02-10 Thread Giampaolo Tomassoni
From: Miles Fidelman [mailto:[EMAIL PROTECTED]
 
 Dan wrote:
  I've developed a new approach to scoring that I want to 1) share with 
  everyone and 2) make into a working system thats as accurate as what 
  I've already built, but easier to use.  First, the theory:
 
  NEW ASSUMPTION
  All messages are spam unless x,y,z score says they're ham.
 
  NEW APPROACH
  Block everything, then create rules to not catch what you do want.  
  ie, build tests that target the spam (keeping all the tests you've 
  already built), then score the thousands of ways ham triggers on those 
  tests.
 It strikes me that the hardest part of this approach is filtering out 
 too much ham.  At least for me, it's more important to make sure that 
 people reach me, than to filter out all spam.  If we take the approach 
 that everything is to be filtered out, except x,y,z - then the risk of 
 filtering out too much seems pretty high.

I definitely agree with you.

By the way, if Dan really brought a new perspective to us (i.e.: a new way to 
detect ham), what would stop us in integrating it into SA?

I would like to see this new perspective, however...

Giampaolo



Re: A New Approach: Find the Ham

2007-02-10 Thread Dan

Clarifications:

1) I'm not talking about generating new rules.  Rules stay the same.   
I'm describing a new scoring process only.


2) This would not be a replacement to SA, but an improvement.  Just a  
new way to process results already generated by SA.  Ideally, this  
would be a replacement for weights and metas.


Dan



How can this method spend less time and energy? Aren't you going  
to build a mirrored method with respect to the actual one? Your  
rules wouldn't be like the actual ones, but negated?


Giampaolo


Dan has a good point; on the surface at least. spam now accounts for
80%+ of all mail, so why are we concentrating on that?

At least the point is worth debate (IMHO).

Can it be done? Even I can see that it can, given the right impetus.
Though perhaps too many companies are making a good $/£/Y off
anti-spam systems based on, around or directly using SA.

Be interesting to see where this thread goes.

Kind regards

Nigel




Re: A New Approach: Find the Ham

2007-02-10 Thread Mark Samples
Is that the same as whitelisting, maybe I do not understand, but a very 
rigorous approach would
be a whitelist methodology which, once a new account is created, they 
send email to everyone they
want to communicate with, and it 'autowhitelists' those addresses, so 
you can only receive from those
you communicate with (or want to), i.e. the user will have to authorize 
the receipt of a message into the
whitelist (that way the email address owner is soley responsible for 
what they receive).  The main problem
(although someone may be able to come up with an appropriate 
compromise), is that if everyone were using
this methodology, how would one ever receive email?  But nonetheless, 
since there is less ham than spam
nowadays, it make more since to do what you are saying and deal with 
only the traffic the user wishes
to see instead of that which they don't,  seems the actual programming 
need to deal with this would be
less stressful on machine resources as well.  I.e. less resources would 
be consumed dealing with less
incoming crap (er mail, I mean)  Stop it at the connection... maybe 
a ulog plugin just a thought

Miles Fidelman wrote:


Dan wrote:

I've developed a new approach to scoring that I want to 1) share with 
everyone and 2) make into a working system thats as accurate as what 
I've already built, but easier to use.  First, the theory:


NEW ASSUMPTION
All messages are spam unless x,y,z score says they're ham.

NEW APPROACH
Block everything, then create rules to not catch what you do want.  
ie, build tests that target the spam (keeping all the tests you've 
already built), then score the thousands of ways ham triggers on 
those tests.


It strikes me that the hardest part of this approach is filtering out 
too much ham.  At least for me, it's more important to make sure that 
people reach me, than to filter out all spam.  If we take the approach 
that everything is to be filtered out, except x,y,z - then the risk of 
filtering out too much seems pretty high.






Re: A New Approach: Find the Ham

2007-02-10 Thread Dan

On Feb 10, 2007, at 12:14, Miles Fidelman wrote:

Dan wrote:
I've developed a new approach to scoring that I want to 1) share  
with everyone and 2) make into a working system thats as accurate  
as what I've already built, but easier to use.  First, the theory:


NEW ASSUMPTION
All messages are spam unless x,y,z score says they're ham.

NEW APPROACH
Block everything, then create rules to not catch what you do  
want.  ie, build tests that target the spam (keeping all the tests  
you've already built), then score the thousands of ways ham  
triggers on those tests.
It strikes me that the hardest part of this approach is filtering  
out too much ham.  At least for me, it's more important to make  
sure that people reach me, than to filter out all spam.  If we take  
the approach that everything is to be filtered out, except x,y,z -  
then the risk of filtering out too much seems pretty high.


Actually, [unparalleled] accuracy is built into this approach.   
Currently, a ham gets caught and you either take out the rule that  
caught it or make a whitelist entry.


Lots of ongoing work = little cumulative return

With Find the Ham, whitelisting is almost obsolete.  When you find an  
FP, you make an exception for the specific profile, the permutation  
of which tests/rules caught the message so this specific assembly  
doesn't catch any more.  The rules stays at full strength for every  
other permutation and no whitelist is needed.


This training process is the best part of the whole approach.  It  
begins with huge FPs, but significant improvements take only a few  
weeks.  A few months (depending on the diversity of your ham) and FPs  
are very very rare.


Little ongoing work = huge cumulative return


Dan


Re: A New Approach: Find the Ham

2007-02-10 Thread Raul Dias
 NEW SITUATION
 Ham is now the tiniest minority of all email.
 
 NEW ASSUMPTION
 All messages are spam unless x,y,z score says they're ham.
 
 NEW APPROACH
 Block everything, then create rules to not catch what you do want.   
 ie, build tests that target the spam (keeping all the tests you've  
 already built), then score the thousands of ways ham triggers on  
 those tests.
 
 NEW RESULT
 Spend less time and energy while catching more of what you do want  
 and less of what you don't.
 
 
 
 CHALLENGE
 All filtering software is written to score for results that equal  
 spam - catch the bad
 
 SOLUTION
 Make filtering software score for results that equal ham - uncatch  
 the good.
 
 
 Your thoughts?


Here is my $0,02.

I have a similar approach already.  My problem is that 80% of the
messages are in pt_BR, which makes a lot of the rules in SA that target
english uneffective.

There is a lot of grey area that have too much spam (FN) and ham (FP).

So, my approach is to quarentine mail from some users a low as 4.0 (or
even less).

This mail is separated to an imap folder and then manually inspected to
ham and spam folders.  This let rules be created to catch spam, but also
to catch ham (which is harder and dangerous ground).
If necessary, white and black lists are created, but this is the last
resource as it is not an affordable/scalable solution.

The spam and ham folder is then trainned with sa-learn and the ham is
given back to the user if necessary.

This approach has a drawback.  An explicity authorization of the user is
necessary (in my view).  So a user (if wants to help) may choose to let
their mail be quarentined and then get it back, or let their mail (above
4.0 score) be analysed but not quarantined (just a copy is kept and it
is not necessary to give back).

A good side of this is that is not necessary lot of users to let their
mail be analysed.  The rules will improve for everyone based of a few
users.

Bayes also plays a more important rule than in a english environment,
because of the lack of good rules in the native language.  

Site-wide Bayes is missed (per user is used), but would help separated
the grey area even more for non monitored users or low volume users.

in the scripts side I use Mail::IMAPClient and I urge anyone writting
your own scripts to stay away from Mail::Box.


-Raul Dias



Re: A New Approach: Find the Ham

2007-02-10 Thread Mathieu Bouchard

On Sat, 10 Feb 2007, Dan wrote:


With Find the Ham, whitelisting is almost obsolete.  When you find an FP,


How do you ever find FPs if you have so many TP to sort through that it's 
not even worth sorting through FP+TP to find the FP ? IMHO, that'd be why 
we assume that mails are ham rather than assume that they are spam.


 _ _ __ ___ _  _ _ ...
| Mathieu Bouchard - tél:+1.514.383.3801 - http://artengine.ca/matju
| Freelance Digital Arts Engineer, Montréal QC Canada

Re: A New Approach: Find the Ham

2007-02-10 Thread Dan


On Feb 10, 2007, at 14:38, Mathieu Bouchard wrote:
How do you ever find FPs if you have so many TP to sort through  
that it's not even worth sorting through FP+TP to find the FP ?  
IMHO, that'd be why we assume that mails are ham rather than assume  
that they are spam.


I haven't found FP reviewing to be a big deal.  In my latest SA based  
configuration, for example, I organize captures according to the  
quantity of tests a given message fails.  The more tests are  
involved, the less a message needs to be double checked.


So as with other particulars, ease of use will depend on how well the  
approach is implemented.


Dan




Re: A New Approach: Find the Ham

2007-02-10 Thread Burak Ueda
Good point, but will cause trouble UNLESS we find a way to  recognize 
ham 100%. And it must me exactly 100% (99% won't be enough).
As other users said, with current system, if we can filter 70-80 of the 
spam, remaining 20-30% will only be an annoyance, but ham will be delivered.


But with the new approach event if the spam stopped 100%, only 1% 
undelivered ham will cause a lot of trouble.


Just my 1 Yen  :-)




Dan wrote:
I've developed a new approach to scoring that I want to 1) share with 
everyone and 2) make into a working system thats as accurate as what 
I've already built, but easier to use.  First, the theory:




SITUATION
In the beginning, all email was ham.  When spam came along, we left 
the ham alone and targeted the annoyance (spam).


ASSUMPTION
All messages are ham unless x,y,z score says they're spam.

APPROACH
Block nothing, then create rules to catch what you don't want.  ie, 
build tests that target the spam, then score the millions of ways spam 
can occur.


RESULT
Huge time spent tuning and retuning weights, catching everything in 
sight (including much ham).




NEW SITUATION
Ham is now the tiniest minority of all email.

NEW ASSUMPTION
All messages are spam unless x,y,z score says they're ham.

NEW APPROACH
Block everything, then create rules to not catch what you do want.  
ie, build tests that target the spam (keeping all the tests you've 
already built), then score the thousands of ways ham triggers on those 
tests.


NEW RESULT
Spend less time and energy while catching more of what you do want and 
less of what you don't.




CHALLENGE
All filtering software is written to score for results that equal spam 
- catch the bad


SOLUTION
Make filtering software score for results that equal ham - uncatch 
the good.



Your thoughts?

Dan


BTW, is there a better forum for this level of question?