Re: A New Approach: Find the Ham

Raul Dias Sat, 10 Feb 2007 13:51:07 -0800

> NEW SITUATION
> Ham is now the tiniest minority of all email.
> 
> NEW ASSUMPTION
> All messages are spam unless x,y,z score says they're ham.
> 
> NEW APPROACH
> Block everything, then create rules to not catch what you do want.   
> ie, build tests that target the spam (keeping all the tests you've  
> already built), then score the thousands of ways ham triggers on  
> those tests.
> 
> NEW RESULT
> Spend less time and energy while catching more of what you do want  
> and less of what you don't.
> 
> 
> 
> CHALLENGE
> All filtering software is written to score for results that equal  
> spam -> catch the bad
> 
> SOLUTION
> Make filtering software score for results that equal ham -> uncatch  
> the good.
> 
> 
> Your thoughts?



Here is my $0,02.

I have a similar approach already.  My problem is that 80% of the
messages are in pt_BR, which makes a lot of the rules in SA that target
english uneffective.

There is a lot of grey area that have too much spam (FN) and ham (FP).

So, my approach is to quarentine mail from some users a low as 4.0 (or
even less).

This mail is separated to an imap folder and then manually inspected to
ham and spam folders.  This let rules be created to catch spam, but also
to catch ham (which is harder and dangerous ground).
If necessary, white and black lists are created, but this is the last
resource as it is not an affordable/scalable solution.

The spam and ham folder is then trainned with sa-learn and the ham is
given back to the user if necessary.

This approach has a drawback.  An explicity authorization of the user is
necessary (in my view).  So a user (if wants to help) may choose to let
their mail be quarentined and then get it back, or let their mail (above
4.0 score) be analysed but not quarantined (just a copy is kept and it
is not necessary to give back).

A good side of this is that is not necessary lot of users to let their
mail be analysed.  The rules will improve for everyone based of a few
users.

Bayes also plays a more important rule than in a english environment,
because of the lack of good rules in the native language.  

Site-wide Bayes is missed (per user is used), but would help separated
the grey area even more for non monitored users or low volume users.

in the scripts side I use Mail::IMAPClient and I urge anyone writting
your own scripts to stay away from Mail::Box.


-Raul Dias

Re: A New Approach: Find the Ham

Reply via email to