> NEW SITUATION > Ham is now the tiniest minority of all email. > > NEW ASSUMPTION > All messages are spam unless x,y,z score says they're ham. > > NEW APPROACH > Block everything, then create rules to not catch what you do want. > ie, build tests that target the spam (keeping all the tests you've > already built), then score the thousands of ways ham triggers on > those tests. > > NEW RESULT > Spend less time and energy while catching more of what you do want > and less of what you don't. > > > > CHALLENGE > All filtering software is written to score for results that equal > spam -> catch the bad > > SOLUTION > Make filtering software score for results that equal ham -> uncatch > the good. > > > Your thoughts?
Here is my $0,02. I have a similar approach already. My problem is that 80% of the messages are in pt_BR, which makes a lot of the rules in SA that target english uneffective. There is a lot of grey area that have too much spam (FN) and ham (FP). So, my approach is to quarentine mail from some users a low as 4.0 (or even less). This mail is separated to an imap folder and then manually inspected to ham and spam folders. This let rules be created to catch spam, but also to catch ham (which is harder and dangerous ground). If necessary, white and black lists are created, but this is the last resource as it is not an affordable/scalable solution. The spam and ham folder is then trainned with sa-learn and the ham is given back to the user if necessary. This approach has a drawback. An explicity authorization of the user is necessary (in my view). So a user (if wants to help) may choose to let their mail be quarentined and then get it back, or let their mail (above 4.0 score) be analysed but not quarantined (just a copy is kept and it is not necessary to give back). A good side of this is that is not necessary lot of users to let their mail be analysed. The rules will improve for everyone based of a few users. Bayes also plays a more important rule than in a english environment, because of the lack of good rules in the native language. Site-wide Bayes is missed (per user is used), but would help separated the grey area even more for non monitored users or low volume users. in the scripts side I use Mail::IMAPClient and I urge anyone writting your own scripts to stay away from Mail::Box. -Raul Dias