Re: Train and use bayes on different adresses

2008-06-26 Thread John Hardin

On Thu, 26 Jun 2008, Florian Lindner wrote:


Hello,
I use (honestly: I plan) the following procedure to filter my spam using SA:

All mails are piped through spamc. (emails for my family and me). 
required_score is set to high value of 9 to avoid false postives. Mail which 
is detected as spam is being deleted.


Refine that a bit. Leave the threshold at 5 so that suspicious messages 
get marked, but delete at a high level (e.g. 10+)


All SA filtering is done on the server side. On the client side 
additional filtering is done by statistic filters of Apple Mail and 
Thunderbird.


Now I want to train the server SA filter by moving the junk mails (whish 
have slipped through SA) on the client into an IMAP folder. This is done 
only with the mail I receive, not the one the rest of family receive.


Why not let others train? Just give each user training folders.

Will this setup cause any problems? I ask because the bayes filter I 
train with only my email is used for all email.


It's better if you train with all users' email. Note that *you* may 
actually be doing the training, but it's still their email.


Some tools that may help you set things up are available here:

  http://www.impsec.org/~jhardin/antispam/

Hooking up spamc via procmail, special handling at a given score, and 
training from per-user spam and ham boxes. The only difference between 
what you're suggesting and what I'm doing today is that I have two mail 
servers, one at a hosted site and one at home (fed by fetchmail from the 
hosted server), so I have some extra glue moving the training folders from 
the home server's IMAP folders back out to the hosted server where SA 
runs. All my family have training folders, but I pretty much do all the 
training classification whenever I'm doing administrative stuff to their 
systems.


--
 John Hardin KA7OHZhttp://www.impsec.org/~jhardin/
 [EMAIL PROTECTED]FALaholic #11174 pgpk -a [EMAIL PROTECTED]
 key: 0xB8732E79 -- 2D8C 34F4 6411 F507 136C  AF76 D822 E6E6 B873 2E79
---
  Users mistake widespread adoption of Microsoft Office as the
  development of a standard document format.
---
 8 days until the 232nd anniversary of the Declaration of Independence


Re: Train and use bayes on different adresses

2008-06-26 Thread Florian Lindner


Am 26.06.2008 um 18:26 schrieb John Hardin:


On Thu, 26 Jun 2008, Florian Lindner wrote:


Hello,
I use (honestly: I plan) the following procedure to filter my spam  
using SA:


All mails are piped through spamc. (emails for my family and me).  
required_score is set to high value of 9 to avoid false postives.  
Mail which is detected as spam is being deleted.


Refine that a bit. Leave the threshold at 5 so that suspicious  
messages get marked, but delete at a high level (e.g. 10+)


What should be done with marked messages?

All SA filtering is done on the server side. On the client side  
additional filtering is done by statistic filters of Apple Mail and  
Thunderbird.


Now I want to train the server SA filter by moving the junk mails  
(whish have slipped through SA) on the client into an IMAP folder.  
This is done only with the mail I receive, not the one the rest of  
family receive.


Why not let others train? Just give each user training folders.


The rest of family is rather computer agnostic and I'm happy they get  
along with the Thunderbird filter well.


Will this setup cause any problems? I ask because the bayes filter  
I train with only my email is used for all email.


It's better if you train with all users' email. Note that *you* may  
actually be doing the training, but it's still their email.


Another option would be to completely disable the statistic filters  
for my family and leave this completely up to Thunderbird. I would be  
using another SA config with statistics. How to implement this? Is is  
sufficient to use spamc -F nostat.cf with use_bayes 0 in the  
config file and just spamc for me? Are these two spamc invocations are  
seperated from eath other?






Some tools that may help you set things up are available here:

 http://www.impsec.org/~jhardin/antispam/


It's very interesting but way too sophisticated for my situation and  
audience.


Hooking up spamc via procmail, special handling at a given score,  
and training from per-user spam and ham boxes. The only difference  
between what you're suggesting and what I'm doing today is that I  
have two mail servers, one at a hosted site and one at home (fed by  
fetchmail from the hosted server), so I have some extra glue moving  
the training folders from the home server's IMAP folders back out to  
the hosted server where SA runs. All my family have training  
folders, but I pretty much do all the training classification  
whenever I'm doing administrative stuff to their systems.



Regards,

Florian



Re: Train and use bayes on different adresses

2008-06-26 Thread John Hardin

On Thu, 26 Jun 2008, Florian Lindner wrote:



Am 26.06.2008 um 18:26 schrieb John Hardin:


On Thu, 26 Jun 2008, Florian Lindner wrote:

 Hello,
 I use (honestly: I plan) the following procedure to filter my spam using 
 SA:
 
 All mails are piped through spamc. (emails for my family and me). 
 required_score is set to high value of 9 to avoid false postives. Mail 
 which is detected as spam is being deleted.


Refine that a bit. Leave the threshold at 5 so that suspicious messages get 
marked, but delete at a high level (e.g. 10+)


What should be done with marked messages?


If they are spam, the user can drop them into their spam training folder - 
the assumption is bayes doesn't recognize them well enough yet, but that 
isn't always the case.


If you want to minimize the number of weak-scores spams that your users 
have to see, and you are less sensitive to FPs (which your original 
proposal suggests) then you'd just delete at a lower score (e.g. 9+ or 
8+).


Generally speaking, it's a bad idea to fiddle with the threshold as all 
the base rulesets are scored by the masscheck process with the assumption 
that 5 is spammy.


 All SA filtering is done on the server side. On the client side 
 additional filtering is done by statistic filters of Apple Mail and 
 Thunderbird.
 
 Now I want to train the server SA filter by moving the junk mails (whish 
 have slipped through SA) on the client into an IMAP folder. This is done 
 only with the mail I receive, not the one the rest of family receive.


Why not let others train? Just give each user training folders.


The rest of family is rather computer agnostic and I'm happy they get along 
with the Thunderbird filter well.


That's reasonable. In my experience what you'll see when you review the 
mailbox is a few false positives that you can copy to the user's ham 
training folder for them. They will generally just delete any spams unless 
you stress repeatedly that spams which leak thorough shold go into the 
spam training folder rather than the trash, and you may be able to tell 
the MUA's classifier to save to the spam training folder rather than 
deleting.


 Will this setup cause any problems? I ask because the bayes filter I 
 train with only my email is used for all email.


It's better if you train with all users' email. Note that *you* may 
actually be doing the training, but it's still their email.


Another option would be to completely disable the statistic filters for my 
family and leave this completely up to Thunderbird. I would be using another 
SA config with statistics. How to implement this? Is is sufficient to use 
spamc -F nostat.cf with use_bayes 0 in the config file and just spamc for 
me? Are these two spamc invocations are seperated from eath other?


I'd recommend against that, personally. Bayes is very helpful even if you 
can't get your users to train it themselves.


You might want to have Thunderbird move spams to the spam training folder 
as I suggested, that way bayes will be led by thunderbird and the 
classification at the server (which is where it should be) will get 
better.



Some tools that may help you set things up are available here:

 http://www.impsec.org/~jhardin/antispam/


It's very interesting but way too sophisticated for my situation and 
audience.


Most of it will be visible only to you. My wife and MiL don't worry 
about training and they get along well.


Then again, it also depends on how allergic to receiving _any_ spam your 
users are.


--
 John Hardin KA7OHZhttp://www.impsec.org/~jhardin/
 [EMAIL PROTECTED]FALaholic #11174 pgpk -a [EMAIL PROTECTED]
 key: 0xB8732E79 -- 2D8C 34F4 6411 F507 136C  AF76 D822 E6E6 B873 2E79
---
  Perfect Security and Absolute Safety are unattainable; beware
  those who would try to sell them to you, regardless of the cost,
  for they are trying to sell you your own slavery.
---
 8 days until the 232nd anniversary of the Declaration of Independence


Re: Train and use bayes on different adresses

2008-06-26 Thread Florian Lindner


Am 26.06.2008 um 19:31 schrieb John Hardin:


On Thu, 26 Jun 2008, Florian Lindner wrote:



Am 26.06.2008 um 18:26 schrieb John Hardin:


On Thu, 26 Jun 2008, Florian Lindner wrote:
 Hello,
 I use (honestly: I plan) the following procedure to filter my  
spam using  SA:
  All mails are piped through spamc. (emails for my family and  
me).  required_score is set to high value of 9 to avoid false  
postives. Mail  which is detected as spam is being deleted.
Refine that a bit. Leave the threshold at 5 so that suspicious  
messages get marked, but delete at a high level (e.g. 10+)


What should be done with marked messages?


If they are spam, the user can drop them into their spam training  
folder - the assumption is bayes doesn't recognize them well enough  
yet, but that isn't always the case.


If you want to minimize the number of weak-scores spams that your  
users have to see, and you are less sensitive to FPs (which your  
original proposal suggests) then you'd just delete at a lower score  
(e.g. 9+ or 8+).


Generally speaking, it's a bad idea to fiddle with the threshold as  
all the base rulesets are scored by the masscheck process with the  
assumption that 5 is spammy.


Sorry, I don't understand this. What is difference between changing  
the threshold and deleting all spam messages or leave the threshold at  
5 and deleting mail with 9 points. Is the threshold changed anything  
else than: if sore  threshold: mark spam else: mark ham after all  
tests have been run?


 All SA filtering is done on the server side. On the client side  
 additional filtering is done by statistic filters of Apple Mail  
and  Thunderbird.
  Now I want to train the server SA filter by moving the junk  
mails (whish  have slipped through SA) on the client into an IMAP  
folder. This is done  only with the mail I receive, not the one  
the rest of family receive.

Why not let others train? Just give each user training folders.


The rest of family is rather computer agnostic and I'm happy they  
get along with the Thunderbird filter well.


That's reasonable. In my experience what you'll see when you review  
the mailbox is a few false positives that you can copy to the user's  
ham training folder for them. They will generally just delete any  
spams unless you stress repeatedly that spams which leak thorough  
shold go into the spam training folder rather than the trash, and  
you may be able to tell the MUA's classifier to save to the spam  
training folder rather than deleting.


For my family I want to leave it as it is.

 Will this setup cause any problems? I ask because the bayes  
filter I  train with only my email is used for all email.
It's better if you train with all users' email. Note that *you*  
may actually be doing the training, but it's still their email.


Another option would be to completely disable the statistic filters  
for my family and leave this completely up to Thunderbird. I would  
be using another SA config with statistics. How to implement this?  
Is is sufficient to use spamc -F nostat.cf with use_bayes 0 in  
the config file and just spamc for me? Are these two spamc  
invocations are seperated from eath other?


I'd recommend against that, personally. Bayes is very helpful even  
if you can't get your users to train it themselves.


Can I use two different bayes DBs? One for my family without training  
(just the auto train functions) and one for me that is trained?


spamc is invoked from the maildrop MDA. I can't change the system user  
I invoke spamc from but best would be two kind of spamc invocations  
that act like they were different users.


You might want to have Thunderbird move spams to the spam training  
folder as I suggested, that way bayes will be led by thunderbird and  
the classification at the server (which is where it should be) will  
get better.



Some tools that may help you set things up are available here:

http://www.impsec.org/~jhardin/antispam/


It's very interesting but way too sophisticated for my situation  
and audience.


Most of it will be visible only to you. My wife and MiL don't worry  
about training and they get along well.


Then again, it also depends on how allergic to receiving _any_ spam  
your users are.


I want to optimize it primarily for me, it's working fine for my family.

Florian


Re: Train and use bayes on different adresses

2008-06-26 Thread John Hardin

On Thu, 26 Jun 2008, Florian Lindner wrote:

Generally speaking, it's a bad idea to fiddle with the threshold as all the 
base rulesets are scored by the masscheck process with the assumption that 
5 is spammy.


Sorry, I don't understand this. What is difference between changing the 
threshold and deleting all spam messages or leave the threshold at 5 and 
deleting mail with 9 points.


Raising the threshold will result in more emails that are obviously spam 
to a human being coming into their mailbox without a [SPAM] tag. It will 
make your antispam efforts look less effective - you're guaranteeing 
yourself more false negatives.


Leaving the threshold at 5 and deleting at the higher threshold will 
result in lower-scoring (i.e. possibly-not-spam) spams being delivered 
with a [SPAM] tag as a warning, while the higher-scoring (9+ obvious spam) 
spams don't get delivered.



For my family I want to leave it as it is.


Fair enough.

Can I use two different bayes DBs? One for my family without training (just 
the auto train functions) and one for me that is trained?


...that I don't know. Others may be able to comment.

--
 John Hardin KA7OHZhttp://www.impsec.org/~jhardin/
 [EMAIL PROTECTED]FALaholic #11174 pgpk -a [EMAIL PROTECTED]
 key: 0xB8732E79 -- 2D8C 34F4 6411 F507 136C  AF76 D822 E6E6 B873 2E79
---
  USMC Rules of Gunfighting #4: If your shooting stance is good,
  you're probably not moving fast enough nor using cover correctly.
---
 8 days until the 232nd anniversary of the Declaration of Independence


Re: Train and use bayes on different adresses

2008-06-26 Thread Michael Monnerie
On Donnerstag, 26. Juni 2008 Florian Lindner wrote:
 Can I use two different bayes DBs? One for my family without training
   (just the auto train functions) and one for me that is trained?

You don't want that, really. If you use a trained bayes, it helps all. 
You do not have to have all spam that your family gets also. Don't 
forget that bayes auto-learns also. So just take your ham/spam, keep 
bayes in training, and let it learn. Feed all e-mails with it, and the 
results will be good.

mfg zmi
-- 
// Michael Monnerie, Ing.BSc-  http://it-management.at
// Tel: 0660 / 415 65 31  .network.your.ideas.
// PGP Key: curl -s http://zmi.at/zmi.asc | gpg --import
// Fingerprint: AC19 F9D5 36ED CD8A EF38  500E CE14 91F7 1C12 09B4
// Keyserver: www.keyserver.net   Key-ID: 1C1209B4


signature.asc
Description: This is a digitally signed message part.