subject:"Allowing IMAP users to train spam\/ham"

Re: was: Allowing IMAP users to train spam/ham is:simplify training of misclassified emails

2012-03-22 Thread Kevin A. McGrail


Before anyone rushes ahead and puts any time or money into this. I
think it's worth establishing whether it makes any significant
difference.
It solves several real world problems that I'm aware of but I agree it's 
not going to hold up 3.4.0 or be a top priority for me.


regards,
KAM

Re: was: Allowing IMAP users to train spam/ham is:simplify training of misclassified emails

2012-03-22 Thread RW

On Thu, 22 Mar 2012 07:59:39 -0400
Kevin A. McGrail wrote:

> Yes and no. What you have missed is that David F Skoll is a key
> author of MIMEDefang. They also publish a great COTS solution for
> email filtering called CanIT. So his plugin is part of the commercial
> product.

AFAIK his Bayes uses word-pair tokenization, and DSPAM supports
various multi-word tokenizers, so they are somewhat more susceptible
to header rewriting.

> 
> However, his idea is very elegant on tokens is an elegant idea. To
> extract them, I planned on using SA's existing Bayesian framework and
> deliver them to a header. What is done with the header from there is
> a spam/ham delivery issue but at best sa-learn could use it. Lots of
> security and privacy issues to deal with but I am just in the idea
> phase.

Before anyone rushes ahead and puts any time or money into this. I
think it's worth establishing whether it makes any significant
difference.

AFAIK Bayes tokenizes after any encoding is removed so unless
Exchange does something extreme like converting to unicode or rich-text
format etc, I doubt it makes any difference at all to the body.

I don't know how exchange mangles headers, but I'm sceptical it has
much effect - if any. You'd really need to look at the details.

Extra headers added after processing shouldn't be a problem, and it's
easy enough to strip them if you're paranoid.

Re: was: Allowing IMAP users to train spam/ham is:simplify training of misclassified emails

2012-03-22 Thread Kevin A. McGrail

Yes and no. What you have missed is that David F Skoll is a key author of 
MIMEDefang. They also publish a great COTS solution for email filtering called 
CanIT. So his plugin is part of the commercial product.

However, his idea is very elegant on tokens is an elegant idea. To extract 
them, I planned on using SA's existing Bayesian framework and deliver them to a 
header. What is done with the header from there is a spam/ham delivery issue 
but at best sa-learn could use it. Lots of security and privacy issues to deal 
with but I am just in the idea phase.
Regards,
KAM

Per-Erik Persson  wrote:

Since we are on the subject of adding "magic links" to email header to
make it easier for nontech staff to report spam.
I don't understand how to extract the tokinzed data needed to represent
the specific email.
Have I missed some plugin that everyone else knows about?

The rest of the problem seems trivial if you already have an
infrastructure deployed with SSO and a decent webinterface.

The setup with postfix facing the world, spamassassin sanitizinging it
and exchange storing it is something that I see quite often nowdays.

Re: was: Allowing IMAP users to train spam/ham is:simplify training of misclassified emails

2012-03-22 Thread David F. Skoll

On Thu, 22 Mar 2012 07:51:07 +0100
Per-Erik Persson  wrote:

> Since we are on the subject of adding "magic links" to email header to
> make it easier for nontech staff to report spam.
> I don't understand how to extract the tokinzed data needed to
> represent the specific email.

We have an entire infrastructure built to support this.  It is proprietary,
however, and is not easily implemented as a SpamAssassin plugin, though
the basic idea probably could be.

Regards,

David.

Re: was: Allowing IMAP users to train spam/ham is:simplify training of misclassified emails

2012-03-22 Thread Robert Schetterer

Am 22.03.2012 09:15, schrieb xTrade Assessory:
> Robert Schetterer wrote:
>>>
>>
>> however , i have a ham/spam transport learn mail address,
>> nearly null users forwards something to it, no wonder
>> the false positve rate is nearly null
>>
>> in fact , there are systems with webmail guis for classify
>> spam i.e aol, reality shows users dont use it very wise
>> perhaps clicking field spam and delte are to near etc or they are simply
>> dummy
>>
>> my conclusion dont  waste your time to implement complicated mechs
>> for ham/spam training, work on the tagging/rejecting side to reduce
>> false positive rate
>>
> 
> Hi
> 
> I can not agree more to that ... at the end, sooner or later, you
> discover having spent time on something with erroneous or no return at
> all ... not even talking about the support-overhead this extra mboxes
> will create
> 
> beside the obvious you already said it is still highly questionable if a
> "user" is able to classify reliable.
> 
> also, IMO, most SPAM hits obvious account names/combinations and most
> user are not affected by the problem, unless their addresses are
> standard_names@
> 
> since years I do not care so much any more and run a pretty standard
> spamassassin but I query maillog for delivering attempts to not existing
> accounts. First I slow it down after 2 invalid destination addresses but
> also record the sender details and block them for three month from
> within access file (I run sendmail everywhere)
> 
> that works so smooth for me, still with almost zero cpu overhead for
> spamd and it is practical, easy and cheap, the result is,  before I got
> on certain accounts 50 SPAMS per day, now 2 maybe 3 and that numbers
> are for mservers with each of them having +50.000 accounts going through
> 
> Hans
> 

something
like
http://mailfud.org/postpals/ may helpfull too at some sites
i have heard amavis has some equal mech

however there is lot a postmaster can do, before trusting users
spam/ham classify ( i.e there is the spamassassin black and whitlist
feature ) , but if somebody do so ,dont trust your users in total
users train should  ever be one tag out of others, so i.e it may high
bayes points etc
but should not to lead for high tagging over spam/ham boarder in one tag
step

( this is for isp style mail systems, the policy might be other for
dediacted company mail etc , but its still complicated there too)

but as reality shows i.e at aol their user abuse spam reporting program
is totally broken , i never had a "true spam alarm" of their users by
sended mails from my systems
and on the other side the aol mail systems itself are very high rate for
trying deliver in spam to my servers

-- 
Best Regards

MfG Robert Schetterer

Germany/Munich/Bavaria

Re: was: Allowing IMAP users to train spam/ham is:simplify training of misclassified emails

2012-03-22 Thread Per-Erik Persson

On 03/22/2012 07:59 AM, Robert Schetterer wrote:
> Am 22.03.2012 07:51, schrieb Per-Erik Persson:
>> Since we are on the subject of adding "magic links" to email header to
>> make it easier for nontech staff to report spam.
>> I don't understand how to extract the tokinzed data needed to represent
>> the specific email.
>> Have I missed some plugin that everyone else knows about?
>>
>> The rest of the problem seems trivial if you already have an
>> infrastructure deployed with SSO and a decent webinterface.
>>
>> The setup with postfix facing the world, spamassassin sanitizinging it
>> and exchange storing it is something that I see quite often nowdays.
>>
>>
>>
> however , i have a ham/spam transport learn mail address,
> nearly null users forwards something to it, no wonder
> the false positve rate is nearly null
>
> in fact , there are systems with webmail guis for classify
> spam i.e aol, reality shows users dont use it very wise
> perhaps clicking field spam and delte are to near etc or they are simply
> dummy
>
> my conclusion dont  waste your time to implement complicated mechs
> for ham/spam training, work on the tagging/rejecting side to reduce
> false positive rate
>
You are right about how the average user works. (Oh I am tired of the
mailinglist, lets classify it as spam since I don't know how to unsubscribe)
However a helpdesk and similair often get complaints about spam getting
thru and it is virtually impossible to make most users cut and paste a
header.
But pasting a single field from the header and sending it to the right
helpdeskqueue or a webinterface is probably just the right amount of work.
I have a personal toolbox to sieve out the phishingemails(and false
positives) and would like to make a closed loop for feeding the
spamassassin without having access to the original emails.

Re: was: Allowing IMAP users to train spam/ham is:simplify training of misclassified emails

2012-03-22 Thread xTrade Assessory

Robert Schetterer wrote:
>>
> 
> however , i have a ham/spam transport learn mail address,
> nearly null users forwards something to it, no wonder
> the false positve rate is nearly null
> 
> in fact , there are systems with webmail guis for classify
> spam i.e aol, reality shows users dont use it very wise
> perhaps clicking field spam and delte are to near etc or they are simply
> dummy
> 
> my conclusion dont  waste your time to implement complicated mechs
> for ham/spam training, work on the tagging/rejecting side to reduce
> false positive rate
> 

Hi

I can not agree more to that ... at the end, sooner or later, you
discover having spent time on something with erroneous or no return at
all ... not even talking about the support-overhead this extra mboxes
will create

beside the obvious you already said it is still highly questionable if a
"user" is able to classify reliable.

also, IMO, most SPAM hits obvious account names/combinations and most
user are not affected by the problem, unless their addresses are
standard_names@

since years I do not care so much any more and run a pretty standard
spamassassin but I query maillog for delivering attempts to not existing
accounts. First I slow it down after 2 invalid destination addresses but
also record the sender details and block them for three month from
within access file (I run sendmail everywhere)

that works so smooth for me, still with almost zero cpu overhead for
spamd and it is practical, easy and cheap, the result is,  before I got
on certain accounts 50 SPAMS per day, now 2 maybe 3 and that numbers
are for mservers with each of them having +50.000 accounts going through

Hans

-- 
XTrade Assessory
International Facilitator
BR - US - CA - DE - GB - RU - UK
+55 (11) 4249.
http://xtrade.matik.com.br

Re: was: Allowing IMAP users to train spam/ham is:simplify training of misclassified emails

2012-03-22 Thread Robert Schetterer

Am 22.03.2012 07:51, schrieb Per-Erik Persson:
> Since we are on the subject of adding "magic links" to email header to
> make it easier for nontech staff to report spam.
> I don't understand how to extract the tokinzed data needed to represent
> the specific email.
> Have I missed some plugin that everyone else knows about?
> 
> The rest of the problem seems trivial if you already have an
> infrastructure deployed with SSO and a decent webinterface.
> 
> The setup with postfix facing the world, spamassassin sanitizinging it
> and exchange storing it is something that I see quite often nowdays.
> 
> 
> 

however , i have a ham/spam transport learn mail address,
nearly null users forwards something to it, no wonder
the false positve rate is nearly null

in fact , there are systems with webmail guis for classify
spam i.e aol, reality shows users dont use it very wise
perhaps clicking field spam and delte are to near etc or they are simply
dummy

my conclusion dont  waste your time to implement complicated mechs
for ham/spam training, work on the tagging/rejecting side to reduce
false positive rate

-- 
Best Regards

MfG Robert Schetterer

Germany/Munich/Bavaria

was: Allowing IMAP users to train spam/ham is:simplify training of misclassified emails

2012-03-21 Thread Per-Erik Persson

Since we are on the subject of adding "magic links" to email header to
make it easier for nontech staff to report spam.
I don't understand how to extract the tokinzed data needed to represent
the specific email.
Have I missed some plugin that everyone else knows about?

The rest of the problem seems trivial if you already have an
infrastructure deployed with SSO and a decent webinterface.

The setup with postfix facing the world, spamassassin sanitizinging it
and exchange storing it is something that I see quite often nowdays.

Re: Allowing IMAP users to train spam/ham

2012-03-21 Thread Benny Pedersen


Den 2012-03-21 13:38, Michael Scheidell skrev:


so, what would you manually learn?


using dspam then its not a problem, it only needs dspam signature

internet > postfix > dspam > postfix > exchange relay transport

now exchange have the dspam signature and can report back if its spam 
or ham, howto make that work is out of my scope :=)


its good that there is no excange smtp problem

Re: Allowing IMAP users to train spam/ham

2012-03-21 Thread David F. Skoll

On Wed, 21 Mar 2012 10:41:31 -0400
Michael Scheidell  wrote:

> But, what do you do about an email that was forwarded to someone else?
> And, that someone else has one of those silly anti-malware plugins
> that surfs to every url in any inbound email?

By default, our system won't allow training until the user logs in.
So clicking the link takes you to an authentication screen and the
voting only happens after you log in.

We provide an option to bypass this for those who are willing to risk
the things you mention.  Also, if you're sending outbound mail through our
system, we strip off preexisting voting links which helps reduce the probelm,
and of course we use the "nofollow" attribute in the link where possible
so that search engines that index mail archives don't cause voting to happen.

Regards,

David.

Re: Allowing IMAP users to train spam/ham

2012-03-21 Thread Kevin A. McGrail


On 3/21/2012 10:41 AM, Michael Scheidell wrote:

On 3/21/12 9:57 AM, Kevin A. McGrail wrote:
Very elegant IMO.  I'd love to look at moving some of the framework 
to support this into SA.  Any objections?  Won't be anything quick 
but it's a really great idea. 

We thought about this once.

add (ie: modify body of email) with 'report spam', 'blacklist sender' 
links.


If the links are internal (private ip's), or internally resolvable 
names, or names or ip's that resolve only locally or via vpn, then 
that might be ok.


But, what do you do about an email that was forwarded to someone else?
And, that someone else has one of those silly anti-malware plugins 
that surfs to every url in any inbound email?


(or some forwarder recipient decides to click on of the links)

From my perspective, the key point is solely the framework to store the 
Bayesian tokens from the email before delivery of the email so later, a 
"this is spam" "this is ham" mechanism can take advantage of that 
information without having the entire email.


The issues you are pointing to have to deal more with the implementation 
of the this is spam/this is ham mechanism.


Regards,
KAM

Re: Allowing IMAP users to train spam/ham

2012-03-21 Thread Michael Scheidell


On 3/21/12 9:57 AM, Kevin A. McGrail wrote:
Very elegant IMO.  I'd love to look at moving some of the framework to 
support this into SA.  Any objections?  Won't be anything quick but 
it's a really great idea. 

We thought about this once.

add (ie: modify body of email) with 'report spam', 'blacklist sender' links.

If the links are internal (private ip's), or internally resolvable 
names, or names or ip's that resolve only locally or via vpn, then that 
might be ok.


But, what do you do about an email that was forwarded to someone else?
And, that someone else has one of those silly anti-malware plugins that 
surfs to every url in any inbound email?


(or some forwarder recipient decides to click on of the links)


--
Michael Scheidell, CTO
o: 561-999-5000
d: 561-948-2259
>*| *SECNAP Network Security Corporation

   * Best Mobile Solutions Product of 2011
   * Best Intrusion Prevention Product
   * Hot Company Finalist 2011
   * Best Email Security Product
   * Certified SNORT Integrator

__
This email has been scanned and certified safe by SpammerTrap(r). 
For Information please see http://www.spammertrap.com/
__

Re: Allowing IMAP users to train spam/ham

2012-03-21 Thread Kevin A. McGrail


On 3/21/2012 10:03 AM, David F. Skoll wrote:

On Wed, 21 Mar 2012 09:57:33 -0400
"Kevin A. McGrail"  wrote:

[Storing Bayes tokens on the server and retrieving them when training]


Very elegant IMO.  I'd love to look at moving some of the framework
to support this into SA.  Any objections?  Won't be anything quick
but it's a really great idea.

Feel free to use the idea.  Alas, the code is proprietary and wouldn't
fit well into Spamassassin anyway, so I can't contribute that back.
The idea alone is good enough, thanks.  I figured you had it in Can-IT 
so I wanted to ask.


Regards,
KAM

Re: Allowing IMAP users to train spam/ham

2012-03-21 Thread David F. Skoll

On Wed, 21 Mar 2012 09:57:33 -0400
"Kevin A. McGrail"  wrote:

[Storing Bayes tokens on the server and retrieving them when training]

> Very elegant IMO.  I'd love to look at moving some of the framework
> to support this into SA.  Any objections?  Won't be anything quick
> but it's a really great idea.

Feel free to use the idea.  Alas, the code is proprietary and wouldn't
fit well into Spamassassin anyway, so I can't contribute that back.

Regards,

David.

Re: Allowing IMAP users to train spam/ham

2012-03-21 Thread Kevin A. McGrail


On 3/21/2012 9:30 AM, David F. Skoll wrote:
Actually, there's a third way and it's what we do (but difficult to 
set up with pure SpamAssassin.) We tokenize inbound messages and store 
the tokens on the server. In each message, we add links for doing 
training. When you click on a training link, the system trains the 
message based on the tokens stored on the server. In that way, you are 
training using exactly the tokens that the Bayes code saw. Regards, 
David. 
Very elegant IMO.  I'd love to look at moving some of the framework to 
support this into SA.  Any objections?  Won't be anything quick but it's 
a really great idea.

Re: Allowing IMAP users to train spam/ham

2012-03-21 Thread David F. Skoll

On Wed, 21 Mar 2012 13:44:49 +0100
Matus UHLAR - fantomas  wrote:

> Mangling data by exchange is a big. problem when trying to filter 
> spam in front of it. I see two ways to avoid this problem:
> - use spam server for exchange. We use one from GFI, with quite good 
> results.
> - you can use spam filter in front of exchange, store copies on it
> and learn from them. However, you will probably be the only one who
> can train spamfilter in such case.

Actually, there's a third way and it's what we do (but difficult to set
up with pure SpamAssassin.)

We tokenize inbound messages and store the tokens on the server.  In
each message, we add links for doing training.  When you click on a
training link, the system trains the message based on the tokens
stored on the server.  In that way, you are training using exactly the
tokens that the Bayes code saw.

Regards,

David.

Re: Allowing IMAP users to train spam/ham

2012-03-21 Thread RW

On Wed, 21 Mar 2012 10:06:58 +0100
Matus UHLAR - fantomas wrote:

> >On Fri, 9 Mar 2012 16:38:49 +0100
> >Matus UHLAR - fantomas wrote:

> >No, it isn't. Bayes is a statistical filter it needs to learn a lot
> >of diverse  spam and ham to reach it's optimum accuracy. It's been
> >demonstrated on Bogofilter that "train-on-everything" outperforms
> >"train-on-error" on the same corpora. They both end-up with similar
> >accuracy, but "train-on-everything" gets there very much faster.
> >Bogofilter is almost identical to BAYES; they just differ in the
> >details of the tokenizer and the Robinson parameters.
> >
> >Training on SA miss-classification is going to be glacially slow.
> 
> there are two problems when requiring users to manually learn on 
> everythhing.

I'm not advocating that users be forced to do anything, my preference
is to allow them to choose what they want to train on. Whether or not
your script chooses to learn everything they submit is it different
matter.

> - it's more work to implement

In general it's easier to implement explicit learn-spam and learn-ham
folders than it is to keep track of what is moved in and out of a spam
folder.

> - it's more work for users to do the training.

Not really, If they choose to learn just the spamassassin
miss-classifications it's the same work, but they have option to learn
more - in particular important ham. Personally, if I saw that
important mail was hitting BAYES_50, I'd feel pretty frustated
sitting  around waiting for FPs to train Bayes, knowing that those
FPs are avoidable.

> Note that the main goal of spam filters is to save people some work, 
> not to give it to them. The users will want to to the "train only on 
> misfires", and the sooner they get there, the better.

On Wed, 21 Mar 2012 08:38:24 -0400
Michael Scheidell wrote:

> On 3/21/12 5:06 AM, Matus UHLAR - fantomas wrote:
> > there are two problems when requiring users to manually learn on 
> > everythhing.
> > - it's more work to implement
> > - it's more work for users to do the training.
> and, if 95% of the users are using microsoft exchange, exchange will 
> horribly mangle the headers, and the body, even changing the actual 
> encoding.
> so, what would you manually learn?

That applies to any form of manual user training, so it's a different
issue.

I don't know the details of what exchange does, but I suspect it matters
less than you think because most of the information used by Bayes
is in normalized form.

Re: Allowing IMAP users to train spam/ham

2012-03-21 Thread Matus UHLAR - fantomas


On 3/21/12 5:06 AM, Matus UHLAR - fantomas wrote:
there are two problems when requiring users to manually learn on 
everythhing.

- it's more work to implement
- it's more work for users to do the training.


On 21.03.12 08:38, Michael Scheidell wrote:
and, if 95% of the users are using microsoft exchange, exchange will 
horribly mangle the headers, and the body, even changing the actual 
encoding.

so, what would you manually learn?


Mangling data by exchange is a big. problem when trying to filter 
spam in front of it. I see two ways to avoid this problem:
- use spam server for exchange. We use one from GFI, with quite good 
results.
- you can use spam filter in front of exchange, store copies on it and 
learn from them. However, you will probably be the only one who can train 
spamfilter in such case.


you actually _can_ train from messages that went through exchange, but 
mangling by exchange will somehow blur the results and lower bayes 
accuracy. 
--

Matus UHLAR - fantomas, uh...@fantomas.sk ; http://www.fantomas.sk/
Warning: I wish NOT to receive e-mail advertising to this address.
Varovanie: na tuto adresu chcem NEDOSTAVAT akukolvek reklamnu postu.
BSE = Mad Cow Desease ... BSA = Mad Software Producents Desease

Re: Allowing IMAP users to train spam/ham

2012-03-21 Thread Michael Scheidell


On 3/21/12 5:06 AM, Matus UHLAR - fantomas wrote:
there are two problems when requiring users to manually learn on 
everythhing.

- it's more work to implement
- it's more work for users to do the training.
and, if 95% of the users are using microsoft exchange, exchange will 
horribly mangle the headers, and the body, even changing the actual 
encoding.

so, what would you manually learn?



--
Michael Scheidell, CTO
o: 561-999-5000
d: 561-948-2259
>*| *SECNAP Network Security Corporation

   * Best Mobile Solutions Product of 2011
   * Best Intrusion Prevention Product
   * Hot Company Finalist 2011
   * Best Email Security Product
   * Certified SNORT Integrator

__
This email has been scanned and certified safe by SpammerTrap(r). 
For Information please see http://www.spammertrap.com/
__

Re: Allowing IMAP users to train spam/ham

2012-03-21 Thread Matus UHLAR - fantomas

On Fri, 9 Mar 2012 16:38:49 +0100
Matus UHLAR - fantomas wrote:

You can of course configure mailer to train automatically on anything
received/delivered.  However this would apparently cause much more
FP's and FN's rate than letting user train only those that misfire.

On 10.03.12 00:07, RW wrote:

The use of the word "apparently" never inspires much confidence. I'm
guessing that you don't have any real evidence.

No, I don't have evidence from comparing between long-time running 
autolearn versus manual learning. However cases were mentioned here on 
the list where people complained about autolearn going well when no 
manual traing was used.

>If you're going to train on error then train on the right error, not
>a rarer, correlated error.

The only error that really matters is the one that causes misfiring.

No, it isn't. Bayes is a statistical filter it needs to learn a lot of
diverse  spam and ham to reach it's optimum accuracy. It's been
demonstrated on Bogofilter that "train-on-everything" outperforms
"train-on-error" on the same corpora. They both end-up with similar
accuracy, but "train-on-everything" gets there very much faster.
Bogofilter is almost identical to BAYES; they just differ in the
details of the tokenizer and the Robinson parameters.

Training on SA miss-classification is going to be glacially slow.

there are two problems when requiring users to manually learn on 
everythhing.

- it's more work to implement
- it's more work for users to do the training.

Note that the main goal of spam filters is to save people some work, 
not to give it to them. The users will want to to the "train only on 
misfires", and the sooner they get there, the better.

Maybe relaxing the autolearn rules until number of hams and spams will 
cross the required values would help us.

--
Matus UHLAR - fantomas, uh...@fantomas.sk ; http://www.fantomas.sk/
Warning: I wish NOT to receive e-mail advertising to this address.
Varovanie: na tuto adresu chcem NEDOSTAVAT akukolvek reklamnu postu.
Emacs is a complicated operating system without good text editor.

Re: Allowing IMAP users to train spam/ham

2012-03-11 Thread RW

On Sun, 11 Mar 2012 13:56:52 -0600
LuKreme wrote:

> 
> On 09 Mar 2012, at 17:07 , RW wrote:
> 
> > It's been demonstrated on Bogofilter that "train-on-everything"
> > outperforms "train-on-error" on the same corpora. They both end-up
> > with similar accuracy, but "train-on-everything" gets there very
> > much faster.
> 
> But training is exceedingly slow. Under normal load, sa-learn putters
> along at 2.5-4 mesg/sec, and under load it can drop to under 1.
> 
> Now, sure, perhaps I should throw a quad core i7 at it, but REALLY?

You missing the point. What I'm saying is that train-on-error is not
more accurate that train-on-everything, and that training on
Spamassassin errors is going to be worse, not the optimal method as
was claimed. 

If you want to trade accuracy for cost that's fine as long as you're
clear about it, but it shouldn't be dressed-up as a better way to learn.

I'm not saying everything needs to learned. In general training on spam
that doesn't hit BAYES_99 and ham that doesn't  hit BAYES_00 is a
reasonable compromise. The big problem with only training on  full
spamassassin errors is that failure to properly classify ham will
rarely be corrected.

Re: Allowing IMAP users to train spam/ham

2012-03-11 Thread LuKreme

On 09 Mar 2012, at 17:07 , RW wrote:

> It's been demonstrated on Bogofilter that "train-on-everything" outperforms 
> "train-on-error" on the same corpora. They both end-up with similar accuracy, 
> but "train-on-everything" gets there very much faster.

But training is exceedingly slow. Under normal load, sa-learn putters along at 
2.5-4 mesg/sec, and under load it can drop to under 1.

Now, sure, perhaps I should throw a quad core i7 at it, but REALLY?

-- 
I NO LONGER WANT MY MTV Bart chalkboard Ep. 3G02

Re: Allowing IMAP users to train spam/ham

2012-03-09 Thread RW

On Fri, 9 Mar 2012 16:38:49 +0100
Matus UHLAR - fantomas wrote:

> You can of course configure mailer to train automatically on anything 
> received/delivered.  However this would apparently cause much more
> FP's and FN's rate than letting user train only those that misfire.

The use of the word "apparently" never inspires much confidence. I'm
guessing that you don't have any real evidence.

> >If you're going to train on error then train on the right error, not
> >a rarer, correlated error.
> 
> The only error that really matters is the one that causes misfiring.

No, it isn't. Bayes is a statistical filter it needs to learn a lot of
diverse  spam and ham to reach it's optimum accuracy. It's been
demonstrated on Bogofilter that "train-on-everything" outperforms
"train-on-error" on the same corpora. They both end-up with similar
accuracy, but "train-on-everything" gets there very much faster.
Bogofilter is almost identical to BAYES; they just differ in the
details of the tokenizer and the Robinson parameters.

Training on SA miss-classification is going to be glacially slow.

Re: Allowing IMAP users to train spam/ham

2012-03-09 Thread Matus UHLAR - fantomas

On Fri, 9 Mar 2012 08:38:21 +0100
Matus UHLAR - fantomas wrote:

>> On 05.03.12 12:15, RW wrote:
>> >I don't like it. It relies on FPs being removed from the SPAM
>> >folder rather than spam being sent to a learn-spam folder.

>On Wed, 7 Mar 2012 15:35:05 +0100
>Matus UHLAR - fantomas wrote:
>> Pardon me, but:
>>
>> Usage for end users
>>
>>  *move mail into SPAM folder to classify as spam
>>  *move mail out of SPAM folder to classify as not spam
>>
>> isn't the former what you want?

On 07.03.12 21:44, RW wrote:
>I'm more concerned about what happens to the mail that isn't moved.

apparently nothing, because it is assumed to be correctly evaluated.

On 09.03.12 14:13, RW wrote:

So are you saying that a legitimate mail that hits BAYES_99 and
scores 4.9 isn't worth learning as ham because it's correctly evaluated.

It's easier - it takes less CPU time and users' effort.
It's alsu MUCH more important to train FPs then train all.

>I think  positive training is better than supervised autolearning

those above clearly indicate postive and negative trainin, or do you
have different informations?

When I first looked at it, it retrained on errors, with DSPAM
autotraining on everything. It probably does support train-on-error,
but IMO it would be inappropriate to train Bayes that way.

You can of course configure mailer to train automatically on anything 
received/delivered.  However this would apparently cause much more FP's 
and FN's rate than letting user train only those that misfire.

>The scheme might work well for pure train-on-error, but that's not
>really practical on Spamassassin where the classification is
>distinct from the Bayes result.

pardon?

If you're going to train on error then train on the right error, not a
rarer, correlated error.

The only error that really matters is the one that causes misfiring.

The FP/FN rate based on the SA classification isn't anywhere near high
enough to train BAYES. If a user receives 10 legitimate mails a day and
SA works at its target FP rate of 1 in 2500, it would take over
100 years for Bayes to even turn-on.

with FP rate of 1 in 2500, it will not matter that much :-)

But yes, this is one of weaknesses of bayes system. It requires 
much mail to start firing.  However you can lower both 
bayes_min_ham_num and bayes_min_spam_num and they will start hitting 
sooner. You can also modify autolearning scores although.

--
Matus UHLAR - fantomas, uh...@fantomas.sk ; http://www.fantomas.sk/
Warning: I wish NOT to receive e-mail advertising to this address.
Varovanie: na tuto adresu chcem NEDOSTAVAT akukolvek reklamnu postu.
"The box said 'Requires Windows 95 or better', so I bought a Macintosh".

Re: Allowing IMAP users to train spam/ham

2012-03-09 Thread RW

On Fri, 9 Mar 2012 08:38:21 +0100
Matus UHLAR - fantomas wrote:

> >> On 05.03.12 12:15, RW wrote:
> >> >I don't like it. It relies on FPs being removed from the SPAM
> >> >folder rather than spam being sent to a learn-spam folder.
> 
> >On Wed, 7 Mar 2012 15:35:05 +0100
> >Matus UHLAR - fantomas wrote:
> >> Pardon me, but:
> >>
> >> Usage for end users
> >>
> >>  *move mail into SPAM folder to classify as spam
> >>  *move mail out of SPAM folder to classify as not spam
> >>
> >> isn't the former what you want?
> 
> On 07.03.12 21:44, RW wrote:
> >I'm more concerned about what happens to the mail that isn't moved.
> 
> apparently nothing, because it is assumed to be correctly evaluated.

So are you saying that a legitimate mail that hits BAYES_99 and
scores 4.9 isn't worth learning as ham because it's correctly evaluated.

> 
> >I think  positive training is better than supervised autolearning
> 
> those above clearly indicate postive and negative trainin, or do you 
> have different informations?

When I first looked at it, it retrained on errors, with DSPAM
autotraining on everything. It probably does support train-on-error,
but IMO it would be inappropriate to train Bayes that way.

> >The scheme might work well for pure train-on-error, but that's not
> >really practical on Spamassassin where the classification is
> >distinct from the Bayes result.
> 
> pardon?

If you're going to train on error then train on the right error, not a
rarer, correlated error.

The FP/FN rate based on the SA classification isn't anywhere near high
enough to train BAYES. If a user receives 10 legitimate mails a day and
SA works at its target FP rate of 1 in 2500, it would take over 
100 years for Bayes to even turn-on.

Re: Allowing IMAP users to train spam/ham

2012-03-08 Thread Matus UHLAR - fantomas

On 05.03.12 12:15, RW wrote:
>I don't like it. It relies on FPs being removed from the SPAM folder
>rather than spam being sent to a learn-spam folder.

On Wed, 7 Mar 2012 15:35:05 +0100
Matus UHLAR - fantomas wrote:

Pardon me, but:

Usage for end users

 *move mail into SPAM folder to classify as spam
 *move mail out of SPAM folder to classify as not spam

isn't the former what you want?

On 07.03.12 21:44, RW wrote:

I'm more concerned about what happens to the mail that isn't moved.

apparently nothing, because it is assumed to be correctly evaluated.

I think  positive training is better than supervised autolearning

those above clearly indicate postive and negative trainin, or do you 
have different informations?

The scheme might work well for pure train-on-error, but that's not
really practical on Spamassassin where the classification is
distinct from the Bayes result.

pardon?

--
Matus UHLAR - fantomas, uh...@fantomas.sk ; http://www.fantomas.sk/
Warning: I wish NOT to receive e-mail advertising to this address.
Varovanie: na tuto adresu chcem NEDOSTAVAT akukolvek reklamnu postu.
Boost your system's speed by 500% - DEL C:\WINDOWS\*.*

Re: Allowing IMAP users to train spam/ham

2012-03-07 Thread RW

On Wed, 7 Mar 2012 15:35:05 +0100
Matus UHLAR - fantomas wrote:

> On 05.03.12 12:15, RW wrote:
> >I don't like it. It relies on FPs being removed from the SPAM folder
> >rather than spam being sent to a learn-spam folder.
> 
> Pardon me, but:
> 
> Usage for end users
> 
>  *move mail into SPAM folder to classify as spam
>  *move mail out of SPAM folder to classify as not spam
> 
> isn't the former what you want?

I'm more concerned about what happens to the mail that isn't moved.
I think  positive training is better than supervised autolearning 

The scheme might work well for pure train-on-error, but that's not
really practical on Spamassassin where the classification is
distinct from the Bayes result.

Re: Allowing IMAP users to train spam/ham

2012-03-07 Thread Matus UHLAR - fantomas




On 04.03.12 14:02, RW wrote:
>An alternative would be to be more selective. I'm not sure if this is
>specific to dovecot but when I copy/move a file in IMAP the new
>maildir file has the same mtime, but a new epoch time in the file
>name. What you might do is generate a list of filenames that contain
>an epoch time later than the start of the previous run and sim-link
>them into a temporary directory, and then learn that.



On Mon, 5 Mar 2012 10:54:22 +0100
Matus UHLAR - fantomas wrote:

afaik, dovecot itself has plugin to learn spam/ham:

http://johannes.sipsolutions.net/Projects/dovecot-antispam


On 05.03.12 12:15, RW wrote:

I don't like it. It relies on FPs being removed from the SPAM folder
rather than spam being sent to a learn-spam folder.


Pardon me, but:

Usage for end users

*move mail into SPAM folder to classify as spam
*move mail out of SPAM folder to classify as not spam

isn't the former what you want?
--
Matus UHLAR - fantomas, uh...@fantomas.sk ; http://www.fantomas.sk/
Warning: I wish NOT to receive e-mail advertising to this address.
Varovanie: na tuto adresu chcem NEDOSTAVAT akukolvek reklamnu postu.
Micro$oft random number generator: 0, 0, 0, 4.33e+67, 0, 0, 0...

Re: Allowing IMAP users to train spam/ham

2012-03-06 Thread Christian Grunfeld

Hi,

do you have per virtual user Bayes training? or sitewide virtual user?
Because I have a setup like yours and everything goes fine ! In my
setup users move by hand to spam folder FNs and retrieve from spam
folder to inbox FPs ! When they make that movements a script copies
those spam/ham to a sitewide spam or ham folder in each case. Then a
nightly script learn from those spam and ham sitewide folders. Then
deleted from system spam/ham folders but not users folders. They can
do what they want with those mails (delete or not).

Webmail plugins are available to do that work ! they can also make the
copies by IMAP protocol instead of filesystem level access.

Cheers

2012/3/4 LuKreme :
> I sued to have a setup where IMAP users could put mail into either SPAM or 
> Junk mailboxes to have it auto trained and then I had a script that stepped 
> through and did the training, and it also processed non-new mail in the inbox 
> as ham.
>
> USERROOT="$HOME";
> MAILP="Maildir";
>
>   J_PATH="$USERROOT/${MAILP}/.Junk";
>   S_PATH="$USERROOT/${MAILP}/.SPAM";
>   H_PATH="$USERROOT/${MAILP}/cur";
>
> if [ `test -d $J_PATH` ]; then
>   /usr/local/bin/sa-learn --spam --progress $i $J_PATH/{new,cur}
> fi
>
> if [ `test -d $S_PATH` ]; then
>   /usr/local/bin/sa-learn --spam --progress $i $S_PATH/{new,cur}
> fi
>
> if [ `test -d $H_PATH` ]; then
>   /usr/local/bin/sa-learn --ham $H_PATH
> fi
>
> This all worked fine, but it was very resource intensive, and it only worked 
> with the very few shell users. I tried to run it (manually) a few times with 
> the virtual users, but I ended up with a process that ground the computer to 
> a halt and generated a bayes database that was massively large (GBs).
>
> So, other than throwing more iron at the problem, is there something I can do 
> to make this process a little smarter? Make it work with the virtual users 
> without generating a massive db file?
>
> --
> 'What can I do? I'm only human,' he said aloud.  Someone said, Not all
> of you. --Pyramids
>

Re: Allowing IMAP users to train spam/ham

2012-03-05 Thread Robert Schetterer

Am 05.03.2012 13:15, schrieb RW:
> On Mon, 5 Mar 2012 10:54:22 +0100
> Matus UHLAR - fantomas wrote:
> 
>> On 04.03.12 14:02, RW wrote:
>>> An alternative would be to be more selective. I'm not sure if this is
>>> specific to dovecot but when I copy/move a file in IMAP the new
>>> maildir file has the same mtime, but a new epoch time in the file
>>> name. What you might do is generate a list of filenames that contain
>>> an epoch time later than the start of the previous run and sim-link
>>> them into a temporary directory, and then learn that.
>>
>> afaik, dovecot itself has plugin to learn spam/ham:
>>
>> http://johannes.sipsolutions.net/Projects/dovecot-antispam
> 
> I don't like it. It relies on FPs being removed from the SPAM folder
> rather than spam being sent to a learn-spam folder.
> 

i use a spam/ham forward email transport

something like here
http://patrick-wessel.de/projektlinuxserver/spamtraining-mit-perl/
http://www.localside.net/sal-wrapper/

but to be honest, its not widly used and needed
-- 
Best Regards

MfG Robert Schetterer

Germany/Munich/Bavaria

Re: Allowing IMAP users to train spam/ham

2012-03-05 Thread RW

On Mon, 5 Mar 2012 10:54:22 +0100
Matus UHLAR - fantomas wrote:

> On 04.03.12 14:02, RW wrote:
> >An alternative would be to be more selective. I'm not sure if this is
> >specific to dovecot but when I copy/move a file in IMAP the new
> >maildir file has the same mtime, but a new epoch time in the file
> >name. What you might do is generate a list of filenames that contain
> >an epoch time later than the start of the previous run and sim-link
> >them into a temporary directory, and then learn that.
> 
> afaik, dovecot itself has plugin to learn spam/ham:
> 
> http://johannes.sipsolutions.net/Projects/dovecot-antispam

I don't like it. It relies on FPs being removed from the SPAM folder
rather than spam being sent to a learn-spam folder.

Re: Allowing IMAP users to train spam/ham

2012-03-05 Thread Matus UHLAR - fantomas


LuKreme wrote:
I sued to have a setup where IMAP users could put mail into either 
SPAM or Junk mailboxes to have it auto trained and then I had a 
script that stepped through and did the training, and it also 
processed non-new mail in the inbox as ham.


On 04.03.12 07:55, xTrade Assessory wrote:

what do you think of something less complex?

you need but probably have autolearn enabled


I guess you mean "you probably need autolearn enabled".
One of autolearn' problems is, that if it starts misfiring, it will 
misfire more and more... 

The manual part is what is needed to prevent this - mostly the 
incorrectly classified mail needs to be learned.



I offer the users a mailbox where they can drop/move any message they
think is spam, what obviously not was processed by spamassasin and
classified as such
i my case the folder's name is X-SPAM
this extra folder is necessary because what is in SPAM already is
supposed to be SPAM


correct.


I don't now if it is a good idea running sa-learn n new msgs without
knowing what it is


It surely is not, however re-learning those as spam will fix that.


Also, chose well your users, that they do not throw everything into this
forlder

then you run a script from cron once a day like this

###
#!/bin/sh
folders=`/usr/bin/find /home/ -maxdepth=2 -type f -name X-Spam -print`
for folder in $folders; do
   /usr/local/bin/sa-learn --spam --mbox $folder
done


I think it would be wise to move messages away after learning.
--
Matus UHLAR - fantomas, uh...@fantomas.sk ; http://www.fantomas.sk/
Warning: I wish NOT to receive e-mail advertising to this address.
Varovanie: na tuto adresu chcem NEDOSTAVAT akukolvek reklamnu postu.
- Holmes, what kind of school did you study to be a detective?
- Elementary, Watson.  -- Daffy Duck & Porky Pig

Re: Allowing IMAP users to train spam/ham

2012-03-05 Thread Matus UHLAR - fantomas


On 04.03.12 14:02, RW wrote:

An alternative would be to be more selective. I'm not sure if this is
specific to dovecot but when I copy/move a file in IMAP the new
maildir file has the same mtime, but a new epoch time in the file name.
What you might do is generate a list of filenames that contain an epoch
time later than the start of the previous run and sim-link them into a
temporary directory, and then learn that.


afaik, dovecot itself has plugin to learn spam/ham:

http://johannes.sipsolutions.net/Projects/dovecot-antispam

--
Matus UHLAR - fantomas, uh...@fantomas.sk ; http://www.fantomas.sk/
Warning: I wish NOT to receive e-mail advertising to this address.
Varovanie: na tuto adresu chcem NEDOSTAVAT akukolvek reklamnu postu.
- Have you got anything without Spam in it?
- Well, there's Spam egg sausage and Spam, that's not got much Spam in it.

Re: Allowing IMAP users to train spam/ham

2012-03-04 Thread jdow


On 2012/03/04 11:57, John Hardin wrote:

On Sun, 4 Mar 2012, jdow wrote:


On 2012/03/04 10:30, LuKreme wrote:

On 04 Mar 2012, at 05:36 , xTrade Assessory wrote:
> question is if necessary ...

Being able to train mis-tagged spam is necessary, yes. I don’t see
anyway to process a message in a maildir and then move that message.
How would you do it?


bash script with for each on the directory. Train then delete each file in
sequence.


I'd suggest that it's a bad idea to delete your training corpus.


And the messages would be good training. However, privacy concerns may
require it be deleted. If not, mv works as well as rm.

{^_-}

Re: Allowing IMAP users to train spam/ham

2012-03-04 Thread Jari Fredriksson

4.3.2012 22:44, LuKreme kirjoitti:
> Trouble with simply moving the messages about in the shell between Maildirs 
> is that the courier files don’t get updated properly. 
> 

I move my files all the time, and no problems occurred so far. I use
Courier too...

-- 

Things past redress and now with me past care.
-- William Shakespeare, "Richard II"



signature.asc
Description: OpenPGP digital signature

Re: Allowing IMAP users to train spam/ham

2012-03-04 Thread LuKreme


On 04 Mar 2012, at 12:57 , John Hardin wrote:

> On Sun, 4 Mar 2012, jdow wrote:
> 
>> On 2012/03/04 10:30, LuKreme wrote:
>>> On 04 Mar 2012, at 05:36 , xTrade Assessory wrote:
>>> >  question is if necessary ...
>>> 
>>> Being able to train mis-tagged spam is necessary, yes. I don’t see
>>> anyway to process a message in a maildir and then move that message.
>>> How would you do it?
>> 
>> bash script with for each on the directory. Train then delete each file in 
>> sequence.
> 
> I'd suggest that it's a bad idea to delete your training corpus.

Yeah, I never said anything about deleting.

Trouble with simply moving the messages about in the shell between Maildirs is 
that the courier files don’t get updated properly. 

-- 
Criticizing evolutionary theory because Darwin was limited is like
claiming computers don't work because Chuck Babbage didn't foresee Duke
Nukem 3.

Re: Allowing IMAP users to train spam/ham

2012-03-04 Thread Jari Fredriksson

4.3.2012 20:49, jdow kirjoitti:
> On 2012/03/04 10:30, LuKreme wrote:
>> On 04 Mar 2012, at 05:36 , xTrade Assessory wrote:
>>> question is if necessary ...
>>
>> Being able to train mis-tagged spam is necessary, yes. I don’t see
>> anyway to process a message in a maildir and then move that message.
>> How would you do it?
> 
> bash script with for each on the directory. Train then delete each file in
> sequence.
> 

If doing this, training via spamc would be good. And the spamd must have
--allow-tell to make this work.

-- 

Today is the first day of the rest of the mess.



signature.asc
Description: OpenPGP digital signature

Re: Allowing IMAP users to train spam/ham

2012-03-04 Thread John Hardin


On Sun, 4 Mar 2012, jdow wrote:


On 2012/03/04 10:30, LuKreme wrote:

 On 04 Mar 2012, at 05:36 , xTrade Assessory wrote:
>  question is if necessary ...

 Being able to train mis-tagged spam is necessary, yes. I don’t see
 anyway to process a message in a maildir and then move that message.
 How would you do it?


bash script with for each on the directory. Train then delete each file 
in sequence.


I'd suggest that it's a bad idea to delete your training corpus.

--
 John Hardin KA7OHZhttp://www.impsec.org/~jhardin/
 jhar...@impsec.orgFALaholic #11174 pgpk -a jhar...@impsec.org
 key: 0xB8732E79 -- 2D8C 34F4 6411 F507 136C  AF76 D822 E6E6 B873 2E79
---
  Failure to plan ahead on someone else's part does not constitute
  an emergency on my part. -- David W. Barts in a.s.r
---
 7 days until Daylight Saving Time begins in U.S. - Spring Forward

Re: Allowing IMAP users to train spam/ham

2012-03-04 Thread jdow


On 2012/03/04 10:30, LuKreme wrote:

On 04 Mar 2012, at 05:36 , xTrade Assessory wrote:

question is if necessary ...


Being able to train mis-tagged spam is necessary, yes. I don’t see anyway to 
process a message in a maildir and then move that message. How would you do it?


bash script with for each on the directory. Train then delete each file in
sequence.

{^_^}

Re: Allowing IMAP users to train spam/ham

2012-03-04 Thread LuKreme

On 04 Mar 2012, at 05:36 , xTrade Assessory wrote:
> question is if necessary ...

Being able to train mis-tagged spam is necessary, yes. I don’t see anyway to 
process a message in a maildir and then move that message. How would you do it?

-- 
Lister: What d'ya think of Betty? Cat: Betty Rubble? Well, I would go
with Betty... but I'd be thinking of Wilma. Lister: This is crazy. Why
are we talking about going to bed with Wilma Flintstone? Cat: You're
right. We're nuts. This is an insane conversation. Lister: She'll never
leave Fred, and we know it.

Re: Allowing IMAP users to train spam/ham

2012-03-04 Thread RW

On Sun, 04 Mar 2012 09:36:25 -0300
xTrade Assessory wrote:

> LuKreme wrote:
> > On 04 Mar 2012, at 03:55 , xTrade Assessory wrote:
> >
> >> what do you think of something less complex?
> > Yeah, I went with Junk/NotJunk, anything placed in Junk gets
> > trained as spam, anything in NotJunk trained as ham. What I’d like
> > to do though is move the messages that are in NotJunk to the inbox
> > maildir as they are processed.

That's similar to what I do and some ESPs like Tuffmail do.

An alternative would be to be more selective. I'm not sure if this is
specific to dovecot but when I copy/move a file in IMAP the new
maildir file has the same mtime, but a new epoch time in the file name.
What you might do is generate a list of filenames that contain an epoch
time later than the start of the previous run and sim-link them into a
temporary directory, and then learn that.  

> if you have bayes already active as well as autolearn then why should
> you run all this again, still more since manual work may not be
> accurate. or do you read all this msgs to be sure they are ham/spam?

Because autolearn is better than nothing, but isn't very good.

It only learns the spam that's easily caught, It's very poor at
capturing a representative selection of ham without miss-learning, and
it wont train actual errors where BAYES has generated a point or more
in the wrong direction.

Re: Allowing IMAP users to train spam/ham

2012-03-04 Thread xTrade Assessory

LuKreme wrote:
> On 04 Mar 2012, at 03:55 , xTrade Assessory wrote:
>
>> what do you think of something less complex?
> Yeah, I went with Junk/NotJunk, anything placed in Junk gets trained as spam, 
> anything in NotJunk trained as ham. What I’d like to do though is move the 
> messages that are in NotJunk to the inbox maildir as they are processed.
>
> Possible?
>

everything is possible :)

question is if necessary ...

if you have bayes already active as well as autolearn then why should
you run all this again, still more since manual work may not be
accurate. or do you read all this msgs to be sure they are ham/spam?

I understand that the sa-learn should be used only for content you are
sure to be ham/spam what is difficult, unless you trust yourself and use
it only on your mailbox :)

I use it because sometimes you get commercial messages which technically
are not spam, have even correct auth headers and everything but it is
SPAM because I do not want to receive every day some kind of offer, so
this ones I can pipe into sa-learn so they bounce into the spam folder
next time they come ...

but that is only my opinion

Hans

-- 
XTrade Assessory
International Facilitator
BR - US - CA - DE - GB - RU - UK
+55 (11) 4249.
http://xtrade.matik.com.br

Re: Allowing IMAP users to train spam/ham

2012-03-04 Thread LuKreme


On 04 Mar 2012, at 03:55 , xTrade Assessory wrote:

> what do you think of something less complex?

Yeah, I went with Junk/NotJunk, anything placed in Junk gets trained as spam, 
anything in NotJunk trained as ham. What I’d like to do though is move the 
messages that are in NotJunk to the inbox maildir as they are processed.

Possible?

-- 
Belief is one of the most powerful organic forces in the multiverse. It
may not be able to move mountains, exactly. But it can create someone
who can.

Re: Allowing IMAP users to train spam/ham

2012-03-04 Thread xTrade Assessory

LuKreme wrote:
> I sued to have a setup where IMAP users could put mail into either SPAM or 
> Junk mailboxes to have it auto trained and then I had a script that stepped 
> through and did the training, and it also processed non-new mail in the inbox 
> as ham.

Hi

what do you think of something less complex?

you need but probably have autolearn enabled

I offer the users a mailbox where they can drop/move any message they
think is spam, what obviously not was processed by spamassasin and
classified as such
i my case the folder's name is X-SPAM
this extra folder is necessary because what is in SPAM already is
supposed to be SPAM

I don't now if it is a good idea running sa-learn n new msgs without
knowing what it is

Also, chose well your users, that they do not throw everything into this
forlder

then you run a script from cron once a day like this

###
#!/bin/sh
folders=`/usr/bin/find /home/ -maxdepth=2 -type f -name X-Spam -print`
for folder in $folders; do
/usr/local/bin/sa-learn --spam --mbox $folder
done
###


good luck
Hans

> USERROOT="$HOME";
> MAILP="Maildir";
>
>J_PATH="$USERROOT/${MAILP}/.Junk";
>S_PATH="$USERROOT/${MAILP}/.SPAM";
>H_PATH="$USERROOT/${MAILP}/cur";
>
> if [ `test -d $J_PATH` ]; then
>/usr/local/bin/sa-learn --spam --progress $i $J_PATH/{new,cur}
> fi
>
> if [ `test -d $S_PATH` ]; then
>/usr/local/bin/sa-learn --spam --progress $i $S_PATH/{new,cur}
> fi
>
> if [ `test -d $H_PATH` ]; then
>/usr/local/bin/sa-learn --ham $H_PATH
> fi
>
> This all worked fine, but it was very resource intensive, and it only worked 
> with the very few shell users. I tried to run it (manually) a few times with 
> the virtual users, but I ended up with a process that ground the computer to 
> a halt and generated a bayes database that was massively large (GBs).
>
> So, other than throwing more iron at the problem, is there something I can do 
> to make this process a little smarter? Make it work with the virtual users 
> without generating a massive db file?
>


-- 
XTrade Assessory
International Facilitator
BR - US - CA - DE - GB - RU - UK
+55 (11) 4249.
http://xtrade.matik.com.br

Allowing IMAP users to train spam/ham

2012-03-04 Thread LuKreme

I sued to have a setup where IMAP users could put mail into either SPAM or Junk 
mailboxes to have it auto trained and then I had a script that stepped through 
and did the training, and it also processed non-new mail in the inbox as ham.

USERROOT="$HOME";
MAILP="Maildir";

   J_PATH="$USERROOT/${MAILP}/.Junk";
   S_PATH="$USERROOT/${MAILP}/.SPAM";
   H_PATH="$USERROOT/${MAILP}/cur";

if [ `test -d $J_PATH` ]; then
   /usr/local/bin/sa-learn --spam --progress $i $J_PATH/{new,cur}
fi

if [ `test -d $S_PATH` ]; then
   /usr/local/bin/sa-learn --spam --progress $i $S_PATH/{new,cur}
fi

if [ `test -d $H_PATH` ]; then
   /usr/local/bin/sa-learn --ham $H_PATH
fi

This all worked fine, but it was very resource intensive, and it only worked 
with the very few shell users. I tried to run it (manually) a few times with 
the virtual users, but I ended up with a process that ground the computer to a 
halt and generated a bayes database that was massively large (GBs).

So, other than throwing more iron at the problem, is there something I can do 
to make this process a little smarter? Make it work with the virtual users 
without generating a massive db file?

-- 
'What can I do? I'm only human,' he said aloud.  Someone said, Not all
of you. --Pyramids

46 matches

Mail list logo