Re: was: Allowing IMAP users to train spam/ham is:simplify training of misclassified emails
Before anyone rushes ahead and puts any time or money into this. I think it's worth establishing whether it makes any significant difference. It solves several real world problems that I'm aware of but I agree it's not going to hold up 3.4.0 or be a top priority for me. regards, KAM
Re: was: Allowing IMAP users to train spam/ham is:simplify training of misclassified emails
On Thu, 22 Mar 2012 07:59:39 -0400 Kevin A. McGrail wrote: > Yes and no. What you have missed is that David F Skoll is a key > author of MIMEDefang. They also publish a great COTS solution for > email filtering called CanIT. So his plugin is part of the commercial > product. AFAIK his Bayes uses word-pair tokenization, and DSPAM supports various multi-word tokenizers, so they are somewhat more susceptible to header rewriting. > > However, his idea is very elegant on tokens is an elegant idea. To > extract them, I planned on using SA's existing Bayesian framework and > deliver them to a header. What is done with the header from there is > a spam/ham delivery issue but at best sa-learn could use it. Lots of > security and privacy issues to deal with but I am just in the idea > phase. Before anyone rushes ahead and puts any time or money into this. I think it's worth establishing whether it makes any significant difference. AFAIK Bayes tokenizes after any encoding is removed so unless Exchange does something extreme like converting to unicode or rich-text format etc, I doubt it makes any difference at all to the body. I don't know how exchange mangles headers, but I'm sceptical it has much effect - if any. You'd really need to look at the details. Extra headers added after processing shouldn't be a problem, and it's easy enough to strip them if you're paranoid.
Re: was: Allowing IMAP users to train spam/ham is:simplify training of misclassified emails
Yes and no. What you have missed is that David F Skoll is a key author of MIMEDefang. They also publish a great COTS solution for email filtering called CanIT. So his plugin is part of the commercial product. However, his idea is very elegant on tokens is an elegant idea. To extract them, I planned on using SA's existing Bayesian framework and deliver them to a header. What is done with the header from there is a spam/ham delivery issue but at best sa-learn could use it. Lots of security and privacy issues to deal with but I am just in the idea phase. Regards, KAM Per-Erik Persson wrote: Since we are on the subject of adding "magic links" to email header to make it easier for nontech staff to report spam. I don't understand how to extract the tokinzed data needed to represent the specific email. Have I missed some plugin that everyone else knows about? The rest of the problem seems trivial if you already have an infrastructure deployed with SSO and a decent webinterface. The setup with postfix facing the world, spamassassin sanitizinging it and exchange storing it is something that I see quite often nowdays.
Re: was: Allowing IMAP users to train spam/ham is:simplify training of misclassified emails
On Thu, 22 Mar 2012 07:51:07 +0100 Per-Erik Persson wrote: > Since we are on the subject of adding "magic links" to email header to > make it easier for nontech staff to report spam. > I don't understand how to extract the tokinzed data needed to > represent the specific email. We have an entire infrastructure built to support this. It is proprietary, however, and is not easily implemented as a SpamAssassin plugin, though the basic idea probably could be. Regards, David.
Re: was: Allowing IMAP users to train spam/ham is:simplify training of misclassified emails
Am 22.03.2012 09:15, schrieb xTrade Assessory: > Robert Schetterer wrote: >>> >> >> however , i have a ham/spam transport learn mail address, >> nearly null users forwards something to it, no wonder >> the false positve rate is nearly null >> >> in fact , there are systems with webmail guis for classify >> spam i.e aol, reality shows users dont use it very wise >> perhaps clicking field spam and delte are to near etc or they are simply >> dummy >> >> my conclusion dont waste your time to implement complicated mechs >> for ham/spam training, work on the tagging/rejecting side to reduce >> false positive rate >> > > Hi > > I can not agree more to that ... at the end, sooner or later, you > discover having spent time on something with erroneous or no return at > all ... not even talking about the support-overhead this extra mboxes > will create > > beside the obvious you already said it is still highly questionable if a > "user" is able to classify reliable. > > also, IMO, most SPAM hits obvious account names/combinations and most > user are not affected by the problem, unless their addresses are > standard_names@ > > since years I do not care so much any more and run a pretty standard > spamassassin but I query maillog for delivering attempts to not existing > accounts. First I slow it down after 2 invalid destination addresses but > also record the sender details and block them for three month from > within access file (I run sendmail everywhere) > > that works so smooth for me, still with almost zero cpu overhead for > spamd and it is practical, easy and cheap, the result is, before I got > on certain accounts 50 SPAMS per day, now 2 maybe 3 and that numbers > are for mservers with each of them having +50.000 accounts going through > > Hans > something like http://mailfud.org/postpals/ may helpfull too at some sites i have heard amavis has some equal mech however there is lot a postmaster can do, before trusting users spam/ham classify ( i.e there is the spamassassin black and whitlist feature ) , but if somebody do so ,dont trust your users in total users train should ever be one tag out of others, so i.e it may high bayes points etc but should not to lead for high tagging over spam/ham boarder in one tag step ( this is for isp style mail systems, the policy might be other for dediacted company mail etc , but its still complicated there too) but as reality shows i.e at aol their user abuse spam reporting program is totally broken , i never had a "true spam alarm" of their users by sended mails from my systems and on the other side the aol mail systems itself are very high rate for trying deliver in spam to my servers -- Best Regards MfG Robert Schetterer Germany/Munich/Bavaria
Re: was: Allowing IMAP users to train spam/ham is:simplify training of misclassified emails
On 03/22/2012 07:59 AM, Robert Schetterer wrote: > Am 22.03.2012 07:51, schrieb Per-Erik Persson: >> Since we are on the subject of adding "magic links" to email header to >> make it easier for nontech staff to report spam. >> I don't understand how to extract the tokinzed data needed to represent >> the specific email. >> Have I missed some plugin that everyone else knows about? >> >> The rest of the problem seems trivial if you already have an >> infrastructure deployed with SSO and a decent webinterface. >> >> The setup with postfix facing the world, spamassassin sanitizinging it >> and exchange storing it is something that I see quite often nowdays. >> >> >> > however , i have a ham/spam transport learn mail address, > nearly null users forwards something to it, no wonder > the false positve rate is nearly null > > in fact , there are systems with webmail guis for classify > spam i.e aol, reality shows users dont use it very wise > perhaps clicking field spam and delte are to near etc or they are simply > dummy > > my conclusion dont waste your time to implement complicated mechs > for ham/spam training, work on the tagging/rejecting side to reduce > false positive rate > You are right about how the average user works. (Oh I am tired of the mailinglist, lets classify it as spam since I don't know how to unsubscribe) However a helpdesk and similair often get complaints about spam getting thru and it is virtually impossible to make most users cut and paste a header. But pasting a single field from the header and sending it to the right helpdeskqueue or a webinterface is probably just the right amount of work. I have a personal toolbox to sieve out the phishingemails(and false positives) and would like to make a closed loop for feeding the spamassassin without having access to the original emails.
Re: was: Allowing IMAP users to train spam/ham is:simplify training of misclassified emails
Robert Schetterer wrote: >> > > however , i have a ham/spam transport learn mail address, > nearly null users forwards something to it, no wonder > the false positve rate is nearly null > > in fact , there are systems with webmail guis for classify > spam i.e aol, reality shows users dont use it very wise > perhaps clicking field spam and delte are to near etc or they are simply > dummy > > my conclusion dont waste your time to implement complicated mechs > for ham/spam training, work on the tagging/rejecting side to reduce > false positive rate > Hi I can not agree more to that ... at the end, sooner or later, you discover having spent time on something with erroneous or no return at all ... not even talking about the support-overhead this extra mboxes will create beside the obvious you already said it is still highly questionable if a "user" is able to classify reliable. also, IMO, most SPAM hits obvious account names/combinations and most user are not affected by the problem, unless their addresses are standard_names@ since years I do not care so much any more and run a pretty standard spamassassin but I query maillog for delivering attempts to not existing accounts. First I slow it down after 2 invalid destination addresses but also record the sender details and block them for three month from within access file (I run sendmail everywhere) that works so smooth for me, still with almost zero cpu overhead for spamd and it is practical, easy and cheap, the result is, before I got on certain accounts 50 SPAMS per day, now 2 maybe 3 and that numbers are for mservers with each of them having +50.000 accounts going through Hans -- XTrade Assessory International Facilitator BR - US - CA - DE - GB - RU - UK +55 (11) 4249. http://xtrade.matik.com.br
Re: was: Allowing IMAP users to train spam/ham is:simplify training of misclassified emails
Am 22.03.2012 07:51, schrieb Per-Erik Persson: > Since we are on the subject of adding "magic links" to email header to > make it easier for nontech staff to report spam. > I don't understand how to extract the tokinzed data needed to represent > the specific email. > Have I missed some plugin that everyone else knows about? > > The rest of the problem seems trivial if you already have an > infrastructure deployed with SSO and a decent webinterface. > > The setup with postfix facing the world, spamassassin sanitizinging it > and exchange storing it is something that I see quite often nowdays. > > > however , i have a ham/spam transport learn mail address, nearly null users forwards something to it, no wonder the false positve rate is nearly null in fact , there are systems with webmail guis for classify spam i.e aol, reality shows users dont use it very wise perhaps clicking field spam and delte are to near etc or they are simply dummy my conclusion dont waste your time to implement complicated mechs for ham/spam training, work on the tagging/rejecting side to reduce false positive rate -- Best Regards MfG Robert Schetterer Germany/Munich/Bavaria
was: Allowing IMAP users to train spam/ham is:simplify training of misclassified emails
Since we are on the subject of adding "magic links" to email header to make it easier for nontech staff to report spam. I don't understand how to extract the tokinzed data needed to represent the specific email. Have I missed some plugin that everyone else knows about? The rest of the problem seems trivial if you already have an infrastructure deployed with SSO and a decent webinterface. The setup with postfix facing the world, spamassassin sanitizinging it and exchange storing it is something that I see quite often nowdays.
Re: Allowing IMAP users to train spam/ham
Den 2012-03-21 13:38, Michael Scheidell skrev: so, what would you manually learn? using dspam then its not a problem, it only needs dspam signature internet > postfix > dspam > postfix > exchange relay transport now exchange have the dspam signature and can report back if its spam or ham, howto make that work is out of my scope :=) its good that there is no excange smtp problem
Re: Allowing IMAP users to train spam/ham
On Wed, 21 Mar 2012 10:41:31 -0400 Michael Scheidell wrote: > But, what do you do about an email that was forwarded to someone else? > And, that someone else has one of those silly anti-malware plugins > that surfs to every url in any inbound email? By default, our system won't allow training until the user logs in. So clicking the link takes you to an authentication screen and the voting only happens after you log in. We provide an option to bypass this for those who are willing to risk the things you mention. Also, if you're sending outbound mail through our system, we strip off preexisting voting links which helps reduce the probelm, and of course we use the "nofollow" attribute in the link where possible so that search engines that index mail archives don't cause voting to happen. Regards, David.
Re: Allowing IMAP users to train spam/ham
On 3/21/2012 10:41 AM, Michael Scheidell wrote: On 3/21/12 9:57 AM, Kevin A. McGrail wrote: Very elegant IMO. I'd love to look at moving some of the framework to support this into SA. Any objections? Won't be anything quick but it's a really great idea. We thought about this once. add (ie: modify body of email) with 'report spam', 'blacklist sender' links. If the links are internal (private ip's), or internally resolvable names, or names or ip's that resolve only locally or via vpn, then that might be ok. But, what do you do about an email that was forwarded to someone else? And, that someone else has one of those silly anti-malware plugins that surfs to every url in any inbound email? (or some forwarder recipient decides to click on of the links) From my perspective, the key point is solely the framework to store the Bayesian tokens from the email before delivery of the email so later, a "this is spam" "this is ham" mechanism can take advantage of that information without having the entire email. The issues you are pointing to have to deal more with the implementation of the this is spam/this is ham mechanism. Regards, KAM
Re: Allowing IMAP users to train spam/ham
On 3/21/12 9:57 AM, Kevin A. McGrail wrote: Very elegant IMO. I'd love to look at moving some of the framework to support this into SA. Any objections? Won't be anything quick but it's a really great idea. We thought about this once. add (ie: modify body of email) with 'report spam', 'blacklist sender' links. If the links are internal (private ip's), or internally resolvable names, or names or ip's that resolve only locally or via vpn, then that might be ok. But, what do you do about an email that was forwarded to someone else? And, that someone else has one of those silly anti-malware plugins that surfs to every url in any inbound email? (or some forwarder recipient decides to click on of the links) -- Michael Scheidell, CTO o: 561-999-5000 d: 561-948-2259 >*| *SECNAP Network Security Corporation * Best Mobile Solutions Product of 2011 * Best Intrusion Prevention Product * Hot Company Finalist 2011 * Best Email Security Product * Certified SNORT Integrator __ This email has been scanned and certified safe by SpammerTrap(r). For Information please see http://www.spammertrap.com/ __
Re: Allowing IMAP users to train spam/ham
On 3/21/2012 10:03 AM, David F. Skoll wrote: On Wed, 21 Mar 2012 09:57:33 -0400 "Kevin A. McGrail" wrote: [Storing Bayes tokens on the server and retrieving them when training] Very elegant IMO. I'd love to look at moving some of the framework to support this into SA. Any objections? Won't be anything quick but it's a really great idea. Feel free to use the idea. Alas, the code is proprietary and wouldn't fit well into Spamassassin anyway, so I can't contribute that back. The idea alone is good enough, thanks. I figured you had it in Can-IT so I wanted to ask. Regards, KAM
Re: Allowing IMAP users to train spam/ham
On Wed, 21 Mar 2012 09:57:33 -0400 "Kevin A. McGrail" wrote: [Storing Bayes tokens on the server and retrieving them when training] > Very elegant IMO. I'd love to look at moving some of the framework > to support this into SA. Any objections? Won't be anything quick > but it's a really great idea. Feel free to use the idea. Alas, the code is proprietary and wouldn't fit well into Spamassassin anyway, so I can't contribute that back. Regards, David.
Re: Allowing IMAP users to train spam/ham
On 3/21/2012 9:30 AM, David F. Skoll wrote: Actually, there's a third way and it's what we do (but difficult to set up with pure SpamAssassin.) We tokenize inbound messages and store the tokens on the server. In each message, we add links for doing training. When you click on a training link, the system trains the message based on the tokens stored on the server. In that way, you are training using exactly the tokens that the Bayes code saw. Regards, David. Very elegant IMO. I'd love to look at moving some of the framework to support this into SA. Any objections? Won't be anything quick but it's a really great idea.
Re: Allowing IMAP users to train spam/ham
On Wed, 21 Mar 2012 13:44:49 +0100 Matus UHLAR - fantomas wrote: > Mangling data by exchange is a big. problem when trying to filter > spam in front of it. I see two ways to avoid this problem: > - use spam server for exchange. We use one from GFI, with quite good > results. > - you can use spam filter in front of exchange, store copies on it > and learn from them. However, you will probably be the only one who > can train spamfilter in such case. Actually, there's a third way and it's what we do (but difficult to set up with pure SpamAssassin.) We tokenize inbound messages and store the tokens on the server. In each message, we add links for doing training. When you click on a training link, the system trains the message based on the tokens stored on the server. In that way, you are training using exactly the tokens that the Bayes code saw. Regards, David.
Re: Allowing IMAP users to train spam/ham
On Wed, 21 Mar 2012 10:06:58 +0100 Matus UHLAR - fantomas wrote: > >On Fri, 9 Mar 2012 16:38:49 +0100 > >Matus UHLAR - fantomas wrote: > >No, it isn't. Bayes is a statistical filter it needs to learn a lot > >of diverse spam and ham to reach it's optimum accuracy. It's been > >demonstrated on Bogofilter that "train-on-everything" outperforms > >"train-on-error" on the same corpora. They both end-up with similar > >accuracy, but "train-on-everything" gets there very much faster. > >Bogofilter is almost identical to BAYES; they just differ in the > >details of the tokenizer and the Robinson parameters. > > > >Training on SA miss-classification is going to be glacially slow. > > there are two problems when requiring users to manually learn on > everythhing. I'm not advocating that users be forced to do anything, my preference is to allow them to choose what they want to train on. Whether or not your script chooses to learn everything they submit is it different matter. > - it's more work to implement In general it's easier to implement explicit learn-spam and learn-ham folders than it is to keep track of what is moved in and out of a spam folder. > - it's more work for users to do the training. Not really, If they choose to learn just the spamassassin miss-classifications it's the same work, but they have option to learn more - in particular important ham. Personally, if I saw that important mail was hitting BAYES_50, I'd feel pretty frustated sitting around waiting for FPs to train Bayes, knowing that those FPs are avoidable. > Note that the main goal of spam filters is to save people some work, > not to give it to them. The users will want to to the "train only on > misfires", and the sooner they get there, the better. On Wed, 21 Mar 2012 08:38:24 -0400 Michael Scheidell wrote: > On 3/21/12 5:06 AM, Matus UHLAR - fantomas wrote: > > there are two problems when requiring users to manually learn on > > everythhing. > > - it's more work to implement > > - it's more work for users to do the training. > and, if 95% of the users are using microsoft exchange, exchange will > horribly mangle the headers, and the body, even changing the actual > encoding. > so, what would you manually learn? That applies to any form of manual user training, so it's a different issue. I don't know the details of what exchange does, but I suspect it matters less than you think because most of the information used by Bayes is in normalized form.
Re: Allowing IMAP users to train spam/ham
On 3/21/12 5:06 AM, Matus UHLAR - fantomas wrote: there are two problems when requiring users to manually learn on everythhing. - it's more work to implement - it's more work for users to do the training. On 21.03.12 08:38, Michael Scheidell wrote: and, if 95% of the users are using microsoft exchange, exchange will horribly mangle the headers, and the body, even changing the actual encoding. so, what would you manually learn? Mangling data by exchange is a big. problem when trying to filter spam in front of it. I see two ways to avoid this problem: - use spam server for exchange. We use one from GFI, with quite good results. - you can use spam filter in front of exchange, store copies on it and learn from them. However, you will probably be the only one who can train spamfilter in such case. you actually _can_ train from messages that went through exchange, but mangling by exchange will somehow blur the results and lower bayes accuracy. -- Matus UHLAR - fantomas, uh...@fantomas.sk ; http://www.fantomas.sk/ Warning: I wish NOT to receive e-mail advertising to this address. Varovanie: na tuto adresu chcem NEDOSTAVAT akukolvek reklamnu postu. BSE = Mad Cow Desease ... BSA = Mad Software Producents Desease
Re: Allowing IMAP users to train spam/ham
On 3/21/12 5:06 AM, Matus UHLAR - fantomas wrote: there are two problems when requiring users to manually learn on everythhing. - it's more work to implement - it's more work for users to do the training. and, if 95% of the users are using microsoft exchange, exchange will horribly mangle the headers, and the body, even changing the actual encoding. so, what would you manually learn? -- Michael Scheidell, CTO o: 561-999-5000 d: 561-948-2259 >*| *SECNAP Network Security Corporation * Best Mobile Solutions Product of 2011 * Best Intrusion Prevention Product * Hot Company Finalist 2011 * Best Email Security Product * Certified SNORT Integrator __ This email has been scanned and certified safe by SpammerTrap(r). For Information please see http://www.spammertrap.com/ __
Re: Allowing IMAP users to train spam/ham
On Fri, 9 Mar 2012 16:38:49 +0100 Matus UHLAR - fantomas wrote: You can of course configure mailer to train automatically on anything received/delivered. However this would apparently cause much more FP's and FN's rate than letting user train only those that misfire. On 10.03.12 00:07, RW wrote: The use of the word "apparently" never inspires much confidence. I'm guessing that you don't have any real evidence. No, I don't have evidence from comparing between long-time running autolearn versus manual learning. However cases were mentioned here on the list where people complained about autolearn going well when no manual traing was used. >If you're going to train on error then train on the right error, not >a rarer, correlated error. The only error that really matters is the one that causes misfiring. No, it isn't. Bayes is a statistical filter it needs to learn a lot of diverse spam and ham to reach it's optimum accuracy. It's been demonstrated on Bogofilter that "train-on-everything" outperforms "train-on-error" on the same corpora. They both end-up with similar accuracy, but "train-on-everything" gets there very much faster. Bogofilter is almost identical to BAYES; they just differ in the details of the tokenizer and the Robinson parameters. Training on SA miss-classification is going to be glacially slow. there are two problems when requiring users to manually learn on everythhing. - it's more work to implement - it's more work for users to do the training. Note that the main goal of spam filters is to save people some work, not to give it to them. The users will want to to the "train only on misfires", and the sooner they get there, the better. Maybe relaxing the autolearn rules until number of hams and spams will cross the required values would help us. -- Matus UHLAR - fantomas, uh...@fantomas.sk ; http://www.fantomas.sk/ Warning: I wish NOT to receive e-mail advertising to this address. Varovanie: na tuto adresu chcem NEDOSTAVAT akukolvek reklamnu postu. Emacs is a complicated operating system without good text editor.
Re: Allowing IMAP users to train spam/ham
On Sun, 11 Mar 2012 13:56:52 -0600 LuKreme wrote: > > On 09 Mar 2012, at 17:07 , RW wrote: > > > It's been demonstrated on Bogofilter that "train-on-everything" > > outperforms "train-on-error" on the same corpora. They both end-up > > with similar accuracy, but "train-on-everything" gets there very > > much faster. > > But training is exceedingly slow. Under normal load, sa-learn putters > along at 2.5-4 mesg/sec, and under load it can drop to under 1. > > Now, sure, perhaps I should throw a quad core i7 at it, but REALLY? You missing the point. What I'm saying is that train-on-error is not more accurate that train-on-everything, and that training on Spamassassin errors is going to be worse, not the optimal method as was claimed. If you want to trade accuracy for cost that's fine as long as you're clear about it, but it shouldn't be dressed-up as a better way to learn. I'm not saying everything needs to learned. In general training on spam that doesn't hit BAYES_99 and ham that doesn't hit BAYES_00 is a reasonable compromise. The big problem with only training on full spamassassin errors is that failure to properly classify ham will rarely be corrected.
Re: Allowing IMAP users to train spam/ham
On 09 Mar 2012, at 17:07 , RW wrote: > It's been demonstrated on Bogofilter that "train-on-everything" outperforms > "train-on-error" on the same corpora. They both end-up with similar accuracy, > but "train-on-everything" gets there very much faster. But training is exceedingly slow. Under normal load, sa-learn putters along at 2.5-4 mesg/sec, and under load it can drop to under 1. Now, sure, perhaps I should throw a quad core i7 at it, but REALLY? -- I NO LONGER WANT MY MTV Bart chalkboard Ep. 3G02
Re: Allowing IMAP users to train spam/ham
On Fri, 9 Mar 2012 16:38:49 +0100 Matus UHLAR - fantomas wrote: > You can of course configure mailer to train automatically on anything > received/delivered. However this would apparently cause much more > FP's and FN's rate than letting user train only those that misfire. The use of the word "apparently" never inspires much confidence. I'm guessing that you don't have any real evidence. > >If you're going to train on error then train on the right error, not > >a rarer, correlated error. > > The only error that really matters is the one that causes misfiring. No, it isn't. Bayes is a statistical filter it needs to learn a lot of diverse spam and ham to reach it's optimum accuracy. It's been demonstrated on Bogofilter that "train-on-everything" outperforms "train-on-error" on the same corpora. They both end-up with similar accuracy, but "train-on-everything" gets there very much faster. Bogofilter is almost identical to BAYES; they just differ in the details of the tokenizer and the Robinson parameters. Training on SA miss-classification is going to be glacially slow.
Re: Allowing IMAP users to train spam/ham
On Fri, 9 Mar 2012 08:38:21 +0100 Matus UHLAR - fantomas wrote: >> On 05.03.12 12:15, RW wrote: >> >I don't like it. It relies on FPs being removed from the SPAM >> >folder rather than spam being sent to a learn-spam folder. >On Wed, 7 Mar 2012 15:35:05 +0100 >Matus UHLAR - fantomas wrote: >> Pardon me, but: >> >> Usage for end users >> >> *move mail into SPAM folder to classify as spam >> *move mail out of SPAM folder to classify as not spam >> >> isn't the former what you want? On 07.03.12 21:44, RW wrote: >I'm more concerned about what happens to the mail that isn't moved. apparently nothing, because it is assumed to be correctly evaluated. On 09.03.12 14:13, RW wrote: So are you saying that a legitimate mail that hits BAYES_99 and scores 4.9 isn't worth learning as ham because it's correctly evaluated. It's easier - it takes less CPU time and users' effort. It's alsu MUCH more important to train FPs then train all. >I think positive training is better than supervised autolearning those above clearly indicate postive and negative trainin, or do you have different informations? When I first looked at it, it retrained on errors, with DSPAM autotraining on everything. It probably does support train-on-error, but IMO it would be inappropriate to train Bayes that way. You can of course configure mailer to train automatically on anything received/delivered. However this would apparently cause much more FP's and FN's rate than letting user train only those that misfire. >The scheme might work well for pure train-on-error, but that's not >really practical on Spamassassin where the classification is >distinct from the Bayes result. pardon? If you're going to train on error then train on the right error, not a rarer, correlated error. The only error that really matters is the one that causes misfiring. The FP/FN rate based on the SA classification isn't anywhere near high enough to train BAYES. If a user receives 10 legitimate mails a day and SA works at its target FP rate of 1 in 2500, it would take over 100 years for Bayes to even turn-on. with FP rate of 1 in 2500, it will not matter that much :-) But yes, this is one of weaknesses of bayes system. It requires much mail to start firing. However you can lower both bayes_min_ham_num and bayes_min_spam_num and they will start hitting sooner. You can also modify autolearning scores although. -- Matus UHLAR - fantomas, uh...@fantomas.sk ; http://www.fantomas.sk/ Warning: I wish NOT to receive e-mail advertising to this address. Varovanie: na tuto adresu chcem NEDOSTAVAT akukolvek reklamnu postu. "The box said 'Requires Windows 95 or better', so I bought a Macintosh".
Re: Allowing IMAP users to train spam/ham
On Fri, 9 Mar 2012 08:38:21 +0100 Matus UHLAR - fantomas wrote: > >> On 05.03.12 12:15, RW wrote: > >> >I don't like it. It relies on FPs being removed from the SPAM > >> >folder rather than spam being sent to a learn-spam folder. > > >On Wed, 7 Mar 2012 15:35:05 +0100 > >Matus UHLAR - fantomas wrote: > >> Pardon me, but: > >> > >> Usage for end users > >> > >> *move mail into SPAM folder to classify as spam > >> *move mail out of SPAM folder to classify as not spam > >> > >> isn't the former what you want? > > On 07.03.12 21:44, RW wrote: > >I'm more concerned about what happens to the mail that isn't moved. > > apparently nothing, because it is assumed to be correctly evaluated. So are you saying that a legitimate mail that hits BAYES_99 and scores 4.9 isn't worth learning as ham because it's correctly evaluated. > > >I think positive training is better than supervised autolearning > > those above clearly indicate postive and negative trainin, or do you > have different informations? When I first looked at it, it retrained on errors, with DSPAM autotraining on everything. It probably does support train-on-error, but IMO it would be inappropriate to train Bayes that way. > >The scheme might work well for pure train-on-error, but that's not > >really practical on Spamassassin where the classification is > >distinct from the Bayes result. > > pardon? If you're going to train on error then train on the right error, not a rarer, correlated error. The FP/FN rate based on the SA classification isn't anywhere near high enough to train BAYES. If a user receives 10 legitimate mails a day and SA works at its target FP rate of 1 in 2500, it would take over 100 years for Bayes to even turn-on.
Re: Allowing IMAP users to train spam/ham
On 05.03.12 12:15, RW wrote: >I don't like it. It relies on FPs being removed from the SPAM folder >rather than spam being sent to a learn-spam folder. On Wed, 7 Mar 2012 15:35:05 +0100 Matus UHLAR - fantomas wrote: Pardon me, but: Usage for end users *move mail into SPAM folder to classify as spam *move mail out of SPAM folder to classify as not spam isn't the former what you want? On 07.03.12 21:44, RW wrote: I'm more concerned about what happens to the mail that isn't moved. apparently nothing, because it is assumed to be correctly evaluated. I think positive training is better than supervised autolearning those above clearly indicate postive and negative trainin, or do you have different informations? The scheme might work well for pure train-on-error, but that's not really practical on Spamassassin where the classification is distinct from the Bayes result. pardon? -- Matus UHLAR - fantomas, uh...@fantomas.sk ; http://www.fantomas.sk/ Warning: I wish NOT to receive e-mail advertising to this address. Varovanie: na tuto adresu chcem NEDOSTAVAT akukolvek reklamnu postu. Boost your system's speed by 500% - DEL C:\WINDOWS\*.*
Re: Allowing IMAP users to train spam/ham
On Wed, 7 Mar 2012 15:35:05 +0100 Matus UHLAR - fantomas wrote: > On 05.03.12 12:15, RW wrote: > >I don't like it. It relies on FPs being removed from the SPAM folder > >rather than spam being sent to a learn-spam folder. > > Pardon me, but: > > Usage for end users > > *move mail into SPAM folder to classify as spam > *move mail out of SPAM folder to classify as not spam > > isn't the former what you want? I'm more concerned about what happens to the mail that isn't moved. I think positive training is better than supervised autolearning The scheme might work well for pure train-on-error, but that's not really practical on Spamassassin where the classification is distinct from the Bayes result.
Re: Allowing IMAP users to train spam/ham
On 04.03.12 14:02, RW wrote: >An alternative would be to be more selective. I'm not sure if this is >specific to dovecot but when I copy/move a file in IMAP the new >maildir file has the same mtime, but a new epoch time in the file >name. What you might do is generate a list of filenames that contain >an epoch time later than the start of the previous run and sim-link >them into a temporary directory, and then learn that. On Mon, 5 Mar 2012 10:54:22 +0100 Matus UHLAR - fantomas wrote: afaik, dovecot itself has plugin to learn spam/ham: http://johannes.sipsolutions.net/Projects/dovecot-antispam On 05.03.12 12:15, RW wrote: I don't like it. It relies on FPs being removed from the SPAM folder rather than spam being sent to a learn-spam folder. Pardon me, but: Usage for end users *move mail into SPAM folder to classify as spam *move mail out of SPAM folder to classify as not spam isn't the former what you want? -- Matus UHLAR - fantomas, uh...@fantomas.sk ; http://www.fantomas.sk/ Warning: I wish NOT to receive e-mail advertising to this address. Varovanie: na tuto adresu chcem NEDOSTAVAT akukolvek reklamnu postu. Micro$oft random number generator: 0, 0, 0, 4.33e+67, 0, 0, 0...
Re: Allowing IMAP users to train spam/ham
Hi, do you have per virtual user Bayes training? or sitewide virtual user? Because I have a setup like yours and everything goes fine ! In my setup users move by hand to spam folder FNs and retrieve from spam folder to inbox FPs ! When they make that movements a script copies those spam/ham to a sitewide spam or ham folder in each case. Then a nightly script learn from those spam and ham sitewide folders. Then deleted from system spam/ham folders but not users folders. They can do what they want with those mails (delete or not). Webmail plugins are available to do that work ! they can also make the copies by IMAP protocol instead of filesystem level access. Cheers 2012/3/4 LuKreme : > I sued to have a setup where IMAP users could put mail into either SPAM or > Junk mailboxes to have it auto trained and then I had a script that stepped > through and did the training, and it also processed non-new mail in the inbox > as ham. > > USERROOT="$HOME"; > MAILP="Maildir"; > > J_PATH="$USERROOT/${MAILP}/.Junk"; > S_PATH="$USERROOT/${MAILP}/.SPAM"; > H_PATH="$USERROOT/${MAILP}/cur"; > > if [ `test -d $J_PATH` ]; then > /usr/local/bin/sa-learn --spam --progress $i $J_PATH/{new,cur} > fi > > if [ `test -d $S_PATH` ]; then > /usr/local/bin/sa-learn --spam --progress $i $S_PATH/{new,cur} > fi > > if [ `test -d $H_PATH` ]; then > /usr/local/bin/sa-learn --ham $H_PATH > fi > > This all worked fine, but it was very resource intensive, and it only worked > with the very few shell users. I tried to run it (manually) a few times with > the virtual users, but I ended up with a process that ground the computer to > a halt and generated a bayes database that was massively large (GBs). > > So, other than throwing more iron at the problem, is there something I can do > to make this process a little smarter? Make it work with the virtual users > without generating a massive db file? > > -- > 'What can I do? I'm only human,' he said aloud. Someone said, Not all > of you. --Pyramids >
Re: Allowing IMAP users to train spam/ham
Am 05.03.2012 13:15, schrieb RW: > On Mon, 5 Mar 2012 10:54:22 +0100 > Matus UHLAR - fantomas wrote: > >> On 04.03.12 14:02, RW wrote: >>> An alternative would be to be more selective. I'm not sure if this is >>> specific to dovecot but when I copy/move a file in IMAP the new >>> maildir file has the same mtime, but a new epoch time in the file >>> name. What you might do is generate a list of filenames that contain >>> an epoch time later than the start of the previous run and sim-link >>> them into a temporary directory, and then learn that. >> >> afaik, dovecot itself has plugin to learn spam/ham: >> >> http://johannes.sipsolutions.net/Projects/dovecot-antispam > > I don't like it. It relies on FPs being removed from the SPAM folder > rather than spam being sent to a learn-spam folder. > i use a spam/ham forward email transport something like here http://patrick-wessel.de/projektlinuxserver/spamtraining-mit-perl/ http://www.localside.net/sal-wrapper/ but to be honest, its not widly used and needed -- Best Regards MfG Robert Schetterer Germany/Munich/Bavaria
Re: Allowing IMAP users to train spam/ham
On Mon, 5 Mar 2012 10:54:22 +0100 Matus UHLAR - fantomas wrote: > On 04.03.12 14:02, RW wrote: > >An alternative would be to be more selective. I'm not sure if this is > >specific to dovecot but when I copy/move a file in IMAP the new > >maildir file has the same mtime, but a new epoch time in the file > >name. What you might do is generate a list of filenames that contain > >an epoch time later than the start of the previous run and sim-link > >them into a temporary directory, and then learn that. > > afaik, dovecot itself has plugin to learn spam/ham: > > http://johannes.sipsolutions.net/Projects/dovecot-antispam I don't like it. It relies on FPs being removed from the SPAM folder rather than spam being sent to a learn-spam folder.
Re: Allowing IMAP users to train spam/ham
LuKreme wrote: I sued to have a setup where IMAP users could put mail into either SPAM or Junk mailboxes to have it auto trained and then I had a script that stepped through and did the training, and it also processed non-new mail in the inbox as ham. On 04.03.12 07:55, xTrade Assessory wrote: what do you think of something less complex? you need but probably have autolearn enabled I guess you mean "you probably need autolearn enabled". One of autolearn' problems is, that if it starts misfiring, it will misfire more and more... The manual part is what is needed to prevent this - mostly the incorrectly classified mail needs to be learned. I offer the users a mailbox where they can drop/move any message they think is spam, what obviously not was processed by spamassasin and classified as such i my case the folder's name is X-SPAM this extra folder is necessary because what is in SPAM already is supposed to be SPAM correct. I don't now if it is a good idea running sa-learn n new msgs without knowing what it is It surely is not, however re-learning those as spam will fix that. Also, chose well your users, that they do not throw everything into this forlder then you run a script from cron once a day like this ### #!/bin/sh folders=`/usr/bin/find /home/ -maxdepth=2 -type f -name X-Spam -print` for folder in $folders; do /usr/local/bin/sa-learn --spam --mbox $folder done I think it would be wise to move messages away after learning. -- Matus UHLAR - fantomas, uh...@fantomas.sk ; http://www.fantomas.sk/ Warning: I wish NOT to receive e-mail advertising to this address. Varovanie: na tuto adresu chcem NEDOSTAVAT akukolvek reklamnu postu. - Holmes, what kind of school did you study to be a detective? - Elementary, Watson. -- Daffy Duck & Porky Pig
Re: Allowing IMAP users to train spam/ham
On 04.03.12 14:02, RW wrote: An alternative would be to be more selective. I'm not sure if this is specific to dovecot but when I copy/move a file in IMAP the new maildir file has the same mtime, but a new epoch time in the file name. What you might do is generate a list of filenames that contain an epoch time later than the start of the previous run and sim-link them into a temporary directory, and then learn that. afaik, dovecot itself has plugin to learn spam/ham: http://johannes.sipsolutions.net/Projects/dovecot-antispam -- Matus UHLAR - fantomas, uh...@fantomas.sk ; http://www.fantomas.sk/ Warning: I wish NOT to receive e-mail advertising to this address. Varovanie: na tuto adresu chcem NEDOSTAVAT akukolvek reklamnu postu. - Have you got anything without Spam in it? - Well, there's Spam egg sausage and Spam, that's not got much Spam in it.
Re: Allowing IMAP users to train spam/ham
On 2012/03/04 11:57, John Hardin wrote: On Sun, 4 Mar 2012, jdow wrote: On 2012/03/04 10:30, LuKreme wrote: On 04 Mar 2012, at 05:36 , xTrade Assessory wrote: > question is if necessary ... Being able to train mis-tagged spam is necessary, yes. I don’t see anyway to process a message in a maildir and then move that message. How would you do it? bash script with for each on the directory. Train then delete each file in sequence. I'd suggest that it's a bad idea to delete your training corpus. And the messages would be good training. However, privacy concerns may require it be deleted. If not, mv works as well as rm. {^_-}
Re: Allowing IMAP users to train spam/ham
4.3.2012 22:44, LuKreme kirjoitti: > Trouble with simply moving the messages about in the shell between Maildirs > is that the courier files don’t get updated properly. > I move my files all the time, and no problems occurred so far. I use Courier too... -- Things past redress and now with me past care. -- William Shakespeare, "Richard II" signature.asc Description: OpenPGP digital signature
Re: Allowing IMAP users to train spam/ham
On 04 Mar 2012, at 12:57 , John Hardin wrote: > On Sun, 4 Mar 2012, jdow wrote: > >> On 2012/03/04 10:30, LuKreme wrote: >>> On 04 Mar 2012, at 05:36 , xTrade Assessory wrote: >>> > question is if necessary ... >>> >>> Being able to train mis-tagged spam is necessary, yes. I don’t see >>> anyway to process a message in a maildir and then move that message. >>> How would you do it? >> >> bash script with for each on the directory. Train then delete each file in >> sequence. > > I'd suggest that it's a bad idea to delete your training corpus. Yeah, I never said anything about deleting. Trouble with simply moving the messages about in the shell between Maildirs is that the courier files don’t get updated properly. -- Criticizing evolutionary theory because Darwin was limited is like claiming computers don't work because Chuck Babbage didn't foresee Duke Nukem 3.
Re: Allowing IMAP users to train spam/ham
4.3.2012 20:49, jdow kirjoitti: > On 2012/03/04 10:30, LuKreme wrote: >> On 04 Mar 2012, at 05:36 , xTrade Assessory wrote: >>> question is if necessary ... >> >> Being able to train mis-tagged spam is necessary, yes. I don’t see >> anyway to process a message in a maildir and then move that message. >> How would you do it? > > bash script with for each on the directory. Train then delete each file in > sequence. > If doing this, training via spamc would be good. And the spamd must have --allow-tell to make this work. -- Today is the first day of the rest of the mess. signature.asc Description: OpenPGP digital signature
Re: Allowing IMAP users to train spam/ham
On Sun, 4 Mar 2012, jdow wrote: On 2012/03/04 10:30, LuKreme wrote: On 04 Mar 2012, at 05:36 , xTrade Assessory wrote: > question is if necessary ... Being able to train mis-tagged spam is necessary, yes. I don’t see anyway to process a message in a maildir and then move that message. How would you do it? bash script with for each on the directory. Train then delete each file in sequence. I'd suggest that it's a bad idea to delete your training corpus. -- John Hardin KA7OHZhttp://www.impsec.org/~jhardin/ jhar...@impsec.orgFALaholic #11174 pgpk -a jhar...@impsec.org key: 0xB8732E79 -- 2D8C 34F4 6411 F507 136C AF76 D822 E6E6 B873 2E79 --- Failure to plan ahead on someone else's part does not constitute an emergency on my part. -- David W. Barts in a.s.r --- 7 days until Daylight Saving Time begins in U.S. - Spring Forward
Re: Allowing IMAP users to train spam/ham
On 2012/03/04 10:30, LuKreme wrote: On 04 Mar 2012, at 05:36 , xTrade Assessory wrote: question is if necessary ... Being able to train mis-tagged spam is necessary, yes. I don’t see anyway to process a message in a maildir and then move that message. How would you do it? bash script with for each on the directory. Train then delete each file in sequence. {^_^}
Re: Allowing IMAP users to train spam/ham
On 04 Mar 2012, at 05:36 , xTrade Assessory wrote: > question is if necessary ... Being able to train mis-tagged spam is necessary, yes. I don’t see anyway to process a message in a maildir and then move that message. How would you do it? -- Lister: What d'ya think of Betty? Cat: Betty Rubble? Well, I would go with Betty... but I'd be thinking of Wilma. Lister: This is crazy. Why are we talking about going to bed with Wilma Flintstone? Cat: You're right. We're nuts. This is an insane conversation. Lister: She'll never leave Fred, and we know it.
Re: Allowing IMAP users to train spam/ham
On Sun, 04 Mar 2012 09:36:25 -0300 xTrade Assessory wrote: > LuKreme wrote: > > On 04 Mar 2012, at 03:55 , xTrade Assessory wrote: > > > >> what do you think of something less complex? > > Yeah, I went with Junk/NotJunk, anything placed in Junk gets > > trained as spam, anything in NotJunk trained as ham. What I’d like > > to do though is move the messages that are in NotJunk to the inbox > > maildir as they are processed. That's similar to what I do and some ESPs like Tuffmail do. An alternative would be to be more selective. I'm not sure if this is specific to dovecot but when I copy/move a file in IMAP the new maildir file has the same mtime, but a new epoch time in the file name. What you might do is generate a list of filenames that contain an epoch time later than the start of the previous run and sim-link them into a temporary directory, and then learn that. > if you have bayes already active as well as autolearn then why should > you run all this again, still more since manual work may not be > accurate. or do you read all this msgs to be sure they are ham/spam? Because autolearn is better than nothing, but isn't very good. It only learns the spam that's easily caught, It's very poor at capturing a representative selection of ham without miss-learning, and it wont train actual errors where BAYES has generated a point or more in the wrong direction.
Re: Allowing IMAP users to train spam/ham
LuKreme wrote: > On 04 Mar 2012, at 03:55 , xTrade Assessory wrote: > >> what do you think of something less complex? > Yeah, I went with Junk/NotJunk, anything placed in Junk gets trained as spam, > anything in NotJunk trained as ham. What I’d like to do though is move the > messages that are in NotJunk to the inbox maildir as they are processed. > > Possible? > everything is possible :) question is if necessary ... if you have bayes already active as well as autolearn then why should you run all this again, still more since manual work may not be accurate. or do you read all this msgs to be sure they are ham/spam? I understand that the sa-learn should be used only for content you are sure to be ham/spam what is difficult, unless you trust yourself and use it only on your mailbox :) I use it because sometimes you get commercial messages which technically are not spam, have even correct auth headers and everything but it is SPAM because I do not want to receive every day some kind of offer, so this ones I can pipe into sa-learn so they bounce into the spam folder next time they come ... but that is only my opinion Hans -- XTrade Assessory International Facilitator BR - US - CA - DE - GB - RU - UK +55 (11) 4249. http://xtrade.matik.com.br
Re: Allowing IMAP users to train spam/ham
On 04 Mar 2012, at 03:55 , xTrade Assessory wrote: > what do you think of something less complex? Yeah, I went with Junk/NotJunk, anything placed in Junk gets trained as spam, anything in NotJunk trained as ham. What I’d like to do though is move the messages that are in NotJunk to the inbox maildir as they are processed. Possible? -- Belief is one of the most powerful organic forces in the multiverse. It may not be able to move mountains, exactly. But it can create someone who can.
Re: Allowing IMAP users to train spam/ham
LuKreme wrote: > I sued to have a setup where IMAP users could put mail into either SPAM or > Junk mailboxes to have it auto trained and then I had a script that stepped > through and did the training, and it also processed non-new mail in the inbox > as ham. Hi what do you think of something less complex? you need but probably have autolearn enabled I offer the users a mailbox where they can drop/move any message they think is spam, what obviously not was processed by spamassasin and classified as such i my case the folder's name is X-SPAM this extra folder is necessary because what is in SPAM already is supposed to be SPAM I don't now if it is a good idea running sa-learn n new msgs without knowing what it is Also, chose well your users, that they do not throw everything into this forlder then you run a script from cron once a day like this ### #!/bin/sh folders=`/usr/bin/find /home/ -maxdepth=2 -type f -name X-Spam -print` for folder in $folders; do /usr/local/bin/sa-learn --spam --mbox $folder done ### good luck Hans > USERROOT="$HOME"; > MAILP="Maildir"; > >J_PATH="$USERROOT/${MAILP}/.Junk"; >S_PATH="$USERROOT/${MAILP}/.SPAM"; >H_PATH="$USERROOT/${MAILP}/cur"; > > if [ `test -d $J_PATH` ]; then >/usr/local/bin/sa-learn --spam --progress $i $J_PATH/{new,cur} > fi > > if [ `test -d $S_PATH` ]; then >/usr/local/bin/sa-learn --spam --progress $i $S_PATH/{new,cur} > fi > > if [ `test -d $H_PATH` ]; then >/usr/local/bin/sa-learn --ham $H_PATH > fi > > This all worked fine, but it was very resource intensive, and it only worked > with the very few shell users. I tried to run it (manually) a few times with > the virtual users, but I ended up with a process that ground the computer to > a halt and generated a bayes database that was massively large (GBs). > > So, other than throwing more iron at the problem, is there something I can do > to make this process a little smarter? Make it work with the virtual users > without generating a massive db file? > -- XTrade Assessory International Facilitator BR - US - CA - DE - GB - RU - UK +55 (11) 4249. http://xtrade.matik.com.br
Allowing IMAP users to train spam/ham
I sued to have a setup where IMAP users could put mail into either SPAM or Junk mailboxes to have it auto trained and then I had a script that stepped through and did the training, and it also processed non-new mail in the inbox as ham. USERROOT="$HOME"; MAILP="Maildir"; J_PATH="$USERROOT/${MAILP}/.Junk"; S_PATH="$USERROOT/${MAILP}/.SPAM"; H_PATH="$USERROOT/${MAILP}/cur"; if [ `test -d $J_PATH` ]; then /usr/local/bin/sa-learn --spam --progress $i $J_PATH/{new,cur} fi if [ `test -d $S_PATH` ]; then /usr/local/bin/sa-learn --spam --progress $i $S_PATH/{new,cur} fi if [ `test -d $H_PATH` ]; then /usr/local/bin/sa-learn --ham $H_PATH fi This all worked fine, but it was very resource intensive, and it only worked with the very few shell users. I tried to run it (manually) a few times with the virtual users, but I ended up with a process that ground the computer to a halt and generated a bayes database that was massively large (GBs). So, other than throwing more iron at the problem, is there something I can do to make this process a little smarter? Make it work with the virtual users without generating a massive db file? -- 'What can I do? I'm only human,' he said aloud. Someone said, Not all of you. --Pyramids