Re: Looking for assist on a rule
On Wed, 1 Nov 2017, Gary Smith wrote: We have recently seen a huge uptick in spam from a bunch of different TLD's. Bayes has been a little whacky with them as well. Our install is 3.3.1 (we're going to be replacing it soon). I'm looking to implement a rule that will assign a higher score to specific TLD's. I tried the rule below based upon the guidelines from https://wiki.apache.org/spamassassin/WritingRules. Nothing seems to hit it though. header HS_BAD_DOMAIN From =~ /^\.(top|study|click|party|link|stream|info|trade|bid|xxx)/i describe HS_BAD_DOMAIN Contains one of the bad domains that commonly spams score HS_BAD_DOMAIN 0.1 0.1 0.1 0.1 Here is what I have (after adding the ones you list that I don't): header FROM_RARE_TLDFrom:addr =~ /\.(?:wor(?:k|ld)|space|club|science|pub|red|blue|green|link|ninja|lol|xyz|faith|review|download|top|global|(?:web)?site|tech|party|pro|bid|trade|win|moda|news|online|biz|host|loan|study|click|stream|xxx)$/i describe FROM_RARE_TLDFrom address in rarely-nonspam TLD score FROM_RARE_TLD3.000 header REPTO_RARE_TLD Reply-To =~ /\.(?:wor(?:k|ld)|space|club|science|pub|red|blue|green|link|ninja|lol|xyz|faith|review|download|top|global|(?:web)?site|tech|party|pro|bid|trade|win|moda|news|online|biz|host|loan|study|click|stream|xxx)>?$/i describe REPTO_RARE_TLD Reply-To address in rarely-nonspam TLD score REPTO_RARE_TLD 3.000 uriURI_RARE_TLD m;://[^/]+\.(?:wor(?:k|ld)|space|club|science|pub|red|blue|green|link|ninja|lol|xyz|faith|review|download|top|global|(?:web)?site|tech|party|pro|bid|trade|win|moda|news|online|biz|host|loan|study|click|stream|xxx)(?:/|$);i describe URI_RARE_TLD URI refers to rarely-nonspam TLD score URI_RARE_TLD 3.000 .info has too many legit domains now that I don't think it's justified to block that entire TLD. -- John Hardin KA7OHZhttp://www.impsec.org/~jhardin/ jhar...@impsec.orgFALaholic #11174 pgpk -a jhar...@impsec.org key: 0xB8732E79 -- 2D8C 34F4 6411 F507 136C AF76 D822 E6E6 B873 2E79 --- "Bother," said Pooh as he struggled with /etc/sendmail.cf, "it never does quite what I want. I wish Christopher Robin was here." -- Peter da Silva in a.s.r --- 4 days until Daylight Saving Time ends in U.S. - Fall Back
Re: very basic SA-Learn performance question: is 90 seconds or so per token really, really slow or roughly normal?
Oh, I wiped the bayes data and started over already once, it isn't (or shouldn't be) that big a deal. Disk performance: seems OK to me. # diskinfo -t /dev/aacd0 /dev/aacd0 512 # sectorsize 73295462400 # mediasize in bytes (68G) 143155200 # mediasize in sectors 0 # stripesize 0 # stripeoffset 8910# Cylinders according to firmware. 255 # Heads according to firmware. 63 # Sectors according to firmware. # Disk ident. Seek times: Full stroke: 250 iter in 2.966242 sec = 11.865 msec Half stroke: 250 iter in 2.126653 sec =8.507 msec Quarter stroke: 500 iter in 3.616484 sec =7.233 msec Short forward:400 iter in 1.540087 sec =3.850 msec Short backward: 400 iter in 1.104617 sec =2.762 msec Seq outer: 2048 iter in 0.546351 sec =0.267 msec Seq inner: 2048 iter in 0.726598 sec =0.355 msec Transfer rates: outside: 102400 kbytes in 2.103472 sec =48681 kbytes/sec middle:102400 kbytes in 2.300709 sec =44508 kbytes/sec inside:102400 kbytes in 3.192841 sec =32072 kbytes/sec nothing amazing, but nothing unexpectedly bad either. Original Message Subject: Re: very basic SA-Learn performance question: is 90 seconds or so per token really, really slow or roughly normal? From: David Jones To: users@spamassassin.apache.org Date: Thu Nov 02 2017 01:00:40 GMT+0300 (AST) > If you want to try to keep your existing Bayes data, try dumping it to a > backup file, clear the DB, then restore it back to see if this resets things > properly. Hopefully this won't take weeks to dump. :) > > https://wiki.apache.org/spamassassin/BayesMigration > > BTW, do you have normal file IO performance? Have you checked iotop and > iostats to see what kind of IOPs/Mbps you are getting on your filesystem > where the Bayes DB files are?
Re: very basic SA-Learn performance question: is 90 seconds or so per token really, really slow or roughly normal?
On 11/01/2017 04:40 PM, David Gessel wrote: Bill, Thanks for the advice. I'm not too worried about the permissions config, though I will make the mods once I get performance up to the point where bayes is usable at all - I wouldn't want to lose all those sweet, sweet tokens to some unauthorized write premission. -David If you want to try to keep your existing Bayes data, try dumping it to a backup file, clear the DB, then restore it back to see if this resets things properly. Hopefully this won't take weeks to dump. :) https://wiki.apache.org/spamassassin/BayesMigration BTW, do you have normal file IO performance? Have you checked iotop and iostats to see what kind of IOPs/Mbps you are getting on your filesystem where the Bayes DB files are? Original Message Subject: Re: very basic SA-Learn performance question: is 90 seconds or so per token really, really slow or roughly normal? From: Bill Cole To: users@spamassassin.apache.org Date: Wed Nov 01 2017 06:57:55 GMT+0300 (AST) On 31 Oct 2017, at 7:27 (-0400), David Gessel wrote: bayes_file_mode 0777 Don't do that. I know the SiteWideBayes page recommends that, but it's wrong. It's a bad idea to EVER make ANY file mode 0777 on any normal system. Something mangled your Bayes DB. Anything running on that system *could* do so. Maybe it was innocent, maybe not. One alternative: use 0770 (or even 775) and use group membership control access. You can then symlink the ~/.spamassassin directories of users in the group to that of the primary SA user (i.e. whatever amavisd runs as) OR hardlink the Bayes and autowhitelist files from the primary user's directory into that of other users. Another alternative: use 0700 and whenever doing anything with the Bayes/AWL/TxRep DBs, do it as the primary user of he sitewide DB. This requires giving that user read access to user mail but that's safe because it already is seeing it all pre-delivery anyway. The safest approach for that is setting an ACL on the Maildir/. I use MIMEDefang instead of amavisd so the ACL for mine looks like this: bigsky:~ bill$ ls -led Maildir/ drwx--+ 239 bill bill 8670 Oct 31 09:31 Maildir/ 0: user:defang allow list,search,readattr,file_inherit,directory_inherit -- David Jones
Re: very basic SA-Learn performance question: is 90 seconds or so per token really, really slow or roughly normal?
Bill, Thanks for the advice. I'm not too worried about the permissions config, though I will make the mods once I get performance up to the point where bayes is usable at all - I wouldn't want to lose all those sweet, sweet tokens to some unauthorized write premission. -David Original Message Subject: Re: very basic SA-Learn performance question: is 90 seconds or so per token really, really slow or roughly normal? From: Bill Cole To: users@spamassassin.apache.org Date: Wed Nov 01 2017 06:57:55 GMT+0300 (AST) > On 31 Oct 2017, at 7:27 (-0400), David Gessel wrote: > >> bayes_file_mode 0777 > > Don't do that. I know the SiteWideBayes page recommends that, but it's wrong. > It's a bad idea to EVER make ANY file mode 0777 on any normal system. > Something mangled your Bayes DB. Anything running on that system *could* do > so. Maybe it was innocent, maybe not. > > One alternative: use 0770 (or even 775) and use group membership control > access. You can then symlink the ~/.spamassassin directories of users in the > group to that of the primary SA user (i.e. whatever amavisd runs as) OR > hardlink the Bayes and autowhitelist files from the primary user's directory > into that of other users. > > Another alternative: use 0700 and whenever doing anything with the > Bayes/AWL/TxRep DBs, do it as the primary user of he sitewide DB. This > requires giving that user read access to user mail but that's safe because it > already is seeing it all pre-delivery anyway. The safest approach for that is > setting an ACL on the Maildir/. I use MIMEDefang instead of amavisd so the > ACL for mine looks like this: > > bigsky:~ bill$ ls -led Maildir/ > drwx--+ 239 bill bill 8670 Oct 31 09:31 Maildir/ > 0: user:defang allow list,search,readattr,file_inherit,directory_inherit > >
RE: Looking for assist on a rule
Bowie (and the rest that answered), Thanks for the follow up. I went with your suggestion of adding the additional addr field and fixed the ^ and it’s catching now. The multiple values on the same line were intentional. I actually have different scored for bayes inclusion and network test (just tweaking them a little). Final is: header HS_BAD_DOMAIN From:addr =~ /\.(top|study|click|party|link|stream|info|trade|bid|xxx)$/i Thanks again, Gary- -Original Message- From: Bowie Bailey [mailto:bowie_bai...@buc.com] Sent: Wednesday, November 1, 2017 12:03 PM To: users@spamassassin.apache.org Subject: Re: Looking for assist on a rule On 11/1/2017 2:39 PM, Gary Smith wrote: > We have recently seen a huge uptick in spam from a bunch of different TLD's. > Bayes has been a little whacky with them as well. Our install is 3.3.1 > (we're going to be replacing it soon). > > I'm looking to implement a rule that will assign a higher score to specific > TLD's. I tried the rule below based upon the guidelines from > https://wiki.apache.org/spamassassin/WritingRules. Nothing seems to hit it > though. > > header HS_BAD_DOMAIN From =~ > /^\.(top|study|click|party|link|stream|info|trade|bid|xxx)/i > describe HS_BAD_DOMAIN Contains one of the bad domains that commonly > spams score HS_BAD_DOMAIN 0.1 0.1 0.1 0.1 The problem is the caret (^). That says that the match must START with a period. For example: From: .top What you probably want is to anchor the expression on the other end: header HS_BAD_DOMAIN From:addr =~ /\.(top|study|click|party|link|stream|info|trade|bid|xxx)$/i The ':addr:' part makes sure the match only hits on the first email address in the header to prevent false positives. Also, you don't need to specify multiple scores unless they are different. score HS_BAD_DOMAIN 0.1 This works exactly the same and is a bit easier to read. -- Bowie
Re: Spam via sendgrid.net
On 11/01/2017 02:00 PM, Pedro David Marco wrote: Hi! Is anyboy scoring emails coming via sendgrid.net ??? i get tons os spam relayed through them !! Thanks! Pedro I have them whitelist_auth'd because they honor their abuse reports. Report the spam via https://sendgrid.com/report-spam/ -- David Jones
Re: Looking for assist on a rule
On 11/1/2017 2:39 PM, Gary Smith wrote: We have recently seen a huge uptick in spam from a bunch of different TLD's. Bayes has been a little whacky with them as well. Our install is 3.3.1 (we're going to be replacing it soon). I'm looking to implement a rule that will assign a higher score to specific TLD's. I tried the rule below based upon the guidelines from https://wiki.apache.org/spamassassin/WritingRules. Nothing seems to hit it though. header HS_BAD_DOMAIN From =~ /^\.(top|study|click|party|link|stream|info|trade|bid|xxx)/i describe HS_BAD_DOMAIN Contains one of the bad domains that commonly spams score HS_BAD_DOMAIN 0.1 0.1 0.1 0.1 The problem is the caret (^). That says that the match must START with a period. For example: From: .top What you probably want is to anchor the expression on the other end: header HS_BAD_DOMAIN From:addr =~ /\.(top|study|click|party|link|stream|info|trade|bid|xxx)$/i The ':addr:' part makes sure the match only hits on the first email address in the header to prevent false positives. Also, you don't need to specify multiple scores unless they are different. score HS_BAD_DOMAIN 0.1 This works exactly the same and is a bit easier to read. -- Bowie
Spam via sendgrid.net
Hi! Is anyboy scoring emails coming via sendgrid.net ??? i get tons os spam relayed through them !! Thanks! Pedro
Re: Looking for assist on a rule
Hi Gary, Try this.. (you are wrongly anchoring with ^) header HS_BAD_DOMAIN From =~ /\.(top|study|click|party|link|stream|info|trade|bid|xxx)$/i describe HS_BAD_DOMAIN Contains one of the bad domains that commonly spams score HS_BAD_DOMAIN 0.1 0.1 0.1 0.1 Pedro
Re: Looking for assist on a rule
On 11/01/2017 01:39 PM, Gary Smith wrote: We have recently seen a huge uptick in spam from a bunch of different TLD's. Bayes has been a little whacky with them as well. Our install is 3.3.1 (we're going to be replacing it soon). I'm looking to implement a rule that will assign a higher score to specific TLD's. I tried the rule below based upon the guidelines from https://wiki.apache.org/spamassassin/WritingRules. Nothing seems to hit it though. header HS_BAD_DOMAIN From =~ /^\.(top|study|click|party|link|stream|info|trade|bid|xxx)/i describe HS_BAD_DOMAIN Contains one of the bad domains that commonly spams score HS_BAD_DOMAIN 0.1 0.1 0.1 0.1 You are close but your regex is a little off. Use https://regex101.com/ to test your regex. /\.(top|study|click|party|link|stream|info|trade|bid|xxx)$/ -- David Jones
Looking for assist on a rule
We have recently seen a huge uptick in spam from a bunch of different TLD's. Bayes has been a little whacky with them as well. Our install is 3.3.1 (we're going to be replacing it soon). I'm looking to implement a rule that will assign a higher score to specific TLD's. I tried the rule below based upon the guidelines from https://wiki.apache.org/spamassassin/WritingRules. Nothing seems to hit it though. header HS_BAD_DOMAIN From =~ /^\.(top|study|click|party|link|stream|info|trade|bid|xxx)/i describe HS_BAD_DOMAIN Contains one of the bad domains that commonly spams score HS_BAD_DOMAIN 0.1 0.1 0.1 0.1
Re: spample: Microsoft Office DDE exploit (in OpenXML attachment)
On Wed, 1 Nov 2017, Rupert Gallagher wrote: We apply a no-nonsense policy, mirroring paper mail policy. Both mail and e-mail sent to undisclosed recipients is either paid-for massmail or spam. I'll grant "largely", but there are legitimate uses for BCC. I hope you're only enforcing this policy on email from the Internet... A client, that used to be spammed 60 times per day on each account and wasted paid-for hours of employees work, called us yesterday to thank us. This month they received 3 junk mails only. The problem with that policy is: how do they know how much *legitimate* email got rejected? -- John Hardin KA7OHZhttp://www.impsec.org/~jhardin/ jhar...@impsec.orgFALaholic #11174 pgpk -a jhar...@impsec.org key: 0xB8732E79 -- 2D8C 34F4 6411 F507 136C AF76 D822 E6E6 B873 2E79 --- "Bother," said Pooh as he struggled with /etc/sendmail.cf, "it never does quite what I want. I wish Christopher Robin was here." -- Peter da Silva in a.s.r --- 4 days until Daylight Saving Time ends in U.S. - Fall Back
Re: very basic SA-Learn performance question: is 90 seconds or so per token really, really slow or roughly normal?
On Wed, 01 Nov 2017 15:11:01 +0300 ges...@blackrosetech.com wrote: > > It is when I run it on a large mailbox that it takes what seems like > too long to complete (at least a week for 4,000 message mailbox). > I've almost certainly configured something wrong/weird. The rate is > way, way below what it should be. A hint that suggests it isn't any > sort of processing performance issue is that CPU load barely > registers for perl/sa-learn. First set bayes_auto_expire 0 This a good idea anyway as auto-expiry can cause problems during scanning. Running sa-learn --force-expire from cron is better. I would also delete the existing database in case there is some corruption. It's worrying that the nspam count drops to zero. It wouldn't hurt to run smartctl -t long on the drive device followed by a full fsck. Also check your logs for warnings.
Re: very basic SA-Learn performance question: is 90 seconds or so per token really, really slow or roughly normal?
On 2017-11-01 14:31, RW wrote: On Tue, 31 Oct 2017 08:44:20 - Kevin Golding wrote: On Mon, 30 Oct 2017 22:35:08 -, David Gessel wrote: > 1) sa-learn seems really, really slow. Slow enough that spam > sometimes comes in faster. This seems far slower than the > benchmark results suggest is within the range of normal. I'm sure > I'm doing something really wrong, but not sure what. sa-learn is more suited to individual or small batches of messages. You'll get significantly improved performance using spamc -L spam (or ham, or forget). Aside from the fact that the OP is not using spamd, it's the the other way around. sa-learn is inefficient for training emails one at a time because of the overhead of repeating the initialization, but it is efficient if you run it on a large mailbox. It is when I run it on a large mailbox that it takes what seems like too long to complete (at least a week for 4,000 message mailbox). I've almost certainly configured something wrong/weird. The rate is way, way below what it should be. A hint that suggests it isn't any sort of processing performance issue is that CPU load barely registers for perl/sa-learn. I'm certainly not certain, but I suspect there's some sort of lock/unlock thing happening - perhaps on the maildir, perhaps on the token db which is stalling the script (line 511 perhaps?) - which is seriously constipating the process.
Re: very basic SA-Learn performance question: is 90 seconds or so per token really, really slow or roughly normal?
On Tue, 31 Oct 2017 08:44:20 - Kevin Golding wrote: > On Mon, 30 Oct 2017 22:35:08 -, David Gessel > wrote: > > > 1) sa-learn seems really, really slow. Slow enough that spam > > sometimes comes in faster. This seems far slower than the > > benchmark results suggest is within the range of normal. I'm sure > > I'm doing something really wrong, but not sure what. > > sa-learn is more suited to individual or small batches of messages. > You'll get significantly improved performance using spamc -L spam (or > ham, or forget). Aside from the fact that the OP is not using spamd, it's the the other way around. sa-learn is inefficient for training emails one at a time because of the overhead of repeating the initialization, but it is efficient if you run it on a large mailbox.
Re: spample: Microsoft Office DDE exploit (in OpenXML attachment)
We apply a no-nonsense policy, mirroring paper mail policy. Both mail and e-mail sent to undisclosed recipients is either paid-for massmail or spam. Paper junk and e-mail junk whose origin is verifiable and within legal domain goes to the lawyer, who sues the sender and gets an economic compensation for us. The remaining junk is automatically rejected. We are not 100% efficient on this, as we reject stuff that may go to the lawyer, but we are happy. A client, that used to be spammed 60 times per day on each account and wasted paid-for hours of employees work, called us yesterday to thank us. This month they received 3 junk mails only. Full disclosure: no, we are not Protonmail. On Wed, Nov 1, 2017 at 9:20 AM, LuKreme wrote: > On Nov 1, 2017, at 00:52, Rupert Gallagher wrote: @protonmail.com> >> By local policy, we *reject* e-mail to undisclosed recipient, so this is not >> a problem for us. @protonmail.com> > You are rejecting legitimate mail th...@protonmail.com>
Re: very basic SA-Learn performance question: is 90 seconds or so per token really, really slow or roughly normal?
Original Message Subject: Re: very basic SA-Learn performance question: is 90 seconds or so per token really, really slow or roughly normal? From: Matus UHLAR - fantomas To: users@spamassassin.apache.org Date: Tue Oct 31 2017 23:05:23 GMT+0300 (AST) >>> On 31.10.17 01:35, David Gessel wrote: amavisd-new-2.11.0_2,1 I'm finding the command /usr/local/bin/sa-learn --spam --showdots /mail/blackrosetech.com/gessel/.Junk/{cur,new} is taking a while to > >>> if you use amavis, you must train amavis' bayes database >>> (/var/lib/amavis/.spamassassin/ here), not your own. > > On 31.10.17 14:27, David Gessel wrote: >> huh, I was getting bayes filter results, as I =think= I'm training a global >> bayes database per >> https://wiki.apache.org/spamassassin/SiteWideBayesSetup > > that is quite dangerous setup if anyone has access to your system, and also > useless when you use amavis. I'll have to review the amavis config again. I think perhaps it is redundant, the amavis setup was done some time ago and I think it was working, the site-wide mod to 0777 permissions was more recently in debugging and I think is just a mistake. The right answer is that just the amavis user is the owner of the bayes db, correct? > >>> I have trained my DB years ago and I rarely need new training now. >> >> Yes, I do understand that. The cron jobs I set up quite some time ago >> # learn ham and spam >> 17 3 * * 0 root /usr/local/bin/sa-learn --ham >> --no-sync /mail/blackrosetech.com/gessel/.archives.2017/{cur,new} >> 22 3 * * 0 root /usr/local/bin/sa-learn --ham >> --no-sync /mail/blackrosetech.com/gessel/.Sent/{cur,new} >> 27 3 * * 0 root /usr/local/bin/sa-learn --spam >> --no-sync /mail/blackrosetech.com/gessel/.ManJunk/{cur,new} >> 22 3 * * 0 root /usr/local/bin/sa-learn --ham >> --no-sync /mail/blackrosetech.com/carolyn/.Archives.2017/{cur,new} >> 32 3 * * 0 root /usr/local/bin/sa-learn --spam >> --no-sync /mail/blackrosetech.com/carolyn/.ManJunk/{cur,new} >> 37 3 * * 0 root /usr/local/bin/sa-learn --ham >> --no-sync /mail/blackrosetech.com/carolyn/.Sent/{cur,new} >> 55 3 * * 0 root /usr/local/bin/sa-learn --sync > > that will kill your machine each night. unnecessarily. > but if really needed, I'd run them sequentially from one script. The thing is, and this may be a big hint, it doesn't kill the machine at all. It barely generates any load, 0.2 max or so. sa-learn is running at the moment at 0.0% of cpu, 0.2% of mem, total time 1.22.03 over 24 hours. I think something may be locking the process in some way - either db locks or something else. No? > >> I disabled auto-learn because non-spam would occasionally get through to >> spam and I didn't want to train on that. The theory here was to wipe the >> database, > > not needed, re-training helps very fast usually. re-training false positives > (especially those market autolearn=spam) and false negatives (autolearn=ham) > is much better. > but then 24 hours later... # sa-learn --dump magic 0.000 0 3 0 non-token data: bayes db version 0.000 0 0 0 non-token data: nspam 0.000 0 0 0 non-token data: nham >>> >>> are you sure someone did not back up your spam DB >> >> Aside from the cron jobs above, no, but if they did that, then yes. > > no idea how can it get lost then... maybe concurrent writes from the scripts > above? > But I think Berkeley DB should be resistant against this. > Would something like specifying the mailbox format also help? >>> >>> only if you use mbox format. >> >> No, maildir. Not really relevant (I don't think) but: >> >> dovecot2-2.2.31_1 > > dovecot's antspam plugin could fix your problems > > https://wiki2.dovecot.org/Plugins/Antispam > > your users would maintain the SA DB themselves. > This looks like a great plugin and I'd be happy to use it, but I don't know if it will help if sa-learn is so slow. Something definitely isn't right... I've done something dumb somewhere and I'm not sure what or where.
Re: spample: Microsoft Office DDE exploit (in OpenXML attachment)
On Nov 1, 2017, at 00:52, Rupert Gallagher wrote: > By local policy, we *reject* e-mail to undisclosed recipient, so this is not > a problem for us. You are rejecting legitimate mail then. -- This is my signature. There are many like it, but this one is mine.