DKIM length 'l=' tag
The DKIM RFC https://datatracker.ietf.org/doc/html/rfc6376#section-8.2 tells us that it is not safe to rely on the DKIM length (l=) tag and https://www.zone.eu/blog/2024/05/17/bimi-and-dmarc-cant-save-you/ shows how it can be used to subvert BIMI*. I am looking at extending Mail::SpamAssassin::Plugin::DKIM to indicate when a DKIM body signature only covers part of the message body and how much of the body is unsigned (bytes, percentage or possibly both). I am new to the spamassassin code, so any comments or suggetions would be welcome. * I am not a fan of BIMI, but big name players appear to be using it to display "trustable" logos on GUI mail clients, so users *will* be caught when it breaks. Thanks, -- Andrew C. Aitchison Kendal, UK and...@aitchison.me.uk
Re: Spamassassin 4 and ClamAVMultipleScores.
Thanks for the reply Jimmy. After playing some more - with priorities in clamav.cf, I got it working, and was just about to explain a fix, when I noticed Henrik has updated the ClamAVMultipleScores page to have a similar (actually better!) fix that I was going to suggest! # Run CLAMAV early so all the rules here will see the results priority CLAMAV -10 and removal of all the individual priorities Thanks Henrik! Andrew. On Fri, 3 Nov 2023 at 02:15, Jimmy wrote: > > The X-Spam-Virus could be absent from the email header. > > You can consider adding the following line: > > add_header spam Virus _VIRUSRESULT_ > > If this doesn't work, the ClamAV plugin might need to include > "put_metadata('X-Spam-Virus')" when it detects a virus. > > Jimmy > > > On Fri, Nov 3, 2023 at 4:06 AM Andrew Hearn wrote: > >> Hello, >> >> We're using clam, some extra signatures, and the plugin/config as >> described on >> https://cwiki.apache.org/confluence/display/SPAMASSASSIN/ClamAVMultipleScores >> to give different signature families different scores. >> >> Since moving to v4, I don't think it's working... >> >> The only rule that is matched now, is the generic CLAMAV_VIRUS rule. >> The rules for the various other signatures are no longer matched. >> Could this be due to the change in priorities for meta rules, and now >> these meta rules are running before they get to see the results from clam? >> >> I can send my config examples and debug output if that's helpful. >> >> Thanks! >> >
Spamassassin 4 and ClamAVMultipleScores.
Hello, We're using clam, some extra signatures, and the plugin/config as described on https://cwiki.apache.org/confluence/display/SPAMASSASSIN/ClamAVMultipleScores to give different signature families different scores. Since moving to v4, I don't think it's working... The only rule that is matched now, is the generic CLAMAV_VIRUS rule. The rules for the various other signatures are no longer matched. Could this be due to the change in priorities for meta rules, and now these meta rules are running before they get to see the results from clam? I can send my config examples and debug output if that's helpful. Thanks!
Re: Lint problem with KAM.cf
Hi There is a new DecodeShortURLs in Spamassassin trunk, the API has changed from the one in the original module on GitHub. The new builtin module has the short_url function but the original module uses short_url_tests, the original module does not have a short_url function thus the error generated. You possibly need "has" checks to differentiate between the two different modules with the same name currently in circulation. - Andrew > On 30 Aug 2021, at 23:13, Kevin A. McGrail wrote: > > We will take a look. We check with lint for every publication but maybe > there's a condition we missed or a spelling issue. Thanks for bringing it up. > KAM
Re: updates.spamassassin.org not resolving
My bad, actually thought updates.spamassassin.org was one of the mirrored-by urls but it is sa-update.spamassassin.org > On 23 Jul 2021, at 14:35, Kevin A. McGrail wrote: > > TL;DR: Everything looks good to me.
updates.spamassassin.org not resolving
Hi updates.spamassassin.org is not resolving, tested with various DNS systems. Can the admins please check ? Kind Regards, Andrew
Re: Spamassassin 3.4.4 on centos7
> On 09 Dec 2020, at 21:13, Benny Pedersen wrote: > > thanks for reporting, but this should be added to centos bug tracker since > its a centos problem, not a spamassassin problem to solve, this 2 modules is > only optional There is no bug here to be reported, those packages do exist in CentOS7 # yum whatprovides "perl(BSD::Resource)" Loaded plugins: fastestmirror Loading mirror speeds from cached hostfile * base: centos.mirror.liquidtelecom.com * epel: fedora.is.co.za * extras: centos.mirror.liquidtelecom.com * updates: centos.mirror.liquidtelecom.com perl-BSD-Resource-1.29.07-1.el7.x86_64 : BSD process resource limit and priority functions Repo: epel Matched from: Provides: perl(BSD::Resource) = 1.2907 yum whatprovides "perl(Net::CIDR::Lite)" Loaded plugins: fastestmirror Loading mirror speeds from cached hostfile * base: centos.mirror.liquidtelecom.com * epel: fedora.is.co.za * extras: centos.mirror.liquidtelecom.com * updates: centos.mirror.liquidtelecom.com perl-Net-CIDR-Lite-0.21-11.el7.noarch : Perl extension for merging IPv4 or IPv6 CIDR addresses Repo: epel Matched from: Provides: perl(Net::CIDR::Lite) = 0.21 signature.asc Description: Message signed with OpenPGP
Re: Spamassassin 3.4.4 on centos7
Use yum local install spamassassin-3.4.4-1.el7.centos.x86_64.rpm That will pull in the dependencies for you. > On 09 Dec 2020, at 13:01, Niamh Holding wrote: > > rpm -ivh spamassassin-3.4.4-1.el7.centos.x86_64.rpm signature.asc Description: Message signed with OpenPGP
Re: contact from blacklist
> On 20 Nov 2020, at 22:23, Levente Birta wrote: > > I'd like to try the KAM channel. A quick install how-to would be nice too I would like to test the KAM channel tool. Thanks, Andrew
Count of DNS lookups
Hello, Is there a way to count and log the number of individual DNS lookups that Spamassassin does whilst processing an email? I'm really after just a number of the lookups requested, but a list of all the individual lookups types would be nice. Thanks. Skeffling.
Re: Google anti-phishing code project
I've not come across these before.. I am too interested in how to integrate them in to SA thanks. On 20 February 2017 at 21:56, Alexwrote: > Hi, > > On Mon, Feb 20, 2017 at 2:32 PM, Dianne Skoll > wrote: > > On Mon, 20 Feb 2017 14:21:08 -0500 > > Alex wrote: > > > >> Maybe we're using something different. This is the link I was using to > >> download the phishing addresses until the other day, when it became a > >> dead link: > > > >> https://aper.svn.sourceforge.net/svnroot/aper/phishing_reply_addresses > > > > That URL works for me. However, I am currently pulling the SVN repo from > > svn://svn.code.sf.net/p/aper/code (also can use > http://svn.code.sf.net/p/aper/code) > > > > It looks like the list of addresses has not been updated since > 2017-02-16, but > > the list of phishing URLs has an entry dated 2017-02-20. > > It looks like the URL has just now become available again. Do you > happen to know the script that can be used to convert the > phishing_links file into SA rules in the same way as the > phishing_reply_addresses are converted? > > Thanks, > Alex > > > > > > > > Regards, > > > > Dianne. >
Training Bayes with BAYES_999 Mail
I'm not an expert on the mechanics of Bayes so I'm wondering how valuable it is to continue training with collected spam that is properly tagged with BAYES_999. Does that help to reinforce the logic or is it overly focusing the database on emails it can already detect? Should I only be training it with miscategorized emails and emails in the 20-80% confidence range? Thanks for clarifying, -- Andrew
Bayes Corruption
Hi, Invoked through a plugin in KerioConnect SpamAssassin 3.3.1 Platform is CentOS 5.10 So, my Bayes.db is corrupt and out of curiosity I just wanted to take a look at it. I used SQLiteBrowser to do so. Now I have some questions about the bayes_token table: 1) Is there a reason why the id is not auto-incremented? 2) The majority of the tokens appear to be valid bytea. But a large number show as (BLOB). Is this perhaps the source of the corruption? If so why would that happen? And, if not why are they (BLOB)? Thanks
FYI - ahbl.org and BIND DNS errors
Per http://ahbl.org/content/changes-ahbl, AHBL is going away (still used in spamassassin-3.3.1) Meanwhile, AHBL is serving strange DNS responses, e.g. (from wireshark) 1 0.00 142.90.100.186 - 162.243.209.249 DNS 93 Standard query 0xc828 A zuz.rhsbl.ahbl.org 2 0.072481 162.243.209.249 - 142.90.100.186 DNS 246 Standard query response 0xc828 Authoritative nameservers rhsbl.ahbl.org: type NS, class IN, ns invalid.ahbl.org rhsbl.ahbl.org: type NS, class IN, ns unresponsive.ahbl.org rhsbl.ahbl.org: type NS, class IN, ns unresponsive2.ahbl.org Name Server: unresponsive2.ahbl.org Additional records invalid.ahbl.org: type A, class IN, addr 244.254.254.254 Addr: 244.254.254.254 (244.254.254.254) unresponsive.ahbl.org: type A, class IN, addr 10.230.230.230 Addr: 10.230.230.230 (10.230.230.230) unresponsive2.ahbl.org: type A, class IN, addr 192.168.230.230 Addr: 192.168.230.230 (192.168.230.230) invalid.ahbl.org: type , class IN, addr fe80:: Addr: fe80:: This last one, fe80::, is an IPv6 scope-link address that causes the BIND nameserver to log a weird error named[31365]: socket.c:4373: unexpected error: named[31365]: 22/Invalid argument Per http://www.mail-archive.com/bind-users@lists.isc.org/msg05240.html connect() fails as it is missing scoping information. -- Andrew Daviel, TRIUMF, Canada Tel. +1 (604) 222-7376 (Pacific Time) Network Security Manager
Re: Detecting very recently registered domain names
On Thu, 19 Dec 2013 10:02:39 -0500 Joe Quinn jqu...@pccc.com wrote: We are noticing a lot of spam coming from domains that are less than two months old. Is there a good way to detect this automatically? We've thought about whois, but do not want to get blocked for looking like we are harvesting information. May be off topic, but is this related to Communicado Ltd, who register domains daily in order to send spam, more info and a maintained list(at least at the moment) on: http://blog.hinterlands.org/2013/10/unwanted-email-from-communicado-ltd/ -- Andrew
RE: USPS Spam
Just wanted to throw in my two cents here - I have spoken to USPS about this and they said that they never send out these messages unless the client requests them, and that it should be safe to completely block messages like this. The same cannot be said about UPS and FexEx, by the way. -Original Message- From: Matt [mailto:matt.mailingli...@gmail.com] Sent: Friday, August 30, 2013 4:23 PM To: users@spamassassin.apache.org Subject: USPS Spam I am seeing tons of junk getting through claiming to be from the USPS about a missed delivery package. Anyone else seeing this? I am running SpamAssassin 3.3.1 and execute sa-update weekly.
SUBJ_ALL_CAPS
Hey all - Does anybody know how long the string needs to be to trigger SUBJ_ALL_CAPS? I know it has to be multi-word and over a certain length. Was wondering the specific length. Thanks in advance J
Low scoring pill spam
Hello, I have a low scoring pills spam: http://pastebin.com/q6nWqzMR I only get the following on it: * 1.0 RCVD_IN_MSPIKE_L3 RBL: Low reputation (-3) * [219.94.129.82 listed in bl.mailspike.net] * 0.0 SUBJECT_FUZZY_CHEAP Attempt to obfuscate words in Subject: * 0.5 FROM_LOCAL_NOVOWEL From: localpart has series of non-vowel letters * -2.8 RP_MATCHES_RCVD Envelope sender domain matches handover relay domain * 0.0 RCVD_NOT_IN_IPREPDNS Sender not listed at * http://www.chaosreigns.com/iprep/ Am I missing anything (apart from Bayes) that would help catch this? Many thanks! -- Andrew
Whitelisting subdomains?
Hey, all - I'm trying to whitelist all our internal subdomains but I can't seem to get it to work. We have so many of them that it's impractical to do them individually. For instance, we have _...@logs.domain.com, @admin-sql.domani.com etc. etc. etc. I was thinking that whitelist_from *.domain.com would work but it doesn't I can't seem to find any documentation on the net anywhere - is it even possible to do this?
RE: PayPal spam filter?
I just had to weigh in here to say that we have DCC_CHECK scored up to a 4, and all of these kinds of spam messages get caught by that because they always hit at least another 1 point worth of rules. Also, those two rules require plugins, I believe. -Original Message- From: Juerg Reimann [mailto:j...@jworld.ch] Sent: Wednesday, June 26, 2013 6:42 PM To: users@spamassassin.apache.org Cc: 'Benny Pedersen' Subject: RE: PayPal spam filter? Hi Benny Thanks for your tip. Could you elaborate on this a bit? First of all, a rule with the name SPF_DID_NOT_PASS or DKIM_DID_NOT_PASS seem not to exist. How and where would I configure this? Thanks, Juerg -Original Message- From: Benny Pedersen [mailto:m...@junc.eu] Sent: Wednesday, June 12, 2013 9:38 PM To: users@spamassassin.apache.org Subject: Re: PayPal spam filter? Juerg Reimann skrev den 2013-06-12 21:30: Is there a filter to block PayPal phishing mails, i.e. everything that claims to come from PayPal but is not? meta SPF_DID_NOT_PASS (!SPF_PASS) simple ? :=) if paypal do use dkim then it could be checked with meta DKIM_DID_NOT_PASS (!DKIM_VALID_AU) phishing emails seldom pass on this 2 tests -- senders that put my email into body content will deliver it to my own trashcan, so if you like to get reply, dont do it
Chain rules?
Hey all - Is there a way to chain rules together such that one rule will only fire if another is hit? Specifically, we have a client that is getting hit with a bunch of messages that are just links, but the links contain sex words. We want to do a body scan for a list of sex words if and only if the body contains only a link rule we have is triggered. I tried to get this to work with meta rules but it seems like it won't do it. Is there currently a way to do this sort of conditional check?
RE: Chain rules?
This is what I was wondering. We don't want to have to run a computationally-expensive body rule unless we need to. No choice though, I guess. Thanks for your help! -Original Message- From: John Hardin [mailto:jhar...@impsec.org] Sent: Monday, June 24, 2013 1:20 PM To: users@spamassassin.apache.org Subject: Re: Chain rules? On Mon, 24 Jun 2013, Andrew Talbot wrote: Is there a way to chain rules together such that one rule will only fire if another is hit? Specifically, we have a client that is getting hit with a bunch of messages that are just links, but the links contain sex words. We want to do a body scan for a list of sex words if and only if the body contains only a link rule we have is triggered. I tried to get this to work with meta rules but it seems like it won't do it. Is there currently a way to do this sort of conditional check? Unfortunately you can't control whether or not a rule is *executed*, you can only control whether or not it contributes to the message's overall score. -- John Hardin KA7OHZhttp://www.impsec.org/~jhardin/ jhar...@impsec.orgFALaholic #11174 pgpk -a jhar...@impsec.org key: 0xB8732E79 -- 2D8C 34F4 6411 F507 136C AF76 D822 E6E6 B873 2E79 --- Look at the people at the top of both efforts. Linus Torvalds is a university graduate with a CS degree. Bill Gates is a university dropout who bragged about dumpster-diving and using other peoples' garbage code as the basis for his code. Maybe that has something to do with the difference in quality/security between Linux and Windows. -- anytwofiveelevenis on Y! SCOX --- 10 days until the 237th anniversary of the Declaration of Independence
Rule to scan for .html attachments?
Hey all - I'm trying to set up a custom rule that scores HTML attachments. The problem I'm running across is that using a rule like this one: mimeheader HTML_ATTACH Content-Type =~ /^text\/html/i Will flag all messages that come in as HTML (vs. plain text). I found this : header HTML_ATTACH_RULE_2 Content-Disposition =~ /^filename\=\[a-z]{2}\.html\/i But that doesn't ... Work ... At all. Any suggestions? Is this even possible?
Re: Rule to scan for .html attachments?
That didn't work :( On Fri, May 31, 2013 at 12:40 PM, Martin Gregorie mar...@gregorie.orgwrote: On Fri, 2013-05-31 at 11:51 -0400, Andrew Talbot wrote: I'm trying to set up a custom rule that scores HTML attachments. ..snippage.. I found this : header HTML_ATTACH_RULE_2 Content-Disposition =~ /^filename\=\[a-z]{2}\.html\/i Don't anchor it to the start of the line, i.e. try this: header HTML_RULE Content-Disposition =~ /filename\=\[a-z]{2}\.html\/i I have a very similar rule for matching ZIP file attachments whose name is xx.zip which works as expected. The only significant difference from your rule is that it doesn't use the '^' BOL anchor symbol. My guess is that SA's body text parser converts the MIME header into one line, so requiring 'filename' to be at the start of the line will always fail. Martin
Re: Rule to scan for .html attachments?
Didn't work with mime_header (or mimeheader) with either rule. On Fri, May 31, 2013 at 12:23 PM, Axb axb.li...@gmail.com wrote: On 05/31/2013 05:51 PM, Andrew Talbot wrote: Hey all - I'm trying to set up a custom rule that scores HTML attachments. The problem I'm running across is that using a rule like this one: mimeheader HTML_ATTACH Content-Type =~ /^text\/html/i Will flag all messages that come in as HTML (vs. plain text). I found this : header HTML_ATTACH_RULE_2 Content-Disposition =~ /^filename\=\[a-z]{2}\.html\**/i But that doesn't ... Work ... At all. Any suggestions? Is this even possible? use mime_header instead of header
RE: Rule to scan for .html attachments?
That's what I was afraid of. We generally avoid those kinds of rules since we are scanning millions of messages a day. -Original Message- From: David F. Skoll [mailto:d...@roaringpenguin.com] Sent: Friday, May 31, 2013 2:22 PM To: users@spamassassin.apache.org Subject: Re: Rule to scan for .html attachments? On Fri, 31 May 2013 14:10:36 -0400 Andrew Talbot andrew.talbot.ownweb...@gmail.com wrote: That didn't work :( What didn't work? Oh... you top-posted. Anyway... you might need a full rule, which can be expensive. Something like: full HTML_RULE /Content- Disposition:.{0,50}name\s{0,2}=\s{0,2}\?.{0,50}\.html?/i Completely untested, of course! :) Regards, David.
RE: Rule to scan for .html attachments?
I need it to fire on any HTML attachment. The modules are enabled. I can get it to pick up text/html, remember, but the problem is that it detects messages sent as HTML when it's set up like that. It doesn't detect plain-text messages, but it will flag plain-text messages with HTML files attached. -Original Message- From: Martin Gregorie [mailto:mar...@gregorie.org] Sent: Friday, May 31, 2013 2:35 PM To: users@spamassassin.apache.org Subject: Re: Rule to scan for .html attachments? On Fri, 2013-05-31 at 14:10 -0400, Andrew Talbot wrote: That didn't work :( Can you post one or two examples of actual MIME attachment headers that you're trying to get the rule to fire on? Obvious question, but have you enabled the MIME header module? I'm using MimeMagic and enabling it requires that MimeMagic.pm and MimeMagic.cf be included in /etc/mail/spamassassin (or wherever you have told SA to look for its configuration etc. Martin
RE: Rule to scan for .html attachments?
Hi, Martin - Thank you for your response. The original test was using a file arbitrarily named aa.html .. It still doesn't work with the rewrite you provided :/ -Original Message- From: Martin Gregorie [mailto:mar...@gregorie.org] Sent: Friday, May 31, 2013 3:38 PM To: users@spamassassin.apache.org Subject: Re: Rule to scan for .html attachments? On Fri, 2013-05-31 at 14:45 -0400, Andrew Talbot wrote: I need it to fire on any HTML attachment. The modules are enabled. I can get it to pick up text/html, remember, but the problem is that it detects messages sent as HTML when it's set up like that. It doesn't detect plain-text messages, but it will flag plain-text messages with HTML files attached. Well, that's exactly what your second rule won't do: it will only fire on the header of an html attachment for a file that has one of a very restricted set of filenames. As you haven't posted any example MIME header sets I can only guess, but my guess is that none of the messages you've tried it against have attachments with names that match the restriction. As I said before the rule can't work with the '^' in place, because that says that the 'filename=' string must be at the beginning of a line and NOT preceded by any white space. Thats a harmful restriction because you never see MIME headers like that. With the '^' removed the rule becomes: header HTML_ATTACH_RULE_2 Content-Disposition =~ /filename\=\[a- z]{2}\.html\/i which has a better chance of working. This version will only fire if the filename associated with the attachment has precisely two alphabetic characters plus a .html extension, i.e. it will fire on filename=aa.html or filename=ZZ.HTML because the trailing 'i' makes it a caseless match, but it won't fire on filename=cat.html or filename=x.html because these don't have two character names and it won't fire if the attachment follows the common Windows convention of using a .htm extension. If you want the rule to fire on *any* HTML attachment it should be: header HTML_ATTACH_RULE_2 Content-Disposition =~ /filename\=\.{0,30}\.html{0,1}\/i which will match any filename with a .html or .htm extension (including .html and .htm). Could I respectfully suggest that you learn about Perl regular expressions before you try writing any more SA rules? SA rules are all based on using the Perl flavour of regular expressions to match character strings in headers and the message body. You could do a lot worse than getting a copy of Programming Perl by Larry Wall, Tom Christiansen Jon Orwant, published by O'Reilly. If there isn't one in the firm's technical library, they should be willing to buy a copy. Its a brick of a book, but you only need to read Chapter 5: Pattern Matching to write SA rules and in any case the rest of its contents will come in handy in future if anybody needs to write Perl programs or SA extension modules. Martin
Re: Bayes + DCC / Bayes as a false-positive killer
Hi, Dave - We don't have anything else learning because we deal in such bulk. We're an email service provider hosting hundreds of thousands of accounts. Re: Your last line about I don't understand what their concerns are ... Welcome to my world. Right now I am manually writing rules - custom rules - based on the subject lines (only the subject lines) of spam that gets reported to us. We are very very clearly Doing It Wrong, so I'm trying to find a way to do it better. As far as why we can't have Bayes and DCC on at the same time I've got no idea. I just work here, Dave! :) Thank you for your response. On Tue, May 28, 2013 at 8:12 PM, Dave Warren da...@hireahit.com wrote: On 2013-05-28 13:43, Andrew Talbot wrote: As some of you may have known from talking with me over the past few weeks, I've been having a difficult time 'selling' my bosses on the idea of Bayes; it simply doesn't seem to do anything new to them. But looking at the data today, I came up with an idea: use Bayes to reduce false positives. Do you have anything else that heuristically learns from your mail and adapts in real-time to your mail flow? That would mean we'd completely nerf the rules that add points to the score, but we'd trust Bayes to subtract points from messages it is confident are ham. I am aware of how silly that sounds. But would it work? We don't have another way to filter out false positives - we've got tons of ways to add points! What do ya'll think? I think it's a great idea, but that I wouldn't zero out the positive score unless it's hurting you, I think I'd just let it do what it does. If it saves you a subscription service, then that alone should be a strong selling point, unless there are false positives (and if so, I'd look into tuning your ham training before abandoning all hope) I guess part of it is that I don't understand what their concerns are with using Bayesian learning? -- Dave Warren http://www.hireahit.com/ http://ca.linkedin.com/in/**davejwarrenhttp://ca.linkedin.com/in/davejwarren
Re: Bayes + DCC / Bayes as a false-positive killer
Hi there, RW- Thank you for your response. A lot of interesting points in there. The issue with something like Bogofilter or its ilk is that it: 1- Requires manual intervention from users (we don't have access to the content of their messages) 2- Apparently doesn't scale well to huge client bases with all kinds of diverse businesses. Our clients range from banking institutions to employment agencies to ... ehh... purveyors of adult objects. So its tough to find commonalities, and since we're so large, we can't exactly have different user accounts for each. Go figure. Bayes performs beautifully in my test environment. I just need to find that extra WOW factor. I thought that saving the cost on DCC would be it but ... That didn't seem to make a difference. Go figure. On Wed, May 29, 2013 at 8:02 AM, RW rwmailli...@googlemail.com wrote: On Tue, 28 May 2013 16:43:20 -0400 Andrew Talbot wrote: Hey all - I've got two questions: 1- ... That said, I'm wondering if it's redundant to run DCC and Bayes at the same time? From what I understand, DCC is a subscription-based service, so it would be nice to be able to cut that cost out! It depends what you mean by DCC, the basic version is free, but is actually only a a way of identifying *bulk* mail which is why it doesn't score all that much. The paid version is a reputation system, it doesn't get discussed much here. Spamassassin is score-based, it doesn't rely on poison-pill rules. It doesn't matter that all DCC hits are also Bayes hits provided that the FPs and FNs don't also overlap and some spam that hits Bayes is pushed over the 5 point threshold by DCC. As some of you may have known from talking with me over the past few weeks, I've been having a difficult time 'selling' my bosses on the idea of Bayes; it simply doesn't seem to do anything new to them. But looking at the data today, I came up with an idea: use Bayes to reduce false positives. That would mean we'd completely nerf the rules that add points to the score, but we'd trust Bayes to subtract points from messages it is confident are ham. I am aware of how silly that sounds. But would it work? We don't have another way to filter out false positives - we've got tons of ways to add points! Reducing FPs is already one of the main benefits of Bayes. The trouble is that if you rescore it, you will still be using the Bayes scoreset that's optimized around Bayes doing a lot of the spam catching. I think you'd be better-off scoring Bogofilter, or a similar filter with 3-way clustering, into SpamAssassin. You still have the problem of learning representative ham if you want accurate ham identification.
Re: Bayes + DCC / Bayes as a false-positive killer
Hi, Matus - I wanted to ask you about your last point about the bayes9x fps and the 0x fns, mostly because it seems like that contradicts the sentence that follows (that you don't consider it to be 100%). If there's no FNs or FPs, it's about as good as it gets, no? On Wed, May 29, 2013 at 3:13 AM, Matus UHLAR - fantomas uh...@fantomas.skwrote: On 28.05.13 16:43, Andrew Talbot wrote: That said, I'm wondering if it's redundant to run DCC and Bayes at the same time? From what I understand, DCC is a subscription-based service, so it would be nice to be able to cut that cost out! No, it is not. It only requires you using other than public DCC servers when your daily rate is over 200k. The server must share the checksums with the DCC network (otherwise you couldn't catch those spams even). If you have that many messages daily, it would not be even a bad idea have DCC locally. score, but we'd trust Bayes to subtract points from messages it is confident are ham. I rarely have BAYES_9x FPs and BAYES_0x FNs. While BAYES is great, I don't consider it to be 100% -- Matus UHLAR - fantomas, uh...@fantomas.sk ; http://www.fantomas.sk/ Warning: I wish NOT to receive e-mail advertising to this address. Varovanie: na tuto adresu chcem NEDOSTAVAT akukolvek reklamnu postu. Saving Private Ryan... Private Ryan exists. Overwrite? (Y/N)
Bayes + DCC / Bayes as a false-positive killer
Hey all - I've got two questions: 1- We're running Bayes and DCC on our server, and we've just been running Bayes locally to see how well it works. It's been about three weeks now so I finally really started poring over the results. One thing I noticed that I thought was a particularly interesting anomaly: Bayes caught 100% of what DCC caught. 100%. Without exception - in thousands of messages. The reverse wasn't true at all. That said, I'm wondering if it's redundant to run DCC and Bayes at the same time? From what I understand, DCC is a subscription-based service, so it would be nice to be able to cut that cost out! 2- As some of you may have known from talking with me over the past few weeks, I've been having a difficult time 'selling' my bosses on the idea of Bayes; it simply doesn't seem to do anything new to them. But looking at the data today, I came up with an idea: use Bayes to reduce false positives. That would mean we'd completely nerf the rules that add points to the score, but we'd trust Bayes to subtract points from messages it is confident are ham. I am aware of how silly that sounds. But would it work? We don't have another way to filter out false positives - we've got tons of ways to add points! What do ya'll think?
Bayes autolearning: logarithmic?
Hey all - I set up Bayes with autolearning a few weeks ago. It took forever to get started, but now it seems like the learning speed has accelerated. Is the autolearning supposed to accelerate? I can't help but feel like it may just be feeding itself it's own data or something.
RE: Default Bayes Database
You all are keeping me sane and grounded as I deal with the Powers That Be here trying to set this up. It's good to know that I'm not wrong (I agree with everything everyone has said, and pointed out from the beginning a default database would be awful). And this: If he insists on starting with a pre-populated Bayes database, he sure knows why. Other than I'm the boss, I want. ... Is exactly right too. We're implementing it locally with auto-learning enabled this weekend (oh, yeah, boss didn't want auto-learning enabled either..). So here goes!! Thanks for all your help. -Original Message- From: Karsten Bräckelmann [mailto:guent...@rudersport.de] Sent: Wednesday, May 08, 2013 8:18 PM To: users@spamassassin.apache.org Subject: Re: Default Bayes Database On Wed, 2013-05-08 at 14:09 -0400, Andrew Talbot wrote: Well, I certainly hope someone offers to help! Heh! I am really confident, Alex didn't mean to be rude, neither that he actually hopes no one will help you. Quite the contrary... He DID try to help you by explaining why a default Bayes database is a bad idea in the first place. And that was his way of telling you... If only to say there is no default database. That. :) There is none, and there never has been. As we've spoken about off-list, my boss is being very particular about the deployment of Bayes, and it sounds like one of his caveats is that we don't start from a blank database. I can see how the idea of basing off of some known to be classified tokens sounds tempting. However, there is no such token. None. Just try to imagine working in an industry where e.g. Viagra and Cialis are totally legit phrases to use... Feel free to direct your boss here. If he insists on starting with a pre- populated Bayes database, he sure knows why. Other than I'm the boss, I want. Anyway, Andrew, your idea of that whole blank slate is inaccurate. If you import someone else's data, before importing your database has been empty. If you collect some ham and spam for initial training, before training your database has been empty. You even do NOT have to deploy SA prior to that. I don't know the size of your user base, but it seems it shouldn't be hard to have a few of the users chip in. Get a few of them to collect hand-classified ham and spam for you. Train Bayes with that. After that, deploy SA to your mail processing chain. There you go! A pre-populated Bayes database, based on YOUR particular ham and spam tokens, before deploying SA in production. -- char *t=\10pse\0r\0dtu\0.@ghno\x4e\xc8\x79\xf4\xab\x51\x8a\x10\xf4\xf4\xc4 ; main(){ char h,m=h=*t++,*x=t+2*h,c,i,l=*x,s=0; for (i=0;il;i++){ i%8? c=1: (c=*++x); c128 (s+=h); if (!(h=1)||!t[s+h]){ putchar(t[s]);h=m;s=0; }}}
Default Bayes Database
Hey all - I remember seeing somewhere that there was a default Bayes database for Bayes to start using right away, but can't seem to find that information again on the Wiki or in my notes. Can someone please help?
RE: Default Bayes Database
Well, I certainly hope someone offers to help! If only to say there is no default database. As we've spoken about off-list, my boss is being very particular about the deployment of Bayes, and it sounds like one of his caveats is that we don't start from a blank database. For the record, I agree with your logic completely .. And I hate to say stupid things like this, but it doesn't even matter to me if the tokens in the default database are useless at this point, or if there are only 20 of them. I just need to get this deployed so it can start learning. -Original Message- From: Axb [mailto:axb.li...@gmail.com] Sent: Wednesday, May 08, 2013 1:32 PM To: users@spamassassin.apache.org Subject: Re: Default Bayes Database On 05/08/2013 07:26 PM, Andrew Talbot wrote: Hey all - I remember seeing somewhere that there was a default Bayes database for Bayes to start using right away, but can't seem to find that information again on the Wiki or in my notes. Can someone please help? I hope nobody offers to help. Why? - your HAM is somebody else's SPAM - A decent Bayes DB is highly dynamic and yesterday's tokens from someone else's traffic will be useless to you traffic, today. - If you have a decent traffic flow, it takes less than 4 hours of autolearning with YOUR data and see Bayes scoring.
Bayes Autolearning
Hey All - I'm about to set up Bayes on one of our mail servers. A lot of the documentation says that I need to manually sift through a few hundred messages and classify them to 'teach' the filter, and it sounds like I may need to do that on an ongoing basis. That is not a very plausible solution - our servers process about 2million messages a day. Does Bayes start out with a completely blank slate? That is, if I never have it learn anything from my servers, will it still be pulling from something already defined? Can I set it to autolearn and leave it be? Or will it require continual maintenance and manual message feeding? Any suggestions any of you have for a Bayes newbie - about what I just asked or otherwise - would be very much appreciated J
RE: Bayes Autolearning
Thank you for that! Off-list you mentioned that you don't need to set the cron/expire because of Redis features; why is it commented out here? -Original Message- From: Axb [mailto:axb.li...@gmail.com] Sent: Wednesday, May 01, 2013 2:14 PM To: users@spamassassin.apache.org Subject: Re: Bayes Autolearning On 05/01/2013 08:01 PM, Andrew Talbot wrote: Any suggestions any of you have for a Bayes newbie - about what I just asked or otherwise - would be very much appreciated. I advocate autolearning as it has always worked fine for me. Can take a bit longer to see good results but with some tuning I can sit back and hear it purr and not worry about collecting ham and spam and training, which under certain circumstances may even be impossible. Before moving on to Redis, these were my bayes settings # bayes.cf use_bayes 1 bayes_auto_learn 1 bayes_auto_expire 0 bayes_learn_to_journal 0 # Dont' want to wait for the deault 200 hams/spams bayes_min_ham_num 20 bayes_min_spam_num 20 bayes_auto_learn_threshold_nonspam -1.0 bayes_auto_learn_threshold_spam 15.0 # FILE BASED # mkdir /etc/bayes bayes_path /etc/mail/spamassassin/bayes/bayes # Check permsisions/modify if needed #bayes_file_mode 0666 bayes_expiry_max_db_size 35 # SDBM is faster than other r/w DBs bayes_store_module Mail::SpamAssassin::BayesStore::SDBM # cron weekly # sa-learn --force-expire
RE: Bayes Autolearning
Hi, Seve - Thanks for your response. Is that just for performance reasons? -Original Message- From: Steve Freegard [mailto:steve.freeg...@fsl.com] Sent: Wednesday, May 01, 2013 2:24 PM To: users@spamassassin.apache.org Subject: Re: Bayes Autolearning All good advice there from Axb; the only thing I'd add to that is: bayes_auto_learn_on_error 1 Which prevents Bayes from over-training when the classifier already agrees with what the autolearn is trying to train on. Cheers, Steve. On 01/05/13 19:14, Axb wrote: On 05/01/2013 08:01 PM, Andrew Talbot wrote: Any suggestions any of you have for a Bayes newbie - about what I just asked or otherwise - would be very much appreciated. I advocate autolearning as it has always worked fine for me. Can take a bit longer to see good results but with some tuning I can sit back and hear it purr and not worry about collecting ham and spam and training, which under certain circumstances may even be impossible. Before moving on to Redis, these were my bayes settings # bayes.cf use_bayes 1 bayes_auto_learn 1 bayes_auto_expire 0 bayes_learn_to_journal 0 # Dont' want to wait for the deault 200 hams/spams bayes_min_ham_num 20 bayes_min_spam_num 20 bayes_auto_learn_threshold_nonspam -1.0 bayes_auto_learn_threshold_spam 15.0 # FILE BASED # mkdir /etc/bayes bayes_path /etc/mail/spamassassin/bayes/bayes # Check permsisions/modify if needed #bayes_file_mode 0666 bayes_expiry_max_db_size 35 # SDBM is faster than other r/w DBs bayes_store_module Mail::SpamAssassin::BayesStore::SDBM # cron weekly # sa-learn --force-expire
RE: Bayes Autolearning
Hey there, thanks for responding. That's an interesting point. Are you saying I should not use autolearning at all? I don't have any way to review a large corpus of messages because we don't have access to them - after they run through our servers they are sent on, and the text of the message is not stored on our server. Man, I wish there was an easier way to feed Bayes an initial set of spam/ham to teach it properly .. I've been told that letting it autolearn for a few hours/days would make it learn well enough though. If only our mail server only got 100 messages a day - then I could just manually mark them! :) -Original Message- From: RW [mailto:rwmailli...@googlemail.com] Sent: Wednesday, May 01, 2013 6:24 PM To: users@spamassassin.apache.org Subject: Re: Bayes Autolearning On Wed, 01 May 2013 22:02:43 +0100 Steve Freegard wrote: On 01/05/13 19:40, Andrew Talbot wrote: Hi, Seve - Thanks for your response. Is that just for performance reasons? Performance is one of the things that bayes_auto_learn_on_error 1 will give you. It means that if the message was already considered spam by Bayes, then the message won't be autolearnt again which means a bit less IO. It will also result in the Bayes databases being smaller as it is likely that with this option that less tokens will be present overall which will also save disk IO and space. But the key reason I like this option is that it doesn't allow bayes to overtrain in one direction (e.g. spam or ham). It only autolearns when Bayes either has the wrong result or isn't sure which IMO has to be better for accuracy in the long run. The evidence from trials with Bogofilter (which is similar to Bayes) showed that initially train-on-everything significantly outperforms train-on-error. The latter asymptotically catches up after thousands of errors. It seems that the most important thing is to learn a few thousand hams and spams by any means; and train-on-error can take a long time to get there. For this reason DSPAM only allows train-on-error when 2500 hams have been learned. There *may* be advantages to train-on-error after this in preventing BAYES becoming insensitive to learning. The chief problem with autolearning is learning ham. If you set a positive threshold you end-up learning a lot of spam as ham, if you set a negative threshold you effectively turn-over ham training to the DNS whitelists since they are the only tests with significant negative scores that aren't excluded from autolearning. Any problems with miss-learning are likely to be exacerbated by train-on-error. If I had to use autolearning I'd mark the DNS whitelists as noautolearn and write some negative-scoring, site-specific rules.
RE: More longer rules or fewer shorter ones?
Martin - Interesting. How many mailboxes does your deployment cover? -Original Message- From: Martin Gregorie [mailto:mar...@gregorie.org] Sent: Thursday, April 25, 2013 8:08 PM To: users@spamassassin.apache.org Subject: Re: More longer rules or fewer shorter ones? On Thu, 2013-04-25 at 18:45 -0400, Andrew Talbot wrote: I like your point about the portmanteau rules (and I award you two Points for using one of my favorite words in a new - yet appropriate - manner!). :-) I never thought about scoring each rule as a 0.001 or something really low then tying them all together with meta-rules. It's been a while since I separated everything out but I believe I have around 1000 different checks (most of them portmanteau'd) so it seems like those meta rules would just get ... Messy. But it's a good idea, and I think I can especially make use of it in my Specific Word list. The metas aren't too bad, though I must admit to building some of them as metas of metas to keep all lines down to 72 chars or so. Most of these submetas are simply lists of other rules that have been ANDed or ORed together. You may find that the Portmanteau Generator reduces your rule count because it too can generate metas, which I use to deal with situations where a term can appear in more than one case, e.g. a generated rule can have this form: describe GENRULE Example rule header __GR1 Reply-to =~ /(\@spam1\.com|\@spammer\.co\.uk|) header __GR2 From =~ /(\@spam1\.com|\@spammer\.co\.uk|) uri __GR3 From =~ /(\@spam1\.com|\@spammer\.co\.uk|) meta GENRULE (__P1 || __P2 || __P3) scoreGENRULE 1.5 which has two advantages. First, that GENRULE is a single name that covers the same spammy term regardless of where it was used and secondly, since each generated rule has its own source file, this makes the three related lists easier to edit, since there's a good chance that a spammy term might be used in more than one of the related lists. Keeping the rules under 1-2mb is a good rule of thumb to follow. Luckily we're nowhere near that point yet. Nor am I. As I said, my biggest generated rule is a bit over 9 KB. Can I ask how many rules you have, and how many of those are meta rules? I have 31 portmanteau rules, of which 9 contain metas. Only 12 of these have a score exceeding 1.0 and these are not usually used as part of higher level metarules My local.cf is where any very specific rules live, along with the higher level metarules that use the low scoring portmanteau rules. This contains 129 rules which between them contain 96 'meta' statements. 36 of these have scores of under 1.0, so are probably used as components of metarules. The total number of rules was obtained by using grep+wc to count lines containing '^score'. my local.cf and portmanteau.cf files are both 29 KB in size. Martin
Re: More longer rules or fewer shorter ones?
Hi, Martin - Thank you for your response. I like your point about the portmanteau rules (and I award you two Points for using one of my favorite words in a new - yet appropriate - manner!). I never thought about scoring each rule as a 0.001 or something really low then tying them all together with meta-rules. It's been a while since I separated everything out but I believe I have around 1000 different checks (most of them portmanteau'd) so it seems like those meta rules would just get ... Messy. But it's a good idea, and I think I can especially make use of it in my Specific Word list. It's interesting that you don't use Bayes for the opposite reason that we don't - we don't do it because of high volume, you don't do it because of low volume. Go figure. Keeping the rules under 1-2mb is a good rule of thumb to follow. Luckily we're nowhere near that point yet. Can I ask how many rules you have, and how many of those are meta rules? -Original Message- From: Martin Gregorie [mailto:mar...@gregorie.org] Sent: Wednesday, April 24, 2013 3:03 PM To: users@spamassassin.apache.org Subject: Re: More longer rules or fewer shorter ones? On Wed, 2013-04-24 at 12:32 -0400, Andrew Talbot wrote: I have my customized deployment split up into a bunch of separate CF files (by category) and I have those further split up into rules based on score. I also use very long rules, mainly due to spamiferous mailing lists, because all the headers are pretty much the same (apart from sender names), so about all you're left with for spam recognition is the body content. I found a problem with very long rules, where for me 'very long' means rules longer than the width of my editor's screen. I refer to these as 'portmanteau rules' (private slang). As I hate editing anything that's longer than my editor's text line and find it particularly annoying to deal with such a line containing a regex consisting of a lot of alternates, I wrote a portmanteau rule generator to make their maintenance a bit easier. It is a gawk script that assembles an arbitrarily long rule from a file containing rule fragments (regexes, etc) that are each placed on a separate line. Since sounds as though you may have a similar problem, you may also find it useful. You can find it and its documentation here: http://www.libelle-systems.com/free/portmanteau/portmanteau.tgz I find it particularly helpful to make the portmanteau rules fairly low scoring and to combine them into higher scoring meta-rules, e.g. if I'm trapping sales spiel I'll have a portmanteau rule listing selling phrases, one containing monetary terms and another containing product terms and names, all scores at 0.001. I'll also have a meta-rule that ANDs these three rules together and scores around 5. This approach is much better at distinguishing spam from ham than a series of higher scoring non-meta rules and has the additional benefit of recognising sales-related text from previously unseen combinations of elements in the three rules. BTW, I don't use Bayes because my mail volume is small and I have difficulty collecting decent training corpuses and find my current setup easier to manage. They are WAY longer than that (and some of them include further nesting of the pipe), but that's the general idea. My question is: is it better performance-wise to have the rules set up like this, or to have each separate thing have its own separate rule? What JH said. When I was thinking of setting up this approach I asked about performance and limits on the size of the generated rules and was told that I shouldn't worry about rule size until they exceeded a megabyte or two. Currently my longest rule is just over 9KB, with the averages being just under 1KB and 51 alternates per rule. Martin
More longer rules or fewer shorter ones?
Hey, all - I have my customized deployment split up into a bunch of separate CF files (by category) and I have those further split up into rules based on score. So, I have a bunch of stuff like: header RULE_1 Subject =~ /\b(this|that|theother|blah|blah)/i score RULE_1 1 describe RULE_1 Rule 1 header RULE_2 Subject =~ /\b(foo|bar|etc)/i score RULE_2 2 describe RULE_2 Rule 2 They are WAY longer than that (and some of them include further nesting of the pipe), but that's the general idea. My question is: is it better performance-wise to have the rules set up like this, or to have each separate thing have its own separate rule?
RE: More longer rules or fewer shorter ones?
John, Thanks for your prompt response! A lot of the rules are big jumbles of rules we are generating in real time and adding to as things come in. Like I said in my original question, we have them separated into separate cf files by category, and within those cf files they are separated by score. So we have just absolutely gargantuan rules for (for instance) sex words that we assign a 5 to automatically. There's also lists of specific words and phrases that we see in real-time spam (like the *$#ing garden hose spam). We are just tacking new rules on to the end to make them easier to read. Our rules properly work with (this|that|theother) if it hits any one of the words. Should we maybe have separate rules for all the phrases, since they're longer strings? There's rules in there that are like RULE Subject =~ /you.have.(new|waiting|blah|blah).*(ecard|message|calendar.invite|blah|blah) )|(garden|new|stretchy|bendy|whatever).*(hose|vaccum|other.thing) . . . . . . Etc. It goes on. .. My syntax is terrible and obviously those aren't the actual rules but the point is that it's a bunch of Or for some really long strings. Should I separate them out and have those long (this|that|theother) rules be only for single words? Alternately, should I separate out the rules with embedded pipes in them (like in the example above)? -Original Message- From: John Hardin [mailto:jhar...@impsec.org] Sent: Wednesday, April 24, 2013 12:58 PM To: users@spamassassin.apache.org Subject: Re: More longer rules or fewer shorter ones? On Wed, 24 Apr 2013, Andrew Talbot wrote: Hey, all - I have my customized deployment split up into a bunch of separate CF files (by category) and I have those further split up into rules based on score. So, I have a bunch of stuff like: header RULE_1 Subject =~ /\b(this|that|theother|blah|blah)/i score RULE_1 1 describe RULE_1 Rule 1 header RULE_2 Subject =~ /\b(foo|bar|etc)/i score RULE_2 2 describe RULE_2 Rule 2 They are WAY longer than that (and some of them include further nesting of the pipe), but that's the general idea. My question is: is it better performance-wise to have the rules set up like this, or to have each separate thing have its own separate rule? For performance, with simple lists of variant values having no repetition across the list e.g. (x|y|z){n,m}, if the most-likely variants are listed first a big rule will (generally-speaking) process less than a set of individual rules for each variant. The big rule will stop trying as soon as a match for one variant is found, whereas all of the individual rules must be tried regardless of what other rules may have hit. RULE_1 won't try matching that, theother, blah, etc. if this matches. Ignoring performance, the alternatives are *not* syntactically equivalent. Absent tflags multiple, RULE_1 would hit only once on a subject containing both this and that and theother, but if you split it up into separate rules *each* would hit. This likely would affect scoring. -- John Hardin KA7OHZhttp://www.impsec.org/~jhardin/ jhar...@impsec.orgFALaholic #11174 pgpk -a jhar...@impsec.org key: 0xB8732E79 -- 2D8C 34F4 6411 F507 136C AF76 D822 E6E6 B873 2E79 --- Vista security improvements consist of attempting to shift blame onto the user when things go wrong. --- 328 days since the first successful private support mission to ISS (SpaceX)
RE: More longer rules or fewer shorter ones?
Hi again, John - It's a good idea to add the realtime rules to the beginning of the filter. I didn't realize that would have such an impact. And the (?=x) tip is a good one too; thank you for that. As far as Bayes, don't get me started! :) I work for an Email Service Provider and about 2 million messages go through our servers every day, so we have Bayes turned off because it would be too computationally expensive. I wish we could turn it on - it'd certainly make my job easier - but The Boss says no. Go figure. Autolearn, same story. Having such a large organization makes it a difficult balance to avoid false positives, too. We have one client who deals with credit reports and refinancing and stuff and pretty much every message that goes to their mailboxes looks like spam. We just have them set up to avoid all our financial rules. Luckily we don't have too many doctors' offices so we needn't really concern ourselves with legitimate Viagra email! :) I've scoured the net looking for rulesets from others that already have a lot of this stuff in there but I haven't found any rulesets since 2006. A lot of what I've seen is irrelevant - do you know a good place to get custom rulesets? I feel like there's someone else out there who already figured out how to write a rule that captures all those learn a new language spam messages so I don't need to just score Language as +4 ! : ) -Original Message- From: John Hardin [mailto:jhar...@impsec.org] Sent: Wednesday, April 24, 2013 1:53 PM To: users@spamassassin.apache.org Subject: RE: More longer rules or fewer shorter ones? On Wed, 24 Apr 2013, Andrew Talbot wrote: John, Thanks for your prompt response! A lot of the rules are big jumbles of rules we are generating in real time and adding to as things come in. Like I said in my original question, we have them separated into separate cf files by category, and within those cf files they are separated by score. So we have just absolutely gargantuan rules for (for instance) sex words that we assign a 5 to automatically. There's also lists of specific words and phrases that we see in real-time spam (like the *$#ing garden hose spam). We are just tacking new rules on to the end to make them easier to read. Our rules properly work with (this|that|theother) if it hits any one of the words. Should we maybe have separate rules for all the phrases, since they're longer strings? There's rules in there that are like RULE Subject =~ /you.have.(new|waiting|blah|blah).*(ecard|message|calendar.invite|blah |blah) )|(garden|new|stretchy|bendy|whatever).*(hose|vaccum|other.thing) . . . . . . Etc. It goes on. .. My syntax is terrible and obviously those aren't the actual rules but the point is that it's a bunch of Or for some really long strings. Should I separate them out and have those long (this|that|theother) rules be only for single words? Simple alternations on phrases are equivalent to simple alternations on single words with respect to the performance concerns. Performance is more governed by the number of alternations and the presence of repetition and .* than their raw length. You might want to limit the total number of alternations per rule. Another performance optimization would be to ensure all of the alternations in a given rule start with the same letter, and put (?=x) before the list of alternatatives e.g. /\b(?=x)(x1|x2|x3|x4)/ so that the engine can skip more easily. If they are simple alternations, it also depends on how you want to score them. For poison pill words or phrases, sure, a long alternation with a high score will be pretty efficient. I'd suggest tacking new hits onto the *front* of the list of alternatives, though, as it's reasonable to assume a spam run will use the same phrasing for a while, then change. Alternately, should I separate out the rules with embedded pipes in them (like in the example above)? Yeah, avoiding nested alternatives where possible will help. Is Bayes not catching things like this? -Original Message- From: John Hardin [mailto:jhar...@impsec.org] Sent: Wednesday, April 24, 2013 12:58 PM To: users@spamassassin.apache.org Subject: Re: More longer rules or fewer shorter ones? On Wed, 24 Apr 2013, Andrew Talbot wrote: Hey, all - I have my customized deployment split up into a bunch of separate CF files (by category) and I have those further split up into rules based on score. So, I have a bunch of stuff like: header RULE_1 Subject =~ /\b(this|that|theother|blah|blah)/i score RULE_1 1 describe RULE_1 Rule 1 header RULE_2 Subject =~ /\b(foo|bar|etc)/i score RULE_2 2 describe RULE_2 Rule 2 They are WAY longer than that (and some of them include further nesting of the pipe), but that's the general idea. My question is: is it better performance-wise to have the rules set up like this, or to have each separate thing have its own separate rule? For performance, with simple lists
Re: Fwd: RE: alert: New event: ET EXPLOIT Possible SpamAssassin Milter Plugin Remote Arbitrary Command Injection Attempt (fwd)
On Thu, 10 Feb 2011, Michael Scheidell wrote: http://seclists.org/fulldisclosure/2010/Mar/140 http://www.securityfocus.com/bid/38578 Vulnerable: SpamAssassin Milter Plugin SpamAssassin Milter Plugin 0.3.1 I don't see anything on bugtraq about a fix. The securityfocus page lists some Debian fixes. The Debian patch spamass-milter_0.3.1-8+lenny2.diff.gz changelog includes: +spamass-milter (0.3.1-8+lenny1) stable-security; urgency=high + + * Use new popenenv function instead of open; fixes remote code exploit +as the spamass-milter user when run using -x. (closes: #573228) + + -- Don Armstrong d...@debian.org Wed, 17 Mar 2010 12:52:56 -0700 per http://security.debian.org/pool/updates/main/s/spamass-milter/ -- Andrew Daviel, TRIUMF, Canada Tel. +1 (604) 222-7376 (Pacific Time) Network Security Manager
URLs with Spaces
Hello, I'm wondering if I'm missing some rules that would have given this message more points - I know it's missing bayes (I'm not sure why as our servers should use bayes, but it seems not to have been run for this message.) http://www.pastebin.ca/1473975 Thanks -- Andrew.
Re: URLs with Spaces
Kasper Sacharias Eenberg wrote: There's been a rule circulating this mailing list for a couple of weeks. This is the latest edition to catch those med-things (afaik). -- body AE_MEDS35 /\bwww\s(?:\W\s)?\w{3,6}\d{2,6}\s(?:\W\s)?(?:c\s?o \s?m|n\s?e\s?t|o\s?r\s?g)\b/i describe AE_MEDS35 obfuscated domain in message scoreAE_MEDS35 5.0 -- It works good for me. Thanks Kasper, Also the Sanesecurity sigs for Clam catch it (thanks to Steve)
FuzzyOCR only runs when specifying spamassassin -D
I've been looking at some of the spam emails I've received lately with images attached and noticed that FuzzyOCR wasn't running against them. The same seems to be true when I take these messages and run them with: spamassassin -t img-email.eml However if I run them through as follows, I get FuzzyOCR showing up in the results: spamassassin -t -D img-email.eml I also get substantially different AWL results between the two (although I guess that maybe part of the debug procedure). Does anyone know why this might be happening? I seem to recall experiencing this before, but can't remember what I did to fix it. spamassassin -t: Content analysis details: (22.2 points, 5.0 required) pts rule name description -- -- 1.2 RCVD_IN_PBL RBL: Received via a relay in Spamhaus PBL [68.186.154.187 listed in zen.spamhaus.org] 3.0 RCVD_IN_XBL RBL: Received via a relay in Spamhaus XBL 0.9 RCVD_IN_SORBS_DUL RBL: SORBS: sent directly from dynamic IP address [68.186.154.187 listed in dnsbl.sorbs.net] 3.5 BAYES_99 BODY: Bayesian spam probability is 99 to 100% [score: 1.] 1.0 FH_HELO_EQ_CHARTER Helo is d-d-d-d charter.com 4.3 HELO_DYNAMIC_HCC Relay HELO'd using suspicious hostname (HCC) 4.4 HELO_DYNAMIC_IPADDR2 Relay HELO'd using suspicious hostname (IP addr 2) 0.0 FH_HELO_EQ_D_D_D_D Helo is d-d-d-d 2.0 RCVD_IN_BL_SPAMCOP_NET RBL: Received via a relay in bl.spamcop.net [Blocked - see ] 0.0 HTML_MESSAGE BODY: HTML included in message 0.1 RDNS_DYNAMIC Delivered to trusted network by host with dynamic-looking rDNS 1.8 AWL AWL: From: address is in the auto white-list spamassassin -t -D: Content analysis details: (25.7 points, 5.0 required) pts rule name description -- -- 3.0 RCVD_IN_XBL RBL: Received via a relay in Spamhaus XBL [68.186.154.187 listed in zen.spamhaus.org] 1.2 RCVD_IN_PBL RBL: Received via a relay in Spamhaus PBL 0.9 RCVD_IN_SORBS_DUL RBL: SORBS: sent directly from dynamic IP address [68.186.154.187 listed in dnsbl.sorbs.net] 3.5 BAYES_99 BODY: Bayesian spam probability is 99 to 100% [score: 1.] 1.0 FH_HELO_EQ_CHARTER Helo is d-d-d-d charter.com 4.3 HELO_DYNAMIC_HCC Relay HELO'd using suspicious hostname (HCC) 4.4 HELO_DYNAMIC_IPADDR2 Relay HELO'd using suspicious hostname (IP addr 2) 0.0 FH_HELO_EQ_D_D_D_D Helo is d-d-d-d 2.0 RCVD_IN_BL_SPAMCOP_NET RBL: Received via a relay in bl.spamcop.net [Blocked - see ] 0.0 HTML_MESSAGE BODY: HTML included in message 0.1 RDNS_DYNAMIC Delivered to trusted network by host with dynamic-looking rDNS 10 FUZZY_OCR_KNOWN_HASH BODY: -5.2 AWL AWL: From: address is in the auto white-list
Always show test scores in email header
Is it possible to have a header, or in X-Spam-Status always show the individual scores for each of the test performed against a particular email (whether it is tagged as spam or not)? I see that when using MailScanner with SpamAssassin this always happens, but cannot replicate the same for a straight SpamAssassin installation. This is an example of what I get in an emails source from MailScanner and would like to replicate in SpamAssassin: X-MailScanner-Spam: not spam, SpamAssassin (not cached, score=4.616, required 5, BAYES_40 -0.18, DCC_CHECK 4.50, HTML_MESSAGE 0.00, RDNS_DYNAMIC 0.10, SARE_HTML_USL_A 0.20) Regards, Andrew Bruce
Re: Always show test scores in email header
On Tue, 31 Mar 2009 23:08:14 -0400, Matt Kettler mkettler...@verizon.net wrote: Andrew Bruce wrote: Is it possible to have a header, or in X-Spam-Status always show the individual scores for each of the test performed against a particular email (whether it is tagged as spam or not)? I see that when using MailScanner with SpamAssassin this always happens, but cannot replicate the same for a straight SpamAssassin installation. This is an example of what I get in an emails source from MailScanner and would like to replicate in SpamAssassin: X-MailScanner-Spam: not spam, SpamAssassin (not cached, score=4.616, required 5, BAYES_40 -0.18, DCC_CHECK 4.50, HTML_MESSAGE 0.00, RDNS_DYNAMIC 0.10, SARE_HTML_USL_A 0.20) You're using MailScanner, which generates it's own markup. SA by default always adds such a header, but MailScanner doesn't use it. There's an option in MailScanner.conf to make MailScanner do this. It's something like always include spamassassin report or something like that. Odd, because on SpamAssassin it never showed that header unless the message was marked as spam. Although I should have mentioned that it's being called through amavisd-new which may have had something to do with it. I've added a custom header, and played with the $sa_tag_level_deflt values in amavis, now the header shows up: X-Spam-Scores: ALL_TRUSTED=-1.8,BAYES_00=-2.599,HTML_MESSAGE=0.001, MIME_HTML_ONLY=1.457,NO_DNS_FOR_FROM=1.496
Re: Spamc giving different scores
On Thu, 26 Mar 2009 18:15:01 -0700 (PDT), asimsinan yuksel.asim.si...@gmail.com wrote: I ran spamc a couple of times. It sometimes gives different scores for same email. Sometimes it gives higher than 5,sometime lower. What can be wrong? -- View this message in context: http://www.nabble.com/Spamc-giving-different-scores-tp22734449p22734449.html Sent from the SpamAssassin - Users mailing list archive at Nabble.com. Pipe the email through SpamAssassin on the command line using the command below: spamassassin -Dt /path/to/email You can then see the full output and what checks are hitting and missing and what the scores are. Andrew
Not scoring well on 'claims of £500,000 pounds' type emails
Hello, Our setup seems to work pretty well, but some spams are slipping through. Has anyone got any suggestions of rules that will catch these types of emails: http://www.pastebin.ca/1266571 I do run Bayes, but seems that Bayes didn't run for this message, I also run sought rules. and greylist before spamassassin for most messages. (v3.2.4) Thanks, Andrew.
Re: using RHEL / CentOS / Fedora perl?
Justin Mason wrote: have you seen this? http://blog.vipul.net/2008/08/24/redhat-perl-what-a-tragedy/ That bug in Red Hat perl will almost definitely slow down SpamAssassin, too, I would say. Can anyone verify? --j. This fixed it for me on a couple of centos servers: http://people.centos.org/z00dax/bz379791/
Re: using RHEL / CentOS / Fedora perl?
Randal, Phil wrote: Andrew Hearn wrote: Justin Mason wrote: have you seen this? http://blog.vipul.net/2008/08/24/redhat-perl-what-a-tragedy/ That bug in Red Hat perl will almost definitely slow down SpamAssassin, too, I would say. Can anyone verify? --j. This fixed it for me on a couple of centos servers: http://people.centos.org/z00dax/bz379791/ Did you notice a real-world performance boost after doing that? Got any numbers for pre- and post- spamassassin performance? No, not that I've noticed yet anyway ;-)
Fraud spam text in .doc attachments
Hi, Any one else seen emails with word documents attached and the word document has text of an 'African fraud'? example: http://pastebin.com/mad34c97 I've not seen a Word Doc plugin for SpamAssassin, is there one? Thanks! -- Andrew Hearn
Not scoring high enough on this spam...
http://pastebin.ca/961075 I've only seen one so far but apart from the 0.0 BAYES_50 (I will learn this message), does anyone have rules that pushes this kind of message over 5.0? thanks! Andrew
Ensuring Custom Rules Are Scored Properly
I'm experimenting with Fedora 8 and a miltered sendmail configuration running as a mail gateway (smf-sav, smf-spf, milter-greylist, clamav-milter, spamass-milter). I've configured spamassassin's local.cf with a custom rule. It's a simple regex which checks the 'Received' header on inbound mail for any IP in a specific Class C range, and negatively scores the message with -100 (probably extreme). I'm just trying to ensure these messages are never tagged as spam. I've --lint-ed the rule and I receive no syntax errors. However, messages coming in from an IP in the specified range don't appear to be negatively scored. In fact, the test messages being sent were scored as, say, 2.8 before AND after the rule was put into place. Spamass and spamassassin (as I'm running spamassassin daemonized) were both restarted after rule creation. I've verified the regex is correct, running it though a couple regex testers. So, I guess I'd be expecting the X-Spam header on these messages to indicate a score of -97.2. Am I assuming incorrectly? thanks
Re: How many use CRM114?
Blaine Fleming wrote: Slightly off-topic, but I'm curious, how many of you are using CRM114? How well does it work for you? Was it difficult to train? I've been looking at it and haven't found much except the official plugin guide and a single page saying that it works better than other learning methods. Any info would be appreciated. Hello I've only just started using it on a test server, I'll let you know how I find the results! Andrew
Re: Lots Of SPAM
Tarak Ranjan wrote: Hi List, i have posted my RAW email in http://pastebin.ca/918849 , i'm receiving 1000 to 4000 per day this king of mesages. SA also skipping this kind of mails / TArak I get 8.2 without Bayes... 1.5 IXHASH2BODY: mail has been classified as spam @ LogInSolutions AG, Germany 0.0 CLAMAV Clam AntiVirus detected something... 4.0 JM_SOUGHT_1JM_SOUGHT_1 0.2 RDNS_NONE Delivered to trusted network by a host with no rDNS 2.5 CLAMAV_SANESPAM found by ClamAV SaneSecurity signatures (JM_SOUGHT was talked about earlier in the list) Andrew.
unsubscribe
unsubscribe
unsubscribe
Not sure why DOS_OE_TO_MX fired
Hello, I'm not sure why DOS_OE_TO_MX fired on this message, as the headers say it was delivered to b.painless.aaisp.net.uk which relayed it on to z.hopeless.aaisp.net.uk. b.painless isn't the MX for the domain... Any ideas? -Thanks! Return-path: [EMAIL PROTECTED] Envelope-to: [EMAIL PROTECTED] Delivery-date: Fri, 14 Dec 2007 11:45:39 + Received: from [2001:8b0:0:81::51bb:5134] (helo=b.painless.aaisp.net.uk) by z.hopeless.aaisp.net.uk with esmtp (Exim 4.63) (envelope-from [EMAIL PROTECTED]) id 1J38z2-0004B8-FV for [EMAIL PROTECTED]; Fri, 14 Dec 2007 11:45:39 + Received: from [217.169.3.9] (helo=DFTJ542J) by b.painless.aaisp.net.uk with smtp (Exim 4.62) (envelope-from [EMAIL PROTECTED]) id 1J38z2-00036f-7g for [EMAIL PROTECTED]; Fri, 14 Dec 2007 11:45:36 + Message-ID: [EMAIL PROTECTED] From: Fiona Murphy [EMAIL PROTECTED] To: [EMAIL PROTECTED] Subject: website emergency! Date: Fri, 14 Dec 2007 11:45:33 - MIME-Version: 1.0 Content-Type: multipart/alternative; boundary==_NextPart_000_00AF_01C83E46.D5CB6A50 X-Priority: 3 X-MSMail-Priority: Normal X-Mailer: Microsoft Outlook Express 6.00.2900.3138 X-MimeOLE: Produced By Microsoft MimeOLE V6.00.2900.3198 X-Virus-Scanned: Clear (Version: ClamAV 0.91.2/5116/Fri Dec 14 07:14:39 2007, by smtp.aaisp.net.uk) X-AA-SMTP-Time-Scanned:YES X-Spam-Score: 4.0 X-AASpam-Report: Spam detection software, running on the system b.spamless.aaisp.net.uk, has processed this message. This message scored (4.0 points and 4.6 are required to mark as spam) pts rule name description -- -- 1.2 HTML_MESSAGE BODY: HTML included in message 0.0 BAYES_50 BODY: Bayesian spam probability is 40 to 60% [score: 0.5071] 0.0 NO_VIRUS_FOUND There were no viruses found in this message by ClamAV 2.8 DOS_OE_TO_MX Delivered direct to MX with OE headers
Re: HELO_DYNAMIC_SPLIT_IP
Giampaolo Tomassoni wrote: -Original Message- From: Andrew Hearn [mailto:[EMAIL PROTECTED] Sent: Tuesday, December 11, 2007 12:04 PM Hi, Can anyone explain why this email: http://pastebin.ca/811938 is getting a hit on HELO_DYNAMIC_SPLIT_IP. I'm seeing a few ham message being caught by this (SpamAssassin version 3.2.3, sa-update) smtp.aaisp.net.uk maps to two IP addresses (81.187.81.51 and 81.187.81.52). An outgoing mail server is supposed to announce itself via HELO with its own, specific name, not with a service name (like smtp.etc.etc). aaisp.net.uk could define the following: smtp1 A 81.187.81.51 smtp2 A 81.187.81.52 smtpA 81.187.81.51 A 81.187.81.52 where the latter name is only suitable to their customers, in order to accept mail to be delivered. Then, when delivery occurs, the SMTP server should identify itself with its unique name. Like, in example: EHLO smtp1.aaisp.net.uk This allows also to define two different entries in aaisp.net.uk's DNS reverse mappings: 51 PTR smtp1.aaisp.net.uk. 52 PTR smtp2.aaisp.net.uk. which may help in better identifying the abused host, whenever it happens. Giampaolo Thanks for the reply and explanation, I'll look in to this!
HELO_DYNAMIC_SPLIT_IP
Hi, Can anyone explain why this email: http://pastebin.ca/811938 is getting a hit on HELO_DYNAMIC_SPLIT_IP. I'm seeing a few ham message being caught by this (SpamAssassin version 3.2.3, sa-update) Thanks! Andrew
Re: SQL-based AWL and Bayes not working with 3.2.3
Rene Caspari wrote: Hi, I'm using spamassassing 3.2.3 with userspecific rules from an SQL database: /etc/spamassassin/local.cf: user_scores_dsn DBI:mysql:spamassassin:localhost [...] bayes_store_module Mail::SpamAssassin::BayesStore::SQL [...] auto_whitelist_factory Mail::SpamAssassin::SQLBasedAddrList spamc is called by procmail. /etc/procmailrc: :0fw * 256000 | /usr/bin/spamc -U /var/run/spamd.sock -u $USER (where $USER is created by Postfix: /usr/bin/procmail -t -m USER=${recipient} SENDER=${sender} /etc/procmailrc) Since I updated to 3.2.3 (Debian Volatile) I get the error message in /var/log/mail.log: [...] spamd: still running as root: user not specified with -u, not found, or set to root, falling back to nobody After this, spamassassin uses the userspecific SQL tables with the user nobody not the specific user, who is the recepient of the scanning mail. Do you have an idea how I can resolve this? I think I have the same problem too, on one of our tests servers. this is one I'm running 3.2.3 on, and using the same config from our other 3.1.7 machines which are happy with Bayes... User preference is being used, as I can tell that as the required score is being set correctly from the preferences. -- Andrew Hearn
user_in_whitelist , how do I find out which one?
I have many users in the whitelist_from in the local.cf. When I get forwarded spam email like this, how do I find which one it matched? Which FROM entry is it actually looking at? -Andrew X-Spam-Checker-Version: SpamAssassin 3.2.1 (2007-05-02) on xphotonics.com X-Spam-Level: X-Spam-Status: No, score=-72.0 required=5.0 tests=BAYES_50,DCC_CHECK, DIGEST_MULTIPLE,DRUGS_ERECTILE,HTML_MESSAGE,HTML_MIME_NO_HTML_TAG, MIME_HTML_ONLY,PYZOR_CHECK,RAZOR2_CF_RANGE_51_100,RAZOR2_CF_RANGE_E4_51_100, RAZOR2_CHECK,SARE_FROM_DRUGS,UNPARSEABLE_RELAY,USER_IN_WHITELIST autolearn=no version=3.2.1 X-Spam-Pyzor: Reported 4263 times. X-Spam-Report: * -100 USER_IN_WHITELIST From: address is in the user's white-list * 1.7 SARE_FROM_DRUGS From a drug * 5.5 UNPARSEABLE_RELAY Informational: message has unparseable relay lines * 0.0 HTML_MESSAGE BODY: HTML included in message * 0.0 BAYES_50 BODY: Bayesian spam probability is 40 to 60% * [score: 0.5000] * 3.5 MIME_HTML_ONLY BODY: Message only has text/html MIME parts * 5.0 RAZOR2_CHECK Listed in Razor2 (http://razor.sf.net/) * 1.5 RAZOR2_CF_RANGE_E4_51_100 Razor2 gives engine 4 confidence level * above 50% * [cf: 100] * 0.5 RAZOR2_CF_RANGE_51_100 Razor2 gives confidence level above 50% * [cf: 100] * 5.0 PYZOR_CHECK Listed in Pyzor (http://pyzor.sf.net/) * 5.0 DCC_CHECK Listed in DCC (http://rhyolite.com/anti-spam/dcc/) * 0.0 DIGEST_MULTIPLE Message hits more than one network digest check * 0.3 DRUGS_ERECTILE Refers to an erectile drug * 0.1 HTML_MIME_NO_HTML_TAG HTML-only message, but there is no HTML tag Received: from xphotonics.com (localhost [127.0.0.1]) by xphotonics.com (8.14.1/8.14.1) with ESMTP id l9MFJIOp032936 (version=TLSv1/SSLv3 cipher=DHE-DSS-AES256-SHA bits=256 verify=NO) for [EMAIL PROTECTED]; Mon, 22 Oct 2007 11:19:18 -0400 (EDT) (envelope-from [EMAIL PROTECTED]) Received: (from [EMAIL PROTECTED]) by xphotonics.com (8.14.1/8.14.1/Submit) id l9MFJIKX032935 for xiang; Mon, 22 Oct 2007 11:19:18 -0400 (EDT) (envelope-from lian) Received: from 029ae8f252bf4ac (84pavel.dialup.corbina.ru [85.21.237.209]) by xphotonics.com (8.14.1/8.14.1) with SMTP id l9MFHg8N032899 for [EMAIL PROTECTED]; Mon, 22 Oct 2007 11:17:44 -0400 (EDT) (envelope-from [EMAIL PROTECTED]) Date: Mon, 22 Oct 2007 11:17:42 -0400 (EDT) Received: from Susana Ware (10.11.17.11) by 029ae8f252bf4ac (PowerMTA(TM) v3.2r4) id hfp31o62d55j87 for [EMAIL PROTECTED]; Mon, 22 Oct 2007 07:17:20 +0300 Message-Id: [EMAIL PROTECTED] To: [EMAIL PROTECTED] Subject: October 79% OFF From: VIAGRA ?Official Site [EMAIL PROTECTED] MIME-Version: 1.0 Content-Type: text/html; charset=iso-8859-1 Content-Transfer-Encoding: 8bit X-Virus-Scanned: ClamAV 0.91.1/4559/Mon Oct 22 00:02:57 2007 on xphotonics.com X-Virus-Scanned: ClamAV 0.91.1/4559/Mon Oct 22 00:02:57 2007 on xphotonics.com X-Virus-Status: Clean style !DOCTYPE html PUBLIC -//W3C//DTD XHTML 1.0 Strict//EN http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd; html dir=ltr head meta http-equiv=Content-Type content=text/html; charset=unicode meta name=Generator content=Microsoft SafeHTML titleWL 90-day Email 1a/title table width=550 border=0 cellpadding=0 cellspacing=0 bgcolor=#99 /tr tr valign=top td colspan=5img src=http://ads1.oqr.com/ads/pronws/CIQ3536/1a_banner.jpg; alt=Windows Live Hotmail width=548 height=224 border=0/td
Problem with ERROR: invalid byte sequence for encoding UTF8: 0x8a
I keep seeing these in my postgresql log file. What did I do wrong? ERROR: invalid byte sequence for encoding UTF8: 0xd255 HINT: This error can also happen if the byte sequence does not match the encoding expected by the server, which is controlled by client_encoding. STATEMENT: SELECT spam_count, ham_count, atime FROM bayes_token WHERE id = $1 AND token = $2 ERROR: invalid byte sequence for encoding UTF8: 0xd255 HINT: This error can also happen if the byte sequence does not match the encoding expected by the server, which is controlled by client_encoding. STATEMENT: INSERT INTO bayes_token (id, token, spam_count, ham_count, atime) VALUES ($1,$2,$3,$4,$5) Here is my local.cf file: http://www.pastebin.ca/639583 spamd runs with these arguments: /usr/bin/spamd -d -i 127.0.0.1 -m 5 -H -q -x -d --pidfile=/var/run/spamd.pid Any help would be appreciated. Thanks.
Auto-whitelist Errors others.
I am having some serious probles with SpamAssassin. For example check out my logs: Mar 8 14:42:32 penguin spamd[15553]: spamd: connection from localhost [127.0.0.1] at port 52601 Mar 8 14:42:32 penguin spamd[15553]: spamd: setuid to root succeeded Mar 8 14:42:32 penguin spamd[15553]: spamd: still running as root: user not specified with -u, not found, or set to root, falling back to nobody at /usr/bin/ spamd line 1147, GEN15 line 4. Mar 8 14:42:32 penguin spamd[15553]: spamd: processing message [EMAIL PROTECTED] for root:99 Mar 8 14:42:32 penguin spamd[15553]: locker: safe_lock: cannot create tmp lockfile /var/spool/spamassassin/auto_whitelist.lock.penguin.leapcash.com.15553 for /var/spool/spamassassin/auto_whitelist.lock: Permission denied Mar 8 14:42:32 penguin spamd[15553]: auto-whitelist: open of auto-whitelist file failed: locker: safe_lock: cannot create tmp lockfile /var/spool/spamassassi n/auto_whitelist.lock.penguin.leapcash.com.15553 for /var/spool/spamassassin/auto_whitelist.lock: Permission denied Mar 8 14:42:32 penguin spamd[15553]: spamd: identified spam (1000.0/5.0) for root:99 in 0.2 seconds, 834 bytes. Mar 8 14:42:32 penguin spamd[15553]: spamd: result: Y 999 - GTUBE,NO_RECEIVED,NO_RELAYS scantime=0.2,size=834,user=root,uid=99,required_score=5.0,rhost=local host,raddr=127.0.0.1,rport=52601,mid=[EMAIL PROTECTED],autolearn=no Mar 8 14:42:32 penguin spamd[15537]: prefork: child states: II Here is the permissions for the folder: drw-rw-rw-2 root nobody 4096 Mar 8 14:35 spamassassin/ And the files: -rw-rw1 root nobody 12288 Mar 30 2006 auto-whitelist -rw-rw1 root nobody 12288 Feb 16 2005 bayes_seen -rw-rw1 root nobody 12288 Feb 16 2005 bayes_toks -rw-r--r--1 root nobody 1218 Feb 16 2005 user_prefs Now if spamassassin folder has group write access for nobody then why wont it write to this folder. Mar 8 14:42:32 penguin spamd[15553]: spamd: still running as root: user not specified with -u, not found, or set to root, falling back to nobody at /usr/bin/ That tells me its using the nobody user Mar 8 14:42:32 penguin spamd[15553]: locker: safe_lock: cannot create tmp lockfile /var/spool/spamassassin/auto_whitelist.lock.penguin.leapcash.com.15553 for /var/spool/spamassassin/auto_whitelist.lock: Permission denied So why the error above. Any help GREATLY appreciated =) -- View this message in context: http://www.nabble.com/Auto-whitelist-Errors---others.-tf3371373.html#a9381345 Sent from the SpamAssassin - Users mailing list archive at Nabble.com.
Re: [2] Auto-whitelist Errors others.
Why does a directory need execute permissions? Theo Van Dinter-2 wrote: On Thu, Mar 08, 2007 at 11:44:31AM -0800, Andrew Rosolino wrote: Mar 8 14:42:32 penguin spamd[15553]: spamd: setuid to root succeeded Mar 8 14:42:32 penguin spamd[15553]: spamd: still running as root: user not specified with -u, not found, or set to root, falling back to nobody at /usr/bin/spamd line 1147, GEN15 line 4. don't call spamd (via spamc) as root. Here is the permissions for the folder: drw-rw-rw-2 root nobody 4096 Mar 8 14:35 spamassassin/ That's definitely not going to work. 0777, not 0666 (directory, not a file). -- Randomly Selected Tagline: You can't build a reputation on what you are going to do. - Henry Ford -- View this message in context: http://www.nabble.com/Auto-whitelist-Errors---others.-tf3371373.html#a9386463 Sent from the SpamAssassin - Users mailing list archive at Nabble.com.
Re: [2] Auto-whitelist Errors others.
Thanks guys everything is good now =D! Phil Barnett wrote: On Thursday 08 March 2007 19:46, Andrew Rosolino wrote: Why does a directory need execute permissions? Because you can't use it and you can't move into it unless it does. -- Balmer is basically saying: We know there's a problem but we're not going to tell you what it is because we want to ambush you in the future. -- View this message in context: http://www.nabble.com/Auto-whitelist-Errors---others.-tf3371373.html#a9386624 Sent from the SpamAssassin - Users mailing list archive at Nabble.com.
Rules - How to capture matched text
Hello, In perl you can use $, parens $1, $2, etc. to capture the text that matched a regex; but how do you do it in sa? Thank you Andrew
Re: Rules - How to capture matched text
On 12/18/06 at 3:41 PM, [EMAIL PROTECTED] (Theo Van Dinter) wrote: On Mon, Dec 18, 2006 at 02:39:13PM -0500, Andrew Brosnan wrote: In perl you can use $, parens $1, $2, etc. to capture the text that matched a regex; but how do you do it in sa? It depends what you're trying to do. If you want to do matching between different rules, you can't do it, short of writing a plugin to do what you want. If you want to match within the same regex, it's like any other regex: /([a-z]+) foo bar \1/ generally speaking, capturing increases resource usage, so don't do it unless necessary (hence the large amount of (?:...) instead of (...) in the rules). Thanks Theo, I'd like the rule to catch when the first name in from: is also the subject:. I was going to capture the name in from: and compare it to subject:. I'll have to give some thought to how I can do that without capturing text. :-) Regards, Andrew
Score counting error
Hi, In my headers I see: X-Spam-Status: No, score=4.3 required=4.4 tests=BAYES_99,NO_RELAYS autolearn=disabled version=3.1.7 X-Spam-Report: * -0.0 NO_RELAYS Informational: message was not relayed via SMTP * 4.4 BAYES_99 BODY: Bayesian spam probability is 99 to 100% * [score: 1.] Seems odd that score doesn't add up? (4.4 + 0.0 = 4.3!!) -- Andrew Hearn
Newbie Question
Hi, I'm writing some code to integrate SpamAssassin with Apache JAMES. I want to setup an address to allow me to pipe spam into sa-learn. I have a prototype of this working fine, but would like to allow various webmail client users to be able to forward spam messages to this address. As I have very limited understanding of how SA works, I don't want to end up blocking the forwarding addresses. If I whitelist the forwarding addresses, can I then simply pipe a forwarded spam from that address into sa-learn or is there more to it? Thanks a lot for your help. -- Kind Regards Andrew Sykes [EMAIL PROTECTED] Sykes Development Ltd http://www.sykesdevelopment.com
Re: Newbie Question
Matt, Thank you, that makes things a lot clearer, is there any way to utilise forwarded messages or is it a lost cause? Thanks Andrew On Fri, 2006-11-24 at 10:22 -0500, Matt Kettler wrote: Andrew Sykes wrote: Hi, I'm writing some code to integrate SpamAssassin with Apache JAMES. I want to setup an address to allow me to pipe spam into sa-learn. I have a prototype of this working fine, but would like to allow various webmail client users to be able to forward spam messages to this address. As I have very limited understanding of how SA works, I don't want to end up blocking the forwarding addresses. If I whitelist the forwarding addresses, can I then simply pipe a forwarded spam from that address into sa-learn or is there more to it? There's MUCH more to it.. In fact, whitelisting won't really affect what sa-learn does at all. Generally speaking, forwarded messages are mostly useless to sa-learn. Exactly how useless depends a bit on the mail client.. SA tokenizes MANY mail headers, including Received:, not just From: and To. All the headers in a forwarded message are completely new, thus the sa-learn process will be learning the headers generated by forwarding, and not spam. SA also tokenizes the body of the message. However, most mail clients substantially modify the body of the message when you forward. Generally speaking they only preserve one of the mime sections in a multipart/alternative message. Spammers FREQUENTLY have text/plain sections which are dissimilar from the text/html. By forwarding you're loosing all but one mime section (generally text/html is kept). On top of this, most mail clients also insert Forwarded message: type text into the body, and add Fwd: to the subject. SA also tokenizes the in-body mime headers describing how the message was encoded. However, when you forward, the mail client doing the forward re-encodes things its own way. What might have been base64 encoded may now be quoted-printable, 8 bit, or 7 bit. So, fundamentally, as far as bayes is concerned the forwarded message is a completely different message than the original spam. You can try this sometime by taking an original spam, and a forwarded version of it and feed them both to spamassassin or sa-learn with -D bayes added. This will cause the debug output to list all the tokens used. Take a look at the tokens. .some are the same, but many are different. -- Kind Regards Andrew Sykes [EMAIL PROTECTED] Sykes Development Ltd http://www.sykesdevelopment.com
Re: Sudden drop in spam-rate, parallel to a surge of new trojans - beware
Chris wrote: On Tuesday 21 November 2006 6:47 pm, Chr. v. Stuckrad wrote: Hi! Yesterday we had a sudden drop in spam-percentage from 80% to near 60%. Parallel to it I got six copies of an undetectable (by NAI and ClamAV) new trojan 'exe' in the Mail. Do we have to prepare for a new flood by an updated (just now reorganizing) botnet? Stucki Yes, I did see a drop in yesterdays spam load: Total: 255 reports in 16m 54s. 3.97 seconds per report. Mon Nov 20 21:01:17 CST 2006 compared with Sunday's: Total: 434 reports in 30m 34s. 4.22 seconds per report. Sun Nov 19 20:03:19 CST 2006 But today's was a killer!: Total: 580 reports in 39m 28s. 4.08 seconds per report. Tue Nov 21 22:08:56 CST 2006 Sorry to be OT, but are these spam stats a built in feature of SA, or have you got a plugin to get this information? Thanks! -- Andrew Hearn
Spam with two subject headers
Hello, I'm running SpamAssassin 3.1.3 on Qmail. 99% of the spam that is processed by SA has the subject header rewritten. A few times a day however, there are spams that get processed by SA, and do not have the 'detected spam' string in the subject. In these spam there are two Subject lines - the first being the original subject and the second is the string that identifies an email as spam in the subject. 'The X-Spam-Prev-Subject' header says '(nonexistant)' which is not the case at all! Here are two links to the headers of two of these spams: spam_1 http://boxmodel.com/spam.txt spam_2 http://boxmodel.com/more_spam.txt I'd really appreciate any advice that this group could give me to help me resolve this issue. Much thanks in advance. Andrew
RE: Spam with two subject headers
Sorry for not being more specific. I'm not using qmail-scanner, just thought it might be helpful to mention qmail is my MTA. I have the same results as you after removing SA markup and retesting... The difference between the two however is the X-Spam-Prev-Subject header - it doesn't read '(nonexistent)' as it did in the email links I posted. Also the missing subject rule never got hit during the test of the cleaned email. Any chance spamd is not processing the same? Perhaps a clever spammer trick? -Original Message- From: Theo Van Dinter [mailto:[EMAIL PROTECTED] Sent: Thursday, November 16, 2006 8:06 To: users@spamassassin.apache.org Subject: Re: Spam with two subject headers On Thu, Nov 16, 2006 at 07:43:52AM -0800, Andrew Hawthorne wrote: I'm running SpamAssassin 3.1.3 on Qmail. What does that mean exactly? qmail-scanner ? Here are two links to the headers of two of these spams: spam_1 http://boxmodel.com/spam.txt spam_2 http://boxmodel.com/more_spam.txt I took them both, removed the SA markup, added GTUBE appropriately, and ran it through w/ a rewrite_header Subject ... config, and it worked fine. -- Randomly Selected Tagline: This tagline is ANNOYWARE! To register, send me some fish.
Subject not rewritten, two subject headers
Greetings, Ive been receiving a number of spam lately that are being correctly identified as spam by SA, however the subject line is not being rewritten. I have noticed that there are two subject lines and the X-Spam-Prev-Subject header states non existent. Below is part of one of the email headers that contains the two Subjects. When the email is delivered, the subject reads Full of health? Then don't click! completely untouched! All other SA headers appear normal and are not included to try and make this message smaller. This messages score was 50+. Im running SpamAssassin 3.1.3. Any help resolving this would be greatly appreciated. ~thanks Subject: Full of health? Then don't click! Date: Wed, 15 Nov 2006 00:30:22 +0100 MIME-Version: 1.0 Content-Type: multipart/related; type=multipart/alternative; boundary=ms030907010507030208050907 X-Priority: 3 X-MSMail-Priority: Normal X-Mailer: Microsoft Outlook Express 6.00.2900.2180 X-MimeOLE: Produced By Microsoft MimeOLE V6.00.2900.2180 Subject: ***SPAM*** X-Spam-Prev-Subject: (nonexistent) This is a multi-part message in MIME format. --ms030907010507030208050907 Content-Type: multipart/alternative;
RE: Subject not rewritten, two subject headers
Question, since you only quoted some of the headers.. is there a blank line anywhere in the headers before the subject header? There are no blank lines... anything else I should check? I attempted to send all the headers and the email was bounced back to me because it was too spammy *grin*. ~thanks
Re: White listing yahoo groups
On Tue, 14 Nov 2006 10:21:02 -0800, Bill Moseley [EMAIL PROTECTED] wrote: [...] Yes, it is my machine rejecting the mail that is flagged spam. And when I reject too many messages Yahoo's mailing list software considers my email non-working and stops delivering list messages. Snap! I have the same issue here, I reject with a high score, and it only takes one to put it into bounce mode. Also, they never let you know you are bouncing until like the next couple of days. The other problem is I have a system here which does some checks on the SMTP transaction and performs checks which gets to SA, and due to the way Yahoo delivers the messages to multiple recipients on the same domain (through sending the message multiple times in the same SMTP transaction) this caused problems as well. I guess I'm just curious how others deal with mailing lists. I suspect just like any other mail -- if a message has a high enough spam score then reject it. I am going to try some of the other messages in this thread - may take a while though, as I have to wait for one to trip the system. Andrew.
Training sa-learn from Outlook.
I imagine the following questions have been asked a lot, but I havent seen the exact answers Im after yet so here goes. We are running qmail, vpopmail, spamassassin, smb shares using samba, among other things, on freebsd. I want to set up public ham and spam folders such that our users can drag emails from Outlook. I can then set up a cron job that runs sa-learn on those folders and deletes the mail. Can I just create two public samba shares, then use those for the emails and run s-learn on them ? I guess not because the emails by this stage are wrecked by Outlook. How else can I do this ? Also, I dont understand exactly the implications of which user you run sa-learn under. How do I set this up when running sa-learn ? I suppose if I run it as the same user as vpopmail then this will work ? Apologies if these questions have already been covered in this mailing list or elsewhere. Andrew.
Re: Where to install imageinfo.pm?
BG Mahesh wrote: hi I am using SA-3.1.4. I am in the process of installing http://www.rulesemporium.com/plugins.htm Where do I install ImageInfo.pm http://www.rulesemporium.com/plugins/ImageInfo.pm [which directory]? On my FreeBSD box, I put ImageInfo.pm here: /usr/local/lib/perl5/site_perl/5.8.8/Mail/SpamAssassin/Plugin/ Andrew
Is anyone else seeing these?
Is anyone else seeing this sort of spam? It consists of a short message and always has a URL in it that ends with the string '/sk/'. The URL points to a web site advertising human growth hormone and testosterone treatment. These spams aren't firing on enough rules to be tagged by SpamAssassin. The URL changes often enough that the URIBL plugin doesn't catch a lot of them. Has anyone had more luck than me at stopping these emails? Andrew just wanted to see if you were still dreaming the notion of getting toned? I so want to be, that is why i am so joyous i chanced upon http://www.dontimesogooder.org/sk/ It was best decisevely having someone to support me out. to examine it, I found career that it was of the beasts rain again closing visit religious conviction, as much
Re: SA-LEARN Question
Jim Maul wrote: Christopher Mills wrote: Hi, We have over 100 domains on a server, all of which are getting junk mail. SA 3.1.4 installed, but I don't think it's properly trained yet (even though I did upgrade from an earlier version). If I set up a [EMAIL PROTECTED] mailto:[EMAIL PROTECTED] address and tell all my customers to forward the junk mail they get to that address, then run sa-learn on that mailbox, will that help, or, will it train SA that the users that forwarded the junk ARE the spammers and start to assign higher scores to legitimate customers? If you forward the emails, this process will not work. You must either forward it as an attachment and then strip the attachment and run sa-learn on that or use some other method which preserves the original headers. How you do this depends largely on your setup. Here's a link describing how I use maildrop to deliver emails to special maildirs for processing by sa-learn. http://www.arda.homeunix.net/spamassassin.html#bayesian Andrew
Re: spamassassin on qmail
Kjetil Kjernsmo wrote: On Saturday 29 July 2006 08:48, Kaushal Shriyan wrote: does spamassassin work on qmail MTA Yes. Also, you might want to look into using the qpsmtpd component, as it gives you a lot of power over the SMTP dialogue: http://smtpd.develooper.com/ You might also want to have a look at my Howto describing my netqmail/SpamAssassin setup. http://www.arda.homeunix.net/spamassassin.html Andrew
whitelists and blacklists question
I currently have SA running in a site-wide configuration using spamc/spamd. I would like to implement whitelists/blacklists on a per account basis. I use qmail and maildrop, so for per account processing, I plan to invoke SA from a .mailfilter file and keep user prefs in a SQL database. My question is, can I invoke SA to check user whitelists/blacklists only without it running any rule tests? Email will already contain SA headers from the site-wide SA installation. I want to use per account whitelists/blacklists as a possible override to whatever verdict the site-wide SA gives an email.
SpamAssassin Howto
I've written a Howto document describing my SpamAssassin setup. I have a site-wide configuration using spamd/spamc with Bayesian and auto-whitelist data in a MySQL database. If anyone is interested in having a look, you can find it here: http://www.arda.homeunix.net/spamassassin.html Of course, constructive feedback is always welcome. Andrew
Re: Spam Assassin Detecting our emails as spam
spectacularstuff wrote: I have just set up Spam Assassin on our server. It is working very nicely however whenever we try to send an email from our own server to someone else on the same server, it gets picked up as spam. I am wondering if anyone here has experience with Spam Assassin and can help me fix the issues below as I don't know what they mean exactly. I have spam assassin set to detect at 8 points whether or not an email is spam. We are way over that because of the following reasons. What do I have to fix on our server to fix the 4 issues below? 1. We are losing 3.4 points because of HELO_DYNAMIC_IPADDR. 2. We are losing 2.6 points because of NO_DNS_FOR_FROM. 3. We are losing 2.0 points because of RCVD_IN_SORBS_DUL. 4. We are losing 1.7 points because of RCVD_IN_NJABL_DUL. Here is a standard header from Spam Assassin that we get when we sent each other email. Code: 3.4 HELO_DYNAMIC_IPADDRRelay HELO'd using suspicious hostname (IP addr1) 0.1 HTML_TAG_EXIST_TBODY BODY: HTML has tbody tag 0.7 MIME_HTML_MOSTLY BODY: Multipart message mostly text/html MIME 0.0 HTML_MESSAGE BODY: HTML included in message 2.6 NO_DNS_FOR_FROMDNS: Envelope sender has no MX or A DNS records 2.0 RCVD_IN_SORBS_DUL RBL: SORBS: sent directly from dynamic IP address [68.56.175.199 listed in dnsbl.sorbs.net] 1.7 RCVD_IN_NJABL_DUL RBL: NJABL: dialup sender did non-local SMTP [68.56.175.199 listed in combined.njabl.org] -0.2 AWLAWL: From: address is in the auto white-list Thanks for any help with this. Wayne -- View this message in context: http://www.nabble.com/Spam+Assassin+Detecting+our+emails+as+spam-t1653798.html#a4480701 Sent from the SpamAssassin - Users forum at Nabble.com. Read about trusted_networks and internal_networks in the Mail::SpamAssassin::Conf man page. These parameters go into your local.cf configuration file. Andrew
Re: Even More Sa-update Problems
David Baron wrote: I have this working fine. However, once that 0300011 directory exists, all my custom rules (i.e. bayes, regex tests, etc) are no longer working and most all spams get through! Took it off once again. Something needs be modified before this can be used. I just set up sa-update myself. I've downloaded the latest ruleset from updates.spamassassin.org and restarted spamd. I don't seem to be having the problem you describe, though. I have the Baysian filter on and some SARE rulesets in /usr/local/etc/mail/spamassassin and I'm still seeing hits with them. Here are the options I use when I start spamd. spamd --siteconfigpath=/usr/local/etc/mail/spamassassin --pidfile=/var/run/spamd.pid Before I set up sa-update, I also had the --configpath option set to /usr/local/share/spamassassin. I had to take this out otherwise spamd wouldn't find the rulesets in /var/lib/spamassassin/3.001001/ I'm using SpamAssassin 3.1.1 on FreeBSD by the way. Andrew
Re: Even More Sa-update Problems
David Baron wrote: On Sunday 14 May 2006 21:24, Andrew wrote: I have this working fine. However, once that 0300011 directory exists, all my custom rules (i.e. bayes, regex tests, etc) are no longer working and most all spams get through! Took it off once again. Something needs be modified before this can be used. I just set up sa-update myself. I've downloaded the latest ruleset from updates.spamassassin.org and restarted spamd. I don't seem to be having the problem you describe, though. I have the Baysian filter on and some SARE rulesets in /usr/local/etc/mail/spamassassin and I'm still seeing hits with them. Here are the options I use when I start spamd. spamd --siteconfigpath=/usr/local/etc/mail/spamassassin --pidfile=/var/run/spamd.pid OK. There is no siteconfigpath in /etc/init.d/spamassassin nor in /etc/default/spamassassin which gives this is $OPTIONS. It would be easy enough to try. Put this in which of these files? Browsing through the Mail::SpamAssassin::Conf man page, I couldn't find a configuration file parameter equivalent to the --siteconfigpath command line option for spamd. I'd put it in your /etc/init.d/spamassassin startup script. Before I set up sa-update, I also had the --configpath option set to /usr/local/share/spamassassin. I had to take this out otherwise spamd wouldn't find the rulesets in /var/lib/spamassassin/3.001001/ This is probably the default. Now it looks first in the version set /var/lib . 3.001001/, Does the siteconfigpath override or add to this (most probably adds or should) ? Should there be a multiple siteconfigpath? Symlinks to various directories from the ...3.001001? Here is what appears to be happening on my system. 1. Because I don't have configpath set, spamd is looking in /var/lib/spamassassin for rulesets. 2. Because it finds the /var/lib/spamassassin directory, it doesn't check /usr/local/share/spamassassin where the rulesets distributed with SpamAssassin reside. 3. Because I have set siteconfigpath, spamd loads extra rulesets and configuration info from /usr/local/etc/mail/spamassassin. In my case, what spamd finds in siteconfigpath is definitely used in addition to what it finds in /var/lib/spamassassin. I've never tried specifying more than one siteconfigpath. My gut feeling is that it won't work. I can't think of a reason why I would need more than one. Andrew
Re: Rule to select sender starting with string
Matt Kettler wrote: Al Danks wrote: Matt Kettler mkettler at evi-inc.com writes: Try a rule something like this: L_FROM_STRING header From =~ /$string/ It appears that the rule is also hitting senders with the string following a . I.e. From =~ /$com/ hits comalksdfl.net aksafjdla.com Interesting.. that shouldn't happen with the $ there.. I'll have to test that, unless Theo or one of the other devs can offer an explanation as to why.. Are SA regexes different from other regexes? If not, use '^' to specify the beginning of a string and '$' its end. Try this pattern: /^com/ Andrew
X-Originating and X-Apparently-From
Hi, We are trying to perform DNSBL checks on incoming mail and we are not seeing any actual DNS queries. When looking at the code it seems that the information on which IP(s) to check is obtained from X-Originating and X-Apparently-From headers. Grepping through the code I do not see these headers anywhere else. We are using Postfix as our MTA, perhaps that is the problem? We could either write a postfix rule or edit the SA code to check the Received header. Thanks, Andrew
Re: X-Originating and X-Apparently-From
Andrew Doughety wrote: Hi, We are trying to perform DNSBL checks on incoming mail and we are not seeing any actual DNS queries. When looking at the code it seems that the information on which IP(s) to check is obtained from X-Originating and X-Apparently-From headers. No, SA should be checking the IPs from the Received: headers. However, make sure your trust path is working correctly. If you ever see spam matching ALL_TRUSTED, then that email is going to be exempt from DNSBL tests. 9 times out of 10, this is the trust-path guesser being confused by a NAT config. See the wiki on how to fix this: http://wiki.apache.org/spamassassin/TrustPath Restricting the trusted path fixed the problem. Thanks!
problem with AWL and SQL
I'm trying to set up SA to use MySQL to store the Auto WhiteList but it's just not working out for me. SA seems to be trying to create a lock file on disk. The problem is that I run spamd as a user which doesn't have a home directory. Here is what I find in my spamd log files. @40004445b038302e5f9c [59195] error: locker: safe_lock: cannot create tmp lockfile /nonexistent/.spamassassin/auto-whitelist.lock.lorien.arda.homeunix.net.59195 for /nonexistent/.spamassassin/auto-whitelist.lock: No such file or directory @40004445b0383030c4e4 [59195] warn: auto-whitelist: open of auto-whitelist file failed: locker: safe_lock: cannot create tmp lockfile /nonexistent/.spamassassin/auto-whitelist.lock.lorien.arda.homeunix.net.59195 for /nonexistent/.spamassassin/auto-whitelist.lock: No such file or directory (The funny strings with the '@' sign at the beginning of lines is a timestamp. I use daemontools to run spamd instead of inetd.) Is this normal behaviour even when using an SQL database to store the AWL? Here are the relevant parameters from my local.cf file. user_awl_dsn DBI:mysql:saawl:localhost:3306 user_awl_sql_usernamesa user_awl_sql_passwordpassword user_awl_sql_table awl I use MySQL with the same credentials to store the Bayesian database and that's working fine. Only the AWL is giving me a problem. I can manually log into the saawl database and even insert and delete rows as the sa user. Andrew
Re: Bayes learning email address
John D. Hardin wrote: On Sat, 15 Apr 2006, mouss wrote: - you are trusting your users to make the right decision. The problem is that different people have different opinions of what is spam and what is not. Things get even worst if one user isn't honest... That's a problem with *any* scheme for allowing the users to train Bayes themselves. In practice, however, I think you'll see much more apathy than stupidity or malice. My problem was with getting my users to even *look at* their marginal-spams folder and classify the messages. Ever. You should check for things like your own quota notification messages in the spam folder. If you send a boilerplate email in response to someone sending an email to your abuse or postmaster address, check for that too. I used to work for a fairly large ISP and we got these sorts of things sent to us all the time. Andrew
Bayes rules taking minutes - solved by moving to innodb?
Hi, people. This started as a plea for help but ended as a report of an investigation, so hopefully it will be a useful addition to the archives. About 1% of my scans were taking more than 300 seconds. Extra debugging in spamd showed me that the Bayes checks were the culprit: 13:38:05 spamd[16852]: slow: run_eval_tests BAYES_40 took 773 seconds 13:45:18 spamd[16852]: slow: run_eval_tests BAYES_80 took 427 seconds 13:45:20 spamd[16852]: slow: do_body_eval_tests(0) took 1212 seconds I am using per-user Bayes (on the recommendation of half this list, and against the recommendation of the other half :-), and perform about 100,000 scans per day. Bayes_seen was ~ 150M, and bayes_token ~ 1.5G. The bayes_token index was 4.7G. MySQL's slow query log showed that the queries did not take long to execute after they achieved a lock, but I suspected they were not getting their locks in reasonable time: mysql SHOW STATUS LIKE 'Table%'; +---++ | Variable_name | Value | +---++ | Table_locks_immediate | 171036 | | Table_locks_waited| 220999 | +---++ In a healthy database, table_locks_waited is a small fraction of table_locks_immediate. I turned off bayes_auto_expire in case it was the expiry which caused the contention, but no change. I need bayes_auto_expire turned on because as we've discussed before, there is no way to perform expiration for every user in an SQL Bayes database. Well, I started this email a week ago and now I've found that at peak times, SHOW PROCESSLIST shows many threads -- like 100 -- locked on SELECT FROM bayes_token and INSERT INTO bayes_token. So I tried to convert bayes_token to InnoDB to take advantage of its row-level locking (this is advised by the developers but not reflected in bayes_mysql.sql). After MySQL worked on that for a few days I stopped it, dropped the database (innodb was very confused), and recreated the database and all tables using innodb and two-byte IDs. It's early days, with only 7.6M tokens seen and few accounts over the activation mark of 200 ham. But I'm hoping my timeout problems are over. So my advice is: SHOW STATUS LIKE 'Table%'; SHOW PROCESSLIST; Change to innodb ALTER TABLE bayes_token MODIFY id SMALLINT UNSIGNED NOT NULL, MODIFY spam_count SMALLINT UNSIGNED NOT NULL, MODIFY ham_count SMALLINT UNSIGNED NOT NULL; ALTER TABLE bayes_expire MODIFY id SMALLINT UNSIGNED NOT NULL; ALTER TABLE bayes_seen MODIFY id SMALLINT UNSIGNED NOT NULL; ALTER TABLE bayes_vars MODIFY id SMALLINT UNSIGNED NOT NULL AUTO_INCREMENT; -- _ Andrew Donkin Waikato University, Hamilton, New Zealand
Re: SQL Bayes - MyISAM locks a problem?
Duane Hill has: per-user [...] just over 10 gig [...] InnoDB [...] http://wiki.apache.org/spamassassin/DBIPlugin [...] bayes_vars table has 14,102 rows Jason Frisvold: I'll have to give innodb a try.. :) Thanks for the tip... Jason, if you haven't moved to innodb already, try SHOW PROCESSLIST in mysql. Do you have many threads locked on SELECT FROM bayes_token and INSERT INTO bayes_token? I had about 100 threads locked, so I am changing to InnoDB for its fine-grained locking. About three days ago I issued ALTER TABLE bayes_token ENGINE innodb. I'll let you know when it finishes. -- _ Andrew Donkin Waikato University, Hamilton, New Zealand
Re: prefork: server reached --max-clients setting, consider raising it messages
After upgrading to 3.1 from 3.0 we are starting to see the following error messages in our logs prefork: server reached --max-clients setting, consider raising it Short version: try --round-robin on spamd. We scan about 100k messages a day balanced (with -H) between two spamd hosts. Traffic is bursty, and during the bursts a lot of spam leaks through unchecked because spamc reaches its 120s timeout. The really annoying thing is that a spamd child would continue to chew on its message for a further few hundred seconds before classifying it, only to find that spamc had already given up. That child could have been working for another spamc. I wonder if there is a way for spamd to catch SIGPIPE or some other message from the client, and abort. So I added --round-robin and things improved markedly. The logging isn't nearly so good (grep prefork: without --round-robin draws you a great load histogram) but far less spam is leaking through. One theory is that spamd doesn't spawn children quickly enough to cope with rapidly-ramping load. I was thinking of ripping out spamd's one-new-child-per-second throtting to see if it improved matters, but that experiment is way down the task list now. Try --round-robin. Scale it up until your spamd hosts are maximising the use of their RAM. Note that your spamd hosts should be similarly capable - spamc will split the load evenly between all of them, even when all children are busy on one. -- _ Andrew Donkin Waikato University, Hamilton, New Zealand