DKIM length 'l=' tag

2024-06-03 Thread Andrew C Aitchison



The DKIM RFC
   https://datatracker.ietf.org/doc/html/rfc6376#section-8.2
tells us that it is not safe to rely on the DKIM length (l=) tag
and
   https://www.zone.eu/blog/2024/05/17/bimi-and-dmarc-cant-save-you/
shows how it can be used to subvert BIMI*.

I am looking at extending Mail::SpamAssassin::Plugin::DKIM to indicate 
when a DKIM body signature only covers part of the message body

and how much of the body is unsigned (bytes, percentage or possibly both).

I am new to the spamassassin code, so any comments or suggetions would be 
welcome.


* I am not a fan of BIMI, but big name players appear to be using
it to display "trustable" logos on GUI mail clients, so users *will*
be caught when it breaks.

Thanks,

--
Andrew C. Aitchison  Kendal, UK
   and...@aitchison.me.uk


Re: Spamassassin 4 and ClamAVMultipleScores.

2023-11-03 Thread Andrew Hearn
Thanks for the reply Jimmy.

After playing some more - with priorities in clamav.cf, I got it working,
and was just about to explain a fix, when I noticed Henrik has updated the
ClamAVMultipleScores page to have a similar (actually better!) fix that I
was going to suggest!

# Run CLAMAV early so all the rules here will see the results
priority CLAMAV -10

and removal of all the individual priorities

Thanks Henrik!

Andrew.

On Fri, 3 Nov 2023 at 02:15, Jimmy  wrote:

>
> The X-Spam-Virus could be absent from the email header.
>
> You can consider adding the following line:
>
> add_header spam Virus _VIRUSRESULT_
>
> If this doesn't work, the ClamAV plugin might need to include
> "put_metadata('X-Spam-Virus')" when it detects a virus.
>
> Jimmy
>
>
> On Fri, Nov 3, 2023 at 4:06 AM Andrew Hearn  wrote:
>
>> Hello,
>>
>> We're using clam, some extra signatures, and the plugin/config as
>> described on
>> https://cwiki.apache.org/confluence/display/SPAMASSASSIN/ClamAVMultipleScores
>> to give different signature families different scores.
>>
>> Since moving to v4, I don't think it's working...
>>
>> The only rule that is matched now, is the generic CLAMAV_VIRUS rule.
>> The rules for the various other signatures are no longer matched.
>> Could this be due to the change in priorities for meta rules, and now
>> these meta rules are running before they get to see the results from clam?
>>
>> I can send my config examples and debug output if that's helpful.
>>
>> Thanks!
>>
>


Spamassassin 4 and ClamAVMultipleScores.

2023-11-02 Thread Andrew Hearn
Hello,

We're using clam, some extra signatures, and the plugin/config as described
on
https://cwiki.apache.org/confluence/display/SPAMASSASSIN/ClamAVMultipleScores
to give different signature families different scores.

Since moving to v4, I don't think it's working...

The only rule that is matched now, is the generic CLAMAV_VIRUS rule.
The rules for the various other signatures are no longer matched.
Could this be due to the change in priorities for meta rules, and now these
meta rules are running before they get to see the results from clam?

I can send my config examples and debug output if that's helpful.

Thanks!


Re: Lint problem with KAM.cf

2021-08-31 Thread Andrew Colin Kissa
Hi

There is a new DecodeShortURLs in Spamassassin trunk, the API has changed
from the one in the original module on GitHub.

The new builtin module has the short_url function but the original module uses
short_url_tests, the original module does not have a short_url function thus
the error generated.

You possibly need "has" checks to differentiate between the two different 
modules
with the same name currently in circulation.

- Andrew 

> On 30 Aug 2021, at 23:13, Kevin A. McGrail  wrote:
> 
> We will take a look.  We check with lint for every publication but maybe 
> there's a condition we missed or a spelling issue. Thanks for bringing it up. 
> KAM



Re: updates.spamassassin.org not resolving

2021-07-23 Thread Andrew Colin Kissa
My bad, actually thought updates.spamassassin.org was one of the mirrored-by
urls but it is sa-update.spamassassin.org

> On 23 Jul 2021, at 14:35, Kevin A. McGrail  wrote:
> 
> TL;DR: Everything looks good to me.  



updates.spamassassin.org not resolving

2021-07-23 Thread Andrew Colin Kissa
Hi

updates.spamassassin.org is not resolving, tested with various
DNS systems. Can the admins please check ?

Kind Regards,
Andrew


Re: Spamassassin 3.4.4 on centos7

2020-12-10 Thread Andrew Colin Kissa


> On 09 Dec 2020, at 21:13, Benny Pedersen  wrote:
> 
> thanks for reporting, but this should be added to centos bug tracker since 
> its a centos problem, not a spamassassin problem to solve, this 2 modules is 
> only optional

There is no bug here to be reported, those packages do exist in CentOS7

# yum whatprovides "perl(BSD::Resource)"
Loaded plugins: fastestmirror
Loading mirror speeds from cached hostfile
 * base: centos.mirror.liquidtelecom.com
 * epel: fedora.is.co.za
 * extras: centos.mirror.liquidtelecom.com
 * updates: centos.mirror.liquidtelecom.com
perl-BSD-Resource-1.29.07-1.el7.x86_64 : BSD process resource limit and 
priority functions
Repo: epel
Matched from:
Provides: perl(BSD::Resource) = 1.2907

yum whatprovides "perl(Net::CIDR::Lite)"
Loaded plugins: fastestmirror
Loading mirror speeds from cached hostfile
 * base: centos.mirror.liquidtelecom.com
 * epel: fedora.is.co.za
 * extras: centos.mirror.liquidtelecom.com
 * updates: centos.mirror.liquidtelecom.com
perl-Net-CIDR-Lite-0.21-11.el7.noarch : Perl extension for merging IPv4 or IPv6 
CIDR addresses
Repo: epel
Matched from:
Provides: perl(Net::CIDR::Lite) = 0.21






signature.asc
Description: Message signed with OpenPGP


Re: Spamassassin 3.4.4 on centos7

2020-12-09 Thread Andrew Colin Kissa
Use

yum local install spamassassin-3.4.4-1.el7.centos.x86_64.rpm

That will pull in the dependencies for you.

> On 09 Dec 2020, at 13:01, Niamh Holding  wrote:
> 
> rpm -ivh spamassassin-3.4.4-1.el7.centos.x86_64.rpm



signature.asc
Description: Message signed with OpenPGP


Re: contact from blacklist

2020-11-21 Thread Andrew Colin Kissa



> On 20 Nov 2020, at 22:23, Levente Birta  wrote:
> 
> I'd like to try the KAM channel. A quick install how-to would be nice too

I would like to test the KAM channel tool.

Thanks,
Andrew



Count of DNS lookups

2017-09-11 Thread Andrew
Hello,

Is there a way to count and log the number of individual DNS lookups that
Spamassassin does whilst processing an email?

I'm really after just a number of the lookups requested, but a list of all
the individual lookups types would be nice.

Thanks.

Skeffling.


Re: Google anti-phishing code project

2017-02-20 Thread Andrew
I've not come across these before.. I am too interested in how to integrate
them in to SA thanks.

On 20 February 2017 at 21:56, Alex  wrote:

> Hi,
>
> On Mon, Feb 20, 2017 at 2:32 PM, Dianne Skoll 
> wrote:
> > On Mon, 20 Feb 2017 14:21:08 -0500
> > Alex  wrote:
> >
> >> Maybe we're using something different. This is the link I was using to
> >> download the phishing addresses until the other day, when it became a
> >> dead link:
> >
> >> https://aper.svn.sourceforge.net/svnroot/aper/phishing_reply_addresses
> >
> > That URL works for me.  However, I am currently pulling the SVN repo from
> > svn://svn.code.sf.net/p/aper/code (also can use
> http://svn.code.sf.net/p/aper/code)
> >
> > It looks like the list of addresses has not been updated since
> 2017-02-16, but
> > the list of phishing URLs has an entry dated 2017-02-20.
>
> It looks like the URL has just now become available again. Do you
> happen to know the script that can be used to convert the
> phishing_links file into SA rules in the same way as the
> phishing_reply_addresses are converted?
>
> Thanks,
> Alex
>
>
>
>
> >
> > Regards,
> >
> > Dianne.
>


Training Bayes with BAYES_999 Mail

2015-10-02 Thread Andrew Davidson
I'm not an expert on the mechanics of Bayes so I'm wondering how
valuable it is to continue training with collected spam that is
properly tagged with BAYES_999.

Does that help to reinforce the logic or is it overly focusing the
database on emails it can already detect? Should I only be training it
with miscategorized emails and emails in the 20-80% confidence range?

Thanks for clarifying,

-- Andrew


Bayes Corruption

2015-01-30 Thread Andrew Watson
Hi,

Invoked through a plugin in KerioConnect
SpamAssassin 3.3.1
Platform is CentOS 5.10

So, my Bayes.db is corrupt and out of curiosity I just wanted to take a look at 
it. I used SQLiteBrowser to do so. Now I have some questions about the 
bayes_token table:

1) Is there a reason why the id is not auto-incremented?
2) The majority of the tokens appear to be valid bytea. But a large number show 
as (BLOB). Is this perhaps the source of the corruption? If so why would that 
happen? And, if not why are they (BLOB)?

Thanks



FYI - ahbl.org and BIND DNS errors

2014-06-10 Thread Andrew Daviel


Per http://ahbl.org/content/changes-ahbl, AHBL is going away (still used 
in spamassassin-3.3.1)


Meanwhile, AHBL is serving strange DNS responses, e.g.
(from wireshark)

  1   0.00 142.90.100.186 - 162.243.209.249 DNS 93 Standard query 0xc828  
A zuz.rhsbl.ahbl.org
  2   0.072481 162.243.209.249 - 142.90.100.186 DNS 246 Standard query 
response 0xc828
Authoritative nameservers
rhsbl.ahbl.org: type NS, class IN, ns invalid.ahbl.org
rhsbl.ahbl.org: type NS, class IN, ns unresponsive.ahbl.org
rhsbl.ahbl.org: type NS, class IN, ns unresponsive2.ahbl.org
Name Server: unresponsive2.ahbl.org
Additional records
invalid.ahbl.org: type A, class IN, addr 244.254.254.254
Addr: 244.254.254.254 (244.254.254.254)
unresponsive.ahbl.org: type A, class IN, addr 10.230.230.230
Addr: 10.230.230.230 (10.230.230.230)
unresponsive2.ahbl.org: type A, class IN, addr 192.168.230.230
Addr: 192.168.230.230 (192.168.230.230)
invalid.ahbl.org: type , class IN, addr fe80::
Addr: fe80::

This last one, fe80::, is an IPv6 scope-link address that causes the BIND 
nameserver to log a weird error

named[31365]: socket.c:4373: unexpected error:
named[31365]: 22/Invalid argument
Per http://www.mail-archive.com/bind-users@lists.isc.org/msg05240.html
connect() fails as it is missing scoping information.


--
Andrew Daviel, TRIUMF, Canada
Tel. +1 (604) 222-7376  (Pacific Time)
Network Security Manager


Re: Detecting very recently registered domain names

2014-01-06 Thread Andrew Hearn
On Thu, 19 Dec 2013 10:02:39 -0500
Joe Quinn jqu...@pccc.com wrote:

 We are noticing a lot of spam coming from domains that are less than
 two months old. Is there a good way to detect this automatically?
 
 We've thought about whois, but do not want to get blocked for looking 
 like we are harvesting information.


May be off topic, but is this related to Communicado Ltd, who register
domains daily in order to send spam, more info and a maintained list(at
least at the moment) on:
http://blog.hinterlands.org/2013/10/unwanted-email-from-communicado-ltd/


-- 
Andrew



RE: USPS Spam

2013-09-03 Thread Andrew Talbot
Just wanted to throw in my two cents here - I have spoken to USPS about this
and they said that they never send out these messages unless the client
requests them, and that it should be safe to completely block messages like
this. 

The same cannot be said about UPS and FexEx, by the way. 



 -Original Message-
 From: Matt [mailto:matt.mailingli...@gmail.com]
 Sent: Friday, August 30, 2013 4:23 PM
 To: users@spamassassin.apache.org
 Subject: USPS Spam
 
 I am seeing tons of junk getting through claiming to be from the USPS
about a
 missed delivery package.  Anyone else seeing this?
 
 I am running SpamAssassin 3.3.1 and execute sa-update weekly.



SUBJ_ALL_CAPS

2013-08-20 Thread Andrew Talbot
Hey all -

 

Does anybody know how long the string needs to be to trigger SUBJ_ALL_CAPS?
I know it has to be multi-word and over a certain length. Was wondering the
specific length. Thanks in advance J 

 

 



Low scoring pill spam

2013-08-16 Thread Andrew Hearn
Hello,

I have a low scoring pills spam:
 http://pastebin.com/q6nWqzMR

I only get the following on it:

*  1.0 RCVD_IN_MSPIKE_L3 RBL: Low reputation (-3)
*  [219.94.129.82 listed in bl.mailspike.net]
*  0.0 SUBJECT_FUZZY_CHEAP Attempt to obfuscate words in
  Subject:
*  0.5 FROM_LOCAL_NOVOWEL From: localpart has series of
  non-vowel letters
* -2.8 RP_MATCHES_RCVD Envelope sender domain matches handover
  relay domain
*  0.0 RCVD_NOT_IN_IPREPDNS Sender not listed at
*  http://www.chaosreigns.com/iprep/


Am I missing anything (apart from Bayes) that would help catch this?

Many thanks!

-- 
Andrew



Whitelisting subdomains?

2013-08-14 Thread Andrew Talbot
Hey, all -

 

I'm trying to whitelist all our internal subdomains but I can't seem to get
it to work.

 

We have so many of them that it's impractical to do them individually. For
instance, we have _...@logs.domain.com, @admin-sql.domani.com etc. etc. etc.

 

I was thinking that whitelist_from *.domain.com would work but it doesn't 

 

I can't seem to find any documentation on the net anywhere - is it even
possible to do this? 

 

 



RE: PayPal spam filter?

2013-06-27 Thread Andrew Talbot
I just had to weigh in here to say that we have DCC_CHECK scored up to a 4, and 
all of these kinds of spam messages get caught by that because they always hit 
at least another 1 point worth of rules. 

Also, those two rules require plugins, I believe. 



 -Original Message-
 From: Juerg Reimann [mailto:j...@jworld.ch]
 Sent: Wednesday, June 26, 2013 6:42 PM
 To: users@spamassassin.apache.org
 Cc: 'Benny Pedersen'
 Subject: RE: PayPal spam filter?
 
 Hi Benny
 
 Thanks for your tip. Could you elaborate on this a bit? First of all, a rule 
 with
 the name SPF_DID_NOT_PASS or DKIM_DID_NOT_PASS seem not to exist.
 How and where would I configure this?
 
 Thanks,
 Juerg
 
  -Original Message-
  From: Benny Pedersen [mailto:m...@junc.eu]
  Sent: Wednesday, June 12, 2013 9:38 PM
  To: users@spamassassin.apache.org
  Subject: Re: PayPal spam filter?
 
  Juerg Reimann skrev den 2013-06-12 21:30:
 
   Is there a filter to block PayPal phishing mails, i.e. everything
   that claims to come from PayPal but is not?
 
  meta SPF_DID_NOT_PASS (!SPF_PASS)
 
  simple ? :=)
 
  if paypal do use dkim then it could be checked with
 
  meta DKIM_DID_NOT_PASS (!DKIM_VALID_AU)
 
  phishing emails seldom pass on this 2 tests
 
  --
  senders that put my email into body content will deliver it to my own
  trashcan, so if you like to get reply, dont do it




Chain rules?

2013-06-24 Thread Andrew Talbot
Hey all -

 

Is there a way to chain rules together such that one rule will only fire
if another is hit? 

 

Specifically, we have a client that is getting hit with a bunch of messages
that are just links, but the links contain sex words. We want to do a body
scan for a list of sex words if and only if the body contains only a link
rule we have is triggered. 

 

I tried to get this to work with meta rules but it seems like it won't do
it. Is there currently a way to do this sort of conditional check? 

 

 



RE: Chain rules?

2013-06-24 Thread Andrew Talbot
This is what I was wondering. We don't want to have to run a
computationally-expensive body rule unless we need to. No choice though, I
guess. Thanks for your help!


 -Original Message-
 From: John Hardin [mailto:jhar...@impsec.org]
 Sent: Monday, June 24, 2013 1:20 PM
 To: users@spamassassin.apache.org
 Subject: Re: Chain rules?
 
 On Mon, 24 Jun 2013, Andrew Talbot wrote:
 
  Is there a way to chain rules together such that one rule will only
  fire if another is hit?
 
  Specifically, we have a client that is getting hit with a bunch of
  messages that are just links, but the links contain sex words. We want
  to do a body scan for a list of sex words if and only if the body
contains
 only a link
  rule we have is triggered.
 
  I tried to get this to work with meta rules but it seems like it won't
  do it. Is there currently a way to do this sort of conditional check?
 
 Unfortunately you can't control whether or not a rule is *executed*, you
can
 only control whether or not it contributes to the message's overall score.
 
 --
   John Hardin KA7OHZhttp://www.impsec.org/~jhardin/
   jhar...@impsec.orgFALaholic #11174 pgpk -a jhar...@impsec.org
   key: 0xB8732E79 -- 2D8C 34F4 6411 F507 136C  AF76 D822 E6E6 B873 2E79
 ---
Look at the people at the top of both efforts. Linus Torvalds is a
university graduate with a CS degree. Bill Gates is a university
dropout who bragged about dumpster-diving and using other peoples'
garbage code as the basis for his code. Maybe that has something to
do with the difference in quality/security between Linux and
Windows.   -- anytwofiveelevenis on Y! SCOX
 ---
   10 days until the 237th anniversary of the Declaration of Independence



Rule to scan for .html attachments?

2013-05-31 Thread Andrew Talbot
Hey all -

I'm trying to set up a custom rule that scores HTML attachments.

The problem I'm running across is that using a rule like this one:
mimeheader HTML_ATTACH Content-Type =~ /^text\/html/i

Will flag all messages that come in as HTML (vs. plain text).

I found this :
header HTML_ATTACH_RULE_2 Content-Disposition =~
/^filename\=\[a-z]{2}\.html\/i

But that doesn't ... Work ... At all.


Any suggestions? Is this even possible?


Re: Rule to scan for .html attachments?

2013-05-31 Thread Andrew Talbot
That didn't work :(



On Fri, May 31, 2013 at 12:40 PM, Martin Gregorie mar...@gregorie.orgwrote:

 On Fri, 2013-05-31 at 11:51 -0400, Andrew Talbot wrote:
  I'm trying to set up a custom rule that scores HTML attachments.
 
 ..snippage..

  I found this :
 header HTML_ATTACH_RULE_2 Content-Disposition =~
  /^filename\=\[a-z]{2}\.html\/i
 
 Don't anchor it to the start of the line, i.e. try this:

 header HTML_RULE Content-Disposition =~ /filename\=\[a-z]{2}\.html\/i

 I have a very similar rule for matching ZIP file attachments whose name
 is xx.zip which works as expected. The only significant difference from
 your rule is that it doesn't use the '^' BOL anchor symbol. My guess is
 that SA's body text parser converts the MIME header into one line, so
 requiring 'filename' to be at the start of the line will always fail.


 Martin






Re: Rule to scan for .html attachments?

2013-05-31 Thread Andrew Talbot
Didn't work with mime_header (or mimeheader) with either rule.


On Fri, May 31, 2013 at 12:23 PM, Axb axb.li...@gmail.com wrote:

 On 05/31/2013 05:51 PM, Andrew Talbot wrote:

 Hey all -

 I'm trying to set up a custom rule that scores HTML attachments.

 The problem I'm running across is that using a rule like this one:
 mimeheader HTML_ATTACH Content-Type =~ /^text\/html/i

 Will flag all messages that come in as HTML (vs. plain text).

 I found this :
 header HTML_ATTACH_RULE_2 Content-Disposition =~
 /^filename\=\[a-z]{2}\.html\**/i

 But that doesn't ... Work ... At all.


 Any suggestions? Is this even possible?


 use mime_header instead of header



RE: Rule to scan for .html attachments?

2013-05-31 Thread Andrew Talbot
That's what I was afraid of. We generally avoid those kinds of rules since
we are scanning millions of messages a day. 

 -Original Message-
 From: David F. Skoll [mailto:d...@roaringpenguin.com]
 Sent: Friday, May 31, 2013 2:22 PM
 To: users@spamassassin.apache.org
 Subject: Re: Rule to scan for .html attachments?
 
 On Fri, 31 May 2013 14:10:36 -0400
 Andrew Talbot andrew.talbot.ownweb...@gmail.com wrote:
 
  That didn't work :(
 
 What didn't work?  Oh... you top-posted.
 
 Anyway... you might need a full rule, which can be expensive.
 Something like:
 
 full HTML_RULE /Content-
 Disposition:.{0,50}name\s{0,2}=\s{0,2}\?.{0,50}\.html?/i
 
 Completely untested, of course! :)
 
 Regards,
 
 David.



RE: Rule to scan for .html attachments?

2013-05-31 Thread Andrew Talbot
I need it to fire on any HTML attachment. The modules are enabled. I can get it 
to pick up text/html, remember, but the problem is that it detects messages 
sent as HTML when it's set up like that. It doesn't detect plain-text messages, 
but it will flag plain-text messages with HTML files attached. 


 -Original Message-
 From: Martin Gregorie [mailto:mar...@gregorie.org]
 Sent: Friday, May 31, 2013 2:35 PM
 To: users@spamassassin.apache.org
 Subject: Re: Rule to scan for .html attachments?
 
 On Fri, 2013-05-31 at 14:10 -0400, Andrew Talbot wrote:
  That didn't work :(
 
 Can you post one or two examples of actual MIME attachment headers that
 you're trying to get the rule to fire on?
 
 Obvious question, but have you enabled the MIME header module?
 I'm using MimeMagic and enabling it requires that MimeMagic.pm and
 MimeMagic.cf be included in /etc/mail/spamassassin (or wherever you have
 told SA to look for its configuration etc.
 
 
 Martin
 
 




RE: Rule to scan for .html attachments?

2013-05-31 Thread Andrew Talbot
Hi, Martin -

Thank you for your response. The original test was using a file arbitrarily 
named aa.html .. It still doesn't work with the rewrite you provided :/ 





 -Original Message-
 From: Martin Gregorie [mailto:mar...@gregorie.org]
 Sent: Friday, May 31, 2013 3:38 PM
 To: users@spamassassin.apache.org
 Subject: Re: Rule to scan for .html attachments?
 
 On Fri, 2013-05-31 at 14:45 -0400, Andrew Talbot wrote:
  I need it to fire on any HTML attachment. The modules are enabled. I
  can get it to pick up text/html, remember, but the problem is that it
  detects messages sent as HTML when it's set up like that. It doesn't
  detect plain-text messages, but it will flag plain-text messages with
  HTML files attached.
 
 Well, that's exactly what your second rule won't do: it will only fire on the
 header of an html attachment for a file that has one of a very restricted set
 of filenames. As you haven't posted any example MIME header sets I can
 only guess, but my guess is that none of the messages you've tried it against
 have attachments with names that match the restriction.
 
 As I said before the rule can't work with the '^' in place, because that says
 that the 'filename=' string must be at the beginning of a line and NOT
 preceded by any white space. Thats a harmful restriction because you never
 see MIME headers like that. With the '^' removed the rule
 becomes:
 
 header HTML_ATTACH_RULE_2 Content-Disposition =~  /filename\=\[a-
 z]{2}\.html\/i
 
 which has a better chance of working. This version will only fire if the
 filename associated with the attachment has precisely two alphabetic
 characters plus a .html extension, i.e. it will fire on filename=aa.html or
 filename=ZZ.HTML because the trailing 'i' makes it a caseless match, but it
 won't fire on filename=cat.html
 or filename=x.html because these don't have two character names and it
 won't fire if the attachment follows the common Windows convention of
 using a .htm extension.
 
 If you want the rule to fire on *any* HTML attachment it should be:
 
 header HTML_ATTACH_RULE_2 Content-Disposition =~
 /filename\=\.{0,30}\.html{0,1}\/i
 
 which will match any filename with a .html or .htm extension (including
 .html and .htm).
 
 Could I respectfully suggest that you learn about Perl regular expressions
 before you try writing any more SA rules? SA rules are all based on using the
 Perl flavour of regular expressions to match character strings in headers and
 the message body.
 
 You could do a lot worse than getting a copy of Programming Perl by Larry
 Wall, Tom Christiansen  Jon Orwant, published by O'Reilly. If there isn't one
 in the firm's technical library, they should be willing to buy a copy. Its a 
 brick
 of a book, but you only need to read Chapter
 5: Pattern Matching to write SA rules and in any case the rest of its 
 contents
 will come in handy in future if anybody needs to write Perl programs or SA
 extension modules.
 
 
 Martin
 
 
 
 




Re: Bayes + DCC / Bayes as a false-positive killer

2013-05-29 Thread Andrew Talbot
Hi, Dave -

We don't have anything else learning because we deal in such bulk. We're an
email service provider hosting hundreds of thousands of accounts.

Re: Your last line about I don't understand what their concerns are
... Welcome to my world. Right now I am manually writing rules - custom
rules - based on the subject lines (only the subject lines) of spam that
gets reported to us. We are very very clearly Doing It Wrong, so I'm trying
to find a way to do it better.

As far as why we can't have Bayes and DCC on at the same time I've got
no idea.

I just work here, Dave! :)

Thank you for your response.


On Tue, May 28, 2013 at 8:12 PM, Dave Warren da...@hireahit.com wrote:

 On 2013-05-28 13:43, Andrew Talbot wrote:

 As some of you may have known from talking with me over the past few
 weeks, I've been having a difficult time 'selling' my bosses on the idea of
 Bayes; it simply doesn't seem to do anything new to them. But looking at
 the data today, I came up with an idea: use Bayes to reduce false positives.


 Do you have anything else that heuristically learns from your mail and
 adapts in real-time to your mail flow?




 That would mean we'd completely nerf the rules that add points to the
 score, but we'd trust Bayes to subtract points from messages it is
 confident are ham.

 I am aware of how silly that sounds. But would it work? We don't have
 another way to filter out false positives - we've got tons of ways to add
 points!

 What do ya'll think?


 I think it's a great idea, but that I wouldn't zero out the positive score
 unless it's hurting you, I think I'd just let it do what it does.

 If it saves you a subscription service, then that alone should be a strong
 selling point, unless there are false positives (and if so, I'd look into
 tuning your ham training before abandoning all hope)

 I guess part of it is that I don't understand what their concerns are with
 using Bayesian learning?

 --
 Dave Warren
 http://www.hireahit.com/
 http://ca.linkedin.com/in/**davejwarrenhttp://ca.linkedin.com/in/davejwarren




Re: Bayes + DCC / Bayes as a false-positive killer

2013-05-29 Thread Andrew Talbot
Hi there, RW-

Thank you for your response. A lot of interesting points in there. The
issue with something like Bogofilter or its ilk is that it:
1- Requires manual intervention from users (we don't have access to the
content of their messages)
2- Apparently doesn't scale well to huge client bases with all kinds of
diverse businesses. Our clients range from banking institutions to
employment agencies to ... ehh... purveyors of adult objects. So its tough
to find commonalities, and since we're so large, we can't exactly have
different user accounts for each.

Go figure.

Bayes performs beautifully in my test environment. I just need to find
that extra WOW factor. I thought that saving the cost on DCC would be it
but ... That didn't seem to make a difference. Go figure.


On Wed, May 29, 2013 at 8:02 AM, RW rwmailli...@googlemail.com wrote:

 On Tue, 28 May 2013 16:43:20 -0400
 Andrew Talbot wrote:

  Hey all -
 
  I've got two questions:
 
  1-
 
 ...
  That said, I'm wondering if it's redundant to run DCC and Bayes at
  the same time? From what I understand, DCC is a subscription-based
  service, so it would be nice to be able to cut that cost out!

 It depends what you mean by DCC, the basic version is free, but is
 actually only a a way of identifying *bulk* mail which is why it
 doesn't score all that much. The paid version is a reputation system, it
 doesn't get discussed much here.

 Spamassassin is score-based, it doesn't rely on poison-pill rules. It
 doesn't matter that all DCC hits are also Bayes hits provided that
 the FPs and FNs don't also overlap and some spam that hits Bayes is
 pushed over the 5 point threshold by DCC.


  As some of you may have known from talking with me over the past few
  weeks, I've been having a difficult time 'selling' my bosses on the
  idea of Bayes; it simply doesn't seem to do anything new to them. But
  looking at the data today, I came up with an idea: use Bayes to
  reduce false positives.
 
  That would mean we'd completely nerf the rules that add points to the
  score, but we'd trust Bayes to subtract points from messages it is
  confident are ham.
 
  I am aware of how silly that sounds. But would it work? We don't have
  another way to filter out false positives - we've got tons of ways to
  add points!

 Reducing FPs is already one of the main benefits of Bayes. The trouble
 is that if you rescore it,  you will still be using the Bayes scoreset
 that's optimized around Bayes doing a lot of the spam catching.

 I think you'd be better-off scoring Bogofilter, or a similar filter with
 3-way clustering, into SpamAssassin. You still have the problem of
 learning representative ham if you want accurate ham identification.



Re: Bayes + DCC / Bayes as a false-positive killer

2013-05-29 Thread Andrew Talbot
Hi, Matus -

I wanted to ask you about your last point about the bayes9x fps and the 0x
fns, mostly because it seems like that contradicts the sentence that
follows (that you don't consider it to be 100%). If there's no FNs or FPs,
it's about as good as it gets, no?


On Wed, May 29, 2013 at 3:13 AM, Matus UHLAR - fantomas
uh...@fantomas.skwrote:

 On 28.05.13 16:43, Andrew Talbot wrote:

 That said, I'm wondering if it's redundant to run DCC and Bayes at the
 same
 time? From what I understand, DCC is a subscription-based service, so it
 would be nice to be able to cut that cost out!


 No, it is not. It only requires you using other than public DCC servers
 when
 your daily rate is over 200k.  The server must share the checksums with the
 DCC network (otherwise you couldn't catch those spams even).  If you have
 that many messages daily, it would not be even a bad idea have DCC locally.


  score, but we'd trust Bayes to subtract points from messages it is
 confident are ham.


 I rarely have BAYES_9x FPs and BAYES_0x FNs. While BAYES is great, I don't
 consider it to be 100%
 --
 Matus UHLAR - fantomas, uh...@fantomas.sk ; http://www.fantomas.sk/
 Warning: I wish NOT to receive e-mail advertising to this address.
 Varovanie: na tuto adresu chcem NEDOSTAVAT akukolvek reklamnu postu.
 Saving Private Ryan...
 Private Ryan exists. Overwrite? (Y/N)



Bayes + DCC / Bayes as a false-positive killer

2013-05-28 Thread Andrew Talbot
Hey all -

I've got two questions:

1-

We're running Bayes and DCC on our server, and we've just been running
Bayes locally to see how well it works. It's been about three weeks now so
I finally really started poring over the results.

One thing I noticed that I thought was a particularly interesting anomaly:
Bayes caught 100% of what DCC caught. 100%. Without exception - in
thousands of messages. The reverse wasn't true at all.

That said, I'm wondering if it's redundant to run DCC and Bayes at the same
time? From what I understand, DCC is a subscription-based service, so it
would be nice to be able to cut that cost out!



2-

As some of you may have known from talking with me over the past few weeks,
I've been having a difficult time 'selling' my bosses on the idea of Bayes;
it simply doesn't seem to do anything new to them. But looking at the data
today, I came up with an idea: use Bayes to reduce false positives.

That would mean we'd completely nerf the rules that add points to the
score, but we'd trust Bayes to subtract points from messages it is
confident are ham.

I am aware of how silly that sounds. But would it work? We don't have
another way to filter out false positives - we've got tons of ways to add
points!

What do ya'll think?


Bayes autolearning: logarithmic?

2013-05-22 Thread Andrew Talbot
Hey all -

I set up Bayes with autolearning a few weeks ago. It took forever to get
started, but now it seems like the learning speed has accelerated.

Is the autolearning supposed to accelerate? I can't help but feel like it
may just be feeding itself it's own data or something.


RE: Default Bayes Database

2013-05-10 Thread Andrew Talbot
You all are keeping me sane and grounded as I deal with the Powers That Be
here trying to set this up. It's good to know that I'm not wrong (I agree
with everything everyone has said, and pointed out from the beginning a
default database would be awful). 

And this:  If he insists on starting with a pre-populated Bayes database,
he sure knows why. Other than I'm the boss, I want.  ... Is exactly
right too. 

We're implementing it locally with auto-learning enabled this weekend (oh,
yeah, boss didn't want auto-learning enabled either..). 

So here goes!! 

Thanks for all your help. 


 -Original Message-
 From: Karsten Bräckelmann [mailto:guent...@rudersport.de]
 Sent: Wednesday, May 08, 2013 8:18 PM
 To: users@spamassassin.apache.org
 Subject: Re: Default Bayes Database
 
 On Wed, 2013-05-08 at 14:09 -0400, Andrew Talbot wrote:
  Well, I certainly hope someone offers to help!
 
 Heh! I am really confident, Alex didn't mean to be rude, neither that he
 actually hopes no one will help you. Quite the contrary...
 
 He DID try to help you by explaining why a default Bayes database is a
bad
 idea in the first place. And that was his way of telling you...
 
  If only to say there is no default database.
 
 That. :)  There is none, and there never has been.
 
 
  As we've spoken about off-list, my boss is being very particular about
  the deployment of Bayes, and it sounds like one of his caveats is that
  we don't start from a blank database.
 
 I can see how the idea of basing off of some known to be classified
 tokens sounds tempting. However, there is no such token. None. Just try to
 imagine working in an industry where e.g. Viagra and Cialis are totally
legit
 phrases to use...
 
 Feel free to direct your boss here. If he insists on starting with a pre-
 populated Bayes database, he sure knows why. Other than I'm the boss, I
 want.
 
 
 Anyway, Andrew, your idea of that whole blank slate is inaccurate. If
you
 import someone else's data, before importing your database has been
 empty.
 
 If you collect some ham and spam for initial training, before training
your
 database has been empty.
 
 You even do NOT have to deploy SA prior to that. I don't know the size of
 your user base, but it seems it shouldn't be hard to have a few of the
users
 chip in. Get a few of them to collect hand-classified ham and spam for
you.
 Train Bayes with that. After that, deploy SA to your mail processing
chain.
 
 There you go! A pre-populated Bayes database, based on YOUR particular
 ham and spam tokens, before deploying SA in production.
 
 
 --
 char
 *t=\10pse\0r\0dtu\0.@ghno\x4e\xc8\x79\xf4\xab\x51\x8a\x10\xf4\xf4\xc4
 ;
 main(){ char h,m=h=*t++,*x=t+2*h,c,i,l=*x,s=0; for (i=0;il;i++){ i%8?
c=1:
 (c=*++x); c128  (s+=h); if (!(h=1)||!t[s+h]){ putchar(t[s]);h=m;s=0;
}}}




Default Bayes Database

2013-05-08 Thread Andrew Talbot
Hey all -

I remember seeing somewhere that there was a default Bayes database for
Bayes to start using right away, but can't seem to find that information
again on the Wiki or in my notes.

Can someone please help?


RE: Default Bayes Database

2013-05-08 Thread Andrew Talbot
Well, I certainly hope someone offers to help! 

If only to say there is no default database. 

As we've spoken about off-list, my boss is being very particular about the
deployment of Bayes, and it sounds like one of his caveats is that we don't
start from a blank database. 

For the record, I agree with your logic completely .. And I hate to say
stupid things like this, but it doesn't even matter to me if the tokens in
the default database are useless at this point, or if there are only 20 of
them. I just need to get this deployed so it can start learning. 




 -Original Message-
 From: Axb [mailto:axb.li...@gmail.com]
 Sent: Wednesday, May 08, 2013 1:32 PM
 To: users@spamassassin.apache.org
 Subject: Re: Default Bayes Database
 
 On 05/08/2013 07:26 PM, Andrew Talbot wrote:
  Hey all -
 
  I remember seeing somewhere that there was a default Bayes database
  for Bayes to start using right away, but can't seem to find that
  information again on the Wiki or in my notes.
 
  Can someone please help?
 
 I hope nobody offers to help.
 
 Why?
 - your HAM is somebody else's SPAM
 - A decent Bayes DB is highly dynamic and yesterday's tokens from someone
 else's traffic will be useless to you traffic, today.
 - If you have a decent traffic flow, it takes less than 4 hours of
autolearning
 with YOUR data and see Bayes scoring.
 




Bayes Autolearning

2013-05-01 Thread Andrew Talbot
Hey All -

 

I'm about to set up Bayes on one of our mail servers. A lot of the
documentation says that I need to manually sift through a few hundred
messages and classify them to 'teach' the filter, and it sounds like I may
need to do that on an ongoing basis. 

 

That is not a very plausible solution - our servers process about 2million
messages a day.

 

Does Bayes start out with a completely blank slate? That is, if I never have
it learn anything from my servers, will it still be pulling from something
already defined?

 

Can I set it to autolearn and leave it be? Or will it require continual
maintenance and manual message feeding?

 

Any suggestions any of you have for a Bayes newbie - about what I just asked
or otherwise - would be very much appreciated J

 

 

 

 

 



RE: Bayes Autolearning

2013-05-01 Thread Andrew Talbot
Thank you for that! 

Off-list you mentioned that you don't need to set the cron/expire because of
Redis features; why is it commented out here? 




 -Original Message-
 From: Axb [mailto:axb.li...@gmail.com]
 Sent: Wednesday, May 01, 2013 2:14 PM
 To: users@spamassassin.apache.org
 Subject: Re: Bayes Autolearning
 
 On 05/01/2013 08:01 PM, Andrew Talbot wrote:
 
  Any suggestions any of you have for a Bayes newbie - about what I just
  asked or otherwise - would be very much appreciated.
 
 I advocate autolearning as it has always worked fine for me.
 Can take  a bit longer to see good results but with some tuning I can sit
back
 and hear it purr and not worry about collecting ham and spam and training,
 which under certain circumstances may even be impossible.
 
 Before moving on to Redis, these were my bayes settings
 
 # bayes.cf
 
 use_bayes 1
 bayes_auto_learn  1
 bayes_auto_expire  0
 
 bayes_learn_to_journal 0
 
 # Dont' want to wait for the deault 200 hams/spams bayes_min_ham_num
 20 bayes_min_spam_num 20
 
 bayes_auto_learn_threshold_nonspam -1.0
 bayes_auto_learn_threshold_spam 15.0
 
 
 # FILE BASED
 # mkdir /etc/bayes
 bayes_path /etc/mail/spamassassin/bayes/bayes
 
 # Check permsisions/modify if needed
 #bayes_file_mode 0666
 
 bayes_expiry_max_db_size 35
 # SDBM is faster than other r/w  DBs
 bayes_store_module   Mail::SpamAssassin::BayesStore::SDBM
 
 # cron weekly
 #  sa-learn --force-expire




RE: Bayes Autolearning

2013-05-01 Thread Andrew Talbot
Hi, Seve -

Thanks for your response. Is that just for performance reasons?




 -Original Message-
 From: Steve Freegard [mailto:steve.freeg...@fsl.com]
 Sent: Wednesday, May 01, 2013 2:24 PM
 To: users@spamassassin.apache.org
 Subject: Re: Bayes Autolearning
 
 All good advice there from Axb; the only thing I'd add to that is:
 
 bayes_auto_learn_on_error 1
 
 Which prevents Bayes from over-training when the classifier already agrees
 with what the autolearn is trying to train on.
 
 Cheers,
 Steve.
 
 On 01/05/13 19:14, Axb wrote:
  On 05/01/2013 08:01 PM, Andrew Talbot wrote:
 
  Any suggestions any of you have for a Bayes newbie - about what I
  just asked or otherwise - would be very much appreciated.
 
  I advocate autolearning as it has always worked fine for me.
  Can take  a bit longer to see good results but with some tuning I can
  sit back and hear it purr and not worry about collecting ham and spam
  and training, which under certain circumstances may even be impossible.
 
  Before moving on to Redis, these were my bayes settings
 
  # bayes.cf
 
  use_bayes 1
  bayes_auto_learn  1
  bayes_auto_expire  0
 
  bayes_learn_to_journal 0
 
  # Dont' want to wait for the deault 200 hams/spams bayes_min_ham_num
  20 bayes_min_spam_num 20
 
  bayes_auto_learn_threshold_nonspam -1.0
  bayes_auto_learn_threshold_spam 15.0
 
 
  # FILE BASED
  # mkdir /etc/bayes
  bayes_path /etc/mail/spamassassin/bayes/bayes
 
  # Check permsisions/modify if needed
  #bayes_file_mode 0666
 
  bayes_expiry_max_db_size 35
  # SDBM is faster than other r/w  DBs
  bayes_store_module   Mail::SpamAssassin::BayesStore::SDBM
 
  # cron weekly
  #  sa-learn --force-expire
 
 
 




RE: Bayes Autolearning

2013-05-01 Thread Andrew Talbot
Hey there, thanks for responding. That's an interesting point.

Are you saying I should not use autolearning at all? 

I don't have any way to review a large corpus of messages because we don't
have access to them - after they run through our servers they are sent on,
and the text of the message is not stored on our server. 

Man, I wish there was an easier way to feed Bayes an initial set of spam/ham
to teach it properly .. I've been told that letting it autolearn for a few
hours/days would make it learn well enough though.

If only our mail server only got 100 messages a day - then I could just
manually mark them! :) 




 -Original Message-
 From: RW [mailto:rwmailli...@googlemail.com]
 Sent: Wednesday, May 01, 2013 6:24 PM
 To: users@spamassassin.apache.org
 Subject: Re: Bayes Autolearning
 
 On Wed, 01 May 2013 22:02:43 +0100
 Steve Freegard wrote:
 
  On 01/05/13 19:40, Andrew Talbot wrote:
   Hi, Seve -
  
   Thanks for your response. Is that just for performance reasons?
  
 
  Performance is one of the things that bayes_auto_learn_on_error 1 will
  give you.  It means that if the message was already considered spam by
  Bayes, then the message won't be autolearnt again which means
  a bit less IO.   It will also result in the Bayes databases being
  smaller as it is likely that with this option that less tokens will be
  present overall which will also save disk IO and space.
 
  But the key reason I like this option is that it doesn't allow bayes
  to overtrain in one direction (e.g. spam or ham).  It only autolearns
  when Bayes either has the wrong result or isn't sure which IMO has to
  be better for accuracy in the long run.
 
 The evidence from trials with Bogofilter (which is similar to Bayes)
showed
 that initially train-on-everything significantly outperforms
train-on-error. The
 latter asymptotically catches up after thousands of errors. It seems that
the
 most important thing  is to learn a few thousand hams and spams by any
 means; and train-on-error can take a long time to get there. For this
reason
 DSPAM only allows train-on-error when 2500 hams have been learned.
 
 There *may* be advantages to train-on-error after this in preventing BAYES
 becoming insensitive to learning.
 
 The chief problem with autolearning is learning ham. If you set a positive
 threshold you end-up learning a lot of spam as ham, if you set a negative
 threshold you effectively turn-over ham training to the DNS whitelists
since
 they are the only tests with  significant negative scores that aren't
excluded
 from autolearning. Any problems with miss-learning are likely to be
 exacerbated by train-on-error.
 
 If I had to use autolearning I'd mark the DNS whitelists as noautolearn
and
 write some negative-scoring, site-specific rules.



RE: More longer rules or fewer shorter ones?

2013-04-26 Thread Andrew Talbot
Martin -

Interesting. How many mailboxes does your deployment cover?



-Original Message-
From: Martin Gregorie [mailto:mar...@gregorie.org] 
Sent: Thursday, April 25, 2013 8:08 PM
To: users@spamassassin.apache.org
Subject: Re: More longer rules or fewer shorter ones?

On Thu, 2013-04-25 at 18:45 -0400, Andrew Talbot wrote:

 I like your point about the portmanteau rules (and I award you two 
 Points for using one of my favorite words in a new - yet appropriate - 
 manner!).
 
:-)


 I never thought about scoring each rule as a 0.001 or something really 
 low then tying them all together with meta-rules. It's been a while 
 since I separated everything out but I believe I have around 1000 
 different checks (most of them portmanteau'd) so it seems like those 
 meta rules would just get ... Messy. But it's a good idea, and I think 
 I can especially make use of it in my Specific Word list.
 
The metas aren't too bad, though I must admit to building some of them as metas 
of metas to keep all lines down to 72 chars or so. Most of these submetas are 
simply lists of other rules that have been ANDed or ORed together.

You may find that the Portmanteau Generator reduces your rule count because it 
too can generate metas, which I use to deal with situations where a term can 
appear in more than one case, e.g. a generated rule can have this form:

describe GENRULE Example rule  
header   __GR1   Reply-to =~ /(\@spam1\.com|\@spammer\.co\.uk|)
header   __GR2   From =~ /(\@spam1\.com|\@spammer\.co\.uk|)
uri  __GR3   From =~ /(\@spam1\.com|\@spammer\.co\.uk|)
meta GENRULE (__P1 || __P2 || __P3)
scoreGENRULE 1.5

which has two advantages. First, that GENRULE is a single name that covers the 
same spammy term regardless of where it was used and secondly, since each 
generated rule has its own source file, this makes the three related lists 
easier to edit, since there's a good chance that a spammy term might be used in 
more than one of the related lists.
  
 Keeping the rules under 1-2mb is a good rule of thumb to follow.
 Luckily we're nowhere near that point yet. 
 
Nor am I. As I said, my biggest generated rule is a bit over 9 KB.

 Can I ask how many rules you have, and how many of those are meta 
 rules?

I have 31 portmanteau rules, of which 9 contain metas. Only 12 of these have a 
score exceeding 1.0 and these are not usually used as part of higher level 
metarules

My local.cf is where any very specific rules live, along with the higher level 
metarules that use the low scoring portmanteau rules. This contains 129 rules 
which between them contain 96 'meta' statements. 36 of these have scores of 
under 1.0, so are probably used as components of metarules.  The total number 
of rules was obtained by using grep+wc to count lines containing '^score'.

my local.cf and portmanteau.cf files are both 29 KB in size.


Martin







Re: More longer rules or fewer shorter ones?

2013-04-25 Thread Andrew Talbot
Hi, Martin -



Thank you for your response.



I like your point about the portmanteau rules (and I award you two Points
for using one of my favorite words in a new - yet appropriate - manner!).



I never thought about scoring each rule as a 0.001 or something really low
then tying them all together with meta-rules. It's been a while since I
separated everything out but I believe I have around 1000 different checks
(most of them portmanteau'd) so it seems like those meta rules would just
get ... Messy. But it's a good idea, and I think I can especially make use
of it in my Specific Word list.



It's interesting that you don't use Bayes for the opposite reason that we
don't - we don't do it because of high volume, you don't do it because of
low volume. Go figure.



Keeping the rules under 1-2mb is a good rule of thumb to follow. Luckily
we're nowhere near that point yet.





Can I ask how many rules you have, and how many of those are meta rules?













-Original Message-

From: Martin Gregorie [mailto:mar...@gregorie.org]

Sent: Wednesday, April 24, 2013 3:03 PM

To: users@spamassassin.apache.org

Subject: Re: More longer rules or fewer shorter ones?



On Wed, 2013-04-24 at 12:32 -0400, Andrew Talbot wrote:

 I have my customized deployment split up into a bunch of separate CF

 files (by category) and I have those further split up into rules based

 on score.



I also use very long rules, mainly due to spamiferous mailing lists,
because all the headers are pretty much the same (apart from sender names),
so about all you're left with for spam recognition is the body content.



I found a problem with very long rules, where for me 'very long' means
rules longer than the width of my editor's screen. I refer to these as
'portmanteau rules' (private slang). As I hate editing anything that's
longer than my editor's text line and find it particularly annoying to deal
with such a line containing a regex consisting of a lot of alternates, I
wrote a portmanteau rule generator to make their maintenance a bit easier.
It is a gawk script that assembles an arbitrarily long rule from a file
containing rule fragments (regexes,

etc) that are each placed on a separate line. Since sounds as though you
may have a similar problem, you may also find it useful. You can find it
and its documentation here:

http://www.libelle-systems.com/free/portmanteau/portmanteau.tgz



I find it particularly helpful to make the portmanteau rules fairly low
scoring and to combine them into higher scoring meta-rules, e.g. if I'm
trapping sales spiel I'll have a portmanteau rule listing selling phrases,
one containing monetary terms and another containing product terms and
names, all scores at 0.001. I'll also have a meta-rule that ANDs these
three rules together and scores around 5. This approach is much better at
distinguishing spam from ham than a series of higher scoring non-meta rules
and has the additional benefit of recognising sales-related text from
previously unseen combinations of elements in the three rules.



BTW, I don't use Bayes because my mail volume is small and I have
difficulty collecting decent training corpuses and find my current setup
easier to manage.





  They are WAY longer than that (and some of them include further nesting
of the pipe), but that's the general idea.



 My question is: is it better performance-wise to have the rules set up

 like this, or to have each separate thing have its own separate rule?



What JH said. When I was thinking of setting up this approach I asked about
performance and limits on the size of the generated rules and was told that
I shouldn't worry about rule size until they exceeded a megabyte or two.
Currently my longest rule is just over 9KB, with the averages being just
under 1KB and 51 alternates per rule.



Martin


More longer rules or fewer shorter ones?

2013-04-24 Thread Andrew Talbot
 

Hey, all -

 

I have my customized deployment split up into a bunch of separate CF files
(by category) and I have those further split up into rules based on score.

 

So, I have a bunch of stuff like:

header RULE_1 Subject =~ /\b(this|that|theother|blah|blah)/i

score RULE_1 1

describe RULE_1 Rule 1

 

header RULE_2 Subject =~ /\b(foo|bar|etc)/i

score RULE_2 2

describe RULE_2 Rule 2

 

They are WAY longer than that (and some of them include further nesting of
the pipe), but that's the general idea.

 

My question is: is it better performance-wise to have the rules set up like
this, or to have each separate thing have its own separate rule?



RE: More longer rules or fewer shorter ones?

2013-04-24 Thread Andrew Talbot
John, 

Thanks for your prompt response!

A lot of the rules are big jumbles of rules we are generating in real time
and adding to as things come in. Like I said in my original question, we
have them separated into separate cf files by category, and within those cf
files they are separated by score. So we have just absolutely gargantuan
rules for (for instance) sex words that we assign a 5 to automatically.
There's also lists of specific words and phrases that we see in real-time
spam (like the *$#ing garden hose spam).

We are just tacking new rules on to the end to make them easier to read. Our
rules properly work with (this|that|theother) if it hits any one of the
words. 

Should we maybe have separate rules for all the phrases, since they're
longer strings? There's rules in there that are like RULE Subject =~
/you.have.(new|waiting|blah|blah).*(ecard|message|calendar.invite|blah|blah)
)|(garden|new|stretchy|bendy|whatever).*(hose|vaccum|other.thing) . . .  . .
. 


Etc. It goes on. .. My syntax is terrible and obviously those aren't the
actual rules but the point is that it's a bunch of Or for some really long
strings. Should I separate them out and have those long (this|that|theother)
rules be only for single words?

Alternately, should I separate out the rules with embedded pipes in them
(like in the example above)? 


-Original Message-
From: John Hardin [mailto:jhar...@impsec.org] 
Sent: Wednesday, April 24, 2013 12:58 PM
To: users@spamassassin.apache.org
Subject: Re: More longer rules or fewer shorter ones?

On Wed, 24 Apr 2013, Andrew Talbot wrote:

 Hey, all -

 I have my customized deployment split up into a bunch of separate CF 
 files (by category) and I have those further split up into rules based on
score.

 So, I have a bunch of stuff like:

 header RULE_1 Subject =~ /\b(this|that|theother|blah|blah)/i
 score RULE_1 1
 describe RULE_1 Rule 1

 header RULE_2 Subject =~ /\b(foo|bar|etc)/i score RULE_2 2 describe 
 RULE_2 Rule 2

 They are WAY longer than that (and some of them include further 
 nesting of the pipe), but that's the general idea.

 My question is: is it better performance-wise to have the rules set up 
 like this, or to have each separate thing have its own separate rule?

For performance, with simple lists of variant values having no repetition
across the list e.g. (x|y|z){n,m}, if the most-likely variants are listed
first a big rule will (generally-speaking) process less than a set of
individual rules for each variant. The big rule will stop trying as soon as
a match for one variant is found, whereas all of the individual rules must
be tried regardless of what other rules may have hit. RULE_1 won't try
matching that, theother, blah, etc. if this matches.

Ignoring performance, the alternatives are *not* syntactically equivalent. 
Absent tflags multiple, RULE_1 would hit only once on a subject containing
both this and that and theother, but if you split it up into separate
rules *each* would hit. This likely would affect scoring.

-- 
  John Hardin KA7OHZhttp://www.impsec.org/~jhardin/
  jhar...@impsec.orgFALaholic #11174 pgpk -a jhar...@impsec.org
  key: 0xB8732E79 -- 2D8C 34F4 6411 F507 136C  AF76 D822 E6E6 B873 2E79
---
   Vista security improvements consist of attempting to shift blame
   onto the user when things go wrong.
---
  328 days since the first successful private support mission to ISS
(SpaceX)



RE: More longer rules or fewer shorter ones?

2013-04-24 Thread Andrew Talbot
Hi again, John -

It's a good idea to add the realtime rules to the beginning of the filter. I
didn't realize that would have such an impact. And the (?=x) tip is a good
one too; thank you for that.

As far as Bayes, don't get me started! :)  I work for an Email Service
Provider and about 2 million messages go through our servers every day, so
we have Bayes turned off because it would be too computationally expensive.
I wish we could turn it on - it'd certainly make my job easier - but The
Boss says no. Go figure. Autolearn, same story. 

Having such a large organization makes it a difficult balance to avoid false
positives, too. We have one client who deals with credit reports and
refinancing and stuff and pretty much every message that goes to their
mailboxes looks like spam. We just have them set up to avoid all our
financial rules. 

Luckily we don't have too many doctors' offices so we needn't really concern
ourselves with legitimate Viagra email! :) 

I've scoured the net looking for rulesets from others that already have a
lot of this stuff in there but I haven't found any rulesets since 2006. A
lot of what I've seen is irrelevant - do you know a good place to get custom
rulesets? I feel like there's someone else out there who already figured out
how to write a rule that captures all those learn a new language spam
messages so I don't need to just score Language as +4 ! : )





-Original Message-
From: John Hardin [mailto:jhar...@impsec.org] 
Sent: Wednesday, April 24, 2013 1:53 PM
To: users@spamassassin.apache.org
Subject: RE: More longer rules or fewer shorter ones?

On Wed, 24 Apr 2013, Andrew Talbot wrote:

 John,

 Thanks for your prompt response!

 A lot of the rules are big jumbles of rules we are generating in real 
 time and adding to as things come in. Like I said in my original 
 question, we have them separated into separate cf files by category, 
 and within those cf files they are separated by score. So we have just 
 absolutely gargantuan rules for (for instance) sex words that we assign a
5 to automatically.
 There's also lists of specific words and phrases that we see in 
 real-time spam (like the *$#ing garden hose spam).

 We are just tacking new rules on to the end to make them easier to 
 read. Our rules properly work with (this|that|theother) if it hits any 
 one of the words.

 Should we maybe have separate rules for all the phrases, since they're 
 longer strings? There's rules in there that are like RULE Subject =~
 /you.have.(new|waiting|blah|blah).*(ecard|message|calendar.invite|blah
 |blah)
 )|(garden|new|stretchy|bendy|whatever).*(hose|vaccum|other.thing) . . .  .
.
 .

 Etc. It goes on. .. My syntax is terrible and obviously those aren't 
 the actual rules but the point is that it's a bunch of Or for some 
 really long strings. Should I separate them out and have those long 
 (this|that|theother) rules be only for single words?

Simple alternations on phrases are equivalent to simple alternations on
single words with respect to the performance concerns. Performance is more
governed by the number of alternations and the presence of repetition and
.* than their raw length. You might want to limit the total number of
alternations per rule.

Another performance optimization would be to ensure all of the alternations
in a given rule start with the same letter, and put (?=x) before the list of
alternatatives e.g. /\b(?=x)(x1|x2|x3|x4)/ so that the engine can skip more
easily.

If they are simple alternations, it also depends on how you want to score
them.

For poison pill words or phrases, sure, a long alternation with a high
score will be pretty efficient. I'd suggest tacking new hits onto the
*front* of the list of alternatives, though, as it's reasonable to assume a
spam run will use the same phrasing for a while, then change.

 Alternately, should I separate out the rules with embedded pipes in 
 them (like in the example above)?

Yeah, avoiding nested alternatives where possible will help.

Is Bayes not catching things like this?

 -Original Message-
 From: John Hardin [mailto:jhar...@impsec.org]
 Sent: Wednesday, April 24, 2013 12:58 PM
 To: users@spamassassin.apache.org
 Subject: Re: More longer rules or fewer shorter ones?

 On Wed, 24 Apr 2013, Andrew Talbot wrote:

 Hey, all -

 I have my customized deployment split up into a bunch of separate CF 
 files (by category) and I have those further split up into rules 
 based on
 score.

 So, I have a bunch of stuff like:

 header RULE_1 Subject =~ /\b(this|that|theother|blah|blah)/i
 score RULE_1 1
 describe RULE_1 Rule 1

 header RULE_2 Subject =~ /\b(foo|bar|etc)/i score RULE_2 2 describe
 RULE_2 Rule 2

 They are WAY longer than that (and some of them include further 
 nesting of the pipe), but that's the general idea.

 My question is: is it better performance-wise to have the rules set 
 up like this, or to have each separate thing have its own separate rule?

 For performance, with simple lists

Re: Fwd: RE: alert: New event: ET EXPLOIT Possible SpamAssassin Milter Plugin Remote Arbitrary Command Injection Attempt (fwd)

2011-02-10 Thread Andrew Daviel


On Thu, 10 Feb 2011, Michael Scheidell wrote:


http://seclists.org/fulldisclosure/2010/Mar/140
http://www.securityfocus.com/bid/38578

Vulnerable: SpamAssassin Milter Plugin SpamAssassin Milter Plugin 0.3.1

I don't see anything on bugtraq about a fix.


The securityfocus page lists some Debian fixes. The Debian patch 
spamass-milter_0.3.1-8+lenny2.diff.gz changelog includes:

+spamass-milter (0.3.1-8+lenny1) stable-security; urgency=high
+
+  * Use new popenenv function instead of open; fixes remote code exploit
+as the spamass-milter user when run using -x. (closes: #573228)
+
+ -- Don Armstrong d...@debian.org  Wed, 17 Mar 2010 12:52:56 -0700

per http://security.debian.org/pool/updates/main/s/spamass-milter/

--
Andrew Daviel, TRIUMF, Canada
Tel. +1 (604) 222-7376  (Pacific Time)
Network Security Manager


URLs with Spaces

2009-06-25 Thread Andrew Hearn
Hello,

I'm wondering if I'm missing some rules that would have given this
message more points - I know it's missing bayes (I'm not sure why as our
servers should use bayes, but it seems not to have been run for this
message.)

http://www.pastebin.ca/1473975

Thanks

-- 
Andrew.


Re: URLs with Spaces

2009-06-25 Thread Andrew Hearn
Kasper Sacharias Eenberg wrote:
 There's been a rule circulating this mailing list for a couple of weeks.
 This is the latest edition to catch those med-things (afaik).
 
 --
 body AE_MEDS35 /\bwww\s(?:\W\s)?\w{3,6}\d{2,6}\s(?:\W\s)?(?:c\s?o
 \s?m|n\s?e\s?t|o\s?r\s?g)\b/i
 describe AE_MEDS35 obfuscated domain in message
 scoreAE_MEDS35 5.0
 --
 
 It works good for me.
 


Thanks Kasper,

Also the Sanesecurity sigs for Clam catch it (thanks to Steve)




FuzzyOCR only runs when specifying spamassassin -D

2009-04-28 Thread Andrew Bruce


I've been looking at some of the spam emails I've received lately with
images attached and noticed that FuzzyOCR wasn't running against them. 

The same seems to be true when I take these messages and run them with: 

spamassassin -t  img-email.eml 

However if I run them through as follows, I get FuzzyOCR showing up in the
results: 

spamassassin -t -D  img-email.eml 

I also get substantially different AWL results between the two (although I
guess that maybe part of the debug procedure). 

Does anyone know why this might be happening? I seem to recall
experiencing this before, but can't remember what I did to fix it. 

spamassassin -t: 

Content analysis details: (22.2 points, 5.0 required)

 pts rule name description
 --
--
 1.2 RCVD_IN_PBL RBL: Received via a relay in Spamhaus PBL
 [68.186.154.187 listed in zen.spamhaus.org]
 3.0 RCVD_IN_XBL RBL: Received via a relay in Spamhaus XBL
 0.9 RCVD_IN_SORBS_DUL
RBL: SORBS: sent directly from dynamic IP address
 [68.186.154.187 listed in dnsbl.sorbs.net]
 3.5 BAYES_99 BODY: Bayesian spam probability is 99 to 100%
 [score: 1.]
 1.0 FH_HELO_EQ_CHARTER Helo is d-d-d-d charter.com
 4.3 HELO_DYNAMIC_HCC Relay HELO'd using suspicious hostname (HCC)
 4.4 HELO_DYNAMIC_IPADDR2 Relay HELO'd using suspicious hostname (IP addr
 2)
 0.0 FH_HELO_EQ_D_D_D_D Helo is d-d-d-d
 2.0 RCVD_IN_BL_SPAMCOP_NET RBL: Received via a relay in bl.spamcop.net
 [Blocked - see ]
 0.0 HTML_MESSAGE BODY: HTML included in message
 0.1 RDNS_DYNAMIC Delivered to trusted network by host with
 dynamic-looking rDNS
 1.8 AWL AWL: From: address is in the auto white-list

spamassassin -t -D: 

Content analysis details: (25.7 points, 5.0 required)

 pts rule name description
 --
--
 3.0 RCVD_IN_XBL RBL: Received via a relay in Spamhaus XBL
 [68.186.154.187 listed in zen.spamhaus.org]
 1.2 RCVD_IN_PBL RBL:
Received via a relay in Spamhaus PBL
 0.9 RCVD_IN_SORBS_DUL RBL: SORBS: sent directly from dynamic IP address
 [68.186.154.187 listed in dnsbl.sorbs.net]
 3.5 BAYES_99 BODY: Bayesian spam probability is 99 to 100%
 [score: 1.]
 1.0 FH_HELO_EQ_CHARTER Helo is d-d-d-d charter.com
 4.3 HELO_DYNAMIC_HCC Relay HELO'd using suspicious hostname (HCC)
 4.4 HELO_DYNAMIC_IPADDR2 Relay HELO'd using suspicious hostname (IP addr
 2)
 0.0 FH_HELO_EQ_D_D_D_D Helo is d-d-d-d
 2.0 RCVD_IN_BL_SPAMCOP_NET RBL: Received via a relay in bl.spamcop.net
 [Blocked - see ]
 0.0 HTML_MESSAGE BODY: HTML included in message
 0.1 RDNS_DYNAMIC Delivered to trusted network by host with
 dynamic-looking rDNS
 10 FUZZY_OCR_KNOWN_HASH BODY:
-5.2 AWL AWL: From: address is in the auto white-list


Always show test scores in email header

2009-03-31 Thread Andrew Bruce
Is it possible to have a header, or in X-Spam-Status always show the
individual scores for each of the test performed against a particular email
(whether it is tagged as spam or not)? 

I see that when using MailScanner with SpamAssassin this always happens,
but cannot replicate the same for a straight SpamAssassin installation. 

This is an example of what I get in an emails source from MailScanner and
would like to replicate in SpamAssassin: 
X-MailScanner-Spam: not spam, SpamAssassin (not cached,
 score=4.616, required 5, BAYES_40 -0.18, DCC_CHECK 4.50,
 HTML_MESSAGE 0.00, RDNS_DYNAMIC 0.10, SARE_HTML_USL_A 0.20) 


Regards,


Andrew Bruce


Re: Always show test scores in email header

2009-03-31 Thread Andrew Bruce
On Tue, 31 Mar 2009 23:08:14 -0400, Matt Kettler mkettler...@verizon.net
wrote:
 Andrew Bruce wrote:
 Is it possible to have a header, or in X-Spam-Status always show the
 individual scores for each of the test performed against a particular
 email
 (whether it is tagged as spam or not)? 

 I see that when using MailScanner with SpamAssassin this always happens,
 but cannot replicate the same for a straight SpamAssassin installation. 

 This is an example of what I get in an emails source from MailScanner
and
 would like to replicate in SpamAssassin: 
 X-MailScanner-Spam: not spam, SpamAssassin (not cached,
  score=4.616, required 5, BAYES_40 -0.18, DCC_CHECK 4.50,
  HTML_MESSAGE 0.00, RDNS_DYNAMIC 0.10, SARE_HTML_USL_A 0.20) 

   
 You're using MailScanner, which generates it's own markup. SA by default
 always adds such a header, but MailScanner doesn't use it.
 
 There's an option in MailScanner.conf to make MailScanner do this. It's
 something like always include spamassassin report or something like
that.

Odd, because on SpamAssassin it never showed that header unless the message
was marked as spam.  Although I should have mentioned that it's being
called through amavisd-new which may have had something to do with it. 
I've added a custom header, and played with the $sa_tag_level_deflt values
in amavis, now the header shows up:

X-Spam-Scores: ALL_TRUSTED=-1.8,BAYES_00=-2.599,HTML_MESSAGE=0.001,
MIME_HTML_ONLY=1.457,NO_DNS_FOR_FROM=1.496



Re: Spamc giving different scores

2009-03-26 Thread Andrew Bruce
On Thu, 26 Mar 2009 18:15:01 -0700 (PDT), asimsinan
yuksel.asim.si...@gmail.com wrote:
 
 I ran spamc a couple of times. It sometimes gives different scores for
 same
 email. Sometimes it gives higher than 5,sometime lower. What can be
wrong?
 --
 View this message in context:

http://www.nabble.com/Spamc-giving-different-scores-tp22734449p22734449.html
 Sent from the SpamAssassin - Users mailing list archive at Nabble.com.

Pipe the email through SpamAssassin on the command line using the command
below:
spamassassin -Dt  /path/to/email


You can then see the full output and what checks are hitting and missing
and what the scores are.


Andrew



Not scoring well on 'claims of £500,000 pounds' type emails

2008-11-25 Thread Andrew Hearn
Hello,

Our setup seems to work pretty well, but some spams are slipping
through. Has anyone got any suggestions of rules that will catch these
types of emails:
http://www.pastebin.ca/1266571

I do run Bayes, but seems that Bayes didn't run for this message, I also
run sought rules. and greylist before spamassassin for most messages.
(v3.2.4)

Thanks, Andrew.


Re: using RHEL / CentOS / Fedora perl?

2008-09-09 Thread Andrew Hearn

Justin Mason wrote:

have you seen this?

  http://blog.vipul.net/2008/08/24/redhat-perl-what-a-tragedy/

That bug in Red Hat perl will almost definitely slow down SpamAssassin,
too, I would say.  Can anyone verify?

--j.



This fixed it for me on a couple of centos servers:

http://people.centos.org/z00dax/bz379791/


Re: using RHEL / CentOS / Fedora perl?

2008-09-09 Thread Andrew Hearn

Randal, Phil wrote:

Andrew Hearn wrote:

Justin Mason wrote:

have you seen this?

  http://blog.vipul.net/2008/08/24/redhat-perl-what-a-tragedy/

That bug in Red Hat perl will almost definitely slow down
SpamAssassin, too, I would say.  Can anyone verify?

--j.


This fixed it for me on a couple of centos servers:

http://people.centos.org/z00dax/bz379791/


Did you notice a real-world performance boost after doing that?  Got any
numbers for pre- and post- spamassassin performance?



No, not that I've noticed yet anyway ;-)


Fraud spam text in .doc attachments

2008-05-16 Thread Andrew Hearn
Hi,

Any one else seen emails with word documents attached and the word
document has text of an 'African fraud'?

example: http://pastebin.com/mad34c97

I've not seen a Word Doc plugin for SpamAssassin, is there one?

Thanks!

-- 
Andrew Hearn


Not scoring high enough on this spam...

2008-03-28 Thread Andrew Hearn

http://pastebin.ca/961075

I've only seen one so far but apart from the 0.0 BAYES_50 (I will learn 
this message), does anyone have rules that pushes this kind of message 
over 5.0?


thanks!

Andrew



Ensuring Custom Rules Are Scored Properly

2008-03-18 Thread Andrew Wilkinson
I'm experimenting with Fedora 8 and a miltered sendmail configuration 
running as a mail gateway (smf-sav, smf-spf, milter-greylist, 
clamav-milter, spamass-milter).  I've configured spamassassin's local.cf 
with a custom rule.  It's a simple regex which checks the 'Received' 
header on inbound mail for any  IP in a specific Class C range, and 
negatively scores the message with -100 (probably extreme).  I'm just 
trying to ensure these messages are never tagged as spam.  I've 
--lint-ed the rule and I receive no syntax errors.  However, messages 
coming in from an IP in the specified range don't appear to be 
negatively scored.  In fact, the test messages being sent were scored 
as, say, 2.8 before AND after the rule was put into place.  Spamass and 
spamassassin (as I'm running spamassassin daemonized) were both 
restarted after rule creation.  I've verified the regex is correct, 
running it though a couple regex testers. 

So, I guess I'd be expecting the X-Spam header on these messages to 
indicate a score of -97.2.  Am I assuming incorrectly?


thanks


Re: How many use CRM114?

2008-03-04 Thread Andrew Hearn
Blaine Fleming wrote:
 Slightly off-topic, but I'm curious, how many of you are using CRM114? 
 How well does it work for you?  Was it difficult to train?  I've been
 looking at it and haven't found much except the official plugin guide
 and a single page saying that it works better than other learning
 methods.  Any info would be appreciated.

Hello

I've only just started using it on a test server, I'll let you know how
I find the results!



Andrew


Re: Lots Of SPAM

2008-02-26 Thread Andrew Hearn
Tarak Ranjan wrote:
  Hi List,
  i have posted my RAW email in http://pastebin.ca/918849 ,
  i'm receiving 1000 to 4000 per day this king of mesages.
  SA also skipping this kind of mails
 
  /
  TArak
 
 

I get 8.2 without Bayes...

1.5 IXHASH2BODY: mail has been classified as spam @
LogInSolutions AG,
Germany
0.0 CLAMAV Clam AntiVirus detected something...
4.0 JM_SOUGHT_1JM_SOUGHT_1
0.2 RDNS_NONE  Delivered to trusted network by a host with
no rDNS
2.5 CLAMAV_SANESPAM found by ClamAV SaneSecurity signatures

(JM_SOUGHT was talked about earlier in the list)

Andrew.



unsubscribe

2008-01-25 Thread Andrew Xiang


unsubscribe

2007-12-19 Thread Andrew Xiang
unsubscribe

Not sure why DOS_OE_TO_MX fired

2007-12-14 Thread Andrew Hearn
Hello,

I'm not sure why DOS_OE_TO_MX fired on this message, as the headers say
it was delivered to b.painless.aaisp.net.uk which relayed it on to
z.hopeless.aaisp.net.uk.

b.painless isn't the MX for the domain...

Any ideas? -Thanks!


Return-path: [EMAIL PROTECTED]
Envelope-to: [EMAIL PROTECTED]
Delivery-date: Fri, 14 Dec 2007 11:45:39 +
Received: from [2001:8b0:0:81::51bb:5134] (helo=b.painless.aaisp.net.uk)
by z.hopeless.aaisp.net.uk with esmtp (Exim 4.63)
(envelope-from [EMAIL PROTECTED])
id 1J38z2-0004B8-FV
for [EMAIL PROTECTED]; Fri, 14 Dec 2007 11:45:39 +
Received: from [217.169.3.9] (helo=DFTJ542J)
by b.painless.aaisp.net.uk with smtp (Exim 4.62)
(envelope-from [EMAIL PROTECTED])
id 1J38z2-00036f-7g
for [EMAIL PROTECTED]; Fri, 14 Dec 2007 11:45:36 +
Message-ID: [EMAIL PROTECTED]
From: Fiona Murphy [EMAIL PROTECTED]
To: [EMAIL PROTECTED]
Subject: website emergency!
Date: Fri, 14 Dec 2007 11:45:33 -
MIME-Version: 1.0
Content-Type: multipart/alternative;
boundary==_NextPart_000_00AF_01C83E46.D5CB6A50
X-Priority: 3
X-MSMail-Priority: Normal
X-Mailer: Microsoft Outlook Express 6.00.2900.3138
X-MimeOLE: Produced By Microsoft MimeOLE V6.00.2900.3198
X-Virus-Scanned: Clear (Version: ClamAV 0.91.2/5116/Fri Dec 14 07:14:39
2007, by smtp.aaisp.net.uk)
X-AA-SMTP-Time-Scanned:YES
X-Spam-Score: 4.0 
X-AASpam-Report: Spam detection software, running on the system
b.spamless.aaisp.net.uk, has
processed this message.
This message scored (4.0 points and 4.6 are required to mark as spam)
pts  rule name  description
 --
--
1.2 HTML_MESSAGE   BODY: HTML included in message
0.0 BAYES_50   BODY: Bayesian spam probability is 40 to 60%
[score: 0.5071]
0.0 NO_VIRUS_FOUND There were no viruses found in this message
by ClamAV
2.8 DOS_OE_TO_MX   Delivered direct to MX with OE headers


Re: HELO_DYNAMIC_SPLIT_IP

2007-12-12 Thread Andrew Hearn
Giampaolo Tomassoni wrote:
 -Original Message-
 From: Andrew Hearn [mailto:[EMAIL PROTECTED]
 Sent: Tuesday, December 11, 2007 12:04 PM

 Hi,

 Can anyone explain why this email:
 http://pastebin.ca/811938
 is getting a hit on HELO_DYNAMIC_SPLIT_IP.

 I'm seeing a few ham message being caught by this

 (SpamAssassin version 3.2.3, sa-update)
 
 smtp.aaisp.net.uk maps to two IP addresses (81.187.81.51 and 81.187.81.52).
 
 An outgoing mail server is supposed to announce itself via HELO with its
 own, specific name, not with a service name (like smtp.etc.etc).
 
 aaisp.net.uk could define the following:
 
   smtp1   A   81.187.81.51
   smtp2   A   81.187.81.52
   smtpA   81.187.81.51
   A   81.187.81.52
 
 where the latter name is only suitable to their customers, in order to
 accept mail to be delivered. Then, when delivery occurs, the SMTP server
 should identify itself with its unique name. Like, in example:
 
   EHLO smtp1.aaisp.net.uk
 
 This allows also to define two different entries in aaisp.net.uk's DNS
 reverse mappings:
 
   51  PTR smtp1.aaisp.net.uk.
   52  PTR smtp2.aaisp.net.uk.
 
 which may help in better identifying the abused host, whenever it happens.
 
 Giampaolo
 


Thanks for the reply and explanation, I'll look in to this!


HELO_DYNAMIC_SPLIT_IP

2007-12-11 Thread Andrew Hearn
Hi,

Can anyone explain why this email:
http://pastebin.ca/811938
is getting a hit on HELO_DYNAMIC_SPLIT_IP.

I'm seeing a few ham message being caught by this

(SpamAssassin version 3.2.3, sa-update)

Thanks!

Andrew


Re: SQL-based AWL and Bayes not working with 3.2.3

2007-11-19 Thread Andrew Hearn (AAISP)
Rene Caspari wrote:
 Hi,
 
 I'm using spamassassing 3.2.3 with userspecific rules from an SQL
 database:
 
 /etc/spamassassin/local.cf:
 user_scores_dsn DBI:mysql:spamassassin:localhost
 [...]
 bayes_store_module  Mail::SpamAssassin::BayesStore::SQL
 [...]
 auto_whitelist_factory  Mail::SpamAssassin::SQLBasedAddrList
 
 spamc is called by procmail.
 /etc/procmailrc:
 :0fw
 *  256000
 | /usr/bin/spamc -U /var/run/spamd.sock -u $USER
 
 (where $USER is created by Postfix:
 /usr/bin/procmail -t -m USER=${recipient} SENDER=${sender} /etc/procmailrc)
 
 Since I updated to 3.2.3 (Debian Volatile) I get the error message in
 /var/log/mail.log:
 [...] spamd: still running as root: user not specified with -u, not found, or 
 set to root, falling back to nobody
 
 After this, spamassassin uses the userspecific SQL tables with the user
 nobody not the specific user, who is the recepient of the scanning mail.
 
 Do you have an idea how I can resolve this?

I think I have the same problem too, on one of our tests servers. this
is one I'm running 3.2.3 on, and using the same config from our other
3.1.7 machines which are happy with Bayes...

User preference is being used, as I can tell that as the required score
is being set correctly from the preferences.




-- 
Andrew Hearn


user_in_whitelist , how do I find out which one?

2007-10-22 Thread Andrew Xiang
I have many users in the whitelist_from in the local.cf.
When I get forwarded spam email like this, how do I find which one it matched? 
Which FROM entry is it actually looking at?

-Andrew


X-Spam-Checker-Version: SpamAssassin 3.2.1 (2007-05-02) on xphotonics.com
X-Spam-Level: 
X-Spam-Status: No, score=-72.0 required=5.0 tests=BAYES_50,DCC_CHECK,
 DIGEST_MULTIPLE,DRUGS_ERECTILE,HTML_MESSAGE,HTML_MIME_NO_HTML_TAG,
 MIME_HTML_ONLY,PYZOR_CHECK,RAZOR2_CF_RANGE_51_100,RAZOR2_CF_RANGE_E4_51_100,
 RAZOR2_CHECK,SARE_FROM_DRUGS,UNPARSEABLE_RELAY,USER_IN_WHITELIST autolearn=no
 version=3.2.1
X-Spam-Pyzor: Reported 4263 times.
X-Spam-Report: 
 * -100 USER_IN_WHITELIST From: address is in the user's white-list
 *  1.7 SARE_FROM_DRUGS From a drug
 *  5.5 UNPARSEABLE_RELAY Informational: message has unparseable relay lines
 *  0.0 HTML_MESSAGE BODY: HTML included in message
 *  0.0 BAYES_50 BODY: Bayesian spam probability is 40 to 60%
 *  [score: 0.5000]
 *  3.5 MIME_HTML_ONLY BODY: Message only has text/html MIME parts
 *  5.0 RAZOR2_CHECK Listed in Razor2 (http://razor.sf.net/)
 *  1.5 RAZOR2_CF_RANGE_E4_51_100 Razor2 gives engine 4 confidence level
 *  above 50%
 *  [cf: 100]
 *  0.5 RAZOR2_CF_RANGE_51_100 Razor2 gives confidence level above 50%
 *  [cf: 100]
 *  5.0 PYZOR_CHECK Listed in Pyzor (http://pyzor.sf.net/)
 *  5.0 DCC_CHECK Listed in DCC (http://rhyolite.com/anti-spam/dcc/)
 *  0.0 DIGEST_MULTIPLE Message hits more than one network digest check
 *  0.3 DRUGS_ERECTILE Refers to an erectile drug
 *  0.1 HTML_MIME_NO_HTML_TAG HTML-only message, but there is no HTML tag
Received: from xphotonics.com (localhost [127.0.0.1])
 by xphotonics.com (8.14.1/8.14.1) with ESMTP id l9MFJIOp032936
 (version=TLSv1/SSLv3 cipher=DHE-DSS-AES256-SHA bits=256 verify=NO)
 for [EMAIL PROTECTED]; Mon, 22 Oct 2007 11:19:18 -0400 (EDT)
 (envelope-from [EMAIL PROTECTED])
Received: (from [EMAIL PROTECTED])
 by xphotonics.com (8.14.1/8.14.1/Submit) id l9MFJIKX032935
 for xiang; Mon, 22 Oct 2007 11:19:18 -0400 (EDT)
 (envelope-from lian)
Received: from 029ae8f252bf4ac (84pavel.dialup.corbina.ru [85.21.237.209])
 by xphotonics.com (8.14.1/8.14.1) with SMTP id l9MFHg8N032899
 for [EMAIL PROTECTED]; Mon, 22 Oct 2007 11:17:44 -0400 (EDT)
 (envelope-from [EMAIL PROTECTED])
Date: Mon, 22 Oct 2007 11:17:42 -0400 (EDT)
Received: from Susana Ware (10.11.17.11) by 029ae8f252bf4ac (PowerMTA(TM) 
v3.2r4) id hfp31o62d55j87 for [EMAIL PROTECTED]; Mon, 22 Oct 2007 07:17:20 
+0300
Message-Id: [EMAIL PROTECTED]
To: [EMAIL PROTECTED]
Subject: October 79% OFF
From: VIAGRA ?Official Site [EMAIL PROTECTED]
MIME-Version: 1.0
Content-Type: text/html; charset=iso-8859-1
Content-Transfer-Encoding: 8bit
X-Virus-Scanned: ClamAV 0.91.1/4559/Mon Oct 22 00:02:57 2007 on xphotonics.com
X-Virus-Scanned: ClamAV 0.91.1/4559/Mon Oct 22 00:02:57 2007 on xphotonics.com
X-Virus-Status: Clean

style
!DOCTYPE html PUBLIC -//W3C//DTD XHTML 1.0 Strict//EN 
http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd;
html dir=ltr
head
meta http-equiv=Content-Type content=text/html; charset=unicode
meta name=Generator content=Microsoft SafeHTML
titleWL 90-day Email 1a/title
table width=550 border=0 cellpadding=0 cellspacing=0 bgcolor=#99
/tr
tr valign=top
td colspan=5img src=http://ads1.oqr.com/ads/pronws/CIQ3536/1a_banner.jpg; 
alt=Windows
 Live Hotmail width=548 height=224 border=0/td


Problem with ERROR: invalid byte sequence for encoding UTF8: 0x8a

2007-07-30 Thread Andrew R Jackson
I keep seeing these in my postgresql log file. What did I do wrong?

ERROR:  invalid byte sequence for encoding UTF8: 0xd255
HINT:  This error can also happen if the byte sequence does not match the
encoding expected by the server, which is controlled by client_encoding.
STATEMENT:  SELECT spam_count, ham_count, atime
   FROM bayes_token
  WHERE id = $1
AND token = $2
ERROR:  invalid byte sequence for encoding UTF8: 0xd255
HINT:  This error can also happen if the byte sequence does not match the
encoding expected by the server, which is controlled by client_encoding.
STATEMENT:  INSERT INTO bayes_token
   (id, token, spam_count, ham_count, atime)
   VALUES ($1,$2,$3,$4,$5)


Here is my local.cf file:

http://www.pastebin.ca/639583


spamd runs with these arguments:

 /usr/bin/spamd -d -i 127.0.0.1 -m 5 -H -q -x -d
--pidfile=/var/run/spamd.pid



Any help would be appreciated. Thanks.


Auto-whitelist Errors others.

2007-03-08 Thread Andrew Rosolino

I am having some serious probles with SpamAssassin. For example check out my
logs:

Mar  8 14:42:32 penguin spamd[15553]: spamd: connection from localhost
[127.0.0.1] at port 52601
Mar  8 14:42:32 penguin spamd[15553]: spamd: setuid to root succeeded
Mar  8 14:42:32 penguin spamd[15553]: spamd: still running as root: user not
specified with -u, not found, or set to root, falling back to nobody at
/usr/bin/
spamd line 1147, GEN15 line 4.
Mar  8 14:42:32 penguin spamd[15553]: spamd: processing message
[EMAIL PROTECTED] for root:99
Mar  8 14:42:32 penguin spamd[15553]: locker: safe_lock: cannot create tmp
lockfile
/var/spool/spamassassin/auto_whitelist.lock.penguin.leapcash.com.15553 for
 /var/spool/spamassassin/auto_whitelist.lock: Permission denied
Mar  8 14:42:32 penguin spamd[15553]: auto-whitelist: open of auto-whitelist
file failed: locker: safe_lock: cannot create tmp lockfile
/var/spool/spamassassi
n/auto_whitelist.lock.penguin.leapcash.com.15553 for
/var/spool/spamassassin/auto_whitelist.lock: Permission denied
Mar  8 14:42:32 penguin spamd[15553]: spamd: identified spam (1000.0/5.0)
for root:99 in 0.2 seconds, 834 bytes.
Mar  8 14:42:32 penguin spamd[15553]: spamd: result: Y 999 -
GTUBE,NO_RECEIVED,NO_RELAYS
scantime=0.2,size=834,user=root,uid=99,required_score=5.0,rhost=local
host,raddr=127.0.0.1,rport=52601,mid=[EMAIL PROTECTED],autolearn=no
Mar  8 14:42:32 penguin spamd[15537]: prefork: child states: II


Here is the permissions for the folder:
drw-rw-rw-2 root nobody   4096 Mar  8 14:35 spamassassin/

And the files:
-rw-rw1 root nobody  12288 Mar 30  2006 auto-whitelist
-rw-rw1 root nobody  12288 Feb 16  2005 bayes_seen
-rw-rw1 root nobody  12288 Feb 16  2005 bayes_toks
-rw-r--r--1 root nobody   1218 Feb 16  2005 user_prefs

Now if spamassassin folder has group write access for nobody then why wont
it write to this folder.

 Mar  8 14:42:32 penguin spamd[15553]: spamd: still running as root: user
 not specified with -u, not found, or set to root, falling back to nobody
 at /usr/bin/
That tells me its using the nobody user

 Mar  8 14:42:32 penguin spamd[15553]: locker: safe_lock: cannot create tmp
 lockfile
 /var/spool/spamassassin/auto_whitelist.lock.penguin.leapcash.com.15553 for
 /var/spool/spamassassin/auto_whitelist.lock: Permission denied
So why the error above.

Any help GREATLY appreciated =)
-- 
View this message in context: 
http://www.nabble.com/Auto-whitelist-Errors---others.-tf3371373.html#a9381345
Sent from the SpamAssassin - Users mailing list archive at Nabble.com.



Re: [2] Auto-whitelist Errors others.

2007-03-08 Thread Andrew Rosolino

Why does a directory need execute permissions?


Theo Van Dinter-2 wrote:
 
 On Thu, Mar 08, 2007 at 11:44:31AM -0800, Andrew Rosolino wrote:
 Mar  8 14:42:32 penguin spamd[15553]: spamd: setuid to root succeeded
 Mar  8 14:42:32 penguin spamd[15553]: spamd: still running as root: user
 not
 specified with -u, not found, or set to root, falling back to nobody at
 /usr/bin/spamd line 1147, GEN15 line 4.
 
 don't call spamd (via spamc) as root.
 
 Here is the permissions for the folder:
 drw-rw-rw-2 root nobody   4096 Mar  8 14:35 spamassassin/
 
 That's definitely not going to work.  0777, not 0666 (directory, not a
 file).
 
 -- 
 Randomly Selected Tagline:
 You can't build a reputation on what you are going to do. - Henry Ford
 
  
 

-- 
View this message in context: 
http://www.nabble.com/Auto-whitelist-Errors---others.-tf3371373.html#a9386463
Sent from the SpamAssassin - Users mailing list archive at Nabble.com.



Re: [2] Auto-whitelist Errors others.

2007-03-08 Thread Andrew Rosolino

Thanks guys everything is good now =D!


Phil Barnett wrote:
 
 On Thursday 08 March 2007 19:46, Andrew Rosolino wrote:
 Why does a directory need execute permissions?
 
 Because you can't use it and you can't move into it unless it does.
 
 -- 
 Balmer is basically saying: We know there's a problem but we're not going
 to 
 tell you what it is because we want to ambush you in the future.
 
 

-- 
View this message in context: 
http://www.nabble.com/Auto-whitelist-Errors---others.-tf3371373.html#a9386624
Sent from the SpamAssassin - Users mailing list archive at Nabble.com.



Rules - How to capture matched text

2006-12-18 Thread Andrew Brosnan
Hello,

In perl you can use $, parens $1, $2, etc. to capture the text that
matched a regex; but how do you do it in sa?

Thank you
Andrew



Re: Rules - How to capture matched text

2006-12-18 Thread Andrew Brosnan
On 12/18/06 at 3:41 PM, [EMAIL PROTECTED] (Theo Van Dinter) wrote:

 On Mon, Dec 18, 2006 at 02:39:13PM -0500, Andrew Brosnan wrote:
  In perl you can use $, parens $1, $2, etc. to capture the text 
  that matched a regex; but how do you do it in sa?
 
 It depends what you're trying to do.  If you want to do matching 
 between different rules, you can't do it, short of writing a plugin 
 to do what you want.  If you want to match within the same regex, 
 it's like any other regex:
 
 /([a-z]+) foo bar \1/
 
 generally speaking, capturing increases resource usage, so don't do 
 it unless necessary (hence the large amount of (?:...) instead of 
 (...) in the rules).


Thanks Theo,

I'd like the rule to catch when the first name in from: is also the
subject:.

I was going to capture the name in from: and compare it to subject:.

I'll have to give some thought to how I can do that without capturing
text. :-)

Regards,
Andrew


Score counting error

2006-12-08 Thread Andrew Hearn (AAISP)
Hi,

In my headers I see:

X-Spam-Status: No, score=4.3 required=4.4 tests=BAYES_99,NO_RELAYS
autolearn=disabled version=3.1.7
X-Spam-Report:
* -0.0 NO_RELAYS Informational: message was not relayed via SMTP
*  4.4 BAYES_99 BODY: Bayesian spam probability is 99 to 100%
*  [score: 1.]

Seems odd that score doesn't add up? (4.4 + 0.0 = 4.3!!)


-- 
Andrew Hearn


Newbie Question

2006-11-24 Thread Andrew Sykes
Hi,

I'm writing some code to integrate SpamAssassin with Apache JAMES.

I want to setup an address to allow me to pipe spam into sa-learn. I
have a prototype of this working fine, but would like to allow various
webmail client users to be able to forward spam messages to this
address.

As I have very limited understanding of how SA works, I don't want to
end up blocking the forwarding addresses.

If I whitelist the forwarding addresses, can I then simply pipe a
forwarded spam from that address into sa-learn or is there more to it?

Thanks a lot for your help.
-- 
Kind Regards
Andrew Sykes [EMAIL PROTECTED]
Sykes Development Ltd
http://www.sykesdevelopment.com



Re: Newbie Question

2006-11-24 Thread Andrew Sykes
Matt,

Thank you, that makes things a lot clearer, is there any way to utilise
forwarded messages or is it a lost cause?

Thanks
Andrew

On Fri, 2006-11-24 at 10:22 -0500, Matt Kettler wrote:
 Andrew Sykes wrote:
  Hi,
 
  I'm writing some code to integrate SpamAssassin with Apache JAMES.
 
  I want to setup an address to allow me to pipe spam into sa-learn. I
  have a prototype of this working fine, but would like to allow various
  webmail client users to be able to forward spam messages to this
  address.
 
  As I have very limited understanding of how SA works, I don't want to
  end up blocking the forwarding addresses.
 
  If I whitelist the forwarding addresses, can I then simply pipe a
  forwarded spam from that address into sa-learn or is there more to it?

 
 There's MUCH more to it.. In fact, whitelisting won't really affect what
 sa-learn does at all.
 
 Generally speaking, forwarded messages are mostly useless to sa-learn.
 Exactly how useless depends a bit on the mail client..
 
 SA tokenizes MANY mail headers, including Received:, not just From: and
 To. All the headers in a forwarded message are completely new, thus the
 sa-learn process will be learning the headers generated by forwarding,
 and not spam.
 
 SA also tokenizes the body of the message. However, most mail clients
 substantially modify the body of the message when you forward. 
 Generally speaking they only preserve one of the mime sections in a
 multipart/alternative message. Spammers FREQUENTLY have text/plain
 sections which are dissimilar from the text/html. By forwarding you're
 loosing all but one mime section (generally text/html is kept).
 
 On top of this, most mail clients also insert Forwarded message: type
 text into the body, and add Fwd: to the subject.
 
 SA also tokenizes the in-body mime headers describing how the message
 was encoded. However, when you forward, the mail client doing the
 forward re-encodes things its own way. What might have been base64
 encoded may now be quoted-printable, 8 bit, or 7 bit.
 
 So, fundamentally, as far as bayes is concerned the forwarded message is
 a completely different message than the original spam.
 
 You can try this sometime by taking an original spam, and a forwarded
 version of it and feed them both to spamassassin or sa-learn with -D
 bayes added. This will cause the debug output to list all the tokens
 used. Take a look at the tokens. .some are the same, but many are different.
 
 
 
 
 
 
 
-- 
Kind Regards
Andrew Sykes [EMAIL PROTECTED]
Sykes Development Ltd
http://www.sykesdevelopment.com



Re: Sudden drop in spam-rate, parallel to a surge of new trojans - beware

2006-11-22 Thread Andrew Hearn (AAISP)
Chris wrote:
 On Tuesday 21 November 2006 6:47 pm, Chr. v. Stuckrad wrote:
 Hi!

 Yesterday we had a sudden drop in spam-percentage from 80% to near 60%.
 Parallel to it I got six copies of an undetectable (by NAI and ClamAV)
 new trojan 'exe' in the Mail.

 Do we have to prepare for a new flood by an updated
 (just now reorganizing) botnet?

 Stucki
 
 Yes, I did see a drop in yesterdays spam load:
 
 Total:  255 reports in 16m 54s.  3.97 seconds per report.
 Mon Nov 20 21:01:17 CST 2006
 
 compared with Sunday's:
 
 Total:  434 reports in 30m 34s.  4.22 seconds per report.
 Sun Nov 19 20:03:19 CST 2006
 
 But today's was a killer!:
 
 Total:  580 reports in 39m 28s.  4.08 seconds per report.
 Tue Nov 21 22:08:56 CST 2006
 

Sorry to be OT, but are these spam stats a built in feature of SA, or
have you got a plugin to get this information? Thanks!

-- 
Andrew Hearn


Spam with two subject headers

2006-11-16 Thread Andrew Hawthorne
Hello,

 

I'm running SpamAssassin 3.1.3 on Qmail.

 

99% of the spam that is processed by SA has the subject header rewritten. A
few times a day however, there are spams that get processed by SA, and do
not have the 'detected spam' string in the subject. In these spam there are
two Subject lines - the first being the original subject and the second is
the string that identifies an email as spam in the subject. 'The
X-Spam-Prev-Subject' header says '(nonexistant)' which is not the case at
all!

 

Here are two links to the headers of two of these spams: spam_1
http://boxmodel.com/spam.txt  spam_2 http://boxmodel.com/more_spam.txt 

 

I'd really appreciate any advice that this group could give me
to help me resolve this issue. Much thanks in advance.

 

Andrew



RE: Spam with two subject headers

2006-11-16 Thread Andrew Hawthorne
Sorry for not being more specific. I'm not using qmail-scanner, just thought
it might be helpful to mention qmail is my MTA.

I have the same results as you after removing SA markup and retesting... The
difference between the two however is the X-Spam-Prev-Subject header - it
doesn't read '(nonexistent)' as it did in the email links I posted. Also the
missing subject rule never got hit during the test of the cleaned email.

Any chance spamd is not processing the same?
Perhaps a clever spammer trick? 



 -Original Message-
 From: Theo Van Dinter [mailto:[EMAIL PROTECTED]
 Sent: Thursday, November 16, 2006 8:06
 To: users@spamassassin.apache.org
 Subject: Re: Spam with two subject headers
 
 On Thu, Nov 16, 2006 at 07:43:52AM -0800, Andrew Hawthorne wrote:
  I'm running SpamAssassin 3.1.3 on Qmail.
 
 What does that mean exactly?  qmail-scanner ?
 
  Here are two links to the headers of two of these spams: spam_1
  http://boxmodel.com/spam.txt  spam_2
 http://boxmodel.com/more_spam.txt
 
 I took them both, removed the SA markup, added GTUBE appropriately,
 and ran it
 through w/ a rewrite_header Subject ... config, and it worked fine.
 
 --
 Randomly Selected Tagline:
 This tagline is ANNOYWARE! To register, send me some fish.




Subject not rewritten, two subject headers

2006-11-15 Thread Andrew Hawthorne








Greetings,



 Ive been
receiving a number of spam lately that are being correctly identified as spam
by SA, however the subject line is not being rewritten. I have noticed that
there are two subject lines and the X-Spam-Prev-Subject header
states non existent. Below is part of one of the email headers that contains
the two Subjects. When the email is delivered, the subject reads Full of
health? Then don't click! completely untouched!



All other SA headers appear normal and are not
included to try and make this message smaller. This messages score was 50+. Im
running SpamAssassin 3.1.3. Any help resolving this would be greatly
appreciated. ~thanks



Subject: Full of health? Then don't click!

Date: Wed, 15 Nov 2006 00:30:22 +0100

MIME-Version: 1.0

Content-Type: multipart/related;

 type=multipart/alternative;

 boundary=ms030907010507030208050907

X-Priority: 3

X-MSMail-Priority: Normal

X-Mailer: Microsoft Outlook Express 6.00.2900.2180

X-MimeOLE: Produced By Microsoft MimeOLE
V6.00.2900.2180

Subject: ***SPAM*** 

X-Spam-Prev-Subject: (nonexistent)

This is a multi-part message in MIME format.

--ms030907010507030208050907

Content-Type: multipart/alternative;










RE: Subject not rewritten, two subject headers

2006-11-15 Thread Andrew Hawthorne
Question, since you only quoted some of the headers.. is there a blank
line anywhere in the headers before the subject header?


There are no blank lines... anything else I should check? I attempted to
send all the headers and the email was bounced back to me because it was too
spammy *grin*.

~thanks




Re: White listing yahoo groups

2006-11-14 Thread Andrew Hodgson
On Tue, 14 Nov 2006 10:21:02 -0800, Bill Moseley [EMAIL PROTECTED]
wrote:

[...]

Yes, it is my machine rejecting the mail that is flagged spam.
And when I reject too many messages Yahoo's mailing list software
considers my email non-working and stops delivering list messages.

Snap!  I have the same issue here, I reject with a high score, and it
only takes one to put it into bounce mode.  Also, they never let you
know you are bouncing until like the next couple of days.

The other problem is I have a system here which does some checks on
the SMTP transaction and performs checks which gets to SA, and due to
the way Yahoo delivers the messages to multiple recipients on the same
domain (through sending the message multiple times in the same SMTP
transaction) this caused problems as well.

I guess I'm just curious how others deal with mailing lists.  I
suspect just like any other mail -- if a message has a high enough
spam score then reject it.

I am going to try some of the other messages in this thread - may take
a while though, as I have to wait for one to trip the system.

Andrew.



Training sa-learn from Outlook.

2006-09-20 Thread Andrew van Tilburg








I imagine the following questions have been asked a lot,
but I havent seen the exact answers Im after yet so here goes.



We are running qmail, vpopmail, spamassassin, smb shares
using samba, among other things, on freebsd. I want to set up public ham and
spam folders such that our users can drag emails from Outlook. I can then set
up a cron job that runs sa-learn on those folders and deletes
the mail. 



Can I just create two public samba shares, then use those
for the emails and run s-learn on them ? I guess not because
the emails by this stage are wrecked by Outlook. How else can I do this ?



Also, I dont understand exactly the implications of
which user you run sa-learn under. How do I set this up
when running sa-learn ? I suppose if I run it as the
same user as vpopmail then this will work ?



Apologies if these questions have already been covered in
this mailing list or elsewhere.



Andrew.








Re: Where to install imageinfo.pm?

2006-08-24 Thread Andrew

BG Mahesh wrote:


hi

I am using SA-3.1.4. I am in the process of installing 
http://www.rulesemporium.com/plugins.htm
Where do I install ImageInfo.pm 
http://www.rulesemporium.com/plugins/ImageInfo.pm [which directory]?




On my FreeBSD box, I put ImageInfo.pm here:
/usr/local/lib/perl5/site_perl/5.8.8/Mail/SpamAssassin/Plugin/

Andrew



Is anyone else seeing these?

2006-08-22 Thread Andrew
Is anyone else seeing this sort of spam? It consists of a short message 
and always has a URL in it that ends with the string '/sk/'. The URL 
points to a web site advertising human growth hormone and testosterone 
treatment.


These spams aren't firing on enough rules to be tagged by SpamAssassin. 
The URL changes often enough that the URIBL plugin doesn't catch a lot 
of them. Has anyone had more luck than me at stopping these emails?


Andrew



just wanted to see if you were still dreaming the notion of getting toned?

I so want to be, that is why i am so joyous i chanced upon

http://www.dontimesogooder.org/sk/

It was best decisevely having someone to support me out.

to examine it, I found career that it was
of the beasts rain again closing
visit religious conviction, as much



Re: SA-LEARN Question

2006-08-22 Thread Andrew
Jim Maul wrote:
 Christopher Mills wrote:
 Hi,
 We have over 100 domains on a server, all of which are getting junk
 mail. SA 3.1.4 installed, but I don't think it's properly trained yet
 (even though I did upgrade from an earlier version).

 If I set up a [EMAIL PROTECTED] mailto:[EMAIL PROTECTED]
 address and tell all my customers to forward the junk mail they get to
 that address, then run sa-learn on that mailbox, will that help, or,
 will it train SA that the users that forwarded the junk ARE the
 spammers and start to assign higher scores to legitimate customers?
 
 If you forward the emails, this process will not work.  You must either
 forward it as an attachment and then strip the attachment and run
 sa-learn on that or use some other method which preserves the original
 headers.  How you do this depends largely on your setup.
 

Here's a link describing how I use maildrop to deliver emails to special
maildirs for processing by sa-learn.

http://www.arda.homeunix.net/spamassassin.html#bayesian

Andrew



Re: spamassassin on qmail

2006-07-29 Thread Andrew

Kjetil Kjernsmo wrote:

On Saturday 29 July 2006 08:48, Kaushal Shriyan wrote:


does spamassassin work on qmail MTA



Yes. Also, you might want to look into using the qpsmtpd component, as 
it gives you a lot of power over the SMTP dialogue: 
http://smtpd.develooper.com/




You might also want to have a look at my Howto describing my 
netqmail/SpamAssassin setup.


http://www.arda.homeunix.net/spamassassin.html

Andrew



whitelists and blacklists question

2006-07-07 Thread Andrew
I currently have SA running in a site-wide configuration using 
spamc/spamd. I would like to implement whitelists/blacklists on a per 
account basis. I use qmail and maildrop, so for per account processing, 
I plan to invoke SA from a .mailfilter file and keep user prefs in a SQL 
database.


My question is, can I invoke SA to check user whitelists/blacklists only 
without it running any rule tests? Email will already contain SA headers 
from the site-wide SA installation. I want to use per account 
whitelists/blacklists as a possible override to whatever verdict the 
site-wide SA gives an email.




SpamAssassin Howto

2006-07-03 Thread Andrew
I've written a Howto document describing my SpamAssassin setup. I have a 
site-wide configuration using spamd/spamc with Bayesian and 
auto-whitelist data in a MySQL database. If anyone is interested in 
having a look, you can find it here:


http://www.arda.homeunix.net/spamassassin.html

Of course, constructive feedback is always welcome.

Andrew



Re: Spam Assassin Detecting our emails as spam

2006-05-20 Thread Andrew

spectacularstuff wrote:

I have just set up Spam Assassin on our server.
It is working very nicely however whenever we try to send an email from our
own server to someone else on the same server, it gets picked up as spam.

I am wondering if anyone here has experience with Spam Assassin and can help
me fix the issues below as I don't know what they mean exactly.

I have spam assassin set to detect at 8 points whether or not an email is
spam. We are way over that because of the following reasons.

What do I have to fix on our server to fix the 4 issues below?

1. We are losing 3.4 points because of HELO_DYNAMIC_IPADDR.

2. We are losing 2.6 points because of NO_DNS_FOR_FROM.

3. We are losing 2.0 points because of RCVD_IN_SORBS_DUL.

4. We are losing 1.7 points because of RCVD_IN_NJABL_DUL.


Here is a standard header from Spam Assassin that we get when we sent each
other email.

Code:
 3.4 HELO_DYNAMIC_IPADDRRelay HELO'd using suspicious hostname (IP
addr1)
 0.1 HTML_TAG_EXIST_TBODY   BODY: HTML has tbody tag
 0.7 MIME_HTML_MOSTLY   BODY: Multipart message mostly text/html MIME
 0.0 HTML_MESSAGE   BODY: HTML included in message
 2.6 NO_DNS_FOR_FROMDNS: Envelope sender has no MX or A DNS records
 2.0 RCVD_IN_SORBS_DUL  RBL: SORBS: sent directly from dynamic IP
address
[68.56.175.199 listed in dnsbl.sorbs.net]
 1.7 RCVD_IN_NJABL_DUL  RBL: NJABL: dialup sender did non-local SMTP
[68.56.175.199 listed in combined.njabl.org]
-0.2 AWLAWL: From: address is in the auto white-list
Thanks for any help with this.

Wayne
--
View this message in context: 
http://www.nabble.com/Spam+Assassin+Detecting+our+emails+as+spam-t1653798.html#a4480701
Sent from the SpamAssassin - Users forum at Nabble.com.




Read about trusted_networks and internal_networks in the 
Mail::SpamAssassin::Conf man page. These parameters go into your 
local.cf configuration file.


Andrew



Re: Even More Sa-update Problems

2006-05-14 Thread Andrew

David Baron wrote:
I have this working fine. However, once that 0300011 directory exists, 
all my custom rules (i.e. bayes, regex tests, etc) are no longer working and 
most all spams get through!


Took it off once again. Something needs be modified before this can be used.



I just set up sa-update myself. I've downloaded the latest ruleset from 
updates.spamassassin.org and restarted spamd. I don't seem to be having 
the problem you describe, though. I have the Baysian filter on and some 
SARE rulesets in /usr/local/etc/mail/spamassassin and I'm still seeing 
hits with them. Here are the options I use when I start spamd.


spamd --siteconfigpath=/usr/local/etc/mail/spamassassin 
--pidfile=/var/run/spamd.pid


Before I set up sa-update, I also had the --configpath option set to 
/usr/local/share/spamassassin. I had to take this out otherwise spamd 
wouldn't find the rulesets in /var/lib/spamassassin/3.001001/


I'm using SpamAssassin 3.1.1 on FreeBSD by the way.

Andrew



Re: Even More Sa-update Problems

2006-05-14 Thread Andrew

David Baron wrote:


On Sunday 14 May 2006 21:24, Andrew wrote:


I have this working fine. However, once that 0300011 directory
exists, all my custom rules (i.e. bayes, regex tests, etc) are no longer
working and most all spams get through!

Took it off once again. Something needs be modified before this can be
used.


I just set up sa-update myself. I've downloaded the latest ruleset from
updates.spamassassin.org and restarted spamd. I don't seem to be having
the problem you describe, though. I have the Baysian filter on and some
SARE rulesets in /usr/local/etc/mail/spamassassin and I'm still seeing
hits with them. Here are the options I use when I start spamd.

spamd --siteconfigpath=/usr/local/etc/mail/spamassassin
--pidfile=/var/run/spamd.pid



OK. There is no siteconfigpath in /etc/init.d/spamassassin nor 
in /etc/default/spamassassin which gives this is $OPTIONS. It would be easy 
enough to try. Put this in which of these files?


Browsing through the Mail::SpamAssassin::Conf man page, I couldn't find 
a configuration file parameter equivalent to the --siteconfigpath 
command line option for spamd. I'd put it in your 
/etc/init.d/spamassassin startup script.





Before I set up sa-update, I also had the --configpath option set to
/usr/local/share/spamassassin. I had to take this out otherwise spamd
wouldn't find the rulesets in /var/lib/spamassassin/3.001001/



This is probably the default. Now it looks first in the version 
set /var/lib . 3.001001/, Does the siteconfigpath override or add to this 
(most probably adds or should) ? Should there be a multiple siteconfigpath? 
Symlinks to various directories from the ...3.001001?




Here is what appears to be happening on my system.

1. Because I don't have configpath set, spamd is looking in 
/var/lib/spamassassin for rulesets.


2. Because it finds the /var/lib/spamassassin directory, it doesn't 
check /usr/local/share/spamassassin where the rulesets distributed with 
SpamAssassin reside.


3. Because I have set siteconfigpath, spamd loads extra rulesets and 
configuration info from /usr/local/etc/mail/spamassassin.


In my case, what spamd finds in siteconfigpath is definitely used in 
addition to what it finds in /var/lib/spamassassin.


I've never tried specifying more than one siteconfigpath. My gut feeling 
is that it won't work. I can't think of a reason why I would need more 
than one.


Andrew



Re: Rule to select sender starting with string

2006-04-27 Thread Andrew

Matt Kettler wrote:

Al Danks wrote:


Matt Kettler mkettler at evi-inc.com writes:


 


Try a rule something like this:

L_FROM_STRING header From =~ /$string/


   


It appears that the rule is also hitting senders with the string following a .

I.e. From =~ /$com/ hits 


comalksdfl.net

aksafjdla.com
 



Interesting.. that shouldn't happen with the $ there.. I'll have to test
that, unless Theo or one of the other devs can offer an explanation as
to why..




Are SA regexes different from other regexes? If not, use '^' to specify 
the beginning of a string and '$' its end. Try this pattern:

/^com/

Andrew



X-Originating and X-Apparently-From

2006-04-18 Thread Andrew Doughety

Hi,
	We are trying to perform DNSBL checks on incoming mail and we are 
not seeing any actual DNS queries.  When looking at the code it seems that 
the information on which IP(s) to check is obtained from X-Originating and 
X-Apparently-From headers.  Grepping through the code I do not see these 
headers anywhere else.  We are using Postfix as our MTA, perhaps that 
is the problem?  We could either write a postfix rule or edit the SA code to 
check the Received header.


Thanks,
Andrew


Re: X-Originating and X-Apparently-From

2006-04-18 Thread Andrew Doughety

Andrew Doughety wrote:

Hi,
We are trying to perform DNSBL checks on incoming mail and we are
not seeing any actual DNS queries.  When looking at the code it seems
that the information on which IP(s) to check is obtained from
X-Originating and X-Apparently-From headers.


No, SA should be checking the IPs from the Received: headers.

However, make sure your trust path is working correctly. If you ever see spam
matching ALL_TRUSTED, then that email is going to be exempt from DNSBL tests.

9 times out of 10, this is the trust-path guesser being confused by a NAT
config. See the wiki on how to fix this:

http://wiki.apache.org/spamassassin/TrustPath


Restricting the trusted path fixed the problem.  Thanks!


problem with AWL and SQL

2006-04-18 Thread Andrew
I'm trying to set up SA to use MySQL to store the Auto WhiteList but 
it's just not working out for me. SA seems to be trying to create a lock 
file on disk. The problem is that I run spamd as a user which doesn't 
have a home directory. Here is what I find in my spamd log files.


@40004445b038302e5f9c [59195] error: locker: safe_lock: cannot 
create tmp lockfile 
/nonexistent/.spamassassin/auto-whitelist.lock.lorien.arda.homeunix.net.59195 
for /nonexistent/.spamassassin/auto-whitelist.lock: No such file or 
directory
@40004445b0383030c4e4 [59195] warn: auto-whitelist: open of 
auto-whitelist file failed: locker: safe_lock: cannot create tmp 
lockfile 
/nonexistent/.spamassassin/auto-whitelist.lock.lorien.arda.homeunix.net.59195 
for /nonexistent/.spamassassin/auto-whitelist.lock: No such file or 
directory


(The funny strings with the '@' sign at the beginning of lines is a 
timestamp. I use daemontools to run spamd instead of inetd.)


Is this normal behaviour even when using an SQL database to store the AWL?

Here are the relevant parameters from my local.cf file.

user_awl_dsn DBI:mysql:saawl:localhost:3306
user_awl_sql_usernamesa
user_awl_sql_passwordpassword
user_awl_sql_table   awl

I use MySQL with the same credentials to store the Bayesian database and 
that's working fine. Only the AWL is giving me a problem. I can manually 
log into the saawl database and even insert and delete rows as the sa user.


Andrew



Re: Bayes learning email address

2006-04-16 Thread Andrew

John D. Hardin wrote:

On Sat, 15 Apr 2006, mouss wrote:



- you are trusting your users to make the right decision. The
problem is that different people have different opinions of what
is spam and what is not. Things get even worst if one user isn't
honest...



That's a problem with *any* scheme for allowing the users to train
Bayes themselves.

In practice, however, I think you'll see much more apathy than
stupidity or malice. My problem was with getting my users to even
*look at* their marginal-spams folder and classify the messages. Ever.

You should check for things like your own quota notification messages in 
the spam folder. If you send a boilerplate email in response to someone 
sending an email to your abuse or postmaster address, check for that 
too. I used to work for a fairly large ISP and we got these sorts of 
things sent to us all the time.


Andrew



Bayes rules taking minutes - solved by moving to innodb?

2006-03-23 Thread Andrew Donkin

Hi, people.  This started as a plea for help but ended as a report of
an investigation, so hopefully it will be a useful addition to the
archives.

About 1% of my scans were taking more than 300 seconds.  Extra
debugging in spamd showed me that the Bayes checks were the culprit:

13:38:05 spamd[16852]: slow: run_eval_tests BAYES_40 took 773 seconds 
13:45:18 spamd[16852]: slow: run_eval_tests BAYES_80 took 427 seconds 
13:45:20 spamd[16852]: slow: do_body_eval_tests(0) took 1212 seconds 

I am using per-user Bayes (on the recommendation of half this list,
and against the recommendation of the other half :-), and perform
about 100,000 scans per day.  Bayes_seen was ~ 150M, and bayes_token ~
1.5G.  The bayes_token index was 4.7G.

MySQL's slow query log showed that the queries did not take long to
execute after they achieved a lock, but I suspected they were not
getting their locks in reasonable time:

 mysql SHOW STATUS LIKE 'Table%';
 +---++
 | Variable_name | Value  |
 +---++
 | Table_locks_immediate | 171036 |
 | Table_locks_waited| 220999 |
 +---++

In a healthy database, table_locks_waited is a small fraction of
table_locks_immediate.

I turned off bayes_auto_expire in case it was the expiry which caused
the contention, but no change.  I need bayes_auto_expire turned on
because as we've discussed before, there is no way to perform
expiration for every user in an SQL Bayes database.

Well, I started this email a week ago and now I've found that at peak
times, SHOW PROCESSLIST shows many threads -- like 100 -- locked on
SELECT FROM bayes_token and INSERT INTO bayes_token.

So I tried to convert bayes_token to InnoDB to take advantage of its
row-level locking (this is advised by the developers but not reflected
in bayes_mysql.sql).  After MySQL worked on that for a few days I
stopped it, dropped the database (innodb was very confused), and
recreated the database and all tables using innodb and two-byte IDs.

It's early days, with only 7.6M tokens seen and few accounts over the
activation mark of 200 ham.  But I'm hoping my timeout problems are
over.

So my advice is:

 SHOW STATUS LIKE 'Table%';
 SHOW PROCESSLIST;
 Change to innodb
 ALTER TABLE bayes_token MODIFY id SMALLINT UNSIGNED NOT NULL,
 MODIFY spam_count SMALLINT UNSIGNED NOT NULL,
 MODIFY ham_count SMALLINT UNSIGNED NOT NULL;
 ALTER TABLE bayes_expire MODIFY id SMALLINT UNSIGNED NOT NULL;
 ALTER TABLE bayes_seen MODIFY id SMALLINT UNSIGNED NOT NULL;
 ALTER TABLE bayes_vars MODIFY id SMALLINT UNSIGNED NOT NULL AUTO_INCREMENT;

-- 
_
Andrew Donkin  Waikato University, Hamilton,  New Zealand


Re: SQL Bayes - MyISAM locks a problem?

2006-03-21 Thread Andrew Donkin

Duane Hill has:

 per-user [...] just over 10 gig [...] InnoDB [...]
 http://wiki.apache.org/spamassassin/DBIPlugin [...] bayes_vars table
 has 14,102 rows

Jason Frisvold:

 I'll have to give innodb a try..  :)  Thanks for the tip...

Jason, if you haven't moved to innodb already, try SHOW PROCESSLIST
in mysql.  Do you have many threads locked on SELECT FROM
bayes_token and INSERT INTO bayes_token?

I had about 100 threads locked, so I am changing to InnoDB for its
fine-grained locking.  About three days ago I issued ALTER TABLE
bayes_token ENGINE innodb.  I'll let you know when it finishes.

-- 
_
Andrew Donkin  Waikato University, Hamilton,  New Zealand


Re: prefork: server reached --max-clients setting, consider raising it messages

2006-03-06 Thread Andrew Donkin

 After upgrading to 3.1 from 3.0 we are starting to see the following
 error messages in our logs prefork: server reached --max-clients
 setting, consider raising it

Short version: try --round-robin on spamd.

We scan about 100k messages a day balanced (with -H) between two spamd
hosts.  Traffic is bursty, and during the bursts a lot of spam leaks
through unchecked because spamc reaches its 120s timeout.

The really annoying thing is that a spamd child would continue to chew
on its message for a further few hundred seconds before classifying
it, only to find that spamc had already given up.  That child could
have been working for another spamc.  I wonder if there is a way for
spamd to catch SIGPIPE or some other message from the client, and
abort.

So I added --round-robin and things improved markedly.  The logging
isn't nearly so good (grep prefork: without --round-robin draws you
a great load histogram) but far less spam is leaking through.

One theory is that spamd doesn't spawn children quickly enough to cope
with rapidly-ramping load.  I was thinking of ripping out spamd's
one-new-child-per-second throtting to see if it improved matters, but
that experiment is way down the task list now.

Try --round-robin.  Scale it up until your spamd hosts are maximising
the use of their RAM.

Note that your spamd hosts should be similarly capable - spamc will
split the load evenly between all of them, even when all children are
busy on one.

-- 
_
Andrew Donkin  Waikato University, Hamilton,  New Zealand


  1   2   >