Re: Identifying Source of False Positives -- RESOLVED

2009-06-05 Thread Rich Shepard

On Tue, 2 Jun 2009, Rich Shepard wrote:


 I started doing this today. Each of the false positive messages was
exported from alpine to a file, and I ran sa-learn on that file telling it
the text is ham.


  Today the mail and logwatch summary reports appeared in my inbox and there
were no false positives in the holding cell. This may have resolved the
issue of missing messages, but I'll continue to monitor and train SA on the
ham that was mistakenly labeled as spam.


The empty body problem is a more difficult problem.  Have procmail save a
copy of the raw message somewhere and take a look at it.  Make sure there
is a blank line between the headers and the body.  Run 'spamassassin -D'
on this saved message and look for anything unusual in the debug output.


  This seems to have been resolved by replacing the old
/etc/mail/spamassassin/local.cf with the new version. Many fewer rules and
other entries, but I no longer see the EMPTY_BODY test adding 2.5 to the
scores.

Thank you all very much,

Rich

--
Richard B. Shepard, Ph.D.   |  IntegrityCredibility
Applied Ecosystem Services, Inc.|Innovation
http://www.appl-ecosys.com Voice: 503-667-4517  Fax: 503-667-8863


Re: Identifying Source of False Positives -- RESOLVED

2009-06-05 Thread Bowie Bailey

Rich Shepard wrote:
The empty body problem is a more difficult problem.  Have procmail 
save a
copy of the raw message somewhere and take a look at it.  Make sure 
there
is a blank line between the headers and the body.  Run 'spamassassin 
-D'
on this saved message and look for anything unusual in the debug 
output.


  This seems to have been resolved by replacing the old
/etc/mail/spamassassin/local.cf with the new version. Many fewer rules 
and

other entries, but I no longer see the EMPTY_BODY test adding 2.5 to the
scores.


In that case, you should be able to track down the issue by comparing 
the two files.  Is the EMPTY_BODY rule defined in the old local.cf 
file?  If so, what does it say?


--
Bowie


Re: Identifying Source of False Positives -- RESOLVED

2009-06-05 Thread Rich Shepard

On Fri, 5 Jun 2009, Bowie Bailey wrote:


In that case, you should be able to track down the issue by comparing the
two files. Is the EMPTY_BODY rule defined in the old local.cf file? If
so, what does it say?


Bowie,

  Yes, it was in the old local.cf:

# for empty message bodies:
body   EMPTY_BODY   m'^[^\n]+\n\s*$'
describe   EMPTY_BODY   Message has subject but no body
score  EMPTY_BODY   2.5

  It apparently used to work, but isn't with the new SA to which I upgraded
a few months ago.

Thanks,

Rich

--
Richard B. Shepard, Ph.D.   |  IntegrityCredibility
Applied Ecosystem Services, Inc.|Innovation
http://www.appl-ecosys.com Voice: 503-667-4517  Fax: 503-667-8863


Re: Identifying Source of False Positives

2009-06-04 Thread Rich Shepard

On Mon, 1 Jun 2009, Bowie Bailey wrote:


The empty body problem is a more difficult problem. Have procmail save a
copy of the raw message somewhere and take a look at it. Make sure there
is a blank line between the headers and the body.


Bowie, et al.:

  Progress is being made. I discovered that the local.cf was for sa-1.3 or
so, and there was a local.cf.new in the same directory. I saved the old
version and made the .new one the working copy. Many fewer rules.

  On a real spam that was saved for my examination I see that the EMPTY_BODY
check was not triggered. I'll watch this a couple of days and see if that
continues to hold true.

  In the meantime, I'm retraining SA on the false positives to teach it that
they are ham rather than spam. When my log summary reports start appearing
in my INBOX and the other false positives from the mail lists (such as this
one), stop appearing in the spam hold mailbox, I'll relax.

  Thank you all for the very helpful suggestions. I'll update the status
over the next days.

Rich

--
Richard B. Shepard, Ph.D.   |  IntegrityCredibility
Applied Ecosystem Services, Inc.|Innovation
http://www.appl-ecosys.com Voice: 503-667-4517  Fax: 503-667-8863


Re: Identifying Source of False Positives

2009-06-03 Thread Rich Shepard

On Tue, 2 Jun 2009, Charles Gregory wrote:


This *really* suggests that one of two things MUST be occuring:
1) What you are seeing is NOT what spamassassin sees.


Charles,

  Quite possible.


2) A character (null/ascii-zeros?) has been injected into the e-mail
  somewhere in the headers, causing Spamassassin to cease its scan at that
  point...


  Hmm-m-m-m. I cannot perceive a scenario where this is selective. For
example, the log reports sent by local root to me on the local machine, some
messages posted to this mail list (but not others in the same thread), some
messages posted to other mail lists (again, not all in the same thread), and
so on. There is no consistent pattern other than the locally generated log
summary reports.


Presuming upon the latter, try examining all the headers injected by other
processes like clamav. Particularly where *some* messages receive this
treatment, but not *all*, you should be able to find a 'header difference'
between the passed and failed messages.


  No clamav or similar. We run only linux with incoming mail processed by
postfix and procmail.


Something to try:
Setup a custom rule in local.cf to match a custom header
  X-Spam-Test: YES
And then , just before you scan the e-mail with spamassasin, use 'formail' to 
add that header to the mail.


  I've not before used formail. SA is called from within
~/procmail/recipes.rc:

## Call SpamAssassin
:0fw: spamassassin.lock
*  256000
| spamassassin

  Where do I insert a call to formail and what is the appropriate format?

Thanks,

Rich

--
Richard B. Shepard, Ph.D.   |  IntegrityCredibility
Applied Ecosystem Services, Inc.|Innovation
http://www.appl-ecosys.com Voice: 503-667-4517  Fax: 503-667-8863


Re: Identifying Source of False Positives

2009-06-02 Thread Rich Shepard

On Mon, 1 Jun 2009, Bowie Bailey wrote:


Your biggest problems here are BAYES_99 and EMPTY_BODY.  To fix the Bayes
problem, sa-learn some of these messages as ham.  Make sure you are
learning as the right user...


Bowie,

  I started doing this today. Each of the false positive messages was
exported from alpine to a file, and I ran sa-learn on that file telling it
the text is ham.


The empty body problem is a more difficult problem.  Have procmail save a
copy of the raw message somewhere and take a look at it.  Make sure there
is a blank line between the headers and the body.  Run 'spamassassin -D'
on this saved message and look for anything unusual in the debug output.


  Part of the problem is that I cannot tell what's unusual in the debug
output. When I tried this yesterday (properly), I saw where the score
suddenly jumped from 1.2 to 5.21 with no visible (to me) explanation.

Rich


Re: Identifying Source of False Positives

2009-06-02 Thread Charles Gregory

On Tue, 2 Jun 2009, Rich Shepard wrote:

 This morning not only was the mail log report and logwatch report falsely
flagged as spam, but so were several messages posted to the google group
mail list for an application I use. What is interesting to me is that every
one had a +2.5 score for EMPTY_BODY, while none of them had empty bodies.


This *really* suggests that one of two things MUST be occuring:

1) What you are seeing is NOT what spamassassin sees.

2) A character (null/ascii-zeros?) has been injected into the e-mail
   somewhere in the headers, causing Spamassassin to cease its scan at
   that point...

Presuming upon the latter, try examining all the headers injected by other 
processes like clamav. Particularly where *some* messages receive this 
treatment, but not *all*, you should be able to find a 'header difference' 
between the passed and failed messages.


Something to try:
Setup a custom rule in local.cf to match a custom header
   X-Spam-Test: YES
And then , just before you scan the e-mail with spamassasin, use 'formail' 
to add that header to the mail. It will get injected at the end of the 
headers. If the test rule 'hits' then you have a real mystery. If the test 
rule does *not* 'hit', then we have evidence that something is causing 
Spamassassin to behave like an End-Of-File condition has ben reached on 
the mail before it read it all. Null/zeros or something


- Charles


Re: Identifying Source of False Positives

2009-06-01 Thread McDonald, Dan
On Mon, 2009-06-01 at 09:28 -0700, Rich Shepard wrote:
 I'm running SA-3.2.5 on Slackware-12.2 and encountering false positives on
 messages that have not before been seen as spam by SA. Specifically, the
 daily postfix mail log summary report and the daily logwatch report are
 marked at spam; they are sent by root to me as a user. Because
 /etc/procmailrc threw these messages away it took a long time to figure out
 that it was SA mis-labeling these messages that was the immediate problem.
 
Over the past few months I've also had problems with messages from three
 specific domains that were never delivered to my inbox. However, when a
 procmail recipe directed all messages to me at my business domain to a
 different mail file, they were delivered.
 
How can I determine what causes SA to mark the log summary reports as
 spam? 

run the message though spamassassin -D and see what tests fire.

Most likely it will be that some of the domains that are reported in
your summary are listed in URIBL, SURBL, or some other uri block list.


-- 
Daniel J McDonald, CCIE # 2495, CISSP # 78281, CNX
www.austinenergy.com


signature.asc
Description: This is a digitally signed message part


Re: Identifying Source of False Positives

2009-06-01 Thread Charles Gregory

On Mon, 1 Jun 2009, Rich Shepard wrote:

messages that have not before been seen as spam by SA. Specifically, the
daily postfix mail log summary report and the daily logwatch report are
marked at spam;


Well, firstly, examine the mail full headers. There should be an
X-Spam-Status header listing the tests that matched on the e-mail.

At a first guess, I would suspect that your log includes a reference to
a blacklisted URI or e-mail. Given the nature of logs to contain 
information of this sort, I would strongly urge you to 'whitelist' the 
logs. For that matter, if this is internally generated mail, why are you 
running spamassassin at all? Or is this mail being passed via an outside 
(untrusted) network to your mailbox?


- C


Re: Identifying Source of False Positives

2009-06-01 Thread John Hardin

On Mon, 1 Jun 2009, Rich Shepard wrote:


 I'm running SA-3.2.5 on Slackware-12.2 and encountering false positives on
messages that have not before been seen as spam by SA. Specifically, the
daily postfix mail log summary report and the daily logwatch report are
marked at spam; they are sent by root to me as a user.


That sort of thing shouldn't even be hitting SA. If you're using procmail 
to glue in SA, you might want to add some exclusionary clauses to the 
stanza that calls SA.



 Over the past few months I've also had problems with messages from three
specific domains that were never delivered to my inbox. However, when a
procmail recipe directed all messages to me at my business domain to a
different mail file, they were delivered.


It can be a bad idea, particularly if you're an administrator or delegate 
for the postmaster@ or abuse@ aliases, to discard mail that SA has marked 
as spam. Quarantine it and periodically review the quarantine.



How can I determine what causes SA to mark the log summary reports as
spam? This is the first issue I want to resolve.


First, capture the messages rather than discarding them. The FPs should 
have the list of rules that hit in the headers.


For historical messages you should be able to look in your mail log 
(typically /var/log/maillog or rotated to /var/log/maillog.1.gz etc.) for 
the SA log entry for the messages in question, which also list the rules 
hit.


If you post the list of rules hit, or better a complete FP message with 
all headers intact, we may be able to suggest more precisely. Please don't 
post messages to the list; post them on pastebin or a webserver you 
control, and send the URL to the list.


--
 John Hardin KA7OHZhttp://www.impsec.org/~jhardin/
 jhar...@impsec.orgFALaholic #11174 pgpk -a jhar...@impsec.org
 key: 0xB8732E79 -- 2D8C 34F4 6411 F507 136C  AF76 D822 E6E6 B873 2E79
---
  It is not the business of government to make men virtuous or
  religious, or to preserve the fool from the consequences of his own
  folly.  -- Henry George
---
 5 days until the 65th anniversary of D-Day


Re: Identifying Source of False Positives

2009-06-01 Thread Rich Shepard

On Mon, 1 Jun 2009, Charles Gregory wrote:


Well, firstly, examine the mail full headers. There should be an
X-Spam-Status header listing the tests that matched on the e-mail.


Charles/Dan/John:

  I certainly managed to forget this. I just ran /etc/cron.daily/1pflogsumm
and looked at the report.

  Here are the headers:


From r...@salmo.appl-ecosys.com Mon Jun  1 11:25:44 2009

Return-Path: r...@salmo.appl-ecosys.com
X-Spam-Flag: YES
X-Spam-Checker-Version: SpamAssassin 3.2.5-ph20040310.0 (2008-06-10) on
salmo.appl-ecosys.com
X-Spam-Level: 
X-Spam-Status: Yes, score=4.9 required=4.0 tests=ALL_TRUSTED,AWL,BAYES_99,
EMPTY_BODY,NORMAL_HTTP_TO_IP,NUMERIC_HTTP_ADDR,URI_HEX,URI_NOVOWEL
autolearn=no version=3.2.5-ph20040310.0
X-Spam-Report:
* -1.3 ALL_TRUSTED Passed through trusted hosts only via SMTP
*  3.5 BAYES_99 BODY: Bayesian spam probability is 99 to 100%
*  [score: 1.]
*  2.5 EMPTY_BODY BODY: Message has subject but no body
*  0.0 NORMAL_HTTP_TO_IP URI: Uses a dotted-decimal IP address in URL
*  0.4 URI_HEX URI: URI hostname has long hexadecimal sequence
*  0.0 NUMERIC_HTTP_ADDR URI: Uses a numeric IP address in URL
*  1.6 URI_NOVOWEL URI: URI hostname has long non-vowel sequence
* -1.8 AWL AWL: From: address is in the auto white-list
X-Original-To: rshep...@appl-ecosys.com

  I can send the entire report if that's necessary.

  There is certainly body content in the message; it's not empty so I don't
understand the 2.5 on that third test. I also don't know where the 3.5 on
the second test arises.

  For about a decade these log summary reports showed up every day with no
problems. Earlier this spring they became sporatic, then ceased appearing at
all. This correlates with a distribution and SpamAssassin upgrade, so it
must be something different in SA that's triggering this response now.

  Suggestions on how to proceed greatly appreciated.

Thanks,

Rich


Re: [sa] Re: Identifying Source of False Positives

2009-06-01 Thread Charles Gregory

On Mon, 1 Jun 2009, Rich Shepard wrote:

 *  2.5 EMPTY_BODY BODY: Message has subject but no body
 There is certainly body content in the message; it's not empty so I don't
understand the 2.5 on that third test. I also don't know where the 3.5 on
the second test arises.


Just to be clear, are you looking at the body in the actual rejected 
message, to make sure it is still there (not 'stripped' from the message)?

First guess, look at the procmail code that 'chooses' to run spamassassin.
Have you used an 'h' where you meant to use an 'H', thereby feeding *only* 
the header to spamassassin?


- C


Re: Identifying Source of False Positives

2009-06-01 Thread John Hardin

On Mon, 1 Jun 2009, Rich Shepard wrote:


 Here are the headers:


From r...@salmo.appl-ecosys.com Mon Jun  1 11:25:44 2009

Return-Path: r...@salmo.appl-ecosys.com
X-Spam-Flag: YES
X-Spam-Checker-Version: SpamAssassin 3.2.5-ph20040310.0 (2008-06-10) on
salmo.appl-ecosys.com
X-Spam-Level: 
X-Spam-Status: Yes, score=4.9 required=4.0 tests=ALL_TRUSTED,AWL,BAYES_99,
 EMPTY_BODY,NORMAL_HTTP_TO_IP,NUMERIC_HTTP_ADDR,URI_HEX,URI_NOVOWEL
 autolearn=no version=3.2.5-ph20040310.0
X-Spam-Report:
 * -1.3 ALL_TRUSTED Passed through trusted hosts only via SMTP
 *  3.5 BAYES_99 BODY: Bayesian spam probability is 99 to 100%
 *  [score: 1.]



I also don't know where the 3.5 on the second test arises.



If these are system-generated messages, something is improperly training 
SA that they are spam. Do you use autolearn?



 Suggestions on how to proceed greatly appreciated.


Primarily I'd suggest you exclude locally-generated emails from SA 
completely. If you'd post the Received: headers from such a message and 
the procmail stanza where you pass messages to SA for scoring I could 
suggest something.


--
 John Hardin KA7OHZhttp://www.impsec.org/~jhardin/
 jhar...@impsec.orgFALaholic #11174 pgpk -a jhar...@impsec.org
 key: 0xB8732E79 -- 2D8C 34F4 6411 F507 136C  AF76 D822 E6E6 B873 2E79
---
  Of the twenty-two civilizations that have appeared in history,
  nineteen of them collapsed when they reached the moral state the
  United States is in now.  -- Arnold Toynbee
---
 5 days until the 65th anniversary of D-Day


Re: [sa] Re: Identifying Source of False Positives

2009-06-01 Thread Rich Shepard

On Mon, 1 Jun 2009, Charles Gregory wrote:


Just to be clear, are you looking at the body in the actual rejected
message,


Charles,

  Yes. The body consists of the mail log summary.


First guess, look at the procmail code that 'chooses' to run spamassassin.
Have you used an 'h' where you meant to use an 'H', thereby feeding *only*
the header to spamassassin?


## Call SpamAssassin
:0fw: spamassassin.lock
*  256000
| spamassassin

  This is how it's been for years.

Rich


Re: Identifying Source of False Positives

2009-06-01 Thread Rich Shepard

On Mon, 1 Jun 2009, John Hardin wrote:


If these are system-generated messages, something is improperly training
SA that they are spam. Do you use autolearn?


John,

  No. Once a week or so I run sa-learn specifying spam on the spam-uncaught
mbox file. Less frequently I run it on mail list files specifying them as
ham.


Primarily I'd suggest you exclude locally-generated emails from SA
completely. If you'd post the Received: headers from such a message and
the procmail stanza where you pass messages to SA for scoring I could
suggest something.


  Here are all headers from the mail log summary:


From r...@salmo.appl-ecosys.com Mon Jun  1 11:25:44 2009

Return-Path: r...@salmo.appl-ecosys.com
X-Spam-Flag: YES
X-Spam-Checker-Version: SpamAssassin 3.2.5-ph20040310.0 (2008-06-10) on
salmo.appl-ecosys.com
X-Spam-Level: 
X-Spam-Status: Yes, score=4.9 required=4.0 tests=ALL_TRUSTED,AWL,BAYES_99,
EMPTY_BODY,NORMAL_HTTP_TO_IP,NUMERIC_HTTP_ADDR,URI_HEX,URI_NOVOWEL
autolearn=no version=3.2.5-ph20040310.0
X-Spam-Report:
* -1.3 ALL_TRUSTED Passed through trusted hosts only via SMTP
*  3.5 BAYES_99 BODY: Bayesian spam probability is 99 to 100%
*  [score: 1.]
*  2.5 EMPTY_BODY BODY: Message has subject but no body
*  0.0 NORMAL_HTTP_TO_IP URI: Uses a dotted-decimal IP address in URL
*  0.4 URI_HEX URI: URI hostname has long hexadecimal sequence
*  0.0 NUMERIC_HTTP_ADDR URI: Uses a numeric IP address in URL
*  1.6 URI_NOVOWEL URI: URI hostname has long non-vowel sequence
* -1.8 AWL AWL: From: address is in the auto white-list
X-Original-To: rshep...@appl-ecosys.com
Delivered-To: rshep...@appl-ecosys.com
Received: from salmo.appl-ecosys.com (localhost.localdomain [127.0.0.1])
by salmo.appl-ecosys.com (Postfix) with ESMTP id 8DA0F1026
for rshep...@appl-ecosys.com; Mon,  1 Jun 2009 11:25:44 -0700 (PDT)
Received: (from r...@localhost)
by salmo.appl-ecosys.com (8.14.3/8.14.2/Submit) id n51IPibx030133;
Mon, 1 Jun 2009 11:25:44 -0700
Date: Mon, 1 Jun 2009 11:25:44 -0700
From: r...@salmo.appl-ecosys.com
Message-Id: 200906011825.n51ipibx030...@salmo.appl-ecosys.com
To: rshep...@appl-ecosys.com
Subject: *SPAM* salmo Daily Mail Report for Monday, 01 June 2009
X-Spam-Prev-Subject: salmo Daily Mail Report for Monday, 01 June 2009

Report based on information in /var/log/maillog

  And this is from ~/procmail/recipes.rc:

## Call SpamAssassin
:0fw: spamassassin.lock
*  256000
| spamassassin

Thanks,

Rich


Re: Identifying Source of False Positives

2009-06-01 Thread Bowie Bailey

Rich Shepard wrote:

  Here are all headers from the mail log summary:

From r...@salmo.appl-ecosys.com Mon Jun  1 11:25:44 2009
Return-Path: r...@salmo.appl-ecosys.com
X-Spam-Flag: YES
X-Spam-Checker-Version: SpamAssassin 3.2.5-ph20040310.0 (2008-06-10) on
salmo.appl-ecosys.com
X-Spam-Level: 
X-Spam-Status: Yes, score=4.9 required=4.0 
tests=ALL_TRUSTED,AWL,BAYES_99,

EMPTY_BODY,NORMAL_HTTP_TO_IP,NUMERIC_HTTP_ADDR,URI_HEX,URI_NOVOWEL
autolearn=no version=3.2.5-ph20040310.0
X-Spam-Report:
* -1.3 ALL_TRUSTED Passed through trusted hosts only via SMTP
*  3.5 BAYES_99 BODY: Bayesian spam probability is 99 to 100%
*  [score: 1.]
*  2.5 EMPTY_BODY BODY: Message has subject but no body
*  0.0 NORMAL_HTTP_TO_IP URI: Uses a dotted-decimal IP address in URL
*  0.4 URI_HEX URI: URI hostname has long hexadecimal sequence
*  0.0 NUMERIC_HTTP_ADDR URI: Uses a numeric IP address in URL
*  1.6 URI_NOVOWEL URI: URI hostname has long non-vowel sequence
* -1.8 AWL AWL: From: address is in the auto white-list
X-Original-To: rshep...@appl-ecosys.com
Delivered-To: rshep...@appl-ecosys.com
Received: from salmo.appl-ecosys.com (localhost.localdomain [127.0.0.1])
by salmo.appl-ecosys.com (Postfix) with ESMTP id 8DA0F1026
for rshep...@appl-ecosys.com; Mon,  1 Jun 2009 11:25:44 -0700 (PDT)
Received: (from r...@localhost)
by salmo.appl-ecosys.com (8.14.3/8.14.2/Submit) id n51IPibx030133;
Mon, 1 Jun 2009 11:25:44 -0700
Date: Mon, 1 Jun 2009 11:25:44 -0700
From: r...@salmo.appl-ecosys.com
Message-Id: 200906011825.n51ipibx030...@salmo.appl-ecosys.com
To: rshep...@appl-ecosys.com
Subject: *SPAM* salmo Daily Mail Report for Monday, 01 June 2009
X-Spam-Prev-Subject: salmo Daily Mail Report for Monday, 01 June 2009

Report based on information in /var/log/maillog


Your biggest problems here are BAYES_99 and EMPTY_BODY.  To fix the 
Bayes problem, sa-learn some of these messages as ham.  Make sure you 
are learning as the right user...


The empty body problem is a more difficult problem.  Have procmail save 
a copy of the raw message somewhere and take a look at it.  Make sure 
there is a blank line between the headers and the body.  Run 
'spamassassin -D' on this saved message and look for anything unusual in 
the debug output.


--
Bowie


Re: Identifying Source of False Positives

2009-06-01 Thread John Hardin

On Mon, 1 Jun 2009, Rich Shepard wrote:


On Mon, 1 Jun 2009, John Hardin wrote:


 If these are system-generated messages, something is improperly training
 SA that they are spam. Do you use autolearn?


John,

 No. Once a week or so I run sa-learn specifying spam on the spam-uncaught
mbox file. Less frequently I run it on mail list files specifying them as
ham.


And I assume you look at the sapm-uncaught file before learning it?

If some log files got in there and were learned, that could explain the 
deterioration.


Have you kept your spam and ham corpa? I would suggest wiping your Bayes 
database and retraining it, after reviewing the corpa.



 Primarily I'd suggest you exclude locally-generated emails from SA
 completely. If you'd post the Received: headers from such a message and
 the procmail stanza where you pass messages to SA for scoring I could
 suggest something.


 Here are all headers from the mail log summary:


From r...@salmo.appl-ecosys.com Mon Jun  1 11:25:44 2009

Return-Path: r...@salmo.appl-ecosys.com
X-Spam-Flag: YES
X-Spam-Checker-Version: SpamAssassin 3.2.5-ph20040310.0 (2008-06-10) on
salmo.appl-ecosys.com
X-Spam-Level: 
X-Spam-Status: Yes, score=4.9 required=4.0 tests=ALL_TRUSTED,AWL,BAYES_99,
 EMPTY_BODY,NORMAL_HTTP_TO_IP,NUMERIC_HTTP_ADDR,URI_HEX,URI_NOVOWEL
 autolearn=no version=3.2.5-ph20040310.0
X-Spam-Report:
 * -1.3 ALL_TRUSTED Passed through trusted hosts only via SMTP
 *  3.5 BAYES_99 BODY: Bayesian spam probability is 99 to 100%
 *  [score: 1.]
 *  2.5 EMPTY_BODY BODY: Message has subject but no body
 *  0.0 NORMAL_HTTP_TO_IP URI: Uses a dotted-decimal IP address in
 URL
 *  0.4 URI_HEX URI: URI hostname has long hexadecimal sequence
 *  0.0 NUMERIC_HTTP_ADDR URI: Uses a numeric IP address in URL
 *  1.6 URI_NOVOWEL URI: URI hostname has long non-vowel sequence
 * -1.8 AWL AWL: From: address is in the auto white-list
X-Original-To: rshep...@appl-ecosys.com
Delivered-To: rshep...@appl-ecosys.com
Received: from salmo.appl-ecosys.com (localhost.localdomain [127.0.0.1])
 by salmo.appl-ecosys.com (Postfix) with ESMTP id 8DA0F1026
 for rshep...@appl-ecosys.com; Mon,  1 Jun 2009 11:25:44 -0700
 (PDT)


Okay, let's key on that one.


## Call SpamAssassin
: 0fw: spamassassin.lock
*  256000
|  spamassassin


:0 fw: spamassassin.lock
*  256000
* ! ^TO_abuse@
* ! ^List-Id: .*?use...@.]spamassassin\.apache\.org?
* ! ^Received: from salmo\.appl-ecosys\.com \(localhost\.localdomain 
\[127\.0\.0\.1\]) by salmo\.appl-ecosys\.com
| /usr/bin/spamc

Using spamc creates less load than launching spamassassin from scratch for 
every email, but you do have to manage the daemon (i.e. restart it if the 
rules change).


Are your resources really so limited that you want to serialize all email 
delivery? As a middle ground you might consider per-user lockfiles 
instead, e.g.:


   :0 fw: $HOME/.spamassassin.lock

I'd also suggest upping the size limit a bit, but that's not a big issue.

There are more complex things you can do; you might want to take a look at 
http://www.impsec.org/~jhardin/antispam/spamassassin.procmail


--
 John Hardin KA7OHZhttp://www.impsec.org/~jhardin/
 jhar...@impsec.orgFALaholic #11174 pgpk -a jhar...@impsec.org
 key: 0xB8732E79 -- 2D8C 34F4 6411 F507 136C  AF76 D822 E6E6 B873 2E79
---
  We have to realize that people who run the government can and do
  change. Our society and laws must assume that bad people -
  criminals even - will run the government, at least part of the
  time.   -- John Gilmore
---
 5 days until the 65th anniversary of D-Day


Re: [sa] Re: Identifying Source of False Positives

2009-06-01 Thread Charles Gregory



 First guess, look at the procmail code that 'chooses' to run spamassassin.
 Have you used an 'h' where you meant to use an 'H', thereby feeding *only*
 the header to spamassassin?

## Call SpamAssassin
: 0fw: spamassassin.lock
*  256000
|  spamassassin


Is there anywhere in the procmail recipe *above* this one that some 
specila condition has been specified as:


   :0fwh

...which has the effect of 'filtering' the message down to just its
headers? It wouldn't necessarily have to be a recent change to your
procmailrc, it might just be a subtle change in the log mail that
'triggers' the rule when it didn't before.

Next guess: Has this log summary grown in size past some limit that would 
cause the whole body to be 'truncated'?


- Charles


Re: [sa] Re: Identifying Source of False Positives

2009-06-01 Thread Rich Shepard

On Mon, 1 Jun 2009, Charles Gregory wrote:



Is there anywhere in the procmail recipe *above* this one that some
specila condition has been specified as:

  :0fwh

...which has the effect of 'filtering' the message down to just its
headers? It wouldn't necessarily have to be a recent change to your
procmailrc, it might just be a subtle change in the log mail that
'triggers' the rule when it didn't before.


Charles,

# BEGIN RECIPES

# Nuke duplicate messages
#:0 Wh: msgid.lock
#| $FORMAIL -D 8192 msgid.cache

## Call SpamAssassin
:0fw: spamassassin.lock
*  256000
| spamassassin

  The first recipe has been commented out for a while now, so the call to SA
is at the top of the list.

Next guess: Has this log summary grown in size past some limit that would 
cause the whole body to be 'truncated'?


  No. The log summary report (with headers) is  26,000 bytes.

Rich


Re: Identifying Source of False Positives

2009-06-01 Thread Rich Shepard

On Mon, 1 Jun 2009, Bowie Bailey wrote:


Your biggest problems here are BAYES_99 and EMPTY_BODY.  To fix the Bayes
problem, sa-learn some of these messages as ham.  Make sure you are
learning as the right user...


Bowie,

  I just did this on a run from this morning. I'll do so again tomorrow
morning with both the mail log and log watch reports.


The empty body problem is a more difficult problem.  Have procmail save a
copy of the raw message somewhere and take a look at it.  Make sure there
is a blank line between the headers and the body.  Run 'spamassassin -D'
on this saved message and look for anything unusual in the debug output.


  There is always a blank line between headers and body. I tried running
'spamassassin -D' on the saved message and nothing happened. Should it take
more than a few seconds to complete and return a debug report?

Thanks,

Rich


Re: Identifying Source of False Positives

2009-06-01 Thread Theo Van Dinter
fwiw, even if there isn't a blank line, SA will figure it out (though
it'll trigger a MISSING_HB_SEP rule hit).

As for the debug output ... it depends, how did you run the command
(ie: what was the command you tried).  My guess is you did something
like spamassassin -D filename, where filename gets treated as the
argument to -D, so then it was waiting for input.  If this is the
case, try spamassassin -D  filename  /dev/null. :)

On Mon, Jun 1, 2009 at 6:09 PM, Rich Shepard rshep...@appl-ecosys.com wrote:
  There is always a blank line between headers and body. I tried running
 'spamassassin -D' on the saved message and nothing happened. Should it take
 more than a few seconds to complete and return a debug report?


Re: Identifying Source of False Positives

2009-06-01 Thread Rich Shepard

On Mon, 1 Jun 2009, John Hardin wrote:


And I assume you look at the sapm-uncaught file before learning it?


  Yes. The messages in there are those I deliberately move there after
they've ended up in my inbox because neither the postfix filters nor the
spamassassin rules caught them.

If some log files got in there and were learned, that could explain the 
deterioration.


  That seems very reasonable, but I would have had to move them there myself
and I cannot recall doing so. Also, before running sa-learn to classify them
as spam I look over the list. So, it's quite possible that they ended up
classified as spam unintentionally.


Have you kept your spam and ham corpa?


  I'm not sure. The spam comes from the spam-uncaught file which is cleared
each time it's run. The ham comes from various mail lists and they grow over
time.


Okay, let's key on that one.


## Call SpamAssassin
: 0fw: spamassassin.lock
*  256000
|  spamassassin


:0 fw: spamassassin.lock
*  256000
* ! ^TO_abuse@
* ! ^List-Id: .*?use...@.]spamassassin\.apache\.org?
* ! ^Received: from salmo\.appl-ecosys\.com \(localhost\.localdomain 
\[127\.0\.0\.1\]) by salmo\.appl-ecosys\.com

| /usr/bin/spamc

Using spamc creates less load than launching spamassassin from scratch for 
every email, but you do have to manage the daemon (i.e. restart it if the 
rules change).


  I run spamd:

 2978 ?Ss12:16 /usr/bin/spamd -d --pidfile=/var/run/spamd.pid
 3052 ?S  0:04 spamd child
 3054 ?S  0:05 spamd child

is this not adequate for a light load?

Are your resources really so limited that you want to serialize all email 
delivery? As a middle ground you might consider per-user lockfiles instead, 
e.g.:



  :0 fw: $HOME/.spamassassin.lock

I'd also suggest upping the size limit a bit, but that's not a big issue.

There are more complex things you can do; you might want to take a look at 
http://www.impsec.org/~jhardin/antispam/spamassassin.procmail


  There are only two users on this network and a low mail volume for each of
us.

  The size limit has been at that value for years without a problem. I'll
keep teaching SA that the log reports are ham and see if that makes a
difference. As I wrote earlier, this is all within the past quarter year,
and it's been a PITA since it's taken time and attention away from my
business.

Thanks,

Rich


Re: Identifying Source of False Positives

2009-06-01 Thread Rich Shepard

On Mon, 1 Jun 2009, Theo Van Dinter wrote:


My guess is you did something like spamassassin -D filename, where
filename gets treated as the argument to -D, so then it was waiting for input.


Theo,

  Yes, this is what I did.


If this is the case, try spamassassin -D  filename  /dev/null. :)


  Interesting:

[785] dbg: rules: running uri tests; score so far=1.2
[785] dbg: rules: compiled uri tests
[785] dbg: rules: ran uri rule NORMAL_HTTP_TO_IP == got hit:
http://211.129.107.12;
[785] dbg: rules: ran uri rule URI_HEX == got hit:
http://kemp-5d866973;
[785] dbg: rules: ran uri rule NUMERIC_HTTP_ADDR == got hit:
http://1898218;
[785] dbg: rules: ran uri rule URI_NOVOWEL == got hit: http://jcwpjkp;
[785] dbg: rules: ran uri rule __DOS_HAS_ANY_URI == got hit: h
[785] dbg: eval: stock info total: 0
[785] warn: rules: failed to run CG_FUJI_JPG test, skipping:
[785] warn:  (Can't locate object method image_name_regex via package
Mail::SpamAssassin::PerMsgStatus at (eval 719) line 1315.
[785] warn: )
[785] warn: rules: failed to run CG_DOUBLEDOT_GIF test, skipping:
[785] warn:  (Can't locate object method image_name_regex via package
Mail::SpamAssassin::PerMsgStatus at (eval 719) line 1580.
[785] warn: )
[785] warn: rules: failed to run CG_SONY_JPG test, skipping:
[785] warn:  (Can't locate object method image_name_regex via package
Mail::SpamAssassin::PerMsgStatus at (eval 719) line 2601.
[785] warn: )
[785] dbg: rules: ran eval rule BAYES_50 == got hit (1)
[785] warn: rules: failed to run CG_CANON_JPG test, skipping:
[785] warn:  (Can't locate object method image_name_regex via package
Mail::SpamAssassin::PerMsgStatus at (eval 719) line 4000.
[785] warn: )
[785] dbg: rules: running rawbody tests; score so far=3.191
[785] dbg: rules: compiled rawbody tests
[785] dbg: rules: running full tests; score so far=3.191
[785] dbg: rules: compiled full tests
[785] dbg: util: current PATH is:
/root/bin:/usr/local/bin:/usr/bin:/usr/local/sbin:/usr/sbin:/sbin:/usr/local/bin:/usr/bin:/bin:/usr/lib/java/bin:/usr/lib/java/jre/bin:/usr/lib/java/bin:/usr/lib/java/jre/bin:/usr/lib/qt/bin:/usr/share/texmf/bin
[785] dbg: pyzor: pyzor is not available: no pyzor executable found
[785] dbg: pyzor: no pyzor found, disabling Pyzor
[785] dbg: rules: running meta tests; score so far=3.191
[785] dbg: rules: compiled meta tests
[785] dbg: check: running tests for priority: 500
[785] dbg: dns: harvest_dnsbl_queries
[785] dbg: async: select found 4 responses ready (t.o.=0.0)
[785] dbg: async: completed in 0.149 s: URI-DNSBL,
DNSBL:sbl.spamhaus.org.:10.96.127.75
[785] dbg: async: completed in 0.156 s: URI-DNSBL,
DNSBL:sbl.spamhaus.org.:10.178.19.65
[785] dbg: async: completed in 0.155 s: URI-DNSBL,
DNSBL:sbl.spamhaus.org.:11.25.147.192
[785] dbg: async: completed in 0.155 s: URI-DNSBL,
DNSBL:sbl.spamhaus.org.:110.0.55.209
[785] dbg: async: queries completed: 4, started: 0
[785] dbg: async: queries active: URI-DNSBL=62 URI-NS=10 at Mon Jun 1
15:53:13 2009
[785] dbg: dns: harvest_dnsbl_queries - check_tick
[785] dbg: async: select found 1 responses ready (t.o.=1.0)
[785] dbg: async: completed in 0.158 s: URI-DNSBL,
DNSBL:sbl.spamhaus.org.:39.0.58.80
[785] dbg: async: queries completed: 1, started: 0
[785] dbg: async: queries active: URI-DNSBL=61 URI-NS=10 at Mon Jun 1
15:53:13 2009
[785] dbg: dns: harvest_dnsbl_queries - check_tick
  ...
[785] dbg: check: is spam? score=3.191 required=4
[785] dbg: check:
tests=ALL_TRUSTED,BAYES_50,EMPTY_BODY,NORMAL_HTTP_TO_IP,NUMERIC_HTTP_ADDR,URI_HEX,URI_NOVOWEL
[785] dbg: check:
subtests=__DATE_700,__DOS_BODY_MON,__DOS_HAS_ANY_URI,__DOS_RCVD_MON,__DOS_REF_TODAY,__ENV_AND_HDR_FROM_MATCH,__FB_NUM_PERCNT,__HAS_ANY_EMAIL,__HAS_ANY_URI,__HAS_MSGID,__HAS_RCVD,__HAS_SUBJECT,__KAM_MED2,__KAM_NUMBER2,__KAM_TIME4,__MISSING_REF,__MSGID_OK_DIGITS,__MSGID_OK_HOST,__MSOE_MID_WRONG_CASE,__NAKED_TO,__NONEMPTY_BODY,__SANE_MSGID,__TOCC_EXISTS,__hk_obfdomreq2

  It suddenly jumps from 1.2 to 3.91 after looking for images. I don't know
where to fix that. I think that I need to update SPF, too, because that's
compiled against an earlier perl version.

Rich


Re: Identifying Source of False Positives

2009-06-01 Thread John Hardin

On Mon, 1 Jun 2009, Rich Shepard wrote:


On Mon, 1 Jun 2009, John Hardin wrote:


 Have you kept your spam and ham corpa?


 I'm not sure. The spam comes from the spam-uncaught file which is 
cleared each time it's run.


Pity. If you're manually training it's a very good idea to retain your 
corpa so you can review training and retrain from scratch if needed.



 Okay, let's key on that one.

  ## Call SpamAssassin
 :  0fw: spamassassin.lock
  *  256000
 |   spamassassin

: 0 fw: spamassassin.lock
 *  256000
 * ! ^TO_abuse@
 * ! ^List-Id: .*?use...@.]spamassassin\.apache\.org?
 * ! ^Received: from salmo\.appl-ecosys\.com \(localhost\.localdomain
 \[127\.0\.0\.1\]) by salmo\.appl-ecosys\.com
|  /usr/bin/spamc

 Using spamc creates less load than launching spamassassin from scratch
 for every email, but you do have to manage the daemon (i.e. restart it
 if the rules change).


 I run spamd:

 2978 ?Ss12:16 /usr/bin/spamd -d --pidfile=/var/run/spamd.pid
 3052 ?S  0:04 spamd child
 3054 ?S  0:05 spamd child

is this not adequate for a light load?


That's fine. If you're currently running spamd, then having procmail call 
spamassassin is wasteful. That recompiles all of the rules from scratch 
for every message you receive, where using spamc/spamd compiles the rules 
once when you restart the daemon.



 Are your resources really so limited that you want to serialize all
 email delivery? As a middle ground you might consider per-user
 lockfiles instead, e.g.:



: 0 fw: $HOME/.spamassassin.lock

 I'd also suggest upping the size limit a bit, but that's not a big issue.

 There are more complex things you can do; you might want to take a
 look at http://www.impsec.org/~jhardin/antispam/spamassassin.procmail


 There are only two users on this network and a low mail volume for each 
of us.


Ok, then your locking should work okay.

I'll keep teaching SA that the log reports are ham and see if that makes 
a difference.


It will help, though it may take a while to override their current 
learning as spam.



As I wrote earlier, this is all within the past quarter year,
and it's been a PITA since it's taken time and attention away from my
business.


--
 John Hardin KA7OHZhttp://www.impsec.org/~jhardin/
 jhar...@impsec.orgFALaholic #11174 pgpk -a jhar...@impsec.org
 key: 0xB8732E79 -- 2D8C 34F4 6411 F507 136C  AF76 D822 E6E6 B873 2E79
---
  ...to announce there must be no criticism of the President or to
  stand by the President right or wrong is not only unpatriotic and
  servile, but is morally treasonous to the American public.
  -- Theodore Roosevelt, 1918
---
 5 days until the 65th anniversary of D-Day