Re: Scanning large-body spam

2010-03-31 Thread Adam Katz
Alex wrote:
 What settings do people typically have these days for the maximum
 scanned message size? Surprisingly, at least to me, I'm seeing spam in
 the 650k and 700k range, at least a few per hour, and are not scanned.
 
 Does anyone have any suggestions for optimizing the process for spam
 containing just a large image that would therefore bypass the typical
 scanning? Should I be scanning messages that large, then?

Depends on your available CPU resources.  If you always have a low
load average, you can scan larger messages.  My production deployment
is such a workhorse that I've got it set to 1.1MB.

My general advice is that since many spammers will check against a
default SA scan before blasting out their messages, you want something
slightly larger than whatever the default is (actually, in the event
that it has changed between versions, something slightly larger than
the largest default SA has ever shipped with).

Maybe somebody who knows the innards better can comment on how quickly
and efficiently SA can ignore non-text attachments (for those of use
who don't try to decode word documents and PDFs or use OCR on images).

Wasn't some earlier version of SA capable of scanning just the /first/
[size] of an email?  Probably harder to implement within MIME, but
some control to internally truncate remaining pieces (for scanning
only, like the pseudo-headers) would allow scanning beyond the size limit.


Re: Scanning large-body spam

2010-03-31 Thread Henrik K
On Wed, Mar 31, 2010 at 11:05:57AM -0400, Adam Katz wrote:
 
 Wasn't some earlier version of SA capable of scanning just the /first/
 [size] of an email?  Probably harder to implement within MIME, but
 some control to internally truncate remaining pieces (for scanning
 only, like the pseudo-headers) would allow scanning beyond the size limit.

SA 3.3 has special handling for truncated messages and amavisd-new (if it's
your choice of glue) has already done it since 2.6.3. Never encountered a
problem with it. Here are release notes for the record:


- large messages beyond $sa_mail_body_size_limit are now partially passed
  to SpamAssassin and other spam scanners for checking: a copy passed to
  a spam scanner is truncated near or slightly past the indicated limit.
  Large messages are no longer given an almost free passage through spam
  checks.

  Note that message truncation can invalidate a DKIM or DK signature.
  If using (non-default) SpamAssassin rules to assign score points to mail
  with no valid signatures from authors which are expected to always provide
  a valid signature, the message truncation can cause false positives on
  these rules. As a workaround, to a truncated message passed to spam
  scanners, amavisd inserts a header field:
X-Amavis-MessageSize: m, TRUNCATED to n
  which can be captured by SpamAssassin rules, e.g.:
header __TRUNCATED X-Amavis-MessageSize =~ m{\A[^\n]*TRUNCATED}m
  and used in rules like NOTVALID_EBAY to prevent them from triggering.

  Starting with version 3.3.0 of SpamAssassin, its DKIM plugin understands
  the issue and receives undamaged DKIM signature objects directly from
  amavisd, so the above workaround is not needed. Also, a hit on a __TRUNCATED
  rule is automatically generated (explicit header rule is not necessary),
  just in case it might be useful for some purpose.


For other glue, I recommend taking it up with the author to support
truncating properly. (Hmm, I don't think spamc has been enhanced yet..)

Of course we hope that someday SA will have true support for ignoring
useless attachment data.



Re: Scanning large-body spam

2010-03-31 Thread Charles Gregory

On Wed, 31 Mar 2010, Henrik K wrote:

SA 3.3 has special handling for truncated messages


Excuse me for not *thinking* earlier, but it occurs to me that there is a 
very big drawback to *truncating* a message before passing it to SA, as 
opposed to my original request/suggestion to *flag* (or set a config 
param?) to tell SA to *ignore* parts of a message past a certain size.


I believe it is fairly common practice for MTA's to expect SA to return 
the *entire* message, complete with X-Spam header 'markup', from SA's 
standard output stream. This is particularly important where mail 
classified as *slightly* spammy is delivered to a special spam folder 
based upon the headers added by SA. Or on a system where all mail tagged 
as spam is quarantined. Having SA's markup/explanations is critical to 
analysing false positives/negatives.


So SA needs to read and write the *entire* message, but then be given a 
parameter to keep it from thrashing over the really large ones.


- Charles


Re: Scanning large-body spam

2010-03-31 Thread Alex
Hi,

 Does anyone have any suggestions for optimizing the process for spam
 containing just a large image that would therefore bypass the typical
 scanning? Should I be scanning messages that large, then?

 Depends on your available CPU resources.  If you always have a low
 load average, you can scan larger messages.  My production deployment
 is such a workhorse that I've got it set to 1.1MB.

Will messages this large have the benefit of bayes? What would be the
impact on the corresponding sa-learn of a message of that size?
Perhaps only learn the header and body components that aren't an
attachment somehow?

Thanks,
Alex


Re: Scanning large-body spam

2010-03-31 Thread Mark Martinec
On Wednesday March 31 2010 18:05:52 Charles Gregory wrote:
 Excuse me for not *thinking* earlier, but it occurs to me that there is a
 very big drawback to *truncating* a message before passing it to SA, as
 opposed to my original request/suggestion to *flag* (or set a config
 param?) to tell SA to *ignore* parts of a message past a certain size.

 I believe it is fairly common practice for MTA's to expect SA to return
 the *entire* message, complete with X-Spam header 'markup', from SA's
 standard output stream. This is particularly important where mail
 classified as *slightly* spammy is delivered to a special spam folder
 based upon the headers added by SA. Or on a system where all mail tagged
 as spam is quarantined. Having SA's markup/explanations is critical to
 analysing false positives/negatives.
 
 So SA needs to read and write the *entire* message, but then be given a
 parameter to keep it from thrashing over the really large ones.

There are some drawbacks in depriving SpamAssassin of the full message
and letting it work on a truncated message, appropriately marked as one.
But even the message header alone often carries half the value of score
quality. Adding to that the first 400 kB of a body already covers plenty
of information about a message. It would be better of course to let SA
have access to a full or summarized info about the rest of the message
(like its attachments) too, but doing without is not too bad. Comparing
the quality of a score on a partial message, to not having any score
at all (and passing any big message as clean) makes a decision trivial
(it just needs to be done).

 I believe it is fairly common practice for MTA's to expect SA to return
 the *entire* message, complete with X-Spam header 'markup', from SA's
 standard output stream.

Sure, but this is an implementation detail. There is no underlying reason
that spamc could not keep the original message and only feed part of it
to spamd, then merge the results back and do the final message editing
(like inserting/editing header fields) by itself. Or to modify spamd and
let it handle arbitrary size messages by avoiding its current paradigm
of keeping the entire message in memory.

Anyway, the amavisd glue to SpamAssassin does just that: let SpamAssassin
see only the first 400 kB (configurable) of a large message, then edit
the original message based on results obtained from SpamAssassin. This
offers best of both worlds: handles arbitrary size messages, and avoids
SpamAssassin slurping it all in memory. The tricky details are in editing
the message, and ensuring that DKIM and DK signatures survive (which is
done by using an out-of-band channel between a caller and SA with its
plugins, as provided by SA 3.3).

  Mark


Re: Scanning large-body spam

2010-03-31 Thread Charles Gregory

On Wed, 31 Mar 2010, Mark Martinec wrote:
 and let it handle arbitrary size messages by avoiding its current 
paradigm of keeping the entire message in memory.


Is there really a problem with the in-memory size? I would have thought 
the major concern was the processing time for evaluating 'full' (and 
rawbody?) rules on a large message



Anyway, the amavisd glue to SpamAssassin does just that: let SpamAssassin
see only the first 400 kB (configurable) of a large message, then edit
the original message based on results obtained from SpamAssassin.


Good for amavis-d, but not for those of us relying on SA to do the whole 
job, and not have our MTA's perform any further message modification


I would be interested in having some of the developers offer an opinion on 
this. Where is the real 'cost' in running SA against a large message? Is 
it just the memory used? Or is it, as I suspect, the use of 'full' rules?


- Charles


Re: Scanning large-body spam

2010-03-31 Thread Mark Martinec
On Wednesday March 31 2010 23:43:25 Charles Gregory wrote:
 Is there really a problem with the in-memory size? I would have thought
 the major concern was the processing time for evaluating 'full' (and
 rawbody?) rules on a large message

Yes, sure, the main issue is with evaluating regexp rules over
a large message. Nevertheless, even now keeping 50 copies of
100 MB memory-footprint child processes is not to be underestimated.
Add to that several copies (raw, decoded, array of lines, ...)
of a large message in perl's data structures can be a big deal.
And bear in mind that once a process running perl extends its
virtual memory, it cannot shrink back, so it stays huge forever
after processing one large message.


  Mark


Scanning large-body spam

2010-03-30 Thread Alex
Hi,

What settings do people typically have these days for the maximum
scanned message size? Surprisingly, at least to me, I'm seeing spam in
the 650k and 700k range, at least a few per hour, and are not scanned.

Does anyone have any suggestions for optimizing the process for spam
containing just a large image that would therefore bypass the typical
scanning? Should I be scanning messages that large, then?

Thanks,
Alex


Re: Scanning large-body spam

2010-03-30 Thread Lee Dilkie
Alex wrote:
 Hi,

 What settings do people typically have these days for the maximum
 scanned message size? Surprisingly, at least to me, I'm seeing spam in
 the 650k and 700k range, at least a few per hour, and are not scanned.

 Does anyone have any suggestions for optimizing the process for spam
 containing just a large image that would therefore bypass the typical
 scanning? Should I be scanning messages that large, then?

 Thanks,
 Alex
   
I just bumped mine up from 150K to 1M to cover these new ones that
contain a jpeg or png and are in the 500K range in size. I'm not sure if
it'll matter too much to scan the odd email that's large I'll have
to monitor my stats.

-lee