Re: Scanning large-body spam
Alex wrote: What settings do people typically have these days for the maximum scanned message size? Surprisingly, at least to me, I'm seeing spam in the 650k and 700k range, at least a few per hour, and are not scanned. Does anyone have any suggestions for optimizing the process for spam containing just a large image that would therefore bypass the typical scanning? Should I be scanning messages that large, then? Depends on your available CPU resources. If you always have a low load average, you can scan larger messages. My production deployment is such a workhorse that I've got it set to 1.1MB. My general advice is that since many spammers will check against a default SA scan before blasting out their messages, you want something slightly larger than whatever the default is (actually, in the event that it has changed between versions, something slightly larger than the largest default SA has ever shipped with). Maybe somebody who knows the innards better can comment on how quickly and efficiently SA can ignore non-text attachments (for those of use who don't try to decode word documents and PDFs or use OCR on images). Wasn't some earlier version of SA capable of scanning just the /first/ [size] of an email? Probably harder to implement within MIME, but some control to internally truncate remaining pieces (for scanning only, like the pseudo-headers) would allow scanning beyond the size limit.
Re: Scanning large-body spam
On Wed, Mar 31, 2010 at 11:05:57AM -0400, Adam Katz wrote: Wasn't some earlier version of SA capable of scanning just the /first/ [size] of an email? Probably harder to implement within MIME, but some control to internally truncate remaining pieces (for scanning only, like the pseudo-headers) would allow scanning beyond the size limit. SA 3.3 has special handling for truncated messages and amavisd-new (if it's your choice of glue) has already done it since 2.6.3. Never encountered a problem with it. Here are release notes for the record: - large messages beyond $sa_mail_body_size_limit are now partially passed to SpamAssassin and other spam scanners for checking: a copy passed to a spam scanner is truncated near or slightly past the indicated limit. Large messages are no longer given an almost free passage through spam checks. Note that message truncation can invalidate a DKIM or DK signature. If using (non-default) SpamAssassin rules to assign score points to mail with no valid signatures from authors which are expected to always provide a valid signature, the message truncation can cause false positives on these rules. As a workaround, to a truncated message passed to spam scanners, amavisd inserts a header field: X-Amavis-MessageSize: m, TRUNCATED to n which can be captured by SpamAssassin rules, e.g.: header __TRUNCATED X-Amavis-MessageSize =~ m{\A[^\n]*TRUNCATED}m and used in rules like NOTVALID_EBAY to prevent them from triggering. Starting with version 3.3.0 of SpamAssassin, its DKIM plugin understands the issue and receives undamaged DKIM signature objects directly from amavisd, so the above workaround is not needed. Also, a hit on a __TRUNCATED rule is automatically generated (explicit header rule is not necessary), just in case it might be useful for some purpose. For other glue, I recommend taking it up with the author to support truncating properly. (Hmm, I don't think spamc has been enhanced yet..) Of course we hope that someday SA will have true support for ignoring useless attachment data.
Re: Scanning large-body spam
On Wed, 31 Mar 2010, Henrik K wrote: SA 3.3 has special handling for truncated messages Excuse me for not *thinking* earlier, but it occurs to me that there is a very big drawback to *truncating* a message before passing it to SA, as opposed to my original request/suggestion to *flag* (or set a config param?) to tell SA to *ignore* parts of a message past a certain size. I believe it is fairly common practice for MTA's to expect SA to return the *entire* message, complete with X-Spam header 'markup', from SA's standard output stream. This is particularly important where mail classified as *slightly* spammy is delivered to a special spam folder based upon the headers added by SA. Or on a system where all mail tagged as spam is quarantined. Having SA's markup/explanations is critical to analysing false positives/negatives. So SA needs to read and write the *entire* message, but then be given a parameter to keep it from thrashing over the really large ones. - Charles
Re: Scanning large-body spam
Hi, Does anyone have any suggestions for optimizing the process for spam containing just a large image that would therefore bypass the typical scanning? Should I be scanning messages that large, then? Depends on your available CPU resources. If you always have a low load average, you can scan larger messages. My production deployment is such a workhorse that I've got it set to 1.1MB. Will messages this large have the benefit of bayes? What would be the impact on the corresponding sa-learn of a message of that size? Perhaps only learn the header and body components that aren't an attachment somehow? Thanks, Alex
Re: Scanning large-body spam
On Wednesday March 31 2010 18:05:52 Charles Gregory wrote: Excuse me for not *thinking* earlier, but it occurs to me that there is a very big drawback to *truncating* a message before passing it to SA, as opposed to my original request/suggestion to *flag* (or set a config param?) to tell SA to *ignore* parts of a message past a certain size. I believe it is fairly common practice for MTA's to expect SA to return the *entire* message, complete with X-Spam header 'markup', from SA's standard output stream. This is particularly important where mail classified as *slightly* spammy is delivered to a special spam folder based upon the headers added by SA. Or on a system where all mail tagged as spam is quarantined. Having SA's markup/explanations is critical to analysing false positives/negatives. So SA needs to read and write the *entire* message, but then be given a parameter to keep it from thrashing over the really large ones. There are some drawbacks in depriving SpamAssassin of the full message and letting it work on a truncated message, appropriately marked as one. But even the message header alone often carries half the value of score quality. Adding to that the first 400 kB of a body already covers plenty of information about a message. It would be better of course to let SA have access to a full or summarized info about the rest of the message (like its attachments) too, but doing without is not too bad. Comparing the quality of a score on a partial message, to not having any score at all (and passing any big message as clean) makes a decision trivial (it just needs to be done). I believe it is fairly common practice for MTA's to expect SA to return the *entire* message, complete with X-Spam header 'markup', from SA's standard output stream. Sure, but this is an implementation detail. There is no underlying reason that spamc could not keep the original message and only feed part of it to spamd, then merge the results back and do the final message editing (like inserting/editing header fields) by itself. Or to modify spamd and let it handle arbitrary size messages by avoiding its current paradigm of keeping the entire message in memory. Anyway, the amavisd glue to SpamAssassin does just that: let SpamAssassin see only the first 400 kB (configurable) of a large message, then edit the original message based on results obtained from SpamAssassin. This offers best of both worlds: handles arbitrary size messages, and avoids SpamAssassin slurping it all in memory. The tricky details are in editing the message, and ensuring that DKIM and DK signatures survive (which is done by using an out-of-band channel between a caller and SA with its plugins, as provided by SA 3.3). Mark
Re: Scanning large-body spam
On Wed, 31 Mar 2010, Mark Martinec wrote: and let it handle arbitrary size messages by avoiding its current paradigm of keeping the entire message in memory. Is there really a problem with the in-memory size? I would have thought the major concern was the processing time for evaluating 'full' (and rawbody?) rules on a large message Anyway, the amavisd glue to SpamAssassin does just that: let SpamAssassin see only the first 400 kB (configurable) of a large message, then edit the original message based on results obtained from SpamAssassin. Good for amavis-d, but not for those of us relying on SA to do the whole job, and not have our MTA's perform any further message modification I would be interested in having some of the developers offer an opinion on this. Where is the real 'cost' in running SA against a large message? Is it just the memory used? Or is it, as I suspect, the use of 'full' rules? - Charles
Re: Scanning large-body spam
On Wednesday March 31 2010 23:43:25 Charles Gregory wrote: Is there really a problem with the in-memory size? I would have thought the major concern was the processing time for evaluating 'full' (and rawbody?) rules on a large message Yes, sure, the main issue is with evaluating regexp rules over a large message. Nevertheless, even now keeping 50 copies of 100 MB memory-footprint child processes is not to be underestimated. Add to that several copies (raw, decoded, array of lines, ...) of a large message in perl's data structures can be a big deal. And bear in mind that once a process running perl extends its virtual memory, it cannot shrink back, so it stays huge forever after processing one large message. Mark
Scanning large-body spam
Hi, What settings do people typically have these days for the maximum scanned message size? Surprisingly, at least to me, I'm seeing spam in the 650k and 700k range, at least a few per hour, and are not scanned. Does anyone have any suggestions for optimizing the process for spam containing just a large image that would therefore bypass the typical scanning? Should I be scanning messages that large, then? Thanks, Alex
Re: Scanning large-body spam
Alex wrote: Hi, What settings do people typically have these days for the maximum scanned message size? Surprisingly, at least to me, I'm seeing spam in the 650k and 700k range, at least a few per hour, and are not scanned. Does anyone have any suggestions for optimizing the process for spam containing just a large image that would therefore bypass the typical scanning? Should I be scanning messages that large, then? Thanks, Alex I just bumped mine up from 150K to 1M to cover these new ones that contain a jpeg or png and are in the 500K range in size. I'm not sure if it'll matter too much to scan the odd email that's large I'll have to monitor my stats. -lee