Justin Mason wrote:
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1
"Ken Anderson (Pacific Internet)" writes:
This is a dev question.
Since SA scores for identical messages are identical if the message is simply passed to SA via perl (like from MailScanner), wouldn't it increase SA's performance if it cached scores for a short time based on message checksum?
it would, but ... the issue is that spammers are trying very hard to *avoid* making hashable messages, since multiple messages with the same hash means "bulk email", and they do not want their messages to be identified as that. Hence the low accuracy rates of DCC, Pyzor and Razor (low relative to what they *could* be that is).
Making hashing schemes that are resistant to spammer evasion, without FPs, is quite hard.
How bad is DCC, Pyzor? We seem to see a high correlation between messages that hit DCC and PYZOR detection in SA and also score quite high on other spam tests. Seems like they aren't too bad - at least with the messages from lazy spammers. :-)
This would be beneficial in typical dictionary attacks when messages are not unique in some way, or when sendmail is splitting recipients using queue groups so that a message with 10 recipients is actually passed to SA 10 times. That might sound odd, but using MailScanner with SA and sendmail on a mail gateway/relay, this is commonly done to permit per user rules. The time SA sometimes spends scanning identical messages is a waste of cpu.
No, makes perfect sense -- that's the thing that's initially counterintuitive until you consider what per-user customisation means ;).
But that then points out the other problem with the idea. What if user A has a score for MIME_HTML_ONLY of 0.1, but user B has a score of 5.0? We can't simply cache scores, we should cache the rules hit. But then what if user A has a bayes DB that says that the "Daily Blah Newsletter" is ham, but user B has trained that as spam? We'd have to cache all hits *except* for bayes, and run that separately.
It gets messy very quickly. As far as I can see -- with per-user customisation in the mix, this is not necessarily a good idea at all.
But if you are not doing per user custom SA scoring at the SA rule level, but only whitelists, blacklists, and spam handling actions based on the SA score returned to MailScanner? It seems like it might be a good fit for this configuration, and speed up SA quite a bit if you are splitting recipients in sendmail as well.
Thanks, Ken A Pacific.Net
- --j. -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.2.4 (GNU/Linux) Comment: Exmh CVS
iD8DBQFATgRMQTcbUG5Y7woRAjsyAKCQQLs+yQxfY9W3LZw6YogTjjQ9fQCgyhTo w8rXoAwz/C9/JyYRLU5SHms= =dGmS -----END PGP SIGNATURE-----
