Re: caching scores - dev idea

Ken Anderson (Pacific Internet) 9 Mar 2004 18:21:36 -0000

Justin Mason wrote:

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1
"Ken Anderson (Pacific Internet)" writes:
This is a dev question.
Since SA scores for identical messages are identical if the message is simply passed to SA via perl (like from MailScanner), wouldn't it increase SA's performance if it cached scores for a short time based on message checksum?
it would, but ...  the issue is that spammers are trying very hard to
*avoid* making hashable messages, since multiple messages with the same
hash means "bulk email", and they do not want their messages to be
identified as that.  Hence the low accuracy rates of DCC, Pyzor and Razor
(low relative to what they *could* be that is).
Making hashing schemes that are resistant to spammer evasion, without
FPs, is quite hard.

How bad is DCC, Pyzor? We seem to see a high correlation between messages that hit DCC and PYZOR detection in SA and also score quite high on other spam tests. Seems like they aren't too bad - at least with the messages from lazy spammers. :-)

This would be beneficial in typical dictionary attacks when messages are not unique in some way, or when sendmail is splitting recipients using queue groups so that a message with 10 recipients is actually passed to SA 10 times. That might sound odd, but using MailScanner with SA and sendmail on a mail gateway/relay, this is commonly done to permit per user rules. The time SA sometimes spends scanning identical messages is a waste of cpu.
No, makes perfect sense -- that's the thing that's initially
counterintuitive until you consider what per-user customisation means ;).
But that then points out the other problem with the idea.  What if
user A has a score for MIME_HTML_ONLY of 0.1, but user B has a score
of 5.0?   We can't simply cache scores, we should cache the rules hit.
But then what if user A has a bayes DB that says that the "Daily Blah
Newsletter" is ham, but user B has trained that as spam?  We'd have
to cache all hits *except* for bayes, and run that separately.
It gets messy very quickly.  As far as I can see -- with per-user
customisation in the mix, this is not necessarily a good idea at all.

But if you are not doing per user custom SA scoring at the SA rule level, but only whitelists, blacklists, and spam handling actions based on the SA score returned to MailScanner? It seems like it might be a good fit for this configuration, and speed up SA quite a bit if you are splitting recipients in sendmail as well.

Thanks,
Ken A
Pacific.Net

- --j.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.4 (GNU/Linux)
Comment: Exmh CVS

iD8DBQFATgRMQTcbUG5Y7woRAjsyAKCQQLs+yQxfY9W3LZw6YogTjjQ9fQCgyhTo
w8rXoAwz/C9/JyYRLU5SHms=
=dGmS
-----END PGP SIGNATURE-----

Re: caching scores - dev idea

Reply via email to