http://bugzilla.spamassassin.org/show_bug.cgi?id=3013





------- Additional Comments From [EMAIL PROTECTED]  2004-02-06 11:39 -------
Removing the . and _ sounds like a decent solution to me (though my experience
with spamassassin is limited to the past week).  With web addresses so common, .
has become a common delimiter to seperate words.  For instance, Java uses the
.tld.domain.project.subproject naming scheme for classes.   It's no mistake that
many of the X-Mailer: headers use internet domains as their identifiers.

I think you're right that there's no simple way of distiguishing
'ckGmqXGFWNfaNAxRse' from ClassifiedVentures using regular expressions.  
Assuming what you're really looking for is either randomly generated X-Mailer
strings (or some ratware guy just hitting keys on his keyboard), you might just
look at the "information content" of the string.  'ckGmqXGFWNfaNAxRse' is a
random string of upper/lowercase  text.  Where 'ClassifiedVentures' is not
random at all.  The random string contains more "information", where the
non-random one contains less.  A simple test might be trying to compress the
string.  If it's very compressible it has low information content, and wasn't
generated randomly.  If it's not very compressible it has high information
content, and is probbably randomly generated.  

Slightly off topic, but could this kind of test could be applied to other parts
of a message too?  I've noticed a lot of spam having random strings inserted in
them in an attempt to get past filters.  If you could identify these strings as
random, you could add to a mails spam rating.



------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

Reply via email to