Justin Mason said:
In the technology, when a mail comes in it is first cleared of the HTML
tags so words like v<aa>i<aa>a<aa>g<aa>r<aa>a is brought to its original
clear text form. Then on this cleared message the entropy type
compression that you have suggested is carried out and the ratio of
similarity is matched.
Actually, judging by the description here:
http://www.brightmail.com/xsp-as-features.html , it's really a
content-hashing technique, so probably more similar to Razor, DCC
or Pyzor than anything else.
So if the method is not patented, SA could use it?Yes definitely we can write a plugin to incorporate this technique in SpamAssassin. This will make our favourite filter more better. But as Justin had mentioned earlier. This is more similar to Razor, DCC or Pyzor then we already have the plugins in place. No need to do any more additional works. But i am not sure whether Razor, DCC or Pyzor cleans the HTML messages of the tags and then looks for the checksum of the clear text message. If we can have that then probably we can ward off many other techniques that spammers are using these days to cheat the filters. One interesting technique that I have very recently observed is as follows.
A quite interesting HTML code to cheat the filters
<a href=3d"http://www=2espyware-killer-software=2ecom/cgi-bin/rd=2ecgi= ?IvC7R3lvJb">http://www=2espyware-killer-software=2ecom/cgi-bin/rd=2ecgi?=IvC7R3lvJb</a>
space substituted by "3d" dot (.) substituted by "=2e" ? substituted by "=?"
and there were many other stuff similar to this. Interestingly the mail displayed perfectly right in the Mail. But when i tried used the same code in a simple HTML file and tried to view it in the browser it was a complete mess up as expected. Unfortunately this mail passed through the RBLchecks+Spamassassin+DCC+SpamCopURI+razor filter. The score available was just
HTML_MESSAGE 0.10, Baye_90 2.10
I think will have to train my Bayes one more time to take care of this message. What we need to develop is techniques to trap these and other kind of mails.
Rakesh