[Bug 2878] Identify when plain text and HTML are different in multipart/alternative

bugzilla-daemon 4 Jan 2004 21:40:28 -0000

http://bugzilla.spamassassin.org/show_bug.cgi?id=2878






------- Additional Comments From [EMAIL PROTECTED]  2004-01-04 13:40 -------
Thanks for the interesting follow-up! 

I suppose HTML::Strip should be perfect at stripping the HTML from the HTML
part... Probably, collapsing all whitespace to a single space is also a good 
idea.

As for the comparing of strings, I wasn't aware of that algorithm, but the link
gave me some keywords, so I googled for "normalized string edit distance", and
came up with several interesting results. Among them, Arslan and Egecioglu:
"Efficient Algorithms For Normalized Edit Distance" (2000),
http://www.cs.ucsb.edu/~omer/DOWNLOADABLE/JDA00.ps

It's not clear to me what the n in your post is, but if I understand the
abstract of the paper correctly (it was all that I read), their algorithm should
be better... :-)

The past week, 90% of the spam that has passed my SMTP rejection score of 13 has
been of this type, so it sure would have been a great addition to SA if we could
get this working. 

BTW, I've noticed something interesting about this spam: Their random-word
database is evidently quite small, and contains a bunch of rarely used words,
which is the reason why Bayesian filtering works so well, such words are rarely
used in any ham, so when they occur frequently in spam, they are making it easy
to catch. 

However, it is just a matter of time before spammers make a larger database of
words, and while it can never fool a well-trained Bayesian filter, it may make
its signal weaker, so to speak. 



------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

[Bug 2878] Identify when plain text and HTML are different in multipart/alternative

Reply via email to