http://bugzilla.spamassassin.org/show_bug.cgi?id=2878
------- Additional Comments From [EMAIL PROTECTED] 2004-01-04 13:40 ------- Thanks for the interesting follow-up! I suppose HTML::Strip should be perfect at stripping the HTML from the HTML part... Probably, collapsing all whitespace to a single space is also a good idea. As for the comparing of strings, I wasn't aware of that algorithm, but the link gave me some keywords, so I googled for "normalized string edit distance", and came up with several interesting results. Among them, Arslan and Egecioglu: "Efficient Algorithms For Normalized Edit Distance" (2000), http://www.cs.ucsb.edu/~omer/DOWNLOADABLE/JDA00.ps It's not clear to me what the n in your post is, but if I understand the abstract of the paper correctly (it was all that I read), their algorithm should be better... :-) The past week, 90% of the spam that has passed my SMTP rejection score of 13 has been of this type, so it sure would have been a great addition to SA if we could get this working. BTW, I've noticed something interesting about this spam: Their random-word database is evidently quite small, and contains a bunch of rarely used words, which is the reason why Bayesian filtering works so well, such words are rarely used in any ham, so when they occur frequently in spam, they are making it easy to catch. However, it is just a matter of time before spammers make a larger database of words, and while it can never fool a well-trained Bayesian filter, it may make its signal weaker, so to speak. ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee.
