Re: Trying to detect bogus end tags

Berend De Schouwer 2 Mar 2004 14:50:59 -0000

On Monday 01 March 2004 11:54, Berend De Schouwer wrote:
> On Saturday 28 February 2004 05:46, Loren Wilton wrote:
> > I'm trying to come up with a way to detect bogus end tags, and so
> > far I'm not having much luck.
> >
> > What I'm specifically trying to catch are things like
> >
> > </table>
> > </belch></huntsville></delusion></wilma></boswell></attune>
> > </vasectomy></centum></surf></yeasty></molt></autocollimate>
> > </acrobat></harvest></gage></flagrant></fumble></nowadays>
> > </BODY>
> > </HTML>


This does catch them:

full     __HTML_BAYES_POISON     /(\<\/[a-zA-Z]{3,}\>[\s]*){20,}/
meta     BDS_HTML_BAYES_POISON   (HTML_MESSAGE && __HTML_BAYES_POISON)
describe BDS_HTML_BAYES_POISON   Multiple garbage HTML close tag
score    BDS_HTML_BAYES_POISON   2.0

You might get false positives if you regularly get HTML e-mail with real 
statements like (/b)(/font)(/tr)(/td)(/table), etc. but I think that 
won't be common.  At least not 20 tags in a row.  To counter false 
positives, I've made it count 3 letter words only, so (/b) and (/tr) 
won't count, and I've used 'full' instead of 'rawbody' so I don't have 
a problem with \r\n.

The ideal solution is for SpamAssassin, or a plugin, to search for 
unmatched HTML tags, in HTML MIME parts, except for (p) and (br), and 
score 0.1 for each unmatched tag.  In the absense of perfection, try 
this.

-- 
Berend De Schouwer

Re: Trying to detect bogus end tags

Reply via email to