On Monday 01 March 2004 11:54, Berend De Schouwer wrote:
> On Saturday 28 February 2004 05:46, Loren Wilton wrote:
> > I'm trying to come up with a way to detect bogus end tags, and so
> > far I'm not having much luck.
> >
> > What I'm specifically trying to catch are things like
> >
> > </table>
> > </belch></huntsville></delusion></wilma></boswell></attune>
> > </vasectomy></centum></surf></yeasty></molt></autocollimate>
> > </acrobat></harvest></gage></flagrant></fumble></nowadays>
> > </BODY>
> > </HTML>
This does catch them:
full __HTML_BAYES_POISON /(\<\/[a-zA-Z]{3,}\>[\s]*){20,}/
meta BDS_HTML_BAYES_POISON (HTML_MESSAGE && __HTML_BAYES_POISON)
describe BDS_HTML_BAYES_POISON Multiple garbage HTML close tag
score BDS_HTML_BAYES_POISON 2.0
You might get false positives if you regularly get HTML e-mail with real
statements like (/b)(/font)(/tr)(/td)(/table), etc. but I think that
won't be common. At least not 20 tags in a row. To counter false
positives, I've made it count 3 letter words only, so (/b) and (/tr)
won't count, and I've used 'full' instead of 'rawbody' so I don't have
a problem with \r\n.
The ideal solution is for SpamAssassin, or a plugin, to search for
unmatched HTML tags, in HTML MIME parts, except for (p) and (br), and
score 0.1 for each unmatched tag. In the absense of perfection, try
this.
--
Berend De Schouwer