Re: Trying to detect bogus end tags

Berend De Schouwer 1 Mar 2004 18:19:28 -0000

On Saturday 28 February 2004 05:46, Loren Wilton wrote:
> I'm trying to come up with a way to detect bogus end tags, and so far
> I'm not having much luck.
>
> What I'm specifically trying to catch are things like
>
> </table>
> </belch></huntsville></delusion></wilma></boswell></attune>
> </vasectomy></centum></surf></yeasty></molt></autocollimate>
> </acrobat></harvest></gage></flagrant></fumble></nowadays>
> </BODY>
> </HTML>
>
> Now, it looks like there is an html_tag_balance eval that would catch
> the fact that there is no "<belch" to match the "</belch>" in the
> above hunk of spam, if only there were some way that I could feed
> "belch" into the eval. I can detect end tags eash enough with a
> regexp, but I can't find any way that works to pull the found tag out
> and feed it to the eval routine within an SA rule definition.


I see this too.  I don't know the proper answer, but you could just look 
for 30 close tags in a row.

rawbody     TOO_MANY_CLOSE_TAGS     /(\<[\/]?[^\>]{1,}\>){20,}/i

I don't think this is too effective in the long term, but it'll cat some 
spams.

> Alternately, is there a way to write a regexp that will let me look
> backward for <belch once I have found </belch>?  I can't seem to
> figure this one out either.
>
> Thanks,
>         Loren

-- 
Berend De Schouwer

Re: Trying to detect bogus end tags

Reply via email to