On Saturday 28 February 2004 05:46, Loren Wilton wrote:
> I'm trying to come up with a way to detect bogus end tags, and so far
> I'm not having much luck.
>
> What I'm specifically trying to catch are things like
>
> </table>
> </belch></huntsville></delusion></wilma></boswell></attune>
> </vasectomy></centum></surf></yeasty></molt></autocollimate>
> </acrobat></harvest></gage></flagrant></fumble></nowadays>
> </BODY>
> </HTML>
>
> Now, it looks like there is an html_tag_balance eval that would catch
> the fact that there is no "<belch" to match the "</belch>" in the
> above hunk of spam, if only there were some way that I could feed
> "belch" into the eval. I can detect end tags eash enough with a
> regexp, but I can't find any way that works to pull the found tag out
> and feed it to the eval routine within an SA rule definition.
I see this too. I don't know the proper answer, but you could just look
for 30 close tags in a row.
rawbody TOO_MANY_CLOSE_TAGS /(\<[\/]?[^\>]{1,}\>){20,}/i
I don't think this is too effective in the long term, but it'll cat some
spams.
> Alternately, is there a way to write a regexp that will let me look
> backward for <belch once I have found </belch>? I can't seem to
> figure this one out either.
>
> Thanks,
> Loren
--
Berend De Schouwer