Try these, they're not the prettiest, but at least work most of the time
:)

#This takes care of tags that don't exist such as <zebra>
#The last / is in there so it doesn't freak out about closing tags.
#<KBD> is a valid tag, but I don't believe we'll see it all that much in
email.
#Added in to not pickup <[EMAIL PROTECTED]>
rawbody         __MK_BAD_HTML_08
/<[^abcdefhilmopstuv\/[EMAIL PROTECTED],80}>/i

#This takes care of closing tags that don't exist such as </zebra>
rawbody         __MK_BAD_HTML_09
/<\/[^abcdefhilmopstuv]/i

#Added in ? due to MS <?xml:blahblah> tag
rawbody         __MK_GOOD_HTML_01               /<\??xml/i
#Another MS Office casualty
rawbody         __MK_GOOD_HTML_02               /<\/xml>/i


meta            MK_BAD_HTML_11          HTML_MESSAGE && __MK_BAD_HTML_08
&& !__MK_GOOD_HTML_01 
describe        MK_BAD_HTML_11          Bad HTML form.  HTML beginning
tag that does not exist used.
score           MK_BAD_HTML_11          0.8

meta            MK_BAD_HTML_12          HTML_MESSAGE && __MK_BAD_HTML_09
&& !__MK_GOOD_HTML_02
describe        MK_BAD_HTML_12          Bad HTML form.  HTML closing tag
that does not exist used.
score           MK_BAD_HTML_12          0.8


Mike



> -----Original Message-----
> From: Loren Wilton [mailto:[EMAIL PROTECTED] 
> Sent: Monday, March 01, 2004 2:56 PM
> To: [EMAIL PROTECTED]
> Subject: Re: Trying to detect bogus end tags
> 
> 
> I've finally got a strange regexp that *almost* works.  I can 
> see it getting
> the right results internally, but it never seems to return 
> true or whatever
> a regexp returns on success.  At this point I have no idea if 
> I'm dealing
> with bugs, features, or a strong misunderstanding of the language.
> 
> If anyone wants to try to figure out what is wrong with it 
> (or can see it at
> a glance) here is the ugly little thing:
> 
> full  BAD__HTML   m|<BODY>(.*)</BODY>(??{ (scalar reverse $1) =~
> />([a-z]+)\/<(?!.*[\s>]\1<)/is; })|s
> 
>         Loren
> 
> 

Reply via email to