Try these, they're not the prettiest, but at least work most of the time :)
#This takes care of tags that don't exist such as <zebra> #The last / is in there so it doesn't freak out about closing tags. #<KBD> is a valid tag, but I don't believe we'll see it all that much in email. #Added in to not pickup <[EMAIL PROTECTED]> rawbody __MK_BAD_HTML_08 /<[^abcdefhilmopstuv\/[EMAIL PROTECTED],80}>/i #This takes care of closing tags that don't exist such as </zebra> rawbody __MK_BAD_HTML_09 /<\/[^abcdefhilmopstuv]/i #Added in ? due to MS <?xml:blahblah> tag rawbody __MK_GOOD_HTML_01 /<\??xml/i #Another MS Office casualty rawbody __MK_GOOD_HTML_02 /<\/xml>/i meta MK_BAD_HTML_11 HTML_MESSAGE && __MK_BAD_HTML_08 && !__MK_GOOD_HTML_01 describe MK_BAD_HTML_11 Bad HTML form. HTML beginning tag that does not exist used. score MK_BAD_HTML_11 0.8 meta MK_BAD_HTML_12 HTML_MESSAGE && __MK_BAD_HTML_09 && !__MK_GOOD_HTML_02 describe MK_BAD_HTML_12 Bad HTML form. HTML closing tag that does not exist used. score MK_BAD_HTML_12 0.8 Mike > -----Original Message----- > From: Loren Wilton [mailto:[EMAIL PROTECTED] > Sent: Monday, March 01, 2004 2:56 PM > To: [EMAIL PROTECTED] > Subject: Re: Trying to detect bogus end tags > > > I've finally got a strange regexp that *almost* works. I can > see it getting > the right results internally, but it never seems to return > true or whatever > a regexp returns on success. At this point I have no idea if > I'm dealing > with bugs, features, or a strong misunderstanding of the language. > > If anyone wants to try to figure out what is wrong with it > (or can see it at > a glance) here is the ugly little thing: > > full BAD__HTML m|<BODY>(.*)</BODY>(??{ (scalar reverse $1) =~ > />([a-z]+)\/<(?!.*[\s>]\1<)/is; })|s > > Loren > >
