Re: [Bug 1987] Rule for detecting non-HTML tags

Daniel Quinlan 5 Feb 2004 23:28:40 -0000

> Methods:
> -  Check for bad tags <viagra> and count them.

HEAD already does this.


> -  Check for wacky paramters to tags.  Count them

This is a lot more work, we don't do this yet.

> -  Note the incorrect use of quotes for tags

Ditto.

> -  Unclosed tags

We already have rules for this, they're called HTML_TAG_BALANCE_*

I still think the tidy experiment would be immensely valuable if someone
has the initiative to do it.

1. jam tidy (even a shell out) into the HEAD parser and run it on
   decoded HTML sections (right were HTML::Parser is run right now)

2. save output from tidy on a recent corpus of spam and ham (ham
   including HTML, of course) in two files: spam.err and ham.err

3. sort /tmp/spam.err | uniq -c | sort -nr > spam.count
   sort /tmp/ham.err | uniq -c | sort -nr > ham.count

4. run "masses/freqdiff /tmp/spam.err /tmp/ham.err" to identify possible
   high probability rules

5. Any line with > 0.99 probability and that appears more than once,
   we try to try to turn into a rule

The .err files might require some massaging to remove unique stuff that
would ruin the counts like unknown tag names.

Daniel

Re: [Bug 1987] Rule for detecting non-HTML tags

Reply via email to