http://bugzilla.spamassassin.org/show_bug.cgi?id=1987
------- Additional Comments From [EMAIL PROTECTED] 2004-02-05 15:28 ------- Subject: Re: Rule for detecting non-HTML tags > Methods: > - Check for bad tags <viagra> and count them. HEAD already does this. > - Check for wacky paramters to tags. Count them This is a lot more work, we don't do this yet. > - Note the incorrect use of quotes for tags Ditto. > - Unclosed tags We already have rules for this, they're called HTML_TAG_BALANCE_* I still think the tidy experiment would be immensely valuable if someone has the initiative to do it. 1. jam tidy (even a shell out) into the HEAD parser and run it on decoded HTML sections (right were HTML::Parser is run right now) 2. save output from tidy on a recent corpus of spam and ham (ham including HTML, of course) in two files: spam.err and ham.err 3. sort /tmp/spam.err | uniq -c | sort -nr > spam.count sort /tmp/ham.err | uniq -c | sort -nr > ham.count 4. run "masses/freqdiff /tmp/spam.err /tmp/ham.err" to identify possible high probability rules 5. Any line with > 0.99 probability and that appears more than once, we try to try to turn into a rule The .err files might require some massaging to remove unique stuff that would ruin the counts like unknown tag names. Daniel ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee.
