http://bugzilla.spamassassin.org/show_bug.cgi?id=1987
------- Additional Comments From [EMAIL PROTECTED] 2004-02-05 13:21 ------- Rather than reinventing the wheel, how about using the Tidy project to check the HTML portion of an e-mail. I know that the project does have a perl module in addition to a library and executable. As a test, I extracted the HTML portion of a spam mail and ran the binary version of tidy against it. I ended up getting 42 warnings (see below). Perhaps the warning count could be multiplied by some value and the result used as a score for this test. If you wanted to go a step further, you could parse the waring log giving scores to each entry. It appears the "discarding unexpected" entries should be given more weight. line 1 column 1 - Warning: SYSTEM, PUBLIC, W3C, DTD, EN must be upper case line 6 column 1 - Warning: <meta> unexpected or duplicate quote mark line 6 column 1 - Warning: <meta> attribute with missing trailing quote mark line 6 column 1 - Warning: <meta> unexpected or duplicate quote mark line 6 column 1 - Warning: unknown attribute "text/html;" line 6 column 1 - Warning: <meta> attribute with missing trailing quote mark line 8 column 1 - Warning: <style> unexpected or duplicate quote mark line 8 column 1 - Warning: <style> attribute with missing trailing quote mark line 16 column 1 - Warning: <table> unexpected or duplicate quote mark line 16 column 1 - Warning: <table> attribute with missing trailing quote mark line 16 column 1 - Warning: <table> unexpected or duplicate quote mark line 16 column 1 - Warning: <table> attribute with missing trailing quote mark line 16 column 1 - Warning: <table> unexpected or duplicate quote mark line 16 column 1 - Warning: <table> attribute with missing trailing quote mark line 16 column 1 - Warning: <table> unexpected or duplicate quote mark line 16 column 1 - Warning: <table> attribute with missing trailing quote mark line 16 column 1 - Warning: <table> attribute "cellpadding" has invalid value "3D" line 16 column 1 - Warning: <table> attribute "cellspacing" has invalid value "3D" line 16 column 1 - Warning: <table> attribute "width" has invalid value "3D" line 16 column 1 - Warning: <table> lacks "summary" attribute line 19 column 25 - Warning: <div> unexpected or duplicate quote mark line 19 column 25 - Warning: <div> attribute with missing trailing quote mark line 32 column 26 - Warning: <a> unexpected or duplicate quote mark line 32 column 26 - Warning: <a> attribute with missing trailing quote mark line 38 column 1 - Warning: discarding unexpected </earthmoving> line 38 column 15 - Warning: discarding unexpected </pomegranate> line 38 column 29 - Warning: discarding unexpected </intimacy> line 38 column 40 - Warning: discarding unexpected </mightn> line 38 column 51 - Warning: discarding unexpected </coherent> line 38 column 62 - Warning: discarding unexpected </curse> line 39 column 1 - Warning: discarding unexpected </guilford> line 39 column 12 - Warning: discarding unexpected </civet> line 39 column 20 - Warning: discarding unexpected </suffragette> line 39 column 34 - Warning: discarding unexpected </certify> line 39 column 44 - Warning: discarding unexpected </buyer> line 39 column 52 - Warning: discarding unexpected </czarina> line 40 column 1 - Warning: discarding unexpected </alongside> line 40 column 13 - Warning: discarding unexpected </bromide> line 40 column 23 - Warning: discarding unexpected </gully> line 40 column 31 - Warning: discarding unexpected </buff> line 40 column 38 - Warning: discarding unexpected </waive> line 40 column 46 - Warning: discarding unexpected </wander> ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee.
