-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1
Daniel Quinlan writes: > These results are for the HTML_MESSAGE messages in our corpus. > > OVERALL% SPAM% HAM% S/O RANK SCORE NAME > 186686 182745 3941 0.979 0.00 0.00 (all messages) > 100.000 97.8890 2.1110 0.979 0.00 0.00 (all messages as %) > > Anything with an S/O below 0.500 is hitting on more HTML ham than HTML > spam. Since most HTML messages are spam, these rules do have good > overall S/O ratios, but I don't think they add too much to our accuracy. > They generally have lower scores (average of .68 score for rules better > than HTML_MESSAGE and .29 score for rules less than HTML_MESSAGE). > > First, the color rules don't seem very effective: > > 5.869 5.9400 2.5628 0.699 0.33 0.00 HTML_COLOR_MAGENTA > 5.801 5.8612 3.0195 0.660 0.28 0.06 HTML_COLOR_GREEN > 100.000 100.0000 100.0000 0.500 0.26 0.16 HTML_MESSAGE > 22.040 22.1527 16.8231 0.568 0.20 0.10 HTML_COLOR_RED > 4.548 4.5380 5.0241 0.475 0.11 0.10 HTML_COLOR_UNKNOWN > 5.407 5.3791 6.6988 0.445 0.09 0.00 HTML_COLOR_CYAN > 10.938 10.8446 15.2753 0.415 0.08 0.00 HTML_COLOR_YELLOW > 18.396 18.1581 29.4342 0.382 0.08 0.10 HTML_COLOR_UNSAFE > 18.562 18.2894 31.1850 0.370 0.07 0.10 HTML_COLOR_BLUE > 13.377 13.1938 21.8726 0.376 0.07 0.00 HTML_COLOR_GRAY > > Second, the image area rules seem even less effective: > > 0.260 0.2632 0.1015 0.722 0.35 0.00 HTML_IMAGE_AREA_06 > 100.000 100.0000 100.0000 0.500 0.26 0.16 HTML_MESSAGE > 0.366 0.3672 0.3299 0.527 0.14 0.28 HTML_IMAGE_AREA_05 > 0.524 0.5248 0.5075 0.508 0.12 0.00 HTML_IMAGE_AREA_04 > 0.016 0.0159 0.0254 0.385 0.05 0.00 HTML_IMAGE_AREA_08 > 0.047 0.0449 0.1522 0.228 0.01 1.61 HTML_IMAGE_AREA_07 > 0.081 0.0717 0.5075 0.124 0.00 0.00 HTML_IMAGE_AREA_09 > > So, it looks like spammers stopped adding size tags to images. agreed. > The image ratio and image only rules that don't rely on size tags seem > to be working better: > > 6.636 6.7788 0.0254 0.996 0.94 2.75 HTML_IMAGE_ONLY_04 > 4.399 4.4904 0.1776 0.962 0.84 1.90 HTML_IMAGE_ONLY_08 > 2.540 2.5899 0.2284 0.919 0.73 0.53 HTML_IMAGE_ONLY_16 > 3.083 3.1399 0.4567 0.873 0.63 1.53 HTML_IMAGE_ONLY_12 > 5.176 5.2664 0.9642 0.845 0.57 0.82 HTML_IMAGE_RATIO_04 > 7.104 7.2215 1.6493 0.814 0.52 0.00 HTML_IMAGE_RATIO_02 > 3.126 3.1689 1.1165 0.739 0.38 0.79 HTML_IMAGE_ONLY_24 > 2.293 2.3174 1.1672 0.665 0.28 0.61 HTML_IMAGE_ONLY_20 > 2.140 2.1631 1.0911 0.665 0.28 0.94 HTML_IMAGE_RATIO_06 > 100.000 100.0000 100.0000 0.500 0.26 0.16 HTML_MESSAGE > 2.669 2.6884 1.7762 0.602 0.21 0.60 HTML_IMAGE_RATIO_08 > 0.564 0.5324 2.0046 0.210 0.01 0.32 HTML_IMAGE_RATIO_12 > 0.666 0.6162 2.9688 0.172 0.01 0.00 HTML_IMAGE_RATIO_14 > 0.267 0.2375 1.6240 0.128 0.00 0.54 HTML_IMAGE_RATIO_10 > > Any thoughts or objections to removing them? nope, sounds good to me... those colour tags have always bothered me anyway ;) - --j. -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.2.4 (GNU/Linux) Comment: Exmh CVS iD8DBQFAe0rHQTcbUG5Y7woRAkfzAJ9kie5sAlG/V1en+5ao9gjmFB9BbgCfT7dI IImHNTsWxson+OtmhAorzb4= =CiiW -----END PGP SIGNATURE-----
