These results are for the HTML_MESSAGE messages in our corpus. OVERALL% SPAM% HAM% S/O RANK SCORE NAME 186686 182745 3941 0.979 0.00 0.00 (all messages) 100.000 97.8890 2.1110 0.979 0.00 0.00 (all messages as %)
Anything with an S/O below 0.500 is hitting on more HTML ham than HTML spam. Since most HTML messages are spam, these rules do have good overall S/O ratios, but I don't think they add too much to our accuracy. They generally have lower scores (average of .68 score for rules better than HTML_MESSAGE and .29 score for rules less than HTML_MESSAGE). First, the color rules don't seem very effective: 5.869 5.9400 2.5628 0.699 0.33 0.00 HTML_COLOR_MAGENTA 5.801 5.8612 3.0195 0.660 0.28 0.06 HTML_COLOR_GREEN 100.000 100.0000 100.0000 0.500 0.26 0.16 HTML_MESSAGE 22.040 22.1527 16.8231 0.568 0.20 0.10 HTML_COLOR_RED 4.548 4.5380 5.0241 0.475 0.11 0.10 HTML_COLOR_UNKNOWN 5.407 5.3791 6.6988 0.445 0.09 0.00 HTML_COLOR_CYAN 10.938 10.8446 15.2753 0.415 0.08 0.00 HTML_COLOR_YELLOW 18.396 18.1581 29.4342 0.382 0.08 0.10 HTML_COLOR_UNSAFE 18.562 18.2894 31.1850 0.370 0.07 0.10 HTML_COLOR_BLUE 13.377 13.1938 21.8726 0.376 0.07 0.00 HTML_COLOR_GRAY Second, the image area rules seem even less effective: 0.260 0.2632 0.1015 0.722 0.35 0.00 HTML_IMAGE_AREA_06 100.000 100.0000 100.0000 0.500 0.26 0.16 HTML_MESSAGE 0.366 0.3672 0.3299 0.527 0.14 0.28 HTML_IMAGE_AREA_05 0.524 0.5248 0.5075 0.508 0.12 0.00 HTML_IMAGE_AREA_04 0.016 0.0159 0.0254 0.385 0.05 0.00 HTML_IMAGE_AREA_08 0.047 0.0449 0.1522 0.228 0.01 1.61 HTML_IMAGE_AREA_07 0.081 0.0717 0.5075 0.124 0.00 0.00 HTML_IMAGE_AREA_09 So, it looks like spammers stopped adding size tags to images. The image ratio and image only rules that don't rely on size tags seem to be working better: 6.636 6.7788 0.0254 0.996 0.94 2.75 HTML_IMAGE_ONLY_04 4.399 4.4904 0.1776 0.962 0.84 1.90 HTML_IMAGE_ONLY_08 2.540 2.5899 0.2284 0.919 0.73 0.53 HTML_IMAGE_ONLY_16 3.083 3.1399 0.4567 0.873 0.63 1.53 HTML_IMAGE_ONLY_12 5.176 5.2664 0.9642 0.845 0.57 0.82 HTML_IMAGE_RATIO_04 7.104 7.2215 1.6493 0.814 0.52 0.00 HTML_IMAGE_RATIO_02 3.126 3.1689 1.1165 0.739 0.38 0.79 HTML_IMAGE_ONLY_24 2.293 2.3174 1.1672 0.665 0.28 0.61 HTML_IMAGE_ONLY_20 2.140 2.1631 1.0911 0.665 0.28 0.94 HTML_IMAGE_RATIO_06 100.000 100.0000 100.0000 0.500 0.26 0.16 HTML_MESSAGE 2.669 2.6884 1.7762 0.602 0.21 0.60 HTML_IMAGE_RATIO_08 0.564 0.5324 2.0046 0.210 0.01 0.32 HTML_IMAGE_RATIO_12 0.666 0.6162 2.9688 0.172 0.01 0.00 HTML_IMAGE_RATIO_14 0.267 0.2375 1.6240 0.128 0.00 0.54 HTML_IMAGE_RATIO_10 Any thoughts or objections to removing them? Daniel -- Daniel Quinlan anti-spam (SpamAssassin), Linux, http://www.pathname.com/~quinlan/ and open source consulting
