http://bugzilla.spamassassin.org/show_bug.cgi?id=3023
------- Additional Comments From [EMAIL PROTECTED] 2004-02-21 23:13 ------- Created an attachment (id=1792) --> (http://bugzilla.spamassassin.org/attachment.cgi?id=1792&action=view) code that was added Okay. Test added to SVN for testing using mostly Gary Funck's ideas. Note that I am checking a huge range of possible cut-off values for length and percentage of unique words. It works pretty well, although about 40 words or so are needed for good results. One concern I have is about the word tokenization system. It's not very locale independent. Maybe we should use/reuse/steal some Bayes code for this. It might work better to scale the percentages based on number of words, perhaps using Zipf's law a bit more intelligently. I think only one rule should be sufficient for this test. I don't really want yet another range test... the top rules of the long long list: OVERALL% SPAM% HAM% S/O RANK SCORE NAME 36096 8124 27972 0.225 0.00 0.00 (all messages) 100.000 22.5066 77.4934 0.225 0.00 0.00 (all messages as %) 2.527 11.2014 0.0072 0.999 1.00 0.01 T_CHECK_UNIQUE_WORDS_070_930 2.499 11.0783 0.0072 0.999 1.00 0.01 T_CHECK_UNIQUE_WORDS_070_935 2.457 10.8936 0.0072 0.999 1.00 0.01 T_CHECK_UNIQUE_WORDS_070_940 2.585 11.4476 0.0107 0.999 1.00 0.01 T_CHECK_UNIQUE_WORDS_070_920 2.571 11.3860 0.0107 0.999 1.00 0.01 T_CHECK_UNIQUE_WORDS_070_925 2.421 10.7336 0.0072 0.999 1.00 0.01 T_CHECK_UNIQUE_WORDS_080_920 2.410 10.6844 0.0072 0.999 1.00 0.01 T_CHECK_UNIQUE_WORDS_070_945 2.407 10.6721 0.0072 0.999 1.00 0.01 T_CHECK_UNIQUE_WORDS_080_925 2.651 11.7307 0.0143 0.999 1.00 0.01 T_CHECK_UNIQUE_WORDS_070_915 2.529 11.2014 0.0107 0.999 1.00 0.01 T_CHECK_UNIQUE_WORDS_080_910 2.369 10.4998 0.0072 0.999 1.00 0.01 T_CHECK_UNIQUE_WORDS_080_930 2.360 10.4628 0.0072 0.999 1.00 0.01 T_CHECK_UNIQUE_WORDS_070_950 2.482 10.9921 0.0107 0.999 1.00 0.01 T_CHECK_UNIQUE_WORDS_080_915 2.599 11.4968 0.0143 0.999 1.00 0.01 T_CHECK_UNIQUE_WORDS_080_905 2.341 10.3767 0.0072 0.999 1.00 0.01 T_CHECK_UNIQUE_WORDS_080_935 2.335 10.3520 0.0072 0.999 1.00 0.01 T_CHECK_UNIQUE_WORDS_070_955 2.457 10.8813 0.0107 0.999 1.00 0.01 T_CHECK_UNIQUE_WORDS_060_955 Any comments or ideas? ------- You are receiving this mail because: ------- You are on the CC list for the bug, or are watching someone who is.
