http://bugzilla.spamassassin.org/show_bug.cgi?id=3023





------- Additional Comments From [EMAIL PROTECTED]  2004-02-21 23:13 -------
Created an attachment (id=1792)
 --> (http://bugzilla.spamassassin.org/attachment.cgi?id=1792&action=view)
code that was added

Okay.  Test added to SVN for testing using mostly Gary Funck's ideas.

Note that I am checking a huge range of possible cut-off values for length and
percentage of unique words.  It works pretty well, although about 40 words or
so are needed for good results.

One concern I have is about the word tokenization system.  It's not very locale

independent.  Maybe we should use/reuse/steal some Bayes code for this.

It might work better to scale the percentages based on number of words, perhaps

using Zipf's law a bit more intelligently.  I think only one rule should be
sufficient for this test.  I don't really want yet another range test...

the top rules of the long long list:

OVERALL%   SPAM%     HAM%     S/O    RANK   SCORE  NAME
  36096     8124    27972    0.225   0.00    0.00  (all messages)
100.000  22.5066  77.4934    0.225   0.00    0.00  (all messages as %)
  2.527  11.2014   0.0072    0.999   1.00    0.01  T_CHECK_UNIQUE_WORDS_070_930

  2.499  11.0783   0.0072    0.999   1.00    0.01  T_CHECK_UNIQUE_WORDS_070_935

  2.457  10.8936   0.0072    0.999   1.00    0.01  T_CHECK_UNIQUE_WORDS_070_940

  2.585  11.4476   0.0107    0.999   1.00    0.01  T_CHECK_UNIQUE_WORDS_070_920

  2.571  11.3860   0.0107    0.999   1.00    0.01  T_CHECK_UNIQUE_WORDS_070_925

  2.421  10.7336   0.0072    0.999   1.00    0.01  T_CHECK_UNIQUE_WORDS_080_920

  2.410  10.6844   0.0072    0.999   1.00    0.01  T_CHECK_UNIQUE_WORDS_070_945

  2.407  10.6721   0.0072    0.999   1.00    0.01  T_CHECK_UNIQUE_WORDS_080_925

  2.651  11.7307   0.0143    0.999   1.00    0.01  T_CHECK_UNIQUE_WORDS_070_915

  2.529  11.2014   0.0107    0.999   1.00    0.01  T_CHECK_UNIQUE_WORDS_080_910

  2.369  10.4998   0.0072    0.999   1.00    0.01  T_CHECK_UNIQUE_WORDS_080_930

  2.360  10.4628   0.0072    0.999   1.00    0.01  T_CHECK_UNIQUE_WORDS_070_950

  2.482  10.9921   0.0107    0.999   1.00    0.01  T_CHECK_UNIQUE_WORDS_080_915

  2.599  11.4968   0.0143    0.999   1.00    0.01  T_CHECK_UNIQUE_WORDS_080_905

  2.341  10.3767   0.0072    0.999   1.00    0.01  T_CHECK_UNIQUE_WORDS_080_935

  2.335  10.3520   0.0072    0.999   1.00    0.01  T_CHECK_UNIQUE_WORDS_070_955

  2.457  10.8813   0.0107    0.999   1.00    0.01  T_CHECK_UNIQUE_WORDS_060_955


Any comments or ideas?




------- You are receiving this mail because: -------
You are on the CC list for the bug, or are watching someone who is.

Reply via email to