http://bugzilla.spamassassin.org/show_bug.cgi?id=3331





------- Additional Comments From [EMAIL PROTECTED]  2004-04-29 14:02 -------
Justin Mason wrote:

> BTW, I know Dan was expressing some concerns last night about
> exposing too much of the bayes data, saying that even with that
> info, it's not much use for people to do stuff with -- sure, you
> know that token "foo.html" is a good spam sign, but in what way does
> that help, given that it's not really possible to actually edit the
> bayes db token-by-token?

For the end user providing db statistics showing the original tokens
would serve the same functions as showing individual token scores for
a message: it will help the user understand what SpamAssassin is doing
so the user will have reasonable expectations for what it can and
cannot do and will let the user know why he or she should take more
care in which messages are learned.  BTW, I myself consider this a
weak reason for keeping the original tokens given that the tokens can
already be shown in the message.

The much more important reason for keeping the original tokens is for
a source of ideas on how to make future improvements.  For example,
one improvement that has been suggested (and used elsewhere) is to
tokenize word pairs.  Examining the database might give hints about
whether too many rare word pairs are being retained.  Another question
that can be examined is how much do random character strings (used in
some spam) inflate the database?  One can have a look.  As for
"a.html", perhaps tokenizing markup would help.




------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

Reply via email to