At 01:11 PM 3/14/2005, Michael Parker wrote:
In general, no it's not possible to dump the bayesian tokens in a
readable (well they are readable, it's just hard to read them :))
format, unless you do a little work yourself.  It is possible to dump
them by making use the the given plugin hooks that allow you to fetch
the "raw" token value and match it to the SHA1 hash for the token.

True, however, just given a bayes DB in 3.0's normal format, you can't dump it in text format. The plugin would have to have been running while the bayes DB was created.



The primary motivation for the change was indeed speed, and let me
tell you it was a lot.  Privacy never really entered into the picture,
although I suppose it is a nice side effect, except that with a plugin
it's pretty easy to map the token values.

True. I guess I mis-represented a desirable-to-some side effect as a reason for implementation. Speed was the big motivator.


Of course, I have to ask, how do you find the data "quite useful?"

It's "quite useful" as dumping the bayes db through sort and looking at the tokens helps you identify tokens to look for that may be in misclassified messages.


ie: if I see an obfuscated Viagra variant with stats like" 0 spam 1 ham 0.000", I know to go dig around in my archives for a misclassified message containing that word and re-train it properly.

However, as I said before, 9 times out of 10 doing this leads to people over-manipulating their bayes DB by deciding that a particular token "must be" spam or nonspam, and doing things like creating bogus messages to shift the training the way they want it. A lot of admins get really worried about one or two tokens that don't "look right"... Which is a bad thing.






Reply via email to