[I'm moving this over to spambayes-dev because it deals more with the code]
On 7/13/06, Todd Kennedy <[EMAIL PROTECTED]> wrote: > I'm trying to integrate the spambayes package into my blogging > software as a comment spam filter. I've read through a bunch of the > source, looked at the scripts provided and stuff and have a > rudimentary understanding of how the software works. (i think). but > i'm getting a ZeroDivisionError when I try to run the score method of > hammie. > > [...] > > The exception occurs at: > File > "/System/Library/Frameworks/Python.framework/Versions/2.3/lib/python2.3/site-packages/spambayes/classifier.py", > line 320, in probability > prob = spamratio / (hamratio + spamratio) > ZeroDivisionError: float division > > I put in some simple print statements to print out nham, nspam, > spamcount and hamcount. this is their output: > 22:14:52 (~) > [EMAIL PROTECTED]> ./test_sp.py > spamcount 6 > hamcount 6 > nham 6 > nspam 6 > spamcount 6 > hamcount 6 > spamcount 6 > hamcount 6 > spamcount 6 > hamcount 6 > spamcount 0 > hamcount 0 > nham 6 > nspam 6 > > why would spamcount and hamcount go to 0? >From the WordInfo class comments in classifier.py: # ... spamcount is the # number of trained spam msgs in which the word appears, and hamcount # the number of trained ham msgs. So spamcount would be 0 if the current word has never been seen in a trained spam message, and similarly for hamcount. A word will only appear in the training database if it has appeared in at least one message so you should never have a word with both counts 0. The _worddistanceget() function in the Classifier class deals with this by assigning a default probability to any word that does not appear in the training data, so the probability calculation should only run on trained words. It's hard to say how the code might have ended up in the probability() function with a word that wasn't in the training data. It might help to print which word produced each of the spamcount/hamcount pairs and compare those against the training data to see if there are any that don't appear in the training. It would also be interesting to know if you have ever tried to remove a message from the training data (i.e. untrain the message). When a message is removed, each word is checked to see if both counts have gone to 0 (see the _remove_msg function) and the word should be removed from the training data in that case. I see that you are using the Postgres storage engine. I'm guessing a little here, but I don't think Postgres has received as much testing as some of the other storage formats so it might be possible that the record didn't actually get deleted from the training database once both counts went to 0. -- Kenny Pitt _______________________________________________ spambayes-dev mailing list spambayes-dev@python.org http://mail.python.org/mailman/listinfo/spambayes-dev