[Todd Kennedy] > With the definitions of spamcount and hamcount it makes sense that > they might be zero, since there is minimal training data in the > system, and the word being scored does not exist in the database. > > This might be some sort of small bug with running the filter on a > small amount of data, as I can reliably replicate a divide by zero > error. If spamcount and hamcount are both zero, shouldn't the system > return some sort of 0% probability for spam or ham (showing it's > uncertainty for the phrase being scored)?
Yes, and it does. That's what Kenny tried to tell you :-) This is Classifier._worddistanceget(): def _worddistanceget(self, word): record = self._wordinfoget(word) if record is None: prob = options["Classifier", "unknown_word_prob"] else: prob = self.probability(record) distance = abs(prob - 0.5) return distance, prob, word, record If there is no record for the word, then this returns the value of the "unknown_word_prob" option. It only tries to _compute_ the probability if there _is_ a record for the word, and it should never be the case that a record exists for a word with hamcount and spamcount both 0. It would be helpful to dump print statements into that function (or run under Python's debugger) to see exactly which word it is and what's in that record -- or possibly you'd discover that _worddistanceget() isn't being called at all. You didn't include a complete traceback in your original message, so it's impossible from here to guess who called probability() to begin with. A complete traceback would help. > ... > If change line 320 of classify.py (i'm using the latest 1.1a1 release > now) to a very simple try/except clause: > try: > prob = spamratio / (hamratio + spamratio) > except: > prob = 0 > > You can't replicate the error with the above script. > > Is this a patch that should be submitted? No, because that slows down a speed-critical function to paper over a problem that should never occur. The bug isn't that this is dividing by 0, the bug is that probability() is being _called_ when both counts are 0. Something, somewhere, on the path _toward_ calling probability() is in error. _______________________________________________ spambayes-dev mailing list spambayes-dev@python.org http://mail.python.org/mailman/listinfo/spambayes-dev