Hello, Does naive bayes always classify a document into a category? Or will it refuse to classify something it cannot?
For example: I'm working through the naive bayes tutorial in Taming Text - with my own data. I built a lucene index, ran extract training data, split 90/10, etc. After looking at the seq dumper on the trained model - I noticed I made a mistake when building the index: The good/bad documents had a unique id field (in the terms) that didn't get filtered out because of a typo/error in my little java program to build the index. I went ahead and ran the test just to see what would happen, and the confusion matrix I got all was zeros. No document was classified correctly or incorrectly. No document was classified at all. I suspect this was because it overfit to the unique id field in the training data - which the test vectors would not have. While this sounds rational, it only explains the results if naive bayes can refuse to classify a document in any category whatsover. So I'm just wondering if this is true, or I should be looking for more mistakes. I'm re-running it right now, but building the index takes a while, so I thought I'd ping the list in the meantime.. Thanks! Take care, -stu
