Hello,

Does naive bayes always classify a document into a category?
Or will it refuse to classify something it cannot?


For example:


I'm working through the naive bayes tutorial in Taming Text - with my own data.
I built a lucene index, ran extract training data, split 90/10, etc.

After looking at the seq dumper on the trained model - I noticed I made a 
mistake when building the index:
The good/bad documents had a unique id field (in the terms) that didn't get 
filtered out because of a typo/error in my little java program to build the 
index.


I went ahead and ran the test just to see what would happen, and the confusion 
matrix I got all was zeros.
No document was classified correctly or incorrectly.

No document was classified at all.

I suspect this was because it overfit to the unique id field in the training 
data - which the test vectors would not have.

While this sounds rational, it only explains the results if naive bayes can 
refuse to classify a document in any category whatsover. 

So I'm just wondering if this is true, or I should be looking for more mistakes.

I'm re-running it right now, but building the index takes a while, so I thought 
I'd ping the list in the meantime..

Thanks!

Take care,
  -stu

Reply via email to