Hello,

   So I eliminated the feature that was basically a document id, and I'm still 
getting the same results.

Based on what's been said on this thread, this should not happen (because we 
should always be classifying into some category):
12/01/29 15:30:25 INFO bayes.TestClassifier: Loading model from: 
{basePath=/user/stu/machine_learning/bayes/model, classifierType=bayes, 
alpha_i=1.0, dataSource=hdfs, gramSize=1, verbose=false, confusionMatrix=null, 
encoding=UTF-8, defaultCat=unknown, 
testDirPath=/user/stu/machine_learning/bayes/category-test-data}
12/01/29 15:30:25 INFO bayes.TestClassifier: Testing Bayes Classifier
12/01/29 15:30:27 INFO bayes.SequenceFileModelReader: Read 50000 feature weights
12/01/29 15:30:27 INFO bayes.SequenceFileModelReader: Read 100000 feature 
weights
12/01/29 15:30:27 INFO bayes.SequenceFileModelReader: Read 150000 feature 
weights
12/01/29 15:30:28 INFO bayes.SequenceFileModelReader: Read 200000 feature 
weights
12/01/29 15:30:28 INFO bayes.SequenceFileModelReader: 1069718.2183796456
12/01/29 15:30:30 INFO bayes.InMemoryBayesDatastore: good -1537123.539470884 
1845854.5550999944 -0.8327435849286697
12/01/29 15:30:30 INFO bayes.InMemoryBayesDatastore: bad -1845854.5550999944 
1845854.5550999944 -1.0
12/01/29 15:30:30 INFO bayes.TestClassifier: 
=======================================================
Summary
-------------------------------------------------------
Correctly Classified Instances          :          0             �%

Yet, this is what I get (from a 90/10 split of the data using the 
splitBayesInput class from Taming Text). 


So I'm stumped. 

I don't even really know where to begin debugging this..


And just to rule out the most obvious bonehead mistake:

hadoop dfs -ls /user/stu/machine_learning/bayes/category-test-data/
Found 2 items
-rw-r--r--   3 stu supergroup  108810564 2012-01-29 14:50 
/user/stu/machine_learning/bayes/category-test-data/bad -rw-r--r--   3 stu 
supergroup   38614032 2012-01-29 14:50 
/user/stu/machine_learning/bayes/category-test-data/good

Here's a couple snippets from my seqdump:

Key class: class org.apache.mahout.common.StringTuple Value Class: class 
org.apache.hadoop.io.DoubleWritable
Key: [__WT, bad, 0_lockit]: Value: 42.99318841395318
Key: [__WT, bad, 0_winit]: Value: 49.148550010941364
Key: [__WT, bad, 0x10,0x12,0x13,0x17]: Value: 52.495103287942825
Key: [__WT, bad, 0x10,0x13a]: Value: 11.538787093822286
Key: [__WT, bad, 0x1000040]: Value: 0.07495396643707189
Key: [__WT, bad, 0x1001c]: Value: 0.12800826729901066 Key: [__WT, good, 
0array]: Value: 10.481077499671203
Key: [__WT, good, 0cudvdcapturework]: Value: 0.10344809179965245
Key: [__WT, good, 0pav1]: Value: 0.23050782000541226
Key: [__WT, good, 0x1]: Value: 1342.2191134942075
Key: [__WT, good, 0x10000]: Value: 243.74351518918098
Ted,
If you're interested, I can send over the whole seqdump file just to you, but 
I'm a little wary of posting it to the whole list at this point...
Once I understand the problem more, I might realize that giving away the 
information won't hurt anything...


Thoughts?

Take care,
  -stu




________________________________
 From: Ted Dunning <[email protected]>
To: [email protected]; Stuart Smith <[email protected]> 
Sent: Saturday, January 28, 2012 12:36 PM
Subject: Re: Diagnosing naive bayes results
 
It always tells you the most likely category, but you can redefine the
output to only trigger if the most likely category really dominates the
results.

With two categories, this is reasonable.  For a dozen it is much more
debatable.

This works with the SGD classifiers as well and I have seen this used in a
multi-level classifier.

On Fri, Jan 27, 2012 at 8:06 PM, Stuart Smith <[email protected]> wrote:

> Hello,
>
> Does naive bayes always classify a document into a category?
> Or will it refuse to classify something it cannot?
>

Reply via email to