Hello,
So I eliminated the feature that was basically a document id, and I'm still
getting the same results.
Based on what's been said on this thread, this should not happen (because we
should always be classifying into some category):
12/01/29 15:30:25 INFO bayes.TestClassifier: Loading model from:
{basePath=/user/stu/machine_learning/bayes/model, classifierType=bayes,
alpha_i=1.0, dataSource=hdfs, gramSize=1, verbose=false, confusionMatrix=null,
encoding=UTF-8, defaultCat=unknown,
testDirPath=/user/stu/machine_learning/bayes/category-test-data}
12/01/29 15:30:25 INFO bayes.TestClassifier: Testing Bayes Classifier
12/01/29 15:30:27 INFO bayes.SequenceFileModelReader: Read 50000 feature weights
12/01/29 15:30:27 INFO bayes.SequenceFileModelReader: Read 100000 feature
weights
12/01/29 15:30:27 INFO bayes.SequenceFileModelReader: Read 150000 feature
weights
12/01/29 15:30:28 INFO bayes.SequenceFileModelReader: Read 200000 feature
weights
12/01/29 15:30:28 INFO bayes.SequenceFileModelReader: 1069718.2183796456
12/01/29 15:30:30 INFO bayes.InMemoryBayesDatastore: good -1537123.539470884
1845854.5550999944 -0.8327435849286697
12/01/29 15:30:30 INFO bayes.InMemoryBayesDatastore: bad -1845854.5550999944
1845854.5550999944 -1.0
12/01/29 15:30:30 INFO bayes.TestClassifier:
=======================================================
Summary
-------------------------------------------------------
Correctly Classified Instances : 0 �%
Yet, this is what I get (from a 90/10 split of the data using the
splitBayesInput class from Taming Text).
So I'm stumped.
I don't even really know where to begin debugging this..
And just to rule out the most obvious bonehead mistake:
hadoop dfs -ls /user/stu/machine_learning/bayes/category-test-data/
Found 2 items
-rw-r--r-- 3 stu supergroup 108810564 2012-01-29 14:50
/user/stu/machine_learning/bayes/category-test-data/bad -rw-r--r-- 3 stu
supergroup 38614032 2012-01-29 14:50
/user/stu/machine_learning/bayes/category-test-data/good
Here's a couple snippets from my seqdump:
Key class: class org.apache.mahout.common.StringTuple Value Class: class
org.apache.hadoop.io.DoubleWritable
Key: [__WT, bad, 0_lockit]: Value: 42.99318841395318
Key: [__WT, bad, 0_winit]: Value: 49.148550010941364
Key: [__WT, bad, 0x10,0x12,0x13,0x17]: Value: 52.495103287942825
Key: [__WT, bad, 0x10,0x13a]: Value: 11.538787093822286
Key: [__WT, bad, 0x1000040]: Value: 0.07495396643707189
Key: [__WT, bad, 0x1001c]: Value: 0.12800826729901066 Key: [__WT, good,
0array]: Value: 10.481077499671203
Key: [__WT, good, 0cudvdcapturework]: Value: 0.10344809179965245
Key: [__WT, good, 0pav1]: Value: 0.23050782000541226
Key: [__WT, good, 0x1]: Value: 1342.2191134942075
Key: [__WT, good, 0x10000]: Value: 243.74351518918098
Ted,
If you're interested, I can send over the whole seqdump file just to you, but
I'm a little wary of posting it to the whole list at this point...
Once I understand the problem more, I might realize that giving away the
information won't hurt anything...
Thoughts?
Take care,
-stu
________________________________
From: Ted Dunning <[email protected]>
To: [email protected]; Stuart Smith <[email protected]>
Sent: Saturday, January 28, 2012 12:36 PM
Subject: Re: Diagnosing naive bayes results
It always tells you the most likely category, but you can redefine the
output to only trigger if the most likely category really dominates the
results.
With two categories, this is reasonable. For a dozen it is much more
debatable.
This works with the SGD classifiers as well and I have seen this used in a
multi-level classifier.
On Fri, Jan 27, 2012 at 8:06 PM, Stuart Smith <[email protected]> wrote:
> Hello,
>
> Does naive bayes always classify a document into a category?
> Or will it refuse to classify something it cannot?
>