These are some pretty strange looking terms popping up here. Can you share some of your data?
On Sun, Jan 29, 2012 at 11:43 PM, Stuart Smith <[email protected]> wrote: > Hello, > > So I eliminated the feature that was basically a document id, and I'm > still getting the same results. > > Based on what's been said on this thread, this should not happen (because > we should always be classifying into some category): > 12/01/29 15:30:25 INFO bayes.TestClassifier: Loading model from: > {basePath=/user/stu/machine_learning/bayes/model, classifierType=bayes, > alpha_i=1.0, dataSource=hdfs, gramSize=1, verbose=false, > confusionMatrix=null, encoding=UTF-8, defaultCat=unknown, > testDirPath=/user/stu/machine_learning/bayes/category-test-data} > 12/01/29 15:30:25 INFO bayes.TestClassifier: Testing Bayes Classifier > 12/01/29 15:30:27 INFO bayes.SequenceFileModelReader: Read 50000 feature > weights > 12/01/29 15:30:27 INFO bayes.SequenceFileModelReader: Read 100000 feature > weights > 12/01/29 15:30:27 INFO bayes.SequenceFileModelReader: Read 150000 feature > weights > 12/01/29 15:30:28 INFO bayes.SequenceFileModelReader: Read 200000 feature > weights > 12/01/29 15:30:28 INFO bayes.SequenceFileModelReader: 1069718.2183796456 > 12/01/29 15:30:30 INFO bayes.InMemoryBayesDatastore: good > -1537123.539470884 1845854.5550999944 -0.8327435849286697 > 12/01/29 15:30:30 INFO bayes.InMemoryBayesDatastore: bad > -1845854.5550999944 1845854.5550999944 -1.0 > 12/01/29 15:30:30 INFO bayes.TestClassifier: > ======================================================= > Summary > ------------------------------------------------------- > Correctly Classified Instances : 0 �% > > Yet, this is what I get (from a 90/10 split of the data using the > splitBayesInput class from Taming Text). > > > So I'm stumped. > > I don't even really know where to begin debugging this.. > > > And just to rule out the most obvious bonehead mistake: > > hadoop dfs -ls /user/stu/machine_learning/bayes/category-test-data/ > Found 2 items > -rw-r--r-- 3 stu supergroup 108810564 2012-01-29 14:50 > /user/stu/machine_learning/bayes/category-test-data/bad -rw-r--r-- 3 stu > supergroup 38614032 2012-01-29 14:50 > /user/stu/machine_learning/bayes/category-test-data/good > > Here's a couple snippets from my seqdump: > > Key class: class org.apache.mahout.common.StringTuple Value Class: class > org.apache.hadoop.io.DoubleWritable > Key: [__WT, bad, 0_lockit]: Value: 42.99318841395318 > Key: [__WT, bad, 0_winit]: Value: 49.148550010941364 > Key: [__WT, bad, 0x10,0x12,0x13,0x17]: Value: 52.495103287942825 > Key: [__WT, bad, 0x10,0x13a]: Value: 11.538787093822286 > Key: [__WT, bad, 0x1000040]: Value: 0.07495396643707189 > Key: [__WT, bad, 0x1001c]: Value: 0.12800826729901066 Key: [__WT, good, > 0array]: Value: 10.481077499671203 > Key: [__WT, good, 0cudvdcapturework]: Value: 0.10344809179965245 > Key: [__WT, good, 0pav1]: Value: 0.23050782000541226 > Key: [__WT, good, 0x1]: Value: 1342.2191134942075 > Key: [__WT, good, 0x10000]: Value: 243.74351518918098 > Ted, > If you're interested, I can send over the whole seqdump file just to you, > but I'm a little wary of posting it to the whole list at this point... > Once I understand the problem more, I might realize that giving away the > information won't hurt anything... > > > Thoughts? > > Take care, > -stu > > > > > ________________________________ > From: Ted Dunning <[email protected]> > To: [email protected]; Stuart Smith <[email protected]> > Sent: Saturday, January 28, 2012 12:36 PM > Subject: Re: Diagnosing naive bayes results > > It always tells you the most likely category, but you can redefine the > output to only trigger if the most likely category really dominates the > results. > > With two categories, this is reasonable. For a dozen it is much more > debatable. > > This works with the SGD classifiers as well and I have seen this used in a > multi-level classifier. > > On Fri, Jan 27, 2012 at 8:06 PM, Stuart Smith <[email protected]> wrote: > > > Hello, > > > > Does naive bayes always classify a document into a category? > > Or will it refuse to classify something it cannot? > >
