These are some pretty strange looking terms popping up here.

Can you share some of your data?

On Sun, Jan 29, 2012 at 11:43 PM, Stuart Smith <[email protected]> wrote:

> Hello,
>
>    So I eliminated the feature that was basically a document id, and I'm
> still getting the same results.
>
> Based on what's been said on this thread, this should not happen (because
> we should always be classifying into some category):
> 12/01/29 15:30:25 INFO bayes.TestClassifier: Loading model from:
> {basePath=/user/stu/machine_learning/bayes/model, classifierType=bayes,
> alpha_i=1.0, dataSource=hdfs, gramSize=1, verbose=false,
> confusionMatrix=null, encoding=UTF-8, defaultCat=unknown,
> testDirPath=/user/stu/machine_learning/bayes/category-test-data}
> 12/01/29 15:30:25 INFO bayes.TestClassifier: Testing Bayes Classifier
> 12/01/29 15:30:27 INFO bayes.SequenceFileModelReader: Read 50000 feature
> weights
> 12/01/29 15:30:27 INFO bayes.SequenceFileModelReader: Read 100000 feature
> weights
> 12/01/29 15:30:27 INFO bayes.SequenceFileModelReader: Read 150000 feature
> weights
> 12/01/29 15:30:28 INFO bayes.SequenceFileModelReader: Read 200000 feature
> weights
> 12/01/29 15:30:28 INFO bayes.SequenceFileModelReader: 1069718.2183796456
> 12/01/29 15:30:30 INFO bayes.InMemoryBayesDatastore: good
> -1537123.539470884 1845854.5550999944 -0.8327435849286697
> 12/01/29 15:30:30 INFO bayes.InMemoryBayesDatastore: bad
> -1845854.5550999944 1845854.5550999944 -1.0
> 12/01/29 15:30:30 INFO bayes.TestClassifier:
> =======================================================
> Summary
> -------------------------------------------------------
> Correctly Classified Instances          :          0             �%
>
> Yet, this is what I get (from a 90/10 split of the data using the
> splitBayesInput class from Taming Text).
>
>
> So I'm stumped.
>
> I don't even really know where to begin debugging this..
>
>
> And just to rule out the most obvious bonehead mistake:
>
> hadoop dfs -ls /user/stu/machine_learning/bayes/category-test-data/
> Found 2 items
> -rw-r--r--   3 stu supergroup  108810564 2012-01-29 14:50
> /user/stu/machine_learning/bayes/category-test-data/bad -rw-r--r--   3 stu
> supergroup   38614032 2012-01-29 14:50
> /user/stu/machine_learning/bayes/category-test-data/good
>
> Here's a couple snippets from my seqdump:
>
> Key class: class org.apache.mahout.common.StringTuple Value Class: class
> org.apache.hadoop.io.DoubleWritable
> Key: [__WT, bad, 0_lockit]: Value: 42.99318841395318
> Key: [__WT, bad, 0_winit]: Value: 49.148550010941364
> Key: [__WT, bad, 0x10,0x12,0x13,0x17]: Value: 52.495103287942825
> Key: [__WT, bad, 0x10,0x13a]: Value: 11.538787093822286
> Key: [__WT, bad, 0x1000040]: Value: 0.07495396643707189
> Key: [__WT, bad, 0x1001c]: Value: 0.12800826729901066 Key: [__WT, good,
> 0array]: Value: 10.481077499671203
> Key: [__WT, good, 0cudvdcapturework]: Value: 0.10344809179965245
> Key: [__WT, good, 0pav1]: Value: 0.23050782000541226
> Key: [__WT, good, 0x1]: Value: 1342.2191134942075
> Key: [__WT, good, 0x10000]: Value: 243.74351518918098
> Ted,
> If you're interested, I can send over the whole seqdump file just to you,
> but I'm a little wary of posting it to the whole list at this point...
> Once I understand the problem more, I might realize that giving away the
> information won't hurt anything...
>
>
> Thoughts?
>
> Take care,
>   -stu
>
>
>
>
> ________________________________
>  From: Ted Dunning <[email protected]>
> To: [email protected]; Stuart Smith <[email protected]>
> Sent: Saturday, January 28, 2012 12:36 PM
> Subject: Re: Diagnosing naive bayes results
>
> It always tells you the most likely category, but you can redefine the
> output to only trigger if the most likely category really dominates the
> results.
>
> With two categories, this is reasonable.  For a dozen it is much more
> debatable.
>
> This works with the SGD classifiers as well and I have seen this used in a
> multi-level classifier.
>
> On Fri, Jan 27, 2012 at 8:06 PM, Stuart Smith <[email protected]> wrote:
>
> > Hello,
> >
> > Does naive bayes always classify a document into a category?
> > Or will it refuse to classify something it cannot?
> >

Reply via email to