feature vector encoding in Mahout

Chris Schilling Tue, 14 Dec 2010 15:38:07 -0800

Hello,

After going through the newest chapters in MIA (very helpful btw), I have a few 
questions that I think I know the answer to, but just wanted to get some 
reinforcement.


Let's say that I have a list of documents and my own pipeline for feature 
extraction.  So, for each document I have a list of key words (and multi-key 
word phrases) and corresponding weights.  So each document is now just a list 
of keyword phrases and weights i.e.

doc1:
phrase1   wt1
phrase2   wt2
phrase3   wt3
...

I would like to use Mahout to train document classifiers using the phrases and 
weights in these files.

Looking at the TrainNewsGroups code in o.a.m.classifier.sgd, It looks like I 
can just use the encoder class for these phrases and weights.  Something like 
this:

RecordValueEncoder encoder = 
        new StaticWordValueEncoder("variable-name");
for (DataRecord ex: trainingData) {
        Vector v = new RandomAccessSparseVector(10000);
        String word = ex.get("variable-name");
        encoder.addToVector(word, v); 
}

Does this make sense?

I would like to compare the results of an SGD and Naive Bayes classification 
using this data.  However, I am unclear of the vector formation process in 
Naive Bayes.  I have prepared some input for the Bayes classifier using 
prepare20newsgroups "macro" - I was able to get my data into a similar format 
as the 20 news groups dataset.  I guess my main question is can I use Naive 
Bayes if I already have the features (phrases above)  and weights that I want 
to use for training?

feature vector encoding in Mahout

Reply via email to