Hello,
After going through the newest chapters in MIA (very helpful btw), I have a few
questions that I think I know the answer to, but just wanted to get some
reinforcement.
Let's say that I have a list of documents and my own pipeline for feature
extraction. So, for each document I have a list of key words (and multi-key
word phrases) and corresponding weights. So each document is now just a list
of keyword phrases and weights i.e.
doc1:
phrase1 wt1
phrase2 wt2
phrase3 wt3
...
I would like to use Mahout to train document classifiers using the phrases and
weights in these files.
Looking at the TrainNewsGroups code in o.a.m.classifier.sgd, It looks like I
can just use the encoder class for these phrases and weights. Something like
this:
RecordValueEncoder encoder =
new StaticWordValueEncoder("variable-name");
for (DataRecord ex: trainingData) {
Vector v = new RandomAccessSparseVector(10000);
String word = ex.get("variable-name");
encoder.addToVector(word, v);
}
Does this make sense?
I would like to compare the results of an SGD and Naive Bayes classification
using this data. However, I am unclear of the vector formation process in
Naive Bayes. I have prepared some input for the Bayes classifier using
prepare20newsgroups "macro" - I was able to get my data into a similar format
as the 20 news groups dataset. I guess my main question is can I use Naive
Bayes if I already have the features (phrases above) and weights that I want
to use for training?