Re: feature vector encoding in Mahout

Chris Schilling Wed, 15 Dec 2010 13:14:03 -0800

Ted,

Thanks for your answers.  I think I might be getting the hang of this thing :)



On Dec 14, 2010, at 11:25 PM, Ted Dunning wrote:

> On Tue, Dec 14, 2010 at 3:40 PM, Chris Schilling
> <[email protected]>wrote:
> 
>> 
>> After going through the newest chapters in MIA (very helpful btw), I have a
>> few questions that I think I know the answer to, but just wanted to get some
>> reinforcement.
>> 
> 
> We are revising them based on comments and would be happy to entertain
> suggestions, so fire away if you have any confusions.

Cool.  I have a few thoughts.  I will organize them and get back to you.  I 
also notice typos here and there.  The forum does not seem to be the place to 
mention these trivial things.  Is there an appropriate offline contact?
> 
> 
>> Let's say that I have a list of documents and my own pipeline for feature
>> extraction.  So, for each document I have a list of key words (and multi-key
>> word phrases) and corresponding weights.  So each document is now just a
>> list of keyword phrases and weights i.e.
>> 
>> doc1:
>> phrase1   wt1
>> phrase2   wt2
>> phrase3   wt3
>> ...
>> 
>> I would like to use Mahout to train document classifiers using the phrases
>> and weights in these files.
>> 
> 
> Cool.  You may eventually have phrases from different fields as well.  More
> about that in a sec.

Okay, a bit about the problem I am working on:  I have documents from different 
pre-labeled categories.  From these documents I run a feature extraction and 
basically calculate TFIDF weights for the keywords and multiple keyword phrases 
in each document across a fairly large corpus.  So, my final dataset looks 
something more like this:

label1, doc1
phrase1  wt11
phrase2  wt21
phrase3  wt31
...

label1, doc2
phrase1 wt12
phrase4 wt42
phraseX wtX2
...

label2, doc3
phrase2 wt23
phraseY wtY3
....

So basically for phrase i on document j, I calculate w ij = tfidf ij.  Document 
j can belong to anyone of nLabels < nDoc.  Probably ~10 labels (millions of 
docs).  The main difference between my extraction and say a 1-gram approach is 
that my extraction contains n-grams in general, so my features are a 
combination of mostly 1-grams, 2-grams, 3-grams, although I do not limit it to 
3.  I require minimum support and other cuts along the way so that the n-grams 
I extract are "important."

Anyway, the tf-idf's are calculated for each doc across the corpus (as opposed 
to being calculated for each label)...  If I understand the Naive Bayes 
implementation correctly, the tfidf is calculated across each label (as opposed 
to each training sample/doc).   So, based on what you state below, it would be 
difficult to implement this with the NB implementation.  It does not seem like 
I would gain much using my own feature vectors for Naive Bayes. In my 
preliminary tests, I am already at ~80 classification accuracy on a held-out 
test set.  

> 
> 
>> Looking at the TrainNewsGroups code in o.a.m.classifier.sgd, It looks like
>> I can just use the encoder class for these phrases and weights.
> 
> 
> Yes.  Absolutely.
> 
> 
>> Something like this:
>> 
>> RecordValueEncoder encoder =
>>       new StaticWordValueEncoder("variable-name");
>> for (DataRecord ex: trainingData) {
>>       Vector v = new RandomAccessSparseVector(10000);
>>       String word = ex.get("variable-name");
>>       encoder.addToVector(word, v);
>> }
>> 
>> Does this make sense?
>> 
> 
> Yes.  You can use the weight that you had in your original data as well.
> That happens with a line like this:
> 
>       double weight = ... mumble ...
>       encoder.addToVector(word, weight, v);
> 
> Of course, you will need to have comparable weights at classification time.
> Also, the SGD should over-ride your weight in the interest of accuracy.
> Using large weights is also not a great idea because it can cause unstable
> updates.  If you use the AdaptiveLogisticRegression, it should manage by
> adapting the learning rate down.  Combinations of very large and very small
> weights will cause the items with small weights to be essentially ignored.
> 
So, it sounds like LR and NB are more concerned in general with the existence 
of the keyword in the document rather than the weight of the keyword in the 
document.  The weights I calculate for each phrase lie from 0.5 to 1.0, and the 
variance small between documents.   Well, I can test the case of weights and no 
weights...
> 
>> I would like to compare the results of an SGD and Naive Bayes
>> classification using this data.  However, I am unclear of the vector
>> formation process in Naive Bayes.  I have prepared some input for the Bayes
>> classifier using prepare20newsgroups "macro" - I was able to get my data
>> into a similar format as the 20 news groups dataset.  I guess my main
>> question is can I use Naive Bayes if I already have the features (phrases
>> above)  and weights that I want to use for training?
>> 
> 
> Naive Bayes is very much more command line oriented.  The SGD logistic
> regression models are very much API oriented.  That means, as you suggest,
> that you have to format your data appropriately for Naive Bayes.  Moreover,
> NaiveBayes will simply ignore your weights.  SGD may optimize them away
> eventually, but it will pay attention to them in the short run.  NaiveBayes
> can only handle text-like input (at the moment) without any fields.
> 
> You can handle separately fielded data in SGD by using multiple encoders.

Most of this was just rambling.  I want to get deeper into the SGD APIs and get 
some performance/evalution studies running

Thanks again, Ted.

Re: feature vector encoding in Mahout

Reply via email to