Ted, Thanks for your answers. I think I might be getting the hang of this thing :)
On Dec 14, 2010, at 11:25 PM, Ted Dunning wrote: > On Tue, Dec 14, 2010 at 3:40 PM, Chris Schilling > <[email protected]>wrote: > >> >> After going through the newest chapters in MIA (very helpful btw), I have a >> few questions that I think I know the answer to, but just wanted to get some >> reinforcement. >> > > We are revising them based on comments and would be happy to entertain > suggestions, so fire away if you have any confusions. Cool. I have a few thoughts. I will organize them and get back to you. I also notice typos here and there. The forum does not seem to be the place to mention these trivial things. Is there an appropriate offline contact? > > >> Let's say that I have a list of documents and my own pipeline for feature >> extraction. So, for each document I have a list of key words (and multi-key >> word phrases) and corresponding weights. So each document is now just a >> list of keyword phrases and weights i.e. >> >> doc1: >> phrase1 wt1 >> phrase2 wt2 >> phrase3 wt3 >> ... >> >> I would like to use Mahout to train document classifiers using the phrases >> and weights in these files. >> > > Cool. You may eventually have phrases from different fields as well. More > about that in a sec. Okay, a bit about the problem I am working on: I have documents from different pre-labeled categories. From these documents I run a feature extraction and basically calculate TFIDF weights for the keywords and multiple keyword phrases in each document across a fairly large corpus. So, my final dataset looks something more like this: label1, doc1 phrase1 wt11 phrase2 wt21 phrase3 wt31 ... label1, doc2 phrase1 wt12 phrase4 wt42 phraseX wtX2 ... label2, doc3 phrase2 wt23 phraseY wtY3 .... So basically for phrase i on document j, I calculate w ij = tfidf ij. Document j can belong to anyone of nLabels < nDoc. Probably ~10 labels (millions of docs). The main difference between my extraction and say a 1-gram approach is that my extraction contains n-grams in general, so my features are a combination of mostly 1-grams, 2-grams, 3-grams, although I do not limit it to 3. I require minimum support and other cuts along the way so that the n-grams I extract are "important." Anyway, the tf-idf's are calculated for each doc across the corpus (as opposed to being calculated for each label)... If I understand the Naive Bayes implementation correctly, the tfidf is calculated across each label (as opposed to each training sample/doc). So, based on what you state below, it would be difficult to implement this with the NB implementation. It does not seem like I would gain much using my own feature vectors for Naive Bayes. In my preliminary tests, I am already at ~80 classification accuracy on a held-out test set. > > >> Looking at the TrainNewsGroups code in o.a.m.classifier.sgd, It looks like >> I can just use the encoder class for these phrases and weights. > > > Yes. Absolutely. > > >> Something like this: >> >> RecordValueEncoder encoder = >> new StaticWordValueEncoder("variable-name"); >> for (DataRecord ex: trainingData) { >> Vector v = new RandomAccessSparseVector(10000); >> String word = ex.get("variable-name"); >> encoder.addToVector(word, v); >> } >> >> Does this make sense? >> > > Yes. You can use the weight that you had in your original data as well. > That happens with a line like this: > > double weight = ... mumble ... > encoder.addToVector(word, weight, v); > > Of course, you will need to have comparable weights at classification time. > Also, the SGD should over-ride your weight in the interest of accuracy. > Using large weights is also not a great idea because it can cause unstable > updates. If you use the AdaptiveLogisticRegression, it should manage by > adapting the learning rate down. Combinations of very large and very small > weights will cause the items with small weights to be essentially ignored. > So, it sounds like LR and NB are more concerned in general with the existence of the keyword in the document rather than the weight of the keyword in the document. The weights I calculate for each phrase lie from 0.5 to 1.0, and the variance small between documents. Well, I can test the case of weights and no weights... > >> I would like to compare the results of an SGD and Naive Bayes >> classification using this data. However, I am unclear of the vector >> formation process in Naive Bayes. I have prepared some input for the Bayes >> classifier using prepare20newsgroups "macro" - I was able to get my data >> into a similar format as the 20 news groups dataset. I guess my main >> question is can I use Naive Bayes if I already have the features (phrases >> above) and weights that I want to use for training? >> > > Naive Bayes is very much more command line oriented. The SGD logistic > regression models are very much API oriented. That means, as you suggest, > that you have to format your data appropriately for Naive Bayes. Moreover, > NaiveBayes will simply ignore your weights. SGD may optimize them away > eventually, but it will pay attention to them in the short run. NaiveBayes > can only handle text-like input (at the moment) without any fields. > > You can handle separately fielded data in SGD by using multiple encoders. Most of this was just rambling. I want to get deeper into the SGD APIs and get some performance/evalution studies running Thanks again, Ted.
