On Fri, Sep 9, 2011 at 8:41 AM, Loic Descotte <[email protected]>wrote:
> ... My goal is to make prediction on thousands of text entries, but with > smaller as possible learning datas (categories may often change so I will > not always have hundreds of entries for training on each category). > This is very small with respect to Mahout algorithms. There may be better options. The standard choice for small text datasets like this is linear SVM, but SGD should work reasonably well. Naive Bayes may not work as well with such a small amount of training data. I would avoid the adaptive SGD and tune the training parameters by hand. Another question, in all exemples I've found, Naive Bayes is used to analyze > sets containing a lot keywords, and to classify them in the right category > (e.g wikipedia examples : https://www.ibm.com/** > developerworks/java/library/j-**mahout/#N10412<https://www.ibm.com/developerworks/java/library/j-mahout/#N10412>). > > SGD example are a little different, instead of working on word sequences, > they use many predictors values and each predictor has only one value for > each entry. > That is true in Chapter 13 where SGD is introduced. Later chapters illustrate the use on the 20 newsgroups data. > Is it possible to use the SGD algorythm (maybe better for me because I have > small datasets) with only text (like blog posts) entries ? > Yes. This should work fine. I would consider also the Luduan algorithm which is not currently part of Mahout, although all the pieces are there. The basic idea is that for each binary decision you have three kinds of documents. These are unjudged documents, judged relevant documents and judged non-relevant. Luduan uses log-likelihood ratio test to compare the judged relevant and judged non-relevant sets. This comparison gives a set of search terms that are used with standard retrieval weighting such as tf-idf or BM-25. Term weights are determined by corpus frequencies without any explicit reference to the frequencies in the judged relevant or non-relevant documents. For some classification tasks with modest sized training data, this method out-performs most others. I can send a PDF with a more detailed description. > Thanks a lot for your time, tell me if I'm not clear enough in my > explainations :) > Please tell me the same.
