It is blowing my own horn to some extent, but take a look at the Mahout in Action book.
http://www.manning.com/owen/ Also, there are several articles with examples for the Naive Bayesian classifiers. On Sun, May 22, 2011 at 2:08 PM, Keith Thompson <[email protected]>wrote: > Hi Ted, > > Thanks for your help. I have to learn Mahout on my own for a project I am > doing. I thought I would just "learn by doing" using readily available > data > sets to learn how the software works (even though the data set is small). > Unfortunately, there doesn't seem to be any documentation that says for > algorithm X, Mahout requires input in format Y. The API seems helpful only > if you already know that information. If you know of any resources that > document this type of thing, I would be grateful to know what they are. Of > course, maybe the fact that I am not a CS person doesn't help either :-) > > > On Sun, May 22, 2011 at 4:43 PM, Ted Dunning <[email protected]> > wrote: > > > First step is to decide what the data is. > > > > To me it looks like you have 33 columns with integer values in the range > > from 0 through 3. The 34th column has integers up to 75. The 35th > column > > has integers in the range from 1 to 6. > > > > These values are either numbers or category codes. > > > > If you want to use the Naive Bayes algorithm, then they need to be > category > > codes. To process these, you need to convert each value into a "word". > My > > tendency would be to prefix the value with X12- where the 12 is the > column > > number. This makes it so values in one column are not confused with > values > > in another. For column 34, I would pick some cut points and encode that > > way > > (deciles or quartiles might be good). Data can be in text form for the > > NaiveBayes categorizer. > > > > For the SGD categorizers, you need to code up a feature vector encoder. > > Look at the FeatureValueEncoder and sub-classes for hints about this. > You > > will need 35 encoders, one for each column. You can probably use a > pretty > > small feature vector. > > > > This problem is very small, with only 366 data points. As such, Mahout > is > > probably not a particularly good choice for solving your problem. Mahout > > is > > optimized for cases where the training data doesn't fit into memory and > > uses > > first order methods. WIth a small data-set like this, you can use all > > kinds > > of second-order methods to get potentially better results. > > > > > > > > On Sun, May 22, 2011 at 12:10 PM, Keith Thompson < > [email protected] > > >wrote: > > > > > If I have some numerical data (e.g., the data at > > > > > > > > > http://archive.ics.uci.edu/ml/machine-learning-databases/dermatology/dermatology.data > > > ) > > > and want to run a Mahout classification algorithm on that data, what > > steps > > > do I need to take in order to put the data into the correct input > format? > > > I > > > have read that most everything requires a sequence file but I'm not > sure > > > that I still understand what that is. Do I need to provide a key for > > each > > > row in this dataset (and the rest of the row sans the final column > would > > be > > > the value)? > > > > > >
