Thanks. I appreciate it. I think I need to buy a copy of that! On Sun, May 22, 2011 at 8:21 PM, Ted Dunning <[email protected]> wrote:
> It is blowing my own horn to some extent, but take a look at the Mahout in > Action book. > > http://www.manning.com/owen/ > > Also, there are several articles with examples for the Naive Bayesian > classifiers. > > On Sun, May 22, 2011 at 2:08 PM, Keith Thompson <[email protected] > >wrote: > > > Hi Ted, > > > > Thanks for your help. I have to learn Mahout on my own for a project I > am > > doing. I thought I would just "learn by doing" using readily available > > data > > sets to learn how the software works (even though the data set is small). > > Unfortunately, there doesn't seem to be any documentation that says for > > algorithm X, Mahout requires input in format Y. The API seems helpful > only > > if you already know that information. If you know of any resources that > > document this type of thing, I would be grateful to know what they are. > Of > > course, maybe the fact that I am not a CS person doesn't help either :-) > > > > > > On Sun, May 22, 2011 at 4:43 PM, Ted Dunning <[email protected]> > > wrote: > > > > > First step is to decide what the data is. > > > > > > To me it looks like you have 33 columns with integer values in the > range > > > from 0 through 3. The 34th column has integers up to 75. The 35th > > column > > > has integers in the range from 1 to 6. > > > > > > These values are either numbers or category codes. > > > > > > If you want to use the Naive Bayes algorithm, then they need to be > > category > > > codes. To process these, you need to convert each value into a "word". > > My > > > tendency would be to prefix the value with X12- where the 12 is the > > column > > > number. This makes it so values in one column are not confused with > > values > > > in another. For column 34, I would pick some cut points and encode > that > > > way > > > (deciles or quartiles might be good). Data can be in text form for the > > > NaiveBayes categorizer. > > > > > > For the SGD categorizers, you need to code up a feature vector encoder. > > > Look at the FeatureValueEncoder and sub-classes for hints about this. > > You > > > will need 35 encoders, one for each column. You can probably use a > > pretty > > > small feature vector. > > > > > > This problem is very small, with only 366 data points. As such, Mahout > > is > > > probably not a particularly good choice for solving your problem. > Mahout > > > is > > > optimized for cases where the training data doesn't fit into memory and > > > uses > > > first order methods. WIth a small data-set like this, you can use all > > > kinds > > > of second-order methods to get potentially better results. > > > > > > > > > > > > On Sun, May 22, 2011 at 12:10 PM, Keith Thompson < > > [email protected] > > > >wrote: > > > > > > > If I have some numerical data (e.g., the data at > > > > > > > > > > > > > > http://archive.ics.uci.edu/ml/machine-learning-databases/dermatology/dermatology.data > > > > ) > > > > and want to run a Mahout classification algorithm on that data, what > > > steps > > > > do I need to take in order to put the data into the correct input > > format? > > > > I > > > > have read that most everything requires a sequence file but I'm not > > sure > > > > that I still understand what that is. Do I need to provide a key for > > > each > > > > row in this dataset (and the rest of the row sans the final column > > would > > > be > > > > the value)? > > > > > > > > > >
