First step is to decide what the data is.

To me it looks like you have 33 columns with integer values in the range
from 0 through 3.  The 34th column has integers up to 75.  The 35th column
has integers in the range from 1 to 6.

These values are either numbers or category codes.

If you want to use the Naive Bayes algorithm, then they need to be category
codes.  To process these, you need to convert each value into a "word".  My
tendency would be to prefix the value with X12- where the 12 is the column
number.  This makes it so values in one column are not confused with values
in another.  For column 34, I would pick some cut points and encode that way
(deciles or quartiles might be good).  Data can be in text form for the
NaiveBayes categorizer.

For the SGD categorizers, you need to code up a feature vector encoder.
 Look at the FeatureValueEncoder and sub-classes for hints about this.  You
will need 35 encoders, one for each column.  You can probably use a pretty
small feature vector.

This problem is very small, with only 366 data points.  As such, Mahout is
probably not a particularly good choice for solving your problem.  Mahout is
optimized for cases where the training data doesn't fit into memory and uses
first order methods.  WIth a small data-set like this, you can use all kinds
of second-order methods to get potentially better results.



On Sun, May 22, 2011 at 12:10 PM, Keith Thompson <[email protected]>wrote:

> If I have some numerical data (e.g., the data at
>
> http://archive.ics.uci.edu/ml/machine-learning-databases/dermatology/dermatology.data
> )
> and want to run a Mahout classification algorithm on that data, what steps
> do I need to take in order to put the data into the correct input format?
>  I
> have read that most everything requires a sequence file but I'm not sure
> that I still understand what that is.  Do I need to provide a key for each
> row in this dataset (and the rest of the row sans the final column would be
> the value)?
>

Reply via email to