Re: file input formats

Ted Dunning Sun, 22 May 2011 17:22:42 -0700

It is blowing my own horn to some extent, but take a look at the Mahout in
Action book.


http://www.manning.com/owen/

Also, there are several articles with examples for the Naive Bayesian
classifiers.

On Sun, May 22, 2011 at 2:08 PM, Keith Thompson <[email protected]>wrote:

> Hi Ted,
>
> Thanks for your help.  I have to learn Mahout on my own for a project I am
> doing.  I thought I would just "learn by doing" using readily available
> data
> sets to learn how the software works (even though the data set is small).
> Unfortunately, there doesn't seem to be any documentation that says for
> algorithm X, Mahout requires input in format Y.  The API seems helpful only
> if you already know that information.  If you know of any resources that
> document this type of thing, I would be grateful to know what they are.  Of
> course, maybe the fact that I am not a CS person doesn't help either :-)
>
>
> On Sun, May 22, 2011 at 4:43 PM, Ted Dunning <[email protected]>
> wrote:
>
> > First step is to decide what the data is.
> >
> > To me it looks like you have 33 columns with integer values in the range
> > from 0 through 3.  The 34th column has integers up to 75.  The 35th
> column
> > has integers in the range from 1 to 6.
> >
> > These values are either numbers or category codes.
> >
> > If you want to use the Naive Bayes algorithm, then they need to be
> category
> > codes.  To process these, you need to convert each value into a "word".
>  My
> > tendency would be to prefix the value with X12- where the 12 is the
> column
> > number.  This makes it so values in one column are not confused with
> values
> > in another.  For column 34, I would pick some cut points and encode that
> > way
> > (deciles or quartiles might be good).  Data can be in text form for the
> > NaiveBayes categorizer.
> >
> > For the SGD categorizers, you need to code up a feature vector encoder.
> >  Look at the FeatureValueEncoder and sub-classes for hints about this.
>  You
> > will need 35 encoders, one for each column.  You can probably use a
> pretty
> > small feature vector.
> >
> > This problem is very small, with only 366 data points.  As such, Mahout
> is
> > probably not a particularly good choice for solving your problem.  Mahout
> > is
> > optimized for cases where the training data doesn't fit into memory and
> > uses
> > first order methods.  WIth a small data-set like this, you can use all
> > kinds
> > of second-order methods to get potentially better results.
> >
> >
> >
> > On Sun, May 22, 2011 at 12:10 PM, Keith Thompson <
> [email protected]
> > >wrote:
> >
> > > If I have some numerical data (e.g., the data at
> > >
> > >
> >
> http://archive.ics.uci.edu/ml/machine-learning-databases/dermatology/dermatology.data
> > > )
> > > and want to run a Mahout classification algorithm on that data, what
> > steps
> > > do I need to take in order to put the data into the correct input
> format?
> > >  I
> > > have read that most everything requires a sequence file but I'm not
> sure
> > > that I still understand what that is.  Do I need to provide a key for
> > each
> > > row in this dataset (and the rest of the row sans the final column
> would
> > be
> > > the value)?
> > >
> >
>

Re: file input formats

Reply via email to