Re: file input formats

Keith Thompson Sun, 22 May 2011 19:01:11 -0700

Thanks. I appreciate it.  I think I need to buy a copy of that!

On Sun, May 22, 2011 at 8:21 PM, Ted Dunning <[email protected]> wrote:


> It is blowing my own horn to some extent, but take a look at the Mahout in
> Action book.
>
> http://www.manning.com/owen/
>
> Also, there are several articles with examples for the Naive Bayesian
> classifiers.
>
> On Sun, May 22, 2011 at 2:08 PM, Keith Thompson <[email protected]
> >wrote:
>
> > Hi Ted,
> >
> > Thanks for your help.  I have to learn Mahout on my own for a project I
> am
> > doing.  I thought I would just "learn by doing" using readily available
> > data
> > sets to learn how the software works (even though the data set is small).
> > Unfortunately, there doesn't seem to be any documentation that says for
> > algorithm X, Mahout requires input in format Y.  The API seems helpful
> only
> > if you already know that information.  If you know of any resources that
> > document this type of thing, I would be grateful to know what they are.
>  Of
> > course, maybe the fact that I am not a CS person doesn't help either :-)
> >
> >
> > On Sun, May 22, 2011 at 4:43 PM, Ted Dunning <[email protected]>
> > wrote:
> >
> > > First step is to decide what the data is.
> > >
> > > To me it looks like you have 33 columns with integer values in the
> range
> > > from 0 through 3.  The 34th column has integers up to 75.  The 35th
> > column
> > > has integers in the range from 1 to 6.
> > >
> > > These values are either numbers or category codes.
> > >
> > > If you want to use the Naive Bayes algorithm, then they need to be
> > category
> > > codes.  To process these, you need to convert each value into a "word".
> >  My
> > > tendency would be to prefix the value with X12- where the 12 is the
> > column
> > > number.  This makes it so values in one column are not confused with
> > values
> > > in another.  For column 34, I would pick some cut points and encode
> that
> > > way
> > > (deciles or quartiles might be good).  Data can be in text form for the
> > > NaiveBayes categorizer.
> > >
> > > For the SGD categorizers, you need to code up a feature vector encoder.
> > >  Look at the FeatureValueEncoder and sub-classes for hints about this.
> >  You
> > > will need 35 encoders, one for each column.  You can probably use a
> > pretty
> > > small feature vector.
> > >
> > > This problem is very small, with only 366 data points.  As such, Mahout
> > is
> > > probably not a particularly good choice for solving your problem.
>  Mahout
> > > is
> > > optimized for cases where the training data doesn't fit into memory and
> > > uses
> > > first order methods.  WIth a small data-set like this, you can use all
> > > kinds
> > > of second-order methods to get potentially better results.
> > >
> > >
> > >
> > > On Sun, May 22, 2011 at 12:10 PM, Keith Thompson <
> > [email protected]
> > > >wrote:
> > >
> > > > If I have some numerical data (e.g., the data at
> > > >
> > > >
> > >
> >
> http://archive.ics.uci.edu/ml/machine-learning-databases/dermatology/dermatology.data
> > > > )
> > > > and want to run a Mahout classification algorithm on that data, what
> > > steps
> > > > do I need to take in order to put the data into the correct input
> > format?
> > > >  I
> > > > have read that most everything requires a sequence file but I'm not
> > sure
> > > > that I still understand what that is.  Do I need to provide a key for
> > > each
> > > > row in this dataset (and the rest of the row sans the final column
> > would
> > > be
> > > > the value)?
> > > >
> > >
> >
>

Re: file input formats

Reply via email to