The thing to look at is the encoder framework in org.apache.mahout.vectorizer.encoders
See for instance https://github.com/apache/mahout/blob/trunk/core/src/main/java/org/apache/mahout/vectorizer/encoders/StaticWordValueEncoder.java Chapter 14 of Mahout in Action describes the process in more detail. There are examples in the Mahout distribution as well. On Thu, Aug 16, 2012 at 5:41 PM, Chandra Mohan, Ananda Vel Murugan < [email protected]> wrote: > Hi, > > Almost all my data in CSV file is categorical data. Can you elaborate what > you mean by fancier footwork? Should I convert categories into some numbers > and store in vector? Thanks!! > > -----Original Message----- > From: Ted Dunning [mailto:[email protected]] > Sent: Thursday, August 16, 2012 8:08 PM > To: [email protected] > Cc: [email protected] > Subject: Re: Encoding and vectorizing > > If your data is dense and numerical, then you don't need anything but > trivial encoding. Just copy the values from your CSV file into the vector, > converting to numbers as you go. If some of your data are categorical or > textual, you will need fancier footwork. > > On Thu, Aug 16, 2012 at 3:28 AM, Chandra Mohan, Ananda Vel Murugan < > [email protected]> wrote: > > > I am a beginner in mahout with not much background in math. I want to > know > > what is encoder and vectorizer in mahout. > > > > As far I know vector can be thought of as an array or tuple containing > > values for a specific attribute of the object which vector represents. > > > > I have testcell data for mechanical component testing. I create a CSV > file > > with various details gathered from test cell database. I want to run > > logistic regression on this data and predict the components life based on > > test cell data. I want to understand what is vectorization and encoding > in > > this context. > > > > Any help would be greatly appreciated. > > > > Regards, > > Anand.C > > >
