You need to create a Mahout distributed row matrix, which is one or more 
SequenceFiles of:
<IntWritable>: <VectorWritable>

The vector will have all your values, the first IntWritable has the Mahout 
ID/key for the vector. It is a positive ordinal. Usually this corresponds to 
some ID you have for the vector so you create a Mahout Int for each new vector, 
and put it in a dictionary that maps your id to/from the Mahout id. Then after 
clustering you map the mahout ID back to yours.

The VectorWritable is created with a Vector. As you have stated things you 
would use a DenseVector implementation. If you have a lot of 0s you may want to 
give your columns Mahout IDs too and use sparse vectors to create a sparse 
matrix. All missing values are assumed to have a 0 value. This may improve the 
performance. It will also allow you to use an implementation of Vector called 
NamedVector, which allows you to put your ID in the Vector as a string to 
follow the vector through the calculations.


On May 24, 2014, at 11:35 AM, Adri Gómez <[email protected]> wrote:

Hello.

First, sorry for my English.

I'm a noob in Mahout and Hadoop. I want to run kmeans clustering on a
Hadoop pseudo-distributed mode. I have 5 million of vectors in a .mat file,
with 38 numeric features for each vector, like this: 0 0 1 0 0 0 0 0 0 0 0
0 ...

I've run the examples that I've found, like Reuters (
https://mahout.apache.org/users/clustering/k-means-clustering.html) or
synthetic data. I know i have to convert this vectors to SequenceFile, but
I don't know if I have to do something more before.

I'm using Mahout 0.7 and Hadoop 1.2.1.

Thanks.

-- 
*Gómez Muñoz, Adrián.*

Reply via email to