You need to create a Mahout distributed row matrix, which is one or more SequenceFiles of: <IntWritable>: <VectorWritable>
The vector will have all your values, the first IntWritable has the Mahout ID/key for the vector. It is a positive ordinal. Usually this corresponds to some ID you have for the vector so you create a Mahout Int for each new vector, and put it in a dictionary that maps your id to/from the Mahout id. Then after clustering you map the mahout ID back to yours. The VectorWritable is created with a Vector. As you have stated things you would use a DenseVector implementation. If you have a lot of 0s you may want to give your columns Mahout IDs too and use sparse vectors to create a sparse matrix. All missing values are assumed to have a 0 value. This may improve the performance. It will also allow you to use an implementation of Vector called NamedVector, which allows you to put your ID in the Vector as a string to follow the vector through the calculations. On May 24, 2014, at 11:35 AM, Adri Gómez <[email protected]> wrote: Hello. First, sorry for my English. I'm a noob in Mahout and Hadoop. I want to run kmeans clustering on a Hadoop pseudo-distributed mode. I have 5 million of vectors in a .mat file, with 38 numeric features for each vector, like this: 0 0 1 0 0 0 0 0 0 0 0 0 ... I've run the examples that I've found, like Reuters ( https://mahout.apache.org/users/clustering/k-means-clustering.html) or synthetic data. I know i have to convert this vectors to SequenceFile, but I don't know if I have to do something more before. I'm using Mahout 0.7 and Hadoop 1.2.1. Thanks. -- *Gómez Muñoz, Adrián.*
