Hello,
I have been trying to use the mahout k-means to cluster some synthetic
data (A3-set <https://cs.joensuu.fi/sipu/datasets/>) as testing
reference but seems I'm missing something here. I'm quite the beginner
so bear with me, please. Before all the explanations, what I want to
know is:
A) Is there a better way to cluster a series of numeric points and
maintain their id value?
B) What I'm doing wrong with this approach?
-Thanks for your patience
This is a sample of the data:
point1,53920,42968
point2,52019,42206
point3,52570,42476
point4,54220,42081
point5,54268,43420
point6,52288,42408
point7,54436,39727
point8,52391,44323
point9,54995,43655
point10,53761,43403
As you can see, is in the format: id_point, value1, value2 (the comma
doesn't matter, I could use tab separators)
All testing data that I have seen that uses numeric data (not the
reuters example) doesn't have a id_column of sorts to identify the point
after the clusterdump, just the values. Is there a way to map the points
to a ID using the command line?
In stackoverflow
<https://stackoverflow.com/questions/8785392/how-to-perform-k-means-clustering-in-mahout-with-vector-data-stored-as-csv>
I saw a code sample to create my own fileSequencer using namedVectors to
capture the ID, I created a similar version (code at the end), and
creates in the HDFS a sequenceFile
sample:
hadoop fs -text /tmp/mahout-another/input/testSequence/ | head
point1 point1:{0:53920.0,1:42968.0}
point2 point2:{0:52019.0,1:42206.0}
point3 point3:{0:52570.0,1:42476.0}
point4 point4:{0:54220.0,1:42081.0}
point5 point5:{0:54268.0,1:43420.0}
point6 point6:{0:52288.0,1:42408.0}
point7 point7:{0:54436.0,1:39727.0}
point8 point8:{0:52391.0,1:44323.0}
point9 point9:{0:54995.0,1:43655.0}
point10 point10:{0:53761.0,1:43403.0}
Then I cluster with:
mahout kmeans -k 50 -i /tmp/mahout-another/input/testSequence/ -o
/tmp/mahout-another/output -c /tmp/mahout-initial-clusters --maxIter 10
After completion, the clusterdump with:
mahout clusterdump -i /tmp/mahout-another/output/clusters-10-final -o
output.txt -p /tmp/mahout-another/input/testSequence
throws me:
Exception in thread "main" java.lang.ClassCastException:
org.apache.hadoop.io.Text cannot be cast to org.apache.hadoop.io.IntWritable
I'm guessing that the problem lies in the SequenceFile writer (code
below) since with requires two class parameters, and following the
stackoverflow example I write the name of the vector as a Text.class and
the vector itself after it. The clusterdump then tries to read the
points, finds text and an exception is thrown, but if that's the case,
how can I do it?
Code for the sequence writer with namedVectors.:
main:
BufferedReader br = new BufferedReader(new FileReader(args[0]));
String line;
List<NamedVector> vector = new ArrayList<NamedVector>();
while ((line = br.readLine()) != null) {
String[] tokenized = line.split(args[2]);
String id = tokenized[INDEX_ID];
double[] coordinates = new double[tokenized.length-1];
for (int i=1;i<tokenized.length;i++){
coordinates[i-1]=Double.valueOf(tokenized[i]);
}
vector.add(createNamedVector(coordinates,id));
}
writeSequenceToPath(args[1], vector);
br.close();
public static NamedVector createNamedVector(double[] points, String id){
return new NamedVector(new DenseVector(points),id);
}
public static void writeSequenceToPath(String directory,
List<NamedVector> listOfVectors) throws IOException{
Configuration config = new Configuration();
FileSystem fs = FileSystem.get(config);
Path path = new Path(directory);
//write a SequenceFile form a Vector
* SequenceFile.Writer writer = new SequenceFile.Writer(fs,
config, path, Text.class, VectorWritable.class);*
VectorWritable vec = new VectorWritable();
for(NamedVector v:listOfVectors){
vec.set(v);
writer.append(new Text(v.getName()), vec);
}
writer.close();
}
--
*Ramiro Manso*
/Data Analyst & BI consultant/
Telf.: +34 917 680 490
Fax: +34 913 833 301
C/ Manuel Tovar, 49-53 - 28034 Madrid - Spain
_http://www.bidoop.es_