Eduard, My guess is you will need to convert your CSV vectors to Mahout vector format and then run that through k-means. I believe the seqdirectory program just converts a collection of individual text files to sequence file format that can then be transformed to Mahout vectors via seq2sparse command. I have never used it but do see there is a CSVVectorIterator class that could be used (in your own custom program): https://cwiki.apache.org/MAHOUT/file-format-integrations.html This thread talks more about the topic: http://comments.gmane.org/gmane.comp.apache.mahout.user/11310 Dan
________________________________ From: Eduard Gamonal <[email protected]> To: [email protected] Sent: Friday, November 30, 2012 4:51 PM Subject: command line input dataset format for k-means and USCensus dataset Hi, I have a text file that contains a few thousands of lines. each line is a set of features, like this: 10000,5,0,1,0,0,5,3,2,2,1,0,1,0,4,3,0,2,0,0,1,0,0,0,0,10,0,1,0,1,0,1,4,2,2,3,0,2,0,2,1,4,3,0,0,0,3,1,0,3,22,0,3,0,1,0,1,0,0,0,5,0,2,1,1,0,11,1,0 source: http://archive.ics.uci.edu/ml/datasets/US+Census+Data+%281990%29 My goal is to cluster all this data with k-means using the command line interface. I read the Reuters-kmeans tutorial but I guess I can't apply the same procedure in a straight forward manner. The reuters example is for analyzing text. However, I want to analyze records. This is what I did $ mahout seqdirectory --input uscensus --output uscensus -seq $ mahout seq2sparse -i uscensus -seq -o uscensus -vec $ mahout kmeans -i uscensus-vec/tfidf-vectors -o uscensus-kmeans-clusters -c uscensus-kmeans-centroids -dm org.apache.mahout.common.distance.CosineDistanceMeasure -x 10 -ow -cl -k 25 I still haven't guessed a good starting k and x, though. I get an empty result: edu@hadoop:~/kmeans-mahout-uscensus$ cat cdump.txt CL-1{n=2 c=[] r=[]} Top Terms: Weight : [props - optional]: Point: 1.0: [] 1.0: [] CL-0{n=1 c=[] r=[]} Top Terms: edu@hadoop:~/kmeans-mahout-uscensus$ Questions: * do you think my vectors are created correctly? I guess they have to be like <1000, 5, 0, ... 1, 0> but because I'm following the reuters example I can't see why they could be correct. * why should I be using the TFIDF-vectors?
