I'm trying to a kmeans clustering on only numeric data This is how my data looks header1, header2 header3, header4, header5 0,0,0,0,0 1,3,2,4,5 3,2,4,5,6 . . .
about 3000 rows As the cluster centroids I created another file (0,0,0,0,0) (1,2,3,4,5) My understanding is that we'd have to change these text files to sequence files and then generate sparse vectors from this sequence file for kmeans clustering I've used the seqdirectory followed by seq2sparse, and at the end I have two folders, one for input and one for centroids Input folder has dirs generated by seq2sparse on the input sequence file Similarly the centroids folder has dirs generated by seq2sparse on the centroids sequence file The command I use to run kmeans mahout kmeans --input input/tfidf-vectors --output output -c centroids/tfidf-vectors --maxIter 20 and I get this error No input clusters found in centroids/tfidf-vectors Check your -c argument. The sequence files have data but the files generated by seq2sparse do not have any contents. Can someone please help. BTW all this on hdfs and not local mode
