We previously did some k-means clustering runs on different sized collections and noticed how that a large cluster was often created along with some smaller others. In digging deeper it turned out a lot of the document vectors (produced via the seq2sparse command) were null (empty). k-means apparently put these together in one large cluster. I also saw NaN for computed distances for these vectors. And in the “clusteredPoints” file, it was clear many vectors were empty. It appeared for our data after our custom lucene analyzer and the tfidf filtering was applied (in seq2sparse command) all terms for many of our documents were removed. These were documents that had minimal (and/or garbage) text to begin with. So, maybe first verify if you are getting proper vectors for the input to k-means. We ended up cleaning up the vectors before clustering them (tossing out the null ones). You can also experiment with different similarity measures in k-means too (e.g., tanimoto). Dan
________________________________ From: syed kather <[email protected]> To: [email protected] Cc: Raja Ramesh <[email protected]> Sent: Thursday, October 18, 2012 11:03 PM Subject: K-Means generates only one cluster Team Version Used : Mahout 0.6 Hadoop : 5 Nodes(1 Master + 4 Slaves) Once we had generated kmean clusters for 600000 documents.I had run the clusterdump, which will extract the top terms from the cluster, There i had noticed only one clusters is made even though we had specified the number of cluster to 10. I had cross check the commands with some 1000 documents and applied clustering. As i had notice that out of the 1000 documents,mahout can able to generated 10 cluster. Some Observation which i had made on 600000 Data:- In clusterdump I had added "--pointDir <path>". Because this command will extactly tell us .what are top terms for each documents vise. In this i had noticed that some of the documents which doesnt have a distance. 1.0 : [distance=NaN]: /0_6_1343_504071_6198107.txt =] 0_6_1343_504071_6198107.txt ==> File Name 1.0 : [distance=NaN]: /0_6_1343_504071_6198108.txt =] 1.0 : [distance=NaN]: /0_6_1343_504071_6198109.txt =] 1.0 : [distance=NaN]: /0_6_1343_504071_6198110.txt =] 1.0 : [distance=NaN]: /0_6_1343_504071_6198111.txt =] 1.0 : [distance=NaN]: /0_6_1343_504071_6198112.txt =] 1.0 : [distance=NaN]: /0_6_1343_504071_6198113.txt =] 1.0 : [distance=NaN]: /0_6_1343_504071_6198114.txt =] 1.0 : [distance=NaN]: /0_6_1343_504071_6198115.txt =] Have a look command which i had executed one is for huge data(600000) and one is for small data (1000 documents) #sequencial File generation bin/mahout seqdirectory -i /hugeData/hugeData/ -o /hugeData/SequenceFiles/ -c UTF-8 -chunk 64 (600000 documents) bin/mahout seqdirectory -i /blrdata/blrdata/ -o /blrdata/SequenceFiles/ -c UTF-8 -chunk 64 (1000 documents) #Term Vector Creation. bin/mahout seq2sparse -i /hugeData/SequenceFiles/ -o /hugeData/SequenceFiles-sparse --maxDFPercent 85 --namedVector --minDF 15 (600000 doc) bin/mahout seq2sparse -i /blrdata/SequenceFiles/ -o /blrdata/SequenceFiles-sparse --maxDFPercent 85 --namedVector --minDF 15 (1000 documents) #Clustering bin/mahout kmeans -i /hugeData/SequenceFiles-sparse/tfidf-vectors/ -c /hugeData/kmeans-clusters -o /hugeData/kmeans -dm org.apache.mahout.common.distance.CosineDistanceMea0sure -x 10 -k 10 -ow --clustering (600000 documents) bin/mahout kmeans -i /blrdata/SequenceFiles-sparse/tfidf-vectors/ -c /blrdata/kmeans-clusters -o /blrdata/kmeans -dm org.apache.mahout.common.distance.CosineDistanceMeasure -x 10 -k 10 -ow --clustering (1000 documents) #Cluster Dump bin/mahout clusterdump -s hdfs://localhost:9000/hugeData/kmeans/clusters-2-final/ -d hdfs://localhost:9000/hugeData/SequenceFiles-sparse/dictionary.file-0 -dt sequencefile -b 100 -n 100 (600000 documents) bin/mahout clusterdump -s hdfs://localhost:9000/blrdata/kmeans/clusters-2-final/ -d hdfs://localhost:9000/blrdata/SequenceFiles-sparse/dictionary.file-0 -dt sequencefile -b 100 -n 10 (1000 documents) I am using Map Reduced Method. For calculating K-Means. I had no clue what is going wrong. So please help me what i had missed in this. please give me some suggestion how to check what goes wrong. Let me know if there is any further information is required Thanks in advance S SYED ABDUL KATHER
