We previously did some k-means clustering runs on
different sized collections and noticed how that a large cluster was often 
created
along with some smaller others. In digging deeper it turned out a lot of the
document vectors (produced via the seq2sparse command) were null (empty).  
k-means apparently put these together in one large
cluster.  I also saw NaN for computed distances
for these vectors.  And in the “clusteredPoints”
file, it was clear many vectors were empty.  It appeared for our data after our 
custom
lucene analyzer and the tfidf filtering was applied (in seq2sparse command) all
terms for many of our documents were removed.  These were documents that had 
minimal (and/or garbage) text to begin
with.
So, maybe first verify if you are getting
proper vectors for the input to k-means. We ended up cleaning up the vectors
before clustering them (tossing out the null ones). You can also experiment
with different similarity measures in k-means too (e.g., tanimoto).
 Dan  

________________________________
 From: syed kather <[email protected]>
To: [email protected] 
Cc: Raja Ramesh <[email protected]> 
Sent: Thursday, October 18, 2012 11:03 PM
Subject: K-Means generates only one cluster
  
Team

    Version Used : Mahout 0.6
    Hadoop : 5 Nodes(1 Master + 4 Slaves)

    Once we had generated kmean clusters for 600000 documents.I had run the
clusterdump, which will extract the top terms from the cluster, There i had
noticed only one clusters is made even though we had specified the number
of cluster to 10. I had cross check the commands with some 1000 documents
and applied clustering. As i had notice that out of the 1000
documents,mahout can able to generated 10 cluster.

Some Observation which i had made on 600000 Data:-
    In clusterdump I had added  "--pointDir <path>". Because this command
will extactly tell us .what are top terms for each documents vise. In this
i had noticed that some of the documents which doesnt have a distance.
1.0 : [distance=NaN]: /0_6_1343_504071_6198107.txt =]
  0_6_1343_504071_6198107.txt ==> File Name
1.0 : [distance=NaN]: /0_6_1343_504071_6198108.txt =]
1.0 : [distance=NaN]: /0_6_1343_504071_6198109.txt =]
1.0 : [distance=NaN]: /0_6_1343_504071_6198110.txt =]
1.0 : [distance=NaN]: /0_6_1343_504071_6198111.txt =]
1.0 : [distance=NaN]: /0_6_1343_504071_6198112.txt =]
1.0 : [distance=NaN]: /0_6_1343_504071_6198113.txt =]
1.0 : [distance=NaN]: /0_6_1343_504071_6198114.txt =]
1.0 : [distance=NaN]: /0_6_1343_504071_6198115.txt =]

Have a look command which i had executed one is for huge data(600000) and
one is for small data (1000 documents)

#sequencial File generation
bin/mahout seqdirectory -i /hugeData/hugeData/ -o /hugeData/SequenceFiles/
-c UTF-8 -chunk 64   (600000 documents)
bin/mahout seqdirectory -i /blrdata/blrdata/ -o /blrdata/SequenceFiles/ -c
UTF-8 -chunk 64               (1000 documents)

#Term Vector Creation.
bin/mahout seq2sparse -i /hugeData/SequenceFiles/ -o
/hugeData/SequenceFiles-sparse --maxDFPercent 85 --namedVector --minDF 15
   (600000 doc)
bin/mahout seq2sparse -i /blrdata/SequenceFiles/ -o
/blrdata/SequenceFiles-sparse --maxDFPercent 85 --namedVector --minDF 15
         (1000 documents)

#Clustering
bin/mahout kmeans -i /hugeData/SequenceFiles-sparse/tfidf-vectors/ -c
/hugeData/kmeans-clusters -o /hugeData/kmeans -dm
org.apache.mahout.common.distance.CosineDistanceMea0sure -x 10 -k 10 -ow
--clustering                       (600000 documents)
bin/mahout kmeans -i /blrdata/SequenceFiles-sparse/tfidf-vectors/ -c
/blrdata/kmeans-clusters -o /blrdata/kmeans -dm
org.apache.mahout.common.distance.CosineDistanceMeasure -x 10 -k 10 -ow
--clustering                        (1000 documents)

#Cluster Dump
bin/mahout clusterdump -s
hdfs://localhost:9000/hugeData/kmeans/clusters-2-final/ -d
hdfs://localhost:9000/hugeData/SequenceFiles-sparse/dictionary.file-0 -dt
sequencefile -b 100 -n 100                                  (600000
documents)
bin/mahout clusterdump -s
hdfs://localhost:9000/blrdata/kmeans/clusters-2-final/ -d
hdfs://localhost:9000/blrdata/SequenceFiles-sparse/dictionary.file-0 -dt
sequencefile -b 100 -n 10                                        (1000
documents)

I am using Map Reduced Method. For calculating K-Means.

I had no clue what is going wrong. So please help me what i had missed in
this.  please give me some suggestion how to check what goes wrong.


Let me know if there is any further information is required

Thanks in advance
S SYED ABDUL KATHER

Reply via email to