Team

    Version Used : Mahout 0.6
    Hadoop : 5 Nodes(1 Master + 4 Slaves)

    Once we had generated kmean clusters for 600000 documents.I had run the
clusterdump, which will extract the top terms from the cluster, There i had
noticed only one clusters is made even though we had specified the number
of cluster to 10. I had cross check the commands with some 1000 documents
and applied clustering. As i had notice that out of the 1000
documents,mahout can able to generated 10 cluster.

Some Observation which i had made on 600000 Data:-
    In clusterdump I had added  "--pointDir <path>". Because this command
will extactly tell us .what are top terms for each documents vise. In this
 i had noticed that some of the documents which doesnt have a distance.
1.0 : [distance=NaN]: /0_6_1343_504071_6198107.txt =]
  0_6_1343_504071_6198107.txt ==> File Name
1.0 : [distance=NaN]: /0_6_1343_504071_6198108.txt =]
1.0 : [distance=NaN]: /0_6_1343_504071_6198109.txt =]
1.0 : [distance=NaN]: /0_6_1343_504071_6198110.txt =]
1.0 : [distance=NaN]: /0_6_1343_504071_6198111.txt =]
1.0 : [distance=NaN]: /0_6_1343_504071_6198112.txt =]
1.0 : [distance=NaN]: /0_6_1343_504071_6198113.txt =]
1.0 : [distance=NaN]: /0_6_1343_504071_6198114.txt =]
1.0 : [distance=NaN]: /0_6_1343_504071_6198115.txt =]

Have a look command which i had executed one is for huge data(600000) and
one is for small data (1000 documents)

#sequencial File generation
bin/mahout seqdirectory -i /hugeData/hugeData/ -o /hugeData/SequenceFiles/
-c UTF-8 -chunk 64   (600000 documents)
bin/mahout seqdirectory -i /blrdata/blrdata/ -o /blrdata/SequenceFiles/ -c
UTF-8 -chunk 64               (1000 documents)

#Term Vector Creation.
bin/mahout seq2sparse -i /hugeData/SequenceFiles/ -o
/hugeData/SequenceFiles-sparse --maxDFPercent 85 --namedVector --minDF 15
   (600000 doc)
bin/mahout seq2sparse -i /blrdata/SequenceFiles/ -o
/blrdata/SequenceFiles-sparse --maxDFPercent 85 --namedVector --minDF 15
         (1000 documents)

#Clustering
bin/mahout kmeans -i /hugeData/SequenceFiles-sparse/tfidf-vectors/ -c
/hugeData/kmeans-clusters -o /hugeData/kmeans -dm
org.apache.mahout.common.distance.CosineDistanceMea0sure -x 10 -k 10 -ow
--clustering                       (600000 documents)
bin/mahout kmeans -i /blrdata/SequenceFiles-sparse/tfidf-vectors/ -c
/blrdata/kmeans-clusters -o /blrdata/kmeans -dm
org.apache.mahout.common.distance.CosineDistanceMeasure -x 10 -k 10 -ow
--clustering                        (1000 documents)

#Cluster Dump
bin/mahout clusterdump -s
 hdfs://localhost:9000/hugeData/kmeans/clusters-2-final/ -d
hdfs://localhost:9000/hugeData/SequenceFiles-sparse/dictionary.file-0 -dt
sequencefile -b 100 -n 100                                  (600000
documents)
bin/mahout clusterdump -s
 hdfs://localhost:9000/blrdata/kmeans/clusters-2-final/ -d
hdfs://localhost:9000/blrdata/SequenceFiles-sparse/dictionary.file-0 -dt
sequencefile -b 100 -n 10                                        (1000
documents)

I am using Map Reduced Method. For calculating K-Means.

 I had no clue what is going wrong. So please help me what i had missed in
this.  please give me some suggestion how to check what goes wrong.


Let me know if there is any further information is required

Thanks in advance
 S SYED ABDUL KATHER

Reply via email to