Thanks  Dan ..
 Yes i had tried tanimoto that gives 6 cluster .

" It appeared for our data after our custom
lucene analyzer and the tfidf filtering was applied (in seq2sparse command)
all
terms for many of our documents were removed.  These were documents that
had minimal (and/or garbage) text to begin "
  We had also did the same way clearing the junck from the original
documents and even we had removed the stop words . But i our case there is
no use .

 How to verify the vector ?  Can you suggest me please ..

            Thanks and Regards,
        S SYED ABDUL KATHER



On Fri, Oct 19, 2012 at 9:20 AM, DAN HELM <[email protected]> wrote:

> We previously did some k-means clustering runs on
> different sized collections and noticed how that a large cluster was often
> created
> along with some smaller others. In digging deeper it turned out a lot of
> the
> document vectors (produced via the seq2sparse command) were null (empty).
>  k-means apparently put these together in one large
> cluster.  I also saw NaN for computed distances
> for these vectors.  And in the “clusteredPoints”
> file, it was clear many vectors were empty.  It appeared for our data
> after our custom
> lucene analyzer and the tfidf filtering was applied (in seq2sparse
> command) all
> terms for many of our documents were removed.  These were documents that
> had minimal (and/or garbage) text to begin
> with.
> So, maybe first verify if you are getting
> proper vectors for the input to k-means. We ended up cleaning up the
> vectors
> before clustering them (tossing out the null ones). You can also experiment
> with different similarity measures in k-means too (e.g., tanimoto).
>  Dan
>
> ________________________________
>  From: syed kather <[email protected]>
> To: [email protected]
> Cc: Raja Ramesh <[email protected]>
> Sent: Thursday, October 18, 2012 11:03 PM
> Subject: K-Means generates only one cluster
>
> Team
>
>     Version Used : Mahout 0.6
>     Hadoop : 5 Nodes(1 Master + 4 Slaves)
>
>     Once we had generated kmean clusters for 600000 documents.I had run the
> clusterdump, which will extract the top terms from the cluster, There i had
> noticed only one clusters is made even though we had specified the number
> of cluster to 10. I had cross check the commands with some 1000 documents
> and applied clustering. As i had notice that out of the 1000
> documents,mahout can able to generated 10 cluster.
>
> Some Observation which i had made on 600000 Data:-
>     In clusterdump I had added  "--pointDir <path>". Because this command
> will extactly tell us .what are top terms for each documents vise. In this
> i had noticed that some of the documents which doesnt have a distance.
> 1.0 : [distance=NaN]: /0_6_1343_504071_6198107.txt =]
>   0_6_1343_504071_6198107.txt ==> File Name
> 1.0 : [distance=NaN]: /0_6_1343_504071_6198108.txt =]
> 1.0 : [distance=NaN]: /0_6_1343_504071_6198109.txt =]
> 1.0 : [distance=NaN]: /0_6_1343_504071_6198110.txt =]
> 1.0 : [distance=NaN]: /0_6_1343_504071_6198111.txt =]
> 1.0 : [distance=NaN]: /0_6_1343_504071_6198112.txt =]
> 1.0 : [distance=NaN]: /0_6_1343_504071_6198113.txt =]
> 1.0 : [distance=NaN]: /0_6_1343_504071_6198114.txt =]
> 1.0 : [distance=NaN]: /0_6_1343_504071_6198115.txt =]
>
> Have a look command which i had executed one is for huge data(600000) and
> one is for small data (1000 documents)
>
> #sequencial File generation
> bin/mahout seqdirectory -i /hugeData/hugeData/ -o /hugeData/SequenceFiles/
> -c UTF-8 -chunk 64   (600000 documents)
> bin/mahout seqdirectory -i /blrdata/blrdata/ -o /blrdata/SequenceFiles/ -c
> UTF-8 -chunk 64               (1000 documents)
>
> #Term Vector Creation.
> bin/mahout seq2sparse -i /hugeData/SequenceFiles/ -o
> /hugeData/SequenceFiles-sparse --maxDFPercent 85 --namedVector --minDF 15
>    (600000 doc)
> bin/mahout seq2sparse -i /blrdata/SequenceFiles/ -o
> /blrdata/SequenceFiles-sparse --maxDFPercent 85 --namedVector --minDF 15
>          (1000 documents)
>
> #Clustering
> bin/mahout kmeans -i /hugeData/SequenceFiles-sparse/tfidf-vectors/ -c
> /hugeData/kmeans-clusters -o /hugeData/kmeans -dm
> org.apache.mahout.common.distance.CosineDistanceMea0sure -x 10 -k 10 -ow
> --clustering                       (600000 documents)
> bin/mahout kmeans -i /blrdata/SequenceFiles-sparse/tfidf-vectors/ -c
> /blrdata/kmeans-clusters -o /blrdata/kmeans -dm
> org.apache.mahout.common.distance.CosineDistanceMeasure -x 10 -k 10 -ow
> --clustering                        (1000 documents)
>
> #Cluster Dump
> bin/mahout clusterdump -s
> hdfs://localhost:9000/hugeData/kmeans/clusters-2-final/ -d
> hdfs://localhost:9000/hugeData/SequenceFiles-sparse/dictionary.file-0 -dt
> sequencefile -b 100 -n 100                                  (600000
> documents)
> bin/mahout clusterdump -s
> hdfs://localhost:9000/blrdata/kmeans/clusters-2-final/ -d
> hdfs://localhost:9000/blrdata/SequenceFiles-sparse/dictionary.file-0 -dt
> sequencefile -b 100 -n 10                                        (1000
> documents)
>
> I am using Map Reduced Method. For calculating K-Means.
>
> I had no clue what is going wrong. So please help me what i had missed in
> this.  please give me some suggestion how to check what goes wrong.
>
>
> Let me know if there is any further information is required
>
> Thanks in advance
> S SYED ABDUL KATHER
>

Reply via email to