Dear All...

I tried to test Mahout K-Mean clustering on Arabic data. But -I think- there
is a problems in encoding...

I tried the following commands:
=======================

$ ./mahout seqdirectory -i "....\Arabic_data" -o
"....\ArabicTest\Arabic_data-seqdir" -c UTF-8 -chunk 5

$ ./mahout seq2sparse -i "....\ArabicTest\Arabic_data-seqdir" -o
"....\ArabicTest\Arabic_data_out-seqdir"

$ ./mahout kmeans -i "....\ArabicTest\Arabic_data_out-seqdir\tfidf-vectors/"
-c "....\ArabicTest\clusters" -o "....\ArabicTest\arabic-kmeans" -x 10 -k 20
-ow

$ ./mahout clusterdump -s "....\ArabicTest\arabic-kmeans\clusters-1" -d
"....\ArabicTest\Arabic_data_out-seqdir\dictionary.file-0" -dt sequencefile
-b 100 -n 20


The clusterdump generate the following output
===================================

o HADOOP_HOME set, running locally
:VL-1{n=1 c=[24:6.187, 31:5.912, 53:7.643, 69:7.958, 77:8.365, ??:2.260,
?????:5.627, ?????:5.627, ??
        Top Terms: 
                ????                                    =>  11.830205917358398
                ?????                                   =>  10.808554649353027
                ???????                                 =>    8.93863296508789
                ?????                                   =>    8.93863296508789
                ???????                                 =>    8.93863296508789
                ???????                                 =>    8.93863296508789
                77                                      =>   8.365219116210938
                ????                                    =>   8.365219116210938
                ??????                                  =>   8.365219116210938
                ???????????                             =>   8.365219116210938
                69                                      =>   7.958374977111816
                ?????                                   =>  7.6428022384643555
                53                                      =>  7.6428022384643555
                ???                                     =>  7.6428022384643555
                ???                                     =>   7.384960651397705
                ?????                                   =>   7.384960651397705
                ?????                                   =>   7.166958332061768
                24                                      =>   6.186699867248535
                31                                      =>  5.9121222496032715
                ?????                                   =>   5.627420902252197
:VL-104{n=1 c=[??:6.089, ????:5.404, ??????:3.795, ???????:5.915,
??????:7.385, ????????:8.939, ?????
        Top Terms: 
                ????????                                =>  12.641136169433594
                ??????                                  =>   9.422260284423828
                ?????????                               =>    8.93863296508789
                ????                                    =>    8.93863296508789


===============================================================
I think the meaningless (?) is a problem of encoding.... Can anyone help me
in this????

Also I want a tutorial describing the command for k-mean clustering and it
attributes and what is the output of clusterdump represent for?

Thank you....
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/Arabic-K-mean-clustering-tp2518248p2518248.html
Sent from the Mahout User List mailing list archive at Nabble.com.

Reply via email to