Dear All...
I tried to test Mahout K-Mean clustering on Arabic data. But -I think- there
is a problems in encoding...
I tried the following commands:
=======================
$ ./mahout seqdirectory -i "....\Arabic_data" -o
"....\ArabicTest\Arabic_data-seqdir" -c UTF-8 -chunk 5
$ ./mahout seq2sparse -i "....\ArabicTest\Arabic_data-seqdir" -o
"....\ArabicTest\Arabic_data_out-seqdir"
$ ./mahout kmeans -i "....\ArabicTest\Arabic_data_out-seqdir\tfidf-vectors/"
-c "....\ArabicTest\clusters" -o "....\ArabicTest\arabic-kmeans" -x 10 -k 20
-ow
$ ./mahout clusterdump -s "....\ArabicTest\arabic-kmeans\clusters-1" -d
"....\ArabicTest\Arabic_data_out-seqdir\dictionary.file-0" -dt sequencefile
-b 100 -n 20
The clusterdump generate the following output
===================================
o HADOOP_HOME set, running locally
:VL-1{n=1 c=[24:6.187, 31:5.912, 53:7.643, 69:7.958, 77:8.365, ??:2.260,
?????:5.627, ?????:5.627, ??
Top Terms:
???? => 11.830205917358398
????? => 10.808554649353027
??????? => 8.93863296508789
????? => 8.93863296508789
??????? => 8.93863296508789
??????? => 8.93863296508789
77 => 8.365219116210938
???? => 8.365219116210938
?????? => 8.365219116210938
??????????? => 8.365219116210938
69 => 7.958374977111816
????? => 7.6428022384643555
53 => 7.6428022384643555
??? => 7.6428022384643555
??? => 7.384960651397705
????? => 7.384960651397705
????? => 7.166958332061768
24 => 6.186699867248535
31 => 5.9121222496032715
????? => 5.627420902252197
:VL-104{n=1 c=[??:6.089, ????:5.404, ??????:3.795, ???????:5.915,
??????:7.385, ????????:8.939, ?????
Top Terms:
???????? => 12.641136169433594
?????? => 9.422260284423828
????????? => 8.93863296508789
???? => 8.93863296508789
===============================================================
I think the meaningless (?) is a problem of encoding.... Can anyone help me
in this????
Also I want a tutorial describing the command for k-mean clustering and it
attributes and what is the output of clusterdump represent for?
Thank you....
--
View this message in context:
http://lucene.472066.n3.nabble.com/Arabic-K-mean-clustering-tp2518248p2518248.html
Sent from the Mahout User List mailing list archive at Nabble.com.