hi, I believe the following bug already addressed the issue: https://issues.apache.org/jira/browse/MAHOUT-594
Thanks, -- Shige On Thu, Feb 17, 2011 at 3:57 AM, WaleedAzmy <[email protected]> wrote: > > Dear All... > > I tried to test Mahout K-Mean clustering on Arabic data. But -I think- > there > is a problems in encoding... > > I tried the following commands: > ======================= > > $ ./mahout seqdirectory -i "....\Arabic_data" -o > "....\ArabicTest\Arabic_data-seqdir" -c UTF-8 -chunk 5 > > $ ./mahout seq2sparse -i "....\ArabicTest\Arabic_data-seqdir" -o > "....\ArabicTest\Arabic_data_out-seqdir" > > $ ./mahout kmeans -i > "....\ArabicTest\Arabic_data_out-seqdir\tfidf-vectors/" > -c "....\ArabicTest\clusters" -o "....\ArabicTest\arabic-kmeans" -x 10 -k > 20 > -ow > > $ ./mahout clusterdump -s "....\ArabicTest\arabic-kmeans\clusters-1" -d > "....\ArabicTest\Arabic_data_out-seqdir\dictionary.file-0" -dt sequencefile > -b 100 -n 20 > > > The clusterdump generate the following output > =================================== > > o HADOOP_HOME set, running locally > :VL-1{n=1 c=[24:6.187, 31:5.912, 53:7.643, 69:7.958, 77:8.365, ??:2.260, > ?????:5.627, ?????:5.627, ?? > Top Terms: > ???? => > 11.830205917358398 > ????? => > 10.808554649353027 > ??????? => > 8.93863296508789 > ????? => > 8.93863296508789 > ??????? => > 8.93863296508789 > ??????? => > 8.93863296508789 > 77 => > 8.365219116210938 > ???? => > 8.365219116210938 > ?????? => > 8.365219116210938 > ??????????? => > 8.365219116210938 > 69 => > 7.958374977111816 > ????? => > 7.6428022384643555 > 53 => > 7.6428022384643555 > ??? => > 7.6428022384643555 > ??? => > 7.384960651397705 > ????? => > 7.384960651397705 > ????? => > 7.166958332061768 > 24 => > 6.186699867248535 > 31 => > 5.9121222496032715 > ????? => > 5.627420902252197 > :VL-104{n=1 c=[??:6.089, ????:5.404, ??????:3.795, ???????:5.915, > ??????:7.385, ????????:8.939, ????? > Top Terms: > ???????? => > 12.641136169433594 > ?????? => > 9.422260284423828 > ????????? => > 8.93863296508789 > ???? => > 8.93863296508789 > > > =============================================================== > I think the meaningless (?) is a problem of encoding.... Can anyone help me > in this???? > > Also I want a tutorial describing the command for k-mean clustering and it > attributes and what is the output of clusterdump represent for? > > Thank you.... > -- > View this message in context: > http://lucene.472066.n3.nabble.com/Arabic-K-mean-clustering-tp2518248p2518248.html > Sent from the Mahout User List mailing list archive at Nabble.com. >
