hi, I believe the following bug already addressed the issue:
https://issues.apache.org/jira/browse/MAHOUT-594

Thanks, -- Shige

On Thu, Feb 17, 2011 at 3:57 AM, WaleedAzmy <[email protected]> wrote:

>
> Dear All...
>
> I tried to test Mahout K-Mean clustering on Arabic data. But -I think-
> there
> is a problems in encoding...
>
> I tried the following commands:
> =======================
>
> $ ./mahout seqdirectory -i "....\Arabic_data" -o
> "....\ArabicTest\Arabic_data-seqdir" -c UTF-8 -chunk 5
>
> $ ./mahout seq2sparse -i "....\ArabicTest\Arabic_data-seqdir" -o
> "....\ArabicTest\Arabic_data_out-seqdir"
>
> $ ./mahout kmeans -i
> "....\ArabicTest\Arabic_data_out-seqdir\tfidf-vectors/"
> -c "....\ArabicTest\clusters" -o "....\ArabicTest\arabic-kmeans" -x 10 -k
> 20
> -ow
>
> $ ./mahout clusterdump -s "....\ArabicTest\arabic-kmeans\clusters-1" -d
> "....\ArabicTest\Arabic_data_out-seqdir\dictionary.file-0" -dt sequencefile
> -b 100 -n 20
>
>
> The clusterdump generate the following output
> ===================================
>
> o HADOOP_HOME set, running locally
> :VL-1{n=1 c=[24:6.187, 31:5.912, 53:7.643, 69:7.958, 77:8.365, ??:2.260,
> ?????:5.627, ?????:5.627, ??
>        Top Terms:
>                ????                                    =>
>  11.830205917358398
>                ?????                                   =>
>  10.808554649353027
>                ???????                                 =>
>  8.93863296508789
>                ?????                                   =>
>  8.93863296508789
>                ???????                                 =>
>  8.93863296508789
>                ???????                                 =>
>  8.93863296508789
>                77                                      =>
> 8.365219116210938
>                ????                                    =>
> 8.365219116210938
>                ??????                                  =>
> 8.365219116210938
>                ???????????                             =>
> 8.365219116210938
>                69                                      =>
> 7.958374977111816
>                ?????                                   =>
>  7.6428022384643555
>                53                                      =>
>  7.6428022384643555
>                ???                                     =>
>  7.6428022384643555
>                ???                                     =>
> 7.384960651397705
>                ?????                                   =>
> 7.384960651397705
>                ?????                                   =>
> 7.166958332061768
>                24                                      =>
> 6.186699867248535
>                31                                      =>
>  5.9121222496032715
>                ?????                                   =>
> 5.627420902252197
> :VL-104{n=1 c=[??:6.089, ????:5.404, ??????:3.795, ???????:5.915,
> ??????:7.385, ????????:8.939, ?????
>        Top Terms:
>                ????????                                =>
>  12.641136169433594
>                ??????                                  =>
> 9.422260284423828
>                ?????????                               =>
>  8.93863296508789
>                ????                                    =>
>  8.93863296508789
>
>
> ===============================================================
> I think the meaningless (?) is a problem of encoding.... Can anyone help me
> in this????
>
> Also I want a tutorial describing the command for k-mean clustering and it
> attributes and what is the output of clusterdump represent for?
>
> Thank you....
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/Arabic-K-mean-clustering-tp2518248p2518248.html
> Sent from the Mahout User List mailing list archive at Nabble.com.
>

Reply via email to