Re: How to use kmeans clustering algorithm of Mahout

Don.Tan Wed, 12 Sep 2012 20:00:18 -0700

I have tried it by following the way of the sample code, and I noticedthat I should not use seq2sparse directory. That leads to the sparseresult is empty.... Anyone you known could help me deal with that?


On 09/12/2012 07:09 PM, Paritosh Ranjan wrote:

I think it shouldn't be sparse in the beginning, the seq2sparse shouldtake care of it.Some one will correct me if I would be wrong, so, wait for some timeand then go ahead.
On 12-09-2012 16:07, Don.Tan wrote:
Thank you for you promptly reply. Can I ask a question before I go on?

     My original data is in a format like that:
176329,116300,175216,167307,**46710,138740,100681,2089,1842,**
1206,101702,99210,50460,89605,**177424,142901,176464,160625,**
38201,112101,4048,1716,167599,**140883,158250,175399,
which is in a sparse format. Is that correct to use seqdirectory andseq2sparse directly?
On 09/12/2012 06:30 PM, Paritosh Ranjan wrote:
Also try to follow the steps in cluster-reuters.sh file. This mighthelp.
On 12-09-2012 15:59, Paritosh Ranjan wrote:
Can you explain something about the error and provide the stacktrace ?

On 12-09-2012 14:22, Don.Tan wrote:
The original data is here:

[hadoop@datamining ~]$ hadoop fs -ls /home/test/test
Found 1 items
-rw-r--r-- 1 hadoop supergroup 129213799 2012-09-12 15:45/home/test/test/result
After I used "mahout seqdirectory -i /home/test/test/ -o/home/test/result/ -c UTF-8", get this:
[hadoop@datamining ~]$ hadoop fs -ls /home/test/result
Found 1 items
-rw-r--r-- 1 hadoop supergroup 129213898 2012-09-12 15:47/home/test/result/chunk-0
And after "mahout seq2sparse -i /home/test/result -o/home/test/sparse":
[hadoop@datamining ~]$ hadoop fs -ls /home/test/sparse
Found 7 items
drwxr-xr-x - hadoop supergroup 0 2012-09-12 15:54/home/test/sparse/df-count-rw-r--r-- 1 hadoop supergroup 442252 2012-09-12 15:53/home/test/sparse/dictionary.file-0-rw-r--r-- 1 hadoop supergroup 394853 2012-09-12 15:54/home/test/sparse/frequency.file-0drwxr-xr-x - hadoop supergroup 0 2012-09-12 15:53/home/test/sparse/tf-vectorsdrwxr-xr-x - hadoop supergroup 0 2012-09-12 15:54/home/test/sparse/tfidf-vectorsdrwxr-xr-x - hadoop supergroup 0 2012-09-12 15:53/home/test/sparse/tokenized-documentsdrwxr-xr-x - hadoop supergroup 0 2012-09-12 15:53/home/test/sparse/wordcount
Which should I do next? I used "mahout kmeans -i/home/test/sparse/ -o /home/test/kmeans -dmorg.apache.mahout.common.distance.CosineDistanceMeasure -x 10 -k20 -ow --clustering"
but I got error.....

Thx!


On 09/12/2012 03:24 PM, Paritosh Ranjan wrote:
I think you will need these two commands ( in the same order ) :

seqdirectory : Generate sequence files (of Text) from a directory
seq2sparse: Sparse Vector generation from Text sequence files

On 12-09-2012 12:28, Don Tan wrote:
I think I didn't explain clear enough and sorry for that.

The example showed before is a part of my data.
Each line is a user profile, for example, the first row is thefeatures of
a user. And I want to apply k-means to this data.
I need to create a file saves all users profile as sparse vectorand put
them in mahout k-means algorithm, how can I do that?

  Thanks for your advice!

Don Tan

2012/9/12 Paritosh Ranjan <[email protected]>
I could not understand the question correctly, can you explainmore?
Here you can find how to use kmeans algorithm of Mahout
https://cwiki.apache.org/**confluence/display/MAHOUT/K-**Means+Clustering<https://cwiki.apache.org/confluence/display/MAHOUT/K-Means+Clustering>
.


On 12-09-2012 11:43, Don.Tan wrote:
Aloha!
I am new to hadoop and mahout, but I have set up thehadoop cluster.
I am working on a clustering task lately. I think I couldnot make itquickly because I don't know too much about how to deal withmassive data (my data contains 1400000 user and 50000 features..plus that issparse ).
Could you tell me how deal with that? A slice of data ishere:
167555,152622,162252,79481,**66540,41942,75500,167898,**
61923,182083,180681,181135,**174449,166439,167307,174126,**87800,2826,
     98660,158620,33900,
4780,13922,45040,159210,26423,**1471,68200,70402,109721,**
145860,23740,5818,15087,47861,**158620,170482,170161,39120,**
164514,5854,169183,151229,**171110,163457,4356,21363,1307,**78105,1322,177011,167822,
176329,116300,175216,167307,**46710,138740,100681,2089,1842,**
1206,101702,99210,50460,89605,**177424,142901,176464,160625,**
38201,112101,4048,1716,167599,**140883,158250,175399,
example above contains 4 user's data and each number isnominal
(denoting that is a kind of behavior of user, e.s, user 2 has
"98660","158620","33900" )
Please tell me how to work on that or which documentsshould I read..
     Thx!

    Don Tan

Re: How to use kmeans clustering algorithm of Mahout

Reply via email to