Re: What are the best settings for my clustering task

Ted Dunning Fri, 04 Oct 2013 17:41:11 -0700

What you are seeing here are the cluster centroids themselves, not the cluster 
assignments.


Streaming k-means is a single pass algorithm to derive these centroids.  
Typically, the next step is to cluster these centroids using ball k-means.  
*Those* results can then be applied back to the original (or new) input vectors 
to get cluster assignments for individual input vectors.

I don't have command line specifics handy, but you seem to have done very well 
already at figuring out the details.


On Oct 3, 2013, at 7:30 AM, Jens Bonerz wrote:

> I created a series of scripts to try out streamingkmeans in mahout an
> increased the number of clusters to a high amount as suggested by Ted.
> Everything seems to work. However, I can't figure out how to access the
> actual cluster data at the end of the process.
> 
> It just gives me output that I cannot really understand... I would expect
> my product_ids being referenced to cluster ids...
> 
> Example of the procedure's output:
> 
> hadoop binary is not in PATH,HADOOP_HOME/bin,HADOOP_PREFIX/bin, running
> locally
> Input Path: file:MahoutCluster/part-r-00000
> Key class: class org.apache.hadoop.io.IntWritable Value Class: class
> org.apache.mahout.clustering.streaming.mapreduce.CentroidWritable
> Key: 0: Value: key = 8678, weight = 3.00, vector =
> {37:26.83479118347168,6085:8.162049293518066,4785:10.44443130493164,2493:19.677349090576172,2494:16.06648826599121,9659:9.568963050842285,20877:9.307...
> Key: 1: Value: key = 3118, weight = 14.00, vector =
> {19457:5.646900812784831,8774:4.746263821919759,9738:1.022495985031128,13301:5.762300491333008,14947:0.6774413585662842,8787:6.841406504313151,14958...
> Key: 2: Value: key = 2867, weight = 3.00, vector =
> {15873:10.955257415771484,1615:4.029662132263184,20963:4.979445934295654,3978:5.611329555511475,7950:8.364990234375,8018:8.68657398223877,15433:7.959...
> Key: 3: Value: key = 6295, weight = 1.00, vector =
> {17113:10.955257415771484,15347:9.568963050842285,15348:10.955257415771484,19845:7.805374622344971,7945:10.262109756469727,15356:18.090286254882812,1...
> Key: 4: Value: key = 6725, weight = 4.00, vector =
> {10570:7.64715051651001,14915:6.126943588256836,14947:4.064648151397705,14330:9.414812088012695,18271:2.7172491550445557,14335:19.677349090576172,143...
> Key: 5:......
> 
> 
> 
> this is my recipe:
> 
> --------------------------------------------------------------------------------------------------------
> Step 1
> Create a seqfile from my data with Python. Its the product_id (key) and the
> short normalized descripti (value) that is written into the sequence file.
> 
> 
> 
> --------------------------------------------------------------------------------------------------------
> Step 2
> create vectors from that data with the following command:
> 
> mahout seq2sparse \
>   -i productClusterSequenceData/productClusterSequenceData.seq \
>   -o productClusterSequenceData/vectors \
> 
> 
> 
> --------------------------------------------------------------------------------------------------------
> Step 3
> Cluster the vectors using streamingkeans with this command:
> 
> mahout streamingkmeans \
> -i productClusterSequenceData/vectors/tfidf-vectors \
> -o MahoutCluster \
> --tempDir /tmp \
> -ow -dm org.apache.mahout.common.distance.CosineDistanceMeasure \
> -sc org.apache.mahout.math.neighborhood.FastProjectionSearch \
> -k 10000 -km 500000 \
> 
> 
> 
> --------------------------------------------------------------------------------------------------------
> Step 4
> Export the streamingkmeans cluster data into a textfile (for an extract of
> the result see above)
> 
> mahout seqdumper \
> -i MahoutCluster > similarProducts.txt
> 
> What am I missing?
> 
> 
> 
> 
> 
> 2013/10/3 Ted Dunning <[email protected]>
> 
>> Yes.  That will work.
>> 
>> The sketch will then contain 10,000 x log N centroids.  If N = 10^9, log N
>> \approx 30 so the sketch will have at about 300,000 weighted centroids in
>> it.  The final clustering will have to process these centroids to produce
>> the desired 5,000 clusters.  Since 300,000 is a relatively small number of
>> data points, this clustering step should proceed relatively quickly.
>> 
>> 
>> 
>> On Wed, Oct 2, 2013 at 10:21 AM, Jens Bonerz <[email protected]>
>> wrote:
>> 
>>> thx for your elaborate answer.
>>> 
>>> so if the upper bound on the final number of clusters is unknown in the
>>> beginning, what would happen, if I define a very high number that is
>>> guaranteed to be > the estimated number of clusters.
>>> for example if I set it to 10.000 clusters if an estimate of 5.000 is
>>> likely, will that work?
>>> 
>>> 
>>> 2013/10/2 Ted Dunning <[email protected]>
>>> 
>>>> The way that the new streaming k-means works is that there is a first
>>>> sketch pass which only requires an upper bound on the final number of
>>>> clusters you will want.  It adaptively creates more or less clusters
>>>> depending on the data and your bound.  This sketch is guaranteed to be
>>>> computed within at most one map-reduce pass.  There is a threaded
>> version
>>>> that runs (fast) on a single machine.  The threaded version is liable
>> to
>>> be
>>>> faster than the map-reduce version for moderate or smaller data sizes.
>>>> 
>>>> That sketch can then be used to do all kinds of things that rely on
>>>> Euclidean distance and still get results within a small factor of the
>>> same
>>>> algorithm applied to all of the data.  Typically this second phase is a
>>>> ball k-means algorithm, but it could easily be a dp-means algorithm [1]
>>> if
>>>> you want a variable number of clusters.  Indeed, you could run many
>>>> dp-means passes with different values of lambda on the same sketch.
>> Note
>>>> that the sketch is small enough that in-memory clustering is entirely
>>>> viable and is very fast.
>>>> 
>>>> For the problem you describe, however, you probably don't need the
>> sketch
>>>> approach at all and can probably apply ball k-means or dp-means
>> directly.
>>>> Running many k-means clusterings with differing values of k should be
>>>> entirely feasible as well with such data sizes.
>>>> 
>>>> [1] http://www.cs.berkeley.edu/~jordan/papers/kulis-jordan-icml12.pdf
>>>> 
>>>> 
>>>> 
>>>> 
>>>> On Wed, Oct 2, 2013 at 9:11 AM, Jens Bonerz <[email protected]>
>>>> wrote:
>>>> 
>>>>> Isn't the streaming k-means just a different approach to crunch
>> through
>>>> the
>>>>> data? In other words, the result of streaming k-means should be
>>>> comparable
>>>>> to using k-means in multiple chained map reduce cycles?
>>>>> 
>>>>> I just read a paper about the k-means clustering and its underlying
>>>>> algorithm.
>>>>> 
>>>>> According to that paper, k-means relies on a preknown/predefined
>> amount
>>>> of
>>>>> clusters as an input parameter.
>>>>> 
>>>>> Link: http://books.nips.cc/papers/files/nips22/NIPS2009_1085.pdf
>>>>> 
>>>>> In my current scenario however, the number of clusters is unknown at
>>> the
>>>>> beginning.
>>>>> 
>>>>> Maybe k-means is just not the right algorithm for clustering similar
>>>>> products based on their short description text? What else could I
>> use?
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> 2013/10/1 Ted Dunning <[email protected]>
>>>>> 
>>>>>> At such small sizes, I would guess that the sequential version of
>> the
>>>>>> streaming k-means or ball k-means would be better options.
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> On Mon, Sep 30, 2013 at 2:14 PM, mercutio7979 <
>>> [email protected]
>>>>>>> wrote:
>>>>>> 
>>>>>>> Hello all,
>>>>>>> 
>>>>>>> I am currently trying create clusters from a group of 50.000
>>> strings
>>>>> that
>>>>>>> contain product descriptions (around 70-100 characters length
>>> each).
>>>>>>> 
>>>>>>> That group of 50.000 consists of roughly 5.000 individual
>> products
>>>> and
>>>>>> ten
>>>>>>> varying product descriptions per product. The product
>> descriptions
>>>> are
>>>>>>> already prepared for clustering and contain a normalized brand
>>> name,
>>>>>>> product
>>>>>>> model number, etc.
>>>>>>> 
>>>>>>> What would be a good approach to maximise the amound of found
>>>> clusters
>>>>>> (the
>>>>>>> best possible value would be 5.000 clusters with 10 products
>> each)
>>>>>>> 
>>>>>>> I adapted the reuters cluster script to read in my data and
>> managed
>>>> to
>>>>>>> create a first set of clusters. However, I have not managed to
>>>> maximise
>>>>>> the
>>>>>>> cluster count.
>>>>>>> 
>>>>>>> The question is: what do I need to tweak with regard to the
>>> available
>>>>>>> mahout
>>>>>>> settings, so the clusters are created as precisely as possible?
>>>>>>> 
>>>>>>> Many regards!
>>>>>>> Jens
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> --
>>>>>>> View this message in context:
>>>>>>> 
>>>>>> 
>>>>> 
>>>> 
>>> 
>> http://lucene.472066.n3.nabble.com/What-are-the-best-settings-for-my-clustering-task-tp4092807.html
>>>>>>> Sent from the Mahout User List mailing list archive at
>> Nabble.com.
>>>>>>> 
>>>>>> 
>>>>> 
>>>> 
>>> 
>>> 
>>> 
>>> --
>>> CEO
>>> Hightech Marketing Group
>>> Cell/Mobile: +49 173 539 3588
>>> 
>>> ____
>>> 
>>> Hightech Marketing Group
>>> Frankenstraße 32
>>> 50354 Huerth
>>> Germany
>>> Phone: +49 (0)2233 – 619 2741
>>> Fax: +49 (0)2233 – 619 27419
>>> Web: www.hightechmg.com
>>> 
>> 
> 
> 
> 
> -- 
> CEO
> Hightech Marketing Group
> Cell/Mobile: +49 173 539 3588
> 
> ____
> 
> Hightech Marketing Group
> Frankenstraße 32
> 50354 Huerth
> Germany
> Phone: +49 (0)2233 – 619 2741
> Fax: +49 (0)2233 – 619 27419
> Web: www.hightechmg.com

Re: What are the best settings for my clustering task

Reply via email to