Re: What are the best settings for my clustering task

Ted Dunning Sun, 06 Oct 2013 04:02:34 -0700

It is there, at the very least as part of the streaming k-means code.  The 
abbreviation bkm has been used in the past.


In looking at the code just now I don't find any command line invocation of 
bkm. It should be quite simple to write one and it would be very handy to have 
a way to run streaming k-means without a map reduce step as well. As such it 
might be good to have a new option to streaming k-means to use just bkm in a 
single thread, to use threaded streaming k-means on a single machine or to use 
MapR reduce streaming k-means.  

You up for trying to make a patch?

Sent from my iPhone

On Oct 6, 2013, at 12:37, Jens Bonerz <[email protected]> wrote:

> Hmmm.. has ballkmeans made it already into the 0.8 release? can't find it
> in the list of available programs when calling the mahout binary...
> 
> 
> 2013/10/3 Ted Dunning <[email protected]>
> 
>> What you are seeing here are the cluster centroids themselves, not the
>> cluster assignments.
>> 
>> Streaming k-means is a single pass algorithm to derive these centroids.
>> Typically, the next step is to cluster these centroids using ball k-means.
>> *Those* results can then be applied back to the original (or new) input
>> vectors to get cluster assignments for individual input vectors.
>> 
>> I don't have command line specifics handy, but you seem to have done very
>> well already at figuring out the details.
>> 
>> 
>> On Oct 3, 2013, at 7:30 AM, Jens Bonerz wrote:
>> 
>>> I created a series of scripts to try out streamingkmeans in mahout an
>>> increased the number of clusters to a high amount as suggested by Ted.
>>> Everything seems to work. However, I can't figure out how to access the
>>> actual cluster data at the end of the process.
>>> 
>>> It just gives me output that I cannot really understand... I would expect
>>> my product_ids being referenced to cluster ids...
>>> 
>>> Example of the procedure's output:
>>> 
>>> hadoop binary is not in PATH,HADOOP_HOME/bin,HADOOP_PREFIX/bin, running
>>> locally
>>> Input Path: file:MahoutCluster/part-r-00000
>>> Key class: class org.apache.hadoop.io.IntWritable Value Class: class
>>> org.apache.mahout.clustering.streaming.mapreduce.CentroidWritable
>>> Key: 0: Value: key = 8678, weight = 3.00, vector =
>>> 
>> {37:26.83479118347168,6085:8.162049293518066,4785:10.44443130493164,2493:19.677349090576172,2494:16.06648826599121,9659:9.568963050842285,20877:9.307...
>>> Key: 1: Value: key = 3118, weight = 14.00, vector =
>>> 
>> {19457:5.646900812784831,8774:4.746263821919759,9738:1.022495985031128,13301:5.762300491333008,14947:0.6774413585662842,8787:6.841406504313151,14958...
>>> Key: 2: Value: key = 2867, weight = 3.00, vector =
>>> 
>> {15873:10.955257415771484,1615:4.029662132263184,20963:4.979445934295654,3978:5.611329555511475,7950:8.364990234375,8018:8.68657398223877,15433:7.959...
>>> Key: 3: Value: key = 6295, weight = 1.00, vector =
>>> 
>> {17113:10.955257415771484,15347:9.568963050842285,15348:10.955257415771484,19845:7.805374622344971,7945:10.262109756469727,15356:18.090286254882812,1...
>>> Key: 4: Value: key = 6725, weight = 4.00, vector =
>>> 
>> {10570:7.64715051651001,14915:6.126943588256836,14947:4.064648151397705,14330:9.414812088012695,18271:2.7172491550445557,14335:19.677349090576172,143...
>>> Key: 5:......
>>> 
>>> 
>>> 
>>> this is my recipe:
>>> 
>>> 
>> --------------------------------------------------------------------------------------------------------
>>> Step 1
>>> Create a seqfile from my data with Python. Its the product_id (key) and
>> the
>>> short normalized descripti (value) that is written into the sequence
>> file.
>>> 
>>> 
>>> 
>>> 
>> --------------------------------------------------------------------------------------------------------
>>> Step 2
>>> create vectors from that data with the following command:
>>> 
>>> mahout seq2sparse \
>>>  -i productClusterSequenceData/productClusterSequenceData.seq \
>>>  -o productClusterSequenceData/vectors \
>>> 
>>> 
>>> 
>>> 
>> --------------------------------------------------------------------------------------------------------
>>> Step 3
>>> Cluster the vectors using streamingkeans with this command:
>>> 
>>> mahout streamingkmeans \
>>> -i productClusterSequenceData/vectors/tfidf-vectors \
>>> -o MahoutCluster \
>>> --tempDir /tmp \
>>> -ow -dm org.apache.mahout.common.distance.CosineDistanceMeasure \
>>> -sc org.apache.mahout.math.neighborhood.FastProjectionSearch \
>>> -k 10000 -km 500000 \
>>> 
>>> 
>>> 
>>> 
>> --------------------------------------------------------------------------------------------------------
>>> Step 4
>>> Export the streamingkmeans cluster data into a textfile (for an extract
>> of
>>> the result see above)
>>> 
>>> mahout seqdumper \
>>> -i MahoutCluster > similarProducts.txt
>>> 
>>> What am I missing?
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 2013/10/3 Ted Dunning <[email protected]>
>>> 
>>>> Yes.  That will work.
>>>> 
>>>> The sketch will then contain 10,000 x log N centroids.  If N = 10^9,
>> log N
>>>> \approx 30 so the sketch will have at about 300,000 weighted centroids
>> in
>>>> it.  The final clustering will have to process these centroids to
>> produce
>>>> the desired 5,000 clusters.  Since 300,000 is a relatively small number
>> of
>>>> data points, this clustering step should proceed relatively quickly.
>>>> 
>>>> 
>>>> 
>>>> On Wed, Oct 2, 2013 at 10:21 AM, Jens Bonerz <[email protected]>
>>>> wrote:
>>>> 
>>>>> thx for your elaborate answer.
>>>>> 
>>>>> so if the upper bound on the final number of clusters is unknown in the
>>>>> beginning, what would happen, if I define a very high number that is
>>>>> guaranteed to be > the estimated number of clusters.
>>>>> for example if I set it to 10.000 clusters if an estimate of 5.000 is
>>>>> likely, will that work?
>>>>> 
>>>>> 
>>>>> 2013/10/2 Ted Dunning <[email protected]>
>>>>> 
>>>>>> The way that the new streaming k-means works is that there is a first
>>>>>> sketch pass which only requires an upper bound on the final number of
>>>>>> clusters you will want.  It adaptively creates more or less clusters
>>>>>> depending on the data and your bound.  This sketch is guaranteed to be
>>>>>> computed within at most one map-reduce pass.  There is a threaded
>>>> version
>>>>>> that runs (fast) on a single machine.  The threaded version is liable
>>>> to
>>>>> be
>>>>>> faster than the map-reduce version for moderate or smaller data sizes.
>>>>>> 
>>>>>> That sketch can then be used to do all kinds of things that rely on
>>>>>> Euclidean distance and still get results within a small factor of the
>>>>> same
>>>>>> algorithm applied to all of the data.  Typically this second phase is
>> a
>>>>>> ball k-means algorithm, but it could easily be a dp-means algorithm
>> [1]
>>>>> if
>>>>>> you want a variable number of clusters.  Indeed, you could run many
>>>>>> dp-means passes with different values of lambda on the same sketch.
>>>> Note
>>>>>> that the sketch is small enough that in-memory clustering is entirely
>>>>>> viable and is very fast.
>>>>>> 
>>>>>> For the problem you describe, however, you probably don't need the
>>>> sketch
>>>>>> approach at all and can probably apply ball k-means or dp-means
>>>> directly.
>>>>>> Running many k-means clusterings with differing values of k should be
>>>>>> entirely feasible as well with such data sizes.
>>>>>> 
>>>>>> [1] http://www.cs.berkeley.edu/~jordan/papers/kulis-jordan-icml12.pdf
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> On Wed, Oct 2, 2013 at 9:11 AM, Jens Bonerz <[email protected]>
>>>>>> wrote:
>>>>>> 
>>>>>>> Isn't the streaming k-means just a different approach to crunch
>>>> through
>>>>>> the
>>>>>>> data? In other words, the result of streaming k-means should be
>>>>>> comparable
>>>>>>> to using k-means in multiple chained map reduce cycles?
>>>>>>> 
>>>>>>> I just read a paper about the k-means clustering and its underlying
>>>>>>> algorithm.
>>>>>>> 
>>>>>>> According to that paper, k-means relies on a preknown/predefined
>>>> amount
>>>>>> of
>>>>>>> clusters as an input parameter.
>>>>>>> 
>>>>>>> Link: http://books.nips.cc/papers/files/nips22/NIPS2009_1085.pdf
>>>>>>> 
>>>>>>> In my current scenario however, the number of clusters is unknown at
>>>>> the
>>>>>>> beginning.
>>>>>>> 
>>>>>>> Maybe k-means is just not the right algorithm for clustering similar
>>>>>>> products based on their short description text? What else could I
>>>> use?
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> 2013/10/1 Ted Dunning <[email protected]>
>>>>>>> 
>>>>>>>> At such small sizes, I would guess that the sequential version of
>>>> the
>>>>>>>> streaming k-means or ball k-means would be better options.
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> On Mon, Sep 30, 2013 at 2:14 PM, mercutio7979 <
>>>>> [email protected]
>>>>>>>>> wrote:
>>>>>>>> 
>>>>>>>>> Hello all,
>>>>>>>>> 
>>>>>>>>> I am currently trying create clusters from a group of 50.000
>>>>> strings
>>>>>>> that
>>>>>>>>> contain product descriptions (around 70-100 characters length
>>>>> each).
>>>>>>>>> 
>>>>>>>>> That group of 50.000 consists of roughly 5.000 individual
>>>> products
>>>>>> and
>>>>>>>> ten
>>>>>>>>> varying product descriptions per product. The product
>>>> descriptions
>>>>>> are
>>>>>>>>> already prepared for clustering and contain a normalized brand
>>>>> name,
>>>>>>>>> product
>>>>>>>>> model number, etc.
>>>>>>>>> 
>>>>>>>>> What would be a good approach to maximise the amound of found
>>>>>> clusters
>>>>>>>> (the
>>>>>>>>> best possible value would be 5.000 clusters with 10 products
>>>> each)
>>>>>>>>> 
>>>>>>>>> I adapted the reuters cluster script to read in my data and
>>>> managed
>>>>>> to
>>>>>>>>> create a first set of clusters. However, I have not managed to
>>>>>> maximise
>>>>>>>> the
>>>>>>>>> cluster count.
>>>>>>>>> 
>>>>>>>>> The question is: what do I need to tweak with regard to the
>>>>> available
>>>>>>>>> mahout
>>>>>>>>> settings, so the clusters are created as precisely as possible?
>>>>>>>>> 
>>>>>>>>> Many regards!
>>>>>>>>> Jens
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> --
>>>>>>>>> View this message in context:
>>>>>>>>> 
>>>>>>>> 
>>>>>>> 
>>>>>> 
>>>>> 
>>>> 
>> http://lucene.472066.n3.nabble.com/What-are-the-best-settings-for-my-clustering-task-tp4092807.html
>>>>>>>>> Sent from the Mahout User List mailing list archive at
>>>> Nabble.com.
>>>>>>>>> 
>>>>>>>> 
>>>>>>> 
>>>>>> 
>>>>> 
>>>>>

Re: What are the best settings for my clustering task

Reply via email to