What you are seeing here are the cluster centroids themselves, not the cluster
assignments.
Streaming k-means is a single pass algorithm to derive these centroids.
Typically, the next step is to cluster these centroids using ball k-means.
*Those* results can then be applied back to the original (or new) input vectors
to get cluster assignments for individual input vectors.
I don't have command line specifics handy, but you seem to have done very well
already at figuring out the details.
On Oct 3, 2013, at 7:30 AM, Jens Bonerz wrote:
> I created a series of scripts to try out streamingkmeans in mahout an
> increased the number of clusters to a high amount as suggested by Ted.
> Everything seems to work. However, I can't figure out how to access the
> actual cluster data at the end of the process.
>
> It just gives me output that I cannot really understand... I would expect
> my product_ids being referenced to cluster ids...
>
> Example of the procedure's output:
>
> hadoop binary is not in PATH,HADOOP_HOME/bin,HADOOP_PREFIX/bin, running
> locally
> Input Path: file:MahoutCluster/part-r-00000
> Key class: class org.apache.hadoop.io.IntWritable Value Class: class
> org.apache.mahout.clustering.streaming.mapreduce.CentroidWritable
> Key: 0: Value: key = 8678, weight = 3.00, vector =
> {37:26.83479118347168,6085:8.162049293518066,4785:10.44443130493164,2493:19.677349090576172,2494:16.06648826599121,9659:9.568963050842285,20877:9.307...
> Key: 1: Value: key = 3118, weight = 14.00, vector =
> {19457:5.646900812784831,8774:4.746263821919759,9738:1.022495985031128,13301:5.762300491333008,14947:0.6774413585662842,8787:6.841406504313151,14958...
> Key: 2: Value: key = 2867, weight = 3.00, vector =
> {15873:10.955257415771484,1615:4.029662132263184,20963:4.979445934295654,3978:5.611329555511475,7950:8.364990234375,8018:8.68657398223877,15433:7.959...
> Key: 3: Value: key = 6295, weight = 1.00, vector =
> {17113:10.955257415771484,15347:9.568963050842285,15348:10.955257415771484,19845:7.805374622344971,7945:10.262109756469727,15356:18.090286254882812,1...
> Key: 4: Value: key = 6725, weight = 4.00, vector =
> {10570:7.64715051651001,14915:6.126943588256836,14947:4.064648151397705,14330:9.414812088012695,18271:2.7172491550445557,14335:19.677349090576172,143...
> Key: 5:......
>
>
>
> this is my recipe:
>
> --------------------------------------------------------------------------------------------------------
> Step 1
> Create a seqfile from my data with Python. Its the product_id (key) and the
> short normalized descripti (value) that is written into the sequence file.
>
>
>
> --------------------------------------------------------------------------------------------------------
> Step 2
> create vectors from that data with the following command:
>
> mahout seq2sparse \
> -i productClusterSequenceData/productClusterSequenceData.seq \
> -o productClusterSequenceData/vectors \
>
>
>
> --------------------------------------------------------------------------------------------------------
> Step 3
> Cluster the vectors using streamingkeans with this command:
>
> mahout streamingkmeans \
> -i productClusterSequenceData/vectors/tfidf-vectors \
> -o MahoutCluster \
> --tempDir /tmp \
> -ow -dm org.apache.mahout.common.distance.CosineDistanceMeasure \
> -sc org.apache.mahout.math.neighborhood.FastProjectionSearch \
> -k 10000 -km 500000 \
>
>
>
> --------------------------------------------------------------------------------------------------------
> Step 4
> Export the streamingkmeans cluster data into a textfile (for an extract of
> the result see above)
>
> mahout seqdumper \
> -i MahoutCluster > similarProducts.txt
>
> What am I missing?
>
>
>
>
>
> 2013/10/3 Ted Dunning <[email protected]>
>
>> Yes. That will work.
>>
>> The sketch will then contain 10,000 x log N centroids. If N = 10^9, log N
>> \approx 30 so the sketch will have at about 300,000 weighted centroids in
>> it. The final clustering will have to process these centroids to produce
>> the desired 5,000 clusters. Since 300,000 is a relatively small number of
>> data points, this clustering step should proceed relatively quickly.
>>
>>
>>
>> On Wed, Oct 2, 2013 at 10:21 AM, Jens Bonerz <[email protected]>
>> wrote:
>>
>>> thx for your elaborate answer.
>>>
>>> so if the upper bound on the final number of clusters is unknown in the
>>> beginning, what would happen, if I define a very high number that is
>>> guaranteed to be > the estimated number of clusters.
>>> for example if I set it to 10.000 clusters if an estimate of 5.000 is
>>> likely, will that work?
>>>
>>>
>>> 2013/10/2 Ted Dunning <[email protected]>
>>>
>>>> The way that the new streaming k-means works is that there is a first
>>>> sketch pass which only requires an upper bound on the final number of
>>>> clusters you will want. It adaptively creates more or less clusters
>>>> depending on the data and your bound. This sketch is guaranteed to be
>>>> computed within at most one map-reduce pass. There is a threaded
>> version
>>>> that runs (fast) on a single machine. The threaded version is liable
>> to
>>> be
>>>> faster than the map-reduce version for moderate or smaller data sizes.
>>>>
>>>> That sketch can then be used to do all kinds of things that rely on
>>>> Euclidean distance and still get results within a small factor of the
>>> same
>>>> algorithm applied to all of the data. Typically this second phase is a
>>>> ball k-means algorithm, but it could easily be a dp-means algorithm [1]
>>> if
>>>> you want a variable number of clusters. Indeed, you could run many
>>>> dp-means passes with different values of lambda on the same sketch.
>> Note
>>>> that the sketch is small enough that in-memory clustering is entirely
>>>> viable and is very fast.
>>>>
>>>> For the problem you describe, however, you probably don't need the
>> sketch
>>>> approach at all and can probably apply ball k-means or dp-means
>> directly.
>>>> Running many k-means clusterings with differing values of k should be
>>>> entirely feasible as well with such data sizes.
>>>>
>>>> [1] http://www.cs.berkeley.edu/~jordan/papers/kulis-jordan-icml12.pdf
>>>>
>>>>
>>>>
>>>>
>>>> On Wed, Oct 2, 2013 at 9:11 AM, Jens Bonerz <[email protected]>
>>>> wrote:
>>>>
>>>>> Isn't the streaming k-means just a different approach to crunch
>> through
>>>> the
>>>>> data? In other words, the result of streaming k-means should be
>>>> comparable
>>>>> to using k-means in multiple chained map reduce cycles?
>>>>>
>>>>> I just read a paper about the k-means clustering and its underlying
>>>>> algorithm.
>>>>>
>>>>> According to that paper, k-means relies on a preknown/predefined
>> amount
>>>> of
>>>>> clusters as an input parameter.
>>>>>
>>>>> Link: http://books.nips.cc/papers/files/nips22/NIPS2009_1085.pdf
>>>>>
>>>>> In my current scenario however, the number of clusters is unknown at
>>> the
>>>>> beginning.
>>>>>
>>>>> Maybe k-means is just not the right algorithm for clustering similar
>>>>> products based on their short description text? What else could I
>> use?
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> 2013/10/1 Ted Dunning <[email protected]>
>>>>>
>>>>>> At such small sizes, I would guess that the sequential version of
>> the
>>>>>> streaming k-means or ball k-means would be better options.
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Mon, Sep 30, 2013 at 2:14 PM, mercutio7979 <
>>> [email protected]
>>>>>>> wrote:
>>>>>>
>>>>>>> Hello all,
>>>>>>>
>>>>>>> I am currently trying create clusters from a group of 50.000
>>> strings
>>>>> that
>>>>>>> contain product descriptions (around 70-100 characters length
>>> each).
>>>>>>>
>>>>>>> That group of 50.000 consists of roughly 5.000 individual
>> products
>>>> and
>>>>>> ten
>>>>>>> varying product descriptions per product. The product
>> descriptions
>>>> are
>>>>>>> already prepared for clustering and contain a normalized brand
>>> name,
>>>>>>> product
>>>>>>> model number, etc.
>>>>>>>
>>>>>>> What would be a good approach to maximise the amound of found
>>>> clusters
>>>>>> (the
>>>>>>> best possible value would be 5.000 clusters with 10 products
>> each)
>>>>>>>
>>>>>>> I adapted the reuters cluster script to read in my data and
>> managed
>>>> to
>>>>>>> create a first set of clusters. However, I have not managed to
>>>> maximise
>>>>>> the
>>>>>>> cluster count.
>>>>>>>
>>>>>>> The question is: what do I need to tweak with regard to the
>>> available
>>>>>>> mahout
>>>>>>> settings, so the clusters are created as precisely as possible?
>>>>>>>
>>>>>>> Many regards!
>>>>>>> Jens
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> View this message in context:
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>> http://lucene.472066.n3.nabble.com/What-are-the-best-settings-for-my-clustering-task-tp4092807.html
>>>>>>> Sent from the Mahout User List mailing list archive at
>> Nabble.com.
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>>
>>>
>>> --
>>> CEO
>>> Hightech Marketing Group
>>> Cell/Mobile: +49 173 539 3588
>>>
>>> ____
>>>
>>> Hightech Marketing Group
>>> Frankenstraße 32
>>> 50354 Huerth
>>> Germany
>>> Phone: +49 (0)2233 – 619 2741
>>> Fax: +49 (0)2233 – 619 27419
>>> Web: www.hightechmg.com
>>>
>>
>
>
>
> --
> CEO
> Hightech Marketing Group
> Cell/Mobile: +49 173 539 3588
>
> ____
>
> Hightech Marketing Group
> Frankenstraße 32
> 50354 Huerth
> Germany
> Phone: +49 (0)2233 – 619 2741
> Fax: +49 (0)2233 – 619 27419
> Web: www.hightechmg.com