It is there, at the very least as part of the streaming k-means code. The abbreviation bkm has been used in the past.
In looking at the code just now I don't find any command line invocation of bkm. It should be quite simple to write one and it would be very handy to have a way to run streaming k-means without a map reduce step as well. As such it might be good to have a new option to streaming k-means to use just bkm in a single thread, to use threaded streaming k-means on a single machine or to use MapR reduce streaming k-means. You up for trying to make a patch? Sent from my iPhone On Oct 6, 2013, at 12:37, Jens Bonerz <[email protected]> wrote: > Hmmm.. has ballkmeans made it already into the 0.8 release? can't find it > in the list of available programs when calling the mahout binary... > > > 2013/10/3 Ted Dunning <[email protected]> > >> What you are seeing here are the cluster centroids themselves, not the >> cluster assignments. >> >> Streaming k-means is a single pass algorithm to derive these centroids. >> Typically, the next step is to cluster these centroids using ball k-means. >> *Those* results can then be applied back to the original (or new) input >> vectors to get cluster assignments for individual input vectors. >> >> I don't have command line specifics handy, but you seem to have done very >> well already at figuring out the details. >> >> >> On Oct 3, 2013, at 7:30 AM, Jens Bonerz wrote: >> >>> I created a series of scripts to try out streamingkmeans in mahout an >>> increased the number of clusters to a high amount as suggested by Ted. >>> Everything seems to work. However, I can't figure out how to access the >>> actual cluster data at the end of the process. >>> >>> It just gives me output that I cannot really understand... I would expect >>> my product_ids being referenced to cluster ids... >>> >>> Example of the procedure's output: >>> >>> hadoop binary is not in PATH,HADOOP_HOME/bin,HADOOP_PREFIX/bin, running >>> locally >>> Input Path: file:MahoutCluster/part-r-00000 >>> Key class: class org.apache.hadoop.io.IntWritable Value Class: class >>> org.apache.mahout.clustering.streaming.mapreduce.CentroidWritable >>> Key: 0: Value: key = 8678, weight = 3.00, vector = >>> >> {37:26.83479118347168,6085:8.162049293518066,4785:10.44443130493164,2493:19.677349090576172,2494:16.06648826599121,9659:9.568963050842285,20877:9.307... >>> Key: 1: Value: key = 3118, weight = 14.00, vector = >>> >> {19457:5.646900812784831,8774:4.746263821919759,9738:1.022495985031128,13301:5.762300491333008,14947:0.6774413585662842,8787:6.841406504313151,14958... >>> Key: 2: Value: key = 2867, weight = 3.00, vector = >>> >> {15873:10.955257415771484,1615:4.029662132263184,20963:4.979445934295654,3978:5.611329555511475,7950:8.364990234375,8018:8.68657398223877,15433:7.959... >>> Key: 3: Value: key = 6295, weight = 1.00, vector = >>> >> {17113:10.955257415771484,15347:9.568963050842285,15348:10.955257415771484,19845:7.805374622344971,7945:10.262109756469727,15356:18.090286254882812,1... >>> Key: 4: Value: key = 6725, weight = 4.00, vector = >>> >> {10570:7.64715051651001,14915:6.126943588256836,14947:4.064648151397705,14330:9.414812088012695,18271:2.7172491550445557,14335:19.677349090576172,143... >>> Key: 5:...... >>> >>> >>> >>> this is my recipe: >>> >>> >> -------------------------------------------------------------------------------------------------------- >>> Step 1 >>> Create a seqfile from my data with Python. Its the product_id (key) and >> the >>> short normalized descripti (value) that is written into the sequence >> file. >>> >>> >>> >>> >> -------------------------------------------------------------------------------------------------------- >>> Step 2 >>> create vectors from that data with the following command: >>> >>> mahout seq2sparse \ >>> -i productClusterSequenceData/productClusterSequenceData.seq \ >>> -o productClusterSequenceData/vectors \ >>> >>> >>> >>> >> -------------------------------------------------------------------------------------------------------- >>> Step 3 >>> Cluster the vectors using streamingkeans with this command: >>> >>> mahout streamingkmeans \ >>> -i productClusterSequenceData/vectors/tfidf-vectors \ >>> -o MahoutCluster \ >>> --tempDir /tmp \ >>> -ow -dm org.apache.mahout.common.distance.CosineDistanceMeasure \ >>> -sc org.apache.mahout.math.neighborhood.FastProjectionSearch \ >>> -k 10000 -km 500000 \ >>> >>> >>> >>> >> -------------------------------------------------------------------------------------------------------- >>> Step 4 >>> Export the streamingkmeans cluster data into a textfile (for an extract >> of >>> the result see above) >>> >>> mahout seqdumper \ >>> -i MahoutCluster > similarProducts.txt >>> >>> What am I missing? >>> >>> >>> >>> >>> >>> 2013/10/3 Ted Dunning <[email protected]> >>> >>>> Yes. That will work. >>>> >>>> The sketch will then contain 10,000 x log N centroids. If N = 10^9, >> log N >>>> \approx 30 so the sketch will have at about 300,000 weighted centroids >> in >>>> it. The final clustering will have to process these centroids to >> produce >>>> the desired 5,000 clusters. Since 300,000 is a relatively small number >> of >>>> data points, this clustering step should proceed relatively quickly. >>>> >>>> >>>> >>>> On Wed, Oct 2, 2013 at 10:21 AM, Jens Bonerz <[email protected]> >>>> wrote: >>>> >>>>> thx for your elaborate answer. >>>>> >>>>> so if the upper bound on the final number of clusters is unknown in the >>>>> beginning, what would happen, if I define a very high number that is >>>>> guaranteed to be > the estimated number of clusters. >>>>> for example if I set it to 10.000 clusters if an estimate of 5.000 is >>>>> likely, will that work? >>>>> >>>>> >>>>> 2013/10/2 Ted Dunning <[email protected]> >>>>> >>>>>> The way that the new streaming k-means works is that there is a first >>>>>> sketch pass which only requires an upper bound on the final number of >>>>>> clusters you will want. It adaptively creates more or less clusters >>>>>> depending on the data and your bound. This sketch is guaranteed to be >>>>>> computed within at most one map-reduce pass. There is a threaded >>>> version >>>>>> that runs (fast) on a single machine. The threaded version is liable >>>> to >>>>> be >>>>>> faster than the map-reduce version for moderate or smaller data sizes. >>>>>> >>>>>> That sketch can then be used to do all kinds of things that rely on >>>>>> Euclidean distance and still get results within a small factor of the >>>>> same >>>>>> algorithm applied to all of the data. Typically this second phase is >> a >>>>>> ball k-means algorithm, but it could easily be a dp-means algorithm >> [1] >>>>> if >>>>>> you want a variable number of clusters. Indeed, you could run many >>>>>> dp-means passes with different values of lambda on the same sketch. >>>> Note >>>>>> that the sketch is small enough that in-memory clustering is entirely >>>>>> viable and is very fast. >>>>>> >>>>>> For the problem you describe, however, you probably don't need the >>>> sketch >>>>>> approach at all and can probably apply ball k-means or dp-means >>>> directly. >>>>>> Running many k-means clusterings with differing values of k should be >>>>>> entirely feasible as well with such data sizes. >>>>>> >>>>>> [1] http://www.cs.berkeley.edu/~jordan/papers/kulis-jordan-icml12.pdf >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> On Wed, Oct 2, 2013 at 9:11 AM, Jens Bonerz <[email protected]> >>>>>> wrote: >>>>>> >>>>>>> Isn't the streaming k-means just a different approach to crunch >>>> through >>>>>> the >>>>>>> data? In other words, the result of streaming k-means should be >>>>>> comparable >>>>>>> to using k-means in multiple chained map reduce cycles? >>>>>>> >>>>>>> I just read a paper about the k-means clustering and its underlying >>>>>>> algorithm. >>>>>>> >>>>>>> According to that paper, k-means relies on a preknown/predefined >>>> amount >>>>>> of >>>>>>> clusters as an input parameter. >>>>>>> >>>>>>> Link: http://books.nips.cc/papers/files/nips22/NIPS2009_1085.pdf >>>>>>> >>>>>>> In my current scenario however, the number of clusters is unknown at >>>>> the >>>>>>> beginning. >>>>>>> >>>>>>> Maybe k-means is just not the right algorithm for clustering similar >>>>>>> products based on their short description text? What else could I >>>> use? >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> 2013/10/1 Ted Dunning <[email protected]> >>>>>>> >>>>>>>> At such small sizes, I would guess that the sequential version of >>>> the >>>>>>>> streaming k-means or ball k-means would be better options. >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> On Mon, Sep 30, 2013 at 2:14 PM, mercutio7979 < >>>>> [email protected] >>>>>>>>> wrote: >>>>>>>> >>>>>>>>> Hello all, >>>>>>>>> >>>>>>>>> I am currently trying create clusters from a group of 50.000 >>>>> strings >>>>>>> that >>>>>>>>> contain product descriptions (around 70-100 characters length >>>>> each). >>>>>>>>> >>>>>>>>> That group of 50.000 consists of roughly 5.000 individual >>>> products >>>>>> and >>>>>>>> ten >>>>>>>>> varying product descriptions per product. The product >>>> descriptions >>>>>> are >>>>>>>>> already prepared for clustering and contain a normalized brand >>>>> name, >>>>>>>>> product >>>>>>>>> model number, etc. >>>>>>>>> >>>>>>>>> What would be a good approach to maximise the amound of found >>>>>> clusters >>>>>>>> (the >>>>>>>>> best possible value would be 5.000 clusters with 10 products >>>> each) >>>>>>>>> >>>>>>>>> I adapted the reuters cluster script to read in my data and >>>> managed >>>>>> to >>>>>>>>> create a first set of clusters. However, I have not managed to >>>>>> maximise >>>>>>>> the >>>>>>>>> cluster count. >>>>>>>>> >>>>>>>>> The question is: what do I need to tweak with regard to the >>>>> available >>>>>>>>> mahout >>>>>>>>> settings, so the clusters are created as precisely as possible? >>>>>>>>> >>>>>>>>> Many regards! >>>>>>>>> Jens >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> -- >>>>>>>>> View this message in context: >>>>>>>>> >>>>>>>> >>>>>>> >>>>>> >>>>> >>>> >> http://lucene.472066.n3.nabble.com/What-are-the-best-settings-for-my-clustering-task-tp4092807.html >>>>>>>>> Sent from the Mahout User List mailing list archive at >>>> Nabble.com. >>>>>>>>> >>>>>>>> >>>>>>> >>>>>> >>>>> >>>>>
