i'd love to contribute to the community. however, I am not proficient in java and have no dev environment for that unfortunately. if it was made in python. php or c++ that would be something else.
but: if you can steer me into the right direction, I will try to suggest a patch. 2013/10/6 Ted Dunning <[email protected]> > It is there, at the very least as part of the streaming k-means code. The > abbreviation bkm has been used in the past. > > In looking at the code just now I don't find any command line invocation > of bkm. It should be quite simple to write one and it would be very handy > to have a way to run streaming k-means without a map reduce step as well. > As such it might be good to have a new option to streaming k-means to use > just bkm in a single thread, to use threaded streaming k-means on a single > machine or to use MapR reduce streaming k-means. > > You up for trying to make a patch? > > Sent from my iPhone > > On Oct 6, 2013, at 12:37, Jens Bonerz <[email protected]> wrote: > > > Hmmm.. has ballkmeans made it already into the 0.8 release? can't find it > > in the list of available programs when calling the mahout binary... > > > > > > 2013/10/3 Ted Dunning <[email protected]> > > > >> What you are seeing here are the cluster centroids themselves, not the > >> cluster assignments. > >> > >> Streaming k-means is a single pass algorithm to derive these centroids. > >> Typically, the next step is to cluster these centroids using ball > k-means. > >> *Those* results can then be applied back to the original (or new) input > >> vectors to get cluster assignments for individual input vectors. > >> > >> I don't have command line specifics handy, but you seem to have done > very > >> well already at figuring out the details. > >> > >> > >> On Oct 3, 2013, at 7:30 AM, Jens Bonerz wrote: > >> > >>> I created a series of scripts to try out streamingkmeans in mahout an > >>> increased the number of clusters to a high amount as suggested by Ted. > >>> Everything seems to work. However, I can't figure out how to access the > >>> actual cluster data at the end of the process. > >>> > >>> It just gives me output that I cannot really understand... I would > expect > >>> my product_ids being referenced to cluster ids... > >>> > >>> Example of the procedure's output: > >>> > >>> hadoop binary is not in PATH,HADOOP_HOME/bin,HADOOP_PREFIX/bin, running > >>> locally > >>> Input Path: file:MahoutCluster/part-r-00000 > >>> Key class: class org.apache.hadoop.io.IntWritable Value Class: class > >>> org.apache.mahout.clustering.streaming.mapreduce.CentroidWritable > >>> Key: 0: Value: key = 8678, weight = 3.00, vector = > >>> > >> > {37:26.83479118347168,6085:8.162049293518066,4785:10.44443130493164,2493:19.677349090576172,2494:16.06648826599121,9659:9.568963050842285,20877:9.307... > >>> Key: 1: Value: key = 3118, weight = 14.00, vector = > >>> > >> > {19457:5.646900812784831,8774:4.746263821919759,9738:1.022495985031128,13301:5.762300491333008,14947:0.6774413585662842,8787:6.841406504313151,14958... > >>> Key: 2: Value: key = 2867, weight = 3.00, vector = > >>> > >> > {15873:10.955257415771484,1615:4.029662132263184,20963:4.979445934295654,3978:5.611329555511475,7950:8.364990234375,8018:8.68657398223877,15433:7.959... > >>> Key: 3: Value: key = 6295, weight = 1.00, vector = > >>> > >> > {17113:10.955257415771484,15347:9.568963050842285,15348:10.955257415771484,19845:7.805374622344971,7945:10.262109756469727,15356:18.090286254882812,1... > >>> Key: 4: Value: key = 6725, weight = 4.00, vector = > >>> > >> > {10570:7.64715051651001,14915:6.126943588256836,14947:4.064648151397705,14330:9.414812088012695,18271:2.7172491550445557,14335:19.677349090576172,143... > >>> Key: 5:...... > >>> > >>> > >>> > >>> this is my recipe: > >>> > >>> > >> > -------------------------------------------------------------------------------------------------------- > >>> Step 1 > >>> Create a seqfile from my data with Python. Its the product_id (key) and > >> the > >>> short normalized descripti (value) that is written into the sequence > >> file. > >>> > >>> > >>> > >>> > >> > -------------------------------------------------------------------------------------------------------- > >>> Step 2 > >>> create vectors from that data with the following command: > >>> > >>> mahout seq2sparse \ > >>> -i productClusterSequenceData/productClusterSequenceData.seq \ > >>> -o productClusterSequenceData/vectors \ > >>> > >>> > >>> > >>> > >> > -------------------------------------------------------------------------------------------------------- > >>> Step 3 > >>> Cluster the vectors using streamingkeans with this command: > >>> > >>> mahout streamingkmeans \ > >>> -i productClusterSequenceData/vectors/tfidf-vectors \ > >>> -o MahoutCluster \ > >>> --tempDir /tmp \ > >>> -ow -dm org.apache.mahout.common.distance.CosineDistanceMeasure \ > >>> -sc org.apache.mahout.math.neighborhood.FastProjectionSearch \ > >>> -k 10000 -km 500000 \ > >>> > >>> > >>> > >>> > >> > -------------------------------------------------------------------------------------------------------- > >>> Step 4 > >>> Export the streamingkmeans cluster data into a textfile (for an extract > >> of > >>> the result see above) > >>> > >>> mahout seqdumper \ > >>> -i MahoutCluster > similarProducts.txt > >>> > >>> What am I missing? > >>> > >>> > >>> > >>> > >>> > >>> 2013/10/3 Ted Dunning <[email protected]> > >>> > >>>> Yes. That will work. > >>>> > >>>> The sketch will then contain 10,000 x log N centroids. If N = 10^9, > >> log N > >>>> \approx 30 so the sketch will have at about 300,000 weighted centroids > >> in > >>>> it. The final clustering will have to process these centroids to > >> produce > >>>> the desired 5,000 clusters. Since 300,000 is a relatively small > number > >> of > >>>> data points, this clustering step should proceed relatively quickly. > >>>> > >>>> > >>>> > >>>> On Wed, Oct 2, 2013 at 10:21 AM, Jens Bonerz <[email protected]> > >>>> wrote: > >>>> > >>>>> thx for your elaborate answer. > >>>>> > >>>>> so if the upper bound on the final number of clusters is unknown in > the > >>>>> beginning, what would happen, if I define a very high number that is > >>>>> guaranteed to be > the estimated number of clusters. > >>>>> for example if I set it to 10.000 clusters if an estimate of 5.000 is > >>>>> likely, will that work? > >>>>> > >>>>> > >>>>> 2013/10/2 Ted Dunning <[email protected]> > >>>>> > >>>>>> The way that the new streaming k-means works is that there is a > first > >>>>>> sketch pass which only requires an upper bound on the final number > of > >>>>>> clusters you will want. It adaptively creates more or less clusters > >>>>>> depending on the data and your bound. This sketch is guaranteed to > be > >>>>>> computed within at most one map-reduce pass. There is a threaded > >>>> version > >>>>>> that runs (fast) on a single machine. The threaded version is > liable > >>>> to > >>>>> be > >>>>>> faster than the map-reduce version for moderate or smaller data > sizes. > >>>>>> > >>>>>> That sketch can then be used to do all kinds of things that rely on > >>>>>> Euclidean distance and still get results within a small factor of > the > >>>>> same > >>>>>> algorithm applied to all of the data. Typically this second phase > is > >> a > >>>>>> ball k-means algorithm, but it could easily be a dp-means algorithm > >> [1] > >>>>> if > >>>>>> you want a variable number of clusters. Indeed, you could run many > >>>>>> dp-means passes with different values of lambda on the same sketch. > >>>> Note > >>>>>> that the sketch is small enough that in-memory clustering is > entirely > >>>>>> viable and is very fast. > >>>>>> > >>>>>> For the problem you describe, however, you probably don't need the > >>>> sketch > >>>>>> approach at all and can probably apply ball k-means or dp-means > >>>> directly. > >>>>>> Running many k-means clusterings with differing values of k should > be > >>>>>> entirely feasible as well with such data sizes. > >>>>>> > >>>>>> [1] > http://www.cs.berkeley.edu/~jordan/papers/kulis-jordan-icml12.pdf > >>>>>> > >>>>>> > >>>>>> > >>>>>> > >>>>>> On Wed, Oct 2, 2013 at 9:11 AM, Jens Bonerz <[email protected] > > > >>>>>> wrote: > >>>>>> > >>>>>>> Isn't the streaming k-means just a different approach to crunch > >>>> through > >>>>>> the > >>>>>>> data? In other words, the result of streaming k-means should be > >>>>>> comparable > >>>>>>> to using k-means in multiple chained map reduce cycles? > >>>>>>> > >>>>>>> I just read a paper about the k-means clustering and its underlying > >>>>>>> algorithm. > >>>>>>> > >>>>>>> According to that paper, k-means relies on a preknown/predefined > >>>> amount > >>>>>> of > >>>>>>> clusters as an input parameter. > >>>>>>> > >>>>>>> Link: http://books.nips.cc/papers/files/nips22/NIPS2009_1085.pdf > >>>>>>> > >>>>>>> In my current scenario however, the number of clusters is unknown > at > >>>>> the > >>>>>>> beginning. > >>>>>>> > >>>>>>> Maybe k-means is just not the right algorithm for clustering > similar > >>>>>>> products based on their short description text? What else could I > >>>> use? > >>>>>>> > >>>>>>> > >>>>>>> > >>>>>>> > >>>>>>> 2013/10/1 Ted Dunning <[email protected]> > >>>>>>> > >>>>>>>> At such small sizes, I would guess that the sequential version of > >>>> the > >>>>>>>> streaming k-means or ball k-means would be better options. > >>>>>>>> > >>>>>>>> > >>>>>>>> > >>>>>>>> On Mon, Sep 30, 2013 at 2:14 PM, mercutio7979 < > >>>>> [email protected] > >>>>>>>>> wrote: > >>>>>>>> > >>>>>>>>> Hello all, > >>>>>>>>> > >>>>>>>>> I am currently trying create clusters from a group of 50.000 > >>>>> strings > >>>>>>> that > >>>>>>>>> contain product descriptions (around 70-100 characters length > >>>>> each). > >>>>>>>>> > >>>>>>>>> That group of 50.000 consists of roughly 5.000 individual > >>>> products > >>>>>> and > >>>>>>>> ten > >>>>>>>>> varying product descriptions per product. The product > >>>> descriptions > >>>>>> are > >>>>>>>>> already prepared for clustering and contain a normalized brand > >>>>> name, > >>>>>>>>> product > >>>>>>>>> model number, etc. > >>>>>>>>> > >>>>>>>>> What would be a good approach to maximise the amound of found > >>>>>> clusters > >>>>>>>> (the > >>>>>>>>> best possible value would be 5.000 clusters with 10 products > >>>> each) > >>>>>>>>> > >>>>>>>>> I adapted the reuters cluster script to read in my data and > >>>> managed > >>>>>> to > >>>>>>>>> create a first set of clusters. However, I have not managed to > >>>>>> maximise > >>>>>>>> the > >>>>>>>>> cluster count. > >>>>>>>>> > >>>>>>>>> The question is: what do I need to tweak with regard to the > >>>>> available > >>>>>>>>> mahout > >>>>>>>>> settings, so the clusters are created as precisely as possible? > >>>>>>>>> > >>>>>>>>> Many regards! > >>>>>>>>> Jens > >>>>>>>>> > >>>>>>>>> > >>>>>>>>> > >>>>>>>>> > >>>>>>>>> > >>>>>>>>> -- > >>>>>>>>> View this message in context: > >>>>>>>>> > >>>>>>>> > >>>>>>> > >>>>>> > >>>>> > >>>> > >> > http://lucene.472066.n3.nabble.com/What-are-the-best-settings-for-my-clustering-task-tp4092807.html > >>>>>>>>> Sent from the Mahout User List mailing list archive at > >>>> Nabble.com. > >>>>>>>>> > >>>>>>>> > >>>>>>> > >>>>>> > >>>>> > >>>>> > <http://www.hightechmg.com>
