Hmmm.. has ballkmeans made it already into the 0.8 release? can't find it in the list of available programs when calling the mahout binary...
2013/10/3 Ted Dunning <[email protected]> > What you are seeing here are the cluster centroids themselves, not the > cluster assignments. > > Streaming k-means is a single pass algorithm to derive these centroids. > Typically, the next step is to cluster these centroids using ball k-means. > *Those* results can then be applied back to the original (or new) input > vectors to get cluster assignments for individual input vectors. > > I don't have command line specifics handy, but you seem to have done very > well already at figuring out the details. > > > On Oct 3, 2013, at 7:30 AM, Jens Bonerz wrote: > > > I created a series of scripts to try out streamingkmeans in mahout an > > increased the number of clusters to a high amount as suggested by Ted. > > Everything seems to work. However, I can't figure out how to access the > > actual cluster data at the end of the process. > > > > It just gives me output that I cannot really understand... I would expect > > my product_ids being referenced to cluster ids... > > > > Example of the procedure's output: > > > > hadoop binary is not in PATH,HADOOP_HOME/bin,HADOOP_PREFIX/bin, running > > locally > > Input Path: file:MahoutCluster/part-r-00000 > > Key class: class org.apache.hadoop.io.IntWritable Value Class: class > > org.apache.mahout.clustering.streaming.mapreduce.CentroidWritable > > Key: 0: Value: key = 8678, weight = 3.00, vector = > > > {37:26.83479118347168,6085:8.162049293518066,4785:10.44443130493164,2493:19.677349090576172,2494:16.06648826599121,9659:9.568963050842285,20877:9.307... > > Key: 1: Value: key = 3118, weight = 14.00, vector = > > > {19457:5.646900812784831,8774:4.746263821919759,9738:1.022495985031128,13301:5.762300491333008,14947:0.6774413585662842,8787:6.841406504313151,14958... > > Key: 2: Value: key = 2867, weight = 3.00, vector = > > > {15873:10.955257415771484,1615:4.029662132263184,20963:4.979445934295654,3978:5.611329555511475,7950:8.364990234375,8018:8.68657398223877,15433:7.959... > > Key: 3: Value: key = 6295, weight = 1.00, vector = > > > {17113:10.955257415771484,15347:9.568963050842285,15348:10.955257415771484,19845:7.805374622344971,7945:10.262109756469727,15356:18.090286254882812,1... > > Key: 4: Value: key = 6725, weight = 4.00, vector = > > > {10570:7.64715051651001,14915:6.126943588256836,14947:4.064648151397705,14330:9.414812088012695,18271:2.7172491550445557,14335:19.677349090576172,143... > > Key: 5:...... > > > > > > > > this is my recipe: > > > > > -------------------------------------------------------------------------------------------------------- > > Step 1 > > Create a seqfile from my data with Python. Its the product_id (key) and > the > > short normalized descripti (value) that is written into the sequence > file. > > > > > > > > > -------------------------------------------------------------------------------------------------------- > > Step 2 > > create vectors from that data with the following command: > > > > mahout seq2sparse \ > > -i productClusterSequenceData/productClusterSequenceData.seq \ > > -o productClusterSequenceData/vectors \ > > > > > > > > > -------------------------------------------------------------------------------------------------------- > > Step 3 > > Cluster the vectors using streamingkeans with this command: > > > > mahout streamingkmeans \ > > -i productClusterSequenceData/vectors/tfidf-vectors \ > > -o MahoutCluster \ > > --tempDir /tmp \ > > -ow -dm org.apache.mahout.common.distance.CosineDistanceMeasure \ > > -sc org.apache.mahout.math.neighborhood.FastProjectionSearch \ > > -k 10000 -km 500000 \ > > > > > > > > > -------------------------------------------------------------------------------------------------------- > > Step 4 > > Export the streamingkmeans cluster data into a textfile (for an extract > of > > the result see above) > > > > mahout seqdumper \ > > -i MahoutCluster > similarProducts.txt > > > > What am I missing? > > > > > > > > > > > > 2013/10/3 Ted Dunning <[email protected]> > > > >> Yes. That will work. > >> > >> The sketch will then contain 10,000 x log N centroids. If N = 10^9, > log N > >> \approx 30 so the sketch will have at about 300,000 weighted centroids > in > >> it. The final clustering will have to process these centroids to > produce > >> the desired 5,000 clusters. Since 300,000 is a relatively small number > of > >> data points, this clustering step should proceed relatively quickly. > >> > >> > >> > >> On Wed, Oct 2, 2013 at 10:21 AM, Jens Bonerz <[email protected]> > >> wrote: > >> > >>> thx for your elaborate answer. > >>> > >>> so if the upper bound on the final number of clusters is unknown in the > >>> beginning, what would happen, if I define a very high number that is > >>> guaranteed to be > the estimated number of clusters. > >>> for example if I set it to 10.000 clusters if an estimate of 5.000 is > >>> likely, will that work? > >>> > >>> > >>> 2013/10/2 Ted Dunning <[email protected]> > >>> > >>>> The way that the new streaming k-means works is that there is a first > >>>> sketch pass which only requires an upper bound on the final number of > >>>> clusters you will want. It adaptively creates more or less clusters > >>>> depending on the data and your bound. This sketch is guaranteed to be > >>>> computed within at most one map-reduce pass. There is a threaded > >> version > >>>> that runs (fast) on a single machine. The threaded version is liable > >> to > >>> be > >>>> faster than the map-reduce version for moderate or smaller data sizes. > >>>> > >>>> That sketch can then be used to do all kinds of things that rely on > >>>> Euclidean distance and still get results within a small factor of the > >>> same > >>>> algorithm applied to all of the data. Typically this second phase is > a > >>>> ball k-means algorithm, but it could easily be a dp-means algorithm > [1] > >>> if > >>>> you want a variable number of clusters. Indeed, you could run many > >>>> dp-means passes with different values of lambda on the same sketch. > >> Note > >>>> that the sketch is small enough that in-memory clustering is entirely > >>>> viable and is very fast. > >>>> > >>>> For the problem you describe, however, you probably don't need the > >> sketch > >>>> approach at all and can probably apply ball k-means or dp-means > >> directly. > >>>> Running many k-means clusterings with differing values of k should be > >>>> entirely feasible as well with such data sizes. > >>>> > >>>> [1] http://www.cs.berkeley.edu/~jordan/papers/kulis-jordan-icml12.pdf > >>>> > >>>> > >>>> > >>>> > >>>> On Wed, Oct 2, 2013 at 9:11 AM, Jens Bonerz <[email protected]> > >>>> wrote: > >>>> > >>>>> Isn't the streaming k-means just a different approach to crunch > >> through > >>>> the > >>>>> data? In other words, the result of streaming k-means should be > >>>> comparable > >>>>> to using k-means in multiple chained map reduce cycles? > >>>>> > >>>>> I just read a paper about the k-means clustering and its underlying > >>>>> algorithm. > >>>>> > >>>>> According to that paper, k-means relies on a preknown/predefined > >> amount > >>>> of > >>>>> clusters as an input parameter. > >>>>> > >>>>> Link: http://books.nips.cc/papers/files/nips22/NIPS2009_1085.pdf > >>>>> > >>>>> In my current scenario however, the number of clusters is unknown at > >>> the > >>>>> beginning. > >>>>> > >>>>> Maybe k-means is just not the right algorithm for clustering similar > >>>>> products based on their short description text? What else could I > >> use? > >>>>> > >>>>> > >>>>> > >>>>> > >>>>> 2013/10/1 Ted Dunning <[email protected]> > >>>>> > >>>>>> At such small sizes, I would guess that the sequential version of > >> the > >>>>>> streaming k-means or ball k-means would be better options. > >>>>>> > >>>>>> > >>>>>> > >>>>>> On Mon, Sep 30, 2013 at 2:14 PM, mercutio7979 < > >>> [email protected] > >>>>>>> wrote: > >>>>>> > >>>>>>> Hello all, > >>>>>>> > >>>>>>> I am currently trying create clusters from a group of 50.000 > >>> strings > >>>>> that > >>>>>>> contain product descriptions (around 70-100 characters length > >>> each). > >>>>>>> > >>>>>>> That group of 50.000 consists of roughly 5.000 individual > >> products > >>>> and > >>>>>> ten > >>>>>>> varying product descriptions per product. The product > >> descriptions > >>>> are > >>>>>>> already prepared for clustering and contain a normalized brand > >>> name, > >>>>>>> product > >>>>>>> model number, etc. > >>>>>>> > >>>>>>> What would be a good approach to maximise the amound of found > >>>> clusters > >>>>>> (the > >>>>>>> best possible value would be 5.000 clusters with 10 products > >> each) > >>>>>>> > >>>>>>> I adapted the reuters cluster script to read in my data and > >> managed > >>>> to > >>>>>>> create a first set of clusters. However, I have not managed to > >>>> maximise > >>>>>> the > >>>>>>> cluster count. > >>>>>>> > >>>>>>> The question is: what do I need to tweak with regard to the > >>> available > >>>>>>> mahout > >>>>>>> settings, so the clusters are created as precisely as possible? > >>>>>>> > >>>>>>> Many regards! > >>>>>>> Jens > >>>>>>> > >>>>>>> > >>>>>>> > >>>>>>> > >>>>>>> > >>>>>>> -- > >>>>>>> View this message in context: > >>>>>>> > >>>>>> > >>>>> > >>>> > >>> > >> > http://lucene.472066.n3.nabble.com/What-are-the-best-settings-for-my-clustering-task-tp4092807.html > >>>>>>> Sent from the Mahout User List mailing list archive at > >> Nabble.com. > >>>>>>> > >>>>>> > >>>>> > >>>> > >>> > >>>
