Isn't the streaming k-means just a different approach to crunch through the data? In other words, the result of streaming k-means should be comparable to using k-means in multiple chained map reduce cycles?
I just read a paper about the k-means clustering and its underlying algorithm. According to that paper, k-means relies on a preknown/predefined amount of clusters as an input parameter. Link: http://books.nips.cc/papers/files/nips22/NIPS2009_1085.pdf In my current scenario however, the number of clusters is unknown at the beginning. Maybe k-means is just not the right algorithm for clustering similar products based on their short description text? What else could I use? 2013/10/1 Ted Dunning <[email protected]> > At such small sizes, I would guess that the sequential version of the > streaming k-means or ball k-means would be better options. > > > > On Mon, Sep 30, 2013 at 2:14 PM, mercutio7979 <[email protected] > >wrote: > > > Hello all, > > > > I am currently trying create clusters from a group of 50.000 strings that > > contain product descriptions (around 70-100 characters length each). > > > > That group of 50.000 consists of roughly 5.000 individual products and > ten > > varying product descriptions per product. The product descriptions are > > already prepared for clustering and contain a normalized brand name, > > product > > model number, etc. > > > > What would be a good approach to maximise the amound of found clusters > (the > > best possible value would be 5.000 clusters with 10 products each) > > > > I adapted the reuters cluster script to read in my data and managed to > > create a first set of clusters. However, I have not managed to maximise > the > > cluster count. > > > > The question is: what do I need to tweak with regard to the available > > mahout > > settings, so the clusters are created as precisely as possible? > > > > Many regards! > > Jens > > > > > > > > > > > > -- > > View this message in context: > > > http://lucene.472066.n3.nabble.com/What-are-the-best-settings-for-my-clustering-task-tp4092807.html > > Sent from the Mahout User List mailing list archive at Nabble.com. > > >
