I will try that right away. thanks Ted!
2013/10/3 Ted Dunning <[email protected]> > Yes. That will work. > > The sketch will then contain 10,000 x log N centroids. If N = 10^9, log N > \approx 30 so the sketch will have at about 300,000 weighted centroids in > it. The final clustering will have to process these centroids to produce > the desired 5,000 clusters. Since 300,000 is a relatively small number of > data points, this clustering step should proceed relatively quickly. > > > > On Wed, Oct 2, 2013 at 10:21 AM, Jens Bonerz <[email protected]> > wrote: > > > thx for your elaborate answer. > > > > so if the upper bound on the final number of clusters is unknown in the > > beginning, what would happen, if I define a very high number that is > > guaranteed to be > the estimated number of clusters. > > for example if I set it to 10.000 clusters if an estimate of 5.000 is > > likely, will that work? > > > > > > 2013/10/2 Ted Dunning <[email protected]> > > > > > The way that the new streaming k-means works is that there is a first > > > sketch pass which only requires an upper bound on the final number of > > > clusters you will want. It adaptively creates more or less clusters > > > depending on the data and your bound. This sketch is guaranteed to be > > > computed within at most one map-reduce pass. There is a threaded > version > > > that runs (fast) on a single machine. The threaded version is liable > to > > be > > > faster than the map-reduce version for moderate or smaller data sizes. > > > > > > That sketch can then be used to do all kinds of things that rely on > > > Euclidean distance and still get results within a small factor of the > > same > > > algorithm applied to all of the data. Typically this second phase is a > > > ball k-means algorithm, but it could easily be a dp-means algorithm [1] > > if > > > you want a variable number of clusters. Indeed, you could run many > > > dp-means passes with different values of lambda on the same sketch. > Note > > > that the sketch is small enough that in-memory clustering is entirely > > > viable and is very fast. > > > > > > For the problem you describe, however, you probably don't need the > sketch > > > approach at all and can probably apply ball k-means or dp-means > directly. > > > Running many k-means clusterings with differing values of k should be > > > entirely feasible as well with such data sizes. > > > > > > [1] http://www.cs.berkeley.edu/~jordan/papers/kulis-jordan-icml12.pdf > > > > > > > > > > > > > > > On Wed, Oct 2, 2013 at 9:11 AM, Jens Bonerz <[email protected]> > > > wrote: > > > > > > > Isn't the streaming k-means just a different approach to crunch > through > > > the > > > > data? In other words, the result of streaming k-means should be > > > comparable > > > > to using k-means in multiple chained map reduce cycles? > > > > > > > > I just read a paper about the k-means clustering and its underlying > > > > algorithm. > > > > > > > > According to that paper, k-means relies on a preknown/predefined > amount > > > of > > > > clusters as an input parameter. > > > > > > > > Link: http://books.nips.cc/papers/files/nips22/NIPS2009_1085.pdf > > > > > > > > In my current scenario however, the number of clusters is unknown at > > the > > > > beginning. > > > > > > > > Maybe k-means is just not the right algorithm for clustering similar > > > > products based on their short description text? What else could I > use? > > > > > > > > > > > > > > > > > > > > 2013/10/1 Ted Dunning <[email protected]> > > > > > > > > > At such small sizes, I would guess that the sequential version of > the > > > > > streaming k-means or ball k-means would be better options. > > > > > > > > > > > > > > > > > > > > On Mon, Sep 30, 2013 at 2:14 PM, mercutio7979 < > > [email protected] > > > > > >wrote: > > > > > > > > > > > Hello all, > > > > > > > > > > > > I am currently trying create clusters from a group of 50.000 > > strings > > > > that > > > > > > contain product descriptions (around 70-100 characters length > > each). > > > > > > > > > > > > That group of 50.000 consists of roughly 5.000 individual > products > > > and > > > > > ten > > > > > > varying product descriptions per product. The product > descriptions > > > are > > > > > > already prepared for clustering and contain a normalized brand > > name, > > > > > > product > > > > > > model number, etc. > > > > > > > > > > > > What would be a good approach to maximise the amound of found > > > clusters > > > > > (the > > > > > > best possible value would be 5.000 clusters with 10 products > each) > > > > > > > > > > > > I adapted the reuters cluster script to read in my data and > managed > > > to > > > > > > create a first set of clusters. However, I have not managed to > > > maximise > > > > > the > > > > > > cluster count. > > > > > > > > > > > > The question is: what do I need to tweak with regard to the > > available > > > > > > mahout > > > > > > settings, so the clusters are created as precisely as possible? > > > > > > > > > > > > Many regards! > > > > > > Jens > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > > > View this message in context: > > > > > > > > > > > > > > > > > > > > > http://lucene.472066.n3.nabble.com/What-are-the-best-settings-for-my-clustering-task-tp4092807.html > > > > > > Sent from the Mahout User List mailing list archive at > Nabble.com. > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > CEO > > Hightech Marketing Group > > Cell/Mobile: +49 173 539 3588 > > > > ____ > > > > Hightech Marketing Group > > Frankenstraße 32 > > 50354 Huerth > > Germany > > Phone: +49 (0)2233 – 619 2741 > > Fax: +49 (0)2233 – 619 27419 > > Web: www.hightechmg.com > > > -- CEO Hightech Marketing Group Cell/Mobile: +49 173 539 3588 ____ Hightech Marketing Group Frankenstraße 32 50354 Huerth Germany Phone: +49 (0)2233 – 619 2741 Fax: +49 (0)2233 – 619 27419 Web: www.hightechmg.com
