Re: What are the best settings for my clustering task

Jens Bonerz Wed, 02 Oct 2013 23:46:02 -0700

I will try that right away. thanks Ted!


2013/10/3 Ted Dunning <[email protected]>

> Yes.  That will work.
>
> The sketch will then contain 10,000 x log N centroids.  If N = 10^9, log N
> \approx 30 so the sketch will have at about 300,000 weighted centroids in
> it.  The final clustering will have to process these centroids to produce
> the desired 5,000 clusters.  Since 300,000 is a relatively small number of
> data points, this clustering step should proceed relatively quickly.
>
>
>
> On Wed, Oct 2, 2013 at 10:21 AM, Jens Bonerz <[email protected]>
> wrote:
>
> > thx for your elaborate answer.
> >
> > so if the upper bound on the final number of clusters is unknown in the
> > beginning, what would happen, if I define a very high number that is
> > guaranteed to be > the estimated number of clusters.
> > for example if I set it to 10.000 clusters if an estimate of 5.000 is
> > likely, will that work?
> >
> >
> > 2013/10/2 Ted Dunning <[email protected]>
> >
> > > The way that the new streaming k-means works is that there is a first
> > > sketch pass which only requires an upper bound on the final number of
> > > clusters you will want.  It adaptively creates more or less clusters
> > > depending on the data and your bound.  This sketch is guaranteed to be
> > > computed within at most one map-reduce pass.  There is a threaded
> version
> > > that runs (fast) on a single machine.  The threaded version is liable
> to
> > be
> > > faster than the map-reduce version for moderate or smaller data sizes.
> > >
> > > That sketch can then be used to do all kinds of things that rely on
> > > Euclidean distance and still get results within a small factor of the
> > same
> > > algorithm applied to all of the data.  Typically this second phase is a
> > > ball k-means algorithm, but it could easily be a dp-means algorithm [1]
> > if
> > > you want a variable number of clusters.  Indeed, you could run many
> > > dp-means passes with different values of lambda on the same sketch.
>  Note
> > > that the sketch is small enough that in-memory clustering is entirely
> > > viable and is very fast.
> > >
> > > For the problem you describe, however, you probably don't need the
> sketch
> > > approach at all and can probably apply ball k-means or dp-means
> directly.
> > >  Running many k-means clusterings with differing values of k should be
> > > entirely feasible as well with such data sizes.
> > >
> > > [1] http://www.cs.berkeley.edu/~jordan/papers/kulis-jordan-icml12.pdf
> > >
> > >
> > >
> > >
> > > On Wed, Oct 2, 2013 at 9:11 AM, Jens Bonerz <[email protected]>
> > > wrote:
> > >
> > > > Isn't the streaming k-means just a different approach to crunch
> through
> > > the
> > > > data? In other words, the result of streaming k-means should be
> > > comparable
> > > > to using k-means in multiple chained map reduce cycles?
> > > >
> > > > I just read a paper about the k-means clustering and its underlying
> > > > algorithm.
> > > >
> > > > According to that paper, k-means relies on a preknown/predefined
> amount
> > > of
> > > > clusters as an input parameter.
> > > >
> > > > Link: http://books.nips.cc/papers/files/nips22/NIPS2009_1085.pdf
> > > >
> > > > In my current scenario however, the number of clusters is unknown at
> > the
> > > > beginning.
> > > >
> > > > Maybe k-means is just not the right algorithm for clustering similar
> > > > products based on their short description text? What else could I
> use?
> > > >
> > > >
> > > >
> > > >
> > > > 2013/10/1 Ted Dunning <[email protected]>
> > > >
> > > > > At such small sizes, I would guess that the sequential version of
> the
> > > > > streaming k-means or ball k-means would be better options.
> > > > >
> > > > >
> > > > >
> > > > > On Mon, Sep 30, 2013 at 2:14 PM, mercutio7979 <
> > [email protected]
> > > > > >wrote:
> > > > >
> > > > > > Hello all,
> > > > > >
> > > > > > I am currently trying create clusters from a group of 50.000
> > strings
> > > > that
> > > > > > contain product descriptions (around 70-100 characters length
> > each).
> > > > > >
> > > > > > That group of 50.000 consists of roughly 5.000 individual
> products
> > > and
> > > > > ten
> > > > > > varying product descriptions per product. The product
> descriptions
> > > are
> > > > > > already prepared for clustering and contain a normalized brand
> > name,
> > > > > > product
> > > > > > model number, etc.
> > > > > >
> > > > > > What would be a good approach to maximise the amound of found
> > > clusters
> > > > > (the
> > > > > > best possible value would be 5.000 clusters with 10 products
> each)
> > > > > >
> > > > > > I adapted the reuters cluster script to read in my data and
> managed
> > > to
> > > > > > create a first set of clusters. However, I have not managed to
> > > maximise
> > > > > the
> > > > > > cluster count.
> > > > > >
> > > > > > The question is: what do I need to tweak with regard to the
> > available
> > > > > > mahout
> > > > > > settings, so the clusters are created as precisely as possible?
> > > > > >
> > > > > > Many regards!
> > > > > > Jens
> > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > > --
> > > > > > View this message in context:
> > > > > >
> > > > >
> > > >
> > >
> >
> http://lucene.472066.n3.nabble.com/What-are-the-best-settings-for-my-clustering-task-tp4092807.html
> > > > > > Sent from the Mahout User List mailing list archive at
> Nabble.com.
> > > > > >
> > > > >
> > > >
> > >
> >
> >
> >
> > --
> > CEO
> > Hightech Marketing Group
> > Cell/Mobile: +49 173 539 3588
> >
> > ____
> >
> > Hightech Marketing Group
> > Frankenstraße 32
> > 50354 Huerth
> > Germany
> > Phone: +49 (0)2233 – 619 2741
> > Fax: +49 (0)2233 – 619 27419
> > Web: www.hightechmg.com
> >
>



-- 
CEO
Hightech Marketing Group
Cell/Mobile: +49 173 539 3588

____

Hightech Marketing Group
Frankenstraße 32
50354 Huerth
Germany
Phone: +49 (0)2233 – 619 2741
Fax: +49 (0)2233 – 619 27419
Web: www.hightechmg.com

Re: What are the best settings for my clustering task

Reply via email to