Re: Seeding k-means with canopy clustering / Filter canopies

Stefan Kreuzer Thu, 03 Jan 2013 08:08:35 -0800

But even with a small weight (not sure how to apply that) i still havethe wrong number of centroids, i.e. the wrong k?

I imagined something like:

1. Do canopy clustering with clusterFilter param => retrieve a folderwith x canopy clusters and a folder with x+n canopy centroids, where xrepresents a good value for k.2. Remove centroids that do not correspond with any of the canopyclusters.

3. Use these reduced set of canopy centroid as seed for k-means.

I dont know if step 2 is possible and if it is, how it could beachieved. Performance is rather a non-issue in my case.


-----Ursprüngliche Mitteilung-----
Von: Ted Dunning <[email protected]>
An: user <[email protected]>
Verschickt: Do, 3 Jan 2013 4:41 pm
Betreff: Re: Seeding k-means with canopy clustering / Filter canopies


The knn stuff on github can run with 0.7.  You would have to pull a few

classes back that have been moved to Mahout, but it shouldn't be hardto do

since the names and paths are identical.

I have no good answer for you about using canopy centroids. The normalwayof doing this is to put a very small or zero weight on the seedcentroids.

That means that they start tings going but have very little or no
influence later.

On Thu, Jan 3, 2013 at 3:43 AM, Stefan Kreuzer<[email protected]>wrote:

I fear I have to stick to 0.7. So there is no solution to get rid of

the

superfluous canopy centroids for the k-means seed?


-----Ursprüngliche Mitteilung-----
Von: Ted Dunning <[email protected]>
An: user <[email protected]>
Verschickt: Do, 3 Jan 2013 7:01 am
Betreff: Re: Seeding k-means with canopy clustering / Filter canopies


Bitlets have come into Mahout so far, but the core is in

https://github.com/tdunning/**knn <https://github.com/tdunning/knn>

still.

The quick summary is that this code can cluster 10-dimensional data at
about 1 million points in 20 seconds on a single machine.  It also can
scale out horizontally using a single map-reduce pass maintaining

about the

same speed.  Performance scales down essentially linearly with higher
dimensionality.
It works by making a fast, single pass through the data to produce a

sketch

of the data. This sketch is clustered in memory using a high quality

ball

k-means algorithm.
The API is currently not compatible with the current clustering API.

The

algorithms are being tested for quality by Dan Filimon who is also

doing

the scaling work.

On Wed, Jan 2, 2013 at 6:00 PM, Stefan Kreuzer <[email protected]
>wrote:

 Uhm no... where can I look? Sorry





-----Ursprüngliche Mitteilung-----
Von: Ted Dunning <[email protected]>
An: user <[email protected]>
Verschickt: Do, 3 Jan 2013 2:12 am
Betreff: Re: Seeding k-means with canopy clustering / Filter canopies


Stefan,

Have you looked at the k-means work that Dan Filimon and I are doing?

On Wed, Jan 2, 2013 at 4:46 PM, Stefan Kreuzer

<[email protected]

>wrote:

> I try to seed a k-means clustering with canopy clustering. Problem:
> Depending on the choice for t1 and t2, canopy clustering gives me

too

many
> canopies or just 1.
> I thought I could solve this with the clusterFilter parameter, but

no

> luck. Although I can restrict the number of _canopy clusters_ with

the

> clusterFilter parameter leading to what would be a good value for

k, this

> parameter has no effect on the _canopy centroids_ that are created,

and

> these are the seed for k-means.
> Is there a way to get a seed for k-means that reflects the value

given

for
> the clusterFilter parameter in canopy clustering?
>

Re: Seeding k-means with canopy clustering / Filter canopies

Reply via email to