Bitlets have come into Mahout so far, but the core is in https://github.com/tdunning/knn still.
The quick summary is that this code can cluster 10-dimensional data at about 1 million points in 20 seconds on a single machine. It also can scale out horizontally using a single map-reduce pass maintaining about the same speed. Performance scales down essentially linearly with higher dimensionality. It works by making a fast, single pass through the data to produce a sketch of the data. This sketch is clustered in memory using a high quality ball k-means algorithm. The API is currently not compatible with the current clustering API. The algorithms are being tested for quality by Dan Filimon who is also doing the scaling work. On Wed, Jan 2, 2013 at 6:00 PM, Stefan Kreuzer <[email protected]>wrote: > Uhm no... where can I look? Sorry > > > > > -----Ursprüngliche Mitteilung----- > Von: Ted Dunning <[email protected]> > An: user <[email protected]> > Verschickt: Do, 3 Jan 2013 2:12 am > Betreff: Re: Seeding k-means with canopy clustering / Filter canopies > > > Stefan, > > Have you looked at the k-means work that Dan Filimon and I are doing? > > On Wed, Jan 2, 2013 at 4:46 PM, Stefan Kreuzer <[email protected] > >wrote: > > > I try to seed a k-means clustering with canopy clustering. Problem: > > Depending on the choice for t1 and t2, canopy clustering gives me too > many > > canopies or just 1. > > I thought I could solve this with the clusterFilter parameter, but no > > luck. Although I can restrict the number of _canopy clusters_ with the > > clusterFilter parameter leading to what would be a good value for k, this > > parameter has no effect on the _canopy centroids_ that are created, and > > these are the seed for k-means. > > Is there a way to get a seed for k-means that reflects the value given > for > > the clusterFilter parameter in canopy clustering? > > > > >
