There is a streaming k-means implementation in MLlib that uses reservoir sampling.
On Tue, Jun 2, 2015 at 2:24 AM, Marko Dinic <[email protected]> wrote: > Ted, > > Thank you for your answer. > > What would you then recommend me to do? My idea is to implement it to > enable clustering of time series using DTW (Dynamic Time Warping) as > distance measure. As you know, the main problem is that K-medoids is not > scalable, so that's standing on my way. Of course, it could be used with > other distances as well. > > I have already implemented something that I consider a scalable K-medoids, > based on using pivots to speed up medoid selection ( > https://seer.lcc.ufmg.br/index.php/jidm/article/viewFile/99/82). This > works for distance measures such as Euclidean, has some limitations (best > results are in case of normal distribution, outliers could be a problem), > but it works pretty good (considering the computations). The thing is, it > can't be used with DTW, since it relies on projections, while triangle > inequality for DTW does not hold. That is why I'm considering this > Streaming approach now. > > Would you think that it is worthy of giving a shot? I'm really stretching > for a scalable solution. > > Best regards, > Marko > > > On Tue 02 Jun 2015 12:03:40 AM CEST, Ted Dunning wrote: > >> The streaming k-means works by building a sketch of the data which is then >> used to do real clustering. >> >> It might be that this sketch would be acceptable to do k-medoids, but that >> is definitely not guaranteed. >> >> Similarly, it might be possible to build a medoid sketch instead of a mean >> based sketch, but this is also unexplored ground. >> >> The virtue of the first approach (using a m-means sketch as input to >> k-medoids) would be that it would make the k-medoids scalable. >> >> >> >> On Mon, Jun 1, 2015 at 1:04 PM, Marko Dinic <[email protected]> >> wrote: >> >> Hello everyone, >>> >>> I have an idea and I would like to get a validation from community about >>> it. >>> >>> In Mahout there is an implementation of Streaming K-means. I'm interested >>> in your opinion would it make sense to make a similar implementation of >>> Streaming K-medoids? >>> >>> K-medoids has even bigger problems than K-means because it's not >>> scalable, >>> but can be useful in some cases (e.g. It allows more sophisticated >>> distance >>> measures). >>> >>> What is your opinion about implementation of this? >>> >>> Best regards, >>> Marko >>> >>> >>
