There is a streaming k-means implementation in MLlib that uses reservoir
sampling.

On Tue, Jun 2, 2015 at 2:24 AM, Marko Dinic <[email protected]>
wrote:

> Ted,
>
> Thank you for your answer.
>
> What would you then recommend me to do? My idea is to implement it to
> enable clustering of time series using DTW (Dynamic Time Warping) as
> distance measure. As you know, the main problem is that K-medoids is not
> scalable, so that's standing on my way. Of course, it could be used with
> other distances as well.
>
> I have already implemented something that I consider a scalable K-medoids,
> based on using pivots to speed up medoid selection (
> https://seer.lcc.ufmg.br/index.php/jidm/article/viewFile/99/82). This
> works for distance measures such as Euclidean, has some limitations (best
> results are in case of normal distribution, outliers could be a problem),
> but it works pretty good (considering the computations). The thing is, it
> can't be used with DTW, since it relies on projections, while triangle
> inequality for DTW does not hold. That is why I'm considering this
> Streaming approach now.
>
> Would you think that it is worthy of giving a shot? I'm really stretching
> for a scalable solution.
>
> Best regards,
> Marko
>
>
> On Tue 02 Jun 2015 12:03:40 AM CEST, Ted Dunning wrote:
>
>> The streaming k-means works by building a sketch of the data which is then
>> used to do real clustering.
>>
>> It might be that this sketch would be acceptable to do k-medoids, but that
>> is definitely not guaranteed.
>>
>> Similarly, it might be possible to build a medoid sketch instead of a mean
>> based sketch, but this is also unexplored ground.
>>
>> The virtue of the first approach (using a m-means sketch as input to
>> k-medoids) would be that it would make the k-medoids scalable.
>>
>>
>>
>> On Mon, Jun 1, 2015 at 1:04 PM, Marko Dinic <[email protected]>
>> wrote:
>>
>>  Hello everyone,
>>>
>>> I have an idea and I would like to get a validation from community about
>>> it.
>>>
>>> In Mahout there is an implementation of Streaming K-means. I'm interested
>>> in your opinion would it make sense to make a similar implementation of
>>> Streaming K-medoids?
>>>
>>> K-medoids has even bigger problems than K-means because it's not
>>> scalable,
>>> but can be useful in some cases (e.g. It allows more sophisticated
>>> distance
>>> measures).
>>>
>>> What is your opinion about implementation of this?
>>>
>>> Best regards,
>>> Marko
>>>
>>>
>>

Reply via email to