Anand,

That is a fine idea.  It is called a medoid instead of a mean (
https://en.wikipedia.org/wiki/Medoid )

The basic idea is that for any metric m, you can define the medoid as the
element from the set that minimizes the sum of the distances to the other
elements for that metric.  In one-dimension, the L_1-medoid is the median.
The L_2-medoid is the point in the dataset closest to the mean.

As you say, computing the medoid is expensive.  The cost is clearly bounded
by O(n^2) but you might be able to do somewhat better in practice with
clever branching and bounding.  I don't think that there is anything like
an O(n) algorithm for medoid computation.

On Wed, Jan 14, 2015 at 8:25 PM, Anand Avati <av...@gluster.org> wrote:

> Perhaps you could think of the centroid as one of the signals itself, from
> which the average distance to rest of the signals in the cluster is the
> lowest? This way instead of finding that mythical "mean" of DTWs, you hop
> from one signal to another over iterations as you refine memberships.
> However "mean" calculation would become O(N^2) (within a cluster).
>
> Be warned I have never tried this method.
>
> Thanks
>
> On Sat Jan 10 2015 at 1:03:25 AM Marko Dinic <marko.di...@nissatech.com>
> wrote:
>
> > Ted,
> >
> > It's because I'm trying to cluster time series that may differ in
> > length, some parts may be shifted, some parts may last longer in one
> > signal than in the other (somehow skewed), but they represent more-less
> > the same thing. DTW seems good because it's optimized for such things
> > (used a lot if speech recognition where problems like this may occur).
> > The problem is that K-means incorporates calculating centroids (and the
> > thing is, I need some kind of centroids that will represent the cluster
> > in the end). Maybe I'm wrong, please correct me, I'm a youngster in
> > this kind of things,  but there is a problem finding centroids for such
> > signals. Distance measures that recalculate centroids (such as
> > Euclidean), calculate pairwise distance between components of two
> > signals (signal and a centroid). How can I use a distance measure for
> > signals with different length? And also, there is shifting and skewing
> > in signals that represent the same thing, and there mean could be
> > totally wrong. For example, mean of two sinusoids while one of them is
> > shifted by Pi is 0. And that's definitely not a good centroid in my
> > case.
> >
> > So, I'm trying to find a scalable solution for my problem, I tried to
> > fit it in what's already implemented in Mahout (for clustering), but
> > it's not so obvious to me.
> >
> > I'm open to suggestions, I'm still new to all of this.
> >
> > Thanks,
> > Marko
> >
> > On Sat 10 Jan 2015 07:32:33 AM CET, Ted Dunning wrote:
> > > Why is it you can't compute a mean?
> > >
> > >
> > >
> > > On Fri, Jan 9, 2015 at 5:03 AM, Marko Dinic <marko.di...@nissatech.com
> >
> > > wrote:
> > >
> > >> Thank you for your answer Ted.
> > >>
> > >> What about some kind of Bisecting k-means? I'm trying to cluster time
> > >> series of different length and I came up to an idea to use DTW as a
> > >> similarity measure, which seems to be adequate, but the thing is, I
> > cannot
> > >> use it with K-means, since it's hard to define centroids based on time
> > >> series which can have different length/phase. So I was thinking about
> > >> Hierarchical clustering, since it seems appropriate to combine with
> DTW,
> > >> but is not scalable, as you said. So my next thought is to try with
> > >> bisecting k-means that seems scalable, since it is based on K-means
> step
> > >> repetitions. My idea is next, by steps:
> > >>
> > >> - Take two signals as initial centroids (maybe two signals that have
> > >> smallest similarity, calculated using DTW)
> > >> - Assign all signals to two initial centroids
> > >> - Repeat the procedure on the biggest cluster
> > >>
> > >> In this way I could use DTW as distance measure, that could be useful
> > >> since my data may be shifted, skewed, and avoid calculating centroids.
> > At
> > >> the end I could take one signal from each cluster that is the most
> > similar
> > >> with others in cluster (some kind of centroid/medioid).
> > >>
> > >> What do you think about this approach and about the scalability?
> > >>
> > >> I would highly appreciate your answer, thanks.
> > >>
> > >> On Thu 08 Jan 2015 08:19:18 PM CET, Ted Dunning wrote:
> > >>
> > >>> On Thu, Jan 8, 2015 at 7:00 AM, Marko Dinic <
> marko.di...@nissatech.com
> > >
> > >>> wrote:
> > >>>
> > >>>   1) Is there an implementation of DTW (Dynamic Time Warping) in
> Mahout
> > >>>> that
> > >>>> could be used as a distance measure for clustering?
> > >>>>
> > >>>>
> > >>> No.
> > >>>
> > >>>
> > >>>
> > >>>> 2) Why isn't there an implementation of K-mediods in Mahout? I'm
> > guessing
> > >>>> that it could not be implemented efficiently on Hadoop, but I wanted
> > to
> > >>>> check if something like that is possible.
> > >>>>
> > >>>>
> > >>> Scalability as you suspected.
> > >>>
> > >>>
> > >>>
> > >>>> 3) Same question, just considering Agglomerative Hierarchical
> > clustering.
> > >>>>
> > >>>>
> > >>> Again.  Agglomerative algorithms tend to be n^2 which contradicts
> > scaling.
> > >>>
> > >>>
> > >
> >
>

Reply via email to