Re: Memory Issue with KMeans clustering

Ted Dunning Mon, 07 Feb 2011 12:09:54 -0800

On Mon, Feb 7, 2011 at 11:35 AM, Robin Anil <[email protected]> wrote:


> On Tue, Feb 8, 2011 at 12:47 AM, Ted Dunning <[email protected]>
> wrote:
>
> > The problem is that the centroids are the average of many documents.
>  This
> > means that the number o non-zero elements in each centroid vector
> increases
> > as the number of documents increases.
> >
> > If we approximate the centroid by the point nearest to the centroid.
> Considering we have a lot of input data. I see the centroids being real
> points(part of the input dataset) instead of imaginary ones(average).  Some
> loss is incurred here
>

This also become much more computationally intense because you can't use
combiners.  Averages are really good about some stuff.


>
> Hashed encoding would be a easier solution. The same or similar loss is
> incurred here as well due to collisions.
>
>
Actually not.  If you have multiple probes, then hashed encoding is a form
of random projection and you typically will not lose any expressivity.

Re: Memory Issue with KMeans clustering

Reply via email to