On Mon, Feb 7, 2011 at 11:35 AM, Robin Anil <[email protected]> wrote:

> On Tue, Feb 8, 2011 at 12:47 AM, Ted Dunning <[email protected]>
> wrote:
>
> > The problem is that the centroids are the average of many documents.
>  This
> > means that the number o non-zero elements in each centroid vector
> increases
> > as the number of documents increases.
> >
> > If we approximate the centroid by the point nearest to the centroid.
> Considering we have a lot of input data. I see the centroids being real
> points(part of the input dataset) instead of imaginary ones(average).  Some
> loss is incurred here
>

This also become much more computationally intense because you can't use
combiners.  Averages are really good about some stuff.


>
> Hashed encoding would be a easier solution. The same or similar loss is
> incurred here as well due to collisions.
>
>
Actually not.  If you have multiple probes, then hashed encoding is a form
of random projection and you typically will not lose any expressivity.

Reply via email to