On Mon, Feb 7, 2011 at 11:35 AM, Robin Anil <[email protected]> wrote:
> On Tue, Feb 8, 2011 at 12:47 AM, Ted Dunning <[email protected]> > wrote: > > > The problem is that the centroids are the average of many documents. > This > > means that the number o non-zero elements in each centroid vector > increases > > as the number of documents increases. > > > > If we approximate the centroid by the point nearest to the centroid. > Considering we have a lot of input data. I see the centroids being real > points(part of the input dataset) instead of imaginary ones(average). Some > loss is incurred here > This also become much more computationally intense because you can't use combiners. Averages are really good about some stuff. > > Hashed encoding would be a easier solution. The same or similar loss is > incurred here as well due to collisions. > > Actually not. If you have multiple probes, then hashed encoding is a form of random projection and you typically will not lose any expressivity.
