Great, thanks for the info!
On Mon, Feb 7, 2011 at 2:12 PM, Ted Dunning <[email protected]> wrote: > > > On Mon, Feb 7, 2011 at 10:43 AM, Marc Hadfield <[email protected]>wrote: > >> >> >> In the case outlined below, does that mean each node of a hadoop cluster >> would need to have the centroid information fully in memory for k-means, or >> is this spread over the cluster in some way? >> > > Yes. Every node needs every centroid in memory. > > >> >> if each node has to have the centroid information fully in memory, are >> there any other data structures which need to be fully in memory in each >> node, and if so, what are they proportional to (again, specifically for >> k-means)? i.e. is anything memory resident related to the number of >> documents? >> > > No. Just centroids. Of course, if you have sparse centroids, then the > number of non-zero elements will increase roughly with the log of hte number > of documents, but if you have space for the dense version of the centroid, > then nothing should scale with the number of documents. > > >> >> If the centroid information (dependent on the number of features and >> clusters) needs to be fully in memory in all hadoop nodes, but not anything >> related to the number of documents, then the k-means algorithm would be >> scalable in the number of documents (just add more hadoop nodes to increase >> document throughput), but *not* scalable in the number of clusters / >> features since the algorithm requires a full copy of this information in >> each node. is this accurate? >> > > Yes. > > Scalability in the number of features can be achieved by using a hashed > encoding. > > Scalability in the number of centroids can be achieved by changing the code > a bit so that the centroid sets are spread across several nodes. That would > require come cleverness in the input format so that each split is sent to > several nodes. An alternative would be to add an extra map-reduce step > where the first reducer is whether the partial classification is done. > > My guess is that scaling the number of centroids isn't a great idea beyond > a moderate size because k-means will break down. Better to do hierarchical > clustering to get very fine distinctions. That should be doable in a much > more scalable way. >
