Also, the knn package has a single pass k-means implementation that can easily handle 20,000 clusters or more. This is done by using an approximate nearest neighbor algorithm inside the k-means implementation to decrease the time dependency on k to roughly log k.
See http://github.com/tdunning/knn Any help in testing these new capabilities or plumbing them into the standard Mahout capabilities would be very much appreciated. On Thu, Aug 9, 2012 at 7:05 AM, Ted Dunning <[email protected]> wrote: > The upcoming knn package has a file based matrix implementation that uses > memory mapping to allow sharing a copy of a large matrix between processes > and threads. > > Sent from my iPhone > > On Aug 9, 2012, at 1:48 AM, Abramov Pavel <[email protected]> wrote: > >> Hello, >> >> If think Zipf's law is relevant for my data. Thanks for idea about >> memory-mapping. >> >> 1) How can I "drop" extremely small/large clusters? There are 50% small >> clusters with only 1 member while 1 large cluster has 50% members. Is >> there a way to "split" large clusters with Kmeans? >> >> 2) Can I force Mahout not to use exact centroid but the closest point from >> my set? Any point has ~10 non-zero components while exact centroid is very >> dense (~200k). >> >> >> Thanks! >> >> Pavel >> >> >> 09.08.12 5:43 пользователь "Lance Norskog" <[email protected]> написал: >> >>> If Zipf's Law is relevant, locality will be much better than random. >>> Maybe you need a Vector implementation that is backed by memory-mapped >>> files? >>> >>> On Wed, Aug 8, 2012 at 12:26 PM, Abramov Pavel <[email protected]> >>> wrote: >>>> Thank you Jeff, Paritosh, >>>> >>>> Reducing cluster count from 1000 to 100 made my day. I will also try >>>> Canopy for initial cluster count. >>>> Unfortunately I don't know how to reduce my 200k dictionary. There are >>>> no >>>> unfrequent terms. >>>> >>>> I will provide you with Hadoop config shortly. But I am pretty sure I >>>> can't decrease number of Mappers/Reducers per node or increase memory >>>> limits. It will affect the whole cluster. >>>> >>>> >>>> Thank you! >>>> >>>> Pavel >>>> >>>> >>>> 08.08.12 16:15 пользователь "Jeff Eastman" <[email protected]> >>>> написал: >>>> >>>>> Consider that each cluster retains 4 vectors in memory in each mapper >>>>> and reducer, and that these vectors tend to become more dense (through >>>>> addition of multiple sparse components) as iterations proceed. With 1000 >>>>> clusters and 200k terms in your dictionary this will cause the heap >>>>> space to be consumed rapidly as you have noted. Some times you can work >>>>> around this problem by increasing your heap size on a per-job basis or >>>>> reducing the number of mappers and reducers allowed on each node. Also >>>>> be sure you are not launching reducers until all of your mapper tasks >>>>> have completed. >>>>> >>>>> In order to provide more help to you, it would be useful to understand >>>>> more about how your cluster is "well tuned". How many mappers & reducers >>>>> are you launching in parallel, the heapspace limits set for tasks on >>>>> each node, etc. >>>>> >>>>> For a quick test, try reducing the k to 500 or 100 to see how many >>>>> clusters you can reasonably compute with your dataset on your cluster. >>>>> Canopy is also a good way to get a feel for a good initial k, though it >>>>> can be hard to arrive at good T-values in some text clustering cases. >>>>> You, can also try hierarchical clustering with reduced k to stay under >>>>> your memory limits. >>>>> >>>>> >>>>> On 8/8/12 5:40 AM, Paritosh Ranjan wrote: >>>>>> A stacktrace of error would have helped in finding the exact error. >>>>>> >>>>>> However, number of clusters can create Heap Space problems ( If the >>>>>> vector dimension is also high ). >>>>>> Either try to reduce the number of initial clusters ( In my opinion, >>>>>> the best way to know about initial clusters is Canopy Clustering >>>>>> https://cwiki.apache.org/confluence/display/MAHOUT/Canopy+Clustering ) >>>>>> >>>>>> or, try to reduce the dimension of the vectors. >>>>>> >>>>>> PS : you are also providing numClusters twice >>>>>> >>>>>> --numClusters 1000 \ --numClusters 5 \ >>>>>> >>>>>> On 08-08-2012 10:42, Abramov Pavel wrote: >>>>>>> Hello, >>>>>>> >>>>>>> I am trying to run KMeans example on 15 000 000 documents (seq2sparse >>>>>>> output). >>>>>>> There are 1 000 clusters, 200 000 terms dictionary and 3-10 terms >>>>>>> document size (titles). seq2sparse produces 200 files 80 MB each. >>>>>>> >>>>>>> My job failed with Java heap space Error. 1st iteration passes while >>>>>>> 2nd iteration fails. On a Map phase of buildClusters I see a lot of >>>>>>> warnings, but it passes. Reduce phase of buildClusters fails with >>>>>>> "Java Heap space". >>>>>>> >>>>>>> I can not increase reducer/mapper memory in hadoop. My cluster is >>>>>>> tunned well. >>>>>>> >>>>>>> How can I avoid this situation? My cluster has 300 Mappers and 220 >>>>>>> Reducers running 40 servers 8-core 12 GB RAM. >>>>>>> >>>>>>> Thanks in advance! >>>>>>> >>>>>>> Here is KMeans parameters: >>>>>>> >>>>>>> ------------------------------------------------ >>>>>>> mahout kmeans -Dmapred.reduce.tasks=200 \ >>>>>>> -i ...tfidf-vectors/ \ >>>>>>> -o /tmp/clustering_results_kmeans/ \ >>>>>>> --clusters /tmp/clusters/ \ >>>>>>> --numClusters 1000 \ >>>>>>> --numClusters 5 \ >>>>>>> --overwrite \ >>>>>>> --clustering >>>>>>> ------------------------------------------------ >>>>>>> >>>>>>> Pavel >>>>>> >>>>>> >>>>>> >>>>>> >>>>> >>>> >>> >>> >>> >>> -- >>> Lance Norskog >>> [email protected] >>
