If Zipf's Law is relevant, locality will be much better than random. Maybe you need a Vector implementation that is backed by memory-mapped files?
On Wed, Aug 8, 2012 at 12:26 PM, Abramov Pavel <[email protected]> wrote: > Thank you Jeff, Paritosh, > > Reducing cluster count from 1000 to 100 made my day. I will also try > Canopy for initial cluster count. > Unfortunately I don't know how to reduce my 200k dictionary. There are no > unfrequent terms. > > I will provide you with Hadoop config shortly. But I am pretty sure I > can't decrease number of Mappers/Reducers per node or increase memory > limits. It will affect the whole cluster. > > > Thank you! > > Pavel > > > 08.08.12 16:15 пользователь "Jeff Eastman" <[email protected]> > написал: > >>Consider that each cluster retains 4 vectors in memory in each mapper >>and reducer, and that these vectors tend to become more dense (through >>addition of multiple sparse components) as iterations proceed. With 1000 >>clusters and 200k terms in your dictionary this will cause the heap >>space to be consumed rapidly as you have noted. Some times you can work >>around this problem by increasing your heap size on a per-job basis or >>reducing the number of mappers and reducers allowed on each node. Also >>be sure you are not launching reducers until all of your mapper tasks >>have completed. >> >>In order to provide more help to you, it would be useful to understand >>more about how your cluster is "well tuned". How many mappers & reducers >>are you launching in parallel, the heapspace limits set for tasks on >>each node, etc. >> >>For a quick test, try reducing the k to 500 or 100 to see how many >>clusters you can reasonably compute with your dataset on your cluster. >>Canopy is also a good way to get a feel for a good initial k, though it >>can be hard to arrive at good T-values in some text clustering cases. >>You, can also try hierarchical clustering with reduced k to stay under >>your memory limits. >> >> >>On 8/8/12 5:40 AM, Paritosh Ranjan wrote: >>> A stacktrace of error would have helped in finding the exact error. >>> >>> However, number of clusters can create Heap Space problems ( If the >>> vector dimension is also high ). >>> Either try to reduce the number of initial clusters ( In my opinion, >>> the best way to know about initial clusters is Canopy Clustering >>> https://cwiki.apache.org/confluence/display/MAHOUT/Canopy+Clustering ) >>> >>> or, try to reduce the dimension of the vectors. >>> >>> PS : you are also providing numClusters twice >>> >>> --numClusters 1000 \ --numClusters 5 \ >>> >>> On 08-08-2012 10:42, Abramov Pavel wrote: >>>> Hello, >>>> >>>> I am trying to run KMeans example on 15 000 000 documents (seq2sparse >>>> output). >>>> There are 1 000 clusters, 200 000 terms dictionary and 3-10 terms >>>> document size (titles). seq2sparse produces 200 files 80 MB each. >>>> >>>> My job failed with Java heap space Error. 1st iteration passes while >>>> 2nd iteration fails. On a Map phase of buildClusters I see a lot of >>>> warnings, but it passes. Reduce phase of buildClusters fails with >>>> "Java Heap space". >>>> >>>> I can not increase reducer/mapper memory in hadoop. My cluster is >>>> tunned well. >>>> >>>> How can I avoid this situation? My cluster has 300 Mappers and 220 >>>> Reducers running 40 servers 8-core 12 GB RAM. >>>> >>>> Thanks in advance! >>>> >>>> Here is KMeans parameters: >>>> >>>> ------------------------------------------------ >>>> mahout kmeans -Dmapred.reduce.tasks=200 \ >>>> -i ...tfidf-vectors/ \ >>>> -o /tmp/clustering_results_kmeans/ \ >>>> --clusters /tmp/clusters/ \ >>>> --numClusters 1000 \ >>>> --numClusters 5 \ >>>> --overwrite \ >>>> --clustering >>>> ------------------------------------------------ >>>> >>>> Pavel >>> >>> >>> >>> >> > -- Lance Norskog [email protected]
