Hello, If think Zipf's law is relevant for my data. Thanks for idea about memory-mapping.
1) How can I "drop" extremely small/large clusters? There are 50% small clusters with only 1 member while 1 large cluster has 50% members. Is there a way to "split" large clusters with Kmeans? 2) Can I force Mahout not to use exact centroid but the closest point from my set? Any point has ~10 non-zero components while exact centroid is very dense (~200k). Thanks! Pavel 09.08.12 5:43 пользователь "Lance Norskog" <[email protected]> написал: >If Zipf's Law is relevant, locality will be much better than random. >Maybe you need a Vector implementation that is backed by memory-mapped >files? > >On Wed, Aug 8, 2012 at 12:26 PM, Abramov Pavel <[email protected]> >wrote: >> Thank you Jeff, Paritosh, >> >> Reducing cluster count from 1000 to 100 made my day. I will also try >> Canopy for initial cluster count. >> Unfortunately I don't know how to reduce my 200k dictionary. There are >>no >> unfrequent terms. >> >> I will provide you with Hadoop config shortly. But I am pretty sure I >> can't decrease number of Mappers/Reducers per node or increase memory >> limits. It will affect the whole cluster. >> >> >> Thank you! >> >> Pavel >> >> >> 08.08.12 16:15 пользователь "Jeff Eastman" <[email protected]> >> написал: >> >>>Consider that each cluster retains 4 vectors in memory in each mapper >>>and reducer, and that these vectors tend to become more dense (through >>>addition of multiple sparse components) as iterations proceed. With 1000 >>>clusters and 200k terms in your dictionary this will cause the heap >>>space to be consumed rapidly as you have noted. Some times you can work >>>around this problem by increasing your heap size on a per-job basis or >>>reducing the number of mappers and reducers allowed on each node. Also >>>be sure you are not launching reducers until all of your mapper tasks >>>have completed. >>> >>>In order to provide more help to you, it would be useful to understand >>>more about how your cluster is "well tuned". How many mappers & reducers >>>are you launching in parallel, the heapspace limits set for tasks on >>>each node, etc. >>> >>>For a quick test, try reducing the k to 500 or 100 to see how many >>>clusters you can reasonably compute with your dataset on your cluster. >>>Canopy is also a good way to get a feel for a good initial k, though it >>>can be hard to arrive at good T-values in some text clustering cases. >>>You, can also try hierarchical clustering with reduced k to stay under >>>your memory limits. >>> >>> >>>On 8/8/12 5:40 AM, Paritosh Ranjan wrote: >>>> A stacktrace of error would have helped in finding the exact error. >>>> >>>> However, number of clusters can create Heap Space problems ( If the >>>> vector dimension is also high ). >>>> Either try to reduce the number of initial clusters ( In my opinion, >>>> the best way to know about initial clusters is Canopy Clustering >>>> https://cwiki.apache.org/confluence/display/MAHOUT/Canopy+Clustering ) >>>> >>>> or, try to reduce the dimension of the vectors. >>>> >>>> PS : you are also providing numClusters twice >>>> >>>> --numClusters 1000 \ --numClusters 5 \ >>>> >>>> On 08-08-2012 10:42, Abramov Pavel wrote: >>>>> Hello, >>>>> >>>>> I am trying to run KMeans example on 15 000 000 documents (seq2sparse >>>>> output). >>>>> There are 1 000 clusters, 200 000 terms dictionary and 3-10 terms >>>>> document size (titles). seq2sparse produces 200 files 80 MB each. >>>>> >>>>> My job failed with Java heap space Error. 1st iteration passes while >>>>> 2nd iteration fails. On a Map phase of buildClusters I see a lot of >>>>> warnings, but it passes. Reduce phase of buildClusters fails with >>>>> "Java Heap space". >>>>> >>>>> I can not increase reducer/mapper memory in hadoop. My cluster is >>>>> tunned well. >>>>> >>>>> How can I avoid this situation? My cluster has 300 Mappers and 220 >>>>> Reducers running 40 servers 8-core 12 GB RAM. >>>>> >>>>> Thanks in advance! >>>>> >>>>> Here is KMeans parameters: >>>>> >>>>> ------------------------------------------------ >>>>> mahout kmeans -Dmapred.reduce.tasks=200 \ >>>>> -i ...tfidf-vectors/ \ >>>>> -o /tmp/clustering_results_kmeans/ \ >>>>> --clusters /tmp/clusters/ \ >>>>> --numClusters 1000 \ >>>>> --numClusters 5 \ >>>>> --overwrite \ >>>>> --clustering >>>>> ------------------------------------------------ >>>>> >>>>> Pavel >>>> >>>> >>>> >>>> >>> >> > > > >-- >Lance Norskog >[email protected]
