The upcoming knn package has a file based matrix implementation that uses memory mapping to allow sharing a copy of a large matrix between processes and threads.
Sent from my iPhone On Aug 9, 2012, at 1:48 AM, Abramov Pavel <[email protected]> wrote: > Hello, > > If think Zipf's law is relevant for my data. Thanks for idea about > memory-mapping. > > 1) How can I "drop" extremely small/large clusters? There are 50% small > clusters with only 1 member while 1 large cluster has 50% members. Is > there a way to "split" large clusters with Kmeans? > > 2) Can I force Mahout not to use exact centroid but the closest point from > my set? Any point has ~10 non-zero components while exact centroid is very > dense (~200k). > > > Thanks! > > Pavel > > > 09.08.12 5:43 пользователь "Lance Norskog" <[email protected]> написал: > >> If Zipf's Law is relevant, locality will be much better than random. >> Maybe you need a Vector implementation that is backed by memory-mapped >> files? >> >> On Wed, Aug 8, 2012 at 12:26 PM, Abramov Pavel <[email protected]> >> wrote: >>> Thank you Jeff, Paritosh, >>> >>> Reducing cluster count from 1000 to 100 made my day. I will also try >>> Canopy for initial cluster count. >>> Unfortunately I don't know how to reduce my 200k dictionary. There are >>> no >>> unfrequent terms. >>> >>> I will provide you with Hadoop config shortly. But I am pretty sure I >>> can't decrease number of Mappers/Reducers per node or increase memory >>> limits. It will affect the whole cluster. >>> >>> >>> Thank you! >>> >>> Pavel >>> >>> >>> 08.08.12 16:15 пользователь "Jeff Eastman" <[email protected]> >>> написал: >>> >>>> Consider that each cluster retains 4 vectors in memory in each mapper >>>> and reducer, and that these vectors tend to become more dense (through >>>> addition of multiple sparse components) as iterations proceed. With 1000 >>>> clusters and 200k terms in your dictionary this will cause the heap >>>> space to be consumed rapidly as you have noted. Some times you can work >>>> around this problem by increasing your heap size on a per-job basis or >>>> reducing the number of mappers and reducers allowed on each node. Also >>>> be sure you are not launching reducers until all of your mapper tasks >>>> have completed. >>>> >>>> In order to provide more help to you, it would be useful to understand >>>> more about how your cluster is "well tuned". How many mappers & reducers >>>> are you launching in parallel, the heapspace limits set for tasks on >>>> each node, etc. >>>> >>>> For a quick test, try reducing the k to 500 or 100 to see how many >>>> clusters you can reasonably compute with your dataset on your cluster. >>>> Canopy is also a good way to get a feel for a good initial k, though it >>>> can be hard to arrive at good T-values in some text clustering cases. >>>> You, can also try hierarchical clustering with reduced k to stay under >>>> your memory limits. >>>> >>>> >>>> On 8/8/12 5:40 AM, Paritosh Ranjan wrote: >>>>> A stacktrace of error would have helped in finding the exact error. >>>>> >>>>> However, number of clusters can create Heap Space problems ( If the >>>>> vector dimension is also high ). >>>>> Either try to reduce the number of initial clusters ( In my opinion, >>>>> the best way to know about initial clusters is Canopy Clustering >>>>> https://cwiki.apache.org/confluence/display/MAHOUT/Canopy+Clustering ) >>>>> >>>>> or, try to reduce the dimension of the vectors. >>>>> >>>>> PS : you are also providing numClusters twice >>>>> >>>>> --numClusters 1000 \ --numClusters 5 \ >>>>> >>>>> On 08-08-2012 10:42, Abramov Pavel wrote: >>>>>> Hello, >>>>>> >>>>>> I am trying to run KMeans example on 15 000 000 documents (seq2sparse >>>>>> output). >>>>>> There are 1 000 clusters, 200 000 terms dictionary and 3-10 terms >>>>>> document size (titles). seq2sparse produces 200 files 80 MB each. >>>>>> >>>>>> My job failed with Java heap space Error. 1st iteration passes while >>>>>> 2nd iteration fails. On a Map phase of buildClusters I see a lot of >>>>>> warnings, but it passes. Reduce phase of buildClusters fails with >>>>>> "Java Heap space". >>>>>> >>>>>> I can not increase reducer/mapper memory in hadoop. My cluster is >>>>>> tunned well. >>>>>> >>>>>> How can I avoid this situation? My cluster has 300 Mappers and 220 >>>>>> Reducers running 40 servers 8-core 12 GB RAM. >>>>>> >>>>>> Thanks in advance! >>>>>> >>>>>> Here is KMeans parameters: >>>>>> >>>>>> ------------------------------------------------ >>>>>> mahout kmeans -Dmapred.reduce.tasks=200 \ >>>>>> -i ...tfidf-vectors/ \ >>>>>> -o /tmp/clustering_results_kmeans/ \ >>>>>> --clusters /tmp/clusters/ \ >>>>>> --numClusters 1000 \ >>>>>> --numClusters 5 \ >>>>>> --overwrite \ >>>>>> --clustering >>>>>> ------------------------------------------------ >>>>>> >>>>>> Pavel >>>>> >>>>> >>>>> >>>>> >>>> >>> >> >> >> >> -- >> Lance Norskog >> [email protected] >
