If Zipf's Law is relevant, locality will be much better than random.
Maybe you need a Vector implementation that is backed by memory-mapped
files?

On Wed, Aug 8, 2012 at 12:26 PM, Abramov Pavel <[email protected]> wrote:
> Thank you Jeff, Paritosh,
>
> Reducing cluster count from 1000 to 100 made my day. I will also try
> Canopy for initial cluster count.
> Unfortunately I don't know how to reduce my 200k dictionary. There are no
> unfrequent terms.
>
> I will provide you with Hadoop config shortly. But I am pretty sure I
> can't decrease number of Mappers/Reducers per node or increase memory
> limits. It will affect the whole cluster.
>
>
> Thank you!
>
> Pavel
>
>
> 08.08.12 16:15 пользователь "Jeff Eastman" <[email protected]>
> написал:
>
>>Consider that each cluster retains 4 vectors in memory in each mapper
>>and reducer, and that these vectors tend to become more dense (through
>>addition of multiple sparse components) as iterations proceed. With 1000
>>clusters and 200k terms in your dictionary this will cause the heap
>>space to be consumed rapidly as you have noted. Some times you can work
>>around this problem by increasing your heap size on a per-job basis or
>>reducing the number of mappers and reducers allowed on each node. Also
>>be sure you are not launching reducers until all of your mapper tasks
>>have completed.
>>
>>In order to provide more help to you, it would be useful to understand
>>more about how your cluster is "well tuned". How many mappers & reducers
>>are you launching in parallel, the heapspace limits set for tasks on
>>each node, etc.
>>
>>For a quick test, try reducing the k to 500 or 100 to see how many
>>clusters you can reasonably compute with your dataset on your cluster.
>>Canopy is also a good way to get a feel for a good initial k, though it
>>can be hard to arrive at good T-values in some text clustering cases.
>>You, can also try hierarchical clustering with reduced k to stay under
>>your memory limits.
>>
>>
>>On 8/8/12 5:40 AM, Paritosh Ranjan wrote:
>>> A stacktrace of error would have helped in finding the exact error.
>>>
>>> However, number of clusters can create Heap Space problems ( If the
>>> vector dimension is also high ).
>>> Either try to reduce the number of initial clusters ( In my opinion,
>>> the best way to know about initial clusters is Canopy Clustering
>>> https://cwiki.apache.org/confluence/display/MAHOUT/Canopy+Clustering )
>>>
>>> or, try to reduce the dimension of the vectors.
>>>
>>> PS : you are also providing numClusters twice
>>>
>>> --numClusters 1000 \ --numClusters 5 \
>>>
>>> On 08-08-2012 10:42, Abramov Pavel wrote:
>>>> Hello,
>>>>
>>>> I am trying to run KMeans example on 15 000 000 documents (seq2sparse
>>>> output).
>>>> There are 1 000 clusters, 200 000 terms dictionary and 3-10 terms
>>>> document size (titles). seq2sparse produces 200 files 80 MB each.
>>>>
>>>> My job failed with Java heap space Error. 1st iteration passes while
>>>> 2nd iteration fails. On a Map phase of buildClusters I see a lot of
>>>> warnings, but it passes. Reduce phase of buildClusters fails with
>>>> "Java Heap space".
>>>>
>>>> I can not increase reducer/mapper memory in hadoop. My cluster is
>>>> tunned well.
>>>>
>>>> How can I avoid this situation? My cluster has 300 Mappers and 220
>>>> Reducers running 40 servers 8-core 12 GB RAM.
>>>>
>>>> Thanks in advance!
>>>>
>>>> Here is KMeans parameters:
>>>>
>>>> ------------------------------------------------
>>>> mahout kmeans -Dmapred.reduce.tasks=200 \
>>>> -i ...tfidf-vectors/  \
>>>> -o /tmp/clustering_results_kmeans/ \
>>>> --clusters /tmp/clusters/ \
>>>> --numClusters 1000 \
>>>> --numClusters 5 \
>>>> --overwrite \
>>>> --clustering
>>>> ------------------------------------------------
>>>>
>>>> Pavel
>>>
>>>
>>>
>>>
>>
>



-- 
Lance Norskog
[email protected]

Reply via email to