Re: KMeans job fails during 2nd iteration. Java Heap space

Ted Dunning Thu, 09 Aug 2012 07:25:12 -0700

Also, the knn package has a single pass k-means implementation that
can easily handle 20,000 clusters or more.  This is done by using an
approximate nearest neighbor algorithm inside the k-means
implementation to decrease the time dependency on k to roughly log k.


See http://github.com/tdunning/knn

Any help in testing these new capabilities or plumbing them into the
standard Mahout capabilities would be very much appreciated.

On Thu, Aug 9, 2012 at 7:05 AM, Ted Dunning <[email protected]> wrote:
> The upcoming knn package has a file based matrix implementation that uses 
> memory mapping to allow sharing a copy of a large matrix between processes 
> and threads.
>
> Sent from my iPhone
>
> On Aug 9, 2012, at 1:48 AM, Abramov Pavel <[email protected]> wrote:
>
>> Hello,
>>
>> If think Zipf's law is relevant for my data. Thanks for idea about
>> memory-mapping.
>>
>> 1) How can I "drop" extremely small/large clusters? There are 50% small
>> clusters with only 1 member while 1 large cluster has 50% members. Is
>> there a way to "split" large clusters with Kmeans?
>>
>> 2) Can I force Mahout not to use exact centroid but the closest point from
>> my set? Any point has ~10 non-zero components while exact centroid is very
>> dense (~200k).
>>
>>
>> Thanks!
>>
>> Pavel
>>
>>
>> 09.08.12 5:43 пользователь "Lance Norskog" <[email protected]> написал:
>>
>>> If Zipf's Law is relevant, locality will be much better than random.
>>> Maybe you need a Vector implementation that is backed by memory-mapped
>>> files?
>>>
>>> On Wed, Aug 8, 2012 at 12:26 PM, Abramov Pavel <[email protected]>
>>> wrote:
>>>> Thank you Jeff, Paritosh,
>>>>
>>>> Reducing cluster count from 1000 to 100 made my day. I will also try
>>>> Canopy for initial cluster count.
>>>> Unfortunately I don't know how to reduce my 200k dictionary. There are
>>>> no
>>>> unfrequent terms.
>>>>
>>>> I will provide you with Hadoop config shortly. But I am pretty sure I
>>>> can't decrease number of Mappers/Reducers per node or increase memory
>>>> limits. It will affect the whole cluster.
>>>>
>>>>
>>>> Thank you!
>>>>
>>>> Pavel
>>>>
>>>>
>>>> 08.08.12 16:15 пользователь "Jeff Eastman" <[email protected]>
>>>> написал:
>>>>
>>>>> Consider that each cluster retains 4 vectors in memory in each mapper
>>>>> and reducer, and that these vectors tend to become more dense (through
>>>>> addition of multiple sparse components) as iterations proceed. With 1000
>>>>> clusters and 200k terms in your dictionary this will cause the heap
>>>>> space to be consumed rapidly as you have noted. Some times you can work
>>>>> around this problem by increasing your heap size on a per-job basis or
>>>>> reducing the number of mappers and reducers allowed on each node. Also
>>>>> be sure you are not launching reducers until all of your mapper tasks
>>>>> have completed.
>>>>>
>>>>> In order to provide more help to you, it would be useful to understand
>>>>> more about how your cluster is "well tuned". How many mappers & reducers
>>>>> are you launching in parallel, the heapspace limits set for tasks on
>>>>> each node, etc.
>>>>>
>>>>> For a quick test, try reducing the k to 500 or 100 to see how many
>>>>> clusters you can reasonably compute with your dataset on your cluster.
>>>>> Canopy is also a good way to get a feel for a good initial k, though it
>>>>> can be hard to arrive at good T-values in some text clustering cases.
>>>>> You, can also try hierarchical clustering with reduced k to stay under
>>>>> your memory limits.
>>>>>
>>>>>
>>>>> On 8/8/12 5:40 AM, Paritosh Ranjan wrote:
>>>>>> A stacktrace of error would have helped in finding the exact error.
>>>>>>
>>>>>> However, number of clusters can create Heap Space problems ( If the
>>>>>> vector dimension is also high ).
>>>>>> Either try to reduce the number of initial clusters ( In my opinion,
>>>>>> the best way to know about initial clusters is Canopy Clustering
>>>>>> https://cwiki.apache.org/confluence/display/MAHOUT/Canopy+Clustering )
>>>>>>
>>>>>> or, try to reduce the dimension of the vectors.
>>>>>>
>>>>>> PS : you are also providing numClusters twice
>>>>>>
>>>>>> --numClusters 1000 \ --numClusters 5 \
>>>>>>
>>>>>> On 08-08-2012 10:42, Abramov Pavel wrote:
>>>>>>> Hello,
>>>>>>>
>>>>>>> I am trying to run KMeans example on 15 000 000 documents (seq2sparse
>>>>>>> output).
>>>>>>> There are 1 000 clusters, 200 000 terms dictionary and 3-10 terms
>>>>>>> document size (titles). seq2sparse produces 200 files 80 MB each.
>>>>>>>
>>>>>>> My job failed with Java heap space Error. 1st iteration passes while
>>>>>>> 2nd iteration fails. On a Map phase of buildClusters I see a lot of
>>>>>>> warnings, but it passes. Reduce phase of buildClusters fails with
>>>>>>> "Java Heap space".
>>>>>>>
>>>>>>> I can not increase reducer/mapper memory in hadoop. My cluster is
>>>>>>> tunned well.
>>>>>>>
>>>>>>> How can I avoid this situation? My cluster has 300 Mappers and 220
>>>>>>> Reducers running 40 servers 8-core 12 GB RAM.
>>>>>>>
>>>>>>> Thanks in advance!
>>>>>>>
>>>>>>> Here is KMeans parameters:
>>>>>>>
>>>>>>> ------------------------------------------------
>>>>>>> mahout kmeans -Dmapred.reduce.tasks=200 \
>>>>>>> -i ...tfidf-vectors/  \
>>>>>>> -o /tmp/clustering_results_kmeans/ \
>>>>>>> --clusters /tmp/clusters/ \
>>>>>>> --numClusters 1000 \
>>>>>>> --numClusters 5 \
>>>>>>> --overwrite \
>>>>>>> --clustering
>>>>>>> ------------------------------------------------
>>>>>>>
>>>>>>> Pavel
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>>
>>>
>>> --
>>> Lance Norskog
>>> [email protected]
>>

Re: KMeans job fails during 2nd iteration. Java Heap space

Reply via email to