I tried increasing the child heap size. But as I mentioned even 4GB
wasn't enough.

I am also not sure whether the block size has some influence on the
memory, but I assume this is not the case since such a design would be
really bad.

Any other ideas?


Am 28.03.2013 17:40, schrieb Chris Harrington:
> Don't know if this will help with your heap issues (or if you've already 
> tried it) but increasing the mapred.child.java.opts in the mapred-site.xml 
> resolved some heap issues I was having. I was clustering 67000 small text 
> docs into ~180 clusters and was seeing mapper heap issues until I made this 
> change. 
>
>       <property>
>               <name>mapred.child.java.opts</name>
>               <value>-Xmx1024M</value>
>       </property>
>
> Someone please correct me if I'm wrong but I think the mapper gets kicked off 
> as a child (i.e. in it's own jvm) which is why increasing hadoop's heap size 
> doesn't do anything but increasing the mapred.child.java.opts might help.
>
> Once again correct me if I'm wrong but the cause may be due to hadoop's block 
> size of 64mb so even a small file takes up more this amount of space or 
> something like that I couldn't quite wrap my head around some of the stuff I 
> read on the topic.
>
> On 28 Mar 2013, at 16:26, Sebastian Briesemeister wrote:
>
>> Dear all,
>>
>> I have a large dataset consisting of ~50,000 documents and a dimension
>> of 90,000. I splitted the created input vectors in smaller files to run
>> a single mapper task on each of the files.
>> However, even with very small files containing only 50 documents, I run
>> into heap space problems.
>>
>> I tried to debug the problem and started the FuzzyKMeansDriver in local
>> mode in my IDE. Interestingly, it is already the first mapper task that
>> accumulates very quickly more than 4GB.
>> In class CIMapper the method map(..) gets called by class Mapper for
>> each input vector of the input split file. Either Mapper or CIMapper is
>> responsible for the memory consumption, but I could not see where and
>> why it could accumulate memory since no additional data is saved during
>> the mapping process.
>> I thought maybe the SoftCluster objects require that much, but since
>> each of them contains 4 dense vectors of double (8 byte) of size 90,000
>> and I have 500 clusters, they only sum up to 1,34 GB...so where are the
>> missing GBs?
>>
>> Does anyone has an explanation for this behaviour or has experience with
>> memory problems on large scale clustering?
>>
>> Thanks in advance
>> Sebastian

Reply via email to