I tried increasing the child heap size. But as I mentioned even 4GB wasn't enough.
I am also not sure whether the block size has some influence on the memory, but I assume this is not the case since such a design would be really bad. Any other ideas? Am 28.03.2013 17:40, schrieb Chris Harrington: > Don't know if this will help with your heap issues (or if you've already > tried it) but increasing the mapred.child.java.opts in the mapred-site.xml > resolved some heap issues I was having. I was clustering 67000 small text > docs into ~180 clusters and was seeing mapper heap issues until I made this > change. > > <property> > <name>mapred.child.java.opts</name> > <value>-Xmx1024M</value> > </property> > > Someone please correct me if I'm wrong but I think the mapper gets kicked off > as a child (i.e. in it's own jvm) which is why increasing hadoop's heap size > doesn't do anything but increasing the mapred.child.java.opts might help. > > Once again correct me if I'm wrong but the cause may be due to hadoop's block > size of 64mb so even a small file takes up more this amount of space or > something like that I couldn't quite wrap my head around some of the stuff I > read on the topic. > > On 28 Mar 2013, at 16:26, Sebastian Briesemeister wrote: > >> Dear all, >> >> I have a large dataset consisting of ~50,000 documents and a dimension >> of 90,000. I splitted the created input vectors in smaller files to run >> a single mapper task on each of the files. >> However, even with very small files containing only 50 documents, I run >> into heap space problems. >> >> I tried to debug the problem and started the FuzzyKMeansDriver in local >> mode in my IDE. Interestingly, it is already the first mapper task that >> accumulates very quickly more than 4GB. >> In class CIMapper the method map(..) gets called by class Mapper for >> each input vector of the input split file. Either Mapper or CIMapper is >> responsible for the memory consumption, but I could not see where and >> why it could accumulate memory since no additional data is saved during >> the mapping process. >> I thought maybe the SoftCluster objects require that much, but since >> each of them contains 4 dense vectors of double (8 byte) of size 90,000 >> and I have 500 clusters, they only sum up to 1,34 GB...so where are the >> missing GBs? >> >> Does anyone has an explanation for this behaviour or has experience with >> memory problems on large scale clustering? >> >> Thanks in advance >> Sebastian
