Dear all,

I have a large dataset consisting of ~50,000 documents and a dimension
of 90,000. I splitted the created input vectors in smaller files to run
a single mapper task on each of the files.
However, even with very small files containing only 50 documents, I run
into heap space problems.

I tried to debug the problem and started the FuzzyKMeansDriver in local
mode in my IDE. Interestingly, it is already the first mapper task that
accumulates very quickly more than 4GB.
In class CIMapper the method map(..) gets called by class Mapper for
each input vector of the input split file. Either Mapper or CIMapper is
responsible for the memory consumption, but I could not see where and
why it could accumulate memory since no additional data is saved during
the mapping process.
I thought maybe the SoftCluster objects require that much, but since
each of them contains 4 dense vectors of double (8 byte) of size 90,000
and I have 500 clusters, they only sum up to 1,34 GB...so where are the
missing GBs?

Does anyone has an explanation for this behaviour or has experience with
memory problems on large scale clustering?

Thanks in advance
Sebastian

Reply via email to