..and keeps on increasing.
maybe there is a bug in some code that zips/unzips data.
109k instances of byte[] followed by 1 mil instances of Finalizer, with
~500k Deflaters and ~500k Inflaters and 1 mil ZStreamRef
I assume that's due to either binaryFiles or saveAsObjectFile
On 11/06/15 13:01, Konstantinos Kougios wrote:
Now I am profiling the executor.
There seems to be a memory leak.
20 mins after the run there were:
157k byte[] allocated for 75MB.
519k java.lang.ref.Finalizer for 31MB,
291k java.util.zip.Inflater for 17MB
487k java.util.zip.ZStreamRef for 11MB
An hour after the run I got :
186k byte[] for 106MB
863k Finalizer for 52MB
475k Inflater for 29MB
354k Deflater for 24MB
829k ZStreamRef for 19MB
I don't see why those zip classes are leaking. I am not doing any
compression myself (I am reading plain text xml files, extracting few
elements and reducing them), I assume it must be the hadoop streams
maybe when I do rdd.saveAsObjectFile()
I am using hadoop 2.7.0 with spark 1.3.1-hadoop
Cheers
On 10/06/15 17:14, Marcelo Vanzin wrote:
So, I don't have an explicit solution to your problem, but...
On Wed, Jun 10, 2015 at 7:13 AM, Kostas Kougios
<kostas.koug...@googlemail.com
<mailto:kostas.koug...@googlemail.com>> wrote:
I am profiling the driver. It currently has 564MB of strings
which might be
the 1mil file names. But also it has 2.34 GB of long[] ! That's
so far, it
is still running. What are those long[] used for?
When Spark lists files it also needs all the extra metadata about
where the files are in the HDFS cluster. That is a lot more than just
the file's name - see the "LocatedFileStatus" class in the Hadoop
docs for an idea.
What you could try is to somehow break that input down into smaller
batches, if that's feasible for your app. e.g. organize the files by
directory and use separate directories in different calls to
"binaryFiles()", things like that.
--
Marcelo