Now I am profiling the executor.

There seems to be a memory leak.

20 mins after the run there were:

 157k byte[] allocated for 75MB.
519k java.lang.ref.Finalizer for 31MB,
291k java.util.zip.Inflater for  17MB
487k java.util.zip.ZStreamRef for 11MB

An hour after the run I got :

186k byte[] for 106MB
863k Finalizer for 52MB
475k Inflater for 29MB
354k Deflater for 24MB
829k ZStreamRef for 19MB

I don't see why those zip classes are leaking. I am not doing any compression myself (I am reading plain text xml files, extracting few elements and reducing them), I assume it must be the hadoop streams maybe when I do rdd.saveAsObjectFile()


I am using hadoop 2.7.0 with spark 1.3.1-hadoop

Cheers

On 10/06/15 17:14, Marcelo Vanzin wrote:
So, I don't have an explicit solution to your problem, but...

On Wed, Jun 10, 2015 at 7:13 AM, Kostas Kougios <kostas.koug...@googlemail.com <mailto:kostas.koug...@googlemail.com>> wrote:

    I am profiling the driver. It currently has 564MB of strings which
    might be
    the 1mil file names. But also it has 2.34 GB of long[] ! That's so
    far, it
    is still running. What are those long[] used for?


When Spark lists files it also needs all the extra metadata about where the files are in the HDFS cluster. That is a lot more than just the file's name - see the "LocatedFileStatus" class in the Hadoop docs for an idea.

What you could try is to somehow break that input down into smaller batches, if that's feasible for your app. e.g. organize the files by directory and use separate directories in different calls to "binaryFiles()", things like that.

--
Marcelo

Reply via email to