Re: spark uses too much memory maybe (binaryFiles() with more than 1 million files in HDFS), groupBy or reduceByKey()

Konstantinos Kougios Thu, 11 Jun 2015 05:02:38 -0700

Now I am profiling the executor.

There seems to be a memory leak.


20 mins after the run there were:

 157k byte[] allocated for 75MB.
519k java.lang.ref.Finalizer for 31MB,
291k java.util.zip.Inflater for  17MB
487k java.util.zip.ZStreamRef for 11MB

An hour after the run I got :

186k byte[] for 106MB
863k Finalizer for 52MB
475k Inflater for 29MB
354k Deflater for 24MB
829k ZStreamRef for 19MB

I don't see why those zip classes are leaking. I am not doing anycompression myself (I am reading plain text xml files, extracting fewelements and reducing them), I assume it must be the hadoop streamsmaybe when I do rdd.saveAsObjectFile()



I am using hadoop 2.7.0 with spark 1.3.1-hadoop

Cheers

On 10/06/15 17:14, Marcelo Vanzin wrote:

So, I don't have an explicit solution to your problem, but...
On Wed, Jun 10, 2015 at 7:13 AM, Kostas Kougios<kostas.koug...@googlemail.com <mailto:kostas.koug...@googlemail.com>>wrote:
    I am profiling the driver. It currently has 564MB of strings which
    might be
    the 1mil file names. But also it has 2.34 GB of long[] ! That's so
    far, it
    is still running. What are those long[] used for?
When Spark lists files it also needs all the extra metadata aboutwhere the files are in the HDFS cluster. That is a lot more than justthe file's name - see the "LocatedFileStatus" class in the Hadoop docsfor an idea.
What you could try is to somehow break that input down into smallerbatches, if that's feasible for your app. e.g. organize the files bydirectory and use separate directories in different calls to"binaryFiles()", things like that.
--
Marcelo

Re: spark uses too much memory maybe (binaryFiles() with more than 1 million files in HDFS), groupBy or reduceByKey()

Reply via email to