Re: spark uses too much memory maybe (binaryFiles() with more than 1 million files in HDFS), groupBy or reduceByKey()

Konstantinos Kougios Thu, 11 Jun 2015 05:30:33 -0700

..and keeps on increasing.

maybe there is a bug in some code that zips/unzips data.

109k instances of byte[] followed by 1 mil instances of Finalizer, with~500k Deflaters and ~500k Inflaters and 1 mil ZStreamRef


I assume that's due to either binaryFiles or saveAsObjectFile

On 11/06/15 13:01, Konstantinos Kougios wrote:

Now I am profiling the executor.

There seems to be a memory leak.

20 mins after the run there were:

 157k byte[] allocated for 75MB.
519k java.lang.ref.Finalizer for 31MB,
291k java.util.zip.Inflater for  17MB
487k java.util.zip.ZStreamRef for 11MB

An hour after the run I got :

186k byte[] for 106MB
863k Finalizer for 52MB
475k Inflater for 29MB
354k Deflater for 24MB
829k ZStreamRef for 19MB
I don't see why those zip classes are leaking. I am not doing anycompression myself (I am reading plain text xml files, extracting fewelements and reducing them), I assume it must be the hadoop streamsmaybe when I do rdd.saveAsObjectFile()
I am using hadoop 2.7.0 with spark 1.3.1-hadoop

Cheers

On 10/06/15 17:14, Marcelo Vanzin wrote:
So, I don't have an explicit solution to your problem, but...
On Wed, Jun 10, 2015 at 7:13 AM, Kostas Kougios<kostas.koug...@googlemail.com<mailto:kostas.koug...@googlemail.com>> wrote:
    I am profiling the driver. It currently has 564MB of strings
    which might be
    the 1mil file names. But also it has 2.34 GB of long[] ! That's
    so far, it
    is still running. What are those long[] used for?
When Spark lists files it also needs all the extra metadata aboutwhere the files are in the HDFS cluster. That is a lot more than justthe file's name - see the "LocatedFileStatus" class in the Hadoopdocs for an idea.
What you could try is to somehow break that input down into smallerbatches, if that's feasible for your app. e.g. organize the files bydirectory and use separate directories in different calls to"binaryFiles()", things like that.
--
Marcelo

Re: spark uses too much memory maybe (binaryFiles() with more than 1 million files in HDFS), groupBy or reduceByKey()

Reply via email to