Re: spark uses too much memory maybe (binaryFiles() with more than 1 million files in HDFS), groupBy or reduceByKey()

2015-06-11 Thread Konstantinos Kougios
Hi Marchelo, The collected data are collected in say class C. c.id is the id of each of those data. But that id might appear more than once in those 1mil xml files, so I am doing a reduceByKey(). Even if I had multiple binaryFile RDD's, wouldn't I have to ++ those in order to correctly

Re: spark uses too much memory maybe (binaryFiles() with more than 1 million files in HDFS), groupBy or reduceByKey()

2015-06-11 Thread Konstantinos Kougios
Now I am profiling the executor. There seems to be a memory leak. 20 mins after the run there were: 157k byte[] allocated for 75MB. 519k java.lang.ref.Finalizer for 31MB, 291k java.util.zip.Inflater for 17MB 487k java.util.zip.ZStreamRef for 11MB An hour after the run I got : 186k byte[]

Re: spark uses too much memory maybe (binaryFiles() with more than 1 million files in HDFS), groupBy or reduceByKey()

2015-06-11 Thread Konstantinos Kougios
..and keeps on increasing. maybe there is a bug in some code that zips/unzips data. 109k instances of byte[] followed by 1 mil instances of Finalizer, with ~500k Deflaters and ~500k Inflaters and 1 mil ZStreamRef I assume that's due to either binaryFiles or saveAsObjectFile On 11/06/15

Re: spark uses too much memory maybe (binaryFiles() with more than 1 million files in HDFS), groupBy or reduceByKey()

2015-06-11 Thread Konstantinos Kougios
after 2h of running, now I got a 10GB long[], 1.3mil instances of long[] So probably information about the files again. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail:

spark uses too much memory maybe (binaryFiles() with more than 1 million files in HDFS), groupBy or reduceByKey()

2015-06-10 Thread Kostas Kougios
Both the driver (ApplicationMaster running on hadoop) and container (CoarseGrainedExecutorBackend) end up exceeding my 25GB allocation. my code is something like sc.binaryFiles(... 1mil xml files).flatMap( ... extract some domain classes, not many though as each xml usually have zero

Re: spark uses too much memory maybe (binaryFiles() with more than 1 million files in HDFS), groupBy or reduceByKey()

2015-06-10 Thread Kostas Kougios
I am profiling the driver. It currently has 564MB of strings which might be the 1mil file names. But also it has 2.34 GB of long[] ! That's so far, it is still running. What are those long[] used for? -- View this message in context:

Re: spark uses too much memory maybe (binaryFiles() with more than 1 million files in HDFS), groupBy or reduceByKey()

2015-06-10 Thread Marcelo Vanzin
So, I don't have an explicit solution to your problem, but... On Wed, Jun 10, 2015 at 7:13 AM, Kostas Kougios kostas.koug...@googlemail.com wrote: I am profiling the driver. It currently has 564MB of strings which might be the 1mil file names. But also it has 2.34 GB of long[] ! That's so

Re: spark uses too much memory maybe (binaryFiles() with more than 1 million files in HDFS), groupBy or reduceByKey()

2015-06-10 Thread Kostas Kougios
After some time the driver accumulated 6.67GB of long[] . The executor mem usage so far is low. -- View this message in context: