Hi Marchelo,
The collected data are collected in say class C. c.id is the id of each
of those data. But that id might appear more than once in those 1mil xml
files, so I am doing a reduceByKey(). Even if I had multiple binaryFile
RDD's, wouldn't I have to ++ those in order to correctly
Now I am profiling the executor.
There seems to be a memory leak.
20 mins after the run there were:
157k byte[] allocated for 75MB.
519k java.lang.ref.Finalizer for 31MB,
291k java.util.zip.Inflater for 17MB
487k java.util.zip.ZStreamRef for 11MB
An hour after the run I got :
186k byte[]
..and keeps on increasing.
maybe there is a bug in some code that zips/unzips data.
109k instances of byte[] followed by 1 mil instances of Finalizer, with
~500k Deflaters and ~500k Inflaters and 1 mil ZStreamRef
I assume that's due to either binaryFiles or saveAsObjectFile
On 11/06/15
after 2h of running, now I got a 10GB long[], 1.3mil instances of long[]
So probably information about the files again.
-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail:
Both the driver (ApplicationMaster running on hadoop) and container
(CoarseGrainedExecutorBackend) end up exceeding my 25GB allocation.
my code is something like
sc.binaryFiles(... 1mil xml files).flatMap( ... extract some domain classes,
not many though as each xml usually have zero
I am profiling the driver. It currently has 564MB of strings which might be
the 1mil file names. But also it has 2.34 GB of long[] ! That's so far, it
is still running. What are those long[] used for?
--
View this message in context:
So, I don't have an explicit solution to your problem, but...
On Wed, Jun 10, 2015 at 7:13 AM, Kostas Kougios
kostas.koug...@googlemail.com wrote:
I am profiling the driver. It currently has 564MB of strings which might be
the 1mil file names. But also it has 2.34 GB of long[] ! That's so
After some time the driver accumulated 6.67GB of long[] . The executor mem
usage so far is low.
--
View this message in context: