A lot of your time is being spent in garbage collection (second image).
 Maybe your dataset doesn't easily fit into memory?  Can you reduce the
number of new objects created in myFun?

How big are your heap sizes?

Another observation is that in the 4th image some of your RDDs are massive
and some are tiny.


On Mon, Apr 14, 2014 at 11:45 AM, Andrea Esposito <and1...@gmail.com> wrote:

> Hi all,
>
> i'm developing an iterative computation over graphs but i'm struggling
> with some embarrassing low performaces.
>
> The computation is heavily iterative and i'm following this rdd usage
> pattern:
>
> newRdd = oldRdd.map(myFun).persist(myStorageLevel)
>>
> newRdd.foreach(x => {}) // Force evaluation
>> oldRdd.unpersist(true)
>>
>
> I'm using a machine equips by 30 cores and 120 GB of RAM.
> As an example i've run with a small graph of 4000 verts and 80 thousand
> edges and the performance at the first iterations are 10+ minutes and after
> they take lots more.
> I attach the Spark UI screenshots of just the first 2 iterations.
>
> I tried with MEMORY_ONLY_SER and MEMORY_AND_DISK_SER and also i changed
> the "spark.shuffle.memoryFraction" to 0.3 but nothing changed (with so lot
> of RAM for 4E10 verts these settings are quite pointless i guess).
>
> How should i continue to investigate?
>
> Any advices are very very welcome, thanks.
>
> Best,
> EA
>

Reply via email to