Update: Checkpointing it doesn't perform. I checked by the "isCheckpointed" method but it returns always false. ???
2014-05-05 23:14 GMT+02:00 Andrea Esposito <and1...@gmail.com>: > Checkpoint doesn't help it seems. I do it at each iteration/superstep. > > Looking deeply, the RDDs are recomputed just few times at the initial > 'phase' after they aren't recomputed anymore. I attach screenshots: > bootstrap phase, recompute section and after. This is still unexpected > because i persist all the intermediate results. > > Anyway the time of each iteration degrates perpetually, as instance: at > the first superstep it takes 3 sec and at 70 superstep it takes 8 sec. > > An iteration, looking at the screenshot, is from row 528 to 122. > > Any idea where to investigate? > > > 2014-05-02 22:28 GMT+02:00 Andrew Ash <and...@andrewash.com>: > > If you end up with a really long dependency tree between RDDs (like 100+) >> people have reported success with using the .checkpoint() method. This >> computes the RDD and then saves it, flattening the dependency tree. It >> turns out that having a really long RDD dependency graph causes >> serialization sizes of tasks to go up, plus any failures causes a long >> sequence of operations to regenerate the missing partition. >> >> Maybe give that a shot and see if it helps? >> >> >> On Fri, May 2, 2014 at 3:29 AM, Andrea Esposito <and1...@gmail.com>wrote: >> >>> Sorry for the very late answer. >>> >>> I carefully follow what you have pointed out and i figure out that the >>> structure used for each record was too big with many small objects. >>> Changing it the memory usage drastically decrease. >>> >>> Despite that i'm still struggling with the behaviour of decreasing >>> performance along supersteps. Now the memory footprint is much less than >>> before and GC time is not noticeable anymore. >>> I supposed that some RDDs are recomputed and watching carefully the >>> stages there is evidence of that but i don't understand why it's happening. >>> >>> Recalling my usage pattern: >>> >>>> newRdd = oldRdd.map(myFun).persist(myStorageLevel) >>>> >>> newRdd.foreach(x => {}) // Force evaluation >>>> >>> oldRdd.unpersist(true) >>>> >>> >>> According to my usage pattern i tried to don't unpersist the >>> intermediate RDDs (i.e. oldRdd) but nothing change. >>> >>> Any hints? How could i debug this? >>> >>> >>> >>> 2014-04-14 12:55 GMT+02:00 Andrew Ash <and...@andrewash.com>: >>> >>> A lot of your time is being spent in garbage collection (second image). >>>> Maybe your dataset doesn't easily fit into memory? Can you reduce the >>>> number of new objects created in myFun? >>>> >>>> How big are your heap sizes? >>>> >>>> Another observation is that in the 4th image some of your RDDs are >>>> massive and some are tiny. >>>> >>>> >>>> On Mon, Apr 14, 2014 at 11:45 AM, Andrea Esposito <and1...@gmail.com>wrote: >>>> >>>>> Hi all, >>>>> >>>>> i'm developing an iterative computation over graphs but i'm struggling >>>>> with some embarrassing low performaces. >>>>> >>>>> The computation is heavily iterative and i'm following this rdd usage >>>>> pattern: >>>>> >>>>> newRdd = oldRdd.map(myFun).persist(myStorageLevel) >>>>>> >>>>> newRdd.foreach(x => {}) // Force evaluation >>>>>> oldRdd.unpersist(true) >>>>>> >>>>> >>>>> I'm using a machine equips by 30 cores and 120 GB of RAM. >>>>> As an example i've run with a small graph of 4000 verts and 80 >>>>> thousand edges and the performance at the first iterations are 10+ minutes >>>>> and after they take lots more. >>>>> I attach the Spark UI screenshots of just the first 2 iterations. >>>>> >>>>> I tried with MEMORY_ONLY_SER and MEMORY_AND_DISK_SER and also i >>>>> changed the "spark.shuffle.memoryFraction" to 0.3 but nothing changed >>>>> (with >>>>> so lot of RAM for 4E10 verts these settings are quite pointless i guess). >>>>> >>>>> How should i continue to investigate? >>>>> >>>>> Any advices are very very welcome, thanks. >>>>> >>>>> Best, >>>>> EA >>>>> >>>> >>>> >>> >> >