Update: Checkpointing it doesn't perform. I checked by the "isCheckpointed"
method but it returns always false. ???


2014-05-05 23:14 GMT+02:00 Andrea Esposito <and1...@gmail.com>:

> Checkpoint doesn't help it seems. I do it at each iteration/superstep.
>
> Looking deeply, the RDDs are recomputed just few times at the initial
> 'phase' after they aren't recomputed anymore. I attach screenshots:
> bootstrap phase, recompute section and after. This is still unexpected
> because i persist all the intermediate results.
>
> Anyway the time of each iteration degrates perpetually, as instance: at
> the first superstep it takes 3 sec and at 70 superstep it takes 8 sec.
>
> An iteration, looking at the screenshot, is from row 528 to 122.
>
> Any idea where to investigate?
>
>
> 2014-05-02 22:28 GMT+02:00 Andrew Ash <and...@andrewash.com>:
>
> If you end up with a really long dependency tree between RDDs (like 100+)
>> people have reported success with using the .checkpoint() method.  This
>> computes the RDD and then saves it, flattening the dependency tree.  It
>> turns out that having a really long RDD dependency graph causes
>> serialization sizes of tasks to go up, plus any failures causes a long
>> sequence of operations to regenerate the missing partition.
>>
>> Maybe give that a shot and see if it helps?
>>
>>
>> On Fri, May 2, 2014 at 3:29 AM, Andrea Esposito <and1...@gmail.com>wrote:
>>
>>> Sorry for the very late answer.
>>>
>>> I carefully follow what you have pointed out and i figure out that the
>>> structure used for each record was too big with many small objects.
>>> Changing it the memory usage drastically decrease.
>>>
>>> Despite that i'm still struggling with the behaviour of decreasing
>>> performance along supersteps. Now the memory footprint is much less than
>>> before and GC time is not noticeable anymore.
>>> I supposed that some RDDs are recomputed and watching carefully the
>>> stages there is evidence of that but i don't understand why it's happening.
>>>
>>> Recalling my usage pattern:
>>>
>>>> newRdd = oldRdd.map(myFun).persist(myStorageLevel)
>>>>
>>> newRdd.foreach(x => {}) // Force evaluation
>>>>
>>> oldRdd.unpersist(true)
>>>>
>>>
>>> According to my usage pattern i tried to don't unpersist the
>>> intermediate RDDs (i.e. oldRdd) but nothing change.
>>>
>>> Any hints? How could i debug this?
>>>
>>>
>>>
>>> 2014-04-14 12:55 GMT+02:00 Andrew Ash <and...@andrewash.com>:
>>>
>>> A lot of your time is being spent in garbage collection (second image).
>>>>  Maybe your dataset doesn't easily fit into memory?  Can you reduce the
>>>> number of new objects created in myFun?
>>>>
>>>> How big are your heap sizes?
>>>>
>>>> Another observation is that in the 4th image some of your RDDs are
>>>> massive and some are tiny.
>>>>
>>>>
>>>> On Mon, Apr 14, 2014 at 11:45 AM, Andrea Esposito <and1...@gmail.com>wrote:
>>>>
>>>>> Hi all,
>>>>>
>>>>> i'm developing an iterative computation over graphs but i'm struggling
>>>>> with some embarrassing low performaces.
>>>>>
>>>>> The computation is heavily iterative and i'm following this rdd usage
>>>>> pattern:
>>>>>
>>>>> newRdd = oldRdd.map(myFun).persist(myStorageLevel)
>>>>>>
>>>>> newRdd.foreach(x => {}) // Force evaluation
>>>>>> oldRdd.unpersist(true)
>>>>>>
>>>>>
>>>>> I'm using a machine equips by 30 cores and 120 GB of RAM.
>>>>> As an example i've run with a small graph of 4000 verts and 80
>>>>> thousand edges and the performance at the first iterations are 10+ minutes
>>>>> and after they take lots more.
>>>>> I attach the Spark UI screenshots of just the first 2 iterations.
>>>>>
>>>>> I tried with MEMORY_ONLY_SER and MEMORY_AND_DISK_SER and also i
>>>>> changed the "spark.shuffle.memoryFraction" to 0.3 but nothing changed 
>>>>> (with
>>>>> so lot of RAM for 4E10 verts these settings are quite pointless i guess).
>>>>>
>>>>> How should i continue to investigate?
>>>>>
>>>>> Any advices are very very welcome, thanks.
>>>>>
>>>>> Best,
>>>>> EA
>>>>>
>>>>
>>>>
>>>
>>
>

Reply via email to