Re: cluster hangs for no apparent reason

Walrus theCat Sun, 03 Nov 2013 09:52:00 -0800

Hi Shangyu,

Thanks for responding.  This is a refactor of other code that isn't
completely scalable because it pulls stuff to the driver.  This code keeps
everything on the cluster.  I left it running for 7 hours, and the log just
froze.  I checked ganglia, and only one machine's CPU seemed to be doing
anything.  The last output on the log left my code at a spot where it is
filtering an RDD by a locally stored set.  No error was thrown.  I thought
that was OK based on the example code, but just in case, I changed it so
it's a broadcast variable.  The un-refactored code (that pulls all the data
to the driver from time to time) runs in minutes.  I've never had the
problem before of the log just getting non-responsive, and was wondering if
anyone knew of any heuristics I could check.


Thank you


On Sat, Nov 2, 2013 at 2:55 PM, Shangyu Luo <[email protected]> wrote:

> Yes, I think so. The running time depends on what work your are doing and
> how large it is.
>
>
> 2013/11/1 Walrus theCat <[email protected]>
>
>> That's what I thought, too.  So is it not "hanging", just recalculating
>> for a very long time?  The log stops updating and it just gives the output
>> I posted.  If there are any suggestions as to parameters to change, or any
>> other data, it would be appreciated.
>>
>> Thank you, Shangyu.
>>
>>
>> On Fri, Nov 1, 2013 at 11:31 AM, Shangyu Luo <[email protected]> wrote:
>>
>>> I think the missing parent may be not abnormal. From my understanding,
>>> when a Spark task cannot find its parent, it can use some meta data to find
>>> the result of its parent or recalculate its parent's value. Imaging in a
>>> loop, a Spark task tries to find some value from the last iteration's
>>> result.
>>>
>>>
>>> 2013/11/1 Walrus theCat <[email protected]>
>>>
>>>> Are there heuristics to check when the scheduler says it is "missing
>>>> parents" and just hangs?
>>>>
>>>>
>>>>
>>>> On Thu, Oct 31, 2013 at 4:56 PM, Walrus theCat 
>>>> <[email protected]>wrote:
>>>>
>>>>> Hi,
>>>>>
>>>>> I'm not sure what's going on here.  My code seems to be working thus
>>>>> far (map at SparkLR:90 completed.)  What can I do to help the scheduler 
>>>>> out
>>>>> here?
>>>>>
>>>>> Thanks
>>>>>
>>>>> 13/10/31 02:10:13 INFO scheduler.DAGScheduler: Completed
>>>>> ShuffleMapTask(10, 211)
>>>>> 13/10/31 02:10:13 INFO scheduler.DAGScheduler: Stage 10 (map at
>>>>> SparkLR.scala:90) finished in 0.923 s
>>>>> 13/10/31 02:10:13 INFO scheduler.DAGScheduler: looking for newly
>>>>> runnable stages
>>>>> 13/10/31 02:10:13 INFO scheduler.DAGScheduler: running: Set(Stage 11)
>>>>> 13/10/31 02:10:13 INFO scheduler.DAGScheduler: waiting: Set(Stage 9,
>>>>> Stage 8)
>>>>> 13/10/31 02:10:13 INFO scheduler.DAGScheduler: failed: Set()
>>>>> 13/10/31 02:10:16 INFO scheduler.DAGScheduler: Missing parents for
>>>>> Stage 9: List(Stage 11)
>>>>> 13/10/31 02:10:16 INFO scheduler.DAGScheduler: Missing parents for
>>>>> Stage 8: List(Stage 9)
>>>>>
>>>>>
>>>>>
>>>>>
>>>>
>>>
>>>
>>> --
>>> --
>>>
>>> Shangyu, Luo
>>> Department of Computer Science
>>> Rice University
>>>
>>> --
>>> Not Just Think About It, But Do It!
>>> --
>>> Success is never final.
>>> --
>>> Losers always whine about their best
>>>
>>
>>
>
>
> --
> --
>
> Shangyu, Luo
> Department of Computer Science
> Rice University
>
> --
> Not Just Think About It, But Do It!
> --
> Success is never final.
> --
> Losers always whine about their best
>

Re: cluster hangs for no apparent reason

Reply via email to