Re: Worker hangs with 100% CPU in Standalone cluster

Grega Kešpret Thu, 16 Jan 2014 13:22:14 -0800

Andrew yes, that's right. Don't forget to implement hashCode() and equals()
when dealing with standard classes (you may not always want to/be able to
use case classes in Scala), as it can lead to some nasty bugs :-).


Grega

On Thu, Jan 16, 2014 at 9:42 PM, Andrew Ash <[email protected]> wrote:

> It sounds like the takeaway is that if you're using custom classes, you
> need to make sure that their hashCode() and equals() methods are
> value-based?
>
>
> On Thu, Jan 16, 2014 at 12:08 PM, Patrick Wendell <[email protected]>wrote:
>
>> Thanks for following up and explaining this one! Definitely something
>> other users might run into...
>>
>>
>> On Thu, Jan 16, 2014 at 5:58 AM, Grega Kešpret <[email protected]> wrote:
>>
>>> Just to follow up, we have since pinpointed the problem to be in
>>> application code (not Spark). In some cases, there was an infinite loop in
>>> Scala HashTable linear probing algorithm, where an element's next() pointed
>>> at itself. It was probably caused by wrong hashCode() and equals() methods
>>> on the object we were storing.
>>>
>>> Milos, we also have Master node separate from Worker nodes. Could
>>> someone from Spark team comment about that?
>>>
>>> Grega
>>> --
>>> [image: Inline image 1]
>>> *Grega Kešpret*
>>> Analytics engineer
>>>
>>> Celtra — Rich Media Mobile Advertising
>>> celtra.com <http://www.celtra.com/> | 
>>> @celtramobile<http://www.twitter.com/celtramobile>
>>>
>>>
>>> On Thu, Jan 16, 2014 at 2:46 PM, Milos Nikolic <
>>> [email protected]> wrote:
>>>
>>>> Hello,
>>>>
>>>> I’m facing the same (or similar) problem. In my case, the last two
>>>> tasks hang in a map function following sc.sequenceFile(…). It happens from
>>>> time to time (more often with TorrentBroadcast than HttpBroadcast) and
>>>> after restarting it works fine.
>>>>
>>>> The problem always happens on the same node — on the node that plays
>>>> the roles of the master and one worker. Once this node becomes master-only
>>>> (i.e., I removed this nodes from conf/slaves), the problem is gone.
>>>>
>>>> Does that mean that the master and workers have to be on separate
>>>> nodes?
>>>>
>>>> Best,
>>>> Milos
>>>>
>>>>
>>>> On Jan 6, 2014, at 5:44 PM, Grega Kešpret <[email protected]> wrote:
>>>>
>>>> Hi,
>>>>
>>>> we are seeing several times a day one worker in a Standalone cluster
>>>> hang up with 100% CPU at the last task and doesn't proceed. After we
>>>> restart the job, it completes successfully.
>>>>
>>>> We are using Spark v0.8.1-incubating.
>>>>
>>>> Attached please find jstack logs of Worker
>>>> and CoarseGrainedExecutorBackend JVM processes.
>>>>
>>>> Grega
>>>> --
>>>> <celtra_logo.png>
>>>> *Grega Kešpret*
>>>> Analytics engineer
>>>>
>>>> Celtra — Rich Media Mobile Advertising
>>>> celtra.com <http://www.celtra.com/> | 
>>>> @celtramobile<http://www.twitter.com/celtramobile>
>>>>  <logs.zip>
>>>>
>>>>
>>>>
>>>
>>
>

<<celtra_logo.png>>

Re: Worker hangs with 100% CPU in Standalone cluster

Reply via email to