[ 
https://issues.apache.org/jira/browse/YARN-2809?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14198560#comment-14198560
 ] 

Nathan Roberts commented on YARN-2809:
--------------------------------------

Stack trace:
 {noformat}
[<ffffffff8150d4a8>] ? panic+0xa7/0x16f
 [<ffffffff815116d4>] ? oops_end+0xe4/0x100
 [<ffffffff81046bfb>] ? no_context+0xfb/0x260
 [<ffffffff81449058>] ? dev_hard_start_xmit+0x308/0x530
 [<ffffffff81046e85>] ? __bad_area_nosemaphore+0x125/0x1e0
 [<ffffffff812773a9>] ? cpumask_next_and+0x29/0x50
 [<ffffffff81046f53>] ? bad_area_nosemaphore+0x13/0x20
 [<ffffffff810476b1>] ? __do_page_fault+0x321/0x480
 [<ffffffff81056881>] ? update_curr+0xe1/0x1f0
 [<ffffffff81065905>] ? enqueue_entity+0x125/0x410
 [<ffffffff810524e3>] ? set_next_buddy+0x43/0x50
 [<ffffffff810570e0>] ? check_preempt_wakeup+0x1c0/0x260
 [<ffffffff81065ceb>] ? enqueue_task_fair+0xfb/0x100
 [<ffffffff8105230c>] ? check_preempt_curr+0x7c/0x90
 [<ffffffff815135fe>] ? do_page_fault+0x3e/0xa0
 [<ffffffff815109b5>] ? page_fault+0x25/0x30
 [<ffffffff81056b19>] ? update_cfs_shares+0x29/0x170
 [<ffffffff81065363>] ? dequeue_entity+0x113/0x2e0
 [<ffffffff810664da>] ? dequeue_task_fair+0x6a/0x130
 [<ffffffff81055ebe>] ? dequeue_task+0x8e/0xb0
 [<ffffffff81055f03>] ? deactivate_task+0x23/0x30
 [<ffffffff8150dc99>] ? thread_return+0x127/0x76e
 [<ffffffff810e6e1e>] ? call_rcu+0xe/0x10
 [<ffffffff8107196f>] ? release_task+0x33f/0x4b0
 [<ffffffff81073837>] ? do_exit+0x5b7/0x870
 [<ffffffff81073b48>] ? do_group_exit+0x58/0xd0
 [<ffffffff81088e36>] ? get_signal_to_deliver+0x1f6/0x460
 [<ffffffff8100a265>] ? do_signal+0x75/0x800
 [<ffffffff810dc675>] ? __audit_syscall_exit+0x265/0x290
 [<ffffffff8100aa80>] ? do_notify_resume+0x90/0xc0
 [<ffffffff8100b341>] ? int_signal+0x12/0x17
{noformat}
What's happening is that CgroupsLCEResourcesHandler is attempting to delete the 
cgroup before all the tasks within the cgroup have exited (explained later). It 
tries every 20ms to remove the cgroup until successful, or a timeout (default 1 
second) expires. Sometimes these attempts hit a race within the kernel where 
the last task has not completely finished tearing down, yet it is far enough 
down that the cgroup is able to be removed. This leaves a NULL pointer around 
which results in the panic.

The kernel has been fixed and most recent distributions will have the fix. 
However, there are older kernel versions out there that would benefit from a 
simple workaround. The proposed workaround is to wait until the "tasks" file 
within the cgroup is empty, and then delay a small amount of time before 
attempting to delete the cgroup. 

One question is why are there still tasks in the cgroup? Don't have a complete 
answer here and some of the details may be slightly off, but do know the 
following: The processtree within a mapreduce  cgroup looks like "bash -c" -> 
"java ..." 
When map or reduce processing is complete, the AM is informed, who then informs 
the NM so that the container can be torn down. A SIGTERM is sent to the session 
(bash is session leader). bash is much quicker at exiting than everything else 
so it exits and its parent (container-executor) gets a SIGCHILD and starts 
cleaning up, this includes removing the cgroup which gets us into the race 
described above. 








> Implement workaround for linux kernel panic when removing cgroup
> ----------------------------------------------------------------
>
>                 Key: YARN-2809
>                 URL: https://issues.apache.org/jira/browse/YARN-2809
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: nodemanager
>    Affects Versions: 2.6.0
>         Environment:  RHEL 6.4
>            Reporter: Nathan Roberts
>            Assignee: Nathan Roberts
>
> Some older versions of linux have a bug that can cause a kernel panic when 
> the LCE attempts to remove a cgroup. It is a race condition so it's a bit 
> rare but on a few thousand node cluster it can result in a couple of panics 
> per day.
> This is the commit that likely (haven't verified) fixes the problem in linux: 
> https://git.kernel.org/cgit/linux/kernel/git/stable/linux-stable.git/commit/?h=linux-2.6.39.y&id=068c5cc5ac7414a8e9eb7856b4bf3cc4d4744267
> Details will be added in comments.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to