[jira] [Commented] (YARN-4048) Linux kernel panic under strict CPU limits(on CentOS/RHEL 6.x)

Ruslan Dautkhanov (JIRA) Tue, 19 Sep 2017 12:10:31 -0700

    [ 
https://issues.apache.org/jira/browse/YARN-4048?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16172195#comment-16172195
 ]


Ruslan Dautkhanov commented on YARN-4048:
-----------------------------------------

Information on this linux kernel panic from RHEL CASE 01901460.
They have analyzed the crash file dumped during panic and came up with below 
analysis of the problem.

{noformat}
Kwon, Daniel on Aug 02 2017 at 11:25 PM -06:00 
Hi, 

This is Daniel Kwon from Kernel team. I'm working with the case owner to 
support your case. 

The system was crashed due to hard lockup which means there was a process 
holding a CPU for more than 60 seconds. 

------------------------------------------------------------------------------ 
CPUS: 88 
DATE: Fri Jul 28 14:37:44 2017 
UPTIME: 15 days, 16:42:29 
LOAD AVERAGE: 109.16, 39.97, 15.17 
TASKS: 3728 
NODENAME: pc1udahad14 
RELEASE: 2.6.32-696.3.2.el6.x86_64 
VERSION: #1 SMP Wed Jun 7 11:51:39 EDT 2017 
MACHINE: x86_64 (2397 Mhz) 
MEMORY: 511.9 GB 
PANIC: "Kernel panic - not syncing: Hard LOCKUP" 
------------------------------------------------------------------------------ 

Checking the runqueue time shows that there were many CPUs not updated lately 
which means processes were holding those CPUs for long. 

------------------------------------------------------------------------------ 
crash> runq -t | grep CPU | sort -k3r | awk 
'NR==1{now=strtonum("0x"$3)}1{printf"%s\t%7.2fs 
behind\n",$0,(now-strtonum("0x"$3))/1000000000}' 
CPU 25: 4d0f5bec32555 0.00s behind 
CPU 2: 4d0f5bec32015 0.00s behind 
<... cut ...> 
CPU 4: 4d0f5bb9984fb 0.05s behind 
CPU 61: 4d0f5bb8f5619 0.05s behind 
CPU 57: 4d0f5bb83f85d 0.05s behind 
CPU 17: 4d0f5bb78ad7d 0.06s behind 
CPU 12: 4d0f5ba972dbd 0.07s behind 
CPU 48: 4d0f5ba8b4980 0.07s behind 
CPU 84: 4d0f5ba72ca7e 0.07s behind 
CPU 13: 4d0f5966bef24 0.68s behind 
CPU 15: 4d0f58a123cfd 0.88s behind 
CPU 54: 4d0f5832e5754 1.00s behind 
CPU 62: 4d0f581d593b7 1.02s behind 
CPU 49: 4d0f54868608e 1.99s behind 
CPU 52: 4d0f5480bd287 1.99s behind 
CPU 24: 4d0eb096f58da 45.99s behind 
CPU 35: 4d0e730040f57 62.52s behind 
CPU 85: 4d0e42eaeaea0 75.43s behind 
CPU 46: 4d0e3287e8aae 79.83s behind 
CPU 1: 4d0e1072119ac 88.98s behind 
CPU 45: 4d0e061ec766a 91.75s behind 
CPU 60: 4d0db70bc6ad6 112.98s behind 
CPU 6: 4d0db002d7b9b 114.87s behind 
CPU 14: 4d0d9679efaad 121.72s behind 
CPU 9: 4d0d938f74e97 122.50s behind 
CPU 51: 4d0d912c6d77e 123.14s behind 
CPU 5: 4d0d807a6de65 127.63s behind 
CPU 53: 4d0d80637174f 127.65s behind 
CPU 70: 4d0d78c599c8e 129.69s behind 
CPU 44: 4d0d75602d3c3 130.61s behind 
CPU 3: 4d0d6fe84455e 132.07s behind 
CPU 0: 4d0d6f1c22d11 132.29s behind 
CPU 47: 4d0d6f16a2e95 132.29s behind 
CPU 64: 4d0d6f06851ee 132.31s behind 
CPU 59: 4d0d6da9596d6 132.68s behind 
CPU 23: 4d0d6d89ecaaa 132.71s behind 
CPU 22: 4d0d6c8ad9dd2 132.98s behind 
CPU 67: 4d0d6c853e44d 132.98s behind 
------------------------------------------------------------------------------ 

I have checked the two longest holders which are CPU 22 and CPU 67. 

The process on CPU 67 was awaiting for runqueue lock for the current CPU. 

------------------------------------------------------------------------------ 
crash> runq -c 67 
CPU 67 RUNQUEUE: ffff8841616f6ec0 
CURRENT: PID: 40639 TASK: ffff885c89c5b520 COMMAND: "java" 
RT PRIO_ARRAY: ffff8841616f7048 
[ 0] PID: 271 TASK: ffff8840266c7520 COMMAND: "migration/67" 
[ 0] PID: 274 TASK: ffff8840266d6ab0 COMMAND: "watchdog/67" 
CFS RB_ROOT: ffff8841616f6f58 
[120] PID: 422 TASK: ffff884026392ab0 COMMAND: "events/67" 
[120] PID: 7857 TASK: ffff888005966040 COMMAND: "kondemand/67" 
crash> bt 40639 
PID: 40639 TASK: ffff885c89c5b520 CPU: 67 COMMAND: "java" 
#0 [ffff8841616e6e90] crash_nmi_callback at ffffffff81036726 
#1 [ffff8841616e6ea0] notifier_call_chain at ffffffff81551085 
#2 [ffff8841616e6ee0] atomic_notifier_call_chain at ffffffff815510ea 
#3 [ffff8841616e6ef0] notify_die at ffffffff810acd0e 
#4 [ffff8841616e6f20] do_nmi at ffffffff8154ec09 
#5 [ffff8841616e6f50] nmi at ffffffff8154e5b3 
[exception RIP: wait_for_rqlock+0x31] 
<... cut ...> 
--- <NMI exception stack> --- 
#6 [ffff887fd461feb8] wait_for_rqlock at ffffffff8105d751 
#7 [ffff887fd461fec0] do_exit at ffffffff81081dbc 
#8 [ffff887fd461ff40] do_group_exit at ffffffff810820c8 
#9 [ffff887fd461ff70] sys_exit_group at ffffffff81082157 
#10 [ffff887fd461ff80] system_call_fastpath at ffffffff8100b0d2 
<... cut ...> 
------------------------------------------------------------------------------ 

The process on CPU 22 was also waiting for runqueue lock. This time it was 
waiting for the runqueue lock for a specific task which is java (40639) that 
was running on CPU 67. 

------------------------------------------------------------------------------ 
crash> runq -c 22 
CPU 22 RUNQUEUE: ffff884161416ec0 
CURRENT: PID: 8944 TASK: ffff88801833eab0 COMMAND: "cmf-agent" 
RT PRIO_ARRAY: ffff884161417048 
[no tasks queued] 
CFS RB_ROOT: ffff884161416f58 
[no tasks queued] 

crash> bt 8944 
PID: 8944 TASK: ffff88801833eab0 CPU: 22 COMMAND: "cmf-agent" 
#0 [ffff884161406e90] crash_nmi_callback at ffffffff81036726 
#1 [ffff884161406ea0] notifier_call_chain at ffffffff81551085 
#2 [ffff884161406ee0] atomic_notifier_call_chain at ffffffff815510ea 
#3 [ffff884161406ef0] notify_die at ffffffff810acd0e 
#4 [ffff884161406f20] do_nmi at ffffffff8154ec09 
#5 [ffff884161406f50] nmi at ffffffff8154e5b3 
[exception RIP: task_rq_unlock_wait+0x2a] 
RIP: ffffffff8105d7ba RSP: ffff88800676fdb8 RFLAGS: 00000016 
RAX: 0000000048a1488f RBX: ffff885c89c5b520 RCX: ffff8841616f6ec0 
RDX: 00000000000048a1 RSI: ffff882bbb9e6060 RDI: ffff885c89c5b520 
RBP: ffff88800676fdb8 R8: ffff888005e28310 R9: dead000000200200 
R10: ffff888005e28310 R11: 0000000000000000 R12: 0000000000011958 
R13: ffff888005e281c0 R14: 0000000000009e01 R15: ffff888018314cc0 
ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018 
--- <NMI exception stack> --- 
#6 [ffff88800676fdb8] task_rq_unlock_wait at ffffffff8105d7ba 
#7 [ffff88800676fdc0] release_task at ffffffff8107ff19 
#8 [ffff88800676fe10] wait_consider_task at ffffffff81080a26 
#9 [ffff88800676fe80] do_wait at ffffffff81080e56 
#10 [ffff88800676fee0] sys_wait4 at ffffffff81081043 
#11 [ffff88800676ff80] system_call_fastpath at ffffffff8100b0d2 
<... cut ...> 

void task_rq_unlock_wait(struct task_struct *p) 
{ 
struct rq *rq = task_rq(p); 

smp_mb(); /* spin-unlock-wait is not a full memory barrier */ 
spin_unlock_wait(&rq->lock); 
} 


#7 [ffff88800676fdc0] release_task at ffffffff8107ff19 
ffff88800676fdc8: 0000000045c0e1a1 ffff8880183154c8 
ffff88800676fdd8: 0000000000000000 00007fffffffeffd 
ffff88800676fde8: ffff885c89c5b520 0000000000000000 
^ 
+--- task waiting 
ffff88800676fdf8: 0000000000009ebf 0000000000009ebf 
ffff88800676fe08: ffff88800676fe78 ffffffff81080a26 
#8 [ffff88800676fe10] wait_consider_task at ffffffff81080a26 


crash> set ffff885c89c5b520     <-- same lock as the first one. 
PID: 40639 
COMMAND: "java" 
TASK: ffff885c89c5b520 [THREAD_INFO: ffff887fd461c000] 
CPU: 67 
STATE: EXIT_DEAD (ACTIVE) 


/** 
* spin_unlock_wait - wait until the spinlock gets unlocked 
* @lock: the spinlock in question. 
*/ 
#define spin_unlock_wait(lock) __raw_spin_unlock_wait(&(lock)->raw_lock) 


crash> struct rq.lock ffff8841616f6ec0 -ox 
struct rq { 
[ffff8841616f6ec0] spinlock_t lock; 
} 
------------------------------------------------------------------------------ 

So, both processes were blocked in CPUs waiting for the same runqueue lock. We 
need to find out who was holding this runqueue lock (ffff8841616f6ec0). 


This runqueue lock was held by process running on CPU 30. 

------------------------------------------------------------------------------ 
crash> bt 
PID: 0 TASK: ffff8840268e6ab0 CPU: 30 COMMAND: "swapper" 
#0 [ffff884161506e90] crash_nmi_callback at ffffffff81036726 
#1 [ffff884161506ea0] notifier_call_chain at ffffffff81551085 
#2 [ffff884161506ee0] atomic_notifier_call_chain at ffffffff815510ea 
#3 [ffff884161506ef0] notify_die at ffffffff810acd0e 
#4 [ffff884161506f20] do_nmi at ffffffff8154ec09 
#5 [ffff884161506f50] nmi at ffffffff8154e5b3 
[exception RIP: load_balance_fair+0xb7] 
<... cut ...> 
--- <NMI exception stack> --- 
#6 [ffff884161503cc0] load_balance_fair at ffffffff8106bcb7 
#7 [ffff884161503d88] rebalance_domains at ffffffff810707f3 
#8 [ffff884161503e78] run_rebalance_domains at ffffffff81070bfc 
#9 [ffff884161503ec8] __do_softirq at ffffffff81085335 
#10 [ffff884161503f48] call_softirq at ffffffff8100c38c 
#11 [ffff884161503f60] do_softirq at ffffffff8100fc95 
#12 [ffff884161503f80] irq_exit at ffffffff810851c5 
#13 [ffff884161503f90] smp_apic_timer_interrupt at ffffffff81554c5a 
#14 [ffff884161503fb0] apic_timer_interrupt at ffffffff8100bc13 
--- <IRQ stack> --- 
#15 [ffff8840268f3d98] apic_timer_interrupt at ffffffff8100bc13 
<... cut ...> 
#16 [ffff8840268f3ee0] cpuidle_idle_call at ffffffff81443a7a 
#17 [ffff8840268f3f00] cpu_idle at ffffffff81009fe6 
------------------------------------------------------------------------------ 

While it's doing moving tasks between CPUs, it had to take the CPU 67's 
runqueue lock. 

------------------------------------------------------------------------------ 
/usr/src/debug/kernel-2.6.32-696.3.2.el6/linux-2.6.32-696.3.2.el6.x86_64/kernel/sched.c:
 3622 
0xffffffff810707c8 <rebalance_domains+0x2b8>: lea -0x40(%rbp),%rdi 
0xffffffff810707cc <rebalance_domains+0x2bc>: lea -0x38(%rbp),%rax 
<... cut ...> 
0xffffffff810707e2 <rebalance_domains+0x2d2>: mov %rax,(%rsp) 
0xffffffff810707e6 <rebalance_domains+0x2d6>: mov %r15,%rdx     <--- busiest 
0xffffffff810707e9 <rebalance_domains+0x2d9>: mov -0x74(%rbp),%esi 
0xffffffff810707ec <rebalance_domains+0x2dc>: mov -0x70(%rbp),%rdi 
0xffffffff810707f0 <rebalance_domains+0x2e0>: callq *0x40(%rbx) 


3611 static int move_tasks(struct rq *this_rq, int this_cpu, struct rq *busiest 
, 
3612 unsigned long max_load_move, 
3613 struct sched_domain *sd, enum cpu_idle_type idle, 
3614 int *all_pinned) 
3615 { 
3616 const struct sched_class *class = sched_class_highest; 
3617 unsigned long total_load_moved = 0; 
3618 int this_best_prio = this_rq->curr->prio; 
3619 
3620 do { 
3621 total_load_moved += 
3622 class->load_balance(this_rq, this_cpu, busiest, 
3623 max_load_move - total_load_moved, 
3624 sd, idle, all_pinned, &this_best_prio); 




4594 static int load_balance(int this_cpu, struct rq *this_rq, 
4595 struct sched_domain *sd, enum cpu_idle_type idle, 
4596 int *balance) 
4597 { 
<... cut ...> 
4652 imbalance, sd, idle, &all_pinned); 


/usr/src/debug/kernel-2.6.32-696.3.2.el6/linux-2.6.32-696.3.2.el6.x86_64/kernel/
 
sched.c: 4650 
0xffffffff81070796 <rebalance_domains+0x286>: mov -0x70(%rbp),%rdi      <-- 
this_rq 
0xffffffff8107079a <rebalance_domains+0x28a>: mov %r15,%rsi     <-- busiest 
/usr/src/debug/kernel-2.6.32-696.3.2.el6/linux-2.6.32-696.3.2.el6.x86_64/kernel/
 
sched.c: 3618 
<... cut ...> 

/usr/src/debug/kernel-2.6.32-696.3.2.el6/linux-2.6.32-696.3.2.el6.x86_64/kernel/
 
sched_fair.c: 2795 
0xffffffff8106bc26 <load_balance_fair+0x26>: mov %rdx,-0x78(%rbp)       <--- 
busiest 

crash> bt -f | grep " rebalance" -B 1 
ffff884161503d88: ffffffff810707f3 
#7 [ffff884161503d88] rebalance_domains at ffffffff810707f3 
crash> px 0xffff884161503d80-0x78 
$15 = 0xffff884161503d08 
crash> rd 0xffff884161503d08 
ffff884161503d08: ffff8841616f6ec0 .noaA... 
crash> runq | grep ffff8841616f6ec0 
CPU 67 RUNQUEUE: ffff8841616f6ec0 
------------------------------------------------------------------------------ 

While it is holding the lock, it was checking each task_groups and tried to 
rebalance some tasks. 


------------------------------------------------------------------------------ 
2803 list_for_each_entry_rcu(tg, &task_groups, list) { 
2804 struct cfs_rq *busiest_cfs_rq = tg->cfs_rq[busiest_cpu]; 
2805 unsigned long busiest_h_load = busiest_cfs_rq->h_load; 
2806 unsigned long busiest_weight = busiest_cfs_rq->load.weight;        <--- 
was running this part 
------------------------------------------------------------------------------ 

As it's not blocked, it will be released some time later, but checking the 
number of groups pointed in task_groups was 2421 which means takes quite some 
time for this looping operation. 

------------------------------------------------------------------------------ 
crash> list 0xffff886d001a54d8 | wc -l 
2421 
------------------------------------------------------------------------------ 

We can tell that there were at least 2,421 task_groups. I didn't check children 
of each task_group as it's too long tedious task ans 2,421 is already a big 
number. Even with this big number it can take long time to traverse and moving 
tasks. By default there's only one task_group and we can add task_group using 
'cgroup', but in this case there were too big number of task_groups which was 
causing of overall slowness. 

Checking the sosreport confirms nothing was configured in /etc/cgconfig.conf. 
However, I could see that there's runtime generated cgroups. 

------------------------------------------------------------------------------ 
$ grep cgroup df 
cm_cgroups 0 0 0 - /var/run/cloudera-scm-agent/cgroups/blkio 
cm_cgroups 0 0 0 - /var/run/cloudera-scm-agent/cgroups/cpuacct 
cm_cgroups 0 0 0 - /var/run/cloudera-scm-agent/cgroups/cpu 
cm_cgroups 0 0 0 - /var/run/cloudera-scm-agent/cgroups/memory 
------------------------------------------------------------------------------ 


And this cgroups were used by hadoop applications. 

------------------------------------------------------------------------------ 
yarn 83493 0.0 0.0 23368 1144 ? S 13:13 0:00 
/opt/cloudera/parcels/CDH-5.10.1-1.cdh5.10.1.p2224.2407/lib/hadoop-yarn/bin/container-executor
 dsteiner dsteiner 1 application_1501200023435_0120 
container_e130_1501200023435_0120_01_000209 
/hdfs01/yarn/nm/usercache/dsteiner/appcache/application_1501200023435_0120/container_e130_1501200023435_0120_01_000209
 
/hdfs06/yarn/nm/nmPrivate/application_1501200023435_0120/container_e130_1501200023435_0120_01_000209/launch_container.sh
 
/hdfs08/yarn/nm/nmPrivate/application_1501200023435_0120/container_e130_1501200023435_0120_01_000209/container_e130_1501200023435_0120_01_000209.tokens
 
/hdfs02/yarn/nm/nmPrivate/application_1501200023435_0120/container_e130_1501200023435_0120_01_000209/container_e130_1501200023435_0120_01_000209.pid
 /hdfs01/yarn/nm /hdfs02/yarn/nm /hdfs03/yarn/nm /hdfs04/yarn/nm 
/hdfs05/yarn/nm /hdfs06/yarn/nm /hdfs07/yarn/nm /hdfs08/yarn/nm /hdfs09/yarn/nm 
/hdfs10/yarn/nm /var/log/hadoop-yarn/container 
cgroups=/var/run/cloudera-scm-agent/cgroups/cpu/hadoop-yarn/container_e130_1501200023435_0120_01_000209/tasks
 


$ grep cgroups ps | wc -l 
23 
------------------------------------------------------------------------------ 

I checked some task_groups from the list and could see that they were all 
belong to 'hadoop-yarn'. 

------------------------------------------------------------------------------ 
crash> list 0xffff886d001a54d8 | head 
ffff886d001a54d8 
ffff883d335eb8d8 
<... cut ...> 
ffff8831c4c490d8 
crash> px 0xffff886d001a54d8-0xd8 
$23 = 0xffff886d001a5400 

crash> task_group.css 0xffff886d001a5400 
css = { 
cgroup = 0xffff8871fbdc8000, 
refcnt = { 
counter = 0x1 
}, 
flags = 0x0, 
id = 0x0 
} 

crash> cgroup.dentry 0xffff8871fbdc8000 
dentry = 0xffff8851949eba80 

crash> dentry.d_parent,d_iname 0xffff8851949eba80 
d_parent = 0xffff884014da5a40 
d_iname = 
"META-INF\000\071\063\000\062_0.data\000\000\000\000\000\000\000\000\000\000\000"
 
crash> dentry.d_parent,d_iname 0xffff884014da5a40 
d_parent = 0xffff884014ca3680 
d_iname = 
"hadoop-yarn\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000"
 
crash> dentry.d_parent,d_iname 0xffff884014ca3680 
d_parent = 0xffff884014ca3680 
d_iname = 
"/\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000"

crash> px 0xffff883d335eb8d8-0xd8 
$24 = 0xffff883d335eb800 
crash> task_group.css 0xffff883d335eb800 
css = { 
cgroup = 0xffff88201b512800, 
refcnt = { 
counter = 0x1 
}, 
flags = 0x0, 
id = 0x0 
} 
crash> cgroup.dentry 0xffff88201b512800 
dentry = 0xffff8824e144d980 
crash> dentry.d_parent,d_iname 0xffff8824e144d980 
d_parent = 0xffff884014da5a40 
d_iname = "META-INF\000ult.xml\000ata\000a\000\000\060\332D\341$\210\377\377" 
crash> dentry.d_parent,d_iname 0xffff884014da5a40 
d_parent = 0xffff884014ca3680 
d_iname = 
"hadoop-yarn\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000"
 
crash> dentry.d_parent,d_iname 0xffff884014ca3680 
d_parent = 0xffff884014ca3680 
d_iname = 
"/\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000"

crash> px 0xffff8831c4c490d8-0xd8 
$25 = 0xffff8831c4c49000 
crash> task_group.css 0xffff8831c4c49000 
css = { 
cgroup = 0xffff883582c55a00, 
refcnt = { 
counter = 0x1 
}, 
flags = 0x0, 
id = 0x0 
} 
crash> cgroup.dentry 0xffff883582c55a00 
dentry = 0xffff882d8f4769c0 
crash> dentry.d_parent,d_iname 0xffff882d8f4769c0 
d_parent = 0xffff884014da5a40 
d_iname = "org\000affinity\000\060\071_0.data\000meta\000\000\000\000" 
crash> dentry.d_parent,d_iname 0xffff884014da5a40 
d_parent = 0xffff884014ca3680 
d_iname = 
"hadoop-yarn\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000"
 
crash> dentry.d_parent,d_iname 0xffff884014ca3680 
d_parent = 0xffff884014ca3680 
d_iname = 
"/\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000"
------------------------------------------------------------------------------ 

I don't see any other applications using this cgroup. As it's the only cgroup 
used by hadoop, could you check with application team the way to reduce the 
number of cgroups generated by this applications? Please let us know if you 
have any questions regarding above. Thank you. 

Best regards, 
Daniel Kwon 
Senior Software Maintenance Engineer, GSS, Red Hat Asia Pacific 

{noformat}

Hope this helps.
Number of cgroups seems to be the culprit. 
RHEL6/linux kernel 2 may not scale well when a number of cgroups is high.


> Linux kernel panic under strict CPU limits(on CentOS/RHEL 6.x)
> --------------------------------------------------------------
>
>                 Key: YARN-4048
>                 URL: https://issues.apache.org/jira/browse/YARN-4048
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: nodemanager
>    Affects Versions: 2.7.1
>            Reporter: Chengbing Liu
>            Priority: Critical
>         Attachments: panic.png
>
>
> With YARN-2440 and YARN-2531, we have seen some kernel panics happening under 
> heavy pressure. Even with YARN-2809, it still panics.
> We are using CentOS 6.5, hadoop 2.5.0-cdh5.2.0 with the above patches. I 
> guess the latest version also has the same issue.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (YARN-4048) Linux kernel panic under strict CPU limits(on CentOS/RHEL 6.x)

Reply via email to