[
https://issues.apache.org/jira/browse/YARN-4048?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16172195#comment-16172195
]
Ruslan Dautkhanov commented on YARN-4048:
-----------------------------------------
Information on this linux kernel panic from RHEL CASE 01901460.
They have analyzed the crash file dumped during panic and came up with below
analysis of the problem.
{noformat}
Kwon, Daniel on Aug 02 2017 at 11:25 PM -06:00
Hi,
This is Daniel Kwon from Kernel team. I'm working with the case owner to
support your case.
The system was crashed due to hard lockup which means there was a process
holding a CPU for more than 60 seconds.
------------------------------------------------------------------------------
CPUS: 88
DATE: Fri Jul 28 14:37:44 2017
UPTIME: 15 days, 16:42:29
LOAD AVERAGE: 109.16, 39.97, 15.17
TASKS: 3728
NODENAME: pc1udahad14
RELEASE: 2.6.32-696.3.2.el6.x86_64
VERSION: #1 SMP Wed Jun 7 11:51:39 EDT 2017
MACHINE: x86_64 (2397 Mhz)
MEMORY: 511.9 GB
PANIC: "Kernel panic - not syncing: Hard LOCKUP"
------------------------------------------------------------------------------
Checking the runqueue time shows that there were many CPUs not updated lately
which means processes were holding those CPUs for long.
------------------------------------------------------------------------------
crash> runq -t | grep CPU | sort -k3r | awk
'NR==1{now=strtonum("0x"$3)}1{printf"%s\t%7.2fs
behind\n",$0,(now-strtonum("0x"$3))/1000000000}'
CPU 25: 4d0f5bec32555 0.00s behind
CPU 2: 4d0f5bec32015 0.00s behind
<... cut ...>
CPU 4: 4d0f5bb9984fb 0.05s behind
CPU 61: 4d0f5bb8f5619 0.05s behind
CPU 57: 4d0f5bb83f85d 0.05s behind
CPU 17: 4d0f5bb78ad7d 0.06s behind
CPU 12: 4d0f5ba972dbd 0.07s behind
CPU 48: 4d0f5ba8b4980 0.07s behind
CPU 84: 4d0f5ba72ca7e 0.07s behind
CPU 13: 4d0f5966bef24 0.68s behind
CPU 15: 4d0f58a123cfd 0.88s behind
CPU 54: 4d0f5832e5754 1.00s behind
CPU 62: 4d0f581d593b7 1.02s behind
CPU 49: 4d0f54868608e 1.99s behind
CPU 52: 4d0f5480bd287 1.99s behind
CPU 24: 4d0eb096f58da 45.99s behind
CPU 35: 4d0e730040f57 62.52s behind
CPU 85: 4d0e42eaeaea0 75.43s behind
CPU 46: 4d0e3287e8aae 79.83s behind
CPU 1: 4d0e1072119ac 88.98s behind
CPU 45: 4d0e061ec766a 91.75s behind
CPU 60: 4d0db70bc6ad6 112.98s behind
CPU 6: 4d0db002d7b9b 114.87s behind
CPU 14: 4d0d9679efaad 121.72s behind
CPU 9: 4d0d938f74e97 122.50s behind
CPU 51: 4d0d912c6d77e 123.14s behind
CPU 5: 4d0d807a6de65 127.63s behind
CPU 53: 4d0d80637174f 127.65s behind
CPU 70: 4d0d78c599c8e 129.69s behind
CPU 44: 4d0d75602d3c3 130.61s behind
CPU 3: 4d0d6fe84455e 132.07s behind
CPU 0: 4d0d6f1c22d11 132.29s behind
CPU 47: 4d0d6f16a2e95 132.29s behind
CPU 64: 4d0d6f06851ee 132.31s behind
CPU 59: 4d0d6da9596d6 132.68s behind
CPU 23: 4d0d6d89ecaaa 132.71s behind
CPU 22: 4d0d6c8ad9dd2 132.98s behind
CPU 67: 4d0d6c853e44d 132.98s behind
------------------------------------------------------------------------------
I have checked the two longest holders which are CPU 22 and CPU 67.
The process on CPU 67 was awaiting for runqueue lock for the current CPU.
------------------------------------------------------------------------------
crash> runq -c 67
CPU 67 RUNQUEUE: ffff8841616f6ec0
CURRENT: PID: 40639 TASK: ffff885c89c5b520 COMMAND: "java"
RT PRIO_ARRAY: ffff8841616f7048
[ 0] PID: 271 TASK: ffff8840266c7520 COMMAND: "migration/67"
[ 0] PID: 274 TASK: ffff8840266d6ab0 COMMAND: "watchdog/67"
CFS RB_ROOT: ffff8841616f6f58
[120] PID: 422 TASK: ffff884026392ab0 COMMAND: "events/67"
[120] PID: 7857 TASK: ffff888005966040 COMMAND: "kondemand/67"
crash> bt 40639
PID: 40639 TASK: ffff885c89c5b520 CPU: 67 COMMAND: "java"
#0 [ffff8841616e6e90] crash_nmi_callback at ffffffff81036726
#1 [ffff8841616e6ea0] notifier_call_chain at ffffffff81551085
#2 [ffff8841616e6ee0] atomic_notifier_call_chain at ffffffff815510ea
#3 [ffff8841616e6ef0] notify_die at ffffffff810acd0e
#4 [ffff8841616e6f20] do_nmi at ffffffff8154ec09
#5 [ffff8841616e6f50] nmi at ffffffff8154e5b3
[exception RIP: wait_for_rqlock+0x31]
<... cut ...>
--- <NMI exception stack> ---
#6 [ffff887fd461feb8] wait_for_rqlock at ffffffff8105d751
#7 [ffff887fd461fec0] do_exit at ffffffff81081dbc
#8 [ffff887fd461ff40] do_group_exit at ffffffff810820c8
#9 [ffff887fd461ff70] sys_exit_group at ffffffff81082157
#10 [ffff887fd461ff80] system_call_fastpath at ffffffff8100b0d2
<... cut ...>
------------------------------------------------------------------------------
The process on CPU 22 was also waiting for runqueue lock. This time it was
waiting for the runqueue lock for a specific task which is java (40639) that
was running on CPU 67.
------------------------------------------------------------------------------
crash> runq -c 22
CPU 22 RUNQUEUE: ffff884161416ec0
CURRENT: PID: 8944 TASK: ffff88801833eab0 COMMAND: "cmf-agent"
RT PRIO_ARRAY: ffff884161417048
[no tasks queued]
CFS RB_ROOT: ffff884161416f58
[no tasks queued]
crash> bt 8944
PID: 8944 TASK: ffff88801833eab0 CPU: 22 COMMAND: "cmf-agent"
#0 [ffff884161406e90] crash_nmi_callback at ffffffff81036726
#1 [ffff884161406ea0] notifier_call_chain at ffffffff81551085
#2 [ffff884161406ee0] atomic_notifier_call_chain at ffffffff815510ea
#3 [ffff884161406ef0] notify_die at ffffffff810acd0e
#4 [ffff884161406f20] do_nmi at ffffffff8154ec09
#5 [ffff884161406f50] nmi at ffffffff8154e5b3
[exception RIP: task_rq_unlock_wait+0x2a]
RIP: ffffffff8105d7ba RSP: ffff88800676fdb8 RFLAGS: 00000016
RAX: 0000000048a1488f RBX: ffff885c89c5b520 RCX: ffff8841616f6ec0
RDX: 00000000000048a1 RSI: ffff882bbb9e6060 RDI: ffff885c89c5b520
RBP: ffff88800676fdb8 R8: ffff888005e28310 R9: dead000000200200
R10: ffff888005e28310 R11: 0000000000000000 R12: 0000000000011958
R13: ffff888005e281c0 R14: 0000000000009e01 R15: ffff888018314cc0
ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018
--- <NMI exception stack> ---
#6 [ffff88800676fdb8] task_rq_unlock_wait at ffffffff8105d7ba
#7 [ffff88800676fdc0] release_task at ffffffff8107ff19
#8 [ffff88800676fe10] wait_consider_task at ffffffff81080a26
#9 [ffff88800676fe80] do_wait at ffffffff81080e56
#10 [ffff88800676fee0] sys_wait4 at ffffffff81081043
#11 [ffff88800676ff80] system_call_fastpath at ffffffff8100b0d2
<... cut ...>
void task_rq_unlock_wait(struct task_struct *p)
{
struct rq *rq = task_rq(p);
smp_mb(); /* spin-unlock-wait is not a full memory barrier */
spin_unlock_wait(&rq->lock);
}
#7 [ffff88800676fdc0] release_task at ffffffff8107ff19
ffff88800676fdc8: 0000000045c0e1a1 ffff8880183154c8
ffff88800676fdd8: 0000000000000000 00007fffffffeffd
ffff88800676fde8: ffff885c89c5b520 0000000000000000
^
+--- task waiting
ffff88800676fdf8: 0000000000009ebf 0000000000009ebf
ffff88800676fe08: ffff88800676fe78 ffffffff81080a26
#8 [ffff88800676fe10] wait_consider_task at ffffffff81080a26
crash> set ffff885c89c5b520 <-- same lock as the first one.
PID: 40639
COMMAND: "java"
TASK: ffff885c89c5b520 [THREAD_INFO: ffff887fd461c000]
CPU: 67
STATE: EXIT_DEAD (ACTIVE)
/**
* spin_unlock_wait - wait until the spinlock gets unlocked
* @lock: the spinlock in question.
*/
#define spin_unlock_wait(lock) __raw_spin_unlock_wait(&(lock)->raw_lock)
crash> struct rq.lock ffff8841616f6ec0 -ox
struct rq {
[ffff8841616f6ec0] spinlock_t lock;
}
------------------------------------------------------------------------------
So, both processes were blocked in CPUs waiting for the same runqueue lock. We
need to find out who was holding this runqueue lock (ffff8841616f6ec0).
This runqueue lock was held by process running on CPU 30.
------------------------------------------------------------------------------
crash> bt
PID: 0 TASK: ffff8840268e6ab0 CPU: 30 COMMAND: "swapper"
#0 [ffff884161506e90] crash_nmi_callback at ffffffff81036726
#1 [ffff884161506ea0] notifier_call_chain at ffffffff81551085
#2 [ffff884161506ee0] atomic_notifier_call_chain at ffffffff815510ea
#3 [ffff884161506ef0] notify_die at ffffffff810acd0e
#4 [ffff884161506f20] do_nmi at ffffffff8154ec09
#5 [ffff884161506f50] nmi at ffffffff8154e5b3
[exception RIP: load_balance_fair+0xb7]
<... cut ...>
--- <NMI exception stack> ---
#6 [ffff884161503cc0] load_balance_fair at ffffffff8106bcb7
#7 [ffff884161503d88] rebalance_domains at ffffffff810707f3
#8 [ffff884161503e78] run_rebalance_domains at ffffffff81070bfc
#9 [ffff884161503ec8] __do_softirq at ffffffff81085335
#10 [ffff884161503f48] call_softirq at ffffffff8100c38c
#11 [ffff884161503f60] do_softirq at ffffffff8100fc95
#12 [ffff884161503f80] irq_exit at ffffffff810851c5
#13 [ffff884161503f90] smp_apic_timer_interrupt at ffffffff81554c5a
#14 [ffff884161503fb0] apic_timer_interrupt at ffffffff8100bc13
--- <IRQ stack> ---
#15 [ffff8840268f3d98] apic_timer_interrupt at ffffffff8100bc13
<... cut ...>
#16 [ffff8840268f3ee0] cpuidle_idle_call at ffffffff81443a7a
#17 [ffff8840268f3f00] cpu_idle at ffffffff81009fe6
------------------------------------------------------------------------------
While it's doing moving tasks between CPUs, it had to take the CPU 67's
runqueue lock.
------------------------------------------------------------------------------
/usr/src/debug/kernel-2.6.32-696.3.2.el6/linux-2.6.32-696.3.2.el6.x86_64/kernel/sched.c:
3622
0xffffffff810707c8 <rebalance_domains+0x2b8>: lea -0x40(%rbp),%rdi
0xffffffff810707cc <rebalance_domains+0x2bc>: lea -0x38(%rbp),%rax
<... cut ...>
0xffffffff810707e2 <rebalance_domains+0x2d2>: mov %rax,(%rsp)
0xffffffff810707e6 <rebalance_domains+0x2d6>: mov %r15,%rdx <--- busiest
0xffffffff810707e9 <rebalance_domains+0x2d9>: mov -0x74(%rbp),%esi
0xffffffff810707ec <rebalance_domains+0x2dc>: mov -0x70(%rbp),%rdi
0xffffffff810707f0 <rebalance_domains+0x2e0>: callq *0x40(%rbx)
3611 static int move_tasks(struct rq *this_rq, int this_cpu, struct rq *busiest
,
3612 unsigned long max_load_move,
3613 struct sched_domain *sd, enum cpu_idle_type idle,
3614 int *all_pinned)
3615 {
3616 const struct sched_class *class = sched_class_highest;
3617 unsigned long total_load_moved = 0;
3618 int this_best_prio = this_rq->curr->prio;
3619
3620 do {
3621 total_load_moved +=
3622 class->load_balance(this_rq, this_cpu, busiest,
3623 max_load_move - total_load_moved,
3624 sd, idle, all_pinned, &this_best_prio);
4594 static int load_balance(int this_cpu, struct rq *this_rq,
4595 struct sched_domain *sd, enum cpu_idle_type idle,
4596 int *balance)
4597 {
<... cut ...>
4652 imbalance, sd, idle, &all_pinned);
/usr/src/debug/kernel-2.6.32-696.3.2.el6/linux-2.6.32-696.3.2.el6.x86_64/kernel/
sched.c: 4650
0xffffffff81070796 <rebalance_domains+0x286>: mov -0x70(%rbp),%rdi <--
this_rq
0xffffffff8107079a <rebalance_domains+0x28a>: mov %r15,%rsi <-- busiest
/usr/src/debug/kernel-2.6.32-696.3.2.el6/linux-2.6.32-696.3.2.el6.x86_64/kernel/
sched.c: 3618
<... cut ...>
/usr/src/debug/kernel-2.6.32-696.3.2.el6/linux-2.6.32-696.3.2.el6.x86_64/kernel/
sched_fair.c: 2795
0xffffffff8106bc26 <load_balance_fair+0x26>: mov %rdx,-0x78(%rbp) <---
busiest
crash> bt -f | grep " rebalance" -B 1
ffff884161503d88: ffffffff810707f3
#7 [ffff884161503d88] rebalance_domains at ffffffff810707f3
crash> px 0xffff884161503d80-0x78
$15 = 0xffff884161503d08
crash> rd 0xffff884161503d08
ffff884161503d08: ffff8841616f6ec0 .noaA...
crash> runq | grep ffff8841616f6ec0
CPU 67 RUNQUEUE: ffff8841616f6ec0
------------------------------------------------------------------------------
While it is holding the lock, it was checking each task_groups and tried to
rebalance some tasks.
------------------------------------------------------------------------------
2803 list_for_each_entry_rcu(tg, &task_groups, list) {
2804 struct cfs_rq *busiest_cfs_rq = tg->cfs_rq[busiest_cpu];
2805 unsigned long busiest_h_load = busiest_cfs_rq->h_load;
2806 unsigned long busiest_weight = busiest_cfs_rq->load.weight; <---
was running this part
------------------------------------------------------------------------------
As it's not blocked, it will be released some time later, but checking the
number of groups pointed in task_groups was 2421 which means takes quite some
time for this looping operation.
------------------------------------------------------------------------------
crash> list 0xffff886d001a54d8 | wc -l
2421
------------------------------------------------------------------------------
We can tell that there were at least 2,421 task_groups. I didn't check children
of each task_group as it's too long tedious task ans 2,421 is already a big
number. Even with this big number it can take long time to traverse and moving
tasks. By default there's only one task_group and we can add task_group using
'cgroup', but in this case there were too big number of task_groups which was
causing of overall slowness.
Checking the sosreport confirms nothing was configured in /etc/cgconfig.conf.
However, I could see that there's runtime generated cgroups.
------------------------------------------------------------------------------
$ grep cgroup df
cm_cgroups 0 0 0 - /var/run/cloudera-scm-agent/cgroups/blkio
cm_cgroups 0 0 0 - /var/run/cloudera-scm-agent/cgroups/cpuacct
cm_cgroups 0 0 0 - /var/run/cloudera-scm-agent/cgroups/cpu
cm_cgroups 0 0 0 - /var/run/cloudera-scm-agent/cgroups/memory
------------------------------------------------------------------------------
And this cgroups were used by hadoop applications.
------------------------------------------------------------------------------
yarn 83493 0.0 0.0 23368 1144 ? S 13:13 0:00
/opt/cloudera/parcels/CDH-5.10.1-1.cdh5.10.1.p2224.2407/lib/hadoop-yarn/bin/container-executor
dsteiner dsteiner 1 application_1501200023435_0120
container_e130_1501200023435_0120_01_000209
/hdfs01/yarn/nm/usercache/dsteiner/appcache/application_1501200023435_0120/container_e130_1501200023435_0120_01_000209
/hdfs06/yarn/nm/nmPrivate/application_1501200023435_0120/container_e130_1501200023435_0120_01_000209/launch_container.sh
/hdfs08/yarn/nm/nmPrivate/application_1501200023435_0120/container_e130_1501200023435_0120_01_000209/container_e130_1501200023435_0120_01_000209.tokens
/hdfs02/yarn/nm/nmPrivate/application_1501200023435_0120/container_e130_1501200023435_0120_01_000209/container_e130_1501200023435_0120_01_000209.pid
/hdfs01/yarn/nm /hdfs02/yarn/nm /hdfs03/yarn/nm /hdfs04/yarn/nm
/hdfs05/yarn/nm /hdfs06/yarn/nm /hdfs07/yarn/nm /hdfs08/yarn/nm /hdfs09/yarn/nm
/hdfs10/yarn/nm /var/log/hadoop-yarn/container
cgroups=/var/run/cloudera-scm-agent/cgroups/cpu/hadoop-yarn/container_e130_1501200023435_0120_01_000209/tasks
$ grep cgroups ps | wc -l
23
------------------------------------------------------------------------------
I checked some task_groups from the list and could see that they were all
belong to 'hadoop-yarn'.
------------------------------------------------------------------------------
crash> list 0xffff886d001a54d8 | head
ffff886d001a54d8
ffff883d335eb8d8
<... cut ...>
ffff8831c4c490d8
crash> px 0xffff886d001a54d8-0xd8
$23 = 0xffff886d001a5400
crash> task_group.css 0xffff886d001a5400
css = {
cgroup = 0xffff8871fbdc8000,
refcnt = {
counter = 0x1
},
flags = 0x0,
id = 0x0
}
crash> cgroup.dentry 0xffff8871fbdc8000
dentry = 0xffff8851949eba80
crash> dentry.d_parent,d_iname 0xffff8851949eba80
d_parent = 0xffff884014da5a40
d_iname =
"META-INF\000\071\063\000\062_0.data\000\000\000\000\000\000\000\000\000\000\000"
crash> dentry.d_parent,d_iname 0xffff884014da5a40
d_parent = 0xffff884014ca3680
d_iname =
"hadoop-yarn\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000"
crash> dentry.d_parent,d_iname 0xffff884014ca3680
d_parent = 0xffff884014ca3680
d_iname =
"/\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000"
crash> px 0xffff883d335eb8d8-0xd8
$24 = 0xffff883d335eb800
crash> task_group.css 0xffff883d335eb800
css = {
cgroup = 0xffff88201b512800,
refcnt = {
counter = 0x1
},
flags = 0x0,
id = 0x0
}
crash> cgroup.dentry 0xffff88201b512800
dentry = 0xffff8824e144d980
crash> dentry.d_parent,d_iname 0xffff8824e144d980
d_parent = 0xffff884014da5a40
d_iname = "META-INF\000ult.xml\000ata\000a\000\000\060\332D\341$\210\377\377"
crash> dentry.d_parent,d_iname 0xffff884014da5a40
d_parent = 0xffff884014ca3680
d_iname =
"hadoop-yarn\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000"
crash> dentry.d_parent,d_iname 0xffff884014ca3680
d_parent = 0xffff884014ca3680
d_iname =
"/\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000"
crash> px 0xffff8831c4c490d8-0xd8
$25 = 0xffff8831c4c49000
crash> task_group.css 0xffff8831c4c49000
css = {
cgroup = 0xffff883582c55a00,
refcnt = {
counter = 0x1
},
flags = 0x0,
id = 0x0
}
crash> cgroup.dentry 0xffff883582c55a00
dentry = 0xffff882d8f4769c0
crash> dentry.d_parent,d_iname 0xffff882d8f4769c0
d_parent = 0xffff884014da5a40
d_iname = "org\000affinity\000\060\071_0.data\000meta\000\000\000\000"
crash> dentry.d_parent,d_iname 0xffff884014da5a40
d_parent = 0xffff884014ca3680
d_iname =
"hadoop-yarn\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000"
crash> dentry.d_parent,d_iname 0xffff884014ca3680
d_parent = 0xffff884014ca3680
d_iname =
"/\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000"
------------------------------------------------------------------------------
I don't see any other applications using this cgroup. As it's the only cgroup
used by hadoop, could you check with application team the way to reduce the
number of cgroups generated by this applications? Please let us know if you
have any questions regarding above. Thank you.
Best regards,
Daniel Kwon
Senior Software Maintenance Engineer, GSS, Red Hat Asia Pacific
{noformat}
Hope this helps.
Number of cgroups seems to be the culprit.
RHEL6/linux kernel 2 may not scale well when a number of cgroups is high.
> Linux kernel panic under strict CPU limits(on CentOS/RHEL 6.x)
> --------------------------------------------------------------
>
> Key: YARN-4048
> URL: https://issues.apache.org/jira/browse/YARN-4048
> Project: Hadoop YARN
> Issue Type: Bug
> Components: nodemanager
> Affects Versions: 2.7.1
> Reporter: Chengbing Liu
> Priority: Critical
> Attachments: panic.png
>
>
> With YARN-2440 and YARN-2531, we have seen some kernel panics happening under
> heavy pressure. Even with YARN-2809, it still panics.
> We are using CentOS 6.5, hadoop 2.5.0-cdh5.2.0 with the above patches. I
> guess the latest version also has the same issue.
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]