Re: [lustre-discuss] 2.10.0 CentOS6.9 ksoftirqd CPU load
Indeed! Already tested 2.10.4 with a fixed issue > On May 31, 2018, at 09:15, Hans Henrik Happe wrote: > > Happy to report that 2.10.4 fixed this issue. > > Cheers, > Hans Henrik > > On 06-04-2018 14:48, Hans Henrik Happe wrote: >> Just for the record. 2.11.0 has fixed this. Not sure which LU though. >> >> Cheers, >> Hans Henrik >> >> On 30-09-2017 23:43, Hans Henrik Happe wrote: >>> On 27-09-2017 18:50, Dilger, Andreas wrote: On Sep 26, 2017, at 01:10, Hans Henrik Happe wrote: > Hi, > > Did anyone else experience CPU load from ksoftirqd after 'modprobe > lustre'? On an otherwise idle node I see: > > PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND > 9 root 20 0 000 S 28.5 0.0 2:05.58 ksoftirqd/1 > > >57 root 20 0 000 R 23.9 0.0 2:22.91 > ksoftirqd/13 > > The sum of those two is about 50% CPU. > > I have narrowed it down to the ptlrpc module. When I remove that, it > stops. > > I also tested the 2.10.1-RC1, which is the same. If you can run "echo l > /proc/sysrq-trigger" it will report the processes that are currently running on the CPUs of your system to the console (and also /var/log/messages, if it can write everything in time). You might need to do this several times to get a representative sample of the ksoftirqd process stacks to see what they are doing that is consuming so much CPU. Alternately, "echo t > /proc/sysrq-trigger" will report the stacks of all processes to the console (and /v/l/m), but there will be a lot of them, and no better chance that it catches what ksoftirqd is doing 25% of the time. >>> I've attached the stacks. Some wakeup which I guess are initiated by >>> something in the ptlrpc code. >>> >>> Cheers, >>> Hans Henrik >>> >>> >>> >>> >>> ___ >>> lustre-discuss mailing list >>> lustre-discuss@lists.lustre.org >>> http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org >>> >> ___ >> lustre-discuss mailing list >> lustre-discuss@lists.lustre.org >> http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org > ___ > lustre-discuss mailing list > lustre-discuss@lists.lustre.org > http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org ___ lustre-discuss mailing list lustre-discuss@lists.lustre.org http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
Re: [lustre-discuss] 2.10.0 CentOS6.9 ksoftirqd CPU load
Happy to report that 2.10.4 fixed this issue. Cheers, Hans Henrik On 06-04-2018 14:48, Hans Henrik Happe wrote: > Just for the record. 2.11.0 has fixed this. Not sure which LU though. > > Cheers, > Hans Henrik > > On 30-09-2017 23:43, Hans Henrik Happe wrote: >> On 27-09-2017 18:50, Dilger, Andreas wrote: >>> On Sep 26, 2017, at 01:10, Hans Henrik Happe wrote: Hi, Did anyone else experience CPU load from ksoftirqd after 'modprobe lustre'? On an otherwise idle node I see: PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 9 root 20 0 0 0 0 S 28.5 0.0 2:05.58 ksoftirqd/1 57 root 20 0 0 0 0 R 23.9 0.0 2:22.91 ksoftirqd/13 The sum of those two is about 50% CPU. I have narrowed it down to the ptlrpc module. When I remove that, it stops. I also tested the 2.10.1-RC1, which is the same. >>> If you can run "echo l > /proc/sysrq-trigger" it will report the >>> processes >>> that are currently running on the CPUs of your system to the console >>> (and >>> also /var/log/messages, if it can write everything in time). >>> >>> You might need to do this several times to get a representative >>> sample of >>> the ksoftirqd process stacks to see what they are doing that is >>> consuming >>> so much CPU. >>> >>> Alternately, "echo t > /proc/sysrq-trigger" will report the stacks of >>> all >>> processes to the console (and /v/l/m), but there will be a lot of them, >>> and no better chance that it catches what ksoftirqd is doing 25% of >>> the time. >> I've attached the stacks. Some wakeup which I guess are initiated by >> something in the ptlrpc code. >> >> Cheers, >> Hans Henrik >> >> >> >> >> ___ >> lustre-discuss mailing list >> lustre-discuss@lists.lustre.org >> http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org >> > ___ > lustre-discuss mailing list > lustre-discuss@lists.lustre.org > http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org ___ lustre-discuss mailing list lustre-discuss@lists.lustre.org http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
Re: [lustre-discuss] 2.10.0 CentOS6.9 ksoftirqd CPU load
Just for the record. 2.11.0 has fixed this. Not sure which LU though. Cheers, Hans Henrik On 30-09-2017 23:43, Hans Henrik Happe wrote: On 27-09-2017 18:50, Dilger, Andreas wrote: On Sep 26, 2017, at 01:10, Hans Henrik Happe wrote: Hi, Did anyone else experience CPU load from ksoftirqd after 'modprobe lustre'? On an otherwise idle node I see: PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 9 root 20 0 000 S 28.5 0.0 2:05.58 ksoftirqd/1 57 root 20 0 000 R 23.9 0.0 2:22.91 ksoftirqd/13 The sum of those two is about 50% CPU. I have narrowed it down to the ptlrpc module. When I remove that, it stops. I also tested the 2.10.1-RC1, which is the same. If you can run "echo l > /proc/sysrq-trigger" it will report the processes that are currently running on the CPUs of your system to the console (and also /var/log/messages, if it can write everything in time). You might need to do this several times to get a representative sample of the ksoftirqd process stacks to see what they are doing that is consuming so much CPU. Alternately, "echo t > /proc/sysrq-trigger" will report the stacks of all processes to the console (and /v/l/m), but there will be a lot of them, and no better chance that it catches what ksoftirqd is doing 25% of the time. I've attached the stacks. Some wakeup which I guess are initiated by something in the ptlrpc code. Cheers, Hans Henrik ___ lustre-discuss mailing list lustre-discuss@lists.lustre.org http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org ___ lustre-discuss mailing list lustre-discuss@lists.lustre.org http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
Re: [lustre-discuss] 2.10.0 CentOS6.9 ksoftirqd CPU load
On 27-09-2017 18:50, Dilger, Andreas wrote: > On Sep 26, 2017, at 01:10, Hans Henrik Happe wrote: >> Hi, >> >> Did anyone else experience CPU load from ksoftirqd after 'modprobe >> lustre'? On an otherwise idle node I see: >> >> PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND >>9 root 20 0 000 S 28.5 0.0 2:05.58 ksoftirqd/1 >> >> >> 57 root 20 0 000 R 23.9 0.0 2:22.91 ksoftirqd/13 >> >> The sum of those two is about 50% CPU. >> >> I have narrowed it down to the ptlrpc module. When I remove that, it stops. >> >> I also tested the 2.10.1-RC1, which is the same. > If you can run "echo l > /proc/sysrq-trigger" it will report the processes > that are currently running on the CPUs of your system to the console (and > also /var/log/messages, if it can write everything in time). > > You might need to do this several times to get a representative sample of > the ksoftirqd process stacks to see what they are doing that is consuming > so much CPU. > > Alternately, "echo t > /proc/sysrq-trigger" will report the stacks of all > processes to the console (and /v/l/m), but there will be a lot of them, > and no better chance that it catches what ksoftirqd is doing 25% of the time. I've attached the stacks. Some wakeup which I guess are initiated by something in the ptlrpc code. Cheers, Hans Henrik Sep 30 23:26:01 node033 kernel: NMI backtrace for cpu 11 Sep 30 23:26:01 node033 kernel: CPU 11 Modules linked in: lustre(U) lov(U) mdc(U) fid(U) lmv(U) fld(U) ksocklnd(U) ptlrpc(U) obdclass(U) lnet(U) libcfs(U) nfs lockd fscache auth_rpcgss nfs_acl mmfs(U) mmfslinux(U) tracedev(U) autofs4 sha512_generic cr c32c_intel sunrpc acpi_cpufreq freq_table mperf bonding ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm ib_sa ib_mad ib_core ib_addr ipv6 xfs exportfs ext2 ipmi_devintf ipmi_si ipmi_msghandler microcode iTCO_wdt iTCO_vendor_support raid0 i gb dca i2c_algo_bit i2c_core raid1 ptp pps_core sg serio_raw lpc_ich mfd_core i7core_edac edac_core shpchp ext4 jbd2 mbcache raid456 async_raid6_recov async_pq raid6_pq async_xor xor async_memcpy async_tx sd_mod crc_t10dif ahci hpsa dm_mirror dm_region_h ash dm_log dm_mod [last unloaded: libcfs] Sep 30 23:26:01 node033 kernel: Sep 30 23:26:01 node033 kernel: Pid: 0, comm: swapper Not tainted 2.6.32-696.6.3.el6.x86_64 #1 HP ProLiant DL170e G6 /ProLiant DL170e G6 Sep 30 23:26:01 node033 kernel: RIP: 0010:[] [] intel_idle+0xd1/0x1b0 Sep 30 23:26:01 node033 kernel: RSP: 0018:880c2d067e48 EFLAGS: 0046 Sep 30 23:26:01 node033 kernel: RAX: 0020 RBX: 0008 RCX: 0001 Sep 30 23:26:01 node033 kernel: RDX: RSI: 880c2d067fd8 RDI: 81a98580 Sep 30 23:26:01 node033 kernel: RBP: 880c2d067ed8 R08: 0005 R09: 00c8 Sep 30 23:26:01 node033 kernel: R10: 0011f630c009c747 R11: 880c2d067e68 R12: 0004 Sep 30 23:26:01 node033 kernel: R13: 880c6a6beb40 R14: 0020 R15: 14e940c25ce4ad4a Sep 30 23:26:01 node033 kernel: FS: () GS:880c6a6a() knlGS: Sep 30 23:26:01 node033 kernel: CS: 0010 DS: 0018 ES: 0018 CR0: 8005003b Sep 30 23:26:01 node033 kernel: CR2: 7fe396313000 CR3: 01a8d000 CR4: 07e0 Sep 30 23:26:01 node033 kernel: DR0: DR1: DR2: Sep 30 23:26:01 node033 kernel: DR3: DR6: 0ff0 DR7: 0400 Sep 30 23:26:01 node033 kernel: Process swapper (pid: 0, threadinfo 880c2d064000, task 880c2d052040) Sep 30 23:26:01 node033 kernel: Stack: Sep 30 23:26:01 node033 kernel: 88182d471020 880c6a6b1b80 000b 0011f630b5b0c60b Sep 30 23:26:01 node033 kernel: 880c0002 0282 804f Sep 30 23:26:01 node033 kernel: 880c2d067ed8 81444bb4 000b0a59013c Sep 30 23:26:01 node033 kernel: Call Trace: Sep 30 23:26:01 node033 kernel: [] ? menu_select+0x174/0x390 Sep 30 23:26:01 node033 kernel: [] cpuidle_idle_call+0x7a/0xe0 Sep 30 23:26:01 node033 kernel: [] cpu_idle+0xb6/0x110 Sep 30 23:26:01 node033 kernel: [] start_secondary+0x2c0/0x316 Sep 30 23:26:01 node033 kernel: Code: ff ff a8 08 75 25 31 d2 48 8d 86 38 c0 ff ff 48 89 d1 0f 01 c8 0f ae f0 48 8b 86 38 c0 ff ff a8 08 75 08 b1 01 4c 89 f0 0f 01 c9 5a 3f db ff 4c 29 f8 48 89 c7 e8 7f 61 d8 ff 49 89 c6 49 89 Sep 30 23:26:01 node033 kernel: Call Trace: Sep 30 23:26:01 node033 kernel: <#DB[1]> <> Pid: 0, comm: swapper Not tainted 2.6.32-696.6.3.el6.x86_64 #1 Sep 30 23:26:01 node033 kernel: Call Trace: Sep 30 23:26:01 node033 kernel: [] ? show_regs+0x27/0x30 Sep 30 23:26:01 node033 kernel: [] ? arch_trigger_all_cpu_backtrace_handler+0x99/0xc0 Sep 30 23:26:01 node033 kernel: [] ? notifier_call_chain+0x55/0x80 Sep 30 23:26:01 node033 kernel: [] ? atomic_notifier_call_chain+0x1a/0x20 Sep 30 23:26:01 node033
Re: [lustre-discuss] 2.10.0 CentOS6.9 ksoftirqd CPU load
A guess for you to consider: A very common cause of ksoftirqd load is a hypervisor putting memory pressure on a VM. At least VMWare, and I think KVM and others, use IRQs to implement some of their memory management and it can show up like this. That would of course mean it's not really the ptlrpc module, I'm not sure how carefully you verified that it is causing this. (Obviously your 'remove it, check, add it, check' method is sound, but if you just checked once or twice, you may have been wrong through bad luck or you could've been right at your limit of available memory.) From: lustre-discuss on behalf of Dilger, Andreas Sent: Wednesday, September 27, 2017 11:50:03 AM To: Hans Henrik Happe Cc: Shehata, Amir; lustre-discuss; Olaf Weber Subject: Re: [lustre-discuss] 2.10.0 CentOS6.9 ksoftirqd CPU load On Sep 26, 2017, at 01:10, Hans Henrik Happe wrote: > > Hi, > > Did anyone else experience CPU load from ksoftirqd after 'modprobe > lustre'? On an otherwise idle node I see: > > PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND >9 root 20 0 000 S 28.5 0.0 2:05.58 ksoftirqd/1 > > > 57 root 20 0 000 R 23.9 0.0 2:22.91 ksoftirqd/13 > > The sum of those two is about 50% CPU. > > I have narrowed it down to the ptlrpc module. When I remove that, it stops. > > I also tested the 2.10.1-RC1, which is the same. If you can run "echo l > /proc/sysrq-trigger" it will report the processes that are currently running on the CPUs of your system to the console (and also /var/log/messages, if it can write everything in time). You might need to do this several times to get a representative sample of the ksoftirqd process stacks to see what they are doing that is consuming so much CPU. Alternately, "echo t > /proc/sysrq-trigger" will report the stacks of all processes to the console (and /v/l/m), but there will be a lot of them, and no better chance that it catches what ksoftirqd is doing 25% of the time. Cheers, Andreas -- Andreas Dilger Lustre Principal Architect Intel Corporation ___ lustre-discuss mailing list lustre-discuss@lists.lustre.org http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org ___ lustre-discuss mailing list lustre-discuss@lists.lustre.org http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
Re: [lustre-discuss] 2.10.0 CentOS6.9 ksoftirqd CPU load
On Sep 26, 2017, at 01:10, Hans Henrik Happe wrote: > > Hi, > > Did anyone else experience CPU load from ksoftirqd after 'modprobe > lustre'? On an otherwise idle node I see: > > PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND >9 root 20 0 000 S 28.5 0.0 2:05.58 ksoftirqd/1 > > > 57 root 20 0 000 R 23.9 0.0 2:22.91 ksoftirqd/13 > > The sum of those two is about 50% CPU. > > I have narrowed it down to the ptlrpc module. When I remove that, it stops. > > I also tested the 2.10.1-RC1, which is the same. If you can run "echo l > /proc/sysrq-trigger" it will report the processes that are currently running on the CPUs of your system to the console (and also /var/log/messages, if it can write everything in time). You might need to do this several times to get a representative sample of the ksoftirqd process stacks to see what they are doing that is consuming so much CPU. Alternately, "echo t > /proc/sysrq-trigger" will report the stacks of all processes to the console (and /v/l/m), but there will be a lot of them, and no better chance that it catches what ksoftirqd is doing 25% of the time. Cheers, Andreas -- Andreas Dilger Lustre Principal Architect Intel Corporation ___ lustre-discuss mailing list lustre-discuss@lists.lustre.org http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org