[Bug 1772264] [NEW] watchdog: BUG: soft lockup - CPU#5 stuck for 23s! [java:5783]

2018-05-20 Thread Launchpad Bug Tracker
You have been subscribed to a public bug by Sriharsha  BS (sribs):

Hello Team,

I have a Customer who is experiencing this issue once every 2 days and
here are the details of the bug :

May 14 05:24:21 localhost kernel: [6006808.160001] watchdog: BUG: soft lockup - 
CPU#5 stuck for 23s! [java:5783]
May 14 05:24:21 localhost kernel: [6006808.160055] Modules linked in: ufs msdos 
xfs ip6table_filter ip6_tables iptable_filter nf_conntrack_ipv4 nf_defrag_ipv4 
xt_owner xt_conntrack nf_conntrack iptable_security ip_tables x_tables udf 
crc_itu_t i2c_piix4 hv_balloon joydev i2c_core serio_raw ib_iser rdma_cm iw_cm 
ib_cm ib_core iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi autofs4 
btrfs raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx 
xor raid6_pq libcrc32c raid1 raid0 multipath linear hid_generic 
crct10dif_pclmul crc32_pclmul ghash_clmulni_intel pcbc aesni_intel aes_x86_64 
crypto_simd glue_helper hid_hyperv cryptd hyperv_fb pata_acpi hv_utils 
cfbfillrect cfbimgblt ptp cfbcopyarea hid hyperv_keyboard hv_netvsc pps_core
May 14 05:24:21 localhost kernel: [6006808.160055] CPU: 5 PID: 5783 Comm: java 
Not tainted 4.13.0-1011-azure #14-Ubuntu
May 14 05:24:21 localhost kernel: [6006808.160055] Hardware name: Microsoft 
Corporation Virtual Machine/Virtual Machine, BIOS 090007  06/02/2017
May 14 05:24:21 localhost kernel: [6006808.160055] task: 8b91a48fc5c0 
task.stack: b5c4cd014000
May 14 05:24:21 localhost kernel: [6006808.160055] RIP: 
0010:fsnotify+0x1f9/0x4f0
May 14 05:24:21 localhost kernel: [6006808.160055] RSP: 0018:b5c4cd017e08 
EFLAGS: 0246 ORIG_RAX: ff0c
May 14 05:24:21 localhost kernel: [6006808.160055] RAX: 0001 RBX: 
8ba0f6246020 RCX: 
May 14 05:24:21 localhost kernel: [6006808.160055] RDX: 8ba0f6246048 RSI: 
 RDI: 9bc57020
May 14 05:24:21 localhost kernel: [6006808.160055] RBP: b5c4cd017ea8 R08: 
 R09: 
May 14 05:24:21 localhost kernel: [6006808.160055] R10: e93042d21080 R11: 
 R12: 
May 14 05:24:21 localhost kernel: [6006808.160055] R13: 8ba0f6246048 R14: 
 R15: 
May 14 05:24:21 localhost kernel: [6006808.160055] FS:  7f154838a700() 
GS:8ba0fd74() knlGS:
May 14 05:24:21 localhost kernel: [6006808.160055] CS:  0010 DS:  ES:  
CR0: 80050033
May 14 05:24:21 localhost kernel: [6006808.160055] CR2: 7f3fcc254000 CR3: 
000165c38000 CR4: 001406e0
May 14 05:24:21 localhost kernel: [6006808.160055] Call Trace:
May 14 05:24:21 localhost kernel: [6006808.160055]  ? new_sync_write+0xe5/0x140
May 14 05:24:21 localhost kernel: [6006808.160055]  vfs_write+0x15a/0x1b0
May 14 05:24:21 localhost kernel: [6006808.160055]  ? 
syscall_trace_enter+0xcd/0x2f0
May 14 05:24:21 localhost kernel: [6006808.160055]  SyS_write+0x55/0xc0
May 14 05:24:21 localhost kernel: [6006808.160055]  do_syscall_64+0x61/0xd0
May 14 05:24:21 localhost kernel: [6006808.160055]  
entry_SYSCALL64_slow_path+0x25/0x25
May 14 05:24:21 localhost kernel: [6006808.160055] RIP: 0033:0x7f489076a2dd. 
#“This indicates Softlockup error message stored in the RIP Register”
May 14 05:24:21 localhost kernel: [6006808.160055] RSP: 002b:7f15483872a0 
EFLAGS: 0293 ORIG_RAX: 0001
May 14 05:24:21 localhost kernel: [6006808.160055] RAX: ffda RBX: 
7f1548389380 RCX: 7f489076a2dd
May 14 05:24:21 localhost kernel: [6006808.160055] RDX: 1740 RSI: 
7f1548387310 RDI: 063c
May 14 05:24:21 localhost kernel: [6006808.160055] RBP: 7f15483872d0 R08: 
7f1548387310 R09: 7f418a55b0b8
May 14 05:24:21 localhost kernel: [6006808.160055] R10: 005b31ee R11: 
0293 R12: 1740
May 14 05:24:21 localhost kernel: [6006808.160055] R13: 7f1548387310 R14: 
063c R15: 7f4051e0

The customer is using Elastic and hence he submitted a issue in Elastic
Search github post which they are pointing that this is a Kernel issue
and not a elastic search Issue :

Attached github post for reference :
https://github.com/elastic/elasticsearch/issues/30667

For now, I have asked him to increase the kernel.watchdog_thresh
parameter from 10 to 20 to relax the situation.

The customer wants to know for sure whether this is a Kernel bug. I also
asked him to perform Kernel update. However, if he is confirmed that
this is a bug in the current Kernel, he is willing to do so in all the
65 servers.

The customer also submitted a bug to the Java process team which seems to be 
causing the issue, 
There reply was it is a kernel issue and the following launchpad link was given 
although I personally think that is not really the case here. However, I may be 
wrong :

https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1730717

This is the Information regarding the Performance of Java process within

[Bug 1772264] [NEW] watchdog: BUG: soft lockup - CPU#5 stuck for 23s! [java:5783]

2018-05-20 Thread Sriharsha BS
Public bug reported:

Hello Team,

I have a Customer who is experiencing this issue once every 2 days and
here are the details of the bug :

May 14 05:24:21 localhost kernel: [6006808.160001] watchdog: BUG: soft lockup - 
CPU#5 stuck for 23s! [java:5783]
May 14 05:24:21 localhost kernel: [6006808.160055] Modules linked in: ufs msdos 
xfs ip6table_filter ip6_tables iptable_filter nf_conntrack_ipv4 nf_defrag_ipv4 
xt_owner xt_conntrack nf_conntrack iptable_security ip_tables x_tables udf 
crc_itu_t i2c_piix4 hv_balloon joydev i2c_core serio_raw ib_iser rdma_cm iw_cm 
ib_cm ib_core iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi autofs4 
btrfs raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx 
xor raid6_pq libcrc32c raid1 raid0 multipath linear hid_generic 
crct10dif_pclmul crc32_pclmul ghash_clmulni_intel pcbc aesni_intel aes_x86_64 
crypto_simd glue_helper hid_hyperv cryptd hyperv_fb pata_acpi hv_utils 
cfbfillrect cfbimgblt ptp cfbcopyarea hid hyperv_keyboard hv_netvsc pps_core
May 14 05:24:21 localhost kernel: [6006808.160055] CPU: 5 PID: 5783 Comm: java 
Not tainted 4.13.0-1011-azure #14-Ubuntu
May 14 05:24:21 localhost kernel: [6006808.160055] Hardware name: Microsoft 
Corporation Virtual Machine/Virtual Machine, BIOS 090007  06/02/2017
May 14 05:24:21 localhost kernel: [6006808.160055] task: 8b91a48fc5c0 
task.stack: b5c4cd014000
May 14 05:24:21 localhost kernel: [6006808.160055] RIP: 
0010:fsnotify+0x1f9/0x4f0
May 14 05:24:21 localhost kernel: [6006808.160055] RSP: 0018:b5c4cd017e08 
EFLAGS: 0246 ORIG_RAX: ff0c
May 14 05:24:21 localhost kernel: [6006808.160055] RAX: 0001 RBX: 
8ba0f6246020 RCX: 
May 14 05:24:21 localhost kernel: [6006808.160055] RDX: 8ba0f6246048 RSI: 
 RDI: 9bc57020
May 14 05:24:21 localhost kernel: [6006808.160055] RBP: b5c4cd017ea8 R08: 
 R09: 
May 14 05:24:21 localhost kernel: [6006808.160055] R10: e93042d21080 R11: 
 R12: 
May 14 05:24:21 localhost kernel: [6006808.160055] R13: 8ba0f6246048 R14: 
 R15: 
May 14 05:24:21 localhost kernel: [6006808.160055] FS:  7f154838a700() 
GS:8ba0fd74() knlGS:
May 14 05:24:21 localhost kernel: [6006808.160055] CS:  0010 DS:  ES:  
CR0: 80050033
May 14 05:24:21 localhost kernel: [6006808.160055] CR2: 7f3fcc254000 CR3: 
000165c38000 CR4: 001406e0
May 14 05:24:21 localhost kernel: [6006808.160055] Call Trace:
May 14 05:24:21 localhost kernel: [6006808.160055]  ? new_sync_write+0xe5/0x140
May 14 05:24:21 localhost kernel: [6006808.160055]  vfs_write+0x15a/0x1b0
May 14 05:24:21 localhost kernel: [6006808.160055]  ? 
syscall_trace_enter+0xcd/0x2f0
May 14 05:24:21 localhost kernel: [6006808.160055]  SyS_write+0x55/0xc0
May 14 05:24:21 localhost kernel: [6006808.160055]  do_syscall_64+0x61/0xd0
May 14 05:24:21 localhost kernel: [6006808.160055]  
entry_SYSCALL64_slow_path+0x25/0x25
May 14 05:24:21 localhost kernel: [6006808.160055] RIP: 0033:0x7f489076a2dd. 
#“This indicates Softlockup error message stored in the RIP Register”
May 14 05:24:21 localhost kernel: [6006808.160055] RSP: 002b:7f15483872a0 
EFLAGS: 0293 ORIG_RAX: 0001
May 14 05:24:21 localhost kernel: [6006808.160055] RAX: ffda RBX: 
7f1548389380 RCX: 7f489076a2dd
May 14 05:24:21 localhost kernel: [6006808.160055] RDX: 1740 RSI: 
7f1548387310 RDI: 063c
May 14 05:24:21 localhost kernel: [6006808.160055] RBP: 7f15483872d0 R08: 
7f1548387310 R09: 7f418a55b0b8
May 14 05:24:21 localhost kernel: [6006808.160055] R10: 005b31ee R11: 
0293 R12: 1740
May 14 05:24:21 localhost kernel: [6006808.160055] R13: 7f1548387310 R14: 
063c R15: 7f4051e0

The customer is using Elastic and hence he submitted a issue in Elastic
Search github post which they are pointing that this is a Kernel issue
and not a elastic search Issue :

Attached github post for reference :
https://github.com/elastic/elasticsearch/issues/30667

For now, I have asked him to increase the kernel.watchdog_thresh
parameter from 10 to 20 to relax the situation.

The customer wants to know for sure whether this is a Kernel bug. I also
asked him to perform Kernel update. However, if he is confirmed that
this is a bug in the current Kernel, he is willing to do so in all the
65 servers.

The customer also submitted a bug to the Java process team which seems to be 
causing the issue, 
There reply was it is a kernel issue and the following launchpad link was given 
although I personally think that is not really the case here. However, I may be 
wrong :

https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1730717

This is the Information regarding the Performance of Java process within
the customer's CPU

Avg. Load: Avg=3, max=9