RE: Any known soft lockup issue with vfs_write()->fsnotify()?

2018-03-08 Thread Haiyang Zhang
There was another report of the same issue on CoreOS, 4.14.11-coreos. The 
host/guest is AWS G4. So the problem is not limited to Azure VMs. It doesn't 
happen on older kernel like 4.4. Maybe the problem is related to some (recent) 
changes on fsnotify or other fs code?

Soft lockup kernel panic reboot on AWS instance on fsnotify and vfs_write  #2356
https://github.com/coreos/bugs/issues/2356

Thanks,
- Haiyang

> -Original Message-
> From: Jan Kara 
> Sent: Monday, March 5, 2018 3:49 PM
> To: Dexuan Cui 
> Cc: linux-fsde...@vger.kernel.org; Jan Kara ; Amir Goldstein
> ; Miklos Szeredi ; Haiyang
> Zhang ; 'linux-kernel@vger.kernel.org'  ker...@vger.kernel.org>; Jork Loeser 
> Subject: Re: Any known soft lockup issue with vfs_write()->fsnotify()?
> 
> Hi!
> 
> On Fri 02-03-18 22:28:50, Dexuan Cui wrote:
> > Recently people are getting a soft lock issue with vfs_write()->fsnotify().
> > The detailed calltrace is available at:
> > https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithu
> >
> b.com%2Fcoreos%2Fbugs%2Fissues%2F2356&data=04%7C01%7Chaiyangz%40
> micros
> >
> oft.com%7Ca1b1bc6822c9442195ad08d582da7942%7C72f988bf86f141af91ab2
> d7cd
> >
> 011db47%7C1%7C0%7C636558797237925702%7CUnknown%7CTWFpbGZsb3d8
> eyJWIjoiM
> > C4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwifQ%3D%3D%7C-
> 2&sdata=pdwtsbU
> > 0%2FW3y7Zy%2BX%2Ffkbx%2FPktoKVBgimfxMyVk6Lyw%3D&reserved=0
> > https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithu
> >
> b.com%2Fcoreos%2Fbugs%2Fissues%2F2364&data=04%7C01%7Chaiyangz%40
> micros
> >
> oft.com%7Ca1b1bc6822c9442195ad08d582da7942%7C72f988bf86f141af91ab2
> d7cd
> >
> 011db47%7C1%7C0%7C636558797237925702%7CUnknown%7CTWFpbGZsb3d8
> eyJWIjoiM
> > C4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwifQ%3D%3D%7C-
> 2&sdata=w%2Bjed
> > u0yIYlpRut5sHa2%2Bhs5cdcdxp1dd3sHkyvRCPw%3D&reserved=0
> 
> I didn't see them yet.
> 
> > The kernel versions showing up the issue are:
> > 4.14.11-coreos
> > 4.14.19-coreos
> > 4.13.0-1009 -- this is the kernel with which I'm personally seeing the 
> > lockup.
> >
> > I have not got a chance to try the latest mainline kernel yet.
> 
> It would be good to try 4.15 kernel to see whether recent fixes from Miklos
> didn't fix your problem. They should be present in 4.14.11/19 kernels as well
> but one never knows...
> 
> > Before the lockup error message suddenly appears, Linux has been
> > running fine for many hours.  I have NOT found a consistent way to
> > reproduce the lockup yet.
> >
> > Looks the kernel is stuck in fsnotify(), when it tries to get the
> > fsnotify_mark_srcu lock.
> 
> It is not possible that we would 'hang' in srcu_read_lock() - that is just a 
> read of
> one variable and increment of another. We'd have to be looping somewhere
> and watchdog would have to happen to hit us always at that place. Weird. Are
> you sure RIP points to srcu_read_lock?
> 
> > "git log fs/notify/fsnotify.c" on the latest mainline shows that some
> > recent patches might help.
> >
> > I'd like to check if this is a known issue.
> 
> As I've mentioned above, so far I didn't see reports like this...
> 
>   Honza
> --
> Jan Kara 
> SUSE Labs, CR


Re: Any known soft lockup issue with vfs_write()->fsnotify()?

2018-03-05 Thread Jan Kara
Hi!

On Fri 02-03-18 22:28:50, Dexuan Cui wrote:
> Recently people are getting a soft lock issue with vfs_write()->fsnotify(). 
> The detailed calltrace is available at:
> https://github.com/coreos/bugs/issues/2356
> https://github.com/coreos/bugs/issues/2364

I didn't see them yet.

> The kernel versions showing up the issue are:
> 4.14.11-coreos 
> 4.14.19-coreos
> 4.13.0-1009 -- this is the kernel with which I'm personally seeing the lockup.
> 
> I have not got a chance to try the latest mainline kernel yet.

It would be good to try 4.15 kernel to see whether recent fixes from Miklos
didn't fix your problem. They should be present in 4.14.11/19 kernels as
well but one never knows...

> Before the lockup error message suddenly appears, Linux has been running
> fine for many hours.  I have NOT found a consistent way to reproduce the
> lockup yet.
> 
> Looks the kernel is stuck in fsnotify(), when it tries to get the
> fsnotify_mark_srcu lock.

It is not possible that we would 'hang' in srcu_read_lock() - that is
just a read of one variable and increment of another. We'd have to be
looping somewhere and watchdog would have to happen to hit us always at
that place. Weird. Are you sure RIP points to srcu_read_lock?

> "git log fs/notify/fsnotify.c" on the latest mainline shows that some
> recent patches might help.
> 
> I'd like to check if this is a known issue.

As I've mentioned above, so far I didn't see reports like this...

Honza
-- 
Jan Kara 
SUSE Labs, CR


Any known soft lockup issue with vfs_write()->fsnotify()?

2018-03-02 Thread Dexuan Cui
Hi,
Recently people are getting a soft lock issue with vfs_write()->fsnotify(). 
The detailed calltrace is available at:
https://github.com/coreos/bugs/issues/2356
https://github.com/coreos/bugs/issues/2364

The kernel versions showing up the issue are:
4.14.11-coreos 
4.14.19-coreos
4.13.0-1009 -- this is the kernel with which I'm personally seeing the lockup.

I have not got a chance to try the latest mainline kernel yet.

Before the lockup error message suddenly appears, Linux has been running fine 
for many hours.
I have NOT found a consistent way to reproduce the lockup yet.

Looks the kernel is stuck in fsnotify(), when it tries to get the 
fsnotify_mark_srcu lock.

"git log fs/notify/fsnotify.c" on the latest mainline shows that some recent 
patches might help.

I'd like to check if this is a known issue.

Looking forward to your insights!

Thanks,
-- Dexuan

For your convenience, this is a calltrace from the first link:

18h 30m 8.626s(   4ms): ip-172-45-43-199 login: [67361.641359] watchdog: BUG: 
soft lockup - CPU#10 stuck for 22s! [java:87260]
18h 42m 40.116s(751490ms): [67361.644600] Modules linked in: xfs xt_statistic 
xt_nat xt_recent ipt_REJECT nf_reject_ipv4 xt_comment xt_mark veth 
nf_conntrack_netlink nfnetlink xfrm_user xfrm_algo xt_addrtype iptable_filter 
xt_conntrack br_netfilter bridge stp llc ipt_MASQUERADE nf_nat_masquerade_ipv4 
iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack 
libcrc32c crc32c_generic vxlan ip6_udp_tunnel udp_tunnel overlay mousedev 
psmouse sb_edac i2c_piix4 i2c_core evdev edac_core button xenfs xen_privcmd 
sch_fq_codel nls_ascii nls_cp437 vfat fat dm_verity dm_bufio ext4 crc16 mbcache 
jbd2 fscrypto crc32c_intel ata_piix aesni_intel xen_blkfront libata aes_x86_64 
crypto_simd cryptd glue_helper scsi_mod ixgbevf dm_mirror dm_region_hash dm_log 
dm_mod dax
18h 42m 40.142s(  26ms): [67361.668103] CPU: 10 PID: 87260 Comm: java Not 
tainted 4.14.11-coreos #1
18h 42m 40.142s(   0ms): [67361.670391] Hardware name: Xen HVM domU, BIOS 
4.2.amazon 08/24/2006
18h 42m 40.144s(   2ms): [67361.672581] task: 90d6009dbc80 task.stack: 
b2388f704000
18h 42m 40.149s(   5ms): [67361.674604] RIP: 0010:fsnotify+0x166/0x520
18h 42m 40.149s(   0ms): [67361.675971] RSP: 0018:b2388f707e10 EFLAGS: 
0202 ORIG_RAX: ff0c
18h 42m 40.150s(   1ms): [67361.678462] RAX: 0001 RBX: 
 RCX: 
18h 42m 40.152s(   2ms): [67361.680986] RDX: 0001 RSI: 
0002 RDI: 907294c0
18h 42m 40.157s(   5ms): [67361.683340] RBP: 90d2eddffed8 R08: 
 R09: 
18h 42m 40.157s(   0ms): [67361.685709] R10: dd941da7a100 R11: 
 R12: 90d2eddfff00
18h 42m 40.159s(   2ms): [67361.688199] R13:  R14: 
 R15: 
18h 42m 40.165s(   6ms): [67361.690579] FS:  7f491c3f4700() 
GS:90d6ef88() knlGS:
18h 42m 40.165s(   0ms): [67361.693227] CS:  0010 DS:  ES:  CR0: 
80050033
18h 42m 40.166s(   1ms): [67361.695206] CR2: 00c421288300 CR3: 
0005ba5fc002 CR4: 001606e0
18h 42m 40.175s(   9ms): [67361.697655] Call Trace:
18h 42m 40.175s(   0ms): [67361.698499]  vfs_write+0x14f/0x1a0
18h 42m 40.175s(   0ms): [67361.699656]  SyS_write+0x52/0xc0
18h 42m 40.175s(   0ms): [67361.700745]  do_syscall_64+0x59/0x1c0
18h 42m 40.175s(   0ms): [67361.701996]  entry_SYSCALL64_slow_path+0x25/0x25
18h 42m 40.175s(   0ms): [67361.703536] RIP: 0033:0x7f4b5566643d
18h 42m 40.176s(   1ms): [67361.704751] RSP: 002b:7f491c3f0ef0 EFLAGS: 
0293 ORIG_RAX: 0001
18h 42m 40.179s(   3ms): [67361.707375] RAX: ffda RBX: 
0032 RCX: 7f4b5566643d
18h 42m 40.188s(   9ms): [67361.709849] RDX: 00f0 RSI: 
7f491c3f0f50 RDI: 0f40
18h 42m 40.188s(   0ms): [67361.712205] RBP: 7f491c3f0f20 R08: 
7f491c3f1030 R09: 0005f4ff70d8
18h 42m 40.188s(   0ms): [67361.714573] R10: 00052f06 R11: 
0293 R12: 00f0
18h 42m 40.188s(   0ms): [67361.716924] R13: 7f491c3f0f50 R14: 
0f40 R15: 
18h 42m 40.191s(   3ms): [67361.719331] Code: 40 4c 89 7c 24 48 4c 89 7c 24 08 
8b 44 24 18 25 00 00 03 00 89 44 24 34 4d 85 e4 0f 95 c2 48 83 7c 24 08 00 0f 
95 c1 89 c8 08 d0 <0f> 84 96 03 00 00 84 d2 0f 84 e8 02 00 00 48 8b 44 24 40 84 
c9