Re: [Lustre-discuss] soft lockups on NFS server/Lustre client

2009-11-20 Thread Frederik Ferner
Hi All,

just a quick follow-up on this.

Frederik Ferner wrote:
> Robin Humble wrote:
[snip]
> I see this has been closed as duplicate of
>  https://bugzilla.redhat.com/show_bug.cgi?id=499019
> which is unfortunately not accessible to me.
> 
> On the other hand Red Hat support have just pointed me at this bug as 
> well and confirmed that it is not yet fixed in RHEL5.4. 

As you can see in this bug, Red Hat have provided a test kernel which 
we've been using on a number of machines without being able to reproduce 
the problem.

>> Lustre 1.6.6 isn't exactly recent. have you tried 1.6.7.2 on your NFS
>> exporter?

Now we've tried 1.6.7.2 on the NFS/Samba exporter and we've still seen 
the soft lockups until we upgraded to the test kernel mentioned above.

NB we've had to upgrade the Samba exporters to 1.6.7.2 anyway after 
we've turned on flock and found a LBUG there that is fixed in 1.6.7.2

So to summarize, the soft lockups are a bug in the RHEL kernel and are 
hopefully going to be fixed in a official update.

Kind regards,
Frederik
-- 
Frederik Ferner
Computer Systems Administrator  phone: +44 1235 77 8624
Diamond Light Source Ltd.   mob:   +44 7917 08 5110
(Apologies in advance for the lines below. Some bits are a legal
requirement and I have no control over them.)

___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss


Re: [Lustre-discuss] soft lockups on NFS server/Lustre client

2009-10-20 Thread Frederik Ferner
Robin Humble wrote:
> On Mon, Oct 12, 2009 at 05:06:28PM +0100, Frederik Ferner wrote:
>> on our NFS server exporting our Lustre file system to a number of NFS 
>> clients, we've recently started to see "kernel: BUG: soft lockup" 
>> messages. As the locked processes include nfsd, our users are obviously 
>> not happy.
>>
>> Around the time when the soft lockup occurs we also see a log of 
>> "kernel: BUG: warning at fs/inotify.c:181/set_dentry_child_flags()" 
>> messages, but I don't know if this is related.
> 
> probably not related. we were seeing this too (no NFS involved at all)

I may have been looking at slightly the wrong thing here. It was first 
reported by our users as a NFS problem but it now seems to be triggered 
by samba access to some directories on Lustre. We've separated the samba 
server from the NFS server and now we only see this on the samba server 
and not on the NFS server.

>   https://bugzilla.redhat.com/show_bug.cgi?id=526853
> but it's probably being ignored. if you have a rhel support contract
> maybe you can kick it along a bit...

I see this has been closed as duplicate of
 https://bugzilla.redhat.com/show_bug.cgi?id=499019
which is unfortunately not accessible to me.

On the other hand Red Hat support have just pointed me at this bug as 
well and confirmed that it is not yet fixed in RHEL5.4.

> dunno about your soft lockups. as I understand it soft lockups
> themselves aren't harmful as long as they progress eventually.

Well, they are not harmful as such, my problem is that they seem to 
block the machine for some time and users complained about applications 
timing out when this affected the file system.

> Lustre 1.6.6 isn't exactly recent. have you tried 1.6.7.2 on your NFS
> exporter?

I know, until recently we did not have any real problems with 1.6.6 and 
the machines are in production. I'm currently trying to reproduce it in 
our test setup and may try 1.6.7.2 with an additional test machine on 
the production system as samba exporter during the next maintenance 
window. On the other hand it's now really looking like a RHEL bug, so 
I'm not too sure how much it would help

> presumably soft lockups could also be saying your re-exporter or OSS's
> are overloaded or that you have a slow disk or 3 in a RAID... without
> NFS involved are all your OSTs up to speed?

I think that the OSTs are not the problem here, as I'm not experiencing 
any problems on any of my other Lustre clients and now not anymore on 
the NFS server which is seeing more load than the samba server.

> do you still get problems after
>   echo 60 > /proc/sys/kernel/softlockup_thresh

After applying this on the samba server, I only see the Bug warnings and 
not the soft lockups in syslog, still my windows clients seem to freeze 
occasionally for about a minute when browsing the exported file system, 
so no change on the client side.

Cheers,
Frederik

-- 
Frederik Ferner
Computer Systems Administrator  phone: +44 1235 77 8624
Diamond Light Source Ltd.   mob:   +44 7917 08 5110
(Apologies in advance for the lines below. Some bits are a legal
requirement and I have no control over them.)
___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss


Re: [Lustre-discuss] soft lockups on NFS server/Lustre client

2009-10-18 Thread Robin Humble
On Mon, Oct 12, 2009 at 05:06:28PM +0100, Frederik Ferner wrote:
>Hi List,
>
>on our NFS server exporting our Lustre file system to a number of NFS 
>clients, we've recently started to see "kernel: BUG: soft lockup" 
>messages. As the locked processes include nfsd, our users are obviously 
>not happy.
>
>Around the time when the soft lockup occurs we also see a log of 
>"kernel: BUG: warning at fs/inotify.c:181/set_dentry_child_flags()" 
>messages, but I don't know if this is related.

probably not related. we were seeing this too (no NFS involved at all)
  https://bugzilla.lustre.org/show_bug.cgi?id=20904
and the upshot is that I'm pretty sure it's harmless and a RHEL bug.
I filed
  https://bugzilla.redhat.com/show_bug.cgi?id=526853
but it's probably being ignored. if you have a rhel support contract
maybe you can kick it along a bit...

dunno about your soft lockups. as I understand it soft lockups
themselves aren't harmful as long as they progress eventually.

Lustre 1.6.6 isn't exactly recent. have you tried 1.6.7.2 on your NFS
exporter?

presumably soft lockups could also be saying your re-exporter or OSS's
are overloaded or that you have a slow disk or 3 in a RAID... without
NFS involved are all your OSTs up to speed?

do you still get problems after
  echo 60 > /proc/sys/kernel/softlockup_thresh

cheers,
robin

>
>We are using Lustre 1.6.6 on all machines, (MDS, OSS, clients). The NFS 
>server/Lustre client with the lockups is running RHEL5.4 with an 
>unpatched RedHat kernel (kernel-2.6.18-92.1.10.el5) with the Lustre 
>modules from Sun.
>
>See below for sample logs from the Lustre client/NFS server. I can 
>provide more logs if required.
>
>I'm not sure if this a Lustre issue but would appreciate if someone 
>could help. We've not seen it on any other NFS server so far and there 
>seems to be at least some lustre related stuff in the stack trace.
>
>Is this a known issue and how can we avoid this? I have not found 
>anything using google and the search on bugzilla.lustre.org. At least 
>the BUG warning seems to be a known issue on this kernel.
>
>I hope the logs below are readable enough, I tried to find entries where 
>the stack traces don't overlap but this seems to be the best I can find.
>
>Oct  9 15:21:27 cs04r-sc-serv-07 kernel: BUG: warning at 
>fs/inotify.c:181/set_dentry_child_flags() (Tainted: G )
>Oct  9 15:21:27 cs04r-sc-serv-07 kernel:
>Oct  9 15:21:27 cs04r-sc-serv-07 kernel: Call Trace:
>Oct  9 15:21:27 cs04r-sc-serv-07 kernel:  [] 
>set_dentry_child_flags+0xef/0x14d
>Oct  9 15:21:27 cs04r-sc-serv-07 kernel:  [] 
>remove_watch_no_event+0x38/0x47
>Oct  9 15:21:27 cs04r-sc-serv-07 kernel:  [] 
>inotify_remove_watch_locked+0x18/0x3b
>Oct  9 15:21:27 cs04r-sc-serv-07 kernel:  [] 
>inotify_rm_wd+0x7e/0xa1
>Oct  9 15:21:27 cs04r-sc-serv-07 kernel:  [] 
>sys_inotify_rm_watch+0x46/0x63
>Oct  9 15:21:27 cs04r-sc-serv-07 kernel:  [] 
>tracesys+0xd5/0xe0
>Oct  9 15:21:27 cs04r-sc-serv-07 kernel:
>Oct  9 15:21:27 cs04r-sc-serv-07 kernel: BUG: warning at 
>fs/inotify.c:181/set_dentry_child_flags() (Tainted: G )
>Oct  9 15:21:27 cs04r-sc-serv-07 kernel:
>Oct  9 15:21:27 cs04r-sc-serv-07 kernel: Call Trace:
>Oct  9 15:21:27 cs04r-sc-serv-07 kernel:  [] 
>set_dentry_child_flags+0xef/0x14d
>Oct  9 15:21:27 cs04r-sc-serv-07 kernel:  [] 
>remove_watch_no_event+0x38/0x47
>Oct  9 15:21:27 cs04r-sc-serv-07 kernel:  [] 
>inotify_remove_watch_locked+0x18/0x3b
>Oct  9 15:21:27 cs04r-sc-serv-07 kernel:  [] 
>inotify_rm_wd+0x7e/0xa1
>Oct  9 15:21:27 cs04r-sc-serv-07 kernel:  [] 
>sys_inotify_rm_watch+0x46/0x63
>Oct  9 15:21:27 cs04r-sc-serv-07 kernel: BUG: soft lockup - CPU#5 stuck 
>for 10s! [nfsd:1]
>Oct  9 15:21:28 cs04r-sc-serv-07 kernel: CPU 5:
>Oct  9 15:21:28 cs04r-sc-serv-07 kernel: Modules linked in: vfat fat 
>usb_storage dell_rbu mptctl ipmi_devintf ipmi_si ipmi_msghandler nfs 
>fscache nfsd exportfs lockd nfs_acl auth_rpcgss autofs4 hidp mgc(U) 
>lustre(U) lov(U) mdc(U) lquota(U) osc(U) ksocklnd(U) ptlrpc(U) ob
>dclass(U) lnet(U) lvfs(U) libcfs(U) rfcomm l2cap bluetooth sunrpc ipv6 
>xfrm_nalgo crypto_api mlx4_en(U) dm_multipath video sbs backlight i2c_ec 
>i2c_core button battery asus_acpi acpi_memhotplug ac parport_pc lp 
>parport joydev sr_mod cdrom mlx4_core(U) bnx2 serio_raw pcsp
>kr sg dm_snapshot dm_zero dm_mirror dm_mod ata_piix libata shpchp mptsas 
>mptscsih mptbase scsi_transport_sas sd_mod scsi_mod ext3 jbd uhci_hcd 
>ohci_hcd ehci_hcd
>Oct  9 15:21:28 cs04r-sc-serv-07 kernel: Pid: 1, comm: nfsd Tainted: 
>G  2.6.18-92.1.10.el5 #1
>Oct  9 15:21:28 cs04r-sc-serv-07 kernel: RIP: 0010:[] 
>  [] .text.lock.spinlock+0x5/0x30
>Oct  9 15:21:28 cs04r-sc-serv-07 kernel: RSP: 0018:810044241ac8 
>EFLAGS: 0286
>Oct  9 15:21:28 cs04r-sc-serv-07 kernel: RAX: 81006cb6a1a8 RBX: 
>81006cb6a178 RCX: 810044241b50
>Oct  9 15:21:28 cs04r-sc-serv-07 kernel: RDX:  RSI: 
>810044241c90 RDI: 803c7480
>Oct  9 15:21:28 cs04r-sc-serv-07 kernel: RB

[Lustre-discuss] soft lockups on NFS server/Lustre client

2009-10-12 Thread Frederik Ferner
Hi List,

on our NFS server exporting our Lustre file system to a number of NFS 
clients, we've recently started to see "kernel: BUG: soft lockup" 
messages. As the locked processes include nfsd, our users are obviously 
not happy.

Around the time when the soft lockup occurs we also see a log of 
"kernel: BUG: warning at fs/inotify.c:181/set_dentry_child_flags()" 
messages, but I don't know if this is related.

We are using Lustre 1.6.6 on all machines, (MDS, OSS, clients). The NFS 
server/Lustre client with the lockups is running RHEL5.4 with an 
unpatched RedHat kernel (kernel-2.6.18-92.1.10.el5) with the Lustre 
modules from Sun.

See below for sample logs from the Lustre client/NFS server. I can 
provide more logs if required.

I'm not sure if this a Lustre issue but would appreciate if someone 
could help. We've not seen it on any other NFS server so far and there 
seems to be at least some lustre related stuff in the stack trace.

Is this a known issue and how can we avoid this? I have not found 
anything using google and the search on bugzilla.lustre.org. At least 
the BUG warning seems to be a known issue on this kernel.

I hope the logs below are readable enough, I tried to find entries where 
the stack traces don't overlap but this seems to be the best I can find.

Oct  9 15:21:27 cs04r-sc-serv-07 kernel: BUG: warning at 
fs/inotify.c:181/set_dentry_child_flags() (Tainted: G )
Oct  9 15:21:27 cs04r-sc-serv-07 kernel:
Oct  9 15:21:27 cs04r-sc-serv-07 kernel: Call Trace:
Oct  9 15:21:27 cs04r-sc-serv-07 kernel:  [] 
set_dentry_child_flags+0xef/0x14d
Oct  9 15:21:27 cs04r-sc-serv-07 kernel:  [] 
remove_watch_no_event+0x38/0x47
Oct  9 15:21:27 cs04r-sc-serv-07 kernel:  [] 
inotify_remove_watch_locked+0x18/0x3b
Oct  9 15:21:27 cs04r-sc-serv-07 kernel:  [] 
inotify_rm_wd+0x7e/0xa1
Oct  9 15:21:27 cs04r-sc-serv-07 kernel:  [] 
sys_inotify_rm_watch+0x46/0x63
Oct  9 15:21:27 cs04r-sc-serv-07 kernel:  [] 
tracesys+0xd5/0xe0
Oct  9 15:21:27 cs04r-sc-serv-07 kernel:
Oct  9 15:21:27 cs04r-sc-serv-07 kernel: BUG: warning at 
fs/inotify.c:181/set_dentry_child_flags() (Tainted: G )
Oct  9 15:21:27 cs04r-sc-serv-07 kernel:
Oct  9 15:21:27 cs04r-sc-serv-07 kernel: Call Trace:
Oct  9 15:21:27 cs04r-sc-serv-07 kernel:  [] 
set_dentry_child_flags+0xef/0x14d
Oct  9 15:21:27 cs04r-sc-serv-07 kernel:  [] 
remove_watch_no_event+0x38/0x47
Oct  9 15:21:27 cs04r-sc-serv-07 kernel:  [] 
inotify_remove_watch_locked+0x18/0x3b
Oct  9 15:21:27 cs04r-sc-serv-07 kernel:  [] 
inotify_rm_wd+0x7e/0xa1
Oct  9 15:21:27 cs04r-sc-serv-07 kernel:  [] 
sys_inotify_rm_watch+0x46/0x63
Oct  9 15:21:27 cs04r-sc-serv-07 kernel: BUG: soft lockup - CPU#5 stuck 
for 10s! [nfsd:1]
Oct  9 15:21:28 cs04r-sc-serv-07 kernel: CPU 5:
Oct  9 15:21:28 cs04r-sc-serv-07 kernel: Modules linked in: vfat fat 
usb_storage dell_rbu mptctl ipmi_devintf ipmi_si ipmi_msghandler nfs 
fscache nfsd exportfs lockd nfs_acl auth_rpcgss autofs4 hidp mgc(U) 
lustre(U) lov(U) mdc(U) lquota(U) osc(U) ksocklnd(U) ptlrpc(U) ob
dclass(U) lnet(U) lvfs(U) libcfs(U) rfcomm l2cap bluetooth sunrpc ipv6 
xfrm_nalgo crypto_api mlx4_en(U) dm_multipath video sbs backlight i2c_ec 
i2c_core button battery asus_acpi acpi_memhotplug ac parport_pc lp 
parport joydev sr_mod cdrom mlx4_core(U) bnx2 serio_raw pcsp
kr sg dm_snapshot dm_zero dm_mirror dm_mod ata_piix libata shpchp mptsas 
mptscsih mptbase scsi_transport_sas sd_mod scsi_mod ext3 jbd uhci_hcd 
ohci_hcd ehci_hcd
Oct  9 15:21:28 cs04r-sc-serv-07 kernel: Pid: 1, comm: nfsd Tainted: 
G  2.6.18-92.1.10.el5 #1
Oct  9 15:21:28 cs04r-sc-serv-07 kernel: RIP: 0010:[] 
  [] .text.lock.spinlock+0x5/0x30
Oct  9 15:21:28 cs04r-sc-serv-07 kernel: RSP: 0018:810044241ac8 
EFLAGS: 0286
Oct  9 15:21:28 cs04r-sc-serv-07 kernel: RAX: 81006cb6a1a8 RBX: 
81006cb6a178 RCX: 810044241b50
Oct  9 15:21:28 cs04r-sc-serv-07 kernel: RDX:  RSI: 
810044241c90 RDI: 803c7480
Oct  9 15:21:28 cs04r-sc-serv-07 kernel: RBP: 81005d609e90 R08: 
0001 R09: 810044241b50
Oct  9 15:21:28 cs04r-sc-serv-07 kernel: R10: 887cf72a R11: 
000189ef R12: 00a8
Oct  9 15:21:28 cs04r-sc-serv-07 kernel: R13: 810044241c90 R14: 
 R15: 8001d54c
Oct  9 15:21:28 cs04r-sc-serv-07 kernel: FS:  2b637558e6e0() 
GS:810037c0c540() knlGS:
Oct  9 15:21:28 cs04r-sc-serv-07 kernel: CS:  0010 DS:  ES:  
CR0: 8005003b
Oct  9 15:21:28 cs04r-sc-serv-07 kernel: CR2: 2b473a3a4000 CR3: 
6934d000 CR4: 06e0
Oct  9 15:21:28 cs04r-sc-serv-07 kernel:
Oct  9 15:21:28 cs04r-sc-serv-07 kernel: Call Trace:
Oct  9 15:21:28 cs04r-sc-serv-07 kernel:  [] 
d_find_alias+0x1c/0x38
Oct  9 15:21:28 cs04r-sc-serv-07 kernel:  [] 
d_alloc_anon+0xc/0xf8
Oct  9 15:21:28 cs04r-sc-serv-07 kernel:  [] 
:lustre:ll_iget_for_nfs+0x608/0x7e0
Oct  9 15:21:28 cs04r-sc-serv-07 kernel:  [] 
:exportfs:find_exported_dentry+0x43/0x47b