All,

I'd like to follow up on this as I can now repeatedly reproduce this on 
our test file system. I've managed to reproduce it on an version up to 
lustre 1.8.6-wc1 on the MDS that I've tried so far.

I've also reported it as LU-534 
(http://jira.whamcloud.com/browse/LU-534) and included current stack 
traces etc.

I'll repeat the basic instructions how to reproduce here:

Export a Lustre file system via NFS(v3) from a Lustre client, mount it 
on one other system over NFS, run racer on the file system over NFS, 
after a few minutes (sometimes one or two hours) the MDS LBUGs with the 
ASSERTION in the subject.

If anyone has any suggestions of debug flags to enable or other ideas 
how to track down the exact problem, I'd like to hear them.

Kind regards,
Frederik

On 08/07/11 14:26, Frederik Ferner wrote:
> All,
>
> we are experiencing what looks like the same MDS LBUG with increasing
> frequency, see below for a sample stack trace. This seems to affect only
> one client at a time and even this client will recover after some time
> (usually minutes but sometimes longer) and continue to work even without
> requiring immediate MDS reboots.
>
> In the recent past, it seems to have affected one specific client more
> often than others. This client is mainly a NFS exporter for the Lustre
> file system. All attempts to trigger the LBUG with known actions have
> been unsuccessful so far. Attempts to trigger it on the test file system
> have equally not been successful but we are still working on this.
>
> As far as I can see, this could be this bug
> https://bugzilla.lustre.org/show_bug.cgi?id=17764 but there has been no
> recent activity. And I'm not entirely sure this is the same bug.
>
> As far as I can see the log dumps don't contain any useful information,
>     but I'm happy to provide as sample file if someone offers to look at
> it.
>
> I'm also looking for suggestions how to go about debugging this problem,
> ideally initially with as little performance impact as possible so we
> might apply it on the productions system until we can reproduce it on a
> test file system. Once we can reproduce it on the test file system,
> debugging with performance implications should be possible as well.
>
> The MDS and clients are currently running Lustre 1.8.3.ddn3.3 on Red Hat
> Enterprise 5.
>
>> Jul  6 11:54:05 cs04r-sc-mds01-01-10ge kernel: LustreError: 
>> 4037:0:(mds_open.c:1295:mds_open()) 
>> ASSERTION(!mds_inode_is_orphan(dchild->d_inode)) failed: dchild 
>> 8ad94b2:0cae8d46 (ffff8101995b0300) inode ffff81041d4e8548/145593522/21276602
>> 2
>> Jul  6 11:54:05 cs04r-sc-mds01-01-10ge kernel: LustreError: 
>> 4037:0:(mds_open.c:1295:mds_open()) LBUG
>> Jul  6 11:54:05 cs04r-sc-mds01-01-10ge kernel: Lustre: 
>> 4037:0:(linux-debug.c:264:libcfs_debug_dumpstack()) showing stack for 
>> process 4037
>> Jul  6 11:54:05 cs04r-sc-mds01-01-10ge kernel: ll_mdt_49     R  running task 
>>       0  4037      1          4038  4036 (L-TLB)
>> Jul  6 11:54:05 cs04r-sc-mds01-01-10ge kernel:  ffff810226da0d00 
>> ffff810247120000 0000000000000286 0000000000000082
>> Jul  6 11:54:05 cs04r-sc-mds01-01-10ge kernel:  0000008100001400 
>> ffff8101db219ef8 0000000000000001 0000000000000001
>> Jul  6 11:54:05 cs04r-sc-mds01-01-10ge kernel:  ffff8101ead74db8 
>> 0000000000000000 ffff810423223e10 ffffffff8008aee7
>> Jul  6 11:54:05 cs04r-sc-mds01-01-10ge kernel: Call Trace:
>> Jul  6 11:54:05 cs04r-sc-mds01-01-10ge kernel:  [<ffffffff8008aee7>] 
>> __wake_up_common+0x3e/0x68
>> Jul  6 11:54:05 cs04r-sc-mds01-01-10ge kernel:  [<ffffffff887acee8>] 
>> :ptlrpc:ptlrpc_main+0x1258/0x1420
>> Jul  6 11:54:05 cs04r-sc-mds01-01-10ge kernel:  [<ffffffff8008cabd>] 
>> default_wake_function+0x0/0xe
>> Jul  6 11:54:05 cs04r-sc-mds01-01-10ge kernel:  [<ffffffff800b7310>] 
>> audit_syscall_exit+0x336/0x362
>> Jul  6 11:54:05 cs04r-sc-mds01-01-10ge kernel:  [<ffffffff8005dfb1>] 
>> child_rip+0xa/0x11
>> Jul  6 11:54:05 cs04r-sc-mds01-01-10ge kernel:  [<ffffffff887abc90>] 
>> :ptlrpc:ptlrpc_main+0x0/0x1420
>> Jul  6 11:54:05 cs04r-sc-mds01-01-10ge kernel:  [<ffffffff8005dfa7>] 
>> child_rip+0x0/0x11
>> Jul  6 11:54:05 cs04r-sc-mds01-01-10ge kernel:
>> Jul  6 11:54:05 cs04r-sc-mds01-01-10ge kernel: LustreError: dumping log to 
>> /tmp/lustre-log.1309949645.4037
>
> Kind regards,
> Frederik


-- 
Frederik Ferner
Computer Systems Administrator          phone: +44 1235 77 8624
Diamond Light Source Ltd.               mob:   +44 7917 08 5110
(Apologies in advance for the lines below. Some bits are a legal
requirement and I have no control over them.)

-- 
This e-mail and any attachments may contain confidential, copyright and or 
privileged material, and are for the use of the intended addressee only. If you 
are not the intended addressee or an authorised recipient of the addressee 
please notify us of receipt by returning the e-mail and do not use, copy, 
retain, distribute or disclose the information in or attached to the e-mail.
Any opinions expressed within this e-mail are those of the individual and not 
necessarily of Diamond Light Source Ltd. 
Diamond Light Source Ltd. cannot guarantee that this e-mail or any attachments 
are free from viruses and we cannot accept liability for any damage which you 
may sustain as a result of software viruses which may be transmitted in or with 
the message.
Diamond Light Source Limited (company no. 4375679). Registered in England and 
Wales with its registered office at Diamond House, Harwell Science and 
Innovation Campus, Didcot, Oxfordshire, OX11 0DE, United Kingdom
 



_______________________________________________
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss

Reply via email to