Re: [Lustre-discuss] MDS crashes daily at the same hour

2010-01-25 Thread Brian J. Murrell
On Sun, 2010-01-24 at 22:54 -0700, Andreas Dilger wrote: 
 
 If they are call traces due to the watchdog timer, then this is somewhat
 expected for extremely high load.

Andreas,

Do you know, does adaptive timeouts take care of setting the timeout
appropriately on watchdogs?

b.



signature.asc
Description: This is a digitally signed message part
___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss


Re: [Lustre-discuss] MDS crashes daily at the same hour

2010-01-25 Thread Johann Lombardi
On Mon, Jan 25, 2010 at 08:51:59AM -0500, Brian J. Murrell wrote:
 Do you know, does adaptive timeouts take care of setting the timeout
 appropriately on watchdogs?

Yes, the watchdog timer is updated based on the estimated rpc service
time (multiplied by a factor which is usually 2).

Johann
___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss


Re: [Lustre-discuss] MDS crashes daily at the same hour

2010-01-25 Thread Christopher J.Walker
Brian J. Murrell wrote:
 On Sun, 2010-01-24 at 22:54 -0700, Andreas Dilger wrote: 
 If they are call traces due to the watchdog timer, then this is somewhat
 expected for extremely high load.
 
 Andreas,
 
 Do you know, does adaptive timeouts take care of setting the timeout
 appropriately on watchdogs?
 

I don't think this is quite what you are asking, but some details on our 
setup.

We have a mixture of 1.6.7.2 clients and 1.8.1.1 clients. The 1.6.7.2 
clients were not using adaptive timeouts when the problem occurred[1]. 
At least one of the 1.6 machines gets regularly swamped with network 
traffic - leading to packet loss.

It was 40 1.8.1.1 clients running updatedb that caused the problem.

Chris

[1] One machine is the interface to the outside world - and runs 
1.6.7.2. I see packet loss to this machine at times and have observed 
lustre  hanging for a while. I suspect the problem is that it is 
occasionally overloaded with network packets, lustre packets are then 
lost (probably at the router), followed by a timeout and recovery. I've 
now enabled adaptive timeouts on this machine - and will install a 
10GigE card too.
___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss


Re: [Lustre-discuss] MDS crashes daily at the same hour

2010-01-25 Thread Brian J. Murrell
On Mon, 2010-01-25 at 15:09 +0100, Johann Lombardi wrote: 
 
 Yes, the watchdog timer is updated based on the estimated rpc service
 time (multiplied by a factor which is usually 2).

Ahhh.  Great.  It would be interesting to know which Lustre release the
poster seeing the stack traces was using.

b.



signature.asc
Description: This is a digitally signed message part
___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss


Re: [Lustre-discuss] MDS crashes daily at the same hour

2010-01-24 Thread Andreas Dilger
On 2010-01-23, at 15:12, Christopher J. Walker wrote:
 I don't see an LBUG in my logs, but there are several Call Traces.  
 Would
 it be useful if I filed a bug too or I could add to David's bug if  
 you'd
 prefer  - if so, can you let me know the bug number as I can't find it
 in bugzilla.  Would you like /tmp/lustre-log.* too?


If they are call traces due to the watchdog timer, then this is somewhat
expected for extremely high load.

Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.

___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss


Re: [Lustre-discuss] MDS crashes daily at the same hour

2010-01-23 Thread Christopher J. Walker
Brian J. Murrell wrote:
 On Wed, 2010-01-06 at 11:25 +0200, David Cohen wrote: 
 I was indeed the *locate update, a simple edit of /etc/updatedb.conf on the 
 clients and the system is stable again.
 

I've just encountered the same thing - the mds crashing at the same time 
several times this week. It's just after the *locate update - I've added 
lustre to the excluded filesystems, and so far so good.

 Great.  But as Andreas said previously, load should not have caused the
 LBUG that you got.  Could you open a bug on our bugzilla about that?
 Please attach to that bug an excerpt from the tech-mds log that covers a
 window of time of 12h hours prior to the the LBUG and an hour after.
 

I don't see an LBUG in my logs, but there are several Call Traces. Would 
it be useful if I filed a bug too or I could add to David's bug if you'd 
prefer  - if so, can you let me know the bug number as I can't find it 
in bugzilla.  Would you like /tmp/lustre-log.* too?

Chris
___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss


Re: [Lustre-discuss] MDS crashes daily at the same hour

2010-01-22 Thread Andreas Dilger
On 2010-01-06, at 04:25, David Cohen wrote:
 On Monday 04 January 2010 20:42:12 Andreas Dilger wrote:
 On 2010-01-04, at 03:02, David Cohen wrote:
 I'm using a mixed environment of 1.8.0.1 MDS and 1.6.6 OSS's (had a
 problem with qlogic drivers and rolled back to 1.6.6).
 My MDS get unresponsive each day at 4-5 am local time, no kernel
 panic or error messages before.

 I was indeed the *locate update, a simple edit of /etc/updatedb.conf  
 on the
 clients and the system is stable again.

I asked the upstream Fedora/RHEL maintainer of mlocate to add lustre  
to the exception list in updatedb.conf, and he has already done so for  
Fedora.  There is also a bug filed for RHEL5 to do the same, if anyone  
is interested in following it:

https://bugzilla.redhat.com/show_bug.cgi?id=557712

 Judging by the time, I'd guess this is slocate or mlocate running
 on all of your clients at the same time.  This used to be a source of
 extremely high load back in the old days, but I thought that Lustre
 was in the exclude list in newer versions of *locate.  Looking at the
 installed mlocate on my system, that doesn't seem to be the case...
 strange.

 Some errors and an LBUG appear in the log after force booting the
 MDS and
 mounting the MDT and then the log is clear until next morning:

 Jan  4 06:33:31 tech-mds kernel: LustreError: 6357:0:
 (class_hash.c:225:lustre_hash_findadd_unique_hnode())
 ASSERTION(hlist_unhashed(hnode)) failed
 Jan  4 06:33:31 tech-mds kernel: LustreError: 6357:0:
 (class_hash.c:225:lustre_hash_findadd_unique_hnode()) LBUG
 Jan  4 06:33:31 tech-mds kernel: Lustre: 6357:0:(linux-
 debug.c:222:libcfs_debug_dumpstack()) showing stack for process 6357
 Jan  4 06:33:31 tech-mds kernel: ll_mgs_02 R  running task
 0  6357
 16340 (L-TLB)
 Jan  4 06:33:31 tech-mds kernel: Call Trace:
 Jan  4 06:33:31 tech-mds kernel: thread_return+0x62/0xfe
 Jan  4 06:33:31 tech-mds kernel: __wake_up_common+0x3e/0x68
 Jan  4 06:33:31 tech-mds kernel: :ptlrpc:ptlrpc_main+0x1218/0x13e0
 Jan  4 06:33:31 tech-mds kernel: default_wake_function+0x0/0xe
 Jan  4 06:33:31 tech-mds kernel: audit_syscall_exit+0x31b/0x336
 Jan  4 06:33:31 tech-mds kernel: child_rip+0xa/0x11
 Jan  4 06:33:31 tech-mds kernel: :ptlrpc:ptlrpc_main+0x0/0x13e0
 Jan  4 06:33:31 tech-mds kernel: child_rip+0x0/0x11

 It shouldn't LBUG during recovery, however.

 Cheers, Andreas
 --
 Andreas Dilger
 Sr. Staff Engineer, Lustre Group
 Sun Microsystems of Canada, Inc.


 -- 
 David Cohen
 Grid Computing
 Physics Department
 Technion - Israel Institute of Technology
 ___
 Lustre-discuss mailing list
 Lustre-discuss@lists.lustre.org
 http://lists.lustre.org/mailman/listinfo/lustre-discuss


Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.

___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss


Re: [Lustre-discuss] MDS crashes daily at the same hour

2010-01-06 Thread David Cohen
On Monday 04 January 2010 20:42:12 Andreas Dilger wrote:
 On 2010-01-04, at 03:02, David Cohen wrote:
  I'm using a mixed environment of 1.8.0.1 MDS and 1.6.6 OSS's (had a
  problem
  with qlogic drivers and rolled back to 1.6.6).
  My MDS get unresponsive each day at 4-5 am local time, no kernel
  panic or
  error messages before.

I was indeed the *locate update, a simple edit of /etc/updatedb.conf on the 
clients and the system is stable again.
Many Thanks.


 
 Judging by the time, I'd guess this is slocate or mlocate running
 on all of your clients at the same time.  This used to be a source of
 extremely high load back in the old days, but I thought that Lustre
 was in the exclude list in newer versions of *locate.  Looking at the
 installed mlocate on my system, that doesn't seem to be the case...
 strange.
 
  Some errors and an LBUG appear in the log after force booting the
  MDS and
  mounting the MDT and then the log is clear until next morning:
 
  Jan  4 06:33:31 tech-mds kernel: LustreError: 6357:0:
  (class_hash.c:225:lustre_hash_findadd_unique_hnode())
  ASSERTION(hlist_unhashed(hnode)) failed
  Jan  4 06:33:31 tech-mds kernel: LustreError: 6357:0:
  (class_hash.c:225:lustre_hash_findadd_unique_hnode()) LBUG
  Jan  4 06:33:31 tech-mds kernel: Lustre: 6357:0:(linux-
  debug.c:222:libcfs_debug_dumpstack()) showing stack for process 6357
  Jan  4 06:33:31 tech-mds kernel: ll_mgs_02 R  running task
  0  6357
  16340 (L-TLB)
  Jan  4 06:33:31 tech-mds kernel: Call Trace:
  Jan  4 06:33:31 tech-mds kernel: thread_return+0x62/0xfe
  Jan  4 06:33:31 tech-mds kernel: __wake_up_common+0x3e/0x68
  Jan  4 06:33:31 tech-mds kernel: :ptlrpc:ptlrpc_main+0x1218/0x13e0
  Jan  4 06:33:31 tech-mds kernel: default_wake_function+0x0/0xe
  Jan  4 06:33:31 tech-mds kernel: audit_syscall_exit+0x31b/0x336
  Jan  4 06:33:31 tech-mds kernel: child_rip+0xa/0x11
  Jan  4 06:33:31 tech-mds kernel: :ptlrpc:ptlrpc_main+0x0/0x13e0
  Jan  4 06:33:31 tech-mds kernel: child_rip+0x0/0x11
 
 It shouldn't LBUG during recovery, however.
 
 Cheers, Andreas
 --
 Andreas Dilger
 Sr. Staff Engineer, Lustre Group
 Sun Microsystems of Canada, Inc.
 

-- 
David Cohen
Grid Computing
Physics Department
Technion - Israel Institute of Technology
___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss


Re: [Lustre-discuss] MDS crashes daily at the same hour

2010-01-06 Thread Brian J. Murrell
On Wed, 2010-01-06 at 11:25 +0200, David Cohen wrote: 
 
 I was indeed the *locate update, a simple edit of /etc/updatedb.conf on the 
 clients and the system is stable again.

Great.  But as Andreas said previously, load should not have caused the
LBUG that you got.  Could you open a bug on our bugzilla about that?
Please attach to that bug an excerpt from the tech-mds log that covers a
window of time of 12h hours prior to the the LBUG and an hour after.

Thanx,
b.




signature.asc
Description: This is a digitally signed message part
___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss


[Lustre-discuss] MDS crashes daily at the same hour

2010-01-04 Thread David Cohen
Hi,
I'm using a mixed environment of 1.8.0.1 MDS and 1.6.6 OSS's (had a problem 
with qlogic drivers and rolled back to 1.6.6).
My MDS get unresponsive each day at 4-5 am local time, no kernel panic or 
error messages before.
Some errors and an LBUG appear in the log after force booting the MDS and 
mounting the MDT and then the log is clear until next morning:

Jan  4 06:27:32 tech-mds kernel: LustreError: 6290:0:
(ldlm_lib.c:884:target_handle_connect()) technion-MDT: denying connection 
for new client 192.114.101...@tcp (ab671897-b1e2-76d3-b661-7b87e82d23e7): 34 
clients in recovery for 337s

  
Jan  4 06:27:32 tech-mds kernel: LustreError: 6290:0:
(ldlm_lib.c:1826:target_send_reply_msg()) @@@ processing error (-16)  
r...@81006f99cc00 x1323646107950586/t0 o38-?@?:0/0 lens 368/264 e 0 to 
0 
dl 1262579352 ref 1 fl Interpret:/0/0 rc -16/0  

  
Jan  4 06:27:41 tech-mds kernel: Lustre: 6280:0:
(ldlm_lib.c:1718:target_queue_last_replay_reply()) technion-MDT: 33 
recoverable clients remain   
Jan  4 06:27:57 tech-mds kernel: LustreError: 6284:0:
(ldlm_lib.c:884:target_handle_connect()) technion-MDT: denying connection 
for new client 192.114.101...@tcp (ab671897-b1e2-76d3-b661-7b87e82d23e7): 33 
clients in recovery for 312s

  
Jan  4 06:27:57 tech-mds kernel: LustreError: 6284:0:
(ldlm_lib.c:1826:target_send_reply_msg()) @@@ processing error (-16)  
r...@81011c69d400 x1323646107950600/t0 o38-?@?:0/0 lens 368/264 e 0 to 
0 
dl 1262579377 ref 1 fl Interpret:/0/0 rc -16/0  

  
Jan  4 06:28:22 tech-mds kernel: LustreError: 6302:0:
(ldlm_lib.c:884:target_handle_connect()) technion-MDT: denying connection 
for new client 192.114.101...@tcp (ab671897-b1e2-76d3-b661-7b87e82d23e7): 33 
clients in recovery for 287s

  
Jan  4 06:28:22 tech-mds kernel: LustreError: 6302:0:
(ldlm_lib.c:1826:target_send_reply_msg()) @@@ processing error (-16)  
r...@81006fa4e000 x1323646107950612/t0 o38-?@?:0/0 lens 368/264 e 0 to 
0 
dl 1262579402 ref 1 fl Interpret:/0/0 rc -16/0  

  
Jan  4 06:28:47 tech-mds kernel: LustreError: 6305:0:
(ldlm_lib.c:884:target_handle_connect()) technion-MDT: denying connection 
for new client 192.114.101...@tcp (ab671897-b1e2-76d3-b661-7b87e82d23e7): 33 
clients in recovery for 262s

  
Jan  4 06:28:47 tech-mds kernel: LustreError: 6305:0:
(ldlm_lib.c:1826:target_send_reply_msg()) @@@ processing error (-16)  
r...@81011c69d800 x1323646107950624/t0 o38-?@?:0/0 lens 368/264 e 0 to 
0 
dl 1262579427 ref 1 fl Interpret:/0/0 rc -16/0  

  
Jan  4 06:29:01 tech-mds ntpd[5999]: synchronized to 132.68.238.40, stratum 2   

 
Jan  4 06:29:01 tech-mds ntpd[5999]: kernel time sync enabled 0001  

 
Jan  4 06:29:12 tech-mds kernel: LustreError: 6278:0:
(ldlm_lib.c:884:target_handle_connect()) technion-MDT: denying connection 
for new client 192.114.101...@tcp (ab671897-b1e2-76d3-b661-7b87e82d23e7): 33 
clients in recovery for 237s

  
Jan  4 06:29:12 tech-mds kernel: LustreError: 6278:0:
(ldlm_lib.c:1826:target_send_reply_msg()) @@@ processing error (-16)  
r...@81007053ac00 x1323646107950636/t0 o38-?@?:0/0 lens 368/264 e 0 to 
0 
dl 1262579452 ref 1 fl Interpret:/0/0 rc -16/0  

  
Jan  4 06:29:37 tech-mds kernel: LustreError: 6293:0:
(ldlm_lib.c:884:target_handle_connect()) technion-MDT: denying connection 
for new client 192.114.101...@tcp (ab671897-b1e2-76d3-b661-7b87e82d23e7): 33 
clients in recovery for 212s
 

Re: [Lustre-discuss] MDS crashes daily at the same hour

2010-01-04 Thread Andreas Dilger
On 2010-01-04, at 03:02, David Cohen wrote:
 I'm using a mixed environment of 1.8.0.1 MDS and 1.6.6 OSS's (had a  
 problem
 with qlogic drivers and rolled back to 1.6.6).
 My MDS get unresponsive each day at 4-5 am local time, no kernel  
 panic or
 error messages before.

Judging by the time, I'd guess this is slocate or mlocate running  
on all of your clients at the same time.  This used to be a source of  
extremely high load back in the old days, but I thought that Lustre  
was in the exclude list in newer versions of *locate.  Looking at the  
installed mlocate on my system, that doesn't seem to be the case...   
strange.

 Some errors and an LBUG appear in the log after force booting the  
 MDS and
 mounting the MDT and then the log is clear until next morning:

 Jan  4 06:33:31 tech-mds kernel: LustreError: 6357:0:
 (class_hash.c:225:lustre_hash_findadd_unique_hnode())
 ASSERTION(hlist_unhashed(hnode)) failed
 Jan  4 06:33:31 tech-mds kernel: LustreError: 6357:0:
 (class_hash.c:225:lustre_hash_findadd_unique_hnode()) LBUG
 Jan  4 06:33:31 tech-mds kernel: Lustre: 6357:0:(linux-
 debug.c:222:libcfs_debug_dumpstack()) showing stack for process 6357
 Jan  4 06:33:31 tech-mds kernel: ll_mgs_02 R  running task
 0  6357
 16340 (L-TLB)
 Jan  4 06:33:31 tech-mds kernel: Call Trace:
 Jan  4 06:33:31 tech-mds kernel: thread_return+0x62/0xfe
 Jan  4 06:33:31 tech-mds kernel: __wake_up_common+0x3e/0x68
 Jan  4 06:33:31 tech-mds kernel: :ptlrpc:ptlrpc_main+0x1218/0x13e0
 Jan  4 06:33:31 tech-mds kernel: default_wake_function+0x0/0xe
 Jan  4 06:33:31 tech-mds kernel: audit_syscall_exit+0x31b/0x336
 Jan  4 06:33:31 tech-mds kernel: child_rip+0xa/0x11
 Jan  4 06:33:31 tech-mds kernel: :ptlrpc:ptlrpc_main+0x0/0x13e0
 Jan  4 06:33:31 tech-mds kernel: child_rip+0x0/0x11

It shouldn't LBUG during recovery, however.

Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.

___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss