Re: [Lustre-discuss] MDS crashes daily at the same hour
On Sun, 2010-01-24 at 22:54 -0700, Andreas Dilger wrote: If they are call traces due to the watchdog timer, then this is somewhat expected for extremely high load. Andreas, Do you know, does adaptive timeouts take care of setting the timeout appropriately on watchdogs? b. signature.asc Description: This is a digitally signed message part ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
Re: [Lustre-discuss] MDS crashes daily at the same hour
On Mon, Jan 25, 2010 at 08:51:59AM -0500, Brian J. Murrell wrote: Do you know, does adaptive timeouts take care of setting the timeout appropriately on watchdogs? Yes, the watchdog timer is updated based on the estimated rpc service time (multiplied by a factor which is usually 2). Johann ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
Re: [Lustre-discuss] MDS crashes daily at the same hour
Brian J. Murrell wrote: On Sun, 2010-01-24 at 22:54 -0700, Andreas Dilger wrote: If they are call traces due to the watchdog timer, then this is somewhat expected for extremely high load. Andreas, Do you know, does adaptive timeouts take care of setting the timeout appropriately on watchdogs? I don't think this is quite what you are asking, but some details on our setup. We have a mixture of 1.6.7.2 clients and 1.8.1.1 clients. The 1.6.7.2 clients were not using adaptive timeouts when the problem occurred[1]. At least one of the 1.6 machines gets regularly swamped with network traffic - leading to packet loss. It was 40 1.8.1.1 clients running updatedb that caused the problem. Chris [1] One machine is the interface to the outside world - and runs 1.6.7.2. I see packet loss to this machine at times and have observed lustre hanging for a while. I suspect the problem is that it is occasionally overloaded with network packets, lustre packets are then lost (probably at the router), followed by a timeout and recovery. I've now enabled adaptive timeouts on this machine - and will install a 10GigE card too. ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
Re: [Lustre-discuss] MDS crashes daily at the same hour
On Mon, 2010-01-25 at 15:09 +0100, Johann Lombardi wrote: Yes, the watchdog timer is updated based on the estimated rpc service time (multiplied by a factor which is usually 2). Ahhh. Great. It would be interesting to know which Lustre release the poster seeing the stack traces was using. b. signature.asc Description: This is a digitally signed message part ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
Re: [Lustre-discuss] MDS crashes daily at the same hour
On 2010-01-23, at 15:12, Christopher J. Walker wrote: I don't see an LBUG in my logs, but there are several Call Traces. Would it be useful if I filed a bug too or I could add to David's bug if you'd prefer - if so, can you let me know the bug number as I can't find it in bugzilla. Would you like /tmp/lustre-log.* too? If they are call traces due to the watchdog timer, then this is somewhat expected for extremely high load. Cheers, Andreas -- Andreas Dilger Sr. Staff Engineer, Lustre Group Sun Microsystems of Canada, Inc. ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
Re: [Lustre-discuss] MDS crashes daily at the same hour
Brian J. Murrell wrote: On Wed, 2010-01-06 at 11:25 +0200, David Cohen wrote: I was indeed the *locate update, a simple edit of /etc/updatedb.conf on the clients and the system is stable again. I've just encountered the same thing - the mds crashing at the same time several times this week. It's just after the *locate update - I've added lustre to the excluded filesystems, and so far so good. Great. But as Andreas said previously, load should not have caused the LBUG that you got. Could you open a bug on our bugzilla about that? Please attach to that bug an excerpt from the tech-mds log that covers a window of time of 12h hours prior to the the LBUG and an hour after. I don't see an LBUG in my logs, but there are several Call Traces. Would it be useful if I filed a bug too or I could add to David's bug if you'd prefer - if so, can you let me know the bug number as I can't find it in bugzilla. Would you like /tmp/lustre-log.* too? Chris ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
Re: [Lustre-discuss] MDS crashes daily at the same hour
On 2010-01-06, at 04:25, David Cohen wrote: On Monday 04 January 2010 20:42:12 Andreas Dilger wrote: On 2010-01-04, at 03:02, David Cohen wrote: I'm using a mixed environment of 1.8.0.1 MDS and 1.6.6 OSS's (had a problem with qlogic drivers and rolled back to 1.6.6). My MDS get unresponsive each day at 4-5 am local time, no kernel panic or error messages before. I was indeed the *locate update, a simple edit of /etc/updatedb.conf on the clients and the system is stable again. I asked the upstream Fedora/RHEL maintainer of mlocate to add lustre to the exception list in updatedb.conf, and he has already done so for Fedora. There is also a bug filed for RHEL5 to do the same, if anyone is interested in following it: https://bugzilla.redhat.com/show_bug.cgi?id=557712 Judging by the time, I'd guess this is slocate or mlocate running on all of your clients at the same time. This used to be a source of extremely high load back in the old days, but I thought that Lustre was in the exclude list in newer versions of *locate. Looking at the installed mlocate on my system, that doesn't seem to be the case... strange. Some errors and an LBUG appear in the log after force booting the MDS and mounting the MDT and then the log is clear until next morning: Jan 4 06:33:31 tech-mds kernel: LustreError: 6357:0: (class_hash.c:225:lustre_hash_findadd_unique_hnode()) ASSERTION(hlist_unhashed(hnode)) failed Jan 4 06:33:31 tech-mds kernel: LustreError: 6357:0: (class_hash.c:225:lustre_hash_findadd_unique_hnode()) LBUG Jan 4 06:33:31 tech-mds kernel: Lustre: 6357:0:(linux- debug.c:222:libcfs_debug_dumpstack()) showing stack for process 6357 Jan 4 06:33:31 tech-mds kernel: ll_mgs_02 R running task 0 6357 16340 (L-TLB) Jan 4 06:33:31 tech-mds kernel: Call Trace: Jan 4 06:33:31 tech-mds kernel: thread_return+0x62/0xfe Jan 4 06:33:31 tech-mds kernel: __wake_up_common+0x3e/0x68 Jan 4 06:33:31 tech-mds kernel: :ptlrpc:ptlrpc_main+0x1218/0x13e0 Jan 4 06:33:31 tech-mds kernel: default_wake_function+0x0/0xe Jan 4 06:33:31 tech-mds kernel: audit_syscall_exit+0x31b/0x336 Jan 4 06:33:31 tech-mds kernel: child_rip+0xa/0x11 Jan 4 06:33:31 tech-mds kernel: :ptlrpc:ptlrpc_main+0x0/0x13e0 Jan 4 06:33:31 tech-mds kernel: child_rip+0x0/0x11 It shouldn't LBUG during recovery, however. Cheers, Andreas -- Andreas Dilger Sr. Staff Engineer, Lustre Group Sun Microsystems of Canada, Inc. -- David Cohen Grid Computing Physics Department Technion - Israel Institute of Technology ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss Cheers, Andreas -- Andreas Dilger Sr. Staff Engineer, Lustre Group Sun Microsystems of Canada, Inc. ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
Re: [Lustre-discuss] MDS crashes daily at the same hour
On Monday 04 January 2010 20:42:12 Andreas Dilger wrote: On 2010-01-04, at 03:02, David Cohen wrote: I'm using a mixed environment of 1.8.0.1 MDS and 1.6.6 OSS's (had a problem with qlogic drivers and rolled back to 1.6.6). My MDS get unresponsive each day at 4-5 am local time, no kernel panic or error messages before. I was indeed the *locate update, a simple edit of /etc/updatedb.conf on the clients and the system is stable again. Many Thanks. Judging by the time, I'd guess this is slocate or mlocate running on all of your clients at the same time. This used to be a source of extremely high load back in the old days, but I thought that Lustre was in the exclude list in newer versions of *locate. Looking at the installed mlocate on my system, that doesn't seem to be the case... strange. Some errors and an LBUG appear in the log after force booting the MDS and mounting the MDT and then the log is clear until next morning: Jan 4 06:33:31 tech-mds kernel: LustreError: 6357:0: (class_hash.c:225:lustre_hash_findadd_unique_hnode()) ASSERTION(hlist_unhashed(hnode)) failed Jan 4 06:33:31 tech-mds kernel: LustreError: 6357:0: (class_hash.c:225:lustre_hash_findadd_unique_hnode()) LBUG Jan 4 06:33:31 tech-mds kernel: Lustre: 6357:0:(linux- debug.c:222:libcfs_debug_dumpstack()) showing stack for process 6357 Jan 4 06:33:31 tech-mds kernel: ll_mgs_02 R running task 0 6357 16340 (L-TLB) Jan 4 06:33:31 tech-mds kernel: Call Trace: Jan 4 06:33:31 tech-mds kernel: thread_return+0x62/0xfe Jan 4 06:33:31 tech-mds kernel: __wake_up_common+0x3e/0x68 Jan 4 06:33:31 tech-mds kernel: :ptlrpc:ptlrpc_main+0x1218/0x13e0 Jan 4 06:33:31 tech-mds kernel: default_wake_function+0x0/0xe Jan 4 06:33:31 tech-mds kernel: audit_syscall_exit+0x31b/0x336 Jan 4 06:33:31 tech-mds kernel: child_rip+0xa/0x11 Jan 4 06:33:31 tech-mds kernel: :ptlrpc:ptlrpc_main+0x0/0x13e0 Jan 4 06:33:31 tech-mds kernel: child_rip+0x0/0x11 It shouldn't LBUG during recovery, however. Cheers, Andreas -- Andreas Dilger Sr. Staff Engineer, Lustre Group Sun Microsystems of Canada, Inc. -- David Cohen Grid Computing Physics Department Technion - Israel Institute of Technology ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
Re: [Lustre-discuss] MDS crashes daily at the same hour
On Wed, 2010-01-06 at 11:25 +0200, David Cohen wrote: I was indeed the *locate update, a simple edit of /etc/updatedb.conf on the clients and the system is stable again. Great. But as Andreas said previously, load should not have caused the LBUG that you got. Could you open a bug on our bugzilla about that? Please attach to that bug an excerpt from the tech-mds log that covers a window of time of 12h hours prior to the the LBUG and an hour after. Thanx, b. signature.asc Description: This is a digitally signed message part ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
[Lustre-discuss] MDS crashes daily at the same hour
Hi, I'm using a mixed environment of 1.8.0.1 MDS and 1.6.6 OSS's (had a problem with qlogic drivers and rolled back to 1.6.6). My MDS get unresponsive each day at 4-5 am local time, no kernel panic or error messages before. Some errors and an LBUG appear in the log after force booting the MDS and mounting the MDT and then the log is clear until next morning: Jan 4 06:27:32 tech-mds kernel: LustreError: 6290:0: (ldlm_lib.c:884:target_handle_connect()) technion-MDT: denying connection for new client 192.114.101...@tcp (ab671897-b1e2-76d3-b661-7b87e82d23e7): 34 clients in recovery for 337s Jan 4 06:27:32 tech-mds kernel: LustreError: 6290:0: (ldlm_lib.c:1826:target_send_reply_msg()) @@@ processing error (-16) r...@81006f99cc00 x1323646107950586/t0 o38-?@?:0/0 lens 368/264 e 0 to 0 dl 1262579352 ref 1 fl Interpret:/0/0 rc -16/0 Jan 4 06:27:41 tech-mds kernel: Lustre: 6280:0: (ldlm_lib.c:1718:target_queue_last_replay_reply()) technion-MDT: 33 recoverable clients remain Jan 4 06:27:57 tech-mds kernel: LustreError: 6284:0: (ldlm_lib.c:884:target_handle_connect()) technion-MDT: denying connection for new client 192.114.101...@tcp (ab671897-b1e2-76d3-b661-7b87e82d23e7): 33 clients in recovery for 312s Jan 4 06:27:57 tech-mds kernel: LustreError: 6284:0: (ldlm_lib.c:1826:target_send_reply_msg()) @@@ processing error (-16) r...@81011c69d400 x1323646107950600/t0 o38-?@?:0/0 lens 368/264 e 0 to 0 dl 1262579377 ref 1 fl Interpret:/0/0 rc -16/0 Jan 4 06:28:22 tech-mds kernel: LustreError: 6302:0: (ldlm_lib.c:884:target_handle_connect()) technion-MDT: denying connection for new client 192.114.101...@tcp (ab671897-b1e2-76d3-b661-7b87e82d23e7): 33 clients in recovery for 287s Jan 4 06:28:22 tech-mds kernel: LustreError: 6302:0: (ldlm_lib.c:1826:target_send_reply_msg()) @@@ processing error (-16) r...@81006fa4e000 x1323646107950612/t0 o38-?@?:0/0 lens 368/264 e 0 to 0 dl 1262579402 ref 1 fl Interpret:/0/0 rc -16/0 Jan 4 06:28:47 tech-mds kernel: LustreError: 6305:0: (ldlm_lib.c:884:target_handle_connect()) technion-MDT: denying connection for new client 192.114.101...@tcp (ab671897-b1e2-76d3-b661-7b87e82d23e7): 33 clients in recovery for 262s Jan 4 06:28:47 tech-mds kernel: LustreError: 6305:0: (ldlm_lib.c:1826:target_send_reply_msg()) @@@ processing error (-16) r...@81011c69d800 x1323646107950624/t0 o38-?@?:0/0 lens 368/264 e 0 to 0 dl 1262579427 ref 1 fl Interpret:/0/0 rc -16/0 Jan 4 06:29:01 tech-mds ntpd[5999]: synchronized to 132.68.238.40, stratum 2 Jan 4 06:29:01 tech-mds ntpd[5999]: kernel time sync enabled 0001 Jan 4 06:29:12 tech-mds kernel: LustreError: 6278:0: (ldlm_lib.c:884:target_handle_connect()) technion-MDT: denying connection for new client 192.114.101...@tcp (ab671897-b1e2-76d3-b661-7b87e82d23e7): 33 clients in recovery for 237s Jan 4 06:29:12 tech-mds kernel: LustreError: 6278:0: (ldlm_lib.c:1826:target_send_reply_msg()) @@@ processing error (-16) r...@81007053ac00 x1323646107950636/t0 o38-?@?:0/0 lens 368/264 e 0 to 0 dl 1262579452 ref 1 fl Interpret:/0/0 rc -16/0 Jan 4 06:29:37 tech-mds kernel: LustreError: 6293:0: (ldlm_lib.c:884:target_handle_connect()) technion-MDT: denying connection for new client 192.114.101...@tcp (ab671897-b1e2-76d3-b661-7b87e82d23e7): 33 clients in recovery for 212s
Re: [Lustre-discuss] MDS crashes daily at the same hour
On 2010-01-04, at 03:02, David Cohen wrote: I'm using a mixed environment of 1.8.0.1 MDS and 1.6.6 OSS's (had a problem with qlogic drivers and rolled back to 1.6.6). My MDS get unresponsive each day at 4-5 am local time, no kernel panic or error messages before. Judging by the time, I'd guess this is slocate or mlocate running on all of your clients at the same time. This used to be a source of extremely high load back in the old days, but I thought that Lustre was in the exclude list in newer versions of *locate. Looking at the installed mlocate on my system, that doesn't seem to be the case... strange. Some errors and an LBUG appear in the log after force booting the MDS and mounting the MDT and then the log is clear until next morning: Jan 4 06:33:31 tech-mds kernel: LustreError: 6357:0: (class_hash.c:225:lustre_hash_findadd_unique_hnode()) ASSERTION(hlist_unhashed(hnode)) failed Jan 4 06:33:31 tech-mds kernel: LustreError: 6357:0: (class_hash.c:225:lustre_hash_findadd_unique_hnode()) LBUG Jan 4 06:33:31 tech-mds kernel: Lustre: 6357:0:(linux- debug.c:222:libcfs_debug_dumpstack()) showing stack for process 6357 Jan 4 06:33:31 tech-mds kernel: ll_mgs_02 R running task 0 6357 16340 (L-TLB) Jan 4 06:33:31 tech-mds kernel: Call Trace: Jan 4 06:33:31 tech-mds kernel: thread_return+0x62/0xfe Jan 4 06:33:31 tech-mds kernel: __wake_up_common+0x3e/0x68 Jan 4 06:33:31 tech-mds kernel: :ptlrpc:ptlrpc_main+0x1218/0x13e0 Jan 4 06:33:31 tech-mds kernel: default_wake_function+0x0/0xe Jan 4 06:33:31 tech-mds kernel: audit_syscall_exit+0x31b/0x336 Jan 4 06:33:31 tech-mds kernel: child_rip+0xa/0x11 Jan 4 06:33:31 tech-mds kernel: :ptlrpc:ptlrpc_main+0x0/0x13e0 Jan 4 06:33:31 tech-mds kernel: child_rip+0x0/0x11 It shouldn't LBUG during recovery, however. Cheers, Andreas -- Andreas Dilger Sr. Staff Engineer, Lustre Group Sun Microsystems of Canada, Inc. ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss