Re: [Lustre-discuss] Cannot mount MDS: Lustre: Denying initial registration attempt from nid 10.201.62...@o2ib, specified as failover
Hi Kevin, But you specified that as a failover node: # tunefs.lustre --erase-params --param=failover.node=10.201.62...@o2ib,10.201.30...@tcp failover.node=10.201.62...@o2ib,10.201.30...@tcp mdt.group_upcall=/usr/sbin/l_getgroups /dev/md10 Well: First i was just running # tunefs.lustre --param mdt.quota_type=ug /dev/md10 and this alone was enough to break it. Then i tried to remove the quota-option with --erase-params and i've included both nodes (the primary + failover) because 'tunefs.lustre /dev/md10' displayed them. Not sure what you mean when you say it worked before It worked before we added the *.quota_type parameters: This installation is over 1 year old, saw quite a few remounts and an upgrade from 1.8.1.1 - 1.8.4. did you specify both sets on your mkfs command line? The initial installation was done / dictated by the swiss branch of an (no longer existing) three-letter company. This command was used to create the filesystem on the MDS # FS_NAME=lustre1 # MGS_1=10.201.62...@o2ib0,10.201.30...@tcp0 # MGS_2=10.201.62...@o2ib0,10.201.30...@tcp0 # mkfs.lustre --reformat --fsname ${FS_NAME} --mdt --mgs --failnode=${MGS_1} --failnode=${MGS_2} /dev/md10 Regards and thanks, Adrian -- RFC 1925: (11) Every old idea will be proposed again with a different name and a different presentation, regardless of whether it works. ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
Re: [Lustre-discuss] Cannot mount MDS: Lustre: Denying initial registration attempt from nid 10.201.62...@o2ib, specified as failover
Adrian Ulrich wrote: Hi Kevin, But you specified that as a failover node: # tunefs.lustre --erase-params --param=failover.node=10.201.62...@o2ib,10.201.30...@tcp failover.node=10.201.62...@o2ib,10.201.30...@tcp mdt.group_upcall=/usr/sbin/l_getgroups /dev/md10 Well: First i was just running # tunefs.lustre --param mdt.quota_type=ug /dev/md10 and this alone was enough to break it. Not sure. did you specify both sets on your mkfs command line? The initial installation was done / dictated by the swiss branch of an (no longer existing) three-letter company. This command was used to create the filesystem on the MDS # FS_NAME=lustre1 # MGS_1=10.201.62...@o2ib0,10.201.30...@tcp0 # MGS_2=10.201.62...@o2ib0,10.201.30...@tcp0 # mkfs.lustre --reformat --fsname ${FS_NAME} --mdt --mgs --failnode=${MGS_1} --failnode=${MGS_2} /dev/md10 I haven't done combined mdt/mgs for a while, so I can't recall if you have to specify the mgs NIDs for the MDT when it is colocated with the MGS, but I think the command should have been more like: # mkfs.lustre --fsname ${FS_NAME} --mdt --mgs --failnode=${MGS_2} --mgsnode=${MGS_1} --mgsnode=${MGS_2} /dev/md10 with the mkfs/first mount on MGS_1. As I mentioned, you would not normally specify the mkfs/first-mount NIDs as failover parameters, as they are added automatically by Lustre. Kevin ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
Re: [Lustre-discuss] Help
Hello Wang Yibin, Thanks for getting back to me. Yes, S2 and S3 can ping each other using lctl ping. I was using nuttcp test and I also tried ib tests that comes with the IB utilities. I will lnet-selftest. My goal was to measure the bandwidth when it has to reach across different network. Are there any such tests specific to lustre? Thanks Nihir From: Wang Yibin [mailto:wang.yi...@oracle.com] Sent: Thursday, November 18, 2010 6:32 AM To: Nihir Parikh Cc: lustre-discuss@lists.lustre.org Subject: Re: [Lustre-discuss] Help Hi, 在 2010-11-17,上午9:17, Nihir Parikh 写道: Now my problem is to run some network tests from S2 -- S3 and S3 -- S2 to measure the bandwidth but somehow both S2 and S3 complain that network is unreachable. What am I doing wrong? Your configuration seems OK to me. Can S2 and S3 ping each other using 'lctl ping'? What kind of network test did you do? Note that only lustre LNET can do the routing. There's a script in lustre testsuite that's specifically for testing the network connectivity - lnet-selftest.sh. Thanks Nihir ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.orgmailto:Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
Re: [Lustre-discuss] Help
在 2010-11-20,上午8:39, Nihir Parikh 写道: Hello Wang Yibin, Thanks for getting back to me. Yes, S2 and S3 can ping each other using lctl ping. This indicates that your routing is work as expected. I was using nuttcp test and I also tried ib tests that comes with the IB utilities. I will lnet-selftest. These utilities do not understand lnet protocol so they won't work. My goal was to measure the bandwidth when it has to reach across different network. Are there any such tests specific to lustre? Lnet has its own testsuite which is called lnet self-test. To measure the bandwidth, you can load lnet_selftest module on your nodes and execute lst in brw mode. Thanks Nihir From: Wang Yibin [mailto:wang.yi...@oracle.com] Sent: Thursday, November 18, 2010 6:32 AM To: Nihir Parikh Cc: lustre-discuss@lists.lustre.org Subject: Re: [Lustre-discuss] Help Hi, 在 2010-11-17,上午9:17, Nihir Parikh 写道: Now my problem is to run some network tests from S2 à S3 and S3 à S2 to measure the bandwidth but somehow both S2 and S3 complain that network is unreachable. What am I doing wrong? Your configuration seems OK to me. Can S2 and S3 ping each other using 'lctl ping'? What kind of network test did you do? Note that only lustre LNET can do the routing. There's a script in lustre testsuite that's specifically for testing the network connectivity - lnet-selftest.sh. Thanks Nihir ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
Re: [Lustre-discuss] LBUG on lustre 1.8.0
Sure, but I think for engineering to make progress on this bug, they are going to want a crash dump. If you can enable crash dumps and panic on lbug (and if HA, increase dead timeout so it can complete the dump before being shot in the head) it would provide more info for the bug report. That being said, there are quite a few other bugs that have been fixed since 1.8.0, so you really should upgrade ASAP to 1.8.4. Kevin On Nov 21, 2010, at 6:59 PM, Larry tsr...@gmail.com wrote: We had a LBUG several days ago on our lustre 1.8.0. One OSS reported kernel: LustreError: 24669:0:(service.c:1311:ptlrpc_server_handle_request()) ASSERTION(atomic_read((export)-exp_refcount) 0x5a5a5a) failed kernel: LustreError: 24669:0:(service.c:1311:ptlrpc_server_handle_request()) LBUG kernel: Lustre: 24669:0:(linux-debug.c:222:libcfs_debug_dumpstack()) showing stack for process 24669 .. I google for this, and find little information about it. It seems to be a race condition on OSS, right? Should I open a bugzilla for this LBUG? Thanks. ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
Re: [Lustre-discuss] LBUG on lustre 1.8.0
Larry wrote: We add the options libcfs libcfs_panic_on_lbug=1 in modprobe.conf to make the server kernel panic ASAP the LBUG happened. Is there some way to make the server dead a few seconds after the LBUG? We are also puzzled with the message lost during the LBUG happened. The messages should have gone to the console just fine (hopefully you are logging a serial console). If you are talking about /var/log/messages, then yes, it will be missing the final output as the messages don't have time to get written to disk on a kernel panic. Kevin On Mon, Nov 22, 2010 at 10:42 AM, Kevin Van Maren kevin.van.ma...@oracle.com wrote: Sure, but I think for engineering to make progress on this bug, they are going to want a crash dump. If you can enable crash dumps and panic on lbug (and if HA, increase dead timeout so it can complete the dump before being shot in the head) it would provide more info for the bug report. That being said, there are quite a few other bugs that have been fixed since 1.8.0, so you really should upgrade ASAP to 1.8.4. Kevin On Nov 21, 2010, at 6:59 PM, Larry tsr...@gmail.com wrote: We had a LBUG several days ago on our lustre 1.8.0. One OSS reported kernel: LustreError: 24669:0:(service.c:1311:ptlrpc_server_handle_request()) ASSERTION(atomic_read((export)-exp_refcount) 0x5a5a5a) failed kernel: LustreError: 24669:0:(service.c:1311:ptlrpc_server_handle_request()) LBUG kernel: Lustre: 24669:0:(linux-debug.c:222:libcfs_debug_dumpstack()) showing stack for process 24669 .. I google for this, and find little information about it. It seems to be a race condition on OSS, right? Should I open a bugzilla for this LBUG? Thanks. ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss