Re: [Lustre-discuss] Cannot mount MDS: Lustre: Denying initial registration attempt from nid 10.201.62...@o2ib, specified as failover

2010-11-21 Thread Adrian Ulrich
Hi Kevin,

 But you specified that as a failover node:
   # tunefs.lustre --erase-params 
 --param=failover.node=10.201.62...@o2ib,10.201.30...@tcp 
 failover.node=10.201.62...@o2ib,10.201.30...@tcp 
 mdt.group_upcall=/usr/sbin/l_getgroups /dev/md10

Well: First i was just running

# tunefs.lustre --param mdt.quota_type=ug /dev/md10

and this alone was enough to break it.

Then i tried to remove the quota-option with --erase-params and i've included
both nodes (the primary + failover) because 'tunefs.lustre /dev/md10' displayed 
them.


 Not sure what you mean when you say it worked before

It worked before we added the *.quota_type parameters: This installation
is over 1 year old, saw quite a few remounts and an upgrade from
1.8.1.1 - 1.8.4.


 did you specify both sets on your mkfs command line?

The initial installation was done / dictated by the swiss branch of
an (no longer existing) three-letter company. This command was used
to create the filesystem on the MDS

# FS_NAME=lustre1
# MGS_1=10.201.62...@o2ib0,10.201.30...@tcp0
# MGS_2=10.201.62...@o2ib0,10.201.30...@tcp0
# mkfs.lustre --reformat --fsname ${FS_NAME} --mdt --mgs --failnode=${MGS_1} 
--failnode=${MGS_2} /dev/md10


Regards and thanks,
 Adrian


-- 
 RFC 1925:
   (11) Every old idea will be proposed again with a different name and
a different presentation, regardless of whether it works.

___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss


Re: [Lustre-discuss] Cannot mount MDS: Lustre: Denying initial registration attempt from nid 10.201.62...@o2ib, specified as failover

2010-11-21 Thread Kevin Van Maren
Adrian Ulrich wrote:
 Hi Kevin,

   
 But you specified that as a failover node:
   # tunefs.lustre --erase-params 
 --param=failover.node=10.201.62...@o2ib,10.201.30...@tcp 
 failover.node=10.201.62...@o2ib,10.201.30...@tcp 
 mdt.group_upcall=/usr/sbin/l_getgroups /dev/md10
 

 Well: First i was just running

 # tunefs.lustre --param mdt.quota_type=ug /dev/md10

 and this alone was enough to break it.
   

Not sure.

 did you specify both sets on your mkfs command line?
 

 The initial installation was done / dictated by the swiss branch of
 an (no longer existing) three-letter company. This command was used
 to create the filesystem on the MDS

 # FS_NAME=lustre1
 # MGS_1=10.201.62...@o2ib0,10.201.30...@tcp0
 # MGS_2=10.201.62...@o2ib0,10.201.30...@tcp0
 # mkfs.lustre --reformat --fsname ${FS_NAME} --mdt --mgs --failnode=${MGS_1} 
 --failnode=${MGS_2} /dev/md10
   

I haven't done combined mdt/mgs for a while, so I can't recall if you 
have to specify the mgs NIDs for the MDT when it is colocated with the 
MGS, but I think the command should have been more like:

# mkfs.lustre --fsname ${FS_NAME} --mdt --mgs --failnode=${MGS_2} 
--mgsnode=${MGS_1} --mgsnode=${MGS_2} /dev/md10
with the mkfs/first mount on MGS_1.

As I mentioned, you would not normally specify the mkfs/first-mount NIDs 
as failover parameters, as they are added automatically by Lustre.

Kevin

___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss


Re: [Lustre-discuss] Help

2010-11-21 Thread Nihir Parikh
Hello Wang Yibin,
Thanks for getting back to me. Yes, S2 and S3 can ping each other using lctl 
ping. I was using nuttcp test and I also tried ib tests that comes with the IB 
utilities. I will lnet-selftest.

My goal was to measure the bandwidth when it has to reach across different 
network. Are there any such tests specific to lustre?

Thanks
Nihir


From: Wang Yibin [mailto:wang.yi...@oracle.com]
Sent: Thursday, November 18, 2010 6:32 AM
To: Nihir Parikh
Cc: lustre-discuss@lists.lustre.org
Subject: Re: [Lustre-discuss] Help

Hi,

在 2010-11-17,上午9:17, Nihir Parikh 写道:



Now my problem is to run some network tests from S2 -- S3 and S3 -- S2 to 
measure the bandwidth but somehow both S2 and S3 complain that network is 
unreachable. What am I doing wrong?

Your configuration seems OK to me. Can S2 and S3 ping each other using 'lctl 
ping'?
What kind of network test did you do? Note that only lustre LNET can do the 
routing.
There's a script in lustre testsuite that's specifically for testing the 
network connectivity - lnet-selftest.sh.



Thanks
Nihir

___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.orgmailto:Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss

___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss


Re: [Lustre-discuss] Help

2010-11-21 Thread Wang Yibin

在 2010-11-20,上午8:39, Nihir Parikh 写道:

 Hello Wang Yibin,
 Thanks for getting back to me. Yes, S2 and S3 can ping each other using lctl 
 ping.

This indicates that your routing is work as expected.

 I was using nuttcp test and I also tried ib tests that comes with the IB 
 utilities. I will lnet-selftest.

These utilities do not understand lnet protocol so they won't work.

  
 My goal was to measure the bandwidth when it has to reach across different 
 network. Are there any such tests specific to lustre?

Lnet has its own testsuite which is called lnet self-test. 
To measure the bandwidth, you can load lnet_selftest module on your nodes and 
execute lst in brw mode.

  
 Thanks
 Nihir
  
 From: Wang Yibin [mailto:wang.yi...@oracle.com] 
 Sent: Thursday, November 18, 2010 6:32 AM
 To: Nihir Parikh
 Cc: lustre-discuss@lists.lustre.org
 Subject: Re: [Lustre-discuss] Help
  
 Hi,
  
 在 2010-11-17,上午9:17, Nihir Parikh 写道:
 
 
  
 Now my problem is to run some network tests from S2 à S3 and S3 à S2 to 
 measure the bandwidth but somehow both S2 and S3 complain that network is 
 unreachable. What am I doing wrong?
  
 Your configuration seems OK to me. Can S2 and S3 ping each other using 'lctl 
 ping'? 
 What kind of network test did you do? Note that only lustre LNET can do the 
 routing. 
 There's a script in lustre testsuite that's specifically for testing the 
 network connectivity - lnet-selftest.sh.
 
 
  
 Thanks
 Nihir
  
 ___
 Lustre-discuss mailing list
 Lustre-discuss@lists.lustre.org
 http://lists.lustre.org/mailman/listinfo/lustre-discuss
  
 ___
 Lustre-discuss mailing list
 Lustre-discuss@lists.lustre.org
 http://lists.lustre.org/mailman/listinfo/lustre-discuss

___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss


Re: [Lustre-discuss] LBUG on lustre 1.8.0

2010-11-21 Thread Kevin Van Maren
Sure, but I think for engineering to make progress on this bug, they  
are going to want a crash dump.  If you can enable crash dumps and  
panic on lbug (and if HA, increase dead timeout so it can complete the  
dump before being shot in the head) it would provide more info for the  
bug report.

That being said, there are quite a few other bugs that have been fixed  
since 1.8.0, so you really should upgrade ASAP to 1.8.4.

Kevin


On Nov 21, 2010, at 6:59 PM, Larry tsr...@gmail.com wrote:

 We had a LBUG several days ago on our lustre 1.8.0. One OSS reported

 kernel: LustreError:
 24669:0:(service.c:1311:ptlrpc_server_handle_request())
 ASSERTION(atomic_read((export)-exp_refcount)  0x5a5a5a) failed
 kernel: LustreError:
 24669:0:(service.c:1311:ptlrpc_server_handle_request()) LBUG
 kernel: Lustre: 24669:0:(linux-debug.c:222:libcfs_debug_dumpstack())
 showing stack for process 24669
 ..

 I google for this, and find little information about it. It seems to
 be a race condition on OSS, right? Should I open a bugzilla for this
 LBUG?
 Thanks.
 ___
 Lustre-discuss mailing list
 Lustre-discuss@lists.lustre.org
 http://lists.lustre.org/mailman/listinfo/lustre-discuss
___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss


Re: [Lustre-discuss] LBUG on lustre 1.8.0

2010-11-21 Thread Kevin Van Maren
Larry wrote:
 We add the options libcfs libcfs_panic_on_lbug=1 in modprobe.conf to
 make the server kernel panic ASAP the LBUG happened. Is there some way
 to make the server dead a few seconds after the LBUG? We are also
 puzzled with the message lost during the LBUG happened.
   

The messages should have gone to the console just fine (hopefully you 
are logging a serial console).
If you are talking about /var/log/messages, then yes, it will be missing 
the final output as the
messages don't have time to get written to disk on a kernel panic.

Kevin


 On Mon, Nov 22, 2010 at 10:42 AM, Kevin Van Maren
 kevin.van.ma...@oracle.com wrote:
   
 Sure, but I think for engineering to make progress on this bug, they are
 going to want a crash dump.  If you can enable crash dumps and panic on lbug
 (and if HA, increase dead timeout so it can complete the dump before being
 shot in the head) it would provide more info for the bug report.

 That being said, there are quite a few other bugs that have been fixed since
 1.8.0, so you really should upgrade ASAP to 1.8.4.

 Kevin


 On Nov 21, 2010, at 6:59 PM, Larry tsr...@gmail.com wrote:

 
 We had a LBUG several days ago on our lustre 1.8.0. One OSS reported

 kernel: LustreError:
 24669:0:(service.c:1311:ptlrpc_server_handle_request())
 ASSERTION(atomic_read((export)-exp_refcount)  0x5a5a5a) failed
 kernel: LustreError:
 24669:0:(service.c:1311:ptlrpc_server_handle_request()) LBUG
 kernel: Lustre: 24669:0:(linux-debug.c:222:libcfs_debug_dumpstack())
 showing stack for process 24669
 ..

 I google for this, and find little information about it. It seems to
 be a race condition on OSS, right? Should I open a bugzilla for this
 LBUG?
 Thanks.
 ___
 Lustre-discuss mailing list
 Lustre-discuss@lists.lustre.org
 http://lists.lustre.org/mailman/listinfo/lustre-discuss
   
 ___
 Lustre-discuss mailing list
 Lustre-discuss@lists.lustre.org
 http://lists.lustre.org/mailman/listinfo/lustre-discuss
   

___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss