Re: [Lustre-discuss] Limits for o2ib lnet network numbers

2012-08-09 Thread Liang Zhen
No Jira ticket yet, reason of the potential performance issue is 
straightforward, it's because all LNet NIs are linked on a plain list, and we 
need scan the whole list for each sending/receiving, it's not an issue for a 
few networks, but it could be problematic for hundreds or tens.

Liang

On Aug 8, 2012, at 10:48 PM, Cory Spitz wrote:

 Liang,
 
 What main stream perf. issue do you refer to?  Is there a JIRA ticket
 tracking it?
 
 Thanks,
 -Cory
 
 On 08/08/2012 09:38 AM, Liang Zhen wrote:
 Hi, LNet reserved 32 bits for network number, so you can choose a very large 
 network number if only have a few networks, but really create many networks 
 will have some issues:
 - o2iblnd will pre-allocate memory resources for each network, so it will 
 consume a lot of memory
 - Main stream LNet will have performance issue if there're many networks, 
 for example, hundreds, although it's not difficult to fix this.
 
 Liang
 
 On Aug 8, 2012, at 10:23 PM, Rick Mohr wrote:
 
 
 I was curious what limitations exist for o2ib network numbers.  Most of
 the time I am dealing with o2ib0, o2ib1, etc.  As as experiment, I tried
 configuring a machine with o2ib1000, and that seemed to be OK.  I
 figured there must be some limit on how large the network number can
 get, but after doing some searching, I have been unable to find any docs
 that specify a limit.  Does any know what the max network number is?
 
 -- 
 Rick Mohr
 HPC Systems Administrator
 National Institute for Computational Sciences
 http://www.nics.tennessee.edu/
 
 ___
 Lustre-discuss mailing list
 Lustre-discuss@lists.lustre.org
 http://lists.lustre.org/mailman/listinfo/lustre-discuss
 
 ___
 Lustre-discuss mailing list
 Lustre-discuss@lists.lustre.org
 http://lists.lustre.org/mailman/listinfo/lustre-discuss

___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss


Re: [Lustre-discuss] Limits for o2ib lnet network numbers

2012-08-08 Thread Liang Zhen
Hi, LNet reserved 32 bits for network number, so you can choose a very large 
network number if only have a few networks, but really create many networks 
will have some issues:
- o2iblnd will pre-allocate memory resources for each network, so it will 
consume a lot of memory
- Main stream LNet will have performance issue if there're many networks, for 
example, hundreds, although it's not difficult to fix this.

Liang

On Aug 8, 2012, at 10:23 PM, Rick Mohr wrote:

 
 I was curious what limitations exist for o2ib network numbers.  Most of
 the time I am dealing with o2ib0, o2ib1, etc.  As as experiment, I tried
 configuring a machine with o2ib1000, and that seemed to be OK.  I
 figured there must be some limit on how large the network number can
 get, but after doing some searching, I have been unable to find any docs
 that specify a limit.  Does any know what the max network number is?
 
 -- 
 Rick Mohr
 HPC Systems Administrator
 National Institute for Computational Sciences
 http://www.nics.tennessee.edu/
 
 ___
 Lustre-discuss mailing list
 Lustre-discuss@lists.lustre.org
 http://lists.lustre.org/mailman/listinfo/lustre-discuss

___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss


Re: [Lustre-discuss] [wc-discuss] The ost_connect operation failed with -16

2012-05-30 Thread Liang Zhen
Hi, I think you might hit this: http://jira.whamcloud.com/browse/LU-952 , you 
can find the patch from this ticket

Regards
Liang

On May 30, 2012, at 11:21 AM, huangql wrote:

 Dear  all,
 
 Recently we found the problem in OSS that some threads might be hung when the 
 server got heavy IO load. In this case, some clients will be evicted or 
 refused by some OSTs and got the error messages as following:
 
 Server side:
 
 May 30 11:06:31 boss07 kernel: Lustre: Service thread pid 8011 was inactive 
 for 200.00s. The thread might be hung, or it might only be slow and will 
 resume later. D
 umping the stack trace for debugging purposes: May 30 11:06:31 boss07 kernel: 
 Lustre: Skipped 1 previous similar message
 May 30 11:06:31 boss07 kernel: Pid: 8011, comm: ll_ost_71 
 May 30 11:06:31 boss07 kernel: 
 May 30 11:06:31 boss07 kernel: Call Trace:
 May 30 11:06:31 boss07 kernel:  [886f5d0e] 
 start_this_handle+0x301/0x3cb [jbd2]
 May 30 11:06:31 boss07 kernel:  [800a09ca] 
 autoremove_wake_function+0x0/0x2e
 May 30 11:06:31 boss07 kernel:  [886f5e83] 
 jbd2_journal_start+0xab/0xdf [jbd2]
 May 30 11:06:31 boss07 kernel:  [888ce9b2] 
 fsfilt_ldiskfs_start+0x4c2/0x590 [fsfilt_ldiskfs]
 May 30 11:06:31 boss07 kernel:  [88920551] 
 filter_version_get_check+0x91/0x2a0 [obdfilter]
 May 30 11:06:31 boss07 kernel:  [80036cf4] __lookup_hash+0x61/0x12f
 May 30 11:06:31 boss07 kernel:  [8893108d] 
 filter_setattr_internal+0x90d/0x1de0 [obdfilter]
 May 30 11:06:31 boss07 kernel:  [800e859b] lookup_one_len+0x53/0x61
 May 30 11:06:31 boss07 kernel:  [88925452] 
 filter_fid2dentry+0x512/0x740 [obdfilter]
 May 30 11:06:31 boss07 kernel:  [88924e27] 
 filter_fmd_get+0x2b7/0x320 [obdfilter]
 May 30 11:06:31 boss07 kernel:  [8003027b] __up_write+0x27/0xf2
 May 30 11:06:31 boss07 kernel:  [88932721] 
 filter_setattr+0x1c1/0x3b0 [obdfilter]
 May 30 11:06:31 boss07 kernel:  [8882677a] 
 lustre_pack_reply_flags+0x86a/0x950 [ptlrpc]
 May 30 11:06:31 boss07 kernel:  [8881e658] 
 ptlrpc_send_reply+0x5c8/0x5e0 [ptlrpc]
 May 30 11:06:31 boss07 kernel:  [88822b05] 
 lustre_msg_get_version+0x35/0xf0 [ptlrpc]
 May 30 11:06:31 boss07 kernel:  [888b0abb] ost_handle+0x25db/0x55b0 
 [ost]
 May 30 11:06:31 boss07 kernel:  [80150d56] __next_cpu+0x19/0x28
 May 30 11:06:31 boss07 kernel:  [800767ae] 
 smp_send_reschedule+0x4e/0x53
 May 30 11:06:31 boss07 kernel:  [8883215a] 
 ptlrpc_server_handle_request+0x97a/0xdf0 [ptlrpc]
 May 30 11:06:31 boss07 kernel:  [888328a8] 
 ptlrpc_wait_event+0x2d8/0x310 [ptlrpc]
 May 30 11:06:31 boss07 kernel:  [8008b3bd] 
 __wake_up_common+0x3e/0x68
 May 30 11:06:31 boss07 kernel:  [88833817] ptlrpc_main+0xf37/0x10f0 
 [ptlrpc]
 May 30 11:06:31 boss07 kernel:  [8005dfb1] child_rip+0xa/0x11
 May 30 11:06:31 boss07 kernel:  [888328e0] ptlrpc_main+0x0/0x10f0 
 [ptlrpc]
 May 30 11:06:31 boss07 kernel:  [8005dfa7] child_rip+0x0/0x11
 May 30 11:06:31 boss07 kernel:
 May 30 11:06:31 boss07 kernel: LustreError: dumping log to 
 /tmp/lustre-log.1338347191.8011
 
 
 Client side:
 
 May 30 09:58:36 ccopt kernel: LustreError: 11-0: an error occurred while 
 communicating with 192.168.50.123@tcp. The ost_connect operation failed with 
 -16
 
 When you got this error message, you failed to run ls, df ,vi, touch 
 and so on, which affect us to do anything in the file system.
 I think the ost_connect failure could report some error messages to users 
 instead of  causing any interactive actions stuck.
 
 Could someone give us some advice or any suggestions to solve this problem?
 
 Thank you very much in advance.
 
 
 Best Regards
 Qiulan Huang
 2012-05-30
 
 Computing center,the Institute of High Energy Physics, China
 Huang, QiulanTel: (+86) 10 8823 6010-105
 P.O. Box 918-7   Fax: (+86) 10 8823 6839
 Beijing 100049  P.R. China   Email: huan...@ihep.ac.cn
 ===   
 
 
 

___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss


Re: [Lustre-discuss] is there a way to run Lustre over UDP instead TCP?

2012-04-09 Thread Liang Zhen
Hi, no it can't, it would require a new UDP based LND  (Lustre Network Driver) 
but I don't know anyone has plan to do this yet. 

Regards
Liang

On Apr 10, 2012, at 6:44 AM, Hebenstreit, Michael wrote:

  
 See title…
  
 Thanks
 Michael
  
 
 Michael Hebenstreit Senior Cluster Architect
 Intel Corporation   Software and Services Group/HTE
 2800 N Center Dr, DP3-307   Tel.:   +1 253 371 3144
 WA 98327, DuPont   
 UNITED STATES   E-mail: michael.hebenstr...@intel.com
 ___
 Lustre-discuss mailing list
 Lustre-discuss@lists.lustre.org
 http://lists.lustre.org/mailman/listinfo/lustre-discuss

___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss


Re: [Lustre-discuss] LNET Performance Issue

2012-02-15 Thread Liang Zhen
Hi, I assume you are using size=1M for brw test right? performance could 
increase if you set concurrency while adding brw test, i.e: --concurrency=16

Liang

On Feb 16, 2012, at 3:30 AM, Barberi, Carl E wrote:

 We are having issues with LNET performance over Infiniband.  We have a 
 configuration with a single MDT and six (6) OSTs.  The Lustre client I am 
 using to test is configured to use 6 stripes (lfs setstripe -c  6 
 /mnt/lustre).  When I perform a test using the following command:
  
 dd if=/dev/zero of=/mnt/lustre/test.dat bs=1M count=2000
  
 I typically get a write rate of about 815 MB/s, and we never exceed 848 MB/s. 
  When I run obdfilter-survey, we easily get about 3-4GB/s write speed, but 
 when I run a series of lnet-selftests, the read and write rates range from 
 850MB/s – 875MB/s max.  I have performed the following optimizations to 
 increase the data rate:
  
 On the Client:
 lctl set_param osc.*.checksums=0
 lctl set_param osc.*.max_dirty_mb=256
  
 On the OSTs
 lctl set_param obdfilter.*.writethrough_cache_enable=0
 lctl set_param obdfilter.*.read_cache_enable=0
  
 echo 4096  /sys/block/devices/queue/nr_requests
  
 I have also loaded the ib_sdp module, which also brought an increase in 
 speed.  However, we need to be able to record at no less than 1GB/s, which we 
 cannot achieve right now.  Any thoughts on how I can optimize LNET, which 
 clearly seems to be the bottleneck?
  
 Thank you for any help you can provide,
 Carl Barberi
 ___
 Lustre-discuss mailing list
 Lustre-discuss@lists.lustre.org
 http://lists.lustre.org/mailman/listinfo/lustre-discuss

___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss


Re: [Lustre-discuss] Where to download Lustre from since 01 Aug?

2011-08-04 Thread Liang Zhen
Hi, you can download Whamcloud Lustre releases from here:
http://downloads.whamcloud.com

Regards
Liang

On Aug 4, 2011, at 8:03 PM, Torsten Harenberg wrote:

 Dear all,
 
 after the migration of the Sun downloads website to Oracle, I couldn't
 find a place anymore where I can actually download anything lustre
 related.
 
 Can someone point me to a site which still works after August 1st?
 
 Thanks
 
  Torsten
 
 
 
 --
 
   
  Dr. Torsten Harenberg harenb...@physik.uni-wuppertal.de  
  Bergische Universitaet   
  FB C - Physik Tel.: +49 (0)202 439-3521  
  Gaussstr. 20  Fax : +49 (0)202 439-2811  
  42097 Wuppertal  
   
  Of course it runs NetBSD http://www.netbsd.org  
 
 
 ___
 Lustre-discuss mailing list
 Lustre-discuss@lists.lustre.org
 http://lists.lustre.org/mailman/listinfo/lustre-discuss

___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss


Re: [Lustre-discuss] [Lustre-devel] Lustre and Multicore

2011-05-31 Thread Liang Zhen
Vilobh,

We do have work-in-progress for performance on multicores 
(http://jira.whamcloud.com/browse/LU-56), but these changes will be almost 
transparent to users, and they have nothing to do with lustre protocol.

Regards
Liang

On May 31, 2011, at 5:56 PM, vilobh meshram wrote:

 Hi ,
 
 Wanted to understand will the design of Lustre(Client/MDS/OSS) change if we 
 have Multicores.
 
 Thanks,
 Vilobh
 ___
 Lustre-devel mailing list
 lustre-de...@lists.lustre.org
 http://lists.lustre.org/mailman/listinfo/lustre-devel

___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss


Re: [Lustre-discuss] Landing and tracking tools improvements

2011-05-23 Thread Liang Zhen
A small suggestion about Maloo:
I think It would be helpful if we can sort testing results by base branch, I 
actually can't find branch information even after clicking into result report. 

Thanks
Liang

On May 24, 2011, at 1:06 AM, Chris Gearing wrote:

 We now have a whole kit of tools [Jira, Gerrit, Jenkins and Maloo] used 
 for tracking, reviewing and testing of code that are being used for the 
 development of Lustre. A lot of time has been spent integrating and 
 connecting them appropriately but as with anything the key is to 
 continuously look for ways to improve what we have and how it works.
 
 So my question is what do people think of the tools as they stand today 
 and how can we improve them moving forwards. if people can respond to 
 lustre-discuss then I'll correlate the outcome of any discussions and 
 then create a Wiki page that can form some plan for improvement.
 
 Please be as descriptive as possible in your replies and take into 
 account that I and others have no experience of Lustre past so if you 
 liked something prior to the current tools you'll need to help me and 
 them understand the details.
 
 Thanks
 
 Chris
 
 ---
 Chris Gearing
 Snr Engineer
 Whamcloud. Inc.
 
 
 ___
 Lustre-discuss mailing list
 Lustre-discuss@lists.lustre.org
 http://lists.lustre.org/mailman/listinfo/lustre-discuss

___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss


Re: [Lustre-discuss] Lustre over o2ib issue

2011-03-23 Thread Liang Zhen
Hi Diego,

Do  you have any other module parameter  for lnet and lnd? 

Regards
Liang 


On Mar 22, 2011, at 9:26 PM, Diego Moreno wrote:

 Hi,
 
 We are having this problem right now with our Lustre 2.0. We tried the 
 proposed solutions but we didn't get it.
 
 We have 2 QDR IB cards on 4 servers and we have to do lctl ping from 
 each server to every client if we want clients to connect to servers. We 
 don't have ib_mthca modules loaded because we don't have DDR cards and 
 we configured ip2nets with no result.
 
 Our ip2nets configuration ([7-10] interfaces are in servers, the others 
 are in clients):
 o2ib0(ib0) 10.50.0.[7-10] ; o2ib1(ib1) 10.50.1.[7-10] ; o2ib0(ib0) 
 10.50.*.* ; o2ib1(ib0) 10.50.*.*
 
 So the only way of having clients connected to servers is doing 
 something like this on every server:
 
 for i in $CLIENT_IB_LIST ; do
 lctl ping $i@o2ib0
 lctl ping $i@o2ib1
 done
 
 Before lctl ping we get messages like this one:
 
 Lustre: 50389:0:(lib-move.c:1028:lnet_post_send_locked()) Dropping 
 message for 12345-10.50.1.7@o2ib1: peer not alive
 
 After lctl ping' everything works right.
 
 Maybe I'm missing something or this is a known bug in lustre 2.0...
 
 
 On 16/03/2011 22:13, Andreas Dilger wrote:
 On 2011-03-16, at 3:04 PM, Mike Hanby wrote:
 Thanks, I forgot to include the card info:
 
 The servers each have a single IB card: dual port MT26528 QDR
 o2ib0(ib0) on each server is attached to the QLogic switch (with three 
 attached M3601Q switches 48 attached blades)
 o2ib1(ib1) on each server is attached to a stack of two M3601Q switches 
 with 24 attached blades
 
 The blades connected to o2ib0 each have an MT26428 QDR IB card
 The blades connected to o2ib1 each have an MT25418 DDR IB card
 
 You may also want to check out the ip2nets option for specifying the Lustre 
 networks.  It is made to handle configuration issues like this where the 
 interface name is not constant across client/server nodes.
 
 
 -Original Message-
 From: lustre-discuss-boun...@lists.lustre.org 
 [mailto:lustre-discuss-boun...@lists.lustre.org] On Behalf Of Nirmal Seenu
 Sent: Wednesday, March 16, 2011 2:10 PM
 To: lustre-discuss@lists.lustre.org
 Subject: Re: [Lustre-discuss] Lustre over o2ib issue
 
 If you are using DDR and QDR or any 2 different cards cards in the same 
 machine there is no guarantee that the same IB cards get assigned to ib0 
 and ib.
 
 To fix that problem you need to comment out the following 3 lines 
 /etc/init.d/openibd:
 
 #for i in `grep ^driver:  /etc/sysconfig/hwconf | sed -e 's/driver: 
 //' | grep -w ib_mthca\\\|ib_ipath\\\|mlx4_core\\\|cxgb3\\\|iw_nes`; do
 #load_modules $i
 #done
 
 and include the following lines instead(we wanted the DDR card to be ib0 
 and the QDR card to be ib1):
 load_modules ib_mthca
 /bin/sleep 10
 load_modules mlx4_core
 
 and you will need to restart openibd once again (we included it in 
 rc.local) to make sure that the same IB cards are assigned to the devices 
 ib0 and ib1.
 
 Nirmal
 ___
 Lustre-discuss mailing list
 Lustre-discuss@lists.lustre.org
 http://lists.lustre.org/mailman/listinfo/lustre-discuss
 ___
 Lustre-discuss mailing list
 Lustre-discuss@lists.lustre.org
 http://lists.lustre.org/mailman/listinfo/lustre-discuss
 
 
 Cheers, Andreas
 --
 Andreas Dilger
 Principal Engineer
 Whamcloud, Inc.
 
 
 
 ___
 Lustre-discuss mailing list
 Lustre-discuss@lists.lustre.org
 http://lists.lustre.org/mailman/listinfo/lustre-discuss
 
 
 ___
 Lustre-discuss mailing list
 Lustre-discuss@lists.lustre.org
 http://lists.lustre.org/mailman/listinfo/lustre-discuss

___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss


Re: [Lustre-discuss] OST threads

2011-02-26 Thread Liang Zhen
In the long term, I would think we can add a common library for threads-pool, 
we have several modules having their own implementation of threads-pool(ptlrpc 
service has two threads pool, LNDs...), so if we have such a library (create 
threads, kill threads, grow threads, shrink threads) then we can get rid of a 
bunch of duplicated code, though these APIs have to be designed carefully so 
can satisfy all current use-cases at least.

Regards
Liang

On Feb 26, 2011, at 12:16 PM, Andreas Dilger wrote:

 On 2011-02-25, at 4:37 PM, Mervini, Joseph A wrote:
 That could be awful handy - especially when trying to tune a live file 
 system for performance. Is that going to be a 2.0 only enhancement or can it 
 be applied to existing 1.8 versions?
 
 The patch was originally developed for 1.8, and ported to 2.1.  That said, 
 last time I tested it there were a few problems (crashing variety) so it 
 isn't ready for prime time yet.
 
 Testing/debugging would be appreciated, patch for 1.8 and 2.1 are at:
 https://bugzilla.lustre.org/show_bug.cgi?id=22516
 
 
 On Feb 24, 2011, at 9:19 PM, Andreas Dilger wrote:
 Yes, this can be set at startup time to limit the number of started 
 threads. There is a patch I wrote to also reduce the number of running 
 treads but it wasn't landed yet. 
 
 Cheers, Andreas
 
 On 2011-02-24, at 14:04, Mervini, Joseph A jame...@sandia.gov wrote:
 
 I'm inclined to agree. So apparently the only time that modifying the 
 runtime max values has a benefit is while the threads_started is low?
 
 Joe
 
 
 Joe Mervini
 Sandia National Laboratories
 High Performance Computing
 505.844.6770
 jame...@sandia.gov
 
 
 
 On Feb 24, 2011, at 1:09 PM, Kevin Van Maren wrote:
 
 However, I don't think you can decrease the number of running threads.
 See https://bugzilla.lustre.org/show_bug.cgi?id=22417 (and also 
 https://bugzilla.lustre.org/show_bug.cgi?id=22516 )
 
 Kevin
 
 
 Mervini, Joseph A wrote:
 Cool! Thank you Johann.
 
 
 Joe Mervini
 Sandia National Laboratories
 High Performance Computing
 505.844.6770
 jame...@sandia.gov
 
 
 
 On Feb 24, 2011, at 11:05 AM, Johann Lombardi wrote:
 
 
 On Thu, Feb 24, 2011 at 10:48:32AM -0700, Mervini, Joseph A wrote:
 
 Quick question: Has runtime modification of the number of OST threads 
 been implemented in Lustre-1.8.3?
 
 Yes, see bugzilla ticket 18688. It was landed in 1.8.1.
 
 
 
 Cheers, Andreas
 --
 Andreas Dilger 
 Principal Engineer
 Whamcloud, Inc.
 
 
 
 ___
 Lustre-discuss mailing list
 Lustre-discuss@lists.lustre.org
 http://lists.lustre.org/mailman/listinfo/lustre-discuss

___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss


Re: [Lustre-discuss] Disabling RDMA on an IB interface

2011-02-23 Thread Liang Zhen

On Feb 24, 2011, at 10:45 AM, Jeremy Filizetti wrote:

 As Chris mentioned your talking about two very different methods.  I think 
 you can use netem with IPoIB but I have never tried it.  If you use connected 
 mode I think your still technically doing RDMA but the maximum size (MTU)  is 
 around 64k which isn't sufficient for higher latencies.  In the first few 
 slides of my LUG presentation last year I have some graphs that show how RDMA 
 performance is affected by latency and need to be increased to compensate for 
 the bandwidth delay product (BDP).  If you do what to use IPoIB you can add a 
 line similar the following in your /etc/modprobe.conf or a fille in 
 /etc/modprobe.d directory:
 
   options lnet networks=tcp(ib0)
 
 If you want to use RC QPs as ko2iblnd does, we use the following kernel 
 parameters:
 
options lnet networks=o2ib(ib0)
options ko2iblnd map_on_demand=2 peer_credits=128 credits=256 
 concurrent_sends=256 ntx=512 fmr_pool_size=2048 fmr_flush_trigger=512 
 fmr_cache=1


If you have peer_credits=128, then I would suggest increase credits=1024 and 
ntx=2048, otherwise a couple of clients could consume all NI credits, 
concurrent_sends is not necessary here because o2iblnd will estimate proper 
value for it. 

Thanks
Liang




___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss


Re: [Lustre-discuss] RDMA limitation?

2010-04-13 Thread Liang Zhen
It's a kind of story like: if you have to take dozens of global locks 
on lifetime of a RPC, then the code can't scale well on large SMP 
system, not matter what kind of network you are using”, so the problem 
is scattered everywhere.
Also, we are trying to reduce RPC bounce between CPUs, in current code, 
a request can be received by CPU A, then queued on CPU B, processed by 
CPU C, and replied by CPU D, it's very bad on large SMP system because 
of data traffic between CPUs.

Regards
Liang

Jiahua wrote:
 You mean it is inherent in the code? Can you point me to the actual
 code if possible? I am just curious why. Any pointers or hints will be
 appreciated.

 Thanks,
 Jiahua


 On Tue, Apr 13, 2010 at 6:46 PM, Kevin Van Maren kevin.vanma...@sun.com 
 wrote:
   
 Yes, the RPC rate is limited by Lustre code locking to that rate, even with
 rdma.

 Kevin


 On Apr 13, 2010, at 5:08 PM, Jiahua jia...@gmail.com wrote:

 
 Hi all,

 This is kind of a followup question of the thread One or two OSS, no
 difference? last month. In that thread, Andreas stated:

 There is work currently underway to improve the SMP scaling
 performance for the RPC handling layer in Lustre.  Currently that
 limits the delivered RPC rate to 10-15k/sec or so.

 My question is: is the limitation also applied to RDMA on IB? By SMP,
 I guess Andreas was talking about CPU, right? Since RDMA can bypass
 the host CPU, does it mean it can also bypass the limitation?

 Thanks,
 Jiahua
 ___
 Lustre-discuss mailing list
 Lustre-discuss@lists.lustre.org
 http://lists.lustre.org/mailman/listinfo/lustre-discuss
   
 ___
 Lustre-discuss mailing list
 Lustre-discuss@lists.lustre.org
 http://lists.lustre.org/mailman/listinfo/lustre-discuss
   

___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss


Re: [Lustre-discuss] 1.8.1(-ish) client vs. 1.6.7.2 server

2009-07-21 Thread Liang Zhen
Robin,

These messages should be harmless, 1.8.1 is using new o2iblnd message 
protocol, so there is a version negotiation if o2iblnd version of client 
is older,  is there any other o2ib error messages like  Deleting 
messages for xxx.xxx.xxx@o2b: connection failed when you see IO 
failure? Anyway, if you got more complaint from o2ib except these 
information, could you please post them on the bug you filed.

Thanks
Liang

Robin Humble wrote:
 I added this to bugzilla.
   https://bugzilla.lustre.org/show_bug.cgi?id=20227

 cheers,
 robin

 On Wed, Jul 15, 2009 at 01:09:33PM -0400, Robin Humble wrote:
   
 On Wed, Jul 15, 2009 at 08:46:12AM -0400, Robin Humble wrote:
 
 I get a ferocious set of error messages when I mount a 1.6.7.2
 filesystem on a b_release_1_8_1 client.
 is this expected?
   
 just to annotate the below a bit in case it's not clear... sorry -
 should have done that in the first email :-/

 10.8.30.244 is MGS and one MDS, 10.8.30.245 is the other MDS in the
 failover pair. 10.8.30.201 - 208 are OSS's (one OST per OSS), and the
 fs is mounted in the usual failover way eg.
  mount -t lustre 10.8.30@o2ib:10.8.30@o2ib:/system /system

 
 from the below (and other similar logs) it kinda looks like the client
   
 fails and then renegotiates with all the servers.

 cheers,
 robin
 --
 Dr Robin Humble, HPC Systems Analyst, NCI National Facility

 
  Lustre: 13800:0:(o2iblnd_cb.c:459:kiblnd_rx_complete()) Rx from 
 10.8.30@o2ib failed: 5
  Lustre: 13799:0:(o2iblnd_cb.c:459:kiblnd_rx_complete()) Rx from 
 10.8.30@o2ib failed: 5
  Lustre: 615:0:(o2iblnd_cb.c:2384:kiblnd_reconnect()) 10.8.30@o2ib: 
 retrying (version negotiation), 12, 11, queue_dep: 8, max_frag: 256, 
 msg_size: 4096
  Lustre: mgc10.8.30@o2ib: Reactivating import
  Lustre: 13797:0:(o2iblnd_cb.c:459:kiblnd_rx_complete()) Rx from 
 10.8.30@o2ib failed: 5
  Lustre: 13798:0:(o2iblnd_cb.c:459:kiblnd_rx_complete()) Rx from 
 10.8.30@o2ib failed: 5
  Lustre: 615:0:(o2iblnd_cb.c:2384:kiblnd_reconnect()) 10.8.30@o2ib: 
 retrying (version negotiation), 12, 11, queue_dep: 8, max_frag: 256, 
 msg_size: 4096
  Lustre: Client system-client has started
  Lustre: 13798:0:(o2iblnd_cb.c:459:kiblnd_rx_complete()) Rx from 
 10.8.30@o2ib failed: 5
  ... last message repeated 17 times ...
  Lustre: 615:0:(o2iblnd_cb.c:2384:kiblnd_reconnect()) 10.8.30@o2ib: 
 retrying (version negotiation), 12, 11, queue_dep: 8, max_frag: 256, 
 msg_size: 4096
  Lustre: 615:0:(o2iblnd_cb.c:2384:kiblnd_reconnect()) 10.8.30@o2ib: 
 retrying (version negotiation), 12, 11, queue_dep: 8, max_frag: 256, 
 msg_size: 4096
  Lustre: 13798:0:(o2iblnd_cb.c:459:kiblnd_rx_complete()) Rx from 
 10.8.30@o2ib failed: 5
  Lustre: 615:0:(o2iblnd_cb.c:2384:kiblnd_reconnect()) 10.8.30@o2ib: 
 retrying (version negotiation), 12, 11, queue_dep: 8, max_frag: 256, 
 msg_size: 4096
  Lustre: 615:0:(o2iblnd_cb.c:2384:kiblnd_reconnect()) 10.8.30@o2ib: 
 retrying (version negotiation), 12, 11, queue_dep: 8, max_frag: 256, 
 msg_size: 4096
  Lustre: 13797:0:(o2iblnd_cb.c:459:kiblnd_rx_complete()) Rx from 
 10.8.30@o2ib failed: 5
  Lustre: 615:0:(o2iblnd_cb.c:2384:kiblnd_reconnect()) 10.8.30@o2ib: 
 retrying (version negotiation), 12, 11, queue_dep: 8, max_frag: 256, 
 msg_size: 4096
  Lustre: 615:0:(o2iblnd_cb.c:2384:kiblnd_reconnect()) 10.8.30@o2ib: 
 retrying (version negotiation), 12, 11, queue_dep: 8, max_frag: 256, 
 msg_size: 4096
  Lustre: 615:0:(o2iblnd_cb.c:2384:kiblnd_reconnect()) 10.8.30@o2ib: 
 retrying (version negotiation), 12, 11, queue_dep: 8, max_frag: 256, 
 msg_size: 4096
  Lustre: 615:0:(o2iblnd_cb.c:2384:kiblnd_reconnect()) 10.8.30@o2ib: 
 retrying (version negotiation), 12, 11, queue_dep: 8, max_frag: 256, 
 msg_size: 4096
  Lustre: 13800:0:(o2iblnd_cb.c:459:kiblnd_rx_complete()) Rx from 
 10.8.30@o2ib failed: 5

 looks like it succeeds in the end, but only after a struggle.

 I don't have any problems with 1.8.1 - 1.8.1 or 1.6.7.2 - 1.6.7.2.

 servers are rhel5 x86_64 2.6.18-92.1.26.el5 1.6.7.2 + bz18793 (group
 quota fix).
 client is rhel5 x86_64 patched 2.6.18-128.1.16.el5-b_release_1_8_1 from
 cvs 20090712131220 + bz18793 again.

 BTW, should I be using cvs tag v1_8_1_RC1 instead of b_release_1_8_1?
 I'm confused about which is closest to the final 1.8.1 :-/

 cheers,
 robin
 --
 Dr Robin Humble, HPC Systems Analyst, NCI National Facility
 ___
 Lustre-discuss mailing list
 Lustre-discuss@lists.lustre.org
 http://lists.lustre.org/mailman/listinfo/lustre-discuss
   
 ___
 Lustre-discuss mailing list
 Lustre-discuss@lists.lustre.org
 http://lists.lustre.org/mailman/listinfo/lustre-discuss
 
 ___
 Lustre-discuss mailing list
 Lustre-discuss@lists.lustre.org
 http://lists.lustre.org/mailman/listinfo/lustre-discuss
   


Re: [Lustre-discuss] Lustre Client using 10GigE iWarp

2009-06-01 Thread Liang Zhen
David Dillow wrote:
 On Fri, 2009-05-29 at 08:57 -0500, Dennis Nelson wrote:
   
 I have tried this:

 options lnet networks=o2ib(eth2)

 Is that correct?
 

 I would think so, but someone more experienced in the Lustre o2ib LND
 and iWarp should chime in on that -- I've never used iWarp, only IB.
   

we also need this for iWarp:
options o2iblnd map_on_demand=64
or map_on_demand=32 if you failed to startup

Regards
Liang

   
 The client is connected to the IB fabric using a Voltaire 10 GigE line card
 on a IB switch.  Has anyone tested such a configuration?  Should I expect it
 to work?
 

 I don't expect that to work, unless the Voltaire is converting iWarp to
 IB -- and I highly doubt that. It is more likely using EoIB to transport
 the ethernet frames over the IB fabric. You will likely need a Lustre
 router to sit between the two fabrics.
   

___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss


Re: [Lustre-discuss] o2ib cant ping/mount Infiniband NID

2009-01-23 Thread Liang Zhen
 172.24.198@o2ib
 failed to ping 172.24.198@o2ib: Input/output error

 /var/log/messages:


 Jan 16 10:24:14 p128 kernel: Lustre: 2750:0:(o2iblnd_cb.c:2687:
 kiblnd_cm_callback()) 172.24.198@o2ib: ROUTE ERROR -22
 Jan 16 10:24:14 p128 kernel: Lustre:
 2750:0:(o2iblnd_cb.c:2101:kiblnd_peer_connect_failed()) Deleting
 messages for 172.24.198@o2ib: connection failed

 how can I get rid of this connection problem?

 ~subbu



 On Fri, Jan 16, 2009 at 2:11 PM, Liang Zhen zhen.li...@sun.com
 mailto:zhen.li...@sun.com wrote:

 Subbu,

 We don't have any tip for setup IPoIB, looks like linux can't
 find the ifaddr of ib0 on MDS(-99 is EADDRNOTAVAIL), so I
 think it's because you didn't assign any address to ib0 (or
 failed to assign address to ib0) before loading o2iblnd  in
 the first try.
 I can reproduce exactly same error by:
 1. modprobe ib_ipoib
 2. ifconfig ib0 up  // without assign any address
 3. modprobe ko2iblnd
 4. lctl network up

 Regards
 Liang

 subbu kl:

 Liang,
 after executing following echo :
 echo +neterror  /proc/sys/lnet/printk

 now lctlt ping shows the following error

 # lctl ping 172.24.198@o2ib
 failed to ping 172.24.198@o2ib: Input/output error

 Jan 16 10:24:14 p128 kernel: Lustre:
 2750:0:(o2iblnd_cb.c:2687:kiblnd_cm_callback())
 172.24.198@o2ib: ROUTE ERROR -22
 Jan 16 10:24:14 p128 kernel: Lustre:
 2750:0:(o2iblnd_cb.c:2101:kiblnd_peer_connect_failed())
 Deleting messages for 172.24.198@o2ib: connection failed

 Looks like some problem with IB connection manager !

 1. do we have any help docs to setup IPoIB and Lustre,
 lustre operation manual has very minimal info about this .
 I think I am missing some IPoIB setup part here.
 2. or is it mannual assignment of  IP addresses to ib0
 is creating some problem


 *Some more supporting info :
 *subnet manager of following version is also running :
 OpenSM 3.1.8

 Initially I got this error for MDS mount

 Jan 16 09:45:20 p128 kernel: LustreError:
 4991:0:(linux-tcpip.c:124:libcfs_ipif_query()) Can't get
 IP address for interface ib0
 Jan 16 09:45:20 p128 kernel: LustreError:
 4991:0:(o2iblnd.c:1563:kiblnd_startup()) Can't query IPoIB
 interface ib0: -99
 Jan 16 09:45:21 p128 kernel: LustreError: 105-4: Error
 -100 starting up LNI o2ib
 Jan 16 09:45:21 p128 kernel: LustreError:
 4991:0:(events.c:707:ptlrpc_init_portals()) network
 initialisation failed
 Jan 16 09:45:21 p128 modprobe: WARNING: Error inserting
 ptlrpc
 
 (/lib/modules/2.6.18-53.1.14.el5_lustre.1.6.5.1smp/kernel/fs/lustre/ptlrpc.ko):
 Input/output error
 Jan 16 09:45:21 p128 modprobe: WARNING: Error inserting
 osc
 
 (/lib/modules/2.6.18-53.1.14.el5_lustre.1.6.5.1smp/kernel/fs/lustre/osc.ko):
 Unknown symbol in module, or unknown parameter (see dmesg)
 Jan 16 09:45:21 p128 kernel: osc: Unknown symbol
 ldlm_prep_enqueue_req
 Jan 16 09:45:21 p128 kernel: osc: Unknown symbol
 ldlm_resource_get
 Jan 16 09:45:21 p128 kernel: osc: Unknown symbol
 ptlrpc_lprocfs_register_obd
 .
 .
 .

 then I mannually set the IP address for ib0 as folows :
 # ifconfig ib0 172.24.198.111

 [r...@p186 ~]# ifconfig ib0
 ib0   Link encap:InfiniBand  HWaddr
 80:00:04:04:FE:80:00:00:00:00:00:00:00:00:00:00:00:00:00:00
  inet addr:172.24.198.112  Bcast:172.24.255.255
  Mask:255.255.0.0
  UP BROADCAST MULTICAST  MTU:65520  Metric:1
  RX packets:0 errors:0 dropped:0 overruns:0 frame:0
  TX packets:0 errors:0 dropped:0 overruns:0 carrier:0
  collisions:0 txqueuelen:256
  RX bytes:0 (0.0 b)  TX bytes:0 (0.0 b)

 then it mounted sucessfully

 *Jan 16 09:47:09 p128 kernel: Lustre: Added LNI
 172.24.198@o2ib [8/64]
 Jan 16 09:47:09 p128 kernel: Lustre: MGS MGS started*
 Jan 16 09:47:09 p128 kernel: Lustre: Setting parameter
 lustre-MDT.mdt.group_upcall in log lustre-MDT
 Jan 16 09:47:09 p128 kernel: Lustre: Enabling user_xattr
 Jan 16 09:47:09 p128 kernel: Lustre: lustre-MDT: new
 disk, initializing
 Jan 16 09:47

Re: [Lustre-discuss] o2ib cant ping/mount Infiniband NID

2009-01-16 Thread Liang Zhen
Subbu,

We don't have any tip for setup IPoIB, looks like linux can't find the 
ifaddr of ib0 on MDS(-99 is EADDRNOTAVAIL), so I think it's because you 
didn't assign any address to ib0 (or failed to assign address to ib0) 
before loading o2iblnd  in the first try.
I can reproduce exactly same error by:
1. modprobe ib_ipoib
2. ifconfig ib0 up  // without assign any address
3. modprobe ko2iblnd
4. lctl network up

Regards
Liang

subbu kl:
 Liang,
 after executing following echo :
 echo +neterror  /proc/sys/lnet/printk

 now lctlt ping shows the following error

 # lctl ping 172.24.198@o2ib
 failed to ping 172.24.198@o2ib: Input/output error

 Jan 16 10:24:14 p128 kernel: Lustre: 
 2750:0:(o2iblnd_cb.c:2687:kiblnd_cm_callback()) 172.24.198@o2ib: 
 ROUTE ERROR -22
 Jan 16 10:24:14 p128 kernel: Lustre: 
 2750:0:(o2iblnd_cb.c:2101:kiblnd_peer_connect_failed()) Deleting 
 messages for 172.24.198@o2ib: connection failed

 Looks like some problem with IB connection manager !

 1. do we have any help docs to setup IPoIB and Lustre, lustre 
 operation manual has very minimal info about this . I think I am 
 missing some IPoIB setup part here.
 2. or is it mannual assignment of  IP addresses to ib0 is creating 
 some problem


 *Some more supporting info :
 *subnet manager of following version is also running : OpenSM 3.1.8

 Initially I got this error for MDS mount

 Jan 16 09:45:20 p128 kernel: LustreError: 
 4991:0:(linux-tcpip.c:124:libcfs_ipif_query()) Can't get IP address 
 for interface ib0
 Jan 16 09:45:20 p128 kernel: LustreError: 
 4991:0:(o2iblnd.c:1563:kiblnd_startup()) Can't query IPoIB interface 
 ib0: -99
 Jan 16 09:45:21 p128 kernel: LustreError: 105-4: Error -100 starting 
 up LNI o2ib
 Jan 16 09:45:21 p128 kernel: LustreError: 
 4991:0:(events.c:707:ptlrpc_init_portals()) network initialisation failed
 Jan 16 09:45:21 p128 modprobe: WARNING: Error inserting ptlrpc 
 (/lib/modules/2.6.18-53.1.14.el5_lustre.1.6.5.1smp/kernel/fs/lustre/ptlrpc.ko):
  
 Input/output error
 Jan 16 09:45:21 p128 modprobe: WARNING: Error inserting osc 
 (/lib/modules/2.6.18-53.1.14.el5_lustre.1.6.5.1smp/kernel/fs/lustre/osc.ko): 
 Unknown symbol in module, or unknown parameter (see dmesg)
 Jan 16 09:45:21 p128 kernel: osc: Unknown symbol ldlm_prep_enqueue_req
 Jan 16 09:45:21 p128 kernel: osc: Unknown symbol ldlm_resource_get
 Jan 16 09:45:21 p128 kernel: osc: Unknown symbol 
 ptlrpc_lprocfs_register_obd
 .
 .
 .

 then I mannually set the IP address for ib0 as folows :
 ifconfig ib0 172.24.198.111

 [r...@p186 ~]# ifconfig ib0
 ib0   Link encap:InfiniBand  HWaddr 
 80:00:04:04:FE:80:00:00:00:00:00:00:00:00:00:00:00:00:00:00
   inet addr:172.24.198.112  Bcast:172.24.255.255  Mask:255.255.0.0
   UP BROADCAST MULTICAST  MTU:65520  Metric:1
   RX packets:0 errors:0 dropped:0 overruns:0 frame:0
   TX packets:0 errors:0 dropped:0 overruns:0 carrier:0
   collisions:0 txqueuelen:256
   RX bytes:0 (0.0 b)  TX bytes:0 (0.0 b)

 then it mounted sucessfully

 Jan 16 09:47:09 p128 kernel: Lustre: Added LNI 172.24.198@o2ib [8/64]
 Jan 16 09:47:09 p128 kernel: Lustre: MGS MGS started
 Jan 16 09:47:09 p128 kernel: Lustre: Setting parameter 
 lustre-MDT.mdt.group_upcall in log lustre-MDT
 Jan 16 09:47:09 p128 kernel: Lustre: Enabling user_xattr
 Jan 16 09:47:09 p128 kernel: Lustre: lustre-MDT: new disk, 
 initializing
 Jan 16 09:47:09 p128 kernel: Lustre: MDT lustre-MDT now serving 
 dev (lustre-MDT/64db1fc7-03ba-9803-4d20-ab0d2aa66116) with 
 recovery enabled
 Jan 16 09:47:09 p128 kernel: Lustre: 
 5274:0:(lproc_mds.c:262:lprocfs_wr_group_upcall()) lustre-MDT: 
 group upcall set to /usr/sbin/l_getgroups
 Jan 16 09:47:09 p128 kernel: Lustre: lustre-MDT.mdt: set parameter 
 group_upcall=/usr/sbin/l_getgroups
 Jan 16 09:47:09 p128 kernel: Lustre: Server lustre-MDT on device 
 /dev/loop0 has started
 .
 .
 .


 ~subbu


 On Thu, Jan 15, 2009 at 8:37 PM, Liang Zhen zhen.li...@sun.com 
 mailto:zhen.li...@sun.com wrote:

 Subbu,

 I'd suggest:
 1) make sure ko2iblnd has been brought up (please check if there
 is any error message when startup ko2iblnd)
 2) echo +neterror  /proc/sys/lnet/printk, then try with lctl
 ping, if it still can't work please post error messages

 Regards
 Liang

 subbu kl:

 Problem is similer to
 http://lists.lustre.org/pipermail/lustre-discuss/2008-May/007498.html
 But by looking at the thread could not really get the solution
 for the problem.

 I have two RHEL5 Linux servers installed with following packages -

 kernel-lustre-smp-2.6.18-53.1.14.el5_lustre.1.6.5.1
 kernel-ib-1.3-2.6.18_53.1.14.el5_lustre.1.6.5.1smp
 lustre-ldiskfs-3.0.4-2.6.18_53.1.14.el5_lustre.1.6.5.1smp
 lustre-1.6.5.1-2.6.18_53.1.14.el5_lustre.1.6.5.1smp
 lustre-modules-1.6.5.1-2.6.18_53.1.14.el5_lustre.1.6.5.1smp

Re: [Lustre-discuss] o2ib cant ping/mount Infiniband NID

2009-01-15 Thread Liang Zhen
Subbu,

I'd suggest:
1) make sure ko2iblnd has been brought up (please check if there is any 
error message when startup ko2iblnd)
2) echo +neterror  /proc/sys/lnet/printk, then try with lctl ping, if 
it still can't work please post error messages

Regards
Liang

subbu kl:
 Problem is similer to 
 http://lists.lustre.org/pipermail/lustre-discuss/2008-May/007498.html
 But by looking at the thread could not really get the solution for the 
 problem.

 I have two RHEL5 Linux servers installed with following packages -

 kernel-lustre-smp-2.6.18-53.1.14.el5_lustre.1.6.5.1
 kernel-ib-1.3-2.6.18_53.1.14.el5_lustre.1.6.5.1smp
 lustre-ldiskfs-3.0.4-2.6.18_53.1.14.el5_lustre.1.6.5.1smp
 lustre-1.6.5.1-2.6.18_53.1.14.el5_lustre.1.6.5.1smp
 lustre-modules-1.6.5.1-2.6.18_53.1.14.el5_lustre.1.6.5.1smp
 e2fsprogs-1.40.7.sun3-0redhat


 machine 1: with ib0 IP address : 172.24.198.111
 machine 2: with ib0 IP address : 172.24.198.112

 /etc/modprobe.conf contains
 options lnet networks=o2ib

 TCP networking worked fine and now I am trying with Infiniband network 
 finding it difficult in communicating with IB nodes mounting effort 
 throghs me the following error

 [r...@p186 ~]# mount -t lustre -o loop /tmp/lustre-ost1 /mnt/ost1
 mount.lustre: mount /dev/loop0 at /mnt/ost1 failed: Input/output error
 Is the MGS running?

 /var/log/messages :
 Jan 15 16:55:25 p186 kernel: kjournald starting.  Commit interval 5 
 seconds
 Jan 15 16:55:25 p186 kernel: LDISKFS FS on loop0, internal journal
 Jan 15 16:55:25 p186 kernel: LDISKFS-fs: mounted filesystem with 
 ordered data mode.
 Jan 15 16:55:25 p186 kernel: kjournald starting.  Commit interval 5 
 seconds
 Jan 15 16:55:25 p186 kernel: LDISKFS FS on loop0, internal journal
 Jan 15 16:55:25 p186 kernel: LDISKFS-fs: mounted filesystem with 
 ordered data mode.
 Jan 15 16:55:25 p186 kernel: LDISKFS-fs: file extents enabled
 Jan 15 16:55:25 p186 kernel: LDISKFS-fs: mballoc enabled
 Jan 15 16:55:30 p186 kernel: Lustre: Request x7 sent from 
 mgc172.24.198@o2ib to NID 172.24.198@o2ib 5s ago has timed out 
 (limit 5s).
 Jan 15 16:55:30 p186 kernel: LustreError: 
 7193:0:(obd_mount.c:1062:server_start_targets()) Required registration 
 failed for lustre-OST: -5
 Jan 15 16:55:30 p186 kernel: LustreError: 15f-b: Communication error 
 with the MGS.  Is the MGS running?
 Jan 15 16:55:30 p186 kernel: LustreError: 
 7193:0:(obd_mount.c:1597:server_fill_super()) Unable to start targets: -5
 Jan 15 16:55:30 p186 kernel: LustreError: 
 7193:0:(obd_mount.c:1382:server_put_super()) no obd lustre-OST
 Jan 15 16:55:30 p186 kernel: LustreError: 
 7193:0:(obd_mount.c:119:server_deregister_mount()) lustre-OST not 
 registered
 Jan 15 16:55:30 p186 kernel: LDISKFS-fs: mballoc: 0 blocks 0 reqs (0 
 success)
 Jan 15 16:55:30 p186 kernel: LDISKFS-fs: mballoc: 0 extents scanned, 0 
 goal hits, 0 2^N hits, 0 breaks, 0 lost
 Jan 15 16:55:30 p186 kernel: LDISKFS-fs: mballoc: 0 generated and it 
 took 0
 Jan 15 16:55:30 p186 kernel: LDISKFS-fs: mballoc: 0 preallocated, 0 
 discarded
 Jan 15 16:55:30 p186 kernel: Lustre: server umount lustre-OST complete
 Jan 15 16:55:30 p186 kernel: LustreError: 
 7193:0:(obd_mount.c:1951:lustre_fill_super()) Unable to mount  (-5)

 All pinging efforts also failed to the IB NIDS local/remote
 can ping the ip address :
 [r...@p186 ~]# ping 172.24.198.112
 PING 172.24.198.112 (172.24.198.112) 56(84) bytes of data.
 64 bytes from 172.24.198.112 http://172.24.198.112: icmp_seq=1 
 ttl=64 time=0.052 ms
 64 bytes from 172.24.198.112 http://172.24.198.112: icmp_seq=2 
 ttl=64 time=0.024 ms

 --- 172.24.198.112 ping statistics ---
 2 packets transmitted, 2 received, 0% packet loss, time 1000ms
 rtt min/avg/max/mdev = 0.024/0.038/0.052/0.014 ms
 [r...@p186 ~]# ping 172.24.198.111
 PING 172.24.198.111 (172.24.198.111) 56(84) bytes of data.
 64 bytes from 172.24.198.111 http://172.24.198.111: icmp_seq=1 
 ttl=64 time=2.16 ms
 64 bytes from 172.24.198.111 http://172.24.198.111: icmp_seq=2 
 ttl=64 time=0.296 ms

 --- 172.24.198.111 ping statistics ---
 2 packets transmitted, 2 received, 0% packet loss, time 1000ms
 rtt min/avg/max/mdev = 0.296/1.231/2.166/0.935 ms

 but cant ping the NIDS :
 [r...@p186 ~]# lctl ping 172.24.198@o2ib
 failed to ping 172.24.198@o2ib: Input/output error
 [r...@p186 ~]# lctl ping 172.24.198@o2ib
 failed to ping 172.24.198@o2ib: Input/output error

 Any idea why lnet cant ping NIDS ?

 some more configurations:
 [r...@p186 ~]# ibstat
 CA 'mthca0'
 CA type: MT23108
 Number of ports: 2
 Firmware version: 3.5.0
 Hardware version: a1
 Node GUID: 0x0002c9020021550c

 Machines are connected via IB switch.

 Looking forward for help.

 ~subbu
 

 ___
 Lustre-discuss mailing list
 Lustre-discuss@lists.lustre.org
 http://lists.lustre.org/mailman/listinfo/lustre-discuss
   


Re: [Lustre-discuss] OFED 1.4

2009-01-06 Thread Liang Zhen
Hi Roger,
I got a chance to try with OFED1.4, it works fine but 1.6.6 can't 
compile with it because conflicting with some backport headers of OFED, 
we will change our build system for it soon.

Regards
Liang

Roger Spellman:

 Hi,

 Is anyone using Lustre 1.6.6 with OFED 1.4?  If so, how is this going?

 Thanks.

  

 Roger Spellman

 Staff Engineer

 Terascala, Inc.

 508-588-1501

 www.terascala.com http://www.terascala.com/

 

 ___
 Lustre-discuss mailing list
 Lustre-discuss@lists.lustre.org
 http://lists.lustre.org/mailman/listinfo/lustre-discuss
   

___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss


Re: [Lustre-discuss] ko2iblnd panics in kiblnd_map_tx_descs

2008-03-05 Thread Liang Zhen
Hi Chris,
To resolve your problem, please:
1. apply this patch to your lnet:
https://bugzilla.lustre.org/attachment.cgi?id=15733
2. please make sure use this option while configure: 
--with-o2ib=/path/to/ofed
3. Copy /path/to/ofed/Module.symvers to your $LUSTRE before building

Regards
Liang

Chris Worley wrote:
 I'm trying to port Lustre 1.6.4.2 to OFED 1.2.5.5 w/ the RHEL kernel
 2.6.9.67.0.4.

 ksocklnd-based mounts work fine, but when I try to mount over IB, I
 get a panic in ko2iblnd in the transmit descriptor mapping routine:

 general protection fault:  [1] SMP
 CPU 1
 Modules linked in: ko2iblnd(U) ptlrpc(U) obdclass(U) lnet(U) lvfs(U)
 libcfs(U) nfs(U) lockd(U) nfs_acl(U) sunrpc(U) rdma_ucm(U) ib_sdp(U)
 rdma_cm(U) iw_cm(U) ib_addr(U) mlx4_ib(U) mlx4_core(U) ib_mthca(U)
 dm_mod(U) ib_ipoib(U) md5(U) ipv6(U) ib_umad(U) ib_ucm(U) ib_uverbs(U)
 ib_cm(U) ib_sa(U) ib_mad(U) ib_core(U) aic79xx(U) e1000(U) ext3(U)
 jbd(U) raid0(U) mptscsih(U) mptsas(U) mptspi(U) mptscsi(U) mptbase(U)
 sd_mod(U) ata_piix(U) libata(U) scsi_mod(U)
 Pid: 5141, comm: modprobe Not tainted 2.6.9-67.0.4.EL-Lustre-1.6.4.2
 RIP: 0010:[a04659d1]
 a04659d1{:ko2iblnd:kiblnd_map_tx_descs+225}
 RSP: :0102105d7cd8  EFLAGS: 00010286
 RAX: a01e6b4e RBX: ff0010028000 RCX: 0001
 RDX: 1000 RSI: 01020e705000 RDI: 0102154e2000
 RBP: 0102102c4200 R08:  R09: 
 R10:  R11:  R12: 
 R13:  R14:  R15: 0102102c4228
 FS:  002a958a0b00() GS:8046ac00() knlGS:
 CS:  0010 DS:  ES:  CR0: 8005003b
 CR2: 002a9598200f CR3: 9fa08000 CR4: 06e0
 Process modprobe (pid: 5141, threadinfo 0102105d6000, task 
 0102175e0030)
 Stack:  0102102c4080 0102102c4100 0102102c4200
0102179c2b86 0102177df400 010215548ac0 a0466fdf
0102179c2b85 
 Call Trace:a0466fdf{:ko2iblnd:kiblnd_startup+2239}
 a03043dc{:lnet:lnet_startup_lndnis+332}
a02d2f38{:libcfs:cfs_alloc+40}
 a0305206{:lnet:LNetNIInit+278}
a03fcb0a{:ptlrpc:ptlrpc_ni_init+106}
 8012f9cd{default_wake_function+0}
a03fcbfa{:ptlrpc:ptlrpc_init_portals+10}
8012f9cd{default_wake_function+0}
 a045f22b{:ptlrpc:init_module+267}
8014bc0a{sys_init_module+278}
 8010f23e{system_call+126}


 Code: ff 50 08 eb 12 48 8b 3f b9 01 00 00 00 ba 00 10 00 00 e8 30
 RIP a04659d1{:ko2iblnd:kiblnd_map_tx_descs+225} RSP 
 0102105d7cd8

 Does this ring any bells?  Otherwise, any debugging tips?

 Shane said that they get an oops if they compile with the version
 specific OFA tree.  Is this the Oops?

 Thanks,

 Chris
 ___
 Lustre-discuss mailing list
 Lustre-discuss@lists.lustre.org
 http://lists.lustre.org/mailman/listinfo/lustre-discuss
   

___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss