Re: [Lustre-discuss] Limits for o2ib lnet network numbers
No Jira ticket yet, reason of the potential performance issue is straightforward, it's because all LNet NIs are linked on a plain list, and we need scan the whole list for each sending/receiving, it's not an issue for a few networks, but it could be problematic for hundreds or tens. Liang On Aug 8, 2012, at 10:48 PM, Cory Spitz wrote: Liang, What main stream perf. issue do you refer to? Is there a JIRA ticket tracking it? Thanks, -Cory On 08/08/2012 09:38 AM, Liang Zhen wrote: Hi, LNet reserved 32 bits for network number, so you can choose a very large network number if only have a few networks, but really create many networks will have some issues: - o2iblnd will pre-allocate memory resources for each network, so it will consume a lot of memory - Main stream LNet will have performance issue if there're many networks, for example, hundreds, although it's not difficult to fix this. Liang On Aug 8, 2012, at 10:23 PM, Rick Mohr wrote: I was curious what limitations exist for o2ib network numbers. Most of the time I am dealing with o2ib0, o2ib1, etc. As as experiment, I tried configuring a machine with o2ib1000, and that seemed to be OK. I figured there must be some limit on how large the network number can get, but after doing some searching, I have been unable to find any docs that specify a limit. Does any know what the max network number is? -- Rick Mohr HPC Systems Administrator National Institute for Computational Sciences http://www.nics.tennessee.edu/ ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
Re: [Lustre-discuss] Limits for o2ib lnet network numbers
Hi, LNet reserved 32 bits for network number, so you can choose a very large network number if only have a few networks, but really create many networks will have some issues: - o2iblnd will pre-allocate memory resources for each network, so it will consume a lot of memory - Main stream LNet will have performance issue if there're many networks, for example, hundreds, although it's not difficult to fix this. Liang On Aug 8, 2012, at 10:23 PM, Rick Mohr wrote: I was curious what limitations exist for o2ib network numbers. Most of the time I am dealing with o2ib0, o2ib1, etc. As as experiment, I tried configuring a machine with o2ib1000, and that seemed to be OK. I figured there must be some limit on how large the network number can get, but after doing some searching, I have been unable to find any docs that specify a limit. Does any know what the max network number is? -- Rick Mohr HPC Systems Administrator National Institute for Computational Sciences http://www.nics.tennessee.edu/ ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
Re: [Lustre-discuss] [wc-discuss] The ost_connect operation failed with -16
Hi, I think you might hit this: http://jira.whamcloud.com/browse/LU-952 , you can find the patch from this ticket Regards Liang On May 30, 2012, at 11:21 AM, huangql wrote: Dear all, Recently we found the problem in OSS that some threads might be hung when the server got heavy IO load. In this case, some clients will be evicted or refused by some OSTs and got the error messages as following: Server side: May 30 11:06:31 boss07 kernel: Lustre: Service thread pid 8011 was inactive for 200.00s. The thread might be hung, or it might only be slow and will resume later. D umping the stack trace for debugging purposes: May 30 11:06:31 boss07 kernel: Lustre: Skipped 1 previous similar message May 30 11:06:31 boss07 kernel: Pid: 8011, comm: ll_ost_71 May 30 11:06:31 boss07 kernel: May 30 11:06:31 boss07 kernel: Call Trace: May 30 11:06:31 boss07 kernel: [886f5d0e] start_this_handle+0x301/0x3cb [jbd2] May 30 11:06:31 boss07 kernel: [800a09ca] autoremove_wake_function+0x0/0x2e May 30 11:06:31 boss07 kernel: [886f5e83] jbd2_journal_start+0xab/0xdf [jbd2] May 30 11:06:31 boss07 kernel: [888ce9b2] fsfilt_ldiskfs_start+0x4c2/0x590 [fsfilt_ldiskfs] May 30 11:06:31 boss07 kernel: [88920551] filter_version_get_check+0x91/0x2a0 [obdfilter] May 30 11:06:31 boss07 kernel: [80036cf4] __lookup_hash+0x61/0x12f May 30 11:06:31 boss07 kernel: [8893108d] filter_setattr_internal+0x90d/0x1de0 [obdfilter] May 30 11:06:31 boss07 kernel: [800e859b] lookup_one_len+0x53/0x61 May 30 11:06:31 boss07 kernel: [88925452] filter_fid2dentry+0x512/0x740 [obdfilter] May 30 11:06:31 boss07 kernel: [88924e27] filter_fmd_get+0x2b7/0x320 [obdfilter] May 30 11:06:31 boss07 kernel: [8003027b] __up_write+0x27/0xf2 May 30 11:06:31 boss07 kernel: [88932721] filter_setattr+0x1c1/0x3b0 [obdfilter] May 30 11:06:31 boss07 kernel: [8882677a] lustre_pack_reply_flags+0x86a/0x950 [ptlrpc] May 30 11:06:31 boss07 kernel: [8881e658] ptlrpc_send_reply+0x5c8/0x5e0 [ptlrpc] May 30 11:06:31 boss07 kernel: [88822b05] lustre_msg_get_version+0x35/0xf0 [ptlrpc] May 30 11:06:31 boss07 kernel: [888b0abb] ost_handle+0x25db/0x55b0 [ost] May 30 11:06:31 boss07 kernel: [80150d56] __next_cpu+0x19/0x28 May 30 11:06:31 boss07 kernel: [800767ae] smp_send_reschedule+0x4e/0x53 May 30 11:06:31 boss07 kernel: [8883215a] ptlrpc_server_handle_request+0x97a/0xdf0 [ptlrpc] May 30 11:06:31 boss07 kernel: [888328a8] ptlrpc_wait_event+0x2d8/0x310 [ptlrpc] May 30 11:06:31 boss07 kernel: [8008b3bd] __wake_up_common+0x3e/0x68 May 30 11:06:31 boss07 kernel: [88833817] ptlrpc_main+0xf37/0x10f0 [ptlrpc] May 30 11:06:31 boss07 kernel: [8005dfb1] child_rip+0xa/0x11 May 30 11:06:31 boss07 kernel: [888328e0] ptlrpc_main+0x0/0x10f0 [ptlrpc] May 30 11:06:31 boss07 kernel: [8005dfa7] child_rip+0x0/0x11 May 30 11:06:31 boss07 kernel: May 30 11:06:31 boss07 kernel: LustreError: dumping log to /tmp/lustre-log.1338347191.8011 Client side: May 30 09:58:36 ccopt kernel: LustreError: 11-0: an error occurred while communicating with 192.168.50.123@tcp. The ost_connect operation failed with -16 When you got this error message, you failed to run ls, df ,vi, touch and so on, which affect us to do anything in the file system. I think the ost_connect failure could report some error messages to users instead of causing any interactive actions stuck. Could someone give us some advice or any suggestions to solve this problem? Thank you very much in advance. Best Regards Qiulan Huang 2012-05-30 Computing center,the Institute of High Energy Physics, China Huang, QiulanTel: (+86) 10 8823 6010-105 P.O. Box 918-7 Fax: (+86) 10 8823 6839 Beijing 100049 P.R. China Email: huan...@ihep.ac.cn === ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
Re: [Lustre-discuss] is there a way to run Lustre over UDP instead TCP?
Hi, no it can't, it would require a new UDP based LND (Lustre Network Driver) but I don't know anyone has plan to do this yet. Regards Liang On Apr 10, 2012, at 6:44 AM, Hebenstreit, Michael wrote: See title… Thanks Michael Michael Hebenstreit Senior Cluster Architect Intel Corporation Software and Services Group/HTE 2800 N Center Dr, DP3-307 Tel.: +1 253 371 3144 WA 98327, DuPont UNITED STATES E-mail: michael.hebenstr...@intel.com ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
Re: [Lustre-discuss] LNET Performance Issue
Hi, I assume you are using size=1M for brw test right? performance could increase if you set concurrency while adding brw test, i.e: --concurrency=16 Liang On Feb 16, 2012, at 3:30 AM, Barberi, Carl E wrote: We are having issues with LNET performance over Infiniband. We have a configuration with a single MDT and six (6) OSTs. The Lustre client I am using to test is configured to use 6 stripes (lfs setstripe -c 6 /mnt/lustre). When I perform a test using the following command: dd if=/dev/zero of=/mnt/lustre/test.dat bs=1M count=2000 I typically get a write rate of about 815 MB/s, and we never exceed 848 MB/s. When I run obdfilter-survey, we easily get about 3-4GB/s write speed, but when I run a series of lnet-selftests, the read and write rates range from 850MB/s – 875MB/s max. I have performed the following optimizations to increase the data rate: On the Client: lctl set_param osc.*.checksums=0 lctl set_param osc.*.max_dirty_mb=256 On the OSTs lctl set_param obdfilter.*.writethrough_cache_enable=0 lctl set_param obdfilter.*.read_cache_enable=0 echo 4096 /sys/block/devices/queue/nr_requests I have also loaded the ib_sdp module, which also brought an increase in speed. However, we need to be able to record at no less than 1GB/s, which we cannot achieve right now. Any thoughts on how I can optimize LNET, which clearly seems to be the bottleneck? Thank you for any help you can provide, Carl Barberi ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
Re: [Lustre-discuss] Where to download Lustre from since 01 Aug?
Hi, you can download Whamcloud Lustre releases from here: http://downloads.whamcloud.com Regards Liang On Aug 4, 2011, at 8:03 PM, Torsten Harenberg wrote: Dear all, after the migration of the Sun downloads website to Oracle, I couldn't find a place anymore where I can actually download anything lustre related. Can someone point me to a site which still works after August 1st? Thanks Torsten -- Dr. Torsten Harenberg harenb...@physik.uni-wuppertal.de Bergische Universitaet FB C - Physik Tel.: +49 (0)202 439-3521 Gaussstr. 20 Fax : +49 (0)202 439-2811 42097 Wuppertal Of course it runs NetBSD http://www.netbsd.org ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
Re: [Lustre-discuss] [Lustre-devel] Lustre and Multicore
Vilobh, We do have work-in-progress for performance on multicores (http://jira.whamcloud.com/browse/LU-56), but these changes will be almost transparent to users, and they have nothing to do with lustre protocol. Regards Liang On May 31, 2011, at 5:56 PM, vilobh meshram wrote: Hi , Wanted to understand will the design of Lustre(Client/MDS/OSS) change if we have Multicores. Thanks, Vilobh ___ Lustre-devel mailing list lustre-de...@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-devel ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
Re: [Lustre-discuss] Landing and tracking tools improvements
A small suggestion about Maloo: I think It would be helpful if we can sort testing results by base branch, I actually can't find branch information even after clicking into result report. Thanks Liang On May 24, 2011, at 1:06 AM, Chris Gearing wrote: We now have a whole kit of tools [Jira, Gerrit, Jenkins and Maloo] used for tracking, reviewing and testing of code that are being used for the development of Lustre. A lot of time has been spent integrating and connecting them appropriately but as with anything the key is to continuously look for ways to improve what we have and how it works. So my question is what do people think of the tools as they stand today and how can we improve them moving forwards. if people can respond to lustre-discuss then I'll correlate the outcome of any discussions and then create a Wiki page that can form some plan for improvement. Please be as descriptive as possible in your replies and take into account that I and others have no experience of Lustre past so if you liked something prior to the current tools you'll need to help me and them understand the details. Thanks Chris --- Chris Gearing Snr Engineer Whamcloud. Inc. ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
Re: [Lustre-discuss] Lustre over o2ib issue
Hi Diego, Do you have any other module parameter for lnet and lnd? Regards Liang On Mar 22, 2011, at 9:26 PM, Diego Moreno wrote: Hi, We are having this problem right now with our Lustre 2.0. We tried the proposed solutions but we didn't get it. We have 2 QDR IB cards on 4 servers and we have to do lctl ping from each server to every client if we want clients to connect to servers. We don't have ib_mthca modules loaded because we don't have DDR cards and we configured ip2nets with no result. Our ip2nets configuration ([7-10] interfaces are in servers, the others are in clients): o2ib0(ib0) 10.50.0.[7-10] ; o2ib1(ib1) 10.50.1.[7-10] ; o2ib0(ib0) 10.50.*.* ; o2ib1(ib0) 10.50.*.* So the only way of having clients connected to servers is doing something like this on every server: for i in $CLIENT_IB_LIST ; do lctl ping $i@o2ib0 lctl ping $i@o2ib1 done Before lctl ping we get messages like this one: Lustre: 50389:0:(lib-move.c:1028:lnet_post_send_locked()) Dropping message for 12345-10.50.1.7@o2ib1: peer not alive After lctl ping' everything works right. Maybe I'm missing something or this is a known bug in lustre 2.0... On 16/03/2011 22:13, Andreas Dilger wrote: On 2011-03-16, at 3:04 PM, Mike Hanby wrote: Thanks, I forgot to include the card info: The servers each have a single IB card: dual port MT26528 QDR o2ib0(ib0) on each server is attached to the QLogic switch (with three attached M3601Q switches 48 attached blades) o2ib1(ib1) on each server is attached to a stack of two M3601Q switches with 24 attached blades The blades connected to o2ib0 each have an MT26428 QDR IB card The blades connected to o2ib1 each have an MT25418 DDR IB card You may also want to check out the ip2nets option for specifying the Lustre networks. It is made to handle configuration issues like this where the interface name is not constant across client/server nodes. -Original Message- From: lustre-discuss-boun...@lists.lustre.org [mailto:lustre-discuss-boun...@lists.lustre.org] On Behalf Of Nirmal Seenu Sent: Wednesday, March 16, 2011 2:10 PM To: lustre-discuss@lists.lustre.org Subject: Re: [Lustre-discuss] Lustre over o2ib issue If you are using DDR and QDR or any 2 different cards cards in the same machine there is no guarantee that the same IB cards get assigned to ib0 and ib. To fix that problem you need to comment out the following 3 lines /etc/init.d/openibd: #for i in `grep ^driver: /etc/sysconfig/hwconf | sed -e 's/driver: //' | grep -w ib_mthca\\\|ib_ipath\\\|mlx4_core\\\|cxgb3\\\|iw_nes`; do #load_modules $i #done and include the following lines instead(we wanted the DDR card to be ib0 and the QDR card to be ib1): load_modules ib_mthca /bin/sleep 10 load_modules mlx4_core and you will need to restart openibd once again (we included it in rc.local) to make sure that the same IB cards are assigned to the devices ib0 and ib1. Nirmal ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss Cheers, Andreas -- Andreas Dilger Principal Engineer Whamcloud, Inc. ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
Re: [Lustre-discuss] OST threads
In the long term, I would think we can add a common library for threads-pool, we have several modules having their own implementation of threads-pool(ptlrpc service has two threads pool, LNDs...), so if we have such a library (create threads, kill threads, grow threads, shrink threads) then we can get rid of a bunch of duplicated code, though these APIs have to be designed carefully so can satisfy all current use-cases at least. Regards Liang On Feb 26, 2011, at 12:16 PM, Andreas Dilger wrote: On 2011-02-25, at 4:37 PM, Mervini, Joseph A wrote: That could be awful handy - especially when trying to tune a live file system for performance. Is that going to be a 2.0 only enhancement or can it be applied to existing 1.8 versions? The patch was originally developed for 1.8, and ported to 2.1. That said, last time I tested it there were a few problems (crashing variety) so it isn't ready for prime time yet. Testing/debugging would be appreciated, patch for 1.8 and 2.1 are at: https://bugzilla.lustre.org/show_bug.cgi?id=22516 On Feb 24, 2011, at 9:19 PM, Andreas Dilger wrote: Yes, this can be set at startup time to limit the number of started threads. There is a patch I wrote to also reduce the number of running treads but it wasn't landed yet. Cheers, Andreas On 2011-02-24, at 14:04, Mervini, Joseph A jame...@sandia.gov wrote: I'm inclined to agree. So apparently the only time that modifying the runtime max values has a benefit is while the threads_started is low? Joe Joe Mervini Sandia National Laboratories High Performance Computing 505.844.6770 jame...@sandia.gov On Feb 24, 2011, at 1:09 PM, Kevin Van Maren wrote: However, I don't think you can decrease the number of running threads. See https://bugzilla.lustre.org/show_bug.cgi?id=22417 (and also https://bugzilla.lustre.org/show_bug.cgi?id=22516 ) Kevin Mervini, Joseph A wrote: Cool! Thank you Johann. Joe Mervini Sandia National Laboratories High Performance Computing 505.844.6770 jame...@sandia.gov On Feb 24, 2011, at 11:05 AM, Johann Lombardi wrote: On Thu, Feb 24, 2011 at 10:48:32AM -0700, Mervini, Joseph A wrote: Quick question: Has runtime modification of the number of OST threads been implemented in Lustre-1.8.3? Yes, see bugzilla ticket 18688. It was landed in 1.8.1. Cheers, Andreas -- Andreas Dilger Principal Engineer Whamcloud, Inc. ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
Re: [Lustre-discuss] Disabling RDMA on an IB interface
On Feb 24, 2011, at 10:45 AM, Jeremy Filizetti wrote: As Chris mentioned your talking about two very different methods. I think you can use netem with IPoIB but I have never tried it. If you use connected mode I think your still technically doing RDMA but the maximum size (MTU) is around 64k which isn't sufficient for higher latencies. In the first few slides of my LUG presentation last year I have some graphs that show how RDMA performance is affected by latency and need to be increased to compensate for the bandwidth delay product (BDP). If you do what to use IPoIB you can add a line similar the following in your /etc/modprobe.conf or a fille in /etc/modprobe.d directory: options lnet networks=tcp(ib0) If you want to use RC QPs as ko2iblnd does, we use the following kernel parameters: options lnet networks=o2ib(ib0) options ko2iblnd map_on_demand=2 peer_credits=128 credits=256 concurrent_sends=256 ntx=512 fmr_pool_size=2048 fmr_flush_trigger=512 fmr_cache=1 If you have peer_credits=128, then I would suggest increase credits=1024 and ntx=2048, otherwise a couple of clients could consume all NI credits, concurrent_sends is not necessary here because o2iblnd will estimate proper value for it. Thanks Liang ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
Re: [Lustre-discuss] RDMA limitation?
It's a kind of story like: if you have to take dozens of global locks on lifetime of a RPC, then the code can't scale well on large SMP system, not matter what kind of network you are using”, so the problem is scattered everywhere. Also, we are trying to reduce RPC bounce between CPUs, in current code, a request can be received by CPU A, then queued on CPU B, processed by CPU C, and replied by CPU D, it's very bad on large SMP system because of data traffic between CPUs. Regards Liang Jiahua wrote: You mean it is inherent in the code? Can you point me to the actual code if possible? I am just curious why. Any pointers or hints will be appreciated. Thanks, Jiahua On Tue, Apr 13, 2010 at 6:46 PM, Kevin Van Maren kevin.vanma...@sun.com wrote: Yes, the RPC rate is limited by Lustre code locking to that rate, even with rdma. Kevin On Apr 13, 2010, at 5:08 PM, Jiahua jia...@gmail.com wrote: Hi all, This is kind of a followup question of the thread One or two OSS, no difference? last month. In that thread, Andreas stated: There is work currently underway to improve the SMP scaling performance for the RPC handling layer in Lustre. Currently that limits the delivered RPC rate to 10-15k/sec or so. My question is: is the limitation also applied to RDMA on IB? By SMP, I guess Andreas was talking about CPU, right? Since RDMA can bypass the host CPU, does it mean it can also bypass the limitation? Thanks, Jiahua ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
Re: [Lustre-discuss] 1.8.1(-ish) client vs. 1.6.7.2 server
Robin, These messages should be harmless, 1.8.1 is using new o2iblnd message protocol, so there is a version negotiation if o2iblnd version of client is older, is there any other o2ib error messages like Deleting messages for xxx.xxx.xxx@o2b: connection failed when you see IO failure? Anyway, if you got more complaint from o2ib except these information, could you please post them on the bug you filed. Thanks Liang Robin Humble wrote: I added this to bugzilla. https://bugzilla.lustre.org/show_bug.cgi?id=20227 cheers, robin On Wed, Jul 15, 2009 at 01:09:33PM -0400, Robin Humble wrote: On Wed, Jul 15, 2009 at 08:46:12AM -0400, Robin Humble wrote: I get a ferocious set of error messages when I mount a 1.6.7.2 filesystem on a b_release_1_8_1 client. is this expected? just to annotate the below a bit in case it's not clear... sorry - should have done that in the first email :-/ 10.8.30.244 is MGS and one MDS, 10.8.30.245 is the other MDS in the failover pair. 10.8.30.201 - 208 are OSS's (one OST per OSS), and the fs is mounted in the usual failover way eg. mount -t lustre 10.8.30@o2ib:10.8.30@o2ib:/system /system from the below (and other similar logs) it kinda looks like the client fails and then renegotiates with all the servers. cheers, robin -- Dr Robin Humble, HPC Systems Analyst, NCI National Facility Lustre: 13800:0:(o2iblnd_cb.c:459:kiblnd_rx_complete()) Rx from 10.8.30@o2ib failed: 5 Lustre: 13799:0:(o2iblnd_cb.c:459:kiblnd_rx_complete()) Rx from 10.8.30@o2ib failed: 5 Lustre: 615:0:(o2iblnd_cb.c:2384:kiblnd_reconnect()) 10.8.30@o2ib: retrying (version negotiation), 12, 11, queue_dep: 8, max_frag: 256, msg_size: 4096 Lustre: mgc10.8.30@o2ib: Reactivating import Lustre: 13797:0:(o2iblnd_cb.c:459:kiblnd_rx_complete()) Rx from 10.8.30@o2ib failed: 5 Lustre: 13798:0:(o2iblnd_cb.c:459:kiblnd_rx_complete()) Rx from 10.8.30@o2ib failed: 5 Lustre: 615:0:(o2iblnd_cb.c:2384:kiblnd_reconnect()) 10.8.30@o2ib: retrying (version negotiation), 12, 11, queue_dep: 8, max_frag: 256, msg_size: 4096 Lustre: Client system-client has started Lustre: 13798:0:(o2iblnd_cb.c:459:kiblnd_rx_complete()) Rx from 10.8.30@o2ib failed: 5 ... last message repeated 17 times ... Lustre: 615:0:(o2iblnd_cb.c:2384:kiblnd_reconnect()) 10.8.30@o2ib: retrying (version negotiation), 12, 11, queue_dep: 8, max_frag: 256, msg_size: 4096 Lustre: 615:0:(o2iblnd_cb.c:2384:kiblnd_reconnect()) 10.8.30@o2ib: retrying (version negotiation), 12, 11, queue_dep: 8, max_frag: 256, msg_size: 4096 Lustre: 13798:0:(o2iblnd_cb.c:459:kiblnd_rx_complete()) Rx from 10.8.30@o2ib failed: 5 Lustre: 615:0:(o2iblnd_cb.c:2384:kiblnd_reconnect()) 10.8.30@o2ib: retrying (version negotiation), 12, 11, queue_dep: 8, max_frag: 256, msg_size: 4096 Lustre: 615:0:(o2iblnd_cb.c:2384:kiblnd_reconnect()) 10.8.30@o2ib: retrying (version negotiation), 12, 11, queue_dep: 8, max_frag: 256, msg_size: 4096 Lustre: 13797:0:(o2iblnd_cb.c:459:kiblnd_rx_complete()) Rx from 10.8.30@o2ib failed: 5 Lustre: 615:0:(o2iblnd_cb.c:2384:kiblnd_reconnect()) 10.8.30@o2ib: retrying (version negotiation), 12, 11, queue_dep: 8, max_frag: 256, msg_size: 4096 Lustre: 615:0:(o2iblnd_cb.c:2384:kiblnd_reconnect()) 10.8.30@o2ib: retrying (version negotiation), 12, 11, queue_dep: 8, max_frag: 256, msg_size: 4096 Lustre: 615:0:(o2iblnd_cb.c:2384:kiblnd_reconnect()) 10.8.30@o2ib: retrying (version negotiation), 12, 11, queue_dep: 8, max_frag: 256, msg_size: 4096 Lustre: 615:0:(o2iblnd_cb.c:2384:kiblnd_reconnect()) 10.8.30@o2ib: retrying (version negotiation), 12, 11, queue_dep: 8, max_frag: 256, msg_size: 4096 Lustre: 13800:0:(o2iblnd_cb.c:459:kiblnd_rx_complete()) Rx from 10.8.30@o2ib failed: 5 looks like it succeeds in the end, but only after a struggle. I don't have any problems with 1.8.1 - 1.8.1 or 1.6.7.2 - 1.6.7.2. servers are rhel5 x86_64 2.6.18-92.1.26.el5 1.6.7.2 + bz18793 (group quota fix). client is rhel5 x86_64 patched 2.6.18-128.1.16.el5-b_release_1_8_1 from cvs 20090712131220 + bz18793 again. BTW, should I be using cvs tag v1_8_1_RC1 instead of b_release_1_8_1? I'm confused about which is closest to the final 1.8.1 :-/ cheers, robin -- Dr Robin Humble, HPC Systems Analyst, NCI National Facility ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
Re: [Lustre-discuss] Lustre Client using 10GigE iWarp
David Dillow wrote: On Fri, 2009-05-29 at 08:57 -0500, Dennis Nelson wrote: I have tried this: options lnet networks=o2ib(eth2) Is that correct? I would think so, but someone more experienced in the Lustre o2ib LND and iWarp should chime in on that -- I've never used iWarp, only IB. we also need this for iWarp: options o2iblnd map_on_demand=64 or map_on_demand=32 if you failed to startup Regards Liang The client is connected to the IB fabric using a Voltaire 10 GigE line card on a IB switch. Has anyone tested such a configuration? Should I expect it to work? I don't expect that to work, unless the Voltaire is converting iWarp to IB -- and I highly doubt that. It is more likely using EoIB to transport the ethernet frames over the IB fabric. You will likely need a Lustre router to sit between the two fabrics. ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
Re: [Lustre-discuss] o2ib cant ping/mount Infiniband NID
172.24.198@o2ib failed to ping 172.24.198@o2ib: Input/output error /var/log/messages: Jan 16 10:24:14 p128 kernel: Lustre: 2750:0:(o2iblnd_cb.c:2687: kiblnd_cm_callback()) 172.24.198@o2ib: ROUTE ERROR -22 Jan 16 10:24:14 p128 kernel: Lustre: 2750:0:(o2iblnd_cb.c:2101:kiblnd_peer_connect_failed()) Deleting messages for 172.24.198@o2ib: connection failed how can I get rid of this connection problem? ~subbu On Fri, Jan 16, 2009 at 2:11 PM, Liang Zhen zhen.li...@sun.com mailto:zhen.li...@sun.com wrote: Subbu, We don't have any tip for setup IPoIB, looks like linux can't find the ifaddr of ib0 on MDS(-99 is EADDRNOTAVAIL), so I think it's because you didn't assign any address to ib0 (or failed to assign address to ib0) before loading o2iblnd in the first try. I can reproduce exactly same error by: 1. modprobe ib_ipoib 2. ifconfig ib0 up // without assign any address 3. modprobe ko2iblnd 4. lctl network up Regards Liang subbu kl: Liang, after executing following echo : echo +neterror /proc/sys/lnet/printk now lctlt ping shows the following error # lctl ping 172.24.198@o2ib failed to ping 172.24.198@o2ib: Input/output error Jan 16 10:24:14 p128 kernel: Lustre: 2750:0:(o2iblnd_cb.c:2687:kiblnd_cm_callback()) 172.24.198@o2ib: ROUTE ERROR -22 Jan 16 10:24:14 p128 kernel: Lustre: 2750:0:(o2iblnd_cb.c:2101:kiblnd_peer_connect_failed()) Deleting messages for 172.24.198@o2ib: connection failed Looks like some problem with IB connection manager ! 1. do we have any help docs to setup IPoIB and Lustre, lustre operation manual has very minimal info about this . I think I am missing some IPoIB setup part here. 2. or is it mannual assignment of IP addresses to ib0 is creating some problem *Some more supporting info : *subnet manager of following version is also running : OpenSM 3.1.8 Initially I got this error for MDS mount Jan 16 09:45:20 p128 kernel: LustreError: 4991:0:(linux-tcpip.c:124:libcfs_ipif_query()) Can't get IP address for interface ib0 Jan 16 09:45:20 p128 kernel: LustreError: 4991:0:(o2iblnd.c:1563:kiblnd_startup()) Can't query IPoIB interface ib0: -99 Jan 16 09:45:21 p128 kernel: LustreError: 105-4: Error -100 starting up LNI o2ib Jan 16 09:45:21 p128 kernel: LustreError: 4991:0:(events.c:707:ptlrpc_init_portals()) network initialisation failed Jan 16 09:45:21 p128 modprobe: WARNING: Error inserting ptlrpc (/lib/modules/2.6.18-53.1.14.el5_lustre.1.6.5.1smp/kernel/fs/lustre/ptlrpc.ko): Input/output error Jan 16 09:45:21 p128 modprobe: WARNING: Error inserting osc (/lib/modules/2.6.18-53.1.14.el5_lustre.1.6.5.1smp/kernel/fs/lustre/osc.ko): Unknown symbol in module, or unknown parameter (see dmesg) Jan 16 09:45:21 p128 kernel: osc: Unknown symbol ldlm_prep_enqueue_req Jan 16 09:45:21 p128 kernel: osc: Unknown symbol ldlm_resource_get Jan 16 09:45:21 p128 kernel: osc: Unknown symbol ptlrpc_lprocfs_register_obd . . . then I mannually set the IP address for ib0 as folows : # ifconfig ib0 172.24.198.111 [r...@p186 ~]# ifconfig ib0 ib0 Link encap:InfiniBand HWaddr 80:00:04:04:FE:80:00:00:00:00:00:00:00:00:00:00:00:00:00:00 inet addr:172.24.198.112 Bcast:172.24.255.255 Mask:255.255.0.0 UP BROADCAST MULTICAST MTU:65520 Metric:1 RX packets:0 errors:0 dropped:0 overruns:0 frame:0 TX packets:0 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:256 RX bytes:0 (0.0 b) TX bytes:0 (0.0 b) then it mounted sucessfully *Jan 16 09:47:09 p128 kernel: Lustre: Added LNI 172.24.198@o2ib [8/64] Jan 16 09:47:09 p128 kernel: Lustre: MGS MGS started* Jan 16 09:47:09 p128 kernel: Lustre: Setting parameter lustre-MDT.mdt.group_upcall in log lustre-MDT Jan 16 09:47:09 p128 kernel: Lustre: Enabling user_xattr Jan 16 09:47:09 p128 kernel: Lustre: lustre-MDT: new disk, initializing Jan 16 09:47
Re: [Lustre-discuss] o2ib cant ping/mount Infiniband NID
Subbu, We don't have any tip for setup IPoIB, looks like linux can't find the ifaddr of ib0 on MDS(-99 is EADDRNOTAVAIL), so I think it's because you didn't assign any address to ib0 (or failed to assign address to ib0) before loading o2iblnd in the first try. I can reproduce exactly same error by: 1. modprobe ib_ipoib 2. ifconfig ib0 up // without assign any address 3. modprobe ko2iblnd 4. lctl network up Regards Liang subbu kl: Liang, after executing following echo : echo +neterror /proc/sys/lnet/printk now lctlt ping shows the following error # lctl ping 172.24.198@o2ib failed to ping 172.24.198@o2ib: Input/output error Jan 16 10:24:14 p128 kernel: Lustre: 2750:0:(o2iblnd_cb.c:2687:kiblnd_cm_callback()) 172.24.198@o2ib: ROUTE ERROR -22 Jan 16 10:24:14 p128 kernel: Lustre: 2750:0:(o2iblnd_cb.c:2101:kiblnd_peer_connect_failed()) Deleting messages for 172.24.198@o2ib: connection failed Looks like some problem with IB connection manager ! 1. do we have any help docs to setup IPoIB and Lustre, lustre operation manual has very minimal info about this . I think I am missing some IPoIB setup part here. 2. or is it mannual assignment of IP addresses to ib0 is creating some problem *Some more supporting info : *subnet manager of following version is also running : OpenSM 3.1.8 Initially I got this error for MDS mount Jan 16 09:45:20 p128 kernel: LustreError: 4991:0:(linux-tcpip.c:124:libcfs_ipif_query()) Can't get IP address for interface ib0 Jan 16 09:45:20 p128 kernel: LustreError: 4991:0:(o2iblnd.c:1563:kiblnd_startup()) Can't query IPoIB interface ib0: -99 Jan 16 09:45:21 p128 kernel: LustreError: 105-4: Error -100 starting up LNI o2ib Jan 16 09:45:21 p128 kernel: LustreError: 4991:0:(events.c:707:ptlrpc_init_portals()) network initialisation failed Jan 16 09:45:21 p128 modprobe: WARNING: Error inserting ptlrpc (/lib/modules/2.6.18-53.1.14.el5_lustre.1.6.5.1smp/kernel/fs/lustre/ptlrpc.ko): Input/output error Jan 16 09:45:21 p128 modprobe: WARNING: Error inserting osc (/lib/modules/2.6.18-53.1.14.el5_lustre.1.6.5.1smp/kernel/fs/lustre/osc.ko): Unknown symbol in module, or unknown parameter (see dmesg) Jan 16 09:45:21 p128 kernel: osc: Unknown symbol ldlm_prep_enqueue_req Jan 16 09:45:21 p128 kernel: osc: Unknown symbol ldlm_resource_get Jan 16 09:45:21 p128 kernel: osc: Unknown symbol ptlrpc_lprocfs_register_obd . . . then I mannually set the IP address for ib0 as folows : ifconfig ib0 172.24.198.111 [r...@p186 ~]# ifconfig ib0 ib0 Link encap:InfiniBand HWaddr 80:00:04:04:FE:80:00:00:00:00:00:00:00:00:00:00:00:00:00:00 inet addr:172.24.198.112 Bcast:172.24.255.255 Mask:255.255.0.0 UP BROADCAST MULTICAST MTU:65520 Metric:1 RX packets:0 errors:0 dropped:0 overruns:0 frame:0 TX packets:0 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:256 RX bytes:0 (0.0 b) TX bytes:0 (0.0 b) then it mounted sucessfully Jan 16 09:47:09 p128 kernel: Lustre: Added LNI 172.24.198@o2ib [8/64] Jan 16 09:47:09 p128 kernel: Lustre: MGS MGS started Jan 16 09:47:09 p128 kernel: Lustre: Setting parameter lustre-MDT.mdt.group_upcall in log lustre-MDT Jan 16 09:47:09 p128 kernel: Lustre: Enabling user_xattr Jan 16 09:47:09 p128 kernel: Lustre: lustre-MDT: new disk, initializing Jan 16 09:47:09 p128 kernel: Lustre: MDT lustre-MDT now serving dev (lustre-MDT/64db1fc7-03ba-9803-4d20-ab0d2aa66116) with recovery enabled Jan 16 09:47:09 p128 kernel: Lustre: 5274:0:(lproc_mds.c:262:lprocfs_wr_group_upcall()) lustre-MDT: group upcall set to /usr/sbin/l_getgroups Jan 16 09:47:09 p128 kernel: Lustre: lustre-MDT.mdt: set parameter group_upcall=/usr/sbin/l_getgroups Jan 16 09:47:09 p128 kernel: Lustre: Server lustre-MDT on device /dev/loop0 has started . . . ~subbu On Thu, Jan 15, 2009 at 8:37 PM, Liang Zhen zhen.li...@sun.com mailto:zhen.li...@sun.com wrote: Subbu, I'd suggest: 1) make sure ko2iblnd has been brought up (please check if there is any error message when startup ko2iblnd) 2) echo +neterror /proc/sys/lnet/printk, then try with lctl ping, if it still can't work please post error messages Regards Liang subbu kl: Problem is similer to http://lists.lustre.org/pipermail/lustre-discuss/2008-May/007498.html But by looking at the thread could not really get the solution for the problem. I have two RHEL5 Linux servers installed with following packages - kernel-lustre-smp-2.6.18-53.1.14.el5_lustre.1.6.5.1 kernel-ib-1.3-2.6.18_53.1.14.el5_lustre.1.6.5.1smp lustre-ldiskfs-3.0.4-2.6.18_53.1.14.el5_lustre.1.6.5.1smp lustre-1.6.5.1-2.6.18_53.1.14.el5_lustre.1.6.5.1smp lustre-modules-1.6.5.1-2.6.18_53.1.14.el5_lustre.1.6.5.1smp
Re: [Lustre-discuss] o2ib cant ping/mount Infiniband NID
Subbu, I'd suggest: 1) make sure ko2iblnd has been brought up (please check if there is any error message when startup ko2iblnd) 2) echo +neterror /proc/sys/lnet/printk, then try with lctl ping, if it still can't work please post error messages Regards Liang subbu kl: Problem is similer to http://lists.lustre.org/pipermail/lustre-discuss/2008-May/007498.html But by looking at the thread could not really get the solution for the problem. I have two RHEL5 Linux servers installed with following packages - kernel-lustre-smp-2.6.18-53.1.14.el5_lustre.1.6.5.1 kernel-ib-1.3-2.6.18_53.1.14.el5_lustre.1.6.5.1smp lustre-ldiskfs-3.0.4-2.6.18_53.1.14.el5_lustre.1.6.5.1smp lustre-1.6.5.1-2.6.18_53.1.14.el5_lustre.1.6.5.1smp lustre-modules-1.6.5.1-2.6.18_53.1.14.el5_lustre.1.6.5.1smp e2fsprogs-1.40.7.sun3-0redhat machine 1: with ib0 IP address : 172.24.198.111 machine 2: with ib0 IP address : 172.24.198.112 /etc/modprobe.conf contains options lnet networks=o2ib TCP networking worked fine and now I am trying with Infiniband network finding it difficult in communicating with IB nodes mounting effort throghs me the following error [r...@p186 ~]# mount -t lustre -o loop /tmp/lustre-ost1 /mnt/ost1 mount.lustre: mount /dev/loop0 at /mnt/ost1 failed: Input/output error Is the MGS running? /var/log/messages : Jan 15 16:55:25 p186 kernel: kjournald starting. Commit interval 5 seconds Jan 15 16:55:25 p186 kernel: LDISKFS FS on loop0, internal journal Jan 15 16:55:25 p186 kernel: LDISKFS-fs: mounted filesystem with ordered data mode. Jan 15 16:55:25 p186 kernel: kjournald starting. Commit interval 5 seconds Jan 15 16:55:25 p186 kernel: LDISKFS FS on loop0, internal journal Jan 15 16:55:25 p186 kernel: LDISKFS-fs: mounted filesystem with ordered data mode. Jan 15 16:55:25 p186 kernel: LDISKFS-fs: file extents enabled Jan 15 16:55:25 p186 kernel: LDISKFS-fs: mballoc enabled Jan 15 16:55:30 p186 kernel: Lustre: Request x7 sent from mgc172.24.198@o2ib to NID 172.24.198@o2ib 5s ago has timed out (limit 5s). Jan 15 16:55:30 p186 kernel: LustreError: 7193:0:(obd_mount.c:1062:server_start_targets()) Required registration failed for lustre-OST: -5 Jan 15 16:55:30 p186 kernel: LustreError: 15f-b: Communication error with the MGS. Is the MGS running? Jan 15 16:55:30 p186 kernel: LustreError: 7193:0:(obd_mount.c:1597:server_fill_super()) Unable to start targets: -5 Jan 15 16:55:30 p186 kernel: LustreError: 7193:0:(obd_mount.c:1382:server_put_super()) no obd lustre-OST Jan 15 16:55:30 p186 kernel: LustreError: 7193:0:(obd_mount.c:119:server_deregister_mount()) lustre-OST not registered Jan 15 16:55:30 p186 kernel: LDISKFS-fs: mballoc: 0 blocks 0 reqs (0 success) Jan 15 16:55:30 p186 kernel: LDISKFS-fs: mballoc: 0 extents scanned, 0 goal hits, 0 2^N hits, 0 breaks, 0 lost Jan 15 16:55:30 p186 kernel: LDISKFS-fs: mballoc: 0 generated and it took 0 Jan 15 16:55:30 p186 kernel: LDISKFS-fs: mballoc: 0 preallocated, 0 discarded Jan 15 16:55:30 p186 kernel: Lustre: server umount lustre-OST complete Jan 15 16:55:30 p186 kernel: LustreError: 7193:0:(obd_mount.c:1951:lustre_fill_super()) Unable to mount (-5) All pinging efforts also failed to the IB NIDS local/remote can ping the ip address : [r...@p186 ~]# ping 172.24.198.112 PING 172.24.198.112 (172.24.198.112) 56(84) bytes of data. 64 bytes from 172.24.198.112 http://172.24.198.112: icmp_seq=1 ttl=64 time=0.052 ms 64 bytes from 172.24.198.112 http://172.24.198.112: icmp_seq=2 ttl=64 time=0.024 ms --- 172.24.198.112 ping statistics --- 2 packets transmitted, 2 received, 0% packet loss, time 1000ms rtt min/avg/max/mdev = 0.024/0.038/0.052/0.014 ms [r...@p186 ~]# ping 172.24.198.111 PING 172.24.198.111 (172.24.198.111) 56(84) bytes of data. 64 bytes from 172.24.198.111 http://172.24.198.111: icmp_seq=1 ttl=64 time=2.16 ms 64 bytes from 172.24.198.111 http://172.24.198.111: icmp_seq=2 ttl=64 time=0.296 ms --- 172.24.198.111 ping statistics --- 2 packets transmitted, 2 received, 0% packet loss, time 1000ms rtt min/avg/max/mdev = 0.296/1.231/2.166/0.935 ms but cant ping the NIDS : [r...@p186 ~]# lctl ping 172.24.198@o2ib failed to ping 172.24.198@o2ib: Input/output error [r...@p186 ~]# lctl ping 172.24.198@o2ib failed to ping 172.24.198@o2ib: Input/output error Any idea why lnet cant ping NIDS ? some more configurations: [r...@p186 ~]# ibstat CA 'mthca0' CA type: MT23108 Number of ports: 2 Firmware version: 3.5.0 Hardware version: a1 Node GUID: 0x0002c9020021550c Machines are connected via IB switch. Looking forward for help. ~subbu ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
Re: [Lustre-discuss] OFED 1.4
Hi Roger, I got a chance to try with OFED1.4, it works fine but 1.6.6 can't compile with it because conflicting with some backport headers of OFED, we will change our build system for it soon. Regards Liang Roger Spellman: Hi, Is anyone using Lustre 1.6.6 with OFED 1.4? If so, how is this going? Thanks. Roger Spellman Staff Engineer Terascala, Inc. 508-588-1501 www.terascala.com http://www.terascala.com/ ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
Re: [Lustre-discuss] ko2iblnd panics in kiblnd_map_tx_descs
Hi Chris, To resolve your problem, please: 1. apply this patch to your lnet: https://bugzilla.lustre.org/attachment.cgi?id=15733 2. please make sure use this option while configure: --with-o2ib=/path/to/ofed 3. Copy /path/to/ofed/Module.symvers to your $LUSTRE before building Regards Liang Chris Worley wrote: I'm trying to port Lustre 1.6.4.2 to OFED 1.2.5.5 w/ the RHEL kernel 2.6.9.67.0.4. ksocklnd-based mounts work fine, but when I try to mount over IB, I get a panic in ko2iblnd in the transmit descriptor mapping routine: general protection fault: [1] SMP CPU 1 Modules linked in: ko2iblnd(U) ptlrpc(U) obdclass(U) lnet(U) lvfs(U) libcfs(U) nfs(U) lockd(U) nfs_acl(U) sunrpc(U) rdma_ucm(U) ib_sdp(U) rdma_cm(U) iw_cm(U) ib_addr(U) mlx4_ib(U) mlx4_core(U) ib_mthca(U) dm_mod(U) ib_ipoib(U) md5(U) ipv6(U) ib_umad(U) ib_ucm(U) ib_uverbs(U) ib_cm(U) ib_sa(U) ib_mad(U) ib_core(U) aic79xx(U) e1000(U) ext3(U) jbd(U) raid0(U) mptscsih(U) mptsas(U) mptspi(U) mptscsi(U) mptbase(U) sd_mod(U) ata_piix(U) libata(U) scsi_mod(U) Pid: 5141, comm: modprobe Not tainted 2.6.9-67.0.4.EL-Lustre-1.6.4.2 RIP: 0010:[a04659d1] a04659d1{:ko2iblnd:kiblnd_map_tx_descs+225} RSP: :0102105d7cd8 EFLAGS: 00010286 RAX: a01e6b4e RBX: ff0010028000 RCX: 0001 RDX: 1000 RSI: 01020e705000 RDI: 0102154e2000 RBP: 0102102c4200 R08: R09: R10: R11: R12: R13: R14: R15: 0102102c4228 FS: 002a958a0b00() GS:8046ac00() knlGS: CS: 0010 DS: ES: CR0: 8005003b CR2: 002a9598200f CR3: 9fa08000 CR4: 06e0 Process modprobe (pid: 5141, threadinfo 0102105d6000, task 0102175e0030) Stack: 0102102c4080 0102102c4100 0102102c4200 0102179c2b86 0102177df400 010215548ac0 a0466fdf 0102179c2b85 Call Trace:a0466fdf{:ko2iblnd:kiblnd_startup+2239} a03043dc{:lnet:lnet_startup_lndnis+332} a02d2f38{:libcfs:cfs_alloc+40} a0305206{:lnet:LNetNIInit+278} a03fcb0a{:ptlrpc:ptlrpc_ni_init+106} 8012f9cd{default_wake_function+0} a03fcbfa{:ptlrpc:ptlrpc_init_portals+10} 8012f9cd{default_wake_function+0} a045f22b{:ptlrpc:init_module+267} 8014bc0a{sys_init_module+278} 8010f23e{system_call+126} Code: ff 50 08 eb 12 48 8b 3f b9 01 00 00 00 ba 00 10 00 00 e8 30 RIP a04659d1{:ko2iblnd:kiblnd_map_tx_descs+225} RSP 0102105d7cd8 Does this ring any bells? Otherwise, any debugging tips? Shane said that they get an oops if they compile with the version specific OFA tree. Is this the Oops? Thanks, Chris ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss