Re: [lustre-discuss] Lnet and IPv6

2016-05-25 Thread Oucharek, Doug S
The situation depicted in those two links has not changed.  IPv6 is not 
supported by LNet due to the reasons given in the second link (i.e. it is a lot 
of work in multiple places).

Doug

On May 25, 2016, at 5:54 PM, Frederick Lefebvre 
> 
wrote:

Does anyone know if current Lustre code (let's say 2.8 and up) would work on an 
all IPv6 Ethernet Network ?  We found this presentation form 2015 
(https://www.openfabrics.org/images/eventpresos/workshops2015/DevWorkshop/Wednesday/wednesday_05.pdf)
 which infers that server discovery in Lnet didn't work with IPv6 addresses 
(referencing another presentation from 2012 : 
http://cdn.opensfs.org/wp-content/uploads/2011/11/LNET-Support-for-IPv6_LUG-2012_Isaac-Huang-Xyratex.pdf).
  Is it still true? Does anyone have first hand experience with Lustre and IPv6 
?

Thank's for any pointers.

Frédérick
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org

___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] poor performance on reading small files

2016-08-03 Thread Oucharek, Doug S
Also note: If you are using IB, these small reads will make use of RDMA.  LNet 
only uses rdma_writes (historical reasons for this) so the client has to use IB 
immediate messages to tell the server to write the 20kb file to the client.  
The extra round-trip handshake involved with this will add latency to each file 
read.  That could be why writes, which don’t need this extra handshake, perform 
better than the reads.

The bigger the files (i.e. the more data moved per rdma_write) the less the 
additional overhead of the handshake will be noticed.

Doug

> On Aug 3, 2016, at 11:32 AM, Jeff Johnson  
> wrote:
> 
> On 8/3/16 10:57 AM, Dilger, Andreas wrote:
>> On Jul 29, 2016, at 03:33, Oliver Mangold  
>> wrote:
>>> On 29.07.2016 04:19, Riccardo Veraldi wrote:
 I am using lustre on ZFS.
 
 While write performances are excellent also on smaller files, I find
 there is a drop down in performance
 on reading 20KB files. Performance can go as low as 200MB/sec or even
 less.
>>> Getting 200 MB/s with 20kB files means you have to do 1 metadata
>>> ops/s. Don't want to say it is impossible to get more than that, but at
>>> least with MDT on ZFS this doesn't sound bad either. Did you run an
>>> mdtest on your system? Maybe some serious tuning of MD performance is in
>>> order.
>> I'd agree with Oliver that getting 200MB/s with 20KB files is not too bad.
>> Are you using HDDs or SSDs for the MDT and OST devices?  If using HDDs,
>> are you using SSD L2ARC to allow the metadata and file data be cached in
>> L2ARC, and allowing enough time for L2ARC to be warmed up?
>> 
>> Are you using TCP or IB networking?  If using TCP then there is a lower
>> limit on the number of RPCs that can be handled compared to IB.
>> 
>> Cheers, Andreas
>> ___
>> lustre-discuss mailing list
>> lustre-discuss@lists.lustre.org
>> http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
> 
> Also consider that 20KB of data per lnet RPC, assuming a 1MB RPC, to move 
> 20KB files at 200MB/sec into a non-striped LFS directory you are using EDR 
> for lnet? 100GB Ethernet?
> 
> --Jeff
> 
> 
> -- 
> --
> Jeff Johnson
> Co-Founder
> Aeon Computing
> 
> jeff.john...@aeoncomputing.com
> www.aeoncomputing.com
> t: 858-412-3810 x1001   f: 858-412-3845
> m: 619-204-9061
> 
> 4170 Morena Boulevard, Suite D - San Diego, CA 92117
> 
> High-performance Computing / Lustre Filesystems / Scale-out Storage
> 
> ___
> lustre-discuss mailing list
> lustre-discuss@lists.lustre.org
> http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org

___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] difficulties mounting client via an lnet router

2016-07-11 Thread Oucharek, Doug S
You mentioned that the servers are on the o2ib0 network, but the error messages 
indicate that the client is trying to communicate with the MDT on the tcp 
network.   The file system configuration needs to be updated to use the updated 
NIDs.  

Doug

> On Jul 11, 2016, at 7:34 AM, Jessica Otey  wrote:
> 
> All,
> I am, as before, working on a small test lustre setup (RHEL 6.8, lustre v. 
> 2.4.3) to prepare for upgrading at 1.8.9 lustre production system to 2.4.3 
> (first the servers and lnet routers, then at a subsequent time, the clients). 
> Lustre servers have IB connections, but the clients are 1G ethernet only.
> 
> For the life of me, I cannot get the client to mount via the router on this 
> test system. (Client will mount fine when router is taken out of the 
> equation.) This is the error I am seeing in the syslog from the mount attempt:
> 
> Jul 11 10:15:37 tlclient kernel: Lustre: 
> 3605:0:(client.c:1868:ptlrpc_expire_one_request()) @@@ Request sent has timed 
> out for slow reply: [sent 1468246532/real 1468246532]  req@88032a3f9400 
> x1539566484848752/t0(0) 
> o38->tlustre-MDT-mdc-88032ad20400@10.7.29.130@tcp:12/10 lens 400/544 
> e 0 to 1 dl 1468246537 ref 1 fl Rpc:XN/0/ rc 0/-1
> Jul 11 10:16:07 tlclient kernel: Lustre: 
> 3605:0:(client.c:1868:ptlrpc_expire_one_request()) @@@ Request sent has timed 
> out for slow reply: [sent 1468246557/real 1468246557]  req@880629819000 
> x1539566484848764/t0(0) 
> o38->tlustre-MDT-mdc-88032ad20400@10.7.29.130@tcp:12/10 lens 400/544 
> e 0 to 1 dl 1468246567 ref 1 fl Rpc:XN/0/ rc 0/-1
> Jul 11 10:16:37 tlclient kernel: Lustre: 
> 3605:0:(client.c:1868:ptlrpc_expire_one_request()) @@@ Request sent has timed 
> out for slow reply: [sent 1468246582/real 1468246582]  req@88062a371000 
> x1539566484848772/t0(0) 
> o38->tlustre-MDT-mdc-88032ad20400@10.7.29.130@tcp:12/10 lens 400/544 
> e 0 to 1 dl 1468246597 ref 1 fl Rpc:XN/0/ rc 0/-1
> Jul 11 10:16:44 tlclient kernel: LustreError: 
> 2511:0:(lov_obd.c:937:lov_cleanup()) lov tgt 0 not cleaned! deathrow=0, 
> lovrc=1
> Jul 11 10:16:44 tlclient kernel: Lustre: Unmounted tlustre-client
> Jul 11 10:16:44 tlclient kernel: LustreError: 
> 4881:0:(obd_mount.c:1289:lustre_fill_super()) Unable to mount (-4)
> 
> More than one pair of eyes has looked at the configs and confirmed they look 
> okay. But frankly we've got to be missing something since this should (like 
> lustre on a good day) 'just work'.
> 
> If anyone has seen this issue before and could give some advice, it'd be 
> appreciated. One major question I have is whether the problem is a 
> configuration issue or a procedure issue--perhaps the order in which I am 
> doing things is causing the failure? The order I'm following currently is:
> 
> 1) unmount/remove modules on all boxes
> 2) bring up the lnet modules on the router, and bring up the network
> 3) On the mds: add the modules, bring up the network, mount the mdt
> 4) On the oss: add the modules, bring up the network, mount the oss
> 5) On the client: add the modules, bring up the network, attempt to mount 
> client (fails)
> 
> Configs follow below.
> 
> Thanks in advance,
> Jessica
> 
> tlnet (the router)
> [root@tlnet ~]# cat /etc/modprobe.d/lustre.conf
> # tlnet configuration
> alias ib0 ib_ipoib
> alias net-pf-27 ib_sdp
> options lnet networks="o2ib0(ib0),tcp0(em1)" forwarding="enabled"
> 
> [root@tlnet ~]# ifconfig #lo omitted
> em1   Link encap:Ethernet  HWaddr 78:2B:CB:25:A7:E2
>  inet addr:10.7.29.134  Bcast:10.7.29.255 Mask:255.255.255.0
>  UP BROADCAST RUNNING MULTICAST  MTU:1500 Metric:1
>  RX packets:453441 errors:0 dropped:0 overruns:0 frame:0
>  TX packets:264313 errors:0 dropped:0 overruns:0 carrier:0
>  collisions:0 txqueuelen:1000
>  RX bytes:436188202 (415.9 MiB)  TX bytes:22274957 (21.2 MiB)
> ib0   Link encap:InfiniBand  HWaddr 
> 80:00:00:48:FE:80:00:00:00:00:00:00:00:00:00:00:00:00:00:00
>  inet addr:10.7.129.134  Bcast:10.7.129.255 Mask:255.255.255.0
>  UP BROADCAST RUNNING MULTICAST  MTU:2044 Metric:1
>  RX packets:650 errors:0 dropped:0 overruns:0 frame:0
>  TX packets:34 errors:0 dropped:0 overruns:0 carrier:0
>  collisions:0 txqueuelen:256
>  RX bytes:75376 (73.6 KiB)  TX bytes:2904 (2.8 KiB)
> 
> tlclient (the client)
> [root@tlclient ~]# cat /etc/modprobe.d/lustre.conf
> options lnet networks="tcp0(em1)" routes="o2ib0 10.7.29.134@tcp0" 
> live_router_check_interval=60 dead_router_check_interval=60
> 
> [root@tlclient ~]# ifconfig #lo omitted
> em1   Link encap:Ethernet  HWaddr 00:26:B9:35:B1:1A
>  inet addr:10.7.29.132  Bcast:10.7.29.255 Mask:255.255.255.0
>  UP BROADCAST RUNNING MULTICAST  MTU:1500 Metric:1
>  RX packets:2817 errors:0 dropped:0 overruns:0 frame:0
>  TX packets:2233 errors:0 dropped:0 overruns:0 carrier:0
>  collisions:0 

Re: [lustre-discuss] LNET Self-test

2017-02-07 Thread Oucharek, Doug S
Because the stat command is “lst stat servers”, the statistics you are seeing 
are from the perspective of the server.  The “from” and “to” parameters can get 
quite confusing for the read case.  When reading, you are transferring the bulk 
data from the “to” group to the “from” group (yes, seems the opposite of what 
you would expect).  I think the “from” and “to” labels were designed to make 
sense in the write case and the logic was just flipped for the read case.

So, the stats you show indicated that are you writing an average of 3.6GiB/s 
(note: the lnet-selftest stats are mislabel and should be MiB/s rather than 
MB/s…I have fixed this in the latest release.  You are then getting 3.8GB/s).  
The reason you see traffic in the read direction is due to responses/acks.  
That is why there are a lot of small messages going back to the server (high 
RPC rate, small bandwidth).

So, your test looks like it is working to me.

Doug

> On Feb 7, 2017, at 2:13 AM, Jon Tegner <teg...@foi.se> wrote:
> 
> Probably doing something wrong here, but I tried to test only READING with 
> the following:
> 
> #!/bin/bash
> export LST_SESSION=$$
> lst new_session read
> lst add_group servers 10.0.12.12@o2ib
> lst add_group readers 10.0.12.11@o2ib
> lst add_batch bulk_read
> lst add_test --batch bulk_read --concurrency 12 --from readers --to servers \
> brw read check=simple size=1M
> lst run bulk_read
> lst stat servers & sleep 10; kill $!
> lst end_session
> 
> which in my case gives:
> 
> [LNet Rates of servers]
> [R] Avg: 3633 RPC/s Min: 3633 RPC/s Max: 3633 RPC/s
> [W] Avg: 7241 RPC/s Min: 7241 RPC/s Max: 7241 RPC/s
> [LNet Bandwidth of servers]
> [R] Avg: 2.29 MB/s  Min: 2.29 MB/s  Max: 2.29 MB/s
> [W] Avg: 3608.44  MB/s  Min: 3608.44  MB/s  Max: 3608.44  MB/s
> 
> it seems strange that it should report non zero numbers in the [W] positions? 
> Specially that bandwidth is low in the [R] position (since I explicitly 
> demanded "read")? Also note that if I change "brw read" to "brw write" in the 
> script above the results are "reversed" in the sense that it reports the 
> higher number regarding bandwidth in the [R] position. That is "brw read" 
> reports (almost) the expected bandwidth in the [W]-position, whereas "brw 
> write" reports it in the [R]-position.
> 
> This is on CentOS-6.5/Lustre-2.5.3. Will try 7.3/2.9.0 later.
> 
> Thanks,
> /jon
> 
> 
> On 02/06/2017 05:45 PM, Oucharek, Doug S wrote:
>> Try running just a read test and then just a write test rather than having 
>> both at the same time and see if the performance goes up.
> 

___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] LNET Self-test

2017-02-05 Thread Oucharek, Doug S
Yes, you can bump your concurrency.  Size caps out at 1M because that is how 
LNet is setup to work.  Going over 1M size would result in an unrealistic 
Lustre test.

Doug

> On Feb 5, 2017, at 11:55 AM, Jeff Johnson  
> wrote:
> 
> Without seeing your entire command it is hard to say for sure but I would 
> make sure your concurrency option is set to 8 for starters. 
> 
> --Jeff
> 
> Sent from my iPhone
> 
>> On Feb 5, 2017, at 11:30, Jon Tegner  wrote:
>> 
>> Hi,
>> 
>> I'm trying to use lnet selftest to evaluate network performance on a test 
>> setup (only two machines). Using e.g., iperf or Netpipe I've managed to 
>> demonstrate the bandwidth of the underlying 10 Gbits/s network (and 
>> typically you reach the expected bandwidth as the packet size increases).
>> 
>> How can I do the same using lnet selftest (i.e., verifying the bandwidth of 
>> the underlying hardware)? My initial thought was to increase the I/O size, 
>> but it seems the maximum size one can use is "--size=1M".
>> 
>> Thanks,
>> 
>> /jon
>> ___
>> lustre-discuss mailing list
>> lustre-discuss@lists.lustre.org
>> http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
> ___
> lustre-discuss mailing list
> lustre-discuss@lists.lustre.org
> http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org

___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] LNET Self-test

2017-02-06 Thread Oucharek, Doug S
You can have larger RPCs, but those get split up into 1M LNet operations.  
Lnet-selftest works with LNet messages and not RPCs.

Doug

On Feb 5, 2017, at 3:07 PM, Patrick Farrell 
<p...@cray.com<mailto:p...@cray.com>> wrote:

Doug,

It seems to me that's not true any more, with larger RPC sizes available.  Is 
there some reason that's not true?

- Patrick

From: lustre-discuss 
<lustre-discuss-boun...@lists.lustre.org<mailto:lustre-discuss-boun...@lists.lustre.org>>
 on behalf of Oucharek, Doug S 
<doug.s.oucha...@intel.com<mailto:doug.s.oucha...@intel.com>>
Sent: Sunday, February 5, 2017 3:18:10 PM
To: Jeff Johnson
Cc: lustre-discuss@lists.lustre.org<mailto:lustre-discuss@lists.lustre.org>
Subject: Re: [lustre-discuss] LNET Self-test

Yes, you can bump your concurrency.  Size caps out at 1M because that is how 
LNet is setup to work.  Going over 1M size would result in an unrealistic 
Lustre test.

Doug

> On Feb 5, 2017, at 11:55 AM, Jeff Johnson 
> <jeff.john...@aeoncomputing.com<mailto:jeff.john...@aeoncomputing.com>> wrote:
>
> Without seeing your entire command it is hard to say for sure but I would 
> make sure your concurrency option is set to 8 for starters.
>
> --Jeff
>
> Sent from my iPhone
>
>> On Feb 5, 2017, at 11:30, Jon Tegner <teg...@foi.se<mailto:teg...@foi.se>> 
>> wrote:
>>
>> Hi,
>>
>> I'm trying to use lnet selftest to evaluate network performance on a test 
>> setup (only two machines). Using e.g., iperf or Netpipe I've managed to 
>> demonstrate the bandwidth of the underlying 10 Gbits/s network (and 
>> typically you reach the expected bandwidth as the packet size increases).
>>
>> How can I do the same using lnet selftest (i.e., verifying the bandwidth of 
>> the underlying hardware)? My initial thought was to increase the I/O size, 
>> but it seems the maximum size one can use is "--size=1M".
>>
>> Thanks,
>>
>> /jon
>> ___
>> lustre-discuss mailing list
>> lustre-discuss@lists.lustre.org<mailto:lustre-discuss@lists.lustre.org>
>> http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
> ___
> lustre-discuss mailing list
> lustre-discuss@lists.lustre.org<mailto:lustre-discuss@lists.lustre.org>
> http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org

___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org<mailto:lustre-discuss@lists.lustre.org>
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org

___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] LNET Self-test

2017-02-06 Thread Oucharek, Doug S
Try running just a read test and then just a write test rather than having both 
at the same time and see if the performance goes up.

Doug

> On Feb 6, 2017, at 4:40 AM, Jon Tegner  wrote:
> 
> Hi,
> 
> I used the following script:
> 
> #!/bin/bash
> export LST_SESSION=$$
> lst new_session read/write
> lst add_group servers 10.0.12.12@o2ib
> lst add_group readers 10.0.12.11@o2ib
> lst add_group writers 10.0.12.11@o2ib
> lst add_batch bulk_rw
> lst add_test --batch bulk_rw --concurrency 12 --from readers --to servers \
> brw read check=simple size=1M
> lst add_test --batch bulk_rw --concurrency 12 --from writers --to servers \
> brw write check=simple size=1M
> # start running
> lst run bulk_rw
> # display server stats for 30 seconds
> lst stat servers & sleep 30; kill $!
> # tear down
> lst end_session
> 
> and tried with concurrency from 0,2,4,8,12,16, results in
> 
> http://renget.se/lnetBandwidth.png
> and
> http://renget.se/lnetRates.png
> 
> From Bandwidth a max of just below 2800 MB/s can be noted. Since in this case 
> "readers" and "writers" are the same, I did a few tests with the line
> 
> lst add_test --batch bulk_rw --concurrency 12 --from writers --to servers \
> brw write check=simple size=1M
> 
> removed from the script - which resulted in a bandwidth of around 3600 MB/s.
> 
> I also did tests using mpitests-osu_bw from openmpi, and in that case I 
> monitored a bandwidth of about 3900 MB/s.
> 
> Considering the "openmpi-bandwidth" should I be happy with the numbers 
> obtained by LNet selftest? Is there a way to modify the test so that the 
> result gets closer to what openmpi is giving? And what can be said of the 
> "Rates of servers (RPC/s)" - are they "good" or "bad"? What to compare them 
> with?
> 
> Thanks!
> 
> /jon
> 
> On 02/05/2017 08:55 PM, Jeff Johnson wrote:
>> Without seeing your entire command it is hard to say for sure but I would 
>> make sure your concurrency option is set to 8 for starters.
>> 
>> --Jeff
>> 
>> Sent from my iPhone
>> 
>>> On Feb 5, 2017, at 11:30, Jon Tegner  wrote:
>>> 
>>> Hi,
>>> 
>>> I'm trying to use lnet selftest to evaluate network performance on a test 
>>> setup (only two machines). Using e.g., iperf or Netpipe I've managed to 
>>> demonstrate the bandwidth of the underlying 10 Gbits/s network (and 
>>> typically you reach the expected bandwidth as the packet size increases).
>>> 
>>> How can I do the same using lnet selftest (i.e., verifying the bandwidth of 
>>> the underlying hardware)? My initial thought was to increase the I/O size, 
>>> but it seems the maximum size one can use is "--size=1M".
>>> 
>>> Thanks,
>>> 
>>> /jon
>>> ___
>>> lustre-discuss mailing list
>>> lustre-discuss@lists.lustre.org
>>> http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
> 
> ___
> lustre-discuss mailing list
> lustre-discuss@lists.lustre.org
> http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org

___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] RDMA too fragmented, OSTs unavailable (permanently)

2016-09-22 Thread Oucharek, Doug S
Hi Thomas,

It is interesting that you have encountered this error without a router.  Good 
information.   I have updated LU-5718 with a link to this discussion.

The original fix posted to LU-5718 by Liang will fix his problem for you (it 
does not assume a router is the cause).  That fix does double the amount of 
memory used per QP.  Probably not an issue for a client, but could be an issue 
for a router (as Cray has found).

Are you using the quotas feature?  There is some evidence that may play a role 
here.

Doug

> On Sep 10, 2016, at 12:38 AM, Thomas Roth  wrote:
> 
> Hi all,
> 
> we are running Lustre 2.5.3 on Infiniband. We have massive problems with 
> clients being unable to communicate with any number of OSTs, rendering the 
> entire cluster quite unusable.
> 
> Clients show
> > LNetError: 1399:0:(o2iblnd_cb.c:1140:kiblnd_init_rdma()) RDMA too 
> > fragmented for 10.20.0.242@o2ib1 (256): 231/256 src 231/256 dst frags
> > LNetError: 1399:0:(o2iblnd_cb.c:1690:kiblnd_reply()) Can't setup rdma for 
> > GET from 10.20.0.242@o2ib1: -90
> 
> which eventually results in OSTs at that nid becoming "temporarily 
> unavailable".
> However, the OSTs are never recovered, until they are manually evicted or the 
> host rebooted.
> 
> On the OSS side, this reads
> >  LNetError: 13660:0:(o2iblnd_cb.c:3075:kiblnd_check_conns()) Timed out RDMA 
> > with 10.20.0.220@o2ib1 (56): c: 7, oc: 0, rc: 7
> 
> 
> We have checked the IB fabric, which shows no errors. Since we are not able 
> to reproduce this effect in a simple way, we have also scrutinized the user 
> code, so far without results.
> 
> Whenever this happens, the connection between client and OSS is fine under 
> all IB test commands.
> Communication between client and OSS is still going on, but obviously when 
> Lustre tries to replay the missed transaction, this fragmentation limit is 
> hit again, so the OST never becomes available again.
> 
> If we understand correctly, the map_on_demand parameter should be increased 
> as a workaround.
> The ko2iblnd module seems to provide this parameter,
> > modinfo ko2iblnd
> > parm:   map_on_demand:map on demand (int)
> 
> but no matter what we load the module with, map_on_demand always remains at 
> the default value,
> > cat /sys/module/ko2iblnd/parameters/map_on_demand
> > 0
> 
> Is there any way to understand
> - why this memory fragmentation occurs/becomes so large?
> - how to measure the real fragmentation degree (o2iblnd simply stops at 256, 
> perhaps we are at 1000?)
> - why map_on_demand cannot be changed?
> 
> 
> Of course this all looks very much like LU-5718, but our clients are not 
> behind LNET routers.
> 
> There is one router which connects to the campus network but is not in use. 
> And there are some routers which connect to an older cluster, but of course 
> the old (1.8) clients never show any of these errors.
> 
> 
> Cheers,
> Thomas
> 
> 
> Thomas Roth
> Department: HPC
> Location: SB3 1.262
> Phone: +49-6159-71 1453  Fax: +49-6159-71 2986
> 
> GSI Helmholtzzentrum für Schwerionenforschung GmbH
> Planckstraße 1
> 64291 Darmstadt
> www.gsi.de
> 
> Gesellschaft mit beschränkter Haftung
> Sitz der Gesellschaft: Darmstadt
> Handelsregister: Amtsgericht Darmstadt, HRB 1528
> 
> Geschäftsführung: Professor Dr. Karlheinz Langanke
> Ursula Weyrich
> Jörg Blaurock
> 
> Vorsitzender des Aufsichtsrates: St Dr. Georg Schütte
> Stellvertreter: Ministerialdirigent Dr. Rolf Bernhardt
> ___
> lustre-discuss mailing list
> lustre-discuss@lists.lustre.org
> http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org

___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] RDMA too many fragments/timed out - clients slowing entire filesystem performance

2016-11-01 Thread Oucharek, Doug S
Hi Brian,

You need this patch: http://review.whamcloud.com/#/c/12451.  It has not landed 
to master yet and is off by default.  To activate it, add this module parameter 
line to your nodes (all of them):

options ko2iblnd wrq_sge=2

The issue is that something is causing an offset to be introduced to the bulk 
transfers.  That causes a misalignment of the source and destination fragments. 
 Due to how the algorithm works, you require twice as many descriptors to the 
fragments to do the RDMA operation.  So, you are running out of descriptors 
when you are only halfway done configuring the transfer.  The above patch 
creates two sets of descriptors so the second set can be utilized in situations 
like this.  The fix operates on the nodes which are doing the bulk transfers.  
Since you can both read and write bulk data, you need the fix on server, 
clients, and LNet routers (basically, everywhere).

Question: are you using the quotas feature and could it be at or approaching a 
limit?  There has been some evidence that the quotas feature could be 
introducing the offset to bulk transfers.

Doug

On Nov 1, 2016, at 4:08 PM, Brian W. Johanson 
> wrote:

Centos 7.2
Lustre 2.8.0
ZFS 0.6.5.5
OPA 10.2.0.0.158


The clients and servers are on the same OPA network, no routing.  Once a client 
gets in this state, the filesystem performance drops to a faction of what it is 
capable of.
The client must be rebooted to clear the issue.


I imagine I am missing a bug in jira for this issue, does this look like a 
known issue?


Pertinent debug messages from the server:

0800:0002:34.0:1478026118.277782:0:29892:0:(o2iblnd_cb.c:3109:kiblnd_check_txs_locked())
 Timed out tx: active_txs, 4 seconds
0800:0002:34.0:1478026118.277785:0:29892:0:(o2iblnd_cb.c:3172:kiblnd_check_conns())
 Timed out RDMA with 10.4.119.112@o2ib (3): c: 112, oc: 0, rc: 66
0800:0100:34.0:1478026118.277787:0:29892:0:(o2iblnd_cb.c:1913:kiblnd_close_conn_locked())
 Closing conn to 10.4.119.112@o2ib: error -110(waiting)
0100:0002:34.0:1478026118.277844:0:29892:0:(events.c:447:server_bulk_callback())
 event type 5, status -103, desc 883e8e8bcc00
0100:0002:34.0:1478026118.288714:0:29892:0:(events.c:447:server_bulk_callback())
 event type 3, status -103, desc 883e8e8bcc00
0100:0002:34.0:1478026118.299574:0:29892:0:(events.c:447:server_bulk_callback())
 event type 5, status -103, desc 8810e92e9c00
0100:0002:34.0:1478026118.310434:0:29892:0:(events.c:447:server_bulk_callback())
 event type 3, status -103, desc 8810e92e9c00

And from the client:


0400:0100:8.0:1477949860.565777:0:3629:0:(lib-move.c:1489:lnet_parse_put())
 Dropping PUT from 12345-10.4.108.81@o2ib portal 4 match 1549728742532740 
offset 0 length 192: 4
0400:0100:8.0:1477949860.565782:0:3629:0:(lib-move.c:1489:lnet_parse_put())
 Dropping PUT from 12345-10.4.108.81@o2ib portal 4 match 1549728742532740 
offset 0 length 192: 4
0800:0002:8.0:1477949860.702666:0:3629:0:(o2iblnd_cb.c:1094:kiblnd_init_rdma())
 RDMA has too many fragments for peer 10.4.108.81@o2ib (32), src idx/frags: 
16/27 dst idx/frags: 16/27
0800:0002:8.0:1477949860.702667:0:3629:0:(o2iblnd_cb.c:1689:kiblnd_reply())
 Can't setup rdma for GET from 10.4.108.81@o2ib: -90
0100:0002:8.0:1477949860.702669:0:3629:0:(events.c:201:client_bulk_callback())
 event type 1, status -5, desc 880fd5d9bc00
0800:0002:8.0:1477949860.81:0:3629:0:(o2iblnd_cb.c:1094:kiblnd_init_rdma())
 RDMA has too many fragments for peer 10.4.108.81@o2ib (32), src idx/frags: 
16/27 dst idx/frags: 16/27
0800:0002:8.0:1477949860.816668:0:3629:0:(o2iblnd_cb.c:1689:kiblnd_reply())
 Can't setup rdma for GET from 10.4.108.81@o2ib: -90
0100:0002:8.0:1477949860.816669:0:3629:0:(events.c:201:client_bulk_callback())
 event type 1, status -5, desc 880fd5d9bc00
0400:0100:8.0:1477949861.573660:0:3629:0:(lib-move.c:1489:lnet_parse_put())
 Dropping PUT from 12345-10.4.108.81@o2ib portal 4 match 1549728742532740 
offset 0 length 192: 4
0400:0100:8.0:1477949861.573664:0:3629:0:(lib-move.c:1489:lnet_parse_put())
 Dropping PUT from 12345-10.4.108.81@o2ib portal 4 match 1549728742532740 
offset 0 length 192: 4
0400:0100:8.0:1477949861.573667:0:3629:0:(lib-move.c:1489:lnet_parse_put())
 Dropping PUT from 12345-10.4.108.81@o2ib portal 4 match 1549728742532740 
offset 0 length 192: 4
0400:0100:8.0:1477949861.573669:0:3629:0:(lib-move.c:1489:lnet_parse_put())
 Dropping PUT from 12345-10.4.108.81@o2ib portal 4 match 1549728742532740 
offset 0 length 192: 4
0400:0100:8.0:1477949861.573671:0:3629:0:(lib-move.c:1489:lnet_parse_put())
 Dropping PUT from 12345-10.4.108.81@o2ib portal 4 match 1549728742532740 
offset 0 length 192: 4
0400:0100:8.0:1477949861.573673:0:3629:0:(lib-move.c:1489:lnet_parse_put())
 Dropping PUT from 12345-10.4.108.81@o2ib portal 4 match 

Re: [lustre-discuss] building lustre from source rpms, mellanox OFED, CentOS 6.8

2016-12-16 Thread Oucharek, Doug S
What distro do you want to build for?  If RHEL 7.3, the instructions Brett 
quoted no longer work thanks to weak module loading being activated.

Doug

On Dec 16, 2016, at 9:43 AM, Brett Lee 
> wrote:

Hi Lana, Here's a link:  
https://wiki.hpdd.intel.com/display/PUB/Building+Lustre+from+Source

The first time through may require some clarification (at least it did for me). 
 I'm sure this list can help if clarification is needed.

Brett
--
Secure your confidential information with PDS 2
PDS Software Solutions LLC
https://www.TrustPDS.com

On Fri, Dec 16, 2016 at 10:35 AM, Lana Deere 
> wrote:
Hi,

The Lustre manual says that instructions for building Lustre from source are 
available online, but I have not been able to locate anything Lustre-specific, 
just general rpmbuild documentation and items in Jira about trouble people had. 
 Can someone give me a pointer to any Lustre-specific documentation there might 
be?

The reason I want to build from source is because I need to upgrade the OFED in 
CentOS from the distribution version to a current Mellanox version (3.4-2, 
ideally).  Has anyone had any success in building Lustre against the Mellanox 
OFED stack, and if so is there any help or tips online somewhere for me to look 
at?

Thanks!

.. Lana (lana.de...@gmail.com)



___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org

___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] client fails to mount

2017-04-25 Thread Oucharek, Doug S
That specific message happens when the “magic” u32 field at the start of a 
message does not match what we are expecting.  We do check if the message was 
transmitted as a different endian from us so when you see this error, we assume 
that message has been corrupted or the sender is using an invalid magic value.  
I don’t believe this value has changed in the history of the LND so this is 
more likely corruption of some sort.

Doug

> On Apr 25, 2017, at 2:29 AM, Dilger, Andreas  wrote:
> 
> I'm not an LNet expert, but I think the critical issue to focus on is:
> 
>  Lustre: Lustre: Build Version: 2.5.3-RC1--PRISTINE-2.6.32-573.el6.x86_64
>  LNet: Added LNI 172.23.54.51@o2ib [8/256/0/180]
>  LNetError: 2878:0:(o2iblnd_cb.c:2587:kiblnd_rejected()) 172.23.55.211@o2ib 
> rejected: consumer defined fatal error
> 
> This means that the LND didn't connect at startup time, but I don't know what 
> the cause is.
> The error that generates this message is IB_CM_REJ_CONSUMER_DEFINED, but I 
> don't know enough about IB to tell you what that means.  Some of the later 
> code is checking for mismatched Lustre versions, but it doesn't even get that 
> far.
> 
> Cheers, Andreas
> 
>> On Apr 25, 2017, at 02:21, Strikwerda, Ger  wrote:
>> 
>> Hi Raj,
>> 
>> [root@pg-gpu01 ~]# lustre_rmmod
>> 
>> [root@pg-gpu01 ~]# modprobe -v lustre
>> insmod 
>> /lib/modules/2.6.32-642.6.2.el6.x86_64/weak-updates/kernel/net/lustre/libcfs.ko
>>  
>> insmod 
>> /lib/modules/2.6.32-642.6.2.el6.x86_64/weak-updates/kernel/fs/lustre/lvfs.ko 
>> insmod 
>> /lib/modules/2.6.32-642.6.2.el6.x86_64/weak-updates/kernel/net/lustre/lnet.ko
>>  networks=o2ib(ib0)
>> insmod 
>> /lib/modules/2.6.32-642.6.2.el6.x86_64/weak-updates/kernel/fs/lustre/obdclass.ko
>>  
>> insmod 
>> /lib/modules/2.6.32-642.6.2.el6.x86_64/weak-updates/kernel/fs/lustre/ptlrpc.ko
>>  
>> insmod 
>> /lib/modules/2.6.32-642.6.2.el6.x86_64/weak-updates/kernel/fs/lustre/fid.ko 
>> insmod 
>> /lib/modules/2.6.32-642.6.2.el6.x86_64/weak-updates/kernel/fs/lustre/mdc.ko 
>> insmod 
>> /lib/modules/2.6.32-642.6.2.el6.x86_64/weak-updates/kernel/fs/lustre/osc.ko 
>> insmod 
>> /lib/modules/2.6.32-642.6.2.el6.x86_64/weak-updates/kernel/fs/lustre/lov.ko 
>> insmod 
>> /lib/modules/2.6.32-642.6.2.el6.x86_64/weak-updates/kernel/fs/lustre/lustre.ko
>>  
>> 
>> dmesg:
>> 
>> LNet: HW CPU cores: 24, npartitions: 4
>> alg: No test for crc32 (crc32-table)
>> alg: No test for adler32 (adler32-zlib)
>> alg: No test for crc32 (crc32-pclmul)
>> Lustre: Lustre: Build Version: 2.5.3-RC1--PRISTINE-2.6.32-573.el6.x86_64
>> LNet: Added LNI 172.23.54.51@o2ib [8/256/0/180]
>> 
>> But no luck,
>> 
>> [root@pg-gpu01 ~]# lctl ping 172.23.55.211@o2ib
>> failed to ping 172.23.55.211@o2ib: Input/output error
>> 
>> [root@pg-gpu01 ~]# mount /home
>> mount.lustre: mount 172.23.55.211@o2ib:172.23.55.212@o2ib:/pghome01 at /home 
>> failed: Input/output error
>> Is the MGS running?
>> 
>> 
>> 
>> 
>> 
>> 
>> On Mon, Apr 24, 2017 at 7:53 PM, Raj  wrote:
>> Yes, this is strange. Normally, I have seen that credits mismatch results 
>> this scenario but it doesn't look like this is the case. 
>> 
>> You wouldn't want to put mgs into capture debug messages as there will be a 
>> lot of data. 
>> 
>> I guess you already tried removing the lustre drivers and adding it again ? 
>> lustre_rmmod 
>> modprobe -v lustre
>> 
>> And check dmesg for any errors...
>> 
>> 
>> On Mon, Apr 24, 2017 at 12:43 PM Strikwerda, Ger  
>> wrote:
>> Hi Raj,
>> 
>> When i do a lctl ping on a MGS server i do not see any logs at all. Also not 
>> when i do a sucessfull ping from a working node. Is there a way to verbose 
>> the Lustre logging to see more detail on the LNET level?
>> 
>> It is very strange that a rebooted node is able to lctl ping compute nodes, 
>> but fails to lctl ping metadata and storage nodes. 
>> 
>> 
>> 
>> 
>> On Mon, Apr 24, 2017 at 7:35 PM, Raj  wrote:
>> Ger,
>> It looks like default configuration of lustre. 
>> 
>> Do you see any error message on the MGS side while you are doing lctl ping 
>> from the rebooted clients? 
>> On Mon, Apr 24, 2017 at 12:27 PM Strikwerda, Ger  
>> wrote:
>> Hi Eli,
>> 
>> Nothing can be mounted on the Lustre filesystems so the output is:
>> 
>> [root@pg-gpu01 ~]# lfs df /home/ger/
>> [root@pg-gpu01 ~]# 
>> 
>> Empty..
>> 
>> 
>> 
>> On Mon, Apr 24, 2017 at 7:24 PM, E.S. Rosenberg  wrote:
>> 
>> 
>> On Mon, Apr 24, 2017 at 8:19 PM, Strikwerda, Ger  
>> wrote:
>> Hallo Eli,
>> 
>> Logfile/syslog on the client-side:
>> 
>> Lustre: Lustre: Build Version: 2.5.3-RC1--PRISTINE-2.6.32-573.el6.x86_64
>> LNet: Added LNI 172.23.54.51@o2ib [8/256/0/180]
>> LNetError: 2878:0:(o2iblnd_cb.c:2587:kiblnd_rejected()) 172.23.55.211@o2ib 
>> rejected: consumer defined fatal error
>> 
>> lctl df /path/to/some/file
>> 
>> gives 

Re: [lustre-discuss] Does Lustre support RoCE?

2017-05-12 Thread Oucharek, Doug S
I’ve been able to determine what is causing the dump_cqe failures, but not why 
it is happening now (all of a sudden).

In Lustre, we pass an IOV of fragments to be RDMA’ed over IB.  The fragments 
need to be page aligned except that the first fragment does not have to start 
on a page boundary and the last fragment does not have to end on a page 
boundary.

When we set up the DMA addresses for remote RDMA, we mask off the fragments so 
the addresses are all on a page boundary.  I guess the original authors 
believed that all DMA addresses needed to be page aligned for IB hardware.  The 
mlx5 code (MOFED 4 specific?) does not like that we are not using the actual 
start address and is rejecting it in the form of a dump_cqe error.

This code does not seem to be a problem with MOFED 3.x so has something 
changed?  Has a page alignment restriction been removed?  I really cannot just 
turn off this alignment operation as I have no idea what will break elsewhere 
in the world of OFED/MOFED.

Could use some insight from people who understand IB hardware/firmware.

Doug

On May 11, 2017, at 11:26 AM, Indivar Nair 
<indivar.n...@techterra.in<mailto:indivar.n...@techterra.in>> wrote:

Thanks for the advice.
I had a hunch that the development will take time.

Regards,


Indivar Nair

On Thu, May 11, 2017 at 11:28 PM, Oucharek, Doug S 
<doug.s.oucha...@intel.com<mailto:doug.s.oucha...@intel.com>> wrote:
As I write this, I am banging my head against this wall trying to figure it 
out.  It is related to the new memory region registration process used by mlx5 
cards.  I could really use the help of any Mellanox/RDMA experts out there.  
The API has virtually no documentation and without the source code for MOFED 4, 
I am really in unable to do much more than guess at what is going on.

So, expect this to take a long time to resolve and stick with MOFED 3.x.

Doug

On May 11, 2017, at 10:29 AM, Indivar Nair 
<indivar.n...@techterra.in<mailto:indivar.n...@techterra.in>> wrote:

Thanks a lot, Michael, Andreas, Simon, Doug,
I have already installed MLNX OFED 4:-(
I will now have to undo it and install the earlier version.

Roughly, by when would the support for MLNX OFED 4 be available?

Regards,


Indivar Nair

On Thu, May 11, 2017 at 9:35 PM, Oucharek, Doug S 
<doug.s.oucha...@intel.com<mailto:doug.s.oucha...@intel.com>> wrote:
The note regarding MOFED 4 not supported by Lustre: I’m working on it. MOFED 4 
did not drop support of Lustre, but did make API/behaviour changes which Lustre 
has not fully adapted to yet.  The ball is in the Lustre community’s court on 
this one now.

Doug

On May 11, 2017, at 8:47 AM, Simon Guilbault 
<simon.guilba...@calculquebec.ca<mailto:simon.guilba...@calculquebec.ca>> wrote:

Hi, your lnet.conf look fine, I tested lnet with RoCE V2 a while back with a 
pair of server using Connectx4 with a single 25Gb interface and RDMA was 
working with Centos 7.3, stock RHEL OFED and Lustre 2.9. The only settings that 
I had to use in lustre's config was this one:

options lnet networks=o2ib(ens2)

The performance was about the same (1.9GB/s) without any tuning with the lnet 
self-test but the CPU utilisation was a lot lower with RDMA than TCP (3% vs 65% 
of a core).

From my notes I took back then Lustre needed to be recompiled with MLNX OFED 
3.4 and MLNX OFED 4 dropped support of Lustre accordings to their release notes.

Ref 965588
https://www.mellanox.com/related-docs/prod_software/Mellanox_OFED_Linux_Release_Notes_4_0-2_0_0_1.pdf
https://www.mellanox.com/related-docs/prod_software/Mellanox_OFED_Linux_Release_Notes_4_0-2_0_2_0.pdf


On Thu, May 11, 2017 at 11:34 AM, Indivar Nair 
<indivar.n...@techterra.in<mailto:indivar.n...@techterra.in>> wrote:
So I should add something like this in lnet.conf -

options lnet networks=o2ib0(p4p1)

Thats it, right?

Regards,


Indivar Nair

On Thu, May 11, 2017 at 8:39 PM, Dilger, Andreas 
<andreas.dil...@intel.com<mailto:andreas.dil...@intel.com>> wrote:
If you have RoCE cards and configure them with OFED, and configure Lustre to 
use o2iblnd then it should use RDMA for those interfaces. The fact that they 
are RoCE cards is hidden below OFED.

Cheers, Andreas

> On May 11, 2017, at 08:36, Indivar Nair 
> <indivar.n...@techterra.in<mailto:indivar.n...@techterra.in>> wrote:
>
> Hi ...,
>
> I have read in different forums and blogs that Lustre supports RoCE.
> But I cant find any documentation on it.
>
> I have a Lustre setup with 6 OSS and 2 SMB/NFS Gateways.
> They are all interconnected using Mellanox SN2700 100G Switch and Mellanox 
> Connect-X4 100G NICs.
> I have installed the Mellanox OFED Drivers, but I cant find a way to tell 
> Lustre / LNET to use RoCE.
>
> How do I go about?
>
> Regards,
>
>
> Indivar Nair
>
>
> ___
>

Re: [lustre-discuss] Does Lustre support RoCE?

2017-05-11 Thread Oucharek, Doug S
The note regarding MOFED 4 not supported by Lustre: I’m working on it. MOFED 4 
did not drop support of Lustre, but did make API/behaviour changes which Lustre 
has not fully adapted to yet.  The ball is in the Lustre community’s court on 
this one now.

Doug

On May 11, 2017, at 8:47 AM, Simon Guilbault 
> wrote:

Hi, your lnet.conf look fine, I tested lnet with RoCE V2 a while back with a 
pair of server using Connectx4 with a single 25Gb interface and RDMA was 
working with Centos 7.3, stock RHEL OFED and Lustre 2.9. The only settings that 
I had to use in lustre's config was this one:

options lnet networks=o2ib(ens2)

The performance was about the same (1.9GB/s) without any tuning with the lnet 
self-test but the CPU utilisation was a lot lower with RDMA than TCP (3% vs 65% 
of a core).

From my notes I took back then Lustre needed to be recompiled with MLNX OFED 
3.4 and MLNX OFED 4 dropped support of Lustre accordings to their release notes.

Ref 965588
https://www.mellanox.com/related-docs/prod_software/Mellanox_OFED_Linux_Release_Notes_4_0-2_0_0_1.pdf
https://www.mellanox.com/related-docs/prod_software/Mellanox_OFED_Linux_Release_Notes_4_0-2_0_2_0.pdf


On Thu, May 11, 2017 at 11:34 AM, Indivar Nair 
> wrote:
So I should add something like this in lnet.conf -

options lnet networks=o2ib0(p4p1)

Thats it, right?

Regards,


Indivar Nair

On Thu, May 11, 2017 at 8:39 PM, Dilger, Andreas 
> wrote:
If you have RoCE cards and configure them with OFED, and configure Lustre to 
use o2iblnd then it should use RDMA for those interfaces. The fact that they 
are RoCE cards is hidden below OFED.

Cheers, Andreas

> On May 11, 2017, at 08:36, Indivar Nair 
> > wrote:
>
> Hi ...,
>
> I have read in different forums and blogs that Lustre supports RoCE.
> But I cant find any documentation on it.
>
> I have a Lustre setup with 6 OSS and 2 SMB/NFS Gateways.
> They are all interconnected using Mellanox SN2700 100G Switch and Mellanox 
> Connect-X4 100G NICs.
> I have installed the Mellanox OFED Drivers, but I cant find a way to tell 
> Lustre / LNET to use RoCE.
>
> How do I go about?
>
> Regards,
>
>
> Indivar Nair
>
>
> ___
> lustre-discuss mailing list
> lustre-discuss@lists.lustre.org
> http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org

___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] Does Lustre support RoCE?

2017-05-11 Thread Oucharek, Doug S
As I write this, I am banging my head against this wall trying to figure it 
out.  It is related to the new memory region registration process used by mlx5 
cards.  I could really use the help of any Mellanox/RDMA experts out there.  
The API has virtually no documentation and without the source code for MOFED 4, 
I am really in unable to do much more than guess at what is going on.

So, expect this to take a long time to resolve and stick with MOFED 3.x.

Doug

On May 11, 2017, at 10:29 AM, Indivar Nair 
<indivar.n...@techterra.in<mailto:indivar.n...@techterra.in>> wrote:

Thanks a lot, Michael, Andreas, Simon, Doug,
I have already installed MLNX OFED 4:-(
I will now have to undo it and install the earlier version.

Roughly, by when would the support for MLNX OFED 4 be available?

Regards,


Indivar Nair

On Thu, May 11, 2017 at 9:35 PM, Oucharek, Doug S 
<doug.s.oucha...@intel.com<mailto:doug.s.oucha...@intel.com>> wrote:
The note regarding MOFED 4 not supported by Lustre: I’m working on it. MOFED 4 
did not drop support of Lustre, but did make API/behaviour changes which Lustre 
has not fully adapted to yet.  The ball is in the Lustre community’s court on 
this one now.

Doug

On May 11, 2017, at 8:47 AM, Simon Guilbault 
<simon.guilba...@calculquebec.ca<mailto:simon.guilba...@calculquebec.ca>> wrote:

Hi, your lnet.conf look fine, I tested lnet with RoCE V2 a while back with a 
pair of server using Connectx4 with a single 25Gb interface and RDMA was 
working with Centos 7.3, stock RHEL OFED and Lustre 2.9. The only settings that 
I had to use in lustre's config was this one:

options lnet networks=o2ib(ens2)

The performance was about the same (1.9GB/s) without any tuning with the lnet 
self-test but the CPU utilisation was a lot lower with RDMA than TCP (3% vs 65% 
of a core).

From my notes I took back then Lustre needed to be recompiled with MLNX OFED 
3.4 and MLNX OFED 4 dropped support of Lustre accordings to their release notes.

Ref 965588
https://www.mellanox.com/related-docs/prod_software/Mellanox_OFED_Linux_Release_Notes_4_0-2_0_0_1.pdf
https://www.mellanox.com/related-docs/prod_software/Mellanox_OFED_Linux_Release_Notes_4_0-2_0_2_0.pdf


On Thu, May 11, 2017 at 11:34 AM, Indivar Nair 
<indivar.n...@techterra.in<mailto:indivar.n...@techterra.in>> wrote:
So I should add something like this in lnet.conf -

options lnet networks=o2ib0(p4p1)

Thats it, right?

Regards,


Indivar Nair

On Thu, May 11, 2017 at 8:39 PM, Dilger, Andreas 
<andreas.dil...@intel.com<mailto:andreas.dil...@intel.com>> wrote:
If you have RoCE cards and configure them with OFED, and configure Lustre to 
use o2iblnd then it should use RDMA for those interfaces. The fact that they 
are RoCE cards is hidden below OFED.

Cheers, Andreas

> On May 11, 2017, at 08:36, Indivar Nair 
> <indivar.n...@techterra.in<mailto:indivar.n...@techterra.in>> wrote:
>
> Hi ...,
>
> I have read in different forums and blogs that Lustre supports RoCE.
> But I cant find any documentation on it.
>
> I have a Lustre setup with 6 OSS and 2 SMB/NFS Gateways.
> They are all interconnected using Mellanox SN2700 100G Switch and Mellanox 
> Connect-X4 100G NICs.
> I have installed the Mellanox OFED Drivers, but I cant find a way to tell 
> Lustre / LNET to use RoCE.
>
> How do I go about?
>
> Regards,
>
>
> Indivar Nair
>
>
> ___
> lustre-discuss mailing list
> lustre-discuss@lists.lustre.org<mailto:lustre-discuss@lists.lustre.org>
> http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org<mailto:lustre-discuss@lists.lustre.org>
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org<mailto:lustre-discuss@lists.lustre.org>
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org



___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] Robinhood exhausting RPC resources against 2.5.5 lustre file systems

2017-05-17 Thread Oucharek, Doug S
How is it you are getting the same NID registering twice in the log file:

Feb 24 20:46:44 10.7.7.8 kernel: LNet: Added LNI 10.7.17.8@o2ib [8/256/0/180]
Feb 24 20:46:44 10.7.7.8 kernel: LNet: Added LNI 10.7.17.8@o2ib [8/256/0/180]

Doug

On May 17, 2017, at 11:04 AM, Jessica Otey 
> wrote:


All,

We have observed an unfortunate interaction between Robinhood and two Lustre 
2.5.5 file systems (both of which originated as 1.8.9 file systems).

Robinhood was used successfully against these file systems when they were both 
1.8.9, 2.4.3, and then 2.5.3 (a total time span of about 11 months).

We also have a third Lustre file system that originated as 2.4.3, and has since 
been upgraded to 2.5.5, against which Robinhood is currently operating as 
expected. This leads me to suppose that the issue may have to do the 
interaction between Robinhood and a legacy-1.8.x-now-lustre-2.5.5 system. But I 
don't know.

The problem manifests itself as follows: Either a Robinhood file scan or the 
initiation of the consumption of changelogs results in the consumption all the 
available RPC resources on the MDT. This in turn leads to the MDT not being 
able to satisfy any other requests from clients, which in turn leads to client 
disconnections (the MDT thinks they are dead and evicts them). Meanwhile, 
Robinhood itself is unable to traverse the file system to gather the 
information it seeks, and so its scans either hang (due to the client 
disconnect) or run at a rate such that they would never complete (less than 1 
file per second).

If we don't run robinhood at all, the file system performs (after a remount of 
the MDT) as expected.

Initially, we thought that the difficulty might be that we neglected to 
activate the FID-in-direct feature when we upgraded to 2.4.3. We did so on one 
of these systems, and ran an lfsck oi_scrub, but that did not ameliorate the 
problem.

Any thoughts on this matter would be appreciated. (We miss using Robinhood!)

Thanks,

Jessica



More data for those who cannot help themselves:

April 2016 - Robinhood comes into production use against both our 1.8.9 file 
systems.

July 2016 - Upgrade to 2.4.3 (on both production lustre file systems) -- 
Robinhood rebuilt against 2.4.3 client; changelog consumption now included.

Lustre "reconnects" (from /var/log/messages on one of the MDTs):

July 2016: 4

Aug 2016: 20

Sept 2016: 8

Oct 2016: 8

Nov 4-6, 2016 - Upgrade to 2.5.3 (on both production lustre file systems) -- 
Robinhood rebuilt against 2.5.3 client.

Lustre "reconnects":

Nov. 2016: 180

Dec. 2016: 62

Jan. 2017: 96

Feb 1-24, 2017: 2

Feb 24, 2017 - Upgrade to 2.5.5 (on both production lustre file systems)

 NAASC-Lustre MDT coming back 

Feb 24 20:46:44 10.7.7.8 kernel: Lustre: Lustre: Build Version: 
2.5.5-g22a210f-CHANGED-2.6.32-642.6.2.el6_lustre.2.5.5.x86_64
Feb 24 20:46:44 10.7.7.8 kernel: Lustre: Lustre: Build Version: 
2.5.5-g22a210f-CHANGED-2.6.32-642.6.2.el6_lustre.2.5.5.x86_64
Feb 24 20:46:44 10.7.7.8 kernel: LNet: Added LNI 10.7.17.8@o2ib [8/256/0/180]
Feb 24 20:46:44 10.7.7.8 kernel: LNet: Added LNI 10.7.17.8@o2ib [8/256/0/180]
Feb 24 20:46:45 10.7.7.8 kernel: LDISKFS-fs (md127): mounted filesystem with 
ordered data mode. quota=off. Opts:
Feb 24 20:46:45 10.7.7.8 kernel: LDISKFS-fs (md127): mounted filesystem with 
ordered data mode. quota=off. Opts:
Feb 24 20:46:46 10.7.7.8 kernel: Lustre: MGC10.7.17.8@o2ib: Connection restored 
to MGS (at 0@lo)
Feb 24 20:46:46 10.7.7.8 kernel: Lustre: MGC10.7.17.8@o2ib: Connection restored 
to MGS (at 0@lo)
Feb 24 20:46:47 10.7.7.8 kernel: Lustre: naaschpc-MDT: used disk, loading
Feb 24 20:46:47 10.7.7.8 kernel: Lustre: naaschpc-MDT: used disk, loading

The night after this upgrade, a regular rsync to the backup Lustre system 
provokes a failure/client disconnect. (Unfortunately, I don't have the logs to 
look at Robinhood activity from this time, but I believe I restarted the 
service after the system came back.)

Feb 25 02:14:24 10.7.7.8 kernel: LustreError: 
25103:0:(service.c:2020:ptlrpc_server_handle_request()) @@@ Dropping timed-out 
request from 12345-10.7.17.123@o2ib: deadline 6:11s ago
Feb 25 02:14:24 10.7.7.8 kernel: LustreError: 
25103:0:(service.c:2020:ptlrpc_server_handle_request()) @@@ Dropping timed-out 
request from 12345-10.7.17.123@o2ib: deadline 6:11s ago
Feb 25 02:14:24 10.7.7.8 kernel:  req@88045b3a2050 x1560271381909936/t0(0) 
o103->bb228923-4216-cc59-d847-38b543af1ae2@10.7.17.123@o2ib:0/0
 lens 3584/0 e 0 to 0 dl 1488006853 ref 1 fl Interpret:/0/ rc 0/-1
Feb 25 02:14:24 10.7.7.8 kernel:  req@88045b3a2050 x1560271381909936/t0(0) 
o103->bb228923-4216-cc59-d847-38b543af1ae2@10.7.17.123@o2ib:0/0
 lens 3584/0 e 0 

Re: [lustre-discuss] Clients looses IB connection to OSS.

2017-05-01 Thread Oucharek, Doug S
For the “RDMA has too many fragments” issue, you need newly landed patch: 
http://review.whamcloud.com/12451.  For the slow access, not sure if that is 
related to the too many fragments error.  Once you get the too many fragments 
error, that node usually needs to unload/reload the LNet module to recover.

Doug

On May 1, 2017, at 7:47 AM, Hans Henrik Happe 
> wrote:

Hi,

We have experienced problems with loosing connection to OSS. It starts with:

May  1 03:35:46 node872 kernel: LNetError:
5545:0:(o2iblnd_cb.c:1094:kiblnd_init_rdma()) RDMA has too many
fragments for peer 10.21.10.116@o2ib (256), src idx/frags: 128/236 dst
idx/frags: 128/236
May  1 03:35:46 node872 kernel: LNetError:
5545:0:(o2iblnd_cb.c:1689:kiblnd_reply()) Can't setup rdma for GET from
10.21.10.116@o2ib: -90

The rest of the log is attached.

After this Lustre access is very slow. I.e. a 'df' can take minutes.
Also 'lctl ping' to the OSS give I/O errors. Doing 'lnet net del/add'
makes ping work again until file I/O starts. Then I/O errors again.

We use both IB and TCP on servers, so no routers.

In the attached log astro-OST0001 has been moved to the other server in
the HA pair. This is because 'lctl dl -t' showed strange output when on
the right server:

# lctl dl -t
 0 UP mgc MGC10.21.10.102@o2ib 0b0bbbce-63b6-bf47-403c-28f0c53e8307 5
 1 UP lov astro-clilov-88107412e800
53add9a3-e719-26d9-afb4-3fe9b0fa03bd 4
 2 UP lmv astro-clilmv-88107412e800
53add9a3-e719-26d9-afb4-3fe9b0fa03bd 4
 3 UP mdc astro-MDT-mdc-88107412e800
53add9a3-e719-26d9-afb4-3fe9b0fa03bd 5 10.21.10.102@o2ib
 4 UP osc astro-OST0002-osc-88107412e800
53add9a3-e719-26d9-afb4-3fe9b0fa03bd 5 10.21.10.116@o2ib
 5 UP osc astro-OST0001-osc-88107412e800
53add9a3-e719-26d9-afb4-3fe9b0fa03bd 5 172.20.10.115@tcp1
 6 UP osc astro-OST0003-osc-88107412e800
53add9a3-e719-26d9-afb4-3fe9b0fa03bd 5 10.21.10.117@o2ib
 7 UP osc astro-OST-osc-88107412e800
53add9a3-e719-26d9-afb4-3fe9b0fa03bd 5 10.21.10.114@o2ib

So astro-OST0001 seems to be connected through 172.20.10.115@tcp1, even
though it uses 10.21.10.115@o2ib (verified by performance test and
disabling tcp1 on IB nodes).

Please ask for more details if needed.

Cheers,
Hans Henrik

___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org

___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] slow mount of lustre CentOS6 clients to 2.9 servers

2017-05-05 Thread Oucharek, Doug S
Are the NIDs "192.168.xxx.yyy@o2ib” really configured that way or did you 
modify those logs when pasting them to email?

Doug

On May 5, 2017, at 11:02 AM, Grigory Shamov 
> wrote:

Hi All,

We were installing a new Lustre storage.
To that end , we have built new clients with the following configuration:

CentOS 6.8, kernel 2.6.32-642.el6.x86_64
Mellanox OFED 3.4.1.0 (on QDR fabric)

and either lustre-2.8.0 or lustre-2.9.0 clients, which we rebuilt from sources. 
The new server is Lustre 2.9 on CentOS 7.3 .

Now, the clients we built have a problem in mounting the filesystem.  It takes 
long time, and/or fails initially, with messages as follows (for the 2.8 
client):

mounting device 192.168.xxx.yyy@o2ib:/lustre at /lustrenew, flags=0x400 
options=flock,device=192.168.xxx.yyy@o2ib:/lustre
mount.lustre: mount 192.168.xxx.yyy@o2ib:/lustre at /lustrenew failed: 
Input/output error retries left: 0
mount.lustre: mount 192.168.xxx.yyy@o2ib:/lustre at /lustrenew failed: 
Input/output error
Is the MGS running?

In dmesg:

LNet: HW CPU cores: 24, npartitions: 4
alg: No test for adler32 (adler32-zlib)
alg: No test for crc32 (crc32-table)
alg: No test for crc32 (crc32-pclmul)
Lustre: Lustre: Build Version: 2.8.0-RC5--PRISTINE-2.6.32-642.el6.x86_64
LNet: Added LNI 192.168.aaa.bbb@o2ib [8/256/0/180]
Lustre: 3476:0:(client.c:2063:ptlrpc_expire_one_request()) @@@ Request sent has 
timed out for sent delay: [sent 1493927511/real 0]  req@88061a1aac80 
x1566496533774340/t0(0) 
o250->MGC192.168.xxx.yyy@o...@192.168.xxx.yyy@o2ib:26/25
 lens 520/544 e 0 to 1 dl 1493927516 ref 2 fl Rpc:XN/0/ rc 0/-1
LustreError: 15c-8: MGC192.168.xxx.yyy@o2ib: The configuration from log 
'lustre-client' failed (-5). This may be the result of communication errors 
between this node and the MGS, a bad configuration, or other errors. See the 
syslog for more information.
Lustre: Unmounted lustre-client


Initial mount would thus  fail; then mount  happens but OST's would take lot of 
time to become active;

UUID  1K-blocksUsed  Available Use% Mounted on
lustre-MDT_UUID  1156701708  751100  1077936556  0% /lustrenew[MDT:0]
OST: inactive device
OST0001: inactive device
OST0002: inactive device
OST0003: inactive device
OST0004: inactive device
OST0005: inactive device
OST0006: inactive device
OST0007: inactive device

filesystem summary:0  0  0  0% /lustrenew

then, after some 10 minutes , the mount completes and performance-wise, Lustre 
seems to be normal.

Same dmesg output from 2.9 client

LNet: HW CPU cores: 24, npartitions: 2
alg: No test for adler32 (adler32-zlib)
alg: No test for crc32 (crc32-table)
alg: No test for crc32 (crc32-pclmul)
Lustre: Lustre: Build Version: 2.9.0
LNet: Added LNI 192.168.aaa.bbb@o2ib [8/256/0/180]
Lustre: 3468:0:(client.c:2111:ptlrpc_expire_one_request()) @@@ Request sent has 
timed out for sent delay: [sent 1493929145/real 0]  req@880631d07c80 
x1566498247147536/t0(0) 
o250->MGC192.168.xxx.yyy@o...@192.168.xxx.yyy@o2ib:26/25
 lens 520/544 e 0 to 1 dl 1493929150 ref 2 fl Rpc:XN/0/ rc 0/-1
LustreError: 15c-8: MGC192.168.xxx.yyy@o2ib: The configuration from log 
'lustre-client' failed (-5). This may be the result of communication errors 
between this node and the MGS, a bad configuration, or other errors. See the 
syslog for more information.
Lustre: Unmounted lustre-client
LustreError: 3413:0:(obd_mount.c:1449:lustre_fill_super()) Unable to mount  (-5)

 I am at loss as to what would cause such behavior? Could anyone advise where 
to look at for the causes of this problem? Thank you very much in advance!

--
Grigory Shamov
HPC SIte Lead,
University of Manitoba
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org

___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] Lustre on Mellonax multi-host infiniband problem

2017-05-05 Thread Oucharek, Doug S
I’m not sure I understand what version of MOFED you are using.  Can you verify 
whether this is MOFED 3.x or 4.x.

Doug

On May 5, 2017, at 9:51 AM, HM Li > wrote:


Conformed.

This is a bug of git(2.9.55_45), it works well when using 
MLNX_OFED_LINUX-3.4-2.1.8.0-rhel7.3-x86_64 
andhttps://downloads.hpdd.intel.com/public/lustre/latest-feature-release/el7.3.1611/server/SRPMS/lustre-2.9.0-1.src.rpm.

node453@tc4600: ~# export LC_ALL=C
node453@tc4600: ~# lctl lustre_build_version
Lustre version: 2.9.0
node453@tc4600: ~# df
Filesystem1K-blocksUsed   Available Use% Mounted on
10.10.100.6@o2ib:/lxfs   734106868810152136  6960149976   1% /home
10.10.100.1@o2ib:/sgfs 108704716104 24800951320 78393702908  25% /mnt
node453@tc4600: ~# ibv_devinfo
hca_id:mlx5_0
transport:InfiniBand (0)
fw_ver:12.17.1010
node_guid:46e3:e861:1f19:4438
sys_image_guid:46e3:e861:1f19:4438
vendor_id:0x02c9
vendor_part_id:4115
hw_ver:0x0
board_id:SGN1130110032
phys_port_cnt:1
Device ports:
port:1
state:PORT_ACTIVE (4)
max_mtu:4096 (5)
active_mtu:4096 (5)
sm_lid:360
port_lid:362
port_lmc:0x00
link_layer:InfiniBand


On 2017年05月03日 17:02, HM Li wrote:
Dear,
I have setup the Lustre-git(2.9.55_45) on CentOS 7.3, but the client(on the 
same multi-host IB) can't mount lustre. Can you help me? Thank you very much.

  *   The server:
 *   mkfs.lustre --fsname=lxfs --mgs --mdt --index=0  --reformat /dev/sda5
 *   mkfs.lustre --fsname=lxfs --mgsnode=10.10.146.1@o2ib1 
--servicenode=10.10.146.1@o2ib1 --ost --reformat --index=1 /dev/sda6
 *   mount -t lustre /dev/sda5 /mnt/mdt
 *   mount -t lustre /dev/sda6 /mnt/ost
 *   mount -t lustre -v  10.10.146.1@o2ib1:/lxfs /home is OK.
 *   lctl list_nids
10.10.146.1@o2ib1
 *   lctl ping 10.10.146.2@o2ib1
12345-0@lo
12345-10.10.146.2@o2ib1

  *   The client:
 *   lctl list_nids
10.10.146.2@o2ib1
 *   lctl ping 10.10.146.1@o2ib1
12345-0@lo
12345-10.10.146.1@o2ib1
 *   mount -t lustre -v 10.10.146.1@o2ib1:/lxfs /home/
arg[0] = /sbin/mount.lustre
arg[1] = -v
arg[2] = -o
arg[3] = rw
arg[4] = 10.10.146.1@o2ib1:/lxfs
arg[5] = /home
source = 10.10.146.1@o2ib1:/lxfs (10.10.146.1@o2ib1:/lxfs), target = /home
options = rw
mounting device 10.10.146.1@o2ib1:/lxfs at /home, flags=0x100 
options=device=10.10.146.1@o2ib1:/lxfs
mount.lustre: mount 10.10.146.1@o2ib1:/lxfs at /home failed: Input/output error 
retries left: 0
mount.lustre: mount 10.10.146.1@o2ib1:/lxfs at /home failed: Input/output error
Is the MGS running?

  *   and now on server dmesg show:
[82709.336007] Lustre: MGS: Connection restored to 
792b2b21-2e57-de7d-3d8f-5e80eb6d7bf2 (at 10.10.146.2@o2ib1)
[82709.339324] mlx5_0:dump_cqe:275:(pid 22740): dump error cqe
[82709.339508]    
[82709.339677]    
[82709.339841]    
[82709.340006]  9d005304 0874 01f1c5d2
[82716.34] Lustre: MGS: Received new LWP connection from 10.10.146.2@o2ib1, 
removing former export from same NID
[82716.343712] Lustre: MGS: Connection restored to 
792b2b21-2e57-de7d-3d8f-5e80eb6d7bf2 (at 10.10.146.2@o2ib1)

  *   IB information:
 *   ibstat
CA 'mlx5_0'
CA type: MT4115
Number of ports: 1
Firmware version: 12.17.1010
Hardware version: 0
Node GUID: 0x46e3e8611f19443a
System image GUID: 0x46e3e8611f194438
Port 1:
State: Active
Physical state: LinkUp
Rate: 100
Base lid: 360
LMC: 2
SM lid: 360
Capability mask: 0x2651e84a
Port GUID: 0x46e3e8611f19443a
Link layer: InfiniBand
 *   ibv_devinfo
hca_id:mlx5_0
transport:InfiniBand (0)
fw_ver:12.17.1010
node_guid:46e3:e861:1f19:443a
sys_image_guid:46e3:e861:1f19:4438
vendor_id:0x02c9
vendor_part_id:4115
hw_ver:0x0
board_id:SGN1130110032
phys_port_cnt:1
Device ports:
port:1
state:PORT_ACTIVE (4)
max_mtu:4096 (5)
active_mtu:4096 (5)
sm_lid:360
port_lid:360
port_lmc:0x02
link_layer:InfiniBand
 *   ibstatus
Infiniband device 'mlx5_0' port 1 status:
default gid: fe80::::46e3:e861:1f19:443a
base lid: 0x168
sm lid: 0x168
state: 4: ACTIVE
phys state: 5: LinkUp
rate: 100 Gb/sec (4X EDR)
link_layer: InfiniBand
 *   opensm 

Re: [lustre-discuss] Lustre on Mellonax multi-host infiniband problem

2017-05-05 Thread Oucharek, Doug S
The tag you checked out is missing this fix: 
https://review.whamcloud.com/#/c/24306/.  Try applying that.

Doug

On May 5, 2017, at 9:51 AM, HM Li > wrote:


Conformed.

This is a bug of git(2.9.55_45), it works well when using 
MLNX_OFED_LINUX-3.4-2.1.8.0-rhel7.3-x86_64 
andhttps://downloads.hpdd.intel.com/public/lustre/latest-feature-release/el7.3.1611/server/SRPMS/lustre-2.9.0-1.src.rpm.

node453@tc4600: ~# export LC_ALL=C
node453@tc4600: ~# lctl lustre_build_version
Lustre version: 2.9.0
node453@tc4600: ~# df
Filesystem1K-blocksUsed   Available Use% Mounted on
10.10.100.6@o2ib:/lxfs   734106868810152136  6960149976   1% /home
10.10.100.1@o2ib:/sgfs 108704716104 24800951320 78393702908  25% /mnt
node453@tc4600: ~# ibv_devinfo
hca_id:mlx5_0
transport:InfiniBand (0)
fw_ver:12.17.1010
node_guid:46e3:e861:1f19:4438
sys_image_guid:46e3:e861:1f19:4438
vendor_id:0x02c9
vendor_part_id:4115
hw_ver:0x0
board_id:SGN1130110032
phys_port_cnt:1
Device ports:
port:1
state:PORT_ACTIVE (4)
max_mtu:4096 (5)
active_mtu:4096 (5)
sm_lid:360
port_lid:362
port_lmc:0x00
link_layer:InfiniBand


On 2017年05月03日 17:02, HM Li wrote:
Dear,
I have setup the Lustre-git(2.9.55_45) on CentOS 7.3, but the client(on the 
same multi-host IB) can't mount lustre. Can you help me? Thank you very much.

  *   The server:
 *   mkfs.lustre --fsname=lxfs --mgs --mdt --index=0  --reformat /dev/sda5
 *   mkfs.lustre --fsname=lxfs --mgsnode=10.10.146.1@o2ib1 
--servicenode=10.10.146.1@o2ib1 --ost --reformat --index=1 /dev/sda6
 *   mount -t lustre /dev/sda5 /mnt/mdt
 *   mount -t lustre /dev/sda6 /mnt/ost
 *   mount -t lustre -v  10.10.146.1@o2ib1:/lxfs /home is OK.
 *   lctl list_nids
10.10.146.1@o2ib1
 *   lctl ping 10.10.146.2@o2ib1
12345-0@lo
12345-10.10.146.2@o2ib1

  *   The client:
 *   lctl list_nids
10.10.146.2@o2ib1
 *   lctl ping 10.10.146.1@o2ib1
12345-0@lo
12345-10.10.146.1@o2ib1
 *   mount -t lustre -v 10.10.146.1@o2ib1:/lxfs /home/
arg[0] = /sbin/mount.lustre
arg[1] = -v
arg[2] = -o
arg[3] = rw
arg[4] = 10.10.146.1@o2ib1:/lxfs
arg[5] = /home
source = 10.10.146.1@o2ib1:/lxfs (10.10.146.1@o2ib1:/lxfs), target = /home
options = rw
mounting device 10.10.146.1@o2ib1:/lxfs at /home, flags=0x100 
options=device=10.10.146.1@o2ib1:/lxfs
mount.lustre: mount 10.10.146.1@o2ib1:/lxfs at /home failed: Input/output error 
retries left: 0
mount.lustre: mount 10.10.146.1@o2ib1:/lxfs at /home failed: Input/output error
Is the MGS running?

  *   and now on server dmesg show:
[82709.336007] Lustre: MGS: Connection restored to 
792b2b21-2e57-de7d-3d8f-5e80eb6d7bf2 (at 10.10.146.2@o2ib1)
[82709.339324] mlx5_0:dump_cqe:275:(pid 22740): dump error cqe
[82709.339508]    
[82709.339677]    
[82709.339841]    
[82709.340006]  9d005304 0874 01f1c5d2
[82716.34] Lustre: MGS: Received new LWP connection from 10.10.146.2@o2ib1, 
removing former export from same NID
[82716.343712] Lustre: MGS: Connection restored to 
792b2b21-2e57-de7d-3d8f-5e80eb6d7bf2 (at 10.10.146.2@o2ib1)

  *   IB information:
 *   ibstat
CA 'mlx5_0'
CA type: MT4115
Number of ports: 1
Firmware version: 12.17.1010
Hardware version: 0
Node GUID: 0x46e3e8611f19443a
System image GUID: 0x46e3e8611f194438
Port 1:
State: Active
Physical state: LinkUp
Rate: 100
Base lid: 360
LMC: 2
SM lid: 360
Capability mask: 0x2651e84a
Port GUID: 0x46e3e8611f19443a
Link layer: InfiniBand
 *   ibv_devinfo
hca_id:mlx5_0
transport:InfiniBand (0)
fw_ver:12.17.1010
node_guid:46e3:e861:1f19:443a
sys_image_guid:46e3:e861:1f19:4438
vendor_id:0x02c9
vendor_part_id:4115
hw_ver:0x0
board_id:SGN1130110032
phys_port_cnt:1
Device ports:
port:1
state:PORT_ACTIVE (4)
max_mtu:4096 (5)
active_mtu:4096 (5)
sm_lid:360
port_lid:360
port_lmc:0x02
link_layer:InfiniBand
 *   ibstatus
Infiniband device 'mlx5_0' port 1 status:
default gid: fe80::::46e3:e861:1f19:443a
base lid: 0x168
sm lid: 0x168
state: 4: ACTIVE
phys state: 5: LinkUp
rate: 100 Gb/sec (4X EDR)
link_layer: InfiniBand
 *   opensm 

Re: [lustre-discuss] Lustre on Mellonax multi-host infiniband problem

2017-05-08 Thread Oucharek, Doug S
I’m currently investigating a problem with MOFED 4.x which seems very similar 
to what you seeing.  I have no solution yet.

Doug

On May 7, 2017, at 7:09 AM, HM Li <li...@163.com<mailto:li...@163.com>> wrote:


Thank you very much.

The MLNX used on Multi-Host nodes is MLNX_OFED_LINUX-4.0-1.0.1.0-rhel7.3-x86_64.

This driver and lustre(git, 2.9.55_45) can work well on other normal FDR nodes.

On 2017年05月06日 01:14, Oucharek, Doug S wrote:
The tag you checked out is missing this fix: 
https://review.whamcloud.com/#/c/24306/.  Try applying that.

Doug

On May 5, 2017, at 9:51 AM, HM Li <li...@163.com<mailto:li...@163.com>> wrote:


Conformed.

This is a bug of git(2.9.55_45), it works well when using 
MLNX_OFED_LINUX-3.4-2.1.8.0-rhel7.3-x86_64 
andhttps://downloads.hpdd.intel.com/public/lustre/latest-feature-release/el7.3.1611/server/SRPMS/lustre-2.9.0-1.src.rpm.

node453@tc4600: ~# export LC_ALL=C
node453@tc4600: ~# lctl lustre_build_version
Lustre version: 2.9.0
node453@tc4600: ~# df
Filesystem1K-blocksUsed   Available Use% Mounted on
10.10.100.6@o2ib:/lxfs   734106868810152136  6960149976   1% /home
10.10.100.1@o2ib:/sgfs 108704716104 24800951320 78393702908  25% /mnt
node453@tc4600: ~# ibv_devinfo
hca_id:mlx5_0
transport:InfiniBand (0)
fw_ver:12.17.1010
node_guid:46e3:e861:1f19:4438
sys_image_guid:46e3:e861:1f19:4438
vendor_id:0x02c9
vendor_part_id:4115
hw_ver:0x0
board_id:SGN1130110032
phys_port_cnt:1
Device ports:
port:1
state:PORT_ACTIVE (4)
max_mtu:4096 (5)
active_mtu:4096 (5)
sm_lid:360
port_lid:362
port_lmc:0x00
link_layer:InfiniBand


On 2017年05月03日 17:02, HM Li wrote:
Dear,
I have setup the Lustre-git(2.9.55_45) on CentOS 7.3, but the client(on the 
same multi-host IB) can't mount lustre. Can you help me? Thank you very much.

  *   The server:
 *   mkfs.lustre --fsname=lxfs --mgs --mdt --index=0  --reformat /dev/sda5
 *   mkfs.lustre --fsname=lxfs --mgsnode=10.10.146.1@o2ib1 
--servicenode=10.10.146.1@o2ib1 --ost --reformat --index=1 /dev/sda6
 *   mount -t lustre /dev/sda5 /mnt/mdt
 *   mount -t lustre /dev/sda6 /mnt/ost
 *   mount -t lustre -v  10.10.146.1@o2ib1:/lxfs /home is OK.
 *   lctl list_nids
10.10.146.1@o2ib1
 *   lctl ping 10.10.146.2@o2ib1
12345-0@lo
12345-10.10.146.2@o2ib1

  *   The client:
 *   lctl list_nids
10.10.146.2@o2ib1
 *   lctl ping 10.10.146.1@o2ib1
12345-0@lo
12345-10.10.146.1@o2ib1
 *   mount -t lustre -v 10.10.146.1@o2ib1:/lxfs /home/
arg[0] = /sbin/mount.lustre
arg[1] = -v
arg[2] = -o
arg[3] = rw
arg[4] = 10.10.146.1@o2ib1:/lxfs
arg[5] = /home
source = 10.10.146.1@o2ib1:/lxfs (10.10.146.1@o2ib1:/lxfs), target = /home
options = rw
mounting device 10.10.146.1@o2ib1:/lxfs at /home, flags=0x100 
options=device=10.10.146.1@o2ib1:/lxfs
mount.lustre: mount 10.10.146.1@o2ib1:/lxfs at /home failed: Input/output error 
retries left: 0
mount.lustre: mount 10.10.146.1@o2ib1:/lxfs at /home failed: Input/output error
Is the MGS running?

  *   and now on server dmesg show:
[82709.336007] Lustre: MGS: Connection restored to 
792b2b21-2e57-de7d-3d8f-5e80eb6d7bf2 (at 10.10.146.2@o2ib1)
[82709.339324] mlx5_0:dump_cqe:275:(pid 22740): dump error cqe
[82709.339508]    
[82709.339677]    
[82709.339841]    
[82709.340006]  9d005304 0874 01f1c5d2
[82716.34] Lustre: MGS: Received new LWP connection from 10.10.146.2@o2ib1, 
removing former export from same NID
[82716.343712] Lustre: MGS: Connection restored to 
792b2b21-2e57-de7d-3d8f-5e80eb6d7bf2 (at 10.10.146.2@o2ib1)

  *   IB information:
 *   ibstat
CA 'mlx5_0'
CA type: MT4115
Number of ports: 1
Firmware version: 12.17.1010
Hardware version: 0
Node GUID: 0x46e3e8611f19443a
System image GUID: 0x46e3e8611f194438
Port 1:
State: Active
Physical state: LinkUp
Rate: 100
Base lid: 360
LMC: 2
SM lid: 360
Capability mask: 0x2651e84a
Port GUID: 0x46e3e8611f19443a
Link layer: InfiniBand
 *   ibv_devinfo
hca_id:mlx5_0
transport:InfiniBand (0)
fw_ver:12.17.1010
node_guid:46e3:e861:1f19:443a
sys_image_guid:46e3:e861:1f19:4438
vendor_id:0x02c9
vendor_part_id:4115
hw_ver:0x0
board_id:SGN1130110032
phys_port_cnt:1
Device ports:
port:1
state:PORT_ACTIVE (4)
max_mtu:4096 (5)
active_mtu:409