Re: [lustre-discuss] Lustre routing help needed

2017-11-01 Thread Kevin M. Hildebrand
So apparently the issue is indeed with the combination of using a Lustre
2.10.1 router with 2.8 servers and clients.  Downgrading the router to 2.9
seems to have solved the problem.
(I can't run 2.8 on the router, because I'm running MOFED 4.1 for the
Mellanox ConnectX-5, and I can't get 2.8 to build with that version...)

Thanks, everyone, for your assistance!
Kevin


On Mon, Oct 30, 2017 at 5:47 PM, Dilger, Andreas 
wrote:

> The 2.10 release added support for multi-rail LNet, which may potentially
> be causing problems here. I would suggest to install an older LNet version
> on your routers to match your client/server.
>
> You may need to build your own RPMs for your new kernel, but can use
> --disable-server for configure to simplify things.
>
> Cheers, Andreas
>
> On Oct 31, 2017, at 04:45, Kevin M. Hildebrand  wrote:
>
> Thanks, I completely missed that.  Indeed the ko2iblnd parameters were
> different between the servers and the router.  I've updated the parameters
> on the router to match those on the server, and things haven't gotten any
> better.  (The problem appears to be on the Ethernet side anyway, so you've
> probably helped me fix a problem I didn't know I had...)
> I don't see much discussion about configuring lnet parameters for Ethernet
> networks, I assume that's using ksocklnd.  On that side, it appears that
> all of the ksocklnd parameters match between the router and clients.
> Interesting that peer_timeout is 180, which is almost exactly when my
> client gets marked down on the router.
>
> Server (and now router) ko2iblnd parameters:
> peer_credits 8
> peer_credits_hiw 4
> credits 256
> concurrent_sends 8
> ntx 512
> map_on_demand 0
> fmr_pool_size 512
> fmr_flush_trigger 384
> fmr_cache 1
>
> Client and router ksocklnd:
> peer_timeout 180
> peer_credits 8
> keepalive 30
> sock_timeout 50
> credits 256
> rx_buffer_size 0
> tx_buffer_size 0
> keepalive_idle 30
> round_robin 1
> sock_timeout 50
>
> Thanks,
> Kevin
>
>
> On Mon, Oct 30, 2017 at 4:16 PM, Mohr Jr, Richard Frank (Rick Mohr) <
> rm...@utk.edu> wrote:
>
>>
>> > On Oct 30, 2017, at 8:47 AM, Kevin M. Hildebrand  wrote:
>> >
>> > All of the hosts (client, server, router) have the following in
>> ko2iblnd.conf:
>> >
>> > alias ko2iblnd-opa ko2iblnd
>> > options ko2iblnd-opa peer_credits=128 peer_credits_hiw=64 credits=1024
>> concurrent_sends=256 ntx=2048 map_on_demand=32 fmr_pool_size=2048
>> fmr_flush_trigger=512 fmr_cache=1 conns_per_peer=4
>> >
>> > install ko2iblnd /usr/sbin/ko2iblnd-probe
>>
>> Those parameters will only get applied to omnipath interfaces (which you
>> don’t have), so everything you have should just be running with default
>> parameters.  Since your lnet routers have a different version of lustre
>> than your servers/clients, it might be possible that the default values for
>> the ko2iblnd parameters are different between the two versions.  You can
>> always check this by looking at the values in the files under
>> /sys/module/ko2iblnd/parameters.  It might be worthwhile to compare
>> those values on the lnet routers to the values on the servers to see if
>> maybe there is a difference that could affect the behavior.
>>
>> --
>> Rick Mohr
>> Senior HPC System Administrator
>> National Institute for Computational Sciences
>> http://www.nics.tennessee.edu
>>
>>
> ___
> lustre-discuss mailing list
> lustre-discuss@lists.lustre.org
> http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
>
>
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] Lustre routing help needed

2017-10-30 Thread Dilger, Andreas
The 2.10 release added support for multi-rail LNet, which may potentially be 
causing problems here. I would suggest to install an older LNet version on your 
routers to match your client/server.

You may need to build your own RPMs for your new kernel, but can use 
--disable-server for configure to simplify things.

Cheers, Andreas

On Oct 31, 2017, at 04:45, Kevin M. Hildebrand 
> wrote:

Thanks, I completely missed that.  Indeed the ko2iblnd parameters were 
different between the servers and the router.  I've updated the parameters on 
the router to match those on the server, and things haven't gotten any better.  
(The problem appears to be on the Ethernet side anyway, so you've probably 
helped me fix a problem I didn't know I had...)
I don't see much discussion about configuring lnet parameters for Ethernet 
networks, I assume that's using ksocklnd.  On that side, it appears that all of 
the ksocklnd parameters match between the router and clients.  Interesting that 
peer_timeout is 180, which is almost exactly when my client gets marked down on 
the router.

Server (and now router) ko2iblnd parameters:
peer_credits 8
peer_credits_hiw 4
credits 256
concurrent_sends 8
ntx 512
map_on_demand 0
fmr_pool_size 512
fmr_flush_trigger 384
fmr_cache 1

Client and router ksocklnd:
peer_timeout 180
peer_credits 8
keepalive 30
sock_timeout 50
credits 256
rx_buffer_size 0
tx_buffer_size 0
keepalive_idle 30
round_robin 1
sock_timeout 50

Thanks,
Kevin


On Mon, Oct 30, 2017 at 4:16 PM, Mohr Jr, Richard Frank (Rick Mohr) 
> wrote:

> On Oct 30, 2017, at 8:47 AM, Kevin M. Hildebrand 
> > wrote:
>
> All of the hosts (client, server, router) have the following in ko2iblnd.conf:
>
> alias ko2iblnd-opa ko2iblnd
> options ko2iblnd-opa peer_credits=128 peer_credits_hiw=64 credits=1024 
> concurrent_sends=256 ntx=2048 map_on_demand=32 fmr_pool_size=2048 
> fmr_flush_trigger=512 fmr_cache=1 conns_per_peer=4
>
> install ko2iblnd /usr/sbin/ko2iblnd-probe

Those parameters will only get applied to omnipath interfaces (which you don’t 
have), so everything you have should just be running with default parameters.  
Since your lnet routers have a different version of lustre than your 
servers/clients, it might be possible that the default values for the ko2iblnd 
parameters are different between the two versions.  You can always check this 
by looking at the values in the files under /sys/module/ko2iblnd/parameters.  
It might be worthwhile to compare those values on the lnet routers to the 
values on the servers to see if maybe there is a difference that could affect 
the behavior.

--
Rick Mohr
Senior HPC System Administrator
National Institute for Computational Sciences
http://www.nics.tennessee.edu


___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] Lustre routing help needed

2017-10-30 Thread Kevin M. Hildebrand
Thanks, I completely missed that.  Indeed the ko2iblnd parameters were
different between the servers and the router.  I've updated the parameters
on the router to match those on the server, and things haven't gotten any
better.  (The problem appears to be on the Ethernet side anyway, so you've
probably helped me fix a problem I didn't know I had...)
I don't see much discussion about configuring lnet parameters for Ethernet
networks, I assume that's using ksocklnd.  On that side, it appears that
all of the ksocklnd parameters match between the router and clients.
Interesting that peer_timeout is 180, which is almost exactly when my
client gets marked down on the router.

Server (and now router) ko2iblnd parameters:
peer_credits 8
peer_credits_hiw 4
credits 256
concurrent_sends 8
ntx 512
map_on_demand 0
fmr_pool_size 512
fmr_flush_trigger 384
fmr_cache 1

Client and router ksocklnd:
peer_timeout 180
peer_credits 8
keepalive 30
sock_timeout 50
credits 256
rx_buffer_size 0
tx_buffer_size 0
keepalive_idle 30
round_robin 1
sock_timeout 50

Thanks,
Kevin


On Mon, Oct 30, 2017 at 4:16 PM, Mohr Jr, Richard Frank (Rick Mohr) <
rm...@utk.edu> wrote:

>
> > On Oct 30, 2017, at 8:47 AM, Kevin M. Hildebrand  wrote:
> >
> > All of the hosts (client, server, router) have the following in
> ko2iblnd.conf:
> >
> > alias ko2iblnd-opa ko2iblnd
> > options ko2iblnd-opa peer_credits=128 peer_credits_hiw=64 credits=1024
> concurrent_sends=256 ntx=2048 map_on_demand=32 fmr_pool_size=2048
> fmr_flush_trigger=512 fmr_cache=1 conns_per_peer=4
> >
> > install ko2iblnd /usr/sbin/ko2iblnd-probe
>
> Those parameters will only get applied to omnipath interfaces (which you
> don’t have), so everything you have should just be running with default
> parameters.  Since your lnet routers have a different version of lustre
> than your servers/clients, it might be possible that the default values for
> the ko2iblnd parameters are different between the two versions.  You can
> always check this by looking at the values in the files under
> /sys/module/ko2iblnd/parameters.  It might be worthwhile to compare those
> values on the lnet routers to the values on the servers to see if maybe
> there is a difference that could affect the behavior.
>
> --
> Rick Mohr
> Senior HPC System Administrator
> National Institute for Computational Sciences
> http://www.nics.tennessee.edu
>
>
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] Lustre routing help needed

2017-10-30 Thread Mohr Jr, Richard Frank (Rick Mohr)

> On Oct 30, 2017, at 8:47 AM, Kevin M. Hildebrand  wrote:
> 
> All of the hosts (client, server, router) have the following in ko2iblnd.conf:
> 
> alias ko2iblnd-opa ko2iblnd
> options ko2iblnd-opa peer_credits=128 peer_credits_hiw=64 credits=1024 
> concurrent_sends=256 ntx=2048 map_on_demand=32 fmr_pool_size=2048 
> fmr_flush_trigger=512 fmr_cache=1 conns_per_peer=4
> 
> install ko2iblnd /usr/sbin/ko2iblnd-probe

Those parameters will only get applied to omnipath interfaces (which you don’t 
have), so everything you have should just be running with default parameters.  
Since your lnet routers have a different version of lustre than your 
servers/clients, it might be possible that the default values for the ko2iblnd 
parameters are different between the two versions.  You can always check this 
by looking at the values in the files under /sys/module/ko2iblnd/parameters.  
It might be worthwhile to compare those values on the lnet routers to the 
values on the servers to see if maybe there is a difference that could affect 
the behavior.

--
Rick Mohr
Senior HPC System Administrator
National Institute for Computational Sciences
http://www.nics.tennessee.edu

___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] Lustre routing help needed

2017-10-30 Thread Kevin M. Hildebrand
I received a reply from Alejandro suggesting that I check
live_router_check_interval, dead_router_check_interval and
router_ping_timeout.
I had those set to the defaults, which I assume are 60, 60, and 50 seconds
respectively.  I did just try setting those values explicitly, and I'm not
seeing any better behavior.
>From watching /proc/sys/lnet/routers on the client, I see that the client
is indeed sending router pings every 60 seconds.  On the router itself,
watching /proc/sys/lnet/peers immediately after doing 'lctl net down; lctl
net up', I see the 'last' column for my test client count from 0 up to
around 180, at which point the client is marked 'down'.  (For the other
peers, all of which are servers, the values count from 0 to around 180 and
then reset to 0, remaining 'up')
Is the 'last' column reflecting the last time the router has received a
'ping' from that peer?  If so, why do the numbers count to 180 instead of
60, which is the frequency they're being sent?

Thanks,
Kevin

On Mon, Oct 30, 2017 at 8:47 AM, Kevin M. Hildebrand  wrote:

> Hello, I'm trying to set up some new Lustre routers between a set of
> Infiniband connected Lustre servers and a few hosts connected to an
> external 100G Ethernet network.   The problem I'm having is that the
> routers work just fine for a minute or two, and then shortly thereafter
> they're marked as 'down' and all traffic stops.  If I unload/reload the
> lustre modules on the router, it'll work again for a short time and then
> stop again.  The router shows errors like:
> [236528.801275] LNetError: 54389:0:(lib-move.c:2120:lnet_parse_get())
> 10.10.104.2@tcp2: Unable to send REPLY for GET from
> 12345-10.10.104.201@tcp2: -113
>
> My Lustre router has a Mellanox ConnectX-3 interface connecting to the
> Lustre servers, and a Mellanox ConnectX-5
> ​100G ​
> interface connecting to a 100G switch to which my test client is connected.
> ​  ​
> On the Infiniband side, I've got
> ​lnet​
> ​ configured as o2ib1
> ​​
> , and on the Ethernet side, as tcp2.
>
> Clients and servers are all running Lustre 2.8.  The Lustre router at the
> moment is running Lustre 2.10.1, because of software dependencies to
> support the 100G card.
>
> I've verified that I have stable network connectivity on both the IB and
> Ethernet sides.
>
> At the moment, I have very simple lnet configurations, using the built in
> defaults.  lnet.conf on the server:
> options lnet ip2nets="o2ib1(ib0) 192.168.[64-95].*; tcp1
> 10.103.[128-159].*" routes="tcp0 192.168.64.[78-79]@o2ib1; tcp2
> 192.168.64.[78-79]@o2ib1"
>
> On the lustre router:
> options lnet networks="o2ib1(ib0),tcp2(p1p1.104)" "forwarding=enabled"
>
> And on the client:
> options lnet networks="tcp2(p4p1.104)" routes="o2ib1 10.10.104.[2-3]@tcp2"
>
> All of the hosts (client, server, router) have the following in
> ko2iblnd.conf:
>
> alias ko2iblnd-opa ko2iblnd
> options ko2iblnd-opa peer_credits=128 peer_credits_hiw=64 credits=1024
> concurrent_sends=256 ntx=2048 map_on_demand=32 fmr_pool_size=2048
> fmr_flush_trigger=512 fmr_cache=1 conns_per_peer=4
>
> install ko2iblnd /usr/sbin/ko2iblnd-probe
>
>
> Does anyone see anything I've missed, or have any thoughts on where I
> should look next?
>
> Thanks,
> Kevin
>
> --
> Kevin Hildebrand
> University of Maryland, College Park
>
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] Lustre routing help needed

2017-10-30 Thread LOPEZ, ALEXANDRE
Hi Kevin,

Just wild-guessing here. Have you tried playing with the 
live_router_check_interval, dead_router_check_interval and router_ping_timeout 
LNet parameters?

HTH,
Alejandro

From: lustre-discuss [mailto:lustre-discuss-boun...@lists.lustre.org] On Behalf 
Of Kevin M. Hildebrand
Sent: Monday, October 30, 2017 1:47 PM
To: lustre-discuss@lists.lustre.org
Subject: [lustre-discuss] Lustre routing help needed

Hello, I'm trying to set up some new Lustre routers between a set of Infiniband 
connected Lustre servers and a few hosts connected to an external 100G Ethernet 
network.   The problem I'm having is that the routers work just fine for a 
minute or two, and then shortly thereafter they're marked as 'down' and all 
traffic stops.  If I unload/reload the lustre modules on the router, it'll work 
again for a short time and then stop again.  The router shows errors like:
[236528.801275] LNetError: 54389:0:(lib-move.c:2120:lnet_parse_get()) 
10.10.104.2@tcp2<mailto:10.10.104.2@tcp2>: Unable to send REPLY for GET from 
12345-10.10.104.201@tcp2<mailto:12345-10.10.104.201@tcp2>: -113
My Lustre router has a Mellanox ConnectX-3 interface connecting to the Lustre 
servers, and a Mellanox ConnectX-5
​100G ​
interface connecting to a 100G switch to which my test client is connected.
​  ​
On the Infiniband side, I've got
​lnet​
​ configured as o2ib1
​​
, and on the Ethernet side, as tcp2.

Clients and servers are all running Lustre 2.8.  The Lustre router at the 
moment is running Lustre 2.10.1, because of software dependencies to support 
the 100G card.

I've verified that I have stable network connectivity on both the IB and 
Ethernet sides.

At the moment, I have very simple lnet configurations, using the built in 
defaults.  lnet.conf on the server:
options lnet ip2nets="o2ib1(ib0) 192.168.[64-95].*; tcp1 10.103.[128-159].*" 
routes="tcp0 192.168.64.[78-79]@o2ib1; tcp2 192.168.64.[78-79]@o2ib1"

On the lustre router:
options lnet networks="o2ib1(ib0),tcp2(p1p1.104)" "forwarding=enabled"

And on the client:
options lnet networks="tcp2(p4p1.104)" routes="o2ib1 10.10.104.[2-3]@tcp2"

All of the hosts (client, server, router) have the following in ko2iblnd.conf:

alias ko2iblnd-opa ko2iblnd
options ko2iblnd-opa peer_credits=128 peer_credits_hiw=64 credits=1024 
concurrent_sends=256 ntx=2048 map_on_demand=32 fmr_pool_size=2048 
fmr_flush_trigger=512 fmr_cache=1 conns_per_peer=4

install ko2iblnd /usr/sbin/ko2iblnd-probe


Does anyone see anything I've missed, or have any thoughts on where I should 
look next?

Thanks,
Kevin

--
Kevin Hildebrand
University of Maryland, College Park
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


[lustre-discuss] Lustre routing help needed

2017-10-30 Thread Kevin M. Hildebrand
Hello, I'm trying to set up some new Lustre routers between a set of
Infiniband connected Lustre servers and a few hosts connected to an
external 100G Ethernet network.   The problem I'm having is that the
routers work just fine for a minute or two, and then shortly thereafter
they're marked as 'down' and all traffic stops.  If I unload/reload the
lustre modules on the router, it'll work again for a short time and then
stop again.  The router shows errors like:
[236528.801275] LNetError: 54389:0:(lib-move.c:2120:lnet_parse_get())
10.10.104.2@tcp2: Unable to send REPLY for GET from 12345-10.10.104.201@tcp2:
-113

My Lustre router has a Mellanox ConnectX-3 interface connecting to the
Lustre servers, and a Mellanox ConnectX-5
​100G ​
interface connecting to a 100G switch to which my test client is connected.
​  ​
On the Infiniband side, I've got
​lnet​
​ configured as o2ib1
​​
, and on the Ethernet side, as tcp2.

Clients and servers are all running Lustre 2.8.  The Lustre router at the
moment is running Lustre 2.10.1, because of software dependencies to
support the 100G card.

I've verified that I have stable network connectivity on both the IB and
Ethernet sides.

At the moment, I have very simple lnet configurations, using the built in
defaults.  lnet.conf on the server:
options lnet ip2nets="o2ib1(ib0) 192.168.[64-95].*; tcp1
10.103.[128-159].*" routes="tcp0 192.168.64.[78-79]@o2ib1; tcp2
192.168.64.[78-79]@o2ib1"

On the lustre router:
options lnet networks="o2ib1(ib0),tcp2(p1p1.104)" "forwarding=enabled"

And on the client:
options lnet networks="tcp2(p4p1.104)" routes="o2ib1 10.10.104.[2-3]@tcp2"

All of the hosts (client, server, router) have the following in
ko2iblnd.conf:

alias ko2iblnd-opa ko2iblnd
options ko2iblnd-opa peer_credits=128 peer_credits_hiw=64 credits=1024
concurrent_sends=256 ntx=2048 map_on_demand=32 fmr_pool_size=2048
fmr_flush_trigger=512 fmr_cache=1 conns_per_peer=4

install ko2iblnd /usr/sbin/ko2iblnd-probe


Does anyone see anything I've missed, or have any thoughts on where I
should look next?

Thanks,
Kevin

--
Kevin Hildebrand
University of Maryland, College Park
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org