Re: [lustre-discuss] Lustre 2.11 lnet troubleshooting

2018-04-17 Thread Faaland, Olaf P.
So the problem was inded that "routing" was disabled on the router node.  I 
added "routing: 1" to the lnet.conf file for the routers and lctl ping works as 
expected.

The question about the lnet module option "forwarding" still stands.  The lnet 
module still accepts a parameter, "forwarding", but it doesn't do what it used 
to.   Is that just a leftover that needs to be cleaned up?

thanks,

Olaf P. Faaland
Livermore Computing


From: Faaland, Olaf P.
Sent: Tuesday, April 17, 2018 5:05 PM
To: lustre-discuss@lists.lustre.org
Subject: Re: Lustre 2.11 lnet troubleshooting

Update:

Joe pointed out "lnetctl set routing 1".  After invoking that on the router 
node, the compute node reports the route as up:

[root@ulna66:lustre-211]# lnetctl route show -v
route:
- net: o2ib100
  gateway: 192.168.128.4@o2ib33
  hop: -1
  priority: 0
  state: up

Does this replace the lnet module parameter "forwarding"?

Olaf P. Faaland
Livermore Computing



From: lustre-discuss  on behalf of 
Faaland, Olaf P. 
Sent: Tuesday, April 17, 2018 4:34:22 PM
To: lustre-discuss@lists.lustre.org
Subject: [lustre-discuss] Lustre 2.11 lnet troubleshooting

Hi,

I've got a cluster running 2.11 with 2 routers and 68  compute nodes.  It's the 
first time I've used a post-multi-rail version of Lustre.

The problem I'm trying to troubleshoot is that my sample compute node (ulna66) 
seems to think the router I configured (ulna4) is down, and so an attempt to 
ping outside the cluster results in failure and "no route to XXX" on the 
console.  I can lctl ping the router from the compute node and vice-versa.   
Forwarding is enabled on the router node via modprobe argument.

lnetctl route show reports that the route is down.  Where I'm stuck is figuring 
out what in userspace (e.g. lnetctl or lctl) can tell me why.

The compute node's lnet configuration is:

[root@ulna66:lustre-211]# cat /etc/lnet.conf
ip2nets:
  - net-spec: o2ib33
interfaces:
 0: hsi0
ip-range:
 0: 192.168.128.*
route:
- net: o2ib100
  gateway: 192.168.128.4@o2ib33

After I start lnet, systemctl reports success and the state is as follows:

[root@ulna66:lustre-211]# lnetctl net show
net:
- net type: lo
  local NI(s):
- nid: 0@lo
  status: up
- net type: o2ib33
  local NI(s):
- nid: 192.168.128.66@o2ib33
  status: up
  interfaces:
  0: hsi0

[root@ulna66:lustre-211]# lnetctl peer show --verbose
peer:
- primary nid: 192.168.128.4@o2ib33
  Multi-Rail: False
  peer ni:
- nid: 192.168.128.4@o2ib33
  state: up
  max_ni_tx_credits: 8
  available_tx_credits: 8
  min_tx_credits: 7
  tx_q_num_of_buf: 0
  available_rtr_credits: 8
  min_rtr_credits: 8
  refcount: 4
  statistics:
  send_count: 2
  recv_count: 2
  drop_count: 0

[root@ulna66:lustre-211]# lnetctl route show --verbose
route:
- net: o2ib100
  gateway: 192.168.128.4@o2ib33
  hop: -1
  priority: 0
  state: down

I can instrument the code, but I figure there must be someplace available to 
normal users to look, that I'm unaware of.

thanks,

Olaf P. Faaland
Livermore Computing
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] Lustre 2.11 lnet troubleshooting

2018-04-17 Thread Alexander I Kulyavtsev
To original question: lnetctl on router node shows ‘enable: 1 ’ 

# lnetctl routing show
routing:
- cpt[0]:
 …snip…
- enable: 1

Lustre 2.10.3-1.el6

Alex.

On 4/17/18, 7:05 PM, "lustre-discuss on behalf of Faaland, Olaf P." 
 wrote:

Update:

Joe pointed out "lnetctl set routing 1".  After invoking that on the router 
node, the compute node reports the route as up:

[root@ulna66:lustre-211]# lnetctl route show -v
route:
- net: o2ib100
  gateway: 192.168.128.4@o2ib33
  hop: -1
  priority: 0
  state: up

Does this replace the lnet module parameter "forwarding"?

Olaf P. Faaland
Livermore Computing



From: lustre-discuss  on behalf of 
Faaland, Olaf P. 
Sent: Tuesday, April 17, 2018 4:34:22 PM
To: lustre-discuss@lists.lustre.org
Subject: [lustre-discuss] Lustre 2.11 lnet troubleshooting

Hi,

I've got a cluster running 2.11 with 2 routers and 68  compute nodes.  It's 
the first time I've used a post-multi-rail version of Lustre.

The problem I'm trying to troubleshoot is that my sample compute node 
(ulna66) seems to think the router I configured (ulna4) is down, and so an 
attempt to ping outside the cluster results in failure and "no route to XXX" on 
the console.  I can lctl ping the router from the compute node and vice-versa.  
 Forwarding is enabled on the router node via modprobe argument.

lnetctl route show reports that the route is down.  Where I'm stuck is 
figuring out what in userspace (e.g. lnetctl or lctl) can tell me why.

The compute node's lnet configuration is:

[root@ulna66:lustre-211]# cat /etc/lnet.conf
ip2nets:
  - net-spec: o2ib33
interfaces:
 0: hsi0
ip-range:
 0: 192.168.128.*
route:
- net: o2ib100
  gateway: 192.168.128.4@o2ib33

After I start lnet, systemctl reports success and the state is as follows:

[root@ulna66:lustre-211]# lnetctl net show
net:
- net type: lo
  local NI(s):
- nid: 0@lo
  status: up
- net type: o2ib33
  local NI(s):
- nid: 192.168.128.66@o2ib33
  status: up
  interfaces:
  0: hsi0

[root@ulna66:lustre-211]# lnetctl peer show --verbose
peer:
- primary nid: 192.168.128.4@o2ib33
  Multi-Rail: False
  peer ni:
- nid: 192.168.128.4@o2ib33
  state: up
  max_ni_tx_credits: 8
  available_tx_credits: 8
  min_tx_credits: 7
  tx_q_num_of_buf: 0
  available_rtr_credits: 8
  min_rtr_credits: 8
  refcount: 4
  statistics:
  send_count: 2
  recv_count: 2
  drop_count: 0

[root@ulna66:lustre-211]# lnetctl route show --verbose
route:
- net: o2ib100
  gateway: 192.168.128.4@o2ib33
  hop: -1
  priority: 0
  state: down

I can instrument the code, but I figure there must be someplace available 
to normal users to look, that I'm unaware of.

thanks,

Olaf P. Faaland
Livermore Computing
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org



___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] Lustre 2.11 lnet troubleshooting

2018-04-17 Thread Faaland, Olaf P.
Update:

Joe pointed out "lnetctl set routing 1".  After invoking that on the router 
node, the compute node reports the route as up:

[root@ulna66:lustre-211]# lnetctl route show -v
route:
- net: o2ib100
  gateway: 192.168.128.4@o2ib33
  hop: -1
  priority: 0
  state: up

Does this replace the lnet module parameter "forwarding"?

Olaf P. Faaland
Livermore Computing



From: lustre-discuss  on behalf of 
Faaland, Olaf P. 
Sent: Tuesday, April 17, 2018 4:34:22 PM
To: lustre-discuss@lists.lustre.org
Subject: [lustre-discuss] Lustre 2.11 lnet troubleshooting

Hi,

I've got a cluster running 2.11 with 2 routers and 68  compute nodes.  It's the 
first time I've used a post-multi-rail version of Lustre.

The problem I'm trying to troubleshoot is that my sample compute node (ulna66) 
seems to think the router I configured (ulna4) is down, and so an attempt to 
ping outside the cluster results in failure and "no route to XXX" on the 
console.  I can lctl ping the router from the compute node and vice-versa.   
Forwarding is enabled on the router node via modprobe argument.

lnetctl route show reports that the route is down.  Where I'm stuck is figuring 
out what in userspace (e.g. lnetctl or lctl) can tell me why.

The compute node's lnet configuration is:

[root@ulna66:lustre-211]# cat /etc/lnet.conf
ip2nets:
  - net-spec: o2ib33
interfaces:
 0: hsi0
ip-range:
 0: 192.168.128.*
route:
- net: o2ib100
  gateway: 192.168.128.4@o2ib33

After I start lnet, systemctl reports success and the state is as follows:

[root@ulna66:lustre-211]# lnetctl net show
net:
- net type: lo
  local NI(s):
- nid: 0@lo
  status: up
- net type: o2ib33
  local NI(s):
- nid: 192.168.128.66@o2ib33
  status: up
  interfaces:
  0: hsi0

[root@ulna66:lustre-211]# lnetctl peer show --verbose
peer:
- primary nid: 192.168.128.4@o2ib33
  Multi-Rail: False
  peer ni:
- nid: 192.168.128.4@o2ib33
  state: up
  max_ni_tx_credits: 8
  available_tx_credits: 8
  min_tx_credits: 7
  tx_q_num_of_buf: 0
  available_rtr_credits: 8
  min_rtr_credits: 8
  refcount: 4
  statistics:
  send_count: 2
  recv_count: 2
  drop_count: 0

[root@ulna66:lustre-211]# lnetctl route show --verbose
route:
- net: o2ib100
  gateway: 192.168.128.4@o2ib33
  hop: -1
  priority: 0
  state: down

I can instrument the code, but I figure there must be someplace available to 
normal users to look, that I'm unaware of.

thanks,

Olaf P. Faaland
Livermore Computing
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


[lustre-discuss] Lustre 2.11 lnet troubleshooting

2018-04-17 Thread Faaland, Olaf P.
Hi,

I've got a cluster running 2.11 with 2 routers and 68  compute nodes.  It's the 
first time I've used a post-multi-rail version of Lustre.  

The problem I'm trying to troubleshoot is that my sample compute node (ulna66) 
seems to think the router I configured (ulna4) is down, and so an attempt to 
ping outside the cluster results in failure and "no route to XXX" on the 
console.  I can lctl ping the router from the compute node and vice-versa.   
Forwarding is enabled on the router node via modprobe argument.

lnetctl route show reports that the route is down.  Where I'm stuck is figuring 
out what in userspace (e.g. lnetctl or lctl) can tell me why.

The compute node's lnet configuration is:

[root@ulna66:lustre-211]# cat /etc/lnet.conf
ip2nets:
  - net-spec: o2ib33
interfaces:
 0: hsi0
ip-range:
 0: 192.168.128.*
route:
- net: o2ib100
  gateway: 192.168.128.4@o2ib33

After I start lnet, systemctl reports success and the state is as follows:

[root@ulna66:lustre-211]# lnetctl net show
net:
- net type: lo
  local NI(s):
- nid: 0@lo
  status: up
- net type: o2ib33
  local NI(s):
- nid: 192.168.128.66@o2ib33
  status: up
  interfaces:
  0: hsi0

[root@ulna66:lustre-211]# lnetctl peer show --verbose
peer:
- primary nid: 192.168.128.4@o2ib33
  Multi-Rail: False
  peer ni:
- nid: 192.168.128.4@o2ib33
  state: up
  max_ni_tx_credits: 8
  available_tx_credits: 8
  min_tx_credits: 7
  tx_q_num_of_buf: 0
  available_rtr_credits: 8
  min_rtr_credits: 8
  refcount: 4
  statistics:
  send_count: 2
  recv_count: 2
  drop_count: 0

[root@ulna66:lustre-211]# lnetctl route show --verbose
route:
- net: o2ib100
  gateway: 192.168.128.4@o2ib33
  hop: -1
  priority: 0
  state: down

I can instrument the code, but I figure there must be someplace available to 
normal users to look, that I'm unaware of.

thanks,

Olaf P. Faaland
Livermore Computing
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] luster 2.10.3 lnetctl configurations not persisting through reboot

2018-04-17 Thread Kurt Strosahl
OK, I was following http://wiki.lustre.org/LNet_Router_Config_Guide.

so what about all the peers the export command generates?  Wouldn't that 
accumulate bad data over time if nodes are retired or IPs change for some 
reason?



- Original Message -
From: "aik" 
To: "Kurt Strosahl" 
Cc: lustre-discuss@lists.lustre.org
Sent: Tuesday, April 17, 2018 4:45:55 PM
Subject: Re: [lustre-discuss] luster 2.10.3 lnetctl configurations not 
persisting through reboot

File /etc/lnet.conf is described on lustre wiki:

http://wiki.lustre.org/Dynamic_LNet_Configuration_and_lnetctl



Alex.





On 4/17/18, 3:37 PM, "lustre-discuss on behalf of Kurt Strosahl" 
 wrote:



I configured an lnet router today with luster 2.10.3 as the lustre 
software.  I then connfigured the lnet router using the following lnetctl 
commands





lnetctl lnet configure

lnetctl net add --net o2ib0 --if ib1

lnetctl net add --net o2ib1 --if ib0

lnetctl set routing 1



When I rebooted the router the configuration didn't stick.  Is there a way 
to make this persist through a reboot?



I also notices that when I do an export of the lnetctl configuration it 
contains



- net type: o2ib1

  local NI(s):

- nid: @o2ib1

  status: up

  interfaces:

  0: ib0

  statistics:

  send_count: 2958318

  recv_count: 2948077

  drop_count: 0

  tunables:

  peer_timeout: 180

  peer_credits: 8

  peer_buffer_credits: 0

  credits: 256

  lnd tunables:

  peercredits_hiw: 4

  map_on_demand: 256

  concurrent_sends: 8

  fmr_pool_size: 512

  fmr_flush_trigger: 384

  fmr_cache: 1

  ntx: 512

  conns_per_peer: 1

  tcp bonding: 0

  dev cpt: 0

  CPT: "[0,1]"



Is this expected behavior?



w/r,

Kurt J. Strosahl

System Administrator: Lustre, HPC

Scientific Computing Group, Thomas Jefferson National Accelerator Facility

___

lustre-discuss mailing list

lustre-discuss@lists.lustre.org


https://urldefense.proofpoint.com/v2/url?u=http-3A__lists.lustre.org_listinfo.cgi_lustre-2Ddiscuss-2Dlustre.org=DwIGaQ=lz9TcOasaINaaC3U7FbMev2lsutwpI4--09aP8Lu18s=KkQX5c9DC-mEANNS9ekPdC4ng8ttyYlGGgdX1DmNd1U=npHDhaAmS5ohN9LIgeMq2W-qgyHGOJ4nMQWsmnMn_rw=Sab0aXn7dh8XQ1UG_y5LWQHAsjTLOGNz2iQdM9lwYDA=
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] luster 2.10.3 lnetctl configurations not persisting through reboot

2018-04-17 Thread Alexander I Kulyavtsev
File /etc/lnet.conf is described on lustre wiki:
   http://wiki.lustre.org/Dynamic_LNet_Configuration_and_lnetctl

Alex.


On 4/17/18, 3:37 PM, "lustre-discuss on behalf of Kurt Strosahl" 
 wrote:

I configured an lnet router today with luster 2.10.3 as the lustre 
software.  I then connfigured the lnet router using the following lnetctl 
commands


lnetctl lnet configure
lnetctl net add --net o2ib0 --if ib1
lnetctl net add --net o2ib1 --if ib0
lnetctl set routing 1

When I rebooted the router the configuration didn't stick.  Is there a way 
to make this persist through a reboot?

I also notices that when I do an export of the lnetctl configuration it 
contains

- net type: o2ib1
  local NI(s):
- nid: @o2ib1
  status: up
  interfaces:
  0: ib0
  statistics:
  send_count: 2958318
  recv_count: 2948077
  drop_count: 0
  tunables:
  peer_timeout: 180
  peer_credits: 8
  peer_buffer_credits: 0
  credits: 256
  lnd tunables:
  peercredits_hiw: 4
  map_on_demand: 256
  concurrent_sends: 8
  fmr_pool_size: 512
  fmr_flush_trigger: 384
  fmr_cache: 1
  ntx: 512
  conns_per_peer: 1
  tcp bonding: 0
  dev cpt: 0
  CPT: "[0,1]"

Is this expected behavior?

w/r,
Kurt J. Strosahl
System Administrator: Lustre, HPC
Scientific Computing Group, Thomas Jefferson National Accelerator Facility
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org



___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


[lustre-discuss] luster 2.10.3 lnetctl configurations not persisting through reboot

2018-04-17 Thread Kurt Strosahl
I configured an lnet router today with luster 2.10.3 as the lustre software.  I 
then connfigured the lnet router using the following lnetctl commands


lnetctl lnet configure
lnetctl net add --net o2ib0 --if ib1
lnetctl net add --net o2ib1 --if ib0
lnetctl set routing 1

When I rebooted the router the configuration didn't stick.  Is there a way to 
make this persist through a reboot?

I also notices that when I do an export of the lnetctl configuration it contains

- net type: o2ib1
  local NI(s):
- nid: @o2ib1
  status: up
  interfaces:
  0: ib0
  statistics:
  send_count: 2958318
  recv_count: 2948077
  drop_count: 0
  tunables:
  peer_timeout: 180
  peer_credits: 8
  peer_buffer_credits: 0
  credits: 256
  lnd tunables:
  peercredits_hiw: 4
  map_on_demand: 256
  concurrent_sends: 8
  fmr_pool_size: 512
  fmr_flush_trigger: 384
  fmr_cache: 1
  ntx: 512
  conns_per_peer: 1
  tcp bonding: 0
  dev cpt: 0
  CPT: "[0,1]"

Is this expected behavior?

w/r,
Kurt J. Strosahl
System Administrator: Lustre, HPC
Scientific Computing Group, Thomas Jefferson National Accelerator Facility
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org