Re: [lustre-discuss] Lustre 2.11 lnet troubleshooting
So the problem was inded that "routing" was disabled on the router node. I added "routing: 1" to the lnet.conf file for the routers and lctl ping works as expected. The question about the lnet module option "forwarding" still stands. The lnet module still accepts a parameter, "forwarding", but it doesn't do what it used to. Is that just a leftover that needs to be cleaned up? thanks, Olaf P. Faaland Livermore Computing From: Faaland, Olaf P. Sent: Tuesday, April 17, 2018 5:05 PM To: lustre-discuss@lists.lustre.org Subject: Re: Lustre 2.11 lnet troubleshooting Update: Joe pointed out "lnetctl set routing 1". After invoking that on the router node, the compute node reports the route as up: [root@ulna66:lustre-211]# lnetctl route show -v route: - net: o2ib100 gateway: 192.168.128.4@o2ib33 hop: -1 priority: 0 state: up Does this replace the lnet module parameter "forwarding"? Olaf P. Faaland Livermore Computing From: lustre-discusson behalf of Faaland, Olaf P. Sent: Tuesday, April 17, 2018 4:34:22 PM To: lustre-discuss@lists.lustre.org Subject: [lustre-discuss] Lustre 2.11 lnet troubleshooting Hi, I've got a cluster running 2.11 with 2 routers and 68 compute nodes. It's the first time I've used a post-multi-rail version of Lustre. The problem I'm trying to troubleshoot is that my sample compute node (ulna66) seems to think the router I configured (ulna4) is down, and so an attempt to ping outside the cluster results in failure and "no route to XXX" on the console. I can lctl ping the router from the compute node and vice-versa. Forwarding is enabled on the router node via modprobe argument. lnetctl route show reports that the route is down. Where I'm stuck is figuring out what in userspace (e.g. lnetctl or lctl) can tell me why. The compute node's lnet configuration is: [root@ulna66:lustre-211]# cat /etc/lnet.conf ip2nets: - net-spec: o2ib33 interfaces: 0: hsi0 ip-range: 0: 192.168.128.* route: - net: o2ib100 gateway: 192.168.128.4@o2ib33 After I start lnet, systemctl reports success and the state is as follows: [root@ulna66:lustre-211]# lnetctl net show net: - net type: lo local NI(s): - nid: 0@lo status: up - net type: o2ib33 local NI(s): - nid: 192.168.128.66@o2ib33 status: up interfaces: 0: hsi0 [root@ulna66:lustre-211]# lnetctl peer show --verbose peer: - primary nid: 192.168.128.4@o2ib33 Multi-Rail: False peer ni: - nid: 192.168.128.4@o2ib33 state: up max_ni_tx_credits: 8 available_tx_credits: 8 min_tx_credits: 7 tx_q_num_of_buf: 0 available_rtr_credits: 8 min_rtr_credits: 8 refcount: 4 statistics: send_count: 2 recv_count: 2 drop_count: 0 [root@ulna66:lustre-211]# lnetctl route show --verbose route: - net: o2ib100 gateway: 192.168.128.4@o2ib33 hop: -1 priority: 0 state: down I can instrument the code, but I figure there must be someplace available to normal users to look, that I'm unaware of. thanks, Olaf P. Faaland Livermore Computing ___ lustre-discuss mailing list lustre-discuss@lists.lustre.org http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org ___ lustre-discuss mailing list lustre-discuss@lists.lustre.org http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
Re: [lustre-discuss] Lustre 2.11 lnet troubleshooting
To original question: lnetctl on router node shows ‘enable: 1 ’ # lnetctl routing show routing: - cpt[0]: …snip… - enable: 1 Lustre 2.10.3-1.el6 Alex. On 4/17/18, 7:05 PM, "lustre-discuss on behalf of Faaland, Olaf P."wrote: Update: Joe pointed out "lnetctl set routing 1". After invoking that on the router node, the compute node reports the route as up: [root@ulna66:lustre-211]# lnetctl route show -v route: - net: o2ib100 gateway: 192.168.128.4@o2ib33 hop: -1 priority: 0 state: up Does this replace the lnet module parameter "forwarding"? Olaf P. Faaland Livermore Computing From: lustre-discuss on behalf of Faaland, Olaf P. Sent: Tuesday, April 17, 2018 4:34:22 PM To: lustre-discuss@lists.lustre.org Subject: [lustre-discuss] Lustre 2.11 lnet troubleshooting Hi, I've got a cluster running 2.11 with 2 routers and 68 compute nodes. It's the first time I've used a post-multi-rail version of Lustre. The problem I'm trying to troubleshoot is that my sample compute node (ulna66) seems to think the router I configured (ulna4) is down, and so an attempt to ping outside the cluster results in failure and "no route to XXX" on the console. I can lctl ping the router from the compute node and vice-versa. Forwarding is enabled on the router node via modprobe argument. lnetctl route show reports that the route is down. Where I'm stuck is figuring out what in userspace (e.g. lnetctl or lctl) can tell me why. The compute node's lnet configuration is: [root@ulna66:lustre-211]# cat /etc/lnet.conf ip2nets: - net-spec: o2ib33 interfaces: 0: hsi0 ip-range: 0: 192.168.128.* route: - net: o2ib100 gateway: 192.168.128.4@o2ib33 After I start lnet, systemctl reports success and the state is as follows: [root@ulna66:lustre-211]# lnetctl net show net: - net type: lo local NI(s): - nid: 0@lo status: up - net type: o2ib33 local NI(s): - nid: 192.168.128.66@o2ib33 status: up interfaces: 0: hsi0 [root@ulna66:lustre-211]# lnetctl peer show --verbose peer: - primary nid: 192.168.128.4@o2ib33 Multi-Rail: False peer ni: - nid: 192.168.128.4@o2ib33 state: up max_ni_tx_credits: 8 available_tx_credits: 8 min_tx_credits: 7 tx_q_num_of_buf: 0 available_rtr_credits: 8 min_rtr_credits: 8 refcount: 4 statistics: send_count: 2 recv_count: 2 drop_count: 0 [root@ulna66:lustre-211]# lnetctl route show --verbose route: - net: o2ib100 gateway: 192.168.128.4@o2ib33 hop: -1 priority: 0 state: down I can instrument the code, but I figure there must be someplace available to normal users to look, that I'm unaware of. thanks, Olaf P. Faaland Livermore Computing ___ lustre-discuss mailing list lustre-discuss@lists.lustre.org http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org ___ lustre-discuss mailing list lustre-discuss@lists.lustre.org http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org ___ lustre-discuss mailing list lustre-discuss@lists.lustre.org http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
Re: [lustre-discuss] Lustre 2.11 lnet troubleshooting
Update: Joe pointed out "lnetctl set routing 1". After invoking that on the router node, the compute node reports the route as up: [root@ulna66:lustre-211]# lnetctl route show -v route: - net: o2ib100 gateway: 192.168.128.4@o2ib33 hop: -1 priority: 0 state: up Does this replace the lnet module parameter "forwarding"? Olaf P. Faaland Livermore Computing From: lustre-discusson behalf of Faaland, Olaf P. Sent: Tuesday, April 17, 2018 4:34:22 PM To: lustre-discuss@lists.lustre.org Subject: [lustre-discuss] Lustre 2.11 lnet troubleshooting Hi, I've got a cluster running 2.11 with 2 routers and 68 compute nodes. It's the first time I've used a post-multi-rail version of Lustre. The problem I'm trying to troubleshoot is that my sample compute node (ulna66) seems to think the router I configured (ulna4) is down, and so an attempt to ping outside the cluster results in failure and "no route to XXX" on the console. I can lctl ping the router from the compute node and vice-versa. Forwarding is enabled on the router node via modprobe argument. lnetctl route show reports that the route is down. Where I'm stuck is figuring out what in userspace (e.g. lnetctl or lctl) can tell me why. The compute node's lnet configuration is: [root@ulna66:lustre-211]# cat /etc/lnet.conf ip2nets: - net-spec: o2ib33 interfaces: 0: hsi0 ip-range: 0: 192.168.128.* route: - net: o2ib100 gateway: 192.168.128.4@o2ib33 After I start lnet, systemctl reports success and the state is as follows: [root@ulna66:lustre-211]# lnetctl net show net: - net type: lo local NI(s): - nid: 0@lo status: up - net type: o2ib33 local NI(s): - nid: 192.168.128.66@o2ib33 status: up interfaces: 0: hsi0 [root@ulna66:lustre-211]# lnetctl peer show --verbose peer: - primary nid: 192.168.128.4@o2ib33 Multi-Rail: False peer ni: - nid: 192.168.128.4@o2ib33 state: up max_ni_tx_credits: 8 available_tx_credits: 8 min_tx_credits: 7 tx_q_num_of_buf: 0 available_rtr_credits: 8 min_rtr_credits: 8 refcount: 4 statistics: send_count: 2 recv_count: 2 drop_count: 0 [root@ulna66:lustre-211]# lnetctl route show --verbose route: - net: o2ib100 gateway: 192.168.128.4@o2ib33 hop: -1 priority: 0 state: down I can instrument the code, but I figure there must be someplace available to normal users to look, that I'm unaware of. thanks, Olaf P. Faaland Livermore Computing ___ lustre-discuss mailing list lustre-discuss@lists.lustre.org http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org ___ lustre-discuss mailing list lustre-discuss@lists.lustre.org http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
[lustre-discuss] Lustre 2.11 lnet troubleshooting
Hi, I've got a cluster running 2.11 with 2 routers and 68 compute nodes. It's the first time I've used a post-multi-rail version of Lustre. The problem I'm trying to troubleshoot is that my sample compute node (ulna66) seems to think the router I configured (ulna4) is down, and so an attempt to ping outside the cluster results in failure and "no route to XXX" on the console. I can lctl ping the router from the compute node and vice-versa. Forwarding is enabled on the router node via modprobe argument. lnetctl route show reports that the route is down. Where I'm stuck is figuring out what in userspace (e.g. lnetctl or lctl) can tell me why. The compute node's lnet configuration is: [root@ulna66:lustre-211]# cat /etc/lnet.conf ip2nets: - net-spec: o2ib33 interfaces: 0: hsi0 ip-range: 0: 192.168.128.* route: - net: o2ib100 gateway: 192.168.128.4@o2ib33 After I start lnet, systemctl reports success and the state is as follows: [root@ulna66:lustre-211]# lnetctl net show net: - net type: lo local NI(s): - nid: 0@lo status: up - net type: o2ib33 local NI(s): - nid: 192.168.128.66@o2ib33 status: up interfaces: 0: hsi0 [root@ulna66:lustre-211]# lnetctl peer show --verbose peer: - primary nid: 192.168.128.4@o2ib33 Multi-Rail: False peer ni: - nid: 192.168.128.4@o2ib33 state: up max_ni_tx_credits: 8 available_tx_credits: 8 min_tx_credits: 7 tx_q_num_of_buf: 0 available_rtr_credits: 8 min_rtr_credits: 8 refcount: 4 statistics: send_count: 2 recv_count: 2 drop_count: 0 [root@ulna66:lustre-211]# lnetctl route show --verbose route: - net: o2ib100 gateway: 192.168.128.4@o2ib33 hop: -1 priority: 0 state: down I can instrument the code, but I figure there must be someplace available to normal users to look, that I'm unaware of. thanks, Olaf P. Faaland Livermore Computing ___ lustre-discuss mailing list lustre-discuss@lists.lustre.org http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
Re: [lustre-discuss] luster 2.10.3 lnetctl configurations not persisting through reboot
OK, I was following http://wiki.lustre.org/LNet_Router_Config_Guide. so what about all the peers the export command generates? Wouldn't that accumulate bad data over time if nodes are retired or IPs change for some reason? - Original Message - From: "aik"To: "Kurt Strosahl" Cc: lustre-discuss@lists.lustre.org Sent: Tuesday, April 17, 2018 4:45:55 PM Subject: Re: [lustre-discuss] luster 2.10.3 lnetctl configurations not persisting through reboot File /etc/lnet.conf is described on lustre wiki: http://wiki.lustre.org/Dynamic_LNet_Configuration_and_lnetctl Alex. On 4/17/18, 3:37 PM, "lustre-discuss on behalf of Kurt Strosahl" wrote: I configured an lnet router today with luster 2.10.3 as the lustre software. I then connfigured the lnet router using the following lnetctl commands lnetctl lnet configure lnetctl net add --net o2ib0 --if ib1 lnetctl net add --net o2ib1 --if ib0 lnetctl set routing 1 When I rebooted the router the configuration didn't stick. Is there a way to make this persist through a reboot? I also notices that when I do an export of the lnetctl configuration it contains - net type: o2ib1 local NI(s): - nid: @o2ib1 status: up interfaces: 0: ib0 statistics: send_count: 2958318 recv_count: 2948077 drop_count: 0 tunables: peer_timeout: 180 peer_credits: 8 peer_buffer_credits: 0 credits: 256 lnd tunables: peercredits_hiw: 4 map_on_demand: 256 concurrent_sends: 8 fmr_pool_size: 512 fmr_flush_trigger: 384 fmr_cache: 1 ntx: 512 conns_per_peer: 1 tcp bonding: 0 dev cpt: 0 CPT: "[0,1]" Is this expected behavior? w/r, Kurt J. Strosahl System Administrator: Lustre, HPC Scientific Computing Group, Thomas Jefferson National Accelerator Facility ___ lustre-discuss mailing list lustre-discuss@lists.lustre.org https://urldefense.proofpoint.com/v2/url?u=http-3A__lists.lustre.org_listinfo.cgi_lustre-2Ddiscuss-2Dlustre.org=DwIGaQ=lz9TcOasaINaaC3U7FbMev2lsutwpI4--09aP8Lu18s=KkQX5c9DC-mEANNS9ekPdC4ng8ttyYlGGgdX1DmNd1U=npHDhaAmS5ohN9LIgeMq2W-qgyHGOJ4nMQWsmnMn_rw=Sab0aXn7dh8XQ1UG_y5LWQHAsjTLOGNz2iQdM9lwYDA= ___ lustre-discuss mailing list lustre-discuss@lists.lustre.org http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
Re: [lustre-discuss] luster 2.10.3 lnetctl configurations not persisting through reboot
File /etc/lnet.conf is described on lustre wiki: http://wiki.lustre.org/Dynamic_LNet_Configuration_and_lnetctl Alex. On 4/17/18, 3:37 PM, "lustre-discuss on behalf of Kurt Strosahl"wrote: I configured an lnet router today with luster 2.10.3 as the lustre software. I then connfigured the lnet router using the following lnetctl commands lnetctl lnet configure lnetctl net add --net o2ib0 --if ib1 lnetctl net add --net o2ib1 --if ib0 lnetctl set routing 1 When I rebooted the router the configuration didn't stick. Is there a way to make this persist through a reboot? I also notices that when I do an export of the lnetctl configuration it contains - net type: o2ib1 local NI(s): - nid: @o2ib1 status: up interfaces: 0: ib0 statistics: send_count: 2958318 recv_count: 2948077 drop_count: 0 tunables: peer_timeout: 180 peer_credits: 8 peer_buffer_credits: 0 credits: 256 lnd tunables: peercredits_hiw: 4 map_on_demand: 256 concurrent_sends: 8 fmr_pool_size: 512 fmr_flush_trigger: 384 fmr_cache: 1 ntx: 512 conns_per_peer: 1 tcp bonding: 0 dev cpt: 0 CPT: "[0,1]" Is this expected behavior? w/r, Kurt J. Strosahl System Administrator: Lustre, HPC Scientific Computing Group, Thomas Jefferson National Accelerator Facility ___ lustre-discuss mailing list lustre-discuss@lists.lustre.org http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org ___ lustre-discuss mailing list lustre-discuss@lists.lustre.org http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
[lustre-discuss] luster 2.10.3 lnetctl configurations not persisting through reboot
I configured an lnet router today with luster 2.10.3 as the lustre software. I then connfigured the lnet router using the following lnetctl commands lnetctl lnet configure lnetctl net add --net o2ib0 --if ib1 lnetctl net add --net o2ib1 --if ib0 lnetctl set routing 1 When I rebooted the router the configuration didn't stick. Is there a way to make this persist through a reboot? I also notices that when I do an export of the lnetctl configuration it contains - net type: o2ib1 local NI(s): - nid: @o2ib1 status: up interfaces: 0: ib0 statistics: send_count: 2958318 recv_count: 2948077 drop_count: 0 tunables: peer_timeout: 180 peer_credits: 8 peer_buffer_credits: 0 credits: 256 lnd tunables: peercredits_hiw: 4 map_on_demand: 256 concurrent_sends: 8 fmr_pool_size: 512 fmr_flush_trigger: 384 fmr_cache: 1 ntx: 512 conns_per_peer: 1 tcp bonding: 0 dev cpt: 0 CPT: "[0,1]" Is this expected behavior? w/r, Kurt J. Strosahl System Administrator: Lustre, HPC Scientific Computing Group, Thomas Jefferson National Accelerator Facility ___ lustre-discuss mailing list lustre-discuss@lists.lustre.org http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org