Re: [lustre-discuss] Odd client behavior with mixed Lustre versions

2019-01-24 Thread Kevin M. Hildebrand
So I'm still experimenting with my 2.10.6 clients mounting from a 2.8
server.  I've found some more information that might narrow down the issue.

To recap:
When a client is rebooted, or after the IB modules are reloaded, any Lustre
operations take a very long time to connect the first time.
lctl ping hangs and times out for 30-60 seconds.  Once it makes a
successful connection, subsequent connections to the same server are fine.
So mounting the Lustre filesystem takes a long time as it has to time out
to each MDS and each OSS before finally succeeding.
What's new:
If I do an IPoIB ping of the server I'm trying to reach first, the lctl
ping succeeds immediately.  So if I ping all of the MDSes and OSSes, the
filesystem will mount immediately.

Does this sound familiar to anyone?

Thanks,
Kevin



On Thu, Jan 10, 2019 at 4:23 PM Kevin M. Hildebrand  wrote:

> I've got a RHEL6 Lustre installation where the servers are running 2.8.0,
> that I'd prefer not to upgrade.
> We've been running 2.8.0 on RHEL6 clients as well and everything's been
> working fine.  However, I just updated the Linux release on the RHEL6
> clients to 6.10, and Lustre 2.8.0 will no longer compile on the latest
> kernel.  I've built and installed 2.10.6 on these clients, and the kernel
> modules load fine, but on first contact with any lustre server, I get a
> bunch of timeouts before I can get a valid connection.  The Lustre network
> in this case is Infiniband, using Mellanox OFED on the clients.
> 'lctl ping' hangs for a few seconds and returns 'failed to ping
> 192.168.64.70@o2ib1: Input/output error'.  An IPoIB ping of the server IP
> address works fine.
> At the same time I get a message in syslog that says 'LNet:
> 8778:0:(o2iblnd_cb.c:3192:kiblnd_check_conns()) Timed out tx for
> 192.168.64.70@o2ib1: 4296292 seconds'
> Nothing shows up in the logs on the server side.
>
> If I repeat the 'lctl ping' a few times, after 30-60 seconds or so, 'lctl
> ping' succeeds.
> This happens for each of my Lustre servers, and once I get a successful
> ping back, it seems to be fully functional up until the next reboot, or
> until the Infiniband modules are reloaded.
>
> If I try to mount the filesystem without doing the pings, I'll get
> timeouts contacting the MDS for the same 30-60 seconds, and then once the
> MDSes are reachable, I get timeouts to the OSSes for a while, until they
> become reachable, and once they're all talking, all seems to be fine.
>
> Any ideas on what could be wrong?
>
> Thanks,
> Kevin
>
> --
> Kevin Hildebrand
> University of Maryland
>
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] Odd client behavior with mixed Lustre versions

2019-01-15 Thread Kevin M. Hildebrand
Yeah, I thought about that.  Both the client and servers are using the
defaults for ko2iblnd-

  tunables:
  peer_timeout: 180
  peer_credits: 8
  peer_buffer_credits: 0
  credits: 256
  lnd tunables:
  peercredits_hiw: 4
  map_on_demand: 0
  concurrent_sends: 8
  fmr_pool_size: 512
  fmr_flush_trigger: 384
  fmr_cache: 1
  ntx: 512
  conns_per_peer: 1

Thanks,
Kevin

On Fri, Jan 11, 2019 at 5:17 PM Mohr Jr, Richard Frank (Rick Mohr) <
rm...@utk.edu> wrote:

> Is it possible you have some incompatible ko2iblnd module parameters
> between the 2.8 servers and the 2.10 clients?  If there was something
> causing LNet issues, that could possibly explain some of the symptoms you
> are seeing.
>
> --
> Rick Mohr
> Senior HPC System Administrator
> National Institute for Computational Sciences
> http://www.nics.tennessee.edu
>
>
> > On Jan 10, 2019, at 4:23 PM, Kevin M. Hildebrand  wrote:
> >
> > I've got a RHEL6 Lustre installation where the servers are running
> 2.8.0, that I'd prefer not to upgrade.
> > We've been running 2.8.0 on RHEL6 clients as well and everything's been
> working fine.  However, I just updated the Linux release on the RHEL6
> clients to 6.10, and Lustre 2.8.0 will no longer compile on the latest
> kernel.  I've built and installed 2.10.6 on these clients, and the kernel
> modules load fine, but on first contact with any lustre server, I get a
> bunch of timeouts before I can get a valid connection.  The Lustre network
> in this case is Infiniband, using Mellanox OFED on the clients.
> > 'lctl ping' hangs for a few seconds and returns 'failed to ping
> 192.168.64.70@o2ib1: Input/output error'.  An IPoIB ping of the server IP
> address works fine.
> > At the same time I get a message in syslog that says 'LNet:
> 8778:0:(o2iblnd_cb.c:3192:kiblnd_check_conns()) Timed out tx for
> 192.168.64.70@o2ib1: 4296292 seconds'
> > Nothing shows up in the logs on the server side.
> >
> > If I repeat the 'lctl ping' a few times, after 30-60 seconds or so,
> 'lctl ping' succeeds.
> > This happens for each of my Lustre servers, and once I get a successful
> ping back, it seems to be fully functional up until the next reboot, or
> until the Infiniband modules are reloaded.
> >
> > If I try to mount the filesystem without doing the pings, I'll get
> timeouts contacting the MDS for the same 30-60 seconds, and then once the
> MDSes are reachable, I get timeouts to the OSSes for a while, until they
> become reachable, and once they're all talking, all seems to be fine.
> >
> > Any ideas on what could be wrong?
> >
> > Thanks,
> > Kevin
> >
> > --
> > Kevin Hildebrand
> > University of Maryland
> > ___
> > lustre-discuss mailing list
> > lustre-discuss@lists.lustre.org
> > http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
>
>
>
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] Odd client behavior with mixed Lustre versions

2019-01-11 Thread Mohr Jr, Richard Frank (Rick Mohr)
Is it possible you have some incompatible ko2iblnd module parameters between 
the 2.8 servers and the 2.10 clients?  If there was something causing LNet 
issues, that could possibly explain some of the symptoms you are seeing.

--
Rick Mohr
Senior HPC System Administrator
National Institute for Computational Sciences
http://www.nics.tennessee.edu


> On Jan 10, 2019, at 4:23 PM, Kevin M. Hildebrand  wrote:
> 
> I've got a RHEL6 Lustre installation where the servers are running 2.8.0, 
> that I'd prefer not to upgrade.  
> We've been running 2.8.0 on RHEL6 clients as well and everything's been 
> working fine.  However, I just updated the Linux release on the RHEL6 clients 
> to 6.10, and Lustre 2.8.0 will no longer compile on the latest kernel.  I've 
> built and installed 2.10.6 on these clients, and the kernel modules load 
> fine, but on first contact with any lustre server, I get a bunch of timeouts 
> before I can get a valid connection.  The Lustre network in this case is 
> Infiniband, using Mellanox OFED on the clients.
> 'lctl ping' hangs for a few seconds and returns 'failed to ping 
> 192.168.64.70@o2ib1: Input/output error'.  An IPoIB ping of the server IP 
> address works fine.
> At the same time I get a message in syslog that says 'LNet: 
> 8778:0:(o2iblnd_cb.c:3192:kiblnd_check_conns()) Timed out tx for 
> 192.168.64.70@o2ib1: 4296292 seconds'
> Nothing shows up in the logs on the server side.
> 
> If I repeat the 'lctl ping' a few times, after 30-60 seconds or so, 'lctl 
> ping' succeeds.
> This happens for each of my Lustre servers, and once I get a successful ping 
> back, it seems to be fully functional up until the next reboot, or until the 
> Infiniband modules are reloaded.
> 
> If I try to mount the filesystem without doing the pings, I'll get timeouts 
> contacting the MDS for the same 30-60 seconds, and then once the MDSes are 
> reachable, I get timeouts to the OSSes for a while, until they become 
> reachable, and once they're all talking, all seems to be fine.
> 
> Any ideas on what could be wrong?
> 
> Thanks,
> Kevin
> 
> --
> Kevin Hildebrand
> University of Maryland
> ___
> lustre-discuss mailing list
> lustre-discuss@lists.lustre.org
> http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org