Re: [lustre-discuss] very slow mounts with OSS node down and peer discovery enabled

2023-10-26 Thread Andreas Dilger via lustre-discuss
I can't comment on the LNet peer discovery part, but I would definitely not 
tecommend to leave the lnet_transaction_timeout that low for normal usage. This 
can cause messages to be dropped while the server is processing them and 
introduce failures needlessly. 

Cheers, Andreas

> On Oct 26, 2023, at 09:48, Bertschinger, Thomas Andrew Hjorth via 
> lustre-discuss  wrote:
> 
> Hello,
> 
> Recently we had an OSS node down for an extended period with hardware 
> problems. While the node was down, mounting lustre on a client took an 
> extremely long time to complete (20-30 minutes). Once the fs is mounted, all 
> operations are normal and there isn't any noticeable impact from the absent 
> node.
> 
> While the client is mounting, the client's debug log shows entries like this 
> slowly going by:
> 
> 0020:0080:87.0:1698333195.993098:0:3801046:0:(obd_config.c:1384:class_process_config())
>  processing cmd: cf005
> 0020:0080:87.0:1698333195.993099:0:3801046:0:(obd_config.c:1396:class_process_config())
>  adding mapping from uuid 10.1.2.3@o2ib to nid 0x50abcd123 (10.1.2.4@o2ib)
> 
> and there is a "llog_process_th" kernel thread hanging in 
> lnet_discover_peer_locked().
> 
> We have peer discovery enabled on our clients, but disabling peer discovery 
> on a client causes the mount to complete quickly. Also, once the down OSS was 
> fixed and powered back on, mounting completed normally again.
> 
> We also found that reducing the following timeout sped up the mount by a 
> factor of ~10:
> 
> $ lnetctl set transaction_timeout 5# was 50 originally
> 
> Is such a dramatic slowdown normal in this situation? Is there any fix (aside 
> from disabling peer discovery or tuning down the timeout) that could speed up 
> mounts in case we have another OSS down in the future?
> 
> Lustre version (server and client): 2.15.3
> 
> Thanks, 
> Thomas Bertschinger
> ___
> lustre-discuss mailing list
> lustre-discuss@lists.lustre.org
> http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


[lustre-discuss] very slow mounts with OSS node down and peer discovery enabled

2023-10-26 Thread Bertschinger, Thomas Andrew Hjorth via lustre-discuss
Hello,

Recently we had an OSS node down for an extended period with hardware problems. 
While the node was down, mounting lustre on a client took an extremely long 
time to complete (20-30 minutes). Once the fs is mounted, all operations are 
normal and there isn't any noticeable impact from the absent node.

While the client is mounting, the client's debug log shows entries like this 
slowly going by:

0020:0080:87.0:1698333195.993098:0:3801046:0:(obd_config.c:1384:class_process_config())
 processing cmd: cf005
0020:0080:87.0:1698333195.993099:0:3801046:0:(obd_config.c:1396:class_process_config())
 adding mapping from uuid 10.1.2.3@o2ib to nid 0x50abcd123 (10.1.2.4@o2ib)

and there is a "llog_process_th" kernel thread hanging in 
lnet_discover_peer_locked().

We have peer discovery enabled on our clients, but disabling peer discovery on 
a client causes the mount to complete quickly. Also, once the down OSS was 
fixed and powered back on, mounting completed normally again.

We also found that reducing the following timeout sped up the mount by a factor 
of ~10:

$ lnetctl set transaction_timeout 5# was 50 originally

Is such a dramatic slowdown normal in this situation? Is there any fix (aside 
from disabling peer discovery or tuning down the timeout) that could speed up 
mounts in case we have another OSS down in the future?

Lustre version (server and client): 2.15.3

Thanks, 
Thomas Bertschinger
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org