Re: [lustre-discuss] lru_size question

2021-09-14 Thread Andreas Dilger via lustre-discuss
On Sep 9, 2021, at 02:49, Thomas Roth mailto:t.r...@gsi.de>> 
wrote:

Hi all,

I have checked the lru_size on an (2.12.5) system that has just been restarted.
The defaults have never been touched on that system, and so I see lru_size=0 
for all OSTs, on the MDS as on a client, as it should be.
The client also reports
> ldlm.namespaces.MGC10.20.3.0@o2ib5.lru_size=12800
while the three MDTs of the system also are shown with lru_size=0

The MDS reports
> ldlm.namespaces.MGC10.20.3.0@o2ib5.lru_size=1600
> ldlm.namespaces.MGS.lru_size=1600
> ldlm.namespaces.hebe-MDT-lwp-MDT.lru_size=494
> ldlm.namespaces.hebe-MDT0001-osp-MDT.lru_size=1600
> ldlm.namespaces.hebe-MDT0002-osp-MDT.lru_size=1600
> ldlm.namespaces.mdt-hebe-MDT_UUID.lru_size=1600

(hebe being the name of the fs)

These values show up immedeately after starting the MDS.

For static-sized LRU the default is to use 100x core count on each node.  The 
MGS is always static, since it doesn't need many locks, and less effort to 
manage LRU size.

Cheers, Andreas
--
Andreas Dilger
Lustre Principal Architect
Whamcloud







___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] Disabling multi-rail dynamic discovery

2021-09-14 Thread Andreas Dilger via lustre-discuss

On Sep 14, 2021, at 11:17, Vicker, Darby J. (JSC-EG111)[Jacobs Technology, 
Inc.] via lustre-discuss  wrote:
> 
> Ah yes, I see what the lnet unit file is doing.  OK, I think this is all 
> straighten out and working great now.  We have a fairly extensive init script 
> (the lustre3 script in previous posts) that does various checks in addition 
> to loading modules and mounting/unmounting the filesystems.  But at its core, 
> the start is now doing this:
>  
>/usr/bin/systemctl start lnet  >& /dev/null
>modprobe lustre
> 
 
Strictly speaking, the mount command itself should automatically trigger 
"lustre" module loading, so the "modprobe lustre" is redundant.

> The stop portion does:
>   
> 
> /usr/bin/systemctl stop lnet  >& /dev/null
> /usr/sbin/lustre_rmmod
 
In 2.15 the lustre_rmmod script will automatically run "lnetctl lnet 
unconfigure", and conversely lnet.service will run "lustre_rmmod" in the right 
places (assuming the filesystem was previously unmounted), so only one or the 
other will be needed.  Running both isn't harmful, just a bit redundant.

Cheers, Andreas

>  
> The final conf files I'm using are:
>  
> lnet.conf:
>  
> net:
> - net type: o2ib1
>   local NI(s):
> - interfaces:
>   0: ib0
> global:
> discovery: 0
>  
>  
>  
> /etc/modprobe.d/lustre.conf:
>  
> options ko2iblnd map_on_demand=32
>  
>  
>  
> Using the lnet systemd unit file properly loads the configuration and shows 
> discovery=0 (without any of lnet stuff in the modprobe conf file).  We could 
> properly enable the lnet unit file and make a dependency to make sure our 
> init script runs after the lnet service but its a little easier to just run 
> the systemctl commands in our init script. 
>  
> I would be interested if others have a cleaner way to do all mounting, etc. 
> in a more native systemd manner.  It probably just involves making a simple 
> unit file to run a script.  Probably six of one, half dozen of the other but 
> if anyone has experience with the pros and cons, please let me know. 
>  
> Thanks a ton for the help on this.  Much appreciated. 
>  
>  
>  
>> From: "Horn, Chris" 
>> Date: Tuesday, September 14, 2021 at 9:40 AM
>> To: "Vicker, Darby J. (JSC-EG111)[Jacobs Technology, Inc.]" 
>> , Riccardo Veraldi , 
>> "lustre-discuss@lists.lustre.org" 
>> Subject: [EXTERNAL] Re: Re: [lustre-discuss] Disabling multi-rail dynamic 
>> discovery
>>  
>> When you start LNet via ‘modprobe lnet; lctl net up’, that doesn’t load the 
>> configuration from /etc/lnet.conf. It is going to configure LNet based only 
>> on kernel module parameters. Since you removed the ‘options lnet networks’ 
>> from your modprobe.conf file, it is going to use the default configuration 
>> which is @tcp on whatever the first ethernet interface w/ipv4 configured 
>> that it finds.
>> 
>> To load /etc/lnet.conf you can use systemctl start lnet.service (or 
>> equivalent), or if you want to do it manually:
>> 
>> modprobe lnet
>> lnetctl lnet configure
>> lnetctl lnet import < /etc/lnet.conf
>> 
>> Also, I would try this for your lnet.conf
>> 
>> net:
>> - net type: o2ib
>>   local NI(s):
>> - interfaces:
>>   0: ib0
>> global:
>> discovery: 0
>> 
>> Chris Horn
>>  
>> From: Vicker, Darby J. (JSC-EG111)[Jacobs Technology, Inc.] 
>> 
>> Date: Tuesday, September 14, 2021 at 10:17 AM
>> To: Horn, Chris , Riccardo Veraldi 
>> , lustre-discuss@lists.lustre.org 
>> 
>> Subject: Re: [EXTERNAL] Re: [lustre-discuss] Disabling multi-rail dynamic 
>> discovery
>> 
>> So I"m a little confused.  
>>  
>> When I take the "options lnet networks=o2ib1(ib0)"  line out of the modprobe 
>> conf file and instead put that info in the lnet.conf file, things don't work 
>> properly. 
>>  
>> [root@r1i1n18 lnet]# cat /etc/modprobe.d/lustre.conf
>> options ko2iblnd map_on_demand=32
>> [root@r1i1n18 lnet]# cat /etc/lnet.conf
>> ip2nets:
>> - net-spec: o2ib1
>>interfaces:
>>   0: ib0
>> global:
>> discovery: 0
>> [root@r1i1n18 lnet]# modprobe lnet
>> [root@r1i1n18 lnet]# lctl network up
>> LNET configured
>> [root@r1i1n18 lnet]# service lustre3 start
>> Mounting /ephemeral... mount.lustre: mount 
>> 10.150.100.30@o2ib1:10.150.100.31@o2ib1:/scratch/work at /ephemeral failed: 
>> No such file or directory
>> Is the MGS specification correct?
>> Is the filesystem name correct?
>> If upgrading, is the copied client log valid? (see upgrade docs)
>> FAILED.
>> Mounting /nobackup... mount.lustre: mount 
>> 10.150.100.30@o2ib1:10.150.100.31@o2ib1:/hpfs-fsl/work at /nobackup failed: 
>> No such file or directory
>> Is the MGS specification correct?
>> Is the filesystem name correct?
>> If upgrading, is the copied client log valid? (see upgrade docs)
>> FAILED.
>> [root@r1i1n18 lnet]#
>>  
>>  
>> The logs when this happens:
>>  
>> Sep 14 09:53:38 r1i1n18 kernel: LNet: Added LNI 10.159.0.39@tcp [8/256/0/180]
>> Sep 14 09:53:38 r1i1n18 kernel: Lnet: Accept 

Re: [lustre-discuss] Disabling multi-rail dynamic discovery

2021-09-14 Thread Andreas Dilger via lustre-discuss

On Sep 14, 2021, at 11:17, Vicker, Darby J. (JSC-EG111)[Jacobs Technology, 
Inc.] via lustre-discuss 
mailto:lustre-discuss@lists.lustre.org>> wrote:

Ah yes, I see what the lnet unit file is doing.  OK, I think this is all 
straighten out and working great now.  We have a fairly extensive init script 
(the lustre3 script in previous posts) that does various checks in addition to 
loading modules and mounting/unmounting the filesystems.  But at its core, the 
start is now doing this:

   /usr/bin/systemctl start lnet  >& /dev/null
   modprobe lustre


Strictly speaking, the mount command itself should automatically trigger 
"lustre" module loading, so the "modprobe lustre" is redundant.

The stop portion does:


/usr/bin/systemctl stop lnet  >& /dev/null
/usr/sbin/lustre_rmmod

In 2.15 the lustre_rmmod script will automatically run "lnetctl lnet 
unconfigure", and conversely lnet.service will run "lustre_rmmod" in the right 
places (assuming the filesystem was previously unmounted), so only one or the 
other will be needed.  Running both isn't harmful, just a bit redundant.

Cheers, Andreas


The final conf files I'm using are:

lnet.conf:

net:
- net type: o2ib1
  local NI(s):
- interfaces:
  0: ib0
global:
discovery: 0



/etc/modprobe.d/lustre.conf:

options ko2iblnd map_on_demand=32



Using the lnet systemd unit file properly loads the configuration and shows 
discovery=0 (without any of lnet stuff in the modprobe conf file).  We could 
properly enable the lnet unit file and make a dependency to make sure our init 
script runs after the lnet service but its a little easier to just run the 
systemctl commands in our init script.

I would be interested if others have a cleaner way to do all mounting, etc. in 
a more native systemd manner.  It probably just involves making a simple unit 
file to run a script.  Probably six of one, half dozen of the other but if 
anyone has experience with the pros and cons, please let me know.

Thanks a ton for the help on this.  Much appreciated.



From: "Horn, Chris" mailto:chris.h...@hpe.com>>
Date: Tuesday, September 14, 2021 at 9:40 AM
To: "Vicker, Darby J. (JSC-EG111)[Jacobs Technology, Inc.]" 
mailto:darby.vicke...@nasa.gov>>, Riccardo Veraldi 
mailto:riccardo.vera...@cnaf.infn.it>>, 
"lustre-discuss@lists.lustre.org" 
mailto:lustre-discuss@lists.lustre.org>>
Subject: [EXTERNAL] Re: Re: [lustre-discuss] Disabling multi-rail dynamic 
discovery

When you start LNet via ‘modprobe lnet; lctl net up’, that doesn’t load the 
configuration from /etc/lnet.conf. It is going to configure LNet based only on 
kernel module parameters. Since you removed the ‘options lnet networks’ from 
your modprobe.conf file, it is going to use the default configuration which is 
@tcp on whatever the first ethernet interface w/ipv4 configured that it finds.

To load /etc/lnet.conf you can use systemctl start lnet.service (or 
equivalent), or if you want to do it manually:

modprobe lnet
lnetctl lnet configure
lnetctl lnet import < /etc/lnet.conf

Also, I would try this for your lnet.conf

net:
- net type: o2ib
  local NI(s):
- interfaces:
  0: ib0
global:
discovery: 0

Chris Horn

From: Vicker, Darby J. (JSC-EG111)[Jacobs Technology, Inc.] 
mailto:darby.vicke...@nasa.gov>>
Date: Tuesday, September 14, 2021 at 10:17 AM
To: Horn, Chris mailto:chris.h...@hpe.com>>, Riccardo 
Veraldi mailto:riccardo.vera...@cnaf.infn.it>>, 
lustre-discuss@lists.lustre.org 
mailto:lustre-discuss@lists.lustre.org>>
Subject: Re: [EXTERNAL] Re: [lustre-discuss] Disabling multi-rail dynamic 
discovery
So I"m a little confused.

When I take the "options lnet networks=o2ib1(ib0)"  line out of the modprobe 
conf file and instead put that info in the lnet.conf file, things don't work 
properly.

[root@r1i1n18 lnet]# cat /etc/modprobe.d/lustre.conf
options ko2iblnd map_on_demand=32
[root@r1i1n18 lnet]# cat /etc/lnet.conf
ip2nets:
- net-spec: o2ib1
   interfaces:
  0: ib0
global:
discovery: 0
[root@r1i1n18 lnet]# modprobe lnet
[root@r1i1n18 lnet]# lctl network up
LNET configured
[root@r1i1n18 lnet]# service lustre3 start
Mounting /ephemeral... mount.lustre: mount 
10.150.100.30@o2ib1:10.150.100.31@o2ib1:/scratch/work at /ephemeral failed: No 
such file or directory
Is the MGS specification correct?
Is the filesystem name correct?
If upgrading, is the copied client log valid? (see upgrade docs)
FAILED.
Mounting /nobackup... mount.lustre: mount 
10.150.100.30@o2ib1:10.150.100.31@o2ib1:/hpfs-fsl/work at /nobackup failed: No 
such file or directory
Is the MGS specification correct?
Is the filesystem name correct?
If upgrading, is the copied client log valid? (see upgrade docs)
FAILED.
[root@r1i1n18 lnet]#


The logs when this happens:

Sep 14 09:53:38 r1i1n18 kernel: LNet: Added LNI 
10.159.0.39@tcp [8/256/0/180]
Sep 14 09:53:38 

[lustre-discuss] Correct ZoL version matching Lustre 2.12.7 ?

2021-09-14 Thread Riccardo Veraldi

Hello,

I am about to deploy a new Lustre 2.12.7 systems.

With ZoL version should I choose for my Lustre/ZFS system ?

0.7.13, 0.8.6, 2.0.5, 2.1.0 ?

Thanks

Rick

___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] Disabling multi-rail dynamic discovery

2021-09-14 Thread Vicker, Darby J. (JSC-EG111)[Jacobs Technology, Inc.] via lustre-discuss
Ah yes, I see what the lnet unit file is doing.  OK, I think this is all 
straighten out and working great now.  We have a fairly extensive init script 
(the lustre3 script in previous posts) that does various checks in addition to 
loading modules and mounting/unmounting the filesystems.  But at its core, the 
start is now doing this:

   /usr/bin/systemctl start lnet  >& /dev/null
   modprobe lustre


The stop portion does:


/usr/bin/systemctl stop lnet  >& /dev/null
/usr/sbin/lustre_rmmod


The final conf files I'm using are:

lnet.conf:

net:
- net type: o2ib1
  local NI(s):
- interfaces:
  0: ib0
global:
discovery: 0



/etc/modprobe.d/lustre.conf:

options ko2iblnd map_on_demand=32



Using the lnet systemd unit file properly loads the configuration and shows 
discovery=0 (without any of lnet stuff in the modprobe conf file).  We could 
properly enable the lnet unit file and make a dependency to make sure our init 
script runs after the lnet service but its a little easier to just run the 
systemctl commands in our init script.

I would be interested if others have a cleaner way to do all mounting, etc. in 
a more native systemd manner.  It probably just involves making a simple unit 
file to run a script.  Probably six of one, half dozen of the other but if 
anyone has experience with the pros and cons, please let me know.

Thanks a ton for the help on this.  Much appreciated.



From: "Horn, Chris" 
Date: Tuesday, September 14, 2021 at 9:40 AM
To: "Vicker, Darby J. (JSC-EG111)[Jacobs Technology, Inc.]" 
, Riccardo Veraldi , 
"lustre-discuss@lists.lustre.org" 
Subject: [EXTERNAL] Re: Re: [lustre-discuss] Disabling multi-rail dynamic 
discovery

When you start LNet via ‘modprobe lnet; lctl net up’, that doesn’t load the 
configuration from /etc/lnet.conf. It is going to configure LNet based only on 
kernel module parameters. Since you removed the ‘options lnet networks’ from 
your modprobe.conf file, it is going to use the default configuration which is 
@tcp on whatever the first ethernet interface w/ipv4 configured that it finds.

To load /etc/lnet.conf you can use systemctl start lnet.service (or 
equivalent), or if you want to do it manually:

modprobe lnet
lnetctl lnet configure
lnetctl lnet import < /etc/lnet.conf

Also, I would try this for your lnet.conf

net:
- net type: o2ib
  local NI(s):
- interfaces:
  0: ib0
global:
discovery: 0

Chris Horn

From: Vicker, Darby J. (JSC-EG111)[Jacobs Technology, Inc.] 

Date: Tuesday, September 14, 2021 at 10:17 AM
To: Horn, Chris , Riccardo Veraldi 
, lustre-discuss@lists.lustre.org 

Subject: Re: [EXTERNAL] Re: [lustre-discuss] Disabling multi-rail dynamic 
discovery
So I"m a little confused.

When I take the "options lnet networks=o2ib1(ib0)"  line out of the modprobe 
conf file and instead put that info in the lnet.conf file, things don't work 
properly.

[root@r1i1n18 lnet]# cat /etc/modprobe.d/lustre.conf
options ko2iblnd map_on_demand=32
[root@r1i1n18 lnet]# cat /etc/lnet.conf
ip2nets:
- net-spec: o2ib1
   interfaces:
  0: ib0
global:
discovery: 0
[root@r1i1n18 lnet]# modprobe lnet
[root@r1i1n18 lnet]# lctl network up
LNET configured
[root@r1i1n18 lnet]# service lustre3 start
Mounting /ephemeral... mount.lustre: mount 
10.150.100.30@o2ib1:10.150.100.31@o2ib1:/scratch/work at /ephemeral failed: No 
such file or directory
Is the MGS specification correct?
Is the filesystem name correct?
If upgrading, is the copied client log valid? (see upgrade docs)
FAILED.
Mounting /nobackup... mount.lustre: mount 
10.150.100.30@o2ib1:10.150.100.31@o2ib1:/hpfs-fsl/work at /nobackup failed: No 
such file or directory
Is the MGS specification correct?
Is the filesystem name correct?
If upgrading, is the copied client log valid? (see upgrade docs)
FAILED.
[root@r1i1n18 lnet]#


The logs when this happens:

Sep 14 09:53:38 r1i1n18 kernel: LNet: Added LNI 
10.159.0.39@tcp [8/256/0/180]
Sep 14 09:53:38 r1i1n18 kernel: Lnet: Accept secure, port 988
Sep 14 09:53:54 r1i1n18 kernel: Lustre: Lustre: Build Version: 2.12.6
Sep 14 09:53:55 r1i1n18 kernel: LustreError: 
34174:0:(ldlm_lib.c:494:client_obd_setup()) can't add initial connection
Sep 14 09:53:55 r1i1n18 kernel: LustreError: 
34174:0:(obd_config.c:559:class_setup()) setup 
MGC10.150.100.30@o2ib1 failed (-2)
Sep 14 09:53:55 r1i1n18 kernel: LustreError: 
34174:0:(obd_mount.c:202:lustre_start_simple()) 
MGC10.150.100.30@o2ib1 setup error -2
Sep 14 09:53:55 r1i1n18 kernel: LustreError: 
34174:0:(obd_mount.c:1608:lustre_fill_super()) Unable to mount  (-2)

Note the @tcp above – it looks like without the modprobe conf file, the lnet 
module isn't getting set up properly.  When this happens, I'm not able to shut 
down lnet or unload the kernel modules to try again.  The only way I've been 
able to recover from this is to reboot the node.  If I add the "options 

Re: [lustre-discuss] Disabling multi-rail dynamic discovery

2021-09-14 Thread Horn, Chris via lustre-discuss
When you start LNet via ‘modprobe lnet; lctl net up’, that doesn’t load the 
configuration from /etc/lnet.conf. It is going to configure LNet based only on 
kernel module parameters. Since you removed the ‘options lnet networks’ from 
your modprobe.conf file, it is going to use the default configuration which is 
@tcp on whatever the first ethernet interface w/ipv4 configured that it finds.

To load /etc/lnet.conf you can use systemctl start lnet.service (or 
equivalent), or if you want to do it manually:

modprobe lnet
lnetctl lnet configure
lnetctl lnet import < /etc/lnet.conf

Also, I would try this for your lnet.conf

net:
- net type: o2ib
  local NI(s):
- interfaces:
  0: ib0
global:
discovery: 0

Chris Horn

From: Vicker, Darby J. (JSC-EG111)[Jacobs Technology, Inc.] 

Date: Tuesday, September 14, 2021 at 10:17 AM
To: Horn, Chris , Riccardo Veraldi 
, lustre-discuss@lists.lustre.org 

Subject: Re: [EXTERNAL] Re: [lustre-discuss] Disabling multi-rail dynamic 
discovery
So I"m a little confused.

When I take the "options lnet networks=o2ib1(ib0)"  line out of the modprobe 
conf file and instead put that info in the lnet.conf file, things don't work 
properly.

[root@r1i1n18 lnet]# cat /etc/modprobe.d/lustre.conf
options ko2iblnd map_on_demand=32
[root@r1i1n18 lnet]# cat /etc/lnet.conf
ip2nets:
- net-spec: o2ib1
   interfaces:
  0: ib0
global:
discovery: 0
[root@r1i1n18 lnet]# modprobe lnet
[root@r1i1n18 lnet]# lctl network up
LNET configured
[root@r1i1n18 lnet]# service lustre3 start
Mounting /ephemeral... mount.lustre: mount 
10.150.100.30@o2ib1:10.150.100.31@o2ib1:/scratch/work at /ephemeral failed: No 
such file or directory
Is the MGS specification correct?
Is the filesystem name correct?
If upgrading, is the copied client log valid? (see upgrade docs)
FAILED.
Mounting /nobackup... mount.lustre: mount 
10.150.100.30@o2ib1:10.150.100.31@o2ib1:/hpfs-fsl/work at /nobackup failed: No 
such file or directory
Is the MGS specification correct?
Is the filesystem name correct?
If upgrading, is the copied client log valid? (see upgrade docs)
FAILED.
[root@r1i1n18 lnet]#


The logs when this happens:

Sep 14 09:53:38 r1i1n18 kernel: LNet: Added LNI 
10.159.0.39@tcp [8/256/0/180]
Sep 14 09:53:38 r1i1n18 kernel: Lnet: Accept secure, port 988
Sep 14 09:53:54 r1i1n18 kernel: Lustre: Lustre: Build Version: 2.12.6
Sep 14 09:53:55 r1i1n18 kernel: LustreError: 
34174:0:(ldlm_lib.c:494:client_obd_setup()) can't add initial connection
Sep 14 09:53:55 r1i1n18 kernel: LustreError: 
34174:0:(obd_config.c:559:class_setup()) setup 
MGC10.150.100.30@o2ib1 failed (-2)
Sep 14 09:53:55 r1i1n18 kernel: LustreError: 
34174:0:(obd_mount.c:202:lustre_start_simple()) 
MGC10.150.100.30@o2ib1 setup error -2
Sep 14 09:53:55 r1i1n18 kernel: LustreError: 
34174:0:(obd_mount.c:1608:lustre_fill_super()) Unable to mount  (-2)

Note the @tcp above – it looks like without the modprobe conf file, the lnet 
module isn't getting set up properly.  When this happens, I'm not able to shut 
down lnet or unload the kernel modules to try again.  The only way I've been 
able to recover from this is to reboot the node.  If I add the "options lnet" 
stuff back to the modprobe conf file, everything works as expected.  Do I not 
have enough info in lnet.conf or are both just required?

Chris, adding lnet_peer_discovery_disabled=1 to my lnet options does indeed 
seem to work.  Thanks!

Darby


From: "Horn, Chris" 
Date: Monday, September 13, 2021 at 4:59 PM
To: Riccardo Veraldi , "Vicker, Darby J. 
(JSC-EG111)[Jacobs Technology, Inc.]" , 
"lustre-discuss@lists.lustre.org" 
Subject: [EXTERNAL] Re: [lustre-discuss] Disabling multi-rail dynamic discovery

I’m not sure why lnetctl import wouldn’t correctly set discovery. Might be a 
bug. You can try setting the kernel module parameter to disable discovery:

options lnet lnet_peer_discovery_disabled=1

This obviously requires LNet to be reloaded.

I would not recommend toggling discovery via the CLI as there are some bugs 
with correctly dealing with the fallout of that (peers going from MR enabled to 
MR disabled).

Chris Horn

From: lustre-discuss  on behalf of 
Riccardo Veraldi 
Date: Monday, September 13, 2021 at 5:25 PM
To: Vicker, Darby J. (JSC-EG111)[Jacobs Technology, Inc.] 
, lustre-discuss@lists.lustre.org 

Subject: Re: [lustre-discuss] Disabling multi-rail dynamic discovery

I supposed you removed the /etc/modprobe.d/lustre.conf completely.

I only have the lnet service enabled at startup, I do not start any lustre3 
service, but I am running lustre 2.12.0 sorry not 2.14

so something might be different.

Did you start over with a clean configuration ?

Did you reboot your system to make sure it picks up the new config ? At least 
for me sometimes the lnet module does not unload correctly.

Also I have to mention in my setup I did disable discovery also on the OSSes 
not only client 

Re: [lustre-discuss] [EXTERNAL] Re: Disabling multi-rail dynamic discovery

2021-09-14 Thread Vicker, Darby J. (JSC-EG111)[Jacobs Technology, Inc.] via lustre-discuss
So I"m a little confused.

When I take the "options lnet networks=o2ib1(ib0)"  line out of the modprobe 
conf file and instead put that info in the lnet.conf file, things don't work 
properly.

[root@r1i1n18 lnet]# cat /etc/modprobe.d/lustre.conf
options ko2iblnd map_on_demand=32
[root@r1i1n18 lnet]# cat /etc/lnet.conf
ip2nets:
- net-spec: o2ib1
   interfaces:
  0: ib0
global:
discovery: 0
[root@r1i1n18 lnet]# modprobe lnet
[root@r1i1n18 lnet]# lctl network up
LNET configured
[root@r1i1n18 lnet]# service lustre3 start
Mounting /ephemeral... mount.lustre: mount 
10.150.100.30@o2ib1:10.150.100.31@o2ib1:/scratch/work at /ephemeral failed: No 
such file or directory
Is the MGS specification correct?
Is the filesystem name correct?
If upgrading, is the copied client log valid? (see upgrade docs)
FAILED.
Mounting /nobackup... mount.lustre: mount 
10.150.100.30@o2ib1:10.150.100.31@o2ib1:/hpfs-fsl/work at /nobackup failed: No 
such file or directory
Is the MGS specification correct?
Is the filesystem name correct?
If upgrading, is the copied client log valid? (see upgrade docs)
FAILED.
[root@r1i1n18 lnet]#


The logs when this happens:

Sep 14 09:53:38 r1i1n18 kernel: LNet: Added LNI 
10.159.0.39@tcp [8/256/0/180]
Sep 14 09:53:38 r1i1n18 kernel: Lnet: Accept secure, port 988
Sep 14 09:53:54 r1i1n18 kernel: Lustre: Lustre: Build Version: 2.12.6
Sep 14 09:53:55 r1i1n18 kernel: LustreError: 
34174:0:(ldlm_lib.c:494:client_obd_setup()) can't add initial connection
Sep 14 09:53:55 r1i1n18 kernel: LustreError: 
34174:0:(obd_config.c:559:class_setup()) setup 
MGC10.150.100.30@o2ib1 failed (-2)
Sep 14 09:53:55 r1i1n18 kernel: LustreError: 
34174:0:(obd_mount.c:202:lustre_start_simple()) 
MGC10.150.100.30@o2ib1 setup error -2
Sep 14 09:53:55 r1i1n18 kernel: LustreError: 
34174:0:(obd_mount.c:1608:lustre_fill_super()) Unable to mount  (-2)

Note the @tcp above – it looks like without the modprobe conf file, the lnet 
module isn't getting set up properly.  When this happens, I'm not able to shut 
down lnet or unload the kernel modules to try again.  The only way I've been 
able to recover from this is to reboot the node.  If I add the "options lnet" 
stuff back to the modprobe conf file, everything works as expected.  Do I not 
have enough info in lnet.conf or are both just required?

Chris, adding lnet_peer_discovery_disabled=1 to my lnet options does indeed 
seem to work.  Thanks!

Darby


From: "Horn, Chris" 
Date: Monday, September 13, 2021 at 4:59 PM
To: Riccardo Veraldi , "Vicker, Darby J. 
(JSC-EG111)[Jacobs Technology, Inc.]" , 
"lustre-discuss@lists.lustre.org" 
Subject: [EXTERNAL] Re: [lustre-discuss] Disabling multi-rail dynamic discovery

I’m not sure why lnetctl import wouldn’t correctly set discovery. Might be a 
bug. You can try setting the kernel module parameter to disable discovery:

options lnet lnet_peer_discovery_disabled=1

This obviously requires LNet to be reloaded.

I would not recommend toggling discovery via the CLI as there are some bugs 
with correctly dealing with the fallout of that (peers going from MR enabled to 
MR disabled).

Chris Horn

From: lustre-discuss  on behalf of 
Riccardo Veraldi 
Date: Monday, September 13, 2021 at 5:25 PM
To: Vicker, Darby J. (JSC-EG111)[Jacobs Technology, Inc.] 
, lustre-discuss@lists.lustre.org 

Subject: Re: [lustre-discuss] Disabling multi-rail dynamic discovery

I supposed you removed the /etc/modprobe.d/lustre.conf completely.

I only have the lnet service enabled at startup, I do not start any lustre3 
service, but I am running lustre 2.12.0 sorry not 2.14

so something might be different.

Did you start over with a clean configuration ?

Did you reboot your system to make sure it picks up the new config ? At least 
for me sometimes the lnet module does not unload correctly.

Also I have to mention in my setup I did disable discovery also on the OSSes 
not only client side.

Generally it is not advisable to disable Multi-rail unless you have backward 
compatibility issues with older lustre peers.

But disabling discovery will also disable Multi-rail.

You can try with

lenetctl set discovery 0

as  you already did,

then you do

lnetctl -b export > /etc/lnet.conf

check discovery is set to 0 in the file and if not edit it and set it to 0.

reboot and see if things changes.

If anyway you did not define any tcp interface in lnet.conf  you should not see 
any tcp peers.


On 9/13/21 2:59 PM, Vicker, Darby J. (JSC-EG111)[Jacobs Technology, Inc.] wrote:
Thanks Rick.  I removed my lnet modprobe options and adapted my lnet.conf file 
to:

# cat /etc/lnet.conf
ip2nets:
- net-spec: o2ib1
   interfaces:
  0: ib0
global:
discovery: 0
#


Now "lnetctl export" doesn't have any reference to NIDs on the other networks, 
so that's good.  However, I'm still seeing some values that concern me:


# lnetctl export | grep -e Multi -e discover | sort -u
discovery: 1