Re: [lustre-discuss] Lustre 2.12.0 and locking problems

2019-03-06 Thread Amir Shehata
no problem

On Wed, 6 Mar 2019 at 12:15, Riccardo Veraldi 
wrote:

> On 3/6/19 11:29 AM, Amir Shehata wrote:
>
> The reason for the load being split across the tcp and o2ib0 for the 2.12
> client, is because the MR code sees both interfaces and realizes it can use
> both of them and so it does.
> To disable this behavior you can disable discovery on the 2.12 client. I
> think that should just get the client to only use the single interface it's
> told to.
>
> thank you very much, this worked out well.
>
> We're currently working on a feature (UDSP) which will allow the
> specification of a "preferred" network. In your case you can set the o2ib
> to be the preferred network. It'll always be used unless it becomes
> unavailable. You get two benefits this way: 1) your preference is adhered
> to. 2) reliability, since the tcp network will be used if the o2ib network
> becomes unavailable.this feature
>
> this feature (UDSP) would e really great.
>
>
> Let me know if disabling discovery on your 2.12 clients work.
>
> yes after disabling discovery on the client side, the situation is much
> better
>
>
> thank you very much
>
>
>
> thanks
> amir
>
> On Tue, 5 Mar 2019 at 18:49, Riccardo Veraldi <
> riccardo.vera...@cnaf.infn.it> wrote:
>
>> Hello Amir i answer in-line
>>
>> On 3/5/19 3:42 PM, Amir Shehata wrote:
>>
>> It looks like the ping is passing. Did you try it several times to make
>> sure it always pings successfully?
>>
>> The way it works is the MDS (2.12) discovers all the interfaces on the
>> peer. There is a concept of the primary NID for the peer. That's the first
>> interface configured on the peer. In your case it's the o2ib NID. So when
>> you do lnetctl net show you'll see Primary NID: @o2ib.
>>
>> - primary nid: 172.21.52.88@o2ib
>>Multi-Rail: True
>>peer ni:
>>  - nid: 172.21.48.250@tcp
>>state: NA
>>  - nid: 172.21.52.88@o2ib
>>state: NA
>>  - nid: 172.21.48.250@tcp1
>>state: NA
>>  - nid: 172.21.48.250@tcp2
>>state: NA
>>
>> On the MDS it uses the primary_nid to identify the peer. So you can ping
>> using the Primary NID. LNet will resolve the Primary NID to the tcp NID. As
>> you can see in the logs, it never actually talks over o2ib. It ends up
>> talking to the peer on its TCP NID, which is what you want to do.
>>
>> I think the problem you're seeing is caused by the combination of 2.12
>> and 2.10.x.
>> From what I understand your servers are 2.12 and your clients are 2.10.x.
>>
>> my clients are 2.10.5 but this problem arise also with one client 2.12.0,
>> anyway the combination of 2.10.0 clients and 2.12.0 is not working right
>>
>>
>> Can you try disabling dynamic discovery on your servers:
>> lnetctl set discovery 0
>>
>> I did this on the MDS and OSS. I did not disable discovery on the client
>> side.
>>
>> now on the MDS side lnetctl peer show looks right.
>>
>> Anyway on the client side where I have both IB and tcp if I write on the
>> lustre filesystem (OSS) what hapens is that the write operation is
>> splitte/load balanced between IB and tcp (Ethernet) and I do not want this.
>> I would like that only IB would be used when the client writes data to the
>> OSS. but both peer ni (o2ib,tcp) are seen from the 2.12.0 client and
>> traffic goes to both of them thus reducing performances because IB is not
>> fully used. This does not happen with 2.10.5 client writing on the same
>> 2.12.0 OSS
>>
>
>
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] Lustre 2.12.0 and locking problems

2019-03-06 Thread Amir Shehata
The reason for the load being split across the tcp and o2ib0 for the 2.12
client, is because the MR code sees both interfaces and realizes it can use
both of them and so it does.
To disable this behavior you can disable discovery on the 2.12 client. I
think that should just get the client to only use the single interface it's
told to.
We're currently working on a feature (UDSP) which will allow the
specification of a "preferred" network. In your case you can set the o2ib
to be the preferred network. It'll always be used unless it becomes
unavailable. You get two benefits this way: 1) your preference is adhered
to. 2) reliability, since the tcp network will be used if the o2ib network
becomes unavailable.

Let me know if disabling discovery on your 2.12 clients work.

thanks
amir

On Tue, 5 Mar 2019 at 18:49, Riccardo Veraldi 
wrote:

> Hello Amir i answer in-line
>
> On 3/5/19 3:42 PM, Amir Shehata wrote:
>
> It looks like the ping is passing. Did you try it several times to make
> sure it always pings successfully?
>
> The way it works is the MDS (2.12) discovers all the interfaces on the
> peer. There is a concept of the primary NID for the peer. That's the first
> interface configured on the peer. In your case it's the o2ib NID. So when
> you do lnetctl net show you'll see Primary NID: @o2ib.
>
> - primary nid: 172.21.52.88@o2ib
>Multi-Rail: True
>peer ni:
>  - nid: 172.21.48.250@tcp
>state: NA
>  - nid: 172.21.52.88@o2ib
>state: NA
>  - nid: 172.21.48.250@tcp1
>state: NA
>  - nid: 172.21.48.250@tcp2
>state: NA
>
> On the MDS it uses the primary_nid to identify the peer. So you can ping
> using the Primary NID. LNet will resolve the Primary NID to the tcp NID. As
> you can see in the logs, it never actually talks over o2ib. It ends up
> talking to the peer on its TCP NID, which is what you want to do.
>
> I think the problem you're seeing is caused by the combination of 2.12 and
> 2.10.x.
> From what I understand your servers are 2.12 and your clients are 2.10.x.
>
> my clients are 2.10.5 but this problem arise also with one client 2.12.0,
> anyway the combination of 2.10.0 clients and 2.12.0 is not working right
>
>
> Can you try disabling dynamic discovery on your servers:
> lnetctl set discovery 0
>
> I did this on the MDS and OSS. I did not disable discovery on the client
> side.
>
> now on the MDS side lnetctl peer show looks right.
>
> Anyway on the client side where I have both IB and tcp if I write on the
> lustre filesystem (OSS) what hapens is that the write operation is
> splitte/load balanced between IB and tcp (Ethernet) and I do not want this.
> I would like that only IB would be used when the client writes data to the
> OSS. but both peer ni (o2ib,tcp) are seen from the 2.12.0 client and
> traffic goes to both of them thus reducing performances because IB is not
> fully used. This does not happen with 2.10.5 client writing on the same
> 2.12.0 OSS
>
>
>
> Do that as part of the initial bring up to make sure 2.12 nodes don't try
> to discover peers. Let me know if that resolves your issue?
>
> On Tue, 5 Mar 2019 at 15:09, Riccardo Veraldi <
> riccardo.vera...@cnaf.infn.it> wrote:
>
>> it is not exactly this problem.
>>
>> here is my setup
>>
>>- MDS is on tcp0
>>- client is on tcp0 and o2ib0
>>- OSS is on tcp0 and o2ib0
>>
>> The problem is that the MDS is discovering both the lustre client and the
>> OSS as well over o2ib and it should not because the MDS has only one
>> ethernet interface. I can see this from lnetctl peer show. This did not
>> happen prior to upgrading to Lustre 2.12.0 from 2.10.5
>>
>> so I tried to debug with lctl ping from the MDS to the lustre client
>>
>> [root@psmdsana1501 ~]# lctl ping 172.21.48.250@tcp
>> 12345-0@lo
>> 12345-172.21.52.88@o2ib
>> 12345-172.21.48.250@tcp
>>
>> 172.21.52.88 is the ib interface of the lustre client.
>>
>> so I did
>>
>> [root@psmdsana1501 ~]# lctl ping 172.21.52.88@o2ib
>> 12345-0@lo
>> 12345-172.21.52.88@o2ib
>> 12345-172.21.48.250@tcp
>>
>>
>> this is the mds lnet.conf
>>
>> ip2nets:
>>  - net-spec: tcp
>>interfaces:
>>   0: eth0
>>
>> then I did as you suggested on the MDS:
>>
>> lctl set_param debug=+"net neterror"
>>
>> lctl ping 172.21.48.250
>> 12345-0@lo
>> 12345-172.21.52.88@o2ib
>> 12345-172.21.48.250@tcp
>>
>>
>> LOG file:
>>
>> 0800:0200:3.0F:1551827094.376143:0:17197:0:(socklnd.c:195:ksocknal_find_peer_locked())
>> got peer_ni [8e0f3a312100] -> 12345-172.21.49.100@tcp (4)
>> 0800:0200:3.0:1551827094.376155:0:17197:0:(socklnd_cb.c:757:ksocknal_queue_tx_locked())
>> Sending to 12345-172.21.49.100@tcp ip *MailScanner warning: numerical
>> links are often malicious:* 172.21.49.100:1021
>> 
>> 0800:0200:3.0:1551827094.376158:0:17197:0:(socklnd_cb.c:776:ksocknal_queue_tx_locked())
>> Packet 8e0f3d32d800 type 192, nob 24 niov 1 nkiov 0
>> 

Re: [lustre-discuss] Lustre 2.12.0 and locking problems

2019-03-06 Thread Riccardo Veraldi

Hello Amir i answer in-line

On 3/5/19 3:42 PM, Amir Shehata wrote:
It looks like the ping is passing. Did you try it several times to 
make sure it always pings successfully?


The way it works is the MDS (2.12) discovers all the interfaces on the 
peer. There is a concept of the primary NID for the peer. That's the 
first interface configured on the peer. In your case it's the o2ib 
NID. So when you do lnetctl net show you'll see Primary NID: @o2ib.


    - primary nid: 172.21.52.88@o2ib
   Multi-Rail: True
   peer ni:
 - nid: 172.21.48.250@tcp
   state: NA
 - nid: 172.21.52.88@o2ib
   state: NA
 - nid: 172.21.48.250@tcp1
   state: NA
 - nid: 172.21.48.250@tcp2
   state: NA

On the MDS it uses the primary_nid to identify the peer. So you can 
ping using the Primary NID. LNet will resolve the Primary NID to the 
tcp NID. As you can see in the logs, it never actually talks over 
o2ib. It ends up talking to the peer on its TCP NID, which is what you 
want to do.


I think the problem you're seeing is caused by the combination of 2.12 
and 2.10.x.

From what I understand your servers are 2.12 and your clients are 2.10.x.
my clients are 2.10.5 but this problem arise also with one client 
2.12.0, anyway the combination of 2.10.0 clients and 2.12.0 is not 
working right


Can you try disabling dynamic discovery on your servers:
lnetctl set discovery 0


I did this on the MDS and OSS. I did not disable discovery on the client 
side.


now on the MDS side lnetctl peer show looks right.

Anyway on the client side where I have both IB and tcp if I write on the 
lustre filesystem (OSS) what hapens is that the write operation is 
splitte/load balanced between IB and tcp (Ethernet) and I do not want 
this. I would like that only IB would be used when the client writes 
data to the OSS. but both peer ni (o2ib,tcp) are seen from the 2.12.0 
client and traffic goes to both of them thus reducing performances 
because IB is not fully used. This does not happen with 2.10.5 client 
writing on the same 2.12.0 OSS





Do that as part of the initial bring up to make sure 2.12 nodes don't 
try to discover peers. Let me know if that resolves your issue?


On Tue, 5 Mar 2019 at 15:09, Riccardo Veraldi 
mailto:riccardo.vera...@cnaf.infn.it>> 
wrote:


it is not exactly this problem.

here is my setup

  * MDS is on tcp0
  * client is on tcp0 and o2ib0
  * OSS is on tcp0 and o2ib0

The problem is that the MDS is discovering both the lustre client
and the OSS as well over o2ib and it should not because the MDS
has only one ethernet interface. I can see this from lnetctl peer
show. This did not happen prior to upgrading to Lustre 2.12.0 from
2.10.5

so I tried to debug with lctl ping from the MDS to the lustre client

[root@psmdsana1501 ~]# lctl ping 172.21.48.250@tcp
12345-0@lo
12345-172.21.52.88@o2ib
12345-172.21.48.250@tcp

172.21.52.88 is the ib interface of the lustre client.

so I did

[root@psmdsana1501 ~]# lctl ping 172.21.52.88@o2ib
12345-0@lo
12345-172.21.52.88@o2ib
12345-172.21.48.250@tcp


this is the mds lnet.conf

ip2nets:
 - net-spec: tcp
   interfaces:
  0: eth0

then I did as you suggested on the MDS:

lctl set_param debug=+"net neterror"

lctl ping 172.21.48.250
12345-0@lo
12345-172.21.52.88@o2ib
12345-172.21.48.250@tcp


LOG file:


0800:0200:3.0F:1551827094.376143:0:17197:0:(socklnd.c:195:ksocknal_find_peer_locked())
got peer_ni [8e0f3a312100] -> 12345-172.21.49.100@tcp (4)

0800:0200:3.0:1551827094.376155:0:17197:0:(socklnd_cb.c:757:ksocknal_queue_tx_locked())
Sending to 12345-172.21.49.100@tcp ip *MailScanner warning:
numerical links are often malicious:* 172.21.49.100:1021


0800:0200:3.0:1551827094.376158:0:17197:0:(socklnd_cb.c:776:ksocknal_queue_tx_locked())
Packet 8e0f3d32d800 type 192, nob 24 niov 1 nkiov 0

0800:0200:3.0:1551827094.376312:0:17200:0:(socklnd_cb.c:549:ksocknal_process_transmit())
send(0) 0

0800:0200:3.0:1551827097.376102:0:17197:0:(socklnd.c:195:ksocknal_find_peer_locked())
got peer_ni [8e0f346f9400] -> 12345-172.21.49.110@tcp (4)

0800:0200:3.0:1551827097.376110:0:17197:0:(socklnd_cb.c:757:ksocknal_queue_tx_locked())
Sending to 12345-172.21.49.110@tcp ip *MailScanner warning:
numerical links are often malicious:* 172.21.49.110:1021


0800:0200:3.0:1551827097.376114:0:17197:0:(socklnd_cb.c:776:ksocknal_queue_tx_locked())
Packet 8e0f3d32d800 type 192, nob 24 niov 1 nkiov 0

0800:0200:3.0:1551827097.376135:0:17197:0:(socklnd.c:195:ksocknal_find_peer_locked())
got peer_ni [8e0f3d32d000] -> 12345-172.21.48.69@tcp (4)