Re: [lustre-discuss] Multiple IB Interfaces

2021-03-12 Thread Matt Rásó-Barnett via lustre-discuss

Hi Alastair,

On Fri, Mar 12, 2021 at 09:32:03AM +, Alastair Basden via lustre-discuss 
wrote:
Reading is more problematic.  A request from a client (say 10.0.0.100) 
for data on OST2 will come in via card 2 (10.0.0.2).  A thread on CPU2 
(hopefully) will then read the data from OST2, and send it out to the 
client, 10.0.0.100.  However, here, Linux will route the packet through 
the first card on this subnet, so it will go over the inter-cpu link, 
and out of IB card 1.  And this will be the case even if the thread is 
pinned on CPU2.


The question then is whether there is a way to configure Lustre to use 
IB card 2 when sending out data from OST2.


The routing table entries referenced here: 
https://wiki.lustre.org/LNet_Router_Config_Guide#ARP_flux_issue_for_MR_node
should do this for you I believe, ensuring essentially that packets will 
be routed out over the interface that they are received on.


I think this is sufficient, but maybe someone more knowledgeable on this 
can confirm.


Cheers,
Matt


On Wed, 10 Mar 2021, Ms. Megan Larko wrote:


[EXTERNAL EMAIL]
Greetings Alastair,

Bonding is supported on InfiniBand, but  I believe that it is only 
active/passive.
I think what you might be looking for WRT avoiding data travel through the inter-cpu link is cpu 
"affinity" AKA cpu "pinning".

Cheers,
megan

WRT = "with regards to"
AKA = "also known as"


___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


--
Matt Rásó-Barnett
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] Multiple IB Interfaces

2021-03-12 Thread Alastair Basden via lustre-discuss

Hi all,

Thanks for the replies.  The issue as I see it is with sending data from 
an OST to the client, avoiding the inter-CPU link.


So, if I have:
cpu1 - IB card 1 (10.0.0.1), nvme1 (OST1)
cpu2 - IB card 2 (10.0.0.2), nvme2 (OST2)

Both IB cards on the same subnet.  Therefore, by default, packets will be 
routed out of the server over the preferred card, say IB card 1 (I could 
be wrong, but this is my current understanding, and seems to be what the 
Lustre manual says).


Data coming in (being written to the OST) is not a problem.  The client 
will know the IP address of the card to which the OST is closest.   So, 
to write to OST2, it will use the 10.0.0.2 address (since this will be 
the IP address given in mkfs.lustre for that OST).


The slight complication here is pinning.  A cpu thread may run on cpu1, so 
the data has to traverse the inter-cpu link twice.  However, I am assuming 
that this won't happen - i.e. the kernel or lustre are clever enough to 
place this thread on cpu2.  As far as I am aware, this should just work, 
though please correct me if I'm wrong.  Perhaps I have to manually specify 
pinning - how does one do that with Lustre?


Reading is more problematic.  A request from a client (say 10.0.0.100) for 
data on OST2 will come in via card 2 (10.0.0.2).  A thread on CPU2 
(hopefully) will then read the data from OST2, and send it out to the 
client, 10.0.0.100.  However, here, Linux will route the packet through 
the first card on this subnet, so it will go over the inter-cpu link, and 
out of IB card 1.  And this will be the case even if the thread is pinned 
on CPU2.


The question then is whether there is a way to configure Lustre to use IB 
card 2 when sending out data from OST2.


Cheers,
Alastair.

On Wed, 10 Mar 2021, Ms. Megan Larko wrote:


[EXTERNAL EMAIL]
Greetings Alastair,

Bonding is supported on InfiniBand, but  I believe that it is only 
active/passive.
I think what you might be looking for WRT avoiding data travel through the inter-cpu link is cpu 
"affinity" AKA cpu "pinning".

Cheers,
megan

WRT = "with regards to"
AKA = "also known as"


___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] Multiple IB interfaces

2021-03-11 Thread Raj via lustre-discuss
Alastair,
Few scenarios which you may consider:
1) define 2 lnets one per IB interface (say o2ib1 and o2ib2) and share out
one OST through o2ib1 and other one through o2ib2. You can map HBA and disk
locality so that they are attached to the same cpu.

2) Same as above but share the ost/s from both lnets But configure odd
clients (clients with odd ips) to use o2ib1 and even clients to use o2ib2.
This may not be exactly what you are looking for but can efficiently
utilize both interfaces.

-Raj

On Tue, Mar 9, 2021 at 9:18 AM Alastair Basden via lustre-discuss <
lustre-discuss@lists.lustre.org> wrote:

> Hi,
>
> We are installing some new Lustre servers with 2 InfiniBand cards, 1
> attached to each CPU socket.  Storage is nvme, again, some drives attached
> to each socket.
>
> We want to ensure that data to/from each drive uses the appropriate IB
> card, and doesn't need to travel through the inter-cpu link.  Data being
> written is fairly easy I think, we just set that OST to the appropriate IP
> address.  However, data being read may well go out the other NIC, if I
> understand correctly.
>
> What setup do we need for this?
>
> I think probably not bonding, as that will presumably not tie
> NIC interfaces to cpus.  But I also see a note in the Lustre manual:
>
> """If the server has multiple interfaces on the same subnet, the Linux
> kernel will send all traffic using the first configured interface. This is
> a limitation of Linux, not Lustre. In this case, network interface bonding
> should be used. For more information about network interface bonding, see
> Chapter 7, Setting Up Network Interface Bonding."""
>
> (plus, no idea if bonding is supported on InfiniBand).
>
> Thanks,
> Alastair.
> ___
> lustre-discuss mailing list
> lustre-discuss@lists.lustre.org
> http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
>
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


[lustre-discuss] Multiple IB Interfaces

2021-03-09 Thread Ms. Megan Larko via lustre-discuss
Greetings Alastair,

Bonding is supported on InfiniBand, but  I believe that it is only
active/passive.
I think what you might be looking for WRT avoiding data travel through the
inter-cpu link is cpu "affinity" AKA cpu "pinning".

Cheers,
megan

WRT = "with regards to"
AKA = "also known as"
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


[lustre-discuss] Multiple IB interfaces

2021-03-09 Thread Alastair Basden via lustre-discuss

Hi,

We are installing some new Lustre servers with 2 InfiniBand cards, 1 
attached to each CPU socket.  Storage is nvme, again, some drives attached 
to each socket.


We want to ensure that data to/from each drive uses the appropriate IB 
card, and doesn't need to travel through the inter-cpu link.  Data being 
written is fairly easy I think, we just set that OST to the appropriate IP 
address.  However, data being read may well go out the other NIC, if I 
understand correctly.


What setup do we need for this?

I think probably not bonding, as that will presumably not tie 
NIC interfaces to cpus.  But I also see a note in the Lustre manual:


"""If the server has multiple interfaces on the same subnet, the Linux 
kernel will send all traffic using the first configured interface. This is 
a limitation of Linux, not Lustre. In this case, network interface bonding 
should be used. For more information about network interface bonding, see 
Chapter 7, Setting Up Network Interface Bonding."""


(plus, no idea if bonding is supported on InfiniBand).

Thanks,
Alastair.
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org