Re: [lustre-discuss] Multiple IB Interfaces
Hi Alastair, On Fri, Mar 12, 2021 at 09:32:03AM +, Alastair Basden via lustre-discuss wrote: Reading is more problematic. A request from a client (say 10.0.0.100) for data on OST2 will come in via card 2 (10.0.0.2). A thread on CPU2 (hopefully) will then read the data from OST2, and send it out to the client, 10.0.0.100. However, here, Linux will route the packet through the first card on this subnet, so it will go over the inter-cpu link, and out of IB card 1. And this will be the case even if the thread is pinned on CPU2. The question then is whether there is a way to configure Lustre to use IB card 2 when sending out data from OST2. The routing table entries referenced here: https://wiki.lustre.org/LNet_Router_Config_Guide#ARP_flux_issue_for_MR_node should do this for you I believe, ensuring essentially that packets will be routed out over the interface that they are received on. I think this is sufficient, but maybe someone more knowledgeable on this can confirm. Cheers, Matt On Wed, 10 Mar 2021, Ms. Megan Larko wrote: [EXTERNAL EMAIL] Greetings Alastair, Bonding is supported on InfiniBand, but I believe that it is only active/passive. I think what you might be looking for WRT avoiding data travel through the inter-cpu link is cpu "affinity" AKA cpu "pinning". Cheers, megan WRT = "with regards to" AKA = "also known as" ___ lustre-discuss mailing list lustre-discuss@lists.lustre.org http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org -- Matt Rásó-Barnett ___ lustre-discuss mailing list lustre-discuss@lists.lustre.org http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
Re: [lustre-discuss] Multiple IB Interfaces
Hi all, Thanks for the replies. The issue as I see it is with sending data from an OST to the client, avoiding the inter-CPU link. So, if I have: cpu1 - IB card 1 (10.0.0.1), nvme1 (OST1) cpu2 - IB card 2 (10.0.0.2), nvme2 (OST2) Both IB cards on the same subnet. Therefore, by default, packets will be routed out of the server over the preferred card, say IB card 1 (I could be wrong, but this is my current understanding, and seems to be what the Lustre manual says). Data coming in (being written to the OST) is not a problem. The client will know the IP address of the card to which the OST is closest. So, to write to OST2, it will use the 10.0.0.2 address (since this will be the IP address given in mkfs.lustre for that OST). The slight complication here is pinning. A cpu thread may run on cpu1, so the data has to traverse the inter-cpu link twice. However, I am assuming that this won't happen - i.e. the kernel or lustre are clever enough to place this thread on cpu2. As far as I am aware, this should just work, though please correct me if I'm wrong. Perhaps I have to manually specify pinning - how does one do that with Lustre? Reading is more problematic. A request from a client (say 10.0.0.100) for data on OST2 will come in via card 2 (10.0.0.2). A thread on CPU2 (hopefully) will then read the data from OST2, and send it out to the client, 10.0.0.100. However, here, Linux will route the packet through the first card on this subnet, so it will go over the inter-cpu link, and out of IB card 1. And this will be the case even if the thread is pinned on CPU2. The question then is whether there is a way to configure Lustre to use IB card 2 when sending out data from OST2. Cheers, Alastair. On Wed, 10 Mar 2021, Ms. Megan Larko wrote: [EXTERNAL EMAIL] Greetings Alastair, Bonding is supported on InfiniBand, but I believe that it is only active/passive. I think what you might be looking for WRT avoiding data travel through the inter-cpu link is cpu "affinity" AKA cpu "pinning". Cheers, megan WRT = "with regards to" AKA = "also known as" ___ lustre-discuss mailing list lustre-discuss@lists.lustre.org http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
Re: [lustre-discuss] Multiple IB interfaces
Alastair, Few scenarios which you may consider: 1) define 2 lnets one per IB interface (say o2ib1 and o2ib2) and share out one OST through o2ib1 and other one through o2ib2. You can map HBA and disk locality so that they are attached to the same cpu. 2) Same as above but share the ost/s from both lnets But configure odd clients (clients with odd ips) to use o2ib1 and even clients to use o2ib2. This may not be exactly what you are looking for but can efficiently utilize both interfaces. -Raj On Tue, Mar 9, 2021 at 9:18 AM Alastair Basden via lustre-discuss < lustre-discuss@lists.lustre.org> wrote: > Hi, > > We are installing some new Lustre servers with 2 InfiniBand cards, 1 > attached to each CPU socket. Storage is nvme, again, some drives attached > to each socket. > > We want to ensure that data to/from each drive uses the appropriate IB > card, and doesn't need to travel through the inter-cpu link. Data being > written is fairly easy I think, we just set that OST to the appropriate IP > address. However, data being read may well go out the other NIC, if I > understand correctly. > > What setup do we need for this? > > I think probably not bonding, as that will presumably not tie > NIC interfaces to cpus. But I also see a note in the Lustre manual: > > """If the server has multiple interfaces on the same subnet, the Linux > kernel will send all traffic using the first configured interface. This is > a limitation of Linux, not Lustre. In this case, network interface bonding > should be used. For more information about network interface bonding, see > Chapter 7, Setting Up Network Interface Bonding.""" > > (plus, no idea if bonding is supported on InfiniBand). > > Thanks, > Alastair. > ___ > lustre-discuss mailing list > lustre-discuss@lists.lustre.org > http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org > ___ lustre-discuss mailing list lustre-discuss@lists.lustre.org http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
[lustre-discuss] Multiple IB Interfaces
Greetings Alastair, Bonding is supported on InfiniBand, but I believe that it is only active/passive. I think what you might be looking for WRT avoiding data travel through the inter-cpu link is cpu "affinity" AKA cpu "pinning". Cheers, megan WRT = "with regards to" AKA = "also known as" ___ lustre-discuss mailing list lustre-discuss@lists.lustre.org http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
[lustre-discuss] Multiple IB interfaces
Hi, We are installing some new Lustre servers with 2 InfiniBand cards, 1 attached to each CPU socket. Storage is nvme, again, some drives attached to each socket. We want to ensure that data to/from each drive uses the appropriate IB card, and doesn't need to travel through the inter-cpu link. Data being written is fairly easy I think, we just set that OST to the appropriate IP address. However, data being read may well go out the other NIC, if I understand correctly. What setup do we need for this? I think probably not bonding, as that will presumably not tie NIC interfaces to cpus. But I also see a note in the Lustre manual: """If the server has multiple interfaces on the same subnet, the Linux kernel will send all traffic using the first configured interface. This is a limitation of Linux, not Lustre. In this case, network interface bonding should be used. For more information about network interface bonding, see Chapter 7, Setting Up Network Interface Bonding.""" (plus, no idea if bonding is supported on InfiniBand). Thanks, Alastair. ___ lustre-discuss mailing list lustre-discuss@lists.lustre.org http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org