I concur with what folks have written so far, it really depends on your use case. For instance if you are looking at a cluster with GPU's and intend to do some serious computing there you are going to need RDMA of some sort. But it all depends on what you end up needing for your workflows.

For us we put most of our network traffic over the IB using IPoIB combined with aliasing all the nodes to their IB address. Thus all the internode network traffic spans the IB fabric rather than the ethernet. We then have 1GbE for our ethernet backend which we mainly use for management purposes. So we haven't heavily invested in a high speed ethernet backbone but instead invested in IB.

To invest in both seems to me to be overkill, you should focus on one or the other unless you have the cash to spend and a good use case.

-Paul Edmon-

On 2/26/24 7:07 AM, Dan Healy via slurm-users wrote:
I’m very appreciative for each person who’s provided some feedback, especially the lengthy replies.

Sounds like RoCE capable Ethernet backbone may be the default way to go /unless/ the end users have some specific requirements that might need IB. At this point, we wouldn’t be interested in anything slower than 200Gbps. So perhaps Eth and IB are equivalent in terms of latency and RDMA capabilities, except one is an open standard.

Thanks,

Daniel Healy


On Mon, Feb 26, 2024 at 3:40 AM Cutts, Tim <tim.cu...@astrazeneca.com> wrote:

    My view is that it depends entirely on the workload, and the
    systems with which your compute needs to interact.  A few things
    I’ve experienced before.

     1. Modern ethernet networks have pretty good latency these days,
        and so MPI codes can run over them.   Whether IB is worth the
        money is a cost/benefit calculation for the codes you want to
        run.  The ethernet network we put in at Sanger in 2016 or so
        we measured as having similar latency, in practice, as FDR
        infiniband, if I remember correctly.  So it wasn’t as good as
        state-of-the-art IB at the time, but not bad. Certainly good
        enough for our purposes, and we gained a lot of flexibility
        through software-defined networking, important if you have
        workloads which require better security boundaries than just a
        big shared network.
     2. If your workload is predominantly single node, embarrassingly
        parallel, you might do better to go with ethernet and invest
        the saved money in more compute nodes.
     3. If you only have ethernet, your cluster will be simpler, and
        require less specialised expertise to run
     4. If your parallel filesystem is Lustre, IB seems to be the more
        well-worn path than ethernet.  We encountered a few Lustre
        bugs early on because of that.
     5. On the other hand, if you need to talk to Weka, ethernet is
        the well-worn path.  Weka’s IB implementation requires the
        dedication of some cores on every client node, so you lose
        some compute capacity, which you don’t need to do if you’re
        using ethernet.

    So, as any lawyer would say “it depends”.  Most of my career has
    been in genomics, where IB definitely wasn’t necessary.  Now that
    I’m in pharma, there’s more MPI code, so there’s more of a case
    for it.

    Ultimately, I think you need to run the real benchmarks with real
    code, and as Jason says, work out whether the additional
    complexity and cost of the IB network is worth it for your
    particular workload.  I don’t think the mantra “It’s HPC so it has
    to be Infiniband” is a given.

    Tim

--
    *Tim Cutts*

    Scientific Computing Platform Lead

    AstraZeneca

    Find out more about R&D IT Data, Analytics & AI and how we can
    support you by visiting ourService Catalogue
    <https://azcollaboration.sharepoint.com/sites/CMU993>|

    *From: *Jason Simms via slurm-users <slurm-users@lists.schedmd.com>
    *Date: *Monday, 26 February 2024 at 01:13
    *To: *Dan Healy <daniel.t.he...@gmail.com>
    *Cc: *slurm-users@lists.schedmd.com <slurm-users@lists.schedmd.com>
    *Subject: *[slurm-users] Re: Question about IB and Ethernet networks

    Hello Daniel,

    In my experience, if you have a high-speed interconnect such as
    IB, you would do IPoIB. You would likely still have a "regular"
    Ethernet connection for management purposes, and yes that means
    both an IB switch and an Ethernet switch, but that switch doesn't
    have to be anything special. Any "real" traffic is routed over IB,
    everything is mounted via IB, etc. That's how the last two
    clusters I've worked with have been configured, and the next one
    will be the same (but will use Omnipath rather than IB). We
    likewise use BeeGFS.

    These next comments are perhaps more likely to encounter
    differences of opinion, but I would say that sufficiently fast
    Ethernet is often "good enough" for most workloads (e.g., MPI).
    I'd wager that for all but the most demanding of workloads, it's
    entirely acceptable. You'll also save a bit of money, of course.
    HOWEVER, I do think there is, shall we say, an expectation from
    many researchers that any cluster worth its salt will have some
    kind of fast interconnect, even if at the scale of most on-prem
    work, you might be hard-pressed in real-world conditions to notice
    much of a difference. If you're running jobs that take weeks and
    hundreds of nodes, the time (and other) savings may add up, but if
    we're talking the difference between a job running on 5 nodes
    taking 48 hours vs. slightly less, then?? Your mileage may vary,
    as they say...

    Warmest regards,

    Jason

    On Sun, Feb 25, 2024 at 3:13 PM Dan Healy via slurm-users
    <slurm-users@lists.schedmd.com> wrote:

        Hi Fellow Slurm Users,

        This question is not slurm-specific, but it might develop into
        that.

        My question relates to understanding how /typical/ HPCs are
        designed in terms of networking. To start, is it typical for
        there to be a high speed Ethernet /and/ Infiniband networks
        (meaning separate switches, NICs)? I know you can easily setup
        IP over IB, but is IB usually fully reserved for MPI messages?
        I’m tempted to spec all new HPCs with only a high speed
        (200Gbps) IB network, and use IPoIB for all slurm comms with
        compute nodes. I plan on using BeeGFS for the file system with
        RDMA.

        Just looking for some feedback, please. Is this OK? Is there a
        better way? If yes, please share why it’s better.

        Thanks,

        Daniel Healy


-- slurm-users mailing list -- slurm-users@lists.schedmd.com
        To unsubscribe send an email to
        slurm-users-le...@lists.schedmd.com


--
    *Jason L. Simms, Ph.D., M.P.H.*

    Manager of Research Computing

    Swarthmore College
    Information Technology Services

    (610) 328-8102

    Schedule a meeting: https://calendly.com/jlsimms

    ------------------------------------------------------------------------

    AstraZeneca UK Limited is a company incorporated in England and
    Wales with registered number:03674842 and its registered office at
    1 Francis Crick Avenue, Cambridge
    
<https://www.google.com/maps/search/1+Francis+Crick+Avenue,+Cambridge?entry=gmail&source=g>
    Biomedical Campus, Cambridge, CB2 0AA.

    This e-mail and its attachments are intended for the above named
    recipient only and may contain confidential and privileged
    information. If they have come to you in error, you must not copy
    or show them to anyone; instead, please reply to this e-mail,
    highlighting the error to the sender and then immediately delete
    the message. For information about how AstraZeneca UK Limited and
    its affiliates may process information, personal data and monitor
    communications, please see our privacy notice at
    www.astrazeneca.com <https://www.astrazeneca.com>

-- 
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com

Reply via email to