Re: Side meeting on AIDC

2023-08-15 Thread Hesham ElBakoury
Thanks Jeff for your feedback.
I certainly agree with you about the operational complexity of such
solutions.

If there is a plan to have another AIDS side meeting we can invite the
proponents of such new solutions (e.g. Cerberus) to discuss with them their
solutions in details including its operational complexity.

It is worth noting that Prof. Torsten Hoefler gave a very interesting
keynote speech on this topic during FCRC conference last June. You can
access the recording of his talk here:
https://www.youtube.com/watch?v=85p34eBTBWo

Torsten talk and the two IEEE Computer papers he mentioned at the end of
the talk are really inspiring and insightful.

I strongly recommend to invite Torsten in the next AIDS meeting. He is a
very busy Prof, so if there is a plan to have another AIDS, you better book
his time ASAP.

Thanks
Hesham

On Sun, Aug 13, 2023, 7:30 PM Jeff Tantsura  wrote:

> Hesham,
>
> Couple of points:
>
> 1. Most of proponents of new solutions have never actually built/operated
> a ML cluster, operational complexity is vastly underestimated (if
> considered at all).
>
> 2. Given the focus on highest possible performance - in most cases, ML
> clusters are stand-alone fabrics with physically separated North/South
> (communication with outside world)and East/West (GPU-GPU) networks, and our
> focus is specifically on the GPU fabrics.
> While the traffic patterns are different, the traffic itself is highly
> homogenous, for IP it would be RoCEv2 (RDMAoUDP), so you don’ need to solve
> “TCP issues”. The traffic pattern are very well understood and usually
> heavily orchestrated.
> 3. They all heavily rely on a notion of a flow, which is not necessarily
> the correct approach, don’t forget - RDMA (in most operations) is a memory
> block transfer, and within reasonable limits, packets can be placed
> directly into memory independently of the order they have arrived.
>
> On Aug 13, 2023, at 10:44 AM, Hesham ElBakoury 
> wrote:
>
> In Sigmetrics 2022 the paper "*Cerberus: The Power of Choices in
> Datacenter Topology Design*" was published [
> https://people.csail.mit.edu/ghobadi/papers/cerberus_sigmetrics_2022.pdf].
>
> "This paper uncovered a potential of serving datacenter traffic with the
> switch technology that best matches its structure. By tapping into this
> potential, we developed a solution, Cerberus, which we showed to
> significantly improve throughput"
>
> Perhaps Cerberus is best suited for ML.
>
> Comments?
>
> Hesham
>
> On Fri, Aug 11, 2023, 2:56 AM Hesham ElBakoury 
> wrote:
>
>> The need for new DC architecture has been the subject of research for
>> quite sometime.
>>
>> I recall that Prof. Brighten (UCIC) had a paper on spineless DC in HotNet
>> 2021 [Spineless Data Centers (acm.org)
>> ]. He presented it
>> in ONUG research track in 2021 [the recording is available here:
>>
>> Spinelessness: The Future of Data Center Networks? - ONUG | ONUG
>> 
>>
>> The abstract of the ONUG talk:
>>
>> "*Leaf-spine and Clos network topologies have become ubiquitous in
>> modern data centers to achieve high throughput for data-intensive
>> applications.  In fact, such designs are not optimal: recent research has
>> developed other topologies, specifically expander graphs, that achieve
>> higher throughput or lower cost, along with potentially easier incremental
>> expansion.  In this talk we’ll explore whether this theoretical performance
>> efficiency can be realized in a practical way to improve enterprise
>> leaf-spine data centers.  This leads to a “spineless” data center, with a
>> single type of switch rather than having separate roles for leafs and
>> spines.  We find that such designs can indeed be more efficient even at
>> small to moderate scale, and we introduce an efficient routing scheme for
>> such networks that uses standard hardware and protocols.  This line of work
>> opens new research directions in topology and routing design that can have
>> significant impact for the most common data centers*."
>>
>> Hesham
>> On 8/9/2023 9:32 PM, Yingzhen Qu wrote:
>>
>> Hi Yao,Spinelessness: The Future of Data Center Networks? - ONUG | ONUG
>> 
>>
>> We used the webex provided by IETF during the side meeting, and the Webex
>> on Jeff's laptop crashed at the end of the meeting. We were told that the
>> recording might be on the IETF chromebook, but haven't heard anything yet.
>>
>> Meanwhile you can have access to all the slides including the ones that
>> didn't get time to present at: Yingzhen-ietf/AIDC-IETF117: This
>> repository is for all the meeting materials. (github.com)
>> 
>>
>> Thanks,
>> Yingzhen
>>
>> On Tue, Aug 8, 2023 at 6:08 PM  wrote:
>>
>>> Hi Yingzhen,
>>>
>>>
>>> Do we have any recording for this meeting?
>>>
>>>
>>> Thanks,

Re: Side meeting on AIDC

2023-08-13 Thread Jeff Tantsura
Hesham,

Couple of points:

1. Most of proponents of new solutions have never actually built/operated a ML 
cluster, operational complexity is vastly underestimated (if considered at all).

2. Given the focus on highest possible performance - in most cases, ML clusters 
are stand-alone fabrics with physically separated North/South (communication 
with outside world)and East/West (GPU-GPU) networks, and our focus is 
specifically on the GPU fabrics.
While the traffic patterns are different, the traffic itself is highly 
homogenous, for IP it would be RoCEv2 (RDMAoUDP), so you don’ need to solve 
“TCP issues”. The traffic pattern are very well understood and usually heavily 
orchestrated.
3. They all heavily rely on a notion of a flow, which is not necessarily the 
correct approach, don’t forget - RDMA (in most operations) is a memory block 
transfer, and within reasonable limits, packets can be placed directly into 
memory independently of the order they have arrived.

> On Aug 13, 2023, at 10:44 AM, Hesham ElBakoury  wrote:
> 
> In Sigmetrics 2022 the paper "Cerberus: The Power of Choices in Datacenter 
> Topology Design" was published 
> [https://people.csail.mit.edu/ghobadi/papers/cerberus_sigmetrics_2022.pdf].
> 
> "This paper uncovered a potential of serving datacenter traffic with the 
> switch technology that best matches its structure. By tapping into this 
> potential, we developed a solution, Cerberus, which we showed to 
> significantly improve throughput"
> 
> Perhaps Cerberus is best suited for ML.
> 
> Comments?
> 
> Hesham
> 
> On Fri, Aug 11, 2023, 2:56 AM Hesham ElBakoury  > wrote:
>> The need for new DC architecture has been the subject of research for quite 
>> sometime.
>> 
>> I recall that Prof. Brighten (UCIC) had a paper on spineless DC in HotNet 
>> 2021 [Spineless Data Centers (acm.org) 
>> ]. He presented it in 
>> ONUG research track in 2021 [the recording is available here: 
>> 
>> Spinelessness: The Future of Data Center Networks? - ONUG | ONUG 
>> 
>> The abstract of the ONUG talk:
>> 
>> "Leaf-spine and Clos network topologies have become ubiquitous in modern 
>> data centers to achieve high throughput for data-intensive applications.  In 
>> fact, such designs are not optimal: recent research has developed other 
>> topologies, specifically expander graphs, that achieve higher throughput or 
>> lower cost, along with potentially easier incremental expansion.  In this 
>> talk we’ll explore whether this theoretical performance efficiency can be 
>> realized in a practical way to improve enterprise leaf-spine data centers.  
>> This leads to a “spineless” data center, with a single type of switch rather 
>> than having separate roles for leafs and spines.  We find that such designs 
>> can indeed be more efficient even at small to moderate scale, and we 
>> introduce an efficient routing scheme for such networks that uses standard 
>> hardware and protocols.  This line of work opens new research directions in 
>> topology and routing design that can have significant impact for the most 
>> common data centers."
>> 
>> Hesham
>> 
>> On 8/9/2023 9:32 PM, Yingzhen Qu wrote:
>>> Hi Yao,Spinelessness: The Future of Data Center Networks? - ONUG | ONUG 
>>> 
>>> 
>>> We used the webex provided by IETF during the side meeting, and the Webex 
>>> on Jeff's laptop crashed at the end of the meeting. We were told that the 
>>> recording might be on the IETF chromebook, but haven't heard anything yet.
>>> 
>>> Meanwhile you can have access to all the slides including the ones that 
>>> didn't get time to present at: Yingzhen-ietf/AIDC-IETF117: This repository 
>>> is for all the meeting materials. (github.com) 
>>> 
>>> 
>>> Thanks,
>>> Yingzhen
>>> 
>>> On Tue, Aug 8, 2023 at 6:08 PM >> > wrote:
 Hi Yingzhen,
 
 
 
 Do we have any recording for this meeting?
 
 
 
 Thanks,
 
 Yao
 
 
 
 
 
 
 
 
 
 
 
>>> 
>>> 
>>> ___
>>> rtgwg mailing list
>>> rtgwg@ietf.org 
>>> https://www.ietf.org/mailman/listinfo/rtgwg
> ___
> rtgwg mailing list
> rtgwg@ietf.org
> https://www.ietf.org/mailman/listinfo/rtgwg

___
rtgwg mailing list
rtgwg@ietf.org
https://www.ietf.org/mailman/listinfo/rtgwg


Re: Side meeting on AIDC

2023-08-13 Thread Hesham ElBakoury
In Sigmetrics 2022 the paper "*Cerberus: The Power of Choices in Datacenter
Topology Design*" was published [
https://people.csail.mit.edu/ghobadi/papers/cerberus_sigmetrics_2022.pdf].

"This paper uncovered a potential of serving datacenter traffic with the
switch technology that best matches its structure. By tapping into this
potential, we developed a solution, Cerberus, which we showed to
significantly improve throughput"

Perhaps Cerberus is best suited for ML.

Comments?

Hesham

On Fri, Aug 11, 2023, 2:56 AM Hesham ElBakoury  wrote:

> The need for new DC architecture has been the subject of research for
> quite sometime.
>
> I recall that Prof. Brighten (UCIC) had a paper on spineless DC in HotNet
> 2021 [Spineless Data Centers (acm.org)
> ]. He presented it in
> ONUG research track in 2021 [the recording is available here:
>
> Spinelessness: The Future of Data Center Networks? - ONUG | ONUG
> 
>
> The abstract of the ONUG talk:
>
> "*Leaf-spine and Clos network topologies have become ubiquitous in modern
> data centers to achieve high throughput for data-intensive applications.
> In fact, such designs are not optimal: recent research has developed other
> topologies, specifically expander graphs, that achieve higher throughput or
> lower cost, along with potentially easier incremental expansion.  In this
> talk we’ll explore whether this theoretical performance efficiency can be
> realized in a practical way to improve enterprise leaf-spine data centers.
> This leads to a “spineless” data center, with a single type of switch
> rather than having separate roles for leafs and spines.  We find that such
> designs can indeed be more efficient even at small to moderate scale, and
> we introduce an efficient routing scheme for such networks that uses
> standard hardware and protocols.  This line of work opens new
> research directions in topology and routing design that can have
> significant impact for the most common data centers*."
>
> Hesham
> On 8/9/2023 9:32 PM, Yingzhen Qu wrote:
>
> Hi Yao,Spinelessness: The Future of Data Center Networks? - ONUG | ONUG
> 
>
> We used the webex provided by IETF during the side meeting, and the Webex
> on Jeff's laptop crashed at the end of the meeting. We were told that the
> recording might be on the IETF chromebook, but haven't heard anything yet.
>
> Meanwhile you can have access to all the slides including the ones that
> didn't get time to present at: Yingzhen-ietf/AIDC-IETF117: This
> repository is for all the meeting materials. (github.com)
> 
>
> Thanks,
> Yingzhen
>
> On Tue, Aug 8, 2023 at 6:08 PM  wrote:
>
>> Hi Yingzhen,
>>
>>
>> Do we have any recording for this meeting?
>>
>>
>> Thanks,
>>
>> Yao
>>
>>
>>
>>
>>
>>
>>
> ___
> rtgwg mailing listrtgwg@ietf.orghttps://www.ietf.org/mailman/listinfo/rtgwg
>
>
___
rtgwg mailing list
rtgwg@ietf.org
https://www.ietf.org/mailman/listinfo/rtgwg


Re: Side meeting on AIDC

2023-08-11 Thread Hesham ElBakoury
The need for new DC architecture has been the subject of research for 
quite sometime.


I recall that Prof. Brighten (UCIC) had a paper on spineless DC in 
HotNet 2021 [Spineless Data Centers (acm.org) 
]. He presented it 
in ONUG research track in 2021 [the recording is available here:


Spinelessness: The Future of Data Center Networks? - ONUG | ONUG 



The abstract of the ONUG talk:

"/Leaf-spine and Clos network topologies have become ubiquitous in 
modern data centers to achieve high throughput for data-intensive 
applications.  In fact, such designs are not optimal: recent research 
has developed other topologies, specifically expander graphs, that 
achieve higher throughput or lower cost, along with potentially easier 
incremental expansion.  In this talk we’ll explore whether this 
theoretical performance efficiency can be realized in a practical way to 
improve enterprise leaf-spine data centers.  This leads to a “spineless” 
data center, with a single type of switch rather than having separate 
roles for leafs and spines.  We find that such designs can indeed be 
more efficient even at small to moderate scale, and we introduce an 
efficient routing scheme for such networks that uses standard hardware 
and protocols.  This line of work opens new research directions in 
topology and routing design that can have significant impact for the 
most common data centers/."


Hesham

On 8/9/2023 9:32 PM, Yingzhen Qu wrote:
Hi Yao,Spinelessness: The Future of Data Center Networks? - ONUG | 
ONUG 
 



We used the webex provided by IETF during the side meeting, and the 
Webex on Jeff's laptop crashed at the end of the meeting. We were told 
that the recording might be on the IETF chromebook, but haven't heard 
anything yet.


Meanwhile you can have access to all the slides including the ones 
that didn't get time to present at: Yingzhen-ietf/AIDC-IETF117: This 
repository is for all the meeting materials. (github.com) 



Thanks,
Yingzhen

On Tue, Aug 8, 2023 at 6:08 PM  wrote:

Hi Yingzhen,


Do we have any recording for this meeting?


Thanks,

Yao







___
rtgwg mailing list
rtgwg@ietf.org
https://www.ietf.org/mailman/listinfo/rtgwg___
rtgwg mailing list
rtgwg@ietf.org
https://www.ietf.org/mailman/listinfo/rtgwg


Re: Side meeting on AIDC

2023-08-10 Thread liu.yao71
Hi Yingzhen,

Thanks for getting back to me and the information you provided.




Regards,

Yao



Original



From: YingzhenQu 
To: 刘尧00165286;
Cc: rtgwg@ietf.org ;
Date: 2023年08月10日 12:32
Subject: Re: Side meeting on AIDC




Hi Yao,
We used the webex provided by IETF during the side meeting, and the Webex on 
Jeff's laptop crashed at the end of the meeting. We were told that the 
recording might be on the IETF chromebook, but haven't heard anything yet.

Meanwhile you can have access to all the slides including the ones that didn't 
get time to present at: Yingzhen-ietf/AIDC-IETF117: This repository is for all 
the meeting materials. (github.com)

Thanks,
Yingzhen





On Tue, Aug 8, 2023 at 6:08 PM  wrote:


Hi Yingzhen,






Do we have any recording for this meeting?






Thanks,


Yao___
rtgwg mailing list
rtgwg@ietf.org
https://www.ietf.org/mailman/listinfo/rtgwg


Re: Side meeting on AIDC

2023-08-09 Thread Yingzhen Qu
Hi Yao,

We used the webex provided by IETF during the side meeting, and the Webex
on Jeff's laptop crashed at the end of the meeting. We were told that the
recording might be on the IETF chromebook, but haven't heard anything yet.

Meanwhile you can have access to all the slides including the ones that
didn't get time to present at: Yingzhen-ietf/AIDC-IETF117: This repository
is for all the meeting materials. (github.com)


Thanks,
Yingzhen

On Tue, Aug 8, 2023 at 6:08 PM  wrote:

> Hi Yingzhen,
>
>
> Do we have any recording for this meeting?
>
>
> Thanks,
>
> Yao
>
>
>
>
>
>
>
___
rtgwg mailing list
rtgwg@ietf.org
https://www.ietf.org/mailman/listinfo/rtgwg


Re: Side meeting on AIDC

2023-08-08 Thread liu.yao71
Hi Yingzhen,






Do we have any recording for this meeting?






Thanks,


Yao___
rtgwg mailing list
rtgwg@ietf.org
https://www.ietf.org/mailman/listinfo/rtgwg


Re: Side meeting on AIDC

2023-07-22 Thread Jeff Tantsura
Dear presenters,

If you plan to use slides, please share these no later than Monday morning (or 
earlier).

Thanks,
Yingzhen and Jeff

> On Jul 20, 2023, at 12:23 PM, Yingzhen Qu  wrote:
> 
> Hi,
> 
> (RTGWG Chair hats off)
> 
> We're going to have a side meeting on "Advancing Technologies for AI/ML 
> Clusters and High Performance Data Centers". 
> 
> The meeting will be held on Monday July 24, 15:30 - 17:00 in Continental 2-3. 
> (Please refer to the side meeting wiki: IETF 117 Side Meeting Signups | IETF 
> Community Wiki )
> 
> In recent times, there has been a significant rise in the popularity of 
> Artificial Intelligence and Machine Learning, leading to the design of data 
> centers specifically tailored for large-scale AI model training. These 
> advanced networks, dedicated to High-Performance Computing, AI, and ML, 
> demand deterministic high-bandwidth and low latency connections between 
> numerous processing nodes. They also impose unique requirements for network 
> topologies and interconnecting technologies.
> 
> RFC 7938, Use of BGP for Routing in Large-Scale Data Centers, was published 
> several years ago. In this ever-evolving landscape of data centers, it is 
> crucial that we stay up-to-date with the latest technologies deployed within 
> them.
> 
> The primary goals of this meeting will be to identify challenges and 
> requirements associated with these advanced data center networks, as well as 
> to explore new research and standardization opportunities.
> 
> It is important to note that this meeting will be a brainstorming session and 
> open-ended in nature. The intention is not to arrive at immediate conclusions 
> or define specific tasks but rather to foster discussions that stimulate 
> innovative thinking.
> 
> The meeting materials including agenda and the Zoom link are available in 
> this GitHub repository: Yingzhen-ietf/AIDC-IETF117: This repository is for 
> all the meeting materials. (github.com) 
>  Slides will be uploaded soon.
> 
> If you have any questions or comments, please contact us.
> 
> Thanks,
> Jeff and Yingzhen

___
rtgwg mailing list
rtgwg@ietf.org
https://www.ietf.org/mailman/listinfo/rtgwg