RE: Need your help to make sure the draft-ietf-rtgwg-net2cloud-problem-statement readability is good.

2023-08-23 Thread Linda Dunbar
Joel,

For your last unresolved comment, the following text is added to explain the 
TN-1 in the Figure 1:

“Figure 1 below shows an example of a portion of workloads belonging to one 
tenant (e.g., TN-1) are accessible via a virtual router connected by AWS 
Internet Gateway; some of the same tenant (TN-1) are accessible via AWS vGW, 
and others are accessible via AWS Direct Connect. The workloads belonging to 
one tenant can communicate within a Cloud DC via virtual routers (e.g., vR1, 
vR2).”

Thank you.

Linda

From: Joel Halpern 
Sent: Monday, August 21, 2023 11:34 PM
To: Linda Dunbar 
Cc: rtgwg-chairs ; 
draft-ietf-rtgwg-net2cloud-problem-statement@ietf.org; rtgwg@ietf.org
Subject: Re: Need your help to make sure the 
draft-ietf-rtgwg-net2cloud-problem-statement readability is good.


Figure 1 in section 4.1 could use some clarification.  It is unclear if the two 
TN-1 are the same networks, or are intended to be different parts of the tenant 
network.  And similarly for the two TN-2.  It is also unclear why the top 
portion is even included in the figure, since it does not seem to have anything 
to do with the data center connectivity task?  Wouldn't it be simpler to just 
note that the diagram only shows part of the tenant infrastructure, and leave 
out irrelevancies?
[Linda] The two TN-1 are intended to be different parts of one single tenant 
network.  Is adding the following good enough?
“TN: Tenant Network. One TN (e.g., TN-1) can be attached to both vR1 and vR2.”
While that at least makes meaning of the figure clear, I am still left 
confused as to why the upper part of the figure is needed.
 mainly to show that one Tenant can have some routes reachable via Internet 
GW and others reachable via Virtual GW (IPsec). And routes belonging to one 
Tenant can be connected by vRouters 
You may want to think about ways to better explain your point, since I 
missed it. 
___
rtgwg mailing list
rtgwg@ietf.org
https://www.ietf.org/mailman/listinfo/rtgwg


Re: Need your help to make sure the draft-ietf-rtgwg-net2cloud-problem-statement readability is good.

2023-08-22 Thread Joel Halpern

Thank you.

Joel

On 8/22/2023 11:40 PM, Linda Dunbar wrote:


Joel,

Thank you very much for your suggestion. We will take your suggested 
wording into the document:


/“When a site failure occurs, many instances can be impacted. When the 
impacted instances’ IP prefixes in a Cloud DC are not aggregated 
nicely, which is very common, one single site failure can trigger a 
huge number of BGP UPDATE messages. There are proposals, such as 
[METADATA-PATH], to enhance BGP advertisements to address this problem.”/


//

Linda

*From:* Joel Halpern 
*Sent:* Tuesday, August 22, 2023 6:03 PM
*To:* Linda Dunbar 
*Cc:* rtgwg-chairs ; 
draft-ietf-rtgwg-net2cloud-problem-statement@ietf.org; rtgwg@ietf.org
*Subject:* Re: Need your help to make sure the 
draft-ietf-rtgwg-net2cloud-problem-statement readability is good.


I think I now understand your point.  As a problem statement draft, I 
would replace the detailed description of the specific proposal with a 
more generic "There are proposals to enhance BGP advertisements to 
address this problem."


Yours,

Joel

On 8/22/2023 6:34 PM, Linda Dunbar wrote:

Joel,

I see your points. Please see my explanation below quoted by 
.

*From:*Joel Halpern 

*Sent:* Monday, August 21, 2023 11:34 PM
*To:* Linda Dunbar 

*Cc:* rtgwg-chairs 
;
draft-ietf-rtgwg-net2cloud-problem-statement@ietf.org;
rtgwg@ietf.org
*Subject:* Re: Need your help to make sure the
draft-ietf-rtgwg-net2cloud-problem-statement readability is good.

Thank you Linda.  Trimmed the agreements, including acceptable
text from your reply. Leaving the two points that can benefit from
a little more tuning.

Marked 

Yours,

Joel

On 8/22/2023 12:12 AM, Linda Dunbar wrote:

Similarly, section 3.2 looks like it could apply to any operator.
The reference to the presence or absence of IGPs seems largely
irrelevant to the question of how partial failures of a facility
are detected and dealt with.

[Linda] Two reasons that the site failure described in Section 3.2
do not apply to other networks:

 1. One DC can have many server racks concentrated in a small area
which can fail by one single event. Vs. Regular network
failure at one location only impact the routers at the
location, which quickly triggers the services switched to the
protection paths.
 2. Regular networks run IGP, which can propagate inner fiber cut
failures quickly to the edge. While as many DCs don’t run IGP.

Given that even a data center has to deal with internal
failures, and that even traditional ISPs have to deal with
partitioning failures, I don't think the distinction you are
drawing in this section really exists.  If it does, you need to
provide stronger justification.  Also, not all public DCs have
chosen to use just BGP, although I grant that many have. I don't
think you want to argue that the folks who have chosen to use BGP
are wrong.  

 Are you referring to Network-Partitioning Failures in Cloud
Systems?

Traditional ISPs don’t host end services; they are responsible for
transporting packets;  therefore protection path can reroute
packets . But Cloud DC site/PoD failure causing all the hosts
(prefixes) no longer reachable 

 If a DC Site fails, the services failed too.  Yes, the DC
operator has to reinstantiate them.  But that is way outside our
scope.  To the degree that they can recover by rerouting to other
instances (whether using anycast or some other trick) it looks
just like routing around failures in other case, which BGP and
IGPs can do.  I am still not seeing how this justifies any special
mechanisms. 


You are correct that the protection is the same as the regular ISP
networks.

The paragraph is intended to say the following:

When a site failure occurs, many instances can be impacted. When
the impacted instances’ IP prefixes in a Cloud DC are not
aggregated nicely, which is very common, one single site failure
can trigger a huge number of BGP UPDATE messages. Instead of many
BGP UPDATE messages to the ingress routers for all the instances
impacted, [METADATA-PATH] proposes one single BGP UPDATE
indicating the site failure. The ingress routers can switch all
the instances that are associated with the site.




___
rtgwg mailing list
rtgwg@ietf.org
https://www.ietf.org/mailman/listinfo/rtgwg___
rtgwg mailing list
rtgwg@ietf.org
https://www.ietf.org/mailman/listinfo/rtgwg


RE: Need your help to make sure the draft-ietf-rtgwg-net2cloud-problem-statement readability is good.

2023-08-22 Thread Linda Dunbar
Joel,

Thank you very much for your suggestion. We will take your suggested wording 
into the document:

“When a site failure occurs, many instances can be impacted. When the impacted 
instances’ IP prefixes in a Cloud DC are not aggregated nicely, which is very 
common, one single site failure can trigger a huge number of BGP UPDATE 
messages. There are proposals, such as [METADATA-PATH], to enhance BGP 
advertisements to address this problem.”

Linda

From: Joel Halpern 
Sent: Tuesday, August 22, 2023 6:03 PM
To: Linda Dunbar 
Cc: rtgwg-chairs ; 
draft-ietf-rtgwg-net2cloud-problem-statement@ietf.org; rtgwg@ietf.org
Subject: Re: Need your help to make sure the 
draft-ietf-rtgwg-net2cloud-problem-statement readability is good.


I think I now understand your point.  As a problem statement draft, I would 
replace the detailed description of the specific proposal with a more generic 
"There are proposals to enhance BGP advertisements to address this problem."

Yours,

Joel
On 8/22/2023 6:34 PM, Linda Dunbar wrote:
Joel,

I see your points. Please see my explanation below quoted by  .



From: Joel Halpern 

Sent: Monday, August 21, 2023 11:34 PM
To: Linda Dunbar 
Cc: rtgwg-chairs ; 
draft-ietf-rtgwg-net2cloud-problem-statement@ietf.org;
 rtgwg@ietf.org
Subject: Re: Need your help to make sure the 
draft-ietf-rtgwg-net2cloud-problem-statement readability is good.

Thank you Linda.  Trimmed the agreements, including acceptable text from your 
reply.  Leaving the two points that can benefit from a little more tuning.
Marked 
Yours,
Joel
On 8/22/2023 12:12 AM, Linda Dunbar wrote:

Similarly, section 3.2 looks like it could apply to any operator.  The 
reference to the presence or absence of IGPs seems largely irrelevant to the 
question of how partial failures of a facility are detected and dealt with.
[Linda] Two reasons that the site failure described in Section 3.2 do not apply 
to other networks:

  1.  One DC can have many server racks concentrated in a small area which can 
fail by one single event. Vs. Regular network failure at one location only 
impact the routers at the location, which quickly triggers the services 
switched to the protection paths.
  2.  Regular networks run IGP, which can propagate inner fiber cut failures 
quickly to the edge. While as many DCs don’t run IGP.
Given that even a data center has to deal with internal failures, and that 
even traditional ISPs have to deal with partitioning failures, I don't think 
the distinction you are drawing in this section really exists.  If it does, you 
need to provide stronger justification.  Also, not all public DCs have chosen 
to use just BGP, although I grant that many have. I don't think you want to 
argue that the folks who have chosen to use BGP are wrong.  

 Are you referring to Network-Partitioning Failures in Cloud Systems?
Traditional ISPs don’t host end services; they are responsible for transporting 
packets;  therefore protection path can reroute packets . But Cloud DC site/PoD 
failure causing all the hosts (prefixes) no longer reachable 
 If a DC Site fails, the services failed too.  Yes, the DC operator has 
to reinstantiate them.  But that is way outside our scope.  To the degree that 
they can recover by rerouting to other instances (whether using anycast or some 
other trick) it looks just like routing around failures in other case, which 
BGP and IGPs can do.  I am still not seeing how this justifies any special 
mechanisms. 

You are correct that the protection is the same as the regular ISP networks.

The paragraph is intended to say the following:
When a site failure occurs, many instances can be impacted. When the impacted 
instances’ IP prefixes in a Cloud DC are not aggregated nicely, which is very 
common, one single site failure can trigger a huge number of BGP UPDATE 
messages. Instead of many BGP UPDATE messages to the ingress routers for all 
the instances impacted, [METADATA-PATH] proposes one single BGP UPDATE 
indicating the site failure. The ingress routers can switch all the instances 
that are associated with the site.





___
rtgwg mailing list
rtgwg@ietf.org
https://www.ietf.org/mailman/listinfo/rtgwg


Re: Need your help to make sure the draft-ietf-rtgwg-net2cloud-problem-statement readability is good.

2023-08-22 Thread Joel Halpern
I think I now understand your point.  As a problem statement draft, I 
would replace the detailed description of the specific proposal with a 
more generic "There are proposals to enhance BGP advertisements to 
address this problem."


Yours,

Joel

On 8/22/2023 6:34 PM, Linda Dunbar wrote:

Joel,
I see your points. Please see my explanation below quoted by  .
*From:* Joel Halpern 
*Sent:* Monday, August 21, 2023 11:34 PM
*To:* Linda Dunbar 
*Cc:* rtgwg-chairs ; 
draft-ietf-rtgwg-net2cloud-problem-statement@ietf.org; rtgwg@ietf.org
*Subject:* Re: Need your help to make sure the 
draft-ietf-rtgwg-net2cloud-problem-statement readability is good.
Thank you Linda.  Trimmed the agreements, including acceptable text 
from your reply.  Leaving the two points that can benefit from a 
little more tuning.

Marked 
Yours,
Joel
On 8/22/2023 12:12 AM, Linda Dunbar wrote:

Similarly, section 3.2 looks like it could apply to any operator. The 
reference to the presence or absence of IGPs seems largely irrelevant 
to the question of how partial failures of a facility are detected and 
dealt with.
[Linda] Two reasons that the site failure described in Section 3.2 do 
not apply to other networks:


 1. One DC can have many server racks concentrated in a small area
which can fail by one single event. Vs. Regular network failure at
one location only impact the routers at the location, which
quickly triggers the services switched to the protection paths.
 2. Regular networks run IGP, which can propagate inner fiber cut
failures quickly to the edge. While as many DCs don’t run IGP.

Given that even a data center has to deal with internal failures, 
and that even traditional ISPs have to deal with partitioning 
failures, I don't think the distinction you are drawing in this 
section really exists.  If it does, you need to provide stronger 
justification.  Also, not all public DCs have chosen to use just BGP, 
although I grant that many have. I don't think you want to argue that 
the folks who have chosen to use BGP are wrong.  

 Are you referring to Network-Partitioning Failures in Cloud Systems?
Traditional ISPs don’t host end services; they are responsible for 
transporting packets;  therefore protection path can reroute packets . 
But Cloud DC site/PoD failure causing all the hosts (prefixes) no 
longer reachable 
 If a DC Site fails, the services failed too.  Yes, the DC 
operator has to reinstantiate them.  But that is way outside our 
scope. To the degree that they can recover by rerouting to other 
instances (whether using anycast or some other trick) it looks just 
like routing around failures in other case, which BGP and IGPs can 
do.  I am still not seeing how this justifies any special mechanisms. 



You are correct that the protection is the same as the regular ISP 
networks.

The paragraph is intended to say the following:
When a site failure occurs, many instances can be impacted. When the 
impacted instances’ IP prefixes in a Cloud DC are not aggregated 
nicely, which is very common, one single site failure can trigger a 
huge number of BGP UPDATE messages. Instead of many BGP UPDATE 
messages to the ingress routers for all the instances impacted, 
[METADATA-PATH] proposes one single BGP UPDATE indicating the site 
failure. The ingress routers can switch all the instances that are 
associated with the site.

___
rtgwg mailing list
rtgwg@ietf.org
https://www.ietf.org/mailman/listinfo/rtgwg


RE: Need your help to make sure the draft-ietf-rtgwg-net2cloud-problem-statement readability is good.

2023-08-22 Thread Linda Dunbar
Joel,

I see your points. Please see my explanation below quoted by  .



From: Joel Halpern 
Sent: Monday, August 21, 2023 11:34 PM
To: Linda Dunbar 
Cc: rtgwg-chairs ; 
draft-ietf-rtgwg-net2cloud-problem-statement@ietf.org; rtgwg@ietf.org
Subject: Re: Need your help to make sure the 
draft-ietf-rtgwg-net2cloud-problem-statement readability is good.


Thank you Linda.  Trimmed the agreements, including acceptable text from your 
reply.  Leaving the two points that can benefit from a little more tuning.
Marked 
Yours,
Joel
On 8/22/2023 12:12 AM, Linda Dunbar wrote:


Similarly, section 3.2 looks like it could apply to any operator.  The 
reference to the presence or absence of IGPs seems largely irrelevant to the 
question of how partial failures of a facility are detected and dealt with.
[Linda] Two reasons that the site failure described in Section 3.2 do not apply 
to other networks:
1.  One DC can have many server racks concentrated in a small area which 
can fail by one single event. Vs. Regular network failure at one location only 
impact the routers at the location, which quickly triggers the services 
switched to the protection paths.
2.  Regular networks run IGP, which can propagate inner fiber cut failures 
quickly to the edge. While as many DCs don’t run IGP.
Given that even a data center has to deal with internal failures, and that 
even traditional ISPs have to deal with partitioning failures, I don't think 
the distinction you are drawing in this section really exists.  If it does, you 
need to provide stronger justification.  Also, not all public DCs have chosen 
to use just BGP, although I grant that many have. I don't think you want to 
argue that the folks who have chosen to use BGP are wrong.  

 Are you referring to Network-Partitioning Failures in Cloud Systems?
Traditional ISPs don’t host end services; they are responsible for transporting 
packets;  therefore protection path can reroute packets . But Cloud DC site/PoD 
failure causing all the hosts (prefixes) no longer reachable 
 If a DC Site fails, the services failed too.  Yes, the DC operator has 
to reinstantiate them.  But that is way outside our scope.  To the degree that 
they can recover by rerouting to other instances (whether using anycast or some 
other trick) it looks just like routing around failures in other case, which 
BGP and IGPs can do.  I am still not seeing how this justifies any special 
mechanisms. 

You are correct that the protection is the same as the regular ISP networks.

The paragraph is intended to say the following:
  When a site failure occurs, many instances can be impacted. When the 
impacted instances’ IP prefixes in a Cloud DC are not aggregated nicely, which 
is very common, one single site failure can trigger a huge number of BGP UPDATE 
messages. Instead of many BGP UPDATE messages to the ingress routers for all 
the instances impacted, [METADATA-PATH] proposes one single BGP UPDATE 
indicating the site failure. The ingress routers can switch all the instances 
that are associated with the site.





___
rtgwg mailing list
rtgwg@ietf.org
https://www.ietf.org/mailman/listinfo/rtgwg


Re: Need your help to make sure the draft-ietf-rtgwg-net2cloud-problem-statement readability is good.

2023-08-21 Thread Joel Halpern
Thank you Linda.  Trimmed the agreements, including acceptable text from 
your reply.  Leaving the two points that can benefit from a little more 
tuning.


Marked 

Yours,

Joel

On 8/22/2023 12:12 AM, Linda Dunbar wrote:



Similarly, section 3.2 looks like it could apply to any operator.
The reference to the presence or absence of IGPs seems largely
irrelevant to the question of how partial failures of a facility
are detected and dealt with.

[Linda] Two reasons that the site failure described in Section 3.2
do not apply to other networks:

  * One DC can have many server racks concentrated in a small area
which can fail by one single event. Vs. Regular network
failure at one location only impact the routers at the
location, which quickly triggers the services switched to the
protection paths.
  * Regular networks run IGP, which can propagate inner fiber cut
failures quickly to the edge. While as many DCs don’t run IGP.

Given that even a data center has to deal with internal failures, 
and that even traditional ISPs have to deal with partitioning 
failures, I don't think the distinction you are drawing in this 
section really exists.  If it does, you need to provide stronger 
justification.  Also, not all public DCs have chosen to use just BGP, 
although I grant that many have. I don't think you want to argue that 
the folks who have chosen to use BGP are wrong.  


 Are you referring to Network-Partitioning Failures in Cloud Systems?

Traditional ISPs don’t host end services; they are responsible for 
transporting packets;  therefore protection path can reroute packets . 
But Cloud DC site/PoD failure causing all the hosts (prefixes) no 
longer reachable 


 If a DC Site fails, the services failed too.  Yes, the DC 
operator has to reinstantiate them.  But that is way outside our scope.  
To the degree that they can recover by rerouting to other instances 
(whether using anycast or some other trick) it looks just like routing 
around failures in other case, which BGP and IGPs can do.  I am still 
not seeing how this justifies any special mechanisms. 


Figure 1 in section 4.1 could use some clarification.  It is
unclear if the two TN-1 are the same networks, or are intended to
be different parts of the tenant network.  And similarly for the
two TN-2.  It is also unclear why the top portion is even included
in the figure, since it does not seem to have anything to do with
the data center connectivity task?  Wouldn't it be simpler to just
note that the diagram only shows part of the tenant
infrastructure, and leave out irrelevancies?

[Linda] The two TN-1 are intended to be different parts of one
single tenant network.  Is adding the following good enough?

/“TN: Tenant Network. One TN (e.g., TN-1) can be attached to both
vR1 and vR2.”/

/While that at least makes meaning of the figure clear, I am 
still left confused as to why the upper part of the figure is 
needed.///


 mainly to show that one Tenant can have some routes reachable via 
Internet GW and others reachable via Virtual GW (IPsec). And routes 
belonging to one Tenant can be connected by vRouters 


You may want to think about ways to better explain your point, 
since I missed it. 
___
rtgwg mailing list
rtgwg@ietf.org
https://www.ietf.org/mailman/listinfo/rtgwg


RE: Need your help to make sure the draft-ietf-rtgwg-net2cloud-problem-statement readability is good.

2023-08-21 Thread Linda Dunbar
Joel,

Thank you very much for the quick feedback.
My questions and replies are inserted below with   (in purple text).

Linda

From: Joel Halpern 
Sent: Monday, August 21, 2023 6:15 PM
To: Linda Dunbar 
Cc: rtgwg-chairs ; 
draft-ietf-rtgwg-net2cloud-problem-statement@ietf.org
Subject: Re: Need your help to make sure the 
draft-ietf-rtgwg-net2cloud-problem-statement readability is good.


Thank you for responding so promptly upon your return from PTO.  (I should take 
more time off myself.)

I will annotate in line, with .  (The current IETF discussion about 
differing markup techniques creating difficulty in following responses 
exemplifies why I have adopted this practice.)

For this version, I will include agreements.  Feel free to remove those for 
followup.

Yours,

Joel
On 8/21/2023 6:45 PM, Linda Dunbar wrote:
Joel,

Thank you very much for the valuable feedback. Sorry I was on vacation without 
internet last week, just got around to study your comments.

Changes to the document to address your comments and questions are inserted 
below.

Attached is the document with change bar enabled. Once it is okay with you, we 
will upload.
Linda

From: Joel Halpern 
Sent: Monday, August 14, 2023 3:26 PM
To: Linda Dunbar 
Cc: rtgwg-chairs 
Subject: Re: Need your help to make sure the 
draft-ietf-rtgwg-net2cloud-problem-statement readability is good.

I have read over the draft.  The following comments may be helpful to you.
Major:
If you are going to use the term SD-WAN as central to the definition of 
controller, you need to provide a citation and definition of SD-WAN.
[Linda] As SD-WAN is so widely used, using any vendor's SD-WAN definition is 
inappropriate. How about using Gartner's SD-WAN definition or MEF's SD-WAN? Do 
you have any preference?
Gartner's SD-WAN definition 
(https://www.gartner.com/en/information-technology/glossary/software-defined-wan-sd-wan#:~:text=Software%2DDefined%20WAN%20(SD%2DWAN),-SD%2DWAN%20solutions=SD%2DWAN%20provides%20dynamic%2C%20policy,as%20WAN%20optimization%20and%20firewalls.
 )
"SD-WAN provides dynamic, policy-based, application path selection across 
multiple WAN connections and supports service chaining for additional services 
such as WAN optimization and firewalls."

MEF (70.1) has the definition for SD-WAN Services 
(https://www.mef.net/wp-content/uploads/MEF_70.1.pdf ):
"An overlay connectivity service that optimizes transport of IP Packets over 
one or more Underlay Connectivity Services by recognizing applications 
(Application Flows) and determining forwarding behavior by applying Policies to 
them."
I slightly prefer to MEF definition, but could live with either one.  My 
concern is merely taht there be a definition. 
 will use MEF's definition then 

Also, if you are going to claim that controller is interchangeable with SD-WAN 
controller you need to explain why.   The definition seems to imply control 
over something very specific, whereas data center controllers, and even SDN 
Controllers, mean something far more general.
[Linda] As the section on "controller" has been removed in a previous revision 
of the document, the definition is no longer needed. We can remove the 
definition for next revision (-29). Does it address your concern?

That works fine, thank you. 
I find the beginning of section 3.1 rather confusing.  You seem to be trying to 
distinguish between classical ISP peering policies and Public Data Center 
peering policies.  But the text simply does not match the broad range of 
practices in either category.The issues that are then discussed do not seem 
to be specific to Cloud Data Centers.  They look like advice that should be 
vetted with the IDR working group, and apply to many different kinds of 
operators.
[Linda] This document is intended to describe the network-related problems for 
connecting to public (Cloud) DCs and mitigation practices. Several IETF 
solution drafts are being proposed in the relevant WGs, such as in 
draft-ietf-idr-sdwan-edge-discovery, draft-ietf-idr-5g-edge-service-metadata, 
draft-ietf-bess-secure-evpn, draft-dmk-rtgwg-multisegment-sdwan, etc. Having 
one document describing the problems and referencing all the relevant solutions 
being developed by IETF can make it easier for the implementers, even though 
some of the solutions can apply to general provider networks in addition to the 
Public DCs.
Many traditional ISPs end up peering with a lot of people.  I think the 
difference is subtler.   I think the argument you are trying to make is "Where 
traditional ISPs view peering as a means to improve network operations, Public 
Data Centers which offer direct peering view that peering