Re: Global Akamai Outage

2021-07-27 Thread Lukas Tribus
Hello,


On Tue, 27 Jul 2021 at 21:02, heasley  wrote:
> > But I have to emphasize that all those are just examples. Unknown bugs
> > or corner cases can lead to similar behavior in "all in one" daemons
> > like Fort and Routinator. That's why specific improvements absolutely
> > do not mean we don't have to monitor the RTR servers.
>
> I am not convinced that I want the RTR server to be any smarter than
> necessary, and I think expiration handling is too smart.  I want it to
> the load the VRPs provided and serve them, no more.
>
> Leave expiration to the validator and monitoring of both to the NMS and
> other means.

While I'm all for KISS, the expiration feature makes sure that the
cryptographic validity in the ROA's is respected not only on the
validator, but also on the RTR server. This is necessary, because
there is nothing in the RTR protocol that indicates the expiration and
this change brings it at least into the JSON exchange between
validator and RTR server.

It's like TTL in DNS, and it's about respecting the wishes of the
authority (CA and ROA ressource holder).


> The delegations should not be changing quickly[1] enough

How do you come to this conclusion? If I decide I'd like to originate
a /24 out of my aggregate, for DDoS mitigation purposes, why shouldn't
I be able to update my ROA and expect quasi-complete convergence in 1
or 2 hours?


> for me to prefer expiration over the grace period to correct a validator
> problem.  That does not prevent an operator from using other means to
> share fate; eg: if the validator does fails completely for 2 hours, stop
> the RTR server.
>
> I perceive this to be choosing stability in the RTR sessions over
> timeliness of updates.  And, if a 15 - 30 minute polling interval is
> reasonable, why isnt 8 - 24 hours.

Well for one, I'd like my ROAs to propagate in 1 or 2 hours. If I need
to wait for 24 hours, then this could cause operational issues for me
(the DDoS mitigation case above for example, or just any other normal
routing change).

The entire RPKI system is designed to fail, so if you have multiple
failures and *all* your RTR servers go down, the worst case is that
the routes on the BGP routers turn NotFound, so you'd lose the benefit
of RPKI validation. It's *way* *way* more harmful to have obsolete
VRP's on your routers. If it's just a few hours, then the impact will
probably not be catastrophic. But what if it's 36 hours, 72 hours?
What if the rpki-validation started failing 2 weeks ago, when Jerry
from IT ("the linux guy") started it's vacation?

On the other hand, if only one (of multiple) validator/rtr instances
has a problem and the number of VRP's slowly goes down, nothing will
happen at all on your routers, as they just use the union of the RTR
endpoints and the VRP's from the broken RTR server will slowly be
withdrawn. Your router will keep using healthy RTR servers, as opposed
to considering erroneous data from a poisoned RTR server.

I define stability not as "RTR session uptime and VRP count", but
whether or not my BGP routers are making correct or wrong decisions.


> I too prefer an approach where the validator and RTR are separate but
> co-habitated, but this naturally increases the possibility that the two
> might serve different data due to reachability, validator run-time, 
> To what extend differences occur, I have not measured.
>
>
> [1] The NIST ROA graph confirms the rate of change is low, as I would
> expect.  But, I have no statistic for ROA stability, considering only
> the prefix and origin.

I don't see how the rate of global ROA changes is in any way related
to this issue. The operational issue a hung RTR endpoint creates for
other people's networks can't be measured with this.


lukas


Re: Global Akamai Outage

2021-07-27 Thread Lukas Tribus
On Tue, 27 Jul 2021 at 16:10, Mark Tinka  wrote:
>
>
>
> On 7/26/21 19:04, Lukas Tribus wrote:
>
> > rpki-client can only remove outdated VRP's, if it a) actually runs and
> > b) if it successfully completes a validation cycle. It also needs to
> > do this BEFORE the RTR server distributes data.
> >
> > If rpki-client for whatever reason doesn't complete a validation cycle
> > [doesn't start, crashes, cannot write to the file] it will not be able
> > to update the file, which stayrtr reads and distributes.
>
> Have you had an odd experiences with rpki-client running? The fact that
> it's not a daemon suggests that it is less likely to bomb out (even
> though that could happen as a runtime binary, but one can reliably test
> for that with any effected changes).

No, I did not have a specific negative experience running rpki-client.

I did have my fair share of:

- fat fingering cronjobs
- fat fingering permissions
- read-only filesystems due to storage/virtualizations problems
- longer VM downtimes

I was also directly impacted by a hung rpki-validator, which I have
referenced in one of the links earlier. This was actually after I
started to have concerns about the lack of monitoring and the danger
of serving stale data, not before.

I was also constantly impacted by generic (non RPKI related) gray
failures in other people's network for the better part of a decade, I
guess that makes me particularly sensitive to topics like this.



> Of course, rpki-client depends on Cron being available and stable, and
> over the years, I have not run into any major issues guaranteeing that.

It's not the quality of the cron code that I'm worried about. It's the
amount of variables that can cause rpki-client to not complete and
fully write the validation results to disk COMBINED with the lack of
monitoring.

You get an alert in your NMS when a link is down, even if that single
link down doesn't mean your customers are impacted. But you need to
know so that you can actually intervene to restore full redundancy.
Lack of awareness of a problem is the large issue here.



> > If your VM went down with both rpki-client and stayrtr, and it stays
> > down for 2 days (maybe a nasty storage or virtualization problem or
> > maybe this just a PSU failure in a SPOF server), when the VM comes
> > backup, stayrtr will read and distribute 2 days old data - after all -
> > rpki-client is a periodic cronjob while stayrtr will start
> > immediately, so there will be plenty of time to distribute obsolete
> > VRP's. Just because you have another validator and RTR server in
> > another region that was always available, doesn't mean that the
> > erroneous and obsolete data served by this server will be ignored.
>
> This is a good point.
>
> So I know that one of the developers of StayRTR is working on having it
> use the "expires" values that rpki-client inherently possesses to ensure
> that StayRTR never delivers stale data to clients. If this works, while
> it does not eliminate the need to some degree of monitoring, it
> certainly makes it less of a hassle, going forward.

Notice that expires is based on the cryptographic validity of the ROA
objects. It can be multiple DAYS until expiration strikes, for example
the expiration value of 8.8.8.0/24 is 2 DAYS in the future.



> > There are more reasons and failure scenarios why this 2 piece setup
> > (periodic RPKI validation, separate RTR daemon) can become a "split
> > brain". As you implement more complicated setups (a single global RPKI
> > validation result is distributed to regional RTR servers - the
> > cloudflare approach), things get even more complicated. Generally I
> > prefer the all in one approach for these reasons (FORT validator).
> >
> > At least if it crashes, it takes down the RTR server with it:
> >
> > https://github.com/NICMx/FORT-validator/issues/40#issuecomment-695054163
> >
> >
> > But I have to emphasize that all those are just examples. Unknown bugs
> > or corner cases can lead to similar behavior in "all in one" daemons
> > like Fort and Routinator. That's why specific improvements absolutely
> > do not mean we don't have to monitor the RTR servers.
>
> Agreed.
>
> I've had my fair share of Fort issues in the past month, all of which
> have been fixed and a new release is imminent, so I'm happy.
>
> I'm currently running both Fort and rpki-client + StayRTR. At a basic
> level, they both send the exact same number of VRP's toward clients,
> likely because they share a philosophy in validation schemes, and crypto
> libraries.
>
> We're getting there.


For IOS-XR I have a netconf script that makes all kinds of health
checks at XR RTR client level:

- comparing the number of total IPv4 and v6 VRP's of each RTR server
enable with absolute values, warning if there are less than EXPECTED
values
- comparing the v4 and v6 number between the RTR endpoints on this XR
box, warning if the disparity crosses a threshold
- warning if a configured RTR server is not in connected 

Re: Global Akamai Outage

2021-07-27 Thread heasley
Mon, Jul 26, 2021 at 07:04:41PM +0200, Lukas Tribus:
> Hello!
> 
> On Mon, 26 Jul 2021 at 17:50, heasley  wrote:
> >
> > Mon, Jul 26, 2021 at 02:20:39PM +0200, Lukas Tribus:
> > > rpki-client 7.1 emits a new per VRP attribute: expires, which makes it
> > > possible for RTR servers to stop considering outdated VRP's:
> > > https://github.com/rpki-client/rpki-client-openbsd/commit/9e48b3b6ad416f40ac3b5b265351ae0bb13ca925
> >
> > Since rpki-client removes "outdated" (expired) VRPs, how does an RTR
> > server "stop considering" something that does not exist from its PoV?
> 
> rpki-client can only remove outdated VRP's, if it a) actually runs and
> b) if it successfully completes a validation cycle. It also needs to
> do this BEFORE the RTR server distributes data.
> 
> If rpki-client for whatever reason doesn't complete a validation cycle
> [doesn't start, crashes, cannot write to the file] it will not be able
> to update the file, which stayrtr reads and distributes.
> 
> If your VM went down with both rpki-client and stayrtr, and it stays
> down for 2 days (maybe a nasty storage or virtualization problem or
> maybe this just a PSU failure in a SPOF server), when the VM comes
> backup, stayrtr will read and distribute 2 days old data - after all -
> rpki-client is a periodic cronjob while stayrtr will start
> immediately, so there will be plenty of time to distribute obsolete
> VRP's. Just because you have another validator and RTR server in
> another region that was always available, doesn't mean that the
> erroneous and obsolete data served by this server will be ignored.
> 
> There are more reasons and failure scenarios why this 2 piece setup
> (periodic RPKI validation, separate RTR daemon) can become a "split
> brain". As you implement more complicated setups (a single global RPKI
> validation result is distributed to regional RTR servers - the
> cloudflare approach), things get even more complicated. Generally I
> prefer the all in one approach for these reasons (FORT validator).
> 
> At least if it crashes, it takes down the RTR server with it:
> 
> https://github.com/NICMx/FORT-validator/issues/40#issuecomment-695054163
> 
> 
> But I have to emphasize that all those are just examples. Unknown bugs
> or corner cases can lead to similar behavior in "all in one" daemons
> like Fort and Routinator. That's why specific improvements absolutely
> do not mean we don't have to monitor the RTR servers.

I am not convinced that I want the RTR server to be any smarter than
necessary, and I think expiration handling is too smart.  I want it to
the load the VRPs provided and serve them, no more.

Leave expiration to the validator and monitoring of both to the NMS and
other means.  The delegations should not be changing quickly[1] enough
for me to prefer expiration over the grace period to correct a validator
problem.  That does not prevent an operator from using other means to
share fate; eg: if the validator does fails completely for 2 hours, stop
the RTR server.

I perceive this to be choosing stability in the RTR sessions over
timeliness of updates.  And, if a 15 - 30 minute polling interval is
reasonable, why isnt 8 - 24 hours.

I too prefer an approach where the validator and RTR are separate but
co-habitated, but this naturally increases the possibility that the two
might serve different data due to reachability, validator run-time, 
To what extend differences occur, I have not measured.


[1] The NIST ROA graph confirms the rate of change is low, as I would
expect.  But, I have no statistic for ROA stability, considering only
the prefix and origin.


Re: Global Akamai Outage

2021-07-27 Thread Mark Tinka




On 7/26/21 19:04, Lukas Tribus wrote:


rpki-client can only remove outdated VRP's, if it a) actually runs and
b) if it successfully completes a validation cycle. It also needs to
do this BEFORE the RTR server distributes data.

If rpki-client for whatever reason doesn't complete a validation cycle
[doesn't start, crashes, cannot write to the file] it will not be able
to update the file, which stayrtr reads and distributes.


Have you had an odd experiences with rpki-client running? The fact that 
it's not a daemon suggests that it is less likely to bomb out (even 
though that could happen as a runtime binary, but one can reliably test 
for that with any effected changes).


Of course, rpki-client depends on Cron being available and stable, and 
over the years, I have not run into any major issues guaranteeing that.


So if you've seen some specific outage scenarios with it, I'd be keen to 
hear about them.




If your VM went down with both rpki-client and stayrtr, and it stays
down for 2 days (maybe a nasty storage or virtualization problem or
maybe this just a PSU failure in a SPOF server), when the VM comes
backup, stayrtr will read and distribute 2 days old data - after all -
rpki-client is a periodic cronjob while stayrtr will start
immediately, so there will be plenty of time to distribute obsolete
VRP's. Just because you have another validator and RTR server in
another region that was always available, doesn't mean that the
erroneous and obsolete data served by this server will be ignored.


This is a good point.

So I know that one of the developers of StayRTR is working on having it 
use the "expires" values that rpki-client inherently possesses to ensure 
that StayRTR never delivers stale data to clients. If this works, while 
it does not eliminate the need to some degree of monitoring, it 
certainly makes it less of a hassle, going forward.




There are more reasons and failure scenarios why this 2 piece setup
(periodic RPKI validation, separate RTR daemon) can become a "split
brain". As you implement more complicated setups (a single global RPKI
validation result is distributed to regional RTR servers - the
cloudflare approach), things get even more complicated. Generally I
prefer the all in one approach for these reasons (FORT validator).

At least if it crashes, it takes down the RTR server with it:

https://github.com/NICMx/FORT-validator/issues/40#issuecomment-695054163


But I have to emphasize that all those are just examples. Unknown bugs
or corner cases can lead to similar behavior in "all in one" daemons
like Fort and Routinator. That's why specific improvements absolutely
do not mean we don't have to monitor the RTR servers.


Agreed.

I've had my fair share of Fort issues in the past month, all of which 
have been fixed and a new release is imminent, so I'm happy.


I'm currently running both Fort and rpki-client + StayRTR. At a basic 
level, they both send the exact same number of VRP's toward clients, 
likely because they share a philosophy in validation schemes, and crypto 
libraries.


We're getting there.

Mark.


Re: Global Akamai Outage

2021-07-26 Thread Lukas Tribus
Hello!

On Mon, 26 Jul 2021 at 17:50, heasley  wrote:
>
> Mon, Jul 26, 2021 at 02:20:39PM +0200, Lukas Tribus:
> > rpki-client 7.1 emits a new per VRP attribute: expires, which makes it
> > possible for RTR servers to stop considering outdated VRP's:
> > https://github.com/rpki-client/rpki-client-openbsd/commit/9e48b3b6ad416f40ac3b5b265351ae0bb13ca925
>
> Since rpki-client removes "outdated" (expired) VRPs, how does an RTR
> server "stop considering" something that does not exist from its PoV?

rpki-client can only remove outdated VRP's, if it a) actually runs and
b) if it successfully completes a validation cycle. It also needs to
do this BEFORE the RTR server distributes data.

If rpki-client for whatever reason doesn't complete a validation cycle
[doesn't start, crashes, cannot write to the file] it will not be able
to update the file, which stayrtr reads and distributes.

If your VM went down with both rpki-client and stayrtr, and it stays
down for 2 days (maybe a nasty storage or virtualization problem or
maybe this just a PSU failure in a SPOF server), when the VM comes
backup, stayrtr will read and distribute 2 days old data - after all -
rpki-client is a periodic cronjob while stayrtr will start
immediately, so there will be plenty of time to distribute obsolete
VRP's. Just because you have another validator and RTR server in
another region that was always available, doesn't mean that the
erroneous and obsolete data served by this server will be ignored.

There are more reasons and failure scenarios why this 2 piece setup
(periodic RPKI validation, separate RTR daemon) can become a "split
brain". As you implement more complicated setups (a single global RPKI
validation result is distributed to regional RTR servers - the
cloudflare approach), things get even more complicated. Generally I
prefer the all in one approach for these reasons (FORT validator).

At least if it crashes, it takes down the RTR server with it:

https://github.com/NICMx/FORT-validator/issues/40#issuecomment-695054163


But I have to emphasize that all those are just examples. Unknown bugs
or corner cases can lead to similar behavior in "all in one" daemons
like Fort and Routinator. That's why specific improvements absolutely
do not mean we don't have to monitor the RTR servers.


lukas


Re: Global Akamai Outage

2021-07-26 Thread Mark Tinka




On 7/26/21 17:50, heasley wrote:


Since rpki-client removes "outdated" (expired) VRPs, how does an RTR
server "stop considering" something that does not exist from its PoV?

Did you mean that it can warn about impending expiration?


StayRTR reads the VRP data generated by rpki-client.

Mark.


Re: Global Akamai Outage

2021-07-26 Thread heasley
Mon, Jul 26, 2021 at 02:20:39PM +0200, Lukas Tribus:
> rpki-client 7.1 emits a new per VRP attribute: expires, which makes it
> possible for RTR servers to stop considering outdated VRP's:
> https://github.com/rpki-client/rpki-client-openbsd/commit/9e48b3b6ad416f40ac3b5b265351ae0bb13ca925

Since rpki-client removes "outdated" (expired) VRPs, how does an RTR
server "stop considering" something that does not exist from its PoV?

Did you mean that it can warn about impending expiration?


Re: Global Akamai Outage

2021-07-26 Thread Mark Tinka




On 7/26/21 14:20, Lukas Tribus wrote:


Some specific failure scenarios are currently being addressed, but
this doesn't make monitoring optional:

rpki-client 7.1 emits a new per VRP attribute: expires, which makes it
possible for RTR servers to stop considering outdated VRP's:
https://github.com/rpki-client/rpki-client-openbsd/commit/9e48b3b6ad416f40ac3b5b265351ae0bb13ca925

stayrtr (a gortr fork), will consider this attribute in the future:
https://github.com/bgp/stayrtr/issues/3


I was just about to cite these two as improving this particular issue in 
upcoming releases.


I am running RPKI-Client + StayRTR, alongside Fort, and yes, while 
monitoring should be standard, improvements in the validation and RTR 
objectives will also go a long way in mitigating these issues.


What's quickly happening in this space is that not all validators and 
RTR servers are going to made equal. There are a number of options 
currently available (both deprecated and current), but I expect that we 
may settle on just a handful, as experience increases. And in what 
remains, I anticipate that they will be bolstered to consider these very 
problems.


Mark.


Re: Global Akamai Outage

2021-07-26 Thread Lukas Tribus
Hello,


On Mon, 26 Jul 2021 at 11:40, Mark Tinka  wrote:
> I can count, on my hands, the number of RPKI-related outages that we
> have experienced, and all of them have turned out to be a
> misunderstanding of how ROA's work, either by customers or some other
> network on the Internet. The good news is that all of those cases were
> resolved within a few hours of notifying the affected party.

That's good, but the understanding of operational issues in the RPKI
systems in the wild is underwhelming, we are bound to make the same
mistakes of DNS all over again.

Yes, a complete failure of an RTR server theoretically does not have
big negative effects in networks. But failure of RPKI validation with
a separate RTR server can lead to outdated VRP's on the routers, just
as RTR server bugs will, which is why monitoring not only for
availability but also whether the data is actually not outdated is
*very* necessary.


Here some examples (both of operators POV as well as actual failure scenarios):


https://mailman.nanog.org/pipermail/nanog/2020-August/208982.html

> we are at fault for not deploying the validation service in a redundant
> setup and for failing at monitoring the service. But we did so because
> we thought it not to be too important, because a failed validation
> service should simply lead to no validation, not a crashed router.

In this case a RTR client bug crashed the router. But the point is
that it is not clear that setting up RPKI validators and RTR servers
is a serious endeavor and monitoring it is not optional.



https://github.com/cloudflare/gortr/issues/82

> we noticed that one the ROAs was wrong. When I pulled output.json
> from octorpki (/output.json), it had the correct value. However when
> I ran rtrdump, it had different ASN value for the prefix. Restarting
> gortr process did fix it. Sending SIGHUP did not.



https://github.com/RIPE-NCC/rpki-validator-3/issues/264

> yesterday we saw a unexpected ROA propagation delay.
>
> After updating a ROA in the RIPE lirportal, NTT, Telia and Cogent
> saw the update within an hour, but a specific rpki validator
> 3.1-2020.08.06.14.39 in a third party network did not converge
> for more than 4 hours.


I wrote a naive nagios script to check for stalled serials on a RTR server:
https://github.com/lukastribus/rtrcheck

and talked about it in this his blog post (shameless plug):
https://labs.ripe.net/author/lukas_tribus/rpki-rov-about-stale-rtr-servers-and-how-to-monitor-them/

This is on the validation/network side. On the CA side, similar issues apply.

I believe we still lack a few high level outages caused by
insufficient reliability in the RPKI stacks for people to start taking
it seriously.


Some specific failure scenarios are currently being addressed, but
this doesn't make monitoring optional:

rpki-client 7.1 emits a new per VRP attribute: expires, which makes it
possible for RTR servers to stop considering outdated VRP's:
https://github.com/rpki-client/rpki-client-openbsd/commit/9e48b3b6ad416f40ac3b5b265351ae0bb13ca925

stayrtr (a gortr fork), will consider this attribute in the future:
https://github.com/bgp/stayrtr/issues/3



cheers,
lukas


Re: Global Akamai Outage

2021-07-26 Thread Mark Tinka




On 7/26/21 07:25, Saku Ytti wrote:


Doesn't matter. And I'm not trying to say RPKI is a bad thing. I like
that we have good AS:origin mapping that is verifiable and machine
readable, that part of the solution will be needed for many
applications which intend to improve the Internet by some metric.
And of course adding any complexity will have some rearing problems,
particularly if the problem it attempts to address is infrequently
occurring, so it would be naive not to expect an increased rate of
outages while maturing it.


Yes, while RPKI fixes problems that genuinely occur infrequently, it's 
intended to work very well for when those problems do occur, especially 
the intentional hijacks, because when they do occur, it disrupts quite a 
large part of the Internet, even if for a few minutes or couple of 
hours. So from that standpoint, RPKI does add value.


Where I do agree with you is that we should restrain ourselves from 
applying RPKI to use-cases that are non-core to its reasons for 
existence, e.g., AS0.


I can count, on my hands, the number of RPKI-related outages that we 
have experienced, and all of them have turned out to be a 
misunderstanding of how ROA's work, either by customers or some other 
network on the Internet. The good news is that all of those cases were 
resolved within a few hours of notifying the affected party.


Mark.


Re: Global Akamai Outage

2021-07-25 Thread Saku Ytti
On Sun, 25 Jul 2021 at 21:41, Mark Tinka  wrote:

> Are you speaking globally, or for NTT?

Doesn't matter. And I'm not trying to say RPKI is a bad thing. I like
that we have good AS:origin mapping that is verifiable and machine
readable, that part of the solution will be needed for many
applications which intend to improve the Internet by some metric.
And of course adding any complexity will have some rearing problems,
particularly if the problem it attempts to address is infrequently
occurring, so it would be naive not to expect an increased rate of
outages while maturing it.

-- 
  ++ytti


Re: Global Akamai Outage

2021-07-25 Thread Mark Tinka




On 7/25/21 17:32, Saku Ytti wrote:


Steering dangerously off-topic from this thread, we have so far had
more operational and availability issues from RPKI than from hijacks.
And it is a bit more embarrassing to say 'we cocked up' than to say
'someone leaked to internet, it be like it do'.


Are you speaking globally, or for NTT?

Mark.


Re: Global Akamai Outage

2021-07-25 Thread Randy Bush
> Very often the corrective and preventive actions appear to be
> different versions and wordings of 'dont make mistakes', in this case:
> 
> - Reviewing and improving input safety checks for mapping components
> - Validate and strengthen the safety checks for the configuration
> deployment zoning process
> 
> It doesn't seem like a tenable solution, when the solution is 'do
> better', since I'm sure whoever did those checks did their best in the
> first place. So we must assume we have some fundamental limits what
> 'do better' can achieve, we have to assume we have similar level of
> outage potential in all work we've produced and continue to produce
> for which we exert very little control over.
> 
> I think the mean-time-to-repair actions described are more actionable
> than the 'do better'.  However Akamai already solved this very fast
> and may not be very reasonable to expect big improvements to a 1h
> start of fault to solution for a big organisation with a complex
> product.
> 
> One thing that comes to mind is, what if Akamai assumes they cannot
> reasonably make it fail less often and they can't fix it faster. Is
> this particular product/solution such that the possibility of having
> entirely independent A+B sides, for which clients fail over is not
> available? If it was a DNS problem, it seems like it might have been
> possible to have entirely failed A, and clients automatically
> reverting to B, perhaps adding some latencies but also allowing the
> system to automatically detect that A and B are performing at an
> unacceptable delta.

formal verification


Re: Global Akamai Outage

2021-07-25 Thread Saku Ytti
On Sun, 25 Jul 2021 at 18:14, Jared Mauch  wrote:

> How can we improve response times when things are routed poorly? Time to 
> mitigate hijacks is improved my majority of providers doing RPKI OV, but 
> interprovider response time scales are much longer. I also think about the 
> two big CTL long haul and routing issues last year. How can you mitigate 
> these externalities.

Steering dangerously off-topic from this thread, we have so far had
more operational and availability issues from RPKI than from hijacks.
And it is a bit more embarrassing to say 'we cocked up' than to say
'someone leaked to internet, it be like it do'.

-- 
  ++ytti


Re: Global Akamai Outage

2021-07-25 Thread Jared Mauch
Work hat is not on, but context is included from prior workplaces etc. 

> On Jul 25, 2021, at 2:22 AM, Saku Ytti  wrote:
> 
> It doesn't seem like a tenable solution, when the solution is 'do
> better', since I'm sure whoever did those checks did their best in the
> first place. So we must assume we have some fundamental limits what
> 'do better' can achieve, we have to assume we have similar level of
> outage potential in all work we've produced and continue to produce
> for which we exert very little control over.

I have seen a very strong culture around risk and risk avoidance whenever 
possible at akamai. Some minor changes are taken very seriously. 

I appreciate that on a daily basis, and when we make mistakes (I am human after 
all) are made, reviews of the mistakes and corrective steps are planned and 
followed up on. I'm sure this time will not be different. 

I also get how easy it is to be cynical about these issues. There's always 
someone with power who can break things, but those can also often fix them just 
as fast. 

Focus on how you can do a transactional routing change and roll it back, how 
you can test etc. 

This is why for years I told one vendor that had a line-by-line parser their 
system was too unsafe for operation. 

There's also other questions like:

How can we improve response times when things are routed poorly? Time to 
mitigate hijacks is improved my majority of providers doing RPKI OV, but 
interprovider response time scales are much longer. I also think about the two 
big CTL long haul and routing issues last year. How can you mitigate these 
externalities. 

- Jared 

Re: Global Akamai Outage

2021-07-25 Thread Mark Tinka




On 7/25/21 08:18, Saku Ytti wrote:


Hey,

Not a critique against Akamai specifically, it applies just the same
to me. Everything seems so complex and fragile.

Very often the corrective and preventive actions appear to be
different versions and wordings of 'dont make mistakes', in this case:

- Reviewing and improving input safety checks for mapping components
- Validate and strengthen the safety checks for the configuration
deployment zoning process

It doesn't seem like a tenable solution, when the solution is 'do
better', since I'm sure whoever did those checks did their best in the
first place. So we must assume we have some fundamental limits what
'do better' can achieve, we have to assume we have similar level of
outage potential in all work we've produced and continue to produce
for which we exert very little control over.

I think the mean-time-to-repair actions described are more actionable
than the 'do better'.  However Akamai already solved this very fast
and may not be very reasonable to expect big improvements to a 1h
start of fault to solution for a big organisation with a complex
product.

One thing that comes to mind is, what if Akamai assumes they cannot
reasonably make it fail less often and they can't fix it faster. Is
this particular product/solution such that the possibility of having
entirely independent A+B sides, for which clients fail over is not
available? If it was a DNS problem, it seems like it might have been
possible to have entirely failed A, and clients automatically
reverting to B, perhaps adding some latencies but also allowing the
system to automatically detect that A and B are performing at an
unacceptable delta.

Did some of their affected customers recover faster than Akamai due to
their own actions automated or manual?


Can we learn something from how the airline industry has incrementally 
improved safety through decades of incidents?


"Doing better" is the lowest hanging fruit any network operator can 
strive for. Unlike airlines, the Internet community - despite being 
built on standards - is quite diverse in how we choose to operate our 
own islands. So "doing better", while a universal goal, means different 
things to different operators. This is why we would likely see different 
RFO's and remedial recommendations from different operators for the 
"same kind of" outage.


In most cases, continuing to "do better" may be most appealing prospect 
because anything better than that will require significantly more 
funding, in an industry where most operators are generally threading the 
P needle.


Mark.


Re: Global Akamai Outage

2021-07-25 Thread Miles Fidelman

Indeed.  Worth rereading for that reason alone (or in particular).

Miles Fidelman

Hank Nussbacher wrote:

On 23/07/2021 09:24, Hank Nussbacher wrote:

From Akamai.  How companies and vendors should report outages:

[07:35 UTC on July 24, 2021] Update:

Root Cause:

This configuration directive was sent as part of preparation for 
independent load balancing control of a forthcoming product. Updates 
to the configuration directive for this load balancing component have 
routinely been made on approximately a weekly basis. (Further changes 
to this configuration channel have been blocked until additional 
safety measures have been implemented, as noted in Corrective and 
Preventive Actions.)


The load balancing configuration directive included a formatting 
error. As a safety measure, the load balancing component disregarded 
the improper configuration and fell back to a minimal configuration. 
In this minimal state, based on a VIP-only configuration, it did not 
support load balancing for Enhanced TLS slots greater than 6145.


The missing load balancing data meant that the Akamai authoritative 
DNS system for the akamaiedge.net zone would not receive any directive 
for how to respond to DNS queries for many Enhanced TLS slots. The 
authoritative DNS system will respond with a SERVFAIL when there is no 
directive, as during localized failures resolvers will retry an 
alternate authority.


The zoning process used for deploying configuration changes to the 
network includes an alert check for potential issues caused by the 
configuration changes. The zoning process did result in alerts during 
the deployment. However, due to how the particular safety check was 
configured, the alerts for this load balancing component did not 
prevent the configuration from continuing to propagate, and did not 
result in escalation to engineering SMEs. The input safety check on 
the load balancing component also did not automatically roll back the 
change upon detecting the error.


Contributing Factors:

    The internal alerting which was specific to the load balancing 
component did not result in blocking the configuration from 
propagating to the network, and did not result in an escalation to the 
SMEs for the component.
    The alert and associated procedure indicating widespread SERVFAILs 
potentially due to issues with mapping systems did not lead to an 
appropriately urgent and timely response.
    The internal alerting which fired and was escalated to SMEs was 
for a separate component which uses the load balancing data. This 
internal alerting initially fired for the Edge DNS system rather than 
the mapping system, which delayed troubleshooting potential issues 
with the mapping system and the load balancing component which had the 
configuration change. Subsequent internal alerts more clearly 
indicated an issue with the mapping system.
    The impact to the Enhanced TLS service affected Akamai staff 
access to internal tools and websites, which delayed escalation of 
alerts, troubleshooting, and especially initiation of the incident 
process.


Short Term

Completed:

    Akamai completed rolling back the configuration change at 16:44 
UTC on July 22, 2021.

    Blocked any further changes to the involved configuration channel.
    Other related channels are being reviewed and may be subject to a 
similar block as reviews take place. Channels will be unblocked after 
additional safety measures are assessed and implemented where needed.


In Progress:

    Validate and strengthen the safety checks for the configuration 
deployment zoning process
    Increase the sensitivity and priority of alerting for high rates 
of SERVFAILs.


Long Term

In Progress:

    Reviewing and improving input safety checks for mapping components.
    Auditing critical systems to identify gaps in monitoring and 
alerting, then closing unacceptable gaps.





On 22/07/2021 19:34, Mark Tinka wrote:

https://edgedns.status.akamai.com/

Mark.



[18:30 UTC on July 22, 2021] Update:

Akamai experienced a disruption with our DNS service on July 22, 
2021. The disruption began at 15:45 UTC and lasted for approximately 
one hour. Affected customer sites were significantly impacted for 
connections that were not established before the incident began.


Our teams identified that a change made in a mapping component was 
causing the issue, and in order to mitigate it we rolled the change 
back at approximately 16:44 UTC. We can confirm this was not a 
cyberattack against Akamai's platform. Immediately following the 
rollback, the platform stabilized and DNS services resumed normal 
operations. At this time the incident is resolved, and we are 
monitoring to ensure that traffic remains stable.





--
In theory, there is no difference between theory and practice.
In practice, there is.   Yogi Berra

Theory is when you know everything but nothing works.
Practice is when everything works but no one knows why.
In our lab, theory and practice 

Re: Global Akamai Outage

2021-07-25 Thread Hank Nussbacher

On 25/07/2021 09:18, Saku Ytti wrote:

Hey,

Not a critique against Akamai specifically, it applies just the same
to me. Everything seems so complex and fragile.


Complex systems are apt to break and only a very limited set of tier-3 
engineers will understand what needs to be done to fix it.


KISS

-Hank


Re: Global Akamai Outage

2021-07-25 Thread Saku Ytti
Hey,

Not a critique against Akamai specifically, it applies just the same
to me. Everything seems so complex and fragile.

Very often the corrective and preventive actions appear to be
different versions and wordings of 'dont make mistakes', in this case:

- Reviewing and improving input safety checks for mapping components
- Validate and strengthen the safety checks for the configuration
deployment zoning process

It doesn't seem like a tenable solution, when the solution is 'do
better', since I'm sure whoever did those checks did their best in the
first place. So we must assume we have some fundamental limits what
'do better' can achieve, we have to assume we have similar level of
outage potential in all work we've produced and continue to produce
for which we exert very little control over.

I think the mean-time-to-repair actions described are more actionable
than the 'do better'.  However Akamai already solved this very fast
and may not be very reasonable to expect big improvements to a 1h
start of fault to solution for a big organisation with a complex
product.

One thing that comes to mind is, what if Akamai assumes they cannot
reasonably make it fail less often and they can't fix it faster. Is
this particular product/solution such that the possibility of having
entirely independent A+B sides, for which clients fail over is not
available? If it was a DNS problem, it seems like it might have been
possible to have entirely failed A, and clients automatically
reverting to B, perhaps adding some latencies but also allowing the
system to automatically detect that A and B are performing at an
unacceptable delta.

Did some of their affected customers recover faster than Akamai due to
their own actions automated or manual?

On Sat, 24 Jul 2021 at 21:46, Hank Nussbacher  wrote:
>
> On 23/07/2021 09:24, Hank Nussbacher wrote:
>
>  From Akamai.  How companies and vendors should report outages:
>
> [07:35 UTC on July 24, 2021] Update:
>
> Root Cause:
>
> This configuration directive was sent as part of preparation for
> independent load balancing control of a forthcoming product. Updates to
> the configuration directive for this load balancing component have
> routinely been made on approximately a weekly basis. (Further changes to
> this configuration channel have been blocked until additional safety
> measures have been implemented, as noted in Corrective and Preventive
> Actions.)
>
> The load balancing configuration directive included a formatting error.
> As a safety measure, the load balancing component disregarded the
> improper configuration and fell back to a minimal configuration. In this
> minimal state, based on a VIP-only configuration, it did not support
> load balancing for Enhanced TLS slots greater than 6145.
>
> The missing load balancing data meant that the Akamai authoritative DNS
> system for the akamaiedge.net zone would not receive any directive for
> how to respond to DNS queries for many Enhanced TLS slots. The
> authoritative DNS system will respond with a SERVFAIL when there is no
> directive, as during localized failures resolvers will retry an
> alternate authority.
>
> The zoning process used for deploying configuration changes to the
> network includes an alert check for potential issues caused by the
> configuration changes. The zoning process did result in alerts during
> the deployment. However, due to how the particular safety check was
> configured, the alerts for this load balancing component did not prevent
> the configuration from continuing to propagate, and did not result in
> escalation to engineering SMEs. The input safety check on the load
> balancing component also did not automatically roll back the change upon
> detecting the error.
>
> Contributing Factors:
>
>  The internal alerting which was specific to the load balancing
> component did not result in blocking the configuration from propagating
> to the network, and did not result in an escalation to the SMEs for the
> component.
>  The alert and associated procedure indicating widespread SERVFAILs
> potentially due to issues with mapping systems did not lead to an
> appropriately urgent and timely response.
>  The internal alerting which fired and was escalated to SMEs was for
> a separate component which uses the load balancing data. This internal
> alerting initially fired for the Edge DNS system rather than the mapping
> system, which delayed troubleshooting potential issues with the mapping
> system and the load balancing component which had the configuration
> change. Subsequent internal alerts more clearly indicated an issue with
> the mapping system.
>  The impact to the Enhanced TLS service affected Akamai staff access
> to internal tools and websites, which delayed escalation of alerts,
> troubleshooting, and especially initiation of the incident process.
>
> Short Term
>
> Completed:
>
>  Akamai completed rolling back the configuration change at 16:44 UTC
> on July 22, 2021.
>

Re: Global Akamai Outage

2021-07-24 Thread Hank Nussbacher

On 23/07/2021 09:24, Hank Nussbacher wrote:

From Akamai.  How companies and vendors should report outages:

[07:35 UTC on July 24, 2021] Update:

Root Cause:

This configuration directive was sent as part of preparation for 
independent load balancing control of a forthcoming product. Updates to 
the configuration directive for this load balancing component have 
routinely been made on approximately a weekly basis. (Further changes to 
this configuration channel have been blocked until additional safety 
measures have been implemented, as noted in Corrective and Preventive 
Actions.)


The load balancing configuration directive included a formatting error. 
As a safety measure, the load balancing component disregarded the 
improper configuration and fell back to a minimal configuration. In this 
minimal state, based on a VIP-only configuration, it did not support 
load balancing for Enhanced TLS slots greater than 6145.


The missing load balancing data meant that the Akamai authoritative DNS 
system for the akamaiedge.net zone would not receive any directive for 
how to respond to DNS queries for many Enhanced TLS slots. The 
authoritative DNS system will respond with a SERVFAIL when there is no 
directive, as during localized failures resolvers will retry an 
alternate authority.


The zoning process used for deploying configuration changes to the 
network includes an alert check for potential issues caused by the 
configuration changes. The zoning process did result in alerts during 
the deployment. However, due to how the particular safety check was 
configured, the alerts for this load balancing component did not prevent 
the configuration from continuing to propagate, and did not result in 
escalation to engineering SMEs. The input safety check on the load 
balancing component also did not automatically roll back the change upon 
detecting the error.


Contributing Factors:

    The internal alerting which was specific to the load balancing 
component did not result in blocking the configuration from propagating 
to the network, and did not result in an escalation to the SMEs for the 
component.
    The alert and associated procedure indicating widespread SERVFAILs 
potentially due to issues with mapping systems did not lead to an 
appropriately urgent and timely response.
    The internal alerting which fired and was escalated to SMEs was for 
a separate component which uses the load balancing data. This internal 
alerting initially fired for the Edge DNS system rather than the mapping 
system, which delayed troubleshooting potential issues with the mapping 
system and the load balancing component which had the configuration 
change. Subsequent internal alerts more clearly indicated an issue with 
the mapping system.
    The impact to the Enhanced TLS service affected Akamai staff access 
to internal tools and websites, which delayed escalation of alerts, 
troubleshooting, and especially initiation of the incident process.


Short Term

Completed:

    Akamai completed rolling back the configuration change at 16:44 UTC 
on July 22, 2021.

    Blocked any further changes to the involved configuration channel.
    Other related channels are being reviewed and may be subject to a 
similar block as reviews take place. Channels will be unblocked after 
additional safety measures are assessed and implemented where needed.


In Progress:

    Validate and strengthen the safety checks for the configuration 
deployment zoning process
    Increase the sensitivity and priority of alerting for high rates of 
SERVFAILs.


Long Term

In Progress:

    Reviewing and improving input safety checks for mapping components.
    Auditing critical systems to identify gaps in monitoring and 
alerting, then closing unacceptable gaps.





On 22/07/2021 19:34, Mark Tinka wrote:

https://edgedns.status.akamai.com/

Mark.



[18:30 UTC on July 22, 2021] Update:

Akamai experienced a disruption with our DNS service on July 22, 2021. 
The disruption began at 15:45 UTC and lasted for approximately one 
hour. Affected customer sites were significantly impacted for 
connections that were not established before the incident began.


Our teams identified that a change made in a mapping component was 
causing the issue, and in order to mitigate it we rolled the change 
back at approximately 16:44 UTC. We can confirm this was not a 
cyberattack against Akamai's platform. Immediately following the 
rollback, the platform stabilized and DNS services resumed normal 
operations. At this time the incident is resolved, and we are 
monitoring to ensure that traffic remains stable.





Re: Global Akamai Outage

2021-07-23 Thread Hank Nussbacher

On 22/07/2021 19:34, Mark Tinka wrote:

https://edgedns.status.akamai.com/

Mark.



[18:30 UTC on July 22, 2021] Update:

Akamai experienced a disruption with our DNS service on July 22, 2021. 
The disruption began at 15:45 UTC and lasted for approximately one hour. 
Affected customer sites were significantly impacted for connections that 
were not established before the incident began.


Our teams identified that a change made in a mapping component was 
causing the issue, and in order to mitigate it we rolled the change back 
at approximately 16:44 UTC. We can confirm this was not a cyberattack 
against Akamai's platform. Immediately following the rollback, the 
platform stabilized and DNS services resumed normal operations. At this 
time the incident is resolved, and we are monitoring to ensure that 
traffic remains stable.


Re: Global Akamai Outage

2021-07-22 Thread Andy Ringsmuth


> On Jul 22, 2021, at 12:38 PM, Grant Taylor via NANOG  wrote:
> 
> On 7/22/21 10:56 AM, Andy Ringsmuth wrote:
>> The outage appears to have, ironically, taken out the outages and 
>> outages-discussion lists too.
> 
> I received multiple messages from the Outages (proper) mailing list, 
> including messages about the Akamai issue.
> 
> I'd be surprised if the Outages Discussion mailing list was on different 
> infrastructure.
> 
> I am now seeing some messages sent to Outages (proper) as if others aren't 
> seeing messages (about the Akamai) issue.  This is probably going to be a 
> more nuanced issue that effected some but not all entities. Probably weird 
> interactions / dependencies.
> 
> 
> 
> -- 
> Grant. . . .
> unix || die

It does seem that way, and I opened a ticket with my provider to see what they 
can find as well.


-Andy

Re: Global Akamai Outage

2021-07-22 Thread Grant Taylor via NANOG

On 7/22/21 10:56 AM, Andy Ringsmuth wrote:
The outage appears to have, ironically, taken out the outages and 
outages-discussion lists too.


I received multiple messages from the Outages (proper) mailing list, 
including messages about the Akamai issue.


I'd be surprised if the Outages Discussion mailing list was on different 
infrastructure.


I am now seeing some messages sent to Outages (proper) as if others 
aren't seeing messages (about the Akamai) issue.  This is probably going 
to be a more nuanced issue that effected some but not all entities. 
Probably weird interactions / dependencies.




--
Grant. . . .
unix || die


Re: Global Akamai Outage

2021-07-22 Thread Jared Mauch



> On Jul 22, 2021, at 12:56 PM, Andy Ringsmuth  wrote:
> 
> The outage appears to have, ironically, taken out the outages and 
> outages-discussion lists too.
> 
> Kinda like having a fire at the 911 dispatch center…



Should not have impacted me in my hosting of the list.  Obviously if the domain 
names were impacted in the lookups for sending e-mail as well, there would be 
problems.



- Jared





Re: Global Akamai Outage

2021-07-22 Thread Andy Ringsmuth
The outage appears to have, ironically, taken out the outages and 
outages-discussion lists too.

Kinda like having a fire at the 911 dispatch center...


Andy Ringsmuth
5609 Harding Drive
Lincoln, NE 68521-5831
(402) 304-0083
a...@andyring.com

“Better even die free, than to live slaves.” - Frederick Douglas, 1863

> On Jul 22, 2021, at 11:34 AM, Mark Tinka  wrote:
> 
> https://edgedns.status.akamai.com/
> 
> Mark.



Re: Global Akamai Outage

2021-07-22 Thread Mark Tinka




On 7/22/21 18:50, Matt Harris wrote:

Seems to be clearing up at this point, was able to get to a site just 
now that I wasn't a little bit ago.


Yes, seems to be restoring...

    https://twitter.com/akamai/status/1418251400660889603?s=28

Mark.


Re: Global Akamai Outage

2021-07-22 Thread Matt Harris

Matt Harris|Infrastructure Lead
816-256-5446|Direct
Looking for help?
Helpdesk|Email Support
We build customized end-to-end technology solutions powered by NetFire Cloud.
On Thu, Jul 22, 2021 at 11:35 AM Mark Tinka  wrote:

> https://edgedns.status.akamai.com/
>
> Mark.
>

Seems to be clearing up at this point, was able to get to a site just now
that I wasn't a little bit ago.