Re: 2019-01-11 ARIN.NET DNSSEC Outage – Post-Mortem (was: Re: ARIN NS down?)

2019-01-14 Thread Stephane Bortzmeyer
On Fri, Jan 11, 2019 at 08:59:10PM +,
 John Curran  wrote 
 a message of 125 lines which said:

> Our monitoring systems reported being green until the signatures
> expired as they presently check that the SOA's match on the internal
> and external nameservers.

For checking of DNSSEC signatures expiration (something which is as
crucial to monitor as the PKIX certificates expiration), I use

and I'm happy with it.


2019-01-11 ARIN.NET DNSSEC Outage – Post-Mortem (was: Re: ARIN NS down?)

2019-01-11 Thread John Curran
On 11 Jan 2019, at 10:39 AM, John Curran 
mailto:jcur...@arin.net>> wrote:

On Fri, Jan 11, 2019 at 07:57:25PM +0530,
couldn't get address for 'ns1.arin.net': not found

Folks -

   This has been resolved - arin.net zone is again correctly 
signed.

Post-mortem forthcoming,

Folks -

The ARIN.NET zone on our public signed DNS servers are 
populated via an internal DNS server and associated workflow.  As part of 
system maintenance near the end of 2018, the zone file used by the master 
internal DNS server was updated incorrectly, resulting in an invalid zone file. 
 Since the zone file was invalid, the zone did not reload on our internal 
master, and the associated workflow to DNSSEC sign and push this zone to the 
public servers did not execute.  Our monitoring systems reported being green 
until the signatures expired as they presently check that the SOA's match on 
the internal and external nameservers.

At approximately 8:30AM eastern time today (11 January 2019), ARIN operations 
started seeing issues within its monitoring.   Initial review suggested the 
problem was DNSSEC-related due to expired signatures.  We pulled the DS record 
from the zone so that DNSSEC validation would not be performed by those 
validating resolvers that had not already cached our DS records. Upon further 
investigation we determined that it was the result of human error in editing a 
zone file that went undetected and resulted in interruption of our routine zone 
publication process.  The issue was fixed and signed zones where then pushed 
out at 10:25 AM ET.  The DS record was reinstated in the parent at 10:30AM ET.

As a result of this incident, we will add additional alerting to the zone 
loading process for any errors and perform monitoring of zone signature 
lifetimes, with appropriate alerting for any potential expiration of DNSSEC 
signatures.

My apologies for this incident – while ARIN does have some fragility in our 
older systems (which we have been working aggressively to phase out via system 
refresh and replacements), it is not acceptable to have this situation with key 
infrastructure such as our DNS zones.   We will prioritize the necessary alert 
and monitor changes and I will report back to the community once that has been 
completed.

Thank you for your patience in this regard.
/John

John Curran
President and CEO
American Registry for Internet Numbers