Re: [DNSOP] sentinel and timing?

2018-02-11 Thread Warren Kumari
On Friday, February 9, 2018, Geoff Huston  wrote:

>
>
> > On 8 Feb 2018, at 5:02 pm, Paul Wouters  wrote:
> >
> > On Wed, 7 Feb 2018, Robert Story wrote:
> >
> >> On Wed 2018-02-07 10:43:16-0500 Paul wrote:
> >>> How about using this query to also encode an
> >>> uptime-processstartedtime value? Maybe with accurancy reduced to
> >>> minutes. I think that would return valuable data.
> >>
> >> -1 for feature creep and the technical reasons Joe mentioned.
> >
> > We have a giant hole in our understanding of why there are updated
> > nameservers running the latest software with the older keys. We
> > need to gain understanding and we know we need more data.
> >
> > Getting more data is the core mission, not feature creep. If there is
> > a technical better way to do this, it's worth considering.
> >
>
> The sentinel mechanism is proposed to be capable of posing a question to a
> user’s
> “DNS Resolution cloud” - it is not intended capable of posing a question to
> an individual DNS resolver.
>
> 

The sentinel mechanism also *only* switches certain valid answers (received
from authorative servers) into a *SERVFAIL*:
 "SERVFAIL2   The name server encountered an internal
  failure while processing this request,
  for example an operating system error
  or a forwarding timeout."

KSK Sentinel seems like it stretches "internal failure" a fair bit, but no
where near breaking point. Having the resursive server return any sort of
other response (especially for a zone it isn't also authorative for) is
making up answers from whole cloth, and that feels like a huge change.

While I think it would be great to be able to send questions to resolvers
and get all sorts of interesting stats (uptime, qps, list of keys, list of
zones, phase of moon), I don't think that this is the protocol to do it.
This is really now a knoch on your idea - I'd encourage you to write a
draft with a "status query" type solution, but that's a (IMO) separate
thing.

W
P.S: This written on a plane. Apoloigies if it is OBE, etc.


What I am trying to say is that here is a big difference between a question
> of:
>
> "will this user be impacted at the point of the roLl of the KSK”
>
> and
>
> “what are the trust keys for this resolver?”, or
> “What is the process uptime of the DNS process on this resolver?”
>
> My intuition is that the mechanisms to implement a measurement
> framework for these questions would necessarily be very different.
>
> Geoff
>
>
>
>
>
> ___
> DNSOP mailing list
> DNSOP@ietf.org
> https://www.ietf.org/mailman/listinfo/dnsop
>


-- 
I don't think the execution is relevant when it was obviously a bad idea in
the first place.
This is like putting rabid weasels in your pants, and later expressing
regret at having chosen those particular rabid weasels and that pair of
pants.
   ---maf
___
DNSOP mailing list
DNSOP@ietf.org
https://www.ietf.org/mailman/listinfo/dnsop


Re: [DNSOP] sentinel and timing?

2018-02-08 Thread Geoff Huston


> On 8 Feb 2018, at 5:02 pm, Paul Wouters  wrote:
> 
> On Wed, 7 Feb 2018, Robert Story wrote:
> 
>> On Wed 2018-02-07 10:43:16-0500 Paul wrote:
>>> How about using this query to also encode an
>>> uptime-processstartedtime value? Maybe with accurancy reduced to
>>> minutes. I think that would return valuable data.
>> 
>> -1 for feature creep and the technical reasons Joe mentioned.
> 
> We have a giant hole in our understanding of why there are updated
> nameservers running the latest software with the older keys. We
> need to gain understanding and we know we need more data.
> 
> Getting more data is the core mission, not feature creep. If there is
> a technical better way to do this, it's worth considering.
> 

The sentinel mechanism is proposed to be capable of posing a question to a 
user’s
“DNS Resolution cloud” - it is not intended capable of posing a question to
an individual DNS resolver.

What I am trying to say is that here is a big difference between a question of:

"will this user be impacted at the point of the roLl of the KSK”

and

“what are the trust keys for this resolver?”, or
“What is the process uptime of the DNS process on this resolver?”

My intuition is that the mechanisms to implement a measurement
framework for these questions would necessarily be very different.

Geoff





___
DNSOP mailing list
DNSOP@ietf.org
https://www.ietf.org/mailman/listinfo/dnsop


Re: [DNSOP] sentinel and timing?

2018-02-08 Thread Joe Abley
On 8 Feb 2018, at 13:52, Paul Wouters  wrote:

> On Thu, 8 Feb 2018, Joe Abley wrote:
> 
>> I don't disagree with the need for more data, but I think the hole you 
>> mention is not so giant. As far as I can tell it's a result of:
> 
> How do you know without the data?

I'm talking about the data that I have seen. I described how I thought that 
data was inadequate (not for lack of uptime statistics).

>> 1. RFC5011 support not being turned on in nameservers that have been 
>> upgraded but whose older, DNSSEC-validating configuration has been preserved 
>> across updates (most cases), and
>> 
>> 2. RFC5011 support exercising a code path that requires a writable, 
>> persistent filesystem to store an updated trust anchor, which turns out not 
>> to be available (fewer, but some cases).
> 
> 3. gold images instantiated in private clouds
> 
> 4. AMI images used in AWS
> 
> 5. docker containers
> 
> 6. kubernetes containers
> 
> 7. old configs not getting updated unrelated to 1. and 2.

Right, I didn't see any of your cases (3) through (7).


Joe

___
DNSOP mailing list
DNSOP@ietf.org
https://www.ietf.org/mailman/listinfo/dnsop


Re: [DNSOP] sentinel and timing?

2018-02-08 Thread Paul Wouters

On Thu, 8 Feb 2018, Joe Abley wrote:


I don't disagree with the need for more data, but I think the hole you mention 
is not so giant. As far as I can tell it's a result of:


How do you know without the data?


1. RFC5011 support not being turned on in nameservers that have been upgraded 
but whose older, DNSSEC-validating configuration has been preserved across 
updates (most cases), and

2. RFC5011 support exercising a code path that requires a writable, persistent 
filesystem to store an updated trust anchor, which turns out not to be 
available (fewer, but some cases).


3. gold images instantiated in private clouds

4. AMI images used in AWS

5. docker containers

6. kubernetes containers

7. old configs not getting updated unrelated to 1. and 2.


unbound-anchor and its use in package and system start scripts is, I think, a 
key reason why the two problems described above don't show up in unbound.


Even for bind, the OS vendors update the packages with the new keys, so
I think your statement is somewhat simplified.

It could also be that we keep seeing new updated software doing initial
old keys then going to new keys. Or it could be that we see the same
readonly containers/chroot instances starting up.

the uptime of OS tells you if these are highly customized containers.
the uptime of the process tells you if this is an instance that might
update itself still (maybe no hold timer via cron, only via daemon?)


I think that the sentinel approach of measuring end-user impact from the 
end-user perspective gets us much closer to useful data in general. However, 
it's not clear to me how even a trusted, accurate sense of uptime across all 
resolvers would help with those questions.


I didn't say anything about not using sentinel. I was only wondering if
we can use those to get more data with respect to resolver state.

Paul

___
DNSOP mailing list
DNSOP@ietf.org
https://www.ietf.org/mailman/listinfo/dnsop


Re: [DNSOP] sentinel and timing?

2018-02-08 Thread Joe Abley
Hi Paul,

(with apologies for breakfast/iPad MIME crime that surely follows)

> On Feb 8, 2018, at 01:02, Paul Wouters  wrote:
> 
>> On Wed, 7 Feb 2018, Robert Story wrote:
>> 
>>> On Wed 2018-02-07 10:43:16-0500 Paul wrote:
>>> How about using this query to also encode an
>>> uptime-processstartedtime value? Maybe with accurancy reduced to
>>> minutes. I think that would return valuable data.
>> 
>> -1 for feature creep and the technical reasons Joe mentioned.
> 
> We have a giant hole in our understanding of why there are updated
> nameservers running the latest software with the older keys. We
> need to gain understanding and we know we need more data.

I don't disagree with the need for more data, but I think the hole you mention 
is not so giant. As far as I can tell it's a result of:

1. RFC5011 support not being turned on in nameservers that have been upgraded 
but whose older, DNSSEC-validating configuration has been preserved across 
updates (most cases), and

2. RFC5011 support exercising a code path that requires a writable, persistent 
filesystem to store an updated trust anchor, which turns out not to be 
available (fewer, but some cases).

These are both BIND9 problems, which I mean as a complement since they are 
indicative of (a) widespread use, (b) early implementation of DNSSEC and (c) a 
high degree of backwards compatibility in configuration.

The larger question of whether RFC5011 is a practical or sufficient mechanism 
given this experience is a reasonable one. You may recall I have been a serial 
advocate for adding standardised bootstrap mechanisms that include fetching a 
trust anchor out-of-band, for example, which I still think would be a practical 
remedy even if a slightly inelegant one; unbound-anchor and its use in package 
and system start scripts is, I think, a key reason why the two problems 
described above don't show up in unbound.

My sense from the recent KSK rollover/RFC8145 data collection experience is 
that the actual impact on end-users from validators dependent on the outgoing 
KSK is very small. This is hard to quantify with precision, however, because we 
are not able to measure the state of most resolvers (e.g. those not reporting 
via RFC 8145 or not validating), nor assess their operational impact (e.g. size 
of end-user population and impact of validation failures upon them) with any 
degree of accuracy.

I think that the sentinel approach of measuring end-user impact from the 
end-user perspective gets us much closer to useful data in general. However, 
it's not clear to me how even a trusted, accurate sense of uptime across all 
resolvers would help with those questions.


Joe
___
DNSOP mailing list
DNSOP@ietf.org
https://www.ietf.org/mailman/listinfo/dnsop


Re: [DNSOP] sentinel and timing?

2018-02-07 Thread Paul Wouters

On Wed, 7 Feb 2018, Robert Story wrote:


On Wed 2018-02-07 10:43:16-0500 Paul wrote:

How about using this query to also encode an
uptime-processstartedtime value? Maybe with accurancy reduced to
minutes. I think that would return valuable data.


-1 for feature creep and the technical reasons Joe mentioned.


We have a giant hole in our understanding of why there are updated
nameservers running the latest software with the older keys. We
need to gain understanding and we know we need more data.

Getting more data is the core mission, not feature creep. If there is
a technical better way to do this, it's worth considering.

Paul

___
DNSOP mailing list
DNSOP@ietf.org
https://www.ietf.org/mailman/listinfo/dnsop


Re: [DNSOP] sentinel and timing?

2018-02-07 Thread Robert Story
On Wed 2018-02-07 10:43:16-0500 Paul wrote:
> How about using this query to also encode an
> uptime-processstartedtime value? Maybe with accurancy reduced to
> minutes. I think that would return valuable data.

-1 for feature creep and the technical reasons Joe mentioned.

Maybe we need a SNMP over DNS draft? :-p

-- 
Robert Story 
USC Information Sciences Institute 


pgpdf4eOn2zwe.pgp
Description: OpenPGP digital signature
___
DNSOP mailing list
DNSOP@ietf.org
https://www.ietf.org/mailman/listinfo/dnsop


Re: [DNSOP] sentinel and timing?

2018-02-07 Thread Joe Abley
Hi Paul,

> On Feb 7, 2018, at 10:43, Paul Wouters  wrote:
> 
> 
> I think it is useful to know how long the DNS resolver process has been
> up, and/or how long the server running the DNS resolver has been up,
> when it is sending the sentinel queries.
> 
> That would allow us to detect if we are looking at spun up server
> instances and/or provisioned containers with old software stuck to
> KSK2010, versus old software running forever on an unmaintained server.

On the authoritative server, receiving a query from a resolver, it's not 
possible to be certain that two queries from the same source address correspond 
to the same originating host. 

On the client side, receiving a response from a resolver, it's far less 
possible to be certain that two responses from the same source address 
correspond to the same originating host. In particular, there are a relatively 
small number of resolver sources used by a large proportion of the end-user 
population, all of which to my knowledge are provisioned at scale, in clusters 
that are often distributed geographically.

I'm not sure what practical use a host-specific "uptime" indicator would have 
unless we also had a way to tie it to a particular host, and we see enough 
people going to the trouble to obscure the responses to ID.SERVER/CH/TXT type 
queries that such host identification might be contentious.

[Disclosure: there is yet more snow coming down, I have not yet had coffee, I 
have not yet left the house today, I am quite possibly running degraded right 
now, so perhaps wait until the failed units have been replaced before 
commenting on performance.]


Joe
___
DNSOP mailing list
DNSOP@ietf.org
https://www.ietf.org/mailman/listinfo/dnsop