Re: Multi-threaded operation?

2015-10-05 Thread Havard Eidnes via Unbound-users
Hi,

it looks like I'll have to answer my own question, which is a
little disappointing:

> I'm running unbound 1.5.4 on NetBSD/amd64 7.0, and I notice that
> despite me having configured
>
> server:
>   num-threads: 12
>   so-reuseport: yes
>
> only one of the threads is handling all the queries, according to
> the output from "unbound-control stats".  "Not what I wanted."

It turns out that using the "so-reuseport" setting to distribute
the load over the threads is a fairly recent Linuxism, and
relying on it causing the kernel to distribute the load over the
different sockets is not portable.

The first answer in

  
http://stackoverflow.com/questions/14388706/socket-options-so-reuseaddr-and-so-reuseport-how-do-they-differ-do-they-mean-t

says it quite clearly:

Linux 3.9 added the option SO_REUSEPORT to Linux as well.
[...] Additionally the kernel performs some "special magic"
for SO_REUSEPORT sockets that isn't found in any other
operating system so far: For UDP sockets, it tries to
distribute datagrams evenly, for TCP listening sockets, it
tries to distribute incoming connect requests (those accepted
by calling accept()) evenly across all the sockets that share
the same address and port combination.  [...]

I'll try to turn off the so-reuseport option later today, and see
if that improves the situation.

Regards,

- Håvard


unbound-control dump_cache / load_cache

2015-12-29 Thread Havard Eidnes via Unbound-users
Hi,

a while back I needed/wanted to reconfigure my unbound recursor
to have more memory available for the "rrset cache", in what
seems to be a futile attempt at increasing the cache hit rate.

This would cause unbound to discard its cache in its entirety.  I
thought that in order to soften the blow for the users, I would
use the "dump_cache / load_cache" operations of unbound-control
so that I could (more or less) restore the state of the cache
across the reconfiguration.

However, despite my unbound-control "stats" saying via
"mem.cache.rrset" that it had lots of memory consumed for caching --
at the time around 2GB, current values are

mem.cache.rrset=1357501872
mem.cache.message=4587299
mem.mod.iterator=16540
mem.mod.validator=32503056

i.e. 1.3GB of data in the "RRSET" cache, the result of dump_cache
came to only around 43MB (a re-dump just now gave 36MB), and the
mem.cache.rrset value was dumped nearly to zero by the reconfigure
and increased much after load_cache was performed.

So ... is dump_cache doing a rather incomplete job of dumping the
cache?

What got me started on this path was that I'm observing that the cache
hit rate of my unbound name server (collected via a collectd plugin,
graphed with grafana) is ... rather pitiful, typically hovering
somewhere around 60-65%.  I've configured it to do "prefetch", but I'm
seeing a rather low rate of prefetches -- below 5%, and my other
recursor, running BIND, appears to see 75-80% cache hit rate (which
isn't all that great either, but appears to do marginally better...).

It is admittedly the "quiet season" now, so the daily query rate is
rather low, but I'm stil left wondering if there's something which
could be done about the cache eviction policies to increase the cache
hit rate?  And I'm left wondering what all that memory which doesn't
show up in "dump_cache" is used for...

Best regards,

- Håvard


Query forwarding

2016-01-18 Thread Havard Eidnes via Unbound-users
Hi,

I'm trying to figure out how unbound can be configured to behave
with respect to query forwarding.  In unbound.conf(5) I find this
particular gem:

forward-first: 
   If enabled, a query is attempted without the forward clause if
   it fails.  The data could not be retrieved and would have caused
   SERVFAIL because the servers are unreachable, instead it is
   tried without this clause.  The default is no.

I don't mean to be too harsh, but ... can someone please
disambiguate this for me?  In the first sentence it's far from clear
what "if it fails" means (unbound fails to find a cached answer?),
and also slightly unclear what "the forward clause" means.  In the
second it's unclear what "the servers" means (forwarding servers or
delegated-to name servers).  Thirdly, it strikes me that the sense
of the first sentence is probably the opposite of what's intended,
at least from an intuitive interpretation of the option name --
"forward first = yes" indicates "use the configured forwarding
server(s) first before trying to do own recursion"(?).

Once I understand what this does, I can hopefully come up with a
suggestion for replacement text which isn't so rife with traps for
misinterpretation...

Regards,

- Håvard


Re: message is bogus, non secure rrset with Unbound as local caching resolver

2016-03-19 Thread Havard Eidnes via Unbound-users
> But unbound is trying to set the AD flag in its reply.  And thus it
> needs all the RRsets to be secure.  Thus, the reply from the forwarder
> with CD flag becomes bogus.

Yes, I know unbound is trying to validate the answer.  However,
insisting that a recursor return all pertinent data required for
validation of the response, especially with cd=1 set in the query,
is unreasonable.

> I fixed it so that Unbound uses CD=0 to send queries to a forwarder.
> Unless a dnssec trust anchor exists above the qname, in which case CD=0
> is only attempted on the first query.

Not sure I understand what it means to have a "trust anchor exist
above the qname", but otherwise I suspect and hope this will cure
the problem.

> CD flag is still used on all queries to authorities.

Of course.

Regards,

- Håvard


Re: message is bogus, non secure rrset with Unbound as local caching resolver

2016-03-04 Thread Havard Eidnes via Unbound-users
>> Following the "not a bug" response from the BIND maintainers 
>> yesterday evening, can you please point to chapter and verse 
>> mandating this behaviour for non-authoritative recursive 
>> resolvers?
>
> RFC4035 3.2.3 for validators, all RRsets in answer and authority
> sections should be authentic ...

That's an incomplete quote.  A more complete quote would be:

3.2.3.  The AD Bit

   The name server side of a security-aware recursive name server MUST
   NOT set the AD bit in a response unless the name server considers all
   RRsets in the Answer and Authority sections of the response to be
   authentic.  The name server side SHOULD set the AD bit if and only if
   the resolver side considers all RRsets in the Answer section and any
   relevant negative response RRs in the Authority section to be
   authentic. [...]

However, since the CD ("Checking Disabled") bit is set in the query
Unbound sends to its forwarder, the AD ("Authentic Data") bit is of
course not set in its response, so this whole section doesn't really
apply.

Please also note that this section only talks about setting the AD
bit, not about whether it's mandatory to include "all required DNSSEC
records that the querier is missing and which are required to validate
the included information".

If such or a similar mandate exists on a recursive resolver, it must
come from elsewhere.

Therefore my suggestion: "If the validator needs more information to
complete validation, it had better ask for it explicitly."

As has been pointed to elsewhere in this thread there's a question of
whether setting the CD bit in queries sent to a forwarder is
appropriate, but RFC 6840 (Clarifications and Implementation notes for
DNS security) seems to recommend the practice, ref. section 5.9, but
also lists a number of different ways this can be done, ref. appendix
B, which also mentions the possibility of there being more than one
validator on the path, as can be the case when using a forwarder.

Regards,

- Håvard


Re: message is bogus, non secure rrset with Unbound as local caching resolver

2016-03-02 Thread Havard Eidnes via Unbound-users
>> Unfortunately, the BIND server only tends to return responses where
>> the authority-section has NS-records but no RRSIG-record
>> during the night.  I suspect it has something to do with
>> traffic levels and what other systems are accessing it. It
>> makes it all a bit hard to troubleshoot.  The main source of
>> information for troubleshooting has been a combination of
>> PCAP-files and log files.
>
> Are you sure this is not the bind wildcard bug? Can you try to resolve
> something like pwouters.fedorahosted.org. That's an expanded wildcard.

A couple of responses to an 'a' query for this name follows
attached below.  In both cases you'll see the Authority section
contains the NS RRSET but not the RRSIG covering the NS RRSET,
something we're not quite sure is "right" (but have not yet found
the scripture on), and which Olav suspects is triggering Unbound
to be unhappy about the response.

> If so, this is the same bug as:
>
> https://bugzilla.redhat.com/show_bug.cgi?id=824219

You mean the ISC RT#21409 which is mentioned in there, or
something else?  The recursor Olav's machine is forwarding to
(oliven.uninett.no) is running BIND 9.9.8-P2, and according to
its CHANGES file, that bug was squashed in the run-up to 9.9.3b2:

3444.   [bug]   The NOQNAME proof was not being returned from cached
insecure responses. [RT #21409]

Or is "the bind wildcard bug" something else?  If so please
provide more information.

Best regards,

- Håvard
: {12} ; dig pwouters.fedorahosted.org. a +dnssec

; <<>> DiG 9.10.2-P4 <<>> pwouters.fedorahosted.org. a +dnssec
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 11578
;; flags: qr rd ra; QUERY: 1, ANSWER: 4, AUTHORITY: 5, ADDITIONAL: 6

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags: do; udp: 4096
;; QUESTION SECTION:
;pwouters.fedorahosted.org. IN  A

;; ANSWER SECTION:
pwouters.fedorahosted.org. 60   IN  CNAME   hosted03.fedoraproject.org.
pwouters.fedorahosted.org. 60   IN  RRSIG   CNAME 5 2 60 20160331192054 
20160301192054 39900 fedorahosted.org. 
P91FaEGxGv2Yrsdo5eDfhkpJD2zqkkoVkJr6dz9XYl0Y2TBG2FQ1OArv 
wUwu/bbi63LDVXsJqmg+AarvQ/xkB6f0C9Ro5/cnQFgQ0zjhi1/n/R7I 
vdXXYMU3xslNTe5s7U2YfCquHtKti8q6bM/ltxgtD03QJz8OxAIbpiyj 4VQ=
hosted03.fedoraproject.org. 267 IN  A   140.211.169.199
hosted03.fedoraproject.org. 267 IN  RRSIG   A 5 3 300 20160331192053 
20160301192053 7725 fedoraproject.org. 
n/lc4F2WKfEnq9kTqjWuBH1YbCjSiFPT1NQuDF9x30BHliC8D6M+EZKC 
Lcx2JVdzi+Gb/DREkp/facfVGsslfGjKfkhl4AL0kDD638I7qhnR8TJp 
D9e+B26xRwORMEDTALc/8KkfPNiBF1rztu2dvVSXR/LsIZd/y/3hyudO Fwk=

;; AUTHORITY SECTION:
mtn.fedorahosted.org.   60  IN  NSECsssd.fedorahosted.org. A SSHFP 
RRSIG NSEC
mtn.fedorahosted.org.   60  IN  RRSIG   NSEC 5 3 86400 20160331192054 
20160301192054 39900 fedorahosted.org. 
p8tlcTZI3cDVAqlk2pbpGHUmDm/tZJyE2PSQNRJsOGXKnVWdZOs9Xovf 
bvJbsnVpeun9S4BosZ6UytlnX7XPn+jVu4KYZ2DK8tdAhyNOJOyVjTnh 
QJtGgPRWnraHA/hKWYsTpkK3meW2/kZdHsSsJodYeQ4WOhsa681htoYp 3vY=
fedoraproject.org.  86367   IN  NS  ns02.fedoraproject.org.
fedoraproject.org.  86367   IN  NS  ns05.fedoraproject.org.
fedoraproject.org.  86367   IN  NS  ns04.fedoraproject.org.

;; ADDITIONAL SECTION:
ns02.fedoraproject.org. 86314   IN  A   152.19.134.139
ns02.fedoraproject.org. 86314   IN  
2610:28:3090:3001:dead:beef:cafe:fed5
ns05.fedoraproject.org. 86314   IN  A   85.236.55.10
ns05.fedoraproject.org. 86314   IN  
2001:4178:2:1269:dead:beef:cafe:fed5
ns04.fedoraproject.org. 86314   IN  A   209.132.181.17

;; Query time: 322 msec
;; SERVER: 127.0.0.1#53(127.0.0.1)
;; WHEN: Wed Mar 02 20:06:31 CET 2016
;; MSG SIZE  rcvd: 844

: {13} ; rndc status
version: 9.10.2-P4 
...


: {14} ; dig @oliven.uninett.no. pwouters.fedorahosted.org. a +dnssec

; <<>> DiG 9.10.2-P4 <<>> @oliven.uninett.no. pwouters.fedorahosted.org. a 
+dnssec
; (2 servers found)
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 35941
;; flags: qr rd ra; QUERY: 1, ANSWER: 4, AUTHORITY: 5, ADDITIONAL: 6

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags: do; udp: 4096
;; QUESTION SECTION:
;pwouters.fedorahosted.org. IN  A

;; ANSWER SECTION:
pwouters.fedorahosted.org. 60   IN  CNAME   hosted03.fedoraproject.org.
pwouters.fedorahosted.org. 60   IN  RRSIG   CNAME 5 2 60 20160331192054 
20160301192054 39900 fedorahosted.org. 
P91FaEGxGv2Yrsdo5eDfhkpJD2zqkkoVkJr6dz9XYl0Y2TBG2FQ1OArv 
wUwu/bbi63LDVXsJqmg+AarvQ/xkB6f0C9Ro5/cnQFgQ0zjhi1/n/R7I 
vdXXYMU3xslNTe5s7U2YfCquHtKti8q6bM/ltxgtD03QJz8OxAIbpiyj 4VQ=
hosted03.fedoraproject.org. 300 IN  A   140.211.169.199
hosted03.fedoraproject.org. 300 IN  RRSIG   A 5 3 300 20160331192053 
20160301192053 7725 fedoraproject.org. 
n/lc4F2WKfEnq9kTqjWuBH1YbCjSiFPT1NQuDF9x30BHliC8D6M+EZKC 
Lcx2JVdzi+Gb/DREkp/facfVGsslfGjKfkhl4AL0kDD638I7qhnR8TJp 

Re: message is bogus, non secure rrset with Unbound as local caching resolver

2016-03-02 Thread Havard Eidnes via Unbound-users
>> The "right" thing is to have RRSIGs for all elements of the
>> answer and authority sections.  This is mandated by
>> RFC4034,4035.  All the RRsets in the answer and authority
>> section MUST validate to mark the response as valid.
> 
> FYI, I've submitted a tentative bug report to the BIND maintainers
> based on my message and the one I'm replying to here, RT#41844.

And... They're not having it:

  This is not a bug.  Section 3.1.1 applies to authoritative nameservers
  not intermediate caching nameservers.  In this case you are seeing the
  referral which is unsigned being returned from the cache.

Regards,

- Håvard


Re: message is bogus, non secure rrset with Unbound as local caching resolver

2016-03-02 Thread Havard Eidnes via Unbound-users
> The "right" thing is to have RRSIGs for all elements of the
> answer and authority sections.  This is mandated by
> RFC4034,4035.  All the RRsets in the answer and authority
> section MUST validate to mark the response as valid.

FYI, I've submitted a tentative bug report to the BIND maintainers
based on my message and the one I'm replying to here, RT#41844.

Regards,

- Håvard


Re: message is bogus, non secure rrset with Unbound as local caching resolver

2016-03-03 Thread Havard Eidnes via Unbound-users
>>   info: validate(cname): sec_status_secure
>>   info: validate(positive): sec_status_secure
>>   info: message is bogus, non secure rrset uninett.no. NS IN
>>
>> As far as I can tell, the problem here is caused by extra NS-records in
>> the authority-section that do not include the RRSIG element for the
>> NS-records, but I can't really say that for certain.
>
> This sounds a lot like a problem we discussed last year. See
> https://unbound.net/pipermail/unbound-users/2015-February/003757.html

Yep, indeed, this does appear to be exactly the same root cause,
although DLV isn't really relevant to the problem.  Unbound appears to
enforce compliance to section 3.1.1 of RFC 4035 even when it's
querying a forwarder, which is a non-authoritative recursive resolver.
In fact, the entire 3.1 section only talks about the behaviour of
authoritative name servers, while section 3.2 talks about recursive
name servers.  Thus, unbound enforcing 3.1.1 on responses to forwarded
queries is just Wrong, and a bug in unbound.

> As I said back then, I think it's wrong to discard the entire response if
> parts of it are bogus. Unbound should keep the valid parts because it
> knows there is nothing wrong with them.

Come to think of it, anything you get from a recursive resolver are
possibly cached hints, including what you get in the Answer section.
If a validating resolver needs other RRsets than those supplied in the
answer (all sections) or what it has in the cache, it should
explicitly ask for them.

Granted, modern recursive name servers used for forwarding will set
"DNSSEC OK" in outgoing queries (as mandated by RFC 3225), and will
supply any cached related DNSSEC material in the reply when queried
with the "DNSSEC OK" flag set.  However, it does not have to comply to
section 3.1 of RFC 4035 when composing the reply.

> Does Unbound use CD=1 when forwarding? If so, it should expect to receive
> partially bogus answers and should handle them gracefully.

Yep, as Olav replied, and the pcaps I capture on the BIND recursor
agrees: CD=1 is set in the forwarded queries.

Regards,

- Håvard


Re: Unbound exiting on stats write failure?,Re: Unbound exiting on stats write failure?

2016-09-20 Thread Havard Eidnes via Unbound-users
> The error is on a pipe between unbound processes (threads).  It should
> not be out of resources (it might block of course, waiting for them, and
> blocking pipes are not a problem for unbound, but this error is like a
> pipe randomly breaks up).

Hm.

> Are you on OpenBSD?  Perhaps upgrade the kernel?

Nope, on NetBSD 7.0.

Regards,

- Håvard


Unbound exiting on stats write failure?

2016-09-20 Thread Havard Eidnes via Unbound-users
Hi,

one of our unbound hosts recently exited, and before it did, it
logged this:

  Sep 19 14:25:56 xxx unbound: [96:4] error: tube msg write failed: 
Resource temporarily unavailable
  Sep 19 14:25:56 xxx unbound: [96:4] fatal error: could not write stat 
values over cmd channel

Now, we're periodically polling stats via "unbound-control stats" and
feeding this into collectd, and our collectd hasn't exactly been fully
stable.  However, is there a good reason the failure to write the
stats values is considered a fatal error?  One would have thought that
it would not be, and that abandoning the output channel would be a
rasonable error recovery mechanism, allowing the main task of unbound
to proceed uninterrupted?

Regards,

- Håvard


Re: Unbound exiting on stats write failure?,Re: Unbound exiting on stats write failure?

2016-10-03 Thread Havard Eidnes via Unbound-users
>> one of our unbound hosts recently exited, and before it did, it
>> logged this:
>> 
>>   Sep 19 14:25:56 xxx unbound: [96:4] error: tube msg write failed: 
>> Resource temporarily unavailable
>>   Sep 19 14:25:56 xxx unbound: [96:4] fatal error: could not write stat 
>> values over cmd channel
>
> The error is on a pipe between unbound processes (threads).  It should
> not be out of resources (it might block of course, waiting for them, and
> blocking pipes are not a problem for unbound, but this error is like a
> pipe randomly breaks up).

This turned out to be caused by us running a too old version of
unbound, version 1.5.4.  I've since upgraded to 1.5.9, so this
exact problem should not happen again for us.  In-between there,
tube_write_msg() grew a test for EAGAIN (causing a retry) in the
non-blocking case.

Regards,

- Håvard


Re: TCP fallback on timeout

2017-04-27 Thread Havard Eidnes via Unbound-users
> Unfortunately, DNS servers aren't required to support TCP.

IMHO, that is an all too commonly held misconception.  Publishing name
servers need to support TCP as well.  I'm pretty sure section 4.2 of
RFC 1035 mandates it.  It doesn't use the formal requirements keywords
because it predates the RFC which defined their use in this document
series.

Regards,

- Håvard


Re: Negative cache being ignored.

2017-10-17 Thread Havard Eidnes via Unbound-users
> In this example, trying to lookup a CAA record for a domain:
> ...
> # time host -t CAA jhmnet.net 192.168.136.181 
...
> real    0m3.876s 
>
> Run this again, immediately after:
..
> real    0m0.016s
>
> Implying the cache is working as expected. (cache-max-negative-ttl: 120)
> 
> However, after about ~9 seconds, the query goes back to taking
> 3-4 seconds, implying its not. Sure enough a tcpdump on the
> host running unbound shows it trying to access the jhmnet.net
> Auth server(s)
>
> Why is unbound not respecting the 2 (120second) min max-negative-ttl?

The situation with jhmnet.net is that it's completely off the
air, because neither of the two delegated-to name servers serve
the zone, so you have a "double lame delegation".

Negative caching revolves around negative authoritative answers,
and this isn't that -- the resolver simply wasn't able to get any
answer whatsoever.

Regards,

- Håvard


Re: Some sites not resolving (DNSSEC?)

2018-05-23 Thread Havard Eidnes via Unbound-users
> This generally seems to work except for several hosts from which I try to
> fetch podcasts. One of these is coder.show.

Just a note,

 http://dnsviz.net/d/coder.show/dnssec/

shows several warnings related to coder.show -- apparently the
auth name servers reply with CNAME *and* other data for the zone
apex, and they also fail to respond with an EDNS0 OPT record.

Regards,

- Håvard