Bug#898073: Bug#897917: Stretch kernel 4.9.88-1 breaks startup of RPC, KDC services

2018-07-16 Thread Sam Hartman
As a FYI, I did some experiments with kvm, and I do seem to have enough
entropy to get the KDC started there.

I have not played with Xen recently.  It's a bit harder to set that up,
and I'm unsurprised that might be more tricky to get randomness with
than kvm.



Bug#898073: Bug#897917: Stretch kernel 4.9.88-1 breaks startup of RPC, KDC services

2018-05-08 Thread Greg Hudson
On Mon, 7 May 2018 18:28:01 -0500 Benjamin Kaduk  wrote:
> At high risk of opening up the RNG debate that I did not want to
> revisit, the current stance of upstream krb5 seems to fall into what
> I'll call the "Schneier worldview", that a fully-seeded
> well-designed CSPRNG can produce arbitrary amounts of random output
> with no need to track "entropy depletion" or similar (emphasis on
> fully-seeded).

This is a good summary of my position (speaking as the current best
representative of "upstream krb5").  In particular, prior to
getrandom(), I had been frustrated with these Linux /dev/[u]random
properties:

1. /dev/random's concept of "strong" entropy attempts to mitigate
against a preimage attack against SHA-1, and therefore depletes the
estimate of strong randomness available as random data is read.  If you
aren't trying to mitigate against an attack on the CSPRNG, you don't
need any concept of "running out of" randomness, only a one-time switch
from "not ready" to "ready".  (You might try to further reduce the risk
of predictability after you've switched to "ready" state, of course.
Schneier's Fortuna algorithm does this, under the assumption that
entropy estimates will always be fallible.)

2. Reading from /dev/urandom can fully deplete the strong random
estimate; no portion of the estimated incoming hardware entropy is
reserved for /dev/random.  Most running systems read from /dev/urandom
often enough that the randomness estimate constantly returns to zero, so
reading from /dev/random almost always blocks for a noticeable period.

I had always viewed the "strong" parameter to krb5_c_os_random_entropy()
as a workaround for these kernel properties (and perhaps similar
properties on other kernels, although I'm not aware of any specific
examples).  We always want "strong" randomness.  Running a KDC and
issuing tickets with predictable session keys is terrible, even if it is
less terrible than generating long-term keys predictable to an attacker.
 But reading from /dev/random with any frequency is too painful, and
reading from /dev/urandom almost always yields unpredictable output.

For those reasons I was happy to accept a change to use getrandom() with
no flags on Linux, ignoring the "strong" parameter any bypassing the
userspace Fortuna code.  I can see that this change creates new KDC
availability issues on VMs, but I don't see that as a krb5-specific issue.



Bug#898073: Bug#897917: Stretch kernel 4.9.88-1 breaks startup of RPC, KDC services

2018-05-08 Thread Benjamin Kaduk
On Tue, May 08, 2018 at 09:28:08AM -0400, Sam Hartman wrote:
> Benjamin> Now, we have getrandom(), which is a great API and is
> Benjamin> pretty much exactly what you want (again, at least in this
> Benjamin> worldview).  IIUC Ted says that you should "just use
> Benjamin> getrandom" for your entropy needs and not worry about
> Benjamin> /dev/*random.  I don't know whether he takes a stance on
> Benjamin> the GRND_RANDOM flag, though.
> 
> And I think that's fine for kadmind.
> I think there's a very real practical question about whether you want
> the KDC to fail to start if your RNG is not seeded.
> Having your KDCs be unavailable from a cold start of an environment is a
> really big thing.

I'll note that the original user report seems to have involved a
virtual machine running on Xen; my general expectation is that
bare-metal KDCs will get enough entropy from device attachment and
network traffic for long blocking to not be an issue.
Enterprise-scale deployments that use virtualized KDCs are likely to
have proper randomness pass-through devices installed, so I suspect
that the number of sites that are at any significant risk of being
affected will be a pretty small percentage.

Do you think we should raise the question on upstream's mailing
list?

-Ben



Bug#898073: Bug#897917: Stretch kernel 4.9.88-1 breaks startup of RPC, KDC services

2018-05-08 Thread Sam Hartman
> "Benjamin" == Benjamin Kaduk  writes:

Benjamin> On Mon, May 07, 2018 at 05:10:27PM +0100, Ben Hutchings wrote:
>> On Mon, 2018-05-07 at 11:57 -0400, Sam Hartman wrote:
>> 
>> There are basically three "strengths" of random numbers available
>> now:
>> 
>> Weak: /dev/urandom Medium: getrandom(flags=0) Strong:
>> /dev/random, getrandom(flags=GRND_RANDOM)
>> 
>> k5_get_os_entropy() has switched from weak/strong depending on
>> the "strong" flag to always medium.  I think what you actually
>> want is medium/strong.

Benjamin> At high risk of opening up the RNG debate that I did not
Benjamin> want to revisit, the current stance of upstream krb5 seems
Benjamin> to fall into what I'll call the "Schneier worldview", that
Benjamin> a fully-seeded well-designed CSPRNG can produce arbitrary
Benjamin> amounts of random output with no need to track "entropy
Benjamin> depletion" or similar (emphasis on fully-seeded).  So the
Benjamin> question (for them) is not "strong" or "weak", but rather
Benjamin> "fully seeded" or "not seeded yet".

OK, I'm happy with this model.  It's certainly along the lines of what I
had in mind when I wrote some of the interfaces we're talking about.

Benjamin> In this worldview, if
Benjamin> you have to choose between /dev/random and /dev/urandom,
Benjamin> (1) /dev/random is the only thing that actually provides
Benjamin> the guarantee you want, but (2) /dev/random is incredibly
Benjamin> painful for using "all the time", so you're tempted to use
Benjamin> /dev/urandom for cases where it's "less important", like
Benjamin> session keys, but reserve /dev/random for times when you
Benjamin> really care about the "fully seeded" property, like
Benjamin> long-term keys.  When those were the only choices, the
Benjamin> 'strong' flag made sense.

Benjamin> Now, we have getrandom(), which is a great API and is
Benjamin> pretty much exactly what you want (again, at least in this
Benjamin> worldview).  IIUC Ted says that you should "just use
Benjamin> getrandom" for your entropy needs and not worry about
Benjamin> /dev/*random.  I don't know whether he takes a stance on
Benjamin> the GRND_RANDOM flag, though.

And I think that's fine for kadmind.
I think there's a very real practical question about whether you want
the KDC to fail to start if your RNG is not seeded.
Having your KDCs be unavailable from a cold start of an environment is a
really big thing.

Benjamin> Anyway, I mention this all in the hope that we can just
Benjamin> drop this line or discussion and let upstream krb5 decide
Benjamin> what properties they want from a CSPRNG, and not try to
Benjamin> revisit that design.

To the extent that it impacts our jobs as system integrators to actually
provide an enterprise system that has availability, I don't think we
can.


Benjamin> To answer Sam's questions, in the above worldview, the
Benjamin> right answer for the KDC and the right answer for kadmind
Benjamin> are the same -- just use getrandom().  It provides the
Benjamin> output of a high-quality CSPRNG, and is guaranteed to
Benjamin> block until fully seeded.  In this worldview, the
Benjamin> cryptographic quality of the (fully seeded) urandom pool
Benjamin> is more than adequate, so there's no need to ever pass
Benjamin> GRND_RANDOM.

To be clear I'm fine with that as far as it goes.
My concern is simply that we (Debian) need to consider the availability
question.

I know that upstream did not seriously consider that question when we
first adopted the Schneier world view.  I don't know if upstream has
adequately considered that since, and if they have whether their design
tradeoffs are the same as we as Debian have.

I do really appreciate your reframing the question though because I
think availability vs strength will be an easier design discussion than
quality of random numbers and entropy.

Benjamin> I'm certainly open to having krb5 ship a proof-of-concept
Benjamin> wait-for-entropy.service in unstable for a while, though
Benjamin> it seems like something better suited for libc or systemd
Benjamin> core for the long term.

Benjamin> If we need to for stretch, it would presumably be easy
Benjamin> enough to just add a stanza to the KDC's unit file to
Benjamin> increase the timeout (though how do we know what sort of
Benjamin> timeout is "long enough"?).

Agreed.



Bug#898073: Bug#897917: Stretch kernel 4.9.88-1 breaks startup of RPC, KDC services

2018-05-07 Thread Benjamin Kaduk
On Mon, May 07, 2018 at 05:10:27PM +0100, Ben Hutchings wrote:
> On Mon, 2018-05-07 at 11:57 -0400, Sam Hartman wrote:
> 
> There are basically three "strengths" of random numbers available now:
> 
> Weak:   /dev/urandom
> Medium: getrandom(flags=0)
> Strong: /dev/random, getrandom(flags=GRND_RANDOM)
> 
> k5_get_os_entropy() has switched from weak/strong depending on the
> "strong" flag to always medium.  I think what you actually want is
> medium/strong.

At high risk of opening up the RNG debate that I did not want to
revisit, the current stance of upstream krb5 seems to fall into what
I'll call the "Schneier worldview", that a fully-seeded
well-designed CSPRNG can produce arbitrary amounts of random output
with no need to track "entropy depletion" or similar (emphasis on
fully-seeded).  So the question (for them) is not "strong" or
"weak", but rather "fully seeded" or "not seeded yet".  In this
worldview, if you have to choose between /dev/random and
/dev/urandom, (1) /dev/random is the only thing that actually
provides the guarantee you want, but (2) /dev/random is incredibly
painful for using "all the time", so you're tempted to use
/dev/urandom for cases where it's "less important", like session
keys, but reserve /dev/random for times when you really care about
the "fully seeded" property, like long-term keys.  When those were
the only choices, the 'strong' flag made sense.

Now, we have getrandom(), which is a great API and is pretty much
exactly what you want (again, at least in this worldview).  IIUC Ted
says that you should "just use getrandom" for your entropy needs and
not worry about /dev/*random.  I don't know whether he takes a
stance on the GRND_RANDOM flag, though.

Anyway, I mention this all in the hope that we can just drop this
line or discussion and let upstream krb5 decide what properties they
want from a CSPRNG, and not try to revisit that design.


To answer Sam's questions, in the above worldview, the right answer
for the KDC and the right answer for kadmind are the same -- just
use getrandom().  It provides the output of a high-quality CSPRNG,
and is guaranteed to block until fully seeded.  In this worldview,
the cryptographic quality of the (fully seeded) urandom pool is more
than adequate, so there's no need to ever pass GRND_RANDOM.

I'm certainly open to having krb5 ship a proof-of-concept
wait-for-entropy.service in unstable for a while, though it seems
like something better suited for libc or systemd core for the long
term.

If we need to for stretch, it would presumably be easy enough to
just add a stanza to the KDC's unit file to increase the timeout
(though how do we know what sort of timeout is "long enough"?).

> I'm going to start a discussion on debian-release, as we need to
> coordinate a solution across multiple packages.

Thanks, I'm glad someone with more time than me has already started
getting the right thing done.

> Jann Horn suggested improving systemd-random-seed.service so that it
> actually credits entropy after feeding saved random bits into the RNG. 
> But this will require some care to ensure we never use the same random
> bits twice (including on multiple systems built from the same system
> image).

Indeed, that's in the general case a rather hard problem.

-Ben


signature.asc
Description: PGP signature


Bug#898073: Bug#897917: Stretch kernel 4.9.88-1 breaks startup of RPC, KDC services

2018-05-07 Thread Ben Hutchings
On Mon, 2018-05-07 at 11:57 -0400, Sam Hartman wrote:
> I'm returning from vacation and jumping into the middle of this.
> 
> Back in the day when I wrote the code that became k5_get_os_entropy we
> viewed two cases:
> 
> * kadmind.  There we're likely to sometimes be generating long-term
>   shared secrets, and it seemed like strong random numbers were
>   important.
> 
> * krb5kdc, where we were generating session keys used for a few hours.
> 
> It seems like the change to use the getrandom syscall or other code
> changes have moved all the code  to prefer strong random numbers.
> That seems problematic at startup for the KDC.

There are basically three "strengths" of random numbers available now:

Weak:   /dev/urandom
Medium: getrandom(flags=0)
Strong: /dev/random, getrandom(flags=GRND_RANDOM)

k5_get_os_entropy() has switched from weak/strong depending on the
"strong" flag to always medium.  I think what you actually want is
medium/strong.

> Even if we do develop  a target indicating that the RNG is seeded, do we
> really want to block the KDC starup on waiting for this target?
> 
> I see a few issues here:
> 
> 1)  What's the right behavior for the KDC?
> 
> 2) What's the right behavior for kadmind?
> 
> 3) Do we want to provide such a service in krb5-kdc or elsewhere?
> 
> 4) What do we want to do about stretch?  It sounds like we need a fix
> that's small enough that we can backport it to stretch and not make the
> SRMs grumpy.

I'm going to start a discussion on debian-release, as we need to
coordinate a solution across multiple packages.

Jann Horn suggested improving systemd-random-seed.service so that it
actually credits entropy after feeding saved random bits into the RNG. 
But this will require some care to ensure we never use the same random
bits twice (including on multiple systems built from the same system
image).

Ben.

-- 
Ben Hutchings
Life is what happens to you while you're busy making other plans.
  - John Lennon



signature.asc
Description: This is a digitally signed message part


Bug#898073: Bug#897917: Stretch kernel 4.9.88-1 breaks startup of RPC, KDC services

2018-05-07 Thread Sam Hartman
I'm returning from vacation and jumping into the middle of this.

Back in the day when I wrote the code that became k5_get_os_entropy we
viewed two cases:

* kadmind.  There we're likely to sometimes be generating long-term
  shared secrets, and it seemed like strong random numbers were
  important.

* krb5kdc, where we were generating session keys used for a few hours.

It seems like the change to use the getrandom syscall or other code
changes have moved all the code  to prefer strong random numbers.
That seems problematic at startup for the KDC.

Even if we do develop  a target indicating that the RNG is seeded, do we
really want to block the KDC starup on waiting for this target?

I see a few issues here:

1)  What's the right behavior for the KDC?

2) What's the right behavior for kadmind?

3) Do we want to provide such a service in krb5-kdc or elsewhere?

4) What do we want to do about stretch?  It sounds like we need a fix
that's small enough that we can backport it to stretch and not make the
SRMs grumpy.

--Sam



Bug#898073: Bug#897917: Stretch kernel 4.9.88-1 breaks startup of RPC, KDC services

2018-05-06 Thread Russ Allbery
Benjamin Kaduk  writes:
> On Sun, May 06, 2018 at 07:05:56PM -0700, Russ Allbery wrote:

>> This seems trivial enough that the krb5-kdc package could just ship
>> this service for now and gauge interest.  I think all you'd need is a
>> program that called getrandom() and then exited when it returned, run
>> via systemd as a Type=oneshot service that krb5-kdc depends on and with
>> a reasonable timeout.

> I think that's what it would look like, yes.  It's less clear that
> putting it in krb5-kdc would actually do anything to gauge demand,
> but I suppose I could be wrong.

Yeah, that was probably the wrong phrasing.  Proof of concept?  To be
usable by any other package, it would have to be a separate package, so it
would be more of an immediate workaround.  It would at least demonstrate
whether this solution works, which is a good basis to talk to the systemd
maintainers (either in Debian or upstream).

-- 
Russ Allbery (r...@debian.org)   



Bug#898073: Bug#897917: Stretch kernel 4.9.88-1 breaks startup of RPC, KDC services

2018-05-06 Thread Benjamin Kaduk
On Sun, May 06, 2018 at 07:05:56PM -0700, Russ Allbery wrote:
> Benjamin Kaduk  writes:
> > On Sun, May 06, 2018 at 08:43:13PM +0100, Ben Hutchings wrote:
> >> On Sun, 2018-05-06 at 14:02 -0500, Benjamin Kaduk wrote:
> 
> >>> Arguably more preferable would be to have a systemd target that
> >>> indicates the RNG is seeded, and then krb5 could have its KDC service
> >>> depend on this "RNG-available" service.  So far as I know, no such
> >>> service currently exists, so again, there would need to be some
> >>> sytsemd effort (potentially in cooperation with the kernel) to provide
> >>> such a service.
> 
> >> Yes, that certainly seems like a good approach.
> 
> > Do you know who would be the right person to talk to about getting
> > that work done?
> 
> This seems trivial enough that the krb5-kdc package could just ship this
> service for now and gauge interest.  I think all you'd need is a program
> that called getrandom() and then exited when it returned, run via systemd
> as a Type=oneshot service that krb5-kdc depends on and with a reasonable
> timeout.

I think that's what it would look like, yes.  It's less clear that
putting it in krb5-kdc would actually do anything to gauge demand,
but I suppose I could be wrong.

-Ben



Bug#898073: Bug#897917: Stretch kernel 4.9.88-1 breaks startup of RPC, KDC services

2018-05-06 Thread Russ Allbery
Benjamin Kaduk  writes:
> On Sun, May 06, 2018 at 08:43:13PM +0100, Ben Hutchings wrote:
>> On Sun, 2018-05-06 at 14:02 -0500, Benjamin Kaduk wrote:

>>> Arguably more preferable would be to have a systemd target that
>>> indicates the RNG is seeded, and then krb5 could have its KDC service
>>> depend on this "RNG-available" service.  So far as I know, no such
>>> service currently exists, so again, there would need to be some
>>> sytsemd effort (potentially in cooperation with the kernel) to provide
>>> such a service.

>> Yes, that certainly seems like a good approach.

> Do you know who would be the right person to talk to about getting
> that work done?

This seems trivial enough that the krb5-kdc package could just ship this
service for now and gauge interest.  I think all you'd need is a program
that called getrandom() and then exited when it returned, run via systemd
as a Type=oneshot service that krb5-kdc depends on and with a reasonable
timeout.

-- 
Russ Allbery (r...@debian.org)   



Bug#898073: Bug#897917: Stretch kernel 4.9.88-1 breaks startup of RPC, KDC services

2018-05-06 Thread Benjamin Kaduk
On Sun, May 06, 2018 at 08:43:13PM +0100, Ben Hutchings wrote:
> On Sun, 2018-05-06 at 14:02 -0500, Benjamin Kaduk wrote:
> > Hi Ben,
> > 
> > On Sun, May 06, 2018 at 06:56:08PM +0100, Ben Hutchings wrote:
> > > I've cloned this bug as #898073 and reassigned that to krb5.
> > > 
> > > krb5 is using the new(ish) getrandom() system call to read random bits,
> > > with the code comment "This ensures strong randomness while only
> > > blocking during first system boot."
> > > 
> > > While this is a regression, the kernel is only doing what krb5 was
> > > asking for (whereas previously it could wrongly provide weak random
> > > bits).
> > > 
> > > We might still revert this change in the kernel temporarily.  However,
> > > the krb5 developers need to decide what they really want, and if that's
> > > strong randomness then they need to configure the service to allow for
> > > a longer delay at boot.
> > 
> > I read through the history on #898073 and am still not sure I have
> > the backstory quite right.  This is what it sounds like has
> > happened:
> > 
> > The kernel in stable has for some time provided a getrandom() system
> > call that provided "weak" (more on this later) random numbers for
> > some time after startup, though did eventually converge to "strong"
> > randomness after some time (a few minutes?).  The kernel 4.9.88-1
> > upload fixed the bug that getrandom() could provide "weak" output
> > (since getrandom() is supposed to block until strong output is
> > ready), and this in turn caused the krb5 KDC to block at boot until
> > the RNG was ready, blocking long enough that systemd timed out the
> > unit and marked it as failed.  We're now talking about the proper
> > way to improve the situation.
> 
> Right.
> 
> > If the above is correct, I'm not yet sure that I see a krb5-specific
> > bug.  It is definitely true that krb5 is specifically requesting the
> > getrandom() semantics of blocking until the RNG is fully seeded, but
> > krb5 is hardly expected to be the only consumer of getrandom().  As
> > such, why should krb5 be responsible for increasing the systemd
> > timeout at boot -- could not systemd be responsible for increasing
> > the default timeout to allow for entropy seeding as used by multiple
> > applications?
> 
> How would systemd determine which systems require this?

I didn't have anything in mind other than globally increasing the
default timeout.

> > Arguably more preferable would be to have a systemd
> > target that indicates the RNG is seeded, and then krb5 could have
> > its KDC service depend on this "RNG-available" service.  So far as I
> > know, no such service currently exists, so again, there would need
> > to be some sytsemd effort (potentially in cooperation with the
> > kernel) to provide such a service.
> 
> Yes, that certainly seems like a good approach.

Do you know who would be the right person to talk to about getting
that work done?

> > To rephrase in a different way, "getrandom() is a system service,
> > and the system's init system should not penalize other services for
> > using system services -- why should the onus of adapting be placed on
> > individual consumers of that system service?"
> > 
> > 
> > Back to the "weak" random numbers.  How weak are we talking about?
> 
> If I'm reading the code correctly, the previous condition for
> successful return of getrandom() (without the GRND_RANDOM flag) was
> that at least 64 bits of raw random data have been added to the random
> pool.  The raw random data might come from a high quality hardware
> random number generator but might be much weaker.  The current
> condition is that at least 128 bits of entropy have been added (based
> on a conservative estimate of entropy).

Thanks for sharing your interpretation.  Hmm, 64 bits is not very
much (e.g., 64^W56-bit single-DES keys are brute-forceable at
relatively low cost, these days), though I don't have a sense for
what the weakest source that could be used is.  It's of course not
just as simple as the first 64 bits, since other input is
continually added, but it sounds like there is some
larger-than-normal-security-margin chance that an attacker could
reproduce a key that was generated on a user system.  It sounds like
we should try to get some additional eyes on this.

> > The krb5 KDC and kadmind are used (among other things) to generate
> > shared symmetric keys, used to encrypt and authenticate traffic over
> > the network.  Some of these keys are long-lived, and an
> > insufficiently random long-lived key could have rather disasterous
> > consequences for deployments unlucky enough to have generated them.
> > Are we looking at a repeat of the openssl RNG fiasco where piles of
> > ssh keys and TLS certificates had to be regenerated?  If there's a
> > real issue here of weak randomness, we may need to publicize this
> > issue much more widely.
> 
> The real issue is that k5_get_os_entropy() silently falls back to
> reading /dev/urandom, which has never, and 

Bug#898073: Bug#897917: Stretch kernel 4.9.88-1 breaks startup of RPC, KDC services

2018-05-06 Thread Ben Hutchings
On Sun, 2018-05-06 at 20:43 +0100, Ben Hutchings wrote:
> On Sun, 2018-05-06 at 14:02 -0500, Benjamin Kaduk wrote:
[...]
> > If the above is correct, I'm not yet sure that I see a krb5-specific
> > bug.  It is definitely true that krb5 is specifically requesting the
> > getrandom() semantics of blocking until the RNG is fully seeded, but
> > krb5 is hardly expected to be the only consumer of getrandom().  As
> > such, why should krb5 be responsible for increasing the systemd
> > timeout at boot -- could not systemd be responsible for increasing
> > the default timeout to allow for entropy seeding as used by multiple
> > applications?
> 
> How would systemd determine which systems require this?
[...]

I meant, which services.

Ben.

-- 
Ben Hutchings
Life is what happens to you while you're busy making other plans.
  - John Lennon



signature.asc
Description: This is a digitally signed message part


Bug#898073: Bug#897917: Stretch kernel 4.9.88-1 breaks startup of RPC, KDC services

2018-05-06 Thread Ben Hutchings
On Sun, 2018-05-06 at 14:02 -0500, Benjamin Kaduk wrote:
> Hi Ben,
> 
> On Sun, May 06, 2018 at 06:56:08PM +0100, Ben Hutchings wrote:
> > I've cloned this bug as #898073 and reassigned that to krb5.
> > 
> > krb5 is using the new(ish) getrandom() system call to read random bits,
> > with the code comment "This ensures strong randomness while only
> > blocking during first system boot."
> > 
> > While this is a regression, the kernel is only doing what krb5 was
> > asking for (whereas previously it could wrongly provide weak random
> > bits).
> > 
> > We might still revert this change in the kernel temporarily.  However,
> > the krb5 developers need to decide what they really want, and if that's
> > strong randomness then they need to configure the service to allow for
> > a longer delay at boot.
> 
> I read through the history on #898073 and am still not sure I have
> the backstory quite right.  This is what it sounds like has
> happened:
> 
> The kernel in stable has for some time provided a getrandom() system
> call that provided "weak" (more on this later) random numbers for
> some time after startup, though did eventually converge to "strong"
> randomness after some time (a few minutes?).  The kernel 4.9.88-1
> upload fixed the bug that getrandom() could provide "weak" output
> (since getrandom() is supposed to block until strong output is
> ready), and this in turn caused the krb5 KDC to block at boot until
> the RNG was ready, blocking long enough that systemd timed out the
> unit and marked it as failed.  We're now talking about the proper
> way to improve the situation.

Right.

> If the above is correct, I'm not yet sure that I see a krb5-specific
> bug.  It is definitely true that krb5 is specifically requesting the
> getrandom() semantics of blocking until the RNG is fully seeded, but
> krb5 is hardly expected to be the only consumer of getrandom().  As
> such, why should krb5 be responsible for increasing the systemd
> timeout at boot -- could not systemd be responsible for increasing
> the default timeout to allow for entropy seeding as used by multiple
> applications?

How would systemd determine which systems require this?

> Arguably more preferable would be to have a systemd
> target that indicates the RNG is seeded, and then krb5 could have
> its KDC service depend on this "RNG-available" service.  So far as I
> know, no such service currently exists, so again, there would need
> to be some sytsemd effort (potentially in cooperation with the
> kernel) to provide such a service.

Yes, that certainly seems like a good approach.

> To rephrase in a different way, "getrandom() is a system service,
> and the system's init system should not penalize other services for
> using system services -- why should the onus of adapting be placed on
> individual consumers of that system service?"
> 
> 
> Back to the "weak" random numbers.  How weak are we talking about?

If I'm reading the code correctly, the previous condition for
successful return of getrandom() (without the GRND_RANDOM flag) was
that at least 64 bits of raw random data have been added to the random
pool.  The raw random data might come from a high quality hardware
random number generator but might be much weaker.  The current
condition is that at least 128 bits of entropy have been added (based
on a conservative estimate of entropy).

> The krb5 KDC and kadmind are used (among other things) to generate
> shared symmetric keys, used to encrypt and authenticate traffic over
> the network.  Some of these keys are long-lived, and an
> insufficiently random long-lived key could have rather disasterous
> consequences for deployments unlucky enough to have generated them.
> Are we looking at a repeat of the openssl RNG fiasco where piles of
> ssh keys and TLS certificates had to be regenerated?  If there's a
> real issue here of weak randomness, we may need to publicize this
> issue much more widely.

The real issue is that k5_get_os_entropy() silently falls back to
reading /dev/urandom, which has never, and will never, wait for a
reasonable amount of entropy to be available.

Worse still, it ignores the "strong" flag when calling getrandom().

If you're serious about the quality of your random numbers, you need to
deal with those issues rather than quibbling about whether the kernel
issue (CVE-2018-1108) is a "fiasco" or not.

Ben.

-- 
Ben Hutchings
If more than one person is responsible for a bug, no one is at fault.


signature.asc
Description: This is a digitally signed message part


Bug#898073: Bug#897917: Stretch kernel 4.9.88-1 breaks startup of RPC, KDC services

2018-05-06 Thread Benjamin Kaduk
Hi Ben,

On Sun, May 06, 2018 at 06:56:08PM +0100, Ben Hutchings wrote:
> I've cloned this bug as #898073 and reassigned that to krb5.
> 
> krb5 is using the new(ish) getrandom() system call to read random bits,
> with the code comment "This ensures strong randomness while only
> blocking during first system boot."
> 
> While this is a regression, the kernel is only doing what krb5 was
> asking for (whereas previously it could wrongly provide weak random
> bits).
> 
> We might still revert this change in the kernel temporarily.  However,
> the krb5 developers need to decide what they really want, and if that's
> strong randomness then they need to configure the service to allow for
> a longer delay at boot.

I read through the history on #898073 and am still not sure I have
the backstory quite right.  This is what it sounds like has
happened:

The kernel in stable has for some time provided a getrandom() system
call that provided "weak" (more on this later) random numbers for
some time after startup, though did eventually converge to "strong"
randomness after some time (a few minutes?).  The kernel 4.9.88-1
upload fixed the bug that getrandom() could provide "weak" output
(since getrandom() is supposed to block until strong output is
ready), and this in turn caused the krb5 KDC to block at boot until
the RNG was ready, blocking long enough that systemd timed out the
unit and marked it as failed.  We're now talking about the proper
way to improve the situation.

If the above is correct, I'm not yet sure that I see a krb5-specific
bug.  It is definitely true that krb5 is specifically requesting the
getrandom() semantics of blocking until the RNG is fully seeded, but
krb5 is hardly expected to be the only consumer of getrandom().  As
such, why should krb5 be responsible for increasing the systemd
timeout at boot -- could not systemd be responsible for increasing
the default timeout to allow for entropy seeding as used by multiple
applications?  Arguably more preferable would be to have a systemd
target that indicates the RNG is seeded, and then krb5 could have
its KDC service depend on this "RNG-available" service.  So far as I
know, no such service currently exists, so again, there would need
to be some sytsemd effort (potentially in cooperation with the
kernel) to provide such a service.

To rephrase in a different way, "getrandom() is a system service,
and the system's init system should not penalize other services for
using system services -- why should the onus of adapting be placed on
individual consumers of that system service?"


Back to the "weak" random numbers.  How weak are we talking about?
The krb5 KDC and kadmind are used (among other things) to generate
shared symmetric keys, used to encrypt and authenticate traffic over
the network.  Some of these keys are long-lived, and an
insufficiently random long-lived key could have rather disasterous
consequences for deployments unlucky enough to have generated them.
Are we looking at a repeat of the openssl RNG fiasco where piles of
ssh keys and TLS certificates had to be regenerated?  If there's a
real issue here of weak randomness, we may need to publicize this
issue much more widely.

Thanks,

Ben


signature.asc
Description: PGP signature


Bug#898073: Bug#897917: Stretch kernel 4.9.88-1 breaks startup of RPC, KDC services

2018-05-06 Thread Ben Hutchings
I've cloned this bug as #898073 and reassigned that to krb5.

krb5 is using the new(ish) getrandom() system call to read random bits,
with the code comment "This ensures strong randomness while only
blocking during first system boot."

While this is a regression, the kernel is only doing what krb5 was
asking for (whereas previously it could wrongly provide weak random
bits).

We might still revert this change in the kernel temporarily.  However,
the krb5 developers need to decide what they really want, and if that's
strong randomness then they need to configure the service to allow for
a longer delay at boot.

Ben.

-- 
Ben Hutchings
If more than one person is responsible for a bug, no one is at fault.


signature.asc
Description: This is a digitally signed message part


Bug#897917: Stretch kernel 4.9.88-1 breaks startup of RPC, KDC services

2018-05-05 Thread Michael J. Redd
On further investigation, Arne's absolutely right. I upgraded the
kernel back to 4.9.88-1 from Debian Security and installed 'haveged'
(another random number generator). Everything started quickly and
normally after a reboot. Turns out I hadn't noticed this on any of my
other virtual servers because they're all running haveged anyway.

So,

Workarounds:

1. Roll back kernel to 4.9.82-1+deb9u3
OR
2. Install another RNG, such has 'haveged'



Bug#897917: Stretch kernel 4.9.88-1 breaks startup of RPC, KDC services

2018-05-05 Thread Michael J. Redd
Interesting! I'd also noticed 'random: init done' being piped to console well 
after the server had booted, but I didn't mention it because I didn't think it 
was related. What you've said makes a lot of sense.

On Sat, 5 May 2018 11:54:54 +0200 Arne Nordmark  wrote:
> I have also seen this on a couple of SSD-only systems.
> 
> I think the problem is that the random number generator takes about two
> minutes to initialize, long enough for systemd to give up on these
> processes. Unbound is similar, but there unit file keeps trying until
> the random numbers are available.
> 
> From the log:
> May  5 10:19:02 ano2 kernel: [  126.436729] random: crng init done
> 
> Pressing the keyboard a few times (thus providing entropy) will allow
> the boot to continue.
> 
> This definitely seems to be a kernel problem.
> 
> Arne
> 
> 

-- 
Sent from my Android device with K-9 Mail.

Bug#897917: Stretch kernel 4.9.88-1 breaks startup of RPC, KDC services

2018-05-05 Thread Arne Nordmark
I have also seen this on a couple of SSD-only systems.

I think the problem is that the random number generator takes about two
minutes to initialize, long enough for systemd to give up on these
processes. Unbound is similar, but there unit file keeps trying until
the random numbers are available.

>From the log:
May  5 10:19:02 ano2 kernel: [  126.436729] random: crng init done

Pressing the keyboard a few times (thus providing entropy) will allow
the boot to continue.

This definitely seems to be a kernel problem.

Arne



Bug#897917: Stretch kernel 4.9.88-1 breaks startup of RPC, KDC services

2018-05-04 Thread Michael J. Redd
Package: linux-image-4.9.0-6-amd64
Version: 4.9.88-1

Issue:
==

Kernel "linux-image-4.9.0-6-amd64," version 4.9.88-1, breaks systemd
startup of RPC, Kerberos KDC services. 

Description:


After upgrading to the latest Stretch kernel (4.9.88-1), RPC and KDC
services time out during the boot process. This issue is being seen on
a Kerberos KDC that is also an NFS client. Kerberos auth. and
encryption are being used with NFS in this environment, and this KDC
provides the Kerberos services for that to work.

Network is functional prior to these services starting, which is
proper.

After the server has booted completely, I can issue `service krb5-kdc
restart` and, after a short delay, the KDC service starts normally.

Not sure if this is a kernel bug, a systemd bug, or something else.
Since the kernel package was the only thing that was upgraded before
the issue started, I'm leaning toward the kernel.

Relevant output from /var/log/syslog:
-

May  4 09:03:17  systemd[1]: rpc-svcgssd.service: Start
operation timed out. Terminating.
May  4 09:03:17  systemd[1]: Failed to start RPC security
service for NFS server.
May  4 09:03:17  systemd[1]: rpc-svcgssd.service: Unit
entered failed state.
May  4 09:03:17  systemd[1]: rpc-svcgssd.service: Failed with
result 'timeout'.
May  4 09:03:17  systemd[1]: rpc-gssd.service: Start
operation timed out. Terminating.
May  4 09:03:17  systemd[1]: Failed to start RPC security
service for NFS client and server.
May  4 09:03:17  systemd[1]: rpc-gssd.service: Unit entered
failed state.
May  4 09:03:17  systemd[1]: rpc-gssd.service: Failed with
result 'timeout'.
May  4 09:03:20  systemd[1]: krb5-kdc.service: Start
operation timed out. Terminating.
May  4 09:03:20  systemd[1]: Failed to start Kerberos 5 Key
Distribution Center.
May  4 09:03:20  systemd[1]: krb5-kdc.service: Unit entered
failed state.29s random time.
May  4 09:03:20  systemd[1]: krb5-kdc.service: Failed with
result 'timeout'. random time.

Workaround:
===

Rolling back to Stretch kernel 4.9.82-1+deb9u3 fixes the issue.

Setup:
==

1. KDC package: krb5-kdc 1.15-1+deb9u1
2. NFS package: nfs-common 1:1.3.4-2.1
3. Kernel: linux-image-4.9.0-6-amd64 4.9.88-1
4. Systemd version: 232-25+deb9u3
5. Server is a 64-bit Xen PV domU