Re: [DISCUSSION PATCH 00/41] random: possible ways towards NIST SP800-90B compliance
Eric Biggers writes: > On Fri, Oct 02, 2020 at 02:38:36PM +0200, Torsten Duwe wrote: >> >> Would some maintainer please comment on potential problems or >> shortcomings? >> > > Well, very people are experts in the Linux RNG *and* have time to review large > patchsets, especially when three people are all proposing conflicting changes. > And those that might be able to review these patches aren't necessarily > interested in compliance with particular government standards. To make it clear: I'm personally not really enthusiastic about some of the restrictions imposed by SP800-90 either and Jason certainly has a point with his concerns about "subpar crypto" ([1]). However, at the same time I'm acknowledging that for some users FIPS compliance is simply a necessity and I don't see a strong reason why that shouldn't be supported, if doable without negatively affecting !fips_enabled users. > Note that having multiple RNG implementations would cause fragmentation, more > maintenance burden, etc. So IMO, that should be a last resort. Instead we > should try to find an implementation that works for everyone. I.e., at least > to > me, Nicolai's patchset seems more on the right track than Stephan's > patchset... I suppose that this concern about fragmentation is among the main reasons for reservations against Stephan's LRNG patchset and that's why I posted this RFC series here for comparison purposes. But note that, as said ([2]), it's incomplete and the only intent was to provide at least a rough idea on what it would take to move the current /dev/random implementation towards SP800-90 -- I was hoping for either a hard NACK or something along the lines of "maybe, go ahead and let's see". > However, not everyone cares about "compliance". So any changes for > "compliance" > either need to have a real technical argument for making the change, *or* need > to be optional (e.g. controlled by fips_enabled). Fully agreed. > AFAICS, this patchset mostly just talks about NIST SP800-90B compliance, and > doesn't make clear whether the changes make the RNG better, worse, or the same > from an actual technical perspective. > > If that was properly explained, and if the answer was "better" or at least > "not worse", I expect that people would be more interested. The goal was not to negatively affect !fips_enabled users, but as outlined in the cover letter ([2]), a performance impact had been measured on ARMv7. This probably isn't something which couldn't get sorted out, but I see no point in doing it at this stage, because - there's still quite some stuff missing for full SP800-90 compliance anyway, c.f. the overview at the end of [2] and - such optimizations would have bloated this patchset even more, e.g. for making fips_enabled a static_key, which should certainly go into a separate series. User visible effects set aside, an obvious downside of SP800-90 compliance would be the increase in code size and the associated maintenance burden. That being said, I can imagine that those boot health tests could also get enabled for !fips_enabled users in the future, if wanted: rather than inhibiting /dev/random output on failure, a warning would get logged instead. Whether or not this would be seen as an improvement is for others to judge though. Thanks, Nicolai [1] https://lkml.kernel.org/r/cahmme9rmxorfxtwdac8yxj+h9gytjj6dpvcxa-jmaagyop+...@mail.gmail.com [2] https://lkml.kernel.org/r/20200921075857.4424-1-nsta...@suse.de -- SUSE Software Solutions Germany GmbH, Maxfeldstr. 5, 90409 Nürnberg, Germany (HRB 36809, AG Nürnberg), GF: Felix Imendörffer
[RFC PATCH 03/41] random: prune dead assignment to entropy_bits in credit_entropy_bits()
Since commit 90ea1c6436d2 ("random: remove the blocking pool"), the last assignment to 'entropy_bits' is dead. Remove it. While at it, move the declaration of 'entropy_bits' one scope down and make it const: it's not used anywhere outside and never updated after initialization. Signed-off-by: Nicolai Stange --- drivers/char/random.c | 5 ++--- 1 file changed, 2 insertions(+), 3 deletions(-) diff --git a/drivers/char/random.c b/drivers/char/random.c index c4b7bdbd460e..14c39608cc17 100644 --- a/drivers/char/random.c +++ b/drivers/char/random.c @@ -716,13 +716,12 @@ static void credit_entropy_bits(struct entropy_store *r, int nbits) entropy_count >> ENTROPY_SHIFT, _RET_IP_); if (r == _pool) { - int entropy_bits = entropy_count >> ENTROPY_SHIFT; - if (crng_init < 2) { + const int entropy_bits = entropy_count >> ENTROPY_SHIFT; + if (entropy_bits < 128) return; crng_reseed(_crng, r); - entropy_bits = ENTROPY_BITS(r); } } } -- 2.26.2
[RFC PATCH 04/41] random: drop 'reserved' parameter from extract_entropy()
Since commit 43d8a72cd985 ("random: remove variable limit") all call sites of extract_entropy() pass in zero for the 'reserved' argument and the corresponding code in account() is effectively dead. Remove it and the drop the now unused 'reserved' argument from extract_entropy() as well as from account() called therefrom. Signed-off-by: Nicolai Stange --- drivers/char/random.c | 17 ++--- 1 file changed, 6 insertions(+), 11 deletions(-) diff --git a/drivers/char/random.c b/drivers/char/random.c index 14c39608cc17..35e381be20fe 100644 --- a/drivers/char/random.c +++ b/drivers/char/random.c @@ -506,7 +506,7 @@ struct entropy_store { }; static ssize_t extract_entropy(struct entropy_store *r, void *buf, - size_t nbytes, int min, int rsvd); + size_t nbytes, int min); static ssize_t _extract_entropy(struct entropy_store *r, void *buf, size_t nbytes, int fips); @@ -944,7 +944,7 @@ static void crng_reseed(struct crng_state *crng, struct entropy_store *r) } buf; if (r) { - num = extract_entropy(r, , 32, 16, 0); + num = extract_entropy(r, , 32, 16); if (num == 0) return; } else { @@ -1330,8 +1330,7 @@ EXPORT_SYMBOL_GPL(add_disk_randomness); * This function decides how many bytes to actually take from the * given pool, and also debits the entropy count accordingly. */ -static size_t account(struct entropy_store *r, size_t nbytes, int min, - int reserved) +static size_t account(struct entropy_store *r, size_t nbytes, int min) { int entropy_count, orig, have_bytes; size_t ibytes, nfrac; @@ -1345,8 +1344,6 @@ static size_t account(struct entropy_store *r, size_t nbytes, int min, /* never pull more than available */ have_bytes = entropy_count >> (ENTROPY_SHIFT + 3); - if ((have_bytes -= reserved) < 0) - have_bytes = 0; ibytes = min_t(size_t, ibytes, have_bytes); if (ibytes < min) ibytes = 0; @@ -1469,12 +1466,10 @@ static ssize_t _extract_entropy(struct entropy_store *r, void *buf, * returns it in a buffer. * * The min parameter specifies the minimum amount we can pull before - * failing to avoid races that defeat catastrophic reseeding while the - * reserved parameter indicates how much entropy we must leave in the - * pool after each pull to avoid starving other readers. + * failing to avoid races that defeat catastrophic reseeding. */ static ssize_t extract_entropy(struct entropy_store *r, void *buf, -size_t nbytes, int min, int reserved) +size_t nbytes, int min) { __u8 tmp[EXTRACT_SIZE]; unsigned long flags; @@ -1495,7 +1490,7 @@ static ssize_t extract_entropy(struct entropy_store *r, void *buf, } trace_extract_entropy(r->name, nbytes, ENTROPY_BITS(r), _RET_IP_); - nbytes = account(r, nbytes, min, reserved); + nbytes = account(r, nbytes, min); return _extract_entropy(r, buf, nbytes, fips_enabled); } -- 2.26.2
[RFC PATCH 40/41] random: trigger startup health test on any failure of the health tests
The startup health tests to be executed at boot as required by NIST 800-90B consist of running the contiuous health tests, i.e. the Adaptive Proportion Test (APT) and the Repetition Count Test (RCT), until a certain amount of noise samples have been examined. In case of test failure during this period, the startup tests would get restarted by means of reinitializing the fast_pool's ->warmup member with the original number of total samples to examine during startup. A future patch will enable dynamically switching from the initial H=1 or 1/8 per-IRQ min-entropy estimates to lower values upon health test failures in order to keep those systems going where these more or less arbitrary per-IRQ entropy estimates turn out to simply be wrong. It is certainly desirable to restart the startup health tests upon such a switch. In order to keep the upcoming code comprehensible, move the startup test restart logic from health_test_process() into add_interrupt_randomness(). For simplicity, make add_interrupt_randomness() trigger a startup test on each health test failure. Note that there's a change in behaviour: up to now, only the bootime startup tests would have restarted themselves upon failure, whereas now even a failure of the continuous health tests can potentially trigger a startup test long after boot. Note that as it currently stands, rerunning the full startup tests after the crng has received its initial seed has the only effect to inhibit entropy dispatch for a while and thus, to potentially delay those best effort crng reseeds during runtime. As reseeds never reduce a crng state's entropy, this behaviour is admittedly questionable. However, further patches introducing forced reseeds might perhaps become necessary in the future, c.f. the specification of "reseed_interval" in NIST SP800-90A. Thus, it's better to keep the startup health test restart logic consistent for now. Signed-off-by: Nicolai Stange --- drivers/char/random.c | 12 1 file changed, 8 insertions(+), 4 deletions(-) diff --git a/drivers/char/random.c b/drivers/char/random.c index 86dd87588b1b..bb79dcb96882 100644 --- a/drivers/char/random.c +++ b/drivers/char/random.c @@ -1098,8 +1098,6 @@ health_test_process(struct health_test *h, unsigned int event_entropy_shift, * Something is really off, get_cycles() has become * (or always been) a constant. */ - if (h->warmup) - health_test_reset(h, event_entropy_shift); return health_discard; } @@ -1110,8 +1108,6 @@ health_test_process(struct health_test *h, unsigned int event_entropy_shift, */ apt = health_test_apt(h, event_entropy_shift, sample_delta); if (unlikely(h->warmup) && --h->warmup) { - if (apt == health_discard) - health_test_reset(h, event_entropy_shift); /* * Don't allow the caller to dispatch until warmup * has completed. @@ -1928,6 +1924,14 @@ void add_interrupt_randomness(int irq, int irq_flags) health_test_process(_pool->health, fast_pool->event_entropy_shift, cycles); + if (unlikely(health_result == health_discard)) { + /* +* Oops, something's odd. Restart the startup +* tests. +*/ + health_test_reset(_pool->health, + fast_pool->event_entropy_shift); + } } if (unlikely(crng_init == 0)) { -- 2.26.2
[RFC PATCH 30/41] random: add a queued_entropy instance to struct fast_pool
When health tests are introduced with upcoming patches, it will become necessary to keep entropy queued across add_interrupt_randomness() invocations for later dispatch to the global balance. Prepare for this by adding a struct queued_entropy member to the per-CPU fast_pool. Use it in place of that queue with automatic storage duration in add_interrupt_randomness(). Signed-off-by: Nicolai Stange --- drivers/char/random.c | 7 --- 1 file changed, 4 insertions(+), 3 deletions(-) diff --git a/drivers/char/random.c b/drivers/char/random.c index 55e784a5a2ec..37746df53acf 100644 --- a/drivers/char/random.c +++ b/drivers/char/random.c @@ -885,6 +885,7 @@ struct fast_pool { unsigned short reg_idx; unsigned char count; int event_entropy_shift; + struct queued_entropy q; }; /* @@ -1655,7 +1656,7 @@ void add_interrupt_randomness(int irq, int irq_flags) __u32 c_high, j_high; __u64 ip; boolreseed; - struct queued_entropy q = { 0 }; + struct queued_entropy *q = _pool->q; unsigned intnfrac; if (cycles == 0) @@ -1700,9 +1701,9 @@ void add_interrupt_randomness(int irq, int irq_flags) nfrac = fast_pool_entropy(fast_pool->count, fast_pool->event_entropy_shift); } - __queue_entropy(r, , nfrac); + __queue_entropy(r, q, nfrac); __mix_pool_bytes(r, _pool->pool, sizeof(fast_pool->pool)); - reseed = __dispatch_queued_entropy_fast(r, ); + reseed = __dispatch_queued_entropy_fast(r, q); spin_unlock(>lock); fast_pool->last = now; -- 2.26.2
[RFC PATCH 02/41] random: remove dead code for nbits < 0 in credit_entropy_bits()
The nbits argument to credit_entropy_bits() is never negative and the branch handling it is dead code. Remove it. The code for handling the regular nbits > 0 case used to live in the corresponding else branch, but has now been lifted up to function scope. Move the declaration of 'pnfrac' to the function prologue in order to adhere to C99 rules. Likewise, move the declaration of 's' into the body loop, the only scope it's referenced from. Signed-off-by: Nicolai Stange --- drivers/char/random.c | 69 --- 1 file changed, 32 insertions(+), 37 deletions(-) diff --git a/drivers/char/random.c b/drivers/char/random.c index 0580968fd28c..c4b7bdbd460e 100644 --- a/drivers/char/random.c +++ b/drivers/char/random.c @@ -654,7 +654,7 @@ static void process_random_ready_list(void) } /* - * Credit (or debit) the entropy store with n bits of entropy. + * Credit the entropy store with n bits of entropy. * Use credit_entropy_bits_safe() if the value comes from userspace * or otherwise should be checked for extreme values. */ @@ -663,50 +663,45 @@ static void credit_entropy_bits(struct entropy_store *r, int nbits) int entropy_count, orig; const int pool_size = r->poolinfo->poolfracbits; int nfrac = nbits << ENTROPY_SHIFT; + int pnfrac; if (!nbits) return; retry: entropy_count = orig = READ_ONCE(r->entropy_count); - if (nfrac < 0) { - /* Debit */ - entropy_count += nfrac; - } else { - /* -* Credit: we have to account for the possibility of -* overwriting already present entropy. Even in the -* ideal case of pure Shannon entropy, new contributions -* approach the full value asymptotically: -* -* entropy <- entropy + (pool_size - entropy) * -* (1 - exp(-add_entropy/pool_size)) -* -* For add_entropy <= pool_size/2 then -* (1 - exp(-add_entropy/pool_size)) >= -*(add_entropy/pool_size)*0.7869... -* so we can approximate the exponential with -* 3/4*add_entropy/pool_size and still be on the -* safe side by adding at most pool_size/2 at a time. -* -* The use of pool_size-2 in the while statement is to -* prevent rounding artifacts from making the loop -* arbitrarily long; this limits the loop to log2(pool_size)*2 -* turns no matter how large nbits is. -*/ - int pnfrac = nfrac; - const int s = r->poolinfo->poolbitshift + ENTROPY_SHIFT + 2; + /* +* Credit: we have to account for the possibility of +* overwriting already present entropy. Even in the +* ideal case of pure Shannon entropy, new contributions +* approach the full value asymptotically: +* +* entropy <- entropy + (pool_size - entropy) * +* (1 - exp(-add_entropy/pool_size)) +* +* For add_entropy <= pool_size/2 then +* (1 - exp(-add_entropy/pool_size)) >= +*(add_entropy/pool_size)*0.7869... +* so we can approximate the exponential with +* 3/4*add_entropy/pool_size and still be on the +* safe side by adding at most pool_size/2 at a time. +* +* The use of pool_size-2 in the while statement is to +* prevent rounding artifacts from making the loop +* arbitrarily long; this limits the loop to log2(pool_size)*2 +* turns no matter how large nbits is. +*/ + pnfrac = nfrac; + do { /* The +2 corresponds to the /4 in the denominator */ + const int s = r->poolinfo->poolbitshift + ENTROPY_SHIFT + 2; + unsigned int anfrac = min(pnfrac, pool_size/2); + unsigned int add = + ((pool_size - entropy_count)*anfrac*3) >> s; - do { - unsigned int anfrac = min(pnfrac, pool_size/2); - unsigned int add = - ((pool_size - entropy_count)*anfrac*3) >> s; - - entropy_count += add; - pnfrac -= anfrac; - } while (unlikely(entropy_count < pool_size-2 && pnfrac)); - } + entropy_count += add; + pnfrac -= anfrac; + } while (unlikely(entropy_count < pool_size-2 && pnfrac)); if (WARN_ON(entropy_count < 0)) { pr_warn("negative entropy/overflow: pool %s count %d\n", -- 2.26.2
[RFC PATCH 09/41] random: protect ->entropy_count with the pool spinlock
Currently, all updates to ->entropy_count are synchronized by means of cmpxchg-retry loops found in credit_entropy_bits(), __credit_entropy_bits_fast() and account() respectively. However, all but one __credit_entropy_bits_fast() call sites grap the pool ->lock already and it would be nice if the potentially costly cmpxchg could be avoided in these performance critical paths. In addition to that, future patches will introduce new fields to struct entropy_store which will required some kinf of synchronization with ->entropy_count updates from said producer paths as well. Protect ->entropy_count with the pool ->lock. - Make callers of __credit_entropy_bits_fast() invoke it with the pool ->lock held. Extend existing critical sections where possible. Drop the cmpxchg-reply loop in __credit_entropy_bits_fast() in favor of a plain assignment. - Retain the retry loop in credit_entropy_bits(): the potentially expensive pool_entropy_delta() should not be called under the lock in order to not unnecessarily block contenders. In order to continue to synchronize with __credit_entropy_bits_fast() and account(), the cmpxchg gets replaced by a plain comparison + store with the ->lock being held. - Make account() grab the ->lock and drop the cmpxchg-retry loop in favor of a plain assignent. Signed-off-by: Nicolai Stange --- drivers/char/random.c | 44 +-- 1 file changed, 30 insertions(+), 14 deletions(-) diff --git a/drivers/char/random.c b/drivers/char/random.c index d9e4dd27d45d..9f87332b158f 100644 --- a/drivers/char/random.c +++ b/drivers/char/random.c @@ -718,7 +718,7 @@ static unsigned int pool_entropy_delta(struct entropy_store *r, * Credit the entropy store with n bits of entropy. * To be used from hot paths when it is either known that nbits is * smaller than one half of the pool size or losing anything beyond that - * doesn't matter. + * doesn't matter. Must be called with r->lock being held. */ static bool __credit_entropy_bits_fast(struct entropy_store *r, int nbits) { @@ -727,13 +727,11 @@ static bool __credit_entropy_bits_fast(struct entropy_store *r, int nbits) if (!nbits) return false; -retry: - orig = READ_ONCE(r->entropy_count); + orig = r->entropy_count; entropy_count = orig + pool_entropy_delta(r, orig, nbits << ENTROPY_SHIFT, true); - if (cmpxchg(>entropy_count, orig, entropy_count) != orig) - goto retry; + WRITE_ONCE(r->entropy_count, entropy_count); trace_credit_entropy_bits(r->name, nbits, entropy_count >> ENTROPY_SHIFT, _RET_IP_); @@ -755,17 +753,28 @@ static bool __credit_entropy_bits_fast(struct entropy_store *r, int nbits) static void credit_entropy_bits(struct entropy_store *r, int nbits) { int entropy_count, orig; + unsigned long flags; if (!nbits) return; retry: + /* +* Don't run the potentially expensive pool_entropy_delta() +* calculations under the spinlock. Instead retry until +* ->entropy_count becomes stable. +*/ orig = READ_ONCE(r->entropy_count); entropy_count = orig + pool_entropy_delta(r, orig, nbits << ENTROPY_SHIFT, false); - if (cmpxchg(>entropy_count, orig, entropy_count) != orig) + spin_lock_irqsave(>lock, flags); + if (r->entropy_count != orig) { + spin_unlock_irqrestore(>lock, flags); goto retry; + } + WRITE_ONCE(r->entropy_count, entropy_count); + spin_unlock_irqrestore(>lock, flags); trace_credit_entropy_bits(r->name, nbits, entropy_count >> ENTROPY_SHIFT, _RET_IP_); @@ -1203,12 +1212,11 @@ static void add_timer_randomness(struct timer_rand_state *state, unsigned num) } sample; long delta, delta2, delta3; bool reseed; + unsigned long flags; sample.jiffies = jiffies; sample.cycles = random_get_entropy(); sample.num = num; - r = _pool; - mix_pool_bytes(r, , sizeof(sample)); /* * Calculate number of bits of randomness we probably added. @@ -1235,12 +1243,16 @@ static void add_timer_randomness(struct timer_rand_state *state, unsigned num) if (delta > delta3) delta = delta3; + r = _pool; + spin_lock_irqsave(>lock, flags); + __mix_pool_bytes(r, , sizeof(sample)); /* * delta is now minimum absolute delta. * Round down by 1 bit on general principles, * and limit entropy estimate to 12 bits. */ reseed = __cred
[RFC PATCH 29/41] random: move definition of struct queued_entropy and related API upwards
The next patch will add a member of type struct queued_entropy to struct fast_pool and thus, the former's definition needs to be visible at the latter's. Move the definition of struct queued_entropy upwards in the file so that it comes before the definition struct fast_pool. Move the associated function definitions as well in order to keep everything together. Note that said function definitions had originally been inserted at the old location with the intent to minimize their introducing patch's diff by placing them near the now removed credit_entropy_delta() they superseded. Signed-off-by: Nicolai Stange --- drivers/char/random.c | 243 +- 1 file changed, 124 insertions(+), 119 deletions(-) diff --git a/drivers/char/random.c b/drivers/char/random.c index 680ccc82a436..55e784a5a2ec 100644 --- a/drivers/char/random.c +++ b/drivers/char/random.c @@ -519,6 +519,10 @@ static ssize_t extract_entropy(struct entropy_store *r, void *buf, static ssize_t _extract_entropy(struct entropy_store *r, void *buf, size_t nbytes, int fips); +static unsigned int pool_entropy_delta(struct entropy_store *r, + int base_entropy_count, + int nfrac, bool fast); + static int min_crng_reseed_pool_entropy(void); static void crng_reseed(struct crng_state *crng, struct entropy_store *r); @@ -610,125 +614,6 @@ static void mix_pool_bytes(struct entropy_store *r, const void *in, spin_unlock_irqrestore(>lock, flags); } -struct fast_pool { - __u32 pool[4]; - unsigned long last; - unsigned short reg_idx; - unsigned char count; - int event_entropy_shift; -}; - -/* - * This is a fast mixing routine used by the interrupt randomness - * collector. It's hardcoded for an 128 bit pool and assumes that any - * locks that might be needed are taken by the caller. - */ -static void fast_mix(struct fast_pool *f) -{ - __u32 a = f->pool[0], b = f->pool[1]; - __u32 c = f->pool[2], d = f->pool[3]; - - a += b; c += d; - b = rol32(b, 6);d = rol32(d, 27); - d ^= a; b ^= c; - - a += b; c += d; - b = rol32(b, 16); d = rol32(d, 14); - d ^= a; b ^= c; - - a += b; c += d; - b = rol32(b, 6);d = rol32(d, 27); - d ^= a; b ^= c; - - a += b; c += d; - b = rol32(b, 16); d = rol32(d, 14); - d ^= a; b ^= c; - - f->pool[0] = a; f->pool[1] = b; - f->pool[2] = c; f->pool[3] = d; - f->count++; -} - -static void process_random_ready_list(void) -{ - unsigned long flags; - struct random_ready_callback *rdy, *tmp; - - spin_lock_irqsave(_ready_list_lock, flags); - list_for_each_entry_safe(rdy, tmp, _ready_list, list) { - struct module *owner = rdy->owner; - - list_del_init(>list); - rdy->func(rdy); - module_put(owner); - } - spin_unlock_irqrestore(_ready_list_lock, flags); -} - -/* - * Based on the pool's current entropy fill level, specified as - * base_entropy_count, and the number of new entropy bits in units of - * 2^-ENTROPY_SHIFT to add, return the amount of new entropy to - * credit. If the 'fast' parameter is set to true, the calculation - * will be guaranteed to terminate quickly, but this comes at the - * expense of capping nbits to one half of the pool size. - */ -static unsigned int pool_entropy_delta(struct entropy_store *r, - int base_entropy_count, - int nfrac, bool fast) -{ - const int pool_size = r->poolinfo->poolfracbits; - int entropy_count = base_entropy_count; - - if (!nfrac) - return 0; - - if (pool_size <= base_entropy_count) - return 0; - - /* -* Credit: we have to account for the possibility of -* overwriting already present entropy. Even in the -* ideal case of pure Shannon entropy, new contributions -* approach the full value asymptotically: -* -* entropy <- entropy + (pool_size - entropy) * -* (1 - exp(-add_entropy/pool_size)) -* -* For add_entropy <= pool_size/2 then -* (1 - exp(-add_entropy/pool_size)) >= -*(add_entropy/pool_size)*0.7869... -* so we can approximate the exponential with -* 3/4*add_entropy/pool_size and still be on the -* safe side by adding at most pool_size/2 at a time. -* -* The use of pool_size-2 in the while statement is to -* prevent rounding artifacts from making the loop -* arbitrarily long; this limits the loop
[RFC PATCH 39/41] random: make the startup tests include muliple APT invocations
Given a per-IRQ min-entropy estimate of H, the Adaptive Proportion Tests (APT) will need to consume at most n = 128/H samples before reaching a conclusion. The supported values for H are 1, 1/2, 1/4, 1/8, ..., 1/64, but only 1 and 1/8 are currently being actively used on systems with and without a high resolution get_cycles() respectively. The corresponding number of samples consumed by one APT execution are 128, 256, 512, 1024, 2048, 4096 and 8192. Currently, the ->warmup parameter used for controlling the startup is hardcoded to be initialized to 1024 and the health test logic won't permit the caller, i.e. add_interrupt_randomness() to dispatch any entropy to the global balance until that many events have been processed *and* the first APT has completed, whichever comes later. It would take roughly eight successful APT invocations for H=1 until the startup sequence has completed, but for all H <= 1/8, the ->warmup logic is effectively useless because the first APT would always need to process >= 1024 samples anyway. The probabilites of one single APT invocation successfully detecting a degradation of the per-IRQ min-entopy to H/2 ("power") are as follows for the different supported H estimates: H n power --- 1 128 64.7% 1/2 256 79.1% 1/4 512 81.6% 1/8 1024 84.0% 1/16 2048 84.9% 1/32 4096 86.9% 1/64 8192 86.4% Thus, for H=1, the probability that at least one out of those eight APT invocations will detect a degradation to H/2 is 1 - (1 - 64.7%)^8 = 99.98%, which is quite good. OTOH, the 84.0% achievable with the single APT invocation for H = 1/8 is only semi-satisfactory. Note that as it currently stands, the only point in time where the health tests can still intervene and keep back low quality noise from the primary_crng is before the initial seed has happened. Afterwards, failing continuous health tests would only potentially delay those best effort reseeds (which is questionable behaviour in itself, as the crng state's entropy is never reduced in the course of reseeding). A future patch will enable dynamically switching from the initial H=1 or 1/8 resp. to lower per-IRQ entropy values upon health test failures in order to keep those systems going where these more or less arbitrary per-IRQ entropy estimates turn out to be simply wrong. From a paranoia POV, it is certainly a good idea to run the APT several times in a row during startup in order to achieve a good statistical power. Extending the warmup to cover the larger of the 1024 events required by NIST SP800-90B and four full APT lengths will result in a combined probability of detecting an entropy degradation to H/2 of >= 99.98% across all supported values of H. The obvious downside is that the number of IRQ events required for the initial seed will be qadrupled, at least for H <= 1/8. Follow this approach. Amend health_test_reset()'s signature by an additional parameter, event_entropy_shift, and make it set ->warmup to the larger of 1024 and 4 * 128 / (2^-event_entropy_shift). Adjust all call sites accordingly. Signed-off-by: Nicolai Stange --- drivers/char/random.c | 23 +++ 1 file changed, 15 insertions(+), 8 deletions(-) diff --git a/drivers/char/random.c b/drivers/char/random.c index bd8c24e433d0..86dd87588b1b 100644 --- a/drivers/char/random.c +++ b/drivers/char/random.c @@ -1058,14 +1058,21 @@ health_test_apt(struct health_test *h, unsigned int event_entropy_shift, return health_queue; } -static void health_test_reset(struct health_test *h) +static void health_test_reset(struct health_test *h, + unsigned int event_entropy_shift) { /* -* Don't dispatch until at least 1024 events have been -* processed by the continuous health tests as required by -* NIST SP800-90B for the startup tests. +* Let H = 2^-event_entropy_shift equal the estimated per-IRQ +* min-entropy. One APT will consume at most 128 / H samples +* until completion. Run the startup tests for the larger of +* 1024 events as required by NIST or four times the APT +* length. In either case, the combined probability of the +* resulting number of successive APTs to detect a degradation +* of H to H/2 will be >= 99.8%, for any supported value of +* event_entropy_shift. */ - h->warmup = 1024; + h->warmup = 4 * (128 << event_entropy_shift); + h->warmup = max_t(unsigned int, h->warmup, 1024); health_apt_reset(h); } @@ -1092,7 +1099,7 @@ health_test_process(struct health_test *h, unsigned int event_entropy_shift, * (or always been) a constant. */ if (h->warmup) - health_test_reset(h); + health_test_reset(h, event_entropy_shift); return health_discard; } @@ -1104
[RFC PATCH 33/41] random: make health_test_process() maintain the get_cycles() delta
The min-entropy estimate has been made for the lower eight bits of the deltas between cycle counter values from successive IRQ events and thus, the upcoming health tests should actually be run on these deltas. Introduce a new field ->previous_sample to struct health_test for storing the previous get_cycles() value. Make health_test_process() maintain it and also calculate the delta between the current and the previous value at this point already in preparation to passing it to the upcoming health tests. Note that ->previous_sample is deliberately not touched from health_test_reset() in order to maintain a steady flow of correctly calculated deltas across health test resets. Signed-off-by: Nicolai Stange --- drivers/char/random.c | 14 +- 1 file changed, 13 insertions(+), 1 deletion(-) diff --git a/drivers/char/random.c b/drivers/char/random.c index cb6441b96b8e..33f9b7b59f92 100644 --- a/drivers/char/random.c +++ b/drivers/char/random.c @@ -879,7 +879,9 @@ static void discard_queued_entropy(struct entropy_store *r, spin_unlock_irqrestore(>lock, flags); } -struct health_test {}; +struct health_test { + u8 previous_sample; +}; enum health_result { health_none, @@ -895,6 +897,16 @@ static enum health_result health_test_process(struct health_test *h, unsigned int event_entropy_shift, u8 sample) { + u8 sample_delta; + + /* +* The min-entropy estimate has been made for the lower eight +* bits of the deltas between cycle counter values from +* successive IRQ events. +*/ + sample_delta = sample - h->previous_sample; + h->previous_sample = sample; + return health_none; } -- 2.26.2
[RFC PATCH 37/41] random: implement the "Repetition Count" NIST SP800-90B health test
The "Repetition Count Test" (RCT) as specified by NIST SP800-90B simply counts the number of times the same sample value has been observed and reports failure if an highly unlikely threshold is exceeded. The exact values of the latter depend on the estimated per-IRQ min-entropy H as well as on the upper bounds set on the probability of false positives. For the latter, a maximum value of 2^-20 is recommended and with this value the threshold can be calculated as 1 + ceil(20 / H). It should be noted that the RCT has very poor statistical power and is only intended to detect catastrophic noise source failures, like the get_cycles() in add_interrupt_randomness() always returning the same constant. Add the fields needed for maintaining the RCT state to struct health_test: ->rct_previous_delta for storing the previous sample value and ->rct_count for keeping track of how many times this value has been observed in a row so far. Implement the RCT and wrap it in a new function, health_test_rct(). Make the health test entry point, health_test_process(), call it early before invoking the APT and forward failure reports to the caller. All other return codes from the RCT are ignored, because - as said, the statistical power is weak and a positive outcome wouldn't tell anything and - it's not desirable to make the caller, i.e. add_interrupt_randomness(), to further queue any entropy once the concurrently running APT has signaled a successful completion. Signed-off-by: Nicolai Stange --- drivers/char/random.c | 55 +++ 1 file changed, 55 insertions(+) diff --git a/drivers/char/random.c b/drivers/char/random.c index 2c744d2a9b26..54ee082ca4a8 100644 --- a/drivers/char/random.c +++ b/drivers/char/random.c @@ -890,6 +890,9 @@ struct health_test { }; u8 previous_sample; + + u8 rct_previous_delta; + unsigned short rct_count; }; enum health_result { @@ -899,6 +902,43 @@ enum health_result { health_discard, }; +/* Repetition count test. */ +static enum health_result +health_test_rct(struct health_test *h, unsigned int event_entropy_shift, + u8 sample_delta) +{ + unsigned int c; + + if (likely(sample_delta != h->rct_previous_delta)) { + h->rct_previous_delta = sample_delta; + h->rct_count = 0; + return health_dispatch; + } + + h->rct_count++; + if (!h->rct_count) { + /* Overflow. */ + h->rct_count = -1; + } + + /* +* With a min-entropy of H = 2^-event_entropy_shift bits per +* event, the maximum probability of seing any particular +* sample value (i.e. delta) is bounded by 2^-H. Thus, the +* probability to observe the same events C times in a row is +* less than 2^-((C - 1) * H). Limit the false positive rate +* of the repetition count test to 2^-20, which yields a +* cut-off value of C = 1 + 20/H. Note that the actual number +* of repetitions equals ->rct_count + 1, so this offset by +* one must be accounted for in the comparison below. +*/ + c = 20 << event_entropy_shift; + if (h->rct_count >= c) + return health_discard; + + return health_queue; +} + /* Adaptive Proportion Test */ #define HEALTH_APT_PRESEARCH_EVENT_COUNT 7 @@ -1027,6 +1067,7 @@ health_test_process(struct health_test *h, unsigned int event_entropy_shift, u8 sample) { u8 sample_delta; + enum health_result rct; /* * The min-entropy estimate has been made for the lower eight @@ -1036,6 +1077,20 @@ health_test_process(struct health_test *h, unsigned int event_entropy_shift, sample_delta = sample - h->previous_sample; h->previous_sample = sample; + rct = health_test_rct(h, event_entropy_shift, sample_delta); + if (rct == health_discard) { + /* +* Something is really off, get_cycles() has become +* (or always been) a constant. +*/ + return health_discard; + } + + /* +* Otherwise return whatever the APT returns. In particular, +* don't care about whether the RCT needs to consume more +* samples to complete. +*/ return health_test_apt(h, event_entropy_shift, sample_delta); } -- 2.26.2
[RFC PATCH 28/41] random: don't award entropy to disk + input events if in FIPS mode
NIST SP800-90C prohibits the use of multiple correlated entropy sources. Obviously, add_disk_randomness(), add_input_randomness() and add_interrupt_randomness() are not independent. Follow the approach taken by Stephan Müller's LRNG patchset ([1]) and don't award any entropy to the former two if fips_enabled is true. Note that the entropy loss has already been compensated for by a previous patch increasing the IRQ event estimate. The actual entropy accounting from add_disk_randomness() and add_input_randomness() is implemented in the common add_timer_randomness() called therefrom. Make the latter to not dispatch any entropy to the global entropy balance if fips_enabled is on. [1] https://lkml.kernel.org/r/5695397.lov4wx5...@positron.chronox.de Suggested-by: Stephan Müller Signed-off-by: Nicolai Stange --- drivers/char/random.c | 24 ++-- 1 file changed, 18 insertions(+), 6 deletions(-) diff --git a/drivers/char/random.c b/drivers/char/random.c index 8f79e90f2429..680ccc82a436 100644 --- a/drivers/char/random.c +++ b/drivers/char/random.c @@ -1481,12 +1481,24 @@ static void add_timer_randomness(struct timer_rand_state *state, unsigned num) r = _pool; spin_lock_irqsave(>lock, flags); - /* -* delta is now minimum absolute delta. -* Round down by 1 bit on general principles, -* and limit entropy estimate to 12 bits. -*/ - __queue_entropy(r, , min_t(int, fls(delta>>1), 11) << ENTROPY_SHIFT); + if (!fips_enabled) { + unsigned int nfrac; + + /* +* delta is now minimum absolute delta. +* Round down by 1 bit on general principles, +* and limit entropy estimate to 12 bits. +*/ + nfrac = min_t(int, fls(delta>>1), 11) << ENTROPY_SHIFT; + __queue_entropy(r, , nfrac); + } else { + /* +* Multiple correlated entropy sources are prohibited +* by NIST SP800-90C. Leave it up to +* add_interrupt_randomness() to contribute any +* entropy. +*/ + } __mix_pool_bytes(r, , sizeof(sample)); reseed = __dispatch_queued_entropy_fast(r, ); spin_unlock_irqrestore(>lock, flags); -- 2.26.2
[RFC PATCH 31/41] random: introduce struct health_test + health_test_reset() placeholders
The to be implemented health tests will maintain some per-CPU state as they successively process the IRQ samples fed into the resp. fast_pool from add_interrupt_randomness(). In order to not to clutter future patches with trivialities, introduce an empty struct health_test supposed to keep said state in the future. Add a member of this new type to struct fast_pool. Introduce a health_test_reset() stub, which is supposed to (re)initialize instances of struct health_test. Invoke it from the fast_pool_init_accounting() to make sure that a fast_pool's contained health_test instance gets initialized once before its first usage. Make add_interrupt_randomness call fast_pool_init_accounting() earlier: health test functionality will get invoked before the latter's old location and it must have been initialized by that time. Signed-off-by: Nicolai Stange --- drivers/char/random.c | 11 +-- 1 file changed, 9 insertions(+), 2 deletions(-) diff --git a/drivers/char/random.c b/drivers/char/random.c index 37746df53acf..0f56c873a501 100644 --- a/drivers/char/random.c +++ b/drivers/char/random.c @@ -879,6 +879,11 @@ static void discard_queued_entropy(struct entropy_store *r, spin_unlock_irqrestore(>lock, flags); } +struct health_test {}; + +static void health_test_reset(struct health_test *h) +{} + struct fast_pool { __u32 pool[4]; unsigned long last; @@ -886,6 +891,7 @@ struct fast_pool { unsigned char count; int event_entropy_shift; struct queued_entropy q; + struct health_test health; }; /* @@ -1644,6 +1650,7 @@ static inline void fast_pool_init_accounting(struct fast_pool *f) return; f->event_entropy_shift = min_irq_event_entropy_shift(); + health_test_reset(>health); } void add_interrupt_randomness(int irq, int irq_flags) @@ -1674,6 +1681,8 @@ void add_interrupt_randomness(int irq, int irq_flags) add_interrupt_bench(cycles); this_cpu_add(net_rand_state.s1, fast_pool->pool[cycles & 3]); + fast_pool_init_accounting(fast_pool); + if (unlikely(crng_init == 0)) { if ((fast_pool->count >= 64) && crng_fast_load((char *) fast_pool->pool, @@ -1692,8 +1701,6 @@ void add_interrupt_randomness(int irq, int irq_flags) if (!spin_trylock(>lock)) return; - fast_pool_init_accounting(fast_pool); - if (!fips_enabled) { /* award one bit for the contents of the fast pool */ nfrac = 1 << ENTROPY_SHIFT; -- 2.26.2
[RFC PATCH 12/41] random: convert add_interrupt_randomness() to queued_entropy API
In an effort to drop __credit_entropy_bits_fast() in favor of the new __queue_entropy()/__dispatch_queued_entropy_fast() API, convert add_interrupt_randomness() from the former to the latter. There is no change in functionality at this point, because __credit_entropy_bits_fast() has already been reimplemented on top of the new API before. Signed-off-by: Nicolai Stange --- drivers/char/random.c | 28 1 file changed, 16 insertions(+), 12 deletions(-) diff --git a/drivers/char/random.c b/drivers/char/random.c index e8c86abde901..bd3774c6be4b 100644 --- a/drivers/char/random.c +++ b/drivers/char/random.c @@ -1512,6 +1512,7 @@ void add_interrupt_randomness(int irq, int irq_flags) unsigned long seed; int credit = 0; boolreseed; + struct queued_entropy q = { 0 }; if (cycles == 0) cycles = get_reg(fast_pool, regs); @@ -1546,24 +1547,27 @@ void add_interrupt_randomness(int irq, int irq_flags) if (!spin_trylock(>lock)) return; - fast_pool->last = now; - __mix_pool_bytes(r, _pool->pool, sizeof(fast_pool->pool)); - /* * If we have architectural seed generator, produce a seed and -* add it to the pool. For the sake of paranoia don't let the -* architectural seed generator dominate the input from the -* interrupt noise. +* add it to the pool further below. For the sake of paranoia +* don't let the architectural seed generator dominate the +* input from the interrupt noise. */ - if (arch_get_random_seed_long()) { - __mix_pool_bytes(r, , sizeof(seed)); - credit = 1; - } + credit = !!arch_get_random_long(); + fast_pool->last = now; fast_pool->count = 0; - /* award one bit for the contents of the fast pool */ - reseed = __credit_entropy_bits_fast(r, credit + 1); + __queue_entropy(r, , (credit + 1) << ENTROPY_SHIFT); + __mix_pool_bytes(r, _pool->pool, sizeof(fast_pool->pool)); + if (credit) { + /* +* A seed has been obtained from +* arch_get_random_seed_long() above, mix it in. +*/ + __mix_pool_bytes(r, , sizeof(seed)); + } + reseed = __dispatch_queued_entropy_fast(r, ); spin_unlock(>lock); if (reseed) crng_reseed(_crng, r); -- 2.26.2
[RFC PATCH 41/41] random: lower per-IRQ entropy estimate upon health test failure
Currently, if fips_enabled is set, a per-IRQ min-entropy estimate of either 1 bit or 1/8 bit is assumed, depending on whether a high resolution get_cycles() is available or not. The statistical NIST SP800-90B startup health tests are run on a certain amount of noise samples and are intended to reject in case this hypothesis turns out to be wrong, i.e. if the actual min-entropy is smaller. As long as the startup tests haven't finished, entropy dispatch and thus, the initial crng seeding, is inhibited. On test failure, the startup tests would restart themselves from the beginning. It follows that in case a system's actual per-IRQ min-entropy is smaller than the more or less arbitrarily assessed 1 bit or 1/8 bit resp., there will be a good chance that the initial crng seed will never complete. AFAICT, such a situation could potentially prevent certain userspace daemons like OpenSSH from loading. In order to still be able to make any progress, make add_interrupt_randomness() lower the per-IRQ min-entropy by one half upon each health test failure, but only until the minimum supported value of 1/64 bits has been reached. Note that health test failures will cause a restart of the startup health tests already and thus, a certain number of additional noise samples resp. IRQ events will have to get examined by the health tests before the initial crng seeding can take place. This number of fresh events required is reciprocal to the estimated per-IRQ min-entropy H: for the Adaptive Proportion Test (APT) it equals ~128 / H. It follows that this patch won't be of much help for embedded systems or VMs with poor IRQ rates at boot time, at least not without manual intervention. But there aren't many options left when fips_enabled is set. With respect to NIST SP800-90B conformance, this patch enters kind of a gray area: NIST SP800-90B has no notion of such a dynamically adjusted min-entropy estimate. Instead, it is assumed that some fixed value has been estimated based on general principles and subsequently validated in the course of the certification process. However, I would argue that if a system had successfully passed certification for 1 bit or 1/8 bit resp. of estimated min-entropy per sample, it would automatically be approved for all smaller values as well. Had we started out with such a lower value passing the health tests from the beginning, the latter would never have complained in the first place and the system would have come up just fine. Finally, note that all statistical tests have a non-zero probability of false positives and so do the NIST SP800-90B health tests. In order to not keep the estimated per-IRQ entropy at a smaller level than necessary for forever after spurious health test failures, make add_interrupt_randomness() attempt to double it again after a certain number of successful health test passes at the degraded entropy level have been completed. This threshold should not be too small in order to avoid excessive entropy accounting loss due to continuously alternating between a too large per-IRQ entropy estimate and the next smaller value. For now, choose a value of five as a compromise between quick recovery and limiting said accounting loss. So, introduce a new member ->good_tests to struct fast_pool for keeping track of the number of successfult health test passes. Make add_interrupt_randomness() increment it upon successful healh test completion and reset it to zero on failures. Make add_interrupt_randomness() double the current min-entropy estimate and restart the startup health in case ->good_tests is > 4 and the entropy had previously been lowered. Signed-off-by: Nicolai Stange --- drivers/char/random.c | 25 +++-- 1 file changed, 23 insertions(+), 2 deletions(-) diff --git a/drivers/char/random.c b/drivers/char/random.c index bb79dcb96882..24c09ba9d7d0 100644 --- a/drivers/char/random.c +++ b/drivers/char/random.c @@ -1126,6 +1126,7 @@ struct fast_pool { booldispatch_needed : 1; booldiscard_needed : 1; int event_entropy_shift; + unsigned intgood_tests; struct queued_entropy q; struct health_test health; }; @@ -1926,9 +1927,13 @@ void add_interrupt_randomness(int irq, int irq_flags) cycles); if (unlikely(health_result == health_discard)) { /* -* Oops, something's odd. Restart the startup -* tests. +* Oops, something's odd. Lower the entropy +* estimate and restart the startup tests. */ + fast_pool->event_entropy_shift = + min_t(unsigned int, + fast_pool->event_entropy_shift + 1, 6); + fast_pool-
[RFC PATCH 10/41] random: implement support for delayed entropy dispatching
e follows the same line of reasoning as for the chosen magnitude limit. In order to enable this watermark lifetime management, add yet another new field ->entropy_watermark_leak_time to struct entropy_store. Make account_begin() set it to the current jiffies upon the first increment of ->entropy_watermark_delta from zero. Make account_complete() reset ->entropy_watermark_delta and invalidate all queued entropy as described above whenever ->pending_extractions is zero and either ->entropy_watermark_leak_time is older than two times CRNG_RESEED_INTERVAL or if ->entropy_watermark_delta exceeds one fourth of the pool size. As entropy producers haven't been converted to the new __queue_entropy() + dispatch_queued_entropy()/__dispatch_queued_entropy_fast() API yet, the net effect of this patch is to "fix" a scenario similar to the one initially described for those producers that call __mix_pool_bytes() and __credit_entropy_bits_fast() without dropping the pool's ->lock inbetween, i.e. for add_interrupt_randomness() and add_timer_randomness(). Namely, if said sequence happens to get serialized inbetween the account_begin() (formerly account()) and the last extract_buf() from a concurrent extraction, the pool's entropy watermark will now be taken into account when calculating the amount of new entropy to credit in __credit_entropy_bits_fast(), because the latter has been reimplemented on top of the new API. Other than that, it's noteworthy that the pool entropy watermark might exceed unexpectedly high levels at boot time, namely if multiple producers happen to trigger the initial seeding of the primary_crng and the subsequent entropy extractions complete when entropy has been queued up somewhere else, e.g. in try_to_generate_entropy(). As detailed above, high values of the pool watermark will reduce the efficiency when dispatching new entropy attributions, but note that - There are mechanisms in place to limit the effect in magnitude and time. - The watermark can never exceed the total amount of entropy collected so far. So entropy collection at boot time would have to be terribly efficient in order for this to matter. - As seeding the primary_crng is a prerequisite for the described scenario, a temporarily reduced entropy collection efficiency isn't really concerning: getting towards a seeded primary_crng is all that matters at this point. Signed-off-by: Nicolai Stange --- drivers/char/random.c | 315 +++--- 1 file changed, 292 insertions(+), 23 deletions(-) diff --git a/drivers/char/random.c b/drivers/char/random.c index 9f87332b158f..b91d1fc08ac5 100644 --- a/drivers/char/random.c +++ b/drivers/char/random.c @@ -499,7 +499,13 @@ struct entropy_store { spinlock_t lock; unsigned short add_ptr; unsigned short input_rotate; + int entropy_count; + unsigned int entropy_watermark_delta; + unsigned int entropy_watermark_seq; + unsigned int queued_entropy_batches; + unsigned int pending_extractions; + unsigned long entropy_watermark_leak_time; unsigned int initialized:1; unsigned int last_data_init:1; __u8 last_data[EXTRACT_SIZE]; @@ -671,6 +677,9 @@ static unsigned int pool_entropy_delta(struct entropy_store *r, if (!nfrac) return 0; + if (pool_size <= base_entropy_count) + return 0; + /* * Credit: we have to account for the possibility of * overwriting already present entropy. Even in the @@ -714,26 +723,172 @@ static unsigned int pool_entropy_delta(struct entropy_store *r, return entropy_count - base_entropy_count; } +struct queued_entropy { + unsigned int pool_watermark_seq; + unsigned int queued_entropy_fracbits; +}; + /* - * Credit the entropy store with n bits of entropy. - * To be used from hot paths when it is either known that nbits is - * smaller than one half of the pool size or losing anything beyond that - * doesn't matter. Must be called with r->lock being held. + * Queue a given amount of new entropy which is about to mixed into + * the entropy pool for later dispatch. + * + * __queue_entropy() may be called one or more time on the same struct + * queued_entropy instance in order to accumulate entropy for later + * dispatch. However, any such sequence of invocations must eventually + * be followed by exactly one call to either of __dequeue_entropy(), + * __dispatch_queued_entropy_fast() or dispatch_queued_entropy() + * when the actual pool mixing has completed. + * __queue_entropy() must be called with r->lock held. + * + * Entropy extraction is a two-step process: + * 1.) The allocated amount of entropy gets subtracted from ->entropy_count. + * 2.) The entropy is actually extracted from the pool by means of one or more + * extract_buf() invocations. + * Likewise for the mixing side: + * 1.) The new entropy data gets mixed
[RFC PATCH 32/41] random: introduce health test stub and wire it up
NIST SP800-90B requires certain statistical tests to be run continuously on a noise source's output. In preparation to implementing those, introduce an empty stub, health_test_process() and wire it up to add_interrupt_randomness(). This patch does not implement any actual testing functionality yet, it's mereley meant to define the interactions between add_interrupt_randomness() and the health tests. health_test_process() is to be invoked on individual noise samples, i.e. cycle counter values and returns, either of three possible status codes indicating to the calling add_interrupt_randomness() that - either some more samples are needed in order to complete the statistical tests, - that the tests have finished with positive result on the latest run of noise samples or - that the tests have failed. Introduce an enum health_result defining constants corresponding to these resp. cases: health_queue, health_dispatch and health_discard. Provide another value, health_none, to indicate the case that the health tests are disabled, because e.g. fips_enabled is unset. Make the stub health_test_process() return this value for now. As long as the statistical tests need more input noise samples before reaching a conclusion, health_queue will get returned from health_test_process(). FWIW, the number of successive input samples needed by the tests will be at the order of 128 to 8192, depending on the per-IRQ entropy estimate. add_interrupt_randomness() currently attempts to transfer the noise kept within in the per-CPU fast_pool, which is of limited capacity, to the global input_pool as soon as a threshold of 64 events is reached and it will continue to do so. However, as long as some tests are pending, i.e. keep returning health_queue, the associated amount of estimated entropy must not get added to the global input_pool balance, but queued up at the fast_pool's queued_entropy instance. Once the health test have eventually succeeded, as indiciated by health_test_process(), the entropy previously queued up may get dispatched to the global reserve. OTOH, on test failure health_discard will get returned and all entropy queued up from add_interrupt_randomness() since the last dispatch (or discard resp.) must get discarded. Note that add_interrupt_randomness() will continue to unconditionally mix the samples into the fast_pools and eventually into the global input_pool -- the health test results really only affect the entropy accounting. So, make add_interrupt_randomness() invoke health_test_process() on the current cycle counter value in case fips_enabled is set. In case a fast_pool's fill level threshold of 64 events is reached at a time when health tests are still pending and keep returning health_queue, let add_interrupt_randomness() continue to mix the fast_pool's contents into the input_pool as before, but enqueue the associated amount of entropy at the fast_pool's associated queued_entropy instance for later dispatch. Both, entropy dispatch as well as discard operations, require a call to __dequeue_entropy(), which in turn must only get invoked with the input_pool's ->lock being held. It follows that in case the spin_trylock() in add_interrupt_randomness() failed, the latter would not be able to perform entropy dispatch or discard operations immediately at the time those have been requested by the health tests. Add two new boolean flags, ->dispatch_needed and ->discard_needed, to struct fast_pool. Set them from add_interrupt_randomness() in case health_test_process() returned health_dispatch or health_discard resp.. Make the current and subsequent add_interrupt_randomness() invocations to check for ->dispatch_needed and ->discard_needed and to attempt to execute any pending dispatch/discard request. Clear ->dispatch_needed and ->discard_needed again when the prerequisite ->lock could eventually be obtained. As actual health tests returning anything but health_none haven't been implemented yet, there is no behavioural change at this point. Signed-off-by: Nicolai Stange --- drivers/char/random.c | 78 +-- 1 file changed, 75 insertions(+), 3 deletions(-) diff --git a/drivers/char/random.c b/drivers/char/random.c index 0f56c873a501..cb6441b96b8e 100644 --- a/drivers/char/random.c +++ b/drivers/char/random.c @@ -881,14 +881,30 @@ static void discard_queued_entropy(struct entropy_store *r, struct health_test {}; +enum health_result { + health_none, + health_queue, + health_dispatch, + health_discard, +}; + static void health_test_reset(struct health_test *h) {} +static enum health_result +health_test_process(struct health_test *h, unsigned int event_entropy_shift, + u8 sample) +{ + return health_none; +} + struct fast_pool { __u32 pool[4]; unsigned long last; unsigned short reg_idx; unsigned char count; + booldispatch_
[RFC PATCH 27/41] random: increase per-IRQ event entropy estimate if in FIPS mode
NIST SP800-90C prohibits the use of multiple correlated entropy sources. However, add_interrupt_randomness(), add_disk_randomness() and add_input_randomness() are clearly not independent and an upcoming patch will make the latter two to stop contributing any entropy to the global balance if fips_enabled is on. With the current parameter settings, it can be assumed that add_disk_randomness() resp. add_input_randomness() are the dominating contributors to the overall entropy reserve for some common workloads: both more or less estimate the entropy per event to equal the width of the minimum out of the first, second and third jiffes deltas to the previous occurrence. add_interrupt_randomness() on the other hand only attributes one single bit entropy to a full batch of 64 IRQ events (or once a second if that completes earlier). Thus, the upcoming exclusion of two potent entropy sources should somehow be compensated for. Stephan Müller worked around this very problem in his "LRNG" proposal ([1]) by increasing the entropy estimate per IRQ event. Namely, in case a get_cycles() with instruction granularity is available, he estimated one bit of entropy per IRQ event and (IIRC) 1/10 bits otherwise. I haven't tested this claim myself, in particular not on smaller devices. But for the sake of moving the development of this RFC series forward, I'll assume it as granted and hereby postulate that The lower eight bits of the differences between get_cycles() from two successive IRQ events on the same CPU carry - one bit of min-entropy in case a get_cycles() with instruction granularity is available and - 1/8 bit of min-entropy in case get_cycles() is still non-trivial, but has a lower resolution. In particular this is assumed to be true for highly periodic interrupts like those issued for e.g. USB microframes and on all supported architectures. In the former case, the underlying source of randomness is believed to follow the same principles as for the Jitter RNGs resp. try_to_generate_entropy(): diffences in RAM vs. CPU clockings and unpredictability of cache states to a certain extent. Notes: - NIST SP800-90B requires a means to access raw samples for validation purposes. Implementation of such an interface is deliberately not part of this RFC series here, but would necessarily be subject of future work. So there would be a means to at least validate these assumptions. - The choice of 1/8 over the 1/10 from the LRNG patchset has been made because it's a power of two and I suppose that the estimate of 1/10 had been quite arbitrary anyway. Replacement of the 1/8 by smaller powers of two down to 1/64 will be supported throughout this patch series. Some health tests as required by NIST SP800-90B will be implemented later in this series. In order to allow for dynamically decreasing the assessed entropy on a per-CPU basis upon health test failures, make it an attibute of the per-CPU struct fast_pool. That is, introduce a new integer field ->event_entropy_shift to struct fast_pool. The estimated entropy per IRQ sample will be calculated as 2^-event_entropy_shift. Initialize it statically with -1 to indicate that runtime initialization hasn't happened yet. Introduce fast_pool_init_accounting() which gets called unconditionally from add_interrupt_randomness() for doing the necessary runtime initializations once, i.e. if ->event_entropy_shift is still found to be negative. Implement it with the help of the also new min_irq_event_entropy_shift(), which will return the initial ->event_entropy_shift value as determined according to the rules from above. That is, depending on the have_highres_cycle_ctr, the result is eiher zero or three. Note that have_highres_cycle_ctr will only get properly initialized from rand_initialize() if fips_enabled is set, but ->event_entropy_shift will also only ever get accessed in this case. Finally, for the case tha fips_enabled is set, make add_interrupt_randomness() to estimate the amount of entropy transferred from the fast_pool into the global input_pool as fast_pool_entropy(->count, ->event_entropy_shift), rather than only one single bit. Remember that fast_pool_entropy() calculates the amount of entropy contained in a fast_pool, based on the total number of events mixed into it and the estimated entropy per event. [1] https://lkml.kernel.org/r/5695397.lov4wx5...@positron.chronox.de Suggested-by: Stephan Müller Signed-off-by: Nicolai Stange --- drivers/char/random.c | 50 ++- 1 file changed, 45 insertions(+), 5 deletions(-) diff --git a/drivers/char/random.c b/drivers/char/random.c index ac36c56dd135..8f79e90f2429 100644 --- a/drivers/char/random.c +++ b/drivers/char/random.c @@ -615,6 +615,7 @@ struct fast_pool { unsigned long last; unsigned short reg_idx; unsigned char count; + int event_entropy_shift; }; /* @@ -1509,7 +1510,9
[RFC PATCH 13/41] random: convert try_to_generate_entropy() to queued_entropy API
In an effort to drop __credit_entropy_bits_fast() in favor of the new __queue_entropy()/__dispatch_queued_entropy_fast() API, convert try_to_generate_entropy() from the former to the latter. Replace the call to __credit_entropy_bits_fast() from the timer callback, entropy_timer(), by a queue_entropy() operation. Dispatch it from the loop in try_to_generate_entropy() by invoking __dispatch_queued_entropy_fast() after the timestamp has been mixed into the input_pool. In order to provide the timer callback and try_to_generate_entropy() with access to a common struct queued_entropy instance, move the currently anonymous struct definition from the local 'stack' variable declaration in try_to_generate_entropy() to file scope and assign it a name, "struct try_to_generate_entropy_stack". Make entropy_timer() obtain a pointer to the corresponding instance by means of container_of() on the ->timer member contained therein. Amend struct try_to_generate_entropy_stack by a new member ->q of type struct queued_entropy. Note that the described scheme alters behaviour a bit: first of all, new entropy credit now gets only dispatched to the pool after the actual mixing has completed rather than in an unsynchronized manner directly from the timer callback. As the mixing loop try_to_generate_entropy() is expected to run at higher frequency than the timer, this is unlikely to make any difference in practice. Furthermore, the pool entropy watermark as tracked over the period from queuing the entropy in the timer callback and to its subsequent dispatch from try_to_generate_entropy() is now taken into account when calculating the actual credit at dispatch. In consequence, the amount of new entropy dispatched to the pool will potentially be lowered if said period happens to overlap with the pool extraction from an initial crng_reseed() on the primary_crng. However, as getting the primary_crng seeded is the whole point of the try_to_generate_entropy() exercise, this won't matter. Note that instead of calling queue_entropy() from the timer callback, an alternative would have been to maintain an invocation counter and queue that up from try_to_generate_entropy() right before the mix operation. This would have reduced the described effect of the pool's entropy watermark and in fact matched the intended queue_entropy() API usage better. However, in this particular case of try_to_generate_entropy(), jitter is desired and invoking queue_entropy() with its buffer locking etc. from the timer callback could potentially contribute to that. Signed-off-by: Nicolai Stange --- drivers/char/random.c | 42 +- 1 file changed, 29 insertions(+), 13 deletions(-) diff --git a/drivers/char/random.c b/drivers/char/random.c index bd3774c6be4b..dfbe49fdbcf1 100644 --- a/drivers/char/random.c +++ b/drivers/char/random.c @@ -1911,6 +1911,12 @@ void get_random_bytes(void *buf, int nbytes) EXPORT_SYMBOL(get_random_bytes); +struct try_to_generate_entropy_stack { + unsigned long now; + struct timer_list timer; + struct queued_entropy q; +} stack; + /* * Each time the timer fires, we expect that we got an unpredictable * jump in the cycle counter. Even if the timer is running on another @@ -1926,14 +1932,10 @@ EXPORT_SYMBOL(get_random_bytes); */ static void entropy_timer(struct timer_list *t) { - bool reseed; - unsigned long flags; + struct try_to_generate_entropy_stack *stack; - spin_lock_irqsave(_pool.lock, flags); - reseed = __credit_entropy_bits_fast(_pool, 1); - spin_unlock_irqrestore(_pool.lock, flags); - if (reseed) - crng_reseed(_crng, _pool); + stack = container_of(t, struct try_to_generate_entropy_stack, timer); + queue_entropy(_pool, >q, 1 << ENTROPY_SHIFT); } /* @@ -1942,10 +1944,9 @@ static void entropy_timer(struct timer_list *t) */ static void try_to_generate_entropy(void) { - struct { - unsigned long now; - struct timer_list timer; - } stack; + struct try_to_generate_entropy_stack stack = { 0 }; + unsigned long flags; + bool reseed; stack.now = random_get_entropy(); @@ -1957,14 +1958,29 @@ static void try_to_generate_entropy(void) while (!crng_ready()) { if (!timer_pending()) mod_timer(, jiffies+1); - mix_pool_bytes(_pool, , sizeof(stack.now)); + spin_lock_irqsave(_pool.lock, flags); + __mix_pool_bytes(_pool, , sizeof(stack.now)); + reseed = __dispatch_queued_entropy_fast(_pool, ); + spin_unlock_irqrestore(_pool.lock, flags); + + if (reseed) + crng_reseed(_crng, _pool); + schedule(); stack.now = random_get_entropy(); } del_timer_sync(); destroy_timer_on_stack(); - mix_pool_bytes(
[RFC PATCH 35/41] random: improve the APT's statistical power
lth_apt_presearch_finalize() for restoring the candidate from the presearch bit counters. Call it from health_test_apt() once the n1'th event in a sequence has been processed and the presearch phase is to be concluded. Make health_test_apt() search for the candidate value as determined by the presearch phase among the sequence's remaining n2 = n - n1 samples. Adapt the failure thresholds to the now slightly smaller n2 values. Signed-off-by: Nicolai Stange --- drivers/char/random.c | 58 +-- 1 file changed, 50 insertions(+), 8 deletions(-) diff --git a/drivers/char/random.c b/drivers/char/random.c index 131302cbc495..75a103f24fea 100644 --- a/drivers/char/random.c +++ b/drivers/char/random.c @@ -881,8 +881,13 @@ static void discard_queued_entropy(struct entropy_store *r, struct health_test { unsigned short apt_event_count; - unsigned short apt_candidate_count; - u8 apt_candidate; + union { + u8 apt_presearch_bit_counters[8]; + struct { + unsigned short apt_candidate_count; + u8 apt_candidate; + }; + }; u8 previous_sample; }; @@ -895,9 +900,44 @@ enum health_result { }; /* Adaptive Proportion Test */ +#define HEALTH_APT_PRESEARCH_EVENT_COUNT 7 + +static void health_apt_presearch_update(struct health_test *h, u8 sample_delta) +{ + int i; + + for (i = 0; i < 8; ++i) { + h->apt_presearch_bit_counters[i] = sample_delta & 0x1; + sample_delta >>= 1; + } +} + +static void health_apt_presearch_finalize(struct health_test *h) +{ + int i; + + /* +* If some event octet occurred more than half of the time, +* i.e. more than HEALTH_APT_PRESEARCH_EVENT_COUNT / 2 times, +* then its value can be restored unambigiously from the eight +* ->apt_presearch_bit_counters each holding the count of 1s +* encountered at the corresponding bit positions. +*/ + h->apt_candidate = 0; + for (i = 0; i < 8; ++i) { + if (h->apt_presearch_bit_counters[i] >= + (HEALTH_APT_PRESEARCH_EVENT_COUNT + 1) / 2) { + h->apt_candidate |= 1 << i; + } + } + h->apt_candidate_count = 0; +}; + static void health_apt_reset(struct health_test *h) { h->apt_event_count = 0; + memset(h->apt_presearch_bit_counters, 0, + sizeof(h->apt_presearch_bit_counters)); } static enum health_result @@ -911,16 +951,18 @@ health_test_apt(struct health_test *h, unsigned int event_entropy_shift, * values of event_entropy_shift each, should have probability * <= 2^-16. */ - static const unsigned int c[] = {87, 210, 463, 973, 1997, 4044, 8140}; + static const unsigned int c[] = {83, 205, 458, 968, 1991, 4038, 8134}; + + BUILD_BUG_ON(HEALTH_APT_PRESEARCH_EVENT_COUNT != 7); - if (!h->apt_event_count) { - h->apt_event_count = 1; - h->apt_candidate = sample_delta; - h->apt_candidate_count = 0; + ++h->apt_event_count; + if (unlikely(h->apt_event_count <= HEALTH_APT_PRESEARCH_EVENT_COUNT)) { + health_apt_presearch_update(h, sample_delta); + if (h->apt_event_count == HEALTH_APT_PRESEARCH_EVENT_COUNT) + health_apt_presearch_finalize(h); return health_queue; } - ++h->apt_event_count; if (unlikely(h->apt_candidate == sample_delta && ++h->apt_candidate_count == c[event_entropy_shift])) { health_apt_reset(h); -- 2.26.2
[RFC PATCH 34/41] random: implement the "Adaptive Proportion" NIST SP800-90B health test
NIST SP800-90B requires an implementation of the "Adaptive Proportion" health test (APT) or similar for detecting noise source entropy degradations. This tests works by counting how many times the first sample value in a sequence of n events occurs among the remaining n-1 samples. The test will reject if this number exceeds a certain threshold. With a min-entropy estimate of H=2^-event_entropy_shift per IRQ event, the probability of observing any particular sample value is bounded by p <= 2^-H. Assuming i.i.d., the number of occurences of such a sample value among n - 1 events follows the binomial distribution with parameters n - 1 and p. The probability to observe up to k occurences of a given sample value is not less than that distribution's CDF F(n - 1, p, k) at point k, per the definition of CDFs and the fact that F(n - 1, p1, k) >= F(n - 1, p2, k) for p1 <= p2 in the particular case of Binomial distributions. It follows that an upper bound on the probability of observing the same value c or more times among n - 1 consecutive samples is given by 1 - F(n - 1, p, c - 1). In conclusion, the probability of false positives is <= p * (1 - F(n - 1, p, c - 1)) for the Adaptive Proportion test. NIST SP800-90B recommends to set n to either 512 or 1024 and to choose a cut-off value c such that the probability of false positives is <= 2^-20. However, assuming an estimated per-IRQ entropy of 1 bit, it would take 1024/128 == 8 minimum crng seed sizes worth of entropy before the APT eventually completes and the accumulated entropy may get released to the global reserve. Thus, it is desirable to set n such that the APT will complete within 128 bits worth of entropy, i.e. to n = 128 / H. However, for such relatively small values of n, an upper bound as small as 2^-20 for the false positives probability would make the test's statistical power, i.e. the capability to detect degraded noise sources, plummet to uselessness. Note that add_interrupt_randomness() will continue to unconditionally mix all events into the fast_pools, independent of the APT's outcome. Thus, allowing for a higher probability of false positives cannot change the output distribution, but only potentially affect the entropy accounting. Choose an upper bound of 2^-16 for the probability of false positives. The resulting cut-off values for the different supported values of per-IRQ entropy estimates are tabulated below. The "power" column lists the probabilities (again for i.i.d.) that the APT would report a failure in case the actual entropy has degraded to one half of the assumed estimate. H n cpower 1 128 87 52.5% 1/2 256 210 67.5% 1/4 512 463 76.7% 1/8 1024 973 82.8% 1/16 2048 1997 82.6% 1/32 4096 4044 85.8% 1/64 8192 8140 85.8% Add a couple of new fields to struct health_test for storing the required APT state to struct health_test: - ->apt_event_count: total number of samples processed by the currently pending APT, - ->apt_candidate: the sample value whose number of occurences the currently pending APT is counting, - ->apt_candidate_count: the number of occurences of ->apt_candidate the currently pending APT has encountered so far. Implement the APT logic and wrap it in a new function, health_test_apt(). Invoke it from health_test_process(). Signed-off-by: Nicolai Stange --- drivers/char/random.c | 56 +-- 1 file changed, 54 insertions(+), 2 deletions(-) diff --git a/drivers/char/random.c b/drivers/char/random.c index 33f9b7b59f92..131302cbc495 100644 --- a/drivers/char/random.c +++ b/drivers/char/random.c @@ -880,6 +880,10 @@ static void discard_queued_entropy(struct entropy_store *r, } struct health_test { + unsigned short apt_event_count; + unsigned short apt_candidate_count; + u8 apt_candidate; + u8 previous_sample; }; @@ -890,8 +894,56 @@ enum health_result { health_discard, }; +/* Adaptive Proportion Test */ +static void health_apt_reset(struct health_test *h) +{ + h->apt_event_count = 0; +} + +static enum health_result +health_test_apt(struct health_test *h, unsigned int event_entropy_shift, + u8 sample_delta) +{ + unsigned int n = 128 << event_entropy_shift; + /* +* Observing some particular sample value more often than +* these thresholds, specified for the different possible +* values of event_entropy_shift each, should have probability +* <= 2^-16. +*/ + static const unsigned int c[] = {87, 210, 463, 973, 1997, 4044, 8140}; + + if (!h->apt_event_count) { + h->apt_event_count = 1; + h->apt_candidate = sample_delta; + h->apt_candidate_count = 0; + return health_queue; + } + + ++h->apt_event_count; + if (unlikely(h->apt_ca
[RFC PATCH 36/41] random: optimize the APT's presearch
The Adaptive Proportion Test's presearch phase is supposed to determine the sample value occurring more than half of the times among the first n1=7 events in a sequence, if there is any such one. To this end, it maintains eight counters at struct health_test for counting the numbers of ones observed at each of the eight resp. bit positions within the sample values' octets. The idea is that if any sample value had been encountered more than half of the time, it would dominate all these counters and its value could be restored unambiguously from them. For better reviewability, this had been implemented in the most straightforward way: - the counters had been represented as an array of eight u8s at struct health_test and - both, the counter updating as well as the candidate extracion code had been implemented by means of a loop over the eight bit positions. As this is all accessed from add_interrupt_randomness(), optimizations won't harm. In particular, using a total of eight bytes for the bit counters is wasteful and can be reduced to half of that, optimizing the size of struct health_test, which in turn is stored as part of the per-CPU struct fast_pool. For counts up to n1=7, 3 bits would suffice. Rather than using an array of eight u8s, store the bit counter within an u32's eight four-bit nibbles. Even though it probably won't matter on average in practice, because the APT presearch is run only for a small fraction of all IRQ events, reduce the number of instructions issued from the bit counter maintenance and candidate extraction code as well. If nothing else, this can potentially reduce the maximum IRQ latency by a few cycles. Namely, make the bit counter updating code in health_apt_presearch_update() spread the eight bits from the input sample evenly across an u32. That is, the bit at position i will end up at position 4*i. This can be achieved within five binops and four shifts. This intermediate value can then get subsequently added in a single operation to the "packed" bit counters kept in struct health_test in order to conclude the operation. As for the final candidate extraction in health_apt_presearch_finalize(), remember that the i'th counter needs to get compared against n1 / 2 = 7 / 2 in order to restore the i'th bit from the resulting candidate value. The condition that one such bit counter is >= 4 is equivalent to testing its bit at the 2nd position, counted from zero. Thus, (->apt_presearch_bit_counters & 0x) >> 2 will yield a value where the LSB from the i'th nibble, equals the i'th bit from the result and everything else is unset. The final result can then be obtained by "shrinking" this intermediate representation back into an u8. In total, the candidate extraction can be achieved within a sequence of seven binops and six shifts. Signed-off-by: Nicolai Stange --- drivers/char/random.c | 71 --- 1 file changed, 53 insertions(+), 18 deletions(-) diff --git a/drivers/char/random.c b/drivers/char/random.c index 75a103f24fea..2c744d2a9b26 100644 --- a/drivers/char/random.c +++ b/drivers/char/random.c @@ -882,7 +882,7 @@ static void discard_queued_entropy(struct entropy_store *r, struct health_test { unsigned short apt_event_count; union { - u8 apt_presearch_bit_counters[8]; + u32 apt_presearch_bit_counters; struct { unsigned short apt_candidate_count; u8 apt_candidate; @@ -904,40 +904,75 @@ enum health_result { static void health_apt_presearch_update(struct health_test *h, u8 sample_delta) { - int i; + u32 encoded; - for (i = 0; i < 8; ++i) { - h->apt_presearch_bit_counters[i] = sample_delta & 0x1; - sample_delta >>= 1; - } + /* +* Encode the sample octet by "widening" it into 8 +* nibbles. That is, bit i from the source will be assigned to +* bit 4*i in the result. All other bits are set to zero. +*/ + encoded = sample_delta; + encoded = (encoded & 0xf) | ((encoded >> 4) << 16); + encoded |= (encoded << 6); + encoded |= (encoded << 3); + encoded &= 0x; + + /* +* The nibbles from ->apt_presearch_bit_counters, each +* counting the number of occurences of 1s at the +* corresponding bit position, don't overflow into each other. +*/ + BUILD_BUG_ON(ilog2(HEALTH_APT_PRESEARCH_EVENT_COUNT) >= 4); + h->apt_presearch_bit_counters += encoded; } static void health_apt_presearch_finalize(struct health_test *h) { - int i; + u32 majority, decoded; /* * If some event octet occurred more than half of the time, * i.e. more than HEALTH_APT_PRESEARCH_EVENT_COUNT / 2 times, * then its value can be res
[RFC PATCH 38/41] random: enable NIST SP800-90B startup tests
NIST SP800-90B, section 4.3 requires an entropy source to inhibit output until the so-called "startup" tests have completed. These "startup" test shall process at least 1024 consecutive samples by means of the continuous health tests, i.e. the already implemented Repetition Count Test (RCT) and Adaptive Proportion Test (APT). Introduce a new field ->warmup to struct health_test. Initialize it to 1024 from health_test_reset(). Make health_test_process() decrement ->warmup once per event processed without test failure, but reset ->warmup to the intitial value upon failure. Prevent health_test_process() from returning health_dispatch as long as ->warmup hasn't dropped to zero. This will cause the caller, i.e. add_interrupt_randomness(), to not dispatch any entropy to the global balance until the startup tests have finished. Note that this change will delay the initial seeding of the primary_crng, especially for those values of the estimated per-IRQ min-entropy H where the mimimum of 1024 events from above is by several factors larger than 128/H, the number of events to be processed by a single APT run. That would only affect systems running with fips_enabled though and there's simply no way to avoid it without violating the specs. Signed-off-by: Nicolai Stange --- drivers/char/random.c | 25 +++-- 1 file changed, 23 insertions(+), 2 deletions(-) diff --git a/drivers/char/random.c b/drivers/char/random.c index 54ee082ca4a8..bd8c24e433d0 100644 --- a/drivers/char/random.c +++ b/drivers/char/random.c @@ -880,6 +880,7 @@ static void discard_queued_entropy(struct entropy_store *r, } struct health_test { + unsigned short warmup; unsigned short apt_event_count; union { u32 apt_presearch_bit_counters; @@ -1059,6 +1060,13 @@ health_test_apt(struct health_test *h, unsigned int event_entropy_shift, static void health_test_reset(struct health_test *h) { + /* +* Don't dispatch until at least 1024 events have been +* processed by the continuous health tests as required by +* NIST SP800-90B for the startup tests. +*/ + h->warmup = 1024; + health_apt_reset(h); } @@ -1067,7 +1075,7 @@ health_test_process(struct health_test *h, unsigned int event_entropy_shift, u8 sample) { u8 sample_delta; - enum health_result rct; + enum health_result rct, apt; /* * The min-entropy estimate has been made for the lower eight @@ -1083,6 +1091,8 @@ health_test_process(struct health_test *h, unsigned int event_entropy_shift, * Something is really off, get_cycles() has become * (or always been) a constant. */ + if (h->warmup) + health_test_reset(h); return health_discard; } @@ -1091,7 +1101,18 @@ health_test_process(struct health_test *h, unsigned int event_entropy_shift, * don't care about whether the RCT needs to consume more * samples to complete. */ - return health_test_apt(h, event_entropy_shift, sample_delta); + apt = health_test_apt(h, event_entropy_shift, sample_delta); + if (unlikely(h->warmup) && --h->warmup) { + if (apt == health_discard) + health_test_reset(h); + /* +* Don't allow the caller to dispatch until warmup +* has completed. +*/ + return apt == health_dispatch ? health_queue : apt; + } + + return apt; } struct fast_pool { -- 2.26.2
[RFC PATCH 15/41] random: convert add_hwgenerator_randomness() to queued_entropy API
In an effort to drop credit_entropy_bits() in favor of the new queue_entropy()/dispatch_queued_entropy() API, convert add_hwgenerator_randomness() from the former to the latter. As a side effect, the pool entropy watermark as tracked over the duration of the mix_pool_bytes() operation is now taken correctly taken into account when calulating the amount of new entropy to dispatch to the pool based on the latter's fill level. Signed-off-by: Nicolai Stange --- drivers/char/random.c | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/drivers/char/random.c b/drivers/char/random.c index 60ce185d7b2d..78e65367ea86 100644 --- a/drivers/char/random.c +++ b/drivers/char/random.c @@ -2640,6 +2640,7 @@ void add_hwgenerator_randomness(const char *buffer, size_t count, size_t entropy) { struct entropy_store *poolp = _pool; + struct queued_entropy q = { 0 }; if (unlikely(crng_init == 0)) { crng_fast_load(buffer, count); @@ -2652,8 +2653,9 @@ void add_hwgenerator_randomness(const char *buffer, size_t count, */ wait_event_interruptible(random_write_wait, kthread_should_stop() || ENTROPY_BITS(_pool) <= random_write_wakeup_bits); + queue_entropy(poolp, , entropy << ENTROPY_SHIFT); mix_pool_bytes(poolp, buffer, count); - credit_entropy_bits(poolp, entropy); + dispatch_queued_entropy(poolp, ); } EXPORT_SYMBOL_GPL(add_hwgenerator_randomness); -- 2.26.2
[RFC PATCH 16/41] random: convert random_ioctl() to queued_entropy API
In an effort to drop credit_entropy_bits_safe() in favor of the new queue_entropy()/dispatch_queued_entropy() API, convert random_ioctl() from the former to the latter. Implement two helpers: - queue_entropy_bits_safe(), which checks the entropy passed from userspace for extreme values in analogy to what credit_entropy_bits_safe() did - discard_queue_entropy(), which is invoked from random_ioctly() to discard the entropy queued prior to the write_pool() call in case the latter fails. Use them to convert the two call sites of credit_entropy_bits_safe() in random_ioctl() to the new API. As a side effect, the pool entropy watermark as tracked over the duration of the write_pool() operation is now taken correctly taken into account when calulating the amount of new entropy to dispatch to the pool based on the latter's fill level. Signed-off-by: Nicolai Stange --- drivers/char/random.c | 57 --- 1 file changed, 53 insertions(+), 4 deletions(-) diff --git a/drivers/char/random.c b/drivers/char/random.c index 78e65367ea86..03eadefabbca 100644 --- a/drivers/char/random.c +++ b/drivers/char/random.c @@ -737,7 +737,9 @@ struct queued_entropy { * dispatch. However, any such sequence of invocations must eventually * be followed by exactly one call to either of __dequeue_entropy(), * __dispatch_queued_entropy_fast() or dispatch_queued_entropy() - * when the actual pool mixing has completed. + * when the actual pool mixing has completed. Alternatively, + * discard_queued_entropy() may be called in case the mixing has + * failed. * __queue_entropy() must be called with r->lock held. * * Entropy extraction is a two-step process: @@ -813,6 +815,26 @@ static void queue_entropy(struct entropy_store *r, struct queued_entropy *q, spin_unlock_irqrestore(>lock, flags); } +/* + * Queue entropy which comes from userspace and might take extreme + * values. + */ +static int queue_entropy_bits_safe(struct entropy_store *r, + struct queued_entropy *q, + int nbits) +{ + const int nbits_max = r->poolinfo->poolwords * 32; + + if (nbits < 0) + return -EINVAL; + + /* Cap the value to avoid overflows */ + nbits = min(nbits, nbits_max); + + queue_entropy(r, q, nbits << ENTROPY_SHIFT); + return 0; +} + /* * Dequeue previously queued entropy and return the pool entropy * watermark to be used in pool_entropy_delta(). @@ -950,6 +972,22 @@ static void dispatch_queued_entropy(struct entropy_store *r, } } +/* + * Discard queued entropy. May be called when e.g. a write_pool() + * operation failed and the corresponding previously queued entropy + * should not get dispatched to the pool. + */ +static void discard_queued_entropy(struct entropy_store *r, + struct queued_entropy *q) +{ + unsigned long flags; + int pool_watermark; + + spin_lock_irqsave(>lock, flags); + __dequeue_entropy(r, q, _watermark); + spin_unlock_irqrestore(>lock, flags); +} + /* * Credit the entropy store with n bits of entropy. * Use credit_entropy_bits_safe() if the value comes from userspace @@ -2272,6 +2310,7 @@ static long random_ioctl(struct file *f, unsigned int cmd, unsigned long arg) int size, ent_count; int __user *p = (int __user *)arg; int retval; + struct queued_entropy q = { 0 }; switch (cmd) { case RNDGETENTCNT: @@ -2285,7 +2324,11 @@ static long random_ioctl(struct file *f, unsigned int cmd, unsigned long arg) return -EPERM; if (get_user(ent_count, p)) return -EFAULT; - return credit_entropy_bits_safe(_pool, ent_count); + retval = queue_entropy_bits_safe(_pool, , ent_count); + if (retval < 0) + return retval; + dispatch_queued_entropy(_pool, ); + return 0; case RNDADDENTROPY: if (!capable(CAP_SYS_ADMIN)) return -EPERM; @@ -2295,11 +2338,17 @@ static long random_ioctl(struct file *f, unsigned int cmd, unsigned long arg) return -EINVAL; if (get_user(size, p++)) return -EFAULT; + retval = queue_entropy_bits_safe(_pool, , ent_count); + if (retval < 0) + return retval; retval = write_pool(_pool, (const char __user *)p, size); - if (retval < 0) + if (retval < 0) { + discard_queued_entropy(_pool, ); return retval; - return credit_entropy_bits_safe(_pool, ent_count); + } + discard_queued_entropy(_pool, ); + return 0;
[RFC PATCH 18/41] random: move arch_get_random_seed() calls in crng_reseed() into own loop
x86's RDSEED/RDRAND insns have reportedly been slowed down significantly due to the ucode update required to mitigate against the "Special Register Buffer Data Sampling" vulnerability (CVE-2020-0543) and should not get invoked from the interrupt path anymore. In preparation of getting rid of that arch_get_random_long() call currently found in add_interrupt_randomness(), move those arch_get_random_long() calls in crng_reseed() into a separate loop and outside of the crng->lock. There is no functional change. Signed-off-by: Nicolai Stange --- drivers/char/random.c | 8 ++-- 1 file changed, 6 insertions(+), 2 deletions(-) diff --git a/drivers/char/random.c b/drivers/char/random.c index a49805d0d23c..1945249597e0 100644 --- a/drivers/char/random.c +++ b/drivers/char/random.c @@ -1200,14 +1200,18 @@ static void crng_reseed(struct crng_state *crng, struct entropy_store *r) _crng_backtrack_protect(_crng, buf.block, CHACHA_KEY_SIZE); } - spin_lock_irqsave(>lock, flags); + for (i = 0; i < 8; i++) { unsigned long rv; if (!arch_get_random_seed_long() && !arch_get_random_long()) rv = random_get_entropy(); - crng->state[i+4] ^= buf.key[i] ^ rv; + buf.key[i] ^= rv; } + + spin_lock_irqsave(>lock, flags); + for (i = 0; i < 8; i++) + crng->state[i+4] ^= buf.key[i]; memzero_explicit(, sizeof(buf)); crng->init_time = jiffies; spin_unlock_irqrestore(>lock, flags); -- 2.26.2
[RFC PATCH 22/41] random: introduce arch_has_sp800_90b_random_seed()
NIST SP800-90C allows for the combination of multiple SP800-90B entropy sources by concatenating their individual outputs together, c.f. sec. 5.3.4 and also the constructions in 10.3.1 from the second draft. We're already doing exactly that when reseeding the primary CRNG from the input pool and possibly from arch_get_random_seed_long() + arch_get_random_long(). The input pool will be moved gradually towards SP800-90B compliance with future patches. Provide a means for the random driver to check whether arch_get_random_seed_long() is also a full entropy source conforming to NIST SP800-90B. Note that I couldn't find any explicit statement in the specs that would allow for using NRBGs as defined by SP800-90C as a drop-in replacement for "entropy sources" in the sense of SP800-90B. In particular there is no statement that NRBGs may be combined with other SP800-90B entropy sources. However, NRBGs always provide stronger guarantees in that they provide certain protection against silent failure of backing entropy sources. Thus, I think it would be perfectly acceptable to combine SP800-90C NRBGs, i.e. ARMv8's RNDRRS or x86's RDSEED with other entropy sources. Introduce arch_has_sp800_90b_random_seed(). - Make the generic stub return false. - Make the arm64 variant return false as well: the current arch_get_random_seed_long() is based on RNDR, not RNDRRS. - Make it return false on powerpc and s390, too. - Let arch_has_sp800_90b_random_seed() return true on x86 if the CPU has RDSEED support. Yes, I know, one change per patch, but this is part of a RFC series. Signed-off-by: Nicolai Stange --- arch/arm64/include/asm/archrandom.h | 10 +- arch/powerpc/include/asm/archrandom.h | 5 + arch/s390/include/asm/archrandom.h| 5 + arch/x86/include/asm/archrandom.h | 8 include/linux/random.h| 9 + 5 files changed, 36 insertions(+), 1 deletion(-) diff --git a/arch/arm64/include/asm/archrandom.h b/arch/arm64/include/asm/archrandom.h index 055d18713db7..db7813c79b3e 100644 --- a/arch/arm64/include/asm/archrandom.h +++ b/arch/arm64/include/asm/archrandom.h @@ -42,7 +42,15 @@ static inline bool arch_has_random_seed(void) */ return cpus_have_const_cap(ARM64_HAS_RNG); } - +static inine bool arch_has_sp800_90b_random_seed(void) +{ + /* +* Section K.12.1 from the Arm Architecture Reference Manual +* Armv8" (DDI0487F) sounds like RNDRRS could qualify as a +* NIST SP800-90C NRBG, but we're currently using RNDR only. +*/ + return false; +} static inline bool __must_check arch_get_random_long(unsigned long *v) { return false; diff --git a/arch/powerpc/include/asm/archrandom.h b/arch/powerpc/include/asm/archrandom.h index 47c2d74e7244..ba0f816d9750 100644 --- a/arch/powerpc/include/asm/archrandom.h +++ b/arch/powerpc/include/asm/archrandom.h @@ -16,6 +16,11 @@ static inline bool arch_has_random_seed(void) return ppc_md.get_random_seed; } +static inline bool arch_has_sp800_90b_random_seed(void) +{ + return false; +} + static inline bool __must_check arch_get_random_long(unsigned long *v) { return false; diff --git a/arch/s390/include/asm/archrandom.h b/arch/s390/include/asm/archrandom.h index 18973845634c..1ee7f9e4b255 100644 --- a/arch/s390/include/asm/archrandom.h +++ b/arch/s390/include/asm/archrandom.h @@ -31,6 +31,11 @@ static inline bool arch_has_random_seed(void) return static_branch_likely(_arch_random_available); } +static inline bool arch_has_sp800_90b_random_seed(void) +{ + return false; +} + static inline bool __must_check arch_get_random_long(unsigned long *v) { return false; diff --git a/arch/x86/include/asm/archrandom.h b/arch/x86/include/asm/archrandom.h index 030f46c9e310..94d4ee8c9e45 100644 --- a/arch/x86/include/asm/archrandom.h +++ b/arch/x86/include/asm/archrandom.h @@ -80,6 +80,14 @@ static inline bool arch_has_random_seed(void) return static_cpu_has(X86_FEATURE_RDSEED); } +static inline bool arch_has_sp800_90b_random_seed(void) +{ + /* +* According to the Intel SDM, rdseed is NIST SP800-90B +* compliant. +*/ + return arch_has_random_seed(); +} static inline bool __must_check arch_get_random_long(unsigned long *v) { return arch_has_random() ? rdrand_long(v) : false; diff --git a/include/linux/random.h b/include/linux/random.h index d4653422a0c7..933f5daa4a1c 100644 --- a/include/linux/random.h +++ b/include/linux/random.h @@ -128,6 +128,15 @@ static inline bool arch_has_random_seed(void) { return false; } +/* + * Whether or not arch_get_random_seed_long() qualifies as a NIST + * SP800-90B compliant entropy source providing full entropy output. + * NIST SP800-90C NRBG's are probably fine, too. + */ +static inline bool arch_has_sp800_90b_random_seed(void) +{ + return false; +} static inline bool __must_check arch_get_random_long(u
[RFC PATCH 23/41] random: don't award entropy to non-SP800-90B arch RNGs in FIPS mode
It is required by SP800-90C that only SP800-90B compliant entropy sources may be used for seeding DRBGs. Don't award any entropy to arch_get_random_long() if fips_enabled is true. Don't award any entropy to arch_get_random_seed_long() if fips_enabled && !arch_has_sp800_90b_random_seed(). This is achieved by making min_crng_reseed_pool_entropy() return the full minimum seed size if fips_enabled && !arch_has_sp800_90b_random_seed() is true. This prevents crng_reseed() from attempting to make up for any lack of entropy in the input_pool by reading from the architectural RNG. Make crng_reseed() bail out in FIPS mode if the input_pool provides insufficient entropy and any of the arch_get_random_seed_long() invocations fails: there's no statement regarding SP900-90B compliance of arch_get_random_long() and so it can't be used as a backup. Signed-off-by: Nicolai Stange --- drivers/char/random.c | 9 +++-- 1 file changed, 7 insertions(+), 2 deletions(-) diff --git a/drivers/char/random.c b/drivers/char/random.c index 7712b4464ef5..aaddee4e4ab1 100644 --- a/drivers/char/random.c +++ b/drivers/char/random.c @@ -1195,9 +1195,13 @@ static int min_crng_reseed_pool_entropy(void) * up to one half of the minimum entropy needed for * reseeding. That way it won't dominate the entropy * collected by other means at input_pool. +* If in FIPS mode, restrict this to SP900-90B compliant +* architectural RNGs. */ - if (arch_has_random() || arch_has_random_seed()) + if (arch_has_sp800_90b_random_seed() || + (!fips_enabled && (arch_has_random() || arch_has_random_seed( { return 8; + } return 16; } @@ -1233,7 +1237,8 @@ static void crng_reseed(struct crng_state *crng, struct entropy_store *r) for (i = 0; i < 8; i++) { unsigned long rv; if (!arch_get_random_seed_long() && - !arch_get_random_long()) { + ((arch_randomness_required && fips_enabled) || +!arch_get_random_long())) { if (arch_randomness_required) { /* * The input_pool failed to provide -- 2.26.2
[RFC PATCH 26/41] random: implement support for evaluating larger fast_pool entropies
The fast_pools, mixed into from add_interrupt_randomness(), are 128 bits wide and get awarded at most an entropy value as low as one bit in total. An upcoming patch will significantly increase the estimated entropy per event and this will make the fast_pools to receive much larger values of successively mixed in event entropy. In analogy to the reasoning in commit 30e37ec516ae ("random: account for entropy loss due to overwrites"), probabilistic collisions will have to get accounted for when calculating the total fast_pool entropy. >From said commit, the final fast_pool entropy equals e = pool_size * (1 - exp(-num_events * entropy_per_event / pool_size)) Where pool_size is 128 bits in this case and num_events * entropy_per_event equals the sum of all estimated entropy from the IRQ events previously mixed in. Disclaimer: I'm cargo-culting here. That probabilisic overwrites are in fact an issue sounds plausible after having read e.g. "Balls into Bins" by M.Raab and A. Steger. But I haven't managed to derive this equation by myself, nor have I found it in any literature. Anyway, implement the new fast_pool_entropy() for evaluating the equation given above by means of a suitable approximation. add_interrupt_randomness() empties its fast_pool into the global input_pool whenever the number of accumulated events has reached a threshold of 64 and the input_pool's ->lock is uncontended. Thus, the number of IRQ events accumulated at the fast_pool can be assumed to be unlikely to exceed larger factors of 64. The maximum estimate supported for per-IRQ entropy will be 1 bit and thus, this sets an upper bound on the range where the approximation is supposed to work well. At the same time, estimates for the per-IRQ entropy as low as 1/64 bits should be supported and the approximation should not be too coarse in these lower regions in order to avoid excessive loss when entropy is likely a scarce resource anyway. Apply a piecewise linear approximation to the fast_pool entropy, with the lengths of the resp. intervals getting doubled with increasing input values. That is, let the first interval cover 32 bits worth of input entropy, the next one 64 bits and stop after a final one of length 128 bits. Any input entropy beyond 32 + 64 + 128 bits gets discarded in order to limit the computations done from interrupt context, but as outlined above, this is unlikely to matter in practice. The shorter intervals for the regions of smaller values will improve the accuracy of the approximation in these areas, i.e. for small estimates for the per-IRQ entropy. Note that the new fast_pool_entropy() is not being used anywhere yet, it will be wired up in an upcoming commit. Signed-off-by: Nicolai Stange --- drivers/char/random.c | 52 +++ 1 file changed, 52 insertions(+) diff --git a/drivers/char/random.c b/drivers/char/random.c index a985ceb22c7c..ac36c56dd135 100644 --- a/drivers/char/random.c +++ b/drivers/char/random.c @@ -1547,6 +1547,58 @@ static __u32 get_reg(struct fast_pool *f, struct pt_regs *regs) return *ptr; } +/* + * Calculate the entropy of a fast_pool after num_events IRQ events of + * assessed entropy 2^-event_entropy_shift each have been mixed in. + */ +static unsigned int fast_pool_entropy(unsigned int num_events, + int event_entropy_shift) +{ + unsigned int result, cur, interval_len; + + /* +* Assume that each event fed into the fast_pool +* carries h = 2^-event_entropy_shift bits of entropy. +* In analogy to how entropy deltas are calculated +* in pool_entropy_delta() for the struct entropy_store +* input_pool, a fast_pool which received num_events +* of total entropy num_events * h will contain +* p * (1 - exp(-num_events * h / p) +* bits of entropy, where p equals the poolsize of 128 bits. +* +* Note that most of the time num_events will be ~64, c.f. +* add_interrupt_randomness. Approximate the resulting +* fast_pool entropy in a piecewise linear manner from below: +* from 0 to 32, from 32 to 96 and from 96 to 224. +* Event entropy above 224 gets simply discarded. For p = 128, +* the calculated fast_pool entropy is ~226 at +* num_events * h == 32, ~540 at 96 and ~846 at 224, all given +* in units of 2^-ENTROPY_SHIFT. +*/ + BUILD_BUG_ON(sizeof(((struct fast_pool *)NULL)->pool) != 16); + BUILD_BUG_ON(ENTROPY_SHIFT != 3); + + /* Interval from 0 to 32. */ + interval_len = 32 << event_entropy_shift; + cur = min_t(unsigned int, num_events, interval_len); + result = (226 * cur) >> 5; /* shift is for /32 */ + num_events -= cur; + + /* Interval of length 64 from 32 to 96. */ + interval_len <<= 1; + cur = min_t(unsigned int, num_events, interval_len); +
[RFC PATCH 25/41] random: probe cycle counter resolution at initialization
An upcoming patch will change the entropy estimate per add_interrupt_randomness() event for fips_enabled based on whether random_get_entropy() resp. get_cycles() is able to capture individual instructions. For example, x86's TSC would qualify, whereas I've seen cycle counters on e.g. a Raspberry PI 2B with an advertised resolution of only 52ns even though the CPU had been clocked at 1GHz. And then there's possibly hardware which doesn't have a cycle counter at all and where get_cycles() would always return the same constant. Make rand_initialize() probe the cycle counter resolution. Introduce a new static_key have_highres_cycle_ctr, indicicating whether or not the system's cycle counter is able to capture individual instructions. Initially it's set to true. Introduce probe_cycle_ctr_resolution() and call it from rand_initialize(). Make probe_cycle_ctr_resolution() compare 16 successive random_get_entropy() values and disable have_highres_cycle_ctr in case the same value has been read two times in a row. As have_highres_cycle_ctr will be only accessed if fips_enabled is true, make it return early in case it's not set. Signed-off-by: Nicolai Stange --- drivers/char/random.c | 29 + 1 file changed, 29 insertions(+) diff --git a/drivers/char/random.c b/drivers/char/random.c index aaddee4e4ab1..a985ceb22c7c 100644 --- a/drivers/char/random.c +++ b/drivers/char/random.c @@ -335,6 +335,7 @@ #include #include #include +#include #include #include @@ -478,6 +479,8 @@ static struct ratelimit_state urandom_warning = static int ratelimit_disable __read_mostly; +static DEFINE_STATIC_KEY_TRUE(have_highres_cycle_ctr); + module_param_named(ratelimit_disable, ratelimit_disable, int, 0644); MODULE_PARM_DESC(ratelimit_disable, "Disable random ratelimit suppression"); @@ -2170,6 +2173,31 @@ static void __init init_std_data(struct entropy_store *r) mix_pool_bytes(r, utsname(), sizeof(*(utsname(; } +static void probe_cycle_ctr_resolution(void) +{ + cycles_t prev; + int i; + + if (!fips_enabled) + return; + + /* +* Check if the cycle counter has insn granularity (or at +* least close to). +*/ + prev = random_get_entropy(); + for (i = 0; i < 16; ++i) { + cycles_t next; + + next = random_get_entropy(); + if (next == prev) { + static_branch_disable(_highres_cycle_ctr); + return; + } + prev = next; + } +} + /* * Note that setup_arch() may call add_device_randomness() * long before we get here. This allows seeding of the pools @@ -2182,6 +2210,7 @@ static void __init init_std_data(struct entropy_store *r) */ int __init rand_initialize(void) { + probe_cycle_ctr_resolution(); init_std_data(_pool); crng_initialize_primary(_crng); crng_global_init_time = jiffies; -- 2.26.2
[RFC PATCH 20/41] random: provide min_crng_reseed_pool_entropy()
Currently, the current minimum entropy required from the input_pool for reseeding the primary_crng() is 16 bytes == 128 bits. A future patch will introduce support for obtaining up to a certain fraction thereof from the architecture's RNG, if available. This will effectively lower the minimum input_pool ->entropy_count required for a successful reseed of the primary_crng. As this value is used at a couple of places, namely crng_reseed() itself as well as dispatch_queued_entropy() and __dispatch_queued_entropy_fast(), introduce min_crng_reseed_pool_entropy() to ensure consistency among these. min_crng_reseed_pool_entropy() returns the minimum amount of entropy in bytes required from the input_pool for a successful reseed of the primary_crng. Currently it's hardcoded to 16. Use it in place of the hardcoded constants in crng_reseed(), dispatch_queued_entropy() and __dispatch_queued_entropy_fast(). Signed-off-by: Nicolai Stange --- drivers/char/random.c | 18 +++--- 1 file changed, 15 insertions(+), 3 deletions(-) diff --git a/drivers/char/random.c b/drivers/char/random.c index 1945249597e0..424de1565927 100644 --- a/drivers/char/random.c +++ b/drivers/char/random.c @@ -516,6 +516,8 @@ static ssize_t extract_entropy(struct entropy_store *r, void *buf, static ssize_t _extract_entropy(struct entropy_store *r, void *buf, size_t nbytes, int fips); +static int min_crng_reseed_pool_entropy(void); + static void crng_reseed(struct crng_state *crng, struct entropy_store *r); static __u32 input_pool_data[INPUT_POOL_WORDS] __latent_entropy; @@ -916,7 +918,7 @@ static bool __dispatch_queued_entropy_fast(struct entropy_store *r, if (unlikely(r == _pool && crng_init < 2)) { const int entropy_bits = entropy_count >> ENTROPY_SHIFT; - return (entropy_bits >= 128); + return (entropy_bits >= min_crng_reseed_pool_entropy() * 8); } return false; @@ -965,7 +967,7 @@ static void dispatch_queued_entropy(struct entropy_store *r, if (crng_init < 2) { const int entropy_bits = entropy_count >> ENTROPY_SHIFT; - if (entropy_bits < 128) + if (entropy_bits < min_crng_reseed_pool_entropy() * 8) return; crng_reseed(_crng, r); } @@ -1182,6 +1184,15 @@ static int crng_slow_load(const char *cp, size_t len) return 1; } +/* + * Minimum amount of entropy in bytes required from the input_pool for + * a successful reseed of the primary_crng. + */ +static int min_crng_reseed_pool_entropy(void) +{ + return 16; +} + static void crng_reseed(struct crng_state *crng, struct entropy_store *r) { unsigned long flags; @@ -1192,7 +1203,8 @@ static void crng_reseed(struct crng_state *crng, struct entropy_store *r) } buf; if (r) { - num = extract_entropy(r, , 32, 16); + num = extract_entropy(r, , 32, + min_crng_reseed_pool_entropy()); if (num == 0) return; } else { -- 2.26.2
[RFC PATCH 17/41] random: drop credit_entropy_bits() and credit_entropy_bits_safe()
All former call sites of credit_entropy_bits() and credit_entropy_bits_safe() respectively have been converted to the new dispatch_queued_entropy() API. Drop the now unused functions. Signed-off-by: Nicolai Stange --- drivers/char/random.c | 29 + 1 file changed, 1 insertion(+), 28 deletions(-) diff --git a/drivers/char/random.c b/drivers/char/random.c index 03eadefabbca..a49805d0d23c 100644 --- a/drivers/char/random.c +++ b/drivers/char/random.c @@ -533,7 +533,7 @@ static __u32 const twist_table[8] = { /* * This function adds bytes into the entropy "pool". It does not * update the entropy estimate. The caller should call - * credit_entropy_bits if this is appropriate. + * queue_entropy()+dispatch_queued_entropy() if this is appropriate. * * The pool is stirred with a primitive polynomial of the appropriate * degree, and then twisted. We twist by three bits at a time because @@ -988,33 +988,6 @@ static void discard_queued_entropy(struct entropy_store *r, spin_unlock_irqrestore(>lock, flags); } -/* - * Credit the entropy store with n bits of entropy. - * Use credit_entropy_bits_safe() if the value comes from userspace - * or otherwise should be checked for extreme values. - */ -static void credit_entropy_bits(struct entropy_store *r, int nbits) -{ - struct queued_entropy q = { 0 }; - - queue_entropy(r, , nbits << ENTROPY_SHIFT); - dispatch_queued_entropy(r, ); -} - -static int credit_entropy_bits_safe(struct entropy_store *r, int nbits) -{ - const int nbits_max = r->poolinfo->poolwords * 32; - - if (nbits < 0) - return -EINVAL; - - /* Cap the value to avoid overflows */ - nbits = min(nbits, nbits_max); - - credit_entropy_bits(r, nbits); - return 0; -} - /* * * CRNG using CHACHA20 -- 2.26.2
[RFC PATCH 05/41] random: don't reset entropy to zero on overflow
credit_entropy_bits() adds one or more positive values to the signed entropy_count and checks if the result is negative afterwards. Note that because the initial value of entropy_count is positive, a negative result can happen only on overflow. However, if the final entropy_count is found to have overflown, a WARN() is emitted and the entropy_store's entropy count reset to zero. Even though this case should never happen, it is better to retain previously available entropy as this will facilitate a future change factoring out that approximation of the exponential. Make credit_entropy_bits() tp reset entropy_count to the original value rather than zero on overflow. Signed-off-by: Nicolai Stange --- drivers/char/random.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/char/random.c b/drivers/char/random.c index 35e381be20fe..6adac462aa0d 100644 --- a/drivers/char/random.c +++ b/drivers/char/random.c @@ -706,7 +706,7 @@ static void credit_entropy_bits(struct entropy_store *r, int nbits) if (WARN_ON(entropy_count < 0)) { pr_warn("negative entropy/overflow: pool %s count %d\n", r->name, entropy_count); - entropy_count = 0; + entropy_count = orig; } else if (entropy_count > pool_size) entropy_count = pool_size; if (cmpxchg(>entropy_count, orig, entropy_count) != orig) -- 2.26.2
[RFC PATCH 14/41] random: drop __credit_entropy_bits_fast()
All former call sites of __credit_entropy_bits_fast() have been converted to the new __dispatch_queued_entropy_fast() API. Drop the now unused __credit_entropy_bits_fast(). Signed-off-by: Nicolai Stange --- drivers/char/random.c | 14 -- 1 file changed, 14 deletions(-) diff --git a/drivers/char/random.c b/drivers/char/random.c index dfbe49fdbcf1..60ce185d7b2d 100644 --- a/drivers/char/random.c +++ b/drivers/char/random.c @@ -900,20 +900,6 @@ static bool __dispatch_queued_entropy_fast(struct entropy_store *r, return false; } -/* - * Credit the entropy store with n bits of entropy. - * To be used from hot paths when it is either known that nbits is - * smaller than one half of the pool size or losing anything beyond that - * doesn't matter. Must be called with r->lock being held. - */ -static bool __credit_entropy_bits_fast(struct entropy_store *r, int nbits) -{ - struct queued_entropy q = { 0 }; - - __queue_entropy(r, , nbits << ENTROPY_SHIFT); - return __dispatch_queued_entropy_fast(r, ); -} - /* * Credit the pool with previously queued entropy. * -- 2.26.2
[RFC PATCH 11/41] random: convert add_timer_randomness() to queued_entropy API
In an effort to drop __credit_entropy_bits_fast() in favor of the new __queue_entropy()/__dispatch_queued_entropy_fast() API, convert add_timer_randomness() from the former to the latter. There is no change in functionality at this point, because __credit_entropy_bits_fast() has already been reimplemented on top of the new API before. Signed-off-by: Nicolai Stange --- drivers/char/random.c | 6 -- 1 file changed, 4 insertions(+), 2 deletions(-) diff --git a/drivers/char/random.c b/drivers/char/random.c index b91d1fc08ac5..e8c86abde901 100644 --- a/drivers/char/random.c +++ b/drivers/char/random.c @@ -1400,6 +1400,7 @@ static void add_timer_randomness(struct timer_rand_state *state, unsigned num) long delta, delta2, delta3; bool reseed; unsigned long flags; + struct queued_entropy q = { 0 }; sample.jiffies = jiffies; sample.cycles = random_get_entropy(); @@ -1432,13 +1433,14 @@ static void add_timer_randomness(struct timer_rand_state *state, unsigned num) r = _pool; spin_lock_irqsave(>lock, flags); - __mix_pool_bytes(r, , sizeof(sample)); /* * delta is now minimum absolute delta. * Round down by 1 bit on general principles, * and limit entropy estimate to 12 bits. */ - reseed = __credit_entropy_bits_fast(r, min_t(int, fls(delta>>1), 11)); + __queue_entropy(r, , min_t(int, fls(delta>>1), 11) << ENTROPY_SHIFT); + __mix_pool_bytes(r, , sizeof(sample)); + reseed = __dispatch_queued_entropy_fast(r, ); spin_unlock_irqrestore(>lock, flags); if (reseed) crng_reseed(_crng, r); -- 2.26.2
[RFC PATCH 21/41] random: don't invoke arch_get_random_long() from add_interrupt_randomness()
writing to the target buffer area following the bytes obtained from the input_pool. Thus, in case failing arch_get_random_long()s in combination with arch_randomness_required set became a problem in the future, it would be better to improve the error path and simply return the unused entropy extracted from the input_pool back. Signed-off-by: Nicolai Stange --- drivers/char/random.c | 49 +-- 1 file changed, 29 insertions(+), 20 deletions(-) diff --git a/drivers/char/random.c b/drivers/char/random.c index 424de1565927..7712b4464ef5 100644 --- a/drivers/char/random.c +++ b/drivers/char/random.c @@ -1190,6 +1190,14 @@ static int crng_slow_load(const char *cp, size_t len) */ static int min_crng_reseed_pool_entropy(void) { + /* +* If there's an architecture provided RNG, use it for +* up to one half of the minimum entropy needed for +* reseeding. That way it won't dominate the entropy +* collected by other means at input_pool. +*/ + if (arch_has_random() || arch_has_random_seed()) + return 8; return 16; } @@ -1197,6 +1205,7 @@ static void crng_reseed(struct crng_state *crng, struct entropy_store *r) { unsigned long flags; int i, num; + boolarch_randomness_required = false; union { __u8block[CHACHA_BLOCK_SIZE]; __u32 key[8]; @@ -1205,8 +1214,16 @@ static void crng_reseed(struct crng_state *crng, struct entropy_store *r) if (r) { num = extract_entropy(r, , 32, min_crng_reseed_pool_entropy()); - if (num == 0) + if (num == 0) { return; + } else if (num < 16) { + /* +* The input_pool did not provide sufficient +* entropy for reseeding and the architecture +* provided RNG will have to make up for it. +*/ + arch_randomness_required = true; + } } else { _extract_crng(_crng, buf.block); _crng_backtrack_protect(_crng, buf.block, @@ -1216,8 +1233,17 @@ static void crng_reseed(struct crng_state *crng, struct entropy_store *r) for (i = 0; i < 8; i++) { unsigned long rv; if (!arch_get_random_seed_long() && - !arch_get_random_long()) + !arch_get_random_long()) { + if (arch_randomness_required) { + /* +* The input_pool failed to provide +* sufficient entropy and the arch RNG +* could not make up for that either. +*/ + return; + } rv = random_get_entropy(); + } buf.key[i] ^= rv; } @@ -1522,8 +1548,6 @@ void add_interrupt_randomness(int irq, int irq_flags) cycles_tcycles = random_get_entropy(); __u32 c_high, j_high; __u64 ip; - unsigned long seed; - int credit = 0; boolreseed; struct queued_entropy q = { 0 }; @@ -1560,26 +1584,11 @@ void add_interrupt_randomness(int irq, int irq_flags) if (!spin_trylock(>lock)) return; - /* -* If we have architectural seed generator, produce a seed and -* add it to the pool further below. For the sake of paranoia -* don't let the architectural seed generator dominate the -* input from the interrupt noise. -*/ - credit = !!arch_get_random_long(); - fast_pool->last = now; fast_pool->count = 0; /* award one bit for the contents of the fast pool */ - __queue_entropy(r, , (credit + 1) << ENTROPY_SHIFT); + __queue_entropy(r, , 1 << ENTROPY_SHIFT); __mix_pool_bytes(r, _pool->pool, sizeof(fast_pool->pool)); - if (credit) { - /* -* A seed has been obtained from -* arch_get_random_seed_long() above, mix it in. -*/ - __mix_pool_bytes(r, , sizeof(seed)); - } reseed = __dispatch_queued_entropy_fast(r, ); spin_unlock(>lock); if (reseed) -- 2.26.2
[RFC PATCH 19/41] random: reintroduce arch_has_random() + arch_has_random_seed()
A future patch will introduce support for making up for a certain amount of lacking entropy in crng_reseed() by means of arch_get_random_long() or arch_get_random_seed_long() respectively. However, before even the tiniest bit of precious entropy is withdrawn from the input_pool, it should be checked if whether the current arch even has support for these. Reintroduce arch_has_random() + arch_has_random_seed() and implement them for arm64, powerpc, s390 and x86 as appropriate (yeah, I know this should go in separate commits, but this is part of a RFC series). Note that this more or less reverts commits 647f50d5d9d9 ("linux/random.h: Remove arch_has_random, arch_has_random_seed") cbac004995a0 ("powerpc: Remove arch_has_random, arch_has_random_seed") 5e054c820f59 ("s390: Remove arch_has_random, arch_has_random_seed") 5f2ed7f5b99b ("x86: Remove arch_has_random, arch_has_random_seed") Signed-off-by: Nicolai Stange --- arch/arm64/include/asm/archrandom.h | 25 ++--- arch/powerpc/include/asm/archrandom.h | 12 +++- arch/s390/include/asm/archrandom.h| 14 -- arch/x86/include/asm/archrandom.h | 18 ++ include/linux/random.h| 8 5 files changed, 63 insertions(+), 14 deletions(-) diff --git a/arch/arm64/include/asm/archrandom.h b/arch/arm64/include/asm/archrandom.h index 44209f6146aa..055d18713db7 100644 --- a/arch/arm64/include/asm/archrandom.h +++ b/arch/arm64/include/asm/archrandom.h @@ -26,17 +26,13 @@ static inline bool __arm64_rndr(unsigned long *v) return ok; } -static inline bool __must_check arch_get_random_long(unsigned long *v) -{ - return false; -} -static inline bool __must_check arch_get_random_int(unsigned int *v) +static inline bool arch_has_random(void) { return false; } -static inline bool __must_check arch_get_random_seed_long(unsigned long *v) +static inline bool arch_has_random_seed(void) { /* * Only support the generic interface after we have detected @@ -44,7 +40,22 @@ static inline bool __must_check arch_get_random_seed_long(unsigned long *v) * cpufeature code and with potential scheduling between CPUs * with and without the feature. */ - if (!cpus_have_const_cap(ARM64_HAS_RNG)) + return cpus_have_const_cap(ARM64_HAS_RNG); +} + +static inline bool __must_check arch_get_random_long(unsigned long *v) +{ + return false; +} + +static inline bool __must_check arch_get_random_int(unsigned int *v) +{ + return false; +} + +static inline bool __must_check arch_get_random_seed_long(unsigned long *v) +{ + if (!arch_has_random_seed()) return false; return __arm64_rndr(v); diff --git a/arch/powerpc/include/asm/archrandom.h b/arch/powerpc/include/asm/archrandom.h index 9a53e29680f4..47c2d74e7244 100644 --- a/arch/powerpc/include/asm/archrandom.h +++ b/arch/powerpc/include/asm/archrandom.h @@ -6,6 +6,16 @@ #include +static inline bool arch_has_random(void) +{ + return false; +} + +static inline bool arch_has_random_seed(void) +{ + return ppc_md.get_random_seed; +} + static inline bool __must_check arch_get_random_long(unsigned long *v) { return false; @@ -18,7 +28,7 @@ static inline bool __must_check arch_get_random_int(unsigned int *v) static inline bool __must_check arch_get_random_seed_long(unsigned long *v) { - if (ppc_md.get_random_seed) + if (arch_has_random_seed()) return ppc_md.get_random_seed(v); return false; diff --git a/arch/s390/include/asm/archrandom.h b/arch/s390/include/asm/archrandom.h index de61ce562052..18973845634c 100644 --- a/arch/s390/include/asm/archrandom.h +++ b/arch/s390/include/asm/archrandom.h @@ -21,6 +21,16 @@ extern atomic64_t s390_arch_random_counter; bool s390_arch_random_generate(u8 *buf, unsigned int nbytes); +static inline bool arch_has_random(void) +{ + return false; +} + +static inline bool arch_has_random_seed(void) +{ + return static_branch_likely(_arch_random_available); +} + static inline bool __must_check arch_get_random_long(unsigned long *v) { return false; @@ -33,7 +43,7 @@ static inline bool __must_check arch_get_random_int(unsigned int *v) static inline bool __must_check arch_get_random_seed_long(unsigned long *v) { - if (static_branch_likely(_arch_random_available)) { + if (arch_has_random_seed()) { return s390_arch_random_generate((u8 *)v, sizeof(*v)); } return false; @@ -41,7 +51,7 @@ static inline bool __must_check arch_get_random_seed_long(unsigned long *v) static inline bool __must_check arch_get_random_seed_int(unsigned int *v) { - if (static_branch_likely(_arch_random_available)) { + if (arch_has_random_seed()) { return s390_arch_random_generate((u8 *)v, sizeof(*v)); }
[RFC PATCH 24/41] init: call time_init() before rand_initialize()
Commit d55535232c3d ("random: move rand_initialize() earlier") moved the rand_initialize() invocation from the early initcalls to right after timekeeping_init(), but before time_init(). However, rand_initialize() would indirectly invoke random_get_entropy(), which is an alias to get_cycles() on most archs, in case an architectural RNG is not available. Problem is that on some archs, e.g. ARM, get_cycles() can only be relied upon when time_init() has completed. Move the invocation of time_init() a couple of lines up in start_kernel() so that it gets called before rand_initialize(). Note that random_get_entropy() data doesn't get any entropy credit and thus, this issue is not to be considered a bug, but more of an inconsistency. Fixes: d55535232c3d ("random: move rand_initialize() earlier") Signed-off-by: Nicolai Stange --- init/main.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/init/main.c b/init/main.c index ae78fb68d231..30892675f48e 100644 --- a/init/main.c +++ b/init/main.c @@ -942,6 +942,7 @@ asmlinkage __visible void __init __no_sanitize_address start_kernel(void) hrtimers_init(); softirq_init(); timekeeping_init(); + time_init(); /* * For best initial stack canary entropy, prepare it after: @@ -956,7 +957,6 @@ asmlinkage __visible void __init __no_sanitize_address start_kernel(void) add_device_randomness(command_line, strlen(command_line)); boot_init_stack_canary(); - time_init(); perf_event_init(); profile_init(); call_function_init(); -- 2.26.2
[RFC PATCH 07/41] random: let pool_entropy_delta() take nbits in units of 2^-ENTROPY_SHIFT
Currently pool_entropy_delta() expects its nbits argument to be given in units of integral bits. Using fractional bits for processing intermediate entropy counts consistently throughout the code will facilitate upcoming changes to the entropy accounting logic in add_interrupt_randomness(). Replace pool_entropy_delta()'s nbits argument with nfrac, which used to be a local variable and is expected to be given in units of 2^-ENTROPY_SHIFT. Adapt the single caller, credit_entropy_bits(), accordingly. Signed-off-by: Nicolai Stange --- drivers/char/random.c | 19 ++- 1 file changed, 10 insertions(+), 9 deletions(-) diff --git a/drivers/char/random.c b/drivers/char/random.c index 15dd22d74029..08caa7a691a5 100644 --- a/drivers/char/random.c +++ b/drivers/char/random.c @@ -655,21 +655,20 @@ static void process_random_ready_list(void) /* * Based on the pool's current entropy fill level, specified as - * base_entropy_count, and the number of new entropy bits to add, - * return the amount of new entropy to credit. If the 'fast' - * parameter is set to true, the calculation will be guaranteed to - * terminate quickly, but this comes at the expense of capping - * nbits to one half of the pool size. + * base_entropy_count, and the number of new entropy bits in units of + * 2^-ENTROPY_SHIFT to add, return the amount of new entropy to + * credit. If the 'fast' parameter is set to true, the calculation + * will be guaranteed to terminate quickly, but this comes at the + * expense of capping nbits to one half of the pool size. */ static unsigned int pool_entropy_delta(struct entropy_store *r, int base_entropy_count, - int nbits, bool fast) + int nfrac, bool fast) { const int pool_size = r->poolinfo->poolfracbits; int entropy_count = base_entropy_count; - int nfrac = nbits << ENTROPY_SHIFT; - if (!nbits) + if (!nfrac) return 0; /* @@ -729,7 +728,9 @@ static void credit_entropy_bits(struct entropy_store *r, int nbits) retry: orig = READ_ONCE(r->entropy_count); - entropy_count = orig + pool_entropy_delta(r, orig, nbits, false); + entropy_count = orig + pool_entropy_delta(r, orig, + nbits << ENTROPY_SHIFT, + false); if (cmpxchg(>entropy_count, orig, entropy_count) != orig) goto retry; -- 2.26.2
[RFC PATCH 06/41] random: factor the exponential approximation in credit_entropy_bits() out
In the course of calculating the actual amount of new entropy to credit, credit_entropy_bits() applies a linear approximation to exp(-nbits/pool_size)) (neglecting scaling factors in the exponent for the sake of simplicity). In order to limit approximation errors for large nbits, nbits is divided into chunks of maximum value pool_size/2 each and said approximation is applied to these individually in a loop. That loop has a theoretic upper bound of 2*log2(pool_size), which, with the given pool_size of 128 * 32 bits, equals 24. However, in practice nbits hardly ever exceeds values as a large as pool_size/2 == 2048, especially not when called from interrupt context, i.e. from add_interrupt_randomness() and alike. Thus, imposing a limit of one single iteration in these contexts would yield a good guarantee with respect to runtime while not losing any entropy. In preparation to enabling that, move the approximation code in credit_entropy_bits() into a separate function, pool_entropy_delta(). Based on the initial pool entropy count and the number of new entropy bits to credit, it calculates and returns a (positive) delta to add to the former. In case the 'fast' parameter is set to true, the calculation will be terminated after the first iteration, effectively capping the input nbits to one half of the pool size. There is no functional change; callers with 'fast' set to true will be introduced in a future patch. Signed-off-by: Nicolai Stange --- drivers/char/random.c | 53 ++- 1 file changed, 37 insertions(+), 16 deletions(-) diff --git a/drivers/char/random.c b/drivers/char/random.c index 6adac462aa0d..15dd22d74029 100644 --- a/drivers/char/random.c +++ b/drivers/char/random.c @@ -366,7 +366,7 @@ * denominated in units of 1/8th bits. * * 2*(ENTROPY_SHIFT + poolbitshift) must <= 31, or the multiply in - * credit_entropy_bits() needs to be 64 bits wide. + * pool_entropy_delta() needs to be 64 bits wide. */ #define ENTROPY_SHIFT 3 #define ENTROPY_BITS(r) ((r)->entropy_count >> ENTROPY_SHIFT) @@ -654,22 +654,24 @@ static void process_random_ready_list(void) } /* - * Credit the entropy store with n bits of entropy. - * Use credit_entropy_bits_safe() if the value comes from userspace - * or otherwise should be checked for extreme values. + * Based on the pool's current entropy fill level, specified as + * base_entropy_count, and the number of new entropy bits to add, + * return the amount of new entropy to credit. If the 'fast' + * parameter is set to true, the calculation will be guaranteed to + * terminate quickly, but this comes at the expense of capping + * nbits to one half of the pool size. */ -static void credit_entropy_bits(struct entropy_store *r, int nbits) +static unsigned int pool_entropy_delta(struct entropy_store *r, + int base_entropy_count, + int nbits, bool fast) { - int entropy_count, orig; const int pool_size = r->poolinfo->poolfracbits; + int entropy_count = base_entropy_count; int nfrac = nbits << ENTROPY_SHIFT; - int pnfrac; if (!nbits) - return; + return 0; -retry: - entropy_count = orig = READ_ONCE(r->entropy_count); /* * Credit: we have to account for the possibility of * overwriting already present entropy. Even in the @@ -691,24 +693,43 @@ static void credit_entropy_bits(struct entropy_store *r, int nbits) * arbitrarily long; this limits the loop to log2(pool_size)*2 * turns no matter how large nbits is. */ - pnfrac = nfrac; do { /* The +2 corresponds to the /4 in the denominator */ const int s = r->poolinfo->poolbitshift + ENTROPY_SHIFT + 2; - unsigned int anfrac = min(pnfrac, pool_size/2); + unsigned int anfrac = min(nfrac, pool_size/2); unsigned int add = ((pool_size - entropy_count)*anfrac*3) >> s; entropy_count += add; - pnfrac -= anfrac; - } while (unlikely(entropy_count < pool_size-2 && pnfrac)); + nfrac -= anfrac; + } while (unlikely(!fast && entropy_count < pool_size-2 && nfrac)); if (WARN_ON(entropy_count < 0)) { pr_warn("negative entropy/overflow: pool %s count %d\n", r->name, entropy_count); - entropy_count = orig; - } else if (entropy_count > pool_size) + entropy_count = base_entropy_count; + } else if (entropy_count > pool_size) { entropy_count = pool_size; + } + + return entropy_count - base_entropy_count; +} + +/* + * Credit the entropy store with n bits of entropy. + * Use credit_entropy_bits_safe() if the value co
[RFC PATCH 01/41] random: remove dead code in credit_entropy_bits()
Since commit 90ea1c6436d2 ("random: remove the blocking pool") the local has_initialized in credit_entropy_bits() won't get set anymore and the corresponding if-clause became dead code. Remove it as well as the has_initialized variable itself from credit_entropy_bits(). Signed-off-by: Nicolai Stange --- drivers/char/random.c | 7 +-- 1 file changed, 1 insertion(+), 6 deletions(-) diff --git a/drivers/char/random.c b/drivers/char/random.c index d20ba1b104ca..0580968fd28c 100644 --- a/drivers/char/random.c +++ b/drivers/char/random.c @@ -660,7 +660,7 @@ static void process_random_ready_list(void) */ static void credit_entropy_bits(struct entropy_store *r, int nbits) { - int entropy_count, orig, has_initialized = 0; + int entropy_count, orig; const int pool_size = r->poolinfo->poolfracbits; int nfrac = nbits << ENTROPY_SHIFT; @@ -717,11 +717,6 @@ static void credit_entropy_bits(struct entropy_store *r, int nbits) if (cmpxchg(>entropy_count, orig, entropy_count) != orig) goto retry; - if (has_initialized) { - r->initialized = 1; - kill_fasync(, SIGIO, POLL_IN); - } - trace_credit_entropy_bits(r->name, nbits, entropy_count >> ENTROPY_SHIFT, _RET_IP_); -- 2.26.2
[DISCUSSION PATCH 00/41] random: possible ways towards NIST SP800-90B compliance
MP respectively. Apart from those well controlled experiments on a RPi, I also did some lax benchmarking on my x86 desktop (which has some Intel i9, IIRC). More specifically, I simply didn't touch the system and ftraced add_interrupt_randomness() for 15 mins. The number of captured events had been about 2000 in each configuration. Here the add_interrupt_randomness() performance improved greatly: from 4.3 us on average w/o the patches down to 2.0 us with the patches applied and fips_enabled. However, I suppose this gain was due to the removal of RDSEED from add_interrupt_randomness(). Indeed, when inspecting the distribution of add_interrupt_randomness() runtimes on plain v5.9-rc4 more closely, it can be seen that there's a good portion of events (about 1/4th) where add_interrupt_randomness() took about 10us. So I think that this comparison isn't really a fair one... To the best of my knowledge, these are the remaining open questions/items towards full SP800-90[A-C] compliance: - There's no (debugfs?) interface for accessing raw samples for validation purposes yet. That would be doable though. - try_to_generate_entropy() should probably get wired up to the health tests as well. More or less straightfoward to implement, too. - Diverting fast_pool contents into net_rand_state is not allowed (for a related discussion on this topic see [7]). - I've been told that SP800-90A is not a hard requirement yet, but I suppose it will eventually become one. This would mean that the chacha20 RNG would have to get replaced by something approved for fips_enabled. - The sequence of fast_pool -> input_pool -> extract_buf() operations is to be considered a "non-vetted conditioning component" in SP800-90B speak. It would follow that the output can't be estimated as having full entropy, but only 0.999 of its length at max. (c.f. sec. 3.1.5.2). This could be resolved by running a SP800-90A derivation function at CRNG reseeding for fips_enabled. extract_buf(), which is already SHA1 based, could perhaps be transformed into such one as well. - The only mention of combining different noise sources I was able to find had been in SP800-90C, sec. 5.3.4 ("Using Multiple Entropy Sources"): it clearly states that the outputs have to be combined by concatenation. add_hwgenerator_randomness() mixes into the same input_pool as add_interrupt_randomness() though and I would expect that this isn't allowed, independent of whether the noise source backing the former is SP800-90B compliant or not. IIUC, Stephan solved this for his LRNG by maintaing a separate pool for the hw generator. - SP800-90A sets an upper bound on how many bits may be drawn from a DRBG/crng before a reseed *must* take place ("reseed_interval"). In principle that shouldn't matter much in practice, at least not with CONFIG_NUMA: with reseed_interval == 2^32 bits, a single CRNG instance would be allowed to hand out only 500MB worth of randomness before reseeding, but a (single) numa crng chained to the primary_crng may produce as much as 8PB before the latter must eventually get reseeded from the input_pool. But AFAICT, a SP800-90A conforming implementation would still have to provide provisions for a blocking extract_crng(). - It's entirely unclear to me whether support for "prediction resistance requests" is optional. It would be a pity if it weren't, because IIUC that would effectively imply a return to the former blocking_pool behaviour, which is obviously a no-no. Nicolai Stange (41): random: remove dead code in credit_entropy_bits() random: remove dead code for nbits < 0 in credit_entropy_bits() random: prune dead assignment to entropy_bits in credit_entropy_bits() random: drop 'reserved' parameter from extract_entropy() random: don't reset entropy to zero on overflow random: factor the exponential approximation in credit_entropy_bits() out random: let pool_entropy_delta() take nbits in units of 2^-ENTROPY_SHIFT random: introduce __credit_entropy_bits_fast() for hot paths random: protect ->entropy_count with the pool spinlock random: implement support for delayed entropy dispatching random: convert add_timer_randomness() to queued_entropy API random: convert add_interrupt_randomness() to queued_entropy API random: convert try_to_generate_entropy() to queued_entropy API random: drop __credit_entropy_bits_fast() random: convert add_hwgenerator_randomness() to queued_entropy API random: convert random_ioctl() to queued_entropy API random: drop credit_entropy_bits() and credit_entropy_bits_safe() random: move arch_get_random_seed() calls in crng_reseed() into own loop random: reintroduce arch_has_random() + arch_has_random_seed() random: provide min_crng_reseed_pool_entropy() random: don't invoke arch_get_random_long() from add_interrupt_randomness() random: introduce arch_has_sp800_90b_random_seed()
[RFC PATCH 08/41] random: introduce __credit_entropy_bits_fast() for hot paths
When transferring entropy from the fast_pool into the global input_pool from add_interrupt_randomness(), there are at least two atomic operations involved: one when taking the input_pool's spinlock for the actual mixing and another one in the cmpxchg loop in credit_entropy_bits() for updating the pool's ->entropy_count. Because cmpxchg is potentially costly, it would be nice if it could be avoided. As said, the input_pool's spinlock is taken anyway, and I see no reason why its scope should not be extended to protect ->entropy_count as well. Performance considerations set aside, this will also facilitate future changes introducing additional fields to input_pool which will also have to get updated atomically from the consumer/producer sides. The actual move to extend the spinlock's scope to cover ->entropy_count will be the subject of a future patch. Prepare for that by putting a limit on the work to be done with the lock being held. In order to avoid releasing and regrabbing from hot producer paths, they'll keep the lock when executing those calculations in pool_entropy_delta(). The loop found in the latter has a theoretical upper bound of 2 * log2(pool_size) == 24 iterations. However, as all entropy increments awarded from the interrupt path are less than pool_size/2 in magnitude, it is safe to enforce a guaranteed limit of one on the iteration count by setting pool_entropy_delta()'s 'fast' parameter. Introduce __credit_entropy_bits_fast() doing exactly that. Currently it resembles the behaviour from credit_entropy_bits() except that - pool_entropy_delta() gets called with 'fast' set to true and - that __credit_entropy_bits_fast() returns a bool indicating whether the caller should reseed the primary_crng. Note that unlike it's the case with credit_entropy_bits(), the reseeding won't be possible from within __credit_entropy_bits_fast() anymore once it actually gets invoked with the pool lock being held in the future. There is no functional change. Signed-off-by: Nicolai Stange --- drivers/char/random.c | 49 --- 1 file changed, 46 insertions(+), 3 deletions(-) diff --git a/drivers/char/random.c b/drivers/char/random.c index 08caa7a691a5..d9e4dd27d45d 100644 --- a/drivers/char/random.c +++ b/drivers/char/random.c @@ -714,6 +714,39 @@ static unsigned int pool_entropy_delta(struct entropy_store *r, return entropy_count - base_entropy_count; } +/* + * Credit the entropy store with n bits of entropy. + * To be used from hot paths when it is either known that nbits is + * smaller than one half of the pool size or losing anything beyond that + * doesn't matter. + */ +static bool __credit_entropy_bits_fast(struct entropy_store *r, int nbits) +{ + int entropy_count, orig; + + if (!nbits) + return false; + +retry: + orig = READ_ONCE(r->entropy_count); + entropy_count = orig + pool_entropy_delta(r, orig, + nbits << ENTROPY_SHIFT, + true); + if (cmpxchg(>entropy_count, orig, entropy_count) != orig) + goto retry; + + trace_credit_entropy_bits(r->name, nbits, + entropy_count >> ENTROPY_SHIFT, _RET_IP_); + + if (unlikely(r == _pool && crng_init < 2)) { + const int entropy_bits = entropy_count >> ENTROPY_SHIFT; + + return (entropy_bits >= 128); + } + + return false; +} + /* * Credit the entropy store with n bits of entropy. * Use credit_entropy_bits_safe() if the value comes from userspace @@ -1169,6 +1202,7 @@ static void add_timer_randomness(struct timer_rand_state *state, unsigned num) unsigned num; } sample; long delta, delta2, delta3; + bool reseed; sample.jiffies = jiffies; sample.cycles = random_get_entropy(); @@ -1206,7 +1240,9 @@ static void add_timer_randomness(struct timer_rand_state *state, unsigned num) * Round down by 1 bit on general principles, * and limit entropy estimate to 12 bits. */ - credit_entropy_bits(r, min_t(int, fls(delta>>1), 11)); + reseed = __credit_entropy_bits_fast(r, min_t(int, fls(delta>>1), 11)); + if (reseed) + crng_reseed(_crng, r); } void add_input_randomness(unsigned int type, unsigned int code, @@ -1274,6 +1310,7 @@ void add_interrupt_randomness(int irq, int irq_flags) __u64 ip; unsigned long seed; int credit = 0; + boolreseed; if (cycles == 0) cycles = get_reg(fast_pool, regs); @@ -1326,7 +1363,9 @@ void add_interrupt_randomness(int irq, int irq_flags) fast_pool->count = 0; /* award one bit for the contents of the fast pool */ - credit_entropy_b
Re: [RFC PATCH 2/2] livepatch: Clear relocation targets on a module removal
Josh Poimboeuf writes: > On Wed, Aug 14, 2019 at 01:06:09PM +0200, Miroslav Benes wrote: >> > Really, we should be going in the opposite direction, by creating module >> > dependencies, like all other kernel modules do, ensuring that a module >> > is loaded *before* we patch it. That would also eliminate this bug. >> >> Yes, but it is not ideal either with cumulative one-fixes-all patch >> modules. It would load also modules which are not necessary for a >> customer and I know that at least some customers care about this. They >> want to deploy only things which are crucial for their systems. Security concerns set aside, some of the patched modules might get distributed separately from the main kernel through some sort of kernel-*-extra packages and thus, not be found on some target system at all. Or they might have been blacklisted. > If you frame the question as "do you want to destabilize the live > patching infrastucture" then the answer might be different. > > We should look at whether it makes sense to destabilize live patching > for everybody, for a small minority of people who care about a small > minority of edge cases. > > Or maybe there's some other solution we haven't thought about, which > fits more in the framework of how kernel modules already work. > >> We could split patch modules as you proposed in the past, but that have >> issues as well. > > Right, I'm not really crazy about that solution either. > > Here's another idea: per-object patch modules. Patches to vmlinux are > in a vmlinux patch module. Patches to kvm.ko are in a kvm patch module. > That would require: > > - Careful management of dependencies between object-specific patches. > Maybe that just means that exported function ABIs shouldn't change. > > - Some kind of hooking into modprobe to ensure the patch module gets > loaded with the real one. > > - Changing 'atomic replace' to allow patch modules to be per-object. > Perhaps I'm misunderstanding, but supporting only per-object livepatch modules would make livepatch creation for something like commit 15fab63e1e57 ("fs: prevent page refcount overflow in pipe_buf_get"), CVE-2019-11487 really cumbersome (see the fuse part)? I think I've seen similar interdependencies between e.g. kvm.ko <-> kvm_intel.ko, but can't find an example right now. Thanks, Nicolai -- SUSE Software Solutions Germany GmbH, Maxfeldstr. 5, 90409 Nürnberg, Germany (HRB 247165, AG München), GF: Felix Imendörffer
Re: [RFC 3/5] livepatch: Allow to distinguish different version of system state changes
Petr Mladek writes: > --- > include/linux/livepatch.h | 2 ++ > kernel/livepatch/core.c | 8 > kernel/livepatch/state.c | 40 +++- > kernel/livepatch/state.h | 9 + > 4 files changed, 58 insertions(+), 1 deletion(-) > create mode 100644 kernel/livepatch/state.h > > diff --git a/include/linux/livepatch.h b/include/linux/livepatch.h > index 591abdee30d7..8bc4c6cc3f3f 100644 > --- a/include/linux/livepatch.h > +++ b/include/linux/livepatch.h > @@ -135,10 +135,12 @@ struct klp_object { > /** > * struct klp_state - state of the system modified by the livepatch > * @id: system state identifier (non zero) > + * @version: version of the change (non-zero) > * @data:custom data > */ > struct klp_state { > int id; > + int version; > void *data; > }; > > diff --git a/kernel/livepatch/core.c b/kernel/livepatch/core.c > index 24c4a13bd26c..614642719825 100644 > --- a/kernel/livepatch/core.c > +++ b/kernel/livepatch/core.c > @@ -21,6 +21,7 @@ > #include > #include "core.h" > #include "patch.h" > +#include "state.h" > #include "transition.h" > > /* > @@ -1003,6 +1004,13 @@ int klp_enable_patch(struct klp_patch *patch) > > mutex_lock(_mutex); > > + if(!klp_is_patch_compatible(patch)) { > + pr_err("Livepatch patch (%s) is not compatible with the already > installed livepatches.\n", > + patch->mod->name); > + mutex_unlock(_mutex); > + return -EINVAL; > + } > + > ret = klp_init_patch_early(patch); > if (ret) { > mutex_unlock(_mutex); Just as a remark: klp_reverse_transition() could still transition back to a !klp_is_patch_compatible() patch. I don't think it's much of a problem, because for live patches introducing completely new states to the system, it is reasonable to assume that they'll start applying incompatible changes only from their ->post_patch(), I guess. For state "upgrades" to higher versions, it's not so clear though and some care will be needed. But I think these could still be handled safely at the cost of some complexity in the new live patch's ->post_patch(). Another detail is that ->post_unpatch() will be called for the new live patch which has been unpatched due to transition reversal and one would have to be careful not to free shared state from under the older, still active live patch. How would ->post_unpatch() distinguish between transition reversal and "normal" live patch disabling? By klp_get_prev_state() != NULL? Perhaps transition reversal should be mentioned in the documentation? Thanks, Nicolai -- SUSE Linux GmbH, GF: Felix Imendörffer, Mary Higgins, Sri Rasiah, HRB 21284 (AG Nürnberg)
Re: [RFC 2/5] livepatch: Basic API to track system state changes
Petr Mladek writes: > --- > include/linux/livepatch.h | 15 + > kernel/livepatch/Makefile | 2 +- > kernel/livepatch/state.c | 83 > +++ > 3 files changed, 99 insertions(+), 1 deletion(-) > create mode 100644 kernel/livepatch/state.c > > diff --git a/include/linux/livepatch.h b/include/linux/livepatch.h > index eeba421cc671..591abdee30d7 100644 > --- a/include/linux/livepatch.h > +++ b/include/linux/livepatch.h > @@ -132,10 +132,21 @@ struct klp_object { > bool patched; > }; > > +/** > + * struct klp_state - state of the system modified by the livepatch > + * @id: system state identifier (non zero) > + * @data:custom data > + */ > +struct klp_state { > + int id; Can we make this an unsigned long please? It would be consistent with shadow variable ids and would give more room for encoding bugzilla or CVE numbers or so. Nicolai > + void *data; > +}; > +
Re: [RFC 0/5] livepatch: new API to track system state changes
Hi Petr, > this is another piece in the puzzle that helps to maintain more > livepatches. > > Especially pre/post (un)patch callbacks might change a system state. > Any newly installed livepatch has to somehow deal with system state > modifications done be already installed livepatches. > > This patchset provides, hopefully, a simple and generic API that > helps to keep and pass information between the livepatches. > It is also usable to prevent loading incompatible livepatches. I like it a lot, many thanks for doing this! Minor remarks/questions will follow inline. Nicolai
Re: [PATCH 0/3] x86_64/ftrace: Emulate calls from int3 when patching functions
Steven Rostedt writes: > [ > This is the non-RFC version. > > It went through and passed all my tests. If there's no objections > I'm going to include this in my pull request. I still have patches > in my INBOX that may still be included, so I need to run those through > my tests as well, so a pull request wont be immediate. > ] > Josh Poimboeuf (1): > x86_64: Add gap to int3 to allow for call emulation > > Peter Zijlstra (2): > x86_64: Allow breakpoints to emulate call instructions > ftrace/x86_64: Emulate call function while updating in breakpoint > handler Reviewed-and-tested-by: Nicolai Stange for the whole series. Many, many thanks to everybody involved! I'll resend that live patching selftest once this fix here has been merged. Nicolai -- SUSE Linux GmbH, GF: Felix Imendörffer, Mary Higgins, Sri Rasiah, HRB 21284 (AG Nürnberg)
Re: [RFC][PATCH v2] ftrace/x86: Emulate call function while updating in breakpoint handler
Hi Steve, many thanks for moving this forward! Steven Rostedt writes: > > diff --git a/arch/x86/kernel/ftrace.c b/arch/x86/kernel/ftrace.c > index ef49517f6bb2..9160f5cc3b6d 100644 > --- a/arch/x86/kernel/ftrace.c > +++ b/arch/x86/kernel/ftrace.c > @@ -17,6 +17,7 @@ > #include > #include > #include > +#include > #include > #include > #include > @@ -232,6 +233,9 @@ int ftrace_modify_call(struct dyn_ftrace *rec, unsigned > long old_addr, > > static unsigned long ftrace_update_func; > > +/* Used within inline asm below */ > +unsigned long ftrace_update_func_call; > + > static int update_ftrace_func(unsigned long ip, void *new) > { > unsigned char old[MCOUNT_INSN_SIZE]; > @@ -259,6 +263,8 @@ int ftrace_update_ftrace_func(ftrace_func_t func) > unsigned char *new; > int ret; > > + ftrace_update_func_call = (unsigned long)func; > + > new = ftrace_call_replace(ip, (unsigned long)func); > ret = update_ftrace_func(ip, new); > > @@ -280,6 +286,125 @@ static nokprobe_inline int is_ftrace_caller(unsigned > long ip) > return 0; > } > > +/* > + * We need to handle the "call func1" -> "call func2" case. > + * Just skipping the call is not sufficient as it will be like > + * changing to "nop" first and then updating the call. But some > + * users of ftrace require calls never to be missed. > + * > + * To emulate the call while converting the call site with a breakpoint, > + * some trampolines are used along with per CPU buffers. > + * There are three trampolines for the call sites and three trampolines > + * for the updating of the call in ftrace trampoline. The three > + * trampolines are: > + * > + * 1) Interrupts are enabled when the breakpoint is hit > + * 2) Interrupts are disabled when the breakpoint is hit > + * 3) The breakpoint was hit in an NMI > + * > + * As per CPU data is used, interrupts must be disabled to prevent them > + * from corrupting the data. A separate NMI trampoline is used for the > + * NMI case. If interrupts are already disabled, then the return path > + * of where the breakpoint was hit (saved in the per CPU data) is pushed > + * on the stack and then a jump to either the ftrace_caller (which will > + * loop through all registered ftrace_ops handlers depending on the ip > + * address), or if its a ftrace trampoline call update, it will call > + * ftrace_update_func_call which will hold the call that should be > + * called. > + */ > +extern asmlinkage void ftrace_emulate_call_irqon(void); > +extern asmlinkage void ftrace_emulate_call_irqoff(void); > +extern asmlinkage void ftrace_emulate_call_nmi(void); > +extern asmlinkage void ftrace_emulate_call_update_irqoff(void); > +extern asmlinkage void ftrace_emulate_call_update_irqon(void); > +extern asmlinkage void ftrace_emulate_call_update_nmi(void); > + > +static DEFINE_PER_CPU(void *, ftrace_bp_call_return); > +static DEFINE_PER_CPU(void *, ftrace_bp_call_nmi_return); Andy mentioned #DB and #MC exceptions here: https://lkml.kernel.org/r/c55ded25-c60d-4731-9a6b-92bda8771...@amacapital.net I think that #DB won't be possible, provided the trampolines below get tagged as NOKPROBE (do_int3() and ftrace_int3_handler() already have it). It's highly theoretic, but tracing do_machine_check() could clobber ftrace_bp_call_return or ftrace_bp_call_nmi_return? > +#ifdef CONFIG_SMP > +#ifdef CONFIG_X86_64 > +# define BP_CALL_RETURN "%gs:ftrace_bp_call_return" > +# define BP_CALL_NMI_RETURN "%gs:ftrace_bp_call_nmi_return" > +#else > +# define BP_CALL_RETURN "%fs:ftrace_bp_call_return" > +# define BP_CALL_NMI_RETURN "%fs:ftrace_bp_call_nmi_return" > +#endif > +#else /* SMP */ > +# define BP_CALL_RETURN "ftrace_bp_call_return" > +# define BP_CALL_NMI_RETURN "ftrace_bp_call_nmi_return" > +#endif > + > +/* To hold the ftrace_caller address to push on the stack */ > +void *ftrace_caller_func = (void *)ftrace_caller; The live patching ftrace_ops need ftrace_regs_caller. > + > +asm( > + ".text\n" > + > + /* Trampoline for function update with interrupts enabled */ > + ".global ftrace_emulate_call_irqoff\n" > + ".type ftrace_emulate_call_irqoff, @function\n" > + "ftrace_emulate_call_irqoff:\n\t" > + "push "BP_CALL_RETURN"\n\t" > + "push ftrace_caller_func\n" > + "sti\n\t" > + "ret\n\t" > + ".size ftrace_emulate_call_irqoff, .-ftrace_emulate_call_irqoff\n" > + > + /* Trampoline for function update with interrupts disabled*/ > + ".global ftrace_emulate_call_irqon\n" The naming is perhaps a bit confusing, i.e. "update with interrupts disabled" vs. "irqon"... How about swapping irqoff<->irqon? Thanks, Nicolai > + ".type ftrace_emulate_call_irqon, @function\n" > + "ftrace_emulate_call_irqon:\n\t" > + "push "BP_CALL_RETURN"\n\t" > + "push ftrace_caller_func\n" > + "ret\n\t" > + ".size ftrace_emulate_call_irqon,
Re: [PATCH 3/4] x86/ftrace: make ftrace_int3_handler() not to skip fops invocation
Steven Rostedt writes: > On Mon, 29 Apr 2019 14:38:35 -0700 > Linus Torvalds wrote: > >> On Mon, Apr 29, 2019 at 1:30 PM Steven Rostedt wrote: >> > >> > The update from "call custom_trampoline" to "call iterator_trampoline" >> > is where we have an issue. >> >> So it has never worked. Just tell people that they have two chocies: > > The custom call to iterator translation never worked. One option is to > always call the iterator, and just take the hit. There's another > solution to to make permanent updates that would handle the live > patching case, but not for the tracing case. It is more critical for > live patching than tracing (missed traced function is not as critical > as running buggy code unexpectedly). I could look to work on that > instead. Making the live patching case permanent would disable tracing of live patched functions though? For unconditionally calling the iterator: I don't have numbers, but would expect that always searching through something like 50-70 live patching ftrace_ops from some hot path will be noticeable. > If Nicolai has time, perhaps he can try out the method you suggest and > see if we can move this forward. You mean making ftrace handle call->call transitions one by one in non-batch mode, right? Sure, I can do that. Another question though: is there anything that prevents us from calling ftrace_ops_list_func() directly from ftrace_int3_handler()? We don't have parent_ip, but if that is used for reporting only, maybe setting it to some ftrace_is_in_int3_and_doesnt_now_the_parent() will work? Graph tracing entries would still be lost, but well... Thanks, Nicolai
Re: [PATCH 1/4] x86/thread_info: introduce ->ftrace_int3_stack member
Steven Rostedt writes: > On Sun, 28 Apr 2019 10:41:10 -0700 > Andy Lutomirski wrote: > > >> > Note that at any given point >> > in time, there can be at most four such call insn emulations pending: >> > namely at most one per "process", "irq", "softirq" and "nmi" context. >> > >> >> That’s quite an assumption. I think your list should also contain >> exception, exceptions nested inside that exception, and machine >> check, at the very least. I’m also wondering why irq and softirq are >> treated separately. You're right, I missed the machine check case. > 4 has usually been the context count we choose. But I guess in theory, > if we get exceptions then I could potentially be more. After having seen the static_call discussion, I'm in no way defending this limited approach here, but out of curiosity: can the code between the push onto the stack from ftrace_int3_handler() and the subsequent pop from the stub actually trigger an (non-#MC) exception? There's an iret inbetween, but that can fault only when returning to user space, correct? > As for irq vs softirq, an interrupt can preempt a softirq. Interrupts > are enabled while softirqs are running. When sofirqs run, softirqs are > disabled to prevent nested softirqs. But interrupts are enabled again, > and another interrupt may come in while a softirq is executing. > Thanks, Nicolai -- SUSE Linux GmbH, GF: Felix Imendörffer, Jane Smithard, Graham Norton, HRB 21284 (AG Nürnberg)
[PATCH 4/4] selftests/livepatch: add "ftrace a live patched function" test
There had been an issue with interactions between tracing and live patching due to how x86' CONFIG_DYNAMIC_FTRACE used to handle the breakpoints at the updated instructions from its ftrace_int3_handler(). More specifically, starting to trace a live patched function caused a short period in time where the live patching redirection became ineffective. In particular, the guarantees from the consistency model couldn't be held up in this situation. Implement a testcase for verifying that a function's live patch replacement is kept effective when enabling tracing on it. Reuse the existing 'test_klp_livepatch' live patch module which patches cmdline_proc_show(), the handler for /proc/cmdline. Let the testcase in a loop - apply this live patch, - launch a background shell job enabling tracing on that function - while continuously verifying that the contents of /proc/cmdline still match what would be expected when the live patch is applied. Signed-off-by: Nicolai Stange --- tools/testing/selftests/livepatch/Makefile | 3 +- .../livepatch/test-livepatch-vs-ftrace.sh | 44 ++ 2 files changed, 46 insertions(+), 1 deletion(-) create mode 100755 tools/testing/selftests/livepatch/test-livepatch-vs-ftrace.sh diff --git a/tools/testing/selftests/livepatch/Makefile b/tools/testing/selftests/livepatch/Makefile index af4aee79bebb..bfa5353f6d17 100644 --- a/tools/testing/selftests/livepatch/Makefile +++ b/tools/testing/selftests/livepatch/Makefile @@ -3,6 +3,7 @@ TEST_GEN_PROGS := \ test-livepatch.sh \ test-callbacks.sh \ - test-shadow-vars.sh + test-shadow-vars.sh \ + test-livepatch-vs-ftrace.sh include ../lib.mk diff --git a/tools/testing/selftests/livepatch/test-livepatch-vs-ftrace.sh b/tools/testing/selftests/livepatch/test-livepatch-vs-ftrace.sh new file mode 100755 index ..5c982ec56373 --- /dev/null +++ b/tools/testing/selftests/livepatch/test-livepatch-vs-ftrace.sh @@ -0,0 +1,44 @@ +#!/bin/bash +# SPDX-License-Identifier: GPL-2.0 +# Copyright (C) 2019 SUSE Linux GmbH + +. $(dirname $0)/functions.sh + +set -e + +MOD_LIVEPATCH=test_klp_livepatch + +# TEST: ftrace a live patched function +# - load a livepatch that modifies the output from /proc/cmdline +# - install a function tracer at the live patched function +# - verify that the function is still patched by reading /proc/cmdline +# - unload the livepatch and make sure the patch was removed + +echo -n "TEST: ftrace a live patched function ... " +dmesg -C + +for i in $(seq 1 3); do + load_lp $MOD_LIVEPATCH + + ( echo cmdline_proc_show > /sys/kernel/debug/tracing/set_ftrace_filter; + echo function > /sys/kernel/debug/tracing/current_tracer ) & + + for j in $(seq 1 200); do + if [[ "$(cat /proc/cmdline)" != \ + "$MOD_LIVEPATCH: this has been live patched" ]] ; then + echo -e "FAIL\n\n" + die "livepatch kselftest(s) failed" + fi + done + + wait %1 + + echo nop > /sys/kernel/debug/tracing/current_tracer + echo > /sys/kernel/debug/tracing/set_ftrace_filter + + disable_lp $MOD_LIVEPATCH + unload_lp $MOD_LIVEPATCH +done + +echo "ok" +exit 0 -- 2.13.7
[PATCH 3/4] x86/ftrace: make ftrace_int3_handler() not to skip fops invocation
With dynamic ftrace, ftrace patches call sites on x86 in a three steps: 1. Put a breakpoint at the to be patched location, 2. update call site and 3. finally remove the breakpoint again. Note that the breakpoint ftrace_int3_handler() simply makes execution to skip over the to be patched instruction. This patching happens in the following circumstances: a.) the global ftrace_trace_function changes and the call sites at ftrace_call and ftrace_regs_call get patched, b.) an ftrace_ops' ->func changes and the call site in its trampoline gets patched, c.) graph tracing gets enabled/disabled and the jump site at ftrace_graph_call gets patched, d.) a mcount site gets converted from nop -> call, call -> nop, or call -> call. The latter case, i.e. a mcount call getting redirected, is possible in e.g. a transition from trampolined to not trampolined upon a user enabling function tracing on a live patched function. The ftrace_int3_handler() simply skipping over the updated insn is quite problematic in the context of live patching, because it means that a live patched function gets temporarily reverted to its unpatched original and this breaks the live patching consistency model. But even without live patching, it is desirable to avoid missing traces when making changes to the tracing setup. Make ftrace_int3_handler not to skip over the fops invocation, but modify the interrupted control flow to issue a call as needed. Case c.) from the list above can be ignored, because a jmp instruction gets changed to a nop or vice versa. The remaining cases a.), b.) and d.) all involve call instructions. For a.) and b.), the call always goes to some ftrace_func_t and the generic ftrace_ops_list_func() impementation will be able to demultiplex and do the right thing. For case d.), the call target is either of ftrace_caller(), ftrace_regs_caller() or some ftrace_ops' trampoline. Because providing the register state won't cause any harm for !FTRACE_OPS_FL_SAVE_REGS ftrace_ops, ftrace_regs_caller() would be a suitable target capable of handling any case. ftrace_int3_handler()'s context is different from the interrupted call instruction's one, obviously. In order to be able to emulate the call within the original context, make ftrace_int3_handler() set its iret frame's ->ip to some helper stub. Upon return from the trap, this stub will then mimic the call by pushing the the return address onto the stack and issuing a jmp to the target address. As describe above, the jmp target will be either of ftrace_ops_list_func() or ftrace_regs_caller(). Provide one such stub implementation for each of the two cases. Finally, the desired return address, which is derived from the trapping ->ip, must get passed from ftrace_int3_handler() to these stubs. Make ftrace_int3_handler() push it onto the ftrace_int3_stack introduced by an earlier patch and let the stubs consume it. Be careful to use proper compiler barriers such that nested int3 handling from e.g. irqs won't clobber entries owned by outer instances. Suggested-by: Steven Rostedt Signed-off-by: Nicolai Stange --- arch/x86/kernel/Makefile| 1 + arch/x86/kernel/ftrace.c| 79 +++-- arch/x86/kernel/ftrace_int3_stubs.S | 61 3 files changed, 130 insertions(+), 11 deletions(-) create mode 100644 arch/x86/kernel/ftrace_int3_stubs.S diff --git a/arch/x86/kernel/Makefile b/arch/x86/kernel/Makefile index 00b7e27bc2b7..0b63ae02b1f3 100644 --- a/arch/x86/kernel/Makefile +++ b/arch/x86/kernel/Makefile @@ -93,6 +93,7 @@ obj-$(CONFIG_LIVEPATCH) += livepatch.o obj-$(CONFIG_FUNCTION_TRACER) += ftrace_$(BITS).o obj-$(CONFIG_FUNCTION_GRAPH_TRACER) += ftrace.o obj-$(CONFIG_FTRACE_SYSCALLS) += ftrace.o +obj-$(CONFIG_DYNAMIC_FTRACE) += ftrace_int3_stubs.o obj-$(CONFIG_X86_TSC) += trace_clock.o obj-$(CONFIG_KEXEC_CORE) += machine_kexec_$(BITS).o obj-$(CONFIG_KEXEC_CORE) += relocate_kernel_$(BITS).o crash.o diff --git a/arch/x86/kernel/ftrace.c b/arch/x86/kernel/ftrace.c index ef49517f6bb2..917494f35cba 100644 --- a/arch/x86/kernel/ftrace.c +++ b/arch/x86/kernel/ftrace.c @@ -280,25 +280,84 @@ static nokprobe_inline int is_ftrace_caller(unsigned long ip) return 0; } -/* - * A breakpoint was added to the code address we are about to - * modify, and this is the handle that will just skip over it. - * We are either changing a nop into a trace call, or a trace - * call to a nop. While the change is taking place, we treat - * it just like it was a nop. - */ +extern void ftrace_graph_call(void); + +asmlinkage void ftrace_int3_stub_regs_caller(void); +asmlinkage void ftrace_int3_stub_list_func(void); + +/* A breakpoint was added to the code address we are about to modify. */ int ftrace_int3_handler(struct pt_regs *regs) { unsigned long ip; + bool is_ftrace_location; + struct ftrace_int3_stack *stack
[PATCH 2/4] ftrace: drop 'static' qualifier from ftrace_ops_list_func()
With an upcoming patch improving x86' ftrace_int3_handler() not to simply skip over the insn being updated, ftrace_ops_list_func() will have to get referenced from arch/x86 code. Drop its 'static' qualifier. Signed-off-by: Nicolai Stange --- kernel/trace/ftrace.c | 8 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/kernel/trace/ftrace.c b/kernel/trace/ftrace.c index b920358dd8f7..ed3c20811d9a 100644 --- a/kernel/trace/ftrace.c +++ b/kernel/trace/ftrace.c @@ -125,8 +125,8 @@ ftrace_func_t ftrace_trace_function __read_mostly = ftrace_stub; struct ftrace_ops global_ops; #if ARCH_SUPPORTS_FTRACE_OPS -static void ftrace_ops_list_func(unsigned long ip, unsigned long parent_ip, -struct ftrace_ops *op, struct pt_regs *regs); +void ftrace_ops_list_func(unsigned long ip, unsigned long parent_ip, + struct ftrace_ops *op, struct pt_regs *regs); #else /* See comment below, where ftrace_ops_list_func is defined */ static void ftrace_ops_no_ops(unsigned long ip, unsigned long parent_ip); @@ -6302,8 +6302,8 @@ __ftrace_ops_list_func(unsigned long ip, unsigned long parent_ip, * set the ARCH_SUPPORTS_FTRACE_OPS. */ #if ARCH_SUPPORTS_FTRACE_OPS -static void ftrace_ops_list_func(unsigned long ip, unsigned long parent_ip, -struct ftrace_ops *op, struct pt_regs *regs) +void ftrace_ops_list_func(unsigned long ip, unsigned long parent_ip, + struct ftrace_ops *op, struct pt_regs *regs) { __ftrace_ops_list_func(ip, parent_ip, NULL, regs); } -- 2.13.7
[PATCH 1/4] x86/thread_info: introduce ->ftrace_int3_stack member
Before actually rewriting an insn, x86' DYNAMIC_FTRACE implementation places an int3 breakpoint on it. Currently, ftrace_int3_handler() simply treats the insn in question as nop and advances %rip past it. An upcoming patch will improve this by making the int3 trap handler emulate the call insn. To this end, ftrace_int3_handler() will be made to change its iret frame's ->ip to some stub which will then mimic the function call in the original context. Somehow the trapping ->ip address will have to get communicated from ftrace_int3_handler() to these stubs though. Note that at any given point in time, there can be at most four such call insn emulations pending: namely at most one per "process", "irq", "softirq" and "nmi" context. Introduce struct ftrace_int3_stack providing four entries for storing the instruction pointer. In principle, it could be made per-cpu, but this would require making ftrace_int3_handler() to return with preemption disabled and to enable it from those emulation stubs again only after the stack's top entry has been consumed. I've been told that this would "break a lot of norms" and that making this stack part of struct thread_info instead would be less fragile. Follow this advice and add a struct ftrace_int3_stack instance to x86's struct thread_info. Note that these stacks will get only rarely accessed (only during ftrace's code modifications) and thus, cache line dirtying won't have any significant impact on the neighbouring fields. Initialization will take place implicitly through INIT_THREAD_INFO as per the rules for missing elements in initializers. The memcpy() in arch_dup_task_struct() will propagate the initial state properly, because it's always run in process context and won't ever see a non-zero ->depth value. Finally, add the necessary bits to asm-offsets for making struct ftrace_int3_stack accessible from assembly. Suggested-by: Steven Rostedt Signed-off-by: Nicolai Stange --- arch/x86/include/asm/thread_info.h | 11 +++ arch/x86/kernel/asm-offsets.c | 8 2 files changed, 19 insertions(+) diff --git a/arch/x86/include/asm/thread_info.h b/arch/x86/include/asm/thread_info.h index e0eccbcb8447..83434a88cfbb 100644 --- a/arch/x86/include/asm/thread_info.h +++ b/arch/x86/include/asm/thread_info.h @@ -56,6 +56,17 @@ struct task_struct; struct thread_info { unsigned long flags; /* low level flags */ u32 status; /* thread synchronous flags */ +#ifdef CONFIG_DYNAMIC_FTRACE + struct ftrace_int3_stack { + int depth; + /* +* There can be at most one slot in use per context, +* i.e. at most one for "normal", "irq", "softirq" and +* "nmi" each. +*/ + unsigned long slots[4]; + } ftrace_int3_stack; +#endif }; #define INIT_THREAD_INFO(tsk) \ diff --git a/arch/x86/kernel/asm-offsets.c b/arch/x86/kernel/asm-offsets.c index 168543d077d7..ca6ee24a0c6e 100644 --- a/arch/x86/kernel/asm-offsets.c +++ b/arch/x86/kernel/asm-offsets.c @@ -105,4 +105,12 @@ static void __used common(void) OFFSET(TSS_sp0, tss_struct, x86_tss.sp0); OFFSET(TSS_sp1, tss_struct, x86_tss.sp1); OFFSET(TSS_sp2, tss_struct, x86_tss.sp2); + +#ifdef CONFIG_DYNAMIC_FTRACE + BLANK(); + OFFSET(TASK_TI_ftrace_int3_depth, task_struct, + thread_info.ftrace_int3_stack.depth); + OFFSET(TASK_TI_ftrace_int3_slots, task_struct, + thread_info.ftrace_int3_stack.slots); +#endif } -- 2.13.7
[PATCH 0/4] x86/ftrace: make ftrace_int3_handler() not to skip fops invocation
Hi, this series is the result of the discussion to the RFC patch found at [1]. The goal is to make x86' ftrace_int3_handler() not to simply skip over the trapping instruction as this is problematic in the context of the live patching consistency model. For details, c.f. the commit message of [3/4] ("x86/ftrace: make ftrace_int3_handler() not to skip fops invocation"). Everything is based on v5.1-rc6, please let me know in case you want me to rebase on somehing else. For x86_64, the live patching selftest added in [4/4] succeeds with this series applied and fails without it. On 32 bits I only compile-tested. checkpatch reports warnings about - an overlong line in assembly -- I chose to ignore that - MAINTAINERS perhaps needing updates due to the new files arch/x86/kernel/ftrace_int3_stubs.S and tools/testing/selftests/livepatch/test-livepatch-vs-ftrace.sh. As the existing arch/x86/kernel/ftrace_{32,64}.S haven't got an explicit entry either, this one is probably Ok? The selftest definitely is. Changes to the RFC patch: - s/trampoline/stub/ to avoid confusion with the ftrace_ops' trampolines, - use a fixed size stack kept in struct thread_info for passing the (adjusted) ->ip values from ftrace_int3_handler() to the stubs, - provide one stub for each of the two possible jump targets and hardcode those, - add the live patching selftest. Thanks, Nicolai Nicolai Stange (4): x86/thread_info: introduce ->ftrace_int3_stack member ftrace: drop 'static' qualifier from ftrace_ops_list_func() x86/ftrace: make ftrace_int3_handler() not to skip fops invocation selftests/livepatch: add "ftrace a live patched function" test arch/x86/include/asm/thread_info.h | 11 +++ arch/x86/kernel/Makefile | 1 + arch/x86/kernel/asm-offsets.c | 8 +++ arch/x86/kernel/ftrace.c | 79 +++--- arch/x86/kernel/ftrace_int3_stubs.S| 61 + kernel/trace/ftrace.c | 8 +-- tools/testing/selftests/livepatch/Makefile | 3 +- .../livepatch/test-livepatch-vs-ftrace.sh | 44 8 files changed, 199 insertions(+), 16 deletions(-) create mode 100644 arch/x86/kernel/ftrace_int3_stubs.S create mode 100755 tools/testing/selftests/livepatch/test-livepatch-vs-ftrace.sh -- 2.13.7
Re: [RFC PATCH 1/1] x86/ftrace: make ftrace_int3_handler() not to skip fops invocation
Steven Rostedt writes: > On Tue, 23 Apr 2019 20:15:49 +0200 > Nicolai Stange wrote: >> Steven Rostedt writes: >> > For 32 bit, we could add 4 variables on the thread_info and make 4 >> > trampolines, one for each context (normal, softirq, irq and NMI), and >> > have them use the variable stored in the thread_info for that location: >> > >> > ftrace_int3_tramp_normal >> >push %eax # just for space >> >push %eax >> >mov whatever_to_get_thread_info, %eax >> >mov normal(%eax), %eax >> >mov %eax, 4(%esp) >> >pop %eax >> >jmp ftrace_caller >> > >> > ftrace_int3_tramp_softiqr >> >push %eax # just for space >> >push %eax >> >mov whatever_to_get_thread_info, %eax >> >mov softirq(%eax), %eax >> >mov %eax, 4(%esp) >> >pop %eax >> >jmp ftrace_caller >> > >> > etc.. >> > >> > >> > A bit crazier for 32 bit but it can still be done. ;-) >> >> Ok, but currently CONFIG_HAVE_LIVEPATCH=n for x86 && !x86_64, >> so I'd rather not invest too much energy into screwing 32 bit up ;) >> > > Probably not something you care about, but something that I do. Which > means it will have to go on my TODO list. I care about missed functions > being called. This means, if you have something tracing a function, and > then enable something else to trace that same function, you might miss > the first one. Alright, if there's a use case beyond live patching, I can try to handle 32 bits alongside, of course. However, some care will be needed when determining the actual context from ftrace_int3_handler(): tracing anything before the addition or after the subtraction of HARDIRQ_OFFSET to/from preempt_count in irq_enter() resp. irq_exit() could otherwise clobber the "normal" state in thread_info, correct? How about having a fixed size stack with three entries shared between "normal", "irq" and "softirq" in thread_info, as well as a dedicated slot for nmi context? (The stack would be modified/consumed only with irqs disabled). Which makes me wonder: - if we disabled preemption from ftrace_int3_handler() and reenabled it again from the *_int3_tramp*s, these stacks could be made per-CPU, AFAICS, - and why not use this strategy on x86_64, too? The advantages would be a common implemention between 32 and 64 bit and more importantly, not relying on that %r11 hack. When doing the initial RFC patch, it wasn't clear to me that at most one stack slot per context would be needed, i.e. only four in total... What do you think? Nicolai -- SUSE Linux GmbH, GF: Felix Imendörffer, Jane Smithard, Graham Norton, HRB 21284 (AG Nürnberg)
Re: [RFC PATCH 1/1] x86/ftrace: make ftrace_int3_handler() not to skip fops invocation
Hi Steven, many thanks for having a look! Steven Rostedt writes: > I just found this in my Inbox, and this looks to be a legit issue. > > On Thu, 26 Jul 2018 12:40:29 +0200 > Nicolai Stange wrote: > > You still working on this? Yes, this still needs to get fixed somehow, preferably at the ftrace layer. > >> With dynamic ftrace, ftrace patches call sites in a three steps: >> 1. Put a breakpoint at the to be patched location, >> 2. update call site and >> 3. finally remove the breakpoint again. >> >> Note that the breakpoint ftrace_int3_handler() simply makes execution >> to skip over the to be patched function. >> >> This patching happens in the following circumstances: >> - the global ftrace_trace_function changes and the call sites at >> ftrace_call and ftrace_regs_call get patched, >> - an ftrace_ops' ->func changes and the call site in its trampoline gets >> patched, >> - graph tracing gets enabled/disabled and the jump site at >> ftrace_graph_call gets patched >> - a mcount site gets converted from nop -> call, call -> nop >> or call -> call. >> >> The latter case, i.e. a mcount call getting redirected, happens for example >> in a transition from trampolined to not trampolined upon a user enabling >> function tracing on a live patched function. >> >> The ftrace_int3_handler() simply skipping over the mcount callsite is >> problematic here, because it means that a live patched function gets >> temporarily reverted to its unpatched original and this breaks live >> patching consistency. >> >> Make ftrace_int3_handler not to skip over the fops invocation, but modify >> the interrupted control flow to issue a call as needed. >> >> For determining what the proper action actually is, modify >> update_ftrace_func() to take an extra argument, func, to be called if the >> corresponding breakpoint gets hit. Introduce a new global variable, >> trace_update_func_dest and let update_ftrace_func() set it. For the special >> case of patching the jmp insn at ftrace_graph_call, set it to zero and make >> ftrace_int3_handler() just skip over the insn in this case as before. >> >> Because there's no space left above the int3 handler's iret frame to stash >> an extra call's return address in, this can't be mimicked from the >> ftrace_int3_handler() itself. >> >> Instead, make ftrace_int3_handler() change the iret frame's %rip slot to >> point to the new ftrace_int3_call_trampoline to be executed immediately >> after iret. >> >> The original %rip gets communicated to ftrace_int3_call_trampoline which >> can then take proper action. Note that ftrace_int3_call_trampoline can >> nest, because of NMIs, for example. In order to avoid introducing another >> stack, abuse %r11 for passing the original %rip. This is safe, because the >> interrupted context is always at a call insn and according to the x86_64 >> ELF ABI allows %r11 is callee-clobbered. According to the glibc sources, >> this is also true for the somewhat special mcount/fentry ABI. >> >> OTOH, a spare register doesn't exist on i686 and thus, this is approach is >> limited to x86_64. >> >> For determining the call target address, let ftrace_int3_call_trampoline >> compare ftrace_update_func against the interrupted %rip. > > I don't think we need to do the compare. > >> If they match, an update_ftrace_func() is currently pending and the >> call site is either of ftrace_call, ftrace_regs_call or the call insn >> within a fops' trampoline. Jump to the ftrace_update_func_dest location as >> set by update_ftrace_func(). >> >> If OTOH the interrupted %rip doesn't equal ftrace_update_func, then >> it points to an mcount call site. In this case, redirect control flow to >> the most generic handler, ftrace_regs_caller, which will then do the right >> thing. >> >> Finally, reading ftrace_update_func and ftrace_update_func_dest from >> outside of the int3 handler requires some care: inbetween adding and >> removing the breakpoints, ftrace invokes run_sync() which basically >> issues a couple of IPIs. Because the int3 handler has interrupts disabled, >> it orders with run_sync(). >> >> To extend that ordering to also include ftrace_int3_call_trampoline, make >> ftrace_int3_handler() disable interrupts within the iret frame returning to >> it. >> >> Store the original EFLAGS.IF into %r11's MSB which, because it represents >> a kernel address, is redundant. Make ftrace_int3_call_trampoline restore >> it when done. > > This
Re: [PATCH] mm/mincore: allow for making sys_mincore() privileged
Linus Torvalds writes: > So in order to use it as a signal, first you have to first scrub the > cache (because if the page was already there, there's no signal at > all), and then for the signal to be as useful as possible, you're also > going to want to try to get out more than one bit of information: you > are going to try to see the patterns and the timings of how it gets > filled. > > And that's actually quite painful. You don't know the initial cache > state, and you're not (in general) controlling the machine entirely, > because there's also that actual other entity that you're trying to > attack and see what it does. > > So what you want to do is basically to first make sure the cache is > scrubbed (only for the pages you're interested in!), then trigger > whatever behavior you are looking for, and then look how that affected > the cache. > > In other words, you want *multiple* residency status check - first to > see what the cache state is (because you're going to want that for > scrubbing), then to see that "yes, it's gone" when doing the > scrubbing, and then to see the *pattern* and timings of how things are > brought in. In an attempt to gain a better understanding of the guided eviction part resp. the relevance of mincore() & friends to that, I worked on reproducing the results from [1], section 6.1 ("Efficient Page Cache Eviction on Linux"). In case anybody wants to run their own experiments: the sources can be found at [2]. Disclaimer: I don't have access to the sources used by the [1]-paper's authors nor do I know anything about their experimental setup. So it might very well be the case, that my implementation is completely different and/or inefficient. Anyways, quoting from [1], section 6.1: "Eviction Set 1. These are pages already in the page cache, used by other processes. To keep them in the page cache, a thread continuously accesses these pages while also keep- ing the system load low by using sched yield and sleep. Consequently, they are among the most recently accessed pages of the system and eviction of these pages becomes highly unlikely." I had two questions: 1.) Do the actual contents of "Eviction set 1" matter for the guided eviction's performance or can they as well be arbitrary but fixed? Because if the set's actual contents weren't of any importance, then mincore() would not be needed to initialize it with "pages already in the page cache". 2.) How does keeping some fixed set resident + doing some IO compare to simply streaming a huge random file through the page cache? (To make it explicit: I didn't look into the probe part of the attack or the checking of the victim page's residency status as a termination condition for the eviction run.) Obviously, there are two primary factors affecting the victim page eviction performance: the file page cache size and disk read bandwidth. Naively speaking, I would suppose that keeping a certain set resident is a cheap and stable way to constrain IO to the remaining portion of the page cache and thus, reduce the amount of data required to be read from disk until the victim page gets evicted. Results summary (details can be found at the end of this mail): - The baseline benchmark of simply streaming random data through the page cache behaves as expected: avg of "Inactive(file)" / avg of "victim page eviction time" yields ~480MB/s, which approx. matches my disk's read bandwidth (note: the victim page was mapped as !PROT_EXEC). - I didn't do any sophisticated fine-tuning wrt. to parameters, but for the configuration yielding the best result, the average victim page eviction time was 147ms (stddev(*): 69ms, stderr: 1ms) with the "random but fixed resident set method". That's an improvement by a factor of 2.6 over the baseline "streaming random data method" (with the same amount of anonymous memory, i.e. "Eviction set 3", allocated: 7GB out of a total of 8GB). - In principle, question 1.) can't be answered by experiment without controlling the initial, legitimate system workload. I did some lax tests on my desktop running firefox, libreoffice etc. though and of course, overall responsiveness got a lot better if the "Eviction set 1" had been populated with pages already resident at the time the "attack" started. But the victim page eviction times didn't seem to improve -- at least not by factors such that my biased mind would have been able to recognize any change. In conclusion, keeping an arbitrary, but fixed "Eviction set 1" resident improved the victim page eviction performance by some factor over the "streaming" baseline, where "Eviction set 1" was populated from a single, attacker-controlled file and mincore() was not needed for determining its initial contents. To my surprise though, I needed to rely on mincore() at some other place, namely *my* implementation of keeping the resident set resident. My first naive approach was to have a
Re: [PATCH 1/4] powerpc/64s: Clear on-stack exception marker upon exception return
Michael Ellerman writes: > Joe Lawrence writes: >> From: Nicolai Stange >> >> The ppc64 specific implementation of the reliable stacktracer, >> save_stack_trace_tsk_reliable(), bails out and reports an "unreliable >> trace" whenever it finds an exception frame on the stack. Stack frames >> are classified as exception frames if the STACK_FRAME_REGS_MARKER magic, >> as written by exception prologues, is found at a particular location. >> >> However, as observed by Joe Lawrence, it is possible in practice that >> non-exception stack frames can alias with prior exception frames and thus, >> that the reliable stacktracer can find a stale STACK_FRAME_REGS_MARKER on >> the stack. It in turn falsely reports an unreliable stacktrace and blocks >> any live patching transition to finish. Said condition lasts until the >> stack frame is overwritten/initialized by function call or other means. >> >> In principle, we could mitigate this by making the exception frame >> classification condition in save_stack_trace_tsk_reliable() stronger: >> in addition to testing for STACK_FRAME_REGS_MARKER, we could also take into >> account that for all exceptions executing on the kernel stack >> - their stack frames's backlink pointers always match what is saved >> in their pt_regs instance's ->gpr[1] slot and that >> - their exception frame size equals STACK_INT_FRAME_SIZE, a value >> uncommonly large for non-exception frames. >> >> However, while these are currently true, relying on them would make the >> reliable stacktrace implementation more sensitive towards future changes in >> the exception entry code. Note that false negatives, i.e. not detecting >> exception frames, would silently break the live patching consistency model. >> >> Furthermore, certain other places (diagnostic stacktraces, perf, xmon) >> rely on STACK_FRAME_REGS_MARKER as well. >> >> Make the exception exit code clear the on-stack STACK_FRAME_REGS_MARKER >> for those exceptions running on the "normal" kernel stack and returning >> to kernelspace: because the topmost frame is ignored by the reliable stack >> tracer anyway, returns to userspace don't need to take care of clearing >> the marker. >> >> Furthermore, as I don't have the ability to test this on Book 3E or >> 32 bits, limit the change to Book 3S and 64 bits. >> >> Finally, make the HAVE_RELIABLE_STACKTRACE Kconfig option depend on >> PPC_BOOK3S_64 for documentation purposes. Before this patch, it depended >> on PPC64 && CPU_LITTLE_ENDIAN and because CPU_LITTLE_ENDIAN implies >> PPC_BOOK3S_64, there's no functional change here. > > That has nothing to do with the fix and should really be in a separate > patch. > > I can split it when applying. If you don't mind, that would be nice! Or simply drop that chunk... Otherwise, let me know if I shall send a split v2 for this patch [1/4] only. Thanks, Nicolai -- SUSE Linux GmbH, GF: Felix Imendörffer, Jane Smithard, Graham Norton, HRB 21284 (AG Nürnberg)
Re: ppc64le reliable stack unwinder and scheduled tasks
Joe Lawrence writes: > On Fri, Jan 11, 2019 at 08:51:54AM +0100, Nicolai Stange wrote: >> Joe Lawrence writes: >> >> > On Fri, Jan 11, 2019 at 01:00:38AM +0100, Nicolai Stange wrote: > > [ ... snip ... ] > >> >> For testing, could you try whether clearing the word at STACK_FRAME_MARKER >> >> from _switch() helps? >> >> >> >> I.e. something like (completely untested): >> > >> > I'll kick off some builds tonight and try to get tests lined up >> > tomorrow. Unfortunately these take a bit of time to run setup, schedule >> > and complete, so perhaps by next week. >> >> Ok, that's probably still a good test for confirmation, but the solution >> you proposed below is much better. > > Good news, 40 tests (RHEL-7 backport) with clearing out the > STACK_FRAME_MARKER were all successful. Previously I would see stalled > livepatch transitions due to the unreliable report in ~10% of test > cases. > >> >> diff --git a/arch/powerpc/kernel/entry_64.S >> >> b/arch/powerpc/kernel/entry_64.S >> >> index 435927f549c4..b747d0647ec4 100644 >> >> --- a/arch/powerpc/kernel/entry_64.S >> >> +++ b/arch/powerpc/kernel/entry_64.S >> >> @@ -596,6 +596,10 @@ _GLOBAL(_switch) >> >> SAVE_8GPRS(14, r1) >> >> SAVE_10GPRS(22, r1) >> >> std r0,_NIP(r1) /* Return to switch caller */ >> >> + >> >> + li r23,0 >> >> + std r23,96(r1) /* 96 == STACK_FRAME_MARKER * sizeof(long) */ >> >> + >> >> mfcrr23 >> >> std r23,_CCR(r1) >> >> std r1,KSP(r3) /* Set old stack pointer */ >> >> >> >> >> > >> > This may be sufficient to avoid the condition, but if the contents of >> > frame 0 are truely uninitialized (not to be trusted), should the >> > unwinder be even looking at that frame (for STACK_FRAMES_REGS_MARKER), >> > aside from the LR and other stack size geometry sanity checks? >> >> That's a very good point: we'll only ever be examining scheduled out tasks >> and current (which at that time is running klp_try_complete_transition()). >> >> current won't be in an interrupt/exception when it's walking its own >> stack. And scheduled out tasks can't have an exception/interrupt frame >> as their frame 0, correct? >> >> Thus, AFAICS, whenever klp_try_complete_transition() finds a >> STACK_FRAMES_REGS_MARKER in frame 0, it is known to be garbage, as you >> said. > > I gave this about 20 tests and they were also all successful, what do > you think? (The log could probably use some trimming and cleanup.) I > think either solution would fix the issue. > > -->8-- -->8-- -->8-- -->8-- -->8-- -->8-- -->8-- -->8-- -->8-- -->8-- > > From edd3fb9e8a16d5a7ccc98759e9397f386f0e82ca Mon Sep 17 00:00:00 2001 > From: Joe Lawrence > Date: Fri, 11 Jan 2019 10:25:26 -0500 > Subject: [PATCH] powerpc/livepatch: relax reliable stack tracer exception > check > > The ppc64le reliable stack tracer iterates over a given task's stack, > verifying a set of conditions to determine whether the trace is > 'reliable'. These checks include the presence of any exception frames. > > We should be careful when inspecting the bottom-most stack frame (the > first to be unwound), particularly for scheduled-out tasks. As Nicolai > Stange explains, "If I'm reading the code in _switch() correctly, the > first frame is completely uninitialized except for the pointer back to > the caller's stack frame." If a previous do_IRQ() invocation, for > example, has left a residual exception-marker on the first frame, the > stack tracer would incorrectly report this task's trace as unreliable. > FWIW, it's not been do_IRQ() who wrote the exception marker, but it's caller hardware_interrupt_common(), more specifically the EXCEPTION_PROLOG_COMMON_3 part therof. > save_stack_trace_tsk_reliable() already skips the first frame when > verifying the saved Link Register. It should do the same when looking > for the STACK_FRAME_REGS_MARKER. The task it is unwinding will be > either 'current', in which case the tracer will have never been called > from an exception path, or the task will be inactive and its first > frame's contents will be uninitialized (aside from backchain pointer). I thought about this a little more and can't see anything that would prevent higher, i.e. non-_switch() frames to also alias with prior exception frames. That STACK_FRAME_REGS_MARKER is written to a stack frame's "parameter area" and most functi
Re: ppc64le reliable stack unwinder and scheduled tasks
Joe Lawrence writes: > On Fri, Jan 11, 2019 at 01:00:38AM +0100, Nicolai Stange wrote: >> Hi Joe, >> >> Joe Lawrence writes: >> >> > tl;dr: On ppc64le, what is top-most stack frame for scheduled tasks >> >about? >> >> If I'm reading the code in _switch() correctly, the first frame is >> completely uninitialized except for the pointer back to the caller's >> stack frame. >> >> For completeness: _switch() saves the return address, i.e. the link >> register into its parent's stack frame, as is mandated by the ABI and >> consistent with your findings below: it's always the second stack frame >> where the return address into __switch_to() is kept. >> > > Hi Nicolai, > > Good, that makes a lot of sense. I couldn't find any reference > explaining the contents of frame 0, only unwinding code here and there > (as in crash-utility) that stepped over it. FWIW, I learned about general stack frame usage on ppc from part 4 of the introductionary series starting at [1]: it's a good reading and I can definitely recommend it. Summary: - Callers of other functions always allocate a stack frame and only set the pointer to the previous stack frame (that's the 'stdu r1, -STACK_FRAME_OVERHEAD(r1)' insn). - Callees save their stuff into the stack frame allocated by the caller if needed. Where "if needed" == callee in turn calls another function. The insignificance of frame 0's contents follows from this ABI: the caller might not have called any callee yet, the callee might be a leaf and so on. Finally, as I understand it, the only purpose of _switch() creating a standard stack frame at the bottom of scheduled out tasks is that the higher ones can be found (for e.g. the backtracing): otherwise there would be a pt_regs at the bottom of the stack. But I might be wrong here. >> >> >> > >> > >> > Example 1 (RHEL-7) >> > == >> > >> > crash> struct task_struct.thread c0022fd015c0 | grep ksp >> > ksp = 0xc000288af9c0 >> > >> > crash> rd 0xc000288af9c0 -e 0xc000288b >> > >> > - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - >> > >> > sp[0]: >> > >> > c000288af9c0: c000288afb90 00dd ...( >> > c000288af9d0: c0002a94 c1c60a00 .*.. >> > >> > crash> sym c0002a94 >> > c0002a94 (T) hardware_interrupt_common+0x114 >> >> So that c0002a94 certainly wasn't stored by _switch(). I think >> what might have happened is that the switching frame aliased with some >> prior interrupt frame as setup by hardware_interrupt_common(). >> >> The interrupt and switching frames seem to share a common layout as far >> as the lower STACK_FRAME_OVERHEAD + sizeof(struct pt_regs) bytes are >> concerned. >> >> That address into hardware_interrupt_common() could have been written by >> the do_IRQ() called from there. >> > > That was my initial theory, but then when I saw an ordinary scheduled > task with a similarly strange frame 0, I thought that _switch() might > have been doing something clever (or not). But according your earlier > explanation, it would line up that these values may be inherited from > do_IRQ() or the like. > >> >> > c000288af9e0: c1c60a80 >> > c000288af9f0: c000288afbc0 00dd ...( >> > c000288afa00: c14322e0 c1c60a00 ."C. >> > c000288afa10: c002303ae380 c002303ae380 ..:0..:0 >> > c000288afa20: 7265677368657265 2200 erehsger.".. >> > >> > Uh-oh... >> > >> > /* Mark stacktraces with exception frames as unreliable. */ >> > stack[STACK_FRAME_MARKER] == STACK_FRAME_REGS_MARKER >> >> >> Aliasing of the switching stack frame with some prior interrupt stack >> frame would explain why that STACK_FRAME_REGS_MARKER is still found on >> the stack, i.e. it's a leftover. >> >> For testing, could you try whether clearing the word at STACK_FRAME_MARKER >> from _switch() helps? >> >> I.e. something like (completely untested): > > I'll kick off some builds tonight and try to get tests lined up > tomorrow. Unfortunately these take a bit of time to run setup, schedule > and complete, so perhaps by next week. Ok, that's probably still a good test for confirmation, but the solution you pro
Re: ppc64le reliable stack unwinder and scheduled tasks
Hi Joe, Joe Lawrence writes: > tl;dr: On ppc64le, what is top-most stack frame for scheduled tasks >about? If I'm reading the code in _switch() correctly, the first frame is completely uninitialized except for the pointer back to the caller's stack frame. For completeness: _switch() saves the return address, i.e. the link register into its parent's stack frame, as is mandated by the ABI and consistent with your findings below: it's always the second stack frame where the return address into __switch_to() is kept. > > > Example 1 (RHEL-7) > == > > crash> struct task_struct.thread c0022fd015c0 | grep ksp > ksp = 0xc000288af9c0 > > crash> rd 0xc000288af9c0 -e 0xc000288b > > - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - > > sp[0]: > > c000288af9c0: c000288afb90 00dd ...( > c000288af9d0: c0002a94 c1c60a00 .*.. > > crash> sym c0002a94 > c0002a94 (T) hardware_interrupt_common+0x114 So that c0002a94 certainly wasn't stored by _switch(). I think what might have happened is that the switching frame aliased with some prior interrupt frame as setup by hardware_interrupt_common(). The interrupt and switching frames seem to share a common layout as far as the lower STACK_FRAME_OVERHEAD + sizeof(struct pt_regs) bytes are concerned. That address into hardware_interrupt_common() could have been written by the do_IRQ() called from there. > c000288af9e0: c1c60a80 > c000288af9f0: c000288afbc0 00dd ...( > c000288afa00: c14322e0 c1c60a00 ."C. > c000288afa10: c002303ae380 c002303ae380 ..:0..:0 > c000288afa20: 7265677368657265 2200 erehsger.".. > > Uh-oh... > > /* Mark stacktraces with exception frames as unreliable. */ > stack[STACK_FRAME_MARKER] == STACK_FRAME_REGS_MARKER Aliasing of the switching stack frame with some prior interrupt stack frame would explain why that STACK_FRAME_REGS_MARKER is still found on the stack, i.e. it's a leftover. For testing, could you try whether clearing the word at STACK_FRAME_MARKER from _switch() helps? I.e. something like (completely untested): diff --git a/arch/powerpc/kernel/entry_64.S b/arch/powerpc/kernel/entry_64.S index 435927f549c4..b747d0647ec4 100644 --- a/arch/powerpc/kernel/entry_64.S +++ b/arch/powerpc/kernel/entry_64.S @@ -596,6 +596,10 @@ _GLOBAL(_switch) SAVE_8GPRS(14, r1) SAVE_10GPRS(22, r1) std r0,_NIP(r1) /* Return to switch caller */ + + li r23,0 + std r23,96(r1) /* 96 == STACK_FRAME_MARKER * sizeof(long) */ + mfcrr23 std r23,_CCR(r1) std r1,KSP(r3) /* Set old stack pointer */ > > save_stack_trace_tsk_reliable > = > > arch/powerpc/kernel/stacktrace.c :: save_stack_trace_tsk_reliable() does > take into account the first stackframe, but only to verify that the link > register is indeed pointing at kernel code address. It's actually the other way around: if (!firstframe && !__kernel_text_address(ip)) return 1; So the address gets sanitized only if it's _not_ coming from the first frame. Thanks, Nicolai -- SUSE Linux GmbH, GF: Felix Imendörffer, Jane Smithard, Graham Norton, HRB 21284 (AG Nürnberg)
[RFC PATCH 1/1] x86/ftrace: make ftrace_int3_handler() not to skip fops invocation
With dynamic ftrace, ftrace patches call sites in a three steps: 1. Put a breakpoint at the to be patched location, 2. update call site and 3. finally remove the breakpoint again. Note that the breakpoint ftrace_int3_handler() simply makes execution to skip over the to be patched function. This patching happens in the following circumstances: - the global ftrace_trace_function changes and the call sites at ftrace_call and ftrace_regs_call get patched, - an ftrace_ops' ->func changes and the call site in its trampoline gets patched, - graph tracing gets enabled/disabled and the jump site at ftrace_graph_call gets patched - a mcount site gets converted from nop -> call, call -> nop or call -> call. The latter case, i.e. a mcount call getting redirected, happens for example in a transition from trampolined to not trampolined upon a user enabling function tracing on a live patched function. The ftrace_int3_handler() simply skipping over the mcount callsite is problematic here, because it means that a live patched function gets temporarily reverted to its unpatched original and this breaks live patching consistency. Make ftrace_int3_handler not to skip over the fops invocation, but modify the interrupted control flow to issue a call as needed. For determining what the proper action actually is, modify update_ftrace_func() to take an extra argument, func, to be called if the corresponding breakpoint gets hit. Introduce a new global variable, trace_update_func_dest and let update_ftrace_func() set it. For the special case of patching the jmp insn at ftrace_graph_call, set it to zero and make ftrace_int3_handler() just skip over the insn in this case as before. Because there's no space left above the int3 handler's iret frame to stash an extra call's return address in, this can't be mimicked from the ftrace_int3_handler() itself. Instead, make ftrace_int3_handler() change the iret frame's %rip slot to point to the new ftrace_int3_call_trampoline to be executed immediately after iret. The original %rip gets communicated to ftrace_int3_call_trampoline which can then take proper action. Note that ftrace_int3_call_trampoline can nest, because of NMIs, for example. In order to avoid introducing another stack, abuse %r11 for passing the original %rip. This is safe, because the interrupted context is always at a call insn and according to the x86_64 ELF ABI allows %r11 is callee-clobbered. According to the glibc sources, this is also true for the somewhat special mcount/fentry ABI. OTOH, a spare register doesn't exist on i686 and thus, this is approach is limited to x86_64. For determining the call target address, let ftrace_int3_call_trampoline compare ftrace_update_func against the interrupted %rip. If they match, an update_ftrace_func() is currently pending and the call site is either of ftrace_call, ftrace_regs_call or the call insn within a fops' trampoline. Jump to the ftrace_update_func_dest location as set by update_ftrace_func(). If OTOH the interrupted %rip doesn't equal ftrace_update_func, then it points to an mcount call site. In this case, redirect control flow to the most generic handler, ftrace_regs_caller, which will then do the right thing. Finally, reading ftrace_update_func and ftrace_update_func_dest from outside of the int3 handler requires some care: inbetween adding and removing the breakpoints, ftrace invokes run_sync() which basically issues a couple of IPIs. Because the int3 handler has interrupts disabled, it orders with run_sync(). To extend that ordering to also include ftrace_int3_call_trampoline, make ftrace_int3_handler() disable interrupts within the iret frame returning to it. Store the original EFLAGS.IF into %r11's MSB which, because it represents a kernel address, is redundant. Make ftrace_int3_call_trampoline restore it when done. Signed-off-by: Nicolai Stange --- arch/x86/kernel/ftrace.c| 48 -- arch/x86/kernel/ftrace_64.S | 56 + 2 files changed, 97 insertions(+), 7 deletions(-) diff --git a/arch/x86/kernel/ftrace.c b/arch/x86/kernel/ftrace.c index 01ebcb6f263e..3908a73370a8 100644 --- a/arch/x86/kernel/ftrace.c +++ b/arch/x86/kernel/ftrace.c @@ -227,9 +227,11 @@ int ftrace_modify_call(struct dyn_ftrace *rec, unsigned long old_addr, return -EINVAL; } -static unsigned long ftrace_update_func; +unsigned long ftrace_update_func; +unsigned long ftrace_update_func_dest; -static int update_ftrace_func(unsigned long ip, void *new) +static int update_ftrace_func(unsigned long ip, void *new, + unsigned long func) { unsigned char old[MCOUNT_INSN_SIZE]; int ret; @@ -237,6 +239,8 @@ static int update_ftrace_func(unsigned long ip, void *new) memcpy(old, (void *)ip, MCOUNT_INSN_SIZE); ftrace_update_func = ip; + ftrace_update_func_dest = func; + /* Make sure the bre
[RFC PATCH 1/1] x86/ftrace: make ftrace_int3_handler() not to skip fops invocation
With dynamic ftrace, ftrace patches call sites in a three steps: 1. Put a breakpoint at the to be patched location, 2. update call site and 3. finally remove the breakpoint again. Note that the breakpoint ftrace_int3_handler() simply makes execution to skip over the to be patched function. This patching happens in the following circumstances: - the global ftrace_trace_function changes and the call sites at ftrace_call and ftrace_regs_call get patched, - an ftrace_ops' ->func changes and the call site in its trampoline gets patched, - graph tracing gets enabled/disabled and the jump site at ftrace_graph_call gets patched - a mcount site gets converted from nop -> call, call -> nop or call -> call. The latter case, i.e. a mcount call getting redirected, happens for example in a transition from trampolined to not trampolined upon a user enabling function tracing on a live patched function. The ftrace_int3_handler() simply skipping over the mcount callsite is problematic here, because it means that a live patched function gets temporarily reverted to its unpatched original and this breaks live patching consistency. Make ftrace_int3_handler not to skip over the fops invocation, but modify the interrupted control flow to issue a call as needed. For determining what the proper action actually is, modify update_ftrace_func() to take an extra argument, func, to be called if the corresponding breakpoint gets hit. Introduce a new global variable, trace_update_func_dest and let update_ftrace_func() set it. For the special case of patching the jmp insn at ftrace_graph_call, set it to zero and make ftrace_int3_handler() just skip over the insn in this case as before. Because there's no space left above the int3 handler's iret frame to stash an extra call's return address in, this can't be mimicked from the ftrace_int3_handler() itself. Instead, make ftrace_int3_handler() change the iret frame's %rip slot to point to the new ftrace_int3_call_trampoline to be executed immediately after iret. The original %rip gets communicated to ftrace_int3_call_trampoline which can then take proper action. Note that ftrace_int3_call_trampoline can nest, because of NMIs, for example. In order to avoid introducing another stack, abuse %r11 for passing the original %rip. This is safe, because the interrupted context is always at a call insn and according to the x86_64 ELF ABI allows %r11 is callee-clobbered. According to the glibc sources, this is also true for the somewhat special mcount/fentry ABI. OTOH, a spare register doesn't exist on i686 and thus, this is approach is limited to x86_64. For determining the call target address, let ftrace_int3_call_trampoline compare ftrace_update_func against the interrupted %rip. If they match, an update_ftrace_func() is currently pending and the call site is either of ftrace_call, ftrace_regs_call or the call insn within a fops' trampoline. Jump to the ftrace_update_func_dest location as set by update_ftrace_func(). If OTOH the interrupted %rip doesn't equal ftrace_update_func, then it points to an mcount call site. In this case, redirect control flow to the most generic handler, ftrace_regs_caller, which will then do the right thing. Finally, reading ftrace_update_func and ftrace_update_func_dest from outside of the int3 handler requires some care: inbetween adding and removing the breakpoints, ftrace invokes run_sync() which basically issues a couple of IPIs. Because the int3 handler has interrupts disabled, it orders with run_sync(). To extend that ordering to also include ftrace_int3_call_trampoline, make ftrace_int3_handler() disable interrupts within the iret frame returning to it. Store the original EFLAGS.IF into %r11's MSB which, because it represents a kernel address, is redundant. Make ftrace_int3_call_trampoline restore it when done. Signed-off-by: Nicolai Stange --- arch/x86/kernel/ftrace.c| 48 -- arch/x86/kernel/ftrace_64.S | 56 + 2 files changed, 97 insertions(+), 7 deletions(-) diff --git a/arch/x86/kernel/ftrace.c b/arch/x86/kernel/ftrace.c index 01ebcb6f263e..3908a73370a8 100644 --- a/arch/x86/kernel/ftrace.c +++ b/arch/x86/kernel/ftrace.c @@ -227,9 +227,11 @@ int ftrace_modify_call(struct dyn_ftrace *rec, unsigned long old_addr, return -EINVAL; } -static unsigned long ftrace_update_func; +unsigned long ftrace_update_func; +unsigned long ftrace_update_func_dest; -static int update_ftrace_func(unsigned long ip, void *new) +static int update_ftrace_func(unsigned long ip, void *new, + unsigned long func) { unsigned char old[MCOUNT_INSN_SIZE]; int ret; @@ -237,6 +239,8 @@ static int update_ftrace_func(unsigned long ip, void *new) memcpy(old, (void *)ip, MCOUNT_INSN_SIZE); ftrace_update_func = ip; + ftrace_update_func_dest = func; + /* Make sure the bre
[RFC PATCH 0/1] x86/ftrace: fix live patching vs. tracing race
Hi, if a user starts to trace a live patched function, its mcount call will get redirected from a trampoline to ftrace_regs_caller. In preparation for that, ftrace on x86 first installs an int3 insn at that call site. ftrace_int3_handler() in turn simply skips over the mcount call insn, effectively reverting the livepatch for that function during ftrace_replace_code(). This breaks KLP's consistency model. There are two possible options for fixing this: 1.) At the ftrace level. 2.) Search for a matching klp_ops from ftrace_int3_handler() and handle the redirection if needed. Both have their drawbacks, hence the RFC mode for this patch implementing 1.). The main disadvantage is that it doesn't work on 32 bits (c.f. the patch description), but for KLP this would be fine. OTOH, it keeps KLP specific code out of ftrace_int3_handler() and might perhaps be beneficial in other contexts as well. Thanks for your comments! Nicolai Nicolai Stange (1): x86/ftrace: make ftrace_int3_handler() not to skip fops invocation arch/x86/kernel/ftrace.c| 48 -- arch/x86/kernel/ftrace_64.S | 56 + 2 files changed, 97 insertions(+), 7 deletions(-) -- 2.13.7
[RFC PATCH 0/1] x86/ftrace: fix live patching vs. tracing race
Hi, if a user starts to trace a live patched function, its mcount call will get redirected from a trampoline to ftrace_regs_caller. In preparation for that, ftrace on x86 first installs an int3 insn at that call site. ftrace_int3_handler() in turn simply skips over the mcount call insn, effectively reverting the livepatch for that function during ftrace_replace_code(). This breaks KLP's consistency model. There are two possible options for fixing this: 1.) At the ftrace level. 2.) Search for a matching klp_ops from ftrace_int3_handler() and handle the redirection if needed. Both have their drawbacks, hence the RFC mode for this patch implementing 1.). The main disadvantage is that it doesn't work on 32 bits (c.f. the patch description), but for KLP this would be fine. OTOH, it keeps KLP specific code out of ftrace_int3_handler() and might perhaps be beneficial in other contexts as well. Thanks for your comments! Nicolai Nicolai Stange (1): x86/ftrace: make ftrace_int3_handler() not to skip fops invocation arch/x86/kernel/ftrace.c| 48 -- arch/x86/kernel/ftrace_64.S | 56 + 2 files changed, 97 insertions(+), 7 deletions(-) -- 2.13.7
Re: [kbuild-all] [PATCH] OPTIONAL: cpufreq/intel_pstate: fix debugfs_simple_attr.cocci warnings
Julia Lawall <julia.law...@lip6.fr> writes: > On Fri, 30 Mar 2018, Nicolai Stange wrote: > >> Julia Lawall <julia.law...@lip6.fr> writes: >> >> > On Thu, 29 Mar 2018, Fabio Estevam wrote: >> > >> >> Hi Julia, >> >> >> >> On Thu, Mar 29, 2018 at 4:12 PM, Julia Lawall <julia.law...@lip6.fr> >> >> wrote: >> >> > Use DEFINE_DEBUGFS_ATTRIBUTE rather than DEFINE_SIMPLE_ATTRIBUTE >> >> > for debugfs files. >> >> > >> >> > Semantic patch information: >> >> > Rationale: DEFINE_SIMPLE_ATTRIBUTE + debugfs_create_file() >> >> > imposes some significant overhead as compared to >> >> > DEFINE_DEBUGFS_ATTRIBUTE + debugfs_create_file_unsafe(). >> >> >> >> Just curious: could you please expand on what "imposes some >> >> significant overhead" means? >> > >> > I don't know. I didn't write this rule. Nicolai, can you explain? >> >> From commit 49d200deaa68 ("debugfs: prevent access to removed files' private >> data"): >> >> Upon return of debugfs_remove()/debugfs_remove_recursive(), it might >> still be attempted to access associated private file data through >> previously opened struct file objects. If that data has been freed by >> the caller of debugfs_remove*() in the meanwhile, the reading/writing >> process would either encounter a fault or, if the memory address in >> question has been reassigned again, unrelated data structures could get >> overwritten. >> [...] >> Currently, there are ~1000 call sites of debugfs_create_file() spread >> throughout the whole tree and touching all of those struct >> file_operations >> in order to make them file removal aware by means of checking the result >> of >> debugfs_use_file_start() from within their methods is unfeasible. >> >> Instead, wrap the struct file_operations by a lifetime managing proxy at >> file open [...] >> >> The additional overhead comes in terms of additional memory needed: for >> debugs files created through debugfs_create_file(), one such struct >> file_operations proxy is allocated for each struct file instantiation, >> c.f. full_proxy_open(). >> >> This was needed to "repair" the ~1000 call sites without touching them. >> >> New debugfs users should make their file_operations removal aware >> themselves by means of DEFINE_DEBUGFS_ATTRIBUTE() and signal that fact to >> the debugfs core by instantiating them through >> debugfs_create_file_unsafe(). >> >> See commit c64688081490 ("debugfs: add support for self-protecting >> attribute file fops") for further information. > > Thanks. Perhaps it would be good to add a reference to this commit in > the message generated by the semantic patch. Thanks for doing this! > > Would it be sufficient to just apply the semantic patch everywhere and > submit the patches? In principle yes. But I'm note sure whether such a mass application is worth it: the proxy allocation happens only at file open and the expectation is that there aren't that many debugfs files kept open at a time. OTOH, a struct file_operation consumes 256 bytes with sizeof(long) == 8. In my opinion, new users should avoid this overhead as it's easily doable. For existing ones, I don't know. Thanks, Nicolai -- SUSE Linux GmbH, GF: Felix Imendörffer, Jane Smithard, Graham Norton, HRB 21284 (AG Nürnberg)
Re: [kbuild-all] [PATCH] OPTIONAL: cpufreq/intel_pstate: fix debugfs_simple_attr.cocci warnings
Julia Lawall writes: > On Fri, 30 Mar 2018, Nicolai Stange wrote: > >> Julia Lawall writes: >> >> > On Thu, 29 Mar 2018, Fabio Estevam wrote: >> > >> >> Hi Julia, >> >> >> >> On Thu, Mar 29, 2018 at 4:12 PM, Julia Lawall >> >> wrote: >> >> > Use DEFINE_DEBUGFS_ATTRIBUTE rather than DEFINE_SIMPLE_ATTRIBUTE >> >> > for debugfs files. >> >> > >> >> > Semantic patch information: >> >> > Rationale: DEFINE_SIMPLE_ATTRIBUTE + debugfs_create_file() >> >> > imposes some significant overhead as compared to >> >> > DEFINE_DEBUGFS_ATTRIBUTE + debugfs_create_file_unsafe(). >> >> >> >> Just curious: could you please expand on what "imposes some >> >> significant overhead" means? >> > >> > I don't know. I didn't write this rule. Nicolai, can you explain? >> >> From commit 49d200deaa68 ("debugfs: prevent access to removed files' private >> data"): >> >> Upon return of debugfs_remove()/debugfs_remove_recursive(), it might >> still be attempted to access associated private file data through >> previously opened struct file objects. If that data has been freed by >> the caller of debugfs_remove*() in the meanwhile, the reading/writing >> process would either encounter a fault or, if the memory address in >> question has been reassigned again, unrelated data structures could get >> overwritten. >> [...] >> Currently, there are ~1000 call sites of debugfs_create_file() spread >> throughout the whole tree and touching all of those struct >> file_operations >> in order to make them file removal aware by means of checking the result >> of >> debugfs_use_file_start() from within their methods is unfeasible. >> >> Instead, wrap the struct file_operations by a lifetime managing proxy at >> file open [...] >> >> The additional overhead comes in terms of additional memory needed: for >> debugs files created through debugfs_create_file(), one such struct >> file_operations proxy is allocated for each struct file instantiation, >> c.f. full_proxy_open(). >> >> This was needed to "repair" the ~1000 call sites without touching them. >> >> New debugfs users should make their file_operations removal aware >> themselves by means of DEFINE_DEBUGFS_ATTRIBUTE() and signal that fact to >> the debugfs core by instantiating them through >> debugfs_create_file_unsafe(). >> >> See commit c64688081490 ("debugfs: add support for self-protecting >> attribute file fops") for further information. > > Thanks. Perhaps it would be good to add a reference to this commit in > the message generated by the semantic patch. Thanks for doing this! > > Would it be sufficient to just apply the semantic patch everywhere and > submit the patches? In principle yes. But I'm note sure whether such a mass application is worth it: the proxy allocation happens only at file open and the expectation is that there aren't that many debugfs files kept open at a time. OTOH, a struct file_operation consumes 256 bytes with sizeof(long) == 8. In my opinion, new users should avoid this overhead as it's easily doable. For existing ones, I don't know. Thanks, Nicolai -- SUSE Linux GmbH, GF: Felix Imendörffer, Jane Smithard, Graham Norton, HRB 21284 (AG Nürnberg)
Re: [kbuild-all] [PATCH] OPTIONAL: cpufreq/intel_pstate: fix debugfs_simple_attr.cocci warnings
Julia Lawallwrites: > On Thu, 29 Mar 2018, Fabio Estevam wrote: > >> Hi Julia, >> >> On Thu, Mar 29, 2018 at 4:12 PM, Julia Lawall wrote: >> > Use DEFINE_DEBUGFS_ATTRIBUTE rather than DEFINE_SIMPLE_ATTRIBUTE >> > for debugfs files. >> > >> > Semantic patch information: >> > Rationale: DEFINE_SIMPLE_ATTRIBUTE + debugfs_create_file() >> > imposes some significant overhead as compared to >> > DEFINE_DEBUGFS_ATTRIBUTE + debugfs_create_file_unsafe(). >> >> Just curious: could you please expand on what "imposes some >> significant overhead" means? > > I don't know. I didn't write this rule. Nicolai, can you explain? >From commit 49d200deaa68 ("debugfs: prevent access to removed files' private data"): Upon return of debugfs_remove()/debugfs_remove_recursive(), it might still be attempted to access associated private file data through previously opened struct file objects. If that data has been freed by the caller of debugfs_remove*() in the meanwhile, the reading/writing process would either encounter a fault or, if the memory address in question has been reassigned again, unrelated data structures could get overwritten. [...] Currently, there are ~1000 call sites of debugfs_create_file() spread throughout the whole tree and touching all of those struct file_operations in order to make them file removal aware by means of checking the result of debugfs_use_file_start() from within their methods is unfeasible. Instead, wrap the struct file_operations by a lifetime managing proxy at file open [...] The additional overhead comes in terms of additional memory needed: for debugs files created through debugfs_create_file(), one such struct file_operations proxy is allocated for each struct file instantiation, c.f. full_proxy_open(). This was needed to "repair" the ~1000 call sites without touching them. New debugfs users should make their file_operations removal aware themselves by means of DEFINE_DEBUGFS_ATTRIBUTE() and signal that fact to the debugfs core by instantiating them through debugfs_create_file_unsafe(). See commit c64688081490 ("debugfs: add support for self-protecting attribute file fops") for further information. Thanks, Nicolai -- SUSE Linux GmbH, GF: Felix Imendörffer, Jane Smithard, Graham Norton, HRB 21284 (AG Nürnberg)
Re: [kbuild-all] [PATCH] OPTIONAL: cpufreq/intel_pstate: fix debugfs_simple_attr.cocci warnings
Julia Lawall writes: > On Thu, 29 Mar 2018, Fabio Estevam wrote: > >> Hi Julia, >> >> On Thu, Mar 29, 2018 at 4:12 PM, Julia Lawall wrote: >> > Use DEFINE_DEBUGFS_ATTRIBUTE rather than DEFINE_SIMPLE_ATTRIBUTE >> > for debugfs files. >> > >> > Semantic patch information: >> > Rationale: DEFINE_SIMPLE_ATTRIBUTE + debugfs_create_file() >> > imposes some significant overhead as compared to >> > DEFINE_DEBUGFS_ATTRIBUTE + debugfs_create_file_unsafe(). >> >> Just curious: could you please expand on what "imposes some >> significant overhead" means? > > I don't know. I didn't write this rule. Nicolai, can you explain? >From commit 49d200deaa68 ("debugfs: prevent access to removed files' private data"): Upon return of debugfs_remove()/debugfs_remove_recursive(), it might still be attempted to access associated private file data through previously opened struct file objects. If that data has been freed by the caller of debugfs_remove*() in the meanwhile, the reading/writing process would either encounter a fault or, if the memory address in question has been reassigned again, unrelated data structures could get overwritten. [...] Currently, there are ~1000 call sites of debugfs_create_file() spread throughout the whole tree and touching all of those struct file_operations in order to make them file removal aware by means of checking the result of debugfs_use_file_start() from within their methods is unfeasible. Instead, wrap the struct file_operations by a lifetime managing proxy at file open [...] The additional overhead comes in terms of additional memory needed: for debugs files created through debugfs_create_file(), one such struct file_operations proxy is allocated for each struct file instantiation, c.f. full_proxy_open(). This was needed to "repair" the ~1000 call sites without touching them. New debugfs users should make their file_operations removal aware themselves by means of DEFINE_DEBUGFS_ATTRIBUTE() and signal that fact to the debugfs core by instantiating them through debugfs_create_file_unsafe(). See commit c64688081490 ("debugfs: add support for self-protecting attribute file fops") for further information. Thanks, Nicolai -- SUSE Linux GmbH, GF: Felix Imendörffer, Jane Smithard, Graham Norton, HRB 21284 (AG Nürnberg)
[PATCH v2] net: ipv4: emulate READ_ONCE() on ->hdrincl bit-field in raw_sendmsg()
Commit 8f659a03a0ba ("net: ipv4: fix for a race condition in raw_sendmsg") fixed the issue of possibly inconsistent ->hdrincl handling due to concurrent updates by reading this bit-field member into a local variable and using the thus stabilized value in subsequent tests. However, aforementioned commit also adds the (correct) comment that /* hdrincl should be READ_ONCE(inet->hdrincl) * but READ_ONCE() doesn't work with bit fields */ because as it stands, the compiler is free to shortcut or even eliminate the local variable at its will. Note that I have not seen anything like this happening in reality and thus, the concern is a theoretical one. However, in order to be on the safe side, emulate a READ_ONCE() on the bit-field by doing it on the local 'hdrincl' variable itself: int hdrincl = inet->hdrincl; hdrincl = READ_ONCE(hdrincl); This breaks the chain in the sense that the compiler is not allowed to replace subsequent reads from hdrincl with reloads from inet->hdrincl. Fixes: 8f659a03a0ba ("net: ipv4: fix for a race condition in raw_sendmsg") Signed-off-by: Nicolai Stange <nsta...@suse.de> --- Compile-tested only (with inspection of compiler output on x86_64). Note that there's still a ->hdrincl reload, but that's due to the inet_sk_flowi_flags() invocation and probably harmless. Applicable to linux-next-20180108. Changes to v2: - Apply review from Stefano Brivio, i.e. drop the __hdrincl intermediate, update comment and commit message accordingly. net/ipv4/raw.c | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/net/ipv4/raw.c b/net/ipv4/raw.c index 5b9bd5c33d9d..6717cf2008f0 100644 --- a/net/ipv4/raw.c +++ b/net/ipv4/raw.c @@ -520,9 +520,11 @@ static int raw_sendmsg(struct sock *sk, struct msghdr *msg, size_t len) goto out; /* hdrincl should be READ_ONCE(inet->hdrincl) -* but READ_ONCE() doesn't work with bit fields +* but READ_ONCE() doesn't work with bit fields. +* Doing this indirectly yields the same result. */ hdrincl = inet->hdrincl; + hdrincl = READ_ONCE(hdrincl); /* * Check the flags. */ -- 2.13.6
[PATCH v2] net: ipv4: emulate READ_ONCE() on ->hdrincl bit-field in raw_sendmsg()
Commit 8f659a03a0ba ("net: ipv4: fix for a race condition in raw_sendmsg") fixed the issue of possibly inconsistent ->hdrincl handling due to concurrent updates by reading this bit-field member into a local variable and using the thus stabilized value in subsequent tests. However, aforementioned commit also adds the (correct) comment that /* hdrincl should be READ_ONCE(inet->hdrincl) * but READ_ONCE() doesn't work with bit fields */ because as it stands, the compiler is free to shortcut or even eliminate the local variable at its will. Note that I have not seen anything like this happening in reality and thus, the concern is a theoretical one. However, in order to be on the safe side, emulate a READ_ONCE() on the bit-field by doing it on the local 'hdrincl' variable itself: int hdrincl = inet->hdrincl; hdrincl = READ_ONCE(hdrincl); This breaks the chain in the sense that the compiler is not allowed to replace subsequent reads from hdrincl with reloads from inet->hdrincl. Fixes: 8f659a03a0ba ("net: ipv4: fix for a race condition in raw_sendmsg") Signed-off-by: Nicolai Stange --- Compile-tested only (with inspection of compiler output on x86_64). Note that there's still a ->hdrincl reload, but that's due to the inet_sk_flowi_flags() invocation and probably harmless. Applicable to linux-next-20180108. Changes to v2: - Apply review from Stefano Brivio, i.e. drop the __hdrincl intermediate, update comment and commit message accordingly. net/ipv4/raw.c | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/net/ipv4/raw.c b/net/ipv4/raw.c index 5b9bd5c33d9d..6717cf2008f0 100644 --- a/net/ipv4/raw.c +++ b/net/ipv4/raw.c @@ -520,9 +520,11 @@ static int raw_sendmsg(struct sock *sk, struct msghdr *msg, size_t len) goto out; /* hdrincl should be READ_ONCE(inet->hdrincl) -* but READ_ONCE() doesn't work with bit fields +* but READ_ONCE() doesn't work with bit fields. +* Doing this indirectly yields the same result. */ hdrincl = inet->hdrincl; + hdrincl = READ_ONCE(hdrincl); /* * Check the flags. */ -- 2.13.6
Re: [PATCH] net: ipv4: emulate READ_ONCE() on ->hdrincl bit-field in raw_sendmsg()
Hi Stefano, Stefano Brivio <sbri...@redhat.com> writes: > On Tue, 2 Jan 2018 17:30:20 +0100 > Nicolai Stange <nsta...@suse.de> wrote: > >> [...] >> >> diff --git a/net/ipv4/raw.c b/net/ipv4/raw.c >> index 5b9bd5c33d9d..e84290c28c0c 100644 >> --- a/net/ipv4/raw.c >> +++ b/net/ipv4/raw.c >> @@ -513,16 +513,18 @@ static int raw_sendmsg(struct sock *sk, struct msghdr >> *msg, size_t len) >> int err; >> struct ip_options_data opt_copy; >> struct raw_frag_vec rfv; >> -int hdrincl; >> +int hdrincl, __hdrincl; >> >> err = -EMSGSIZE; >> if (len > 0x) >> goto out; >> >> /* hdrincl should be READ_ONCE(inet->hdrincl) >> - * but READ_ONCE() doesn't work with bit fields >> + * but READ_ONCE() doesn't work with bit fields. >> + * Emulate it by doing the READ_ONCE() from an intermediate int. >> */ >> -hdrincl = inet->hdrincl; >> +__hdrincl = inet->hdrincl; >> +hdrincl = READ_ONCE(__hdrincl); > > I guess you don't actually need to use a third variable. What about > doing READ_ONCE() on hdrincl itself after the first assignment? > > Perhaps something like the patch below -- applies to net.git, yields > same binary output as your version with gcc 6, looks IMHO more > straightforward: > > diff --git a/net/ipv4/raw.c b/net/ipv4/raw.c > index 125c1eab3eaa..8c2f783a95fc 100644 > --- a/net/ipv4/raw.c > +++ b/net/ipv4/raw.c > @@ -519,10 +519,12 @@ static int raw_sendmsg(struct sock *sk, struct msghdr > *msg, size_t len) > if (len > 0x) > goto out; > > - /* hdrincl should be READ_ONCE(inet->hdrincl) > - * but READ_ONCE() doesn't work with bit fields > + /* hdrincl should be READ_ONCE(inet->hdrincl) but READ_ONCE() doesn't > + * work with bit fields. Emulate it by adding a further sequence point. >*/ > hdrincl = inet->hdrincl; > + hdrincl = READ_ONCE(hdrincl); > + Yes, this does also work. In fact, after having been lowered into SSA form, it should be equivalent to what I posted. So, it's a matter of preference/style and I'd leave the decision on this to the maintainers -- for me, either way is fine. I don't like the "sequence point" wording in the comment above though: AFAICS, if taken in the meaning of C99, it's not any sequence point but the volatile access in READ_ONCE() which ensures that there won't be any reloads from ->hdrincl. If you don't mind, I'll adjust that comment if asked to resend with your solution. Thanks, Nicolai
Re: [PATCH] net: ipv4: emulate READ_ONCE() on ->hdrincl bit-field in raw_sendmsg()
Hi Stefano, Stefano Brivio writes: > On Tue, 2 Jan 2018 17:30:20 +0100 > Nicolai Stange wrote: > >> [...] >> >> diff --git a/net/ipv4/raw.c b/net/ipv4/raw.c >> index 5b9bd5c33d9d..e84290c28c0c 100644 >> --- a/net/ipv4/raw.c >> +++ b/net/ipv4/raw.c >> @@ -513,16 +513,18 @@ static int raw_sendmsg(struct sock *sk, struct msghdr >> *msg, size_t len) >> int err; >> struct ip_options_data opt_copy; >> struct raw_frag_vec rfv; >> -int hdrincl; >> +int hdrincl, __hdrincl; >> >> err = -EMSGSIZE; >> if (len > 0x) >> goto out; >> >> /* hdrincl should be READ_ONCE(inet->hdrincl) >> - * but READ_ONCE() doesn't work with bit fields >> + * but READ_ONCE() doesn't work with bit fields. >> + * Emulate it by doing the READ_ONCE() from an intermediate int. >> */ >> -hdrincl = inet->hdrincl; >> +__hdrincl = inet->hdrincl; >> +hdrincl = READ_ONCE(__hdrincl); > > I guess you don't actually need to use a third variable. What about > doing READ_ONCE() on hdrincl itself after the first assignment? > > Perhaps something like the patch below -- applies to net.git, yields > same binary output as your version with gcc 6, looks IMHO more > straightforward: > > diff --git a/net/ipv4/raw.c b/net/ipv4/raw.c > index 125c1eab3eaa..8c2f783a95fc 100644 > --- a/net/ipv4/raw.c > +++ b/net/ipv4/raw.c > @@ -519,10 +519,12 @@ static int raw_sendmsg(struct sock *sk, struct msghdr > *msg, size_t len) > if (len > 0x) > goto out; > > - /* hdrincl should be READ_ONCE(inet->hdrincl) > - * but READ_ONCE() doesn't work with bit fields > + /* hdrincl should be READ_ONCE(inet->hdrincl) but READ_ONCE() doesn't > + * work with bit fields. Emulate it by adding a further sequence point. >*/ > hdrincl = inet->hdrincl; > + hdrincl = READ_ONCE(hdrincl); > + Yes, this does also work. In fact, after having been lowered into SSA form, it should be equivalent to what I posted. So, it's a matter of preference/style and I'd leave the decision on this to the maintainers -- for me, either way is fine. I don't like the "sequence point" wording in the comment above though: AFAICS, if taken in the meaning of C99, it's not any sequence point but the volatile access in READ_ONCE() which ensures that there won't be any reloads from ->hdrincl. If you don't mind, I'll adjust that comment if asked to resend with your solution. Thanks, Nicolai
[PATCH] net: ipv4: emulate READ_ONCE() on ->hdrincl bit-field in raw_sendmsg()
Commit 8f659a03a0ba ("net: ipv4: fix for a race condition in raw_sendmsg") fixed the issue of possibly inconsistent ->hdrincl handling due to concurrent updates by reading this bit-field member into a local variable and using the thus stabilized value in subsequent tests. However, aforementioned commit also adds the (correct) comment that /* hdrincl should be READ_ONCE(inet->hdrincl) * but READ_ONCE() doesn't work with bit fields */ because as it stands, the compiler is free to shortcut or even eliminate the local variable at its will. Note that I have not seen anything like this happening in reality and thus, the concern is a theoretical one. However, in order to be on the safe side, emulate a READ_ONCE() on the bit-field by introducing an intermediate local variable and doing a READ_ONCE() from it: int __hdrincl = inet->hdrincl; int hdrincl = READ_ONCE(__hdrincl); This breaks the chain in the sense that the compiler is not allowed to replace subsequent reads from hdrincl with reloads from inet->hdrincl. Fixes: 8f659a03a0ba ("net: ipv4: fix for a race condition in raw_sendmsg") Signed-off-by: Nicolai Stange <nsta...@suse.de> --- Compile-tested only (with inspection of compiler output on x86_64). Applicable to linux-next-20180102. net/ipv4/raw.c | 8 +--- 1 file changed, 5 insertions(+), 3 deletions(-) diff --git a/net/ipv4/raw.c b/net/ipv4/raw.c index 5b9bd5c33d9d..e84290c28c0c 100644 --- a/net/ipv4/raw.c +++ b/net/ipv4/raw.c @@ -513,16 +513,18 @@ static int raw_sendmsg(struct sock *sk, struct msghdr *msg, size_t len) int err; struct ip_options_data opt_copy; struct raw_frag_vec rfv; - int hdrincl; + int hdrincl, __hdrincl; err = -EMSGSIZE; if (len > 0x) goto out; /* hdrincl should be READ_ONCE(inet->hdrincl) -* but READ_ONCE() doesn't work with bit fields +* but READ_ONCE() doesn't work with bit fields. +* Emulate it by doing the READ_ONCE() from an intermediate int. */ - hdrincl = inet->hdrincl; + __hdrincl = inet->hdrincl; + hdrincl = READ_ONCE(__hdrincl); /* * Check the flags. */ -- 2.13.6
[PATCH] net: ipv4: emulate READ_ONCE() on ->hdrincl bit-field in raw_sendmsg()
Commit 8f659a03a0ba ("net: ipv4: fix for a race condition in raw_sendmsg") fixed the issue of possibly inconsistent ->hdrincl handling due to concurrent updates by reading this bit-field member into a local variable and using the thus stabilized value in subsequent tests. However, aforementioned commit also adds the (correct) comment that /* hdrincl should be READ_ONCE(inet->hdrincl) * but READ_ONCE() doesn't work with bit fields */ because as it stands, the compiler is free to shortcut or even eliminate the local variable at its will. Note that I have not seen anything like this happening in reality and thus, the concern is a theoretical one. However, in order to be on the safe side, emulate a READ_ONCE() on the bit-field by introducing an intermediate local variable and doing a READ_ONCE() from it: int __hdrincl = inet->hdrincl; int hdrincl = READ_ONCE(__hdrincl); This breaks the chain in the sense that the compiler is not allowed to replace subsequent reads from hdrincl with reloads from inet->hdrincl. Fixes: 8f659a03a0ba ("net: ipv4: fix for a race condition in raw_sendmsg") Signed-off-by: Nicolai Stange --- Compile-tested only (with inspection of compiler output on x86_64). Applicable to linux-next-20180102. net/ipv4/raw.c | 8 +--- 1 file changed, 5 insertions(+), 3 deletions(-) diff --git a/net/ipv4/raw.c b/net/ipv4/raw.c index 5b9bd5c33d9d..e84290c28c0c 100644 --- a/net/ipv4/raw.c +++ b/net/ipv4/raw.c @@ -513,16 +513,18 @@ static int raw_sendmsg(struct sock *sk, struct msghdr *msg, size_t len) int err; struct ip_options_data opt_copy; struct raw_frag_vec rfv; - int hdrincl; + int hdrincl, __hdrincl; err = -EMSGSIZE; if (len > 0x) goto out; /* hdrincl should be READ_ONCE(inet->hdrincl) -* but READ_ONCE() doesn't work with bit fields +* but READ_ONCE() doesn't work with bit fields. +* Emulate it by doing the READ_ONCE() from an intermediate int. */ - hdrincl = inet->hdrincl; + __hdrincl = inet->hdrincl; + hdrincl = READ_ONCE(__hdrincl); /* * Check the flags. */ -- 2.13.6
Re: [nohz_full/apic] multiple timer interrupts a second
Hi Luiz, [John Stultz added to CC] On Fri, Nov 03 2017, Luiz Capitulino wrote: > [CC'ing lkml this time] > > I've observed that smp_apic_timer_interrupt() is sometimes called > two or more times a second on a nohz_full core which has a single > task taking 100% of the core. In one of the calls, hrtimer_interrupt() > runs tick_sched_timer(), but in others it doesn't call any handler. > Here's an example (Linus HEAD f34157878): > > <...>-1831 [008] 1060.115578: funcgraph_entry: | > smp_apic_timer_interrupt() { > <...>-1831 [008] 1060.115578: funcgraph_entry: | > hrtimer_interrupt() { > <...>-1831 [008] 1060.115578: hrtimer_expire_entry: > hrtimer=0x8edfefd12d60 function=tick_sched_timer now=1060079001980 > <...>-1831 [008] 1060.115578: funcgraph_entry:1.172 us | > tick_sched_timer(); > <...>-1831 [008] 1060.115580: funcgraph_exit: 1.757 us |} > <...>-1831 [008] 1060.115580: hrtimer_start: > hrtimer=0x8edfefd12d60 function=tick_sched_timer expires=106107900 > softexpires=106107900 > <...>-1831 [008] 1060.115581: funcgraph_exit: 3.026 us | } > <...>-1831 [008] 1061.115577: funcgraph_entry: | > smp_apic_timer_interrupt() { > <...>-1831 [008] 1061.115577: funcgraph_entry:0.261 us | > hrtimer_interrupt(); <-- NO handler called > <...>-1831 [008] 1061.115578: funcgraph_exit: 1.349 us | } > <...>-1831 [008] 1061.115579: funcgraph_entry: | > smp_apic_timer_interrupt() { > <...>-1831 [008] 1061.115579: funcgraph_entry: | > hrtimer_interrupt() { > <...>-1831 [008] 1061.115579: hrtimer_expire_entry: > hrtimer=0x8edfefd12d60 function=tick_sched_timer now=1061079001473 > <...>-1831 [008] 1061.115580: funcgraph_entry:1.413 us | > tick_sched_timer(); > <...>-1831 [008] 1061.115581: funcgraph_exit: 2.124 us |} > <...>-1831 [008] 1061.115582: hrtimer_start: > hrtimer=0x8edfefd12d60 function=tick_sched_timer expires=106207900 > softexpires=106207900 > <...>-1831 [008] 1061.115582: funcgraph_exit: 3.255 us | } > > Is this expected for some reason? > > I guess what's happening is that the deadline timer is firing > earlier than expected. From a few dozen to a few hundreds > nanoseconds earlier. When this happens, hrtimer_interrupt() > skips calling the hrtimer handler (since it's early) and the > apic is programmed to fire in the next microsecond. Exactly. > On further research I saw that Nicolai tried to fix a very similar > problem last year: > > commit 1a9e4c564ab174e53ed86def922804a5ddc63e7d > Author: Nicolai Stange <nicsta...@gmail.com> > Date: Thu Jul 14 17:22:54 2016 +0200 > > x86/timers/apic: Fix imprecise timer interrupts by eliminating TSC > clockevents frequency roundoff error > > I noticed the following bug/misbehavior on certain Intel systems: with a > single task running on a NOHZ CPU on an Intel Haswell, I recognized > that I did not only get the one expected local_timer APIC interrupt, but > two per second at minimum. (!) Note that there's also 6731b0d611a1 ("x86/timers/apic: Inform TSC deadline clockevent device about recalibration"). > Maybe this issue is still present? Yes it is. The reason is that NTP frequency adjustments would make it into the mono clocksource's frequency but not into the clock_event_device's. I sent a series back then [1] addressing this. However, in the meanwhile, I changed jobs and thus, got somehow distracted. My bad :/ The result is that this got merged only up to what corresponds to [10/28] in [1]. I'll pick this up again now and try to get the rest accepted for inclusion. Thanks, Nicolai [1] last time I sent this series as a whole can be found at http://lkml.kernel.org/r/20161119160055.12491-2-nicsta...@gmail.com
Re: [nohz_full/apic] multiple timer interrupts a second
Hi Luiz, [John Stultz added to CC] On Fri, Nov 03 2017, Luiz Capitulino wrote: > [CC'ing lkml this time] > > I've observed that smp_apic_timer_interrupt() is sometimes called > two or more times a second on a nohz_full core which has a single > task taking 100% of the core. In one of the calls, hrtimer_interrupt() > runs tick_sched_timer(), but in others it doesn't call any handler. > Here's an example (Linus HEAD f34157878): > > <...>-1831 [008] 1060.115578: funcgraph_entry: | > smp_apic_timer_interrupt() { > <...>-1831 [008] 1060.115578: funcgraph_entry: | > hrtimer_interrupt() { > <...>-1831 [008] 1060.115578: hrtimer_expire_entry: > hrtimer=0x8edfefd12d60 function=tick_sched_timer now=1060079001980 > <...>-1831 [008] 1060.115578: funcgraph_entry:1.172 us | > tick_sched_timer(); > <...>-1831 [008] 1060.115580: funcgraph_exit: 1.757 us |} > <...>-1831 [008] 1060.115580: hrtimer_start: > hrtimer=0x8edfefd12d60 function=tick_sched_timer expires=106107900 > softexpires=106107900 > <...>-1831 [008] 1060.115581: funcgraph_exit: 3.026 us | } > <...>-1831 [008] 1061.115577: funcgraph_entry: | > smp_apic_timer_interrupt() { > <...>-1831 [008] 1061.115577: funcgraph_entry:0.261 us | > hrtimer_interrupt(); <-- NO handler called > <...>-1831 [008] 1061.115578: funcgraph_exit: 1.349 us | } > <...>-1831 [008] 1061.115579: funcgraph_entry: | > smp_apic_timer_interrupt() { > <...>-1831 [008] 1061.115579: funcgraph_entry: | > hrtimer_interrupt() { > <...>-1831 [008] 1061.115579: hrtimer_expire_entry: > hrtimer=0x8edfefd12d60 function=tick_sched_timer now=1061079001473 > <...>-1831 [008] 1061.115580: funcgraph_entry:1.413 us | > tick_sched_timer(); > <...>-1831 [008] 1061.115581: funcgraph_exit: 2.124 us |} > <...>-1831 [008] 1061.115582: hrtimer_start: > hrtimer=0x8edfefd12d60 function=tick_sched_timer expires=106207900 > softexpires=106207900 > <...>-1831 [008] 1061.115582: funcgraph_exit: 3.255 us | } > > Is this expected for some reason? > > I guess what's happening is that the deadline timer is firing > earlier than expected. From a few dozen to a few hundreds > nanoseconds earlier. When this happens, hrtimer_interrupt() > skips calling the hrtimer handler (since it's early) and the > apic is programmed to fire in the next microsecond. Exactly. > On further research I saw that Nicolai tried to fix a very similar > problem last year: > > commit 1a9e4c564ab174e53ed86def922804a5ddc63e7d > Author: Nicolai Stange > Date: Thu Jul 14 17:22:54 2016 +0200 > > x86/timers/apic: Fix imprecise timer interrupts by eliminating TSC > clockevents frequency roundoff error > > I noticed the following bug/misbehavior on certain Intel systems: with a > single task running on a NOHZ CPU on an Intel Haswell, I recognized > that I did not only get the one expected local_timer APIC interrupt, but > two per second at minimum. (!) Note that there's also 6731b0d611a1 ("x86/timers/apic: Inform TSC deadline clockevent device about recalibration"). > Maybe this issue is still present? Yes it is. The reason is that NTP frequency adjustments would make it into the mono clocksource's frequency but not into the clock_event_device's. I sent a series back then [1] addressing this. However, in the meanwhile, I changed jobs and thus, got somehow distracted. My bad :/ The result is that this got merged only up to what corresponds to [10/28] in [1]. I'll pick this up again now and try to get the rest accepted for inclusion. Thanks, Nicolai [1] last time I sent this series as a whole can be found at http://lkml.kernel.org/r/20161119160055.12491-2-nicsta...@gmail.com
[PATCH v3 3/8] debugfs: debugfs_real_fops(): drop __must_hold sparse annotation
Currently, debugfs_real_fops() is annotated with a __must_hold(_srcu) sparse annotation. With the conversion of the SRCU based protection of users against concurrent file removals to a per-file refcount based scheme, this becomes wrong. Drop this annotation. Signed-off-by: Nicolai Stange <nicsta...@gmail.com> --- fs/debugfs/file.c | 6 +- include/linux/debugfs.h | 3 +-- 2 files changed, 2 insertions(+), 7 deletions(-) diff --git a/fs/debugfs/file.c b/fs/debugfs/file.c index 6644bfdea2f8..08511678b782 100644 --- a/fs/debugfs/file.c +++ b/fs/debugfs/file.c @@ -98,13 +98,9 @@ EXPORT_SYMBOL_GPL(debugfs_use_file_finish); #define F_DENTRY(filp) ((filp)->f_path.dentry) const struct file_operations *debugfs_real_fops(const struct file *filp) - __must_hold(_srcu) { struct debugfs_fsdata *fsd = F_DENTRY(filp)->d_fsdata; - /* -* Neither the pointer to the struct file_operations, nor its -* contents ever change -- srcu_dereference() is not needed here. -*/ + return fsd->real_fops; } EXPORT_SYMBOL_GPL(debugfs_real_fops); diff --git a/include/linux/debugfs.h b/include/linux/debugfs.h index 3b914d588148..c5eda259b9d6 100644 --- a/include/linux/debugfs.h +++ b/include/linux/debugfs.h @@ -95,8 +95,7 @@ int debugfs_use_file_start(const struct dentry *dentry, int *srcu_idx) void debugfs_use_file_finish(int srcu_idx) __releases(_srcu); -const struct file_operations *debugfs_real_fops(const struct file *filp) - __must_hold(_srcu); +const struct file_operations *debugfs_real_fops(const struct file *filp); int debugfs_file_get(struct dentry *dentry); void debugfs_file_put(struct dentry *dentry); -- 2.13.6
[PATCH v3 3/8] debugfs: debugfs_real_fops(): drop __must_hold sparse annotation
Currently, debugfs_real_fops() is annotated with a __must_hold(_srcu) sparse annotation. With the conversion of the SRCU based protection of users against concurrent file removals to a per-file refcount based scheme, this becomes wrong. Drop this annotation. Signed-off-by: Nicolai Stange --- fs/debugfs/file.c | 6 +- include/linux/debugfs.h | 3 +-- 2 files changed, 2 insertions(+), 7 deletions(-) diff --git a/fs/debugfs/file.c b/fs/debugfs/file.c index 6644bfdea2f8..08511678b782 100644 --- a/fs/debugfs/file.c +++ b/fs/debugfs/file.c @@ -98,13 +98,9 @@ EXPORT_SYMBOL_GPL(debugfs_use_file_finish); #define F_DENTRY(filp) ((filp)->f_path.dentry) const struct file_operations *debugfs_real_fops(const struct file *filp) - __must_hold(_srcu) { struct debugfs_fsdata *fsd = F_DENTRY(filp)->d_fsdata; - /* -* Neither the pointer to the struct file_operations, nor its -* contents ever change -- srcu_dereference() is not needed here. -*/ + return fsd->real_fops; } EXPORT_SYMBOL_GPL(debugfs_real_fops); diff --git a/include/linux/debugfs.h b/include/linux/debugfs.h index 3b914d588148..c5eda259b9d6 100644 --- a/include/linux/debugfs.h +++ b/include/linux/debugfs.h @@ -95,8 +95,7 @@ int debugfs_use_file_start(const struct dentry *dentry, int *srcu_idx) void debugfs_use_file_finish(int srcu_idx) __releases(_srcu); -const struct file_operations *debugfs_real_fops(const struct file *filp) - __must_hold(_srcu); +const struct file_operations *debugfs_real_fops(const struct file *filp); int debugfs_file_get(struct dentry *dentry); void debugfs_file_put(struct dentry *dentry); -- 2.13.6
[PATCH v3 4/8] debugfs: convert to debugfs_file_get() and -put()
Convert all calls to the now obsolete debugfs_use_file_start() and debugfs_use_file_finish() from the debugfs core itself to the new debugfs_file_get() and debugfs_file_put() API. Fixes: 49d200deaa68 ("debugfs: prevent access to removed files' private data") Signed-off-by: Nicolai Stange <nicsta...@gmail.com> --- fs/debugfs/file.c | 106 ++ 1 file changed, 50 insertions(+), 56 deletions(-) diff --git a/fs/debugfs/file.c b/fs/debugfs/file.c index 08511678b782..d3a972b45ff0 100644 --- a/fs/debugfs/file.c +++ b/fs/debugfs/file.c @@ -155,15 +155,12 @@ EXPORT_SYMBOL_GPL(debugfs_file_put); static int open_proxy_open(struct inode *inode, struct file *filp) { - const struct dentry *dentry = F_DENTRY(filp); + struct dentry *dentry = F_DENTRY(filp); const struct file_operations *real_fops = NULL; - int srcu_idx, r; + int r = 0; - r = debugfs_use_file_start(dentry, _idx); - if (r) { - r = -ENOENT; - goto out; - } + if (debugfs_file_get(dentry)) + return -ENOENT; real_fops = debugfs_real_fops(filp); real_fops = fops_get(real_fops); @@ -180,7 +177,7 @@ static int open_proxy_open(struct inode *inode, struct file *filp) r = real_fops->open(inode, filp); out: - debugfs_use_file_finish(srcu_idx); + debugfs_file_put(dentry); return r; } @@ -194,16 +191,16 @@ const struct file_operations debugfs_open_proxy_file_operations = { #define FULL_PROXY_FUNC(name, ret_type, filp, proto, args) \ static ret_type full_proxy_ ## name(proto) \ { \ - const struct dentry *dentry = F_DENTRY(filp); \ + struct dentry *dentry = F_DENTRY(filp); \ const struct file_operations *real_fops = \ debugfs_real_fops(filp);\ - int srcu_idx; \ ret_type r; \ \ - r = debugfs_use_file_start(dentry, _idx); \ - if (likely(!r)) \ - r = real_fops->name(args); \ - debugfs_use_file_finish(srcu_idx); \ + r = debugfs_file_get(dentry); \ + if (unlikely(r))\ + return r; \ + r = real_fops->name(args); \ + debugfs_file_put(dentry); \ return r; \ } @@ -228,18 +225,15 @@ FULL_PROXY_FUNC(unlocked_ioctl, long, filp, static unsigned int full_proxy_poll(struct file *filp, struct poll_table_struct *wait) { - const struct dentry *dentry = F_DENTRY(filp); const struct file_operations *real_fops = debugfs_real_fops(filp); - int srcu_idx; + struct dentry *dentry = F_DENTRY(filp); unsigned int r = 0; - if (debugfs_use_file_start(dentry, _idx)) { - debugfs_use_file_finish(srcu_idx); + if (debugfs_file_get(dentry)) return POLLHUP; - } r = real_fops->poll(filp, wait); - debugfs_use_file_finish(srcu_idx); + debugfs_file_put(dentry); return r; } @@ -283,16 +277,13 @@ static void __full_proxy_fops_init(struct file_operations *proxy_fops, static int full_proxy_open(struct inode *inode, struct file *filp) { - const struct dentry *dentry = F_DENTRY(filp); + struct dentry *dentry = F_DENTRY(filp); const struct file_operations *real_fops = NULL; struct file_operations *proxy_fops = NULL; - int srcu_idx, r; + int r = 0; - r = debugfs_use_file_start(dentry, _idx); - if (r) { - r = -ENOENT; - goto out; - } + if (debugfs_file_get(dentry)) + return -ENOENT; real_fops = debugfs_real_fops(filp); real_fops = fops_get(real_fops); @@ -330,7 +321,7 @@ static int full_proxy_open(struct inode *inode, struct file *filp) kfree(proxy_fops); fops_put(real_fops); out: - debugfs_use_file_finish(srcu_idx); + debugfs_file_put(dentry); return r; } @@ -341,13 +332,14 @@ const struct file_operations debugfs_full_proxy_file_operations = { ssize_t debugfs_attr_read(struct file *file, char __user *buf, size_t len, loff_t *ppos) { + struct de
[PATCH v3 4/8] debugfs: convert to debugfs_file_get() and -put()
Convert all calls to the now obsolete debugfs_use_file_start() and debugfs_use_file_finish() from the debugfs core itself to the new debugfs_file_get() and debugfs_file_put() API. Fixes: 49d200deaa68 ("debugfs: prevent access to removed files' private data") Signed-off-by: Nicolai Stange --- fs/debugfs/file.c | 106 ++ 1 file changed, 50 insertions(+), 56 deletions(-) diff --git a/fs/debugfs/file.c b/fs/debugfs/file.c index 08511678b782..d3a972b45ff0 100644 --- a/fs/debugfs/file.c +++ b/fs/debugfs/file.c @@ -155,15 +155,12 @@ EXPORT_SYMBOL_GPL(debugfs_file_put); static int open_proxy_open(struct inode *inode, struct file *filp) { - const struct dentry *dentry = F_DENTRY(filp); + struct dentry *dentry = F_DENTRY(filp); const struct file_operations *real_fops = NULL; - int srcu_idx, r; + int r = 0; - r = debugfs_use_file_start(dentry, _idx); - if (r) { - r = -ENOENT; - goto out; - } + if (debugfs_file_get(dentry)) + return -ENOENT; real_fops = debugfs_real_fops(filp); real_fops = fops_get(real_fops); @@ -180,7 +177,7 @@ static int open_proxy_open(struct inode *inode, struct file *filp) r = real_fops->open(inode, filp); out: - debugfs_use_file_finish(srcu_idx); + debugfs_file_put(dentry); return r; } @@ -194,16 +191,16 @@ const struct file_operations debugfs_open_proxy_file_operations = { #define FULL_PROXY_FUNC(name, ret_type, filp, proto, args) \ static ret_type full_proxy_ ## name(proto) \ { \ - const struct dentry *dentry = F_DENTRY(filp); \ + struct dentry *dentry = F_DENTRY(filp); \ const struct file_operations *real_fops = \ debugfs_real_fops(filp);\ - int srcu_idx; \ ret_type r; \ \ - r = debugfs_use_file_start(dentry, _idx); \ - if (likely(!r)) \ - r = real_fops->name(args); \ - debugfs_use_file_finish(srcu_idx); \ + r = debugfs_file_get(dentry); \ + if (unlikely(r))\ + return r; \ + r = real_fops->name(args); \ + debugfs_file_put(dentry); \ return r; \ } @@ -228,18 +225,15 @@ FULL_PROXY_FUNC(unlocked_ioctl, long, filp, static unsigned int full_proxy_poll(struct file *filp, struct poll_table_struct *wait) { - const struct dentry *dentry = F_DENTRY(filp); const struct file_operations *real_fops = debugfs_real_fops(filp); - int srcu_idx; + struct dentry *dentry = F_DENTRY(filp); unsigned int r = 0; - if (debugfs_use_file_start(dentry, _idx)) { - debugfs_use_file_finish(srcu_idx); + if (debugfs_file_get(dentry)) return POLLHUP; - } r = real_fops->poll(filp, wait); - debugfs_use_file_finish(srcu_idx); + debugfs_file_put(dentry); return r; } @@ -283,16 +277,13 @@ static void __full_proxy_fops_init(struct file_operations *proxy_fops, static int full_proxy_open(struct inode *inode, struct file *filp) { - const struct dentry *dentry = F_DENTRY(filp); + struct dentry *dentry = F_DENTRY(filp); const struct file_operations *real_fops = NULL; struct file_operations *proxy_fops = NULL; - int srcu_idx, r; + int r = 0; - r = debugfs_use_file_start(dentry, _idx); - if (r) { - r = -ENOENT; - goto out; - } + if (debugfs_file_get(dentry)) + return -ENOENT; real_fops = debugfs_real_fops(filp); real_fops = fops_get(real_fops); @@ -330,7 +321,7 @@ static int full_proxy_open(struct inode *inode, struct file *filp) kfree(proxy_fops); fops_put(real_fops); out: - debugfs_use_file_finish(srcu_idx); + debugfs_file_put(dentry); return r; } @@ -341,13 +332,14 @@ const struct file_operations debugfs_full_proxy_file_operations = { ssize_t debugfs_attr_read(struct file *file, char __user *buf, size_t len, loff_t *ppos) { + struct dentry *dentry = F_DENTRY(file
[PATCH v3 5/8] IB/hfi1: convert to debugfs_file_get() and -put()
Convert all calls to the now obsolete debugfs_use_file_start() and debugfs_use_file_finish() to the new debugfs_file_get() and debugfs_file_put() API. Fixes: 49d200deaa68 ("debugfs: prevent access to removed files' private data") Signed-off-by: Nicolai Stange <nicsta...@gmail.com> --- drivers/infiniband/hw/hfi1/debugfs.c | 20 ++-- 1 file changed, 10 insertions(+), 10 deletions(-) diff --git a/drivers/infiniband/hw/hfi1/debugfs.c b/drivers/infiniband/hw/hfi1/debugfs.c index 24128f4e5748..ca03d8b23851 100644 --- a/drivers/infiniband/hw/hfi1/debugfs.c +++ b/drivers/infiniband/hw/hfi1/debugfs.c @@ -71,13 +71,13 @@ static ssize_t hfi1_seq_read( loff_t *ppos) { struct dentry *d = file->f_path.dentry; - int srcu_idx; ssize_t r; - r = debugfs_use_file_start(d, _idx); - if (likely(!r)) - r = seq_read(file, buf, size, ppos); - debugfs_use_file_finish(srcu_idx); + r = debugfs_file_get(d); + if (unlikely(r)) + return r; + r = seq_read(file, buf, size, ppos); + debugfs_file_put(d); return r; } @@ -87,13 +87,13 @@ static loff_t hfi1_seq_lseek( int whence) { struct dentry *d = file->f_path.dentry; - int srcu_idx; loff_t r; - r = debugfs_use_file_start(d, _idx); - if (likely(!r)) - r = seq_lseek(file, offset, whence); - debugfs_use_file_finish(srcu_idx); + r = debugfs_file_get(d); + if (unlikely(r)) + return r; + r = seq_lseek(file, offset, whence); + debugfs_file_put(d); return r; } -- 2.13.6
[PATCH v3 5/8] IB/hfi1: convert to debugfs_file_get() and -put()
Convert all calls to the now obsolete debugfs_use_file_start() and debugfs_use_file_finish() to the new debugfs_file_get() and debugfs_file_put() API. Fixes: 49d200deaa68 ("debugfs: prevent access to removed files' private data") Signed-off-by: Nicolai Stange --- drivers/infiniband/hw/hfi1/debugfs.c | 20 ++-- 1 file changed, 10 insertions(+), 10 deletions(-) diff --git a/drivers/infiniband/hw/hfi1/debugfs.c b/drivers/infiniband/hw/hfi1/debugfs.c index 24128f4e5748..ca03d8b23851 100644 --- a/drivers/infiniband/hw/hfi1/debugfs.c +++ b/drivers/infiniband/hw/hfi1/debugfs.c @@ -71,13 +71,13 @@ static ssize_t hfi1_seq_read( loff_t *ppos) { struct dentry *d = file->f_path.dentry; - int srcu_idx; ssize_t r; - r = debugfs_use_file_start(d, _idx); - if (likely(!r)) - r = seq_read(file, buf, size, ppos); - debugfs_use_file_finish(srcu_idx); + r = debugfs_file_get(d); + if (unlikely(r)) + return r; + r = seq_read(file, buf, size, ppos); + debugfs_file_put(d); return r; } @@ -87,13 +87,13 @@ static loff_t hfi1_seq_lseek( int whence) { struct dentry *d = file->f_path.dentry; - int srcu_idx; loff_t r; - r = debugfs_use_file_start(d, _idx); - if (likely(!r)) - r = seq_lseek(file, offset, whence); - debugfs_use_file_finish(srcu_idx); + r = debugfs_file_get(d); + if (unlikely(r)) + return r; + r = seq_lseek(file, offset, whence); + debugfs_file_put(d); return r; } -- 2.13.6
[PATCH v3 6/8] debugfs: purge obsolete SRCU based removal protection
Purge the SRCU based file removal race protection in favour of the new, refcount based debugfs_file_get()/debugfs_file_put() API. Fixes: 49d200deaa68 ("debugfs: prevent access to removed files' private data") Signed-off-by: Nicolai Stange <nicsta...@gmail.com> --- fs/debugfs/file.c | 48 fs/debugfs/inode.c | 7 --- include/linux/debugfs.h | 19 --- lib/Kconfig.debug | 1 - 4 files changed, 75 deletions(-) diff --git a/fs/debugfs/file.c b/fs/debugfs/file.c index d3a972b45ff0..53f5c9a2af88 100644 --- a/fs/debugfs/file.c +++ b/fs/debugfs/file.c @@ -22,7 +22,6 @@ #include #include #include -#include #include #include "internal.h" @@ -48,53 +47,6 @@ const struct file_operations debugfs_noop_file_operations = { .llseek = noop_llseek, }; -/** - * debugfs_use_file_start - mark the beginning of file data access - * @dentry: the dentry object whose data is being accessed. - * @srcu_idx: a pointer to some memory to store a SRCU index in. - * - * Up to a matching call to debugfs_use_file_finish(), any - * successive call into the file removing functions debugfs_remove() - * and debugfs_remove_recursive() will block. Since associated private - * file data may only get freed after a successful return of any of - * the removal functions, you may safely access it after a successful - * call to debugfs_use_file_start() without worrying about - * lifetime issues. - * - * If -%EIO is returned, the file has already been removed and thus, - * it is not safe to access any of its data. If, on the other hand, - * it is allowed to access the file data, zero is returned. - * - * Regardless of the return code, any call to - * debugfs_use_file_start() must be followed by a matching call - * to debugfs_use_file_finish(). - */ -int debugfs_use_file_start(const struct dentry *dentry, int *srcu_idx) - __acquires(_srcu) -{ - *srcu_idx = srcu_read_lock(_srcu); - barrier(); - if (d_unlinked(dentry)) - return -EIO; - return 0; -} -EXPORT_SYMBOL_GPL(debugfs_use_file_start); - -/** - * debugfs_use_file_finish - mark the end of file data access - * @srcu_idx: the SRCU index "created" by a former call to - *debugfs_use_file_start(). - * - * Allow any ongoing concurrent call into debugfs_remove() or - * debugfs_remove_recursive() blocked by a former call to - * debugfs_use_file_start() to proceed and return to its caller. - */ -void debugfs_use_file_finish(int srcu_idx) __releases(_srcu) -{ - srcu_read_unlock(_srcu, srcu_idx); -} -EXPORT_SYMBOL_GPL(debugfs_use_file_finish); - #define F_DENTRY(filp) ((filp)->f_path.dentry) const struct file_operations *debugfs_real_fops(const struct file *filp) diff --git a/fs/debugfs/inode.c b/fs/debugfs/inode.c index 6449eb935540..f587aded46b5 100644 --- a/fs/debugfs/inode.c +++ b/fs/debugfs/inode.c @@ -27,14 +27,11 @@ #include #include #include -#include #include "internal.h" #define DEBUGFS_DEFAULT_MODE 0700 -DEFINE_SRCU(debugfs_srcu); - static struct vfsmount *debugfs_mount; static int debugfs_mount_count; static bool debugfs_registered; @@ -693,8 +690,6 @@ void debugfs_remove(struct dentry *dentry) inode_unlock(d_inode(parent)); if (!ret) simple_release_fs(_mount, _mount_count); - - synchronize_srcu(_srcu); } EXPORT_SYMBOL_GPL(debugfs_remove); @@ -768,8 +763,6 @@ void debugfs_remove_recursive(struct dentry *dentry) if (!__debugfs_remove(child, parent)) simple_release_fs(_mount, _mount_count); inode_unlock(d_inode(parent)); - - synchronize_srcu(_srcu); } EXPORT_SYMBOL_GPL(debugfs_remove_recursive); diff --git a/include/linux/debugfs.h b/include/linux/debugfs.h index c5eda259b9d6..9f821a43bb0d 100644 --- a/include/linux/debugfs.h +++ b/include/linux/debugfs.h @@ -23,7 +23,6 @@ struct device; struct file_operations; -struct srcu_struct; struct debugfs_blob_wrapper { void *data; @@ -43,8 +42,6 @@ struct debugfs_regset32 { extern struct dentry *arch_debugfs_dir; -extern struct srcu_struct debugfs_srcu; - #define DEFINE_DEBUGFS_ATTRIBUTE(__fops, __get, __set, __fmt) \ static int __fops ## _open(struct inode *inode, struct file *file) \ { \ @@ -90,11 +87,6 @@ struct dentry *debugfs_create_automount(const char *name, void debugfs_remove(struct dentry *dentry); void debugfs_remove_recursive(struct dentry *dentry); -int debugfs_use_file_start(const struct dentry *dentry, int *srcu_idx) - __acquires(_srcu); - -void debugfs_use_file_finish(int srcu_idx) __releases(_srcu); - const struct file_operations *debugfs_real_fops(const struct file *filp); int debugfs_file_get(struct dentry *dentry); @@ -227,17 +219,6 @@ static inline void debugfs_
[PATCH v3 6/8] debugfs: purge obsolete SRCU based removal protection
Purge the SRCU based file removal race protection in favour of the new, refcount based debugfs_file_get()/debugfs_file_put() API. Fixes: 49d200deaa68 ("debugfs: prevent access to removed files' private data") Signed-off-by: Nicolai Stange --- fs/debugfs/file.c | 48 fs/debugfs/inode.c | 7 --- include/linux/debugfs.h | 19 --- lib/Kconfig.debug | 1 - 4 files changed, 75 deletions(-) diff --git a/fs/debugfs/file.c b/fs/debugfs/file.c index d3a972b45ff0..53f5c9a2af88 100644 --- a/fs/debugfs/file.c +++ b/fs/debugfs/file.c @@ -22,7 +22,6 @@ #include #include #include -#include #include #include "internal.h" @@ -48,53 +47,6 @@ const struct file_operations debugfs_noop_file_operations = { .llseek = noop_llseek, }; -/** - * debugfs_use_file_start - mark the beginning of file data access - * @dentry: the dentry object whose data is being accessed. - * @srcu_idx: a pointer to some memory to store a SRCU index in. - * - * Up to a matching call to debugfs_use_file_finish(), any - * successive call into the file removing functions debugfs_remove() - * and debugfs_remove_recursive() will block. Since associated private - * file data may only get freed after a successful return of any of - * the removal functions, you may safely access it after a successful - * call to debugfs_use_file_start() without worrying about - * lifetime issues. - * - * If -%EIO is returned, the file has already been removed and thus, - * it is not safe to access any of its data. If, on the other hand, - * it is allowed to access the file data, zero is returned. - * - * Regardless of the return code, any call to - * debugfs_use_file_start() must be followed by a matching call - * to debugfs_use_file_finish(). - */ -int debugfs_use_file_start(const struct dentry *dentry, int *srcu_idx) - __acquires(_srcu) -{ - *srcu_idx = srcu_read_lock(_srcu); - barrier(); - if (d_unlinked(dentry)) - return -EIO; - return 0; -} -EXPORT_SYMBOL_GPL(debugfs_use_file_start); - -/** - * debugfs_use_file_finish - mark the end of file data access - * @srcu_idx: the SRCU index "created" by a former call to - *debugfs_use_file_start(). - * - * Allow any ongoing concurrent call into debugfs_remove() or - * debugfs_remove_recursive() blocked by a former call to - * debugfs_use_file_start() to proceed and return to its caller. - */ -void debugfs_use_file_finish(int srcu_idx) __releases(_srcu) -{ - srcu_read_unlock(_srcu, srcu_idx); -} -EXPORT_SYMBOL_GPL(debugfs_use_file_finish); - #define F_DENTRY(filp) ((filp)->f_path.dentry) const struct file_operations *debugfs_real_fops(const struct file *filp) diff --git a/fs/debugfs/inode.c b/fs/debugfs/inode.c index 6449eb935540..f587aded46b5 100644 --- a/fs/debugfs/inode.c +++ b/fs/debugfs/inode.c @@ -27,14 +27,11 @@ #include #include #include -#include #include "internal.h" #define DEBUGFS_DEFAULT_MODE 0700 -DEFINE_SRCU(debugfs_srcu); - static struct vfsmount *debugfs_mount; static int debugfs_mount_count; static bool debugfs_registered; @@ -693,8 +690,6 @@ void debugfs_remove(struct dentry *dentry) inode_unlock(d_inode(parent)); if (!ret) simple_release_fs(_mount, _mount_count); - - synchronize_srcu(_srcu); } EXPORT_SYMBOL_GPL(debugfs_remove); @@ -768,8 +763,6 @@ void debugfs_remove_recursive(struct dentry *dentry) if (!__debugfs_remove(child, parent)) simple_release_fs(_mount, _mount_count); inode_unlock(d_inode(parent)); - - synchronize_srcu(_srcu); } EXPORT_SYMBOL_GPL(debugfs_remove_recursive); diff --git a/include/linux/debugfs.h b/include/linux/debugfs.h index c5eda259b9d6..9f821a43bb0d 100644 --- a/include/linux/debugfs.h +++ b/include/linux/debugfs.h @@ -23,7 +23,6 @@ struct device; struct file_operations; -struct srcu_struct; struct debugfs_blob_wrapper { void *data; @@ -43,8 +42,6 @@ struct debugfs_regset32 { extern struct dentry *arch_debugfs_dir; -extern struct srcu_struct debugfs_srcu; - #define DEFINE_DEBUGFS_ATTRIBUTE(__fops, __get, __set, __fmt) \ static int __fops ## _open(struct inode *inode, struct file *file) \ { \ @@ -90,11 +87,6 @@ struct dentry *debugfs_create_automount(const char *name, void debugfs_remove(struct dentry *dentry); void debugfs_remove_recursive(struct dentry *dentry); -int debugfs_use_file_start(const struct dentry *dentry, int *srcu_idx) - __acquires(_srcu); - -void debugfs_use_file_finish(int srcu_idx) __releases(_srcu); - const struct file_operations *debugfs_real_fops(const struct file *filp); int debugfs_file_get(struct dentry *dentry); @@ -227,17 +219,6 @@ static inline void debugfs_remove(struct dentry *dentry)
[PATCH v3 1/8] debugfs: add support for more elaborate ->d_fsdata
Currently, the user provided fops, "real_fops", are stored directly into ->d_fsdata. In order to be able to store more per-file state and thus prepare for more granular file removal protection, wrap the real_fops into a dynamically allocated container struct, debugfs_fsdata. A struct debugfs_fsdata gets allocated at file creation and freed from the newly intoduced ->d_release(). Finally, move the implementation of debugfs_real_fops() out of the public debugfs header such that struct debugfs_fsdata's declaration can be kept private. Signed-off-by: Nicolai Stange <nicsta...@gmail.com> --- fs/debugfs/file.c | 12 fs/debugfs/inode.c | 22 +++--- fs/debugfs/internal.h | 4 include/linux/debugfs.h | 20 +++- 4 files changed, 38 insertions(+), 20 deletions(-) diff --git a/fs/debugfs/file.c b/fs/debugfs/file.c index 6dabc4a10396..b6f5ddab66bf 100644 --- a/fs/debugfs/file.c +++ b/fs/debugfs/file.c @@ -97,6 +97,18 @@ EXPORT_SYMBOL_GPL(debugfs_use_file_finish); #define F_DENTRY(filp) ((filp)->f_path.dentry) +const struct file_operations *debugfs_real_fops(const struct file *filp) + __must_hold(_srcu) +{ + struct debugfs_fsdata *fsd = F_DENTRY(filp)->d_fsdata; + /* +* Neither the pointer to the struct file_operations, nor its +* contents ever change -- srcu_dereference() is not needed here. +*/ + return fsd->real_fops; +} +EXPORT_SYMBOL_GPL(debugfs_real_fops); + static int open_proxy_open(struct inode *inode, struct file *filp) { const struct dentry *dentry = F_DENTRY(filp); diff --git a/fs/debugfs/inode.c b/fs/debugfs/inode.c index c59f015f386e..a9c3d3e9af39 100644 --- a/fs/debugfs/inode.c +++ b/fs/debugfs/inode.c @@ -185,6 +185,11 @@ static const struct super_operations debugfs_super_operations = { .evict_inode= debugfs_evict_inode, }; +static void debugfs_release_dentry(struct dentry *dentry) +{ + kfree(dentry->d_fsdata); +} + static struct vfsmount *debugfs_automount(struct path *path) { debugfs_automount_t f; @@ -194,6 +199,7 @@ static struct vfsmount *debugfs_automount(struct path *path) static const struct dentry_operations debugfs_dops = { .d_delete = always_delete_dentry, + .d_release = debugfs_release_dentry, .d_automount = debugfs_automount, }; @@ -341,24 +347,34 @@ static struct dentry *__debugfs_create_file(const char *name, umode_t mode, { struct dentry *dentry; struct inode *inode; + struct debugfs_fsdata *fsd; + + fsd = kmalloc(sizeof(*fsd), GFP_KERNEL); + if (!fsd) + return NULL; if (!(mode & S_IFMT)) mode |= S_IFREG; BUG_ON(!S_ISREG(mode)); dentry = start_creating(name, parent); - if (IS_ERR(dentry)) + if (IS_ERR(dentry)) { + kfree(fsd); return NULL; + } inode = debugfs_get_inode(dentry->d_sb); - if (unlikely(!inode)) + if (unlikely(!inode)) { + kfree(fsd); return failed_creating(dentry); + } inode->i_mode = mode; inode->i_private = data; inode->i_fop = proxy_fops; - dentry->d_fsdata = (void *)real_fops; + fsd->real_fops = real_fops; + dentry->d_fsdata = fsd; d_instantiate(dentry, inode); fsnotify_create(d_inode(dentry->d_parent), dentry); diff --git a/fs/debugfs/internal.h b/fs/debugfs/internal.h index b3e8443a1f47..512601eed3ce 100644 --- a/fs/debugfs/internal.h +++ b/fs/debugfs/internal.h @@ -19,4 +19,8 @@ extern const struct file_operations debugfs_noop_file_operations; extern const struct file_operations debugfs_open_proxy_file_operations; extern const struct file_operations debugfs_full_proxy_file_operations; +struct debugfs_fsdata { + const struct file_operations *real_fops; +}; + #endif /* _DEBUGFS_INTERNAL_H_ */ diff --git a/include/linux/debugfs.h b/include/linux/debugfs.h index b93efc8feecd..cbee5f4a02a3 100644 --- a/include/linux/debugfs.h +++ b/include/linux/debugfs.h @@ -45,23 +45,6 @@ extern struct dentry *arch_debugfs_dir; extern struct srcu_struct debugfs_srcu; -/** - * debugfs_real_fops - getter for the real file operation - * @filp: a pointer to a struct file - * - * Must only be called under the protection established by - * debugfs_use_file_start(). - */ -static inline const struct file_operations *debugfs_real_fops(const struct file *filp) - __must_hold(_srcu) -{ - /* -* Neither the pointer to the struct file_operations, nor its -* contents ever change -- srcu_dereference() is not needed here. -*/ - return filp->f_path.dentry->d_fsdata; -} - #define DEFINE_DEBUGFS_ATTRIBUTE(__fops, __get, __set, __fmt) \ static int __fops ## _open(struct inode *inode, struct file *file) \ {
[PATCH v3 1/8] debugfs: add support for more elaborate ->d_fsdata
Currently, the user provided fops, "real_fops", are stored directly into ->d_fsdata. In order to be able to store more per-file state and thus prepare for more granular file removal protection, wrap the real_fops into a dynamically allocated container struct, debugfs_fsdata. A struct debugfs_fsdata gets allocated at file creation and freed from the newly intoduced ->d_release(). Finally, move the implementation of debugfs_real_fops() out of the public debugfs header such that struct debugfs_fsdata's declaration can be kept private. Signed-off-by: Nicolai Stange --- fs/debugfs/file.c | 12 fs/debugfs/inode.c | 22 +++--- fs/debugfs/internal.h | 4 include/linux/debugfs.h | 20 +++- 4 files changed, 38 insertions(+), 20 deletions(-) diff --git a/fs/debugfs/file.c b/fs/debugfs/file.c index 6dabc4a10396..b6f5ddab66bf 100644 --- a/fs/debugfs/file.c +++ b/fs/debugfs/file.c @@ -97,6 +97,18 @@ EXPORT_SYMBOL_GPL(debugfs_use_file_finish); #define F_DENTRY(filp) ((filp)->f_path.dentry) +const struct file_operations *debugfs_real_fops(const struct file *filp) + __must_hold(_srcu) +{ + struct debugfs_fsdata *fsd = F_DENTRY(filp)->d_fsdata; + /* +* Neither the pointer to the struct file_operations, nor its +* contents ever change -- srcu_dereference() is not needed here. +*/ + return fsd->real_fops; +} +EXPORT_SYMBOL_GPL(debugfs_real_fops); + static int open_proxy_open(struct inode *inode, struct file *filp) { const struct dentry *dentry = F_DENTRY(filp); diff --git a/fs/debugfs/inode.c b/fs/debugfs/inode.c index c59f015f386e..a9c3d3e9af39 100644 --- a/fs/debugfs/inode.c +++ b/fs/debugfs/inode.c @@ -185,6 +185,11 @@ static const struct super_operations debugfs_super_operations = { .evict_inode= debugfs_evict_inode, }; +static void debugfs_release_dentry(struct dentry *dentry) +{ + kfree(dentry->d_fsdata); +} + static struct vfsmount *debugfs_automount(struct path *path) { debugfs_automount_t f; @@ -194,6 +199,7 @@ static struct vfsmount *debugfs_automount(struct path *path) static const struct dentry_operations debugfs_dops = { .d_delete = always_delete_dentry, + .d_release = debugfs_release_dentry, .d_automount = debugfs_automount, }; @@ -341,24 +347,34 @@ static struct dentry *__debugfs_create_file(const char *name, umode_t mode, { struct dentry *dentry; struct inode *inode; + struct debugfs_fsdata *fsd; + + fsd = kmalloc(sizeof(*fsd), GFP_KERNEL); + if (!fsd) + return NULL; if (!(mode & S_IFMT)) mode |= S_IFREG; BUG_ON(!S_ISREG(mode)); dentry = start_creating(name, parent); - if (IS_ERR(dentry)) + if (IS_ERR(dentry)) { + kfree(fsd); return NULL; + } inode = debugfs_get_inode(dentry->d_sb); - if (unlikely(!inode)) + if (unlikely(!inode)) { + kfree(fsd); return failed_creating(dentry); + } inode->i_mode = mode; inode->i_private = data; inode->i_fop = proxy_fops; - dentry->d_fsdata = (void *)real_fops; + fsd->real_fops = real_fops; + dentry->d_fsdata = fsd; d_instantiate(dentry, inode); fsnotify_create(d_inode(dentry->d_parent), dentry); diff --git a/fs/debugfs/internal.h b/fs/debugfs/internal.h index b3e8443a1f47..512601eed3ce 100644 --- a/fs/debugfs/internal.h +++ b/fs/debugfs/internal.h @@ -19,4 +19,8 @@ extern const struct file_operations debugfs_noop_file_operations; extern const struct file_operations debugfs_open_proxy_file_operations; extern const struct file_operations debugfs_full_proxy_file_operations; +struct debugfs_fsdata { + const struct file_operations *real_fops; +}; + #endif /* _DEBUGFS_INTERNAL_H_ */ diff --git a/include/linux/debugfs.h b/include/linux/debugfs.h index b93efc8feecd..cbee5f4a02a3 100644 --- a/include/linux/debugfs.h +++ b/include/linux/debugfs.h @@ -45,23 +45,6 @@ extern struct dentry *arch_debugfs_dir; extern struct srcu_struct debugfs_srcu; -/** - * debugfs_real_fops - getter for the real file operation - * @filp: a pointer to a struct file - * - * Must only be called under the protection established by - * debugfs_use_file_start(). - */ -static inline const struct file_operations *debugfs_real_fops(const struct file *filp) - __must_hold(_srcu) -{ - /* -* Neither the pointer to the struct file_operations, nor its -* contents ever change -- srcu_dereference() is not needed here. -*/ - return filp->f_path.dentry->d_fsdata; -} - #define DEFINE_DEBUGFS_ATTRIBUTE(__fops, __get, __set, __fmt) \ static int __fops ## _open(struct inode *inode, struct file *file) \ {
[PATCH v3 2/8] debugfs: implement per-file removal protection
Since commit 49d200deaa68 ("debugfs: prevent access to removed files' private data"), accesses to a file's private data are protected from concurrent removal by covering all file_operations with a SRCU read section and sychronizing with those before returning from debugfs_remove() by means of synchronize_srcu(). As pointed out by Johannes Berg, there are debugfs files with forever blocking file_operations. Their corresponding SRCU read side sections would block any debugfs_remove() forever as well, even unrelated ones. This results in a livelock. Because a remover can't cancel any indefinite blocking within foreign files, this is a problem. Resolve this by introducing support for more granular protection on a per-file basis. This is implemented by introducing an 'active_users' refcount_t to the per-file struct debugfs_fsdata state. At file creation time, it is set to one and a debugfs_remove() will drop that initial reference. The new debugfs_file_get() and debugfs_file_put(), intended to be used in place of former debugfs_use_file_start() and debugfs_use_file_finish(), increment and decrement it respectively. Once the count drops to zero, debugfs_file_put() will signal a completion which is possibly being waited for from debugfs_remove(). Thus, as long as there is a debugfs_file_get() not yet matched by a corresponding debugfs_file_put() around, debugfs_remove() will block. Actual users of debugfs_use_file_start() and -finish() will get converted to the new debugfs_file_get() and debugfs_file_put() by followup patches. Fixes: 49d200deaa68 ("debugfs: prevent access to removed files' private data") Reported-by: Johannes Berg <johan...@sipsolutions.net> Signed-off-by: Nicolai Stange <nicsta...@gmail.com> --- fs/debugfs/file.c | 48 fs/debugfs/inode.c | 29 +++-- fs/debugfs/internal.h | 2 ++ include/linux/debugfs.h | 11 +++ 4 files changed, 84 insertions(+), 6 deletions(-) diff --git a/fs/debugfs/file.c b/fs/debugfs/file.c index b6f5ddab66bf..6644bfdea2f8 100644 --- a/fs/debugfs/file.c +++ b/fs/debugfs/file.c @@ -109,6 +109,54 @@ const struct file_operations *debugfs_real_fops(const struct file *filp) } EXPORT_SYMBOL_GPL(debugfs_real_fops); +/** + * debugfs_file_get - mark the beginning of file data access + * @dentry: the dentry object whose data is being accessed. + * + * Up to a matching call to debugfs_file_put(), any successive call + * into the file removing functions debugfs_remove() and + * debugfs_remove_recursive() will block. Since associated private + * file data may only get freed after a successful return of any of + * the removal functions, you may safely access it after a successful + * call to debugfs_file_get() without worrying about lifetime issues. + * + * If -%EIO is returned, the file has already been removed and thus, + * it is not safe to access any of its data. If, on the other hand, + * it is allowed to access the file data, zero is returned. + */ +int debugfs_file_get(struct dentry *dentry) +{ + struct debugfs_fsdata *fsd = dentry->d_fsdata; + + /* Avoid starvation of removers. */ + if (d_unlinked(dentry)) + return -EIO; + + if (!refcount_inc_not_zero(>active_users)) + return -EIO; + + return 0; +} +EXPORT_SYMBOL_GPL(debugfs_file_get); + +/** + * debugfs_file_put - mark the end of file data access + * @dentry: the dentry object formerly passed to + * debugfs_file_get(). + * + * Allow any ongoing concurrent call into debugfs_remove() or + * debugfs_remove_recursive() blocked by a former call to + * debugfs_file_get() to proceed and return to its caller. + */ +void debugfs_file_put(struct dentry *dentry) +{ + struct debugfs_fsdata *fsd = dentry->d_fsdata; + + if (refcount_dec_and_test(>active_users)) + complete(>active_users_drained); +} +EXPORT_SYMBOL_GPL(debugfs_file_put); + static int open_proxy_open(struct inode *inode, struct file *filp) { const struct dentry *dentry = F_DENTRY(filp); diff --git a/fs/debugfs/inode.c b/fs/debugfs/inode.c index a9c3d3e9af39..6449eb935540 100644 --- a/fs/debugfs/inode.c +++ b/fs/debugfs/inode.c @@ -374,6 +374,7 @@ static struct dentry *__debugfs_create_file(const char *name, umode_t mode, inode->i_fop = proxy_fops; fsd->real_fops = real_fops; + refcount_set(>active_users, 1); dentry->d_fsdata = fsd; d_instantiate(dentry, inode); @@ -631,18 +632,34 @@ struct dentry *debugfs_create_symlink(const char *name, struct dentry *parent, } EXPORT_SYMBOL_GPL(debugfs_create_symlink); +static void __debugfs_remove_file(struct dentry *dentry, struct dentry *parent) +{ + struct debugfs_fsdata *fsd; + + simple_unlink(d_inode(parent), dentry); + d_delete(dentry); + fsd = dent
[PATCH v3 7/8] debugfs: call debugfs_real_fops() only after debugfs_file_get()
The current implementation of debugfs_real_fops() relies on a debugfs_fsdata instance to be installed at ->d_fsdata. With future patches introducing lazy allocation of these, this requirement will be guaranteed to be fullfilled only inbetween a debugfs_file_get()/debugfs_file_put() pair. The full proxies' fops implemented by debugfs happen to be the only offenders. Fix them up by moving their debugfs_real_fops() calls past those to debugfs_file_get(). full_proxy_release() is special as it doesn't invoke debugfs_file_get() at all. Leave it alone for now. Signed-off-by: Nicolai Stange <nicsta...@gmail.com> --- fs/debugfs/file.c | 7 --- 1 file changed, 4 insertions(+), 3 deletions(-) diff --git a/fs/debugfs/file.c b/fs/debugfs/file.c index 53f5c9a2af88..bc3549c95574 100644 --- a/fs/debugfs/file.c +++ b/fs/debugfs/file.c @@ -144,13 +144,13 @@ const struct file_operations debugfs_open_proxy_file_operations = { static ret_type full_proxy_ ## name(proto) \ { \ struct dentry *dentry = F_DENTRY(filp); \ - const struct file_operations *real_fops = \ - debugfs_real_fops(filp);\ + const struct file_operations *real_fops;\ ret_type r; \ \ r = debugfs_file_get(dentry); \ if (unlikely(r))\ return r; \ + real_fops = debugfs_real_fops(filp);\ r = real_fops->name(args); \ debugfs_file_put(dentry); \ return r; \ @@ -177,13 +177,14 @@ FULL_PROXY_FUNC(unlocked_ioctl, long, filp, static unsigned int full_proxy_poll(struct file *filp, struct poll_table_struct *wait) { - const struct file_operations *real_fops = debugfs_real_fops(filp); struct dentry *dentry = F_DENTRY(filp); unsigned int r = 0; + const struct file_operations *real_fops; if (debugfs_file_get(dentry)) return POLLHUP; + real_fops = debugfs_real_fops(filp); r = real_fops->poll(filp, wait); debugfs_file_put(dentry); return r; -- 2.13.6
[PATCH v3 2/8] debugfs: implement per-file removal protection
Since commit 49d200deaa68 ("debugfs: prevent access to removed files' private data"), accesses to a file's private data are protected from concurrent removal by covering all file_operations with a SRCU read section and sychronizing with those before returning from debugfs_remove() by means of synchronize_srcu(). As pointed out by Johannes Berg, there are debugfs files with forever blocking file_operations. Their corresponding SRCU read side sections would block any debugfs_remove() forever as well, even unrelated ones. This results in a livelock. Because a remover can't cancel any indefinite blocking within foreign files, this is a problem. Resolve this by introducing support for more granular protection on a per-file basis. This is implemented by introducing an 'active_users' refcount_t to the per-file struct debugfs_fsdata state. At file creation time, it is set to one and a debugfs_remove() will drop that initial reference. The new debugfs_file_get() and debugfs_file_put(), intended to be used in place of former debugfs_use_file_start() and debugfs_use_file_finish(), increment and decrement it respectively. Once the count drops to zero, debugfs_file_put() will signal a completion which is possibly being waited for from debugfs_remove(). Thus, as long as there is a debugfs_file_get() not yet matched by a corresponding debugfs_file_put() around, debugfs_remove() will block. Actual users of debugfs_use_file_start() and -finish() will get converted to the new debugfs_file_get() and debugfs_file_put() by followup patches. Fixes: 49d200deaa68 ("debugfs: prevent access to removed files' private data") Reported-by: Johannes Berg Signed-off-by: Nicolai Stange --- fs/debugfs/file.c | 48 fs/debugfs/inode.c | 29 +++-- fs/debugfs/internal.h | 2 ++ include/linux/debugfs.h | 11 +++ 4 files changed, 84 insertions(+), 6 deletions(-) diff --git a/fs/debugfs/file.c b/fs/debugfs/file.c index b6f5ddab66bf..6644bfdea2f8 100644 --- a/fs/debugfs/file.c +++ b/fs/debugfs/file.c @@ -109,6 +109,54 @@ const struct file_operations *debugfs_real_fops(const struct file *filp) } EXPORT_SYMBOL_GPL(debugfs_real_fops); +/** + * debugfs_file_get - mark the beginning of file data access + * @dentry: the dentry object whose data is being accessed. + * + * Up to a matching call to debugfs_file_put(), any successive call + * into the file removing functions debugfs_remove() and + * debugfs_remove_recursive() will block. Since associated private + * file data may only get freed after a successful return of any of + * the removal functions, you may safely access it after a successful + * call to debugfs_file_get() without worrying about lifetime issues. + * + * If -%EIO is returned, the file has already been removed and thus, + * it is not safe to access any of its data. If, on the other hand, + * it is allowed to access the file data, zero is returned. + */ +int debugfs_file_get(struct dentry *dentry) +{ + struct debugfs_fsdata *fsd = dentry->d_fsdata; + + /* Avoid starvation of removers. */ + if (d_unlinked(dentry)) + return -EIO; + + if (!refcount_inc_not_zero(>active_users)) + return -EIO; + + return 0; +} +EXPORT_SYMBOL_GPL(debugfs_file_get); + +/** + * debugfs_file_put - mark the end of file data access + * @dentry: the dentry object formerly passed to + * debugfs_file_get(). + * + * Allow any ongoing concurrent call into debugfs_remove() or + * debugfs_remove_recursive() blocked by a former call to + * debugfs_file_get() to proceed and return to its caller. + */ +void debugfs_file_put(struct dentry *dentry) +{ + struct debugfs_fsdata *fsd = dentry->d_fsdata; + + if (refcount_dec_and_test(>active_users)) + complete(>active_users_drained); +} +EXPORT_SYMBOL_GPL(debugfs_file_put); + static int open_proxy_open(struct inode *inode, struct file *filp) { const struct dentry *dentry = F_DENTRY(filp); diff --git a/fs/debugfs/inode.c b/fs/debugfs/inode.c index a9c3d3e9af39..6449eb935540 100644 --- a/fs/debugfs/inode.c +++ b/fs/debugfs/inode.c @@ -374,6 +374,7 @@ static struct dentry *__debugfs_create_file(const char *name, umode_t mode, inode->i_fop = proxy_fops; fsd->real_fops = real_fops; + refcount_set(>active_users, 1); dentry->d_fsdata = fsd; d_instantiate(dentry, inode); @@ -631,18 +632,34 @@ struct dentry *debugfs_create_symlink(const char *name, struct dentry *parent, } EXPORT_SYMBOL_GPL(debugfs_create_symlink); +static void __debugfs_remove_file(struct dentry *dentry, struct dentry *parent) +{ + struct debugfs_fsdata *fsd; + + simple_unlink(d_inode(parent), dentry); + d_delete(dentry); + fsd = dentry->d_fsdata; + init_completion(>active_users_drain
[PATCH v3 7/8] debugfs: call debugfs_real_fops() only after debugfs_file_get()
The current implementation of debugfs_real_fops() relies on a debugfs_fsdata instance to be installed at ->d_fsdata. With future patches introducing lazy allocation of these, this requirement will be guaranteed to be fullfilled only inbetween a debugfs_file_get()/debugfs_file_put() pair. The full proxies' fops implemented by debugfs happen to be the only offenders. Fix them up by moving their debugfs_real_fops() calls past those to debugfs_file_get(). full_proxy_release() is special as it doesn't invoke debugfs_file_get() at all. Leave it alone for now. Signed-off-by: Nicolai Stange --- fs/debugfs/file.c | 7 --- 1 file changed, 4 insertions(+), 3 deletions(-) diff --git a/fs/debugfs/file.c b/fs/debugfs/file.c index 53f5c9a2af88..bc3549c95574 100644 --- a/fs/debugfs/file.c +++ b/fs/debugfs/file.c @@ -144,13 +144,13 @@ const struct file_operations debugfs_open_proxy_file_operations = { static ret_type full_proxy_ ## name(proto) \ { \ struct dentry *dentry = F_DENTRY(filp); \ - const struct file_operations *real_fops = \ - debugfs_real_fops(filp);\ + const struct file_operations *real_fops;\ ret_type r; \ \ r = debugfs_file_get(dentry); \ if (unlikely(r))\ return r; \ + real_fops = debugfs_real_fops(filp);\ r = real_fops->name(args); \ debugfs_file_put(dentry); \ return r; \ @@ -177,13 +177,14 @@ FULL_PROXY_FUNC(unlocked_ioctl, long, filp, static unsigned int full_proxy_poll(struct file *filp, struct poll_table_struct *wait) { - const struct file_operations *real_fops = debugfs_real_fops(filp); struct dentry *dentry = F_DENTRY(filp); unsigned int r = 0; + const struct file_operations *real_fops; if (debugfs_file_get(dentry)) return POLLHUP; + real_fops = debugfs_real_fops(filp); r = real_fops->poll(filp, wait); debugfs_file_put(dentry); return r; -- 2.13.6
[PATCH v3 8/8] debugfs: defer debugfs_fsdata allocation to first usage
Currently, __debugfs_create_file allocates one struct debugfs_fsdata instance for every file created. However, there are potentially many debugfs file around, most of which are never touched by userspace. Thus, defer the allocations to the first usage, i.e. to the first debugfs_file_get(). A dentry's ->d_fsdata starts out to point to the "real", user provided fops. After a debugfs_fsdata instance has been allocated (and the real fops pointer has been moved over into its ->real_fops member), ->d_fsdata is changed to point to it from then on. The two cases are distinguished by setting BIT(0) for the real fops case. struct debugfs_fsdata's foremost purpose is to track active users and to make debugfs_remove() block until they are done. Since no debugfs_fsdata instance means no active users, make debugfs_remove() return immediately in this case. Take care of possible races between debugfs_file_get() and debugfs_remove(): either debugfs_remove() must see a debugfs_fsdata instance and thus wait for possible active users or debugfs_file_get() must see a dead dentry and return immediately. Make a dentry's ->d_release(), i.e. debugfs_release_dentry(), check whether ->d_fsdata is actually a debugfs_fsdata instance before kfree()ing it. Similarly, make debugfs_real_fops() check whether ->d_fsdata is actually a debugfs_fsdata instance before returning it, otherwise emit a warning. The set of possible error codes returned from debugfs_file_get() has grown from -EIO to -EIO and -ENOMEM. Make open_proxy_open() and full_proxy_open() pass the -ENOMEM onwards to their callers. Signed-off-by: Nicolai Stange <nicsta...@gmail.com> --- fs/debugfs/file.c | 55 ++- fs/debugfs/inode.c| 36 + fs/debugfs/internal.h | 8 3 files changed, 73 insertions(+), 26 deletions(-) diff --git a/fs/debugfs/file.c b/fs/debugfs/file.c index bc3549c95574..65872340e301 100644 --- a/fs/debugfs/file.c +++ b/fs/debugfs/file.c @@ -53,6 +53,15 @@ const struct file_operations *debugfs_real_fops(const struct file *filp) { struct debugfs_fsdata *fsd = F_DENTRY(filp)->d_fsdata; + if ((unsigned long)fsd & DEBUGFS_FSDATA_IS_REAL_FOPS_BIT) { + /* +* Urgh, we've been called w/o a protecting +* debugfs_file_get(). +*/ + WARN_ON(1); + return NULL; + } + return fsd->real_fops; } EXPORT_SYMBOL_GPL(debugfs_real_fops); @@ -74,9 +83,35 @@ EXPORT_SYMBOL_GPL(debugfs_real_fops); */ int debugfs_file_get(struct dentry *dentry) { - struct debugfs_fsdata *fsd = dentry->d_fsdata; + struct debugfs_fsdata *fsd; + void *d_fsd; + + d_fsd = READ_ONCE(dentry->d_fsdata); + if (!((unsigned long)d_fsd & DEBUGFS_FSDATA_IS_REAL_FOPS_BIT)) { + fsd = d_fsd; + } else { + fsd = kmalloc(sizeof(*fsd), GFP_KERNEL); + if (!fsd) + return -ENOMEM; + + fsd->real_fops = (void *)((unsigned long)d_fsd & + ~DEBUGFS_FSDATA_IS_REAL_FOPS_BIT); + refcount_set(>active_users, 1); + init_completion(>active_users_drained); + if (cmpxchg(>d_fsdata, d_fsd, fsd) != d_fsd) { + kfree(fsd); + fsd = READ_ONCE(dentry->d_fsdata); + } + } - /* Avoid starvation of removers. */ + /* +* In case of a successful cmpxchg() above, this check is +* strictly necessary and must follow it, see the comment in +* __debugfs_remove_file(). +* OTOH, if the cmpxchg() hasn't been executed or wasn't +* successful, this serves the purpose of not starving +* removers. +*/ if (d_unlinked(dentry)) return -EIO; @@ -98,7 +133,7 @@ EXPORT_SYMBOL_GPL(debugfs_file_get); */ void debugfs_file_put(struct dentry *dentry) { - struct debugfs_fsdata *fsd = dentry->d_fsdata; + struct debugfs_fsdata *fsd = READ_ONCE(dentry->d_fsdata); if (refcount_dec_and_test(>active_users)) complete(>active_users_drained); @@ -109,10 +144,11 @@ static int open_proxy_open(struct inode *inode, struct file *filp) { struct dentry *dentry = F_DENTRY(filp); const struct file_operations *real_fops = NULL; - int r = 0; + int r; - if (debugfs_file_get(dentry)) - return -ENOENT; + r = debugfs_file_get(dentry); + if (r) + return r == -EIO ? -ENOENT : r; real_fops = debugfs_real_fops(filp); real_fops = fops_get(real_fops); @@ -233,10 +269,11 @@ static int full_proxy_open(struct inode *inode, struct file *filp) struct dentry *dentry = F_DENTRY(filp); const struct
[PATCH v3 8/8] debugfs: defer debugfs_fsdata allocation to first usage
Currently, __debugfs_create_file allocates one struct debugfs_fsdata instance for every file created. However, there are potentially many debugfs file around, most of which are never touched by userspace. Thus, defer the allocations to the first usage, i.e. to the first debugfs_file_get(). A dentry's ->d_fsdata starts out to point to the "real", user provided fops. After a debugfs_fsdata instance has been allocated (and the real fops pointer has been moved over into its ->real_fops member), ->d_fsdata is changed to point to it from then on. The two cases are distinguished by setting BIT(0) for the real fops case. struct debugfs_fsdata's foremost purpose is to track active users and to make debugfs_remove() block until they are done. Since no debugfs_fsdata instance means no active users, make debugfs_remove() return immediately in this case. Take care of possible races between debugfs_file_get() and debugfs_remove(): either debugfs_remove() must see a debugfs_fsdata instance and thus wait for possible active users or debugfs_file_get() must see a dead dentry and return immediately. Make a dentry's ->d_release(), i.e. debugfs_release_dentry(), check whether ->d_fsdata is actually a debugfs_fsdata instance before kfree()ing it. Similarly, make debugfs_real_fops() check whether ->d_fsdata is actually a debugfs_fsdata instance before returning it, otherwise emit a warning. The set of possible error codes returned from debugfs_file_get() has grown from -EIO to -EIO and -ENOMEM. Make open_proxy_open() and full_proxy_open() pass the -ENOMEM onwards to their callers. Signed-off-by: Nicolai Stange --- fs/debugfs/file.c | 55 ++- fs/debugfs/inode.c| 36 + fs/debugfs/internal.h | 8 3 files changed, 73 insertions(+), 26 deletions(-) diff --git a/fs/debugfs/file.c b/fs/debugfs/file.c index bc3549c95574..65872340e301 100644 --- a/fs/debugfs/file.c +++ b/fs/debugfs/file.c @@ -53,6 +53,15 @@ const struct file_operations *debugfs_real_fops(const struct file *filp) { struct debugfs_fsdata *fsd = F_DENTRY(filp)->d_fsdata; + if ((unsigned long)fsd & DEBUGFS_FSDATA_IS_REAL_FOPS_BIT) { + /* +* Urgh, we've been called w/o a protecting +* debugfs_file_get(). +*/ + WARN_ON(1); + return NULL; + } + return fsd->real_fops; } EXPORT_SYMBOL_GPL(debugfs_real_fops); @@ -74,9 +83,35 @@ EXPORT_SYMBOL_GPL(debugfs_real_fops); */ int debugfs_file_get(struct dentry *dentry) { - struct debugfs_fsdata *fsd = dentry->d_fsdata; + struct debugfs_fsdata *fsd; + void *d_fsd; + + d_fsd = READ_ONCE(dentry->d_fsdata); + if (!((unsigned long)d_fsd & DEBUGFS_FSDATA_IS_REAL_FOPS_BIT)) { + fsd = d_fsd; + } else { + fsd = kmalloc(sizeof(*fsd), GFP_KERNEL); + if (!fsd) + return -ENOMEM; + + fsd->real_fops = (void *)((unsigned long)d_fsd & + ~DEBUGFS_FSDATA_IS_REAL_FOPS_BIT); + refcount_set(>active_users, 1); + init_completion(>active_users_drained); + if (cmpxchg(>d_fsdata, d_fsd, fsd) != d_fsd) { + kfree(fsd); + fsd = READ_ONCE(dentry->d_fsdata); + } + } - /* Avoid starvation of removers. */ + /* +* In case of a successful cmpxchg() above, this check is +* strictly necessary and must follow it, see the comment in +* __debugfs_remove_file(). +* OTOH, if the cmpxchg() hasn't been executed or wasn't +* successful, this serves the purpose of not starving +* removers. +*/ if (d_unlinked(dentry)) return -EIO; @@ -98,7 +133,7 @@ EXPORT_SYMBOL_GPL(debugfs_file_get); */ void debugfs_file_put(struct dentry *dentry) { - struct debugfs_fsdata *fsd = dentry->d_fsdata; + struct debugfs_fsdata *fsd = READ_ONCE(dentry->d_fsdata); if (refcount_dec_and_test(>active_users)) complete(>active_users_drained); @@ -109,10 +144,11 @@ static int open_proxy_open(struct inode *inode, struct file *filp) { struct dentry *dentry = F_DENTRY(filp); const struct file_operations *real_fops = NULL; - int r = 0; + int r; - if (debugfs_file_get(dentry)) - return -ENOENT; + r = debugfs_file_get(dentry); + if (r) + return r == -EIO ? -ENOENT : r; real_fops = debugfs_real_fops(filp); real_fops = fops_get(real_fops); @@ -233,10 +269,11 @@ static int full_proxy_open(struct inode *inode, struct file *filp) struct dentry *dentry = F_DENTRY(filp); const struct file_operations *real_fo
[PATCH v3 0/8] debugfs: per-file removal protection
Hi, this is v3 of the per-file removal protection with the main change to v2 being that it's sent in non-RFC mode now. For this, I dropped the questionable former [9/9] ("debugfs: free debugfs_fsdata instances") for now. Perhaps I'll resend it later on its own as a RFC again. The problem this series attempts to address is that any indefinitely blocking fops blocks _unrelated_ debugfs_remove()ers. This will be resolved by introducing a per-file removal protection mechanism in place of the former global debugfs_srcu shared among all debugfs files. This problem has first been spotted and debugged by Johannes Berg [1] and Tyler Hall is now facing the same issue again [2]. There's one patch to non-debugfs code, [5/8] ("IB/hfi1: convert to debugfs_file_get() and -put()"), which should be taken together with the rest of the series because it depends on prior patches and later patches depend on it. Applies to next-20171018. I reviewed the patches once again and did an allmodconfig as well as an allnoconfig build. I did more thorough testing on v2, including whether the removal protection still works. Thanks, Nicolai Changes to v2: [9/9] ("debugfs: free debugfs_fsdata instances") - dropped for now. Changes to v1: [2/9] ("debugfs: implement per-file removal protection") - In an attempt to resolve the issue reported by the kernel test robot for v1, restrict the "extended removal logic" to regular files in __debugfs_remove(). [8/9] ("debugfs: defer debugfs_fsdata allocation to first usage") - Following review from Johannes Berg, replace the WARN_ON in debugfs_real_fops() by a WARN + 'return NULL'. The return NULL is expected to crash current soon and serves as an alternative for a BUG_ON here. - Mention the change in debugfs_real_fops() in the commit message. [9/9] ("debugfs: free debugfs_fsdata instances") - Following advice from Paul E. McKenney, make debugfs_file_get() release the RCU read section inbetween retry loop iterations. - Fix a race in debugfs_file_get()'s path handling a concurrent debugfs_file_put(): the former must not "help out resetting ->d_fsdata" because this can wipe out another debugfs_file_get()'s achievements. [1] http://lkml.kernel.org/r/1490280886.2766.4.ca...@sipsolutions.net [2] https://lkml.kernel.org/r/CAOjnSCYGprej+vEEsSXwr=wo+ewle2d6shqytpp-dfpq3pm...@mail.gmail.com Nicolai Stange (8): debugfs: add support for more elaborate ->d_fsdata debugfs: implement per-file removal protection debugfs: debugfs_real_fops(): drop __must_hold sparse annotation debugfs: convert to debugfs_file_get() and -put() IB/hfi1: convert to debugfs_file_get() and -put() debugfs: purge obsolete SRCU based removal protection debugfs: call debugfs_real_fops() only after debugfs_file_get() debugfs: defer debugfs_fsdata allocation to first usage drivers/infiniband/hw/hfi1/debugfs.c | 20 ++-- fs/debugfs/file.c| 210 +-- fs/debugfs/inode.c | 56 +++--- fs/debugfs/internal.h| 14 +++ include/linux/debugfs.h | 33 +- lib/Kconfig.debug| 1 - 6 files changed, 196 insertions(+), 138 deletions(-) -- 2.13.6
[PATCH v3 0/8] debugfs: per-file removal protection
Hi, this is v3 of the per-file removal protection with the main change to v2 being that it's sent in non-RFC mode now. For this, I dropped the questionable former [9/9] ("debugfs: free debugfs_fsdata instances") for now. Perhaps I'll resend it later on its own as a RFC again. The problem this series attempts to address is that any indefinitely blocking fops blocks _unrelated_ debugfs_remove()ers. This will be resolved by introducing a per-file removal protection mechanism in place of the former global debugfs_srcu shared among all debugfs files. This problem has first been spotted and debugged by Johannes Berg [1] and Tyler Hall is now facing the same issue again [2]. There's one patch to non-debugfs code, [5/8] ("IB/hfi1: convert to debugfs_file_get() and -put()"), which should be taken together with the rest of the series because it depends on prior patches and later patches depend on it. Applies to next-20171018. I reviewed the patches once again and did an allmodconfig as well as an allnoconfig build. I did more thorough testing on v2, including whether the removal protection still works. Thanks, Nicolai Changes to v2: [9/9] ("debugfs: free debugfs_fsdata instances") - dropped for now. Changes to v1: [2/9] ("debugfs: implement per-file removal protection") - In an attempt to resolve the issue reported by the kernel test robot for v1, restrict the "extended removal logic" to regular files in __debugfs_remove(). [8/9] ("debugfs: defer debugfs_fsdata allocation to first usage") - Following review from Johannes Berg, replace the WARN_ON in debugfs_real_fops() by a WARN + 'return NULL'. The return NULL is expected to crash current soon and serves as an alternative for a BUG_ON here. - Mention the change in debugfs_real_fops() in the commit message. [9/9] ("debugfs: free debugfs_fsdata instances") - Following advice from Paul E. McKenney, make debugfs_file_get() release the RCU read section inbetween retry loop iterations. - Fix a race in debugfs_file_get()'s path handling a concurrent debugfs_file_put(): the former must not "help out resetting ->d_fsdata" because this can wipe out another debugfs_file_get()'s achievements. [1] http://lkml.kernel.org/r/1490280886.2766.4.ca...@sipsolutions.net [2] https://lkml.kernel.org/r/CAOjnSCYGprej+vEEsSXwr=wo+ewle2d6shqytpp-dfpq3pm...@mail.gmail.com Nicolai Stange (8): debugfs: add support for more elaborate ->d_fsdata debugfs: implement per-file removal protection debugfs: debugfs_real_fops(): drop __must_hold sparse annotation debugfs: convert to debugfs_file_get() and -put() IB/hfi1: convert to debugfs_file_get() and -put() debugfs: purge obsolete SRCU based removal protection debugfs: call debugfs_real_fops() only after debugfs_file_get() debugfs: defer debugfs_fsdata allocation to first usage drivers/infiniband/hw/hfi1/debugfs.c | 20 ++-- fs/debugfs/file.c| 210 +-- fs/debugfs/inode.c | 56 +++--- fs/debugfs/internal.h| 14 +++ include/linux/debugfs.h | 33 +- lib/Kconfig.debug| 1 - 6 files changed, 196 insertions(+), 138 deletions(-) -- 2.13.6
Re: [PATCH 2/3] PCI: don't use snprintf() in driver_override_show()
Bjorn Helgaas <helg...@kernel.org> writes: > On Mon, Sep 11, 2017 at 04:55:11AM -0700, Greg Kroah-Hartman wrote: >> On Mon, Sep 11, 2017 at 09:45:41AM +0200, Nicolai Stange wrote: >> > Quote from Documentation/filesystems/sysfs.txt: >> > >> > show() must not use snprintf() when formatting the value to be >> > returned to user space. If you can guarantee that an overflow >> > will never happen you can use sprintf() otherwise you must use >> > scnprintf(). >> > >> > Commit 4efe874aace5 ("PCI: Don't read past the end of sysfs >> > "driver_override" buffer") introduced such a snprintf() usage from >> > driver_override_show() while at the same time tweaking >> > driver_override_store() such that the write buffer can't ever get >> > overflowed. >> > >> > Reasoning: >> > Since aforementioned commit, driver_override_store() only accepts to be >> > written buffers less than PAGE_SIZE - 1 in size. >> > >> > The then kstrndup()'ed driver_override string will be at most PAGE_SIZE - 1 >> > in length, including the trailing '\0'. >> > >> > After the addition of a '\n' in driver_override_show(), the result won't >> > exceed PAGE_SIZE characters in length, again including the trailing '\0'. >> > >> > Hence, snprintf(buf, PAGE_SIZE, ...) and sprintf(buf, ...) are equivalent >> > at this point. >> > >> > Replace the former by the latter in order to adhere to the rules in >> > Documentation/filesystems/sysfs.txt. >> > >> > This is a style fix only and there's no change in functionality. >> > >> > Signed-off-by: Nicolai Stange <nsta...@suse.de> >> > --- >> > drivers/pci/pci-sysfs.c | 2 +- >> > 1 file changed, 1 insertion(+), 1 deletion(-) >> > >> > diff --git a/drivers/pci/pci-sysfs.c b/drivers/pci/pci-sysfs.c >> > index 8e075ea2743e..43f7fbede448 100644 >> > --- a/drivers/pci/pci-sysfs.c >> > +++ b/drivers/pci/pci-sysfs.c >> > @@ -722,7 +722,7 @@ static ssize_t driver_override_show(struct device *dev, >> >ssize_t len; >> > >> >device_lock(dev); >> > - len = snprintf(buf, PAGE_SIZE, "%s\n", pdev->driver_override); >> > + len = sprintf(buf, "%s\n", pdev->driver_override); >> >> While I'm all for changes like this, it's an uphill battle to change >> them, usually it's best to just catch them before they go into the tree. I did this patch mostly for demonstrating why the exclusion of the sprintf() -> snprintf() change from the port of 3d713e0e382e ("driver core: platform: add device binding path 'driver_override'") to platform in [3/3] is safe. Because I'm pretty sure that [3/3] would have been rejected if it had included that chunk -- due to the non-conformity to Documentation/. >> Anyway, nice summary, very good job with that. >> >> Acked-by: Greg Kroah-Hartman <gre...@linuxfoundation.org> > > Why use snprintf() instead of scnprintf()? s/snprintf/sprintf/, right? > It looks like snprintf() is probably safe, but requires a bunch of > analysis to prove that, while scnprintf() would be obviously safe. Personal preference, I guess. Something like "using sprintf() would implicitly document that the buffer is always large enough", i.e. that the size check in driver_override_store() is already strict enough / correct. Anyway, I don't have a strong opinion on that. So if you want me to change this, I'll do, of course. Just note that Greg K-H. has queued up [3/3] somewhere already and thus, it would probably require two patches rather than only this one to make PCI and platform look the same again, provided that's what is wanted. So, given that this patch has fulfilled its purpose already, I'm fine with either of - dropping it - taking it - s/sprintf/scnprintf/ and do the same for platform. Thanks, Nicolai
Re: [PATCH 2/3] PCI: don't use snprintf() in driver_override_show()
Bjorn Helgaas writes: > On Mon, Sep 11, 2017 at 04:55:11AM -0700, Greg Kroah-Hartman wrote: >> On Mon, Sep 11, 2017 at 09:45:41AM +0200, Nicolai Stange wrote: >> > Quote from Documentation/filesystems/sysfs.txt: >> > >> > show() must not use snprintf() when formatting the value to be >> > returned to user space. If you can guarantee that an overflow >> > will never happen you can use sprintf() otherwise you must use >> > scnprintf(). >> > >> > Commit 4efe874aace5 ("PCI: Don't read past the end of sysfs >> > "driver_override" buffer") introduced such a snprintf() usage from >> > driver_override_show() while at the same time tweaking >> > driver_override_store() such that the write buffer can't ever get >> > overflowed. >> > >> > Reasoning: >> > Since aforementioned commit, driver_override_store() only accepts to be >> > written buffers less than PAGE_SIZE - 1 in size. >> > >> > The then kstrndup()'ed driver_override string will be at most PAGE_SIZE - 1 >> > in length, including the trailing '\0'. >> > >> > After the addition of a '\n' in driver_override_show(), the result won't >> > exceed PAGE_SIZE characters in length, again including the trailing '\0'. >> > >> > Hence, snprintf(buf, PAGE_SIZE, ...) and sprintf(buf, ...) are equivalent >> > at this point. >> > >> > Replace the former by the latter in order to adhere to the rules in >> > Documentation/filesystems/sysfs.txt. >> > >> > This is a style fix only and there's no change in functionality. >> > >> > Signed-off-by: Nicolai Stange >> > --- >> > drivers/pci/pci-sysfs.c | 2 +- >> > 1 file changed, 1 insertion(+), 1 deletion(-) >> > >> > diff --git a/drivers/pci/pci-sysfs.c b/drivers/pci/pci-sysfs.c >> > index 8e075ea2743e..43f7fbede448 100644 >> > --- a/drivers/pci/pci-sysfs.c >> > +++ b/drivers/pci/pci-sysfs.c >> > @@ -722,7 +722,7 @@ static ssize_t driver_override_show(struct device *dev, >> >ssize_t len; >> > >> >device_lock(dev); >> > - len = snprintf(buf, PAGE_SIZE, "%s\n", pdev->driver_override); >> > + len = sprintf(buf, "%s\n", pdev->driver_override); >> >> While I'm all for changes like this, it's an uphill battle to change >> them, usually it's best to just catch them before they go into the tree. I did this patch mostly for demonstrating why the exclusion of the sprintf() -> snprintf() change from the port of 3d713e0e382e ("driver core: platform: add device binding path 'driver_override'") to platform in [3/3] is safe. Because I'm pretty sure that [3/3] would have been rejected if it had included that chunk -- due to the non-conformity to Documentation/. >> Anyway, nice summary, very good job with that. >> >> Acked-by: Greg Kroah-Hartman > > Why use snprintf() instead of scnprintf()? s/snprintf/sprintf/, right? > It looks like snprintf() is probably safe, but requires a bunch of > analysis to prove that, while scnprintf() would be obviously safe. Personal preference, I guess. Something like "using sprintf() would implicitly document that the buffer is always large enough", i.e. that the size check in driver_override_store() is already strict enough / correct. Anyway, I don't have a strong opinion on that. So if you want me to change this, I'll do, of course. Just note that Greg K-H. has queued up [3/3] somewhere already and thus, it would probably require two patches rather than only this one to make PCI and platform look the same again, provided that's what is wanted. So, given that this patch has fulfilled its purpose already, I'm fine with either of - dropping it - taking it - s/sprintf/scnprintf/ and do the same for platform. Thanks, Nicolai
[PATCH 0/3] make PCI's and platform's driver_override_store()/show() converge
Hi, both, drivers/pci/pci-sysfs.c and drivers/base/platform.c implement a driver_override_store/show() pair which used to coincide between these two subsystems (up to ->driver_override's storage location). Now, both have received a fix in the meanwhile which the other subsystem lacks each: commit 6265539776a0 ("driver core: platform: fix race condition with driver_override") and commit 4efe874aace5 ("PCI: Don't read past the end of sysfs "driver_override" buffer") Port the missing fix to the other subsystem each. There's also the style fixup [2/3] which isn't strictly needed but will make both driver_override_show() variants look the same again. Applies to next-20170908. Note that checkpatch complains about commit references to the PCI commit mentioned above, which is, as I believe, a false positive due to the quotation marks in the title. Thanks, Nicolai Nicolai Stange (3): PCI: fix race condition with driver_override PCI: don't use snprintf() in driver_override_show() driver core: platform: Don't read past the end of "driver_override" buffer drivers/base/platform.c | 3 ++- drivers/pci/pci-sysfs.c | 11 +-- 2 files changed, 11 insertions(+), 3 deletions(-) -- 2.13.5