Re: [DISCUSSION PATCH 00/41] random: possible ways towards NIST SP800-90B compliance

2020-10-07 Thread Nicolai Stange
Eric Biggers  writes:

> On Fri, Oct 02, 2020 at 02:38:36PM +0200, Torsten Duwe wrote:
>>
>> Would some maintainer please comment on potential problems or
>> shortcomings?
>> 
>
> Well, very people are experts in the Linux RNG *and* have time to review large
> patchsets, especially when three people are all proposing conflicting changes.
> And those that might be able to review these patches aren't necessarily
> interested in compliance with particular government standards.

To make it clear: I'm personally not really enthusiastic about some of
the restrictions imposed by SP800-90 either and Jason certainly has a
point with his concerns about "subpar crypto" ([1]). However, at the
same time I'm acknowledging that for some users FIPS compliance is
simply a necessity and I don't see a strong reason why that shouldn't be
supported, if doable without negatively affecting !fips_enabled users.


> Note that having multiple RNG implementations would cause fragmentation, more
> maintenance burden, etc.  So IMO, that should be a last resort.  Instead we
> should try to find an implementation that works for everyone.  I.e., at least 
> to
> me, Nicolai's patchset seems more on the right track than Stephan's 
> patchset...

I suppose that this concern about fragmentation is among the main
reasons for reservations against Stephan's LRNG patchset and that's why
I posted this RFC series here for comparison purposes. But note that, as
said ([2]), it's incomplete and the only intent was to provide at least
a rough idea on what it would take to move the current /dev/random
implementation towards SP800-90 -- I was hoping for either a hard NACK
or something along the lines of "maybe, go ahead and let's see".


> However, not everyone cares about "compliance".  So any changes for 
> "compliance"
> either need to have a real technical argument for making the change, *or* need
> to be optional (e.g. controlled by fips_enabled).

Fully agreed.


> AFAICS, this patchset mostly just talks about NIST SP800-90B compliance, and
> doesn't make clear whether the changes make the RNG better, worse, or the same
> from an actual technical perspective.
>
> If that was properly explained, and if the answer was "better" or at least
> "not worse", I expect that people would be more interested.

The goal was not to negatively affect !fips_enabled users, but as
outlined in the cover letter ([2]), a performance impact had been
measured on ARMv7. This probably isn't something which couldn't get
sorted out, but I see no point in doing it at this stage, because
- there's still quite some stuff missing for full SP800-90 compliance
  anyway, c.f. the overview at the end of [2] and
- such optimizations would have bloated this patchset even more,
  e.g. for making fips_enabled a static_key, which should certainly go
  into a separate series.

User visible effects set aside, an obvious downside of SP800-90
compliance would be the increase in code size and the associated
maintenance burden.

That being said, I can imagine that those boot health tests could also
get enabled for !fips_enabled users in the future, if wanted: rather
than inhibiting /dev/random output on failure, a warning would get
logged instead.  Whether or not this would be seen as an improvement
is for others to judge though.

Thanks,

Nicolai


[1] 
https://lkml.kernel.org/r/cahmme9rmxorfxtwdac8yxj+h9gytjj6dpvcxa-jmaagyop+...@mail.gmail.com
[2] https://lkml.kernel.org/r/20200921075857.4424-1-nsta...@suse.de

-- 
SUSE Software Solutions Germany GmbH, Maxfeldstr. 5, 90409 Nürnberg, Germany
(HRB 36809, AG Nürnberg), GF: Felix Imendörffer


[RFC PATCH 03/41] random: prune dead assignment to entropy_bits in credit_entropy_bits()

2020-09-21 Thread Nicolai Stange
Since commit 90ea1c6436d2 ("random: remove the blocking pool"), the
last assignment to 'entropy_bits' is dead. Remove it.

While at it, move the declaration of 'entropy_bits' one scope down and make
it const: it's not used anywhere outside and never updated after
initialization.

Signed-off-by: Nicolai Stange 
---
 drivers/char/random.c | 5 ++---
 1 file changed, 2 insertions(+), 3 deletions(-)

diff --git a/drivers/char/random.c b/drivers/char/random.c
index c4b7bdbd460e..14c39608cc17 100644
--- a/drivers/char/random.c
+++ b/drivers/char/random.c
@@ -716,13 +716,12 @@ static void credit_entropy_bits(struct entropy_store *r, 
int nbits)
  entropy_count >> ENTROPY_SHIFT, _RET_IP_);
 
if (r == _pool) {
-   int entropy_bits = entropy_count >> ENTROPY_SHIFT;
-
if (crng_init < 2) {
+   const int entropy_bits = entropy_count >> ENTROPY_SHIFT;
+
if (entropy_bits < 128)
return;
crng_reseed(_crng, r);
-   entropy_bits = ENTROPY_BITS(r);
}
}
 }
-- 
2.26.2



[RFC PATCH 04/41] random: drop 'reserved' parameter from extract_entropy()

2020-09-21 Thread Nicolai Stange
Since commit 43d8a72cd985 ("random: remove variable limit") all call
sites of extract_entropy() pass in zero for the 'reserved' argument
and the corresponding code in account() is effectively dead.

Remove it and the drop the now unused 'reserved' argument from
extract_entropy() as well as from account() called therefrom.

Signed-off-by: Nicolai Stange 
---
 drivers/char/random.c | 17 ++---
 1 file changed, 6 insertions(+), 11 deletions(-)

diff --git a/drivers/char/random.c b/drivers/char/random.c
index 14c39608cc17..35e381be20fe 100644
--- a/drivers/char/random.c
+++ b/drivers/char/random.c
@@ -506,7 +506,7 @@ struct entropy_store {
 };
 
 static ssize_t extract_entropy(struct entropy_store *r, void *buf,
-  size_t nbytes, int min, int rsvd);
+  size_t nbytes, int min);
 static ssize_t _extract_entropy(struct entropy_store *r, void *buf,
size_t nbytes, int fips);
 
@@ -944,7 +944,7 @@ static void crng_reseed(struct crng_state *crng, struct 
entropy_store *r)
} buf;
 
if (r) {
-   num = extract_entropy(r, , 32, 16, 0);
+   num = extract_entropy(r, , 32, 16);
if (num == 0)
return;
} else {
@@ -1330,8 +1330,7 @@ EXPORT_SYMBOL_GPL(add_disk_randomness);
  * This function decides how many bytes to actually take from the
  * given pool, and also debits the entropy count accordingly.
  */
-static size_t account(struct entropy_store *r, size_t nbytes, int min,
- int reserved)
+static size_t account(struct entropy_store *r, size_t nbytes, int min)
 {
int entropy_count, orig, have_bytes;
size_t ibytes, nfrac;
@@ -1345,8 +1344,6 @@ static size_t account(struct entropy_store *r, size_t 
nbytes, int min,
/* never pull more than available */
have_bytes = entropy_count >> (ENTROPY_SHIFT + 3);
 
-   if ((have_bytes -= reserved) < 0)
-   have_bytes = 0;
ibytes = min_t(size_t, ibytes, have_bytes);
if (ibytes < min)
ibytes = 0;
@@ -1469,12 +1466,10 @@ static ssize_t _extract_entropy(struct entropy_store 
*r, void *buf,
  * returns it in a buffer.
  *
  * The min parameter specifies the minimum amount we can pull before
- * failing to avoid races that defeat catastrophic reseeding while the
- * reserved parameter indicates how much entropy we must leave in the
- * pool after each pull to avoid starving other readers.
+ * failing to avoid races that defeat catastrophic reseeding.
  */
 static ssize_t extract_entropy(struct entropy_store *r, void *buf,
-size_t nbytes, int min, int reserved)
+size_t nbytes, int min)
 {
__u8 tmp[EXTRACT_SIZE];
unsigned long flags;
@@ -1495,7 +1490,7 @@ static ssize_t extract_entropy(struct entropy_store *r, 
void *buf,
}
 
trace_extract_entropy(r->name, nbytes, ENTROPY_BITS(r), _RET_IP_);
-   nbytes = account(r, nbytes, min, reserved);
+   nbytes = account(r, nbytes, min);
 
return _extract_entropy(r, buf, nbytes, fips_enabled);
 }
-- 
2.26.2



[RFC PATCH 40/41] random: trigger startup health test on any failure of the health tests

2020-09-21 Thread Nicolai Stange
The startup health tests to be executed at boot as required by NIST 800-90B
consist of running the contiuous health tests, i.e. the Adaptive Proportion
Test (APT) and the Repetition Count Test (RCT), until a certain amount
of noise samples have been examined. In case of test failure during this
period, the startup tests would get restarted by means of reinitializing
the fast_pool's ->warmup member with the original number of total samples
to examine during startup.

A future patch will enable dynamically switching from the initial H=1 or
1/8 per-IRQ min-entropy estimates to lower values upon health test
failures in order to keep those systems going where these more or less
arbitrary per-IRQ entropy estimates turn out to simply be wrong. It is
certainly desirable to restart the startup health tests upon such a switch.

In order to keep the upcoming code comprehensible, move the startup test
restart logic from health_test_process() into add_interrupt_randomness().
For simplicity, make add_interrupt_randomness() trigger a startup test on
each health test failure. Note that there's a change in behaviour: up to
now, only the bootime startup tests would have restarted themselves upon
failure, whereas now even a failure of the continuous health tests can
potentially trigger a startup test long after boot.

Note that as it currently stands, rerunning the full startup tests after
the crng has received its initial seed has the only effect to inhibit
entropy dispatch for a while and thus, to potentially delay those best
effort crng reseeds during runtime. As reseeds never reduce a crng state's
entropy, this behaviour is admittedly questionable. However, further
patches introducing forced reseeds might perhaps become necessary in the
future, c.f. the specification of "reseed_interval" in NIST SP800-90A.
Thus, it's better to keep the startup health test restart logic consistent
for now.

Signed-off-by: Nicolai Stange 
---
 drivers/char/random.c | 12 
 1 file changed, 8 insertions(+), 4 deletions(-)

diff --git a/drivers/char/random.c b/drivers/char/random.c
index 86dd87588b1b..bb79dcb96882 100644
--- a/drivers/char/random.c
+++ b/drivers/char/random.c
@@ -1098,8 +1098,6 @@ health_test_process(struct health_test *h, unsigned int 
event_entropy_shift,
 * Something is really off, get_cycles() has become
 * (or always been) a constant.
 */
-   if (h->warmup)
-   health_test_reset(h, event_entropy_shift);
return health_discard;
}
 
@@ -1110,8 +1108,6 @@ health_test_process(struct health_test *h, unsigned int 
event_entropy_shift,
 */
apt = health_test_apt(h, event_entropy_shift, sample_delta);
if (unlikely(h->warmup) && --h->warmup) {
-   if (apt == health_discard)
-   health_test_reset(h, event_entropy_shift);
/*
 * Don't allow the caller to dispatch until warmup
 * has completed.
@@ -1928,6 +1924,14 @@ void add_interrupt_randomness(int irq, int irq_flags)
health_test_process(_pool->health,
fast_pool->event_entropy_shift,
cycles);
+   if (unlikely(health_result == health_discard)) {
+   /*
+* Oops, something's odd. Restart the startup
+* tests.
+*/
+   health_test_reset(_pool->health,
+ fast_pool->event_entropy_shift);
+   }
}
 
if (unlikely(crng_init == 0)) {
-- 
2.26.2



[RFC PATCH 30/41] random: add a queued_entropy instance to struct fast_pool

2020-09-21 Thread Nicolai Stange
When health tests are introduced with upcoming patches, it will become
necessary to keep entropy queued across add_interrupt_randomness()
invocations for later dispatch to the global balance.

Prepare for this by adding a struct queued_entropy member to the per-CPU
fast_pool. Use it in place of that queue with automatic storage duration
in add_interrupt_randomness().

Signed-off-by: Nicolai Stange 
---
 drivers/char/random.c | 7 ---
 1 file changed, 4 insertions(+), 3 deletions(-)

diff --git a/drivers/char/random.c b/drivers/char/random.c
index 55e784a5a2ec..37746df53acf 100644
--- a/drivers/char/random.c
+++ b/drivers/char/random.c
@@ -885,6 +885,7 @@ struct fast_pool {
unsigned short  reg_idx;
unsigned char   count;
int event_entropy_shift;
+   struct queued_entropy   q;
 };
 
 /*
@@ -1655,7 +1656,7 @@ void add_interrupt_randomness(int irq, int irq_flags)
__u32   c_high, j_high;
__u64   ip;
boolreseed;
-   struct queued_entropy   q = { 0 };
+   struct queued_entropy   *q = _pool->q;
unsigned intnfrac;
 
if (cycles == 0)
@@ -1700,9 +1701,9 @@ void add_interrupt_randomness(int irq, int irq_flags)
nfrac = fast_pool_entropy(fast_pool->count,
  fast_pool->event_entropy_shift);
}
-   __queue_entropy(r, , nfrac);
+   __queue_entropy(r, q, nfrac);
__mix_pool_bytes(r, _pool->pool, sizeof(fast_pool->pool));
-   reseed = __dispatch_queued_entropy_fast(r, );
+   reseed = __dispatch_queued_entropy_fast(r, q);
spin_unlock(>lock);
 
fast_pool->last = now;
-- 
2.26.2



[RFC PATCH 02/41] random: remove dead code for nbits < 0 in credit_entropy_bits()

2020-09-21 Thread Nicolai Stange
The nbits argument to credit_entropy_bits() is never negative and
the branch handling it is dead code. Remove it.

The code for handling the regular nbits > 0 case used to live in the
corresponding else branch, but has now been lifted up to function scope.
Move the declaration of 'pnfrac' to the function prologue in order to
adhere to C99 rules. Likewise, move the declaration of 's' into the
body loop, the only scope it's referenced from.

Signed-off-by: Nicolai Stange 
---
 drivers/char/random.c | 69 ---
 1 file changed, 32 insertions(+), 37 deletions(-)

diff --git a/drivers/char/random.c b/drivers/char/random.c
index 0580968fd28c..c4b7bdbd460e 100644
--- a/drivers/char/random.c
+++ b/drivers/char/random.c
@@ -654,7 +654,7 @@ static void process_random_ready_list(void)
 }
 
 /*
- * Credit (or debit) the entropy store with n bits of entropy.
+ * Credit the entropy store with n bits of entropy.
  * Use credit_entropy_bits_safe() if the value comes from userspace
  * or otherwise should be checked for extreme values.
  */
@@ -663,50 +663,45 @@ static void credit_entropy_bits(struct entropy_store *r, 
int nbits)
int entropy_count, orig;
const int pool_size = r->poolinfo->poolfracbits;
int nfrac = nbits << ENTROPY_SHIFT;
+   int pnfrac;
 
if (!nbits)
return;
 
 retry:
entropy_count = orig = READ_ONCE(r->entropy_count);
-   if (nfrac < 0) {
-   /* Debit */
-   entropy_count += nfrac;
-   } else {
-   /*
-* Credit: we have to account for the possibility of
-* overwriting already present entropy.  Even in the
-* ideal case of pure Shannon entropy, new contributions
-* approach the full value asymptotically:
-*
-* entropy <- entropy + (pool_size - entropy) *
-*  (1 - exp(-add_entropy/pool_size))
-*
-* For add_entropy <= pool_size/2 then
-* (1 - exp(-add_entropy/pool_size)) >=
-*(add_entropy/pool_size)*0.7869...
-* so we can approximate the exponential with
-* 3/4*add_entropy/pool_size and still be on the
-* safe side by adding at most pool_size/2 at a time.
-*
-* The use of pool_size-2 in the while statement is to
-* prevent rounding artifacts from making the loop
-* arbitrarily long; this limits the loop to log2(pool_size)*2
-* turns no matter how large nbits is.
-*/
-   int pnfrac = nfrac;
-   const int s = r->poolinfo->poolbitshift + ENTROPY_SHIFT + 2;
+   /*
+* Credit: we have to account for the possibility of
+* overwriting already present entropy.  Even in the
+* ideal case of pure Shannon entropy, new contributions
+* approach the full value asymptotically:
+*
+* entropy <- entropy + (pool_size - entropy) *
+*  (1 - exp(-add_entropy/pool_size))
+*
+* For add_entropy <= pool_size/2 then
+* (1 - exp(-add_entropy/pool_size)) >=
+*(add_entropy/pool_size)*0.7869...
+* so we can approximate the exponential with
+* 3/4*add_entropy/pool_size and still be on the
+* safe side by adding at most pool_size/2 at a time.
+*
+* The use of pool_size-2 in the while statement is to
+* prevent rounding artifacts from making the loop
+* arbitrarily long; this limits the loop to log2(pool_size)*2
+* turns no matter how large nbits is.
+*/
+   pnfrac = nfrac;
+   do {
/* The +2 corresponds to the /4 in the denominator */
+   const int s = r->poolinfo->poolbitshift + ENTROPY_SHIFT + 2;
+   unsigned int anfrac = min(pnfrac, pool_size/2);
+   unsigned int add =
+   ((pool_size - entropy_count)*anfrac*3) >> s;
 
-   do {
-   unsigned int anfrac = min(pnfrac, pool_size/2);
-   unsigned int add =
-   ((pool_size - entropy_count)*anfrac*3) >> s;
-
-   entropy_count += add;
-   pnfrac -= anfrac;
-   } while (unlikely(entropy_count < pool_size-2 && pnfrac));
-   }
+   entropy_count += add;
+   pnfrac -= anfrac;
+   } while (unlikely(entropy_count < pool_size-2 && pnfrac));
 
if (WARN_ON(entropy_count < 0)) {
pr_warn("negative entropy/overflow: pool %s count %d\n",
-- 
2.26.2



[RFC PATCH 09/41] random: protect ->entropy_count with the pool spinlock

2020-09-21 Thread Nicolai Stange
Currently, all updates to ->entropy_count are synchronized by means of
cmpxchg-retry loops found in credit_entropy_bits(),
__credit_entropy_bits_fast() and account() respectively.

However, all but one __credit_entropy_bits_fast() call sites grap the pool
->lock already and it would be nice if the potentially costly cmpxchg could
be avoided in these performance critical paths. In addition to that, future
patches will introduce new fields to struct entropy_store which will
required some kinf of synchronization with ->entropy_count updates from
said producer paths as well.

Protect ->entropy_count with the pool ->lock.

- Make callers of __credit_entropy_bits_fast() invoke it with the
  pool ->lock held. Extend existing critical sections where possible.
  Drop the cmpxchg-reply loop in __credit_entropy_bits_fast() in favor of
  a plain assignment.
- Retain the retry loop in credit_entropy_bits(): the potentially
  expensive pool_entropy_delta() should not be called under the lock in
  order to not unnecessarily block contenders. In order to continue to
  synchronize with  __credit_entropy_bits_fast() and account(), the
  cmpxchg gets replaced by a plain comparison + store with the ->lock being
  held.
- Make account() grab the ->lock and drop the cmpxchg-retry loop in favor
  of a plain assignent.

Signed-off-by: Nicolai Stange 
---
 drivers/char/random.c | 44 +--
 1 file changed, 30 insertions(+), 14 deletions(-)

diff --git a/drivers/char/random.c b/drivers/char/random.c
index d9e4dd27d45d..9f87332b158f 100644
--- a/drivers/char/random.c
+++ b/drivers/char/random.c
@@ -718,7 +718,7 @@ static unsigned int pool_entropy_delta(struct entropy_store 
*r,
  * Credit the entropy store with n bits of entropy.
  * To be used from hot paths when it is either known that nbits is
  * smaller than one half of the pool size or losing anything beyond that
- * doesn't matter.
+ * doesn't matter. Must be called with r->lock being held.
  */
 static bool __credit_entropy_bits_fast(struct entropy_store *r, int nbits)
 {
@@ -727,13 +727,11 @@ static bool __credit_entropy_bits_fast(struct 
entropy_store *r, int nbits)
if (!nbits)
return false;
 
-retry:
-   orig = READ_ONCE(r->entropy_count);
+   orig = r->entropy_count;
entropy_count = orig + pool_entropy_delta(r, orig,
  nbits << ENTROPY_SHIFT,
  true);
-   if (cmpxchg(>entropy_count, orig, entropy_count) != orig)
-   goto retry;
+   WRITE_ONCE(r->entropy_count, entropy_count);
 
trace_credit_entropy_bits(r->name, nbits,
  entropy_count >> ENTROPY_SHIFT, _RET_IP_);
@@ -755,17 +753,28 @@ static bool __credit_entropy_bits_fast(struct 
entropy_store *r, int nbits)
 static void credit_entropy_bits(struct entropy_store *r, int nbits)
 {
int entropy_count, orig;
+   unsigned long flags;
 
if (!nbits)
return;
 
 retry:
+   /*
+* Don't run the potentially expensive pool_entropy_delta()
+* calculations under the spinlock. Instead retry until
+* ->entropy_count becomes stable.
+*/
orig = READ_ONCE(r->entropy_count);
entropy_count = orig + pool_entropy_delta(r, orig,
  nbits << ENTROPY_SHIFT,
  false);
-   if (cmpxchg(>entropy_count, orig, entropy_count) != orig)
+   spin_lock_irqsave(>lock, flags);
+   if (r->entropy_count != orig) {
+   spin_unlock_irqrestore(>lock, flags);
goto retry;
+   }
+   WRITE_ONCE(r->entropy_count, entropy_count);
+   spin_unlock_irqrestore(>lock, flags);
 
trace_credit_entropy_bits(r->name, nbits,
  entropy_count >> ENTROPY_SHIFT, _RET_IP_);
@@ -1203,12 +1212,11 @@ static void add_timer_randomness(struct 
timer_rand_state *state, unsigned num)
} sample;
long delta, delta2, delta3;
bool reseed;
+   unsigned long flags;
 
sample.jiffies = jiffies;
sample.cycles = random_get_entropy();
sample.num = num;
-   r = _pool;
-   mix_pool_bytes(r, , sizeof(sample));
 
/*
 * Calculate number of bits of randomness we probably added.
@@ -1235,12 +1243,16 @@ static void add_timer_randomness(struct 
timer_rand_state *state, unsigned num)
if (delta > delta3)
delta = delta3;
 
+   r = _pool;
+   spin_lock_irqsave(>lock, flags);
+   __mix_pool_bytes(r, , sizeof(sample));
/*
 * delta is now minimum absolute delta.
 * Round down by 1 bit on general principles,
 * and limit entropy estimate to 12 bits.
 */
reseed = __cred

[RFC PATCH 29/41] random: move definition of struct queued_entropy and related API upwards

2020-09-21 Thread Nicolai Stange
The next patch will add a member of type struct queued_entropy to struct
fast_pool and thus, the former's definition needs to be visible at the
latter's.

Move the definition of struct queued_entropy upwards in the file so that
it comes before the definition struct fast_pool. Move the associated
function definitions as well in order to keep everything together.

Note that said function definitions had originally been inserted at the
old location with the intent to minimize their introducing patch's diff
by placing them near the now removed credit_entropy_delta() they
superseded.

Signed-off-by: Nicolai Stange 
---
 drivers/char/random.c | 243 +-
 1 file changed, 124 insertions(+), 119 deletions(-)

diff --git a/drivers/char/random.c b/drivers/char/random.c
index 680ccc82a436..55e784a5a2ec 100644
--- a/drivers/char/random.c
+++ b/drivers/char/random.c
@@ -519,6 +519,10 @@ static ssize_t extract_entropy(struct entropy_store *r, 
void *buf,
 static ssize_t _extract_entropy(struct entropy_store *r, void *buf,
size_t nbytes, int fips);
 
+static unsigned int pool_entropy_delta(struct entropy_store *r,
+  int base_entropy_count,
+  int nfrac, bool fast);
+
 static int min_crng_reseed_pool_entropy(void);
 
 static void crng_reseed(struct crng_state *crng, struct entropy_store *r);
@@ -610,125 +614,6 @@ static void mix_pool_bytes(struct entropy_store *r, const 
void *in,
spin_unlock_irqrestore(>lock, flags);
 }
 
-struct fast_pool {
-   __u32   pool[4];
-   unsigned long   last;
-   unsigned short  reg_idx;
-   unsigned char   count;
-   int event_entropy_shift;
-};
-
-/*
- * This is a fast mixing routine used by the interrupt randomness
- * collector.  It's hardcoded for an 128 bit pool and assumes that any
- * locks that might be needed are taken by the caller.
- */
-static void fast_mix(struct fast_pool *f)
-{
-   __u32 a = f->pool[0],   b = f->pool[1];
-   __u32 c = f->pool[2],   d = f->pool[3];
-
-   a += b; c += d;
-   b = rol32(b, 6);d = rol32(d, 27);
-   d ^= a; b ^= c;
-
-   a += b; c += d;
-   b = rol32(b, 16);   d = rol32(d, 14);
-   d ^= a; b ^= c;
-
-   a += b; c += d;
-   b = rol32(b, 6);d = rol32(d, 27);
-   d ^= a; b ^= c;
-
-   a += b; c += d;
-   b = rol32(b, 16);   d = rol32(d, 14);
-   d ^= a; b ^= c;
-
-   f->pool[0] = a;  f->pool[1] = b;
-   f->pool[2] = c;  f->pool[3] = d;
-   f->count++;
-}
-
-static void process_random_ready_list(void)
-{
-   unsigned long flags;
-   struct random_ready_callback *rdy, *tmp;
-
-   spin_lock_irqsave(_ready_list_lock, flags);
-   list_for_each_entry_safe(rdy, tmp, _ready_list, list) {
-   struct module *owner = rdy->owner;
-
-   list_del_init(>list);
-   rdy->func(rdy);
-   module_put(owner);
-   }
-   spin_unlock_irqrestore(_ready_list_lock, flags);
-}
-
-/*
- * Based on the pool's current entropy fill level, specified as
- * base_entropy_count, and the number of new entropy bits in units of
- * 2^-ENTROPY_SHIFT to add, return the amount of new entropy to
- * credit. If the 'fast' parameter is set to true, the calculation
- * will be guaranteed to terminate quickly, but this comes at the
- * expense of capping nbits to one half of the pool size.
- */
-static unsigned int pool_entropy_delta(struct entropy_store *r,
-  int base_entropy_count,
-  int nfrac, bool fast)
-{
-   const int pool_size = r->poolinfo->poolfracbits;
-   int entropy_count = base_entropy_count;
-
-   if (!nfrac)
-   return 0;
-
-   if (pool_size <= base_entropy_count)
-   return 0;
-
-   /*
-* Credit: we have to account for the possibility of
-* overwriting already present entropy.  Even in the
-* ideal case of pure Shannon entropy, new contributions
-* approach the full value asymptotically:
-*
-* entropy <- entropy + (pool_size - entropy) *
-*  (1 - exp(-add_entropy/pool_size))
-*
-* For add_entropy <= pool_size/2 then
-* (1 - exp(-add_entropy/pool_size)) >=
-*(add_entropy/pool_size)*0.7869...
-* so we can approximate the exponential with
-* 3/4*add_entropy/pool_size and still be on the
-* safe side by adding at most pool_size/2 at a time.
-*
-* The use of pool_size-2 in the while statement is to
-* prevent rounding artifacts from making the loop
-* arbitrarily long; this limits the loop 

[RFC PATCH 39/41] random: make the startup tests include muliple APT invocations

2020-09-21 Thread Nicolai Stange
Given a per-IRQ min-entropy estimate of H, the Adaptive Proportion Tests
(APT) will need to consume at most n = 128/H samples before reaching a
conclusion. The supported values for H are 1, 1/2, 1/4, 1/8, ..., 1/64,
but only 1 and 1/8 are currently being actively used on systems with and
without a high resolution get_cycles() respectively. The corresponding
number of samples consumed by one APT execution are 128, 256, 512, 1024,
2048, 4096 and 8192. Currently, the ->warmup parameter used for controlling
the startup is hardcoded to be initialized to 1024 and the health test
logic won't permit the caller, i.e. add_interrupt_randomness() to dispatch
any entropy to the global balance until that many events have been
processed *and* the first APT has completed, whichever comes later. It
would take roughly eight successful APT invocations for H=1 until the
startup sequence has completed, but for all H <= 1/8, the ->warmup logic is
effectively useless because the first APT would always need to process
>= 1024 samples anyway.

The probabilites of one single APT invocation successfully detecting a
degradation of the per-IRQ min-entopy to H/2 ("power") are as follows for
the different supported H estimates:

   H n   power
   ---
  1  128 64.7%
1/2  256 79.1%
1/4  512 81.6%
1/8 1024 84.0%
   1/16 2048 84.9%
   1/32 4096 86.9%
   1/64 8192 86.4%

Thus, for H=1, the probability that at least one out of those eight APT
invocations will detect a degradation to H/2 is 1 - (1 - 64.7%)^8 = 99.98%,
which is quite good. OTOH, the 84.0% achievable with the single APT
invocation for H = 1/8 is only semi-satisfactory.

Note that as it currently stands, the only point in time where the health
tests can still intervene and keep back low quality noise from
the primary_crng is before the initial seed has happened. Afterwards,
failing continuous health tests would only potentially delay those best
effort reseeds (which is questionable behaviour in itself, as the crng
state's entropy is never reduced in the course of reseeding).

A future patch will enable dynamically switching from the initial H=1 or
1/8 resp. to lower per-IRQ entropy values upon health test failures in
order to keep those systems going where these more or less arbitrary
per-IRQ entropy estimates turn out to be simply wrong. From a paranoia POV,
it is certainly a good idea to run the APT several times in a row during
startup in order to achieve a good statistical power. Extending the warmup
to cover the larger of the 1024 events required by NIST SP800-90B and four
full APT lengths will result in a combined probability of detecting an
entropy degradation to H/2 of >= 99.98% across all supported values of H.
The obvious downside is that the number of IRQ events required for the
initial seed will be qadrupled, at least for H <= 1/8.

Follow this approach. Amend health_test_reset()'s signature by an
additional parameter, event_entropy_shift, and make it set ->warmup to
the larger of 1024 and 4 * 128 / (2^-event_entropy_shift). Adjust all
call sites accordingly.

Signed-off-by: Nicolai Stange 
---
 drivers/char/random.c | 23 +++
 1 file changed, 15 insertions(+), 8 deletions(-)

diff --git a/drivers/char/random.c b/drivers/char/random.c
index bd8c24e433d0..86dd87588b1b 100644
--- a/drivers/char/random.c
+++ b/drivers/char/random.c
@@ -1058,14 +1058,21 @@ health_test_apt(struct health_test *h, unsigned int 
event_entropy_shift,
return health_queue;
 }
 
-static void health_test_reset(struct health_test *h)
+static void health_test_reset(struct health_test *h,
+ unsigned int event_entropy_shift)
 {
/*
-* Don't dispatch until at least 1024 events have been
-* processed by the continuous health tests as required by
-* NIST SP800-90B for the startup tests.
+* Let H = 2^-event_entropy_shift equal the estimated per-IRQ
+* min-entropy. One APT will consume at most 128 / H samples
+* until completion. Run the startup tests for the larger of
+* 1024 events as required by NIST or four times the APT
+* length. In either case, the combined probability of the
+* resulting number of successive APTs to detect a degradation
+* of H to H/2 will be >= 99.8%, for any supported value of
+* event_entropy_shift.
 */
-   h->warmup = 1024;
+   h->warmup = 4 * (128 << event_entropy_shift);
+   h->warmup = max_t(unsigned int, h->warmup, 1024);
 
health_apt_reset(h);
 }
@@ -1092,7 +1099,7 @@ health_test_process(struct health_test *h, unsigned int 
event_entropy_shift,
 * (or always been) a constant.
 */
if (h->warmup)
-   health_test_reset(h);
+   health_test_reset(h, event_entropy_shift);
return health_discard;
}
 
@@ -1104

[RFC PATCH 33/41] random: make health_test_process() maintain the get_cycles() delta

2020-09-21 Thread Nicolai Stange
The min-entropy estimate has been made for the lower eight bits of the
deltas between cycle counter values from successive IRQ events and thus,
the upcoming health tests should actually be run on these deltas.

Introduce a new field ->previous_sample to struct health_test for storing
the previous get_cycles() value. Make health_test_process() maintain it
and also calculate the delta between the current and the previous value
at this point already in preparation to passing it to the upcoming health
tests. Note that ->previous_sample is deliberately not touched from
health_test_reset() in order to maintain a steady flow of correctly
calculated deltas across health test resets.

Signed-off-by: Nicolai Stange 
---
 drivers/char/random.c | 14 +-
 1 file changed, 13 insertions(+), 1 deletion(-)

diff --git a/drivers/char/random.c b/drivers/char/random.c
index cb6441b96b8e..33f9b7b59f92 100644
--- a/drivers/char/random.c
+++ b/drivers/char/random.c
@@ -879,7 +879,9 @@ static void discard_queued_entropy(struct entropy_store *r,
spin_unlock_irqrestore(>lock, flags);
 }
 
-struct health_test {};
+struct health_test {
+   u8 previous_sample;
+};
 
 enum health_result {
health_none,
@@ -895,6 +897,16 @@ static enum health_result
 health_test_process(struct health_test *h, unsigned int event_entropy_shift,
u8 sample)
 {
+   u8 sample_delta;
+
+   /*
+* The min-entropy estimate has been made for the lower eight
+* bits of the deltas between cycle counter values from
+* successive IRQ events.
+*/
+   sample_delta = sample - h->previous_sample;
+   h->previous_sample = sample;
+
return health_none;
 }
 
-- 
2.26.2



[RFC PATCH 37/41] random: implement the "Repetition Count" NIST SP800-90B health test

2020-09-21 Thread Nicolai Stange
The "Repetition Count Test" (RCT) as specified by NIST SP800-90B simply
counts the number of times the same sample value has been observed and
reports failure if an highly unlikely threshold is exceeded. The exact
values of the latter depend on the estimated per-IRQ min-entropy H as well
as on the upper bounds set on the probability of false positives. For the
latter, a maximum value of 2^-20 is recommended and with this value the
threshold can be calculated as 1 + ceil(20 / H). It should be noted that
the RCT has very poor statistical power and is only intended to detect
catastrophic noise source failures, like the get_cycles() in
add_interrupt_randomness() always returning the same constant.

Add the fields needed for maintaining the RCT state to struct health_test:
->rct_previous_delta for storing the previous sample value and ->rct_count
for keeping track of how many times this value has been observed in a row
so far.

Implement the RCT and wrap it in a new function, health_test_rct().

Make the health test entry point, health_test_process(), call it early
before invoking the APT and forward failure reports to the caller. All
other return codes from the RCT are ignored, because
- as said, the statistical power is weak and a positive outcome wouldn't
  tell anything and
- it's not desirable to make the caller, i.e. add_interrupt_randomness(),
  to further queue any entropy once the concurrently running APT has
  signaled a successful completion.

Signed-off-by: Nicolai Stange 
---
 drivers/char/random.c | 55 +++
 1 file changed, 55 insertions(+)

diff --git a/drivers/char/random.c b/drivers/char/random.c
index 2c744d2a9b26..54ee082ca4a8 100644
--- a/drivers/char/random.c
+++ b/drivers/char/random.c
@@ -890,6 +890,9 @@ struct health_test {
};
 
u8 previous_sample;
+
+   u8 rct_previous_delta;
+   unsigned short rct_count;
 };
 
 enum health_result {
@@ -899,6 +902,43 @@ enum health_result {
health_discard,
 };
 
+/* Repetition count test. */
+static enum health_result
+health_test_rct(struct health_test *h, unsigned int event_entropy_shift,
+   u8 sample_delta)
+{
+   unsigned int c;
+
+   if (likely(sample_delta != h->rct_previous_delta)) {
+   h->rct_previous_delta = sample_delta;
+   h->rct_count = 0;
+   return health_dispatch;
+   }
+
+   h->rct_count++;
+   if (!h->rct_count) {
+   /* Overflow. */
+   h->rct_count = -1;
+   }
+
+   /*
+* With a min-entropy of H = 2^-event_entropy_shift bits per
+* event, the maximum probability of seing any particular
+* sample value (i.e. delta) is bounded by 2^-H. Thus, the
+* probability to observe the same events C times in a row is
+* less than 2^-((C - 1) * H). Limit the false positive rate
+* of the repetition count test to 2^-20, which yields a
+* cut-off value of C = 1 + 20/H. Note that the actual number
+* of repetitions equals ->rct_count + 1, so this offset by
+* one must be accounted for in the comparison below.
+*/
+   c = 20 << event_entropy_shift;
+   if (h->rct_count >= c)
+   return health_discard;
+
+   return health_queue;
+}
+
 /* Adaptive Proportion Test */
 #define HEALTH_APT_PRESEARCH_EVENT_COUNT 7
 
@@ -1027,6 +1067,7 @@ health_test_process(struct health_test *h, unsigned int 
event_entropy_shift,
u8 sample)
 {
u8 sample_delta;
+   enum health_result rct;
 
/*
 * The min-entropy estimate has been made for the lower eight
@@ -1036,6 +1077,20 @@ health_test_process(struct health_test *h, unsigned int 
event_entropy_shift,
sample_delta = sample - h->previous_sample;
h->previous_sample = sample;
 
+   rct = health_test_rct(h, event_entropy_shift, sample_delta);
+   if (rct == health_discard) {
+   /*
+* Something is really off, get_cycles() has become
+* (or always been) a constant.
+*/
+   return health_discard;
+   }
+
+   /*
+* Otherwise return whatever the APT returns. In particular,
+* don't care about whether the RCT needs to consume more
+* samples to complete.
+*/
return health_test_apt(h, event_entropy_shift, sample_delta);
 }
 
-- 
2.26.2



[RFC PATCH 28/41] random: don't award entropy to disk + input events if in FIPS mode

2020-09-21 Thread Nicolai Stange
NIST SP800-90C prohibits the use of multiple correlated entropy sources.

Obviously, add_disk_randomness(), add_input_randomness() and
add_interrupt_randomness() are not independent.

Follow the approach taken by Stephan Müller's LRNG patchset ([1]) and don't
award any entropy to the former two if fips_enabled is true.
Note that the entropy loss has already been compensated for by a previous
patch increasing the IRQ event estimate.

The actual entropy accounting from add_disk_randomness() and
add_input_randomness() is implemented in the common add_timer_randomness()
called therefrom.

Make the latter to not dispatch any entropy to the global entropy balance
if fips_enabled is on.

[1] https://lkml.kernel.org/r/5695397.lov4wx5...@positron.chronox.de

Suggested-by: Stephan Müller 
Signed-off-by: Nicolai Stange 
---
 drivers/char/random.c | 24 ++--
 1 file changed, 18 insertions(+), 6 deletions(-)

diff --git a/drivers/char/random.c b/drivers/char/random.c
index 8f79e90f2429..680ccc82a436 100644
--- a/drivers/char/random.c
+++ b/drivers/char/random.c
@@ -1481,12 +1481,24 @@ static void add_timer_randomness(struct 
timer_rand_state *state, unsigned num)
 
r = _pool;
spin_lock_irqsave(>lock, flags);
-   /*
-* delta is now minimum absolute delta.
-* Round down by 1 bit on general principles,
-* and limit entropy estimate to 12 bits.
-*/
-   __queue_entropy(r, , min_t(int, fls(delta>>1), 11) << ENTROPY_SHIFT);
+   if (!fips_enabled) {
+   unsigned int nfrac;
+
+   /*
+* delta is now minimum absolute delta.
+* Round down by 1 bit on general principles,
+* and limit entropy estimate to 12 bits.
+*/
+   nfrac = min_t(int, fls(delta>>1), 11) << ENTROPY_SHIFT;
+   __queue_entropy(r, , nfrac);
+   } else {
+   /*
+* Multiple correlated entropy sources are prohibited
+* by NIST SP800-90C. Leave it up to
+* add_interrupt_randomness() to contribute any
+* entropy.
+*/
+   }
__mix_pool_bytes(r, , sizeof(sample));
reseed = __dispatch_queued_entropy_fast(r, );
spin_unlock_irqrestore(>lock, flags);
-- 
2.26.2



[RFC PATCH 31/41] random: introduce struct health_test + health_test_reset() placeholders

2020-09-21 Thread Nicolai Stange
The to be implemented health tests will maintain some per-CPU state as they
successively process the IRQ samples fed into the resp. fast_pool from
add_interrupt_randomness().

In order to not to clutter future patches with trivialities, introduce
an empty struct health_test supposed to keep said state in the future.
Add a member of this new type to struct fast_pool.

Introduce a health_test_reset() stub, which is supposed to (re)initialize
instances of struct health_test.

Invoke it from the fast_pool_init_accounting() to make sure that a
fast_pool's contained health_test instance gets initialized once before
its first usage.

Make add_interrupt_randomness call fast_pool_init_accounting() earlier:
health test functionality will get invoked before the latter's old location
and it must have been initialized by that time.

Signed-off-by: Nicolai Stange 
---
 drivers/char/random.c | 11 +--
 1 file changed, 9 insertions(+), 2 deletions(-)

diff --git a/drivers/char/random.c b/drivers/char/random.c
index 37746df53acf..0f56c873a501 100644
--- a/drivers/char/random.c
+++ b/drivers/char/random.c
@@ -879,6 +879,11 @@ static void discard_queued_entropy(struct entropy_store *r,
spin_unlock_irqrestore(>lock, flags);
 }
 
+struct health_test {};
+
+static void health_test_reset(struct health_test *h)
+{}
+
 struct fast_pool {
__u32   pool[4];
unsigned long   last;
@@ -886,6 +891,7 @@ struct fast_pool {
unsigned char   count;
int event_entropy_shift;
struct queued_entropy   q;
+   struct health_test  health;
 };
 
 /*
@@ -1644,6 +1650,7 @@ static inline void fast_pool_init_accounting(struct 
fast_pool *f)
return;
 
f->event_entropy_shift = min_irq_event_entropy_shift();
+   health_test_reset(>health);
 }
 
 void add_interrupt_randomness(int irq, int irq_flags)
@@ -1674,6 +1681,8 @@ void add_interrupt_randomness(int irq, int irq_flags)
add_interrupt_bench(cycles);
this_cpu_add(net_rand_state.s1, fast_pool->pool[cycles & 3]);
 
+   fast_pool_init_accounting(fast_pool);
+
if (unlikely(crng_init == 0)) {
if ((fast_pool->count >= 64) &&
crng_fast_load((char *) fast_pool->pool,
@@ -1692,8 +1701,6 @@ void add_interrupt_randomness(int irq, int irq_flags)
if (!spin_trylock(>lock))
return;
 
-   fast_pool_init_accounting(fast_pool);
-
if (!fips_enabled) {
/* award one bit for the contents of the fast pool */
nfrac = 1 << ENTROPY_SHIFT;
-- 
2.26.2



[RFC PATCH 12/41] random: convert add_interrupt_randomness() to queued_entropy API

2020-09-21 Thread Nicolai Stange
In an effort to drop __credit_entropy_bits_fast() in favor of the new
__queue_entropy()/__dispatch_queued_entropy_fast() API, convert
add_interrupt_randomness() from the former to the latter.

There is no change in functionality at this point, because
__credit_entropy_bits_fast() has already been reimplemented on top of the
new API before.

Signed-off-by: Nicolai Stange 
---
 drivers/char/random.c | 28 
 1 file changed, 16 insertions(+), 12 deletions(-)

diff --git a/drivers/char/random.c b/drivers/char/random.c
index e8c86abde901..bd3774c6be4b 100644
--- a/drivers/char/random.c
+++ b/drivers/char/random.c
@@ -1512,6 +1512,7 @@ void add_interrupt_randomness(int irq, int irq_flags)
unsigned long   seed;
int credit = 0;
boolreseed;
+   struct queued_entropy   q = { 0 };
 
if (cycles == 0)
cycles = get_reg(fast_pool, regs);
@@ -1546,24 +1547,27 @@ void add_interrupt_randomness(int irq, int irq_flags)
if (!spin_trylock(>lock))
return;
 
-   fast_pool->last = now;
-   __mix_pool_bytes(r, _pool->pool, sizeof(fast_pool->pool));
-
/*
 * If we have architectural seed generator, produce a seed and
-* add it to the pool.  For the sake of paranoia don't let the
-* architectural seed generator dominate the input from the
-* interrupt noise.
+* add it to the pool further below. For the sake of paranoia
+* don't let the architectural seed generator dominate the
+* input from the interrupt noise.
 */
-   if (arch_get_random_seed_long()) {
-   __mix_pool_bytes(r, , sizeof(seed));
-   credit = 1;
-   }
+   credit = !!arch_get_random_long();
 
+   fast_pool->last = now;
fast_pool->count = 0;
-
/* award one bit for the contents of the fast pool */
-   reseed = __credit_entropy_bits_fast(r, credit + 1);
+   __queue_entropy(r, , (credit + 1) << ENTROPY_SHIFT);
+   __mix_pool_bytes(r, _pool->pool, sizeof(fast_pool->pool));
+   if (credit) {
+   /*
+* A seed has been obtained from
+* arch_get_random_seed_long() above, mix it in.
+*/
+   __mix_pool_bytes(r, , sizeof(seed));
+   }
+   reseed = __dispatch_queued_entropy_fast(r, );
spin_unlock(>lock);
if (reseed)
crng_reseed(_crng, r);
-- 
2.26.2



[RFC PATCH 41/41] random: lower per-IRQ entropy estimate upon health test failure

2020-09-21 Thread Nicolai Stange
Currently, if fips_enabled is set, a per-IRQ min-entropy estimate of
either 1 bit or 1/8 bit is assumed, depending on whether a high resolution
get_cycles() is available or not. The statistical NIST SP800-90B startup
health tests are run on a certain amount of noise samples and are intended
to reject in case this hypothesis turns out to be wrong, i.e. if the
actual min-entropy is smaller. As long as the startup tests haven't
finished, entropy dispatch and thus, the initial crng seeding, is
inhibited. On test failure, the startup tests would restart themselves
from the beginning.

It follows that in case a system's actual per-IRQ min-entropy is smaller
than the more or less arbitrarily assessed 1 bit or 1/8 bit resp., there
will be a good chance that the initial crng seed will never complete.
AFAICT, such a situation could potentially prevent certain userspace
daemons like OpenSSH from loading.

In order to still be able to make any progress, make
add_interrupt_randomness() lower the per-IRQ min-entropy by one half upon
each health test failure, but only until the minimum supported value of
1/64 bits has been reached. Note that health test failures will cause a
restart of the startup health tests already and thus, a certain number of
additional noise samples resp. IRQ events will have to get examined by the
health tests before the initial crng seeding can take place. This number
of fresh events required is reciprocal to the estimated per-IRQ
min-entropy H: for the Adaptive Proportion Test (APT) it equals ~128 / H.
It follows that this patch won't be of much help for embedded systems or
VMs with poor IRQ rates at boot time, at least not without manual
intervention. But there aren't many options left when fips_enabled is set.

With respect to NIST SP800-90B conformance, this patch enters kind of a
gray area: NIST SP800-90B has no notion of such a dynamically adjusted
min-entropy estimate. Instead, it is assumed that some fixed value has been
estimated based on general principles and subsequently validated in the
course of the certification process. However, I would argue that if a
system had successfully passed certification for 1 bit or 1/8 bit resp. of
estimated min-entropy per sample, it would automatically be approved for
all smaller values as well. Had we started out with such a lower value
passing the health tests from the beginning, the latter would never have
complained in the first place and the system would have come up just fine.

Finally, note that all statistical tests have a non-zero probability of
false positives and so do the NIST SP800-90B health tests. In order to not
keep the estimated per-IRQ entropy at a smaller level than necessary for
forever after spurious health test failures, make
add_interrupt_randomness() attempt to double it again after a certain
number of successful health test passes at the degraded entropy level have
been completed. This threshold should not be too small in order to avoid
excessive entropy accounting loss due to continuously alternating between
a too large per-IRQ entropy estimate and the next smaller value. For now,
choose a value of five as a compromise between quick recovery and limiting
said accounting loss.

So, introduce a new member ->good_tests to struct fast_pool for keeping
track of the number of successfult health test passes. Make
add_interrupt_randomness() increment it upon successful healh test
completion and reset it to zero on failures. Make
add_interrupt_randomness() double the current min-entropy estimate and
restart the startup health in case ->good_tests is > 4 and the entropy
had previously been lowered.

Signed-off-by: Nicolai Stange 
---
 drivers/char/random.c | 25 +++--
 1 file changed, 23 insertions(+), 2 deletions(-)

diff --git a/drivers/char/random.c b/drivers/char/random.c
index bb79dcb96882..24c09ba9d7d0 100644
--- a/drivers/char/random.c
+++ b/drivers/char/random.c
@@ -1126,6 +1126,7 @@ struct fast_pool {
booldispatch_needed : 1;
booldiscard_needed : 1;
int event_entropy_shift;
+   unsigned intgood_tests;
struct queued_entropy   q;
struct health_test  health;
 };
@@ -1926,9 +1927,13 @@ void add_interrupt_randomness(int irq, int irq_flags)
cycles);
if (unlikely(health_result == health_discard)) {
/*
-* Oops, something's odd. Restart the startup
-* tests.
+* Oops, something's odd. Lower the entropy
+* estimate and restart the startup tests.
 */
+   fast_pool->event_entropy_shift =
+   min_t(unsigned int,
+ fast_pool->event_entropy_shift + 1, 6);
+   fast_pool-

[RFC PATCH 10/41] random: implement support for delayed entropy dispatching

2020-09-21 Thread Nicolai Stange
e follows the
same line of reasoning as for the chosen magnitude limit.

In order to enable this watermark lifetime management, add yet another new
field ->entropy_watermark_leak_time to struct entropy_store. Make
account_begin() set it to the current jiffies upon the first increment of
->entropy_watermark_delta from zero. Make account_complete() reset
->entropy_watermark_delta and invalidate all queued entropy as
described above whenever ->pending_extractions is zero and either
->entropy_watermark_leak_time is older than two times CRNG_RESEED_INTERVAL
or if ->entropy_watermark_delta exceeds one fourth of the pool size.

As entropy producers haven't been converted to the new __queue_entropy() +
dispatch_queued_entropy()/__dispatch_queued_entropy_fast() API yet, the net
effect of this patch is to "fix" a scenario similar to the one initially
described for those producers that call __mix_pool_bytes() and
__credit_entropy_bits_fast() without dropping the pool's ->lock inbetween,
i.e. for add_interrupt_randomness() and add_timer_randomness(). Namely, if
said sequence happens to get serialized inbetween the account_begin()
(formerly account()) and the last extract_buf() from a concurrent
extraction, the pool's entropy watermark will now be taken into account
when calculating the amount of new entropy to credit in
__credit_entropy_bits_fast(), because the latter has been reimplemented on
top of the new API.

Other than that, it's noteworthy that the pool entropy watermark might
exceed unexpectedly high levels at boot time, namely if multiple producers
happen to trigger the initial seeding of the primary_crng and the
subsequent entropy extractions complete when entropy has been queued up
somewhere else, e.g. in try_to_generate_entropy(). As detailed above, high
values of the pool watermark will reduce the efficiency when dispatching
new entropy attributions, but note that
- There are mechanisms in place to limit the effect in magnitude and
  time.
- The watermark can never exceed the total amount of entropy collected
  so far. So entropy collection at boot time would have to be terribly
  efficient in order for this to matter.
- As seeding the primary_crng is a prerequisite for the described scenario,
  a temporarily reduced entropy collection efficiency isn't really
  concerning: getting towards a seeded primary_crng is all that matters at
  this point.

Signed-off-by: Nicolai Stange 
---
 drivers/char/random.c | 315 +++---
 1 file changed, 292 insertions(+), 23 deletions(-)

diff --git a/drivers/char/random.c b/drivers/char/random.c
index 9f87332b158f..b91d1fc08ac5 100644
--- a/drivers/char/random.c
+++ b/drivers/char/random.c
@@ -499,7 +499,13 @@ struct entropy_store {
spinlock_t lock;
unsigned short add_ptr;
unsigned short input_rotate;
+
int entropy_count;
+   unsigned int entropy_watermark_delta;
+   unsigned int entropy_watermark_seq;
+   unsigned int queued_entropy_batches;
+   unsigned int pending_extractions;
+   unsigned long entropy_watermark_leak_time;
unsigned int initialized:1;
unsigned int last_data_init:1;
__u8 last_data[EXTRACT_SIZE];
@@ -671,6 +677,9 @@ static unsigned int pool_entropy_delta(struct entropy_store 
*r,
if (!nfrac)
return 0;
 
+   if (pool_size <= base_entropy_count)
+   return 0;
+
/*
 * Credit: we have to account for the possibility of
 * overwriting already present entropy.  Even in the
@@ -714,26 +723,172 @@ static unsigned int pool_entropy_delta(struct 
entropy_store *r,
return entropy_count - base_entropy_count;
 }
 
+struct queued_entropy {
+   unsigned int pool_watermark_seq;
+   unsigned int queued_entropy_fracbits;
+};
+
 /*
- * Credit the entropy store with n bits of entropy.
- * To be used from hot paths when it is either known that nbits is
- * smaller than one half of the pool size or losing anything beyond that
- * doesn't matter. Must be called with r->lock being held.
+ * Queue a given amount of new entropy which is about to mixed into
+ * the entropy pool for later dispatch.
+ *
+ * __queue_entropy() may be called one or more time on the same struct
+ * queued_entropy instance in order to accumulate entropy for later
+ * dispatch. However, any such sequence of invocations must eventually
+ * be followed by exactly one call to either of __dequeue_entropy(),
+ * __dispatch_queued_entropy_fast() or dispatch_queued_entropy()
+ * when the actual pool mixing has completed.
+ * __queue_entropy() must be called with r->lock held.
+ *
+ * Entropy extraction is a two-step process:
+ * 1.) The allocated amount of entropy gets subtracted from ->entropy_count.
+ * 2.) The entropy is actually extracted from the pool by means of one or more
+ * extract_buf() invocations.
+ * Likewise for the mixing side:
+ * 1.) The new entropy data gets mixed

[RFC PATCH 32/41] random: introduce health test stub and wire it up

2020-09-21 Thread Nicolai Stange
NIST SP800-90B requires certain statistical tests to be run continuously on
a noise source's output.

In preparation to implementing those, introduce an empty stub,
health_test_process() and wire it up to add_interrupt_randomness(). This
patch does not implement any actual testing functionality yet, it's mereley
meant to define the interactions between add_interrupt_randomness() and
the health tests.

health_test_process() is to be invoked on individual noise samples, i.e.
cycle counter values and returns, either of three possible status
codes indicating to the calling add_interrupt_randomness() that
- either some more samples are needed in order to complete the statistical
  tests,
- that the tests have finished with positive result on the latest run
  of noise samples or
- that the tests have failed.

Introduce an enum health_result defining constants corresponding to these
resp. cases: health_queue, health_dispatch and health_discard. Provide
another value, health_none, to indicate the case that the health tests
are disabled, because e.g. fips_enabled is unset. Make the stub
health_test_process() return this value for now.

As long as the statistical tests need more input noise samples before
reaching a conclusion, health_queue will get returned from
health_test_process(). FWIW, the number of successive input samples needed
by the tests will be at the order of 128 to 8192, depending on the per-IRQ
entropy estimate. add_interrupt_randomness() currently attempts to transfer
the noise kept within in the per-CPU fast_pool, which is of limited
capacity, to the global input_pool as soon as a threshold of 64 events is
reached and it will continue to do so. However, as long as some tests are
pending, i.e. keep returning health_queue, the associated amount of
estimated entropy must not get added to the global input_pool balance, but
queued up at the fast_pool's queued_entropy instance. Once the health test
have eventually succeeded, as indiciated by health_test_process(), the
entropy previously queued up may get dispatched to the global reserve.
OTOH, on test failure health_discard will get returned and all entropy
queued up from add_interrupt_randomness() since the last dispatch (or
discard resp.) must get discarded.

Note that add_interrupt_randomness() will continue to unconditionally mix
the samples into the fast_pools and eventually into the global input_pool
-- the health test results really only affect the entropy accounting.

So, make add_interrupt_randomness() invoke health_test_process() on
the current cycle counter value in case fips_enabled is set.

In case a fast_pool's fill level threshold of 64 events is reached at a
time when health tests are still pending and keep returning health_queue,
let add_interrupt_randomness() continue to mix the fast_pool's contents
into the input_pool as before, but enqueue the associated amount of entropy
at the fast_pool's associated queued_entropy instance for later dispatch.

Both, entropy dispatch as well as discard operations, require a call to
__dequeue_entropy(), which in turn must only get invoked with the
input_pool's ->lock being held. It follows that in case the spin_trylock()
in add_interrupt_randomness() failed, the latter would not be able to
perform entropy dispatch or discard operations immediately at the time
those have been requested by the health tests. Add two new boolean flags,
->dispatch_needed and ->discard_needed, to struct fast_pool. Set them from
add_interrupt_randomness() in case health_test_process() returned
health_dispatch or health_discard resp.. Make the current and subsequent
add_interrupt_randomness() invocations to check for ->dispatch_needed and
->discard_needed and to attempt to execute any pending dispatch/discard
request. Clear ->dispatch_needed and ->discard_needed again when the
prerequisite ->lock could eventually be obtained.

As actual health tests returning anything but health_none haven't been
implemented yet, there is no behavioural change at this point.

Signed-off-by: Nicolai Stange 
---
 drivers/char/random.c | 78 +--
 1 file changed, 75 insertions(+), 3 deletions(-)

diff --git a/drivers/char/random.c b/drivers/char/random.c
index 0f56c873a501..cb6441b96b8e 100644
--- a/drivers/char/random.c
+++ b/drivers/char/random.c
@@ -881,14 +881,30 @@ static void discard_queued_entropy(struct entropy_store 
*r,
 
 struct health_test {};
 
+enum health_result {
+   health_none,
+   health_queue,
+   health_dispatch,
+   health_discard,
+};
+
 static void health_test_reset(struct health_test *h)
 {}
 
+static enum health_result
+health_test_process(struct health_test *h, unsigned int event_entropy_shift,
+   u8 sample)
+{
+   return health_none;
+}
+
 struct fast_pool {
__u32   pool[4];
unsigned long   last;
unsigned short  reg_idx;
unsigned char   count;
+   booldispatch_

[RFC PATCH 27/41] random: increase per-IRQ event entropy estimate if in FIPS mode

2020-09-21 Thread Nicolai Stange
NIST SP800-90C prohibits the use of multiple correlated entropy sources.
However, add_interrupt_randomness(), add_disk_randomness() and
add_input_randomness() are clearly not independent and an upcoming patch
will make the latter two to stop contributing any entropy to the global
balance if fips_enabled is on.

With the current parameter settings, it can be assumed that
add_disk_randomness() resp. add_input_randomness() are the dominating
contributors to the overall entropy reserve for some common workloads: both
more or less estimate the entropy per event to equal the width of the
minimum out of the first, second and third jiffes deltas to the previous
occurrence.

add_interrupt_randomness() on the other hand only attributes one single bit
entropy to a full batch of 64 IRQ events (or once a second if that
completes earlier).

Thus, the upcoming exclusion of two potent entropy sources should somehow
be compensated for.

Stephan Müller worked around this very problem in his "LRNG" proposal ([1])
by increasing the entropy estimate per IRQ event. Namely, in case a
get_cycles() with instruction granularity is available, he estimated one
bit of entropy per IRQ event and (IIRC) 1/10 bits otherwise. I haven't
tested this claim myself, in particular not on smaller devices. But for the
sake of moving the development of this RFC series forward, I'll assume it
as granted and hereby postulate that

  The lower eight bits of the differences between get_cycles() from two
  successive IRQ events on the same CPU carry
  - one bit of min-entropy in case a get_cycles() with instruction
granularity is available and
  - 1/8 bit of min-entropy in case get_cycles() is still non-trivial, but
has a lower resolution.

In particular this is assumed to be true for highly periodic interrupts
like those issued for e.g. USB microframes and on all supported
architectures. In the former case, the underlying source of randomness is
believed to follow the same principles as for the Jitter RNGs resp.
try_to_generate_entropy(): diffences in RAM vs. CPU clockings and
unpredictability of cache states to a certain extent.

Notes:
- NIST SP800-90B requires a means to access raw samples for validation
  purposes. Implementation of such an interface is deliberately not part
  of this RFC series here, but would necessarily be subject of future work.
  So there would be a means to at least validate these assumptions.
- The choice of 1/8 over the 1/10 from the LRNG patchset has been made
  because it's a power of two and I suppose that the estimate of 1/10
  had been quite arbitrary anyway. Replacement of the 1/8 by smaller
  powers of two down to 1/64 will be supported throughout this patch
  series.

Some health tests as required by NIST SP800-90B will be implemented later
in this series. In order to allow for dynamically decreasing the assessed
entropy on a per-CPU basis upon health test failures, make it an attibute
of the per-CPU struct fast_pool. That is, introduce a new integer field
->event_entropy_shift to struct fast_pool. The estimated entropy per
IRQ sample will be calculated as 2^-event_entropy_shift. Initialize it
statically with -1 to indicate that runtime initialization hasn't happened
yet. Introduce fast_pool_init_accounting() which gets called
unconditionally from add_interrupt_randomness() for doing the necessary
runtime initializations once, i.e. if ->event_entropy_shift is
still found to be negative. Implement it with the help of the also new
min_irq_event_entropy_shift(), which will return the initial
->event_entropy_shift value as determined according to the rules from
above. That is, depending on the have_highres_cycle_ctr, the result is
eiher zero or three. Note that have_highres_cycle_ctr will only get
properly initialized from rand_initialize() if fips_enabled is set, but
->event_entropy_shift will also only ever get accessed in this case.

Finally, for the case tha fips_enabled is set, make
add_interrupt_randomness() to estimate the amount of entropy transferred
from the fast_pool into the global input_pool as
fast_pool_entropy(->count, ->event_entropy_shift), rather than only one
single bit. Remember that fast_pool_entropy() calculates the amount of
entropy contained in a fast_pool, based on the total number of events mixed
into it and the estimated entropy per event.

[1] https://lkml.kernel.org/r/5695397.lov4wx5...@positron.chronox.de

Suggested-by: Stephan Müller 
Signed-off-by: Nicolai Stange 
---
 drivers/char/random.c | 50 ++-
 1 file changed, 45 insertions(+), 5 deletions(-)

diff --git a/drivers/char/random.c b/drivers/char/random.c
index ac36c56dd135..8f79e90f2429 100644
--- a/drivers/char/random.c
+++ b/drivers/char/random.c
@@ -615,6 +615,7 @@ struct fast_pool {
unsigned long   last;
unsigned short  reg_idx;
unsigned char   count;
+   int event_entropy_shift;
 };
 
 /*
@@ -1509,7 +1510,9 

[RFC PATCH 13/41] random: convert try_to_generate_entropy() to queued_entropy API

2020-09-21 Thread Nicolai Stange
In an effort to drop __credit_entropy_bits_fast() in favor of the new
__queue_entropy()/__dispatch_queued_entropy_fast() API, convert
try_to_generate_entropy() from the former to the latter.

Replace the call to __credit_entropy_bits_fast() from the timer callback,
entropy_timer(), by a queue_entropy() operation. Dispatch it from the loop
in try_to_generate_entropy() by invoking __dispatch_queued_entropy_fast()
after the timestamp has been mixed into the input_pool.

In order to provide the timer callback and try_to_generate_entropy() with
access to a common struct queued_entropy instance, move the currently
anonymous struct definition from the local 'stack' variable declaration in
try_to_generate_entropy() to file scope and assign it a name,
"struct try_to_generate_entropy_stack". Make entropy_timer() obtain a
pointer to the corresponding instance by means of container_of() on the
->timer member contained therein. Amend struct
try_to_generate_entropy_stack by a new member ->q of type struct
queued_entropy.

Note that the described scheme alters behaviour a bit: first of all, new
entropy credit now gets only dispatched to the pool after the actual mixing
has completed rather than in an unsynchronized manner directly from the
timer callback. As the mixing loop try_to_generate_entropy() is expected to
run at higher frequency than the timer, this is unlikely to make any
difference in practice.

Furthermore, the pool entropy watermark as tracked over the period from
queuing the entropy in the timer callback and to its subsequent dispatch
from try_to_generate_entropy() is now taken into account when calculating
the actual credit at dispatch. In consequence, the amount of new entropy
dispatched to the pool will potentially be lowered if said period happens
to overlap with the pool extraction from an initial crng_reseed() on the
primary_crng. However, as getting the primary_crng seeded is the whole
point of the try_to_generate_entropy() exercise, this won't matter.

Note that instead of calling queue_entropy() from the timer callback,
an alternative would have been to maintain an invocation counter and queue
that up from try_to_generate_entropy() right before the mix operation.
This would have reduced the described effect of the pool's entropy
watermark and in fact matched the intended queue_entropy() API usage
better. However, in this particular case of try_to_generate_entropy(),
jitter is desired and invoking queue_entropy() with its buffer locking etc.
from the timer callback could potentially contribute to that.

Signed-off-by: Nicolai Stange 
---
 drivers/char/random.c | 42 +-
 1 file changed, 29 insertions(+), 13 deletions(-)

diff --git a/drivers/char/random.c b/drivers/char/random.c
index bd3774c6be4b..dfbe49fdbcf1 100644
--- a/drivers/char/random.c
+++ b/drivers/char/random.c
@@ -1911,6 +1911,12 @@ void get_random_bytes(void *buf, int nbytes)
 EXPORT_SYMBOL(get_random_bytes);
 
 
+struct try_to_generate_entropy_stack {
+   unsigned long now;
+   struct timer_list timer;
+   struct queued_entropy q;
+} stack;
+
 /*
  * Each time the timer fires, we expect that we got an unpredictable
  * jump in the cycle counter. Even if the timer is running on another
@@ -1926,14 +1932,10 @@ EXPORT_SYMBOL(get_random_bytes);
  */
 static void entropy_timer(struct timer_list *t)
 {
-   bool reseed;
-   unsigned long flags;
+   struct try_to_generate_entropy_stack *stack;
 
-   spin_lock_irqsave(_pool.lock, flags);
-   reseed = __credit_entropy_bits_fast(_pool, 1);
-   spin_unlock_irqrestore(_pool.lock, flags);
-   if (reseed)
-   crng_reseed(_crng, _pool);
+   stack = container_of(t, struct try_to_generate_entropy_stack, timer);
+   queue_entropy(_pool, >q, 1 << ENTROPY_SHIFT);
 }
 
 /*
@@ -1942,10 +1944,9 @@ static void entropy_timer(struct timer_list *t)
  */
 static void try_to_generate_entropy(void)
 {
-   struct {
-   unsigned long now;
-   struct timer_list timer;
-   } stack;
+   struct try_to_generate_entropy_stack stack = { 0 };
+   unsigned long flags;
+   bool reseed;
 
stack.now = random_get_entropy();
 
@@ -1957,14 +1958,29 @@ static void try_to_generate_entropy(void)
while (!crng_ready()) {
if (!timer_pending())
mod_timer(, jiffies+1);
-   mix_pool_bytes(_pool, , sizeof(stack.now));
+   spin_lock_irqsave(_pool.lock, flags);
+   __mix_pool_bytes(_pool, , sizeof(stack.now));
+   reseed = __dispatch_queued_entropy_fast(_pool, );
+   spin_unlock_irqrestore(_pool.lock, flags);
+
+   if (reseed)
+   crng_reseed(_crng, _pool);
+
schedule();
stack.now = random_get_entropy();
}
 
del_timer_sync();
destroy_timer_on_stack();
-   mix_pool_bytes(

[RFC PATCH 35/41] random: improve the APT's statistical power

2020-09-21 Thread Nicolai Stange
lth_apt_presearch_finalize() for restoring the
candidate from the presearch bit counters. Call it from health_test_apt()
once the n1'th event in a sequence has been processed and the presearch
phase is to be concluded.

Make health_test_apt() search for the candidate value as determined by
the presearch phase among the sequence's remaining n2 = n - n1 samples.
Adapt the failure thresholds to the now slightly smaller n2 values.

Signed-off-by: Nicolai Stange 
---
 drivers/char/random.c | 58 +--
 1 file changed, 50 insertions(+), 8 deletions(-)

diff --git a/drivers/char/random.c b/drivers/char/random.c
index 131302cbc495..75a103f24fea 100644
--- a/drivers/char/random.c
+++ b/drivers/char/random.c
@@ -881,8 +881,13 @@ static void discard_queued_entropy(struct entropy_store *r,
 
 struct health_test {
unsigned short apt_event_count;
-   unsigned short apt_candidate_count;
-   u8 apt_candidate;
+   union {
+   u8 apt_presearch_bit_counters[8];
+   struct {
+   unsigned short apt_candidate_count;
+   u8 apt_candidate;
+   };
+   };
 
u8 previous_sample;
 };
@@ -895,9 +900,44 @@ enum health_result {
 };
 
 /* Adaptive Proportion Test */
+#define HEALTH_APT_PRESEARCH_EVENT_COUNT 7
+
+static void health_apt_presearch_update(struct health_test *h, u8 sample_delta)
+{
+   int i;
+
+   for (i = 0; i < 8; ++i) {
+   h->apt_presearch_bit_counters[i] = sample_delta & 0x1;
+   sample_delta >>= 1;
+   }
+}
+
+static void health_apt_presearch_finalize(struct health_test *h)
+{
+   int i;
+
+   /*
+* If some event octet occurred more than half of the time,
+* i.e. more than HEALTH_APT_PRESEARCH_EVENT_COUNT / 2 times,
+* then its value can be restored unambigiously from the eight
+* ->apt_presearch_bit_counters each holding the count of 1s
+* encountered at the corresponding bit positions.
+*/
+   h->apt_candidate = 0;
+   for (i = 0; i < 8; ++i) {
+   if (h->apt_presearch_bit_counters[i] >=
+   (HEALTH_APT_PRESEARCH_EVENT_COUNT + 1) / 2) {
+   h->apt_candidate |= 1 << i;
+   }
+   }
+   h->apt_candidate_count = 0;
+};
+
 static void health_apt_reset(struct health_test *h)
 {
h->apt_event_count = 0;
+   memset(h->apt_presearch_bit_counters, 0,
+   sizeof(h->apt_presearch_bit_counters));
 }
 
 static enum health_result
@@ -911,16 +951,18 @@ health_test_apt(struct health_test *h, unsigned int 
event_entropy_shift,
 * values of event_entropy_shift each, should have probability
 * <= 2^-16.
 */
-   static const unsigned int c[] = {87, 210, 463, 973, 1997, 4044, 8140};
+   static const unsigned int c[] = {83, 205, 458, 968, 1991, 4038, 8134};
+
+   BUILD_BUG_ON(HEALTH_APT_PRESEARCH_EVENT_COUNT != 7);
 
-   if (!h->apt_event_count) {
-   h->apt_event_count = 1;
-   h->apt_candidate = sample_delta;
-   h->apt_candidate_count = 0;
+   ++h->apt_event_count;
+   if (unlikely(h->apt_event_count <= HEALTH_APT_PRESEARCH_EVENT_COUNT)) {
+   health_apt_presearch_update(h, sample_delta);
+   if (h->apt_event_count == HEALTH_APT_PRESEARCH_EVENT_COUNT)
+   health_apt_presearch_finalize(h);
return health_queue;
}
 
-   ++h->apt_event_count;
if (unlikely(h->apt_candidate == sample_delta &&
 ++h->apt_candidate_count == c[event_entropy_shift])) {
health_apt_reset(h);
-- 
2.26.2



[RFC PATCH 34/41] random: implement the "Adaptive Proportion" NIST SP800-90B health test

2020-09-21 Thread Nicolai Stange
NIST SP800-90B requires an implementation of the "Adaptive Proportion"
health test (APT) or similar for detecting noise source entropy
degradations.

This tests works by counting how many times the first sample value in a
sequence of n events occurs among the remaining n-1 samples. The test will
reject if this number exceeds a certain threshold.

With a min-entropy estimate of H=2^-event_entropy_shift per IRQ event, the
probability of observing any particular sample value is bounded by
p <= 2^-H. Assuming i.i.d., the number of occurences of such a sample
value among n - 1 events follows the binomial distribution with parameters
n - 1 and p. The probability to observe up to k occurences of a given
sample value is not less than that distribution's CDF F(n - 1, p, k) at
point k, per the definition of CDFs and the fact that
F(n - 1, p1, k) >= F(n - 1, p2, k) for p1 <= p2 in the particular case of
Binomial distributions. It follows that an upper bound on the probability
of observing the same value c or more times among n - 1 consecutive samples
is given by 1 - F(n - 1, p, c - 1). In conclusion, the probability of false
positives is <= p * (1 - F(n - 1, p, c - 1)) for the Adaptive Proportion
test.

NIST SP800-90B recommends to set n to either 512 or 1024 and to choose a
cut-off value c such that the probability of false positives is <= 2^-20.
However, assuming an estimated per-IRQ entropy of 1 bit, it would take
1024/128 == 8 minimum crng seed sizes worth of entropy before the APT
eventually completes and the accumulated entropy may get released to the
global reserve. Thus, it is desirable to set n such that the APT will
complete within 128 bits worth of entropy, i.e. to n = 128 / H. However,
for such relatively small values of n, an upper bound as small as 2^-20
for the false positives probability would make the test's statistical
power, i.e. the capability to detect degraded noise sources, plummet to
uselessness. Note that add_interrupt_randomness() will continue to
unconditionally mix all events into the fast_pools, independent of the
APT's outcome. Thus, allowing for a higher probability of false positives
cannot change the output distribution, but only potentially affect the
entropy accounting. Choose an upper bound of 2^-16 for the probability of
false positives.

The resulting cut-off values for the different supported values of per-IRQ
entropy estimates are tabulated below. The "power" column lists the
probabilities (again for i.i.d.) that the APT would report a failure in
case the actual entropy has degraded to one half of the assumed estimate.

   H n   cpower
   
  1  128   87 52.5%
1/2  256  210 67.5%
1/4  512  463 76.7%
1/8 1024  973 82.8%
   1/16 2048 1997 82.6%
   1/32 4096 4044 85.8%
   1/64 8192 8140 85.8%

Add a couple of new fields to struct health_test for storing the required
APT state to struct health_test:
 - ->apt_event_count: total number of samples processed by the currently
 pending APT,
 - ->apt_candidate: the sample value whose number of occurences the
 currently pending APT is counting,
 - ->apt_candidate_count: the number of occurences of ->apt_candidate
 the currently pending APT has encountered so far.

Implement the APT logic and wrap it in a new function, health_test_apt().
Invoke it from health_test_process().

Signed-off-by: Nicolai Stange 
---
 drivers/char/random.c | 56 +--
 1 file changed, 54 insertions(+), 2 deletions(-)

diff --git a/drivers/char/random.c b/drivers/char/random.c
index 33f9b7b59f92..131302cbc495 100644
--- a/drivers/char/random.c
+++ b/drivers/char/random.c
@@ -880,6 +880,10 @@ static void discard_queued_entropy(struct entropy_store *r,
 }
 
 struct health_test {
+   unsigned short apt_event_count;
+   unsigned short apt_candidate_count;
+   u8 apt_candidate;
+
u8 previous_sample;
 };
 
@@ -890,8 +894,56 @@ enum health_result {
health_discard,
 };
 
+/* Adaptive Proportion Test */
+static void health_apt_reset(struct health_test *h)
+{
+   h->apt_event_count = 0;
+}
+
+static enum health_result
+health_test_apt(struct health_test *h, unsigned int event_entropy_shift,
+   u8 sample_delta)
+{
+   unsigned int n = 128 << event_entropy_shift;
+   /*
+* Observing some particular sample value more often than
+* these thresholds, specified for the different possible
+* values of event_entropy_shift each, should have probability
+* <= 2^-16.
+*/
+   static const unsigned int c[] = {87, 210, 463, 973, 1997, 4044, 8140};
+
+   if (!h->apt_event_count) {
+   h->apt_event_count = 1;
+   h->apt_candidate = sample_delta;
+   h->apt_candidate_count = 0;
+   return health_queue;
+   }
+
+   ++h->apt_event_count;
+   if (unlikely(h->apt_ca

[RFC PATCH 36/41] random: optimize the APT's presearch

2020-09-21 Thread Nicolai Stange
The Adaptive Proportion Test's presearch phase is supposed to determine the
sample value occurring more than half of the times among the first n1=7
events in a sequence, if there is any such one.

To this end, it maintains eight counters at struct health_test for counting
the numbers of ones observed at each of the eight resp. bit positions
within the sample values' octets. The idea is that if any sample value had
been encountered more than half of the time, it would dominate all these
counters and its value could be restored unambiguously from them.

For better reviewability, this had been implemented in the most
straightforward way:
- the counters had been represented as an array of eight u8s at struct
  health_test and
- both, the counter updating as well as the candidate extracion code had
  been implemented by means of a loop over the eight bit positions.

As this is all accessed from add_interrupt_randomness(), optimizations
won't harm. In particular, using a total of eight bytes for the bit
counters is wasteful and can be reduced to half of that, optimizing the
size of struct health_test, which in turn is stored as part of the per-CPU
struct fast_pool.

For counts up to n1=7, 3 bits would suffice. Rather than using an array of
eight u8s, store the bit counter within an u32's eight four-bit nibbles.

Even though it probably won't matter on average in practice, because the
APT presearch is run only for a small fraction of all IRQ events, reduce
the number of instructions issued from the bit counter maintenance and
candidate extraction code as well. If nothing else, this can potentially
reduce the maximum IRQ latency by a few cycles.

Namely, make the bit counter updating code in health_apt_presearch_update()
spread the eight bits from the input sample evenly across an u32. That is,
the bit at position i will end up at position 4*i. This can be achieved
within five binops and four shifts. This intermediate value can then get
subsequently added in a single operation to the "packed" bit counters kept
in struct health_test in order to conclude the operation.

As for the final candidate extraction in health_apt_presearch_finalize(),
remember that the i'th counter needs to get compared against
n1 / 2 = 7 / 2 in order to restore the i'th bit from the resulting
candidate value. The condition that one such bit counter is >= 4 is
equivalent to testing its bit at the 2nd position, counted from zero.
Thus, (->apt_presearch_bit_counters & 0x) >> 2
will yield a value where the LSB from the i'th nibble, equals the i'th bit
from the result and everything else is unset. The final result can then be
obtained by "shrinking" this intermediate representation back into an u8.
In total, the candidate extraction can be achieved within a sequence of
seven binops and six shifts.

Signed-off-by: Nicolai Stange 
---
 drivers/char/random.c | 71 ---
 1 file changed, 53 insertions(+), 18 deletions(-)

diff --git a/drivers/char/random.c b/drivers/char/random.c
index 75a103f24fea..2c744d2a9b26 100644
--- a/drivers/char/random.c
+++ b/drivers/char/random.c
@@ -882,7 +882,7 @@ static void discard_queued_entropy(struct entropy_store *r,
 struct health_test {
unsigned short apt_event_count;
union {
-   u8 apt_presearch_bit_counters[8];
+   u32 apt_presearch_bit_counters;
struct {
unsigned short apt_candidate_count;
u8 apt_candidate;
@@ -904,40 +904,75 @@ enum health_result {
 
 static void health_apt_presearch_update(struct health_test *h, u8 sample_delta)
 {
-   int i;
+   u32 encoded;
 
-   for (i = 0; i < 8; ++i) {
-   h->apt_presearch_bit_counters[i] = sample_delta & 0x1;
-   sample_delta >>= 1;
-   }
+   /*
+* Encode the sample octet by "widening" it into 8
+* nibbles. That is, bit i from the source will be assigned to
+* bit 4*i in the result. All other bits are set to zero.
+*/
+   encoded = sample_delta;
+   encoded = (encoded & 0xf) | ((encoded >> 4) << 16);
+   encoded |= (encoded << 6);
+   encoded |= (encoded << 3);
+   encoded &= 0x;
+
+   /*
+* The nibbles from ->apt_presearch_bit_counters, each
+* counting the number of occurences of 1s at the
+* corresponding bit position, don't overflow into each other.
+*/
+   BUILD_BUG_ON(ilog2(HEALTH_APT_PRESEARCH_EVENT_COUNT) >= 4);
+   h->apt_presearch_bit_counters += encoded;
 }
 
 static void health_apt_presearch_finalize(struct health_test *h)
 {
-   int i;
+   u32 majority, decoded;
 
/*
 * If some event octet occurred more than half of the time,
 * i.e. more than HEALTH_APT_PRESEARCH_EVENT_COUNT / 2 times,
 * then its value can be res

[RFC PATCH 38/41] random: enable NIST SP800-90B startup tests

2020-09-21 Thread Nicolai Stange
NIST SP800-90B, section 4.3 requires an entropy source to inhibit output
until the so-called "startup" tests have completed. These "startup" test
shall process at least 1024 consecutive samples by means of the continuous
health tests, i.e. the already implemented Repetition Count Test (RCT) and
Adaptive Proportion Test (APT).

Introduce a new field ->warmup to struct health_test. Initialize it to 1024
from health_test_reset(). Make health_test_process() decrement ->warmup
once per event processed without test failure, but reset ->warmup to the
intitial value upon failure. Prevent health_test_process() from returning
health_dispatch as long as ->warmup hasn't dropped to zero. This will cause
the caller, i.e. add_interrupt_randomness(), to not dispatch any entropy to
the global balance until the startup tests have finished.

Note that this change will delay the initial seeding of the primary_crng,
especially for those values of the estimated per-IRQ min-entropy H where
the mimimum of 1024 events from above is by several factors larger than
128/H, the number of events to be processed by a single APT run. That
would only affect systems running with fips_enabled though and there's
simply no way to avoid it without violating the specs.

Signed-off-by: Nicolai Stange 
---
 drivers/char/random.c | 25 +++--
 1 file changed, 23 insertions(+), 2 deletions(-)

diff --git a/drivers/char/random.c b/drivers/char/random.c
index 54ee082ca4a8..bd8c24e433d0 100644
--- a/drivers/char/random.c
+++ b/drivers/char/random.c
@@ -880,6 +880,7 @@ static void discard_queued_entropy(struct entropy_store *r,
 }
 
 struct health_test {
+   unsigned short warmup;
unsigned short apt_event_count;
union {
u32 apt_presearch_bit_counters;
@@ -1059,6 +1060,13 @@ health_test_apt(struct health_test *h, unsigned int 
event_entropy_shift,
 
 static void health_test_reset(struct health_test *h)
 {
+   /*
+* Don't dispatch until at least 1024 events have been
+* processed by the continuous health tests as required by
+* NIST SP800-90B for the startup tests.
+*/
+   h->warmup = 1024;
+
health_apt_reset(h);
 }
 
@@ -1067,7 +1075,7 @@ health_test_process(struct health_test *h, unsigned int 
event_entropy_shift,
u8 sample)
 {
u8 sample_delta;
-   enum health_result rct;
+   enum health_result rct, apt;
 
/*
 * The min-entropy estimate has been made for the lower eight
@@ -1083,6 +1091,8 @@ health_test_process(struct health_test *h, unsigned int 
event_entropy_shift,
 * Something is really off, get_cycles() has become
 * (or always been) a constant.
 */
+   if (h->warmup)
+   health_test_reset(h);
return health_discard;
}
 
@@ -1091,7 +1101,18 @@ health_test_process(struct health_test *h, unsigned int 
event_entropy_shift,
 * don't care about whether the RCT needs to consume more
 * samples to complete.
 */
-   return health_test_apt(h, event_entropy_shift, sample_delta);
+   apt = health_test_apt(h, event_entropy_shift, sample_delta);
+   if (unlikely(h->warmup) && --h->warmup) {
+   if (apt == health_discard)
+   health_test_reset(h);
+   /*
+* Don't allow the caller to dispatch until warmup
+* has completed.
+*/
+   return apt == health_dispatch ? health_queue : apt;
+   }
+
+   return apt;
 }
 
 struct fast_pool {
-- 
2.26.2



[RFC PATCH 15/41] random: convert add_hwgenerator_randomness() to queued_entropy API

2020-09-21 Thread Nicolai Stange
In an effort to drop credit_entropy_bits() in favor of the new
queue_entropy()/dispatch_queued_entropy() API, convert
add_hwgenerator_randomness() from the former to the latter.

As a side effect, the pool entropy watermark as tracked over the duration
of the mix_pool_bytes() operation is now taken correctly taken into account
when calulating the amount of new entropy to dispatch to the pool based on
the latter's fill level.

Signed-off-by: Nicolai Stange 
---
 drivers/char/random.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/drivers/char/random.c b/drivers/char/random.c
index 60ce185d7b2d..78e65367ea86 100644
--- a/drivers/char/random.c
+++ b/drivers/char/random.c
@@ -2640,6 +2640,7 @@ void add_hwgenerator_randomness(const char *buffer, 
size_t count,
size_t entropy)
 {
struct entropy_store *poolp = _pool;
+   struct queued_entropy q = { 0 };
 
if (unlikely(crng_init == 0)) {
crng_fast_load(buffer, count);
@@ -2652,8 +2653,9 @@ void add_hwgenerator_randomness(const char *buffer, 
size_t count,
 */
wait_event_interruptible(random_write_wait, kthread_should_stop() ||
ENTROPY_BITS(_pool) <= random_write_wakeup_bits);
+   queue_entropy(poolp, , entropy << ENTROPY_SHIFT);
mix_pool_bytes(poolp, buffer, count);
-   credit_entropy_bits(poolp, entropy);
+   dispatch_queued_entropy(poolp, );
 }
 EXPORT_SYMBOL_GPL(add_hwgenerator_randomness);
 
-- 
2.26.2



[RFC PATCH 16/41] random: convert random_ioctl() to queued_entropy API

2020-09-21 Thread Nicolai Stange
In an effort to drop credit_entropy_bits_safe() in favor of the new
queue_entropy()/dispatch_queued_entropy() API, convert random_ioctl() from
the former to the latter.

Implement two helpers:
- queue_entropy_bits_safe(), which checks the entropy passed from userspace
  for extreme values in analogy to what credit_entropy_bits_safe() did
- discard_queue_entropy(), which is invoked from random_ioctly() to discard
  the entropy queued prior to the write_pool() call in case the latter
  fails.

Use them to convert the two call sites of credit_entropy_bits_safe()
in random_ioctl() to the new API.

As a side effect, the pool entropy watermark as tracked over the duration
of the write_pool() operation is now taken correctly taken into account
when calulating the amount of new entropy to dispatch to the pool based on
the latter's fill level.

Signed-off-by: Nicolai Stange 
---
 drivers/char/random.c | 57 ---
 1 file changed, 53 insertions(+), 4 deletions(-)

diff --git a/drivers/char/random.c b/drivers/char/random.c
index 78e65367ea86..03eadefabbca 100644
--- a/drivers/char/random.c
+++ b/drivers/char/random.c
@@ -737,7 +737,9 @@ struct queued_entropy {
  * dispatch. However, any such sequence of invocations must eventually
  * be followed by exactly one call to either of __dequeue_entropy(),
  * __dispatch_queued_entropy_fast() or dispatch_queued_entropy()
- * when the actual pool mixing has completed.
+ * when the actual pool mixing has completed. Alternatively,
+ * discard_queued_entropy() may be called in case the mixing has
+ * failed.
  * __queue_entropy() must be called with r->lock held.
  *
  * Entropy extraction is a two-step process:
@@ -813,6 +815,26 @@ static void queue_entropy(struct entropy_store *r, struct 
queued_entropy *q,
spin_unlock_irqrestore(>lock, flags);
 }
 
+/*
+ * Queue entropy which comes from userspace and might take extreme
+ * values.
+ */
+static int queue_entropy_bits_safe(struct entropy_store *r,
+  struct queued_entropy *q,
+  int nbits)
+{
+   const int nbits_max = r->poolinfo->poolwords * 32;
+
+   if (nbits < 0)
+   return -EINVAL;
+
+   /* Cap the value to avoid overflows */
+   nbits = min(nbits,  nbits_max);
+
+   queue_entropy(r, q, nbits << ENTROPY_SHIFT);
+   return 0;
+}
+
 /*
  * Dequeue previously queued entropy and return the pool entropy
  * watermark to be used in pool_entropy_delta().
@@ -950,6 +972,22 @@ static void dispatch_queued_entropy(struct entropy_store 
*r,
}
 }
 
+/*
+ * Discard queued entropy. May be called when e.g. a write_pool()
+ * operation failed and the corresponding previously queued entropy
+ * should not get dispatched to the pool.
+ */
+static void discard_queued_entropy(struct entropy_store *r,
+  struct queued_entropy *q)
+{
+   unsigned long flags;
+   int pool_watermark;
+
+   spin_lock_irqsave(>lock, flags);
+   __dequeue_entropy(r, q, _watermark);
+   spin_unlock_irqrestore(>lock, flags);
+}
+
 /*
  * Credit the entropy store with n bits of entropy.
  * Use credit_entropy_bits_safe() if the value comes from userspace
@@ -2272,6 +2310,7 @@ static long random_ioctl(struct file *f, unsigned int 
cmd, unsigned long arg)
int size, ent_count;
int __user *p = (int __user *)arg;
int retval;
+   struct queued_entropy q = { 0 };
 
switch (cmd) {
case RNDGETENTCNT:
@@ -2285,7 +2324,11 @@ static long random_ioctl(struct file *f, unsigned int 
cmd, unsigned long arg)
return -EPERM;
if (get_user(ent_count, p))
return -EFAULT;
-   return credit_entropy_bits_safe(_pool, ent_count);
+   retval = queue_entropy_bits_safe(_pool, , ent_count);
+   if (retval < 0)
+   return retval;
+   dispatch_queued_entropy(_pool, );
+   return 0;
case RNDADDENTROPY:
if (!capable(CAP_SYS_ADMIN))
return -EPERM;
@@ -2295,11 +2338,17 @@ static long random_ioctl(struct file *f, unsigned int 
cmd, unsigned long arg)
return -EINVAL;
if (get_user(size, p++))
return -EFAULT;
+   retval = queue_entropy_bits_safe(_pool, , ent_count);
+   if (retval < 0)
+   return retval;
retval = write_pool(_pool, (const char __user *)p,
size);
-   if (retval < 0)
+   if (retval < 0) {
+   discard_queued_entropy(_pool, );
return retval;
-   return credit_entropy_bits_safe(_pool, ent_count);
+   }
+   discard_queued_entropy(_pool, );
+   return 0;

[RFC PATCH 18/41] random: move arch_get_random_seed() calls in crng_reseed() into own loop

2020-09-21 Thread Nicolai Stange
x86's RDSEED/RDRAND insns have reportedly been slowed down significantly
due to the ucode update required to mitigate against the "Special Register
Buffer Data Sampling" vulnerability (CVE-2020-0543) and should not get
invoked from the interrupt path anymore.

In preparation of getting rid of that arch_get_random_long() call currently
found in add_interrupt_randomness(), move those arch_get_random_long()
calls in crng_reseed() into a separate loop and outside of the crng->lock.

There is no functional change.

Signed-off-by: Nicolai Stange 
---
 drivers/char/random.c | 8 ++--
 1 file changed, 6 insertions(+), 2 deletions(-)

diff --git a/drivers/char/random.c b/drivers/char/random.c
index a49805d0d23c..1945249597e0 100644
--- a/drivers/char/random.c
+++ b/drivers/char/random.c
@@ -1200,14 +1200,18 @@ static void crng_reseed(struct crng_state *crng, struct 
entropy_store *r)
_crng_backtrack_protect(_crng, buf.block,
CHACHA_KEY_SIZE);
}
-   spin_lock_irqsave(>lock, flags);
+
for (i = 0; i < 8; i++) {
unsigned long   rv;
if (!arch_get_random_seed_long() &&
!arch_get_random_long())
rv = random_get_entropy();
-   crng->state[i+4] ^= buf.key[i] ^ rv;
+   buf.key[i] ^= rv;
}
+
+   spin_lock_irqsave(>lock, flags);
+   for (i = 0; i < 8; i++)
+   crng->state[i+4] ^= buf.key[i];
memzero_explicit(, sizeof(buf));
crng->init_time = jiffies;
spin_unlock_irqrestore(>lock, flags);
-- 
2.26.2



[RFC PATCH 22/41] random: introduce arch_has_sp800_90b_random_seed()

2020-09-21 Thread Nicolai Stange
NIST SP800-90C allows for the combination of multiple SP800-90B entropy
sources by concatenating their individual outputs together, c.f.
sec. 5.3.4 and also the constructions in 10.3.1 from the second draft.

We're already doing exactly that when reseeding the primary CRNG from the
input pool and possibly from arch_get_random_seed_long() +
arch_get_random_long(). The input pool will be moved gradually towards
SP800-90B compliance with future patches. Provide a means for the random
driver to check whether arch_get_random_seed_long() is also a full entropy
source conforming to NIST SP800-90B.

Note that I couldn't find any explicit statement in the specs that would
allow for using NRBGs as defined by SP800-90C as a drop-in replacement for
"entropy sources" in the sense of SP800-90B. In particular there is no
statement that NRBGs may be combined with other SP800-90B entropy sources.
However, NRBGs always provide stronger guarantees in that they provide
certain protection against silent failure of backing entropy sources. Thus,
I think it would be perfectly acceptable to combine SP800-90C NRBGs, i.e.
ARMv8's RNDRRS or x86's RDSEED with other entropy sources.

Introduce arch_has_sp800_90b_random_seed().
- Make the generic stub return false.
- Make the arm64 variant return false as well: the current
  arch_get_random_seed_long() is based on RNDR, not RNDRRS.
- Make it return false on powerpc and s390, too.
- Let arch_has_sp800_90b_random_seed() return true on x86 if the CPU has
  RDSEED support.
Yes, I know, one change per patch, but this is part of a RFC series.

Signed-off-by: Nicolai Stange 
---
 arch/arm64/include/asm/archrandom.h   | 10 +-
 arch/powerpc/include/asm/archrandom.h |  5 +
 arch/s390/include/asm/archrandom.h|  5 +
 arch/x86/include/asm/archrandom.h |  8 
 include/linux/random.h|  9 +
 5 files changed, 36 insertions(+), 1 deletion(-)

diff --git a/arch/arm64/include/asm/archrandom.h 
b/arch/arm64/include/asm/archrandom.h
index 055d18713db7..db7813c79b3e 100644
--- a/arch/arm64/include/asm/archrandom.h
+++ b/arch/arm64/include/asm/archrandom.h
@@ -42,7 +42,15 @@ static inline bool arch_has_random_seed(void)
 */
return cpus_have_const_cap(ARM64_HAS_RNG);
 }
-
+static inine bool arch_has_sp800_90b_random_seed(void)
+{
+   /*
+* Section K.12.1 from the Arm Architecture Reference Manual
+* Armv8" (DDI0487F) sounds like RNDRRS could qualify as a
+* NIST SP800-90C NRBG, but we're currently using RNDR only.
+*/
+   return false;
+}
 static inline bool __must_check arch_get_random_long(unsigned long *v)
 {
return false;
diff --git a/arch/powerpc/include/asm/archrandom.h 
b/arch/powerpc/include/asm/archrandom.h
index 47c2d74e7244..ba0f816d9750 100644
--- a/arch/powerpc/include/asm/archrandom.h
+++ b/arch/powerpc/include/asm/archrandom.h
@@ -16,6 +16,11 @@ static inline bool arch_has_random_seed(void)
return ppc_md.get_random_seed;
 }
 
+static inline bool arch_has_sp800_90b_random_seed(void)
+{
+   return false;
+}
+
 static inline bool __must_check arch_get_random_long(unsigned long *v)
 {
return false;
diff --git a/arch/s390/include/asm/archrandom.h 
b/arch/s390/include/asm/archrandom.h
index 18973845634c..1ee7f9e4b255 100644
--- a/arch/s390/include/asm/archrandom.h
+++ b/arch/s390/include/asm/archrandom.h
@@ -31,6 +31,11 @@ static inline bool arch_has_random_seed(void)
return static_branch_likely(_arch_random_available);
 }
 
+static inline bool arch_has_sp800_90b_random_seed(void)
+{
+   return false;
+}
+
 static inline bool __must_check arch_get_random_long(unsigned long *v)
 {
return false;
diff --git a/arch/x86/include/asm/archrandom.h 
b/arch/x86/include/asm/archrandom.h
index 030f46c9e310..94d4ee8c9e45 100644
--- a/arch/x86/include/asm/archrandom.h
+++ b/arch/x86/include/asm/archrandom.h
@@ -80,6 +80,14 @@ static inline bool arch_has_random_seed(void)
return static_cpu_has(X86_FEATURE_RDSEED);
 }
 
+static inline bool arch_has_sp800_90b_random_seed(void)
+{
+   /*
+* According to the Intel SDM, rdseed is NIST SP800-90B
+* compliant.
+*/
+   return arch_has_random_seed();
+}
 static inline bool __must_check arch_get_random_long(unsigned long *v)
 {
return arch_has_random() ? rdrand_long(v) : false;
diff --git a/include/linux/random.h b/include/linux/random.h
index d4653422a0c7..933f5daa4a1c 100644
--- a/include/linux/random.h
+++ b/include/linux/random.h
@@ -128,6 +128,15 @@ static inline bool arch_has_random_seed(void)
 {
return false;
 }
+/*
+ * Whether or not arch_get_random_seed_long() qualifies as a NIST
+ * SP800-90B compliant entropy source providing full entropy output.
+ * NIST SP800-90C NRBG's are probably fine, too.
+ */
+static inline bool arch_has_sp800_90b_random_seed(void)
+{
+   return false;
+}
 static inline bool __must_check arch_get_random_long(u

[RFC PATCH 23/41] random: don't award entropy to non-SP800-90B arch RNGs in FIPS mode

2020-09-21 Thread Nicolai Stange
It is required by SP800-90C that only SP800-90B compliant entropy sources
may be used for seeding DRBGs.

Don't award any entropy to arch_get_random_long() if fips_enabled is
true. Don't award any entropy to arch_get_random_seed_long() if
fips_enabled && !arch_has_sp800_90b_random_seed().

This is achieved by making min_crng_reseed_pool_entropy() return the
full minimum seed size if fips_enabled && !arch_has_sp800_90b_random_seed()
is true. This prevents crng_reseed() from attempting to make up for any
lack of entropy in the input_pool by reading from the architectural RNG.

Make crng_reseed() bail out in FIPS mode if the input_pool provides
insufficient entropy and any of the arch_get_random_seed_long()
invocations fails: there's no statement regarding SP900-90B compliance of
arch_get_random_long() and so it can't be used as a backup.

Signed-off-by: Nicolai Stange 
---
 drivers/char/random.c | 9 +++--
 1 file changed, 7 insertions(+), 2 deletions(-)

diff --git a/drivers/char/random.c b/drivers/char/random.c
index 7712b4464ef5..aaddee4e4ab1 100644
--- a/drivers/char/random.c
+++ b/drivers/char/random.c
@@ -1195,9 +1195,13 @@ static int min_crng_reseed_pool_entropy(void)
 * up to one half of the minimum entropy needed for
 * reseeding. That way it won't dominate the entropy
 * collected by other means at input_pool.
+* If in FIPS mode, restrict this to SP900-90B compliant
+* architectural RNGs.
 */
-   if (arch_has_random() || arch_has_random_seed())
+   if (arch_has_sp800_90b_random_seed() ||
+   (!fips_enabled && (arch_has_random() || arch_has_random_seed( {
return 8;
+   }
return 16;
 }
 
@@ -1233,7 +1237,8 @@ static void crng_reseed(struct crng_state *crng, struct 
entropy_store *r)
for (i = 0; i < 8; i++) {
unsigned long   rv;
if (!arch_get_random_seed_long() &&
-   !arch_get_random_long()) {
+   ((arch_randomness_required && fips_enabled) ||
+!arch_get_random_long())) {
if (arch_randomness_required) {
/*
 * The input_pool failed to provide
-- 
2.26.2



[RFC PATCH 26/41] random: implement support for evaluating larger fast_pool entropies

2020-09-21 Thread Nicolai Stange
The fast_pools, mixed into from add_interrupt_randomness(), are 128 bits
wide and get awarded at most an entropy value as low as one bit in total.

An upcoming patch will significantly increase the estimated entropy per
event and this will make the fast_pools to receive much larger values
of successively mixed in event entropy.

In analogy to the reasoning in commit 30e37ec516ae ("random: account for
entropy loss due to overwrites"), probabilistic collisions will have to get
accounted for when calculating the total fast_pool entropy.

>From said commit, the final fast_pool entropy equals

  e = pool_size * (1 - exp(-num_events * entropy_per_event / pool_size))

Where pool_size is 128 bits in this case and
num_events * entropy_per_event equals the sum of all estimated entropy from
the IRQ events previously mixed in.

Disclaimer: I'm cargo-culting here. That probabilisic overwrites are in
fact an issue sounds plausible after having read e.g. "Balls into Bins" by
M.Raab and A. Steger. But I haven't managed to derive this equation by
myself, nor have I found it in any literature.

Anyway, implement the new fast_pool_entropy() for evaluating the equation
given above by means of a suitable approximation.

add_interrupt_randomness() empties its fast_pool into the global input_pool
whenever the number of accumulated events has reached a threshold of 64
and the input_pool's ->lock is uncontended. Thus, the number of IRQ events
accumulated at the fast_pool can be assumed to be unlikely to exceed
larger factors of 64. The maximum estimate supported for per-IRQ entropy
will be 1 bit and thus, this sets an upper bound on the range where the
approximation is supposed to work well.

At the same time, estimates for the per-IRQ entropy as low as 1/64 bits
should be supported and the approximation should not be too coarse in these
lower regions in order to avoid excessive loss when entropy is likely a
scarce resource anyway.

Apply a piecewise linear approximation to the fast_pool entropy, with
the lengths of the resp. intervals getting doubled with increasing input
values. That is, let the first interval cover 32 bits worth of input
entropy, the next one 64 bits and stop after a final one of length
128 bits. Any input entropy beyond 32 + 64 + 128 bits gets discarded in
order to limit the computations done from interrupt context, but as
outlined above, this is unlikely to matter in practice. The shorter
intervals for the regions of smaller values will improve the accuracy of
the approximation in these areas, i.e. for small estimates for the per-IRQ
entropy.

Note that the new fast_pool_entropy() is not being used anywhere yet, it
will be wired up in an upcoming commit.

Signed-off-by: Nicolai Stange 
---
 drivers/char/random.c | 52 +++
 1 file changed, 52 insertions(+)

diff --git a/drivers/char/random.c b/drivers/char/random.c
index a985ceb22c7c..ac36c56dd135 100644
--- a/drivers/char/random.c
+++ b/drivers/char/random.c
@@ -1547,6 +1547,58 @@ static __u32 get_reg(struct fast_pool *f, struct pt_regs 
*regs)
return *ptr;
 }
 
+/*
+ * Calculate the entropy of a fast_pool after num_events IRQ events of
+ * assessed entropy 2^-event_entropy_shift each have been mixed in.
+ */
+static unsigned int fast_pool_entropy(unsigned int num_events,
+ int event_entropy_shift)
+{
+   unsigned int result, cur, interval_len;
+
+   /*
+* Assume that each event fed into the fast_pool
+* carries h = 2^-event_entropy_shift bits of entropy.
+* In analogy to how entropy deltas are calculated
+* in pool_entropy_delta() for the struct entropy_store
+* input_pool, a fast_pool which received num_events
+* of total entropy num_events * h will contain
+* p * (1 - exp(-num_events * h / p)
+* bits of entropy, where p equals the poolsize of 128 bits.
+*
+* Note that most of the time num_events will be ~64, c.f.
+* add_interrupt_randomness.  Approximate the resulting
+* fast_pool entropy in a piecewise linear manner from below:
+* from 0 to 32, from 32 to 96 and from 96 to 224.
+* Event entropy above 224 gets simply discarded. For p = 128,
+* the calculated fast_pool entropy is ~226 at
+* num_events * h == 32, ~540 at 96 and ~846 at 224, all given
+* in units of 2^-ENTROPY_SHIFT.
+*/
+   BUILD_BUG_ON(sizeof(((struct fast_pool *)NULL)->pool) != 16);
+   BUILD_BUG_ON(ENTROPY_SHIFT != 3);
+
+   /* Interval from 0 to 32. */
+   interval_len = 32 << event_entropy_shift;
+   cur = min_t(unsigned int, num_events, interval_len);
+   result = (226 * cur) >> 5; /* shift is for /32 */
+   num_events -= cur;
+
+   /* Interval of length 64 from 32 to 96. */
+   interval_len <<= 1;
+   cur = min_t(unsigned int, num_events, interval_len);
+

[RFC PATCH 25/41] random: probe cycle counter resolution at initialization

2020-09-21 Thread Nicolai Stange
An upcoming patch will change the entropy estimate per
add_interrupt_randomness() event for fips_enabled based on whether
random_get_entropy() resp. get_cycles() is able to capture individual
instructions.

For example, x86's TSC would qualify, whereas I've seen cycle counters on
e.g. a Raspberry PI 2B with an advertised resolution of only 52ns even
though the CPU had been clocked at 1GHz. And then there's possibly hardware
which doesn't have a cycle counter at all and where get_cycles() would
always return the same constant.

Make rand_initialize() probe the cycle counter resolution.

Introduce a new static_key have_highres_cycle_ctr, indicicating whether
or not the system's cycle counter is able to capture individual
instructions. Initially it's set to true. Introduce
probe_cycle_ctr_resolution() and call it from rand_initialize().
Make probe_cycle_ctr_resolution() compare 16 successive
random_get_entropy() values and disable have_highres_cycle_ctr in case
the same value has been read two times in a row. As have_highres_cycle_ctr
will be only accessed if fips_enabled is true, make it return early in
case it's not set.

Signed-off-by: Nicolai Stange 
---
 drivers/char/random.c | 29 +
 1 file changed, 29 insertions(+)

diff --git a/drivers/char/random.c b/drivers/char/random.c
index aaddee4e4ab1..a985ceb22c7c 100644
--- a/drivers/char/random.c
+++ b/drivers/char/random.c
@@ -335,6 +335,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 
@@ -478,6 +479,8 @@ static struct ratelimit_state urandom_warning =
 
 static int ratelimit_disable __read_mostly;
 
+static DEFINE_STATIC_KEY_TRUE(have_highres_cycle_ctr);
+
 module_param_named(ratelimit_disable, ratelimit_disable, int, 0644);
 MODULE_PARM_DESC(ratelimit_disable, "Disable random ratelimit suppression");
 
@@ -2170,6 +2173,31 @@ static void __init init_std_data(struct entropy_store *r)
mix_pool_bytes(r, utsname(), sizeof(*(utsname(;
 }
 
+static void probe_cycle_ctr_resolution(void)
+{
+   cycles_t prev;
+   int i;
+
+   if (!fips_enabled)
+   return;
+
+   /*
+* Check if the cycle counter has insn granularity (or at
+* least close to).
+*/
+   prev = random_get_entropy();
+   for (i = 0; i < 16; ++i) {
+   cycles_t next;
+
+   next = random_get_entropy();
+   if (next == prev) {
+   static_branch_disable(_highres_cycle_ctr);
+   return;
+   }
+   prev = next;
+   }
+}
+
 /*
  * Note that setup_arch() may call add_device_randomness()
  * long before we get here. This allows seeding of the pools
@@ -2182,6 +2210,7 @@ static void __init init_std_data(struct entropy_store *r)
  */
 int __init rand_initialize(void)
 {
+   probe_cycle_ctr_resolution();
init_std_data(_pool);
crng_initialize_primary(_crng);
crng_global_init_time = jiffies;
-- 
2.26.2



[RFC PATCH 20/41] random: provide min_crng_reseed_pool_entropy()

2020-09-21 Thread Nicolai Stange
Currently, the current minimum entropy required from the input_pool for
reseeding the primary_crng() is 16 bytes == 128 bits. A future patch will
introduce support for obtaining up to a certain fraction thereof from the
architecture's RNG, if available.

This will effectively lower the minimum input_pool ->entropy_count required
for a successful reseed of the primary_crng.

As this value is used at a couple of places, namely crng_reseed() itself
as well as dispatch_queued_entropy() and __dispatch_queued_entropy_fast(),
introduce min_crng_reseed_pool_entropy() to ensure consistency among
these.

min_crng_reseed_pool_entropy() returns the minimum amount of entropy in
bytes required from the input_pool for a successful reseed of the
primary_crng. Currently it's hardcoded to 16.

Use it in place of the hardcoded constants in crng_reseed(),
dispatch_queued_entropy() and __dispatch_queued_entropy_fast().

Signed-off-by: Nicolai Stange 
---
 drivers/char/random.c | 18 +++---
 1 file changed, 15 insertions(+), 3 deletions(-)

diff --git a/drivers/char/random.c b/drivers/char/random.c
index 1945249597e0..424de1565927 100644
--- a/drivers/char/random.c
+++ b/drivers/char/random.c
@@ -516,6 +516,8 @@ static ssize_t extract_entropy(struct entropy_store *r, 
void *buf,
 static ssize_t _extract_entropy(struct entropy_store *r, void *buf,
size_t nbytes, int fips);
 
+static int min_crng_reseed_pool_entropy(void);
+
 static void crng_reseed(struct crng_state *crng, struct entropy_store *r);
 static __u32 input_pool_data[INPUT_POOL_WORDS] __latent_entropy;
 
@@ -916,7 +918,7 @@ static bool __dispatch_queued_entropy_fast(struct 
entropy_store *r,
if (unlikely(r == _pool && crng_init < 2)) {
const int entropy_bits = entropy_count >> ENTROPY_SHIFT;
 
-   return (entropy_bits >= 128);
+   return (entropy_bits >= min_crng_reseed_pool_entropy() * 8);
}
 
return false;
@@ -965,7 +967,7 @@ static void dispatch_queued_entropy(struct entropy_store *r,
if (crng_init < 2) {
const int entropy_bits = entropy_count >> ENTROPY_SHIFT;
 
-   if (entropy_bits < 128)
+   if (entropy_bits < min_crng_reseed_pool_entropy() * 8)
return;
crng_reseed(_crng, r);
}
@@ -1182,6 +1184,15 @@ static int crng_slow_load(const char *cp, size_t len)
return 1;
 }
 
+/*
+ * Minimum amount of entropy in bytes required from the input_pool for
+ * a successful reseed of the primary_crng.
+ */
+static int min_crng_reseed_pool_entropy(void)
+{
+   return 16;
+}
+
 static void crng_reseed(struct crng_state *crng, struct entropy_store *r)
 {
unsigned long   flags;
@@ -1192,7 +1203,8 @@ static void crng_reseed(struct crng_state *crng, struct 
entropy_store *r)
} buf;
 
if (r) {
-   num = extract_entropy(r, , 32, 16);
+   num = extract_entropy(r, , 32,
+ min_crng_reseed_pool_entropy());
if (num == 0)
return;
} else {
-- 
2.26.2



[RFC PATCH 17/41] random: drop credit_entropy_bits() and credit_entropy_bits_safe()

2020-09-21 Thread Nicolai Stange
All former call sites of credit_entropy_bits() and
credit_entropy_bits_safe() respectively have been converted to the new
dispatch_queued_entropy() API. Drop the now unused functions.

Signed-off-by: Nicolai Stange 
---
 drivers/char/random.c | 29 +
 1 file changed, 1 insertion(+), 28 deletions(-)

diff --git a/drivers/char/random.c b/drivers/char/random.c
index 03eadefabbca..a49805d0d23c 100644
--- a/drivers/char/random.c
+++ b/drivers/char/random.c
@@ -533,7 +533,7 @@ static __u32 const twist_table[8] = {
 /*
  * This function adds bytes into the entropy "pool".  It does not
  * update the entropy estimate.  The caller should call
- * credit_entropy_bits if this is appropriate.
+ * queue_entropy()+dispatch_queued_entropy() if this is appropriate.
  *
  * The pool is stirred with a primitive polynomial of the appropriate
  * degree, and then twisted.  We twist by three bits at a time because
@@ -988,33 +988,6 @@ static void discard_queued_entropy(struct entropy_store *r,
spin_unlock_irqrestore(>lock, flags);
 }
 
-/*
- * Credit the entropy store with n bits of entropy.
- * Use credit_entropy_bits_safe() if the value comes from userspace
- * or otherwise should be checked for extreme values.
- */
-static void credit_entropy_bits(struct entropy_store *r, int nbits)
-{
-   struct queued_entropy q = { 0 };
-
-   queue_entropy(r, , nbits << ENTROPY_SHIFT);
-   dispatch_queued_entropy(r, );
-}
-
-static int credit_entropy_bits_safe(struct entropy_store *r, int nbits)
-{
-   const int nbits_max = r->poolinfo->poolwords * 32;
-
-   if (nbits < 0)
-   return -EINVAL;
-
-   /* Cap the value to avoid overflows */
-   nbits = min(nbits,  nbits_max);
-
-   credit_entropy_bits(r, nbits);
-   return 0;
-}
-
 /*
  *
  * CRNG using CHACHA20
-- 
2.26.2



[RFC PATCH 05/41] random: don't reset entropy to zero on overflow

2020-09-21 Thread Nicolai Stange
credit_entropy_bits() adds one or more positive values to the signed
entropy_count and checks if the result is negative afterwards. Note that
because the initial value of entropy_count is positive, a negative result
can happen only on overflow.

However, if the final entropy_count is found to have overflown, a WARN()
is emitted and the entropy_store's entropy count reset to zero. Even
though this case should never happen, it is better to retain previously
available entropy as this will facilitate a future change factoring out
that approximation of the exponential.

Make credit_entropy_bits() tp reset entropy_count to the original value
rather than zero on overflow.

Signed-off-by: Nicolai Stange 
---
 drivers/char/random.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/char/random.c b/drivers/char/random.c
index 35e381be20fe..6adac462aa0d 100644
--- a/drivers/char/random.c
+++ b/drivers/char/random.c
@@ -706,7 +706,7 @@ static void credit_entropy_bits(struct entropy_store *r, 
int nbits)
if (WARN_ON(entropy_count < 0)) {
pr_warn("negative entropy/overflow: pool %s count %d\n",
r->name, entropy_count);
-   entropy_count = 0;
+   entropy_count = orig;
} else if (entropy_count > pool_size)
entropy_count = pool_size;
if (cmpxchg(>entropy_count, orig, entropy_count) != orig)
-- 
2.26.2



[RFC PATCH 14/41] random: drop __credit_entropy_bits_fast()

2020-09-21 Thread Nicolai Stange
All former call sites of __credit_entropy_bits_fast() have been converted
to the new __dispatch_queued_entropy_fast() API. Drop the now unused
__credit_entropy_bits_fast().

Signed-off-by: Nicolai Stange 
---
 drivers/char/random.c | 14 --
 1 file changed, 14 deletions(-)

diff --git a/drivers/char/random.c b/drivers/char/random.c
index dfbe49fdbcf1..60ce185d7b2d 100644
--- a/drivers/char/random.c
+++ b/drivers/char/random.c
@@ -900,20 +900,6 @@ static bool __dispatch_queued_entropy_fast(struct 
entropy_store *r,
return false;
 }
 
-/*
- * Credit the entropy store with n bits of entropy.
- * To be used from hot paths when it is either known that nbits is
- * smaller than one half of the pool size or losing anything beyond that
- * doesn't matter. Must be called with r->lock being held.
- */
-static bool __credit_entropy_bits_fast(struct entropy_store *r, int nbits)
-{
-   struct queued_entropy q = { 0 };
-
-   __queue_entropy(r, , nbits << ENTROPY_SHIFT);
-   return __dispatch_queued_entropy_fast(r, );
-}
-
 /*
  * Credit the pool with previously queued entropy.
  *
-- 
2.26.2



[RFC PATCH 11/41] random: convert add_timer_randomness() to queued_entropy API

2020-09-21 Thread Nicolai Stange
In an effort to drop __credit_entropy_bits_fast() in favor of the new
__queue_entropy()/__dispatch_queued_entropy_fast() API, convert
add_timer_randomness() from the former to the latter.

There is no change in functionality at this point, because
__credit_entropy_bits_fast() has already been reimplemented on top of the
new API before.

Signed-off-by: Nicolai Stange 
---
 drivers/char/random.c | 6 --
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/drivers/char/random.c b/drivers/char/random.c
index b91d1fc08ac5..e8c86abde901 100644
--- a/drivers/char/random.c
+++ b/drivers/char/random.c
@@ -1400,6 +1400,7 @@ static void add_timer_randomness(struct timer_rand_state 
*state, unsigned num)
long delta, delta2, delta3;
bool reseed;
unsigned long flags;
+   struct queued_entropy q = { 0 };
 
sample.jiffies = jiffies;
sample.cycles = random_get_entropy();
@@ -1432,13 +1433,14 @@ static void add_timer_randomness(struct 
timer_rand_state *state, unsigned num)
 
r = _pool;
spin_lock_irqsave(>lock, flags);
-   __mix_pool_bytes(r, , sizeof(sample));
/*
 * delta is now minimum absolute delta.
 * Round down by 1 bit on general principles,
 * and limit entropy estimate to 12 bits.
 */
-   reseed = __credit_entropy_bits_fast(r, min_t(int, fls(delta>>1), 11));
+   __queue_entropy(r, , min_t(int, fls(delta>>1), 11) << ENTROPY_SHIFT);
+   __mix_pool_bytes(r, , sizeof(sample));
+   reseed = __dispatch_queued_entropy_fast(r, );
spin_unlock_irqrestore(>lock, flags);
if (reseed)
crng_reseed(_crng, r);
-- 
2.26.2



[RFC PATCH 21/41] random: don't invoke arch_get_random_long() from add_interrupt_randomness()

2020-09-21 Thread Nicolai Stange
writing to the target buffer area following the
bytes obtained from the input_pool.

Thus, in case failing arch_get_random_long()s in combination with
arch_randomness_required set became a problem in the future, it would be
better to improve the error path and simply return the unused entropy
extracted from the input_pool back.

Signed-off-by: Nicolai Stange 
---
 drivers/char/random.c | 49 +--
 1 file changed, 29 insertions(+), 20 deletions(-)

diff --git a/drivers/char/random.c b/drivers/char/random.c
index 424de1565927..7712b4464ef5 100644
--- a/drivers/char/random.c
+++ b/drivers/char/random.c
@@ -1190,6 +1190,14 @@ static int crng_slow_load(const char *cp, size_t len)
  */
 static int min_crng_reseed_pool_entropy(void)
 {
+   /*
+* If there's an architecture provided RNG, use it for
+* up to one half of the minimum entropy needed for
+* reseeding. That way it won't dominate the entropy
+* collected by other means at input_pool.
+*/
+   if (arch_has_random() || arch_has_random_seed())
+   return 8;
return 16;
 }
 
@@ -1197,6 +1205,7 @@ static void crng_reseed(struct crng_state *crng, struct 
entropy_store *r)
 {
unsigned long   flags;
int i, num;
+   boolarch_randomness_required = false;
union {
__u8block[CHACHA_BLOCK_SIZE];
__u32   key[8];
@@ -1205,8 +1214,16 @@ static void crng_reseed(struct crng_state *crng, struct 
entropy_store *r)
if (r) {
num = extract_entropy(r, , 32,
  min_crng_reseed_pool_entropy());
-   if (num == 0)
+   if (num == 0) {
return;
+   } else if (num < 16) {
+   /*
+* The input_pool did not provide sufficient
+* entropy for reseeding and the architecture
+* provided RNG will have to make up for it.
+*/
+   arch_randomness_required = true;
+   }
} else {
_extract_crng(_crng, buf.block);
_crng_backtrack_protect(_crng, buf.block,
@@ -1216,8 +1233,17 @@ static void crng_reseed(struct crng_state *crng, struct 
entropy_store *r)
for (i = 0; i < 8; i++) {
unsigned long   rv;
if (!arch_get_random_seed_long() &&
-   !arch_get_random_long())
+   !arch_get_random_long()) {
+   if (arch_randomness_required) {
+   /*
+* The input_pool failed to provide
+* sufficient entropy and the arch RNG
+* could not make up for that either.
+*/
+   return;
+   }
rv = random_get_entropy();
+   }
buf.key[i] ^= rv;
}
 
@@ -1522,8 +1548,6 @@ void add_interrupt_randomness(int irq, int irq_flags)
cycles_tcycles = random_get_entropy();
__u32   c_high, j_high;
__u64   ip;
-   unsigned long   seed;
-   int credit = 0;
boolreseed;
struct queued_entropy   q = { 0 };
 
@@ -1560,26 +1584,11 @@ void add_interrupt_randomness(int irq, int irq_flags)
if (!spin_trylock(>lock))
return;
 
-   /*
-* If we have architectural seed generator, produce a seed and
-* add it to the pool further below. For the sake of paranoia
-* don't let the architectural seed generator dominate the
-* input from the interrupt noise.
-*/
-   credit = !!arch_get_random_long();
-
fast_pool->last = now;
fast_pool->count = 0;
/* award one bit for the contents of the fast pool */
-   __queue_entropy(r, , (credit + 1) << ENTROPY_SHIFT);
+   __queue_entropy(r, , 1 << ENTROPY_SHIFT);
__mix_pool_bytes(r, _pool->pool, sizeof(fast_pool->pool));
-   if (credit) {
-   /*
-* A seed has been obtained from
-* arch_get_random_seed_long() above, mix it in.
-*/
-   __mix_pool_bytes(r, , sizeof(seed));
-   }
reseed = __dispatch_queued_entropy_fast(r, );
spin_unlock(>lock);
if (reseed)
-- 
2.26.2



[RFC PATCH 19/41] random: reintroduce arch_has_random() + arch_has_random_seed()

2020-09-21 Thread Nicolai Stange
A future patch will introduce support for making up for a certain amount
of lacking entropy in crng_reseed() by means of arch_get_random_long()
or arch_get_random_seed_long() respectively.

However, before even the tiniest bit of precious entropy is withdrawn from
the input_pool, it should be checked if whether the current arch even
has support for these.

Reintroduce arch_has_random() + arch_has_random_seed() and implement
them for arm64, powerpc, s390 and x86 as appropriate (yeah, I know this
should go in separate commits, but this is part of a RFC series).

Note that this more or less reverts commits
  647f50d5d9d9 ("linux/random.h: Remove arch_has_random,
 arch_has_random_seed")
  cbac004995a0 ("powerpc: Remove arch_has_random, arch_has_random_seed")
  5e054c820f59 ("s390: Remove arch_has_random, arch_has_random_seed")
  5f2ed7f5b99b ("x86: Remove arch_has_random, arch_has_random_seed")

Signed-off-by: Nicolai Stange 
---
 arch/arm64/include/asm/archrandom.h   | 25 ++---
 arch/powerpc/include/asm/archrandom.h | 12 +++-
 arch/s390/include/asm/archrandom.h| 14 --
 arch/x86/include/asm/archrandom.h | 18 ++
 include/linux/random.h|  8 
 5 files changed, 63 insertions(+), 14 deletions(-)

diff --git a/arch/arm64/include/asm/archrandom.h 
b/arch/arm64/include/asm/archrandom.h
index 44209f6146aa..055d18713db7 100644
--- a/arch/arm64/include/asm/archrandom.h
+++ b/arch/arm64/include/asm/archrandom.h
@@ -26,17 +26,13 @@ static inline bool __arm64_rndr(unsigned long *v)
return ok;
 }
 
-static inline bool __must_check arch_get_random_long(unsigned long *v)
-{
-   return false;
-}
 
-static inline bool __must_check arch_get_random_int(unsigned int *v)
+static inline bool arch_has_random(void)
 {
return false;
 }
 
-static inline bool __must_check arch_get_random_seed_long(unsigned long *v)
+static inline bool arch_has_random_seed(void)
 {
/*
 * Only support the generic interface after we have detected
@@ -44,7 +40,22 @@ static inline bool __must_check 
arch_get_random_seed_long(unsigned long *v)
 * cpufeature code and with potential scheduling between CPUs
 * with and without the feature.
 */
-   if (!cpus_have_const_cap(ARM64_HAS_RNG))
+   return cpus_have_const_cap(ARM64_HAS_RNG);
+}
+
+static inline bool __must_check arch_get_random_long(unsigned long *v)
+{
+   return false;
+}
+
+static inline bool __must_check arch_get_random_int(unsigned int *v)
+{
+   return false;
+}
+
+static inline bool __must_check arch_get_random_seed_long(unsigned long *v)
+{
+   if (!arch_has_random_seed())
return false;
 
return __arm64_rndr(v);
diff --git a/arch/powerpc/include/asm/archrandom.h 
b/arch/powerpc/include/asm/archrandom.h
index 9a53e29680f4..47c2d74e7244 100644
--- a/arch/powerpc/include/asm/archrandom.h
+++ b/arch/powerpc/include/asm/archrandom.h
@@ -6,6 +6,16 @@
 
 #include 
 
+static inline bool arch_has_random(void)
+{
+   return false;
+}
+
+static inline bool arch_has_random_seed(void)
+{
+   return ppc_md.get_random_seed;
+}
+
 static inline bool __must_check arch_get_random_long(unsigned long *v)
 {
return false;
@@ -18,7 +28,7 @@ static inline bool __must_check arch_get_random_int(unsigned 
int *v)
 
 static inline bool __must_check arch_get_random_seed_long(unsigned long *v)
 {
-   if (ppc_md.get_random_seed)
+   if (arch_has_random_seed())
return ppc_md.get_random_seed(v);
 
return false;
diff --git a/arch/s390/include/asm/archrandom.h 
b/arch/s390/include/asm/archrandom.h
index de61ce562052..18973845634c 100644
--- a/arch/s390/include/asm/archrandom.h
+++ b/arch/s390/include/asm/archrandom.h
@@ -21,6 +21,16 @@ extern atomic64_t s390_arch_random_counter;
 
 bool s390_arch_random_generate(u8 *buf, unsigned int nbytes);
 
+static inline bool arch_has_random(void)
+{
+   return false;
+}
+
+static inline bool arch_has_random_seed(void)
+{
+   return static_branch_likely(_arch_random_available);
+}
+
 static inline bool __must_check arch_get_random_long(unsigned long *v)
 {
return false;
@@ -33,7 +43,7 @@ static inline bool __must_check arch_get_random_int(unsigned 
int *v)
 
 static inline bool __must_check arch_get_random_seed_long(unsigned long *v)
 {
-   if (static_branch_likely(_arch_random_available)) {
+   if (arch_has_random_seed()) {
return s390_arch_random_generate((u8 *)v, sizeof(*v));
}
return false;
@@ -41,7 +51,7 @@ static inline bool __must_check 
arch_get_random_seed_long(unsigned long *v)
 
 static inline bool __must_check arch_get_random_seed_int(unsigned int *v)
 {
-   if (static_branch_likely(_arch_random_available)) {
+   if (arch_has_random_seed()) {
return s390_arch_random_generate((u8 *)v, sizeof(*v));
}
   

[RFC PATCH 24/41] init: call time_init() before rand_initialize()

2020-09-21 Thread Nicolai Stange
Commit d55535232c3d ("random: move rand_initialize() earlier") moved the
rand_initialize() invocation from the early initcalls to right after
timekeeping_init(), but before time_init().

However, rand_initialize() would indirectly invoke random_get_entropy(),
which is an alias to get_cycles() on most archs, in case an architectural
RNG is not available. Problem is that on some archs, e.g. ARM,
get_cycles() can only be relied upon when time_init() has completed.

Move the invocation of time_init() a couple of lines up in start_kernel()
so that it gets called before rand_initialize().

Note that random_get_entropy() data doesn't get any entropy credit and
thus, this issue is not to be considered a bug, but more of an
inconsistency.

Fixes: d55535232c3d ("random: move rand_initialize() earlier")
Signed-off-by: Nicolai Stange 
---
 init/main.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/init/main.c b/init/main.c
index ae78fb68d231..30892675f48e 100644
--- a/init/main.c
+++ b/init/main.c
@@ -942,6 +942,7 @@ asmlinkage __visible void __init __no_sanitize_address 
start_kernel(void)
hrtimers_init();
softirq_init();
timekeeping_init();
+   time_init();
 
/*
 * For best initial stack canary entropy, prepare it after:
@@ -956,7 +957,6 @@ asmlinkage __visible void __init __no_sanitize_address 
start_kernel(void)
add_device_randomness(command_line, strlen(command_line));
boot_init_stack_canary();
 
-   time_init();
perf_event_init();
profile_init();
call_function_init();
-- 
2.26.2



[RFC PATCH 07/41] random: let pool_entropy_delta() take nbits in units of 2^-ENTROPY_SHIFT

2020-09-21 Thread Nicolai Stange
Currently pool_entropy_delta() expects its nbits argument to be given in
units of integral bits. Using fractional bits for processing intermediate
entropy counts consistently throughout the code will facilitate upcoming
changes to the entropy accounting logic in add_interrupt_randomness().
Replace pool_entropy_delta()'s nbits argument with nfrac, which used to be
a local variable and is expected to be given in units of 2^-ENTROPY_SHIFT.
Adapt the single caller, credit_entropy_bits(), accordingly.

Signed-off-by: Nicolai Stange 
---
 drivers/char/random.c | 19 ++-
 1 file changed, 10 insertions(+), 9 deletions(-)

diff --git a/drivers/char/random.c b/drivers/char/random.c
index 15dd22d74029..08caa7a691a5 100644
--- a/drivers/char/random.c
+++ b/drivers/char/random.c
@@ -655,21 +655,20 @@ static void process_random_ready_list(void)
 
 /*
  * Based on the pool's current entropy fill level, specified as
- * base_entropy_count, and the number of new entropy bits to add,
- * return the amount of new entropy to credit. If the 'fast'
- * parameter is set to true, the calculation will be guaranteed to
- * terminate quickly, but this comes at the expense of capping
- * nbits to one half of the pool size.
+ * base_entropy_count, and the number of new entropy bits in units of
+ * 2^-ENTROPY_SHIFT to add, return the amount of new entropy to
+ * credit. If the 'fast' parameter is set to true, the calculation
+ * will be guaranteed to terminate quickly, but this comes at the
+ * expense of capping nbits to one half of the pool size.
  */
 static unsigned int pool_entropy_delta(struct entropy_store *r,
   int base_entropy_count,
-  int nbits, bool fast)
+  int nfrac, bool fast)
 {
const int pool_size = r->poolinfo->poolfracbits;
int entropy_count = base_entropy_count;
-   int nfrac = nbits << ENTROPY_SHIFT;
 
-   if (!nbits)
+   if (!nfrac)
return 0;
 
/*
@@ -729,7 +728,9 @@ static void credit_entropy_bits(struct entropy_store *r, 
int nbits)
 
 retry:
orig = READ_ONCE(r->entropy_count);
-   entropy_count = orig + pool_entropy_delta(r, orig, nbits, false);
+   entropy_count = orig + pool_entropy_delta(r, orig,
+ nbits << ENTROPY_SHIFT,
+ false);
if (cmpxchg(>entropy_count, orig, entropy_count) != orig)
goto retry;
 
-- 
2.26.2



[RFC PATCH 06/41] random: factor the exponential approximation in credit_entropy_bits() out

2020-09-21 Thread Nicolai Stange
In the course of calculating the actual amount of new entropy to credit,
credit_entropy_bits() applies a linear approximation to
exp(-nbits/pool_size)) (neglecting scaling factors in the exponent for
the sake of simplicity).

In order to limit approximation errors for large nbits, nbits is divided
into chunks of maximum value pool_size/2 each and said approximation is
applied to these individually in a loop. That loop has a theoretic upper
bound of 2*log2(pool_size), which, with the given pool_size of 128 * 32
bits, equals 24.

However, in practice nbits hardly ever exceeds values as a large as
pool_size/2 == 2048, especially not when called from interrupt context,
i.e. from add_interrupt_randomness() and alike. Thus, imposing a limit of
one single iteration in these contexts would yield a good guarantee with
respect to runtime while not losing any entropy.

In preparation to enabling that, move the approximation code in
credit_entropy_bits() into a separate function, pool_entropy_delta().
Based on the initial pool entropy count and the number of new entropy bits
to credit, it calculates and returns a (positive) delta to add to the
former. In case the 'fast' parameter is set to true, the calculation
will be terminated after the first iteration, effectively capping the input
nbits to one half of the pool size.

There is no functional change; callers with 'fast' set to true will be
introduced in a future patch.

Signed-off-by: Nicolai Stange 
---
 drivers/char/random.c | 53 ++-
 1 file changed, 37 insertions(+), 16 deletions(-)

diff --git a/drivers/char/random.c b/drivers/char/random.c
index 6adac462aa0d..15dd22d74029 100644
--- a/drivers/char/random.c
+++ b/drivers/char/random.c
@@ -366,7 +366,7 @@
  * denominated in units of 1/8th bits.
  *
  * 2*(ENTROPY_SHIFT + poolbitshift) must <= 31, or the multiply in
- * credit_entropy_bits() needs to be 64 bits wide.
+ * pool_entropy_delta() needs to be 64 bits wide.
  */
 #define ENTROPY_SHIFT 3
 #define ENTROPY_BITS(r) ((r)->entropy_count >> ENTROPY_SHIFT)
@@ -654,22 +654,24 @@ static void process_random_ready_list(void)
 }
 
 /*
- * Credit the entropy store with n bits of entropy.
- * Use credit_entropy_bits_safe() if the value comes from userspace
- * or otherwise should be checked for extreme values.
+ * Based on the pool's current entropy fill level, specified as
+ * base_entropy_count, and the number of new entropy bits to add,
+ * return the amount of new entropy to credit. If the 'fast'
+ * parameter is set to true, the calculation will be guaranteed to
+ * terminate quickly, but this comes at the expense of capping
+ * nbits to one half of the pool size.
  */
-static void credit_entropy_bits(struct entropy_store *r, int nbits)
+static unsigned int pool_entropy_delta(struct entropy_store *r,
+  int base_entropy_count,
+  int nbits, bool fast)
 {
-   int entropy_count, orig;
const int pool_size = r->poolinfo->poolfracbits;
+   int entropy_count = base_entropy_count;
int nfrac = nbits << ENTROPY_SHIFT;
-   int pnfrac;
 
if (!nbits)
-   return;
+   return 0;
 
-retry:
-   entropy_count = orig = READ_ONCE(r->entropy_count);
/*
 * Credit: we have to account for the possibility of
 * overwriting already present entropy.  Even in the
@@ -691,24 +693,43 @@ static void credit_entropy_bits(struct entropy_store *r, 
int nbits)
 * arbitrarily long; this limits the loop to log2(pool_size)*2
 * turns no matter how large nbits is.
 */
-   pnfrac = nfrac;
do {
/* The +2 corresponds to the /4 in the denominator */
const int s = r->poolinfo->poolbitshift + ENTROPY_SHIFT + 2;
-   unsigned int anfrac = min(pnfrac, pool_size/2);
+   unsigned int anfrac = min(nfrac, pool_size/2);
unsigned int add =
((pool_size - entropy_count)*anfrac*3) >> s;
 
entropy_count += add;
-   pnfrac -= anfrac;
-   } while (unlikely(entropy_count < pool_size-2 && pnfrac));
+   nfrac -= anfrac;
+   } while (unlikely(!fast && entropy_count < pool_size-2 && nfrac));
 
if (WARN_ON(entropy_count < 0)) {
pr_warn("negative entropy/overflow: pool %s count %d\n",
r->name, entropy_count);
-   entropy_count = orig;
-   } else if (entropy_count > pool_size)
+   entropy_count = base_entropy_count;
+   } else if (entropy_count > pool_size) {
entropy_count = pool_size;
+   }
+
+   return entropy_count - base_entropy_count;
+}
+
+/*
+ * Credit the entropy store with n bits of entropy.
+ * Use credit_entropy_bits_safe() if the value co

[RFC PATCH 01/41] random: remove dead code in credit_entropy_bits()

2020-09-21 Thread Nicolai Stange
Since commit 90ea1c6436d2 ("random: remove the blocking pool") the local
has_initialized in credit_entropy_bits() won't get set anymore and
the corresponding if-clause became dead code. Remove it as well as the
has_initialized variable itself from credit_entropy_bits().

Signed-off-by: Nicolai Stange 
---
 drivers/char/random.c | 7 +--
 1 file changed, 1 insertion(+), 6 deletions(-)

diff --git a/drivers/char/random.c b/drivers/char/random.c
index d20ba1b104ca..0580968fd28c 100644
--- a/drivers/char/random.c
+++ b/drivers/char/random.c
@@ -660,7 +660,7 @@ static void process_random_ready_list(void)
  */
 static void credit_entropy_bits(struct entropy_store *r, int nbits)
 {
-   int entropy_count, orig, has_initialized = 0;
+   int entropy_count, orig;
const int pool_size = r->poolinfo->poolfracbits;
int nfrac = nbits << ENTROPY_SHIFT;
 
@@ -717,11 +717,6 @@ static void credit_entropy_bits(struct entropy_store *r, 
int nbits)
if (cmpxchg(>entropy_count, orig, entropy_count) != orig)
goto retry;
 
-   if (has_initialized) {
-   r->initialized = 1;
-   kill_fasync(, SIGIO, POLL_IN);
-   }
-
trace_credit_entropy_bits(r->name, nbits,
  entropy_count >> ENTROPY_SHIFT, _RET_IP_);
 
-- 
2.26.2



[DISCUSSION PATCH 00/41] random: possible ways towards NIST SP800-90B compliance

2020-09-21 Thread Nicolai Stange
MP respectively.

Apart from those well controlled experiments on a RPi, I also did some
lax benchmarking on my x86 desktop (which has some Intel i9, IIRC).
More specifically, I simply didn't touch the system and ftraced
add_interrupt_randomness() for 15 mins. The number of captured events had
been about 2000 in each configuration. Here the add_interrupt_randomness()
performance improved greatly: from 4.3 us on average w/o the patches down
to 2.0 us with the patches applied and fips_enabled. However, I suppose
this gain was due to the removal of RDSEED from add_interrupt_randomness().
Indeed, when inspecting the distribution of add_interrupt_randomness()
runtimes on plain v5.9-rc4 more closely, it can be seen that there's a
good portion of events (about 1/4th) where add_interrupt_randomness() took
about 10us. So I think that this comparison isn't really a fair one...


To the best of my knowledge, these are the remaining open questions/items
towards full SP800-90[A-C] compliance:
- There's no (debugfs?) interface for accessing raw samples for validation
  purposes yet. That would be doable though.
- try_to_generate_entropy() should probably get wired up to the health
  tests as well. More or less straightfoward to implement, too.
- Diverting fast_pool contents into net_rand_state is not allowed (for a
  related discussion on this topic see [7]).
- I've been told that SP800-90A is not a hard requirement yet, but I
  suppose it will eventually become one. This would mean that the chacha20
  RNG would have to get replaced by something approved for fips_enabled.
- The sequence of fast_pool -> input_pool -> extract_buf() operations
  is to be considered a "non-vetted conditioning component" in SP800-90B
  speak. It would follow that the output can't be estimated as having full
  entropy, but only 0.999 of its length at max. (c.f. sec. 3.1.5.2). This
  could be resolved by running a SP800-90A derivation function at CRNG
  reseeding for fips_enabled. extract_buf(), which is already SHA1 based,
  could perhaps be transformed into such one as well.
- The only mention of combining different noise sources I was able to find
  had been in SP800-90C, sec. 5.3.4 ("Using Multiple Entropy Sources"):
  it clearly states that the outputs have to be combined by concatenation.
  add_hwgenerator_randomness() mixes into the same input_pool as
  add_interrupt_randomness() though and I would expect that this isn't
  allowed, independent of whether the noise source backing the former
  is SP800-90B compliant or not. IIUC, Stephan solved this for his LRNG
  by maintaing a separate pool for the hw generator.
- SP800-90A sets an upper bound on how many bits may be drawn from a
  DRBG/crng before a reseed *must* take place ("reseed_interval"). In
  principle that shouldn't matter much in practice, at least not with
  CONFIG_NUMA: with reseed_interval == 2^32 bits, a single CRNG instance
  would be allowed to hand out only 500MB worth of randomness before
  reseeding, but a (single) numa crng chained to the primary_crng may
  produce as much as 8PB before the latter must eventually get reseeded
  from the input_pool. But AFAICT, a SP800-90A conforming implementation
  would still have to provide provisions for a blocking extract_crng().
- It's entirely unclear to me whether support for "prediction resistance
  requests" is optional. It would be a pity if it weren't, because IIUC
  that would effectively imply a return to the former blocking_pool
  behaviour, which is obviously a no-no.


Nicolai Stange (41):
  random: remove dead code in credit_entropy_bits()
  random: remove dead code for nbits < 0 in credit_entropy_bits()
  random: prune dead assignment to entropy_bits in credit_entropy_bits()
  random: drop 'reserved' parameter from extract_entropy()
  random: don't reset entropy to zero on overflow
  random: factor the exponential approximation in credit_entropy_bits()
out
  random: let pool_entropy_delta() take nbits in units of
2^-ENTROPY_SHIFT
  random: introduce __credit_entropy_bits_fast() for hot paths
  random: protect ->entropy_count with the pool spinlock
  random: implement support for delayed entropy dispatching
  random: convert add_timer_randomness() to queued_entropy API
  random: convert add_interrupt_randomness() to queued_entropy API
  random: convert try_to_generate_entropy() to queued_entropy API
  random: drop __credit_entropy_bits_fast()
  random: convert add_hwgenerator_randomness() to queued_entropy API
  random: convert random_ioctl() to queued_entropy API
  random: drop credit_entropy_bits() and credit_entropy_bits_safe()
  random: move arch_get_random_seed() calls in crng_reseed() into own
loop
  random: reintroduce arch_has_random() + arch_has_random_seed()
  random: provide min_crng_reseed_pool_entropy()
  random: don't invoke arch_get_random_long() from
add_interrupt_randomness()
  random: introduce arch_has_sp800_90b_random_seed()

[RFC PATCH 08/41] random: introduce __credit_entropy_bits_fast() for hot paths

2020-09-21 Thread Nicolai Stange
When transferring entropy from the fast_pool into the global input_pool
from add_interrupt_randomness(), there are at least two atomic operations
involved: one when taking the input_pool's spinlock for the actual mixing
and another one in the cmpxchg loop in credit_entropy_bits() for
updating the pool's ->entropy_count. Because cmpxchg is potentially costly,
it would be nice if it could be avoided.

As said, the input_pool's spinlock is taken anyway, and I see no reason
why its scope should not be extended to protect ->entropy_count as well.
Performance considerations set aside, this will also facilitate future
changes introducing additional fields to input_pool which will also have to
get updated atomically from the consumer/producer sides.

The actual move to extend the spinlock's scope to cover ->entropy_count
will be the subject of a future patch. Prepare for that by putting
a limit on the work to be done with the lock being held.

In order to avoid releasing and regrabbing from hot producer paths, they'll
keep the lock when executing those calculations in pool_entropy_delta().
The loop found in the latter has a theoretical upper bound of
2 * log2(pool_size) == 24 iterations. However, as all entropy increments
awarded from the interrupt path are less than pool_size/2 in magnitude,
it is safe to enforce a guaranteed limit of one on the iteration count
by setting pool_entropy_delta()'s 'fast' parameter.

Introduce __credit_entropy_bits_fast() doing exactly that. Currently
it resembles the behaviour from credit_entropy_bits() except that
- pool_entropy_delta() gets called with 'fast' set to true and
- that __credit_entropy_bits_fast() returns a bool indicating whether
  the caller should reseed the primary_crng.

Note that unlike it's the case with credit_entropy_bits(), the reseeding
won't be possible from within __credit_entropy_bits_fast() anymore once it
actually gets invoked with the pool lock being held in the future.

There is no functional change.

Signed-off-by: Nicolai Stange 
---
 drivers/char/random.c | 49 ---
 1 file changed, 46 insertions(+), 3 deletions(-)

diff --git a/drivers/char/random.c b/drivers/char/random.c
index 08caa7a691a5..d9e4dd27d45d 100644
--- a/drivers/char/random.c
+++ b/drivers/char/random.c
@@ -714,6 +714,39 @@ static unsigned int pool_entropy_delta(struct 
entropy_store *r,
return entropy_count - base_entropy_count;
 }
 
+/*
+ * Credit the entropy store with n bits of entropy.
+ * To be used from hot paths when it is either known that nbits is
+ * smaller than one half of the pool size or losing anything beyond that
+ * doesn't matter.
+ */
+static bool __credit_entropy_bits_fast(struct entropy_store *r, int nbits)
+{
+   int entropy_count, orig;
+
+   if (!nbits)
+   return false;
+
+retry:
+   orig = READ_ONCE(r->entropy_count);
+   entropy_count = orig + pool_entropy_delta(r, orig,
+ nbits << ENTROPY_SHIFT,
+ true);
+   if (cmpxchg(>entropy_count, orig, entropy_count) != orig)
+   goto retry;
+
+   trace_credit_entropy_bits(r->name, nbits,
+ entropy_count >> ENTROPY_SHIFT, _RET_IP_);
+
+   if (unlikely(r == _pool && crng_init < 2)) {
+   const int entropy_bits = entropy_count >> ENTROPY_SHIFT;
+
+   return (entropy_bits >= 128);
+   }
+
+   return false;
+}
+
 /*
  * Credit the entropy store with n bits of entropy.
  * Use credit_entropy_bits_safe() if the value comes from userspace
@@ -1169,6 +1202,7 @@ static void add_timer_randomness(struct timer_rand_state 
*state, unsigned num)
unsigned num;
} sample;
long delta, delta2, delta3;
+   bool reseed;
 
sample.jiffies = jiffies;
sample.cycles = random_get_entropy();
@@ -1206,7 +1240,9 @@ static void add_timer_randomness(struct timer_rand_state 
*state, unsigned num)
 * Round down by 1 bit on general principles,
 * and limit entropy estimate to 12 bits.
 */
-   credit_entropy_bits(r, min_t(int, fls(delta>>1), 11));
+   reseed = __credit_entropy_bits_fast(r, min_t(int, fls(delta>>1), 11));
+   if (reseed)
+   crng_reseed(_crng, r);
 }
 
 void add_input_randomness(unsigned int type, unsigned int code,
@@ -1274,6 +1310,7 @@ void add_interrupt_randomness(int irq, int irq_flags)
__u64   ip;
unsigned long   seed;
int credit = 0;
+   boolreseed;
 
if (cycles == 0)
cycles = get_reg(fast_pool, regs);
@@ -1326,7 +1363,9 @@ void add_interrupt_randomness(int irq, int irq_flags)
fast_pool->count = 0;
 
/* award one bit for the contents of the fast pool */
-   credit_entropy_b

Re: [RFC PATCH 2/2] livepatch: Clear relocation targets on a module removal

2019-08-26 Thread Nicolai Stange
Josh Poimboeuf  writes:

> On Wed, Aug 14, 2019 at 01:06:09PM +0200, Miroslav Benes wrote:
>> > Really, we should be going in the opposite direction, by creating module
>> > dependencies, like all other kernel modules do, ensuring that a module
>> > is loaded *before* we patch it.  That would also eliminate this bug.
>> 
>> Yes, but it is not ideal either with cumulative one-fixes-all patch 
>> modules. It would load also modules which are not necessary for a 
>> customer and I know that at least some customers care about this. They 
>> want to deploy only things which are crucial for their systems.

Security concerns set aside, some of the patched modules might get
distributed separately from the main kernel through some sort of
kernel-*-extra packages and thus, not be found on some target system
at all. Or they might have been blacklisted.


> If you frame the question as "do you want to destabilize the live
> patching infrastucture" then the answer might be different.
>
> We should look at whether it makes sense to destabilize live patching
> for everybody, for a small minority of people who care about a small
> minority of edge cases.
>
> Or maybe there's some other solution we haven't thought about, which
> fits more in the framework of how kernel modules already work.
>
>> We could split patch modules as you proposed in the past, but that have 
>> issues as well.
>
> Right, I'm not really crazy about that solution either.
>
> Here's another idea: per-object patch modules.  Patches to vmlinux are
> in a vmlinux patch module.  Patches to kvm.ko are in a kvm patch module.
> That would require:
>
> - Careful management of dependencies between object-specific patches.
>   Maybe that just means that exported function ABIs shouldn't change.
>
> - Some kind of hooking into modprobe to ensure the patch module gets
>   loaded with the real one.
>
> - Changing 'atomic replace' to allow patch modules to be per-object.
>

Perhaps I'm misunderstanding, but supporting only per-object livepatch
modules would make livepatch creation for something like commit
15fab63e1e57 ("fs: prevent page refcount overflow in pipe_buf_get"),
CVE-2019-11487 really cumbersome (see the fuse part)?

I think I've seen similar interdependencies between e.g. kvm.ko <->
kvm_intel.ko, but can't find an example right now.


Thanks,

Nicolai

--
SUSE Software Solutions Germany GmbH, Maxfeldstr. 5, 90409 Nürnberg, Germany
(HRB 247165, AG München), GF: Felix Imendörffer


Re: [RFC 3/5] livepatch: Allow to distinguish different version of system state changes

2019-06-24 Thread Nicolai Stange
Petr Mladek  writes:

> ---
>  include/linux/livepatch.h |  2 ++
>  kernel/livepatch/core.c   |  8 
>  kernel/livepatch/state.c  | 40 +++-
>  kernel/livepatch/state.h  |  9 +
>  4 files changed, 58 insertions(+), 1 deletion(-)
>  create mode 100644 kernel/livepatch/state.h
>
> diff --git a/include/linux/livepatch.h b/include/linux/livepatch.h
> index 591abdee30d7..8bc4c6cc3f3f 100644
> --- a/include/linux/livepatch.h
> +++ b/include/linux/livepatch.h
> @@ -135,10 +135,12 @@ struct klp_object {
>  /**
>   * struct klp_state - state of the system modified by the livepatch
>   * @id:  system state identifier (non zero)
> + * @version: version of the change (non-zero)
>   * @data:custom data
>   */
>  struct klp_state {
>   int id;
> + int version;
>   void *data;
>  };
>  
> diff --git a/kernel/livepatch/core.c b/kernel/livepatch/core.c
> index 24c4a13bd26c..614642719825 100644
> --- a/kernel/livepatch/core.c
> +++ b/kernel/livepatch/core.c
> @@ -21,6 +21,7 @@
>  #include 
>  #include "core.h"
>  #include "patch.h"
> +#include "state.h"
>  #include "transition.h"
>  
>  /*
> @@ -1003,6 +1004,13 @@ int klp_enable_patch(struct klp_patch *patch)
>  
>   mutex_lock(_mutex);
>  
> + if(!klp_is_patch_compatible(patch)) {
> + pr_err("Livepatch patch (%s) is not compatible with the already 
> installed livepatches.\n",
> + patch->mod->name);
> + mutex_unlock(_mutex);
> + return -EINVAL;
> + }
> +
>   ret = klp_init_patch_early(patch);
>   if (ret) {
>   mutex_unlock(_mutex);


Just as a remark: klp_reverse_transition() could still transition back
to a !klp_is_patch_compatible() patch.

I don't think it's much of a problem, because for live patches
introducing completely new states to the system, it is reasonable
to assume that they'll start applying incompatible changes only from
their ->post_patch(), I guess.

For state "upgrades" to higher versions, it's not so clear though and
some care will be needed. But I think these could still be handled
safely at the cost of some complexity in the new live patch's
->post_patch().

Another detail is that ->post_unpatch() will be called for the new live
patch which has been unpatched due to transition reversal and one would
have to be careful not to free shared state from under the older, still
active live patch. How would ->post_unpatch() distinguish between
transition reversal and "normal" live patch disabling?  By
klp_get_prev_state() != NULL?

Perhaps transition reversal should be mentioned in the documentation?

Thanks,

Nicolai


-- 
SUSE Linux GmbH, GF: Felix Imendörffer, Mary Higgins, Sri Rasiah, HRB
21284 (AG Nürnberg)


Re: [RFC 2/5] livepatch: Basic API to track system state changes

2019-06-24 Thread Nicolai Stange
Petr Mladek  writes:

> ---
>  include/linux/livepatch.h | 15 +
>  kernel/livepatch/Makefile |  2 +-
>  kernel/livepatch/state.c  | 83 
> +++
>  3 files changed, 99 insertions(+), 1 deletion(-)
>  create mode 100644 kernel/livepatch/state.c
>
> diff --git a/include/linux/livepatch.h b/include/linux/livepatch.h
> index eeba421cc671..591abdee30d7 100644
> --- a/include/linux/livepatch.h
> +++ b/include/linux/livepatch.h
> @@ -132,10 +132,21 @@ struct klp_object {
>   bool patched;
>  };
>  
> +/**
> + * struct klp_state - state of the system modified by the livepatch
> + * @id:  system state identifier (non zero)
> + * @data:custom data
> + */
> +struct klp_state {
> + int id;

Can we make this an unsigned long please? It would be consistent with
shadow variable ids and would give more room for encoding bugzilla or
CVE numbers or so.

Nicolai

> + void *data;
> +};
> +


Re: [RFC 0/5] livepatch: new API to track system state changes

2019-06-24 Thread Nicolai Stange
Hi Petr,

> this is another piece in the puzzle that helps to maintain more
> livepatches.
>
> Especially pre/post (un)patch callbacks might change a system state.
> Any newly installed livepatch has to somehow deal with system state
> modifications done be already installed livepatches.
>
> This patchset provides, hopefully, a simple and generic API that
> helps to keep and pass information between the livepatches.
> It is also usable to prevent loading incompatible livepatches.

I like it a lot, many thanks for doing this!

Minor remarks/questions will follow inline.

Nicolai


Re: [PATCH 0/3] x86_64/ftrace: Emulate calls from int3 when patching functions

2019-05-08 Thread Nicolai Stange
Steven Rostedt  writes:

> [
>   This is the non-RFC version.
>
>   It went through and passed all my tests. If there's no objections
>   I'm going to include this in my pull request. I still have patches
>   in my INBOX that may still be included, so I need to run those through
>   my tests as well, so a pull request wont be immediate.
> ]



> Josh Poimboeuf (1):
>   x86_64: Add gap to int3 to allow for call emulation
>
> Peter Zijlstra (2):
>   x86_64: Allow breakpoints to emulate call instructions
>   ftrace/x86_64: Emulate call function while updating in breakpoint 
> handler

Reviewed-and-tested-by: Nicolai Stange 

for the whole series. Many, many thanks to everybody involved!

I'll resend that live patching selftest once this fix here has been
merged.

Nicolai

-- 
SUSE Linux GmbH, GF: Felix Imendörffer, Mary Higgins, Sri Rasiah, HRB
21284 (AG Nürnberg)


Re: [RFC][PATCH v2] ftrace/x86: Emulate call function while updating in breakpoint handler

2019-05-01 Thread Nicolai Stange
Hi Steve,

many thanks for moving this forward!


Steven Rostedt  writes:

>
> diff --git a/arch/x86/kernel/ftrace.c b/arch/x86/kernel/ftrace.c
> index ef49517f6bb2..9160f5cc3b6d 100644
> --- a/arch/x86/kernel/ftrace.c
> +++ b/arch/x86/kernel/ftrace.c
> @@ -17,6 +17,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  #include 
>  #include 
>  #include 
> @@ -232,6 +233,9 @@ int ftrace_modify_call(struct dyn_ftrace *rec, unsigned 
> long old_addr,
>  
>  static unsigned long ftrace_update_func;
>  
> +/* Used within inline asm below */
> +unsigned long ftrace_update_func_call;
> +
>  static int update_ftrace_func(unsigned long ip, void *new)
>  {
>   unsigned char old[MCOUNT_INSN_SIZE];
> @@ -259,6 +263,8 @@ int ftrace_update_ftrace_func(ftrace_func_t func)
>   unsigned char *new;
>   int ret;
>  
> + ftrace_update_func_call = (unsigned long)func;
> +
>   new = ftrace_call_replace(ip, (unsigned long)func);
>   ret = update_ftrace_func(ip, new);
>  
> @@ -280,6 +286,125 @@ static nokprobe_inline int is_ftrace_caller(unsigned 
> long ip)
>   return 0;
>  }
>  
> +/*
> + * We need to handle the "call func1" -> "call func2" case.
> + * Just skipping the call is not sufficient as it will be like
> + * changing to "nop" first and then updating the call. But some
> + * users of ftrace require calls never to be missed.
> + *
> + * To emulate the call while converting the call site with a breakpoint,
> + * some trampolines are used along with per CPU buffers.
> + * There are three trampolines for the call sites and three trampolines
> + * for the updating of the call in ftrace trampoline. The three
> + * trampolines are:
> + *
> + * 1) Interrupts are enabled when the breakpoint is hit
> + * 2) Interrupts are disabled when the breakpoint is hit
> + * 3) The breakpoint was hit in an NMI
> + *
> + * As per CPU data is used, interrupts must be disabled to prevent them
> + * from corrupting the data. A separate NMI trampoline is used for the
> + * NMI case. If interrupts are already disabled, then the return path
> + * of where the breakpoint was hit (saved in the per CPU data) is pushed
> + * on the stack and then a jump to either the ftrace_caller (which will
> + * loop through all registered ftrace_ops handlers depending on the ip
> + * address), or if its a ftrace trampoline call update, it will call
> + * ftrace_update_func_call which will hold the call that should be
> + * called.
> + */
> +extern asmlinkage void ftrace_emulate_call_irqon(void);
> +extern asmlinkage void ftrace_emulate_call_irqoff(void);
> +extern asmlinkage void ftrace_emulate_call_nmi(void);
> +extern asmlinkage void ftrace_emulate_call_update_irqoff(void);
> +extern asmlinkage void ftrace_emulate_call_update_irqon(void);
> +extern asmlinkage void ftrace_emulate_call_update_nmi(void);
> +
> +static DEFINE_PER_CPU(void *, ftrace_bp_call_return);
> +static DEFINE_PER_CPU(void *, ftrace_bp_call_nmi_return);

Andy mentioned #DB and #MC exceptions here:
https://lkml.kernel.org/r/c55ded25-c60d-4731-9a6b-92bda8771...@amacapital.net

I think that #DB won't be possible, provided the trampolines below get
tagged as NOKPROBE (do_int3() and ftrace_int3_handler() already have
it).

It's highly theoretic, but tracing do_machine_check() could clobber
ftrace_bp_call_return or ftrace_bp_call_nmi_return?


> +#ifdef CONFIG_SMP
> +#ifdef CONFIG_X86_64
> +# define BP_CALL_RETURN  "%gs:ftrace_bp_call_return"
> +# define BP_CALL_NMI_RETURN  "%gs:ftrace_bp_call_nmi_return"
> +#else
> +# define BP_CALL_RETURN  "%fs:ftrace_bp_call_return"
> +# define BP_CALL_NMI_RETURN  "%fs:ftrace_bp_call_nmi_return"
> +#endif
> +#else /* SMP */
> +# define BP_CALL_RETURN  "ftrace_bp_call_return"
> +# define BP_CALL_NMI_RETURN  "ftrace_bp_call_nmi_return"
> +#endif
> +
> +/* To hold the ftrace_caller address to push on the stack */
> +void *ftrace_caller_func = (void *)ftrace_caller;

The live patching ftrace_ops need ftrace_regs_caller.


> +
> +asm(
> + ".text\n"
> +
> + /* Trampoline for function update with interrupts enabled */
> + ".global ftrace_emulate_call_irqoff\n"
> + ".type ftrace_emulate_call_irqoff, @function\n"
> + "ftrace_emulate_call_irqoff:\n\t"
> + "push "BP_CALL_RETURN"\n\t"
> + "push ftrace_caller_func\n"
> + "sti\n\t"
> + "ret\n\t"
> + ".size ftrace_emulate_call_irqoff, .-ftrace_emulate_call_irqoff\n"
> +
> + /* Trampoline for function update with interrupts disabled*/
> + ".global ftrace_emulate_call_irqon\n"

The naming is perhaps a bit confusing, i.e. "update with interrupts
disabled" vs. "irqon"... How about swapping irqoff<->irqon?

Thanks,

Nicolai


> + ".type ftrace_emulate_call_irqon, @function\n"
> + "ftrace_emulate_call_irqon:\n\t"
> + "push "BP_CALL_RETURN"\n\t"
> + "push ftrace_caller_func\n"
> + "ret\n\t"
> + ".size ftrace_emulate_call_irqon, 

Re: [PATCH 3/4] x86/ftrace: make ftrace_int3_handler() not to skip fops invocation

2019-04-30 Thread Nicolai Stange
Steven Rostedt  writes:

> On Mon, 29 Apr 2019 14:38:35 -0700
> Linus Torvalds  wrote:
>
>> On Mon, Apr 29, 2019 at 1:30 PM Steven Rostedt  wrote:
>> >
>> > The update from "call custom_trampoline" to "call iterator_trampoline"
>> > is where we have an issue.  
>> 
>> So it has never worked. Just tell people that they have two chocies:
>
> The custom call to iterator translation never worked. One option is to
> always call the iterator, and just take the hit. There's another
> solution to to make permanent updates that would handle the live
> patching case, but not for the tracing case. It is more critical for
> live patching than tracing (missed traced function is not as critical
> as running buggy code unexpectedly). I could look to work on that
> instead.

Making the live patching case permanent would disable tracing of live
patched functions though?

For unconditionally calling the iterator: I don't have numbers, but
would expect that always searching through something like 50-70 live
patching ftrace_ops from some hot path will be noticeable.


> If Nicolai has time, perhaps he can try out the method you suggest and
> see if we can move this forward.

You mean making ftrace handle call->call transitions one by one in
non-batch mode, right? Sure, I can do that.

Another question though: is there anything that prevents us from calling
ftrace_ops_list_func() directly from ftrace_int3_handler()? We don't
have parent_ip, but if that is used for reporting only, maybe setting it
to some ftrace_is_in_int3_and_doesnt_now_the_parent() will work?
Graph tracing entries would still be lost, but well...

Thanks,

Nicolai


Re: [PATCH 1/4] x86/thread_info: introduce ->ftrace_int3_stack member

2019-04-28 Thread Nicolai Stange
Steven Rostedt  writes:

> On Sun, 28 Apr 2019 10:41:10 -0700
> Andy Lutomirski  wrote:
>
>
>> > Note that at any given point
>> > in time, there can be at most four such call insn emulations pending:
>> > namely at most one per "process", "irq", "softirq" and "nmi" context.
>> >   
>> 
>> That’s quite an assumption. I think your list should also contain
>> exception, exceptions nested inside that exception, and machine
>> check, at the very least.  I’m also wondering why irq and softirq are
>> treated separately.

You're right, I missed the machine check case.


> 4 has usually been the context count we choose. But I guess in theory,
> if we get exceptions then I could potentially be more.

After having seen the static_call discussion, I'm in no way defending
this limited approach here, but out of curiosity: can the code between
the push onto the stack from ftrace_int3_handler() and the subsequent
pop from the stub actually trigger an (non-#MC) exception? There's an
iret inbetween, but that can fault only when returning to user space,
correct?


> As for irq vs softirq, an interrupt can preempt a softirq. Interrupts
> are enabled while softirqs are running. When sofirqs run, softirqs are
> disabled to prevent nested softirqs. But interrupts are enabled again,
> and another interrupt may come in while a softirq is executing.
>

Thanks,

Nicolai


-- 
SUSE Linux GmbH, GF: Felix Imendörffer, Jane Smithard, Graham Norton,
HRB 21284 (AG Nürnberg)


[PATCH 4/4] selftests/livepatch: add "ftrace a live patched function" test

2019-04-27 Thread Nicolai Stange
There had been an issue with interactions between tracing and live patching
due to how x86' CONFIG_DYNAMIC_FTRACE used to handle the breakpoints at the
updated instructions from its ftrace_int3_handler().

More specifically, starting to trace a live patched function caused a short
period in time where the live patching redirection became ineffective. In
particular, the guarantees from the consistency model couldn't be held up
in this situation.

Implement a testcase for verifying that a function's live patch replacement
is kept effective when enabling tracing on it.

Reuse the existing 'test_klp_livepatch' live patch module which patches
cmdline_proc_show(), the handler for /proc/cmdline.

Let the testcase in a loop
- apply this live patch,
- launch a background shell job enabling tracing on that function
- while continuously verifying that the contents of /proc/cmdline still
  match what would be expected when the live patch is applied.

Signed-off-by: Nicolai Stange 
---
 tools/testing/selftests/livepatch/Makefile |  3 +-
 .../livepatch/test-livepatch-vs-ftrace.sh  | 44 ++
 2 files changed, 46 insertions(+), 1 deletion(-)
 create mode 100755 
tools/testing/selftests/livepatch/test-livepatch-vs-ftrace.sh

diff --git a/tools/testing/selftests/livepatch/Makefile 
b/tools/testing/selftests/livepatch/Makefile
index af4aee79bebb..bfa5353f6d17 100644
--- a/tools/testing/selftests/livepatch/Makefile
+++ b/tools/testing/selftests/livepatch/Makefile
@@ -3,6 +3,7 @@
 TEST_GEN_PROGS := \
test-livepatch.sh \
test-callbacks.sh \
-   test-shadow-vars.sh
+   test-shadow-vars.sh \
+   test-livepatch-vs-ftrace.sh
 
 include ../lib.mk
diff --git a/tools/testing/selftests/livepatch/test-livepatch-vs-ftrace.sh 
b/tools/testing/selftests/livepatch/test-livepatch-vs-ftrace.sh
new file mode 100755
index ..5c982ec56373
--- /dev/null
+++ b/tools/testing/selftests/livepatch/test-livepatch-vs-ftrace.sh
@@ -0,0 +1,44 @@
+#!/bin/bash
+# SPDX-License-Identifier: GPL-2.0
+# Copyright (C) 2019 SUSE Linux GmbH
+
+. $(dirname $0)/functions.sh
+
+set -e
+
+MOD_LIVEPATCH=test_klp_livepatch
+
+# TEST: ftrace a live patched function
+# - load a livepatch that modifies the output from /proc/cmdline
+# - install a function tracer at the live patched function
+# - verify that the function is still patched by reading /proc/cmdline
+# - unload the livepatch and make sure the patch was removed
+
+echo -n "TEST: ftrace a live patched function ... "
+dmesg -C
+
+for i in $(seq 1 3); do
+   load_lp $MOD_LIVEPATCH
+
+   ( echo cmdline_proc_show > /sys/kernel/debug/tracing/set_ftrace_filter;
+ echo function > /sys/kernel/debug/tracing/current_tracer ) &
+
+   for j in $(seq 1 200); do
+   if [[ "$(cat /proc/cmdline)" != \
+   "$MOD_LIVEPATCH: this has been live patched" ]] ; then
+   echo -e "FAIL\n\n"
+   die "livepatch kselftest(s) failed"
+   fi
+   done
+
+   wait %1
+
+   echo nop > /sys/kernel/debug/tracing/current_tracer
+   echo > /sys/kernel/debug/tracing/set_ftrace_filter
+
+   disable_lp $MOD_LIVEPATCH
+   unload_lp $MOD_LIVEPATCH
+done
+
+echo "ok"
+exit 0
-- 
2.13.7



[PATCH 3/4] x86/ftrace: make ftrace_int3_handler() not to skip fops invocation

2019-04-27 Thread Nicolai Stange
With dynamic ftrace, ftrace patches call sites on x86 in a three steps:
1. Put a breakpoint at the to be patched location,
2. update call site and
3. finally remove the breakpoint again.

Note that the breakpoint ftrace_int3_handler() simply makes execution to
skip over the to be patched instruction.

This patching happens in the following circumstances:
a.) the global ftrace_trace_function changes and the call sites at
ftrace_call and ftrace_regs_call get patched,
b.) an ftrace_ops' ->func changes and the call site in its trampoline gets
patched,
c.) graph tracing gets enabled/disabled and the jump site at
ftrace_graph_call gets patched,
d.) a mcount site gets converted from nop -> call, call -> nop, or
call -> call.

The latter case, i.e. a mcount call getting redirected, is possible in e.g.
a transition from trampolined to not trampolined upon a user enabling
function tracing on a live patched function.

The ftrace_int3_handler() simply skipping over the updated insn is quite
problematic in the context of live patching, because it means that a live
patched function gets temporarily reverted to its unpatched original and
this breaks the live patching consistency model. But even without live
patching, it is desirable to avoid missing traces when making changes to
the tracing setup.

Make ftrace_int3_handler not to skip over the fops invocation, but modify
the interrupted control flow to issue a call as needed.

Case c.) from the list above can be ignored, because a jmp instruction gets
changed to a nop or vice versa.

The remaining cases a.), b.) and d.) all involve call instructions. For a.)
and b.), the call always goes to some ftrace_func_t and the generic
ftrace_ops_list_func() impementation will be able to demultiplex and do the
right thing. For case d.), the call target is either of ftrace_caller(),
ftrace_regs_caller() or some ftrace_ops' trampoline. Because providing the
register state won't cause any harm for !FTRACE_OPS_FL_SAVE_REGS
ftrace_ops, ftrace_regs_caller() would be a suitable target capable of
handling any case.

ftrace_int3_handler()'s context is different from the interrupted call
instruction's one, obviously. In order to be able to emulate the call
within the original context, make ftrace_int3_handler() set its iret
frame's ->ip to some helper stub. Upon return from the trap, this stub will
then mimic the call by pushing the the return address onto the stack and
issuing a jmp to the target address. As describe above, the jmp target
will be either of ftrace_ops_list_func() or ftrace_regs_caller(). Provide
one such stub implementation for each of the two cases.

Finally, the desired return address, which is derived from the trapping
->ip, must get passed from ftrace_int3_handler() to these stubs. Make
ftrace_int3_handler() push it onto the ftrace_int3_stack introduced by an
earlier patch and let the stubs consume it. Be careful to use proper
compiler barriers such that nested int3 handling from e.g. irqs won't
clobber entries owned by outer instances.

Suggested-by: Steven Rostedt 
Signed-off-by: Nicolai Stange 
---
 arch/x86/kernel/Makefile|  1 +
 arch/x86/kernel/ftrace.c| 79 +++--
 arch/x86/kernel/ftrace_int3_stubs.S | 61 
 3 files changed, 130 insertions(+), 11 deletions(-)
 create mode 100644 arch/x86/kernel/ftrace_int3_stubs.S

diff --git a/arch/x86/kernel/Makefile b/arch/x86/kernel/Makefile
index 00b7e27bc2b7..0b63ae02b1f3 100644
--- a/arch/x86/kernel/Makefile
+++ b/arch/x86/kernel/Makefile
@@ -93,6 +93,7 @@ obj-$(CONFIG_LIVEPATCH)   += livepatch.o
 obj-$(CONFIG_FUNCTION_TRACER)  += ftrace_$(BITS).o
 obj-$(CONFIG_FUNCTION_GRAPH_TRACER) += ftrace.o
 obj-$(CONFIG_FTRACE_SYSCALLS)  += ftrace.o
+obj-$(CONFIG_DYNAMIC_FTRACE)   += ftrace_int3_stubs.o
 obj-$(CONFIG_X86_TSC)  += trace_clock.o
 obj-$(CONFIG_KEXEC_CORE)   += machine_kexec_$(BITS).o
 obj-$(CONFIG_KEXEC_CORE)   += relocate_kernel_$(BITS).o crash.o
diff --git a/arch/x86/kernel/ftrace.c b/arch/x86/kernel/ftrace.c
index ef49517f6bb2..917494f35cba 100644
--- a/arch/x86/kernel/ftrace.c
+++ b/arch/x86/kernel/ftrace.c
@@ -280,25 +280,84 @@ static nokprobe_inline int is_ftrace_caller(unsigned long 
ip)
return 0;
 }
 
-/*
- * A breakpoint was added to the code address we are about to
- * modify, and this is the handle that will just skip over it.
- * We are either changing a nop into a trace call, or a trace
- * call to a nop. While the change is taking place, we treat
- * it just like it was a nop.
- */
+extern void ftrace_graph_call(void);
+
+asmlinkage void ftrace_int3_stub_regs_caller(void);
+asmlinkage void ftrace_int3_stub_list_func(void);
+
+/* A breakpoint was added to the code address we are about to modify. */
 int ftrace_int3_handler(struct pt_regs *regs)
 {
unsigned long ip;
+   bool is_ftrace_location;
+   struct ftrace_int3_stack *stack

[PATCH 2/4] ftrace: drop 'static' qualifier from ftrace_ops_list_func()

2019-04-27 Thread Nicolai Stange
With an upcoming patch improving x86' ftrace_int3_handler() not to simply
skip over the insn being updated, ftrace_ops_list_func() will have to get
referenced from arch/x86 code. Drop its 'static' qualifier.

Signed-off-by: Nicolai Stange 
---
 kernel/trace/ftrace.c | 8 
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/kernel/trace/ftrace.c b/kernel/trace/ftrace.c
index b920358dd8f7..ed3c20811d9a 100644
--- a/kernel/trace/ftrace.c
+++ b/kernel/trace/ftrace.c
@@ -125,8 +125,8 @@ ftrace_func_t ftrace_trace_function __read_mostly = 
ftrace_stub;
 struct ftrace_ops global_ops;
 
 #if ARCH_SUPPORTS_FTRACE_OPS
-static void ftrace_ops_list_func(unsigned long ip, unsigned long parent_ip,
-struct ftrace_ops *op, struct pt_regs *regs);
+void ftrace_ops_list_func(unsigned long ip, unsigned long parent_ip,
+ struct ftrace_ops *op, struct pt_regs *regs);
 #else
 /* See comment below, where ftrace_ops_list_func is defined */
 static void ftrace_ops_no_ops(unsigned long ip, unsigned long parent_ip);
@@ -6302,8 +6302,8 @@ __ftrace_ops_list_func(unsigned long ip, unsigned long 
parent_ip,
  * set the ARCH_SUPPORTS_FTRACE_OPS.
  */
 #if ARCH_SUPPORTS_FTRACE_OPS
-static void ftrace_ops_list_func(unsigned long ip, unsigned long parent_ip,
-struct ftrace_ops *op, struct pt_regs *regs)
+void ftrace_ops_list_func(unsigned long ip, unsigned long parent_ip,
+ struct ftrace_ops *op, struct pt_regs *regs)
 {
__ftrace_ops_list_func(ip, parent_ip, NULL, regs);
 }
-- 
2.13.7



[PATCH 1/4] x86/thread_info: introduce ->ftrace_int3_stack member

2019-04-27 Thread Nicolai Stange
Before actually rewriting an insn, x86' DYNAMIC_FTRACE  implementation
places an int3 breakpoint on it. Currently, ftrace_int3_handler() simply
treats the insn in question as nop and advances %rip past it. An upcoming
patch will improve this by making the int3 trap handler emulate the call
insn. To this end, ftrace_int3_handler() will be made to change its iret
frame's ->ip to some stub which will then mimic the function call in the
original context.

Somehow the trapping ->ip address will have to get communicated from
ftrace_int3_handler() to these stubs though. Note that at any given point
in time, there can be at most four such call insn emulations pending:
namely at most one per "process", "irq", "softirq" and "nmi" context.

Introduce struct ftrace_int3_stack providing four entries for storing
the instruction pointer.

In principle, it could be made per-cpu, but this would require making
ftrace_int3_handler() to return with preemption disabled and to enable it
from those emulation stubs again only after the stack's top entry has been
consumed. I've been told that this would "break a lot of norms" and that
making this stack part of struct thread_info instead would be less fragile.
Follow this advice and add a struct ftrace_int3_stack instance to x86's
struct thread_info. Note that these stacks will get only rarely accessed
(only during ftrace's code modifications) and thus, cache line dirtying
won't have any significant impact on the neighbouring fields.

Initialization will take place implicitly through INIT_THREAD_INFO as per
the rules for missing elements in initializers. The memcpy() in
arch_dup_task_struct() will propagate the initial state properly, because
it's always run in process context and won't ever see a non-zero ->depth
value.

Finally, add the necessary bits to asm-offsets for making struct
ftrace_int3_stack accessible from assembly.

Suggested-by: Steven Rostedt 
Signed-off-by: Nicolai Stange 
---
 arch/x86/include/asm/thread_info.h | 11 +++
 arch/x86/kernel/asm-offsets.c  |  8 
 2 files changed, 19 insertions(+)

diff --git a/arch/x86/include/asm/thread_info.h 
b/arch/x86/include/asm/thread_info.h
index e0eccbcb8447..83434a88cfbb 100644
--- a/arch/x86/include/asm/thread_info.h
+++ b/arch/x86/include/asm/thread_info.h
@@ -56,6 +56,17 @@ struct task_struct;
 struct thread_info {
unsigned long   flags;  /* low level flags */
u32 status; /* thread synchronous flags */
+#ifdef CONFIG_DYNAMIC_FTRACE
+   struct ftrace_int3_stack {
+   int depth;
+   /*
+* There can be at most one slot in use per context,
+* i.e. at most one for "normal", "irq", "softirq" and
+* "nmi" each.
+*/
+   unsigned long slots[4];
+   } ftrace_int3_stack;
+#endif
 };
 
 #define INIT_THREAD_INFO(tsk)  \
diff --git a/arch/x86/kernel/asm-offsets.c b/arch/x86/kernel/asm-offsets.c
index 168543d077d7..ca6ee24a0c6e 100644
--- a/arch/x86/kernel/asm-offsets.c
+++ b/arch/x86/kernel/asm-offsets.c
@@ -105,4 +105,12 @@ static void __used common(void)
OFFSET(TSS_sp0, tss_struct, x86_tss.sp0);
OFFSET(TSS_sp1, tss_struct, x86_tss.sp1);
OFFSET(TSS_sp2, tss_struct, x86_tss.sp2);
+
+#ifdef CONFIG_DYNAMIC_FTRACE
+   BLANK();
+   OFFSET(TASK_TI_ftrace_int3_depth, task_struct,
+  thread_info.ftrace_int3_stack.depth);
+   OFFSET(TASK_TI_ftrace_int3_slots, task_struct,
+  thread_info.ftrace_int3_stack.slots);
+#endif
 }
-- 
2.13.7



[PATCH 0/4] x86/ftrace: make ftrace_int3_handler() not to skip fops invocation

2019-04-27 Thread Nicolai Stange
Hi,

this series is the result of the discussion to the RFC patch found at [1].

The goal is to make x86' ftrace_int3_handler() not to simply skip over
the trapping instruction as this is problematic in the context of
the live patching consistency model. For details, c.f. the commit message
of [3/4] ("x86/ftrace: make ftrace_int3_handler() not to skip fops
invocation").

Everything is based on v5.1-rc6, please let me know in case you want me to
rebase on somehing else.

For x86_64, the live patching selftest added in [4/4] succeeds with this
series applied and fails without it. On 32 bits I only compile-tested.

checkpatch reports warnings about
- an overlong line in assembly -- I chose to ignore that
- MAINTAINERS perhaps needing updates due to the new files
  arch/x86/kernel/ftrace_int3_stubs.S and
  tools/testing/selftests/livepatch/test-livepatch-vs-ftrace.sh.
  As the existing arch/x86/kernel/ftrace_{32,64}.S haven't got an
  explicit entry either, this one is probably Ok? The selftest
  definitely is.

Changes to the RFC patch:
- s/trampoline/stub/ to avoid confusion with the ftrace_ops' trampolines,
- use a fixed size stack kept in struct thread_info for passing the
  (adjusted) ->ip values from ftrace_int3_handler() to the stubs,
- provide one stub for each of the two possible jump targets and hardcode
  those,
- add the live patching selftest.

Thanks,

Nicolai

Nicolai Stange (4):
  x86/thread_info: introduce ->ftrace_int3_stack member
  ftrace: drop 'static' qualifier from ftrace_ops_list_func()
  x86/ftrace: make ftrace_int3_handler() not to skip fops invocation
  selftests/livepatch: add "ftrace a live patched function" test

 arch/x86/include/asm/thread_info.h | 11 +++
 arch/x86/kernel/Makefile   |  1 +
 arch/x86/kernel/asm-offsets.c  |  8 +++
 arch/x86/kernel/ftrace.c   | 79 +++---
 arch/x86/kernel/ftrace_int3_stubs.S| 61 +
 kernel/trace/ftrace.c  |  8 +--
 tools/testing/selftests/livepatch/Makefile |  3 +-
 .../livepatch/test-livepatch-vs-ftrace.sh  | 44 
 8 files changed, 199 insertions(+), 16 deletions(-)
 create mode 100644 arch/x86/kernel/ftrace_int3_stubs.S
 create mode 100755 
tools/testing/selftests/livepatch/test-livepatch-vs-ftrace.sh

-- 
2.13.7



Re: [RFC PATCH 1/1] x86/ftrace: make ftrace_int3_handler() not to skip fops invocation

2019-04-24 Thread Nicolai Stange
Steven Rostedt  writes:
> On Tue, 23 Apr 2019 20:15:49 +0200
> Nicolai Stange  wrote:
>> Steven Rostedt  writes:
>> > For 32 bit, we could add 4 variables on the thread_info and make 4
>> > trampolines, one for each context (normal, softirq, irq and NMI), and
>> > have them use the variable stored in the thread_info for that location:
>> >
>> > ftrace_int3_tramp_normal
>> >push %eax # just for space
>> >push %eax
>> >mov whatever_to_get_thread_info, %eax
>> >mov normal(%eax), %eax
>> >mov %eax, 4(%esp)
>> >pop %eax
>> >jmp ftrace_caller
>> >
>> > ftrace_int3_tramp_softiqr
>> >push %eax # just for space
>> >push %eax
>> >mov whatever_to_get_thread_info, %eax
>> >mov softirq(%eax), %eax
>> >mov %eax, 4(%esp)
>> >pop %eax
>> >jmp ftrace_caller
>> >
>> > etc..
>> >
>> >
>> > A bit crazier for 32 bit but it can still be done. ;-)  
>> 
>> Ok, but currently CONFIG_HAVE_LIVEPATCH=n for x86 && !x86_64,
>> so I'd rather not invest too much energy into screwing 32 bit up ;)
>>
>
> Probably not something you care about, but something that I do. Which
> means it will have to go on my TODO list. I care about missed functions
> being called. This means, if you have something tracing a function, and
> then enable something else to trace that same function, you might miss
> the first one.

Alright, if there's a use case beyond live patching, I can try to handle
32 bits alongside, of course.

However, some care will be needed when determining the actual context
from ftrace_int3_handler(): tracing anything before the addition or
after the subtraction of HARDIRQ_OFFSET to/from preempt_count in
irq_enter() resp. irq_exit() could otherwise clobber the "normal" state
in thread_info, correct?

How about having a fixed size stack with three entries shared between
"normal", "irq" and "softirq" in thread_info, as well as a dedicated
slot for nmi context? (The stack would be modified/consumed only with
irqs disabled).

Which makes me wonder:
- if we disabled preemption from ftrace_int3_handler() and reenabled it
  again from the *_int3_tramp*s, these stacks could be made per-CPU,
  AFAICS,
- and why not use this strategy on x86_64, too? The advantages would be
  a common implemention between 32 and 64 bit and more importantly, not
  relying on that %r11 hack. When doing the initial RFC patch, it
  wasn't clear to me that at most one stack slot per context would be
  needed, i.e. only four in total...

What do you think?

Nicolai

-- 
SUSE Linux GmbH, GF: Felix Imendörffer, Jane Smithard, Graham Norton,
HRB 21284 (AG Nürnberg)


Re: [RFC PATCH 1/1] x86/ftrace: make ftrace_int3_handler() not to skip fops invocation

2019-04-23 Thread Nicolai Stange
Hi Steven,

many thanks for having a look!

Steven Rostedt  writes:

> I just found this in my Inbox, and this looks to be a legit issue.
>
> On Thu, 26 Jul 2018 12:40:29 +0200
> Nicolai Stange  wrote:
>
> You still working on this?

Yes, this still needs to get fixed somehow, preferably at the ftrace
layer.

>
>> With dynamic ftrace, ftrace patches call sites in a three steps:
>> 1. Put a breakpoint at the to be patched location,
>> 2. update call site and
>> 3. finally remove the breakpoint again.
>> 
>> Note that the breakpoint ftrace_int3_handler() simply makes execution
>> to skip over the to be patched function.
>> 
>> This patching happens in the following circumstances:
>> - the global ftrace_trace_function changes and the call sites at
>>   ftrace_call and ftrace_regs_call get patched,
>> - an ftrace_ops' ->func changes and the call site in its trampoline gets
>>   patched,
>> - graph tracing gets enabled/disabled and the jump site at
>>   ftrace_graph_call gets patched
>> - a mcount site gets converted from nop -> call, call -> nop
>>   or call -> call.
>> 
>> The latter case, i.e. a mcount call getting redirected, happens for example
>> in a transition from trampolined to not trampolined upon a user enabling
>> function tracing on a live patched function.
>> 
>> The ftrace_int3_handler() simply skipping over the mcount callsite is
>> problematic here, because it means that a live patched function gets
>> temporarily reverted to its unpatched original and this breaks live
>> patching consistency.
>> 
>> Make ftrace_int3_handler not to skip over the fops invocation, but modify
>> the interrupted control flow to issue a call as needed.
>> 
>> For determining what the proper action actually is, modify
>> update_ftrace_func() to take an extra argument, func, to be called if the
>> corresponding breakpoint gets hit. Introduce a new global variable,
>> trace_update_func_dest and let update_ftrace_func() set it. For the special
>> case of patching the jmp insn at ftrace_graph_call, set it to zero and make
>> ftrace_int3_handler() just skip over the insn in this case as before.
>> 
>> Because there's no space left above the int3 handler's iret frame to stash
>> an extra call's return address in, this can't be mimicked from the
>> ftrace_int3_handler() itself.
>> 
>> Instead, make ftrace_int3_handler() change the iret frame's %rip slot to
>> point to the new ftrace_int3_call_trampoline to be executed immediately
>> after iret.
>> 
>> The original %rip gets communicated to ftrace_int3_call_trampoline which
>> can then take proper action. Note that ftrace_int3_call_trampoline can
>> nest, because of NMIs, for example. In order to avoid introducing another
>> stack, abuse %r11 for passing the original %rip. This is safe, because the
>> interrupted context is always at a call insn and according to the x86_64
>> ELF ABI allows %r11 is callee-clobbered. According to the glibc sources,
>> this is also true for the somewhat special mcount/fentry ABI.
>> 
>> OTOH, a spare register doesn't exist on i686 and thus, this is approach is
>> limited to x86_64.
>> 
>> For determining the call target address, let ftrace_int3_call_trampoline
>> compare ftrace_update_func against the interrupted %rip.
>
> I don't think we need to do the compare.
>
>> If they match, an update_ftrace_func() is currently pending and the
>> call site is either of ftrace_call, ftrace_regs_call or the call insn
>> within a fops' trampoline. Jump to the ftrace_update_func_dest location as
>> set by update_ftrace_func().
>> 
>> If OTOH the interrupted %rip doesn't equal ftrace_update_func, then
>> it points to an mcount call site. In this case, redirect control flow to
>> the most generic handler, ftrace_regs_caller, which will then do the right
>> thing.
>> 
>> Finally, reading ftrace_update_func and ftrace_update_func_dest from
>> outside of the int3 handler requires some care: inbetween adding and
>> removing the breakpoints, ftrace invokes run_sync() which basically
>> issues a couple of IPIs. Because the int3 handler has interrupts disabled,
>> it orders with run_sync().
>> 
>> To extend that ordering to also include ftrace_int3_call_trampoline, make
>> ftrace_int3_handler() disable interrupts within the iret frame returning to
>> it.
>> 
>> Store the original EFLAGS.IF into %r11's MSB which, because it represents
>> a kernel address, is redundant. Make ftrace_int3_call_trampoline restore
>> it when done.
>
> This 

Re: [PATCH] mm/mincore: allow for making sys_mincore() privileged

2019-02-20 Thread Nicolai Stange
Linus Torvalds  writes:



> So in order to use it as a signal, first you have to first scrub the
> cache (because if the page was already there, there's no signal at
> all), and then for the signal to be as useful as possible, you're also
> going to want to try to get out more than one bit of information: you
> are going to try to see the patterns and the timings of how it gets
> filled.
>
> And that's actually quite painful. You don't know the initial cache
> state, and you're not (in general) controlling the machine entirely,
> because there's also that actual other entity that you're trying to
> attack and see what it does.
>
> So what you want to do is basically to first make sure the cache is
> scrubbed (only for the pages you're interested in!), then trigger
> whatever behavior you are looking for, and then look how that affected
> the cache.
>
> In other words,  you want *multiple* residency status check - first to
> see what the cache state is (because you're going to want that for
> scrubbing), then to see that "yes, it's gone" when doing the
> scrubbing, and then to see the *pattern* and timings of how things are
> brought in.



In an attempt to gain a better understanding of the guided eviction part
resp. the relevance of mincore() & friends to that, I worked on
reproducing the results from [1], section 6.1 ("Efficient Page Cache
Eviction on Linux").

In case anybody wants to run their own experiments: the sources can
be found at [2].

Disclaimer: I don't have access to the sources used by the [1]-paper's
authors nor do I know anything about their experimental setup. So it
might very well be the case, that my implementation is completely
different and/or inefficient.

Anyways, quoting from [1], section 6.1:

  "Eviction Set 1. These are pages already in the page cache,
   used by other processes. To keep them in the page cache,
   a thread continuously accesses these pages while also keep-
   ing the system load low by using sched yield and sleep.
   Consequently, they are among the most recently accessed
   pages of the system and eviction of these pages becomes
   highly unlikely."

I had two questions:
1.) Do the actual contents of "Eviction set 1" matter for the guided
eviction's performance or can they as well be arbitrary but fixed?
Because if the set's actual contents weren't of any importance,
then mincore() would not be needed to initialize it with "pages
already in the page cache".
2.) How does keeping some fixed set resident + doing some IO compare to
simply streaming a huge random file through the page cache?

(To make it explicit: I didn't look into the probe part of the attack
 or the checking of the victim page's residency status as a
 termination condition for the eviction run.)

Obviously, there are two primary factors affecting the victim page
eviction performance: the file page cache size and disk read
bandwidth.

Naively speaking, I would suppose that keeping a certain set resident is
a cheap and stable way to constrain IO to the remaining portion of the
page cache and thus, reduce the amount of data required to be read from
disk until the victim page gets evicted.


Results summary (details can be found at the end of this mail):
- The baseline benchmark of simply streaming random data
  through the page cache behaves as expected:

avg of "Inactive(file)" / avg of "victim page eviction time"

  yields ~480MB/s, which approx. matches my disk's read bandwidth
  (note: the victim page was mapped as !PROT_EXEC).

- I didn't do any sophisticated fine-tuning wrt. to parameters, but
  for the configuration yielding the best result, the average victim
  page eviction time was 147ms (stddev(*): 69ms, stderr: 1ms) with the
  "random but fixed resident set method". That's an improvement by a
  factor of 2.6 over the baseline "streaming random data method" (with
  the same amount of anonymous memory, i.e. "Eviction set 3",
  allocated: 7GB out of a total of 8GB).

- In principle, question 1.) can't be answered by experiment without
  controlling the initial, legitimate system workload. I did some lax
  tests on my desktop running firefox, libreoffice etc. though and of
  course, overall responsiveness got a lot better if the "Eviction set
  1" had been populated with pages already resident at the time the
  "attack" started. But the victim page eviction times didn't seem to
  improve -- at least not by factors such that my biased mind would
  have been able to recognize any change.

In conclusion, keeping an arbitrary, but fixed "Eviction set 1" resident
improved the victim page eviction performance by some factor over the
"streaming" baseline, where "Eviction set 1" was populated from a
single, attacker-controlled file and mincore() was not needed for
determining its initial contents.

To my surprise though, I needed to rely on mincore() at some other
place, namely *my* implementation of keeping the resident set
resident. My first naive approach was to have a 

Re: [PATCH 1/4] powerpc/64s: Clear on-stack exception marker upon exception return

2019-01-30 Thread Nicolai Stange
Michael Ellerman  writes:

> Joe Lawrence  writes:
>> From: Nicolai Stange 
>>
>> The ppc64 specific implementation of the reliable stacktracer,
>> save_stack_trace_tsk_reliable(), bails out and reports an "unreliable
>> trace" whenever it finds an exception frame on the stack. Stack frames
>> are classified as exception frames if the STACK_FRAME_REGS_MARKER magic,
>> as written by exception prologues, is found at a particular location.
>>
>> However, as observed by Joe Lawrence, it is possible in practice that
>> non-exception stack frames can alias with prior exception frames and thus,
>> that the reliable stacktracer can find a stale STACK_FRAME_REGS_MARKER on
>> the stack. It in turn falsely reports an unreliable stacktrace and blocks
>> any live patching transition to finish. Said condition lasts until the
>> stack frame is overwritten/initialized by function call or other means.
>>
>> In principle, we could mitigate this by making the exception frame
>> classification condition in save_stack_trace_tsk_reliable() stronger:
>> in addition to testing for STACK_FRAME_REGS_MARKER, we could also take into
>> account that for all exceptions executing on the kernel stack
>> - their stack frames's backlink pointers always match what is saved
>>   in their pt_regs instance's ->gpr[1] slot and that
>> - their exception frame size equals STACK_INT_FRAME_SIZE, a value
>>   uncommonly large for non-exception frames.
>>
>> However, while these are currently true, relying on them would make the
>> reliable stacktrace implementation more sensitive towards future changes in
>> the exception entry code. Note that false negatives, i.e. not detecting
>> exception frames, would silently break the live patching consistency model.
>>
>> Furthermore, certain other places (diagnostic stacktraces, perf, xmon)
>> rely on STACK_FRAME_REGS_MARKER as well.
>>
>> Make the exception exit code clear the on-stack STACK_FRAME_REGS_MARKER
>> for those exceptions running on the "normal" kernel stack and returning
>> to kernelspace: because the topmost frame is ignored by the reliable stack
>> tracer anyway, returns to userspace don't need to take care of clearing
>> the marker.
>>
>> Furthermore, as I don't have the ability to test this on Book 3E or
>> 32 bits, limit the change to Book 3S and 64 bits.
>>
>> Finally, make the HAVE_RELIABLE_STACKTRACE Kconfig option depend on
>> PPC_BOOK3S_64 for documentation purposes. Before this patch, it depended
>> on PPC64 && CPU_LITTLE_ENDIAN and because CPU_LITTLE_ENDIAN implies
>> PPC_BOOK3S_64, there's no functional change here.
>
> That has nothing to do with the fix and should really be in a separate
> patch.
>
> I can split it when applying.

If you don't mind, that would be nice! Or simply drop that
chunk... Otherwise, let me know if I shall send a split v2 for this
patch [1/4] only.

Thanks,

Nicolai

-- 
SUSE Linux GmbH, GF: Felix Imendörffer, Jane Smithard, Graham Norton,
HRB 21284 (AG Nürnberg)


Re: ppc64le reliable stack unwinder and scheduled tasks

2019-01-13 Thread Nicolai Stange
Joe Lawrence  writes:

> On Fri, Jan 11, 2019 at 08:51:54AM +0100, Nicolai Stange wrote:
>> Joe Lawrence  writes:
>> 
>> > On Fri, Jan 11, 2019 at 01:00:38AM +0100, Nicolai Stange wrote:
>
> [ ... snip ... ]
>
>> >> For testing, could you try whether clearing the word at STACK_FRAME_MARKER
>> >> from _switch() helps?
>> >> 
>> >> I.e. something like (completely untested):
>> >
>> > I'll kick off some builds tonight and try to get tests lined up
>> > tomorrow.  Unfortunately these take a bit of time to run setup, schedule
>> > and complete, so perhaps by next week.
>> 
>> Ok, that's probably still a good test for confirmation, but the solution
>> you proposed below is much better.
>
> Good news, 40 tests (RHEL-7 backport) with clearing out the
> STACK_FRAME_MARKER were all successful.  Previously I would see stalled
> livepatch transitions due to the unreliable report in ~10% of test
> cases.
>
>> >> diff --git a/arch/powerpc/kernel/entry_64.S 
>> >> b/arch/powerpc/kernel/entry_64.S
>> >> index 435927f549c4..b747d0647ec4 100644
>> >> --- a/arch/powerpc/kernel/entry_64.S
>> >> +++ b/arch/powerpc/kernel/entry_64.S
>> >> @@ -596,6 +596,10 @@ _GLOBAL(_switch)
>> >>   SAVE_8GPRS(14, r1)
>> >>   SAVE_10GPRS(22, r1)
>> >>   std r0,_NIP(r1) /* Return to switch caller */
>> >> +
>> >> + li  r23,0
>> >> + std r23,96(r1)  /* 96 == STACK_FRAME_MARKER * sizeof(long) */
>> >> +
>> >>   mfcrr23
>> >>   std r23,_CCR(r1)
>> >>   std r1,KSP(r3)  /* Set old stack pointer */
>> >> 
>> >> 
>> >
>> > This may be sufficient to avoid the condition, but if the contents of
>> > frame 0 are truely uninitialized (not to be trusted), should the
>> > unwinder be even looking at that frame (for STACK_FRAMES_REGS_MARKER),
>> > aside from the LR and other stack size geometry sanity checks?
>> 
>> That's a very good point: we'll only ever be examining scheduled out tasks
>> and current (which at that time is running klp_try_complete_transition()).
>> 
>> current won't be in an interrupt/exception when it's walking its own
>> stack. And scheduled out tasks can't have an exception/interrupt frame
>> as their frame 0, correct?
>> 
>> Thus, AFAICS, whenever klp_try_complete_transition() finds a
>> STACK_FRAMES_REGS_MARKER in frame 0, it is known to be garbage, as you
>> said.
>
> I gave this about 20 tests and they were also all successful, what do
> you think?  (The log could probably use some trimming and cleanup.)  I
> think either solution would fix the issue.
>
> -->8-- -->8-- -->8-- -->8-- -->8-- -->8-- -->8-- -->8-- -->8-- -->8--
>
> From edd3fb9e8a16d5a7ccc98759e9397f386f0e82ca Mon Sep 17 00:00:00 2001
> From: Joe Lawrence 
> Date: Fri, 11 Jan 2019 10:25:26 -0500
> Subject: [PATCH] powerpc/livepatch: relax reliable stack tracer exception
>  check
>
> The ppc64le reliable stack tracer iterates over a given task's stack,
> verifying a set of conditions to determine whether the trace is
> 'reliable'.  These checks include the presence of any exception frames.
>
> We should be careful when inspecting the bottom-most stack frame (the
> first to be unwound), particularly for scheduled-out tasks.  As Nicolai
> Stange explains, "If I'm reading the code in _switch() correctly, the
> first frame is completely uninitialized except for the pointer back to
> the caller's stack frame."  If a previous do_IRQ() invocation, for
> example, has left a residual exception-marker on the first frame, the
> stack tracer would incorrectly report this task's trace as unreliable.
>

FWIW, it's not been do_IRQ() who wrote the exception marker, but it's
caller hardware_interrupt_common(), more specifically the
EXCEPTION_PROLOG_COMMON_3 part therof.


> save_stack_trace_tsk_reliable() already skips the first frame when
> verifying the saved Link Register.  It should do the same when looking
> for the STACK_FRAME_REGS_MARKER.  The task it is unwinding will be
> either 'current', in which case the tracer will have never been called
> from an exception path, or the task will be inactive and its first
> frame's contents will be uninitialized (aside from backchain pointer).

I thought about this a little more and can't see anything that would
prevent higher, i.e. non-_switch() frames to also alias with prior
exception frames. That STACK_FRAME_REGS_MARKER is written to a stack
frame's "parameter area" and most functi

Re: ppc64le reliable stack unwinder and scheduled tasks

2019-01-10 Thread Nicolai Stange
Joe Lawrence  writes:

> On Fri, Jan 11, 2019 at 01:00:38AM +0100, Nicolai Stange wrote:
>> Hi Joe,
>> 
>> Joe Lawrence  writes:
>> 
>> > tl;dr: On ppc64le, what is top-most stack frame for scheduled tasks
>> >about?
>> 
>> If I'm reading the code in _switch() correctly, the first frame is
>> completely uninitialized except for the pointer back to the caller's
>> stack frame.
>> 
>> For completeness: _switch() saves the return address, i.e. the link
>> register into its parent's stack frame, as is mandated by the ABI and
>> consistent with your findings below: it's always the second stack frame
>> where the return address into __switch_to() is kept.
>>
>
> Hi Nicolai,
>
> Good, that makes a lot of sense.  I couldn't find any reference
> explaining the contents of frame 0, only unwinding code here and there
> (as in crash-utility) that stepped over it.

FWIW, I learned about general stack frame usage on ppc from part 4 of
the introductionary series starting at [1]: it's a good reading and I
can definitely recommend it.

Summary:
- Callers of other functions always allocate a stack frame and only
  set the pointer to the previous stack frame (that's the
  'stdu r1, -STACK_FRAME_OVERHEAD(r1)' insn).
- Callees save their stuff into the stack frame allocated by the caller
  if needed. Where "if needed" == callee in turn calls another function.

The insignificance of frame 0's contents follows from this ABI: the
caller might not have called any callee yet, the callee might be a leaf
and so on.

Finally, as I understand it, the only purpose of _switch() creating a
standard stack frame at the bottom of scheduled out tasks is that the
higher ones can be found (for e.g. the backtracing): otherwise
there would be a pt_regs at the bottom of the stack. But I might be
wrong here.



>> 
>> 
>> >
>> >
>> > Example 1 (RHEL-7)
>> > ==
>> >
>> > crash> struct task_struct.thread c0022fd015c0 | grep ksp
>> > ksp = 0xc000288af9c0
>> >
>> > crash> rd 0xc000288af9c0 -e 0xc000288b
>> >
>> >  - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
>> >
>> > sp[0]:
>> >
>> > c000288af9c0:  c000288afb90 00dd   ...(
>> > c000288af9d0:  c0002a94 c1c60a00   .*..
>> >
>> > crash> sym c0002a94
>> > c0002a94 (T) hardware_interrupt_common+0x114
>> 
>> So that c0002a94 certainly wasn't stored by _switch(). I think
>> what might have happened is that the switching frame aliased with some
>> prior interrupt frame as setup by hardware_interrupt_common().
>> 
>> The interrupt and switching frames seem to share a common layout as far
>> as the lower STACK_FRAME_OVERHEAD + sizeof(struct pt_regs) bytes are
>> concerned.
>> 
>> That address into hardware_interrupt_common() could have been written by
>> the do_IRQ() called from there.
>> 
>
> That was my initial theory, but then when I saw an ordinary scheduled
> task with a similarly strange frame 0, I thought that _switch() might
> have been doing something clever (or not).  But according your earlier
> explanation, it would line up that these values may be inherited from
> do_IRQ() or the like.
>
>> 
>> > c000288af9e0:  c1c60a80    
>> > c000288af9f0:  c000288afbc0 00dd   ...(
>> > c000288afa00:  c14322e0 c1c60a00   ."C.
>> > c000288afa10:  c002303ae380 c002303ae380   ..:0..:0
>> > c000288afa20:  7265677368657265 2200   erehsger."..
>> >
>> > Uh-oh...
>> >
>> > /* Mark stacktraces with exception frames as unreliable. */
>> > stack[STACK_FRAME_MARKER] == STACK_FRAME_REGS_MARKER
>> 
>> 
>> Aliasing of the switching stack frame with some prior interrupt stack
>> frame would explain why that STACK_FRAME_REGS_MARKER is still found on
>> the stack, i.e. it's a leftover.
>> 
>> For testing, could you try whether clearing the word at STACK_FRAME_MARKER
>> from _switch() helps?
>> 
>> I.e. something like (completely untested):
>
> I'll kick off some builds tonight and try to get tests lined up
> tomorrow.  Unfortunately these take a bit of time to run setup, schedule
> and complete, so perhaps by next week.

Ok, that's probably still a good test for confirmation, but the solution
you pro

Re: ppc64le reliable stack unwinder and scheduled tasks

2019-01-10 Thread Nicolai Stange
Hi Joe,

Joe Lawrence  writes:

> tl;dr: On ppc64le, what is top-most stack frame for scheduled tasks
>about?

If I'm reading the code in _switch() correctly, the first frame is
completely uninitialized except for the pointer back to the caller's
stack frame.

For completeness: _switch() saves the return address, i.e. the link
register into its parent's stack frame, as is mandated by the ABI and
consistent with your findings below: it's always the second stack frame
where the return address into __switch_to() is kept.



>
>
> Example 1 (RHEL-7)
> ==
>
> crash> struct task_struct.thread c0022fd015c0 | grep ksp
> ksp = 0xc000288af9c0
>
> crash> rd 0xc000288af9c0 -e 0xc000288b
>
>  - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
>
> sp[0]:
>
> c000288af9c0:  c000288afb90 00dd   ...(
> c000288af9d0:  c0002a94 c1c60a00   .*..
>
> crash> sym c0002a94
> c0002a94 (T) hardware_interrupt_common+0x114

So that c0002a94 certainly wasn't stored by _switch(). I think
what might have happened is that the switching frame aliased with some
prior interrupt frame as setup by hardware_interrupt_common().

The interrupt and switching frames seem to share a common layout as far
as the lower STACK_FRAME_OVERHEAD + sizeof(struct pt_regs) bytes are
concerned.

That address into hardware_interrupt_common() could have been written by
the do_IRQ() called from there.


> c000288af9e0:  c1c60a80    
> c000288af9f0:  c000288afbc0 00dd   ...(
> c000288afa00:  c14322e0 c1c60a00   ."C.
> c000288afa10:  c002303ae380 c002303ae380   ..:0..:0
> c000288afa20:  7265677368657265 2200   erehsger."..
>
> Uh-oh...
>
> /* Mark stacktraces with exception frames as unreliable. */
> stack[STACK_FRAME_MARKER] == STACK_FRAME_REGS_MARKER


Aliasing of the switching stack frame with some prior interrupt stack
frame would explain why that STACK_FRAME_REGS_MARKER is still found on
the stack, i.e. it's a leftover.

For testing, could you try whether clearing the word at STACK_FRAME_MARKER
from _switch() helps?

I.e. something like (completely untested):

diff --git a/arch/powerpc/kernel/entry_64.S b/arch/powerpc/kernel/entry_64.S
index 435927f549c4..b747d0647ec4 100644
--- a/arch/powerpc/kernel/entry_64.S
+++ b/arch/powerpc/kernel/entry_64.S
@@ -596,6 +596,10 @@ _GLOBAL(_switch)
SAVE_8GPRS(14, r1)
SAVE_10GPRS(22, r1)
std r0,_NIP(r1) /* Return to switch caller */
+
+   li  r23,0
+   std r23,96(r1)  /* 96 == STACK_FRAME_MARKER * sizeof(long) */
+
mfcrr23
std r23,_CCR(r1)
std r1,KSP(r3)  /* Set old stack pointer */




>
> save_stack_trace_tsk_reliable
> =
>
> arch/powerpc/kernel/stacktrace.c :: save_stack_trace_tsk_reliable() does
> take into account the first stackframe, but only to verify that the link
> register is indeed pointing at kernel code address.

It's actually the other way around:

if (!firstframe && !__kernel_text_address(ip))
return 1;


So the address gets sanitized only if it's _not_ coming from the first
frame.


Thanks,

Nicolai

-- 
SUSE Linux GmbH, GF: Felix Imendörffer, Jane Smithard, Graham Norton,
HRB 21284 (AG Nürnberg)


[RFC PATCH 1/1] x86/ftrace: make ftrace_int3_handler() not to skip fops invocation

2018-07-26 Thread Nicolai Stange
With dynamic ftrace, ftrace patches call sites in a three steps:
1. Put a breakpoint at the to be patched location,
2. update call site and
3. finally remove the breakpoint again.

Note that the breakpoint ftrace_int3_handler() simply makes execution
to skip over the to be patched function.

This patching happens in the following circumstances:
- the global ftrace_trace_function changes and the call sites at
  ftrace_call and ftrace_regs_call get patched,
- an ftrace_ops' ->func changes and the call site in its trampoline gets
  patched,
- graph tracing gets enabled/disabled and the jump site at
  ftrace_graph_call gets patched
- a mcount site gets converted from nop -> call, call -> nop
  or call -> call.

The latter case, i.e. a mcount call getting redirected, happens for example
in a transition from trampolined to not trampolined upon a user enabling
function tracing on a live patched function.

The ftrace_int3_handler() simply skipping over the mcount callsite is
problematic here, because it means that a live patched function gets
temporarily reverted to its unpatched original and this breaks live
patching consistency.

Make ftrace_int3_handler not to skip over the fops invocation, but modify
the interrupted control flow to issue a call as needed.

For determining what the proper action actually is, modify
update_ftrace_func() to take an extra argument, func, to be called if the
corresponding breakpoint gets hit. Introduce a new global variable,
trace_update_func_dest and let update_ftrace_func() set it. For the special
case of patching the jmp insn at ftrace_graph_call, set it to zero and make
ftrace_int3_handler() just skip over the insn in this case as before.

Because there's no space left above the int3 handler's iret frame to stash
an extra call's return address in, this can't be mimicked from the
ftrace_int3_handler() itself.

Instead, make ftrace_int3_handler() change the iret frame's %rip slot to
point to the new ftrace_int3_call_trampoline to be executed immediately
after iret.

The original %rip gets communicated to ftrace_int3_call_trampoline which
can then take proper action. Note that ftrace_int3_call_trampoline can
nest, because of NMIs, for example. In order to avoid introducing another
stack, abuse %r11 for passing the original %rip. This is safe, because the
interrupted context is always at a call insn and according to the x86_64
ELF ABI allows %r11 is callee-clobbered. According to the glibc sources,
this is also true for the somewhat special mcount/fentry ABI.

OTOH, a spare register doesn't exist on i686 and thus, this is approach is
limited to x86_64.

For determining the call target address, let ftrace_int3_call_trampoline
compare ftrace_update_func against the interrupted %rip.

If they match, an update_ftrace_func() is currently pending and the
call site is either of ftrace_call, ftrace_regs_call or the call insn
within a fops' trampoline. Jump to the ftrace_update_func_dest location as
set by update_ftrace_func().

If OTOH the interrupted %rip doesn't equal ftrace_update_func, then
it points to an mcount call site. In this case, redirect control flow to
the most generic handler, ftrace_regs_caller, which will then do the right
thing.

Finally, reading ftrace_update_func and ftrace_update_func_dest from
outside of the int3 handler requires some care: inbetween adding and
removing the breakpoints, ftrace invokes run_sync() which basically
issues a couple of IPIs. Because the int3 handler has interrupts disabled,
it orders with run_sync().

To extend that ordering to also include ftrace_int3_call_trampoline, make
ftrace_int3_handler() disable interrupts within the iret frame returning to
it.

Store the original EFLAGS.IF into %r11's MSB which, because it represents
a kernel address, is redundant. Make ftrace_int3_call_trampoline restore
it when done.

Signed-off-by: Nicolai Stange 
---
 arch/x86/kernel/ftrace.c| 48 --
 arch/x86/kernel/ftrace_64.S | 56 +
 2 files changed, 97 insertions(+), 7 deletions(-)

diff --git a/arch/x86/kernel/ftrace.c b/arch/x86/kernel/ftrace.c
index 01ebcb6f263e..3908a73370a8 100644
--- a/arch/x86/kernel/ftrace.c
+++ b/arch/x86/kernel/ftrace.c
@@ -227,9 +227,11 @@ int ftrace_modify_call(struct dyn_ftrace *rec, unsigned 
long old_addr,
return -EINVAL;
 }
 
-static unsigned long ftrace_update_func;
+unsigned long ftrace_update_func;
+unsigned long ftrace_update_func_dest;
 
-static int update_ftrace_func(unsigned long ip, void *new)
+static int update_ftrace_func(unsigned long ip, void *new,
+ unsigned long func)
 {
unsigned char old[MCOUNT_INSN_SIZE];
int ret;
@@ -237,6 +239,8 @@ static int update_ftrace_func(unsigned long ip, void *new)
memcpy(old, (void *)ip, MCOUNT_INSN_SIZE);
 
ftrace_update_func = ip;
+   ftrace_update_func_dest = func;
+
/* Make sure the bre

[RFC PATCH 1/1] x86/ftrace: make ftrace_int3_handler() not to skip fops invocation

2018-07-26 Thread Nicolai Stange
With dynamic ftrace, ftrace patches call sites in a three steps:
1. Put a breakpoint at the to be patched location,
2. update call site and
3. finally remove the breakpoint again.

Note that the breakpoint ftrace_int3_handler() simply makes execution
to skip over the to be patched function.

This patching happens in the following circumstances:
- the global ftrace_trace_function changes and the call sites at
  ftrace_call and ftrace_regs_call get patched,
- an ftrace_ops' ->func changes and the call site in its trampoline gets
  patched,
- graph tracing gets enabled/disabled and the jump site at
  ftrace_graph_call gets patched
- a mcount site gets converted from nop -> call, call -> nop
  or call -> call.

The latter case, i.e. a mcount call getting redirected, happens for example
in a transition from trampolined to not trampolined upon a user enabling
function tracing on a live patched function.

The ftrace_int3_handler() simply skipping over the mcount callsite is
problematic here, because it means that a live patched function gets
temporarily reverted to its unpatched original and this breaks live
patching consistency.

Make ftrace_int3_handler not to skip over the fops invocation, but modify
the interrupted control flow to issue a call as needed.

For determining what the proper action actually is, modify
update_ftrace_func() to take an extra argument, func, to be called if the
corresponding breakpoint gets hit. Introduce a new global variable,
trace_update_func_dest and let update_ftrace_func() set it. For the special
case of patching the jmp insn at ftrace_graph_call, set it to zero and make
ftrace_int3_handler() just skip over the insn in this case as before.

Because there's no space left above the int3 handler's iret frame to stash
an extra call's return address in, this can't be mimicked from the
ftrace_int3_handler() itself.

Instead, make ftrace_int3_handler() change the iret frame's %rip slot to
point to the new ftrace_int3_call_trampoline to be executed immediately
after iret.

The original %rip gets communicated to ftrace_int3_call_trampoline which
can then take proper action. Note that ftrace_int3_call_trampoline can
nest, because of NMIs, for example. In order to avoid introducing another
stack, abuse %r11 for passing the original %rip. This is safe, because the
interrupted context is always at a call insn and according to the x86_64
ELF ABI allows %r11 is callee-clobbered. According to the glibc sources,
this is also true for the somewhat special mcount/fentry ABI.

OTOH, a spare register doesn't exist on i686 and thus, this is approach is
limited to x86_64.

For determining the call target address, let ftrace_int3_call_trampoline
compare ftrace_update_func against the interrupted %rip.

If they match, an update_ftrace_func() is currently pending and the
call site is either of ftrace_call, ftrace_regs_call or the call insn
within a fops' trampoline. Jump to the ftrace_update_func_dest location as
set by update_ftrace_func().

If OTOH the interrupted %rip doesn't equal ftrace_update_func, then
it points to an mcount call site. In this case, redirect control flow to
the most generic handler, ftrace_regs_caller, which will then do the right
thing.

Finally, reading ftrace_update_func and ftrace_update_func_dest from
outside of the int3 handler requires some care: inbetween adding and
removing the breakpoints, ftrace invokes run_sync() which basically
issues a couple of IPIs. Because the int3 handler has interrupts disabled,
it orders with run_sync().

To extend that ordering to also include ftrace_int3_call_trampoline, make
ftrace_int3_handler() disable interrupts within the iret frame returning to
it.

Store the original EFLAGS.IF into %r11's MSB which, because it represents
a kernel address, is redundant. Make ftrace_int3_call_trampoline restore
it when done.

Signed-off-by: Nicolai Stange 
---
 arch/x86/kernel/ftrace.c| 48 --
 arch/x86/kernel/ftrace_64.S | 56 +
 2 files changed, 97 insertions(+), 7 deletions(-)

diff --git a/arch/x86/kernel/ftrace.c b/arch/x86/kernel/ftrace.c
index 01ebcb6f263e..3908a73370a8 100644
--- a/arch/x86/kernel/ftrace.c
+++ b/arch/x86/kernel/ftrace.c
@@ -227,9 +227,11 @@ int ftrace_modify_call(struct dyn_ftrace *rec, unsigned 
long old_addr,
return -EINVAL;
 }
 
-static unsigned long ftrace_update_func;
+unsigned long ftrace_update_func;
+unsigned long ftrace_update_func_dest;
 
-static int update_ftrace_func(unsigned long ip, void *new)
+static int update_ftrace_func(unsigned long ip, void *new,
+ unsigned long func)
 {
unsigned char old[MCOUNT_INSN_SIZE];
int ret;
@@ -237,6 +239,8 @@ static int update_ftrace_func(unsigned long ip, void *new)
memcpy(old, (void *)ip, MCOUNT_INSN_SIZE);
 
ftrace_update_func = ip;
+   ftrace_update_func_dest = func;
+
/* Make sure the bre

[RFC PATCH 0/1] x86/ftrace: fix live patching vs. tracing race

2018-07-26 Thread Nicolai Stange
Hi,

if a user starts to trace a live patched function, its mcount call will get
redirected from a trampoline to ftrace_regs_caller.

In preparation for that, ftrace on x86 first installs an int3 insn at that
call site.

ftrace_int3_handler() in turn simply skips over the mcount call insn,
effectively reverting the livepatch for that function during
ftrace_replace_code().

This breaks KLP's consistency model.


There are two possible options for fixing this:
1.) At the ftrace level.
2.) Search for a matching klp_ops from ftrace_int3_handler() and
handle the redirection if needed.

Both have their drawbacks, hence the RFC mode for this patch implementing
1.).

The main disadvantage is that it doesn't work on 32 bits (c.f. the patch
description), but for KLP this would be fine.

OTOH, it keeps KLP specific code out of ftrace_int3_handler() and might
perhaps be beneficial in other contexts as well.

Thanks for your comments!

Nicolai

Nicolai Stange (1):
  x86/ftrace: make ftrace_int3_handler() not to skip fops invocation

 arch/x86/kernel/ftrace.c| 48 --
 arch/x86/kernel/ftrace_64.S | 56 +
 2 files changed, 97 insertions(+), 7 deletions(-)

-- 
2.13.7



[RFC PATCH 0/1] x86/ftrace: fix live patching vs. tracing race

2018-07-26 Thread Nicolai Stange
Hi,

if a user starts to trace a live patched function, its mcount call will get
redirected from a trampoline to ftrace_regs_caller.

In preparation for that, ftrace on x86 first installs an int3 insn at that
call site.

ftrace_int3_handler() in turn simply skips over the mcount call insn,
effectively reverting the livepatch for that function during
ftrace_replace_code().

This breaks KLP's consistency model.


There are two possible options for fixing this:
1.) At the ftrace level.
2.) Search for a matching klp_ops from ftrace_int3_handler() and
handle the redirection if needed.

Both have their drawbacks, hence the RFC mode for this patch implementing
1.).

The main disadvantage is that it doesn't work on 32 bits (c.f. the patch
description), but for KLP this would be fine.

OTOH, it keeps KLP specific code out of ftrace_int3_handler() and might
perhaps be beneficial in other contexts as well.

Thanks for your comments!

Nicolai

Nicolai Stange (1):
  x86/ftrace: make ftrace_int3_handler() not to skip fops invocation

 arch/x86/kernel/ftrace.c| 48 --
 arch/x86/kernel/ftrace_64.S | 56 +
 2 files changed, 97 insertions(+), 7 deletions(-)

-- 
2.13.7



Re: [kbuild-all] [PATCH] OPTIONAL: cpufreq/intel_pstate: fix debugfs_simple_attr.cocci warnings

2018-03-30 Thread Nicolai Stange
Julia Lawall <julia.law...@lip6.fr> writes:

> On Fri, 30 Mar 2018, Nicolai Stange wrote:
>
>> Julia Lawall <julia.law...@lip6.fr> writes:
>>
>> > On Thu, 29 Mar 2018, Fabio Estevam wrote:
>> >
>> >> Hi Julia,
>> >>
>> >> On Thu, Mar 29, 2018 at 4:12 PM, Julia Lawall <julia.law...@lip6.fr> 
>> >> wrote:
>> >> >  Use DEFINE_DEBUGFS_ATTRIBUTE rather than DEFINE_SIMPLE_ATTRIBUTE
>> >> >  for debugfs files.
>> >> >
>> >> > Semantic patch information:
>> >> >  Rationale: DEFINE_SIMPLE_ATTRIBUTE + debugfs_create_file()
>> >> >  imposes some significant overhead as compared to
>> >> >  DEFINE_DEBUGFS_ATTRIBUTE + debugfs_create_file_unsafe().
>> >>
>> >> Just curious: could you please expand on what "imposes some
>> >> significant overhead" means?
>> >
>> > I don't know.  I didn't write this rule.  Nicolai, can you explain?
>>
>> From commit 49d200deaa68 ("debugfs: prevent access to removed files' private
>> data"):
>>
>> Upon return of debugfs_remove()/debugfs_remove_recursive(), it might
>> still be attempted to access associated private file data through
>> previously opened struct file objects. If that data has been freed by
>> the caller of debugfs_remove*() in the meanwhile, the reading/writing
>> process would either encounter a fault or, if the memory address in
>> question has been reassigned again, unrelated data structures could get
>> overwritten.
>> [...]
>> Currently, there are ~1000 call sites of debugfs_create_file() spread
>> throughout the whole tree and touching all of those struct 
>> file_operations
>> in order to make them file removal aware by means of checking the result 
>> of
>> debugfs_use_file_start() from within their methods is unfeasible.
>>
>> Instead, wrap the struct file_operations by a lifetime managing proxy at
>> file open [...]
>>
>> The additional overhead comes in terms of additional memory needed: for
>> debugs files created through debugfs_create_file(), one such struct
>> file_operations proxy is allocated for each struct file instantiation,
>> c.f. full_proxy_open().
>>
>> This was needed to "repair" the ~1000 call sites without touching them.
>>
>> New debugfs users should make their file_operations removal aware
>> themselves by means of DEFINE_DEBUGFS_ATTRIBUTE() and signal that fact to
>> the debugfs core by instantiating them through
>> debugfs_create_file_unsafe().
>>
>> See commit c64688081490 ("debugfs: add support for self-protecting
>> attribute file fops") for further information.
>
> Thanks.  Perhaps it would be good to add a reference to this commit in
> the message generated by the semantic patch.

Thanks for doing this!


>
> Would it be sufficient to just apply the semantic patch everywhere and
> submit the patches?

In principle yes. But I'm note sure whether such a mass application is
worth it: the proxy allocation happens only at file open and the
expectation is that there aren't that many debugfs files kept open at a
time. OTOH, a struct file_operation consumes 256 bytes with
sizeof(long) == 8.

In my opinion, new users should avoid this overhead as it's easily
doable. For existing ones, I don't know.

Thanks,

Nicolai

-- 
SUSE Linux GmbH, GF: Felix Imendörffer, Jane Smithard, Graham Norton,
HRB 21284 (AG Nürnberg)


Re: [kbuild-all] [PATCH] OPTIONAL: cpufreq/intel_pstate: fix debugfs_simple_attr.cocci warnings

2018-03-30 Thread Nicolai Stange
Julia Lawall  writes:

> On Fri, 30 Mar 2018, Nicolai Stange wrote:
>
>> Julia Lawall  writes:
>>
>> > On Thu, 29 Mar 2018, Fabio Estevam wrote:
>> >
>> >> Hi Julia,
>> >>
>> >> On Thu, Mar 29, 2018 at 4:12 PM, Julia Lawall  
>> >> wrote:
>> >> >  Use DEFINE_DEBUGFS_ATTRIBUTE rather than DEFINE_SIMPLE_ATTRIBUTE
>> >> >  for debugfs files.
>> >> >
>> >> > Semantic patch information:
>> >> >  Rationale: DEFINE_SIMPLE_ATTRIBUTE + debugfs_create_file()
>> >> >  imposes some significant overhead as compared to
>> >> >  DEFINE_DEBUGFS_ATTRIBUTE + debugfs_create_file_unsafe().
>> >>
>> >> Just curious: could you please expand on what "imposes some
>> >> significant overhead" means?
>> >
>> > I don't know.  I didn't write this rule.  Nicolai, can you explain?
>>
>> From commit 49d200deaa68 ("debugfs: prevent access to removed files' private
>> data"):
>>
>> Upon return of debugfs_remove()/debugfs_remove_recursive(), it might
>> still be attempted to access associated private file data through
>> previously opened struct file objects. If that data has been freed by
>> the caller of debugfs_remove*() in the meanwhile, the reading/writing
>> process would either encounter a fault or, if the memory address in
>> question has been reassigned again, unrelated data structures could get
>> overwritten.
>> [...]
>> Currently, there are ~1000 call sites of debugfs_create_file() spread
>> throughout the whole tree and touching all of those struct 
>> file_operations
>> in order to make them file removal aware by means of checking the result 
>> of
>> debugfs_use_file_start() from within their methods is unfeasible.
>>
>> Instead, wrap the struct file_operations by a lifetime managing proxy at
>> file open [...]
>>
>> The additional overhead comes in terms of additional memory needed: for
>> debugs files created through debugfs_create_file(), one such struct
>> file_operations proxy is allocated for each struct file instantiation,
>> c.f. full_proxy_open().
>>
>> This was needed to "repair" the ~1000 call sites without touching them.
>>
>> New debugfs users should make their file_operations removal aware
>> themselves by means of DEFINE_DEBUGFS_ATTRIBUTE() and signal that fact to
>> the debugfs core by instantiating them through
>> debugfs_create_file_unsafe().
>>
>> See commit c64688081490 ("debugfs: add support for self-protecting
>> attribute file fops") for further information.
>
> Thanks.  Perhaps it would be good to add a reference to this commit in
> the message generated by the semantic patch.

Thanks for doing this!


>
> Would it be sufficient to just apply the semantic patch everywhere and
> submit the patches?

In principle yes. But I'm note sure whether such a mass application is
worth it: the proxy allocation happens only at file open and the
expectation is that there aren't that many debugfs files kept open at a
time. OTOH, a struct file_operation consumes 256 bytes with
sizeof(long) == 8.

In my opinion, new users should avoid this overhead as it's easily
doable. For existing ones, I don't know.

Thanks,

Nicolai

-- 
SUSE Linux GmbH, GF: Felix Imendörffer, Jane Smithard, Graham Norton,
HRB 21284 (AG Nürnberg)


Re: [kbuild-all] [PATCH] OPTIONAL: cpufreq/intel_pstate: fix debugfs_simple_attr.cocci warnings

2018-03-30 Thread Nicolai Stange
Julia Lawall  writes:

> On Thu, 29 Mar 2018, Fabio Estevam wrote:
>
>> Hi Julia,
>>
>> On Thu, Mar 29, 2018 at 4:12 PM, Julia Lawall  wrote:
>> >  Use DEFINE_DEBUGFS_ATTRIBUTE rather than DEFINE_SIMPLE_ATTRIBUTE
>> >  for debugfs files.
>> >
>> > Semantic patch information:
>> >  Rationale: DEFINE_SIMPLE_ATTRIBUTE + debugfs_create_file()
>> >  imposes some significant overhead as compared to
>> >  DEFINE_DEBUGFS_ATTRIBUTE + debugfs_create_file_unsafe().
>>
>> Just curious: could you please expand on what "imposes some
>> significant overhead" means?
>
> I don't know.  I didn't write this rule.  Nicolai, can you explain?

>From commit 49d200deaa68 ("debugfs: prevent access to removed files' private
data"):

Upon return of debugfs_remove()/debugfs_remove_recursive(), it might
still be attempted to access associated private file data through
previously opened struct file objects. If that data has been freed by
the caller of debugfs_remove*() in the meanwhile, the reading/writing
process would either encounter a fault or, if the memory address in
question has been reassigned again, unrelated data structures could get
overwritten.
[...]
Currently, there are ~1000 call sites of debugfs_create_file() spread
throughout the whole tree and touching all of those struct file_operations
in order to make them file removal aware by means of checking the result of
debugfs_use_file_start() from within their methods is unfeasible.

Instead, wrap the struct file_operations by a lifetime managing proxy at
file open [...]

The additional overhead comes in terms of additional memory needed: for
debugs files created through debugfs_create_file(), one such struct
file_operations proxy is allocated for each struct file instantiation,
c.f. full_proxy_open().

This was needed to "repair" the ~1000 call sites without touching them.

New debugfs users should make their file_operations removal aware
themselves by means of DEFINE_DEBUGFS_ATTRIBUTE() and signal that fact to
the debugfs core by instantiating them through
debugfs_create_file_unsafe().

See commit c64688081490 ("debugfs: add support for self-protecting
attribute file fops") for further information.


Thanks,

Nicolai


-- 
SUSE Linux GmbH, GF: Felix Imendörffer, Jane Smithard, Graham Norton,
HRB 21284 (AG Nürnberg)


Re: [kbuild-all] [PATCH] OPTIONAL: cpufreq/intel_pstate: fix debugfs_simple_attr.cocci warnings

2018-03-30 Thread Nicolai Stange
Julia Lawall  writes:

> On Thu, 29 Mar 2018, Fabio Estevam wrote:
>
>> Hi Julia,
>>
>> On Thu, Mar 29, 2018 at 4:12 PM, Julia Lawall  wrote:
>> >  Use DEFINE_DEBUGFS_ATTRIBUTE rather than DEFINE_SIMPLE_ATTRIBUTE
>> >  for debugfs files.
>> >
>> > Semantic patch information:
>> >  Rationale: DEFINE_SIMPLE_ATTRIBUTE + debugfs_create_file()
>> >  imposes some significant overhead as compared to
>> >  DEFINE_DEBUGFS_ATTRIBUTE + debugfs_create_file_unsafe().
>>
>> Just curious: could you please expand on what "imposes some
>> significant overhead" means?
>
> I don't know.  I didn't write this rule.  Nicolai, can you explain?

>From commit 49d200deaa68 ("debugfs: prevent access to removed files' private
data"):

Upon return of debugfs_remove()/debugfs_remove_recursive(), it might
still be attempted to access associated private file data through
previously opened struct file objects. If that data has been freed by
the caller of debugfs_remove*() in the meanwhile, the reading/writing
process would either encounter a fault or, if the memory address in
question has been reassigned again, unrelated data structures could get
overwritten.
[...]
Currently, there are ~1000 call sites of debugfs_create_file() spread
throughout the whole tree and touching all of those struct file_operations
in order to make them file removal aware by means of checking the result of
debugfs_use_file_start() from within their methods is unfeasible.

Instead, wrap the struct file_operations by a lifetime managing proxy at
file open [...]

The additional overhead comes in terms of additional memory needed: for
debugs files created through debugfs_create_file(), one such struct
file_operations proxy is allocated for each struct file instantiation,
c.f. full_proxy_open().

This was needed to "repair" the ~1000 call sites without touching them.

New debugfs users should make their file_operations removal aware
themselves by means of DEFINE_DEBUGFS_ATTRIBUTE() and signal that fact to
the debugfs core by instantiating them through
debugfs_create_file_unsafe().

See commit c64688081490 ("debugfs: add support for self-protecting
attribute file fops") for further information.


Thanks,

Nicolai


-- 
SUSE Linux GmbH, GF: Felix Imendörffer, Jane Smithard, Graham Norton,
HRB 21284 (AG Nürnberg)


[PATCH v2] net: ipv4: emulate READ_ONCE() on ->hdrincl bit-field in raw_sendmsg()

2018-01-08 Thread Nicolai Stange
Commit 8f659a03a0ba ("net: ipv4: fix for a race condition in
raw_sendmsg") fixed the issue of possibly inconsistent ->hdrincl handling
due to concurrent updates by reading this bit-field member into a local
variable and using the thus stabilized value in subsequent tests.

However, aforementioned commit also adds the (correct) comment that

  /* hdrincl should be READ_ONCE(inet->hdrincl)
   * but READ_ONCE() doesn't work with bit fields
   */

because as it stands, the compiler is free to shortcut or even eliminate
the local variable at its will.

Note that I have not seen anything like this happening in reality and thus,
the concern is a theoretical one.

However, in order to be on the safe side, emulate a READ_ONCE() on the
bit-field by doing it on the local 'hdrincl' variable itself:

int hdrincl = inet->hdrincl;
hdrincl = READ_ONCE(hdrincl);

This breaks the chain in the sense that the compiler is not allowed
to replace subsequent reads from hdrincl with reloads from inet->hdrincl.

Fixes: 8f659a03a0ba ("net: ipv4: fix for a race condition in raw_sendmsg")
Signed-off-by: Nicolai Stange <nsta...@suse.de>
---
 Compile-tested only (with inspection of compiler output on x86_64).
 Note that there's still a ->hdrincl reload, but that's due to the
 inet_sk_flowi_flags() invocation and probably harmless.

 Applicable to linux-next-20180108.

Changes to v2:
 - Apply review from Stefano Brivio, i.e. drop the __hdrincl
   intermediate, update comment and commit message accordingly. 

 net/ipv4/raw.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/net/ipv4/raw.c b/net/ipv4/raw.c
index 5b9bd5c33d9d..6717cf2008f0 100644
--- a/net/ipv4/raw.c
+++ b/net/ipv4/raw.c
@@ -520,9 +520,11 @@ static int raw_sendmsg(struct sock *sk, struct msghdr 
*msg, size_t len)
goto out;
 
/* hdrincl should be READ_ONCE(inet->hdrincl)
-* but READ_ONCE() doesn't work with bit fields
+* but READ_ONCE() doesn't work with bit fields.
+* Doing this indirectly yields the same result.
 */
hdrincl = inet->hdrincl;
+   hdrincl = READ_ONCE(hdrincl);
/*
 *  Check the flags.
 */
-- 
2.13.6



[PATCH v2] net: ipv4: emulate READ_ONCE() on ->hdrincl bit-field in raw_sendmsg()

2018-01-08 Thread Nicolai Stange
Commit 8f659a03a0ba ("net: ipv4: fix for a race condition in
raw_sendmsg") fixed the issue of possibly inconsistent ->hdrincl handling
due to concurrent updates by reading this bit-field member into a local
variable and using the thus stabilized value in subsequent tests.

However, aforementioned commit also adds the (correct) comment that

  /* hdrincl should be READ_ONCE(inet->hdrincl)
   * but READ_ONCE() doesn't work with bit fields
   */

because as it stands, the compiler is free to shortcut or even eliminate
the local variable at its will.

Note that I have not seen anything like this happening in reality and thus,
the concern is a theoretical one.

However, in order to be on the safe side, emulate a READ_ONCE() on the
bit-field by doing it on the local 'hdrincl' variable itself:

int hdrincl = inet->hdrincl;
hdrincl = READ_ONCE(hdrincl);

This breaks the chain in the sense that the compiler is not allowed
to replace subsequent reads from hdrincl with reloads from inet->hdrincl.

Fixes: 8f659a03a0ba ("net: ipv4: fix for a race condition in raw_sendmsg")
Signed-off-by: Nicolai Stange 
---
 Compile-tested only (with inspection of compiler output on x86_64).
 Note that there's still a ->hdrincl reload, but that's due to the
 inet_sk_flowi_flags() invocation and probably harmless.

 Applicable to linux-next-20180108.

Changes to v2:
 - Apply review from Stefano Brivio, i.e. drop the __hdrincl
   intermediate, update comment and commit message accordingly. 

 net/ipv4/raw.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/net/ipv4/raw.c b/net/ipv4/raw.c
index 5b9bd5c33d9d..6717cf2008f0 100644
--- a/net/ipv4/raw.c
+++ b/net/ipv4/raw.c
@@ -520,9 +520,11 @@ static int raw_sendmsg(struct sock *sk, struct msghdr 
*msg, size_t len)
goto out;
 
/* hdrincl should be READ_ONCE(inet->hdrincl)
-* but READ_ONCE() doesn't work with bit fields
+* but READ_ONCE() doesn't work with bit fields.
+* Doing this indirectly yields the same result.
 */
hdrincl = inet->hdrincl;
+   hdrincl = READ_ONCE(hdrincl);
/*
 *  Check the flags.
 */
-- 
2.13.6



Re: [PATCH] net: ipv4: emulate READ_ONCE() on ->hdrincl bit-field in raw_sendmsg()

2018-01-03 Thread Nicolai Stange
Hi Stefano,

Stefano Brivio <sbri...@redhat.com> writes:

> On Tue,  2 Jan 2018 17:30:20 +0100
> Nicolai Stange <nsta...@suse.de> wrote:
>
>> [...]
>>
>> diff --git a/net/ipv4/raw.c b/net/ipv4/raw.c
>> index 5b9bd5c33d9d..e84290c28c0c 100644
>> --- a/net/ipv4/raw.c
>> +++ b/net/ipv4/raw.c
>> @@ -513,16 +513,18 @@ static int raw_sendmsg(struct sock *sk, struct msghdr 
>> *msg, size_t len)
>>  int err;
>>  struct ip_options_data opt_copy;
>>  struct raw_frag_vec rfv;
>> -int hdrincl;
>> +int hdrincl, __hdrincl;
>>  
>>  err = -EMSGSIZE;
>>  if (len > 0x)
>>  goto out;
>>  
>>  /* hdrincl should be READ_ONCE(inet->hdrincl)
>> - * but READ_ONCE() doesn't work with bit fields
>> + * but READ_ONCE() doesn't work with bit fields.
>> + * Emulate it by doing the READ_ONCE() from an intermediate int.
>>   */
>> -hdrincl = inet->hdrincl;
>> +__hdrincl = inet->hdrincl;
>> +hdrincl = READ_ONCE(__hdrincl);
>
> I guess you don't actually need to use a third variable. What about
> doing READ_ONCE() on hdrincl itself after the first assignment?
>
> Perhaps something like the patch below -- applies to net.git, yields
> same binary output as your version with gcc 6, looks IMHO more
> straightforward:
>
> diff --git a/net/ipv4/raw.c b/net/ipv4/raw.c
> index 125c1eab3eaa..8c2f783a95fc 100644
> --- a/net/ipv4/raw.c
> +++ b/net/ipv4/raw.c
> @@ -519,10 +519,12 @@ static int raw_sendmsg(struct sock *sk, struct msghdr 
> *msg, size_t len)
>   if (len > 0x)
>   goto out;
>  
> - /* hdrincl should be READ_ONCE(inet->hdrincl)
> -  * but READ_ONCE() doesn't work with bit fields
> + /* hdrincl should be READ_ONCE(inet->hdrincl) but READ_ONCE() doesn't
> +  * work with bit fields. Emulate it by adding a further sequence point.
>*/
>   hdrincl = inet->hdrincl;
> + hdrincl = READ_ONCE(hdrincl);
> +

Yes, this does also work. In fact, after having been lowered into SSA
form, it should be equivalent to what I posted.

So, it's a matter of preference/style and I'd leave the decision on
this to the maintainers -- for me, either way is fine.

I don't like the "sequence point" wording in the comment above though:
AFAICS, if taken in the meaning of C99, it's not any sequence point but
the volatile access in READ_ONCE() which ensures that there won't be any
reloads from ->hdrincl. If you don't mind, I'll adjust that comment if
asked to resend with your solution.

Thanks,

Nicolai


Re: [PATCH] net: ipv4: emulate READ_ONCE() on ->hdrincl bit-field in raw_sendmsg()

2018-01-03 Thread Nicolai Stange
Hi Stefano,

Stefano Brivio  writes:

> On Tue,  2 Jan 2018 17:30:20 +0100
> Nicolai Stange  wrote:
>
>> [...]
>>
>> diff --git a/net/ipv4/raw.c b/net/ipv4/raw.c
>> index 5b9bd5c33d9d..e84290c28c0c 100644
>> --- a/net/ipv4/raw.c
>> +++ b/net/ipv4/raw.c
>> @@ -513,16 +513,18 @@ static int raw_sendmsg(struct sock *sk, struct msghdr 
>> *msg, size_t len)
>>  int err;
>>  struct ip_options_data opt_copy;
>>  struct raw_frag_vec rfv;
>> -int hdrincl;
>> +int hdrincl, __hdrincl;
>>  
>>  err = -EMSGSIZE;
>>  if (len > 0x)
>>  goto out;
>>  
>>  /* hdrincl should be READ_ONCE(inet->hdrincl)
>> - * but READ_ONCE() doesn't work with bit fields
>> + * but READ_ONCE() doesn't work with bit fields.
>> + * Emulate it by doing the READ_ONCE() from an intermediate int.
>>   */
>> -hdrincl = inet->hdrincl;
>> +__hdrincl = inet->hdrincl;
>> +hdrincl = READ_ONCE(__hdrincl);
>
> I guess you don't actually need to use a third variable. What about
> doing READ_ONCE() on hdrincl itself after the first assignment?
>
> Perhaps something like the patch below -- applies to net.git, yields
> same binary output as your version with gcc 6, looks IMHO more
> straightforward:
>
> diff --git a/net/ipv4/raw.c b/net/ipv4/raw.c
> index 125c1eab3eaa..8c2f783a95fc 100644
> --- a/net/ipv4/raw.c
> +++ b/net/ipv4/raw.c
> @@ -519,10 +519,12 @@ static int raw_sendmsg(struct sock *sk, struct msghdr 
> *msg, size_t len)
>   if (len > 0x)
>   goto out;
>  
> - /* hdrincl should be READ_ONCE(inet->hdrincl)
> -  * but READ_ONCE() doesn't work with bit fields
> + /* hdrincl should be READ_ONCE(inet->hdrincl) but READ_ONCE() doesn't
> +  * work with bit fields. Emulate it by adding a further sequence point.
>*/
>   hdrincl = inet->hdrincl;
> + hdrincl = READ_ONCE(hdrincl);
> +

Yes, this does also work. In fact, after having been lowered into SSA
form, it should be equivalent to what I posted.

So, it's a matter of preference/style and I'd leave the decision on
this to the maintainers -- for me, either way is fine.

I don't like the "sequence point" wording in the comment above though:
AFAICS, if taken in the meaning of C99, it's not any sequence point but
the volatile access in READ_ONCE() which ensures that there won't be any
reloads from ->hdrincl. If you don't mind, I'll adjust that comment if
asked to resend with your solution.

Thanks,

Nicolai


[PATCH] net: ipv4: emulate READ_ONCE() on ->hdrincl bit-field in raw_sendmsg()

2018-01-02 Thread Nicolai Stange
Commit 8f659a03a0ba ("net: ipv4: fix for a race condition in
raw_sendmsg") fixed the issue of possibly inconsistent ->hdrincl handling
due to concurrent updates by reading this bit-field member into a local
variable and using the thus stabilized value in subsequent tests.

However, aforementioned commit also adds the (correct) comment that

  /* hdrincl should be READ_ONCE(inet->hdrincl)
   * but READ_ONCE() doesn't work with bit fields
   */

because as it stands, the compiler is free to shortcut or even eliminate
the local variable at its will.

Note that I have not seen anything like this happening in reality and thus,
the concern is a theoretical one.

However, in order to be on the safe side, emulate a READ_ONCE() on the
bit-field by introducing an intermediate local variable and doing a
READ_ONCE() from it:

int __hdrincl = inet->hdrincl;
int hdrincl = READ_ONCE(__hdrincl);

This breaks the chain in the sense that the compiler is not allowed
to replace subsequent reads from hdrincl with reloads from inet->hdrincl.

Fixes: 8f659a03a0ba ("net: ipv4: fix for a race condition in raw_sendmsg")
Signed-off-by: Nicolai Stange <nsta...@suse.de>
---
 Compile-tested only (with inspection of compiler output on x86_64).
 Applicable to linux-next-20180102.

 net/ipv4/raw.c | 8 +---
 1 file changed, 5 insertions(+), 3 deletions(-)

diff --git a/net/ipv4/raw.c b/net/ipv4/raw.c
index 5b9bd5c33d9d..e84290c28c0c 100644
--- a/net/ipv4/raw.c
+++ b/net/ipv4/raw.c
@@ -513,16 +513,18 @@ static int raw_sendmsg(struct sock *sk, struct msghdr 
*msg, size_t len)
int err;
struct ip_options_data opt_copy;
struct raw_frag_vec rfv;
-   int hdrincl;
+   int hdrincl, __hdrincl;
 
err = -EMSGSIZE;
if (len > 0x)
goto out;
 
/* hdrincl should be READ_ONCE(inet->hdrincl)
-* but READ_ONCE() doesn't work with bit fields
+* but READ_ONCE() doesn't work with bit fields.
+* Emulate it by doing the READ_ONCE() from an intermediate int.
 */
-   hdrincl = inet->hdrincl;
+   __hdrincl = inet->hdrincl;
+   hdrincl = READ_ONCE(__hdrincl);
/*
 *  Check the flags.
 */
-- 
2.13.6



[PATCH] net: ipv4: emulate READ_ONCE() on ->hdrincl bit-field in raw_sendmsg()

2018-01-02 Thread Nicolai Stange
Commit 8f659a03a0ba ("net: ipv4: fix for a race condition in
raw_sendmsg") fixed the issue of possibly inconsistent ->hdrincl handling
due to concurrent updates by reading this bit-field member into a local
variable and using the thus stabilized value in subsequent tests.

However, aforementioned commit also adds the (correct) comment that

  /* hdrincl should be READ_ONCE(inet->hdrincl)
   * but READ_ONCE() doesn't work with bit fields
   */

because as it stands, the compiler is free to shortcut or even eliminate
the local variable at its will.

Note that I have not seen anything like this happening in reality and thus,
the concern is a theoretical one.

However, in order to be on the safe side, emulate a READ_ONCE() on the
bit-field by introducing an intermediate local variable and doing a
READ_ONCE() from it:

int __hdrincl = inet->hdrincl;
int hdrincl = READ_ONCE(__hdrincl);

This breaks the chain in the sense that the compiler is not allowed
to replace subsequent reads from hdrincl with reloads from inet->hdrincl.

Fixes: 8f659a03a0ba ("net: ipv4: fix for a race condition in raw_sendmsg")
Signed-off-by: Nicolai Stange 
---
 Compile-tested only (with inspection of compiler output on x86_64).
 Applicable to linux-next-20180102.

 net/ipv4/raw.c | 8 +---
 1 file changed, 5 insertions(+), 3 deletions(-)

diff --git a/net/ipv4/raw.c b/net/ipv4/raw.c
index 5b9bd5c33d9d..e84290c28c0c 100644
--- a/net/ipv4/raw.c
+++ b/net/ipv4/raw.c
@@ -513,16 +513,18 @@ static int raw_sendmsg(struct sock *sk, struct msghdr 
*msg, size_t len)
int err;
struct ip_options_data opt_copy;
struct raw_frag_vec rfv;
-   int hdrincl;
+   int hdrincl, __hdrincl;
 
err = -EMSGSIZE;
if (len > 0x)
goto out;
 
/* hdrincl should be READ_ONCE(inet->hdrincl)
-* but READ_ONCE() doesn't work with bit fields
+* but READ_ONCE() doesn't work with bit fields.
+* Emulate it by doing the READ_ONCE() from an intermediate int.
 */
-   hdrincl = inet->hdrincl;
+   __hdrincl = inet->hdrincl;
+   hdrincl = READ_ONCE(__hdrincl);
/*
 *  Check the flags.
 */
-- 
2.13.6



Re: [nohz_full/apic] multiple timer interrupts a second

2017-11-04 Thread Nicolai Stange
Hi Luiz,

[John Stultz added to CC]

On Fri, Nov 03 2017, Luiz Capitulino wrote:

> [CC'ing lkml this time]
>
> I've observed that smp_apic_timer_interrupt() is sometimes called
> two or more times a second on a nohz_full core which has a single
> task taking 100% of the core. In one of the calls, hrtimer_interrupt()
> runs tick_sched_timer(), but in others it doesn't call any handler.
> Here's an example (Linus HEAD f34157878):
>
> <...>-1831  [008]  1060.115578: funcgraph_entry:   |  
> smp_apic_timer_interrupt() {
> <...>-1831  [008]  1060.115578: funcgraph_entry:   |
> hrtimer_interrupt() {
> <...>-1831  [008]  1060.115578: hrtimer_expire_entry: 
> hrtimer=0x8edfefd12d60 function=tick_sched_timer now=1060079001980
> <...>-1831  [008]  1060.115578: funcgraph_entry:1.172 us   |  
> tick_sched_timer();
> <...>-1831  [008]  1060.115580: funcgraph_exit: 1.757 us   |}
> <...>-1831  [008]  1060.115580: hrtimer_start:
> hrtimer=0x8edfefd12d60 function=tick_sched_timer expires=106107900 
> softexpires=106107900
> <...>-1831  [008]  1060.115581: funcgraph_exit: 3.026 us   |  }
> <...>-1831  [008]  1061.115577: funcgraph_entry:   |  
> smp_apic_timer_interrupt() {
> <...>-1831  [008]  1061.115577: funcgraph_entry:0.261 us   |
> hrtimer_interrupt(); <-- NO handler called
> <...>-1831  [008]  1061.115578: funcgraph_exit: 1.349 us   |  }
> <...>-1831  [008]  1061.115579: funcgraph_entry:   |  
> smp_apic_timer_interrupt() {
> <...>-1831  [008]  1061.115579: funcgraph_entry:   |
> hrtimer_interrupt() {
> <...>-1831  [008]  1061.115579: hrtimer_expire_entry: 
> hrtimer=0x8edfefd12d60 function=tick_sched_timer now=1061079001473
> <...>-1831  [008]  1061.115580: funcgraph_entry:1.413 us   |  
> tick_sched_timer();
> <...>-1831  [008]  1061.115581: funcgraph_exit: 2.124 us   |}
> <...>-1831  [008]  1061.115582: hrtimer_start:
> hrtimer=0x8edfefd12d60 function=tick_sched_timer expires=106207900 
> softexpires=106207900
> <...>-1831  [008]  1061.115582: funcgraph_exit: 3.255 us   |  }
>
> Is this expected for some reason?
>
> I guess what's happening is that the deadline timer is firing
> earlier than expected. From a few dozen to a few hundreds
> nanoseconds earlier. When this happens, hrtimer_interrupt()
> skips calling the hrtimer handler (since it's early) and the
> apic is programmed to fire in the next microsecond.

Exactly.


> On further research I saw that Nicolai tried to fix a very similar
> problem last year:
>
> commit 1a9e4c564ab174e53ed86def922804a5ddc63e7d
> Author: Nicolai Stange <nicsta...@gmail.com>
> Date:   Thu Jul 14 17:22:54 2016 +0200
>
> x86/timers/apic: Fix imprecise timer interrupts by eliminating TSC 
> clockevents frequency roundoff error
> 
> I noticed the following bug/misbehavior on certain Intel systems: with a
> single task running on a NOHZ CPU on an Intel Haswell, I recognized
> that I did not only get the one expected local_timer APIC interrupt, but
> two per second at minimum. (!)

Note that there's also 6731b0d611a1 ("x86/timers/apic: Inform TSC
deadline clockevent device about recalibration").


> Maybe this issue is still present?

Yes it is. The reason is that NTP frequency adjustments would make it
into the mono clocksource's frequency but not into the
clock_event_device's.

I sent a series back then [1] addressing this.

However, in the meanwhile, I changed jobs and thus, got somehow
distracted. My bad :/

The result is that this got merged only up to what corresponds to
[10/28] in [1].

I'll pick this up again now and try to get the rest accepted for
inclusion.


Thanks,

Nicolai


[1] last time I sent this series as a whole can be found at
http://lkml.kernel.org/r/20161119160055.12491-2-nicsta...@gmail.com


Re: [nohz_full/apic] multiple timer interrupts a second

2017-11-04 Thread Nicolai Stange
Hi Luiz,

[John Stultz added to CC]

On Fri, Nov 03 2017, Luiz Capitulino wrote:

> [CC'ing lkml this time]
>
> I've observed that smp_apic_timer_interrupt() is sometimes called
> two or more times a second on a nohz_full core which has a single
> task taking 100% of the core. In one of the calls, hrtimer_interrupt()
> runs tick_sched_timer(), but in others it doesn't call any handler.
> Here's an example (Linus HEAD f34157878):
>
> <...>-1831  [008]  1060.115578: funcgraph_entry:   |  
> smp_apic_timer_interrupt() {
> <...>-1831  [008]  1060.115578: funcgraph_entry:   |
> hrtimer_interrupt() {
> <...>-1831  [008]  1060.115578: hrtimer_expire_entry: 
> hrtimer=0x8edfefd12d60 function=tick_sched_timer now=1060079001980
> <...>-1831  [008]  1060.115578: funcgraph_entry:1.172 us   |  
> tick_sched_timer();
> <...>-1831  [008]  1060.115580: funcgraph_exit: 1.757 us   |}
> <...>-1831  [008]  1060.115580: hrtimer_start:
> hrtimer=0x8edfefd12d60 function=tick_sched_timer expires=106107900 
> softexpires=106107900
> <...>-1831  [008]  1060.115581: funcgraph_exit: 3.026 us   |  }
> <...>-1831  [008]  1061.115577: funcgraph_entry:   |  
> smp_apic_timer_interrupt() {
> <...>-1831  [008]  1061.115577: funcgraph_entry:0.261 us   |
> hrtimer_interrupt(); <-- NO handler called
> <...>-1831  [008]  1061.115578: funcgraph_exit: 1.349 us   |  }
> <...>-1831  [008]  1061.115579: funcgraph_entry:   |  
> smp_apic_timer_interrupt() {
> <...>-1831  [008]  1061.115579: funcgraph_entry:   |
> hrtimer_interrupt() {
> <...>-1831  [008]  1061.115579: hrtimer_expire_entry: 
> hrtimer=0x8edfefd12d60 function=tick_sched_timer now=1061079001473
> <...>-1831  [008]  1061.115580: funcgraph_entry:1.413 us   |  
> tick_sched_timer();
> <...>-1831  [008]  1061.115581: funcgraph_exit: 2.124 us   |}
> <...>-1831  [008]  1061.115582: hrtimer_start:
> hrtimer=0x8edfefd12d60 function=tick_sched_timer expires=106207900 
> softexpires=106207900
> <...>-1831  [008]  1061.115582: funcgraph_exit: 3.255 us   |  }
>
> Is this expected for some reason?
>
> I guess what's happening is that the deadline timer is firing
> earlier than expected. From a few dozen to a few hundreds
> nanoseconds earlier. When this happens, hrtimer_interrupt()
> skips calling the hrtimer handler (since it's early) and the
> apic is programmed to fire in the next microsecond.

Exactly.


> On further research I saw that Nicolai tried to fix a very similar
> problem last year:
>
> commit 1a9e4c564ab174e53ed86def922804a5ddc63e7d
> Author: Nicolai Stange 
> Date:   Thu Jul 14 17:22:54 2016 +0200
>
> x86/timers/apic: Fix imprecise timer interrupts by eliminating TSC 
> clockevents frequency roundoff error
> 
> I noticed the following bug/misbehavior on certain Intel systems: with a
> single task running on a NOHZ CPU on an Intel Haswell, I recognized
> that I did not only get the one expected local_timer APIC interrupt, but
> two per second at minimum. (!)

Note that there's also 6731b0d611a1 ("x86/timers/apic: Inform TSC
deadline clockevent device about recalibration").


> Maybe this issue is still present?

Yes it is. The reason is that NTP frequency adjustments would make it
into the mono clocksource's frequency but not into the
clock_event_device's.

I sent a series back then [1] addressing this.

However, in the meanwhile, I changed jobs and thus, got somehow
distracted. My bad :/

The result is that this got merged only up to what corresponds to
[10/28] in [1].

I'll pick this up again now and try to get the rest accepted for
inclusion.


Thanks,

Nicolai


[1] last time I sent this series as a whole can be found at
http://lkml.kernel.org/r/20161119160055.12491-2-nicsta...@gmail.com


[PATCH v3 3/8] debugfs: debugfs_real_fops(): drop __must_hold sparse annotation

2017-10-30 Thread Nicolai Stange
Currently, debugfs_real_fops() is annotated with a
__must_hold(_srcu) sparse annotation.

With the conversion of the SRCU based protection of users against
concurrent file removals to a per-file refcount based scheme, this becomes
wrong.

Drop this annotation.

Signed-off-by: Nicolai Stange <nicsta...@gmail.com>
---
 fs/debugfs/file.c   | 6 +-
 include/linux/debugfs.h | 3 +--
 2 files changed, 2 insertions(+), 7 deletions(-)

diff --git a/fs/debugfs/file.c b/fs/debugfs/file.c
index 6644bfdea2f8..08511678b782 100644
--- a/fs/debugfs/file.c
+++ b/fs/debugfs/file.c
@@ -98,13 +98,9 @@ EXPORT_SYMBOL_GPL(debugfs_use_file_finish);
 #define F_DENTRY(filp) ((filp)->f_path.dentry)
 
 const struct file_operations *debugfs_real_fops(const struct file *filp)
-   __must_hold(_srcu)
 {
struct debugfs_fsdata *fsd = F_DENTRY(filp)->d_fsdata;
-   /*
-* Neither the pointer to the struct file_operations, nor its
-* contents ever change -- srcu_dereference() is not needed here.
-*/
+
return fsd->real_fops;
 }
 EXPORT_SYMBOL_GPL(debugfs_real_fops);
diff --git a/include/linux/debugfs.h b/include/linux/debugfs.h
index 3b914d588148..c5eda259b9d6 100644
--- a/include/linux/debugfs.h
+++ b/include/linux/debugfs.h
@@ -95,8 +95,7 @@ int debugfs_use_file_start(const struct dentry *dentry, int 
*srcu_idx)
 
 void debugfs_use_file_finish(int srcu_idx) __releases(_srcu);
 
-const struct file_operations *debugfs_real_fops(const struct file *filp)
-   __must_hold(_srcu);
+const struct file_operations *debugfs_real_fops(const struct file *filp);
 
 int debugfs_file_get(struct dentry *dentry);
 void debugfs_file_put(struct dentry *dentry);
-- 
2.13.6



[PATCH v3 3/8] debugfs: debugfs_real_fops(): drop __must_hold sparse annotation

2017-10-30 Thread Nicolai Stange
Currently, debugfs_real_fops() is annotated with a
__must_hold(_srcu) sparse annotation.

With the conversion of the SRCU based protection of users against
concurrent file removals to a per-file refcount based scheme, this becomes
wrong.

Drop this annotation.

Signed-off-by: Nicolai Stange 
---
 fs/debugfs/file.c   | 6 +-
 include/linux/debugfs.h | 3 +--
 2 files changed, 2 insertions(+), 7 deletions(-)

diff --git a/fs/debugfs/file.c b/fs/debugfs/file.c
index 6644bfdea2f8..08511678b782 100644
--- a/fs/debugfs/file.c
+++ b/fs/debugfs/file.c
@@ -98,13 +98,9 @@ EXPORT_SYMBOL_GPL(debugfs_use_file_finish);
 #define F_DENTRY(filp) ((filp)->f_path.dentry)
 
 const struct file_operations *debugfs_real_fops(const struct file *filp)
-   __must_hold(_srcu)
 {
struct debugfs_fsdata *fsd = F_DENTRY(filp)->d_fsdata;
-   /*
-* Neither the pointer to the struct file_operations, nor its
-* contents ever change -- srcu_dereference() is not needed here.
-*/
+
return fsd->real_fops;
 }
 EXPORT_SYMBOL_GPL(debugfs_real_fops);
diff --git a/include/linux/debugfs.h b/include/linux/debugfs.h
index 3b914d588148..c5eda259b9d6 100644
--- a/include/linux/debugfs.h
+++ b/include/linux/debugfs.h
@@ -95,8 +95,7 @@ int debugfs_use_file_start(const struct dentry *dentry, int 
*srcu_idx)
 
 void debugfs_use_file_finish(int srcu_idx) __releases(_srcu);
 
-const struct file_operations *debugfs_real_fops(const struct file *filp)
-   __must_hold(_srcu);
+const struct file_operations *debugfs_real_fops(const struct file *filp);
 
 int debugfs_file_get(struct dentry *dentry);
 void debugfs_file_put(struct dentry *dentry);
-- 
2.13.6



[PATCH v3 4/8] debugfs: convert to debugfs_file_get() and -put()

2017-10-30 Thread Nicolai Stange
Convert all calls to the now obsolete debugfs_use_file_start() and
debugfs_use_file_finish() from the debugfs core itself to the new
debugfs_file_get() and debugfs_file_put() API.

Fixes: 49d200deaa68 ("debugfs: prevent access to removed files' private
  data")
Signed-off-by: Nicolai Stange <nicsta...@gmail.com>
---
 fs/debugfs/file.c | 106 ++
 1 file changed, 50 insertions(+), 56 deletions(-)

diff --git a/fs/debugfs/file.c b/fs/debugfs/file.c
index 08511678b782..d3a972b45ff0 100644
--- a/fs/debugfs/file.c
+++ b/fs/debugfs/file.c
@@ -155,15 +155,12 @@ EXPORT_SYMBOL_GPL(debugfs_file_put);
 
 static int open_proxy_open(struct inode *inode, struct file *filp)
 {
-   const struct dentry *dentry = F_DENTRY(filp);
+   struct dentry *dentry = F_DENTRY(filp);
const struct file_operations *real_fops = NULL;
-   int srcu_idx, r;
+   int r = 0;
 
-   r = debugfs_use_file_start(dentry, _idx);
-   if (r) {
-   r = -ENOENT;
-   goto out;
-   }
+   if (debugfs_file_get(dentry))
+   return -ENOENT;
 
real_fops = debugfs_real_fops(filp);
real_fops = fops_get(real_fops);
@@ -180,7 +177,7 @@ static int open_proxy_open(struct inode *inode, struct file 
*filp)
r = real_fops->open(inode, filp);
 
 out:
-   debugfs_use_file_finish(srcu_idx);
+   debugfs_file_put(dentry);
return r;
 }
 
@@ -194,16 +191,16 @@ const struct file_operations 
debugfs_open_proxy_file_operations = {
 #define FULL_PROXY_FUNC(name, ret_type, filp, proto, args) \
 static ret_type full_proxy_ ## name(proto) \
 {  \
-   const struct dentry *dentry = F_DENTRY(filp);   \
+   struct dentry *dentry = F_DENTRY(filp); \
const struct file_operations *real_fops =   \
debugfs_real_fops(filp);\
-   int srcu_idx;   \
ret_type r; \
\
-   r = debugfs_use_file_start(dentry, _idx);  \
-   if (likely(!r)) \
-   r = real_fops->name(args);  \
-   debugfs_use_file_finish(srcu_idx);  \
+   r = debugfs_file_get(dentry);   \
+   if (unlikely(r))\
+   return r;   \
+   r = real_fops->name(args);  \
+   debugfs_file_put(dentry);   \
return r;   \
 }
 
@@ -228,18 +225,15 @@ FULL_PROXY_FUNC(unlocked_ioctl, long, filp,
 static unsigned int full_proxy_poll(struct file *filp,
struct poll_table_struct *wait)
 {
-   const struct dentry *dentry = F_DENTRY(filp);
const struct file_operations *real_fops = debugfs_real_fops(filp);
-   int srcu_idx;
+   struct dentry *dentry = F_DENTRY(filp);
unsigned int r = 0;
 
-   if (debugfs_use_file_start(dentry, _idx)) {
-   debugfs_use_file_finish(srcu_idx);
+   if (debugfs_file_get(dentry))
return POLLHUP;
-   }
 
r = real_fops->poll(filp, wait);
-   debugfs_use_file_finish(srcu_idx);
+   debugfs_file_put(dentry);
return r;
 }
 
@@ -283,16 +277,13 @@ static void __full_proxy_fops_init(struct file_operations 
*proxy_fops,
 
 static int full_proxy_open(struct inode *inode, struct file *filp)
 {
-   const struct dentry *dentry = F_DENTRY(filp);
+   struct dentry *dentry = F_DENTRY(filp);
const struct file_operations *real_fops = NULL;
struct file_operations *proxy_fops = NULL;
-   int srcu_idx, r;
+   int r = 0;
 
-   r = debugfs_use_file_start(dentry, _idx);
-   if (r) {
-   r = -ENOENT;
-   goto out;
-   }
+   if (debugfs_file_get(dentry))
+   return -ENOENT;
 
real_fops = debugfs_real_fops(filp);
real_fops = fops_get(real_fops);
@@ -330,7 +321,7 @@ static int full_proxy_open(struct inode *inode, struct file 
*filp)
kfree(proxy_fops);
fops_put(real_fops);
 out:
-   debugfs_use_file_finish(srcu_idx);
+   debugfs_file_put(dentry);
return r;
 }
 
@@ -341,13 +332,14 @@ const struct file_operations 
debugfs_full_proxy_file_operations = {
 ssize_t debugfs_attr_read(struct file *file, char __user *buf,
size_t len, loff_t *ppos)
 {
+   struct de

[PATCH v3 4/8] debugfs: convert to debugfs_file_get() and -put()

2017-10-30 Thread Nicolai Stange
Convert all calls to the now obsolete debugfs_use_file_start() and
debugfs_use_file_finish() from the debugfs core itself to the new
debugfs_file_get() and debugfs_file_put() API.

Fixes: 49d200deaa68 ("debugfs: prevent access to removed files' private
  data")
Signed-off-by: Nicolai Stange 
---
 fs/debugfs/file.c | 106 ++
 1 file changed, 50 insertions(+), 56 deletions(-)

diff --git a/fs/debugfs/file.c b/fs/debugfs/file.c
index 08511678b782..d3a972b45ff0 100644
--- a/fs/debugfs/file.c
+++ b/fs/debugfs/file.c
@@ -155,15 +155,12 @@ EXPORT_SYMBOL_GPL(debugfs_file_put);
 
 static int open_proxy_open(struct inode *inode, struct file *filp)
 {
-   const struct dentry *dentry = F_DENTRY(filp);
+   struct dentry *dentry = F_DENTRY(filp);
const struct file_operations *real_fops = NULL;
-   int srcu_idx, r;
+   int r = 0;
 
-   r = debugfs_use_file_start(dentry, _idx);
-   if (r) {
-   r = -ENOENT;
-   goto out;
-   }
+   if (debugfs_file_get(dentry))
+   return -ENOENT;
 
real_fops = debugfs_real_fops(filp);
real_fops = fops_get(real_fops);
@@ -180,7 +177,7 @@ static int open_proxy_open(struct inode *inode, struct file 
*filp)
r = real_fops->open(inode, filp);
 
 out:
-   debugfs_use_file_finish(srcu_idx);
+   debugfs_file_put(dentry);
return r;
 }
 
@@ -194,16 +191,16 @@ const struct file_operations 
debugfs_open_proxy_file_operations = {
 #define FULL_PROXY_FUNC(name, ret_type, filp, proto, args) \
 static ret_type full_proxy_ ## name(proto) \
 {  \
-   const struct dentry *dentry = F_DENTRY(filp);   \
+   struct dentry *dentry = F_DENTRY(filp); \
const struct file_operations *real_fops =   \
debugfs_real_fops(filp);\
-   int srcu_idx;   \
ret_type r; \
\
-   r = debugfs_use_file_start(dentry, _idx);  \
-   if (likely(!r)) \
-   r = real_fops->name(args);  \
-   debugfs_use_file_finish(srcu_idx);  \
+   r = debugfs_file_get(dentry);   \
+   if (unlikely(r))\
+   return r;   \
+   r = real_fops->name(args);  \
+   debugfs_file_put(dentry);   \
return r;   \
 }
 
@@ -228,18 +225,15 @@ FULL_PROXY_FUNC(unlocked_ioctl, long, filp,
 static unsigned int full_proxy_poll(struct file *filp,
struct poll_table_struct *wait)
 {
-   const struct dentry *dentry = F_DENTRY(filp);
const struct file_operations *real_fops = debugfs_real_fops(filp);
-   int srcu_idx;
+   struct dentry *dentry = F_DENTRY(filp);
unsigned int r = 0;
 
-   if (debugfs_use_file_start(dentry, _idx)) {
-   debugfs_use_file_finish(srcu_idx);
+   if (debugfs_file_get(dentry))
return POLLHUP;
-   }
 
r = real_fops->poll(filp, wait);
-   debugfs_use_file_finish(srcu_idx);
+   debugfs_file_put(dentry);
return r;
 }
 
@@ -283,16 +277,13 @@ static void __full_proxy_fops_init(struct file_operations 
*proxy_fops,
 
 static int full_proxy_open(struct inode *inode, struct file *filp)
 {
-   const struct dentry *dentry = F_DENTRY(filp);
+   struct dentry *dentry = F_DENTRY(filp);
const struct file_operations *real_fops = NULL;
struct file_operations *proxy_fops = NULL;
-   int srcu_idx, r;
+   int r = 0;
 
-   r = debugfs_use_file_start(dentry, _idx);
-   if (r) {
-   r = -ENOENT;
-   goto out;
-   }
+   if (debugfs_file_get(dentry))
+   return -ENOENT;
 
real_fops = debugfs_real_fops(filp);
real_fops = fops_get(real_fops);
@@ -330,7 +321,7 @@ static int full_proxy_open(struct inode *inode, struct file 
*filp)
kfree(proxy_fops);
fops_put(real_fops);
 out:
-   debugfs_use_file_finish(srcu_idx);
+   debugfs_file_put(dentry);
return r;
 }
 
@@ -341,13 +332,14 @@ const struct file_operations 
debugfs_full_proxy_file_operations = {
 ssize_t debugfs_attr_read(struct file *file, char __user *buf,
size_t len, loff_t *ppos)
 {
+   struct dentry *dentry = F_DENTRY(file

[PATCH v3 5/8] IB/hfi1: convert to debugfs_file_get() and -put()

2017-10-30 Thread Nicolai Stange
Convert all calls to the now obsolete debugfs_use_file_start() and
debugfs_use_file_finish() to the new debugfs_file_get() and
debugfs_file_put() API.

Fixes: 49d200deaa68 ("debugfs: prevent access to removed files' private
  data")
Signed-off-by: Nicolai Stange <nicsta...@gmail.com>
---
 drivers/infiniband/hw/hfi1/debugfs.c | 20 ++--
 1 file changed, 10 insertions(+), 10 deletions(-)

diff --git a/drivers/infiniband/hw/hfi1/debugfs.c 
b/drivers/infiniband/hw/hfi1/debugfs.c
index 24128f4e5748..ca03d8b23851 100644
--- a/drivers/infiniband/hw/hfi1/debugfs.c
+++ b/drivers/infiniband/hw/hfi1/debugfs.c
@@ -71,13 +71,13 @@ static ssize_t hfi1_seq_read(
loff_t *ppos)
 {
struct dentry *d = file->f_path.dentry;
-   int srcu_idx;
ssize_t r;
 
-   r = debugfs_use_file_start(d, _idx);
-   if (likely(!r))
-   r = seq_read(file, buf, size, ppos);
-   debugfs_use_file_finish(srcu_idx);
+   r = debugfs_file_get(d);
+   if (unlikely(r))
+   return r;
+   r = seq_read(file, buf, size, ppos);
+   debugfs_file_put(d);
return r;
 }
 
@@ -87,13 +87,13 @@ static loff_t hfi1_seq_lseek(
int whence)
 {
struct dentry *d = file->f_path.dentry;
-   int srcu_idx;
loff_t r;
 
-   r = debugfs_use_file_start(d, _idx);
-   if (likely(!r))
-   r = seq_lseek(file, offset, whence);
-   debugfs_use_file_finish(srcu_idx);
+   r = debugfs_file_get(d);
+   if (unlikely(r))
+   return r;
+   r = seq_lseek(file, offset, whence);
+   debugfs_file_put(d);
return r;
 }
 
-- 
2.13.6



[PATCH v3 5/8] IB/hfi1: convert to debugfs_file_get() and -put()

2017-10-30 Thread Nicolai Stange
Convert all calls to the now obsolete debugfs_use_file_start() and
debugfs_use_file_finish() to the new debugfs_file_get() and
debugfs_file_put() API.

Fixes: 49d200deaa68 ("debugfs: prevent access to removed files' private
  data")
Signed-off-by: Nicolai Stange 
---
 drivers/infiniband/hw/hfi1/debugfs.c | 20 ++--
 1 file changed, 10 insertions(+), 10 deletions(-)

diff --git a/drivers/infiniband/hw/hfi1/debugfs.c 
b/drivers/infiniband/hw/hfi1/debugfs.c
index 24128f4e5748..ca03d8b23851 100644
--- a/drivers/infiniband/hw/hfi1/debugfs.c
+++ b/drivers/infiniband/hw/hfi1/debugfs.c
@@ -71,13 +71,13 @@ static ssize_t hfi1_seq_read(
loff_t *ppos)
 {
struct dentry *d = file->f_path.dentry;
-   int srcu_idx;
ssize_t r;
 
-   r = debugfs_use_file_start(d, _idx);
-   if (likely(!r))
-   r = seq_read(file, buf, size, ppos);
-   debugfs_use_file_finish(srcu_idx);
+   r = debugfs_file_get(d);
+   if (unlikely(r))
+   return r;
+   r = seq_read(file, buf, size, ppos);
+   debugfs_file_put(d);
return r;
 }
 
@@ -87,13 +87,13 @@ static loff_t hfi1_seq_lseek(
int whence)
 {
struct dentry *d = file->f_path.dentry;
-   int srcu_idx;
loff_t r;
 
-   r = debugfs_use_file_start(d, _idx);
-   if (likely(!r))
-   r = seq_lseek(file, offset, whence);
-   debugfs_use_file_finish(srcu_idx);
+   r = debugfs_file_get(d);
+   if (unlikely(r))
+   return r;
+   r = seq_lseek(file, offset, whence);
+   debugfs_file_put(d);
return r;
 }
 
-- 
2.13.6



[PATCH v3 6/8] debugfs: purge obsolete SRCU based removal protection

2017-10-30 Thread Nicolai Stange
Purge the SRCU based file removal race protection in favour of the new,
refcount based debugfs_file_get()/debugfs_file_put() API.

Fixes: 49d200deaa68 ("debugfs: prevent access to removed files' private
  data")
Signed-off-by: Nicolai Stange <nicsta...@gmail.com>
---
 fs/debugfs/file.c   | 48 
 fs/debugfs/inode.c  |  7 ---
 include/linux/debugfs.h | 19 ---
 lib/Kconfig.debug   |  1 -
 4 files changed, 75 deletions(-)

diff --git a/fs/debugfs/file.c b/fs/debugfs/file.c
index d3a972b45ff0..53f5c9a2af88 100644
--- a/fs/debugfs/file.c
+++ b/fs/debugfs/file.c
@@ -22,7 +22,6 @@
 #include 
 #include 
 #include 
-#include 
 #include 
 
 #include "internal.h"
@@ -48,53 +47,6 @@ const struct file_operations debugfs_noop_file_operations = {
.llseek =   noop_llseek,
 };
 
-/**
- * debugfs_use_file_start - mark the beginning of file data access
- * @dentry: the dentry object whose data is being accessed.
- * @srcu_idx: a pointer to some memory to store a SRCU index in.
- *
- * Up to a matching call to debugfs_use_file_finish(), any
- * successive call into the file removing functions debugfs_remove()
- * and debugfs_remove_recursive() will block. Since associated private
- * file data may only get freed after a successful return of any of
- * the removal functions, you may safely access it after a successful
- * call to debugfs_use_file_start() without worrying about
- * lifetime issues.
- *
- * If -%EIO is returned, the file has already been removed and thus,
- * it is not safe to access any of its data. If, on the other hand,
- * it is allowed to access the file data, zero is returned.
- *
- * Regardless of the return code, any call to
- * debugfs_use_file_start() must be followed by a matching call
- * to debugfs_use_file_finish().
- */
-int debugfs_use_file_start(const struct dentry *dentry, int *srcu_idx)
-   __acquires(_srcu)
-{
-   *srcu_idx = srcu_read_lock(_srcu);
-   barrier();
-   if (d_unlinked(dentry))
-   return -EIO;
-   return 0;
-}
-EXPORT_SYMBOL_GPL(debugfs_use_file_start);
-
-/**
- * debugfs_use_file_finish - mark the end of file data access
- * @srcu_idx: the SRCU index "created" by a former call to
- *debugfs_use_file_start().
- *
- * Allow any ongoing concurrent call into debugfs_remove() or
- * debugfs_remove_recursive() blocked by a former call to
- * debugfs_use_file_start() to proceed and return to its caller.
- */
-void debugfs_use_file_finish(int srcu_idx) __releases(_srcu)
-{
-   srcu_read_unlock(_srcu, srcu_idx);
-}
-EXPORT_SYMBOL_GPL(debugfs_use_file_finish);
-
 #define F_DENTRY(filp) ((filp)->f_path.dentry)
 
 const struct file_operations *debugfs_real_fops(const struct file *filp)
diff --git a/fs/debugfs/inode.c b/fs/debugfs/inode.c
index 6449eb935540..f587aded46b5 100644
--- a/fs/debugfs/inode.c
+++ b/fs/debugfs/inode.c
@@ -27,14 +27,11 @@
 #include 
 #include 
 #include 
-#include 
 
 #include "internal.h"
 
 #define DEBUGFS_DEFAULT_MODE   0700
 
-DEFINE_SRCU(debugfs_srcu);
-
 static struct vfsmount *debugfs_mount;
 static int debugfs_mount_count;
 static bool debugfs_registered;
@@ -693,8 +690,6 @@ void debugfs_remove(struct dentry *dentry)
inode_unlock(d_inode(parent));
if (!ret)
simple_release_fs(_mount, _mount_count);
-
-   synchronize_srcu(_srcu);
 }
 EXPORT_SYMBOL_GPL(debugfs_remove);
 
@@ -768,8 +763,6 @@ void debugfs_remove_recursive(struct dentry *dentry)
if (!__debugfs_remove(child, parent))
simple_release_fs(_mount, _mount_count);
inode_unlock(d_inode(parent));
-
-   synchronize_srcu(_srcu);
 }
 EXPORT_SYMBOL_GPL(debugfs_remove_recursive);
 
diff --git a/include/linux/debugfs.h b/include/linux/debugfs.h
index c5eda259b9d6..9f821a43bb0d 100644
--- a/include/linux/debugfs.h
+++ b/include/linux/debugfs.h
@@ -23,7 +23,6 @@
 
 struct device;
 struct file_operations;
-struct srcu_struct;
 
 struct debugfs_blob_wrapper {
void *data;
@@ -43,8 +42,6 @@ struct debugfs_regset32 {
 
 extern struct dentry *arch_debugfs_dir;
 
-extern struct srcu_struct debugfs_srcu;
-
 #define DEFINE_DEBUGFS_ATTRIBUTE(__fops, __get, __set, __fmt)  \
 static int __fops ## _open(struct inode *inode, struct file *file) \
 {  \
@@ -90,11 +87,6 @@ struct dentry *debugfs_create_automount(const char *name,
 void debugfs_remove(struct dentry *dentry);
 void debugfs_remove_recursive(struct dentry *dentry);
 
-int debugfs_use_file_start(const struct dentry *dentry, int *srcu_idx)
-   __acquires(_srcu);
-
-void debugfs_use_file_finish(int srcu_idx) __releases(_srcu);
-
 const struct file_operations *debugfs_real_fops(const struct file *filp);
 
 int debugfs_file_get(struct dentry *dentry);
@@ -227,17 +219,6 @@ static inline void debugfs_

[PATCH v3 6/8] debugfs: purge obsolete SRCU based removal protection

2017-10-30 Thread Nicolai Stange
Purge the SRCU based file removal race protection in favour of the new,
refcount based debugfs_file_get()/debugfs_file_put() API.

Fixes: 49d200deaa68 ("debugfs: prevent access to removed files' private
  data")
Signed-off-by: Nicolai Stange 
---
 fs/debugfs/file.c   | 48 
 fs/debugfs/inode.c  |  7 ---
 include/linux/debugfs.h | 19 ---
 lib/Kconfig.debug   |  1 -
 4 files changed, 75 deletions(-)

diff --git a/fs/debugfs/file.c b/fs/debugfs/file.c
index d3a972b45ff0..53f5c9a2af88 100644
--- a/fs/debugfs/file.c
+++ b/fs/debugfs/file.c
@@ -22,7 +22,6 @@
 #include 
 #include 
 #include 
-#include 
 #include 
 
 #include "internal.h"
@@ -48,53 +47,6 @@ const struct file_operations debugfs_noop_file_operations = {
.llseek =   noop_llseek,
 };
 
-/**
- * debugfs_use_file_start - mark the beginning of file data access
- * @dentry: the dentry object whose data is being accessed.
- * @srcu_idx: a pointer to some memory to store a SRCU index in.
- *
- * Up to a matching call to debugfs_use_file_finish(), any
- * successive call into the file removing functions debugfs_remove()
- * and debugfs_remove_recursive() will block. Since associated private
- * file data may only get freed after a successful return of any of
- * the removal functions, you may safely access it after a successful
- * call to debugfs_use_file_start() without worrying about
- * lifetime issues.
- *
- * If -%EIO is returned, the file has already been removed and thus,
- * it is not safe to access any of its data. If, on the other hand,
- * it is allowed to access the file data, zero is returned.
- *
- * Regardless of the return code, any call to
- * debugfs_use_file_start() must be followed by a matching call
- * to debugfs_use_file_finish().
- */
-int debugfs_use_file_start(const struct dentry *dentry, int *srcu_idx)
-   __acquires(_srcu)
-{
-   *srcu_idx = srcu_read_lock(_srcu);
-   barrier();
-   if (d_unlinked(dentry))
-   return -EIO;
-   return 0;
-}
-EXPORT_SYMBOL_GPL(debugfs_use_file_start);
-
-/**
- * debugfs_use_file_finish - mark the end of file data access
- * @srcu_idx: the SRCU index "created" by a former call to
- *debugfs_use_file_start().
- *
- * Allow any ongoing concurrent call into debugfs_remove() or
- * debugfs_remove_recursive() blocked by a former call to
- * debugfs_use_file_start() to proceed and return to its caller.
- */
-void debugfs_use_file_finish(int srcu_idx) __releases(_srcu)
-{
-   srcu_read_unlock(_srcu, srcu_idx);
-}
-EXPORT_SYMBOL_GPL(debugfs_use_file_finish);
-
 #define F_DENTRY(filp) ((filp)->f_path.dentry)
 
 const struct file_operations *debugfs_real_fops(const struct file *filp)
diff --git a/fs/debugfs/inode.c b/fs/debugfs/inode.c
index 6449eb935540..f587aded46b5 100644
--- a/fs/debugfs/inode.c
+++ b/fs/debugfs/inode.c
@@ -27,14 +27,11 @@
 #include 
 #include 
 #include 
-#include 
 
 #include "internal.h"
 
 #define DEBUGFS_DEFAULT_MODE   0700
 
-DEFINE_SRCU(debugfs_srcu);
-
 static struct vfsmount *debugfs_mount;
 static int debugfs_mount_count;
 static bool debugfs_registered;
@@ -693,8 +690,6 @@ void debugfs_remove(struct dentry *dentry)
inode_unlock(d_inode(parent));
if (!ret)
simple_release_fs(_mount, _mount_count);
-
-   synchronize_srcu(_srcu);
 }
 EXPORT_SYMBOL_GPL(debugfs_remove);
 
@@ -768,8 +763,6 @@ void debugfs_remove_recursive(struct dentry *dentry)
if (!__debugfs_remove(child, parent))
simple_release_fs(_mount, _mount_count);
inode_unlock(d_inode(parent));
-
-   synchronize_srcu(_srcu);
 }
 EXPORT_SYMBOL_GPL(debugfs_remove_recursive);
 
diff --git a/include/linux/debugfs.h b/include/linux/debugfs.h
index c5eda259b9d6..9f821a43bb0d 100644
--- a/include/linux/debugfs.h
+++ b/include/linux/debugfs.h
@@ -23,7 +23,6 @@
 
 struct device;
 struct file_operations;
-struct srcu_struct;
 
 struct debugfs_blob_wrapper {
void *data;
@@ -43,8 +42,6 @@ struct debugfs_regset32 {
 
 extern struct dentry *arch_debugfs_dir;
 
-extern struct srcu_struct debugfs_srcu;
-
 #define DEFINE_DEBUGFS_ATTRIBUTE(__fops, __get, __set, __fmt)  \
 static int __fops ## _open(struct inode *inode, struct file *file) \
 {  \
@@ -90,11 +87,6 @@ struct dentry *debugfs_create_automount(const char *name,
 void debugfs_remove(struct dentry *dentry);
 void debugfs_remove_recursive(struct dentry *dentry);
 
-int debugfs_use_file_start(const struct dentry *dentry, int *srcu_idx)
-   __acquires(_srcu);
-
-void debugfs_use_file_finish(int srcu_idx) __releases(_srcu);
-
 const struct file_operations *debugfs_real_fops(const struct file *filp);
 
 int debugfs_file_get(struct dentry *dentry);
@@ -227,17 +219,6 @@ static inline void debugfs_remove(struct dentry *dentry)
 

[PATCH v3 1/8] debugfs: add support for more elaborate ->d_fsdata

2017-10-30 Thread Nicolai Stange
Currently, the user provided fops, "real_fops", are stored directly into
->d_fsdata.

In order to be able to store more per-file state and thus prepare for more
granular file removal protection, wrap the real_fops into a dynamically
allocated container struct, debugfs_fsdata.

A struct debugfs_fsdata gets allocated at file creation and freed from the
newly intoduced ->d_release().

Finally, move the implementation of debugfs_real_fops() out of the public
debugfs header such that struct debugfs_fsdata's declaration can be kept
private.

Signed-off-by: Nicolai Stange <nicsta...@gmail.com>
---
 fs/debugfs/file.c   | 12 
 fs/debugfs/inode.c  | 22 +++---
 fs/debugfs/internal.h   |  4 
 include/linux/debugfs.h | 20 +++-
 4 files changed, 38 insertions(+), 20 deletions(-)

diff --git a/fs/debugfs/file.c b/fs/debugfs/file.c
index 6dabc4a10396..b6f5ddab66bf 100644
--- a/fs/debugfs/file.c
+++ b/fs/debugfs/file.c
@@ -97,6 +97,18 @@ EXPORT_SYMBOL_GPL(debugfs_use_file_finish);
 
 #define F_DENTRY(filp) ((filp)->f_path.dentry)
 
+const struct file_operations *debugfs_real_fops(const struct file *filp)
+   __must_hold(_srcu)
+{
+   struct debugfs_fsdata *fsd = F_DENTRY(filp)->d_fsdata;
+   /*
+* Neither the pointer to the struct file_operations, nor its
+* contents ever change -- srcu_dereference() is not needed here.
+*/
+   return fsd->real_fops;
+}
+EXPORT_SYMBOL_GPL(debugfs_real_fops);
+
 static int open_proxy_open(struct inode *inode, struct file *filp)
 {
const struct dentry *dentry = F_DENTRY(filp);
diff --git a/fs/debugfs/inode.c b/fs/debugfs/inode.c
index c59f015f386e..a9c3d3e9af39 100644
--- a/fs/debugfs/inode.c
+++ b/fs/debugfs/inode.c
@@ -185,6 +185,11 @@ static const struct super_operations 
debugfs_super_operations = {
.evict_inode= debugfs_evict_inode,
 };
 
+static void debugfs_release_dentry(struct dentry *dentry)
+{
+   kfree(dentry->d_fsdata);
+}
+
 static struct vfsmount *debugfs_automount(struct path *path)
 {
debugfs_automount_t f;
@@ -194,6 +199,7 @@ static struct vfsmount *debugfs_automount(struct path *path)
 
 static const struct dentry_operations debugfs_dops = {
.d_delete = always_delete_dentry,
+   .d_release = debugfs_release_dentry,
.d_automount = debugfs_automount,
 };
 
@@ -341,24 +347,34 @@ static struct dentry *__debugfs_create_file(const char 
*name, umode_t mode,
 {
struct dentry *dentry;
struct inode *inode;
+   struct debugfs_fsdata *fsd;
+
+   fsd = kmalloc(sizeof(*fsd), GFP_KERNEL);
+   if (!fsd)
+   return NULL;
 
if (!(mode & S_IFMT))
mode |= S_IFREG;
BUG_ON(!S_ISREG(mode));
dentry = start_creating(name, parent);
 
-   if (IS_ERR(dentry))
+   if (IS_ERR(dentry)) {
+   kfree(fsd);
return NULL;
+   }
 
inode = debugfs_get_inode(dentry->d_sb);
-   if (unlikely(!inode))
+   if (unlikely(!inode)) {
+   kfree(fsd);
return failed_creating(dentry);
+   }
 
inode->i_mode = mode;
inode->i_private = data;
 
inode->i_fop = proxy_fops;
-   dentry->d_fsdata = (void *)real_fops;
+   fsd->real_fops = real_fops;
+   dentry->d_fsdata = fsd;
 
d_instantiate(dentry, inode);
fsnotify_create(d_inode(dentry->d_parent), dentry);
diff --git a/fs/debugfs/internal.h b/fs/debugfs/internal.h
index b3e8443a1f47..512601eed3ce 100644
--- a/fs/debugfs/internal.h
+++ b/fs/debugfs/internal.h
@@ -19,4 +19,8 @@ extern const struct file_operations 
debugfs_noop_file_operations;
 extern const struct file_operations debugfs_open_proxy_file_operations;
 extern const struct file_operations debugfs_full_proxy_file_operations;
 
+struct debugfs_fsdata {
+   const struct file_operations *real_fops;
+};
+
 #endif /* _DEBUGFS_INTERNAL_H_ */
diff --git a/include/linux/debugfs.h b/include/linux/debugfs.h
index b93efc8feecd..cbee5f4a02a3 100644
--- a/include/linux/debugfs.h
+++ b/include/linux/debugfs.h
@@ -45,23 +45,6 @@ extern struct dentry *arch_debugfs_dir;
 
 extern struct srcu_struct debugfs_srcu;
 
-/**
- * debugfs_real_fops - getter for the real file operation
- * @filp: a pointer to a struct file
- *
- * Must only be called under the protection established by
- * debugfs_use_file_start().
- */
-static inline const struct file_operations *debugfs_real_fops(const struct 
file *filp)
-   __must_hold(_srcu)
-{
-   /*
-* Neither the pointer to the struct file_operations, nor its
-* contents ever change -- srcu_dereference() is not needed here.
-*/
-   return filp->f_path.dentry->d_fsdata;
-}
-
 #define DEFINE_DEBUGFS_ATTRIBUTE(__fops, __get, __set, __fmt)  \
 static int __fops ## _open(struct inode *inode, struct file *file) \
 { 

[PATCH v3 1/8] debugfs: add support for more elaborate ->d_fsdata

2017-10-30 Thread Nicolai Stange
Currently, the user provided fops, "real_fops", are stored directly into
->d_fsdata.

In order to be able to store more per-file state and thus prepare for more
granular file removal protection, wrap the real_fops into a dynamically
allocated container struct, debugfs_fsdata.

A struct debugfs_fsdata gets allocated at file creation and freed from the
newly intoduced ->d_release().

Finally, move the implementation of debugfs_real_fops() out of the public
debugfs header such that struct debugfs_fsdata's declaration can be kept
private.

Signed-off-by: Nicolai Stange 
---
 fs/debugfs/file.c   | 12 
 fs/debugfs/inode.c  | 22 +++---
 fs/debugfs/internal.h   |  4 
 include/linux/debugfs.h | 20 +++-
 4 files changed, 38 insertions(+), 20 deletions(-)

diff --git a/fs/debugfs/file.c b/fs/debugfs/file.c
index 6dabc4a10396..b6f5ddab66bf 100644
--- a/fs/debugfs/file.c
+++ b/fs/debugfs/file.c
@@ -97,6 +97,18 @@ EXPORT_SYMBOL_GPL(debugfs_use_file_finish);
 
 #define F_DENTRY(filp) ((filp)->f_path.dentry)
 
+const struct file_operations *debugfs_real_fops(const struct file *filp)
+   __must_hold(_srcu)
+{
+   struct debugfs_fsdata *fsd = F_DENTRY(filp)->d_fsdata;
+   /*
+* Neither the pointer to the struct file_operations, nor its
+* contents ever change -- srcu_dereference() is not needed here.
+*/
+   return fsd->real_fops;
+}
+EXPORT_SYMBOL_GPL(debugfs_real_fops);
+
 static int open_proxy_open(struct inode *inode, struct file *filp)
 {
const struct dentry *dentry = F_DENTRY(filp);
diff --git a/fs/debugfs/inode.c b/fs/debugfs/inode.c
index c59f015f386e..a9c3d3e9af39 100644
--- a/fs/debugfs/inode.c
+++ b/fs/debugfs/inode.c
@@ -185,6 +185,11 @@ static const struct super_operations 
debugfs_super_operations = {
.evict_inode= debugfs_evict_inode,
 };
 
+static void debugfs_release_dentry(struct dentry *dentry)
+{
+   kfree(dentry->d_fsdata);
+}
+
 static struct vfsmount *debugfs_automount(struct path *path)
 {
debugfs_automount_t f;
@@ -194,6 +199,7 @@ static struct vfsmount *debugfs_automount(struct path *path)
 
 static const struct dentry_operations debugfs_dops = {
.d_delete = always_delete_dentry,
+   .d_release = debugfs_release_dentry,
.d_automount = debugfs_automount,
 };
 
@@ -341,24 +347,34 @@ static struct dentry *__debugfs_create_file(const char 
*name, umode_t mode,
 {
struct dentry *dentry;
struct inode *inode;
+   struct debugfs_fsdata *fsd;
+
+   fsd = kmalloc(sizeof(*fsd), GFP_KERNEL);
+   if (!fsd)
+   return NULL;
 
if (!(mode & S_IFMT))
mode |= S_IFREG;
BUG_ON(!S_ISREG(mode));
dentry = start_creating(name, parent);
 
-   if (IS_ERR(dentry))
+   if (IS_ERR(dentry)) {
+   kfree(fsd);
return NULL;
+   }
 
inode = debugfs_get_inode(dentry->d_sb);
-   if (unlikely(!inode))
+   if (unlikely(!inode)) {
+   kfree(fsd);
return failed_creating(dentry);
+   }
 
inode->i_mode = mode;
inode->i_private = data;
 
inode->i_fop = proxy_fops;
-   dentry->d_fsdata = (void *)real_fops;
+   fsd->real_fops = real_fops;
+   dentry->d_fsdata = fsd;
 
d_instantiate(dentry, inode);
fsnotify_create(d_inode(dentry->d_parent), dentry);
diff --git a/fs/debugfs/internal.h b/fs/debugfs/internal.h
index b3e8443a1f47..512601eed3ce 100644
--- a/fs/debugfs/internal.h
+++ b/fs/debugfs/internal.h
@@ -19,4 +19,8 @@ extern const struct file_operations 
debugfs_noop_file_operations;
 extern const struct file_operations debugfs_open_proxy_file_operations;
 extern const struct file_operations debugfs_full_proxy_file_operations;
 
+struct debugfs_fsdata {
+   const struct file_operations *real_fops;
+};
+
 #endif /* _DEBUGFS_INTERNAL_H_ */
diff --git a/include/linux/debugfs.h b/include/linux/debugfs.h
index b93efc8feecd..cbee5f4a02a3 100644
--- a/include/linux/debugfs.h
+++ b/include/linux/debugfs.h
@@ -45,23 +45,6 @@ extern struct dentry *arch_debugfs_dir;
 
 extern struct srcu_struct debugfs_srcu;
 
-/**
- * debugfs_real_fops - getter for the real file operation
- * @filp: a pointer to a struct file
- *
- * Must only be called under the protection established by
- * debugfs_use_file_start().
- */
-static inline const struct file_operations *debugfs_real_fops(const struct 
file *filp)
-   __must_hold(_srcu)
-{
-   /*
-* Neither the pointer to the struct file_operations, nor its
-* contents ever change -- srcu_dereference() is not needed here.
-*/
-   return filp->f_path.dentry->d_fsdata;
-}
-
 #define DEFINE_DEBUGFS_ATTRIBUTE(__fops, __get, __set, __fmt)  \
 static int __fops ## _open(struct inode *inode, struct file *file) \
 {

[PATCH v3 2/8] debugfs: implement per-file removal protection

2017-10-30 Thread Nicolai Stange
Since commit 49d200deaa68 ("debugfs: prevent access to removed files'
private data"), accesses to a file's private data are protected from
concurrent removal by covering all file_operations with a SRCU read section
and sychronizing with those before returning from debugfs_remove() by means
of synchronize_srcu().

As pointed out by Johannes Berg, there are debugfs files with forever
blocking file_operations. Their corresponding SRCU read side sections would
block any debugfs_remove() forever as well, even unrelated ones. This
results in a livelock. Because a remover can't cancel any indefinite
blocking within foreign files, this is a problem.

Resolve this by introducing support for more granular protection on a
per-file basis.

This is implemented by introducing an  'active_users' refcount_t to the
per-file struct debugfs_fsdata state. At file creation time, it is set to
one and a debugfs_remove() will drop that initial reference. The new
debugfs_file_get() and debugfs_file_put(), intended to be used in place of
former debugfs_use_file_start() and debugfs_use_file_finish(), increment
and decrement it respectively. Once the count drops to zero,
debugfs_file_put() will signal a completion which is possibly being waited
for from debugfs_remove().
Thus, as long as there is a debugfs_file_get() not yet matched by a
corresponding debugfs_file_put() around, debugfs_remove() will block.

Actual users of debugfs_use_file_start() and -finish() will get converted
to the new debugfs_file_get() and debugfs_file_put() by followup patches.

Fixes: 49d200deaa68 ("debugfs: prevent access to removed files' private
  data")
Reported-by: Johannes Berg <johan...@sipsolutions.net>
Signed-off-by: Nicolai Stange <nicsta...@gmail.com>
---
 fs/debugfs/file.c   | 48 
 fs/debugfs/inode.c  | 29 +++--
 fs/debugfs/internal.h   |  2 ++
 include/linux/debugfs.h | 11 +++
 4 files changed, 84 insertions(+), 6 deletions(-)

diff --git a/fs/debugfs/file.c b/fs/debugfs/file.c
index b6f5ddab66bf..6644bfdea2f8 100644
--- a/fs/debugfs/file.c
+++ b/fs/debugfs/file.c
@@ -109,6 +109,54 @@ const struct file_operations *debugfs_real_fops(const 
struct file *filp)
 }
 EXPORT_SYMBOL_GPL(debugfs_real_fops);
 
+/**
+ * debugfs_file_get - mark the beginning of file data access
+ * @dentry: the dentry object whose data is being accessed.
+ *
+ * Up to a matching call to debugfs_file_put(), any successive call
+ * into the file removing functions debugfs_remove() and
+ * debugfs_remove_recursive() will block. Since associated private
+ * file data may only get freed after a successful return of any of
+ * the removal functions, you may safely access it after a successful
+ * call to debugfs_file_get() without worrying about lifetime issues.
+ *
+ * If -%EIO is returned, the file has already been removed and thus,
+ * it is not safe to access any of its data. If, on the other hand,
+ * it is allowed to access the file data, zero is returned.
+ */
+int debugfs_file_get(struct dentry *dentry)
+{
+   struct debugfs_fsdata *fsd = dentry->d_fsdata;
+
+   /* Avoid starvation of removers. */
+   if (d_unlinked(dentry))
+   return -EIO;
+
+   if (!refcount_inc_not_zero(>active_users))
+   return -EIO;
+
+   return 0;
+}
+EXPORT_SYMBOL_GPL(debugfs_file_get);
+
+/**
+ * debugfs_file_put - mark the end of file data access
+ * @dentry: the dentry object formerly passed to
+ *  debugfs_file_get().
+ *
+ * Allow any ongoing concurrent call into debugfs_remove() or
+ * debugfs_remove_recursive() blocked by a former call to
+ * debugfs_file_get() to proceed and return to its caller.
+ */
+void debugfs_file_put(struct dentry *dentry)
+{
+   struct debugfs_fsdata *fsd = dentry->d_fsdata;
+
+   if (refcount_dec_and_test(>active_users))
+   complete(>active_users_drained);
+}
+EXPORT_SYMBOL_GPL(debugfs_file_put);
+
 static int open_proxy_open(struct inode *inode, struct file *filp)
 {
const struct dentry *dentry = F_DENTRY(filp);
diff --git a/fs/debugfs/inode.c b/fs/debugfs/inode.c
index a9c3d3e9af39..6449eb935540 100644
--- a/fs/debugfs/inode.c
+++ b/fs/debugfs/inode.c
@@ -374,6 +374,7 @@ static struct dentry *__debugfs_create_file(const char 
*name, umode_t mode,
 
inode->i_fop = proxy_fops;
fsd->real_fops = real_fops;
+   refcount_set(>active_users, 1);
dentry->d_fsdata = fsd;
 
d_instantiate(dentry, inode);
@@ -631,18 +632,34 @@ struct dentry *debugfs_create_symlink(const char *name, 
struct dentry *parent,
 }
 EXPORT_SYMBOL_GPL(debugfs_create_symlink);
 
+static void __debugfs_remove_file(struct dentry *dentry, struct dentry *parent)
+{
+   struct debugfs_fsdata *fsd;
+
+   simple_unlink(d_inode(parent), dentry);
+   d_delete(dentry);
+   fsd = dent

[PATCH v3 7/8] debugfs: call debugfs_real_fops() only after debugfs_file_get()

2017-10-30 Thread Nicolai Stange
The current implementation of debugfs_real_fops() relies on a
debugfs_fsdata instance to be installed at ->d_fsdata.

With future patches introducing lazy allocation of these, this requirement
will be guaranteed to be fullfilled only inbetween a
debugfs_file_get()/debugfs_file_put() pair.

The full proxies' fops implemented by debugfs happen to be the only
offenders. Fix them up by moving their debugfs_real_fops() calls past those
to debugfs_file_get().

full_proxy_release() is special as it doesn't invoke debugfs_file_get() at
all. Leave it alone for now.

Signed-off-by: Nicolai Stange <nicsta...@gmail.com>
---
 fs/debugfs/file.c | 7 ---
 1 file changed, 4 insertions(+), 3 deletions(-)

diff --git a/fs/debugfs/file.c b/fs/debugfs/file.c
index 53f5c9a2af88..bc3549c95574 100644
--- a/fs/debugfs/file.c
+++ b/fs/debugfs/file.c
@@ -144,13 +144,13 @@ const struct file_operations 
debugfs_open_proxy_file_operations = {
 static ret_type full_proxy_ ## name(proto) \
 {  \
struct dentry *dentry = F_DENTRY(filp); \
-   const struct file_operations *real_fops =   \
-   debugfs_real_fops(filp);\
+   const struct file_operations *real_fops;\
ret_type r; \
\
r = debugfs_file_get(dentry);   \
if (unlikely(r))\
return r;   \
+   real_fops = debugfs_real_fops(filp);\
r = real_fops->name(args);  \
debugfs_file_put(dentry);   \
return r;   \
@@ -177,13 +177,14 @@ FULL_PROXY_FUNC(unlocked_ioctl, long, filp,
 static unsigned int full_proxy_poll(struct file *filp,
struct poll_table_struct *wait)
 {
-   const struct file_operations *real_fops = debugfs_real_fops(filp);
struct dentry *dentry = F_DENTRY(filp);
unsigned int r = 0;
+   const struct file_operations *real_fops;
 
if (debugfs_file_get(dentry))
return POLLHUP;
 
+   real_fops = debugfs_real_fops(filp);
r = real_fops->poll(filp, wait);
debugfs_file_put(dentry);
return r;
-- 
2.13.6



[PATCH v3 2/8] debugfs: implement per-file removal protection

2017-10-30 Thread Nicolai Stange
Since commit 49d200deaa68 ("debugfs: prevent access to removed files'
private data"), accesses to a file's private data are protected from
concurrent removal by covering all file_operations with a SRCU read section
and sychronizing with those before returning from debugfs_remove() by means
of synchronize_srcu().

As pointed out by Johannes Berg, there are debugfs files with forever
blocking file_operations. Their corresponding SRCU read side sections would
block any debugfs_remove() forever as well, even unrelated ones. This
results in a livelock. Because a remover can't cancel any indefinite
blocking within foreign files, this is a problem.

Resolve this by introducing support for more granular protection on a
per-file basis.

This is implemented by introducing an  'active_users' refcount_t to the
per-file struct debugfs_fsdata state. At file creation time, it is set to
one and a debugfs_remove() will drop that initial reference. The new
debugfs_file_get() and debugfs_file_put(), intended to be used in place of
former debugfs_use_file_start() and debugfs_use_file_finish(), increment
and decrement it respectively. Once the count drops to zero,
debugfs_file_put() will signal a completion which is possibly being waited
for from debugfs_remove().
Thus, as long as there is a debugfs_file_get() not yet matched by a
corresponding debugfs_file_put() around, debugfs_remove() will block.

Actual users of debugfs_use_file_start() and -finish() will get converted
to the new debugfs_file_get() and debugfs_file_put() by followup patches.

Fixes: 49d200deaa68 ("debugfs: prevent access to removed files' private
  data")
Reported-by: Johannes Berg 
Signed-off-by: Nicolai Stange 
---
 fs/debugfs/file.c   | 48 
 fs/debugfs/inode.c  | 29 +++--
 fs/debugfs/internal.h   |  2 ++
 include/linux/debugfs.h | 11 +++
 4 files changed, 84 insertions(+), 6 deletions(-)

diff --git a/fs/debugfs/file.c b/fs/debugfs/file.c
index b6f5ddab66bf..6644bfdea2f8 100644
--- a/fs/debugfs/file.c
+++ b/fs/debugfs/file.c
@@ -109,6 +109,54 @@ const struct file_operations *debugfs_real_fops(const 
struct file *filp)
 }
 EXPORT_SYMBOL_GPL(debugfs_real_fops);
 
+/**
+ * debugfs_file_get - mark the beginning of file data access
+ * @dentry: the dentry object whose data is being accessed.
+ *
+ * Up to a matching call to debugfs_file_put(), any successive call
+ * into the file removing functions debugfs_remove() and
+ * debugfs_remove_recursive() will block. Since associated private
+ * file data may only get freed after a successful return of any of
+ * the removal functions, you may safely access it after a successful
+ * call to debugfs_file_get() without worrying about lifetime issues.
+ *
+ * If -%EIO is returned, the file has already been removed and thus,
+ * it is not safe to access any of its data. If, on the other hand,
+ * it is allowed to access the file data, zero is returned.
+ */
+int debugfs_file_get(struct dentry *dentry)
+{
+   struct debugfs_fsdata *fsd = dentry->d_fsdata;
+
+   /* Avoid starvation of removers. */
+   if (d_unlinked(dentry))
+   return -EIO;
+
+   if (!refcount_inc_not_zero(>active_users))
+   return -EIO;
+
+   return 0;
+}
+EXPORT_SYMBOL_GPL(debugfs_file_get);
+
+/**
+ * debugfs_file_put - mark the end of file data access
+ * @dentry: the dentry object formerly passed to
+ *  debugfs_file_get().
+ *
+ * Allow any ongoing concurrent call into debugfs_remove() or
+ * debugfs_remove_recursive() blocked by a former call to
+ * debugfs_file_get() to proceed and return to its caller.
+ */
+void debugfs_file_put(struct dentry *dentry)
+{
+   struct debugfs_fsdata *fsd = dentry->d_fsdata;
+
+   if (refcount_dec_and_test(>active_users))
+   complete(>active_users_drained);
+}
+EXPORT_SYMBOL_GPL(debugfs_file_put);
+
 static int open_proxy_open(struct inode *inode, struct file *filp)
 {
const struct dentry *dentry = F_DENTRY(filp);
diff --git a/fs/debugfs/inode.c b/fs/debugfs/inode.c
index a9c3d3e9af39..6449eb935540 100644
--- a/fs/debugfs/inode.c
+++ b/fs/debugfs/inode.c
@@ -374,6 +374,7 @@ static struct dentry *__debugfs_create_file(const char 
*name, umode_t mode,
 
inode->i_fop = proxy_fops;
fsd->real_fops = real_fops;
+   refcount_set(>active_users, 1);
dentry->d_fsdata = fsd;
 
d_instantiate(dentry, inode);
@@ -631,18 +632,34 @@ struct dentry *debugfs_create_symlink(const char *name, 
struct dentry *parent,
 }
 EXPORT_SYMBOL_GPL(debugfs_create_symlink);
 
+static void __debugfs_remove_file(struct dentry *dentry, struct dentry *parent)
+{
+   struct debugfs_fsdata *fsd;
+
+   simple_unlink(d_inode(parent), dentry);
+   d_delete(dentry);
+   fsd = dentry->d_fsdata;
+   init_completion(>active_users_drain

[PATCH v3 7/8] debugfs: call debugfs_real_fops() only after debugfs_file_get()

2017-10-30 Thread Nicolai Stange
The current implementation of debugfs_real_fops() relies on a
debugfs_fsdata instance to be installed at ->d_fsdata.

With future patches introducing lazy allocation of these, this requirement
will be guaranteed to be fullfilled only inbetween a
debugfs_file_get()/debugfs_file_put() pair.

The full proxies' fops implemented by debugfs happen to be the only
offenders. Fix them up by moving their debugfs_real_fops() calls past those
to debugfs_file_get().

full_proxy_release() is special as it doesn't invoke debugfs_file_get() at
all. Leave it alone for now.

Signed-off-by: Nicolai Stange 
---
 fs/debugfs/file.c | 7 ---
 1 file changed, 4 insertions(+), 3 deletions(-)

diff --git a/fs/debugfs/file.c b/fs/debugfs/file.c
index 53f5c9a2af88..bc3549c95574 100644
--- a/fs/debugfs/file.c
+++ b/fs/debugfs/file.c
@@ -144,13 +144,13 @@ const struct file_operations 
debugfs_open_proxy_file_operations = {
 static ret_type full_proxy_ ## name(proto) \
 {  \
struct dentry *dentry = F_DENTRY(filp); \
-   const struct file_operations *real_fops =   \
-   debugfs_real_fops(filp);\
+   const struct file_operations *real_fops;\
ret_type r; \
\
r = debugfs_file_get(dentry);   \
if (unlikely(r))\
return r;   \
+   real_fops = debugfs_real_fops(filp);\
r = real_fops->name(args);  \
debugfs_file_put(dentry);   \
return r;   \
@@ -177,13 +177,14 @@ FULL_PROXY_FUNC(unlocked_ioctl, long, filp,
 static unsigned int full_proxy_poll(struct file *filp,
struct poll_table_struct *wait)
 {
-   const struct file_operations *real_fops = debugfs_real_fops(filp);
struct dentry *dentry = F_DENTRY(filp);
unsigned int r = 0;
+   const struct file_operations *real_fops;
 
if (debugfs_file_get(dentry))
return POLLHUP;
 
+   real_fops = debugfs_real_fops(filp);
r = real_fops->poll(filp, wait);
debugfs_file_put(dentry);
return r;
-- 
2.13.6



[PATCH v3 8/8] debugfs: defer debugfs_fsdata allocation to first usage

2017-10-30 Thread Nicolai Stange
Currently, __debugfs_create_file allocates one struct debugfs_fsdata
instance for every file created. However, there are potentially many
debugfs file around, most of which are never touched by userspace.

Thus, defer the allocations to the first usage, i.e. to the first
debugfs_file_get().

A dentry's ->d_fsdata starts out to point to the "real", user provided
fops. After a debugfs_fsdata instance has been allocated (and the real
fops pointer has been moved over into its ->real_fops member),
->d_fsdata is changed to point to it from then on. The two cases are
distinguished by setting BIT(0) for the real fops case.

struct debugfs_fsdata's foremost purpose is to track active users and to
make debugfs_remove() block until they are done. Since no debugfs_fsdata
instance means no active users, make debugfs_remove() return immediately
in this case.

Take care of possible races between debugfs_file_get() and
debugfs_remove(): either debugfs_remove() must see a debugfs_fsdata
instance and thus wait for possible active users or debugfs_file_get() must
see a dead dentry and return immediately.

Make a dentry's ->d_release(), i.e. debugfs_release_dentry(), check whether
->d_fsdata is actually a debugfs_fsdata instance before kfree()ing it.

Similarly, make debugfs_real_fops() check whether ->d_fsdata is actually
a debugfs_fsdata instance before returning it, otherwise emit a warning.

The set of possible error codes returned from debugfs_file_get() has grown
from -EIO to -EIO and -ENOMEM. Make open_proxy_open() and full_proxy_open()
pass the -ENOMEM onwards to their callers.

Signed-off-by: Nicolai Stange <nicsta...@gmail.com>
---
 fs/debugfs/file.c | 55 ++-
 fs/debugfs/inode.c| 36 +
 fs/debugfs/internal.h |  8 
 3 files changed, 73 insertions(+), 26 deletions(-)

diff --git a/fs/debugfs/file.c b/fs/debugfs/file.c
index bc3549c95574..65872340e301 100644
--- a/fs/debugfs/file.c
+++ b/fs/debugfs/file.c
@@ -53,6 +53,15 @@ const struct file_operations *debugfs_real_fops(const struct 
file *filp)
 {
struct debugfs_fsdata *fsd = F_DENTRY(filp)->d_fsdata;
 
+   if ((unsigned long)fsd & DEBUGFS_FSDATA_IS_REAL_FOPS_BIT) {
+   /*
+* Urgh, we've been called w/o a protecting
+* debugfs_file_get().
+*/
+   WARN_ON(1);
+   return NULL;
+   }
+
return fsd->real_fops;
 }
 EXPORT_SYMBOL_GPL(debugfs_real_fops);
@@ -74,9 +83,35 @@ EXPORT_SYMBOL_GPL(debugfs_real_fops);
  */
 int debugfs_file_get(struct dentry *dentry)
 {
-   struct debugfs_fsdata *fsd = dentry->d_fsdata;
+   struct debugfs_fsdata *fsd;
+   void *d_fsd;
+
+   d_fsd = READ_ONCE(dentry->d_fsdata);
+   if (!((unsigned long)d_fsd & DEBUGFS_FSDATA_IS_REAL_FOPS_BIT)) {
+   fsd = d_fsd;
+   } else {
+   fsd = kmalloc(sizeof(*fsd), GFP_KERNEL);
+   if (!fsd)
+   return -ENOMEM;
+
+   fsd->real_fops = (void *)((unsigned long)d_fsd &
+   ~DEBUGFS_FSDATA_IS_REAL_FOPS_BIT);
+   refcount_set(>active_users, 1);
+   init_completion(>active_users_drained);
+   if (cmpxchg(>d_fsdata, d_fsd, fsd) != d_fsd) {
+   kfree(fsd);
+   fsd = READ_ONCE(dentry->d_fsdata);
+   }
+   }
 
-   /* Avoid starvation of removers. */
+   /*
+* In case of a successful cmpxchg() above, this check is
+* strictly necessary and must follow it, see the comment in
+* __debugfs_remove_file().
+* OTOH, if the cmpxchg() hasn't been executed or wasn't
+* successful, this serves the purpose of not starving
+* removers.
+*/
if (d_unlinked(dentry))
return -EIO;
 
@@ -98,7 +133,7 @@ EXPORT_SYMBOL_GPL(debugfs_file_get);
  */
 void debugfs_file_put(struct dentry *dentry)
 {
-   struct debugfs_fsdata *fsd = dentry->d_fsdata;
+   struct debugfs_fsdata *fsd = READ_ONCE(dentry->d_fsdata);
 
if (refcount_dec_and_test(>active_users))
complete(>active_users_drained);
@@ -109,10 +144,11 @@ static int open_proxy_open(struct inode *inode, struct 
file *filp)
 {
struct dentry *dentry = F_DENTRY(filp);
const struct file_operations *real_fops = NULL;
-   int r = 0;
+   int r;
 
-   if (debugfs_file_get(dentry))
-   return -ENOENT;
+   r = debugfs_file_get(dentry);
+   if (r)
+   return r == -EIO ? -ENOENT : r;
 
real_fops = debugfs_real_fops(filp);
real_fops = fops_get(real_fops);
@@ -233,10 +269,11 @@ static int full_proxy_open(struct inode *inode, struct 
file *filp)
struct dentry *dentry = F_DENTRY(filp);
const struct

[PATCH v3 8/8] debugfs: defer debugfs_fsdata allocation to first usage

2017-10-30 Thread Nicolai Stange
Currently, __debugfs_create_file allocates one struct debugfs_fsdata
instance for every file created. However, there are potentially many
debugfs file around, most of which are never touched by userspace.

Thus, defer the allocations to the first usage, i.e. to the first
debugfs_file_get().

A dentry's ->d_fsdata starts out to point to the "real", user provided
fops. After a debugfs_fsdata instance has been allocated (and the real
fops pointer has been moved over into its ->real_fops member),
->d_fsdata is changed to point to it from then on. The two cases are
distinguished by setting BIT(0) for the real fops case.

struct debugfs_fsdata's foremost purpose is to track active users and to
make debugfs_remove() block until they are done. Since no debugfs_fsdata
instance means no active users, make debugfs_remove() return immediately
in this case.

Take care of possible races between debugfs_file_get() and
debugfs_remove(): either debugfs_remove() must see a debugfs_fsdata
instance and thus wait for possible active users or debugfs_file_get() must
see a dead dentry and return immediately.

Make a dentry's ->d_release(), i.e. debugfs_release_dentry(), check whether
->d_fsdata is actually a debugfs_fsdata instance before kfree()ing it.

Similarly, make debugfs_real_fops() check whether ->d_fsdata is actually
a debugfs_fsdata instance before returning it, otherwise emit a warning.

The set of possible error codes returned from debugfs_file_get() has grown
from -EIO to -EIO and -ENOMEM. Make open_proxy_open() and full_proxy_open()
pass the -ENOMEM onwards to their callers.

Signed-off-by: Nicolai Stange 
---
 fs/debugfs/file.c | 55 ++-
 fs/debugfs/inode.c| 36 +
 fs/debugfs/internal.h |  8 
 3 files changed, 73 insertions(+), 26 deletions(-)

diff --git a/fs/debugfs/file.c b/fs/debugfs/file.c
index bc3549c95574..65872340e301 100644
--- a/fs/debugfs/file.c
+++ b/fs/debugfs/file.c
@@ -53,6 +53,15 @@ const struct file_operations *debugfs_real_fops(const struct 
file *filp)
 {
struct debugfs_fsdata *fsd = F_DENTRY(filp)->d_fsdata;
 
+   if ((unsigned long)fsd & DEBUGFS_FSDATA_IS_REAL_FOPS_BIT) {
+   /*
+* Urgh, we've been called w/o a protecting
+* debugfs_file_get().
+*/
+   WARN_ON(1);
+   return NULL;
+   }
+
return fsd->real_fops;
 }
 EXPORT_SYMBOL_GPL(debugfs_real_fops);
@@ -74,9 +83,35 @@ EXPORT_SYMBOL_GPL(debugfs_real_fops);
  */
 int debugfs_file_get(struct dentry *dentry)
 {
-   struct debugfs_fsdata *fsd = dentry->d_fsdata;
+   struct debugfs_fsdata *fsd;
+   void *d_fsd;
+
+   d_fsd = READ_ONCE(dentry->d_fsdata);
+   if (!((unsigned long)d_fsd & DEBUGFS_FSDATA_IS_REAL_FOPS_BIT)) {
+   fsd = d_fsd;
+   } else {
+   fsd = kmalloc(sizeof(*fsd), GFP_KERNEL);
+   if (!fsd)
+   return -ENOMEM;
+
+   fsd->real_fops = (void *)((unsigned long)d_fsd &
+   ~DEBUGFS_FSDATA_IS_REAL_FOPS_BIT);
+   refcount_set(>active_users, 1);
+   init_completion(>active_users_drained);
+   if (cmpxchg(>d_fsdata, d_fsd, fsd) != d_fsd) {
+   kfree(fsd);
+   fsd = READ_ONCE(dentry->d_fsdata);
+   }
+   }
 
-   /* Avoid starvation of removers. */
+   /*
+* In case of a successful cmpxchg() above, this check is
+* strictly necessary and must follow it, see the comment in
+* __debugfs_remove_file().
+* OTOH, if the cmpxchg() hasn't been executed or wasn't
+* successful, this serves the purpose of not starving
+* removers.
+*/
if (d_unlinked(dentry))
return -EIO;
 
@@ -98,7 +133,7 @@ EXPORT_SYMBOL_GPL(debugfs_file_get);
  */
 void debugfs_file_put(struct dentry *dentry)
 {
-   struct debugfs_fsdata *fsd = dentry->d_fsdata;
+   struct debugfs_fsdata *fsd = READ_ONCE(dentry->d_fsdata);
 
if (refcount_dec_and_test(>active_users))
complete(>active_users_drained);
@@ -109,10 +144,11 @@ static int open_proxy_open(struct inode *inode, struct 
file *filp)
 {
struct dentry *dentry = F_DENTRY(filp);
const struct file_operations *real_fops = NULL;
-   int r = 0;
+   int r;
 
-   if (debugfs_file_get(dentry))
-   return -ENOENT;
+   r = debugfs_file_get(dentry);
+   if (r)
+   return r == -EIO ? -ENOENT : r;
 
real_fops = debugfs_real_fops(filp);
real_fops = fops_get(real_fops);
@@ -233,10 +269,11 @@ static int full_proxy_open(struct inode *inode, struct 
file *filp)
struct dentry *dentry = F_DENTRY(filp);
const struct file_operations *real_fo

[PATCH v3 0/8] debugfs: per-file removal protection

2017-10-30 Thread Nicolai Stange
Hi,

this is v3 of the per-file removal protection with the main change to v2
being that it's sent in non-RFC mode now.

For this, I dropped the questionable former [9/9] ("debugfs: free
debugfs_fsdata instances") for now. Perhaps I'll resend it later on
its own as a RFC again.

The problem this series attempts to address is that any indefinitely
blocking fops blocks _unrelated_ debugfs_remove()ers. This will be resolved
by introducing a per-file removal protection mechanism in place of the
former global debugfs_srcu shared among all debugfs files.

This problem has first been spotted and debugged by Johannes Berg [1]
and Tyler Hall is now facing the same issue again [2].

There's one patch to non-debugfs code,
[5/8] ("IB/hfi1: convert to debugfs_file_get() and -put()"),
which should be taken together with the rest of the series because
it depends on prior patches and later patches depend on it.

Applies to next-20171018.

I reviewed the patches once again and did an allmodconfig as well as an 
allnoconfig build. I did more thorough testing on v2, including whether
the removal protection still works.

Thanks,

Nicolai


Changes to v2:
  [9/9] ("debugfs: free debugfs_fsdata instances")
   - dropped for now.

Changes to v1:
  [2/9] ("debugfs: implement per-file removal protection")
  - In an attempt to resolve the issue reported by the kernel test robot
for v1, restrict the "extended removal logic" to regular files in
__debugfs_remove().

  [8/9] ("debugfs: defer debugfs_fsdata allocation to first usage")
  - Following review from Johannes Berg, replace the WARN_ON in
debugfs_real_fops() by a WARN + 'return NULL'. The return NULL is
expected to crash current soon and serves as an alternative for a
BUG_ON here.
  - Mention the change in debugfs_real_fops() in the commit message.

  [9/9] ("debugfs: free debugfs_fsdata instances")
  - Following advice from Paul E. McKenney, make debugfs_file_get()
release the RCU read section inbetween retry loop iterations.
  - Fix a race in debugfs_file_get()'s path handling a concurrent
debugfs_file_put(): the former must not "help out resetting ->d_fsdata"
because this can wipe out another debugfs_file_get()'s achievements.


[1] http://lkml.kernel.org/r/1490280886.2766.4.ca...@sipsolutions.net
[2] 
https://lkml.kernel.org/r/CAOjnSCYGprej+vEEsSXwr=wo+ewle2d6shqytpp-dfpq3pm...@mail.gmail.com

Nicolai Stange (8):
  debugfs: add support for more elaborate ->d_fsdata
  debugfs: implement per-file removal protection
  debugfs: debugfs_real_fops(): drop __must_hold sparse annotation
  debugfs: convert to debugfs_file_get() and -put()
  IB/hfi1: convert to debugfs_file_get() and -put()
  debugfs: purge obsolete SRCU based removal protection
  debugfs: call debugfs_real_fops() only after debugfs_file_get()
  debugfs: defer debugfs_fsdata allocation to first usage

 drivers/infiniband/hw/hfi1/debugfs.c |  20 ++--
 fs/debugfs/file.c| 210 +--
 fs/debugfs/inode.c   |  56 +++---
 fs/debugfs/internal.h|  14 +++
 include/linux/debugfs.h  |  33 +-
 lib/Kconfig.debug|   1 -
 6 files changed, 196 insertions(+), 138 deletions(-)

-- 
2.13.6



[PATCH v3 0/8] debugfs: per-file removal protection

2017-10-30 Thread Nicolai Stange
Hi,

this is v3 of the per-file removal protection with the main change to v2
being that it's sent in non-RFC mode now.

For this, I dropped the questionable former [9/9] ("debugfs: free
debugfs_fsdata instances") for now. Perhaps I'll resend it later on
its own as a RFC again.

The problem this series attempts to address is that any indefinitely
blocking fops blocks _unrelated_ debugfs_remove()ers. This will be resolved
by introducing a per-file removal protection mechanism in place of the
former global debugfs_srcu shared among all debugfs files.

This problem has first been spotted and debugged by Johannes Berg [1]
and Tyler Hall is now facing the same issue again [2].

There's one patch to non-debugfs code,
[5/8] ("IB/hfi1: convert to debugfs_file_get() and -put()"),
which should be taken together with the rest of the series because
it depends on prior patches and later patches depend on it.

Applies to next-20171018.

I reviewed the patches once again and did an allmodconfig as well as an 
allnoconfig build. I did more thorough testing on v2, including whether
the removal protection still works.

Thanks,

Nicolai


Changes to v2:
  [9/9] ("debugfs: free debugfs_fsdata instances")
   - dropped for now.

Changes to v1:
  [2/9] ("debugfs: implement per-file removal protection")
  - In an attempt to resolve the issue reported by the kernel test robot
for v1, restrict the "extended removal logic" to regular files in
__debugfs_remove().

  [8/9] ("debugfs: defer debugfs_fsdata allocation to first usage")
  - Following review from Johannes Berg, replace the WARN_ON in
debugfs_real_fops() by a WARN + 'return NULL'. The return NULL is
expected to crash current soon and serves as an alternative for a
BUG_ON here.
  - Mention the change in debugfs_real_fops() in the commit message.

  [9/9] ("debugfs: free debugfs_fsdata instances")
  - Following advice from Paul E. McKenney, make debugfs_file_get()
release the RCU read section inbetween retry loop iterations.
  - Fix a race in debugfs_file_get()'s path handling a concurrent
debugfs_file_put(): the former must not "help out resetting ->d_fsdata"
because this can wipe out another debugfs_file_get()'s achievements.


[1] http://lkml.kernel.org/r/1490280886.2766.4.ca...@sipsolutions.net
[2] 
https://lkml.kernel.org/r/CAOjnSCYGprej+vEEsSXwr=wo+ewle2d6shqytpp-dfpq3pm...@mail.gmail.com

Nicolai Stange (8):
  debugfs: add support for more elaborate ->d_fsdata
  debugfs: implement per-file removal protection
  debugfs: debugfs_real_fops(): drop __must_hold sparse annotation
  debugfs: convert to debugfs_file_get() and -put()
  IB/hfi1: convert to debugfs_file_get() and -put()
  debugfs: purge obsolete SRCU based removal protection
  debugfs: call debugfs_real_fops() only after debugfs_file_get()
  debugfs: defer debugfs_fsdata allocation to first usage

 drivers/infiniband/hw/hfi1/debugfs.c |  20 ++--
 fs/debugfs/file.c| 210 +--
 fs/debugfs/inode.c   |  56 +++---
 fs/debugfs/internal.h|  14 +++
 include/linux/debugfs.h  |  33 +-
 lib/Kconfig.debug|   1 -
 6 files changed, 196 insertions(+), 138 deletions(-)

-- 
2.13.6



Re: [PATCH 2/3] PCI: don't use snprintf() in driver_override_show()

2017-09-26 Thread Nicolai Stange
Bjorn Helgaas <helg...@kernel.org> writes:

> On Mon, Sep 11, 2017 at 04:55:11AM -0700, Greg Kroah-Hartman wrote:
>> On Mon, Sep 11, 2017 at 09:45:41AM +0200, Nicolai Stange wrote:
>> > Quote from Documentation/filesystems/sysfs.txt:
>> > 
>> >   show() must not use snprintf() when formatting the value to be
>> >   returned to user space. If you can guarantee that an overflow
>> >   will never happen you can use sprintf() otherwise you must use
>> >   scnprintf().
>> > 
>> > Commit 4efe874aace5 ("PCI: Don't read past the end of sysfs
>> > "driver_override" buffer") introduced such a snprintf() usage from
>> > driver_override_show() while at the same time tweaking
>> > driver_override_store() such that the write buffer can't ever get
>> > overflowed.
>> > 
>> > Reasoning:
>> > Since aforementioned commit, driver_override_store() only accepts to be
>> > written buffers less than PAGE_SIZE - 1 in size.
>> > 
>> > The then kstrndup()'ed driver_override string will be at most PAGE_SIZE - 1
>> > in length, including the trailing '\0'.
>> > 
>> > After the addition of a '\n' in driver_override_show(), the result won't
>> > exceed PAGE_SIZE characters in length, again including the trailing '\0'.
>> > 
>> > Hence, snprintf(buf, PAGE_SIZE, ...) and sprintf(buf, ...) are equivalent
>> > at this point.
>> > 
>> > Replace the former by the latter in order to adhere to the rules in
>> > Documentation/filesystems/sysfs.txt.
>> > 
>> > This is a style fix only and there's no change in functionality.
>> > 
>> > Signed-off-by: Nicolai Stange <nsta...@suse.de>
>> > ---
>> >  drivers/pci/pci-sysfs.c | 2 +-
>> >  1 file changed, 1 insertion(+), 1 deletion(-)
>> > 
>> > diff --git a/drivers/pci/pci-sysfs.c b/drivers/pci/pci-sysfs.c
>> > index 8e075ea2743e..43f7fbede448 100644
>> > --- a/drivers/pci/pci-sysfs.c
>> > +++ b/drivers/pci/pci-sysfs.c
>> > @@ -722,7 +722,7 @@ static ssize_t driver_override_show(struct device *dev,
>> >ssize_t len;
>> >  
>> >device_lock(dev);
>> > -  len = snprintf(buf, PAGE_SIZE, "%s\n", pdev->driver_override);
>> > +  len = sprintf(buf, "%s\n", pdev->driver_override);
>> 
>> While I'm all for changes like this, it's an uphill battle to change
>> them, usually it's best to just catch them before they go into the tree.

I did this patch mostly for demonstrating why the exclusion of the
sprintf() -> snprintf() change from the port of

  3d713e0e382e ("driver core: platform: add device binding path 
'driver_override'")

to platform in [3/3] is safe. Because I'm pretty sure that [3/3] would
have been rejected if it had included that chunk -- due to the
non-conformity to Documentation/.


>> Anyway, nice summary, very good job with that.
>> 
>> Acked-by: Greg Kroah-Hartman <gre...@linuxfoundation.org>
>
> Why use snprintf() instead of scnprintf()?  

s/snprintf/sprintf/, right?


> It looks like snprintf() is probably safe, but requires a bunch of
> analysis to prove that, while scnprintf() would be obviously safe.

Personal preference, I guess. Something like "using sprintf() would
implicitly document that the buffer is always large enough", i.e. that
the size check in driver_override_store() is already strict enough / correct.

Anyway, I don't have a strong opinion on that. So if you want me to
change this, I'll do, of course. Just note that Greg K-H. has queued up
[3/3] somewhere already and thus, it would probably require two patches
rather than only this one to make PCI and platform look the same again,
provided that's what is wanted.

So, given that this patch has fulfilled its purpose already, I'm fine
with either of
- dropping it
- taking it
- s/sprintf/scnprintf/ and do the same for platform.


Thanks,

Nicolai


Re: [PATCH 2/3] PCI: don't use snprintf() in driver_override_show()

2017-09-26 Thread Nicolai Stange
Bjorn Helgaas  writes:

> On Mon, Sep 11, 2017 at 04:55:11AM -0700, Greg Kroah-Hartman wrote:
>> On Mon, Sep 11, 2017 at 09:45:41AM +0200, Nicolai Stange wrote:
>> > Quote from Documentation/filesystems/sysfs.txt:
>> > 
>> >   show() must not use snprintf() when formatting the value to be
>> >   returned to user space. If you can guarantee that an overflow
>> >   will never happen you can use sprintf() otherwise you must use
>> >   scnprintf().
>> > 
>> > Commit 4efe874aace5 ("PCI: Don't read past the end of sysfs
>> > "driver_override" buffer") introduced such a snprintf() usage from
>> > driver_override_show() while at the same time tweaking
>> > driver_override_store() such that the write buffer can't ever get
>> > overflowed.
>> > 
>> > Reasoning:
>> > Since aforementioned commit, driver_override_store() only accepts to be
>> > written buffers less than PAGE_SIZE - 1 in size.
>> > 
>> > The then kstrndup()'ed driver_override string will be at most PAGE_SIZE - 1
>> > in length, including the trailing '\0'.
>> > 
>> > After the addition of a '\n' in driver_override_show(), the result won't
>> > exceed PAGE_SIZE characters in length, again including the trailing '\0'.
>> > 
>> > Hence, snprintf(buf, PAGE_SIZE, ...) and sprintf(buf, ...) are equivalent
>> > at this point.
>> > 
>> > Replace the former by the latter in order to adhere to the rules in
>> > Documentation/filesystems/sysfs.txt.
>> > 
>> > This is a style fix only and there's no change in functionality.
>> > 
>> > Signed-off-by: Nicolai Stange 
>> > ---
>> >  drivers/pci/pci-sysfs.c | 2 +-
>> >  1 file changed, 1 insertion(+), 1 deletion(-)
>> > 
>> > diff --git a/drivers/pci/pci-sysfs.c b/drivers/pci/pci-sysfs.c
>> > index 8e075ea2743e..43f7fbede448 100644
>> > --- a/drivers/pci/pci-sysfs.c
>> > +++ b/drivers/pci/pci-sysfs.c
>> > @@ -722,7 +722,7 @@ static ssize_t driver_override_show(struct device *dev,
>> >ssize_t len;
>> >  
>> >device_lock(dev);
>> > -  len = snprintf(buf, PAGE_SIZE, "%s\n", pdev->driver_override);
>> > +  len = sprintf(buf, "%s\n", pdev->driver_override);
>> 
>> While I'm all for changes like this, it's an uphill battle to change
>> them, usually it's best to just catch them before they go into the tree.

I did this patch mostly for demonstrating why the exclusion of the
sprintf() -> snprintf() change from the port of

  3d713e0e382e ("driver core: platform: add device binding path 
'driver_override'")

to platform in [3/3] is safe. Because I'm pretty sure that [3/3] would
have been rejected if it had included that chunk -- due to the
non-conformity to Documentation/.


>> Anyway, nice summary, very good job with that.
>> 
>> Acked-by: Greg Kroah-Hartman 
>
> Why use snprintf() instead of scnprintf()?  

s/snprintf/sprintf/, right?


> It looks like snprintf() is probably safe, but requires a bunch of
> analysis to prove that, while scnprintf() would be obviously safe.

Personal preference, I guess. Something like "using sprintf() would
implicitly document that the buffer is always large enough", i.e. that
the size check in driver_override_store() is already strict enough / correct.

Anyway, I don't have a strong opinion on that. So if you want me to
change this, I'll do, of course. Just note that Greg K-H. has queued up
[3/3] somewhere already and thus, it would probably require two patches
rather than only this one to make PCI and platform look the same again,
provided that's what is wanted.

So, given that this patch has fulfilled its purpose already, I'm fine
with either of
- dropping it
- taking it
- s/sprintf/scnprintf/ and do the same for platform.


Thanks,

Nicolai


[PATCH 0/3] make PCI's and platform's driver_override_store()/show() converge

2017-09-11 Thread Nicolai Stange
Hi,

both, drivers/pci/pci-sysfs.c and drivers/base/platform.c implement
a driver_override_store/show() pair which used to coincide between these
two subsystems (up to ->driver_override's storage location).

Now, both have received a fix in the meanwhile which the other subsystem
lacks each:
  commit 6265539776a0 ("driver core: platform: fix race condition with
driver_override")
and
  commit 4efe874aace5 ("PCI: Don't read past the end of sysfs
   "driver_override" buffer")

Port the missing fix to the other subsystem each.

There's also the style fixup [2/3] which isn't strictly needed but
will make both driver_override_show() variants look the same again.

Applies to next-20170908.

Note that checkpatch complains about commit references to the PCI commit
mentioned above, which is, as I believe, a false positive due to the
quotation marks in the title.

Thanks,

Nicolai

Nicolai Stange (3):
  PCI: fix race condition with driver_override
  PCI: don't use snprintf() in driver_override_show()
  driver core: platform: Don't read past the end of "driver_override"
buffer

 drivers/base/platform.c |  3 ++-
 drivers/pci/pci-sysfs.c | 11 +--
 2 files changed, 11 insertions(+), 3 deletions(-)

-- 
2.13.5



  1   2   3   4   5   6   7   8   9   10   >