[PATCH] random: fix possible sleeping allocation from irq context

2018-04-23 Thread Theodore Ts'o
We can do a sleeping allocation from an irq context when CONFIG_NUMA
is enabled.  Fix this by initializing the NUMA crng instances in a
workqueue.

Reported-by: Tetsuo Handa <penguin-ker...@i-love.sakura.ne.jp>
Reported-by: syzbot+9de458f6a5e713ee8...@syzkaller.appspotmail.com
Fixes: 8ef35c866f8862df ("random: set up the NUMA crng instances...")
Cc: sta...@vger.kernel.org
Signed-off-by: Theodore Ts'o <ty...@mit.edu>
---
 drivers/char/random.c | 9 -
 1 file changed, 8 insertions(+), 1 deletion(-)

diff --git a/drivers/char/random.c b/drivers/char/random.c
index 3cd3aae24d6d..e182cca7e6cd 100644
--- a/drivers/char/random.c
+++ b/drivers/char/random.c
@@ -789,7 +789,7 @@ static void crng_initialize(struct crng_state *crng)
 }
 
 #ifdef CONFIG_NUMA
-static void numa_crng_init(void)
+static void do_numa_crng_init(struct work_struct *work)
 {
int i;
struct crng_state *crng;
@@ -810,6 +810,13 @@ static void numa_crng_init(void)
kfree(pool);
}
 }
+
+DECLARE_WORK(numa_crng_init_work, do_numa_crng_init);
+
+static void numa_crng_init(void)
+{
+   schedule_work(_crng_init_work);
+}
 #else
 static void numa_crng_init(void) {}
 #endif
-- 
2.16.1.72.g5be1f00a9a



[PATCH 1/5] random: fix crng_ready() test

2018-04-12 Thread Theodore Ts'o
The crng_init variable has three states:

0: The CRNG is not initialized at all
1: The CRNG has a small amount of entropy, hopefully good enough for
   early-boot, non-cryptographical use cases
2: The CRNG is fully initialized and we are sure it is safe for
   cryptographic use cases.

The crng_ready() function should only return true once we are in the
last state.

Reported-by: Jann Horn <ja...@google.com>
Fixes: e192be9d9a30 ("random: replace non-blocking pool...")
Cc: sta...@kernel.org # 4.8+
Signed-off-by: Theodore Ts'o <ty...@mit.edu>
Reviewed-by: Jann Horn <ja...@google.com>
---
 drivers/char/random.c | 10 +-
 1 file changed, 5 insertions(+), 5 deletions(-)

diff --git a/drivers/char/random.c b/drivers/char/random.c
index e027e7fa1472..c8ec1e70abde 100644
--- a/drivers/char/random.c
+++ b/drivers/char/random.c
@@ -427,7 +427,7 @@ struct crng_state primary_crng = {
  * its value (from 0->1->2).
  */
 static int crng_init = 0;
-#define crng_ready() (likely(crng_init > 0))
+#define crng_ready() (likely(crng_init > 1))
 static int crng_init_cnt = 0;
 #define CRNG_INIT_CNT_THRESH (2*CHACHA20_KEY_SIZE)
 static void _extract_crng(struct crng_state *crng,
@@ -794,7 +794,7 @@ static int crng_fast_load(const char *cp, size_t len)
 
if (!spin_trylock_irqsave(_crng.lock, flags))
return 0;
-   if (crng_ready()) {
+   if (crng_init != 0) {
spin_unlock_irqrestore(_crng.lock, flags);
return 0;
}
@@ -856,7 +856,7 @@ static void _extract_crng(struct crng_state *crng,
 {
unsigned long v, flags;
 
-   if (crng_init > 1 &&
+   if (crng_ready() &&
time_after(jiffies, crng->init_time + CRNG_RESEED_INTERVAL))
crng_reseed(crng, crng == _crng ? _pool : NULL);
spin_lock_irqsave(>lock, flags);
@@ -1139,7 +1139,7 @@ void add_interrupt_randomness(int irq, int irq_flags)
fast_mix(fast_pool);
add_interrupt_bench(cycles);
 
-   if (!crng_ready()) {
+   if (unlikely(crng_init == 0)) {
if ((fast_pool->count >= 64) &&
crng_fast_load((char *) fast_pool->pool,
   sizeof(fast_pool->pool))) {
@@ -2212,7 +2212,7 @@ void add_hwgenerator_randomness(const char *buffer, 
size_t count,
 {
struct entropy_store *poolp = _pool;
 
-   if (!crng_ready()) {
+   if (unlikely(crng_init == 0)) {
crng_fast_load(buffer, count);
return;
}
-- 
2.16.1.72.g5be1f00a9a



[PATCH 2/5] random: use a different mixing algorithm for add_device_randomness()

2018-04-12 Thread Theodore Ts'o
add_device_randomness() use of crng_fast_load() was highly
problematic.  Some callers of add_device_randomness() can pass in a
large amount of static information.  This would immediately promote
the crng_init state from 0 to 1, without really doing much to
initialize the primary_crng's internal state with something even
vaguely unpredictable.

Since we don't have the speed constraints of add_interrupt_randomness(),
we can do a better job mixing in the what unpredictability a device
driver or architecture maintainer might see fit to give us, and do it
in a way which does not bump the crng_init_cnt variable.

Also, since add_device_randomness() doesn't bump any entropy
accounting in crng_init state 0, mix the device randomness into the
input_pool entropy pool as well.

Reported-by: Jann Horn <ja...@google.com>
Fixes: ee7998c50c26 ("random: do not ignore early device randomness")
Cc: sta...@kernel.org # 4.13+
Signed-off-by: Theodore Ts'o <ty...@mit.edu>
---
 drivers/char/random.c | 55 +++
 1 file changed, 51 insertions(+), 4 deletions(-)

diff --git a/drivers/char/random.c b/drivers/char/random.c
index c8ec1e70abde..2154a5fe4c81 100644
--- a/drivers/char/random.c
+++ b/drivers/char/random.c
@@ -787,6 +787,10 @@ static void crng_initialize(struct crng_state *crng)
crng->init_time = jiffies - CRNG_RESEED_INTERVAL - 1;
 }
 
+/*
+ * crng_fast_load() can be called by code in the interrupt service
+ * path.  So we can't afford to dilly-dally.
+ */
 static int crng_fast_load(const char *cp, size_t len)
 {
unsigned long flags;
@@ -813,6 +817,51 @@ static int crng_fast_load(const char *cp, size_t len)
return 1;
 }
 
+/*
+ * crng_slow_load() is called by add_device_randomness, which has two
+ * attributes.  (1) We can't trust the buffer passed to it is
+ * guaranteed to be unpredictable (so it might not have any entropy at
+ * all), and (2) it doesn't have the performance constraints of
+ * crng_fast_load().
+ *
+ * So we do something more comprehensive which is guaranteed to touch
+ * all of the primary_crng's state, and which uses a LFSR with a
+ * period of 255 as part of the mixing algorithm.  Finally, we do
+ * *not* advance crng_init_cnt since buffer we may get may be something
+ * like a fixed DMI table (for example), which might very well be
+ * unique to the machine, but is otherwise unvarying.
+ */
+static int crng_slow_load(const char *cp, size_t len)
+{
+   unsigned long   flags;
+   static unsigned charlfsr = 1;
+   unsigned char   tmp;
+   unsignedi, max = CHACHA20_KEY_SIZE;
+   const char *src_buf = cp;
+   char *  dest_buf = (char *) _crng.state[4];
+
+   if (!spin_trylock_irqsave(_crng.lock, flags))
+   return 0;
+   if (crng_init != 0) {
+   spin_unlock_irqrestore(_crng.lock, flags);
+   return 0;
+   }
+   if (len > max)
+   max = len;
+
+   for (i = 0; i < max ; i++) {
+   tmp = lfsr;
+   lfsr >>= 1;
+   if (tmp & 1)
+   lfsr ^= 0xE1;
+   tmp = dest_buf[i % CHACHA20_KEY_SIZE];
+   dest_buf[i % CHACHA20_KEY_SIZE] ^= src_buf[i % len] ^ lfsr;
+   lfsr += (tmp << 3) | (tmp >> 5);
+   }
+   spin_unlock_irqrestore(_crng.lock, flags);
+   return 1;
+}
+
 static void crng_reseed(struct crng_state *crng, struct entropy_store *r)
 {
unsigned long   flags;
@@ -981,10 +1030,8 @@ void add_device_randomness(const void *buf, unsigned int 
size)
unsigned long time = random_get_entropy() ^ jiffies;
unsigned long flags;
 
-   if (!crng_ready()) {
-   crng_fast_load(buf, size);
-   return;
-   }
+   if (!crng_ready())
+   crng_slow_load(buf, size);
 
trace_add_device_randomness(size, _RET_IP_);
spin_lock_irqsave(_pool.lock, flags);
-- 
2.16.1.72.g5be1f00a9a



[PATCH 3/5] random: set up the NUMA crng instances after the CRNG is fully initialized

2018-04-12 Thread Theodore Ts'o
Until the primary_crng is fully initialized, don't initialize the NUMA
crng nodes.  Otherwise users of /dev/urandom on NUMA systems before
the CRNG is fully initialized can get very bad quality randomness.  Of
course everyone should move to getrandom(2) where this won't be an
issue, but there's a lot of legacy code out there.

Reported-by: Jann Horn <ja...@google.com>
Fixes: 1e7f583af67b ("random: make /dev/urandom scalable for silly...")
Cc: sta...@kernel.org # 4.8+
Signed-off-by: Theodore Ts'o <ty...@mit.edu>
---
 drivers/char/random.c | 46 +++---
 1 file changed, 27 insertions(+), 19 deletions(-)

diff --git a/drivers/char/random.c b/drivers/char/random.c
index 2154a5fe4c81..681ee0c0de24 100644
--- a/drivers/char/random.c
+++ b/drivers/char/random.c
@@ -787,6 +787,32 @@ static void crng_initialize(struct crng_state *crng)
crng->init_time = jiffies - CRNG_RESEED_INTERVAL - 1;
 }
 
+#ifdef CONFIG_NUMA
+static void numa_crng_init(void)
+{
+   int i;
+   struct crng_state *crng;
+   struct crng_state **pool;
+
+   pool = kcalloc(nr_node_ids, sizeof(*pool), GFP_KERNEL|__GFP_NOFAIL);
+   for_each_online_node(i) {
+   crng = kmalloc_node(sizeof(struct crng_state),
+   GFP_KERNEL | __GFP_NOFAIL, i);
+   spin_lock_init(>lock);
+   crng_initialize(crng);
+   pool[i] = crng;
+   }
+   mb();
+   if (cmpxchg(_node_pool, NULL, pool)) {
+   for_each_node(i)
+   kfree(pool[i]);
+   kfree(pool);
+   }
+}
+#else
+static int numa_crng_init(void) {}
+#endif
+
 /*
  * crng_fast_load() can be called by code in the interrupt service
  * path.  So we can't afford to dilly-dally.
@@ -893,6 +919,7 @@ static void crng_reseed(struct crng_state *crng, struct 
entropy_store *r)
spin_unlock_irqrestore(_crng.lock, flags);
if (crng == _crng && crng_init < 2) {
invalidate_batched_entropy();
+   numa_crng_init();
crng_init = 2;
process_random_ready_list();
wake_up_interruptible(_init_wait);
@@ -1727,28 +1754,9 @@ static void init_std_data(struct entropy_store *r)
  */
 static int rand_initialize(void)
 {
-#ifdef CONFIG_NUMA
-   int i;
-   struct crng_state *crng;
-   struct crng_state **pool;
-#endif
-
init_std_data(_pool);
init_std_data(_pool);
crng_initialize(_crng);
-
-#ifdef CONFIG_NUMA
-   pool = kcalloc(nr_node_ids, sizeof(*pool), GFP_KERNEL|__GFP_NOFAIL);
-   for_each_online_node(i) {
-   crng = kmalloc_node(sizeof(struct crng_state),
-   GFP_KERNEL | __GFP_NOFAIL, i);
-   spin_lock_init(>lock);
-   crng_initialize(crng);
-   pool[i] = crng;
-   }
-   mb();
-   crng_node_pool = pool;
-#endif
return 0;
 }
 early_initcall(rand_initialize);
-- 
2.16.1.72.g5be1f00a9a



[PATCH 5/5] random: add new ioctl RNDRESEEDCRNG

2018-04-12 Thread Theodore Ts'o
Add a new ioctl which forces the the crng to be reseeded.

Signed-off-by: Theodore Ts'o <ty...@mit.edu>
Cc: sta...@kernel.org
---
 drivers/char/random.c   | 13 -
 include/uapi/linux/random.h |  3 +++
 2 files changed, 15 insertions(+), 1 deletion(-)

diff --git a/drivers/char/random.c b/drivers/char/random.c
index 6e7fa13b1a89..c552431587a7 100644
--- a/drivers/char/random.c
+++ b/drivers/char/random.c
@@ -429,6 +429,7 @@ struct crng_state primary_crng = {
 static int crng_init = 0;
 #define crng_ready() (likely(crng_init > 1))
 static int crng_init_cnt = 0;
+static unsigned long crng_global_init_time = 0;
 #define CRNG_INIT_CNT_THRESH (2*CHACHA20_KEY_SIZE)
 static void _extract_crng(struct crng_state *crng,
  __u32 out[CHACHA20_BLOCK_WORDS]);
@@ -933,7 +934,8 @@ static void _extract_crng(struct crng_state *crng,
unsigned long v, flags;
 
if (crng_ready() &&
-   time_after(jiffies, crng->init_time + CRNG_RESEED_INTERVAL))
+   (time_after(crng_global_init_time, crng->init_time) ||
+time_after(jiffies, crng->init_time + CRNG_RESEED_INTERVAL)))
crng_reseed(crng, crng == _crng ? _pool : NULL);
spin_lock_irqsave(>lock, flags);
if (arch_get_random_long())
@@ -1757,6 +1759,7 @@ static int rand_initialize(void)
init_std_data(_pool);
init_std_data(_pool);
crng_initialize(_crng);
+   crng_global_init_time = jiffies;
return 0;
 }
 early_initcall(rand_initialize);
@@ -1930,6 +1933,14 @@ static long random_ioctl(struct file *f, unsigned int 
cmd, unsigned long arg)
input_pool.entropy_count = 0;
blocking_pool.entropy_count = 0;
return 0;
+   case RNDRESEEDCRNG:
+   if (!capable(CAP_SYS_ADMIN))
+   return -EPERM;
+   if (crng_init < 2)
+   return -ENODATA;
+   crng_reseed(_crng, NULL);
+   crng_global_init_time = jiffies - 1;
+   return 0;
default:
return -EINVAL;
}
diff --git a/include/uapi/linux/random.h b/include/uapi/linux/random.h
index c34f4490d025..26ee91300e3e 100644
--- a/include/uapi/linux/random.h
+++ b/include/uapi/linux/random.h
@@ -35,6 +35,9 @@
 /* Clear the entropy pool and associated counters.  (Superuser only.) */
 #define RNDCLEARPOOL   _IO( 'R', 0x06 )
 
+/* Reseed CRNG.  (Superuser only.) */
+#define RNDRESEEDCRNG  _IO( 'R', 0x07 )
+
 struct rand_pool_info {
int entropy_count;
int buf_size;
-- 
2.16.1.72.g5be1f00a9a



[PATCH 4/5] random: crng_reseed() should lock the crng instance that it is modifying

2018-04-12 Thread Theodore Ts'o
Reported-by: Jann Horn <ja...@google.com>
Fixes: 1e7f583af67b ("random: make /dev/urandom scalable for silly...")
Cc: sta...@kernel.org # 4.8+
Signed-off-by: Theodore Ts'o <ty...@mit.edu>
Reviewed-by: Jann Horn <ja...@google.com>
---
 drivers/char/random.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/char/random.c b/drivers/char/random.c
index 681ee0c0de24..6e7fa13b1a89 100644
--- a/drivers/char/random.c
+++ b/drivers/char/random.c
@@ -906,7 +906,7 @@ static void crng_reseed(struct crng_state *crng, struct 
entropy_store *r)
_crng_backtrack_protect(_crng, buf.block,
CHACHA20_KEY_SIZE);
}
-   spin_lock_irqsave(_crng.lock, flags);
+   spin_lock_irqsave(>lock, flags);
for (i = 0; i < 8; i++) {
unsigned long   rv;
if (!arch_get_random_seed_long() &&
@@ -916,7 +916,7 @@ static void crng_reseed(struct crng_state *crng, struct 
entropy_store *r)
}
memzero_explicit(, sizeof(buf));
crng->init_time = jiffies;
-   spin_unlock_irqrestore(_crng.lock, flags);
+   spin_unlock_irqrestore(>lock, flags);
if (crng == _crng && crng_init < 2) {
invalidate_batched_entropy();
numa_crng_init();
-- 
2.16.1.72.g5be1f00a9a



Re: EXT4 Oops (Re: [PATCH V15 06/22] mmc: block: Add blk-mq support)

2018-03-01 Thread Theodore Ts'o
On Thu, Mar 01, 2018 at 01:15:24AM -0800, Jose R R wrote:
> Probably it is not wise to place all your eggs (data) in one basket
> (ext4) and diversify to viable alternatives which won't be affected by
> UNIX 2038 year date problem, likewise?
> < 
> https://metztli.it/blog/index.php/amatl8/reiser-nahui/reiser4-filesystem-and-the-unix
> >

All of the modern file systems (btrfs, ext4, f2fs, xfs, etc.) are fine
with respect to the 2038 problem.

- Ted


Re: EXT4 Oops (Re: [PATCH V15 06/22] mmc: block: Add blk-mq support)

2018-03-01 Thread Theodore Ts'o
On Thu, Mar 01, 2018 at 10:55:37AM +0200, Adrian Hunter wrote:
> On 27/02/18 11:28, Adrian Hunter wrote:
> > On 26/02/18 23:48, Dmitry Osipenko wrote:
> >> But still something is wrong... I've been getting occasional EXT4 Ooops's, 
> >> like
> >> the one below, and __wait_on_bit() is always figuring in the stacktrace. It
> >> never happened with blk-mq disabled, though it could be a coincidence and
> >> actually unrelated to blk-mq patches.
> > 
> >> [ 6625.992337] Unable to handle kernel NULL pointer dereference at virtual
> >> address 001c
> >> [ 6625.993004] pgd = 00b30c03
> >> [ 6625.993257] [001c] *pgd=
> >> [ 6625.993594] Internal error: Oops: 5 [#1] PREEMPT SMP ARM
> >> [ 6625.994022] Modules linked in:
> >> [ 6625.994326] CPU: 1 PID: 19355 Comm: dpkg Not tainted
> >> 4.16.0-rc2-next-20180220-00095-ge9c9f5689a84-dirty #2090
> >> [ 6625.995078] Hardware name: NVIDIA Tegra SoC (Flattened Device Tree)
> >> [ 6625.995595] PC is aht dx_probe+0x68/0x684
> >> [ 6625.995947] LR is at __wait_on_bit+0xac/0xc8

This doesn't seem to make sense; the PC is where we are currently
executing, and LR is the "Link Register" where the flow of control
will be returning after the current function returns, right?  Well,
dx_probe should *not* be returning to __wait_on_bit().  So this just
seems weird.

Ignoring the LR register, this stack trace looks sane...  I can't see
which pointer could be NULL and getting dereferenced, though.  How
easily can you reproduce the problem?  Can you either (a) translate
the PC into a line number, or better yet, if you can reproduce, add a
series of BUG_ON's so we can see what's going on?

+   BUG_ON(frame);
memset(frame_in, 0, EXT4_HTREE_LEVEL * sizeof(frame_in[0]));
frame->bh = ext4_read_dirblock(dir, 0, INDEX);
if (IS_ERR(frame->bh))
return (struct dx_frame *) frame->bh;

+   BUG_ON(frame->bh);
+   BUG_ON(frame->bh->b_data);
root = (struct dx_root *) frame->bh->b_data;
if (root->info.hash_version != DX_HASH_TEA &&
root->info.hash_version != DX_HASH_HALF_MD4 &&
root->info.hash_version != DX_HASH_LEGACY) {

These are "could never" happen scenarios from looking at the code, but
that will help explain what is going on.

If this is reliably only happening with mq, the only way I could see
that if is something is returning an error when it previously wasn't.
This isn't a problem we're seeing with any of our testing, though.

Cheers,

- Ted



Re: [PATCH resend] drivers/char/random.c: remove unused dont_count_entropy

2018-02-28 Thread Theodore Ts'o
On Thu, Mar 01, 2018 at 12:22:47AM +0100, Rasmus Villemoes wrote:
> Ever since "random: kill dead extract_state struct" [1], the
> dont_count_entropy member of struct timer_rand_state has been
> effectively unused. Since it hasn't found a new use in 12 years, it's
> probably safe to finally kill it.
> 
> [1] Pre-git, 
> https://git.kernel.org/pub/scm/linux/kernel/git/tglx/history.git/commit/?id=c1c48e61c251f57e7a3f1bf11b3c462b2de9dcb5
> 
> Signed-off-by: Rasmus Villemoes 
> ---
>  drivers/char/random.c | 53 
> ---

Thanks, applied.

- Ted


Re: [PATCH] random: Optimize add_interrupt_randomness

2018-02-28 Thread Theodore Ts'o
Thanks, applied.

- Ted

On Wed, Feb 28, 2018 at 01:43:28PM -0800, Andi Kleen wrote:
> From: Andi Kleen 
> 
> add_interrupt_randomess always wakes up
> code blocking on /dev/random. This wake up is done
> unconditionally. Unfortunately this means all interrupts
> take the wait queue spinlock, which can be rather expensive
> on large systems processing lots of interrupts.
> 
> We saw 1% cpu time spinning on this on a large macro workload
> running on a large system.
> 
> I believe it's a recent regression (?)
> 
> Always check if there is a waiter on the wait queue
> before waking up. This check can be done without
> taking a spinlock.
> 
> 1.06% 10460  [kernel.vmlinux] [k] native_queued_spin_lock_slowpath
>  |
>  ---native_queued_spin_lock_slowpath
> |
>  --0.57%--_raw_spin_lock_irqsave
>|
> --0.56%--__wake_up_common_lock
>   credit_entropy_bits
>   add_interrupt_randomness
>   handle_irq_event_percpu
>   handle_irq_event
>   handle_edge_irq
>   handle_irq
>   do_IRQ
>   common_interrupt
> 
> Signed-off-by: Andi Kleen 


Re: [PATCH] random: always fill buffer in get_random_bytes_wait

2018-02-28 Thread Theodore Ts'o
On Sun, Feb 04, 2018 at 11:07:46PM +0100, Jason A. Donenfeld wrote:
> In the unfortunate event that a developer fails to check the return
> value of get_random_bytes_wait, or simply wants to make a "best effort"
> attempt, for whatever that's worth, it's much better to still fill the
> buffer with _something_ rather than catastrophically failing in the case
> of an interruption. This is both a defense in depth measure against
> inevitable programming bugs, as well as a means of making the API a bit
> more useful.
> 
> Signed-off-by: Jason A. Donenfeld 

Thanks, applied.

- Ted


Re: [PATCH] linux/kernel.h: break long lines

2018-02-24 Thread Theodore Ts'o
On Sat, Feb 24, 2018 at 07:30:44PM +0100, Borislav Petkov wrote:
> 
> And, for the record, if we have to break a function signature, we align the
> arguments at the opening brace like this:
> 
> static inline int __must_check kstrtou64_from_user(const char __user *s,
>  size_t count, unsigned int 
> base, u64 *res)
>

An alternate approach is this:

static inline int __must_check
kstrtou64_from_user(const char __user *s, size_t count, unsigned int base, u64 
*res)

Which yes, is still longer than 80 characters.  But this is where
blindly following coding guidelines as if they are fundamentalist
biblical doctrine is not really a great idea.  The goal is to make the
code easier to read, and very often it's important to apply _judgement_.
(Which is one of the reasons why I generally don't really like newbies
trying to apply checkpatch.pl to existing source files.)

Cheers,

- Ted


Re: backporting "ext4: inplace xattr block update fails to deduplicate blocks" to LTS kernels?

2018-02-21 Thread Theodore Ts'o
On Wed, Feb 21, 2018 at 12:40:00PM +0100, Greg Kroah-Hartman wrote:
> On Mon, Feb 19, 2018 at 03:26:37PM +0200, Tommi Rantala wrote:
> > 
> > OK to backport it?
> > I tested it briefly in 4.9, seems to work.

It looks sane, but it would be nice if I can get people who are
backporting ext4 patches to make sure there are no regressions using
one of kvm-xfstests[1] or gce-xfstests[2][3].

[1] 
https://github.com/tytso/xfstests-bld/blob/master/Documentation/kvm-xfstests.md
[2] 
https://github.com/tytso/xfstests-bld/blob/master/Documentation/gce-xfstests.md
[3] https://thunk.org/gce-xfstests

I do run regression tests[4] on stable kernels when I have time, but
it scales much better when other people can help.

[4] 
https://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4.git/tag/?h=ext4-4.9.54-1

> I need an ack from the ext4 maintainers before I can take this...

Greg, you can go ahead and take this, but in the future I'd appreciate
it if ext4 backporters could at least run a smoke test (which takes
less than 15 minutes on GCE) before and after the patch, and report no
test regressions.

More aggressive testers can and should run "{kvm,gce}-xfstests auto"
which take overnight, and the really adventurous can use the
lightweight test manager (ltm) which will break up the test run the
job across multiple VM's and do the job in about 2.5 hours or so.  :-)

 - Ted


Re: ext4 iomap SEEK broken [was: [GIT PULL] ext4 updates for 4.15]

2018-02-12 Thread Theodore Ts'o
On Mon, Feb 12, 2018 at 01:14:07PM +0100, Jiri Slaby wrote:
> On 02/12/2018, 11:02 AM, Jiri Slaby wrote:
> > Given this happens only on 32bit kernel, I assume some 32bit overflow.
> > But I am unable to see it (yet).
> 
> Just to add, a diff of strace in good and bad kernels:
> @@ -655,14 +655,4 @@
>  _llseek(3, 4275568640, [4286054400], SEEK_DATA) = 0
>  _llseek(3, 4286054400, [4288675840], SEEK_HOLE) = 0
>  _llseek(3, 4288675840, [4299161600], SEEK_DATA) = 0
> -_llseek(3, 4299161600, [4301783040], SEEK_HOLE) = 0
> +_llseek(3, 4299161600, [4299161600], SEEK_HOLE) = 2621440
> 
> llseek returns a very invalid value when it comes to 0x10040.

Thanks for the bugreport!  Can you send me the output of "filefrag -v"
on the file so I can make sure we can exactly replicate what you're
seeing?

Thanks,

- Ted


Re: [PATCHv2 1/1] ext4: don't put symlink in pagecache into highmem

2018-02-06 Thread Theodore Ts'o
On Tue, Feb 06, 2018 at 03:38:09PM -0800, Eric Biggers wrote:
> I don't think backporting this change for other filesystems is particularly
> important, since if I understand correctly, the reasons that Al made the 
> change
> originally were:
> 
> - to allow following symlinks in RCU mode, but that's not implemented in old
>   kernels

Yup.

> - to prevent a process from using up all kmaps and deadlocking the system, 
> which
>   I'm not sure is a real problem (someone would need to try to put together a
>   reproducer), but if so it would probably just be a local device of service.

.. and *that's* only a problem on 32-bit systems.  And aside from
Android, it's unclear to me how much we need to support 32-bit systems
on upstream LTS kernels.  I suppose there might be Rasperry PI's which
are 32-bits and which might want to use btrfs.  Personally I'm not
sure we should care all that much, but others who care more about LTS
kernels and 32-bit systems might have a different opinion.

> Also if we actually backported the full commit there are follow-on fixes such 
> as
> e8ecde25f5e that would be needed as well but might be missed.

Good point.

- Ted


Re: [PATCHv2 1/1] ext4: don't put symlink in pagecache into highmem

2018-02-06 Thread Theodore Ts'o
On Tue, Feb 06, 2018 at 12:39:53PM -0800, Greg KH wrote:
> On Tue, Feb 06, 2018 at 11:09:37AM -0800, Jin Qian wrote:
> > From: Jin Qian 
> > 
> > partial backport from 21fc61c73c3903c4c312d0802da01ec2b323d174 upstream
> > to v4.4 to prevent virt_to_page on highmem.
>
> Ted, any objection to this patch?

No objections with my ext4 hat on.

It should be noted though that this is a partial backport because it
only fixes ext4, while Al's original upstream fix addressed a much
larger set of file systems.  In the Android kernel the f2fs fix had
been backported separately.  But for the upstream kernel, it *might*
be the case that we should try backporting the original commit so that
in case there is some other general purpose distribution decides (a)
to base their system on 4.4, and (b) support a 32-bit kernel, they get
the more general bug fixes which applies for btrfs, isofs, ocfs2, nfs,
etc.

I haevn't been paying attention to what LTS kernels general purpose
distro's are using, so I don't know how important this would be.  And
if there are companies like Cloudflare which are using upstream LTS
kernel, it seems unlikely they would want to use a 32-bit kernel,
so shrug.  Greg, I'll let you decide if you want to backport the
full commit or not.

(We had a similar discussion on the AOSP kernel, and came to the
conclusion that we only needed to make the patch support ext4.  No one
was going to test the other file systems besides ext4 and f2fs,
anyway.  But the calculus might be different might be different for
the general upstream LTS kernel.)

- Ted


Re: [PATCH 1/1] ext4: don't put symlink in pagecache into highmem

2018-02-06 Thread Theodore Ts'o
On Mon, Feb 05, 2018 at 11:57:39AM -0800, Jin Qian wrote:
> From: Jin Qian 
> 
> partial backport of 21fc61c73c3903c4c312d0802da01ec2b323d174.

Hi Jin,

When you do a partial backport like this, it would be helpful you
could indicate which kernel version(s) it is meant for, since there
are many stable kernel trees.

Thanks!!

- Ted


[GIT PULL] fscrypt updates for 4.16

2018-02-03 Thread Theodore Ts'o
Note: there will be a merge conflict; please just take the chunk
which calls fscrypt_encrypt_symlink() from the fscrypt tree.  This
will end up dropping the kzalloc() -> f2fs_kzalloc() change, which
means the fscrypt-specific allocation won't get tested by f2fs's
kmalloc error injection system; which is fine.

The ubifs and f2fs changes have been reviewed by their respective
maintainers and I got their approval to run all of these changes
through the fscrypt tree.

 - Ted
 
The following changes since commit 1291a0d5049dbc06baaaf66a9ff3f53db493b19b:

  Linux 4.15-rc4 (2017-12-17 18:59:59 -0800)

are available in the git repository at:

  git://git.kernel.org/pub/scm/linux/kernel/git/tytso/fscrypt.git 
tags/fscrypt_for_linus

for you to fetch changes up to 0b1dfa4cc6c60052b2c30ead316fa84c46d3c43c:

  fscrypt: fix build with pre-4.6 gcc versions (2018-02-01 10:51:18 -0500)


Refactor support for encrypted symlinks to move common code to fscrypt.


Eric Biggers (26):
  fscrypt: move fscrypt_has_encryption_key() to supp/notsupp headers
  fscrypt: move fscrypt_control_page() to supp/notsupp headers
  fscrypt: move fscrypt_info_cachep declaration to fscrypt_private.h
  fscrypt: move fscrypt_ctx declaration to fscrypt_supp.h
  fscrypt: split fscrypt_dummy_context_enabled() into supp/notsupp versions
  fscrypt: move fscrypt_operations declaration to fscrypt_supp.h
  fscrypt: move fscrypt_valid_enc_modes() to fscrypt_private.h
  fscrypt: move fscrypt_is_dot_dotdot() to fs/crypto/fname.c
  fscrypt: trim down fscrypt.h includes
  fscrypt: new helper functions for ->symlink()
  fscrypt: new helper function - fscrypt_get_symlink()
  ext4: switch to fscrypt ->symlink() helper functions
  ext4: switch to fscrypt_get_symlink()
  f2fs: switch to fscrypt ->symlink() helper functions
  f2fs: switch to fscrypt_get_symlink()
  ubifs: free the encrypted symlink target
  ubifs: switch to fscrypt ->symlink() helper functions
  ubifs: switch to fscrypt_get_symlink()
  fscrypt: remove fscrypt_fname_usr_to_disk()
  fscrypt: move fscrypt_symlink_data to fscrypt_private.h
  fscrypt: calculate NUL-padding length in one place only
  fscrypt: define fscrypt_fname_alloc_buffer() to be for presented names
  fscrypt: fix up fscrypt_fname_encrypted_size() for internal use
  fscrypt: document symlink length restriction
  fscrypt: remove 'ci' parameter from fscrypt_put_encryption_info()
  fscrypt: fix build with pre-4.6 gcc versions

 Documentation/filesystems/fscrypt.rst |  10 -
 fs/crypto/crypto.c|   1 +
 fs/crypto/fname.c | 140 
+++--
 fs/crypto/fscrypt_private.h   |  31 ++
 fs/crypto/hooks.c | 158 
+
 fs/crypto/keyinfo.c   |  17 ++--
 fs/ext4/namei.c   |  58 +++---
 fs/ext4/super.c   |   4 +-
 fs/ext4/symlink.c |  43 +++
 fs/f2fs/inode.c   |   2 +-
 fs/f2fs/namei.c   | 132 
+++--
 fs/ubifs/dir.c|  63 +++-
 fs/ubifs/file.c   |  36 +---
 fs/ubifs/super.c  |   4 +-
 include/linux/fscrypt.h   | 174 
+---
 include/linux/fscrypt_notsupp.h   |  59 ++
 include/linux/fscrypt_supp.h  |  68 +++---
 17 files changed, 500 insertions(+), 500 deletions(-)


[GIT PULL] ext4 updates for 4.16

2018-02-03 Thread Theodore Ts'o
The following changes since commit 1291a0d5049dbc06baaaf66a9ff3f53db493b19b:

  Linux 4.15-rc4 (2017-12-17 18:59:59 -0800)

are available in the git repository at:

  git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4.git 
tags/ext4_for_linus

for you to fetch changes up to 5dc397113d195a004899ee672c5a8a19809495ba:

  ext4: create ext4_kset dynamically (2018-01-11 15:34:04 -0500)


Only miscellaneous cleanups and bug fixes for ext4 this cycle.


Alexander Potapenko (1):
  mbcache: initialize entry->e_referenced in mb_cache_entry_create()

Colin Ian King (1):
  ext4: fix incorrect indentation of if statement

Eric Biggers (1):
  mbcache: revert "fs/mbcache.c: make count_objects() more robust"

Ernesto A. Fernández (1):
  ext4: correct documentation for grpid mount option

Harshad Shirwadkar (1):
  ext4: fix a race in the ext4 shutdown path

Jan Kara (2):
  dax: pass detailed error code from dax_iomap_fault()
  ext4: fix ENOSPC handling in DAX page fault handler

Jiang Biao (1):
  mbcache: make sure c_entry_count is not decremented past zero

Jun Piao (1):
  ext4: use 'sbi' instead of 'EXT4_SB(sb)'

Petros Koutoupis (1):
  ext4: fixed alignment and minor code cleanup in ext4.h

Riccardo Schirone (3):
  ext4: release kobject/kset even when init/register fail
  ext4: create ext4_feat kobject dynamically
  ext4: create ext4_kset dynamically

Theodore Ts'o (1):
  ext4: fix up remaining files with SPDX cleanups

Tobin C. Harding (1):
  jbd2: fix sphinx kernel-doc build warnings

Zhouyi Zhou (1):
  ext4: save error to disk in __ext4_grp_locked_error()

piaojun (1):
  ext4: no need flush workqueue before destroying it

 Documentation/filesystems/ext4.txt |   2 +-
 fs/dax.c   |   9 +-
 fs/ext2/file.c |   2 +-
 fs/ext4/acl.h  |   2 +-
 fs/ext4/balloc.c   |   4 +-
 fs/ext4/block_validity.c   |   6 +-
 fs/ext4/ext4.h |  22 ++--
 fs/ext4/ext4_extents.h |  14 +--
 fs/ext4/ext4_jbd2.h|   5 +-
 fs/ext4/extents.c  |  14 +--
 fs/ext4/extents_status.h   |   2 +-
 fs/ext4/file.c |  10 +-
 fs/ext4/fsmap.c|  15 +--
 fs/ext4/fsmap.h|  15 +--
 fs/ext4/hash.c |   6 +-
 fs/ext4/ialloc.c   |   4 +-
 fs/ext4/inline.c   |  10 +-
 fs/ext4/inode.c|  16 ++-
 fs/ext4/mballoc.c  |  28 ++
 fs/ext4/mballoc.h  |   2 +-
 fs/ext4/migrate.c  |   9 +-
 fs/ext4/move_extent.c  |  10 +-
 fs/ext4/namei.c|   6 +-
 fs/ext4/resize.c   |   2 +-
 fs/ext4/super.c|   9 +-
 fs/ext4/sysfs.c|  65 
 fs/ext4/truncate.h |   2 +-
 fs/ext4/xattr.h|   2 +-
 fs/jbd2/checkpoint.c   |   5 +-
 fs/jbd2/commit.c   |   5 +-
 fs/jbd2/journal.c  |   5 +-
 fs/jbd2/recovery.c |   5 +-
 fs/jbd2/revoke.c   |   5 +-
 fs/jbd2/transaction.c  |  10 +-
 fs/mbcache.c   |   8 +-
 fs/xfs/xfs_file.c  |   2 +-
 include/linux/dax.h|   2 +-
 include/linux/jbd2.h   | 431 
+--
 38 files changed, 407 insertions(+), 364 deletions(-)


Re: [GIT PULL] inode->i_version rework for v4.16

2018-01-30 Thread Theodore Ts'o
On Tue, Jan 30, 2018 at 07:05:48AM -0500, Jeff Layton wrote:
> 
> I want to make sure I understand what's actually broken here thoug. Is
> it only broken when the two values are more than 2**63 apart, or is
> there something else more fundamentally wrong here?

The other problem is that returning a s64 (or a u64) is extremely
inefficient on a 32-bit platform.  So in general, it's better to avoid
returning a 64-bit type unless it's really necessary

- Ted


Re: Panic with ext4,nbd,qemu-img,block

2018-01-20 Thread Theodore Ts'o
On Sat, Jan 20, 2018 at 04:41:00PM +0800, Hongzhi, Song wrote:
> 
> 329.11 EXT4-fs (nbd0): mounted filesystem with ordered data mode. Opts:
> (null)
> 329.12 block nbd0: Connection timed out
> 329.13 block nbd0: shutting down sockets

That's your problem.  I'm guessing qemu-nbd is dying for some reason,
which is causing the nbd device to fail (hence the "connection timed
out").

- Ted


Re: dangers of bots on the mailing lists was Re: divide error in ___bpf_prog_run

2018-01-17 Thread Theodore Ts'o
On Wed, Jan 17, 2018 at 04:21:13PM -0800, Alexei Starovoitov wrote:
> 
> If syzkaller can only test one tree than linux-next should be the one.

Well, there's been some controversy about that.  The problem is that
it's often not clear if this is long-standing bug, or a bug which is
in a particular subsystem tree --- and if so, *which* subsystem tree,
etc.  So it gets blasted to linux-kernel, and to get_maintainer.pl,
which is often not accurate --- since the location of the crash
doesn't necessarily point out where the problem originated, and hence
who should look at the syzbot report.  And so this has caused
some irritation.

> There is some value of testing stable trees, but any developer
> will first ask for a reproducer in the latest, so usefulness of
> reporting such bugs will be limited.

What I suggested was to test Linus's tree, and then when a problem is
found, and syzkaller has a reliable repro, to *then* try to see if it
*also* shows up in the LTS kernels.

- Ted


Re: dangers of bots on the mailing lists was Re: divide error in ___bpf_prog_run

2018-01-17 Thread Theodore Ts'o
On Wed, Jan 17, 2018 at 12:09:18PM +0100, Dmitry Vyukov wrote:
> On Wed, Jan 17, 2018 at 10:49 AM, Daniel Borkmann  
> wrote:
> > Don't know if there's such a possibility, but it would be nice if we could
> > target fuzzing for specific subsystems in related subtrees directly (e.g.
> > for bpf in bpf and bpf-next trees as one example). Dmitry?
> 
> Hi Daniel,
> 
> It's doable.
> Let's start with one bpf tree. Will it be bpf or bpf-next? Which one
> contains more ongoing work? What's the exact git repo address/branch,
> so that I don't second guess?

As a suggestion, until the bpf subsystem is free from problems that
can be found by Syzkaller in Linus's upstream tree, maybe it's not
worth trying to test individual subsystem trees such as the bpf tree?
After all, there's no point trying to bisect our way checking to see
if the problem is with a newly added commit in a development tree, if
it turns out the problem was first introduced years ago in the 4.1 or
3.19 timeframe.

After all, finding these older problems is going to have much higher
value, since these are the sorts of potential security problems that
are worth backporting to real device kernels for Android/ChromeOS, and
for enterprise distro kernels.  So from an "impact to the industry"
perspective, focusing on Linus's tree is going to be far more
productive.  That's a win for the community, and it's a win for those
people on the Syzkaller team who might be going up for promo or
listing their achievements at performance review time.  :-)

This will also give the Syzkaller team more time to make the
automation more intelligent in terms of being able to do the automatic
bisection to find the first guilty commit, labelling the report with
the specific subsystem tree that that it came from, etc., etc.

Cheers,

- Ted

P.S.  Something that might be *really* interesting is for those cases
where Syzkaller can find a repro, to test that repro on various stable
4.4, 4.9, 3.18, et. al. LTS kernels.  This will take less resources
than a full bisection, but it will add real value since knowledge that
it will trigger on a LTS kernel will help prioritize which reports
developers might be more interested in focusing upon, and it will give
them a head start in determining which fixes needed to be backported
to which stable kernels.


Re: [PATCH 3/3] encrypted-keys: document new fscrypt key format

2018-01-17 Thread Theodore Ts'o
On Wed, Jan 17, 2018 at 02:38:59PM +, André Draszik wrote:
> > > [...]
> > 
> > Please be very clear about exactly what security properties are achieved
> > by
> > using encrypted-keys.
> 
> I've left out all of this in the updated documentation, as any such
> information should probably be in Documentation/security/keys/trusted-
> encrypted.rst in the first place.

Where is this document going to be found / when will it be written?
It seems really odd to be requesting a do code review when the
specifications aren't available and/or haven't been written yet.  I
prefer to review the *design* first, as opposed to trying to both
review the code and try to guess at the design and review my guess of
the design at the same time

- Ted



Re: LKML admins (syzbot emails are not delivered)

2018-01-17 Thread Theodore Ts'o
On Wed, Jan 17, 2018 at 09:06:54AM +0100, Dmitry Vyukov wrote:
> I've also mailed a change to appengine that supports *.c, *.log and
> *.patch as text/plain extensions. There does not seem to be any major
> objects to it, but it will probably take some time to be deployed in
> prod. After that I will rename them back.

I would think the ideal change would be one where one could specify an
optional parameter to the appengine API which would override the
default content/type, but adding some additional defaults
for various magic extentions is probably a simpler change.

   - Ted


Re: LKML admins (syzbot emails are not delivered)

2018-01-16 Thread Theodore Ts'o
On Tue, Jan 16, 2018 at 09:31:26AM +0100, Dmitry Vyukov wrote:
> On Tue, Jan 16, 2018 at 8:12 AM, Theodore Ts'o <ty...@mit.edu> wrote:
> > I just checked a recent report from the Syzbot, and it's not fixed.
> > The raw.log file still uses a Content-Type of
> > Application/Octet-stream.  Worse the reproducer C source file has a
> > content type of Application/Octet-stream instead of the much more sane
> > Application/text.  
> 
> 
> I will look into using a different mailing system which allows more
> control over email contents.
> A quick fix for raw.log will be to rename it to raw.log.txt. Not sure
> if text/plain repro.c.txt is better than application/octet-stream
> repro.c...

My personal opinion is that if there is no way to force the content
type except by using magic extensions --- which is super-surprising to
me; that seems like a broken API and a feature request bug should be
filed against the relevant API --- using repro.c.txt would be the best
of bad alternatives.  Someone is going to have to exit to a shell to
compile the repro, and renaming the filename isn't a big deal.  The
Mail User Agent I use (mutt) allows me to specify the directory to
save the file, and gives me the opportunity to edit the filename,
before I save it.  So at least for me, it really isn't a big deal for
you to use repro.c.txt.

Cheers,

- Ted


Re: LKML admins (syzbot emails are not delivered)

2018-01-15 Thread Theodore Ts'o
On Mon, Jan 15, 2018 at 10:38:42AM -0600, Eric W. Biederman wrote:
> 
> Sometimes the branches on linux-next are experimental crap.  If someone
> adds an experimental memory allocator to linux-next before discovering
> it causes all kinds of problems I don't want bug reports about my code
> not being able to allocate memory because the memory allocator was bad.
> 
> If you don't have the resources to test the individual branches of
> linux-next please just test Linus's tree.   That will be much more
> meaningful and productive.

I have to agree with Eric here, the reason why Fengguang Wu's 0-day
testing robot is much better received by developers is that he does
not test linux-net, but rather individual subsystem git trees and
branches.  His test automation also does an automatic bisection
search, and can point at a specific commit --- at which point e-mail
goes out to owner of the subsystem git tree, and to the people who
authored and/or reviewed the guilty commit.

Dmitry, perhaps you could collaborate with Intel's 0-day testing
folks?  They have code which does all of this, and perhaps it can be
leveraged.

> 
> When I made the complaint it came to me and to messages on lkml as
> .log.  With Content-Type: Application/Octent-stream.
> 
> That is a bloody mess that wastes peoples time.  If it is fixed good,
> it certainly was not fixed at that point.

I just checked a recent report from the Syzbot, and it's not fixed.
The raw.log file still uses a Content-Type of
Application/Octet-stream.  Worse the reproducer C source file has a
content type of Application/Octet-stream instead of the much more sane
Application/text.  

Hint to Googlers --- many kernel developers do not use Gmail because
it does unspeakable things to whitespaces, which results in mangled
patches, or because they want real mail threading.  The Content-Type
really matters, because for many text-based Mail User Agents, if it is
Application/octet-stream, the MUA will assume that it can't be
displayed as text, and require that it be saved to a file, which the
developer then has to manually read by firing up a pager or an editor.
When you are getting hundreds or thousands of messages a day, having
critical data which darn well *could* be displayed as text require
manual processing adds friction and destroys developer productivity.
So *you* might think the Content-type is trivial, but for developers
who live in their MUA's (that's why many prefer to review patches in
their mail reader, and not have to switch to web browser to use
Gerrit), screwing up the Content-type is going to be a big deal to
them.

> Outside of the bugs being considered as considered as security issues,
> the bugs syzbot finds are generally things that don't affect anyone in
> practice.  So are very low on the priority of things to get fixed.

The real danger is that people will start auto-filing Syzbot reports
to "file 13" (e.g., the trash can) because they are too annoying
But that's Dmitry and the Syzkaller team's problem, not the kernel
developers.   :-)

- Ted


Re: [PATCH] ext4: fix incorrect indentation of if statement

2018-01-11 Thread Theodore Ts'o
On Thu, Jan 04, 2018 at 04:40:18PM +0100, Jan Kara wrote:
> On Wed 29-11-17 14:20:59, Colin King wrote:
> > From: Colin Ian King 
> > 
> > The indentation is incorrect and spaces need replacing with a tab
> > on the if statement.
> > 
> > Cleans up smatch warning:
> > fs/ext4/namei.c:3220 ext4_link() warn: inconsistent indenting
> > 
> > Signed-off-by: Colin Ian King 
> 
> This seems to fall through cracks. The patch looks good. You can add:
> 
> Reviewed-by: Jan Kara 

Thanks, applied.  There was actually another whitespace issue on the
following line, which I've fixed up and merged into this patch.

- Ted


Re: [PATCH] FS: EXT4: syn error in __ext4_grp_locked_error

2018-01-11 Thread Theodore Ts'o
On Fri, Dec 15, 2017 at 02:32:16AM +, zhouzho...@gmail.com wrote:
> From: Zhouyi Zhou 
> 
> In function __ext4_grp_locked_error,   __save_error_info
> is called to save error info in super block block, but does not sync
> that information to disk to info the subsequence fsck after reboot.
> 
> This patch sync the error information to disk. After this patch,
> I think there is no obvious EXT4 error handle branche which leads to
> "Remounting filesystem read-only" will leave the disk partition miss
> the subsequence fsck.
> 
> Compiled in x86_64 Linux
> Signed-off-by: Zhouyi Zhou 

Applied, thanks.

- Ted


Re: [PATCH 13/38] ext4: Define usercopy region in ext4_inode_cache slab cache

2018-01-11 Thread Theodore Ts'o
On Wed, Jan 10, 2018 at 06:02:45PM -0800, Kees Cook wrote:
> The ext4 symlink pathnames, stored in struct ext4_inode_info.i_data
> and therefore contained in the ext4_inode_cache slab cache, need
> to be copied to/from userspace.

Symlink operations to/from userspace aren't common or in the hot path,
and when they are in i_data, limited to at most 60 bytes.  Is it worth
it to copy through a bounce buffer so as to disallow any usercopies
into struct ext4_inode_info?

- Ted


Re: [PATCH] ext4: Remove repeated test in ext4_file_read_iter.

2018-01-10 Thread Theodore Ts'o
On Wed, Jan 10, 2018 at 10:01:49PM +0800, Sean Fu wrote:
>
> Correct, IOCB_NOWAIT read with zero count can return -EAGAIN, But I
> think that it is reasonable. while it got lock, zero would be returned
> in this case.

Returning -EAGAIN and 0 are not the same thing.  Specifically
returning 0 means "end of file".  Hence, it is NOT reasonable.

See the man page for read(2) or the relevant Single Unix Specification
standard:

RETURN VALUE
 On success, the number of bytes read is returned (zero indicates
 end of file)

Changing the behavior of the system in which will break userspace is
never acceptable, and even more so given that this is just a "cleanup
patch".

- Ted



Re: [PATCH v2] jbd2: fix sphinx kerel-doc build warnings

2018-01-09 Thread Theodore Ts'o
On Wed, Jan 10, 2018 at 12:42:29PM +1100, Tobin C. Harding wrote:
> Sphinx emits various (26) warnings when building make target 'htmldocs'.
> Currently struct definitions contain duplicate documentation, some as
> kernel-docs and some as standard c89 comments.  We can reduce
> duplication while cleaning up the kernel docs.
> 
> Move all kernel-docs to right above each struct member.  Use the set of
> all existing comments (kernel-doc and c89).  Add documentation for
> missing struct members and function arguments.
> 
> Signed-off-by: Tobin C. Harding 

Thanks, applied.  I did fix up a few of your newly added doc strings.

   - Ted


Re: Proposal: CAP_PAYLOAD to reduce Meltdown and Spectre mitigation costs

2018-01-07 Thread Theodore Ts'o
On Sun, Jan 07, 2018 at 02:51:59PM +0200, Avi Kivity wrote:
> 
> I don't see the connection. The browser wouldn't run with CAP_PAYLOAD set.
> 
> In a desktop system, only init retains CAP_PAYLOAD.
> 
> On a server that runs one application (and some supporting processes), only
> init and that one application have CAP_PAYLOAD (if the sysadmin makes it
> so).

In the classical (as defined by the withdrawn Posix draft spec)
capaibilities model, if you have a setuid root process it gets all the
capabilities, and capabilities are used to limit what privileges a
root process.  Hence using strict capabilities, any setuid root
process would have CAP_PAYLOAD.

Linux has extensions which allow you to have capability bound which
capabilities that can be obtained by a process, so you _could_ make it
work, but it just seems like an bad fit, since it's not strictly
speaking a root-owned privilege.  It's more like a configuration
setting, and so modulating it via cgroups attribute seems to make a
lot more sense --- it's certainly (IMHO) less confusing than trying to
ab(use) the capabilities system and its extensions in this fashion.

- Ted


Re: Proposal: CAP_PAYLOAD to reduce Meltdown and Spectre mitigation costs

2018-01-07 Thread Theodore Ts'o
On Sun, Jan 07, 2018 at 11:16:28AM +0200, Avi Kivity wrote:
> I think capabilities will work just as well with cgroups. The container
> manager will set CAP_PAYLOAD to payload containers; and if those run an init
> system or a container manager themselves, they'll drop CAP_PAYLOAD for all
> process/sub-containers but their payloads.

The reason why cgroups are better is Spectre can be used to steal
information from within the same privilege level --- e.g., you could
use Javascript to steal a user's Coindesk credentials or Lastpass
data, which is going to be *way* more lucrative than trying to mine
cryptocurrency in the sly in a user's browser.  :-)

As a result, you probably want Spectre mitigations to be enabled in a
root process --- which means capabilities aren't the right answer.

Regards,

- Ted


Re: ext4, quota, projectid: limits without capablity

2018-01-05 Thread Theodore Ts'o
On Fri, Jan 05, 2018 at 03:23:43PM +0800, 王元良 wrote:
> 
> hi,  all
> We use projectid for space restrictions inside the container, share the same 
> hard link between the host and multiple containers.
> But I encountered a problem, as follows, are the following modifications 
> feasible ?

I'm not sure what the problem that you encoutered.  So it's hard to
judge the modification, especially since you didn't send it in a
proper unified diff format.

It looks like you are trying to allow root to avoid the restriction an
entire directory hierarchy (all files and subdirectories) must have
the same project ID.  There are a couple of problems with this.  For
example, you only do this for the link case, but we enforce the same
restriction in other places, such as the rename case.

The current restrictions are to make life simpler for users.  The
restrictions are also part of the xfstests regression test, so I'm
fairly certain will cause a test failure.

So this is something I wouldn't want to do without general agreement
across all of the file systems developers, since it breaks the
fundamental model of how project IDs are presented to users, and could
end up making things extremely confusing.  So the best way to do this
is for you to give much more details about why you want this change,
and what is the larger context of what is you are trying to do.

  - Ted


Re: LKML admins (syzbot emails are not delivered)

2018-01-04 Thread Theodore Ts'o
On Thu, Jan 04, 2018 at 12:04:34PM +0100, Dmitry Vyukov wrote:
> 
> The problem is that it's not _me_, it's a computer program which uses
> some mail delivery system which I don't have full visibility into. I
> don't know if it even gets bounce emails (as far as I understand it's
> not LKML that generates them, LKML SMTP server just returns some error
> code and then it's a responsibility of somebody else to represent it
> by a reply email). If the only way to probe the behavior is to send
> actual emails to LKML (which have high chances to be actually
> delivered to all subscribers), it makes testing somewhat problematic.

It looks like you're using the App Engine Mail API.  You can configure
it to get bounce e-mails[1].  From what I can tell looking at the mail
headers, the mail gets sent via an intermediate SMTP host, such as
mail-it0-f69.google.com before it is delievered to vger.  So if vger's
mailer bounces it via an SMTP error, it would be
mail-it0-f69.google.com's responsibility to generate a bounce message,
which you then should be able to catch.

[1] https://cloud.google.com/appengine/docs/standard/go/mail/bounce

You should be able to test this by having your app-engine send a
message to "invalid-addr...@vger.kernel.org".  I've verified that this
will cause an immediate SMTP error:

554 5.0.0 Hi [74.207.234.97], unresolvable address: 
; nosuchuser; invalid-addr...@vger.kernel.org

If it doesn't, you should file an internal bug report since that's
clearly a bug in the App Engine's mail infrastructure.  If you can't
get satisfaction that way, my recommendation would be to set up an
Linode server (you can't use GCE because GCE blocks outgoing SMTP
connections) to be your mail gateway, and send the notifications from
a domain such as syzkaller.org, where you have full control of your
mail infrastructure, and you don't have to deal with nonsense such as
DMARC.


It also seems to me, looking at other complaints on this thread, that
there is the opportunity for the syzbot to do much more.  For example,
you can see if it repro's on the last released mainline kernel (such
as 4.14) and if so, have the syzbot automatically do a bisection
search, so you can make sure the report goes to the best set of
developers to fix it, a pointer to the guilty commit.

Cheers,

- Ted


Re: About the try to remove cross-release feature entirely by Ingo

2018-01-02 Thread Theodore Ts'o
On Wed, Jan 03, 2018 at 11:10:37AM +0900, Byungchul Park wrote:
> > The point I was trying to drive home is that "all we have to do is
> > just classify everything well or just invalidate the right lock
> 
> Just to be sure, we don't have to invalidate lock objects at all but
> a problematic waiter only.

So essentially you are proposing that we have to play "whack-a-mole"
as we find false positives, and where we may have to put in ad-hoc
plumbing to only invalidate "a problematic waiter" when it's
problematic --- or to entirely suppress the problematic waiter
altogether.  And in that case, a file system developer might be forced
to invalidate a lock/"waiter"/"completion" in another subsystem.

I will also remind you that doing this will trigger a checkpatch.pl
*error*:

ERROR("LOCKDEP", "lockdep_no_validate class is reserved for device->mutex.\n" . 
$herecurr);

- Ted


Re: About the try to remove cross-release feature entirely by Ingo

2018-01-01 Thread Theodore Ts'o
On Mon, Jan 01, 2018 at 02:18:55AM -0800, Matthew Wilcox wrote:
> > Clarification: all TCP connections that are used by kernel code would
> > need to be in their own separate lock class.  All TCP connections used
> > only by userspace could be in their own shared lock class.  You can't
> > use a one lock class for all kernel-used TCP connections, because of
> > the Network Block Device mounted on a local file system which is then
> > exported via NFS and squirted out yet another TCP connection problem.
> 
> So the false positive you're concerned about is write-comes-in-over-NFS
> (with socket lock held), NFS sends a write request to local filesystem,
> local filesystem sends write to block device, block device sends a
> packet to a socket which takes that socket lock.

It's not just the socket lock, but any of the locks/mutexes/"waiters"
that might be taken in the TCP code path and below, including in the
NIC driver.

> I don't think we need to be as drastic as giving each socket its own lock
> class to solve this.  All NFS sockets can be in lock class A; all NBD
> sockets can be in lock class B; all user sockets can be in lock class
> C; etc.

But how do you know which of the locks taken in the networking stack
are for the NBD versus the NFS sockets?  What manner of horrific
abstraction violation is going to pass that information all the way
down to all of the locks that might be taken at the socket layer and
below?

How is this "proper clasification" supposed to happen?  It's the
repeated handwaving which claims this is easy which is rather
frustrating.  The simple thing is to use a unique ID which is bumped
for each struct sock, each struct super, struct block_device, struct
request_queue, struct bdi, etc, but that runs into lockdep scalability
issues.

Anything else means that you have to somehow pass down through the
layers so that, in the general case, the socket knows that it is "an
NFS socket" versus "an NBD socket" --- and remember, if there is any
kind of completion handling done in the NIC driver, it's going to have
to passed down well below the TCP layer all the way down to the
network device drivers.  Or is the plan to do this add a bit ad hoc of
plumbing for each false positive which cross-release lockdep failures
are reported?

> > Also, what to do with TCP connections which are created in userspace
> > (with some authentication exchanges happening in userspace), and then
> > passed into kernel space for use in kernel space, is an interesting
> > question.
> 
> Yes!  I'd love to have a lockdep expert weigh in here.  I believe it's
> legitimate to change a lock's class after it's been used, essentially
> destroying it and reinitialising it.  If not, it should be because it's
> a reasonable design for an object to need different lock classes for
> different phases of its existance.

We just also need to be destroy a lock class after the transient
object has been deleted.  This is especially true for file system
testing, since we are constantly mounting and unmounting file systems,
and creating and destroying loop devices, potentially hundreds or
thousands of times during a test run.  So if we have to create a
unique lock class for "proper classification" each time a file system
is mounted, or loop device or device-mapper device (dm-error, etc.) is
created, we'll run into lockdep scalability issues really quickly.

So this is yet another example where the handwaving, "all you have to
do is proper classification" just doesn't work.

> > So "all you have to do is classify the locks 'properly'" is much like
> > the apocrophal, "all you have to do is bell the cat"[1].  Or like the
> > saying, "colonizing the stars is *easy*; all you have to do is figure
> > out faster than light travel."
> 
> This is only computer programming, not rocket surgery :-)

Given the current state of the lockdep technology, merging cross-lock
certainly feels like requiring the use of sledgehammers to do rocket
surgery in order to avoid false positives --- sorry, "proper
classification".

- Ted


Re: About the try to remove cross-release feature entirely by Ingo

2017-12-30 Thread Theodore Ts'o
On Sat, Dec 30, 2017 at 05:40:28PM -0500, Theodore Ts'o wrote:
> On Sat, Dec 30, 2017 at 12:44:17PM -0800, Matthew Wilcox wrote:
> > 
> > I'm not sure I agree with this part.  What if we add a new TCP lock class
> > for connections which are used for filesystems/network block devices/...?
> > Yes, it'll be up to each user to set the lockdep classification correctly,
> > but that's a relatively small number of places to add annotations,
> > and I don't see why it wouldn't work.
> 
> I was exagerrating a bit for effect, I admit.  (but only a bit).
> 
> It can probably be for all TCP connections that are used by kernel
> code (as opposed to userspace-only TCP connections).  But it would
> probably have to be each and every device-mapper instance, each and
> every block device, each and every mounted file system, each and every
> bdi object, etc.

Clarification: all TCP connections that are used by kernel code would
need to be in their own separate lock class.  All TCP connections used
only by userspace could be in their own shared lock class.  You can't
use a one lock class for all kernel-used TCP connections, because of
the Network Block Device mounted on a local file system which is then
exported via NFS and squirted out yet another TCP connection problem.

Also, what to do with TCP connections which are created in userspace
(with some authentication exchanges happening in userspace), and then
passed into kernel space for use in kernel space, is an interesting
question.

So "all you have to do is classify the locks 'properly'" is much like
the apocrophal, "all you have to do is bell the cat"[1].  Or like the
saying, "colonizing the stars is *easy*; all you have to do is figure
out faster than light travel."

[1] https://en.wikipedia.org/wiki/Belling_the_cat

- Ted


Re: About the try to remove cross-release feature entirely by Ingo

2017-12-30 Thread Theodore Ts'o
On Sat, Dec 30, 2017 at 12:44:17PM -0800, Matthew Wilcox wrote:
> 
> I'm not sure I agree with this part.  What if we add a new TCP lock class
> for connections which are used for filesystems/network block devices/...?
> Yes, it'll be up to each user to set the lockdep classification correctly,
> but that's a relatively small number of places to add annotations,
> and I don't see why it wouldn't work.

I was exagerrating a bit for effect, I admit.  (but only a bit).

It can probably be for all TCP connections that are used by kernel
code (as opposed to userspace-only TCP connections).  But it would
probably have to be each and every device-mapper instance, each and
every block device, each and every mounted file system, each and every
bdi object, etc.

The point I was trying to drive home is that "all we have to do is
just classify everything well or just invalidate the right lock
objects" is a massive understatement of the complexity level of what
would be required, or the number of locks/completion handlers that
would have to be blacklisted.

- Ted


Re: About the try to remove cross-release feature entirely by Ingo

2017-12-30 Thread Theodore Ts'o
On Fri, Dec 29, 2017 at 10:16:24PM -0800, Matthew Wilcox wrote:
> 
> I think this is a terminology problem.  To me (and, I suspect Ted), a
> waiter is a subject of a verb while a lock is an object.  So Ted is asking
> whether we have to classify the users, while I think you're saying we
> have extra objects to classify.

Exactly, the classification is applied when the {lock, mutex,
completion} object is initialized.  Not currently at the individual
call points to mutex_lock(), wait_for_completion(), down_write(), etc.


> > The problems come from wrong classification. Waiters either classfied
> > well or invalidated properly won't bitrot.
> 
> I disagree here.  As Ted says, it's the interactions between the
> subsystems that leads to problems.  Everything's goig to work great
> until somebody does something in a way that's never been tried before.

The question what is classified *well* mean?  At the extreme, we could
put the locks for every single TCP connection into their own lockdep
class.  But that would blow the limits in terms of the number of locks
out of the water super-quickly --- and it would destroy the ability
for lockdep to learn what the proper locking order should be.  Yet
given Lockdep's current implementation, the only way to guarantee that
there won't be any interactions between subsystems that cause false
positives would be to categorizes locks for each TCP connection into
their own class.

So this is why I get a little annoyed when you say, "it's just a
matter of classification".  NO IT IS NOT.  We can not possibly
classify things "correctly" to completely limit false positives
without completely destroying lockdep's scalability as it is currently
designed.  Byungchul, you don't acknowledge this, and it makes the
"just classify everything" argument completely suspect as a result.

As far as the "just invalidate the waiter", the problem is that it
requires source level changes to invalidate the waiter, and for
different use cases, we will need to validate different waiters.  For
example, in the example I gave, we would have to invalidate *all* TCP
waiters/locks in order to prevent false positives.  But that makes the
lockdep useless for all TCP locks.  What's the solution?  I claim that
until lockdep is fundamentally fixed, there is no way to eliminate
*all* false positives without invalidating *all*
cross-release/cross-locks --- in which case you might as well leave
the cross-release patches as an out of tree patch.

So to claim that we can somehow fix the problem by making source-level
changes outside of lockdep, by "properly classifying" or "properly
invalidating" all locks, just doesn't make sense. 

The only way it can work is to either dump it on the reposibility of
the people debugging lockdep reports to make source level changes to
other subsystems which they aren't the maintainers of to suppress
false positives that arise due to how the subsystems are being used
together in their particular configuration  or you can try to
claim that there is an "acceptable level" of false positives with
which we can live with forever, and which can not be fixed by "proper
classifying" the locks.

Or you can try to make lockdep scalable enough that if we could put
every single lock for every single object into its own lock class
(e.g., each lock for every single TCP connection gets its own lock
class) which is after all the only way we can "properly classify
everything") and still let lockdep be useful.

If you think that is doable, why don't you work on that, and once that
is done, maybe cross-locks lockdep will be considered more acceptable
for mainstream?

- Ted


Re: [patch V5 01/11] Documentation: Add license-rules.rst to describe how to properly identify file licenses

2017-12-29 Thread Theodore Ts'o
On Fri, Dec 29, 2017 at 11:17:54PM +0100, Philippe Ombredanne wrote:
> > As far as I know, none of the licenses explicitly say
> > copyright license must be on each file.  Just that the distribution of
> > source must include the copyright and license statement.  Exactly how
> > that is done is not explicitly specified.
> 
> This is also my take. What is done here is not much different than
> refactoring duplicated code so it leaves in a single place:
> 
> - by "value" at the root in COPYING and in the Documentation.
> - by "reference" in the code proper as SPDX ids.
> 
> Therefore essential and common requirements to include the license
> text is fulfilled in the kernel.
> 
> Note that there are a few offenders that will need to clean up their
> acts as they came up will both long and "un-removable and
> un-alterable" crazy legalese blurbs [1] prefix this:
> 
> "DO NOT ALTER OR REMOVE COPYRIGHT NOTICES OR THIS FILE HEADER"
> 
> These will have to be taken care on a case by case basis. These are
> pretty stupid and IMHO should have never been allowed to be added to
> the kernel in the first place and are ugly warts. It could very well
> be that these are not really GPL-compliant notices FWIW: keeping
> notices and copyrights is quite different from a restriction of
> altering things by moving them around which is exactly what is
> happening with the SPDX-ification here.
> 
> [1] 
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/drivers/staging/lustre/include/linux/libcfs/libcfs.h?h=v4.15-rc5#n5

Lustre is now owned by Intel so I suspect that some throat clearing
noises in the right direction could easily take care of the issue with
those files

- Ted


Re: [patch V5 01/11] Documentation: Add license-rules.rst to describe how to properly identify file licenses

2017-12-29 Thread Theodore Ts'o
On Fri, Dec 29, 2017 at 08:19:59AM -0800, Joe Perches wrote:
> 
> Has it been legally reviewed and accepted that removal
> of the BSD license text from individual source files is
> appropriate and meets the legal requirements of
> following the BSD license on a per-file basis?
> 
> And if so, who did this review?
> 
> Is there any license that does not allow removal of the
> license text and does not allow simple substitution of
> the SPDX license identifier in each individual file?

The work to use SPDX lines instead of individual licenses was done by
Greg K-H in close consultation with Linux Foundation counsels, so I
would assume that they did look at that particular issue.

IANAL, but I've talked to lawyers about this issue, and in my
experience if you talk to three lawyers you will easily get six
opinions.  As far as I know, none of the licenses explicitly say
copyright license must be on each file.  Just that the distribution of
source must include the copyright and license statement.  Exactly how
that is done is not explicitly specified.

- Ted


Re: About the try to remove cross-release feature entirely by Ingo

2017-12-28 Thread Theodore Ts'o
On Fri, Dec 29, 2017 at 10:47:36AM +0900, Byungchul Park wrote:
> 
>(1) The best way: To classify all waiters correctly.

It's really not all waiters, but all *locks*, no?

>   Ultimately the problems should be solved in this way. But it
>   takes a lot of time so it's not easy to use the way right away.
>   And I need helps from experts of other sub-systems.
> 
>   While talking about this way, I made a trouble.. I still believe
>   that each sub-system expert knows how to solve dependency problems
>   most, since each has own dependency rule, but it was not about
>   responsibility. I've never wanted to charge someone else it but me.

The problem is that it's not one subsystem, but *many*.  And it's the
interactions between the subsystems.

Consider the example I gave of a network block device, on which a
local disk file system is mounted, which is then exported over NFS.
So we have the Networking (TCP) stack involved, the NBD device driver,
the local disk file system, the NFS file system, and the networking
stack a second time.  That is *many* subsystem maintainers who have to
get involved.

In addition, the lock classification system is not documented at all,
so now you also need someone who understands the lockdep code.  And
since some of these classifications involve transient objects, and
lockdep doesn't have a way of dealing with transient locks, and has a
hard compile time limit of the number of locks that it supports, to
expect a subsystem maintainer to figure out all of the interactions,
plus figure out lockdep, and work around lockdep's limitations
seems not realistic.

(By the way, I've tried reading the crosslock and crossrelease
documentation --- and I'm lost.  Sorry, I'm just not smart enough to
understand how it works, at least not from reading the documentation
that was in the patch series.  And honestly, I don't care.  All I do
need is some practical instructions for how to "classify locks
properly", and how this interacts with lockdep's limitations.)

>(2) The 2nd way: To make cross-release off by default.
> 
>   At the beginning, I proposed cross-release being off by default.
>   Honestly, I was happy and did it when Ingo suggested it on by
>   default once lockdep on. But I shouldn't have done that but kept
>   it off by default. Cross-release can make some happy but some
>   unhappy until problems go away through (1) or (2).

Ingo's argument is that (1) is not going to be happening any time
soon, and in the meantime, code which is turned off will bitrot.

Given that once Lockdep reports a locking violation, it doesn't report
any more lockdep violations, if there are a large number of false
positives, people will not want to turn on cross-release, since it
will report the false positive and then turn itself off, so it won't
report anything useful.  So if no one turns it on because of the false
positives, how does the bitrot problem get resolved?

And if the answer is that some small number of lockdep experts will be
trying to figure out how to do (1) in a tractable way, then Ingo has
argued it can be handled via an out-of-tree patch.

>(3) The 3rd way: To invalidate waiters making trouble.

Hmm, can we make cross-release and cross-lock off by default on a per
lock basis?  With a well documented to enable it?  I'm still not sure
how this works given the cross-subsystem problem, though.

So if networking enables it because there are no problems with their
TCP-only test, and then it blows up when someone is doing NBD or NFS
testing, what's the recourse?  The file system developer submitting a
patch against the networking subsystem to turn off the lockdep
tracking on that particular lock because it's causing pain for the
file system developer?  I can see that potentially causing all sorts
of inter-subsystem conflicts.

> Talking about what Ingo said in the commit msg.. I want to ask him back,
> if he did it with no false positives at the moment merging it in 2006,
> without using (2) or (3) method. I bet he know what it means.. And
> classifying locks/waiters correctly is not something uglifying code but
> a way to document code better. I've felt ill at ease because of the
> unnatural and forced explanation.

So I think this is the big difference is that potential for
cross-subsystem false positives is dramatically higher than with
cross-release compared with the traditional lockdep.  And I'm not sure
there is a clean solution --- how do you "cleanly classify" locks when
in some cases each object's locks needs to be considered individual
locks, and when that must not be done lest there is an explosion of
the number of locks which lockdep needs to track (which is strictly
limited due to memory and CPU overhead, as I understand things)?  I
haven't seen an explanation for how to solve this in a clean, general
way --- and I strongly suspect it doesn't exist.

Regards,

- Ted


Re: [PATCH 04/12] ext2: drop unneeded newline

2017-12-27 Thread Theodore Ts'o
On Wed, Dec 27, 2017 at 03:51:37PM +0100, Julia Lawall wrote:
> ext2_msg prints a newline at the end of the message string, so the message
> string does not need to include a newline explicitly.  Done using
> Coccinelle.
> 
> Signed-off-by: Julia Lawall <julia.law...@lip6.fr>

Reviewed-by: Theodore Ts'o <ty...@mit.edu>

- Ted


Re: [PATCH 1/1] ext4: remove redundant assignment in ext4_iomap_begin()

2017-12-21 Thread Theodore Ts'o
On Mon, Nov 27, 2017 at 05:21:27PM +0800, sunqiuyang wrote:
> From: Qiuyang Sun 
> 
> This line will not change the value of map.m_lblk in any case.

I don't see that it is absolutely guaranteed.  And if you are
depending on ext4_es_find_delayed_extent() returns, then (a) the
following line which adjusts map.m_len could also be dropped, and (b)
we should add a WARN_ON to make the code robust against future changes
to the above function.

- Ted

> 
> Signed-off-by: Qiuyang Sun 
> ---
>  fs/ext4/inode.c | 1 -
>  1 file changed, 1 deletion(-)
> 
> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> index 9f836e2..d4a42b1 100644
> --- a/fs/ext4/inode.c
> +++ b/fs/ext4/inode.c
> @@ -3443,7 +3443,6 @@ static int ext4_iomap_begin(struct inode *inode, loff_t 
> offset, loff_t length,
>  
>   if (es.es_lblk < map.m_lblk)
>   offs = map.m_lblk - es.es_lblk;
> - map.m_lblk = es.es_lblk + offs;
>   map.m_len = es.es_len - offs;
>   delalloc = true;
>   }
> -- 
> 2.5.0
> 


[GIT PULL] ext4 bug fixes for 4.15

2017-12-16 Thread Theodore Ts'o
The following changes since commit ae64f9bd1d3621b5e60d7363bc20afb46aede215:

  Linux 4.15-rc2 (2017-12-03 11:01:47 -0500)

are available in the git repository at:

  git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4.git 
tags/ext4_for_stable

for you to fetch changes up to 9d5afec6b8bd46d6ed821aa1579634437f58ef1f:

  ext4: fix crash when a directory's i_size is too small (2017-12-11 15:00:57 
-0500)


Fix a regression which caused us to fail to interpret symlinks in very
ancient ext3 file system images.  Also fix two xfstests failures, one
of which could cause a OOPS, plus an additional bug fix caught by fuzz
testing.


Andi Kleen (1):
  ext4: support fast symlinks from ext3 file systems

Chandan Rajendra (1):
  ext4: fix crash when a directory's i_size is too small

Eryu Guan (1):
  ext4: fix fdatasync(2) after fallocate(2) operation

Theodore Ts'o (1):
  ext4: add missing error check in __ext4_new_inode()

 fs/ext4/extents.c | 1 +
 fs/ext4/ialloc.c  | 2 ++
 fs/ext4/inode.c   | 9 +
 fs/ext4/namei.c   | 4 
 4 files changed, 16 insertions(+)


Re: [PATCH] ext4: delayed inode update for the consistency of file size after a crash

2017-12-16 Thread Theodore Ts'o
On Sat, Dec 16, 2017 at 01:33:26PM +0900, Seongbae Son wrote:
> > > Details can be found as follows.
> > >
> > > Son et al. "Guaranteeing the Metadata Update Atomicity in EXT4 
> > > Filesystem”,
> > > In Proc. of APSYS 2017, Mumbai, India
> 
> > This is behind a paywall, so I can't access it.  I am sorry I wasn't
> > on the program committee, or I would have pointed this out while the
> > paper was being reviewed.
> 
> Thanks for your quick answer.
> I am sorry about that. I could not think about the paywall.

If you want to send me a PDF of the paper, I'm happy to look at it.

> I have performed the above scenario to xfs, btrfs, f2fs, and zfs.
> As the test result, all of the four file systems does not have the problem
> that fileA in which fsync() was not executed has the wrong file size
> after a system crash. So, I think, the portability of applications might be
> okay even though EXT4 guarantees the consistency between the file size and
> the data blocks of the file that fsync() is not executed after a system crash.

So first of all, there are more file systems than xfs, btrfs, f2fs,
and zfs, and there are more operating systems than Linux and Solaris.

Secondly, your patch doesn't solve the problem.  Updating the
timestamps without putting the changes in a transaction is no
guarantee; if some other process does some operation such as chmod,
et. al, the inode timestamps will get updated ahead of time.  Worse,
you are updating the i_size in a separate handle, right *before* the
write request is submitted.  So if a transaction commit gets started
immediately after call to ext4_journal_stop() which was added in
mpage_process_page_bufs(), it's still possible for i_size to be
updated and be visible after a crash, without the data block being
updated.  It's a narrower race window, but it's still there.

Furthermore, the patch is huge because the introduction of the new
functions _ext4_update_time(), ext4_update_time(),
_generic_file_write_iter() have included a large amount of extra lines
of code which are copied from elsewhere --- e.g., this is "reverse
code factoring" --- and it makes the resulting source less
maintainable.  And the fact that it forces every single write which
affects the last block in the file to be written *twice* means that it
has really unfortunate real performance impacts for workloads which
are appending to a file (e.g., any kind of log file).

If you really want to solve this problem, what needs to be done is
that update of the timestamp and i_size needs to be moved to an I/O
completion handler.  We do this already to convert unwritten requests
to be written in fs/ext4/page_io.c.  See ext4_put_io_end_defer() in
fs/ext4/page_io.c; if we need to convert unwritten extents the
EXT4_IO_END_UNWRITTEN flag is set, and ext4_add_complete_io() tacks
the io_end queue onto a workqueue.  This infrastructure could be made
more general so that it can do other work after the I/O has been
completed, including the i_size update.

I'm not really convinced it's worth it, but this would be a much more
efficient way of solving the problem, and it would avoid need to clone
a large amount of code both in ext4 and in the generic mm/filemap.c
file.

Best regards,

- Ted


Re: [PATCH] locking/lockdep: Remove the cross-release locking checks

2017-12-15 Thread Theodore Ts'o
On Fri, Dec 15, 2017 at 05:39:25PM +0900, Byungchul Park wrote:
> 
> All locks should belong to one class if each path of acquisition
> can be switchable each other within the class at any time.
> Otherwise, they should belong to a different class.

OK, so let's go back to my case of a Network Block Device with a local
file system mounted on it, which is then exported via NFS.

So an incoming TCP packet can go into the NFS server subsystem, then
be processed by local disk file system, which then does an I/O
operation to the NBD device, which results in an TCP packet being sent
out.  Then the response will come back over TCP, into the NBD block
layer, then into the local disk file system, and that will result in
an outgoing response to the TCP connection for the NFS protocol.

In order to avoid cross release problems, all locks associated with
the incoming TCP connection will need to be classified as belonging to
a different class as the outgoing TCP connection.  Correct?  One
solution might be to put every single TCP connection into a separate
class --- but that will explode the number of lock classes which
Lockdep will need to track, and there is a limited number of lock
classes (set at compile time) that Lockdep can track.  So if that
doesn't work, we will have to put something ugly which manually makes
certain TCP connections "magic" and require them to be put into a
separate class than all other TCP connections, which will get
collapsed into a single class.  Basically, any TCP connection which is
either originated by the kernel, or passed in from userspace into the
kernel and used by some kernel subsystem, will have to be assigned its
own lockdep class.

If the TCP connection gets closed, we don't need to track that lockdep
class any more.  (Or if a device mapper device is torn down, we
similarly don't need any unique lockdep classes created for that
device mapper device.)  Is there a way to tell lockdep that a set of
lockdep classes can be released so we can recover the kernel memory to
be used to track some new TCP connection or some new device mapper
device?

- Ted


Re: [PATCH] locking/lockdep: Remove the cross-release locking checks

2017-12-14 Thread Theodore Ts'o
On Fri, Dec 15, 2017 at 01:05:43PM +0900, Byungchul Park wrote:
> For example, in the case of fs issues, for now we can
> invalidate wait_for_completion() in submit_bio_wait()

And this will spawn a checkpatch.pl ERROR:

ERROR("LOCKDEP",
"lockdep_no_validate class is reserved for 
device->mutex.\n" . $herecurr);

This mention in checkpatch.pl is the only documentation I've been able
to find about lockdep_set_novalidate_class(), by the way. 

> ... and re-validate it later, of course, I really want to find more
> fundamental solution though.

Oh, and I was finally able to find the quote that the *only* people
who are likely to be able to deal with lock annotations are the
subsystem maintainers.  But if the ways of dealing with lock
annotations are not documented, such that subsystem maintainers are
going to have a very hard time figuring out *how* to deal with it, it
seems that lock classification as a solution to cross-release false
positives seems unlikely:
   
   From: Byungchul Park 
   Date: Fri, 8 Dec 2017 18:27:45 +0900
   Subject: Re: [PATCH v4 72/73] xfs: Convert mru cache to XArray

   1) Firstly, it's hard to assign lock classes *properly*. By
   default, it relies on the caller site of lockdep_init_map(),
   but we need to assign another class manually, where ordering
   rules are complicated so cannot rely on the caller site. That
   *only* can be done by experts of the subsystem.

- Ted


Re: [PATCH 09/19] ext4: convert to new i_version API

2017-12-14 Thread Theodore Ts'o
On Wed, Dec 13, 2017 at 09:20:07AM -0500, Jeff Layton wrote:
> From: Jeff Layton <jlay...@redhat.com>
> 
> Signed-off-by: Jeff Layton <jlay...@redhat.com>

Acked-by: Theodore Ts'o <ty...@mit.edu>

- Ted


Re: FS: EXT4: should we sync error info in __ext4_grp_locked_error?

2017-12-14 Thread Theodore Ts'o
On Thu, Dec 14, 2017 at 03:13:15PM +0800, Zhouyi Zhou wrote:
> Hi,
> In function __ext4_grp_locked_error,   __save_error_info(sb, function, 
> line)
> is called to save error info in super block block, but does not sync
> that information
> to disk to info the subsequence fsck after reboot. The reason, I guess
> maybe it is
> in locked state.
> My question is why not make a call ext4_commit_super(sb, 1) after
> ext4_unlock_group(sb, grp) and  ext4_handle_error(sb), so that subsequence 
> fsck
> after reboot is sure to be well informed.

Adding ext4_commit_super(sb, 1) between the calls to
ext4_unlock_group() and ext4_handle_error() is a good idea; it's not a
naive suggestion at all.

Cheers,

- Ted


Re: About the try to remove cross-release feature entirely by Ingo

2017-12-13 Thread Theodore Ts'o
On Wed, Dec 13, 2017 at 04:13:07PM +0900, Byungchul Park wrote:
> 
> Therefore, I want to say the fundamental problem
> comes from classification, not cross-release
> specific.

You keep saying that it is "just" a matter of classificaion.

However, it is not obvious how to do the classification in a sane
manner.  And this is why I keep pointing out that there is no
documentation on how to do this, and somehow you never respond to this
point

In the case where you have multiple unrelated subsystems that can be
stacked in different ways, with potentially multiple instances stacked
on top of each other, it is not at all clear to me how this problem
should be solved.

It was said on one of these threads (perhaps by you, perhaps by
someone else), that we can't expect the lockdep maintainers to
understand all of the subsystems in the kernels, and so therefore it
must be up to the subsystem maintainers to classify the locks.  I
interpreted this as the lockdep maintainers saying, "hey, not my
fault, it's the subsystem maintainer's fault for not properly
classifying the locks" --- and thus dumping the responsibility in the
subsystem maintainers' laps.

I don't know if the situation is just that lockdep is insufficiently
documented, and with the proper tutorial, it would be obvious how to
solve the classification problem.

Or, if perhaps, there *is* no way to solve the classification problem,
at least not in a general form.

For example --- suppose we have a network block device on which there
is an btrfs file system, which is then exported via NFS.  Now all of
the TCP locks will be used twice for two different instances, once for
the TCP connection for the network block device, and then for the NFS
export.

How exactly are we supposed to classify the locks to make it all work?

Or the loop device built on top of an ext4 file system which on a
LVM/device mapper device.  And suppose the loop device is then layered
with a dm-error device for regression testing, and with another ext4
file system on top of that?

How exactly are we supposed to classify the locks in that situation?
Where's the documentation and tutorials which explain how to make this
work, if the responsibility is going to be dumped on the subsystem
maintainers' laps?  Or if the lockdep maintainers are expected to fix
and classify all of these locks, are you volunteering to do this?

How hard is it exactly to do all of this classification work, no
matter whose responsibility it will ultimately be?

And if the answer is that it is too hard, then let me gently suggest
to you that perhaps, if this is a case, that maybe this is a
fundamental and fatal flaw with the cross-release and completion
lockdep feature?

Best regards,

- Ted


Re: [PATCH 1/9] generic/381: use username fsgqa-381

2017-12-12 Thread Theodore Ts'o
On Tue, Dec 12, 2017 at 04:45:11PM -0800, Luis R. Rodriguez wrote:
> Some systems are not allowing usernames prefixed with a number now.
> One can however use numbers as a postfix so use that.
> 
> Signed-off-by: Luis R. Rodriguez 
> ---
>  README|  2 +-
>  tests/generic/381 | 16 
>  tests/generic/381.out |  4 ++--
>  3 files changed, 11 insertions(+), 11 deletions(-)
> 
> diff --git a/README b/README
> index ed69332e774e..e05142be1a87 100644
> --- a/README
> +++ b/README
> @@ -20,7 +20,7 @@ ___
>  - run make
>  - run make install
>  - create fsgqa test user ("sudo useradd fsgqa")
> -- create 123456-fsgqa test user ("sudo useradd 123456-fsgqa")
> +- create fsgqa-381 test user ("sudo useradd fsgqa-381")

I'd suggest using the username "fsgqa2" so that other tests that want
to use a second username can do so using a more logically name username.

- Ted


Re: [PATCH] ext4: delayed inode update for the consistency of file size after a crash

2017-12-10 Thread Theodore Ts'o
On Sun, Dec 10, 2017 at 09:12:57PM +0900, seongbaeSon wrote:
> 1. Current file offset of fileA is 14 KB. An application appends 2 KB data to
> fileA by executing a write() system call. At this time, the file size in 
> the ext4_inode of fileA is updated to 16 KB by ext4_da_write_end().
> 2. Current file offset of fileB is 14 KB. An application appends 2 KB data to
> fileB by executing a write() system call. At this time, the file size in
> the ext4_inode of fileB is updated to 16 KB by ext4_da_write_end().
> 3. A fsync(fileB) is called before the kworker thread runs. At this time,
> the application thread transfers the data block of fileB to storage and
> wakes up the JBD2. Then, JBD2 writes the ext4_inodes of fileA and fileB in
> the running transaction to the journal area. The ext4_inode of fileA in 
> the journal area has the file size, 16 KB, even though the data block of
> fileA has not been written to storage.
> 4. Assume that a system crash occurs. The EXT4 recovery module recovers
> the inodes of fileA and fileB. The recovered inode of fileA has the updated
> file size, 16 KB, even though the data of fileA has not been made durable.
> The data block of fileA between 14 KB and 16 KB is seen as zeros.

There's nothing wrong with this.  The user space application called
fsync on fileB, and *not* on fileA.  Therefore, there is absolutely no
guarantee that fileA's data contents are valid.

Consider the exact same thing will happen if the application had
written data to fileA at offsets 6k to 8k.  If those offsets were
previously zero, then after the crash, those offsets *might* still be
zero after the crash, *unless* the application had first called
fsync() or fdatasync() first.

> Details can be found as follows.
> 
> Son et al. "Guaranteeing the Metadata Update Atomicity in EXT4 Filesystem”,
> In Proc. of APSYS 2017, Mumbai, India

This is behind a paywall, so I can't access it.  I am sorry I wasn't
on the program committee, or I would have pointed this out while the
paper was being reviewed.

The problem with providing more guarantees than what is strictly
provided for by POSIX is that it degrades the performance of the file
system.  It can also promote application writes to depend on semantics
which are non-portable, which can cause problems when they try to run
that progam on other operating systems or other file systems.

Cheers,

- Ted


Re: Lockdep is less useful than it was

2017-12-08 Thread Theodore Ts'o
On Thu, Dec 07, 2017 at 02:38:03PM -0800, Matthew Wilcox wrote:
> I think it was a mistake to force these on for everybody; they have a
> much higher false-positive rate than the rest of lockdep, so as you say
> forcing them on leads to fewer people using *any* of lockdep.
> 
> The bug you're hitting isn't Byungchul's fault; it's an annotation
> problem.  The same kind of annotation problem that we used to have with
> dozens of other places in the kernel which are now fixed.

The question is whose responsibility is it to annotate the code?  On
another thread it was suggested it was ext4's responsibility to add
annotations to avoid the false positives --- never the mind the fact
that every single file system is going to have add annotations.

Also note that the documentation for how to add annotations is
***horrible***.  It's mostly, "try to figure out how other people
added magic cargo cult code which is not well defined (look at the
definitions of lockdep_set_class, lockdep_set_class_and_name,
lockdep_set_class_and_subclass, and lockdep_set_subclass, and weep) in
other subsystems and hope and pray it works for you."

And the explanation when there are failures, either false positives,
or not, are completely opaque.  For example:

[   16.190198] ext4lazyinit/648 is trying to acquire lock:
[   16.191201]  ((gendisk_completion)1 << part_shift(NUMA_NO_NODE)){+.+.}, at: 
[<8a1ebe9d>] wait_for_completion_io+0x12/0x20

Just try to tell me that:

((gendisk_completion)1 << part_shift(NUMA_NO_NODE)){+.+.}

is human comprehensible with a straight face.  And since the messages
don't even include the subclass/class/name key annotations, as lockdep
tries to handle things that are more and more complex, I'd argue it
has already crossed the boundary where unless you are a lockdep
developer, good luck trying to understand what is going on or how to
add annotations.

So if you are adding complexity to the kernel with the argument,
"lockdep will save us", I'm with Dave --- it's just not a believable
argument.

Cheers,

- Ted


Re: [PATCH v4 72/73] xfs: Convert mru cache to XArray

2017-12-07 Thread Theodore Ts'o
On Wed, Dec 06, 2017 at 06:06:48AM -0800, Matthew Wilcox wrote:
> > Unfortunately for you, I don't find arguments along the lines of
> > "lockdep will save us" at all convincing.  lockdep already throws
> > too many false positives to be useful as a tool that reliably and
> > accurately points out rare, exciting, complex, intricate locking
> > problems.
> 
> But it does reliably and accurately point out "dude, you forgot to take
> the lock".  It's caught a number of real problems in my own testing that
> you never got to see.

The problem is that if it has too many false positives --- and it's
gotten *way* worse with the completion callback "feature", people will
just stop using Lockdep as being too annyoing and a waste of developer
time when trying to figure what is a legitimate locking bug versus
lockdep getting confused.

I can't even disable the new Lockdep feature which is throwing
lots of new false positives --- it's just all or nothing.

Dave has just said he's already stopped using Lockdep, as a result.

  - Ted


Re: [PATCH] ext4: Support fast symlinks from ext3 file systems

2017-12-03 Thread Theodore Ts'o
On Mon, Nov 27, 2017 at 11:08:53AM -0700, Andreas Dilger wrote:
> On Nov 27, 2017, at 10:37 AM, Andi Kleen  wrote:
> > 
> > From: Andi Kleen 
> > 
> > dbb3c27f5b91c4 (ext4: change fast symlink test to not rely on i_blocks)
> > broke ~10 years old ext3 file systems created by 2.6.17. Any ELF
> > executable fails because the /lib/ld-linux.so.2 fast symlink
> > cannot be read anymore.
> > 
> > The patch assumed fast symlinks were created in a specific way,
> > but that's not true on these really old file systems.
> > 
> > The new behavior is apparently needed only with the large EA inode
> > feature.
> > 
> > Revert to the old behavior if the large EA inode feature is not set.
> > 
> > This makes my old VM boot again.
> > 
> > Fixes: dbb3c27f5b91c4 (ext4: change fast symlink test to not rely ...)
> > Signed-off-by: Andi Kleen 
> 
> Reviewed-by: Andreas Dilger 

Thanks, applied.

- Ted


Re: [kernel-hardening] Re: [PATCH v5 next 5/5] net: modules: use request_module_cap() to load 'netdev-%s' modules

2017-11-30 Thread Theodore Ts'o
On Thu, Nov 30, 2017 at 09:50:27AM +0100, Djalal Harouni wrote:
> In embedded systems we can't maintain a SELinux policy, distro man
> power hardly manage. We have abstracted seccomp etc, but the kernel
> inherited the difficult multiplex things, plus all other paths that
> trigger this.

> Yes, but it is hard to maintain a whitelist policy, the code is hardly
> maintained...

So this is the part that scares me to death about IOT, and why I tell
everyone to ***never*** trust an IOT device on their home network, and
***never*** trust it with anything you don't mind splattered all over
the front page of NY Times and RT / Sputnick news.

You're saying that you want to use modules (as opposed to compile
everything tightly down to just what you need for the embedded
system); that the code is "hardly maintained".  And yet we're supposed
to consider it trustworthy?

If that's the case, turning off implicit module loading sounds and
thinking that this will somehow be a magic wand sounds crazy.

 - Ted


Re: [PATCH v5 next 1/5] modules:capabilities: add request_module_cap()

2017-11-29 Thread Theodore Ts'o
On Wed, Nov 29, 2017 at 11:28:52AM -0600, Serge E. Hallyn wrote:
> 
> Just to be clear, module loading requires - and must always continue to
> require - CAP_SYS_MODULE against the initial user namespace.  Containers
> in user namespaces do not have that.
> 
> I don't believe anyone has ever claimed that containers which are not in
> a user namespace are in any way secure.

Unless the container performs some action which causes the kernel to
call request_module(), which then loads some kernel module,
potentially containing cr*p unmaintained code which was included when
the distro compiled the world, into the host kernel.

This is an attack vector that doesn't exist if you are using VM's.
And in general, the attack surface of the entire Linux
kernel<->userspace API is far larger than that which is exposed by the
guest<->host interface.

For that reason, containers are *far* more insecure than VM's, since
once the attacker gets root on the guest VM, they then have to attack
the hypervisor interface.  And if you compare the attack surface of
the two, it's pretty clear which is larger, and it's not the
hypervisor interface.

- Ted


Re: [PATCH v5 next 1/5] modules:capabilities: add request_module_cap()

2017-11-29 Thread Theodore Ts'o
On Wed, Nov 29, 2017 at 10:58:16AM -0500, David Miller wrote:
> That's not what we're talking about.
> 
> We're talking about making sure that loading "ppp.ko" really gets
> ppp.ko rather than some_other_module.ko renamed to ppp.ko via some
> other mechanism.

Right, and the best solution to this problem is to include in the
signature the name of the module.  (One way of doing this is adding
the module name into the cryptographic checksum which is then
digitally signed by the kernel module key; we would have to bump the
version number of the signature format, obviously.)  This binds the
name of the module into the digital signature so that it can't be
renamed.

- Ted


Re: [PATCH v5 next 1/5] modules:capabilities: add request_module_cap()

2017-11-29 Thread Theodore Ts'o
On Wed, Nov 29, 2017 at 09:50:14AM -0500, David Miller wrote:
> From: Alan Cox 
> Date: Wed, 29 Nov 2017 13:46:12 +
> 
> > I really don't care what the module loading rules end up with and
> > whether we add CAP_SYS_YET_ANOTHER_MEANINGLESS_FLAG but what is
> > actually needed is to properly incorporate it into securiy ruiles
> > for whatever LSM you are using.
> 
> I'm surprised we're not using the SHA1 hashes or whatever we compute
> for the modules to make sure we are loading the foo.ko that we expect
> to be.

We do have signed modules.  But this won't help us if the user is
using a distro kernel which has compiled some module which is known to
be unmaintained which everyone in the know *expects* to have 0-day
bugs, such as DCCP.  That's because the DCCP module is signed.

We could fix this by adding to the signature used for module signing
to include the module name, so that the bad guy can't rename dccp.ko
to be ppp.ko, I suppose

> All of this capability stuff seems to dance a circle around the
> problem rather than fix it.

Half the problem here is that with containers, people are changing the
security model, because they want to let untrusted users have "root",
without really having "root".  Part of the fundamental problem is that
there are some well-meaning, but fundamentally misguided people, who
have been asserting: "Containers are just as secure as VM's".

Well, they are not.  And the sooner people get past this, the better
off they'll be

- Ted


Re: [kernel-hardening] Re: [PATCH v5 next 5/5] net: modules: use request_module_cap() to load 'netdev-%s' modules

2017-11-28 Thread Theodore Ts'o
On Tue, Nov 28, 2017 at 04:18:59PM -0800, Kees Cook wrote:
> There's also a difference between immutable CONFIG options that cannot
> be disabled at runtime, those that can, global sysctls, per-namespace
> controls, etc etc. The kernel is all about providing admins with knobs
> to tweak their performance and security. Suddenly being told that we
> can't create optional improvements is very odd.

I just think that tweakable knobs are mostly pointless.  From my
experience the number of sysadmins that adjust knobs is ***tiny***[1].
Put another way, the effort to determine whether tweaking a knob will
result in breakages or will be safe is as much work as creating a
white list of modules that are allowed to be loaded.

[1] And I say that having providing a lot of knobs for ext4.  :-)

This is why some on the kernel-hardening list have argued for making
the default to be opt-out, which means some users will be breaken (and
their answer to that seems to be, "oh well --- gotta break some eggs
to make an omlette".  Sucks if you're one of the eggs, though.)

And I don't see how systemd magically means no one will be broken.  If
you have a non-root process trying to invoke a line discpline which
has to be loaded as a module, if you flip the switch, that process
will be broken.  How does using systemd make the problem go away?

- Ted


Re: [kernel-hardening] Re: [PATCH v5 next 5/5] net: modules: use request_module_cap() to load 'netdev-%s' modules

2017-11-28 Thread Theodore Ts'o
On Tue, Nov 28, 2017 at 03:29:01PM -0800, Kees Cook wrote:
> > So in these two cases, if the kernel was built w/o modules, and HDLC
> > and DCCP was built-in, you'd be screwed, then?
> 
> Sure, but that's not the common situation.
> 
> > Is the goal here to protect people using distro kernels which build
> > the world as modules, including dodgy pieces of kernel code that are
> > bug-ridden?
> 
> The bulk of the risk comes from distro kernels, yes. (Though "bug
> ridden" is a strong statement. There are and will be bugs, scattered
> across a wide portion of the kernel, it's just that modules tend to
> cover most of that attack surface.)

OK, but if the goal is to protect users who are running distro
kernels, then a kernel config that breaks some percentage of users is
highly unlikely to be enabled by Red Hat and SuSE, right?  And
many of these users either can't (from a skills perspective) or won't
(because they lose all support from the enterprise distro's help desk)
recompile their kernel to enable an "might break 3% of all users ---
oh, well" config option.

Which argues for being extremely conservative about making something
that has an extremely low probability of breaking users, and it points
out why Geo Kozey's "who cares about breaking users; security is
IMPORTANT" argument is so wrong-headed.

If the goal is to protect distro kernels, but a sloppy approach
guarantees that distro kernels will never enable it, is it really
worth it?

 - Ted

P.S.  This is where it might be useful to get some input from the Red
Hat and SuSE support teams.  How many angry user calls to their help
desk are they willing to field before they'll just turn off the kernel
config option for their kernels?


Re: [kernel-hardening] Re: [PATCH v5 next 5/5] net: modules: use request_module_cap() to load 'netdev-%s' modules

2017-11-28 Thread Theodore Ts'o
On Tue, Nov 28, 2017 at 01:33:40PM -0800, Kees Cook wrote:
> As I've said before, this isn't a theoretical attack surface. This
> year alone there have been three known-exploitable flaws exposed by
> autoloading:
> 
> The exploit for CVE-2017-2636 uses int n_hdlc = N_HDLC; ioctl(fd,
> TIOCSETD, _hdlc) [1]. This is using the existing "tty-ldisc-"
> prefix, and is intentionally unprivileged.
> 
> The exploit for CVE-2017-6074 uses socket(PF_INET6, SOCK_DCCP,
> IPPROTO_IP) [2]. This is using the existing proto prefix, and is
> intentionally unprivileged.

So in these two cases, if the kernel was built w/o modules, and HDLC
and DCCP was built-in, you'd be screwed, then?

Is the goal here to protect people using distro kernels which build
the world as modules, including dodgy pieces of kernel code that are
bug-ridden?

If so, then presumably 90% of the problem you've cited can be done by
creating a script which takes a look of the modules that are normally
in use once the machine is in production, and then deleting everything
else?  Correct?

And yes, this will potentially break some users, but the security
folks who are advocating for the more aggressive version of this
change seem to be OK with breaking users, so they can do this without
making kernel changes.  Good luck getting Red Hat and SuSE to accept
such a change, though

- Ted


Re: [kernel-hardening] Re: [PATCH v5 next 5/5] net: modules: use request_module_cap() to load 'netdev-%s' modules

2017-11-28 Thread Theodore Ts'o
On Tue, Nov 28, 2017 at 01:16:59PM +0100, Geo Kozey wrote:
> 
> Userspace can be configured in a way which is compatible with those
> changes being on the same as it can be configured to work with
> selinux. That means on distro level or sysadmin level it's a
> valuable tool. It's better than nothing and it's better than using
> some out-of-tree patches instead. Switching one sysctl would make
> their life easier.

If *selinux* can opt-in to something more stringent, such that when
you upgrade to a new version of selinux which enables something which
breaks all modules unless you set up the rules corretly, I don't see a
problem with it.  It might force distributions not to go to the latest
version of SELinux because users get cranky when their systems get
broken, but for people like me, who *still* don't use SELinux because
every few years, i try to enable on my development laptop running
Debian, watch ***far*** too much stuff break. and then turn it off
again.  So tieing it to SELinux (as far as I am concerned) reduces it to
a previously unsolved problem.  :-)

But that's different from opting it on by default for non-SELinux
users.  To which I can only say, "Please, No."

  - Ted


Re: regression: 4.13 cannot follow symlinks on some ext3 fs

2017-11-27 Thread Theodore Ts'o
On Mon, Nov 27, 2017 at 08:14:27AM +1100, Dave Chinner wrote:
> Of course. I've done that every time I've come acros these sorts of
> problems.

The most recent report I was able to find was against 4.7-rc6, in July
2016.  Have you been able to reproduce it more recently than that?

Cheers,

- Ted


Re: regression: 4.13 cannot follow symlinks on some ext3 fs

2017-11-26 Thread Theodore Ts'o
On Sun, Nov 26, 2017 at 09:32:02AM +1100, Dave Chinner wrote:
> 
> They don't have any whacky symlinks around, but the modern ext4 code
> does try to eat these filesystems every so often. Extended operation
> at ENOSPC will eventually corrupt the rootfs and crash the kernel,
> and then I play the "e2fsck doesn't detect corruption, kernel does"
> game to get them fixed up and working again

If you have stack dumps or file system images which e2fsck doesn't
detect any problems but the kernels do, please do feel free send
reports to the ext4 mailing list.

> I'm running with everything up to date (debian unstable) on these
> VMs, they are just an old filesystem because some distros have had
> reliable rolling updates for the entire life of these VMs. :P

Or if you can make the VM's available and tell me how you are
using/exercising them, I can try to see if I can repro the problem.

I am wondering how you are running into ENOSPC on the root file
systems; I take this is much more than running xfstests?  Are you
running some benchmarks that are logging into the root, and that's
triggering the ENOSPC condition?

Thanks,

- Ted


Re: regression: 4.13 cannot follow symlinks on some ext3 fs

2017-11-24 Thread Theodore Ts'o
On Fri, Nov 24, 2017 at 08:51:02AM -0800, Andi Kleen wrote:
> > I think e2fsck can fix this quite easily, and there really isn't
> > an easy way to revert to the old method if the large xattr feature
> > is enabled.  If you are willing to run a new kernel, you should also
> > be willing to run a new e2fsck.
> 
> It's obviously not enabled on ext3.

Yes, I think it's clear we need to enable a backwards compatibility
support for ext3 file systems, or even all ext4 file systems that
don't have the large xattr feature.

We could have e2fsck offer to fix it, so long as it is being run
manually (e.g., not in preen mode), since it does have the benefit of
releasing unnecessarily allocated 4k blocks for symlinks which are <
60 bytes.

- Ted


Re: regression: 4.13 cannot follow symlinks on some ext3 fs

2017-11-23 Thread Theodore Ts'o
On Thu, Nov 23, 2017 at 12:33:30PM -0800, Andi Kleen wrote:
> 
> I have an older qemu VM image that i sometimes use for testing. It
> stopped booting with 4.13-4.14 because it couldn't run init.  
> It uses ext3 for the root file system.

Hmm, do you know roughly when (what krenel version) this image was
created?  We had done quite a lot of research and the belief was
kernels never would create a "slow" symlink which was less than 60
bytes.

Or was this image something that was created manually (e.g., using debugfs)?

  - Ted


Re: [GIT PULL] ext4 updates for 4.15

2017-11-14 Thread Theodore Ts'o
On Tue, Nov 14, 2017 at 12:59:17PM -0800, Linus Torvalds wrote:
> Of course,
> 
> (flags & EXT4_ENCRYPT_FL)
> 
> _should_ be the same as
> 
> ext4_test_inode_flag(inode, EXT4_INODE_ENCRYPT);

And in the second is the preferred way to do things, actually.

> I'll do that suggested resolution, but I have to say that the ext4 bit
> testing is incredibly broken and non-obvious. Just as an example:
> 
>   fs/ext4/ext4.h:#define EXT4_ENCRYPT_FL  0x0800
> /* encrypted file */
>   fs/ext4/ext4.h: EXT4_INODE_ENCRYPT  = 11,   /* Encrypted file */
> 
> yeah, it's the same bit, but it sure as hell isn't obvious. Why the
> two totally different ways to define that data?

Yes, it's non-obvious and ugly.  Sorry about that.

We originally used EXT4_*_FL, and we needed to use the bit number
encoding so we could use test_bit().  We just never converted all the
way over.

We do have a way to make sure the two ways of defining a bit position
are in sync; see ext4_check_flag_values() and CHECK_FLAG_VALUE in
ext4.h.  It's a bit gross, and we probably should clean this up, at
least in the kernel.  The e2fsprogs user space libraries all use
EXT4_*_FL, and we can't change that without breaking applications
depending on userspace, but we can keep things consistent in the
kernel, and that probably means completely converting away from
EXT4_*_FL, if possible.

  - Ted


Re: [btrfs] WARNING: CPU: 0 PID: 6379 at fs/direct-io.c:293 dio_complete+0x1d4/0x220

2017-11-14 Thread Theodore Ts'o
On Mon, Nov 13, 2017 at 02:01:36PM -0800, Linus Torvalds wrote:
> 
> And in the end, maybe even the warning is pointless. You used
> direct-IO and cached IO at the same time, and you got coherency
> issues. What did you expect? directio is fundamentally broken.

A single warning per inode, "page cache coherency broken due to direct
I/O; userspace did a dumb thing; oh, well" is probably more than
sufficient.  The reason to have the warning is so that when the user
complains about the file system bug, we can point at the "userspace
did a dumb thing" warning message.

- Ted


Re: [GIT PULL] ext4 updates for 4.15

2017-11-13 Thread Theodore Ts'o
I forgot to mention, there's a merge conflict when pulling the ext4
and fscrypt trees.  The fixup is relatively straightforward:

commit daf886f04e60eda3bbc957e79d81d72965afd947
Merge: a0b3bc855374 232530680290
Author: Theodore Ts'o <ty...@mit.edu>
Date:   Sun Nov 12 22:03:15 2017 -0500

Merge tag 'ext4_for_linus' into test

Add support for online resizing of file systems with bigalloc.  Fix a
two data corruption bugs involving DAX, as well as a corruption bug
after a crash during a racing fallocate and delayed allocation.
Finally, a number of cleanups and optimizations.

diff --cc fs/ext4/inode.c
index 617c7feced24,9f836e2ec18c..737c43d724fb
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@@ -4587,15 -4640,10 +4640,13 @@@ void ext4_set_inode_flags(struct inode 
new_fl |= S_NOATIME;
if (flags & EXT4_DIRSYNC_FL)
new_fl |= S_DIRSYNC;
-   if (test_opt(inode->i_sb, DAX) && S_ISREG(inode->i_mode) &&
-   !ext4_should_journal_data(inode) && !ext4_has_inline_data(inode) &&
-   !(flags & EXT4_ENCRYPT_FL))
+   if (ext4_should_use_dax(inode))
new_fl |= S_DAX;
 +  if (flags & EXT4_ENCRYPT_FL)
 +  new_fl |= S_ENCRYPTED;
inode_set_flags(inode, new_fl,
 -  S_SYNC|S_APPEND|S_IMMUTABLE|S_NOATIME|S_DIRSYNC|S_DAX);
 +  S_SYNC|S_APPEND|S_IMMUTABLE|S_NOATIME|S_DIRSYNC|S_DAX|
 +  S_ENCRYPTED);
  }
  
  static blkcnt_t ext4_inode_blocks(struct ext4_inode *raw_inode,


[GIT PULL] ext4 updates for 4.15

2017-11-12 Thread Theodore Ts'o
The following changes since commit 9e66317d3c92ddaab330c125dfe9d06eee268aff:

  Linux 4.14-rc3 (2017-10-01 14:54:54 -0700)

are available in the git repository at:

  git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4.git 
tags/ext4_for_linus

for you to fetch changes up to 232530680290ba94ca37852ab10d9556ea28badf:

  ext4: improve smp scalability for inode generation (2017-11-08 22:23:20 -0500)


Add support for online resizing of file systems with bigalloc.  Fix a
two data corruption bugs involving DAX, as well as a corruption bug
after a crash during a racing fallocate and delayed allocation.
Finally, a number of cleanups and optimizations.


Andreas Gruenbacher (3):
  iomap: Switch from blkno to disk offset
  iomap: Add IOMAP_F_DATA_INLINE flag
  ext4: Add iomap support for inline data

Christoph Hellwig (1):
  ext4: Switch to iomap for SEEK_HOLE / SEEK_DATA

Kees Cook (2):
  jbd2: convert timers to use timer_setup()
  ext4: convert timers to use timer_setup()

Pavel Machek (1):
  Documentation: fix little inconsistencies

Ross Zwisler (5):
  ext4: prevent data corruption with inline data + DAX
  ext4: prevent data corruption with journaling + DAX
  ext4: add sanity check for encryption + DAX
  ext4: add ext4_should_use_dax()
  ext4: remove duplicate extended attributes defs

Simon Ruderich (1):
  ext4: mention noload when recovering on read-only device

Theodore Ts'o (3):
  ext4: retry allocations conservatively
  ext4: fix interaction between i_size, fallocate, and delalloc after a 
crash
  ext4: improve smp scalability for inode generation

harshads (1):
  ext4: add support for online resizing with bigalloc

 Documentation/filesystems/ext4.txt |   8 +--
 fs/buffer.c|   4 +-
 fs/dax.c   |   2 +-
 fs/ext2/inode.c|   4 +-
 fs/ext4/Kconfig|   1 +
 fs/ext4/balloc.c   |  15 ++--
 fs/ext4/ext4.h |  50 ++---
 fs/ext4/extents.c  |   6 +-
 fs/ext4/file.c | 263 
-
 fs/ext4/ialloc.c   |   4 +-
 fs/ext4/inline.c   |  43 +---
 fs/ext4/inode.c| 153 
++--
 fs/ext4/ioctl.c|  30 
 fs/ext4/mballoc.c  |  28 
 fs/ext4/resize.c   | 104 +--
 fs/ext4/super.c|  27 +++
 fs/iomap.c |  13 ++--
 fs/jbd2/journal.c  |   9 ++-
 fs/nfsd/blocklayout.c  |   4 +-
 fs/xfs/xfs_iomap.c |   6 +-
 include/linux/iomap.h  |  15 ++--
 21 files changed, 278 insertions(+), 511 deletions(-)


[GIT PULL] fscrypt updates for 4.15

2017-11-12 Thread Theodore Ts'o
The following changes since commit 9e66317d3c92ddaab330c125dfe9d06eee268aff:

  Linux 4.14-rc3 (2017-10-01 14:54:54 -0700)

are available in the git repository at:

  git://git.kernel.org/pub/scm/linux/kernel/git/tytso/fscrypt.git 
tags/fscrypt-for-linus

for you to fetch changes up to a0b3bc855374c50b5ea85273553485af48caf2f7:

  fscrypt: lock mutex before checking for bounce page pool (2017-10-31 13:49:25 
-0400)


fscrypt: lots of cleanups, mostly courtesy by Eric Biggers


Dave Chinner (1):
  fscrypt: clean up include file mess

Eric Biggers (17):
  fs, fscrypt: add an S_ENCRYPTED inode flag
  fscrypt: switch from ->is_encrypted() to IS_ENCRYPTED()
  fscrypt: remove ->is_encrypted()
  fscrypt: remove unneeded empty fscrypt_operations structs
  fscrypt: new helper function - fscrypt_require_key()
  fscrypt: new helper function - fscrypt_file_open()
  fscrypt: new helper function - fscrypt_prepare_link()
  fscrypt: new helper function - fscrypt_prepare_rename()
  fscrypt: new helper function - fscrypt_prepare_lookup()
  fscrypt: new helper function - fscrypt_prepare_setattr()
  ext4: switch to fscrypt_file_open()
  ext4: switch to fscrypt_prepare_link()
  ext4: switch to fscrypt_prepare_rename()
  ext4: switch to fscrypt_prepare_lookup()
  ext4: switch to fscrypt_prepare_setattr()
  fscrypt: add a documentation file for filesystem-level encryption
  fscrypt: lock mutex before checking for bounce page pool

 Documentation/filesystems/fscrypt.rst | 610 
++
 Documentation/filesystems/index.rst   |  11 ++
 MAINTAINERS   |   1 +
 fs/crypto/Makefile|   2 +-
 fs/crypto/crypto.c|   9 +-
 fs/crypto/fname.c |   3 +-
 fs/crypto/fscrypt_private.h   |   3 +-
 fs/crypto/hooks.c | 112 
 fs/crypto/keyinfo.c   |   2 +-
 fs/crypto/policy.c|   6 +-
 fs/ext4/ext4.h|   8 +-
 fs/ext4/file.c|  23 +--
 fs/ext4/inode.c   |  19 +--
 fs/ext4/namei.c   |  62 ++-
 fs/ext4/super.c   |  15 +-
 fs/f2fs/f2fs.h|   9 +-
 fs/f2fs/inode.c   |   5 +-
 fs/f2fs/super.c   |   7 +-
 fs/ubifs/crypto.c |   1 -
 fs/ubifs/ioctl.c  |   5 +-
 fs/ubifs/super.c  |   8 +-
 fs/ubifs/ubifs.h  |  18 +-
 fs/ubifs/xattr.c  |   1 +
 include/linux/fs.h|   2 +
 include/linux/fscrypt.h   | 293 
 include/linux/fscrypt_common.h| 141 
 include/linux/fscrypt_notsupp.h   |  39 -
 include/linux/fscrypt_supp.h  |  17 +-
 28 files changed, 1154 insertions(+), 278 deletions(-)
 create mode 100644 Documentation/filesystems/fscrypt.rst
 create mode 100644 fs/crypto/hooks.c
 create mode 100644 include/linux/fscrypt.h
 delete mode 100644 include/linux/fscrypt_common.h


Re: [pmem_attach_disk] WARNING: CPU: 46 PID: 518 at kernel/memremap.c:363 devm_memremap_pages+0x350/0x4b0

2017-11-11 Thread Theodore Ts'o
On Mon, Oct 30, 2017 at 05:24:42PM -0700, Dan Williams wrote:
> 
> Something is going wrong with memmap= because you are not getting 1G
> aligned address ranges. I think you would have better luck switching
> to the official nvdimm emulation in qemu-kvm rather than relying on
> memmap= which is just a fragile / unreliable interface. In fact we
> should look to deprecate it and point everyone to use the standard
> methods. We just have a problem of legacy pre-ACPI6 platforms that
> have no other way than a kernel command line to identify persistent
> memory ranges.

Why is memmap fragile/unreliable?  I'm not using qemu-kvm for most of
my testing and right now memmap is the only way I can test ext4 DAX
codepaths.   :-/

- Ted


Re: Legal question: Author, Sign-off, Company Copyright and gmail

2017-11-09 Thread Theodore Ts'o
On Thu, Nov 09, 2017 at 10:45:27AM +0100, Daniel Lezcano wrote:
> 
> Hi all,
> 
> I noticed a practice when the patches are submitted where I'm a bit
> confused about how it fits with the DCO.
> 
> People are creating gmail accounts to send patches on behalf of their
> company because the company's email configuration does not allow to send
> patches or adds extra infos, or whatever...
> 
> That ends up with patches submitted by a gmail account with no history
> and verifiable origin and new files containing a company copyright [1].
> 
> At the first glance I would say, it is not allowed, and if a company is
> willing to do opensource, it should provide the tools to its employees
> to do so. But I don't want block patch submission if this practice is
> tolerated.

Note that git send-email will automatically take care of this case by
taking the From field from the author field of git, and moving it into
the body.  So you might have an e-mail message which looks like this:

From: @gmail.com
Subject: [PATCH] some patch
Date: April 1, 2017
Other-mail-headers: etc. etc.

From: User Lastname 

This is some patch!

...

And then git apply-patch will automatically use the first From field
in the "pseudo-header" instead of the From field in the mail header.

Other people will do this because they they have a usern...@kernel.org
account, and they want all of their Kernel git commits to have
usern...@kernel.org in the author field, so they'll do the same thing
of putting "From: usern...@kernel.org" as the first line of the body
--- either manually, or by setting the appropriate git configuration
variables so that usern...@kernel.org gets used as the author
information in the git commit.

So what e-mail you use when you interact with the kernel mailing
lists, what e-mail address you want to show up in the git commit's
author field, and which company should get credit for your
contributions for a given period of time in the Linux Foundation's
"who writes Linux" report are three separable issues.

You could be using gmail.com for patch submission, use your kernel.org
address for git author fields, while working for nokia.com.  And
that's just fine.

Cheers,

- Ted


Re: Linux & FAT32 label

2017-11-09 Thread Theodore Ts'o
On Thu, Nov 09, 2017 at 10:01:45AM +0100, Pali Rohár wrote:
> > > You would have stored LABEL42 in boot sector and no label in root
> > > directory. Windows handle this situation as there is no label.
> > 
> > But why should we *care*?
> 
> FAT is Microsoft's filesystem and the only usage of it on Linux is due
> to interoperability with different non-Linux systems. So here we should
> implement FAT in the similar/same way as other systems. It does not make
> sense to implement it differently and specially in non-compatible way.
> Because it lost reason what is primary usage of the FAT on Linux.

The primary usage of FAT on Linux is for data interchange, primarily
with USB sticks.  I'm still failing to see how it might "spoil some
vast eternal plan"[1] if a USB stick which Windows would show as
having no label, blkid/udev showed as having the label "LABEL42"?

[1] to quote the song "If I were a rich man"

I'm just trying to understand why this specific detail of bit-for-bit
compatibility is so important.

- Ted


Re: WTF? Re: [PATCH] License cleanup: add SPDX GPL-2.0 license identifier to files with no license

2017-11-07 Thread Theodore Ts'o
On Tue, Nov 07, 2017 at 06:46:58PM +, Alan Cox wrote:
> > Given that it had no license text on it at all, it "defaults" to GPLv2,
> > so the GPLv2 SPDX identifier was added to it.
> > 
> > No copyright was changed, nothing at all happened except we explicitly
> > list the license of the file, instead of it being "implicit" before.
> 
> Well if Christoph owns the copyright (if there is one) and he has stated
> he believes it is too trivial to copyright then it needs an SPDX tag that
> indicates the rightsholder has stated it's too trivial to copyright and
> (by estoppel) revoked any right they might have to pursue a claim.

If Cristoph has revoked any right to pursue a claim, then he's also
legally given up the right to complain if, say, Bradley Kuhn starting
distributing a version with a GPLv3 permission statement --- or if Greg
K-H adds a GPLv2 SPDX identifier.  :-)

- Ted

> I'm sure there's a correct SPDX tag for that ;-)
> 
> Alan


Re: Linux & FAT32 label

2017-11-07 Thread Theodore Ts'o
On Sun, Nov 05, 2017 at 10:12:49PM +0100, Pali Rohár wrote:
> Easy way how to achieve this situation:
> 
> 1. use mkdosfs to format hard disk to FAT32 with label LABEL42
> 
> 2. boot Windows 10 (or XP) and set label of that FAT32 partition to
> empty (via Explorer GUI)
> 
> 3. profit
> 
> You would have stored LABEL42 in boot sector and no label in root
> directory. Windows handle this situation as there is no label.

But why should we *care*?

Puzzled,

- Ted


Re: Linux & FAT32 label

2017-11-05 Thread Theodore Ts'o
On Sun, Nov 05, 2017 at 03:07:45PM +0100, Pali Rohár wrote:
> 
> Current behavior of the last blkid and fatlabel tools is: Try to read
> label from the root directory. If it does not exist, then fallback to
> label stored in boot sector. And when fatlabel is changing label it
> updates both locations.
> 
> So tools which already uses fatlabel for get & set operations should not
> be affected as setting new label makes boot and root in sync.
> 
> New proposed behavior is: Try to read label from the root directory. If
> not exist, then treat disk as without label.

Why is it important to ignore the label from the boot sector?  What is
the situation where if there is not a label in the root directory, and
there is a label in the boot sector, it is the Wrong Thing to return it?

For that matter, aside for a diskette from DOS 3.x (where using the
label from the boot sector *is* the right thing), why/when would we
ever have a label in the boot sector and not in the root director?

   - Ted


Re: Is 115200 still the maximum baudrate?

2017-11-02 Thread Theodore Ts'o
On Thu, Nov 02, 2017 at 04:42:56PM +0100, Paul Menzel wrote:
> 
> The Linux serial console documentation [1] says that 115200 is the maximum
> supported baudrate.
> 
> > The maximum baudrate is 115200.
> 
> Is that still accurate? If yes, where should I look to support higher
> values?

See the setserial man page and the spd_* options:

spd_hi
Use 57.6kb when the application requests 38.4kb. This parameter
may be specified by a non-privileged user.
 
spd_vhi
Use 115kb when the application requests 38.4kb. This parameter may
be specified by a non-privileged user.

spd_shi
Use 230kb when the application requests 38.4kb. This parameter may
be specified by a non-privileged user.

spd_warp
Use 460kb when the application requests 38.4kb. This parameter may
be specified by a non-privileged user.

- Ted


Re: [PATCH RFC] random: fix syzkaller fuzzer test int overflow

2017-10-30 Thread Theodore Ts'o
On Mon, Oct 30, 2017 at 08:39:56AM +0100, Greg KH wrote:
> 
> No "Reported-by:"?

Good point, fixed in my tree.

- Ted


Re: [PATCH RFC] random: fix syzkaller fuzzer test int overflow

2017-10-29 Thread Theodore Ts'o
On Sat, Oct 28, 2017 at 11:22:00AM +0800, Chen Feng wrote:
> 
> I checked the ioctl. What's the purpose of RNDADDTOENTCNT ioctl to
> userspace?

It's a legacy ioctl which is probably not used anywhere; it's been
replaced by RNDADDENTROPY.  It previously allows root to bump the
entropy estimate, but the right way to do this by rngd is to
atomically add entropy to the pool land and bump the entropy estimate
at the same time.

The UBSAN is harmless.  The ioctl requires root, and the entropy_total
field, which is involved in the UBSAN, is only used in the first few
seconds of boot, to determine when the entropy pool has been
initialized.  In general on desktop and servers this happens before
userspace has a chance to run.

In any case, here's a fix for this.

- Ted

commit 6f7034d0c52e21f30002b95126b6b98e4618dc57
Author: Theodore Ts'o <ty...@mit.edu>
Date:   Sun Oct 29 14:17:26 2017 -0400

random: use a tighter cap in credit_entropy_bits_safe()

This fixes a harmless UBSAN where root could potentially end up
causing an overflow while bumping the entropy_total field (which is
ignored once the entropy pool has been initialized, and this generally
is completed during the boot sequence).

This is marginal for the stable kernel series, but it's a really
trivial patch, and it UBSAN warning that might cause security folks to
get overly excited for no reason.

Signed-off-by: Theodore Ts'o <ty...@mit.edu>
Cc: sta...@vger.kernel.org

diff --git a/drivers/char/random.c b/drivers/char/random.c
index 8ad92707e45f..ae8a2f829890 100644
--- a/drivers/char/random.c
+++ b/drivers/char/random.c
@@ -733,7 +733,7 @@ static void credit_entropy_bits(struct entropy_store *r, 
int nbits)
 
 static int credit_entropy_bits_safe(struct entropy_store *r, int nbits)
 {
-   const int nbits_max = (int)(~0U >> (ENTROPY_SHIFT + 1));
+   const int nbits_max = r->poolinfo->poolwords * 32;
 
if (nbits < 0)
return -EINVAL;


Re: [PATCH] fscrypt: lock mutex before checking for bounce page pool

2017-10-29 Thread Theodore Ts'o
On Thu, Jul 06, 2017 at 10:57:48AM -0700, Eric Biggers wrote:
> From: Eric Biggers 
> 
> fscrypt_initialize(), which allocates the global bounce page pool when
> an encrypted file is first accessed, uses "double-checked locking" to
> try to avoid locking fscrypt_init_mutex.  However, it doesn't use any
> memory barriers, so it's theoretically possible for a thread to observe
> a bounce page pool which has not been fully initialized.  This is a
> classic bug with "double-checked locking".
> 
> While "only a theoretical issue" in the latest kernel, in pre-4.8
> kernels the pointer that was checked was not even the last to be
> initialized, so it was easily possible for a crash (NULL pointer
> dereference) to happen.  This was changed only incidentally by the large
> refactor to use fs/crypto/.
> 
> Solve both problems in a trivial way that can easily be backported: just
> always take the mutex.  It's theoretically less efficient, but it
> shouldn't be noticeable in practice as the mutex is only acquired very
> briefly once per encrypted file.
> 
> Later I'd like to make this use a helper macro like DO_ONCE().  However,
> DO_ONCE() runs in atomic context, so we'd need to add a new macro that
> allows blocking.
> 
> Cc: sta...@vger.kernel.org # v4.1+
> Signed-off-by: Eric Biggers 

Applied, thanks.  Sorry for the delay; this slipped through the
cracks, and then I had a crazy travel/conference schedule.

 - Ted


Re: [PATCH RFC] random: fix syzkaller fuzzer test int overflow

2017-10-26 Thread Theodore Ts'o
On Thu, Oct 26, 2017 at 04:25:15PM +0800, Chen Feng wrote:
> 
> 
> On 2017/10/25 16:49, Theodore Ts'o wrote:
> > Other people who have sent me fuzzer test reproducers are able to
> > reproduce syzkaller logs into a simple C program.  Can you explain to
> > me what the heck:
> > 
> >> r3 = syz_open_dev$urandom(&(0x7f00a000)="2f6465762f7572616e646f6d00", 
> >> 0x0, 0x0)
> > 
> > means?
> 
> Take a look at this:
> 
> https://github.com/google/syzkaller/blob/master/sys/linux/random.txt

Sorry, this *still* looks like gobbledygook.

What ioctls are you executing, and with what arguments?

*Please*, give me a C program I can compile.

 -Ted


Re: [PATCH RFC] random: fix syzkaller fuzzer test int overflow

2017-10-25 Thread Theodore Ts'o
Other people who have sent me fuzzer test reproducers are able to
reproduce syzkaller logs into a simple C program.  Can you explain to
me what the heck:

> r3 = syz_open_dev$urandom(&(0x7f00a000)="2f6465762f7572616e646f6d00", 
> 0x0, 0x0)

means?

- Ted


Re: [PATCH RFC] random: fix syzkaller fuzzer test int overflow

2017-10-24 Thread Theodore Ts'o
On Tue, Oct 24, 2017 at 11:09:27AM +0200, Greg KH wrote:
> On Tue, Oct 24, 2017 at 03:44:17PM +0800, Chen Feng wrote:
> > [pid:11940,cpu6,syz-executor][flp_ioctl]cmd[0x1]
> > Restart is not permit
> > =
> > UBSAN: Undefined behaviour in
> > kernel/linux-4.4/drivers/char/random.c:676:19
> > signed integer overflow:
> > 2147483645 + 268435455 cannot be represented in type 'int'
> > CPU: 4 PID: 11941 Comm: syz-executor Not tainted 4.4.76+ #2
> 
> Does this also happen on 4.14-rc6?

No.  It was fixed in 4.8, by commit 86a574de4590: "random: strengthen
input validation for RNDADDTOENTCNT".

- Ted


Re: [GIT PULL] Documentation: Add a file explaining the requested Linux kernel license enforcement policy

2017-10-23 Thread Theodore Ts'o
On Mon, Oct 23, 2017 at 04:11:14PM +0300, Pavel Nikulin wrote:
> The people signing there effectively say: "we, to big extend, limit
> our options to call for expedient permanent license revocation" - the
> only thing that will ever tickle a commercial entity.

This makes no sense.  If a commercial entity is permanently in
violation, then the license stays revoked.  The whole *point* of
ethical copyright enforcement is that if the commercial entity comes
into compliance, that they can also have a way to have their ability
to use the GPL'ed code in question be also restored.

> When I first heard news of this and had a glancing look on the
> document, it looked to me almost as if it is a change to terms made on
> behalf of the "community," only after I read the document through few
> times over, I got an understanding of the semantics there.

So you're not a lawyer and have not consulted a lawyer, and yet you
feel comfortable making pronouncements about legal implications?  Just
trying to be clear...

- Ted


Re: [PATCH] ext4: Log inode exhaustion to dmesg

2017-10-23 Thread Theodore Ts'o
Hi,

The Signed-off-by needs to contain your real name (sorry, no
pseudonyms or anonymous contributions.)  That's because it has a
formal legal meaning.  See Section 11: "Sign your work - the
Developer’s Certificate of Origin" of the Submitting Patches
docuementation:

https://www.kernel.org/doc/html/latest/process/submitting-patches.html

(Also, you should use ext4_warning() and not a bare printk).

- Ted

On Mon, Oct 23, 2017 at 05:32:47PM +0530, Team Athena wrote:
> Make a log in dmesg when file creation fails due to no free inodes.
> The error code for both "out of disk space" and "out of inode" is the same.
> This is misleading to the user. Logging the exact reason helps to find and
> correct the issue from the users' side.
> 
> Fix bug 197335 - https://bugzilla.kernel.org/show_bug.cgi?id=197335
> 
> Signed-off-by: Team Athena 
> ---
>  fs/ext4/namei.c | 2 ++
>  1 file changed, 2 insertions(+)
> 
> diff --git a/fs/ext4/namei.c b/fs/ext4/namei.c
> index c1cf020d..c3990d2d 100644
> --- a/fs/ext4/namei.c
> +++ b/fs/ext4/namei.c
> @@ -2463,6 +2463,8 @@ static int ext4_create(struct inode *dir, struct dentry 
> *dentry, umode_t mode,
>   ext4_journal_stop(handle);
>   if (err == -ENOSPC && ext4_should_retry_alloc(dir->i_sb, ))
>   goto retry;
> + else if (err == -ENOSPC && printk_ratelimited())
> + printk(pr_warning "ext4: No space on disk, inode usage full");
>   return err;
>  }
>  
> -- 
> 2.11.0
> 


Re: [PATCH] fs: fix xattr permission checking error

2017-10-21 Thread Theodore Ts'o
On Sat, Oct 21, 2017 at 03:39:47PM +0200, Nicolas Belouin wrote:
> Fix an issue making trusted xattr world readable and other
> cap_sys_admin only
>

NACK.  It is *documented* that trusted xattrs are only supposed to be
readable by root.

- Ted


Re: [PATCH] fs: check for DAC_READ_SEARCH instead of SYS_ADMIN

2017-10-21 Thread Theodore Ts'o
On Sat, Oct 21, 2017 at 03:24:46PM +0200, Nicolas Belouin wrote:
> These checks are meant to prevent leaks or attacks via directory
> traversal, the use of CAP_SYS_ADMIN here is a misuse,
> CAP_DAC_READ_SEARCH being way more appropriate as a process
> with CAP_DAC_READ_SEARCH is entrusted with going trough all directories.
> CAP_SYS_ADMIN is not meant to flag such a process.
> 
> Signed-off-by: Nicolas Belouin 

No.  lookup_dcookie() is a horrid, horrid, hack which is
*spectacularly* dangerous.  We should not be trying to encourage its
use for anything beside its single legacy user, oprofile(8), for which
CAP_SYS_ADMIN is appropriate.

- Ted


Re: Documentation: fix little inconsistencies

2017-10-18 Thread Theodore Ts'o
On Sat, Sep 16, 2017 at 01:48:37PM +0200, Pavel Machek wrote:
> Fix little inconsistencies in Documentation: make case and spacing
> match surrounding text.
> 
> Signed-off-by: Pavel Machek 
> Reviewed-by: Darrick J. Wong 

Applied, thanks.

- Ted


Re: [PATCH] ext4: Convert timers to use timer_setup()

2017-10-18 Thread Theodore Ts'o
On Mon, Oct 16, 2017 at 05:00:19PM -0700, Kees Cook wrote:
> In preparation for unconditionally passing the struct timer_list pointer to
> all timer callbacks, switch to using the new timer_setup() and from_timer()
> to pass the timer pointer explicitly.
> 
> Cc: "Theodore Ts'o" <ty...@mit.edu>
> Cc: Andreas Dilger <adilger.ker...@dilger.ca>
> Cc: linux-e...@vger.kernel.org
> Signed-off-by: Kees Cook <keesc...@chromium.org>
> Reviewed-by: Reviewed-by: Jan Kara <j...@suse.cz>

Applied, thanks.

- Ted


Re: [PATCH] jbd2: Convert timers to use timer_setup()

2017-10-18 Thread Theodore Ts'o
On Tue, Oct 10, 2017 at 01:20:50PM +0200, Jan Kara wrote:
> On Wed 04-10-17 17:48:46, Kees Cook wrote:
> > In preparation for unconditionally passing the struct timer_list pointer to
> > all timer callbacks, switch to using the new timer_setup() and from_timer()
> > to pass the timer pointer explicitly.
> > 
> > Cc: "Theodore Ts'o" <ty...@mit.edu>
> > Cc: Jan Kara <j...@suse.com>
> > Cc: linux-e...@vger.kernel.org
> > Cc: Thomas Gleixner <t...@linutronix.de>
> > Signed-off-by: Kees Cook <keesc...@chromium.org>
> 
> The patch looks good to me. You can add:
> 
> Reviewed-by: Jan Kara <j...@suse.cz>

Applied, thanks.

- Ted


Re: [kernel-hardening] Re: [PATCH v2] printk: hash addresses printed with %p

2017-10-18 Thread Theodore Ts'o
On Wed, Oct 18, 2017 at 01:28:05PM +1100, Tobin C. Harding wrote:
> > >> Does %p[FfSs] leak addresses? Well, I guess it does if they are not
> > >> found in kallsyms, but otherwise you have:
> > >>
> > >>   function+0x
> > >
> > 
> > They haven't traditionally been a big deal. If they turn out to be
> > problematic, we can deal with it then, IMO.

If it's not in kallsyms, the raw address is probably not going to be
terribly useful --- so even if it's not traditionally been a big deal,
why not just hash them if it's not going to be printed as "function+0x"?

If nothing else, it will help correlate the random address with other
places where it was printed via %p.

- Ted


Re: [PATCH v2 5/5] ext4: remove duplicate extended attributes defs

2017-10-12 Thread Theodore Ts'o
On Mon, Sep 11, 2017 at 11:05:26PM -0600, Ross Zwisler wrote:
> The following commit:
> 
> commit 9b7365fc1c82 ("ext4: add FS_IOC_FSSETXATTR/FS_IOC_FSGETXATTR
> interface support")
> 
> added several defines related to extended attributes to ext4.h.  They were
> added within an #ifndef FS_IOC_FSGETXATTR block with the comment:
> 
> /* Until the uapi changes get merged for project quota... */
> 
> Those uapi changes were merged by this commit:
> 
> commit 334e580a6f97 ("fs: XFS_IOC_FS[SG]SETXATTR to FS_IOC_FS[SG]ETXATTR
> promotion")
> 
> so all the definitions needed by ext4 are available in
> include/uapi/linux/fs.h.  Remove the duplicates from ext4.h.
> 
> Signed-off-by: Ross Zwisler <ross.zwis...@linux.intel.com>
> Reviewed-by: Jan Kara <j...@suse.cz>
> Cc: Li Xi <l...@ddn.com>
> Cc: Theodore Ts'o <ty...@mit.edu>
> Cc: Andreas Dilger <adil...@dilger.ca>
> Cc: Jan Kara <j...@suse.cz>
> Cc: Dave Chinner <dchin...@redhat.com>

Thanks, applied.

- Ted


Re: [PATCH v2 4/5] ext4: add ext4_should_use_dax()

2017-10-12 Thread Theodore Ts'o
On Tue, Sep 12, 2017 at 08:46:12AM +0200, Jan Kara wrote:
> On Mon 11-09-17 23:05:25, Ross Zwisler wrote:
> > This helper, in the spirit of ext4_should_dioread_nolock() et al., replaces
> > the complex conditional in ext4_set_inode_flags().
> > 
> > Signed-off-by: Ross Zwisler 
> 
> Yeah, makes sense to me. You can add:
> 
> Reviewed-by: Jan Kara 

Thanks, applied.

- Ted


Re: [PATCH v2 2/5] ext4: prevent data corruption with journaling + DAX

2017-10-12 Thread Theodore Ts'o
On Mon, Sep 11, 2017 at 11:05:23PM -0600, Ross Zwisler wrote:
> The current code has the potential for data corruption when changing an
> inode's journaling mode, as that can result in a subsequent unsafe change
> in S_DAX.
> 
> I've captured an instance of this data corruption in the following fstest:
> 
> https://patchwork.kernel.org/patch/9948377/
> 
> Prevent this data corruption from happening by disallowing changes to the
> journaling mode if the '-o dax' mount option was used.  This means that for
> a given filesystem we could have a mix of inodes using either DAX or
> data journaling, but whatever state the inodes are in will be held for the
> duration of the mount.
> 
> Signed-off-by: Ross Zwisler 
> Suggested-by: Jan Kara 

Thanks, applied.

- Ted


Re: [PATCH v2 1/5] ext4: prevent data corruption with inline data + DAX

2017-10-12 Thread Theodore Ts'o
On Mon, Sep 11, 2017 at 11:05:22PM -0600, Ross Zwisler wrote:
> If an inode has inline data it is currently prevented from using DAX by a
> check in ext4_set_inode_flags().  When the inode grows inline data via
> ext4_create_inline_data() or removes its inline data via
> ext4_destroy_inline_data_nolock(), the value of S_DAX can change.
> 
> Currently these changes are unsafe because we don't hold off page faults
> and I/O, write back dirty radix tree entries and invalidate all mappings.
> There are also issues with mm-level races when changing the value of S_DAX,
> as well as issues with the VM_MIXEDMAP flag:
> 
> https://www.spinics.net/lists/linux-xfs/msg09859.html
> 
> The unsafe transition of S_DAX can reliably cause data corruption, as shown
> by the following fstest:
> 
> https://patchwork.kernel.org/patch/9948381/
> 
> Fix this issue by preventing the DAX mount option from being used on
> filesystems that were created to support inline data.  Inline data is an
> option given to mkfs.ext4.
> 
> Signed-off-by: Ross Zwisler 
> CC: sta...@vger.kernel.org

Thanks, applied.

- Ted


  1   2   3   4   5   6   7   8   9   10   >