Re: [PATCH v2 08/92] fs: new helper: simple_rename_timestamp

2023-08-29 Thread Al Viro
On Wed, Jul 05, 2023 at 02:58:11PM -0400, Jeff Layton wrote:

> + * POSIX mandates that the old and new parent directories have their ctime 
> and
> + * mtime updated, and that inodes of @old_dentry and @new_dentry (if any), 
> have
> + * their ctime updated.

APPLICATION USAGE
Some implementations mark for update the last file status change timestamp
of renamed files and some do not. Applications which make use of the
last file status change timestamp may behave differently with respect
to renamed files unless they are designed to allow for either behavior.

So for children POSIX permits rather than mandates.  Doesn't really matter;
Linux behaviour had been to touch ctime on children since way back, if
not since the very beginning.


Re: [PATCH] net: unexport csum_and_copy_{from,to}_user

2022-05-17 Thread Al Viro
On Thu, Apr 21, 2022 at 09:04:40AM +0200, Christoph Hellwig wrote:
> csum_and_copy_from_user and csum_and_copy_to_user are exported by
> a few architectures, but not actually used in modular code.  Drop
> the exports.
> 
> Signed-off-by: Christoph Hellwig 

Acked-by: Al Viro 

Not sure which tree should it go through - Arnd's, perhaps?


Re: [PATCH v2 00/18] clean up asm/uaccess.h, kill set_fs for good

2022-02-17 Thread Al Viro
On Thu, Feb 17, 2022 at 08:49:59AM +0100, Arnd Bergmann wrote:

> Same here: architectures can already provide a __put_user_fn()
> and __get_user_fn(), to get the generic versions of the interface,
> but few architectures use that. You can actually get all the interfaces
> by just providing raw_copy_from_user() and raw_copy_to_user(),
> but the get_user/put_user versions you get from that are fairly
> inefficient.

FWIW, __{get,put}_user_{8,16,32,64} would probably make it easier to
unify.  That's where the really variable part tends to be, anyway.
IMO __get_user_fn() had been a mistake.

One thing I somewhat dislike about the series is the boilerplate in
asm/uaccess.h instances - #include  in
a lot of them might make sense as a transitory state, but getting
stuck with those indefinitely...

BTW, do we need user_addr_max() anymore?  The definition in
asm-generic/access-ok.h is the only one, so ifndef around it is pointless.


Re: [PATCH v2 00/18] clean up asm/uaccess.h, kill set_fs for good

2022-02-17 Thread Al Viro
On Thu, Feb 17, 2022 at 07:20:11AM +, Christophe Leroy wrote:

> And we have also 
> user_access_begin()/user_read_access_begin()/user_write_access_begin() 
> which call access_ok() then do the real work. Could be made generic with 
> call to some arch specific __user_access_begin() and friends after the 
> access_ok() and eventually the might_fault().

Not a good idea, considering the fact that we do not want to invite
uses of "faster" variants...


Re: [PATCH 09/14] m68k: drop custom __access_ok()

2022-02-14 Thread Al Viro
On Tue, Feb 15, 2022 at 07:29:42AM +0100, Christoph Hellwig wrote:
> On Tue, Feb 15, 2022 at 12:37:41AM +0000, Al Viro wrote:
> > Perhaps simply wrap that sucker into #ifdef CONFIG_CPU_HAS_ADDRESS_SPACES
> > (and trim the comment down to "coldfire and 68000 will pick generic
> > variant")?
> 
> I wonder if we should invert CONFIG_ARCH_HAS_NON_OVERLAPPING_ADDRESS_SPACE,
> select the separate address space config for s390, sparc64, non-coldfire
> m68k and mips with EVA and then just have one single access_ok for
> overlapping address space (as added by Arnd) and non-overlapping ones
> (always return true).

parisc is also such...  How about

select ALTERNATE_SPACE_USERLAND

for that bunch?  While we are at it, how many unusual access_ok() instances are
left after this series?  arm64, itanic, um, anything else?

FWIW, sparc32 has a slightly unusual instance (see uaccess_32.h there); it's
obviously cheaper than generic and I wonder if the trick is legitimate (and
applicable elsewhere, perhaps)...


Re: [PATCH 14/14] uaccess: drop set_fs leftovers

2022-02-14 Thread Al Viro
On Mon, Feb 14, 2022 at 05:34:52PM +0100, Arnd Bergmann wrote:
> diff --git a/arch/parisc/include/asm/futex.h b/arch/parisc/include/asm/futex.h
> index b5835325d44b..2f4a1b1ef387 100644
> --- a/arch/parisc/include/asm/futex.h
> +++ b/arch/parisc/include/asm/futex.h
> @@ -99,7 +99,7 @@ futex_atomic_cmpxchg_inatomic(u32 *uval, u32 __user *uaddr,
>   /* futex.c wants to do a cmpxchg_inatomic on kernel NULL, which is
>* our gateway page, and causes no end of trouble...
>*/
> - if (uaccess_kernel() && !uaddr)
> + if (!uaddr)
>   return -EFAULT;

Huh?  uaccess_kernel() is removed since it becomes always false now,
so this looks odd.

AFAICS, the comment above that check refers to futex_detect_cmpxchg()
-> cmpxchg_futex_value_locked() -> futex_atomic_cmpxchg_inatomic() call chain.
Which had been gone since commit 3297481d688a (futex: Remove futex_cmpxchg
detection).  The comment *and* the check should've been killed off back
then.
Let's make sure to get both now...


Re: [PATCH 04/14] x86: use more conventional access_ok() definition

2022-02-14 Thread Al Viro
On Mon, Feb 14, 2022 at 08:17:07PM +, Al Viro wrote:
> On Mon, Feb 14, 2022 at 12:01:05PM -0800, Linus Torvalds wrote:
> > On Mon, Feb 14, 2022 at 11:46 AM Arnd Bergmann  wrote:
> > >
> > > As Al pointed out, they turned out to be necessary on sparc64, but the 
> > > only
> > > definitions are on sparc64 and x86, so it's possible that they serve a 
> > > similar
> > > purpose here, in which case changing the limit from TASK_SIZE to
> > > TASK_SIZE_MAX is probably wrong as well.
> > 
> > x86-64 has always(*) used TASK_SIZE_MAX for access_ok(), and the
> > get_user() assembler implementation does the same.
> > 
> > I think any __range_not_ok() users that use TASK_SIZE are entirely
> > historical, and should be just fixed.
> 
> IIRC, that was mostly userland stack trace collection in perf.
> I'll try to dig in archives and see what shows up - it's been
> a while ago...

After some digging:

access_ok() needs only to make sure that MMU won't go anywhere near
the kernel page tables; address limit for 32bit threads is none of its
concern, so TASK_SIZE_MAX is right for it.

valid_user_frame() in arch/x86/events/core.c: used while walking
the userland call chain.  The reason it's not access_ok() is only that
perf_callchain_user() might've been called from interrupt that came while
we'd been under KERNEL_DS.
That had been back in 2015 and it had been obsoleted since 2017, commit
88b0193d9418 (perf/callchain: Force USER_DS when invoking 
perf_callchain_user()).
We had been guaranteed USER_DS ever since.
IOW, it could've reverted to use of access_ok() at any point after that.
TASK_SIZE vs TASK_SIZE_MAX is pretty much an accident there - might've been
TASK_SIZE_MAX from the very beginning.

copy_stack_frame() in arch/x86/kernel/stacktrace.c: similar story,
except the commit that made sure callers will have USER_DS - cac9b9a4b083
(stacktrace: Force USER_DS for stack_trace_save_user()) in this case.
Also could've been using access_ok() just fine.  Amusingly, access_ok()
used to be there, until it had been replaced with explicit check on
Jul 22 2019 - 4 days after that had been made useless by fix in the caller...

copy_from_user_nmi().  That one is a bit more interesting.
We have a call chain from perf_output_sample_ustack() (covered by
force_uaccess_begin() these days, not that it mattered for x86 now),
there's something odd in dumpstack.c:copy_code() (with explicit check
for TASK_SIZE_MAX in the caller) and there's a couple of callers in
Intel PMU code.
AFAICS, there's no reason whatsoever to use TASK_SIZE
in that one - the point is to prevent copyin from the kernel
memory, and in that respect TASK_SIZE_MAX isn't any worse.
The check in copy_code() probably should go.

So all of those guys should be simply switched to access_ok().
Might be worth making that a preliminary patch - it's independent
from everything else and there's no point folding it into any of the
patches in the series.


Re: [PATCH 11/14] sparc64: remove CONFIG_SET_FS support

2022-02-14 Thread Al Viro
On Mon, Feb 14, 2022 at 05:34:49PM +0100, Arnd Bergmann wrote:

> -/*
> - * Sparc64 is segmented, though more like the M68K than the I386.
> - * We use the secondary ASI to address user memory, which references a
> - * completely different VM map, thus there is zero chance of the user
> - * doing something queer and tricking us into poking kernel memory.

Actually, this part of comment probably ought to stay - it is relevant
for understanding what's going on (e.g. why is access_ok() always true, etc.)


Re: [PATCH 09/14] m68k: drop custom __access_ok()

2022-02-14 Thread Al Viro
On Mon, Feb 14, 2022 at 05:34:47PM +0100, Arnd Bergmann wrote:
> From: Arnd Bergmann 
> 
> While most m68k platforms use separate address spaces for user
> and kernel space, at least coldfire does not, and the other
> ones have a TASK_SIZE that is less than the entire 4GB address
> range.
> 
> Using the generic implementation of __access_ok() stops coldfire
> user space from trivially accessing kernel memory, and is probably
> the right thing elsewhere for consistency as well.

Perhaps simply wrap that sucker into #ifdef CONFIG_CPU_HAS_ADDRESS_SPACES
(and trim the comment down to "coldfire and 68000 will pick generic
variant")?

> Signed-off-by: Arnd Bergmann 
> ---
>  arch/m68k/include/asm/uaccess.h | 13 -
>  1 file changed, 13 deletions(-)
> 
> diff --git a/arch/m68k/include/asm/uaccess.h b/arch/m68k/include/asm/uaccess.h
> index d6bb5720365a..64914872a5c9 100644
> --- a/arch/m68k/include/asm/uaccess.h
> +++ b/arch/m68k/include/asm/uaccess.h
> @@ -10,19 +10,6 @@
>  #include 
>  #include 
>  #include 
> -
> -/* We let the MMU do all checking */
> -static inline int __access_ok(const void __user *addr,
> - unsigned long size)
> -{
> - /*
> -  * XXX: for !CONFIG_CPU_HAS_ADDRESS_SPACES this really needs to check
> -  * for TASK_SIZE!
> -  * Removing this helper is probably sufficient.
> -  */
> - return 1;
> -}
> -#define __access_ok __access_ok
>  #include 
>  
>  /*
> -- 
> 2.29.2
> 


Re: [PATCH 05/14] uaccess: add generic __{get,put}_kernel_nofault

2022-02-14 Thread Al Viro
On Mon, Feb 14, 2022 at 05:34:43PM +0100, Arnd Bergmann wrote:
> From: Arnd Bergmann 
> 
> All architectures that don't provide __{get,put}_kernel_nofault() yet
> can implement this on top of __{get,put}_user.
> 
> Add a generic version that lets everything use the normal
> copy_{from,to}_kernel_nofault() code based on these, removing the last
> use of get_fs()/set_fs() from architecture-independent code.

I'd put the list of those architectures (AFAICS, that's alpha, ia64,
microblaze, nds32, nios2, openrisc, sh, sparc32, xtensa) into commit
message - it's not that hard to find out, but...

And AFAICS, you've missed nios2 - see
#define __put_user(x, ptr) put_user(x, ptr)
in there.  nds32 oddities are dealt with earlier in the series, this
one is not...


Re: [PATCH 04/14] x86: use more conventional access_ok() definition

2022-02-14 Thread Al Viro
On Mon, Feb 14, 2022 at 12:01:05PM -0800, Linus Torvalds wrote:
> On Mon, Feb 14, 2022 at 11:46 AM Arnd Bergmann  wrote:
> >
> > As Al pointed out, they turned out to be necessary on sparc64, but the only
> > definitions are on sparc64 and x86, so it's possible that they serve a 
> > similar
> > purpose here, in which case changing the limit from TASK_SIZE to
> > TASK_SIZE_MAX is probably wrong as well.
> 
> x86-64 has always(*) used TASK_SIZE_MAX for access_ok(), and the
> get_user() assembler implementation does the same.
> 
> I think any __range_not_ok() users that use TASK_SIZE are entirely
> historical, and should be just fixed.

IIRC, that was mostly userland stack trace collection in perf.
I'll try to dig in archives and see what shows up - it's been
a while ago...


Re: [PATCH 07/14] uaccess: generalize access_ok()

2022-02-14 Thread Al Viro
On Mon, Feb 14, 2022 at 05:34:45PM +0100, Arnd Bergmann wrote:

> diff --git a/arch/csky/kernel/signal.c b/arch/csky/kernel/signal.c
> index c7b763d2f526..8867ddf3e6c7 100644
> --- a/arch/csky/kernel/signal.c
> +++ b/arch/csky/kernel/signal.c
> @@ -136,7 +136,7 @@ static inline void __user *get_sigframe(struct ksignal 
> *ksig,
>  static int
>  setup_rt_frame(struct ksignal *ksig, sigset_t *set, struct pt_regs *regs)
>  {
> - struct rt_sigframe *frame;
> + struct rt_sigframe __user *frame;
>   int err = 0;
>  
>   frame = get_sigframe(ksig, regs, sizeof(*frame));

Minor nit: might make sense to separate annotations (here, on nios2, etc.) from 
the rest...

This, OTOH,

> diff --git a/arch/sparc/include/asm/uaccess_64.h 
> b/arch/sparc/include/asm/uaccess_64.h
> index 5c12fb46bc61..000bac67cf31 100644
> --- a/arch/sparc/include/asm/uaccess_64.h
> +++ b/arch/sparc/include/asm/uaccess_64.h
...
> -static inline bool __chk_range_not_ok(unsigned long addr, unsigned long 
> size, unsigned long limit)
> -{
> - if (__builtin_constant_p(size))
> - return addr > limit - size;
> -
> - addr += size;
> - if (addr < size)
> - return true;
> -
> - return addr > limit;
> -}
> -
> -#define __range_not_ok(addr, size, limit)   \
> -({  \
> - __chk_user_ptr(addr);   \
> - __chk_range_not_ok((unsigned long __force)(addr), size, limit); \
> -})
> -
> -static inline int __access_ok(const void __user * addr, unsigned long size)
> -{
> - return 1;
> -}
> -
> -static inline int access_ok(const void __user * addr, unsigned long size)
> -{
> - return 1;
> -}
> +#define __range_not_ok(addr, size, limit) (!__access_ok(addr, size))

is really wrong.  For sparc64, access_ok() should always be true.
This __range_not_ok() thing is used *only* for valid_user_frame() in
arch/sparc/kernel/perf_event.c - it's not a part of normal access_ok()
there.

sparc64 has separate address spaces for kernel and for userland; access_ok()
had never been useful there.  


Re: [PATCH 3/4] exec: simplify the compat syscall handling

2021-03-26 Thread Al Viro
On Fri, Mar 26, 2021 at 03:38:30PM +0100, Christoph Hellwig wrote:

> -static const char __user *get_user_arg_ptr(struct user_arg_ptr argv, int nr)
> +static const char __user *
> +get_user_arg_ptr(const char __user *const __user *argv, int nr)
>  {
> - const char __user *native;
> -
> -#ifdef CONFIG_COMPAT
> - if (unlikely(argv.is_compat)) {
> + if (in_compat_syscall()) {
> + const compat_uptr_t __user *compat_argv =
> + compat_ptr((unsigned long)argv);
>   compat_uptr_t compat;
>  
> - if (get_user(compat, argv.ptr.compat + nr))
> + if (get_user(compat, compat_argv + nr))
>   return ERR_PTR(-EFAULT);
> -
>   return compat_ptr(compat);
> - }
> -#endif
> -
> - if (get_user(native, argv.ptr.native + nr))
> - return ERR_PTR(-EFAULT);
> + } else {
> + const char __user *native;
>  
> - return native;
> + if (get_user(native, argv + nr))
> + return ERR_PTR(-EFAULT);
> + return native;
> + }
>  }

Yecchhh  So you have in_compat_syscall() called again and again, for
each argument in the list?  I agree that current version is fucking ugly,
but I really hate that approach ;-/


Re: [PATCH 01/10] alpha: use libata instead of the legacy ide driver

2021-03-18 Thread Al Viro
On Thu, Mar 18, 2021 at 05:56:57AM +0100, Christoph Hellwig wrote:
> Switch the alpha defconfig from the legacy ide driver to libata.

Umm...  I don't have an IDE alpha box in a usable shape (fans on
CPU module shat themselves), and it would take a while to resurrect
it, but I remember the joy it used to cause in some versions.

Do you have reports of libata variants of drivers actually tested on
those?


Re: [PATCH 3/9] powerpc/pseries: remove the ppc-cmm file system

2021-03-10 Thread Al Viro
On Tue, Mar 09, 2021 at 04:53:42PM +0100, Christoph Hellwig wrote:
> Just use the generic anon_inode file system.

Umm...  The only problem I see here is the lifetime rules for
that module, and that's not something introduced in this patchset.
Said that, looks like the logics around that place is duplicated in
cmm.c, vmw_balloon.c and virtion_balloon.c and I wonder if it would
be better off with a helper in mm/balloon.c to be used for that setup...


Re: [PATCH 4/9] drm: remove the drm file system

2021-03-10 Thread Al Viro
On Tue, Mar 09, 2021 at 04:53:43PM +0100, Christoph Hellwig wrote:
> Just use the generic anon_inode file system.

Are you changing the lifetime rules for that module?


Re: [REGRESSION] mm: process_vm_readv testcase no longer works after compat_prcoess_vm_readv removed

2020-10-26 Thread Al Viro
On Mon, Oct 26, 2020 at 05:56:11PM -0600, Jens Axboe wrote:
> On 10/26/20 4:55 PM, Kyle Huey wrote:
> > A test program from the rr[0] test suite, vm_readv_writev[1], no
> > longer works on 5.10-rc1 when compiled as a 32 bit binary and executed
> > on a 64 bit kernel. The first process_vm_readv call (on line 35) now
> > fails with EFAULT. I have bisected this to
> > c3973b401ef2b0b8005f8074a10e96e3ea093823.
> > 
> > It should be fairly straightforward to extract the test case from our
> > repository into a standalone program.
> 
> Can you check with this applied?
> 
> diff --git a/mm/process_vm_access.c b/mm/process_vm_access.c
> index fd12da80b6f2..05676722d9cd 100644
> --- a/mm/process_vm_access.c
> +++ b/mm/process_vm_access.c
> @@ -273,7 +273,8 @@ static ssize_t process_vm_rw(pid_t pid,
>   return rc;
>   if (!iov_iter_count())
>   goto free_iov_l;
> - iov_r = iovec_from_user(rvec, riovcnt, UIO_FASTIOV, iovstack_r, false);
> + iov_r = iovec_from_user(rvec, riovcnt, UIO_FASTIOV, iovstack_r,
> + in_compat_syscall());

_ouch_

There's a bug, all right, but I'm not sure that this is all there is to it.
For now it's probably the right fix, but...  Consider the fun trying to
use that from 32bit process to access the memory of 64bit one.  IOW, we
might want to add an explicit flag for "force 64bit addresses/sizes
in rvec".


Re: Buggy commit tracked to: "Re: [PATCH 2/9] iov_iter: move rw_copy_check_uvector() into lib/iov_iter.c"

2020-10-23 Thread Al Viro
On Fri, Oct 23, 2020 at 03:09:30PM +0200, David Hildenbrand wrote:

> Now, I am not a compiler expert, but as I already cited, at least on
> x86-64 clang expects that the high bits were cleared by the caller - in
> contrast to gcc. I suspect it's the same on arm64, but again, I am no
> compiler expert.
> 
> If what I said and cites for x86-64 is correct, if the function expects
> an "unsigned int", it will happily use 64bit operations without further
> checks where valid when assuming high bits are zero. That's why even
> converting everything to "unsigned int" as proposed by me won't work on
> clang - it assumes high bits are zero (as indicated by Nick).
> 
> As I am neither a compiler experts (did I mention that already? ;) ) nor
> an arm64 experts, I can't tell if this is a compiler BUG or not.

On arm64 when callee expects a 32bit argument, the caller is *not* responsible
for clearing the upper half of 64bit register used to pass the value - it only
needs to store the actual value into the lower half.  The callee must consider
the contents of the upper half of that register as undefined.  See AAPCS64 (e.g.
https://github.com/ARM-software/abi-aa/blob/master/aapcs64/aapcs64.rst#parameter-passing-rules
); AFAICS, the relevant bit is
"Unlike in the 32-bit AAPCS, named integral values must be narrowed by
the callee rather than the caller."


Re: Buggy commit tracked to: "Re: [PATCH 2/9] iov_iter: move rw_copy_check_uvector() into lib/iov_iter.c"

2020-10-22 Thread Al Viro
On Thu, Oct 22, 2020 at 01:59:32PM -0700, Eric Biggers wrote:

> Also note the following program succeeds on Linux 5.9 on x86_64.  On kernels
> that have this bug, it should fail.  (I couldn't get it to actually fail, so 
> it
> must depend on the compiler and/or the kernel config...)

It doesn't.  See https://www.spinics.net/lists/linux-scsi/msg147836.html for
discussion of that mess.

ssize_t vfs_readv(struct file *file, const struct iovec __user *vec,
  unsigned long vlen, loff_t *pos, rwf_t flags)
{
struct iovec iovstack[UIO_FASTIOV];
struct iovec *iov = iovstack;
struct iov_iter iter;
ssize_t ret;

ret = import_iovec(READ, vec, vlen, ARRAY_SIZE(iovstack), , );
if (ret >= 0) {
ret = do_iter_read(file, , pos, flags);
kfree(iov);
}

return ret;
}

and import_iovec() takes unsigned int as the third argument, so it *will*
truncate to 32 bits, no matter what.  Has done so since 0504c074b546
"switch {compat_,}do_readv_writev() to {compat_,}import_iovec()" back in
March 2015.  Yes, it was an incompatible userland ABI change, even though
nothing that used glibc/uclibc/dietlibc would've noticed.

Better yet, up until 2.1.90pre1 passing a 64bit value as the _first_ argument
of readv(2) used to fail with -EBADF if it was too large; at that point it
started to get quietly truncated to 32bit first.  And again, no libc users
would've noticed (neither would anything except deliberate regression test
looking for that specific behaviour).

Note that we also have process_madvise(2) with size_t for vlen (huh?  It's
a number of array elements, not an object size) and process_vm_{read,write}v(2),
that have unsigned long for the same thing.  And the last two *are* using
the same unsigned long from glibc POV.


Re: Buggy commit tracked to: "Re: [PATCH 2/9] iov_iter: move rw_copy_check_uvector() into lib/iov_iter.c"

2020-10-22 Thread Al Viro
On Thu, Oct 22, 2020 at 09:06:29PM +0100, Al Viro wrote:
> On Thu, Oct 22, 2020 at 08:24:58PM +0100, Al Viro wrote:
> 
> > Depending upon the calling conventions, compiler might do truncation in 
> > caller or
> > in a callee, but it must be done _somewhere_.
> 
> Unless I'm misreading AAPCS64,
>   "Unlike in the 32-bit AAPCS, named integral values must be narrowed by 
> the callee
>rather than the caller"
> in 6.4.2 means that callee must not _not_ expect the upper 32 bits of 
> %x0..%x7 to contain

Sorry, artefact of editing - that's

"in 6.4.2 means that callee must _not_ expect the upper 32 bits of %x0..%x7 to 
contain"

obviously.


Re: Buggy commit tracked to: "Re: [PATCH 2/9] iov_iter: move rw_copy_check_uvector() into lib/iov_iter.c"

2020-10-22 Thread Al Viro
On Thu, Oct 22, 2020 at 08:24:58PM +0100, Al Viro wrote:

> Depending upon the calling conventions, compiler might do truncation in 
> caller or
> in a callee, but it must be done _somewhere_.

Unless I'm misreading AAPCS64,
"Unlike in the 32-bit AAPCS, named integral values must be narrowed by 
the callee
 rather than the caller"
in 6.4.2 means that callee must not _not_ expect the upper 32 bits of %x0..%x7 
to contain
anything valid for 32bit arguments and it must zero-extend %w0..%w7 when 
passing that to
something that expects a 64bit argument.  On inlining it should be the same 
situation as
storing unsigned int argument into unsigned long local variable and working 
with that - if

void f(unsigned int w)
{
unsigned long x = w;
printf("%lx\n", x);
}

ends up passing %x0 to printf, it's an obvious bug - it must do something like
uxtw x0, w0
first.

What am I missing here?


Re: Buggy commit tracked to: "Re: [PATCH 2/9] iov_iter: move rw_copy_check_uvector() into lib/iov_iter.c"

2020-10-22 Thread Al Viro
On Thu, Oct 22, 2020 at 08:24:58PM +0100, Al Viro wrote:
> On Thu, Oct 22, 2020 at 12:04:52PM -0700, Nick Desaulniers wrote:
> 
> > Passing an `unsigned long` as an `unsigned int` does no such
> > narrowing: https://godbolt.org/z/TvfMxe (same vice-versa, just tail
> > calls, no masking instructions).
> > So if rw_copy_check_uvector() is inlined into import_iovec() (looking
> > at the mainline@1028ae406999), then children calls of
> > `rw_copy_check_uvector()` will be interpreting the `nr_segs` register
> > unmodified, ie. garbage in the upper 32b.
> 
> FWIW,
> 
> void f(unsinged long v)

s/unsinged/unsigned/, obviously...


Re: Buggy commit tracked to: "Re: [PATCH 2/9] iov_iter: move rw_copy_check_uvector() into lib/iov_iter.c"

2020-10-22 Thread Al Viro
On Thu, Oct 22, 2020 at 12:04:52PM -0700, Nick Desaulniers wrote:

> Passing an `unsigned long` as an `unsigned int` does no such
> narrowing: https://godbolt.org/z/TvfMxe (same vice-versa, just tail
> calls, no masking instructions).
> So if rw_copy_check_uvector() is inlined into import_iovec() (looking
> at the mainline@1028ae406999), then children calls of
> `rw_copy_check_uvector()` will be interpreting the `nr_segs` register
> unmodified, ie. garbage in the upper 32b.

FWIW,

void f(unsinged long v)
{
if (v != 1)
printf("failed\n");
}

void g(unsigned int v)
{
f(v);
}

void h(unsigned long v)
{
g(v);
}

main()
{
h(0x10001);
}

must not produce any output on a host with 32bit int and 64bit long, regardless 
of
the inlining, having functions live in different compilation units, etc.

Depending upon the calling conventions, compiler might do truncation in caller 
or
in a callee, but it must be done _somewhere_.


Re: Buggy commit tracked to: "Re: [PATCH 2/9] iov_iter: move rw_copy_check_uvector() into lib/iov_iter.c"

2020-10-22 Thread Al Viro
On Thu, Oct 22, 2020 at 05:40:40PM +0100, Matthew Wilcox wrote:
> On Thu, Oct 22, 2020 at 04:35:17PM +, David Laight wrote:
> > Wait...
> > readv(2) defines:
> > ssize_t readv(int fd, const struct iovec *iov, int iovcnt);
> 
> It doesn't really matter what the manpage says.  What does the AOSP
> libc header say?

FWIW, see https://www.spinics.net/lists/linux-scsi/msg147836.html and
subthread from there on...


Re: Buggy commit tracked to: "Re: [PATCH 2/9] iov_iter: move rw_copy_check_uvector() into lib/iov_iter.c"

2020-10-21 Thread Al Viro
On Wed, Oct 21, 2020 at 06:13:01PM +0200, Greg KH wrote:
> On Fri, Sep 25, 2020 at 06:51:39AM +0200, Christoph Hellwig wrote:
> > From: David Laight 
> > 
> > This lets the compiler inline it into import_iovec() generating
> > much better code.
> > 
> > Signed-off-by: David Laight 
> > Signed-off-by: Christoph Hellwig 
> > ---
> >  fs/read_write.c | 179 
> >  lib/iov_iter.c  | 176 +++
> >  2 files changed, 176 insertions(+), 179 deletions(-)
> 
> Strangely, this commit causes a regression in Linus's tree right now.
> 
> I can't really figure out what the regression is, only that this commit
> triggers a "large Android system binary" from working properly.  There's
> no kernel log messages anywhere, and I don't have any way to strace the
> thing in the testing framework, so any hints that people can provide
> would be most appreciated.

It's a pure move - modulo changed line breaks in the argument lists
the functions involved are identical before and after that (just checked
that directly, by checking out the trees before and after, extracting two
functions in question from fs/read_write.c and lib/iov_iter.c (before and
after, resp.) and checking the diff between those.

How certain is your bisection?


Re: [PATCH RFC PKS/PMEM 33/58] fs/cramfs: Utilize new kmap_thread()

2020-10-13 Thread Al Viro
On Tue, Oct 13, 2020 at 08:36:43PM +0100, Matthew Wilcox wrote:

> static inline void copy_to_highpage(struct page *to, void *vfrom, unsigned 
> int size)
> {
>   char *vto = kmap_atomic(to);
> 
>   memcpy(vto, vfrom, size);
>   kunmap_atomic(vto);
> }
> 
> in linux/highmem.h ?

You mean, like
static void memcpy_from_page(char *to, struct page *page, size_t offset, size_t 
len)
{
char *from = kmap_atomic(page);
memcpy(to, from + offset, len);
kunmap_atomic(from);
}

static void memcpy_to_page(struct page *page, size_t offset, const char *from, 
size_t len)
{
char *to = kmap_atomic(page);
memcpy(to + offset, from, len);
kunmap_atomic(to);
}

static void memzero_page(struct page *page, size_t offset, size_t len)
{
char *addr = kmap_atomic(page);
memset(addr + offset, 0, len);
kunmap_atomic(addr);
}

in lib/iov_iter.c?  FWIW, I don't like that "highpage" in the name and
highmem.h as location - these make perfect sense regardless of highmem;
they are normal memory operations with page + offset used instead of
a pointer...


Re: [PATCH 05/14] fs: don't allow kernel reads and writes without iter ops

2020-10-01 Thread Al Viro
On Thu, Oct 01, 2020 at 03:38:52PM -0700, Eric Biggers wrote:

>   mutex_lock(>pipe_mutex);
>   while (bytes) {
> - wr = __kernel_write(file, data, bytes, NULL);
> + wr = __kernel_write(file, data, bytes, >f_pos);

Better
loff_t dummy = 0;
...
wr = __kernel_write(file, data, bytes, );


Re: let import_iovec deal with compat_iovecs as well v4

2020-09-25 Thread Al Viro
On Fri, Sep 25, 2020 at 06:51:37AM +0200, Christoph Hellwig wrote:
> Hi Al,
> 
> this series changes import_iovec to transparently deal with compat iovec
> structures, and then cleanups up a lot of code dupliation.

OK, I can live with that.  Applied, let's see if it passes smoke tests
into -next it goes.


Re: [PATCH 5/9] fs: remove various compat readv/writev helpers

2020-09-23 Thread Al Viro
On Wed, Sep 23, 2020 at 08:45:51PM +0200, Arnd Bergmann wrote:
> On Wed, Sep 23, 2020 at 6:38 PM Al Viro  wrote:
> >
> > I wonder if we should do something like
> >
> > SYSCALL_DECLARE3(readv, unsigned long, fd, const struct iovec __user *, vec,
> >  unsigned long, vlen);
> > in syscalls.h instead, and not under that ifdef.
> >
> > Let it expand to declaration of sys_...() in generic case and, on x86, into
> > __do_sys_...() and __ia32_sys_...()/__x64_sys_...(), with types matching
> > what SYSCALL_DEFINE ends up using.
> >
> > Similar macro would cover compat_sys_...() declarations.  That would
> > restore mismatch checking for x86 and friends.  AFAICS, the cost wouldn't
> > be terribly high - cpp would have more to chew through in syscalls.h,
> > but it shouldn't be all that costly.  Famous last words, of course...
> >
> > Does anybody see fundamental problems with that?
> 
> I've had some ideas along those lines in the past and I think it should work.
> 
> As a variation of this, the SYSCALL_DEFINEx() macros could go away
> entirely, leaving only the macro instantiations from the header to
> require that syntax. It would require first changing the remaining
> architectures to build the syscall table from C code instead of
> assembler though.
> 
> Regardless of that, another advantage of having the SYSCALL_DECLAREx()
> would be the ability to include that header file from elsewhere with a 
> different
> macro definition to create a machine-readable version of the interface when
> combined with the syscall.tbl files. This could be used to create a user
> space stub for calling into the low-level syscall regardless of the
> libc interfaces,
> or for synchronizing the interfaces with strace, qemu-user, or anything that
> needs to deal with the low-level interface.

FWIW, after playing with that for a while...  Do we really want the
compat_sys_...() declarations to live in linux/compat.h?  Most of
the users of that file don't want those; why not move them to
linux/syscalls.h?

Reason: there's a lot more users of linux/compat.h than those of
linux/syscalls.h - it's pulled by everything in the networking stack,
for starters...


Re: [PATCH 5/9] fs: remove various compat readv/writev helpers

2020-09-23 Thread Al Viro
On Wed, Sep 23, 2020 at 05:38:31PM +0100, Al Viro wrote:
> On Wed, Sep 23, 2020 at 03:59:01PM +0100, Al Viro wrote:
> 
> > > That's a very good question.  But it does not just compile but actually
> > > works.  Probably because all the syscall wrappers mean that we don't
> > > actually generate the normal names.  I just tried this:
> > > 
> > > --- a/include/linux/syscalls.h
> > > +++ b/include/linux/syscalls.h
> > > @@ -468,7 +468,7 @@ asmlinkage long sys_lseek(unsigned int fd, off_t 
> > > offset,
> > >  asmlinkage long sys_read(unsigned int fd, char __user *buf, size_t 
> > > count);
> > >  asmlinkage long sys_write(unsigned int fd, const char __user *buf,
> > > size_t count);
> > > -asmlinkage long sys_readv(unsigned long fd,
> > > +asmlinkage long sys_readv(void *fd,
> > > 
> > > for fun, and the compiler doesn't care either..
> > 
> > Try to build it for sparc or ppc...
> 
> FWIW, declarations in syscalls.h used to serve 4 purposes:
>   1) syscall table initializers needed symbols declared
>   2) direct calls needed the same
>   3) catching mismatches between the declarations and definitions
>   4) centralized list of all syscalls
> 
> (2) has been (thankfully) reduced for some time; in any case, ksys_... is
> used for the remaining ones.
> 
> (1) and (3) are served by syscalls.h in architectures other than x86, arm64
> and s390.  On those 3 (1) is done otherwise (near the syscall table 
> initializer)
> and (3) is not done at all.
> 
> I wonder if we should do something like
> 
> SYSCALL_DECLARE3(readv, unsigned long, fd, const struct iovec __user *, vec,
>unsigned long, vlen);
> in syscalls.h instead, and not under that ifdef.
> 
> Let it expand to declaration of sys_...() in generic case and, on x86, into
> __do_sys_...() and __ia32_sys_...()/__x64_sys_...(), with types matching
> what SYSCALL_DEFINE ends up using.
> 
> Similar macro would cover compat_sys_...() declarations.  That would
> restore mismatch checking for x86 and friends.  AFAICS, the cost wouldn't
> be terribly high - cpp would have more to chew through in syscalls.h,
> but it shouldn't be all that costly.  Famous last words, of course...
> 
> Does anybody see fundamental problems with that?

Just to make it clear - I do not propose to fold that into this series;
there we just need to keep those declarations in sync with fs/read_write.c


Re: [PATCH 5/9] fs: remove various compat readv/writev helpers

2020-09-23 Thread Al Viro
On Wed, Sep 23, 2020 at 03:59:01PM +0100, Al Viro wrote:

> > That's a very good question.  But it does not just compile but actually
> > works.  Probably because all the syscall wrappers mean that we don't
> > actually generate the normal names.  I just tried this:
> > 
> > --- a/include/linux/syscalls.h
> > +++ b/include/linux/syscalls.h
> > @@ -468,7 +468,7 @@ asmlinkage long sys_lseek(unsigned int fd, off_t offset,
> >  asmlinkage long sys_read(unsigned int fd, char __user *buf, size_t count);
> >  asmlinkage long sys_write(unsigned int fd, const char __user *buf,
> > size_t count);
> > -asmlinkage long sys_readv(unsigned long fd,
> > +asmlinkage long sys_readv(void *fd,
> > 
> > for fun, and the compiler doesn't care either..
> 
> Try to build it for sparc or ppc...

FWIW, declarations in syscalls.h used to serve 4 purposes:
1) syscall table initializers needed symbols declared
2) direct calls needed the same
3) catching mismatches between the declarations and definitions
4) centralized list of all syscalls

(2) has been (thankfully) reduced for some time; in any case, ksys_... is
used for the remaining ones.

(1) and (3) are served by syscalls.h in architectures other than x86, arm64
and s390.  On those 3 (1) is done otherwise (near the syscall table initializer)
and (3) is not done at all.

I wonder if we should do something like

SYSCALL_DECLARE3(readv, unsigned long, fd, const struct iovec __user *, vec,
 unsigned long, vlen);
in syscalls.h instead, and not under that ifdef.

Let it expand to declaration of sys_...() in generic case and, on x86, into
__do_sys_...() and __ia32_sys_...()/__x64_sys_...(), with types matching
what SYSCALL_DEFINE ends up using.

Similar macro would cover compat_sys_...() declarations.  That would
restore mismatch checking for x86 and friends.  AFAICS, the cost wouldn't
be terribly high - cpp would have more to chew through in syscalls.h,
but it shouldn't be all that costly.  Famous last words, of course...

Does anybody see fundamental problems with that?


Re: [PATCH 5/9] fs: remove various compat readv/writev helpers

2020-09-23 Thread Al Viro
On Wed, Sep 23, 2020 at 04:32:51PM +0200, Christoph Hellwig wrote:
> On Wed, Sep 23, 2020 at 03:25:49PM +0100, Al Viro wrote:
> > On Wed, Sep 23, 2020 at 08:05:43AM +0200, Christoph Hellwig wrote:
> > >  COMPAT_SYSCALL_DEFINE3(readv, compat_ulong_t, fd,
> > > - const struct compat_iovec __user *,vec,
> > > + const struct iovec __user *, vec,
> > 
> > Um...  Will it even compile?
> > 
> > >  #ifdef __ARCH_WANT_COMPAT_SYS_PREADV64
> > >  COMPAT_SYSCALL_DEFINE4(preadv64, unsigned long, fd,
> > > - const struct compat_iovec __user *,vec,
> > > + const struct iovec __user *, vec,
> > 
> > Ditto.  Look into include/linux/compat.h and you'll see
> > 
> > asmlinkage long compat_sys_preadv64(unsigned long fd,
> > const struct compat_iovec __user *vec,
> > unsigned long vlen, loff_t pos);
> > 
> > How does that manage to avoid the compiler screaming bloody
> > murder?
> 
> That's a very good question.  But it does not just compile but actually
> works.  Probably because all the syscall wrappers mean that we don't
> actually generate the normal names.  I just tried this:
> 
> --- a/include/linux/syscalls.h
> +++ b/include/linux/syscalls.h
> @@ -468,7 +468,7 @@ asmlinkage long sys_lseek(unsigned int fd, off_t offset,
>  asmlinkage long sys_read(unsigned int fd, char __user *buf, size_t count);
>  asmlinkage long sys_write(unsigned int fd, const char __user *buf,
> size_t count);
> -asmlinkage long sys_readv(unsigned long fd,
> +asmlinkage long sys_readv(void *fd,
> 
> for fun, and the compiler doesn't care either..

Try to build it for sparc or ppc...


Re: [PATCH 3/9] iov_iter: refactor rw_copy_check_uvector and import_iovec

2020-09-23 Thread Al Viro
On Wed, Sep 23, 2020 at 02:38:24PM +, David Laight wrote:
> From: Al Viro
> > Sent: 23 September 2020 15:17
> > 
> > On Wed, Sep 23, 2020 at 08:05:41AM +0200, Christoph Hellwig wrote:
> > 
> > > +struct iovec *iovec_from_user(const struct iovec __user *uvec,
> > > + unsigned long nr_segs, unsigned long fast_segs,
> > 
> > Hmm...  For fast_segs unsigned long had always been ridiculous
> > (4G struct iovec on caller stack frame?), but that got me wondering about
> > nr_segs and I wish I'd thought of that when introducing import_iovec().
> > 
> > The thing is, import_iovec() takes unsigned int there.  Which is fine
> > (hell, the maximal value that can be accepted in 1024), except that
> > we do pass unsigned long syscall argument to it in some places.
> 
> It will make diddly-squit difference.
> The parameters end up in registers on most calling conventions.
> Plausibly you get an extra 'REX' byte on x86 for the 64bit value.
> What you want to avoid is explicit sign/zero extension and value
> masking after arithmetic.

Don't tell me what I want; your telepathic abilities are consistently sucky.

I am *NOT* talking about microoptimization here.  I have described
the behaviour change of syscall caused by commit 5 years ago.  Which is
generally considered a problem.  Then I asked whether that behaviour
change would fall under the "if nobody noticed, it's not a userland ABI
breakage" exception.

Could you show me the point where I have expressed concerns about
the quality of amd64 code generated for that thing, before or after
the change in question?


Re: [PATCH 3/9] iov_iter: refactor rw_copy_check_uvector and import_iovec

2020-09-23 Thread Al Viro
On Wed, Sep 23, 2020 at 03:16:54PM +0100, Al Viro wrote:
> On Wed, Sep 23, 2020 at 08:05:41AM +0200, Christoph Hellwig wrote:
> 
> > +struct iovec *iovec_from_user(const struct iovec __user *uvec,
> > +   unsigned long nr_segs, unsigned long fast_segs,
> 
> Hmm...  For fast_segs unsigned long had always been ridiculous
> (4G struct iovec on caller stack frame?), but that got me wondering about
> nr_segs and I wish I'd thought of that when introducing import_iovec().
> 
> The thing is, import_iovec() takes unsigned int there.  Which is fine
> (hell, the maximal value that can be accepted in 1024), except that
> we do pass unsigned long syscall argument to it in some places.
> 
> E.g. vfs_readv() quietly truncates vlen to 32 bits, and vlen can
> come unchanged through sys_readv() -> do_readv() -> vfs_readv().
> With unsigned long passed by syscall glue.
> 
> AFAICS, passing 4G+1 as the third argument to readv(2) on 64bit box
> will be quietly treated as 1 these days.  Which would be fine, except
> that before "switch {compat_,}do_readv_writev() to {compat_,}import_iovec()"
> it used to fail with -EINVAL.
> 
> Userland, BTW, describes readv(2) iovcnt as int; process_vm_readv(),
> OTOH, has these counts unsigned long from the userland POV...
> 
> I suppose we ought to switch import_iovec() to unsigned long for nr_segs ;-/
> Strictly speaking that had been a userland ABI change, even though nothing
> except regression tests checking for expected errors would've been likely
> to notice.  And it looks like no regression tests covered that one...
> 
> Linus, does that qualify for your "if no userland has noticed the change,
> it's not a breakage"?

Egads...  We have sys_readv() with unsigned long for file descriptor, since
1.3.31 when it had been introduced.  And originally it did comparison with
NR_OPEN right in sys_readv().  Then in 2.1.60 it had been switched to
fget(), which used to take unsigned long at that point.  And in 2.1.90pre1
it went unsigned int, so non-zero upper 32 bits in readv(2) first argument
ceased to cause EBADF...

Of course, libc had it as int fd all along.


Re: [PATCH 5/9] fs: remove various compat readv/writev helpers

2020-09-23 Thread Al Viro
On Wed, Sep 23, 2020 at 08:05:43AM +0200, Christoph Hellwig wrote:
>  COMPAT_SYSCALL_DEFINE3(readv, compat_ulong_t, fd,
> - const struct compat_iovec __user *,vec,
> + const struct iovec __user *, vec,

Um...  Will it even compile?

>  #ifdef __ARCH_WANT_COMPAT_SYS_PREADV64
>  COMPAT_SYSCALL_DEFINE4(preadv64, unsigned long, fd,
> - const struct compat_iovec __user *,vec,
> + const struct iovec __user *, vec,

Ditto.  Look into include/linux/compat.h and you'll see

asmlinkage long compat_sys_preadv64(unsigned long fd,
const struct compat_iovec __user *vec,
unsigned long vlen, loff_t pos);

How does that manage to avoid the compiler screaming bloody
murder?


Re: [PATCH 3/9] iov_iter: refactor rw_copy_check_uvector and import_iovec

2020-09-23 Thread Al Viro
On Wed, Sep 23, 2020 at 08:05:41AM +0200, Christoph Hellwig wrote:

> +struct iovec *iovec_from_user(const struct iovec __user *uvec,
> + unsigned long nr_segs, unsigned long fast_segs,

Hmm...  For fast_segs unsigned long had always been ridiculous
(4G struct iovec on caller stack frame?), but that got me wondering about
nr_segs and I wish I'd thought of that when introducing import_iovec().

The thing is, import_iovec() takes unsigned int there.  Which is fine
(hell, the maximal value that can be accepted in 1024), except that
we do pass unsigned long syscall argument to it in some places.

E.g. vfs_readv() quietly truncates vlen to 32 bits, and vlen can
come unchanged through sys_readv() -> do_readv() -> vfs_readv().
With unsigned long passed by syscall glue.

AFAICS, passing 4G+1 as the third argument to readv(2) on 64bit box
will be quietly treated as 1 these days.  Which would be fine, except
that before "switch {compat_,}do_readv_writev() to {compat_,}import_iovec()"
it used to fail with -EINVAL.

Userland, BTW, describes readv(2) iovcnt as int; process_vm_readv(),
OTOH, has these counts unsigned long from the userland POV...

I suppose we ought to switch import_iovec() to unsigned long for nr_segs ;-/
Strictly speaking that had been a userland ABI change, even though nothing
except regression tests checking for expected errors would've been likely
to notice.  And it looks like no regression tests covered that one...

Linus, does that qualify for your "if no userland has noticed the change,
it's not a breakage"?


Re: [PATCH 1/9] kernel: add a PF_FORCE_COMPAT flag

2020-09-23 Thread Al Viro
On Wed, Sep 23, 2020 at 11:01:34AM +0300, Pavel Begunkov wrote:

> > I'm not following why that would be considered a valid option,
> > as that clearly breaks existing users that update from a 32-bit
> > kernel to a 64-bit one.
> 
> Do you mean users who move 32-bit binaries (without recompiling) to a
> new x64 kernel? Does the kernel guarantees that to work?

Yes.

No further (printable) comments for now...


Re: [PATCH 02/11] mm: call import_iovec() instead of rw_copy_check_uvector() in process_vm_rw()

2020-09-21 Thread Al Viro
On Mon, Sep 21, 2020 at 03:44:00PM +, David Laight wrote:
> From: Al Viro
> > Sent: 21 September 2020 16:30
> > 
> > On Mon, Sep 21, 2020 at 03:21:35PM +, David Laight wrote:
> > 
> > > You really don't want to be looping through the array twice.
> > 
> > Profiles, please.
> 
> I did some profiling of send() v sendmsg() much earlier in the year.
> I can't remember the exact details but the extra cost of sendmsg()
> is far more than you might expect.
> (I was timing sending fully built IPv4 UDP packets using a raw socket.)
> 
> About half the difference does away if you change the
> copy_from_user() to __copy_from_user() when reading the struct msghdr
> and iov[] from userspace (user copy hardening is expensive).

Wha...?  So the difference is within 4 times the overhead of the
hardening checks done for one call of copy_from_user()?

> The rest is just code path, my gut feeling is that a lot of that
> is in import_iovec().
> 
> Remember semdmsg() is likely to be called with an iov count of 1.

... which makes that loop unlikely to be noticable in the entire
mess, whether you pass it once or twice.  IOW, unless you can show
profiles where that loop is sufficiently hot or if you can show
the timings change from splitting it in two, I'll remain very
sceptical about that assertion.


Re: [PATCH 02/11] mm: call import_iovec() instead of rw_copy_check_uvector() in process_vm_rw()

2020-09-21 Thread Al Viro
On Mon, Sep 21, 2020 at 03:21:35PM +, David Laight wrote:

> You really don't want to be looping through the array twice.

Profiles, please.

> I think the 'length' check can be optimised to do something like:
>   for (...) {
>   ssize_t len = (ssize_t)iov[seg].iov_len;
>   ret += len;
>   len_hi += (unsigned long)len >> 20;
>   }
>   if (len_hi) {
>   /* Something potentially odd in the lengths.
>* Might just be a very long fragment.
>* Check the individial values. */
>   Add the exiting loop here.
>   }

Far too ugly to live.


Re: [PATCH 06/11] iov_iter: handle the compat case in import_iovec

2020-09-21 Thread Al Viro
On Mon, Sep 21, 2020 at 04:34:29PM +0200, Christoph Hellwig wrote:
> Use in compat_syscall to import either native or the compat iovecs, and
> remove the now superflous compat_import_iovec, which removes the need for
> special compat logic in most callers.  Only io_uring needs special
> treatment given that it can call import_iovec from kernel threads acting
> on behalf of native or compat syscalls.  Expose the low-level
> __import_iovec helper and use it in io_uring to explicitly pick a iovec
> layout.

fs/aio.c part is not obvious...  Might be better to use __import_iovec()
there as well and leave the rest of aio.c changes to followup.

> --- a/lib/iov_iter.c
> +++ b/lib/iov_iter.c
> @@ -1683,7 +1683,7 @@ static int compat_copy_iovecs_from_user(struct iovec 
> *iov,
>   return ret;
>  }
>  
> -static ssize_t __import_iovec(int type, const struct iovec __user *uvector,
> +ssize_t __import_iovec(int type, const struct iovec __user *uvector,
>   unsigned nr_segs, unsigned fast_segs, struct iovec **iovp,
>   struct iov_iter *i, bool compat)
>  {

Don't make it static in the first place, perhaps?


Re: [PATCH 05/11] iov_iter: merge the compat case into rw_copy_check_uvector

2020-09-21 Thread Al Viro
On Mon, Sep 21, 2020 at 04:34:28PM +0200, Christoph Hellwig wrote:
> +static int compat_copy_iovecs_from_user(struct iovec *iov,
> + const struct iovec __user *uvector, unsigned long nr_segs)
> +{
> + const struct compat_iovec __user *uiov =
> + (const struct compat_iovec __user *)uvector;
> + unsigned long i;
 
Huh?


Re: [PATCH 04/11] iov_iter: explicitly check for CHECK_IOVEC_ONLY in rw_copy_check_uvector

2020-09-21 Thread Al Viro
On Mon, Sep 21, 2020 at 03:05:32PM +, David Laight wrote:

> I've actually no idea:
> 1) Why there is an access_ok() check here.
>It will be repeated by the user copy functions.

Early sanity check.

> 2) Why it isn't done when called from mm/process_vm_access.c.
>Ok, the addresses refer to a different process, but they
>must still be valid user addresses.
> 
> Is 2 a legacy from when access_ok() actually checked that the
> addresses were mapped into the process's address space?

It never did.  2 is for the situation when a 32bit process
accesses 64bit one; addresses that are perfectly legitimate
for 64bit userland (and fitting into the first 4Gb of address
space, so they can be represented by 32bit pointers just fine)
might be rejected by access_ok() if the caller is 32bit.


Re: [PATCH 04/11] iov_iter: explicitly check for CHECK_IOVEC_ONLY in rw_copy_check_uvector

2020-09-21 Thread Al Viro
On Mon, Sep 21, 2020 at 04:34:27PM +0200, Christoph Hellwig wrote:
> Explicitly check for the magic value insted of implicitly relying on
> its numeric representation.   Also drop the rather pointless unlikely
> annotation.

See above - I would rather have CHECK_IOVEC_ONLY gone.

The reason for doing these access_ok() in the same loop is fairly weak,
especially since you are checking type on each iteration.  Might as
well do that in a separate loop afterwards.


Re: [PATCH 02/11] mm: call import_iovec() instead of rw_copy_check_uvector() in process_vm_rw()

2020-09-21 Thread Al Viro
On Mon, Sep 21, 2020 at 04:34:25PM +0200, Christoph Hellwig wrote:
> From: David Laight 
> 
> This is the only direct call of rw_copy_check_uvector().  Removing it
> will allow rw_copy_check_uvector() to be inlined into import_iovec(),
> while only paying a minor price by setting up an otherwise unused
> iov_iter in the process_vm_readv/process_vm_writev syscalls that aren't
> in a super hot path.

> @@ -443,7 +443,7 @@ void iov_iter_init(struct iov_iter *i, unsigned int 
> direction,
>   const struct iovec *iov, unsigned long nr_segs,
>   size_t count)
>  {
> - WARN_ON(direction & ~(READ | WRITE));
> + WARN_ON(direction & ~(READ | WRITE | CHECK_IOVEC_ONLY));
>   direction &= READ | WRITE;

Ugh...

> - rc = rw_copy_check_uvector(CHECK_IOVEC_ONLY, rvec, riovcnt, UIO_FASTIOV,
> -iovstack_r, _r);
> + rc = import_iovec(CHECK_IOVEC_ONLY, rvec, riovcnt, UIO_FASTIOV, _r,
> +   _r);
>   if (rc <= 0)
>   goto free_iovecs;
>  
> - rc = process_vm_rw_core(pid, , iov_r, riovcnt, flags, vm_write);
> + rc = process_vm_rw_core(pid, _l, iter_r.iov, iter_r.nr_segs,
> + flags, vm_write);

... and ugh^2, since now you are not only setting a meaningless iov_iter,
you are creating a new place that pokes directly into struct iov_iter
guts.

Sure, moving rw_copy_check_uvector() over to lib/iov_iter.c makes sense.
But I would rather split the access_ok()-related checks out of that thing
and bury CHECK_IOVEC_ONLY.

Step 1: move the damn thing to lib/iov_iter.c (same as you do, but without
making it static)

Step 2: split it in two:

ssize_t rw_copy_check_uvector(const struct iovec __user * uvector,
  unsigned long nr_segs, unsigned long fast_segs,
  struct iovec *fast_pointer,
  struct iovec **ret_pointer)
{
unsigned long seg;
ssize_t ret;
struct iovec *iov = fast_pointer;

*ret_pointer = fast_pointer;
/*
 * SuS says "The readv() function *may* fail if the iovcnt argument
 * was less than or equal to 0, or greater than {IOV_MAX}.  Linux has
 * traditionally returned zero for zero segments, so...
 */
if (nr_segs == 0)
return 0;

/*
 * First get the "struct iovec" from user memory and
 * verify all the pointers
 */
if (nr_segs > UIO_MAXIOV)
return -EINVAL;

if (nr_segs > fast_segs) {
iov = kmalloc_array(nr_segs, sizeof(struct iovec), GFP_KERNEL);
if (!iov)
return -ENOMEM;
*ret_pointer = iov;
}
if (copy_from_user(iov, uvector, nr_segs*sizeof(*uvector)))
return -EFAULT;

/*
 * According to the Single Unix Specification we should return EINVAL
 * if an element length is < 0 when cast to ssize_t or if the
 * total length would overflow the ssize_t return value of the
 * system call.
 *
 * Linux caps all read/write calls to MAX_RW_COUNT, and avoids the
 * overflow case.
 */
ret = 0;
for (seg = 0; seg < nr_segs; seg++) {
void __user *buf = iov[seg].iov_base;
ssize_t len = (ssize_t)iov[seg].iov_len;

/* see if we we're about to use an invalid len or if
 * it's about to overflow ssize_t */
if (len < 0)
return -EINVAL;
if (len > MAX_RW_COUNT - ret) {
len = MAX_RW_COUNT - ret;
iov[seg].iov_len = len;
}
ret += len;
}
return ret;
}

/*
 *  This is merely an early sanity check; we do _not_ rely upon
 *  it when we get to the actual memory accesses.
 */
static bool check_iovecs(const struct iovec *iov, int nr_segs)
{
for (seg = 0; seg < nr_segs; seg++) {
void __user *buf = iov[seg].iov_base;
ssize_t len = (ssize_t)iov[seg].iov_len;

if (unlikely(!access_ok(buf, len)))
return false;
}
return true;
}

ssize_t import_iovec(int type, const struct iovec __user * uvector,
 unsigned nr_segs, unsigned fast_segs,
 struct iovec **iov, struct iov_iter *i)
{
struct iovec *p;
ssize_t n;

n = rw_copy_check_uvector(uvector, nr_segs, fast_segs, *iov, );
if (n > 0 && !check_iovecs(p, nr_segs))
n = -EFAULT;
if (n < 0) {
if (p != *iov)
kfree(p);
*iov = NULL;
return n;
}
iov_iter_init(i, type, p, nr_segs, n);
*iov = p == *iov ? NULL : p;
return n;
}

kill CHECK_IOVEC_ONLY and use 

Re: [PATCH 1/9] kernel: add a PF_FORCE_COMPAT flag

2020-09-20 Thread Al Viro
On Sun, Sep 20, 2020 at 08:22:59PM +0100, Matthew Wilcox wrote:
> On Sun, Sep 20, 2020 at 08:10:31PM +0100, Al Viro wrote:
> > IMO it's much saner to mark those and refuse to touch them from io_uring...
> 
> Simpler solution is to remove io_uring from the 32-bit syscall list.
> If you're a 32-bit process, you don't get to use io_uring.  Would
> any real users actually care about that?

What for?  I mean, is there any reason to try and keep those bugs as
first-class citizens?  IDGI...  Yes, we have several special files
(out of thousands) that have read()/write() user-visible semantics
broken wrt 32bit/64bit.  And we have to keep them working that way
for existing syscalls.  Why would we want to pretend that their
behaviour is normal and isn't an ABI bug, not to be repeated for
anything new?


Re: [PATCH 1/9] kernel: add a PF_FORCE_COMPAT flag

2020-09-20 Thread Al Viro
On Sun, Sep 20, 2020 at 08:01:59PM +0100, Matthew Wilcox wrote:
> On Sun, Sep 20, 2020 at 07:07:42PM +0100, Al Viro wrote:
> > 2) a few drivers are really fucked in head.  They use different
> > *DATA* layouts for reads/writes, depending upon the calling process.
> > IOW, if you fork/exec a 32bit binary and your stdin is one of those,
> > reads from stdin in parent and child will yield different data layouts.
> > On the same struct file.
> > That's what Christoph worries about (/dev/sg he'd mentioned is
> > one of those).
> > 
> > IMO we should simply have that dozen or so of pathological files
> > marked with FMODE_SHITTY_ABI; it's not about how they'd been opened -
> > it describes the userland ABI provided by those.  And it's cast in stone.
> > 
> > Any in_compat_syscall() in ->read()/->write() instances is an ABI
> > bug, plain and simple.  Some are unfixable for compatibility reasons, but
> > any new caller like that should be a big red flag.
> 
> So an IOCB_COMPAT flag would let us know whether the caller is expecting
> a 32-bit or 64-bit layout?  And io_uring could set it based on the
> ctx->compat flag.

So you want to convert all that crap to a form that would see iocb
(IOW, ->read_iter()/->write_iter())?  And, since quite a few are sysfs
attributes, propagate that through sysfs, changing the method signatures
to match that and either modifying fuckloads of instances or introducing
new special methods for the ones that are sensitive to that crap?

IMO it's much saner to mark those and refuse to touch them from io_uring...


Re: [PATCH 1/9] kernel: add a PF_FORCE_COMPAT flag

2020-09-20 Thread Al Viro
On Sun, Sep 20, 2020 at 07:07:42PM +0100, Al Viro wrote:

> /proc/bus/input/devices (fucked bitmap-to-text representation)

To illustrate the, er, beauty of that stuff:

; cat32 /proc/bus/input/devices >/tmp/a
; cat /proc/bus/input/devices >/tmp/b
; diff -u /tmp/a /tmp/b|grep '^[-+]'
--- /tmp/a  2020-09-20 14:28:43.442560691 -0400
+++ /tmp/b  2020-09-20 14:28:49.018543230 -0400
-B: KEY=1100f 2902000 8380307c f910f001 fedf ffef  fffe
+B: KEY=1100f02902000 8380307cf910f001 fedfffef fffe
-B: KEY=7 0 0 0 0 0 0 0 0
+B: KEY=7 0 0 0 0
-B: KEY=420 0 7 0 0 0 0 0 0 0 0
+B: KEY=420 7 0 0 0 0
-B: KEY=10 0 0 0
+B: KEY=10 0
-B: KEY=4000 0 0 0 0
+B: KEY=4000 0 0
-B: KEY=8000 0 0 0 0 0 1100b 800 2 20 0 0 0 0
+B: KEY=8000 0 0 1100b0800 20020 0 0
-B: KEY=3e000b 0 0 0 0 0 0 0
+B: KEY=3e000b 0 0 0
-B: KEY=       fffe
+B: KEY=   fffe
; 
(cat32 being a 32bit binary of cat)
All the differences are due to homegrown bitmap-to-text conversion.

Note that feeding that into a pipe leaves the recepient with a lovely problem -
you can't go by the width of words (they are not zero-padded) and you can't
go by the number of words either - it varies from device to device.

And there's nothing we can do with that - it's a userland ABI, can't be
changed without breaking stuff.  I would prefer to avoid additional examples
like that, TYVM...


Re: [PATCH 1/9] kernel: add a PF_FORCE_COMPAT flag

2020-09-20 Thread Al Viro
On Sun, Sep 20, 2020 at 09:59:36AM -0700, Andy Lutomirski wrote:

> As one example, look at __sys_setsockopt().  It's called for the
> native and compat versions, and it contains an in_compat_syscall()
> check.  (This particularly check looks dubious to me, but that's
> another story.)  If this were to be done with equivalent semantics
> without a separate COMPAT_DEFINE_SYSCALL and without
> in_compat_syscall(), there would need to be some indication as to
> whether this is compat or native setsockopt.  There are other
> setsockopt implementations in the net stack with more
> legitimate-seeming uses of in_compat_syscall() that would need some
> other mechanism if in_compat_syscall() were to go away.
> 
> setsockopt is (I hope!) out of scope for io_uring, but the situation
> isn't fundamentally different from read and write.

Except that setsockopt() had that crap very widespread; for read()
and write() those are very rare exceptions.

Andy, please RTFS.  Or dig through archives.  The situation
with setsockopt() is *NOT* a good thing - it's (probably) the least
of the evils.  The last thing we need is making that the norm.


Re: [PATCH 1/9] kernel: add a PF_FORCE_COMPAT flag

2020-09-20 Thread Al Viro
On Sun, Sep 20, 2020 at 04:15:10PM +0100, Matthew Wilcox wrote:
> On Fri, Sep 18, 2020 at 02:45:25PM +0200, Christoph Hellwig wrote:
> > Add a flag to force processing a syscall as a compat syscall.  This is
> > required so that in_compat_syscall() works for I/O submitted by io_uring
> > helper threads on behalf of compat syscalls.
> 
> Al doesn't like this much, but my suggestion is to introduce two new
> opcodes -- IORING_OP_READV32 and IORING_OP_WRITEV32.  The compat code
> can translate IORING_OP_READV to IORING_OP_READV32 and then the core
> code can know what that user pointer is pointing to.

Let's separate two issues:
1) compat syscalls want 32bit iovecs.  Nothing to do with the
drivers, dealt with just fine.
2) a few drivers are really fucked in head.  They use different
*DATA* layouts for reads/writes, depending upon the calling process.
IOW, if you fork/exec a 32bit binary and your stdin is one of those,
reads from stdin in parent and child will yield different data layouts.
On the same struct file.
That's what Christoph worries about (/dev/sg he'd mentioned is
one of those).

IMO we should simply have that dozen or so of pathological files
marked with FMODE_SHITTY_ABI; it's not about how they'd been opened -
it describes the userland ABI provided by those.  And it's cast in stone.

Any in_compat_syscall() in ->read()/->write() instances is an ABI
bug, plain and simple.  Some are unfixable for compatibility reasons, but
any new caller like that should be a big red flag.

How we import iovec array is none of the drivers' concern; we do
not need to mess with in_compat_syscall() reporting the matching value,
etc. for that.  It's about the instances that want in_compat_syscall() to
decide between the 32bit and 64bit data layouts.  And I believe that
we should simply have them marked as such and rejected by io_uring.  With
any new occurences getting slapped down hard.

Current list of those turds:
/dev/sg (pointer-chasing, generally insane)
/sys/firmware/efi/vars/*/raw_var (fucked binary structure)
/sys/firmware/efi/vars/new_var (fucked binary structure)
/sys/firmware/efi/vars/del_var (fucked binary structure)
/dev/uhid   (pointer-chasing for one obsolete command)
/dev/input/event* (timestamps)
/dev/uinput (timestamps)
/proc/bus/input/devices (fucked bitmap-to-text representation)
/sys/class/input/*/capabilities/* (fucked bitmap-to-text representation)


Re: [PATCH 1/9] kernel: add a PF_FORCE_COMPAT flag

2020-09-20 Thread Al Viro
On Sun, Sep 20, 2020 at 03:55:47PM +0200, Arnd Bergmann wrote:
> On Sun, Sep 20, 2020 at 12:09 AM Al Viro  wrote:
> > On Fri, Sep 18, 2020 at 05:16:15PM +0200, Christoph Hellwig wrote:
> > > On Fri, Sep 18, 2020 at 02:58:22PM +0100, Al Viro wrote:
> > > > Said that, why not provide a variant that would take an explicit
> > > > "is it compat" argument and use it there?  And have the normal
> > > > one pass in_compat_syscall() to that...
> > >
> > > That would help to not introduce a regression with this series yes.
> > > But it wouldn't fix existing bugs when io_uring is used to access
> > > read or write methods that use in_compat_syscall().  One example that
> > > I recently ran into is drivers/scsi/sg.c.
> >
> > So screw such read/write methods - don't use them with io_uring.
> > That, BTW, is one of the reasons I'm sceptical about burying the
> > decisions deep into the callchain - we don't _want_ different
> > data layouts on read/write depending upon the 32bit vs. 64bit
> > caller, let alone the pointer-chasing garbage that is /dev/sg.
> 
> Would it be too late to limit what kind of file descriptors we allow
> io_uring to read/write on?
> 
> If io_uring can get changed to return -EINVAL on trying to
> read/write something other than S_IFREG file descriptors,
> that particular problem space gets a lot simpler, but this
> is of course only possible if nobody actually relies on it yet.

S_IFREG is almost certainly too heavy as a restriction.  Looking through
the stuff sensitive to 32bit/64bit, we seem to have
* /dev/sg - pointer-chasing horror
* sysfs files for efivar - different layouts for compat and native,
shitty userland ABI design (
struct efi_variable {
efi_char16_t  VariableName[EFI_VAR_NAME_LEN/sizeof(efi_char16_t)];
efi_guid_tVendorGuid;
unsigned long DataSize;
__u8  Data[1024];
efi_status_t  Status;
__u32 Attributes;
} __attribute__((packed));
) is the piece of crap in question; 'DataSize' is where the headache comes
from.  Regular files, BTW...
* uhid - character device, milder pointer-chasing horror.  Trouble
comes from this:
/* Obsolete! Use UHID_CREATE2. */
struct uhid_create_req {
__u8 name[128];
__u8 phys[64];
__u8 uniq[64];
__u8 __user *rd_data;
__u16 rd_size;

__u16 bus;
__u32 vendor;
__u32 product;
__u32 version;
__u32 country;
} __attribute__((__packed__));
and suggested replacement doesn't do any pointer-chasing (rd_data is an
embedded array in the end of struct uhid_create2_req).
* evdev, uinput - bitness-sensitive layout, due to timestamps
* /proc/bus/input/devices - weird crap with printing bitmap, different
_text_ layouts seen by 32bit and 64bit readers.  Binary structures are PITA,
but with sufficient effort you can screw the text just as hard...  Oh, and it's
a regular file.
* similar in sysfs analogue

And AFAICS, that's it for read/write-related method instances.


Re: [PATCH 1/9] kernel: add a PF_FORCE_COMPAT flag

2020-09-19 Thread Al Viro
On Sat, Sep 19, 2020 at 05:14:41PM -0700, Andy Lutomirski wrote:

> > 2) have you counted the syscalls that do and do not need that?
> 
> No.

Might be illuminating...

> > 3) how many of those realistically *can* be unified with their
> > compat counterparts?  [hint: ioctl(2) cannot]
> 
> There would be no requirement to unify anything.  The idea is that
> we'd get rid of all the global state flags.

_What_ global state flags?  When you have separate SYSCALL_DEFINE3(ioctl...)
and COMPAT_SYSCALL_DEFINE3(ioctl...), there's no flags at all, global or
local.  They only come into the play when you try to share the same function
for both, right on the top level.

> For ioctl, we'd have a new file_operation:
> 
> long ioctl(struct file *, unsigned int, unsigned long, enum syscall_arch);
> 
> I'm not saying this is easy, but I think it's possible and the result
> would be more obviously correct than what we have now.

No, it would not.  Seriously, from time to time a bit of RTFS before grand
proposals turns out to be useful.


Re: [PATCH 1/9] kernel: add a PF_FORCE_COMPAT flag

2020-09-19 Thread Al Viro
On Sat, Sep 19, 2020 at 03:53:40PM -0700, Andy Lutomirski wrote:

> > It would not be a win - most of the syscalls don't give a damn
> > about 32bit vs. 64bit...
> 
> Any reasonable implementation would optimize it out for syscalls that don’t 
> care.  Or it could be explicit:
> 
> DEFINE_MULTIARCH_SYSCALL(...)

1) what would that look like?
2) have you counted the syscalls that do and do not need that?
3) how many of those realistically *can* be unified with their
compat counterparts?  [hint: ioctl(2) cannot]


Re: [PATCH 1/9] kernel: add a PF_FORCE_COMPAT flag

2020-09-19 Thread Al Viro
On Sat, Sep 19, 2020 at 03:23:54PM -0700, Andy Lutomirski wrote:
> 
> > On Sep 19, 2020, at 3:09 PM, Al Viro  wrote:
> > 
> > On Fri, Sep 18, 2020 at 05:16:15PM +0200, Christoph Hellwig wrote:
> >>> On Fri, Sep 18, 2020 at 02:58:22PM +0100, Al Viro wrote:
> >>> Said that, why not provide a variant that would take an explicit
> >>> "is it compat" argument and use it there?  And have the normal
> >>> one pass in_compat_syscall() to that...
> >> 
> >> That would help to not introduce a regression with this series yes.
> >> But it wouldn't fix existing bugs when io_uring is used to access
> >> read or write methods that use in_compat_syscall().  One example that
> >> I recently ran into is drivers/scsi/sg.c.
> > 
> > So screw such read/write methods - don't use them with io_uring.
> > That, BTW, is one of the reasons I'm sceptical about burying the
> > decisions deep into the callchain - we don't _want_ different
> > data layouts on read/write depending upon the 32bit vs. 64bit
> > caller, let alone the pointer-chasing garbage that is /dev/sg.
> 
> Well, we could remove in_compat_syscall(), etc and instead have an implicit 
> parameter in DEFINE_SYSCALL.  Then everything would have to be explicit.  
> This would probably be a win, although it could be quite a bit of work.

It would not be a win - most of the syscalls don't give a damn
about 32bit vs. 64bit...


Re: [PATCH 1/9] kernel: add a PF_FORCE_COMPAT flag

2020-09-19 Thread Al Viro
On Fri, Sep 18, 2020 at 05:16:15PM +0200, Christoph Hellwig wrote:
> On Fri, Sep 18, 2020 at 02:58:22PM +0100, Al Viro wrote:
> > Said that, why not provide a variant that would take an explicit
> > "is it compat" argument and use it there?  And have the normal
> > one pass in_compat_syscall() to that...
> 
> That would help to not introduce a regression with this series yes.
> But it wouldn't fix existing bugs when io_uring is used to access
> read or write methods that use in_compat_syscall().  One example that
> I recently ran into is drivers/scsi/sg.c.

So screw such read/write methods - don't use them with io_uring.
That, BTW, is one of the reasons I'm sceptical about burying the
decisions deep into the callchain - we don't _want_ different
data layouts on read/write depending upon the 32bit vs. 64bit
caller, let alone the pointer-chasing garbage that is /dev/sg.


Re: [PATCH 1/9] kernel: add a PF_FORCE_COMPAT flag

2020-09-18 Thread Al Viro
On Fri, Sep 18, 2020 at 03:44:06PM +0200, Christoph Hellwig wrote:
> On Fri, Sep 18, 2020 at 02:40:12PM +0100, Al Viro wrote:
> > >   /* Vector 0x110 is LINUX_32BIT_SYSCALL_TRAP */
> > > - return pt_regs_trap_type(current_pt_regs()) == 0x110;
> > > + return pt_regs_trap_type(current_pt_regs()) == 0x110 ||
> > > + (current->flags & PF_FORCE_COMPAT);
> > 
> > Can't say I like that approach ;-/  Reasoning about the behaviour is much
> > harder when it's controlled like that - witness set_fs() shite...
> 
> I don't particularly like it either.  But do you have a better idea
> how to deal with io_uring vs compat tasks?

 git rm fs/io_uring.c would make a good starting point 
Yes, I know it's not going to happen, but one can dream...

Said that, why not provide a variant that would take an explicit
"is it compat" argument and use it there?  And have the normal
one pass in_compat_syscall() to that...


Re: [PATCH 1/9] kernel: add a PF_FORCE_COMPAT flag

2020-09-18 Thread Al Viro
On Fri, Sep 18, 2020 at 02:45:25PM +0200, Christoph Hellwig wrote:
> Add a flag to force processing a syscall as a compat syscall.  This is
> required so that in_compat_syscall() works for I/O submitted by io_uring
> helper threads on behalf of compat syscalls.
> 
> Signed-off-by: Christoph Hellwig 
> ---
>  arch/sparc/include/asm/compat.h | 3 ++-
>  arch/x86/include/asm/compat.h   | 2 +-
>  fs/io_uring.c   | 9 +
>  include/linux/compat.h  | 5 -
>  include/linux/sched.h   | 1 +
>  5 files changed, 17 insertions(+), 3 deletions(-)
> 
> diff --git a/arch/sparc/include/asm/compat.h b/arch/sparc/include/asm/compat.h
> index 40a267b3bd5208..fee6c51d36e869 100644
> --- a/arch/sparc/include/asm/compat.h
> +++ b/arch/sparc/include/asm/compat.h
> @@ -211,7 +211,8 @@ static inline int is_compat_task(void)
>  static inline bool in_compat_syscall(void)
>  {
>   /* Vector 0x110 is LINUX_32BIT_SYSCALL_TRAP */
> - return pt_regs_trap_type(current_pt_regs()) == 0x110;
> + return pt_regs_trap_type(current_pt_regs()) == 0x110 ||
> + (current->flags & PF_FORCE_COMPAT);

Can't say I like that approach ;-/  Reasoning about the behaviour is much
harder when it's controlled like that - witness set_fs() shite...


Re: [PATCH 12/14] x86: remove address space overrides using set_fs()

2020-09-03 Thread Al Viro
On Fri, Sep 04, 2020 at 03:55:10AM +0100, Al Viro wrote:
> On Thu, Sep 03, 2020 at 04:22:40PM +0200, Christoph Hellwig wrote:
> 
> > diff --git a/arch/x86/lib/getuser.S b/arch/x86/lib/getuser.S
> > index c8a85b512796e1..94f7be4971ed04 100644
> > --- a/arch/x86/lib/getuser.S
> > +++ b/arch/x86/lib/getuser.S
> > @@ -35,10 +35,19 @@
> >  #include 
> >  #include 
> >  
> > +#ifdef CONFIG_X86_5LEVEL
> > +#define LOAD_TASK_SIZE_MINUS_N(n) \
> > +   ALTERNATIVE "mov $((1 << 47) - 4096 - (n)),%rdx", \
> > +   "mov $((1 << 56) - 4096 - (n)),%rdx", X86_FEATURE_LA57
> > +#else
> > +#define LOAD_TASK_SIZE_MINUS_N(n) \
> > +   mov $(TASK_SIZE_MAX - (n)),%_ASM_DX
> > +#endif
> 
> Wait a sec... how is that supposed to build with X86_5LEVEL?  Do you mean
> 
> #define LOAD_TASK_SIZE_MINUS_N(n) \
>   ALTERNATIVE __stringify(mov $((1 << 47) - 4096 - (n)),%rdx), \
>   __stringify(mov $((1 << 56) - 4096 - (n)),%rdx), 
> X86_FEATURE_LA57
> 
> there?

Pushed out with the following folded in.

diff --git a/arch/x86/lib/getuser.S b/arch/x86/lib/getuser.S
index 94f7be4971ed..2f052bc96866 100644
--- a/arch/x86/lib/getuser.S
+++ b/arch/x86/lib/getuser.S
@@ -37,8 +37,8 @@
 
 #ifdef CONFIG_X86_5LEVEL
 #define LOAD_TASK_SIZE_MINUS_N(n) \
-   ALTERNATIVE "mov $((1 << 47) - 4096 - (n)),%rdx", \
-   "mov $((1 << 56) - 4096 - (n)),%rdx", X86_FEATURE_LA57
+   ALTERNATIVE __stringify(mov $((1 << 47) - 4096 - (n)),%rdx), \
+   __stringify(mov $((1 << 56) - 4096 - (n)),%rdx), 
X86_FEATURE_LA57
 #else
 #define LOAD_TASK_SIZE_MINUS_N(n) \
mov $(TASK_SIZE_MAX - (n)),%_ASM_DX
diff --git a/arch/x86/lib/putuser.S b/arch/x86/lib/putuser.S
index 445374885153..358239d77dff 100644
--- a/arch/x86/lib/putuser.S
+++ b/arch/x86/lib/putuser.S
@@ -33,8 +33,8 @@
 
 #ifdef CONFIG_X86_5LEVEL
 #define LOAD_TASK_SIZE_MINUS_N(n) \
-   ALTERNATIVE "mov $((1 << 47) - 4096 - (n)),%rbx", \
-   "mov $((1 << 56) - 4096 - (n)),%rbx", X86_FEATURE_LA57
+   ALTERNATIVE __stringify(mov $((1 << 47) - 4096 - (n)),%rbx), \
+   __stringify(mov $((1 << 56) - 4096 - (n)),%rbx), 
X86_FEATURE_LA57
 #else
 #define LOAD_TASK_SIZE_MINUS_N(n) \
mov $(TASK_SIZE_MAX - (n)),%_ASM_BX


Re: [PATCH 12/14] x86: remove address space overrides using set_fs()

2020-09-03 Thread Al Viro
On Thu, Sep 03, 2020 at 04:22:40PM +0200, Christoph Hellwig wrote:

> diff --git a/arch/x86/lib/getuser.S b/arch/x86/lib/getuser.S
> index c8a85b512796e1..94f7be4971ed04 100644
> --- a/arch/x86/lib/getuser.S
> +++ b/arch/x86/lib/getuser.S
> @@ -35,10 +35,19 @@
>  #include 
>  #include 
>  
> +#ifdef CONFIG_X86_5LEVEL
> +#define LOAD_TASK_SIZE_MINUS_N(n) \
> + ALTERNATIVE "mov $((1 << 47) - 4096 - (n)),%rdx", \
> + "mov $((1 << 56) - 4096 - (n)),%rdx", X86_FEATURE_LA57
> +#else
> +#define LOAD_TASK_SIZE_MINUS_N(n) \
> + mov $(TASK_SIZE_MAX - (n)),%_ASM_DX
> +#endif

Wait a sec... how is that supposed to build with X86_5LEVEL?  Do you mean

#define LOAD_TASK_SIZE_MINUS_N(n) \
ALTERNATIVE __stringify(mov $((1 << 47) - 4096 - (n)),%rdx), \
__stringify(mov $((1 << 56) - 4096 - (n)),%rdx), 
X86_FEATURE_LA57

there?


Re: remove the last set_fs() in common code, and remove it for x86 and powerpc v3

2020-09-03 Thread Al Viro
On Thu, Sep 03, 2020 at 04:30:03PM +0200, Christoph Hellwig wrote:
> On Thu, Sep 03, 2020 at 03:28:03PM +0100, Al Viro wrote:
> > On Thu, Sep 03, 2020 at 04:22:28PM +0200, Christoph Hellwig wrote:
> > 
> > > Besides x86 and powerpc I plan to eventually convert all other
> > > architectures, although this will be a slow process, starting with the
> > > easier ones once the infrastructure is merged.  The process to convert
> > > architectures is roughtly:
> > > 
> > >  (1) ensure there is no set_fs(KERNEL_DS) left in arch specific code
> > >  (2) implement __get_kernel_nofault and __put_kernel_nofault
> > >  (3) remove the arch specific address limitation functionality
> > 
> > The one to really watch out for is sparc; I have something in that
> > direction, will resurrect as soon as I'm done with eventpoll analysis...
> > 
> > I can live with this series; do you want that in vfs.git#for-next?
> 
> Either that or a separate tree is fine with me.  It would be good to
> eventually have a non-rebased stable tree so that other arch trees
> can work from it, though.

FWIW, vfs.git#for-next is always a merge of independent branches; I don't
put stuff directly into #for-next - too easy to lose that way.

IOW, that would be something like #base.set_fs, included into #for-next
merge set.  And I've no problem with never-rebased branches...

Where in the mainline are the most recent prereqs of this series?


Re: remove the last set_fs() in common code, and remove it for x86 and powerpc v3

2020-09-03 Thread Al Viro
On Thu, Sep 03, 2020 at 04:22:28PM +0200, Christoph Hellwig wrote:

> Besides x86 and powerpc I plan to eventually convert all other
> architectures, although this will be a slow process, starting with the
> easier ones once the infrastructure is merged.  The process to convert
> architectures is roughtly:
> 
>  (1) ensure there is no set_fs(KERNEL_DS) left in arch specific code
>  (2) implement __get_kernel_nofault and __put_kernel_nofault
>  (3) remove the arch specific address limitation functionality

The one to really watch out for is sparc; I have something in that
direction, will resurrect as soon as I'm done with eventpoll analysis...

I can live with this series; do you want that in vfs.git#for-next?


Re: remove the last set_fs() in common code, and remove it for x86 and powerpc v2

2020-09-01 Thread Al Viro
On Tue, Sep 01, 2020 at 07:13:00PM +0200, Christophe Leroy wrote:

> 10.92%  dd   [kernel.kallsyms]  [k] iov_iter_zero

Interesting...  Could you get an instruction-level profile inside 
iov_iter_zero(),
along with the disassembly of that sucker?


Re: [GIT PULL] Please pull powerpc/linux.git powerpc-5.9-1 tag

2020-08-07 Thread Al Viro
On Fri, Aug 07, 2020 at 10:46:13AM -0700, Linus Torvalds wrote:
> On Fri, Aug 7, 2020 at 6:14 AM Michael Ellerman  wrote:
> >
> > Just one minor conflict, in a comment in drivers/misc/ocxl/config.c.
> 
> Well, this morning I merged the ptrace ->regset_get() updates from Al,
> and that brought in a different conflict.
> 
> I _think_ I resolved it correctly, but while the new model is fairly
> readable, the old one sure wasn't, and who knows how messed up my
> attempt to sort it out was. I don't know the pkey details on powerpc..
> 
> So I'd appreciate it if both Al and Aneesh Kumar would check that what
> I did to pkey_get() in arch/powerpc/kernel/ptrace/ptrace-view.c makes
> sense and works..

Grabbing...

Looks sane and yes, 3 membuf_store() instead of membuf_write() + membuf_store()
would make sense (might even yield better code).  Up to ppc folks...

> Side note - it might have been cleaner to just make it do
> 
> membuf_store(, target->thread.amr);
> membuf_store(, target->thread.iamr);
> return membuf_store(, default_uamor);
> 
> instead of doing that membuf_write() for the first two ones and then
> the membuf_store() for the uamor field, but I did what I did to keep
> the logic as close to what it used to be as possible.
> 
> If I messed up, I apologize.
> 
> And if you agree that making it three membuf_store() instead of that
> odd "depend on the exact order of the thread struct and pick two
> consecutive values", I'll leave that to you as a separate cleanup.
> 
>Linus


Re: [PATCH] arch/{mips,sparc,microblaze,powerpc}: Don't enable pagefault/preempt twice

2020-05-21 Thread Al Viro
On Fri, May 22, 2020 at 02:29:50AM +0100, Al Viro wrote:
> On Thu, May 21, 2020 at 06:11:08PM -0700, Guenter Roeck wrote:
> 
> > Mainline, with:
> > 
> > qemu-system-sparc -M SS-4 -kernel arch/sparc/boot/zImage -no-reboot \
> > -snapshot -drive file=rootfs.ext2,format=raw,if=scsi \
> > -append "panic=-1 slub_debug=FZPUA root=/dev/sda console=ttyS0"
> > -nographic -monitor none
> > 
> > The machine doesn't really matter, though.
> 
> It does, unfortunately - try that with SS-10 and watch what happens ;-/

Ugh...  It's actually something in -m handling: -m 256 passes, -m 512
leads to that panic.


Re: [PATCH] arch/{mips,sparc,microblaze,powerpc}: Don't enable pagefault/preempt twice

2020-05-21 Thread Al Viro
On Thu, May 21, 2020 at 06:11:08PM -0700, Guenter Roeck wrote:

> Mainline, with:
> 
> qemu-system-sparc -M SS-4 -kernel arch/sparc/boot/zImage -no-reboot \
>   -snapshot -drive file=rootfs.ext2,format=raw,if=scsi \
>   -append "panic=-1 slub_debug=FZPUA root=/dev/sda console=ttyS0"
>   -nographic -monitor none
> 
> The machine doesn't really matter, though.

It does, unfortunately - try that with SS-10 and watch what happens ;-/


Re: [PATCH] arch/{mips,sparc,microblaze,powerpc}: Don't enable pagefault/preempt twice

2020-05-21 Thread Al Viro
On Thu, May 21, 2020 at 11:46:12PM +0100, Al Viro wrote:
> On Thu, May 21, 2020 at 03:20:46PM -0700, Guenter Roeck wrote:
> > On 5/21/20 10:27 AM, Al Viro wrote:
> > > On Tue, May 19, 2020 at 09:54:22AM -0700, Guenter Roeck wrote:
> > >> On Mon, May 18, 2020 at 11:48:43AM -0700, ira.we...@intel.com wrote:
> > >>> From: Ira Weiny 
> > >>>
> > >>> The kunmap_atomic clean up failed to remove one set of pagefault/preempt
> > >>> enables when vaddr is not in the fixmap.
> > >>>
> > >>> Fixes: bee2128a09e6 ("arch/kunmap_atomic: consolidate duplicate code")
> > >>> Signed-off-by: Ira Weiny 
> > >>
> > >> microblazeel works with this patch, as do the nosmp sparc32 boot tests,
> > >> but sparc32 boot tests with SMP enabled still fail with lots of messages
> > >> such as:
> > > 
> > > BTW, what's your setup for sparc32 boot tests?  IOW, how do you manage to
> > > shrink the damn thing enough to have the loader cope with it?  I hadn't
> > > been able to do that for the current mainline ;-/
> > > 
> > 
> > defconfig seems to work just fine, even after enabling various debug
> > and file system options.
> 
> The hell?  How do you manage to get the kernel in?  sparc32_defconfig
> ends up with 5316876 bytes unpacked...

Incidentally, trying to load it via -kernel/-initrd leads to
Configuration device id QEMU version 1 machine id 64
Probing SBus slot 0 offset 0
Probing SBus slot 1 offset 0
Probing SBus slot 2 offset 0
Probing SBus slot 3 offset 0
Probing SBus slot 15 offset 0
Invalid FCode start byte
CPUs: 1 x TI,TMS390Z55
UUID: ----
Welcome to OpenBIOS v1.1 built on Dec 27 2018 19:17
  Type 'help' for detailed information
[sparc] Kernel already loaded
switching to new context:
PROMLIB: obio_ranges 1
PROMLIB: Sun Boot Prom Version 3 Revision 2
Linux version 5.7.0-rc1-2-gcf51e129b968 (al@duke) (gcc version 6.3.0 
20170516 (Debian 6.3.0-18), GNU ld (GNU Binutils for Debian) 2.28) #32 Thu May 
21 18:36:07 EDT 2020
printk: bootconsole [earlyprom0] enabled
ARCH: SUN4M
TYPE: Sun4m SparcStation10/20
Ethernet address: 52:54:00:12:34:56
303MB HIGHMEM available.
OF stdout device is: /obio/zs@0,10:a
PROM: Built device tree with 30051 bytes of memory.
Booting Linux...
Power off control detected.
Kernel panic - not syncing: Failed to allocate memory for percpu areas.
CPU: 0 PID: 0 Comm: swapper Not tainted 5.7.0-rc1-2-gcf51e129b968 #32
[f04f92a8 : 
setup_per_cpu_areas+0x58/0x90 ] 
[f04edbf4 : 
start_kernel+0xc0/0x4a0 ] 
[f04ed43c : 
continue_boot+0x324/0x334 ] 
[ : 
0x0 ] 

Press Stop-A (L1-A) from sun keyboard or send break
twice on console to return to the boot prom
---[ end Kernel panic - not syncing: Failed to allocate memory for percpu 
areas. ]---

Giving guest more RAM doesn't change the outcome (well, the number HIGHMEM line 
is
obviously higher, but that's it).

So which sparc32 kernel have you booted with defconfig and how have you done
that?


Re: [PATCH] arch/{mips,sparc,microblaze,powerpc}: Don't enable pagefault/preempt twice

2020-05-21 Thread Al Viro
On Thu, May 21, 2020 at 03:20:46PM -0700, Guenter Roeck wrote:
> On 5/21/20 10:27 AM, Al Viro wrote:
> > On Tue, May 19, 2020 at 09:54:22AM -0700, Guenter Roeck wrote:
> >> On Mon, May 18, 2020 at 11:48:43AM -0700, ira.we...@intel.com wrote:
> >>> From: Ira Weiny 
> >>>
> >>> The kunmap_atomic clean up failed to remove one set of pagefault/preempt
> >>> enables when vaddr is not in the fixmap.
> >>>
> >>> Fixes: bee2128a09e6 ("arch/kunmap_atomic: consolidate duplicate code")
> >>> Signed-off-by: Ira Weiny 
> >>
> >> microblazeel works with this patch, as do the nosmp sparc32 boot tests,
> >> but sparc32 boot tests with SMP enabled still fail with lots of messages
> >> such as:
> > 
> > BTW, what's your setup for sparc32 boot tests?  IOW, how do you manage to
> > shrink the damn thing enough to have the loader cope with it?  I hadn't
> > been able to do that for the current mainline ;-/
> > 
> 
> defconfig seems to work just fine, even after enabling various debug
> and file system options.

The hell?  How do you manage to get the kernel in?  sparc32_defconfig
ends up with 5316876 bytes unpacked...


Re: [PATCH] arch/{mips,sparc,microblaze,powerpc}: Don't enable pagefault/preempt twice

2020-05-21 Thread Al Viro
On Tue, May 19, 2020 at 09:54:22AM -0700, Guenter Roeck wrote:
> On Mon, May 18, 2020 at 11:48:43AM -0700, ira.we...@intel.com wrote:
> > From: Ira Weiny 
> > 
> > The kunmap_atomic clean up failed to remove one set of pagefault/preempt
> > enables when vaddr is not in the fixmap.
> > 
> > Fixes: bee2128a09e6 ("arch/kunmap_atomic: consolidate duplicate code")
> > Signed-off-by: Ira Weiny 
> 
> microblazeel works with this patch, as do the nosmp sparc32 boot tests,
> but sparc32 boot tests with SMP enabled still fail with lots of messages
> such as:

BTW, what's your setup for sparc32 boot tests?  IOW, how do you manage to
shrink the damn thing enough to have the loader cope with it?  I hadn't
been able to do that for the current mainline ;-/


Re: remove set_fs calls from the coredump code v6

2020-05-05 Thread Al Viro
On Tue, May 05, 2020 at 10:42:58PM +0200, Christoph Hellwig wrote:
> On Tue, May 05, 2020 at 09:34:46PM +0100, Al Viro wrote:
> > Looks good.  Want me to put it into vfs.git?  #work.set_fs-exec, perhaps?
> 
> Sounds good.

Applied, pushed and added into #for-next


Re: remove set_fs calls from the coredump code v6

2020-05-05 Thread Al Viro
On Tue, May 05, 2020 at 12:12:49PM +0200, Christoph Hellwig wrote:
> Hi all,
> 
> this series gets rid of playing with the address limit in the exec and
> coredump code.  Most of this was fairly trivial, the biggest changes are
> those to the spufs coredump code.
> 
> Changes since v5:
>  - fix uaccess under spinlock in spufs (Jeremy)
>  - remove use of access_ok in spufs
> 
> Changes since v4:
>  - change some goto names as suggested by Linus
> 
> Changes since v3:
>  - fix x86 compilation with x32 in the new version of the signal code
>  - split the exec patches into a new series
> 
> Changes since v2:
>  - don't cleanup the compat siginfo calling conventions, use the patch
>variant from Eric with slight coding style fixes instead.
> 
> Changes since v1:
>  - properly spell NUL
>  - properly handle the compat siginfo case in ELF coredumps

Looks good.  Want me to put it into vfs.git?  #work.set_fs-exec, perhaps?


Re: [PATCH V2 00/11] Subject: Remove duplicated kmap code

2020-05-04 Thread Al Viro
On Mon, May 04, 2020 at 01:17:41PM -0700, Ira Weiny wrote:

> > || * arm: much, much worse.  We have several files that pull 
> > linux/highmem.h:
> > || arch/arm/mm/cache-feroceon-l2.c, arch/arm/mm/cache-xsc3l2.c,
> > || arch/arm/mm/copypage-*.c, arch/arm/mm/dma-mapping.c, arch/arm/mm/flush.c,
> > || arch/arm/mm/highmem.c, arch/arm/probes/uprobes/core.c,
> > || arch/arm/include/asm/kvm_mmu.h (kmap_atomic_pfn()).
> > || Those are fine, but we also have this:
> > || arch/arm/include/asm/pgtable.h:200:#define __pte_map(pmd)   
> > (pte_t *)kmap_atomic(pmd_page(*(pmd)))
> > || arch/arm/include/asm/pgtable.h:208:#define pte_offset_map(pmd,addr) 
> > (__pte_map(pmd) + pte_index(addr))
> > || and sure as hell, asm/pgtable.h does *NOT* pull linux/highmem.h.
> 
> It does not pull asm/highmem.h either...

No, but the users of those macros need to be considered.

> > || #define pte_offset_map(dir, addr)   \
> > || ((pte_t *) kmap_atomic(pmd_page(*(dir))) + pte_index(addr))
> > || One pte_offset_map user in arch/microblaze:
> > || arch/microblaze/kernel/signal.c:207:ptep = pte_offset_map(pmdp, 
> > address);
> > || Messy, but doesn't require any changes (we have asm/pgalloc.h included
> > || there, and that pull linux/highmem.h).
> 
> AFAICS asm/pgtable.h does not include asm/highmem.h here...
> 
> So looks like arch/microblaze/kernel/signal.c will need linux/highmem.h

See above - line 39 in there is
#include 
and line 14 in arch/microblaze/include/asm/pgalloc.h is
#include 
It's conditional upon CONFIG_MMU in there, but so's the use of
pte_offset_map() in arch/microblaze/kernel/signal.c 

So it shouldn't be a problem.

> > || * xtensa: users in arch/xtensa/kernel/pci-dma.c, 
> > arch/xtensa/mm/highmem.c,
> > || arch/xtensa/mm/cache.c and arch/xtensa/platforms/iss/simdisk.c (all pull
> > || linux/highmem.h).
> 
> Actually
> 
> arch/xtensa/mm/cache.c gets linux/highmem.h from linux/pagemap.h
> 
> arch/xtensa/platforms/iss/simdisk.c may have an issue?
>   linux/blkdev.h -> CONFIG_BLOCK -> linux/pagemap.h -> linux/highmem.h
>   But simdisk.c requires BLK_DEV_SIMDISK -> CONFIG_BLOCK...
>   

Yep - see above re major chain of indirect includes conditional upon 
CONFIG_BLOCK
and its uses in places that only build with such configs.  There's a plenty of
similar considerations outside of arch/*, unfortunately...


Re: [PATCH V2 00/11] Subject: Remove duplicated kmap code

2020-05-03 Thread Al Viro
On Sun, May 03, 2020 at 10:04:47PM -0700, Ira Weiny wrote:

> Grepping for 'asm/highmem.h' and investigations don't reveal any issues...  
> But
> you do have me worried.  That said 0-day has been crunching on multiple
> versions of this series without issues such as this (save the mips issue
> above).
> 
> I have to say it would be nice if the relation between linux/highmem.h and
> asm/highmem.h was more straightforward.

IIRC, the headache was due to asm/pgtable.h on several architectures and
asm/cacheflush.h on parisc.



|| IOW, there's one in linux/highmem.h (conditional on !CONFIG_HIGHMEM,
|| !ARCH_HAS_KMAP) and several per-architecture variants, usually declared in
|| their asm/highmem.h.  In three of those (microblaze, parisc and powerpc) 
these
|| are inlines (parisc one identical to linux/highmem.h, lives in 
asm/cacheflush.h,
|| powerpc and microblaze ones calling kmap_atomic_prot() which is defined in
|| arch/$ARCH/mm/highmem.c).
|| 
|| parisc case is weird - essentially, they want to call 
|| flush_kernel_dcache_page_addr() at the places where kunmap/kunmap_atomic
|| is done.  And they do so despite not selecting HIGHMEM, with definitions
|| in usual place.  They do have ARCH_HAS_KMAP defined, which prevents
|| getting buggered in linux/highmem.h.  ARCH_HAS_KMAP is parisc-unique,
|| BTW, and checked only in linux/highmem.h.
|| 
|| All genuine arch-specific variants are defined in (or call functions
|| defined in) translation units that are only included CONFIG_HIGHMEM builds.
|| 
|| It would be tempting to consolidate those, e.g. by adding 
__kmap_atomic()
|| and __kmap_atomic_prot() without that boilerplate, with universal 
kmap_atomic()
|| and kmap_atomic_prot() in linux/highmem.h.  Potential problem with that would
|| be something that pulls ash/highmem.h (or asm/cacheflush.h in case of parisc)
|| directly and uses kmap_atomic/kmap_atomic_prot.  There's not a lot places
|| pulling asm/highmem.h, and many of them are not even in includes:
|| 
|| arch/arm/include/asm/efi.h:13:#include 
|| arch/arm/mm/dma-mapping.c:31:#include 
|| arch/arm/mm/flush.c:14:#include 
|| arch/arm/mm/mmu.c:27:#include 
|| arch/mips/include/asm/pgtable-32.h:22:#include 
|| arch/mips/mm/cache.c:19:#include 
|| arch/mips/mm/fault.c:28:#include /* For 
VMALLOC_END */
|| arch/nds32/include/asm/pgtable.h:60:#include 
|| arch/x86/kernel/setup_percpu.c:20:#include 
|| include/linux/highmem.h:35:#include 
|| 
|| Users of asm/cacheflush.h are rather more numerous; however, anything
|| outside of parisc-specific code has to pull linux/highmem.h, or it won't see
|| the definitions of kmap_atomic/kmap_atomic_prot at all.  arch/parisc itself
|| has no callers of those.
|| 
|| Outside of arch/* there is a plenty of callers.  However, the following is
|| true: all instances of kmap_atomic or kmap_atomic_prot outside of arch/*
|| are either inside the linux/highmem.h or are preceded by include of
|| linux/highmem.h on any build that sees them (there is a common include
|| chain that is conditional upon CONFIG_BLOCK, but it's only needed in
|| drivers that are BLOCK-dependent).  It was not fun to verify, to put
|| it mildly...
|| 
|| So for parisc we have no problem - leaving 
__kmap_atomic()/__kmap_atomic_prot()
|| in asm/cachefile.h and adding universal wrappers in linux/highmem.h will be
|| fine.  For other architectures the things might be trickier.
|| 
|| * arc: all users in arch/arc/ are within arch/arc/mm/{cache,highmem}.c;
|| both pull linux/highmem.h.  We are fine.
|| 
|| * arm: much, much worse.  We have several files that pull linux/highmem.h:
|| arch/arm/mm/cache-feroceon-l2.c, arch/arm/mm/cache-xsc3l2.c,
|| arch/arm/mm/copypage-*.c, arch/arm/mm/dma-mapping.c, arch/arm/mm/flush.c,
|| arch/arm/mm/highmem.c, arch/arm/probes/uprobes/core.c,
|| arch/arm/include/asm/kvm_mmu.h (kmap_atomic_pfn()).
|| Those are fine, but we also have this:
|| arch/arm/include/asm/pgtable.h:200:#define __pte_map(pmd)   
(pte_t *)kmap_atomic(pmd_page(*(pmd)))
|| arch/arm/include/asm/pgtable.h:208:#define pte_offset_map(pmd,addr) 
(__pte_map(pmd) + pte_index(addr))
|| and sure as hell, asm/pgtable.h does *NOT* pull linux/highmem.h.
|| 
|| Fortunately, the users of pte_offset_map() (__pte_map() has no other users)
|| are few, both in arch/arm and outside of arch.  All arm ones are pulling
|| linux/highmem (arch/arm/mm/{pgd,fault*}.c).  Outside of arch we have several
|| that pull highmem.h (by way of rmap.h or pagemap.h, usually):
|| fs/userfaultfd.c, mm/gup.c, mm/hmm.c, mm/huge_memory.c,
|| mm/khugepaged.c, mm/memory-failure.c, mm/memory.c, mm/migrate.c,
|| mm/mremap.c, mm/page_vma_mapped.c, mm/swap_state.c, mm/swapfile.c,
|| mm/vmalloc.c
|| and then there are these in linux/mm.h:
|| 
|| #define pte_offset_map_lock(mm, pmd, address, ptlp) \
|| ({  \
|| spinlock_t *__ptl = pte_lockptr(mm, 

Re: [PATCH V2 00/11] Subject: Remove duplicated kmap code

2020-05-03 Thread Al Viro
On Sun, May 03, 2020 at 06:09:01PM -0700, ira.we...@intel.com wrote:
> From: Ira Weiny 
> 
> The kmap infrastructure has been copied almost verbatim to every architecture.
> This series consolidates obvious duplicated code by defining core functions
> which call into the architectures only when needed.
> 
> Some of the k[un]map_atomic() implementations have some similarities but the
> similarities were not sufficient to warrant further changes.
> 
> In addition we remove a duplicate implementation of kmap() in DRM.
> 
> Testing was done by 0day to cover all the architectures I can't readily
> build/test.

OK...  Looking through my old notes on kmap unification (this winter, never
went anywhere),

* arch/mips/mm/cache.c ought to use linux/highmem.h, not asm/highmem.h
I suspect that your series doesn't build on some configs there.  Hadn't
verified that, though.

* kmap_atomic_to_page() is dead, but not quite gone - csky and nds32 brought
the damn thing back (nds32 - only an extern).  It needs killin'...

* parisc is (arguably) abusing kunmap()/kunmap_atomic() for cache flushing.
Replace the bulk of its highmem.h with
#define ARCH_HAS_FLUSH_ON_KUNMAP
#define arch_before_kunmap flush_kernel_dcache_page_addr
and have default kunmap()/kunmap_atomic() do
#ifdef ARCH_HAS_FLUSH_ON_KUNMAP
arch_before_kunmap(page_address(page));
#endif
and
#ifdef ARCH_HAS_FLUSH_ON_KUNMAP
arch_before_kunmap(addr);
#endif
resp.  Kills ARCH_HAS_KMAP along with ifdefs on it, makes parisc use somewhat
less hacky.

I'd suggest checking various configs on mips - that's likely to cause headache.
Said that, my analysis of include chains back then is pretty much worthless
by now - I really hate the amount of indirect include chains leading to that
sucker on some, but not all configs ;-/  IIRC, the proof that everything
using kmap*/kunmap* would pull linux/highmem.h regardless of config took several
hours of digging, ran for several pages and had been hopelessly brittle.
arch/mips/mm/cache.c was the only exception caught by it, but these days
there might be more.


Re: [PATCH V1 09/10] arch/kmap: Define kmap_atomic_prot() for all arch's

2020-04-30 Thread Al Viro
On Fri, May 01, 2020 at 03:37:34AM +0100, Al Viro wrote:
> On Thu, Apr 30, 2020 at 01:38:44PM -0700, ira.we...@intel.com wrote:
> 
> > -static inline void *kmap_atomic(struct page *page)
> > +static inline void *kmap_atomic_prot(struct page *page, pgprot_t prot)
> >  {
> > preempt_disable();
> > pagefault_disable();
> > if (!PageHighMem(page))
> > return page_address(page);
> > -   return kmap_atomic_high(page);
> > +   return kmap_atomic_high_prot(page, prot);
> >  }
> > +#define kmap_atomic(page)  kmap_atomic_prot(page, kmap_prot)
> 
> OK, so it *was* just a bisect hazard - you return to original semantics
> wrt preempt_disable()...

FWIW, how about doing the following: just before #5/10 have a patch
that would touch only microblaze, ppc and x86 splitting their
kmap_atomic_prot() into an inline helper + kmap_atomic_high_prot().
Then your #5 would leave their kmap_atomic_prot() as-is (it would
use kmap_atomic_prot_high() instead).  The rest of the series plays
out pretty much the same way it does now, and wrappers on those
3 architectures would go away when an identical generic one is
introduced in this commit (#9/10).

AFAICS, that would avoid the bisect hazard and might even end
up with less noise in the patches...


Re: [PATCH V1 09/10] arch/kmap: Define kmap_atomic_prot() for all arch's

2020-04-30 Thread Al Viro
On Thu, Apr 30, 2020 at 01:38:44PM -0700, ira.we...@intel.com wrote:

> -static inline void *kmap_atomic(struct page *page)
> +static inline void *kmap_atomic_prot(struct page *page, pgprot_t prot)
>  {
>   preempt_disable();
>   pagefault_disable();
>   if (!PageHighMem(page))
>   return page_address(page);
> - return kmap_atomic_high(page);
> + return kmap_atomic_high_prot(page, prot);
>  }
> +#define kmap_atomic(page)kmap_atomic_prot(page, kmap_prot)

OK, so it *was* just a bisect hazard - you return to original semantics
wrt preempt_disable()...


Re: [PATCH V1 08/10] arch/kmap: Don't hard code kmap_prot values

2020-04-30 Thread Al Viro
On Thu, Apr 30, 2020 at 01:38:43PM -0700, ira.we...@intel.com wrote:
> From: Ira Weiny 
> 
> To support kmap_atomic_prot() on all architectures each arch must
> support protections passed in to them.
> 
> Change csky, mips, nds32 and xtensa to use their global kmap_prot value
> rather than a hard coded value which was equal.

Minor nitpick: it's probably worth pointing out that kmap_prot on those
is a constant...


Re: [PATCH V1 05/10] arch/kmap_atomic: Consolidate duplicate code

2020-04-30 Thread Al Viro
On Thu, Apr 30, 2020 at 01:38:40PM -0700, ira.we...@intel.com wrote:
> From: Ira Weiny 
> 
> Every arch has the same code to ensure atomic operations and a check for
> !HIGHMEM page.
> 
> Remove the duplicate code by defining a core kmap_atomic() which only
> calls the arch specific kmap_atomic_high() when the page is high memory.

Err  AFAICS, you've just silently changed the semantics for
kmap_atomic_prot() here.  And while most of the callers are converted,
drivers/gpu/drm/ttm/ttm_bo_util.c one is not, so at the very least it's
a bisect hazard...

And I would argue that having kmap_atomic() differ from kmap_atomic_prot()
wrt disabling preempt is asking for trouble.


Re: [PATCH 1/5] powerpc/spufs: simplify spufs core dumping

2020-04-27 Thread Al Viro
On Mon, Apr 27, 2020 at 10:06:21PM +0200, Christoph Hellwig wrote:

> @@ -1988,7 +1984,12 @@ static ssize_t spufs_mbox_info_read(struct file *file, 
> char __user *buf,
>   if (ret)
>   return ret;
>   spin_lock(>csa.register_lock);
> - ret = __spufs_mbox_info_read(ctx, buf, len, pos);
> + /* EOF if there's no entry in the mbox */
> + if (ctx->csa.prob.mb_stat_R & 0xff) {
> + ret = simple_read_from_buffer(buf, len, pos,
> + >csa.prob.pu_mb_R,
> + sizeof(ctx->csa.prob.pu_mb_R));
> + }
>   spin_unlock(>csa.register_lock);
>   spu_release_saved(ctx);

Again, this really needs fixing.  Preferably - as a separate commit preceding
this series, so that it could be backported.  simple_read_from_buffer() is
a blocking operation.  Yes, I understand that mainline has the same bug;
it really does need to be fixed and having to backport this series is not
a good idea, for obvious reasons.


Re: [PATCH 1/7] powerpc/spufs: simplify spufs core dumping

2020-04-21 Thread Al Viro
On Tue, Apr 21, 2020 at 08:19:09PM +0100, Al Viro wrote:
> On Tue, Apr 21, 2020 at 09:01:48PM +0200, Christoph Hellwig wrote:
> > On Tue, Apr 21, 2020 at 07:49:41PM +0100, Al Viro wrote:
> > > > spin_lock(>csa.register_lock);
> > > > -   ret = __spufs_proxydma_info_read(ctx, buf, len, pos);
> > > > +   __spufs_proxydma_info_read(ctx, );
> > > > +   ret = simple_read_from_buffer(buf, len, pos, , 
> > > > sizeof(info));
> > > 
> > > IDGI...  What's that access_ok() for?  If you are using 
> > > simple_read_from_buffer(),
> > > the damn thing goes through copy_to_user().  Why bother with separate 
> > > access_ok()
> > > here?
> > 
> > I have no idea at all, this just refactors the code.
> 
> Wait a bloody minute, it's doing *what* under a spinlock?

... and yes, I realize that it's been broken the same way.  It still needs 
fixing.
AFAICS, that got broken in commit bf1ab978be23 "[POWERPC] coredump: Add SPU elf
notes to coredump." when spufs_proxydma_info_read() had copy_to_user() (until
then done after dropping the spinlock) moved into an area where blocking is very
much *not* allowed.


Re: [PATCH 1/7] powerpc/spufs: simplify spufs core dumping

2020-04-21 Thread Al Viro
On Tue, Apr 21, 2020 at 09:01:48PM +0200, Christoph Hellwig wrote:
> On Tue, Apr 21, 2020 at 07:49:41PM +0100, Al Viro wrote:
> > >   spin_lock(>csa.register_lock);
> > > - ret = __spufs_proxydma_info_read(ctx, buf, len, pos);
> > > + __spufs_proxydma_info_read(ctx, );
> > > + ret = simple_read_from_buffer(buf, len, pos, , sizeof(info));
> > 
> > IDGI...  What's that access_ok() for?  If you are using 
> > simple_read_from_buffer(),
> > the damn thing goes through copy_to_user().  Why bother with separate 
> > access_ok()
> > here?
> 
> I have no idea at all, this just refactors the code.

Wait a bloody minute, it's doing *what* under a spinlock?


Re: [PATCH 1/7] powerpc/spufs: simplify spufs core dumping

2020-04-21 Thread Al Viro
On Tue, Apr 21, 2020 at 05:41:58PM +0200, Christoph Hellwig wrote:

>  static ssize_t spufs_proxydma_info_read(struct file *file, char __user *buf,
>  size_t len, loff_t *pos)
>  {
>   struct spu_context *ctx = file->private_data;
> + struct spu_proxydma_info info;
>   int ret;
>  
> + if (len < sizeof(info))
> + return -EINVAL;
> + if (!access_ok(buf, len))
> + return -EFAULT;
> +
>   ret = spu_acquire_saved(ctx);
>   if (ret)
>   return ret;
>   spin_lock(>csa.register_lock);
> - ret = __spufs_proxydma_info_read(ctx, buf, len, pos);
> + __spufs_proxydma_info_read(ctx, );
> + ret = simple_read_from_buffer(buf, len, pos, , sizeof(info));

IDGI...  What's that access_ok() for?  If you are using 
simple_read_from_buffer(),
the damn thing goes through copy_to_user().  Why bother with separate 
access_ok()
here?


Re: [PATCH v2 5/5] uaccess: Rename user_access_begin/end() to user_full_access_begin/end()

2020-04-20 Thread Al Viro
[rmk Cc'd]
On Fri, Apr 03, 2020 at 09:52:05PM +0100, Al Viro wrote:

> I can do a 5.7-rc1-based branch with that; depending upon what we end
> up doing for arm and s390 we can always change the calling conventions
> come next cycle ;-/
> 
> My impressions after digging through arm side of things:
> 
> 1) the only instance of nesting I'd found there (so far) is a mistake.
> The rule should be "no fucking nesting, TYVM".

OK, after quite a bit of digging:
1) everything outside of arm is quite happy with not passing
anything to user_access_end().  s390 is a red herring in that respect.
2) on arm we definitely can get rid of nesting.  However,
there are some unpleasant sides of the logics in there.  What we have
is an MMU register; everything except for two 2bit fields in it is
constant.  One of those fields is a function of get_fs(), another might
serve an analogue of x86 EFLAGS.AC.  Rules:

DACR.USER is 0 if CONFIG_SW_DOMAIN_PAN is enabled and we are
*not* in uaccess section; otherwise it's 1.
DACR.KERNEL is 3 if CONFIG_USE_DOMAINS is enabled and we are
under KERNEL_DS; otherwise it's 1.

[USE_DOMAINS is forced to "yes" on v5 and earlier, configurable on v6+]
[SW_DOMAIN_PAN is forced to "no" on v7 if we want support of huge physical
space, configurable with default to "yes" otherwise]

On entry into kernel we get into USER_DS state before we get
out of asm glue.  Original settings are restored on return.  That goes
both for ->addr_limit (get_fs() value) and for DACR.KERNEL contents.
DACR.USER ("uaccess allowed") is switched to "disabled" state before
we reach C code and restored on return from kernel.

The costs are interesting; setting the register is costly, in
the same manner STAC/CLAC is.  Reading it... hell knows; I don't see any
explicit information about that.  As it is, both set_fs() and starting
uaccess block (uaccess_save_and_enable() - the thing that would've
gone into user_access_begin()) do both read and write to register; with
minimal massage we could get rid of reading the damn thing in set_fs().
user_access_end() candidate does a plain write to register, with value
kept around since the beginning of uaccess block.

*IF* read from that register is cheap, we can trivially get rid
of passing the cookie there - it's a matter of reading the register
and clearing one bit in it before writing it back.  If that is costly,
though...  We can easily calculate it from ->addr_limit, which we already
have in cache at that point, or will need shortly anyway.  In that
case it would probably make sense to do the same to user_access_begin()
and set_fs().  Note that I'm not suggesting to do anything of that sort
in switch_to() - existing mechanism doesn't need any changes, and neither
does the asm glue in entry*.S.

The only source I'd been able to find speeks of >= 60 cycles
(and possibly much more) for non-pipelined coprocessor instructions;
the list of such does contain loads and stores to a bunch of registers.
However, the register in question (p15/c3) has only store mentioned there,
so loads might be cheap; no obvious reasons for those to be slow.
That's a question to arm folks, I'm afraid...  rmk?

Note that we can keep the current variant (i.e.
user_access_begin() being just the check for access_ok(), user_access_end()
being empty and uaccess_save_and_enable()/uaccess_restore() done manually
inside the primitives); after all, a lot of architectures don't _have_
anything of that sort.  It's just that decisions regarding the calling
conventions for these primitives will be much harder to change later on...

Again, arm (32bit one) is the only architectures that has something
of that sort and needs to pass cookie from beginning to the end of uaccess
blocks.  Everything else splits into several classes:

1) has MMU, shared address space for kernel/userland, no stac analogues.
alpha, arc, csky, hexagon, itanic, nds32, nios32, openrisc, sh,
sparc32, unicore32, xtensa/MMU, microblaze/MMU, mips/MMU,
m68k/MMU/COLDFIRE.
No way to do anything other than plain access_ok() for user_access_begin().

2) has MMU, shared address space for kernel/userland, has stac analogue,
possibly with separate "for read" and "for write" variants.  Can live
without passing any cookies.
arm64, powerpc, riscv, x86
Current variant with changes in this patchset covers those.

3) non-MMU, uses memcpy() for everything, or at least ought to:
c6x, h8300, m68k/!MMU, xtensa/!MMU(?), microblaze/!MMU(?), mips/!MMU(?),
arm/!MMU
No memory protection of any sort...

4) sparc-like: MMU, separate address spaces for userland and kernel,
has explicit insns for uaccess + some register(s) to choose what those
insns actually hit.
sparc64, parisc, m68k/MMU/!COLDFIRE
No stac/clac analogue would make sense.

5) s39

Re: [PATCH 3/6] binfmt_elf: remove the set_fs(KERNEL_DS) in elf_core_dump

2020-04-06 Thread Al Viro
On Mon, Apr 06, 2020 at 02:03:09PM +0200, Christoph Hellwig wrote:
> There is no logic in elf_core_dump itself that uses uaccess routines
> on kernel pointers, the file writes are nicely encapsulated in dump_emit
> which does its own set_fs.

... assuming you've checked the asm/elf.h to see that nobody is playing
silly buggers in these forests of macros and the stuff called from those.
Which is a feat that ought to be mentioned in commit message...


Re: [PATCH v2 5/5] uaccess: Rename user_access_begin/end() to user_full_access_begin/end()

2020-04-03 Thread Al Viro
On Fri, Apr 03, 2020 at 11:01:10AM -0700, Linus Torvalds wrote:
> On Fri, Apr 3, 2020 at 12:21 AM Christophe Leroy
>  wrote:
> >
> > Now we have user_read_access_begin() and user_write_access_begin()
> > in addition to user_access_begin().
> 
> I realize Al asked for this, but I don't think it really adds anything
> to the series.
> 
> The "full" makes the names longer, but not really any more legible.
> 
> So I like 1-4, but am unconvinced about 5 and would prefer that to be
> dropped. Sorry for the bikeshedding.
> 
> And I like this series much better without the cookie that was
> discussed, and just making the hard rule be that they can't nest.
> 
> Some architecture may obviously use a cookie internally if they have
> some nesting behavior of their own, but it doesn't look like we have
> any major reason to expose that as the actual interface.
> 
> The only other question is how to synchronize this? I'm ok with it
> going through the ppc tree, for example, and just let others build on
> that.  Maybe using a shared immutable branch with 5.6 as a base?

I can do a 5.7-rc1-based branch with that; depending upon what we end
up doing for arm and s390 we can always change the calling conventions
come next cycle ;-/

My impressions after digging through arm side of things:

1) the only instance of nesting I'd found there (so far) is a mistake.
The rule should be "no fucking nesting, TYVM".

2) I'm really unhappy about the uaccess_with_memcpy thing.  Besides
being fucking ugly, it kills any hope of lifting user_access_begin/end
out of raw_copy_to_user() - the things done in that bastard are
obviously *NOT* fit for uaccess block.  Including the wonders like
/* the mmap semaphore is taken only if not in an atomic context */
atomic = faulthandler_disabled();

if (!atomic)
down_read(>mm->mmap_sem);
which, IMO, deserves to be taken out and shot.  Not that pin_page_for_write()
in the same file (arch/arm/lib/uaccess_with_memcpy.c) did not deserve the
same treatment...  As far as I can tell, the whole point of that thing
is that well, memcpy() is optimized better than raw_copy_to_user()...
So what's wrong with taking the damn optimized memcpy and using it for
raw_copy_to_user() instead?

Is that the lack of STRT analogue that would store several registers?
Because AFAICS commit f441882a5229ffaef61a47bccd4518f7e2749cbc
Author: Vincent Whitchurch 
Date:   Fri Nov 9 10:09:48 2018 +0100
ARM: 8812/1: Optimise copy_{from/to}_user for !CPU_USE_DOMAINS
makes for much saner solution...  I realize that it's v6+ and this
thing is specifically for a v5 variant, but...

3) how much do we need to keep the old DACR value in a register for
uaccess_restore()?  AFAICS, if we prohibit nesting it becomes
a function of USER_DS/KERNEL_DS setting (and that - only for
CPU_USE_DOMAINS), doesn't it?  And we had to have fetched it
recently, back when access_ok() had been done, so shouldn't
it be in cache?


Re: [PATCH RESEND 1/4] uaccess: Add user_read_access_begin/end and user_write_access_begin/end

2020-04-03 Thread Al Viro
On Fri, Apr 03, 2020 at 02:37:19PM +0100, Russell King - ARM Linux admin wrote:

> > I think uaccess_enable() could indeed switch the kernel domain if
> > KERNEL_DS is set and move this out of set_fs(). It would reduce the
> > window the kernel domain permissions are overridden. Anyway,
> > uaccess_enable() appeared much later on arm when Russell introduced PAN
> > (SMAP) like support by switching the user domain.
> 
> Yes, that would be a possibility.  Another possibility would be to
> eliminate as much usage of KERNEL_DS as possible

That's definitely worth doing, but that's another long-term project ;-/

> - I've just found
> one instance in sys_oabi-compat.c that can be eliminated (epoll_ctl)
> but there's several there that can't with the current code structure,
> and re-coding the contents of some fs/* functions to work around that
> is a very bad idea.  If there's some scope for rejigging some of the
> fs/* code, it may be possible to elimate some other cases in there.

Well, your do_locks() definitely can be converted.  epoll_wait()...
not sure, need to look into that.  Is that about the layout mismatch
between struct oabi_epoll_event and struct epoll_event?  In case of
semtimedop...  Hell knows, I would probably consider moving that thing
into ipc/sem.c under ifdef...


Re: [PATCH RESEND 1/4] uaccess: Add user_read_access_begin/end and user_write_access_begin/end

2020-04-02 Thread Al Viro
On Thu, Apr 02, 2020 at 11:35:57AM -0700, Kees Cook wrote:

> Yup, I think it's a weakness of the ARM implementation and I'd like to
> not extend it further. AFAIK we should never nest, but I would not be
> surprised at all if we did.
> 
> If we were looking at a design goal for all architectures, I'd like
> to be doing what the public PaX patchset did for their memory access
> switching, which is to alarm if calling into "enable" found the access
> already enabled, etc. Such a condition would show an unexpected nesting
> (like we've seen with similar constructs with set_fs() not getting reset
> during an exception handler, etc etc).

FWIW, maybe I'm misreading the ARM uaccess logics, but... it smells like
KERNEL_DS is somewhat more dangerous there than on e.g. x86.

Look: with CONFIG_CPU_DOMAINS, set_fs(KERNEL_DS) tells MMU to ignore
per-page permission bits in DOMAIN_KERNEL (i.e. for kernel address
ranges), allowing them even if they would normally be denied.  We need
that for actual uaccess loads/stores, since those use insns that pretend
to be done in user mode and we want them to access the kernel pages.
But that affects the normal loads/stores as well; unless I'm misreading
that code, it will ignore (supervisor) r/o on a page.  And that's not
just for the code inside the uaccess blocks; *everything* done under
KERNEL_DS is subject to that.

Why do we do that (modify_domain(), that is) inside set_fs() and not
in uaccess_enable() et.al.?


Re: [PATCH RESEND 1/4] uaccess: Add user_read_access_begin/end and user_write_access_begin/end

2020-04-02 Thread Al Viro
On Thu, Apr 02, 2020 at 07:03:28PM +0200, Christophe Leroy wrote:

> user_access_begin() grants both read and write.
> 
> This patch adds user_read_access_begin() and user_write_access_begin() but
> it doesn't remove user_access_begin()

Ouch...  So the most generic name is for the rarest case?
 
> > What should we do about that?  Do we prohibit such blocks outside
> > of arch?
> > 
> > What should we do about arm and s390?  There we want a cookie passed
> > from beginning of block to its end; should that be a return value?
> 
> That was the way I implemented it in January, see
> https://patchwork.ozlabs.org/patch/1227926/
> 
> There was some discussion around that and most noticeable was:
> 
> H. Peter (hpa) said about it: "I have *deep* concern with carrying state in
> a "key" variable: it's a direct attack vector for a crowbar attack,
> especially since it is by definition live inside a user access region."

> This patch minimises the change by just adding user_read_access_begin() and
> user_write_access_begin() keeping the same parameters as the existing
> user_access_begin().

Umm...  What about the arm situation?  The same concerns would apply there,
wouldn't they?  Currently we have
static __always_inline unsigned int uaccess_save_and_enable(void)
{
#ifdef CONFIG_CPU_SW_DOMAIN_PAN
unsigned int old_domain = get_domain();

/* Set the current domain access to permit user accesses */
set_domain((old_domain & ~domain_mask(DOMAIN_USER)) |
   domain_val(DOMAIN_USER, DOMAIN_CLIENT));

return old_domain;
#else
return 0;
#endif
}
and
static __always_inline void uaccess_restore(unsigned int flags)
{
#ifdef CONFIG_CPU_SW_DOMAIN_PAN
/* Restore the user access mask */
set_domain(flags);
#endif
}

How much do we need nesting on those, anyway?  rmk?


Re: [PATCH RESEND 1/4] uaccess: Add user_read_access_begin/end and user_write_access_begin/end

2020-04-02 Thread Al Viro
On Thu, Apr 02, 2020 at 07:34:16AM +, Christophe Leroy wrote:
> Some architectures like powerpc64 have the capability to separate
> read access and write access protection.
> For get_user() and copy_from_user(), powerpc64 only open read access.
> For put_user() and copy_to_user(), powerpc64 only open write access.
> But when using unsafe_get_user() or unsafe_put_user(),
> user_access_begin open both read and write.
> 
> Other architectures like powerpc book3s 32 bits only allow write
> access protection. And on this architecture protection is an heavy
> operation as it requires locking/unlocking per segment of 256Mbytes.
> On those architecture it is therefore desirable to do the unlocking
> only for write access. (Note that book3s/32 ranges from very old
> powermac from the 90's with powerpc 601 processor, till modern
> ADSL boxes with PowerQuicc II modern processors for instance so it
> is still worth considering)
> 
> In order to avoid any risk based of hacking some variable parameters
> passed to user_access_begin/end that would allow hacking and
> leaving user access open or opening too much, it is preferable to
> use dedicated static functions that can't be overridden.
> 
> Add a user_read_access_begin and user_read_access_end to only open
> read access.
> 
> Add a user_write_access_begin and user_write_access_end to only open
> write access.
> 
> By default, when undefined, those new access helpers default on the
> existing user_access_begin and user_access_end.

The only problem I have is that we'd better choose the calling
conventions that work for other architectures as well.

AFAICS, aside of ppc and x86 we have (at least) this:
arm:
unsigned int __ua_flags = uaccess_save_and_enable();
...
uaccess_restore(__ua_flags);
arm64:
uaccess_enable_not_uao();
...
uaccess_disable_not_uao();
riscv:
__enable_user_access();
...
__disable_user_access();
s390/mvc:
old_fs = enable_sacf_uaccess();
...
disable_sacf_uaccess(old_fs);

arm64 and riscv are easy - they map well on what we have now.
The interesting ones are ppc, arm and s390.

You wants to specify the kind of access; OK, but...  it's not just read
vs. write - there's read-write as well.  AFAICS, there are 3 users of
that:
* copy_in_user()
* arch_futex_atomic_op_inuser()
* futex_atomic_cmpxchg_inatomic()
The former is of dubious utility (all users outside of arch are in
the badly done compat code), but the other two are not going to go
away.

What should we do about that?  Do we prohibit such blocks outside
of arch?

What should we do about arm and s390?  There we want a cookie passed
from beginning of block to its end; should that be a return value?

And at least on arm that thing nests (see e.g. __clear_user_memset()
there), so "stash that cookie in current->something" is not a solution...

Folks, let's sort that out while we still have few users of that
interface; changing the calling conventions later will be much harder.
Comments?


Re: [powerpc] Intermittent crashes ( link_path_walk) with linux-next

2020-03-26 Thread Al Viro
On Thu, Mar 26, 2020 at 10:40:06PM +1100, Michael Ellerman wrote:

> > The code in question (link_path_walk() in fs/namei.c ) was recently changed 
> > by
> > following commit:
> >
> > commit 881386f7e46a: 
> >   link_path_walk(): sample parent's i_uid and i_mode for the last component
> 
> That and about 10 other commits.
> 
> Unless Al can give us a clue we'll need to bisect.

Already fixed yesterday.  It's not link_path_walk(), it's handle_dots()
ignoring an error returned by step_into().

commit 5e3c3570ec97 is the broken one; commit 20971012f63e is its variant with 
the
fix folded in.  So next-20200325 has the bug and next-20200326 should have it
fixed.  Could you check the current -next and see if you still observe that 
crap?


Re: [PATCH v1 1/6] fs/readdir: Fix filldir() and filldir64() use of user_access_begin()

2020-01-22 Thread Al Viro
On Wed, Jan 22, 2020 at 08:13:12AM -0800, Linus Torvalds wrote:
> On Wed, Jan 22, 2020 at 5:00 AM Christophe Leroy
>  wrote:
> >
> > Modify filldir() and filldir64() to request the real area they need
> > to get access to.
> 
> Not like this.
> 
> This makes the situation for architectures like x86 much worse, since
> you now use "put_user()" for the previous dirent filling. Which does
> that expensive user access setup/teardown twice again.
> 
> So either you need to cover both the dirent's with one call, or you
> just need to cover the whole (original) user buffer passed in. But not
> this unholy mixing of both unsafe_put_user() and regular put_user().

I would suggest simply covering the range from dirent->d_off to
buf->current_dir->d_name[namelen]; they are going to be close to
each other and we need those addresses anyway...


Re: [PATCH v18 00/13] open: introduce openat2(2) syscall

2019-12-07 Thread Al Viro
On Sat, Dec 07, 2019 at 01:13:25AM +1100, Aleksa Sarai wrote:
> This patchset is being developed here:
>   

See #work.openat2; rebased on top of #fixes (there's a minor
conflict in follow_managed()).  I'll either leave that as-is
or rebase to -rc1 once it appears; the thing goes into #for-next
right after -rc1 in any case.


Re: [PATCH v17 00/13] open: introduce openat2(2) syscall

2019-11-24 Thread Al Viro
Generally I can live with that - the only serious issue is
with that WARN_ON() and the logics of the check in it.


Re: [PATCH v17 11/13] open: introduce openat2(2) syscall

2019-11-24 Thread Al Viro
On Sun, Nov 17, 2019 at 12:17:11PM +1100, Aleksa Sarai wrote:

> +/* how->upgrade flags for openat2(2). */
> +/* First bit is reserved for a future UPGRADE_NOEXEC flag. */
> +#define UPGRADE_NOREAD   0x02 /* Block re-opening with MAY_READ. 
> */
> +#define UPGRADE_NOWRITE  0x04 /* Block re-opening with 
> MAY_WRITE. */

I'd drop that chunk, for the lack of ->upgrade...


Re: [PATCH v17 10/13] namei: LOOKUP_{IN_ROOT,BENEATH}: permit limited ".." resolution

2019-11-24 Thread Al Viro
On Sun, Nov 17, 2019 at 12:17:10PM +1100, Aleksa Sarai wrote:
> + if (unlikely(nd->flags & LOOKUP_IS_SCOPED)) {
> + /*
> +  * If there was a racing rename or mount along our
> +  * path, then we can't be sure that ".." hasn't jumped
> +  * above nd->root (and so userspace should retry or use
> +  * some fallback).
> +  */
> + if (unlikely(read_seqretry(_lock, nd->m_seq)))
> + return -EAGAIN;
> + if (unlikely(read_seqretry(_lock, nd->r_seq)))
> + return -EAGAIN;
> + }

Looks like excessive barriers to me - it's
rmb
check mount_lock.sequence
rmb
check rename_lock.sequence

> @@ -2266,6 +2274,10 @@ static const char *path_init(struct nameidata *nd, 
> unsigned flags)
>   nd->last_type = LAST_ROOT; /* if there are only slashes... */
>   nd->flags = flags | LOOKUP_JUMPED | LOOKUP_PARENT;
>   nd->depth = 0;
> +
> + nd->m_seq = read_seqbegin(_lock);
> + nd->r_seq = read_seqbegin(_lock);

Same here, pretty much (fetch/rmb/fetch/rmb)


Re: [PATCH v17 08/13] namei: LOOKUP_BENEATH: O_BENEATH-like scoped resolution

2019-11-24 Thread Al Viro
On Sun, Nov 17, 2019 at 12:17:08PM +1100, Aleksa Sarai wrote:

> + if (unlikely(nd->flags & LOOKUP_IS_SCOPED)) {
> + /*
> +  * Do a final check to ensure that the path didn't escape. Note
> +  * that this should already be guaranteed by all of the other
> +  * LOOKUP_IS_SCOPED checks (and delaying this check this late
> +  * does open the door to some possible timing-based attacks).
> +  */
> + if (WARN_ON(!path_is_under(>path, >root)))
> + return -EXDEV;

I don't like that.  What it gives is an ability to race that with
rename(), with user-triggered WARN_ON.  You *can't* promise that result of
lookup is in a subtree, simply because it can get moved just as you've
declared it to be in the clear.

Anyone who relies upon that is delusional; it really can't be done.
What warranties LOOKUP_IS_SCOPED is really supposed to provide?  That we do not
attempt to walk out of the subtree rooted at the start point?  Fine, but this
is not what this test does.  What are you trying to achieve there?  If it's
"what we'd got was at one point in our subtree", the test is more or less
right, but WARN_ON isn't.


Re: [PATCH v16 09/12] namei: LOOKUP_{IN_ROOT,BENEATH}: permit limited ".." resolution

2019-11-15 Thread Al Viro
On Sat, Nov 16, 2019 at 11:27:59AM +1100, Aleksa Sarai wrote:

> + if (unlikely(nd->flags & LOOKUP_IS_SCOPED)) {
> + bool m_retry = read_seqretry(_lock, nd->m_seq);
> + bool r_retry = read_seqretry(_lock, nd->r_seq);
> +
> + /*
> +  * If there was a racing rename or mount along our
> +  * path, then we can't be sure that ".." hasn't jumped
> +  * above nd->root (and so userspace should retry or use
> +  * some fallback).
> +  */
> + if (unlikely(m_retry || r_retry))
> + return -EAGAIN;
> + }
>   }
>   return 0;

Elaborate...  Do these boolean variables make any sense now, really?


Re: [PATCH v16 06/12] namei: LOOKUP_NO_XDEV: block mountpoint crossing

2019-11-15 Thread Al Viro
On Sat, Nov 16, 2019 at 11:27:56AM +1100, Aleksa Sarai wrote:

> @@ -1383,6 +1398,8 @@ static int follow_dotdot_rcu(struct nameidata *nd)
>   return -ECHILD;
>   if (>mnt == nd->path.mnt)
>   break;
> + if (unlikely(nd->flags & LOOKUP_NO_XDEV))
> + return -EXDEV;
>   /* we know that mountpoint was pinned */
>   nd->path.dentry = mountpoint;
>   nd->path.mnt = >mnt;
> @@ -1397,6 +1414,8 @@ static int follow_dotdot_rcu(struct nameidata *nd)
>   return -ECHILD;
>   if (!mounted)
>   break;
> + if (unlikely(nd->flags & LOOKUP_NO_XDEV))
> + return -EXDEV;
>   nd->path.mnt = >mnt;
>   nd->path.dentry = mounted->mnt.mnt_root;
>   inode = nd->path.dentry->d_inode;

I really don't think we should return hard errors from that function.
Let the caller redo it in refwalk mode.

It's not the fast path, especially for this kind of errors.  Matter of
fact, I'm not sure about -ENOENT returned in another failure case
there - it's probably OK, but again, -ECHILD would be just as good.


Re: [PATCH v16 02/12] namei: allow nd_jump_link() to produce errors

2019-11-15 Thread Al Viro
On Sat, Nov 16, 2019 at 11:27:52AM +1100, Aleksa Sarai wrote:
> + error = nd_jump_link();
> + if (error)
> + path_put();

> + error = nd_jump_link(_path);
> + if (error)
> + path_put(_path);

> + error = nd_jump_link();
> + if (error)
> + path_put();

3 calls.  Exact same boilerplate in each to handle a failure case.
Which spells "wrong calling conventions"; it's absolutely clear
that we want that path_put() inside nd_jump_link().

The rule should be this: reference that used to be held in
*path is consumed in any case.  On success it goes into
nd->path, on error it's just dropped, but in any case, the
caller has the same refcounting environment to deal with.

If you need the same boilerplate cleanup on failure again and again,
the calling conventions are wrong and need to be fixed.

And I'm not sure that int is the right return type here, to be honest.
void * might be better - return ERR_PTR() or NULL, so that the value
could be used as return value of ->get_link() that calls that thing.


Re: [PATCH v15 3/9] namei: LOOKUP_NO_XDEV: block mountpoint crossing

2019-11-13 Thread Al Viro
On Thu, Nov 14, 2019 at 03:49:45PM +1100, Aleksa Sarai wrote:
> On 2019-11-13, Al Viro  wrote:
> > On Tue, Nov 05, 2019 at 08:05:47PM +1100, Aleksa Sarai wrote:
> > 
> > > @@ -862,6 +870,8 @@ static int nd_jump_root(struct nameidata *nd)
> > >  void nd_jump_link(struct path *path)
> > >  {
> > >   struct nameidata *nd = current->nameidata;
> > > +
> > > + nd->last_magiclink.same_mnt = (nd->path.mnt == path->mnt);
> > >   path_put(>path);
> > >  
> > >   nd->path = *path;
> > > @@ -1082,6 +1092,10 @@ const char *get_link(struct nameidata *nd)
> > >   if (nd->flags & LOOKUP_MAGICLINK_JUMPED) {
> > >   if (unlikely(nd->flags & LOOKUP_NO_MAGICLINKS))
> > >   return ERR_PTR(-ELOOP);
> > > + if (unlikely(nd->flags & LOOKUP_NO_XDEV)) {
> > > + if (!nd->last_magiclink.same_mnt)
> > > + return ERR_PTR(-EXDEV);
> > > + }
> > >   }
> > 
> > Ugh...  Wouldn't it be better to take that logics (some equivalent thereof)
> > into nd_jump_link()?  Or just have nd_jump_link() return an error...
> 
> This could be done, but the reason for stashing it away in
> last_magiclink is because of the future magic-link re-opening patches
> which can't be implemented like that without putting the open_flags
> inside nameidata (which was decided to be too ugly a while ago).
> 
> My point being that I could implement it this way for this series, but
> I'd have to implement something like last_magiclink when I end up
> re-posting the magic-link stuff in a few weeks.
> 
> Looking at all the nd_jump_link() users, the other option is to just
> disallow magic-link crossings entirely for LOOKUP_NO_XDEV. The only
> thing allowing them permits is to resolve file descriptors that are
> pointing to the same procfs mount -- and it's unclear to me how useful
> that really is (apparmorfs and nsfs will always give -EXDEV because
> aafs_mnt and nsfs_mnt are internal kernel vfsmounts).

I would rather keep the entire if (nd->flags & LOOKUP_MAGICLINK_JUMPED)
out of the get_link().  If you want to generate some error if
nd_jump_link() has been called, just do it right there.  The fewer
pieces of state need to be carried around, the better...

And as for opening them...  Why would you need full open_flags in there?
Details, please...


Re: [PATCH v15 5/9] namei: LOOKUP_IN_ROOT: chroot-like scoped resolution

2019-11-12 Thread Al Viro
On Wed, Nov 13, 2019 at 01:44:14PM +1100, Aleksa Sarai wrote:
> On 2019-11-13, Al Viro  wrote:
> > On Tue, Nov 05, 2019 at 08:05:49PM +1100, Aleksa Sarai wrote:
> > 
> > > @@ -2277,12 +2277,20 @@ static const char *path_init(struct nameidata 
> > > *nd, unsigned flags)
> > >  
> > >   nd->m_seq = read_seqbegin(_lock);
> > >  
> > > - /* Figure out the starting path and root (if needed). */
> > > - if (*s == '/') {
> > > + /* Absolute pathname -- fetch the root. */
> > > + if (flags & LOOKUP_IN_ROOT) {
> > > + /* With LOOKUP_IN_ROOT, act as a relative path. */
> > > + while (*s == '/')
> > > + s++;
> > 
> > Er...  Why bother skipping slashes?  I mean, not only link_path_walk()
> > will skip them just fine, you are actually risking breakage in this:
> > if (*s && unlikely(!d_can_lookup(dentry))) {
> > fdput(f);
> > return ERR_PTR(-ENOTDIR);
> > }
> > which is downstream from there with you patch, AFAICS.
> 
> I switched to stripping the slashes at your suggestion a few revisions
> ago[1], and had (wrongly) assumed we needed to handle "/" somehow in
> path_init(). But you're quite right about link_path_walk() -- and I'd be
> more than happy to drop it.

That, IIRC, was about untangling the weirdness around multiple calls of
dirfd_path_init() and basically went "we might want just strip the slashes
in case of that flag very early in the entire thing, so that later the
normal logics for absolute/relative would DTRT".  Since your check is
right next to checking for absolute pathnames (and not in the very
beginning of path_init()), we might as well turn the check for
absolute pathname into *s == '/' && !(flags & LOOKUP_IN_ROOT) and be
done with that.


Re: [PATCH v15 7/9] open: introduce openat2(2) syscall

2019-11-12 Thread Al Viro
On Tue, Nov 05, 2019 at 08:05:51PM +1100, Aleksa Sarai wrote:
  
> +/*
> + * Arguments for how openat2(2) should open the target path. If @resolve is
> + * zero, then openat2(2) operates very similarly to openat(2).
> + *
> + * However, unlike openat(2), unknown bits in @flags result in -EINVAL rather
> + * than being silently ignored. @mode must be zero unless one of {O_CREAT,
> + * O_TMPFILE} are set, and @upgrade_mask must be zero unless O_PATH is set.
> + *
> + * @flags: O_* flags.
> + * @mode: O_CREAT/O_TMPFILE file mode.
> + * @upgrade_mask: UPGRADE_* flags (to restrict O_PATH re-opening).

???

> + * @resolve: RESOLVE_* flags.
> + */
> +struct open_how {
> + __aligned_u64 flags;
> + __u16 mode;
> + __u16 __padding[3]; /* must be zeroed */
> + __aligned_u64 resolve;
> +};


  1   2   >