RE: [PATCH] Raise the minimum GCC version to 5.2

2021-05-03 Thread David Laight
From: Arnd Bergmann
> Sent: 03 May 2021 10:25
...
> One scenario that I've seen previously is where user space and
> kernel are built together as a source based distribution (OE, buildroot,
> openwrt, ...), and the compiler is picked to match the original sources
> of the user space because that is best tested, but the same compiler
> then gets used to build the kernel as well because that is the default
> in the build environment.

If you are building programs for release to customers who might
be running then on old distributions then you need a system with
the original userspace headers and almost certainly a similar
vintage compiler.
Never mind RHEL7 we have customers running RHEL6.
(We've managed to get everyone off RHEL5.)
So the build machine is running a 10+ year old distro.

I did try to build on a newer system (only 5 years old)
but the complete fubar of memcpy() makes it impossible
to compile C programs that will run on an older libc.
And don't even mention C++, the 'character traits' is just
plain horrid - enough to make me want to remove every
reference to CString from the small amount of C++ we have.

To quote our makefile:
# C++ is fighting back.
# I'd like to be able to compile on a 'new' system and still be able to run
# the binaries on RHEL 6 (2.6.32 kernel 2011 era libraries).
# But even linking libstdc++ static still leaves
# an undefined C++ symbol that the dynamic loader barfs on.
# The static libstdc++ also references memcpy@GLIBC_2.14 - but that can be
# 'solved' by adding an extra .so that defines the symbol (and calls memmove()).
# I've also tried pulling a single .o out of libstc++.a. This might work if
# the .o is small and self contained.
#
# For now we statically link libstc++ and continue to build on an old system.
C++LDLIBS := -Wl,-Bstatic -lstdc++ -Wl,-Bdynamic

It would be nice to be able to build current kernels (for local
use) on the 'new' system - but gcc is already too old.

David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, 
UK
Registration No: 1397386 (Wales)


RE: [PATCH] powerpc: Initialize local variable fdt to NULL in elf64_load()

2021-04-23 Thread David Laight
From: Michael Ellerman
> Sent: 23 April 2021 14:51
...
> > (Does anyone - and by anyone I mean any large distro - compile with
> > local variables inited by the compiler?)
> 
> This is where I say, "yes, Android" and you say "ugh no I meant a real
> distro", and I say "well ...".
> 
> But yeah doesn't help us much.

And it doesn't seem to stop my phone crashing either :-)

Of course, I've absolutely no way of finding out where it is crashing.
Nor where the massive memory leaks are that means it need rebooting
every few days.

David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, 
UK
Registration No: 1397386 (Wales)



RE: [PATCH] powerpc: Initialize local variable fdt to NULL in elf64_load()

2021-04-22 Thread David Laight
From: Daniel Axtens
> Sent: 22 April 2021 03:21
> 
> > Hi Lakshmi,
> >
> >> On 4/15/21 12:14 PM, Lakshmi Ramasubramanian wrote:
> >>
> >> Sorry - missed copying device-tree and powerpc mailing lists.
> >>
> >>> There are a few "goto out;" statements before the local variable "fdt"
> >>> is initialized through the call to of_kexec_alloc_and_setup_fdt() in
> >>> elf64_load(). This will result in an uninitialized "fdt" being passed
> >>> to kvfree() in this function if there is an error before the call to
> >>> of_kexec_alloc_and_setup_fdt().
> >>>
> >>> Initialize the local variable "fdt" to NULL.
> >>>
> > I'm a huge fan of initialising local variables! But I'm struggling to
> > find the code path that will lead to an uninit fdt being returned...
> 
> OK, so perhaps this was putting it too strongly. I have been bitten
> by uninitialised things enough in C that I may have taken a slightly
> overly-agressive view of fixing them in the source rather than the
> compiler. I do think compiler-level mitigations are better, and I take
> the point that we don't want to defeat compiler checking.
> 
> (Does anyone - and by anyone I mean any large distro - compile with
> local variables inited by the compiler?)

There are compilers that initialise locals to zero for 'debug' builds
and leave the 'random' for optimised 'release' builds.
Lets not test what we are releasing!

I also think there is a new option to gcc (or clang?) to initialise
on-stack structures and arrays to ensure garbage isn't passed.
That seems to be a horrid performance hit!
Especially in userspace where large stack allocations are almost free.

Any auto-initialise ought to be with a semi-random value
(especially not zero) so that it is never right and doesn't
lead to lazy coding.

David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, 
UK
Registration No: 1397386 (Wales)



RE: [PATCH 1/2] mm: Fix struct page layout on 32-bit systems

2021-04-21 Thread David Laight
From: Arnd Bergmann
> Sent: 20 April 2021 22:20
> 
> On Tue, Apr 20, 2021 at 11:14 PM Vineet Gupta
>  wrote:
> > On 4/20/21 12:07 AM, Arnd Bergmann wrote:
> 
> > >
> > > which means that half the 32-bit architectures do this. This may
> > > cause more problems when arc and/or microblaze want to support
> > > 64-bit kernels and compat mode in the future on their latest hardware,
> > > as that means duplicating the x86 specific hacks we have for compat.
> > >
> > > What is alignof(u64) on 64-bit arc?
> >
> > $ echo 'int a = __alignof__(long long);' | arc64-linux-gnu-gcc -xc -
> > -Wall -S -o - | grep -A1 a: | tail -n 1 | cut -f 3
> > 8
> 
> Ok, good.

That test doesn't prove anything.
Try running on x86:
$ echo 'int a = __alignof__(long long);' | gcc -xc - -Wall -S -o - -m32
.file   ""
.globl  a
.data
.align 4
.type   a, @object
.size   a, 4
a:
.long   8
.ident  "GCC: (Ubuntu 5.4.0-6ubuntu1~16.04.10) 5.4.0 20160609"
.section.note.GNU-stack,"",@progbits

Using '__alignof__(struct {long long x;})' does give the expected 4.

__alignof__() returns the preferred alignment, not the enforced
alignmnet - go figure.

David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, 
UK
Registration No: 1397386 (Wales)


RE: [PATCH 1/2] mm: Fix struct page layout on 32-bit systems

2021-04-20 Thread David Laight
From: Geert Uytterhoeven
> Sent: 20 April 2021 08:40
> 
> Hi Willy,
> 
> On Sat, Apr 17, 2021 at 4:49 AM Matthew Wilcox  wrote:
> > Replacement patch to fix compiler warning.
> >
> > 32-bit architectures which expect 8-byte alignment for 8-byte integers
> > and need 64-bit DMA addresses (arc, arm, mips, ppc) had their struct
> > page inadvertently expanded in 2019.  When the dma_addr_t was added,
> > it forced the alignment of the union to 8 bytes, which inserted a 4 byte
> > gap between 'flags' and the union.
> >
> > Fix this by storing the dma_addr_t in one or two adjacent unsigned longs.
> > This restores the alignment to that of an unsigned long, and also fixes a
> > potential problem where (on a big endian platform), the bit used to denote
> > PageTail could inadvertently get set, and a racing get_user_pages_fast()
> > could dereference a bogus compound_head().
> >
> > Fixes: c25fff7171be ("mm: add dma_addr_t to struct page")
> > Signed-off-by: Matthew Wilcox (Oracle) 
> 
> Thanks for your patch!
> 
> > --- a/include/linux/mm_types.h
> > +++ b/include/linux/mm_types.h
> > @@ -97,10 +97,10 @@ struct page {
> > };
> > struct {/* page_pool used by netstack */
> > /**
> > -* @dma_addr: might require a 64-bit value even on
> > +* @dma_addr: might require a 64-bit value on
> >  * 32-bit architectures.
> >  */
> > -   dma_addr_t dma_addr;
> > +   unsigned long dma_addr[2];
> 
> So we get two 64-bit words on 64-bit platforms, while only one is
> needed?
> 
> Would
> 
> unsigned long _dma_addr[sizeof(dma_addr_t) / sizeof(unsigned long)];
> 
> work?
> 
> Or will the compiler become too overzealous, and warn about the use
> of ...[1] below, even when unreachable?
> I wouldn't mind an #ifdef instead of an if () in the code below, though.

You could use [ARRAY_SIZE()-1] instead of [1].
Or, since IIRC it is the last member of that specific struct, define as:
unsigned long dma_addr[];

...
> > -   return page->dma_addr;
> > +   dma_addr_t ret = page->dma_addr[0];
> > +   if (sizeof(dma_addr_t) > sizeof(unsigned long))
> > +   ret |= (dma_addr_t)page->dma_addr[1] << 16 << 16;
> 
> We don't seem to have a handy macro for a 32-bit left shift yet...
> 
> But you can also avoid the warning using
> 
> ret |= (u64)page->dma_addr[1] << 32;

Or:
ret |= page->dma_addr[1] + 0ull << 32;

Which relies in integer promotion rather than a cast.

David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, 
UK
Registration No: 1397386 (Wales)


RE: [PATCH 1/2] mm: Fix struct page layout on 32-bit systems

2021-04-17 Thread David Laight
From: Matthew Wilcox 
> Sent: 17 April 2021 03:45
> 
> Replacement patch to fix compiler warning.
...
>  static inline dma_addr_t page_pool_get_dma_addr(struct page *page)
>  {
> - return page->dma_addr;
> + dma_addr_t ret = page->dma_addr[0];
> + if (sizeof(dma_addr_t) > sizeof(unsigned long))
> + ret |= (dma_addr_t)page->dma_addr[1] << 16 << 16;

Ugly as well.

Why not just replace the (dma_addr_t) cast with a (u64) one?
Looks better than the double shift.

Same could be done for the '>> 32'.
Is there an upper_32_bits() that could be used??

David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, 
UK
Registration No: 1397386 (Wales)



RE: [PATCH 2/2] mm: Indicate pfmemalloc pages in compound_head

2021-04-17 Thread David Laight
From: Matthew Wilcox (Oracle) 
> Sent: 17 April 2021 00:07
> 
> The net page_pool wants to use a magic value to identify page pool pages.
> The best place to put it is in the first word where it can be clearly a
> non-pointer value.  That means shifting dma_addr up to alias with ->index,
> which means we need to find another way to indicate page_is_pfmemalloc().
> Since page_pool doesn't want to set its magic value on pages which are
> pfmemalloc, we can use bit 1 of compound_head to indicate that the page
> came from the memory reserves.
> 
...
>   struct {/* page_pool used by netstack */
> - /**
> -  * @dma_addr: might require a 64-bit value on
> -  * 32-bit architectures.
> -  */
> + unsigned long pp_magic;
> + unsigned long xmi;
> + unsigned long _pp_mapping_pad;
>   unsigned long dma_addr[2];
>   };

You've deleted the comment.

I also think there should be a comment that dma_addr[0]
must be aliased to ->index.
(Or whatever all the exact requirements are.)

David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, 
UK
Registration No: 1397386 (Wales)



RE: Bogus struct page layout on 32-bit

2021-04-17 Thread David Laight
From: Grygorii Strashko
> Sent: 16 April 2021 10:27
...
> Sry, for delayed reply.
> 
> The TI platforms am3/4/5 (cpsw) and Keystone 2 (netcp) can do only 32bit DMA 
> even in case of LPAE
> (dma-ranges are used).
> Originally, as I remember, CONFIG_ARCH_DMA_ADDR_T_64BIT has not been selected 
> for the LPAE case
> on TI platforms and the fact that it became set is the result of 
> multi-paltform/allXXXconfig/DMA
> optimizations and unification.
> (just checked - not set in 4.14)
> 
> Probable commit 4965a68780c5 ("arch: define the ARCH_DMA_ADDR_T_64BIT config 
> symbol in lib/Kconfig").
> 
> The TI drivers have been updated, finally to accept ARCH_DMA_ADDR_T_64BIT=y 
> by using things like
> (__force u32)
> for example.

Hmmm using (__force u32) is probably wrong.
If an address +length >= 2**32 can get passed then the IO request
needs to be errored (or a bounce buffer used).

Otherwise you can get particularly horrid corruptions.

David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, 
UK
Registration No: 1397386 (Wales)


RE: [PATCH 1/1] mm: Fix struct page layout on 32-bit systems

2021-04-17 Thread David Laight
From: Matthew Wilcox
> Sent: 16 April 2021 16:28
> 
> On Thu, Apr 15, 2021 at 08:08:32PM +0200, Jesper Dangaard Brouer wrote:
> > See below patch.  Where I swap32 the dma address to satisfy
> > page->compound having bit zero cleared. (It is the simplest fix I could
> > come up with).
> 
> I think this is slightly simpler, and as a bonus code that assumes the
> old layout won't compile.

Always a good plan.

...
>  static inline dma_addr_t page_pool_get_dma_addr(struct page *page)
>  {
> - return page->dma_addr;
> + dma_addr_t ret = page->dma_addr[0];
> + if (sizeof(dma_addr_t) > sizeof(unsigned long))
> + ret |= (dma_addr_t)page->dma_addr[1] << 32;
> + return ret;
> +}

Won't some compiler/option combinations generate an
error for the '<< 32' when dma_addr_t is 32bit?

You might need to use a (u64) cast.

David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, 
UK
Registration No: 1397386 (Wales)



RE: [PATCH 1/1] mm: Fix struct page layout on 32-bit systems

2021-04-16 Thread David Laight
From: Matthew Wilcox 
> Sent: 15 April 2021 23:22
> 
> On Thu, Apr 15, 2021 at 09:11:56PM +0000, David Laight wrote:
> > Isn't it possible to move the field down one long?
> > This might require an explicit zero - but this is not a common
> > code path - the extra write will be noise.
> 
> Then it overlaps page->mapping.  See emails passim.

The rules on overlaps make be wonder if every 'long'
should be in its own union.
The comments would need to say when each field is used.
It would, at least, make these errors less common.

That doesn't solve the 64bit dma_addr though.

Actually rather that word-swapping dma_addr on 32bit BE
could you swap over the two fields it overlays with.
That might look messy in the .h, but it doesn't require
an accessor function to do the swap - easily missed.

David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, 
UK
Registration No: 1397386 (Wales)



RE: [PATCH 1/1] mm: Fix struct page layout on 32-bit systems

2021-04-15 Thread David Laight
From: Matthew Wilcox 
> Sent: 15 April 2021 19:22
> 
> On Thu, Apr 15, 2021 at 08:08:32PM +0200, Jesper Dangaard Brouer wrote:
> > +static inline
> > +dma_addr_t page_pool_dma_addr_read(dma_addr_t dma_addr)
> > +{
> > +   /* Workaround for storing 64-bit DMA-addr on 32-bit machines in struct
> > +* page.  The page->dma_addr share area with page->compound_head which
> > +* use bit zero to mark compound pages. This is okay, as DMA-addr are
> > +* aligned pointers which have bit zero cleared.
> > +*
> > +* In the 32-bit case, page->compound_head is 32-bit.  Thus, when
> > +* dma_addr_t is 64-bit it will be located in top 32-bit.  Solve by
> > +* swapping dma_addr 32-bit segments.
> > +*/
> > +#ifdef CONFIG_ARCH_DMA_ADDR_T_64BIT
> 
> #if defined(CONFIG_ARCH_DMA_ADDR_T_64BIT) && defined(__BIG_ENDIAN)
> otherwise you'll create the problem on ARM that you're avoiding on PPC ...
> 
> I think you want to delete the word '_read' from this function name because
> you're using it for both read and write.

I think I'd use explicit dma_addr_hi and dma_addr_lo and
separate read/write functions just to make absolutely sure
nothing picks up the swapped value.

Isn't it possible to move the field down one long?
This might require an explicit zero - but this is not a common
code path - the extra write will be noise.

David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, 
UK
Registration No: 1397386 (Wales)



RE: [PATCH 1/1] mm: Fix struct page layout on 32-bit systems

2021-04-14 Thread David Laight
From: Matthew Wilcox
> Sent: 14 April 2021 22:36
> 
> On Wed, Apr 14, 2021 at 09:13:22PM +0200, Jesper Dangaard Brouer wrote:
> > (If others want to reproduce).  First I could not reproduce on ARM32.
> > Then I found out that enabling CONFIG_XEN on ARCH=arm was needed to
> > cause the issue by enabling CONFIG_ARCH_DMA_ADDR_T_64BIT.
> 
> hmmm ... you should be able to provoke it by enabling ARM_LPAE,
> which selects PHYS_ADDR_T_64BIT, and
> 
> config ARCH_DMA_ADDR_T_64BIT
> def_bool 64BIT || PHYS_ADDR_T_64BIT
> 
> >  struct page {
> > long unsigned int  flags;/* 0 4 */
> >
> > /* XXX 4 bytes hole, try to pack */
> >
> > union {
> > struct {
> > struct list_head lru;/* 8 8 */
> > struct address_space * mapping;  /*16 4 */
> > long unsigned int index; /*20 4 */
> > long unsigned int private;   /*24 4 */
> > };   /* 820 */
> > struct {
> > dma_addr_t dma_addr

Adding __packed here will remove the 4 byte hole before the union
and the compiler seems clever enough to know that anything following
a 'long' must also be 'long' aligned.
So you don't get anything horrid like byte accesses.
On 64bit dma_addr will remain 64bit aligned.
On arm32 dma_addr will be 32bit aligned - but forcing two 32bit access
won't make any difference.

So definitely the only simple fix.

David

> >; /* 8 8 */
> > };   /* 8 8 */
> [...]
> > } __attribute__((__aligned__(8)));   /* 824 */
> > union {
> > atomic_t   _mapcount;/*32 4 */
> > unsigned int   page_type;/*32 4 */
> > unsigned int   active;   /*32 4 */
> > intunits;/*32 4 */
> > };   /*32 4 */
> > atomic_t   _refcount;/*36 4 */
> >
> > /* size: 40, cachelines: 1, members: 4 */
> > /* sum members: 36, holes: 1, sum holes: 4 */
> > /* forced alignments: 1, forced holes: 1, sum forced holes: 4 */
> > /* last cacheline: 40 bytes */
> > } __attribute__((__aligned__(8)));
> 
> If you also enable CONFIG_MEMCG or enough options to make
> LAST_CPUPID_NOT_IN_PAGE_FLAGS true, you'll end up with another 4-byte
> hole at the end.

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, 
UK
Registration No: 1397386 (Wales)



RE: [PATCH 1/1] mm: Fix struct page layout on 32-bit systems

2021-04-14 Thread David Laight
> Doing this fixes it:
> 
> +++ b/include/linux/types.h
> @@ -140,7 +140,7 @@ typedef u64 blkcnt_t;
>   * so they don't care about the size of the actual bus addresses.
>   */
>  #ifdef CONFIG_ARCH_DMA_ADDR_T_64BIT
> -typedef u64 dma_addr_t;
> +typedef u64 __attribute__((aligned(sizeof(void * dma_addr_t;
>  #else
>  typedef u32 dma_addr_t;
>  #endif

I hate __packed so much I've been checking what it does!

If you add __packed to the dma_addr_t field inside the union
then gcc (at least) removes the pad from before it, but also
'remembers' the alignment that is enforced by other members
of the structure.

So you don't need the extra aligned(sizeof (void *)) since
that is implicit.

So in this case __packed probably has no side effects.
(Unless a 32bit arch has instructions for a 64bit read
that must not be on an 8n+4 boundary and the address is taken).

It also doesn't affect 64bit - since the previous field
forces 64bit alignment.

David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, 
UK
Registration No: 1397386 (Wales)



RE: [PATCH v1 1/2] powerpc/bitops: Use immediate operand when possible

2021-04-14 Thread David Laight
From: Segher Boessenkool
> Sent: 14 April 2021 16:19
...
> > Could the kernel use GCC builtin atomic functions instead ?
> >
> > https://gcc.gnu.org/onlinedocs/gcc/_005f_005fatomic-Builtins.html
> 
> Certainly that should work fine for the simpler cases that the atomic
> operations are meant to provide.  But esp. for not-so-simple cases the
> kernel may require some behaviour provided by the existing assembler
> implementation, and not by the atomic builtins.
> 
> I'm not saying this cannot work, just that some serious testing will be
> needed.  If it works it should be the best of all worlds, so then it is
> a really good idea yes :-)

I suspect they just add an extra layer of abstraction that makes it
even more difficult to verify and could easily get broken by a compiler
update (etc).

The other issue is that the code needs to be correct with compiled
with (for example) -O0.
That could very easily break anything except the asm implementation
if additional memory accesses and/or increased code size cause grief.

David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, 
UK
Registration No: 1397386 (Wales)



RE: [PATCH 1/1] mm: Fix struct page layout on 32-bit systems

2021-04-13 Thread David Laight
From: Matthew Wilcox 
> Sent: 12 April 2021 19:24
> 
> On Sun, Apr 11, 2021 at 11:43:07AM +0200, Jesper Dangaard Brouer wrote:
> > Could you explain your intent here?
> > I worry about @index.
> >
> > As I mentioned in other thread[1] netstack use page_is_pfmemalloc()
> > (code copy-pasted below signature) which imply that the member @index
> > have to be kept intact. In above, I'm unsure @index is untouched.
> 
> Well, I tried three different approaches.  Here's the one I hated the least.
> 
> From: "Matthew Wilcox (Oracle)" 
> Date: Sat, 10 Apr 2021 16:12:06 -0400
> Subject: [PATCH] mm: Fix struct page layout on 32-bit systems
> 
> 32-bit architectures which expect 8-byte alignment for 8-byte integers
> and need 64-bit DMA addresses (arc, arm, mips, ppc) had their struct
> page inadvertently expanded in 2019.  When the dma_addr_t was added,
> it forced the alignment of the union to 8 bytes, which inserted a 4 byte
> gap between 'flags' and the union.
> 
> We could fix this by telling the compiler to use a smaller alignment
> for the dma_addr, but that seems a little fragile.  Instead, move the
> 'flags' into the union.  That causes dma_addr to shift into the same
> bits as 'mapping', which causes problems with page_mapping() called from
> set_page_dirty() in the munmap path.  To avoid this, insert three words
> of padding and use the same bits as ->index and ->private, neither of
> which have to be cleared on free.

This all looks horribly fragile and is bound to get broken again.
Are there two problems?
1) The 'folio' structure needs to match 'rcu' part of the page
   so that it can use the same rcu list to free items.
2) Various uses of 'struct page' need to overlay fields to save space.

For (1) the rcu bit should probably be a named structure in an
anonymous union - probably in both structures.

For (2) is it worth explicitly defining the word number for each field?
So you end up with something like:
#define F(offset, member) struct { long _pad_##offset[offset]; member; }
struct page [
union {
struct page_rcu;
unsigned long flags;
F(1, unsigned long xxx);
F(2, unsigned long yyy);
etc.



...
>   struct {/* page_pool used by netstack */
> - /**
> -  * @dma_addr: might require a 64-bit value even on
> -  * 32-bit architectures.
> -  */
> - dma_addr_t dma_addr;
> + unsigned long _pp_flags;
> + unsigned long pp_magic;
> + unsigned long xmi;
> + unsigned long _pp_mapping_pad;
> + dma_addr_t dma_addr;/* might be one or two words */
>   };

Isn't that 6 words?

David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, 
UK
Registration No: 1397386 (Wales)



RE: [PATCH 5/5] compat: consolidate the compat_flock{,64} definition

2021-04-12 Thread David Laight
From: Arnd Bergmann
> Sent: 12 April 2021 12:26
> 
> On Mon, Apr 12, 2021 at 12:54 PM David Laight  wrote:
> > From: David Laight > Sent: 12 April 2021 10:37
> > ...
> > > I'm guessing that compat_pid_t is 16 bits?
> > > So the native 32bit version has an unnamed 2 byte structure pad.
> > > The 'packed' removes this pad from the compat structure.
> > >
> > > AFAICT (apart from mips) the __ARCH_COMPAT_FLOCK_PAD is just
> > > adding an explicit pad for the implicit pad the compiler
> > > would generate because compat_pid_t is 16 bits.
> >
> > I've just looked at the header.
> > compat_pid_t is 32 bits.
> > So Linux must have gained 32bit pids at some earlier time.
> > (Historically Unix pids were 16 bit - even on 32bit systems.)
> >
> > Which makes the explicit pad in 'sparc' rather 'interesting'.
> 
> I saw it was there since the sparc kernel support got merged in
> linux-1.3, possibly copied from an older sunos version.

Which had a 16bit pid when I used it.
So this is a bug in the sparc merge!

The explicit 'short' pad could be removed from the 64bit variant
because there are always 4 bytes of pad after l_pid.
But it does extend the application structure on 32bit sparc so must
remain in the uapi header.
It doesn't need to be in the 'compat' definition.

> > oh - compat_loff_t is only used in a couple of other places.
> > neither care in any way about the alignment.
> > (Provided get_user() doesn't fault on a 8n+4 aligned address.)
> 
> Ah right, I also see that after this series it's only used in to other
> places:  compat_resume_swap_area, which could also lose the
> __packed annotation,

That structure just defines 0 and 8, the structure size doesn't
matter and the offsets are 'passed to' get_user() so byte
accesses aren't performed.

> and in the declaration of
> compat_sys_sendfile64, where it makes no difference.

Which should probably use get_user() rather than copy_from_user().

Although some architectures may need fallback code for
misaligned get_user() ?
Or is there a general 'cop out' that structures passed to the
kernel are required to be correctly aligned.
They should be aligned unless the kernel is 'playing games'
like reading 'struct pollfd' as a 64bit item.

David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, 
UK
Registration No: 1397386 (Wales)


RE: [PATCH 5/5] compat: consolidate the compat_flock{,64} definition

2021-04-12 Thread David Laight
From: David Laight
> Sent: 12 April 2021 10:37
...
> I'm guessing that compat_pid_t is 16 bits?
> So the native 32bit version has an unnamed 2 byte structure pad.
> The 'packed' removes this pad from the compat structure.
> 
> AFAICT (apart from mips) the __ARCH_COMPAT_FLOCK_PAD is just
> adding an explicit pad for the implicit pad the compiler
> would generate because compat_pid_t is 16 bits.

I've just looked at the header.
compat_pid_t is 32 bits.
So Linux must have gained 32bit pids at some earlier time.
(Historically Unix pids were 16 bit - even on 32bit systems.)

Which makes the explicit pad in 'sparc' rather 'interesting'.

Actually the tail pad can just be removed from the compat
structures.
Just a comment that mips and sparc have extra fields
in the uapi header is enough.

The kernel never needs to read/write the pad.
userspace must provide the pad in case the kernel writes it.

oh - compat_loff_t is only used in a couple of other places.
neither care in any way about the alignment.
(Provided get_user() doesn't fault on a 8n+4 aligned address.)

David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, 
UK
Registration No: 1397386 (Wales)



RE: consolidate the flock uapi definitions

2021-04-12 Thread David Laight
From: Arnd Bergmann
> Sent: 12 April 2021 11:04
> 
> On Mon, Apr 12, 2021 at 10:55 AM Christoph Hellwig  wrote:
> >
> > Hi all,
> >
> > currently we deal with the slight differents in the various architecture
> > variants of the flock and flock64 stuctures in a very cruft way.  This
> > series switches to just use small arch hooks and define the rest in
> > asm-generic and linux/compat.h instead.
> 
> Nice cleanup. I can merge it through the asm-generic tree if you like,
> though it's a little late just ahead of the merge window.
> 
> I would not want to change the compat_loff_t definition to compat_s64
> to avoid the padding at this time, though that might be a useful cleanup
> for a future cycle.

Is x86 the only architecture that has 32bit and 64bit variants where
the 32bit variant aligns 64bit items on 32bit boundaries?

I've just checked MIPS and ARM, and I'm fairly sure sparc 64bit
aligns them.

Are there any others?

Might also be interesting to check whether compat_loff_t gets
used anywhere else - where the x64-64 compat code will get it
wrong.

ISTM that fixing compat_loff_t shouldn't have any fallout.

David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, 
UK
Registration No: 1397386 (Wales)


RE: [PATCH 5/5] compat: consolidate the compat_flock{,64} definition

2021-04-12 Thread David Laight
From: Christoph Hellwig
> Sent: 12 April 2021 09:56
> 
> Provide a single common definition for the compat_flock and
> compat_flock64 structures using the same tricks as for the native
> variants.  An extra define is added for the packing required on x86.
> 
...
>  /*
> - * IA32 uses 4 byte alignment for 64 bit quantities,
> - * so we need to pack this structure.
> + * IA32 uses 4 byte alignment for 64 bit quantities, so we need to pack the
> + * compat flock64 structure.
>   */
> -struct compat_flock64 {
> - short   l_type;
> - short   l_whence;
> - compat_loff_t   l_start;
> - compat_loff_t   l_len;
> - compat_pid_tl_pid;
> -} __attribute__((packed));
> +#define __ARCH_NEED_COMPAT_FLOCK64_PACKED

That shouldn't need to be packed at all.
(Since the 32bit variant isn't packed.)

compat_loff_t should itself have __attribute__((aligned(4)))
probably inherited from compat_s64.
So l_start will be at offset 4 without the __packed.

I'm guessing that compat_pid_t is 16 bits?
So the native 32bit version has an unnamed 2 byte structure pad.
The 'packed' removes this pad from the compat structure.

AFAICT (apart from mips) the __ARCH_COMPAT_FLOCK_PAD is just
adding an explicit pad for the implicit pad the compiler
would generate because compat_pid_t is 16 bits.

If the padding need not be named for the 64bit system calls.
(Where there is probably rather more padding all over the place.)
then it doesn't need to be named for the compat variants.

Even the mips extra padding could be removed.
F_GETLK might be expected to do a read-write of them, so
return EFAULT if not mapped.
But nothing should be testing the EFAULT is returned!

David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, 
UK
Registration No: 1397386 (Wales)



RE: Bogus struct page layout on 32-bit

2021-04-10 Thread David Laight
From: Matthew Wilcox
> Sent: 10 April 2021 03:43
> On Sat, Apr 10, 2021 at 06:45:35AM +0800, kernel test robot wrote:
> > >> include/linux/mm_types.h:274:1: error: static_assert failed due to 
> > >> requirement
> '__builtin_offsetof(struct page, lru) == __builtin_offsetof(struct folio, 
> lru)' "offsetof(struct page,
> lru) == offsetof(struct folio, lru)"
> >FOLIO_MATCH(lru, lru);
> >include/linux/mm_types.h:272:2: note: expanded from macro 'FOLIO_MATCH'
> >static_assert(offsetof(struct page, pg) == offsetof(struct 
> > folio, fl))
> 
> Well, this is interesting.  pahole reports:
> 
> struct page {
> long unsigned int  flags;/* 0 4 */
> /* XXX 4 bytes hole, try to pack */
> union {
> struct {
> struct list_head lru;/* 8 8 */
> ...
> struct folio {
> union {
> struct {
> long unsigned int flags; /* 0 4 */
> struct list_head lru;/* 4 8 */
> 
> so this assert has absolutely done its job.
> 
> But why has this assert triggered?  Why is struct page layout not what
> we thought it was?  Turns out it's the dma_addr added in 2019 by commit
> c25fff7171be ("mm: add dma_addr_t to struct page").  On this particular
> config, it's 64-bit, and ppc32 requires alignment to 64-bit.  So
> the whole union gets moved out by 4 bytes.
> 
> Unfortunately, we can't just fix this by putting an 'unsigned long pad'
> in front of it.  It still aligns the entire union to 8 bytes, and then
> it skips another 4 bytes after the pad.
> 
> We can fix it like this ...
> 
> +++ b/include/linux/mm_types.h
> @@ -96,11 +96,12 @@ struct page {
> unsigned long private;
> };
> struct {/* page_pool used by netstack */
> +   unsigned long _page_pool_pad;
> /**
>  * @dma_addr: might require a 64-bit value even on
>  * 32-bit architectures.
>  */
> -   dma_addr_t dma_addr;
> +   dma_addr_t dma_addr __packed;
> };
> struct {/* slab, slob and slub */
> union {
> 
> but I don't know if GCC is smart enough to realise that dma_addr is now
> on an 8 byte boundary and it can use a normal instruction to access it,
> or whether it'll do something daft like use byte loads to access it.

I'm betting it will use byte accesses.
Checked - it does seem to use 4-byte accesses.
(godbolt is rather short of 32 bit compilers...)

> 
> We could also do:
> 
> +   dma_addr_t dma_addr __packed __aligned(sizeof(void 
> *));

I wonder if __aligned(n) should be defined as
__attribute__((packed,aligned(n))
I don't think you ever want the 'unpacked' variant.

But explicitly reducing the alignment of single member is much
better than the habit of marking the structure 'packed'.

(Never mind the habit of adding __packed 'because we don't want
the compiler to add random padding.)

> 
> and I see pahole, at least sees this correctly:
> 
> struct {
> long unsigned int _page_pool_pad; /* 4 4 */
> dma_addr_t dma_addr __attribute__((__aligned__(4))); 
> /* 8 8 */
> } __attribute__((__packed__)) __attribute__((__aligned__(4)));

Is the attribute on the struct an artifact of pahole?
It should just have an alignment of 4 without anything special.

> 
> This presumably affects any 32-bit architecture with a 64-bit phys_addr_t
> / dma_addr_t.  Advice, please?

Only those where a 64-bit value is 64-bit aligned.
So that excludes x86 (which can have 64-bit dma) but includes sparc32
(which probably doesn't).

David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, 
UK
Registration No: 1397386 (Wales)



RE: [PATCH 3/4] exec: simplify the compat syscall handling

2021-03-26 Thread David Laight
From: Al Viro
> Sent: 26 March 2021 16:12
> 
> On Fri, Mar 26, 2021 at 03:38:30PM +0100, Christoph Hellwig wrote:
> 
> > +static const char __user *
> > +get_user_arg_ptr(const char __user *const __user *argv, int nr)
> >  {
> > +   if (in_compat_syscall()) {
> > +   const compat_uptr_t __user *compat_argv =
> > +   compat_ptr((unsigned long)argv);
> > compat_uptr_t compat;
> >
> > +   if (get_user(compat, compat_argv + nr))
> > return ERR_PTR(-EFAULT);
> > return compat_ptr(compat);
> > +   } else {
> > +   const char __user *native;
> >
> > +   if (get_user(native, argv + nr))
> > +   return ERR_PTR(-EFAULT);
> > +   return native;
> > +   }
> >  }
> 
> Yecchhh  So you have in_compat_syscall() called again and again, for
> each argument in the list?  I agree that current version is fucking ugly,
> but I really hate that approach ;-/

Especially since in_compat_syscall() isn't entirely trivial on x86-64.
Probably all in the noise for 'exec', but all the bits do add up.

You may not want separate get_user() on some architectures either.
The user_access_begin/end aren't cheap.

OTOH if you call copy_from_user() you get hit by the stupid
additional costs of 'user copy hardening'.

David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, 
UK
Registration No: 1397386 (Wales)



RE: [PATCH] powerpc/vdso32: Add missing _restgpr_31_x to fix build failure

2021-03-16 Thread David Laight
From: Segher Boessenkool
> Sent: 16 March 2021 00:00
...
> > Although you may need to disable loop unrolling (often dubious at best)
> > and either force or disable some function inlining.
> 
> The cases where GCC does loop unrolling at -O2 always help quite a lot.
> Or, do you have a counter-example?  We'd love to see one.

The real problem with loop unrolling is that quite often a modern
out-of-order superscaler processor actually has 'spare' execution
cycles where the loop control can be done 'for free'.
Sometimes you do need to unroll (or interleave) a couple of
times to get enough spare execution cycles.

But the unrolled loop has to read a lot more code into cache
- so unless the code is 'hot cache' (that is usually arranged
for benchmarking) those delays apply as well.
The larger code footprint also displaces other code.

My real annoyance with gcc is unrolling (and vectorizing)
loops that I know are never executed as many times as even one
copy of the unrolled loop.

As an example intel (ivy bridge onwards) cpu execute the
following code (the middle of the ip checksum) at 8 bytes/clock.
(Limited by the carry flag.)
It just doesn't need any further unrolling.

+   "10:jecxz 20f\n"
+   "   adc   (%[buff], %[len]), %[sum_0]\n"
+   "   adc   8(%[buff], %[len]), %[sum_1]\n"
+   "   lea   32(%[len]), %[len_tmp]\n"
+   "   adc   16(%[buff], %[len]), %[sum_0]\n"
+   "   adc   24(%[buff], %[len]), %[sum_1]\n"
+   "   mov   %[len_tmp], %[len]\n"
+   "   jmp   10b\n"

Annoyingly that loop is slow on my 8-core atom. 
The existing code only does 4 bytes/clock on intel cpu prior
to either broadwell or haswell (forgotten which) in spite
of much more unroling.


> And yup, inlining is hard.  GCC's heuristics there are very good
> nowadays, but any single decision has big effects.  Doing the important
> spots manually (always_inline or noinline) has good payoff.

Latest inline gripe was a function replicated about 20 times
when the non-inline version was a register load and 'tail call'.
The inlining is just bloat.

David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, 
UK
Registration No: 1397386 (Wales)



RE: [PATCH] powerpc/vdso32: Add missing _restgpr_31_x to fix build failure

2021-03-15 Thread David Laight
From: Rasmus Villemoes
> Sent: 15 March 2021 16:24
> 
> On 12/03/2021 03.29, Segher Boessenkool wrote:
> > Hi!
> >
> > On Tue, Mar 09, 2021 at 06:19:30AM +, Christophe Leroy wrote:
> >> With some defconfig including CONFIG_CC_OPTIMIZE_FOR_SIZE,
> >> (for instance mvme5100_defconfig and ps3_defconfig), gcc 5
> >> generates a call to _restgpr_31_x.
> >
> >> I don't know if there is a way to tell GCC not to emit that call, because 
> >> at the end we get more
> instructions than needed.
> >
> > The function is required by the ABI, you need to have it.
> >
> > You get *fewer* insns statically, and that is what -Os is about: reduce
> > the size of the binaries.
> 
> Is there any reason to not just always build the vdso with -O2? It's one
> page/one VMA either way, and the vdso is about making certain system
> calls cheaper, so if unconditional -O2 could save a few cycles compared
> to -Os, why not? (And if, as it seems, there's only one user within the
> DSO of _restgpr_31_x, yes, the overall size of the .text segment
> probably increases slightly).

Sometimes -Os generates such horrid code you really never want to use it.
A classic is on x86 where it replaces 'load register with byte constant'
with 'push byte' 'pop register'.
The code is actually smaller but the execution time is horrid.

There are also cases where -O2 actually generates smaller code.

Although you may need to disable loop unrolling (often dubious at best)
and either force or disable some function inlining.

David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, 
UK
Registration No: 1397386 (Wales)



RE: Errant readings on LM81 with T2080 SoC

2021-03-15 Thread David Laight
From: Chris Packham
> Sent: 14 March 2021 21:26
> 
> On 12/03/21 10:25 pm, David Laight wrote:
> > From: Linuxppc-dev Guenter Roeck
> >> Sent: 11 March 2021 21:35
> >>
> >> On 3/11/21 1:17 PM, Chris Packham wrote:
> >>> On 11/03/21 9:18 pm, Wolfram Sang wrote:
> >>>>> Bummer. What is really weird is that you see clock stretching under
> >>>>> CPU load. Normally clock stretching is triggered by the device, not
> >>>>> by the host.
> >>>> One example: Some hosts need an interrupt per byte to know if they
> >>>> should send ACK or NACK. If that interrupt is delayed, they stretch the
> >>>> clock.
> >>>>
> >>> It feels like something like that is happening. Looking at the T2080
> >>> Reference manual there is an interesting timing diagram (Figure 14-2 if
> >>> someone feels like looking it up). It shows SCL low between the ACK for
> >>> the address and the data byte. I think if we're delayed in sending the
> >>> next byte we could violate Ttimeout or Tlow:mext from the SMBUS spec.
> >>>
> >> I think that really leaves you only two options that I can see:
> >> Rework the driver to handle critical actions (such as setting TXAK,
> >> and everything else that might result in clock stretching) in the
> >> interrupt handler, or rework the driver to handle everything in
> >> a high priority kernel thread.
> >
> > I'm not sure a high priority kernel thread will help.
> > Without CONFIG_PREEMPT (which has its own set of nasties)
> > a RT process won't be scheduled until the processor it last
> > ran on does a reschedule.
> > I don't think a kernel thread will be any different from a
> > user process running under the RT scheduler.
> >
> > I'm trying to remember the smbus spec (without remembering the I2C one).

> For those following along the spec is available here[0]. I know there's
> a 3.0 version[1] as well but the devices I'm dealing with are from a 2.0
> vintage.
> > While basically a clock+data bit-bang the slave is allowed to drive
> > the clock low to extend a cycle.
> > It may be allowed to do this at any point?
>
>  From what I can see it's actually the master extending the clock. Or
> more accurately holding it low between the address and data bytes (which
> from the T2080 reference manual looks expected). I think this may cause
> a strictly compliant SMBUS device to determine that Tlow:mext has been
> violated.

Yes, the spec does seem to assume that is a signal is stable
for 20ms something has gone 'horribly wrong'.
I wasn't worries about that, our fpga does the whole transaction
as a single command.
None of our slaves generate interrupts - so it is purely master/slave.

If you run your process under the RT scheduler it is unlikely
that pre-emption will be delayed by long enough to stop the process
running for 10ms.
I've seen >1ms delays (testing RTP audio), but most of the long
loops have a cond_resched() in them.

...

> Probably depends on the device implementation. I've got multiple other
> I2C/SMBUS devices and the LM81 seems to be the one that objects.

I bet most don't implement any of the timeouts.

I found one interesting pmbus device.
Sometimes it would detect a STOP condition because the data line
went high when it tri-stated its output driver in response to the
rising clock edge!
So it saw the same clock edge twice.

> [0] - http://www.smbus.org/specs/smbus20.pdf
> [1] - https://pmbus.org/Assets/PDFS/Public/SMBus_3_0_20141220.pdf

I should have both those - I've copied them to the directory where
I'd look for them first!

David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, 
UK
Registration No: 1397386 (Wales)


RE: Errant readings on LM81 with T2080 SoC

2021-03-12 Thread David Laight
From: Linuxppc-dev Guenter Roeck
> Sent: 11 March 2021 21:35
> 
> On 3/11/21 1:17 PM, Chris Packham wrote:
> >
> > On 11/03/21 9:18 pm, Wolfram Sang wrote:
> >>> Bummer. What is really weird is that you see clock stretching under
> >>> CPU load. Normally clock stretching is triggered by the device, not
> >>> by the host.
> >> One example: Some hosts need an interrupt per byte to know if they
> >> should send ACK or NACK. If that interrupt is delayed, they stretch the
> >> clock.
> >>
> > It feels like something like that is happening. Looking at the T2080
> > Reference manual there is an interesting timing diagram (Figure 14-2 if
> > someone feels like looking it up). It shows SCL low between the ACK for
> > the address and the data byte. I think if we're delayed in sending the
> > next byte we could violate Ttimeout or Tlow:mext from the SMBUS spec.
> >
> 
> I think that really leaves you only two options that I can see:
> Rework the driver to handle critical actions (such as setting TXAK,
> and everything else that might result in clock stretching) in the
> interrupt handler, or rework the driver to handle everything in
> a high priority kernel thread.

I'm not sure a high priority kernel thread will help.
Without CONFIG_PREEMPT (which has its own set of nasties)
a RT process won't be scheduled until the processor it last
ran on does a reschedule.
I don't think a kernel thread will be any different from a
user process running under the RT scheduler.

I'm trying to remember the smbus spec (without remembering the I2C one).
While basically a clock+data bit-bang the slave is allowed to drive
the clock low to extend a cycle.
It may be allowed to do this at any point?
The master can generate the data at almost any rate (below the maximum)
but I don't think it can go down to zero.
But I do remember one of the specs having a timeout.

But I'd have thought the slave should answer the cycle correctly
regardless of any 'random' delays the master adds in.
Unless you are getting away with de-asserting chipselect?

The only implementation I've done is one an FPGA so doesn't have
worry about interrupt latencies.
It doesn't actually support clock stretching; it wasn't in the
code I started from and none of the slaves we need to connect to
ever does it.

David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, 
UK
Registration No: 1397386 (Wales)



RE: [PATCH v5 20/22] powerpc/syscall: Avoid storing 'current' in another pointer

2021-02-09 Thread David Laight
From: Christophe Leroy 
> Sent: 09 February 2021 17:04
> 
> Le 09/02/2021 à 15:31, David Laight a écrit :
> > From: Segher Boessenkool
> >> Sent: 09 February 2021 13:51
> >>
> >> On Tue, Feb 09, 2021 at 12:36:20PM +1000, Nicholas Piggin wrote:
> >>> What if you did this?
> >>
> >>> +static inline struct task_struct *get_current(void)
> >>> +{
> >>> + register struct task_struct *task asm ("r2");
> >>> +
> >>> + return task;
> >>> +}
> >>
> >> Local register asm variables are *only* guaranteed to live in that
> >> register as operands to an asm.  See
> >>
> >> https://gcc.gnu.org/onlinedocs/gcc/Local-Register-Variables.html#Local-Register-Variables
> >> ("The only supported use" etc.)
> >>
> >> You can do something like
> >>
> >> static inline struct task_struct *get_current(void)
> >> {
> >>register struct task_struct *task asm ("r2");
> >>
> >>asm("" : "+r"(task));
> >>
> >>return task;
> >> }
> >>
> >> which makes sure that "task" actually is in r2 at the point of that asm.
> >
> > If "r2" always contains current (and is never assigned by the compiler)
> > why not use a global register variable for it?
> >
> 
> 
> The change proposed by Nick doesn't solve the issue.
> 
> The problem is that at the begining of the function we have:
> 
>   unsigned long *ti_flagsp = _thread_info()->flags;
> 
> When the function uses ti_flagsp for the first time, it does use 112(r2)
> 
> Then the function calls some other functions.
> 
> Most likely because the function could update 'current', GCC copies r2 into 
> r30, so that if r2 get
> changed by the called function, ti_flagsp is still based on the previous 
> value of current.
> 
> Allthough we know r2 wont change, GCC doesn't know it. And in order to save 
> r2 into r30, it needs to
> save r30 in the stack.
> 
> 
> By using _thread_info()->flags directly instead of this intermediaite 
> ti_flagsp pointer, GCC
> uses r2 instead instead of doing a copy.

Does marking current_thread_info() 'pure' (I think that the right one)
work - so that gcc knows its result doesn't depend on external data
and that it doesn't change external data.

Although I'm not 100% how well those attributes actually work.

> Nick, I don't understand the reason why you need that 'ti_flagsp' local var.

Probably to save typing.

I sometimes reload locals after function calls.

David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, 
UK
Registration No: 1397386 (Wales)


RE: [PATCH v5 20/22] powerpc/syscall: Avoid storing 'current' in another pointer

2021-02-09 Thread David Laight
From: Segher Boessenkool
> Sent: 09 February 2021 13:51
> 
> On Tue, Feb 09, 2021 at 12:36:20PM +1000, Nicholas Piggin wrote:
> > What if you did this?
> 
> > +static inline struct task_struct *get_current(void)
> > +{
> > +   register struct task_struct *task asm ("r2");
> > +
> > +   return task;
> > +}
> 
> Local register asm variables are *only* guaranteed to live in that
> register as operands to an asm.  See
>   
> https://gcc.gnu.org/onlinedocs/gcc/Local-Register-Variables.html#Local-Register-Variables
> ("The only supported use" etc.)
> 
> You can do something like
> 
> static inline struct task_struct *get_current(void)
> {
>   register struct task_struct *task asm ("r2");
> 
>   asm("" : "+r"(task));
> 
>   return task;
> }
> 
> which makes sure that "task" actually is in r2 at the point of that asm.

If "r2" always contains current (and is never assigned by the compiler)
why not use a global register variable for it?

David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, 
UK
Registration No: 1397386 (Wales)



RE: [PATCH v2 1/3] powerpc: sstep: Fix load and update emulation

2021-02-04 Thread David Laight
From: Segher Boessenkool
> Sent: 03 February 2021 21:18
...
> Power9 does:
> 
>   Load with Update Instructions (RA = 0)
> EA is placed into R0.

Does that change the value of 0?
Rather reminds me of some 1960s era systems that had the small integers
at fixed (global) addresses.
FORTRAN always passes by reference, pass 0 and the address of the global
zero location was passed, the called function could change 0 to 1 for
the entire computer!

>   Load with Update Instructions (RA = RT)
> The storage operand addressed by EA is accessed. The displacement
> field is added to the data returned by the load and placed into RT.

Shame that isn't standard - could be used to optimise some code.

David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, 
UK
Registration No: 1397386 (Wales)



RE: [PATCH v4 02/10] powerpc/signal: Add unsafe_copy_{vsx, fpr}_from_user()

2021-02-01 Thread David Laight
From: Christopher M. Riedl
> Sent: 01 February 2021 16:55
...
> > > > > + int i;  
> > > > > \
> > > > > + 
> > > > > \
> > > > > + unsafe_copy_from_user(buf, __f, ELF_NFPREG * sizeof(double),
> > > > > \
> > > > > + label); 
> > > > > \
> > > > > + for (i = 0; i < ELF_NFPREG - 1; i++)
> > > > > \
> > > > > + __t->thread.TS_FPR(i) = buf[i]; 
> > > > > \
> > > > > + __t->thread.fp_state.fpscr = buf[i];
> > > > > \
> > > > > +} while (0)
> >
> > On further reflection, since you immediately loop through the buffer
> > why not just use user_access_begin() and unsafe_get_user() in the loop.
> 
> Christophe had suggested this a few revisions ago as well. When I tried
> this approach, the signal handling performance took a pretty big hit:
> https://lists.ozlabs.org/pipermail/linuxppc-dev/2020-October/219351.html
> 
> I included some numbers on v3 as well but decided to drop the approach
> altogether for this one since it just didn't seem worth the hit.

Was that using unsafe_get_user (which relies on user_access_begin()
having 'opened up' user accesses) or just get_user() that does
it for every access?

The former should be ok, the latter will be horrid.

David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, 
UK
Registration No: 1397386 (Wales)


RE: [PATCH v4 02/10] powerpc/signal: Add unsafe_copy_{vsx, fpr}_from_user()

2021-02-01 Thread David Laight
From: Christopher M. Riedl
> Sent: 01 February 2021 15:56
> 
> On Thu Jan 28, 2021 at 4:38 AM CST, David Laight wrote:
> > From: Christopher M. Riedl
> > > Sent: 28 January 2021 04:04
> > >
> > > Reuse the "safe" implementation from signal.c except for calling
> > > unsafe_copy_from_user() to copy into a local buffer.
> > >
> > > Signed-off-by: Christopher M. Riedl 
> > > ---
> > >  arch/powerpc/kernel/signal.h | 33 +
> > >  1 file changed, 33 insertions(+)
> > >
> > > diff --git a/arch/powerpc/kernel/signal.h b/arch/powerpc/kernel/signal.h
> > > index 2559a681536e..c18402d625f1 100644
> > > --- a/arch/powerpc/kernel/signal.h
> > > +++ b/arch/powerpc/kernel/signal.h
> > > @@ -53,6 +53,33 @@ unsigned long copy_ckfpr_from_user(struct task_struct 
> > > *task, void __user *from);
> > >   [i], label);\
> > >  } while (0)
> > >
> > > +#define unsafe_copy_fpr_from_user(task, from, label) do {
> > > \
> > > + struct task_struct *__t = task; \
> > > + u64 __user *__f = (u64 __user *)from;   \
> > > + u64 buf[ELF_NFPREG];\
> >
> > How big is that buffer?
> > Isn't is likely to be reasonably large compared to a reasonable
> > kernel stack frame.
> > Especially since this isn't even a leaf function.
> >
> 
> I think Christophe answered this - I don't really have an opinion either
> way. What would be a 'reasonable' kernel stack frame for reference?

Zero :-)

> 
> > > + int i;  \
> > > + \
> > > + unsafe_copy_from_user(buf, __f, ELF_NFPREG * sizeof(double),\
> > > + label); \
> > > + for (i = 0; i < ELF_NFPREG - 1; i++)\
> > > + __t->thread.TS_FPR(i) = buf[i]; \
> > > + __t->thread.fp_state.fpscr = buf[i];\
> > > +} while (0)

On further reflection, since you immediately loop through the buffer
why not just use user_access_begin() and unsafe_get_user() in the loop.

David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, 
UK
Registration No: 1397386 (Wales)


RE: [PATCH v4 02/10] powerpc/signal: Add unsafe_copy_{vsx, fpr}_from_user()

2021-01-28 Thread David Laight
From: Christopher M. Riedl
> Sent: 28 January 2021 04:04
> 
> Reuse the "safe" implementation from signal.c except for calling
> unsafe_copy_from_user() to copy into a local buffer.
> 
> Signed-off-by: Christopher M. Riedl 
> ---
>  arch/powerpc/kernel/signal.h | 33 +
>  1 file changed, 33 insertions(+)
> 
> diff --git a/arch/powerpc/kernel/signal.h b/arch/powerpc/kernel/signal.h
> index 2559a681536e..c18402d625f1 100644
> --- a/arch/powerpc/kernel/signal.h
> +++ b/arch/powerpc/kernel/signal.h
> @@ -53,6 +53,33 @@ unsigned long copy_ckfpr_from_user(struct task_struct 
> *task, void __user *from);
>   [i], label);\
>  } while (0)
> 
> +#define unsafe_copy_fpr_from_user(task, from, label) do {\
> + struct task_struct *__t = task; \
> + u64 __user *__f = (u64 __user *)from;   \
> + u64 buf[ELF_NFPREG];\

How big is that buffer?
Isn't is likely to be reasonably large compared to a reasonable
kernel stack frame.
Especially since this isn't even a leaf function.

> + int i;  \
> + \
> + unsafe_copy_from_user(buf, __f, ELF_NFPREG * sizeof(double),\

That really ought to be sizeof(buf).

David


> + label); \
> + for (i = 0; i < ELF_NFPREG - 1; i++)\
> + __t->thread.TS_FPR(i) = buf[i]; \
> + __t->thread.fp_state.fpscr = buf[i];\
> +} while (0)
> +
> +#define unsafe_copy_vsx_from_user(task, from, label) do {\
> + struct task_struct *__t = task; \
> + u64 __user *__f = (u64 __user *)from;   \
> + u64 buf[ELF_NVSRHALFREG];   \
> + int i;  \
> + \
> + unsafe_copy_from_user(buf, __f, \
> + ELF_NVSRHALFREG * sizeof(double),   \
> + label); \
> + for (i = 0; i < ELF_NVSRHALFREG ; i++)  \
> + __t->thread.fp_state.fpr[i][TS_VSRLOWOFFSET] = buf[i];  \
> +} while (0)
> +
> +
>  #ifdef CONFIG_PPC_TRANSACTIONAL_MEM
>  #define unsafe_copy_ckfpr_to_user(to, task, label)   do {\
>   struct task_struct *__t = task; \
> @@ -80,6 +107,10 @@ unsigned long copy_ckfpr_from_user(struct task_struct 
> *task, void __user *from);
>   unsafe_copy_to_user(to, (task)->thread.fp_state.fpr,\
>   ELF_NFPREG * sizeof(double), label)
> 
> +#define unsafe_copy_fpr_from_user(task, from, label) \
> + unsafe_copy_from_user((task)->thread.fp_state.fpr, from,\
> + ELF_NFPREG * sizeof(double), label)
> +
>  static inline unsigned long
>  copy_fpr_to_user(void __user *to, struct task_struct *task)
>  {
> @@ -115,6 +146,8 @@ copy_ckfpr_from_user(struct task_struct *task, void 
> __user *from)
>  #else
>  #define unsafe_copy_fpr_to_user(to, task, label) do { } while (0)
> 
> +#define unsafe_copy_fpr_from_user(task, from, label) do { } while (0)
> +
>  static inline unsigned long
>  copy_fpr_to_user(void __user *to, struct task_struct *task)
>  {
> --
> 2.26.1

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, 
UK
Registration No: 1397386 (Wales)



RE: [PATCH v4 11/23] powerpc/syscall: Rename syscall_64.c into syscall.c

2021-01-26 Thread David Laight
From: Nicholas Piggin
> Sent: 26 January 2021 10:21
> 
> Excerpts from Christophe Leroy's message of January 26, 2021 12:48 am:
> > syscall_64.c will be reused almost as is for PPC32.
> >
> > Rename it syscall.c
> 
> Could you rename it to interrupt.c instead? A system call is an
> interrupt, and the file now also has code to return from other
> interrupts as well, and it matches the new asm/interrupt.h from
> the interrupts series.

Hmmm

That might make it harder for someone looking for the system call
entry code to find it.

In some sense interrupts are the simpler case.

Especially when comparing with other architectures which have
special instructions for syscall entry.

David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, 
UK
Registration No: 1397386 (Wales)


RE: [PATCH v10 11/12] mm/vmalloc: Hugepage vmalloc mappings

2021-01-25 Thread David Laight
From: Christophe Leroy
> Sent: 25 January 2021 09:15
> 
> Le 24/01/2021 à 09:22, Nicholas Piggin a écrit :
> > Support huge page vmalloc mappings. Config option HAVE_ARCH_HUGE_VMALLOC
> > enables support on architectures that define HAVE_ARCH_HUGE_VMAP and
> > supports PMD sized vmap mappings.
> >
> > vmalloc will attempt to allocate PMD-sized pages if allocating PMD size
> > or larger, and fall back to small pages if that was unsuccessful.
> >
> > Architectures must ensure that any arch specific vmalloc allocations
> > that require PAGE_SIZE mappings (e.g., module allocations vs strict
> > module rwx) use the VM_NOHUGE flag to inhibit larger mappings.
> >
> > When hugepage vmalloc mappings are enabled in the next patch, this
> > reduces TLB misses by nearly 30x on a `git diff` workload on a 2-node
> > POWER9 (59,800 -> 2,100) and reduces CPU cycles by 0.54%.
> >
> > This can result in more internal fragmentation and memory overhead for a
> > given allocation, an option nohugevmalloc is added to disable at boot.
> >
> > Signed-off-by: Nicholas Piggin 
> > ---
> >   arch/Kconfig|  10 +++
> >   include/linux/vmalloc.h |  18 
> >   mm/page_alloc.c |   5 +-
> >   mm/vmalloc.c| 192 ++--
> >   4 files changed, 177 insertions(+), 48 deletions(-)
> >
> 
> > diff --git a/mm/vmalloc.c b/mm/vmalloc.c
> > index 0377e1d059e5..eef61e0f5170 100644
> > --- a/mm/vmalloc.c
> > +++ b/mm/vmalloc.c
> 
> > @@ -2691,15 +2746,18 @@ EXPORT_SYMBOL_GPL(vmap_pfn);
> >   #endif /* CONFIG_VMAP_PFN */
> >
> >   static void *__vmalloc_area_node(struct vm_struct *area, gfp_t gfp_mask,
> > -pgprot_t prot, int node)
> > +pgprot_t prot, unsigned int page_shift,
> > +int node)
> >   {
> > const gfp_t nested_gfp = (gfp_mask & GFP_RECLAIM_MASK) | __GFP_ZERO;
> > -   unsigned int nr_pages = get_vm_area_size(area) >> PAGE_SHIFT;
> > -   unsigned long array_size;
> > -   unsigned int i;
> > +   unsigned int page_order = page_shift - PAGE_SHIFT;
> > +   unsigned long addr = (unsigned long)area->addr;
> > +   unsigned long size = get_vm_area_size(area);
> > +   unsigned int nr_small_pages = size >> PAGE_SHIFT;
> > struct page **pages;
> > +   unsigned int i;
> >
> > -   array_size = (unsigned long)nr_pages * sizeof(struct page *);
> > +   array_size = (unsigned long)nr_small_pages * sizeof(struct page *);
> 
> array_size() is a function in include/linux/overflow.h
> 
> For some reason, it breaks the build with your series.

I can't see the replacement definition for array_size.
The old local variable is deleted.

David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, 
UK
Registration No: 1397386 (Wales)


RE: [PATCH 02/13] module: add a module_loaded helper

2021-01-21 Thread David Laight
> 
> On Thu, Jan 21, 2021 at 11:00:20AM +0100, Christophe Leroy wrote:
> > > +bool module_loaded(const char *name);
> >
> > Maybe module_is_loaded() would be a better name.
> 
> Fine with me.

It does look like both callers aren't 'unsafe'.
But it is a bit strange returning a stale value.

It did make be look a bit more deeply at try_module_get().
For THIS_MODULE (eg to get an extra reference for a kthread)
I doubt it can ever fail.

But the other cases require a 'module' structure be passed in.
ISTM that unloading the module could invalidate the pointer
that has just been read from some other structure.

I'm guessing that some other lock must always be held.

David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, 
UK
Registration No: 1397386 (Wales)



RE: [PATCH 02/13] module: add a module_loaded helper

2021-01-21 Thread David Laight
From: Christophe Leroy
> Sent: 21 January 2021 10:00
> 
> Le 21/01/2021 à 08:49, Christoph Hellwig a écrit :
> > Add a helper that takes modules_mutex and uses find_module to check if a
> > given module is loaded.  This provides a better abstraction for the two
> > callers, and allows to unexport modules_mutex and find_module.
> >
> > Signed-off-by: Christoph Hellwig 
> > ---
> >   drivers/gpu/drm/drm_fb_helper.c |  7 +--
> >   include/linux/module.h  |  3 +++
> >   kernel/module.c | 14 --
> >   kernel/trace/trace_kprobe.c |  4 +---
> >   4 files changed, 17 insertions(+), 11 deletions(-)
> >
> 
> > diff --git a/include/linux/module.h b/include/linux/module.h
> > index 7a0bcb5b1ffccd..b4654f8a408134 100644
> > --- a/include/linux/module.h
> > +++ b/include/linux/module.h
> > @@ -589,6 +589,9 @@ static inline bool within_module(unsigned long addr, 
> > const struct module *mod)
> >   /* Search for module by name: must hold module_mutex. */
> >   struct module *find_module(const char *name);
> >
> > +/* Check if a module is loaded. */
> > +bool module_loaded(const char *name);
> 
> Maybe module_is_loaded() would be a better name.

I can't see the original patch.

What is the point of the function.
By the time it returns the information is stale - so mostly useless.

Surely you need to use try_module_get() instead?

David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, 
UK
Registration No: 1397386 (Wales)


RE: [PATCH 2/2] powerpc/44x: Remove STDBINUTILS kconfig option

2021-01-20 Thread David Laight
From: Christophe Leroy
> Sent: 20 January 2021 07:49
> 
> STDBINUTILS is just a toggle to allow 256k page size
> to appear in the possible page sizes list for the 44x.
> 
> Make 256k page size appear all the time with an
> explicit warning on binutils, and remove this unneccessary
> STDBINUTILS config option.
> 
> Signed-off-by: Christophe Leroy 
> ---
>  arch/powerpc/Kconfig | 27 +++
>  1 file changed, 7 insertions(+), 20 deletions(-)
> 
> diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig
> index a685e42d3993..3e29995540a7 100644
> --- a/arch/powerpc/Kconfig
> +++ b/arch/powerpc/Kconfig
> @@ -720,18 +720,6 @@ config ARCH_MEMORY_PROBE
>   def_bool y
>   depends on MEMORY_HOTPLUG
> 
> -config STDBINUTILS
> - bool "Using standard binutils settings"
> - depends on 44x
> - default y
> - help
> -   Turning this option off allows you to select 256KB PAGE_SIZE on 44x.
> -   Note, that kernel will be able to run only those applications,
> -   which had been compiled using binutils later than 2.17.50.0.3 with
> -   '-zmax-page-size' set to 256K (the default is 64K). Or, if using
> -   the older binutils, you can patch them with a trivial patch, which
> -   changes the ELF_MAXPAGESIZE definition from 0x1 to 0x4.
> -
>  choice
>   prompt "Page size"
>   default PPC_4K_PAGES
> @@ -771,17 +759,16 @@ config PPC_64K_PAGES
>   select HAVE_ARCH_SOFT_DIRTY if PPC_BOOK3S_64
> 
>  config PPC_256K_PAGES
> - bool "256k page size"
> - depends on 44x && !STDBINUTILS && !PPC_47x
> + bool "256k page size (Requires non-standard binutils settings)"
> + depends on 44x && !PPC_47x
>   help
> Make the page size 256k.
> 
> -   As the ELF standard only requires alignment to support page
> -   sizes up to 64k, you will need to compile all of your user
> -   space applications with a non-standard binutils settings
> -   (see the STDBINUTILS description for details).
> -
> -   Say N unless you know what you are doing.
> +   That kernel will be able to run only those applications,
> +   which had been compiled using binutils later than 2.17.50.0.3 with
> +   '-zmax-page-size' set to 256K (the default is 64K). Or, if using
> +   the older binutils, you can patch them with a trivial patch, which
> +   changes the ELF_MAXPAGESIZE definition from 0x1 to 0x4.


The kernel will only be able to run applications that have been
compiled with '-zmax-page-size' set to 256K (the default is 64K)
using binutils later than 2.17.50.0.3, or by patching the
ELF_MAXPAGESIZE definition from 0x1 to 0x4 in older versions.

> 
>  endchoice
> 
> --
> 2.25.0

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, 
UK
Registration No: 1397386 (Wales)



RE: [PATCH] Adds a new ioctl32 syscall for backwards compatibility layers

2021-01-16 Thread David Laight
...
> He's already doing the system call emulation for all the system
> calls other than ioctl in user space though. In my experience,
> there are actually fairly few ioctl commands that are different
> between architectures -- most of them have no misaligned
> or architecture-defined struct members at all.

Aren't there also some intractable issues with socket options?
IIRC the kernel code that tried to change them to 64bit was
horribly broken in some obscure cases.

Pushing the conversion down the stack not only identified the
issues, it also made them easier to fix.

If you change the kernel so a 64bit process can execute 32bit
system calls then a lot of the problems do go away.
This is probably easiest done by setting a high bit on the
system call number - as x86_64 does for x32 calls.

You still have to solve the different alignment of 64bit data
on i386.

Of course the system call numbers are different - but that is
just a lookup.

David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, 
UK
Registration No: 1397386 (Wales)


RE: [PATCH] Adds a new ioctl32 syscall for backwards compatibility layers

2021-01-16 Thread David Laight
From: sonicadvan...@gmail.com
> Sent: 15 January 2021 07:03
> Problem presented:
> A backwards compatibility layer that allows running x86-64 and x86
> processes inside of an AArch64 process.
>   - CPU is emulated
>   - Syscall interface is mostly passthrough
>   - Some syscalls require patching or emulation depending on behaviour
>   - Not viable from the emulator design to use an AArch32 host process
> 

You are going to need to add all the x86 compatibility code into
your arm64 kernel.
This is likely to be different from the 32bit arm compatibility
because 64bit items are only aligned on 32bit boundaries.
The x86 x32 compatibility will be more like the 32bit arm 'compat'
code - I'm pretty sure arm32 64bit aligned 64bit data.

You'll then need to remember how the process entered the kernel
to work out which compatibility code to invoke.
This is what x86 does.
It allows a single process to do all three types of system call.

Trying to 'patch up' structures outside the kernel, or in the
syscall interface code will always cause grief somewhere.
The only sane place is in the code that uses the structures.
Which, for ioctls, means inside the driver that parses them.

David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, 
UK
Registration No: 1397386 (Wales)



RE: [PATCH 05/11] iov_iter: merge the compat case into rw_copy_check_uvector

2021-01-08 Thread David Laight
From: Christoph Hellwig 
> Sent: 21 September 2020 15:34
> 
> Stop duplicating the iovec verify code, and instead add add a
> __import_iovec helper that does the whole verify and import, but takes
> a bool compat to decided on the native or compat layout.  This also
> ends up massively simplifying the calling conventions.
> 
> Signed-off-by: Christoph Hellwig 
> ---
>  lib/iov_iter.c | 195 ++---
>  1 file changed, 70 insertions(+), 125 deletions(-)
> 
> diff --git a/lib/iov_iter.c b/lib/iov_iter.c
> index a64867501a7483..8bfa47b63d39aa 100644
> --- a/lib/iov_iter.c
> +++ b/lib/iov_iter.c
> @@ -10,6 +10,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
> 
>  #define PIPE_PARANOIA /* for now */
> 
> @@ -1650,43 +1651,76 @@ const void *dup_iter(struct iov_iter *new, struct 
> iov_iter *old, gfp_t flags)
>  }
>  EXPORT_SYMBOL(dup_iter);
> 
> -static ssize_t rw_copy_check_uvector(int type,
> - const struct iovec __user *uvector, unsigned long nr_segs,
> - unsigned long fast_segs, struct iovec *fast_pointer,
> - struct iovec **ret_pointer)
> +static int compat_copy_iovecs_from_user(struct iovec *iov,
> + const struct iovec __user *uvector, unsigned long nr_segs)
> +{
> + const struct compat_iovec __user *uiov =
> + (const struct compat_iovec __user *)uvector;
> + unsigned long i;
> + int ret = -EFAULT;
> +
> + if (!user_access_begin(uvector, nr_segs * sizeof(*uvector)))
> + return -EFAULT;

I little bit late, but the above isn't quite right.
It should be sizeof(*iouv) - the length is double what it should be.

Not that access_ok() can fail for compat addresses
and the extra length won't matter for architectures that
need the address/length to open an address hole into userspace.

David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, 
UK
Registration No: 1397386 (Wales)



RE: [RFC please help] membarrier: Rewrite sync_core_before_usermode()

2021-01-01 Thread David Laight
From: Andy Lutomirski
> Sent: 29 December 2020 00:36
...
> I mean that the mapping from the name "sync_core" to its semantics is
> x86 only.  The string "sync_core" appears in the kernel only in
> arch/x86, membarrier code, membarrier docs, and a single SGI driver
> that is x86-only.  Sure, the idea of serializing things is fairly
> generic, but exactly what operations serialize what, when things need
> serialization, etc is quite architecture specific.
> 
> Heck, on 486 you serialize the instruction stream with JMP.

Did the 486 even have a memory cache?
Never mind separate I caches.
Without branch prediction or an I$ a jmp is enough.
No idea how the dual 486 box we had actually behaved.

For non-SMP the x86 cpus tend to still be compatible with
the original 8086 - so are pretty much fully coherent.
ISTR the memory writes will invalidate I$ lines.

But there was some hardware compatibility that meant a load
of Pentium-75 systems were 'scavenged' from development for
a customer - we got faster P-266 boxes as replacements.

OTOH we never did work out how to do the required 'barrier'
when switching a Via C3 to and from 16-bit mode.
Sometimes it worked, other times the cpu went AWOL.
Best guess was that it sometimes executed pre-decoded
instructions for the wrong mode when returning from the
function call that flipped modes.

Then there is the P-Pro era Intel doc that says that IOR/IOW
aren't sequenced wrt memory accesses.
Fortunately all x86 processors have sequenced them.
Which is what the current docs say.

David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, 
UK
Registration No: 1397386 (Wales)


RE: [PATCH] powerpc:Don't print raw EIP/LR hex values in dump_stack() and show_regs()

2020-12-21 Thread David Laight
From: Segher Boessenkool
> Sent: 21 December 2020 16:32
> 
> On Mon, Dec 21, 2020 at 04:17:21PM +0100, Christophe Leroy wrote:
> > Le 21/12/2020 à 04:27, Xiaoming Ni a écrit :
> > >Since the commit 2b0e86cc5de6 ("powerpc/fsl_booke/32: implement KASLR
> > >infrastructure"), the powerpc system is ready to support KASLR.
> > >To reduces the risk of invalidating address randomization, don't print the
> > >EIP/LR hex values in dump_stack() and show_regs().
> 
> > I think your change is not enough to hide EIP address, see below a dump
> > with you patch, you get "Faulting instruction address: 0xc03a0c14"
> 
> As far as I can see the patch does nothing to the GPR printout.  Often
> GPRs contain code addresses.  As one example, the LR is moved via a GPR
> (often GPR0, but not always) for storing on the stack.
> 
> So this needs more work.

If the dump_stack() is from an oops you need the real EIP value
on order to stand any chance of making headway.

Otherwise you might just as well just print 'borked - tough luck'.

David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, 
UK
Registration No: 1397386 (Wales)



RE: [PATCH] net: ethernet: fs-enet: remove casting dma_alloc_coherent

2020-12-11 Thread David Laight
From: Christophe Leroy
> Sent: 11 December 2020 15:22
> 
> Le 11/12/2020 à 09:52, Xu Wang a écrit :
> > Remove casting the values returned by dma_alloc_coherent.
> 
> Can you explain more in the commit log ?
> 
> As far as I can see, dma_alloc_coherent() doesn't return __iomem, and 
> ring_base member is __iomem

Which is probably wrong - that is the kernel address of kernel memory.
So it shouldn't have the __iomem marker.

I wonder what else is wrong

David

> 
> Christophe
> 
> >
> > Signed-off-by: Xu Wang 
> > ---
> >   drivers/net/ethernet/freescale/fs_enet/mac-fec.c | 2 +-
> >   1 file changed, 1 insertion(+), 1 deletion(-)
> >
> > diff --git a/drivers/net/ethernet/freescale/fs_enet/mac-fec.c
> b/drivers/net/ethernet/freescale/fs_enet/mac-fec.c
> > index 99fe2c210d0f..3ae345676e50 100644
> > --- a/drivers/net/ethernet/freescale/fs_enet/mac-fec.c
> > +++ b/drivers/net/ethernet/freescale/fs_enet/mac-fec.c
> > @@ -131,7 +131,7 @@ static int allocate_bd(struct net_device *dev)
> > struct fs_enet_private *fep = netdev_priv(dev);
> > const struct fs_platform_info *fpi = fep->fpi;
> >
> > -   fep->ring_base = (void __force __iomem *)dma_alloc_coherent(fep->dev,
> > +   fep->ring_base = dma_alloc_coherent(fep->dev,
> > (fpi->tx_ring + fpi->rx_ring) *
> > sizeof(cbd_t), >ring_mem_addr,
> > GFP_KERNEL);
> >

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, 
UK
Registration No: 1397386 (Wales)


RE: [PATCH] net: ethernet: fs-enet: remove casting dma_alloc_coherent

2020-12-11 Thread David Laight
From: Christophe Leroy
> Sent: 11 December 2020 16:43
> 
> Le 11/12/2020 à 17:07, David Laight a écrit :
> > From: Christophe Leroy
> >> Sent: 11 December 2020 15:22
> >>
> >> Le 11/12/2020 à 09:52, Xu Wang a écrit :
> >>> Remove casting the values returned by dma_alloc_coherent.
> >>
> >> Can you explain more in the commit log ?
> >>
> >> As far as I can see, dma_alloc_coherent() doesn't return __iomem, and 
> >> ring_base member is __iomem
> >
> > Which is probably wrong - that is the kernel address of kernel memory.
> > So it shouldn't have the __iomem marker.
> 
> That's where the buffer descriptors are, the driver accesses to the content 
> of the buffer
> descriptors using the IO accessors in_be16()/out_be16(). Is it not correct ?

I've just been looking at the crap in there.
My understanding is that IO accessors are for IO devices (eg addresses
from io_remap() etc).

Buffers allocated by dma_alloc_coherent() are normal kernel memory
and don't need any accessors.
Now you might need some barriers - mostly because an ethernet chip
can typically read a ring entry without being prodded.
IIRC there is a barrier in writel() to ensure the dma master will
'see' all memory writes done before the IO write that kicks it into
doing some processing.

The fact that the driver contains so many __iomem casts (eg in
tx_restart) is an indication that something is badly awry.
__iomem exists to check you are using the correct type of pointer.
Any __iomem casts are dubious.

David

> 
> Christophe
> 
> >
> > I wonder what else is wrong
> >
> > David
> >
> >>
> >> Christophe
> >>
> >>>
> >>> Signed-off-by: Xu Wang 
> >>> ---
> >>>drivers/net/ethernet/freescale/fs_enet/mac-fec.c | 2 +-
> >>>1 file changed, 1 insertion(+), 1 deletion(-)
> >>>
> >>> diff --git a/drivers/net/ethernet/freescale/fs_enet/mac-fec.c
> >> b/drivers/net/ethernet/freescale/fs_enet/mac-fec.c
> >>> index 99fe2c210d0f..3ae345676e50 100644
> >>> --- a/drivers/net/ethernet/freescale/fs_enet/mac-fec.c
> >>> +++ b/drivers/net/ethernet/freescale/fs_enet/mac-fec.c
> >>> @@ -131,7 +131,7 @@ static int allocate_bd(struct net_device *dev)
> >>>   struct fs_enet_private *fep = netdev_priv(dev);
> >>>   const struct fs_platform_info *fpi = fep->fpi;
> >>>
> >>> - fep->ring_base = (void __force __iomem *)dma_alloc_coherent(fep->dev,
> >>> + fep->ring_base = dma_alloc_coherent(fep->dev,
> >>>   (fpi->tx_ring + 
> >>> fpi->rx_ring) *
> >>>   sizeof(cbd_t), 
> >>> >ring_mem_addr,
> >>>   GFP_KERNEL);
> >>>
> >
> > -
> > Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 
> > 1PT, UK
> > Registration No: 1397386 (Wales)
> >

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, 
UK
Registration No: 1397386 (Wales)


RE: [PATCH v6 4/5] PCI: vmd: Update type of the __iomem pointers

2020-11-30 Thread David Laight
From: Krzysztof Wilczynski
> Sent: 29 November 2020 23:08
> 
> Use "void __iomem" instead "char __iomem" pointer type when working with
> the accessor functions (with names like readb() or writel(), etc.) to
> better match a given accessor function signature where commonly the
> address pointing to an I/O memory region would be a "void __iomem"
> pointer.

ISTM that is heading in the wrong direction.

I think (form the variable names etc) that these are pointers
to specific registers.

So what you ought to have is a type for that register block.
Typically this is actually a structure - to give some type
checking that the offsets are being used with the correct
base address.

If the code is using numeric offsets (hardware engineers like
numeric offsets) then you can get some type protection by using
a structure that only contains a single field (char in this case).

David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, 
UK
Registration No: 1397386 (Wales)


RE: Buggy commit tracked to: "Re: [PATCH 2/9] iov_iter: move rw_copy_check_uvector() into lib/iov_iter.c"

2020-11-02 Thread David Laight
From: 'Greg KH'
> Sent: 02 November 2020 13:52
> 
> On Mon, Nov 02, 2020 at 09:06:38AM +0000, David Laight wrote:
> > From: 'Greg KH'
> > > Sent: 23 October 2020 15:47
> > >
> > > On Fri, Oct 23, 2020 at 02:39:24PM +, David Laight wrote:
> > > > From: David Hildenbrand
> > > > > Sent: 23 October 2020 15:33
> > > > ...
> > > > > I just checked against upstream code generated by clang 10 and it
> > > > > properly discards the upper 32bit via a mov w23 w2.
> > > > >
> > > > > So at least clang 10 indeed properly assumes we could have garbage and
> > > > > masks it off.
> > > > >
> > > > > Maybe the issue is somewhere else, unrelated to nr_pages ... or clang 
> > > > > 11
> > > > > behaves differently.
> > > >
> > > > We'll need the disassembly from a failing kernel image.
> > > > It isn't that big to hand annotate.
> > >
> > > I've worked around the merge at the moment in the android tree, but it
> > > is still quite reproducable, and will try to get a .o file to
> > > disassemble on Monday or so...
> >
> > Did this get properly resolved?
> 
> For some reason, 5.10-rc2 fixed all of this up.  I backed out all of the
> patches I had to revert to get 5.10-rc1 to work properly, and then did
> the merge and all is well.
> 
> It must have been something to do with the compat changes in this same
> area that went in after 5.10-rc1, and something got reorganized in the
> files somehow.  I really do not know, and at the moment, don't have the
> time to track it down anymore.  So for now, I'd say it's all good, sorry
> for the noise.

Hopefully it won't appear again.

Saved me spending a day off reading arm64 assembler.

David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, 
UK
Registration No: 1397386 (Wales)



RE: Buggy commit tracked to: "Re: [PATCH 2/9] iov_iter: move rw_copy_check_uvector() into lib/iov_iter.c"

2020-11-02 Thread David Laight
From: 'Greg KH'
> Sent: 23 October 2020 15:47
> 
> On Fri, Oct 23, 2020 at 02:39:24PM +0000, David Laight wrote:
> > From: David Hildenbrand
> > > Sent: 23 October 2020 15:33
> > ...
> > > I just checked against upstream code generated by clang 10 and it
> > > properly discards the upper 32bit via a mov w23 w2.
> > >
> > > So at least clang 10 indeed properly assumes we could have garbage and
> > > masks it off.
> > >
> > > Maybe the issue is somewhere else, unrelated to nr_pages ... or clang 11
> > > behaves differently.
> >
> > We'll need the disassembly from a failing kernel image.
> > It isn't that big to hand annotate.
> 
> I've worked around the merge at the moment in the android tree, but it
> is still quite reproducable, and will try to get a .o file to
> disassemble on Monday or so...

Did this get properly resolved?

David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, 
UK
Registration No: 1397386 (Wales)



RE: Buggy commit tracked to: "Re: [PATCH 2/9] iov_iter: move rw_copy_check_uvector() into lib/iov_iter.c"

2020-10-24 Thread David Laight
From: Segher Boessenkool
> Sent: 24 October 2020 18:29
> 
> On Fri, Oct 23, 2020 at 09:28:59PM +0000, David Laight wrote:
> > From: Segher Boessenkool
> > > Sent: 23 October 2020 19:27
> > > On Fri, Oct 23, 2020 at 06:58:57PM +0100, Al Viro wrote:
> > > > On Fri, Oct 23, 2020 at 03:09:30PM +0200, David Hildenbrand wrote:
> > > > On arm64 when callee expects a 32bit argument, the caller is *not* 
> > > > responsible
> > > > for clearing the upper half of 64bit register used to pass the value - 
> > > > it only
> > > > needs to store the actual value into the lower half.  The callee must 
> > > > consider
> > > > the contents of the upper half of that register as undefined.  See 
> > > > AAPCS64 (e.g.
> > > > https://github.com/ARM-software/abi-aa/blob/master/aapcs64/aapcs64.rst#parameter-passing-rules
> > > > ); AFAICS, the relevant bit is
> > > > "Unlike in the 32-bit AAPCS, named integral values must be 
> > > > narrowed by
> > > > the callee rather than the caller."
> > >
> > > Or the formal rule:
> > >
> > > C.9   If the argument is an Integral or Pointer Type, the size of the
> > >   argument is less than or equal to 8 bytes and the NGRN is less
> > >   than 8, the argument is copied to the least significant bits in
> > >   x[NGRN]. The NGRN is incremented by one. The argument has now
> > >   been allocated.
> >
> > So, in essence, if the value is in a 64bit register the calling
> > code is independent of the actual type of the formal parameter.
> > Clearly a value might need explicit widening.
> 
> No, this says that if you pass a 32-bit integer in a 64-bit register,
> then the top 32 bits of that register hold an undefined value.

That's sort of what I meant.
The 'normal' junk in the hight bits will there because the variable
in the calling code is wider.

> > I've found a copy of the 64 bit arm instruction set.
> > Unfortunately it is alpha sorted and repetitive so shows none
> > of the symmetry and makes things difficult to find.
> 
> All of this is ABI, not ISA.  Look at the AAPCS64 pointed to above.
> 
> > But, contrary to what someone suggested most register writes
> > (eg from arithmetic) seem to zero/extend the high bits.
> 
> Everything that writes a "w" does, yes.  But that has nothing to do with
> the parameter passing rules, that is ABI.  It just means that very often
> a 32-bit integer will be passed zero-extended in a 64-bit register, but
> that is just luck (or not, it makes finding bugs harder ;-) )

Working out why the code is wrong is more of an ISA issue than an ABI one.
It may be an ABI one, but the analysis is ISA.

I've written a lot of asm over the years - decoding compiler generated
asm isn't that hard.
At least ARM doesn't have annulled delay slots.

David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, 
UK
Registration No: 1397386 (Wales)



RE: Buggy commit tracked to: "Re: [PATCH 2/9] iov_iter: move rw_copy_check_uvector() into lib/iov_iter.c"

2020-10-23 Thread David Laight
From: Segher Boessenkool
> Sent: 23 October 2020 19:27
> 
> On Fri, Oct 23, 2020 at 06:58:57PM +0100, Al Viro wrote:
> > On Fri, Oct 23, 2020 at 03:09:30PM +0200, David Hildenbrand wrote:
> >
> > > Now, I am not a compiler expert, but as I already cited, at least on
> > > x86-64 clang expects that the high bits were cleared by the caller - in
> > > contrast to gcc. I suspect it's the same on arm64, but again, I am no
> > > compiler expert.
> > >
> > > If what I said and cites for x86-64 is correct, if the function expects
> > > an "unsigned int", it will happily use 64bit operations without further
> > > checks where valid when assuming high bits are zero. That's why even
> > > converting everything to "unsigned int" as proposed by me won't work on
> > > clang - it assumes high bits are zero (as indicated by Nick).
> > >
> > > As I am neither a compiler experts (did I mention that already? ;) ) nor
> > > an arm64 experts, I can't tell if this is a compiler BUG or not.
> >
> > On arm64 when callee expects a 32bit argument, the caller is *not* 
> > responsible
> > for clearing the upper half of 64bit register used to pass the value - it 
> > only
> > needs to store the actual value into the lower half.  The callee must 
> > consider
> > the contents of the upper half of that register as undefined.  See AAPCS64 
> > (e.g.
> > https://github.com/ARM-software/abi-aa/blob/master/aapcs64/aapcs64.rst#parameter-passing-rules
> > ); AFAICS, the relevant bit is
> > "Unlike in the 32-bit AAPCS, named integral values must be narrowed by
> > the callee rather than the caller."
> 
> Or the formal rule:
> 
> C.9   If the argument is an Integral or Pointer Type, the size of the
>   argument is less than or equal to 8 bytes and the NGRN is less
>   than 8, the argument is copied to the least significant bits in
>   x[NGRN]. The NGRN is incremented by one. The argument has now
>   been allocated.

So, in essence, if the value is in a 64bit register the calling
code is independent of the actual type of the formal parameter.
Clearly a value might need explicit widening.

I've found a copy of the 64 bit arm instruction set.
Unfortunately it is alpha sorted and repetitive so shows none
of the symmetry and makes things difficult to find.
But, contrary to what someone suggested most register writes
(eg from arithmetic) seem to zero/extend the high bits.

David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, 
UK
Registration No: 1397386 (Wales)



RE: Buggy commit tracked to: "Re: [PATCH 2/9] iov_iter: move rw_copy_check_uvector() into lib/iov_iter.c"

2020-10-23 Thread David Laight
From: David Hildenbrand
> Sent: 23 October 2020 15:33
...
> I just checked against upstream code generated by clang 10 and it
> properly discards the upper 32bit via a mov w23 w2.
> 
> So at least clang 10 indeed properly assumes we could have garbage and
> masks it off.
> 
> Maybe the issue is somewhere else, unrelated to nr_pages ... or clang 11
> behaves differently.

We'll need the disassembly from a failing kernel image.
It isn't that big to hand annotate.

David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, 
UK
Registration No: 1397386 (Wales)


RE: Buggy commit tracked to: "Re: [PATCH 2/9] iov_iter: move rw_copy_check_uvector() into lib/iov_iter.c"

2020-10-23 Thread David Laight
From: Arnd Bergmann
> Sent: 23 October 2020 14:23
> 
> On Fri, Oct 23, 2020 at 2:46 PM David Laight  wrote:
> >
> > From: Greg KH 
> > > Sent: 22 October 2020 14:51
> >
> > I've rammed the code into godbolt.
> >
> > https://godbolt.org/z/9v5PPW
> >
> > Definitely a clang bug.
> >
> > Search for [wx]24 in the clang output.
> > nr_segs comes in as w2 and the initial bound checks are done on w2.
> > w24 is loaded from w2 - I don't believe this changes the high bits.
> 
> You believe wrong, "mov w24, w2" is a zero-extending operation.

Ah ok, but gcc uses utxw for the same task.
I guess they could be the same opcode.

Last time I wrote ARM thumb didn't really exist - never mind 64bit

David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, 
UK
Registration No: 1397386 (Wales)


RE: Buggy commit tracked to: "Re: [PATCH 2/9] iov_iter: move rw_copy_check_uvector() into lib/iov_iter.c"

2020-10-23 Thread David Laight
From: Greg KH 
> Sent: 22 October 2020 14:51

I've rammed the code into godbolt.

https://godbolt.org/z/9v5PPW

Definitely a clang bug.

Search for [wx]24 in the clang output.
nr_segs comes in as w2 and the initial bound checks are done on w2.
w24 is loaded from w2 - I don't believe this changes the high bits.
There are no references to w24, just x24.
So the kmalloc_array() is passed 'huge' and will fail.
The iov_iter_init also gets the 64bit value.

Note that the gcc code has a sign-extend copy of w2.

David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, 
UK
Registration No: 1397386 (Wales)



RE: Buggy commit tracked to: "Re: [PATCH 2/9] iov_iter: move rw_copy_check_uvector() into lib/iov_iter.c"

2020-10-22 Thread David Laight
From: Al Viro
> Sent: 22 October 2020 20:25
> 
> On Thu, Oct 22, 2020 at 12:04:52PM -0700, Nick Desaulniers wrote:
> 
> > Passing an `unsigned long` as an `unsigned int` does no such
> > narrowing: https://godbolt.org/z/TvfMxe (same vice-versa, just tail
> > calls, no masking instructions).
> > So if rw_copy_check_uvector() is inlined into import_iovec() (looking
> > at the mainline@1028ae406999), then children calls of
> > `rw_copy_check_uvector()` will be interpreting the `nr_segs` register
> > unmodified, ie. garbage in the upper 32b.
> 
> FWIW,
> 
> void f(unsinged long v)
> {
>   if (v != 1)
>   printf("failed\n");
> }
> 
> void g(unsigned int v)
> {
>   f(v);
> }
> 
> void h(unsigned long v)
> {
>   g(v);
> }
> 
> main()
> {
>   h(0x10001);
> }
> 
> must not produce any output on a host with 32bit int and 64bit long, 
> regardless of
> the inlining, having functions live in different compilation units, etc.
> 
> Depending upon the calling conventions, compiler might do truncation in 
> caller or
> in a callee, but it must be done _somewhere_.

Put g() in a separate compilation unit and use the 'wrong' type
in the prototypes t() used to call g() and g() uses to call f().

Then you might see where and masking does (or does not) happen.

David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, 
UK
Registration No: 1397386 (Wales)



RE: Buggy commit tracked to: "Re: [PATCH 2/9] iov_iter: move rw_copy_check_uvector() into lib/iov_iter.c"

2020-10-22 Thread David Laight
From: Nick Desaulniers
> Sent: 22 October 2020 20:05
> 
...
> Passing an `unsigned long` as an `unsigned int` does no such
> narrowing: https://godbolt.org/z/TvfMxe (same vice-versa, just tail
> calls, no masking instructions).

Right but is the called function going to use 32bit ops
and/or mask the register?
Certainly that is likely on x86-64.

I've rather lost track of where the masking is expected
to happen.

David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, 
UK
Registration No: 1397386 (Wales)


RE: Buggy commit tracked to: "Re: [PATCH 2/9] iov_iter: move rw_copy_check_uvector() into lib/iov_iter.c"

2020-10-22 Thread David Laight
From: Matthew Wilcox 
> Sent: 22 October 2020 17:41
> 
> On Thu, Oct 22, 2020 at 04:35:17PM +0000, David Laight wrote:
> > Wait...
> > readv(2) defines:
> > ssize_t readv(int fd, const struct iovec *iov, int iovcnt);
> 
> It doesn't really matter what the manpage says.  What does the AOSP
> libc header say?

The only copy I can find is:

/usr/include/x86_64-linux-gnu/sys/uio.h:extern ssize_t readv (int __fd, const 
struct iovec *__iovec, int __count)
/usr/include/x86_64-linux-gnu/sys/uio.h-  __wur;

and not surprisingly agrees.
POSIX and/or TOG will (more or less) mandate the prototype.

> > But the syscall is defined as:
> >
> > SYSCALL_DEFINE3(readv, unsigned long, fd, const struct iovec __user *, vec,
> > unsigned long, vlen)
> > {
> > return do_readv(fd, vec, vlen, 0);
> > }

I wonder when the high bits of 'fd' get zapped?

David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, 
UK
Registration No: 1397386 (Wales)



RE: Buggy commit tracked to: "Re: [PATCH 2/9] iov_iter: move rw_copy_check_uvector() into lib/iov_iter.c"

2020-10-22 Thread David Laight
From: Christoph Hellwig
> Sent: 22 October 2020 14:24
> 
> On Thu, Oct 22, 2020 at 11:36:40AM +0200, David Hildenbrand wrote:
> > My thinking: if the compiler that calls import_iovec() has garbage in
> > the upper 32 bit
> >
> > a) gcc will zero it out and not rely on it being zero.
> > b) clang will not zero it out, assuming it is zero.
> >
> > But
> >
> > a) will zero it out when calling the !inlined variant
> > b) clang will zero it out when calling the !inlined variant
> >
> > When inlining, b) strikes. We access garbage. That would mean that we
> > have calling code that's not generated by clang/gcc IIUC.
> 
> Most callchains of import_iovec start with the assembly syscall wrappers.

Wait...
readv(2) defines:
ssize_t readv(int fd, const struct iovec *iov, int iovcnt);

But the syscall is defined as:

SYSCALL_DEFINE3(readv, unsigned long, fd, const struct iovec __user *, vec,
unsigned long, vlen)
{
return do_readv(fd, vec, vlen, 0);
}

I'm guessing that nothing actually masks the high bits that come
from an application that is compiled with clang?

The vlen is 'unsigned long' through the first few calls.
So unless there is a non-inlined function than takes vlen
as 'int' the high garbage bits from userspace are kept.

Which makes it a bug in the kernel C syscall wrappers.
They need to explicitly mask the high bits of 32bit
arguments on arm64 but not x86-64.

What does the ARM EABI say about register parameters?

David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, 
UK
Registration No: 1397386 (Wales)



RE: Buggy commit tracked to: "Re: [PATCH 2/9] iov_iter: move rw_copy_check_uvector() into lib/iov_iter.c"

2020-10-22 Thread David Laight
From: Greg KH
> Sent: 22 October 2020 15:40
> 
> On Thu, Oct 22, 2020 at 04:28:20PM +0200, Arnd Bergmann wrote:
...
> > Can you attach the iov_iter.s files from the broken build, plus the
> > one with 'noinline' for comparison? Maybe something can be seen
> > in there.
> 
> I don't know how to extract the .s files easily from the AOSP build
> system, I'll look into that.  I'm also now testing by downgrading to an
> older version of clang (10 instead of 11), to see if that matters at all
> or not...

Back from a day out - after it stopped raining.
Trying to use up leave before the end of the year.

Can you use objdump on the kernel binary itself and cut out
the single function?

David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, 
UK
Registration No: 1397386 (Wales)



RE: Buggy commit tracked to: "Re: [PATCH 2/9] iov_iter: move rw_copy_check_uvector() into lib/iov_iter.c"

2020-10-22 Thread David Laight
From: David Hildenbrand
> Sent: 22 October 2020 10:25
...
> ... especially because I recall that clang and gcc behave slightly
> differently:
> 
> https://github.com/hjl-tools/x86-psABI/issues/2
> 
> "Function args are different: narrow types are sign or zero extended to
> 32 bits, depending on their type. clang depends on this for incoming
> args, but gcc doesn't make that assumption. But both compilers do it
> when calling, so gcc code can call clang code.

It really is best to use 'int' (or even 'long') for all numeric
arguments (and results) regardless of the domain of the value.

Related, I've always worried about 'bool'

> The upper 32 bits of registers are always undefined garbage for types
> smaller than 64 bits."

On x86-64 the high bits are zeroed by all 32bit loads.

David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, 
UK
Registration No: 1397386 (Wales)


RE: Buggy commit tracked to: "Re: [PATCH 2/9] iov_iter: move rw_copy_check_uvector() into lib/iov_iter.c"

2020-10-22 Thread David Laight
From: David Hildenbrand
> Sent: 22 October 2020 10:19
> 
> On 22.10.20 11:01, Greg KH wrote:
> > On Thu, Oct 22, 2020 at 10:48:59AM +0200, David Hildenbrand wrote:
> >> On 22.10.20 10:40, David Laight wrote:
> >>> From: David Hildenbrand
> >>>> Sent: 22 October 2020 09:35
> >>>>
> >>>> On 22.10.20 10:26, Greg KH wrote:
> >>>>> On Thu, Oct 22, 2020 at 12:39:14AM +0100, Al Viro wrote:
> >>>>>> On Wed, Oct 21, 2020 at 06:13:01PM +0200, Greg KH wrote:
> >>>>>>> On Fri, Sep 25, 2020 at 06:51:39AM +0200, Christoph Hellwig wrote:
> >>>>>>>> From: David Laight 
> >>>>>>>>
> >>>>>>>> This lets the compiler inline it into import_iovec() generating
> >>>>>>>> much better code.
> >>>>>>>>
> >>>>>>>> Signed-off-by: David Laight 
> >>>>>>>> Signed-off-by: Christoph Hellwig 
> >>>>>>>> ---
> >>>>>>>>  fs/read_write.c | 179 
> >>>>>>>> 
> >>>>>>>>  lib/iov_iter.c  | 176 
> >>>>>>>> +++
> >>>>>>>>  2 files changed, 176 insertions(+), 179 deletions(-)
> >>>>>>>
> >>>>>>> Strangely, this commit causes a regression in Linus's tree right now.
> >>>>>>>
> >>>>>>> I can't really figure out what the regression is, only that this 
> >>>>>>> commit
> >>>>>>> triggers a "large Android system binary" from working properly.  
> >>>>>>> There's
> >>>>>>> no kernel log messages anywhere, and I don't have any way to strace 
> >>>>>>> the
> >>>>>>> thing in the testing framework, so any hints that people can provide
> >>>>>>> would be most appreciated.
> >>>>>>
> >>>>>> It's a pure move - modulo changed line breaks in the argument lists
> >>>>>> the functions involved are identical before and after that (just 
> >>>>>> checked
> >>>>>> that directly, by checking out the trees before and after, extracting 
> >>>>>> two
> >>>>>> functions in question from fs/read_write.c and lib/iov_iter.c (before 
> >>>>>> and
> >>>>>> after, resp.) and checking the diff between those.
> >>>>>>
> >>>>>> How certain is your bisection?
> >>>>>
> >>>>> The bisection is very reproducable.
> >>>>>
> >>>>> But, this looks now to be a compiler bug.  I'm using the latest version
> >>>>> of clang and if I put "noinline" at the front of the function,
> >>>>> everything works.
> >>>>
> >>>> Well, the compiler can do more invasive optimizations when inlining. If
> >>>> you have buggy code that relies on some unspecified behavior, inlining
> >>>> can change the behavior ... but going over that code, there isn't too
> >>>> much action going on. At least nothing screamed at me.
> >>>
> >>> Apart from all the optimisations that get rid off the 'pass be reference'
> >>> parameters and strange conditional tests.
> >>> Plenty of scope for the compiler getting it wrong.
> >>> But nothing even vaguely illegal.
> >>
> >> Not the first time that people blame the compiler to then figure out
> >> that something else is wrong ... but maybe this time is different :)
> >
> > I agree, I hate to blame the compiler, that's almost never the real
> > problem, but this one sure "feels" like it.
> >
> > I'm running some more tests, trying to narrow things down as just adding
> > a "noinline" to the function that got moved here doesn't work on Linus's
> > tree at the moment because the function was split into multiple
> > functions.
> >
> > Give me a few hours...
> 
> I might be wrong but
> 
> a) import_iovec() uses:
> - unsigned nr_segs -> int
> - unsigned fast_segs -> int
> b) rw_copy_check_uvector() uses:
> - unsigned long nr_segs -> long
> - unsigned long fast_seg -> long
> 
> So when calling rw_copy_check_uvector(), we have to zero-extend the
> registers used for passing the arguments. That's definitely done when
> calling the function explicitly. Maybe when inlining something is messed up?

That's also not needed on x86-64 - the high bits get cleared by 32bit writes.
But, IIRC, arm64 leaves them unchanged or undefined.

I guessing that every array access uses a *(Rx + Ry) addressing
mode. So indexing an array even with 'unsigned int' requires
an explicit zero-extend on arm64?
(x86-64 ends up with an explicit sign extend when indexing an
array with 'signed int'.)

David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, 
UK
Registration No: 1397386 (Wales)


RE: Buggy commit tracked to: "Re: [PATCH 2/9] iov_iter: move rw_copy_check_uvector() into lib/iov_iter.c"

2020-10-22 Thread David Laight
From: Greg KH
> Sent: 22 October 2020 10:02
...
> I'm running some more tests, trying to narrow things down as just adding
> a "noinline" to the function that got moved here doesn't work on Linus's
> tree at the moment because the function was split into multiple
> functions.

I was going to look at that once rc2 is in - and the kernel is
likely to work.

I suspect the split version doesn't get inlined the same way?
Which leaves the horrid argument conversion the inline got
rid of back there again.

Which all rather begs the question of why the compiler doesn't
generate the expected code.

David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, 
UK
Registration No: 1397386 (Wales)



RE: Buggy commit tracked to: "Re: [PATCH 2/9] iov_iter: move rw_copy_check_uvector() into lib/iov_iter.c"

2020-10-22 Thread David Laight
From: David Hildenbrand
> Sent: 22 October 2020 09:49
...
> >>> But, this looks now to be a compiler bug.  I'm using the latest version
> >>> of clang and if I put "noinline" at the front of the function,
> >>> everything works.
> >>
> >> Well, the compiler can do more invasive optimizations when inlining. If
> >> you have buggy code that relies on some unspecified behavior, inlining
> >> can change the behavior ... but going over that code, there isn't too
> >> much action going on. At least nothing screamed at me.
> >
> > Apart from all the optimisations that get rid off the 'pass be reference'
> > parameters and strange conditional tests.
> > Plenty of scope for the compiler getting it wrong.
> > But nothing even vaguely illegal.
> 
> Not the first time that people blame the compiler to then figure out
> that something else is wrong ... but maybe this time is different :)

Usually down to missing asm 'memory' constraints...

Need to read the obj file to see what the compiler did.

The code must be 'approximately right' or nothing would run.
So I'd guess it has to do with > 8 fragments.

David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, 
UK
Registration No: 1397386 (Wales)


RE: Buggy commit tracked to: "Re: [PATCH 2/9] iov_iter: move rw_copy_check_uvector() into lib/iov_iter.c"

2020-10-22 Thread David Laight
From: David Hildenbrand
> Sent: 22 October 2020 09:35
> 
> On 22.10.20 10:26, Greg KH wrote:
> > On Thu, Oct 22, 2020 at 12:39:14AM +0100, Al Viro wrote:
> >> On Wed, Oct 21, 2020 at 06:13:01PM +0200, Greg KH wrote:
> >>> On Fri, Sep 25, 2020 at 06:51:39AM +0200, Christoph Hellwig wrote:
> >>>> From: David Laight 
> >>>>
> >>>> This lets the compiler inline it into import_iovec() generating
> >>>> much better code.
> >>>>
> >>>> Signed-off-by: David Laight 
> >>>> Signed-off-by: Christoph Hellwig 
> >>>> ---
> >>>>  fs/read_write.c | 179 
> >>>>  lib/iov_iter.c  | 176 +++
> >>>>  2 files changed, 176 insertions(+), 179 deletions(-)
> >>>
> >>> Strangely, this commit causes a regression in Linus's tree right now.
> >>>
> >>> I can't really figure out what the regression is, only that this commit
> >>> triggers a "large Android system binary" from working properly.  There's
> >>> no kernel log messages anywhere, and I don't have any way to strace the
> >>> thing in the testing framework, so any hints that people can provide
> >>> would be most appreciated.
> >>
> >> It's a pure move - modulo changed line breaks in the argument lists
> >> the functions involved are identical before and after that (just checked
> >> that directly, by checking out the trees before and after, extracting two
> >> functions in question from fs/read_write.c and lib/iov_iter.c (before and
> >> after, resp.) and checking the diff between those.
> >>
> >> How certain is your bisection?
> >
> > The bisection is very reproducable.
> >
> > But, this looks now to be a compiler bug.  I'm using the latest version
> > of clang and if I put "noinline" at the front of the function,
> > everything works.
> 
> Well, the compiler can do more invasive optimizations when inlining. If
> you have buggy code that relies on some unspecified behavior, inlining
> can change the behavior ... but going over that code, there isn't too
> much action going on. At least nothing screamed at me.

Apart from all the optimisations that get rid off the 'pass be reference'
parameters and strange conditional tests.
Plenty of scope for the compiler getting it wrong.
But nothing even vaguely illegal.

David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, 
UK
Registration No: 1397386 (Wales)


RE: Buggy commit tracked to: "Re: [PATCH 2/9] iov_iter: move rw_copy_check_uvector() into lib/iov_iter.c"

2020-10-21 Thread David Laight
From: Greg KH
> Sent: 21 October 2020 17:13
> 
> On Fri, Sep 25, 2020 at 06:51:39AM +0200, Christoph Hellwig wrote:
> > From: David Laight 
> >
> > This lets the compiler inline it into import_iovec() generating
> > much better code.
> >
> > Signed-off-by: David Laight 
> > Signed-off-by: Christoph Hellwig 
> > ---
> >  fs/read_write.c | 179 
> >  lib/iov_iter.c  | 176 +++
> >  2 files changed, 176 insertions(+), 179 deletions(-)
> 
> Strangely, this commit causes a regression in Linus's tree right now.
> 
> I can't really figure out what the regression is, only that this commit
> triggers a "large Android system binary" from working properly.  There's
> no kernel log messages anywhere, and I don't have any way to strace the
> thing in the testing framework, so any hints that people can provide
> would be most appreciated.

My original commit just moved the function source from one file to another.
So it is odd that it makes any difference.
I don't even know if it gets inlined by Christoph's actual patch.
(I have another patch that depended on it that I need to resubmit.)

Some of the other changes from Christoph's same patch set might
make a difference though.

Might be worth forcing it to be not inlined - so it is no change.
Or try adding a kernel log to import_iovec() or the associated
copy failing.

David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, 
UK
Registration No: 1397386 (Wales)



RE: [PATCH 5/9] fs: remove various compat readv/writev helpers

2020-09-23 Thread David Laight
From: Arnd Bergmann
> Sent: 23 September 2020 19:46
...
> Regardless of that, another advantage of having the SYSCALL_DECLAREx()
> would be the ability to include that header file from elsewhere with a 
> different
> macro definition to create a machine-readable version of the interface when
> combined with the syscall.tbl files. This could be used to create a user
> space stub for calling into the low-level syscall regardless of the
> libc interfaces,
> or for synchronizing the interfaces with strace, qemu-user, or anything that
> needs to deal with the low-level interface.

A similar 'trick' (that probably won't work here) is to pass
the name of a #define function as a parameter to another define.
Useful for defining constants and error strings together. eg:
#define TRAFFIC_LIGHTS(x) \
x(RED, 0, "Red") \
x(YELLOW, 1, "Yellow) \
x(GREEN, 2, "GREEN)

You can then do thing like:
#define x(token, value, string) token = value,
enum {TRAFFIC_LIGHTS(x) NUM_LIGHTS};
#undef x
#define x(token, value, string) [value] = string,
const char *colours[] = {TRAFFIC_LIGHTS(x)};
#undef x
to initialise constants and a name table that are always in sync.

It is also a good way to generate source lines that are over 1MB.

David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, 
UK
Registration No: 1397386 (Wales)


RE: [PATCH 3/9] iov_iter: refactor rw_copy_check_uvector and import_iovec

2020-09-23 Thread David Laight
From: Al Viro
> Sent: 23 September 2020 15:17
> 
> On Wed, Sep 23, 2020 at 08:05:41AM +0200, Christoph Hellwig wrote:
> 
> > +struct iovec *iovec_from_user(const struct iovec __user *uvec,
> > +   unsigned long nr_segs, unsigned long fast_segs,
> 
> Hmm...  For fast_segs unsigned long had always been ridiculous
> (4G struct iovec on caller stack frame?), but that got me wondering about
> nr_segs and I wish I'd thought of that when introducing import_iovec().
> 
> The thing is, import_iovec() takes unsigned int there.  Which is fine
> (hell, the maximal value that can be accepted in 1024), except that
> we do pass unsigned long syscall argument to it in some places.

It will make diddly-squit difference.
The parameters end up in registers on most calling conventions.
Plausibly you get an extra 'REX' byte on x86 for the 64bit value.
What you want to avoid is explicit sign/zero extension and value
masking after arithmetic.

On x86-64 the 'horrid' type is actually 'signed int'.
It often needs sign extending to 64bits (eg when being
used as an array subscript).

David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, 
UK
Registration No: 1397386 (Wales)



RE: [PATCH 02/11] mm: call import_iovec() instead of rw_copy_check_uvector() in process_vm_rw()

2020-09-21 Thread David Laight
From: Al Viro
> Sent: 21 September 2020 16:30
> 
> On Mon, Sep 21, 2020 at 03:21:35PM +0000, David Laight wrote:
> 
> > You really don't want to be looping through the array twice.
> 
> Profiles, please.

I did some profiling of send() v sendmsg() much earlier in the year.
I can't remember the exact details but the extra cost of sendmsg()
is far more than you might expect.
(I was timing sending fully built IPv4 UDP packets using a raw socket.)

About half the difference does away if you change the
copy_from_user() to __copy_from_user() when reading the struct msghdr
and iov[] from userspace (user copy hardening is expensive).

The rest is just code path, my gut feeling is that a lot of that
is in import_iovec().

Remember semdmsg() is likely to be called with an iov count of 1.

David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, 
UK
Registration No: 1397386 (Wales)



RE: [PATCH 04/11] iov_iter: explicitly check for CHECK_IOVEC_ONLY in rw_copy_check_uvector

2020-09-21 Thread David Laight
From: Al Viro
> Sent: 21 September 2020 16:11
> On Mon, Sep 21, 2020 at 03:05:32PM +, David Laight wrote:
> 
> > I've actually no idea:
> > 1) Why there is an access_ok() check here.
> >It will be repeated by the user copy functions.
> 
> Early sanity check.
> 
> > 2) Why it isn't done when called from mm/process_vm_access.c.
> >Ok, the addresses refer to a different process, but they
> >must still be valid user addresses.
> >
> > Is 2 a legacy from when access_ok() actually checked that the
> > addresses were mapped into the process's address space?
> 
> It never did.  2 is for the situation when a 32bit process
> accesses 64bit one; addresses that are perfectly legitimate
> for 64bit userland (and fitting into the first 4Gb of address
> space, so they can be represented by 32bit pointers just fine)
> might be rejected by access_ok() if the caller is 32bit.

Can't 32 bit processes on a 64bit system access all the way to 4GB?
Mapping things by default above 3GB will probably break things.
But there is no reason to disallow explicit maps.
And in any case access_ok() can use the same limit as it does for
64bit processes - the page fault handler will sort it all out.

David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, 
UK
Registration No: 1397386 (Wales)



RE: [PATCH 02/11] mm: call import_iovec() instead of rw_copy_check_uvector() in process_vm_rw()

2020-09-21 Thread David Laight
From: Al Viro
> Sent: 21 September 2020 16:02
> 
> On Mon, Sep 21, 2020 at 04:34:25PM +0200, Christoph Hellwig wrote:
> > From: David Laight 
> >
> > This is the only direct call of rw_copy_check_uvector().  Removing it
> > will allow rw_copy_check_uvector() to be inlined into import_iovec(),
> > while only paying a minor price by setting up an otherwise unused
> > iov_iter in the process_vm_readv/process_vm_writev syscalls that aren't
> > in a super hot path.
> 
> > @@ -443,7 +443,7 @@ void iov_iter_init(struct iov_iter *i, unsigned int 
> > direction,
> > const struct iovec *iov, unsigned long nr_segs,
> > size_t count)
> >  {
> > -   WARN_ON(direction & ~(READ | WRITE));
> > +   WARN_ON(direction & ~(READ | WRITE | CHECK_IOVEC_ONLY));
> > direction &= READ | WRITE;
> 
> Ugh...
> 
> > -   rc = rw_copy_check_uvector(CHECK_IOVEC_ONLY, rvec, riovcnt, UIO_FASTIOV,
> > -  iovstack_r, _r);
> > +   rc = import_iovec(CHECK_IOVEC_ONLY, rvec, riovcnt, UIO_FASTIOV, _r,
> > + _r);
> > if (rc <= 0)
> > goto free_iovecs;
> >
> > -   rc = process_vm_rw_core(pid, , iov_r, riovcnt, flags, vm_write);
> > +   rc = process_vm_rw_core(pid, _l, iter_r.iov, iter_r.nr_segs,
> > +   flags, vm_write);
> 
> ... and ugh^2, since now you are not only setting a meaningless iov_iter,
> you are creating a new place that pokes directly into struct iov_iter
> guts.
> 
> Sure, moving rw_copy_check_uvector() over to lib/iov_iter.c makes sense.
> But I would rather split the access_ok()-related checks out of that thing
> and bury CHECK_IOVEC_ONLY.
> 
> Step 1: move the damn thing to lib/iov_iter.c (same as you do, but without
> making it static)
> 
> Step 2: split it in two:
> 
> ssize_t rw_copy_check_uvector(const struct iovec __user * uvector,
>   unsigned long nr_segs, unsigned long fast_segs,
>   struct iovec *fast_pointer,
>   struct iovec **ret_pointer)
> {
>   unsigned long seg;
...
>   ret = 0;
>   for (seg = 0; seg < nr_segs; seg++) {
>   void __user *buf = iov[seg].iov_base;
>   ssize_t len = (ssize_t)iov[seg].iov_len;
> 
>   /* see if we we're about to use an invalid len or if
>* it's about to overflow ssize_t */
>   if (len < 0)
>   return -EINVAL;
>   if (len > MAX_RW_COUNT - ret) {
>   len = MAX_RW_COUNT - ret;
>   iov[seg].iov_len = len;
>   }
>   ret += len;
>   }
>   return ret;
> }
> 
> /*
>  *  This is merely an early sanity check; we do _not_ rely upon
>  *  it when we get to the actual memory accesses.
>  */
> static bool check_iovecs(const struct iovec *iov, int nr_segs)
> {
> for (seg = 0; seg < nr_segs; seg++) {
> void __user *buf = iov[seg].iov_base;
> ssize_t len = (ssize_t)iov[seg].iov_len;
> 
> if (unlikely(!access_ok(buf, len)))
> return false;
> }
>   return true;
> }

You really don't want to be looping through the array twice.
In fact you don't really want to be doing all those tests at all.
This code makes a significant fraction of the not-insignificant
difference between the 'costs' of send() and sendmsg().

I think the 'length' check can be optimised to do something like:
for (...) {
ssize_t len = (ssize_t)iov[seg].iov_len;
ret += len;
len_hi += (unsigned long)len >> 20;
}
if (len_hi) {
/* Something potentially odd in the lengths.
 * Might just be a very long fragment.
 * Check the individial values. */
Add the exiting loop here.
}

David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, 
UK
Registration No: 1397386 (Wales)



RE: [PATCH 04/11] iov_iter: explicitly check for CHECK_IOVEC_ONLY in rw_copy_check_uvector

2020-09-21 Thread David Laight
From: Christoph Hellwig
> Sent: 21 September 2020 15:34
> 
> Explicitly check for the magic value insted of implicitly relying on
> its numeric representation.   Also drop the rather pointless unlikely
> annotation.
> 
> Signed-off-by: Christoph Hellwig 
> ---
>  lib/iov_iter.c | 5 ++---
>  1 file changed, 2 insertions(+), 3 deletions(-)
> 
> diff --git a/lib/iov_iter.c b/lib/iov_iter.c
> index d7e72343c360eb..a64867501a7483 100644
> --- a/lib/iov_iter.c
> +++ b/lib/iov_iter.c
> @@ -1709,8 +1709,7 @@ static ssize_t rw_copy_check_uvector(int type,
>   ret = -EINVAL;
>   goto out;
>   }
> - if (type >= 0
> - && unlikely(!access_ok(buf, len))) {
> + if (type != CHECK_IOVEC_ONLY && !access_ok(buf, len)) {
>   ret = -EFAULT;
>   goto out;
>   }
> @@ -1824,7 +1823,7 @@ static ssize_t compat_rw_copy_check_uvector(int type,
>   }
>   if (len < 0)/* size_t not fitting in compat_ssize_t .. */
>   goto out;
> - if (type >= 0 &&
> + if (type != CHECK_IOVEC_ONLY &&
>   !access_ok(compat_ptr(buf), len)) {
>   ret = -EFAULT;
>   goto out;
> --
> 2.28.0

I've actually no idea:
1) Why there is an access_ok() check here.
   It will be repeated by the user copy functions.
2) Why it isn't done when called from mm/process_vm_access.c.
   Ok, the addresses refer to a different process, but they
   must still be valid user addresses.

Is 2 a legacy from when access_ok() actually checked that the
addresses were mapped into the process's address space?

David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, 
UK
Registration No: 1397386 (Wales)



RE: let import_iovec deal with compat_iovecs as well

2020-09-21 Thread David Laight
> On Sat, Sep 19, 2020 at 02:24:10PM +0000, David Laight wrote:
> > I thought about that change while writing my import_iovec() => 
> > iovec_import()
> > patch - and thought that the io_uring code would (as usual) cause grief.
> >
> > Christoph - did you see those patches?

Link to cover email.

https://lkml.org/lkml/2020/9/15/661

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, 
UK
Registration No: 1397386 (Wales)



RE: [PATCH 1/9] kernel: add a PF_FORCE_COMPAT flag

2020-09-20 Thread David Laight
From: Arnd Bergmann
> Sent: 20 September 2020 21:49
> 
> On Sun, Sep 20, 2020 at 9:28 PM Andy Lutomirski  wrote:
> > On Sun, Sep 20, 2020 at 12:23 PM Matthew Wilcox  wrote:
> > >
> > > On Sun, Sep 20, 2020 at 08:10:31PM +0100, Al Viro wrote:
> > > > IMO it's much saner to mark those and refuse to touch them from 
> > > > io_uring...
> > >
> > > Simpler solution is to remove io_uring from the 32-bit syscall list.
> > > If you're a 32-bit process, you don't get to use io_uring.  Would
> > > any real users actually care about that?
> >
> > We could go one step farther and declare that we're done adding *any*
> > new compat syscalls :)
> 
> Would you also stop adding system calls to native 32-bit systems then?
> 
> On memory constrained systems (less than 2GB a.t.m.), there is still a
> strong demand for running 32-bit user space, but all of the recent Arm
> cores (after Cortex-A55) dropped the ability to run 32-bit kernels, so
> that compat mode may eventually become the primary way to run
> Linux on cheap embedded systems.
> 
> I don't think there is any chance we can realistically take away io_uring
> from the 32-bit ABI any more now.

Can't it just run requests from 32bit apps in a kernel thread that has
the 'in_compat_syscall' flag set?
Not that i recall seeing the code where it saves the 'compat' nature
of any requests.

It is already completely f*cked if you try to pass the command ring
to a child process - it uses the wrong 'mm'.
I suspect there are some really horrid security holes in that area.

David.

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, 
UK
Registration No: 1397386 (Wales)


RE: [PATCH 1/9] kernel: add a PF_FORCE_COMPAT flag

2020-09-19 Thread David Laight
From: Al Viro
> Sent: 18 September 2020 14:58
> 
> On Fri, Sep 18, 2020 at 03:44:06PM +0200, Christoph Hellwig wrote:
> > On Fri, Sep 18, 2020 at 02:40:12PM +0100, Al Viro wrote:
> > > > /* Vector 0x110 is LINUX_32BIT_SYSCALL_TRAP */
> > > > -   return pt_regs_trap_type(current_pt_regs()) == 0x110;
> > > > +   return pt_regs_trap_type(current_pt_regs()) == 0x110 ||
> > > > +   (current->flags & PF_FORCE_COMPAT);
> > >
> > > Can't say I like that approach ;-/  Reasoning about the behaviour is much
> > > harder when it's controlled like that - witness set_fs() shite...
> >
> > I don't particularly like it either.  But do you have a better idea
> > how to deal with io_uring vs compat tasks?
> 
>  git rm fs/io_uring.c would make a good starting point 
> Yes, I know it's not going to happen, but one can dream...

Maybe the io_uring code needs some changes to make it vaguely safe.
- No support for 32-bit compat mixed working (or at all?).
  Plausibly a special worker could do 32bit work.
- ring structure (I'm assuming mapped by mmap()) never mapped
  in more than one process (not cloned by fork()).
- No implicit handover of files to another process.
  Would need an munmap, handover, mmap sequence.

In any case the io_ring rather abuses the import_iovec() interface.

The canonical sequence is (types from memory):
struct iovec cache[8], *iov = cache;
struct iter iter;
...
rval = import_iovec(..., , 8, );
// Do read/write user using 'iter'
free(iov);

I don't think there is any strict requirement that iter.iov
is set to either 'cache' or 'iov' (it probably must point
into one of them.)
But the io_uring code will make that assumption because the
actual copies can be done much later and it doesn't save 'iter'.
It gets itself in a right mess because it doesn't separate
the 'address I need to free' from 'the iov[] for any transfers'.

io_uring is also the only code that relies on import_iovec()
returning the iter.count on success.
It would be much better to have:
iov = import_iovec(..., , ...);
free(iov);
and use ERR_PTR() et al for error detectoion.

David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, 
UK
Registration No: 1397386 (Wales)



RE: let import_iovec deal with compat_iovecs as well

2020-09-19 Thread David Laight
From: Christoph Hellwig
> Sent: 18 September 2020 13:45
> 
> this series changes import_iovec to transparently deal with comat iovec
> structures, and then cleanups up a lot of code dupliation.  But to get
> there it first has to fix the pre-existing bug that io_uring compat
> contexts don't trigger the in_compat_syscall() check.  This has so far
> been relatively harmless as very little code callable from io_uring used
> the check, and even that code that could be called usually wasn't.

I thought about that change while writing my import_iovec() => iovec_import()
patch - and thought that the io_uring code would (as usual) cause grief.

Christoph - did you see those patches?
David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, 
UK
Registration No: 1397386 (Wales)



RE: [PATCH] powerpc: Select HAVE_FUTEX_CMPXCHG

2020-09-19 Thread David Laight
From: Samuel Holland
> Sent: 19 September 2020 04:20
> 
> On powerpc, access_ok() succeeds for the NULL pointer. This breaks the
> dynamic check in futex_detect_cmpxchg(), which expects -EFAULT. As a
> result, robust futex operations are not functional on powerpc.

access_ok(NULL, sane_count) will succeed on all (maybe most) architectures.
All access_ok() does is check that kernel addresses aren't referenced.
(access_ok(kernel_adress, 0) is also likely to succeed.)

It is the access to user-address 0 that is expected to fault.
If this isn't faulting something else is wrong.

Historically (at least pre-elf, if not before) user programs
were linked to address zero - so the page was mapped.
(Linux may be too new to actually require it.)
Not sure what 'wine' requires for win-32 execuatbles.

ISTR there are also some 'crazy' ARM? cpu that read the interrupt
vectors from address 0 in user-space.

So assuming:

static void __init futex_detect_cmpxchg(void)
{
#ifndef CONFIG_HAVE_FUTEX_CMPXCHG
u32 curval;

/*
 * This will fail and we want it. Some arch implementations do
 * runtime detection of the futex_atomic_cmpxchg_inatomic()
 * functionality. We want to know that before we call in any
 * of the complex code paths. Also we want to prevent
 * registration of robust lists in that case. NULL is
 * guaranteed to fault and we get -EFAULT on functional
 * implementation, the non-functional ones will return
 * -ENOSYS.
 */
if (cmpxchg_futex_value_locked(, NULL, 0, 0) == -EFAULT)
futex_cmpxchg_enabled = 1;
#endif
}

will fail -EFAULT because user address 0 is invalid seems hopeful.

David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, 
UK
Registration No: 1397386 (Wales)



RE: remove the last set_fs() in common code, and remove it for x86 and powerpc v3

2020-09-10 Thread David Laight



> -Original Message-
> From: Segher Boessenkool 
> Sent: 10 September 2020 16:21
> To: David Laight 
> Cc: 'Christophe Leroy' ; 'Linus Torvalds' 
>  foundation.org>; linux-arch ; Kees Cook 
> ; the
> arch/x86 maintainers ; Nick Desaulniers 
> ; Linux Kernel
> Mailing List ; Alexey Dobriyan 
> ; Luis Chamberlain
> ; Al Viro ; linux-fsdevel 
> ;
> linuxppc-dev ; Christoph Hellwig 
> Subject: Re: remove the last set_fs() in common code, and remove it for x86 
> and powerpc v3
> 
> On Thu, Sep 10, 2020 at 12:26:53PM +, David Laight wrote:
> > Actually this is pretty sound:
> > __label__ label;
> > register int eax asm ("eax");
> > // Ensure eax can't be reloaded from anywhere
> > // In particular it can't be reloaded after the asm goto line
> > asm volatile ("" : "=r" (eax));
> 
> This asm is fine.  It says it writes the "eax" variable, which lives in
> the eax register *in that asm* (so *not* guaranteed after it!).
> 
> > // Provided gcc doesn't save eax here...
> > asm volatile goto ("x" ::: "eax" : label);
> 
> So this is incorrect.

>From the other email:

> It is neither input nor output operand here!  Only *then* is a local
> register asm guaranteed to be in the given reg: as input or output to an
> inline asm.

Ok, so adding '"r" (eax)' to the input section helps a bit.

> > // ... and reload the saved value here.
> > // The input value here will be that modified by the 'asm goto'.
> > // Since this modifies eax it can't be moved before the 'asm goto'.
> > asm volatile ("" : "+r" (eax));
> > // So here eax must contain the value set by the "x" instructions.
> 
> No, the register eax will contain the value of the eax variable.  In the
> asm; it might well be there before or after the asm as well, but none of
> that is guaranteed.

Perhaps not 'guaranteed', but very unlikely to be wrong.
It doesn't give gcc much scope for not generating the desired code.

David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, 
UK
Registration No: 1397386 (Wales)



RE: remove the last set_fs() in common code, and remove it for x86 and powerpc v3

2020-09-10 Thread David Laight
From: David Laight
> Sent: 10 September 2020 10:26
...
> > > I had an 'interesting' idea.
> > >
> > > Can you use a local asm register variable as an input and output to
> > > an 'asm volatile goto' statement?
> > >
> > > Well you can - but is it guaranteed to work :-)
> > >
> >
> > With gcc at least it should work according to
> > https://gcc.gnu.org/onlinedocs/gcc/Local-Register-Variables.html
> >
> > They even explicitely tell: "The only supported use for this feature is
> > to specify registers for input and output operands when calling Extended
> > asm "
> 
> A quick test isn't good
> 
> int bar(char *z)
> {
> __label__ label;
> register int eax asm ("eax") = 6;
> asm volatile goto (" mov $1, %%eax" ::: "eax" : label);
> label:
> return eax;
> }
> 
> 0040 :
>   40:   b8 01 00 00 00  mov$0x1,%eax
>   45:   b8 06 00 00 00  mov$0x6,%eax
>   4a:   c3  retq
> 
> although adding:
> asm volatile ("" : "+r" (eax));
> either side of the 'asm volatile goto' does fix it.

Actually this is pretty sound:
__label__ label;
register int eax asm ("eax");
// Ensure eax can't be reloaded from anywhere
// In particular it can't be reloaded after the asm goto line
asm volatile ("" : "=r" (eax));
// Provided gcc doesn't save eax here...
asm volatile goto ("x" ::: "eax" : label);
// ... and reload the saved value here.
// The input value here will be that modified by the 'asm goto'.
// Since this modifies eax it can't be moved before the 'asm goto'.
asm volatile ("" : "+r" (eax));
// So here eax must contain the value set by the "x" instructions.

David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, 
UK
Registration No: 1397386 (Wales)


RE: remove the last set_fs() in common code, and remove it for x86 and powerpc v3

2020-09-10 Thread David Laight
From: Christophe Leroy
> Sent: 10 September 2020 09:14
> 
> Le 10/09/2020 à 10:04, David Laight a écrit :
> > From: Linus Torvalds
> >> Sent: 09 September 2020 22:34
> >> On Wed, Sep 9, 2020 at 11:42 AM Segher Boessenkool
> >>  wrote:
> >>>
> >>> It will not work like this in GCC, no.  The LLVM people know about that.
> >>> I do not know why they insist on pushing this, being incompatible and
> >>> everything.
> >>
> >> Umm. Since they'd be the ones supporting this, *gcc* would be the
> >> incompatible one, not clang.
> >
> > I had an 'interesting' idea.
> >
> > Can you use a local asm register variable as an input and output to
> > an 'asm volatile goto' statement?
> >
> > Well you can - but is it guaranteed to work :-)
> >
> 
> With gcc at least it should work according to
> https://gcc.gnu.org/onlinedocs/gcc/Local-Register-Variables.html
> 
> They even explicitely tell: "The only supported use for this feature is
> to specify registers for input and output operands when calling Extended
> asm "

A quick test isn't good

int bar(char *z)
{
__label__ label;
register int eax asm ("eax") = 6;
asm volatile goto (" mov $1, %%eax" ::: "eax" : label);

label:
return eax;
}

0040 :
  40:   b8 01 00 00 00  mov$0x1,%eax
  45:   b8 06 00 00 00  mov$0x6,%eax
  4a:   c3  retq

although adding:
asm volatile ("" : "+r" (eax));
either side of the 'asm volatile goto' does fix it.

David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, 
UK
Registration No: 1397386 (Wales)


RE: remove the last set_fs() in common code, and remove it for x86 and powerpc v3

2020-09-10 Thread David Laight
From: Linus Torvalds
> Sent: 09 September 2020 22:34
> On Wed, Sep 9, 2020 at 11:42 AM Segher Boessenkool
>  wrote:
> >
> > It will not work like this in GCC, no.  The LLVM people know about that.
> > I do not know why they insist on pushing this, being incompatible and
> > everything.
> 
> Umm. Since they'd be the ones supporting this, *gcc* would be the
> incompatible one, not clang.

I had an 'interesting' idea.

Can you use a local asm register variable as an input and output to
an 'asm volatile goto' statement?

Well you can - but is it guaranteed to work :-)

David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, 
UK
Registration No: 1397386 (Wales)


RE: remove the last set_fs() in common code, and remove it for x86 and powerpc v3

2020-09-05 Thread David Laight
From: Christophe Leroy
> Sent: 05 September 2020 08:16
> 
> Le 04/09/2020 à 23:01, David Laight a écrit :
> > From: Alexey Dobriyan
> >> Sent: 04 September 2020 18:58
...
> > What is this strange %fs register you are talking about.
> > Figure 2-4 only has CS, DS, SS and ES.
> >
> 
> Intel added registers FS and GS in the i386

I know, I've got both the 'iAPX 286 Programmer's Reference Manual'
and the '80386 Programmer's Reference Manual' on my shelf.

I don't have the 8088 book though - which I used in 1982.

The old books are a lot easier to read if, for instance,
you are trying to work out how to back and forth to real mode
to do bios calls.

David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, 
UK
Registration No: 1397386 (Wales)


RE: remove the last set_fs() in common code, and remove it for x86 and powerpc v3

2020-09-04 Thread David Laight
From: Alexey Dobriyan
> Sent: 04 September 2020 18:58
> 
> On Fri, Sep 04, 2020 at 08:00:24AM +0200, Ingo Molnar wrote:
> > * Christoph Hellwig  wrote:
> > > this series removes the last set_fs() used to force a kernel address
> > > space for the uaccess code in the kernel read/write/splice code, and then
> > > stops implementing the address space overrides entirely for x86 and
> > > powerpc.
> >
> > Cool! For the x86 bits:
> >
> >   Acked-by: Ingo Molnar 
> 
> set_fs() is older than some kernel hackers!
> 
>   $ cd linux-0.11/
>   $ find . -type f -name '*.h' | xargs grep -e set_fs -w -n -A3
>   ./include/asm/segment.h:61:extern inline void set_fs(unsigned long val)
>   ./include/asm/segment.h-62-{
>   ./include/asm/segment.h-63- __asm__("mov %0,%%fs"::"a" ((unsigned 
> short) val));
>   ./include/asm/segment.h-64-}

What is this strange %fs register you are talking about.
Figure 2-4 only has CS, DS, SS and ES.

David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, 
UK
Registration No: 1397386 (Wales)


RE: [PATCH 12/14] x86: remove address space overrides using set_fs()

2020-09-04 Thread David Laight
From: Linus Torvalds
> Sent: 04 September 2020 00:26
> 
> On Thu, Sep 3, 2020 at 2:30 PM David Laight  wrote:
> >
> > A non-canonical (is that the right term) address between the highest
> > valid user address and the lowest valid kernel address (7ffe to fffe?)
> > will fault anyway.
> 
> Yes.
> 
> But we actually warn against that fault, because it's been a good way
> to catch places that didn't use the proper "access_ok()" pattern.
> 
> See ex_handler_uaccess() and the
> 
> WARN_ONCE(trapnr == X86_TRAP_GP, "General protection fault in
> user access. Non-canonical address?");
> 
> warning. It's been good for randomized testing - a missing range check
> on a user address will often hit this.
> 
> Of course, you should never see it in real life (and hopefully not in
> testing either any more). But belt-and-suspenders..

That could still be effective, just pick an address limit that is
appropriate for the one access_ok() is using.

Even if access_ok() uses 1<<63 there are plenty of addresses above it that 
fault.
But the upper limit for 5-level page tables could be used all the time.

One option is to test '(address | length) < (3<<62)' in access_ok().
That is also moderately suitable for masking invalid addresses to 0.

David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, 
UK
Registration No: 1397386 (Wales)


RE: [PATCH 12/14] x86: remove address space overrides using set_fs()

2020-09-03 Thread David Laight
From: Christoph Hellwig
> Sent: 03 September 2020 15:23
> 
> Stop providing the possibility to override the address space using
> set_fs() now that there is no need for that any more.  To properly
> handle the TASK_SIZE_MAX checking for 4 vs 5-level page tables on
> x86 a new alternative is introduced, which just like the one in
> entry_64.S has to use the hardcoded virtual address bits to escape
> the fact that TASK_SIZE_MAX isn't actually a constant when 5-level
> page tables are enabled.

Why does it matter whether 4 or 5 level page tables are in use?
Surely all access_ok() needs to do is ensure that a valid kernel
address isn't supplied.
A non-canonical (is that the right term) address between the highest
valid user address and the lowest valid kernel address (7ffe to fffe?)
will fault anyway.
So any limit between the valid user and kernel addresses should
work?
So a limit of 1<<63 would seem appropriate.

David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, 
UK
Registration No: 1397386 (Wales)



RE: [PATCH 10/10] powerpc: remove address space overrides using set_fs()

2020-09-02 Thread David Laight
From: Christophe Leroy
> Sent: 02 September 2020 15:13
> 
> 
> Le 02/09/2020 à 15:51, David Laight a écrit :
> > From: Christophe Leroy
> >> Sent: 02 September 2020 14:25
> >> Le 02/09/2020 à 15:13, David Laight a écrit :
> >>> From: Christoph Hellwig
> >>>> Sent: 02 September 2020 13:37
> >>>>
> >>>> On Wed, Sep 02, 2020 at 08:15:12AM +0200, Christophe Leroy wrote:
> >>>>>> -  return 0;
> >>>>>> -  return (size == 0 || size - 1 <= seg.seg - addr);
> >>>>>> +  if (addr >= TASK_SIZE_MAX)
> >>>>>> +  return false;
> >>>>>> +  if (size == 0)
> >>>>>> +  return false;
> >>>>>
> >>>>> __access_ok() was returning true when size == 0 up to now. Any reason to
> >>>>> return false now ?
> >>>>
> >>>> No, this is accidental and broken.  Can you re-run your benchmark with
> >>>> this fixed?
> >>>
> >>> Is TASK_SIZE_MASK defined such that you can do:
> >>>
> >>>   return (addr | size) < TASK_SIZE_MAX) || !size;
> >>
> >> TASK_SIZE_MAX will usually be 0xc000
> >>
> >> With:
> >> addr = 0x8000;
> >> size = 0x8000;
> >>
> >> I expect it to fail 
> >>
> >> With the formula you propose it will succeed, won't it ?
> >
> > Hmmm... Was i getting confused about some comments for 64bit
> > about there being such a big hole between valid user and kernel
> > addresses that it was enough to check that 'size < TASK_SIZE_MAX'.
> >
> > That would be true for 64bit x86 (and probably ppc (& arm??))
> > if TASK_SIZE_MAX were 0x4 << 60.
> > IIUC the highest user address is (much) less than 0x0 << 60
> > and the lowest kernel address (much) greater than 0xf << 60
> > on all these 64bit platforms.
> >
> > Actually if doing access_ok() inside get_user() you don't
> > need to check the size at all.
> 
> You mean on 64 bit or on any platform ?

64bit and 32bit

> What about a word write to 0xbffe, won't it overwrite 0xc000 ?
> 
> > You don't even need to in copy_to/from_user() provided
> > it always does a forwards copy.
> 
> Do you mean due to the gap ?
> Is it garantied to be a gap ? Even on a 32 bits having TASK_SIZE set to
> 0xc000 and PAGE_OFFSET set to the same ?

I read somewhere (I won't find it again) that the last 4k page
(below 0xc000) must not be allocated on i386 because some
cpu (both intel and amd) do 'horrid things' if they try to
(IIRC) do instruction prefetches across the boundary.
So the accesses to 0xbffe will fault and the one to 0xc000
won't happen (in any useful way at least).

I'd suspect that not allocating the 3G-4k page would be a safe
bet on all architectures - even 68k.

David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, 
UK
Registration No: 1397386 (Wales)


RE: [PATCH 10/10] powerpc: remove address space overrides using set_fs()

2020-09-02 Thread David Laight
From: Christophe Leroy
> Sent: 02 September 2020 14:25
> Le 02/09/2020 à 15:13, David Laight a écrit :
> > From: Christoph Hellwig
> >> Sent: 02 September 2020 13:37
> >>
> >> On Wed, Sep 02, 2020 at 08:15:12AM +0200, Christophe Leroy wrote:
> >>>> -return 0;
> >>>> -return (size == 0 || size - 1 <= seg.seg - addr);
> >>>> +if (addr >= TASK_SIZE_MAX)
> >>>> +return false;
> >>>> +if (size == 0)
> >>>> +return false;
> >>>
> >>> __access_ok() was returning true when size == 0 up to now. Any reason to
> >>> return false now ?
> >>
> >> No, this is accidental and broken.  Can you re-run your benchmark with
> >> this fixed?
> >
> > Is TASK_SIZE_MASK defined such that you can do:
> >
> > return (addr | size) < TASK_SIZE_MAX) || !size;
> 
> TASK_SIZE_MAX will usually be 0xc000
> 
> With:
> addr = 0x8000;
> size = 0x8000;
> 
> I expect it to fail 
> 
> With the formula you propose it will succeed, won't it ?

Hmmm... Was i getting confused about some comments for 64bit
about there being such a big hole between valid user and kernel
addresses that it was enough to check that 'size < TASK_SIZE_MAX'.

That would be true for 64bit x86 (and probably ppc (& arm??))
if TASK_SIZE_MAX were 0x4 << 60.
IIUC the highest user address is (much) less than 0x0 << 60
and the lowest kernel address (much) greater than 0xf << 60
on all these 64bit platforms.

Actually if doing access_ok() inside get_user() you don't
need to check the size at all.
You don't even need to in copy_to/from_user() provided
it always does a forwards copy.
(Rather that copying the last word first for misaligned lengths.)

David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, 
UK
Registration No: 1397386 (Wales)


RE: [PATCH 10/10] powerpc: remove address space overrides using set_fs()

2020-09-02 Thread David Laight
From: Christoph Hellwig
> Sent: 02 September 2020 13:37
> 
> On Wed, Sep 02, 2020 at 08:15:12AM +0200, Christophe Leroy wrote:
> >> -  return 0;
> >> -  return (size == 0 || size - 1 <= seg.seg - addr);
> >> +  if (addr >= TASK_SIZE_MAX)
> >> +  return false;
> >> +  if (size == 0)
> >> +  return false;
> >
> > __access_ok() was returning true when size == 0 up to now. Any reason to
> > return false now ?
> 
> No, this is accidental and broken.  Can you re-run your benchmark with
> this fixed?

Is TASK_SIZE_MASK defined such that you can do:

return (addr | size) < TASK_SIZE_MAX) || !size;

David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, 
UK
Registration No: 1397386 (Wales)



RE: [PATCH 01/10] fs: don't allow kernel reads and writes without iter ops

2020-08-27 Thread David Laight
From: Christoph Hellwig
> Sent: 27 August 2020 16:00
> 
> Don't allow calling ->read or ->write with set_fs as a preparation for
> killing off set_fs.  All the instances that we use kernel_read/write on
> are using the iter ops already.
> 
> If a file has both the regular ->read/->write methods and the iter
> variants those could have different semantics for messed up enough
> drivers.  Also fails the kernel access to them in that case.

Is there a real justification for that?
For system calls supplying both methods makes sense to avoid
the extra code paths for a simple read/write.

Any one stupid enough to make them behave differently gets
what they deserve.

David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, 
UK
Registration No: 1397386 (Wales)



RE: Re:Re: [PATCH] powerpc: Fix a bug in __div64_32 if divisor is zero

2020-08-24 Thread David Laight
From: Guohua Zhong
> Sent: 24 August 2020 14:26
> 
> >> >In generic version in lib/math/div64.c, there is no checking of 'base'
> >> >either.
> >> >Do we really want to add this check in the powerpc version only ?
> >>
> >> >The only user of __div64_32() is do_div() in
> >> >include/asm-generic/div64.h. Wouldn't it be better to do the check there ?
> >>
> >> >Christophe
> >>
> >> Yet, I have noticed that there is no checking of 'base' in these functions.
> >> But I am not sure how to check is better.As we know that the result is
> >> undefined when divisor is zero. It maybe good to print error and dump 
> >> stack.

I thought that the onus was put on the caller to avoid divide by zero.

On x86 divide by zero causes an exception which (I'm pretty sure)
leads to a oops/panic.

David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, 
UK
Registration No: 1397386 (Wales)


RE: [PATCH 09/11] x86: remove address space overrides using set_fs()

2020-08-17 Thread David Laight
From: Christoph Hellwig
> Sent: 17 August 2020 08:32
>
> Stop providing the possibility to override the address space using
> set_fs() now that there is no need for that any more.  To properly
> handle the TASK_SIZE_MAX checking for 4 vs 5-level page tables on
> x86 a new alternative is introduced, which just like the one in
> entry_64.S has to use the hardcoded virtual address bits to escape
> the fact that TASK_SIZE_MAX isn't actually a constant when 5-level
> page tables are enabled.

> @@ -93,7 +69,7 @@ static inline bool pagefault_disabled(void);
>  #define access_ok(addr, size)\
>  ({   \
>   WARN_ON_IN_IRQ();   \
> - likely(!__range_not_ok(addr, size, user_addr_max()));   \
> + likely(!__range_not_ok(addr, size, TASK_SIZE_MAX)); \
>  })

Can't that always compare against a constant even when 5-levl
page tables are enabled on x86-64?

On x86-64 it can (probably) reduce to (addr | (addr + size)) < 0.

David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, 
UK
Registration No: 1397386 (Wales)



RE: [PATCH v2] selftests/powerpc: Fix pkey syscall redefinitions

2020-08-03 Thread David Laight
> > +#undef SYS_pkey_mprotect
> >  #define SYS_pkey_mprotect  386
> 
> We shouldn't undef them.
> 
> They should obviously never change, but if the system headers already
> have a definition then we should use that, so I think it should be:
> 
> #ifndef SYS_pkey_mprotect
> #define SYS_pkey_mprotect 386
> #endif

If the definitions are identical the compiler won't complain.
So you probably actually want a matching definition so that,
provided at least one compile picks up both headers, you know
that the definitions actually match.

David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, 
UK
Registration No: 1397386 (Wales)



RE: [RFC PATCH 00/35] Move all PCIBIOS* definitions into arch/x86

2020-07-16 Thread David Laight
From: Bjorn Helgaas
> Sent: 15 July 2020 23:02
> 
> On Wed, Jul 15, 2020 at 02:24:21PM +0000, David Laight wrote:
> > From: Arnd Bergmann
> > > Sent: 15 July 2020 07:47
> > > On Wed, Jul 15, 2020 at 1:46 AM Bjorn Helgaas  wrote:
> > >
> > >  So something like:
> > > >
> > > >   void pci_read_config_word(struct pci_dev *dev, int where, u16 *val)
> > > >
> > > > and where we used to return anything non-zero, we just set *val = ~0
> > > > instead?  I think we do that already in most, maybe all, cases.
> > >
> > > Right, this is what I had in mind. If we start by removing the handling
> > > of the return code in all files that clearly don't need it, looking at
> > > whatever remains will give a much better idea of what a good interface
> > > should be.
> >
> > It would be best to get rid of that nasty 'u16 *' parameter.
> 
> Do you mean nasty because it's basically a return value, but not
> returned as the *function's* return value?  I agree that if we were
> starting from scratch it would nicer to have:
> 
>   u16 pci_read_config_word(struct pci_dev *dev, int where)
> 
> but I don't think it's worth changing the thousands of callers just
> for that.

It'll shrink the kernel text size somewhat.
It could also be 'fixed' with a static inline.

Actually you don't even want the result to be u16.
Even though the domain of the value is 0..65535 keeping
the type as int (or unsigned int) will save the compiler
having to generate lots of masking instructions.

Code performance here will be overwhelmed by the time taken
for the config space access.
But more generally all local variables should really be
the size of cpu registers.

On x86-64 you need to use 'unsigned int' for anything used
as array subscripts to avoid the 'sign extend' instructions.
In some code paths it may matter...

David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, 
UK
Registration No: 1397386 (Wales)



RE: [RFC PATCH 00/35] Move all PCIBIOS* definitions into arch/x86

2020-07-16 Thread David Laight
From: Benjamin Herrenschmidt
> Sent: 15 July 2020 23:49
> On Wed, 2020-07-15 at 17:12 -0500, Bjorn Helgaas wrote:
> > > I've 'played' with PCIe error handling - without much success.
> > > What might be useful is for a driver that has just read ~0u to
> > > be able to ask 'has there been an error signalled for this device?'.
> >
> > In many cases a driver will know that ~0 is not a valid value for the
> > register it's reading.  But if ~0 *could* be valid, an interface like
> > you suggest could be useful.  I don't think we have anything like that
> > today, but maybe we could.  It would certainly be nice if the PCI core
> > noticed, logged, and cleared errors.  We have some of that for AER,
> > but that's an optional feature, and support for the error bits in the
> > garden-variety PCI_STATUS register is pretty haphazard.  As you note
> > below, this sort of SERR/PERR reporting is frequently hard-wired in
> > ways that takes it out of our purview.
> 
> We do have pci_channel_state (via pci_channel_offline()) which covers
> the cases where the underlying error handling (such as EEH or unplug)
> results in the device being offlined though this tend to be
> asynchronous so it might take a few ~0's before you get it.

On one of my systems I don't think the error TLP from the target
made its way past the first bridge - I could see the error in it's
status registers.
But I couldn't find any of the AER status registers in the root bridge.
So I think you'd need a software poll of the bridge registers to
find out (and clear) the error.

The NMI on the dell system (which is supposed to meet some special
NEBS? server requirements) is just stupid.
Too late to be synchronous and impossible for the OS to handle.

David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, 
UK
Registration No: 1397386 (Wales)


RE: [RFC PATCH 00/35] Move all PCIBIOS* definitions into arch/x86

2020-07-15 Thread David Laight
From: Oliver O'Halloran
> Sent: 15 July 2020 05:19
> 
> On Wed, Jul 15, 2020 at 8:03 AM Arnd Bergmann  wrote:
...
> > - config space accesses are very rare compared to memory
> >   space access and on the hardware side the error handling
> >   would be similar, but readl/writel don't return errors, they just
> >   access wrong registers or return 0x.
> >   arch/powerpc/kernel/eeh.c has a ton extra code written to
> >   deal with it, but no other architectures do.
> 
> TBH the EEH MMIO hooks were probably a mistake to begin with. Errors
> detected via MMIO are almost always asynchronous to the error itself
> so you usually just wind up with a misleading stack trace rather than
> any kind of useful synchronous error reporting. It seems like most
> drivers don't bother checking for 0xFFs either and rely on the
> asynchronous reporting via .error_detected() instead, so I have to
> wonder what the point is. I've been thinking of removing the MMIO
> hooks and using a background poller to check for errors on each PHB
> periodically (assuming we don't have an EEH interrupt) instead. That
> would remove the requirement for eeh_dev_check_failure() to be
> interrupt safe too, so it might even let us fix all the godawful races
> in EEH.

I've 'played' with PCIe error handling - without much success.
What might be useful is for a driver that has just read ~0u to
be able to ask 'has there been an error signalled for this device?'.

I got an error generated by doing an MMIO access that was inside
the address range forwarded to the slave, but outside any of its BARs.
(Two BARs of different sizes leaves a nice gap.)
This got reported up to the bridge nearest the slave (which supported
error handling), but not to the root bridge (which I don't think does).
ISTR a message about EEH being handled by the hardware (the machine
is up but dmesg is full of messages from a bouncing USB mouse).

With such partial error reporting useful info can still be extracted.

Of course, what actually happens on a PCIe error is that the signal
gets routed to some 'board support logic' and then passed back into
the kernel as an NMI - which then crashes the kernel!
This even happens when the PCIe link goes down after we've done a
soft-remove of the device itself!
Rather makes updating the board's FPGA without a reboot tricky.

David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, 
UK
Registration No: 1397386 (Wales)


RE: [RFC PATCH 00/35] Move all PCIBIOS* definitions into arch/x86

2020-07-15 Thread David Laight
From: Arnd Bergmann
> Sent: 15 July 2020 07:47
> On Wed, Jul 15, 2020 at 1:46 AM Bjorn Helgaas  wrote:
> 
>  So something like:
> >
> >   void pci_read_config_word(struct pci_dev *dev, int where, u16 *val)
> >
> > and where we used to return anything non-zero, we just set *val = ~0
> > instead?  I think we do that already in most, maybe all, cases.
> 
> Right, this is what I had in mind. If we start by removing the handling
> of the return code in all files that clearly don't need it, looking at
> whatever remains will give a much better idea of what a good interface
> should be.

It would be best to get rid of that nasty 'u16 *' parameter.
Make the return int and return the read value or -1 on error.
(Or maybe 0x on error??)

For a 32bit read (there must be one for the BARs) returning
a 64bit signed integer would work even for 32bit systems.

If code cares about the error, and it can be detected then
it can check. Otherwise the it all 'just works'.

David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, 
UK
Registration No: 1397386 (Wales)


RE: [PATCH v2 1/3] module: Rename module_alloc() to text_alloc() and move to kernel proper

2020-07-15 Thread David Laight
From: Jarkko Sakkinen
> Sent: 14 July 2020 17:31
..
> There is one arch (nios2), which uses a regular kzalloc(). This means
> that both text_alloc() and text_memfree() need to be weaks symbols and
> nios2 needs to have overriding text.c to do its thing.

That could be handled by an arch-specific inline wrapper in a .h file.

Although I actually suspect the nios2 code is just being lazy.
The processor is a very simple in-order 32bit soft cpu for Intel (Altera)
FPGA.

We have four nios2 cpu on one of our fpga images - all running small
(almost single function) code loops.
I can't imagine actually trying to run Linux on one to do anything
useful - clock speed is unlikely to be much above 100MHz.
We can't meet timing at 125MHZ (I've tried quite hard) so have
to run at 62.5MHz because of the PCIe clock.

David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, 
UK
Registration No: 1397386 (Wales)



RE: [PATCH] x86: Fix early boot crash on gcc-10, next try

2020-04-27 Thread David Laight
From: Borislav Petkov
> Sent: 25 April 2020 18:53
...
> IOW, something like this (ontop) which takes care of the xen case too.
> If it needs to be used by all arches, then I'll split the patch:
.
> - asm ("");
> + prevent_tail_call_optimization();
>  }

One obvious implementation would be a real function call.
Which the compiler would convert into a tail call.
Just to confuse matters :-)

David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, 
UK
Registration No: 1397386 (Wales)


RE: [musl] Powerpc Linux 'scv' system call ABI proposal take 2

2020-04-21 Thread David Laight
From: Nicholas Piggin
> Sent: 20 April 2020 02:10
...
> >> Yes, but does it really matter to optimize this specific usage case
> >> for size? glibc, for instance, tries to leverage the syscall mechanism
> >> by adding some complex pre-processor asm directives.  It optimizes
> >> the syscall code size in most cases.  For instance, kill in static case
> >> generates on x86_64:
> >>
> >>  <__kill>:
> >>0:   b8 3e 00 00 00  mov$0x3e,%eax
> >>5:   0f 05   syscall
> >>7:   48 3d 01 f0 ff ff   cmp$0xf001,%rax
> >>d:   0f 83 00 00 00 00   jae13 <__kill+0x13>

Hmmm... that cmp + jae is unnecessary here.
It is also a 32bit offset jump.
I also suspect it gets predicted very badly.

> >>   13:   c3  retq
> >>
> >> While on musl:
> >>
> >>  :
> >>0:  48 83 ec 08 sub$0x8,%rsp
> >>4:  48 63 ffmovslq %edi,%rdi
> >>7:  48 63 f6movslq %esi,%rsi
> >>a:  b8 3e 00 00 00  mov$0x3e,%eax
> >>f:  0f 05   syscall
> >>   11:  48 89 c7mov%rax,%rdi
> >>   14:  e8 00 00 00 00  callq  19 
> >>   19:  5a  pop%rdx
> >>   1a:  c3  retq
> >
> > Wow that's some extraordinarily bad codegen going on by gcc... The
> > sign-extension is semantically needed and I don't see a good way
> > around it (glibc's asm is kinda a hack taking advantage of kernel not
> > looking at high bits, I think), but the gratuitous stack adjustment
> > and refusal to generate a tail call isn't. I'll see if we can track
> > down what's going on and get it fixed.

A suitable cast might get rid of the sign extension.
Possibly just (unsigned int).

David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, 
UK
Registration No: 1397386 (Wales)


RE: [musl] Powerpc Linux 'scv' system call ABI proposal take 2

2020-04-21 Thread David Laight
From: Adhemerval Zanella
> Sent: 21 April 2020 16:01
> 
> On 21/04/2020 11:39, Rich Felker wrote:
> > On Tue, Apr 21, 2020 at 12:28:25PM +0000, David Laight wrote:
> >> From: Nicholas Piggin
> >>> Sent: 20 April 2020 02:10
> >> ...
> >>>>> Yes, but does it really matter to optimize this specific usage case
> >>>>> for size? glibc, for instance, tries to leverage the syscall mechanism
> >>>>> by adding some complex pre-processor asm directives.  It optimizes
> >>>>> the syscall code size in most cases.  For instance, kill in static case
> >>>>> generates on x86_64:
> >>>>>
> >>>>>  <__kill>:
> >>>>>0:   b8 3e 00 00 00  mov$0x3e,%eax
> >>>>>5:   0f 05   syscall
> >>>>>7:   48 3d 01 f0 ff ff   cmp$0xf001,%rax
> >>>>>d:   0f 83 00 00 00 00   jae13 <__kill+0x13>
> >>
> >> Hmmm... that cmp + jae is unnecessary here.
> >
> > It's not.. Rather the objdump was just mistakenly done without -r so
> > it looks like a nop jump rather than a conditional tail call to the
> > function that sets errno.
> >
> 
> Indeed, the output with -r is:
> 
>  <__kill>:
>0:   b8 3e 00 00 00  mov$0x3e,%eax
>5:   0f 05   syscall
>7:   48 3d 01 f0 ff ff   cmp$0xf001,%rax
>d:   0f 83 00 00 00 00   jae13 <__kill+0x13>
> f: R_X86_64_PLT32   __syscall_error-0x4
>   13:   c3  retq

Yes, I probably should have remembered it looked like that :-)
...
> >> I also suspect it gets predicted very badly.
> >
> > I doubt that. This is a very standard idiom and the size of the offset
> > (which is necessarily 32-bit because it has a relocation on it) is
> > orthogonal to the condition on the jump.

Yes, it only gets mispredicted as badly as any other conditional jump.
I believe modern intel x86 will randomly predict it taken (regardless
of the direction) and then hit a TLB fault on text.unlikely :-)

> > FWIW a syscall like kill takes global kernel-side locks to be able to
> > address a target process by pid, and the rate of meaningful calls you
> > can make to it is very low (since it's bounded by time for target
> > process to act on the signal). Trying to optimize it for speed is
> > pointless, and even size isn't important locally (although in
> > aggregate, lots of wasted small size can add up to more pages = more
> > TLB entries = ...).
> 
> I agree and I would prefer to focus on code simplicity to have a
> platform neutral way to handle error and let the compiler optimize
> it than messy with assembly macros to squeeze this kind of
> micro-optimizations.

syscall entry does get micro-optimised.
Real speed-ups can probably be found by optimising other places.
I've a patch i need to resumbit that should improve the reading
of iov[] from user space.

David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, 
UK
Registration No: 1397386 (Wales)


RE: hardcoded SIGSEGV in __die() ?

2020-03-25 Thread David Laight
From: Joakim Tjernlund
> Sent: 23 March 2020 15:45
...
> > > I tried to follow that chain thinking it would end up sending a signal to 
> > > user space but I cannot
> see
> > > that happens. Seems to be related to debugging.
> > >
> > > In short, I cannot see any signal being delivered to user space. If so 
> > > that would explain why
> > > our user space process never dies.
> > > Is there a signal hidden in machine_check handler for SIGBUS I cannot see?
> > >
> >
> > Isn't it done in do_exit(), called from oops_end() ?
> 
> hmm, so it seems. The odd thing though is that do_exit takes an exit code, 
> not signal number.
> Also, feels a bit odd to force an exit(that we haven't seen happening) rather 
> than just a signal.

Isn't there something 'magic' that converts EFAULT into SIGSEGV?

David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, 
UK
Registration No: 1397386 (Wales)


RE: [RESEND PATCH v2 9/9] ath5k: Constify ioreadX() iomem argument (as in generic implementation)

2020-02-24 Thread David Laight
From: Geert Uytterhoeven
> Sent: 24 February 2020 12:54
> To: Krzysztof Kozlowski 
...
> > > > diff --git a/drivers/net/wireless/ath/ath5k/ahb.c 
> > > > b/drivers/net/wireless/ath/ath5k/ahb.c
> > > > index 2c9cec8b53d9..8bd01df369fb 100644
> > > > --- a/drivers/net/wireless/ath/ath5k/ahb.c
> > > > +++ b/drivers/net/wireless/ath/ath5k/ahb.c
> > > > @@ -138,18 +138,18 @@ static int ath_ahb_probe(struct platform_device 
> > > > *pdev)
> > > >
> > > > if (bcfg->devid >= AR5K_SREV_AR2315_R6) {
> > > > /* Enable WMAC AHB arbitration */
> > > > -   reg = ioread32((void __iomem *) AR5K_AR2315_AHB_ARB_CTL);
> > > > +   reg = ioread32((const void __iomem *) 
> > > > AR5K_AR2315_AHB_ARB_CTL);
> > >
> > > While I understand why the parameter of ioread32 should be const, I
> > > don't see a reason for these casts on the users' side. What does it
> > > bring except longer code to read?
> >
> > Because the argument is an int:
> >
> > drivers/net/wireless/ath/ath5k/ahb.c: In function ‘ath_ahb_probe’:
> > drivers/net/wireless/ath/ath5k/ahb.c:141:18: warning: passing argument 1 of 
> > ‘ioread32’ makes pointer
> from integer without a cast [-Wint-conversion]
> >reg = ioread32(AR5K_AR2315_AHB_ARB_CTL);
> 
> That's an argument for keeping the cast to "void __iomem *", not for
> adding the "const", right?

Or more likely change the definitions to use a struct for the layout.
That also stops the constants being used in the wrong place.

David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, 
UK
Registration No: 1397386 (Wales)


RE: [PATCH v6 08/10] mm/memory_hotplug: Don't check for "all holes" in shrink_zone_span()

2020-02-05 Thread David Laight
From: Wei Yang
> Sent: 05 February 2020 09:59
...
> If it is me, I would like to take out these two similar logic out.
> 
> For example:
> 
>   if () {
>   } else if () {
>   } else {
>   goto out;
>   }

I'm pretty sure the kernel layout rules disallow 'else if'.
It is also pretty horrid unless the conditionals are all related
(so it is almost a switch statement).

David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, 
UK
Registration No: 1397386 (Wales)



RE: z constraint in powerpc inline assembly ?

2020-01-16 Thread David Laight
> You mean the mpc8xx , but I'm also using the mpc832x which has a e300c2
> core and is capable of executing 2 insns in parallel if not in the same
> Unit.

That should let you do a memory read and an add.
(I can't remember if the ppc has 'add from memory' but that is
likely to use both units anyway.)
An infinitely unrolled loop will then be 4 clocks/byte (for 32bit).
If you get to 3 for a real loop you are doing ok.

Remember, unroll too much and you displace other code from
the i-cache. Also the i-cache loads themselves kill you.
(A hot-cache benchmark won't see this...)

David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, 
UK
Registration No: 1397386 (Wales)


  1   2   3   4   5   >