Re: [PATCH v2] scsi: ibmvfc: interface updates for future FPIN and MQ support

2020-09-02 Thread Tyrel Datwyler
On 9/2/20 7:11 PM, Martin K. Petersen wrote:
> 
> Tyrel,
> 
>>  Fixup complier errors from neglected commit --amend
> 
> Bunch of formatting-related checkpatch warnings. Please fix.
> 
> Thanks!
> 

So, I stuck to the existing style already in that header. If I'm going to fixup
to make checkpatch happy I imagine it makes sense to send a prerequisite patch
that fixes up the rest of the header.

-Tyrel


Re: [PATCH v1 02/10] powerpc/kernel/iommu: Align size for IOMMU_PAGE_SIZE on iommu_*_coherent()

2020-09-02 Thread Alexey Kardashevskiy




On 02/09/2020 08:34, Leonardo Bras wrote:

On Mon, 2020-08-31 at 10:47 +1000, Alexey Kardashevskiy wrote:


Maybe testing with host 64k pagesize and IOMMU 16MB pagesize in qemu
should be enough, is there any chance to get indirect mapping in qemu
like this? (DDW but with smaller DMA window available)


You will have to hack the guest kernel to always do indirect mapping or
hack QEMU's rtas_ibm_query_pe_dma_window() to return a small number of
available TCEs. But you will be testing QEMU/KVM which behave quite
differently to pHyp in this particular case.



As you suggested before, building for 4k cpu pagesize should be the
best approach. It would allow testing for both pHyp and qemu scenarios.


Because if we want the former (==support), then we'll have to align the
size up to the bigger page size when allocating/zeroing system pages,
etc.


This part I don't understand. Why do we need to align everything to the
bigger pagesize?

I mean, is not that enough that the range [ret, ret + size[ is both
allocated by mm and mapped on a iommu range?

Suppose a iommu_alloc_coherent() of 16kB on PAGESIZE = 4k and
IOMMU_PAGE_SIZE() == 64k.
Why 4 * cpu_pages mapped by a 64k IOMMU page is not enough?
All the space the user asked for is allocated and mapped for DMA.


The user asked to map 16K, the rest - 48K - is used for something else
(may be even mapped to another device) but you are making all 64K
accessible by the device which only should be able to access 16K.

In practice, if this happens, H_PUT_TCE will simply fail.


I have noticed mlx5 driver getting a few bytes in a buffer, and using
iommu_map_page(). It does map a whole page for as few bytes as the user


Whole 4K system page or whole 64K iommu page?


I tested it in 64k system page + 64k iommu page.

The 64K system page may be used for anything, and a small portion of it
(say 128 bytes) needs to be used for DMA.
The whole page is mapped by IOMMU, and the driver gets info of the
memory range it should access / modify.



This works because the whole system page belongs to the same memory 
context and IOMMU allows a device to access that page. You can still 
have problems if there is a bug within the page but it will go mostly 
unnoticed as it will be memory corruption.


If you system page is smaller (4K) than IOMMU page (64K), then the 
device gets wider access than it should but it is still going to be 
silent memory corruption.








wants mapped, and the other bytes get used for something else, or just
mapped on another DMA page.
It seems to work fine.



With 4K system page and 64K IOMMU page? In practice it would take an
effort or/and bad luck to see it crashing. Thanks,


I haven't tested it yet. On a 64k system page and 4k/64k iommu page, it
works as described above.

I am new to this, so I am trying to understand how a memory page mapped
as DMA, and used for something else could be a problem.


From the device prospective, there is PCI space and everything from 0 
till 1<<64 is accessible and what is that mapped to - the device does 
not know. PHB's IOMMU is the thing to notice invalid access and raise 
EEH but PHB only knows about PCI->physical memory mapping (with IOMMU 
pages) but nothing about the host kernel pages. Does this help? Thanks,





Thanks!






Bigger pages are not the case here as I understand it.


I did not get this part, what do you mean?


Possible IOMMU page sizes are 4K, 64K, 2M, 16M, 256M, 1GB, and the
supported set of sizes is different for P8/P9 and type of IO (PHB,
NVLink/CAPI).



Update those functions to guarantee alignment with requested size
using IOMMU_PAGE_ALIGN() before doing iommu_alloc() / iommu_free().

Also, on iommu_range_alloc(), replace ALIGN(n, 1 << tbl->it_page_shift)
with IOMMU_PAGE_ALIGN(n, tbl), which seems easier to read.

Signed-off-by: Leonardo Bras 
---
  arch/powerpc/kernel/iommu.c | 17 +
  1 file changed, 9 insertions(+), 8 deletions(-)

diff --git a/arch/powerpc/kernel/iommu.c b/arch/powerpc/kernel/iommu.c
index 9704f3f76e63..d7086087830f 100644
--- a/arch/powerpc/kernel/iommu.c
+++ b/arch/powerpc/kernel/iommu.c
@@ -237,10 +237,9 @@ static unsigned long iommu_range_alloc(struct device *dev,
}
  
  	if (dev)

-   boundary_size = ALIGN(dma_get_seg_boundary(dev) + 1,
- 1 << tbl->it_page_shift);
+   boundary_size = IOMMU_PAGE_ALIGN(dma_get_seg_boundary(dev) + 1, 
tbl);


Run checkpatch.pl, should complain about a long line.


It's 86 columns long, which is less than the new limit of 100 columns
Linus announced a few weeks ago. checkpatch.pl was updated too:
https://www.phoronix.com/scan.php?page=news_item=Linux-Kernel-Deprecates-80-Col


Yay finally :) Thanks,


:)




else
-   boundary_size = ALIGN(1UL << 32, 1 << tbl->it_page_shift);
+   boundary_size = IOMMU_PAGE_ALIGN(1UL << 32, tbl);
/* 4GB boundary for iseries_hv_alloc and iseries_hv_map */
  
  	n = 

Re: [PATCH v1 01/10] powerpc/pseries/iommu: Replace hard-coded page shift

2020-09-02 Thread Alexey Kardashevskiy




On 02/09/2020 07:38, Leonardo Bras wrote:

On Mon, 2020-08-31 at 13:48 +1000, Alexey Kardashevskiy wrote:

Well, I created this TCE_RPN_BITS = 52 because the previous mask was a
hardcoded 40-bit mask (0xfful), for hard-coded 12-bit (4k)
pagesize, and on PAPR+/LoPAR also defines TCE as having bits 0-51
described as RPN, as described before.

IODA3 Revision 3.0_prd1 (OpenPowerFoundation), Figure 3.4 and 3.5.
shows system memory mapping into a TCE, and the TCE also has bits 0-51
for the RPN (52 bits). "Table 3.6. TCE Definition" also shows it.
In fact, by the looks of those figures, the RPN_MASK should always be a
52-bit mask, and RPN = (page >> tceshift) & RPN_MASK.


I suspect the mask is there in the first place for extra protection
against too big addresses going to the TCE table (or/and for virtial vs
physical addresses). Using 52bit mask makes no sense for anything, you
could just drop the mask and let c compiler deal with 64bit "uint" as it
is basically a 4K page address anywhere in the 64bit space. Thanks,


Assuming 4K pages you need 52 RPN bits to cover the whole 64bit
physical address space. The IODA3 spec does explicitly say the upper
bits are optional and the implementation only needs to support enough
to cover up to the physical address limit, which is 56bits of P9 /
PHB4. If you want to validate that the address will fit inside of
MAX_PHYSMEM_BITS then fine, but I think that should be done as a
WARN_ON or similar rather than just silently masking off the bits.


We can do this and probably should anyway but I am also pretty sure we
can just ditch the mask and have the hypervisor return an error which
will show up in dmesg.


Ok then, ditching the mask.



Well, you could run a little experiment and set some bits above that old 
mask and see how phyp reacts :)




Thanks!



--
Alexey


Re: [PATCH 1/2] dma-mapping: introduce dma_get_seg_boundary_nr_pages()

2020-09-02 Thread Michael Ellerman
Nicolin Chen  writes:
> diff --git a/arch/powerpc/kernel/iommu.c b/arch/powerpc/kernel/iommu.c
> index 9704f3f76e63..cbc2e62db597 100644
> --- a/arch/powerpc/kernel/iommu.c
> +++ b/arch/powerpc/kernel/iommu.c
> @@ -236,15 +236,10 @@ static unsigned long iommu_range_alloc(struct device 
> *dev,
>   }
>   }
>  
> - if (dev)
> - boundary_size = ALIGN(dma_get_seg_boundary(dev) + 1,
> -   1 << tbl->it_page_shift);
> - else
> - boundary_size = ALIGN(1UL << 32, 1 << tbl->it_page_shift);
> - /* 4GB boundary for iseries_hv_alloc and iseries_hv_map */
> + boundary_size = dma_get_seg_boundary_nr_pages(dev, tbl->it_page_shift);
>  
>   n = iommu_area_alloc(tbl->it_map, limit, start, npages, tbl->it_offset,
> -  boundary_size >> tbl->it_page_shift, align_mask);
> +  boundary_size, align_mask);

This has changed the units of boundary_size, but it's unused elsewhere
in the function so that's OK.

If you need to do a v2 for any other reason, then I'd just drop
boundary_size and call dma_get_seg_boundary_nr_pages() directly in the
function call.

Acked-by: Michael Ellerman  (powerpc)

cheers


Re: [PATCH] ASoC: fsl_sai: Set SAI Channel Mode to Output Mode

2020-09-02 Thread Nicolin Chen
On Thu, Sep 03, 2020 at 11:09:15AM +0800, Shengjiu Wang wrote:
> Transmit data pins will output zero when slots are masked or channels
> are disabled. In CHMOD TDM mode, transmit data pins are tri-stated when
> slots are masked or channels are disabled. When data pins are tri-stated,
> there is noise on some channels when FS clock value is high and data is
> read while fsclk is transitioning from high to low.
> 
> Signed-off-by: Cosmin-Gabriel Samoila 
> Signed-off-by: Shengjiu Wang 

Acked-by: Nicolin Chen 

Though one nit inline:

> ---
>  sound/soc/fsl/fsl_sai.c | 12 ++--
>  sound/soc/fsl/fsl_sai.h |  2 ++
>  2 files changed, 12 insertions(+), 2 deletions(-)
> 
> diff --git a/sound/soc/fsl/fsl_sai.c b/sound/soc/fsl/fsl_sai.c
> index 62c5fdb678fc..33b194a5c1dc 100644
> --- a/sound/soc/fsl/fsl_sai.c
> +++ b/sound/soc/fsl/fsl_sai.c
> @@ -486,6 +486,12 @@ static int fsl_sai_hw_params(struct snd_pcm_substream 
> *substream,
>  
>   val_cr4 |= FSL_SAI_CR4_FRSZ(slots);
>  
> + /* Output Mode - data pins transmit 0 when slots are masked
> +  * or channels are disabled
> +  */

Coding style for multi-line comments. Yet, probably can simplify?

/* Set to output mode to avoid tri-stated data pins */


Re: [PATCH 0/2] dma-mapping: update default segment_boundary_mask

2020-09-02 Thread Nicolin Chen
On Wed, Sep 02, 2020 at 10:13:12AM +0200, Niklas Schnelle wrote:
> On 9/2/20 12:16 AM, Nicolin Chen wrote:
> > These two patches are to update default segment_boundary_mask.
> > 
> > PATCH-1 fixes overflow issues in callers of dma_get_seg_boundary.
> > Previous version was a series: https://lkml.org/lkml/2020/8/31/1026
> > 
> > Then PATCH-2 sets default segment_boundary_mask to ULONG_MAX.
> > 
> > Nicolin Chen (2):
> >   dma-mapping: introduce dma_get_seg_boundary_nr_pages()
> >   dma-mapping: set default segment_boundary_mask to ULONG_MAX
> 
> I gave both of your patches a quick test ride on a couple of dev mainframes,
> both NVMe, ConnectX and virtio-pci devices all seems to work fine.
> I already commented on Christoph's mail that I like the helper approach,
> so as for s390 you can add my
> 
> Acked-by: Niklas Schnelle 
 
Thanks for testing and the ack! 


Re: [PATCH v2] scsi: ibmvfc: interface updates for future FPIN and MQ support

2020-09-02 Thread Martin K. Petersen


Tyrel,

>   Fixup complier errors from neglected commit --amend

Bunch of formatting-related checkpatch warnings. Please fix.

Thanks!

-- 
Martin K. Petersen  Oracle Linux Engineering


Re: [PATCH] powerpc/64s: handle ISA v3.1 local copy-paste context switches

2020-09-02 Thread Paul Mackerras
On Tue, Aug 25, 2020 at 05:55:35PM +1000, Nicholas Piggin wrote:
> The ISA v3.1 the copy-paste facility has a new memory move functionality
> which allows the copy buffer to be pasted to domestic memory (RAM) as
> opposed to foreign memory (accelerator).
> 
> This means the POWER9 trick of avoiding the cp_abort on context switch if
> the process had not mapped foreign memory does not work on POWER10. Do the
> cp_abort unconditionally there.
> 
> KVM must also cp_abort on guest exit to prevent copy buffer state leaking
> between contexts.
> 
> Signed-off-by: Nicholas Piggin 

For the KVM part:

Acked-by: Paul Mackerras 


Re: [PATCH 10/10] powerpc: remove address space overrides using set_fs()

2020-09-02 Thread Linus Torvalds
On Wed, Sep 2, 2020 at 8:17 AM Christophe Leroy
 wrote:
>
>
> With this fix, I get
>
> root@vgoippro:~# time dd if=/dev/zero of=/dev/null count=1M
> 536870912 bytes (512.0MB) copied, 6.776327 seconds, 75.6MB/s
>
> That's still far from the 91.7MB/s I get with 5.9-rc2, but better than
> the 65.8MB/s I got yesterday with your series. Still some way to go thought.

I don't see why this change would make any difference.

And btw, why do the 32-bit and 64-bit checks even differ? It's not
like the extra (single) instruction should even matter. I think the
main reason is that the simpler 64-bit case could stay as a macro
(because it only uses "addr" and "size" once), but honestly, that
"simplification" doesn't help when you then need to have that #ifdef
for the 32-bit case and an inline function anyway.

So why isn't it just

  static inline int __access_ok(unsigned long addr, unsigned long size)
  { return addr <= TASK_SIZE_MAX && size <= TASK_SIZE_MAX-addr; }

for both and be done with it?

The "size=0" check is only relevant for the "addr == TASK_SIZE_MAX"
case, and existed in the old code because it had that "-1" thing
becasue "seg.seg" was actually TASK_SIZE-1.

Now that we don't have any TASK_SIZE-1, zero isn't special any more.

However, I suspect a bigger reason for the actual performance
degradation would be the patch that makes things use "write_iter()"
for writing, even when a simpler "write()" exists.

For writing to /dev/null, the cost of setting up iterators and all the
pointless indirection is all kinds of stupid.

So I think "write()" should just go back to default to using
"->write()" rather than "->write_iter()" if the simpler case exists.

   Linus


Re: [PATCH 1/2] powerpc/vdso64: link vdso64 with linker

2020-09-02 Thread Christophe Leroy




Le 02/09/2020 à 19:41, Nick Desaulniers a écrit :

On Wed, Sep 2, 2020 at 5:14 AM Michael Ellerman  wrote:


Nick Desaulniers  writes:

Fixes: commit f2af201002a8 ("powerpc/build: vdso linker warning for orphan 
sections")


I think I'll just revert that for v5.9 ?


SGTM; you'll probably still want these changes with some modifications
at some point; vdso32 did have at least one orphaned section, and will
be important for hermetic builds.  Seeing crashes in supported
versions of the tools ties our hands at the moment.



Keeping the tool problem aside with binutils 2.26, do you have a way to 
really link an elf32ppc object when  building vdso32 for PPC64 ?


Christophe


Re: [PATCH 2/2] powerpc/vdso32: link vdso64 with linker

2020-09-02 Thread Segher Boessenkool
On Wed, Sep 02, 2020 at 05:43:03PM +0200, Christophe Leroy wrote:
> >Try with a newer ld?  If it still happens with current versions, please
> >open a bug report?  https://sourceware.org/bugzilla
> 
> Yes it works with 2.30 and 2.34

Ah okay, I missed this part.

> But minimum for building kernel is supposed to be 2.23

Sure.  Tthat could be upgraded to 2.24 -- you should use a binutils at
least as new as your GCC, and that requires 4.9 now -- but that
probably doesn't help you here).


Segher


Re: ptrace_syscall_32 is failing

2020-09-02 Thread Andy Lutomirski
On Wed, Sep 2, 2020 at 1:29 AM Thomas Gleixner  wrote:
>
> On Tue, Sep 01 2020 at 17:09, Andy Lutomirski wrote:
> > On Tue, Sep 1, 2020 at 4:50 PM Thomas Gleixner  wrote:
> >> > I think that they almost work for x86, but not quite as
> >> > indicated by this bug.  Even if we imagine we can somehow hack around
> >> > this bug, I imagine we're going to find other problems with this
> >> > model, e.g. the potential upcoming exit problem I noted in my review.
> >>
> >> What's the upcoming problem?
> >
> > If we ever want to get single-stepping fully correct across syscalls,
> > we might need to inject SIGTRAP on syscall return. This would be more
> > awkward if we can't run instrumentable code after the syscall part of
> > the syscall is done.
>
> We run a lot of instrumentable code after sys_foo() returns. Otherwise
> all the TIF work would not be possible at all.
>
> But you might tell me where exactly you want to inject the SIGTRAP in
> the syscall exit code flow.

It would be a bit complicated.  Definitely after any signals from the
syscall are delivered.  Right now, I think that we don't deliver a
SIGTRAP on the instruction boundary after SYSCALL while
single-stepping.  (I think we used to, but only sometimes, and now we
are at least consistent.)  This is because IRET will not trap if it
starts with TF clear and ends up setting it.  (I asked Intel to
document this, and I think they finally did, although I haven't gotten
around to reading the new docs.  Certainly the old docs as of a year
or two ago had no description whatsoever of how TF changes worked.)

Deciding exactly *when* a trap should occur would be nontrivial -- we
can't trap on sigreturn() from a SIGTRAP, for example.

So this isn't fully worked out.

>
> >> I don't think we want that in general. The current variant is perfectly
> >> fine for everything except the 32bit fast syscall nonsense. Also
> >> irqentry_entry/exit is not equivalent to the syscall_enter/exit
> >> counterparts.
> >
> > If there are any architectures in which actual work is needed to
> > figure out whether something is a syscall in the first place, they'll
> > want to do the usual kernel entry work before the syscall entry work.
>
> That's low level entry code which does not require RCU, lockdep, tracing
> or whatever muck we setup before actual work can be done.
>
> arch_asm_entry()
>   ...
>   arch_c_entry(cause) {
> switch(cause) {
>   case EXCEPTION: arch_c_exception(...);
>   case SYSCALL: arch_c_syscall(...);
>   ...
> }

You're assuming that figuring out the cause doesn't need the kernel
entry code to run first.  In the case of the 32-bit vDSO fast
syscalls, we arguably don't know whether an entry is a syscall until
we have done a user memory access.  Logically, we're doing:

if (get_user() < 0) {
  /* Not a syscall.  This is actually a silly operation that sets AX =
-EFAULT and returns.  Do not audit or invoke ptrace. */
} else {
  /* This actually is a syscall. */
}

So we really do want to stick arch code between the
enter_from_user_mode() and the audit check.  We *can't* audit because
we don't know the syscall args.  Now maybe we could invent new
semantics for this in which a fault here is still somehow a syscall,
but I think that would be a real ABI change and would want very
careful thought.  And it would be weird -- syscalls are supposed to
actually call the syscall handler, aren't they?  (Arguably we should
go back in time and make this a SIGSEGV.  We have the infrastructure
to do this cleanly, but when I wrote the code I just copied the ABI
from code that was before my time.  Even so, it would be an exception,
not a syscall.)

>
> You really want to differentiate between exception and syscall
> entry/exit.
>

Why do we want to distinguish between exception and syscall
entry/exit?  For the enter part, AFAICS the exception case boils down
to enter_from_user_mode() and the syscall case is:

enter_from_user_mode(regs);
instrumentation_begin();

local_irq_enable();
ti_work = READ_ONCE(current_thread_info()->flags);
if (ti_work & SYSCALL_ENTER_WORK)
syscall = syscall_trace_enter(regs, syscall, ti_work);
instrumentation_end();

Which would decompose quite nicely as a regular (non-syscall) entry
plus the syscall part later.

> The splitting of syscall_enter_from_user_mode() is only necessary for
> that 32bit fast syscall thing on x86 and there is no point to open code
> it with two calls for e.g. do_syscall_64().
>
> > Maybe your patch actually makes this possible -- I haven't digested
> > all the details yet.
> >
> > Who advised you to drop the arch parameter?
>
> Kees, IIRC, but I would have to search through the gazillions of mail
> threads to be sure.
>
> >> +   syscall_enter_from_user_mode_prepare(regs);
> >
> > I'm getting lost in all these "enter" functions...
>
> It's not that hard.
>
>  syscall_enter_from_user_mode_prepare()
> +

Re: [PATCH 2/2] powerpc/vdso32: link vdso64 with linker

2020-09-02 Thread Christophe Leroy

Hi,

Le 02/09/2020 à 16:14, Segher Boessenkool a écrit :

Hi!

On Wed, Sep 02, 2020 at 06:46:45AM +, Christophe Leroy wrote:

ld crashes:

   LD  arch/powerpc/kernel/vdso32/vdso32.so.dbg
/bin/sh: line 1: 23780 Segmentation fault  (core dumped)
ppc-linux-ld -EB -m elf32ppc -shared -soname linux-vdso32.so.1
--eh-frame-hdr --orphan-handling=warn -T
arch/powerpc/kernel/vdso32/vdso32.lds
arch/powerpc/kernel/vdso32/sigtramp.o
arch/powerpc/kernel/vdso32/gettimeofday.o
arch/powerpc/kernel/vdso32/datapage.o
arch/powerpc/kernel/vdso32/cacheflush.o
arch/powerpc/kernel/vdso32/note.o arch/powerpc/kernel/vdso32/getcpu.o -o
arch/powerpc/kernel/vdso32/vdso32.so.dbg
make[4]: *** [arch/powerpc/kernel/vdso32/vdso32.so.dbg] Error 139


[root@localhost linux-powerpc]# ppc-linux-ld --version
GNU ld (GNU Binutils) 2.26.20160125


[ Don't build as root :-P ]

Try with a newer ld?  If it still happens with current versions, please
open a bug report?  https://sourceware.org/bugzilla


Yes it works with 2.30 and 2.34
But minimum for building kernel is supposed to be 2.23

Christophe


Re: [PATCH 10/10] powerpc: remove address space overrides using set_fs()

2020-09-02 Thread Christophe Leroy




Le 02/09/2020 à 14:36, Christoph Hellwig a écrit :

On Wed, Sep 02, 2020 at 08:15:12AM +0200, Christophe Leroy wrote:

-   return 0;
-   return (size == 0 || size - 1 <= seg.seg - addr);
+   if (addr >= TASK_SIZE_MAX)
+   return false;
+   if (size == 0)
+   return false;


__access_ok() was returning true when size == 0 up to now. Any reason to
return false now ?


No, this is accidental and broken.  Can you re-run your benchmark with
this fixed?



With this fix, I get

root@vgoippro:~# time dd if=/dev/zero of=/dev/null count=1M
1048576+0 records in
1048576+0 records out
536870912 bytes (512.0MB) copied, 6.776327 seconds, 75.6MB/s
real0m 6.78s
user0m 1.64s
sys 0m 5.13s

That's still far from the 91.7MB/s I get with 5.9-rc2, but better than 
the 65.8MB/s I got yesterday with your series. Still some way to go thought.


Christophe


RE: [PATCH 10/10] powerpc: remove address space overrides using set_fs()

2020-09-02 Thread David Laight
From: Christophe Leroy
> Sent: 02 September 2020 15:13
> 
> 
> Le 02/09/2020 à 15:51, David Laight a écrit :
> > From: Christophe Leroy
> >> Sent: 02 September 2020 14:25
> >> Le 02/09/2020 à 15:13, David Laight a écrit :
> >>> From: Christoph Hellwig
>  Sent: 02 September 2020 13:37
> 
>  On Wed, Sep 02, 2020 at 08:15:12AM +0200, Christophe Leroy wrote:
> >> -  return 0;
> >> -  return (size == 0 || size - 1 <= seg.seg - addr);
> >> +  if (addr >= TASK_SIZE_MAX)
> >> +  return false;
> >> +  if (size == 0)
> >> +  return false;
> >
> > __access_ok() was returning true when size == 0 up to now. Any reason to
> > return false now ?
> 
>  No, this is accidental and broken.  Can you re-run your benchmark with
>  this fixed?
> >>>
> >>> Is TASK_SIZE_MASK defined such that you can do:
> >>>
> >>>   return (addr | size) < TASK_SIZE_MAX) || !size;
> >>
> >> TASK_SIZE_MAX will usually be 0xc000
> >>
> >> With:
> >> addr = 0x8000;
> >> size = 0x8000;
> >>
> >> I expect it to fail 
> >>
> >> With the formula you propose it will succeed, won't it ?
> >
> > Hmmm... Was i getting confused about some comments for 64bit
> > about there being such a big hole between valid user and kernel
> > addresses that it was enough to check that 'size < TASK_SIZE_MAX'.
> >
> > That would be true for 64bit x86 (and probably ppc (& arm??))
> > if TASK_SIZE_MAX were 0x4 << 60.
> > IIUC the highest user address is (much) less than 0x0 << 60
> > and the lowest kernel address (much) greater than 0xf << 60
> > on all these 64bit platforms.
> >
> > Actually if doing access_ok() inside get_user() you don't
> > need to check the size at all.
> 
> You mean on 64 bit or on any platform ?

64bit and 32bit

> What about a word write to 0xbffe, won't it overwrite 0xc000 ?
> 
> > You don't even need to in copy_to/from_user() provided
> > it always does a forwards copy.
> 
> Do you mean due to the gap ?
> Is it garantied to be a gap ? Even on a 32 bits having TASK_SIZE set to
> 0xc000 and PAGE_OFFSET set to the same ?

I read somewhere (I won't find it again) that the last 4k page
(below 0xc000) must not be allocated on i386 because some
cpu (both intel and amd) do 'horrid things' if they try to
(IIRC) do instruction prefetches across the boundary.
So the accesses to 0xbffe will fault and the one to 0xc000
won't happen (in any useful way at least).

I'd suspect that not allocating the 3G-4k page would be a safe
bet on all architectures - even 68k.

David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, 
UK
Registration No: 1397386 (Wales)


[Bug 206203] kmemleak reports various leaks in drivers/of/unittest.c

2020-09-02 Thread bugzilla-daemon
https://bugzilla.kernel.org/show_bug.cgi?id=206203

Erhard F. (erhar...@mailbox.org) changed:

   What|Removed |Added

 Attachment #288187|0   |1
is obsolete||

--- Comment #20 from Erhard F. (erhar...@mailbox.org) ---
Created attachment 292289
  --> https://bugzilla.kernel.org/attachment.cgi?id=292289=edit
kernel .config (kernel 5.9-rc3, Talos II)

-- 
You are receiving this mail because:
You are watching the assignee of the bug.

[Bug 206203] kmemleak reports various leaks in drivers/of/unittest.c

2020-09-02 Thread bugzilla-daemon
https://bugzilla.kernel.org/show_bug.cgi?id=206203

Erhard F. (erhar...@mailbox.org) changed:

   What|Removed |Added

 Attachment #288189|0   |1
is obsolete||

--- Comment #19 from Erhard F. (erhar...@mailbox.org) ---
Created attachment 292287
  --> https://bugzilla.kernel.org/attachment.cgi?id=292287=edit
kmemleak output (kernel 5.9-rc3, Talos II)

-- 
You are receiving this mail because:
You are watching the assignee of the bug.

[Bug 206203] kmemleak reports various leaks in drivers/of/unittest.c

2020-09-02 Thread bugzilla-daemon
https://bugzilla.kernel.org/show_bug.cgi?id=206203

Erhard F. (erhar...@mailbox.org) changed:

   What|Removed |Added

 Attachment #288185|0   |1
is obsolete||

--- Comment #18 from Erhard F. (erhar...@mailbox.org) ---
Created attachment 292285
  --> https://bugzilla.kernel.org/attachment.cgi?id=292285=edit
dmesg (kernel 5.9-rc3, Talos II)

-- 
You are receiving this mail because:
You are watching the assignee of the bug.

Re: [PATCH 2/2] powerpc/vdso32: link vdso64 with linker

2020-09-02 Thread Segher Boessenkool
Hi!

On Wed, Sep 02, 2020 at 06:46:45AM +, Christophe Leroy wrote:
> ld crashes:
> 
>   LD  arch/powerpc/kernel/vdso32/vdso32.so.dbg
> /bin/sh: line 1: 23780 Segmentation fault  (core dumped) 
> ppc-linux-ld -EB -m elf32ppc -shared -soname linux-vdso32.so.1 
> --eh-frame-hdr --orphan-handling=warn -T 
> arch/powerpc/kernel/vdso32/vdso32.lds 
> arch/powerpc/kernel/vdso32/sigtramp.o 
> arch/powerpc/kernel/vdso32/gettimeofday.o 
> arch/powerpc/kernel/vdso32/datapage.o 
> arch/powerpc/kernel/vdso32/cacheflush.o 
> arch/powerpc/kernel/vdso32/note.o arch/powerpc/kernel/vdso32/getcpu.o -o 
> arch/powerpc/kernel/vdso32/vdso32.so.dbg
> make[4]: *** [arch/powerpc/kernel/vdso32/vdso32.so.dbg] Error 139
> 
> 
> [root@localhost linux-powerpc]# ppc-linux-ld --version
> GNU ld (GNU Binutils) 2.26.20160125

[ Don't build as root :-P ]

Try with a newer ld?  If it still happens with current versions, please
open a bug report?  https://sourceware.org/bugzilla


Segher


Re: [PATCH 10/10] powerpc: remove address space overrides using set_fs()

2020-09-02 Thread Christophe Leroy




Le 02/09/2020 à 15:51, David Laight a écrit :

From: Christophe Leroy

Sent: 02 September 2020 14:25
Le 02/09/2020 à 15:13, David Laight a écrit :

From: Christoph Hellwig

Sent: 02 September 2020 13:37

On Wed, Sep 02, 2020 at 08:15:12AM +0200, Christophe Leroy wrote:

-   return 0;
-   return (size == 0 || size - 1 <= seg.seg - addr);
+   if (addr >= TASK_SIZE_MAX)
+   return false;
+   if (size == 0)
+   return false;


__access_ok() was returning true when size == 0 up to now. Any reason to
return false now ?


No, this is accidental and broken.  Can you re-run your benchmark with
this fixed?


Is TASK_SIZE_MASK defined such that you can do:

return (addr | size) < TASK_SIZE_MAX) || !size;


TASK_SIZE_MAX will usually be 0xc000

With:
addr = 0x8000;
size = 0x8000;

I expect it to fail 

With the formula you propose it will succeed, won't it ?


Hmmm... Was i getting confused about some comments for 64bit
about there being such a big hole between valid user and kernel
addresses that it was enough to check that 'size < TASK_SIZE_MAX'.

That would be true for 64bit x86 (and probably ppc (& arm??))
if TASK_SIZE_MAX were 0x4 << 60.
IIUC the highest user address is (much) less than 0x0 << 60
and the lowest kernel address (much) greater than 0xf << 60
on all these 64bit platforms.

Actually if doing access_ok() inside get_user() you don't
need to check the size at all.


You mean on 64 bit or on any platform ?

What about a word write to 0xbffe, won't it overwrite 0xc000 ?


You don't even need to in copy_to/from_user() provided
it always does a forwards copy.


Do you mean due to the gap ?
Is it garantied to be a gap ? Even on a 32 bits having TASK_SIZE set to 
0xc000 and PAGE_OFFSET set to the same ?



Christophe


RE: [PATCH 10/10] powerpc: remove address space overrides using set_fs()

2020-09-02 Thread David Laight
From: Christophe Leroy
> Sent: 02 September 2020 14:25
> Le 02/09/2020 à 15:13, David Laight a écrit :
> > From: Christoph Hellwig
> >> Sent: 02 September 2020 13:37
> >>
> >> On Wed, Sep 02, 2020 at 08:15:12AM +0200, Christophe Leroy wrote:
>  -return 0;
>  -return (size == 0 || size - 1 <= seg.seg - addr);
>  +if (addr >= TASK_SIZE_MAX)
>  +return false;
>  +if (size == 0)
>  +return false;
> >>>
> >>> __access_ok() was returning true when size == 0 up to now. Any reason to
> >>> return false now ?
> >>
> >> No, this is accidental and broken.  Can you re-run your benchmark with
> >> this fixed?
> >
> > Is TASK_SIZE_MASK defined such that you can do:
> >
> > return (addr | size) < TASK_SIZE_MAX) || !size;
> 
> TASK_SIZE_MAX will usually be 0xc000
> 
> With:
> addr = 0x8000;
> size = 0x8000;
> 
> I expect it to fail 
> 
> With the formula you propose it will succeed, won't it ?

Hmmm... Was i getting confused about some comments for 64bit
about there being such a big hole between valid user and kernel
addresses that it was enough to check that 'size < TASK_SIZE_MAX'.

That would be true for 64bit x86 (and probably ppc (& arm??))
if TASK_SIZE_MAX were 0x4 << 60.
IIUC the highest user address is (much) less than 0x0 << 60
and the lowest kernel address (much) greater than 0xf << 60
on all these 64bit platforms.

Actually if doing access_ok() inside get_user() you don't
need to check the size at all.
You don't even need to in copy_to/from_user() provided
it always does a forwards copy.
(Rather that copying the last word first for misaligned lengths.)

David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, 
UK
Registration No: 1397386 (Wales)


Re: [PATCH 10/10] powerpc: remove address space overrides using set_fs()

2020-09-02 Thread Christophe Leroy




Le 02/09/2020 à 15:13, David Laight a écrit :

From: Christoph Hellwig

Sent: 02 September 2020 13:37

On Wed, Sep 02, 2020 at 08:15:12AM +0200, Christophe Leroy wrote:

-   return 0;
-   return (size == 0 || size - 1 <= seg.seg - addr);
+   if (addr >= TASK_SIZE_MAX)
+   return false;
+   if (size == 0)
+   return false;


__access_ok() was returning true when size == 0 up to now. Any reason to
return false now ?


No, this is accidental and broken.  Can you re-run your benchmark with
this fixed?


Is TASK_SIZE_MASK defined such that you can do:

return (addr | size) < TASK_SIZE_MAX) || !size;


TASK_SIZE_MAX will usually be 0xc000

With:
addr = 0x8000;
size = 0x8000;

I expect it to fail 

With the formula you propose it will succeed, won't it ?

Christophe


Re: [PATCH v3 12/13] mm/debug_vm_pgtable/hugetlb: Disable hugetlb test on ppc64

2020-09-02 Thread Aneesh Kumar K.V
Anshuman Khandual  writes:

> On 09/01/2020 12:00 PM, Aneesh Kumar K.V wrote:
>> On 9/1/20 9:33 AM, Anshuman Khandual wrote:
>>>
>>>
>>> On 08/27/2020 01:34 PM, Aneesh Kumar K.V wrote:
 The seems to be missing quite a lot of details w.r.t allocating
 the correct pgtable_t page (huge_pte_alloc()), holding the right
 lock (huge_pte_lock()) etc. The vma used is also not a hugetlb VMA.

 ppc64 do have runtime checks within CONFIG_DEBUG_VM for most of these.
 Hence disable the test on ppc64.
>>>
>>> Would really like this to get resolved in an uniform and better way
>>> instead, i.e a modified hugetlb_advanced_tests() which works on all
>>> platforms including ppc64.
>>>
>>> In absence of a modified version, I do realize the situation here,
>>> where DEBUG_VM_PGTABLE test either runs on ppc64 or just completely
>>> remove hugetlb_advanced_tests() from other platforms as well.
>>>

 Signed-off-by: Aneesh Kumar K.V 
 ---
   mm/debug_vm_pgtable.c | 4 
   1 file changed, 4 insertions(+)

 diff --git a/mm/debug_vm_pgtable.c b/mm/debug_vm_pgtable.c
 index a188b6e4e37e..21329c7d672f 100644
 --- a/mm/debug_vm_pgtable.c
 +++ b/mm/debug_vm_pgtable.c
 @@ -813,6 +813,7 @@ static void __init hugetlb_basic_tests(unsigned long 
 pfn, pgprot_t prot)
   #endif /* CONFIG_ARCH_WANT_GENERAL_HUGETLB */
   }
   +#ifndef CONFIG_PPC_BOOK3S_64
   static void __init hugetlb_advanced_tests(struct mm_struct *mm,
     struct vm_area_struct *vma,
     pte_t *ptep, unsigned long pfn,
 @@ -855,6 +856,7 @@ static void __init hugetlb_advanced_tests(struct 
 mm_struct *mm,
   pte = huge_ptep_get(ptep);
   WARN_ON(!(huge_pte_write(pte) && huge_pte_dirty(pte)));
   }
 +#endif
>>>
>>> In the worst case if we could not get a new hugetlb_advanced_tests() test
>>> that works on all platforms, this might be the last fallback option. In
>>> which case, it will require a proper comment section with a "FIXME: ",
>>> explaining the current situation (and that #ifdef is temporary in nature)
>>> and a hugetlb_advanced_tests() stub when CONFIG_PPC_BOOK3S_64 is enabled.
>>>
   #else  /* !CONFIG_HUGETLB_PAGE */
   static void __init hugetlb_basic_tests(unsigned long pfn, pgprot_t prot) 
 { }
   static void __init hugetlb_advanced_tests(struct mm_struct *mm,
 @@ -1065,7 +1067,9 @@ static int __init debug_vm_pgtable(void)
   pud_populate_tests(mm, pudp, saved_pmdp);
   spin_unlock(ptl);
   +#ifndef CONFIG_PPC_BOOK3S_64
   hugetlb_advanced_tests(mm, vma, ptep, pte_aligned, vaddr, prot);
 +#endif
>>>
>> 
>> I actually wanted to add #ifdef BROKEN. That test is completely broken. 
>> Infact I would suggest to remove that test completely.
>> 
>> 
>> 
>>> #ifdef will not be required here as there would be a stub definition
>>> for hugetlb_advanced_tests() when CONFIG_PPC_BOOK3S_64 is enabled.
>>>
     spin_lock(>page_table_lock);
   p4d_clear_tests(mm, p4dp);

>>>
>>> But again, we should really try and avoid taking this path.
>>>
>> 
>> To be frank i am kind of frustrated with how this patch series is being 
>> looked at. We pushed a completely broken test to upstream and right now we 
>> have a code in upstream that crash when booted on ppc64. My attempt has been 
>> to make progress here and you definitely seems to be not in agreement to 
>> that.
>> 
>
> I am afraid, this does not accurately represent the situation.
>
> - The second set patch series got merged in it's V5 after accommodating almost
>   all reviews and objections during previous discussion cycles. For a complete
>   development log, please refer https://patchwork.kernel.org/cover/11658627/.
>
> - The series has been repeatedly tested on arm64 and x86 platforms for 
> multiple
>   configurations but build tested on all other enabled platforms. I have 
> always
>   been dependent on voluntary help from folks on the list to get this tested 
> on
>   other enabled platforms as I dont have access to such systems. Always 
> assumed
>   that is the way to go for anything which runs on multiple platforms. So, am 
> I
>   expected to test on platforms that I dont have access to ? But I am ready to
>   be corrected here, if the community protocol is not what I have always 
> assumed
>   it to be.
>
> - Each and every version of the series had appropriately copied all the 
> enabled
>   platform's mailing list. Also, I had explicitly asked for volunteers to test
>   this out on platforms apart from x86 and arm64. We had positive response 
> from
>   all platforms i.e arc, s390, ppc32 but except for ppc64.
>
>   https://patchwork.kernel.org/cover/11644771/
>   https://patchwork.kernel.org/cover/11603713/
>
> - The development cycle provided sufficient time window for any detailed 
> review
>   and test. I have always been willing to address almost all the issues 
> 

RE: [PATCH 10/10] powerpc: remove address space overrides using set_fs()

2020-09-02 Thread David Laight
From: Christoph Hellwig
> Sent: 02 September 2020 13:37
> 
> On Wed, Sep 02, 2020 at 08:15:12AM +0200, Christophe Leroy wrote:
> >> -  return 0;
> >> -  return (size == 0 || size - 1 <= seg.seg - addr);
> >> +  if (addr >= TASK_SIZE_MAX)
> >> +  return false;
> >> +  if (size == 0)
> >> +  return false;
> >
> > __access_ok() was returning true when size == 0 up to now. Any reason to
> > return false now ?
> 
> No, this is accidental and broken.  Can you re-run your benchmark with
> this fixed?

Is TASK_SIZE_MASK defined such that you can do:

return (addr | size) < TASK_SIZE_MAX) || !size;

David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, 
UK
Registration No: 1397386 (Wales)



Re: [PATCH v3 12/13] mm/debug_vm_pgtable/hugetlb: Disable hugetlb test on ppc64

2020-09-02 Thread Anshuman Khandual



On 09/01/2020 12:00 PM, Aneesh Kumar K.V wrote:
> On 9/1/20 9:33 AM, Anshuman Khandual wrote:
>>
>>
>> On 08/27/2020 01:34 PM, Aneesh Kumar K.V wrote:
>>> The seems to be missing quite a lot of details w.r.t allocating
>>> the correct pgtable_t page (huge_pte_alloc()), holding the right
>>> lock (huge_pte_lock()) etc. The vma used is also not a hugetlb VMA.
>>>
>>> ppc64 do have runtime checks within CONFIG_DEBUG_VM for most of these.
>>> Hence disable the test on ppc64.
>>
>> Would really like this to get resolved in an uniform and better way
>> instead, i.e a modified hugetlb_advanced_tests() which works on all
>> platforms including ppc64.
>>
>> In absence of a modified version, I do realize the situation here,
>> where DEBUG_VM_PGTABLE test either runs on ppc64 or just completely
>> remove hugetlb_advanced_tests() from other platforms as well.
>>
>>>
>>> Signed-off-by: Aneesh Kumar K.V 
>>> ---
>>>   mm/debug_vm_pgtable.c | 4 
>>>   1 file changed, 4 insertions(+)
>>>
>>> diff --git a/mm/debug_vm_pgtable.c b/mm/debug_vm_pgtable.c
>>> index a188b6e4e37e..21329c7d672f 100644
>>> --- a/mm/debug_vm_pgtable.c
>>> +++ b/mm/debug_vm_pgtable.c
>>> @@ -813,6 +813,7 @@ static void __init hugetlb_basic_tests(unsigned long 
>>> pfn, pgprot_t prot)
>>>   #endif /* CONFIG_ARCH_WANT_GENERAL_HUGETLB */
>>>   }
>>>   +#ifndef CONFIG_PPC_BOOK3S_64
>>>   static void __init hugetlb_advanced_tests(struct mm_struct *mm,
>>>     struct vm_area_struct *vma,
>>>     pte_t *ptep, unsigned long pfn,
>>> @@ -855,6 +856,7 @@ static void __init hugetlb_advanced_tests(struct 
>>> mm_struct *mm,
>>>   pte = huge_ptep_get(ptep);
>>>   WARN_ON(!(huge_pte_write(pte) && huge_pte_dirty(pte)));
>>>   }
>>> +#endif
>>
>> In the worst case if we could not get a new hugetlb_advanced_tests() test
>> that works on all platforms, this might be the last fallback option. In
>> which case, it will require a proper comment section with a "FIXME: ",
>> explaining the current situation (and that #ifdef is temporary in nature)
>> and a hugetlb_advanced_tests() stub when CONFIG_PPC_BOOK3S_64 is enabled.
>>
>>>   #else  /* !CONFIG_HUGETLB_PAGE */
>>>   static void __init hugetlb_basic_tests(unsigned long pfn, pgprot_t prot) 
>>> { }
>>>   static void __init hugetlb_advanced_tests(struct mm_struct *mm,
>>> @@ -1065,7 +1067,9 @@ static int __init debug_vm_pgtable(void)
>>>   pud_populate_tests(mm, pudp, saved_pmdp);
>>>   spin_unlock(ptl);
>>>   +#ifndef CONFIG_PPC_BOOK3S_64
>>>   hugetlb_advanced_tests(mm, vma, ptep, pte_aligned, vaddr, prot);
>>> +#endif
>>
> 
> I actually wanted to add #ifdef BROKEN. That test is completely broken. 
> Infact I would suggest to remove that test completely.
> 
> 
> 
>> #ifdef will not be required here as there would be a stub definition
>> for hugetlb_advanced_tests() when CONFIG_PPC_BOOK3S_64 is enabled.
>>
>>>     spin_lock(>page_table_lock);
>>>   p4d_clear_tests(mm, p4dp);
>>>
>>
>> But again, we should really try and avoid taking this path.
>>
> 
> To be frank i am kind of frustrated with how this patch series is being 
> looked at. We pushed a completely broken test to upstream and right now we 
> have a code in upstream that crash when booted on ppc64. My attempt has been 
> to make progress here and you definitely seems to be not in agreement to that.
> 

I am afraid, this does not accurately represent the situation.

- The second set patch series got merged in it's V5 after accommodating almost
  all reviews and objections during previous discussion cycles. For a complete
  development log, please refer https://patchwork.kernel.org/cover/11658627/.

- The series has been repeatedly tested on arm64 and x86 platforms for multiple
  configurations but build tested on all other enabled platforms. I have always
  been dependent on voluntary help from folks on the list to get this tested on
  other enabled platforms as I dont have access to such systems. Always assumed
  that is the way to go for anything which runs on multiple platforms. So, am I
  expected to test on platforms that I dont have access to ? But I am ready to
  be corrected here, if the community protocol is not what I have always assumed
  it to be.

- Each and every version of the series had appropriately copied all the enabled
  platform's mailing list. Also, I had explicitly asked for volunteers to test
  this out on platforms apart from x86 and arm64. We had positive response from
  all platforms i.e arc, s390, ppc32 but except for ppc64.

  https://patchwork.kernel.org/cover/11644771/
  https://patchwork.kernel.org/cover/11603713/

- The development cycle provided sufficient time window for any detailed review
  and test. I have always been willing to address almost all the issues brought
  forward during these discussions. From past experience on this test, there is
  an inherent need to understand platform specific details while trying to come
  up with something generic 

Re: [PATCH v4 04/13] mm/debug_vm_pgtables/hugevmap: Use the arch helper to identify huge vmap support.

2020-09-02 Thread Aneesh Kumar K.V

On 9/2/20 6:10 PM, Christophe Leroy wrote:



Le 02/09/2020 à 13:42, Aneesh Kumar K.V a écrit :
ppc64 supports huge vmap only with radix translation. Hence use arch 
helper

to determine the huge vmap support.

Signed-off-by: Aneesh Kumar K.V 
---
  mm/debug_vm_pgtable.c | 14 --
  1 file changed, 12 insertions(+), 2 deletions(-)

diff --git a/mm/debug_vm_pgtable.c b/mm/debug_vm_pgtable.c
index 00649b47f6e0..4c73e63b4ceb 100644
--- a/mm/debug_vm_pgtable.c
+++ b/mm/debug_vm_pgtable.c
@@ -28,6 +28,7 @@
  #include 
  #include 
  #include 
+#include 
  #include 
  #include 
@@ -206,11 +207,12 @@ static void __init pmd_leaf_tests(unsigned long 
pfn, pgprot_t prot)

  WARN_ON(!pmd_leaf(pmd));
  }
+#ifdef CONFIG_HAVE_ARCH_HUGE_VMAP
  static void __init pmd_huge_tests(pmd_t *pmdp, unsigned long pfn, 
pgprot_t prot)

  {
  pmd_t pmd;
-    if (!IS_ENABLED(CONFIG_HAVE_ARCH_HUGE_VMAP))
+    if (!arch_ioremap_pmd_supported())


What about moving ioremap_pmd_enabled() from mm/ioremap.c to some .h, 
and using it ?

As ioremap_pmd_enabled() is defined at all time, no need of #ifdef



yes. This was discussed earlier too. IMHO we should do that outside this 
series. I guess figuring out ioremap_pmd/pud support can definitely be 
simplified. With a generic version like


#ifndef arch_ioremap_pmd_supported
static inline bool arch_ioremap_pmd_supported(void)
{
return false;
}
#endif



  return;
  pr_debug("Validating PMD huge\n");
@@ -224,6 +226,9 @@ static void __init pmd_huge_tests(pmd_t *pmdp, 
unsigned long pfn, pgprot_t prot)

  pmd = READ_ONCE(*pmdp);
  WARN_ON(!pmd_none(pmd));
  }
+#else /* CONFIG_HAVE_ARCH_HUGE_VMAP */
+static void __init pmd_huge_tests(pmd_t *pmdp, unsigned long pfn, 
pgprot_t prot) { }

+#endif /* CONFIG_HAVE_ARCH_HUGE_VMAP */
  static void __init pmd_savedwrite_tests(unsigned long pfn, pgprot_t 
prot)

  {
@@ -320,11 +325,12 @@ static void __init pud_leaf_tests(unsigned long 
pfn, pgprot_t prot)

  WARN_ON(!pud_leaf(pud));
  }
+#ifdef CONFIG_HAVE_ARCH_HUGE_VMAP
  static void __init pud_huge_tests(pud_t *pudp, unsigned long pfn, 
pgprot_t prot)

  {
  pud_t pud;
-    if (!IS_ENABLED(CONFIG_HAVE_ARCH_HUGE_VMAP))
+    if (!arch_ioremap_pud_supported())


What about moving ioremap_pud_enabled() from mm/ioremap.c to some .h, 
and using it ?

As ioremap_pud_enabled() is defined at all time, no need of #ifdef


  return;
  pr_debug("Validating PUD huge\n");
@@ -338,6 +344,10 @@ static void __init pud_huge_tests(pud_t *pudp, 
unsigned long pfn, pgprot_t prot)

  pud = READ_ONCE(*pudp);
  WARN_ON(!pud_none(pud));
  }
+#else /* !CONFIG_HAVE_ARCH_HUGE_VMAP */
+static void __init pud_huge_tests(pud_t *pudp, unsigned long pfn, 
pgprot_t prot) { }

+#endif /* !CONFIG_HAVE_ARCH_HUGE_VMAP */
+
  #else  /* !CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD */
  static void __init pud_basic_tests(unsigned long pfn, pgprot_t prot) 
{ }

  static void __init pud_advanced_tests(struct mm_struct *mm,



Christophe




Re: [PATCH v4 04/13] mm/debug_vm_pgtables/hugevmap: Use the arch helper to identify huge vmap support.

2020-09-02 Thread Christophe Leroy




Le 02/09/2020 à 13:42, Aneesh Kumar K.V a écrit :

ppc64 supports huge vmap only with radix translation. Hence use arch helper
to determine the huge vmap support.

Signed-off-by: Aneesh Kumar K.V 
---
  mm/debug_vm_pgtable.c | 14 --
  1 file changed, 12 insertions(+), 2 deletions(-)

diff --git a/mm/debug_vm_pgtable.c b/mm/debug_vm_pgtable.c
index 00649b47f6e0..4c73e63b4ceb 100644
--- a/mm/debug_vm_pgtable.c
+++ b/mm/debug_vm_pgtable.c
@@ -28,6 +28,7 @@
  #include 
  #include 
  #include 
+#include 
  #include 
  #include 
  
@@ -206,11 +207,12 @@ static void __init pmd_leaf_tests(unsigned long pfn, pgprot_t prot)

WARN_ON(!pmd_leaf(pmd));
  }
  
+#ifdef CONFIG_HAVE_ARCH_HUGE_VMAP

  static void __init pmd_huge_tests(pmd_t *pmdp, unsigned long pfn, pgprot_t 
prot)
  {
pmd_t pmd;
  
-	if (!IS_ENABLED(CONFIG_HAVE_ARCH_HUGE_VMAP))

+   if (!arch_ioremap_pmd_supported())


What about moving ioremap_pmd_enabled() from mm/ioremap.c to some .h, 
and using it ?

As ioremap_pmd_enabled() is defined at all time, no need of #ifdef


return;
  
  	pr_debug("Validating PMD huge\n");

@@ -224,6 +226,9 @@ static void __init pmd_huge_tests(pmd_t *pmdp, unsigned 
long pfn, pgprot_t prot)
pmd = READ_ONCE(*pmdp);
WARN_ON(!pmd_none(pmd));
  }
+#else /* CONFIG_HAVE_ARCH_HUGE_VMAP */
+static void __init pmd_huge_tests(pmd_t *pmdp, unsigned long pfn, pgprot_t 
prot) { }
+#endif /* CONFIG_HAVE_ARCH_HUGE_VMAP */
  
  static void __init pmd_savedwrite_tests(unsigned long pfn, pgprot_t prot)

  {
@@ -320,11 +325,12 @@ static void __init pud_leaf_tests(unsigned long pfn, 
pgprot_t prot)
WARN_ON(!pud_leaf(pud));
  }
  
+#ifdef CONFIG_HAVE_ARCH_HUGE_VMAP

  static void __init pud_huge_tests(pud_t *pudp, unsigned long pfn, pgprot_t 
prot)
  {
pud_t pud;
  
-	if (!IS_ENABLED(CONFIG_HAVE_ARCH_HUGE_VMAP))

+   if (!arch_ioremap_pud_supported())


What about moving ioremap_pud_enabled() from mm/ioremap.c to some .h, 
and using it ?

As ioremap_pud_enabled() is defined at all time, no need of #ifdef


return;
  
  	pr_debug("Validating PUD huge\n");

@@ -338,6 +344,10 @@ static void __init pud_huge_tests(pud_t *pudp, unsigned 
long pfn, pgprot_t prot)
pud = READ_ONCE(*pudp);
WARN_ON(!pud_none(pud));
  }
+#else /* !CONFIG_HAVE_ARCH_HUGE_VMAP */
+static void __init pud_huge_tests(pud_t *pudp, unsigned long pfn, pgprot_t 
prot) { }
+#endif /* !CONFIG_HAVE_ARCH_HUGE_VMAP */
+
  #else  /* !CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD */
  static void __init pud_basic_tests(unsigned long pfn, pgprot_t prot) { }
  static void __init pud_advanced_tests(struct mm_struct *mm,



Christophe


Re: [PATCH 10/10] powerpc: remove address space overrides using set_fs()

2020-09-02 Thread Christoph Hellwig
On Wed, Sep 02, 2020 at 08:15:12AM +0200, Christophe Leroy wrote:
>> -return 0;
>> -return (size == 0 || size - 1 <= seg.seg - addr);
>> +if (addr >= TASK_SIZE_MAX)
>> +return false;
>> +if (size == 0)
>> +return false;
>
> __access_ok() was returning true when size == 0 up to now. Any reason to 
> return false now ?

No, this is accidental and broken.  Can you re-run your benchmark with
this fixed?


Re: [PATCH v4 02/13] powerpc/mm: Move setting pte specific flags to pfn_pte

2020-09-02 Thread Christophe Leroy




Le 02/09/2020 à 13:42, Aneesh Kumar K.V a écrit :

powerpc used to set the pte specific flags in set_pte_at(). This is
different from other architectures. To be consistent with other
architecture update pfn_pte to set _PAGE_PTE on ppc64. Also, drop now
unused pte_mkpte.

We add a VM_WARN_ON() to catch the usage of calling set_pte_at()
without setting _PAGE_PTE bit. We will remove that after a few releases.

With respect to huge pmd entries, pmd_mkhuge() takes care of adding the
_PAGE_PTE bit.

Signed-off-by: Aneesh Kumar K.V 


Reviewed-by: Christophe Leroy 

Small nit below.


---
  arch/powerpc/include/asm/book3s/64/pgtable.h | 15 +--
  arch/powerpc/include/asm/nohash/pgtable.h|  5 -
  arch/powerpc/mm/pgtable.c|  5 -
  3 files changed, 9 insertions(+), 16 deletions(-)

diff --git a/arch/powerpc/include/asm/book3s/64/pgtable.h 
b/arch/powerpc/include/asm/book3s/64/pgtable.h
index 079211968987..2382fd516f6b 100644
--- a/arch/powerpc/include/asm/book3s/64/pgtable.h
+++ b/arch/powerpc/include/asm/book3s/64/pgtable.h
@@ -823,6 +818,14 @@ static inline int pte_none(pte_t pte)
  static inline void __set_pte_at(struct mm_struct *mm, unsigned long addr,
pte_t *ptep, pte_t pte, int percpu)
  {
+
+   VM_WARN_ON(!(pte_raw(pte) & cpu_to_be64(_PAGE_PTE)));
+   /*
+* Keep the _PAGE_PTE added till we are sure we handle _PAGE_PTE
+* in all the callers.
+*/
+pte = __pte_raw(pte_raw(pte) | cpu_to_be64(_PAGE_PTE));


Wrong alignment, there is a leading space.


+
if (radix_enabled())
return radix__set_pte_at(mm, addr, ptep, pte, percpu);
return hash__set_pte_at(mm, addr, ptep, pte, percpu);


Christophe


Re: [PATCH 1/2] powerpc/vdso64: link vdso64 with linker

2020-09-02 Thread Michael Ellerman
Nick Desaulniers  writes:
> Rather than invoke the compiler as the driver, use the linker. That way
> we can check --orphan-handling=warn support correctly, as cc-ldoption
> was removed in
> commit 055efab3120b ("kbuild: drop support for cc-ldoption").

Ouch.

Seems make is quite happy to $(call deadbeef, ...) and not print a
warning, which I guess is probably a feature.

> Painstakingly compared the output between `objdump -a` before and after
> this change. Now function symbols have the correct type of FUNC rather
> than NONE, and the entry is slightly different (which doesn't matter for
> the vdso). Binary size is the same.
>
> Fixes: commit f2af201002a8 ("powerpc/build: vdso linker warning for orphan 
> sections")

I think I'll just revert that for v5.9 ?

cheers

> Link: 
> https://lore.kernel.org/lkml/cakwvodnn3wxydjomvnveyd_njwrku3fabwt_bs92duihhyw...@mail.gmail.com/
> Signed-off-by: Nick Desaulniers 
> ---
>  arch/powerpc/include/asm/vdso.h | 17 ++---
>  arch/powerpc/kernel/vdso64/Makefile |  8 ++--
>  arch/powerpc/kernel/vdso64/vdso64.lds.S |  1 -
>  3 files changed, 8 insertions(+), 18 deletions(-)
>
> diff --git a/arch/powerpc/include/asm/vdso.h b/arch/powerpc/include/asm/vdso.h
> index 2ff884853f97..11b2ecf49f79 100644
> --- a/arch/powerpc/include/asm/vdso.h
> +++ b/arch/powerpc/include/asm/vdso.h
> @@ -24,19 +24,7 @@ int vdso_getcpu_init(void);
>  
>  #else /* __ASSEMBLY__ */
>  
> -#ifdef __VDSO64__
> -#define V_FUNCTION_BEGIN(name)   \
> - .globl name;\
> - name:   \
> -
> -#define V_FUNCTION_END(name) \
> - .size name,.-name;
> -
> -#define V_LOCAL_FUNC(name) (name)
> -#endif /* __VDSO64__ */
> -
> -#ifdef __VDSO32__
> -
> +#if defined(__VDSO32__) || defined (__VDSO64__)
>  #define V_FUNCTION_BEGIN(name)   \
>   .globl name;\
>   .type name,@function;   \
> @@ -46,8 +34,7 @@ int vdso_getcpu_init(void);
>   .size name,.-name;
>  
>  #define V_LOCAL_FUNC(name) (name)
> -
> -#endif /* __VDSO32__ */
> +#endif /* __VDSO{32|64}__ */
>  
>  #endif /* __ASSEMBLY__ */
>  
> diff --git a/arch/powerpc/kernel/vdso64/Makefile 
> b/arch/powerpc/kernel/vdso64/Makefile
> index 38c317f25141..7ea3ce537d0a 100644
> --- a/arch/powerpc/kernel/vdso64/Makefile
> +++ b/arch/powerpc/kernel/vdso64/Makefile
> @@ -32,9 +32,13 @@ $(obj)/%.so: OBJCOPYFLAGS := -S
>  $(obj)/%.so: $(obj)/%.so.dbg FORCE
>   $(call if_changed,objcopy)
>  
> +ldflags-y := -shared -soname linux-vdso64.so.1 \
> + $(call ld-option, --eh-frame-hdr) \
> + $(call ld-option, --orphan-handling=warn) -T
> +
>  # actual build commands
> -quiet_cmd_vdso64ld = VDSO64L $@
> -  cmd_vdso64ld = $(CC) $(c_flags) -o $@ -Wl,-T$(filter %.lds,$^) 
> $(filter %.o,$^) $(call cc-ldoption, -Wl$(comma)--orphan-handling=warn)
> +quiet_cmd_vdso64ld = LD  $@
> +  cmd_vdso64ld = $(cmd_ld)
>  
>  # install commands for the unstripped file
>  quiet_cmd_vdso_install = INSTALL $@
> diff --git a/arch/powerpc/kernel/vdso64/vdso64.lds.S 
> b/arch/powerpc/kernel/vdso64/vdso64.lds.S
> index 4e3a8d4ee614..58c33b704b6a 100644
> --- a/arch/powerpc/kernel/vdso64/vdso64.lds.S
> +++ b/arch/powerpc/kernel/vdso64/vdso64.lds.S
> @@ -11,7 +11,6 @@ OUTPUT_FORMAT("elf64-powerpcle", "elf64-powerpcle", 
> "elf64-powerpcle")
>  OUTPUT_FORMAT("elf64-powerpc", "elf64-powerpc", "elf64-powerpc")
>  #endif
>  OUTPUT_ARCH(powerpc:common64)
> -ENTRY(_start)
>  
>  SECTIONS
>  {
> -- 
> 2.28.0.402.g5ffc5be6b7-goog


[PATCH v4 13/13] mm/debug_vm_pgtable: Avoid none pte in pte_clear_test

2020-09-02 Thread Aneesh Kumar K.V
pte_clear_tests operate on an existing pte entry. Make sure that
is not a none pte entry.

Signed-off-by: Aneesh Kumar K.V 
---
 mm/debug_vm_pgtable.c | 7 ---
 1 file changed, 4 insertions(+), 3 deletions(-)

diff --git a/mm/debug_vm_pgtable.c b/mm/debug_vm_pgtable.c
index 9afa1354326b..c36530c69e33 100644
--- a/mm/debug_vm_pgtable.c
+++ b/mm/debug_vm_pgtable.c
@@ -542,9 +542,10 @@ static void __init pgd_populate_tests(struct mm_struct 
*mm, pgd_t *pgdp,
 #endif /* PAGETABLE_P4D_FOLDED */
 
 static void __init pte_clear_tests(struct mm_struct *mm, pte_t *ptep,
-  unsigned long vaddr)
+  unsigned long pfn, unsigned long vaddr,
+  pgprot_t prot)
 {
-   pte_t pte = ptep_get(ptep);
+   pte_t pte = pfn_pte(pfn, prot);
 
pr_debug("Validating PTE clear\n");
pte = __pte(pte_val(pte) | RANDOM_ORVALUE);
@@ -1049,7 +1050,7 @@ static int __init debug_vm_pgtable(void)
 
ptl = pte_lockptr(mm, pmdp);
spin_lock(ptl);
-   pte_clear_tests(mm, ptep, vaddr);
+   pte_clear_tests(mm, ptep, pte_aligned, vaddr, prot);
pte_advanced_tests(mm, vma, ptep, pte_aligned, vaddr, prot);
pte_unmap_unlock(ptep, ptl);
 
-- 
2.26.2



[PATCH v4 13/13] mm/debug_vm_pgtable: Avoid none pte in pte_clear_test

2020-09-02 Thread Aneesh Kumar K.V
pte_clear_tests operate on an existing pte entry. Make sure that
is not a none pte entry.

Signed-off-by: Aneesh Kumar K.V 
---
 mm/debug_vm_pgtable.c | 7 ---
 1 file changed, 4 insertions(+), 3 deletions(-)

diff --git a/mm/debug_vm_pgtable.c b/mm/debug_vm_pgtable.c
index 9afa1354326b..c36530c69e33 100644
--- a/mm/debug_vm_pgtable.c
+++ b/mm/debug_vm_pgtable.c
@@ -542,9 +542,10 @@ static void __init pgd_populate_tests(struct mm_struct 
*mm, pgd_t *pgdp,
 #endif /* PAGETABLE_P4D_FOLDED */
 
 static void __init pte_clear_tests(struct mm_struct *mm, pte_t *ptep,
-  unsigned long vaddr)
+  unsigned long pfn, unsigned long vaddr,
+  pgprot_t prot)
 {
-   pte_t pte = ptep_get(ptep);
+   pte_t pte = pfn_pte(pfn, prot);
 
pr_debug("Validating PTE clear\n");
pte = __pte(pte_val(pte) | RANDOM_ORVALUE);
@@ -1049,7 +1050,7 @@ static int __init debug_vm_pgtable(void)
 
ptl = pte_lockptr(mm, pmdp);
spin_lock(ptl);
-   pte_clear_tests(mm, ptep, vaddr);
+   pte_clear_tests(mm, ptep, pte_aligned, vaddr, prot);
pte_advanced_tests(mm, vma, ptep, pte_aligned, vaddr, prot);
pte_unmap_unlock(ptep, ptl);
 
-- 
2.26.2



[PATCH v4 12/13] mm/debug_vm_pgtable/hugetlb: Disable hugetlb test on ppc64

2020-09-02 Thread Aneesh Kumar K.V
The seems to be missing quite a lot of details w.r.t allocating
the correct pgtable_t page (huge_pte_alloc()), holding the right
lock (huge_pte_lock()) etc. The vma used is also not a hugetlb VMA.

ppc64 do have runtime checks within CONFIG_DEBUG_VM for most of these.
Hence disable the test on ppc64.

Signed-off-by: Aneesh Kumar K.V 
---
 mm/debug_vm_pgtable.c | 4 
 1 file changed, 4 insertions(+)

diff --git a/mm/debug_vm_pgtable.c b/mm/debug_vm_pgtable.c
index b53903fdee85..9afa1354326b 100644
--- a/mm/debug_vm_pgtable.c
+++ b/mm/debug_vm_pgtable.c
@@ -811,6 +811,7 @@ static void __init hugetlb_basic_tests(unsigned long pfn, 
pgprot_t prot)
 #endif /* CONFIG_ARCH_WANT_GENERAL_HUGETLB */
 }
 
+#ifndef CONFIG_PPC_BOOK3S_64
 static void __init hugetlb_advanced_tests(struct mm_struct *mm,
  struct vm_area_struct *vma,
  pte_t *ptep, unsigned long pfn,
@@ -853,6 +854,7 @@ static void __init hugetlb_advanced_tests(struct mm_struct 
*mm,
pte = huge_ptep_get(ptep);
WARN_ON(!(huge_pte_write(pte) && huge_pte_dirty(pte)));
 }
+#endif
 #else  /* !CONFIG_HUGETLB_PAGE */
 static void __init hugetlb_basic_tests(unsigned long pfn, pgprot_t prot) { }
 static void __init hugetlb_advanced_tests(struct mm_struct *mm,
@@ -1065,7 +1067,9 @@ static int __init debug_vm_pgtable(void)
pud_populate_tests(mm, pudp, saved_pmdp);
spin_unlock(ptl);
 
+#ifndef CONFIG_PPC_BOOK3S_64
hugetlb_advanced_tests(mm, vma, ptep, pte_aligned, vaddr, prot);
+#endif
 
spin_lock(>page_table_lock);
p4d_clear_tests(mm, p4dp);
-- 
2.26.2



[PATCH v4 11/13] mm/debug_vm_pgtable/pmd_clear: Don't use pmd/pud_clear on pte entries

2020-09-02 Thread Aneesh Kumar K.V
pmd_clear() should not be used to clear pmd level pte entries.

Signed-off-by: Aneesh Kumar K.V 
---
 mm/debug_vm_pgtable.c | 7 ---
 1 file changed, 4 insertions(+), 3 deletions(-)

diff --git a/mm/debug_vm_pgtable.c b/mm/debug_vm_pgtable.c
index 26023d990bd0..b53903fdee85 100644
--- a/mm/debug_vm_pgtable.c
+++ b/mm/debug_vm_pgtable.c
@@ -196,6 +196,8 @@ static void __init pmd_advanced_tests(struct mm_struct *mm,
pmd = READ_ONCE(*pmdp);
WARN_ON(pmd_young(pmd));
 
+   /*  Clear the pte entries  */
+   pmdp_huge_get_and_clear(mm, vaddr, pmdp);
pgtable = pgtable_trans_huge_withdraw(mm, pmdp);
 }
 
@@ -319,6 +321,8 @@ static void __init pud_advanced_tests(struct mm_struct *mm,
pudp_test_and_clear_young(vma, vaddr, pudp);
pud = READ_ONCE(*pudp);
WARN_ON(pud_young(pud));
+
+   pudp_huge_get_and_clear(mm, vaddr, pudp);
 }
 
 static void __init pud_leaf_tests(unsigned long pfn, pgprot_t prot)
@@ -442,8 +446,6 @@ static void __init pud_populate_tests(struct mm_struct *mm, 
pud_t *pudp,
 * This entry points to next level page table page.
 * Hence this must not qualify as pud_bad().
 */
-   pmd_clear(pmdp);
-   pud_clear(pudp);
pud_populate(mm, pudp, pmdp);
pud = READ_ONCE(*pudp);
WARN_ON(pud_bad(pud));
@@ -575,7 +577,6 @@ static void __init pmd_populate_tests(struct mm_struct *mm, 
pmd_t *pmdp,
 * This entry points to next level page table page.
 * Hence this must not qualify as pmd_bad().
 */
-   pmd_clear(pmdp);
pmd_populate(mm, pmdp, pgtable);
pmd = READ_ONCE(*pmdp);
WARN_ON(pmd_bad(pmd));
-- 
2.26.2



[PATCH v4 10/13] mm/debug_vm_pgtable/thp: Use page table depost/withdraw with THP

2020-09-02 Thread Aneesh Kumar K.V
Architectures like ppc64 use deposited page table while updating the
huge pte entries.

Signed-off-by: Aneesh Kumar K.V 
---
 mm/debug_vm_pgtable.c | 10 +++---
 1 file changed, 7 insertions(+), 3 deletions(-)

diff --git a/mm/debug_vm_pgtable.c b/mm/debug_vm_pgtable.c
index 2bc1952e5f83..26023d990bd0 100644
--- a/mm/debug_vm_pgtable.c
+++ b/mm/debug_vm_pgtable.c
@@ -154,7 +154,7 @@ static void __init pmd_basic_tests(unsigned long pfn, 
pgprot_t prot)
 static void __init pmd_advanced_tests(struct mm_struct *mm,
  struct vm_area_struct *vma, pmd_t *pmdp,
  unsigned long pfn, unsigned long vaddr,
- pgprot_t prot)
+ pgprot_t prot, pgtable_t pgtable)
 {
pmd_t pmd;
 
@@ -165,6 +165,8 @@ static void __init pmd_advanced_tests(struct mm_struct *mm,
/* Align the address wrt HPAGE_PMD_SIZE */
vaddr = (vaddr & HPAGE_PMD_MASK) + HPAGE_PMD_SIZE;
 
+   pgtable_trans_huge_deposit(mm, pmdp, pgtable);
+
pmd = pmd_mkhuge(pfn_pmd(pfn, prot));
set_pmd_at(mm, vaddr, pmdp, pmd);
pmdp_set_wrprotect(mm, vaddr, pmdp);
@@ -193,6 +195,8 @@ static void __init pmd_advanced_tests(struct mm_struct *mm,
pmdp_test_and_clear_young(vma, vaddr, pmdp);
pmd = READ_ONCE(*pmdp);
WARN_ON(pmd_young(pmd));
+
+   pgtable = pgtable_trans_huge_withdraw(mm, pmdp);
 }
 
 static void __init pmd_leaf_tests(unsigned long pfn, pgprot_t prot)
@@ -371,7 +375,7 @@ static void __init pud_basic_tests(unsigned long pfn, 
pgprot_t prot) { }
 static void __init pmd_advanced_tests(struct mm_struct *mm,
  struct vm_area_struct *vma, pmd_t *pmdp,
  unsigned long pfn, unsigned long vaddr,
- pgprot_t prot)
+ pgprot_t prot, pgtable_t pgtable)
 {
 }
 static void __init pud_advanced_tests(struct mm_struct *mm,
@@ -1048,7 +1052,7 @@ static int __init debug_vm_pgtable(void)
 
ptl = pmd_lock(mm, pmdp);
pmd_clear_tests(mm, pmdp);
-   pmd_advanced_tests(mm, vma, pmdp, pmd_aligned, vaddr, prot);
+   pmd_advanced_tests(mm, vma, pmdp, pmd_aligned, vaddr, prot, saved_ptep);
pmd_huge_tests(pmdp, pmd_aligned, prot);
pmd_populate_tests(mm, pmdp, saved_ptep);
spin_unlock(ptl);
-- 
2.26.2



[PATCH v4 09/13] mm/debug_vm_pgtable/locks: Take correct page table lock

2020-09-02 Thread Aneesh Kumar K.V
Make sure we call pte accessors with correct lock held.

Signed-off-by: Aneesh Kumar K.V 
---
 mm/debug_vm_pgtable.c | 35 ++-
 1 file changed, 22 insertions(+), 13 deletions(-)

diff --git a/mm/debug_vm_pgtable.c b/mm/debug_vm_pgtable.c
index f59cf6a9b05e..2bc1952e5f83 100644
--- a/mm/debug_vm_pgtable.c
+++ b/mm/debug_vm_pgtable.c
@@ -1035,30 +1035,39 @@ static int __init debug_vm_pgtable(void)
 
hugetlb_basic_tests(pte_aligned, prot);
 
-   pte_clear_tests(mm, ptep, vaddr);
-   pmd_clear_tests(mm, pmdp);
-   pud_clear_tests(mm, pudp);
-   p4d_clear_tests(mm, p4dp);
-   pgd_clear_tests(mm, pgdp);
+   /*
+* Page table modifying tests. They need to hold
+* proper page table lock.
+*/
 
ptl = pte_lockptr(mm, pmdp);
spin_lock(ptl);
-
+   pte_clear_tests(mm, ptep, vaddr);
pte_advanced_tests(mm, vma, ptep, pte_aligned, vaddr, prot);
-   pmd_advanced_tests(mm, vma, pmdp, pmd_aligned, vaddr, prot);
-   pud_advanced_tests(mm, vma, pudp, pud_aligned, vaddr, prot);
-   hugetlb_advanced_tests(mm, vma, ptep, pte_aligned, vaddr, prot);
-
+   pte_unmap_unlock(ptep, ptl);
 
+   ptl = pmd_lock(mm, pmdp);
+   pmd_clear_tests(mm, pmdp);
+   pmd_advanced_tests(mm, vma, pmdp, pmd_aligned, vaddr, prot);
pmd_huge_tests(pmdp, pmd_aligned, prot);
+   pmd_populate_tests(mm, pmdp, saved_ptep);
+   spin_unlock(ptl);
+
+   ptl = pud_lock(mm, pudp);
+   pud_clear_tests(mm, pudp);
+   pud_advanced_tests(mm, vma, pudp, pud_aligned, vaddr, prot);
pud_huge_tests(pudp, pud_aligned, prot);
+   pud_populate_tests(mm, pudp, saved_pmdp);
+   spin_unlock(ptl);
 
-   pte_unmap_unlock(ptep, ptl);
+   hugetlb_advanced_tests(mm, vma, ptep, pte_aligned, vaddr, prot);
 
-   pmd_populate_tests(mm, pmdp, saved_ptep);
-   pud_populate_tests(mm, pudp, saved_pmdp);
+   spin_lock(>page_table_lock);
+   p4d_clear_tests(mm, p4dp);
+   pgd_clear_tests(mm, pgdp);
p4d_populate_tests(mm, p4dp, saved_pudp);
pgd_populate_tests(mm, pgdp, saved_p4dp);
+   spin_unlock(>page_table_lock);
 
p4d_free(mm, saved_p4dp);
pud_free(mm, saved_pudp);
-- 
2.26.2



[PATCH v4 08/13] mm/debug_vm_pgtable/locks: Move non page table modifying test together

2020-09-02 Thread Aneesh Kumar K.V
This will help in adding proper locks in a later patch

Signed-off-by: Aneesh Kumar K.V 
---
 mm/debug_vm_pgtable.c | 51 ---
 1 file changed, 28 insertions(+), 23 deletions(-)

diff --git a/mm/debug_vm_pgtable.c b/mm/debug_vm_pgtable.c
index de333871f407..f59cf6a9b05e 100644
--- a/mm/debug_vm_pgtable.c
+++ b/mm/debug_vm_pgtable.c
@@ -986,7 +986,7 @@ static int __init debug_vm_pgtable(void)
p4dp = p4d_alloc(mm, pgdp, vaddr);
pudp = pud_alloc(mm, p4dp, vaddr);
pmdp = pmd_alloc(mm, pudp, vaddr);
-   ptep = pte_alloc_map_lock(mm, pmdp, vaddr, );
+   ptep = pte_alloc_map(mm, pmdp, vaddr);
 
/*
 * Save all the page table page addresses as the page table
@@ -1006,33 +1006,12 @@ static int __init debug_vm_pgtable(void)
p4d_basic_tests(p4d_aligned, prot);
pgd_basic_tests(pgd_aligned, prot);
 
-   pte_clear_tests(mm, ptep, vaddr);
-   pmd_clear_tests(mm, pmdp);
-   pud_clear_tests(mm, pudp);
-   p4d_clear_tests(mm, p4dp);
-   pgd_clear_tests(mm, pgdp);
-
-   pte_advanced_tests(mm, vma, ptep, pte_aligned, vaddr, prot);
-   pmd_advanced_tests(mm, vma, pmdp, pmd_aligned, vaddr, prot);
-   pud_advanced_tests(mm, vma, pudp, pud_aligned, vaddr, prot);
-   hugetlb_advanced_tests(mm, vma, ptep, pte_aligned, vaddr, prot);
-
pmd_leaf_tests(pmd_aligned, prot);
pud_leaf_tests(pud_aligned, prot);
 
-   pmd_huge_tests(pmdp, pmd_aligned, prot);
-   pud_huge_tests(pudp, pud_aligned, prot);
-
pte_savedwrite_tests(pte_aligned, protnone);
pmd_savedwrite_tests(pmd_aligned, protnone);
 
-   pte_unmap_unlock(ptep, ptl);
-
-   pmd_populate_tests(mm, pmdp, saved_ptep);
-   pud_populate_tests(mm, pudp, saved_pmdp);
-   p4d_populate_tests(mm, p4dp, saved_pudp);
-   pgd_populate_tests(mm, pgdp, saved_p4dp);
-
pte_special_tests(pte_aligned, prot);
pte_protnone_tests(pte_aligned, protnone);
pmd_protnone_tests(pmd_aligned, protnone);
@@ -1050,11 +1029,37 @@ static int __init debug_vm_pgtable(void)
pmd_swap_tests(pmd_aligned, prot);
 
swap_migration_tests();
-   hugetlb_basic_tests(pte_aligned, prot);
 
pmd_thp_tests(pmd_aligned, prot);
pud_thp_tests(pud_aligned, prot);
 
+   hugetlb_basic_tests(pte_aligned, prot);
+
+   pte_clear_tests(mm, ptep, vaddr);
+   pmd_clear_tests(mm, pmdp);
+   pud_clear_tests(mm, pudp);
+   p4d_clear_tests(mm, p4dp);
+   pgd_clear_tests(mm, pgdp);
+
+   ptl = pte_lockptr(mm, pmdp);
+   spin_lock(ptl);
+
+   pte_advanced_tests(mm, vma, ptep, pte_aligned, vaddr, prot);
+   pmd_advanced_tests(mm, vma, pmdp, pmd_aligned, vaddr, prot);
+   pud_advanced_tests(mm, vma, pudp, pud_aligned, vaddr, prot);
+   hugetlb_advanced_tests(mm, vma, ptep, pte_aligned, vaddr, prot);
+
+
+   pmd_huge_tests(pmdp, pmd_aligned, prot);
+   pud_huge_tests(pudp, pud_aligned, prot);
+
+   pte_unmap_unlock(ptep, ptl);
+
+   pmd_populate_tests(mm, pmdp, saved_ptep);
+   pud_populate_tests(mm, pudp, saved_pmdp);
+   p4d_populate_tests(mm, p4dp, saved_pudp);
+   pgd_populate_tests(mm, pgdp, saved_p4dp);
+
p4d_free(mm, saved_p4dp);
pud_free(mm, saved_pudp);
pmd_free(mm, saved_pmdp);
-- 
2.26.2



[PATCH v4 07/13] mm/debug_vm_pgtable/set_pte/pmd/pud: Don't use set_*_at to update an existing pte entry

2020-09-02 Thread Aneesh Kumar K.V
set_pte_at() should not be used to set a pte entry at locations that
already holds a valid pte entry. Architectures like ppc64 don't do TLB
invalidate in set_pte_at() and hence expect it to be used to set locations
that are not a valid PTE.

Signed-off-by: Aneesh Kumar K.V 
---
 mm/debug_vm_pgtable.c | 35 +++
 1 file changed, 15 insertions(+), 20 deletions(-)

diff --git a/mm/debug_vm_pgtable.c b/mm/debug_vm_pgtable.c
index 9cafed39c236..de333871f407 100644
--- a/mm/debug_vm_pgtable.c
+++ b/mm/debug_vm_pgtable.c
@@ -79,15 +79,18 @@ static void __init pte_advanced_tests(struct mm_struct *mm,
 {
pte_t pte = pfn_pte(pfn, prot);
 
+   /*
+* Architectures optimize set_pte_at by avoiding TLB flush.
+* This requires set_pte_at to be not used to update an
+* existing pte entry. Clear pte before we do set_pte_at
+*/
+
pr_debug("Validating PTE advanced\n");
pte = pfn_pte(pfn, prot);
set_pte_at(mm, vaddr, ptep, pte);
ptep_set_wrprotect(mm, vaddr, ptep);
pte = ptep_get(ptep);
WARN_ON(pte_write(pte));
-
-   pte = pfn_pte(pfn, prot);
-   set_pte_at(mm, vaddr, ptep, pte);
ptep_get_and_clear(mm, vaddr, ptep);
pte = ptep_get(ptep);
WARN_ON(!pte_none(pte));
@@ -101,13 +104,11 @@ static void __init pte_advanced_tests(struct mm_struct 
*mm,
ptep_set_access_flags(vma, vaddr, ptep, pte, 1);
pte = ptep_get(ptep);
WARN_ON(!(pte_write(pte) && pte_dirty(pte)));
-
-   pte = pfn_pte(pfn, prot);
-   set_pte_at(mm, vaddr, ptep, pte);
ptep_get_and_clear_full(mm, vaddr, ptep, 1);
pte = ptep_get(ptep);
WARN_ON(!pte_none(pte));
 
+   pte = pfn_pte(pfn, prot);
pte = pte_mkyoung(pte);
set_pte_at(mm, vaddr, ptep, pte);
ptep_test_and_clear_young(vma, vaddr, ptep);
@@ -169,9 +170,6 @@ static void __init pmd_advanced_tests(struct mm_struct *mm,
pmdp_set_wrprotect(mm, vaddr, pmdp);
pmd = READ_ONCE(*pmdp);
WARN_ON(pmd_write(pmd));
-
-   pmd = pmd_mkhuge(pfn_pmd(pfn, prot));
-   set_pmd_at(mm, vaddr, pmdp, pmd);
pmdp_huge_get_and_clear(mm, vaddr, pmdp);
pmd = READ_ONCE(*pmdp);
WARN_ON(!pmd_none(pmd));
@@ -185,13 +183,11 @@ static void __init pmd_advanced_tests(struct mm_struct 
*mm,
pmdp_set_access_flags(vma, vaddr, pmdp, pmd, 1);
pmd = READ_ONCE(*pmdp);
WARN_ON(!(pmd_write(pmd) && pmd_dirty(pmd)));
-
-   pmd = pmd_mkhuge(pfn_pmd(pfn, prot));
-   set_pmd_at(mm, vaddr, pmdp, pmd);
pmdp_huge_get_and_clear_full(vma, vaddr, pmdp, 1);
pmd = READ_ONCE(*pmdp);
WARN_ON(!pmd_none(pmd));
 
+   pmd = pmd_mkhuge(pfn_pmd(pfn, prot));
pmd = pmd_mkyoung(pmd);
set_pmd_at(mm, vaddr, pmdp, pmd);
pmdp_test_and_clear_young(vma, vaddr, pmdp);
@@ -292,17 +288,9 @@ static void __init pud_advanced_tests(struct mm_struct *mm,
WARN_ON(pud_write(pud));
 
 #ifndef __PAGETABLE_PMD_FOLDED
-   pud = pud_mkhuge(pfn_pud(pfn, prot));
-   set_pud_at(mm, vaddr, pudp, pud);
pudp_huge_get_and_clear(mm, vaddr, pudp);
pud = READ_ONCE(*pudp);
WARN_ON(!pud_none(pud));
-
-   pud = pud_mkhuge(pfn_pud(pfn, prot));
-   set_pud_at(mm, vaddr, pudp, pud);
-   pudp_huge_get_and_clear_full(mm, vaddr, pudp, 1);
-   pud = READ_ONCE(*pudp);
-   WARN_ON(!pud_none(pud));
 #endif /* __PAGETABLE_PMD_FOLDED */
 
pud = pud_mkhuge(pfn_pud(pfn, prot));
@@ -315,6 +303,13 @@ static void __init pud_advanced_tests(struct mm_struct *mm,
pud = READ_ONCE(*pudp);
WARN_ON(!(pud_write(pud) && pud_dirty(pud)));
 
+#ifndef __PAGETABLE_PMD_FOLDED
+   pudp_huge_get_and_clear_full(mm, vaddr, pudp, 1);
+   pud = READ_ONCE(*pudp);
+   WARN_ON(!pud_none(pud));
+#endif /* __PAGETABLE_PMD_FOLDED */
+
+   pud = pud_mkhuge(pfn_pud(pfn, prot));
pud = pud_mkyoung(pud);
set_pud_at(mm, vaddr, pudp, pud);
pudp_test_and_clear_young(vma, vaddr, pudp);
-- 
2.26.2



[PATCH v4 06/13] mm/debug_vm_pgtable/THP: Mark the pte entry huge before using set_pmd/pud_at

2020-09-02 Thread Aneesh Kumar K.V
kernel expects entries to be marked huge before we use
set_pmd_at()/set_pud_at().

Signed-off-by: Aneesh Kumar K.V 
---
 mm/debug_vm_pgtable.c | 20 +++-
 1 file changed, 11 insertions(+), 9 deletions(-)

diff --git a/mm/debug_vm_pgtable.c b/mm/debug_vm_pgtable.c
index 8704901f6bd8..9cafed39c236 100644
--- a/mm/debug_vm_pgtable.c
+++ b/mm/debug_vm_pgtable.c
@@ -155,7 +155,7 @@ static void __init pmd_advanced_tests(struct mm_struct *mm,
  unsigned long pfn, unsigned long vaddr,
  pgprot_t prot)
 {
-   pmd_t pmd = pfn_pmd(pfn, prot);
+   pmd_t pmd;
 
if (!has_transparent_hugepage())
return;
@@ -164,19 +164,19 @@ static void __init pmd_advanced_tests(struct mm_struct 
*mm,
/* Align the address wrt HPAGE_PMD_SIZE */
vaddr = (vaddr & HPAGE_PMD_MASK) + HPAGE_PMD_SIZE;
 
-   pmd = pfn_pmd(pfn, prot);
+   pmd = pmd_mkhuge(pfn_pmd(pfn, prot));
set_pmd_at(mm, vaddr, pmdp, pmd);
pmdp_set_wrprotect(mm, vaddr, pmdp);
pmd = READ_ONCE(*pmdp);
WARN_ON(pmd_write(pmd));
 
-   pmd = pfn_pmd(pfn, prot);
+   pmd = pmd_mkhuge(pfn_pmd(pfn, prot));
set_pmd_at(mm, vaddr, pmdp, pmd);
pmdp_huge_get_and_clear(mm, vaddr, pmdp);
pmd = READ_ONCE(*pmdp);
WARN_ON(!pmd_none(pmd));
 
-   pmd = pfn_pmd(pfn, prot);
+   pmd = pmd_mkhuge(pfn_pmd(pfn, prot));
pmd = pmd_wrprotect(pmd);
pmd = pmd_mkclean(pmd);
set_pmd_at(mm, vaddr, pmdp, pmd);
@@ -236,7 +236,7 @@ static void __init pmd_huge_tests(pmd_t *pmdp, unsigned 
long pfn, pgprot_t prot)
 
 static void __init pmd_savedwrite_tests(unsigned long pfn, pgprot_t prot)
 {
-   pmd_t pmd = pfn_pmd(pfn, prot);
+   pmd_t pmd = pmd_mkhuge(pfn_pmd(pfn, prot));
 
if (!IS_ENABLED(CONFIG_NUMA_BALANCING))
return;
@@ -276,7 +276,7 @@ static void __init pud_advanced_tests(struct mm_struct *mm,
  unsigned long pfn, unsigned long vaddr,
  pgprot_t prot)
 {
-   pud_t pud = pfn_pud(pfn, prot);
+   pud_t pud;
 
if (!has_transparent_hugepage())
return;
@@ -285,25 +285,27 @@ static void __init pud_advanced_tests(struct mm_struct 
*mm,
/* Align the address wrt HPAGE_PUD_SIZE */
vaddr = (vaddr & HPAGE_PUD_MASK) + HPAGE_PUD_SIZE;
 
+   pud = pud_mkhuge(pfn_pud(pfn, prot));
set_pud_at(mm, vaddr, pudp, pud);
pudp_set_wrprotect(mm, vaddr, pudp);
pud = READ_ONCE(*pudp);
WARN_ON(pud_write(pud));
 
 #ifndef __PAGETABLE_PMD_FOLDED
-   pud = pfn_pud(pfn, prot);
+   pud = pud_mkhuge(pfn_pud(pfn, prot));
set_pud_at(mm, vaddr, pudp, pud);
pudp_huge_get_and_clear(mm, vaddr, pudp);
pud = READ_ONCE(*pudp);
WARN_ON(!pud_none(pud));
 
-   pud = pfn_pud(pfn, prot);
+   pud = pud_mkhuge(pfn_pud(pfn, prot));
set_pud_at(mm, vaddr, pudp, pud);
pudp_huge_get_and_clear_full(mm, vaddr, pudp, 1);
pud = READ_ONCE(*pudp);
WARN_ON(!pud_none(pud));
 #endif /* __PAGETABLE_PMD_FOLDED */
-   pud = pfn_pud(pfn, prot);
+
+   pud = pud_mkhuge(pfn_pud(pfn, prot));
pud = pud_wrprotect(pud);
pud = pud_mkclean(pud);
set_pud_at(mm, vaddr, pudp, pud);
-- 
2.26.2



[PATCH v4 05/13] mm/debug_vm_pgtable/savedwrite: Enable savedwrite test with CONFIG_NUMA_BALANCING

2020-09-02 Thread Aneesh Kumar K.V
Saved write support was added to track the write bit of a pte after
marking the pte protnone. This was done so that AUTONUMA can convert
a write pte to protnone and still track the old write bit. When converting
it back we set the pte write bit correctly thereby avoiding a write fault
again. Hence enable the test only when CONFIG_NUMA_BALANCING is enabled and
use protnone protflags.

Signed-off-by: Aneesh Kumar K.V 
---
 mm/debug_vm_pgtable.c | 11 +--
 1 file changed, 9 insertions(+), 2 deletions(-)

diff --git a/mm/debug_vm_pgtable.c b/mm/debug_vm_pgtable.c
index 4c73e63b4ceb..8704901f6bd8 100644
--- a/mm/debug_vm_pgtable.c
+++ b/mm/debug_vm_pgtable.c
@@ -119,10 +119,14 @@ static void __init pte_savedwrite_tests(unsigned long 
pfn, pgprot_t prot)
 {
pte_t pte = pfn_pte(pfn, prot);
 
+   if (!IS_ENABLED(CONFIG_NUMA_BALANCING))
+   return;
+
pr_debug("Validating PTE saved write\n");
WARN_ON(!pte_savedwrite(pte_mk_savedwrite(pte_clear_savedwrite(pte;
WARN_ON(pte_savedwrite(pte_clear_savedwrite(pte_mk_savedwrite(pte;
 }
+
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
 static void __init pmd_basic_tests(unsigned long pfn, pgprot_t prot)
 {
@@ -234,6 +238,9 @@ static void __init pmd_savedwrite_tests(unsigned long pfn, 
pgprot_t prot)
 {
pmd_t pmd = pfn_pmd(pfn, prot);
 
+   if (!IS_ENABLED(CONFIG_NUMA_BALANCING))
+   return;
+
pr_debug("Validating PMD saved write\n");
WARN_ON(!pmd_savedwrite(pmd_mk_savedwrite(pmd_clear_savedwrite(pmd;
WARN_ON(pmd_savedwrite(pmd_clear_savedwrite(pmd_mk_savedwrite(pmd;
@@ -1019,8 +1026,8 @@ static int __init debug_vm_pgtable(void)
pmd_huge_tests(pmdp, pmd_aligned, prot);
pud_huge_tests(pudp, pud_aligned, prot);
 
-   pte_savedwrite_tests(pte_aligned, prot);
-   pmd_savedwrite_tests(pmd_aligned, prot);
+   pte_savedwrite_tests(pte_aligned, protnone);
+   pmd_savedwrite_tests(pmd_aligned, protnone);
 
pte_unmap_unlock(ptep, ptl);
 
-- 
2.26.2



[PATCH v4 04/13] mm/debug_vm_pgtables/hugevmap: Use the arch helper to identify huge vmap support.

2020-09-02 Thread Aneesh Kumar K.V
ppc64 supports huge vmap only with radix translation. Hence use arch helper
to determine the huge vmap support.

Signed-off-by: Aneesh Kumar K.V 
---
 mm/debug_vm_pgtable.c | 14 --
 1 file changed, 12 insertions(+), 2 deletions(-)

diff --git a/mm/debug_vm_pgtable.c b/mm/debug_vm_pgtable.c
index 00649b47f6e0..4c73e63b4ceb 100644
--- a/mm/debug_vm_pgtable.c
+++ b/mm/debug_vm_pgtable.c
@@ -28,6 +28,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 
@@ -206,11 +207,12 @@ static void __init pmd_leaf_tests(unsigned long pfn, 
pgprot_t prot)
WARN_ON(!pmd_leaf(pmd));
 }
 
+#ifdef CONFIG_HAVE_ARCH_HUGE_VMAP
 static void __init pmd_huge_tests(pmd_t *pmdp, unsigned long pfn, pgprot_t 
prot)
 {
pmd_t pmd;
 
-   if (!IS_ENABLED(CONFIG_HAVE_ARCH_HUGE_VMAP))
+   if (!arch_ioremap_pmd_supported())
return;
 
pr_debug("Validating PMD huge\n");
@@ -224,6 +226,9 @@ static void __init pmd_huge_tests(pmd_t *pmdp, unsigned 
long pfn, pgprot_t prot)
pmd = READ_ONCE(*pmdp);
WARN_ON(!pmd_none(pmd));
 }
+#else /* CONFIG_HAVE_ARCH_HUGE_VMAP */
+static void __init pmd_huge_tests(pmd_t *pmdp, unsigned long pfn, pgprot_t 
prot) { }
+#endif /* CONFIG_HAVE_ARCH_HUGE_VMAP */
 
 static void __init pmd_savedwrite_tests(unsigned long pfn, pgprot_t prot)
 {
@@ -320,11 +325,12 @@ static void __init pud_leaf_tests(unsigned long pfn, 
pgprot_t prot)
WARN_ON(!pud_leaf(pud));
 }
 
+#ifdef CONFIG_HAVE_ARCH_HUGE_VMAP
 static void __init pud_huge_tests(pud_t *pudp, unsigned long pfn, pgprot_t 
prot)
 {
pud_t pud;
 
-   if (!IS_ENABLED(CONFIG_HAVE_ARCH_HUGE_VMAP))
+   if (!arch_ioremap_pud_supported())
return;
 
pr_debug("Validating PUD huge\n");
@@ -338,6 +344,10 @@ static void __init pud_huge_tests(pud_t *pudp, unsigned 
long pfn, pgprot_t prot)
pud = READ_ONCE(*pudp);
WARN_ON(!pud_none(pud));
 }
+#else /* !CONFIG_HAVE_ARCH_HUGE_VMAP */
+static void __init pud_huge_tests(pud_t *pudp, unsigned long pfn, pgprot_t 
prot) { }
+#endif /* !CONFIG_HAVE_ARCH_HUGE_VMAP */
+
 #else  /* !CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD */
 static void __init pud_basic_tests(unsigned long pfn, pgprot_t prot) { }
 static void __init pud_advanced_tests(struct mm_struct *mm,
-- 
2.26.2



[PATCH v4 03/13] mm/debug_vm_pgtable/ppc64: Avoid setting top bits in radom value

2020-09-02 Thread Aneesh Kumar K.V
ppc64 use bit 62 to indicate a pte entry (_PAGE_PTE). Avoid setting
that bit in random value.

Signed-off-by: Aneesh Kumar K.V 
---
 mm/debug_vm_pgtable.c | 13 ++---
 1 file changed, 10 insertions(+), 3 deletions(-)

diff --git a/mm/debug_vm_pgtable.c b/mm/debug_vm_pgtable.c
index 086309fb9b6f..00649b47f6e0 100644
--- a/mm/debug_vm_pgtable.c
+++ b/mm/debug_vm_pgtable.c
@@ -44,10 +44,17 @@
  * entry type. But these bits might affect the ability to clear entries with
  * pxx_clear() because of how dynamic page table folding works on s390. So
  * while loading up the entries do not change the lower 4 bits. It does not
- * have affect any other platform.
+ * have affect any other platform. Also avoid the 62nd bit on ppc64 that is
+ * used to mark a pte entry.
  */
-#define S390_MASK_BITS 4
-#define RANDOM_ORVALUE GENMASK(BITS_PER_LONG - 1, S390_MASK_BITS)
+#define S390_SKIP_MASK GENMASK(3, 0)
+#if __BITS_PER_LONG == 64
+#define PPC64_SKIP_MASKGENMASK(62, 62)
+#else
+#define PPC64_SKIP_MASK0x0
+#endif
+#define ARCH_SKIP_MASK (S390_SKIP_MASK | PPC64_SKIP_MASK)
+#define RANDOM_ORVALUE (GENMASK(BITS_PER_LONG - 1, 0) & ~ARCH_SKIP_MASK)
 #define RANDOM_NZVALUE GENMASK(7, 0)
 
 static void __init pte_basic_tests(unsigned long pfn, pgprot_t prot)
-- 
2.26.2



[PATCH v4 01/13] powerpc/mm: Add DEBUG_VM WARN for pmd_clear

2020-09-02 Thread Aneesh Kumar K.V
With the hash page table, the kernel should not use pmd_clear for clearing
huge pte entries. Add a DEBUG_VM WARN to catch the wrong usage.

Signed-off-by: Aneesh Kumar K.V 
---
 arch/powerpc/include/asm/book3s/64/pgtable.h | 14 ++
 1 file changed, 14 insertions(+)

diff --git a/arch/powerpc/include/asm/book3s/64/pgtable.h 
b/arch/powerpc/include/asm/book3s/64/pgtable.h
index 6de56c3b33c4..079211968987 100644
--- a/arch/powerpc/include/asm/book3s/64/pgtable.h
+++ b/arch/powerpc/include/asm/book3s/64/pgtable.h
@@ -868,6 +868,13 @@ static inline bool pte_ci(pte_t pte)
 
 static inline void pmd_clear(pmd_t *pmdp)
 {
+   if (IS_ENABLED(CONFIG_DEBUG_VM) && !radix_enabled()) {
+   /*
+* Don't use this if we can possibly have a hash page table
+* entry mapping this.
+*/
+   WARN_ON((pmd_val(*pmdp) & (H_PAGE_HASHPTE | _PAGE_PTE)) == 
(H_PAGE_HASHPTE | _PAGE_PTE));
+   }
*pmdp = __pmd(0);
 }
 
@@ -916,6 +923,13 @@ static inline int pmd_bad(pmd_t pmd)
 
 static inline void pud_clear(pud_t *pudp)
 {
+   if (IS_ENABLED(CONFIG_DEBUG_VM) && !radix_enabled()) {
+   /*
+* Don't use this if we can possibly have a hash page table
+* entry mapping this.
+*/
+   WARN_ON((pud_val(*pudp) & (H_PAGE_HASHPTE | _PAGE_PTE)) == 
(H_PAGE_HASHPTE | _PAGE_PTE));
+   }
*pudp = __pud(0);
 }
 
-- 
2.26.2



[PATCH v4 02/13] powerpc/mm: Move setting pte specific flags to pfn_pte

2020-09-02 Thread Aneesh Kumar K.V
powerpc used to set the pte specific flags in set_pte_at(). This is
different from other architectures. To be consistent with other
architecture update pfn_pte to set _PAGE_PTE on ppc64. Also, drop now
unused pte_mkpte.

We add a VM_WARN_ON() to catch the usage of calling set_pte_at()
without setting _PAGE_PTE bit. We will remove that after a few releases.

With respect to huge pmd entries, pmd_mkhuge() takes care of adding the
_PAGE_PTE bit.

Signed-off-by: Aneesh Kumar K.V 
---
 arch/powerpc/include/asm/book3s/64/pgtable.h | 15 +--
 arch/powerpc/include/asm/nohash/pgtable.h|  5 -
 arch/powerpc/mm/pgtable.c|  5 -
 3 files changed, 9 insertions(+), 16 deletions(-)

diff --git a/arch/powerpc/include/asm/book3s/64/pgtable.h 
b/arch/powerpc/include/asm/book3s/64/pgtable.h
index 079211968987..2382fd516f6b 100644
--- a/arch/powerpc/include/asm/book3s/64/pgtable.h
+++ b/arch/powerpc/include/asm/book3s/64/pgtable.h
@@ -619,7 +619,7 @@ static inline pte_t pfn_pte(unsigned long pfn, pgprot_t 
pgprot)
VM_BUG_ON(pfn >> (64 - PAGE_SHIFT));
VM_BUG_ON((pfn << PAGE_SHIFT) & ~PTE_RPN_MASK);
 
-   return __pte(((pte_basic_t)pfn << PAGE_SHIFT) | pgprot_val(pgprot));
+   return __pte(((pte_basic_t)pfn << PAGE_SHIFT) | pgprot_val(pgprot) | 
_PAGE_PTE);
 }
 
 static inline unsigned long pte_pfn(pte_t pte)
@@ -655,11 +655,6 @@ static inline pte_t pte_mkexec(pte_t pte)
return __pte_raw(pte_raw(pte) | cpu_to_be64(_PAGE_EXEC));
 }
 
-static inline pte_t pte_mkpte(pte_t pte)
-{
-   return __pte_raw(pte_raw(pte) | cpu_to_be64(_PAGE_PTE));
-}
-
 static inline pte_t pte_mkwrite(pte_t pte)
 {
/*
@@ -823,6 +818,14 @@ static inline int pte_none(pte_t pte)
 static inline void __set_pte_at(struct mm_struct *mm, unsigned long addr,
pte_t *ptep, pte_t pte, int percpu)
 {
+
+   VM_WARN_ON(!(pte_raw(pte) & cpu_to_be64(_PAGE_PTE)));
+   /*
+* Keep the _PAGE_PTE added till we are sure we handle _PAGE_PTE
+* in all the callers.
+*/
+pte = __pte_raw(pte_raw(pte) | cpu_to_be64(_PAGE_PTE));
+
if (radix_enabled())
return radix__set_pte_at(mm, addr, ptep, pte, percpu);
return hash__set_pte_at(mm, addr, ptep, pte, percpu);
diff --git a/arch/powerpc/include/asm/nohash/pgtable.h 
b/arch/powerpc/include/asm/nohash/pgtable.h
index 4b7c3472eab1..6277e7596ae5 100644
--- a/arch/powerpc/include/asm/nohash/pgtable.h
+++ b/arch/powerpc/include/asm/nohash/pgtable.h
@@ -140,11 +140,6 @@ static inline pte_t pte_mkold(pte_t pte)
return __pte(pte_val(pte) & ~_PAGE_ACCESSED);
 }
 
-static inline pte_t pte_mkpte(pte_t pte)
-{
-   return pte;
-}
-
 static inline pte_t pte_mkspecial(pte_t pte)
 {
return __pte(pte_val(pte) | _PAGE_SPECIAL);
diff --git a/arch/powerpc/mm/pgtable.c b/arch/powerpc/mm/pgtable.c
index 9c0547d77af3..ab57b07ef39a 100644
--- a/arch/powerpc/mm/pgtable.c
+++ b/arch/powerpc/mm/pgtable.c
@@ -184,9 +184,6 @@ void set_pte_at(struct mm_struct *mm, unsigned long addr, 
pte_t *ptep,
 */
VM_WARN_ON(pte_hw_valid(*ptep) && !pte_protnone(*ptep));
 
-   /* Add the pte bit when trying to set a pte */
-   pte = pte_mkpte(pte);
-
/* Note: mm->context.id might not yet have been assigned as
 * this context might not have been activated yet when this
 * is called.
@@ -275,8 +272,6 @@ void set_huge_pte_at(struct mm_struct *mm, unsigned long 
addr, pte_t *ptep, pte_
 */
VM_WARN_ON(pte_hw_valid(*ptep) && !pte_protnone(*ptep));
 
-   pte = pte_mkpte(pte);
-
pte = set_pte_filter(pte);
 
val = pte_val(pte);
-- 
2.26.2



[PATCH v4 00/13] mm/debug_vm_pgtable fixes

2020-09-02 Thread Aneesh Kumar K.V
This patch series includes fixes for debug_vm_pgtable test code so that
they follow page table updates rules correctly. The first two patches introduce
changes w.r.t ppc64. The patches are included in this series for completeness. 
We can
merge them via ppc64 tree if required.

Hugetlb test is disabled on ppc64 because that needs larger change to satisfy
page table update rules.

These tests are broken w.r.t page table update rules and results in kernel
crash as below. 

[   21.083519] kernel BUG at arch/powerpc/mm/pgtable.c:304!
cpu 0x0: Vector: 700 (Program Check) at [c00c6d1e76c0]
pc: c009a5ec: assert_pte_locked+0x14c/0x380
lr: c05c: pte_update+0x11c/0x190
sp: c00c6d1e7950
   msr: 82029033
  current = 0xc00c6d172c80
  paca= 0xc3ba   irqmask: 0x03   irq_happened: 0x01
pid   = 1, comm = swapper/0
kernel BUG at arch/powerpc/mm/pgtable.c:304!
[link register   ] c05c pte_update+0x11c/0x190
[c00c6d1e7950] 0001 (unreliable)
[c00c6d1e79b0] c05eee14 pte_update+0x44/0x190
[c00c6d1e7a10] c1a2ca9c pte_advanced_tests+0x160/0x3d8
[c00c6d1e7ab0] c1a2d4fc debug_vm_pgtable+0x7e8/0x1338
[c00c6d1e7ba0] c00116ec do_one_initcall+0xac/0x5f0
[c00c6d1e7c80] c19e4fac kernel_init_freeable+0x4dc/0x5a4
[c00c6d1e7db0] c0012474 kernel_init+0x24/0x160
[c00c6d1e7e20] c000cbd0 ret_from_kernel_thread+0x5c/0x6c

With DEBUG_VM disabled

[   20.530152] BUG: Kernel NULL pointer dereference on read at 0x
[   20.530183] Faulting instruction address: 0xc00df330
cpu 0x33: Vector: 380 (Data SLB Access) at [c00c6d19f700]
pc: c00df330: memset+0x68/0x104
lr: c009f6d8: hash__pmdp_huge_get_and_clear+0xe8/0x1b0
sp: c00c6d19f990
   msr: 82009033
   dar: 0
  current = 0xc00c6d177480
  paca= 0xc0001ec4f400   irqmask: 0x03   irq_happened: 0x01
pid   = 1, comm = swapper/0
[link register   ] c009f6d8 hash__pmdp_huge_get_and_clear+0xe8/0x1b0
[c00c6d19f990] c009f748 hash__pmdp_huge_get_and_clear+0x158/0x1b0 
(unreliable)
[c00c6d19fa10] c19ebf30 pmd_advanced_tests+0x1f0/0x378
[c00c6d19fab0] c19ed088 debug_vm_pgtable+0x79c/0x1244
[c00c6d19fba0] c00116ec do_one_initcall+0xac/0x5f0
[c00c6d19fc80] c19a4fac kernel_init_freeable+0x4dc/0x5a4
[c00c6d19fdb0] c0012474 kernel_init+0x24/0x160
[c00c6d19fe20] c000cbd0 ret_from_kernel_thread+0x5c/0x6c

Changes from v3:
* Address review feedback
* Move page table depost and withdraw patch after adding pmdlock to avoid 
bisect failure.

Changes from v2:
* Fix build failure with different configs and architecture.

Changes from v1:
* Address review feedback
* drop test specific pfn_pte and pfn_pmd.
* Update ppc64 page table helper to add _PAGE_PTE 


Aneesh Kumar K.V (13):
  powerpc/mm: Add DEBUG_VM WARN for pmd_clear
  powerpc/mm: Move setting pte specific flags to pfn_pte
  mm/debug_vm_pgtable/ppc64: Avoid setting top bits in radom value
  mm/debug_vm_pgtables/hugevmap: Use the arch helper to identify huge
vmap support.
  mm/debug_vm_pgtable/savedwrite: Enable savedwrite test with
CONFIG_NUMA_BALANCING
  mm/debug_vm_pgtable/THP: Mark the pte entry huge before using
set_pmd/pud_at
  mm/debug_vm_pgtable/set_pte/pmd/pud: Don't use set_*_at to update an
existing pte entry
  mm/debug_vm_pgtable/locks: Move non page table modifying test together
  mm/debug_vm_pgtable/locks: Take correct page table lock
  mm/debug_vm_pgtable/thp: Use page table depost/withdraw with THP
  mm/debug_vm_pgtable/pmd_clear: Don't use pmd/pud_clear on pte entries
  mm/debug_vm_pgtable/hugetlb: Disable hugetlb test on ppc64
  mm/debug_vm_pgtable: Avoid none pte in pte_clear_test

 arch/powerpc/include/asm/book3s/64/pgtable.h |  29 +++-
 arch/powerpc/include/asm/nohash/pgtable.h|   5 -
 arch/powerpc/mm/pgtable.c|   5 -
 mm/debug_vm_pgtable.c| 171 ---
 4 files changed, 131 insertions(+), 79 deletions(-)

-- 
2.26.2



[Bug 209029] kernel 5.9-rc2 fails to boot on a PowerMac G5 11,2 - BUG: Kernel NULL pointer dereference on read at 0x00000020

2020-09-02 Thread bugzilla-daemon
https://bugzilla.kernel.org/show_bug.cgi?id=209029

--- Comment #4 from Erhard F. (erhar...@mailbox.org) ---
(In reply to Christophe Leroy from comment #3)
> Did you try without CONFIG_DEBUG_VM_PGTABLE ?
Without CONFIG_DEBUG_VM_PGTABLE the G5 boots fine. Thanks!

> If you want CONFIG_DEBUG_VM_PGTABLE, the following series aims at fixing it
> for PPC64:
> https://patchwork.ozlabs.org/project/linuxppc-dev/list/?series=197961
Did not check the series as current ozlabs patches indicate that the
CONFIG_DEBUG_VM_PGTABLE option is removed for the time being.

-- 
You are receiving this mail because:
You are watching the assignee of the bug.

Re: [PATCH 2/2] powerpc/vdso32: link vdso64 with linker

2020-09-02 Thread Christophe Leroy




On 9/1/20 10:25 PM, Nick Desaulniers wrote:

Rather than invoke the compiler as the driver, use the linker. That way
we can check --orphan-handling=warn support correctly, as cc-ldoption
was removed in
commit 055efab3120b ("kbuild: drop support for cc-ldoption").

Requires dropping the .got section.  I couldn't find how it was used in
the vdso32.

Fixes: commit f2af201002a8 ("powerpc/build: vdso linker warning for orphan 
sections")
Link: 
https://lore.kernel.org/lkml/cakwvodnn3wxydjomvnveyd_njwrku3fabwt_bs92duihhyw...@mail.gmail.com/
Signed-off-by: Nick Desaulniers 
---
Not sure removing .got is a good idea or not.  Otherwise I observe the
following link error:
powerpc-linux-gnu-ld: warning: orphan section `.got' from 
`arch/powerpc/kernel/vdso32/sigtramp.o' being placed in section `.got'
powerpc-linux-gnu-ld: _GLOBAL_OFFSET_TABLE_ not defined in linker created .got
powerpc-linux-gnu-ld: final link failed: bad value


Finally I spotted it I think:

make arch/powerpc/kernel/vdso32/ V=1

powerpc64-linux-ld  -EB -m elf64ppc -shared -soname linux-vdso32.so.1 
--eh-frame-hdr  --orphan-handling=warn -T 
arch/powerpc/kernel/vdso32/vdso32.lds 
arch/powerpc/kernel/vdso32/sigtramp.o 
arch/powerpc/kernel/vdso32/gettimeofday.o 
arch/powerpc/kernel/vdso32/datapage.o 
arch/powerpc/kernel/vdso32/cacheflush.o 
arch/powerpc/kernel/vdso32/note.o arch/powerpc/kernel/vdso32/getcpu.o -o 
arch/powerpc/kernel/vdso32/vdso32.so.dbg




If I do the same manually but with -m elf32ppc instead of -m elf64ppc, 
there is no failure.


Adding -m elf32ppc to ldflags-y also works, allthough I don't like too 
much having "-m elf64ppc -m elf32ppc" on the line.


Christophe


Re: [PATCH 4/4] powerpc/64s/radix: Fix mm_cpumask trimming race vs kthread_use_mm

2020-09-02 Thread Nicholas Piggin
Excerpts from Michael Ellerman's message of September 1, 2020 10:00 pm:
> Nicholas Piggin  writes:
>> Commit 0cef77c7798a7 ("powerpc/64s/radix: flush remote CPUs out of
>> single-threaded mm_cpumask") added a mechanism to trim the mm_cpumask of
>> a process under certain conditions. One of the assumptions is that
>> mm_users would not be incremented via a reference outside the process
>> context with mmget_not_zero() then go on to kthread_use_mm() via that
>> reference.
>>
>> That invariant was broken by io_uring code (see previous sparc64 fix),
>> but I'll point Fixes: to the original powerpc commit because we are
>> changing that assumption going forward, so this will make backports
>> match up.
>>
>> Fix this by no longer relying on that assumption, but by having each CPU
>> check the mm is not being used, and clearing their own bit from the mask
>> if it's okay. This fix relies on commit 38cf307c1f20 ("mm: fix
>> kthread_use_mm() vs TLB invalidate") to disable irqs over the mm switch,
>> and ARCH_WANT_IRQS_OFF_ACTIVATE_MM to be enabled.
> 
> You could use:
> 
> Depends-on: 38cf307c1f20 ("mm: fix kthread_use_mm() vs TLB invalidate")

Good idea I wil.

>> Fixes: 0cef77c7798a7 ("powerpc/64s/radix: flush remote CPUs out of 
>> single-threaded mm_cpumask")
>> Signed-off-by: Nicholas Piggin 
>> ---
>>  arch/powerpc/include/asm/tlb.h   | 13 -
>>  arch/powerpc/mm/book3s64/radix_tlb.c | 23 ---
>>  2 files changed, 16 insertions(+), 20 deletions(-)
> 
> One minor nit below if you're respinning anyway.
> 
> You know this stuff better than me, but I still reviewed it and it seems
> good to me.
> 
> Reviewed-by: Michael Ellerman 

Thanks.

> 
>> diff --git a/arch/powerpc/include/asm/tlb.h b/arch/powerpc/include/asm/tlb.h
>> index fbc6f3002f23..d97f061fecac 100644
>> --- a/arch/powerpc/include/asm/tlb.h
>> +++ b/arch/powerpc/include/asm/tlb.h
>> @@ -66,19 +66,6 @@ static inline int mm_is_thread_local(struct mm_struct *mm)
>>  return false;
>>  return cpumask_test_cpu(smp_processor_id(), mm_cpumask(mm));
>>  }
>> -static inline void mm_reset_thread_local(struct mm_struct *mm)
>> -{
>> -WARN_ON(atomic_read(>context.copros) > 0);
>> -/*
>> - * It's possible for mm_access to take a reference on mm_users to
>> - * access the remote mm from another thread, but it's not allowed
>> - * to set mm_cpumask, so mm_users may be > 1 here.
>> - */
>> -WARN_ON(current->mm != mm);
>> -atomic_set(>context.active_cpus, 1);
>> -cpumask_clear(mm_cpumask(mm));
>> -cpumask_set_cpu(smp_processor_id(), mm_cpumask(mm));
>> -}
>>  #else /* CONFIG_PPC_BOOK3S_64 */
>>  static inline int mm_is_thread_local(struct mm_struct *mm)
>>  {
>> diff --git a/arch/powerpc/mm/book3s64/radix_tlb.c 
>> b/arch/powerpc/mm/book3s64/radix_tlb.c
>> index 0d233763441f..a421a0e3f930 100644
>> --- a/arch/powerpc/mm/book3s64/radix_tlb.c
>> +++ b/arch/powerpc/mm/book3s64/radix_tlb.c
>> @@ -645,19 +645,29 @@ static void do_exit_flush_lazy_tlb(void *arg)
>>  struct mm_struct *mm = arg;
>>  unsigned long pid = mm->context.id;
>>  
>> +/*
>> + * A kthread could have done a mmget_not_zero() after the flushing CPU
>> + * checked mm_users == 1, and be in the process of kthread_use_mm when
> ^
> in mm_is_singlethreaded()
> 
> Adding that reference would help join the dots for a new reader I think.

Yes you're right I can change that.

Thanks,
Nick


Re: [RFC PATCH 1/2] KVM: PPC: Use the ppc_inst type

2020-09-02 Thread Paul Mackerras
On Wed, Sep 02, 2020 at 06:00:24PM +1000, Jordan Niethe wrote:
> On Wed, Sep 2, 2020 at 4:18 PM Paul Mackerras  wrote:
> >
> > On Thu, Aug 20, 2020 at 01:39:21PM +1000, Jordan Niethe wrote:
> > > The ppc_inst type was added to help cope with the addition of prefixed
> > > instructions to the ISA. Convert KVM to use this new type for dealing
> > > wiht instructions. For now do not try to add further support for
> > > prefixed instructions.
> >
> > This change does seem to splatter itself across a lot of code that
> > mostly or exclusively runs on machines which are not POWER10 and will
> > never need to handle prefixed instructions, unfortunately.  I wonder
> > if there is a less invasive way to approach this.
> Something less invasive would be good.
> >
> > In particular we are inflicting this 64-bit struct on 32-bit platforms
> > unnecessarily (I assume, correct me if I am wrong here).
> No, that is something that I wanted to to avoid, on 32 bit platforms
> it is a 32bit struct:
> 
> struct ppc_inst {
> u32 val;
> #ifdef CONFIG_PPC64
> u32 suffix;
> #endif
> } __packed;
> >
> > How would it be to do something like:
> >
> > typedef unsigned long ppc_inst_t;
> >
> > so it is 32 bits on 32-bit platforms and 64 bits on 64-bit platforms,
> > and then use that instead of 'struct ppc_inst'?  You would still need
> > to change the function declarations but I think most of the function
> > bodies would not need to be changed.  In particular you would avoid a
> > lot of the churn related to having to add ppc_inst_val() and suchlike.
> 
> Would the idea be to get rid of `struct ppc_inst` entirely or just not
> use it in kvm?
> In an earlier series I did something similar (at least code shared
> between 32bit and 64bit would need helpers, but 32bit only code need
> not change):
> 
> #ifdef __powerpc64__
> 
> typedef struct ppc_inst {
> union {
> struct {
> u32 word;
> u32 pad;
> } __packed;
> struct {
> u32 prefix;
> u32 suffix;
> } __packed;
> };
> } ppc_inst;
> 
> #else /* !__powerpc64__ */
> 
> typedef u32 ppc_inst;
> #endif
> 
> However mpe wanted to avoid using a typedef
> (https://patchwork.ozlabs.org/comment/2391845/)

Well it doesn't have to be typedef'd, it could just be "unsigned
long", which is used in other places for things that want to be 32-bit
on 32-bit machines and 64-bit on 64-bit machines.

I do however think that it should be a numeric type so that we can
mask, shift and compare it more easily.  I know that's less "abstract"
but it's also a lot less obfuscated and I think that will lead to
clearer code.  If you got the opposite advice from Michael Ellerman or
Nick Piggin then I will discuss it with them.

> We did also talk about just using a u64 for instructions
> (https://lore.kernel.org/linuxppc-dev/1585028462.t27rstc2uf.astr...@bobo.none/)
> but the concern was that as prefixed instructions act as two separate
> u32s (prefix is always before the suffix regardless of endianess)
> keeping it as a u64 would lead to lot of macros and potential
> confusion.
> But it does seem if that can avoid a lot of needless churn it might
> worth the trade off.

u32 *ip;

instr = *ip++;
if (is_prefix(instr) && is_suitably_aligned(ip))
instr = (instr << 32) | *ip++;

would avoid the endian issues pretty cleanly I think.  In other words
the prefix would always be the high half of the 64-bit value, so you
can't just do a single 64-bit of the instruction on little-endian
platforms; but you can't do a single 64-bit load for other reasons as
well, such as alignment.

Paul.


Re: [RFC PATCH 2/2] KVM: PPC: Book3S HV: Support prefixed instructions

2020-09-02 Thread Jordan Niethe
On Wed, Sep 2, 2020 at 4:18 PM Paul Mackerras  wrote:
>
> On Thu, Aug 20, 2020 at 01:39:22PM +1000, Jordan Niethe wrote:
> > There are two main places where instructions are loaded from the guest:
> > * Emulate loadstore - such as when performing MMIO emulation
> >   triggered by an HDSI
> > * After an HV emulation assistance interrupt (e40)
> >
> > If it is a prefixed instruction that triggers these cases, its suffix
> > must be loaded. Use the SRR1_PREFIX bit to decide if a suffix needs to
> > be loaded. Make sure if this bit is set inject_interrupt() also sets it
> > when giving an interrupt to the guest.
> >
> > ISA v3.10 extends the Hypervisor Emulation Instruction Register (HEIR)
> > to 64 bits long to accommodate prefixed instructions. For interrupts
> > caused by a word instruction the instruction is loaded into bits 32:63
> > and bits 0:31 are zeroed. When caused by a prefixed instruction the
> > prefix and suffix are loaded into bits 0:63.
> >
> > Signed-off-by: Jordan Niethe 
> > ---
> >  arch/powerpc/kvm/book3s.c   | 15 +--
> >  arch/powerpc/kvm/book3s_64_mmu_hv.c | 10 +++---
> >  arch/powerpc/kvm/book3s_hv_builtin.c|  3 +++
> >  arch/powerpc/kvm/book3s_hv_rmhandlers.S | 14 ++
> >  4 files changed, 37 insertions(+), 5 deletions(-)
> >
> > diff --git a/arch/powerpc/kvm/book3s.c b/arch/powerpc/kvm/book3s.c
> > index 70d8967acc9b..18b1928a571b 100644
> > --- a/arch/powerpc/kvm/book3s.c
> > +++ b/arch/powerpc/kvm/book3s.c
> > @@ -456,13 +456,24 @@ int kvmppc_load_last_inst(struct kvm_vcpu *vcpu,
> >  {
> >   ulong pc = kvmppc_get_pc(vcpu);
> >   u32 word;
> > + u64 doubleword;
> >   int r;
> >
> >   if (type == INST_SC)
> >   pc -= 4;
> >
> > - r = kvmppc_ld(vcpu, , sizeof(u32), , false);
> > - *inst = ppc_inst(word);
> > + if ((kvmppc_get_msr(vcpu) & SRR1_PREFIXED)) {
> > + r = kvmppc_ld(vcpu, , sizeof(u64), , false);
>
> Should we also have a check here that the doubleword is not crossing a
> page boundary?  I can't think of a way to get this code to cross a
> page boundary, assuming the hardware is working correctly, but it
> makes me just a little nervous.
I didn't think it could happen but I will add a check to be safe.
>
> > +#ifdef CONFIG_CPU_LITTLE_ENDIAN
> > + *inst = ppc_inst_prefix(doubleword & 0x, doubleword 
> > >> 32);
> > +#else
> > + *inst = ppc_inst_prefix(doubleword >> 32, doubleword & 
> > 0x);
> > +#endif
>
> Ick.  Is there a cleaner way to do this?
Would it be nicer to read the prefix as u32 then the suffix as a u32 too?
>
> > + } else {
> > + r = kvmppc_ld(vcpu, , sizeof(u32), , false);
> > + *inst = ppc_inst(word);
> > + }
> > +
> >   if (r == EMULATE_DONE)
> >   return r;
> >   else
> > diff --git a/arch/powerpc/kvm/book3s_64_mmu_hv.c 
> > b/arch/powerpc/kvm/book3s_64_mmu_hv.c
> > index 775ce41738ce..0802471f4856 100644
> > --- a/arch/powerpc/kvm/book3s_64_mmu_hv.c
> > +++ b/arch/powerpc/kvm/book3s_64_mmu_hv.c
> > @@ -411,9 +411,13 @@ static int instruction_is_store(struct ppc_inst instr)
> >   unsigned int mask;
> >
> >   mask = 0x1000;
> > - if ((ppc_inst_val(instr) & 0xfc00) == 0x7c00)
> > - mask = 0x100;   /* major opcode 31 */
> > - return (ppc_inst_val(instr) & mask) != 0;
> > + if (ppc_inst_prefixed(instr)) {
> > + return (ppc_inst_suffix(instr) & mask) != 0;
> > + } else {
> > + if ((ppc_inst_val(instr) & 0xfc00) == 0x7c00)
> > + mask = 0x100;   /* major opcode 31 */
> > + return (ppc_inst_val(instr) & mask) != 0;
> > + }
>
> The way the code worked before, the mask depended on whether the
> instruction was a D-form (or DS-form or other variant) instruction,
> where you can tell loads and stores apart by looking at the major
> opcode, or an X-form instruction, where you look at the minor opcode.
>
> Now we are only looking at the minor opcode if it is not a prefixed
> instruction.  Are there no X-form prefixed loads or stores?
I could not see an X-form load/stores so I went with just that.
But checking the ISA it does mention  "..X-form instructions that are
preceded by an MLS-form or MMLS-form prefix..." so I shall use the
other mask too.
>
> Paul.
Thank you for the comments and suggestions.


[PATCH] mm: check for memory's node later during boot

2020-09-02 Thread Laurent Dufour
register_mem_sect_under_nodem() is checking the memory block's node id only
if the system state is "SYSTEM_BOOTING". On PowerPC, the memory blocks are
registered while the system state is "SYSTEM_SCHEDULING", the one before
SYSTEM_RUNNING.

The consequence on PowerPC guest with interleaved memory node's ranges is
that some memory block could be assigned to multiple nodes on sysfs. This
lately prevents some memory hot-plug and hot-unplug to succeed because
links are remaining. Such a panic is then displayed:

[ cut here ]
kernel BUG at /Users/laurent/src/linux-ppc/mm/memory_hotplug.c:1084!
Oops: Exception in kernel mode, sig: 5 [#1]
LE PAGE_SIZE=64K MMU=Hash SMP NR_CPUS=2048 NUMA pSeries
Modules linked in: rpadlpar_io rpaphp pseries_rng rng_core vmx_crypto gf128mul 
binfmt_misc ip_tables x_tables xfs libcrc32c crc32c_vpmsum autofs4
CPU: 8 PID: 10256 Comm: drmgr Not tainted 5.9.0-rc1+ #25
NIP:  c0403f34 LR: c0403f2c CTR: 
REGS: c004876e3660 TRAP: 0700   Not tainted  (5.9.0-rc1+)
MSR:  8282b033   CR: 24000448  XER: 
2004
CFAR: c0846d20 IRQMASK: 0
GPR00: c0403f2c c004876e38f0 c12f6f00 ffef
GPR04: 0227 c004805ae680  0004886f
GPR08: 0226 0003 0002 fffd
GPR12: 88000484 c0001ec96280  
GPR16:   0004 0003
GPR20: c0047814ffe0 c0077c08 0010 c13332c8
GPR24:  c11f6cc0  
GPR28: ffef 0001 00015000 1000
NIP [c0403f34] add_memory_resource+0x244/0x340
LR [c0403f2c] add_memory_resource+0x23c/0x340
Call Trace:
[c004876e38f0] [c0403f2c] add_memory_resource+0x23c/0x340 
(unreliable)
[c004876e39c0] [c040408c] __add_memory+0x5c/0xf0
[c004876e39f0] [c00e2b94] dlpar_add_lmb+0x1b4/0x500
[c004876e3ad0] [c00e3888] dlpar_memory+0x1f8/0xb80
[c004876e3b60] [c00dc0d0] handle_dlpar_errorlog+0xc0/0x190
[c004876e3bd0] [c00dc398] dlpar_store+0x198/0x4a0
[c004876e3c90] [c072e630] kobj_attr_store+0x30/0x50
[c004876e3cb0] [c051f954] sysfs_kf_write+0x64/0x90
[c004876e3cd0] [c051ee40] kernfs_fop_write+0x1b0/0x290
[c004876e3d20] [c0438dd8] vfs_write+0xe8/0x290
[c004876e3d70] [c04391ac] ksys_write+0xdc/0x130
[c004876e3dc0] [c0034e40] system_call_exception+0x160/0x270
[c004876e3e20] [c000d740] system_call_common+0xf0/0x27c
Instruction dump:
48442e35 6000 0b03 3cbe0001 7fa3eb78 7bc48402 38a5fffe 7ca5fa14
78a58402 48442db1 6000 7c7c1b78 <0b03> 7f23cb78 4bda371d 6000
---[ end trace 562fd6c109cd0fb2 ]---

To prevent this multiple links, make the node checking done for states
prior to SYSTEM_RUNNING.

Signed-off-by: Laurent Dufour 
Cc: Greg Kroah-Hartman 
Cc: "Rafael J. Wysocki" 
Cc: Andrew Morton 
Fixes: 4fbce633910e ("mm/memory_hotplug.c: make register_mem_sect_under_node() 
a callback of walk_memory_range()")
---
 drivers/base/node.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/base/node.c b/drivers/base/node.c
index 508b80f6329b..8e9f39b562ef 100644
--- a/drivers/base/node.c
+++ b/drivers/base/node.c
@@ -789,7 +789,7 @@ static int register_mem_sect_under_node(struct memory_block 
*mem_blk,
 * case, during hotplug we know that all pages in the memory
 * block belong to the same node.
 */
-   if (system_state == SYSTEM_BOOTING) {
+   if (system_state < SYSTEM_RUNNING) {
page_nid = get_nid_for_pfn(pfn);
if (page_nid < 0)
continue;
-- 
2.28.0



Re: [PATCH] cpuidle-pseries: Fix CEDE latency conversion from tb to us

2020-09-02 Thread Gautham R Shenoy
Hello Joel,

On Wed, Sep 02, 2020 at 01:08:35AM +, Joel Stanley wrote:
> On Tue, 1 Sep 2020 at 14:09, Gautham R. Shenoy  
> wrote:
> >
> > From: "Gautham R. Shenoy" 
> >
> > commit d947fb4c965c ("cpuidle: pseries: Fixup exit latency for
> > CEDE(0)") sets the exit latency of CEDE(0) based on the latency values
> > of the Extended CEDE states advertised by the platform. The values
> > advertised by the platform are in timebase ticks. However the cpuidle
> > framework requires the latency values in microseconds.
> >
> > If the tb-ticks value advertised by the platform correspond to a value
> > smaller than 1us, during the conversion from tb-ticks to microseconds,
> > in the current code, the result becomes zero. This is incorrect as it
> > puts a CEDE state on par with the snooze state.
> >
> > This patch fixes this by rounding up the result obtained while
> > converting the latency value from tb-ticks to microseconds.
> >
> > Fixes: commit d947fb4c965c ("cpuidle: pseries: Fixup exit latency for
> > CEDE(0)")
> >
> > Signed-off-by: Gautham R. Shenoy 
> 
> Reviewed-by: Joel Stanley 
>

Thanks for reviewing the fix.

> Should you check for the zero case and print a warning?

Yes, that would be better. I will post a v2 with that.

> 
> > ---
> >  drivers/cpuidle/cpuidle-pseries.c | 2 +-
> >  1 file changed, 1 insertion(+), 1 deletion(-)
> >
> > diff --git a/drivers/cpuidle/cpuidle-pseries.c 
> > b/drivers/cpuidle/cpuidle-pseries.c
> > index ff6d99e..9043358 100644
> > --- a/drivers/cpuidle/cpuidle-pseries.c
> > +++ b/drivers/cpuidle/cpuidle-pseries.c
> > @@ -361,7 +361,7 @@ static void __init fixup_cede0_latency(void)
> > for (i = 0; i < nr_xcede_records; i++) {
> > struct xcede_latency_record *record = >records[i];
> > u64 latency_tb = be64_to_cpu(record->latency_ticks);
> > -   u64 latency_us = tb_to_ns(latency_tb) / NSEC_PER_USEC;
> > +   u64 latency_us = DIV_ROUND_UP_ULL(tb_to_ns(latency_tb), 
> > NSEC_PER_USEC);
> >
> > if (latency_us < min_latency_us)
> > min_latency_us = latency_us;
> > --
> > 1.9.4
> >


Re: ptrace_syscall_32 is failing

2020-09-02 Thread Thomas Gleixner
On Tue, Sep 01 2020 at 17:09, Andy Lutomirski wrote:
> On Tue, Sep 1, 2020 at 4:50 PM Thomas Gleixner  wrote:
>> > I think that they almost work for x86, but not quite as
>> > indicated by this bug.  Even if we imagine we can somehow hack around
>> > this bug, I imagine we're going to find other problems with this
>> > model, e.g. the potential upcoming exit problem I noted in my review.
>>
>> What's the upcoming problem?
>
> If we ever want to get single-stepping fully correct across syscalls,
> we might need to inject SIGTRAP on syscall return. This would be more
> awkward if we can't run instrumentable code after the syscall part of
> the syscall is done.

We run a lot of instrumentable code after sys_foo() returns. Otherwise
all the TIF work would not be possible at all.

But you might tell me where exactly you want to inject the SIGTRAP in
the syscall exit code flow.

>> I don't think we want that in general. The current variant is perfectly
>> fine for everything except the 32bit fast syscall nonsense. Also
>> irqentry_entry/exit is not equivalent to the syscall_enter/exit
>> counterparts.
>
> If there are any architectures in which actual work is needed to
> figure out whether something is a syscall in the first place, they'll
> want to do the usual kernel entry work before the syscall entry work.

That's low level entry code which does not require RCU, lockdep, tracing
or whatever muck we setup before actual work can be done.

arch_asm_entry()
  ...
  arch_c_entry(cause) {
switch(cause) {
  case EXCEPTION: arch_c_exception(...);
  case SYSCALL: arch_c_syscall(...);
  ...
}

You really want to differentiate between exception and syscall
entry/exit.

The splitting of syscall_enter_from_user_mode() is only necessary for
that 32bit fast syscall thing on x86 and there is no point to open code
it with two calls for e.g. do_syscall_64().

> Maybe your patch actually makes this possible -- I haven't digested
> all the details yet.
>
> Who advised you to drop the arch parameter?

Kees, IIRC, but I would have to search through the gazillions of mail
threads to be sure.

>> +   syscall_enter_from_user_mode_prepare(regs);
>
> I'm getting lost in all these "enter" functions...

It's not that hard.

 syscall_enter_from_user_mode_prepare()
+syscall_enter_from_user_mode_work()
=syscall_enter_from_user_mode()

That's exactly what you suggested just with the difference that it is
explicit for syscalls and not using irqentry_enter/exit().

If we would do that then instead of having a single call for sane
syscall pathes:

  arch_c_entry()
 nr = syscall_enter_from_user_mode();

or for that 32bit fast syscall nonsense the split variant:

  arch_c_entry()
 syscall_enter_from_user_mode_prepare();
 do_fast_syscall_muck();
 nr = syscall_enter_from_user_mode_work();

we'd have:

  arch_c_entry()
 irqentry_enter();
 local_irq_enble();
 nr = syscall_enter_from_user_mode_work();
 ...

which enforces two calls for sane entries and more code in arch/

Thanks,

tglx


Re: [PATCH] powerpc: Fix random segfault when freeing hugetlb range

2020-09-02 Thread Aneesh Kumar K.V

On 9/2/20 1:41 PM, Christophe Leroy wrote:



Le 02/09/2020 à 05:23, Aneesh Kumar K.V a écrit :

Christophe Leroy  writes:


The following random segfault is observed from time to time with
map_hugetlb selftest:

root@localhost:~# ./map_hugetlb 1 19
524288 kB hugepages
Mapping 1 Mbytes
Segmentation fault

[   31.219972] map_hugetlb[365]: segfault (11) at 117 nip 77974f8c lr 
779a6834 code 1 in ld-2.23.so[77966000+21000]
[   31.220192] map_hugetlb[365]: code: 9421ffc0 480318d1 93410028 
90010044 9361002c 93810030 93a10034 93c10038
[   31.220307] map_hugetlb[365]: code: 93e1003c 93210024 8123007c 
81430038 <80e90004> 814a0004 7f443a14 813a0004
[   31.221911] BUG: Bad rss-counter state mm:(ptrval) 
type:MM_FILEPAGES val:33
[   31.229362] BUG: Bad rss-counter state mm:(ptrval) 
type:MM_ANONPAGES val:5


This fault is due to hugetlb_free_pgd_range() freeing page tables
that are also used by regular pages.

As explain in the comment at the beginning of
hugetlb_free_pgd_range(), the verification done in free_pgd_range()
on floor and ceiling is not done here, which means
hugetlb_free_pte_range() can free outside the expected range.

As the verification cannot be done in hugetlb_free_pgd_range(), it
must be done in hugetlb_free_pte_range().



Reviewed-by: Aneesh Kumar K.V 

Fixes: b250c8c08c79 ("powerpc/8xx: Manage 512k huge pages as standard 
pages.")

Cc: sta...@vger.kernel.org
Signed-off-by: Christophe Leroy 
---
  arch/powerpc/mm/hugetlbpage.c | 18 --
  1 file changed, 16 insertions(+), 2 deletions(-)

diff --git a/arch/powerpc/mm/hugetlbpage.c 
b/arch/powerpc/mm/hugetlbpage.c

index 26292544630f..e7ae2a2c4545 100644
--- a/arch/powerpc/mm/hugetlbpage.c
+++ b/arch/powerpc/mm/hugetlbpage.c
@@ -330,10 +330,24 @@ static void free_hugepd_range(struct mmu_gather 
*tlb, hugepd_t *hpdp, int pdshif

   get_hugepd_cache_index(pdshift - shift));
  }
-static void hugetlb_free_pte_range(struct mmu_gather *tlb, pmd_t 
*pmd, unsigned long addr)

+static void hugetlb_free_pte_range(struct mmu_gather *tlb, pmd_t *pmd,
+   unsigned long addr, unsigned long end,
+   unsigned long floor, unsigned long ceiling)
  {
+    unsigned long start = addr;
  pgtable_t token = pmd_pgtable(*pmd);
+    start &= PMD_MASK;
+    if (start < floor)
+    return;
+    if (ceiling) {
+    ceiling &= PMD_MASK;
+    if (!ceiling)
+    return;
+    }
+    if (end - 1 > ceiling - 1)
+    return;
+


We do repeat that for pud/pmd/pte hugetlb_free_range. Can we consolidate
that with comment explaining we are checking if the pgtable entry is
mapping outside the range?


I was thinking about refactoring that into a helper and add all the 
necessary comments to explain what it does.


Will do that in a followup series if you are OK. This patch is a bug fix 
and also have to go through stable.




agreed.

Thanks.
-aneesh


Re: [PATCH 0/2] dma-mapping: update default segment_boundary_mask

2020-09-02 Thread Niklas Schnelle



On 9/2/20 12:16 AM, Nicolin Chen wrote:
> These two patches are to update default segment_boundary_mask.
> 
> PATCH-1 fixes overflow issues in callers of dma_get_seg_boundary.
> Previous version was a series: https://lkml.org/lkml/2020/8/31/1026
> 
> Then PATCH-2 sets default segment_boundary_mask to ULONG_MAX.
> 
> Nicolin Chen (2):
>   dma-mapping: introduce dma_get_seg_boundary_nr_pages()
>   dma-mapping: set default segment_boundary_mask to ULONG_MAX

I gave both of your patches a quick test ride on a couple of dev mainframes,
both NVMe, ConnectX and virtio-pci devices all seems to work fine.
I already commented on Christoph's mail that I like the helper approach,
so as for s390 you can add my

Acked-by: Niklas Schnelle 

> 
>  arch/alpha/kernel/pci_iommu.c|  7 +--
>  arch/ia64/hp/common/sba_iommu.c  |  3 +--
>  arch/powerpc/kernel/iommu.c  |  9 ++---
>  arch/s390/pci/pci_dma.c  |  6 ++
>  arch/sparc/kernel/iommu-common.c | 10 +++---
>  arch/sparc/kernel/iommu.c|  3 +--
>  arch/sparc/kernel/pci_sun4v.c|  3 +--
>  arch/x86/kernel/amd_gart_64.c|  3 +--
>  drivers/parisc/ccio-dma.c|  3 +--
>  drivers/parisc/sba_iommu.c   |  3 +--
>  include/linux/dma-mapping.h  | 21 -
>  11 files changed, 34 insertions(+), 37 deletions(-)
> 


Re: [PATCH] powerpc: Fix random segfault when freeing hugetlb range

2020-09-02 Thread Christophe Leroy




Le 02/09/2020 à 05:23, Aneesh Kumar K.V a écrit :

Christophe Leroy  writes:


The following random segfault is observed from time to time with
map_hugetlb selftest:

root@localhost:~# ./map_hugetlb 1 19
524288 kB hugepages
Mapping 1 Mbytes
Segmentation fault

[   31.219972] map_hugetlb[365]: segfault (11) at 117 nip 77974f8c lr 779a6834 
code 1 in ld-2.23.so[77966000+21000]
[   31.220192] map_hugetlb[365]: code: 9421ffc0 480318d1 93410028 90010044 
9361002c 93810030 93a10034 93c10038
[   31.220307] map_hugetlb[365]: code: 93e1003c 93210024 8123007c 81430038 
<80e90004> 814a0004 7f443a14 813a0004
[   31.221911] BUG: Bad rss-counter state mm:(ptrval) type:MM_FILEPAGES val:33
[   31.229362] BUG: Bad rss-counter state mm:(ptrval) type:MM_ANONPAGES val:5

This fault is due to hugetlb_free_pgd_range() freeing page tables
that are also used by regular pages.

As explain in the comment at the beginning of
hugetlb_free_pgd_range(), the verification done in free_pgd_range()
on floor and ceiling is not done here, which means
hugetlb_free_pte_range() can free outside the expected range.

As the verification cannot be done in hugetlb_free_pgd_range(), it
must be done in hugetlb_free_pte_range().



Reviewed-by: Aneesh Kumar K.V 


Fixes: b250c8c08c79 ("powerpc/8xx: Manage 512k huge pages as standard pages.")
Cc: sta...@vger.kernel.org
Signed-off-by: Christophe Leroy 
---
  arch/powerpc/mm/hugetlbpage.c | 18 --
  1 file changed, 16 insertions(+), 2 deletions(-)

diff --git a/arch/powerpc/mm/hugetlbpage.c b/arch/powerpc/mm/hugetlbpage.c
index 26292544630f..e7ae2a2c4545 100644
--- a/arch/powerpc/mm/hugetlbpage.c
+++ b/arch/powerpc/mm/hugetlbpage.c
@@ -330,10 +330,24 @@ static void free_hugepd_range(struct mmu_gather *tlb, 
hugepd_t *hpdp, int pdshif
 get_hugepd_cache_index(pdshift - shift));
  }
  
-static void hugetlb_free_pte_range(struct mmu_gather *tlb, pmd_t *pmd, unsigned long addr)

+static void hugetlb_free_pte_range(struct mmu_gather *tlb, pmd_t *pmd,
+  unsigned long addr, unsigned long end,
+  unsigned long floor, unsigned long ceiling)
  {
+   unsigned long start = addr;
pgtable_t token = pmd_pgtable(*pmd);
  
+	start &= PMD_MASK;

+   if (start < floor)
+   return;
+   if (ceiling) {
+   ceiling &= PMD_MASK;
+   if (!ceiling)
+   return;
+   }
+   if (end - 1 > ceiling - 1)
+   return;
+


We do repeat that for pud/pmd/pte hugetlb_free_range. Can we consolidate
that with comment explaining we are checking if the pgtable entry is
mapping outside the range?


I was thinking about refactoring that into a helper and add all the 
necessary comments to explain what it does.


Will do that in a followup series if you are OK. This patch is a bug fix 
and also have to go through stable.


Christophe


Re: remove the last set_fs() in common code, and remove it for x86 and powerpc v2

2020-09-02 Thread Christoph Hellwig
On Tue, Sep 01, 2020 at 06:25:12PM +0100, Al Viro wrote:
> On Tue, Sep 01, 2020 at 07:13:00PM +0200, Christophe Leroy wrote:
> 
> > 10.92%  dd   [kernel.kallsyms]  [k] iov_iter_zero
> 
> Interesting...  Could you get an instruction-level profile inside 
> iov_iter_zero(),
> along with the disassembly of that sucker?

So the interesting thing here is with that none of these code paths
should have changed at all, and the biggest items on the profile look
the same modulo some minor reordering.


Re: [PATCH 05/10] lkdtm: disable set_fs-based tests for !CONFIG_SET_FS

2020-09-02 Thread Christoph Hellwig
On Tue, Sep 01, 2020 at 11:57:37AM -0700, Kees Cook wrote:
> On Sat, Aug 29, 2020 at 11:24:06AM +0200, Christoph Hellwig wrote:
> > On Thu, Aug 27, 2020 at 11:06:28AM -0700, Linus Torvalds wrote:
> > > On Thu, Aug 27, 2020 at 8:00 AM Christoph Hellwig  wrote:
> > > >
> > > > Once we can't manipulate the address limit, we also can't test what
> > > > happens when the manipulation is abused.
> > > 
> > > Just remove these tests entirely.
> > > 
> > > Once set_fs() doesn't exist on x86, the tests no longer make any sense
> > > what-so-ever, because test coverage will be basically zero.
> > > 
> > > So don't make the code uglier just to maintain a fiction that
> > > something is tested when it isn't really.
> > 
> > Sure fine with me unless Kees screams.
> 
> To clarify: if any of x86, arm64, arm, powerpc, riscv, and s390 are
> using set_fs(), I want to keep this test. "ugly" is fine in lkdtm. :)

And Linus wants them gone entirely, so I'll need a stage fight between
the two of you.  At least for this merge window I'm only planning on
x86 and power, plus maybe riscv if I get the work done in time.  Although
helper from the maintainers would be welcome.  s390 has a driver that
still uses set_fs that will need some surgery, although it shouldn't
be too bad, but arm will be a piece of work.  Unless I get help it will
take a while.


Re: [RFC PATCH 1/2] KVM: PPC: Use the ppc_inst type

2020-09-02 Thread Jordan Niethe
On Wed, Sep 2, 2020 at 4:18 PM Paul Mackerras  wrote:
>
> On Thu, Aug 20, 2020 at 01:39:21PM +1000, Jordan Niethe wrote:
> > The ppc_inst type was added to help cope with the addition of prefixed
> > instructions to the ISA. Convert KVM to use this new type for dealing
> > wiht instructions. For now do not try to add further support for
> > prefixed instructions.
>
> This change does seem to splatter itself across a lot of code that
> mostly or exclusively runs on machines which are not POWER10 and will
> never need to handle prefixed instructions, unfortunately.  I wonder
> if there is a less invasive way to approach this.
Something less invasive would be good.
>
> In particular we are inflicting this 64-bit struct on 32-bit platforms
> unnecessarily (I assume, correct me if I am wrong here).
No, that is something that I wanted to to avoid, on 32 bit platforms
it is a 32bit struct:

struct ppc_inst {
u32 val;
#ifdef CONFIG_PPC64
u32 suffix;
#endif
} __packed;
>
> How would it be to do something like:
>
> typedef unsigned long ppc_inst_t;
>
> so it is 32 bits on 32-bit platforms and 64 bits on 64-bit platforms,
> and then use that instead of 'struct ppc_inst'?  You would still need
> to change the function declarations but I think most of the function
> bodies would not need to be changed.  In particular you would avoid a
> lot of the churn related to having to add ppc_inst_val() and suchlike.

Would the idea be to get rid of `struct ppc_inst` entirely or just not
use it in kvm?
In an earlier series I did something similar (at least code shared
between 32bit and 64bit would need helpers, but 32bit only code need
not change):

#ifdef __powerpc64__

typedef struct ppc_inst {
union {
struct {
u32 word;
u32 pad;
} __packed;
struct {
u32 prefix;
u32 suffix;
} __packed;
};
} ppc_inst;

#else /* !__powerpc64__ */

typedef u32 ppc_inst;
#endif

However mpe wanted to avoid using a typedef
(https://patchwork.ozlabs.org/comment/2391845/)

We did also talk about just using a u64 for instructions
(https://lore.kernel.org/linuxppc-dev/1585028462.t27rstc2uf.astr...@bobo.none/)
but the concern was that as prefixed instructions act as two separate
u32s (prefix is always before the suffix regardless of endianess)
keeping it as a u64 would lead to lot of macros and potential
confusion.
But it does seem if that can avoid a lot of needless churn it might
worth the trade off.
>
> > -static inline unsigned make_dsisr(unsigned instr)
> > +static inline unsigned make_dsisr(struct ppc_inst instr)
> >  {
> >   unsigned dsisr;
> > + u32 word = ppc_inst_val(instr);
> >
> >
> >   /* bits  6:15 --> 22:31 */
> > - dsisr = (instr & 0x03ff) >> 16;
> > + dsisr = (word & 0x03ff) >> 16;
> >
> >   if (IS_XFORM(instr)) {
> >   /* bits 29:30 --> 15:16 */
> > - dsisr |= (instr & 0x0006) << 14;
> > + dsisr |= (word & 0x0006) << 14;
> >   /* bit 25 -->17 */
> > - dsisr |= (instr & 0x0040) << 8;
> > + dsisr |= (word & 0x0040) << 8;
> >   /* bits 21:24 --> 18:21 */
> > - dsisr |= (instr & 0x0780) << 3;
> > + dsisr |= (word & 0x0780) << 3;
> >   } else {
> >   /* bit  5 -->17 */
> > - dsisr |= (instr & 0x0400) >> 12;
> > + dsisr |= (word & 0x0400) >> 12;
> >   /* bits  1: 4 --> 18:21 */
> > - dsisr |= (instr & 0x7800) >> 17;
> > + dsisr |= (word & 0x7800) >> 17;
> >   /* bits 30:31 --> 12:13 */
> >   if (IS_DSFORM(instr))
> > - dsisr |= (instr & 0x0003) << 18;
> > + dsisr |= (word & 0x0003) << 18;
>
> Here I would have done something like:
>
> > -static inline unsigned make_dsisr(unsigned instr)
> > +static inline unsigned make_dsisr(struct ppc_inst pi)
> >  {
> >   unsigned dsisr;
> > + u32 instr = ppc_inst_val(pi);
>
> and left the rest of the function unchanged.
That is better.
>
> At first I wondered why we still had that function, since IBM Power
> CPUs have not set DSISR on an alignment interrupt since POWER3 days.
> It turns out it this function is used by PR KVM when it is emulating
> one of the old 32-bit PowerPC CPUs (601, 603, 604, 750, 7450 etc.).
>
> > diff --git a/arch/powerpc/kernel/kvm.c b/arch/powerpc/kernel/kvm.c
>
> Despite the file name, this code is not used on IBM Power servers.
> It is for platforms which run under an ePAPR (not server PAPR)
> hypervisor (which would be a KVM variant, but generally book E KVM not
> book 3S).
>
> Paul.


Re: [PATCH 2/2] powerpc/vdso32: link vdso64 with linker

2020-09-02 Thread Christophe Leroy




On 9/1/20 10:25 PM, Nick Desaulniers wrote:

Rather than invoke the compiler as the driver, use the linker. That way
we can check --orphan-handling=warn support correctly, as cc-ldoption
was removed in
commit 055efab3120b ("kbuild: drop support for cc-ldoption").

Requires dropping the .got section.  I couldn't find how it was used in
the vdso32.

Fixes: commit f2af201002a8 ("powerpc/build: vdso linker warning for orphan 
sections")
Link: 
https://lore.kernel.org/lkml/cakwvodnn3wxydjomvnveyd_njwrku3fabwt_bs92duihhyw...@mail.gmail.com/
Signed-off-by: Nick Desaulniers 
---
Not sure removing .got is a good idea or not.  Otherwise I observe the
following link error:
powerpc-linux-gnu-ld: warning: orphan section `.got' from 
`arch/powerpc/kernel/vdso32/sigtramp.o' being placed in section `.got'
powerpc-linux-gnu-ld: _GLOBAL_OFFSET_TABLE_ not defined in linker created .got
powerpc-linux-gnu-ld: final link failed: bad value

sigtramp.c doesn't mention anything from the GOT AFAICT, and doesn't
look like it contains relocations that do, so I'm not sure where
references to _GLOBAL_OFFSET_TABLE_ are coming from.


I'm getting the same but only when building for PPC64.
I don't get any reference to sigtramp.o though:

  CALLscripts/checksyscalls.sh
  CALLscripts/atomic/check-atomics.sh
  VDSO32A arch/powerpc/kernel/vdso32/sigtramp.o
  VDSO32A arch/powerpc/kernel/vdso32/gettimeofday.o
  VDSO32A arch/powerpc/kernel/vdso32/datapage.o
  VDSO32A arch/powerpc/kernel/vdso32/cacheflush.o
  VDSO32A arch/powerpc/kernel/vdso32/note.o
  VDSO32A arch/powerpc/kernel/vdso32/getcpu.o
  LD  arch/powerpc/kernel/vdso32/vdso32.so.dbg
powerpc64-linux-ld: _GLOBAL_OFFSET_TABLE_ not defined in linker created .got
powerpc64-linux-ld: final link failed: Bad value

(GCC 8.1, Binutils 2.30)

So it seems that the got section is being created by the linker. Don't 
know why though.



With GCC 10.1, binutils 2.34 I get:

  LDS arch/powerpc/kernel/vdso32/vdso32.lds
  VDSO32A arch/powerpc/kernel/vdso32/sigtramp.o
  VDSO32A arch/powerpc/kernel/vdso32/gettimeofday.o
  VDSO32A arch/powerpc/kernel/vdso32/datapage.o
  VDSO32A arch/powerpc/kernel/vdso32/cacheflush.o
  VDSO32A arch/powerpc/kernel/vdso32/note.o
  VDSO32A arch/powerpc/kernel/vdso32/getcpu.o
  LD  arch/powerpc/kernel/vdso32/vdso32.so.dbg
powerpc64-linux-ld: warning: orphan section `.branch_lt' from 
`arch/powerpc/kernel/vdso32/sigtramp.o' being placed in section `.branch_lt'

powerpc64-linux-ld: _GLOBAL_OFFSET_TABLE_ not defined in linker created .got
powerpc64-linux-ld: final link failed: bad value

I can't see any .branch_lt section when objdumping sigtramp.o or any 
other .o


When I move sigtramp.o at the end of the definition of obj-vdso32 in 
Makefile, I then get:


powerpc64-linux-ld: warning: orphan section `.branch_lt' from 
`arch/powerpc/kernel/vdso32/gettimeofday.o' being placed in section 
`.branch_lt'

powerpc64-linux-ld: _GLOBAL_OFFSET_TABLE_ not defined in linker created .got
powerpc64-linux-ld: final link failed: bad value


gettimeofday.o now being the first object in obj-vdso32


Christophe



  arch/powerpc/kernel/vdso32/Makefile | 7 +--
  arch/powerpc/kernel/vdso32/vdso32.lds.S | 3 ++-
  2 files changed, 7 insertions(+), 3 deletions(-)

diff --git a/arch/powerpc/kernel/vdso32/Makefile 
b/arch/powerpc/kernel/vdso32/Makefile
index 87ab1152d5ce..611a5951945a 100644
--- a/arch/powerpc/kernel/vdso32/Makefile
+++ b/arch/powerpc/kernel/vdso32/Makefile
@@ -27,6 +27,9 @@ UBSAN_SANITIZE := n
  ccflags-y := -shared -fno-common -fno-builtin -nostdlib \
-Wl,-soname=linux-vdso32.so.1 -Wl,--hash-style=both
  asflags-y := -D__VDSO32__ -s
+ldflags-y := -shared -soname linux-vdso32.so.1 \
+   $(call ld-option, --eh-frame-hdr) \
+   $(call ld-option, --orphan-handling=warn) -T
  
  obj-y += vdso32_wrapper.o

  extra-y += vdso32.lds
@@ -49,8 +52,8 @@ $(obj-vdso32): %.o: %.S FORCE
$(call if_changed_dep,vdso32as)
  
  # actual build commands

-quiet_cmd_vdso32ld = VDSO32L $@
-  cmd_vdso32ld = $(VDSOCC) $(c_flags) $(CC32FLAGS) -o $@ $(call 
cc-ldoption, -Wl$(comma)--orphan-handling=warn) -Wl,-T$(filter %.lds,$^) 
$(filter %.o,$^)
+quiet_cmd_vdso32ld = LD  $@
+  cmd_vdso32ld = $(cmd_ld)
  quiet_cmd_vdso32as = VDSO32A $@
cmd_vdso32as = $(VDSOCC) $(a_flags) $(CC32FLAGS) -c -o $@ $<
  
diff --git a/arch/powerpc/kernel/vdso32/vdso32.lds.S b/arch/powerpc/kernel/vdso32/vdso32.lds.S

index 4c985467a668..0ccdebad18b8 100644
--- a/arch/powerpc/kernel/vdso32/vdso32.lds.S
+++ b/arch/powerpc/kernel/vdso32/vdso32.lds.S
@@ -61,7 +61,6 @@ SECTIONS
.fixup  : { *(.fixup) }
  
  	.dynamic	: { *(.dynamic) }		:text	:dynamic

-   .got: { *(.got) }   :text
.plt: { *(.plt) }
  
  	_end = .;

@@ -108,7 +107,9 @@ SECTIONS
.debug_varnames  0 : { *(.debug_varnames) }
  
  	/DISCARD/	: {

+   *(.got)
*(.note.GNU-stack)
+   *(.branch_lt)

Re: [PATCH 2/2] powerpc/vdso32: link vdso64 with linker

2020-09-02 Thread Christophe Leroy




On 09/01/2020 10:25 PM, Nick Desaulniers wrote:

Rather than invoke the compiler as the driver, use the linker. That way
we can check --orphan-handling=warn support correctly, as cc-ldoption
was removed in
commit 055efab3120b ("kbuild: drop support for cc-ldoption").

Requires dropping the .got section.  I couldn't find how it was used in
the vdso32.


ld crashes:

  LD  arch/powerpc/kernel/vdso32/vdso32.so.dbg
/bin/sh: line 1: 23780 Segmentation fault  (core dumped) 
ppc-linux-ld -EB -m elf32ppc -shared -soname linux-vdso32.so.1 
--eh-frame-hdr --orphan-handling=warn -T 
arch/powerpc/kernel/vdso32/vdso32.lds 
arch/powerpc/kernel/vdso32/sigtramp.o 
arch/powerpc/kernel/vdso32/gettimeofday.o 
arch/powerpc/kernel/vdso32/datapage.o 
arch/powerpc/kernel/vdso32/cacheflush.o 
arch/powerpc/kernel/vdso32/note.o arch/powerpc/kernel/vdso32/getcpu.o -o 
arch/powerpc/kernel/vdso32/vdso32.so.dbg

make[4]: *** [arch/powerpc/kernel/vdso32/vdso32.so.dbg] Error 139


[root@localhost linux-powerpc]# ppc-linux-ld --version
GNU ld (GNU Binutils) 2.26.20160125


Christophe



Fixes: commit f2af201002a8 ("powerpc/build: vdso linker warning for orphan 
sections")
Link: 
https://lore.kernel.org/lkml/cakwvodnn3wxydjomvnveyd_njwrku3fabwt_bs92duihhyw...@mail.gmail.com/
Signed-off-by: Nick Desaulniers 
---
Not sure removing .got is a good idea or not.  Otherwise I observe the
following link error:
powerpc-linux-gnu-ld: warning: orphan section `.got' from 
`arch/powerpc/kernel/vdso32/sigtramp.o' being placed in section `.got'
powerpc-linux-gnu-ld: _GLOBAL_OFFSET_TABLE_ not defined in linker created .got
powerpc-linux-gnu-ld: final link failed: bad value

sigtramp.c doesn't mention anything from the GOT AFAICT, and doesn't
look like it contains relocations that do, so I'm not sure where
references to _GLOBAL_OFFSET_TABLE_ are coming from.

  arch/powerpc/kernel/vdso32/Makefile | 7 +--
  arch/powerpc/kernel/vdso32/vdso32.lds.S | 3 ++-
  2 files changed, 7 insertions(+), 3 deletions(-)

diff --git a/arch/powerpc/kernel/vdso32/Makefile 
b/arch/powerpc/kernel/vdso32/Makefile
index 87ab1152d5ce..611a5951945a 100644
--- a/arch/powerpc/kernel/vdso32/Makefile
+++ b/arch/powerpc/kernel/vdso32/Makefile
@@ -27,6 +27,9 @@ UBSAN_SANITIZE := n
  ccflags-y := -shared -fno-common -fno-builtin -nostdlib \
-Wl,-soname=linux-vdso32.so.1 -Wl,--hash-style=both
  asflags-y := -D__VDSO32__ -s
+ldflags-y := -shared -soname linux-vdso32.so.1 \
+   $(call ld-option, --eh-frame-hdr) \
+   $(call ld-option, --orphan-handling=warn) -T
  
  obj-y += vdso32_wrapper.o

  extra-y += vdso32.lds
@@ -49,8 +52,8 @@ $(obj-vdso32): %.o: %.S FORCE
$(call if_changed_dep,vdso32as)
  
  # actual build commands

-quiet_cmd_vdso32ld = VDSO32L $@
-  cmd_vdso32ld = $(VDSOCC) $(c_flags) $(CC32FLAGS) -o $@ $(call 
cc-ldoption, -Wl$(comma)--orphan-handling=warn) -Wl,-T$(filter %.lds,$^) 
$(filter %.o,$^)
+quiet_cmd_vdso32ld = LD  $@
+  cmd_vdso32ld = $(cmd_ld)
  quiet_cmd_vdso32as = VDSO32A $@
cmd_vdso32as = $(VDSOCC) $(a_flags) $(CC32FLAGS) -c -o $@ $<
  
diff --git a/arch/powerpc/kernel/vdso32/vdso32.lds.S b/arch/powerpc/kernel/vdso32/vdso32.lds.S

index 4c985467a668..0ccdebad18b8 100644
--- a/arch/powerpc/kernel/vdso32/vdso32.lds.S
+++ b/arch/powerpc/kernel/vdso32/vdso32.lds.S
@@ -61,7 +61,6 @@ SECTIONS
.fixup  : { *(.fixup) }
  
  	.dynamic	: { *(.dynamic) }		:text	:dynamic

-   .got: { *(.got) }   :text
.plt: { *(.plt) }
  
  	_end = .;

@@ -108,7 +107,9 @@ SECTIONS
.debug_varnames  0 : { *(.debug_varnames) }
  
  	/DISCARD/	: {

+   *(.got)
*(.note.GNU-stack)
+   *(.branch_lt)
*(.data .data.* .gnu.linkonce.d.* .sdata*)
*(.bss .sbss .dynbss .dynsbss)
*(.glink .iplt .plt .rela*)



Re: [RFC PATCH 2/2] KVM: PPC: Book3S HV: Support prefixed instructions

2020-09-02 Thread Paul Mackerras
On Thu, Aug 20, 2020 at 01:39:22PM +1000, Jordan Niethe wrote:
> There are two main places where instructions are loaded from the guest:
> * Emulate loadstore - such as when performing MMIO emulation
>   triggered by an HDSI
> * After an HV emulation assistance interrupt (e40)
> 
> If it is a prefixed instruction that triggers these cases, its suffix
> must be loaded. Use the SRR1_PREFIX bit to decide if a suffix needs to
> be loaded. Make sure if this bit is set inject_interrupt() also sets it
> when giving an interrupt to the guest.
> 
> ISA v3.10 extends the Hypervisor Emulation Instruction Register (HEIR)
> to 64 bits long to accommodate prefixed instructions. For interrupts
> caused by a word instruction the instruction is loaded into bits 32:63
> and bits 0:31 are zeroed. When caused by a prefixed instruction the
> prefix and suffix are loaded into bits 0:63.
> 
> Signed-off-by: Jordan Niethe 
> ---
>  arch/powerpc/kvm/book3s.c   | 15 +--
>  arch/powerpc/kvm/book3s_64_mmu_hv.c | 10 +++---
>  arch/powerpc/kvm/book3s_hv_builtin.c|  3 +++
>  arch/powerpc/kvm/book3s_hv_rmhandlers.S | 14 ++
>  4 files changed, 37 insertions(+), 5 deletions(-)
> 
> diff --git a/arch/powerpc/kvm/book3s.c b/arch/powerpc/kvm/book3s.c
> index 70d8967acc9b..18b1928a571b 100644
> --- a/arch/powerpc/kvm/book3s.c
> +++ b/arch/powerpc/kvm/book3s.c
> @@ -456,13 +456,24 @@ int kvmppc_load_last_inst(struct kvm_vcpu *vcpu,
>  {
>   ulong pc = kvmppc_get_pc(vcpu);
>   u32 word;
> + u64 doubleword;
>   int r;
>  
>   if (type == INST_SC)
>   pc -= 4;
>  
> - r = kvmppc_ld(vcpu, , sizeof(u32), , false);
> - *inst = ppc_inst(word);
> + if ((kvmppc_get_msr(vcpu) & SRR1_PREFIXED)) {
> + r = kvmppc_ld(vcpu, , sizeof(u64), , false);

Should we also have a check here that the doubleword is not crossing a
page boundary?  I can't think of a way to get this code to cross a
page boundary, assuming the hardware is working correctly, but it
makes me just a little nervous.

> +#ifdef CONFIG_CPU_LITTLE_ENDIAN
> + *inst = ppc_inst_prefix(doubleword & 0x, doubleword >> 
> 32);
> +#else
> + *inst = ppc_inst_prefix(doubleword >> 32, doubleword & 
> 0x);
> +#endif

Ick.  Is there a cleaner way to do this?

> + } else {
> + r = kvmppc_ld(vcpu, , sizeof(u32), , false);
> + *inst = ppc_inst(word);
> + }
> +
>   if (r == EMULATE_DONE)
>   return r;
>   else
> diff --git a/arch/powerpc/kvm/book3s_64_mmu_hv.c 
> b/arch/powerpc/kvm/book3s_64_mmu_hv.c
> index 775ce41738ce..0802471f4856 100644
> --- a/arch/powerpc/kvm/book3s_64_mmu_hv.c
> +++ b/arch/powerpc/kvm/book3s_64_mmu_hv.c
> @@ -411,9 +411,13 @@ static int instruction_is_store(struct ppc_inst instr)
>   unsigned int mask;
>  
>   mask = 0x1000;
> - if ((ppc_inst_val(instr) & 0xfc00) == 0x7c00)
> - mask = 0x100;   /* major opcode 31 */
> - return (ppc_inst_val(instr) & mask) != 0;
> + if (ppc_inst_prefixed(instr)) {
> + return (ppc_inst_suffix(instr) & mask) != 0;
> + } else {
> + if ((ppc_inst_val(instr) & 0xfc00) == 0x7c00)
> + mask = 0x100;   /* major opcode 31 */
> + return (ppc_inst_val(instr) & mask) != 0;
> + }

The way the code worked before, the mask depended on whether the
instruction was a D-form (or DS-form or other variant) instruction,
where you can tell loads and stores apart by looking at the major
opcode, or an X-form instruction, where you look at the minor opcode.

Now we are only looking at the minor opcode if it is not a prefixed
instruction.  Are there no X-form prefixed loads or stores?

Paul.


Re: [RFC PATCH 1/2] KVM: PPC: Use the ppc_inst type

2020-09-02 Thread Paul Mackerras
On Thu, Aug 20, 2020 at 01:39:21PM +1000, Jordan Niethe wrote:
> The ppc_inst type was added to help cope with the addition of prefixed
> instructions to the ISA. Convert KVM to use this new type for dealing
> wiht instructions. For now do not try to add further support for
> prefixed instructions.

This change does seem to splatter itself across a lot of code that
mostly or exclusively runs on machines which are not POWER10 and will
never need to handle prefixed instructions, unfortunately.  I wonder
if there is a less invasive way to approach this.

In particular we are inflicting this 64-bit struct on 32-bit platforms
unnecessarily (I assume, correct me if I am wrong here).

How would it be to do something like:

typedef unsigned long ppc_inst_t;

so it is 32 bits on 32-bit platforms and 64 bits on 64-bit platforms,
and then use that instead of 'struct ppc_inst'?  You would still need
to change the function declarations but I think most of the function
bodies would not need to be changed.  In particular you would avoid a
lot of the churn related to having to add ppc_inst_val() and suchlike.

> -static inline unsigned make_dsisr(unsigned instr)
> +static inline unsigned make_dsisr(struct ppc_inst instr)
>  {
>   unsigned dsisr;
> + u32 word = ppc_inst_val(instr);
>  
>  
>   /* bits  6:15 --> 22:31 */
> - dsisr = (instr & 0x03ff) >> 16;
> + dsisr = (word & 0x03ff) >> 16;
>  
>   if (IS_XFORM(instr)) {
>   /* bits 29:30 --> 15:16 */
> - dsisr |= (instr & 0x0006) << 14;
> + dsisr |= (word & 0x0006) << 14;
>   /* bit 25 -->17 */
> - dsisr |= (instr & 0x0040) << 8;
> + dsisr |= (word & 0x0040) << 8;
>   /* bits 21:24 --> 18:21 */
> - dsisr |= (instr & 0x0780) << 3;
> + dsisr |= (word & 0x0780) << 3;
>   } else {
>   /* bit  5 -->17 */
> - dsisr |= (instr & 0x0400) >> 12;
> + dsisr |= (word & 0x0400) >> 12;
>   /* bits  1: 4 --> 18:21 */
> - dsisr |= (instr & 0x7800) >> 17;
> + dsisr |= (word & 0x7800) >> 17;
>   /* bits 30:31 --> 12:13 */
>   if (IS_DSFORM(instr))
> - dsisr |= (instr & 0x0003) << 18;
> + dsisr |= (word & 0x0003) << 18;

Here I would have done something like:

> -static inline unsigned make_dsisr(unsigned instr)
> +static inline unsigned make_dsisr(struct ppc_inst pi)
>  {
>   unsigned dsisr;
> + u32 instr = ppc_inst_val(pi);

and left the rest of the function unchanged.

At first I wondered why we still had that function, since IBM Power
CPUs have not set DSISR on an alignment interrupt since POWER3 days.
It turns out it this function is used by PR KVM when it is emulating
one of the old 32-bit PowerPC CPUs (601, 603, 604, 750, 7450 etc.).

> diff --git a/arch/powerpc/kernel/kvm.c b/arch/powerpc/kernel/kvm.c

Despite the file name, this code is not used on IBM Power servers.
It is for platforms which run under an ePAPR (not server PAPR)
hypervisor (which would be a KVM variant, but generally book E KVM not
book 3S).

Paul.


Re: [PATCH 10/10] powerpc: remove address space overrides using set_fs()

2020-09-02 Thread Christophe Leroy




Le 27/08/2020 à 17:00, Christoph Hellwig a écrit :

Stop providing the possibility to override the address space using
set_fs() now that there is no need for that any more.

Signed-off-by: Christoph Hellwig 
---
  arch/powerpc/Kconfig   |  1 -
  arch/powerpc/include/asm/processor.h   |  7 ---
  arch/powerpc/include/asm/thread_info.h |  5 +--
  arch/powerpc/include/asm/uaccess.h | 62 --
  arch/powerpc/kernel/signal.c   |  3 --
  arch/powerpc/lib/sstep.c   |  6 +--
  6 files changed, 22 insertions(+), 62 deletions(-)

diff --git a/arch/powerpc/include/asm/uaccess.h 
b/arch/powerpc/include/asm/uaccess.h
index 7fe3531ad36a77..39727537d39701 100644
--- a/arch/powerpc/include/asm/uaccess.h
+++ b/arch/powerpc/include/asm/uaccess.h
@@ -8,62 +8,36 @@
  #include 
  #include 
  
-/*

- * The fs value determines whether argument validity checking should be
- * performed or not.  If get_fs() == USER_DS, checking is performed, with
- * get_fs() == KERNEL_DS, checking is bypassed.
- *
- * For historical reasons, these macros are grossly misnamed.
- *
- * The fs/ds values are now the highest legal address in the "segment".
- * This simplifies the checking in the routines below.
- */
-
-#define MAKE_MM_SEG(s)  ((mm_segment_t) { (s) })
-
-#define KERNEL_DS  MAKE_MM_SEG(~0UL)
  #ifdef __powerpc64__
  /* We use TASK_SIZE_USER64 as TASK_SIZE is not constant */
-#define USER_DSMAKE_MM_SEG(TASK_SIZE_USER64 - 1)
-#else
-#define USER_DSMAKE_MM_SEG(TASK_SIZE - 1)
-#endif
-
-#define get_fs()   (current->thread.addr_limit)
+#define TASK_SIZE_MAX  TASK_SIZE_USER64
  
-static inline void set_fs(mm_segment_t fs)

+static inline bool __access_ok(unsigned long addr, unsigned long size)
  {
-   current->thread.addr_limit = fs;
-   /* On user-mode return check addr_limit (fs) is correct */
-   set_thread_flag(TIF_FSCHECK);
+   if (addr >= TASK_SIZE_MAX)
+   return false;
+   /*
+* This check is sufficient because there is a large enough gap between
+* user addresses and the kernel addresses.
+*/
+   return size <= TASK_SIZE_MAX;
  }
-
-#define uaccess_kernel() (get_fs().seg == KERNEL_DS.seg)
-#define user_addr_max()(get_fs().seg)
-
-#ifdef __powerpc64__
-/*
- * This check is sufficient because there is a large enough
- * gap between user addresses and the kernel addresses
- */
-#define __access_ok(addr, size, segment)   \
-   (((addr) <= (segment).seg) && ((size) <= (segment).seg))
-
  #else
+#define TASK_SIZE_MAX  TASK_SIZE
  
-static inline int __access_ok(unsigned long addr, unsigned long size,

-   mm_segment_t seg)
+static inline bool __access_ok(unsigned long addr, unsigned long size)
  {
-   if (addr > seg.seg)
-   return 0;
-   return (size == 0 || size - 1 <= seg.seg - addr);
+   if (addr >= TASK_SIZE_MAX)
+   return false;
+   if (size == 0)
+   return false;


__access_ok() was returning true when size == 0 up to now. Any reason to 
return false now ?



+   return size <= TASK_SIZE_MAX - addr;
  }
-
-#endif
+#endif /* __powerpc64__ */
  
  #define access_ok(addr, size)		\

(__chk_user_ptr(addr),  \
-__access_ok((__force unsigned long)(addr), (size), get_fs()))
+__access_ok((unsigned long)(addr), (size)))
  
  /*

   * These are the main single-value transfer routines.  They automatically


Christophe


Re: [PATCH v1 09/10] powerpc/pseries/iommu: Make use of DDW even if it does not map the partition

2020-09-02 Thread Leonardo Bras
On Mon, 2020-08-31 at 14:35 +1000, Alexey Kardashevskiy wrote:
> 
> On 29/08/2020 04:36, Leonardo Bras wrote:
> > On Mon, 2020-08-24 at 15:17 +1000, Alexey Kardashevskiy wrote:
> > > On 18/08/2020 09:40, Leonardo Bras wrote:
> > > > As of today, if the biggest DDW that can be created can't map the whole
> > > > partition, it's creation is skipped and the default DMA window
> > > > "ibm,dma-window" is used instead.
> > > > 
> > > > DDW is 16x bigger than the default DMA window,
> > > 
> > > 16x only under very specific circumstances which are
> > > 1. phyp
> > > 2. sriov
> > > 3. device class in hmc (or what that priority number is in the lpar 
> > > config).
> > 
> > Yeah, missing details.
> > 
> > > > having the same amount of
> > > > pages, but increasing the page size to 64k.
> > > > Besides larger DMA window,
> > > 
> > > "Besides being larger"?
> > 
> > You are right there.
> > 
> > > > it performs better for allocations over 4k,
> > > 
> > > Better how?
> > 
> > I was thinking for allocations larger than (512 * 4k), since >2
> > hypercalls are needed here, and for 64k pages would still be just 1
> > hypercall up to (512 * 64k). 
> > But yeah, not the usual case anyway.
> 
> Yup.
> 
> 
> > > > so it would be nice to use it instead.
> > > 
> > > I'd rather say something like:
> > > ===
> > > So far we assumed we can map the guest RAM 1:1 to the bus which worked
> > > with a small number of devices. SRIOV changes it as the user can
> > > configure hundreds VFs and since phyp preallocates TCEs and does not
> > > allow IOMMU pages bigger than 64K, it has to limit the number of TCEs
> > > per a PE to limit waste of physical pages.
> > > ===
> > 
> > I mixed this in my commit message, it looks like this:
> > 
> > ===
> > powerpc/pseries/iommu: Make use of DDW for indirect mapping
> > 
> > So far it's assumed possible to map the guest RAM 1:1 to the bus, which
> > works with a small number of devices. SRIOV changes it as the user can
> > configure hundreds VFs and since phyp preallocates TCEs and does not
> > allow IOMMU pages bigger than 64K, it has to limit the number of TCEs
> > per a PE to limit waste of physical pages.
> > 
> > As of today, if the assumed direct mapping is not possible, DDW
> > creation is skipped and the default DMA window "ibm,dma-window" is used
> > instead.
> > 
> > The default DMA window uses 4k pages instead of 64k pages, and since
> > the amount of pages is the same,
> 
> Is the amount really the same? I thought you can prioritize some VFs
> over others (== allocate different number of TCEs). Does it really
> matter if it is the same?

On a conversation with Travis Pizel, he explained how it's supposed to
work, and I understood this:

When a VF is created, it will be assigned a capacity, like 4%, 20%, and
so on. The number of 'TCE entries' that are available to that partition
are proportional to that capacity. 

If we use the default DMA window, the IOMMU pagesize/entry will be 4k,
and if we use DDW, we will get 64k pagesize. As the number of entries
will be the same (for the same capacity), the total space that can be
addressed by the IOMMU will be 16 times bigger. This sometimes enable
direct mapping, but sometimes it's still not enough.

On Travis words :
"A low capacity VF, with less resources available, will certainly have
less DMA window capability than a high capacity VF. But, an 8GB DMA
window (with 64k pages) is still 16x larger than an 512MB window (with
4K pages).
A high capacity VF - for example, one that Leonardo has in his scenario
- will go from 8GB (using 4K pages) to 128GB (using 64K pages) - again,
16x larger - but it's obviously still possible to create a partition
that exceeds 128GB of memory in size."

> 
> 
> > making use of DDW instead of the
> > default DMA window for indirect mapping will expand in 16x the amount
> > of memory that can be mapped on DMA.
> 
> Stop saying "16x", it is not guaranteed by anything :)
> 
> 
> > The DDW created will be used for direct mapping by default. [...]
> > ===
> > 
> > What do you think?
> > 
> > > > The DDW created will be used for direct mapping by default.
> > > > If it's not available, indirect mapping will be used instead.
> > > > 
> > > > For indirect mapping, it's necessary to update the iommu_table so
> > > > iommu_alloc() can use the DDW created. For this,
> > > > iommu_table_update_window() is called when everything else succeeds
> > > > at enable_ddw().
> > > > 
> > > > Removing the default DMA window for using DDW with indirect mapping
> > > > is only allowed if there is no current IOMMU memory allocated in
> > > > the iommu_table. enable_ddw() is aborted otherwise.
> > > > 
> > > > As there will never have both direct and indirect mappings at the same
> > > > time, the same property name can be used for the created DDW.
> > > > 
> > > > So renaming
> > > > define DIRECT64_PROPNAME "linux,direct64-ddr-window-info"
> > > > to
> > > > define DMA64_PROPNAME "linux,dma64-ddr-window-info"
> > > > looks the right