Re: powerpc: remove unnecessary unlikely()

2019-01-23 Thread Michael Ellerman
On Fri, 2018-09-07 at 15:35:26 UTC, Igor Stoppa wrote:
> WARN_ON() already contains an unlikely(), so it's not necessary to
> wrap it into another.
> 
> Signed-off-by: Igor Stoppa 
> Cc: Arseny Solokha 
> Cc: Benjamin Herrenschmidt 
> Cc: Paul Mackerras 
> Cc: Michael Ellerman 
> Cc: linuxppc-dev@lists.ozlabs.org
> Cc: linux-ker...@vger.kernel.org

Applied to powerpc next, thanks.

https://git.kernel.org/powerpc/c/63da6caeb84cfad3d1e5774b7049dd1d

cheers


Re: powerpc: Allow CPU selection of G4/74xx variant

2019-01-23 Thread Michael Ellerman
On Mon, 2019-01-14 at 20:13:04 UTC, Mathieu Malaterre wrote:
> GCC supports -mcpu=G4
> 
> This patch gives the opportunity to select ALTIVEC for this variant.
> 
> Signed-off-by: Mathieu Malaterre 

Applied to powerpc next, thanks.

https://git.kernel.org/powerpc/c/9bd10b649826774bb5e1e7fb67544e65

cheers


Re: [kernel] powerpc/powernv/npu: Remove obsolete comment about TCE_KILL_INVAL_ALL

2019-01-23 Thread Michael Ellerman
On Mon, 2019-01-14 at 00:41:38 UTC, Alexey Kardashevskiy wrote:
> TCE_KILL_INVAL_ALL has moved long ago but the comment was forgotted so
> finish the move and remove the comment.
> 
> Fixes: 0bbcdb437da0c4a "powerpc/powernv/npu: TCE Kill helpers cleanup"
> Signed-off-by: Alexey Kardashevskiy 

Applied to powerpc next, thanks.

https://git.kernel.org/powerpc/c/797eadd9c80ca3b3f913ccde29f8a601

cheers


Re: [kernel] powerpc/powernv: Remove never used pnv_power9_force_smt4

2019-01-23 Thread Michael Ellerman
On Mon, 2019-01-14 at 00:40:27 UTC, Alexey Kardashevskiy wrote:
> This removes never used symbol - pnv_power9_force_smt4.
> 
> Note that we might still want to add stubs for:
>   void pnv_power9_force_smt4_catch(void);
>   void pnv_power9_force_smt4_release(void);
> 
> Fixes: 7672691a08c88 "powerpc/powernv: Provide a way to force a core into 
> SMT4 mode"
> Signed-off-by: Alexey Kardashevskiy 

Applied to powerpc next, thanks.

https://git.kernel.org/powerpc/c/c35f78d7a422750917029d20d9e57000

cheers


Re: [kernel] KVM: PPC: Fix compile when CONFIG_PPC_RADIX_MMU is not defined

2019-01-23 Thread Michael Ellerman
On Mon, 2019-01-14 at 00:38:49 UTC, Alexey Kardashevskiy wrote:
> This adds some stubs for hash only configs.
> 
> Signed-off-by: Alexey Kardashevskiy 

Applied to powerpc next, thanks.

https://git.kernel.org/powerpc/c/cd6b8a631c5de3a6b7c8ef30337fd02b

cheers


Re: [1/2] powerpc: wii.dts: Add interrupt-related properties to GPIO node

2019-01-23 Thread Michael Ellerman
On Sat, 2019-01-12 at 16:21:23 UTC, =?utf-8?q?Jonathan_Neusch=C3=A4fer?= wrote:
> The Hollywood GPIO controller is connected to the Hollywood PIC ()
> at IRQs 10 and 11; IRQ 10 for GPIO lines that are configured for access
> by the PPC, 11 for GPIO lines that are configured for access by the
> ARM926.
> 
> Signed-off-by: Jonathan Neuschäfer 

Applied to powerpc next, thanks.

https://git.kernel.org/powerpc/c/f4ddc19a711736eb54fb8259499faf5a

cheers


Re: [1/2] powerpc: wii.dts: Add interrupt-related properties to GPIO node

2019-01-23 Thread Michael Ellerman
On Sat, 2019-01-12 at 16:21:23 UTC, =?utf-8?q?Jonathan_Neusch=C3=A4fer?= wrote:
> The Hollywood GPIO controller is connected to the Hollywood PIC ()
> at IRQs 10 and 11; IRQ 10 for GPIO lines that are configured for access
> by the PPC, 11 for GPIO lines that are configured for access by the
> ARM926.
> 
> Signed-off-by: Jonathan Neuschäfer 

Applied to powerpc next, thanks.

https://git.kernel.org/powerpc/c/8de7547e03059281fda075355c114694

cheers


Re: powerpc: Use ALIGN instead of BLOCK

2019-01-23 Thread Michael Ellerman
On Fri, 2019-01-11 at 23:50:56 UTC, Joel Stanley wrote:
> In the ld documentation under Builtin Functions:
> 
>   BLOCK(exp)
> 
> This is a synonym for ALIGN, for compatibility with older linker scripts.
> 
> Clang's linker (lld) doesn't know about BLOCK so remove this use of it.
> 
> Link: https://github.com/ClangBuiltLinux/linux/issues/253
> Signed-off-by: Joel Stanley 
> Reviewed-by: Nick Desaulniers 

Applied to powerpc next, thanks.

https://git.kernel.org/powerpc/c/a652758ac1475f69d28d11b3528c4f48

cheers


Re: [1/3] KVM: powerpc: remove -I. header search paths

2019-01-23 Thread Michael Ellerman
On Fri, 2019-01-11 at 03:22:31 UTC, Masahiro Yamada wrote:
> The header search path -I. in kernel Makefiles is very suspicious;
> it allows the compiler to search for headers in the top of $(srctree),
> where obviously no header file exists.
> 
> Commit 46f43c6ee022 ("KVM: powerpc: convert marker probes to event
> trace") first added these options, but they are completely useless.
> 
> Signed-off-by: Masahiro Yamada 

Series applied to powerpc next, thanks.

https://git.kernel.org/powerpc/c/c142e9741e61577c45f2441214c999f2

cheers


Re: [v2,1/3] powerpc: Stop using pr_cont() in __die()

2019-01-23 Thread Michael Ellerman
On Thu, 2019-01-10 at 11:57:35 UTC, Michael Ellerman wrote:
> Using pr_cont() risks having our output interleaved with other output
> from other CPUs. Instead print everything in a single printk() call.
> 
> Signed-off-by: Michael Ellerman 
> Reviewed-by: Christophe Leroy 

Series applied to powerpc next.

https://git.kernel.org/powerpc/c/782274434d6f2e8aa8c573cb24fef94a

cheers


Re: [-next] powerpc/mm: Fix debugfs_simple_attr.cocci warnings

2019-01-23 Thread Michael Ellerman
On Wed, 2019-01-09 at 12:10:58 UTC, YueHaibing wrote:
> Use DEFINE_DEBUGFS_ATTRIBUTE rather than DEFINE_SIMPLE_ATTRIBUTE
> for debugfs files.
> 
> Semantic patch information:
> Rationale: DEFINE_SIMPLE_ATTRIBUTE + debugfs_create_file()
> imposes some significant overhead as compared to
> DEFINE_DEBUGFS_ATTRIBUTE + debugfs_create_file_unsafe().
> 
> Generated by: scripts/coccinelle/api/debugfs/debugfs_simple_attr.cocci
> 
> Signed-off-by: YueHaibing 

Applied to powerpc next, thanks.

https://git.kernel.org/powerpc/c/7cd4774ff7a495839f49a582be729959

cheers


Re: powerpc/ps3: Use struct_size() in kzalloc()

2019-01-23 Thread Michael Ellerman
On Tue, 2019-01-08 at 21:00:10 UTC, "Gustavo A. R. Silva" wrote:
> One of the more common cases of allocation size calculations is finding the
> size of a structure that has a zero-sized array at the end, along with memory
> for some number of elements for that array. For example:
> 
> struct foo {
> int stuff;
> void *entry[];
> };
> 
> instance = kzalloc(sizeof(struct foo) + sizeof(void *) * count, GFP_KERNEL);
> 
> Instead of leaving these open-coded and prone to type mistakes, we can now
> use the new struct_size() helper:
> 
> instance = kzalloc(struct_size(instance, entry, count), GFP_KERNEL);
> 
> This code was detected with the help of Coccinelle.
> 
> Signed-off-by: Gustavo A. R. Silva 

Applied to powerpc next, thanks.

https://git.kernel.org/powerpc/c/31367b9a01d6a3f4f77694bd44f547d6

cheers


Re: powerpc/spufs: use struct_size() in kmalloc()

2019-01-23 Thread Michael Ellerman
On Tue, 2019-01-08 at 18:37:20 UTC, "Gustavo A. R. Silva" wrote:
> One of the more common cases of allocation size calculations is finding
> the size of a structure that has a zero-sized array at the end, along
> with memory for some number of elements for that array. For example:
> 
> struct foo {
> int stuff;
> void *entry[];
> };
> 
> instance = kmalloc(sizeof(struct foo) + sizeof(void *) * count, GFP_KERNEL);
> 
> Instead of leaving these open-coded and prone to type mistakes, we can
> now use the new struct_size() helper:
> 
> instance = kmalloc(struct_size(instance, entry, count), GFP_KERNEL);
> 
> This code was detected with the help of Coccinelle.
> 
> Signed-off-by: Gustavo A. R. Silva 

Applied to powerpc next, thanks.

https://git.kernel.org/powerpc/c/00def7130af8b3fad1bdef98429c94a6

cheers


Re: powerpc/ipic: drop unused functions

2019-01-23 Thread Michael Ellerman
On Tue, 2019-01-08 at 15:08:00 UTC, Christophe Leroy wrote:
> ipic_set_highest_priority(), ipic_enable_mcp() and ipic_disable_mcp()
> are unused. This patch drops them.
> 
> Signed-off-by: Christophe Leroy 

Applied to powerpc next, thanks.

https://git.kernel.org/powerpc/c/8acb88682cc00a41a677c2455a7c992d

cheers


Re: powerpc: build virtex dtb

2019-01-23 Thread Michael Ellerman
On Tue, 2019-01-08 at 12:52:50 UTC, Corentin Labbe wrote:
> I wanted to test the virtex440-ml507 qemu machine and found that the dtb
> for it was not builded.
> All powerpc DTB are only built when CONFIG_OF_ALL_DTBS is set which depend on
> COMPILE_TEST.
> 
> This patchs adds build of virtex dtbs depending on
> CONFIG_XILINX_VIRTEX440_GENERIC_BOARD option.
> 
> Signed-off-by: Corentin Labbe 

Applied to powerpc next, thanks.

https://git.kernel.org/powerpc/c/acef5e0165912c459a9ae98a25f0f879

cheers


Re: powerpc/irq: drop arch_early_irq_init()

2019-01-23 Thread Michael Ellerman
On Tue, 2019-01-08 at 11:37:19 UTC, Christophe Leroy wrote:
> arch_early_irq_init() does nothing different than
> the weak arch_early_irq_init() in kernel/softirq.c
> 
> Fixes: 089fb442f301 ("powerpc: Use ARCH_IRQ_INIT_FLAGS")
> Cc: Thomas Gleixner 
> Signed-off-by: Christophe Leroy 
> Acked-by: Thomas Gleixner 

Applied to powerpc next, thanks.

https://git.kernel.org/powerpc/c/607ea5090b3fb61fea1d0bc5278e6c1d

cheers


Re: [v4] selftests/powerpc: New TM signal self test

2019-01-23 Thread Michael Ellerman
On Tue, 2019-01-08 at 11:31:21 UTC, Breno Leitao wrote:
> A new self test that forces MSR[TS] to be set without calling any TM
> instruction. This test also tries to cause a page fault at a signal
> handler, exactly between MSR[TS] set and tm_recheckpoint(), forcing
> thread->texasr to be rewritten with TEXASR[FS] = 0, which will cause a BUG
> when tm_recheckpoint() is called.
> 
> This test is not deterministic, since it is hard to guarantee that the page
> access will cause a page fault. In order to force more page faults at
> signal context, the signal handler and the ucontext are being mapped into a
> MADV_DONTNEED memory chunks.
> 
> Tests have shown that the bug could be exposed with few interactions in a
> buggy kernel. This test is configured to loop 5000x, having a good chance
> to hit the kernel issue in just one run.  This self test takes less than
> two seconds to run.
> 
> This test uses set/getcontext because the kernel will recheckpoint
> zeroed structures, causing the test to segfault, which is undesired because
> the test needs to rerun, so, there is a signal handler for SIGSEGV which
> will restart the test.
> 
> v2: Uses the MADV_DONTNEED memory advice
> v3: Fix memcpy and 32-bits compilation
> v4: Does not define unused macros
> 
> Signed-off-by: Breno Leitao 

Applied to powerpc next, thanks.

https://git.kernel.org/powerpc/c/a65329aa7d613288626275546074f1aa

cheers


Re: powerpc: use a CONSOLE_LOGLEVEL_DEBUG macro

2019-01-23 Thread Michael Ellerman
On Mon, 2019-01-07 at 09:57:20 UTC, Sergey Senozhatsky wrote:
> Use a CONSOLE_LOGLEVEL_DEBUG macro for console_loglevel rather
> than a naked number.
> 
> Signed-off-by: Sergey Senozhatsky 

Applied to powerpc next, thanks.

https://git.kernel.org/powerpc/c/fae1383b38a105a0454acab19b094c51

cheers


Re: [02/12] powerpc/hvsi: fix typo

2019-01-23 Thread Michael Ellerman
On Fri, 2019-01-04 at 21:31:52 UTC, Matteo Croce wrote:
> Fix spelling mistake: "lenght" -> "length"
> 
> Signed-off-by: Matteo Croce 

Applied to powerpc next, thanks.

https://git.kernel.org/powerpc/c/3b702ddd066813952154c22dd76d3b0c

cheers


Re: [PATCH -next] usb: host: drop pointless static qualifier

2019-01-23 Thread YueHaibing
On 2019/1/24 0:09, Alan Stern wrote:
> On Wed, 23 Jan 2019, YueHaibing wrote:
> 
>> There is no need to have the 'dummy_mask' variable static since new
>> value always be assigned before use it.
>>
>> Signed-off-by: YueHaibing 
>> ---
>>  drivers/usb/host/ehci-ps3.c | 2 +-
>>  drivers/usb/host/ohci-ps3.c | 2 +-
>>  2 files changed, 2 insertions(+), 2 deletions(-)
>>
>> diff --git a/drivers/usb/host/ehci-ps3.c b/drivers/usb/host/ehci-ps3.c
>> index 454d8c6..91cee02 100644
>> --- a/drivers/usb/host/ehci-ps3.c
>> +++ b/drivers/usb/host/ehci-ps3.c
>> @@ -86,7 +86,7 @@ static int ps3_ehci_probe(struct ps3_system_bus_device 
>> *dev)
>>  int result;
>>  struct usb_hcd *hcd;
>>  unsigned int virq;
>> -static u64 dummy_mask;
>> +u64 dummy_mask;
>>  
>>  if (usb_disabled()) {
>>  result = -ENODEV;
>> diff --git a/drivers/usb/host/ohci-ps3.c b/drivers/usb/host/ohci-ps3.c
>> index 395f9d3..a1c1bdf 100644
>> --- a/drivers/usb/host/ohci-ps3.c
>> +++ b/drivers/usb/host/ohci-ps3.c
>> @@ -69,7 +69,7 @@ static int ps3_ohci_probe(struct ps3_system_bus_device 
>> *dev)
>>  int result;
>>  struct usb_hcd *hcd;
>>  unsigned int virq;
>> -static u64 dummy_mask;
>> +u64 dummy_mask;
>>  
>>  if (usb_disabled()) {
>>  result = -ENODEV;
> 
> No.  You need to read the code and understand how a variable is used
> before you decide to modify it.

Sorry, I misread the code, so just leave it as is.

> 
> In this case, a suitable approach would be to change the declaration 
> so that it says:
> 
>   status u64 dummy_mask = DMA_BIT_MASK(32);
> 
> and remove the line that does the assignment dynamically.
> 
> Alan Stern
> 
> 
> .
> 



Re: [PATCH v13 00/10] powerpc: Switch to CONFIG_THREAD_INFO_IN_TASK

2019-01-23 Thread Michael Ellerman
Christophe Leroy  writes:
> Le 12/01/2019 à 10:55, Christophe Leroy a écrit :
>> The purpose of this serie is to activate CONFIG_THREAD_INFO_IN_TASK which
>> moves the thread_info into task_struct.
>> 
>> Moving thread_info into task_struct has the following advantages:
>> - It protects thread_info from corruption in the case of stack
>> overflows.
>> - Its address is harder to determine if stack addresses are
>> leaked, making a number of attacks more difficult.
>
> I ran null_syscall and context_switch benchmark selftests and the result 
> is surprising. There is slight degradation in context_switch and a 
> significant one on null_syscall:
>
> Without the serie:
>
> ~# chrt -f 98 ./context_switch --no-altivec --no-vector --no-fp
> 55542
> 55562
> 55564
> 55562
> 55568
> ...
>
> ~# ./null_syscall
> 2546.71 ns 336.17 cycles
>
>
> With the serie:
>
> ~# chrt -f 98 ./context_switch --no-altivec --no-vector --no-fp
> 55138
> 55142
> 55152
> 55144
> 55142
>
> ~# ./null_syscall
> 3479.54 ns 459.30 cycles
>
> So 0,8% less context switches per second and 37% more time for one syscall ?
>
> Any idea ?

What platform is that on?

On 64-bit we have to turn one mtmsrd into two and that's obviously a
slow down. But I don't see that you've done anything similar in 32-bit
code.

I assume it's patch 8 that causes the slow down?

cheers


Re: [PATCH v13 00/10] powerpc: Switch to CONFIG_THREAD_INFO_IN_TASK

2019-01-23 Thread Michael Ellerman
Christophe Leroy  writes:
> Le 19/01/2019 à 11:23, Michael Ellerman a écrit :
>> Christophe Leroy  writes:
>> 
>>> The purpose of this serie is to activate CONFIG_THREAD_INFO_IN_TASK which
>>> moves the thread_info into task_struct.
>>>
>>> Moving thread_info into task_struct has the following advantages:
>>> - It protects thread_info from corruption in the case of stack
>>> overflows.
>>> - Its address is harder to determine if stack addresses are
>>> leaked, making a number of attacks more difficult.
>>>
>>> Changes since v12:
>>>   - Patch 1: Taken comment from Mike (re-introduced the 'panic' in case 
>>> memblock allocation fails in setup_64.c
>>>   - Patch 1: Added alloc_stack() function in setup_32.c to also panic in 
>>> case of allocation failure.
>> 
>> Hi Christophe,
>> 
>> I can't get this series to boot on qemu mac99. I'm getting eg:
>> 
>> [0.981514] NFS: Registering the id_resolver key type
>> [0.981752] Key type id_resolver registered
>> [0.981868] Key type id_legacy registered
>> [0.995711] Unrecoverable exception 0 at 0 (msr=0)
>> [0.996091] Oops: Unrecoverable exception, sig: 6 [#1]
>> [0.996314] BE PAGE_SIZE=4K MMU=Hash PowerMac
>> [0.996617] Modules linked in:
>> [0.996869] CPU: 0 PID: 416 Comm: modprobe Not tainted 
>> 5.0.0-rc2-gcc-7.3.0-00043-g53f2de798792 #342
>> [0.997138] NIP:   LR:  CTR: 
>> [0.997309] REGS: ef237f50 TRAP:    Not tainted  
>> (5.0.0-rc2-gcc-7.3.0-00043-g53f2de798792)
>> [0.997508] MSR:   <>  CR:   XER: 
>> [0.997712]
>> [0.997712] GPR00:  ef238000     
>>  
>> [0.997712] GPR08:       
>> c006477c ef13d8c0
>> [0.997712] GPR16:       
>>  
>> [0.997712] GPR24:       
>>  
>> [0.998671] NIP []   (null)
>> [0.998774] LR []   (null)
>> [0.998895] Call Trace:
>> [0.999030] Instruction dump:
>> [0.999320]       
>>  
>> [0.999546]     6000  
>>  
>> [1.23] ---[ end trace 925ea3419844fe68 ]---
>
> No such issue on my side. Do you have a ramdisk with anythink special or 
> a special config ? I see your kernel is modprobing something, know what 
> it is ?

It's just a debian installer image, nothing special AFAIK.

> Especially, what is the amount of memory in your config ? On my side 
> there is 128M:

I have 1G.

But today I can't reproduce the crash :/

So I guess it must have been something else in my config.

cheers


Re: [PATCH v13 00/10] powerpc: Switch to CONFIG_THREAD_INFO_IN_TASK

2019-01-23 Thread Michael Ellerman
LEROY Christophe  writes:
> Michael Ellerman  a écrit :
>
>> Christophe Leroy  writes:
>>
>>> The purpose of this serie is to activate CONFIG_THREAD_INFO_IN_TASK which
>>> moves the thread_info into task_struct.
>>>
>>> Moving thread_info into task_struct has the following advantages:
>>> - It protects thread_info from corruption in the case of stack
>>> overflows.
>>> - Its address is harder to determine if stack addresses are
>>> leaked, making a number of attacks more difficult.
>>>
>>> Changes since v12:
>>>  - Patch 1: Taken comment from Mike (re-introduced the 'panic' in  
>>> case memblock allocation fails in setup_64.c
>>>  - Patch 1: Added alloc_stack() function in setup_32.c to also  
>>> panic in case of allocation failure.
>>
>> Hi Christophe,
>>
>> I can't get this series to boot on qemu mac99. I'm getting eg:
>
> Problem new with version 13 or it is the first time you test ?

I did test a previous version of the series, but I'm not sure if I
tested pmac32 before. So I don't have a known good version.

>> [0.981514] NFS: Registering the id_resolver key type
>> [0.981752] Key type id_resolver registered
>> [0.981868] Key type id_legacy registered
>> [0.995711] Unrecoverable exception 0 at 0 (msr=0)
>> [0.996091] Oops: Unrecoverable exception, sig: 6 [#1]
>> [0.996314] BE PAGE_SIZE=4K MMU=Hash PowerMac
>> [0.996617] Modules linked in:
>> [0.996869] CPU: 0 PID: 416 Comm: modprobe Not tainted  
>> 5.0.0-rc2-gcc-7.3.0-00043-g53f2de798792 #342
>
> Comm:modprobe  ==> Something wrong with modules ? I never tested with  
> CONFIG_MODULES.

Yep good clue.

cheers


Re: [PATCH kernel] vfio-pci/nvlink2: Fix ancient gcc warnings

2019-01-23 Thread Michael Ellerman
Geert Uytterhoeven  writes:
> Hi Alex,
>
> On Wed, Jan 23, 2019 at 5:30 AM Alex Williamson
>  wrote:
>> The below patch comes about from the build regressions and improvements
>> list you've sent out, but something doesn't add up that we'd be testing
>> with an old compiler where initialization with { 0 } generates a
>> "missing braces around initialization" warning.  Is this really the
>> case or are we missing something here?  There's no harm that I can see
>> with Alexey's fix, but are these really just false positives from a
>> compiler bug that we should selectively ignore if the "fix" is less
>> clean?  Thanks,
>
> Yes, they are false positives, AFAIK.
>
>> On Wed, 23 Jan 2019 15:07:11 +1100
>> Alexey Kardashevskiy  wrote:
>>
>> > Using the {0} construct as a generic initializer is perfectly fine in C,
>> > however due to a bug in old gcc there is a warning:
>> >
>> >   + /kisskb/src/drivers/vfio/pci/vfio_pci_nvlink2.c: warning: (near
>> > initialization for 'cap.header') [-Wmissing-braces]:  => 181:9
>
> These all seem to come from an old gcc 4.6, which is the oldest still
> supported version for compiling Linux
> http://kisskb.ellerman.id.au/kisskb/buildresult/13663641/
>
> Note that kisskb is also using gcc 4.6.3 for s390x and mips, which are the 
> only
> other builds showing missing braces warnings.

As documented here:

  
https://www.kernel.org/doc/html/latest/process/changes.html#current-minimal-requirements

x86 has effectively dropped support for 4.6 because it doesn't support
retpoline and CONFIG_RETPOLINE is default y.

So it might be time to stop supporting 4.6, but I'd rather that happened
by someone sending a patch to change the requirements doc above, and
then kisskb can stop doing the builds with 4.6.

cheers


Re: [PATCH -next] usb: host: drop pointless static qualifier

2019-01-23 Thread Alan Stern
On Wed, 23 Jan 2019, YueHaibing wrote:

> There is no need to have the 'dummy_mask' variable static since new
> value always be assigned before use it.
> 
> Signed-off-by: YueHaibing 
> ---
>  drivers/usb/host/ehci-ps3.c | 2 +-
>  drivers/usb/host/ohci-ps3.c | 2 +-
>  2 files changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/usb/host/ehci-ps3.c b/drivers/usb/host/ehci-ps3.c
> index 454d8c6..91cee02 100644
> --- a/drivers/usb/host/ehci-ps3.c
> +++ b/drivers/usb/host/ehci-ps3.c
> @@ -86,7 +86,7 @@ static int ps3_ehci_probe(struct ps3_system_bus_device *dev)
>   int result;
>   struct usb_hcd *hcd;
>   unsigned int virq;
> - static u64 dummy_mask;
> + u64 dummy_mask;
>  
>   if (usb_disabled()) {
>   result = -ENODEV;
> diff --git a/drivers/usb/host/ohci-ps3.c b/drivers/usb/host/ohci-ps3.c
> index 395f9d3..a1c1bdf 100644
> --- a/drivers/usb/host/ohci-ps3.c
> +++ b/drivers/usb/host/ohci-ps3.c
> @@ -69,7 +69,7 @@ static int ps3_ohci_probe(struct ps3_system_bus_device *dev)
>   int result;
>   struct usb_hcd *hcd;
>   unsigned int virq;
> - static u64 dummy_mask;
> + u64 dummy_mask;
>  
>   if (usb_disabled()) {
>   result = -ENODEV;

No.  You need to read the code and understand how a variable is used
before you decide to modify it.

In this case, a suitable approach would be to change the declaration 
so that it says:

status u64 dummy_mask = DMA_BIT_MASK(32);

and remove the line that does the assignment dynamically.

Alan Stern



Re: [PATCH 18/19] KVM: PPC: Book3S HV: add passthrough support

2019-01-23 Thread Benjamin Herrenschmidt
On Wed, 2019-01-23 at 21:30 +1100, Paul Mackerras wrote:
> > Afaik bcs we change the mapping to point to the real HW irq ESB page
> > instead of the "IPI" that was there at VM init time.
> 
> So that makes it sound like there is a whole lot going on that hasn't
> even been hinted at in the patch descriptions...  It sounds like we
> need a good description of how all this works and fits together
> somewhere under Documentation/.
> 
> In any case we need much more informative patch descriptions.  I
> realize that it's all currently in Cedric's head, but I bet that in
> two or three years' time when we come to try to debug something, it
> won't be in anyone's head...

The main problem is understanding XIVE itself. It's not realistic to
ask Cedric to write a proper documentation for XIVE as part of the
patch series, but sadly IBM doesn't have a good one to provide either.

Ben.




[PATCH] powerpc/pseries: Check for ceded CPU's during LPAR migration

2019-01-23 Thread Michael Bringmann
This patch is to check for cede'ed CPUs during LPM.  Some extreme
tests encountered a problem ehere Linux has put some threads to
sleep (possibly to save energy or something), LPM was attempted,
and the Linux kernel didn't awaken the sleeping threads, but issued
the H_JOIN for the active threads.  Since the sleeping threads
are not awake, they can not issue the expected H_JOIN, and the
partition would never suspend.  This patch wakes the sleeping
threads back up.

Signed-off-by: Nathan Fontenot 
Signed-off-by: Gustavo Walbon 
---
 arch/powerpc/include/asm/plpar_wrappers.h |6 ++
 arch/powerpc/kernel/rtas.c|6 ++
 arch/powerpc/platforms/pseries/setup.c|   18 ++
 3 files changed, 26 insertions(+), 4 deletions(-)

diff --git a/arch/powerpc/include/asm/plpar_wrappers.h 
b/arch/powerpc/include/asm/plpar_wrappers.h
index cff5a41..8292eff 100644
--- a/arch/powerpc/include/asm/plpar_wrappers.h
+++ b/arch/powerpc/include/asm/plpar_wrappers.h
@@ -26,10 +26,8 @@ static inline void set_cede_latency_hint(u8 latency_hint)
get_lppaca()->cede_latency_hint = latency_hint;
 }
 
-static inline long cede_processor(void)
-{
-   return plpar_hcall_norets(H_CEDE);
-}
+int cpu_is_ceded(int cpu);
+long cede_processor(void);
 
 static inline long extended_cede_processor(unsigned long latency_hint)
 {
diff --git a/arch/powerpc/kernel/rtas.c b/arch/powerpc/kernel/rtas.c
index de35bd8f..9d9d08d 100644
--- a/arch/powerpc/kernel/rtas.c
+++ b/arch/powerpc/kernel/rtas.c
@@ -44,6 +44,7 @@
 #include 
 #include 
 #include 
+#include 
 
 /* This is here deliberately so it's only used in this file */
 void enter_rtas(unsigned long);
@@ -991,6 +992,11 @@ int rtas_ibm_suspend_me(u64 handle)
goto out_hotplug_enable;
}
 
+   for_each_present_cpu(cpu) {
+   if (cpu_is_ceded(cpu))
+   plpar_hcall_norets(H_PROD, 
get_hard_smp_processor_id(cpu));
+   }
+
/* Call function on all CPUs.  One of us will make the
 * rtas call
 */
diff --git a/arch/powerpc/platforms/pseries/setup.c 
b/arch/powerpc/platforms/pseries/setup.c
index 4078a05..0106668 100644
--- a/arch/powerpc/platforms/pseries/setup.c
+++ b/arch/powerpc/platforms/pseries/setup.c
@@ -331,6 +331,24 @@ static int alloc_dispatch_log_kmem_cache(void)
 }
 machine_early_initcall(pseries, alloc_dispatch_log_kmem_cache);
 
+static DEFINE_PER_CPU(int, cpu_ceded);
+
+int cpu_is_ceded(int cpu)
+{
+   return per_cpu(cpu_ceded, cpu);
+}
+
+long cede_processor(void)
+{
+   long rc;
+
+   per_cpu(cpu_ceded, raw_smp_processor_id()) = 1;
+   rc = plpar_hcall_norets(H_CEDE);
+   per_cpu(cpu_ceded, raw_smp_processor_id()) = 0;
+
+   return rc;
+}
+
 static void pseries_lpar_idle(void)
 {
/*



Re: [PATCH kernel] vfio-pci/nvlink2: Fix ancient gcc warnings

2019-01-23 Thread Alex Williamson
On Wed, 23 Jan 2019 15:07:11 +1100
Alexey Kardashevskiy  wrote:

> Using the {0} construct as a generic initializer is perfectly fine in C,
> however due to a bug in old gcc there is a warning:
> 
>   + /kisskb/src/drivers/vfio/pci/vfio_pci_nvlink2.c: warning: (near
> initialization for 'cap.header') [-Wmissing-braces]:  => 181:9
> 
> Since for whatever reason we still want to compile the modern kernel
> with such an old gcc without warnings, this changes the capabilities
> initialization.
> 
> The gcc bugzilla: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=53119
> 
> Signed-off-by: Alexey Kardashevskiy 

Added Fixes: and Reported-by: tags.

Applied to for-linus branch for v5.0.  Thanks,

Alex

> ---
>  drivers/vfio/pci/vfio_pci_nvlink2.c | 30 ++---
>  1 file changed, 15 insertions(+), 15 deletions(-)
> 
> diff --git a/drivers/vfio/pci/vfio_pci_nvlink2.c 
> b/drivers/vfio/pci/vfio_pci_nvlink2.c
> index 054a2cf..91d945b 100644
> --- a/drivers/vfio/pci/vfio_pci_nvlink2.c
> +++ b/drivers/vfio/pci/vfio_pci_nvlink2.c
> @@ -178,11 +178,11 @@ static int vfio_pci_nvgpu_add_capability(struct 
> vfio_pci_device *vdev,
>   struct vfio_pci_region *region, struct vfio_info_cap *caps)
>  {
>   struct vfio_pci_nvgpu_data *data = region->data;
> - struct vfio_region_info_cap_nvlink2_ssatgt cap = { 0 };
> -
> - cap.header.id = VFIO_REGION_INFO_CAP_NVLINK2_SSATGT;
> - cap.header.version = 1;
> - cap.tgt = data->gpu_tgt;
> + struct vfio_region_info_cap_nvlink2_ssatgt cap = {
> + .header.id = VFIO_REGION_INFO_CAP_NVLINK2_SSATGT,
> + .header.version = 1,
> + .tgt = data->gpu_tgt
> + };
>  
>   return vfio_info_add_capability(caps, , sizeof(cap));
>  }
> @@ -365,18 +365,18 @@ static int vfio_pci_npu2_add_capability(struct 
> vfio_pci_device *vdev,
>   struct vfio_pci_region *region, struct vfio_info_cap *caps)
>  {
>   struct vfio_pci_npu2_data *data = region->data;
> - struct vfio_region_info_cap_nvlink2_ssatgt captgt = { 0 };
> - struct vfio_region_info_cap_nvlink2_lnkspd capspd = { 0 };
> + struct vfio_region_info_cap_nvlink2_ssatgt captgt = {
> + .header.id = VFIO_REGION_INFO_CAP_NVLINK2_SSATGT,
> + .header.version = 1,
> + .tgt = data->gpu_tgt
> + };
> + struct vfio_region_info_cap_nvlink2_lnkspd capspd = {
> + .header.id = VFIO_REGION_INFO_CAP_NVLINK2_LNKSPD,
> + .header.version = 1,
> + .link_speed = data->link_speed
> + };
>   int ret;
>  
> - captgt.header.id = VFIO_REGION_INFO_CAP_NVLINK2_SSATGT;
> - captgt.header.version = 1;
> - captgt.tgt = data->gpu_tgt;
> -
> - capspd.header.id = VFIO_REGION_INFO_CAP_NVLINK2_LNKSPD;
> - capspd.header.version = 1;
> - capspd.link_speed = data->link_speed;
> -
>   ret = vfio_info_add_capability(caps, , sizeof(captgt));
>   if (ret)
>   return ret;



Re: [PATCH 00/19] KVM: PPC: Book3S HV: add XIVE native exploitation mode

2019-01-23 Thread Cédric Le Goater
On 1/22/19 5:46 AM, Paul Mackerras wrote:
> On Mon, Jan 07, 2019 at 07:43:12PM +0100, Cédric Le Goater wrote:
>> Hello,
>>
>> On the POWER9 processor, the XIVE interrupt controller can control
>> interrupt sources using MMIO to trigger events, to EOI or to turn off
>> the sources. Priority management and interrupt acknowledgment is also
>> controlled by MMIO in the CPU presenter subengine.
>>
>> PowerNV/baremetal Linux runs natively under XIVE but sPAPR guests need
>> special support from the hypervisor to do the same. This is called the
>> XIVE native exploitation mode and today, it can be activated under the
>> PowerPC Hypervisor, pHyp. However, Linux/KVM lacks XIVE native support
>> and still offers the old interrupt mode interface using a
>> XICS-over-XIVE glue which implements the XICS hcalls.
>>
>> The following series is proposal to add the same support under KVM.
>>
>> A new KVM device is introduced for the XIVE native exploitation
>> mode. It reuses most of the XICS-over-XIVE glue implementation
>> structures which are internal to KVM but has a completely different
>> interface. A set of Hypervisor calls configures the sources and the
>> event queues and from there, all control is done by the guest through
>> MMIOs.
>>
>> These MMIO regions (ESB and TIMA) are exposed to guests in QEMU,
>> similarly to VFIO, and the associated VMAs are populated dynamically
>> with the appropriate pages using a fault handler. This is implemented
>> with a couple of KVM device ioctls.
>>
>> On a POWER9 sPAPR machine, the Client Architecture Support (CAS)
>> negotiation process determines whether the guest operates with a
>> interrupt controller using the XICS legacy model, as found on POWER8,
>> or in XIVE exploitation mode. Which means that the KVM interrupt
>> device should be created at runtime, after the machine as started.
>> This requires extra KVM support to create/destroy KVM devices. The
>> last patches are an attempt to solve that problem.
>>
>> Migration has its own specific needs. The patchset provides the
>> necessary routines to quiesce XIVE, to capture and restore the state
>> of the different structures used by KVM, OPAL and HW. Extra OPAL
>> support is required for these.
> 
> Thanks for the patchset.  It mostly looks good, but there are some
> more things we need to consider, and I think a v2 will be needed.
>> One general comment I have is that there are a lot of acronyms in this
> code and you mostly seem to assume that people will know what they all
> mean.  It would make the code more readable if you provide the
> expansion of the acronym on first use in a comment or whatever.  For
> example, one of the patches in this series talks about the "EAS"

 Event Assignment Structure, a.k.a IVE (Interrupt Virtualization Entry)

All the names changed somewhere between XIVE v1 and XIVE v2. OPAL and
Linux should be adjusted ...

> without ever expanding it in any comment or in the patch description,
> and I have forgotten just at the moment what EAS stands for (I just
> know that understanding the XIVE is not eas-y. :)
ah ! yes. But we have great documentation :)

We pushed some high level description of XIVE in QEMU :

  
https://git.qemu.org/?p=qemu.git;a=blob;f=include/hw/ppc/xive.h;h=ec23253ba448e25c621356b55a119a738f8e;hb=HEAD

I should do the same for Linux with a KVM section to explain the 
interfaces which do not directly expose the underlying XIVE concepts. 
It's better to understand a little what is happening under the hood.

> Another general comment is that you seem to have written all this
> code assuming we are using HV KVM in a host running bare-metal.

Yes. I didn't look at the other configurations. I thought that we could
use the kernel_irqchip=off option to begin with. A couple of checks
are indeed missing.

> However, we could be using PR KVM (either in a bare-metal host or in a
> guest), or we could be doing nested HV KVM where we are using the
> kvm_hv module inside a KVM guest and using special hypercalls for
> controlling our guests.

Yes. 

It would be good to talk a little about the nested support (offline 
may be) to make sure that we are not missing some major interface that 
would require a lot of change. If we need to prepare ground, I think
the timing is good.

The size of the IRQ number space might be a problem. It seems we 
would need to increase it considerably to support multiple nested 
guests. That said I haven't look much how nested is designed.  

> It would be perfectly acceptable for now to say that we don't yet
> support XIVE exploitation in those scenarios, as long as we then make
> sure that the new KVM capability reports false in those scenarios, and
> any attempt to use the XIVE exploitation interfaces fails cleanly.

ok. That looks the best approach for now.

> I don't see that either of those is true in the patch set as it
> stands, so that is one area that needs to be fixed.
> 
> A third general comment is that the new KVM interfaces you have added
> 

Re: [PATCH 19/19] KVM: introduce a KVM_DELETE_DEVICE ioctl

2019-01-23 Thread Cédric Le Goater
On 1/22/19 6:42 AM, Paul Mackerras wrote:
> On Mon, Jan 07, 2019 at 08:10:06PM +0100, Cédric Le Goater wrote:
>> This will be used to destroy the KVM XICS or XIVE device when the
>> sPAPR machine is reseted. When the VM boots, the CAS negotiation
>> process will determine which interrupt mode to use and the appropriate
>> KVM device will then be created.
> 
> What would be the consequence if we didn't destroy the device?

So, if we don't destroy the device, it would mean that we are 
maintaining its availability under the KVM PPC structures, VM and
vCPUs, I think the changes would be significant to have two interrupt 
devices unde the VM. We would also need a way to activate one or 
the other depending on the interrupt mode chosen by CAS. In other 
words, it's moving all the interrupt mode politics from QEMU to KVM. 
It's possible of course but I would prefer to leave the ugly details 
in QEMU.  

Let's suppose now that we keep the device alive but disconnect the 
presenters from it, and from the VM also. We would have an unused 
device in the VM. We would need way to keep an handle on it (fd 
certainly) and a KVM interface to soft reset a KVM device partially 
initialized. That's one other option.

It seemed easier to do an hard reset : create/destroy.  

> The reason I ask is that we will have to be much more careful about
> memory allocation lifetimes with this patch. 

yes. bad refcounting will lead the host kernel to a crash. 

> Having KVM devices last
> until the KVM instance is destroyed means that we generally avoid
> use-after-free bugs.  With this patch we will have to do a careful
> analysis of the lifetime of the xive structures vs. possible accesses
> on other threads to prove there are no use-after-free bugs.
> 
> For example, it is not sufficient to set any pointers in struct kvm or
> struct kvm_vcpu that point into xive structures to NULL before freeing
> the structures.  There could be code on another CPU that has read the
> pointer value before you set it to NULL and then goes and accesses it
> after you have freed it.  You need to prove that can't happen,
> possibly using some sort of explicit synchronization that ensures that
> no other CPU could still be accessing the structure at the time when
> you free it.  RCU can help with this, but in general means you need
> RCU synchronization primitives (rcu_read_lock() etc.) at all the
> places where you use the pointer, which I don't think you currently
> have.

no. indeed. I have overlooked the synchronization aspect.

> If there is a good fundamental reason why this can't happen, even
> though you don't have explicit synchronization, then at a minimum you
> need to explain that in the patch description, and ideally also in
> code comments.

OK. I did leave that patch at the end for one reason. It needs more care.

Thanks,

C.
 



Re: [PATCH 06/19] KVM: PPC: Book3S HV: add a GET_ESB_FD control to the XIVE native device

2019-01-23 Thread Cédric Le Goater
On 1/22/19 6:09 AM, Paul Mackerras wrote:
> On Mon, Jan 07, 2019 at 07:43:18PM +0100, Cédric Le Goater wrote:
>> This will let the guest create a memory mapping to expose the ESB MMIO
>> regions used to control the interrupt sources, to trigger events, to
>> EOI or to turn off the sources.
>>
>> Signed-off-by: Cédric Le Goater 
>> ---
>>  arch/powerpc/include/uapi/asm/kvm.h   |  4 ++
>>  arch/powerpc/kvm/book3s_xive_native.c | 97 +++
>>  2 files changed, 101 insertions(+)
>>
>> diff --git a/arch/powerpc/include/uapi/asm/kvm.h 
>> b/arch/powerpc/include/uapi/asm/kvm.h
>> index 8c876c166ef2..6bb61ba141c2 100644
>> --- a/arch/powerpc/include/uapi/asm/kvm.h
>> +++ b/arch/powerpc/include/uapi/asm/kvm.h
>> @@ -675,4 +675,8 @@ struct kvm_ppc_cpu_char {
>>  #define  KVM_XICS_PRESENTED (1ULL << 43)
>>  #define  KVM_XICS_QUEUED(1ULL << 44)
>>  
>> +/* POWER9 XIVE Native Interrupt Controller */
>> +#define KVM_DEV_XIVE_GRP_CTRL   1
>> +#define   KVM_DEV_XIVE_GET_ESB_FD   1
>> +
>>  #endif /* __LINUX_KVM_POWERPC_H */
>> diff --git a/arch/powerpc/kvm/book3s_xive_native.c 
>> b/arch/powerpc/kvm/book3s_xive_native.c
>> index 115143e76c45..e20081f0c8d4 100644
>> --- a/arch/powerpc/kvm/book3s_xive_native.c
>> +++ b/arch/powerpc/kvm/book3s_xive_native.c
>> @@ -153,6 +153,85 @@ int kvmppc_xive_native_connect_vcpu(struct kvm_device 
>> *dev,
>>  return rc;
>>  }
>>  
>> +static int xive_native_esb_fault(struct vm_fault *vmf)
>> +{
>> +struct vm_area_struct *vma = vmf->vma;
>> +struct kvmppc_xive *xive = vma->vm_file->private_data;
>> +struct kvmppc_xive_src_block *sb;
>> +struct kvmppc_xive_irq_state *state;
>> +struct xive_irq_data *xd;
>> +u32 hw_num;
>> +u16 src;
>> +u64 page;
>> +unsigned long irq;
>> +
>> +/*
>> + * Linux/KVM uses a two pages ESB setting, one for trigger and
>> + * one for EOI
>> + */
>> +irq = vmf->pgoff / 2;
>> +
>> +sb = kvmppc_xive_find_source(xive, irq, );
>> +if (!sb) {
>> +pr_err("%s: source %lx not found !\n", __func__, irq);
> 
> In general it's a bad idea to have a printk that userspace can trigger
> at will without any rate-limiting.  Is there a real reason why this
> printk is needed (and can't be pr_devel)?

yes. It should. The SIGBUS is enough to know what's going on. 

> 
>> +return VM_FAULT_SIGBUS;
>> +}
>> +
>> +state = >irq_state[src];
>> +kvmppc_xive_select_irq(state, _num, );
>> +
>> +arch_spin_lock(>lock);
>> +
>> +/*
>> + * first/even page is for trigger
>> + * second/odd page is for EOI and management.
>> + */
>> +page = vmf->pgoff % 2 ? xd->eoi_page : xd->trig_page;
>> +arch_spin_unlock(>lock);
>> +
>> +if (!page) {
>> +pr_err("%s: acessing invalid ESB page for source %lx !\n",
>> +   __func__, irq);
> 
> Does this represent a exceptional condition that userspace can't just
> trigger at will (i.e. it implies the presence of a kernel bug)?  If
> not then the same comment as above applies.

Not having an ESB page (trigger or EOI) implies that the xive_irq_data 
for the source is bogus. This probably deserves a WARN().

Thanks,

C. 



Re: [PATCH 08/19] KVM: PPC: Book3S HV: add a VC_BASE control to the XIVE native device

2019-01-23 Thread Cédric Le Goater
On 1/22/19 6:14 AM, Paul Mackerras wrote:
> On Mon, Jan 07, 2019 at 07:43:20PM +0100, Cédric Le Goater wrote:
>> The ESB MMIO region controls the interrupt sources of the guest. QEMU
>> will query an fd (GET_ESB_FD ioctl) and map this region at a specific
>> address for the guest to use. The guest will obtain this information
>> using the H_INT_GET_SOURCE_INFO hcall. To inform KVM of the address
>> setting used by QEMU, add a VC_BASE control to the KVM XIVE device
> 
> This needs a little more explanation.  I *think* the only way this
> gets used is that it gets returned to the guest by the new
> hypercalls.  If that is indeed the case it would be useful to mention
> that in the patch description, because otherwise taking a value that
> userspace provides and which looks like it is an address, and not
> doing any validation on it, looks a bit scary.

I think we have solved this problem in another email thread. 

The H_INT_GET_SOURCE_INFO hcall does not need to be implemented in KVM
as all the source information should already be available in QEMU. In
that case, there is no need to inform KVM of where the ESB pages are 
mapped in the guest address space. So we don't need that extra control
on the KVM device. This is good news. 

Thanks,

C. 





Re: [PATCH 05/19] KVM: PPC: Book3S HV: add a new KVM device for the XIVE native exploitation mode

2019-01-23 Thread Cédric Le Goater
On 1/22/19 6:05 AM, Paul Mackerras wrote:
> On Mon, Jan 07, 2019 at 07:43:17PM +0100, Cédric Le Goater wrote:
>> This is the basic framework for the new KVM device supporting the XIVE
>> native exploitation mode. The user interface exposes a new capability
>> and a new KVM device to be used by QEMU.
> 
> [snip]
>> @@ -1039,7 +1039,10 @@ static int kvmppc_book3s_init(void)
>>  #ifdef CONFIG_KVM_XIVE
>>  if (xive_enabled()) {
>>  kvmppc_xive_init_module();
>> +kvmppc_xive_native_init_module();
>>  kvm_register_device_ops(_xive_ops, KVM_DEV_TYPE_XICS);
>> +kvm_register_device_ops(_xive_native_ops,
>> +KVM_DEV_TYPE_XIVE);
> 
> I think we want tighter conditions on initializing the xive_native
> stuff and creating the xive device class.  We could have
> xive_enabled() returning true in a guest, and this code will get
> called both by PR KVM and HV KVM (and HV KVM no longer implies that we
> are running bare metal).

Ah yes, I agree. I haven't addressed at all the nested flavor. I have 
some questions about this that I will ask in summary email you sent. 

Thanks,

C. 
  

> 
>> @@ -1050,8 +1053,10 @@ static int kvmppc_book3s_init(void)
>>  static void kvmppc_book3s_exit(void)
>>  {
>>  #ifdef CONFIG_KVM_XICS
>> -if (xive_enabled())
>> +if (xive_enabled()) {
>>  kvmppc_xive_exit_module();
>> +kvmppc_xive_native_exit_module();
> 
> Same comment here.
> 
> Paul.
> 



Re: [PATCH 03/19] KVM: PPC: Book3S HV: check the IRQ controller type

2019-01-23 Thread Cédric Le Goater
On 1/22/19 5:56 AM, Paul Mackerras wrote:
> On Mon, Jan 07, 2019 at 07:43:15PM +0100, Cédric Le Goater wrote:
>> We will have different KVM devices for interrupts, one for the
>> XICS-over-XIVE mode and one for the XIVE native exploitation
>> mode. Let's add some checks to make sure we are not mixing the
>> interfaces in KVM.
>>
>> Signed-off-by: Cédric Le Goater 
>> ---
>>  arch/powerpc/kvm/book3s_xive.c | 6 ++
>>  1 file changed, 6 insertions(+)
>>
>> diff --git a/arch/powerpc/kvm/book3s_xive.c b/arch/powerpc/kvm/book3s_xive.c
>> index f78d002f0fe0..8a4fa45f07f8 100644
>> --- a/arch/powerpc/kvm/book3s_xive.c
>> +++ b/arch/powerpc/kvm/book3s_xive.c
>> @@ -819,6 +819,9 @@ u64 kvmppc_xive_get_icp(struct kvm_vcpu *vcpu)
>>  {
>>  struct kvmppc_xive_vcpu *xc = vcpu->arch.xive_vcpu;
>>  
>> +if (!kvmppc_xics_enabled(vcpu))
>> +return -EPERM;
>> +
>>  if (!xc)
>>  return 0;
>>  
>> @@ -835,6 +838,9 @@ int kvmppc_xive_set_icp(struct kvm_vcpu *vcpu, u64 
>> icpval)
>>  u8 cppr, mfrr;
>>  u32 xisr;
>>  
>> +if (!kvmppc_xics_enabled(vcpu))
>> +return -EPERM;
>> +
>>  if (!xc || !xive)
>>  return -ENOENT;
> 
> I can't see how these new checks could ever trigger in the code as it
> stands.  Is there a way at present? 

It would require some custom QEMU doing silly things : create the XICS 
KVM device, and then call kvm_get_one_reg(KVM_REG_PPC_ICP_STATE) or 
kvm_set_one_reg(icp->cs, KVM_REG_PPC_ICP_STATE) without connecting the
vCPU to its presenter. 

Today, you get a ENOENT.

> Do following patches ever add a path where the new checks could trigger, 
> or is this just an excess of caution? 

With the following patches, QEMU could to do something even more silly,
which is to mix the interrupt mode interfaces : create a KVM XICS device
and call KVM CPU ioctls of the KVM XIVE device, or the opposite.  

> (Your patch description should ideally have answered these questions > for 
> me.)

Yes. I also think that I introduced this patch to early in the series.
It make more sense when the XICS and the XIVE KVM devices are available.  

Thanks,

C.


Re: [PATCH v4] kbuild: Add support for DT binding schema checks

2019-01-23 Thread Geert Uytterhoeven
Hi Rob,

On Tue, Dec 11, 2018 at 9:24 PM Rob Herring  wrote:
> This adds the build infrastructure for checking DT binding schema
> documents and validating dts files using the binding schema.
>
> Check DT binding schema documents:
> make dt_binding_check
>
> Build dts files and check using DT binding schema:
> make dtbs_check
>
> Optionally, DT_SCHEMA_FILES can be passed in with a schema file(s) to
> use for validation. This makes it easier to find and fix errors
> generated by a specific schema.
>
> Currently, the validation targets are separate from a normal build to
> avoid a hard dependency on the external DT schema project and because
> there are lots of warnings generated.

Thanks, I'm giving this a try, and get errors like:

  DTC arch/arm/boot/dts/emev2-kzm9d.dt.yaml
FATAL ERROR: No markers present in property 'cpu0' value

and

  DTC arch/arm64/boot/dts/renesas/r8a7795-salvator-x.dt.yaml
FATAL ERROR: No markers present in property 'audio_clk_a' value

Do you have a clue?
Thanks!

Gr{oetje,eeting}s,

Geert

-- 
Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- ge...@linux-m68k.org

In personal conversations with technical people, I call myself a hacker. But
when I'm talking to journalists I just say "programmer" or something like that.
-- Linus Torvalds


Re: [PATCH] tools/testing/selftests/powerpc: Add Anton's null_syscall benchmark to the selftests

2019-01-23 Thread Christophe Leroy




Le 27/09/2016 à 16:10, Rui Teng a écrit :

From: Anton Blanchard 

Pull in a version of Anton's null_syscall benchmark:
http://ozlabs.org/~anton/junkcode/null_syscall.c
Into tools/testing/selftests/powerpc/benchmarks.

Suggested-by: Michael Ellerman 
Signed-off-by: Anton Blanchard 
Signed-off-by: Rui Teng 
---
  .../testing/selftests/powerpc/benchmarks/Makefile  |   2 +-
  .../selftests/powerpc/benchmarks/null_syscall.c| 157 +
  2 files changed, 158 insertions(+), 1 deletion(-)
  create mode 100644 tools/testing/selftests/powerpc/benchmarks/null_syscall.c



[...]


+
+static void do_null_syscall(unsigned long nr)
+{
+   unsigned long i;
+
+   for (i = 0; i < nr; i++)
+   getppid();
+}
+


Looks like getppid() performs a rcu_read_lock(). Is that what we want ?

Shouldn't we use getpid() instead for a lighter syscall ?

Christophe


Re: use generic DMA mapping code in powerpc V4

2019-01-23 Thread Christian Zigotzky

Hi Christoph,

I also compiled a kernel (zImage) for the X1000  from your Git 
'powerpc-dma.6-debug' (both patches) today.


It boots and the P.A. Semi Ethernet works!

I will test just the first patch tomorrow.

Thanks,
Christian


On 21 January 2019 at 3:38PM, Christian Zigotzky wrote:

Hello Christoph,

Thanks for your reply. I successfully compiled a kernel (uImage) for 
the X5000 from your Git 'powerpc-dma.6-debug' (both patches) today.


It detects the SATA hard disk drive and boots without any problems.





Re: [PATCH 18/19] KVM: PPC: Book3S HV: add passthrough support

2019-01-23 Thread Cédric Le Goater
On 1/23/19 11:30 AM, Paul Mackerras wrote:
> On Wed, Jan 23, 2019 at 05:45:24PM +1100, Benjamin Herrenschmidt wrote:
>> On Tue, 2019-01-22 at 16:26 +1100, Paul Mackerras wrote:
>>> On Mon, Jan 07, 2019 at 08:10:05PM +0100, Cédric Le Goater wrote:
 Clear the ESB pages from the VMA of the IRQ being pass through to the
 guest and let the fault handler repopulate the VMA when the ESB pages
 are accessed for an EOI or for a trigger.
>>>
>>> Why do we want to do this?
>>>
>>> I don't see any possible advantage to removing the PTEs from the
>>> userspace mapping.  You'll need to explain further.
>>
>> Afaik bcs we change the mapping to point to the real HW irq ESB page
>> instead of the "IPI" that was there at VM init time.

yes exactly. You need to clean up the pages each time.
 
> So that makes it sound like there is a whole lot going on that hasn't
> even been hinted at in the patch descriptions...  It sounds like we
> need a good description of how all this works and fits together
> somewhere under Documentation/.

OK. I have started doing so for the models merged in QEMU but not yet 
for KVM. I will work on it.

> In any case we need much more informative patch descriptions.  I
> realize that it's all currently in Cedric's head, but I bet that in
> two or three years' time when we come to try to debug something, it
> won't be in anyone's head...

I agree. 


So, storing the ESB VMA under the KVM device is not shocking anyone ?  

Thanks,

C.



Re: [PATCH 11/19] KVM: PPC: Book3S HV: add support for the XIVE native exploitation mode hcalls

2019-01-23 Thread Cédric Le Goater
On 1/23/19 11:26 AM, Paul Mackerras wrote:
> On Wed, Jan 23, 2019 at 09:48:31AM +0100, Cédric Le Goater wrote:
>> On 1/23/19 7:44 AM, Benjamin Herrenschmidt wrote:
>>> On Tue, 2019-01-22 at 16:23 +1100, Paul Mackerras wrote:
 Why do we need to provide real-mode versions of these hypercall
 handlers?  I thought these hypercalls would only get called
 infrequently, and in any case certainly much less frequently than once
 per interrupt delivered.  If they are infrequent, then let's leave out
 the real-mode version and just handle them in book3s_hv.c.
>>>
>>> Agreed with the exception maybe of H_INT_ESB
>>
>> ok. 
>>
>> Some of these hcalls are really simple and only getting local info from 
>> the host (h_int_get_*). I thought handling the hcall ASAP was a preferred 
>> practice, even if the hcall is not called frequently. Isn't it ?
> 
> If we are going to handle a given hcall in the kernel at all, then we
> have to have a virtual mode handler.  If we have a real-mode handler
> as well then we in general incur a certain amount of code duplication
> with consequent maintenance costs and possibility of bugs.  So we
> generally only have real-mode handlers for the hcalls where it is
> critical to minimize the latency.  From what Ben is saying that would
> only be H_INT_ESB, and maybe not even that.

ok. and yes, even the H_INT_ESB is questionable as this is really a rare
configuration. 

> If H_INT_ESB is only used for LSIs, then is a guest going to be using
> it at all?  My understanding was that with XIVE, only a small number
> of interrupts that are to do with system management functions are
> LSIs; all of the interrupts relating to PCI-e devices are MSIs.  So do
> we actually have a real high-frequency use case for LSIs in a guest?

The guest should be using a rtl8139 or a e1000 NIC under QEMU/KVM, which 
is not the common scenario. 

> For now I would prefer that you remove all the real-mode hcall
> handlers.  We can add them later if we get performance data showing
> that they are needed.

ok. I will.

> Regarding whether or not to have a given hcall handler in the kernel
> at all - if there is for example an hcall which is just called once
> on guest startup, and its function is just to provide information to
> the guest, and QEMU has that information, then why not have that hcall
> implemented by QEMU?  Are any of the hcalls like that?
> 
> For example, if H_INT_GET_SOURCE_INFO was implemented in QEMU, could
> we then remove the VC_BASE thing from the xive device?

Yes. H_INT_GET_SOURCE_INFO looks like a good candidate, all info should
be in QEMU and there are no OPAL calls, and we would get rid of the 
VC_BASE kvm device ioctl at the same time.

Thanks,

C.


Re: [PATCH 11/19] KVM: PPC: Book3S HV: add support for the XIVE native exploitation mode hcalls

2019-01-23 Thread Paul Mackerras
On Wed, Jan 23, 2019 at 09:48:31AM +0100, Cédric Le Goater wrote:
> On 1/23/19 7:44 AM, Benjamin Herrenschmidt wrote:
> > On Tue, 2019-01-22 at 16:23 +1100, Paul Mackerras wrote:
> >> Why do we need to provide real-mode versions of these hypercall
> >> handlers?  I thought these hypercalls would only get called
> >> infrequently, and in any case certainly much less frequently than once
> >> per interrupt delivered.  If they are infrequent, then let's leave out
> >> the real-mode version and just handle them in book3s_hv.c.
> > 
> > Agreed with the exception maybe of H_INT_ESB
> 
> ok. 
> 
> Some of these hcalls are really simple and only getting local info from 
> the host (h_int_get_*). I thought handling the hcall ASAP was a preferred 
> practice, even if the hcall is not called frequently. Isn't it ?

If we are going to handle a given hcall in the kernel at all, then we
have to have a virtual mode handler.  If we have a real-mode handler
as well then we in general incur a certain amount of code duplication
with consequent maintenance costs and possibility of bugs.  So we
generally only have real-mode handlers for the hcalls where it is
critical to minimize the latency.  From what Ben is saying that would
only be H_INT_ESB, and maybe not even that.

If H_INT_ESB is only used for LSIs, then is a guest going to be using
it at all?  My understanding was that with XIVE, only a small number
of interrupts that are to do with system management functions are
LSIs; all of the interrupts relating to PCI-e devices are MSIs.  So do
we actually have a real high-frequency use case for LSIs in a guest?

For now I would prefer that you remove all the real-mode hcall
handlers.  We can add them later if we get performance data showing
that they are needed.

Regarding whether or not to have a given hcall handler in the kernel
at all - if there is for example an hcall which is just called once
on guest startup, and its function is just to provide information to
the guest, and QEMU has that information, then why not have that hcall
implemented by QEMU?  Are any of the hcalls like that?

For example, if H_INT_GET_SOURCE_INFO was implemented in QEMU, could
we then remove the VC_BASE thing from the xive device?

Paul.


Re: [PATCH 18/19] KVM: PPC: Book3S HV: add passthrough support

2019-01-23 Thread Paul Mackerras
On Wed, Jan 23, 2019 at 05:45:24PM +1100, Benjamin Herrenschmidt wrote:
> On Tue, 2019-01-22 at 16:26 +1100, Paul Mackerras wrote:
> > On Mon, Jan 07, 2019 at 08:10:05PM +0100, Cédric Le Goater wrote:
> > > Clear the ESB pages from the VMA of the IRQ being pass through to the
> > > guest and let the fault handler repopulate the VMA when the ESB pages
> > > are accessed for an EOI or for a trigger.
> > 
> > Why do we want to do this?
> > 
> > I don't see any possible advantage to removing the PTEs from the
> > userspace mapping.  You'll need to explain further.
> 
> Afaik bcs we change the mapping to point to the real HW irq ESB page
> instead of the "IPI" that was there at VM init time.

So that makes it sound like there is a whole lot going on that hasn't
even been hinted at in the patch descriptions...  It sounds like we
need a good description of how all this works and fits together
somewhere under Documentation/.

In any case we need much more informative patch descriptions.  I
realize that it's all currently in Cedric's head, but I bet that in
two or three years' time when we come to try to debug something, it
won't be in anyone's head...

Paul.


Re: [PATCH v13 00/10] powerpc: Switch to CONFIG_THREAD_INFO_IN_TASK

2019-01-23 Thread Christophe Leroy




Le 12/01/2019 à 10:55, Christophe Leroy a écrit :

The purpose of this serie is to activate CONFIG_THREAD_INFO_IN_TASK which
moves the thread_info into task_struct.

Moving thread_info into task_struct has the following advantages:
- It protects thread_info from corruption in the case of stack
overflows.
- Its address is harder to determine if stack addresses are
leaked, making a number of attacks more difficult.


I ran null_syscall and context_switch benchmark selftests and the result 
is surprising. There is slight degradation in context_switch and a 
significant one on null_syscall:


Without the serie:

~# chrt -f 98 ./context_switch --no-altivec --no-vector --no-fp
55542
55562
55564
55562
55568
...

~# ./null_syscall
   2546.71 ns 336.17 cycles


With the serie:

~# chrt -f 98 ./context_switch --no-altivec --no-vector --no-fp
55138
55142
55152
55144
55142

~# ./null_syscall
   3479.54 ns 459.30 cycles

So 0,8% less context switches per second and 37% more time for one syscall ?


Any idea ?

Christophe




Changes since v12:
  - Patch 1: Taken comment from Mike (re-introduced the 'panic' in case 
memblock allocation fails in setup_64.c
  - Patch 1: Added alloc_stack() function in setup_32.c to also panic in case 
of allocation failure.

Changes since v11:
  - Rebased on 81775f5563fa ("Automatic merge of branches 'master', 'next' and 
'fixes' into merge")
  - Added a first patch to change memblock allocs to functions returning 
virtual addrs. This removes
the memset() which were the only remaining stuff in irq_ctx_init() and 
exc_lvl_ctx_init() at the end.
  - dropping irq_ctx_init() and exc_lvl_ctx_init() in patch 5 (powerpc: 
Activate CONFIG_THREAD_INFO_IN_TASK)
  - A few cosmetic changes in commit log and code.

Changes since v10:
  - Rebased on 21622a0d2023 ("Automatic merge of branches 'master', 'next' and 
'fixes' into merge")
   ==> Fixed conflict in setup_32.S

Changes since v9:
  - Rebased on 183cbf93be88 ("Automatic merge of branches 'master', 'next' and 
'fixes' into merge")
   ==> Fixed conflict on xmon

Changes since v8:
  - Rebased on e589b79e40d9 ("Automatic merge of branches 'master', 'next' and 
'fixes' into merge")
   ==> Main impact was conflicts due to commit 9a8dd708d547 ("memblock: rename 
memblock_alloc{_nid,_try_nid} to memblock_phys_alloc*")

Changes since v7:
  - Rebased on fb6c6ce7907d ("Automatic merge of branches 'master', 'next' and 
'fixes' into merge")

Changes since v6:
  - Fixed validate_sp() to exclude NULL sp in 'regain entire stack space' patch 
(early crash with CONFIG_KMEMLEAK)

Changes since v5:
  - Fixed livepatch_sp setup by using end_of_stack() instead of hardcoding
  - Fixed PPC_BPF_LOAD_CPU() macro

Changes since v4:
  - Fixed a build failure on 32bits SMP when include/generated/asm-offsets.h is 
not
  already existing, was due to spaces instead of a tab in the Makefile

Changes since RFC v3: (based on Nick's review)
  - Renamed task_size.h to task_size_user64.h to better relate to what it 
contains.
  - Handling of the isolation of thread_info cpu field inside CONFIG_SMP 
#ifdefs moved to a separate patch.
  - Removed CURRENT_THREAD_INFO macro completely.
  - Added a guard in asm/smp.h to avoid build failure before _TASK_CPU is 
defined.
  - Added a patch at the end to rename 'tp' pointers to 'sp' pointers
  - Renamed 'tp' into 'sp' pointers in preparation patch when relevant
  - Fixed a few commit logs
  - Fixed checkpatch report.

Changes since RFC v2:
  - Removed the modification of names in asm-offsets
  - Created a rule in arch/powerpc/Makefile to append the offset of 
current->cpu in CFLAGS
  - Modified asm/smp.h to use the offset set in CFLAGS
  - Squashed the renaming of THREAD_INFO to TASK_STACK in the preparation patch
  - Moved the modification of current_pt_regs in the patch activating 
CONFIG_THREAD_INFO_IN_TASK

Changes since RFC v1:
  - Removed the first patch which was modifying header inclusion order in timer
  - Modified some names in asm-offsets to avoid conflicts when including 
asm-offsets in C files
  - Modified asm/smp.h to avoid having to include linux/sched.h (using 
asm-offsets instead)
  - Moved some changes from the activation patch to the preparation patch.

Christophe Leroy (10):
   powerpc/irq: use memblock functions returning virtual address
   book3s/64: avoid circular header inclusion in mmu-hash.h
   powerpc: Only use task_struct 'cpu' field on SMP
   powerpc: Prepare for moving thread_info into task_struct
   powerpc: Activate CONFIG_THREAD_INFO_IN_TASK
   powerpc: regain entire stack space
   powerpc: 'current_set' is now a table of task_struct pointers
   powerpc/32: Remove CURRENT_THREAD_INFO and rename TI_CPU
   powerpc/64: Remove CURRENT_THREAD_INFO
   powerpc: clean stack pointers naming

  arch/powerpc/Kconfig   |   1 +
  arch/powerpc/Makefile  |   7 ++
  arch/powerpc/include/asm/asm-prototypes.h  |   4 +-
  

Re: [PATCH 4/4] powerpc/64s: Support shrinking the SLB for debugging

2019-01-23 Thread Aneesh Kumar K.V
Michael Ellerman  writes:

> On machines with 1TB segments and a 32-entry SLB it's quite hard to
> cause sufficient SLB pressure to trigger bugs caused due to badly
> timed SLB faults.
>
> We have seen this in the past and a few years ago added the
> disable_1tb_segments command line option to force the use of 256MB
> segments. However even this allows some bugs to slip through testing
> if the SLB entry in question was recently accessed.
>
> So add a new command line parameter for debugging which shrinks the
> SLB to the minimum size we can support. Currently that size is 3, two
> bolted SLBs and 1 for dynamic use. This creates the maximal SLB
> pressure while still allowing the kernel to operate.
>

Should we put this within DEBUG_VM? 

Reviewed-by: Aneesh Kumar K.V  


> Signed-off-by: Michael Ellerman 
> ---
>  arch/powerpc/mm/slb.c | 14 ++
>  1 file changed, 14 insertions(+)
>
> diff --git a/arch/powerpc/mm/slb.c b/arch/powerpc/mm/slb.c
> index 61450a9cf30d..0f33e28f97da 100644
> --- a/arch/powerpc/mm/slb.c
> +++ b/arch/powerpc/mm/slb.c
> @@ -506,10 +506,24 @@ void switch_slb(struct task_struct *tsk, struct 
> mm_struct *mm)
>   asm volatile("isync" : : : "memory");
>  }
>  
> +static bool shrink_slb = false;
> +
> +static int __init parse_shrink_slb(char *p)
> +{
> + shrink_slb = true;
> + slb_set_size(0);

Why do we need call slb_set_size(0) here? htab_dt_scan_seg_sizes should 
find the shirnk_slb = true?

> +
> + return 0;
> +}
> +early_param("shrink_slb", parse_shrink_slb);
> +
>  static u32 slb_full_bitmap;
>  
>  void slb_set_size(u16 size)
>  {
> + if (shrink_slb)
> + size = SLB_NUM_BOLTED + 1;
> +
>   mmu_slb_size = size;
>  
>   if (size >= 32)
> -- 
> 2.20.1

-aneesh



Re: [PATCH 3/4] powerpc/64s: Move SLB init into hash_utils_64.c

2019-01-23 Thread Aneesh Kumar K.V
Michael Ellerman  writes:

> The SLB initialisation code is spread around a bit between prom.c and
> hash_utils_64.c. Consolidate it all in hash_utils_64.c.
>
> This slightly changes the timing of when mmu_slb_size is initialised,
> but that should have no effect.

Reviewed-by: Aneesh Kumar K.V 

>
> Signed-off-by: Michael Ellerman 
> ---
>  arch/powerpc/kernel/prom.c  | 16 
>  arch/powerpc/mm/hash_utils_64.c | 15 ++-
>  2 files changed, 10 insertions(+), 21 deletions(-)
>
> diff --git a/arch/powerpc/kernel/prom.c b/arch/powerpc/kernel/prom.c
> index 14693f8ccb80..018ededd1948 100644
> --- a/arch/powerpc/kernel/prom.c
> +++ b/arch/powerpc/kernel/prom.c
> @@ -229,21 +229,6 @@ static void __init check_cpu_pa_features(unsigned long 
> node)
> ibm_pa_features, ARRAY_SIZE(ibm_pa_features));
>  }
>  
> -#ifdef CONFIG_PPC_BOOK3S_64
> -static void __init init_mmu_slb_size(unsigned long node)
> -{
> - const __be32 *slb_size_ptr;
> -
> - slb_size_ptr = of_get_flat_dt_prop(node, "slb-size", NULL) ? :
> - of_get_flat_dt_prop(node, "ibm,slb-size", NULL);
> -
> - if (slb_size_ptr)
> - slb_set_size(be32_to_cpup(slb_size_ptr));
> -}
> -#else
> -#define init_mmu_slb_size(node) do { } while(0)
> -#endif
> -
>  static struct feature_property {
>   const char *name;
>   u32 min_value;
> @@ -379,7 +364,6 @@ static int __init early_init_dt_scan_cpus(unsigned long 
> node,
>   }
>  
>   identical_pvr_fixup(node);
> - init_mmu_slb_size(node);
>  
>  #ifdef CONFIG_PPC64
>   if (nthreads == 1)
> diff --git a/arch/powerpc/mm/hash_utils_64.c b/arch/powerpc/mm/hash_utils_64.c
> index 4aa0797000f7..33ce76be17de 100644
> --- a/arch/powerpc/mm/hash_utils_64.c
> +++ b/arch/powerpc/mm/hash_utils_64.c
> @@ -344,9 +344,8 @@ static int __init parse_disable_1tb_segments(char *p)
>  }
>  early_param("disable_1tb_segments", parse_disable_1tb_segments);
>  
> -static int __init htab_dt_scan_seg_sizes(unsigned long node,
> -  const char *uname, int depth,
> -  void *data)
> +static int __init htab_dt_scan_slb(unsigned long node, const char *uname,
> +int depth, void *data)
>  {
>   const char *type = of_get_flat_dt_prop(node, "device_type", NULL);
>   const __be32 *prop;
> @@ -356,6 +355,12 @@ static int __init htab_dt_scan_seg_sizes(unsigned long 
> node,
>   if (type == NULL || strcmp(type, "cpu") != 0)
>   return 0;
>  
> + prop = of_get_flat_dt_prop(node, "slb-size", NULL);
> + if (!prop)
> + prop = of_get_flat_dt_prop(node, "ibm,slb-size", NULL);
> + if (prop)
> + slb_set_size(be32_to_cpup(prop));
> +
>   prop = of_get_flat_dt_prop(node, "ibm,processor-segment-sizes", );
>   if (prop == NULL)
>   return 0;
> @@ -954,8 +959,8 @@ static void __init htab_initialize(void)
>  
>  void __init hash__early_init_devtree(void)
>  {
> - /* Initialize segment sizes */
> - of_scan_flat_dt(htab_dt_scan_seg_sizes, NULL);
> + /* Initialize SLB size and segment sizes */
> + of_scan_flat_dt(htab_dt_scan_slb, NULL);
>  
>   /* Initialize page sizes */
>   htab_scan_page_sizes();
> -- 
> 2.20.1



Re: [PATCH 2/4] powerpc/64s: Add slb_full_bitmap rather than hard-coding U32_MAX

2019-01-23 Thread Aneesh Kumar K.V
Michael Ellerman  writes:

> The recent rewrite of the SLB code into C included the assumption that
> all CPUs we run on have at least 32 SLB entries. This is currently
> true but a bit fragile as the SLB size is actually defined by the
> device tree and so could theoretically change at any time.
>
> The assumption is encoded in the fact that we use U32_MAX as the value
> for a full SLB bitmap. Instead, calculate what the full bitmap would
> be based on the SLB size we're given and store it. This still requires
> the SLB size to be a power of 2.

So if it is less than 32 we want to make sure we don't allocate an index
above 32. Is this the reason for that radom assert_slb_presence?

Reviewed-by: Aneesh Kumar K.V  

>
> Fixes: 126b11b294d1 ("powerpc/64s/hash: Add SLB allocation status bitmaps")
> Signed-off-by: Michael Ellerman 
> ---
>  arch/powerpc/mm/slb.c | 9 -
>  1 file changed, 8 insertions(+), 1 deletion(-)
>
> diff --git a/arch/powerpc/mm/slb.c b/arch/powerpc/mm/slb.c
> index bc3914d54e26..61450a9cf30d 100644
> --- a/arch/powerpc/mm/slb.c
> +++ b/arch/powerpc/mm/slb.c
> @@ -506,9 +506,16 @@ void switch_slb(struct task_struct *tsk, struct 
> mm_struct *mm)
>   asm volatile("isync" : : : "memory");
>  }
>  
> +static u32 slb_full_bitmap;
> +
>  void slb_set_size(u16 size)
>  {
>   mmu_slb_size = size;
> +
> + if (size >= 32)
> + slb_full_bitmap = U32_MAX;
> + else
> + slb_full_bitmap = (1ul << size) - 1;
>  }
>  
>  void slb_initialize(void)
> @@ -611,7 +618,7 @@ static enum slb_index alloc_slb_index(bool kernel)
>* POWER7/8/9 have 32 SLB entries, this could be expanded if a
>* future CPU has more.
>*/
> - if (local_paca->slb_used_bitmap != U32_MAX) {
> + if (local_paca->slb_used_bitmap != slb_full_bitmap) {
>   index = ffz(local_paca->slb_used_bitmap);
>   local_paca->slb_used_bitmap |= 1U << index;
>   if (kernel)
> -- 
> 2.20.1



Re: [PATCH 1/4] powerpc/64s: Always set mmu_slb_size using slb_set_size()

2019-01-23 Thread Aneesh Kumar K.V
Michael Ellerman  writes:

> It's easier to reason about the code if we only set mmu_slb_size in
> one place, so convert open-coded assignments to use slb_set_size().
>

Reviewed-by: Aneesh Kumar K.V 

> Signed-off-by: Michael Ellerman 
> ---
>  arch/powerpc/kernel/prom.c  | 2 +-
>  arch/powerpc/mm/pgtable-radix.c | 3 ++-
>  2 files changed, 3 insertions(+), 2 deletions(-)
>
> diff --git a/arch/powerpc/kernel/prom.c b/arch/powerpc/kernel/prom.c
> index 4181ec715f88..14693f8ccb80 100644
> --- a/arch/powerpc/kernel/prom.c
> +++ b/arch/powerpc/kernel/prom.c
> @@ -238,7 +238,7 @@ static void __init init_mmu_slb_size(unsigned long node)
>   of_get_flat_dt_prop(node, "ibm,slb-size", NULL);
>  
>   if (slb_size_ptr)
> - mmu_slb_size = be32_to_cpup(slb_size_ptr);
> + slb_set_size(be32_to_cpup(slb_size_ptr));
>  }
>  #else
>  #define init_mmu_slb_size(node) do { } while(0)
> diff --git a/arch/powerpc/mm/pgtable-radix.c b/arch/powerpc/mm/pgtable-radix.c
> index 931156069a81..949fbc96b237 100644
> --- a/arch/powerpc/mm/pgtable-radix.c
> +++ b/arch/powerpc/mm/pgtable-radix.c
> @@ -328,7 +328,8 @@ void __init radix_init_pgtable(void)
>   struct memblock_region *reg;
>  
>   /* We don't support slb for radix */
> - mmu_slb_size = 0;
> + slb_set_size(0);
> +
>   /*
>* Create the linear mapping, using standard page size for now
>*/
> -- 
> 2.20.1



Re: [PATCH] powerpc/traps: fix recoverability of machine check handling on book3s/32

2019-01-23 Thread Nicholas Piggin
Christophe Leroy's on January 23, 2019 12:11 am:
> Looks like book3s/32 doesn't set RI on machine check, so
> checking RI before calling die() will always be fatal
> allthought this is not an issue in most cases.

Oh good catch, this is a fix for powerpc/64 as well. I think actually 
the panic was supposed to go in the bail: path, it's missing from
there.

But you may have a point, I don't know if we need to panic the box if 
we have !RI in process context... Possible we might be able to remove
those panics and just turn them into die.

Thanks,
Nick


Re: [PATCH 11/19] KVM: PPC: Book3S HV: add support for the XIVE native exploitation mode hcalls

2019-01-23 Thread Cédric Le Goater
On 1/23/19 7:44 AM, Benjamin Herrenschmidt wrote:
> On Tue, 2019-01-22 at 16:23 +1100, Paul Mackerras wrote:
>>
>> Which ones of these could be implemented in QEMU?  Are there any that
>> can't possibly be implemented in QEMU because they need to do things
>> that require calling internal interfaces that userspace doesn't have
>> access to?
> 
> Implementing them in qemu doesn't make a lot of sense. Qemu doesn't
> have access to most of the XIVE HW state. There's a XIVE model for full
> emulation but when using the real thing, almost none of it is visible.
> A lot of those hcalls effectively turn into OPAL calls.
> 
>> How often do we expect each of these hypercalls to be called?
> 
> It depends, I can't tell for AIX. For Linux, not often with one
> exception: H_INT_ESB which will be used whenever you EOI an emulated
> LSI.

yes. This one is only doing loads and stores.

>  .../...
> 
>> Why do we need to provide real-mode versions of these hypercall
>> handlers?  I thought these hypercalls would only get called
>> infrequently, and in any case certainly much less frequently than once
>> per interrupt delivered.  If they are infrequent, then let's leave out
>> the real-mode version and just handle them in book3s_hv.c.
> 
> Agreed with the exception maybe of H_INT_ESB

ok. 

Some of these hcalls are really simple and only getting local info from 
the host (h_int_get_*). I thought handling the hcall ASAP was a preferred 
practice, even if the hcall is not called frequently. Isn't it ?

 
>>> @@ -5153,6 +5169,19 @@ static unsigned int default_hcall_list[] = {
>>>H_IPOLL,
>>>H_XIRR,
>>>H_XIRR_X,
>>> +#endif
>>> +#ifdef CONFIG_KVM_XIVE
>>> + H_INT_GET_SOURCE_INFO,
>>> + H_INT_SET_SOURCE_CONFIG,
>>> + H_INT_GET_SOURCE_CONFIG,
>>> + H_INT_GET_QUEUE_INFO,
>>> + H_INT_SET_QUEUE_CONFIG,
>>> + H_INT_GET_QUEUE_CONFIG,
>>> + H_INT_SET_OS_REPORTING_LINE,
>>> + H_INT_GET_OS_REPORTING_LINE,
>>> + H_INT_ESB,
>>> + H_INT_SYNC,
>>> + H_INT_RESET,
>>>   #endif
>>
>> The policy is not to add new hcalls to default_hcall_list[].  Is there
>> a strong reason for adding them here?

I don't remember. I will check for v2.

Thanks, 

C.



Re: [PATCH 2/2] selftests/powerpc: benchmarks/context_switch.c use vector instructions/types

2019-01-23 Thread Christophe Leroy

Hi Cyril,

On 03/03/2016 11:06 PM, Cyril Bur wrote:

Currently it doesn't appear the resulting binary actually uses any Altivec
or VSX instructions the solution is to explicitly tell GCC to use vector
instructions and use vector types in the code.

Part of this this issue can be GCC version specific:

GCC 4.9.x is happy to use Altivec and VSX instructions if altivec.h is
includedi (and possibly if vector types are used), this also means that
4.9.x will use VSX instructions even if only -maltivec is passed. It is
also possible that Altivec instructions will be used even without -maltivec
or -mabi=altivec.

GCC 5.2.x complains about the lack of -maltivec parameter if altivec.h is
included and will not use VSX unless -mvsx is present on commandline.

GCC 5.3.0 has a regression that means __attribute__((__target__("no-vsx"))
fails to build. A fix is targeted for 5.4.

Furthermore LTO (Link Time Optimisation) doesn't play well with
__attribute__((__target__("no-vsx")), LTO can cause GCC to forget about the
attribute and compile with VSX instructions regardless. Be weary when
enabling -flfo for this test.

Signed-off-by: Cyril Bur 


This patch breaks the build on my setup:

ppc-linux-gcc -std=gnu99 -O2 -Wall -Werror 
-DGIT_VERSION='"v5.0-rc3-560-ge0ce62731d77"' 
-I/root/linux-powerpc/tools/testing/selftests/powerpc/include  -O2 
-maltivec -mvsx -mabi=altiveccontext_switch.c ../harness.c 
../utils.c -lpthread -o 
/root/linux-powerpc/tools/testing/selftests/powerpc/benchmarks/context_switch
context_switch.c:1:0: error: -mvsx requires hardware floating point 
[-Werror]

 /*
 ^
cc1: all warnings being treated as errors
../harness.c:1:0: error: -mvsx requires hardware floating point [-Werror]
 /*
 ^
cc1: all warnings being treated as errors
../utils.c:1:0: error: -mvsx requires hardware floating point [-Werror]
 /*
 ^
cc1: all warnings being treated as errors
make[1]: *** 
[/root/linux-powerpc/tools/testing/selftests/powerpc/benchmarks/context_switch] 
Error 1



By removing the -mvsx option, it compiles just fine.

Is that option really required ? According to gcc doc, it is 
automatically selected when compiling for cpus that support it.


I'm using:

[root@po16846vm linux-powerpc]# ppc-linux-gcc -v
Using built-in specs.
COLLECT_GCC=ppc-linux-gcc
COLLECT_LTO_WRAPPER=/opt/cldk-1.4.0/libexec/gcc/ppc-linux/5.4.0/lto-wrapper
Target: ppc-linux
Configured with: /root/cldk/gcc-5.4.0/configure --target=ppc-linux 
--with-headers=yes --with-cpu=860 --prefix=/opt/cldk-1.4.0 
--bindir=/opt/cldk-1.4.0/bin --sbindir=/opt/cldk-1.4.0/sbin 
--libexecdir=/opt/cldk-1.4.0/libexec --datadir=/opt/cldk-1.4.0/share 
--sysconfdir=/opt/cldk-1.4.0/etc --libdir=/opt/cldk-1.4.0/lib 
--includedir=/opt/cldk-1.4.0/usr/include 
--oldincludedir=/opt/cldk-1.4.0/usr/include 
--infodir=/opt/cldk-1.4.0/share/info --mandir=/opt/cldk-1.4.0/share/man 
--with-glibc-version=2.18 --enable-languages=c,c++

Thread model: posix
gcc version 5.4.0 (GCC)


Christophe


---
  tools/testing/selftests/powerpc/benchmarks/Makefile |  1 +
  tools/testing/selftests/powerpc/benchmarks/context_switch.c | 11 ---
  2 files changed, 9 insertions(+), 3 deletions(-)

diff --git a/tools/testing/selftests/powerpc/benchmarks/Makefile 
b/tools/testing/selftests/powerpc/benchmarks/Makefile
index 912445f..6816fc2 100644
--- a/tools/testing/selftests/powerpc/benchmarks/Makefile
+++ b/tools/testing/selftests/powerpc/benchmarks/Makefile
@@ -7,6 +7,7 @@ all: $(TEST_PROGS)
  $(TEST_PROGS): ../harness.c
  
  context_switch: ../utils.c

+context_switch: CFLAGS += -maltivec -mvsx -mabi=altivec
  context_switch: LDLIBS += -lpthread
  
  include ../../lib.mk

diff --git a/tools/testing/selftests/powerpc/benchmarks/context_switch.c 
b/tools/testing/selftests/powerpc/benchmarks/context_switch.c
index e6af382..a36883a 100644
--- a/tools/testing/selftests/powerpc/benchmarks/context_switch.c
+++ b/tools/testing/selftests/powerpc/benchmarks/context_switch.c
@@ -25,7 +25,9 @@
  #include 
  #include 
  #include 
-
+#ifdef __powerpc__
+#include 
+#endif
  #include "../utils.h"
  
  static unsigned int timeout = 30;

@@ -37,12 +39,15 @@ static int touch_fp = 1;
  double fp;
  
  static int touch_vector = 1;

-typedef int v4si __attribute__ ((vector_size (16)));
-v4si a, b, c;
+vector int a, b, c;
  
  #ifdef __powerpc__

  static int touch_altivec = 1;
  
+/*

+ * Note: LTO (Link Time Optimisation) doesn't play well with this function
+ * attribute. Be very careful enabling LTO for this test.
+ */
  static void __attribute__((__target__("no-vsx"))) altivec_touch_fn(void)
  {
c = a + b;



Re: [PATCH 4/4] powerpc/64s: Support shrinking the SLB for debugging

2019-01-23 Thread Nicholas Piggin
Michael Ellerman's on January 19, 2019 8:27 pm:
> Michal Suchánek  writes:
> 
>> On Thu, 17 Jan 2019 23:13:28 +1100
>> Michael Ellerman  wrote:
>>
>>> On machines with 1TB segments and a 32-entry SLB it's quite hard to
>>> cause sufficient SLB pressure to trigger bugs caused due to badly
>>> timed SLB faults.
>>> 
>>> We have seen this in the past and a few years ago added the
>>> disable_1tb_segments command line option to force the use of 256MB
>>> segments. However even this allows some bugs to slip through testing
>>> if the SLB entry in question was recently accessed.
>>> 
>>> So add a new command line parameter for debugging which shrinks the
>>> SLB to the minimum size we can support. Currently that size is 3, two
>>> bolted SLBs and 1 for dynamic use. This creates the maximal SLB
>>
>> Doesn't this violate the power of 2 requirement stated in 2/4?
> 
> Yes. Good point. This was originally a hack patch in my tree, back when
> SLB_NUM_BOLTED was 3 and before Nick even added the slb_used_bitmap, so
> back then it was a power of 2 but it also didn't matter :)
> 
> I think I'll rework the slb_full_bitmap patch anyway and remove the
> power of 2 requirement, so then this patch will be OK.

Thanks, good patch and really will help testing. Sometimes even if you 
can get enough pressure with the 256MB segments, you actually want to 
test 1TB segment code paths too.

Thanks,
Nick



[PATCH -next] usb: host: drop pointless static qualifier

2019-01-23 Thread YueHaibing
There is no need to have the 'dummy_mask' variable static since new
value always be assigned before use it.

Signed-off-by: YueHaibing 
---
 drivers/usb/host/ehci-ps3.c | 2 +-
 drivers/usb/host/ohci-ps3.c | 2 +-
 2 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/usb/host/ehci-ps3.c b/drivers/usb/host/ehci-ps3.c
index 454d8c6..91cee02 100644
--- a/drivers/usb/host/ehci-ps3.c
+++ b/drivers/usb/host/ehci-ps3.c
@@ -86,7 +86,7 @@ static int ps3_ehci_probe(struct ps3_system_bus_device *dev)
int result;
struct usb_hcd *hcd;
unsigned int virq;
-   static u64 dummy_mask;
+   u64 dummy_mask;
 
if (usb_disabled()) {
result = -ENODEV;
diff --git a/drivers/usb/host/ohci-ps3.c b/drivers/usb/host/ohci-ps3.c
index 395f9d3..a1c1bdf 100644
--- a/drivers/usb/host/ohci-ps3.c
+++ b/drivers/usb/host/ohci-ps3.c
@@ -69,7 +69,7 @@ static int ps3_ohci_probe(struct ps3_system_bus_device *dev)
int result;
struct usb_hcd *hcd;
unsigned int virq;
-   static u64 dummy_mask;
+   u64 dummy_mask;
 
if (usb_disabled()) {
result = -ENODEV;
-- 
2.7.0




Re: [PATCH] powerpc/64s: Remove MSR_RI optimisation in system_call_exit()

2019-01-23 Thread Nicholas Piggin
Michael Ellerman's on January 17, 2019 9:35 pm:
> Currently in system_call_exit() we have an optimisation where we
> disable MSR_RI (recoverable interrupt) and MSR_EE (external interrupt
> enable) in a single mtmsrd instruction.
> 
> Unfortunately this will no longer work with THREAD_INFO_IN_TASK,
> because then the load of TI_FLAGS might fault and faulting with MSR_RI
> clear is treated as an unrecoverable exception which leads to a
> panic().
> 
> So change the code to only clear MSR_EE prior to loading TI_FLAGS,
> leaving the clear of MSR_RI until later. We have some latitude in
> where do the clear of MSR_RI. A bit of experimentation has shown that
> this location gives the least slow down.
> 
> This still causes a noticeable slow down in our null_syscall
> performance. On a Power9 DD2.2:
> 
>   BeforeAfter Delta Delta %
>   955 cycles999 cycles-44 -4.6%
> 
> On the plus side this does simplify the code somewhat, because we
> don't have to reenable MSR_RI on the restore_math() or
> syscall_exit_work() paths which was necessitated previously by the
> optimisation.
> 
> Signed-off-by: Michael Ellerman 

Reviewed-by: Nicholas Piggin 

But only because spectre and meltdown broke my spirit.



Re: [PATCH kernel] vfio-pci/nvlink2: Fix ancient gcc warnings

2019-01-23 Thread Geert Uytterhoeven
Hi Alex,

On Wed, Jan 23, 2019 at 5:30 AM Alex Williamson
 wrote:
> The below patch comes about from the build regressions and improvements
> list you've sent out, but something doesn't add up that we'd be testing
> with an old compiler where initialization with { 0 } generates a
> "missing braces around initialization" warning.  Is this really the
> case or are we missing something here?  There's no harm that I can see
> with Alexey's fix, but are these really just false positives from a
> compiler bug that we should selectively ignore if the "fix" is less
> clean?  Thanks,

Yes, they are false positives, AFAIK.

> On Wed, 23 Jan 2019 15:07:11 +1100
> Alexey Kardashevskiy  wrote:
>
> > Using the {0} construct as a generic initializer is perfectly fine in C,
> > however due to a bug in old gcc there is a warning:
> >
> >   + /kisskb/src/drivers/vfio/pci/vfio_pci_nvlink2.c: warning: (near
> > initialization for 'cap.header') [-Wmissing-braces]:  => 181:9

These all seem to come from an old gcc 4.6, which is the oldest still
supported version for compiling Linux
http://kisskb.ellerman.id.au/kisskb/buildresult/13663641/

Note that kisskb is also using gcc 4.6.3 for s390x and mips, which are the only
other builds showing missing braces warnings.

> > Since for whatever reason we still want to compile the modern kernel
> > with such an old gcc without warnings, this changes the capabilities
> > initialization.
> >
> > The gcc bugzilla: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=53119
> >
> > Signed-off-by: Alexey Kardashevskiy 
> > ---
> >  drivers/vfio/pci/vfio_pci_nvlink2.c | 30 ++---
> >  1 file changed, 15 insertions(+), 15 deletions(-)
> >
> > diff --git a/drivers/vfio/pci/vfio_pci_nvlink2.c 
> > b/drivers/vfio/pci/vfio_pci_nvlink2.c
> > index 054a2cf..91d945b 100644
> > --- a/drivers/vfio/pci/vfio_pci_nvlink2.c
> > +++ b/drivers/vfio/pci/vfio_pci_nvlink2.c
> > @@ -178,11 +178,11 @@ static int vfio_pci_nvgpu_add_capability(struct 
> > vfio_pci_device *vdev,
> >   struct vfio_pci_region *region, struct vfio_info_cap *caps)
> >  {
> >   struct vfio_pci_nvgpu_data *data = region->data;
> > - struct vfio_region_info_cap_nvlink2_ssatgt cap = { 0 };
> > -
> > - cap.header.id = VFIO_REGION_INFO_CAP_NVLINK2_SSATGT;
> > - cap.header.version = 1;
> > - cap.tgt = data->gpu_tgt;
> > + struct vfio_region_info_cap_nvlink2_ssatgt cap = {
> > + .header.id = VFIO_REGION_INFO_CAP_NVLINK2_SSATGT,
> > + .header.version = 1,
> > + .tgt = data->gpu_tgt
> > + };

I think the simpler change

-   struct vfio_region_info_cap_nvlink2_ssatgt cap = { 0 };
+   struct vfio_region_info_cap_nvlink2_ssatgt cap = { { 0 } };

should also work (tested with gcc 4.1).
That's how most of these were fixed in the past.

Gr{oetje,eeting}s,

Geert

-- 
Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- ge...@linux-m68k.org

In personal conversations with technical people, I call myself a hacker. But
when I'm talking to journalists I just say "programmer" or something like that.
-- Linus Torvalds


Re: [PATCH] arch/powerpc/radix: Fix kernel crash with mremap

2019-01-23 Thread Aneesh Kumar K.V
"Aneesh Kumar K.V"  writes:

> With support for split pmd lock, we use pmd page pmd_huge_pte pointer to store
> the deposited page table. In those config when we move page tables we need to
> make sure we move the depoisted page table to the right pmd page. Otherwise 
> this
> can result in crash when we withdraw of deposited page table because we can 
> find
> the pmd_huge_pte NULL.
>
> c04a1230 __split_huge_pmd+0x1070/0x1940
> c04a0ff4 __split_huge_pmd+0xe34/0x1940 (unreliable)
> c04a4000 vma_adjust_trans_huge+0x110/0x1c0
> c042fe04 __vma_adjust+0x2b4/0x9b0
> c04316e8 __split_vma+0x1b8/0x280
> c043192c __do_munmap+0x13c/0x550
> c0439390 sys_mremap+0x220/0x7e0
> c000b488 system_call+0x5c/0x70
>

This was done on top of "NestMMU pte upgrade workaround for mprotect" series

> Fixes: 675d995297d4 ("powerpc/book3s64: Enable split pmd ptlock.")
> Signed-off-by: Aneesh Kumar K.V 

For applying to merge branch you can use the below patch

>From 33d28f144d41ecbba074ceefd234983422469e4b Mon Sep 17 00:00:00 2001
From: "Aneesh Kumar K.V" 
Date: Mon, 21 Jan 2019 16:43:17 +0530
Subject: [PATCH] arch/powerpc/radix: Fix kernel crash with mremap

With support for split pmd lock, we use pmd page pmd_huge_pte pointer to store
the deposited page table. In those config when we move page tables we need to
make sure we move the depoisted page table to the right pmd page. Otherwise this
can result in crash when we withdraw of deposited page table because we can find
the pmd_huge_pte NULL.

c04a1230 __split_huge_pmd+0x1070/0x1940
c04a0ff4 __split_huge_pmd+0xe34/0x1940 (unreliable)
c04a4000 vma_adjust_trans_huge+0x110/0x1c0
c042fe04 __vma_adjust+0x2b4/0x9b0
c04316e8 __split_vma+0x1b8/0x280
c043192c __do_munmap+0x13c/0x550
c0439390 sys_mremap+0x220/0x7e0
c000b488 system_call+0x5c/0x70

Fixes: 675d995297d4 ("powerpc/book3s64: Enable split pmd ptlock.")
Signed-off-by: Aneesh Kumar K.V 
---
 arch/powerpc/include/asm/book3s/64/pgtable.h | 22 +++-
 arch/powerpc/mm/pgtable-book3s64.c   | 22 
 2 files changed, 29 insertions(+), 15 deletions(-)

diff --git a/arch/powerpc/include/asm/book3s/64/pgtable.h 
b/arch/powerpc/include/asm/book3s/64/pgtable.h
index 2e6ada28da64..c9bfe526ca9d 100644
--- a/arch/powerpc/include/asm/book3s/64/pgtable.h
+++ b/arch/powerpc/include/asm/book3s/64/pgtable.h
@@ -1258,21 +1258,13 @@ extern pmd_t pmdp_invalidate(struct vm_area_struct 
*vma, unsigned long address,
 
 #define pmd_move_must_withdraw pmd_move_must_withdraw
 struct spinlock;
-static inline int pmd_move_must_withdraw(struct spinlock *new_pmd_ptl,
-struct spinlock *old_pmd_ptl,
-struct vm_area_struct *vma)
-{
-   if (radix_enabled())
-   return false;
-   /*
-* Archs like ppc64 use pgtable to store per pmd
-* specific information. So when we switch the pmd,
-* we should also withdraw and deposit the pgtable
-*/
-   return true;
-}
-
-
+extern int pmd_move_must_withdraw(struct spinlock *new_pmd_ptl,
+ struct spinlock *old_pmd_ptl,
+ struct vm_area_struct *vma);
+/*
+ * Hash translation mode use the deposited table to store hash pte
+ * slot information.
+ */
 #define arch_needs_pgtable_deposit arch_needs_pgtable_deposit
 static inline bool arch_needs_pgtable_deposit(void)
 {
diff --git a/arch/powerpc/mm/pgtable-book3s64.c 
b/arch/powerpc/mm/pgtable-book3s64.c
index f3c31f5e1026..65eaf784f866 100644
--- a/arch/powerpc/mm/pgtable-book3s64.c
+++ b/arch/powerpc/mm/pgtable-book3s64.c
@@ -400,3 +400,25 @@ void arch_report_meminfo(struct seq_file *m)
   atomic_long_read(_pages_count[MMU_PAGE_1G]) << 20);
 }
 #endif /* CONFIG_PROC_FS */
+
+/* For hash translation mode, we use the deposited table to store
+ * hash slot information and they are stored at PTRS_PER_PMD
+ * offset from related pmd location. Hence a pmd move requires
+ * deposit and withdraw.
+
+ * For radix translation with split pmd ptl, we store the deposited
+ * table in the pmd page. Hence if we have different pmd page we need
+ * to withdraw during pmd move.
+
+ * With hash we use deposited table always irrespective of anon or not.
+ * With radix we use deposited table only for anonymous mapping.
+ */
+int pmd_move_must_withdraw(struct spinlock *new_pmd_ptl,
+  struct spinlock *old_pmd_ptl,
+  struct vm_area_struct *vma)
+{
+   if (radix_enabled())
+   return (new_pmd_ptl != old_pmd_ptl) && vma_is_anonymous(vma);
+
+   return true;
+}
-- 
2.20.1