Re: [PATCH 2/2] s390/pci: skip VF scanning

2018-12-18 Thread Christoph Hellwig
On Tue, Dec 18, 2018 at 11:16:50AM +0100, Sebastian Ott wrote:
> Set the flag to skip scanning for VFs after SRIOV enablement.
> VF creation will be triggered by the hotplug code.
> 
> Signed-off-by: Sebastian Ott 

Looks good,

Reviewed-by: Christoph Hellwig 


Re: [PATCH 1/2] PCI/IOV: provide flag to skip VF scanning

2018-12-18 Thread Christoph Hellwig
On Tue, Dec 18, 2018 at 11:16:49AM +0100, Sebastian Ott wrote:
> Provide a flag to skip scanning for new VFs after SRIOV enablement.
> This can be set by implementations for which the VFs are already
> reported by other means.
> 
> Signed-off-by: Sebastian Ott 

Looks good,

Reviewed-by: Christoph Hellwig 


[PATCH v4] powerpc: implement CONFIG_DEBUG_VIRTUAL

2018-12-18 Thread Christophe Leroy
This patch implements CONFIG_DEBUG_VIRTUAL to warn about
incorrect use of virt_to_phys() and page_to_phys()

Below is the result of test_debug_virtual:

[1.438746] WARNING: CPU: 0 PID: 1 at ./arch/powerpc/include/asm/io.h:808 
test_debug_virtual_init+0x3c/0xd4
[1.448156] CPU: 0 PID: 1 Comm: swapper Not tainted 
4.20.0-rc5-00560-g6bfb52e23a00-dirty #532
[1.457259] NIP:  c066c550 LR: c0650ccc CTR: c066c514
[1.462257] REGS: c900bdb0 TRAP: 0700   Not tainted  
(4.20.0-rc5-00560-g6bfb52e23a00-dirty)
[1.471184] MSR:  00029032   CR: 48000422  XER: 2000
[1.477811]
[1.477811] GPR00: c0650ccc c900be60 c60d  006000c0 c900 
9032 c7fa0020
[1.477811] GPR08: 2400 0001 0900  c07b5d04  
c00037d8 
[1.477811] GPR16:     c076 c074 
0092 c0685bb0
[1.477811] GPR24: c065042c c068a734 c0685b8c 0006  c076 
c075c3c0 
[1.512711] NIP [c066c550] test_debug_virtual_init+0x3c/0xd4
[1.518315] LR [c0650ccc] do_one_initcall+0x8c/0x1cc
[1.523163] Call Trace:
[1.525595] [c900be60] [c0567340] 0xc0567340 (unreliable)
[1.530954] [c900be90] [c0650ccc] do_one_initcall+0x8c/0x1cc
[1.536551] [c900bef0] [c0651000] kernel_init_freeable+0x1f4/0x2cc
[1.542658] [c900bf30] [c00037ec] kernel_init+0x14/0x110
[1.547913] [c900bf40] [c000e1d0] ret_from_kernel_thread+0x14/0x1c
[1.553971] Instruction dump:
[1.556909] 3ca50100 bfa10024 54a5000e 3fa0c076 7c0802a6 3d454000 813dc204 
554893be
[1.564566] 7d294010 7d294910 90010034 39290001 <0f09> 7c3e0b78 955e0008 
3fe0c062
[1.572425] ---[ end trace 6f6984225b280ad6 ]---
[1.577467] PA: 0x0900 for VA: 0xc900
[1.581799] PA: 0x061e8f50 for VA: 0xc61e8f50

Signed-off-by: Christophe Leroy 
---
 v4: revised verification in __ioremap_caller(): we keep the verification
 based on virt_to_phys() but do it on (highmem - 1) instead of highmem
 because highmem is not a valid virtual address.

 v3: Added missing linux/mm.h
 I realised that a driver may use DMA on stack after checking with 
virt_addr_valid(), so the new
 verification might induce false positives. I remove it for now, will add 
it again later in a more
 controled way.

 v2: Using asm/pgtable.h to avoid build failure on ppc64e.
 Added a verification that the object is not in stack to catch problems 
before activing VMAP_STACK.

 arch/powerpc/Kconfig  |  1 +
 arch/powerpc/include/asm/io.h | 13 -
 arch/powerpc/mm/pgtable_32.c  |  2 +-
 3 files changed, 14 insertions(+), 2 deletions(-)

diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig
index e312e92e3381..94b46624068d 100644
--- a/arch/powerpc/Kconfig
+++ b/arch/powerpc/Kconfig
@@ -128,6 +128,7 @@ config PPC
#
# Please keep this list sorted alphabetically.
#
+   select ARCH_HAS_DEBUG_VIRTUAL
select ARCH_HAS_DEVMEM_IS_ALLOWED
select ARCH_HAS_DMA_SET_COHERENT_MASK
select ARCH_HAS_ELF_RANDOMIZE
diff --git a/arch/powerpc/include/asm/io.h b/arch/powerpc/include/asm/io.h
index e746becd9d6f..7f19fbd3ba55 100644
--- a/arch/powerpc/include/asm/io.h
+++ b/arch/powerpc/include/asm/io.h
@@ -29,12 +29,14 @@ extern struct pci_dev *isa_bridge_pcidev;
 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
 #include 
 #include 
 #include 
+#include 
 
 #ifdef CONFIG_PPC64
 #include 
@@ -804,6 +806,8 @@ extern void __iounmap_at(void *ea, unsigned long size);
  */
 static inline unsigned long virt_to_phys(volatile void * address)
 {
+   WARN_ON(IS_ENABLED(CONFIG_DEBUG_VIRTUAL) && !virt_addr_valid(address));
+
return __pa((unsigned long)address);
 }
 
@@ -827,7 +831,14 @@ static inline void * phys_to_virt(unsigned long address)
 /*
  * Change "struct page" to physical address.
  */
-#define page_to_phys(page) ((phys_addr_t)page_to_pfn(page) << PAGE_SHIFT)
+static inline phys_addr_t page_to_phys(struct page *page)
+{
+   unsigned long pfn = page_to_pfn(page);
+
+   WARN_ON(IS_ENABLED(CONFIG_DEBUG_VIRTUAL) && !pfn_valid(pfn));
+
+   return PFN_PHYS(pfn);
+}
 
 /*
  * 32 bits still uses virt_to_bus() for it's implementation of DMA
diff --git a/arch/powerpc/mm/pgtable_32.c b/arch/powerpc/mm/pgtable_32.c
index 4fc77a99c9bf..d67215248d82 100644
--- a/arch/powerpc/mm/pgtable_32.c
+++ b/arch/powerpc/mm/pgtable_32.c
@@ -143,7 +143,7 @@ __ioremap_caller(phys_addr_t addr, unsigned long size, 
pgprot_t prot, void *call
 * Don't allow anybody to remap normal RAM that we're using.
 * mem_init() sets high_memory so only do the check after that.
 */
-   if (slab_is_available() && (p < virt_to_phys(high_memory)) &&
+   if (slab_is_available() && p <= virt_to_phys(high_memory - 1) &&
page_is_ram(__phys_to_pfn(p))) {
printk("__ioremap(): phys addr 0x%llx is RAM lr %ps\n",
   (unsigned long long)p, 

Re: [PATCH v3] powerpc: implement CONFIG_DEBUG_VIRTUAL

2018-12-18 Thread Christophe Leroy




On 12/19/2018 06:57 AM, Christophe Leroy wrote:



Le 19/12/2018 à 01:26, Michael Ellerman a écrit :

Michael Ellerman  writes:

Christophe Leroy  writes:


This patch implements CONFIG_DEBUG_VIRTUAL to warn about
incorrect use of virt_to_phys() and page_to_phys()


This commit is breaking my p5020ds booting a 32-bit kernel with:

   smp: Bringing up secondary CPUs ...
   __ioremap(): phys addr 0x7fef5000 is RAM lr ioremap_coherent
   Unable to handle kernel paging request for data at address 0x
   Faulting instruction address: 0xc002e950
   Oops: Kernel access of bad area, sig: 11 [#1]
   BE SMP NR_CPUS=24 CoreNet Generic
   Modules linked in:
   CPU: 0 PID: 1 Comm: swapper/0 Not tainted 
4.20.0-rc2-gcc-7.0.1-00138-g9a0380d299e9 #148

   NIP:  c002e950 LR: c002eb20 CTR: 0001
   REGS: e804bd20 TRAP: 0300   Not tainted  
(4.20.0-rc2-gcc-7.0.1-00138-g9a0380d299e9)

   MSR:  00021002   CR: 28004222  XER: 
   DEAR:  ESR: 
   GPR00: c002eb20 e804bdd0 e805  00021002  
0050 00021002
   GPR08: 2d3f 0001  0004 24000842  
c00026d0 
   GPR16:       
 0001
   GPR24: 00029002 7fef5140 3000   0040 
0001 

   NIP [c002e950] smp_85xx_kick_cpu+0x120/0x410
   LR [c002eb20] smp_85xx_kick_cpu+0x2f0/0x410
   Call Trace:
   [e804bdd0] [c002eb20] smp_85xx_kick_cpu+0x2f0/0x410 (unreliable)
   [e804be20] [c0012e38] __cpu_up+0xc8/0x230
   [e804be50] [c0040b34] bringup_cpu+0x34/0x110
   [e804be70] [c00418a8] cpu_up+0x128/0x250
   [e804beb0] [c0b84b14] smp_init+0xc4/0x10c
   [e804bee0] [c0b75c1c] kernel_init_freeable+0xc8/0x250
   [e804bf20] [c00026e8] kernel_init+0x18/0x120
   [e804bf40] [c0011298] ret_from_kernel_thread+0x14/0x1c
   Instruction dump:
   7fb3e850 57bdd1be 2e1d 41d20250 57bd3032 393dffc0 7e6a9b78 
5529d1be
   39290001 7d2903a6 6000 6000 <7c0050ac> 394a0040 4200fff8 
7c0004ac

   ---[ end trace edcab2a1dfd5b38c ]---


Which is obviously this hunk:

diff --git a/arch/powerpc/mm/pgtable_32.c 
b/arch/powerpc/mm/pgtable_32.c

index 4fc77a99c9bf..68d204a45cd0 100644
--- a/arch/powerpc/mm/pgtable_32.c
+++ b/arch/powerpc/mm/pgtable_32.c
@@ -143,7 +143,7 @@ __ioremap_caller(phys_addr_t addr, unsigned long 
size, pgprot_t prot, void *call

   * Don't allow anybody to remap normal RAM that we're using.
   * mem_init() sets high_memory so only do the check after that.
   */
-    if (slab_is_available() && (p < virt_to_phys(high_memory)) &&
+    if (slab_is_available() && virt_addr_valid(p) &&
  page_is_ram(__phys_to_pfn(p))) {
  printk("__ioremap(): phys addr 0x%llx is RAM lr %ps\n",
 (unsigned long long)p, __builtin_return_address(0));



I'll try and come up with a fix tomorrow.


Actually I think that change is just wrong. virt_addr_valid() takes a
virtual address, but p is a physical address.

So I'll drop this hunk for now, which makes the patch a no-op when
DEBUG_VIRTUAL is n which is probably the way it should be.


The hunk is obviously wrong for sure. Anyway there's a problem, most 
likely high_memory is not a valid virtual address, so without this hunk 
I get the following warning at every ioremap():


[    0.00] WARNING: CPU: 0 PID: 0 at 
./arch/powerpc/include/asm/io.h:809 __ioremap_caller+0x9c/0x180
[    0.00] CPU: 0 PID: 0 Comm: swapper Not tainted 
4.20.0-rc6-s3k-dev-00677-g9c98dcab6203-dirty #615

[    0.00] NIP:  c000fcd0 LR: c000fc64 CTR: 
[    0.00] REGS: c073de50 TRAP: 0700   Not tainted 
(4.20.0-rc6-s3k-dev-00677-g9c98dcab6203-dirty)

[    0.00] MSR:  00021032   CR: 28944422  XER: f940
[    0.00]
[    0.00] GPR00: c000fe04 c073df00 c06e1450 0001 4023 
c073df38 c0018f50 0001
[    0.00] GPR08: 2000 0800 2000  88944224 
0060  07ff9580
[    0.00] GPR16:  07ffb94c    
  
[    0.00] GPR24:  c076 019f ff00 ff00 
c000fe04 4000 c0018f50

[    0.00] NIP [c000fcd0] __ioremap_caller+0x9c/0x180
[    0.00] LR [c000fc64] __ioremap_caller+0x30/0x180
[    0.00] Call Trace:
[    0.00] [c073df00] [c02fc23c] of_address_to_resource+0x114/0x154 
(unreliable)

[    0.00] [c073df30] [c000fe04] ioremap_wt+0x20/0x30
[    0.00] [c073df40] [c0018f50] mpc8xx_pic_init+0x70/0xf8
[    0.00] [c073df80] [c0655b84] mpc8xx_pics_init+0x10/0x6c
[    0.00] [c073df90] [c0675080] cmpc885_pics_init+0x14/0x118
[    0.00] [c073dfa0] [c0652eb0] init_IRQ+0x24/0x38
[    0.00] [c073dfb0] [c0650b10] start_kernel+0x2a8/0x3d4
[    0.00] [c073dff0] [c0002258] start_here+0x44/0x98
[    0.00] Instruction dump:
[    0.00] 419e00b8 7f83e378 480013fd 7c7d1b79 41820030 576304be 
7c63ea14 80010034
[    0.00] bb410018 7c0803a6 38210030 4e800020 <0fe0> 7f9c4840 
409cffc4 48a8
[    

Re: [PATCH v3] powerpc: implement CONFIG_DEBUG_VIRTUAL

2018-12-18 Thread Christophe Leroy




Le 19/12/2018 à 01:26, Michael Ellerman a écrit :

Michael Ellerman  writes:

Christophe Leroy  writes:


This patch implements CONFIG_DEBUG_VIRTUAL to warn about
incorrect use of virt_to_phys() and page_to_phys()


This commit is breaking my p5020ds booting a 32-bit kernel with:

   smp: Bringing up secondary CPUs ...
   __ioremap(): phys addr 0x7fef5000 is RAM lr ioremap_coherent
   Unable to handle kernel paging request for data at address 0x
   Faulting instruction address: 0xc002e950
   Oops: Kernel access of bad area, sig: 11 [#1]
   BE SMP NR_CPUS=24 CoreNet Generic
   Modules linked in:
   CPU: 0 PID: 1 Comm: swapper/0 Not tainted 
4.20.0-rc2-gcc-7.0.1-00138-g9a0380d299e9 #148
   NIP:  c002e950 LR: c002eb20 CTR: 0001
   REGS: e804bd20 TRAP: 0300   Not tainted  
(4.20.0-rc2-gcc-7.0.1-00138-g9a0380d299e9)
   MSR:  00021002   CR: 28004222  XER: 
   DEAR:  ESR: 
   GPR00: c002eb20 e804bdd0 e805  00021002  0050 
00021002
   GPR08: 2d3f 0001  0004 24000842  c00026d0 

   GPR16:        
0001
   GPR24: 00029002 7fef5140 3000   0040 0001 

   NIP [c002e950] smp_85xx_kick_cpu+0x120/0x410
   LR [c002eb20] smp_85xx_kick_cpu+0x2f0/0x410
   Call Trace:
   [e804bdd0] [c002eb20] smp_85xx_kick_cpu+0x2f0/0x410 (unreliable)
   [e804be20] [c0012e38] __cpu_up+0xc8/0x230
   [e804be50] [c0040b34] bringup_cpu+0x34/0x110
   [e804be70] [c00418a8] cpu_up+0x128/0x250
   [e804beb0] [c0b84b14] smp_init+0xc4/0x10c
   [e804bee0] [c0b75c1c] kernel_init_freeable+0xc8/0x250
   [e804bf20] [c00026e8] kernel_init+0x18/0x120
   [e804bf40] [c0011298] ret_from_kernel_thread+0x14/0x1c
   Instruction dump:
   7fb3e850 57bdd1be 2e1d 41d20250 57bd3032 393dffc0 7e6a9b78 5529d1be
   39290001 7d2903a6 6000 6000 <7c0050ac> 394a0040 4200fff8 7c0004ac
   ---[ end trace edcab2a1dfd5b38c ]---


Which is obviously this hunk:


diff --git a/arch/powerpc/mm/pgtable_32.c b/arch/powerpc/mm/pgtable_32.c
index 4fc77a99c9bf..68d204a45cd0 100644
--- a/arch/powerpc/mm/pgtable_32.c
+++ b/arch/powerpc/mm/pgtable_32.c
@@ -143,7 +143,7 @@ __ioremap_caller(phys_addr_t addr, unsigned long size, 
pgprot_t prot, void *call
 * Don't allow anybody to remap normal RAM that we're using.
 * mem_init() sets high_memory so only do the check after that.
 */
-   if (slab_is_available() && (p < virt_to_phys(high_memory)) &&
+   if (slab_is_available() && virt_addr_valid(p) &&
page_is_ram(__phys_to_pfn(p))) {
printk("__ioremap(): phys addr 0x%llx is RAM lr %ps\n",
   (unsigned long long)p, __builtin_return_address(0));



I'll try and come up with a fix tomorrow.


Actually I think that change is just wrong. virt_addr_valid() takes a
virtual address, but p is a physical address.

So I'll drop this hunk for now, which makes the patch a no-op when
DEBUG_VIRTUAL is n which is probably the way it should be.


The hunk is obviously wrong for sure. Anyway there's a problem, most 
likely high_memory is not a valid virtual address, so without this hunk 
I get the following warning at every ioremap():


[0.00] WARNING: CPU: 0 PID: 0 at 
./arch/powerpc/include/asm/io.h:809 __ioremap_caller+0x9c/0x180
[0.00] CPU: 0 PID: 0 Comm: swapper Not tainted 
4.20.0-rc6-s3k-dev-00677-g9c98dcab6203-dirty #615

[0.00] NIP:  c000fcd0 LR: c000fc64 CTR: 
[0.00] REGS: c073de50 TRAP: 0700   Not tainted 
(4.20.0-rc6-s3k-dev-00677-g9c98dcab6203-dirty)

[0.00] MSR:  00021032   CR: 28944422  XER: f940
[0.00]
[0.00] GPR00: c000fe04 c073df00 c06e1450 0001 4023 
c073df38 c0018f50 0001
[0.00] GPR08: 2000 0800 2000  88944224 
0060  07ff9580
[0.00] GPR16:  07ffb94c    
  
[0.00] GPR24:  c076 019f ff00 ff00 
c000fe04 4000 c0018f50

[0.00] NIP [c000fcd0] __ioremap_caller+0x9c/0x180
[0.00] LR [c000fc64] __ioremap_caller+0x30/0x180
[0.00] Call Trace:
[0.00] [c073df00] [c02fc23c] of_address_to_resource+0x114/0x154 
(unreliable)

[0.00] [c073df30] [c000fe04] ioremap_wt+0x20/0x30
[0.00] [c073df40] [c0018f50] mpc8xx_pic_init+0x70/0xf8
[0.00] [c073df80] [c0655b84] mpc8xx_pics_init+0x10/0x6c
[0.00] [c073df90] [c0675080] cmpc885_pics_init+0x14/0x118
[0.00] [c073dfa0] [c0652eb0] init_IRQ+0x24/0x38
[0.00] [c073dfb0] [c0650b10] start_kernel+0x2a8/0x3d4
[0.00] [c073dff0] [c0002258] start_here+0x44/0x98
[0.00] Instruction dump:
[0.00] 419e00b8 7f83e378 480013fd 7c7d1b79 41820030 576304be 
7c63ea14 80010034
[0.00] bb410018 7c0803a6 38210030 4e800020 <0fe0> 7f9c4840 
409cffc4 48a8
[0.00] random: 

Re: [PATCH kernel v5 14/20] powerpc/powernv/npu: Add compound IOMMU groups

2018-12-18 Thread Alexey Kardashevskiy



On 19/12/2018 11:17, Michael Ellerman wrote:
> Alexey Kardashevskiy  writes:
>> diff --git a/arch/powerpc/platforms/powernv/npu-dma.c 
>> b/arch/powerpc/platforms/powernv/npu-dma.c
>> index dc629ee..3468eaa 100644
>> --- a/arch/powerpc/platforms/powernv/npu-dma.c
>> +++ b/arch/powerpc/platforms/powernv/npu-dma.c
>> @@ -372,8 +358,263 @@ struct npu {
> ...
>> +
>> +static void pnv_comp_attach_table_group(struct npu_comp *npucomp,
>> +struct pnv_ioda_pe *pe)
>> +{
>> +if (WARN_ON(npucomp->pe_num == NV_NPU_MAX_PE_NUM))
>> +return;
>> +
>> +npucomp->pe[npucomp->pe_num] = pe;
>> +++npucomp->pe_num;
>> +}
>> +
>> +struct iommu_table_group *pnv_try_setup_npu_table_group(struct pnv_ioda_pe 
>> *pe)
>> +{
>> +struct iommu_table_group *table_group;
>> +struct npu_comp *npucomp;
>> +struct pci_dev *gpdev = NULL;
>> +struct pci_controller *hose;
>> +struct pci_dev *npdev;
>> +
>> +list_for_each_entry(gpdev, >pbus->devices, bus_list) {
>> +npdev = pnv_pci_get_npu_dev(gpdev, 0);
>> +if (npdev)
>> +break;
>> +}
>> +
>> +if (!npdev)
>> +/* It is not an NPU attached device, skip */
>> +return NULL;
> 
> This breaks some configs with:
> 
>   arch/powerpc/platforms/powernv/npu-dma.c:550:5: error: 'npdev' may be used 
> uninitialized in this function [-Werror=uninitialized]


gcc 5, 7 and 8 do not warn about this, I have to disable
list_for_each_entry() above to recreate this.

I even compiled gcc 5.5 which some of your buildmachines use and yet no
error on this:

make O=/home/aik/pbuild/kernel-le/ KCFLAGS=-Werror=all ARCH=powerpc
CROSS_COMPILE=/opt/cross/gcc-powerpc64le-linux-5.5.0-nolibc/bin/powerpc64le-linux-
arch/powerpc/platforms/powernv/npu-dma.o



I only get an error when I do:

@@ -525,6 +525,7 @@ struct iommu_table_group
*pnv_try_setup_npu_table_group(struct pnv_ioda_pe *pe)
struct pci_controller *hose;
struct pci_dev *npdev;

+   if (0)
list_for_each_entry(gpdev, >pbus->devices, bus_list) {

npdev = pnv_pci_get_npu_dev(gpdev, 0);
if (npdev)



How do you compile?


-- 
Alexey


Re: [PATCH V4 0/5] NestMMU pte upgrade workaround for mprotect

2018-12-18 Thread Benjamin Herrenschmidt
On Tue, 2018-12-18 at 09:17 -0800, Christoph Hellwig wrote:
> This series seems to miss patches 1 and 2.

Odd, I got them...

Ben.




Re: [PATCH V4 5/5] arch/powerpc/mm/hugetlb: NestMMU workaround for hugetlb mprotect RW upgrade

2018-12-18 Thread Christoph Hellwig
On Wed, Dec 19, 2018 at 08:50:57AM +0530, Aneesh Kumar K.V wrote:
> That was done considering that ptep_modify_prot_start/commit was defined
> in asm-generic/pgtable.h. I was trying to make sure I didn't break
> anything with the patch. Also s390 do have that EXPORT_SYMBOL() for the
> same. hugetlb just inherited that.
> 
> If you feel strongly about it, I can drop the EXPORT_SYMBOL().

Yes.  And we should probably remove the s390 as well as it isn't used
either.


Re: [PATCH kernel v5 10/20] powerpc/iommu_api: Move IOMMU groups setup to a single place

2018-12-18 Thread Alexey Kardashevskiy



On 19/12/2018 10:35, Michael Ellerman wrote:
> Alexey Kardashevskiy  writes:
> 
>> diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c 
>> b/arch/powerpc/platforms/powernv/pci-ioda.c
>> index b86a6e0..1168b185 100644
>> --- a/arch/powerpc/platforms/powernv/pci-ioda.c
>> +++ b/arch/powerpc/platforms/powernv/pci-ioda.c
>> @@ -2735,12 +2733,68 @@ static struct iommu_table_group_ops 
>> pnv_pci_ioda2_npu_ops = {
>>  .release_ownership = pnv_ioda2_release_ownership,
>>  };
>>  
>> +static void pnv_ioda_setup_bus_iommu_group_add_devices(struct pnv_ioda_pe 
>> *pe,
>> +struct pci_bus *bus)
>> +{
>> +struct pci_dev *dev;
>> +
>> +list_for_each_entry(dev, >devices, bus_list) {
>> +iommu_add_device(>table_group, >dev);
>> +
>> +if ((pe->flags & PNV_IODA_PE_BUS_ALL) && dev->subordinate)
>> +pnv_ioda_setup_bus_iommu_group_add_devices(pe,
>> +dev->subordinate);
>> +}
>> +}
>> +
>> +static void pnv_ioda_setup_bus_iommu_group(struct pnv_ioda_pe *pe)
>> +{
>> +if (!pnv_pci_ioda_pe_dma_weight(pe))
>> +return;
>> +
>> +iommu_register_group(>table_group, pe->phb->hose->global_number,
>> +pe->pe_number);
>> +
>> +/*
>> + * set_iommu_table_base(>pdev->dev, tbl) should have been called
>> + * by now
>> + */
>> +if (pe->flags & PNV_IODA_PE_DEV)
>> +iommu_add_device(>table_group, >pdev->dev);
>> +else if (pe->flags & (PNV_IODA_PE_BUS | PNV_IODA_PE_BUS_ALL))
>> +pnv_ioda_setup_bus_iommu_group_add_devices(pe, pe->pbus);
>> +}
>> +
> 
> This breaks skiroot_defconfig with:
> 
> arch/powerpc/platforms/powernv/pci-ioda.c:2731:13: error: 
> 'pnv_ioda_setup_bus_iommu_group' defined but not used 
> [-Werror=unused-function]
> 
>   http://kisskb.ellerman.id.au/kisskb/buildresult/13623033/


How do you enable these warnings? I do not get them no matter what I do.


-- 
Alexey


Re: [PATCH v2 5/5] powerpc/perf: Trace imc PMU functions

2018-12-18 Thread Madhavan Srinivasan



On 14/12/18 2:41 PM, Anju T Sudhakar wrote:

Add PMU functions to support trace-imc.


Reviewed-by: Madhavan Srinivasan 



Signed-off-by: Anju T Sudhakar 
---
  arch/powerpc/perf/imc-pmu.c | 175 
  1 file changed, 175 insertions(+)

diff --git a/arch/powerpc/perf/imc-pmu.c b/arch/powerpc/perf/imc-pmu.c
index 1f09265c8fb0..32ff0e449fca 100644
--- a/arch/powerpc/perf/imc-pmu.c
+++ b/arch/powerpc/perf/imc-pmu.c
@@ -1120,6 +1120,173 @@ static int trace_imc_cpu_init(void)
  ppc_trace_imc_cpu_offline);
  }

+static u64 get_trace_imc_event_base_addr(void)
+{
+   return (u64)per_cpu(trace_imc_mem, smp_processor_id());
+}
+
+/*
+ * Function to parse trace-imc data obtained
+ * and to prepare the perf sample.
+ */
+static int trace_imc_prepare_sample(struct trace_imc_data *mem,
+   struct perf_sample_data *data,
+   u64 *prev_tb,
+   struct perf_event_header *header,
+   struct perf_event *event)
+{
+   /* Sanity checks for a valid record */
+   if (be64_to_cpu(READ_ONCE(mem->tb1)) > *prev_tb)
+   *prev_tb = be64_to_cpu(READ_ONCE(mem->tb1));
+   else
+   return -EINVAL;
+
+   if ((be64_to_cpu(READ_ONCE(mem->tb1)) & IMC_TRACE_RECORD_TB1_MASK) !=
+be64_to_cpu(READ_ONCE(mem->tb2)))
+   return -EINVAL;
+
+   /* Prepare perf sample */
+   data->ip =  be64_to_cpu(READ_ONCE(mem->ip));
+   data->period = event->hw.last_period;
+
+   header->type = PERF_RECORD_SAMPLE;
+   header->size = sizeof(*header) + event->header_size;
+   header->misc = 0;
+
+   if (is_kernel_addr(data->ip))
+   header->misc |= PERF_RECORD_MISC_KERNEL;
+   else
+   header->misc |= PERF_RECORD_MISC_USER;
+
+   perf_event_header__init_id(header, data, event);
+
+   return 0;
+}
+
+static void dump_trace_imc_data(struct perf_event *event)
+{
+   struct trace_imc_data *mem;
+   int i, ret;
+   u64 prev_tb = 0;
+
+   mem = (struct trace_imc_data *)get_trace_imc_event_base_addr();
+   for (i = 0; i < (trace_imc_mem_size / sizeof(struct trace_imc_data));
+   i++, mem++) {
+   struct perf_sample_data data;
+   struct perf_event_header header;
+
+   ret = trace_imc_prepare_sample(mem, , _tb, , 
event);
+   if (ret) /* Exit, if not a valid record */
+   break;
+   else {
+   /* If this is a valid record, create the sample */
+   struct perf_output_handle handle;
+
+   if (perf_output_begin(, event, header.size))
+   return;
+
+   perf_output_sample(, , , event);
+   perf_output_end();
+   }
+   }
+}
+
+static int trace_imc_event_add(struct perf_event *event, int flags)
+{
+   /* Enable the sched_task to start the engine */
+   perf_sched_cb_inc(event->ctx->pmu);
+   return 0;
+}
+
+static void trace_imc_event_read(struct perf_event *event)
+{
+   dump_trace_imc_data(event);
+}
+
+static void trace_imc_event_stop(struct perf_event *event, int flags)
+{
+   trace_imc_event_read(event);
+}
+
+static void trace_imc_event_start(struct perf_event *event, int flags)
+{
+   return;
+}
+
+static void trace_imc_event_del(struct perf_event *event, int flags)
+{
+   perf_sched_cb_dec(event->ctx->pmu);
+}
+
+void trace_imc_pmu_sched_task(struct perf_event_context *ctx,
+   bool sched_in)
+{
+   int core_id = smp_processor_id() / threads_per_core;
+   struct imc_pmu_ref *ref;
+   u64 local_mem, ldbar_value;
+
+   /* Set trace-imc bit in ldbar and load ldbar with per-thread memory 
address */
+   local_mem = get_trace_imc_event_base_addr();
+   ldbar_value = ((u64)local_mem & THREAD_IMC_LDBAR_MASK) | 
TRACE_IMC_ENABLE;
+
+   ref = _imc_refc[core_id];
+   if (!ref)
+   return;
+
+   if (sched_in) {
+   mtspr(SPRN_LDBAR, ldbar_value);
+   mutex_lock(>lock);
+   if (ref->refc == 0) {
+   if (opal_imc_counters_start(OPAL_IMC_COUNTERS_TRACE,
+   
get_hard_smp_processor_id(smp_processor_id( {
+   mutex_unlock(>lock);
+   pr_err("trace-imc: Unable to start the counters for 
core %d\n", core_id);
+   mtspr(SPRN_LDBAR, 0);
+   return;
+   }
+   }
+   ++ref->refc;
+   mutex_unlock(>lock);
+   } else {
+   mtspr(SPRN_LDBAR, 0);
+   mutex_lock(>lock);
+   ref->refc--;
+   if (ref->refc == 0) {
+ 

Re: [PATCH v2 4/5] powerpc/perf: Trace imc events detection and cpuhotplug

2018-12-18 Thread Madhavan Srinivasan



On 14/12/18 2:41 PM, Anju T Sudhakar wrote:

Patch detects trace-imc events, does memory initilizations for each online
cpu, and registers cpuhotplug call-backs.


Reviewed-by: Madhavan Srinivasan 


Signed-off-by: Anju T Sudhakar 
---
  arch/powerpc/perf/imc-pmu.c   | 91 +++
  arch/powerpc/platforms/powernv/opal-imc.c |  3 +
  include/linux/cpuhotplug.h|  1 +
  3 files changed, 95 insertions(+)

diff --git a/arch/powerpc/perf/imc-pmu.c b/arch/powerpc/perf/imc-pmu.c
index 5ca80545a849..1f09265c8fb0 100644
--- a/arch/powerpc/perf/imc-pmu.c
+++ b/arch/powerpc/perf/imc-pmu.c
@@ -43,6 +43,10 @@ static DEFINE_PER_CPU(u64 *, thread_imc_mem);
  static struct imc_pmu *thread_imc_pmu;
  static int thread_imc_mem_size;

+/* Trace IMC data structures */
+static DEFINE_PER_CPU(u64 *, trace_imc_mem);
+static int trace_imc_mem_size;
+
  static struct imc_pmu *imc_event_to_pmu(struct perf_event *event)
  {
return container_of(event->pmu, struct imc_pmu, pmu);
@@ -1068,6 +1072,54 @@ static void thread_imc_event_del(struct perf_event 
*event, int flags)
imc_event_update(event);
  }

+/*
+ * Allocate a page of memory for each cpu, and load LDBAR with 0.
+ */
+static int trace_imc_mem_alloc(int cpu_id, int size)
+{
+   u64 *local_mem = per_cpu(trace_imc_mem, cpu_id);
+   int phys_id = cpu_to_node(cpu_id), rc = 0;
+
+   if (!local_mem) {
+   local_mem = page_address(alloc_pages_node(phys_id,
+   GFP_KERNEL | __GFP_ZERO | 
__GFP_THISNODE |
+   __GFP_NOWARN, get_order(size)));
+   if (!local_mem)
+   return -ENOMEM;
+   per_cpu(trace_imc_mem, cpu_id) = local_mem;
+
+   /* Initialise the counters for trace mode */
+   rc = opal_imc_counters_init(OPAL_IMC_COUNTERS_TRACE, __pa((void 
*)local_mem),
+   get_hard_smp_processor_id(cpu_id));
+   if (rc) {
+   pr_info("IMC:opal init failed for trace imc\n");
+   return rc;
+   }
+   }
+
+   mtspr(SPRN_LDBAR, 0);
+   return 0;
+}
+
+static int ppc_trace_imc_cpu_online(unsigned int cpu)
+{
+   return trace_imc_mem_alloc(cpu, trace_imc_mem_size);
+}
+
+static int ppc_trace_imc_cpu_offline(unsigned int cpu)
+{
+   mtspr(SPRN_LDBAR, 0);
+   return 0;
+}
+
+static int trace_imc_cpu_init(void)
+{
+   return cpuhp_setup_state(CPUHP_AP_PERF_POWERPC_TRACE_IMC_ONLINE,
+ "perf/powerpc/imc_trace:online",
+ ppc_trace_imc_cpu_online,
+ ppc_trace_imc_cpu_offline);
+}
+
  /* update_pmu_ops : Populate the appropriate operations for "pmu" */
  static int update_pmu_ops(struct imc_pmu *pmu)
  {
@@ -1189,6 +1241,17 @@ static void cleanup_all_thread_imc_memory(void)
}
  }

+static void cleanup_all_trace_imc_memory(void)
+{
+   int i, order = get_order(trace_imc_mem_size);
+
+   for_each_online_cpu(i) {
+   if (per_cpu(trace_imc_mem, i))
+   free_pages((u64)per_cpu(trace_imc_mem, i), order);
+
+   }
+}
+
  /* Function to free the attr_groups which are dynamically allocated */
  static void imc_common_mem_free(struct imc_pmu *pmu_ptr)
  {
@@ -1230,6 +1293,11 @@ static void imc_common_cpuhp_mem_free(struct imc_pmu 
*pmu_ptr)
cpuhp_remove_state(CPUHP_AP_PERF_POWERPC_THREAD_IMC_ONLINE);
cleanup_all_thread_imc_memory();
}
+
+   if (pmu_ptr->domain == IMC_DOMAIN_TRACE) {
+   cpuhp_remove_state(CPUHP_AP_PERF_POWERPC_TRACE_IMC_ONLINE);
+   cleanup_all_trace_imc_memory();
+   }
  }

  /*
@@ -1312,6 +1380,21 @@ static int imc_mem_init(struct imc_pmu *pmu_ptr, struct 
device_node *parent,

thread_imc_pmu = pmu_ptr;
break;
+   case IMC_DOMAIN_TRACE:
+   /* Update the pmu name */
+   pmu_ptr->pmu.name = kasprintf(GFP_KERNEL, "%s%s", s, "_imc");
+   if (!pmu_ptr->pmu.name)
+   return -ENOMEM;
+
+   trace_imc_mem_size = pmu_ptr->counter_mem_size;
+   for_each_online_cpu(cpu) {
+   res = trace_imc_mem_alloc(cpu, trace_imc_mem_size);
+   if (res) {
+   cleanup_all_trace_imc_memory();
+   goto err;
+   }
+   }
+   break;
default:
return -EINVAL;
}
@@ -1384,6 +1467,14 @@ int init_imc_pmu(struct device_node *parent, struct 
imc_pmu *pmu_ptr, int pmu_id
goto err_free_mem;
}

+   break;
+   case IMC_DOMAIN_TRACE:
+   ret = trace_imc_cpu_init();
+   if (ret) {
+   cleanup_all_trace_imc_memory();
+  

Re: [PATCH v2 2/5] powerpc/perf: Rearrange setting of ldbar for thread-imc

2018-12-18 Thread Madhavan Srinivasan



On 14/12/18 2:41 PM, Anju T Sudhakar wrote:

LDBAR holds the memory address allocated for each cpu. For thread-imc
the mode bit (i.e bit 1) of LDBAR is set to accumulation.
Currently, ldbar is loaded with per cpu memory address and mode set to
accumulation at boot time.

To enable trace-imc, the mode bit of ldbar should be set to 'trace'. So to
accommodate trace-mode of IMC, reposition setting of ldbar for thread-imc
to thread_imc_event_add(). Also reset ldbar at thread_imc_event_del().


Changes looks fine to me.

Reviewed-by: Madhavan Srinivasan 


Signed-off-by: Anju T Sudhakar 
---
  arch/powerpc/perf/imc-pmu.c | 28 +---
  1 file changed, 17 insertions(+), 11 deletions(-)

diff --git a/arch/powerpc/perf/imc-pmu.c b/arch/powerpc/perf/imc-pmu.c
index f292a3f284f1..3bef46f8417d 100644
--- a/arch/powerpc/perf/imc-pmu.c
+++ b/arch/powerpc/perf/imc-pmu.c
@@ -806,8 +806,11 @@ static int core_imc_event_init(struct perf_event *event)
  }

  /*
- * Allocates a page of memory for each of the online cpus, and write the
- * physical base address of that page to the LDBAR for that cpu.
+ * Allocates a page of memory for each of the online cpus, and load
+ * LDBAR with 0.
+ * The physical base address of the page allocated for a cpu will be
+ * written to the LDBAR for that cpu, when the thread-imc event
+ * is added.
   *
   * LDBAR Register Layout:
   *
@@ -825,7 +828,7 @@ static int core_imc_event_init(struct perf_event *event)
   */
  static int thread_imc_mem_alloc(int cpu_id, int size)
  {
-   u64 ldbar_value, *local_mem = per_cpu(thread_imc_mem, cpu_id);
+   u64 *local_mem = per_cpu(thread_imc_mem, cpu_id);
int nid = cpu_to_node(cpu_id);

if (!local_mem) {
@@ -842,9 +845,7 @@ static int thread_imc_mem_alloc(int cpu_id, int size)
per_cpu(thread_imc_mem, cpu_id) = local_mem;
}

-   ldbar_value = ((u64)local_mem & THREAD_IMC_LDBAR_MASK) | 
THREAD_IMC_ENABLE;
-
-   mtspr(SPRN_LDBAR, ldbar_value);
+   mtspr(SPRN_LDBAR, 0);
return 0;
  }

@@ -995,6 +996,7 @@ static int thread_imc_event_add(struct perf_event *event, 
int flags)
  {
int core_id;
struct imc_pmu_ref *ref;
+   u64 ldbar_value, *local_mem = per_cpu(thread_imc_mem, 
smp_processor_id());

if (flags & PERF_EF_START)
imc_event_start(event, flags);
@@ -1003,6 +1005,9 @@ static int thread_imc_event_add(struct perf_event *event, 
int flags)
return -EINVAL;

core_id = smp_processor_id() / threads_per_core;
+   ldbar_value = ((u64)local_mem & THREAD_IMC_LDBAR_MASK) | 
THREAD_IMC_ENABLE;
+   mtspr(SPRN_LDBAR, ldbar_value);
+
/*
 * imc pmus are enabled only when it is used.
 * See if this is triggered for the first time.
@@ -1034,11 +1039,7 @@ static void thread_imc_event_del(struct perf_event 
*event, int flags)
int core_id;
struct imc_pmu_ref *ref;

-   /*
-* Take a snapshot and calculate the delta and update
-* the event counter values.
-*/
-   imc_event_update(event);
+   mtspr(SPRN_LDBAR, 0);

core_id = smp_processor_id() / threads_per_core;
ref = _imc_refc[core_id];
@@ -1057,6 +1058,11 @@ static void thread_imc_event_del(struct perf_event 
*event, int flags)
ref->refc = 0;
}
mutex_unlock(>lock);
+   /*
+* Take a snapshot and calculate the delta and update
+* the event counter values.
+*/
+   imc_event_update(event);
  }

  /* update_pmu_ops : Populate the appropriate operations for "pmu" */




Re: [PATCH v2 1/5] powerpc/include: Add data structures and macros for IMC trace mode

2018-12-18 Thread Madhavan Srinivasan



On 14/12/18 2:41 PM, Anju T Sudhakar wrote:

Add the macros needed for IMC (In-Memory Collection Counters) trace-mode
and data structure to hold the trace-imc record data.
Also, add the new type "OPAL_IMC_COUNTERS_TRACE" in 'opal-api.h', since
there is a new switch case added in the opal-calls for IMC.


Reviewed-by: Madhavan Srinivasan 


Signed-off-by: Anju T Sudhakar 
---
  arch/powerpc/include/asm/imc-pmu.h  | 39 +
  arch/powerpc/include/asm/opal-api.h |  1 +
  2 files changed, 40 insertions(+)

diff --git a/arch/powerpc/include/asm/imc-pmu.h 
b/arch/powerpc/include/asm/imc-pmu.h
index 69f516ecb2fd..7c2ef0e42661 100644
--- a/arch/powerpc/include/asm/imc-pmu.h
+++ b/arch/powerpc/include/asm/imc-pmu.h
@@ -33,6 +33,7 @@
   */
  #define THREAD_IMC_LDBAR_MASK   0x0003e000ULL
  #define THREAD_IMC_ENABLE   0x8000ULL
+#define TRACE_IMC_ENABLE   0x4000ULL

  /*
   * For debugfs interface for imc-mode and imc-command
@@ -59,6 +60,34 @@ struct imc_events {
char *scale;
  };

+/*
+ * Trace IMC hardware updates a 64bytes record on
+ * Core Performance Monitoring Counter (CPMC)
+ * overflow. Here is the layout for the trace imc record
+ *
+ * DW 0 : Timebase
+ * DW 1 : Program Counter
+ * DW 2 : PIDR information
+ * DW 3 : CPMC1
+ * DW 4 : CPMC2
+ * DW 5 : CPMC3
+ * Dw 6 : CPMC4
+ * DW 7 : Timebase
+ * .
+ *
+ * The following is the data structure to hold trace imc data.
+ */
+struct trace_imc_data {
+   u64 tb1;
+   u64 ip;
+   u64 val;
+   u64 cpmc1;
+   u64 cpmc2;
+   u64 cpmc3;
+   u64 cpmc4;
+   u64 tb2;
+};
+
  /* Event attribute array index */
  #define IMC_FORMAT_ATTR   0
  #define IMC_EVENT_ATTR1
@@ -68,6 +97,13 @@ struct imc_events {
  /* PMU Format attribute macros */
  #define IMC_EVENT_OFFSET_MASK 0xULL

+/*
+ * Macro to mask bits 0:21 of first double word(which is the timebase) to
+ * compare with 8th double word (timebase) of trace imc record data.
+ */
+#define IMC_TRACE_RECORD_TB1_MASK  0x3ffULL
+
+
  /*
   * Device tree parser code detects IMC pmu support and
   * registers new IMC pmus. This structure will hold the
@@ -113,6 +149,7 @@ struct imc_pmu_ref {

  enum {
IMC_TYPE_THREAD = 0x1,
+   IMC_TYPE_TRACE  = 0x2,
IMC_TYPE_CORE   = 0x4,
IMC_TYPE_CHIP   = 0x10,
  };
@@ -123,6 +160,8 @@ enum {
  #define IMC_DOMAIN_NEST   1
  #define IMC_DOMAIN_CORE   2
  #define IMC_DOMAIN_THREAD 3
+/* For trace-imc the domain is still thread but it operates in trace-mode */
+#define IMC_DOMAIN_TRACE   4

  extern int init_imc_pmu(struct device_node *parent,
struct imc_pmu *pmu_ptr, int pmu_id);
diff --git a/arch/powerpc/include/asm/opal-api.h 
b/arch/powerpc/include/asm/opal-api.h
index 870fb7b239ea..a4130b21b159 100644
--- a/arch/powerpc/include/asm/opal-api.h
+++ b/arch/powerpc/include/asm/opal-api.h
@@ -1118,6 +1118,7 @@ enum {
  enum {
OPAL_IMC_COUNTERS_NEST = 1,
OPAL_IMC_COUNTERS_CORE = 2,
+   OPAL_IMC_COUNTERS_TRACE = 3,
  };






Re: [PATCH v2] powerpc/perf: Fix loop exit condition in nest_imc_event_init

2018-12-18 Thread Madhavan Srinivasan



On 18/12/18 11:50 AM, Anju T Sudhakar wrote:

The data structure (i.e struct imc_mem_info) to hold the memory address
information for nest imc units is allocated based on the number of nodes
in the system.

nest_imc_event_init() traverse this struct array to calculate the memory
base address for the event-cpu. If we fail to find a match for the event
cpu's chip-id in imc_mem_info struct array, then the do-while loop will
iterate until we crash.

Fix this by changing the loop exit condition based on the number of
non zero vbase elements in the array, since the allocation is done for
nr_chips + 1.


Reviewed-by: Madhavan Srinivasan 

B/w we will also need this patch to go along
https://patchwork.ozlabs.org/patch/1003669/

These 2 fixes need to go to stable also.


Reported-by: Dan Carpenter 
Fixes: 885dcd709ba91 ( powerpc/perf: Add nest IMC PMU support)
Signed-off-by: Anju T Sudhakar 
---
  arch/powerpc/perf/imc-pmu.c   | 2 +-
  arch/powerpc/platforms/powernv/opal-imc.c | 2 +-
  2 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/arch/powerpc/perf/imc-pmu.c b/arch/powerpc/perf/imc-pmu.c
index 4f34c75..d1009fe 100644
--- a/arch/powerpc/perf/imc-pmu.c
+++ b/arch/powerpc/perf/imc-pmu.c
@@ -508,7 +508,7 @@ static int nest_imc_event_init(struct perf_event *event)
break;
}
pcni++;
-   } while (pcni);
+   } while (pcni->vbase != 0);

if (!flag)
return -ENODEV;
diff --git a/arch/powerpc/platforms/powernv/opal-imc.c 
b/arch/powerpc/platforms/powernv/opal-imc.c
index 58a0794..3d27f02 100644
--- a/arch/powerpc/platforms/powernv/opal-imc.c
+++ b/arch/powerpc/platforms/powernv/opal-imc.c
@@ -127,7 +127,7 @@ static int imc_get_mem_addr_nest(struct device_node *node,
nr_chips))
goto error;

-   pmu_ptr->mem_info = kcalloc(nr_chips, sizeof(*pmu_ptr->mem_info),
+   pmu_ptr->mem_info = kcalloc(nr_chips + 1, sizeof(*pmu_ptr->mem_info),
GFP_KERNEL);
if (!pmu_ptr->mem_info)
goto error;




Re: [RFC PATCH kernel] vfio/spapr_tce: Get rid of possible infinite loop

2018-12-18 Thread Alexey Kardashevskiy



On 08/10/2018 21:18, Michael Ellerman wrote:
> Serhii Popovych  writes:
>> Alexey Kardashevskiy wrote:
>>> As a part of cleanup, the SPAPR TCE IOMMU subdriver releases preregistered
>>> memory. If there is a bug in memory release, the loop in
>>> tce_iommu_release() becomes infinite; this actually happened to me.
>>>
>>> This makes the loop finite and prints a warning on every failure to make
>>> the code more bug prone.
>>>
>>> Signed-off-by: Alexey Kardashevskiy 
>>> ---
>>>  drivers/vfio/vfio_iommu_spapr_tce.c | 10 +++---
>>>  1 file changed, 3 insertions(+), 7 deletions(-)
>>>
>>> diff --git a/drivers/vfio/vfio_iommu_spapr_tce.c 
>>> b/drivers/vfio/vfio_iommu_spapr_tce.c
>>> index b1a8ab3..ece0651 100644
>>> --- a/drivers/vfio/vfio_iommu_spapr_tce.c
>>> +++ b/drivers/vfio/vfio_iommu_spapr_tce.c
>>> @@ -393,13 +394,8 @@ static void tce_iommu_release(void *iommu_data)
>>> tce_iommu_free_table(container, tbl);
>>> }
>>>  
>>> -   while (!list_empty(>prereg_list)) {
>>> -   struct tce_iommu_prereg *tcemem;
>>> -
>>> -   tcemem = list_first_entry(>prereg_list,
>>> -   struct tce_iommu_prereg, next);
>>> -   WARN_ON_ONCE(tce_iommu_prereg_free(container, tcemem));
>>> -   }
>>> +   list_for_each_entry_safe(tcemem, tmtmp, >prereg_list, next)
>>> +   WARN_ON(tce_iommu_prereg_free(container, tcemem));
>>
>> I'm not sure that tce_iommu_prereg_free() call under WARN_ON() is good
>> idea because WARN_ON() is a preprocessor macro:
>>
>>   if CONFIG_WARN=n is added by the analogy with CONFIG_BUG=n defining
>>   WARN_ON() as empty we will loose call to tce_iommu_prereg_free()
>>   leaking resources.
> 
> I don't think that's likely to ever happen though, we have a large
> number of uses that would need to be checked one-by-one:
> 
>   $ git grep "if (WARN_ON(" | wc -l
>   2853
> 
> 
> So if we ever did add CONFIG_WARN, I think it would still need to
> evaluate the condition, just not emit a warning.

Is anyone taking this?


-- 
Alexey


Re: [PATCH] powerpc/prom: move the device tree if not in declared memory.

2018-12-18 Thread Michael Ellerman
Christophe Leroy  writes:

> If the device tree doesn't reside in the memory which is declared
> inside it, it has to be moved as well as this memory will not be
> mapped by the kernel.

I worry this will break some obscure platform, but I'll merge it anyway
and we'll see :)

cheers

> diff --git a/arch/powerpc/kernel/prom.c b/arch/powerpc/kernel/prom.c
> index 87a68e2dc531..4181ec715f88 100644
> --- a/arch/powerpc/kernel/prom.c
> +++ b/arch/powerpc/kernel/prom.c
> @@ -124,8 +124,8 @@ static void __init move_device_tree(void)
>   size = fdt_totalsize(initial_boot_params);
>  
>   if ((memory_limit && (start + size) > PHYSICAL_START + memory_limit) ||
> - overlaps_crashkernel(start, size) ||
> - overlaps_initrd(start, size)) {
> + !memblock_is_memory(start + size - 1) ||
> + overlaps_crashkernel(start, size) || overlaps_initrd(start, size)) {
>   p = __va(memblock_phys_alloc(size, PAGE_SIZE));
>   memcpy(p, initial_boot_params, size);
>   initial_boot_params = p;
> -- 
> 2.13.3


Re: [PATCH kernel v5 20/20] vfio_pci: Add NVIDIA GV100GL [Tesla V100 SXM2] subdriver

2018-12-18 Thread Alexey Kardashevskiy



On 19/12/2018 09:37, Alex Williamson wrote:
> On Thu, 13 Dec 2018 17:17:34 +1100
> Alexey Kardashevskiy  wrote:
> 
>> POWER9 Witherspoon machines come with 4 or 6 V100 GPUs which are not
>> pluggable PCIe devices but still have PCIe links which are used
>> for config space and MMIO. In addition to that the GPUs have 6 NVLinks
>> which are connected to other GPUs and the POWER9 CPU. POWER9 chips
>> have a special unit on a die called an NPU which is an NVLink2 host bus
>> adapter with p2p connections to 2 to 3 GPUs, 3 or 2 NVLinks to each.
>> These systems also support ATS (address translation services) which is
>> a part of the NVLink2 protocol. Such GPUs also share on-board RAM
>> (16GB or 32GB) to the system via the same NVLink2 so a CPU has
>> cache-coherent access to a GPU RAM.
>>
>> This exports GPU RAM to the userspace as a new VFIO device region. This
>> preregisters the new memory as device memory as it might be used for DMA.
>> This inserts pfns from the fault handler as the GPU memory is not onlined
>> until the vendor driver is loaded and trained the NVLinks so doing this
>> earlier causes low level errors which we fence in the firmware so
>> it does not hurt the host system but still better be avoided; for the same
>> reason this does not map GPU RAM into the host kernel (usual thing for
>> emulated access otherwise).
>>
>> This exports an ATSD (Address Translation Shootdown) register of NPU which
>> allows TLB invalidations inside GPU for an operating system. The register
>> conveniently occupies a single 64k page. It is also presented to
>> the userspace as a new VFIO device region. One NPU has 8 ATSD registers,
>> each of them can be used for TLB invalidation in a GPU linked to this NPU.
>> This allocates one ATSD register per an NVLink bridge allowing passing
>> up to 6 registers. Due to the host firmware bug (just recently fixed),
>> only 1 ATSD register per NPU was actually advertised to the host system
>> so this passes that alone register via the first NVLink bridge device in
>> the group which is still enough as QEMU collects them all back and
>> presents to the guest via vPHB to mimic the emulated NPU PHB on the host.
>>
>> In order to provide the userspace with the information about GPU-to-NVLink
>> connections, this exports an additional capability called "tgt"
>> (which is an abbreviated host system bus address). The "tgt" property
>> tells the GPU its own system address and allows the guest driver to
>> conglomerate the routing information so each GPU knows how to get directly
>> to the other GPUs.
>>
>> For ATS to work, the nest MMU (an NVIDIA block in a P9 CPU) needs to
>> know LPID (a logical partition ID or a KVM guest hardware ID in other
>> words) and PID (a memory context ID of a userspace process, not to be
>> confused with a linux pid). This assigns a GPU to LPID in the NPU and
>> this is why this adds a listener for KVM on an IOMMU group. A PID comes
>> via NVLink from a GPU and NPU uses a PID wildcard to pass it through.
>>
>> This requires coherent memory and ATSD to be available on the host as
>> the GPU vendor only supports configurations with both features enabled
>> and other configurations are known not to work. Because of this and
>> because of the ways the features are advertised to the host system
>> (which is a device tree with very platform specific properties),
>> this requires enabled POWERNV platform.
>>
>> The V100 GPUs do not advertise any of these capabilities via the config
>> space and there are more than just one device ID so this relies on
>> the platform to tell whether these GPUs have special abilities such as
>> NVLinks.
>>
>> Signed-off-by: Alexey Kardashevskiy 
>> ---
>> Changes:
>> v5:
>> * do not memremap GPU RAM for emulation, map it only when it is needed
>> * allocate 1 ATSD register per NVLink bridge, if none left, then expose
>> the region with a zero size
>> * separate caps per device type
>> * addressed AW review comments
>>
>> v4:
>> * added nvlink-speed to the NPU bridge capability as this turned out to
>> be not a constant value
>> * instead of looking at the exact device ID (which also changes from system
>> to system), now this (indirectly) looks at the device tree to know
>> if GPU and NPU support NVLink
>>
>> v3:
>> * reworded the commit log about tgt
>> * added tracepoints (do we want them enabled for entire vfio-pci?)
>> * added code comments
>> * added write|mmap flags to the new regions
>> * auto enabled VFIO_PCI_NVLINK2 config option
>> * added 'tgt' capability to a GPU so QEMU can recreate ibm,npu and ibm,gpu
>> references; there are required by the NVIDIA driver
>> * keep notifier registered only for short time
>> ---
>>  drivers/vfio/pci/Makefile   |   1 +
>>  drivers/vfio/pci/trace.h| 102 ++
>>  drivers/vfio/pci/vfio_pci_private.h |  14 +
>>  include/uapi/linux/vfio.h   |  39 +++
>>  drivers/vfio/pci/vfio_pci.c |  27 +-
>>  drivers/vfio/pci/vfio_pci_nvlink2.c | 473 

[PATCH V5 3/3] powerpc/mm/iommu: Allow large IOMMU page size only for hugetlb backing

2018-12-18 Thread Aneesh Kumar K.V
THP pages can get split during different code paths. An incremented reference
count do imply we will not split the compound page. But the pmd entry can be
converted to level 4 pte entries. Keep the code simpler by allowing large
IOMMU page size only if the guest ram is backed by hugetlb pages.

Signed-off-by: Aneesh Kumar K.V 
---
 arch/powerpc/mm/mmu_context_iommu.c | 24 +++-
 1 file changed, 7 insertions(+), 17 deletions(-)

diff --git a/arch/powerpc/mm/mmu_context_iommu.c 
b/arch/powerpc/mm/mmu_context_iommu.c
index 1d5161f93ce6..0741d905ed04 100644
--- a/arch/powerpc/mm/mmu_context_iommu.c
+++ b/arch/powerpc/mm/mmu_context_iommu.c
@@ -95,8 +95,6 @@ long mm_iommu_get(struct mm_struct *mm, unsigned long ua, 
unsigned long entries,
struct mm_iommu_table_group_mem_t *mem;
long i, ret = 0, locked_entries = 0;
unsigned int pageshift;
-   unsigned long flags;
-   unsigned long cur_ua;
 
mutex_lock(_list_mutex);
 
@@ -159,23 +157,15 @@ long mm_iommu_get(struct mm_struct *mm, unsigned long ua, 
unsigned long entries,
pageshift = PAGE_SHIFT;
for (i = 0; i < entries; ++i) {
struct page *page = mem->hpages[i];
-   cur_ua = ua + (i << PAGE_SHIFT);
 
-   if (mem->pageshift > PAGE_SHIFT && PageCompound(page)) {
-   pte_t *pte;
+   /*
+* Allow to use larger than 64k IOMMU pages. Only do that
+* if we are backed by hugetlb.
+*/
+   if ((mem->pageshift > PAGE_SHIFT) && PageHuge(page)) {
struct page *head = compound_head(page);
-   unsigned int compshift = compound_order(head);
-   unsigned int pteshift;
-
-   local_irq_save(flags); /* disables as well */
-   pte = find_linux_pte(mm->pgd, cur_ua, NULL, );
-
-   /* Double check it is still the same pinned page */
-   if (pte && pte_page(*pte) == head &&
-   pteshift == compshift + PAGE_SHIFT)
-   pageshift = max_t(unsigned int, pteshift,
-   PAGE_SHIFT);
-   local_irq_restore(flags);
+
+   pageshift = compound_order(head) + PAGE_SHIFT;
}
mem->pageshift = min(mem->pageshift, pageshift);
/*
-- 
2.19.2



[PATCH V5 2/3] powerpc/mm/iommu: Allow migration of cma allocated pages during mm_iommu_get

2018-12-18 Thread Aneesh Kumar K.V
Current code doesn't do page migration if the page allocated is a compound page.
With HugeTLB migration support, we can end up allocating hugetlb pages from
CMA region. Also THP pages can be allocated from CMA region. This patch updates
the code to handle compound pages correctly.

This use the new helper get_user_pages_cma_migrate. It does one get_user_pages
with right count, instead of doing one get_user_pages per page. That avoids
reading page table multiple times.

The patch also convert the hpas member of mm_iommu_table_group_mem_t to a union.
We use the same storage location to store pointers to struct page. We cannot
update alll the code path use struct page *, because we access hpas in real mode
and we can't do that struct page * to pfn conversion in real mode.

Signed-off-by: Aneesh Kumar K.V 
---
 arch/powerpc/mm/mmu_context_iommu.c | 120 
 1 file changed, 35 insertions(+), 85 deletions(-)

diff --git a/arch/powerpc/mm/mmu_context_iommu.c 
b/arch/powerpc/mm/mmu_context_iommu.c
index 56c2234cc6ae..1d5161f93ce6 100644
--- a/arch/powerpc/mm/mmu_context_iommu.c
+++ b/arch/powerpc/mm/mmu_context_iommu.c
@@ -21,6 +21,7 @@
 #include 
 #include 
 #include 
+#include 
 
 static DEFINE_MUTEX(mem_list_mutex);
 
@@ -34,8 +35,18 @@ struct mm_iommu_table_group_mem_t {
atomic64_t mapped;
unsigned int pageshift;
u64 ua; /* userspace address */
-   u64 entries;/* number of entries in hpas[] */
-   u64 *hpas;  /* vmalloc'ed */
+   u64 entries;/* number of entries in hpages[] */
+   /*
+* in mm_iommu_get we temporarily use this to store
+* struct page address.
+*
+* We need to convert ua to hpa in real mode. Make it
+* simpler by storing physicall address.
+*/
+   union {
+   struct page **hpages;   /* vmalloc'ed */
+   phys_addr_t *hpas;
+   };
 };
 
 static long mm_iommu_adjust_locked_vm(struct mm_struct *mm,
@@ -78,63 +89,14 @@ bool mm_iommu_preregistered(struct mm_struct *mm)
 }
 EXPORT_SYMBOL_GPL(mm_iommu_preregistered);
 
-/*
- * Taken from alloc_migrate_target with changes to remove CMA allocations
- */
-struct page *new_iommu_non_cma_page(struct page *page, unsigned long private)
-{
-   gfp_t gfp_mask = GFP_USER;
-   struct page *new_page;
-
-   if (PageCompound(page))
-   return NULL;
-
-   if (PageHighMem(page))
-   gfp_mask |= __GFP_HIGHMEM;
-
-   /*
-* We don't want the allocation to force an OOM if possibe
-*/
-   new_page = alloc_page(gfp_mask | __GFP_NORETRY | __GFP_NOWARN);
-   return new_page;
-}
-
-static int mm_iommu_move_page_from_cma(struct page *page)
-{
-   int ret = 0;
-   LIST_HEAD(cma_migrate_pages);
-
-   /* Ignore huge pages for now */
-   if (PageCompound(page))
-   return -EBUSY;
-
-   lru_add_drain();
-   ret = isolate_lru_page(page);
-   if (ret)
-   return ret;
-
-   list_add(>lru, _migrate_pages);
-   put_page(page); /* Drop the gup reference */
-
-   ret = migrate_pages(_migrate_pages, new_iommu_non_cma_page,
-   NULL, 0, MIGRATE_SYNC, MR_CONTIG_RANGE);
-   if (ret) {
-   if (!list_empty(_migrate_pages))
-   putback_movable_pages(_migrate_pages);
-   }
-
-   return 0;
-}
-
 long mm_iommu_get(struct mm_struct *mm, unsigned long ua, unsigned long 
entries,
struct mm_iommu_table_group_mem_t **pmem)
 {
struct mm_iommu_table_group_mem_t *mem;
-   long i, j, ret = 0, locked_entries = 0;
+   long i, ret = 0, locked_entries = 0;
unsigned int pageshift;
unsigned long flags;
unsigned long cur_ua;
-   struct page *page = NULL;
 
mutex_lock(_list_mutex);
 
@@ -181,41 +143,24 @@ long mm_iommu_get(struct mm_struct *mm, unsigned long ua, 
unsigned long entries,
goto unlock_exit;
}
 
+   ret = get_user_pages_cma_migrate(ua, entries, 1, mem->hpages);
+   if (ret != entries) {
+   /* free the reference taken */
+   for (i = 0; i < ret; i++)
+   put_page(mem->hpages[i]);
+
+   vfree(mem->hpas);
+   kfree(mem);
+   ret = -EFAULT;
+   goto unlock_exit;
+   } else
+   ret = 0;
+
+   pageshift = PAGE_SHIFT;
for (i = 0; i < entries; ++i) {
+   struct page *page = mem->hpages[i];
cur_ua = ua + (i << PAGE_SHIFT);
-   if (1 != get_user_pages_fast(cur_ua,
-   1/* pages */, 1/* iswrite */, )) {
-   ret = -EFAULT;
-   for (j = 0; j < i; ++j)
-   put_page(pfn_to_page(mem->hpas[j] >>
-   PAGE_SHIFT));
- 

[PATCH V5 1/3] mm: Add get_user_pages_cma_migrate

2018-12-18 Thread Aneesh Kumar K.V
This helper does a get_user_pages_fast and if it find pages in the CMA area
it will try to migrate them before taking page reference. This makes sure that
we don't keep non-movable pages (due to page reference count) in the CMA area.
Not able to move pages out of CMA area result in CMA allocation failures.

Signed-off-by: Aneesh Kumar K.V 
---
 include/linux/hugetlb.h |   2 +
 include/linux/migrate.h |   3 +
 mm/hugetlb.c|   4 +-
 mm/migrate.c| 139 
 4 files changed, 146 insertions(+), 2 deletions(-)

diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index 087fd5f48c91..1eed0cdaec0e 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -371,6 +371,8 @@ struct page *alloc_huge_page_nodemask(struct hstate *h, int 
preferred_nid,
nodemask_t *nmask);
 struct page *alloc_huge_page_vma(struct hstate *h, struct vm_area_struct *vma,
unsigned long address);
+struct page *alloc_migrate_huge_page(struct hstate *h, gfp_t gfp_mask,
+int nid, nodemask_t *nmask);
 int huge_add_to_page_cache(struct page *page, struct address_space *mapping,
pgoff_t idx);
 
diff --git a/include/linux/migrate.h b/include/linux/migrate.h
index f2b4abbca55e..d82b35afd2eb 100644
--- a/include/linux/migrate.h
+++ b/include/linux/migrate.h
@@ -286,6 +286,9 @@ static inline int migrate_vma(const struct migrate_vma_ops 
*ops,
 }
 #endif /* IS_ENABLED(CONFIG_MIGRATE_VMA_HELPER) */
 
+extern int get_user_pages_cma_migrate(unsigned long start, int nr_pages, int 
write,
+ struct page **pages);
+
 #endif /* CONFIG_MIGRATION */
 
 #endif /* _LINUX_MIGRATE_H */
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 7f2a28ab46d5..faf3102ae45e 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -1585,8 +1585,8 @@ static struct page *alloc_surplus_huge_page(struct hstate 
*h, gfp_t gfp_mask,
return page;
 }
 
-static struct page *alloc_migrate_huge_page(struct hstate *h, gfp_t gfp_mask,
-   int nid, nodemask_t *nmask)
+struct page *alloc_migrate_huge_page(struct hstate *h, gfp_t gfp_mask,
+int nid, nodemask_t *nmask)
 {
struct page *page;
 
diff --git a/mm/migrate.c b/mm/migrate.c
index f7e4bfdc13b7..d564558fba03 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -2946,3 +2946,142 @@ int migrate_vma(const struct migrate_vma_ops *ops,
 }
 EXPORT_SYMBOL(migrate_vma);
 #endif /* defined(MIGRATE_VMA_HELPER) */
+
+static struct page *new_non_cma_page(struct page *page, unsigned long private)
+{
+   /*
+* We want to make sure we allocate the new page from the same node
+* as the source page.
+*/
+   int nid = page_to_nid(page);
+   /*
+* Trying to allocate a page for migration. Ignore allocation
+* failure warnings
+*/
+   gfp_t gfp_mask = GFP_USER | __GFP_THISNODE | __GFP_NOWARN;
+
+   if (PageHighMem(page))
+   gfp_mask |= __GFP_HIGHMEM;
+
+#ifdef CONFIG_HUGETLB_PAGE
+   if (PageHuge(page)) {
+   struct hstate *h = page_hstate(page);
+   /*
+* We don't want to dequeue from the pool because pool pages 
will
+* mostly be from the CMA region.
+*/
+   return alloc_migrate_huge_page(h, gfp_mask, nid, NULL);
+   }
+#endif
+   if (PageTransHuge(page)) {
+   struct page *thp;
+   /*
+* ignore allocation failure warnings
+*/
+   gfp_t thp_gfpmask = GFP_TRANSHUGE | __GFP_THISNODE | 
__GFP_NOWARN;
+
+   /*
+* Remove the movable mask so that we don't allocate from
+* CMA area again.
+*/
+   thp_gfpmask &= ~__GFP_MOVABLE;
+   thp = __alloc_pages_node(nid, thp_gfpmask, HPAGE_PMD_ORDER);
+   if (!thp)
+   return NULL;
+   prep_transhuge_page(thp);
+   return thp;
+   }
+
+   return __alloc_pages_node(nid, gfp_mask, 0);
+}
+
+/**
+ * get_user_pages_cma_migrate() - pin user pages in memory by migrating pages 
in CMA region
+ * @start: starting user address
+ * @nr_pages:  number of pages from start to pin
+ * @write: whether pages will be written to
+ * @pages: array that receives pointers to the pages pinned.
+ * Should be at least nr_pages long.
+ *
+ * Attempt to pin user pages in memory without taking mm->mmap_sem.
+ * If not successful, it will fall back to taking the lock and
+ * calling get_user_pages().
+ *
+ * If the pinned pages are backed by CMA region, we migrate those pages out,
+ * allocating new pages from non-CMA region. This helps in avoiding keeping
+ * pages pinned in the CMA region for a long time thereby resulting in
+ * CMA allocation failures.
+ *
+ 

[PATCH V5 0/3] mm/kvm/vfio/ppc64: Migrate compound pages out of CMA region

2018-12-18 Thread Aneesh Kumar K.V
ppc64 use CMA area for the allocation of guest page table (hash page table). We 
won't
be able to start guest if we fail to allocate hash page table. We have observed
hash table allocation failure because we failed to migrate pages out of CMA 
region
because they were pinned. This happen when we are using VFIO. VFIO on ppc64 pins
the entire guest RAM. If the guest RAM pages get allocated out of CMA region, we
won't be able to migrate those pages. The pages are also pinned for the 
lifetime of the
guest.

Currently we support migration of non-compound pages. With THP and with the 
addition of
 hugetlb migration we can end up allocating compound pages from CMA region. This
patch series add support for migrating compound pages. The first path adds the 
helper
get_user_pages_cma_migrate() which pin the page making sure we migrate them out 
of
CMA region before incrementing the reference count. 

Changes from V4:
* use __GFP_NOWARN when allocating pages to avoid page allocation failure 
warnings.

Changes from V3:
* Move the hugetlb check before transhuge check
* Use compound head page when isolating hugetlb page



Aneesh Kumar K.V (3):
  mm: Add get_user_pages_cma_migrate
  powerpc/mm/iommu: Allow migration of cma allocated pages during
mm_iommu_get
  powerpc/mm/iommu: Allow large IOMMU page size only for hugetlb backing

 arch/powerpc/mm/mmu_context_iommu.c | 140 
 include/linux/hugetlb.h |   2 +
 include/linux/migrate.h |   3 +
 mm/hugetlb.c|   4 +-
 mm/migrate.c| 139 +++
 5 files changed, 186 insertions(+), 102 deletions(-)

-- 
2.19.2



Re: [PATCH V4 5/5] arch/powerpc/mm/hugetlb: NestMMU workaround for hugetlb mprotect RW upgrade

2018-12-18 Thread Aneesh Kumar K.V
Christoph Hellwig  writes:

> On Tue, Dec 18, 2018 at 03:11:37PM +0530, Aneesh Kumar K.V wrote:
>> +EXPORT_SYMBOL(huge_ptep_modify_prot_start);
>
> The only user of this function is the one you added in the last patch
> in mm/hugetlb.c, so there is no need to export this function.
>
>> +
>> +void huge_ptep_modify_prot_commit(struct vm_area_struct *vma, unsigned long 
>> addr,
>> +  pte_t *ptep, pte_t old_pte, pte_t pte)
>> +{
>> +
>> +if (radix_enabled())
>> +return radix__huge_ptep_modify_prot_commit(vma, addr, ptep,
>> +   old_pte, pte);
>> +set_huge_pte_at(vma->vm_mm, addr, ptep, pte);
>> +}
>> +EXPORT_SYMBOL(huge_ptep_modify_prot_commit);
>
> Same here.

That was done considering that ptep_modify_prot_start/commit was defined
in asm-generic/pgtable.h. I was trying to make sure I didn't break
anything with the patch. Also s390 do have that EXPORT_SYMBOL() for the
same. hugetlb just inherited that.

If you feel strongly about it, I can drop the EXPORT_SYMBOL().

-aneesh



Re: [PATCH V4 0/5] NestMMU pte upgrade workaround for mprotect

2018-12-18 Thread Aneesh Kumar K.V
Christoph Hellwig  writes:

> This series seems to miss patches 1 and 2.

https://lore.kernel.org/linuxppc-dev/20181218094137.13732-2-aneesh.ku...@linux.ibm.com/
https://lore.kernel.org/linuxppc-dev/20181218094137.13732-3-aneesh.ku...@linux.ibm.com/

-aneesh



Re: [PATCH v3] powerpc: implement CONFIG_DEBUG_VIRTUAL

2018-12-18 Thread Michael Ellerman
Michael Ellerman  writes:
> Christophe Leroy  writes:
>
>> This patch implements CONFIG_DEBUG_VIRTUAL to warn about
>> incorrect use of virt_to_phys() and page_to_phys()
>
> This commit is breaking my p5020ds booting a 32-bit kernel with:
>
>   smp: Bringing up secondary CPUs ...
>   __ioremap(): phys addr 0x7fef5000 is RAM lr ioremap_coherent
>   Unable to handle kernel paging request for data at address 0x
>   Faulting instruction address: 0xc002e950
>   Oops: Kernel access of bad area, sig: 11 [#1]
>   BE SMP NR_CPUS=24 CoreNet Generic
>   Modules linked in:
>   CPU: 0 PID: 1 Comm: swapper/0 Not tainted 
> 4.20.0-rc2-gcc-7.0.1-00138-g9a0380d299e9 #148
>   NIP:  c002e950 LR: c002eb20 CTR: 0001
>   REGS: e804bd20 TRAP: 0300   Not tainted  
> (4.20.0-rc2-gcc-7.0.1-00138-g9a0380d299e9)
>   MSR:  00021002   CR: 28004222  XER: 
>   DEAR:  ESR:  
>   GPR00: c002eb20 e804bdd0 e805  00021002  0050 
> 00021002 
>   GPR08: 2d3f 0001  0004 24000842  c00026d0 
>  
>   GPR16:        
> 0001 
>   GPR24: 00029002 7fef5140 3000   0040 0001 
>  
>   NIP [c002e950] smp_85xx_kick_cpu+0x120/0x410
>   LR [c002eb20] smp_85xx_kick_cpu+0x2f0/0x410
>   Call Trace:
>   [e804bdd0] [c002eb20] smp_85xx_kick_cpu+0x2f0/0x410 (unreliable)
>   [e804be20] [c0012e38] __cpu_up+0xc8/0x230
>   [e804be50] [c0040b34] bringup_cpu+0x34/0x110
>   [e804be70] [c00418a8] cpu_up+0x128/0x250
>   [e804beb0] [c0b84b14] smp_init+0xc4/0x10c
>   [e804bee0] [c0b75c1c] kernel_init_freeable+0xc8/0x250
>   [e804bf20] [c00026e8] kernel_init+0x18/0x120
>   [e804bf40] [c0011298] ret_from_kernel_thread+0x14/0x1c
>   Instruction dump:
>   7fb3e850 57bdd1be 2e1d 41d20250 57bd3032 393dffc0 7e6a9b78 5529d1be 
>   39290001 7d2903a6 6000 6000 <7c0050ac> 394a0040 4200fff8 7c0004ac 
>   ---[ end trace edcab2a1dfd5b38c ]---
>
>
> Which is obviously this hunk:
>
>> diff --git a/arch/powerpc/mm/pgtable_32.c b/arch/powerpc/mm/pgtable_32.c
>> index 4fc77a99c9bf..68d204a45cd0 100644
>> --- a/arch/powerpc/mm/pgtable_32.c
>> +++ b/arch/powerpc/mm/pgtable_32.c
>> @@ -143,7 +143,7 @@ __ioremap_caller(phys_addr_t addr, unsigned long size, 
>> pgprot_t prot, void *call
>>   * Don't allow anybody to remap normal RAM that we're using.
>>   * mem_init() sets high_memory so only do the check after that.
>>   */
>> -if (slab_is_available() && (p < virt_to_phys(high_memory)) &&
>> +if (slab_is_available() && virt_addr_valid(p) &&
>>  page_is_ram(__phys_to_pfn(p))) {
>>  printk("__ioremap(): phys addr 0x%llx is RAM lr %ps\n",
>> (unsigned long long)p, __builtin_return_address(0));
>
>
> I'll try and come up with a fix tomorrow.

Actually I think that change is just wrong. virt_addr_valid() takes a
virtual address, but p is a physical address.

So I'll drop this hunk for now, which makes the patch a no-op when
DEBUG_VIRTUAL is n which is probably the way it should be.

cheers


Re: [PATCH] powerpc/ptrace: fix empty-body warning

2018-12-18 Thread Michael Ellerman
Hi Mathieu,

Mathieu Malaterre  writes:
> In commit a225f1567405 ("powerpc/ptrace: replace ptrace_report_syscall()
> with a tracehook call") an empty body if(); was added.
>
> Replace ; with {} to remove a warning (treated as error) reported by gcc
> using W=1:
>
>   arch/powerpc/kernel/ptrace.c: In function ‘do_syscall_trace_enter’:
>   arch/powerpc/kernel/ptrace.c:3281:4: error: suggest braces around empty 
> body in an ‘if’ statement [-Werror=empty-body]
>
> Fixes: a225f1567405 ("powerpc/ptrace: replace ptrace_report_syscall() with a 
> tracehook call")
> Signed-off-by: Mathieu Malaterre 

Thanks for the fix, but this code is being refactored already in next,
see:

https://patchwork.ozlabs.org/patch/1014179/

cheers


Re: [PATCH kernel v5 14/20] powerpc/powernv/npu: Add compound IOMMU groups

2018-12-18 Thread Michael Ellerman
Alexey Kardashevskiy  writes:
> diff --git a/arch/powerpc/platforms/powernv/npu-dma.c 
> b/arch/powerpc/platforms/powernv/npu-dma.c
> index dc629ee..3468eaa 100644
> --- a/arch/powerpc/platforms/powernv/npu-dma.c
> +++ b/arch/powerpc/platforms/powernv/npu-dma.c
> @@ -372,8 +358,263 @@ struct npu {
...
> +
> +static void pnv_comp_attach_table_group(struct npu_comp *npucomp,
> + struct pnv_ioda_pe *pe)
> +{
> + if (WARN_ON(npucomp->pe_num == NV_NPU_MAX_PE_NUM))
> + return;
> +
> + npucomp->pe[npucomp->pe_num] = pe;
> + ++npucomp->pe_num;
> +}
> +
> +struct iommu_table_group *pnv_try_setup_npu_table_group(struct pnv_ioda_pe 
> *pe)
> +{
> + struct iommu_table_group *table_group;
> + struct npu_comp *npucomp;
> + struct pci_dev *gpdev = NULL;
> + struct pci_controller *hose;
> + struct pci_dev *npdev;
> +
> + list_for_each_entry(gpdev, >pbus->devices, bus_list) {
> + npdev = pnv_pci_get_npu_dev(gpdev, 0);
> + if (npdev)
> + break;
> + }
> +
> + if (!npdev)
> + /* It is not an NPU attached device, skip */
> + return NULL;

This breaks some configs with:

  arch/powerpc/platforms/powernv/npu-dma.c:550:5: error: 'npdev' may be used 
uninitialized in this function [-Werror=uninitialized]

cheers


Re: [PATCH v2 0/2] of: phandle_cache, fix refcounts, remove stale entry

2018-12-18 Thread Michael Ellerman
Rob Herring  writes:
> On Mon, Dec 17, 2018 at 1:56 AM  wrote:
>>
>> From: Frank Rowand 
>>
>> Non-overlay dynamic devicetree node removal may leave the node in
>> the phandle cache.  Subsequent calls to of_find_node_by_phandle()
>> will incorrectly find the stale entry.  This bug exposed the foloowing
>> phandle cache refcount bug.
>>
>> The refcount of phandle_cache entries is not incremented while in
>> the cache, allowing use after free error after kfree() of the
>> cached entry.
>>
>> Changes since v1:
>>   - make __of_free_phandle_cache() static
>>   - add WARN_ON(1) for unexpected condition in of_find_node_by_phandle()
>>
>> Frank Rowand (2):
>>   of: of_node_get()/of_node_put() nodes held in phandle cache
>>   of: __of_detach_node() - remove node from phandle cache
>
> I'll send this to Linus this week if I get a tested by. Otherwise, it
> will go in for 4.21.

I think it can wait to go into 4.21, it's not super critical and it's
not a regression since 4.19.

cheers


Re: [PATCH v2 2/2] of: __of_detach_node() - remove node from phandle cache

2018-12-18 Thread Michael Ellerman
Rob Herring  writes:
> On Tue, Dec 18, 2018 at 2:33 PM Frank Rowand  wrote:
>>
>> On 12/18/18 12:09 PM, Frank Rowand wrote:
>> > On 12/18/18 12:01 PM, Rob Herring wrote:
>> >> On Tue, Dec 18, 2018 at 12:57 PM Frank Rowand  
>> >> wrote:
>> >>>
>> >>> On 12/17/18 2:52 AM, Michael Ellerman wrote:
>>  Hi Frank,
>> 
>>  frowand.l...@gmail.com writes:
>> > From: Frank Rowand 
>> >
>> > Non-overlay dynamic devicetree node removal may leave the node in
>> > the phandle cache.  Subsequent calls to of_find_node_by_phandle()
>> > will incorrectly find the stale entry.  Remove the node from the
>> > cache.
>> >
>> > Add paranoia checks in of_find_node_by_phandle() as a second level
>> > of defense (do not return cached node if detached, do not add node
>> > to cache if detached).
>> >
>> > Reported-by: Michael Bringmann 
>> > Signed-off-by: Frank Rowand 
>> > ---
>> 
>>  Similarly here can we add:
>> 
>>  Fixes: 0b3ce78e90fc ("of: cache phandle nodes to reduce cost of 
>>  of_find_node_by_phandle()")
>> >>>
>> >>> Yes, thanks.
>> >>>
>> >>>
>>  Cc: sta...@vger.kernel.org # v4.17+
>> >>>
>> >>> Nope, 0b3ce78e90fc does not belong in stable (it is a feature, not a bug
>> >>> fix).  So the bug will not be in stable.
>> >>
>> >> 0b3ce78e90fc landed in v4.17, so Michael's line above is correct.
>> >> Annotating it with 4.17 only saves Greg from trying and then emailing
>> >> us to backport this patch as it wouldn't apply.
>> >
>> > Thanks for the correction.  I was both under-thinking and over-thinking,
>> > ending up with an incorrect answer.
>> >
>> > Can you add the Cc: to version 3 patch comments (both 1/2 and 2/2) or do
>> > you want me to re-spin?
>>
>> Now that my thinking has been straightened out, a little bit more checking
>> for the other pre-requisite patches show:
>>
>>   v4.18: commit b9952b5218ad ("of: overlay: update phandle cache on overlay 
>> apply and remove")
>>   v4.19: commit e54192b48da7 ("of: fix phandle cache creation for DTs with 
>> no phandles")
>>
>> These can be addressed by changing the "Cc:" to ... # v4.19+
>> because stable v4.17.* and v4.18.* are end of life.
>
> EOL shouldn't factor into it. There's always the possibility someone
> else picks any kernel version.

Yeah, there are other stable branches out there, so the tag should point
to the correct version regardless of whether it's currently EOL on
kernel.org.

>> Or the pre-requisites can be listed:
>>
>># v4.17: b9952b5218ad of: overlay: update phandle cache
>># v4.17: e54192b48da7 of: fix phandle cache creation
>># v4.17
>>
>># v4.18: e54192b48da7 of: fix phandle cache creation
>># v4.18
>>
>># v4.19+
>>
>> Do you have a preference?
>
> I think we just list v4.17 and be done with it.

Yep, anyone who wants to backport it can work it out, or ask us.

cheers


Re: [PATCH kernel v5 10/20] powerpc/iommu_api: Move IOMMU groups setup to a single place

2018-12-18 Thread Michael Ellerman
Alexey Kardashevskiy  writes:

> diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c 
> b/arch/powerpc/platforms/powernv/pci-ioda.c
> index b86a6e0..1168b185 100644
> --- a/arch/powerpc/platforms/powernv/pci-ioda.c
> +++ b/arch/powerpc/platforms/powernv/pci-ioda.c
> @@ -2735,12 +2733,68 @@ static struct iommu_table_group_ops 
> pnv_pci_ioda2_npu_ops = {
>   .release_ownership = pnv_ioda2_release_ownership,
>  };
>  
> +static void pnv_ioda_setup_bus_iommu_group_add_devices(struct pnv_ioda_pe 
> *pe,
> + struct pci_bus *bus)
> +{
> + struct pci_dev *dev;
> +
> + list_for_each_entry(dev, >devices, bus_list) {
> + iommu_add_device(>table_group, >dev);
> +
> + if ((pe->flags & PNV_IODA_PE_BUS_ALL) && dev->subordinate)
> + pnv_ioda_setup_bus_iommu_group_add_devices(pe,
> + dev->subordinate);
> + }
> +}
> +
> +static void pnv_ioda_setup_bus_iommu_group(struct pnv_ioda_pe *pe)
> +{
> + if (!pnv_pci_ioda_pe_dma_weight(pe))
> + return;
> +
> + iommu_register_group(>table_group, pe->phb->hose->global_number,
> + pe->pe_number);
> +
> + /*
> +  * set_iommu_table_base(>pdev->dev, tbl) should have been called
> +  * by now
> +  */
> + if (pe->flags & PNV_IODA_PE_DEV)
> + iommu_add_device(>table_group, >pdev->dev);
> + else if (pe->flags & (PNV_IODA_PE_BUS | PNV_IODA_PE_BUS_ALL))
> + pnv_ioda_setup_bus_iommu_group_add_devices(pe, pe->pbus);
> +}
> +

This breaks skiroot_defconfig with:

arch/powerpc/platforms/powernv/pci-ioda.c:2731:13: error: 
'pnv_ioda_setup_bus_iommu_group' defined but not used [-Werror=unused-function]

  http://kisskb.ellerman.id.au/kisskb/buildresult/13623033/

cheers


Re: [PATCH kernel v5 03/20] powerpc/vfio/iommu/kvm: Do not pin device memory

2018-12-18 Thread Michael Ellerman
Alexey Kardashevskiy  writes:
> diff --git a/arch/powerpc/kernel/iommu.c b/arch/powerpc/kernel/iommu.c
> index f0dc680..cbcc615 100644
> --- a/arch/powerpc/kernel/iommu.c
> +++ b/arch/powerpc/kernel/iommu.c
> @@ -993,15 +994,19 @@ int iommu_tce_check_gpa(unsigned long page_shift, 
> unsigned long gpa)
>  }
>  EXPORT_SYMBOL_GPL(iommu_tce_check_gpa);
>  
> -long iommu_tce_xchg(struct iommu_table *tbl, unsigned long entry,
> - unsigned long *hpa, enum dma_data_direction *direction)
> +long iommu_tce_xchg(struct mm_struct *mm, struct iommu_table *tbl,
> + unsigned long entry, unsigned long *hpa,
> + enum dma_data_direction *direction)
>  {
>   long ret;
> + unsigned long size = 0;
>  
>   ret = tbl->it_ops->exchange(tbl, entry, hpa, direction);
>  
>   if (!ret && ((*direction == DMA_FROM_DEVICE) ||
> - (*direction == DMA_BIDIRECTIONAL)))
> + (*direction == DMA_BIDIRECTIONAL)) &&
> + !mm_iommu_is_devmem(mm, *hpa, tbl->it_page_shift,
> + ))

This is breaking a bunch of configs with:

  arch/powerpc/kernel/iommu.c:1008:5: error: implicit declaration of function 
'mm_iommu_is_devmem'; did you mean 'iommu_del_device'? 
[-Werror=implicit-function-declaration]

eg: http://kisskb.ellerman.id.au/kisskb/buildresult/13623063/

> diff --git a/arch/powerpc/mm/mmu_context_iommu.c 
> b/arch/powerpc/mm/mmu_context_iommu.c
> index 25a4b7f7..06fdbd3 100644
> --- a/arch/powerpc/mm/mmu_context_iommu.c
> +++ b/arch/powerpc/mm/mmu_context_iommu.c
> @@ -384,6 +432,33 @@ extern void mm_iommu_ua_mark_dirty_rm(struct mm_struct 
> *mm, unsigned long ua)
>   *pa |= MM_IOMMU_TABLE_GROUP_PAGE_DIRTY;
>  }
>  
> +extern bool mm_iommu_is_devmem(struct mm_struct *mm, unsigned long hpa,
> + unsigned int pageshift, unsigned long *size)
> +{

You shouldn't need extern in a C file.

cheers


Re: [PATCH kernel v5 20/20] vfio_pci: Add NVIDIA GV100GL [Tesla V100 SXM2] subdriver

2018-12-18 Thread Alex Williamson
On Thu, 13 Dec 2018 17:17:34 +1100
Alexey Kardashevskiy  wrote:

> POWER9 Witherspoon machines come with 4 or 6 V100 GPUs which are not
> pluggable PCIe devices but still have PCIe links which are used
> for config space and MMIO. In addition to that the GPUs have 6 NVLinks
> which are connected to other GPUs and the POWER9 CPU. POWER9 chips
> have a special unit on a die called an NPU which is an NVLink2 host bus
> adapter with p2p connections to 2 to 3 GPUs, 3 or 2 NVLinks to each.
> These systems also support ATS (address translation services) which is
> a part of the NVLink2 protocol. Such GPUs also share on-board RAM
> (16GB or 32GB) to the system via the same NVLink2 so a CPU has
> cache-coherent access to a GPU RAM.
> 
> This exports GPU RAM to the userspace as a new VFIO device region. This
> preregisters the new memory as device memory as it might be used for DMA.
> This inserts pfns from the fault handler as the GPU memory is not onlined
> until the vendor driver is loaded and trained the NVLinks so doing this
> earlier causes low level errors which we fence in the firmware so
> it does not hurt the host system but still better be avoided; for the same
> reason this does not map GPU RAM into the host kernel (usual thing for
> emulated access otherwise).
> 
> This exports an ATSD (Address Translation Shootdown) register of NPU which
> allows TLB invalidations inside GPU for an operating system. The register
> conveniently occupies a single 64k page. It is also presented to
> the userspace as a new VFIO device region. One NPU has 8 ATSD registers,
> each of them can be used for TLB invalidation in a GPU linked to this NPU.
> This allocates one ATSD register per an NVLink bridge allowing passing
> up to 6 registers. Due to the host firmware bug (just recently fixed),
> only 1 ATSD register per NPU was actually advertised to the host system
> so this passes that alone register via the first NVLink bridge device in
> the group which is still enough as QEMU collects them all back and
> presents to the guest via vPHB to mimic the emulated NPU PHB on the host.
> 
> In order to provide the userspace with the information about GPU-to-NVLink
> connections, this exports an additional capability called "tgt"
> (which is an abbreviated host system bus address). The "tgt" property
> tells the GPU its own system address and allows the guest driver to
> conglomerate the routing information so each GPU knows how to get directly
> to the other GPUs.
> 
> For ATS to work, the nest MMU (an NVIDIA block in a P9 CPU) needs to
> know LPID (a logical partition ID or a KVM guest hardware ID in other
> words) and PID (a memory context ID of a userspace process, not to be
> confused with a linux pid). This assigns a GPU to LPID in the NPU and
> this is why this adds a listener for KVM on an IOMMU group. A PID comes
> via NVLink from a GPU and NPU uses a PID wildcard to pass it through.
> 
> This requires coherent memory and ATSD to be available on the host as
> the GPU vendor only supports configurations with both features enabled
> and other configurations are known not to work. Because of this and
> because of the ways the features are advertised to the host system
> (which is a device tree with very platform specific properties),
> this requires enabled POWERNV platform.
> 
> The V100 GPUs do not advertise any of these capabilities via the config
> space and there are more than just one device ID so this relies on
> the platform to tell whether these GPUs have special abilities such as
> NVLinks.
> 
> Signed-off-by: Alexey Kardashevskiy 
> ---
> Changes:
> v5:
> * do not memremap GPU RAM for emulation, map it only when it is needed
> * allocate 1 ATSD register per NVLink bridge, if none left, then expose
> the region with a zero size
> * separate caps per device type
> * addressed AW review comments
> 
> v4:
> * added nvlink-speed to the NPU bridge capability as this turned out to
> be not a constant value
> * instead of looking at the exact device ID (which also changes from system
> to system), now this (indirectly) looks at the device tree to know
> if GPU and NPU support NVLink
> 
> v3:
> * reworded the commit log about tgt
> * added tracepoints (do we want them enabled for entire vfio-pci?)
> * added code comments
> * added write|mmap flags to the new regions
> * auto enabled VFIO_PCI_NVLINK2 config option
> * added 'tgt' capability to a GPU so QEMU can recreate ibm,npu and ibm,gpu
> references; there are required by the NVIDIA driver
> * keep notifier registered only for short time
> ---
>  drivers/vfio/pci/Makefile   |   1 +
>  drivers/vfio/pci/trace.h| 102 ++
>  drivers/vfio/pci/vfio_pci_private.h |  14 +
>  include/uapi/linux/vfio.h   |  39 +++
>  drivers/vfio/pci/vfio_pci.c |  27 +-
>  drivers/vfio/pci/vfio_pci_nvlink2.c | 473 
>  drivers/vfio/pci/Kconfig|   6 +
>  7 files changed, 660 insertions(+), 2 deletions(-)
>  create mode 

[RFC PATCH] ASoC: fsl: Add Audio Mixer CPU DAI driver

2018-12-18 Thread Viorel Suman
This patch implements Audio Mixer CPU DAI driver for NXP iMX8 SOCs.
The Audio Mixer is a on-chip functional module that allows mixing of
two audio streams into a single audio stream.

Audio Mixer datasheet is available here:
https://www.nxp.com/docs/en/reference-manual/IMX8DQXPRM.pdf

Signed-off-by: Viorel Suman 
---
 .../devicetree/bindings/sound/fsl,amix.txt |  45 ++
 sound/soc/fsl/Kconfig  |   7 +
 sound/soc/fsl/Makefile |   3 +
 sound/soc/fsl/fsl_amix.c   | 554 +
 sound/soc/fsl/fsl_amix.h   | 101 
 5 files changed, 710 insertions(+)
 create mode 100644 Documentation/devicetree/bindings/sound/fsl,amix.txt
 create mode 100644 sound/soc/fsl/fsl_amix.c
 create mode 100644 sound/soc/fsl/fsl_amix.h

diff --git a/Documentation/devicetree/bindings/sound/fsl,amix.txt 
b/Documentation/devicetree/bindings/sound/fsl,amix.txt
new file mode 100644
index 000..049144f
--- /dev/null
+++ b/Documentation/devicetree/bindings/sound/fsl,amix.txt
@@ -0,0 +1,45 @@
+NXP Audio Mixer (AMIX).
+
+The Audio Mixer is a on-chip functional module that allows mixing of two
+audio streams into a single audio stream. Audio Mixer has two input serial
+audio interfaces. These are driven by two Synchronous Audio interface
+modules (SAI). Each input serial interface carries 8 audio channels in its
+frame in TDM manner. Mixer mixes audio samples of corresponding channels
+from two interfaces into a single sample. Before mixing, audio samples of
+two inputs can be attenuated based on configuration. The output of the
+Audio Mixer is also a serial audio interface. Like input interfaces it has
+the same TDM frame format. This output is used to drive the serial DAC TDM
+interface of audio codec and also sent to the external pins along with the
+receive path of normal audio SAI module for readback by the CPU.
+
+The output of Audio mixer can be selected from any of the three streams
+ - serial audio input 1
+ - serial audio input 2
+ - Mixed audio
+
+Mixing operation is independent of audio sample rate but the two audio
+input streams must have same audio sample rate with same number of channels
+in TDM frame to be eligible for mixing.
+
+Device driver required properties:
+=
+  - compatible : Compatible list, contains "fsl,imx8qm-amix"
+
+  - reg: Offset and length of the register set for the 
device.
+
+  - clocks : Must contain an entry for each entry in clock-names.
+
+  - clock-names: Must include the "ipg" for register access.
+
+  - power-domains  : Must contain the phandle to the AMIX power domain node
+
+Device driver configuration example:
+==
+  amix: amix@5984 {
+compatible = "fsl,imx8qm-amix";
+reg = <0x0 0x5984 0x0 0x1>;
+clocks = < IMX8QXP_AUD_AMIX_IPG>;
+clock-names = "ipg";
+power-domains = <_amix>;
+  };
+
diff --git a/sound/soc/fsl/Kconfig b/sound/soc/fsl/Kconfig
index 2e75b5bc..0b08eb2 100644
--- a/sound/soc/fsl/Kconfig
+++ b/sound/soc/fsl/Kconfig
@@ -24,6 +24,13 @@ config SND_SOC_FSL_SAI
  This option is only useful for out-of-tree drivers since
  in-tree drivers select it automatically.
 
+config SND_SOC_FSL_AMIX
+   tristate "Audio Mixer (AMIX) module support"
+   select REGMAP_MMIO
+   help
+ Say Y if you want to add Audio Mixer (AMIX)
+ support for the NXP iMX CPUs.
+
 config SND_SOC_FSL_SSI
tristate "Synchronous Serial Interface module (SSI) support"
select SND_SOC_IMX_PCM_DMA if SND_IMX_SOC != n
diff --git a/sound/soc/fsl/Makefile b/sound/soc/fsl/Makefile
index de94fa0..3263634 100644
--- a/sound/soc/fsl/Makefile
+++ b/sound/soc/fsl/Makefile
@@ -12,6 +12,7 @@ snd-soc-p1022-rdk-objs := p1022_rdk.o
 obj-$(CONFIG_SND_SOC_P1022_RDK) += snd-soc-p1022-rdk.o
 
 # Freescale SSI/DMA/SAI/SPDIF Support
+snd-soc-fsl-amix-objs := fsl_amix.o
 snd-soc-fsl-asoc-card-objs := fsl-asoc-card.o
 snd-soc-fsl-asrc-objs := fsl_asrc.o fsl_asrc_dma.o
 snd-soc-fsl-sai-objs := fsl_sai.o
@@ -21,6 +22,8 @@ snd-soc-fsl-spdif-objs := fsl_spdif.o
 snd-soc-fsl-esai-objs := fsl_esai.o
 snd-soc-fsl-utils-objs := fsl_utils.o
 snd-soc-fsl-dma-objs := fsl_dma.o
+
+obj-$(CONFIG_SND_SOC_FSL_AMIX) += snd-soc-fsl-amix.o
 obj-$(CONFIG_SND_SOC_FSL_ASOC_CARD) += snd-soc-fsl-asoc-card.o
 obj-$(CONFIG_SND_SOC_FSL_ASRC) += snd-soc-fsl-asrc.o
 obj-$(CONFIG_SND_SOC_FSL_SAI) += snd-soc-fsl-sai.o
diff --git a/sound/soc/fsl/fsl_amix.c b/sound/soc/fsl/fsl_amix.c
new file mode 100644
index 000..d752029
--- /dev/null
+++ b/sound/soc/fsl/fsl_amix.c
@@ -0,0 +1,554 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * NXP AMIX ALSA SoC Digital Audio Interface (DAI) driver
+ *
+ * Copyright 2017 NXP
+ */
+
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+#include "fsl_amix.h"
+
+#define 

Re: [PATCH v2 2/2] of: __of_detach_node() - remove node from phandle cache

2018-12-18 Thread Rob Herring
On Tue, Dec 18, 2018 at 2:33 PM Frank Rowand  wrote:
>
> On 12/18/18 12:09 PM, Frank Rowand wrote:
> > On 12/18/18 12:01 PM, Rob Herring wrote:
> >> On Tue, Dec 18, 2018 at 12:57 PM Frank Rowand  
> >> wrote:
> >>>
> >>> On 12/17/18 2:52 AM, Michael Ellerman wrote:
>  Hi Frank,
> 
>  frowand.l...@gmail.com writes:
> > From: Frank Rowand 
> >
> > Non-overlay dynamic devicetree node removal may leave the node in
> > the phandle cache.  Subsequent calls to of_find_node_by_phandle()
> > will incorrectly find the stale entry.  Remove the node from the
> > cache.
> >
> > Add paranoia checks in of_find_node_by_phandle() as a second level
> > of defense (do not return cached node if detached, do not add node
> > to cache if detached).
> >
> > Reported-by: Michael Bringmann 
> > Signed-off-by: Frank Rowand 
> > ---
> 
>  Similarly here can we add:
> 
>  Fixes: 0b3ce78e90fc ("of: cache phandle nodes to reduce cost of 
>  of_find_node_by_phandle()")
> >>>
> >>> Yes, thanks.
> >>>
> >>>
>  Cc: sta...@vger.kernel.org # v4.17+
> >>>
> >>> Nope, 0b3ce78e90fc does not belong in stable (it is a feature, not a bug
> >>> fix).  So the bug will not be in stable.
> >>
> >> 0b3ce78e90fc landed in v4.17, so Michael's line above is correct.
> >> Annotating it with 4.17 only saves Greg from trying and then emailing
> >> us to backport this patch as it wouldn't apply.
> >
> > Thanks for the correction.  I was both under-thinking and over-thinking,
> > ending up with an incorrect answer.
> >
> > Can you add the Cc: to version 3 patch comments (both 1/2 and 2/2) or do
> > you want me to re-spin?
>
> Now that my thinking has been straightened out, a little bit more checking
> for the other pre-requisite patches show:
>
>   v4.18: commit b9952b5218ad ("of: overlay: update phandle cache on overlay 
> apply and remove")
>   v4.19: commit e54192b48da7 ("of: fix phandle cache creation for DTs with no 
> phandles")
>
> These can be addressed by changing the "Cc:" to ... # v4.19+
> because stable v4.17.* and v4.18.* are end of life.

EOL shouldn't factor into it. There's always the possibility someone
else picks any kernel version.

> Or the pre-requisites can be listed:
>
># v4.17: b9952b5218ad of: overlay: update phandle cache
># v4.17: e54192b48da7 of: fix phandle cache creation
># v4.17
>
># v4.18: e54192b48da7 of: fix phandle cache creation
># v4.18
>
># v4.19+
>
> Do you have a preference?

I think we just list v4.17 and be done with it.

Rob


[PATCH] powerpc/ptrace: fix empty-body warning

2018-12-18 Thread Mathieu Malaterre
In commit a225f1567405 ("powerpc/ptrace: replace ptrace_report_syscall()
with a tracehook call") an empty body if(); was added.

Replace ; with {} to remove a warning (treated as error) reported by gcc
using W=1:

  arch/powerpc/kernel/ptrace.c: In function ‘do_syscall_trace_enter’:
  arch/powerpc/kernel/ptrace.c:3281:4: error: suggest braces around empty body 
in an ‘if’ statement [-Werror=empty-body]

Fixes: a225f1567405 ("powerpc/ptrace: replace ptrace_report_syscall() with a 
tracehook call")
Signed-off-by: Mathieu Malaterre 
---
 arch/powerpc/kernel/ptrace.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/arch/powerpc/kernel/ptrace.c b/arch/powerpc/kernel/ptrace.c
index 8314e8fed0ee..e1988892b3e7 100644
--- a/arch/powerpc/kernel/ptrace.c
+++ b/arch/powerpc/kernel/ptrace.c
@@ -3277,8 +3277,8 @@ long do_syscall_trace_enter(struct pt_regs *regs)
 * avoid clobbering any register also, thus, not 'gotoing'
 * skip label.
 */
-   if (tracehook_report_syscall_entry(regs))
-   ;
+   if (tracehook_report_syscall_entry(regs)) {
+   }
return -1;
}
 
-- 
2.19.2



Re: [PATCH v2 2/2] of: __of_detach_node() - remove node from phandle cache

2018-12-18 Thread Frank Rowand
On 12/18/18 12:09 PM, Frank Rowand wrote:
> On 12/18/18 12:01 PM, Rob Herring wrote:
>> On Tue, Dec 18, 2018 at 12:57 PM Frank Rowand  wrote:
>>>
>>> On 12/17/18 2:52 AM, Michael Ellerman wrote:
 Hi Frank,

 frowand.l...@gmail.com writes:
> From: Frank Rowand 
>
> Non-overlay dynamic devicetree node removal may leave the node in
> the phandle cache.  Subsequent calls to of_find_node_by_phandle()
> will incorrectly find the stale entry.  Remove the node from the
> cache.
>
> Add paranoia checks in of_find_node_by_phandle() as a second level
> of defense (do not return cached node if detached, do not add node
> to cache if detached).
>
> Reported-by: Michael Bringmann 
> Signed-off-by: Frank Rowand 
> ---

 Similarly here can we add:

 Fixes: 0b3ce78e90fc ("of: cache phandle nodes to reduce cost of 
 of_find_node_by_phandle()")
>>>
>>> Yes, thanks.
>>>
>>>
 Cc: sta...@vger.kernel.org # v4.17+
>>>
>>> Nope, 0b3ce78e90fc does not belong in stable (it is a feature, not a bug
>>> fix).  So the bug will not be in stable.
>>
>> 0b3ce78e90fc landed in v4.17, so Michael's line above is correct.
>> Annotating it with 4.17 only saves Greg from trying and then emailing
>> us to backport this patch as it wouldn't apply.
> 
> Thanks for the correction.  I was both under-thinking and over-thinking,
> ending up with an incorrect answer.
> 
> Can you add the Cc: to version 3 patch comments (both 1/2 and 2/2) or do
> you want me to re-spin?

Now that my thinking has been straightened out, a little bit more checking
for the other pre-requisite patches show:

  v4.18: commit b9952b5218ad ("of: overlay: update phandle cache on overlay 
apply and remove")
  v4.19: commit e54192b48da7 ("of: fix phandle cache creation for DTs with no 
phandles")

These can be addressed by changing the "Cc:" to ... # v4.19+
because stable v4.17.* and v4.18.* are end of life.

Or the pre-requisites can be listed:

   # v4.17: b9952b5218ad of: overlay: update phandle cache
   # v4.17: e54192b48da7 of: fix phandle cache creation
   # v4.17

   # v4.18: e54192b48da7 of: fix phandle cache creation
   # v4.18

   # v4.19+

Do you have a preference?

-Frank
> 
> -Frank
> 
>>
>> Rob
>>
> 
> 



Re: [PATCH v2 2/2] of: __of_detach_node() - remove node from phandle cache

2018-12-18 Thread Frank Rowand
On 12/18/18 12:01 PM, Rob Herring wrote:
> On Tue, Dec 18, 2018 at 12:57 PM Frank Rowand  wrote:
>>
>> On 12/17/18 2:52 AM, Michael Ellerman wrote:
>>> Hi Frank,
>>>
>>> frowand.l...@gmail.com writes:
 From: Frank Rowand 

 Non-overlay dynamic devicetree node removal may leave the node in
 the phandle cache.  Subsequent calls to of_find_node_by_phandle()
 will incorrectly find the stale entry.  Remove the node from the
 cache.

 Add paranoia checks in of_find_node_by_phandle() as a second level
 of defense (do not return cached node if detached, do not add node
 to cache if detached).

 Reported-by: Michael Bringmann 
 Signed-off-by: Frank Rowand 
 ---
>>>
>>> Similarly here can we add:
>>>
>>> Fixes: 0b3ce78e90fc ("of: cache phandle nodes to reduce cost of 
>>> of_find_node_by_phandle()")
>>
>> Yes, thanks.
>>
>>
>>> Cc: sta...@vger.kernel.org # v4.17+
>>
>> Nope, 0b3ce78e90fc does not belong in stable (it is a feature, not a bug
>> fix).  So the bug will not be in stable.
> 
> 0b3ce78e90fc landed in v4.17, so Michael's line above is correct.
> Annotating it with 4.17 only saves Greg from trying and then emailing
> us to backport this patch as it wouldn't apply.

Thanks for the correction.  I was both under-thinking and over-thinking,
ending up with an incorrect answer.

Can you add the Cc: to version 3 patch comments (both 1/2 and 2/2) or do
you want me to re-spin?

-Frank

> 
> Rob
> 



Re: [PATCH v2 2/2] of: __of_detach_node() - remove node from phandle cache

2018-12-18 Thread Rob Herring
On Tue, Dec 18, 2018 at 12:57 PM Frank Rowand  wrote:
>
> On 12/17/18 2:52 AM, Michael Ellerman wrote:
> > Hi Frank,
> >
> > frowand.l...@gmail.com writes:
> >> From: Frank Rowand 
> >>
> >> Non-overlay dynamic devicetree node removal may leave the node in
> >> the phandle cache.  Subsequent calls to of_find_node_by_phandle()
> >> will incorrectly find the stale entry.  Remove the node from the
> >> cache.
> >>
> >> Add paranoia checks in of_find_node_by_phandle() as a second level
> >> of defense (do not return cached node if detached, do not add node
> >> to cache if detached).
> >>
> >> Reported-by: Michael Bringmann 
> >> Signed-off-by: Frank Rowand 
> >> ---
> >
> > Similarly here can we add:
> >
> > Fixes: 0b3ce78e90fc ("of: cache phandle nodes to reduce cost of 
> > of_find_node_by_phandle()")
>
> Yes, thanks.
>
>
> > Cc: sta...@vger.kernel.org # v4.17+
>
> Nope, 0b3ce78e90fc does not belong in stable (it is a feature, not a bug
> fix).  So the bug will not be in stable.

0b3ce78e90fc landed in v4.17, so Michael's line above is correct.
Annotating it with 4.17 only saves Greg from trying and then emailing
us to backport this patch as it wouldn't apply.

Rob


Re: [PATCH v3 2/2] of: __of_detach_node() - remove node from phandle cache

2018-12-18 Thread Frank Rowand
On 12/18/18 11:40 AM, frowand.l...@gmail.com wrote:
> From: Frank Rowand 
> 
> Non-overlay dynamic devicetree node removal may leave the node in
> the phandle cache.  Subsequent calls to of_find_node_by_phandle()
> will incorrectly find the stale entry.  Remove the node from the
> cache.
> 
> Add paranoia checks in of_find_node_by_phandle() as a second level
> of defense (do not return cached node if detached, do not add node
> to cache if detached).
> 
> Fixes: 0b3ce78e90fc ("of: cache phandle nodes to reduce cost of 
> of_find_node_by_phandle()")
> Reported-by: Michael Bringmann 
> Signed-off-by: Frank Rowand 
> ---
> 
> do not "cc: stable", unless the following commits are also in stable:
> 
>   commit e54192b48da7 ("of: fix phandle cache creation for DTs with no 
> phandles")
>   commit b9952b5218ad ("of: overlay: update phandle cache on overlay apply 
> and remove")
>   commit 0b3ce78e90fc ("of: cache phandle nodes to reduce cost of 
> of_find_node_by_phandle()")
> 
> 
> Changes since v2:
>   - add temporary variable np in __of_free_phandle_cache_entry() to improve
> readability
>   - explain reason for WARN_ON() in comment
>   - add Fixes tag in patch comment

I should have carried this forward:

changes since v1:
  - add WARN_ON(1) for unexpected condition in of_find_node_by_phandle()

-Frank

> 
>  drivers/of/base.c   | 31 ++-
>  drivers/of/dynamic.c|  3 +++
>  drivers/of/of_private.h |  4 
>  3 files changed, 37 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/of/base.c b/drivers/of/base.c
> index 6c33d63361b8..6d20b6dcf034 100644
> --- a/drivers/of/base.c
> +++ b/drivers/of/base.c
> @@ -162,6 +162,28 @@ int of_free_phandle_cache(void)
>  late_initcall_sync(of_free_phandle_cache);
>  #endif
>  
> +/*
> + * Caller must hold devtree_lock.
> + */
> +void __of_free_phandle_cache_entry(phandle handle)
> +{
> + phandle masked_handle;
> + struct device_node *np;
> +
> + if (!handle)
> + return;
> +
> + masked_handle = handle & phandle_cache_mask;
> +
> + if (phandle_cache) {
> + np = phandle_cache[masked_handle];
> + if (np && handle == np->phandle) {
> + of_node_put(np);
> + phandle_cache[masked_handle] = NULL;
> + }
> + }
> +}
> +
>  void of_populate_phandle_cache(void)
>  {
>   unsigned long flags;
> @@ -1209,11 +1231,18 @@ struct device_node *of_find_node_by_phandle(phandle 
> handle)
>   if (phandle_cache[masked_handle] &&
>   handle == phandle_cache[masked_handle]->phandle)
>   np = phandle_cache[masked_handle];
> + if (np && of_node_check_flag(np, OF_DETACHED)) {
> + WARN_ON(1); /* did not uncache np on node removal */
> + of_node_put(np);
> + phandle_cache[masked_handle] = NULL;
> + np = NULL;
> + }
>   }
>  
>   if (!np) {
>   for_each_of_allnodes(np)
> - if (np->phandle == handle) {
> + if (np->phandle == handle &&
> + !of_node_check_flag(np, OF_DETACHED)) {
>   if (phandle_cache) {
>   /* will put when removed from cache */
>   of_node_get(np);
> diff --git a/drivers/of/dynamic.c b/drivers/of/dynamic.c
> index f4f8ed9b5454..ecea92f68c87 100644
> --- a/drivers/of/dynamic.c
> +++ b/drivers/of/dynamic.c
> @@ -268,6 +268,9 @@ void __of_detach_node(struct device_node *np)
>   }
>  
>   of_node_set_flag(np, OF_DETACHED);
> +
> + /* race with of_find_node_by_phandle() prevented by devtree_lock */
> + __of_free_phandle_cache_entry(np->phandle);
>  }
>  
>  /**
> diff --git a/drivers/of/of_private.h b/drivers/of/of_private.h
> index 5d1567025358..24786818e32e 100644
> --- a/drivers/of/of_private.h
> +++ b/drivers/of/of_private.h
> @@ -84,6 +84,10 @@ static inline void __of_detach_node_sysfs(struct 
> device_node *np) {}
>  int of_resolve_phandles(struct device_node *tree);
>  #endif
>  
> +#if defined(CONFIG_OF_DYNAMIC)
> +void __of_free_phandle_cache_entry(phandle handle);
> +#endif
> +
>  #if defined(CONFIG_OF_OVERLAY)
>  void of_overlay_mutex_lock(void);
>  void of_overlay_mutex_unlock(void);
> 



Re: [PATCH v3 1/2] of: of_node_get()/of_node_put() nodes held in phandle cache

2018-12-18 Thread Frank Rowand
On 12/18/18 11:40 AM, frowand.l...@gmail.com wrote:
> From: Frank Rowand 
> 
> The phandle cache contains struct device_node pointers.  The refcount
> of the pointers was not incremented while in the cache, allowing use
> after free error after kfree() of the node.  Add the proper increment
> and decrement of the use count.
> 
> Fixes: 0b3ce78e90fc ("of: cache phandle nodes to reduce cost of 
> of_find_node_by_phandle()")
> 
> Signed-off-by: Frank Rowand 
> ---
> 
> do not "cc: stable", unless the following commits are also in stable:
> 
>   commit e54192b48da7 ("of: fix phandle cache creation for DTs with no 
> phandles")
>   commit b9952b5218ad ("of: overlay: update phandle cache on overlay apply 
> and remove")
>   commit 0b3ce78e90fc ("of: cache phandle nodes to reduce cost of 
> of_find_node_by_phandle()")

I should have carried this forward:

changes since v1
  - make __of_free_phandle_cache() static

-Frank

> 
> 
>  drivers/of/base.c | 70 
> ---
>  1 file changed, 46 insertions(+), 24 deletions(-)
> 
> diff --git a/drivers/of/base.c b/drivers/of/base.c
> index 09692c9b32a7..6c33d63361b8 100644
> --- a/drivers/of/base.c
> +++ b/drivers/of/base.c
> @@ -116,9 +116,6 @@ int __weak of_node_to_nid(struct device_node *np)
>  }
>  #endif
>  
> -static struct device_node **phandle_cache;
> -static u32 phandle_cache_mask;
> -
>  /*
>   * Assumptions behind phandle_cache implementation:
>   *   - phandle property values are in a contiguous range of 1..n
> @@ -127,6 +124,44 @@ int __weak of_node_to_nid(struct device_node *np)
>   *   - the phandle lookup overhead reduction provided by the cache
>   * will likely be less
>   */
> +
> +static struct device_node **phandle_cache;
> +static u32 phandle_cache_mask;
> +
> +/*
> + * Caller must hold devtree_lock.
> + */
> +static void __of_free_phandle_cache(void)
> +{
> + u32 cache_entries = phandle_cache_mask + 1;
> + u32 k;
> +
> + if (!phandle_cache)
> + return;
> +
> + for (k = 0; k < cache_entries; k++)
> + of_node_put(phandle_cache[k]);
> +
> + kfree(phandle_cache);
> + phandle_cache = NULL;
> +}
> +
> +int of_free_phandle_cache(void)
> +{
> + unsigned long flags;
> +
> + raw_spin_lock_irqsave(_lock, flags);
> +
> + __of_free_phandle_cache();
> +
> + raw_spin_unlock_irqrestore(_lock, flags);
> +
> + return 0;
> +}
> +#if !defined(CONFIG_MODULES)
> +late_initcall_sync(of_free_phandle_cache);
> +#endif
> +
>  void of_populate_phandle_cache(void)
>  {
>   unsigned long flags;
> @@ -136,8 +171,7 @@ void of_populate_phandle_cache(void)
>  
>   raw_spin_lock_irqsave(_lock, flags);
>  
> - kfree(phandle_cache);
> - phandle_cache = NULL;
> + __of_free_phandle_cache();
>  
>   for_each_of_allnodes(np)
>   if (np->phandle && np->phandle != OF_PHANDLE_ILLEGAL)
> @@ -155,30 +189,15 @@ void of_populate_phandle_cache(void)
>   goto out;
>  
>   for_each_of_allnodes(np)
> - if (np->phandle && np->phandle != OF_PHANDLE_ILLEGAL)
> + if (np->phandle && np->phandle != OF_PHANDLE_ILLEGAL) {
> + of_node_get(np);
>   phandle_cache[np->phandle & phandle_cache_mask] = np;
> + }
>  
>  out:
>   raw_spin_unlock_irqrestore(_lock, flags);
>  }
>  
> -int of_free_phandle_cache(void)
> -{
> - unsigned long flags;
> -
> - raw_spin_lock_irqsave(_lock, flags);
> -
> - kfree(phandle_cache);
> - phandle_cache = NULL;
> -
> - raw_spin_unlock_irqrestore(_lock, flags);
> -
> - return 0;
> -}
> -#if !defined(CONFIG_MODULES)
> -late_initcall_sync(of_free_phandle_cache);
> -#endif
> -
>  void __init of_core_init(void)
>  {
>   struct device_node *np;
> @@ -1195,8 +1214,11 @@ struct device_node *of_find_node_by_phandle(phandle 
> handle)
>   if (!np) {
>   for_each_of_allnodes(np)
>   if (np->phandle == handle) {
> - if (phandle_cache)
> + if (phandle_cache) {
> + /* will put when removed from cache */
> + of_node_get(np);
>   phandle_cache[masked_handle] = np;
> + }
>   break;
>   }
>   }
> 



[PATCH v3 0/2] of: phandle_cache, fix refcounts, remove stale entry

2018-12-18 Thread frowand . list
From: Frank Rowand 

Non-overlay dynamic devicetree node removal may leave the node in
the phandle cache.  Subsequent calls to of_find_node_by_phandle()
will incorrectly find the stale entry.  This bug exposed the foloowing
phandle cache refcount bug.

The refcount of phandle_cache entries is not incremented while in
the cache, allowing use after free error after kfree() of the
cached entry.

Changes since v2:
  - patch 2/2: add temporary variable np in __of_free_phandle_cache_entry()
to improve readability
  - patch 2/2: explain reason for WARN_ON() in comment
  - patch 2/2: add Fixes tag in patch comment

Changes since v1:
  - make __of_free_phandle_cache() static
  - add WARN_ON(1) for unexpected condition in of_find_node_by_phandle()
  
Frank Rowand (2):
  of: of_node_get()/of_node_put() nodes held in phandle cache
  of: __of_detach_node() - remove node from phandle cache

 drivers/of/base.c   | 101 
 drivers/of/dynamic.c|   3 ++
 drivers/of/of_private.h |   4 ++
 3 files changed, 83 insertions(+), 25 deletions(-)

-- 
Frank Rowand 



[PATCH v3 1/2] of: of_node_get()/of_node_put() nodes held in phandle cache

2018-12-18 Thread frowand . list
From: Frank Rowand 

The phandle cache contains struct device_node pointers.  The refcount
of the pointers was not incremented while in the cache, allowing use
after free error after kfree() of the node.  Add the proper increment
and decrement of the use count.

Fixes: 0b3ce78e90fc ("of: cache phandle nodes to reduce cost of 
of_find_node_by_phandle()")

Signed-off-by: Frank Rowand 
---

do not "cc: stable", unless the following commits are also in stable:

  commit e54192b48da7 ("of: fix phandle cache creation for DTs with no 
phandles")
  commit b9952b5218ad ("of: overlay: update phandle cache on overlay apply and 
remove")
  commit 0b3ce78e90fc ("of: cache phandle nodes to reduce cost of 
of_find_node_by_phandle()")


 drivers/of/base.c | 70 ---
 1 file changed, 46 insertions(+), 24 deletions(-)

diff --git a/drivers/of/base.c b/drivers/of/base.c
index 09692c9b32a7..6c33d63361b8 100644
--- a/drivers/of/base.c
+++ b/drivers/of/base.c
@@ -116,9 +116,6 @@ int __weak of_node_to_nid(struct device_node *np)
 }
 #endif
 
-static struct device_node **phandle_cache;
-static u32 phandle_cache_mask;
-
 /*
  * Assumptions behind phandle_cache implementation:
  *   - phandle property values are in a contiguous range of 1..n
@@ -127,6 +124,44 @@ int __weak of_node_to_nid(struct device_node *np)
  *   - the phandle lookup overhead reduction provided by the cache
  * will likely be less
  */
+
+static struct device_node **phandle_cache;
+static u32 phandle_cache_mask;
+
+/*
+ * Caller must hold devtree_lock.
+ */
+static void __of_free_phandle_cache(void)
+{
+   u32 cache_entries = phandle_cache_mask + 1;
+   u32 k;
+
+   if (!phandle_cache)
+   return;
+
+   for (k = 0; k < cache_entries; k++)
+   of_node_put(phandle_cache[k]);
+
+   kfree(phandle_cache);
+   phandle_cache = NULL;
+}
+
+int of_free_phandle_cache(void)
+{
+   unsigned long flags;
+
+   raw_spin_lock_irqsave(_lock, flags);
+
+   __of_free_phandle_cache();
+
+   raw_spin_unlock_irqrestore(_lock, flags);
+
+   return 0;
+}
+#if !defined(CONFIG_MODULES)
+late_initcall_sync(of_free_phandle_cache);
+#endif
+
 void of_populate_phandle_cache(void)
 {
unsigned long flags;
@@ -136,8 +171,7 @@ void of_populate_phandle_cache(void)
 
raw_spin_lock_irqsave(_lock, flags);
 
-   kfree(phandle_cache);
-   phandle_cache = NULL;
+   __of_free_phandle_cache();
 
for_each_of_allnodes(np)
if (np->phandle && np->phandle != OF_PHANDLE_ILLEGAL)
@@ -155,30 +189,15 @@ void of_populate_phandle_cache(void)
goto out;
 
for_each_of_allnodes(np)
-   if (np->phandle && np->phandle != OF_PHANDLE_ILLEGAL)
+   if (np->phandle && np->phandle != OF_PHANDLE_ILLEGAL) {
+   of_node_get(np);
phandle_cache[np->phandle & phandle_cache_mask] = np;
+   }
 
 out:
raw_spin_unlock_irqrestore(_lock, flags);
 }
 
-int of_free_phandle_cache(void)
-{
-   unsigned long flags;
-
-   raw_spin_lock_irqsave(_lock, flags);
-
-   kfree(phandle_cache);
-   phandle_cache = NULL;
-
-   raw_spin_unlock_irqrestore(_lock, flags);
-
-   return 0;
-}
-#if !defined(CONFIG_MODULES)
-late_initcall_sync(of_free_phandle_cache);
-#endif
-
 void __init of_core_init(void)
 {
struct device_node *np;
@@ -1195,8 +1214,11 @@ struct device_node *of_find_node_by_phandle(phandle 
handle)
if (!np) {
for_each_of_allnodes(np)
if (np->phandle == handle) {
-   if (phandle_cache)
+   if (phandle_cache) {
+   /* will put when removed from cache */
+   of_node_get(np);
phandle_cache[masked_handle] = np;
+   }
break;
}
}
-- 
Frank Rowand 



[PATCH v3 2/2] of: __of_detach_node() - remove node from phandle cache

2018-12-18 Thread frowand . list
From: Frank Rowand 

Non-overlay dynamic devicetree node removal may leave the node in
the phandle cache.  Subsequent calls to of_find_node_by_phandle()
will incorrectly find the stale entry.  Remove the node from the
cache.

Add paranoia checks in of_find_node_by_phandle() as a second level
of defense (do not return cached node if detached, do not add node
to cache if detached).

Fixes: 0b3ce78e90fc ("of: cache phandle nodes to reduce cost of 
of_find_node_by_phandle()")
Reported-by: Michael Bringmann 
Signed-off-by: Frank Rowand 
---

do not "cc: stable", unless the following commits are also in stable:

  commit e54192b48da7 ("of: fix phandle cache creation for DTs with no 
phandles")
  commit b9952b5218ad ("of: overlay: update phandle cache on overlay apply and 
remove")
  commit 0b3ce78e90fc ("of: cache phandle nodes to reduce cost of 
of_find_node_by_phandle()")


Changes since v2:
  - add temporary variable np in __of_free_phandle_cache_entry() to improve
readability
  - explain reason for WARN_ON() in comment
  - add Fixes tag in patch comment

 drivers/of/base.c   | 31 ++-
 drivers/of/dynamic.c|  3 +++
 drivers/of/of_private.h |  4 
 3 files changed, 37 insertions(+), 1 deletion(-)

diff --git a/drivers/of/base.c b/drivers/of/base.c
index 6c33d63361b8..6d20b6dcf034 100644
--- a/drivers/of/base.c
+++ b/drivers/of/base.c
@@ -162,6 +162,28 @@ int of_free_phandle_cache(void)
 late_initcall_sync(of_free_phandle_cache);
 #endif
 
+/*
+ * Caller must hold devtree_lock.
+ */
+void __of_free_phandle_cache_entry(phandle handle)
+{
+   phandle masked_handle;
+   struct device_node *np;
+
+   if (!handle)
+   return;
+
+   masked_handle = handle & phandle_cache_mask;
+
+   if (phandle_cache) {
+   np = phandle_cache[masked_handle];
+   if (np && handle == np->phandle) {
+   of_node_put(np);
+   phandle_cache[masked_handle] = NULL;
+   }
+   }
+}
+
 void of_populate_phandle_cache(void)
 {
unsigned long flags;
@@ -1209,11 +1231,18 @@ struct device_node *of_find_node_by_phandle(phandle 
handle)
if (phandle_cache[masked_handle] &&
handle == phandle_cache[masked_handle]->phandle)
np = phandle_cache[masked_handle];
+   if (np && of_node_check_flag(np, OF_DETACHED)) {
+   WARN_ON(1); /* did not uncache np on node removal */
+   of_node_put(np);
+   phandle_cache[masked_handle] = NULL;
+   np = NULL;
+   }
}
 
if (!np) {
for_each_of_allnodes(np)
-   if (np->phandle == handle) {
+   if (np->phandle == handle &&
+   !of_node_check_flag(np, OF_DETACHED)) {
if (phandle_cache) {
/* will put when removed from cache */
of_node_get(np);
diff --git a/drivers/of/dynamic.c b/drivers/of/dynamic.c
index f4f8ed9b5454..ecea92f68c87 100644
--- a/drivers/of/dynamic.c
+++ b/drivers/of/dynamic.c
@@ -268,6 +268,9 @@ void __of_detach_node(struct device_node *np)
}
 
of_node_set_flag(np, OF_DETACHED);
+
+   /* race with of_find_node_by_phandle() prevented by devtree_lock */
+   __of_free_phandle_cache_entry(np->phandle);
 }
 
 /**
diff --git a/drivers/of/of_private.h b/drivers/of/of_private.h
index 5d1567025358..24786818e32e 100644
--- a/drivers/of/of_private.h
+++ b/drivers/of/of_private.h
@@ -84,6 +84,10 @@ static inline void __of_detach_node_sysfs(struct device_node 
*np) {}
 int of_resolve_phandles(struct device_node *tree);
 #endif
 
+#if defined(CONFIG_OF_DYNAMIC)
+void __of_free_phandle_cache_entry(phandle handle);
+#endif
+
 #if defined(CONFIG_OF_OVERLAY)
 void of_overlay_mutex_lock(void);
 void of_overlay_mutex_unlock(void);
-- 
Frank Rowand 



Re: [PATCH v2 2/2] of: __of_detach_node() - remove node from phandle cache

2018-12-18 Thread Frank Rowand
On 12/17/18 2:52 AM, Michael Ellerman wrote:
> Hi Frank,
> 
> frowand.l...@gmail.com writes:
>> From: Frank Rowand 
>>
>> Non-overlay dynamic devicetree node removal may leave the node in
>> the phandle cache.  Subsequent calls to of_find_node_by_phandle()
>> will incorrectly find the stale entry.  Remove the node from the
>> cache.
>>
>> Add paranoia checks in of_find_node_by_phandle() as a second level
>> of defense (do not return cached node if detached, do not add node
>> to cache if detached).
>>
>> Reported-by: Michael Bringmann 
>> Signed-off-by: Frank Rowand 
>> ---
> 
> Similarly here can we add:
> 
> Fixes: 0b3ce78e90fc ("of: cache phandle nodes to reduce cost of 
> of_find_node_by_phandle()")

Yes, thanks.


> Cc: sta...@vger.kernel.org # v4.17+

Nope, 0b3ce78e90fc does not belong in stable (it is a feature, not a bug
fix).  So the bug will not be in stable.

I've debated with myself over this, because there is a possibility that
0b3ce78e90fc could somehow be put into a stable despite not being a
bug fix.  We can always explicitly request this patch series be added
to stable in that case.


> Thanks for doing this series.
> 
> Some minor comments below.
> 
>> diff --git a/drivers/of/base.c b/drivers/of/base.c
>> index 6c33d63361b8..ad71864cecf5 100644
>> --- a/drivers/of/base.c
>> +++ b/drivers/of/base.c
>> @@ -162,6 +162,27 @@ int of_free_phandle_cache(void)
>>  late_initcall_sync(of_free_phandle_cache);
>>  #endif
>>  
>> +/*
>> + * Caller must hold devtree_lock.
>> + */
>> +void __of_free_phandle_cache_entry(phandle handle)
>> +{
>> +phandle masked_handle;
>> +
>> +if (!handle)
>> +return;
> 
> We could fold the phandle_cache check into that if and return early for
> both cases couldn't we?

We could, but that would make the reason for checking phandle_cache
less obvious.  I would rather leave that check
> 
>> +masked_handle = handle & phandle_cache_mask;
>> +
>> +if (phandle_cache) {
> 
> Meaning this wouldn't be necessary.
> 
>> +if (phandle_cache[masked_handle] &&
>> +handle == phandle_cache[masked_handle]->phandle) {
>> +of_node_put(phandle_cache[masked_handle]);
>> +phandle_cache[masked_handle] = NULL;
>> +}
> 
> A temporary would help the readability here I think, eg:
> 
>   struct device_node *np;
> np = phandle_cache[masked_handle];
> 
>   if (np && handle == np->phandle) {
>   of_node_put(np);
>   phandle_cache[masked_handle] = NULL;
>   }

Yes, much cleaner.


>> @@ -1209,11 +1230,18 @@ struct device_node *of_find_node_by_phandle(phandle 
>> handle)
>>  if (phandle_cache[masked_handle] &&
>>  handle == phandle_cache[masked_handle]->phandle)
>>  np = phandle_cache[masked_handle];
>> +if (np && of_node_check_flag(np, OF_DETACHED)) {
>> +WARN_ON(1);
>> +of_node_put(np);
>
> Do we really want to do the put here?
> 
> We're here because something has gone wrong, possibly even memory
> corruption such that np is not even pointing at a device node anymore.
> So it seems like it would be safer to just leave the ref count alone,
> possibly leak a small amount of memory, and NULL out the reference.

I like the concept of the code being a little bit paranoid.

But the bug that this check is likely to cache is the bug that led
to this series -- removing a devicetree node, but failing to remove
it from the cache as part of the removal.  So I think I'll leave
it as is.

> 
> 
> cheers
> 

Thanks for the thoughts and suggestions!

-Frank



Re: [PATCH v1 03/13] powerpc/mm/32s: rework mmu_mapin_ram()

2018-12-18 Thread Christophe Leroy

Le 18/12/2018 à 18:04, Jonathan Neuschäfer a écrit :

On Tue, Dec 18, 2018 at 04:04:42PM +0100, Christophe Leroy wrote:

Stupid of me. In fact at the time being, BATS cover both RO and RW data
areas, so it can definitly not be mapped with PAGE_KERNEL_ROX.

In fact, as I have CONFIG_BDI_SWITCH in my setup, PAGE_KERNEL_TEXT is
PAGE_KERNEL_X on my side. That's the reason why I missed it.

With this change being done to patch 3, does the overall serie works for you ?


Yes, with the PAGE_KERNEL_X change, the whole series boots on the Wii.


That's great, many thanks for testing.

Christophe




Jonathan



Re: [PATCH V4 5/5] arch/powerpc/mm/hugetlb: NestMMU workaround for hugetlb mprotect RW upgrade

2018-12-18 Thread Christoph Hellwig
On Tue, Dec 18, 2018 at 03:11:37PM +0530, Aneesh Kumar K.V wrote:
> +EXPORT_SYMBOL(huge_ptep_modify_prot_start);

The only user of this function is the one you added in the last patch
in mm/hugetlb.c, so there is no need to export this function.

> +
> +void huge_ptep_modify_prot_commit(struct vm_area_struct *vma, unsigned long 
> addr,
> +   pte_t *ptep, pte_t old_pte, pte_t pte)
> +{
> +
> + if (radix_enabled())
> + return radix__huge_ptep_modify_prot_commit(vma, addr, ptep,
> +old_pte, pte);
> + set_huge_pte_at(vma->vm_mm, addr, ptep, pte);
> +}
> +EXPORT_SYMBOL(huge_ptep_modify_prot_commit);

Same here.


Re: [PATCH V3 3/5] arch/powerpc/mm: Nest MMU workaround for mprotect RW upgrade.

2018-12-18 Thread Christoph Hellwig
On Wed, Dec 05, 2018 at 08:39:29AM +0530, Aneesh Kumar K.V wrote:
> +pte_t ptep_modify_prot_start(struct vm_area_struct *vma, unsigned long addr,
> +  pte_t *ptep)
> +{
> + unsigned long pte_val;
> +
> + /*
> +  * Clear the _PAGE_PRESENT so that no hardware parallel update is
> +  * possible. Also keep the pte_present true so that we don't take
> +  * wrong fault.
> +  */
> + pte_val = pte_update(vma->vm_mm, addr, ptep, _PAGE_PRESENT, 
> _PAGE_INVALID, 0);
> +
> + return __pte(pte_val);
> +
> +}
> +EXPORT_SYMBOL(ptep_modify_prot_start);

As far as I can tell this is only called from mm/memory.c, mm/mprotect.c
and fs/proc/task_mmu.c, so there should be no need to export the
function.


Re: [PATCH V4 0/5] NestMMU pte upgrade workaround for mprotect

2018-12-18 Thread Christoph Hellwig
This series seems to miss patches 1 and 2.


Re: [PATCH v1 03/13] powerpc/mm/32s: rework mmu_mapin_ram()

2018-12-18 Thread Jonathan Neuschäfer
On Tue, Dec 18, 2018 at 04:04:42PM +0100, Christophe Leroy wrote:
> Stupid of me. In fact at the time being, BATS cover both RO and RW data
> areas, so it can definitly not be mapped with PAGE_KERNEL_ROX.
> 
> In fact, as I have CONFIG_BDI_SWITCH in my setup, PAGE_KERNEL_TEXT is
> PAGE_KERNEL_X on my side. That's the reason why I missed it.
> 
> With this change being done to patch 3, does the overall serie works for you ?

Yes, with the PAGE_KERNEL_X change, the whole series boots on the Wii.


Jonathan


signature.asc
Description: PGP signature


Re: [PATCH] powerpc/83xx: handle machine check caused by watchdog timer

2018-12-18 Thread kbuild test robot
Hi Christophe,

Thank you for the patch! Yet something to improve:

[auto build test ERROR on powerpc/next]
[also build test ERROR on v4.20-rc7 next-20181218]
[if your patch is applied to the wrong git tree, please drop us a note to help 
improve the system]

url:
https://github.com/0day-ci/linux/commits/Christophe-Leroy/powerpc-83xx-handle-machine-check-caused-by-watchdog-timer/20181210-222612
base:   https://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux.git next
config: powerpc-mpc83xx_defconfig (attached as .config)
compiler: powerpc-linux-gnu-gcc (Debian 7.2.0-11) 7.2.0
reproduce:
wget 
https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O 
~/bin/make.cross
chmod +x ~/bin/make.cross
# save the attached .config to linux build tree
GCC_VERSION=7.2.0 make.cross ARCH=powerpc 

All errors (new ones prefixed by >>):

   arch/powerpc//platforms/83xx/misc.c: In function 'machine_check_83xx':
>> arch/powerpc//platforms/83xx/misc.c:162:6: error: implicit declaration of 
>> function 'debugger_fault_handler' [-Werror=implicit-function-declaration]
 if (debugger_fault_handler(regs))
 ^~
   cc1: all warnings being treated as errors

vim +/debugger_fault_handler +162 arch/powerpc//platforms/83xx/misc.c

   153  
   154  int machine_check_83xx(struct pt_regs *regs)
   155  {
   156  u32 mask = 1 << (31 - IPIC_MCP_WDT);
   157  
   158  if (!(regs->msr & SRR1_MCE_MCP) || !(ipic_get_mcp_status() & 
mask))
   159  return machine_check_generic(regs);
   160  ipic_clear_mcp_status(mask);
   161  
 > 162  if (debugger_fault_handler(regs))

---
0-DAY kernel test infrastructureOpen Source Technology Center
https://lists.01.org/pipermail/kbuild-all   Intel Corporation


.config.gz
Description: application/gzip


Re: [PATCH v2 0/2] of: phandle_cache, fix refcounts, remove stale entry

2018-12-18 Thread Rob Herring
On Mon, Dec 17, 2018 at 1:56 AM  wrote:
>
> From: Frank Rowand 
>
> Non-overlay dynamic devicetree node removal may leave the node in
> the phandle cache.  Subsequent calls to of_find_node_by_phandle()
> will incorrectly find the stale entry.  This bug exposed the foloowing
> phandle cache refcount bug.
>
> The refcount of phandle_cache entries is not incremented while in
> the cache, allowing use after free error after kfree() of the
> cached entry.
>
> Changes since v1:
>   - make __of_free_phandle_cache() static
>   - add WARN_ON(1) for unexpected condition in of_find_node_by_phandle()
>
> Frank Rowand (2):
>   of: of_node_get()/of_node_put() nodes held in phandle cache
>   of: __of_detach_node() - remove node from phandle cache

I'll send this to Linus this week if I get a tested by. Otherwise, it
will go in for 4.21.

Rob


Re: [PATCH v1 03/13] powerpc/mm/32s: rework mmu_mapin_ram()

2018-12-18 Thread Christophe Leroy




Le 18/12/2018 à 15:55, Christophe Leroy a écrit :



Le 18/12/2018 à 15:15, Christophe Leroy a écrit :



Le 18/12/2018 à 15:07, Jonathan Neuschäfer a écrit :

On Tue, Dec 18, 2018 at 09:18:42AM +, Christophe Leroy wrote:

The only difference I see then are the flags. Everything else is seems
identical.

I know you tried already, but would you mind trying once more with the
following change ?


[...]

-    setbat(idx, PAGE_OFFSET + base, base, size, PAGE_KERNEL_TEXT);
+    setbat(idx, PAGE_OFFSET + base, base, size, PAGE_KERNEL_X);


Good call, with this workaround on top of patches 1-3, it boots again:

# mount -t debugfs d /sys/kernel/debug
# cat /sys/kernel/debug/powerpc/block_address_translation
---[ Instruction Block Address Translation ]---
0: 0xc000-0xc0ff 0x Kernel EXEC
1: -
2: 0xc100-0xc17f 0x0100 Kernel EXEC
3: -
4: 0xd000-0xd1ff 0x1000 Kernel EXEC
5: -
6: -
7: -

---[ Data Block Address Translation ]---
0: 0xc000-0xc0ff 0x Kernel RW
1: 0xfffe-0x 0x0d00 Kernel RW no cache guarded
2: 0xc100-0xc17f 0x0100 Kernel RW
3: -
4: 0xd000-0xd1ff 0x1000 Kernel RW
5: -
6: -
7: -

I think we may have some code trying to modify the kernel text 
without using

code patching functions.


Is there any faster way than to sprinkle some printks in setup_kernel
and try to find the guilty piece of code this way?


Can you start with the serie 
https://patchwork.ozlabs.org/project/linuxppc-dev/list/?series=75072 ?


Ok, the thing I was thinking about was the MMU_init_hw() but it is 
called before mapin_ram() so it should not be a problem. Not sure that 
serie improves anything at all here.


So there must be something else, pretty early (before the system is able 
to properly handle and display an Oops for write to RO area.)


Does anybody have an idea of what it can be ?



Stupid of me. In fact at the time being, BATS cover both RO and RW data 
areas, so it can definitly not be mapped with PAGE_KERNEL_ROX.


In fact, as I have CONFIG_BDI_SWITCH in my setup, PAGE_KERNEL_TEXT is 
PAGE_KERNEL_X on my side. That's the reason why I missed it.


With this change being done to patch 3, does the overall serie works for 
you ?


Thanks
Christophe





Christophe



Christophe




Jonathan



Re: [PATCH v1 03/13] powerpc/mm/32s: rework mmu_mapin_ram()

2018-12-18 Thread Christophe Leroy




Le 18/12/2018 à 15:15, Christophe Leroy a écrit :



Le 18/12/2018 à 15:07, Jonathan Neuschäfer a écrit :

On Tue, Dec 18, 2018 at 09:18:42AM +, Christophe Leroy wrote:

The only difference I see then are the flags. Everything else is seems
identical.

I know you tried already, but would you mind trying once more with the
following change ?


[...]

-    setbat(idx, PAGE_OFFSET + base, base, size, PAGE_KERNEL_TEXT);
+    setbat(idx, PAGE_OFFSET + base, base, size, PAGE_KERNEL_X);


Good call, with this workaround on top of patches 1-3, it boots again:

# mount -t debugfs d /sys/kernel/debug
# cat /sys/kernel/debug/powerpc/block_address_translation
---[ Instruction Block Address Translation ]---
0: 0xc000-0xc0ff 0x Kernel EXEC
1: -
2: 0xc100-0xc17f 0x0100 Kernel EXEC
3: -
4: 0xd000-0xd1ff 0x1000 Kernel EXEC
5: -
6: -
7: -

---[ Data Block Address Translation ]---
0: 0xc000-0xc0ff 0x Kernel RW
1: 0xfffe-0x 0x0d00 Kernel RW no cache guarded
2: 0xc100-0xc17f 0x0100 Kernel RW
3: -
4: 0xd000-0xd1ff 0x1000 Kernel RW
5: -
6: -
7: -

I think we may have some code trying to modify the kernel text 
without using

code patching functions.


Is there any faster way than to sprinkle some printks in setup_kernel
and try to find the guilty piece of code this way?


Can you start with the serie 
https://patchwork.ozlabs.org/project/linuxppc-dev/list/?series=75072 ?


Ok, the thing I was thinking about was the MMU_init_hw() but it is 
called before mapin_ram() so it should not be a problem. Not sure that 
serie improves anything at all here.


So there must be something else, pretty early (before the system is able 
to properly handle and display an Oops for write to RO area.)


Does anybody have an idea of what it can be ?

Christophe



Christophe




Jonathan



Re: [PATCH V4 0/3] * mm/kvm/vfio/ppc64: Migrate compound pages out of CMA region

2018-12-18 Thread Michal Hocko
On Fri 07-12-18 15:12:26, Andrew Morton wrote:
> On Wed, 21 Nov 2018 14:52:56 +0530 "Aneesh Kumar K.V" 
>  wrote:
> 
> > Subject: [PATCH V4 0/3] * mm/kvm/vfio/ppc64: Migrate compound pages out of 
> > CMA region
> 
> Asterisk in title is strange?
> 
> > ppc64 use CMA area for the allocation of guest page table (hash page 
> > table). We won't
> > be able to start guest if we fail to allocate hash page table. We have 
> > observed
> > hash table allocation failure because we failed to migrate pages out of CMA 
> > region
> > because they were pinned. This happen when we are using VFIO. VFIO on ppc64 
> > pins
> > the entire guest RAM. If the guest RAM pages get allocated out of CMA 
> > region, we
> > won't be able to migrate those pages. The pages are also pinned for the 
> > lifetime of the
> > guest.
> > 
> > Currently we support migration of non-compound pages. With THP and with the 
> > addition of
> >  hugetlb migration we can end up allocating compound pages from CMA region. 
> > This
> > patch series add support for migrating compound pages. The first path adds 
> > the helper
> > get_user_pages_cma_migrate() which pin the page making sure we migrate them 
> > out of
> > CMA region before incrementing the reference count. 
> 
> Very little review activity.  Perhaps Andrey and/or Michal can find the
> time..

I will unlikely find some time before the end of the year. Sorry about
that.
-- 
Michal Hocko
SUSE Labs


Re: [PATCH v1 03/13] powerpc/mm/32s: rework mmu_mapin_ram()

2018-12-18 Thread Christophe Leroy




Le 18/12/2018 à 15:07, Jonathan Neuschäfer a écrit :

On Tue, Dec 18, 2018 at 09:18:42AM +, Christophe Leroy wrote:

The only difference I see then are the flags. Everything else is seems
identical.

I know you tried already, but would you mind trying once more with the
following change ?


[...]

-   setbat(idx, PAGE_OFFSET + base, base, size, PAGE_KERNEL_TEXT);
+   setbat(idx, PAGE_OFFSET + base, base, size, PAGE_KERNEL_X);


Good call, with this workaround on top of patches 1-3, it boots again:

# mount -t debugfs d /sys/kernel/debug
# cat /sys/kernel/debug/powerpc/block_address_translation
---[ Instruction Block Address Translation ]---
0: 0xc000-0xc0ff 0x Kernel EXEC
1: -
2: 0xc100-0xc17f 0x0100 Kernel EXEC
3: -
4: 0xd000-0xd1ff 0x1000 Kernel EXEC
5: -
6: -
7: -

---[ Data Block Address Translation ]---
0: 0xc000-0xc0ff 0x Kernel RW
1: 0xfffe-0x 0x0d00 Kernel RW no cache guarded
2: 0xc100-0xc17f 0x0100 Kernel RW
3: -
4: 0xd000-0xd1ff 0x1000 Kernel RW
5: -
6: -
7: -


I think we may have some code trying to modify the kernel text without using
code patching functions.


Is there any faster way than to sprinkle some printks in setup_kernel
and try to find the guilty piece of code this way?


Can you start with the serie 
https://patchwork.ozlabs.org/project/linuxppc-dev/list/?series=75072 ?


Christophe




Jonathan



Re: [PATCH v2] crypto: talitos - fix ablkcipher for CONFIG_VMAP_STACK

2018-12-18 Thread Horia Geanta
On 12/13/2018 9:34 AM, Christophe Leroy wrote:
> [2.364486] WARNING: CPU: 0 PID: 60 at ./arch/powerpc/include/asm/io.h:837 
> dma_nommu_map_page+0x44/0xd4
> [2.373579] CPU: 0 PID: 60 Comm: cryptomgr_test Tainted: GW
>  4.20.0-rc5-00560-g6bfb52e23a00-dirty #531
> [2.384740] NIP:  c000c540 LR: c000c584 CTR: 
> [2.389743] REGS: c95abab0 TRAP: 0700   Tainted: GW  
> (4.20.0-rc5-00560-g6bfb52e23a00-dirty)
> [2.400042] MSR:  00029032   CR: 24042204  XER: 
> [2.406669]
> [2.406669] GPR00: c02f2244 c95abb60 c6262990 c95abd80 256a 0001 
> 0001 0001
> [2.406669] GPR08:  2000 0010 0010 24042202  
> 0100 c95abd88
> [2.406669] GPR16:  c05569d4 0001 0010 c95abc88 c0615664 
> 0004 
> [2.406669] GPR24: 0010 c95abc88 c95abc88  c61ae210 c7ff6d40 
> c61ae210 3d68
> [2.441559] NIP [c000c540] dma_nommu_map_page+0x44/0xd4
> [2.446720] LR [c000c584] dma_nommu_map_page+0x88/0xd4
> [2.451762] Call Trace:
> [2.454195] [c95abb60] [82000808] 0x82000808 (unreliable)
> [2.459572] [c95abb80] [c02f2244] talitos_edesc_alloc+0xbc/0x3c8
> [2.465493] [c95abbb0] [c02f2600] ablkcipher_edesc_alloc+0x4c/0x5c
> [2.471606] [c95abbd0] [c02f4ed0] ablkcipher_encrypt+0x20/0x64
> [2.477389] [c95abbe0] [c02023b0] __test_skcipher+0x4bc/0xa08
> [2.483049] [c95abe00] [c0204b60] test_skcipher+0x2c/0xcc
> [2.488385] [c95abe20] [c0204c48] alg_test_skcipher+0x48/0xbc
> [2.494064] [c95abe40] [c0205cec] alg_test+0x164/0x2e8
> [2.499142] [c95abf00] [c0200dec] cryptomgr_test+0x48/0x50
> [2.504558] [c95abf10] [c0039ff4] kthread+0xe4/0x110
> [2.509471] [c95abf40] [c000e1d0] ret_from_kernel_thread+0x14/0x1c
> [2.515532] Instruction dump:
> [2.518468] 7c7e1b78 7c9d2378 7cbf2b78 41820054 3d20c076 8089c200 3d20c076 
> 7c84e850
> [2.526127] 8129c204 7c842e70 7f844840 419c0008 <0fe0> 2f9e 
> 54847022 7c84fa14
> [2.533960] ---[ end trace bf78d94af73fe3b8 ]---
> [2.539123] talitos ff02.crypto: master data transfer error
> [2.544775] talitos ff02.crypto: TEA error: ISR 0x2000_0040
> [2.551625] alg: skcipher: encryption failed on test 1 for 
> ecb-aes-talitos: ret=22
> 
> IV cannot be on stack when CONFIG_VMAP_STACK is selected because the stack
> cannot be DMA mapped anymore.
> 
Same failure could happen for aead.

> This patch copies the IV from areq->info into the request context.
> 
There is already a per-request structure - talitos_edesc - that should be used
to save the IV.

The best approach to fix the issue (both for ablkcipher and aead) would be to
update talitos_edesc_alloc().

Thanks,
Horia


Re: [PATCH v1 03/13] powerpc/mm/32s: rework mmu_mapin_ram()

2018-12-18 Thread Jonathan Neuschäfer
On Tue, Dec 18, 2018 at 09:18:42AM +, Christophe Leroy wrote:
> The only difference I see then are the flags. Everything else is seems
> identical.
> 
> I know you tried already, but would you mind trying once more with the
> following change ?
> 
[...]
> - setbat(idx, PAGE_OFFSET + base, base, size, PAGE_KERNEL_TEXT);
> + setbat(idx, PAGE_OFFSET + base, base, size, PAGE_KERNEL_X);

Good call, with this workaround on top of patches 1-3, it boots again:

# mount -t debugfs d /sys/kernel/debug
# cat /sys/kernel/debug/powerpc/block_address_translation
---[ Instruction Block Address Translation ]---
0: 0xc000-0xc0ff 0x Kernel EXEC
1: -
2: 0xc100-0xc17f 0x0100 Kernel EXEC
3: -
4: 0xd000-0xd1ff 0x1000 Kernel EXEC
5: -
6: -
7: -

---[ Data Block Address Translation ]---
0: 0xc000-0xc0ff 0x Kernel RW
1: 0xfffe-0x 0x0d00 Kernel RW no cache guarded
2: 0xc100-0xc17f 0x0100 Kernel RW
3: -
4: 0xd000-0xd1ff 0x1000 Kernel RW
5: -
6: -
7: -

> I think we may have some code trying to modify the kernel text without using
> code patching functions.

Is there any faster way than to sprinkle some printks in setup_kernel
and try to find the guilty piece of code this way?


Jonathan


signature.asc
Description: PGP signature


Re: [PATCH v3] powerpc: implement CONFIG_DEBUG_VIRTUAL

2018-12-18 Thread Michael Ellerman
Christophe Leroy  writes:

> This patch implements CONFIG_DEBUG_VIRTUAL to warn about
> incorrect use of virt_to_phys() and page_to_phys()

This commit is breaking my p5020ds booting a 32-bit kernel with:

  smp: Bringing up secondary CPUs ...
  __ioremap(): phys addr 0x7fef5000 is RAM lr ioremap_coherent
  Unable to handle kernel paging request for data at address 0x
  Faulting instruction address: 0xc002e950
  Oops: Kernel access of bad area, sig: 11 [#1]
  BE SMP NR_CPUS=24 CoreNet Generic
  Modules linked in:
  CPU: 0 PID: 1 Comm: swapper/0 Not tainted 
4.20.0-rc2-gcc-7.0.1-00138-g9a0380d299e9 #148
  NIP:  c002e950 LR: c002eb20 CTR: 0001
  REGS: e804bd20 TRAP: 0300   Not tainted  
(4.20.0-rc2-gcc-7.0.1-00138-g9a0380d299e9)
  MSR:  00021002   CR: 28004222  XER: 
  DEAR:  ESR:  
  GPR00: c002eb20 e804bdd0 e805  00021002  0050 
00021002 
  GPR08: 2d3f 0001  0004 24000842  c00026d0 
 
  GPR16:        
0001 
  GPR24: 00029002 7fef5140 3000   0040 0001 
 
  NIP [c002e950] smp_85xx_kick_cpu+0x120/0x410
  LR [c002eb20] smp_85xx_kick_cpu+0x2f0/0x410
  Call Trace:
  [e804bdd0] [c002eb20] smp_85xx_kick_cpu+0x2f0/0x410 (unreliable)
  [e804be20] [c0012e38] __cpu_up+0xc8/0x230
  [e804be50] [c0040b34] bringup_cpu+0x34/0x110
  [e804be70] [c00418a8] cpu_up+0x128/0x250
  [e804beb0] [c0b84b14] smp_init+0xc4/0x10c
  [e804bee0] [c0b75c1c] kernel_init_freeable+0xc8/0x250
  [e804bf20] [c00026e8] kernel_init+0x18/0x120
  [e804bf40] [c0011298] ret_from_kernel_thread+0x14/0x1c
  Instruction dump:
  7fb3e850 57bdd1be 2e1d 41d20250 57bd3032 393dffc0 7e6a9b78 5529d1be 
  39290001 7d2903a6 6000 6000 <7c0050ac> 394a0040 4200fff8 7c0004ac 
  ---[ end trace edcab2a1dfd5b38c ]---


Which is obviously this hunk:

> diff --git a/arch/powerpc/mm/pgtable_32.c b/arch/powerpc/mm/pgtable_32.c
> index 4fc77a99c9bf..68d204a45cd0 100644
> --- a/arch/powerpc/mm/pgtable_32.c
> +++ b/arch/powerpc/mm/pgtable_32.c
> @@ -143,7 +143,7 @@ __ioremap_caller(phys_addr_t addr, unsigned long size, 
> pgprot_t prot, void *call
>* Don't allow anybody to remap normal RAM that we're using.
>* mem_init() sets high_memory so only do the check after that.
>*/
> - if (slab_is_available() && (p < virt_to_phys(high_memory)) &&
> + if (slab_is_available() && virt_addr_valid(p) &&
>   page_is_ram(__phys_to_pfn(p))) {
>   printk("__ioremap(): phys addr 0x%llx is RAM lr %ps\n",
>  (unsigned long long)p, __builtin_return_address(0));


I'll try and come up with a fix tomorrow.

cheers



Re: [PATCH V4 1/3] mm: Add get_user_pages_cma_migrate

2018-12-18 Thread Michael Ellerman
"Aneesh Kumar K.V"  writes:

> This helper does a get_user_pages_fast and if it find pages in the CMA area
> it will try to migrate them before taking page reference. This makes sure that
> we don't keep non-movable pages (due to page reference count) in the CMA area.
> Not able to move pages out of CMA area result in CMA allocation failures.
>
> Signed-off-by: Aneesh Kumar K.V 
> ---
>  include/linux/hugetlb.h |   2 +
>  include/linux/migrate.h |   3 +
>  mm/hugetlb.c|   4 +-
>  mm/migrate.c| 132 
>  4 files changed, 139 insertions(+), 2 deletions(-)

I'd rather not merge this much mm/ code via the powerpc tree without
acks.

Anyone?

cheers


> diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
> index 087fd5f48c91..1eed0cdaec0e 100644
> --- a/include/linux/hugetlb.h
> +++ b/include/linux/hugetlb.h
> @@ -371,6 +371,8 @@ struct page *alloc_huge_page_nodemask(struct hstate *h, 
> int preferred_nid,
>   nodemask_t *nmask);
>  struct page *alloc_huge_page_vma(struct hstate *h, struct vm_area_struct 
> *vma,
>   unsigned long address);
> +struct page *alloc_migrate_huge_page(struct hstate *h, gfp_t gfp_mask,
> +  int nid, nodemask_t *nmask);
>  int huge_add_to_page_cache(struct page *page, struct address_space *mapping,
>   pgoff_t idx);
>  
> diff --git a/include/linux/migrate.h b/include/linux/migrate.h
> index f2b4abbca55e..d82b35afd2eb 100644
> --- a/include/linux/migrate.h
> +++ b/include/linux/migrate.h
> @@ -286,6 +286,9 @@ static inline int migrate_vma(const struct 
> migrate_vma_ops *ops,
>  }
>  #endif /* IS_ENABLED(CONFIG_MIGRATE_VMA_HELPER) */
>  
> +extern int get_user_pages_cma_migrate(unsigned long start, int nr_pages, int 
> write,
> +   struct page **pages);
> +
>  #endif /* CONFIG_MIGRATION */
>  
>  #endif /* _LINUX_MIGRATE_H */
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index 7f2a28ab46d5..faf3102ae45e 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -1585,8 +1585,8 @@ static struct page *alloc_surplus_huge_page(struct 
> hstate *h, gfp_t gfp_mask,
>   return page;
>  }
>  
> -static struct page *alloc_migrate_huge_page(struct hstate *h, gfp_t gfp_mask,
> - int nid, nodemask_t *nmask)
> +struct page *alloc_migrate_huge_page(struct hstate *h, gfp_t gfp_mask,
> +  int nid, nodemask_t *nmask)
>  {
>   struct page *page;
>  
> diff --git a/mm/migrate.c b/mm/migrate.c
> index f7e4bfdc13b7..b0e47e2c5347 100644
> --- a/mm/migrate.c
> +++ b/mm/migrate.c
> @@ -2946,3 +2946,135 @@ int migrate_vma(const struct migrate_vma_ops *ops,
>  }
>  EXPORT_SYMBOL(migrate_vma);
>  #endif /* defined(MIGRATE_VMA_HELPER) */
> +
> +static struct page *new_non_cma_page(struct page *page, unsigned long 
> private)
> +{
> + /*
> +  * We want to make sure we allocate the new page from the same node
> +  * as the source page.
> +  */
> + int nid = page_to_nid(page);
> + gfp_t gfp_mask = GFP_USER | __GFP_THISNODE;
> +
> + if (PageHighMem(page))
> + gfp_mask |= __GFP_HIGHMEM;
> +
> +#ifdef CONFIG_HUGETLB_PAGE
> + if (PageHuge(page)) {
> + struct hstate *h = page_hstate(page);
> + /*
> +  * We don't want to dequeue from the pool because pool pages 
> will
> +  * mostly be from the CMA region.
> +  */
> + return alloc_migrate_huge_page(h, gfp_mask, nid, NULL);
> + }
> +#endif
> + if (PageTransHuge(page)) {
> + struct page *thp;
> + gfp_t thp_gfpmask = GFP_TRANSHUGE | __GFP_THISNODE;
> +
> + /*
> +  * Remove the movable mask so that we don't allocate from
> +  * CMA area again.
> +  */
> + thp_gfpmask &= ~__GFP_MOVABLE;
> + thp = __alloc_pages_node(nid, thp_gfpmask, HPAGE_PMD_ORDER);
> + if (!thp)
> + return NULL;
> + prep_transhuge_page(thp);
> + return thp;
> + }
> +
> + return __alloc_pages_node(nid, gfp_mask, 0);
> +}
> +
> +/**
> + * get_user_pages_cma_migrate() - pin user pages in memory by migrating 
> pages in CMA region
> + * @start:   starting user address
> + * @nr_pages:number of pages from start to pin
> + * @write:   whether pages will be written to
> + * @pages:   array that receives pointers to the pages pinned.
> + *   Should be at least nr_pages long.
> + *
> + * Attempt to pin user pages in memory without taking mm->mmap_sem.
> + * If not successful, it will fall back to taking the lock and
> + * calling get_user_pages().
> + *
> + * If the pinned pages are backed by CMA region, we migrate those pages out,
> + * allocating new pages from non-CMA region. This helps in avoiding keeping
> + * pages pinned in the CMA region for a long time 

powerpc syscall_set_return_value() is confused (was Re: [PATCH v6 18/27] powerpc: define syscall_get_error())

2018-12-18 Thread Michael Ellerman
Hi Dmitry,

"Dmitry V. Levin"  writes:
> syscall_get_error() is required to be implemented on this
> architecture in addition to already implemented syscall_get_nr(),
> syscall_get_arguments(), syscall_get_return_value(), and
> syscall_get_arch() functions in order to extend the generic
> ptrace API with PTRACE_GET_SYSCALL_INFO request.
>
> Cc: Michael Ellerman 
> Cc: Benjamin Herrenschmidt 
> Cc: Paul Mackerras 
> Cc: Oleg Nesterov 
> Cc: Andy Lutomirski 
> Cc: Elvira Khabirova 
> Cc: Eugene Syromyatnikov 
> Cc: linuxppc-dev@lists.ozlabs.org
> Signed-off-by: Dmitry V. Levin 
> ---
>
> Notes:
> v6: unchanged
> 
> v5:
> This change has been tested with
> tools/testing/selftests/ptrace/get_syscall_info.c and strace,
> so it's correct from PTRACE_GET_SYSCALL_INFO point of view.
> 
> This cast doubts on commit v4.3-rc1~86^2~81 that changed
> syscall_set_return_value() in a way that doesn't quite match
> syscall_get_error(), but syscall_set_return_value() is out
> of scope of this series, so I'll just let you know my concerns.

Sorry I only just saw this comment.

It's going to take me a while to page this stuff back into my brain, but
I think you may have a point.

I think the way it's written now *works* but only because it's only used
by seccomp, and we rely on the fact that the syscall exit path will
negate the value before returning to userspace or calling ptrace etc.

eg. we do:

syscall_set_return_value()
if (error) {
regs->ccr |= 0x1000L;
regs->gpr[3] = error;

then the asm does:

/* Return code is already in r3 thanks to do_syscall_trace_enter() */
b   .Lsyscall_exit
...

.Lsyscall_exit:
std r3,RESULT(r1)
...

3:  cmpld   r3,r11
ld  r5,_CCR(r1)
bge-.Lsyscall_error
...

.Lsyscall_error:
orisr5,r5,0x1000/* Set SO bit in CR */
neg r3,r3
std r5,_CCR(r1)

And we do the same before calling do_syscall_trace_leave().


Still it's a bit confused, because in the C code we're setting r3 and
CCR in the C code, but we're not negating the value in r3, and we're not
setting result at all.

I'll test a patch to fix it up.

cheers


[PATCH 2/2] s390/pci: skip VF scanning

2018-12-18 Thread Sebastian Ott
Set the flag to skip scanning for VFs after SRIOV enablement.
VF creation will be triggered by the hotplug code.

Signed-off-by: Sebastian Ott 
---
 arch/s390/pci/pci.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/arch/s390/pci/pci.c b/arch/s390/pci/pci.c
index 9f6f392a4461..4266a4de3160 100644
--- a/arch/s390/pci/pci.c
+++ b/arch/s390/pci/pci.c
@@ -651,6 +651,9 @@ int pcibios_add_device(struct pci_dev *pdev)
struct resource *res;
int i;
 
+   if (pdev->is_physfn)
+   pdev->no_vf_scan = 1;
+
pdev->dev.groups = zpci_attr_groups;
pdev->dev.dma_ops = _pci_dma_ops;
zpci_map_resources(pdev);
-- 
2.13.4



[PATCH 1/2] PCI/IOV: provide flag to skip VF scanning

2018-12-18 Thread Sebastian Ott
Provide a flag to skip scanning for new VFs after SRIOV enablement.
This can be set by implementations for which the VFs are already
reported by other means.

Signed-off-by: Sebastian Ott 
---
 drivers/pci/iov.c   | 48 
 include/linux/pci.h |  1 +
 2 files changed, 37 insertions(+), 12 deletions(-)

diff --git a/drivers/pci/iov.c b/drivers/pci/iov.c
index 9616eca3182f..3aa115ed3a65 100644
--- a/drivers/pci/iov.c
+++ b/drivers/pci/iov.c
@@ -252,6 +252,27 @@ int __weak pcibios_sriov_disable(struct pci_dev *pdev)
return 0;
 }
 
+static int sriov_add_vfs(struct pci_dev *dev, u16 num_vfs)
+{
+   unsigned int i;
+   int rc;
+
+   if (dev->no_vf_scan)
+   return 0;
+
+   for (i = 0; i < num_vfs; i++) {
+   rc = pci_iov_add_virtfn(dev, i);
+   if (rc)
+   goto failed;
+   }
+   return 0;
+failed:
+   while (i--)
+   pci_iov_remove_virtfn(dev, i);
+
+   return rc;
+}
+
 static int sriov_enable(struct pci_dev *dev, int nr_virtfn)
 {
int rc;
@@ -337,21 +358,15 @@ static int sriov_enable(struct pci_dev *dev, int 
nr_virtfn)
msleep(100);
pci_cfg_access_unlock(dev);
 
-   for (i = 0; i < initial; i++) {
-   rc = pci_iov_add_virtfn(dev, i);
-   if (rc)
-   goto failed;
-   }
+   rc = sriov_add_vfs(dev, initial);
+   if (rc)
+   goto err_pcibios;
 
kobject_uevent(>dev.kobj, KOBJ_CHANGE);
iov->num_VFs = nr_virtfn;
 
return 0;
 
-failed:
-   while (i--)
-   pci_iov_remove_virtfn(dev, i);
-
 err_pcibios:
iov->ctrl &= ~(PCI_SRIOV_CTRL_VFE | PCI_SRIOV_CTRL_MSE);
pci_cfg_access_lock(dev);
@@ -368,17 +383,26 @@ static int sriov_enable(struct pci_dev *dev, int 
nr_virtfn)
return rc;
 }
 
-static void sriov_disable(struct pci_dev *dev)
+static void sriov_del_vfs(struct pci_dev *dev)
 {
-   int i;
struct pci_sriov *iov = dev->sriov;
+   int i;
 
-   if (!iov->num_VFs)
+   if (dev->no_vf_scan)
return;
 
for (i = 0; i < iov->num_VFs; i++)
pci_iov_remove_virtfn(dev, i);
+}
+
+static void sriov_disable(struct pci_dev *dev)
+{
+   struct pci_sriov *iov = dev->sriov;
+
+   if (!iov->num_VFs)
+   return;
 
+   sriov_del_vfs(dev);
iov->ctrl &= ~(PCI_SRIOV_CTRL_VFE | PCI_SRIOV_CTRL_MSE);
pci_cfg_access_lock(dev);
pci_write_config_word(dev, iov->pos + PCI_SRIOV_CTRL, iov->ctrl);
diff --git a/include/linux/pci.h b/include/linux/pci.h
index 11c71c4ecf75..f70b9ccd3e86 100644
--- a/include/linux/pci.h
+++ b/include/linux/pci.h
@@ -405,6 +405,7 @@ struct pci_dev {
unsigned intnon_compliant_bars:1;   /* Broken BARs; ignore them */
unsigned intis_probed:1;/* Device probing in progress */
unsigned intlink_active_reporting:1;/* Device capable of reporting 
link active */
+   unsigned intno_vf_scan:1;   /* Don't scan for VF's after VF 
enablement */
pci_dev_flags_t dev_flags;
atomic_tenable_cnt; /* pci_enable_device has been called */
 
-- 
2.13.4



[PATCH V3 2/2] Tools: Replace open encodings for NUMA_NO_NODE

2018-12-18 Thread Anshuman Khandual
From: Stephen Rothwell 

This replaces all open encodings in tools with NUMA_NO_NODE.
Also linux/numa.h is now needed for the perf build.

Signed-off-by: Anshuman Khandual 
Signed-off-by: Stephen Rothwell 
---
 tools/include/linux/numa.h | 16 
 tools/perf/bench/numa.c|  6 +++---
 2 files changed, 19 insertions(+), 3 deletions(-)
 create mode 100644 tools/include/linux/numa.h

diff --git a/tools/include/linux/numa.h b/tools/include/linux/numa.h
new file mode 100644
index 000..110b0e5
--- /dev/null
+++ b/tools/include/linux/numa.h
@@ -0,0 +1,16 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _LINUX_NUMA_H
+#define _LINUX_NUMA_H
+
+
+#ifdef CONFIG_NODES_SHIFT
+#define NODES_SHIFT CONFIG_NODES_SHIFT
+#else
+#define NODES_SHIFT 0
+#endif
+
+#define MAX_NUMNODES(1 << NODES_SHIFT)
+
+#defineNUMA_NO_NODE(-1)
+
+#endif /* _LINUX_NUMA_H */
diff --git a/tools/perf/bench/numa.c b/tools/perf/bench/numa.c
index 4419551..e0ad5f1 100644
--- a/tools/perf/bench/numa.c
+++ b/tools/perf/bench/numa.c
@@ -298,7 +298,7 @@ static cpu_set_t bind_to_node(int target_node)
 
CPU_ZERO();
 
-   if (target_node == -1) {
+   if (target_node == NUMA_NO_NODE) {
for (cpu = 0; cpu < g->p.nr_cpus; cpu++)
CPU_SET(cpu, );
} else {
@@ -339,7 +339,7 @@ static void bind_to_memnode(int node)
unsigned long nodemask;
int ret;
 
-   if (node == -1)
+   if (node == NUMA_NO_NODE)
return;
 
BUG_ON(g->p.nr_nodes > (int)sizeof(nodemask)*8);
@@ -1363,7 +1363,7 @@ static void init_thread_data(void)
int cpu;
 
/* Allow all nodes by default: */
-   td->bind_node = -1;
+   td->bind_node = NUMA_NO_NODE;
 
/* Allow all CPUs by default: */
CPU_ZERO(>bind_cpumask);
-- 
2.7.4



[PATCH V3 1/2] mm: Replace all open encodings for NUMA_NO_NODE

2018-12-18 Thread Anshuman Khandual
At present there are multiple places where invalid node number is encoded
as -1. Even though implicitly understood it is always better to have macros
in there. Replace these open encodings for an invalid node number with the
global macro NUMA_NO_NODE. This helps remove NUMA related assumptions like
'invalid node' from various places redirecting them to a common definition.

Reviewed-by: David Hildenbrand 
Acked-by: Jeff Kirsher [ixgbe]
Acked-by: Jens Axboe   [mtip32xx]
Acked-by: Vinod Koul  [dmaengine.c]
Acked-by: Michael Ellerman [powerpc]
Acked-by: Doug Ledford [drivers/infiniband]
Signed-off-by: Anshuman Khandual 
---
 arch/alpha/include/asm/topology.h |  3 ++-
 arch/ia64/kernel/numa.c   |  2 +-
 arch/ia64/mm/discontig.c  |  6 +++---
 arch/powerpc/include/asm/pci-bridge.h |  3 ++-
 arch/powerpc/kernel/paca.c|  3 ++-
 arch/powerpc/kernel/pci-common.c  |  3 ++-
 arch/powerpc/mm/numa.c| 14 +++---
 arch/powerpc/platforms/powernv/memtrace.c |  5 +++--
 arch/sparc/kernel/pci_fire.c  |  3 ++-
 arch/sparc/kernel/pci_schizo.c|  3 ++-
 arch/sparc/kernel/psycho_common.c |  3 ++-
 arch/sparc/kernel/sbus.c  |  3 ++-
 arch/sparc/mm/init_64.c   |  6 +++---
 arch/x86/include/asm/pci.h|  3 ++-
 arch/x86/kernel/apic/x2apic_uv_x.c|  7 ---
 arch/x86/kernel/smpboot.c |  3 ++-
 drivers/block/mtip32xx/mtip32xx.c |  5 +++--
 drivers/dma/dmaengine.c   |  4 +++-
 drivers/infiniband/hw/hfi1/affinity.c |  3 ++-
 drivers/infiniband/hw/hfi1/init.c |  3 ++-
 drivers/iommu/dmar.c  |  5 +++--
 drivers/iommu/intel-iommu.c   |  3 ++-
 drivers/misc/sgi-xp/xpc_uv.c  |  3 ++-
 drivers/net/ethernet/intel/ixgbe/ixgbe_main.c |  5 +++--
 include/linux/device.h|  2 +-
 init/init_task.c  |  3 ++-
 kernel/kthread.c  |  3 ++-
 kernel/sched/fair.c   | 15 ---
 lib/cpumask.c |  3 ++-
 mm/huge_memory.c  | 13 +++--
 mm/hugetlb.c  |  3 ++-
 mm/ksm.c  |  2 +-
 mm/memory.c   |  7 ---
 mm/memory_hotplug.c   | 12 ++--
 mm/mempolicy.c|  2 +-
 mm/page_alloc.c   |  4 ++--
 mm/page_ext.c |  2 +-
 net/core/pktgen.c |  3 ++-
 net/qrtr/qrtr.c   |  3 ++-
 39 files changed, 104 insertions(+), 74 deletions(-)

diff --git a/arch/alpha/include/asm/topology.h 
b/arch/alpha/include/asm/topology.h
index e6e13a8..5a77a40 100644
--- a/arch/alpha/include/asm/topology.h
+++ b/arch/alpha/include/asm/topology.h
@@ -4,6 +4,7 @@
 
 #include 
 #include 
+#include 
 #include 
 
 #ifdef CONFIG_NUMA
@@ -29,7 +30,7 @@ static const struct cpumask *cpumask_of_node(int node)
 {
int cpu;
 
-   if (node == -1)
+   if (node == NUMA_NO_NODE)
return cpu_all_mask;
 
cpumask_clear(_to_cpumask_map[node]);
diff --git a/arch/ia64/kernel/numa.c b/arch/ia64/kernel/numa.c
index 92c3762..1315da6 100644
--- a/arch/ia64/kernel/numa.c
+++ b/arch/ia64/kernel/numa.c
@@ -74,7 +74,7 @@ void __init build_cpu_to_node_map(void)
cpumask_clear(_to_cpu_mask[node]);
 
for_each_possible_early_cpu(cpu) {
-   node = -1;
+   node = NUMA_NO_NODE;
for (i = 0; i < NR_CPUS; ++i)
if (cpu_physical_id(cpu) == node_cpuid[i].phys_id) {
node = node_cpuid[i].nid;
diff --git a/arch/ia64/mm/discontig.c b/arch/ia64/mm/discontig.c
index 8a96578..f9c3675 100644
--- a/arch/ia64/mm/discontig.c
+++ b/arch/ia64/mm/discontig.c
@@ -227,7 +227,7 @@ void __init setup_per_cpu_areas(void)
 * CPUs are put into groups according to node.  Walk cpu_map
 * and create new groups at node boundaries.
 */
-   prev_node = -1;
+   prev_node = NUMA_NO_NODE;
ai->nr_groups = 0;
for (unit = 0; unit < nr_units; unit++) {
cpu = cpu_map[unit];
@@ -435,7 +435,7 @@ static void __init *memory_less_node_alloc(int nid, 
unsigned long pernodesize)
 {
void *ptr = NULL;
u8 best = 0xff;
-   int bestnode = -1, node, anynode = 0;
+   int bestnode = NUMA_NO_NODE, node, anynode = 0;
 
for_each_online_node(node) {
if (node_isset(node, memory_less_mask))
@@ -447,7 +447,7 @@ static void __init *memory_less_node_alloc(int 

[PATCH V3 0/2] Replace all open encodings for NUMA_NO_NODE

2018-12-18 Thread Anshuman Khandual
Changes in V3:

- Dropped all references to NUMA_NO_NODE as per Lubomir Rinetl
- Split the patch into two creating a new one specifically for tools
- Folded Stephen's linux-next build fix into the second patch

Changes in V2: (https://patchwork.kernel.org/patch/10698089/)

- Added inclusion of 'numa.h' header at various places per Andrew
- Updated 'dev_to_node' to use NUMA_NO_NODE instead per Vinod

Changes in V1: (https://lkml.org/lkml/2018/11/23/485)

- Dropped OCFS2 changes per Joseph
- Dropped media/video drivers changes per Hans

RFC - https://patchwork.kernel.org/patch/10678035/

Build tested this with multiple cross compiler options like alpha, sparc,
arm64, x86, powerpc, powerpc64le etc with their default config which might
not have compiled tested all driver related changes. I will appreciate
folks giving this a test in their respective build environments.

All these places for replacement were found by running the following grep
patterns on the entire kernel code. Please let me know if this might have
missed some instances. This might also have replaced some false positives.
I will appreciate suggestions, inputs and review.

1. git grep "nid == -1"
2. git grep "node == -1"
3. git grep "nid = -1"
4. git grep "node = -1"

NOTE: I can still split the first patch into multiple ones - one for each
subsystem as suggested by Lubomir if that would be better.

Anshuman Khandual (1):
  mm: Replace all open encodings for NUMA_NO_NODE

Stephen Rothwell (1):
  Tools: Replace open encodings for NUMA_NO_NODE

 arch/alpha/include/asm/topology.h |  3 ++-
 arch/ia64/kernel/numa.c   |  2 +-
 arch/ia64/mm/discontig.c  |  6 +++---
 arch/powerpc/include/asm/pci-bridge.h |  3 ++-
 arch/powerpc/kernel/paca.c|  3 ++-
 arch/powerpc/kernel/pci-common.c  |  3 ++-
 arch/powerpc/mm/numa.c| 14 +++---
 arch/powerpc/platforms/powernv/memtrace.c |  5 +++--
 arch/sparc/kernel/pci_fire.c  |  3 ++-
 arch/sparc/kernel/pci_schizo.c|  3 ++-
 arch/sparc/kernel/psycho_common.c |  3 ++-
 arch/sparc/kernel/sbus.c  |  3 ++-
 arch/sparc/mm/init_64.c   |  6 +++---
 arch/x86/include/asm/pci.h|  3 ++-
 arch/x86/kernel/apic/x2apic_uv_x.c|  7 ---
 arch/x86/kernel/smpboot.c |  3 ++-
 drivers/block/mtip32xx/mtip32xx.c |  5 +++--
 drivers/dma/dmaengine.c   |  4 +++-
 drivers/infiniband/hw/hfi1/affinity.c |  3 ++-
 drivers/infiniband/hw/hfi1/init.c |  3 ++-
 drivers/iommu/dmar.c  |  5 +++--
 drivers/iommu/intel-iommu.c   |  3 ++-
 drivers/misc/sgi-xp/xpc_uv.c  |  3 ++-
 drivers/net/ethernet/intel/ixgbe/ixgbe_main.c |  5 +++--
 include/linux/device.h|  2 +-
 init/init_task.c  |  3 ++-
 kernel/kthread.c  |  3 ++-
 kernel/sched/fair.c   | 15 ---
 lib/cpumask.c |  3 ++-
 mm/huge_memory.c  | 13 +++--
 mm/hugetlb.c  |  3 ++-
 mm/ksm.c  |  2 +-
 mm/memory.c   |  7 ---
 mm/memory_hotplug.c   | 12 ++--
 mm/mempolicy.c|  2 +-
 mm/page_alloc.c   |  4 ++--
 mm/page_ext.c |  2 +-
 net/core/pktgen.c |  3 ++-
 net/qrtr/qrtr.c   |  3 ++-
 tools/include/linux/numa.h| 16 
 tools/perf/bench/numa.c   |  6 +++---
 41 files changed, 123 insertions(+), 77 deletions(-)
 create mode 100644 tools/include/linux/numa.h

-- 
2.7.4



[PATCH V4 5/5] arch/powerpc/mm/hugetlb: NestMMU workaround for hugetlb mprotect RW upgrade

2018-12-18 Thread Aneesh Kumar K.V
NestMMU requires us to mark the pte invalid and flush the tlb when we do a
RW upgrade of pte. We fixed a variant of this in the fault path in commit
Fixes: bd5050e38aec ("powerpc/mm/radix: Change pte relax sequence to handle 
nest MMU hang")

Signed-off-by: Aneesh Kumar K.V 
---
 arch/powerpc/include/asm/book3s/64/hugetlb.h | 12 +
 arch/powerpc/mm/hugetlbpage-hash64.c | 27 
 arch/powerpc/mm/hugetlbpage-radix.c  | 17 
 3 files changed, 56 insertions(+)

diff --git a/arch/powerpc/include/asm/book3s/64/hugetlb.h 
b/arch/powerpc/include/asm/book3s/64/hugetlb.h
index 5b0177733994..66c1e4f88d65 100644
--- a/arch/powerpc/include/asm/book3s/64/hugetlb.h
+++ b/arch/powerpc/include/asm/book3s/64/hugetlb.h
@@ -13,6 +13,10 @@ radix__hugetlb_get_unmapped_area(struct file *file, unsigned 
long addr,
unsigned long len, unsigned long pgoff,
unsigned long flags);
 
+extern void radix__huge_ptep_modify_prot_commit(struct vm_area_struct *vma,
+   unsigned long addr, pte_t *ptep,
+   pte_t old_pte, pte_t pte);
+
 static inline int hstate_get_psize(struct hstate *hstate)
 {
unsigned long shift;
@@ -42,4 +46,12 @@ static inline bool gigantic_page_supported(void)
 /* hugepd entry valid bit */
 #define HUGEPD_VAL_BITS(0x8000UL)
 
+#define huge_ptep_modify_prot_start huge_ptep_modify_prot_start
+extern pte_t huge_ptep_modify_prot_start(struct vm_area_struct *vma,
+unsigned long addr, pte_t *ptep);
+
+#define huge_ptep_modify_prot_commit huge_ptep_modify_prot_commit
+extern void huge_ptep_modify_prot_commit(struct vm_area_struct *vma,
+unsigned long addr, pte_t *ptep,
+pte_t old_pte, pte_t new_pte);
 #endif
diff --git a/arch/powerpc/mm/hugetlbpage-hash64.c 
b/arch/powerpc/mm/hugetlbpage-hash64.c
index 2e6a8f9345d3..48fe74bfeab1 100644
--- a/arch/powerpc/mm/hugetlbpage-hash64.c
+++ b/arch/powerpc/mm/hugetlbpage-hash64.c
@@ -121,3 +121,30 @@ int __hash_page_huge(unsigned long ea, unsigned long 
access, unsigned long vsid,
*ptep = __pte(new_pte & ~H_PAGE_BUSY);
return 0;
 }
+
+pte_t huge_ptep_modify_prot_start(struct vm_area_struct *vma,
+ unsigned long addr, pte_t *ptep)
+{
+   unsigned long pte_val;
+   /*
+* Clear the _PAGE_PRESENT so that no hardware parallel update is
+* possible. Also keep the pte_present true so that we don't take
+* wrong fault.
+*/
+   pte_val = pte_update(vma->vm_mm, addr, ptep,
+_PAGE_PRESENT, _PAGE_INVALID, 1);
+
+   return __pte(pte_val);
+}
+EXPORT_SYMBOL(huge_ptep_modify_prot_start);
+
+void huge_ptep_modify_prot_commit(struct vm_area_struct *vma, unsigned long 
addr,
+ pte_t *ptep, pte_t old_pte, pte_t pte)
+{
+
+   if (radix_enabled())
+   return radix__huge_ptep_modify_prot_commit(vma, addr, ptep,
+  old_pte, pte);
+   set_huge_pte_at(vma->vm_mm, addr, ptep, pte);
+}
+EXPORT_SYMBOL(huge_ptep_modify_prot_commit);
diff --git a/arch/powerpc/mm/hugetlbpage-radix.c 
b/arch/powerpc/mm/hugetlbpage-radix.c
index 2486bee0f93e..11d9ea28a816 100644
--- a/arch/powerpc/mm/hugetlbpage-radix.c
+++ b/arch/powerpc/mm/hugetlbpage-radix.c
@@ -90,3 +90,20 @@ radix__hugetlb_get_unmapped_area(struct file *file, unsigned 
long addr,
 
return vm_unmapped_area();
 }
+
+void radix__huge_ptep_modify_prot_commit(struct vm_area_struct *vma,
+unsigned long addr, pte_t *ptep,
+pte_t old_pte, pte_t pte)
+{
+   struct mm_struct *mm = vma->vm_mm;
+
+   /*
+* To avoid NMMU hang while relaxing access we need to flush the tlb 
before
+* we set the new value.
+*/
+   if (is_pte_rw_upgrade(pte_val(old_pte), pte_val(pte)) &&
+   (atomic_read(>context.copros) > 0))
+   radix__flush_hugetlb_page(vma, addr);
+
+   set_huge_pte_at(vma->vm_mm, addr, ptep, pte);
+}
-- 
2.19.2



[PATCH V4 4/5] mm/hugetlb: Add prot_modify_start/commit sequence for hugetlb update

2018-12-18 Thread Aneesh Kumar K.V
Architectures like ppc64 requires to do a conditional tlb flush based on the old
and new value of pte. Follow the regular pte change protection sequence for
hugetlb too. This allow the architectures to override the update sequence.

Signed-off-by: Aneesh Kumar K.V 
---
 include/linux/hugetlb.h | 20 
 mm/hugetlb.c|  8 +---
 2 files changed, 25 insertions(+), 3 deletions(-)

diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index 087fd5f48c91..39e78b80375c 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -543,6 +543,26 @@ static inline void set_huge_swap_pte_at(struct mm_struct 
*mm, unsigned long addr
set_huge_pte_at(mm, addr, ptep, pte);
 }
 #endif
+
+#ifndef huge_ptep_modify_prot_start
+#define huge_ptep_modify_prot_start huge_ptep_modify_prot_start
+static inline pte_t huge_ptep_modify_prot_start(struct vm_area_struct *vma,
+   unsigned long addr, pte_t *ptep)
+{
+   return huge_ptep_get_and_clear(vma->vm_mm, addr, ptep);
+}
+#endif
+
+#ifndef huge_ptep_modify_prot_commit
+#define huge_ptep_modify_prot_commit huge_ptep_modify_prot_commit
+static inline void huge_ptep_modify_prot_commit(struct vm_area_struct *vma,
+   unsigned long addr, pte_t *ptep,
+   pte_t old_pte, pte_t pte)
+{
+   set_huge_pte_at(vma->vm_mm, addr, ptep, pte);
+}
+#endif
+
 #else  /* CONFIG_HUGETLB_PAGE */
 struct hstate {};
 #define alloc_huge_page(v, a, r) NULL
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 705a3e9cc910..353bff385595 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -4388,10 +4388,12 @@ unsigned long hugetlb_change_protection(struct 
vm_area_struct *vma,
continue;
}
if (!huge_pte_none(pte)) {
-   pte = huge_ptep_get_and_clear(mm, address, ptep);
-   pte = pte_mkhuge(huge_pte_modify(pte, newprot));
+   pte_t old_pte;
+
+   old_pte = huge_ptep_modify_prot_start(vma, address, 
ptep);
+   pte = pte_mkhuge(huge_pte_modify(old_pte, newprot));
pte = arch_make_huge_pte(pte, vma, NULL, 0);
-   set_huge_pte_at(mm, address, ptep, pte);
+   huge_ptep_modify_prot_commit(vma, address, ptep, 
old_pte, pte);
pages++;
}
spin_unlock(ptl);
-- 
2.19.2



[PATCH V4 3/5] arch/powerpc/mm: Nest MMU workaround for mprotect RW upgrade.

2018-12-18 Thread Aneesh Kumar K.V
NestMMU requires us to mark the pte invalid and flush the tlb when we do a
RW upgrade of pte. We fixed a variant of this in the fault path in commit
Fixes: bd5050e38aec ("powerpc/mm/radix: Change pte relax sequence to handle 
nest MMU hang")

Do the same for mprotect upgrades.

Hugetlb is handled in the next patch.

Signed-off-by: Aneesh Kumar K.V 
---
 arch/powerpc/include/asm/book3s/64/pgtable.h | 18 +
 arch/powerpc/include/asm/book3s/64/radix.h   |  4 +++
 arch/powerpc/mm/pgtable-book3s64.c   | 27 
 arch/powerpc/mm/pgtable-radix.c  | 18 +
 4 files changed, 67 insertions(+)

diff --git a/arch/powerpc/include/asm/book3s/64/pgtable.h 
b/arch/powerpc/include/asm/book3s/64/pgtable.h
index 2e6ada28da64..92eaea164700 100644
--- a/arch/powerpc/include/asm/book3s/64/pgtable.h
+++ b/arch/powerpc/include/asm/book3s/64/pgtable.h
@@ -1314,6 +1314,24 @@ static inline int pud_pfn(pud_t pud)
BUILD_BUG();
return 0;
 }
+#define __HAVE_ARCH_PTEP_MODIFY_PROT_TRANSACTION
+pte_t ptep_modify_prot_start(struct vm_area_struct *, unsigned long, pte_t *);
+void ptep_modify_prot_commit(struct vm_area_struct *, unsigned long,
+pte_t *, pte_t, pte_t);
+
+/*
+ * Returns true for a R -> RW upgrade of pte
+ */
+static inline bool is_pte_rw_upgrade(unsigned long old_val, unsigned long 
new_val)
+{
+   if (!(old_val & _PAGE_READ))
+   return false;
+
+   if ((!(old_val & _PAGE_WRITE)) && (new_val & _PAGE_WRITE))
+   return true;
+
+   return false;
+}
 
 #endif /* __ASSEMBLY__ */
 #endif /* _ASM_POWERPC_BOOK3S_64_PGTABLE_H_ */
diff --git a/arch/powerpc/include/asm/book3s/64/radix.h 
b/arch/powerpc/include/asm/book3s/64/radix.h
index 7d1a3d1543fc..5ab134eeed20 100644
--- a/arch/powerpc/include/asm/book3s/64/radix.h
+++ b/arch/powerpc/include/asm/book3s/64/radix.h
@@ -127,6 +127,10 @@ extern void radix__ptep_set_access_flags(struct 
vm_area_struct *vma, pte_t *ptep
 pte_t entry, unsigned long address,
 int psize);
 
+extern void radix__ptep_modify_prot_commit(struct vm_area_struct *vma,
+  unsigned long addr, pte_t *ptep,
+  pte_t old_pte, pte_t pte);
+
 static inline unsigned long __radix_pte_update(pte_t *ptep, unsigned long clr,
   unsigned long set)
 {
diff --git a/arch/powerpc/mm/pgtable-book3s64.c 
b/arch/powerpc/mm/pgtable-book3s64.c
index f3c31f5e1026..d6ff1f99ccfc 100644
--- a/arch/powerpc/mm/pgtable-book3s64.c
+++ b/arch/powerpc/mm/pgtable-book3s64.c
@@ -400,3 +400,30 @@ void arch_report_meminfo(struct seq_file *m)
   atomic_long_read(_pages_count[MMU_PAGE_1G]) << 20);
 }
 #endif /* CONFIG_PROC_FS */
+
+pte_t ptep_modify_prot_start(struct vm_area_struct *vma, unsigned long addr,
+pte_t *ptep)
+{
+   unsigned long pte_val;
+
+   /*
+* Clear the _PAGE_PRESENT so that no hardware parallel update is
+* possible. Also keep the pte_present true so that we don't take
+* wrong fault.
+*/
+   pte_val = pte_update(vma->vm_mm, addr, ptep, _PAGE_PRESENT, 
_PAGE_INVALID, 0);
+
+   return __pte(pte_val);
+
+}
+EXPORT_SYMBOL(ptep_modify_prot_start);
+
+void ptep_modify_prot_commit(struct vm_area_struct *vma, unsigned long addr,
+pte_t *ptep, pte_t old_pte, pte_t pte)
+{
+   if (radix_enabled())
+   return radix__ptep_modify_prot_commit(vma, addr,
+ ptep, old_pte, pte);
+   set_pte_at(vma->vm_mm, addr, ptep, pte);
+}
+EXPORT_SYMBOL(ptep_modify_prot_commit);
diff --git a/arch/powerpc/mm/pgtable-radix.c b/arch/powerpc/mm/pgtable-radix.c
index 931156069a81..dced3cd241c2 100644
--- a/arch/powerpc/mm/pgtable-radix.c
+++ b/arch/powerpc/mm/pgtable-radix.c
@@ -1063,3 +1063,21 @@ void radix__ptep_set_access_flags(struct vm_area_struct 
*vma, pte_t *ptep,
}
/* See ptesync comment in radix__set_pte_at */
 }
+
+void radix__ptep_modify_prot_commit(struct vm_area_struct *vma,
+   unsigned long addr, pte_t *ptep,
+   pte_t old_pte, pte_t pte)
+{
+   struct mm_struct *mm = vma->vm_mm;
+
+   /*
+* To avoid NMMU hang while relaxing access we need to flush the tlb 
before
+* we set the new value. We need to do this only for radix, because hash
+* translation does flush when updating the linux pte.
+*/
+   if (is_pte_rw_upgrade(pte_val(old_pte), pte_val(pte)) &&
+   (atomic_read(>context.copros) > 0))
+   radix__flush_tlb_page(vma, addr);
+
+   set_pte_at(mm, addr, ptep, pte);
+}
-- 
2.19.2



[PATCH V4 2/5] mm: update ptep_modify_prot_commit to take old pte value as arg

2018-12-18 Thread Aneesh Kumar K.V
Architectures like ppc64 requires to do a conditional tlb flush based on the old
and new value of pte. Enable that by passing old pte value as the arg.

Signed-off-by: Aneesh Kumar K.V 
---
 arch/s390/include/asm/pgtable.h | 3 ++-
 arch/s390/mm/pgtable.c  | 2 +-
 arch/x86/include/asm/paravirt.h | 2 +-
 fs/proc/task_mmu.c  | 8 +---
 include/asm-generic/pgtable.h   | 2 +-
 mm/memory.c | 8 
 mm/mprotect.c   | 6 +++---
 7 files changed, 17 insertions(+), 14 deletions(-)

diff --git a/arch/s390/include/asm/pgtable.h b/arch/s390/include/asm/pgtable.h
index 5d730199e37b..76dc344edb8c 100644
--- a/arch/s390/include/asm/pgtable.h
+++ b/arch/s390/include/asm/pgtable.h
@@ -1070,7 +1070,8 @@ static inline pte_t ptep_get_and_clear(struct mm_struct 
*mm,
 
 #define __HAVE_ARCH_PTEP_MODIFY_PROT_TRANSACTION
 pte_t ptep_modify_prot_start(struct vm_area_struct *, unsigned long, pte_t *);
-void ptep_modify_prot_commit(struct vm_area_struct *, unsigned long, pte_t *, 
pte_t);
+void ptep_modify_prot_commit(struct vm_area_struct *, unsigned long,
+pte_t *, pte_t, pte_t);
 
 #define __HAVE_ARCH_PTEP_CLEAR_FLUSH
 static inline pte_t ptep_clear_flush(struct vm_area_struct *vma,
diff --git a/arch/s390/mm/pgtable.c b/arch/s390/mm/pgtable.c
index 29c0a21cd34a..b283b92722cc 100644
--- a/arch/s390/mm/pgtable.c
+++ b/arch/s390/mm/pgtable.c
@@ -322,7 +322,7 @@ pte_t ptep_modify_prot_start(struct vm_area_struct *vma, 
unsigned long addr,
 EXPORT_SYMBOL(ptep_modify_prot_start);
 
 void ptep_modify_prot_commit(struct vm_area_struct *vma, unsigned long addr,
-pte_t *ptep, pte_t pte)
+pte_t *ptep, pte_t old_pte, pte_t pte)
 {
pgste_t pgste;
struct mm_struct *mm = vma->vm_mm;
diff --git a/arch/x86/include/asm/paravirt.h b/arch/x86/include/asm/paravirt.h
index a1d0ee5c5c51..28152236a65b 100644
--- a/arch/x86/include/asm/paravirt.h
+++ b/arch/x86/include/asm/paravirt.h
@@ -428,7 +428,7 @@ static inline pte_t ptep_modify_prot_start(struct 
vm_area_struct *vma, unsigned
 }
 
 static inline void ptep_modify_prot_commit(struct vm_area_struct *vma, 
unsigned long addr,
-  pte_t *ptep, pte_t pte)
+  pte_t *ptep, pte_t old_pte, pte_t 
pte)
 {
 
if (sizeof(pteval_t) > sizeof(long))
diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index 9952d7185170..8d62891d38a8 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -940,10 +940,12 @@ static inline void clear_soft_dirty(struct vm_area_struct 
*vma,
pte_t ptent = *pte;
 
if (pte_present(ptent)) {
-   ptent = ptep_modify_prot_start(vma, addr, pte);
-   ptent = pte_wrprotect(ptent);
+   pte_t old_pte;
+
+   old_pte = ptep_modify_prot_start(vma, addr, pte);
+   ptent = pte_wrprotect(old_pte);
ptent = pte_clear_soft_dirty(ptent);
-   ptep_modify_prot_commit(vma, addr, pte, ptent);
+   ptep_modify_prot_commit(vma, addr, pte, old_pte, ptent);
} else if (is_swap_pte(ptent)) {
ptent = pte_swp_clear_soft_dirty(ptent);
set_pte_at(vma->vm_mm, addr, pte, ptent);
diff --git a/include/asm-generic/pgtable.h b/include/asm-generic/pgtable.h
index d28683ada357..3ff8b1c3f003 100644
--- a/include/asm-generic/pgtable.h
+++ b/include/asm-generic/pgtable.h
@@ -619,7 +619,7 @@ static inline pte_t ptep_modify_prot_start(struct 
vm_area_struct *vma,
  */
 static inline void ptep_modify_prot_commit(struct vm_area_struct *vma,
   unsigned long addr,
-  pte_t *ptep, pte_t pte)
+  pte_t *ptep, pte_t old_pte, pte_t 
pte)
 {
__ptep_modify_prot_commit(vma, addr, ptep, pte);
 }
diff --git a/mm/memory.c b/mm/memory.c
index d36b0eaa7862..4f3ddaedc764 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3568,7 +3568,7 @@ static vm_fault_t do_numa_page(struct vm_fault *vmf)
int last_cpupid;
int target_nid;
bool migrated = false;
-   pte_t pte;
+   pte_t pte, old_pte;
bool was_writable = pte_savedwrite(vmf->orig_pte);
int flags = 0;
 
@@ -3588,12 +3588,12 @@ static vm_fault_t do_numa_page(struct vm_fault *vmf)
 * Make it present again, Depending on how arch implementes non
 * accessible ptes, some can allow access by kernel mode.
 */
-   pte = ptep_modify_prot_start(vma, vmf->address, vmf->pte);
-   pte = pte_modify(pte, vma->vm_page_prot);
+   old_pte = ptep_modify_prot_start(vma, vmf->address, vmf->pte);
+   pte = pte_modify(old_pte, vma->vm_page_prot);
pte = pte_mkyoung(pte);
if (was_writable)
pte = pte_mkwrite(pte);
-   ptep_modify_prot_commit(vma, vmf->address, vmf->pte, pte);
+ 

[PATCH V4 1/5] mm: Update ptep_modify_prot_start/commit to take vm_area_struct as arg

2018-12-18 Thread Aneesh Kumar K.V
Some architecture may want to call flush_tlb_range from these helpers.

Signed-off-by: Aneesh Kumar K.V 
---
 arch/s390/include/asm/pgtable.h   |  4 ++--
 arch/s390/mm/pgtable.c|  6 --
 arch/x86/include/asm/paravirt.h   | 11 ++-
 arch/x86/include/asm/paravirt_types.h |  5 +++--
 arch/x86/xen/mmu.h|  4 ++--
 arch/x86/xen/mmu_pv.c |  8 
 fs/proc/task_mmu.c|  4 ++--
 include/asm-generic/pgtable.h | 16 
 mm/memory.c   |  4 ++--
 mm/mprotect.c |  4 ++--
 10 files changed, 35 insertions(+), 31 deletions(-)

diff --git a/arch/s390/include/asm/pgtable.h b/arch/s390/include/asm/pgtable.h
index 063732414dfb..5d730199e37b 100644
--- a/arch/s390/include/asm/pgtable.h
+++ b/arch/s390/include/asm/pgtable.h
@@ -1069,8 +1069,8 @@ static inline pte_t ptep_get_and_clear(struct mm_struct 
*mm,
 }
 
 #define __HAVE_ARCH_PTEP_MODIFY_PROT_TRANSACTION
-pte_t ptep_modify_prot_start(struct mm_struct *, unsigned long, pte_t *);
-void ptep_modify_prot_commit(struct mm_struct *, unsigned long, pte_t *, 
pte_t);
+pte_t ptep_modify_prot_start(struct vm_area_struct *, unsigned long, pte_t *);
+void ptep_modify_prot_commit(struct vm_area_struct *, unsigned long, pte_t *, 
pte_t);
 
 #define __HAVE_ARCH_PTEP_CLEAR_FLUSH
 static inline pte_t ptep_clear_flush(struct vm_area_struct *vma,
diff --git a/arch/s390/mm/pgtable.c b/arch/s390/mm/pgtable.c
index f2cc7da473e4..29c0a21cd34a 100644
--- a/arch/s390/mm/pgtable.c
+++ b/arch/s390/mm/pgtable.c
@@ -301,12 +301,13 @@ pte_t ptep_xchg_lazy(struct mm_struct *mm, unsigned long 
addr,
 }
 EXPORT_SYMBOL(ptep_xchg_lazy);
 
-pte_t ptep_modify_prot_start(struct mm_struct *mm, unsigned long addr,
+pte_t ptep_modify_prot_start(struct vm_area_struct *vma, unsigned long addr,
 pte_t *ptep)
 {
pgste_t pgste;
pte_t old;
int nodat;
+   struct mm_struct *mm = vma->vm_mm;
 
preempt_disable();
pgste = ptep_xchg_start(mm, addr, ptep);
@@ -320,10 +321,11 @@ pte_t ptep_modify_prot_start(struct mm_struct *mm, 
unsigned long addr,
 }
 EXPORT_SYMBOL(ptep_modify_prot_start);
 
-void ptep_modify_prot_commit(struct mm_struct *mm, unsigned long addr,
+void ptep_modify_prot_commit(struct vm_area_struct *vma, unsigned long addr,
 pte_t *ptep, pte_t pte)
 {
pgste_t pgste;
+   struct mm_struct *mm = vma->vm_mm;
 
if (!MACHINE_HAS_NX)
pte_val(pte) &= ~_PAGE_NOEXEC;
diff --git a/arch/x86/include/asm/paravirt.h b/arch/x86/include/asm/paravirt.h
index 4bf42f9e4eea..a1d0ee5c5c51 100644
--- a/arch/x86/include/asm/paravirt.h
+++ b/arch/x86/include/asm/paravirt.h
@@ -417,25 +417,26 @@ static inline pgdval_t pgd_val(pgd_t pgd)
 }
 
 #define  __HAVE_ARCH_PTEP_MODIFY_PROT_TRANSACTION
-static inline pte_t ptep_modify_prot_start(struct mm_struct *mm, unsigned long 
addr,
+static inline pte_t ptep_modify_prot_start(struct vm_area_struct *vma, 
unsigned long addr,
   pte_t *ptep)
 {
pteval_t ret;
 
-   ret = PVOP_CALL3(pteval_t, mmu.ptep_modify_prot_start, mm, addr, ptep);
+   ret = PVOP_CALL3(pteval_t, mmu.ptep_modify_prot_start, vma, addr, ptep);
 
return (pte_t) { .pte = ret };
 }
 
-static inline void ptep_modify_prot_commit(struct mm_struct *mm, unsigned long 
addr,
+static inline void ptep_modify_prot_commit(struct vm_area_struct *vma, 
unsigned long addr,
   pte_t *ptep, pte_t pte)
 {
+
if (sizeof(pteval_t) > sizeof(long))
/* 5 arg words */
-   pv_ops.mmu.ptep_modify_prot_commit(mm, addr, ptep, pte);
+   pv_ops.mmu.ptep_modify_prot_commit(vma, addr, ptep, pte);
else
PVOP_VCALL4(mmu.ptep_modify_prot_commit,
-   mm, addr, ptep, pte.pte);
+   vma, addr, ptep, pte.pte);
 }
 
 static inline void set_pte(pte_t *ptep, pte_t pte)
diff --git a/arch/x86/include/asm/paravirt_types.h 
b/arch/x86/include/asm/paravirt_types.h
index 26942ad63830..609a728ec809 100644
--- a/arch/x86/include/asm/paravirt_types.h
+++ b/arch/x86/include/asm/paravirt_types.h
@@ -55,6 +55,7 @@ struct task_struct;
 struct cpumask;
 struct flush_tlb_info;
 struct mmu_gather;
+struct vm_area_struct;
 
 /*
  * Wrapper type for pointers to code which uses the non-standard
@@ -254,9 +255,9 @@ struct pv_mmu_ops {
   pte_t *ptep, pte_t pteval);
void (*set_pmd)(pmd_t *pmdp, pmd_t pmdval);
 
-   pte_t (*ptep_modify_prot_start)(struct mm_struct *mm, unsigned long 
addr,
+   pte_t (*ptep_modify_prot_start)(struct vm_area_struct *vma, unsigned 
long addr,
pte_t *ptep);
-   void (*ptep_modify_prot_commit)(struct mm_struct *mm, unsigned long 
addr,
+   void 

[PATCH V4 0/5] NestMMU pte upgrade workaround for mprotect

2018-12-18 Thread Aneesh Kumar K.V
We can upgrade pte access (R -> RW transition) via mprotect. We need
to make sure we follow the recommended pte update sequence as outlined in
commit bd5050e38aec ("powerpc/mm/radix: Change pte relax sequence to handle 
nest MMU hang")
for such updates. This patch series do that.

Changes from V3:
* Build fix for x86

Changes from V2:
* Update commit message for patch 4
* use radix tlb flush routines directly.

Changes from V1:
* Restrict ths only for R->RW upgrade. We don't need to do this for Autonuma
* Restrict this only for radix translation mode.


Aneesh Kumar K.V (5):
  mm: Update ptep_modify_prot_start/commit to take vm_area_struct as arg
  mm: update ptep_modify_prot_commit to take old pte value as arg
  arch/powerpc/mm: Nest MMU workaround for mprotect RW upgrade.
  mm/hugetlb: Add prot_modify_start/commit sequence for hugetlb update
  arch/powerpc/mm/hugetlb: NestMMU workaround for hugetlb mprotect RW
upgrade

 arch/powerpc/include/asm/book3s/64/hugetlb.h | 12 +
 arch/powerpc/include/asm/book3s/64/pgtable.h | 18 +
 arch/powerpc/include/asm/book3s/64/radix.h   |  4 +++
 arch/powerpc/mm/hugetlbpage-hash64.c | 27 
 arch/powerpc/mm/hugetlbpage-radix.c  | 17 
 arch/powerpc/mm/pgtable-book3s64.c   | 27 
 arch/powerpc/mm/pgtable-radix.c  | 18 +
 arch/s390/include/asm/pgtable.h  |  5 ++--
 arch/s390/mm/pgtable.c   |  8 +++---
 arch/x86/include/asm/paravirt.h  | 13 +-
 arch/x86/include/asm/paravirt_types.h|  5 ++--
 arch/x86/xen/mmu.h   |  4 +--
 arch/x86/xen/mmu_pv.c|  8 +++---
 fs/proc/task_mmu.c   |  8 +++---
 include/asm-generic/pgtable.h| 18 ++---
 include/linux/hugetlb.h  | 20 +++
 mm/hugetlb.c |  8 +++---
 mm/memory.c  |  8 +++---
 mm/mprotect.c|  6 ++---
 19 files changed, 193 insertions(+), 41 deletions(-)

-- 
2.19.2



Re: [PATCH v1 03/13] powerpc/mm/32s: rework mmu_mapin_ram()

2018-12-18 Thread Christophe Leroy




On 12/18/2018 03:05 AM, Jonathan Neuschäfer wrote:

On Mon, Dec 17, 2018 at 10:29:18AM +0100, Christophe Leroy wrote:

With patches 1-3:
[0.00] setbat(0, c000, , 0100, 311)
[0.00] setbat(2, c100, 0100, 0080, 311)
[0.00] setbat(4, d000, 1000, 0200, 791)


What we see is that BAT0 is not used in the origin. I have always wondered
the reason, maybe there is something odd behind and BAT0 shall no ne used.

Could you try and modify find_free_bat() so that it starts at b = 1 instead
of b = 0 ?


In this case, setbat is called with index 2, 3, and 4, but the Wii still
doesn't boot.


According to arch/powerpc/include/asm/book3s/32/hash.h,
   - 0x591 = _PAGE_RW | _PAGE_ACCESSED | _PAGE_DIRTY | _PAGE_COHERENT | 
_PAGE_PRESENT
   - 0x311 = _PAGE_EXEC | _PAGE_ACCESSED | _PAGE_COHERENT | _PAGE_PRESENT
   - 0x791 = _PAGE_RW | _PAGE_EXEC | _PAGE_ACCESSED | _PAGE_DIRTY | 
_PAGE_COHERENT | _PAGE_PRESENT



Yes, patch 1 added _PAGE_EXEC which explains this 0x200.
Do you confirm it still works well with only patch 1 ?


Patch 1 alone boots to userspace.



Ok, thanks for testing.

The only difference I see then are the flags. Everything else is seems 
identical.


I know you tried already, but would you mind trying once more with the 
following change ?


diff --git b/arch/powerpc/mm/ppc_mmu_32.c a/arch/powerpc/mm/ppc_mmu_32.c
index 61c10ee00ba2..628fba23 100644
--- b/arch/powerpc/mm/ppc_mmu_32.c
+++ a/arch/powerpc/mm/ppc_mmu_32.c
@@ -119,7 +119,7 @@ unsigned long __init mmu_mapin_ram(unsigned long 
base, unsigned long top)


if (size < 128 << 10)
break;
-   setbat(idx, PAGE_OFFSET + base, base, size, PAGE_KERNEL_TEXT);
+   setbat(idx, PAGE_OFFSET + base, base, size, PAGE_KERNEL_X);
base += size;
}

I think we may have some code trying to modify the kernel text without 
using code patching functions.


Thanks,
Christophe




Jonathan