Re: [PATCH v3 08/13] mm/debug_vm_pgtable/thp: Use page table depost/withdraw with THP

2020-08-31 Thread Christophe Leroy




Le 01/09/2020 à 08:25, Aneesh Kumar K.V a écrit :

On 9/1/20 8:52 AM, Anshuman Khandual wrote:




There is a checkpatch.pl warning here.

WARNING: Possible unwrapped commit description (prefer a maximum 75 
chars per line)

#7:
Architectures like ppc64 use deposited page table while updating the 
huge pte


total: 0 errors, 1 warnings, 40 lines checked



I will ignore all these, because they are not really important IMHO.



When doing a git log in a 80 chars terminal window, having wrapping 
lines is not really convenient. It should be easy to avoid it.


Christophe


Re: [PATCH v3 09/13] mm/debug_vm_pgtable/locks: Move non page table modifying test together

2020-08-31 Thread Aneesh Kumar K.V

On 9/1/20 9:11 AM, Anshuman Khandual wrote:



On 08/27/2020 01:34 PM, Aneesh Kumar K.V wrote:

This will help in adding proper locks in a later patch


It really makes sense to classify these tests here as static and dynamic.
Static are the ones that test via page table entry values modification
without changing anything on the actual page table itself. Where as the
dynamic tests do change the page table. Static tests would probably never
require any lock synchronization but the dynamic ones will do.



Signed-off-by: Aneesh Kumar K.V 
---
  mm/debug_vm_pgtable.c | 52 ---
  1 file changed, 29 insertions(+), 23 deletions(-)

diff --git a/mm/debug_vm_pgtable.c b/mm/debug_vm_pgtable.c
index 0ce5c6a24c5b..78c8af3445ac 100644
--- a/mm/debug_vm_pgtable.c
+++ b/mm/debug_vm_pgtable.c
@@ -992,7 +992,7 @@ static int __init debug_vm_pgtable(void)
p4dp = p4d_alloc(mm, pgdp, vaddr);
pudp = pud_alloc(mm, p4dp, vaddr);
pmdp = pmd_alloc(mm, pudp, vaddr);
-   ptep = pte_alloc_map_lock(mm, pmdp, vaddr, &ptl);
+   ptep = pte_alloc_map(mm, pmdp, vaddr);
  
  	/*

 * Save all the page table page addresses as the page table
@@ -1012,33 +1012,12 @@ static int __init debug_vm_pgtable(void)
p4d_basic_tests(p4d_aligned, prot);
pgd_basic_tests(pgd_aligned, prot);
  
-	pte_clear_tests(mm, ptep, vaddr);

-   pmd_clear_tests(mm, pmdp);
-   pud_clear_tests(mm, pudp);
-   p4d_clear_tests(mm, p4dp);
-   pgd_clear_tests(mm, pgdp);
-
-   pte_advanced_tests(mm, vma, ptep, pte_aligned, vaddr, prot);
-   pmd_advanced_tests(mm, vma, pmdp, pmd_aligned, vaddr, prot, saved_ptep);
-   pud_advanced_tests(mm, vma, pudp, pud_aligned, vaddr, prot);
-   hugetlb_advanced_tests(mm, vma, ptep, pte_aligned, vaddr, prot);
-
pmd_leaf_tests(pmd_aligned, prot);
pud_leaf_tests(pud_aligned, prot);
  
-	pmd_huge_tests(pmdp, pmd_aligned, prot);

-   pud_huge_tests(pudp, pud_aligned, prot);
-
pte_savedwrite_tests(pte_aligned, protnone);
pmd_savedwrite_tests(pmd_aligned, protnone);
  
-	pte_unmap_unlock(ptep, ptl);

-
-   pmd_populate_tests(mm, pmdp, saved_ptep);
-   pud_populate_tests(mm, pudp, saved_pmdp);
-   p4d_populate_tests(mm, p4dp, saved_pudp);
-   pgd_populate_tests(mm, pgdp, saved_p4dp);
-
pte_special_tests(pte_aligned, prot);
pte_protnone_tests(pte_aligned, protnone);
pmd_protnone_tests(pmd_aligned, protnone);
@@ -1056,11 +1035,38 @@ static int __init debug_vm_pgtable(void)
pmd_swap_tests(pmd_aligned, prot);
  
  	swap_migration_tests();

-   hugetlb_basic_tests(pte_aligned, prot);
  
  	pmd_thp_tests(pmd_aligned, prot);

pud_thp_tests(pud_aligned, prot);
  
+	/*

+* Page table modifying tests
+*/


In this comment, please do add some more details about how these tests
need proper locking mechanism unlike the previous static ones. Also add
a similar comment section for the static tests that dont really change
page table and need not require corresponding locking mechanism. Both
comment sections will make this classification clear.


+   pte_clear_tests(mm, ptep, vaddr);
+   pmd_clear_tests(mm, pmdp);
+   pud_clear_tests(mm, pudp);
+   p4d_clear_tests(mm, p4dp);
+   pgd_clear_tests(mm, pgdp);
+
+   ptep = pte_alloc_map_lock(mm, pmdp, vaddr, &ptl);
+   pte_advanced_tests(mm, vma, ptep, pte_aligned, vaddr, prot);
+   pmd_advanced_tests(mm, vma, pmdp, pmd_aligned, vaddr, prot, saved_ptep);
+   pud_advanced_tests(mm, vma, pudp, pud_aligned, vaddr, prot);
+   hugetlb_advanced_tests(mm, vma, ptep, pte_aligned, vaddr, prot);
+
+
+   pmd_huge_tests(pmdp, pmd_aligned, prot);
+   pud_huge_tests(pudp, pud_aligned, prot);
+
+   pte_unmap_unlock(ptep, ptl);
+
+   pmd_populate_tests(mm, pmdp, saved_ptep);
+   pud_populate_tests(mm, pudp, saved_pmdp);
+   p4d_populate_tests(mm, p4dp, saved_pudp);
+   pgd_populate_tests(mm, pgdp, saved_p4dp);
+
+   hugetlb_basic_tests(pte_aligned, prot);


hugetlb_basic_tests() is a non page table modifying static test and
should be classified accordingly.


+
p4d_free(mm, saved_p4dp);
pud_free(mm, saved_pudp);
pmd_free(mm, saved_pmdp);



Changes till this patch fails to boot on arm64 platform. This should be
folded with the next patch.

[   17.080644] [ cut here ]
[   17.081342] kernel BUG at mm/pgtable-generic.c:164!
[   17.082091] Internal error: Oops - BUG: 0 [#1] PREEMPT SMP
[   17.082977] Modules linked in:
[   17.083481] CPU: 79 PID: 1 Comm: swapper/0 Tainted: GW 
5.9.0-rc2-00105-g740e72cd6717 #14
[   17.084998] Hardware name: linux,dummy-virt (DT)
[   17.085745] pstate: 6045 (nZCv daif +PAN -UAO BTYPE=--)
[   17.086645] pc : pgtable_trans_huge_deposit+0x58/0x60
[   17.087462] lr : debug_vm_pgtable+0x4dc/0x8f0
[   17.088168] sp : 80001219bcf0
[   17.08

Re: [PATCH v3 13/13] mm/debug_vm_pgtable: populate a pte entry before fetching it

2020-08-31 Thread Aneesh Kumar K.V

On 9/1/20 8:55 AM, Anshuman Khandual wrote:



On 08/27/2020 01:34 PM, Aneesh Kumar K.V wrote:

pte_clear_tests operate on an existing pte entry. Make sure that is not a none
pte entry.

Signed-off-by: Aneesh Kumar K.V 
---
  mm/debug_vm_pgtable.c | 6 --
  1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/mm/debug_vm_pgtable.c b/mm/debug_vm_pgtable.c
index 21329c7d672f..8527ebb75f2c 100644
--- a/mm/debug_vm_pgtable.c
+++ b/mm/debug_vm_pgtable.c
@@ -546,7 +546,7 @@ static void __init pgd_populate_tests(struct mm_struct *mm, 
pgd_t *pgdp,
  static void __init pte_clear_tests(struct mm_struct *mm, pte_t *ptep,
   unsigned long vaddr)
  {
-   pte_t pte = ptep_get(ptep);
+   pte_t pte =  ptep_get_and_clear(mm, vaddr, ptep);


Seems like ptep_get_and_clear() here just clears the entry in preparation
for a following set_pte_at() which otherwise would have been a problem on
ppc64 as you had pointed out earlier i.e set_pte_at() should not update an
existing valid entry. So the commit message here is bit misleading.



and also fetch the pte value which is used further.


  
  	pr_debug("Validating PTE clear\n");

pte = __pte(pte_val(pte) | RANDOM_ORVALUE);
@@ -944,7 +944,7 @@ static int __init debug_vm_pgtable(void)
p4d_t *p4dp, *saved_p4dp;
pud_t *pudp, *saved_pudp;
pmd_t *pmdp, *saved_pmdp, pmd;
-   pte_t *ptep;
+   pte_t *ptep, pte;
pgtable_t saved_ptep;
pgprot_t prot, protnone;
phys_addr_t paddr;
@@ -1049,6 +1049,8 @@ static int __init debug_vm_pgtable(void)
 */
  
  	ptep = pte_alloc_map_lock(mm, pmdp, vaddr, &ptl);

+   pte = pfn_pte(pte_aligned, prot);
+   set_pte_at(mm, vaddr, ptep, pte);


Not here, creating and populating an entry must be done in respective
test functions itself. Besides, this seems bit redundant as well. The
test pte_clear_tests() with the above change added, already

- Clears the PTEP entry with ptep_get_and_clear()


and fetch the old value set previously.


- Creates and populates the PTEP with set_pte_at()
- Clears with pte_clear()
- Checks for pte_none()

If the objective is to clear the PTEP entry before calling set_pte_at(),
then only the first chunk is required and it should also be merged with
a previous patch.

[PATCH v3 07/13] mm/debug_vm_pgtable/set_pte/pmd/pud: Don't use set_*_at to 
update an existing pte entry



pte_clear_tests(mm, ptep, vaddr);
pte_advanced_tests(mm, vma, ptep, pte_aligned, vaddr, prot);
pte_unmap_unlock(ptep, ptl);



There is a checkpatch.pl warning here.

WARNING: Possible unwrapped commit description (prefer a maximum 75 chars per 
line)
#7:
pte_clear_tests operate on an existing pte entry. Make sure that is not a none

total: 0 errors, 1 warnings, 24 lines checked

There is also a build failure on x86 reported from kernel test robot.





Re: [PATCH v3 12/13] mm/debug_vm_pgtable/hugetlb: Disable hugetlb test on ppc64

2020-08-31 Thread Aneesh Kumar K.V

On 9/1/20 9:33 AM, Anshuman Khandual wrote:



On 08/27/2020 01:34 PM, Aneesh Kumar K.V wrote:

The seems to be missing quite a lot of details w.r.t allocating
the correct pgtable_t page (huge_pte_alloc()), holding the right
lock (huge_pte_lock()) etc. The vma used is also not a hugetlb VMA.

ppc64 do have runtime checks within CONFIG_DEBUG_VM for most of these.
Hence disable the test on ppc64.


Would really like this to get resolved in an uniform and better way
instead, i.e a modified hugetlb_advanced_tests() which works on all
platforms including ppc64.

In absence of a modified version, I do realize the situation here,
where DEBUG_VM_PGTABLE test either runs on ppc64 or just completely
remove hugetlb_advanced_tests() from other platforms as well.



Signed-off-by: Aneesh Kumar K.V 
---
  mm/debug_vm_pgtable.c | 4 
  1 file changed, 4 insertions(+)

diff --git a/mm/debug_vm_pgtable.c b/mm/debug_vm_pgtable.c
index a188b6e4e37e..21329c7d672f 100644
--- a/mm/debug_vm_pgtable.c
+++ b/mm/debug_vm_pgtable.c
@@ -813,6 +813,7 @@ static void __init hugetlb_basic_tests(unsigned long pfn, 
pgprot_t prot)
  #endif /* CONFIG_ARCH_WANT_GENERAL_HUGETLB */
  }
  
+#ifndef CONFIG_PPC_BOOK3S_64

  static void __init hugetlb_advanced_tests(struct mm_struct *mm,
  struct vm_area_struct *vma,
  pte_t *ptep, unsigned long pfn,
@@ -855,6 +856,7 @@ static void __init hugetlb_advanced_tests(struct mm_struct 
*mm,
pte = huge_ptep_get(ptep);
WARN_ON(!(huge_pte_write(pte) && huge_pte_dirty(pte)));
  }
+#endif


In the worst case if we could not get a new hugetlb_advanced_tests() test
that works on all platforms, this might be the last fallback option. In
which case, it will require a proper comment section with a "FIXME: ",
explaining the current situation (and that #ifdef is temporary in nature)
and a hugetlb_advanced_tests() stub when CONFIG_PPC_BOOK3S_64 is enabled.


  #else  /* !CONFIG_HUGETLB_PAGE */
  static void __init hugetlb_basic_tests(unsigned long pfn, pgprot_t prot) { }
  static void __init hugetlb_advanced_tests(struct mm_struct *mm,
@@ -1065,7 +1067,9 @@ static int __init debug_vm_pgtable(void)
pud_populate_tests(mm, pudp, saved_pmdp);
spin_unlock(ptl);
  
+#ifndef CONFIG_PPC_BOOK3S_64

hugetlb_advanced_tests(mm, vma, ptep, pte_aligned, vaddr, prot);
+#endif




I actually wanted to add #ifdef BROKEN. That test is completely broken. 
Infact I would suggest to remove that test completely.





#ifdef will not be required here as there would be a stub definition
for hugetlb_advanced_tests() when CONFIG_PPC_BOOK3S_64 is enabled.

  
  	spin_lock(&mm->page_table_lock);

p4d_clear_tests(mm, p4dp);



But again, we should really try and avoid taking this path.



To be frank i am kind of frustrated with how this patch series is being 
looked at. We pushed a completely broken test to upstream and right now 
we have a code in upstream that crash when booted on ppc64. My attempt 
has been to make progress here and you definitely seems to be not in 
agreement to that.


At this point I am tempted to suggest we remove the DEBUG_VM_PGTABLE 
support on ppc64 because AFAIU it doesn't add any value.



-aneesh


Re: [PATCH v3 08/13] mm/debug_vm_pgtable/thp: Use page table depost/withdraw with THP

2020-08-31 Thread Aneesh Kumar K.V

On 9/1/20 8:52 AM, Anshuman Khandual wrote:



On 08/27/2020 01:34 PM, Aneesh Kumar K.V wrote:

Architectures like ppc64 use deposited page table while updating the huge pte
entries.

Signed-off-by: Aneesh Kumar K.V 
---
  mm/debug_vm_pgtable.c | 10 +++---
  1 file changed, 7 insertions(+), 3 deletions(-)

diff --git a/mm/debug_vm_pgtable.c b/mm/debug_vm_pgtable.c
index f9f6358899a8..0ce5c6a24c5b 100644
--- a/mm/debug_vm_pgtable.c
+++ b/mm/debug_vm_pgtable.c
@@ -154,7 +154,7 @@ static void __init pmd_basic_tests(unsigned long pfn, 
pgprot_t prot)
  static void __init pmd_advanced_tests(struct mm_struct *mm,
  struct vm_area_struct *vma, pmd_t *pmdp,
  unsigned long pfn, unsigned long vaddr,
- pgprot_t prot)
+ pgprot_t prot, pgtable_t pgtable)
  {
pmd_t pmd;
  
@@ -165,6 +165,8 @@ static void __init pmd_advanced_tests(struct mm_struct *mm,

/* Align the address wrt HPAGE_PMD_SIZE */
vaddr = (vaddr & HPAGE_PMD_MASK) + HPAGE_PMD_SIZE;
  
+	pgtable_trans_huge_deposit(mm, pmdp, pgtable);

+
pmd = pmd_mkhuge(pfn_pmd(pfn, prot));
set_pmd_at(mm, vaddr, pmdp, pmd);
pmdp_set_wrprotect(mm, vaddr, pmdp);
@@ -193,6 +195,8 @@ static void __init pmd_advanced_tests(struct mm_struct *mm,
pmdp_test_and_clear_young(vma, vaddr, pmdp);
pmd = READ_ONCE(*pmdp);
WARN_ON(pmd_young(pmd));
+
+   pgtable = pgtable_trans_huge_withdraw(mm, pmdp);


Should the call sites here be wrapped with __HAVE_ARCH_PGTABLE_DEPOSIT and
__HAVE_ARCH_PGTABLE_WITHDRAW respectively. Though there are generic fallback
definitions, wondering whether they are indeed essential for all platforms.



No, because Any page table helpers operating on pmd level THP can expect 
a deposited page table.


__HAVE_ARCH_PGTABLE_DEPOSIT indicates that fallback to generic version.


  }
  
  static void __init pmd_leaf_tests(unsigned long pfn, pgprot_t prot)

@@ -373,7 +377,7 @@ static void __init pud_basic_tests(unsigned long pfn, 
pgprot_t prot) { }
  static void __init pmd_advanced_tests(struct mm_struct *mm,
  struct vm_area_struct *vma, pmd_t *pmdp,
  unsigned long pfn, unsigned long vaddr,
- pgprot_t prot)
+ pgprot_t prot, pgtable_t pgtable)
  {
  }
  static void __init pud_advanced_tests(struct mm_struct *mm,
@@ -1015,7 +1019,7 @@ static int __init debug_vm_pgtable(void)
pgd_clear_tests(mm, pgdp);
  
  	pte_advanced_tests(mm, vma, ptep, pte_aligned, vaddr, prot);

-   pmd_advanced_tests(mm, vma, pmdp, pmd_aligned, vaddr, prot);
+   pmd_advanced_tests(mm, vma, pmdp, pmd_aligned, vaddr, prot, saved_ptep);
pud_advanced_tests(mm, vma, pudp, pud_aligned, vaddr, prot);
hugetlb_advanced_tests(mm, vma, ptep, pte_aligned, vaddr, prot);
  



There is a checkpatch.pl warning here.

WARNING: Possible unwrapped commit description (prefer a maximum 75 chars per 
line)
#7:
Architectures like ppc64 use deposited page table while updating the huge pte

total: 0 errors, 1 warnings, 40 lines checked



I will ignore all these, because they are not really important IMHO.

-aneesh


Re: [PATCH v3 06/13] mm/debug_vm_pgtable/THP: Mark the pte entry huge before using set_pmd/pud_at

2020-08-31 Thread Aneesh Kumar K.V

On 9/1/20 8:51 AM, Anshuman Khandual wrote:



On 08/27/2020 01:34 PM, Aneesh Kumar K.V wrote:

kernel expects entries to be marked huge before we use 
set_pmd_at()/set_pud_at().

Signed-off-by: Aneesh Kumar K.V 
---
  mm/debug_vm_pgtable.c | 21 -
  1 file changed, 12 insertions(+), 9 deletions(-)

diff --git a/mm/debug_vm_pgtable.c b/mm/debug_vm_pgtable.c
index 5c0680836fe9..de83a20c1b30 100644
--- a/mm/debug_vm_pgtable.c
+++ b/mm/debug_vm_pgtable.c
@@ -155,7 +155,7 @@ static void __init pmd_advanced_tests(struct mm_struct *mm,
  unsigned long pfn, unsigned long vaddr,
  pgprot_t prot)
  {
-   pmd_t pmd = pfn_pmd(pfn, prot);
+   pmd_t pmd;
  
  	if (!has_transparent_hugepage())

return;
@@ -164,19 +164,19 @@ static void __init pmd_advanced_tests(struct mm_struct 
*mm,
/* Align the address wrt HPAGE_PMD_SIZE */
vaddr = (vaddr & HPAGE_PMD_MASK) + HPAGE_PMD_SIZE;
  
-	pmd = pfn_pmd(pfn, prot);

+   pmd = pmd_mkhuge(pfn_pmd(pfn, prot));
set_pmd_at(mm, vaddr, pmdp, pmd);
pmdp_set_wrprotect(mm, vaddr, pmdp);
pmd = READ_ONCE(*pmdp);
WARN_ON(pmd_write(pmd));
  
-	pmd = pfn_pmd(pfn, prot);

+   pmd = pmd_mkhuge(pfn_pmd(pfn, prot));
set_pmd_at(mm, vaddr, pmdp, pmd);
pmdp_huge_get_and_clear(mm, vaddr, pmdp);
pmd = READ_ONCE(*pmdp);
WARN_ON(!pmd_none(pmd));
  
-	pmd = pfn_pmd(pfn, prot);

+   pmd = pmd_mkhuge(pfn_pmd(pfn, prot));
pmd = pmd_wrprotect(pmd);
pmd = pmd_mkclean(pmd);
set_pmd_at(mm, vaddr, pmdp, pmd);
@@ -237,7 +237,7 @@ static void __init pmd_huge_tests(pmd_t *pmdp, unsigned 
long pfn, pgprot_t prot)
  
  static void __init pmd_savedwrite_tests(unsigned long pfn, pgprot_t prot)

  {
-   pmd_t pmd = pfn_pmd(pfn, prot);
+   pmd_t pmd = pmd_mkhuge(pfn_pmd(pfn, prot));


There is no set_pmd_at() in this particular test, why change ?



because if you are building a hugepage you should use pmd_mkhuge(). That 
is what is setting _PAGE_PTE with this series. We don't make pfn_pmd set 
_PAGE_PTE



-aneesh


Re: [PATCH v3 03/13] mm/debug_vm_pgtable/ppc64: Avoid setting top bits in radom value

2020-08-31 Thread Aneesh Kumar K.V

On 9/1/20 8:45 AM, Anshuman Khandual wrote:



On 08/27/2020 01:34 PM, Aneesh Kumar K.V wrote:

ppc64 use bit 62 to indicate a pte entry (_PAGE_PTE). Avoid setting that bit in
random value.

Signed-off-by: Aneesh Kumar K.V 
---
  mm/debug_vm_pgtable.c | 13 ++---
  1 file changed, 10 insertions(+), 3 deletions(-)

diff --git a/mm/debug_vm_pgtable.c b/mm/debug_vm_pgtable.c
index 086309fb9b6f..bbf9df0e64c6 100644
--- a/mm/debug_vm_pgtable.c
+++ b/mm/debug_vm_pgtable.c
@@ -44,10 +44,17 @@
   * entry type. But these bits might affect the ability to clear entries with
   * pxx_clear() because of how dynamic page table folding works on s390. So
   * while loading up the entries do not change the lower 4 bits. It does not
- * have affect any other platform.
+ * have affect any other platform. Also avoid the 62nd bit on ppc64 that is
+ * used to mark a pte entry.
   */
-#define S390_MASK_BITS 4
-#define RANDOM_ORVALUE GENMASK(BITS_PER_LONG - 1, S390_MASK_BITS)
+#define S390_SKIP_MASK GENMASK(3, 0)
+#ifdef CONFIG_PPC_BOOK3S_64
+#define PPC64_SKIP_MASKGENMASK(62, 62)
+#else
+#define PPC64_SKIP_MASK0x0
+#endif


Please drop the #ifdef CONFIG_PPC_BOOK3S_64 here. We already accommodate skip
bits for a s390 platform requirement and can also do so for ppc64 as well. As
mentioned before, please avoid adding any platform specific constructs in the
test.




that is needed so that it can be built on 32 bit architectures.I did 
face build errors with arch-linux



+#define ARCH_SKIP_MASK (S390_SKIP_MASK | PPC64_SKIP_MASK)
+#define RANDOM_ORVALUE (GENMASK(BITS_PER_LONG - 1, 0) & ~ARCH_SKIP_MASK)
  #define RANDOM_NZVALUEGENMASK(7, 0)


Please fix the alignments here. Feel free to consider following changes after
this patch.

diff --git a/mm/debug_vm_pgtable.c b/mm/debug_vm_pgtable.c
index 122416464e0f..f969031152bb 100644
--- a/mm/debug_vm_pgtable.c
+++ b/mm/debug_vm_pgtable.c
@@ -48,14 +48,11 @@
   * have affect any other platform. Also avoid the 62nd bit on ppc64 that is
   * used to mark a pte entry.
   */
-#define S390_SKIP_MASK GENMASK(3, 0)
-#ifdef CONFIG_PPC_BOOK3S_64
-#define PPC64_SKIP_MASKGENMASK(62, 62)
-#else
-#define PPC64_SKIP_MASK0x0
-#endif
-#define ARCH_SKIP_MASK (S390_SKIP_MASK | PPC64_SKIP_MASK)
-#define RANDOM_ORVALUE (GENMASK(BITS_PER_LONG - 1, 0) & ~ARCH_SKIP_MASK)
+#define S390_SKIP_MASK GENMASK(3, 0)
+#define PPC64_SKIP_MASKGENMASK(62, 62)
+#define ARCH_SKIP_MASK (S390_SKIP_MASK | PPC64_SKIP_MASK)
+#define RANDOM_ORVALUE (GENMASK(BITS_PER_LONG - 1, 0) & ~ARCH_SKIP_MASK)
+
  #define RANDOM_NZVALUE GENMASK(7, 0)
  
  
  static void __init pte_basic_tests(unsigned long pfn, pgprot_t prot)




Besides, there is also one checkpatch.pl warning here.

WARNING: Possible unwrapped commit description (prefer a maximum 75 chars per 
line)
#7:
ppc64 use bit 62 to indicate a pte entry (_PAGE_PTE). Avoid setting that bit in

total: 0 errors, 1 warnings, 20 lines checked




These warnings are not valid. They are mostly long lines (upto 100) . or 
some details mentioned in the () as above.


-aneesh


Re: fsl_espi errors on v5.7.15

2020-08-31 Thread Nicholas Piggin
Excerpts from Chris Packham's message of September 1, 2020 11:25 am:
> 
> On 1/09/20 12:33 am, Heiner Kallweit wrote:
>> On 30.08.2020 23:59, Chris Packham wrote:
>>> On 31/08/20 9:41 am, Heiner Kallweit wrote:
 On 30.08.2020 23:00, Chris Packham wrote:
> On 31/08/20 12:30 am, Nicholas Piggin wrote:
>> Excerpts from Chris Packham's message of August 28, 2020 8:07 am:
> 
>
>>> I've also now seen the RX FIFO not empty error on the T2080RDB
>>>
>>> fsl_espi ffe11.spi: Transfer done but SPIE_DON isn't set!
>>> fsl_espi ffe11.spi: Transfer done but SPIE_DON isn't set!
>>> fsl_espi ffe11.spi: Transfer done but SPIE_DON isn't set!
>>> fsl_espi ffe11.spi: Transfer done but SPIE_DON isn't set!
>>> fsl_espi ffe11.spi: Transfer done but rx/tx fifo's aren't empty!
>>> fsl_espi ffe11.spi: SPIE_RXCNT = 1, SPIE_TXCNT = 32
>>>
>>> With my current workaround of emptying the RX FIFO. It seems
>>> survivable. Interestingly it only ever seems to be 1 extra byte in 
>>> the
>>> RX FIFO and it seems to be after either a READ_SR or a READ_FSR.
>>>
>>> fsl_espi ffe11.spi: tx 70
>>> fsl_espi ffe11.spi: rx 03
>>> fsl_espi ffe11.spi: Extra RX 00
>>> fsl_espi ffe11.spi: Transfer done but SPIE_DON isn't set!
>>> fsl_espi ffe11.spi: Transfer done but rx/tx fifo's aren't empty!
>>> fsl_espi ffe11.spi: SPIE_RXCNT = 1, SPIE_TXCNT = 32
>>> fsl_espi ffe11.spi: tx 05
>>> fsl_espi ffe11.spi: rx 00
>>> fsl_espi ffe11.spi: Extra RX 03
>>> fsl_espi ffe11.spi: Transfer done but SPIE_DON isn't set!
>>> fsl_espi ffe11.spi: Transfer done but rx/tx fifo's aren't empty!
>>> fsl_espi ffe11.spi: SPIE_RXCNT = 1, SPIE_TXCNT = 32
>>> fsl_espi ffe11.spi: tx 05
>>> fsl_espi ffe11.spi: rx 00
>>> fsl_espi ffe11.spi: Extra RX 03
>>>
>>> From all the Micron SPI-NOR datasheets I've got access to it is
>>> possible to continually read the SR/FSR. But I've no idea why it
>>> happens some times and not others.
>> So I think I've got a reproduction and I think I've bisected the 
>> problem
>> to commit 3282a3da25bd ("powerpc/64: Implement soft interrupt replay 
>> in
>> C"). My day is just finishing now so I haven't applied too much 
>> scrutiny
>> to this result. Given the various rabbit holes I've been down on this
>> issue already I'd take this information with a good degree of 
>> skepticism.
>>
> OK, so an easy test should be to re-test with a 5.4 kernel.
> It doesn't have yet the change you're referring to, and the fsl-espi 
> driver
> is basically the same as in 5.7 (just two small changes in 5.7).
 There's 6cc0c16d82f88 and maybe also other interrupt related patches
 around this time that could affect book E, so it's good if that exact
 patch is confirmed.
>>> My confirmation is basically that I can induce the issue in a 5.4 kernel
>>> by cherry-picking 3282a3da25bd. I'm also able to "fix" the issue in
>>> 5.9-rc2 by reverting that one commit.
>>>
>>> I both cases it's not exactly a clean cherry-pick/revert so I also
>>> confirmed the bisection result by building at 3282a3da25bd (which sees
>>> the issue) and the commit just before (which does not).
>> Thanks for testing, that confirms it well.
>>
>> [snip patch]
>>
>>> I still saw the issue with this change applied. PPC_IRQ_SOFT_MASK_DEBUG
>>> didn't report anything (either with or without the change above).
>> Okay, it was a bit of a shot in the dark. I still can't see what
>> else has changed.
>>
>> What would cause this, a lost interrupt? A spurious interrupt? Or
>> higher interrupt latency?
>>
>> I don't think the patch should cause significantly worse latency,
>> (it's supposed to be a bit better if anything because it doesn't set
>> up the full interrupt frame). But it's possible.
> My working theory is that the SPI_DON indication is all about the TX
> direction an now that the interrupts are faster we're hitting an error
> because there is still RX activity going on. Heiner disagrees with my
> interpretation of the SPI_DON indication and the fact that it doesn't
> happen every time does throw doubt on it.
>
 It's right that the eSPI spec can be interpreted that SPI_DON refers to
 TX only. However this wouldn't really make sense, because also for RX
 we program the frame length, and therefore want to be notified once the
 full frame was received. Also practical experience shows that SPI_DON
 is set also after RX-only transfers.
 Typical SPI NOR use case is that you write

[PATCH v11] Fixup for "powerpc/vdso: Provide __kernel_clock_gettime64() on vdso32"

2020-08-31 Thread Christophe Leroy
Signed-off-by: Christophe Leroy 
---
 arch/powerpc/include/asm/vdso/gettimeofday.h | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/arch/powerpc/include/asm/vdso/gettimeofday.h 
b/arch/powerpc/include/asm/vdso/gettimeofday.h
index 59a609a48b63..8da84722729b 100644
--- a/arch/powerpc/include/asm/vdso/gettimeofday.h
+++ b/arch/powerpc/include/asm/vdso/gettimeofday.h
@@ -186,6 +186,8 @@ int __c_kernel_clock_getres(clockid_t clock_id, struct 
__kernel_timespec *res,
 #else
 int __c_kernel_clock_gettime(clockid_t clock, struct old_timespec32 *ts,
 const struct vdso_data *vd);
+int __c_kernel_clock_gettime64(clockid_t clock, struct __kernel_timespec *ts,
+  const struct vdso_data *vd);
 int __c_kernel_clock_getres(clockid_t clock_id, struct old_timespec32 *res,
const struct vdso_data *vd);
 #endif
-- 
2.25.0



[Bug 209029] kernel 5.9-rc2 fails to boot on a PowerMac G5 11,2 - BUG: Kernel NULL pointer dereference on read at 0x00000020

2020-08-31 Thread bugzilla-daemon
https://bugzilla.kernel.org/show_bug.cgi?id=209029

Christophe Leroy (christophe.le...@csgroup.eu) changed:

   What|Removed |Added

 CC||christophe.le...@csgroup.eu

--- Comment #3 from Christophe Leroy (christophe.le...@csgroup.eu) ---
Did you try without CONFIG_DEBUG_VM_PGTABLE ?

If you want CONFIG_DEBUG_VM_PGTABLE, the following series aims at fixing it for
PPC64: https://patchwork.ozlabs.org/project/linuxppc-dev/list/?series=197961

-- 
You are receiving this mail because:
You are watching the assignee of the bug.

Re: [PATCH v3 12/13] mm/debug_vm_pgtable/hugetlb: Disable hugetlb test on ppc64

2020-08-31 Thread Anshuman Khandual



On 08/27/2020 01:34 PM, Aneesh Kumar K.V wrote:
> The seems to be missing quite a lot of details w.r.t allocating
> the correct pgtable_t page (huge_pte_alloc()), holding the right
> lock (huge_pte_lock()) etc. The vma used is also not a hugetlb VMA.
> 
> ppc64 do have runtime checks within CONFIG_DEBUG_VM for most of these.
> Hence disable the test on ppc64.

Would really like this to get resolved in an uniform and better way
instead, i.e a modified hugetlb_advanced_tests() which works on all
platforms including ppc64.

In absence of a modified version, I do realize the situation here,
where DEBUG_VM_PGTABLE test either runs on ppc64 or just completely
remove hugetlb_advanced_tests() from other platforms as well.

> 
> Signed-off-by: Aneesh Kumar K.V 
> ---
>  mm/debug_vm_pgtable.c | 4 
>  1 file changed, 4 insertions(+)
> 
> diff --git a/mm/debug_vm_pgtable.c b/mm/debug_vm_pgtable.c
> index a188b6e4e37e..21329c7d672f 100644
> --- a/mm/debug_vm_pgtable.c
> +++ b/mm/debug_vm_pgtable.c
> @@ -813,6 +813,7 @@ static void __init hugetlb_basic_tests(unsigned long pfn, 
> pgprot_t prot)
>  #endif /* CONFIG_ARCH_WANT_GENERAL_HUGETLB */
>  }
>  
> +#ifndef CONFIG_PPC_BOOK3S_64
>  static void __init hugetlb_advanced_tests(struct mm_struct *mm,
> struct vm_area_struct *vma,
> pte_t *ptep, unsigned long pfn,
> @@ -855,6 +856,7 @@ static void __init hugetlb_advanced_tests(struct 
> mm_struct *mm,
>   pte = huge_ptep_get(ptep);
>   WARN_ON(!(huge_pte_write(pte) && huge_pte_dirty(pte)));
>  }
> +#endif

In the worst case if we could not get a new hugetlb_advanced_tests() test
that works on all platforms, this might be the last fallback option. In
which case, it will require a proper comment section with a "FIXME: ",
explaining the current situation (and that #ifdef is temporary in nature)
and a hugetlb_advanced_tests() stub when CONFIG_PPC_BOOK3S_64 is enabled.

>  #else  /* !CONFIG_HUGETLB_PAGE */
>  static void __init hugetlb_basic_tests(unsigned long pfn, pgprot_t prot) { }
>  static void __init hugetlb_advanced_tests(struct mm_struct *mm,
> @@ -1065,7 +1067,9 @@ static int __init debug_vm_pgtable(void)
>   pud_populate_tests(mm, pudp, saved_pmdp);
>   spin_unlock(ptl);
>  
> +#ifndef CONFIG_PPC_BOOK3S_64
>   hugetlb_advanced_tests(mm, vma, ptep, pte_aligned, vaddr, prot);
> +#endif

#ifdef will not be required here as there would be a stub definition
for hugetlb_advanced_tests() when CONFIG_PPC_BOOK3S_64 is enabled.

>  
>   spin_lock(&mm->page_table_lock);
>   p4d_clear_tests(mm, p4dp);
> 

But again, we should really try and avoid taking this path.


Re: [PATCH v3 10/13] mm/debug_vm_pgtable/locks: Take correct page table lock

2020-08-31 Thread Anshuman Khandual



On 08/27/2020 01:34 PM, Aneesh Kumar K.V wrote:
> Make sure we call pte accessors with correct lock held.
> 
> Signed-off-by: Aneesh Kumar K.V 
> ---
>  mm/debug_vm_pgtable.c | 34 --
>  1 file changed, 20 insertions(+), 14 deletions(-)
> 
> diff --git a/mm/debug_vm_pgtable.c b/mm/debug_vm_pgtable.c
> index 78c8af3445ac..0a6e771ebd13 100644
> --- a/mm/debug_vm_pgtable.c
> +++ b/mm/debug_vm_pgtable.c
> @@ -1039,33 +1039,39 @@ static int __init debug_vm_pgtable(void)
>   pmd_thp_tests(pmd_aligned, prot);
>   pud_thp_tests(pud_aligned, prot);
>  
> + hugetlb_basic_tests(pte_aligned, prot);
> +

This moved back again. As pointed out earlier, there will be a bisect
problem and so this patch must be folded back with the previous one.

>   /*
>* Page table modifying tests
>*/
> - pte_clear_tests(mm, ptep, vaddr);
> - pmd_clear_tests(mm, pmdp);
> - pud_clear_tests(mm, pudp);
> - p4d_clear_tests(mm, p4dp);
> - pgd_clear_tests(mm, pgdp);
>  
>   ptep = pte_alloc_map_lock(mm, pmdp, vaddr, &ptl);
> + pte_clear_tests(mm, ptep, vaddr);
>   pte_advanced_tests(mm, vma, ptep, pte_aligned, vaddr, prot);
> - pmd_advanced_tests(mm, vma, pmdp, pmd_aligned, vaddr, prot, saved_ptep);
> - pud_advanced_tests(mm, vma, pudp, pud_aligned, vaddr, prot);
> - hugetlb_advanced_tests(mm, vma, ptep, pte_aligned, vaddr, prot);
> -
> + pte_unmap_unlock(ptep, ptl);
>  
> + ptl = pmd_lock(mm, pmdp);
> + pmd_clear_tests(mm, pmdp);
> + pmd_advanced_tests(mm, vma, pmdp, pmd_aligned, vaddr, prot, saved_ptep);
>   pmd_huge_tests(pmdp, pmd_aligned, prot);
> + pmd_populate_tests(mm, pmdp, saved_ptep);
> + spin_unlock(ptl);
> +
> + ptl = pud_lock(mm, pudp);
> + pud_clear_tests(mm, pudp);
> + pud_advanced_tests(mm, vma, pudp, pud_aligned, vaddr, prot);
>   pud_huge_tests(pudp, pud_aligned, prot);
> + pud_populate_tests(mm, pudp, saved_pmdp);
> + spin_unlock(ptl);
>  
> - pte_unmap_unlock(ptep, ptl);
> + hugetlb_advanced_tests(mm, vma, ptep, pte_aligned, vaddr, prot);
>  
> - pmd_populate_tests(mm, pmdp, saved_ptep);
> - pud_populate_tests(mm, pudp, saved_pmdp);
> + spin_lock(&mm->page_table_lock);
> + p4d_clear_tests(mm, p4dp);
> + pgd_clear_tests(mm, pgdp);
>   p4d_populate_tests(mm, p4dp, saved_pudp);
>   pgd_populate_tests(mm, pgdp, saved_p4dp);
> -
> - hugetlb_basic_tests(pte_aligned, prot);
> + spin_unlock(&mm->page_table_lock);
>  
>   p4d_free(mm, saved_p4dp);
>   pud_free(mm, saved_pudp);
> 

Overall, grouping together dynamic tests at various page table levels and
taking a corresponding lock, makes sense. Commit message for the resultant
patch should include (a) Test classification (b) Grouping by function for
the static tests, by page table level for the dynamic tests (c) Locking
requirement for the dynamic tests each level etc.


Re: [PATCH v3 09/13] mm/debug_vm_pgtable/locks: Move non page table modifying test together

2020-08-31 Thread Anshuman Khandual



On 08/27/2020 01:34 PM, Aneesh Kumar K.V wrote:
> This will help in adding proper locks in a later patch

It really makes sense to classify these tests here as static and dynamic.
Static are the ones that test via page table entry values modification
without changing anything on the actual page table itself. Where as the
dynamic tests do change the page table. Static tests would probably never
require any lock synchronization but the dynamic ones will do.

> 
> Signed-off-by: Aneesh Kumar K.V 
> ---
>  mm/debug_vm_pgtable.c | 52 ---
>  1 file changed, 29 insertions(+), 23 deletions(-)
> 
> diff --git a/mm/debug_vm_pgtable.c b/mm/debug_vm_pgtable.c
> index 0ce5c6a24c5b..78c8af3445ac 100644
> --- a/mm/debug_vm_pgtable.c
> +++ b/mm/debug_vm_pgtable.c
> @@ -992,7 +992,7 @@ static int __init debug_vm_pgtable(void)
>   p4dp = p4d_alloc(mm, pgdp, vaddr);
>   pudp = pud_alloc(mm, p4dp, vaddr);
>   pmdp = pmd_alloc(mm, pudp, vaddr);
> - ptep = pte_alloc_map_lock(mm, pmdp, vaddr, &ptl);
> + ptep = pte_alloc_map(mm, pmdp, vaddr);
>  
>   /*
>* Save all the page table page addresses as the page table
> @@ -1012,33 +1012,12 @@ static int __init debug_vm_pgtable(void)
>   p4d_basic_tests(p4d_aligned, prot);
>   pgd_basic_tests(pgd_aligned, prot);
>  
> - pte_clear_tests(mm, ptep, vaddr);
> - pmd_clear_tests(mm, pmdp);
> - pud_clear_tests(mm, pudp);
> - p4d_clear_tests(mm, p4dp);
> - pgd_clear_tests(mm, pgdp);
> -
> - pte_advanced_tests(mm, vma, ptep, pte_aligned, vaddr, prot);
> - pmd_advanced_tests(mm, vma, pmdp, pmd_aligned, vaddr, prot, saved_ptep);
> - pud_advanced_tests(mm, vma, pudp, pud_aligned, vaddr, prot);
> - hugetlb_advanced_tests(mm, vma, ptep, pte_aligned, vaddr, prot);
> -
>   pmd_leaf_tests(pmd_aligned, prot);
>   pud_leaf_tests(pud_aligned, prot);
>  
> - pmd_huge_tests(pmdp, pmd_aligned, prot);
> - pud_huge_tests(pudp, pud_aligned, prot);
> -
>   pte_savedwrite_tests(pte_aligned, protnone);
>   pmd_savedwrite_tests(pmd_aligned, protnone);
>  
> - pte_unmap_unlock(ptep, ptl);
> -
> - pmd_populate_tests(mm, pmdp, saved_ptep);
> - pud_populate_tests(mm, pudp, saved_pmdp);
> - p4d_populate_tests(mm, p4dp, saved_pudp);
> - pgd_populate_tests(mm, pgdp, saved_p4dp);
> -
>   pte_special_tests(pte_aligned, prot);
>   pte_protnone_tests(pte_aligned, protnone);
>   pmd_protnone_tests(pmd_aligned, protnone);
> @@ -1056,11 +1035,38 @@ static int __init debug_vm_pgtable(void)
>   pmd_swap_tests(pmd_aligned, prot);
>  
>   swap_migration_tests();
> - hugetlb_basic_tests(pte_aligned, prot);
>  
>   pmd_thp_tests(pmd_aligned, prot);
>   pud_thp_tests(pud_aligned, prot);
>  
> + /*
> +  * Page table modifying tests
> +  */

In this comment, please do add some more details about how these tests
need proper locking mechanism unlike the previous static ones. Also add
a similar comment section for the static tests that dont really change
page table and need not require corresponding locking mechanism. Both
comment sections will make this classification clear.

> + pte_clear_tests(mm, ptep, vaddr);
> + pmd_clear_tests(mm, pmdp);
> + pud_clear_tests(mm, pudp);
> + p4d_clear_tests(mm, p4dp);
> + pgd_clear_tests(mm, pgdp);
> +
> + ptep = pte_alloc_map_lock(mm, pmdp, vaddr, &ptl);
> + pte_advanced_tests(mm, vma, ptep, pte_aligned, vaddr, prot);
> + pmd_advanced_tests(mm, vma, pmdp, pmd_aligned, vaddr, prot, saved_ptep);
> + pud_advanced_tests(mm, vma, pudp, pud_aligned, vaddr, prot);
> + hugetlb_advanced_tests(mm, vma, ptep, pte_aligned, vaddr, prot);
> +
> +
> + pmd_huge_tests(pmdp, pmd_aligned, prot);
> + pud_huge_tests(pudp, pud_aligned, prot);
> +
> + pte_unmap_unlock(ptep, ptl);
> +
> + pmd_populate_tests(mm, pmdp, saved_ptep);
> + pud_populate_tests(mm, pudp, saved_pmdp);
> + p4d_populate_tests(mm, p4dp, saved_pudp);
> + pgd_populate_tests(mm, pgdp, saved_p4dp);
> +
> + hugetlb_basic_tests(pte_aligned, prot);

hugetlb_basic_tests() is a non page table modifying static test and
should be classified accordingly.

> +
>   p4d_free(mm, saved_p4dp);
>   pud_free(mm, saved_pudp);
>   pmd_free(mm, saved_pmdp);
> 

Changes till this patch fails to boot on arm64 platform. This should be
folded with the next patch.

[   17.080644] [ cut here ]
[   17.081342] kernel BUG at mm/pgtable-generic.c:164!
[   17.082091] Internal error: Oops - BUG: 0 [#1] PREEMPT SMP
[   17.082977] Modules linked in:
[   17.083481] CPU: 79 PID: 1 Comm: swapper/0 Tainted: GW 
5.9.0-rc2-00105-g740e72cd6717 #14
[   17.084998] Hardware name: linux,dummy-virt (DT)
[   17.085745] pstate: 6045 (nZCv daif +PAN -UAO BTYPE=--)
[   17.086645] pc : pgtable_trans_huge_deposit+0x58/0x60
[   17.087462] lr : debug_vm_pgtable+0x4dc/0x8f0
[   17.0

Re: [PATCH v3 13/13] mm/debug_vm_pgtable: populate a pte entry before fetching it

2020-08-31 Thread Anshuman Khandual



On 08/27/2020 01:34 PM, Aneesh Kumar K.V wrote:
> pte_clear_tests operate on an existing pte entry. Make sure that is not a none
> pte entry.
> 
> Signed-off-by: Aneesh Kumar K.V 
> ---
>  mm/debug_vm_pgtable.c | 6 --
>  1 file changed, 4 insertions(+), 2 deletions(-)
> 
> diff --git a/mm/debug_vm_pgtable.c b/mm/debug_vm_pgtable.c
> index 21329c7d672f..8527ebb75f2c 100644
> --- a/mm/debug_vm_pgtable.c
> +++ b/mm/debug_vm_pgtable.c
> @@ -546,7 +546,7 @@ static void __init pgd_populate_tests(struct mm_struct 
> *mm, pgd_t *pgdp,
>  static void __init pte_clear_tests(struct mm_struct *mm, pte_t *ptep,
>  unsigned long vaddr)
>  {
> - pte_t pte = ptep_get(ptep);
> + pte_t pte =  ptep_get_and_clear(mm, vaddr, ptep);

Seems like ptep_get_and_clear() here just clears the entry in preparation
for a following set_pte_at() which otherwise would have been a problem on
ppc64 as you had pointed out earlier i.e set_pte_at() should not update an
existing valid entry. So the commit message here is bit misleading.

>  
>   pr_debug("Validating PTE clear\n");
>   pte = __pte(pte_val(pte) | RANDOM_ORVALUE);
> @@ -944,7 +944,7 @@ static int __init debug_vm_pgtable(void)
>   p4d_t *p4dp, *saved_p4dp;
>   pud_t *pudp, *saved_pudp;
>   pmd_t *pmdp, *saved_pmdp, pmd;
> - pte_t *ptep;
> + pte_t *ptep, pte;
>   pgtable_t saved_ptep;
>   pgprot_t prot, protnone;
>   phys_addr_t paddr;
> @@ -1049,6 +1049,8 @@ static int __init debug_vm_pgtable(void)
>*/
>  
>   ptep = pte_alloc_map_lock(mm, pmdp, vaddr, &ptl);
> + pte = pfn_pte(pte_aligned, prot);
> + set_pte_at(mm, vaddr, ptep, pte);

Not here, creating and populating an entry must be done in respective
test functions itself. Besides, this seems bit redundant as well. The
test pte_clear_tests() with the above change added, already

- Clears the PTEP entry with ptep_get_and_clear()
- Creates and populates the PTEP with set_pte_at()
- Clears with pte_clear()
- Checks for pte_none()

If the objective is to clear the PTEP entry before calling set_pte_at(),
then only the first chunk is required and it should also be merged with
a previous patch. 

[PATCH v3 07/13] mm/debug_vm_pgtable/set_pte/pmd/pud: Don't use set_*_at to 
update an existing pte entry 


>   pte_clear_tests(mm, ptep, vaddr);
>   pte_advanced_tests(mm, vma, ptep, pte_aligned, vaddr, prot);
>   pte_unmap_unlock(ptep, ptl);
> 

There is a checkpatch.pl warning here.

WARNING: Possible unwrapped commit description (prefer a maximum 75 chars per 
line)
#7: 
pte_clear_tests operate on an existing pte entry. Make sure that is not a none

total: 0 errors, 1 warnings, 24 lines checked

There is also a build failure on x86 reported from kernel test robot.


Re: [PATCH v3 08/13] mm/debug_vm_pgtable/thp: Use page table depost/withdraw with THP

2020-08-31 Thread Anshuman Khandual



On 08/27/2020 01:34 PM, Aneesh Kumar K.V wrote:
> Architectures like ppc64 use deposited page table while updating the huge pte
> entries.
> 
> Signed-off-by: Aneesh Kumar K.V 
> ---
>  mm/debug_vm_pgtable.c | 10 +++---
>  1 file changed, 7 insertions(+), 3 deletions(-)
> 
> diff --git a/mm/debug_vm_pgtable.c b/mm/debug_vm_pgtable.c
> index f9f6358899a8..0ce5c6a24c5b 100644
> --- a/mm/debug_vm_pgtable.c
> +++ b/mm/debug_vm_pgtable.c
> @@ -154,7 +154,7 @@ static void __init pmd_basic_tests(unsigned long pfn, 
> pgprot_t prot)
>  static void __init pmd_advanced_tests(struct mm_struct *mm,
> struct vm_area_struct *vma, pmd_t *pmdp,
> unsigned long pfn, unsigned long vaddr,
> -   pgprot_t prot)
> +   pgprot_t prot, pgtable_t pgtable)
>  {
>   pmd_t pmd;
>  
> @@ -165,6 +165,8 @@ static void __init pmd_advanced_tests(struct mm_struct 
> *mm,
>   /* Align the address wrt HPAGE_PMD_SIZE */
>   vaddr = (vaddr & HPAGE_PMD_MASK) + HPAGE_PMD_SIZE;
>  
> + pgtable_trans_huge_deposit(mm, pmdp, pgtable);
> +
>   pmd = pmd_mkhuge(pfn_pmd(pfn, prot));
>   set_pmd_at(mm, vaddr, pmdp, pmd);
>   pmdp_set_wrprotect(mm, vaddr, pmdp);
> @@ -193,6 +195,8 @@ static void __init pmd_advanced_tests(struct mm_struct 
> *mm,
>   pmdp_test_and_clear_young(vma, vaddr, pmdp);
>   pmd = READ_ONCE(*pmdp);
>   WARN_ON(pmd_young(pmd));
> +
> + pgtable = pgtable_trans_huge_withdraw(mm, pmdp);

Should the call sites here be wrapped with __HAVE_ARCH_PGTABLE_DEPOSIT and
__HAVE_ARCH_PGTABLE_WITHDRAW respectively. Though there are generic fallback
definitions, wondering whether they are indeed essential for all platforms.

>  }
>  
>  static void __init pmd_leaf_tests(unsigned long pfn, pgprot_t prot)
> @@ -373,7 +377,7 @@ static void __init pud_basic_tests(unsigned long pfn, 
> pgprot_t prot) { }
>  static void __init pmd_advanced_tests(struct mm_struct *mm,
> struct vm_area_struct *vma, pmd_t *pmdp,
> unsigned long pfn, unsigned long vaddr,
> -   pgprot_t prot)
> +   pgprot_t prot, pgtable_t pgtable)
>  {
>  }
>  static void __init pud_advanced_tests(struct mm_struct *mm,
> @@ -1015,7 +1019,7 @@ static int __init debug_vm_pgtable(void)
>   pgd_clear_tests(mm, pgdp);
>  
>   pte_advanced_tests(mm, vma, ptep, pte_aligned, vaddr, prot);
> - pmd_advanced_tests(mm, vma, pmdp, pmd_aligned, vaddr, prot);
> + pmd_advanced_tests(mm, vma, pmdp, pmd_aligned, vaddr, prot, saved_ptep);
>   pud_advanced_tests(mm, vma, pudp, pud_aligned, vaddr, prot);
>   hugetlb_advanced_tests(mm, vma, ptep, pte_aligned, vaddr, prot);
>  
> 

There is a checkpatch.pl warning here.

WARNING: Possible unwrapped commit description (prefer a maximum 75 chars per 
line)
#7: 
Architectures like ppc64 use deposited page table while updating the huge pte

total: 0 errors, 1 warnings, 40 lines checked


Re: [PATCH v3 06/13] mm/debug_vm_pgtable/THP: Mark the pte entry huge before using set_pmd/pud_at

2020-08-31 Thread Anshuman Khandual



On 08/27/2020 01:34 PM, Aneesh Kumar K.V wrote:
> kernel expects entries to be marked huge before we use 
> set_pmd_at()/set_pud_at().
> 
> Signed-off-by: Aneesh Kumar K.V 
> ---
>  mm/debug_vm_pgtable.c | 21 -
>  1 file changed, 12 insertions(+), 9 deletions(-)
> 
> diff --git a/mm/debug_vm_pgtable.c b/mm/debug_vm_pgtable.c
> index 5c0680836fe9..de83a20c1b30 100644
> --- a/mm/debug_vm_pgtable.c
> +++ b/mm/debug_vm_pgtable.c
> @@ -155,7 +155,7 @@ static void __init pmd_advanced_tests(struct mm_struct 
> *mm,
> unsigned long pfn, unsigned long vaddr,
> pgprot_t prot)
>  {
> - pmd_t pmd = pfn_pmd(pfn, prot);
> + pmd_t pmd;
>  
>   if (!has_transparent_hugepage())
>   return;
> @@ -164,19 +164,19 @@ static void __init pmd_advanced_tests(struct mm_struct 
> *mm,
>   /* Align the address wrt HPAGE_PMD_SIZE */
>   vaddr = (vaddr & HPAGE_PMD_MASK) + HPAGE_PMD_SIZE;
>  
> - pmd = pfn_pmd(pfn, prot);
> + pmd = pmd_mkhuge(pfn_pmd(pfn, prot));
>   set_pmd_at(mm, vaddr, pmdp, pmd);
>   pmdp_set_wrprotect(mm, vaddr, pmdp);
>   pmd = READ_ONCE(*pmdp);
>   WARN_ON(pmd_write(pmd));
>  
> - pmd = pfn_pmd(pfn, prot);
> + pmd = pmd_mkhuge(pfn_pmd(pfn, prot));
>   set_pmd_at(mm, vaddr, pmdp, pmd);
>   pmdp_huge_get_and_clear(mm, vaddr, pmdp);
>   pmd = READ_ONCE(*pmdp);
>   WARN_ON(!pmd_none(pmd));
>  
> - pmd = pfn_pmd(pfn, prot);
> + pmd = pmd_mkhuge(pfn_pmd(pfn, prot));
>   pmd = pmd_wrprotect(pmd);
>   pmd = pmd_mkclean(pmd);
>   set_pmd_at(mm, vaddr, pmdp, pmd);
> @@ -237,7 +237,7 @@ static void __init pmd_huge_tests(pmd_t *pmdp, unsigned 
> long pfn, pgprot_t prot)
>  
>  static void __init pmd_savedwrite_tests(unsigned long pfn, pgprot_t prot)
>  {
> - pmd_t pmd = pfn_pmd(pfn, prot);
> + pmd_t pmd = pmd_mkhuge(pfn_pmd(pfn, prot));

There is no set_pmd_at() in this particular test, why change ?

>  
>   if (!IS_ENABLED(CONFIG_NUMA_BALANCING))
>   return;
> @@ -277,7 +277,7 @@ static void __init pud_advanced_tests(struct mm_struct 
> *mm,
> unsigned long pfn, unsigned long vaddr,
> pgprot_t prot)
>  {
> - pud_t pud = pfn_pud(pfn, prot);
> + pud_t pud;
>  
>   if (!has_transparent_hugepage())
>   return;
> @@ -286,25 +286,28 @@ static void __init pud_advanced_tests(struct mm_struct 
> *mm,
>   /* Align the address wrt HPAGE_PUD_SIZE */
>   vaddr = (vaddr & HPAGE_PUD_MASK) + HPAGE_PUD_SIZE;
>  
> + pud = pud_mkhuge(pfn_pud(pfn, prot));
>   set_pud_at(mm, vaddr, pudp, pud);
>   pudp_set_wrprotect(mm, vaddr, pudp);
>   pud = READ_ONCE(*pudp);
>   WARN_ON(pud_write(pud));
>  
>  #ifndef __PAGETABLE_PMD_FOLDED
> - pud = pfn_pud(pfn, prot);
> +

Please drop the extra line here.

> + pud = pud_mkhuge(pfn_pud(pfn, prot));
>   set_pud_at(mm, vaddr, pudp, pud);
>   pudp_huge_get_and_clear(mm, vaddr, pudp);
>   pud = READ_ONCE(*pudp);
>   WARN_ON(!pud_none(pud));
>  
> - pud = pfn_pud(pfn, prot);
> + pud = pud_mkhuge(pfn_pud(pfn, prot));
>   set_pud_at(mm, vaddr, pudp, pud);
>   pudp_huge_get_and_clear_full(mm, vaddr, pudp, 1);
>   pud = READ_ONCE(*pudp);
>   WARN_ON(!pud_none(pud));
>  #endif /* __PAGETABLE_PMD_FOLDED */
> - pud = pfn_pud(pfn, prot);
> +

Please drop the extra line here.

> + pud = pud_mkhuge(pfn_pud(pfn, prot));
>   pud = pud_wrprotect(pud);
>   pud = pud_mkclean(pud);
>   set_pud_at(mm, vaddr, pudp, pud);
>

There is a checkpatch.pl warning here.

WARNING: Possible unwrapped commit description (prefer a maximum 75 chars per 
line)
#7: 
kernel expects entries to be marked huge before we use 
set_pmd_at()/set_pud_at().

total: 0 errors, 1 warnings, 77 lines checked


Re: [PATCH v3 05/13] mm/debug_vm_pgtable/savedwrite: Enable savedwrite test with CONFIG_NUMA_BALANCING

2020-08-31 Thread Anshuman Khandual



On 08/27/2020 01:34 PM, Aneesh Kumar K.V wrote:
> Saved write support was added to track the write bit of a pte after marking 
> the
> pte protnone. This was done so that AUTONUMA can convert a write pte to 
> protnone
> and still track the old write bit. When converting it back we set the pte 
> write
> bit correctly thereby avoiding a write fault again. Hence enable the test only
> when CONFIG_NUMA_BALANCING is enabled and use protnone protflags.
> 
> Signed-off-by: Aneesh Kumar K.V 
> ---
>  mm/debug_vm_pgtable.c | 11 +--
>  1 file changed, 9 insertions(+), 2 deletions(-)
> 
> diff --git a/mm/debug_vm_pgtable.c b/mm/debug_vm_pgtable.c
> index 28f9d0558c20..5c0680836fe9 100644
> --- a/mm/debug_vm_pgtable.c
> +++ b/mm/debug_vm_pgtable.c
> @@ -119,10 +119,14 @@ static void __init pte_savedwrite_tests(unsigned long 
> pfn, pgprot_t prot)
>  {
>   pte_t pte = pfn_pte(pfn, prot);
>  
> + if (!IS_ENABLED(CONFIG_NUMA_BALANCING))
> + return;
> +
>   pr_debug("Validating PTE saved write\n");
>   WARN_ON(!pte_savedwrite(pte_mk_savedwrite(pte_clear_savedwrite(pte;
>   WARN_ON(pte_savedwrite(pte_clear_savedwrite(pte_mk_savedwrite(pte;
>  }
> +
>  #ifdef CONFIG_TRANSPARENT_HUGEPAGE
>  static void __init pmd_basic_tests(unsigned long pfn, pgprot_t prot)
>  {
> @@ -235,6 +239,9 @@ static void __init pmd_savedwrite_tests(unsigned long 
> pfn, pgprot_t prot)
>  {
>   pmd_t pmd = pfn_pmd(pfn, prot);
>  
> + if (!IS_ENABLED(CONFIG_NUMA_BALANCING))
> + return;
> +
>   pr_debug("Validating PMD saved write\n");
>   WARN_ON(!pmd_savedwrite(pmd_mk_savedwrite(pmd_clear_savedwrite(pmd;
>   WARN_ON(pmd_savedwrite(pmd_clear_savedwrite(pmd_mk_savedwrite(pmd;
> @@ -1020,8 +1027,8 @@ static int __init debug_vm_pgtable(void)
>   pmd_huge_tests(pmdp, pmd_aligned, prot);
>   pud_huge_tests(pudp, pud_aligned, prot);
>  
> - pte_savedwrite_tests(pte_aligned, prot);
> - pmd_savedwrite_tests(pmd_aligned, prot);
> + pte_savedwrite_tests(pte_aligned, protnone);
> + pmd_savedwrite_tests(pmd_aligned, protnone);
>  
>   pte_unmap_unlock(ptep, ptl);
>  
> 

There is a checkpatch.pl warning here.

WARNING: Possible unwrapped commit description (prefer a maximum 75 chars per 
line)
#7: 
Saved write support was added to track the write bit of a pte after marking the

total: 0 errors, 1 warnings, 33 lines checked

Otherwise this looks good. With the checkpatch.pl warning fixed

Reviewed-by: Anshuman Khandual 


Re: [PATCH v3 04/13] mm/debug_vm_pgtables/hugevmap: Use the arch helper to identify huge vmap support.

2020-08-31 Thread Anshuman Khandual



On 08/27/2020 01:34 PM, Aneesh Kumar K.V wrote:
> ppc64 supports huge vmap only with radix translation. Hence use arch helper
> to determine the huge vmap support.
> 
> Signed-off-by: Aneesh Kumar K.V 
> ---
>  mm/debug_vm_pgtable.c | 15 +--
>  1 file changed, 13 insertions(+), 2 deletions(-)
> 
> diff --git a/mm/debug_vm_pgtable.c b/mm/debug_vm_pgtable.c
> index bbf9df0e64c6..28f9d0558c20 100644
> --- a/mm/debug_vm_pgtable.c
> +++ b/mm/debug_vm_pgtable.c
> @@ -28,6 +28,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  #include 
>  #include 
>  
> @@ -206,11 +207,12 @@ static void __init pmd_leaf_tests(unsigned long pfn, 
> pgprot_t prot)
>   WARN_ON(!pmd_leaf(pmd));
>  }
>  
> +#ifdef CONFIG_HAVE_ARCH_HUGE_VMAP
>  static void __init pmd_huge_tests(pmd_t *pmdp, unsigned long pfn, pgprot_t 
> prot)
>  {
>   pmd_t pmd;
>  
> - if (!IS_ENABLED(CONFIG_HAVE_ARCH_HUGE_VMAP))
> + if (!arch_ioremap_pmd_supported())
>   return;
>  
>   pr_debug("Validating PMD huge\n");
> @@ -224,6 +226,10 @@ static void __init pmd_huge_tests(pmd_t *pmdp, unsigned 
> long pfn, pgprot_t prot)
>   pmd = READ_ONCE(*pmdp);
>   WARN_ON(!pmd_none(pmd));
>  }
> +#else /* !CONFIG_HAVE_ARCH_HUGE_VMAP */
> +static void __init pmd_huge_tests(pmd_t *pmdp, unsigned long pfn, pgprot_t 
> prot) { }
> +#endif /* !CONFIG_HAVE_ARCH_HUGE_VMAP */
> +

Some small nits here.

- s/!CONFIG_HAVE_ARCH_HUGE_VMAP/CONFIG_HAVE_ARCH_HUGE_VMAP
- Drop the extra line at the end
- Move the { } down just to be consistent with existing stub for 
pmd_huge_tests()

>  
>  static void __init pmd_savedwrite_tests(unsigned long pfn, pgprot_t prot)
>  {
> @@ -320,11 +326,12 @@ static void __init pud_leaf_tests(unsigned long pfn, 
> pgprot_t prot)
>   WARN_ON(!pud_leaf(pud));
>  }
>  
> +#ifdef CONFIG_HAVE_ARCH_HUGE_VMAP
>  static void __init pud_huge_tests(pud_t *pudp, unsigned long pfn, pgprot_t 
> prot)
>  {
>   pud_t pud;
>  
> - if (!IS_ENABLED(CONFIG_HAVE_ARCH_HUGE_VMAP))
> + if (!arch_ioremap_pud_supported())
>   return;
>  
>   pr_debug("Validating PUD huge\n");
> @@ -338,6 +345,10 @@ static void __init pud_huge_tests(pud_t *pudp, unsigned 
> long pfn, pgprot_t prot)
>   pud = READ_ONCE(*pudp);
>   WARN_ON(!pud_none(pud));
>  }
> +#else /* !CONFIG_HAVE_ARCH_HUGE_VMAP */
> +static void __init pud_huge_tests(pud_t *pudp, unsigned long pfn, pgprot_t 
> prot) { }
> +#endif /* !CONFIG_HAVE_ARCH_HUGE_VMAP */
> +

Some small nits here.

- s/!CONFIG_HAVE_ARCH_HUGE_VMAP/CONFIG_HAVE_ARCH_HUGE_VMAP
- Drop the extra line at the end
- Move the { } down just to be consistent with existing stub for 
pud_huge_tests()

>  #else  /* !CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD */
>  static void __init pud_basic_tests(unsigned long pfn, pgprot_t prot) { }
>  static void __init pud_advanced_tests(struct mm_struct *mm,
>


Re: [PATCH v3 03/13] mm/debug_vm_pgtable/ppc64: Avoid setting top bits in radom value

2020-08-31 Thread Anshuman Khandual



On 08/27/2020 01:34 PM, Aneesh Kumar K.V wrote:
> ppc64 use bit 62 to indicate a pte entry (_PAGE_PTE). Avoid setting that bit 
> in
> random value.
> 
> Signed-off-by: Aneesh Kumar K.V 
> ---
>  mm/debug_vm_pgtable.c | 13 ++---
>  1 file changed, 10 insertions(+), 3 deletions(-)
> 
> diff --git a/mm/debug_vm_pgtable.c b/mm/debug_vm_pgtable.c
> index 086309fb9b6f..bbf9df0e64c6 100644
> --- a/mm/debug_vm_pgtable.c
> +++ b/mm/debug_vm_pgtable.c
> @@ -44,10 +44,17 @@
>   * entry type. But these bits might affect the ability to clear entries with
>   * pxx_clear() because of how dynamic page table folding works on s390. So
>   * while loading up the entries do not change the lower 4 bits. It does not
> - * have affect any other platform.
> + * have affect any other platform. Also avoid the 62nd bit on ppc64 that is
> + * used to mark a pte entry.
>   */
> -#define S390_MASK_BITS   4
> -#define RANDOM_ORVALUE   GENMASK(BITS_PER_LONG - 1, S390_MASK_BITS)
> +#define S390_SKIP_MASK   GENMASK(3, 0)
> +#ifdef CONFIG_PPC_BOOK3S_64
> +#define PPC64_SKIP_MASK  GENMASK(62, 62)
> +#else
> +#define PPC64_SKIP_MASK  0x0
> +#endif

Please drop the #ifdef CONFIG_PPC_BOOK3S_64 here. We already accommodate skip
bits for a s390 platform requirement and can also do so for ppc64 as well. As
mentioned before, please avoid adding any platform specific constructs in the
test.

> +#define ARCH_SKIP_MASK (S390_SKIP_MASK | PPC64_SKIP_MASK)
> +#define RANDOM_ORVALUE (GENMASK(BITS_PER_LONG - 1, 0) & ~ARCH_SKIP_MASK)
>  #define RANDOM_NZVALUE   GENMASK(7, 0)

Please fix the alignments here. Feel free to consider following changes after
this patch.

diff --git a/mm/debug_vm_pgtable.c b/mm/debug_vm_pgtable.c
index 122416464e0f..f969031152bb 100644
--- a/mm/debug_vm_pgtable.c
+++ b/mm/debug_vm_pgtable.c
@@ -48,14 +48,11 @@
  * have affect any other platform. Also avoid the 62nd bit on ppc64 that is
  * used to mark a pte entry.
  */
-#define S390_SKIP_MASK GENMASK(3, 0)
-#ifdef CONFIG_PPC_BOOK3S_64
-#define PPC64_SKIP_MASKGENMASK(62, 62)
-#else
-#define PPC64_SKIP_MASK0x0
-#endif
-#define ARCH_SKIP_MASK (S390_SKIP_MASK | PPC64_SKIP_MASK)
-#define RANDOM_ORVALUE (GENMASK(BITS_PER_LONG - 1, 0) & ~ARCH_SKIP_MASK)
+#define S390_SKIP_MASK GENMASK(3, 0)
+#define PPC64_SKIP_MASKGENMASK(62, 62)
+#define ARCH_SKIP_MASK (S390_SKIP_MASK | PPC64_SKIP_MASK)
+#define RANDOM_ORVALUE (GENMASK(BITS_PER_LONG - 1, 0) & ~ARCH_SKIP_MASK)
+
 #define RANDOM_NZVALUE GENMASK(7, 0)
 
>  
>  static void __init pte_basic_tests(unsigned long pfn, pgprot_t prot)
> 

Besides, there is also one checkpatch.pl warning here.

WARNING: Possible unwrapped commit description (prefer a maximum 75 chars per 
line)
#7: 
ppc64 use bit 62 to indicate a pte entry (_PAGE_PTE). Avoid setting that bit in

total: 0 errors, 1 warnings, 20 lines checked


Re: [PATCH v3 02/13] powerpc/mm: Move setting pte specific flags to pfn_pte

2020-08-31 Thread Anshuman Khandual



On 08/27/2020 01:34 PM, Aneesh Kumar K.V wrote:
> powerpc used to set the pte specific flags in set_pte_at(). This is different
> from other architectures. To be consistent with other architecture update
> pfn_pte to set _PAGE_PTE on ppc64. Also, drop now unused pte_mkpte.
> 
> We add a VM_WARN_ON() to catch the usage of calling set_pte_at() without 
> setting
> _PAGE_PTE bit. We will remove that after a few releases.
> 
> With respect to huge pmd entries, pmd_mkhuge() takes care of adding the
> _PAGE_PTE bit.
> 
> Signed-off-by: Aneesh Kumar K.V 
> ---
>  arch/powerpc/include/asm/book3s/64/pgtable.h | 15 +--
>  arch/powerpc/include/asm/nohash/pgtable.h|  5 -
>  arch/powerpc/mm/pgtable.c|  5 -
>  3 files changed, 9 insertions(+), 16 deletions(-)
> 
> diff --git a/arch/powerpc/include/asm/book3s/64/pgtable.h 
> b/arch/powerpc/include/asm/book3s/64/pgtable.h
> index 079211968987..2382fd516f6b 100644
> --- a/arch/powerpc/include/asm/book3s/64/pgtable.h
> +++ b/arch/powerpc/include/asm/book3s/64/pgtable.h
> @@ -619,7 +619,7 @@ static inline pte_t pfn_pte(unsigned long pfn, pgprot_t 
> pgprot)
>   VM_BUG_ON(pfn >> (64 - PAGE_SHIFT));
>   VM_BUG_ON((pfn << PAGE_SHIFT) & ~PTE_RPN_MASK);
>  
> - return __pte(((pte_basic_t)pfn << PAGE_SHIFT) | pgprot_val(pgprot));
> + return __pte(((pte_basic_t)pfn << PAGE_SHIFT) | pgprot_val(pgprot) | 
> _PAGE_PTE);
>  }
>  
>  static inline unsigned long pte_pfn(pte_t pte)
> @@ -655,11 +655,6 @@ static inline pte_t pte_mkexec(pte_t pte)
>   return __pte_raw(pte_raw(pte) | cpu_to_be64(_PAGE_EXEC));
>  }
>  
> -static inline pte_t pte_mkpte(pte_t pte)
> -{
> - return __pte_raw(pte_raw(pte) | cpu_to_be64(_PAGE_PTE));
> -}
> -
>  static inline pte_t pte_mkwrite(pte_t pte)
>  {
>   /*
> @@ -823,6 +818,14 @@ static inline int pte_none(pte_t pte)
>  static inline void __set_pte_at(struct mm_struct *mm, unsigned long addr,
>   pte_t *ptep, pte_t pte, int percpu)
>  {
> +
> + VM_WARN_ON(!(pte_raw(pte) & cpu_to_be64(_PAGE_PTE)));
> + /*
> +  * Keep the _PAGE_PTE added till we are sure we handle _PAGE_PTE
> +  * in all the callers.
> +  */
> +  pte = __pte_raw(pte_raw(pte) | cpu_to_be64(_PAGE_PTE));
> +
>   if (radix_enabled())
>   return radix__set_pte_at(mm, addr, ptep, pte, percpu);
>   return hash__set_pte_at(mm, addr, ptep, pte, percpu);
> diff --git a/arch/powerpc/include/asm/nohash/pgtable.h 
> b/arch/powerpc/include/asm/nohash/pgtable.h
> index 4b7c3472eab1..6277e7596ae5 100644
> --- a/arch/powerpc/include/asm/nohash/pgtable.h
> +++ b/arch/powerpc/include/asm/nohash/pgtable.h
> @@ -140,11 +140,6 @@ static inline pte_t pte_mkold(pte_t pte)
>   return __pte(pte_val(pte) & ~_PAGE_ACCESSED);
>  }
>  
> -static inline pte_t pte_mkpte(pte_t pte)
> -{
> - return pte;
> -}
> -
>  static inline pte_t pte_mkspecial(pte_t pte)
>  {
>   return __pte(pte_val(pte) | _PAGE_SPECIAL);
> diff --git a/arch/powerpc/mm/pgtable.c b/arch/powerpc/mm/pgtable.c
> index 9c0547d77af3..ab57b07ef39a 100644
> --- a/arch/powerpc/mm/pgtable.c
> +++ b/arch/powerpc/mm/pgtable.c
> @@ -184,9 +184,6 @@ void set_pte_at(struct mm_struct *mm, unsigned long addr, 
> pte_t *ptep,
>*/
>   VM_WARN_ON(pte_hw_valid(*ptep) && !pte_protnone(*ptep));
>  
> - /* Add the pte bit when trying to set a pte */
> - pte = pte_mkpte(pte);
> -
>   /* Note: mm->context.id might not yet have been assigned as
>* this context might not have been activated yet when this
>* is called.
> @@ -275,8 +272,6 @@ void set_huge_pte_at(struct mm_struct *mm, unsigned long 
> addr, pte_t *ptep, pte_
>*/
>   VM_WARN_ON(pte_hw_valid(*ptep) && !pte_protnone(*ptep));
>  
> - pte = pte_mkpte(pte);
> -
>   pte = set_pte_filter(pte);
>  
>   val = pte_val(pte);
> 

There is one checkpatch.pl warning for this patch.

WARNING: Possible unwrapped commit description (prefer a maximum 75 chars per 
line)
#6: 
powerpc used to set the pte specific flags in set_pte_at(). This is different

total: 0 errors, 1 warnings, 61 lines checked


Re: [PATCH v3 01/13] powerpc/mm: Add DEBUG_VM WARN for pmd_clear

2020-08-31 Thread Anshuman Khandual



On 08/27/2020 01:34 PM, Aneesh Kumar K.V wrote:
> With the hash page table, the kernel should not use pmd_clear for clearing
> huge pte entries. Add a DEBUG_VM WARN to catch the wrong usage.
> 
> Signed-off-by: Aneesh Kumar K.V 
> ---
>  arch/powerpc/include/asm/book3s/64/pgtable.h | 14 ++
>  1 file changed, 14 insertions(+)
> 
> diff --git a/arch/powerpc/include/asm/book3s/64/pgtable.h 
> b/arch/powerpc/include/asm/book3s/64/pgtable.h
> index 6de56c3b33c4..079211968987 100644
> --- a/arch/powerpc/include/asm/book3s/64/pgtable.h
> +++ b/arch/powerpc/include/asm/book3s/64/pgtable.h
> @@ -868,6 +868,13 @@ static inline bool pte_ci(pte_t pte)
>  
>  static inline void pmd_clear(pmd_t *pmdp)
>  {
> + if (IS_ENABLED(CONFIG_DEBUG_VM) && !radix_enabled()) {
> + /*
> +  * Don't use this if we can possibly have a hash page table
> +  * entry mapping this.
> +  */
> + WARN_ON((pmd_val(*pmdp) & (H_PAGE_HASHPTE | _PAGE_PTE)) == 
> (H_PAGE_HASHPTE | _PAGE_PTE));
> + }
>   *pmdp = __pmd(0);
>  }
>  
> @@ -916,6 +923,13 @@ static inline int pmd_bad(pmd_t pmd)
>  
>  static inline void pud_clear(pud_t *pudp)
>  {
> + if (IS_ENABLED(CONFIG_DEBUG_VM) && !radix_enabled()) {
> + /*
> +  * Don't use this if we can possibly have a hash page table
> +  * entry mapping this.
> +  */
> + WARN_ON((pud_val(*pudp) & (H_PAGE_HASHPTE | _PAGE_PTE)) == 
> (H_PAGE_HASHPTE | _PAGE_PTE));
> + }
>   *pudp = __pud(0);
>  }

There are two checkpatch.pl warnings for this patch.

WARNING: line length of 105 exceeds 100 columns
#27: FILE: arch/powerpc/include/asm/book3s/64/pgtable.h:876:
+   WARN_ON((pmd_val(*pmdp) & (H_PAGE_HASHPTE | _PAGE_PTE)) == 
(H_PAGE_HASHPTE | _PAGE_PTE));

WARNING: line length of 105 exceeds 100 columns
#41: FILE: arch/powerpc/include/asm/book3s/64/pgtable.h:931:
+   WARN_ON((pud_val(*pudp) & (H_PAGE_HASHPTE | _PAGE_PTE)) == 
(H_PAGE_HASHPTE | _PAGE_PTE));

total: 0 errors, 2 warnings, 26 lines checked


Re: fsl_espi errors on v5.7.15

2020-08-31 Thread Chris Packham

On 1/09/20 12:33 am, Heiner Kallweit wrote:
> On 30.08.2020 23:59, Chris Packham wrote:
>> On 31/08/20 9:41 am, Heiner Kallweit wrote:
>>> On 30.08.2020 23:00, Chris Packham wrote:
 On 31/08/20 12:30 am, Nicholas Piggin wrote:
> Excerpts from Chris Packham's message of August 28, 2020 8:07 am:
 

>> I've also now seen the RX FIFO not empty error on the T2080RDB
>>
>> fsl_espi ffe11.spi: Transfer done but SPIE_DON isn't set!
>> fsl_espi ffe11.spi: Transfer done but SPIE_DON isn't set!
>> fsl_espi ffe11.spi: Transfer done but SPIE_DON isn't set!
>> fsl_espi ffe11.spi: Transfer done but SPIE_DON isn't set!
>> fsl_espi ffe11.spi: Transfer done but rx/tx fifo's aren't empty!
>> fsl_espi ffe11.spi: SPIE_RXCNT = 1, SPIE_TXCNT = 32
>>
>> With my current workaround of emptying the RX FIFO. It seems
>> survivable. Interestingly it only ever seems to be 1 extra byte in 
>> the
>> RX FIFO and it seems to be after either a READ_SR or a READ_FSR.
>>
>> fsl_espi ffe11.spi: tx 70
>> fsl_espi ffe11.spi: rx 03
>> fsl_espi ffe11.spi: Extra RX 00
>> fsl_espi ffe11.spi: Transfer done but SPIE_DON isn't set!
>> fsl_espi ffe11.spi: Transfer done but rx/tx fifo's aren't empty!
>> fsl_espi ffe11.spi: SPIE_RXCNT = 1, SPIE_TXCNT = 32
>> fsl_espi ffe11.spi: tx 05
>> fsl_espi ffe11.spi: rx 00
>> fsl_espi ffe11.spi: Extra RX 03
>> fsl_espi ffe11.spi: Transfer done but SPIE_DON isn't set!
>> fsl_espi ffe11.spi: Transfer done but rx/tx fifo's aren't empty!
>> fsl_espi ffe11.spi: SPIE_RXCNT = 1, SPIE_TXCNT = 32
>> fsl_espi ffe11.spi: tx 05
>> fsl_espi ffe11.spi: rx 00
>> fsl_espi ffe11.spi: Extra RX 03
>>
>> From all the Micron SPI-NOR datasheets I've got access to it is
>> possible to continually read the SR/FSR. But I've no idea why it
>> happens some times and not others.
> So I think I've got a reproduction and I think I've bisected the 
> problem
> to commit 3282a3da25bd ("powerpc/64: Implement soft interrupt replay 
> in
> C"). My day is just finishing now so I haven't applied too much 
> scrutiny
> to this result. Given the various rabbit holes I've been down on this
> issue already I'd take this information with a good degree of 
> skepticism.
>
 OK, so an easy test should be to re-test with a 5.4 kernel.
 It doesn't have yet the change you're referring to, and the fsl-espi 
 driver
 is basically the same as in 5.7 (just two small changes in 5.7).
>>> There's 6cc0c16d82f88 and maybe also other interrupt related patches
>>> around this time that could affect book E, so it's good if that exact
>>> patch is confirmed.
>> My confirmation is basically that I can induce the issue in a 5.4 kernel
>> by cherry-picking 3282a3da25bd. I'm also able to "fix" the issue in
>> 5.9-rc2 by reverting that one commit.
>>
>> I both cases it's not exactly a clean cherry-pick/revert so I also
>> confirmed the bisection result by building at 3282a3da25bd (which sees
>> the issue) and the commit just before (which does not).
> Thanks for testing, that confirms it well.
>
> [snip patch]
>
>> I still saw the issue with this change applied. PPC_IRQ_SOFT_MASK_DEBUG
>> didn't report anything (either with or without the change above).
> Okay, it was a bit of a shot in the dark. I still can't see what
> else has changed.
>
> What would cause this, a lost interrupt? A spurious interrupt? Or
> higher interrupt latency?
>
> I don't think the patch should cause significantly worse latency,
> (it's supposed to be a bit better if anything because it doesn't set
> up the full interrupt frame). But it's possible.
 My working theory is that the SPI_DON indication is all about the TX
 direction an now that the interrupts are faster we're hitting an error
 because there is still RX activity going on. Heiner disagrees with my
 interpretation of the SPI_DON indication and the fact that it doesn't
 happen every time does throw doubt on it.

>>> It's right that the eSPI spec can be interpreted that SPI_DON refers to
>>> TX only. However this wouldn't really make sense, because also for RX
>>> we program the frame length, and therefore want to be notified once the
>>> full frame was received. Also practical experience shows that SPI_DON
>>> is set also after RX-only transfers.
>>> Typical SPI NOR use case is that you write read command + start address,
>>> followed by a longer read. If the TX-only interpretation would be right,
>>> we'd always end up with SPI_DON not being set.
>>>
>

Re: [PATCH] soc: fsl: Remove bogus packed attributes from qman.h

2020-08-31 Thread Herbert Xu
On Tue, Sep 01, 2020 at 01:50:38AM +, Leo Li wrote:
>
> Sorry for the late response.  I missed this email previously.
> 
> These structures are descriptors used by hardware, we cannot have _ANY_ 
> padding from the compiler.  The compiled result might be the same with or 
> without the __packed attribute for now, but I think keep it there probably is 
> safer for dealing with unexpected alignment requirements from the compiler in 
> the future.
> 
> Having conflicting alignment requirements warning might means something is 
> wrong with the structure in certain scenario.  I just tried a ARM64 build but 
> didn't see the warnings.  Could you share the warning you got and the build 
> setup?  Thanks.

Just do a COMPILE_TEST build on x86-64:

In file included from ../drivers/crypto/caam/qi.c:12:
../include/soc/fsl/qman.h:259:1: warning: alignment 1 of ‘struct qm_dqrr_entry’ 
is less than 8 [-Wpacked-not-aligned]
 } __packed;
 ^
../include/soc/fsl/qman.h:292:2: warning: alignment 1 of ‘struct ’ 
is less than 8 [-Wpacked-not-aligned]
  } __packed ern;
  ^

In any case, those packed markers are completely unnecessary because
those structs contain no holes.

Cheers,
-- 
Email: Herbert Xu 
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt


RE: [PATCH] soc: fsl: Remove bogus packed attributes from qman.h

2020-08-31 Thread Leo Li



> -Original Message-
> From: Herbert Xu 
> Sent: Thursday, July 30, 2020 7:53 AM
> To: Leo Li ; linuxppc-dev@lists.ozlabs.org; linux-arm-
> ker...@lists.infradead.org
> Subject: [PATCH] soc: fsl: Remove bogus packed attributes from qman.h
> 
> There are two __packed attributes in qman.h that are both unnecessary
> and causing compiler warnings because they're conflicting with
> explicit alignment requirements set on members within the structure.

Sorry for the late response.  I missed this email previously.

These structures are descriptors used by hardware, we cannot have _ANY_ padding 
from the compiler.  The compiled result might be the same with or without the 
__packed attribute for now, but I think keep it there probably is safer for 
dealing with unexpected alignment requirements from the compiler in the future.

Having conflicting alignment requirements warning might means something is 
wrong with the structure in certain scenario.  I just tried a ARM64 build but 
didn't see the warnings.  Could you share the warning you got and the build 
setup?  Thanks.

Regards,
Leo 
> 
> This patch removes them both.
> 
> Signed-off-by: Herbert Xu 
> 
> diff --git a/include/soc/fsl/qman.h b/include/soc/fsl/qman.h
> index cfe00e08e85b..d81ff185dc0b 100644
> --- a/include/soc/fsl/qman.h
> +++ b/include/soc/fsl/qman.h
> @@ -256,7 +256,7 @@ struct qm_dqrr_entry {
>   __be32 context_b;
>   struct qm_fd fd;
>   u8 __reserved4[32];
> -} __packed;
> +};
>  #define QM_DQRR_VERB_VBIT0x80
>  #define QM_DQRR_VERB_MASK0x7f/* where the verb
> contains; */
>  #define QM_DQRR_VERB_FRAME_DEQUEUE   0x60/* "this format" */
> @@ -289,7 +289,7 @@ union qm_mr_entry {
>   __be32 tag;
>   struct qm_fd fd;
>   u8 __reserved1[32];
> - } __packed ern;
> + } ern;
>   struct {
>   u8 verb;
>   u8 fqs; /* Frame Queue Status */
> --
> Email: Herbert Xu 
> Home Page:
> https://eur01.safelinks.protection.outlook.com/?url=http:%2F%2Fgondor.ap
> ana.org.au%2F~herbert%2F&data=02%7C01%7Cleoyang.li%40nxp.com
> %7Cb69aca8dc53a4030b14b08d8348783a9%7C686ea1d3bc2b4c6fa92cd99c5c3
> 01635%7C0%7C0%7C637317103931120730&sdata=g3%2FJfa%2FNcuhLD5
> SYhbmhno65O1bxisVt2xltu2IMPjQ%3D&reserved=0
> PGP Key:
> https://eur01.safelinks.protection.outlook.com/?url=http:%2F%2Fgondor.ap
> ana.org.au%2F~herbert%2Fpubkey.txt&data=02%7C01%7Cleoyang.li%4
> 0nxp.com%7Cb69aca8dc53a4030b14b08d8348783a9%7C686ea1d3bc2b4c6fa9
> 2cd99c5c301635%7C0%7C0%7C637317103931120730&sdata=uSS2C4cuAL
> XcCgIhpIORK4EZ1BHHj%2BqAW2Pu%2FLrFKPM%3D&reserved=0


Re: fsl_espi errors on v5.7.15

2020-08-31 Thread Chris Packham

On 1/09/20 12:33 am, Heiner Kallweit wrote:
> On 30.08.2020 23:59, Chris Packham wrote:
>> On 31/08/20 9:41 am, Heiner Kallweit wrote:
>>> On 30.08.2020 23:00, Chris Packham wrote:
 On 31/08/20 12:30 am, Nicholas Piggin wrote:
> Excerpts from Chris Packham's message of August 28, 2020 8:07 am:
 

>> I've also now seen the RX FIFO not empty error on the T2080RDB
>>
>> fsl_espi ffe11.spi: Transfer done but SPIE_DON isn't set!
>> fsl_espi ffe11.spi: Transfer done but SPIE_DON isn't set!
>> fsl_espi ffe11.spi: Transfer done but SPIE_DON isn't set!
>> fsl_espi ffe11.spi: Transfer done but SPIE_DON isn't set!
>> fsl_espi ffe11.spi: Transfer done but rx/tx fifo's aren't empty!
>> fsl_espi ffe11.spi: SPIE_RXCNT = 1, SPIE_TXCNT = 32
>>
>> With my current workaround of emptying the RX FIFO. It seems
>> survivable. Interestingly it only ever seems to be 1 extra byte in 
>> the
>> RX FIFO and it seems to be after either a READ_SR or a READ_FSR.
>>
>> fsl_espi ffe11.spi: tx 70
>> fsl_espi ffe11.spi: rx 03
>> fsl_espi ffe11.spi: Extra RX 00
>> fsl_espi ffe11.spi: Transfer done but SPIE_DON isn't set!
>> fsl_espi ffe11.spi: Transfer done but rx/tx fifo's aren't empty!
>> fsl_espi ffe11.spi: SPIE_RXCNT = 1, SPIE_TXCNT = 32
>> fsl_espi ffe11.spi: tx 05
>> fsl_espi ffe11.spi: rx 00
>> fsl_espi ffe11.spi: Extra RX 03
>> fsl_espi ffe11.spi: Transfer done but SPIE_DON isn't set!
>> fsl_espi ffe11.spi: Transfer done but rx/tx fifo's aren't empty!
>> fsl_espi ffe11.spi: SPIE_RXCNT = 1, SPIE_TXCNT = 32
>> fsl_espi ffe11.spi: tx 05
>> fsl_espi ffe11.spi: rx 00
>> fsl_espi ffe11.spi: Extra RX 03
>>
>> From all the Micron SPI-NOR datasheets I've got access to it is
>> possible to continually read the SR/FSR. But I've no idea why it
>> happens some times and not others.
> So I think I've got a reproduction and I think I've bisected the 
> problem
> to commit 3282a3da25bd ("powerpc/64: Implement soft interrupt replay 
> in
> C"). My day is just finishing now so I haven't applied too much 
> scrutiny
> to this result. Given the various rabbit holes I've been down on this
> issue already I'd take this information with a good degree of 
> skepticism.
>
 OK, so an easy test should be to re-test with a 5.4 kernel.
 It doesn't have yet the change you're referring to, and the fsl-espi 
 driver
 is basically the same as in 5.7 (just two small changes in 5.7).
>>> There's 6cc0c16d82f88 and maybe also other interrupt related patches
>>> around this time that could affect book E, so it's good if that exact
>>> patch is confirmed.
>> My confirmation is basically that I can induce the issue in a 5.4 kernel
>> by cherry-picking 3282a3da25bd. I'm also able to "fix" the issue in
>> 5.9-rc2 by reverting that one commit.
>>
>> I both cases it's not exactly a clean cherry-pick/revert so I also
>> confirmed the bisection result by building at 3282a3da25bd (which sees
>> the issue) and the commit just before (which does not).
> Thanks for testing, that confirms it well.
>
> [snip patch]
>
>> I still saw the issue with this change applied. PPC_IRQ_SOFT_MASK_DEBUG
>> didn't report anything (either with or without the change above).
> Okay, it was a bit of a shot in the dark. I still can't see what
> else has changed.
>
> What would cause this, a lost interrupt? A spurious interrupt? Or
> higher interrupt latency?
>
> I don't think the patch should cause significantly worse latency,
> (it's supposed to be a bit better if anything because it doesn't set
> up the full interrupt frame). But it's possible.
 My working theory is that the SPI_DON indication is all about the TX
 direction an now that the interrupts are faster we're hitting an error
 because there is still RX activity going on. Heiner disagrees with my
 interpretation of the SPI_DON indication and the fact that it doesn't
 happen every time does throw doubt on it.

>>> It's right that the eSPI spec can be interpreted that SPI_DON refers to
>>> TX only. However this wouldn't really make sense, because also for RX
>>> we program the frame length, and therefore want to be notified once the
>>> full frame was received. Also practical experience shows that SPI_DON
>>> is set also after RX-only transfers.
>>> Typical SPI NOR use case is that you write read command + start address,
>>> followed by a longer read. If the TX-only interpretation would be right,
>>> we'd always end up with SPI_DON not being set.
>>>
>

Re: [PATCH] soc: fsl: Remove bogus packed attributes from qman.h

2020-08-31 Thread Herbert Xu
On Thu, Jul 30, 2020 at 10:52:59PM +1000, Herbert Xu wrote:
> There are two __packed attributes in qman.h that are both unnecessary
> and causing compiler warnings because they're conflicting with
> explicit alignment requirements set on members within the structure.
> 
> This patch removes them both.
> 
> Signed-off-by: Herbert Xu 

Ping.
-- 
Email: Herbert Xu 
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt


[PATCH v2] scsi: ibmvfc: interface updates for future FPIN and MQ support

2020-08-31 Thread Tyrel Datwyler
VIOS partitions with SLI-4 enabled Emulex adapters will be capable of
driving IO in parallel through mulitple work queues or channels, and
with new hyperviosr firmware that supports multiple interrupt sources
an ibmvfc NPIV single initiator can be modified to exploit end to end
channelization in a PowerVM environment.

VIOS hosts will also be able to expose fabric perfromance impact
notifications (FPIN) via a new asynchronous event to ibmvfc clients that
advertise support via IBMVFC_CAN_HANDLE_FPIN in their capabilities flag
during NPIV_LOGIN.

This patch introduces three new Managment Datagrams (MADs) for
channelization support negotiation as well as the FPIN asynchronous
event and FPIN status flags. Follow up work is required to plumb the
ibmvfc client driver to use these new interfaces.

Signed-off-by: Tyrel Datwyler 
---
v1 -> v2:
Fixup complier errors from neglected commit --amend

---
 drivers/scsi/ibmvscsi/ibmvfc.h | 66 +-
 1 file changed, 65 insertions(+), 1 deletion(-)

diff --git a/drivers/scsi/ibmvscsi/ibmvfc.h b/drivers/scsi/ibmvscsi/ibmvfc.h
index 907889f1fa9d..fe55cfc79421 100644
--- a/drivers/scsi/ibmvscsi/ibmvfc.h
+++ b/drivers/scsi/ibmvscsi/ibmvfc.h
@@ -124,6 +124,9 @@ enum ibmvfc_mad_types {
IBMVFC_PASSTHRU = 0x0200,
IBMVFC_TMF_MAD  = 0x0100,
IBMVFC_NPIV_LOGOUT  = 0x0800,
+   IBMVFC_CHANNEL_ENQUIRY  = 0x1000,
+   IBMVFC_CHANNEL_SETUP= 0x2000,
+   IBMVFC_CONNECTION_INFO  = 0x4000,
 };
 
 struct ibmvfc_mad_common {
@@ -162,6 +165,8 @@ struct ibmvfc_npiv_login {
__be32 max_cmds;
__be64 capabilities;
 #define IBMVFC_CAN_MIGRATE 0x01
+#define IBMVFC_CAN_USE_CHANNELS0x02
+#define IBMVFC_CAN_HANDLE_FPIN 0x04
__be64 node_name;
struct srp_direct_buf async;
u8 partition_name[IBMVFC_MAX_NAME];
@@ -204,6 +209,7 @@ struct ibmvfc_npiv_login_resp {
__be64 capabilities;
 #define IBMVFC_CAN_FLUSH_ON_HALT   0x08
 #define IBMVFC_CAN_SUPPRESS_ABTS   0x10
+#define IBMVFC_CAN_SUPPORT_CHANNELS0x20
__be32 max_cmds;
__be32 scsi_id_sz;
__be64 max_dma_len;
@@ -482,6 +488,52 @@ struct ibmvfc_passthru_mad {
struct ibmvfc_passthru_fc_iu fc_iu;
 }__attribute__((packed, aligned (8)));
 
+struct ibmvfc_channel_enquiry {
+   struct ibmvfc_mad_common common;
+   __be32 flags;
+#define IBMVFC_NO_CHANNELS_TO_CRQ_SUPPORT  0x01
+#define IBMVFC_SUPPORT_VARIABLE_SUBQ_MSG   0x02
+#define IBMVFC_NO_N_TO_M_CHANNELS_SUPPORT  0x04
+   __be32 num_scsi_subq_channels;
+   __be32 num_nvmeof_subq_channels;
+   __be32 num_scsi_vas_channels;
+   __be32 num_nvmeof_vas_channels;
+}__attribute__((packed, aligned (8)));
+
+struct ibmvfc_channel_setup_mad {
+   struct ibmvfc_mad_common common;
+   struct srp_direct_buf buffer;
+}__attribute__((packed, aligned (8)));
+
+#define IBMVFC_MAX_CHANNELS502
+
+struct ibmvfc_channel_setup {
+   __be32 flags;
+#define IBMVFC_CANCEL_CHANNELS 0x01
+#define IBMVFC_USE_BUFFER  0x02
+#define IBMVFC_CHANNELS_CANCELED   0x04
+   __be32 reserved;
+   __be32 num_scsi_subq_channels;
+   __be32 num_nvmeof_subq_channels;
+   __be32 num_scsi_vas_channels;
+   __be32 num_nvmeof_vas_channels;
+   struct srp_direct_buf buffer;
+   __be64 reserved2[5];
+   __be64 channel_handles[IBMVFC_MAX_CHANNELS];
+}__attribute__((packed, aligned (8)));
+
+struct ibmvfc_connection_info {
+   struct ibmvfc_mad_common common;
+   __be64 information_bits;
+#define IBMVFC_NO_FC_IO_CHANNEL0x01
+#define IBMVFC_NO_PHYP_VAS 0x02
+#define IBMVFC_NO_PHYP_SUBQ0x04
+#define IBMVFC_PHYP_DEPRECATED_SUBQ0x08
+#define IBMVFC_PHYP_PRESERVED_SUBQ 0x10
+#define IBMVFC_PHYP_FULL_SUBQ  0x20
+   __be64 reserved[16];
+}__attribute__((packed, aligned (8)));
+
 struct ibmvfc_trace_start_entry {
u32 xfer_len;
 }__attribute__((packed));
@@ -532,6 +584,7 @@ enum ibmvfc_async_event {
IBMVFC_AE_HALT  = 0x0400,
IBMVFC_AE_RESUME= 0x0800,
IBMVFC_AE_ADAPTER_FAILED= 0x1000,
+   IBMVFC_AE_FPIN  = 0x2000,
 };
 
 struct ibmvfc_async_desc {
@@ -560,10 +613,18 @@ enum ibmvfc_ae_link_state {
IBMVFC_AE_LS_LINK_DEAD  = 0x08,
 };
 
+enum ibmvfc_ae_fpin_status {
+   IBMVFC_AE_FPIN_LINK_CONGESTED   = 0x1,
+   IBMVFC_AE_FPIN_PORT_CONGESTED   = 0x2,
+   IBMVFC_AE_FPIN_PORT_CLEARED = 0x3,
+   IBMVFC_AE_FPIN_PORT_DEGRADED= 0x4,
+};
+
 struct ibmvfc_async_crq {
volatile u8 valid;
u8 link_state;
-   u8 pad[2];
+   u8 fpin_status;
+   u8 pad;
__be32 pad2;
volatile __be64 event;
volatile __be64 scsi_id;
@@ -590,6 +651,9 @@ union ibmvfc_iu {
struct ibmvfc_tmf tmf;
struct ibmvfc_cmd cmd;
  

Re: [PATCH] scsi: ibmvfc: interface updates for future FPIN and MQ support

2020-08-31 Thread kernel test robot
Hi Tyrel,

I love your patch! Yet something to improve:

[auto build test ERROR on powerpc/next]
[also build test ERROR on mkp-scsi/for-next scsi/for-next v5.9-rc3 
next-20200828]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch]

url:
https://github.com/0day-ci/linux/commits/Tyrel-Datwyler/scsi-ibmvfc-interface-updates-for-future-FPIN-and-MQ-support/20200901-012005
base:   https://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux.git next
config: powerpc-defconfig (attached as .config)
compiler: powerpc64-linux-gcc (GCC) 9.3.0
reproduce (this is a W=1 build):
wget 
https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O 
~/bin/make.cross
chmod +x ~/bin/make.cross
# save the attached .config to linux build tree
COMPILER_INSTALL_PATH=$HOME/0day COMPILER=gcc-9.3.0 make.cross 
ARCH=powerpc 

If you fix the issue, kindly add following tag as appropriate
Reported-by: kernel test robot 

All errors (new ones prefixed by >>):

   In file included from drivers/scsi/ibmvscsi/ibmvfc.c:32:
>> drivers/scsi/ibmvscsi/ibmvfc.h:500:13: error: expected ':', ',', ';', '}' or 
>> '__attribute__' before 'nvmeof_vas_channels'
 500 |  __be32 num nvmeof_vas_channels;
 | ^~~
>> drivers/scsi/ibmvscsi/ibmvfc.h:506:17: error: expected declaration 
>> specifiers or '...' before '(' token
 506 | }__attrribute__((packed, aligned (8)));
 | ^
>> drivers/scsi/ibmvscsi/ibmvfc.h:526:28: error: field 'common' has incomplete 
>> type
 526 |  struct ibmvfc_madd_common common;
 |^~
>> drivers/scsi/ibmvscsi/ibmvfc.h:627:2: error: expected ':', ',', ';', '}' or 
>> '__attribute__' before 'u8'
 627 |  u8 pad;
 |  ^~
   In file included from include/linux/byteorder/big_endian.h:5,
from arch/powerpc/include/uapi/asm/byteorder.h:14,
from include/asm-generic/bitops/le.h:6,
from arch/powerpc/include/asm/bitops.h:246,
from include/linux/bitops.h:29,
from include/linux/kernel.h:12,
from include/linux/list.h:9,
from include/linux/module.h:12,
from drivers/scsi/ibmvscsi/ibmvfc.c:10:
   drivers/scsi/ibmvscsi/ibmvfc.c: In function 'ibmvfc_handle_async':
>> drivers/scsi/ibmvscsi/ibmvfc.c:2665:75: error: 'struct ibmvfc_async_crq' has 
>> no member named 'event'
2665 |  const struct ibmvfc_async_desc *desc = 
ibmvfc_get_ae_desc(be64_to_cpu(crq->event));
 |  
 ^~
   include/uapi/linux/byteorder/big_endian.h:38:51: note: in definition of 
macro '__be64_to_cpu'
  38 | #define __be64_to_cpu(x) ((__force __u64)(__be64)(x))
 |   ^
   drivers/scsi/ibmvscsi/ibmvfc.c:2665:60: note: in expansion of macro 
'be64_to_cpu'
2665 |  const struct ibmvfc_async_desc *desc = 
ibmvfc_get_ae_desc(be64_to_cpu(crq->event));
 |
^~~
>> drivers/scsi/ibmvscsi/ibmvfc.c:2669:57: error: 'struct ibmvfc_async_crq' has 
>> no member named 'scsi_id'
2669 |  " node_name: %llx%s\n", desc->desc, be64_to_cpu(crq->scsi_id),
 | ^~
   include/uapi/linux/byteorder/big_endian.h:38:51: note: in definition of 
macro '__be64_to_cpu'
  38 | #define __be64_to_cpu(x) ((__force __u64)(__be64)(x))
 |   ^
   drivers/scsi/ibmvscsi/ibmvfc.h:818:4: note: in expansion of macro 'dev_err'
 818 |dev_err((vhost)->dev, ##__VA_ARGS__); \
 |^~~
   drivers/scsi/ibmvscsi/ibmvfc.c:2668:2: note: in expansion of macro 
'ibmvfc_log'
2668 |  ibmvfc_log(vhost, desc->log_level, "%s event received. scsi_id: 
%llx, wwpn: %llx,"
 |  ^~
>> drivers/scsi/ibmvscsi/ibmvfc.c:2670:21: error: 'struct ibmvfc_async_crq' has 
>> no member named 'wwpn'
2670 |  be64_to_cpu(crq->wwpn), be64_to_cpu(crq->node_name),
 | ^~
   include/uapi/linux/byteorder/big_endian.h:38:51: note: in definition of 
macro '__be64_to_cpu'
  38 | #define __be64_to_cpu(x) ((__force __u64)(__be64)(x))
 |   ^
   drivers/scsi/ibmvscsi/ibmvfc.h:818:4: note: in expansion of macro 'dev_err'
 818 |dev_err((vhost)->dev, ##__VA_ARGS__); \
 |^~~
   drivers/scsi/ibmvscsi/ibmvfc.c:2668:2: note: in expansion of macro 
'ibmvfc_log'
2668 |  ibmvfc_log(vhost, desc->log_level, "%s event received. scsi_id: 
%llx, wwpn: %llx,"
 |  ^~
>> drivers/scsi/ibmvscsi/ibmvfc.c:2670:45: error: 'struct i

[RESEND][PATCH 7/7] parisc: Avoid overflow at boundary_size

2020-08-31 Thread Nicolin Chen
The boundary_size might be as large as ULONG_MAX, which means
that a device has no specific boundary limit. So either "+ 1"
or passing it to ALIGN() would potentially overflow.

According to kernel defines:
#define ALIGN_MASK(x, mask) (((x) + (mask)) & ~(mask))
#define ALIGN(x, a) ALIGN_MASK(x, (typeof(x))(a) - 1)

We can simplify the logic here:
  ALIGN(boundary + 1, 1 << shift) >> shift
= ALIGN_MASK(b + 1, (1 << s) - 1) >> s
= {[b + 1 + (1 << s) - 1] & ~[(1 << s) - 1]} >> s
= [b + 1 + (1 << s) - 1] >> s
= [b + (1 << s)] >> s
= (b >> s) + 1

So fixing a potential overflow with the safer shortcut.

Signed-off-by: Nicolin Chen 
Cc: Christoph Hellwig 
---
 drivers/parisc/ccio-dma.c  | 4 ++--
 drivers/parisc/sba_iommu.c | 4 ++--
 2 files changed, 4 insertions(+), 4 deletions(-)

diff --git a/drivers/parisc/ccio-dma.c b/drivers/parisc/ccio-dma.c
index a5507f75b524..c667d6aba764 100644
--- a/drivers/parisc/ccio-dma.c
+++ b/drivers/parisc/ccio-dma.c
@@ -356,8 +356,8 @@ ccio_alloc_range(struct ioc *ioc, struct device *dev, 
size_t size)
** ggg sacrifices another 710 to the computer gods.
*/
 
-   boundary_size = ALIGN((unsigned long long)dma_get_seg_boundary(dev) + 1,
- 1ULL << IOVP_SHIFT) >> IOVP_SHIFT;
+   /* Overflow-free shortcut for: ALIGN(b + 1, 1 << s) >> s */
+   boundary_size = (dma_get_seg_boundary(dev) >> IOVP_SHIFT) + 1;
 
if (pages_needed <= 8) {
/*
diff --git a/drivers/parisc/sba_iommu.c b/drivers/parisc/sba_iommu.c
index d4314fba0269..96bc2c617cbd 100644
--- a/drivers/parisc/sba_iommu.c
+++ b/drivers/parisc/sba_iommu.c
@@ -342,8 +342,8 @@ sba_search_bitmap(struct ioc *ioc, struct device *dev,
unsigned long shift;
int ret;
 
-   boundary_size = ALIGN((unsigned long long)dma_get_seg_boundary(dev) + 1,
- 1ULL << IOVP_SHIFT) >> IOVP_SHIFT;
+   /* Overflow-free shortcut for: ALIGN(b + 1, 1 << s) >> s */
+   boundary_size = (dma_get_seg_boundary(dev) >> IOVP_SHIFT) + 1;
 
 #if defined(ZX1_SUPPORT)
BUG_ON(ioc->ibase & ~IOVP_MASK);
-- 
2.17.1



[RESEND][PATCH 6/7] x86/amd_gart: Avoid overflow at boundary_size

2020-08-31 Thread Nicolin Chen
The boundary_size might be as large as ULONG_MAX, which means
that a device has no specific boundary limit. So either "+ 1"
or passing it to ALIGN() would potentially overflow.

According to kernel defines:
#define ALIGN_MASK(x, mask) (((x) + (mask)) & ~(mask))
#define ALIGN(x, a) ALIGN_MASK(x, (typeof(x))(a) - 1)

We can simplify the logic here:
  ALIGN(boundary + 1, 1 << shift) >> shift
= ALIGN_MASK(b + 1, (1 << s) - 1) >> s
= {[b + 1 + (1 << s) - 1] & ~[(1 << s) - 1]} >> s
= [b + 1 + (1 << s) - 1] >> s
= [b + (1 << s)] >> s
= (b >> s) + 1

So fixing a potential overflow with the safer shortcut.

Signed-off-by: Nicolin Chen 
Cc: Christoph Hellwig 
---
 arch/x86/kernel/amd_gart_64.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kernel/amd_gart_64.c b/arch/x86/kernel/amd_gart_64.c
index e89031e9c847..7fa0bb490065 100644
--- a/arch/x86/kernel/amd_gart_64.c
+++ b/arch/x86/kernel/amd_gart_64.c
@@ -96,8 +96,8 @@ static unsigned long alloc_iommu(struct device *dev, int size,
 
base_index = ALIGN(iommu_bus_base & dma_get_seg_boundary(dev),
   PAGE_SIZE) >> PAGE_SHIFT;
-   boundary_size = ALIGN((u64)dma_get_seg_boundary(dev) + 1,
- PAGE_SIZE) >> PAGE_SHIFT;
+   /* Overflow-free shortcut for: ALIGN(b + 1, 1 << s) >> s */
+   boundary_size = (dma_get_seg_boundary(dev) >> PAGE_SHIFT) + 1;
 
spin_lock_irqsave(&iommu_bitmap_lock, flags);
offset = iommu_area_alloc(iommu_gart_bitmap, iommu_pages, next_bit,
-- 
2.17.1



[RESEND][PATCH 5/7] sparc: Avoid overflow at boundary_size

2020-08-31 Thread Nicolin Chen
The boundary_size might be as large as ULONG_MAX, which means
that a device has no specific boundary limit. So either "+ 1"
or passing it to ALIGN() would potentially overflow.

According to kernel defines:
#define ALIGN_MASK(x, mask) (((x) + (mask)) & ~(mask))
#define ALIGN(x, a) ALIGN_MASK(x, (typeof(x))(a) - 1)

We can simplify the logic here:
  ALIGN(boundary + 1, 1 << shift) >> shift
= ALIGN_MASK(b + 1, (1 << s) - 1) >> s
= {[b + 1 + (1 << s) - 1] & ~[(1 << s) - 1]} >> s
= [b + 1 + (1 << s) - 1] >> s
= [b + (1 << s)] >> s
= (b >> s) + 1

So fixing a potential overflow with the safer shortcut.

Signed-off-by: Nicolin Chen 
Cc: Christoph Hellwig 
---
 arch/sparc/kernel/iommu-common.c | 9 +++--
 arch/sparc/kernel/iommu.c| 4 ++--
 arch/sparc/kernel/pci_sun4v.c| 4 ++--
 3 files changed, 7 insertions(+), 10 deletions(-)

diff --git a/arch/sparc/kernel/iommu-common.c b/arch/sparc/kernel/iommu-common.c
index 59cb16691322..843e71894d04 100644
--- a/arch/sparc/kernel/iommu-common.c
+++ b/arch/sparc/kernel/iommu-common.c
@@ -166,13 +166,10 @@ unsigned long iommu_tbl_range_alloc(struct device *dev,
}
}
 
-   if (dev)
-   boundary_size = ALIGN(dma_get_seg_boundary(dev) + 1,
- 1 << iommu->table_shift);
-   else
-   boundary_size = ALIGN(1ULL << 32, 1 << iommu->table_shift);
+   boundary_size = dev ? dma_get_seg_boundary(dev) : U32_MAX;
 
-   boundary_size = boundary_size >> iommu->table_shift;
+   /* Overflow-free shortcut for: ALIGN(b + 1, 1 << s) >> s */
+   boundary_size = (boundary_size >> iommu->table_shift) + 1;
/*
 * if the skip_span_boundary_check had been set during init, we set
 * things up so that iommu_is_span_boundary() merely checks if the
diff --git a/arch/sparc/kernel/iommu.c b/arch/sparc/kernel/iommu.c
index 4ae7388b1bff..d981c37305ae 100644
--- a/arch/sparc/kernel/iommu.c
+++ b/arch/sparc/kernel/iommu.c
@@ -472,8 +472,8 @@ static int dma_4u_map_sg(struct device *dev, struct 
scatterlist *sglist,
outs->dma_length = 0;
 
max_seg_size = dma_get_max_seg_size(dev);
-   seg_boundary_size = ALIGN(dma_get_seg_boundary(dev) + 1,
- IO_PAGE_SIZE) >> IO_PAGE_SHIFT;
+   /* Overflow-free shortcut for: ALIGN(b + 1, 1 << s) >> s */
+   seg_boundary_size = (dma_get_seg_boundary(dev) >> IO_PAGE_SHIFT) + 1;
base_shift = iommu->tbl.table_map_base >> IO_PAGE_SHIFT;
for_each_sg(sglist, s, nelems, i) {
unsigned long paddr, npages, entry, out_entry = 0, slen;
diff --git a/arch/sparc/kernel/pci_sun4v.c b/arch/sparc/kernel/pci_sun4v.c
index 14b93c5564e3..233fe8a20cec 100644
--- a/arch/sparc/kernel/pci_sun4v.c
+++ b/arch/sparc/kernel/pci_sun4v.c
@@ -508,8 +508,8 @@ static int dma_4v_map_sg(struct device *dev, struct 
scatterlist *sglist,
iommu_batch_start(dev, prot, ~0UL);
 
max_seg_size = dma_get_max_seg_size(dev);
-   seg_boundary_size = ALIGN(dma_get_seg_boundary(dev) + 1,
- IO_PAGE_SIZE) >> IO_PAGE_SHIFT;
+   /* Overflow-free shortcut for: ALIGN(b + 1, 1 << s) >> s */
+   seg_boundary_size = (dma_get_seg_boundary(dev) >> IO_PAGE_SHIFT) + 1;
 
mask = *dev->dma_mask;
if (!iommu_use_atu(iommu, mask))
-- 
2.17.1



[RESEND][PATCH 4/7] s390/pci_dma: Avoid overflow at boundary_size

2020-08-31 Thread Nicolin Chen
The boundary_size might be as large as ULONG_MAX, which means
that a device has no specific boundary limit. So either "+ 1"
or passing it to ALIGN() would potentially overflow.

According to kernel defines:
#define ALIGN_MASK(x, mask) (((x) + (mask)) & ~(mask))
#define ALIGN(x, a) ALIGN_MASK(x, (typeof(x))(a) - 1)

We can simplify the logic here:
  ALIGN(boundary + 1, 1 << shift) >> shift
= ALIGN_MASK(b + 1, (1 << s) - 1) >> s
= {[b + 1 + (1 << s) - 1] & ~[(1 << s) - 1]} >> s
= [b + 1 + (1 << s) - 1] >> s
= [b + (1 << s)] >> s
= (b >> s) + 1

So fixing a potential overflow with the safer shortcut.

Signed-off-by: Nicolin Chen 
Cc: Christoph Hellwig 
---
 arch/s390/pci/pci_dma.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/arch/s390/pci/pci_dma.c b/arch/s390/pci/pci_dma.c
index 64b1399a73f0..ecb067acc6d5 100644
--- a/arch/s390/pci/pci_dma.c
+++ b/arch/s390/pci/pci_dma.c
@@ -263,8 +263,8 @@ static unsigned long __dma_alloc_iommu(struct device *dev,
struct zpci_dev *zdev = to_zpci(to_pci_dev(dev));
unsigned long boundary_size;
 
-   boundary_size = ALIGN(dma_get_seg_boundary(dev) + 1,
- PAGE_SIZE) >> PAGE_SHIFT;
+   /* Overflow-free shortcut for: ALIGN(b + 1, 1 << s) >> s */
+   boundary_size = (dma_get_seg_boundary(dev) >> PAGE_SHIFT) + 1;
return iommu_area_alloc(zdev->iommu_bitmap, zdev->iommu_pages,
start, size, zdev->start_dma >> PAGE_SHIFT,
boundary_size, 0);
-- 
2.17.1



[RESEND][PATCH 0/7] Avoid overflow at boundary_size

2020-08-31 Thread Nicolin Chen
 For this resend 
The original series have not been acked at any patch. So I am
resending them, being suggested by Niklas.

 Coverletter 
We are expending the default DMA segmentation boundary to its
possible maximum value (ULONG_MAX) to indicate that a device
doesn't specify a boundary limit. So all dma_get_seg_boundary
callers should take a precaution with the return values since
it would easily get overflowed.

I scanned the entire kernel tree for all the existing callers
and found that most of callers may get overflowed in two ways:
either "+ 1" or passing it to ALIGN() that does "+ mask".

According to kernel defines:
#define ALIGN_MASK(x, mask) (((x) + (mask)) & ~(mask))
#define ALIGN(x, a) ALIGN_MASK(x, (typeof(x))(a) - 1)

We can simplify the logic here:
  ALIGN(boundary + 1, 1 << shift) >> shift
= ALIGN_MASK(b + 1, (1 << s) - 1) >> s
= {[b + 1 + (1 << s) - 1] & ~[(1 << s) - 1]} >> s
= [b + 1 + (1 << s) - 1] >> s
= [b + (1 << s)] >> s
= (b >> s) + 1

So this series of patches fix the potential overflow with this
overflow-free shortcut.

As I don't have these platforms, testings/comments are welcome.

Thanks
Nic

Nicolin Chen (7):
  powerpc/iommu: Avoid overflow at boundary_size
  alpha: Avoid overflow at boundary_size
  ia64/sba_iommu: Avoid overflow at boundary_size
  s390/pci_dma: Avoid overflow at boundary_size
  sparc: Avoid overflow at boundary_size
  x86/amd_gart: Avoid overflow at boundary_size
  parisc: Avoid overflow at boundary_size

 arch/alpha/kernel/pci_iommu.c| 10 --
 arch/ia64/hp/common/sba_iommu.c  |  4 ++--
 arch/powerpc/kernel/iommu.c  | 11 +--
 arch/s390/pci/pci_dma.c  |  4 ++--
 arch/sparc/kernel/iommu-common.c |  9 +++--
 arch/sparc/kernel/iommu.c|  4 ++--
 arch/sparc/kernel/pci_sun4v.c|  4 ++--
 arch/x86/kernel/amd_gart_64.c|  4 ++--
 drivers/parisc/ccio-dma.c|  4 ++--
 drivers/parisc/sba_iommu.c   |  4 ++--
 10 files changed, 26 insertions(+), 32 deletions(-)

-- 
2.17.1



[RESEND][PATCH 3/7] ia64/sba_iommu: Avoid overflow at boundary_size

2020-08-31 Thread Nicolin Chen
The boundary_size might be as large as ULONG_MAX, which means
that a device has no specific boundary limit. So either "+ 1"
or passing it to ALIGN() would potentially overflow.

According to kernel defines:
#define ALIGN_MASK(x, mask) (((x) + (mask)) & ~(mask))
#define ALIGN(x, a) ALIGN_MASK(x, (typeof(x))(a) - 1)

We can simplify the logic here:
  ALIGN(boundary + 1, 1 << shift) >> shift
= ALIGN_MASK(b + 1, (1 << s) - 1) >> s
= {[b + 1 + (1 << s) - 1] & ~[(1 << s) - 1]} >> s
= [b + 1 + (1 << s) - 1] >> s
= [b + (1 << s)] >> s
= (b >> s) + 1

So fixing a potential overflow with the safer shortcut.

Signed-off-by: Nicolin Chen 
Cc: Christoph Hellwig 
---
 arch/ia64/hp/common/sba_iommu.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/arch/ia64/hp/common/sba_iommu.c b/arch/ia64/hp/common/sba_iommu.c
index 656a4888c300..945954903bb0 100644
--- a/arch/ia64/hp/common/sba_iommu.c
+++ b/arch/ia64/hp/common/sba_iommu.c
@@ -485,8 +485,8 @@ sba_search_bitmap(struct ioc *ioc, struct device *dev,
ASSERT(((unsigned long) ioc->res_hint & (sizeof(unsigned long) - 1UL)) 
== 0);
ASSERT(res_ptr < res_end);
 
-   boundary_size = (unsigned long long)dma_get_seg_boundary(dev) + 1;
-   boundary_size = ALIGN(boundary_size, 1ULL << iovp_shift) >> iovp_shift;
+   /* Overflow-free shortcut for: ALIGN(b + 1, 1 << s) >> s */
+   boundary_size = (dma_get_seg_boundary(dev) >> iovp_shift) + 1;
 
BUG_ON(ioc->ibase & ~iovp_mask);
shift = ioc->ibase >> iovp_shift;
-- 
2.17.1



[RESEND][PATCH 1/7] powerpc/iommu: Avoid overflow at boundary_size

2020-08-31 Thread Nicolin Chen
The boundary_size might be as large as ULONG_MAX, which means
that a device has no specific boundary limit. So either "+ 1"
or passing it to ALIGN() would potentially overflow.

According to kernel defines:
#define ALIGN_MASK(x, mask) (((x) + (mask)) & ~(mask))
#define ALIGN(x, a) ALIGN_MASK(x, (typeof(x))(a) - 1)

We can simplify the logic here:
  ALIGN(boundary + 1, 1 << shift) >> shift
= ALIGN_MASK(b + 1, (1 << s) - 1) >> s
= {[b + 1 + (1 << s) - 1] & ~[(1 << s) - 1]} >> s
= [b + 1 + (1 << s) - 1] >> s
= [b + (1 << s)] >> s
= (b >> s) + 1

So fixing a potential overflow with the safer shortcut.

Reported-by: Stephen Rothwell 
Signed-off-by: Nicolin Chen 
Cc: Christoph Hellwig 
---
 arch/powerpc/kernel/iommu.c | 11 +--
 1 file changed, 5 insertions(+), 6 deletions(-)

diff --git a/arch/powerpc/kernel/iommu.c b/arch/powerpc/kernel/iommu.c
index 9704f3f76e63..c01ccbf8afdd 100644
--- a/arch/powerpc/kernel/iommu.c
+++ b/arch/powerpc/kernel/iommu.c
@@ -236,15 +236,14 @@ static unsigned long iommu_range_alloc(struct device *dev,
}
}
 
-   if (dev)
-   boundary_size = ALIGN(dma_get_seg_boundary(dev) + 1,
- 1 << tbl->it_page_shift);
-   else
-   boundary_size = ALIGN(1UL << 32, 1 << tbl->it_page_shift);
/* 4GB boundary for iseries_hv_alloc and iseries_hv_map */
+   boundary_size = dev ? dma_get_seg_boundary(dev) : U32_MAX;
+
+   /* Overflow-free shortcut for: ALIGN(b + 1, 1 << s) >> s */
+   boundary_size = (boundary_size >> tbl->it_page_shift) + 1;
 
n = iommu_area_alloc(tbl->it_map, limit, start, npages, tbl->it_offset,
-boundary_size >> tbl->it_page_shift, align_mask);
+boundary_size, align_mask);
if (n == -1) {
if (likely(pass == 0)) {
/* First try the pool from the start */
-- 
2.17.1



[RESEND][PATCH 2/7] alpha: Avoid overflow at boundary_size

2020-08-31 Thread Nicolin Chen
The boundary_size might be as large as ULONG_MAX, which means
that a device has no specific boundary limit. So "+ 1" would
potentially overflow.

Also, by following other places in the kernel, boundary_size
should align with the PAGE_SIZE before right shifting by the
PAGE_SHIFT. However, passing it to ALIGN() would potentially
overflow too.

According to kernel defines:
#define ALIGN_MASK(x, mask) (((x) + (mask)) & ~(mask))
#define ALIGN(x, a) ALIGN_MASK(x, (typeof(x))(a) - 1)

We can simplify the logic here:
  ALIGN(boundary + 1, 1 << shift) >> shift
= ALIGN_MASK(b + 1, (1 << s) - 1) >> s
= {[b + 1 + (1 << s) - 1] & ~[(1 << s) - 1]} >> s
= [b + 1 + (1 << s) - 1] >> s
= [b + (1 << s)] >> s
= (b >> s) + 1

So fixing a potential overflow with the safer shortcut.

Signed-off-by: Nicolin Chen 
Cc: Christoph Hellwig 
---
 arch/alpha/kernel/pci_iommu.c | 10 --
 1 file changed, 4 insertions(+), 6 deletions(-)

diff --git a/arch/alpha/kernel/pci_iommu.c b/arch/alpha/kernel/pci_iommu.c
index 81037907268d..1ef2c647bd3e 100644
--- a/arch/alpha/kernel/pci_iommu.c
+++ b/arch/alpha/kernel/pci_iommu.c
@@ -141,12 +141,10 @@ iommu_arena_find_pages(struct device *dev, struct 
pci_iommu_arena *arena,
unsigned long boundary_size;
 
base = arena->dma_base >> PAGE_SHIFT;
-   if (dev) {
-   boundary_size = dma_get_seg_boundary(dev) + 1;
-   boundary_size >>= PAGE_SHIFT;
-   } else {
-   boundary_size = 1UL << (32 - PAGE_SHIFT);
-   }
+
+   boundary_size = dev ? dma_get_seg_boundary(dev) : U32_MAX;
+   /* Overflow-free shortcut for: ALIGN(b + 1, 1 << s) >> s */
+   boundary_size = (boundary_size >> PAGE_SHIFT) + 1;
 
/* Search forward for the first mask-aligned sequence of N free ptes */
ptes = arena->ptes;
-- 
2.17.1



[Bug 209029] kernel 5.9-rc2 fails to boot on a PowerMac G5 11,2 - BUG: Kernel NULL pointer dereference on read at 0x00000020

2020-08-31 Thread bugzilla-daemon
https://bugzilla.kernel.org/show_bug.cgi?id=209029

--- Comment #2 from Erhard F. (erhar...@mailbox.org) ---
No change with 5.9-rc3.

-- 
You are receiving this mail because:
You are watching the assignee of the bug.

Re: [PATCH net-next] net/wan/fsl_ucc_hdlc: Add MODULE_DESCRIPTION

2020-08-31 Thread David Miller
From: YueHaibing 
Date: Sat, 29 Aug 2020 19:58:23 +0800

> Add missing MODULE_DESCRIPTION.
> 
> Signed-off-by: YueHaibing 

Applied.


[RFC PATCH 2/2] powerpc/kernel/iommu: Introduce IOMMU DMA pagecache

2020-08-31 Thread Leonardo Bras
In pseries, mapping a DMA page for a cpu memory range requires a
H_PUT_TCE* hypercall, and unmapping requires a H_STUFF_TCE hypercall.
When doing a lot of I/O, a thread can spend a lot of time doing such
hcalls, specially when a DMA mapping don't get reused.

The purpose of this change is to introduce a mechanism similar to
a pagecache, but for reusing DMA mappings, improving performance and
avoiding multiple DMA mappings of the same cpupage.

This works based in a few current behaviors:
- Page reuse:
It's common for userspace process to reuse the same page for several
allocations This is probably caused by page buffering behavior that
come from libc, so getting pages from the kernel is less expensive.

- A lot of small allocations are used:
Some workloads do a lot of allocations that do not fill a whole page
causing multiple DMA mappings for a single page.

How it work:
- When a DMA mapping is required for given allocation, it first searches
  the DMA pagecache for a matching mapping. When found, increment refcount,
  when not found, a new mapping is created.
- Every time a new mapping is created, it's added to the DMA pagecache,
  with refcount = 1;
- When the mapping is not needed anymore (iommu_free()), it looks in the
  DMA pagecache for the entry, and the decrements it's refcount.
- When there are more than IOMMU_MAP_LIST_MAX entries in the
   DMA pagecache, it removes the older ones.

What is inside:
- 1 XArray which indexes the DMA page addresses, used for removing
  mappings and decreasing refcounts.
- 1 XArray which indexes CPU page addresses, for finding matching mappings.
  - As there can be multiple mappings (directions) for the same cpupage,
   this one keeps a llist for looking into the entries.
- Every entry points (page) in the XArray points to the mapping struct.

- The mapping struct have:
  - DMA & CPU page addresses, size (pages) , direction and refcount.
  - 1 llist used for multiple mappings at the same cpupage
  - 1 llist used for the FIFO (removing old unused entries)

- The cache struct, added to iommu_table, have:
  - Both XArrays,
  - 2 llist for entry/removal point on FIFO,
  - 1 Atomic Cachesize, to manage the FIFO.

Signed-off-by: Leonardo Bras 
---
 arch/powerpc/include/asm/iommu-cache.h |  31 
 arch/powerpc/include/asm/iommu.h   |   4 +
 arch/powerpc/kernel/Makefile   |   2 +-
 arch/powerpc/kernel/iommu-cache.c  | 247 +
 arch/powerpc/kernel/iommu.c|  15 +-
 5 files changed, 293 insertions(+), 6 deletions(-)
 create mode 100644 arch/powerpc/include/asm/iommu-cache.h
 create mode 100644 arch/powerpc/kernel/iommu-cache.c

diff --git a/arch/powerpc/include/asm/iommu-cache.h 
b/arch/powerpc/include/asm/iommu-cache.h
new file mode 100644
index ..ad298a4cd9c9
--- /dev/null
+++ b/arch/powerpc/include/asm/iommu-cache.h
@@ -0,0 +1,31 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _IOMMU_CACHE_H
+#define _IOMMU_CACHE_H
+#ifdef __KERNEL__
+
+#include 
+#include 
+#include 
+
+struct dmacache {
+   struct llist_head fifo_add;
+   struct llist_head fifo_del;
+   struct xarray cpupages;
+   struct xarray dmapages;
+   atomic64_t cachesize;
+};
+
+#include 
+
+void iommu_cache_init(struct iommu_table *tbl);
+void iommu_dmacache_add(struct iommu_table *tbl, void *page, unsigned int 
npages, dma_addr_t addr,
+   enum dma_data_direction direction);
+dma_addr_t iommu_dmacache_use(struct iommu_table *tbl, void *page, unsigned 
int npages,
+ enum dma_data_direction direction);
+void iommu_dmacache_free(struct iommu_table *tbl, dma_addr_t dma_handle, 
unsigned int npages);
+
+#define IOMMU_MAP_LIST_MAX 8192
+#define IOMMU_MAP_LIST_THRES   128
+
+#endif /* __KERNEL__ */
+#endif /* _IOMMU_CACHE_H */
diff --git a/arch/powerpc/include/asm/iommu.h b/arch/powerpc/include/asm/iommu.h
index 2913e5c8b1f8..51a2f5503f8e 100644
--- a/arch/powerpc/include/asm/iommu.h
+++ b/arch/powerpc/include/asm/iommu.h
@@ -18,6 +18,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #define IOMMU_PAGE_SHIFT_4K  12
 #define IOMMU_PAGE_SIZE_4K   (ASM_CONST(1) << IOMMU_PAGE_SHIFT_4K)
@@ -114,6 +115,7 @@ struct iommu_table {
int it_nid;
unsigned long it_reserved_start; /* Start of not-DMA-able (MMIO) area */
unsigned long it_reserved_end;
+   struct dmacache cache;
 };
 
 #define IOMMU_TABLE_USERSPACE_ENTRY_RO(tbl, entry) \
@@ -317,6 +319,8 @@ extern void iommu_release_ownership(struct iommu_table 
*tbl);
 extern enum dma_data_direction iommu_tce_direction(unsigned long tce);
 extern unsigned long iommu_direction_to_tce_perm(enum dma_data_direction dir);
 
+void __iommu_free(struct iommu_table *tbl, dma_addr_t dma_addr, unsigned int 
npages);
+
 #ifdef CONFIG_PPC_CELL_NATIVE
 extern bool iommu_fixed_is_weak;
 #else
diff --git a/arch/powerpc/kernel/Makefile b/arch/powerpc/kernel/Makefile
index cbf41fb4ee89..62f6c90076d9 100644
---

[RFC PATCH 1/2] dma-direction: Add DMA_DIR_COMPAT() macro to test direction compability

2020-08-31 Thread Leonardo Bras
Given a existing mapping with 'current' direction, and a 'wanted'
direction for using that mapping, check if 'wanted' is satisfied by
'current'.

current accepts
DMA_BIDIRECTIONAL   DMA_BIDIRECTIONAL, DMA_TO_DEVICE,
DMA_FROM_DEVICE, DMA_NONE
DMA_TO_DEVICE   DMA_TO_DEVICE, DMA_NONE
DMA_FROM_DEVICE DMA_FROM_DEVICE, DMA_NONE
DMA_NONEDMA_NONE

This macro is useful for checking if a DMA mapping can be reused.

Signed-off-by: Leonardo Bras 
---
 include/linux/dma-direction.h | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/include/linux/dma-direction.h b/include/linux/dma-direction.h
index 9c96e30e6a0b..caf3943a21f4 100644
--- a/include/linux/dma-direction.h
+++ b/include/linux/dma-direction.h
@@ -9,4 +9,7 @@ enum dma_data_direction {
DMA_NONE = 3,
 };
 
+/* Checks if wanted direction is satisfied by current mapping direction*/
+#define DMA_DIR_COMPAT(current, wanted)(((current) & ~(wanted)) == 0)
+
 #endif
-- 
2.25.4



[RFC PATCH 0/2] DMA pagecache

2020-08-31 Thread Leonardo Bras
This RFC improves the performance of indirect mapping on all tested DMA
usages, based on a mlx5 device, ranging from 64k packages to 1-byte
packages, from 1 thread to 64 threads.

In all workloads tested, the performance of indirect mapping gets very
near to direct mapping case.

The whole thing is designed to have as much perfomance as possible, so
the impact of the pagecache is not too big.

As I am not very experienced in XArrays usage, nor in lockless
algorithms, I would specially appreaciate feedback on possible
failures on it's usage, missing barriers, and so on.

Also, this size for the FIFO is just for testing purposes.
It's also very possible that it will not be a good idea in platforms
other than pseries, (i have not tested them).
I can plan I bypass for those cases without much work.

Thank you!

Leonardo Bras (2):
  dma-direction: Add DMA_DIR_COMPAT() macro to test direction
compability
  powerpc/kernel/iommu: Introduce IOMMU DMA pagecache

 arch/powerpc/include/asm/iommu-cache.h |  31 
 arch/powerpc/include/asm/iommu.h   |   4 +
 arch/powerpc/kernel/Makefile   |   2 +-
 arch/powerpc/kernel/iommu-cache.c  | 247 +
 arch/powerpc/kernel/iommu.c|  15 +-
 include/linux/dma-direction.h  |   3 +
 6 files changed, 296 insertions(+), 6 deletions(-)
 create mode 100644 arch/powerpc/include/asm/iommu-cache.h
 create mode 100644 arch/powerpc/kernel/iommu-cache.c

-- 
2.25.4



[PATCH] arch: vdso: add vdso linker script to 'targets' instead of extra-y

2020-08-31 Thread Masahiro Yamada
The vdso linker script is preprocessed on demand.
Adding it to 'targets' is enough to include the .cmd file.

Signed-off-by: Masahiro Yamada 
---

 arch/arm64/kernel/vdso/Makefile | 2 +-
 arch/arm64/kernel/vdso32/Makefile   | 2 +-
 arch/nds32/kernel/vdso/Makefile | 2 +-
 arch/powerpc/kernel/vdso32/Makefile | 2 +-
 arch/powerpc/kernel/vdso64/Makefile | 2 +-
 arch/s390/kernel/vdso64/Makefile| 2 +-
 6 files changed, 6 insertions(+), 6 deletions(-)

diff --git a/arch/arm64/kernel/vdso/Makefile b/arch/arm64/kernel/vdso/Makefile
index 45d5cfe46429..7cd8aafbe96e 100644
--- a/arch/arm64/kernel/vdso/Makefile
+++ b/arch/arm64/kernel/vdso/Makefile
@@ -54,7 +54,7 @@ endif
 GCOV_PROFILE := n
 
 obj-y += vdso.o
-extra-y += vdso.lds
+targets += vdso.lds
 CPPFLAGS_vdso.lds += -P -C -U$(ARCH)
 
 # Force dependency (incbin is bad)
diff --git a/arch/arm64/kernel/vdso32/Makefile 
b/arch/arm64/kernel/vdso32/Makefile
index d6adb4677c25..572475b7b7ed 100644
--- a/arch/arm64/kernel/vdso32/Makefile
+++ b/arch/arm64/kernel/vdso32/Makefile
@@ -155,7 +155,7 @@ asm-obj-vdso := $(addprefix $(obj)/, $(asm-obj-vdso))
 obj-vdso := $(c-obj-vdso) $(c-obj-vdso-gettimeofday) $(asm-obj-vdso)
 
 obj-y += vdso.o
-extra-y += vdso.lds
+targets += vdso.lds
 CPPFLAGS_vdso.lds += -P -C -U$(ARCH)
 
 # Force dependency (vdso.s includes vdso.so through incbin)
diff --git a/arch/nds32/kernel/vdso/Makefile b/arch/nds32/kernel/vdso/Makefile
index 7c3c1ccb196e..55df25ef0057 100644
--- a/arch/nds32/kernel/vdso/Makefile
+++ b/arch/nds32/kernel/vdso/Makefile
@@ -20,7 +20,7 @@ GCOV_PROFILE := n
 
 
 obj-y += vdso.o
-extra-y += vdso.lds
+targets += vdso.lds
 CPPFLAGS_vdso.lds += -P -C -U$(ARCH)
 
 # Force dependency
diff --git a/arch/powerpc/kernel/vdso32/Makefile 
b/arch/powerpc/kernel/vdso32/Makefile
index 87ab1152d5ce..fd5072a4c73c 100644
--- a/arch/powerpc/kernel/vdso32/Makefile
+++ b/arch/powerpc/kernel/vdso32/Makefile
@@ -29,7 +29,7 @@ ccflags-y := -shared -fno-common -fno-builtin -nostdlib \
 asflags-y := -D__VDSO32__ -s
 
 obj-y += vdso32_wrapper.o
-extra-y += vdso32.lds
+targets += vdso32.lds
 CPPFLAGS_vdso32.lds += -P -C -Upowerpc
 
 # Force dependency (incbin is bad)
diff --git a/arch/powerpc/kernel/vdso64/Makefile 
b/arch/powerpc/kernel/vdso64/Makefile
index 38c317f25141..c737b3ea3207 100644
--- a/arch/powerpc/kernel/vdso64/Makefile
+++ b/arch/powerpc/kernel/vdso64/Makefile
@@ -17,7 +17,7 @@ ccflags-y := -shared -fno-common -fno-builtin -nostdlib \
 asflags-y := -D__VDSO64__ -s
 
 obj-y += vdso64_wrapper.o
-extra-y += vdso64.lds
+targets += vdso64.lds
 CPPFLAGS_vdso64.lds += -P -C -U$(ARCH)
 
 # Force dependency (incbin is bad)
diff --git a/arch/s390/kernel/vdso64/Makefile b/arch/s390/kernel/vdso64/Makefile
index 4a66a1cb919b..d0d406cfffa9 100644
--- a/arch/s390/kernel/vdso64/Makefile
+++ b/arch/s390/kernel/vdso64/Makefile
@@ -25,7 +25,7 @@ $(targets:%=$(obj)/%.dbg): KBUILD_CFLAGS = $(KBUILD_CFLAGS_64)
 $(targets:%=$(obj)/%.dbg): KBUILD_AFLAGS = $(KBUILD_AFLAGS_64)
 
 obj-y += vdso64_wrapper.o
-extra-y += vdso64.lds
+targets += vdso64.lds
 CPPFLAGS_vdso64.lds += -P -C -U$(ARCH)
 
 # Disable gcov profiling, ubsan and kasan for VDSO code
-- 
2.25.1



[PATCH] scsi: ibmvfc: interface updates for future FPIN and MQ support

2020-08-31 Thread Tyrel Datwyler
VIOS partitions with SLI-4 enabled Emulex adapters will be capable of
driving IO in parallel through mulitple work queues or channels, and
with new hyperviosr firmware that supports multiple interrupt sources
an ibmvfc NPIV single initiator can be modified to exploit end to end
channelization in a PowerVM environment.

VIOS hosts will also be able to expose fabric perfromance impact
notifications (FPIN) via a new asynchronous event to ibmvfc clients that
advertise support via IBMVFC_CAN_HANDLE_FPIN in their capabilities flag
during NPIV_LOGIN.

This patch introduces three new Managment Datagrams (MADs) for
channelization support negotiation as well as the FPIN asynchronous
event and FPIN status flags. Follow up work is required to plumb the
ibmvfc client driver to use these new interfaces.

Signed-off-by: Tyrel Datwyler 
---
 drivers/scsi/ibmvscsi/ibmvfc.h | 66 +-
 1 file changed, 65 insertions(+), 1 deletion(-)

diff --git a/drivers/scsi/ibmvscsi/ibmvfc.h b/drivers/scsi/ibmvscsi/ibmvfc.h
index 907889f1fa9d..801681b63daa 100644
--- a/drivers/scsi/ibmvscsi/ibmvfc.h
+++ b/drivers/scsi/ibmvscsi/ibmvfc.h
@@ -124,6 +124,9 @@ enum ibmvfc_mad_types {
IBMVFC_PASSTHRU = 0x0200,
IBMVFC_TMF_MAD  = 0x0100,
IBMVFC_NPIV_LOGOUT  = 0x0800,
+   IBMVFC_CHANNEL_ENQUIRY  = 0x1000,
+   IBMVFC_CHANNEL_SETUP= 0x2000,
+   IBMVFC_CONNECTION_INFO  = 0x4000,
 };
 
 struct ibmvfc_mad_common {
@@ -162,6 +165,8 @@ struct ibmvfc_npiv_login {
__be32 max_cmds;
__be64 capabilities;
 #define IBMVFC_CAN_MIGRATE 0x01
+#define IBMVFC_CAN_USE_CHANNELS0x02
+#define IBMVFC_CAN_HANDLE_FPIN 0x04
__be64 node_name;
struct srp_direct_buf async;
u8 partition_name[IBMVFC_MAX_NAME];
@@ -204,6 +209,7 @@ struct ibmvfc_npiv_login_resp {
__be64 capabilities;
 #define IBMVFC_CAN_FLUSH_ON_HALT   0x08
 #define IBMVFC_CAN_SUPPRESS_ABTS   0x10
+#define IBMVFC_CAN_SUPPORT_CHANNELS0x20
__be32 max_cmds;
__be32 scsi_id_sz;
__be64 max_dma_len;
@@ -482,6 +488,52 @@ struct ibmvfc_passthru_mad {
struct ibmvfc_passthru_fc_iu fc_iu;
 }__attribute__((packed, aligned (8)));
 
+struct ibmvfc_channel_enquiry {
+   struct ibmvfc_mad_common common;
+   __be32 flags;
+#define IBMVFC_NO_CHANNELS_TO_CRQ_SUPPORT  0x01
+#define IBMVFC_SUPPORT_VARIABLE_SUBQ_MSG   0x02
+#define IBMVFC_NO_N_TO_M_CHANNELS_SUPPORT  0x04
+   __be32 num_scsi_subq_channels;
+   __be32 num_nvmeof_subq_channels;
+   __be32 num_scsi_vas_channels;
+   __be32 num nvmeof_vas_channels;
+}__attribute__((packed, aligned (8)));
+
+struct ibmvfc_channel_setup_mad {
+   struct ibmvfc_mad_common common;
+   struct srp_direct_buf buffer;
+}__attrribute__((packed, aligned (8)));
+
+#define IBMVFC_MAX_CHANNELS502
+
+struct ibmvfc_channel_setup {
+   __be32 flags;
+#define IBMVFC_CANCEL_CHANNELS 0x01
+#define IBMVFC_USE_BUFFER  0x02
+#define IBMVFC_CHANNELS_CANCELED   0x04
+   __be32 reserved;
+   __be32 num_scsi_subq_channels;
+   __be32 num_nvmeof_subq_channels;
+   __be32 num_scsi_vas_channels;
+   __be32 num_nvmeof_vas_channels;
+   struct srp_direct_buf buffer;
+   __be64 reserved2[5];
+   __be64 channel_handles[IBMVFC_MAX_CHANNELS];
+}__attribute__((packed, aligned (8)));
+
+struct ibmvfc_connection_info {
+   struct ibmvfc_madd_common common;
+   __be64 information_bits;
+#define IBMVFC_NO_FC_IO_CHANNEL0x01
+#define IBMVFC_NO_PHYP_VAS 0x02
+#define IBMVFC_NO_PHYP_SUBQ0x04
+#define IBMVFC_PHYP_DEPRECATED_SUBQ0x08
+#define IBMVFC_PHYP_PRESERVED_SUBQ 0x10
+#define IBMVFC_PHYP_FULL_SUBQ  0x20
+   __be64 reserved[16];
+}__attribute__((packed, aligned (8)));
+
 struct ibmvfc_trace_start_entry {
u32 xfer_len;
 }__attribute__((packed));
@@ -532,6 +584,7 @@ enum ibmvfc_async_event {
IBMVFC_AE_HALT  = 0x0400,
IBMVFC_AE_RESUME= 0x0800,
IBMVFC_AE_ADAPTER_FAILED= 0x1000,
+   IBMVFC_AE_FPIN  = 0x2000,
 };
 
 struct ibmvfc_async_desc {
@@ -560,10 +613,18 @@ enum ibmvfc_ae_link_state {
IBMVFC_AE_LS_LINK_DEAD  = 0x08,
 };
 
+enum ibmvfc_ae_fpin_status {
+   IBMVFC_AE_FPIN_LINK_CONGESTED   = 0x1,
+   IBMVFC_AE_FPIN_PORT_CONGESTED   = 0x2,
+   IBMVFC_AE_FPIN_PORT_CLEARED = 0x3,
+   IBMVFC_AE_FPIN_PORT_DEGRADED= 0x4,
+};
+
 struct ibmvfc_async_crq {
volatile u8 valid;
u8 link_state;
-   u8 pad[2];
+   u8 fpin_status
+   u8 pad;
__be32 pad2;
volatile __be64 event;
volatile __be64 scsi_id;
@@ -590,6 +651,9 @@ union ibmvfc_iu {
struct ibmvfc_tmf tmf;
struct ibmvfc_cmd cmd;
struct ibmvfc_passthru_mad passthru;
+   struct ibmvfc_channel_e

[PATCH AUTOSEL 5.4 23/23] fsldma: fix very broken 32-bit ppc ioread64 functionality

2020-08-31 Thread Sasha Levin
From: Linus Torvalds 

[ Upstream commit 0a4c56c80f90797e9b9f8426c6aae4c0cf1c9785 ]

Commit ef91bb196b0d ("kernel.h: Silence sparse warning in
lower_32_bits") caused new warnings to show in the fsldma driver, but
that commit was not to blame: it only exposed some very incorrect code
that tried to take the low 32 bits of an address.

That made no sense for multiple reasons, the most notable one being that
that code was intentionally limited to only 32-bit ppc builds, so "only
low 32 bits of an address" was completely nonsensical.  There were no
high bits to mask off to begin with.

But even more importantly fropm a correctness standpoint, turning the
address into an integer then caused the subsequent address arithmetic to
be completely wrong too, and the "+1" actually incremented the address
by one, rather than by four.

Which again was incorrect, since the code was reading two 32-bit values
and trying to make a 64-bit end result of it all.  Surprisingly, the
iowrite64() did not suffer from the same odd and incorrect model.

This code has never worked, but it's questionable whether anybody cared:
of the two users that actually read the 64-bit value (by way of some C
preprocessor hackery and eventually the 'get_cdar()' inline function),
one of them explicitly ignored the value, and the other one might just
happen to work despite the incorrect value being read.

This patch at least makes it not fail the build any more, and makes the
logic superficially sane.  Whether it makes any difference to the code
_working_ or not shall remain a mystery.

Compile-tested-by: Guenter Roeck 
Reviewed-by: Guenter Roeck 
Signed-off-by: Linus Torvalds 
Signed-off-by: Sasha Levin 
---
 drivers/dma/fsldma.h | 12 ++--
 1 file changed, 6 insertions(+), 6 deletions(-)

diff --git a/drivers/dma/fsldma.h b/drivers/dma/fsldma.h
index 56f18ae992332..308bed0a560ac 100644
--- a/drivers/dma/fsldma.h
+++ b/drivers/dma/fsldma.h
@@ -205,10 +205,10 @@ struct fsldma_chan {
 #else
 static u64 fsl_ioread64(const u64 __iomem *addr)
 {
-   u32 fsl_addr = lower_32_bits(addr);
-   u64 fsl_addr_hi = (u64)in_le32((u32 *)(fsl_addr + 1)) << 32;
+   u32 val_lo = in_le32((u32 __iomem *)addr);
+   u32 val_hi = in_le32((u32 __iomem *)addr + 1);
 
-   return fsl_addr_hi | in_le32((u32 *)fsl_addr);
+   return ((u64)val_hi << 32) + val_lo;
 }
 
 static void fsl_iowrite64(u64 val, u64 __iomem *addr)
@@ -219,10 +219,10 @@ static void fsl_iowrite64(u64 val, u64 __iomem *addr)
 
 static u64 fsl_ioread64be(const u64 __iomem *addr)
 {
-   u32 fsl_addr = lower_32_bits(addr);
-   u64 fsl_addr_hi = (u64)in_be32((u32 *)fsl_addr) << 32;
+   u32 val_hi = in_be32((u32 __iomem *)addr);
+   u32 val_lo = in_be32((u32 __iomem *)addr + 1);
 
-   return fsl_addr_hi | in_be32((u32 *)(fsl_addr + 1));
+   return ((u64)val_hi << 32) + val_lo;
 }
 
 static void fsl_iowrite64be(u64 val, u64 __iomem *addr)
-- 
2.25.1



[PATCH AUTOSEL 5.8 42/42] fsldma: fix very broken 32-bit ppc ioread64 functionality

2020-08-31 Thread Sasha Levin
From: Linus Torvalds 

[ Upstream commit 0a4c56c80f90797e9b9f8426c6aae4c0cf1c9785 ]

Commit ef91bb196b0d ("kernel.h: Silence sparse warning in
lower_32_bits") caused new warnings to show in the fsldma driver, but
that commit was not to blame: it only exposed some very incorrect code
that tried to take the low 32 bits of an address.

That made no sense for multiple reasons, the most notable one being that
that code was intentionally limited to only 32-bit ppc builds, so "only
low 32 bits of an address" was completely nonsensical.  There were no
high bits to mask off to begin with.

But even more importantly fropm a correctness standpoint, turning the
address into an integer then caused the subsequent address arithmetic to
be completely wrong too, and the "+1" actually incremented the address
by one, rather than by four.

Which again was incorrect, since the code was reading two 32-bit values
and trying to make a 64-bit end result of it all.  Surprisingly, the
iowrite64() did not suffer from the same odd and incorrect model.

This code has never worked, but it's questionable whether anybody cared:
of the two users that actually read the 64-bit value (by way of some C
preprocessor hackery and eventually the 'get_cdar()' inline function),
one of them explicitly ignored the value, and the other one might just
happen to work despite the incorrect value being read.

This patch at least makes it not fail the build any more, and makes the
logic superficially sane.  Whether it makes any difference to the code
_working_ or not shall remain a mystery.

Compile-tested-by: Guenter Roeck 
Reviewed-by: Guenter Roeck 
Signed-off-by: Linus Torvalds 
Signed-off-by: Sasha Levin 
---
 drivers/dma/fsldma.h | 12 ++--
 1 file changed, 6 insertions(+), 6 deletions(-)

diff --git a/drivers/dma/fsldma.h b/drivers/dma/fsldma.h
index 56f18ae992332..308bed0a560ac 100644
--- a/drivers/dma/fsldma.h
+++ b/drivers/dma/fsldma.h
@@ -205,10 +205,10 @@ struct fsldma_chan {
 #else
 static u64 fsl_ioread64(const u64 __iomem *addr)
 {
-   u32 fsl_addr = lower_32_bits(addr);
-   u64 fsl_addr_hi = (u64)in_le32((u32 *)(fsl_addr + 1)) << 32;
+   u32 val_lo = in_le32((u32 __iomem *)addr);
+   u32 val_hi = in_le32((u32 __iomem *)addr + 1);
 
-   return fsl_addr_hi | in_le32((u32 *)fsl_addr);
+   return ((u64)val_hi << 32) + val_lo;
 }
 
 static void fsl_iowrite64(u64 val, u64 __iomem *addr)
@@ -219,10 +219,10 @@ static void fsl_iowrite64(u64 val, u64 __iomem *addr)
 
 static u64 fsl_ioread64be(const u64 __iomem *addr)
 {
-   u32 fsl_addr = lower_32_bits(addr);
-   u64 fsl_addr_hi = (u64)in_be32((u32 *)fsl_addr) << 32;
+   u32 val_hi = in_be32((u32 __iomem *)addr);
+   u32 val_lo = in_be32((u32 __iomem *)addr + 1);
 
-   return fsl_addr_hi | in_be32((u32 *)(fsl_addr + 1));
+   return ((u64)val_hi << 32) + val_lo;
 }
 
 static void fsl_iowrite64be(u64 val, u64 __iomem *addr)
-- 
2.25.1



RE: [PATCH] fsldma: fsl_ioread64*() do not need lower_32_bits()

2020-08-31 Thread Leo Li


> -Original Message-
> From: Linus Torvalds 
> Sent: Saturday, August 29, 2020 4:20 PM
> To: Guenter Roeck 
> Cc: Luc Van Oostenryck ; Herbert Xu
> ; Andrew Morton  foundation.org>; Joerg Roedel ; Leo Li
> ; Zhang Wei ; Dan Williams
> ; Vinod Koul ; linuxppc-dev
> ; dma ; Linux
> Kernel Mailing List 
> Subject: Re: [PATCH] fsldma: fsl_ioread64*() do not need lower_32_bits()
> 
> On Sat, Aug 29, 2020 at 1:40 PM Guenter Roeck  wrote:
> >
> > Except for
> >
> > CHECK: spaces preferred around that '+' (ctx:VxV)
> > #29: FILE: drivers/dma/fsldma.h:223:
> > +   u32 val_lo = in_be32((u32 __iomem *)addr+1);
> 
> Added spaces.
> 
> > I don't see anything wrong with it either, so
> >
> > Reviewed-by: Guenter Roeck 
> >
> > Since I didn't see the real problem with the original code, I'd take
> > that with a grain of salt, though.
> 
> Well, honestly, the old code was so confused that just making it build is
> clearly already an improvement even if everything else were to be wrong.
> 
> So I committed my "fix". If it turns out there's more wrong in there and
> somebody tests it, we can fix it again. But now it hopefully compiles, at 
> least.
> 
> My bet is that if that driver ever worked on ppc32, it will continue to work
> whatever we do to that function.
> 
> I _think_ the old code happened to - completely by mistake - get the value
> right for the case of "little endian access, with dma_addr_t being 32-bit".
> Because then it would still read the upper bits wrong, but the cast to
> dma_addr_t would then throw those bits away. And the lower bits would be
> right.
> 
> But for big-endian accesses or for ARCH_DMA_ADDR_T_64BIT it really looks
> like it always returned a completely incorrect value.
> 
> And again - the driver may have worked even with that completely incorrect
> value, since the use of it seems to be very incidental.
> 
> In either case ("it didn't work before" or "it worked because the value
> doesn't really matter"), I don't think I could possibly have made things 
> worse.
> 
> Famous last words.

Thanks for the patch.  

Acked-by: Li Yang 

We are having periodical auto regression tests covering ppc32 platforms.  But 
looks like it missed this issue.  I will ask the test team to investigate on 
why the test cases are not sufficient.

Regards,
Leo


[Bug 205183] PPC64: Signal delivery fails with SIGSEGV if between about 1KB and 4KB bytes of stack remain

2020-08-31 Thread bugzilla-daemon
https://bugzilla.kernel.org/show_bug.cgi?id=205183

Michael Ellerman (mich...@ellerman.id.au) changed:

   What|Removed |Added

 Status|RESOLVED|CLOSED

-- 
You are receiving this mail because:
You are watching the assignee of the bug.

Re: KVM on POWER8 host lock up since 10d91611f426 ("powerpc/64s: Reimplement book3s idle code in C")

2020-08-31 Thread Ruediger Oertel
Am 31.08.20 um 14:58 schrieb Michael Ellerman:
[...]
>> The machines are running in OPAL/PowerNV mode, with "ppc64_cpu --smt=off".
>> The number of VMs varies across the machines:
>> obs-power8-01: 18 instances, "-smp 16,threads=8"
>> obs-power8-02: 20 instances, "-smp 8,threads=8"
>> obs-power8-03: 30 instances, "-smp 8,threads=8"
>> obs-power8-04: 20 instances, "-smp 8,threads=8"
> 
> Can you send us the output of:
> 
> # grep . /sys/module/kvm_hv/parameters/*

of course, the current values are:

/sys/module/kvm_hv/parameters/dynamic_mt_modes:6
/sys/module/kvm_hv/parameters/h_ipi_redirect:1
/sys/module/kvm_hv/parameters/indep_threads_mode:Y
/sys/module/kvm_hv/parameters/kvm_irq_bypass:1
/sys/module/kvm_hv/parameters/nested:Y
/sys/module/kvm_hv/parameters/one_vm_per_core:N
/sys/module/kvm_hv/parameters/target_smt_mode:0

(actually identical on all 5 above)

-- 
with kind regards (mit freundlichem Grinsen),
  Ruediger Oertel (r...@suse.com,r...@suse.de,bugfin...@t-online.de)
Do-Not-Accept-Binary-Blobs.Ever.From-Anyone.
Key fingerprint   =   17DC 6553 86A7 384B 53C5  CA5C 3CE4 F2E7 23F2 B417
SUSE Software Solutions Germany GmbH, Maxfeldstr. 5, 90409 Nürnberg,
  Germany, (HRB 36809, AG Nürnberg), Geschäftsführer: Felix Imendörffer



signature.asc
Description: OpenPGP digital signature


Re: KVM on POWER8 host lock up since 10d91611f426 ("powerpc/64s: Reimplement book3s idle code in C")

2020-08-31 Thread Michael Ellerman
Ruediger Oertel  writes:
> Am 31.08.20 um 03:14 schrieb Nicholas Piggin:
>> Excerpts from Michal Suchánek's message of August 31, 2020 6:11 am:
>>> Hello,
>>>
>>> on POWER8 KVM hosts lock up since commit 10d91611f426 ("powerpc/64s:
>>> Reimplement book3s idle code in C").
>>>
>>> The symptom is host locking up completely after some hours of KVM
>>> workload with messages like
>>>
>>> 2020-08-30T10:51:31+00:00 obs-power8-01 kernel: KVM: couldn't grab cpu 47
>>> 2020-08-30T10:51:31+00:00 obs-power8-01 kernel: KVM: couldn't grab cpu 71
>>> 2020-08-30T10:51:31+00:00 obs-power8-01 kernel: KVM: couldn't grab cpu 47
>>> 2020-08-30T10:51:31+00:00 obs-power8-01 kernel: KVM: couldn't grab cpu 71
>>> 2020-08-30T10:51:31+00:00 obs-power8-01 kernel: KVM: couldn't grab cpu 47
>>>
>>> printed before the host locks up.
>>>
>>> The machines run sandboxed builds which is a mixed workload resulting in
>>> IO/single core/mutiple core load over time and there are periods of no
>>> activity and no VMS runnig as well. The VMs are shortlived so VM
>>> setup/terdown is somewhat excercised as well.
>>>
>>> POWER9 with the new guest entry fast path does not seem to be affected.
>>>
>>> Reverted the patch and the followup idle fixes on top of 5.2.14 and
>>> re-applied commit a3f3072db6ca ("powerpc/powernv/idle: Restore IAMR
>>> after idle") which gives same idle code as 5.1.16 and the kernel seems
>>> stable.
>>>
>>> Config is attached.
>>>
>>> I cannot easily revert this commit, especially if I want to use the same
>>> kernel on POWER8 and POWER9 - many of the POWER9 fixes are applicable
>>> only to the new idle code.
>>>
>>> Any idea what can be the problem?
>> 
>> So hwthread_state is never getting back to to HWTHREAD_IN_IDLE on
>> those threads. I wonder what they are doing. POWER8 doesn't have a good
>> NMI IPI and I don't know if it supports pdbg dumping registers from the
>> BMC unfortunately. Do the messages always come in pairs of CPUs?
>> 
>> I'm not sure where to start with reproducing, I'll have to try. How many
>> vCPUs in the guests? Do you have several guests running at once?
>
> Hello all,
>
> some details on the setup...
> these machines are buildservice workers, (build.opensuse.org) and all they
> do is spawn new VMs, run a package building job inside (rpmbuild, 
> debbuild,...)
>
> The machines are running in OPAL/PowerNV mode, with "ppc64_cpu --smt=off".
> The number of VMs varies across the machines:
> obs-power8-01: 18 instances, "-smp 16,threads=8"
> obs-power8-02: 20 instances, "-smp 8,threads=8"
> obs-power8-03: 30 instances, "-smp 8,threads=8"
> obs-power8-04: 20 instances, "-smp 8,threads=8"

Can you send us the output of:

# grep . /sys/module/kvm_hv/parameters/*

cheers



Re: fsl_espi errors on v5.7.15

2020-08-31 Thread Heiner Kallweit
On 30.08.2020 23:59, Chris Packham wrote:
> 
> On 31/08/20 9:41 am, Heiner Kallweit wrote:
>> On 30.08.2020 23:00, Chris Packham wrote:
>>> On 31/08/20 12:30 am, Nicholas Piggin wrote:
 Excerpts from Chris Packham's message of August 28, 2020 8:07 am:
>>> 
>>>
> I've also now seen the RX FIFO not empty error on the T2080RDB
>
> fsl_espi ffe11.spi: Transfer done but SPIE_DON isn't set!
> fsl_espi ffe11.spi: Transfer done but SPIE_DON isn't set!
> fsl_espi ffe11.spi: Transfer done but SPIE_DON isn't set!
> fsl_espi ffe11.spi: Transfer done but SPIE_DON isn't set!
> fsl_espi ffe11.spi: Transfer done but rx/tx fifo's aren't empty!
> fsl_espi ffe11.spi: SPIE_RXCNT = 1, SPIE_TXCNT = 32
>
> With my current workaround of emptying the RX FIFO. It seems
> survivable. Interestingly it only ever seems to be 1 extra byte in the
> RX FIFO and it seems to be after either a READ_SR or a READ_FSR.
>
> fsl_espi ffe11.spi: tx 70
> fsl_espi ffe11.spi: rx 03
> fsl_espi ffe11.spi: Extra RX 00
> fsl_espi ffe11.spi: Transfer done but SPIE_DON isn't set!
> fsl_espi ffe11.spi: Transfer done but rx/tx fifo's aren't empty!
> fsl_espi ffe11.spi: SPIE_RXCNT = 1, SPIE_TXCNT = 32
> fsl_espi ffe11.spi: tx 05
> fsl_espi ffe11.spi: rx 00
> fsl_espi ffe11.spi: Extra RX 03
> fsl_espi ffe11.spi: Transfer done but SPIE_DON isn't set!
> fsl_espi ffe11.spi: Transfer done but rx/tx fifo's aren't empty!
> fsl_espi ffe11.spi: SPIE_RXCNT = 1, SPIE_TXCNT = 32
> fsl_espi ffe11.spi: tx 05
> fsl_espi ffe11.spi: rx 00
> fsl_espi ffe11.spi: Extra RX 03
>
>From all the Micron SPI-NOR datasheets I've got access to it is
> possible to continually read the SR/FSR. But I've no idea why it
> happens some times and not others.
 So I think I've got a reproduction and I think I've bisected the 
 problem
 to commit 3282a3da25bd ("powerpc/64: Implement soft interrupt replay in
 C"). My day is just finishing now so I haven't applied too much 
 scrutiny
 to this result. Given the various rabbit holes I've been down on this
 issue already I'd take this information with a good degree of 
 skepticism.

>>> OK, so an easy test should be to re-test with a 5.4 kernel.
>>> It doesn't have yet the change you're referring to, and the fsl-espi 
>>> driver
>>> is basically the same as in 5.7 (just two small changes in 5.7).
>> There's 6cc0c16d82f88 and maybe also other interrupt related patches
>> around this time that could affect book E, so it's good if that exact
>> patch is confirmed.
> My confirmation is basically that I can induce the issue in a 5.4 kernel
> by cherry-picking 3282a3da25bd. I'm also able to "fix" the issue in
> 5.9-rc2 by reverting that one commit.
>
> I both cases it's not exactly a clean cherry-pick/revert so I also
> confirmed the bisection result by building at 3282a3da25bd (which sees
> the issue) and the commit just before (which does not).
 Thanks for testing, that confirms it well.

 [snip patch]

> I still saw the issue with this change applied. PPC_IRQ_SOFT_MASK_DEBUG
> didn't report anything (either with or without the change above).
 Okay, it was a bit of a shot in the dark. I still can't see what
 else has changed.

 What would cause this, a lost interrupt? A spurious interrupt? Or
 higher interrupt latency?

 I don't think the patch should cause significantly worse latency,
 (it's supposed to be a bit better if anything because it doesn't set
 up the full interrupt frame). But it's possible.
>>> My working theory is that the SPI_DON indication is all about the TX
>>> direction an now that the interrupts are faster we're hitting an error
>>> because there is still RX activity going on. Heiner disagrees with my
>>> interpretation of the SPI_DON indication and the fact that it doesn't
>>> happen every time does throw doubt on it.
>>>
>> It's right that the eSPI spec can be interpreted that SPI_DON refers to
>> TX only. However this wouldn't really make sense, because also for RX
>> we program the frame length, and therefore want to be notified once the
>> full frame was received. Also practical experience shows that SPI_DON
>> is set also after RX-only transfers.
>> Typical SPI NOR use case is that you write read command + start address,
>> followed by a longer read. If the TX-only interpretation would be right,
>> we'd always end up with SPI_DON not being set.
>>
>>> I can't really explain the extra RX byte in the fifo. We know how many
>>> bytes to expect and we pull that many from the fifo so it's not as if
>>> we're m

Re: KVM on POWER8 host lock up since 10d91611f426 ("powerpc/64s: Reimplement book3s idle code in C")

2020-08-31 Thread Michael Ellerman
Michal Suchánek  writes:
> On Mon, Aug 31, 2020 at 11:14:18AM +1000, Nicholas Piggin wrote:
>> Excerpts from Michal Suchánek's message of August 31, 2020 6:11 am:
>> > Hello,
>> > 
>> > on POWER8 KVM hosts lock up since commit 10d91611f426 ("powerpc/64s:
>> > Reimplement book3s idle code in C").
>> > 
>> > The symptom is host locking up completely after some hours of KVM
>> > workload with messages like
>> > 
>> > 2020-08-30T10:51:31+00:00 obs-power8-01 kernel: KVM: couldn't grab cpu 47
>> > 2020-08-30T10:51:31+00:00 obs-power8-01 kernel: KVM: couldn't grab cpu 71
>> > 2020-08-30T10:51:31+00:00 obs-power8-01 kernel: KVM: couldn't grab cpu 47
>> > 2020-08-30T10:51:31+00:00 obs-power8-01 kernel: KVM: couldn't grab cpu 71
>> > 2020-08-30T10:51:31+00:00 obs-power8-01 kernel: KVM: couldn't grab cpu 47
>> > 
>> > printed before the host locks up.
>> > 
>> > The machines run sandboxed builds which is a mixed workload resulting in
>> > IO/single core/mutiple core load over time and there are periods of no
>> > activity and no VMS runnig as well. The VMs are shortlived so VM
>> > setup/terdown is somewhat excercised as well.
>> > 
>> > POWER9 with the new guest entry fast path does not seem to be affected.
>> > 
>> > Reverted the patch and the followup idle fixes on top of 5.2.14 and
>> > re-applied commit a3f3072db6ca ("powerpc/powernv/idle: Restore IAMR
>> > after idle") which gives same idle code as 5.1.16 and the kernel seems
>> > stable.
>> > 
>> > Config is attached.
>> > 
>> > I cannot easily revert this commit, especially if I want to use the same
>> > kernel on POWER8 and POWER9 - many of the POWER9 fixes are applicable
>> > only to the new idle code.
>> > 
>> > Any idea what can be the problem?
>> 
>> So hwthread_state is never getting back to to HWTHREAD_IN_IDLE on
>> those threads. I wonder what they are doing. POWER8 doesn't have a good
>> NMI IPI and I don't know if it supports pdbg dumping registers from the
>> BMC unfortunately.
>
> It may be possible to set up fadump with a later kernel version that
> supports it on powernv and dump the whole kernel.

Your firmware won't support it AFAIK.

You could try kdump, but if we have CPUs stuck in KVM then there's a
good chance it won't work :/

cheers


Re: KVM on POWER8 host lock up since 10d91611f426 ("powerpc/64s: Reimplement book3s idle code in C")

2020-08-31 Thread Ruediger Oertel
Am 31.08.20 um 03:14 schrieb Nicholas Piggin:

> So hwthread_state is never getting back to to HWTHREAD_IN_IDLE on
> those threads. I wonder what they are doing. POWER8 doesn't have a good
> NMI IPI and I don't know if it supports pdbg dumping registers from the
> BMC unfortunately.


> Do the messages always come in pairs of CPUs?

well,
- one problem is that at some point the machine just locks up completely,
  so I can not tell if there were lines not printed any more and in some
  cases all I get is a single line
- looking at the stats in generally it's either one cpu printed several
  times or a pair ("not strictly") alternatingly

2020-07-30T03:16:16+00:00 obs-power8-03 kernel: [51284.029821] KVM: couldn't 
grab cpu 92
2020-07-30T03:16:16+00:00 obs-power8-03 kernel: [51284.058630] KVM: couldn't 
grab cpu 92
2020-07-30T03:16:16+00:00 obs-power8-03 kernel: [51284.108268] KVM: couldn't 
grab cpu 92
2020-07-30T03:16:16+00:00 obs-power8-03 kernel: [51284.210206] KVM: couldn't 
grab cpu 92
2020-07-30T03:16:16+00:00 obs-power8-03 kernel: [51284.323465] KVM: couldn't 
grab cpu 92
2020-07-30T03:16:16+00:00 obs-power8-03 kernel: [51284.334420] KVM: couldn't 
grab cpu 92
2020-07-30T03:16:16+00:00 obs-power8-03 kernel: [51284.345470] KVM: couldn't 
grab cpu 92
2020-07-30T03:16:16+00:00 obs-power8-03 kernel: [51284.395185] KVM: couldn't 
grab cpu 92
2020-07-30T03:16:16+00:00 obs-power8-03 kernel: [51284.517182] KVM: couldn't 
grab cpu 92
2020-07-30T03:16:17+00:00 obs-power8-03 kernel: [51284.600716] KVM: couldn't 
grab cpu 92
2020-07-30T03:16:18+00:00 obs-power8-03 kernel: [51286.201589] KVM: couldn't 
grab cpu 92
2020-07-30T03:16:19+00:00 obs-power8-03 kernel: [51286.627273] KVM: couldn't 
grab cpu 92

2020-07-30T16:44:16+00:00 obs-power8-04 kernel: [30099.726288] KVM: couldn't 
grab cpu 61
2020-07-30T16:44:16+00:00 obs-power8-04 kernel: [30099.736843] KVM: couldn't 
grab cpu 125
2020-07-30T16:44:17+00:00 obs-power8-04 kernel: [30099.747429] KVM: couldn't 
grab cpu 125
2020-07-30T16:44:17+00:00 obs-power8-04 kernel: [30099.877138] KVM: couldn't 
grab cpu 61
2020-07-30T16:44:17+00:00 obs-power8-04 kernel: [30099.916422] KVM: couldn't 
grab cpu 125
2020-07-30T16:44:17+00:00 obs-power8-04 kernel: [30099.931755] KVM: couldn't 
grab cpu 61
2020-07-30T16:44:17+00:00 obs-power8-04 kernel: [30100.029003] KVM: couldn't 
grab cpu 61
2020-07-30T16:44:17+00:00 obs-power8-04 kernel: [30100.334895] KVM: couldn't 
grab cpu 125
2020-07-30T16:44:17+00:00 obs-power8-04 kernel: [30100.392713] KVM: couldn't 
grab cpu 61
2020-07-30T16:44:17+00:00 obs-power8-04 kernel: [30100.569011] KVM: couldn't 
grab cpu 125
2020-07-30T16:44:17+00:00 obs-power8-04 kernel: [30100.617048] KVM: couldn't 
grab cpu 125
2020-07-30T16:44:17+00:00 obs-power8-04 kernel: [30100.628107] KVM: couldn't 
grab cpu 125
2020-07-30T16:44:18+00:00 obs-power8-04 kernel: [30100.809046] KVM: couldn't 
grab cpu 125
2020-07-30T16:44:18+00:00 obs-power8-04 kernel: [30101.001097] KVM: couldn't 
grab cpu 61
2020-07-30T16:44:19+00:00 obs-power8-04 kernel: [30102.109007] KVM: couldn't 
grab cpu 125
2020-07-30T16:44:19+00:00 obs-power8-04 kernel: [30102.254470] KVM: couldn't 
grab cpu 61



> 
> I'm not sure where to start with reproducing, I'll have to try. How many
> vCPUs in the guests? Do you have several guests running at once?
> 
> Thanks,
> Nick
> 


-- 
with kind regards (mit freundlichem Grinsen),
  Ruediger Oertel (r...@suse.com,r...@suse.de,bugfin...@t-online.de)
Do-Not-Accept-Binary-Blobs.Ever.From-Anyone.
Key fingerprint   =   17DC 6553 86A7 384B 53C5  CA5C 3CE4 F2E7 23F2 B417
SUSE Software Solutions Germany GmbH, Maxfeldstr. 5, 90409 Nürnberg,
  Germany, (HRB 36809, AG Nürnberg), Geschäftsführer: Felix Imendörffer



signature.asc
Description: OpenPGP digital signature


Re: KVM on POWER8 host lock up since 10d91611f426 ("powerpc/64s: Reimplement book3s idle code in C")

2020-08-31 Thread Ruediger Oertel
Am 31.08.20 um 03:14 schrieb Nicholas Piggin:
> Excerpts from Michal Suchánek's message of August 31, 2020 6:11 am:
>> Hello,
>>
>> on POWER8 KVM hosts lock up since commit 10d91611f426 ("powerpc/64s:
>> Reimplement book3s idle code in C").
>>
>> The symptom is host locking up completely after some hours of KVM
>> workload with messages like
>>
>> 2020-08-30T10:51:31+00:00 obs-power8-01 kernel: KVM: couldn't grab cpu 47
>> 2020-08-30T10:51:31+00:00 obs-power8-01 kernel: KVM: couldn't grab cpu 71
>> 2020-08-30T10:51:31+00:00 obs-power8-01 kernel: KVM: couldn't grab cpu 47
>> 2020-08-30T10:51:31+00:00 obs-power8-01 kernel: KVM: couldn't grab cpu 71
>> 2020-08-30T10:51:31+00:00 obs-power8-01 kernel: KVM: couldn't grab cpu 47
>>
>> printed before the host locks up.
>>
>> The machines run sandboxed builds which is a mixed workload resulting in
>> IO/single core/mutiple core load over time and there are periods of no
>> activity and no VMS runnig as well. The VMs are shortlived so VM
>> setup/terdown is somewhat excercised as well.
>>
>> POWER9 with the new guest entry fast path does not seem to be affected.
>>
>> Reverted the patch and the followup idle fixes on top of 5.2.14 and
>> re-applied commit a3f3072db6ca ("powerpc/powernv/idle: Restore IAMR
>> after idle") which gives same idle code as 5.1.16 and the kernel seems
>> stable.
>>
>> Config is attached.
>>
>> I cannot easily revert this commit, especially if I want to use the same
>> kernel on POWER8 and POWER9 - many of the POWER9 fixes are applicable
>> only to the new idle code.
>>
>> Any idea what can be the problem?
> 
> So hwthread_state is never getting back to to HWTHREAD_IN_IDLE on
> those threads. I wonder what they are doing. POWER8 doesn't have a good
> NMI IPI and I don't know if it supports pdbg dumping registers from the
> BMC unfortunately. Do the messages always come in pairs of CPUs?
> 
> I'm not sure where to start with reproducing, I'll have to try. How many
> vCPUs in the guests? Do you have several guests running at once?

Hello all,

some details on the setup...
these machines are buildservice workers, (build.opensuse.org) and all they
do is spawn new VMs, run a package building job inside (rpmbuild, debbuild,...)

The machines are running in OPAL/PowerNV mode, with "ppc64_cpu --smt=off".
The number of VMs varies across the machines:
obs-power8-01: 18 instances, "-smp 16,threads=8"
obs-power8-02: 20 instances, "-smp 8,threads=8"
obs-power8-03: 30 instances, "-smp 8,threads=8"
obs-power8-04: 20 instances, "-smp 8,threads=8"
obs-power8-05: 36 instances, "-smp 4,threads=2" (this one with "ppc64_cpu 
--subcores-per-core=4"

but anyway the stalls can be seen on all of them, sometimes after 4 hours
sometimes just after about a day. The 01 with more cpu overcommit seems
a little faster reproducing the problem, but that's more gut feeling than
anything backed by real numbers.


-- 
with kind regards (mit freundlichem Grinsen),
  Ruediger Oertel (r...@suse.com,r...@suse.de,bugfin...@t-online.de)
Do-Not-Accept-Binary-Blobs.Ever.From-Anyone.
Key fingerprint   =   17DC 6553 86A7 384B 53C5  CA5C 3CE4 F2E7 23F2 B417
SUSE Software Solutions Germany GmbH, Maxfeldstr. 5, 90409 Nürnberg,
  Germany, (HRB 36809, AG Nürnberg), Geschäftsführer: Felix Imendörffer



signature.asc
Description: OpenPGP digital signature


Re: [PATCH 5/5] powerpc: use the generic dma_ops_bypass mode

2020-08-31 Thread Cédric Le Goater
On 8/31/20 8:40 AM, Christoph Hellwig wrote:
> On Sun, Aug 30, 2020 at 11:04:21AM +0200, Cédric Le Goater wrote:
>> Hello,
>>
>> On 7/8/20 5:24 PM, Christoph Hellwig wrote:
>>> Use the DMA API bypass mechanism for direct window mappings.  This uses
>>> common code and speed up the direct mapping case by avoiding indirect
>>> calls just when not using dma ops at all.  It also fixes a problem where
>>> the sync_* methods were using the bypass check for DMA allocations, but
>>> those are part of the streaming ops.
>>>
>>> Note that this patch loses the DMA_ATTR_WEAK_ORDERING override, which
>>> has never been well defined, as is only used by a few drivers, which
>>> IIRC never showed up in the typical Cell blade setups that are affected
>>> by the ordering workaround.
>>>
>>> Fixes: efd176a04bef ("powerpc/pseries/dma: Allow SWIOTLB")
>>> Signed-off-by: Christoph Hellwig 
>>> ---
>>>  arch/powerpc/Kconfig  |  1 +
>>>  arch/powerpc/include/asm/device.h |  5 --
>>>  arch/powerpc/kernel/dma-iommu.c   | 90 ---
>>>  3 files changed, 10 insertions(+), 86 deletions(-)
>>
>> I am seeing corruptions on a couple of POWER9 systems (boston) when
>> stressed with IO. stress-ng gives some results but I have first seen
>> it when compiling the kernel in a guest and this is still the best way
>> to raise the issue.
>>
>> These systems have of a SAS Adaptec controller :
>>
>>   0003:01:00.0 Serial Attached SCSI controller: Adaptec Series 8 12G 
>> SAS/PCIe 3 (rev 01)
>>
>> When the failure occurs, the POWERPC EEH interrupt fires and dumps
>> lowlevel PHB4 registers among which :
>>
>>   [ 2179.251069490,3] PHB#0003[0:3]:   phbErrorStatus = 
>> 0280
>>   [ 2179.251117476,3] PHB#0003[0:3]:  phbFirstErrorStatus = 
>> 0200
>>
>> The bits raised identify a PPC 'TCE' error, which means it is related
>> to DMAs. See below for more details.
>>
>>
>> Reverting this patch "fixes" the issue but it is probably else where,
>> in some other layers or in the aacraid driver. How should I proceed 
>> to get more information ?
> 
> The aacraid DMA masks look like a mess.  Can you try the hack
> below and see it it helps?

No effect. The system crashes the same. But Alexey spotted some issue 
with swiotlb.

C. 


Re: [PATCH 5/5] powerpc: use the generic dma_ops_bypass mode

2020-08-31 Thread Cédric Le Goater
Hello,

On 7/8/20 5:24 PM, Christoph Hellwig wrote:
> Use the DMA API bypass mechanism for direct window mappings.  This uses
> common code and speed up the direct mapping case by avoiding indirect
> calls just when not using dma ops at all.  It also fixes a problem where
> the sync_* methods were using the bypass check for DMA allocations, but
> those are part of the streaming ops.
> 
> Note that this patch loses the DMA_ATTR_WEAK_ORDERING override, which
> has never been well defined, as is only used by a few drivers, which
> IIRC never showed up in the typical Cell blade setups that are affected
> by the ordering workaround.
> 
> Fixes: efd176a04bef ("powerpc/pseries/dma: Allow SWIOTLB")
> Signed-off-by: Christoph Hellwig 
> ---
>  arch/powerpc/Kconfig  |  1 +
>  arch/powerpc/include/asm/device.h |  5 --
>  arch/powerpc/kernel/dma-iommu.c   | 90 ---
>  3 files changed, 10 insertions(+), 86 deletions(-)

I am seeing corruptions on a couple of POWER9 systems (boston) when
stressed with IO. stress-ng gives some results but I have first seen
it when compiling the kernel in a guest and this is still the best way
to raise the issue.

These systems have of a SAS Adaptec controller :

  0003:01:00.0 Serial Attached SCSI controller: Adaptec Series 8 12G SAS/PCIe 3 
(rev 01)

When the failure occurs, the POWERPC EEH interrupt fires and dumps
lowlevel PHB4 registers among which :
  
  [ 2179.251069490,3] PHB#0003[0:3]:   phbErrorStatus = 0280
  [ 2179.251117476,3] PHB#0003[0:3]:  phbFirstErrorStatus = 0200

The bits raised identify a PPC 'TCE' error, which means it is related
to DMAs. See below for more details.


Reverting this patch "fixes" the issue but it is probably else where,
in some other layers or in the aacraid driver. How should I proceed 
to get more information ?

Thanks,

C.


[ 2054.970339] EEH: Frozen PE#1fd on PHB#3 detected
[ 2054.970375] EEH: PE location: UOPWR.BOS0019-Node0-Onboard SAS, PHB location: 
N/A
[ 2179.249415973,3] PHB#0003[0:3]:  brdgCtl = 0002
[ 2179.249515795,3] PHB#0003[0:3]: deviceStatus = 00060040
[ 2179.249596452,3] PHB#0003[0:3]:   slotStatus = 00402000
[ 2179.249633728,3] PHB#0003[0:3]:   linkStatus = a0830008
[ 2179.249674807,3] PHB#0003[0:3]: devCmdStatus = 00100107
[ 2179.249725974,3] PHB#0003[0:3]: devSecStatus = 00100107
[ 2179.249773550,3] PHB#0003[0:3]:  rootErrorStatus = 
[ 2179.249809823,3] PHB#0003[0:3]:  corrErrorStatus = 
[ 2179.249850439,3] PHB#0003[0:3]:uncorrErrorStatus = 
[ 2179.249887411,3] PHB#0003[0:3]:   devctl = 0040
[ 2179.249928677,3] PHB#0003[0:3]:  devStat = 0006
[ 2179.249967150,3] PHB#0003[0:3]:  tlpHdr1 = 
[ 2179.250054987,3] PHB#0003[0:3]:  tlpHdr2 = 
[ 2179.250146600,3] PHB#0003[0:3]:  tlpHdr3 = 
[ 2179.250262780,3] PHB#0003[0:3]:  tlpHdr4 = 
[ 2179.250343101,3] PHB#0003[0:3]: sourceId = 
[ 2179.250514264,3] PHB#0003[0:3]: nFir = 
[ 2179.250717971,3] PHB#0003[0:3]: nFirMask = 0030001c
[ 2179.250791474,3] PHB#0003[0:3]:  nFirWOF = 
[ 2179.250842054,3] PHB#0003[0:3]: phbPlssr = 001c
[ 2179.250886003,3] PHB#0003[0:3]:   phbCsr = 001c
[ 2179.250929859,3] PHB#0003[0:3]:   lemFir = 00010080
[ 2179.250973720,3] PHB#0003[0:3]: lemErrorMask = 
[ 2179.251018622,3] PHB#0003[0:3]:   lemWOF = 0080
[ 2179.251069490,3] PHB#0003[0:3]:   phbErrorStatus = 0280
[ 2179.251117476,3] PHB#0003[0:3]:  phbFirstErrorStatus = 0200
[ 2179.251162218,3] PHB#0003[0:3]: phbErrorLog0 = 214898000240
[ 2179.251206251,3] PHB#0003[0:3]: phbErrorLog1 = a0084000
[ 2179.251253096,3] PHB#0003[0:3]:phbTxeErrorStatus = 
[ 2179.265087656,3] PHB#0003[0:3]:   phbTxeFirstErrorStatus = 
[ 2179.265142247,3] PHB#0003[0:3]:  phbTxeErrorLog0 = 
[ 2179.265189734,3] PHB#0003[0:3]:  phbTxeErrorLog1 = 
[ 2179.266335296,3] PHB#0003[0:3]: phbRxeArbErrorStatus = 
[ 2179.266380366,3] PHB#0003[0:3]: phbRxeArbFrstErrorStatus = 
[ 2179.266426523,3] PHB#0003[0:3]:   phbRxeArbErrorLog0 = 
[ 2179.267537283,3] PHB#0003[0:3]:   phbRxeArbErrorLog1 = 
[ 2179.267581708,3] PHB#0003[0:3]: phbRxeMrgErrorStatus = 
[ 2179.267628324,3] PHB#0003[0:3]: phbRxeMrgFrstErrorStatus = 
[ 2179.268748771,3] PHB#0003[0:3]:   phbRxeMrgErrorLog

Re: KVM on POWER8 host lock up since 10d91611f426 ("powerpc/64s: Reimplement book3s idle code in C")

2020-08-31 Thread Michal Suchánek
On Mon, Aug 31, 2020 at 11:14:18AM +1000, Nicholas Piggin wrote:
> Excerpts from Michal Suchánek's message of August 31, 2020 6:11 am:
> > Hello,
> > 
> > on POWER8 KVM hosts lock up since commit 10d91611f426 ("powerpc/64s:
> > Reimplement book3s idle code in C").
> > 
> > The symptom is host locking up completely after some hours of KVM
> > workload with messages like
> > 
> > 2020-08-30T10:51:31+00:00 obs-power8-01 kernel: KVM: couldn't grab cpu 47
> > 2020-08-30T10:51:31+00:00 obs-power8-01 kernel: KVM: couldn't grab cpu 71
> > 2020-08-30T10:51:31+00:00 obs-power8-01 kernel: KVM: couldn't grab cpu 47
> > 2020-08-30T10:51:31+00:00 obs-power8-01 kernel: KVM: couldn't grab cpu 71
> > 2020-08-30T10:51:31+00:00 obs-power8-01 kernel: KVM: couldn't grab cpu 47
> > 
> > printed before the host locks up.
> > 
> > The machines run sandboxed builds which is a mixed workload resulting in
> > IO/single core/mutiple core load over time and there are periods of no
> > activity and no VMS runnig as well. The VMs are shortlived so VM
> > setup/terdown is somewhat excercised as well.
> > 
> > POWER9 with the new guest entry fast path does not seem to be affected.
> > 
> > Reverted the patch and the followup idle fixes on top of 5.2.14 and
> > re-applied commit a3f3072db6ca ("powerpc/powernv/idle: Restore IAMR
> > after idle") which gives same idle code as 5.1.16 and the kernel seems
> > stable.
> > 
> > Config is attached.
> > 
> > I cannot easily revert this commit, especially if I want to use the same
> > kernel on POWER8 and POWER9 - many of the POWER9 fixes are applicable
> > only to the new idle code.
> > 
> > Any idea what can be the problem?
> 
> So hwthread_state is never getting back to to HWTHREAD_IN_IDLE on
> those threads. I wonder what they are doing. POWER8 doesn't have a good
> NMI IPI and I don't know if it supports pdbg dumping registers from the
> BMC unfortunately.
It may be possible to set up fadump with a later kernel version that
supports it on powernv and dump the whole kernel.

Thanks

Michal


Re: KVM on POWER8 host lock up since 10d91611f426 ("powerpc/64s: Reimplement book3s idle code in C")

2020-08-31 Thread Michal Suchánek
On Mon, Aug 31, 2020 at 11:14:18AM +1000, Nicholas Piggin wrote:
> Excerpts from Michal Suchánek's message of August 31, 2020 6:11 am:
> > Hello,
> > 
> > on POWER8 KVM hosts lock up since commit 10d91611f426 ("powerpc/64s:
> > Reimplement book3s idle code in C").
> > 
> > The symptom is host locking up completely after some hours of KVM
> > workload with messages like
> > 
> > 2020-08-30T10:51:31+00:00 obs-power8-01 kernel: KVM: couldn't grab cpu 47
> > 2020-08-30T10:51:31+00:00 obs-power8-01 kernel: KVM: couldn't grab cpu 71
> > 2020-08-30T10:51:31+00:00 obs-power8-01 kernel: KVM: couldn't grab cpu 47
> > 2020-08-30T10:51:31+00:00 obs-power8-01 kernel: KVM: couldn't grab cpu 71
> > 2020-08-30T10:51:31+00:00 obs-power8-01 kernel: KVM: couldn't grab cpu 47
> > 
> > printed before the host locks up.
> > 
> > The machines run sandboxed builds which is a mixed workload resulting in
> > IO/single core/mutiple core load over time and there are periods of no
> > activity and no VMS runnig as well. The VMs are shortlived so VM
> > setup/terdown is somewhat excercised as well.
> > 
> > POWER9 with the new guest entry fast path does not seem to be affected.
> > 
> > Reverted the patch and the followup idle fixes on top of 5.2.14 and
> > re-applied commit a3f3072db6ca ("powerpc/powernv/idle: Restore IAMR
> > after idle") which gives same idle code as 5.1.16 and the kernel seems
> > stable.
> > 
> > Config is attached.
> > 
> > I cannot easily revert this commit, especially if I want to use the same
> > kernel on POWER8 and POWER9 - many of the POWER9 fixes are applicable
> > only to the new idle code.
> > 
> > Any idea what can be the problem?
> 
> So hwthread_state is never getting back to to HWTHREAD_IN_IDLE on
> those threads. I wonder what they are doing. POWER8 doesn't have a good
> NMI IPI and I don't know if it supports pdbg dumping registers from the
> BMC unfortunately. Do the messages always come in pairs of CPUs?
> 
> I'm not sure where to start with reproducing, I'll have to try. How many
> vCPUs in the guests? Do you have several guests running at once?

The guests are spawned on demand - there are like 20-30 'slots'
configured where a VM may be running or it may be idle with no VM
spawned when there are no jobs available.

Thanks

Michal


[PATCH 1/2] powerpc/8xx: Refactor calculation of number of entries per PTE in page tables

2020-08-31 Thread Christophe Leroy
On 8xx, the number of entries occupied by a PTE in the page tables
depends on the size of the page. At the time being, this calculation
is done in two places: in pte_update() and in set_huge_pte_at()

Refactor this calculation into a helper called
number_of_cells_per_pte(). For the time being, the val param is
unused. It will be used by following patch.

Instead of opencoding is_hugepd(), use hugepd_ok() with a forward
declaration.

Signed-off-by: Christophe Leroy 
---
 arch/powerpc/include/asm/nohash/32/pgtable.h | 18 --
 arch/powerpc/mm/pgtable.c|  6 --
 2 files changed, 16 insertions(+), 8 deletions(-)

diff --git a/arch/powerpc/include/asm/nohash/32/pgtable.h 
b/arch/powerpc/include/asm/nohash/32/pgtable.h
index b9e134d0f03a..80bbc21b87f0 100644
--- a/arch/powerpc/include/asm/nohash/32/pgtable.h
+++ b/arch/powerpc/include/asm/nohash/32/pgtable.h
@@ -227,6 +227,17 @@ static inline void pmd_clear(pmd_t *pmdp)
  */
 #ifdef CONFIG_PPC_8xx
 static pmd_t *pmd_off(struct mm_struct *mm, unsigned long addr);
+static int hugepd_ok(hugepd_t hpd);
+
+static int number_of_cells_per_pte(pmd_t *pmd, pte_basic_t val, int huge)
+{
+   if (!huge)
+   return PAGE_SIZE / SZ_4K;
+   else if (hugepd_ok(*((hugepd_t *)pmd)))
+   return 1;
+   else
+   return SZ_512K / SZ_4K;
+}
 
 static inline pte_basic_t pte_update(struct mm_struct *mm, unsigned long addr, 
pte_t *p,
 unsigned long clr, unsigned long set, int 
huge)
@@ -237,12 +248,7 @@ static inline pte_basic_t pte_update(struct mm_struct *mm, 
unsigned long addr, p
int num, i;
pmd_t *pmd = pmd_off(mm, addr);
 
-   if (!huge)
-   num = PAGE_SIZE / SZ_4K;
-   else if ((pmd_val(*pmd) & _PMD_PAGE_MASK) != _PMD_PAGE_8M)
-   num = SZ_512K / SZ_4K;
-   else
-   num = 1;
+   num = number_of_cells_per_pte(pmd, new, huge);
 
for (i = 0; i < num; i++, entry++, new += SZ_4K)
*entry = new;
diff --git a/arch/powerpc/mm/pgtable.c b/arch/powerpc/mm/pgtable.c
index 9c0547d77af3..2dcad640b869 100644
--- a/arch/powerpc/mm/pgtable.c
+++ b/arch/powerpc/mm/pgtable.c
@@ -266,8 +266,7 @@ void set_huge_pte_at(struct mm_struct *mm, unsigned long 
addr, pte_t *ptep, pte_
pmd_t *pmd = pmd_off(mm, addr);
pte_basic_t val;
pte_basic_t *entry = &ptep->pte;
-   int num = is_hugepd(*((hugepd_t *)pmd)) ? 1 : SZ_512K / SZ_4K;
-   int i;
+   int num, i;
 
/*
 * Make sure hardware valid bit is not set. We don't do
@@ -280,6 +279,9 @@ void set_huge_pte_at(struct mm_struct *mm, unsigned long 
addr, pte_t *ptep, pte_
pte = set_pte_filter(pte);
 
val = pte_val(pte);
+
+   num = number_of_cells_per_pte(pmd, val, 1);
+
for (i = 0; i < num; i++, entry++, val += SZ_4K)
*entry = val;
 }
-- 
2.25.0



[PATCH 2/2] powerpc/8xx: Support 16k hugepages with 4k pages

2020-08-31 Thread Christophe Leroy
The 8xx has 4 page sizes: 4k, 16k, 512k and 8M

4k and 16k can be selected at build time as standard page sizes,
and 512k and 8M are hugepages.

When 4k standard pages are selected, 16k pages are not available.

Allow 16k pages as hugepages when 4k pages are used.

To allow that, implement arch_make_huge_pte() which receives
the necessary arguments to allow setting the PTE in accordance
with the page size:
- 512 k pages must have _PAGE_HUGE and _PAGE_SPS. They are set
by pte_mkhuge(). arch_make_huge_pte() does nothing.
- 16 k pages must have only _PAGE_SPS. arch_make_huge_pte() clears
_PAGE_HUGE.

Signed-off-by: Christophe Leroy 
---
 arch/powerpc/include/asm/nohash/32/hugetlb-8xx.h | 14 ++
 arch/powerpc/include/asm/nohash/32/pgtable.h |  2 ++
 arch/powerpc/mm/hugetlbpage.c|  2 +-
 arch/powerpc/mm/nohash/tlb.c |  4 
 arch/powerpc/mm/ptdump/8xx.c |  5 +
 include/uapi/asm-generic/hugetlb_encode.h|  1 +
 include/uapi/linux/mman.h|  1 +
 7 files changed, 24 insertions(+), 5 deletions(-)

diff --git a/arch/powerpc/include/asm/nohash/32/hugetlb-8xx.h 
b/arch/powerpc/include/asm/nohash/32/hugetlb-8xx.h
index e752a5807a59..39be9aea86db 100644
--- a/arch/powerpc/include/asm/nohash/32/hugetlb-8xx.h
+++ b/arch/powerpc/include/asm/nohash/32/hugetlb-8xx.h
@@ -65,4 +65,18 @@ static inline void huge_ptep_set_wrprotect(struct mm_struct 
*mm,
pte_update(mm, addr, ptep, clr, set, 1);
 }
 
+#ifdef CONFIG_PPC_4K_PAGES
+static inline pte_t arch_make_huge_pte(pte_t entry, struct vm_area_struct *vma,
+  struct page *page, int writable)
+{
+   size_t size = huge_page_size(hstate_vma(vma));
+
+   if (size == SZ_16K)
+   return __pte(pte_val(entry) & ~_PAGE_HUGE);
+   else
+   return entry;
+}
+#define arch_make_huge_pte arch_make_huge_pte
+#endif
+
 #endif /* _ASM_POWERPC_NOHASH_32_HUGETLB_8XX_H */
diff --git a/arch/powerpc/include/asm/nohash/32/pgtable.h 
b/arch/powerpc/include/asm/nohash/32/pgtable.h
index 80bbc21b87f0..ee2243ba96cf 100644
--- a/arch/powerpc/include/asm/nohash/32/pgtable.h
+++ b/arch/powerpc/include/asm/nohash/32/pgtable.h
@@ -235,6 +235,8 @@ static int number_of_cells_per_pte(pmd_t *pmd, pte_basic_t 
val, int huge)
return PAGE_SIZE / SZ_4K;
else if (hugepd_ok(*((hugepd_t *)pmd)))
return 1;
+   else if (IS_ENABLED(CONFIG_PPC_4K_PAGES) && !(val & _PAGE_HUGE))
+   return SZ_16K / SZ_4K;
else
return SZ_512K / SZ_4K;
 }
diff --git a/arch/powerpc/mm/hugetlbpage.c b/arch/powerpc/mm/hugetlbpage.c
index e7ae2a2c4545..36c3800769fb 100644
--- a/arch/powerpc/mm/hugetlbpage.c
+++ b/arch/powerpc/mm/hugetlbpage.c
@@ -180,7 +180,7 @@ pte_t *huge_pte_alloc(struct mm_struct *mm, unsigned long 
addr, unsigned long sz
if (!hpdp)
return NULL;
 
-   if (IS_ENABLED(CONFIG_PPC_8xx) && sz == SZ_512K)
+   if (IS_ENABLED(CONFIG_PPC_8xx) && pshift < PMD_SHIFT)
return pte_alloc_map(mm, (pmd_t *)hpdp, addr);
 
BUG_ON(!hugepd_none(*hpdp) && !hugepd_ok(*hpdp));
diff --git a/arch/powerpc/mm/nohash/tlb.c b/arch/powerpc/mm/nohash/tlb.c
index 14514585db98..5872f69141d5 100644
--- a/arch/powerpc/mm/nohash/tlb.c
+++ b/arch/powerpc/mm/nohash/tlb.c
@@ -83,16 +83,12 @@ struct mmu_psize_def mmu_psize_defs[MMU_PAGE_COUNT] = {
 };
 #elif defined(CONFIG_PPC_8xx)
 struct mmu_psize_def mmu_psize_defs[MMU_PAGE_COUNT] = {
-   /* we only manage 4k and 16k pages as normal pages */
-#ifdef CONFIG_PPC_4K_PAGES
[MMU_PAGE_4K] = {
.shift  = 12,
},
-#else
[MMU_PAGE_16K] = {
.shift  = 14,
},
-#endif
[MMU_PAGE_512K] = {
.shift  = 19,
},
diff --git a/arch/powerpc/mm/ptdump/8xx.c b/arch/powerpc/mm/ptdump/8xx.c
index 8a797dcbf475..86da2a669680 100644
--- a/arch/powerpc/mm/ptdump/8xx.c
+++ b/arch/powerpc/mm/ptdump/8xx.c
@@ -11,8 +11,13 @@
 
 static const struct flag_info flag_array[] = {
{
+#ifdef CONFIG_PPC_16K_PAGES
.mask   = _PAGE_HUGE,
.val= _PAGE_HUGE,
+#else
+   .mask   = _PAGE_SPS,
+   .val= _PAGE_SPS,
+#endif
.set= "huge",
.clear  = "",
}, {
diff --git a/include/uapi/asm-generic/hugetlb_encode.h 
b/include/uapi/asm-generic/hugetlb_encode.h
index b0f8e87235bd..4f3d5aaa11f5 100644
--- a/include/uapi/asm-generic/hugetlb_encode.h
+++ b/include/uapi/asm-generic/hugetlb_encode.h
@@ -20,6 +20,7 @@
 #define HUGETLB_FLAG_ENCODE_SHIFT  26
 #define HUGETLB_FLAG_ENCODE_MASK   0x3f
 
+#define HUGETLB_FLAG_ENCODE_16KB   (14 << HUGETLB_FLAG_ENCODE_SHIFT)
 #define HUGETLB_FLAG_ENCODE_64KB   (16 << HUGETLB_FLAG_ENCODE_SHIFT)
 #define HUGETLB_FLAG_ENCODE_512KB  (19 << HUGETLB_FLAG_ENCODE_SHIFT)
 #define HUGETLB_FLAG_ENCOD

[PATCH] selftests/vm: Fix display of page size in map_hugetlb

2020-08-31 Thread Christophe Leroy
The displayed size is in bytes while the text says it is in kB.

Shift it by 10 to really display kBytes.

Fixes: fa7b9a805c79 ("tools/selftest/vm: allow choosing mem size and page size 
in map_hugetlb")
Cc: sta...@vger.kernel.org
Signed-off-by: Christophe Leroy 
---
 tools/testing/selftests/vm/map_hugetlb.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/tools/testing/selftests/vm/map_hugetlb.c 
b/tools/testing/selftests/vm/map_hugetlb.c
index 6af951900aa3..312889edb84a 100644
--- a/tools/testing/selftests/vm/map_hugetlb.c
+++ b/tools/testing/selftests/vm/map_hugetlb.c
@@ -83,7 +83,7 @@ int main(int argc, char **argv)
}
 
if (shift)
-   printf("%u kB hugepages\n", 1 << shift);
+   printf("%u kB hugepages\n", 1 << (shift - 10));
else
printf("Default size hugepages\n");
printf("Mapping %lu Mbytes\n", (unsigned long)length >> 20);
-- 
2.25.0



[PATCH] powerpc: Fix random segfault when freeing hugetlb range

2020-08-31 Thread Christophe Leroy
The following random segfault is observed from time to time with
map_hugetlb selftest:

root@localhost:~# ./map_hugetlb 1 19
524288 kB hugepages
Mapping 1 Mbytes
Segmentation fault

[   31.219972] map_hugetlb[365]: segfault (11) at 117 nip 77974f8c lr 779a6834 
code 1 in ld-2.23.so[77966000+21000]
[   31.220192] map_hugetlb[365]: code: 9421ffc0 480318d1 93410028 90010044 
9361002c 93810030 93a10034 93c10038
[   31.220307] map_hugetlb[365]: code: 93e1003c 93210024 8123007c 81430038 
<80e90004> 814a0004 7f443a14 813a0004
[   31.221911] BUG: Bad rss-counter state mm:(ptrval) type:MM_FILEPAGES val:33
[   31.229362] BUG: Bad rss-counter state mm:(ptrval) type:MM_ANONPAGES val:5

This fault is due to hugetlb_free_pgd_range() freeing page tables
that are also used by regular pages.

As explain in the comment at the beginning of
hugetlb_free_pgd_range(), the verification done in free_pgd_range()
on floor and ceiling is not done here, which means
hugetlb_free_pte_range() can free outside the expected range.

As the verification cannot be done in hugetlb_free_pgd_range(), it
must be done in hugetlb_free_pte_range().

Fixes: b250c8c08c79 ("powerpc/8xx: Manage 512k huge pages as standard pages.")
Cc: sta...@vger.kernel.org
Signed-off-by: Christophe Leroy 
---
 arch/powerpc/mm/hugetlbpage.c | 18 --
 1 file changed, 16 insertions(+), 2 deletions(-)

diff --git a/arch/powerpc/mm/hugetlbpage.c b/arch/powerpc/mm/hugetlbpage.c
index 26292544630f..e7ae2a2c4545 100644
--- a/arch/powerpc/mm/hugetlbpage.c
+++ b/arch/powerpc/mm/hugetlbpage.c
@@ -330,10 +330,24 @@ static void free_hugepd_range(struct mmu_gather *tlb, 
hugepd_t *hpdp, int pdshif
 get_hugepd_cache_index(pdshift - shift));
 }
 
-static void hugetlb_free_pte_range(struct mmu_gather *tlb, pmd_t *pmd, 
unsigned long addr)
+static void hugetlb_free_pte_range(struct mmu_gather *tlb, pmd_t *pmd,
+  unsigned long addr, unsigned long end,
+  unsigned long floor, unsigned long ceiling)
 {
+   unsigned long start = addr;
pgtable_t token = pmd_pgtable(*pmd);
 
+   start &= PMD_MASK;
+   if (start < floor)
+   return;
+   if (ceiling) {
+   ceiling &= PMD_MASK;
+   if (!ceiling)
+   return;
+   }
+   if (end - 1 > ceiling - 1)
+   return;
+
pmd_clear(pmd);
pte_free_tlb(tlb, token, addr);
mm_dec_nr_ptes(tlb->mm);
@@ -363,7 +377,7 @@ static void hugetlb_free_pmd_range(struct mmu_gather *tlb, 
pud_t *pud,
 */
WARN_ON(!IS_ENABLED(CONFIG_PPC_8xx));
 
-   hugetlb_free_pte_range(tlb, pmd, addr);
+   hugetlb_free_pte_range(tlb, pmd, addr, end, floor, 
ceiling);
 
continue;
}
-- 
2.25.0