Re: [PATCH v3 08/13] mm/debug_vm_pgtable/thp: Use page table depost/withdraw with THP
Le 01/09/2020 à 08:25, Aneesh Kumar K.V a écrit : On 9/1/20 8:52 AM, Anshuman Khandual wrote: There is a checkpatch.pl warning here. WARNING: Possible unwrapped commit description (prefer a maximum 75 chars per line) #7: Architectures like ppc64 use deposited page table while updating the huge pte total: 0 errors, 1 warnings, 40 lines checked I will ignore all these, because they are not really important IMHO. When doing a git log in a 80 chars terminal window, having wrapping lines is not really convenient. It should be easy to avoid it. Christophe
Re: [PATCH v3 09/13] mm/debug_vm_pgtable/locks: Move non page table modifying test together
On 9/1/20 9:11 AM, Anshuman Khandual wrote: On 08/27/2020 01:34 PM, Aneesh Kumar K.V wrote: This will help in adding proper locks in a later patch It really makes sense to classify these tests here as static and dynamic. Static are the ones that test via page table entry values modification without changing anything on the actual page table itself. Where as the dynamic tests do change the page table. Static tests would probably never require any lock synchronization but the dynamic ones will do. Signed-off-by: Aneesh Kumar K.V --- mm/debug_vm_pgtable.c | 52 --- 1 file changed, 29 insertions(+), 23 deletions(-) diff --git a/mm/debug_vm_pgtable.c b/mm/debug_vm_pgtable.c index 0ce5c6a24c5b..78c8af3445ac 100644 --- a/mm/debug_vm_pgtable.c +++ b/mm/debug_vm_pgtable.c @@ -992,7 +992,7 @@ static int __init debug_vm_pgtable(void) p4dp = p4d_alloc(mm, pgdp, vaddr); pudp = pud_alloc(mm, p4dp, vaddr); pmdp = pmd_alloc(mm, pudp, vaddr); - ptep = pte_alloc_map_lock(mm, pmdp, vaddr, &ptl); + ptep = pte_alloc_map(mm, pmdp, vaddr); /* * Save all the page table page addresses as the page table @@ -1012,33 +1012,12 @@ static int __init debug_vm_pgtable(void) p4d_basic_tests(p4d_aligned, prot); pgd_basic_tests(pgd_aligned, prot); - pte_clear_tests(mm, ptep, vaddr); - pmd_clear_tests(mm, pmdp); - pud_clear_tests(mm, pudp); - p4d_clear_tests(mm, p4dp); - pgd_clear_tests(mm, pgdp); - - pte_advanced_tests(mm, vma, ptep, pte_aligned, vaddr, prot); - pmd_advanced_tests(mm, vma, pmdp, pmd_aligned, vaddr, prot, saved_ptep); - pud_advanced_tests(mm, vma, pudp, pud_aligned, vaddr, prot); - hugetlb_advanced_tests(mm, vma, ptep, pte_aligned, vaddr, prot); - pmd_leaf_tests(pmd_aligned, prot); pud_leaf_tests(pud_aligned, prot); - pmd_huge_tests(pmdp, pmd_aligned, prot); - pud_huge_tests(pudp, pud_aligned, prot); - pte_savedwrite_tests(pte_aligned, protnone); pmd_savedwrite_tests(pmd_aligned, protnone); - pte_unmap_unlock(ptep, ptl); - - pmd_populate_tests(mm, pmdp, saved_ptep); - pud_populate_tests(mm, pudp, saved_pmdp); - p4d_populate_tests(mm, p4dp, saved_pudp); - pgd_populate_tests(mm, pgdp, saved_p4dp); - pte_special_tests(pte_aligned, prot); pte_protnone_tests(pte_aligned, protnone); pmd_protnone_tests(pmd_aligned, protnone); @@ -1056,11 +1035,38 @@ static int __init debug_vm_pgtable(void) pmd_swap_tests(pmd_aligned, prot); swap_migration_tests(); - hugetlb_basic_tests(pte_aligned, prot); pmd_thp_tests(pmd_aligned, prot); pud_thp_tests(pud_aligned, prot); + /* +* Page table modifying tests +*/ In this comment, please do add some more details about how these tests need proper locking mechanism unlike the previous static ones. Also add a similar comment section for the static tests that dont really change page table and need not require corresponding locking mechanism. Both comment sections will make this classification clear. + pte_clear_tests(mm, ptep, vaddr); + pmd_clear_tests(mm, pmdp); + pud_clear_tests(mm, pudp); + p4d_clear_tests(mm, p4dp); + pgd_clear_tests(mm, pgdp); + + ptep = pte_alloc_map_lock(mm, pmdp, vaddr, &ptl); + pte_advanced_tests(mm, vma, ptep, pte_aligned, vaddr, prot); + pmd_advanced_tests(mm, vma, pmdp, pmd_aligned, vaddr, prot, saved_ptep); + pud_advanced_tests(mm, vma, pudp, pud_aligned, vaddr, prot); + hugetlb_advanced_tests(mm, vma, ptep, pte_aligned, vaddr, prot); + + + pmd_huge_tests(pmdp, pmd_aligned, prot); + pud_huge_tests(pudp, pud_aligned, prot); + + pte_unmap_unlock(ptep, ptl); + + pmd_populate_tests(mm, pmdp, saved_ptep); + pud_populate_tests(mm, pudp, saved_pmdp); + p4d_populate_tests(mm, p4dp, saved_pudp); + pgd_populate_tests(mm, pgdp, saved_p4dp); + + hugetlb_basic_tests(pte_aligned, prot); hugetlb_basic_tests() is a non page table modifying static test and should be classified accordingly. + p4d_free(mm, saved_p4dp); pud_free(mm, saved_pudp); pmd_free(mm, saved_pmdp); Changes till this patch fails to boot on arm64 platform. This should be folded with the next patch. [ 17.080644] [ cut here ] [ 17.081342] kernel BUG at mm/pgtable-generic.c:164! [ 17.082091] Internal error: Oops - BUG: 0 [#1] PREEMPT SMP [ 17.082977] Modules linked in: [ 17.083481] CPU: 79 PID: 1 Comm: swapper/0 Tainted: GW 5.9.0-rc2-00105-g740e72cd6717 #14 [ 17.084998] Hardware name: linux,dummy-virt (DT) [ 17.085745] pstate: 6045 (nZCv daif +PAN -UAO BTYPE=--) [ 17.086645] pc : pgtable_trans_huge_deposit+0x58/0x60 [ 17.087462] lr : debug_vm_pgtable+0x4dc/0x8f0 [ 17.088168] sp : 80001219bcf0 [ 17.08
Re: [PATCH v3 13/13] mm/debug_vm_pgtable: populate a pte entry before fetching it
On 9/1/20 8:55 AM, Anshuman Khandual wrote: On 08/27/2020 01:34 PM, Aneesh Kumar K.V wrote: pte_clear_tests operate on an existing pte entry. Make sure that is not a none pte entry. Signed-off-by: Aneesh Kumar K.V --- mm/debug_vm_pgtable.c | 6 -- 1 file changed, 4 insertions(+), 2 deletions(-) diff --git a/mm/debug_vm_pgtable.c b/mm/debug_vm_pgtable.c index 21329c7d672f..8527ebb75f2c 100644 --- a/mm/debug_vm_pgtable.c +++ b/mm/debug_vm_pgtable.c @@ -546,7 +546,7 @@ static void __init pgd_populate_tests(struct mm_struct *mm, pgd_t *pgdp, static void __init pte_clear_tests(struct mm_struct *mm, pte_t *ptep, unsigned long vaddr) { - pte_t pte = ptep_get(ptep); + pte_t pte = ptep_get_and_clear(mm, vaddr, ptep); Seems like ptep_get_and_clear() here just clears the entry in preparation for a following set_pte_at() which otherwise would have been a problem on ppc64 as you had pointed out earlier i.e set_pte_at() should not update an existing valid entry. So the commit message here is bit misleading. and also fetch the pte value which is used further. pr_debug("Validating PTE clear\n"); pte = __pte(pte_val(pte) | RANDOM_ORVALUE); @@ -944,7 +944,7 @@ static int __init debug_vm_pgtable(void) p4d_t *p4dp, *saved_p4dp; pud_t *pudp, *saved_pudp; pmd_t *pmdp, *saved_pmdp, pmd; - pte_t *ptep; + pte_t *ptep, pte; pgtable_t saved_ptep; pgprot_t prot, protnone; phys_addr_t paddr; @@ -1049,6 +1049,8 @@ static int __init debug_vm_pgtable(void) */ ptep = pte_alloc_map_lock(mm, pmdp, vaddr, &ptl); + pte = pfn_pte(pte_aligned, prot); + set_pte_at(mm, vaddr, ptep, pte); Not here, creating and populating an entry must be done in respective test functions itself. Besides, this seems bit redundant as well. The test pte_clear_tests() with the above change added, already - Clears the PTEP entry with ptep_get_and_clear() and fetch the old value set previously. - Creates and populates the PTEP with set_pte_at() - Clears with pte_clear() - Checks for pte_none() If the objective is to clear the PTEP entry before calling set_pte_at(), then only the first chunk is required and it should also be merged with a previous patch. [PATCH v3 07/13] mm/debug_vm_pgtable/set_pte/pmd/pud: Don't use set_*_at to update an existing pte entry pte_clear_tests(mm, ptep, vaddr); pte_advanced_tests(mm, vma, ptep, pte_aligned, vaddr, prot); pte_unmap_unlock(ptep, ptl); There is a checkpatch.pl warning here. WARNING: Possible unwrapped commit description (prefer a maximum 75 chars per line) #7: pte_clear_tests operate on an existing pte entry. Make sure that is not a none total: 0 errors, 1 warnings, 24 lines checked There is also a build failure on x86 reported from kernel test robot.
Re: [PATCH v3 12/13] mm/debug_vm_pgtable/hugetlb: Disable hugetlb test on ppc64
On 9/1/20 9:33 AM, Anshuman Khandual wrote: On 08/27/2020 01:34 PM, Aneesh Kumar K.V wrote: The seems to be missing quite a lot of details w.r.t allocating the correct pgtable_t page (huge_pte_alloc()), holding the right lock (huge_pte_lock()) etc. The vma used is also not a hugetlb VMA. ppc64 do have runtime checks within CONFIG_DEBUG_VM for most of these. Hence disable the test on ppc64. Would really like this to get resolved in an uniform and better way instead, i.e a modified hugetlb_advanced_tests() which works on all platforms including ppc64. In absence of a modified version, I do realize the situation here, where DEBUG_VM_PGTABLE test either runs on ppc64 or just completely remove hugetlb_advanced_tests() from other platforms as well. Signed-off-by: Aneesh Kumar K.V --- mm/debug_vm_pgtable.c | 4 1 file changed, 4 insertions(+) diff --git a/mm/debug_vm_pgtable.c b/mm/debug_vm_pgtable.c index a188b6e4e37e..21329c7d672f 100644 --- a/mm/debug_vm_pgtable.c +++ b/mm/debug_vm_pgtable.c @@ -813,6 +813,7 @@ static void __init hugetlb_basic_tests(unsigned long pfn, pgprot_t prot) #endif /* CONFIG_ARCH_WANT_GENERAL_HUGETLB */ } +#ifndef CONFIG_PPC_BOOK3S_64 static void __init hugetlb_advanced_tests(struct mm_struct *mm, struct vm_area_struct *vma, pte_t *ptep, unsigned long pfn, @@ -855,6 +856,7 @@ static void __init hugetlb_advanced_tests(struct mm_struct *mm, pte = huge_ptep_get(ptep); WARN_ON(!(huge_pte_write(pte) && huge_pte_dirty(pte))); } +#endif In the worst case if we could not get a new hugetlb_advanced_tests() test that works on all platforms, this might be the last fallback option. In which case, it will require a proper comment section with a "FIXME: ", explaining the current situation (and that #ifdef is temporary in nature) and a hugetlb_advanced_tests() stub when CONFIG_PPC_BOOK3S_64 is enabled. #else /* !CONFIG_HUGETLB_PAGE */ static void __init hugetlb_basic_tests(unsigned long pfn, pgprot_t prot) { } static void __init hugetlb_advanced_tests(struct mm_struct *mm, @@ -1065,7 +1067,9 @@ static int __init debug_vm_pgtable(void) pud_populate_tests(mm, pudp, saved_pmdp); spin_unlock(ptl); +#ifndef CONFIG_PPC_BOOK3S_64 hugetlb_advanced_tests(mm, vma, ptep, pte_aligned, vaddr, prot); +#endif I actually wanted to add #ifdef BROKEN. That test is completely broken. Infact I would suggest to remove that test completely. #ifdef will not be required here as there would be a stub definition for hugetlb_advanced_tests() when CONFIG_PPC_BOOK3S_64 is enabled. spin_lock(&mm->page_table_lock); p4d_clear_tests(mm, p4dp); But again, we should really try and avoid taking this path. To be frank i am kind of frustrated with how this patch series is being looked at. We pushed a completely broken test to upstream and right now we have a code in upstream that crash when booted on ppc64. My attempt has been to make progress here and you definitely seems to be not in agreement to that. At this point I am tempted to suggest we remove the DEBUG_VM_PGTABLE support on ppc64 because AFAIU it doesn't add any value. -aneesh
Re: [PATCH v3 08/13] mm/debug_vm_pgtable/thp: Use page table depost/withdraw with THP
On 9/1/20 8:52 AM, Anshuman Khandual wrote: On 08/27/2020 01:34 PM, Aneesh Kumar K.V wrote: Architectures like ppc64 use deposited page table while updating the huge pte entries. Signed-off-by: Aneesh Kumar K.V --- mm/debug_vm_pgtable.c | 10 +++--- 1 file changed, 7 insertions(+), 3 deletions(-) diff --git a/mm/debug_vm_pgtable.c b/mm/debug_vm_pgtable.c index f9f6358899a8..0ce5c6a24c5b 100644 --- a/mm/debug_vm_pgtable.c +++ b/mm/debug_vm_pgtable.c @@ -154,7 +154,7 @@ static void __init pmd_basic_tests(unsigned long pfn, pgprot_t prot) static void __init pmd_advanced_tests(struct mm_struct *mm, struct vm_area_struct *vma, pmd_t *pmdp, unsigned long pfn, unsigned long vaddr, - pgprot_t prot) + pgprot_t prot, pgtable_t pgtable) { pmd_t pmd; @@ -165,6 +165,8 @@ static void __init pmd_advanced_tests(struct mm_struct *mm, /* Align the address wrt HPAGE_PMD_SIZE */ vaddr = (vaddr & HPAGE_PMD_MASK) + HPAGE_PMD_SIZE; + pgtable_trans_huge_deposit(mm, pmdp, pgtable); + pmd = pmd_mkhuge(pfn_pmd(pfn, prot)); set_pmd_at(mm, vaddr, pmdp, pmd); pmdp_set_wrprotect(mm, vaddr, pmdp); @@ -193,6 +195,8 @@ static void __init pmd_advanced_tests(struct mm_struct *mm, pmdp_test_and_clear_young(vma, vaddr, pmdp); pmd = READ_ONCE(*pmdp); WARN_ON(pmd_young(pmd)); + + pgtable = pgtable_trans_huge_withdraw(mm, pmdp); Should the call sites here be wrapped with __HAVE_ARCH_PGTABLE_DEPOSIT and __HAVE_ARCH_PGTABLE_WITHDRAW respectively. Though there are generic fallback definitions, wondering whether they are indeed essential for all platforms. No, because Any page table helpers operating on pmd level THP can expect a deposited page table. __HAVE_ARCH_PGTABLE_DEPOSIT indicates that fallback to generic version. } static void __init pmd_leaf_tests(unsigned long pfn, pgprot_t prot) @@ -373,7 +377,7 @@ static void __init pud_basic_tests(unsigned long pfn, pgprot_t prot) { } static void __init pmd_advanced_tests(struct mm_struct *mm, struct vm_area_struct *vma, pmd_t *pmdp, unsigned long pfn, unsigned long vaddr, - pgprot_t prot) + pgprot_t prot, pgtable_t pgtable) { } static void __init pud_advanced_tests(struct mm_struct *mm, @@ -1015,7 +1019,7 @@ static int __init debug_vm_pgtable(void) pgd_clear_tests(mm, pgdp); pte_advanced_tests(mm, vma, ptep, pte_aligned, vaddr, prot); - pmd_advanced_tests(mm, vma, pmdp, pmd_aligned, vaddr, prot); + pmd_advanced_tests(mm, vma, pmdp, pmd_aligned, vaddr, prot, saved_ptep); pud_advanced_tests(mm, vma, pudp, pud_aligned, vaddr, prot); hugetlb_advanced_tests(mm, vma, ptep, pte_aligned, vaddr, prot); There is a checkpatch.pl warning here. WARNING: Possible unwrapped commit description (prefer a maximum 75 chars per line) #7: Architectures like ppc64 use deposited page table while updating the huge pte total: 0 errors, 1 warnings, 40 lines checked I will ignore all these, because they are not really important IMHO. -aneesh
Re: [PATCH v3 06/13] mm/debug_vm_pgtable/THP: Mark the pte entry huge before using set_pmd/pud_at
On 9/1/20 8:51 AM, Anshuman Khandual wrote: On 08/27/2020 01:34 PM, Aneesh Kumar K.V wrote: kernel expects entries to be marked huge before we use set_pmd_at()/set_pud_at(). Signed-off-by: Aneesh Kumar K.V --- mm/debug_vm_pgtable.c | 21 - 1 file changed, 12 insertions(+), 9 deletions(-) diff --git a/mm/debug_vm_pgtable.c b/mm/debug_vm_pgtable.c index 5c0680836fe9..de83a20c1b30 100644 --- a/mm/debug_vm_pgtable.c +++ b/mm/debug_vm_pgtable.c @@ -155,7 +155,7 @@ static void __init pmd_advanced_tests(struct mm_struct *mm, unsigned long pfn, unsigned long vaddr, pgprot_t prot) { - pmd_t pmd = pfn_pmd(pfn, prot); + pmd_t pmd; if (!has_transparent_hugepage()) return; @@ -164,19 +164,19 @@ static void __init pmd_advanced_tests(struct mm_struct *mm, /* Align the address wrt HPAGE_PMD_SIZE */ vaddr = (vaddr & HPAGE_PMD_MASK) + HPAGE_PMD_SIZE; - pmd = pfn_pmd(pfn, prot); + pmd = pmd_mkhuge(pfn_pmd(pfn, prot)); set_pmd_at(mm, vaddr, pmdp, pmd); pmdp_set_wrprotect(mm, vaddr, pmdp); pmd = READ_ONCE(*pmdp); WARN_ON(pmd_write(pmd)); - pmd = pfn_pmd(pfn, prot); + pmd = pmd_mkhuge(pfn_pmd(pfn, prot)); set_pmd_at(mm, vaddr, pmdp, pmd); pmdp_huge_get_and_clear(mm, vaddr, pmdp); pmd = READ_ONCE(*pmdp); WARN_ON(!pmd_none(pmd)); - pmd = pfn_pmd(pfn, prot); + pmd = pmd_mkhuge(pfn_pmd(pfn, prot)); pmd = pmd_wrprotect(pmd); pmd = pmd_mkclean(pmd); set_pmd_at(mm, vaddr, pmdp, pmd); @@ -237,7 +237,7 @@ static void __init pmd_huge_tests(pmd_t *pmdp, unsigned long pfn, pgprot_t prot) static void __init pmd_savedwrite_tests(unsigned long pfn, pgprot_t prot) { - pmd_t pmd = pfn_pmd(pfn, prot); + pmd_t pmd = pmd_mkhuge(pfn_pmd(pfn, prot)); There is no set_pmd_at() in this particular test, why change ? because if you are building a hugepage you should use pmd_mkhuge(). That is what is setting _PAGE_PTE with this series. We don't make pfn_pmd set _PAGE_PTE -aneesh
Re: [PATCH v3 03/13] mm/debug_vm_pgtable/ppc64: Avoid setting top bits in radom value
On 9/1/20 8:45 AM, Anshuman Khandual wrote: On 08/27/2020 01:34 PM, Aneesh Kumar K.V wrote: ppc64 use bit 62 to indicate a pte entry (_PAGE_PTE). Avoid setting that bit in random value. Signed-off-by: Aneesh Kumar K.V --- mm/debug_vm_pgtable.c | 13 ++--- 1 file changed, 10 insertions(+), 3 deletions(-) diff --git a/mm/debug_vm_pgtable.c b/mm/debug_vm_pgtable.c index 086309fb9b6f..bbf9df0e64c6 100644 --- a/mm/debug_vm_pgtable.c +++ b/mm/debug_vm_pgtable.c @@ -44,10 +44,17 @@ * entry type. But these bits might affect the ability to clear entries with * pxx_clear() because of how dynamic page table folding works on s390. So * while loading up the entries do not change the lower 4 bits. It does not - * have affect any other platform. + * have affect any other platform. Also avoid the 62nd bit on ppc64 that is + * used to mark a pte entry. */ -#define S390_MASK_BITS 4 -#define RANDOM_ORVALUE GENMASK(BITS_PER_LONG - 1, S390_MASK_BITS) +#define S390_SKIP_MASK GENMASK(3, 0) +#ifdef CONFIG_PPC_BOOK3S_64 +#define PPC64_SKIP_MASKGENMASK(62, 62) +#else +#define PPC64_SKIP_MASK0x0 +#endif Please drop the #ifdef CONFIG_PPC_BOOK3S_64 here. We already accommodate skip bits for a s390 platform requirement and can also do so for ppc64 as well. As mentioned before, please avoid adding any platform specific constructs in the test. that is needed so that it can be built on 32 bit architectures.I did face build errors with arch-linux +#define ARCH_SKIP_MASK (S390_SKIP_MASK | PPC64_SKIP_MASK) +#define RANDOM_ORVALUE (GENMASK(BITS_PER_LONG - 1, 0) & ~ARCH_SKIP_MASK) #define RANDOM_NZVALUEGENMASK(7, 0) Please fix the alignments here. Feel free to consider following changes after this patch. diff --git a/mm/debug_vm_pgtable.c b/mm/debug_vm_pgtable.c index 122416464e0f..f969031152bb 100644 --- a/mm/debug_vm_pgtable.c +++ b/mm/debug_vm_pgtable.c @@ -48,14 +48,11 @@ * have affect any other platform. Also avoid the 62nd bit on ppc64 that is * used to mark a pte entry. */ -#define S390_SKIP_MASK GENMASK(3, 0) -#ifdef CONFIG_PPC_BOOK3S_64 -#define PPC64_SKIP_MASKGENMASK(62, 62) -#else -#define PPC64_SKIP_MASK0x0 -#endif -#define ARCH_SKIP_MASK (S390_SKIP_MASK | PPC64_SKIP_MASK) -#define RANDOM_ORVALUE (GENMASK(BITS_PER_LONG - 1, 0) & ~ARCH_SKIP_MASK) +#define S390_SKIP_MASK GENMASK(3, 0) +#define PPC64_SKIP_MASKGENMASK(62, 62) +#define ARCH_SKIP_MASK (S390_SKIP_MASK | PPC64_SKIP_MASK) +#define RANDOM_ORVALUE (GENMASK(BITS_PER_LONG - 1, 0) & ~ARCH_SKIP_MASK) + #define RANDOM_NZVALUE GENMASK(7, 0) static void __init pte_basic_tests(unsigned long pfn, pgprot_t prot) Besides, there is also one checkpatch.pl warning here. WARNING: Possible unwrapped commit description (prefer a maximum 75 chars per line) #7: ppc64 use bit 62 to indicate a pte entry (_PAGE_PTE). Avoid setting that bit in total: 0 errors, 1 warnings, 20 lines checked These warnings are not valid. They are mostly long lines (upto 100) . or some details mentioned in the () as above. -aneesh
Re: fsl_espi errors on v5.7.15
Excerpts from Chris Packham's message of September 1, 2020 11:25 am: > > On 1/09/20 12:33 am, Heiner Kallweit wrote: >> On 30.08.2020 23:59, Chris Packham wrote: >>> On 31/08/20 9:41 am, Heiner Kallweit wrote: On 30.08.2020 23:00, Chris Packham wrote: > On 31/08/20 12:30 am, Nicholas Piggin wrote: >> Excerpts from Chris Packham's message of August 28, 2020 8:07 am: > > >>> I've also now seen the RX FIFO not empty error on the T2080RDB >>> >>> fsl_espi ffe11.spi: Transfer done but SPIE_DON isn't set! >>> fsl_espi ffe11.spi: Transfer done but SPIE_DON isn't set! >>> fsl_espi ffe11.spi: Transfer done but SPIE_DON isn't set! >>> fsl_espi ffe11.spi: Transfer done but SPIE_DON isn't set! >>> fsl_espi ffe11.spi: Transfer done but rx/tx fifo's aren't empty! >>> fsl_espi ffe11.spi: SPIE_RXCNT = 1, SPIE_TXCNT = 32 >>> >>> With my current workaround of emptying the RX FIFO. It seems >>> survivable. Interestingly it only ever seems to be 1 extra byte in >>> the >>> RX FIFO and it seems to be after either a READ_SR or a READ_FSR. >>> >>> fsl_espi ffe11.spi: tx 70 >>> fsl_espi ffe11.spi: rx 03 >>> fsl_espi ffe11.spi: Extra RX 00 >>> fsl_espi ffe11.spi: Transfer done but SPIE_DON isn't set! >>> fsl_espi ffe11.spi: Transfer done but rx/tx fifo's aren't empty! >>> fsl_espi ffe11.spi: SPIE_RXCNT = 1, SPIE_TXCNT = 32 >>> fsl_espi ffe11.spi: tx 05 >>> fsl_espi ffe11.spi: rx 00 >>> fsl_espi ffe11.spi: Extra RX 03 >>> fsl_espi ffe11.spi: Transfer done but SPIE_DON isn't set! >>> fsl_espi ffe11.spi: Transfer done but rx/tx fifo's aren't empty! >>> fsl_espi ffe11.spi: SPIE_RXCNT = 1, SPIE_TXCNT = 32 >>> fsl_espi ffe11.spi: tx 05 >>> fsl_espi ffe11.spi: rx 00 >>> fsl_espi ffe11.spi: Extra RX 03 >>> >>> From all the Micron SPI-NOR datasheets I've got access to it is >>> possible to continually read the SR/FSR. But I've no idea why it >>> happens some times and not others. >> So I think I've got a reproduction and I think I've bisected the >> problem >> to commit 3282a3da25bd ("powerpc/64: Implement soft interrupt replay >> in >> C"). My day is just finishing now so I haven't applied too much >> scrutiny >> to this result. Given the various rabbit holes I've been down on this >> issue already I'd take this information with a good degree of >> skepticism. >> > OK, so an easy test should be to re-test with a 5.4 kernel. > It doesn't have yet the change you're referring to, and the fsl-espi > driver > is basically the same as in 5.7 (just two small changes in 5.7). There's 6cc0c16d82f88 and maybe also other interrupt related patches around this time that could affect book E, so it's good if that exact patch is confirmed. >>> My confirmation is basically that I can induce the issue in a 5.4 kernel >>> by cherry-picking 3282a3da25bd. I'm also able to "fix" the issue in >>> 5.9-rc2 by reverting that one commit. >>> >>> I both cases it's not exactly a clean cherry-pick/revert so I also >>> confirmed the bisection result by building at 3282a3da25bd (which sees >>> the issue) and the commit just before (which does not). >> Thanks for testing, that confirms it well. >> >> [snip patch] >> >>> I still saw the issue with this change applied. PPC_IRQ_SOFT_MASK_DEBUG >>> didn't report anything (either with or without the change above). >> Okay, it was a bit of a shot in the dark. I still can't see what >> else has changed. >> >> What would cause this, a lost interrupt? A spurious interrupt? Or >> higher interrupt latency? >> >> I don't think the patch should cause significantly worse latency, >> (it's supposed to be a bit better if anything because it doesn't set >> up the full interrupt frame). But it's possible. > My working theory is that the SPI_DON indication is all about the TX > direction an now that the interrupts are faster we're hitting an error > because there is still RX activity going on. Heiner disagrees with my > interpretation of the SPI_DON indication and the fact that it doesn't > happen every time does throw doubt on it. > It's right that the eSPI spec can be interpreted that SPI_DON refers to TX only. However this wouldn't really make sense, because also for RX we program the frame length, and therefore want to be notified once the full frame was received. Also practical experience shows that SPI_DON is set also after RX-only transfers. Typical SPI NOR use case is that you write
[PATCH v11] Fixup for "powerpc/vdso: Provide __kernel_clock_gettime64() on vdso32"
Signed-off-by: Christophe Leroy --- arch/powerpc/include/asm/vdso/gettimeofday.h | 2 ++ 1 file changed, 2 insertions(+) diff --git a/arch/powerpc/include/asm/vdso/gettimeofday.h b/arch/powerpc/include/asm/vdso/gettimeofday.h index 59a609a48b63..8da84722729b 100644 --- a/arch/powerpc/include/asm/vdso/gettimeofday.h +++ b/arch/powerpc/include/asm/vdso/gettimeofday.h @@ -186,6 +186,8 @@ int __c_kernel_clock_getres(clockid_t clock_id, struct __kernel_timespec *res, #else int __c_kernel_clock_gettime(clockid_t clock, struct old_timespec32 *ts, const struct vdso_data *vd); +int __c_kernel_clock_gettime64(clockid_t clock, struct __kernel_timespec *ts, + const struct vdso_data *vd); int __c_kernel_clock_getres(clockid_t clock_id, struct old_timespec32 *res, const struct vdso_data *vd); #endif -- 2.25.0
[Bug 209029] kernel 5.9-rc2 fails to boot on a PowerMac G5 11,2 - BUG: Kernel NULL pointer dereference on read at 0x00000020
https://bugzilla.kernel.org/show_bug.cgi?id=209029 Christophe Leroy (christophe.le...@csgroup.eu) changed: What|Removed |Added CC||christophe.le...@csgroup.eu --- Comment #3 from Christophe Leroy (christophe.le...@csgroup.eu) --- Did you try without CONFIG_DEBUG_VM_PGTABLE ? If you want CONFIG_DEBUG_VM_PGTABLE, the following series aims at fixing it for PPC64: https://patchwork.ozlabs.org/project/linuxppc-dev/list/?series=197961 -- You are receiving this mail because: You are watching the assignee of the bug.
Re: [PATCH v3 12/13] mm/debug_vm_pgtable/hugetlb: Disable hugetlb test on ppc64
On 08/27/2020 01:34 PM, Aneesh Kumar K.V wrote: > The seems to be missing quite a lot of details w.r.t allocating > the correct pgtable_t page (huge_pte_alloc()), holding the right > lock (huge_pte_lock()) etc. The vma used is also not a hugetlb VMA. > > ppc64 do have runtime checks within CONFIG_DEBUG_VM for most of these. > Hence disable the test on ppc64. Would really like this to get resolved in an uniform and better way instead, i.e a modified hugetlb_advanced_tests() which works on all platforms including ppc64. In absence of a modified version, I do realize the situation here, where DEBUG_VM_PGTABLE test either runs on ppc64 or just completely remove hugetlb_advanced_tests() from other platforms as well. > > Signed-off-by: Aneesh Kumar K.V > --- > mm/debug_vm_pgtable.c | 4 > 1 file changed, 4 insertions(+) > > diff --git a/mm/debug_vm_pgtable.c b/mm/debug_vm_pgtable.c > index a188b6e4e37e..21329c7d672f 100644 > --- a/mm/debug_vm_pgtable.c > +++ b/mm/debug_vm_pgtable.c > @@ -813,6 +813,7 @@ static void __init hugetlb_basic_tests(unsigned long pfn, > pgprot_t prot) > #endif /* CONFIG_ARCH_WANT_GENERAL_HUGETLB */ > } > > +#ifndef CONFIG_PPC_BOOK3S_64 > static void __init hugetlb_advanced_tests(struct mm_struct *mm, > struct vm_area_struct *vma, > pte_t *ptep, unsigned long pfn, > @@ -855,6 +856,7 @@ static void __init hugetlb_advanced_tests(struct > mm_struct *mm, > pte = huge_ptep_get(ptep); > WARN_ON(!(huge_pte_write(pte) && huge_pte_dirty(pte))); > } > +#endif In the worst case if we could not get a new hugetlb_advanced_tests() test that works on all platforms, this might be the last fallback option. In which case, it will require a proper comment section with a "FIXME: ", explaining the current situation (and that #ifdef is temporary in nature) and a hugetlb_advanced_tests() stub when CONFIG_PPC_BOOK3S_64 is enabled. > #else /* !CONFIG_HUGETLB_PAGE */ > static void __init hugetlb_basic_tests(unsigned long pfn, pgprot_t prot) { } > static void __init hugetlb_advanced_tests(struct mm_struct *mm, > @@ -1065,7 +1067,9 @@ static int __init debug_vm_pgtable(void) > pud_populate_tests(mm, pudp, saved_pmdp); > spin_unlock(ptl); > > +#ifndef CONFIG_PPC_BOOK3S_64 > hugetlb_advanced_tests(mm, vma, ptep, pte_aligned, vaddr, prot); > +#endif #ifdef will not be required here as there would be a stub definition for hugetlb_advanced_tests() when CONFIG_PPC_BOOK3S_64 is enabled. > > spin_lock(&mm->page_table_lock); > p4d_clear_tests(mm, p4dp); > But again, we should really try and avoid taking this path.
Re: [PATCH v3 10/13] mm/debug_vm_pgtable/locks: Take correct page table lock
On 08/27/2020 01:34 PM, Aneesh Kumar K.V wrote: > Make sure we call pte accessors with correct lock held. > > Signed-off-by: Aneesh Kumar K.V > --- > mm/debug_vm_pgtable.c | 34 -- > 1 file changed, 20 insertions(+), 14 deletions(-) > > diff --git a/mm/debug_vm_pgtable.c b/mm/debug_vm_pgtable.c > index 78c8af3445ac..0a6e771ebd13 100644 > --- a/mm/debug_vm_pgtable.c > +++ b/mm/debug_vm_pgtable.c > @@ -1039,33 +1039,39 @@ static int __init debug_vm_pgtable(void) > pmd_thp_tests(pmd_aligned, prot); > pud_thp_tests(pud_aligned, prot); > > + hugetlb_basic_tests(pte_aligned, prot); > + This moved back again. As pointed out earlier, there will be a bisect problem and so this patch must be folded back with the previous one. > /* >* Page table modifying tests >*/ > - pte_clear_tests(mm, ptep, vaddr); > - pmd_clear_tests(mm, pmdp); > - pud_clear_tests(mm, pudp); > - p4d_clear_tests(mm, p4dp); > - pgd_clear_tests(mm, pgdp); > > ptep = pte_alloc_map_lock(mm, pmdp, vaddr, &ptl); > + pte_clear_tests(mm, ptep, vaddr); > pte_advanced_tests(mm, vma, ptep, pte_aligned, vaddr, prot); > - pmd_advanced_tests(mm, vma, pmdp, pmd_aligned, vaddr, prot, saved_ptep); > - pud_advanced_tests(mm, vma, pudp, pud_aligned, vaddr, prot); > - hugetlb_advanced_tests(mm, vma, ptep, pte_aligned, vaddr, prot); > - > + pte_unmap_unlock(ptep, ptl); > > + ptl = pmd_lock(mm, pmdp); > + pmd_clear_tests(mm, pmdp); > + pmd_advanced_tests(mm, vma, pmdp, pmd_aligned, vaddr, prot, saved_ptep); > pmd_huge_tests(pmdp, pmd_aligned, prot); > + pmd_populate_tests(mm, pmdp, saved_ptep); > + spin_unlock(ptl); > + > + ptl = pud_lock(mm, pudp); > + pud_clear_tests(mm, pudp); > + pud_advanced_tests(mm, vma, pudp, pud_aligned, vaddr, prot); > pud_huge_tests(pudp, pud_aligned, prot); > + pud_populate_tests(mm, pudp, saved_pmdp); > + spin_unlock(ptl); > > - pte_unmap_unlock(ptep, ptl); > + hugetlb_advanced_tests(mm, vma, ptep, pte_aligned, vaddr, prot); > > - pmd_populate_tests(mm, pmdp, saved_ptep); > - pud_populate_tests(mm, pudp, saved_pmdp); > + spin_lock(&mm->page_table_lock); > + p4d_clear_tests(mm, p4dp); > + pgd_clear_tests(mm, pgdp); > p4d_populate_tests(mm, p4dp, saved_pudp); > pgd_populate_tests(mm, pgdp, saved_p4dp); > - > - hugetlb_basic_tests(pte_aligned, prot); > + spin_unlock(&mm->page_table_lock); > > p4d_free(mm, saved_p4dp); > pud_free(mm, saved_pudp); > Overall, grouping together dynamic tests at various page table levels and taking a corresponding lock, makes sense. Commit message for the resultant patch should include (a) Test classification (b) Grouping by function for the static tests, by page table level for the dynamic tests (c) Locking requirement for the dynamic tests each level etc.
Re: [PATCH v3 09/13] mm/debug_vm_pgtable/locks: Move non page table modifying test together
On 08/27/2020 01:34 PM, Aneesh Kumar K.V wrote: > This will help in adding proper locks in a later patch It really makes sense to classify these tests here as static and dynamic. Static are the ones that test via page table entry values modification without changing anything on the actual page table itself. Where as the dynamic tests do change the page table. Static tests would probably never require any lock synchronization but the dynamic ones will do. > > Signed-off-by: Aneesh Kumar K.V > --- > mm/debug_vm_pgtable.c | 52 --- > 1 file changed, 29 insertions(+), 23 deletions(-) > > diff --git a/mm/debug_vm_pgtable.c b/mm/debug_vm_pgtable.c > index 0ce5c6a24c5b..78c8af3445ac 100644 > --- a/mm/debug_vm_pgtable.c > +++ b/mm/debug_vm_pgtable.c > @@ -992,7 +992,7 @@ static int __init debug_vm_pgtable(void) > p4dp = p4d_alloc(mm, pgdp, vaddr); > pudp = pud_alloc(mm, p4dp, vaddr); > pmdp = pmd_alloc(mm, pudp, vaddr); > - ptep = pte_alloc_map_lock(mm, pmdp, vaddr, &ptl); > + ptep = pte_alloc_map(mm, pmdp, vaddr); > > /* >* Save all the page table page addresses as the page table > @@ -1012,33 +1012,12 @@ static int __init debug_vm_pgtable(void) > p4d_basic_tests(p4d_aligned, prot); > pgd_basic_tests(pgd_aligned, prot); > > - pte_clear_tests(mm, ptep, vaddr); > - pmd_clear_tests(mm, pmdp); > - pud_clear_tests(mm, pudp); > - p4d_clear_tests(mm, p4dp); > - pgd_clear_tests(mm, pgdp); > - > - pte_advanced_tests(mm, vma, ptep, pte_aligned, vaddr, prot); > - pmd_advanced_tests(mm, vma, pmdp, pmd_aligned, vaddr, prot, saved_ptep); > - pud_advanced_tests(mm, vma, pudp, pud_aligned, vaddr, prot); > - hugetlb_advanced_tests(mm, vma, ptep, pte_aligned, vaddr, prot); > - > pmd_leaf_tests(pmd_aligned, prot); > pud_leaf_tests(pud_aligned, prot); > > - pmd_huge_tests(pmdp, pmd_aligned, prot); > - pud_huge_tests(pudp, pud_aligned, prot); > - > pte_savedwrite_tests(pte_aligned, protnone); > pmd_savedwrite_tests(pmd_aligned, protnone); > > - pte_unmap_unlock(ptep, ptl); > - > - pmd_populate_tests(mm, pmdp, saved_ptep); > - pud_populate_tests(mm, pudp, saved_pmdp); > - p4d_populate_tests(mm, p4dp, saved_pudp); > - pgd_populate_tests(mm, pgdp, saved_p4dp); > - > pte_special_tests(pte_aligned, prot); > pte_protnone_tests(pte_aligned, protnone); > pmd_protnone_tests(pmd_aligned, protnone); > @@ -1056,11 +1035,38 @@ static int __init debug_vm_pgtable(void) > pmd_swap_tests(pmd_aligned, prot); > > swap_migration_tests(); > - hugetlb_basic_tests(pte_aligned, prot); > > pmd_thp_tests(pmd_aligned, prot); > pud_thp_tests(pud_aligned, prot); > > + /* > + * Page table modifying tests > + */ In this comment, please do add some more details about how these tests need proper locking mechanism unlike the previous static ones. Also add a similar comment section for the static tests that dont really change page table and need not require corresponding locking mechanism. Both comment sections will make this classification clear. > + pte_clear_tests(mm, ptep, vaddr); > + pmd_clear_tests(mm, pmdp); > + pud_clear_tests(mm, pudp); > + p4d_clear_tests(mm, p4dp); > + pgd_clear_tests(mm, pgdp); > + > + ptep = pte_alloc_map_lock(mm, pmdp, vaddr, &ptl); > + pte_advanced_tests(mm, vma, ptep, pte_aligned, vaddr, prot); > + pmd_advanced_tests(mm, vma, pmdp, pmd_aligned, vaddr, prot, saved_ptep); > + pud_advanced_tests(mm, vma, pudp, pud_aligned, vaddr, prot); > + hugetlb_advanced_tests(mm, vma, ptep, pte_aligned, vaddr, prot); > + > + > + pmd_huge_tests(pmdp, pmd_aligned, prot); > + pud_huge_tests(pudp, pud_aligned, prot); > + > + pte_unmap_unlock(ptep, ptl); > + > + pmd_populate_tests(mm, pmdp, saved_ptep); > + pud_populate_tests(mm, pudp, saved_pmdp); > + p4d_populate_tests(mm, p4dp, saved_pudp); > + pgd_populate_tests(mm, pgdp, saved_p4dp); > + > + hugetlb_basic_tests(pte_aligned, prot); hugetlb_basic_tests() is a non page table modifying static test and should be classified accordingly. > + > p4d_free(mm, saved_p4dp); > pud_free(mm, saved_pudp); > pmd_free(mm, saved_pmdp); > Changes till this patch fails to boot on arm64 platform. This should be folded with the next patch. [ 17.080644] [ cut here ] [ 17.081342] kernel BUG at mm/pgtable-generic.c:164! [ 17.082091] Internal error: Oops - BUG: 0 [#1] PREEMPT SMP [ 17.082977] Modules linked in: [ 17.083481] CPU: 79 PID: 1 Comm: swapper/0 Tainted: GW 5.9.0-rc2-00105-g740e72cd6717 #14 [ 17.084998] Hardware name: linux,dummy-virt (DT) [ 17.085745] pstate: 6045 (nZCv daif +PAN -UAO BTYPE=--) [ 17.086645] pc : pgtable_trans_huge_deposit+0x58/0x60 [ 17.087462] lr : debug_vm_pgtable+0x4dc/0x8f0 [ 17.0
Re: [PATCH v3 13/13] mm/debug_vm_pgtable: populate a pte entry before fetching it
On 08/27/2020 01:34 PM, Aneesh Kumar K.V wrote: > pte_clear_tests operate on an existing pte entry. Make sure that is not a none > pte entry. > > Signed-off-by: Aneesh Kumar K.V > --- > mm/debug_vm_pgtable.c | 6 -- > 1 file changed, 4 insertions(+), 2 deletions(-) > > diff --git a/mm/debug_vm_pgtable.c b/mm/debug_vm_pgtable.c > index 21329c7d672f..8527ebb75f2c 100644 > --- a/mm/debug_vm_pgtable.c > +++ b/mm/debug_vm_pgtable.c > @@ -546,7 +546,7 @@ static void __init pgd_populate_tests(struct mm_struct > *mm, pgd_t *pgdp, > static void __init pte_clear_tests(struct mm_struct *mm, pte_t *ptep, > unsigned long vaddr) > { > - pte_t pte = ptep_get(ptep); > + pte_t pte = ptep_get_and_clear(mm, vaddr, ptep); Seems like ptep_get_and_clear() here just clears the entry in preparation for a following set_pte_at() which otherwise would have been a problem on ppc64 as you had pointed out earlier i.e set_pte_at() should not update an existing valid entry. So the commit message here is bit misleading. > > pr_debug("Validating PTE clear\n"); > pte = __pte(pte_val(pte) | RANDOM_ORVALUE); > @@ -944,7 +944,7 @@ static int __init debug_vm_pgtable(void) > p4d_t *p4dp, *saved_p4dp; > pud_t *pudp, *saved_pudp; > pmd_t *pmdp, *saved_pmdp, pmd; > - pte_t *ptep; > + pte_t *ptep, pte; > pgtable_t saved_ptep; > pgprot_t prot, protnone; > phys_addr_t paddr; > @@ -1049,6 +1049,8 @@ static int __init debug_vm_pgtable(void) >*/ > > ptep = pte_alloc_map_lock(mm, pmdp, vaddr, &ptl); > + pte = pfn_pte(pte_aligned, prot); > + set_pte_at(mm, vaddr, ptep, pte); Not here, creating and populating an entry must be done in respective test functions itself. Besides, this seems bit redundant as well. The test pte_clear_tests() with the above change added, already - Clears the PTEP entry with ptep_get_and_clear() - Creates and populates the PTEP with set_pte_at() - Clears with pte_clear() - Checks for pte_none() If the objective is to clear the PTEP entry before calling set_pte_at(), then only the first chunk is required and it should also be merged with a previous patch. [PATCH v3 07/13] mm/debug_vm_pgtable/set_pte/pmd/pud: Don't use set_*_at to update an existing pte entry > pte_clear_tests(mm, ptep, vaddr); > pte_advanced_tests(mm, vma, ptep, pte_aligned, vaddr, prot); > pte_unmap_unlock(ptep, ptl); > There is a checkpatch.pl warning here. WARNING: Possible unwrapped commit description (prefer a maximum 75 chars per line) #7: pte_clear_tests operate on an existing pte entry. Make sure that is not a none total: 0 errors, 1 warnings, 24 lines checked There is also a build failure on x86 reported from kernel test robot.
Re: [PATCH v3 08/13] mm/debug_vm_pgtable/thp: Use page table depost/withdraw with THP
On 08/27/2020 01:34 PM, Aneesh Kumar K.V wrote: > Architectures like ppc64 use deposited page table while updating the huge pte > entries. > > Signed-off-by: Aneesh Kumar K.V > --- > mm/debug_vm_pgtable.c | 10 +++--- > 1 file changed, 7 insertions(+), 3 deletions(-) > > diff --git a/mm/debug_vm_pgtable.c b/mm/debug_vm_pgtable.c > index f9f6358899a8..0ce5c6a24c5b 100644 > --- a/mm/debug_vm_pgtable.c > +++ b/mm/debug_vm_pgtable.c > @@ -154,7 +154,7 @@ static void __init pmd_basic_tests(unsigned long pfn, > pgprot_t prot) > static void __init pmd_advanced_tests(struct mm_struct *mm, > struct vm_area_struct *vma, pmd_t *pmdp, > unsigned long pfn, unsigned long vaddr, > - pgprot_t prot) > + pgprot_t prot, pgtable_t pgtable) > { > pmd_t pmd; > > @@ -165,6 +165,8 @@ static void __init pmd_advanced_tests(struct mm_struct > *mm, > /* Align the address wrt HPAGE_PMD_SIZE */ > vaddr = (vaddr & HPAGE_PMD_MASK) + HPAGE_PMD_SIZE; > > + pgtable_trans_huge_deposit(mm, pmdp, pgtable); > + > pmd = pmd_mkhuge(pfn_pmd(pfn, prot)); > set_pmd_at(mm, vaddr, pmdp, pmd); > pmdp_set_wrprotect(mm, vaddr, pmdp); > @@ -193,6 +195,8 @@ static void __init pmd_advanced_tests(struct mm_struct > *mm, > pmdp_test_and_clear_young(vma, vaddr, pmdp); > pmd = READ_ONCE(*pmdp); > WARN_ON(pmd_young(pmd)); > + > + pgtable = pgtable_trans_huge_withdraw(mm, pmdp); Should the call sites here be wrapped with __HAVE_ARCH_PGTABLE_DEPOSIT and __HAVE_ARCH_PGTABLE_WITHDRAW respectively. Though there are generic fallback definitions, wondering whether they are indeed essential for all platforms. > } > > static void __init pmd_leaf_tests(unsigned long pfn, pgprot_t prot) > @@ -373,7 +377,7 @@ static void __init pud_basic_tests(unsigned long pfn, > pgprot_t prot) { } > static void __init pmd_advanced_tests(struct mm_struct *mm, > struct vm_area_struct *vma, pmd_t *pmdp, > unsigned long pfn, unsigned long vaddr, > - pgprot_t prot) > + pgprot_t prot, pgtable_t pgtable) > { > } > static void __init pud_advanced_tests(struct mm_struct *mm, > @@ -1015,7 +1019,7 @@ static int __init debug_vm_pgtable(void) > pgd_clear_tests(mm, pgdp); > > pte_advanced_tests(mm, vma, ptep, pte_aligned, vaddr, prot); > - pmd_advanced_tests(mm, vma, pmdp, pmd_aligned, vaddr, prot); > + pmd_advanced_tests(mm, vma, pmdp, pmd_aligned, vaddr, prot, saved_ptep); > pud_advanced_tests(mm, vma, pudp, pud_aligned, vaddr, prot); > hugetlb_advanced_tests(mm, vma, ptep, pte_aligned, vaddr, prot); > > There is a checkpatch.pl warning here. WARNING: Possible unwrapped commit description (prefer a maximum 75 chars per line) #7: Architectures like ppc64 use deposited page table while updating the huge pte total: 0 errors, 1 warnings, 40 lines checked
Re: [PATCH v3 06/13] mm/debug_vm_pgtable/THP: Mark the pte entry huge before using set_pmd/pud_at
On 08/27/2020 01:34 PM, Aneesh Kumar K.V wrote: > kernel expects entries to be marked huge before we use > set_pmd_at()/set_pud_at(). > > Signed-off-by: Aneesh Kumar K.V > --- > mm/debug_vm_pgtable.c | 21 - > 1 file changed, 12 insertions(+), 9 deletions(-) > > diff --git a/mm/debug_vm_pgtable.c b/mm/debug_vm_pgtable.c > index 5c0680836fe9..de83a20c1b30 100644 > --- a/mm/debug_vm_pgtable.c > +++ b/mm/debug_vm_pgtable.c > @@ -155,7 +155,7 @@ static void __init pmd_advanced_tests(struct mm_struct > *mm, > unsigned long pfn, unsigned long vaddr, > pgprot_t prot) > { > - pmd_t pmd = pfn_pmd(pfn, prot); > + pmd_t pmd; > > if (!has_transparent_hugepage()) > return; > @@ -164,19 +164,19 @@ static void __init pmd_advanced_tests(struct mm_struct > *mm, > /* Align the address wrt HPAGE_PMD_SIZE */ > vaddr = (vaddr & HPAGE_PMD_MASK) + HPAGE_PMD_SIZE; > > - pmd = pfn_pmd(pfn, prot); > + pmd = pmd_mkhuge(pfn_pmd(pfn, prot)); > set_pmd_at(mm, vaddr, pmdp, pmd); > pmdp_set_wrprotect(mm, vaddr, pmdp); > pmd = READ_ONCE(*pmdp); > WARN_ON(pmd_write(pmd)); > > - pmd = pfn_pmd(pfn, prot); > + pmd = pmd_mkhuge(pfn_pmd(pfn, prot)); > set_pmd_at(mm, vaddr, pmdp, pmd); > pmdp_huge_get_and_clear(mm, vaddr, pmdp); > pmd = READ_ONCE(*pmdp); > WARN_ON(!pmd_none(pmd)); > > - pmd = pfn_pmd(pfn, prot); > + pmd = pmd_mkhuge(pfn_pmd(pfn, prot)); > pmd = pmd_wrprotect(pmd); > pmd = pmd_mkclean(pmd); > set_pmd_at(mm, vaddr, pmdp, pmd); > @@ -237,7 +237,7 @@ static void __init pmd_huge_tests(pmd_t *pmdp, unsigned > long pfn, pgprot_t prot) > > static void __init pmd_savedwrite_tests(unsigned long pfn, pgprot_t prot) > { > - pmd_t pmd = pfn_pmd(pfn, prot); > + pmd_t pmd = pmd_mkhuge(pfn_pmd(pfn, prot)); There is no set_pmd_at() in this particular test, why change ? > > if (!IS_ENABLED(CONFIG_NUMA_BALANCING)) > return; > @@ -277,7 +277,7 @@ static void __init pud_advanced_tests(struct mm_struct > *mm, > unsigned long pfn, unsigned long vaddr, > pgprot_t prot) > { > - pud_t pud = pfn_pud(pfn, prot); > + pud_t pud; > > if (!has_transparent_hugepage()) > return; > @@ -286,25 +286,28 @@ static void __init pud_advanced_tests(struct mm_struct > *mm, > /* Align the address wrt HPAGE_PUD_SIZE */ > vaddr = (vaddr & HPAGE_PUD_MASK) + HPAGE_PUD_SIZE; > > + pud = pud_mkhuge(pfn_pud(pfn, prot)); > set_pud_at(mm, vaddr, pudp, pud); > pudp_set_wrprotect(mm, vaddr, pudp); > pud = READ_ONCE(*pudp); > WARN_ON(pud_write(pud)); > > #ifndef __PAGETABLE_PMD_FOLDED > - pud = pfn_pud(pfn, prot); > + Please drop the extra line here. > + pud = pud_mkhuge(pfn_pud(pfn, prot)); > set_pud_at(mm, vaddr, pudp, pud); > pudp_huge_get_and_clear(mm, vaddr, pudp); > pud = READ_ONCE(*pudp); > WARN_ON(!pud_none(pud)); > > - pud = pfn_pud(pfn, prot); > + pud = pud_mkhuge(pfn_pud(pfn, prot)); > set_pud_at(mm, vaddr, pudp, pud); > pudp_huge_get_and_clear_full(mm, vaddr, pudp, 1); > pud = READ_ONCE(*pudp); > WARN_ON(!pud_none(pud)); > #endif /* __PAGETABLE_PMD_FOLDED */ > - pud = pfn_pud(pfn, prot); > + Please drop the extra line here. > + pud = pud_mkhuge(pfn_pud(pfn, prot)); > pud = pud_wrprotect(pud); > pud = pud_mkclean(pud); > set_pud_at(mm, vaddr, pudp, pud); > There is a checkpatch.pl warning here. WARNING: Possible unwrapped commit description (prefer a maximum 75 chars per line) #7: kernel expects entries to be marked huge before we use set_pmd_at()/set_pud_at(). total: 0 errors, 1 warnings, 77 lines checked
Re: [PATCH v3 05/13] mm/debug_vm_pgtable/savedwrite: Enable savedwrite test with CONFIG_NUMA_BALANCING
On 08/27/2020 01:34 PM, Aneesh Kumar K.V wrote: > Saved write support was added to track the write bit of a pte after marking > the > pte protnone. This was done so that AUTONUMA can convert a write pte to > protnone > and still track the old write bit. When converting it back we set the pte > write > bit correctly thereby avoiding a write fault again. Hence enable the test only > when CONFIG_NUMA_BALANCING is enabled and use protnone protflags. > > Signed-off-by: Aneesh Kumar K.V > --- > mm/debug_vm_pgtable.c | 11 +-- > 1 file changed, 9 insertions(+), 2 deletions(-) > > diff --git a/mm/debug_vm_pgtable.c b/mm/debug_vm_pgtable.c > index 28f9d0558c20..5c0680836fe9 100644 > --- a/mm/debug_vm_pgtable.c > +++ b/mm/debug_vm_pgtable.c > @@ -119,10 +119,14 @@ static void __init pte_savedwrite_tests(unsigned long > pfn, pgprot_t prot) > { > pte_t pte = pfn_pte(pfn, prot); > > + if (!IS_ENABLED(CONFIG_NUMA_BALANCING)) > + return; > + > pr_debug("Validating PTE saved write\n"); > WARN_ON(!pte_savedwrite(pte_mk_savedwrite(pte_clear_savedwrite(pte; > WARN_ON(pte_savedwrite(pte_clear_savedwrite(pte_mk_savedwrite(pte; > } > + > #ifdef CONFIG_TRANSPARENT_HUGEPAGE > static void __init pmd_basic_tests(unsigned long pfn, pgprot_t prot) > { > @@ -235,6 +239,9 @@ static void __init pmd_savedwrite_tests(unsigned long > pfn, pgprot_t prot) > { > pmd_t pmd = pfn_pmd(pfn, prot); > > + if (!IS_ENABLED(CONFIG_NUMA_BALANCING)) > + return; > + > pr_debug("Validating PMD saved write\n"); > WARN_ON(!pmd_savedwrite(pmd_mk_savedwrite(pmd_clear_savedwrite(pmd; > WARN_ON(pmd_savedwrite(pmd_clear_savedwrite(pmd_mk_savedwrite(pmd; > @@ -1020,8 +1027,8 @@ static int __init debug_vm_pgtable(void) > pmd_huge_tests(pmdp, pmd_aligned, prot); > pud_huge_tests(pudp, pud_aligned, prot); > > - pte_savedwrite_tests(pte_aligned, prot); > - pmd_savedwrite_tests(pmd_aligned, prot); > + pte_savedwrite_tests(pte_aligned, protnone); > + pmd_savedwrite_tests(pmd_aligned, protnone); > > pte_unmap_unlock(ptep, ptl); > > There is a checkpatch.pl warning here. WARNING: Possible unwrapped commit description (prefer a maximum 75 chars per line) #7: Saved write support was added to track the write bit of a pte after marking the total: 0 errors, 1 warnings, 33 lines checked Otherwise this looks good. With the checkpatch.pl warning fixed Reviewed-by: Anshuman Khandual
Re: [PATCH v3 04/13] mm/debug_vm_pgtables/hugevmap: Use the arch helper to identify huge vmap support.
On 08/27/2020 01:34 PM, Aneesh Kumar K.V wrote: > ppc64 supports huge vmap only with radix translation. Hence use arch helper > to determine the huge vmap support. > > Signed-off-by: Aneesh Kumar K.V > --- > mm/debug_vm_pgtable.c | 15 +-- > 1 file changed, 13 insertions(+), 2 deletions(-) > > diff --git a/mm/debug_vm_pgtable.c b/mm/debug_vm_pgtable.c > index bbf9df0e64c6..28f9d0558c20 100644 > --- a/mm/debug_vm_pgtable.c > +++ b/mm/debug_vm_pgtable.c > @@ -28,6 +28,7 @@ > #include > #include > #include > +#include > #include > #include > > @@ -206,11 +207,12 @@ static void __init pmd_leaf_tests(unsigned long pfn, > pgprot_t prot) > WARN_ON(!pmd_leaf(pmd)); > } > > +#ifdef CONFIG_HAVE_ARCH_HUGE_VMAP > static void __init pmd_huge_tests(pmd_t *pmdp, unsigned long pfn, pgprot_t > prot) > { > pmd_t pmd; > > - if (!IS_ENABLED(CONFIG_HAVE_ARCH_HUGE_VMAP)) > + if (!arch_ioremap_pmd_supported()) > return; > > pr_debug("Validating PMD huge\n"); > @@ -224,6 +226,10 @@ static void __init pmd_huge_tests(pmd_t *pmdp, unsigned > long pfn, pgprot_t prot) > pmd = READ_ONCE(*pmdp); > WARN_ON(!pmd_none(pmd)); > } > +#else /* !CONFIG_HAVE_ARCH_HUGE_VMAP */ > +static void __init pmd_huge_tests(pmd_t *pmdp, unsigned long pfn, pgprot_t > prot) { } > +#endif /* !CONFIG_HAVE_ARCH_HUGE_VMAP */ > + Some small nits here. - s/!CONFIG_HAVE_ARCH_HUGE_VMAP/CONFIG_HAVE_ARCH_HUGE_VMAP - Drop the extra line at the end - Move the { } down just to be consistent with existing stub for pmd_huge_tests() > > static void __init pmd_savedwrite_tests(unsigned long pfn, pgprot_t prot) > { > @@ -320,11 +326,12 @@ static void __init pud_leaf_tests(unsigned long pfn, > pgprot_t prot) > WARN_ON(!pud_leaf(pud)); > } > > +#ifdef CONFIG_HAVE_ARCH_HUGE_VMAP > static void __init pud_huge_tests(pud_t *pudp, unsigned long pfn, pgprot_t > prot) > { > pud_t pud; > > - if (!IS_ENABLED(CONFIG_HAVE_ARCH_HUGE_VMAP)) > + if (!arch_ioremap_pud_supported()) > return; > > pr_debug("Validating PUD huge\n"); > @@ -338,6 +345,10 @@ static void __init pud_huge_tests(pud_t *pudp, unsigned > long pfn, pgprot_t prot) > pud = READ_ONCE(*pudp); > WARN_ON(!pud_none(pud)); > } > +#else /* !CONFIG_HAVE_ARCH_HUGE_VMAP */ > +static void __init pud_huge_tests(pud_t *pudp, unsigned long pfn, pgprot_t > prot) { } > +#endif /* !CONFIG_HAVE_ARCH_HUGE_VMAP */ > + Some small nits here. - s/!CONFIG_HAVE_ARCH_HUGE_VMAP/CONFIG_HAVE_ARCH_HUGE_VMAP - Drop the extra line at the end - Move the { } down just to be consistent with existing stub for pud_huge_tests() > #else /* !CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD */ > static void __init pud_basic_tests(unsigned long pfn, pgprot_t prot) { } > static void __init pud_advanced_tests(struct mm_struct *mm, >
Re: [PATCH v3 03/13] mm/debug_vm_pgtable/ppc64: Avoid setting top bits in radom value
On 08/27/2020 01:34 PM, Aneesh Kumar K.V wrote: > ppc64 use bit 62 to indicate a pte entry (_PAGE_PTE). Avoid setting that bit > in > random value. > > Signed-off-by: Aneesh Kumar K.V > --- > mm/debug_vm_pgtable.c | 13 ++--- > 1 file changed, 10 insertions(+), 3 deletions(-) > > diff --git a/mm/debug_vm_pgtable.c b/mm/debug_vm_pgtable.c > index 086309fb9b6f..bbf9df0e64c6 100644 > --- a/mm/debug_vm_pgtable.c > +++ b/mm/debug_vm_pgtable.c > @@ -44,10 +44,17 @@ > * entry type. But these bits might affect the ability to clear entries with > * pxx_clear() because of how dynamic page table folding works on s390. So > * while loading up the entries do not change the lower 4 bits. It does not > - * have affect any other platform. > + * have affect any other platform. Also avoid the 62nd bit on ppc64 that is > + * used to mark a pte entry. > */ > -#define S390_MASK_BITS 4 > -#define RANDOM_ORVALUE GENMASK(BITS_PER_LONG - 1, S390_MASK_BITS) > +#define S390_SKIP_MASK GENMASK(3, 0) > +#ifdef CONFIG_PPC_BOOK3S_64 > +#define PPC64_SKIP_MASK GENMASK(62, 62) > +#else > +#define PPC64_SKIP_MASK 0x0 > +#endif Please drop the #ifdef CONFIG_PPC_BOOK3S_64 here. We already accommodate skip bits for a s390 platform requirement and can also do so for ppc64 as well. As mentioned before, please avoid adding any platform specific constructs in the test. > +#define ARCH_SKIP_MASK (S390_SKIP_MASK | PPC64_SKIP_MASK) > +#define RANDOM_ORVALUE (GENMASK(BITS_PER_LONG - 1, 0) & ~ARCH_SKIP_MASK) > #define RANDOM_NZVALUE GENMASK(7, 0) Please fix the alignments here. Feel free to consider following changes after this patch. diff --git a/mm/debug_vm_pgtable.c b/mm/debug_vm_pgtable.c index 122416464e0f..f969031152bb 100644 --- a/mm/debug_vm_pgtable.c +++ b/mm/debug_vm_pgtable.c @@ -48,14 +48,11 @@ * have affect any other platform. Also avoid the 62nd bit on ppc64 that is * used to mark a pte entry. */ -#define S390_SKIP_MASK GENMASK(3, 0) -#ifdef CONFIG_PPC_BOOK3S_64 -#define PPC64_SKIP_MASKGENMASK(62, 62) -#else -#define PPC64_SKIP_MASK0x0 -#endif -#define ARCH_SKIP_MASK (S390_SKIP_MASK | PPC64_SKIP_MASK) -#define RANDOM_ORVALUE (GENMASK(BITS_PER_LONG - 1, 0) & ~ARCH_SKIP_MASK) +#define S390_SKIP_MASK GENMASK(3, 0) +#define PPC64_SKIP_MASKGENMASK(62, 62) +#define ARCH_SKIP_MASK (S390_SKIP_MASK | PPC64_SKIP_MASK) +#define RANDOM_ORVALUE (GENMASK(BITS_PER_LONG - 1, 0) & ~ARCH_SKIP_MASK) + #define RANDOM_NZVALUE GENMASK(7, 0) > > static void __init pte_basic_tests(unsigned long pfn, pgprot_t prot) > Besides, there is also one checkpatch.pl warning here. WARNING: Possible unwrapped commit description (prefer a maximum 75 chars per line) #7: ppc64 use bit 62 to indicate a pte entry (_PAGE_PTE). Avoid setting that bit in total: 0 errors, 1 warnings, 20 lines checked
Re: [PATCH v3 02/13] powerpc/mm: Move setting pte specific flags to pfn_pte
On 08/27/2020 01:34 PM, Aneesh Kumar K.V wrote: > powerpc used to set the pte specific flags in set_pte_at(). This is different > from other architectures. To be consistent with other architecture update > pfn_pte to set _PAGE_PTE on ppc64. Also, drop now unused pte_mkpte. > > We add a VM_WARN_ON() to catch the usage of calling set_pte_at() without > setting > _PAGE_PTE bit. We will remove that after a few releases. > > With respect to huge pmd entries, pmd_mkhuge() takes care of adding the > _PAGE_PTE bit. > > Signed-off-by: Aneesh Kumar K.V > --- > arch/powerpc/include/asm/book3s/64/pgtable.h | 15 +-- > arch/powerpc/include/asm/nohash/pgtable.h| 5 - > arch/powerpc/mm/pgtable.c| 5 - > 3 files changed, 9 insertions(+), 16 deletions(-) > > diff --git a/arch/powerpc/include/asm/book3s/64/pgtable.h > b/arch/powerpc/include/asm/book3s/64/pgtable.h > index 079211968987..2382fd516f6b 100644 > --- a/arch/powerpc/include/asm/book3s/64/pgtable.h > +++ b/arch/powerpc/include/asm/book3s/64/pgtable.h > @@ -619,7 +619,7 @@ static inline pte_t pfn_pte(unsigned long pfn, pgprot_t > pgprot) > VM_BUG_ON(pfn >> (64 - PAGE_SHIFT)); > VM_BUG_ON((pfn << PAGE_SHIFT) & ~PTE_RPN_MASK); > > - return __pte(((pte_basic_t)pfn << PAGE_SHIFT) | pgprot_val(pgprot)); > + return __pte(((pte_basic_t)pfn << PAGE_SHIFT) | pgprot_val(pgprot) | > _PAGE_PTE); > } > > static inline unsigned long pte_pfn(pte_t pte) > @@ -655,11 +655,6 @@ static inline pte_t pte_mkexec(pte_t pte) > return __pte_raw(pte_raw(pte) | cpu_to_be64(_PAGE_EXEC)); > } > > -static inline pte_t pte_mkpte(pte_t pte) > -{ > - return __pte_raw(pte_raw(pte) | cpu_to_be64(_PAGE_PTE)); > -} > - > static inline pte_t pte_mkwrite(pte_t pte) > { > /* > @@ -823,6 +818,14 @@ static inline int pte_none(pte_t pte) > static inline void __set_pte_at(struct mm_struct *mm, unsigned long addr, > pte_t *ptep, pte_t pte, int percpu) > { > + > + VM_WARN_ON(!(pte_raw(pte) & cpu_to_be64(_PAGE_PTE))); > + /* > + * Keep the _PAGE_PTE added till we are sure we handle _PAGE_PTE > + * in all the callers. > + */ > + pte = __pte_raw(pte_raw(pte) | cpu_to_be64(_PAGE_PTE)); > + > if (radix_enabled()) > return radix__set_pte_at(mm, addr, ptep, pte, percpu); > return hash__set_pte_at(mm, addr, ptep, pte, percpu); > diff --git a/arch/powerpc/include/asm/nohash/pgtable.h > b/arch/powerpc/include/asm/nohash/pgtable.h > index 4b7c3472eab1..6277e7596ae5 100644 > --- a/arch/powerpc/include/asm/nohash/pgtable.h > +++ b/arch/powerpc/include/asm/nohash/pgtable.h > @@ -140,11 +140,6 @@ static inline pte_t pte_mkold(pte_t pte) > return __pte(pte_val(pte) & ~_PAGE_ACCESSED); > } > > -static inline pte_t pte_mkpte(pte_t pte) > -{ > - return pte; > -} > - > static inline pte_t pte_mkspecial(pte_t pte) > { > return __pte(pte_val(pte) | _PAGE_SPECIAL); > diff --git a/arch/powerpc/mm/pgtable.c b/arch/powerpc/mm/pgtable.c > index 9c0547d77af3..ab57b07ef39a 100644 > --- a/arch/powerpc/mm/pgtable.c > +++ b/arch/powerpc/mm/pgtable.c > @@ -184,9 +184,6 @@ void set_pte_at(struct mm_struct *mm, unsigned long addr, > pte_t *ptep, >*/ > VM_WARN_ON(pte_hw_valid(*ptep) && !pte_protnone(*ptep)); > > - /* Add the pte bit when trying to set a pte */ > - pte = pte_mkpte(pte); > - > /* Note: mm->context.id might not yet have been assigned as >* this context might not have been activated yet when this >* is called. > @@ -275,8 +272,6 @@ void set_huge_pte_at(struct mm_struct *mm, unsigned long > addr, pte_t *ptep, pte_ >*/ > VM_WARN_ON(pte_hw_valid(*ptep) && !pte_protnone(*ptep)); > > - pte = pte_mkpte(pte); > - > pte = set_pte_filter(pte); > > val = pte_val(pte); > There is one checkpatch.pl warning for this patch. WARNING: Possible unwrapped commit description (prefer a maximum 75 chars per line) #6: powerpc used to set the pte specific flags in set_pte_at(). This is different total: 0 errors, 1 warnings, 61 lines checked
Re: [PATCH v3 01/13] powerpc/mm: Add DEBUG_VM WARN for pmd_clear
On 08/27/2020 01:34 PM, Aneesh Kumar K.V wrote: > With the hash page table, the kernel should not use pmd_clear for clearing > huge pte entries. Add a DEBUG_VM WARN to catch the wrong usage. > > Signed-off-by: Aneesh Kumar K.V > --- > arch/powerpc/include/asm/book3s/64/pgtable.h | 14 ++ > 1 file changed, 14 insertions(+) > > diff --git a/arch/powerpc/include/asm/book3s/64/pgtable.h > b/arch/powerpc/include/asm/book3s/64/pgtable.h > index 6de56c3b33c4..079211968987 100644 > --- a/arch/powerpc/include/asm/book3s/64/pgtable.h > +++ b/arch/powerpc/include/asm/book3s/64/pgtable.h > @@ -868,6 +868,13 @@ static inline bool pte_ci(pte_t pte) > > static inline void pmd_clear(pmd_t *pmdp) > { > + if (IS_ENABLED(CONFIG_DEBUG_VM) && !radix_enabled()) { > + /* > + * Don't use this if we can possibly have a hash page table > + * entry mapping this. > + */ > + WARN_ON((pmd_val(*pmdp) & (H_PAGE_HASHPTE | _PAGE_PTE)) == > (H_PAGE_HASHPTE | _PAGE_PTE)); > + } > *pmdp = __pmd(0); > } > > @@ -916,6 +923,13 @@ static inline int pmd_bad(pmd_t pmd) > > static inline void pud_clear(pud_t *pudp) > { > + if (IS_ENABLED(CONFIG_DEBUG_VM) && !radix_enabled()) { > + /* > + * Don't use this if we can possibly have a hash page table > + * entry mapping this. > + */ > + WARN_ON((pud_val(*pudp) & (H_PAGE_HASHPTE | _PAGE_PTE)) == > (H_PAGE_HASHPTE | _PAGE_PTE)); > + } > *pudp = __pud(0); > } There are two checkpatch.pl warnings for this patch. WARNING: line length of 105 exceeds 100 columns #27: FILE: arch/powerpc/include/asm/book3s/64/pgtable.h:876: + WARN_ON((pmd_val(*pmdp) & (H_PAGE_HASHPTE | _PAGE_PTE)) == (H_PAGE_HASHPTE | _PAGE_PTE)); WARNING: line length of 105 exceeds 100 columns #41: FILE: arch/powerpc/include/asm/book3s/64/pgtable.h:931: + WARN_ON((pud_val(*pudp) & (H_PAGE_HASHPTE | _PAGE_PTE)) == (H_PAGE_HASHPTE | _PAGE_PTE)); total: 0 errors, 2 warnings, 26 lines checked
Re: fsl_espi errors on v5.7.15
On 1/09/20 12:33 am, Heiner Kallweit wrote: > On 30.08.2020 23:59, Chris Packham wrote: >> On 31/08/20 9:41 am, Heiner Kallweit wrote: >>> On 30.08.2020 23:00, Chris Packham wrote: On 31/08/20 12:30 am, Nicholas Piggin wrote: > Excerpts from Chris Packham's message of August 28, 2020 8:07 am: >> I've also now seen the RX FIFO not empty error on the T2080RDB >> >> fsl_espi ffe11.spi: Transfer done but SPIE_DON isn't set! >> fsl_espi ffe11.spi: Transfer done but SPIE_DON isn't set! >> fsl_espi ffe11.spi: Transfer done but SPIE_DON isn't set! >> fsl_espi ffe11.spi: Transfer done but SPIE_DON isn't set! >> fsl_espi ffe11.spi: Transfer done but rx/tx fifo's aren't empty! >> fsl_espi ffe11.spi: SPIE_RXCNT = 1, SPIE_TXCNT = 32 >> >> With my current workaround of emptying the RX FIFO. It seems >> survivable. Interestingly it only ever seems to be 1 extra byte in >> the >> RX FIFO and it seems to be after either a READ_SR or a READ_FSR. >> >> fsl_espi ffe11.spi: tx 70 >> fsl_espi ffe11.spi: rx 03 >> fsl_espi ffe11.spi: Extra RX 00 >> fsl_espi ffe11.spi: Transfer done but SPIE_DON isn't set! >> fsl_espi ffe11.spi: Transfer done but rx/tx fifo's aren't empty! >> fsl_espi ffe11.spi: SPIE_RXCNT = 1, SPIE_TXCNT = 32 >> fsl_espi ffe11.spi: tx 05 >> fsl_espi ffe11.spi: rx 00 >> fsl_espi ffe11.spi: Extra RX 03 >> fsl_espi ffe11.spi: Transfer done but SPIE_DON isn't set! >> fsl_espi ffe11.spi: Transfer done but rx/tx fifo's aren't empty! >> fsl_espi ffe11.spi: SPIE_RXCNT = 1, SPIE_TXCNT = 32 >> fsl_espi ffe11.spi: tx 05 >> fsl_espi ffe11.spi: rx 00 >> fsl_espi ffe11.spi: Extra RX 03 >> >> From all the Micron SPI-NOR datasheets I've got access to it is >> possible to continually read the SR/FSR. But I've no idea why it >> happens some times and not others. > So I think I've got a reproduction and I think I've bisected the > problem > to commit 3282a3da25bd ("powerpc/64: Implement soft interrupt replay > in > C"). My day is just finishing now so I haven't applied too much > scrutiny > to this result. Given the various rabbit holes I've been down on this > issue already I'd take this information with a good degree of > skepticism. > OK, so an easy test should be to re-test with a 5.4 kernel. It doesn't have yet the change you're referring to, and the fsl-espi driver is basically the same as in 5.7 (just two small changes in 5.7). >>> There's 6cc0c16d82f88 and maybe also other interrupt related patches >>> around this time that could affect book E, so it's good if that exact >>> patch is confirmed. >> My confirmation is basically that I can induce the issue in a 5.4 kernel >> by cherry-picking 3282a3da25bd. I'm also able to "fix" the issue in >> 5.9-rc2 by reverting that one commit. >> >> I both cases it's not exactly a clean cherry-pick/revert so I also >> confirmed the bisection result by building at 3282a3da25bd (which sees >> the issue) and the commit just before (which does not). > Thanks for testing, that confirms it well. > > [snip patch] > >> I still saw the issue with this change applied. PPC_IRQ_SOFT_MASK_DEBUG >> didn't report anything (either with or without the change above). > Okay, it was a bit of a shot in the dark. I still can't see what > else has changed. > > What would cause this, a lost interrupt? A spurious interrupt? Or > higher interrupt latency? > > I don't think the patch should cause significantly worse latency, > (it's supposed to be a bit better if anything because it doesn't set > up the full interrupt frame). But it's possible. My working theory is that the SPI_DON indication is all about the TX direction an now that the interrupts are faster we're hitting an error because there is still RX activity going on. Heiner disagrees with my interpretation of the SPI_DON indication and the fact that it doesn't happen every time does throw doubt on it. >>> It's right that the eSPI spec can be interpreted that SPI_DON refers to >>> TX only. However this wouldn't really make sense, because also for RX >>> we program the frame length, and therefore want to be notified once the >>> full frame was received. Also practical experience shows that SPI_DON >>> is set also after RX-only transfers. >>> Typical SPI NOR use case is that you write read command + start address, >>> followed by a longer read. If the TX-only interpretation would be right, >>> we'd always end up with SPI_DON not being set. >>> >
Re: [PATCH] soc: fsl: Remove bogus packed attributes from qman.h
On Tue, Sep 01, 2020 at 01:50:38AM +, Leo Li wrote: > > Sorry for the late response. I missed this email previously. > > These structures are descriptors used by hardware, we cannot have _ANY_ > padding from the compiler. The compiled result might be the same with or > without the __packed attribute for now, but I think keep it there probably is > safer for dealing with unexpected alignment requirements from the compiler in > the future. > > Having conflicting alignment requirements warning might means something is > wrong with the structure in certain scenario. I just tried a ARM64 build but > didn't see the warnings. Could you share the warning you got and the build > setup? Thanks. Just do a COMPILE_TEST build on x86-64: In file included from ../drivers/crypto/caam/qi.c:12: ../include/soc/fsl/qman.h:259:1: warning: alignment 1 of ‘struct qm_dqrr_entry’ is less than 8 [-Wpacked-not-aligned] } __packed; ^ ../include/soc/fsl/qman.h:292:2: warning: alignment 1 of ‘struct ’ is less than 8 [-Wpacked-not-aligned] } __packed ern; ^ In any case, those packed markers are completely unnecessary because those structs contain no holes. Cheers, -- Email: Herbert Xu Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
RE: [PATCH] soc: fsl: Remove bogus packed attributes from qman.h
> -Original Message- > From: Herbert Xu > Sent: Thursday, July 30, 2020 7:53 AM > To: Leo Li ; linuxppc-dev@lists.ozlabs.org; linux-arm- > ker...@lists.infradead.org > Subject: [PATCH] soc: fsl: Remove bogus packed attributes from qman.h > > There are two __packed attributes in qman.h that are both unnecessary > and causing compiler warnings because they're conflicting with > explicit alignment requirements set on members within the structure. Sorry for the late response. I missed this email previously. These structures are descriptors used by hardware, we cannot have _ANY_ padding from the compiler. The compiled result might be the same with or without the __packed attribute for now, but I think keep it there probably is safer for dealing with unexpected alignment requirements from the compiler in the future. Having conflicting alignment requirements warning might means something is wrong with the structure in certain scenario. I just tried a ARM64 build but didn't see the warnings. Could you share the warning you got and the build setup? Thanks. Regards, Leo > > This patch removes them both. > > Signed-off-by: Herbert Xu > > diff --git a/include/soc/fsl/qman.h b/include/soc/fsl/qman.h > index cfe00e08e85b..d81ff185dc0b 100644 > --- a/include/soc/fsl/qman.h > +++ b/include/soc/fsl/qman.h > @@ -256,7 +256,7 @@ struct qm_dqrr_entry { > __be32 context_b; > struct qm_fd fd; > u8 __reserved4[32]; > -} __packed; > +}; > #define QM_DQRR_VERB_VBIT0x80 > #define QM_DQRR_VERB_MASK0x7f/* where the verb > contains; */ > #define QM_DQRR_VERB_FRAME_DEQUEUE 0x60/* "this format" */ > @@ -289,7 +289,7 @@ union qm_mr_entry { > __be32 tag; > struct qm_fd fd; > u8 __reserved1[32]; > - } __packed ern; > + } ern; > struct { > u8 verb; > u8 fqs; /* Frame Queue Status */ > -- > Email: Herbert Xu > Home Page: > https://eur01.safelinks.protection.outlook.com/?url=http:%2F%2Fgondor.ap > ana.org.au%2F~herbert%2F&data=02%7C01%7Cleoyang.li%40nxp.com > %7Cb69aca8dc53a4030b14b08d8348783a9%7C686ea1d3bc2b4c6fa92cd99c5c3 > 01635%7C0%7C0%7C637317103931120730&sdata=g3%2FJfa%2FNcuhLD5 > SYhbmhno65O1bxisVt2xltu2IMPjQ%3D&reserved=0 > PGP Key: > https://eur01.safelinks.protection.outlook.com/?url=http:%2F%2Fgondor.ap > ana.org.au%2F~herbert%2Fpubkey.txt&data=02%7C01%7Cleoyang.li%4 > 0nxp.com%7Cb69aca8dc53a4030b14b08d8348783a9%7C686ea1d3bc2b4c6fa9 > 2cd99c5c301635%7C0%7C0%7C637317103931120730&sdata=uSS2C4cuAL > XcCgIhpIORK4EZ1BHHj%2BqAW2Pu%2FLrFKPM%3D&reserved=0
Re: fsl_espi errors on v5.7.15
On 1/09/20 12:33 am, Heiner Kallweit wrote: > On 30.08.2020 23:59, Chris Packham wrote: >> On 31/08/20 9:41 am, Heiner Kallweit wrote: >>> On 30.08.2020 23:00, Chris Packham wrote: On 31/08/20 12:30 am, Nicholas Piggin wrote: > Excerpts from Chris Packham's message of August 28, 2020 8:07 am: >> I've also now seen the RX FIFO not empty error on the T2080RDB >> >> fsl_espi ffe11.spi: Transfer done but SPIE_DON isn't set! >> fsl_espi ffe11.spi: Transfer done but SPIE_DON isn't set! >> fsl_espi ffe11.spi: Transfer done but SPIE_DON isn't set! >> fsl_espi ffe11.spi: Transfer done but SPIE_DON isn't set! >> fsl_espi ffe11.spi: Transfer done but rx/tx fifo's aren't empty! >> fsl_espi ffe11.spi: SPIE_RXCNT = 1, SPIE_TXCNT = 32 >> >> With my current workaround of emptying the RX FIFO. It seems >> survivable. Interestingly it only ever seems to be 1 extra byte in >> the >> RX FIFO and it seems to be after either a READ_SR or a READ_FSR. >> >> fsl_espi ffe11.spi: tx 70 >> fsl_espi ffe11.spi: rx 03 >> fsl_espi ffe11.spi: Extra RX 00 >> fsl_espi ffe11.spi: Transfer done but SPIE_DON isn't set! >> fsl_espi ffe11.spi: Transfer done but rx/tx fifo's aren't empty! >> fsl_espi ffe11.spi: SPIE_RXCNT = 1, SPIE_TXCNT = 32 >> fsl_espi ffe11.spi: tx 05 >> fsl_espi ffe11.spi: rx 00 >> fsl_espi ffe11.spi: Extra RX 03 >> fsl_espi ffe11.spi: Transfer done but SPIE_DON isn't set! >> fsl_espi ffe11.spi: Transfer done but rx/tx fifo's aren't empty! >> fsl_espi ffe11.spi: SPIE_RXCNT = 1, SPIE_TXCNT = 32 >> fsl_espi ffe11.spi: tx 05 >> fsl_espi ffe11.spi: rx 00 >> fsl_espi ffe11.spi: Extra RX 03 >> >> From all the Micron SPI-NOR datasheets I've got access to it is >> possible to continually read the SR/FSR. But I've no idea why it >> happens some times and not others. > So I think I've got a reproduction and I think I've bisected the > problem > to commit 3282a3da25bd ("powerpc/64: Implement soft interrupt replay > in > C"). My day is just finishing now so I haven't applied too much > scrutiny > to this result. Given the various rabbit holes I've been down on this > issue already I'd take this information with a good degree of > skepticism. > OK, so an easy test should be to re-test with a 5.4 kernel. It doesn't have yet the change you're referring to, and the fsl-espi driver is basically the same as in 5.7 (just two small changes in 5.7). >>> There's 6cc0c16d82f88 and maybe also other interrupt related patches >>> around this time that could affect book E, so it's good if that exact >>> patch is confirmed. >> My confirmation is basically that I can induce the issue in a 5.4 kernel >> by cherry-picking 3282a3da25bd. I'm also able to "fix" the issue in >> 5.9-rc2 by reverting that one commit. >> >> I both cases it's not exactly a clean cherry-pick/revert so I also >> confirmed the bisection result by building at 3282a3da25bd (which sees >> the issue) and the commit just before (which does not). > Thanks for testing, that confirms it well. > > [snip patch] > >> I still saw the issue with this change applied. PPC_IRQ_SOFT_MASK_DEBUG >> didn't report anything (either with or without the change above). > Okay, it was a bit of a shot in the dark. I still can't see what > else has changed. > > What would cause this, a lost interrupt? A spurious interrupt? Or > higher interrupt latency? > > I don't think the patch should cause significantly worse latency, > (it's supposed to be a bit better if anything because it doesn't set > up the full interrupt frame). But it's possible. My working theory is that the SPI_DON indication is all about the TX direction an now that the interrupts are faster we're hitting an error because there is still RX activity going on. Heiner disagrees with my interpretation of the SPI_DON indication and the fact that it doesn't happen every time does throw doubt on it. >>> It's right that the eSPI spec can be interpreted that SPI_DON refers to >>> TX only. However this wouldn't really make sense, because also for RX >>> we program the frame length, and therefore want to be notified once the >>> full frame was received. Also practical experience shows that SPI_DON >>> is set also after RX-only transfers. >>> Typical SPI NOR use case is that you write read command + start address, >>> followed by a longer read. If the TX-only interpretation would be right, >>> we'd always end up with SPI_DON not being set. >>> >
Re: [PATCH] soc: fsl: Remove bogus packed attributes from qman.h
On Thu, Jul 30, 2020 at 10:52:59PM +1000, Herbert Xu wrote: > There are two __packed attributes in qman.h that are both unnecessary > and causing compiler warnings because they're conflicting with > explicit alignment requirements set on members within the structure. > > This patch removes them both. > > Signed-off-by: Herbert Xu Ping. -- Email: Herbert Xu Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
[PATCH v2] scsi: ibmvfc: interface updates for future FPIN and MQ support
VIOS partitions with SLI-4 enabled Emulex adapters will be capable of driving IO in parallel through mulitple work queues or channels, and with new hyperviosr firmware that supports multiple interrupt sources an ibmvfc NPIV single initiator can be modified to exploit end to end channelization in a PowerVM environment. VIOS hosts will also be able to expose fabric perfromance impact notifications (FPIN) via a new asynchronous event to ibmvfc clients that advertise support via IBMVFC_CAN_HANDLE_FPIN in their capabilities flag during NPIV_LOGIN. This patch introduces three new Managment Datagrams (MADs) for channelization support negotiation as well as the FPIN asynchronous event and FPIN status flags. Follow up work is required to plumb the ibmvfc client driver to use these new interfaces. Signed-off-by: Tyrel Datwyler --- v1 -> v2: Fixup complier errors from neglected commit --amend --- drivers/scsi/ibmvscsi/ibmvfc.h | 66 +- 1 file changed, 65 insertions(+), 1 deletion(-) diff --git a/drivers/scsi/ibmvscsi/ibmvfc.h b/drivers/scsi/ibmvscsi/ibmvfc.h index 907889f1fa9d..fe55cfc79421 100644 --- a/drivers/scsi/ibmvscsi/ibmvfc.h +++ b/drivers/scsi/ibmvscsi/ibmvfc.h @@ -124,6 +124,9 @@ enum ibmvfc_mad_types { IBMVFC_PASSTHRU = 0x0200, IBMVFC_TMF_MAD = 0x0100, IBMVFC_NPIV_LOGOUT = 0x0800, + IBMVFC_CHANNEL_ENQUIRY = 0x1000, + IBMVFC_CHANNEL_SETUP= 0x2000, + IBMVFC_CONNECTION_INFO = 0x4000, }; struct ibmvfc_mad_common { @@ -162,6 +165,8 @@ struct ibmvfc_npiv_login { __be32 max_cmds; __be64 capabilities; #define IBMVFC_CAN_MIGRATE 0x01 +#define IBMVFC_CAN_USE_CHANNELS0x02 +#define IBMVFC_CAN_HANDLE_FPIN 0x04 __be64 node_name; struct srp_direct_buf async; u8 partition_name[IBMVFC_MAX_NAME]; @@ -204,6 +209,7 @@ struct ibmvfc_npiv_login_resp { __be64 capabilities; #define IBMVFC_CAN_FLUSH_ON_HALT 0x08 #define IBMVFC_CAN_SUPPRESS_ABTS 0x10 +#define IBMVFC_CAN_SUPPORT_CHANNELS0x20 __be32 max_cmds; __be32 scsi_id_sz; __be64 max_dma_len; @@ -482,6 +488,52 @@ struct ibmvfc_passthru_mad { struct ibmvfc_passthru_fc_iu fc_iu; }__attribute__((packed, aligned (8))); +struct ibmvfc_channel_enquiry { + struct ibmvfc_mad_common common; + __be32 flags; +#define IBMVFC_NO_CHANNELS_TO_CRQ_SUPPORT 0x01 +#define IBMVFC_SUPPORT_VARIABLE_SUBQ_MSG 0x02 +#define IBMVFC_NO_N_TO_M_CHANNELS_SUPPORT 0x04 + __be32 num_scsi_subq_channels; + __be32 num_nvmeof_subq_channels; + __be32 num_scsi_vas_channels; + __be32 num_nvmeof_vas_channels; +}__attribute__((packed, aligned (8))); + +struct ibmvfc_channel_setup_mad { + struct ibmvfc_mad_common common; + struct srp_direct_buf buffer; +}__attribute__((packed, aligned (8))); + +#define IBMVFC_MAX_CHANNELS502 + +struct ibmvfc_channel_setup { + __be32 flags; +#define IBMVFC_CANCEL_CHANNELS 0x01 +#define IBMVFC_USE_BUFFER 0x02 +#define IBMVFC_CHANNELS_CANCELED 0x04 + __be32 reserved; + __be32 num_scsi_subq_channels; + __be32 num_nvmeof_subq_channels; + __be32 num_scsi_vas_channels; + __be32 num_nvmeof_vas_channels; + struct srp_direct_buf buffer; + __be64 reserved2[5]; + __be64 channel_handles[IBMVFC_MAX_CHANNELS]; +}__attribute__((packed, aligned (8))); + +struct ibmvfc_connection_info { + struct ibmvfc_mad_common common; + __be64 information_bits; +#define IBMVFC_NO_FC_IO_CHANNEL0x01 +#define IBMVFC_NO_PHYP_VAS 0x02 +#define IBMVFC_NO_PHYP_SUBQ0x04 +#define IBMVFC_PHYP_DEPRECATED_SUBQ0x08 +#define IBMVFC_PHYP_PRESERVED_SUBQ 0x10 +#define IBMVFC_PHYP_FULL_SUBQ 0x20 + __be64 reserved[16]; +}__attribute__((packed, aligned (8))); + struct ibmvfc_trace_start_entry { u32 xfer_len; }__attribute__((packed)); @@ -532,6 +584,7 @@ enum ibmvfc_async_event { IBMVFC_AE_HALT = 0x0400, IBMVFC_AE_RESUME= 0x0800, IBMVFC_AE_ADAPTER_FAILED= 0x1000, + IBMVFC_AE_FPIN = 0x2000, }; struct ibmvfc_async_desc { @@ -560,10 +613,18 @@ enum ibmvfc_ae_link_state { IBMVFC_AE_LS_LINK_DEAD = 0x08, }; +enum ibmvfc_ae_fpin_status { + IBMVFC_AE_FPIN_LINK_CONGESTED = 0x1, + IBMVFC_AE_FPIN_PORT_CONGESTED = 0x2, + IBMVFC_AE_FPIN_PORT_CLEARED = 0x3, + IBMVFC_AE_FPIN_PORT_DEGRADED= 0x4, +}; + struct ibmvfc_async_crq { volatile u8 valid; u8 link_state; - u8 pad[2]; + u8 fpin_status; + u8 pad; __be32 pad2; volatile __be64 event; volatile __be64 scsi_id; @@ -590,6 +651,9 @@ union ibmvfc_iu { struct ibmvfc_tmf tmf; struct ibmvfc_cmd cmd;
Re: [PATCH] scsi: ibmvfc: interface updates for future FPIN and MQ support
Hi Tyrel, I love your patch! Yet something to improve: [auto build test ERROR on powerpc/next] [also build test ERROR on mkp-scsi/for-next scsi/for-next v5.9-rc3 next-20200828] [If your patch is applied to the wrong git tree, kindly drop us a note. And when submitting patch, we suggest to use '--base' as documented in https://git-scm.com/docs/git-format-patch] url: https://github.com/0day-ci/linux/commits/Tyrel-Datwyler/scsi-ibmvfc-interface-updates-for-future-FPIN-and-MQ-support/20200901-012005 base: https://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux.git next config: powerpc-defconfig (attached as .config) compiler: powerpc64-linux-gcc (GCC) 9.3.0 reproduce (this is a W=1 build): wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross chmod +x ~/bin/make.cross # save the attached .config to linux build tree COMPILER_INSTALL_PATH=$HOME/0day COMPILER=gcc-9.3.0 make.cross ARCH=powerpc If you fix the issue, kindly add following tag as appropriate Reported-by: kernel test robot All errors (new ones prefixed by >>): In file included from drivers/scsi/ibmvscsi/ibmvfc.c:32: >> drivers/scsi/ibmvscsi/ibmvfc.h:500:13: error: expected ':', ',', ';', '}' or >> '__attribute__' before 'nvmeof_vas_channels' 500 | __be32 num nvmeof_vas_channels; | ^~~ >> drivers/scsi/ibmvscsi/ibmvfc.h:506:17: error: expected declaration >> specifiers or '...' before '(' token 506 | }__attrribute__((packed, aligned (8))); | ^ >> drivers/scsi/ibmvscsi/ibmvfc.h:526:28: error: field 'common' has incomplete >> type 526 | struct ibmvfc_madd_common common; |^~ >> drivers/scsi/ibmvscsi/ibmvfc.h:627:2: error: expected ':', ',', ';', '}' or >> '__attribute__' before 'u8' 627 | u8 pad; | ^~ In file included from include/linux/byteorder/big_endian.h:5, from arch/powerpc/include/uapi/asm/byteorder.h:14, from include/asm-generic/bitops/le.h:6, from arch/powerpc/include/asm/bitops.h:246, from include/linux/bitops.h:29, from include/linux/kernel.h:12, from include/linux/list.h:9, from include/linux/module.h:12, from drivers/scsi/ibmvscsi/ibmvfc.c:10: drivers/scsi/ibmvscsi/ibmvfc.c: In function 'ibmvfc_handle_async': >> drivers/scsi/ibmvscsi/ibmvfc.c:2665:75: error: 'struct ibmvfc_async_crq' has >> no member named 'event' 2665 | const struct ibmvfc_async_desc *desc = ibmvfc_get_ae_desc(be64_to_cpu(crq->event)); | ^~ include/uapi/linux/byteorder/big_endian.h:38:51: note: in definition of macro '__be64_to_cpu' 38 | #define __be64_to_cpu(x) ((__force __u64)(__be64)(x)) | ^ drivers/scsi/ibmvscsi/ibmvfc.c:2665:60: note: in expansion of macro 'be64_to_cpu' 2665 | const struct ibmvfc_async_desc *desc = ibmvfc_get_ae_desc(be64_to_cpu(crq->event)); | ^~~ >> drivers/scsi/ibmvscsi/ibmvfc.c:2669:57: error: 'struct ibmvfc_async_crq' has >> no member named 'scsi_id' 2669 | " node_name: %llx%s\n", desc->desc, be64_to_cpu(crq->scsi_id), | ^~ include/uapi/linux/byteorder/big_endian.h:38:51: note: in definition of macro '__be64_to_cpu' 38 | #define __be64_to_cpu(x) ((__force __u64)(__be64)(x)) | ^ drivers/scsi/ibmvscsi/ibmvfc.h:818:4: note: in expansion of macro 'dev_err' 818 |dev_err((vhost)->dev, ##__VA_ARGS__); \ |^~~ drivers/scsi/ibmvscsi/ibmvfc.c:2668:2: note: in expansion of macro 'ibmvfc_log' 2668 | ibmvfc_log(vhost, desc->log_level, "%s event received. scsi_id: %llx, wwpn: %llx," | ^~ >> drivers/scsi/ibmvscsi/ibmvfc.c:2670:21: error: 'struct ibmvfc_async_crq' has >> no member named 'wwpn' 2670 | be64_to_cpu(crq->wwpn), be64_to_cpu(crq->node_name), | ^~ include/uapi/linux/byteorder/big_endian.h:38:51: note: in definition of macro '__be64_to_cpu' 38 | #define __be64_to_cpu(x) ((__force __u64)(__be64)(x)) | ^ drivers/scsi/ibmvscsi/ibmvfc.h:818:4: note: in expansion of macro 'dev_err' 818 |dev_err((vhost)->dev, ##__VA_ARGS__); \ |^~~ drivers/scsi/ibmvscsi/ibmvfc.c:2668:2: note: in expansion of macro 'ibmvfc_log' 2668 | ibmvfc_log(vhost, desc->log_level, "%s event received. scsi_id: %llx, wwpn: %llx," | ^~ >> drivers/scsi/ibmvscsi/ibmvfc.c:2670:45: error: 'struct i
[RESEND][PATCH 7/7] parisc: Avoid overflow at boundary_size
The boundary_size might be as large as ULONG_MAX, which means that a device has no specific boundary limit. So either "+ 1" or passing it to ALIGN() would potentially overflow. According to kernel defines: #define ALIGN_MASK(x, mask) (((x) + (mask)) & ~(mask)) #define ALIGN(x, a) ALIGN_MASK(x, (typeof(x))(a) - 1) We can simplify the logic here: ALIGN(boundary + 1, 1 << shift) >> shift = ALIGN_MASK(b + 1, (1 << s) - 1) >> s = {[b + 1 + (1 << s) - 1] & ~[(1 << s) - 1]} >> s = [b + 1 + (1 << s) - 1] >> s = [b + (1 << s)] >> s = (b >> s) + 1 So fixing a potential overflow with the safer shortcut. Signed-off-by: Nicolin Chen Cc: Christoph Hellwig --- drivers/parisc/ccio-dma.c | 4 ++-- drivers/parisc/sba_iommu.c | 4 ++-- 2 files changed, 4 insertions(+), 4 deletions(-) diff --git a/drivers/parisc/ccio-dma.c b/drivers/parisc/ccio-dma.c index a5507f75b524..c667d6aba764 100644 --- a/drivers/parisc/ccio-dma.c +++ b/drivers/parisc/ccio-dma.c @@ -356,8 +356,8 @@ ccio_alloc_range(struct ioc *ioc, struct device *dev, size_t size) ** ggg sacrifices another 710 to the computer gods. */ - boundary_size = ALIGN((unsigned long long)dma_get_seg_boundary(dev) + 1, - 1ULL << IOVP_SHIFT) >> IOVP_SHIFT; + /* Overflow-free shortcut for: ALIGN(b + 1, 1 << s) >> s */ + boundary_size = (dma_get_seg_boundary(dev) >> IOVP_SHIFT) + 1; if (pages_needed <= 8) { /* diff --git a/drivers/parisc/sba_iommu.c b/drivers/parisc/sba_iommu.c index d4314fba0269..96bc2c617cbd 100644 --- a/drivers/parisc/sba_iommu.c +++ b/drivers/parisc/sba_iommu.c @@ -342,8 +342,8 @@ sba_search_bitmap(struct ioc *ioc, struct device *dev, unsigned long shift; int ret; - boundary_size = ALIGN((unsigned long long)dma_get_seg_boundary(dev) + 1, - 1ULL << IOVP_SHIFT) >> IOVP_SHIFT; + /* Overflow-free shortcut for: ALIGN(b + 1, 1 << s) >> s */ + boundary_size = (dma_get_seg_boundary(dev) >> IOVP_SHIFT) + 1; #if defined(ZX1_SUPPORT) BUG_ON(ioc->ibase & ~IOVP_MASK); -- 2.17.1
[RESEND][PATCH 6/7] x86/amd_gart: Avoid overflow at boundary_size
The boundary_size might be as large as ULONG_MAX, which means that a device has no specific boundary limit. So either "+ 1" or passing it to ALIGN() would potentially overflow. According to kernel defines: #define ALIGN_MASK(x, mask) (((x) + (mask)) & ~(mask)) #define ALIGN(x, a) ALIGN_MASK(x, (typeof(x))(a) - 1) We can simplify the logic here: ALIGN(boundary + 1, 1 << shift) >> shift = ALIGN_MASK(b + 1, (1 << s) - 1) >> s = {[b + 1 + (1 << s) - 1] & ~[(1 << s) - 1]} >> s = [b + 1 + (1 << s) - 1] >> s = [b + (1 << s)] >> s = (b >> s) + 1 So fixing a potential overflow with the safer shortcut. Signed-off-by: Nicolin Chen Cc: Christoph Hellwig --- arch/x86/kernel/amd_gart_64.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/arch/x86/kernel/amd_gart_64.c b/arch/x86/kernel/amd_gart_64.c index e89031e9c847..7fa0bb490065 100644 --- a/arch/x86/kernel/amd_gart_64.c +++ b/arch/x86/kernel/amd_gart_64.c @@ -96,8 +96,8 @@ static unsigned long alloc_iommu(struct device *dev, int size, base_index = ALIGN(iommu_bus_base & dma_get_seg_boundary(dev), PAGE_SIZE) >> PAGE_SHIFT; - boundary_size = ALIGN((u64)dma_get_seg_boundary(dev) + 1, - PAGE_SIZE) >> PAGE_SHIFT; + /* Overflow-free shortcut for: ALIGN(b + 1, 1 << s) >> s */ + boundary_size = (dma_get_seg_boundary(dev) >> PAGE_SHIFT) + 1; spin_lock_irqsave(&iommu_bitmap_lock, flags); offset = iommu_area_alloc(iommu_gart_bitmap, iommu_pages, next_bit, -- 2.17.1
[RESEND][PATCH 5/7] sparc: Avoid overflow at boundary_size
The boundary_size might be as large as ULONG_MAX, which means that a device has no specific boundary limit. So either "+ 1" or passing it to ALIGN() would potentially overflow. According to kernel defines: #define ALIGN_MASK(x, mask) (((x) + (mask)) & ~(mask)) #define ALIGN(x, a) ALIGN_MASK(x, (typeof(x))(a) - 1) We can simplify the logic here: ALIGN(boundary + 1, 1 << shift) >> shift = ALIGN_MASK(b + 1, (1 << s) - 1) >> s = {[b + 1 + (1 << s) - 1] & ~[(1 << s) - 1]} >> s = [b + 1 + (1 << s) - 1] >> s = [b + (1 << s)] >> s = (b >> s) + 1 So fixing a potential overflow with the safer shortcut. Signed-off-by: Nicolin Chen Cc: Christoph Hellwig --- arch/sparc/kernel/iommu-common.c | 9 +++-- arch/sparc/kernel/iommu.c| 4 ++-- arch/sparc/kernel/pci_sun4v.c| 4 ++-- 3 files changed, 7 insertions(+), 10 deletions(-) diff --git a/arch/sparc/kernel/iommu-common.c b/arch/sparc/kernel/iommu-common.c index 59cb16691322..843e71894d04 100644 --- a/arch/sparc/kernel/iommu-common.c +++ b/arch/sparc/kernel/iommu-common.c @@ -166,13 +166,10 @@ unsigned long iommu_tbl_range_alloc(struct device *dev, } } - if (dev) - boundary_size = ALIGN(dma_get_seg_boundary(dev) + 1, - 1 << iommu->table_shift); - else - boundary_size = ALIGN(1ULL << 32, 1 << iommu->table_shift); + boundary_size = dev ? dma_get_seg_boundary(dev) : U32_MAX; - boundary_size = boundary_size >> iommu->table_shift; + /* Overflow-free shortcut for: ALIGN(b + 1, 1 << s) >> s */ + boundary_size = (boundary_size >> iommu->table_shift) + 1; /* * if the skip_span_boundary_check had been set during init, we set * things up so that iommu_is_span_boundary() merely checks if the diff --git a/arch/sparc/kernel/iommu.c b/arch/sparc/kernel/iommu.c index 4ae7388b1bff..d981c37305ae 100644 --- a/arch/sparc/kernel/iommu.c +++ b/arch/sparc/kernel/iommu.c @@ -472,8 +472,8 @@ static int dma_4u_map_sg(struct device *dev, struct scatterlist *sglist, outs->dma_length = 0; max_seg_size = dma_get_max_seg_size(dev); - seg_boundary_size = ALIGN(dma_get_seg_boundary(dev) + 1, - IO_PAGE_SIZE) >> IO_PAGE_SHIFT; + /* Overflow-free shortcut for: ALIGN(b + 1, 1 << s) >> s */ + seg_boundary_size = (dma_get_seg_boundary(dev) >> IO_PAGE_SHIFT) + 1; base_shift = iommu->tbl.table_map_base >> IO_PAGE_SHIFT; for_each_sg(sglist, s, nelems, i) { unsigned long paddr, npages, entry, out_entry = 0, slen; diff --git a/arch/sparc/kernel/pci_sun4v.c b/arch/sparc/kernel/pci_sun4v.c index 14b93c5564e3..233fe8a20cec 100644 --- a/arch/sparc/kernel/pci_sun4v.c +++ b/arch/sparc/kernel/pci_sun4v.c @@ -508,8 +508,8 @@ static int dma_4v_map_sg(struct device *dev, struct scatterlist *sglist, iommu_batch_start(dev, prot, ~0UL); max_seg_size = dma_get_max_seg_size(dev); - seg_boundary_size = ALIGN(dma_get_seg_boundary(dev) + 1, - IO_PAGE_SIZE) >> IO_PAGE_SHIFT; + /* Overflow-free shortcut for: ALIGN(b + 1, 1 << s) >> s */ + seg_boundary_size = (dma_get_seg_boundary(dev) >> IO_PAGE_SHIFT) + 1; mask = *dev->dma_mask; if (!iommu_use_atu(iommu, mask)) -- 2.17.1
[RESEND][PATCH 4/7] s390/pci_dma: Avoid overflow at boundary_size
The boundary_size might be as large as ULONG_MAX, which means that a device has no specific boundary limit. So either "+ 1" or passing it to ALIGN() would potentially overflow. According to kernel defines: #define ALIGN_MASK(x, mask) (((x) + (mask)) & ~(mask)) #define ALIGN(x, a) ALIGN_MASK(x, (typeof(x))(a) - 1) We can simplify the logic here: ALIGN(boundary + 1, 1 << shift) >> shift = ALIGN_MASK(b + 1, (1 << s) - 1) >> s = {[b + 1 + (1 << s) - 1] & ~[(1 << s) - 1]} >> s = [b + 1 + (1 << s) - 1] >> s = [b + (1 << s)] >> s = (b >> s) + 1 So fixing a potential overflow with the safer shortcut. Signed-off-by: Nicolin Chen Cc: Christoph Hellwig --- arch/s390/pci/pci_dma.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/arch/s390/pci/pci_dma.c b/arch/s390/pci/pci_dma.c index 64b1399a73f0..ecb067acc6d5 100644 --- a/arch/s390/pci/pci_dma.c +++ b/arch/s390/pci/pci_dma.c @@ -263,8 +263,8 @@ static unsigned long __dma_alloc_iommu(struct device *dev, struct zpci_dev *zdev = to_zpci(to_pci_dev(dev)); unsigned long boundary_size; - boundary_size = ALIGN(dma_get_seg_boundary(dev) + 1, - PAGE_SIZE) >> PAGE_SHIFT; + /* Overflow-free shortcut for: ALIGN(b + 1, 1 << s) >> s */ + boundary_size = (dma_get_seg_boundary(dev) >> PAGE_SHIFT) + 1; return iommu_area_alloc(zdev->iommu_bitmap, zdev->iommu_pages, start, size, zdev->start_dma >> PAGE_SHIFT, boundary_size, 0); -- 2.17.1
[RESEND][PATCH 0/7] Avoid overflow at boundary_size
For this resend The original series have not been acked at any patch. So I am resending them, being suggested by Niklas. Coverletter We are expending the default DMA segmentation boundary to its possible maximum value (ULONG_MAX) to indicate that a device doesn't specify a boundary limit. So all dma_get_seg_boundary callers should take a precaution with the return values since it would easily get overflowed. I scanned the entire kernel tree for all the existing callers and found that most of callers may get overflowed in two ways: either "+ 1" or passing it to ALIGN() that does "+ mask". According to kernel defines: #define ALIGN_MASK(x, mask) (((x) + (mask)) & ~(mask)) #define ALIGN(x, a) ALIGN_MASK(x, (typeof(x))(a) - 1) We can simplify the logic here: ALIGN(boundary + 1, 1 << shift) >> shift = ALIGN_MASK(b + 1, (1 << s) - 1) >> s = {[b + 1 + (1 << s) - 1] & ~[(1 << s) - 1]} >> s = [b + 1 + (1 << s) - 1] >> s = [b + (1 << s)] >> s = (b >> s) + 1 So this series of patches fix the potential overflow with this overflow-free shortcut. As I don't have these platforms, testings/comments are welcome. Thanks Nic Nicolin Chen (7): powerpc/iommu: Avoid overflow at boundary_size alpha: Avoid overflow at boundary_size ia64/sba_iommu: Avoid overflow at boundary_size s390/pci_dma: Avoid overflow at boundary_size sparc: Avoid overflow at boundary_size x86/amd_gart: Avoid overflow at boundary_size parisc: Avoid overflow at boundary_size arch/alpha/kernel/pci_iommu.c| 10 -- arch/ia64/hp/common/sba_iommu.c | 4 ++-- arch/powerpc/kernel/iommu.c | 11 +-- arch/s390/pci/pci_dma.c | 4 ++-- arch/sparc/kernel/iommu-common.c | 9 +++-- arch/sparc/kernel/iommu.c| 4 ++-- arch/sparc/kernel/pci_sun4v.c| 4 ++-- arch/x86/kernel/amd_gart_64.c| 4 ++-- drivers/parisc/ccio-dma.c| 4 ++-- drivers/parisc/sba_iommu.c | 4 ++-- 10 files changed, 26 insertions(+), 32 deletions(-) -- 2.17.1
[RESEND][PATCH 3/7] ia64/sba_iommu: Avoid overflow at boundary_size
The boundary_size might be as large as ULONG_MAX, which means that a device has no specific boundary limit. So either "+ 1" or passing it to ALIGN() would potentially overflow. According to kernel defines: #define ALIGN_MASK(x, mask) (((x) + (mask)) & ~(mask)) #define ALIGN(x, a) ALIGN_MASK(x, (typeof(x))(a) - 1) We can simplify the logic here: ALIGN(boundary + 1, 1 << shift) >> shift = ALIGN_MASK(b + 1, (1 << s) - 1) >> s = {[b + 1 + (1 << s) - 1] & ~[(1 << s) - 1]} >> s = [b + 1 + (1 << s) - 1] >> s = [b + (1 << s)] >> s = (b >> s) + 1 So fixing a potential overflow with the safer shortcut. Signed-off-by: Nicolin Chen Cc: Christoph Hellwig --- arch/ia64/hp/common/sba_iommu.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/arch/ia64/hp/common/sba_iommu.c b/arch/ia64/hp/common/sba_iommu.c index 656a4888c300..945954903bb0 100644 --- a/arch/ia64/hp/common/sba_iommu.c +++ b/arch/ia64/hp/common/sba_iommu.c @@ -485,8 +485,8 @@ sba_search_bitmap(struct ioc *ioc, struct device *dev, ASSERT(((unsigned long) ioc->res_hint & (sizeof(unsigned long) - 1UL)) == 0); ASSERT(res_ptr < res_end); - boundary_size = (unsigned long long)dma_get_seg_boundary(dev) + 1; - boundary_size = ALIGN(boundary_size, 1ULL << iovp_shift) >> iovp_shift; + /* Overflow-free shortcut for: ALIGN(b + 1, 1 << s) >> s */ + boundary_size = (dma_get_seg_boundary(dev) >> iovp_shift) + 1; BUG_ON(ioc->ibase & ~iovp_mask); shift = ioc->ibase >> iovp_shift; -- 2.17.1
[RESEND][PATCH 1/7] powerpc/iommu: Avoid overflow at boundary_size
The boundary_size might be as large as ULONG_MAX, which means that a device has no specific boundary limit. So either "+ 1" or passing it to ALIGN() would potentially overflow. According to kernel defines: #define ALIGN_MASK(x, mask) (((x) + (mask)) & ~(mask)) #define ALIGN(x, a) ALIGN_MASK(x, (typeof(x))(a) - 1) We can simplify the logic here: ALIGN(boundary + 1, 1 << shift) >> shift = ALIGN_MASK(b + 1, (1 << s) - 1) >> s = {[b + 1 + (1 << s) - 1] & ~[(1 << s) - 1]} >> s = [b + 1 + (1 << s) - 1] >> s = [b + (1 << s)] >> s = (b >> s) + 1 So fixing a potential overflow with the safer shortcut. Reported-by: Stephen Rothwell Signed-off-by: Nicolin Chen Cc: Christoph Hellwig --- arch/powerpc/kernel/iommu.c | 11 +-- 1 file changed, 5 insertions(+), 6 deletions(-) diff --git a/arch/powerpc/kernel/iommu.c b/arch/powerpc/kernel/iommu.c index 9704f3f76e63..c01ccbf8afdd 100644 --- a/arch/powerpc/kernel/iommu.c +++ b/arch/powerpc/kernel/iommu.c @@ -236,15 +236,14 @@ static unsigned long iommu_range_alloc(struct device *dev, } } - if (dev) - boundary_size = ALIGN(dma_get_seg_boundary(dev) + 1, - 1 << tbl->it_page_shift); - else - boundary_size = ALIGN(1UL << 32, 1 << tbl->it_page_shift); /* 4GB boundary for iseries_hv_alloc and iseries_hv_map */ + boundary_size = dev ? dma_get_seg_boundary(dev) : U32_MAX; + + /* Overflow-free shortcut for: ALIGN(b + 1, 1 << s) >> s */ + boundary_size = (boundary_size >> tbl->it_page_shift) + 1; n = iommu_area_alloc(tbl->it_map, limit, start, npages, tbl->it_offset, -boundary_size >> tbl->it_page_shift, align_mask); +boundary_size, align_mask); if (n == -1) { if (likely(pass == 0)) { /* First try the pool from the start */ -- 2.17.1
[RESEND][PATCH 2/7] alpha: Avoid overflow at boundary_size
The boundary_size might be as large as ULONG_MAX, which means that a device has no specific boundary limit. So "+ 1" would potentially overflow. Also, by following other places in the kernel, boundary_size should align with the PAGE_SIZE before right shifting by the PAGE_SHIFT. However, passing it to ALIGN() would potentially overflow too. According to kernel defines: #define ALIGN_MASK(x, mask) (((x) + (mask)) & ~(mask)) #define ALIGN(x, a) ALIGN_MASK(x, (typeof(x))(a) - 1) We can simplify the logic here: ALIGN(boundary + 1, 1 << shift) >> shift = ALIGN_MASK(b + 1, (1 << s) - 1) >> s = {[b + 1 + (1 << s) - 1] & ~[(1 << s) - 1]} >> s = [b + 1 + (1 << s) - 1] >> s = [b + (1 << s)] >> s = (b >> s) + 1 So fixing a potential overflow with the safer shortcut. Signed-off-by: Nicolin Chen Cc: Christoph Hellwig --- arch/alpha/kernel/pci_iommu.c | 10 -- 1 file changed, 4 insertions(+), 6 deletions(-) diff --git a/arch/alpha/kernel/pci_iommu.c b/arch/alpha/kernel/pci_iommu.c index 81037907268d..1ef2c647bd3e 100644 --- a/arch/alpha/kernel/pci_iommu.c +++ b/arch/alpha/kernel/pci_iommu.c @@ -141,12 +141,10 @@ iommu_arena_find_pages(struct device *dev, struct pci_iommu_arena *arena, unsigned long boundary_size; base = arena->dma_base >> PAGE_SHIFT; - if (dev) { - boundary_size = dma_get_seg_boundary(dev) + 1; - boundary_size >>= PAGE_SHIFT; - } else { - boundary_size = 1UL << (32 - PAGE_SHIFT); - } + + boundary_size = dev ? dma_get_seg_boundary(dev) : U32_MAX; + /* Overflow-free shortcut for: ALIGN(b + 1, 1 << s) >> s */ + boundary_size = (boundary_size >> PAGE_SHIFT) + 1; /* Search forward for the first mask-aligned sequence of N free ptes */ ptes = arena->ptes; -- 2.17.1
[Bug 209029] kernel 5.9-rc2 fails to boot on a PowerMac G5 11,2 - BUG: Kernel NULL pointer dereference on read at 0x00000020
https://bugzilla.kernel.org/show_bug.cgi?id=209029 --- Comment #2 from Erhard F. (erhar...@mailbox.org) --- No change with 5.9-rc3. -- You are receiving this mail because: You are watching the assignee of the bug.
Re: [PATCH net-next] net/wan/fsl_ucc_hdlc: Add MODULE_DESCRIPTION
From: YueHaibing Date: Sat, 29 Aug 2020 19:58:23 +0800 > Add missing MODULE_DESCRIPTION. > > Signed-off-by: YueHaibing Applied.
[RFC PATCH 2/2] powerpc/kernel/iommu: Introduce IOMMU DMA pagecache
In pseries, mapping a DMA page for a cpu memory range requires a H_PUT_TCE* hypercall, and unmapping requires a H_STUFF_TCE hypercall. When doing a lot of I/O, a thread can spend a lot of time doing such hcalls, specially when a DMA mapping don't get reused. The purpose of this change is to introduce a mechanism similar to a pagecache, but for reusing DMA mappings, improving performance and avoiding multiple DMA mappings of the same cpupage. This works based in a few current behaviors: - Page reuse: It's common for userspace process to reuse the same page for several allocations This is probably caused by page buffering behavior that come from libc, so getting pages from the kernel is less expensive. - A lot of small allocations are used: Some workloads do a lot of allocations that do not fill a whole page causing multiple DMA mappings for a single page. How it work: - When a DMA mapping is required for given allocation, it first searches the DMA pagecache for a matching mapping. When found, increment refcount, when not found, a new mapping is created. - Every time a new mapping is created, it's added to the DMA pagecache, with refcount = 1; - When the mapping is not needed anymore (iommu_free()), it looks in the DMA pagecache for the entry, and the decrements it's refcount. - When there are more than IOMMU_MAP_LIST_MAX entries in the DMA pagecache, it removes the older ones. What is inside: - 1 XArray which indexes the DMA page addresses, used for removing mappings and decreasing refcounts. - 1 XArray which indexes CPU page addresses, for finding matching mappings. - As there can be multiple mappings (directions) for the same cpupage, this one keeps a llist for looking into the entries. - Every entry points (page) in the XArray points to the mapping struct. - The mapping struct have: - DMA & CPU page addresses, size (pages) , direction and refcount. - 1 llist used for multiple mappings at the same cpupage - 1 llist used for the FIFO (removing old unused entries) - The cache struct, added to iommu_table, have: - Both XArrays, - 2 llist for entry/removal point on FIFO, - 1 Atomic Cachesize, to manage the FIFO. Signed-off-by: Leonardo Bras --- arch/powerpc/include/asm/iommu-cache.h | 31 arch/powerpc/include/asm/iommu.h | 4 + arch/powerpc/kernel/Makefile | 2 +- arch/powerpc/kernel/iommu-cache.c | 247 + arch/powerpc/kernel/iommu.c| 15 +- 5 files changed, 293 insertions(+), 6 deletions(-) create mode 100644 arch/powerpc/include/asm/iommu-cache.h create mode 100644 arch/powerpc/kernel/iommu-cache.c diff --git a/arch/powerpc/include/asm/iommu-cache.h b/arch/powerpc/include/asm/iommu-cache.h new file mode 100644 index ..ad298a4cd9c9 --- /dev/null +++ b/arch/powerpc/include/asm/iommu-cache.h @@ -0,0 +1,31 @@ +/* SPDX-License-Identifier: GPL-2.0 */ +#ifndef _IOMMU_CACHE_H +#define _IOMMU_CACHE_H +#ifdef __KERNEL__ + +#include +#include +#include + +struct dmacache { + struct llist_head fifo_add; + struct llist_head fifo_del; + struct xarray cpupages; + struct xarray dmapages; + atomic64_t cachesize; +}; + +#include + +void iommu_cache_init(struct iommu_table *tbl); +void iommu_dmacache_add(struct iommu_table *tbl, void *page, unsigned int npages, dma_addr_t addr, + enum dma_data_direction direction); +dma_addr_t iommu_dmacache_use(struct iommu_table *tbl, void *page, unsigned int npages, + enum dma_data_direction direction); +void iommu_dmacache_free(struct iommu_table *tbl, dma_addr_t dma_handle, unsigned int npages); + +#define IOMMU_MAP_LIST_MAX 8192 +#define IOMMU_MAP_LIST_THRES 128 + +#endif /* __KERNEL__ */ +#endif /* _IOMMU_CACHE_H */ diff --git a/arch/powerpc/include/asm/iommu.h b/arch/powerpc/include/asm/iommu.h index 2913e5c8b1f8..51a2f5503f8e 100644 --- a/arch/powerpc/include/asm/iommu.h +++ b/arch/powerpc/include/asm/iommu.h @@ -18,6 +18,7 @@ #include #include #include +#include #define IOMMU_PAGE_SHIFT_4K 12 #define IOMMU_PAGE_SIZE_4K (ASM_CONST(1) << IOMMU_PAGE_SHIFT_4K) @@ -114,6 +115,7 @@ struct iommu_table { int it_nid; unsigned long it_reserved_start; /* Start of not-DMA-able (MMIO) area */ unsigned long it_reserved_end; + struct dmacache cache; }; #define IOMMU_TABLE_USERSPACE_ENTRY_RO(tbl, entry) \ @@ -317,6 +319,8 @@ extern void iommu_release_ownership(struct iommu_table *tbl); extern enum dma_data_direction iommu_tce_direction(unsigned long tce); extern unsigned long iommu_direction_to_tce_perm(enum dma_data_direction dir); +void __iommu_free(struct iommu_table *tbl, dma_addr_t dma_addr, unsigned int npages); + #ifdef CONFIG_PPC_CELL_NATIVE extern bool iommu_fixed_is_weak; #else diff --git a/arch/powerpc/kernel/Makefile b/arch/powerpc/kernel/Makefile index cbf41fb4ee89..62f6c90076d9 100644 ---
[RFC PATCH 1/2] dma-direction: Add DMA_DIR_COMPAT() macro to test direction compability
Given a existing mapping with 'current' direction, and a 'wanted' direction for using that mapping, check if 'wanted' is satisfied by 'current'. current accepts DMA_BIDIRECTIONAL DMA_BIDIRECTIONAL, DMA_TO_DEVICE, DMA_FROM_DEVICE, DMA_NONE DMA_TO_DEVICE DMA_TO_DEVICE, DMA_NONE DMA_FROM_DEVICE DMA_FROM_DEVICE, DMA_NONE DMA_NONEDMA_NONE This macro is useful for checking if a DMA mapping can be reused. Signed-off-by: Leonardo Bras --- include/linux/dma-direction.h | 3 +++ 1 file changed, 3 insertions(+) diff --git a/include/linux/dma-direction.h b/include/linux/dma-direction.h index 9c96e30e6a0b..caf3943a21f4 100644 --- a/include/linux/dma-direction.h +++ b/include/linux/dma-direction.h @@ -9,4 +9,7 @@ enum dma_data_direction { DMA_NONE = 3, }; +/* Checks if wanted direction is satisfied by current mapping direction*/ +#define DMA_DIR_COMPAT(current, wanted)(((current) & ~(wanted)) == 0) + #endif -- 2.25.4
[RFC PATCH 0/2] DMA pagecache
This RFC improves the performance of indirect mapping on all tested DMA usages, based on a mlx5 device, ranging from 64k packages to 1-byte packages, from 1 thread to 64 threads. In all workloads tested, the performance of indirect mapping gets very near to direct mapping case. The whole thing is designed to have as much perfomance as possible, so the impact of the pagecache is not too big. As I am not very experienced in XArrays usage, nor in lockless algorithms, I would specially appreaciate feedback on possible failures on it's usage, missing barriers, and so on. Also, this size for the FIFO is just for testing purposes. It's also very possible that it will not be a good idea in platforms other than pseries, (i have not tested them). I can plan I bypass for those cases without much work. Thank you! Leonardo Bras (2): dma-direction: Add DMA_DIR_COMPAT() macro to test direction compability powerpc/kernel/iommu: Introduce IOMMU DMA pagecache arch/powerpc/include/asm/iommu-cache.h | 31 arch/powerpc/include/asm/iommu.h | 4 + arch/powerpc/kernel/Makefile | 2 +- arch/powerpc/kernel/iommu-cache.c | 247 + arch/powerpc/kernel/iommu.c| 15 +- include/linux/dma-direction.h | 3 + 6 files changed, 296 insertions(+), 6 deletions(-) create mode 100644 arch/powerpc/include/asm/iommu-cache.h create mode 100644 arch/powerpc/kernel/iommu-cache.c -- 2.25.4
[PATCH] arch: vdso: add vdso linker script to 'targets' instead of extra-y
The vdso linker script is preprocessed on demand. Adding it to 'targets' is enough to include the .cmd file. Signed-off-by: Masahiro Yamada --- arch/arm64/kernel/vdso/Makefile | 2 +- arch/arm64/kernel/vdso32/Makefile | 2 +- arch/nds32/kernel/vdso/Makefile | 2 +- arch/powerpc/kernel/vdso32/Makefile | 2 +- arch/powerpc/kernel/vdso64/Makefile | 2 +- arch/s390/kernel/vdso64/Makefile| 2 +- 6 files changed, 6 insertions(+), 6 deletions(-) diff --git a/arch/arm64/kernel/vdso/Makefile b/arch/arm64/kernel/vdso/Makefile index 45d5cfe46429..7cd8aafbe96e 100644 --- a/arch/arm64/kernel/vdso/Makefile +++ b/arch/arm64/kernel/vdso/Makefile @@ -54,7 +54,7 @@ endif GCOV_PROFILE := n obj-y += vdso.o -extra-y += vdso.lds +targets += vdso.lds CPPFLAGS_vdso.lds += -P -C -U$(ARCH) # Force dependency (incbin is bad) diff --git a/arch/arm64/kernel/vdso32/Makefile b/arch/arm64/kernel/vdso32/Makefile index d6adb4677c25..572475b7b7ed 100644 --- a/arch/arm64/kernel/vdso32/Makefile +++ b/arch/arm64/kernel/vdso32/Makefile @@ -155,7 +155,7 @@ asm-obj-vdso := $(addprefix $(obj)/, $(asm-obj-vdso)) obj-vdso := $(c-obj-vdso) $(c-obj-vdso-gettimeofday) $(asm-obj-vdso) obj-y += vdso.o -extra-y += vdso.lds +targets += vdso.lds CPPFLAGS_vdso.lds += -P -C -U$(ARCH) # Force dependency (vdso.s includes vdso.so through incbin) diff --git a/arch/nds32/kernel/vdso/Makefile b/arch/nds32/kernel/vdso/Makefile index 7c3c1ccb196e..55df25ef0057 100644 --- a/arch/nds32/kernel/vdso/Makefile +++ b/arch/nds32/kernel/vdso/Makefile @@ -20,7 +20,7 @@ GCOV_PROFILE := n obj-y += vdso.o -extra-y += vdso.lds +targets += vdso.lds CPPFLAGS_vdso.lds += -P -C -U$(ARCH) # Force dependency diff --git a/arch/powerpc/kernel/vdso32/Makefile b/arch/powerpc/kernel/vdso32/Makefile index 87ab1152d5ce..fd5072a4c73c 100644 --- a/arch/powerpc/kernel/vdso32/Makefile +++ b/arch/powerpc/kernel/vdso32/Makefile @@ -29,7 +29,7 @@ ccflags-y := -shared -fno-common -fno-builtin -nostdlib \ asflags-y := -D__VDSO32__ -s obj-y += vdso32_wrapper.o -extra-y += vdso32.lds +targets += vdso32.lds CPPFLAGS_vdso32.lds += -P -C -Upowerpc # Force dependency (incbin is bad) diff --git a/arch/powerpc/kernel/vdso64/Makefile b/arch/powerpc/kernel/vdso64/Makefile index 38c317f25141..c737b3ea3207 100644 --- a/arch/powerpc/kernel/vdso64/Makefile +++ b/arch/powerpc/kernel/vdso64/Makefile @@ -17,7 +17,7 @@ ccflags-y := -shared -fno-common -fno-builtin -nostdlib \ asflags-y := -D__VDSO64__ -s obj-y += vdso64_wrapper.o -extra-y += vdso64.lds +targets += vdso64.lds CPPFLAGS_vdso64.lds += -P -C -U$(ARCH) # Force dependency (incbin is bad) diff --git a/arch/s390/kernel/vdso64/Makefile b/arch/s390/kernel/vdso64/Makefile index 4a66a1cb919b..d0d406cfffa9 100644 --- a/arch/s390/kernel/vdso64/Makefile +++ b/arch/s390/kernel/vdso64/Makefile @@ -25,7 +25,7 @@ $(targets:%=$(obj)/%.dbg): KBUILD_CFLAGS = $(KBUILD_CFLAGS_64) $(targets:%=$(obj)/%.dbg): KBUILD_AFLAGS = $(KBUILD_AFLAGS_64) obj-y += vdso64_wrapper.o -extra-y += vdso64.lds +targets += vdso64.lds CPPFLAGS_vdso64.lds += -P -C -U$(ARCH) # Disable gcov profiling, ubsan and kasan for VDSO code -- 2.25.1
[PATCH] scsi: ibmvfc: interface updates for future FPIN and MQ support
VIOS partitions with SLI-4 enabled Emulex adapters will be capable of driving IO in parallel through mulitple work queues or channels, and with new hyperviosr firmware that supports multiple interrupt sources an ibmvfc NPIV single initiator can be modified to exploit end to end channelization in a PowerVM environment. VIOS hosts will also be able to expose fabric perfromance impact notifications (FPIN) via a new asynchronous event to ibmvfc clients that advertise support via IBMVFC_CAN_HANDLE_FPIN in their capabilities flag during NPIV_LOGIN. This patch introduces three new Managment Datagrams (MADs) for channelization support negotiation as well as the FPIN asynchronous event and FPIN status flags. Follow up work is required to plumb the ibmvfc client driver to use these new interfaces. Signed-off-by: Tyrel Datwyler --- drivers/scsi/ibmvscsi/ibmvfc.h | 66 +- 1 file changed, 65 insertions(+), 1 deletion(-) diff --git a/drivers/scsi/ibmvscsi/ibmvfc.h b/drivers/scsi/ibmvscsi/ibmvfc.h index 907889f1fa9d..801681b63daa 100644 --- a/drivers/scsi/ibmvscsi/ibmvfc.h +++ b/drivers/scsi/ibmvscsi/ibmvfc.h @@ -124,6 +124,9 @@ enum ibmvfc_mad_types { IBMVFC_PASSTHRU = 0x0200, IBMVFC_TMF_MAD = 0x0100, IBMVFC_NPIV_LOGOUT = 0x0800, + IBMVFC_CHANNEL_ENQUIRY = 0x1000, + IBMVFC_CHANNEL_SETUP= 0x2000, + IBMVFC_CONNECTION_INFO = 0x4000, }; struct ibmvfc_mad_common { @@ -162,6 +165,8 @@ struct ibmvfc_npiv_login { __be32 max_cmds; __be64 capabilities; #define IBMVFC_CAN_MIGRATE 0x01 +#define IBMVFC_CAN_USE_CHANNELS0x02 +#define IBMVFC_CAN_HANDLE_FPIN 0x04 __be64 node_name; struct srp_direct_buf async; u8 partition_name[IBMVFC_MAX_NAME]; @@ -204,6 +209,7 @@ struct ibmvfc_npiv_login_resp { __be64 capabilities; #define IBMVFC_CAN_FLUSH_ON_HALT 0x08 #define IBMVFC_CAN_SUPPRESS_ABTS 0x10 +#define IBMVFC_CAN_SUPPORT_CHANNELS0x20 __be32 max_cmds; __be32 scsi_id_sz; __be64 max_dma_len; @@ -482,6 +488,52 @@ struct ibmvfc_passthru_mad { struct ibmvfc_passthru_fc_iu fc_iu; }__attribute__((packed, aligned (8))); +struct ibmvfc_channel_enquiry { + struct ibmvfc_mad_common common; + __be32 flags; +#define IBMVFC_NO_CHANNELS_TO_CRQ_SUPPORT 0x01 +#define IBMVFC_SUPPORT_VARIABLE_SUBQ_MSG 0x02 +#define IBMVFC_NO_N_TO_M_CHANNELS_SUPPORT 0x04 + __be32 num_scsi_subq_channels; + __be32 num_nvmeof_subq_channels; + __be32 num_scsi_vas_channels; + __be32 num nvmeof_vas_channels; +}__attribute__((packed, aligned (8))); + +struct ibmvfc_channel_setup_mad { + struct ibmvfc_mad_common common; + struct srp_direct_buf buffer; +}__attrribute__((packed, aligned (8))); + +#define IBMVFC_MAX_CHANNELS502 + +struct ibmvfc_channel_setup { + __be32 flags; +#define IBMVFC_CANCEL_CHANNELS 0x01 +#define IBMVFC_USE_BUFFER 0x02 +#define IBMVFC_CHANNELS_CANCELED 0x04 + __be32 reserved; + __be32 num_scsi_subq_channels; + __be32 num_nvmeof_subq_channels; + __be32 num_scsi_vas_channels; + __be32 num_nvmeof_vas_channels; + struct srp_direct_buf buffer; + __be64 reserved2[5]; + __be64 channel_handles[IBMVFC_MAX_CHANNELS]; +}__attribute__((packed, aligned (8))); + +struct ibmvfc_connection_info { + struct ibmvfc_madd_common common; + __be64 information_bits; +#define IBMVFC_NO_FC_IO_CHANNEL0x01 +#define IBMVFC_NO_PHYP_VAS 0x02 +#define IBMVFC_NO_PHYP_SUBQ0x04 +#define IBMVFC_PHYP_DEPRECATED_SUBQ0x08 +#define IBMVFC_PHYP_PRESERVED_SUBQ 0x10 +#define IBMVFC_PHYP_FULL_SUBQ 0x20 + __be64 reserved[16]; +}__attribute__((packed, aligned (8))); + struct ibmvfc_trace_start_entry { u32 xfer_len; }__attribute__((packed)); @@ -532,6 +584,7 @@ enum ibmvfc_async_event { IBMVFC_AE_HALT = 0x0400, IBMVFC_AE_RESUME= 0x0800, IBMVFC_AE_ADAPTER_FAILED= 0x1000, + IBMVFC_AE_FPIN = 0x2000, }; struct ibmvfc_async_desc { @@ -560,10 +613,18 @@ enum ibmvfc_ae_link_state { IBMVFC_AE_LS_LINK_DEAD = 0x08, }; +enum ibmvfc_ae_fpin_status { + IBMVFC_AE_FPIN_LINK_CONGESTED = 0x1, + IBMVFC_AE_FPIN_PORT_CONGESTED = 0x2, + IBMVFC_AE_FPIN_PORT_CLEARED = 0x3, + IBMVFC_AE_FPIN_PORT_DEGRADED= 0x4, +}; + struct ibmvfc_async_crq { volatile u8 valid; u8 link_state; - u8 pad[2]; + u8 fpin_status + u8 pad; __be32 pad2; volatile __be64 event; volatile __be64 scsi_id; @@ -590,6 +651,9 @@ union ibmvfc_iu { struct ibmvfc_tmf tmf; struct ibmvfc_cmd cmd; struct ibmvfc_passthru_mad passthru; + struct ibmvfc_channel_e
[PATCH AUTOSEL 5.4 23/23] fsldma: fix very broken 32-bit ppc ioread64 functionality
From: Linus Torvalds [ Upstream commit 0a4c56c80f90797e9b9f8426c6aae4c0cf1c9785 ] Commit ef91bb196b0d ("kernel.h: Silence sparse warning in lower_32_bits") caused new warnings to show in the fsldma driver, but that commit was not to blame: it only exposed some very incorrect code that tried to take the low 32 bits of an address. That made no sense for multiple reasons, the most notable one being that that code was intentionally limited to only 32-bit ppc builds, so "only low 32 bits of an address" was completely nonsensical. There were no high bits to mask off to begin with. But even more importantly fropm a correctness standpoint, turning the address into an integer then caused the subsequent address arithmetic to be completely wrong too, and the "+1" actually incremented the address by one, rather than by four. Which again was incorrect, since the code was reading two 32-bit values and trying to make a 64-bit end result of it all. Surprisingly, the iowrite64() did not suffer from the same odd and incorrect model. This code has never worked, but it's questionable whether anybody cared: of the two users that actually read the 64-bit value (by way of some C preprocessor hackery and eventually the 'get_cdar()' inline function), one of them explicitly ignored the value, and the other one might just happen to work despite the incorrect value being read. This patch at least makes it not fail the build any more, and makes the logic superficially sane. Whether it makes any difference to the code _working_ or not shall remain a mystery. Compile-tested-by: Guenter Roeck Reviewed-by: Guenter Roeck Signed-off-by: Linus Torvalds Signed-off-by: Sasha Levin --- drivers/dma/fsldma.h | 12 ++-- 1 file changed, 6 insertions(+), 6 deletions(-) diff --git a/drivers/dma/fsldma.h b/drivers/dma/fsldma.h index 56f18ae992332..308bed0a560ac 100644 --- a/drivers/dma/fsldma.h +++ b/drivers/dma/fsldma.h @@ -205,10 +205,10 @@ struct fsldma_chan { #else static u64 fsl_ioread64(const u64 __iomem *addr) { - u32 fsl_addr = lower_32_bits(addr); - u64 fsl_addr_hi = (u64)in_le32((u32 *)(fsl_addr + 1)) << 32; + u32 val_lo = in_le32((u32 __iomem *)addr); + u32 val_hi = in_le32((u32 __iomem *)addr + 1); - return fsl_addr_hi | in_le32((u32 *)fsl_addr); + return ((u64)val_hi << 32) + val_lo; } static void fsl_iowrite64(u64 val, u64 __iomem *addr) @@ -219,10 +219,10 @@ static void fsl_iowrite64(u64 val, u64 __iomem *addr) static u64 fsl_ioread64be(const u64 __iomem *addr) { - u32 fsl_addr = lower_32_bits(addr); - u64 fsl_addr_hi = (u64)in_be32((u32 *)fsl_addr) << 32; + u32 val_hi = in_be32((u32 __iomem *)addr); + u32 val_lo = in_be32((u32 __iomem *)addr + 1); - return fsl_addr_hi | in_be32((u32 *)(fsl_addr + 1)); + return ((u64)val_hi << 32) + val_lo; } static void fsl_iowrite64be(u64 val, u64 __iomem *addr) -- 2.25.1
[PATCH AUTOSEL 5.8 42/42] fsldma: fix very broken 32-bit ppc ioread64 functionality
From: Linus Torvalds [ Upstream commit 0a4c56c80f90797e9b9f8426c6aae4c0cf1c9785 ] Commit ef91bb196b0d ("kernel.h: Silence sparse warning in lower_32_bits") caused new warnings to show in the fsldma driver, but that commit was not to blame: it only exposed some very incorrect code that tried to take the low 32 bits of an address. That made no sense for multiple reasons, the most notable one being that that code was intentionally limited to only 32-bit ppc builds, so "only low 32 bits of an address" was completely nonsensical. There were no high bits to mask off to begin with. But even more importantly fropm a correctness standpoint, turning the address into an integer then caused the subsequent address arithmetic to be completely wrong too, and the "+1" actually incremented the address by one, rather than by four. Which again was incorrect, since the code was reading two 32-bit values and trying to make a 64-bit end result of it all. Surprisingly, the iowrite64() did not suffer from the same odd and incorrect model. This code has never worked, but it's questionable whether anybody cared: of the two users that actually read the 64-bit value (by way of some C preprocessor hackery and eventually the 'get_cdar()' inline function), one of them explicitly ignored the value, and the other one might just happen to work despite the incorrect value being read. This patch at least makes it not fail the build any more, and makes the logic superficially sane. Whether it makes any difference to the code _working_ or not shall remain a mystery. Compile-tested-by: Guenter Roeck Reviewed-by: Guenter Roeck Signed-off-by: Linus Torvalds Signed-off-by: Sasha Levin --- drivers/dma/fsldma.h | 12 ++-- 1 file changed, 6 insertions(+), 6 deletions(-) diff --git a/drivers/dma/fsldma.h b/drivers/dma/fsldma.h index 56f18ae992332..308bed0a560ac 100644 --- a/drivers/dma/fsldma.h +++ b/drivers/dma/fsldma.h @@ -205,10 +205,10 @@ struct fsldma_chan { #else static u64 fsl_ioread64(const u64 __iomem *addr) { - u32 fsl_addr = lower_32_bits(addr); - u64 fsl_addr_hi = (u64)in_le32((u32 *)(fsl_addr + 1)) << 32; + u32 val_lo = in_le32((u32 __iomem *)addr); + u32 val_hi = in_le32((u32 __iomem *)addr + 1); - return fsl_addr_hi | in_le32((u32 *)fsl_addr); + return ((u64)val_hi << 32) + val_lo; } static void fsl_iowrite64(u64 val, u64 __iomem *addr) @@ -219,10 +219,10 @@ static void fsl_iowrite64(u64 val, u64 __iomem *addr) static u64 fsl_ioread64be(const u64 __iomem *addr) { - u32 fsl_addr = lower_32_bits(addr); - u64 fsl_addr_hi = (u64)in_be32((u32 *)fsl_addr) << 32; + u32 val_hi = in_be32((u32 __iomem *)addr); + u32 val_lo = in_be32((u32 __iomem *)addr + 1); - return fsl_addr_hi | in_be32((u32 *)(fsl_addr + 1)); + return ((u64)val_hi << 32) + val_lo; } static void fsl_iowrite64be(u64 val, u64 __iomem *addr) -- 2.25.1
RE: [PATCH] fsldma: fsl_ioread64*() do not need lower_32_bits()
> -Original Message- > From: Linus Torvalds > Sent: Saturday, August 29, 2020 4:20 PM > To: Guenter Roeck > Cc: Luc Van Oostenryck ; Herbert Xu > ; Andrew Morton foundation.org>; Joerg Roedel ; Leo Li > ; Zhang Wei ; Dan Williams > ; Vinod Koul ; linuxppc-dev > ; dma ; Linux > Kernel Mailing List > Subject: Re: [PATCH] fsldma: fsl_ioread64*() do not need lower_32_bits() > > On Sat, Aug 29, 2020 at 1:40 PM Guenter Roeck wrote: > > > > Except for > > > > CHECK: spaces preferred around that '+' (ctx:VxV) > > #29: FILE: drivers/dma/fsldma.h:223: > > + u32 val_lo = in_be32((u32 __iomem *)addr+1); > > Added spaces. > > > I don't see anything wrong with it either, so > > > > Reviewed-by: Guenter Roeck > > > > Since I didn't see the real problem with the original code, I'd take > > that with a grain of salt, though. > > Well, honestly, the old code was so confused that just making it build is > clearly already an improvement even if everything else were to be wrong. > > So I committed my "fix". If it turns out there's more wrong in there and > somebody tests it, we can fix it again. But now it hopefully compiles, at > least. > > My bet is that if that driver ever worked on ppc32, it will continue to work > whatever we do to that function. > > I _think_ the old code happened to - completely by mistake - get the value > right for the case of "little endian access, with dma_addr_t being 32-bit". > Because then it would still read the upper bits wrong, but the cast to > dma_addr_t would then throw those bits away. And the lower bits would be > right. > > But for big-endian accesses or for ARCH_DMA_ADDR_T_64BIT it really looks > like it always returned a completely incorrect value. > > And again - the driver may have worked even with that completely incorrect > value, since the use of it seems to be very incidental. > > In either case ("it didn't work before" or "it worked because the value > doesn't really matter"), I don't think I could possibly have made things > worse. > > Famous last words. Thanks for the patch. Acked-by: Li Yang We are having periodical auto regression tests covering ppc32 platforms. But looks like it missed this issue. I will ask the test team to investigate on why the test cases are not sufficient. Regards, Leo
[Bug 205183] PPC64: Signal delivery fails with SIGSEGV if between about 1KB and 4KB bytes of stack remain
https://bugzilla.kernel.org/show_bug.cgi?id=205183 Michael Ellerman (mich...@ellerman.id.au) changed: What|Removed |Added Status|RESOLVED|CLOSED -- You are receiving this mail because: You are watching the assignee of the bug.
Re: KVM on POWER8 host lock up since 10d91611f426 ("powerpc/64s: Reimplement book3s idle code in C")
Am 31.08.20 um 14:58 schrieb Michael Ellerman: [...] >> The machines are running in OPAL/PowerNV mode, with "ppc64_cpu --smt=off". >> The number of VMs varies across the machines: >> obs-power8-01: 18 instances, "-smp 16,threads=8" >> obs-power8-02: 20 instances, "-smp 8,threads=8" >> obs-power8-03: 30 instances, "-smp 8,threads=8" >> obs-power8-04: 20 instances, "-smp 8,threads=8" > > Can you send us the output of: > > # grep . /sys/module/kvm_hv/parameters/* of course, the current values are: /sys/module/kvm_hv/parameters/dynamic_mt_modes:6 /sys/module/kvm_hv/parameters/h_ipi_redirect:1 /sys/module/kvm_hv/parameters/indep_threads_mode:Y /sys/module/kvm_hv/parameters/kvm_irq_bypass:1 /sys/module/kvm_hv/parameters/nested:Y /sys/module/kvm_hv/parameters/one_vm_per_core:N /sys/module/kvm_hv/parameters/target_smt_mode:0 (actually identical on all 5 above) -- with kind regards (mit freundlichem Grinsen), Ruediger Oertel (r...@suse.com,r...@suse.de,bugfin...@t-online.de) Do-Not-Accept-Binary-Blobs.Ever.From-Anyone. Key fingerprint = 17DC 6553 86A7 384B 53C5 CA5C 3CE4 F2E7 23F2 B417 SUSE Software Solutions Germany GmbH, Maxfeldstr. 5, 90409 Nürnberg, Germany, (HRB 36809, AG Nürnberg), Geschäftsführer: Felix Imendörffer signature.asc Description: OpenPGP digital signature
Re: KVM on POWER8 host lock up since 10d91611f426 ("powerpc/64s: Reimplement book3s idle code in C")
Ruediger Oertel writes: > Am 31.08.20 um 03:14 schrieb Nicholas Piggin: >> Excerpts from Michal Suchánek's message of August 31, 2020 6:11 am: >>> Hello, >>> >>> on POWER8 KVM hosts lock up since commit 10d91611f426 ("powerpc/64s: >>> Reimplement book3s idle code in C"). >>> >>> The symptom is host locking up completely after some hours of KVM >>> workload with messages like >>> >>> 2020-08-30T10:51:31+00:00 obs-power8-01 kernel: KVM: couldn't grab cpu 47 >>> 2020-08-30T10:51:31+00:00 obs-power8-01 kernel: KVM: couldn't grab cpu 71 >>> 2020-08-30T10:51:31+00:00 obs-power8-01 kernel: KVM: couldn't grab cpu 47 >>> 2020-08-30T10:51:31+00:00 obs-power8-01 kernel: KVM: couldn't grab cpu 71 >>> 2020-08-30T10:51:31+00:00 obs-power8-01 kernel: KVM: couldn't grab cpu 47 >>> >>> printed before the host locks up. >>> >>> The machines run sandboxed builds which is a mixed workload resulting in >>> IO/single core/mutiple core load over time and there are periods of no >>> activity and no VMS runnig as well. The VMs are shortlived so VM >>> setup/terdown is somewhat excercised as well. >>> >>> POWER9 with the new guest entry fast path does not seem to be affected. >>> >>> Reverted the patch and the followup idle fixes on top of 5.2.14 and >>> re-applied commit a3f3072db6ca ("powerpc/powernv/idle: Restore IAMR >>> after idle") which gives same idle code as 5.1.16 and the kernel seems >>> stable. >>> >>> Config is attached. >>> >>> I cannot easily revert this commit, especially if I want to use the same >>> kernel on POWER8 and POWER9 - many of the POWER9 fixes are applicable >>> only to the new idle code. >>> >>> Any idea what can be the problem? >> >> So hwthread_state is never getting back to to HWTHREAD_IN_IDLE on >> those threads. I wonder what they are doing. POWER8 doesn't have a good >> NMI IPI and I don't know if it supports pdbg dumping registers from the >> BMC unfortunately. Do the messages always come in pairs of CPUs? >> >> I'm not sure where to start with reproducing, I'll have to try. How many >> vCPUs in the guests? Do you have several guests running at once? > > Hello all, > > some details on the setup... > these machines are buildservice workers, (build.opensuse.org) and all they > do is spawn new VMs, run a package building job inside (rpmbuild, > debbuild,...) > > The machines are running in OPAL/PowerNV mode, with "ppc64_cpu --smt=off". > The number of VMs varies across the machines: > obs-power8-01: 18 instances, "-smp 16,threads=8" > obs-power8-02: 20 instances, "-smp 8,threads=8" > obs-power8-03: 30 instances, "-smp 8,threads=8" > obs-power8-04: 20 instances, "-smp 8,threads=8" Can you send us the output of: # grep . /sys/module/kvm_hv/parameters/* cheers
Re: fsl_espi errors on v5.7.15
On 30.08.2020 23:59, Chris Packham wrote: > > On 31/08/20 9:41 am, Heiner Kallweit wrote: >> On 30.08.2020 23:00, Chris Packham wrote: >>> On 31/08/20 12:30 am, Nicholas Piggin wrote: Excerpts from Chris Packham's message of August 28, 2020 8:07 am: >>> >>> > I've also now seen the RX FIFO not empty error on the T2080RDB > > fsl_espi ffe11.spi: Transfer done but SPIE_DON isn't set! > fsl_espi ffe11.spi: Transfer done but SPIE_DON isn't set! > fsl_espi ffe11.spi: Transfer done but SPIE_DON isn't set! > fsl_espi ffe11.spi: Transfer done but SPIE_DON isn't set! > fsl_espi ffe11.spi: Transfer done but rx/tx fifo's aren't empty! > fsl_espi ffe11.spi: SPIE_RXCNT = 1, SPIE_TXCNT = 32 > > With my current workaround of emptying the RX FIFO. It seems > survivable. Interestingly it only ever seems to be 1 extra byte in the > RX FIFO and it seems to be after either a READ_SR or a READ_FSR. > > fsl_espi ffe11.spi: tx 70 > fsl_espi ffe11.spi: rx 03 > fsl_espi ffe11.spi: Extra RX 00 > fsl_espi ffe11.spi: Transfer done but SPIE_DON isn't set! > fsl_espi ffe11.spi: Transfer done but rx/tx fifo's aren't empty! > fsl_espi ffe11.spi: SPIE_RXCNT = 1, SPIE_TXCNT = 32 > fsl_espi ffe11.spi: tx 05 > fsl_espi ffe11.spi: rx 00 > fsl_espi ffe11.spi: Extra RX 03 > fsl_espi ffe11.spi: Transfer done but SPIE_DON isn't set! > fsl_espi ffe11.spi: Transfer done but rx/tx fifo's aren't empty! > fsl_espi ffe11.spi: SPIE_RXCNT = 1, SPIE_TXCNT = 32 > fsl_espi ffe11.spi: tx 05 > fsl_espi ffe11.spi: rx 00 > fsl_espi ffe11.spi: Extra RX 03 > >From all the Micron SPI-NOR datasheets I've got access to it is > possible to continually read the SR/FSR. But I've no idea why it > happens some times and not others. So I think I've got a reproduction and I think I've bisected the problem to commit 3282a3da25bd ("powerpc/64: Implement soft interrupt replay in C"). My day is just finishing now so I haven't applied too much scrutiny to this result. Given the various rabbit holes I've been down on this issue already I'd take this information with a good degree of skepticism. >>> OK, so an easy test should be to re-test with a 5.4 kernel. >>> It doesn't have yet the change you're referring to, and the fsl-espi >>> driver >>> is basically the same as in 5.7 (just two small changes in 5.7). >> There's 6cc0c16d82f88 and maybe also other interrupt related patches >> around this time that could affect book E, so it's good if that exact >> patch is confirmed. > My confirmation is basically that I can induce the issue in a 5.4 kernel > by cherry-picking 3282a3da25bd. I'm also able to "fix" the issue in > 5.9-rc2 by reverting that one commit. > > I both cases it's not exactly a clean cherry-pick/revert so I also > confirmed the bisection result by building at 3282a3da25bd (which sees > the issue) and the commit just before (which does not). Thanks for testing, that confirms it well. [snip patch] > I still saw the issue with this change applied. PPC_IRQ_SOFT_MASK_DEBUG > didn't report anything (either with or without the change above). Okay, it was a bit of a shot in the dark. I still can't see what else has changed. What would cause this, a lost interrupt? A spurious interrupt? Or higher interrupt latency? I don't think the patch should cause significantly worse latency, (it's supposed to be a bit better if anything because it doesn't set up the full interrupt frame). But it's possible. >>> My working theory is that the SPI_DON indication is all about the TX >>> direction an now that the interrupts are faster we're hitting an error >>> because there is still RX activity going on. Heiner disagrees with my >>> interpretation of the SPI_DON indication and the fact that it doesn't >>> happen every time does throw doubt on it. >>> >> It's right that the eSPI spec can be interpreted that SPI_DON refers to >> TX only. However this wouldn't really make sense, because also for RX >> we program the frame length, and therefore want to be notified once the >> full frame was received. Also practical experience shows that SPI_DON >> is set also after RX-only transfers. >> Typical SPI NOR use case is that you write read command + start address, >> followed by a longer read. If the TX-only interpretation would be right, >> we'd always end up with SPI_DON not being set. >> >>> I can't really explain the extra RX byte in the fifo. We know how many >>> bytes to expect and we pull that many from the fifo so it's not as if >>> we're m
Re: KVM on POWER8 host lock up since 10d91611f426 ("powerpc/64s: Reimplement book3s idle code in C")
Michal Suchánek writes: > On Mon, Aug 31, 2020 at 11:14:18AM +1000, Nicholas Piggin wrote: >> Excerpts from Michal Suchánek's message of August 31, 2020 6:11 am: >> > Hello, >> > >> > on POWER8 KVM hosts lock up since commit 10d91611f426 ("powerpc/64s: >> > Reimplement book3s idle code in C"). >> > >> > The symptom is host locking up completely after some hours of KVM >> > workload with messages like >> > >> > 2020-08-30T10:51:31+00:00 obs-power8-01 kernel: KVM: couldn't grab cpu 47 >> > 2020-08-30T10:51:31+00:00 obs-power8-01 kernel: KVM: couldn't grab cpu 71 >> > 2020-08-30T10:51:31+00:00 obs-power8-01 kernel: KVM: couldn't grab cpu 47 >> > 2020-08-30T10:51:31+00:00 obs-power8-01 kernel: KVM: couldn't grab cpu 71 >> > 2020-08-30T10:51:31+00:00 obs-power8-01 kernel: KVM: couldn't grab cpu 47 >> > >> > printed before the host locks up. >> > >> > The machines run sandboxed builds which is a mixed workload resulting in >> > IO/single core/mutiple core load over time and there are periods of no >> > activity and no VMS runnig as well. The VMs are shortlived so VM >> > setup/terdown is somewhat excercised as well. >> > >> > POWER9 with the new guest entry fast path does not seem to be affected. >> > >> > Reverted the patch and the followup idle fixes on top of 5.2.14 and >> > re-applied commit a3f3072db6ca ("powerpc/powernv/idle: Restore IAMR >> > after idle") which gives same idle code as 5.1.16 and the kernel seems >> > stable. >> > >> > Config is attached. >> > >> > I cannot easily revert this commit, especially if I want to use the same >> > kernel on POWER8 and POWER9 - many of the POWER9 fixes are applicable >> > only to the new idle code. >> > >> > Any idea what can be the problem? >> >> So hwthread_state is never getting back to to HWTHREAD_IN_IDLE on >> those threads. I wonder what they are doing. POWER8 doesn't have a good >> NMI IPI and I don't know if it supports pdbg dumping registers from the >> BMC unfortunately. > > It may be possible to set up fadump with a later kernel version that > supports it on powernv and dump the whole kernel. Your firmware won't support it AFAIK. You could try kdump, but if we have CPUs stuck in KVM then there's a good chance it won't work :/ cheers
Re: KVM on POWER8 host lock up since 10d91611f426 ("powerpc/64s: Reimplement book3s idle code in C")
Am 31.08.20 um 03:14 schrieb Nicholas Piggin: > So hwthread_state is never getting back to to HWTHREAD_IN_IDLE on > those threads. I wonder what they are doing. POWER8 doesn't have a good > NMI IPI and I don't know if it supports pdbg dumping registers from the > BMC unfortunately. > Do the messages always come in pairs of CPUs? well, - one problem is that at some point the machine just locks up completely, so I can not tell if there were lines not printed any more and in some cases all I get is a single line - looking at the stats in generally it's either one cpu printed several times or a pair ("not strictly") alternatingly 2020-07-30T03:16:16+00:00 obs-power8-03 kernel: [51284.029821] KVM: couldn't grab cpu 92 2020-07-30T03:16:16+00:00 obs-power8-03 kernel: [51284.058630] KVM: couldn't grab cpu 92 2020-07-30T03:16:16+00:00 obs-power8-03 kernel: [51284.108268] KVM: couldn't grab cpu 92 2020-07-30T03:16:16+00:00 obs-power8-03 kernel: [51284.210206] KVM: couldn't grab cpu 92 2020-07-30T03:16:16+00:00 obs-power8-03 kernel: [51284.323465] KVM: couldn't grab cpu 92 2020-07-30T03:16:16+00:00 obs-power8-03 kernel: [51284.334420] KVM: couldn't grab cpu 92 2020-07-30T03:16:16+00:00 obs-power8-03 kernel: [51284.345470] KVM: couldn't grab cpu 92 2020-07-30T03:16:16+00:00 obs-power8-03 kernel: [51284.395185] KVM: couldn't grab cpu 92 2020-07-30T03:16:16+00:00 obs-power8-03 kernel: [51284.517182] KVM: couldn't grab cpu 92 2020-07-30T03:16:17+00:00 obs-power8-03 kernel: [51284.600716] KVM: couldn't grab cpu 92 2020-07-30T03:16:18+00:00 obs-power8-03 kernel: [51286.201589] KVM: couldn't grab cpu 92 2020-07-30T03:16:19+00:00 obs-power8-03 kernel: [51286.627273] KVM: couldn't grab cpu 92 2020-07-30T16:44:16+00:00 obs-power8-04 kernel: [30099.726288] KVM: couldn't grab cpu 61 2020-07-30T16:44:16+00:00 obs-power8-04 kernel: [30099.736843] KVM: couldn't grab cpu 125 2020-07-30T16:44:17+00:00 obs-power8-04 kernel: [30099.747429] KVM: couldn't grab cpu 125 2020-07-30T16:44:17+00:00 obs-power8-04 kernel: [30099.877138] KVM: couldn't grab cpu 61 2020-07-30T16:44:17+00:00 obs-power8-04 kernel: [30099.916422] KVM: couldn't grab cpu 125 2020-07-30T16:44:17+00:00 obs-power8-04 kernel: [30099.931755] KVM: couldn't grab cpu 61 2020-07-30T16:44:17+00:00 obs-power8-04 kernel: [30100.029003] KVM: couldn't grab cpu 61 2020-07-30T16:44:17+00:00 obs-power8-04 kernel: [30100.334895] KVM: couldn't grab cpu 125 2020-07-30T16:44:17+00:00 obs-power8-04 kernel: [30100.392713] KVM: couldn't grab cpu 61 2020-07-30T16:44:17+00:00 obs-power8-04 kernel: [30100.569011] KVM: couldn't grab cpu 125 2020-07-30T16:44:17+00:00 obs-power8-04 kernel: [30100.617048] KVM: couldn't grab cpu 125 2020-07-30T16:44:17+00:00 obs-power8-04 kernel: [30100.628107] KVM: couldn't grab cpu 125 2020-07-30T16:44:18+00:00 obs-power8-04 kernel: [30100.809046] KVM: couldn't grab cpu 125 2020-07-30T16:44:18+00:00 obs-power8-04 kernel: [30101.001097] KVM: couldn't grab cpu 61 2020-07-30T16:44:19+00:00 obs-power8-04 kernel: [30102.109007] KVM: couldn't grab cpu 125 2020-07-30T16:44:19+00:00 obs-power8-04 kernel: [30102.254470] KVM: couldn't grab cpu 61 > > I'm not sure where to start with reproducing, I'll have to try. How many > vCPUs in the guests? Do you have several guests running at once? > > Thanks, > Nick > -- with kind regards (mit freundlichem Grinsen), Ruediger Oertel (r...@suse.com,r...@suse.de,bugfin...@t-online.de) Do-Not-Accept-Binary-Blobs.Ever.From-Anyone. Key fingerprint = 17DC 6553 86A7 384B 53C5 CA5C 3CE4 F2E7 23F2 B417 SUSE Software Solutions Germany GmbH, Maxfeldstr. 5, 90409 Nürnberg, Germany, (HRB 36809, AG Nürnberg), Geschäftsführer: Felix Imendörffer signature.asc Description: OpenPGP digital signature
Re: KVM on POWER8 host lock up since 10d91611f426 ("powerpc/64s: Reimplement book3s idle code in C")
Am 31.08.20 um 03:14 schrieb Nicholas Piggin: > Excerpts from Michal Suchánek's message of August 31, 2020 6:11 am: >> Hello, >> >> on POWER8 KVM hosts lock up since commit 10d91611f426 ("powerpc/64s: >> Reimplement book3s idle code in C"). >> >> The symptom is host locking up completely after some hours of KVM >> workload with messages like >> >> 2020-08-30T10:51:31+00:00 obs-power8-01 kernel: KVM: couldn't grab cpu 47 >> 2020-08-30T10:51:31+00:00 obs-power8-01 kernel: KVM: couldn't grab cpu 71 >> 2020-08-30T10:51:31+00:00 obs-power8-01 kernel: KVM: couldn't grab cpu 47 >> 2020-08-30T10:51:31+00:00 obs-power8-01 kernel: KVM: couldn't grab cpu 71 >> 2020-08-30T10:51:31+00:00 obs-power8-01 kernel: KVM: couldn't grab cpu 47 >> >> printed before the host locks up. >> >> The machines run sandboxed builds which is a mixed workload resulting in >> IO/single core/mutiple core load over time and there are periods of no >> activity and no VMS runnig as well. The VMs are shortlived so VM >> setup/terdown is somewhat excercised as well. >> >> POWER9 with the new guest entry fast path does not seem to be affected. >> >> Reverted the patch and the followup idle fixes on top of 5.2.14 and >> re-applied commit a3f3072db6ca ("powerpc/powernv/idle: Restore IAMR >> after idle") which gives same idle code as 5.1.16 and the kernel seems >> stable. >> >> Config is attached. >> >> I cannot easily revert this commit, especially if I want to use the same >> kernel on POWER8 and POWER9 - many of the POWER9 fixes are applicable >> only to the new idle code. >> >> Any idea what can be the problem? > > So hwthread_state is never getting back to to HWTHREAD_IN_IDLE on > those threads. I wonder what they are doing. POWER8 doesn't have a good > NMI IPI and I don't know if it supports pdbg dumping registers from the > BMC unfortunately. Do the messages always come in pairs of CPUs? > > I'm not sure where to start with reproducing, I'll have to try. How many > vCPUs in the guests? Do you have several guests running at once? Hello all, some details on the setup... these machines are buildservice workers, (build.opensuse.org) and all they do is spawn new VMs, run a package building job inside (rpmbuild, debbuild,...) The machines are running in OPAL/PowerNV mode, with "ppc64_cpu --smt=off". The number of VMs varies across the machines: obs-power8-01: 18 instances, "-smp 16,threads=8" obs-power8-02: 20 instances, "-smp 8,threads=8" obs-power8-03: 30 instances, "-smp 8,threads=8" obs-power8-04: 20 instances, "-smp 8,threads=8" obs-power8-05: 36 instances, "-smp 4,threads=2" (this one with "ppc64_cpu --subcores-per-core=4" but anyway the stalls can be seen on all of them, sometimes after 4 hours sometimes just after about a day. The 01 with more cpu overcommit seems a little faster reproducing the problem, but that's more gut feeling than anything backed by real numbers. -- with kind regards (mit freundlichem Grinsen), Ruediger Oertel (r...@suse.com,r...@suse.de,bugfin...@t-online.de) Do-Not-Accept-Binary-Blobs.Ever.From-Anyone. Key fingerprint = 17DC 6553 86A7 384B 53C5 CA5C 3CE4 F2E7 23F2 B417 SUSE Software Solutions Germany GmbH, Maxfeldstr. 5, 90409 Nürnberg, Germany, (HRB 36809, AG Nürnberg), Geschäftsführer: Felix Imendörffer signature.asc Description: OpenPGP digital signature
Re: [PATCH 5/5] powerpc: use the generic dma_ops_bypass mode
On 8/31/20 8:40 AM, Christoph Hellwig wrote: > On Sun, Aug 30, 2020 at 11:04:21AM +0200, Cédric Le Goater wrote: >> Hello, >> >> On 7/8/20 5:24 PM, Christoph Hellwig wrote: >>> Use the DMA API bypass mechanism for direct window mappings. This uses >>> common code and speed up the direct mapping case by avoiding indirect >>> calls just when not using dma ops at all. It also fixes a problem where >>> the sync_* methods were using the bypass check for DMA allocations, but >>> those are part of the streaming ops. >>> >>> Note that this patch loses the DMA_ATTR_WEAK_ORDERING override, which >>> has never been well defined, as is only used by a few drivers, which >>> IIRC never showed up in the typical Cell blade setups that are affected >>> by the ordering workaround. >>> >>> Fixes: efd176a04bef ("powerpc/pseries/dma: Allow SWIOTLB") >>> Signed-off-by: Christoph Hellwig >>> --- >>> arch/powerpc/Kconfig | 1 + >>> arch/powerpc/include/asm/device.h | 5 -- >>> arch/powerpc/kernel/dma-iommu.c | 90 --- >>> 3 files changed, 10 insertions(+), 86 deletions(-) >> >> I am seeing corruptions on a couple of POWER9 systems (boston) when >> stressed with IO. stress-ng gives some results but I have first seen >> it when compiling the kernel in a guest and this is still the best way >> to raise the issue. >> >> These systems have of a SAS Adaptec controller : >> >> 0003:01:00.0 Serial Attached SCSI controller: Adaptec Series 8 12G >> SAS/PCIe 3 (rev 01) >> >> When the failure occurs, the POWERPC EEH interrupt fires and dumps >> lowlevel PHB4 registers among which : >> >> [ 2179.251069490,3] PHB#0003[0:3]: phbErrorStatus = >> 0280 >> [ 2179.251117476,3] PHB#0003[0:3]: phbFirstErrorStatus = >> 0200 >> >> The bits raised identify a PPC 'TCE' error, which means it is related >> to DMAs. See below for more details. >> >> >> Reverting this patch "fixes" the issue but it is probably else where, >> in some other layers or in the aacraid driver. How should I proceed >> to get more information ? > > The aacraid DMA masks look like a mess. Can you try the hack > below and see it it helps? No effect. The system crashes the same. But Alexey spotted some issue with swiotlb. C.
Re: [PATCH 5/5] powerpc: use the generic dma_ops_bypass mode
Hello, On 7/8/20 5:24 PM, Christoph Hellwig wrote: > Use the DMA API bypass mechanism for direct window mappings. This uses > common code and speed up the direct mapping case by avoiding indirect > calls just when not using dma ops at all. It also fixes a problem where > the sync_* methods were using the bypass check for DMA allocations, but > those are part of the streaming ops. > > Note that this patch loses the DMA_ATTR_WEAK_ORDERING override, which > has never been well defined, as is only used by a few drivers, which > IIRC never showed up in the typical Cell blade setups that are affected > by the ordering workaround. > > Fixes: efd176a04bef ("powerpc/pseries/dma: Allow SWIOTLB") > Signed-off-by: Christoph Hellwig > --- > arch/powerpc/Kconfig | 1 + > arch/powerpc/include/asm/device.h | 5 -- > arch/powerpc/kernel/dma-iommu.c | 90 --- > 3 files changed, 10 insertions(+), 86 deletions(-) I am seeing corruptions on a couple of POWER9 systems (boston) when stressed with IO. stress-ng gives some results but I have first seen it when compiling the kernel in a guest and this is still the best way to raise the issue. These systems have of a SAS Adaptec controller : 0003:01:00.0 Serial Attached SCSI controller: Adaptec Series 8 12G SAS/PCIe 3 (rev 01) When the failure occurs, the POWERPC EEH interrupt fires and dumps lowlevel PHB4 registers among which : [ 2179.251069490,3] PHB#0003[0:3]: phbErrorStatus = 0280 [ 2179.251117476,3] PHB#0003[0:3]: phbFirstErrorStatus = 0200 The bits raised identify a PPC 'TCE' error, which means it is related to DMAs. See below for more details. Reverting this patch "fixes" the issue but it is probably else where, in some other layers or in the aacraid driver. How should I proceed to get more information ? Thanks, C. [ 2054.970339] EEH: Frozen PE#1fd on PHB#3 detected [ 2054.970375] EEH: PE location: UOPWR.BOS0019-Node0-Onboard SAS, PHB location: N/A [ 2179.249415973,3] PHB#0003[0:3]: brdgCtl = 0002 [ 2179.249515795,3] PHB#0003[0:3]: deviceStatus = 00060040 [ 2179.249596452,3] PHB#0003[0:3]: slotStatus = 00402000 [ 2179.249633728,3] PHB#0003[0:3]: linkStatus = a0830008 [ 2179.249674807,3] PHB#0003[0:3]: devCmdStatus = 00100107 [ 2179.249725974,3] PHB#0003[0:3]: devSecStatus = 00100107 [ 2179.249773550,3] PHB#0003[0:3]: rootErrorStatus = [ 2179.249809823,3] PHB#0003[0:3]: corrErrorStatus = [ 2179.249850439,3] PHB#0003[0:3]:uncorrErrorStatus = [ 2179.249887411,3] PHB#0003[0:3]: devctl = 0040 [ 2179.249928677,3] PHB#0003[0:3]: devStat = 0006 [ 2179.249967150,3] PHB#0003[0:3]: tlpHdr1 = [ 2179.250054987,3] PHB#0003[0:3]: tlpHdr2 = [ 2179.250146600,3] PHB#0003[0:3]: tlpHdr3 = [ 2179.250262780,3] PHB#0003[0:3]: tlpHdr4 = [ 2179.250343101,3] PHB#0003[0:3]: sourceId = [ 2179.250514264,3] PHB#0003[0:3]: nFir = [ 2179.250717971,3] PHB#0003[0:3]: nFirMask = 0030001c [ 2179.250791474,3] PHB#0003[0:3]: nFirWOF = [ 2179.250842054,3] PHB#0003[0:3]: phbPlssr = 001c [ 2179.250886003,3] PHB#0003[0:3]: phbCsr = 001c [ 2179.250929859,3] PHB#0003[0:3]: lemFir = 00010080 [ 2179.250973720,3] PHB#0003[0:3]: lemErrorMask = [ 2179.251018622,3] PHB#0003[0:3]: lemWOF = 0080 [ 2179.251069490,3] PHB#0003[0:3]: phbErrorStatus = 0280 [ 2179.251117476,3] PHB#0003[0:3]: phbFirstErrorStatus = 0200 [ 2179.251162218,3] PHB#0003[0:3]: phbErrorLog0 = 214898000240 [ 2179.251206251,3] PHB#0003[0:3]: phbErrorLog1 = a0084000 [ 2179.251253096,3] PHB#0003[0:3]:phbTxeErrorStatus = [ 2179.265087656,3] PHB#0003[0:3]: phbTxeFirstErrorStatus = [ 2179.265142247,3] PHB#0003[0:3]: phbTxeErrorLog0 = [ 2179.265189734,3] PHB#0003[0:3]: phbTxeErrorLog1 = [ 2179.266335296,3] PHB#0003[0:3]: phbRxeArbErrorStatus = [ 2179.266380366,3] PHB#0003[0:3]: phbRxeArbFrstErrorStatus = [ 2179.266426523,3] PHB#0003[0:3]: phbRxeArbErrorLog0 = [ 2179.267537283,3] PHB#0003[0:3]: phbRxeArbErrorLog1 = [ 2179.267581708,3] PHB#0003[0:3]: phbRxeMrgErrorStatus = [ 2179.267628324,3] PHB#0003[0:3]: phbRxeMrgFrstErrorStatus = [ 2179.268748771,3] PHB#0003[0:3]: phbRxeMrgErrorLog
Re: KVM on POWER8 host lock up since 10d91611f426 ("powerpc/64s: Reimplement book3s idle code in C")
On Mon, Aug 31, 2020 at 11:14:18AM +1000, Nicholas Piggin wrote: > Excerpts from Michal Suchánek's message of August 31, 2020 6:11 am: > > Hello, > > > > on POWER8 KVM hosts lock up since commit 10d91611f426 ("powerpc/64s: > > Reimplement book3s idle code in C"). > > > > The symptom is host locking up completely after some hours of KVM > > workload with messages like > > > > 2020-08-30T10:51:31+00:00 obs-power8-01 kernel: KVM: couldn't grab cpu 47 > > 2020-08-30T10:51:31+00:00 obs-power8-01 kernel: KVM: couldn't grab cpu 71 > > 2020-08-30T10:51:31+00:00 obs-power8-01 kernel: KVM: couldn't grab cpu 47 > > 2020-08-30T10:51:31+00:00 obs-power8-01 kernel: KVM: couldn't grab cpu 71 > > 2020-08-30T10:51:31+00:00 obs-power8-01 kernel: KVM: couldn't grab cpu 47 > > > > printed before the host locks up. > > > > The machines run sandboxed builds which is a mixed workload resulting in > > IO/single core/mutiple core load over time and there are periods of no > > activity and no VMS runnig as well. The VMs are shortlived so VM > > setup/terdown is somewhat excercised as well. > > > > POWER9 with the new guest entry fast path does not seem to be affected. > > > > Reverted the patch and the followup idle fixes on top of 5.2.14 and > > re-applied commit a3f3072db6ca ("powerpc/powernv/idle: Restore IAMR > > after idle") which gives same idle code as 5.1.16 and the kernel seems > > stable. > > > > Config is attached. > > > > I cannot easily revert this commit, especially if I want to use the same > > kernel on POWER8 and POWER9 - many of the POWER9 fixes are applicable > > only to the new idle code. > > > > Any idea what can be the problem? > > So hwthread_state is never getting back to to HWTHREAD_IN_IDLE on > those threads. I wonder what they are doing. POWER8 doesn't have a good > NMI IPI and I don't know if it supports pdbg dumping registers from the > BMC unfortunately. It may be possible to set up fadump with a later kernel version that supports it on powernv and dump the whole kernel. Thanks Michal
Re: KVM on POWER8 host lock up since 10d91611f426 ("powerpc/64s: Reimplement book3s idle code in C")
On Mon, Aug 31, 2020 at 11:14:18AM +1000, Nicholas Piggin wrote: > Excerpts from Michal Suchánek's message of August 31, 2020 6:11 am: > > Hello, > > > > on POWER8 KVM hosts lock up since commit 10d91611f426 ("powerpc/64s: > > Reimplement book3s idle code in C"). > > > > The symptom is host locking up completely after some hours of KVM > > workload with messages like > > > > 2020-08-30T10:51:31+00:00 obs-power8-01 kernel: KVM: couldn't grab cpu 47 > > 2020-08-30T10:51:31+00:00 obs-power8-01 kernel: KVM: couldn't grab cpu 71 > > 2020-08-30T10:51:31+00:00 obs-power8-01 kernel: KVM: couldn't grab cpu 47 > > 2020-08-30T10:51:31+00:00 obs-power8-01 kernel: KVM: couldn't grab cpu 71 > > 2020-08-30T10:51:31+00:00 obs-power8-01 kernel: KVM: couldn't grab cpu 47 > > > > printed before the host locks up. > > > > The machines run sandboxed builds which is a mixed workload resulting in > > IO/single core/mutiple core load over time and there are periods of no > > activity and no VMS runnig as well. The VMs are shortlived so VM > > setup/terdown is somewhat excercised as well. > > > > POWER9 with the new guest entry fast path does not seem to be affected. > > > > Reverted the patch and the followup idle fixes on top of 5.2.14 and > > re-applied commit a3f3072db6ca ("powerpc/powernv/idle: Restore IAMR > > after idle") which gives same idle code as 5.1.16 and the kernel seems > > stable. > > > > Config is attached. > > > > I cannot easily revert this commit, especially if I want to use the same > > kernel on POWER8 and POWER9 - many of the POWER9 fixes are applicable > > only to the new idle code. > > > > Any idea what can be the problem? > > So hwthread_state is never getting back to to HWTHREAD_IN_IDLE on > those threads. I wonder what they are doing. POWER8 doesn't have a good > NMI IPI and I don't know if it supports pdbg dumping registers from the > BMC unfortunately. Do the messages always come in pairs of CPUs? > > I'm not sure where to start with reproducing, I'll have to try. How many > vCPUs in the guests? Do you have several guests running at once? The guests are spawned on demand - there are like 20-30 'slots' configured where a VM may be running or it may be idle with no VM spawned when there are no jobs available. Thanks Michal
[PATCH 1/2] powerpc/8xx: Refactor calculation of number of entries per PTE in page tables
On 8xx, the number of entries occupied by a PTE in the page tables depends on the size of the page. At the time being, this calculation is done in two places: in pte_update() and in set_huge_pte_at() Refactor this calculation into a helper called number_of_cells_per_pte(). For the time being, the val param is unused. It will be used by following patch. Instead of opencoding is_hugepd(), use hugepd_ok() with a forward declaration. Signed-off-by: Christophe Leroy --- arch/powerpc/include/asm/nohash/32/pgtable.h | 18 -- arch/powerpc/mm/pgtable.c| 6 -- 2 files changed, 16 insertions(+), 8 deletions(-) diff --git a/arch/powerpc/include/asm/nohash/32/pgtable.h b/arch/powerpc/include/asm/nohash/32/pgtable.h index b9e134d0f03a..80bbc21b87f0 100644 --- a/arch/powerpc/include/asm/nohash/32/pgtable.h +++ b/arch/powerpc/include/asm/nohash/32/pgtable.h @@ -227,6 +227,17 @@ static inline void pmd_clear(pmd_t *pmdp) */ #ifdef CONFIG_PPC_8xx static pmd_t *pmd_off(struct mm_struct *mm, unsigned long addr); +static int hugepd_ok(hugepd_t hpd); + +static int number_of_cells_per_pte(pmd_t *pmd, pte_basic_t val, int huge) +{ + if (!huge) + return PAGE_SIZE / SZ_4K; + else if (hugepd_ok(*((hugepd_t *)pmd))) + return 1; + else + return SZ_512K / SZ_4K; +} static inline pte_basic_t pte_update(struct mm_struct *mm, unsigned long addr, pte_t *p, unsigned long clr, unsigned long set, int huge) @@ -237,12 +248,7 @@ static inline pte_basic_t pte_update(struct mm_struct *mm, unsigned long addr, p int num, i; pmd_t *pmd = pmd_off(mm, addr); - if (!huge) - num = PAGE_SIZE / SZ_4K; - else if ((pmd_val(*pmd) & _PMD_PAGE_MASK) != _PMD_PAGE_8M) - num = SZ_512K / SZ_4K; - else - num = 1; + num = number_of_cells_per_pte(pmd, new, huge); for (i = 0; i < num; i++, entry++, new += SZ_4K) *entry = new; diff --git a/arch/powerpc/mm/pgtable.c b/arch/powerpc/mm/pgtable.c index 9c0547d77af3..2dcad640b869 100644 --- a/arch/powerpc/mm/pgtable.c +++ b/arch/powerpc/mm/pgtable.c @@ -266,8 +266,7 @@ void set_huge_pte_at(struct mm_struct *mm, unsigned long addr, pte_t *ptep, pte_ pmd_t *pmd = pmd_off(mm, addr); pte_basic_t val; pte_basic_t *entry = &ptep->pte; - int num = is_hugepd(*((hugepd_t *)pmd)) ? 1 : SZ_512K / SZ_4K; - int i; + int num, i; /* * Make sure hardware valid bit is not set. We don't do @@ -280,6 +279,9 @@ void set_huge_pte_at(struct mm_struct *mm, unsigned long addr, pte_t *ptep, pte_ pte = set_pte_filter(pte); val = pte_val(pte); + + num = number_of_cells_per_pte(pmd, val, 1); + for (i = 0; i < num; i++, entry++, val += SZ_4K) *entry = val; } -- 2.25.0
[PATCH 2/2] powerpc/8xx: Support 16k hugepages with 4k pages
The 8xx has 4 page sizes: 4k, 16k, 512k and 8M 4k and 16k can be selected at build time as standard page sizes, and 512k and 8M are hugepages. When 4k standard pages are selected, 16k pages are not available. Allow 16k pages as hugepages when 4k pages are used. To allow that, implement arch_make_huge_pte() which receives the necessary arguments to allow setting the PTE in accordance with the page size: - 512 k pages must have _PAGE_HUGE and _PAGE_SPS. They are set by pte_mkhuge(). arch_make_huge_pte() does nothing. - 16 k pages must have only _PAGE_SPS. arch_make_huge_pte() clears _PAGE_HUGE. Signed-off-by: Christophe Leroy --- arch/powerpc/include/asm/nohash/32/hugetlb-8xx.h | 14 ++ arch/powerpc/include/asm/nohash/32/pgtable.h | 2 ++ arch/powerpc/mm/hugetlbpage.c| 2 +- arch/powerpc/mm/nohash/tlb.c | 4 arch/powerpc/mm/ptdump/8xx.c | 5 + include/uapi/asm-generic/hugetlb_encode.h| 1 + include/uapi/linux/mman.h| 1 + 7 files changed, 24 insertions(+), 5 deletions(-) diff --git a/arch/powerpc/include/asm/nohash/32/hugetlb-8xx.h b/arch/powerpc/include/asm/nohash/32/hugetlb-8xx.h index e752a5807a59..39be9aea86db 100644 --- a/arch/powerpc/include/asm/nohash/32/hugetlb-8xx.h +++ b/arch/powerpc/include/asm/nohash/32/hugetlb-8xx.h @@ -65,4 +65,18 @@ static inline void huge_ptep_set_wrprotect(struct mm_struct *mm, pte_update(mm, addr, ptep, clr, set, 1); } +#ifdef CONFIG_PPC_4K_PAGES +static inline pte_t arch_make_huge_pte(pte_t entry, struct vm_area_struct *vma, + struct page *page, int writable) +{ + size_t size = huge_page_size(hstate_vma(vma)); + + if (size == SZ_16K) + return __pte(pte_val(entry) & ~_PAGE_HUGE); + else + return entry; +} +#define arch_make_huge_pte arch_make_huge_pte +#endif + #endif /* _ASM_POWERPC_NOHASH_32_HUGETLB_8XX_H */ diff --git a/arch/powerpc/include/asm/nohash/32/pgtable.h b/arch/powerpc/include/asm/nohash/32/pgtable.h index 80bbc21b87f0..ee2243ba96cf 100644 --- a/arch/powerpc/include/asm/nohash/32/pgtable.h +++ b/arch/powerpc/include/asm/nohash/32/pgtable.h @@ -235,6 +235,8 @@ static int number_of_cells_per_pte(pmd_t *pmd, pte_basic_t val, int huge) return PAGE_SIZE / SZ_4K; else if (hugepd_ok(*((hugepd_t *)pmd))) return 1; + else if (IS_ENABLED(CONFIG_PPC_4K_PAGES) && !(val & _PAGE_HUGE)) + return SZ_16K / SZ_4K; else return SZ_512K / SZ_4K; } diff --git a/arch/powerpc/mm/hugetlbpage.c b/arch/powerpc/mm/hugetlbpage.c index e7ae2a2c4545..36c3800769fb 100644 --- a/arch/powerpc/mm/hugetlbpage.c +++ b/arch/powerpc/mm/hugetlbpage.c @@ -180,7 +180,7 @@ pte_t *huge_pte_alloc(struct mm_struct *mm, unsigned long addr, unsigned long sz if (!hpdp) return NULL; - if (IS_ENABLED(CONFIG_PPC_8xx) && sz == SZ_512K) + if (IS_ENABLED(CONFIG_PPC_8xx) && pshift < PMD_SHIFT) return pte_alloc_map(mm, (pmd_t *)hpdp, addr); BUG_ON(!hugepd_none(*hpdp) && !hugepd_ok(*hpdp)); diff --git a/arch/powerpc/mm/nohash/tlb.c b/arch/powerpc/mm/nohash/tlb.c index 14514585db98..5872f69141d5 100644 --- a/arch/powerpc/mm/nohash/tlb.c +++ b/arch/powerpc/mm/nohash/tlb.c @@ -83,16 +83,12 @@ struct mmu_psize_def mmu_psize_defs[MMU_PAGE_COUNT] = { }; #elif defined(CONFIG_PPC_8xx) struct mmu_psize_def mmu_psize_defs[MMU_PAGE_COUNT] = { - /* we only manage 4k and 16k pages as normal pages */ -#ifdef CONFIG_PPC_4K_PAGES [MMU_PAGE_4K] = { .shift = 12, }, -#else [MMU_PAGE_16K] = { .shift = 14, }, -#endif [MMU_PAGE_512K] = { .shift = 19, }, diff --git a/arch/powerpc/mm/ptdump/8xx.c b/arch/powerpc/mm/ptdump/8xx.c index 8a797dcbf475..86da2a669680 100644 --- a/arch/powerpc/mm/ptdump/8xx.c +++ b/arch/powerpc/mm/ptdump/8xx.c @@ -11,8 +11,13 @@ static const struct flag_info flag_array[] = { { +#ifdef CONFIG_PPC_16K_PAGES .mask = _PAGE_HUGE, .val= _PAGE_HUGE, +#else + .mask = _PAGE_SPS, + .val= _PAGE_SPS, +#endif .set= "huge", .clear = "", }, { diff --git a/include/uapi/asm-generic/hugetlb_encode.h b/include/uapi/asm-generic/hugetlb_encode.h index b0f8e87235bd..4f3d5aaa11f5 100644 --- a/include/uapi/asm-generic/hugetlb_encode.h +++ b/include/uapi/asm-generic/hugetlb_encode.h @@ -20,6 +20,7 @@ #define HUGETLB_FLAG_ENCODE_SHIFT 26 #define HUGETLB_FLAG_ENCODE_MASK 0x3f +#define HUGETLB_FLAG_ENCODE_16KB (14 << HUGETLB_FLAG_ENCODE_SHIFT) #define HUGETLB_FLAG_ENCODE_64KB (16 << HUGETLB_FLAG_ENCODE_SHIFT) #define HUGETLB_FLAG_ENCODE_512KB (19 << HUGETLB_FLAG_ENCODE_SHIFT) #define HUGETLB_FLAG_ENCOD
[PATCH] selftests/vm: Fix display of page size in map_hugetlb
The displayed size is in bytes while the text says it is in kB. Shift it by 10 to really display kBytes. Fixes: fa7b9a805c79 ("tools/selftest/vm: allow choosing mem size and page size in map_hugetlb") Cc: sta...@vger.kernel.org Signed-off-by: Christophe Leroy --- tools/testing/selftests/vm/map_hugetlb.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/tools/testing/selftests/vm/map_hugetlb.c b/tools/testing/selftests/vm/map_hugetlb.c index 6af951900aa3..312889edb84a 100644 --- a/tools/testing/selftests/vm/map_hugetlb.c +++ b/tools/testing/selftests/vm/map_hugetlb.c @@ -83,7 +83,7 @@ int main(int argc, char **argv) } if (shift) - printf("%u kB hugepages\n", 1 << shift); + printf("%u kB hugepages\n", 1 << (shift - 10)); else printf("Default size hugepages\n"); printf("Mapping %lu Mbytes\n", (unsigned long)length >> 20); -- 2.25.0
[PATCH] powerpc: Fix random segfault when freeing hugetlb range
The following random segfault is observed from time to time with map_hugetlb selftest: root@localhost:~# ./map_hugetlb 1 19 524288 kB hugepages Mapping 1 Mbytes Segmentation fault [ 31.219972] map_hugetlb[365]: segfault (11) at 117 nip 77974f8c lr 779a6834 code 1 in ld-2.23.so[77966000+21000] [ 31.220192] map_hugetlb[365]: code: 9421ffc0 480318d1 93410028 90010044 9361002c 93810030 93a10034 93c10038 [ 31.220307] map_hugetlb[365]: code: 93e1003c 93210024 8123007c 81430038 <80e90004> 814a0004 7f443a14 813a0004 [ 31.221911] BUG: Bad rss-counter state mm:(ptrval) type:MM_FILEPAGES val:33 [ 31.229362] BUG: Bad rss-counter state mm:(ptrval) type:MM_ANONPAGES val:5 This fault is due to hugetlb_free_pgd_range() freeing page tables that are also used by regular pages. As explain in the comment at the beginning of hugetlb_free_pgd_range(), the verification done in free_pgd_range() on floor and ceiling is not done here, which means hugetlb_free_pte_range() can free outside the expected range. As the verification cannot be done in hugetlb_free_pgd_range(), it must be done in hugetlb_free_pte_range(). Fixes: b250c8c08c79 ("powerpc/8xx: Manage 512k huge pages as standard pages.") Cc: sta...@vger.kernel.org Signed-off-by: Christophe Leroy --- arch/powerpc/mm/hugetlbpage.c | 18 -- 1 file changed, 16 insertions(+), 2 deletions(-) diff --git a/arch/powerpc/mm/hugetlbpage.c b/arch/powerpc/mm/hugetlbpage.c index 26292544630f..e7ae2a2c4545 100644 --- a/arch/powerpc/mm/hugetlbpage.c +++ b/arch/powerpc/mm/hugetlbpage.c @@ -330,10 +330,24 @@ static void free_hugepd_range(struct mmu_gather *tlb, hugepd_t *hpdp, int pdshif get_hugepd_cache_index(pdshift - shift)); } -static void hugetlb_free_pte_range(struct mmu_gather *tlb, pmd_t *pmd, unsigned long addr) +static void hugetlb_free_pte_range(struct mmu_gather *tlb, pmd_t *pmd, + unsigned long addr, unsigned long end, + unsigned long floor, unsigned long ceiling) { + unsigned long start = addr; pgtable_t token = pmd_pgtable(*pmd); + start &= PMD_MASK; + if (start < floor) + return; + if (ceiling) { + ceiling &= PMD_MASK; + if (!ceiling) + return; + } + if (end - 1 > ceiling - 1) + return; + pmd_clear(pmd); pte_free_tlb(tlb, token, addr); mm_dec_nr_ptes(tlb->mm); @@ -363,7 +377,7 @@ static void hugetlb_free_pmd_range(struct mmu_gather *tlb, pud_t *pud, */ WARN_ON(!IS_ENABLED(CONFIG_PPC_8xx)); - hugetlb_free_pte_range(tlb, pmd, addr); + hugetlb_free_pte_range(tlb, pmd, addr, end, floor, ceiling); continue; } -- 2.25.0