Re: [PATCH V2] powerpc/mm: Fix Multi hit ERAT cause by recent THP update
On Mon, Feb 08, 2016 at 11:44:22AM +0530, Aneesh Kumar K.V wrote: > With ppc64 we use the deposited pgtable_t to store the hash pte slot > information. We should not withdraw the deposited pgtable_t without > marking the pmd none. This ensure that low level hash fault handling > will skip this huge pte and we will handle them at upper levels. > > Recent change to pmd splitting changed the above in order to handle the > race between pmd split and exit_mmap. The race is explained below. > > Consider following race: > > CPU0CPU1 > shrink_page_list() > add_to_swap() > split_huge_page_to_list() > __split_huge_pmd_locked() > pmdp_huge_clear_flush_notify() > // pmd_none() == true > exit_mmap() > unmap_vmas() > zap_pmd_range() > // no action on pmd since > pmd_none() == true > pmd_populate() > > As result the THP will not be freed. The leak is detected by check_mm(): > > BUG: Bad rss-counter state mm:880058d2e580 idx:1 val:512 > > The above required us to not mark pmd none during a pmd split. > > The fix for ppc is to clear the huge pte of _PAGE_USER, so that low > level fault handling code skip this pte. At higher level we do take ptl > lock. That should serialze us against the pmd split. Once the lock is > acquired we do check the pmd again using pmd_same. That should always > return false for us and hence we should retry the access. I guess it worth mention that this serialization against ptl happens in huge_pmd_set_accessed(), if I didn't miss anything. > > Also make sure we wait for irq disable section in other cpus to finish > before flipping a huge pte entry with a regular pmd entry. Code paths > like find_linux_pte_or_hugepte depend on irq disable to get > a stable pte_t pointer. A parallel thp split need to make sure we > don't convert a pmd pte to a regular pmd entry without waiting for the > irq disable section to finish. > > Signed-off-by: Aneesh Kumar K.V> --- > arch/powerpc/include/asm/book3s/64/pgtable.h | 4 > arch/powerpc/mm/pgtable_64.c | 35 > +++- > include/asm-generic/pgtable.h| 8 +++ > mm/huge_memory.c | 1 + > 4 files changed, 47 insertions(+), 1 deletion(-) > > diff --git a/arch/powerpc/include/asm/book3s/64/pgtable.h > b/arch/powerpc/include/asm/book3s/64/pgtable.h > index 8d1c41d28318..0415856941e0 100644 > --- a/arch/powerpc/include/asm/book3s/64/pgtable.h > +++ b/arch/powerpc/include/asm/book3s/64/pgtable.h > @@ -281,6 +281,10 @@ extern pgtable_t pgtable_trans_huge_withdraw(struct > mm_struct *mm, pmd_t *pmdp); > extern void pmdp_invalidate(struct vm_area_struct *vma, unsigned long > address, > pmd_t *pmdp); > > +#define __HAVE_ARCH_PMDP_HUGE_SPLITTING_FLUSH > +extern void pmdp_huge_splitting_flush(struct vm_area_struct *vma, > + unsigned long address, pmd_t *pmdp); > + > #define pmd_move_must_withdraw pmd_move_must_withdraw > struct spinlock; > static inline int pmd_move_must_withdraw(struct spinlock *new_pmd_ptl, > diff --git a/arch/powerpc/mm/pgtable_64.c b/arch/powerpc/mm/pgtable_64.c > index 3124a20d0fab..e8214b7f2210 100644 > --- a/arch/powerpc/mm/pgtable_64.c > +++ b/arch/powerpc/mm/pgtable_64.c > @@ -646,6 +646,30 @@ pgtable_t pgtable_trans_huge_withdraw(struct mm_struct > *mm, pmd_t *pmdp) > return pgtable; > } > > +void pmdp_huge_splitting_flush(struct vm_area_struct *vma, > +unsigned long address, pmd_t *pmdp) > +{ > + VM_BUG_ON(address & ~HPAGE_PMD_MASK); > + > +#ifdef CONFIG_DEBUG_VM > + BUG_ON(REGION_ID(address) != USER_REGION_ID); > +#endif > + /* > + * We can't mark the pmd none here, because that will cause a race > + * against exit_mmap. We need to continue mark pmd TRANS HUGE, while > + * we spilt, but at the same time we wan't rest of the ppc64 code > + * not to insert hash pte on this, because we will be modifying > + * the deposited pgtable in the caller of this function. Hence > + * clear the _PAGE_USER so that we move the fault handling to > + * higher level function and that will serialize against ptl. > + * We need to flush existing hash pte entries here even though, > + * the translation is still valid, because we will withdraw > + * pgtable_t after this. > + */ > + pmd_hugepage_update(vma->vm_mm, address, pmdp, _PAGE_USER, 0); > +} > + > + > /* > * set a new huge pmd. We should not be called for updating > * an existing pmd entry. That should go via pmd_hugepage_update. > @@ -663,10 +687,19 @@ void set_pmd_at(struct mm_struct *mm, unsigned long > addr, > return set_pte_at(mm, addr,
Re: [PATCH v10 8/8] numa, mm, cleanup: remove redundant NODE_DATA macro from asm header files.
Hi Ganapatrao, [auto build test ERROR on arm64/for-next/core] [also build test ERROR on v4.5-rc2 next-20160205] [if your patch is applied to the wrong git tree, please drop us a note to help improving the system] url: https://github.com/0day-ci/linux/commits/Ganapatrao-Kulkarni/arm64-numa-adding-numa-support-for-arm64-platforms/20160202-181522 base: https://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux for-next/core config: i386-randconfig-sb0-02030124 (attached as .config) reproduce: # save the attached .config to linux build tree make ARCH=i386 All error/warnings (new ones prefixed by >>): In file included from include/linux/gfp.h:5:0, from include/linux/slab.h:14, from include/linux/crypto.h:24, from arch/x86/kernel/asm-offsets.c:8: arch/x86/include/asm/mmzone_32.h: In function 'pfn_valid': >> include/linux/mmzone.h:704:41: error: implicit declaration of function >> 'NODE_DATA' [-Werror=implicit-function-declaration] #define node_end_pfn(nid) pgdat_end_pfn(NODE_DATA(nid)) ^ >> arch/x86/include/asm/mmzone_32.h:42:17: note: in expansion of macro >> 'node_end_pfn' return (pfn < node_end_pfn(nid)); ^ >> include/linux/mmzone.h:704:41: warning: passing argument 1 of >> 'pgdat_end_pfn' makes pointer from integer without a cast [-Wint-conversion] #define node_end_pfn(nid) pgdat_end_pfn(NODE_DATA(nid)) ^ >> arch/x86/include/asm/mmzone_32.h:42:17: note: in expansion of macro >> 'node_end_pfn' return (pfn < node_end_pfn(nid)); ^ include/linux/mmzone.h:706:29: note: expected 'pg_data_t * {aka struct pglist_data *}' but argument is of type 'int' static inline unsigned long pgdat_end_pfn(pg_data_t *pgdat) ^ cc1: some warnings being treated as errors make[2]: *** [arch/x86/kernel/asm-offsets.s] Error 1 make[2]: Target '__build' not remade because of errors. make[1]: *** [prepare0] Error 2 make[1]: Target 'prepare' not remade because of errors. make: *** [sub-make] Error 2 vim +/NODE_DATA +704 include/linux/mmzone.h d41dee369 Andy Whitcroft2005-06-23 698 #else d41dee369 Andy Whitcroft2005-06-23 699 #define pgdat_page_nr(pgdat, pagenr) pfn_to_page((pgdat)->node_start_pfn + (pagenr)) d41dee369 Andy Whitcroft2005-06-23 700 #endif 408fde81c Dave Hansen 2005-06-23 701 #define nid_page_nr(nid, pagenr) pgdat_page_nr(NODE_DATA(nid),(pagenr)) ^1da177e4 Linus Torvalds2005-04-16 702 c6830c226 KAMEZAWA Hiroyuki 2011-06-16 703 #define node_start_pfn(nid) (NODE_DATA(nid)->node_start_pfn) da3649e13 Cody P Schafer2013-02-22 @704 #define node_end_pfn(nid) pgdat_end_pfn(NODE_DATA(nid)) c6830c226 KAMEZAWA Hiroyuki 2011-06-16 705 da3649e13 Cody P Schafer2013-02-22 706 static inline unsigned long pgdat_end_pfn(pg_data_t *pgdat) da3649e13 Cody P Schafer2013-02-22 707 { :: The code at line 704 was first introduced by commit :: da3649e133948d8b7d8c57b05a33faf62ac2cc7e mmzone: add pgdat_{end_pfn,is_empty}() helpers & consolidate. :: TO: Cody P Schafer:: CC: Linus Torvalds --- 0-DAY kernel test infrastructureOpen Source Technology Center https://lists.01.org/pipermail/kbuild-all Intel Corporation .config.gz Description: Binary data ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH v5 06/23] powerpc32: refactor x_mapped_by_bats() and x_mapped_by_tlbcam() together
Hi Christophe, [auto build test ERROR on powerpc/next] [also build test ERROR on v4.5-rc2 next-20160205] [if your patch is applied to the wrong git tree, please drop us a note to help improving the system] url: https://github.com/0day-ci/linux/commits/Christophe-Leroy/powerpc-8xx-Use-large-pages-for-RAM-and-IMMR-and-other-improvments/20160204-071322 base: https://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux.git next config: powerpc-ppc64e_defconfig (attached as .config) reproduce: wget https://git.kernel.org/cgit/linux/kernel/git/wfg/lkp-tests.git/plain/sbin/make.cross -O ~/bin/make.cross chmod +x ~/bin/make.cross # save the attached .config to linux build tree make.cross ARCH=powerpc All errors (new ones prefixed by >>): >> arch/powerpc/mm/fsl_booke_mmu.c:78:13: error: redefinition of >> 'v_block_mapped' phys_addr_t v_block_mapped(unsigned long va) ^ In file included from arch/powerpc/mm/fsl_booke_mmu.c:57:0: arch/powerpc/mm/mmu_decl.h:168:27: note: previous definition of 'v_block_mapped' was here static inline phys_addr_t v_block_mapped(unsigned long va) { return 0; } ^ >> arch/powerpc/mm/fsl_booke_mmu.c:90:15: error: redefinition of >> 'p_block_mapped' unsigned long p_block_mapped(phys_addr_t pa) ^ In file included from arch/powerpc/mm/fsl_booke_mmu.c:57:0: arch/powerpc/mm/mmu_decl.h:169:29: note: previous definition of 'p_block_mapped' was here static inline unsigned long p_block_mapped(phys_addr_t pa) { return 0; } ^ vim +/v_block_mapped +78 arch/powerpc/mm/fsl_booke_mmu.c 72 return tlbcam_addrs[idx].limit - tlbcam_addrs[idx].start + 1; 73 } 74 75 /* 76 * Return PA for this VA if it is mapped by a CAM, or 0 77 */ > 78 phys_addr_t v_block_mapped(unsigned long va) 79 { 80 int b; 81 for (b = 0; b < tlbcam_index; ++b) 82 if (va >= tlbcam_addrs[b].start && va < tlbcam_addrs[b].limit) 83 return tlbcam_addrs[b].phys + (va - tlbcam_addrs[b].start); 84 return 0; 85 } 86 87 /* 88 * Return VA for a given PA or 0 if not mapped 89 */ > 90 unsigned long p_block_mapped(phys_addr_t pa) 91 { 92 int b; 93 for (b = 0; b < tlbcam_index; ++b) --- 0-DAY kernel test infrastructureOpen Source Technology Center https://lists.01.org/pipermail/kbuild-all Intel Corporation .config.gz Description: Binary data ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
[PATCH 4/4] powerpc/ps3: gelic_udbg: use struct udphdr from
Instead of defining a local version of struct udphdr use the standard definition from . The 'src' field is named 'source' in the definition. Signed-off-by: Luis Henriques--- arch/powerpc/platforms/ps3/gelic_udbg.c | 10 ++ 1 file changed, 2 insertions(+), 8 deletions(-) diff --git a/arch/powerpc/platforms/ps3/gelic_udbg.c b/arch/powerpc/platforms/ps3/gelic_udbg.c index 01d274fcbe51..b8f90a8465b9 100644 --- a/arch/powerpc/platforms/ps3/gelic_udbg.c +++ b/arch/powerpc/platforms/ps3/gelic_udbg.c @@ -16,6 +16,7 @@ #include #include #include +#include #include #include @@ -60,13 +61,6 @@ struct debug_block { u8 pkt[1520]; } __packed; -struct udphdr { - u16 src; - u16 dest; - u16 len; - u16 checksum; -} __packed; - static __iomem struct ethhdr *h_eth; static __iomem struct vlan_hdr *h_vlan; static __iomem struct iphdr *h_ip; @@ -185,7 +179,7 @@ static void gelic_debug_init(void) header_size += sizeof(struct udphdr); h_udp = (struct udphdr *)(h_ip + 1); - h_udp->src = GELIC_DEBUG_PORT; + h_udp->source = GELIC_DEBUG_PORT; h_udp->dest = GELIC_DEBUG_PORT; pmsgc = pmsg = (char *)(h_udp + 1); ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
[PATCH 0/4] powerpc/ps3: gelic_udbg: drop local versions of network data structs
Several network-related data structures are defined in gelic_udbg. These could be easily dropped and the standard ones defined in network headers could be used instead. The 4 patches that follow replace ethernet, vlan, ip and udp structures in gelic_udbg. Note that this has been compile-tested only. Luis Henriques (4): powerpc/ps3: gelic_udbg: use struct ethhdr from powerpc/ps3: gelic_udbg: use struct vlan_hdr from powerpc/ps3: gelic_udbg: use struct iphdr from powerpc/ps3: gelic_udbg: use struct udphdr from arch/powerpc/platforms/ps3/gelic_udbg.c | 71 +++-- 1 file changed, 23 insertions(+), 48 deletions(-) ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
[PATCH 2/4] powerpc/ps3: gelic_udbg: use struct vlan_hdr from
Instead of defining the local struct vlantag use the standard definition of vlan_hdr from . The fields in the definition have different names: - vlan -> h_vlan_TCI - subtype -> h_vlan_encapsulated_proto Signed-off-by: Luis Henriques--- arch/powerpc/platforms/ps3/gelic_udbg.c | 16 ++-- 1 file changed, 6 insertions(+), 10 deletions(-) diff --git a/arch/powerpc/platforms/ps3/gelic_udbg.c b/arch/powerpc/platforms/ps3/gelic_udbg.c index ac87811e8b4e..4d6e827edfde 100644 --- a/arch/powerpc/platforms/ps3/gelic_udbg.c +++ b/arch/powerpc/platforms/ps3/gelic_udbg.c @@ -14,6 +14,7 @@ */ #include +#include #include #include @@ -58,11 +59,6 @@ struct debug_block { u8 pkt[1520]; } __packed; -struct vlantag { - u16 vlan; - u16 subtype; -} __packed; - struct iphdr { u8 ver_len; u8 dscp_ecn; @@ -84,7 +80,7 @@ struct udphdr { } __packed; static __iomem struct ethhdr *h_eth; -static __iomem struct vlantag *h_vlan; +static __iomem struct vlan_hdr *h_vlan; static __iomem struct iphdr *h_ip; static __iomem struct udphdr *h_udp; @@ -181,10 +177,10 @@ static void gelic_debug_init(void) if (!result) { h_eth->h_proto= 0x8100; - header_size += sizeof(struct vlantag); - h_vlan = (struct vlantag *)(h_eth + 1); - h_vlan->vlan = vlan_id; - h_vlan->subtype = 0x0800; + header_size += sizeof(struct vlan_hdr); + h_vlan = (struct vlan_hdr *)(h_eth + 1); + h_vlan->h_vlan_TCI = vlan_id; + h_vlan->h_vlan_encapsulated_proto = 0x0800; h_ip = (struct iphdr *)(h_vlan + 1); } else { h_eth->h_proto= 0x0800; ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
[PATCH 3/4] powerpc/ps3: gelic_udbg: use struct iphdr from
Instead of defining a local version of struct iphdr use the standard definition from . Several fields in the definition have different names: - proto -> protocol - src -> saddr - dest -> daddr - total_length -> tot_len - checksum -> check Also, 'ver_len' is composed by 'version' and 'ihl' in . Signed-off-by: Luis Henriques--- arch/powerpc/platforms/ps3/gelic_udbg.c | 29 + 1 file changed, 9 insertions(+), 20 deletions(-) diff --git a/arch/powerpc/platforms/ps3/gelic_udbg.c b/arch/powerpc/platforms/ps3/gelic_udbg.c index 4d6e827edfde..01d274fcbe51 100644 --- a/arch/powerpc/platforms/ps3/gelic_udbg.c +++ b/arch/powerpc/platforms/ps3/gelic_udbg.c @@ -15,6 +15,7 @@ #include #include +#include #include #include @@ -59,19 +60,6 @@ struct debug_block { u8 pkt[1520]; } __packed; -struct iphdr { - u8 ver_len; - u8 dscp_ecn; - u16 total_length; - u16 ident; - u16 frag_off_flags; - u8 ttl; - u8 proto; - u16 checksum; - u32 src; - u32 dest; -} __packed; - struct udphdr { u16 src; u16 dest; @@ -188,11 +176,12 @@ static void gelic_debug_init(void) } header_size += sizeof(struct iphdr); - h_ip->ver_len = 0x45; + h_ip->version = 4; + h_ip->ihl = 5; h_ip->ttl = 10; - h_ip->proto = 0x11; - h_ip->src = 0x; - h_ip->dest = 0x; + h_ip->protocol = 0x11; + h_ip->saddr = 0x; + h_ip->daddr = 0x; header_size += sizeof(struct udphdr); h_udp = (struct udphdr *)(h_ip + 1); @@ -217,16 +206,16 @@ static void gelic_sendbuf(int msgsize) int i; dbg.descr.buf_size = header_size + msgsize; - h_ip->total_length = msgsize + sizeof(struct udphdr) + + h_ip->tot_len = msgsize + sizeof(struct udphdr) + sizeof(struct iphdr); h_udp->len = msgsize + sizeof(struct udphdr); - h_ip->checksum = 0; + h_ip->check = 0; sum = 0; p = (u16 *)h_ip; for (i = 0; i < 5; i++) sum += *p++; - h_ip->checksum = ~(sum + (sum >> 16)); + h_ip->check = ~(sum + (sum >> 16)); dbg.descr.dmac_cmd_status = GELIC_DESCR_DMA_CMD_NO_CHKSUM | GELIC_DESCR_TX_DMA_FRAME_TAIL; ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
[PATCH 1/4] powerpc/ps3: gelic_udbg: use struct ethhdr from
Instead of defining a local version of struct ethhdr use the standard definition from . The fields in the definition have different names: - dest -> h_dest - src -> h_source - type -> h_proto Signed-off-by: Luis Henriques--- arch/powerpc/platforms/ps3/gelic_udbg.c | 16 ++-- 1 file changed, 6 insertions(+), 10 deletions(-) diff --git a/arch/powerpc/platforms/ps3/gelic_udbg.c b/arch/powerpc/platforms/ps3/gelic_udbg.c index 20b46a19a48f..ac87811e8b4e 100644 --- a/arch/powerpc/platforms/ps3/gelic_udbg.c +++ b/arch/powerpc/platforms/ps3/gelic_udbg.c @@ -13,6 +13,8 @@ * */ +#include + #include #include #include @@ -56,12 +58,6 @@ struct debug_block { u8 pkt[1520]; } __packed; -struct ethhdr { - u8 dest[6]; - u8 src[6]; - u16 type; -} __packed; - struct vlantag { u16 vlan; u16 subtype; @@ -173,8 +169,8 @@ static void gelic_debug_init(void) h_eth = (struct ethhdr *)dbg.pkt; - memset(_eth->dest, 0xff, 6); - memcpy(_eth->src, , 6); + memset(_eth->h_dest, 0xff, 6); + memcpy(_eth->h_source, , 6); header_size = sizeof(struct ethhdr); @@ -183,7 +179,7 @@ static void gelic_debug_init(void) GELIC_LV1_VLAN_TX_ETHERNET_0, 0, 0, _id, ); if (!result) { - h_eth->type = 0x8100; + h_eth->h_proto= 0x8100; header_size += sizeof(struct vlantag); h_vlan = (struct vlantag *)(h_eth + 1); @@ -191,7 +187,7 @@ static void gelic_debug_init(void) h_vlan->subtype = 0x0800; h_ip = (struct iphdr *)(h_vlan + 1); } else { - h_eth->type = 0x0800; + h_eth->h_proto= 0x0800; h_ip = (struct iphdr *)(h_eth + 1); } ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH 1/4] powerpc/ps3: gelic_udbg: use struct ethhdr from
On Sun, 2016-02-07 at 17:38 +, Luis Henriques wrote: > Instead of defining a local version of struct ethhdr use the standard > definition from . trivia: > diff --git a/arch/powerpc/platforms/ps3/gelic_udbg.c > b/arch/powerpc/platforms/ps3/gelic_udbg.c [] > @@ -173,8 +169,8 @@ static void gelic_debug_init(void) > > h_eth = (struct ethhdr *)dbg.pkt; > > - memset(_eth->dest, 0xff, 6); > - memcpy(_eth->src, , 6); > + memset(_eth->h_dest, 0xff, 6); > + memcpy(_eth->h_source, , 6); Be nice to use ETH_ALEN and eth_broadcast_addr. Maybe ether_addr_copy too. > @@ -183,7 +179,7 @@ static void gelic_debug_init(void) > GELIC_LV1_VLAN_TX_ETHERNET_0, 0, 0, > _id, ); > if (!result) { > - h_eth->type = 0x8100; > + h_eth->h_proto= 0x8100; ETH_P_8021Q > header_size += sizeof(struct vlantag); > h_vlan = (struct vlantag *)(h_eth + 1); > @@ -191,7 +187,7 @@ static void gelic_debug_init(void) > h_vlan->subtype = 0x0800; > h_ip = (struct iphdr *)(h_vlan + 1); > } else { > - h_eth->type = 0x0800; > + h_eth->h_proto= 0x0800; ETH_P_IP ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [RFCv2 6/9] pseries: Add hypercall wrappers for hash page table resizing
On Thu, Feb 04, 2016 at 04:41:10PM +0530, Anshuman Khandual wrote: > On 02/02/2016 06:28 AM, David Gibson wrote: > > On Mon, Feb 01, 2016 at 12:41:31PM +0530, Anshuman Khandual wrote: > >> On 01/29/2016 10:54 AM, David Gibson wrote: > >>> This adds the hypercall numbers and wrapper functions for the hash page > >>> table resizing hypercalls. > >>> > >>> These are experimental "platform specific" values for now, until we have a > >>> formal PAPR update. > >>> > >>> It also adds a new firmware feature flat to track the presence of the > >>> HPT resizing calls. > >> > >> Its a flag ... ^^^ here. > > > > Oops, thanks. > > > >> > >>> > >>> Signed-off-by: David Gibson> >>> --- > >>> arch/powerpc/include/asm/firmware.h | 5 +++-- > >>> arch/powerpc/include/asm/hvcall.h | 2 ++ > >>> arch/powerpc/include/asm/plpar_wrappers.h | 12 > >>> arch/powerpc/platforms/pseries/firmware.c | 1 + > >>> 4 files changed, 18 insertions(+), 2 deletions(-) > >>> > >>> diff --git a/arch/powerpc/include/asm/firmware.h > >>> b/arch/powerpc/include/asm/firmware.h > >>> index b062924..32435d2 100644 > >>> --- a/arch/powerpc/include/asm/firmware.h > >>> +++ b/arch/powerpc/include/asm/firmware.h > >>> @@ -42,7 +42,7 @@ > >>> #define FW_FEATURE_SPLPARASM_CONST(0x0010) > >>> #define FW_FEATURE_LPAR ASM_CONST(0x0040) > >>> #define FW_FEATURE_PS3_LV1 ASM_CONST(0x0080) > >>> -/* Free ASM_CONST(0x0100) */ > >>> +#define FW_FEATURE_HPT_RESIZEASM_CONST(0x0100) > >>> #define FW_FEATURE_CMO ASM_CONST(0x0200) > >>> #define FW_FEATURE_VPHN ASM_CONST(0x0400) > >>> #define FW_FEATURE_XCMO ASM_CONST(0x0800) > >>> @@ -66,7 +66,8 @@ enum { > >>> FW_FEATURE_MULTITCE | FW_FEATURE_SPLPAR | FW_FEATURE_LPAR | > >>> FW_FEATURE_CMO | FW_FEATURE_VPHN | FW_FEATURE_XCMO | > >>> FW_FEATURE_SET_MODE | FW_FEATURE_BEST_ENERGY | > >>> - FW_FEATURE_TYPE1_AFFINITY | FW_FEATURE_PRRN, > >>> + FW_FEATURE_TYPE1_AFFINITY | FW_FEATURE_PRRN | > >>> + FW_FEATURE_HPT_RESIZE, > >>> FW_FEATURE_PSERIES_ALWAYS = 0, > >>> FW_FEATURE_POWERNV_POSSIBLE = FW_FEATURE_OPAL, > >>> FW_FEATURE_POWERNV_ALWAYS = 0, > >>> diff --git a/arch/powerpc/include/asm/hvcall.h > >>> b/arch/powerpc/include/asm/hvcall.h > >>> index e3b54dd..195e080 100644 > >>> --- a/arch/powerpc/include/asm/hvcall.h > >>> +++ b/arch/powerpc/include/asm/hvcall.h > >>> @@ -293,6 +293,8 @@ > >>> > >>> /* Platform specific hcalls, used by KVM */ > >>> #define H_RTAS 0xf000 > >>> +#define H_RESIZE_HPT_PREPARE 0xf003 > >>> +#define H_RESIZE_HPT_COMMIT 0xf004 > >> > >> This sound better and matches FW_FEATURE_HPT_RESIZE ? > > > > I'm not quite sure what you're suggesting here. > > > >> #define H_HPT_RESIZE_PREPARE 0xf003 > >> #define H_HPT_RESIZE_COMMIT0xf004 > > Just little bit of change of name of the macro like this > > > H_RESIZE_HPT_PREPARE --> H_HPT_RESIZE_PREPARE > H_RESIZE_HPT_COMMIT --> H_HPT_RESIZE_COMMIT Oh, I see. Actually, I'm trying to standardize on "resize hpt" rather than "hpt resize" everywhere. -- David Gibson| I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_ | _way_ _around_! http://www.ozlabs.org/~dgibson signature.asc Description: PGP signature ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
[PATCH] powerpc/eeh: fix incorrect function name in comment
The comment block above pcibios_set_pcie_reset_state() incorrectly refers to pcibios_set_pcie_slot_reset(). Fix the comment accordingly. Signed-off-by: Andrew Donnellan--- arch/powerpc/kernel/eeh.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/arch/powerpc/kernel/eeh.c b/arch/powerpc/kernel/eeh.c index 40e4d4a..8c6005c 100644 --- a/arch/powerpc/kernel/eeh.c +++ b/arch/powerpc/kernel/eeh.c @@ -739,7 +739,7 @@ static void *eeh_restore_dev_state(void *data, void *userdata) } /** - * pcibios_set_pcie_slot_reset - Set PCI-E reset state + * pcibios_set_pcie_reset_state - Set PCI-E reset state * @dev: pci device struct * @state: reset state to enter * -- Andrew Donnellan Software Engineer, OzLabs andrew.donnel...@au1.ibm.com Australia Development Lab, Canberra +61 2 6201 8874 (work)IBM Australia Limited ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH v3 13/18] cxl: sysfs support for guests
Frederic Barratwrites: > --- a/Documentation/ABI/testing/sysfs-class-cxl > +++ b/Documentation/ABI/testing/sysfs-class-cxl > @@ -183,7 +183,7 @@ Description:read only > Identifies the revision level of the PSL. > Users: https://github.com/ibm-capi/libcxl > > -What: /sys/class/cxl//base_image > +What: /sys/class/cxl//base_image (not in a guest) Is this going to be the case for KVM guest as well as PowerVM guest? -- Stewart Smith OPAL Architect, IBM. ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH] powerpc/eeh: fix incorrect function name in comment
On Mon, Feb 08, 2016 at 02:39:19PM +1100, Andrew Donnellan wrote: >The comment block above pcibios_set_pcie_reset_state() incorrectly refers >to pcibios_set_pcie_slot_reset(). Fix the comment accordingly. > >Signed-off-by: Andrew DonnellanAcked-by: Gavin Shan Thanks, Gavin >--- > arch/powerpc/kernel/eeh.c | 2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > >diff --git a/arch/powerpc/kernel/eeh.c b/arch/powerpc/kernel/eeh.c >index 40e4d4a..8c6005c 100644 >--- a/arch/powerpc/kernel/eeh.c >+++ b/arch/powerpc/kernel/eeh.c >@@ -739,7 +739,7 @@ static void *eeh_restore_dev_state(void *data, void >*userdata) > } > > /** >- * pcibios_set_pcie_slot_reset - Set PCI-E reset state >+ * pcibios_set_pcie_reset_state - Set PCI-E reset state > * @dev: pci device struct > * @state: reset state to enter > * >-- >Andrew Donnellan Software Engineer, OzLabs >andrew.donnel...@au1.ibm.com Australia Development Lab, Canberra >+61 2 6201 8874 (work)IBM Australia Limited > ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
[PATCH V3] powerpc/powernv: Remove support for p5ioc2
"p5ioc2 is used by approximately 2 machines in the world, and has never ever been a supported configuration." The code for p5ioc2 is essentially unused and complicates what is already a very complicated codebase. Its removal is essentially a "free win" in the effort to simplify the powernv PCI code. In addition, support for p5ioc2 has been dropped from skiboot. There's no reason to keep it around in the kernel. Signed-off-by: Russell Currey--- V3: Remove now-useless variable "found_ioda" thanks to Gavin V2: Remove pointless union and rebase on -next thanks to Andrew Tested on a P7IOC machine and a PHB3 machine. Skiboot p5ioc2 removal patch: https://patchwork.ozlabs.org/patch/544898/ --- arch/powerpc/platforms/powernv/Makefile | 2 +- arch/powerpc/platforms/powernv/pci-p5ioc2.c | 271 arch/powerpc/platforms/powernv/pci.c| 17 +- arch/powerpc/platforms/powernv/pci.h| 152 4 files changed, 74 insertions(+), 368 deletions(-) delete mode 100644 arch/powerpc/platforms/powernv/pci-p5ioc2.c diff --git a/arch/powerpc/platforms/powernv/Makefile b/arch/powerpc/platforms/powernv/Makefile index f1516b5..cd9711e 100644 --- a/arch/powerpc/platforms/powernv/Makefile +++ b/arch/powerpc/platforms/powernv/Makefile @@ -5,7 +5,7 @@ obj-y += opal-msglog.o opal-hmi.o opal-power.o opal-irqchip.o obj-y += opal-kmsg.o obj-$(CONFIG_SMP) += smp.o subcore.o subcore-asm.o -obj-$(CONFIG_PCI) += pci.o pci-p5ioc2.o pci-ioda.o npu-dma.o +obj-$(CONFIG_PCI) += pci.o pci-ioda.o npu-dma.o obj-$(CONFIG_EEH) += eeh-powernv.o obj-$(CONFIG_PPC_SCOM) += opal-xscom.o obj-$(CONFIG_MEMORY_FAILURE) += opal-memory-errors.o diff --git a/arch/powerpc/platforms/powernv/pci-p5ioc2.c b/arch/powerpc/platforms/powernv/pci-p5ioc2.c deleted file mode 100644 index f2bdfea..000 --- a/arch/powerpc/platforms/powernv/pci-p5ioc2.c +++ /dev/null @@ -1,271 +0,0 @@ -/* - * Support PCI/PCIe on PowerNV platforms - * - * Currently supports only P5IOC2 - * - * Copyright 2011 Benjamin Herrenschmidt, IBM Corp. - * - * This program is free software; you can redistribute it and/or - * modify it under the terms of the GNU General Public License - * as published by the Free Software Foundation; either version - * 2 of the License, or (at your option) any later version. - */ - -#include -#include -#include -#include -#include -#include -#include -#include -#include - -#include -#include -#include -#include -#include -#include -#include -#include -#include -#include - -#include "powernv.h" -#include "pci.h" - -/* For now, use a fixed amount of TCE memory for each p5ioc2 - * hub, 16M will do - */ -#define P5IOC2_TCE_MEMORY 0x0100 - -#ifdef CONFIG_PCI_MSI -static int pnv_pci_p5ioc2_msi_setup(struct pnv_phb *phb, struct pci_dev *dev, - unsigned int hwirq, unsigned int virq, - unsigned int is_64, struct msi_msg *msg) -{ - if (WARN_ON(!is_64)) - return -ENXIO; - msg->data = hwirq - phb->msi_base; - msg->address_hi = 0x1000; - msg->address_lo = 0; - - return 0; -} - -static void pnv_pci_init_p5ioc2_msis(struct pnv_phb *phb) -{ - unsigned int count; - const __be32 *prop = of_get_property(phb->hose->dn, -"ibm,opal-msi-ranges", NULL); - if (!prop) - return; - - /* Don't do MSI's on p5ioc2 PCI-X are they are not properly -* verified in HW -*/ - if (of_device_is_compatible(phb->hose->dn, "ibm,p5ioc2-pcix")) - return; - phb->msi_base = be32_to_cpup(prop); - count = be32_to_cpup(prop + 1); - if (msi_bitmap_alloc(>msi_bmp, count, phb->hose->dn)) { - pr_err("PCI %d: Failed to allocate MSI bitmap !\n", - phb->hose->global_number); - return; - } - phb->msi_setup = pnv_pci_p5ioc2_msi_setup; - phb->msi32_support = 0; - pr_info(" Allocated bitmap for %d MSIs (base IRQ 0x%x)\n", - count, phb->msi_base); -} -#else -static void pnv_pci_init_p5ioc2_msis(struct pnv_phb *phb) { } -#endif /* CONFIG_PCI_MSI */ - -static struct iommu_table_ops pnv_p5ioc2_iommu_ops = { - .set = pnv_tce_build, -#ifdef CONFIG_IOMMU_API - .exchange = pnv_tce_xchg, -#endif - .clear = pnv_tce_free, - .get = pnv_tce_get, -}; - -static void pnv_pci_p5ioc2_dma_dev_setup(struct pnv_phb *phb, -struct pci_dev *pdev) -{ - struct iommu_table *tbl = phb->p5ioc2.table_group.tables[0]; - - if (!tbl->it_map) { - tbl->it_ops = _p5ioc2_iommu_ops; - iommu_init_table(tbl, phb->hose->node); - iommu_register_group(>p5ioc2.table_group, - pci_domain_nr(phb->hose->bus),
Re: [PATCH V3] powerpc/powernv: Remove support for p5ioc2
On Mon, Feb 08, 2016 at 03:08:20PM +1100, Russell Currey wrote: >"p5ioc2 is used by approximately 2 machines in the world, and has never >ever been a supported configuration." > >The code for p5ioc2 is essentially unused and complicates what is already >a very complicated codebase. Its removal is essentially a "free win" in >the effort to simplify the powernv PCI code. > >In addition, support for p5ioc2 has been dropped from skiboot. There's no >reason to keep it around in the kernel. > >Signed-off-by: Russell CurreyAcked-by: Gavin Shan Note: I plan to rebase my next hotplug patchset revision (v8) on top of this one. Thanks, Gavin >--- >V3: Remove now-useless variable "found_ioda" thanks to Gavin >V2: Remove pointless union and rebase on -next thanks to Andrew > >Tested on a P7IOC machine and a PHB3 machine. > >Skiboot p5ioc2 removal patch: https://patchwork.ozlabs.org/patch/544898/ >--- > arch/powerpc/platforms/powernv/Makefile | 2 +- > arch/powerpc/platforms/powernv/pci-p5ioc2.c | 271 > arch/powerpc/platforms/powernv/pci.c| 17 +- > arch/powerpc/platforms/powernv/pci.h| 152 > 4 files changed, 74 insertions(+), 368 deletions(-) > delete mode 100644 arch/powerpc/platforms/powernv/pci-p5ioc2.c > >diff --git a/arch/powerpc/platforms/powernv/Makefile >b/arch/powerpc/platforms/powernv/Makefile >index f1516b5..cd9711e 100644 >--- a/arch/powerpc/platforms/powernv/Makefile >+++ b/arch/powerpc/platforms/powernv/Makefile >@@ -5,7 +5,7 @@ obj-y += opal-msglog.o opal-hmi.o >opal-power.o opal-irqchip.o > obj-y += opal-kmsg.o > > obj-$(CONFIG_SMP) += smp.o subcore.o subcore-asm.o >-obj-$(CONFIG_PCI) += pci.o pci-p5ioc2.o pci-ioda.o npu-dma.o >+obj-$(CONFIG_PCI) += pci.o pci-ioda.o npu-dma.o > obj-$(CONFIG_EEH) += eeh-powernv.o > obj-$(CONFIG_PPC_SCOM)+= opal-xscom.o > obj-$(CONFIG_MEMORY_FAILURE) += opal-memory-errors.o >diff --git a/arch/powerpc/platforms/powernv/pci-p5ioc2.c >b/arch/powerpc/platforms/powernv/pci-p5ioc2.c >deleted file mode 100644 >index f2bdfea..000 >--- a/arch/powerpc/platforms/powernv/pci-p5ioc2.c >+++ /dev/null >@@ -1,271 +0,0 @@ >-/* >- * Support PCI/PCIe on PowerNV platforms >- * >- * Currently supports only P5IOC2 >- * >- * Copyright 2011 Benjamin Herrenschmidt, IBM Corp. >- * >- * This program is free software; you can redistribute it and/or >- * modify it under the terms of the GNU General Public License >- * as published by the Free Software Foundation; either version >- * 2 of the License, or (at your option) any later version. >- */ >- >-#include >-#include >-#include >-#include >-#include >-#include >-#include >-#include >-#include >- >-#include >-#include >-#include >-#include >-#include >-#include >-#include >-#include >-#include >-#include >- >-#include "powernv.h" >-#include "pci.h" >- >-/* For now, use a fixed amount of TCE memory for each p5ioc2 >- * hub, 16M will do >- */ >-#define P5IOC2_TCE_MEMORY 0x0100 >- >-#ifdef CONFIG_PCI_MSI >-static int pnv_pci_p5ioc2_msi_setup(struct pnv_phb *phb, struct pci_dev *dev, >- unsigned int hwirq, unsigned int virq, >- unsigned int is_64, struct msi_msg *msg) >-{ >- if (WARN_ON(!is_64)) >- return -ENXIO; >- msg->data = hwirq - phb->msi_base; >- msg->address_hi = 0x1000; >- msg->address_lo = 0; >- >- return 0; >-} >- >-static void pnv_pci_init_p5ioc2_msis(struct pnv_phb *phb) >-{ >- unsigned int count; >- const __be32 *prop = of_get_property(phb->hose->dn, >- "ibm,opal-msi-ranges", NULL); >- if (!prop) >- return; >- >- /* Don't do MSI's on p5ioc2 PCI-X are they are not properly >- * verified in HW >- */ >- if (of_device_is_compatible(phb->hose->dn, "ibm,p5ioc2-pcix")) >- return; >- phb->msi_base = be32_to_cpup(prop); >- count = be32_to_cpup(prop + 1); >- if (msi_bitmap_alloc(>msi_bmp, count, phb->hose->dn)) { >- pr_err("PCI %d: Failed to allocate MSI bitmap !\n", >- phb->hose->global_number); >- return; >- } >- phb->msi_setup = pnv_pci_p5ioc2_msi_setup; >- phb->msi32_support = 0; >- pr_info(" Allocated bitmap for %d MSIs (base IRQ 0x%x)\n", >- count, phb->msi_base); >-} >-#else >-static void pnv_pci_init_p5ioc2_msis(struct pnv_phb *phb) { } >-#endif /* CONFIG_PCI_MSI */ >- >-static struct iommu_table_ops pnv_p5ioc2_iommu_ops = { >- .set = pnv_tce_build, >-#ifdef CONFIG_IOMMU_API >- .exchange = pnv_tce_xchg, >-#endif >- .clear = pnv_tce_free, >- .get = pnv_tce_get, >-}; >- >-static void pnv_pci_p5ioc2_dma_dev_setup(struct pnv_phb *phb, >- struct pci_dev *pdev) >-{ >-
[PATCH 2/2] powerpc/eeh: Reworked eeh_pe_bus_get()
The original implementation is ugly: unnecessary if statements and "out" tag. This reworks the function to avoid above weaknesses. No functional changes introduced. Signed-off-by: Gavin Shan--- arch/powerpc/kernel/eeh_pe.c | 28 1 file changed, 12 insertions(+), 16 deletions(-) diff --git a/arch/powerpc/kernel/eeh_pe.c b/arch/powerpc/kernel/eeh_pe.c index 8654cb1..1d64e60 100644 --- a/arch/powerpc/kernel/eeh_pe.c +++ b/arch/powerpc/kernel/eeh_pe.c @@ -923,25 +923,21 @@ out: */ struct pci_bus *eeh_pe_bus_get(struct eeh_pe *pe) { - struct pci_bus *bus = NULL; struct eeh_dev *edev; struct pci_dev *pdev; - if (pe->type & EEH_PE_PHB) { - bus = pe->phb->bus; - } else if (pe->type & EEH_PE_BUS || - pe->type & EEH_PE_DEVICE) { - if (pe->bus) { - bus = pe->bus; - goto out; - } + if (pe->type & EEH_PE_PHB) + return pe->phb->bus; - edev = list_first_entry(>edevs, struct eeh_dev, list); - pdev = eeh_dev_to_pci_dev(edev); - if (pdev) - bus = pdev->bus; - } + /* The primary bus might be cached during probe time */ + if (pe->bus) + return pe->bus; -out: - return bus; + /* Retrieve the parent PCI bus of first (top) PCI device */ + edev = list_first_entry_or_null(>edevs, struct eeh_dev, list); + pdev = eeh_dev_to_pci_dev(edev); + if (pdev) + return pdev->bus; + + return NULL; } -- 2.1.0 ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [RFCv2 1/9] memblock: Don't mark memblock_phys_mem_size() as __init
On Fri, Jan 29, 2016 at 04:23:55PM +1100, David Gibson wrote: > At the moment memblock_phys_mem_size() is marked as __init, and so is > discarded after boot. This is different from most of the memblock > functions which are marked __init_memblock, and are only discarded after > boot if memory hotplug is not configured. > > To allow for upcoming code which will need memblock_phys_mem_size() in the > hotplug path, change it from __init to __init_memblock. > > Signed-off-by: David GibsonReviewed-by: Paul Mackerras ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [RFCv2 2/9] arch/powerpc: Clean up error handling for htab_remove_mapping
On Fri, Jan 29, 2016 at 04:23:56PM +1100, David Gibson wrote: > Currently, the only error that htab_remove_mapping() can report is -EINVAL, > if removal of bolted HPTEs isn't implemeted for this platform. We make > a few clean ups to the handling of this: > > * EINVAL isn't really the right code - there's nothing wrong with the >function's arguments - use ENODEV instead > * We were also printing a warning message, but that's a decision better >left up to the callers, so remove it > * One caller is vmemmap_remove_mapping(), which will just BUG_ON() on >error, making the warning message irrelevant, so no change is needed >there. > * The other caller is remove_section_mapping(). This is called in the >memory hot remove path at a point after vmemmap_remove_mapping() so >if hpte_removebolted isn't implemented, we'd expect to have already >BUG()ed anyway. Put a WARN_ON() here, in lieu of a printk() since this >really shouldn't be happening. > > Signed-off-by: David GibsonReviewed-by: Paul Mackerras ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [RFCv2 5/9] arch/powerpc: Split hash page table sizing heuristic into a helper
On Thu, Feb 04, 2016 at 04:26:20PM +0530, Anshuman Khandual wrote: > On 02/02/2016 06:34 AM, David Gibson wrote: > > On Mon, Feb 01, 2016 at 12:34:32PM +0530, Anshuman Khandual wrote: > >> On 01/29/2016 10:53 AM, David Gibson wrote: > >>> htab_get_table_size() either retrieve the size of the hash page table > >>> (HPT) > >>> from the device tree - if the HPT size is determined by firmware - or > >>> uses a heuristic to determine a good size based on RAM size if the kernel > >>> is responsible for allocating the HPT. > >>> > >>> To support a PAPR extension allowing resizing of the HPT, we're going to > >>> want the memory size -> HPT size logic elsewhere, so split it out into a > >>> helper function. > >>> > >>> Signed-off-by: David Gibson> >>> --- > >>> arch/powerpc/include/asm/mmu-hash64.h | 3 +++ > >>> arch/powerpc/mm/hash_utils_64.c | 30 +- > >>> 2 files changed, 20 insertions(+), 13 deletions(-) > >>> > >>> diff --git a/arch/powerpc/include/asm/mmu-hash64.h > >>> b/arch/powerpc/include/asm/mmu-hash64.h > >>> index 7352d3f..cf070fd 100644 > >>> --- a/arch/powerpc/include/asm/mmu-hash64.h > >>> +++ b/arch/powerpc/include/asm/mmu-hash64.h > >>> @@ -607,6 +607,9 @@ static inline unsigned long get_kernel_vsid(unsigned > >>> long ea, int ssize) > >>> context = (MAX_USER_CONTEXT) + ((ea >> 60) - 0xc) + 1; > >>> return get_vsid(context, ea, ssize); > >>> } > >>> + > >>> +unsigned htab_shift_for_mem_size(unsigned long mem_size); > >>> + > >>> #endif /* __ASSEMBLY__ */ > >>> > >>> #endif /* _ASM_POWERPC_MMU_HASH64_H_ */ > >>> diff --git a/arch/powerpc/mm/hash_utils_64.c > >>> b/arch/powerpc/mm/hash_utils_64.c > >>> index e88a86e..d63f7dc 100644 > >>> --- a/arch/powerpc/mm/hash_utils_64.c > >>> +++ b/arch/powerpc/mm/hash_utils_64.c > >>> @@ -606,10 +606,24 @@ static int __init htab_dt_scan_pftsize(unsigned > >>> long node, > >>> return 0; > >>> } > >>> > >>> -static unsigned long __init htab_get_table_size(void) > >>> +unsigned htab_shift_for_mem_size(unsigned long mem_size) > >>> { > >>> - unsigned long mem_size, rnd_mem_size, pteg_count, psize; > >>> + unsigned memshift = __ilog2(mem_size); > >>> + unsigned pshift = mmu_psize_defs[mmu_virtual_psize].shift; > >>> + unsigned pteg_shift; > >>> + > >>> + /* round mem_size up to next power of 2 */ > >>> + if ((1UL << memshift) < mem_size) > >>> + memshift += 1; > >>> + > >>> + /* aim for 2 pages / pteg */ > >> > >> While here I guess its a good opportunity to write couple of lines > >> about why one PTE group for every two physical pages on the system, > > > > Well, that don't really know, it's just copied from the existing code. > > Aneesh, would you know why ? 1 PTEG per 2 pages means 4 HPTEs per page, which means you can map each page to an average of 4 different virtual addresses. It's a heuristic that has been around for a long time and dates back to the early days of AIX. For Linux, running on machines which typically have quite a lot of memory, it's probably overkill. > > > >> why minimum (1UL << 11 = 2048) number of PTE groups required, > > Aneesh, would you know why ? It's in the architecture, which specifies the minimum size of the HPT as 256kB. The reason is because not all of the virtual address bits are present in the HPT. That's OK because some of the virtual address bits are implied by the HPTEG index in the hash table. If the HPT was less than 256kB (2048 HPTEGs) there would be the possibility of collisions where two different virtual addresses could hash to the same HPTEG and their HPTEs would be impossible to tell apart. > > > > > Ok. > > > >> why > >> (1U << 7 = 128) entries per PTE group > > > > Um.. what? Because that's how big a PTEG is, I don't think > > re-explaining the HPT structure here is useful. > > Agreed, though think some where these things should be macros not used > as hard coded numbers like this. Using symbols instead of constant numbers is not always clearer. The symbol name can give some context (but so can a suitable comment) but has the cost of obscuring the actual numeric value. Paul. ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [RFCv2 7/9] pseries: Add support for hash table resizing
On Fri, Jan 29, 2016 at 04:24:01PM +1100, David Gibson wrote: > This adds support for using experimental hypercalls to change the size > of the main hash page table while running as a PAPR guest. For now these > hypercalls are only in experimental qemu versions. > > The interface is two part: first H_RESIZE_HPT_PREPARE is used to allocate > and prepare the new hash table. This may be slow, but can be done > asynchronously. Then, H_RESIZE_HPT_COMMIT is used to switch to the new > hash table. This requires that no CPUs be concurrently updating the HPT, > and so must be run under stop_machine(). > > This also adds a debugfs file which can be used to manually control > HPT resizing or testing purposes. > > Signed-off-by: David GibsonReviewed-by: Paul Mackerras ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [RFCv2 6/9] pseries: Add hypercall wrappers for hash page table resizing
On Fri, Jan 29, 2016 at 04:24:00PM +1100, David Gibson wrote: > This adds the hypercall numbers and wrapper functions for the hash page > table resizing hypercalls. > > These are experimental "platform specific" values for now, until we have a > formal PAPR update. > > It also adds a new firmware feature flat to track the presence of the > HPT resizing calls. > > Signed-off-by: David GibsonReviewed-by: Paul Mackerras ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [RFCv2 9/9] pseries: Automatically resize HPT for memory hot add/remove
On Fri, Jan 29, 2016 at 04:24:03PM +1100, David Gibson wrote: > We've now implemented code in the pseries platform to use the new PAPR > interface to allow resizing the hash page table (HPT) at runtime. > > This patch uses that interface to automatically attempt to resize the HPT > when memory is hot added or removed. This tries to always keep the HPT at > a reasonable size for our current memory size. > > Signed-off-by: David GibsonReviewed-by: Paul Mackerras ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
[PATCH V2] powerpc/mm: Fix Multi hit ERAT cause by recent THP update
With ppc64 we use the deposited pgtable_t to store the hash pte slot information. We should not withdraw the deposited pgtable_t without marking the pmd none. This ensure that low level hash fault handling will skip this huge pte and we will handle them at upper levels. Recent change to pmd splitting changed the above in order to handle the race between pmd split and exit_mmap. The race is explained below. Consider following race: CPU0CPU1 shrink_page_list() add_to_swap() split_huge_page_to_list() __split_huge_pmd_locked() pmdp_huge_clear_flush_notify() // pmd_none() == true exit_mmap() unmap_vmas() zap_pmd_range() // no action on pmd since pmd_none() == true pmd_populate() As result the THP will not be freed. The leak is detected by check_mm(): BUG: Bad rss-counter state mm:880058d2e580 idx:1 val:512 The above required us to not mark pmd none during a pmd split. The fix for ppc is to clear the huge pte of _PAGE_USER, so that low level fault handling code skip this pte. At higher level we do take ptl lock. That should serialze us against the pmd split. Once the lock is acquired we do check the pmd again using pmd_same. That should always return false for us and hence we should retry the access. Also make sure we wait for irq disable section in other cpus to finish before flipping a huge pte entry with a regular pmd entry. Code paths like find_linux_pte_or_hugepte depend on irq disable to get a stable pte_t pointer. A parallel thp split need to make sure we don't convert a pmd pte to a regular pmd entry without waiting for the irq disable section to finish. Signed-off-by: Aneesh Kumar K.V--- arch/powerpc/include/asm/book3s/64/pgtable.h | 4 arch/powerpc/mm/pgtable_64.c | 35 +++- include/asm-generic/pgtable.h| 8 +++ mm/huge_memory.c | 1 + 4 files changed, 47 insertions(+), 1 deletion(-) diff --git a/arch/powerpc/include/asm/book3s/64/pgtable.h b/arch/powerpc/include/asm/book3s/64/pgtable.h index 8d1c41d28318..0415856941e0 100644 --- a/arch/powerpc/include/asm/book3s/64/pgtable.h +++ b/arch/powerpc/include/asm/book3s/64/pgtable.h @@ -281,6 +281,10 @@ extern pgtable_t pgtable_trans_huge_withdraw(struct mm_struct *mm, pmd_t *pmdp); extern void pmdp_invalidate(struct vm_area_struct *vma, unsigned long address, pmd_t *pmdp); +#define __HAVE_ARCH_PMDP_HUGE_SPLITTING_FLUSH +extern void pmdp_huge_splitting_flush(struct vm_area_struct *vma, + unsigned long address, pmd_t *pmdp); + #define pmd_move_must_withdraw pmd_move_must_withdraw struct spinlock; static inline int pmd_move_must_withdraw(struct spinlock *new_pmd_ptl, diff --git a/arch/powerpc/mm/pgtable_64.c b/arch/powerpc/mm/pgtable_64.c index 3124a20d0fab..e8214b7f2210 100644 --- a/arch/powerpc/mm/pgtable_64.c +++ b/arch/powerpc/mm/pgtable_64.c @@ -646,6 +646,30 @@ pgtable_t pgtable_trans_huge_withdraw(struct mm_struct *mm, pmd_t *pmdp) return pgtable; } +void pmdp_huge_splitting_flush(struct vm_area_struct *vma, + unsigned long address, pmd_t *pmdp) +{ + VM_BUG_ON(address & ~HPAGE_PMD_MASK); + +#ifdef CONFIG_DEBUG_VM + BUG_ON(REGION_ID(address) != USER_REGION_ID); +#endif + /* +* We can't mark the pmd none here, because that will cause a race +* against exit_mmap. We need to continue mark pmd TRANS HUGE, while +* we spilt, but at the same time we wan't rest of the ppc64 code +* not to insert hash pte on this, because we will be modifying +* the deposited pgtable in the caller of this function. Hence +* clear the _PAGE_USER so that we move the fault handling to +* higher level function and that will serialize against ptl. +* We need to flush existing hash pte entries here even though, +* the translation is still valid, because we will withdraw +* pgtable_t after this. +*/ + pmd_hugepage_update(vma->vm_mm, address, pmdp, _PAGE_USER, 0); +} + + /* * set a new huge pmd. We should not be called for updating * an existing pmd entry. That should go via pmd_hugepage_update. @@ -663,10 +687,19 @@ void set_pmd_at(struct mm_struct *mm, unsigned long addr, return set_pte_at(mm, addr, pmdp_ptep(pmdp), pmd_pte(pmd)); } +/* + * We use this to invalidate a pmdp entry before switching from a + * hugepte to regular pmd entry. + */ void pmdp_invalidate(struct vm_area_struct *vma, unsigned long address, pmd_t *pmdp) { - pmd_hugepage_update(vma->vm_mm, address, pmdp, _PAGE_PRESENT, 0); +
Re: [PATCH V3] powerpc/powernv: Remove support for p5ioc2
Russell Curreywrites: > "p5ioc2 is used by approximately 2 machines in the world, and has never > ever been a supported configuration." > > The code for p5ioc2 is essentially unused and complicates what is already > a very complicated codebase. Its removal is essentially a "free win" in > the effort to simplify the powernv PCI code. > > In addition, support for p5ioc2 has been dropped from skiboot. There's no > reason to keep it around in the kernel. > > Signed-off-by: Russell Currey Yep, it's gone from firmware and there was only ever a handful of machines inside development labs inside IBM that had it. We may still have one in the lab, but I agree - it's not worth maintaining it. Acked-by: Stewart Smith -- Stewart Smith OPAL Architect, IBM. ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
[PATCH 1/2] powerpc/powernv: Simplify definitions of EEH debugfs handlers
The EEH debugfs handlers have same prototype. This introduces a macro to define them, then to simplify the code. No logical changes. Signed-off-by: Gavin Shan--- arch/powerpc/platforms/powernv/eeh-powernv.c | 60 ++-- 1 file changed, 22 insertions(+), 38 deletions(-) diff --git a/arch/powerpc/platforms/powernv/eeh-powernv.c b/arch/powerpc/platforms/powernv/eeh-powernv.c index 5f152b9..3f1cb35 100644 --- a/arch/powerpc/platforms/powernv/eeh-powernv.c +++ b/arch/powerpc/platforms/powernv/eeh-powernv.c @@ -167,42 +167,26 @@ static int pnv_eeh_dbgfs_get(void *data, int offset, u64 *val) return 0; } -static int pnv_eeh_outb_dbgfs_set(void *data, u64 val) -{ - return pnv_eeh_dbgfs_set(data, 0xD10, val); -} - -static int pnv_eeh_outb_dbgfs_get(void *data, u64 *val) -{ - return pnv_eeh_dbgfs_get(data, 0xD10, val); -} - -static int pnv_eeh_inbA_dbgfs_set(void *data, u64 val) -{ - return pnv_eeh_dbgfs_set(data, 0xD90, val); -} - -static int pnv_eeh_inbA_dbgfs_get(void *data, u64 *val) -{ - return pnv_eeh_dbgfs_get(data, 0xD90, val); -} - -static int pnv_eeh_inbB_dbgfs_set(void *data, u64 val) -{ - return pnv_eeh_dbgfs_set(data, 0xE10, val); -} - -static int pnv_eeh_inbB_dbgfs_get(void *data, u64 *val) -{ - return pnv_eeh_dbgfs_get(data, 0xE10, val); -} +#define PNV_EEH_DBGFS_ENTRY(name, reg) \ +static int pnv_eeh_dbgfs_set_##name(void *data, u64 val) \ +{ \ + return pnv_eeh_dbgfs_set(data, reg, val); \ +} \ + \ +static int pnv_eeh_dbgfs_get_##name(void *data, u64 *val) \ +{ \ + return pnv_eeh_dbgfs_get(data, reg, val); \ +} \ + \ +DEFINE_SIMPLE_ATTRIBUTE(pnv_eeh_dbgfs_ops_##name, \ + pnv_eeh_dbgfs_get_##name, \ +pnv_eeh_dbgfs_set_##name, \ + "0x%llx\n") + +PNV_EEH_DBGFS_ENTRY(outb, 0xD10); +PNV_EEH_DBGFS_ENTRY(inbA, 0xD90); +PNV_EEH_DBGFS_ENTRY(inbB, 0xE10); -DEFINE_SIMPLE_ATTRIBUTE(pnv_eeh_outb_dbgfs_ops, pnv_eeh_outb_dbgfs_get, - pnv_eeh_outb_dbgfs_set, "0x%llx\n"); -DEFINE_SIMPLE_ATTRIBUTE(pnv_eeh_inbA_dbgfs_ops, pnv_eeh_inbA_dbgfs_get, - pnv_eeh_inbA_dbgfs_set, "0x%llx\n"); -DEFINE_SIMPLE_ATTRIBUTE(pnv_eeh_inbB_dbgfs_ops, pnv_eeh_inbB_dbgfs_get, - pnv_eeh_inbB_dbgfs_set, "0x%llx\n"); #endif /* CONFIG_DEBUG_FS */ /** @@ -268,13 +252,13 @@ static int pnv_eeh_post_init(void) debugfs_create_file("err_injct_outbound", 0600, phb->dbgfs, hose, - _eeh_outb_dbgfs_ops); + _eeh_dbgfs_ops_outb); debugfs_create_file("err_injct_inboundA", 0600, phb->dbgfs, hose, - _eeh_inbA_dbgfs_ops); + _eeh_dbgfs_ops_inbA); debugfs_create_file("err_injct_inboundB", 0600, phb->dbgfs, hose, - _eeh_inbB_dbgfs_ops); + _eeh_dbgfs_ops_inbB); #endif /* CONFIG_DEBUG_FS */ } -- 2.1.0 ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [RFCv2 3/9] arch/powerpc: Handle removing maybe-present bolted HPTEs
On Fri, Jan 29, 2016 at 04:23:57PM +1100, David Gibson wrote: > At the moment the hpte_removebolted callback in ppc_md returns void and > will BUG_ON() if the hpte it's asked to remove doesn't exist in the first > place. This is awkward for the case of cleaning up a mapping which was > partially made before failing. > > So, we add a return value to hpte_removebolted, and have it return ENOENT > in the case that the HPTE to remove didn't exist in the first place. > > In the (sole) caller, we propagate errors in hpte_removebolted to its > caller to handle. However, we handle ENOENT specially, continuing to > complete the unmapping over the specified range before returning the error > to the caller. > > This means that htab_remove_mapping() will work sanely on a partially > present mapping, removing any HPTEs which are present, while also returning > ENOENT to its caller in case it's important there. > > There are two callers of htab_remove_mapping(): >- In remove_section_mapping() we already WARN_ON() any error return, > which is reasonable - in this case the mapping should be fully > present >- In vmemmap_remove_mapping() we BUG_ON() any error. We change that to > just a WARN_ON() in the case of ENOENT, since failing to remove a > mapping that wasn't there in the first place probably shouldn't be > fatal. > > Signed-off-by: David Gibson[snip] > --- a/arch/powerpc/mm/hash_utils_64.c > +++ b/arch/powerpc/mm/hash_utils_64.c > @@ -269,6 +269,7 @@ int htab_remove_mapping(unsigned long vstart, unsigned > long vend, > { > unsigned long vaddr; > unsigned int step, shift; > + int rc = 0; > > shift = mmu_psize_defs[psize].shift; > step = 1 << shift; > @@ -276,10 +277,13 @@ int htab_remove_mapping(unsigned long vstart, unsigned > long vend, > if (!ppc_md.hpte_removebolted) > return -ENODEV; > > - for (vaddr = vstart; vaddr < vend; vaddr += step) > - ppc_md.hpte_removebolted(vaddr, psize, ssize); > + for (vaddr = vstart; vaddr < vend; vaddr += step) { > + rc = ppc_md.hpte_removebolted(vaddr, psize, ssize); > + if ((rc < 0) && (rc != -ENOENT)) > + return rc; > + } > > - return 0; > + return rc; This will return the rc from the last hpte_removebolted call, which might be 0 even if earlier calls had returned -ENOENT. Or, if the last call fails with -ENOENT, this will return -ENOENT. Is that exactly what you meant? In the case where some calls to hpte_removebolted return -ENOENT, I would think we would want a consistent return value, which could be either 0 or -ENOENT, but it shouldn't depend on which specific calls fail with -ENOENT, in my opinion. Paul. ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [RFCv2 4/9] arch/powerpc: Clean up memory hotplug failure paths
On Fri, Jan 29, 2016 at 04:23:58PM +1100, David Gibson wrote: > This makes a number of cleanups to handling of mapping failures during > memory hotplug on Power: > > For errors creating the linear mapping for the hot-added region: > * This is now reported with EFAULT which is more appropriate than the > previous EINVAL (the failure is unlikely to be related to the > function's parameters) > * An error in this path now prints a warning message, rather than just > silently failing to add the extra memory. > * Previously a failure here could result in the region being partially > mapped. We now clean up any partial mapping before failing. > > For errors creating the vmemmap for the hot-added region: >* This is now reported with EFAULT instead of causing a BUG() - this > could happen for external reason (e.g. full hash table) so it's better > to handle this non-fatally >* An error message is also printed, so the failure won't be silent >* As above a failure could cause a partially mapped region, we now > clean this up. > > Signed-off-by: David GibsonReviewed-by: Paul Mackerras ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH] powerpc/mm: Fix Multi hit ERAT cause by recent THP update
"Kirill A. Shutemov"writes: > On Fri, Feb 05, 2016 at 11:41:40PM +0530, Aneesh Kumar K.V wrote: >> With ppc64 we use the deposted pgtable_t to store the hash pte slot >> information. We should not withdraw the deposited pgtable_t without >> marking the pmd none. This ensure that low level hash fault handling >> will skip this huge pte and we will handle them at upper levels. We >> do take page table lock there and we can serialize against a parallel >> THP split there. Hence mark the pte none (ie, remove __PAGE_USER) before >> splitting the huge pmd. >> >> Also make sure we wait for irq disable section in other cpus to finish >> before flipping a huge pte entry with a regular pmd entry. Code paths >> like find_linux_pte_or_hugepte depend on irq disable to get >> a stable pte_t pointer. A parallel thp split need to make sure we >> don't convert a pmd pte to a regular pmd entry without waiting for the >> irq disable section to finish. >> >> Signed-off-by: Aneesh Kumar K.V > > Cc list is too short. At least akpm@ and linux-mm@ should be there. > Probably numa balancing folks. Will add them in the next iteration. > > Have you tested it with CONFIG_NUMA_BALANCING disabled? yes. > > I would expect some additional changes in this area would be required. > pmd_protnone() is always zero without numa balancing compiled in and > therefore I don't see where we will get this serialization agians ptl on > fault side. I am not really depending on the pmd_protnone definition here. The thing that I am depending with respect to the core code is that after taking ptl, all the code path should check for pmd using pmd_same. If found not matching they should force a retry. All code path within pmd_trans_huge() check seem to do so. > >> --- >> arch/powerpc/include/asm/book3s/64/pgtable.h | 4 >> arch/powerpc/mm/pgtable_64.c | 36 >> +++- >> include/asm-generic/pgtable.h| 8 +++ >> mm/huge_memory.c | 1 + >> 4 files changed, 48 insertions(+), 1 deletion(-) >> >> diff --git a/arch/powerpc/include/asm/book3s/64/pgtable.h >> b/arch/powerpc/include/asm/book3s/64/pgtable.h >> index 8d1c41d28318..0415856941e0 100644 >> --- a/arch/powerpc/include/asm/book3s/64/pgtable.h >> +++ b/arch/powerpc/include/asm/book3s/64/pgtable.h >> @@ -281,6 +281,10 @@ extern pgtable_t pgtable_trans_huge_withdraw(struct >> mm_struct *mm, pmd_t *pmdp); >> extern void pmdp_invalidate(struct vm_area_struct *vma, unsigned long >> address, >> pmd_t *pmdp); >> >> +#define __HAVE_ARCH_PMDP_HUGE_SPLITTING_FLUSH >> +extern void pmdp_huge_splitting_flush(struct vm_area_struct *vma, >> + unsigned long address, pmd_t *pmdp); > > I don't really like the name, but cannot think of anything better. same here. I will keep this as it is for now. ? > >> + >> #define pmd_move_must_withdraw pmd_move_must_withdraw >> struct spinlock; >> static inline int pmd_move_must_withdraw(struct spinlock *new_pmd_ptl, >> diff --git a/arch/powerpc/mm/pgtable_64.c b/arch/powerpc/mm/pgtable_64.c >> index 3124a20d0fab..d80a23a92f95 100644 >> --- a/arch/powerpc/mm/pgtable_64.c >> +++ b/arch/powerpc/mm/pgtable_64.c >> @@ -646,6 +646,31 @@ pgtable_t pgtable_trans_huge_withdraw(struct mm_struct >> *mm, pmd_t *pmdp) >> return pgtable; >> } -aneesh ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH 2/2] powerpc/eeh: Reworked eeh_pe_bus_get()
On 08/02/16 16:35, Gavin Shan wrote: The original implementation is ugly: unnecessary if statements and "out" tag. This reworks the function to avoid above weaknesses. No functional changes introduced. Signed-off-by: Gavin ShanThis is definitely a lot nicer to read and doesn't appear to have any functional changes. Reviewed-by: Andrew Donnellan -- Andrew Donnellan Software Engineer, OzLabs andrew.donnel...@au1.ibm.com Australia Development Lab, Canberra +61 2 6201 8874 (work)IBM Australia Limited ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [RFCv2 8/9] pseries: Advertise HPT resizing support via CAS
On Fri, Jan 29, 2016 at 04:24:02PM +1100, David Gibson wrote: > The hypervisor needs to know a guest is capable of using the HPT resizing > PAPR extension in order to make full advantage of it for memory hotplug. > > If the hypervisor knows the guest is HPT resize aware, it can size the > initial HPT based on the initial guest RAM size, relying on the guest to > resize the HPT when more memory is hot-added. Without this, the hypervisor > must size the HPT for the maximum possible guest RAM, which can lead to > a huge waste of space if the guest never actually expends to that maximum > size. > > This patch advertises the guest's support for HPT resizing via the > ibm,client-architecture-support OF interface. Obviously, the actual > encoding in the CAS vector is tentative until the extension is officially > incorporated into PAPR. For now we use bit 0 of (previously unused) byte 8 > of option vector 5. > > Signed-off-by: David GibsonReviewed-by: Paul Mackerras ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev