Re: [PATCH 1/2] mm: add notifier in pageblock isolation for balloon drivers
On Thu, Oct 01, 2009 at 02:53:11PM -0500, Robert Jennings wrote: Memory balloon drivers can allocate a large amount of memory which is not movable but could be freed to accommodate memory hotplug remove. Prior to calling the memory hotplug notifier chain the memory in the pageblock is isolated. If the migrate type is not MIGRATE_MOVABLE the isolation will not proceed, causing the memory removal for that page range to fail. Rather than immediately failing pageblock isolation if the the migrateteype is not MIGRATE_MOVABLE, this patch checks if all of the pages in the pageblock are owned by a registered balloon driver using a notifier chain. If all of the non-movable pages are owned by a balloon, they can be freed later through the memory notifier chain and the range can still be isolated in set_migratetype_isolate(). Signed-off-by: Robert Jennings r...@linux.vnet.ibm.com --- drivers/base/memory.c | 19 +++ include/linux/memory.h | 22 ++ mm/page_alloc.c| 49 + 3 files changed, 82 insertions(+), 8 deletions(-) Index: b/drivers/base/memory.c === --- a/drivers/base/memory.c +++ b/drivers/base/memory.c @@ -63,6 +63,20 @@ void unregister_memory_notifier(struct n } EXPORT_SYMBOL(unregister_memory_notifier); +static BLOCKING_NOTIFIER_HEAD(memory_isolate_chain); + +int register_memory_isolate_notifier(struct notifier_block *nb) +{ + return blocking_notifier_chain_register(memory_isolate_chain, nb); +} +EXPORT_SYMBOL(register_memory_isolate_notifier); + +void unregister_memory_isolate_notifier(struct notifier_block *nb) +{ + blocking_notifier_chain_unregister(memory_isolate_chain, nb); +} +EXPORT_SYMBOL(unregister_memory_isolate_notifier); + /* * register_memory - Setup a sysfs device for a memory block */ @@ -157,6 +171,11 @@ int memory_notify(unsigned long val, voi return blocking_notifier_call_chain(memory_chain, val, v); } +int memory_isolate_notify(unsigned long val, void *v) +{ + return blocking_notifier_call_chain(memory_isolate_chain, val, v); +} + /* * MEMORY_HOTPLUG depends on SPARSEMEM in mm/Kconfig, so it is * OK to have direct references to sparsemem variables in here. Index: b/include/linux/memory.h === --- a/include/linux/memory.h +++ b/include/linux/memory.h @@ -50,6 +50,14 @@ struct memory_notify { int status_change_nid; }; +#define MEM_ISOLATE_COUNT(10) + This needs a comment explaining that that this is an action to count the number of pages within a range that have been isolated within a range of pages and not a default value for nr_pages in the next structure. +struct memory_isolate_notify { + unsigned long start_addr; + unsigned int nr_pages; + unsigned int pages_found; +}; Is there any particular reason you used virtual address of the mapped page instead of PFN? I am guessing at this point that the balloon driver is based on addresses but the code that populates the structure more commonly deals with PFNs. Outside of debugging code, page_address is rarely used in mm/page_alloc.c . It's picky but it feels more natural to me to have the structure have start_pfn and nr_pages or start_addr and end_addr but not a mix of both. + struct notifier_block; struct mem_section; @@ -76,14 +84,28 @@ static inline int memory_notify(unsigned { return 0; } +static inline int register_memory_isolate_notifier(struct notifier_block *nb) +{ + return 0; +} +static inline void unregister_memory_isolate_notifier(struct notifier_block *nb) +{ +} +static inline int memory_isolate_notify(unsigned long val, void *v) +{ + return 0; +} #else extern int register_memory_notifier(struct notifier_block *nb); extern void unregister_memory_notifier(struct notifier_block *nb); +extern int register_memory_isolate_notifier(struct notifier_block *nb); +extern void unregister_memory_isolate_notifier(struct notifier_block *nb); extern int register_new_memory(int, struct mem_section *); extern int unregister_memory_section(struct mem_section *); extern int memory_dev_init(void); extern int remove_memory_block(unsigned long, struct mem_section *, int); extern int memory_notify(unsigned long val, void *v); +extern int memory_isolate_notify(unsigned long val, void *v); extern struct memory_block *find_memory_block(struct mem_section *); #define CONFIG_MEM_BLOCK_SIZE(PAGES_PER_SECTIONPAGE_SHIFT) enum mem_add_context { BOOT, HOTPLUG }; Index: b/mm/page_alloc.c === --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -48,6 +48,7 @@ #include linux/page_cgroup.h #include linux/debugobjects.h #include linux/kmemleak.h +#include
Is volatile always verboten for FSL QE structures?
Just wondering - is there a case where using volatile for UCC parameter RAM for example will not work, or is the use of I/O accessors everywhere an attempt to be portable to other architectures? I'm asking because I really want to know ;) -- Michael Barkowski 905-482-4577 ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: Is volatile always verboten for FSL QE structures?
Michael Barkowski wrote: Just wondering - is there a case where using volatile for UCC parameter RAM for example will not work, or is the use of I/O accessors everywhere an attempt to be portable to other architectures? 'volatile' just doesn't really do what you think it should do. The PowerPC architecture is too complicated w.r.t. ordering of reads and writes. In other words, you can't trust it. No one should be using 'volatile' to access I/O registers. -- Timur Tabi Linux kernel developer at Freescale ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: linux-next: tree build failure
On Wed, 2009-09-30 at 07:35 +0100, Jan Beulich wrote: Hollis Blanchard holl...@us.ibm.com 30.09.09 01:39 On Tue, 2009-09-29 at 10:28 +0100, Jan Beulich wrote: Hollis Blanchard 09/29/09 2:00 AM First, I think there is a real bug here, and the code should read like this (to match the comment): /* type has to be known at build time for optimization */ -BUILD_BUG_ON(__builtin_constant_p(type)); +BUILD_BUG_ON(!__builtin_constant_p(type)); However, I get the same build error *both* ways, i.e. __builtin_constant_p(type) evaluates to both 0 and 1? Either that, or the new BUILD_BUG_ON() macro isn't working... No, at this point of the compilation process it's neither zero nor one, it's simply considered non-constant by the compiler at that stage (this builtin is used for optimization, not during parsing, and the error gets generated when the body of the function gets parsed, not when code gets generated from it). I think I see what you're saying. Do you have a fix to suggest? The one Rusty suggested the other day may help here. I don't like it as a drop-in replacement for BUILD_BUG_ON() though (due to it deferring the error generated to the linking stage), I'd rather view this as an improvement to MAYBE_BUILD_BUG_ON() (which should then be used here). Can you be more specific? I have no idea what Rusty suggested where. I can't even guess what MAYBE_BUILD_BUG_ON() is supposed to do (sounds like a terrible name). All I know is that this used to build... -- Hollis Blanchard IBM Linux Technology Center ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: Is volatile always verboten for FSL QE structures?
On Oct 2, 2009, at 9:46 AM, Timur Tabi wrote: Michael Barkowski wrote: Just wondering - is there a case where using volatile for UCC parameter RAM for example will not work, or is the use of I/O accessors everywhere an attempt to be portable to other architectures? 'volatile' just doesn't really do what you think it should do. The PowerPC architecture is too complicated w.r.t. ordering of reads and writes. In other words, you can't trust it. No one should be using 'volatile' to access I/O registers. See Documentation/volatile-considered-harmful.txt - k ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: Is volatile always verboten for FSL QE structures?
Kumar Gala wrote: On Oct 2, 2009, at 9:46 AM, Timur Tabi wrote: Michael Barkowski wrote: Just wondering - is there a case where using volatile for UCC parameter RAM for example will not work, or is the use of I/O accessors everywhere an attempt to be portable to other architectures? 'volatile' just doesn't really do what you think it should do. The PowerPC architecture is too complicated w.r.t. ordering of reads and writes. In other words, you can't trust it. No one should be using 'volatile' to access I/O registers. See Documentation/volatile-considered-harmful.txt I'm happy to adopt your interpretation of it, and I appreciate the explanation. from Documentation/volatile-considered-harmful.txt: - The above-mentioned accessor functions might use volatile on architectures where direct I/O memory access does work. Essentially, each accessor call becomes a little critical section on its own and ensures that the access happens as expected by the programmer. Part of it was that I wondered if this was one of those architectures. I guess not. -- Michael Barkowski 905-482-4577 ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: linux booting fails on ppc440x5 with SRAM
It's seems a RAM initialize problem. Try to use ICE or your bootloader to test initialized RAM wirh write/read operation. Ex, use mtest in uboot to check memory. For ICE, it should be an detailed memory test function like hardware diagnostic. 2009/9/24 Benjamin Herrenschmidt b...@kernel.crashing.org: On Wed, 2009-09-23 at 20:19 +0530, Vineeth _ wrote: I am trying to port linux on IBM powerpc-440x5. I have this board which has this processor, a 16MB SRAM sits on location 0x0, uart and a flash.I have a simple bootloader which does the following. 1. Initialize the processor (as part of it, we Generates the tlbs for UART,16MB flash,16MB SRAM) 2. Initialize the UART 3. Copy the simple-boot linux_image (binary file) from flash to 0x40 location of SRAM. 4. Kernel entry to 0x40 Not sure yet, looks like things are being overwritten during boot. Interestingly enough, the DEAR value looks like an MSR value.. Hard to tell what's up. You'll have to dig a bit more youself to figure out how come the kernel's getting that bad pointer. All I can tell you is that things work fine on 440x5 cores, though I haven't tested a configuration with so little memory in a while. Ben. zImage starting: loaded at 0x0040 (sp: 0x004deeb0) Allocating 0x1dad84 bytes for kernel ... gunzipping (0x - 0x0040c000:0x004dd3f1)...done 0x1c31cc bytes Linux/PowerPC load: console=ttyS0 root=/dev/ram Finalizing device tree... flat tree at 0x4eb300 id mach(): done inside skybeam_register_console function MMU:enterMMU:hw initMMU:mapinMMU:setioMMU:exitinside _setup_arch-begininginside _setup_arch-1inside _setup_arch-2setup_arch: bootmemarch: exit7Top of RAM: 0x100, Total RAM: 0x100 Zone PFN ranges: DMA 0x - 0x1000 Normal 0x1000 - 0x1000 Movable zone start PFN for each node early_node_map[1] active PFN ranges 0: 0x - 0x1000 MMU: Allocated 1088 bytes of context maps for 255 contexts Built 1 zonelists in Zone order, mobility grouping off. Total pages: 4064 Kernel command line: console=ttyS0 root=/dev/ram Unable to handle kernel paging request for data at address 0x00021000 Faulting instruction address: 0xc010a7c4 Oops: Kernel access of bad area, sig: 11 [#1] PREEMPT PowerPC 44x Platform Modules linked in: NIP: c010a7c4 LR: c010dc50 CTR: REGS: c01bfeb0 TRAP: 0300 Not tainted (2.6.30) MSR: 00021000 ME,CE CR: 2444 XER: DEAR: 00021000, ESR: TASK = c01a94b8[0] 'swapper' THREAD: c01be000 GPR00: 1180 c01bff60 c01a94b8 00021000 0025 0008 c0104968 GPR08: 2f72616d c011 c0155938 c01a 2224 f104 GPR16: fff8 08b8 c010d758 c0104968 GPR24: 1198 1190 c018a001 c01c5498 08c0 1188 00021000 c01c42f0 NIP [c010a7c4] strchr+0x0/0x3c LR [c010dc50] match_token+0x138/0x228 Call Trace: [c01bff60] [c016b99c] 0xc016b99c (unreliable) [c01bffa0] [c0104a00] sort_extable+0x28/0x38 [c01bffb0] [c01938ec] sort_main_extable+0x20/0x30 [c01bffc0] [c018c734] start_kernel+0x140/0x288 [c01bfff0] [c200] skpinv+0x190/0x1cc Instruction dump: 7ca903a6 8804 38a5 38840001 2f80 9809 39290001 419e0010 4200ffe4 98a9 4e800020 4e800020 8803 5484063e 7f802000 4d9e0020 ---[ end trace 31fd0ba7d8756001 ]--- Kernel panic - not syncing: Attempted to kill the idle task! Call Trace: [c01bfd90] [c0005d5c] show_stack+0x4c/0x16c (unreliable) [c01bfdd0] [c002f17c] panic+0xa0/0x168 [c01bfe20] [c0032eb8] do_exit+0x61c/0x638 [c01bfe60] [c000b60c] kernel_bad_stack+0x0/0x4c [c01bfe90] [c000f310] bad_page_fault+0x90/0xd8 [c01bfea0] [c000e184] handle_page_fault+0x7c/0x80 [c01bff60] [c016b99c] 0xc016b99c [c01bffa0] [c0104a00] sort_extable+0x28/0x38 [c01bffb0] [c01938ec] sort_main_extable+0x20/0x30 [c01bffc0] [c018c734] start_kernel+0x140/0x288 [c01bfff0] [c200] skpinv+0x190/0x1cc Rebooting in 180 seconds.. ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH] powerpc/8xx: fix regression introduced by cache coherency rewrite
Benjamin Herrenschmidt b...@kernel.crashing.org wrote on 01/10/2009 00:35:59: Had a look at linus tree and there is something I don't understand. Your fix, e0908085fc2391c85b85fb814ae1df377c8e0dcb, fixes a problem that was introduced by 8d30c14cab30d405a05f2aaceda1e9ad57800f36 but 8d30c14cab30d405a05f2aaceda1e9ad57800f36 was included in .31 and .31 works and top of tree does not, how can that be? To me it seems more likely that some mm change introduced between .31 and top of tree is the culprit. My patch addresses the problem described in the comment: /* On 8xx, cache control instructions (particularly * dcbst from flush_dcache_icache) fault as write * operation if there is an unpopulated TLB entry * for the address in question. To workaround that, * we invalidate the TLB here, thus avoiding dcbst * misbehaviour. */ Now you are using this old fix to paper over some other bug or so I think. There is something fishy with the TLB status, looking into the mpc862 manual I don't see how it can work reliably. Need to dwell some more. Anyhow, I have incorporated some of my findings into a new patch, perhaps this will be the golden one? Well, I still wonder about what whole unpopulated entry business. From what I can see, the TLB miss code will check _PAGE_PRESENT, and when not set, it will -still- insert something into the TLB (unlike all other CPU types that go straight to data access faults from there). So we end up populating with those unpopulated entries that will then cause us to take a DSI (or ISI) instead of a TLB miss the next time around ... and so we need to remove them once we do that, no ? IE. Once we have actually put a valid PTE in. At least that's my understanding and why I believe we should always have tlbil_va() in set_pte() and ptep_set_access_flags(), which boils down in putting it into the 2 filter functions in the new code. Well.. actually, the ptep_set_access_flags() will already do a flush_tlb_page_nohash() when the PTE is changed. So I suppose all we really need here would be in set_pte_filter(), to do the same if we are writing a PTE that has _PAGE_PRESENT, at least on 8xx. But just unconditionally doing a tlbil_va() in both the filter functions shouldn't harm and also fix the problem, though Rex seems to indicate that is not the case. Now, we -might- have something else broken on 8xx, hard to tell. You may want to try to bisect, adding back the removed tlbil_va() as you go backward, between .31 and upstream... Found something odd, perhaps Rex can test? I tested this on my old 2.4 and it worked well. That was not quite correct, sorry. Try this instead: From c5f1a70561730b8a443f7081fbd9c5b023147043 Mon Sep 17 00:00:00 2001 From: Joakim Tjernlund joakim.tjernl...@transmode.se Date: Fri, 2 Oct 2009 14:59:21 +0200 Subject: [PATCH] powerpc, 8xx: DTLB Error must check for more errors. DataTLBError currently does: if ((err 0x0200) == 0) DSI(); This won't handle a store with no valid translation. Change this to if ((err 0x4800) != 0) DSI(); that is, branch to DSI if either !permission or !translation. --- arch/powerpc/kernel/head_8xx.S |4 ++-- 1 files changed, 2 insertions(+), 2 deletions(-) diff --git a/arch/powerpc/kernel/head_8xx.S b/arch/powerpc/kernel/head_8xx.S index 52ff8c5..118bb05 100644 --- a/arch/powerpc/kernel/head_8xx.S +++ b/arch/powerpc/kernel/head_8xx.S @@ -472,8 +472,8 @@ DataTLBError: /* First, make sure this was a store operation. */ mfspr r10, SPRN_DSISR - andis. r11, r10, 0x0200/* If set, indicates store op */ - beq 2f + andis. r11, r10, 0x4800 /* no translation, no permission. */ + bne 2f /* branch if either is set */ /* The EA of a data TLB miss is automatically stored in the MD_EPN * register. The EA of a data TLB error is automatically stored in -- 1.6.4.4 ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: Is volatile always verboten for FSL QE structures?
Michael Barkowski wrote: Kumar Gala wrote: On Oct 2, 2009, at 9:46 AM, Timur Tabi wrote: Michael Barkowski wrote: Just wondering - is there a case where using volatile for UCC parameter RAM for example will not work, or is the use of I/O accessors everywhere an attempt to be portable to other architectures? 'volatile' just doesn't really do what you think it should do. The PowerPC architecture is too complicated w.r.t. ordering of reads and writes. In other words, you can't trust it. No one should be using 'volatile' to access I/O registers. See Documentation/volatile-considered-harmful.txt I'm happy to adopt your interpretation of it, and I appreciate the explanation. from Documentation/volatile-considered-harmful.txt: - The above-mentioned accessor functions might use volatile on architectures where direct I/O memory access does work. Essentially, each accessor call becomes a little critical section on its own and ensures that the access happens as expected by the programmer. Part of it was that I wondered if this was one of those architectures. I guess not. I guess this could only work on architectures having a totally ordered memory model. Definitely not the case for Power. ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH 1/2] mm: add notifier in pageblock isolation for balloon drivers
* Nathan Fontenot (nf...@austin.ibm.com) wrote: Robert Jennings wrote: Memory balloon drivers can allocate a large amount of memory which is not movable but could be freed to accommodate memory hotplug remove. Prior to calling the memory hotplug notifier chain the memory in the pageblock is isolated. If the migrate type is not MIGRATE_MOVABLE the isolation will not proceed, causing the memory removal for that page range to fail. Rather than immediately failing pageblock isolation if the the migrateteype is not MIGRATE_MOVABLE, this patch checks if all of the pages in the pageblock are owned by a registered balloon driver using a notifier chain. If all of the non-movable pages are owned by a balloon, they can be freed later through the memory notifier chain and the range can still be isolated in set_migratetype_isolate(). Signed-off-by: Robert Jennings r...@linux.vnet.ibm.com --- drivers/base/memory.c | 19 +++ include/linux/memory.h | 22 ++ mm/page_alloc.c| 49 + 3 files changed, 82 insertions(+), 8 deletions(-) snip Index: b/mm/page_alloc.c === --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -48,6 +48,7 @@ #include linux/page_cgroup.h #include linux/debugobjects.h #include linux/kmemleak.h +#include linux/memory.h #include trace/events/kmem.h #include asm/tlbflush.h @@ -4985,23 +4986,55 @@ void set_pageblock_flags_group(struct pa int set_migratetype_isolate(struct page *page) { struct zone *zone; -unsigned long flags; +unsigned long flags, pfn, iter; +long immobile = 0; +struct memory_isolate_notify arg; +int notifier_ret; int ret = -EBUSY; int zone_idx; zone = page_zone(page); zone_idx = zone_idx(zone); + +pfn = page_to_pfn(page); +arg.start_addr = (unsigned long)page_address(page); +arg.nr_pages = pageblock_nr_pages; +arg.pages_found = 0; + spin_lock_irqsave(zone-lock, flags); /* * In future, more migrate types will be able to be isolation target. */ -if (get_pageblock_migratetype(page) != MIGRATE_MOVABLE -zone_idx != ZONE_MOVABLE) -goto out; -set_pageblock_migratetype(page, MIGRATE_ISOLATE); -move_freepages_block(zone, page, MIGRATE_ISOLATE); -ret = 0; -out: +do { +if (get_pageblock_migratetype(page) == MIGRATE_MOVABLE +zone_idx == ZONE_MOVABLE) { +ret = 0; +break; +} + +/* + * If all of the pages in a zone are used by a balloon, + * the range can be still be isolated. The balloon will + * free these pages from the memory notifier chain. + */ +notifier_ret = memory_isolate_notify(MEM_ISOLATE_COUNT, arg); +notifier_ret = notifier_to_errno(ret); Should this be notifier_ret = notifier_to_errno(notifier_ret); -Nathan I'll correct this. Thanks ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH 1/2] mm: add notifier in pageblock isolation for balloon drivers
* Mel Gorman (m...@csn.ul.ie) wrote: On Thu, Oct 01, 2009 at 02:53:11PM -0500, Robert Jennings wrote: Memory balloon drivers can allocate a large amount of memory which is not movable but could be freed to accommodate memory hotplug remove. Prior to calling the memory hotplug notifier chain the memory in the pageblock is isolated. If the migrate type is not MIGRATE_MOVABLE the isolation will not proceed, causing the memory removal for that page range to fail. Rather than immediately failing pageblock isolation if the the migrateteype is not MIGRATE_MOVABLE, this patch checks if all of the pages in the pageblock are owned by a registered balloon driver using a notifier chain. If all of the non-movable pages are owned by a balloon, they can be freed later through the memory notifier chain and the range can still be isolated in set_migratetype_isolate(). Signed-off-by: Robert Jennings r...@linux.vnet.ibm.com --- drivers/base/memory.c | 19 +++ include/linux/memory.h | 22 ++ mm/page_alloc.c| 49 + 3 files changed, 82 insertions(+), 8 deletions(-) Index: b/drivers/base/memory.c === --- a/drivers/base/memory.c +++ b/drivers/base/memory.c @@ -63,6 +63,20 @@ void unregister_memory_notifier(struct n } EXPORT_SYMBOL(unregister_memory_notifier); +static BLOCKING_NOTIFIER_HEAD(memory_isolate_chain); + +int register_memory_isolate_notifier(struct notifier_block *nb) +{ + return blocking_notifier_chain_register(memory_isolate_chain, nb); +} +EXPORT_SYMBOL(register_memory_isolate_notifier); + +void unregister_memory_isolate_notifier(struct notifier_block *nb) +{ + blocking_notifier_chain_unregister(memory_isolate_chain, nb); +} +EXPORT_SYMBOL(unregister_memory_isolate_notifier); + /* * register_memory - Setup a sysfs device for a memory block */ @@ -157,6 +171,11 @@ int memory_notify(unsigned long val, voi return blocking_notifier_call_chain(memory_chain, val, v); } +int memory_isolate_notify(unsigned long val, void *v) +{ + return blocking_notifier_call_chain(memory_isolate_chain, val, v); +} + /* * MEMORY_HOTPLUG depends on SPARSEMEM in mm/Kconfig, so it is * OK to have direct references to sparsemem variables in here. Index: b/include/linux/memory.h === --- a/include/linux/memory.h +++ b/include/linux/memory.h @@ -50,6 +50,14 @@ struct memory_notify { int status_change_nid; }; +#define MEM_ISOLATE_COUNT (10) + This needs a comment explaining that that this is an action to count the number of pages within a range that have been isolated within a range of pages and not a default value for nr_pages in the next structure. I'll provide a clear explanation for this. +struct memory_isolate_notify { + unsigned long start_addr; + unsigned int nr_pages; + unsigned int pages_found; +}; Is there any particular reason you used virtual address of the mapped page instead of PFN? I am guessing at this point that the balloon driver is based on addresses but the code that populates the structure more commonly deals with PFNs. Outside of debugging code, page_address is rarely used in mm/page_alloc.c . It's picky but it feels more natural to me to have the structure have start_pfn and nr_pages or start_addr and end_addr but not a mix of both. Changing this to use start_pfn and nr_pages, this will also match struct memory_notify. Thanks for the review of this patch. + struct notifier_block; struct mem_section; @@ -76,14 +84,28 @@ static inline int memory_notify(unsigned { return 0; } +static inline int register_memory_isolate_notifier(struct notifier_block *nb) +{ + return 0; +} +static inline void unregister_memory_isolate_notifier(struct notifier_block *nb) +{ +} +static inline int memory_isolate_notify(unsigned long val, void *v) +{ + return 0; +} #else extern int register_memory_notifier(struct notifier_block *nb); extern void unregister_memory_notifier(struct notifier_block *nb); +extern int register_memory_isolate_notifier(struct notifier_block *nb); +extern void unregister_memory_isolate_notifier(struct notifier_block *nb); extern int register_new_memory(int, struct mem_section *); extern int unregister_memory_section(struct mem_section *); extern int memory_dev_init(void); extern int remove_memory_block(unsigned long, struct mem_section *, int); extern int memory_notify(unsigned long val, void *v); +extern int memory_isolate_notify(unsigned long val, void *v); extern struct memory_block *find_memory_block(struct mem_section *); #define CONFIG_MEM_BLOCK_SIZE (PAGES_PER_SECTIONPAGE_SHIFT)
[PATCH 1/2][v2] mm: add notifier in pageblock isolation for balloon drivers
Memory balloon drivers can allocate a large amount of memory which is not movable but could be freed to accomodate memory hotplug remove. Prior to calling the memory hotplug notifier chain the memory in the pageblock is isolated. If the migrate type is not MIGRATE_MOVABLE the isolation will not proceed, causing the memory removal for that page range to fail. Rather than failing pageblock isolation if the the migrateteype is not MIGRATE_MOVABLE, this patch checks if all of the pages in the pageblock are owned by a registered balloon driver (or other entity) using a notifier chain. If all of the non-movable pages are owned by a balloon, they can be freed later through the memory notifier chain and the range can still be isolated in set_migratetype_isolate(). Signed-off-by: Robert Jennings r...@linux.vnet.ibm.com --- drivers/base/memory.c | 19 +++ include/linux/memory.h | 26 ++ mm/page_alloc.c| 45 ++--- 3 files changed, 83 insertions(+), 7 deletions(-) Index: b/drivers/base/memory.c === --- a/drivers/base/memory.c +++ b/drivers/base/memory.c @@ -63,6 +63,20 @@ void unregister_memory_notifier(struct n } EXPORT_SYMBOL(unregister_memory_notifier); +static BLOCKING_NOTIFIER_HEAD(memory_isolate_chain); + +int register_memory_isolate_notifier(struct notifier_block *nb) +{ + return blocking_notifier_chain_register(memory_isolate_chain, nb); +} +EXPORT_SYMBOL(register_memory_isolate_notifier); + +void unregister_memory_isolate_notifier(struct notifier_block *nb) +{ + blocking_notifier_chain_unregister(memory_isolate_chain, nb); +} +EXPORT_SYMBOL(unregister_memory_isolate_notifier); + /* * register_memory - Setup a sysfs device for a memory block */ @@ -157,6 +171,11 @@ int memory_notify(unsigned long val, voi return blocking_notifier_call_chain(memory_chain, val, v); } +int memory_isolate_notify(unsigned long val, void *v) +{ + return blocking_notifier_call_chain(memory_isolate_chain, val, v); +} + /* * MEMORY_HOTPLUG depends on SPARSEMEM in mm/Kconfig, so it is * OK to have direct references to sparsemem variables in here. Index: b/include/linux/memory.h === --- a/include/linux/memory.h +++ b/include/linux/memory.h @@ -50,6 +50,18 @@ struct memory_notify { int status_change_nid; }; +/* + * During pageblock isolation, count the number of pages in the + * range [start_pfn, start_pfn + nr_pages) + */ +#define MEM_ISOLATE_COUNT (10) + +struct memory_isolate_notify { + unsigned long start_pfn; + unsigned int nr_pages; + unsigned int pages_found; +}; + struct notifier_block; struct mem_section; @@ -76,14 +88,28 @@ static inline int memory_notify(unsigned { return 0; } +static inline int register_memory_isolate_notifier(struct notifier_block *nb) +{ + return 0; +} +static inline void unregister_memory_isolate_notifier(struct notifier_block *nb) +{ +} +static inline int memory_isolate_notify(unsigned long val, void *v) +{ + return 0; +} #else extern int register_memory_notifier(struct notifier_block *nb); extern void unregister_memory_notifier(struct notifier_block *nb); +extern int register_memory_isolate_notifier(struct notifier_block *nb); +extern void unregister_memory_isolate_notifier(struct notifier_block *nb); extern int register_new_memory(int, struct mem_section *); extern int unregister_memory_section(struct mem_section *); extern int memory_dev_init(void); extern int remove_memory_block(unsigned long, struct mem_section *, int); extern int memory_notify(unsigned long val, void *v); +extern int memory_isolate_notify(unsigned long val, void *v); extern struct memory_block *find_memory_block(struct mem_section *); #define CONFIG_MEM_BLOCK_SIZE (PAGES_PER_SECTIONPAGE_SHIFT) enum mem_add_context { BOOT, HOTPLUG }; Index: b/mm/page_alloc.c === --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -48,6 +48,7 @@ #include linux/page_cgroup.h #include linux/debugobjects.h #include linux/kmemleak.h +#include linux/memory.h #include trace/events/kmem.h #include asm/tlbflush.h @@ -4985,23 +4986,53 @@ void set_pageblock_flags_group(struct pa int set_migratetype_isolate(struct page *page) { struct zone *zone; - unsigned long flags; + unsigned long flags, pfn, iter; + unsigned long immobile = 0; + struct memory_isolate_notify arg; + int notifier_ret; int ret = -EBUSY; int zone_idx; zone = page_zone(page); zone_idx = zone_idx(zone); + spin_lock_irqsave(zone-lock, flags); + if (get_pageblock_migratetype(page) == MIGRATE_MOVABLE || + zone_idx == ZONE_MOVABLE) { + ret = 0; + goto out; + } + +
[PATCH 2/2][v2] powerpc: Make the CMM memory hotplug aware
The Collaborative Memory Manager (CMM) module allocates individual pages over time that are not migratable. On a long running system this can severely impact the ability to find enough pages to support a hotplug memory remove operation. This patch adds a memory isolation notifier and a memory hotplug notifier. The memory isolation notifier will return the number of pages found in the range specified. This is used to determine if all of the used pages in a pageblock are owned by the balloon (or other entities in the notifier chain). The hotplug notifier will free pages in the range which is to be removed. The priority of this hotplug notifier is low so that it will be called near last, this helps avoids removing loaned pages in operations that fail due to other handlers. CMM activity will be halted when hotplug remove operations are active and resume activity after a delay period to allow the hypervisor time to adjust. Signed-off-by: Robert Jennings r...@linux.vnet.ibm.com --- Minor update to cmm_count_pages() to account for changes in struct memory_isolate_notify. --- arch/powerpc/platforms/pseries/cmm.c | 207 ++- 1 file changed, 201 insertions(+), 6 deletions(-) Index: b/arch/powerpc/platforms/pseries/cmm.c === --- a/arch/powerpc/platforms/pseries/cmm.c +++ b/arch/powerpc/platforms/pseries/cmm.c @@ -38,19 +38,28 @@ #include asm/mmu.h #include asm/pgalloc.h #include asm/uaccess.h +#include linux/memory.h #include plpar_wrappers.h #define CMM_DRIVER_VERSION 1.0.0 #define CMM_DEFAULT_DELAY 1 +#define CMM_HOTPLUG_DELAY 5 #define CMM_DEBUG 0 #define CMM_DISABLE0 #define CMM_OOM_KB 1024 #define CMM_MIN_MEM_MB 256 #define KB2PAGES(_p) ((_p)(PAGE_SHIFT-10)) #define PAGES2KB(_p) ((_p)(PAGE_SHIFT-10)) +/* + * The priority level tries to ensure that this notifier is called as + * late as possible to reduce thrashing in the shared memory pool. + */ +#define CMM_MEM_HOTPLUG_PRI1 +#define CMM_MEM_ISOLATE_PRI15 static unsigned int delay = CMM_DEFAULT_DELAY; +static unsigned int hotplug_delay = CMM_HOTPLUG_DELAY; static unsigned int oom_kb = CMM_OOM_KB; static unsigned int cmm_debug = CMM_DEBUG; static unsigned int cmm_disabled = CMM_DISABLE; @@ -65,6 +74,10 @@ MODULE_VERSION(CMM_DRIVER_VERSION); module_param_named(delay, delay, uint, S_IRUGO | S_IWUSR); MODULE_PARM_DESC(delay, Delay (in seconds) between polls to query hypervisor paging requests. [Default= __stringify(CMM_DEFAULT_DELAY) ]); +module_param_named(hotplug_delay, hotplug_delay, uint, S_IRUGO | S_IWUSR); +MODULE_PARM_DESC(delay, Delay (in seconds) after memory hotplug remove +before activity resumes. +[Default= __stringify(CMM_HOTPLUG_DELAY) ]); module_param_named(oom_kb, oom_kb, uint, S_IRUGO | S_IWUSR); MODULE_PARM_DESC(oom_kb, Amount of memory in kb to free on OOM. [Default= __stringify(CMM_OOM_KB) ]); @@ -88,6 +101,8 @@ struct cmm_page_array { static unsigned long loaned_pages; static unsigned long loaned_pages_target; static unsigned long oom_freed_pages; +static atomic_t hotplug_active = ATOMIC_INIT(0); +static atomic_t hotplug_occurred = ATOMIC_INIT(0); static struct cmm_page_array *cmm_page_list; static DEFINE_SPINLOCK(cmm_lock); @@ -110,6 +125,9 @@ static long cmm_alloc_pages(long nr) cmm_dbg(Begin request for %ld pages\n, nr); while (nr) { + if (atomic_read(hotplug_active)) + break; + addr = __get_free_page(GFP_NOIO | __GFP_NOWARN | __GFP_NORETRY | __GFP_NOMEMALLOC); if (!addr) @@ -119,8 +137,10 @@ static long cmm_alloc_pages(long nr) if (!pa || pa-index = CMM_NR_PAGES) { /* Need a new page for the page list. */ spin_unlock(cmm_lock); - npa = (struct cmm_page_array *)__get_free_page(GFP_NOIO | __GFP_NOWARN | - __GFP_NORETRY | __GFP_NOMEMALLOC); + npa = (struct cmm_page_array *)__get_free_page( + GFP_NOIO | __GFP_NOWARN | + __GFP_NORETRY | __GFP_NOMEMALLOC | + __GFP_MOVABLE); if (!npa) { pr_info(%s: Can not allocate new page list\n, __func__); free_page(addr); @@ -273,9 +293,23 @@ static int cmm_thread(void *dummy) while (1) { timeleft = msleep_interruptible(delay * 1000); - if (kthread_should_stop() || timeleft) { - loaned_pages_target = loaned_pages; + if
Re: [patch 1/3] powerpc: Move 64bit VDSO to improve context switch performance
Anton Blanchard an...@samba.org writes: On 64bit applications the VDSO is the only thing in segment 0. Since the VDSO is position independent we can remove the hint and let get_unmapped_area pick an area. This breaks gdb. The section table in the VDSO image when mapped into the process no longer contains meaningful values, and gdb rejects it. Andreas. -- Andreas Schwab, sch...@linux-m68k.org GPG Key fingerprint = 58CA 54C7 6D53 942B 1756 01D3 44D5 214B 8276 4ED5 And now for something completely different. ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH] powerpc/8xx: fix regression introduced by cache coherency rewrite
On Thu, Oct 01, 2009 at 08:35:59AM +1000, Benjamin Herrenschmidt wrote: From what I can see, the TLB miss code will check _PAGE_PRESENT, and when not set, it will -still- insert something into the TLB (unlike all other CPU types that go straight to data access faults from there). So we end up populating with those unpopulated entries that will then cause us to take a DSI (or ISI) instead of a TLB miss the next time around ... and so we need to remove them once we do that, no ? IE. Once we have actually put a valid PTE in. At least that's my understanding and why I believe we should always have tlbil_va() in set_pte() and ptep_set_access_flags(), which boils down in putting it into the 2 filter functions in the new code. Well.. actually, the ptep_set_access_flags() will already do a flush_tlb_page_nohash() when the PTE is changed. So I suppose all we really need here would be in set_pte_filter(), to do the same if we are writing a PTE that has _PAGE_PRESENT, at least on 8xx. But just unconditionally doing a tlbil_va() in both the filter functions shouldn't harm and also fix the problem, though Rex seems to indicate that is not the case. Adding a tlbil_va to do_page_fault makes the problem go away for me (on top of your merge branch) -- none of the other changes in this thread do (assuming I didn't miss any). FWIW, when it gets stuck on a fault, DSISR is 0xc000, and handle_mm_fault returns zero. -Scott ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH] powerpc/8xx: fix regression introduced by cache coherency rewrite
On Fri, 2009-10-02 at 16:49 -0500, Scott Wood wrote: Adding a tlbil_va to do_page_fault makes the problem go away for me (on top of your merge branch) -- none of the other changes in this thread do (assuming I didn't miss any). FWIW, when it gets stuck on a fault, DSISR is 0xc000, and handle_mm_fault returns zero. But in that case, it should hit ptep_set_access_flags() (via handle_mm_fault) and from there call tlbil_va (provided we add a call to it in the relevant filter function which I proposed initially), no ? Ben. ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev