Re: [External] Re: [PATCH v11 1/2] memory tier: dax/kmem: introduce an abstract layer for finding, allocating, and putting memory types
On Tue, 9 Apr 2024 12:00:06 -0700 "Ho-Ren (Jack) Chuang" wrote: > Hi Jonathan, > > On Fri, Apr 5, 2024 at 6:56 AM Jonathan Cameron > wrote: > > > > On Fri, 5 Apr 2024 00:07:05 + > > "Ho-Ren (Jack) Chuang" wrote: > > > > > Since different memory devices require finding, allocating, and putting > > > memory types, these common steps are abstracted in this patch, > > > enhancing the scalability and conciseness of the code. > > > > > > Signed-off-by: Ho-Ren (Jack) Chuang > > > Reviewed-by: "Huang, Ying" > > Reviewed-by: Jonathan Cameron > > > Thank you for reviewing and for adding your "Reviewed-by"! > I was wondering if I need to send a v12 and manually add > this to the commit description, or if this is sufficient. I had added Jonathan's r-b to the mm.git copy of this patch.
Re: [Qemu-devel] [PATCH 1/1] mm: gup: teach get_user_pages_unlocked to handle FOLL_NOWAIT
On Fri, 2 Mar 2018 18:43:43 +0100 Andrea Arcangeliwrote: > KVM is hanging during postcopy live migration with userfaultfd because > get_user_pages_unlocked is not capable to handle FOLL_NOWAIT. > > Earlier FOLL_NOWAIT was only ever passed to get_user_pages. > > Specifically faultin_page (the callee of get_user_pages_unlocked > caller) doesn't know that if FAULT_FLAG_RETRY_NOWAIT was set in the > page fault flags, when VM_FAULT_RETRY is returned, the mmap_sem wasn't > actually released (even if nonblocking is not NULL). So it sets > *nonblocking to zero and the caller won't release the mmap_sem > thinking it was already released, but it wasn't because of > FOLL_NOWAIT. > > Reported-by: Dr. David Alan Gilbert > Tested-by: Dr. David Alan Gilbert > Signed-off-by: Andrea Arcangeli I added Fixes: ce53053ce378c ("kvm: switch get_user_page_nowait() to get_user_pages_unlocked()") Cc:
Re: [Qemu-devel] [PATCH v9 3/5] mm: function to offer a page block on the free list
On Thu, 13 Apr 2017 17:35:06 +0800 Wei Wangwrote: > Add a function to find a page block on the free list specified by the > caller. Pages from the page block may be used immediately after the > function returns. The caller is responsible for detecting or preventing > the use of such pages. > > ... > > --- a/mm/page_alloc.c > +++ b/mm/page_alloc.c > @@ -4498,6 +4498,93 @@ void show_free_areas(unsigned int filter) > show_swap_cache_info(); > } > > +/** > + * Heuristically get a page block in the system that is unused. > + * It is possible that pages from the page block are used immediately after > + * inquire_unused_page_block() returns. It is the caller's responsibility > + * to either detect or prevent the use of such pages. > + * > + * The free list to check: zone->free_area[order].free_list[migratetype]. > + * > + * If the caller supplied page block (i.e. **page) is on the free list, offer > + * the next page block on the list to the caller. Otherwise, offer the first > + * page block on the list. > + * > + * Return 0 when a page block is found on the caller specified free list. > + */ > +int inquire_unused_page_block(struct zone *zone, unsigned int order, > + unsigned int migratetype, struct page **page) > +{ Perhaps we can wrap this in the appropriate ifdef so the kernels which won't be using virtio-balloon don't carry the added overhead.
Re: [Qemu-devel] [PATCH kernel v8 3/4] mm: add inerface to offer info about unused pages
On Thu, 16 Mar 2017 15:08:46 +0800 Wei Wangwrote: > From: Liang Li > > This patch adds a function to provides a snapshot of the present system > unused pages. An important usage of this function is to provide the > unsused pages to the Live migration thread, which skips the transfer of > thoses unused pages. Newly used pages can be re-tracked by the dirty > page logging mechanisms. I don't think this will be useful for anything other than virtio-balloon. I guess it would be better to keep this code in the virtio-balloon driver if possible, even though that's rather a layering violation :( What would have to be done to make that possible? Perhaps we can put some *small* helpers into page_alloc.c to prevent things from becoming too ugly. > --- a/mm/page_alloc.c > +++ b/mm/page_alloc.c > @@ -4498,6 +4498,120 @@ void show_free_areas(unsigned int filter) > show_swap_cache_info(); > } > > +static int __record_unused_pages(struct zone *zone, int order, > + __le64 *buf, unsigned int size, > + unsigned int *offset, bool part_fill) > +{ > + unsigned long pfn, flags; > + int t, ret = 0; > + struct list_head *curr; > + __le64 *chunk; > + > + if (zone_is_empty(zone)) > + return 0; > + > + spin_lock_irqsave(>lock, flags); > + > + if (*offset + zone->free_area[order].nr_free > size && !part_fill) { > + ret = -ENOSPC; > + goto out; > + } > + for (t = 0; t < MIGRATE_TYPES; t++) { > + list_for_each(curr, >free_area[order].free_list[t]) { > + pfn = page_to_pfn(list_entry(curr, struct page, lru)); > + chunk = buf + *offset; > + if (*offset + 2 > size) { > + ret = -ENOSPC; > + goto out; > + } > + /* Align to the chunk format used in virtio-balloon */ > + *chunk = cpu_to_le64(pfn << 12); > + *(chunk + 1) = cpu_to_le64((1 << order) << 12); > + *offset += 2; > + } > + } > + > +out: > + spin_unlock_irqrestore(>lock, flags); > + > + return ret; > +} This looks like it could disable interrupts for a long time. Too long? > +/* > + * The record_unused_pages() function is used to record the system unused > + * pages. The unused pages can be skipped to transfer during live migration. > + * Though the unused pages are dynamically changing, dirty page logging > + * mechanisms are able to capture the newly used pages though they were > + * recorded as unused pages via this function. > + * > + * This function scans the free page list of the specified order to record > + * the unused pages, and chunks those continuous pages following the chunk > + * format below: > + * -- > + * | Base (52-bit) | Rsvd (12-bit) | > + * -- > + * -- > + * | Size (52-bit) | Rsvd (12-bit) | > + * -- > + * > + * @start_zone: zone to start the record operation. > + * @order: order of the free page list to record. > + * @buf: buffer to record the unused page info in chunks. > + * @size: size of the buffer in __le64 to record > + * @offset: offset in the buffer to record. > + * @part_fill: indicate if partial fill is used. > + * > + * return -EINVAL if parameter is invalid > + * return -ENOSPC when the buffer is too small to record all the unsed pages > + * return 0 when sccess > + */ It's a strange thing - it returns information which will instantly become incorrect. > +int record_unused_pages(struct zone **start_zone, int order, > + __le64 *buf, unsigned int size, > + unsigned int *offset, bool part_fill) > +{ > + struct zone *zone; > + int ret = 0; > + bool skip_check = false; > + > + /* Make sure all the parameters are valid */ > + if (buf == NULL || offset == NULL || order >= MAX_ORDER) > + return -EINVAL; > + > + if (*start_zone != NULL) { > + bool found = false; > + > + for_each_populated_zone(zone) { > + if (zone != *start_zone) > + continue; > + found = true; > + break; > + } > + if (!found) > + return -EINVAL; > + } else > + skip_check = true; > + > + for_each_populated_zone(zone) { > + /* Start from *start_zone if it's not NULL */ > + if (!skip_check) { > + if (*start_zone != zone) > + continue; > + else > + skip_check = true; > + } > + ret = __record_unused_pages(zone, order, buf,
Re: [Qemu-devel] [PATCH 22/23] userfaultfd: avoid mmap_sem read recursion in mcopy_atomic
There's a more serious failure with i386 allmodconfig: fs/userfaultfd.c:145:2: note: in expansion of macro 'BUILD_BUG_ON' BUILD_BUG_ON(sizeof(struct uffd_msg) != 32); I'm surprised the feature is even reachable on i386 builds?
Re: [Qemu-devel] [PATCH 22/23] userfaultfd: avoid mmap_sem read recursion in mcopy_atomic
On Thu, 14 May 2015 19:31:19 +0200 Andrea Arcangeli aarca...@redhat.com wrote: If the rwsem starves writers it wasn't strictly a bug but lockdep doesn't like it and this avoids depending on lowlevel implementation details of the lock. ... @@ -229,13 +246,33 @@ static __always_inline ssize_t __mcopy_atomic(struct mm_struct *dst_mm, if (!zeropage) err = mcopy_atomic_pte(dst_mm, dst_pmd, dst_vma, -dst_addr, src_addr); +dst_addr, src_addr, page); else err = mfill_zeropage_pte(dst_mm, dst_pmd, dst_vma, dst_addr); cond_resched(); + if (unlikely(err == -EFAULT)) { + void *page_kaddr; + + BUILD_BUG_ON(zeropage); I'm not sure what this is trying to do. BUILD_BUG_ON(local_variable)? It goes bang in my build. I'll just delete it. + up_read(dst_mm-mmap_sem); + BUG_ON(!page); + + page_kaddr = kmap(page); + err = copy_from_user(page_kaddr, + (const void __user *) src_addr, + PAGE_SIZE); + kunmap(page); + if (unlikely(err)) { + err = -EFAULT; + goto out; + } + goto retry; + } else + BUG_ON(page); +
Re: [Qemu-devel] [PATCH 00/23] userfaultfd v4
On Thu, 14 May 2015 19:30:57 +0200 Andrea Arcangeli aarca...@redhat.com wrote: This is the latest userfaultfd patchset against mm-v4.1-rc3 2015-05-14-10:04. It would be useful to have some userfaultfd testcases in tools/testing/selftests/. Partly as an aid to arch maintainers when enabling this. And also as a standalone thing to give people a practical way of exercising this interface. What are your thoughts on enabling userfaultfd for other architectures, btw? Are there good use cases, are people working on it, etc? Also, I assume a manpage is in the works? Sooner rather than later would be good - Michael's review of proposed kernel interfaces has often been valuable.
Re: [Qemu-devel] [PATCH 8/9] mm: compaction: Cache if a pageblock was scanned and no pages were isolated
On Tue, 25 Sep 2012 10:12:07 +0100 Mel Gorman mgor...@suse.de wrote: First, we'd introduce a variant of get_pageblock_migratetype() that returns all the bits for the pageblock flags and then helpers to extract either the migratetype or the PG_migrate_skip. We already are incurring the cost of get_pageblock_migratetype() so it will not be much more expensive than what is already there. If there is an allocation or free within a pageblock that as the PG_migrate_skip bit set then we increment a counter. When the counter reaches some to-be-decided threshold then compaction may clear all the bits. This would match the criteria of the clearing being based on activity. There are four potential problems with this 1. The logic to retrieve all the bits and split them up will be a little convulated but maybe it would not be that bad. 2. The counter is a shared-writable cache line but obviously it could be moved to vmstat and incremented with inc_zone_page_state to offset the cost a little. 3. The biggested weakness is that there is not way to know if the counter is incremented based on activity in a small subset of blocks. 4. What should the threshold be? The first problem is minor but the other three are potentially a mess. Adding another vmstat counter is bad enough in itself but if the counter is incremented based on a small subsets of pageblocks, the hint becomes is potentially useless. However, does this match what you have in mind or am I over-complicating things? Sounds complicated. Using wall time really does suck. Are you sure you can't think of something more logical? How would we demonstrate the suckage? What would be the observeable downside of switching that 5 seconds to 5 hours? + for (pfn = start_pfn; pfn end_pfn; pfn += pageblock_nr_pages) { + struct page *page; + if (!pfn_valid(pfn)) + continue; + + page = pfn_to_page(pfn); + if (zone != page_zone(page)) + continue; + + clear_pageblock_skip(page); + } What's the worst-case loop count here? zone-spanned_pages pageblock_order What's the worst-case value of (zone-spanned_pages pageblock_order) :) Lets take an unlikely case - 128G single-node machine. That loop count on x86-64 would be 65536. It'll be fast enough, particularly in this path. That could easily exceed a millisecond. Can/should we stick a cond_resched() in there?
Re: [Qemu-devel] [PATCH 5/9] mm: compaction: Acquire the zone-lru_lock as late as possible
On Tue, 25 Sep 2012 17:13:27 +0900 Minchan Kim minc...@kernel.org wrote: I see. To me, your saying is better than current comment. I hope comment could be more explicit. diff --git a/mm/compaction.c b/mm/compaction.c index df01b4e..f1d2cc7 100644 --- a/mm/compaction.c +++ b/mm/compaction.c @@ -542,8 +542,9 @@ isolate_migratepages_range(struct zone *zone, struct compact_control *cc, * splitting and collapsing (collapsing has already happened * if PageLRU is set) but the lock is not necessarily taken * here and it is wasteful to take it just to check transhuge. -* Check transhuge without lock and skip if it's either a -* transhuge or hugetlbfs page. +* Check transhuge without lock and *skip* if it's either a +* transhuge or hugetlbfs page because it's not safe to call +* compound_order. */ if (PageTransHuge(page)) { if (!locked) Going a bit further: --- a/mm/compaction.c~mm-compaction-acquire-the-zone-lru_lock-as-late-as-possible-fix +++ a/mm/compaction.c @@ -415,7 +415,8 @@ isolate_migratepages_range(struct zone * * if PageLRU is set) but the lock is not necessarily taken * here and it is wasteful to take it just to check transhuge. * Check transhuge without lock and skip if it's either a -* transhuge or hugetlbfs page. +* transhuge or hugetlbfs page because calling compound_order() +* requires lru_lock to exclude isolation and splitting. */ if (PageTransHuge(page)) { if (!locked) _ but... the requirement to hold lru_lock for compound_order() is news to me. It doesn't seem to be written down or explained anywhere, and one wonders why the cheerily undocumented compound_lock() doesn't have this effect. What's going on here??
Re: [Qemu-devel] [PATCH 8/9] mm: compaction: Cache if a pageblock was scanned and no pages were isolated
On Mon, 24 Sep 2012 10:39:38 +0100 Mel Gorman mgor...@suse.de wrote: On Fri, Sep 21, 2012 at 02:36:56PM -0700, Andrew Morton wrote: Also, what has to be done to avoid the polling altogether? eg/ie, zap a pageblock's PB_migrate_skip synchronously, when something was done to that pageblock which justifies repolling it? The something event you are looking for is pages being freed or allocated in the page allocator. A movable page being allocated in block or a page being freed should clear the PB_migrate_skip bit if it's set. Unfortunately this would impact the fast path of the alloc and free paths of the page allocator. I felt that that was too high a price to pay. We already do a similar thing in the page allocator: clearing of -all_unreclaimable and -pages_scanned. But that isn't on the fast path really - it happens once per pcp unload. Can we do something like that? Drop some hint into the zone without having to visit each page? ... +static void reset_isolation_suitable(struct zone *zone) +{ + unsigned long start_pfn = zone-zone_start_pfn; + unsigned long end_pfn = zone-zone_start_pfn + zone-spanned_pages; + unsigned long pfn; + + /* + * Do not reset more than once every five seconds. If allocations are + * failing sufficiently quickly to allow this to happen then continually + * scanning for compaction is not going to help. The choice of five + * seconds is arbitrary but will mitigate excessive scanning. + */ + if (time_before(jiffies, zone-compact_blockskip_expire)) + return; + zone-compact_blockskip_expire = jiffies + (HZ * 5); + + /* Walk the zone and mark every pageblock as suitable for isolation */ + for (pfn = start_pfn; pfn end_pfn; pfn += pageblock_nr_pages) { + struct page *page; + if (!pfn_valid(pfn)) + continue; + + page = pfn_to_page(pfn); + if (zone != page_zone(page)) + continue; + + clear_pageblock_skip(page); + } What's the worst-case loop count here? zone-spanned_pages pageblock_order What's the worst-case value of (zone-spanned_pages pageblock_order) :)
Re: [Qemu-devel] [PATCH 4/9] mm: compaction: Abort compaction loop if lock is contended or run too long
On Fri, 21 Sep 2012 11:46:18 +0100 Mel Gorman mgor...@suse.de wrote: Changelog since V2 o Fix BUG_ON triggered due to pages left on cc.migratepages o Make compact_zone_order() require non-NULL arg `contended' Changelog since V1 o only abort the compaction if lock is contended or run too long o Rearranged the code by Andrea Arcangeli. isolate_migratepages_range() might isolate no pages if for example when zone-lru_lock is contended and running asynchronous compaction. In this case, we should abort compaction, otherwise, compact_zone will run a useless loop and make zone-lru_lock is even contended. hm, this appears to be identical to mm-compaction-abort-compaction-loop-if-lock-is-contended-or-run-too-long.patch mm-compaction-abort-compaction-loop-if-lock-is-contended-or-run-too-long-fix.patch mm-compaction-abort-compaction-loop-if-lock-is-contended-or-run-too-long-fix-2.patch so I simply omitted patches 2, 3 and 4.
Re: [Qemu-devel] [PATCH 8/9] mm: compaction: Cache if a pageblock was scanned and no pages were isolated
On Fri, 21 Sep 2012 11:46:22 +0100 Mel Gorman mgor...@suse.de wrote: When compaction was implemented it was known that scanning could potentially be excessive. The ideal was that a counter be maintained for each pageblock but maintaining this information would incur a severe penalty due to a shared writable cache line. It has reached the point where the scanning costs are an serious problem, particularly on long-lived systems where a large process starts and allocates a large number of THPs at the same time. Instead of using a shared counter, this patch adds another bit to the pageblock flags called PG_migrate_skip. If a pageblock is scanned by either migrate or free scanner and 0 pages were isolated, the pageblock is marked to be skipped in the future. When scanning, this bit is checked before any scanning takes place and the block skipped if set. The main difficulty with a patch like this is when to ignore the cached information? If it's ignored too often, the scanning rates will still be excessive. If the information is too stale then allocations will fail that might have otherwise succeeded. In this patch o CMA always ignores the information o If the migrate and free scanner meet then the cached information will be discarded if it's at least 5 seconds since the last time the cache was discarded o If there are a large number of allocation failures, discard the cache. The time-based heuristic is very clumsy but there are few choices for a better event. Depending solely on multiple allocation failures still allows excessive scanning when THP allocations are failing in quick succession due to memory pressure. Waiting until memory pressure is relieved would cause compaction to continually fail instead of using reclaim/compaction to try allocate the page. The time-based mechanism is clumsy but a better option is not obvious. ick. Wall time has sooo little relationship to what's happening in there. If we *have* to use polling, cannot we clock the poll with some metric which is at least vaguely related to the amount of activity? Number (or proportion) of pages scanned, for example? Or reset everything on the Nth trip around the zone? Or even a combination of one of these *and* of wall time, so the system will at least work harder when MM is under load. Also, what has to be done to avoid the polling altogether? eg/ie, zap a pageblock's PB_migrate_skip synchronously, when something was done to that pageblock which justifies repolling it? ... +static void reset_isolation_suitable(struct zone *zone) +{ + unsigned long start_pfn = zone-zone_start_pfn; + unsigned long end_pfn = zone-zone_start_pfn + zone-spanned_pages; + unsigned long pfn; + + /* + * Do not reset more than once every five seconds. If allocations are + * failing sufficiently quickly to allow this to happen then continually + * scanning for compaction is not going to help. The choice of five + * seconds is arbitrary but will mitigate excessive scanning. + */ + if (time_before(jiffies, zone-compact_blockskip_expire)) + return; + zone-compact_blockskip_expire = jiffies + (HZ * 5); + + /* Walk the zone and mark every pageblock as suitable for isolation */ + for (pfn = start_pfn; pfn end_pfn; pfn += pageblock_nr_pages) { + struct page *page; + if (!pfn_valid(pfn)) + continue; + + page = pfn_to_page(pfn); + if (zone != page_zone(page)) + continue; + + clear_pageblock_skip(page); + } What's the worst-case loop count here? +} + ...
Re: [Qemu-devel] [PATCH 6/9] mm: compaction: Acquire the zone-lock as late as possible
On Fri, 21 Sep 2012 11:46:20 +0100 Mel Gorman mgor...@suse.de wrote: Compactions free scanner acquires the zone-lock when checking for PageBuddy pages and isolating them. It does this even if there are no PageBuddy pages in the range. This patch defers acquiring the zone lock for as long as possible. In the event there are no free pages in the pageblock then the lock will not be acquired at all which reduces contention on zone-lock. ... --- a/mm/compaction.c +++ b/mm/compaction.c @@ -93,6 +93,28 @@ static inline bool compact_trylock_irqsave(spinlock_t *lock, return compact_checklock_irqsave(lock, flags, false, cc); } +/* Returns true if the page is within a block suitable for migration to */ +static bool suitable_migration_target(struct page *page) +{ + stray newline + int migratetype = get_pageblock_migratetype(page); + + /* Don't interfere with memory hot-remove or the min_free_kbytes blocks */ + if (migratetype == MIGRATE_ISOLATE || migratetype == MIGRATE_RESERVE) + return false; + + /* If the page is a large free page, then allow migration */ + if (PageBuddy(page) page_order(page) = pageblock_order) + return true; + + /* If the block is MIGRATE_MOVABLE or MIGRATE_CMA, allow migration */ + if (migrate_async_suitable(migratetype)) + return true; + + /* Otherwise skip the block */ + return false; +} + ... @@ -168,23 +193,38 @@ static unsigned long isolate_freepages_block(unsigned long blockpfn, int isolated, i; struct page *page = cursor; - if (!pfn_valid_within(blockpfn)) { - if (strict) - return 0; - continue; - } + if (!pfn_valid_within(blockpfn)) + goto strict_check; nr_scanned++; - if (!PageBuddy(page)) { - if (strict) - return 0; - continue; - } + if (!PageBuddy(page)) + goto strict_check; + + /* + * The zone lock must be held to isolate freepages. This + * unfortunately this is a very coarse lock and can be this this + * heavily contended if there are parallel allocations + * or parallel compactions. For async compaction do not + * spin on the lock and we acquire the lock as late as + * possible. + */ + locked = compact_checklock_irqsave(cc-zone-lock, flags, + locked, cc); + if (!locked) + break; + + /* Recheck this is a suitable migration target under lock */ + if (!strict !suitable_migration_target(page)) + break; + + /* Recheck this is a buddy page under lock */ + if (!PageBuddy(page)) + goto strict_check; /* Found a free page, break it into order-0 pages */ isolated = split_free_page(page); if (!isolated strict) - return 0; + goto strict_check; total_isolated += isolated; for (i = 0; i isolated; i++) { list_add(page-lru, freelist); @@ -196,9 +236,23 @@ static unsigned long isolate_freepages_block(unsigned long blockpfn, blockpfn += isolated - 1; cursor += isolated - 1; } + + continue; + +strict_check: + /* Abort isolation if the caller requested strict isolation */ + if (strict) { + total_isolated = 0; + goto out; + } } trace_mm_compaction_isolate_freepages(nr_scanned, total_isolated); + +out: + if (locked) + spin_unlock_irqrestore(cc-zone-lock, flags); + return total_isolated; } A a few things about this function. Would it be cleaner if we did if (!strict) { if (!suitable_migration_target(page)) break; } else { if (!PageBuddy(page)) goto strict_check; } and then remove the test of `strict' from strict_check (which then should be renamed)? Which perhaps means that the whole `strict_check' block can go away: if (!strict) { if (!suitable_migration_target(page)) break; } else { if (!PageBuddy(page)) { total_isolated = 0; goto out; } Have a think about it? The function is a little straggly. Secondly, it is correct/desirable to skip the (now misnamed
Re: [Qemu-devel] [PATCH 0/2] core dump: re-purpose VM_ALWAYSDUMP to user controlled VM_DONTDUMP
On Wed, 7 Mar 2012 12:00:46 -0500 Jason Baron jba...@redhat.com wrote: Hi, The motivation for this change was that I was looking at a way for a qemu-kvm process, to exclude the guest memory from its core dump, which can be quite large. There are already a number of filter flags in /proc/pid/coredump_filter, however, these allow one to specify 'types' of kernel memory, not specific address ranges (which is needed in this case). Since there are no more vma flags available, the first patch eliminates the need for the 'VM_ALWAYSDUMP' flag. The flag is used internally by the kernel to mark vdso and vsyscall pages. However, it is simple enough to check if a vma covers a vdso or vsyscall page without the need for this flag. Gee, we ran out? That makes it pretty inevitable that we will grow the vma by four bytes. Once we have done that, your always_dump_vma() trickery becomes unneeded and undesirable, yes? If so, we may as well recognise reality and grow the vma now. The second patch then replaces the 'VM_ALWAYSDUMP' flag with a new 'VM_DONTDUMP' flag, which can be set by userspace using new madvise flags: 'MADV_DONTDUMP', and unset via 'MADV_DUMP'. The core dump filters continue to work the same as before unless 'MADV_DONTDUMP' is set on the region. The qemu code which implements this features is at: http://people.redhat.com/~jbaron/qemu-dump/qemu-dump.patch In my testing the qemu core dump shrunk from 383MB - 13MB with this patch. I also believe that the 'MADV_DONTDUMP' flag might be useful for security sensitive apps, which might want to select which areas are dumped. Is there any way for userspace to query the state of the flag?