Re: [External] Re: [PATCH v11 1/2] memory tier: dax/kmem: introduce an abstract layer for finding, allocating, and putting memory types

2024-04-09 Thread Andrew Morton
On Tue, 9 Apr 2024 12:00:06 -0700 "Ho-Ren (Jack) Chuang" 
 wrote:

> Hi Jonathan,
> 
> On Fri, Apr 5, 2024 at 6:56 AM Jonathan Cameron
>  wrote:
> >
> > On Fri,  5 Apr 2024 00:07:05 +
> > "Ho-Ren (Jack) Chuang"  wrote:
> >
> > > Since different memory devices require finding, allocating, and putting
> > > memory types, these common steps are abstracted in this patch,
> > > enhancing the scalability and conciseness of the code.
> > >
> > > Signed-off-by: Ho-Ren (Jack) Chuang 
> > > Reviewed-by: "Huang, Ying" 
> > Reviewed-by: Jonathan Cameron 
> >
> Thank you for reviewing and for adding your "Reviewed-by"!
> I was wondering if I need to send a v12 and manually add
> this to the commit description, or if this is sufficient.

I had added Jonathan's r-b to the mm.git copy of this patch.



Re: [Qemu-devel] [PATCH 1/1] mm: gup: teach get_user_pages_unlocked to handle FOLL_NOWAIT

2018-03-02 Thread Andrew Morton
On Fri,  2 Mar 2018 18:43:43 +0100 Andrea Arcangeli  wrote:

> KVM is hanging during postcopy live migration with userfaultfd because
> get_user_pages_unlocked is not capable to handle FOLL_NOWAIT.
> 
> Earlier FOLL_NOWAIT was only ever passed to get_user_pages.
> 
> Specifically faultin_page (the callee of get_user_pages_unlocked
> caller) doesn't know that if FAULT_FLAG_RETRY_NOWAIT was set in the
> page fault flags, when VM_FAULT_RETRY is returned, the mmap_sem wasn't
> actually released (even if nonblocking is not NULL). So it sets
> *nonblocking to zero and the caller won't release the mmap_sem
> thinking it was already released, but it wasn't because of
> FOLL_NOWAIT.
> 
> Reported-by: Dr. David Alan Gilbert 
> Tested-by: Dr. David Alan Gilbert 
> Signed-off-by: Andrea Arcangeli 

I added

Fixes: ce53053ce378c ("kvm: switch get_user_page_nowait() to 
get_user_pages_unlocked()")
Cc: 



Re: [Qemu-devel] [PATCH v9 3/5] mm: function to offer a page block on the free list

2017-04-13 Thread Andrew Morton
On Thu, 13 Apr 2017 17:35:06 +0800 Wei Wang  wrote:

> Add a function to find a page block on the free list specified by the
> caller. Pages from the page block may be used immediately after the
> function returns. The caller is responsible for detecting or preventing
> the use of such pages.
> 
> ...
>
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -4498,6 +4498,93 @@ void show_free_areas(unsigned int filter)
>   show_swap_cache_info();
>  }
>  
> +/**
> + * Heuristically get a page block in the system that is unused.
> + * It is possible that pages from the page block are used immediately after
> + * inquire_unused_page_block() returns. It is the caller's responsibility
> + * to either detect or prevent the use of such pages.
> + *
> + * The free list to check: zone->free_area[order].free_list[migratetype].
> + *
> + * If the caller supplied page block (i.e. **page) is on the free list, offer
> + * the next page block on the list to the caller. Otherwise, offer the first
> + * page block on the list.
> + *
> + * Return 0 when a page block is found on the caller specified free list.
> + */
> +int inquire_unused_page_block(struct zone *zone, unsigned int order,
> +   unsigned int migratetype, struct page **page)
> +{

Perhaps we can wrap this in the appropriate ifdef so the kernels which
won't be using virtio-balloon don't carry the added overhead.





Re: [Qemu-devel] [PATCH kernel v8 3/4] mm: add inerface to offer info about unused pages

2017-03-16 Thread Andrew Morton
On Thu, 16 Mar 2017 15:08:46 +0800 Wei Wang  wrote:

> From: Liang Li 
> 
> This patch adds a function to provides a snapshot of the present system
> unused pages. An important usage of this function is to provide the
> unsused pages to the Live migration thread, which skips the transfer of
> thoses unused pages. Newly used pages can be re-tracked by the dirty
> page logging mechanisms.

I don't think this will be useful for anything other than
virtio-balloon.  I guess it would be better to keep this code in the
virtio-balloon driver if possible, even though that's rather a layering
violation :( What would have to be done to make that possible?  Perhaps
we can put some *small* helpers into page_alloc.c to prevent things
from becoming too ugly.

> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -4498,6 +4498,120 @@ void show_free_areas(unsigned int filter)
>   show_swap_cache_info();
>  }
>  
> +static int __record_unused_pages(struct zone *zone, int order,
> +  __le64 *buf, unsigned int size,
> +  unsigned int *offset, bool part_fill)
> +{
> + unsigned long pfn, flags;
> + int t, ret = 0;
> + struct list_head *curr;
> + __le64 *chunk;
> +
> + if (zone_is_empty(zone))
> + return 0;
> +
> + spin_lock_irqsave(>lock, flags);
> +
> + if (*offset + zone->free_area[order].nr_free > size && !part_fill) {
> + ret = -ENOSPC;
> + goto out;
> + }
> + for (t = 0; t < MIGRATE_TYPES; t++) {
> + list_for_each(curr, >free_area[order].free_list[t]) {
> + pfn = page_to_pfn(list_entry(curr, struct page, lru));
> + chunk = buf + *offset;
> + if (*offset + 2 > size) {
> + ret = -ENOSPC;
> + goto out;
> + }
> + /* Align to the chunk format used in virtio-balloon */
> + *chunk = cpu_to_le64(pfn << 12);
> + *(chunk + 1) = cpu_to_le64((1 << order) << 12);
> + *offset += 2;
> + }
> + }
> +
> +out:
> + spin_unlock_irqrestore(>lock, flags);
> +
> + return ret;
> +}

This looks like it could disable interrupts for a long time.  Too long?

> +/*
> + * The record_unused_pages() function is used to record the system unused
> + * pages. The unused pages can be skipped to transfer during live migration.
> + * Though the unused pages are dynamically changing, dirty page logging
> + * mechanisms are able to capture the newly used pages though they were
> + * recorded as unused pages via this function.
> + *
> + * This function scans the free page list of the specified order to record
> + * the unused pages, and chunks those continuous pages following the chunk
> + * format below:
> + * --
> + * | Base (52-bit)   | Rsvd (12-bit) |
> + * --
> + * --
> + * | Size (52-bit)   | Rsvd (12-bit) |
> + * --
> + *
> + * @start_zone: zone to start the record operation.
> + * @order: order of the free page list to record.
> + * @buf: buffer to record the unused page info in chunks.
> + * @size: size of the buffer in __le64 to record
> + * @offset: offset in the buffer to record.
> + * @part_fill: indicate if partial fill is used.
> + *
> + * return -EINVAL if parameter is invalid
> + * return -ENOSPC when the buffer is too small to record all the unsed pages
> + * return 0 when sccess
> + */

It's a strange thing - it returns information which will instantly
become incorrect.

> +int record_unused_pages(struct zone **start_zone, int order,
> + __le64 *buf, unsigned int size,
> + unsigned int *offset, bool part_fill)
> +{
> + struct zone *zone;
> + int ret = 0;
> + bool skip_check = false;
> +
> + /* Make sure all the parameters are valid */
> + if (buf == NULL || offset == NULL || order >= MAX_ORDER)
> + return -EINVAL;
> +
> + if (*start_zone != NULL) {
> + bool found = false;
> +
> + for_each_populated_zone(zone) {
> + if (zone != *start_zone)
> + continue;
> + found = true;
> + break;
> + }
> + if (!found)
> + return -EINVAL;
> + } else
> + skip_check = true;
> +
> + for_each_populated_zone(zone) {
> + /* Start from *start_zone if it's not NULL */
> + if (!skip_check) {
> + if (*start_zone != zone)
> + continue;
> + else
> + skip_check = true;
> + }
> + ret = __record_unused_pages(zone, order, buf, 

Re: [Qemu-devel] [PATCH 22/23] userfaultfd: avoid mmap_sem read recursion in mcopy_atomic

2015-05-22 Thread Andrew Morton

There's a more serious failure with i386 allmodconfig:

fs/userfaultfd.c:145:2: note: in expansion of macro 'BUILD_BUG_ON'
  BUILD_BUG_ON(sizeof(struct uffd_msg) != 32);

I'm surprised the feature is even reachable on i386 builds?



Re: [Qemu-devel] [PATCH 22/23] userfaultfd: avoid mmap_sem read recursion in mcopy_atomic

2015-05-22 Thread Andrew Morton
On Thu, 14 May 2015 19:31:19 +0200 Andrea Arcangeli aarca...@redhat.com wrote:

 If the rwsem starves writers it wasn't strictly a bug but lockdep
 doesn't like it and this avoids depending on lowlevel implementation
 details of the lock.
 
 ...

 @@ -229,13 +246,33 @@ static __always_inline ssize_t __mcopy_atomic(struct 
 mm_struct *dst_mm,
  
   if (!zeropage)
   err = mcopy_atomic_pte(dst_mm, dst_pmd, dst_vma,
 -dst_addr, src_addr);
 +dst_addr, src_addr, page);
   else
   err = mfill_zeropage_pte(dst_mm, dst_pmd, dst_vma,
dst_addr);
  
   cond_resched();
  
 + if (unlikely(err == -EFAULT)) {
 + void *page_kaddr;
 +
 + BUILD_BUG_ON(zeropage);

I'm not sure what this is trying to do.  BUILD_BUG_ON(local_variable)?

It goes bang in my build.  I'll just delete it.

 + up_read(dst_mm-mmap_sem);
 + BUG_ON(!page);
 +
 + page_kaddr = kmap(page);
 + err = copy_from_user(page_kaddr,
 +  (const void __user *) src_addr,
 +  PAGE_SIZE);
 + kunmap(page);
 + if (unlikely(err)) {
 + err = -EFAULT;
 + goto out;
 + }
 + goto retry;
 + } else
 + BUG_ON(page);
 +




Re: [Qemu-devel] [PATCH 00/23] userfaultfd v4

2015-05-19 Thread Andrew Morton
On Thu, 14 May 2015 19:30:57 +0200 Andrea Arcangeli aarca...@redhat.com wrote:

 This is the latest userfaultfd patchset against mm-v4.1-rc3
 2015-05-14-10:04.

It would be useful to have some userfaultfd testcases in
tools/testing/selftests/.  Partly as an aid to arch maintainers when
enabling this.  And also as a standalone thing to give people a
practical way of exercising this interface.

What are your thoughts on enabling userfaultfd for other architectures,
btw?  Are there good use cases, are people working on it, etc?


Also, I assume a manpage is in the works?  Sooner rather than later
would be good - Michael's review of proposed kernel interfaces has
often been valuable.




Re: [Qemu-devel] [PATCH 8/9] mm: compaction: Cache if a pageblock was scanned and no pages were isolated

2012-09-25 Thread Andrew Morton
On Tue, 25 Sep 2012 10:12:07 +0100
Mel Gorman mgor...@suse.de wrote:

 First, we'd introduce a variant of get_pageblock_migratetype() that returns
 all the bits for the pageblock flags and then helpers to extract either the
 migratetype or the PG_migrate_skip. We already are incurring the cost of
 get_pageblock_migratetype() so it will not be much more expensive than what
 is already there. If there is an allocation or free within a pageblock that
 as the PG_migrate_skip bit set then we increment a counter. When the counter
 reaches some to-be-decided threshold then compaction may clear all the
 bits. This would match the criteria of the clearing being based on activity.
 
 There are four potential problems with this
 
 1. The logic to retrieve all the bits and split them up will be a little
convulated but maybe it would not be that bad.
 
 2. The counter is a shared-writable cache line but obviously it could
be moved to vmstat and incremented with inc_zone_page_state to offset
the cost a little.
 
 3. The biggested weakness is that there is not way to know if the
counter is incremented based on activity in a small subset of blocks.
 
 4. What should the threshold be?
 
 The first problem is minor but the other three are potentially a mess.
 Adding another vmstat counter is bad enough in itself but if the counter
 is incremented based on a small subsets of pageblocks, the hint becomes
 is potentially useless.
 
 However, does this match what you have in mind or am I over-complicating
 things?

Sounds complicated.

Using wall time really does suck.  Are you sure you can't think of
something more logical?

How would we demonstrate the suckage?  What would be the observeable downside of
switching that 5 seconds to 5 hours?

 + for (pfn = start_pfn; pfn  end_pfn; pfn += pageblock_nr_pages) 
 {
 + struct page *page;
 + if (!pfn_valid(pfn))
 + continue;
 +
 + page = pfn_to_page(pfn);
 + if (zone != page_zone(page))
 + continue;
 +
 + clear_pageblock_skip(page);
 + }

What's the worst-case loop count here?

   
   zone-spanned_pages  pageblock_order
  
  What's the worst-case value of (zone-spanned_pages  pageblock_order) :)
 
 Lets take an unlikely case - 128G single-node machine. That loop count
 on x86-64 would be 65536. It'll be fast enough, particularly in this
 path.

That could easily exceed a millisecond.  Can/should we stick a
cond_resched() in there?



Re: [Qemu-devel] [PATCH 5/9] mm: compaction: Acquire the zone-lru_lock as late as possible

2012-09-25 Thread Andrew Morton
On Tue, 25 Sep 2012 17:13:27 +0900
Minchan Kim minc...@kernel.org wrote:

 I see. To me, your saying is better than current comment.
 I hope comment could be more explicit.
 
 diff --git a/mm/compaction.c b/mm/compaction.c
 index df01b4e..f1d2cc7 100644
 --- a/mm/compaction.c
 +++ b/mm/compaction.c
 @@ -542,8 +542,9 @@ isolate_migratepages_range(struct zone *zone, struct 
 compact_control *cc,
  * splitting and collapsing (collapsing has already happened
  * if PageLRU is set) but the lock is not necessarily taken
  * here and it is wasteful to take it just to check transhuge.
 -* Check transhuge without lock and skip if it's either a
 -* transhuge or hugetlbfs page.
 +* Check transhuge without lock and *skip* if it's either a
 +* transhuge or hugetlbfs page because it's not safe to call
 +* compound_order.
  */
 if (PageTransHuge(page)) {
 if (!locked)

Going a bit further:

--- 
a/mm/compaction.c~mm-compaction-acquire-the-zone-lru_lock-as-late-as-possible-fix
+++ a/mm/compaction.c
@@ -415,7 +415,8 @@ isolate_migratepages_range(struct zone *
 * if PageLRU is set) but the lock is not necessarily taken
 * here and it is wasteful to take it just to check transhuge.
 * Check transhuge without lock and skip if it's either a
-* transhuge or hugetlbfs page.
+* transhuge or hugetlbfs page because calling compound_order()
+* requires lru_lock to exclude isolation and splitting.
 */
if (PageTransHuge(page)) {
if (!locked)
_


but...  the requirement to hold lru_lock for compound_order() is news
to me.  It doesn't seem to be written down or explained anywhere, and
one wonders why the cheerily undocumented compound_lock() doesn't have
this effect.  What's going on here??




Re: [Qemu-devel] [PATCH 8/9] mm: compaction: Cache if a pageblock was scanned and no pages were isolated

2012-09-24 Thread Andrew Morton
On Mon, 24 Sep 2012 10:39:38 +0100
Mel Gorman mgor...@suse.de wrote:

 On Fri, Sep 21, 2012 at 02:36:56PM -0700, Andrew Morton wrote:
 
  Also, what has to be done to avoid the polling altogether?  eg/ie, zap
  a pageblock's PB_migrate_skip synchronously, when something was done to
  that pageblock which justifies repolling it?
  
 
 The something event you are looking for is pages being freed or
 allocated in the page allocator. A movable page being allocated in block
 or a page being freed should clear the PB_migrate_skip bit if it's set.
 Unfortunately this would impact the fast path of the alloc and free paths
 of the page allocator. I felt that that was too high a price to pay.

We already do a similar thing in the page allocator: clearing of
-all_unreclaimable and -pages_scanned.  But that isn't on the fast
path really - it happens once per pcp unload.  Can we do something
like that?  Drop some hint into the zone without having to visit each
page?

  
   ...
  
   +static void reset_isolation_suitable(struct zone *zone)
   +{
   + unsigned long start_pfn = zone-zone_start_pfn;
   + unsigned long end_pfn = zone-zone_start_pfn + zone-spanned_pages;
   + unsigned long pfn;
   +
   + /*
   +  * Do not reset more than once every five seconds. If allocations are
   +  * failing sufficiently quickly to allow this to happen then continually
   +  * scanning for compaction is not going to help. The choice of five
   +  * seconds is arbitrary but will mitigate excessive scanning.
   +  */
   + if (time_before(jiffies, zone-compact_blockskip_expire))
   + return;
   + zone-compact_blockskip_expire = jiffies + (HZ * 5);
   +
   + /* Walk the zone and mark every pageblock as suitable for isolation */
   + for (pfn = start_pfn; pfn  end_pfn; pfn += pageblock_nr_pages) {
   + struct page *page;
   + if (!pfn_valid(pfn))
   + continue;
   +
   + page = pfn_to_page(pfn);
   + if (zone != page_zone(page))
   + continue;
   +
   + clear_pageblock_skip(page);
   + }
  
  What's the worst-case loop count here?
  
 
 zone-spanned_pages  pageblock_order

What's the worst-case value of (zone-spanned_pages  pageblock_order) :)



Re: [Qemu-devel] [PATCH 4/9] mm: compaction: Abort compaction loop if lock is contended or run too long

2012-09-21 Thread Andrew Morton
On Fri, 21 Sep 2012 11:46:18 +0100
Mel Gorman mgor...@suse.de wrote:

 Changelog since V2
 o Fix BUG_ON triggered due to pages left on cc.migratepages
 o Make compact_zone_order() require non-NULL arg `contended'
 
 Changelog since V1
 o only abort the compaction if lock is contended or run too long
 o Rearranged the code by Andrea Arcangeli.
 
 isolate_migratepages_range() might isolate no pages if for example when
 zone-lru_lock is contended and running asynchronous compaction. In this
 case, we should abort compaction, otherwise, compact_zone will run a
 useless loop and make zone-lru_lock is even contended.

hm, this appears to be identical to

mm-compaction-abort-compaction-loop-if-lock-is-contended-or-run-too-long.patch
mm-compaction-abort-compaction-loop-if-lock-is-contended-or-run-too-long-fix.patch
mm-compaction-abort-compaction-loop-if-lock-is-contended-or-run-too-long-fix-2.patch

so I simply omitted patches 2, 3 and 4.



Re: [Qemu-devel] [PATCH 8/9] mm: compaction: Cache if a pageblock was scanned and no pages were isolated

2012-09-21 Thread Andrew Morton
On Fri, 21 Sep 2012 11:46:22 +0100
Mel Gorman mgor...@suse.de wrote:

 When compaction was implemented it was known that scanning could potentially
 be excessive. The ideal was that a counter be maintained for each pageblock
 but maintaining this information would incur a severe penalty due to a
 shared writable cache line. It has reached the point where the scanning
 costs are an serious problem, particularly on long-lived systems where a
 large process starts and allocates a large number of THPs at the same time.
 
 Instead of using a shared counter, this patch adds another bit to the
 pageblock flags called PG_migrate_skip. If a pageblock is scanned by
 either migrate or free scanner and 0 pages were isolated, the pageblock
 is marked to be skipped in the future. When scanning, this bit is checked
 before any scanning takes place and the block skipped if set.
 
 The main difficulty with a patch like this is when to ignore the cached
 information? If it's ignored too often, the scanning rates will still
 be excessive. If the information is too stale then allocations will fail
 that might have otherwise succeeded. In this patch
 
 o CMA always ignores the information
 o If the migrate and free scanner meet then the cached information will
   be discarded if it's at least 5 seconds since the last time the cache
   was discarded
 o If there are a large number of allocation failures, discard the cache.
 
 The time-based heuristic is very clumsy but there are few choices for a
 better event. Depending solely on multiple allocation failures still allows
 excessive scanning when THP allocations are failing in quick succession
 due to memory pressure. Waiting until memory pressure is relieved would
 cause compaction to continually fail instead of using reclaim/compaction
 to try allocate the page. The time-based mechanism is clumsy but a better
 option is not obvious.

ick.

Wall time has sooo little relationship to what's happening in there. 
If we *have* to use polling, cannot we clock the poll with some metric
which is at least vaguely related to the amount of activity?  Number
(or proportion) of pages scanned, for example?  Or reset everything on
the Nth trip around the zone?  Or even a combination of one of these
*and* of wall time, so the system will at least work harder when MM is
under load.

Also, what has to be done to avoid the polling altogether?  eg/ie, zap
a pageblock's PB_migrate_skip synchronously, when something was done to
that pageblock which justifies repolling it?


 ...

 +static void reset_isolation_suitable(struct zone *zone)
 +{
 + unsigned long start_pfn = zone-zone_start_pfn;
 + unsigned long end_pfn = zone-zone_start_pfn + zone-spanned_pages;
 + unsigned long pfn;
 +
 + /*
 +  * Do not reset more than once every five seconds. If allocations are
 +  * failing sufficiently quickly to allow this to happen then continually
 +  * scanning for compaction is not going to help. The choice of five
 +  * seconds is arbitrary but will mitigate excessive scanning.
 +  */
 + if (time_before(jiffies, zone-compact_blockskip_expire))
 + return;
 + zone-compact_blockskip_expire = jiffies + (HZ * 5);
 +
 + /* Walk the zone and mark every pageblock as suitable for isolation */
 + for (pfn = start_pfn; pfn  end_pfn; pfn += pageblock_nr_pages) {
 + struct page *page;
 + if (!pfn_valid(pfn))
 + continue;
 +
 + page = pfn_to_page(pfn);
 + if (zone != page_zone(page))
 + continue;
 +
 + clear_pageblock_skip(page);
 + }

What's the worst-case loop count here?

 +}
 +

 ...





Re: [Qemu-devel] [PATCH 6/9] mm: compaction: Acquire the zone-lock as late as possible

2012-09-21 Thread Andrew Morton
On Fri, 21 Sep 2012 11:46:20 +0100
Mel Gorman mgor...@suse.de wrote:

 Compactions free scanner acquires the zone-lock when checking for PageBuddy
 pages and isolating them. It does this even if there are no PageBuddy pages
 in the range.
 
 This patch defers acquiring the zone lock for as long as possible. In the
 event there are no free pages in the pageblock then the lock will not be
 acquired at all which reduces contention on zone-lock.

 ...

 --- a/mm/compaction.c
 +++ b/mm/compaction.c
 @@ -93,6 +93,28 @@ static inline bool compact_trylock_irqsave(spinlock_t 
 *lock,
   return compact_checklock_irqsave(lock, flags, false, cc);
  }
  
 +/* Returns true if the page is within a block suitable for migration to */
 +static bool suitable_migration_target(struct page *page)
 +{
 +

stray newline

 + int migratetype = get_pageblock_migratetype(page);
 +
 + /* Don't interfere with memory hot-remove or the min_free_kbytes blocks 
 */
 + if (migratetype == MIGRATE_ISOLATE || migratetype == MIGRATE_RESERVE)
 + return false;
 +
 + /* If the page is a large free page, then allow migration */
 + if (PageBuddy(page)  page_order(page) = pageblock_order)
 + return true;
 +
 + /* If the block is MIGRATE_MOVABLE or MIGRATE_CMA, allow migration */
 + if (migrate_async_suitable(migratetype))
 + return true;
 +
 + /* Otherwise skip the block */
 + return false;
 +}
 +

 ...

 @@ -168,23 +193,38 @@ static unsigned long isolate_freepages_block(unsigned 
 long blockpfn,
   int isolated, i;
   struct page *page = cursor;
  
 - if (!pfn_valid_within(blockpfn)) {
 - if (strict)
 - return 0;
 - continue;
 - }
 + if (!pfn_valid_within(blockpfn))
 + goto strict_check;
   nr_scanned++;
  
 - if (!PageBuddy(page)) {
 - if (strict)
 - return 0;
 - continue;
 - }
 + if (!PageBuddy(page))
 + goto strict_check;
 +
 + /*
 +  * The zone lock must be held to isolate freepages. This
 +  * unfortunately this is a very coarse lock and can be

this this

 +  * heavily contended if there are parallel allocations
 +  * or parallel compactions. For async compaction do not
 +  * spin on the lock and we acquire the lock as late as
 +  * possible.
 +  */
 + locked = compact_checklock_irqsave(cc-zone-lock, flags,
 + locked, cc);
 + if (!locked)
 + break;
 +
 + /* Recheck this is a suitable migration target under lock */
 + if (!strict  !suitable_migration_target(page))
 + break;
 +
 + /* Recheck this is a buddy page under lock */
 + if (!PageBuddy(page))
 + goto strict_check;
  
   /* Found a free page, break it into order-0 pages */
   isolated = split_free_page(page);
   if (!isolated  strict)
 - return 0;
 + goto strict_check;
   total_isolated += isolated;
   for (i = 0; i  isolated; i++) {
   list_add(page-lru, freelist);
 @@ -196,9 +236,23 @@ static unsigned long isolate_freepages_block(unsigned 
 long blockpfn,
   blockpfn += isolated - 1;
   cursor += isolated - 1;
   }
 +
 + continue;
 +
 +strict_check:
 + /* Abort isolation if the caller requested strict isolation */
 + if (strict) {
 + total_isolated = 0;
 + goto out;
 + }
   }
  
   trace_mm_compaction_isolate_freepages(nr_scanned, total_isolated);
 +
 +out:
 + if (locked)
 + spin_unlock_irqrestore(cc-zone-lock, flags);
 +
   return total_isolated;
  }

A a few things about this function.

Would it be cleaner if we did

if (!strict) {
if (!suitable_migration_target(page))
break;
} else {
if (!PageBuddy(page))
goto strict_check;
}

and then remove the test of `strict' from strict_check (which then
should be renamed)?

Which perhaps means that the whole `strict_check' block can go away:

if (!strict) {
if (!suitable_migration_target(page))
break;
} else {
if (!PageBuddy(page)) {
total_isolated = 0;
goto out;
}

Have a think about it?  The function is a little straggly.

Secondly, it is correct/desirable to skip the (now misnamed

Re: [Qemu-devel] [PATCH 0/2] core dump: re-purpose VM_ALWAYSDUMP to user controlled VM_DONTDUMP

2012-03-07 Thread Andrew Morton
On Wed, 7 Mar 2012 12:00:46 -0500
Jason Baron jba...@redhat.com wrote:

 Hi,
 
 The motivation for this change was that I was looking at a way for a qemu-kvm
 process, to exclude the guest memory from its core dump, which can be quite
 large. There are already a number of filter flags in
 /proc/pid/coredump_filter, however, these allow one to specify 'types' of
 kernel memory, not specific address ranges (which is needed in this case).
 
 Since there are no more vma flags available, the first patch eliminates the
 need for the 'VM_ALWAYSDUMP' flag. The flag is used internally by the kernel 
 to
 mark vdso and vsyscall pages. However, it is simple enough to check if a vma
 covers a vdso or vsyscall page without the need for this flag.

Gee, we ran out?

That makes it pretty inevitable that we will grow the vma by four
bytes.  Once we have done that, your always_dump_vma() trickery becomes
unneeded and undesirable, yes?  If so, we may as well recognise reality
and grow the vma now.

 The second patch then replaces the 'VM_ALWAYSDUMP' flag with a new 
 'VM_DONTDUMP' flag, which can be set by userspace using new madvise flags:
 'MADV_DONTDUMP', and unset via 'MADV_DUMP'. The core dump filters continue to
 work the same as before unless 'MADV_DONTDUMP' is set on the region.
 
 The qemu code which implements this features is at:
 http://people.redhat.com/~jbaron/qemu-dump/qemu-dump.patch
 
 In my testing the qemu core dump shrunk from 383MB - 13MB with this patch.
 
 I also believe that the 'MADV_DONTDUMP' flag might be useful for security
 sensitive apps, which might want to select which areas are dumped.
 

Is there any way for userspace to query the state of the flag?