Re: [patch 09/20] rename struct paravirt_patch to paravirt_patch_site for clarity

2007-04-06 Thread Andrew Morton
On Wed, 04 Apr 2007 12:12:00 -0700 Jeremy Fitzhardinge [EMAIL PROTECTED] 
wrote:

 Rename struct paravirt_patch to paravirt_patch_site, so that it
 clearly refers to a callsite, and not the patch which may be applied
 to that callsite.

Needed a bit of updating for Andi's recently-merged
x86_64-mm-vmi-backend-for-paravirt-ops.patch but I think I got it
right.  We'll see if it compiles.

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/virtualization


Re: [patch 18/20] clean up tsc-based sched_clock

2007-04-06 Thread Andrew Morton
On Wed, 04 Apr 2007 12:12:09 -0700 Jeremy Fitzhardinge [EMAIL PROTECTED] 
wrote:

 Three cleanups:
  - change instable - unstable
  - its better to use get_cpu_var for getting this cpu's variables
  - change cycles_2_ns to do the full computation rather than just the
tsc-ns scaling.  Its a simpler interface, and it makes the function
more generally useful.
 
 Signed-off-by: Jeremy Fitzhardinge [EMAIL PROTECTED]
 
 ---
  arch/i386/kernel/sched-clock.c |   35 +--

I'm dropping the relevant patch from Andi's tree due to it causing
mysterious hangs when initscripts start ondemand.  So I'll need to drop
this patch and [patch 19/20] Add a sched_clock paravirt_op.


I still need to work out why that hang is happening - it is very
mysterious.  I got as far as working out that it was hanging on
write_seqlock_irqsave(xtime_lock), then remembered that it's with
CONFIG_SMP=n so I stomped off to bed in disgust.  Later.
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/virtualization


Re: [patch 07/20] Allow paravirt backend to choose kernel PMD sharing

2007-04-06 Thread Andrew Morton
On Wed, 04 Apr 2007 12:11:58 -0700 Jeremy Fitzhardinge [EMAIL PROTECTED] 
wrote:

 Normally when running in PAE mode, the 4th PMD maps the kernel address
 space, which can be shared among all processes (since they all need
 the same kernel mappings).
 
 Xen, however, does not allow guests to have the kernel pmd shared
 between page tables, so parameterize pgtable.c to allow both modes of
 operation.
 
 There are several side-effects of this.  One is that vmalloc will
 update the kernel address space mappings, and those updates need to be
 propagated into all processes if the kernel mappings are not
 intrinsically shared.  In the non-PAE case, this is done by
 maintaining a pgd_list of all processes; this list is used when all
 process pagetables must be updated.  pgd_list is threaded via
 otherwise unused entries in the page structure for the pgd, which
 means that the pgd must be page-sized for this to work.
 
 Normally the PAE pgd is only 4x64 byte entries large, but Xen requires
 the PAE pgd to page aligned anyway, so this patch forces the pgd to be
 page aligned+sized when the kernel pmd is unshared, to accomodate both
 these requirements.
 
 Also, since there may be several distinct kernel pmds (if the
 user/kernel split is below 3G), there's no point in allocating them
 from a slab cache; they're just allocated with get_free_page and
 initialized appropriately.  (Of course the could be cached if there is
 just a single kernel pmd - which is the default with a 3G user/kernel
 split - but it doesn't seem worthwhile to add yet another case into
 this code).

All this paravirt stuff isn't making the kernel any prettier, is it?

 ...
  
 -#ifndef CONFIG_X86_PAE
 -void vmalloc_sync_all(void)
 +void _vmalloc_sync_all(void)
  {
   /*
* Note that races in the updates of insync and start aren't
 @@ -600,6 +599,8 @@ void vmalloc_sync_all(void)
   static DECLARE_BITMAP(insync, PTRS_PER_PGD);
   static unsigned long start = TASK_SIZE;
   unsigned long address;
 +
 + BUG_ON(SHARED_KERNEL_PMD);
  
   BUILD_BUG_ON(TASK_SIZE  ~PGDIR_MASK);
   for (address = start; address = TASK_SIZE; address += PGDIR_SIZE) {
 @@ -623,4 +624,3 @@ void vmalloc_sync_all(void)
   start = address + PGDIR_SIZE;
   }
  }

This is a functional change for non-paravirt kernels.  Non-PAE kernels now
get a vmalloc_sync_all().  How come?

We normally use double-underscore for things like this.

Your change clashes pretty fundamantally with
ftp://ftp.kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.21-rc5/2.6.21-rc5-mm4/broken-out/move-die-notifier-handling-to-common-code-fix-vmalloc_sync_all.patch,
and
ftp://ftp.kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.21-rc5/2.6.21-rc5-mm4/broken-out/move-die-notifier-handling-to-common-code.patch
_does_ make the kernel prettier.

But I'm a bit reluctant to rework
move-die-notifier-handling-to-common-code-fix-vmalloc_sync_all.patch
(somehow) until I understand why your patch is a) futzing with non-PAE,
non-paravirt code and b) overengineered.

Why didn't you just stick a

if (SHARED_KERNEL_PMD)
return;

into vmalloc_sync_all()?


___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/virtualization


Re: [patch 07/20] Allow paravirt backend to choose kernel PMD sharing

2007-04-06 Thread Andrew Morton
On Fri, 06 Apr 2007 17:02:58 -0700 Jeremy Fitzhardinge [EMAIL PROTECTED] 
wrote:

 Andrew Morton wrote:
  All this paravirt stuff isn't making the kernel any prettier, is it?

 
 You're too kind.  wli's comment on the first version of this patch was
 something along the lines of this patch causes a surprising amount of
 damage for what little it achieves.

Damn, I wish I'd said that.

  ...
   
  -#ifndef CONFIG_X86_PAE
  -void vmalloc_sync_all(void)
  +void _vmalloc_sync_all(void)
   {
 /*
  * Note that races in the updates of insync and start aren't
  @@ -600,6 +599,8 @@ void vmalloc_sync_all(void)
 static DECLARE_BITMAP(insync, PTRS_PER_PGD);
 static unsigned long start = TASK_SIZE;
 unsigned long address;
  +
  +  BUG_ON(SHARED_KERNEL_PMD);
   
 BUILD_BUG_ON(TASK_SIZE  ~PGDIR_MASK);
 for (address = start; address = TASK_SIZE; address += PGDIR_SIZE) {
  @@ -623,4 +624,3 @@ void vmalloc_sync_all(void)
 start = address + PGDIR_SIZE;
 }
   }
  
 
  This is a functional change for non-paravirt kernels.  Non-PAE kernels now
  get a vmalloc_sync_all().  How come?

 
 You mean PAE kernels get a vmalloc_sync_all?

err, yes.

 When we're in PAE mode, but SHARED_KERNEL_PMD is false (which is true
 for Xen, but not for native execution), then the kernel mappings are not
 implicitly shared between processes.  This means that the vmalloc
 mappings are not shared, and so need to be explicitly synchronized
 between pagetables, like in the !PAE case.

head spins.

  Your change clashes pretty fundamantally with
  ftp://ftp.kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.21-rc5/2.6.21-rc5-mm4/broken-out/move-die-notifier-handling-to-common-code-fix-vmalloc_sync_all.patch,
  and
  ftp://ftp.kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.21-rc5/2.6.21-rc5-mm4/broken-out/move-die-notifier-handling-to-common-code.patch
  _does_ make the kernel prettier.

 
 Hm, it doesn't look like a deep clash.  Dropping the inline function and
 just putting the if (SHARED_KERNEL_PMD) return; at the start of the
 real vmalloc_sync_all() would work fine.

Something like that.  I don't want to redo my patch if we're going to change
your patch ;)

 And I like vmalloc_sync_all() being a non-arch-specific interface; it
 cleans up another of the xen patches.

OK.

  But I'm a bit reluctant to rework
  move-die-notifier-handling-to-common-code-fix-vmalloc_sync_all.patch
  (somehow) until I understand why your patch is a) futzing with non-PAE,
  non-paravirt code
 
 There should be no functional difference for non-paravirt code, PAE or
 non-PAE.
 
   and b) overengineered.

 
 Overall, or just this bit?

this bit.

  Why didn't you just stick a
 
  if (SHARED_KERNEL_PMD)
  return;
 
  into vmalloc_sync_all()?

 
 That would work, but when building !PARAVIRT  PAE, SHARED_KERNEL_PMD
 is just constant 1, so it would end up making a pointless function
 call.  With the wrapper, the call disappears entirely.  It probably
 doesn't matter, but I didn't want anyone to complain about making the
 !PARAVIRT generated code worse (hi, Ingo!).

vmalloc_sync_all() is a) tremendously slow and b) only called by
register_die_notifier().  We can afford to add a few cycles to it.

 However, if you're making vmalloc_sync_all a weak function anyway, then
 there's no difference with the paravirt patches in place.  The
 
   if (SHARED_KERNEL_PMD)
   return;
 
 will evaluate to
 
   if (1)
   return;
 

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/virtualization


Re: [patch 07/20] Allow paravirt backend to choose kernel PMD sharing

2007-04-06 Thread Andrew Morton
On Fri, 06 Apr 2007 17:40:13 -0700 Jeremy Fitzhardinge [EMAIL PROTECTED] 
wrote:

 Andrew Morton wrote:
  Something like that.  I don't want to redo my patch if we're going to change
  your patch ;)

 
 OK.  I won't specifically redo it on top of your patches, but I'll
 rework it to remove the inline function and add the if() statement.  Do
 you want an incremental update or a complete replacement?
 

Thanks.  A replacement would suit.
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/virtualization


Re: [patch 2/4] clean up identify_cpu

2007-04-07 Thread Andrew Morton
On Fri, 06 Apr 2007 15:41:54 -0700 Jeremy Fitzhardinge [EMAIL PROTECTED] 
wrote:

 identify_cpu() is used to identify both the boot CPU and secondary
 CPUs, but it performs some actions which only apply to the boot CPU.
 Those functions are therefore really __init functions, but because
 they're called by identify_cpu(), they must be marked __cpuinit.
 
 This patch splits identify_cpu() into identify_boot_cpu() and
 identify_secondary_cpu(), and calls the appropriate init functions
 from each.  Also, identify_boot_cpu() and all the functions it
 dominates are marked __init.

x86_64 uses this too.

WARNING: arch/x86_64/kernel/built-in.o - Section mismatch: reference to 
.init.text:mtrr_bp_init from .text.identify_cpu after 'identify_cpu' (at offset 
0x655)
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/virtualization


Re: [patch 2/4] clean up identify_cpu

2007-04-07 Thread Andrew Morton
On Sat, 07 Apr 2007 10:20:17 -0700 Jeremy Fitzhardinge [EMAIL PROTECTED] 
wrote:

  I don't have a x86-64 compile environment on
 hand, so the 64 bits are completely untested

http://userweb.kernel.org/~akpm/cross-compilers/
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/virtualization


Re: [patch 4/6] Rename the parainstructions symbols to be consistent with the others

2007-04-10 Thread Andrew Morton
On Tue, 10 Apr 2007 20:07:48 +0200 Andi Kleen [EMAIL PROTECTED] wrote:

 On Wednesday 04 April 2007 03:06:59 Jeremy Fitzhardinge wrote:
  The other symbols used to delineate the alt-instructions sections have
  the form __foo/__foo_end.  Rename parainstructions to match.
 
 This patch breaks x86-64
 

Fixed version is in 2.6.21-rc6-mm1.

Please let me know when you've uploaded your new tree...
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/virtualization


lguest re-review

2007-05-09 Thread Andrew Morton

Some concern was expressed over the lguest review status, so I shall send
the patches out again for people to review, to test, to make observations
about the author's personal appearance, etc.

I'll plan on sending these patches off to Linus in a week's time, assuming
all goes well.

Thanks.
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/virtualization


Re: lguest re-review

2007-05-09 Thread Andrew Morton
On Thu, 10 May 2007 10:24:08 +1000
Rusty Russell [EMAIL PROTECTED] wrote:

 On Wed, 2007-05-09 at 02:51 -0700, Andrew Morton wrote:
  Some concern was expressed over the lguest review status, so I shall send
  the patches out again for people to review, to test, to make observations
  about the author's personal appearance, etc.
 
 Thanks Andrew,
 
   This means I can finally ack this patch (thanks Eric):
 
 From: [EMAIL PROTECTED] (Eric W. Biederman)
 Subject: [PATCH] Revert [PATCH] paravirt: Add startup infrastructure for 
 paravirtualization
 
 This reverts commit c9ccf30d77f04064fe5436027ab9d2230c7cdd94.

I don't get it.  Does lguest no longer need this code, or will
it be reintroduced with an lguest merge, or something else?
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/virtualization


Re: [PATCH 2/3] fbdev: Make deferred I/O work as advertized

2008-03-19 Thread Andrew Morton
On Wed, 19 Mar 2008 19:24:25 +0800 Jaya Kumar [EMAIL PROTECTED] wrote:

 On Wed, Mar 19, 2008 at 6:45 PM, Markus Armbruster [EMAIL PROTECTED] wrote:
 
   Any progress on this?
 
   Considering that fb_defio is utterly broken without the fix (writing
   the frame buffer makes the VM endlessly invoke vm_ops.page_mkwrite()),
   wouldn't it make sense to merge the fix even if it still has issues?
 
 
 Yes, Andrew has merged the fix into -mm. I see it in mmotm
 
 http://userweb.kernel.org/~akpm/mmotm/broken-out/fbdev-defio-and-metronomefb-v4.patch
 

Do we need some of all of that patch in 2.6.25?  I wasn't aware of such a
need.

If the drivers/video/fb_defio.c hunk of that patch fixes something then I'd
be inclined to merge the whole thing - adding a new driver won't hurt
anyone.
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/virtualization


Re: [PATCH 2/3] fbdev: Make deferred I/O work as advertized

2008-03-19 Thread Andrew Morton
On Thu, 20 Mar 2008 06:30:37 +0800
Jaya Kumar [EMAIL PROTECTED] wrote:

 On Thu, Mar 20, 2008 at 2:37 AM, Andrew Morton
 [EMAIL PROTECTED] wrote:
 
   Do we need some of all of that patch in 2.6.25?  I wasn't aware of such a
   need.
 
   If the drivers/video/fb_defio.c hunk of that patch fixes something then I'd
   be inclined to merge the whole thing - adding a new driver won't hurt
   anyone.
 
 
 Yes, the defio hunk is a bugfix. I agree with pushing the whole thing
 to 2.6.25 if there are people who need it in 2.6.25. I believe there
 is no risk of impact on anyone else.
 

Well all we have in the changelog is

  Implement support for the E-Ink Metronome controller.  It provides an
  mmapable interface to the controller using defio support.  It was tested
  with a gumstix pxa255 with Vizplex media using Xfbdev and various X
  clients such as xeyes, xpdf, xloadimage.

grump.  This say nothing at all about the change to drivers/video/fb_defio.c.

a) the changelog _should_ document this change.  Please send a paragraph and
   I'll add it.

b) If this discussion hadn't happened, that bug (whatever it is) would
   have remained unfixed in 2.6.25.  As a direct consequence of poor
   changelogging.

c) This bugfix (whatever it is) should have been in a separate patch.

Thanks.
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/virtualization


Re: [RFC/PATCH 01/15 v2] preparation: provide hook to enable pgstes in user pagetable

2008-03-24 Thread Andrew Morton
On Sat, 22 Mar 2008 18:02:37 +0100
Carsten Otte [EMAIL PROTECTED] wrote:

 From: Martin Schwidefsky [EMAIL PROTECTED]
 
 The SIE instruction on s390 uses the 2nd half of the page table page to
 virtualize the storage keys of a guest. This patch offers the s390_enable_sie
 function, which reorganizes the page tables of a single-threaded process to
 reserve space in the page table:
 s390_enable_sie makes sure that the process is single threaded and then uses
 dup_mm to create a new mm with reorganized page tables. The old mm is freed 
 and the process has now a page status extended field after every page table.
 
 Code that wants to exploit pgstes should SELECT CONFIG_PGSTE.
 
 This patch has a small common code hit, namely making dup_mm non-static.
 
 Edit (Carsten): I've modified Martin's patch, following Jeremy Fitzhardinge's
 review feedback. Now we do have the prototype for dup_mm in
 include/linux/sched.h.
 
 ...

 --- linux-host.orig/kernel/fork.c
 +++ linux-host/kernel/fork.c
 @@ -498,7 +498,7 @@ void mm_release(struct task_struct *tsk,
   * Allocate a new mm structure and copy contents from the
   * mm structure of the passed in task structure.
   */
 -static struct mm_struct *dup_mm(struct task_struct *tsk)
 +struct mm_struct *dup_mm(struct task_struct *tsk)
  {
   struct mm_struct *mm, *oldmm = current-mm;
   int err;

ack

 --- linux-host.orig/include/linux/sched.h
 +++ linux-host/include/linux/sched.h
 @@ -1758,6 +1758,8 @@ extern void mmput(struct mm_struct *);
  extern struct mm_struct *get_task_mm(struct task_struct *task);
  /* Remove the current tasks stale references to the old mm_struct */
  extern void mm_release(struct task_struct *, struct mm_struct *);
 +/* Allocate a new mm structure and copy contents from tsk-mm */
 +extern struct mm_struct *dup_mm(struct task_struct *tsk);
  
  extern int  copy_thread(int, unsigned long, unsigned long, unsigned long, 
 struct task_struct *, struct pt_regs *);
  extern void flush_thread(void);
 

hm, why did we put these in sched.h?

oh well - acked-by-me.
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/virtualization


Re: [PATCH 2/5] /dev/vring: simple userspace-kernel ringbuffer interface.

2008-04-18 Thread Andrew Morton
On Fri, 18 Apr 2008 14:39:48 +1000 Rusty Russell [EMAIL PROTECTED] wrote:

 virtio introduced a ring structure ABI for guest-host communications
 (currently used by lguest and kvm).  Using this same ABI, we can
 create a nice fd version.
 
 This is useful for efficiently passing packets to and from the tun,
 for example.
 
 ...

 +static int vring_mmap(struct file *filp, struct vm_area_struct *vma)
 +{
 + unsigned long size, num_descs;
 + struct vring_info *vr = filp-private_data;
 + int err;
 +
 + /* We overload mmap's offset to hold the ring number. */
 + num_descs = vma-vm_pgoff;
 +
 + /* Must be a power of two, and limit indices to a u16. */
 + if (!num_descs || (num_descs  (num_descs-1)) || num_descs  65536)

We have an is_power_of_2().

 + return -EINVAL;
 +
 + /* mmap size must be what we expect for such a ring. */
 + size = vma-vm_end - vma-vm_start;
 + if (size != ALIGN(vring_size(num_descs, PAGE_SIZE), PAGE_SIZE))
 + return -EINVAL;
 +
 + /* We only let them map this in one place. */
 + mutex_lock(vr-lock);
 + if (vr-ring.num != 0) {
 + err = -EBUSY;
 + goto unlock;
 + }
 +
 + vring_init(vr-ring, num_descs, (void *)vma-vm_start, PAGE_SIZE);
 +
 + vr-mask = num_descs - 1;
 + err = 0;
 +
 +unlock:
 + mutex_unlock(vr-lock);
 + return err;
 +}

 ...

 +/**
 + * vring_get - check out a vring file descriptor
 + * @filp: the file structure to attach to (eg. from fget()).
 + *
 + * Userspace opens /dev/vring and mmaps it, then hands that fd to the
 + * kernel subsystem it wants to communicate with.  That subsystem uses
 + * this routine and vring_set_ops() to attach to it.
 + *
 + * This simply checks that it really is a vring fd (otherwise it
 + * returns NULL), the other routine checks that it's not already
 + * attached.
 + */

hm, I don't understand the big picture here yet.

Isn't this kinda-sorta like what a relayfs file does?  The oprofile
buffers?  etc?  Nothing in common at all, no hope?

 +struct vring_info *vring_get(struct file *filp)
 +{
 + /* Must be one of ours. */
 + if (filp-f_op != vring_fops)
 + return NULL;
 +
 + return filp-private_data;
 +}
 +EXPORT_SYMBOL_GPL(vring_get);
 +
 +/**
 + * vring_set_ops - attach operations to a vring file descriptor.
 + * @vr: the vring_info returned from vring_get.
 + * @ops: the operations to attach.
 + * @ops_data: the argument to the ops callbacks.
 + *
 + * This is called after vring_get(): the reason for the two-part
 + * process is that the ops can be called before vring_set_ops returns
 + * (we don't do locking), so you really need to set things up before
 + * this call.
 + *
 + * This simply checks that the ring is not already attached to something,
 + * then sets the ops.
 + */
 +int vring_set_ops(struct vring_info *vr,
 +   const struct vring_ops *ops, void *ops_data)
 +{
 + int err;
 +
 + mutex_lock(vr-lock);
 + if (vr-ops) {
 + err = -EBUSY;
 + goto unlock;
 + }
 +
 + /* We don't lock, so make sure we get this in the right order. */
 + vr-ops_data = ops_data;
 + wmb();
 + vr-ops = ops;
 +
 + err = 0;
 +unlock:
 + mutex_unlock(vr-lock);
 + local_irq_enable();

what's this doing here?

 + return err;
 +}
 +EXPORT_SYMBOL_GPL(vring_set_ops);
 +
 +/**
 + * vring_unset_ops - remove operations to a vring file descriptor.
 + * @vr: the vring_info previously successfully vring_set_ops'd
 + */
 +void vring_unset_ops(struct vring_info *vr)
 +{
 + BUG_ON(!vr-ops);
 + mutex_lock(vr-lock);
 + vr-ops = NULL;
 + mutex_unlock(vr-lock);
 +}
 +EXPORT_SYMBOL_GPL(vring_unset_ops);

Isn't this just vring_set_ops(vr, NULL, NULL)?

 +static struct miscdevice vring_dev = {
 + .minor = MISC_DYNAMIC_MINOR,
 + .name = KBUILD_MODNAME,
 + .fops = vring_fops,
 +};
 +
 +static int __init init(void)
 +{
 + return misc_register(vring_dev);
 +}
 +
 +static void __exit fini(void)
 +{
 + misc_deregister(vring_dev);
 +}
 +
 +module_init(init);
 +module_exit(fini);
 diff -r b2d9869d338f include/linux/vring.h
 --- /dev/null Thu Jan 01 00:00:00 1970 +
 +++ b/include/linux/vring.h   Fri Apr 18 13:35:16 2008 +1000
 @@ -0,0 +1,58 @@
 +/* Ring-buffer file descriptor implementation.
 + *
 + *  Copyright 2008 Rusty Russell IBM Corporation
 + *
 + *  This program is free software; you can redistribute it and/or modify
 + *  it under the terms of the GNU General Public License as published by
 + *  the Free Software Foundation; either version 2 of the License, or
 + *  (at your option) any later version.
 + *
 + *  This program is distributed in the hope that it will be useful,
 + *  but WITHOUT ANY WARRANTY; without even the implied warranty of
 + *  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
 + *  GNU General Public License for more details.
 + *
 + *  You should have received a copy of the GNU 

Re: [PATCH 5/5] tun: vringfd xmit support.

2008-04-18 Thread Andrew Morton
On Fri, 18 Apr 2008 14:43:24 +1000 Rusty Russell [EMAIL PROTECTED] wrote:

 This patch modifies tun to allow a vringfd to specify the send
 buffer.  The user does a write to push out packets from the buffer.
 
 Again we use the 'struct virtio_net_hdr' to allow userspace to send
 GSO packets.  In this case, it can hint how much to copy, and the
 other pages will be made into skb fragments.
 
 
 ...

 +/* We don't consolidate consecutive iovecs, so huge iovecs can break here.
 + * Users will learn not to do that. */
 +static int get_user_skb_frags(const struct iovec *iv, size_t len,
 +   struct skb_frag_struct *f)
 +{
 + unsigned int i, j, num_pg = 0;
 + int err;
 + struct page *pages[MAX_SKB_FRAGS];
 +
 + down_read(current-mm-mmap_sem);
 + while (len) {
 + int n, npages;
 + unsigned long base, len;
 + base = (unsigned long)iv-iov_base;
 + len = (unsigned long)iv-iov_len;
 +
 + if (len == 0) {
 + iv++;
 + continue;
 + }
 +
 + /* How many pages will this take? */
 + npages = 1 + (base + len - 1)/PAGE_SIZE - base/PAGE_SIZE;

Brain hurts.  I hope you got that right.

 + if (unlikely(num_pg + npages  MAX_SKB_FRAGS)) {
 + err = -ENOSPC;
 + goto fail;
 + }
 + n = get_user_pages(current, current-mm, base, npages,
 +0, 0, pages, NULL);

What is the maximum numbet of pages which an unpriviliged user can
concurrently pin with this code?

 + if (unlikely(n  0)) {
 + err = n;
 + goto fail;
 + }
 +
 + /* Transfer pages to the frag array */
 + for (j = 0; j  n; j++) {
 + f[num_pg].page = pages[j];
 + if (j == 0) {
 + f[num_pg].page_offset = offset_in_page(base);
 + f[num_pg].size = min(len, PAGE_SIZE -
 +  f[num_pg].page_offset);
 + } else {
 + f[num_pg].page_offset = 0;
 + f[num_pg].size = min(len, PAGE_SIZE);
 + }
 + len -= f[num_pg].size;
 + base += f[num_pg].size;
 + num_pg++;
 + }

This loop is a fancy way of doing

num_pg = n;

 + if (unlikely(n != npages)) {
 + err = -EFAULT;
 + goto fail;
 + }

why not do this immediately after running get_user_pages()?

 + }
 + up_read(current-mm-mmap_sem);
 + return num_pg;
 +
 +fail:
 + for (i = 0; i  num_pg; i++)
 + put_page(f[i].page);

release_pages() could be a tad more efficient, but it's only error-path.

 + up_read(current-mm-mmap_sem);
 + return err;
 +}
 +
  

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/virtualization


Re: [PATCH 2/5] /dev/vring: simple userspace-kernel ringbuffer interface.

2008-04-18 Thread Andrew Morton
On Sat, 19 Apr 2008 00:32:39 +1000 Rusty Russell [EMAIL PROTECTED] wrote:

  Isn't this kinda-sorta like what a relayfs file does?  The oprofile
  buffers?  etc?  Nothing in common at all, no hope?
 
 An excellent question, but I thought the modern kernel etiquette was to only 
 comment on whitespace and formatting, and call it review? :)
 
 Yes, kinda-sorta in that it's a ring buffer.  No, in that it's bidir and 
 consumption can be out-of-order (kind of important for I/O buffers).
 
 But the reason I'm not proposing it as a syscall is that I'm not convinced 
 it's the One True Solution which everyone should be using.  Time will tell: 
 it's clearly not tied to tun and it's been generically useful for virtual 
 I/O, but history has not been kind to new userspace interfaces.

This is may be our third high-bandwidth user/kernel interface to transport
bulk data (hbukittbd) which was implemented because its predecessors
weren't quite right.  In a year or two's time someone else will need a
hbukittbd and will find that the existing three aren't quite right and will
give us another one.  One day we need to stop doing this ;)

It could be that this person will look at Rusty's hbukittbd and find that
it _could_ be tweaked to do what he wants, but it's already shipping and
it's part of the kernel API and hence can't be made to do what he wants.

So I think it would be good to plonk the proposed interface on the table
and have a poke at it.  Is it compat-safe?  Is it extensible in a
backward-compatible fashion?  Are there future-safe changes we should make
to it?  Can Michael Kerrisk understand, review and document it?  etc.

You know what I'm saying ;)  What is the proposed interface?
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/virtualization


Re: [PATCH 5/5] tun: vringfd xmit support.

2008-04-18 Thread Andrew Morton
On Sat, 19 Apr 2008 01:15:15 +1000 Rusty Russell [EMAIL PROTECTED] wrote:

 
  What is the maximum numbet of pages which an unpriviliged user can
  concurrently pin with this code?
 
 Since only root can open the tun device, it's currently OK.  The old code
 kmalloced and copied: is there some mm-fu reason why pinning userspace memory
 is worse?

We generally try to avoid it - it allows users to dos the box.  Although I
suspect that direct-io presently permits users to transiently pin an amount
of memory which is proportional to the number of disks upon which they can
open files.

 Subject: Export release_pages; nice undo for get_user_pages.
 
 Andrew Morton suggests tun/tap use release_pages, but it's not
 exported.  It's not clear to me why this is in swap.c, but it exists
 even without CONFIG_SWAP, so that's OK.
 
 Signed-off-by: Rusty Russell [EMAIL PROTECTED]
 
 diff -r abd2ad431e5c mm/swap.c
 --- a/mm/swap.c   Sat Apr 19 00:34:54 2008 +1000
 +++ b/mm/swap.c   Sat Apr 19 01:11:40 2008 +1000
 @@ -346,6 +346,7 @@ void release_pages(struct page **pages, 
  
   pagevec_free(pages_to_free);
  }
 +EXPORT_SYMBOL(release_pages);
  
  /*
   * The pages which we're about to release may be in the deferred lru-addition

acked-by: me.
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/virtualization


Re: [PATCH 5/5] tun: vringfd xmit support.

2008-04-18 Thread Andrew Morton
On Sat, 19 Apr 2008 01:15:15 +1000 Rusty Russell [EMAIL PROTECTED] wrote:

 It's not clear to me why this is in swap.c, but it exists
 even without CONFIG_SWAP, so that's OK.

We should have done mv mm/swap.c mm/mmlib.c years ago.
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/virtualization


Re: [PATCH 5/5] tun: vringfd xmit support.

2008-04-19 Thread Andrew Morton
 On Sun, 20 Apr 2008 00:41:43 +1000 Rusty Russell [EMAIL PROTECTED] wrote:
 On Saturday 19 April 2008 05:06:34 Andrew Morton wrote:
  On Sat, 19 Apr 2008 01:15:15 +1000 Rusty Russell [EMAIL PROTECTED] 
 wrote:
What is the maximum numbet of pages which an unpriviliged user can
concurrently pin with this code?
  
   Since only root can open the tun device, it's currently OK.  The old code
   kmalloced and copied: is there some mm-fu reason why pinning userspace
   memory is worse?
 
  We generally try to avoid it - it allows users to dos the box.
 
 My question is: is pinning a page worse than allocating a (kernel) page in 
 some way?
 

I guess pinning is not as bad as straight-out allocating.

Pinning is limited to the size of the program's VM.  Pinning
it at least pining something which is accounted and is exposed
to admin tools.

But they're both pretty similar in effect and risk.
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/virtualization


Re: [PATCH 2/2] dm-ioband: I/O bandwidth controller v0.0.4: Document

2008-04-26 Thread Andrew Morton
On Thu, 24 Apr 2008 20:22:19 +0900 (JST) Ryo Tsuruta [EMAIL PROTECTED] wrote:

 +What's dm-ioband all about?
 +
 + dm-ioband is an I/O bandwidth controller implemented as a device-mapper
 +   driver. Several jobs using the same physical device have to share the
 +   bandwidth of the device. dm-ioband gives bandwidth to each job according
 +   to its weight, which each job can set its own value to.
 +
 + At this time, a job is a group of processes with the same pid or pgrp or
 +   uid. There is also a plan to make it support cgroup. A job can also be a
 +   virtual machine such as KVM or Xen.

Most writes are performed by pdflush, kswapd, etc.  This will lead to large
inaccuracy.

It isn't trivial to fix.  We'd need deep, long tracking of ownership
probably all the way up to the pagecache page.  The same infrastructure
would be needed to make Sergey's BSD acct: disk I/O accounting vaguely
accurate.  Other proposals need it, but I forget what they are.

Much more minor points: when merge-time comes, the patches should have the
LINUX_VERSION_CODE stuff removed.  And probably all of the many `inline's
should be removed.


___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/virtualization


Re: [PATCHv2 1/2] mm: export use_mm/unuse_mm to modules

2009-08-11 Thread Andrew Morton
On Wed, 12 Aug 2009 00:27:52 +0300
Michael S. Tsirkin m...@redhat.com wrote:

 vhost net module wants to do copy to/from user from a kernel thread,
 which needs use_mm (like what fs/aio has).  Move that into mm/ and
 export to modules.

OK by me.  Please include this change in the virtio patchset.  Which I
shall cheerfully not be looking at :)

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/virtualization


Re: [PATCHv2 1/2] mm: export use_mm/unuse_mm to modules

2009-09-16 Thread Andrew Morton
On Thu, 17 Sep 2009 08:38:18 +0300 Michael S. Tsirkin m...@redhat.com wrote:

 Hi Andrew,
 On Tue, Aug 11, 2009 at 03:10:10PM -0700, Andrew Morton wrote:
  On Wed, 12 Aug 2009 00:27:52 +0300
  Michael S. Tsirkin m...@redhat.com wrote:
  
   vhost net module wants to do copy to/from user from a kernel thread,
   which needs use_mm (like what fs/aio has).  Move that into mm/ and
   export to modules.
  
  OK by me.  Please include this change in the virtio patchset.  Which I
  shall cheerfully not be looking at :)
 
 The virtio patches are somewhat delayed as we are ironing out the
 kernel/user interface with Rusty. Can the patch moving use_mm to mm/ be
 applied without exporting to modules for now? This will make it easier
 for virtio which will only have to patch in the EXPORT line.

That was 10,000 patches ago.

 I also have a small patch optimizing atomic usage in use_mm (which I did for
 virtio) and it's easier to apply it if the code is in the new place.
 
 If ok, pls let me know and I'll post the patch without the EXPORT line.

Please just send them all out.
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/virtualization


Re: [PATCH] change acquire/release_console_sem() to console_lock/unlock()

2011-01-20 Thread Andrew Morton
On Thu, 20 Jan 2011 17:55:02 +0100
torbenh torb...@gmx.de wrote:

 On Thu, Jan 20, 2011 at 08:34:48AM -0800, Greg KH wrote:
  On Thu, Jan 20, 2011 at 04:58:13PM +0100, Torben Hohn wrote:
   the -rt patches change the console_semaphore to console_mutex.
   so a quite large chunk of the patches changes all
   acquire/release_console_sem() to acquire/release_console_mutex()
  
  Why not just change the functionality of the existing function to be a
  mutex in the rt patches, instead of having to rename it everywhere?
 
 i hope that Thomas already did this in his upcoming -rt series.
 
  
   this commit makes things use more neutral function names
   which dont make implications about the underlying lock.
   
   the only real change is the return value of console_trylock
   which is inverted from try_acquire_console_sem()
   
   Signed-off-by: Torben Hohn torb...@gmx.de
   CC: Thomas Gleixner t...@tglx.de
  
  I don't mind this rename, but is it really going to help anything out?
  What's the odds of the -rt portion of this patch ever making it to
  mainline?
 
 the -rt portion only changes the semaphore to a mutex.
 since the console_sem is used with mutex semantics, i dont see any
 reason, not to merge that portion too. 
 
 i am just trying to shrink the -rt patch to make it more maintanable :)
 

Yeah, I think it's a better name and if we can indeed switch that
semaphore to a mutex then that's a good thing to do.
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/virtualization


Re: [PATCH] change acquire/release_console_sem() to console_lock/unlock()

2011-01-21 Thread Andrew Morton
On Fri, 21 Jan 2011 09:10:06 +0100 Geert Uytterhoeven ge...@linux-m68k.org 
wrote:

 include/linux/mutex.h:
 
 /*
  * NOTE: mutex_trylock() follows the spin_trylock() convention,
  *   not the down_trylock() convention!
  *
  * Returns 1 if the mutex has been acquired successfully, and 0 on contention.
  */
 extern int mutex_trylock(struct mutex *lock);
 
 So that's why the return value was inverted (when treating it as a boolean).
 I can understand that.
 
 However:
 
 +/**
 + * console_trylock - try to lock the console system for exclusive use.
 + *
 + * Tried to acquire a lock which guarantees that the caller has
 + * exclusive access to the console system and the console_drivers list.
 + *
 + * returns -1 on success, and 0 on failure to acquire the lock.
 + */
 +int console_trylock(void)
 
 So this one returns -1 on success, not 1? Why?

Yup.  All callers just test for non-zero, so...

--- 
a/kernel/printk.c~change-acquire-release_console_sem-to-console_lock-unlock-fix-2
+++ a/kernel/printk.c
@@ -1058,7 +1058,7 @@ EXPORT_SYMBOL(console_lock);
  * Tried to acquire a lock which guarantees that the caller has
  * exclusive access to the console system and the console_drivers list.
  *
- * returns -1 on success, and 0 on failure to acquire the lock.
+ * returns 1 on success, and 0 on failure to acquire the lock.
  */
 int console_trylock(void)
 {
@@ -1070,7 +1070,7 @@ int console_trylock(void)
}
console_locked = 1;
console_may_schedule = 0;
-   return -1;
+   return 1;
 }
 EXPORT_SYMBOL(console_trylock);
 
_

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/virtualization


Re: [PATCH] xen-blkfront: use bitmap_set() and bitmap_clear()

2012-01-20 Thread Andrew Morton
On Fri, 20 Jan 2012 11:09:38 -0500
Konrad Rzeszutek Wilk konrad.w...@oracle.com wrote:

  
   Did you test this patch with a large amount of minors?
  
  Sorry I didn't do runtime test.
 
 Please do.

The poor guy probably doesn't know how to test it and surely it would
be quite a lot of work for him to do so.

Overall, it would be much more efficient if the tester of this code is
someone who is set up to easily apply the patch and test it.  ie: the
code maintainer(s).

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH] xen-blkfront: use bitmap_set() and bitmap_clear()

2012-01-20 Thread Andrew Morton
On Fri, 20 Jan 2012 18:55:08 -0500
Konrad Rzeszutek Wilk konrad.w...@oracle.com wrote:

 On Fri, Jan 20, 2012 at 03:11:49PM -0800, Andrew Morton wrote:
  On Fri, 20 Jan 2012 11:09:38 -0500
  Konrad Rzeszutek Wilk konrad.w...@oracle.com wrote:
  

 Did you test this patch with a large amount of minors?

Sorry I didn't do runtime test.
   
   Please do.
  
  The poor guy probably doesn't know how to test it and surely it would
  be quite a lot of work for him to do so.
  
  Overall, it would be much more efficient if the tester of this code is
  someone who is set up to easily apply the patch and test it.  ie: the
  code maintainer(s).
 
 grins
 
 This patch aside - Andrew how do you deal with a large amount of patches
 and make sure they are tested?

Not very well :(

 Is there a weekly Test Tuesday where you 
 kick off your automated tests and dilligently look for any variations?
 Or is it more of compile the kernel with X features and run it for a week
 doing normal (and abnormal!) things to see if it falls over?

I build all the patches I have along with all of linux-next and boot
the thing, then use the resulting kernel for a few hours compilation
testing, then shove it all into linux-next where I naively hope that
someone else is testing things.

The coverage which is obtained this way is pretty poor.  
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH v4 1/3] mm: introduce compaction and migration for virtio ballooned pages

2012-07-18 Thread Andrew Morton
On Tue, 17 Jul 2012 13:50:41 -0300
Rafael Aquini aqu...@redhat.com wrote:

 This patch introduces the helper functions as well as the necessary changes
 to teach compaction and migration bits how to cope with pages which are
 part of a guest memory balloon, in order to make them movable by memory
 compaction procedures.
 
 ...

 --- a/include/linux/mm.h
 +++ b/include/linux/mm.h
 @@ -1629,5 +1629,20 @@ static inline unsigned int 
 debug_guardpage_minorder(void) { return 0; }
  static inline bool page_is_guard(struct page *page) { return false; }
  #endif /* CONFIG_DEBUG_PAGEALLOC */
  
 +#if (defined(CONFIG_VIRTIO_BALLOON) || \
 + defined(CONFIG_VIRTIO_BALLOON_MODULE))  defined(CONFIG_COMPACTION)
 +extern bool putback_balloon_page(struct page *);
 +extern struct address_space *balloon_mapping;
 +
 +static inline bool is_balloon_page(struct page *page)
 +{
 + return (page-mapping == balloon_mapping) ? true : false;

You can simply do

return page-mapping == balloon_mapping;

 +}
 +#else
 +static inline bool is_balloon_page(struct page *page)   { return false; }
 +static inline bool isolate_balloon_page(struct page *page)  { return false; }
 +static inline bool putback_balloon_page(struct page *page)  { return false; }
 +#endif /* (VIRTIO_BALLOON || VIRTIO_BALLOON_MODULE)  COMPACTION */

This means that if CONFIG_VIRTIO_BALLOON=y and CONFIG_COMPACTION=n,
is_balloon_page() will always return NULL.  IOW, no pages are balloon
pages!  This is wrong.

I'm not sure what to do about this, apart from renaming the function to
is_compactible_balloon_page() or something similarly aawkward.


___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH v4 2/3] virtio_balloon: introduce migration primitives to balloon pages

2012-07-18 Thread Andrew Morton
On Tue, 17 Jul 2012 13:50:42 -0300
Rafael Aquini aqu...@redhat.com wrote:

 Besides making balloon pages movable at allocation time and introducing
 the necessary primitives to perform balloon page migration/compaction,
 this patch also introduces the following locking scheme to provide the
 proper synchronization and protection for struct virtio_balloon elements
 against concurrent accesses due to parallel operations introduced by
 memory compaction / page migration.
  - balloon_lock (mutex) : synchronizes the access demand to elements of
 struct virtio_balloon and its queue operations;
  - pages_lock (spinlock): special protection to balloon pages list against
 concurrent list handling operations;
 
 ...

 + balloon_mapping-a_ops = virtio_balloon_aops;
 + balloon_mapping-backing_dev_info = (void *)vb;

hoo boy.  We're making page-mapping-backing_dev_info point at a
struct which does not have type `struct backing_dev_info'.  And then we
are exposing that page to core MM functions.  So we're hoping that core
MM will never walk down page-mapping-backing_dev_info and explode.

That's nasty, hacky and fragile.
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH v4 1/3] mm: introduce compaction and migration for virtio ballooned pages

2012-07-18 Thread Andrew Morton
On Wed, 18 Jul 2012 20:07:07 -0300
Rafael Aquini aqu...@redhat.com wrote:

  
   +}
   +#else
   +static inline bool is_balloon_page(struct page *page)   { return 
   false; }
   +static inline bool isolate_balloon_page(struct page *page)  { return 
   false; }
   +static inline bool putback_balloon_page(struct page *page)  { return 
   false; }
   +#endif /* (VIRTIO_BALLOON || VIRTIO_BALLOON_MODULE)  COMPACTION */
  
  This means that if CONFIG_VIRTIO_BALLOON=y and CONFIG_COMPACTION=n,
  is_balloon_page() will always return NULL.  IOW, no pages are balloon
  pages!  This is wrong.
  
 I believe it's right, actually, as we can see CONFIG_COMPACTION=n associated 
 with
 CONFIG_MIGRATION=y (and  CONFIG_VIRTIO_BALLOON=y).
 For such config case we cannot perform the is_balloon_page() test branches
 placed on mm/migration.c

No, it isn't right.  Look at the name: is_balloon_page.  If a caller
runs is_balloon_page() against a balloon page with CONFIG_COMPACTION=n
then they will get false, which is incorrect.

So the function needs a better name - one which communicates that it is
a balloon page *for the purposes of processing by the compaction code*. 
Making the function private to compaction.c would help with that, if
feasible.


___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH v4 1/3] mm: introduce compaction and migration for virtio ballooned pages

2012-07-18 Thread Andrew Morton
On Wed, 18 Jul 2012 22:00:48 -0300 Rafael Aquini aqu...@redhat.com wrote:

  So the function needs a better name - one which communicates that it is
  a balloon page *for the purposes of processing by the compaction code*. 
  Making the function private to compaction.c would help with that, if
  feasible.
  
 
 How about this (adjusted) approach:

it fails checkpatch ;)

 --- a/include/linux/mm.h
 +++ b/include/linux/mm.h
 @@ -1629,8 +1629,7 @@ static inline unsigned int 
 debug_guardpage_minorder(void)
 { return 0; }
  static inline bool page_is_guard(struct page *page) { return false; }
  #endif /* CONFIG_DEBUG_PAGEALLOC */
  
 -#if (defined(CONFIG_VIRTIO_BALLOON) || \
 - defined(CONFIG_VIRTIO_BALLOON_MODULE))  defined(CONFIG_COMPACTION)
 +#if (defined(CONFIG_VIRTIO_BALLOON) ||defined(CONFIG_VIRTIO_BALLOON_MODULE))
  extern bool putback_balloon_page(struct page *);
  extern struct address_space *balloon_mapping;
  
 @@ -1638,11 +1637,13 @@ static inline bool is_balloon_page(struct page *page)
  {
   return (page-mapping  page-mapping == balloon_mapping);
  }
 +#if defined(CONFIG_COMPACTION)
 +static inline bool balloon_compaction_enabled(void) { return true; }
  #else
 -static inline bool is_balloon_page(struct page *page)   { return false; }
 -static inline bool isolate_balloon_page(struct page *page)  { return false; }
 -static inline bool putback_balloon_page(struct page *page)  { return false; }
 -#endif /* (VIRTIO_BALLOON || VIRTIO_BALLOON_MODULE)  COMPACTION */
 +static inline bool putback_balloon_page(struct page *page) { return false; }
 +static inline bool balloon_compaction_enabled(void) { return false; }
 +#endif /* CONFIG_COMPACTION */
 +#endif /* (CONFIG_VIRTIO_BALLOON || CONFIG_VIRTIO_BALLOON_MODULE) */
  
  #endif /* __KERNEL__ */
  #endif /* _LINUX_MM_H */
 diff --git a/mm/migrate.c b/mm/migrate.c
 index 59c7bc5..f5f6a7d 100644
 --- a/mm/migrate.c
 +++ b/mm/migrate.c
 @@ -78,7 +78,8 @@ void putback_lru_pages(struct list_head *l)
   list_del(page-lru);
   dec_zone_page_state(page, NR_ISOLATED_ANON +
   page_is_file_cache(page));
 - if (unlikely(is_balloon_page(page)))
 + if (unlikely(is_balloon_page(page)) 
 + balloon_compaction_enabled())

well, that helps readability.  But what does is_balloon_page() return
when invoked on a balloon page when CONFIG_COMPACTION=n?  False,
methinks.

I think the code as you previously had it was OK, but the
is_balloon_page() name is misleading.  It really wants to be called
is_potentially_compactible_balloon_page() :( Maybe rename it to
compactible_balloon_page()?
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH v10 0/5] make balloon pages movable by compaction

2012-09-17 Thread Andrew Morton
On Mon, 17 Sep 2012 13:38:15 -0300
Rafael Aquini aqu...@redhat.com wrote:

 Memory fragmentation introduced by ballooning might reduce significantly
 the number of 2MB contiguous memory blocks that can be used within a guest,
 thus imposing performance penalties associated with the reduced number of
 transparent huge pages that could be used by the guest workload.
 
 This patch-set follows the main idea discussed at 2012 LSFMMS session:
 Ballooning for transparent huge pages -- http://lwn.net/Articles/490114/
 to introduce the required changes to the virtio_balloon driver, as well as
 the changes to the core compaction  migration bits, in order to make those
 subsystems aware of ballooned pages and allow memory balloon pages become
 movable within a guest, thus avoiding the aforementioned fragmentation issue
 
 Following are numbers that prove this patch benefits on allowing compaction
 to be more effective at memory ballooned guests.
 
 Results for STRESS-HIGHALLOC benchmark, from Mel Gorman's mmtests suite,
 running on a 4gB RAM KVM guest which was ballooning 1gB RAM in 256mB chunks,
 at every minute (inflating/deflating), while test was running:

How can a patchset reach v10 and have zero Reviewed-by's?

The patchset looks reasonable to me and your empirical results look
good.  But I don't feel that I'm in a position to decide on its overall
desirability, either in a standalone sense or in comparison to any
alternative schemes which anyone has proposed.

IOW, Rusty and KVM folks: please consider thyself poked.

I looked through the code and have some comments which are minor in the
overall scheme of things.  I'll be more comfortable when a compaction
expert has had a go over it.  IOW, Mel joins the pokee list ;)

(The question of overall desirability is the big one here.  Do we
actually want to add this to Linux?  The rest is details which we can
work out).


___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH v10 3/5] virtio_balloon: introduce migration primitives to balloon pages

2012-09-17 Thread Andrew Morton
On Mon, 17 Sep 2012 13:38:18 -0300
Rafael Aquini aqu...@redhat.com wrote:

 Memory fragmentation introduced by ballooning might reduce significantly
 the number of 2MB contiguous memory blocks that can be used within a guest,
 thus imposing performance penalties associated with the reduced number of
 transparent huge pages that could be used by the guest workload.
 
 Besides making balloon pages movable at allocation time and introducing
 the necessary primitives to perform balloon page migration/compaction,
 this patch also introduces the following locking scheme, in order to
 enhance the syncronization methods for accessing elements of struct
 virtio_balloon, thus providing protection against concurrent access
 introduced by parallel memory compaction threads.
 
  - balloon_lock (mutex) : synchronizes the access demand to elements of
   struct virtio_balloon and its queue operations;
  - pages_lock (spinlock): special protection to balloon's pages bookmarking
   elements (list and atomic counters) against the
   potential memory compaction concurrency;
 

 ...

  struct virtio_balloon
  {
 @@ -46,11 +48,24 @@ struct virtio_balloon
   /* The thread servicing the balloon. */
   struct task_struct *thread;
  
 + /* balloon special page-mapping */
 + struct address_space *mapping;
 +
 + /* Synchronize access/update to this struct virtio_balloon elements */
 + struct mutex balloon_lock;
 +
   /* Waiting for host to ack the pages we released. */
   wait_queue_head_t acked;
  
 + /* Protect pages list, and pages bookeeping counters */
 + spinlock_t pages_lock;
 +
 + /* Number of balloon pages isolated from 'pages' list for compaction */
 + unsigned int num_isolated_pages;

Is it utterly inconceivable that this counter could exceed 4G, ever?

   /* Number of balloon pages we've told the Host we're not using. */
   unsigned int num_pages;
 +
   /*
* The pages we've told the Host we're not using.
* Each page on this list adds VIRTIO_BALLOON_PAGES_PER_PAGE
 @@ -60,7 +75,7 @@ struct virtio_balloon
  
   /* The array of pfns we tell the Host about. */
   unsigned int num_pfns;
 - u32 pfns[256];
 + u32 pfns[VIRTIO_BALLOON_ARRAY_PFNS_MAX];
  
   /* Memory statistics */
   int need_stats_update;
 @@ -122,13 +137,17 @@ static void set_page_pfns(u32 pfns[], struct page *page)
  
  static void fill_balloon(struct virtio_balloon *vb, size_t num)
  {
 + /* Get the proper GFP alloc mask from vb-mapping flags */
 + gfp_t vb_gfp_mask = mapping_gfp_mask(vb-mapping);
 +
   /* We can only do one array worth at a time. */
   num = min(num, ARRAY_SIZE(vb-pfns));
  
 + mutex_lock(vb-balloon_lock);
   for (vb-num_pfns = 0; vb-num_pfns  num;
vb-num_pfns += VIRTIO_BALLOON_PAGES_PER_PAGE) {
 - struct page *page = alloc_page(GFP_HIGHUSER | __GFP_NORETRY |
 - __GFP_NOMEMALLOC | __GFP_NOWARN);
 + struct page *page = alloc_page(vb_gfp_mask | __GFP_NORETRY |
 +__GFP_NOWARN | __GFP_NOMEMALLOC);

That looks like an allocation which could easily fail.

   if (!page) {
   if (printk_ratelimit())
   dev_printk(KERN_INFO, vb-vdev-dev,

Strangely, we suppressed the core page allocator's warning and
substituted this less useful one.

Also, it would be nice if someone could get that printk_ratelimit() out
of there, for reasons described at the printk_ratelimit() definition
site.

 @@ -139,9 +158,15 @@ static void fill_balloon(struct virtio_balloon *vb, 
 size_t num)
   break;
   }
   set_page_pfns(vb-pfns + vb-num_pfns, page);
 - vb-num_pages += VIRTIO_BALLOON_PAGES_PER_PAGE;
   totalram_pages--;
 +
 + BUG_ON(!trylock_page(page));
 + spin_lock(vb-pages_lock);
   list_add(page-lru, vb-pages);
 + assign_balloon_mapping(page, vb-mapping);
 + vb-num_pages += VIRTIO_BALLOON_PAGES_PER_PAGE;
 + spin_unlock(vb-pages_lock);
 + unlock_page(page);
   }

 ...


___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH v10 1/5] mm: introduce a common interface for balloon pages mobility

2012-09-17 Thread Andrew Morton
On Mon, 17 Sep 2012 13:38:16 -0300
Rafael Aquini aqu...@redhat.com wrote:

 Memory fragmentation introduced by ballooning might reduce significantly
 the number of 2MB contiguous memory blocks that can be used within a guest,
 thus imposing performance penalties associated with the reduced number of
 transparent huge pages that could be used by the guest workload.
 
 This patch introduces a common interface to help a balloon driver on
 making its page set movable to compaction, and thus allowing the system
 to better leverage the compation efforts on memory defragmentation.
 

 ...

 +#ifndef _LINUX_BALLOON_COMPACTION_H
 +#define _LINUX_BALLOON_COMPACTION_H
 +
 +#include linux/rcupdate.h
 +#include linux/pagemap.h
 +#include linux/gfp.h
 +#include linux/err.h
 +
 +/* return code to identify when a ballooned page has been migrated */
 +#define BALLOON_MIGRATION_RETURN 0xba1100

I didn't really spend enough time to work out why this was done this
way, but I know a hack when I see one!

We forgot to document the a_ops.migratepage() return value.  Perhaps
it's time to work out what it should be.

 +#ifdef CONFIG_BALLOON_COMPACTION
 +#define count_balloon_event(e)   count_vm_event(e)
 +#define free_balloon_mapping(m)  kfree(m)

It would be better to write these in C please.  That way we get
typechecking, even when CONFIG_BALLOON_COMPACTION=n.

 +extern bool isolate_balloon_page(struct page *);
 +extern void putback_balloon_page(struct page *);
 +extern int migrate_balloon_page(struct page *newpage,
 + struct page *page, enum migrate_mode mode);
 +extern struct address_space *alloc_balloon_mapping(void *balloon_device,
 + const struct address_space_operations *a_ops);

There's a useful convention that interface identifiers are prefixed by
their interface's name.  IOW, everything in this file would start with
balloon_.  balloon_page_isolate, balloon_page_putback, etc.  I think
we could follow that convention here?

 +static inline void assign_balloon_mapping(struct page *page,
 +   struct address_space *mapping)
 +{
 + page-mapping = mapping;
 + smp_wmb();
 +}
 +
 +static inline void clear_balloon_mapping(struct page *page)
 +{
 + page-mapping = NULL;
 + smp_wmb();
 +}
 +
 +static inline gfp_t balloon_mapping_gfp_mask(void)
 +{
 + return GFP_HIGHUSER_MOVABLE;
 +}
 +
 +static inline bool __is_movable_balloon_page(struct page *page)
 +{
 + struct address_space *mapping = ACCESS_ONCE(page-mapping);
 + smp_read_barrier_depends();
 + return mapping_balloon(mapping);
 +}

hm.  Are these barrier tricks copied from somewhere else, or home-made?

 +/*
 + * movable_balloon_page - test page-mapping-flags to identify balloon pages
 + * that can be moved by compaction/migration.
 + *
 + * This function is used at core compaction's page isolation scheme and so 
 it's
 + * exposed to several system pages which may, or may not, be part of a memory
 + * balloon, and thus we cannot afford to hold a page locked to perform tests.

I don't understand this.  What is a system page?  If I knew that, I
migth perhaps understand why we cannot lock such a page.

 + * Therefore, as we might return false positives in the case a balloon page
 + * is just released under us, the page-mapping-flags need to be retested
 + * with the proper page lock held, on the functions that will cope with the
 + * balloon page later.
 + */
 +static inline bool movable_balloon_page(struct page *page)
 +{
 + /*
 +  * Before dereferencing and testing mapping-flags, lets make sure
 +  * this is not a page that uses -mapping in a different way
 +  */
 + if (!PageSlab(page)  !PageSwapCache(page)  !PageAnon(page) 
 + !page_mapped(page))
 + return __is_movable_balloon_page(page);
 +
 + return false;
 +}
 +
 +/*
 + * __page_balloon_device - get the balloon device that owns the given page.
 + *
 + * This shall only be used at driver callbacks under proper page lock,
 + * to get access to the balloon device which @page belongs.
 + */
 +static inline void *__page_balloon_device(struct page *page)
 +{
 + struct address_space *mapping = page-mapping;
 + if (mapping)
 + mapping = mapping-assoc_mapping;
 +
 + return mapping;
 +}

So you've repurposed address_space.assoc_mapping in new and unexpected
ways.

I don't immediately see a problem with doing this, but we should do it
properly.  Something like:

- rename address_space.assoc_mapping to private_data
- it has type void*
- document its ownership rules
- convert fs/buffer.c

all done as a standalone preparatory patch.

Also, your usage of -private_data should minimise its use of void* -
use more specific types wherever possible.  So this function should
return a struct virtio_balloon *.

It is unobvious why this interface function is prefixed with __.

 +/*
 + * 

Re: [PATCH v10 1/5] mm: introduce a common interface for balloon pages mobility

2012-09-18 Thread Andrew Morton
On Tue, 18 Sep 2012 13:24:21 -0300
Rafael Aquini aqu...@redhat.com wrote:

 On Mon, Sep 17, 2012 at 03:15:43PM -0700, Andrew Morton wrote:
   +/* return code to identify when a ballooned page has been migrated */
   +#define BALLOON_MIGRATION_RETURN 0xba1100
  
  I didn't really spend enough time to work out why this was done this
  way, but I know a hack when I see one!
 
 Yes, I'm afraid it's a hack, but, unfortunately, it's a necessary one (IMHO).
 
 This 'distinct' return code is used to flag a sucessful balloon page migration
 at the following unmap_and_move() snippet (patch 2).
 If by any reason we fail to identify a sucessfull balloon page migration, we
 will cause a page leak, as the old 'page' won't be properly released.
 .
 rc = __unmap_and_move(page, newpage, force, offlining, mode);
 +
 +if (unlikely(rc == BALLOON_MIGRATION_RETURN)) {
 +/*
 + * A ballooned page has been migrated already.
 + * Now, it's the time to remove the old page from the 
 isolated
 + * pageset list and handle it back to Buddy, wrap-up counters
 + * and return.
 + */
 ..
 
 By reaching that point in code, we cannot rely on testing page-mapping flags
 anymore for both 'page' and 'newpage' because:
 a) migration has already finished and 'page'-mapping is wiped out;
 b) balloon might have started to deflate, and 'newpage' might be released
already;
 
 If the return code approach is unnaceptable, we might defer the 
 'page'-mapping
 wipe-out step to that point in code for the balloon page case.
 That, however, tends to be a little bit heavier, IMHO, as it will require us 
 to
 acquire the page lock once more to proceed the mapping wipe out, thus
 potentially introducing overhead by lock contention (specially when several
 parallel compaction threads are scanning pages for isolation)

I think the return code approach _is_ acceptable, but the
implementation could be improved.

As it stands, a naive caller could be expecting either 0 (success) or a
negative errno.  A large positive return value could trigger havoc.  We
can defend against such programming mistakes with code commentary, but
a better approach would be to enumerate the return values.  Along the
lines of

/*
 * Return values from addresss_space_operations.migratepage().  Returns a
 * negative errno on failure.
 */
#define MIGRATEPAGE_SUCCESS 0
#define MIGRATEPAGE_BALLOON_THINGY  1   /* nice comment goes here */

and convert all callers to explicitly check for MIGRATEPAGE_SUCCESS,
not literal zero.  We should be particularly careful to look for
codesites which are unprepared for positive return values, such as

ret = migratepage();
if (ret  0)
return ret;
...
return ret; /* success!! */



If we wanted to be really vigilant about this, we could do

#define MIGRATEPAGE_SUCCESS 1
#define MIGRATEPAGE_BALLOON_THINGY  2

so any naive code which tests for literal zero will nicely explode early
in testing.

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH v2 2/3] virtio: 9p: correctly pass physical address to userspace for high pages

2012-10-22 Thread Andrew Morton
On Fri, 19 Oct 2012 14:03:32 +0100
Will Deacon will.dea...@arm.com wrote:

 When using a virtio transport, the 9p net device may pass the physical
 address of a kernel buffer to userspace via a scatterlist inside a
 virtqueue. If the kernel buffer is mapped outside of the linear mapping
 (e.g. highmem), then virt_to_page will return a bogus value and we will
 populate the scatterlist with junk.
 
 This patch uses kmap_to_page when populating the page array for a kernel
 buffer.
 
 ...

 --- a/net/9p/trans_virtio.c
 +++ b/net/9p/trans_virtio.c
 @@ -39,6 +39,7 @@
  #include linux/inet.h
  #include linux/idr.h
  #include linux/file.h
 +#include linux/highmem.h
  #include linux/slab.h
  #include net/9p/9p.h
  #include linux/parser.h
 @@ -325,7 +326,7 @@ static int p9_get_mapped_pages(struct virtio_chan *chan,
   int count = nr_pages;
   while (nr_pages) {
   s = rest_of_page(data);
 - pages[index++] = virt_to_page(data);
 + pages[index++] = kmap_to_page(data);
   data += s;
   nr_pages--;

I am suspecting that this code has been busted for a while on x86
highmem, but nobody noticed.  True or false?  If true then I expect
that a -stable backport is appropriate?

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH v2 1/3] mm: highmem: export kmap_to_page for modules

2012-10-22 Thread Andrew Morton
On Fri, 19 Oct 2012 14:03:31 +0100
Will Deacon will.dea...@arm.com wrote:

 Some virtio device drivers (9p) need to translate high virtual addresses
 to physical addresses, which are inserted into the virtqueue for
 processing by userspace.
 
 This patch exports the kmap_to_page symbol, so that the affected drivers
 can be compiled as modules.
 
 Signed-off-by: Will Deacon will.dea...@arm.com
 ---
  mm/highmem.c |1 +
  1 files changed, 1 insertions(+), 0 deletions(-)
 
 diff --git a/mm/highmem.c b/mm/highmem.c
 index d517cd1..2a07f97 100644
 --- a/mm/highmem.c
 +++ b/mm/highmem.c
 @@ -105,6 +105,7 @@ struct page *kmap_to_page(void *vaddr)
  
   return virt_to_page(addr);
  }
 +EXPORT_SYMBOL(kmap_to_page);
  

Looks OK to me.  Would generally prefer that an exported-to-modules
symbol have some documentation, but I guess this one is obvious enough.

Rusty, please include this if you grab the rest of the series.

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH v11 1/7] mm: adjust address_space_operations.migratepage() return code

2012-11-07 Thread Andrew Morton
On Wed,  7 Nov 2012 01:05:48 -0200
Rafael Aquini aqu...@redhat.com wrote:

 This patch introduces MIGRATEPAGE_SUCCESS as the default return code
 for address_space_operations.migratepage() method and documents the
 expected return code for the same method in failure cases.

I hit a large number of rejects applying this against linux-next.  Due
to the increasingly irritating sched/numa code in there.

I attempted to fix it up and also converted some (but not all) of the
implicit tests of `rc' against zero.

Please check the result very carefully - more changes will be needed.

All those

-   if (rc)
+   if (rc != MIGRATEPAGE_SUCCESS)

changes are a pain.  Perhaps we shouldn't bother.
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH v11 5/7] virtio_balloon: introduce migration primitives to balloon pages

2012-11-07 Thread Andrew Morton
On Wed,  7 Nov 2012 01:05:52 -0200
Rafael Aquini aqu...@redhat.com wrote:

 Memory fragmentation introduced by ballooning might reduce significantly
 the number of 2MB contiguous memory blocks that can be used within a guest,
 thus imposing performance penalties associated with the reduced number of
 transparent huge pages that could be used by the guest workload.
 
 Besides making balloon pages movable at allocation time and introducing
 the necessary primitives to perform balloon page migration/compaction,
 this patch also introduces the following locking scheme, in order to
 enhance the syncronization methods for accessing elements of struct
 virtio_balloon, thus providing protection against concurrent access
 introduced by parallel memory migration threads.
 
 ...

 @@ -122,18 +128,25 @@ static void set_page_pfns(u32 pfns[], struct page *page)
  
  static void fill_balloon(struct virtio_balloon *vb, size_t num)
  {
 + struct balloon_dev_info *vb_dev_info = vb-vb_dev_info;
 +
 + static DEFINE_RATELIMIT_STATE(fill_balloon_rs,
 +   DEFAULT_RATELIMIT_INTERVAL,
 +   DEFAULT_RATELIMIT_BURST);
 +
   /* We can only do one array worth at a time. */
   num = min(num, ARRAY_SIZE(vb-pfns));
  
 + mutex_lock(vb-balloon_lock);
   for (vb-num_pfns = 0; vb-num_pfns  num;
vb-num_pfns += VIRTIO_BALLOON_PAGES_PER_PAGE) {
 - struct page *page = alloc_page(GFP_HIGHUSER | __GFP_NORETRY |
 - __GFP_NOMEMALLOC | __GFP_NOWARN);
 + struct page *page = balloon_page_enqueue(vb_dev_info);
 +
   if (!page) {
 - if (printk_ratelimit())
 + if (__ratelimit(fill_balloon_rs))
   dev_printk(KERN_INFO, vb-vdev-dev,
  Out of puff! Can't get %zu pages\n,
 -num);
 +VIRTIO_BALLOON_PAGES_PER_PAGE);
   /* Sleep for at least 1/5 of a second before retry. */
   msleep(200);
   break;

linux-next's fill_balloon() has already been converted to
dev_info_ratelimited().  I fixed everything up.  Please check the result.

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH v11 3/7] mm: introduce a common interface for balloon pages mobility

2012-11-07 Thread Andrew Morton
On Wed,  7 Nov 2012 01:05:50 -0200
Rafael Aquini aqu...@redhat.com wrote:

 Memory fragmentation introduced by ballooning might reduce significantly
 the number of 2MB contiguous memory blocks that can be used within a guest,
 thus imposing performance penalties associated with the reduced number of
 transparent huge pages that could be used by the guest workload.
 
 This patch introduces a common interface to help a balloon driver on
 making its page set movable to compaction, and thus allowing the system
 to better leverage the compation efforts on memory defragmentation.


mm/migrate.c: In function 'unmap_and_move':
mm/migrate.c:899: error: 'COMPACTBALLOONRELEASED' undeclared (first use in this 
function)
mm/migrate.c:899: error: (Each undeclared identifier is reported only once
mm/migrate.c:899: error: for each function it appears in.)

You've been bad - you didn't test with your feature disabled. 
Please do that.  And not just compilation testing.


We can fix this one with a sucky macro.  I think that's better than
unconditionally defining the enums.

--- 
a/include/linux/balloon_compaction.h~mm-introduce-a-common-interface-for-balloon-pages-mobility-fix
+++ a/include/linux/balloon_compaction.h
@@ -207,10 +207,8 @@ static inline gfp_t balloon_mapping_gfp_
return GFP_HIGHUSER;
 }
 
-static inline void balloon_event_count(enum vm_event_item item)
-{
-   return;
-}
+/* A macro, to avoid generating references to the undefined COMPACTBALLOON* */
+#define balloon_event_count(item) do { } while (0)
 
 static inline bool balloon_compaction_check(void)
 {

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH v11 5/7] virtio_balloon: introduce migration primitives to balloon pages

2012-11-07 Thread Andrew Morton
On Thu, 08 Nov 2012 09:32:18 +1030
Rusty Russell ru...@rustcorp.com.au wrote:

 Rafael Aquini aqu...@redhat.com writes:
  + * virtballoon_migratepage - perform the balloon page migration on behalf 
  of
  + *  a compation thread. (called under page lock)
 
  +   if (!mutex_trylock(vb-balloon_lock))
  +   return -EAGAIN;
 
 Erk, OK...

Not really.  As is almost always the case with a trylock, it needs a
comment explaining why we couldn't use the far superior mutex_lock(). 
Data: this reader doesn't know!

  +   /* balloon's page migration 1st step  -- inflate newpage */
  +   spin_lock_irqsave(vb_dev_info-pages_lock, flags);
  +   balloon_page_insert(newpage, mapping, vb_dev_info-pages);
  +   vb_dev_info-isolated_pages--;
  +   spin_unlock_irqrestore(vb_dev_info-pages_lock, flags);
  +   vb-num_pfns = VIRTIO_BALLOON_PAGES_PER_PAGE;
  +   set_page_pfns(vb-pfns, newpage);
  +   tell_host(vb, vb-inflate_vq);
 
 tell_host does wait_event(), so you can't call it under the page_lock.
 Right?

Sleeping inside lock_page() is OK.  More problematic is that GFP_KERNEL
allocation.  iirc it _should_ be OK.  Core VM uses trylock_page() and
the filesystems shouldn't be doing a synchronous lock_page() in the
pageout path.  But I suspect it isn't a well-tested area.

 You probably get away with it because current qemu will service you
 immediately.  You could spin here in this case for the moment.
 
 There's a second call to tell_host():
 
  +   /*
  +* balloon's page migration 2nd step -- deflate page
  +*
  +* It's safe to delete page-lru here because this page is at
  +* an isolated migration list, and this step is expected to happen here
  +*/
  +   balloon_page_delete(page);
  +   vb-num_pfns = VIRTIO_BALLOON_PAGES_PER_PAGE;
  +   set_page_pfns(vb-pfns, page);
  +   tell_host(vb, vb-deflate_vq);
 
 The first one can be delayed, the second one can be delayed if the host
 didn't ask for VIRTIO_BALLOON_F_MUST_TELL_HOST (qemu doesn't).
 
 We could implement a proper request queue for these, and return -EAGAIN
 if the queue fills.  Though in practice, it's not important (it might
 help performance).

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH -next] virtio: balloon: fix missing unlock on error in fill_balloon()

2012-11-12 Thread Andrew Morton
On Mon, 12 Nov 2012 21:50:40 +0800
Wei Yongjun weiyj...@gmail.com wrote:

 From: Wei Yongjun yongjun_...@trendmicro.com.cn
 
 Add the missing unlock before return from function fill_balloon()
 in the error handling case.
 
 Introduced by 9864a8(virtio_balloon: introduce migration primitives
 to balloon pages)
 
 dpatch engine is used to auto generate this patch.
 (https://github.com/weiyj/dpatch)
 
 Signed-off-by: Wei Yongjun yongjun_...@trendmicro.com.cn
 ---
  drivers/virtio/virtio_balloon.c | 4 +++-
  1 file changed, 3 insertions(+), 1 deletion(-)
 
 diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c
 index f70151b..72e8dcb 100644
 --- a/drivers/virtio/virtio_balloon.c
 +++ b/drivers/virtio/virtio_balloon.c
 @@ -152,8 +152,10 @@ static void fill_balloon(struct virtio_balloon *vb, 
 size_t num)
   }
  
   /* Didn't get any?  Oh well. */
 - if (vb-num_pfns == 0)
 + if (vb-num_pfns == 0) {
 + mutex_unlock(vb-balloon_lock);
   return;
 + }
  
   tell_host(vb, vb-inflate_vq);
   mutex_unlock(vb-balloon_lock);

Well.  Why did this bug occur in the first place?  Because
fill_balloon() has multiple return points and one of those was
overlooked when adding new locking.

This (and the addition of memory leaks) is quite common.  It is part of
the reason why so much kernel code uses goto's to avoid multiple return
points.  A maintainability thing.

And we can fix this shortcoming in fill_balloon() without even needing a
goto:

--- 
a/drivers/virtio/virtio_balloon.c~virtio_balloon-introduce-migration-primitives-to-balloon-pages-fix-fix
+++ a/drivers/virtio/virtio_balloon.c
@@ -151,13 +151,9 @@ static void fill_balloon(struct virtio_b
totalram_pages--;
}
 
-   /* Didn't get any?  Oh well. */
-   if (vb-num_pfns == 0) {
-   mutex_unlock(vb-balloon_lock);
-   return;
-   }
-
-   tell_host(vb, vb-inflate_vq);
+   /* Did we get any? */
+   if (vb-num_pfns != 0)
+   tell_host(vb, vb-inflate_vq);
mutex_unlock(vb-balloon_lock);
 }
 
_

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH v11 7/7] mm: add vm event counters for balloon pages compaction

2012-11-20 Thread Andrew Morton
On Fri, 9 Nov 2012 12:58:29 -0200
Rafael Aquini aqu...@redhat.com wrote:

 On Fri, Nov 09, 2012 at 12:20:33PM +, Mel Gorman wrote:
  On Wed, Nov 07, 2012 at 01:05:54AM -0200, Rafael Aquini wrote:
   This patch introduces a new set of vm event counters to keep track of
   ballooned pages compaction activity.
   
   Signed-off-by: Rafael Aquini aqu...@redhat.com
  
  Other than confirming the thing actually works can any meaningful
  conclusions be drawn from this counters?
  
  I know I have been inconsistent on this myself in the past but recently
  I've been taking the attitude that the counters can be used to fit into
  some other metric. I'm looking to change the compaction counters to be
  able to build a basic cost model for example. The same idea could be
  used for balloons of course but it's a less critical path than
  compaction for THP for example.
  
  Assuming it builds and all the defines are correct when the feature is
  not configured (I didn't check) then there is nothing wrong with the
  patch. However, if it was dropped would it make life very hard or would
  you notice?
  
 
 Originally, I proposed this patch as droppable (and it's still droppable)
 because its major purpose was solely to show the thing working consistently
 
 OTOH, it might make the life easier to spot breakages if it remains with the
 merged bits, and per a reviewer request I removed its 'DROP BEFORE MERGE'
 disclaimer.
 
https://lkml.org/lkml/2012/8/8/616

There's a lot to be said for not merging things.

I think I'll maintain this as a mm-only patch.  That way it's
available in linux-next and we can merge it later if a need arises.
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH v11 4/7] mm: introduce compaction and migration for ballooned pages

2012-11-20 Thread Andrew Morton
On Fri, 9 Nov 2012 12:16:02 +
Mel Gorman m...@csn.ul.ie wrote:

 On Wed, Nov 07, 2012 at 01:05:51AM -0200, Rafael Aquini wrote:
  Memory fragmentation introduced by ballooning might reduce significantly
  the number of 2MB contiguous memory blocks that can be used within a guest,
  thus imposing performance penalties associated with the reduced number of
  transparent huge pages that could be used by the guest workload.
  
  This patch introduces the helper functions as well as the necessary changes
  to teach compaction and migration bits how to cope with pages which are
  part of a guest memory balloon, in order to make them movable by memory
  compaction procedures.
  

 ...

  --- a/mm/compaction.c
  +++ b/mm/compaction.c
  @@ -14,6 +14,7 @@
   #include linux/backing-dev.h
   #include linux/sysctl.h
   #include linux/sysfs.h
  +#include linux/balloon_compaction.h
   #include internal.h
   
   #if defined CONFIG_COMPACTION || defined CONFIG_CMA
  @@ -565,9 +566,24 @@ isolate_migratepages_range(struct zone *zone, struct 
  compact_control *cc,
  goto next_pageblock;
  }
   
  -   /* Check may be lockless but that's ok as we recheck later */
  -   if (!PageLRU(page))
  +   /*
  +* Check may be lockless but that's ok as we recheck later.
  +* It's possible to migrate LRU pages and balloon pages
  +* Skip any other type of page
  +*/
  +   if (!PageLRU(page)) {
  +   if (unlikely(balloon_page_movable(page))) {
 
 Because it's lockless, it really seems that the barrier stuck down there
 is unnecessary. At worst you get a temporarily incorrect answer that you
 recheck later under page lock in balloon_page_isolate.

What happened with this?

Also: what barrier?
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH] drivers: virtio: Use PTR_RET function

2013-03-25 Thread Andrew Morton
On Tue, 26 Mar 2013 13:57:09 +1030 Rusty Russell ru...@rustcorp.com.au wrote:

 Alexandru Gheorghiu gheorghiuan...@gmail.com writes:
 
  Used PTR_RET function instead of IS_ERR and PTR_ERR.
  Patch found using coccinelle.
 
 WTF is PTR_RET?  PTR_RET doesn't return anything.  Why is it called
 that?  It doesn't even make sense.
 
 ZERO_OR_PTR_ERR() maybe.
 
 But what problem are we solving?  Insufficient churn in the tree?  Code
 being too readable?  This isn't some hard-to-get right corner case, or a
 missed optimization.
 
 Andrew, what am I missing here?

It seemed like a good idea at the time.  Merged it two years ago and
have since been keenly awaiting an opportunity to use it.

It seems that people _have_ been using it, but mainly netfilter people
and we know they're all crazy ;)

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH-v3 1/4] idr: Percpu ida

2013-08-20 Thread Andrew Morton
On Fri, 16 Aug 2013 23:09:06 + Nicholas A. Bellinger 
n...@linux-iscsi.org wrote:

 From: Kent Overstreet k...@daterainc.com
 
 Percpu frontend for allocating ids. With percpu allocation (that works),
 it's impossible to guarantee it will always be possible to allocate all
 nr_tags - typically, some will be stuck on a remote percpu freelist
 where the current job can't get to them.
 
 We do guarantee that it will always be possible to allocate at least
 (nr_tags / 2) tags - this is done by keeping track of which and how many
 cpus have tags on their percpu freelists. On allocation failure if
 enough cpus have tags that there could potentially be (nr_tags / 2) tags
 stuck on remote percpu freelists, we then pick a remote cpu at random to
 steal from.
 
 Note that there's no cpu hotplug notifier - we don't care, because
 steal_tags() will eventually get the down cpu's tags. We _could_ satisfy
 more allocations if we had a notifier - but we'll still meet our
 guarantees and it's absolutely not a correctness issue, so I don't think
 it's worth the extra code.

 ...

  include/linux/idr.h |   53 +
  lib/idr.c   |  316 
 +--

I don't think this should be in idr.[ch] at all.  It has no
relationship with the existing code.  Apart from duplicating its
functionality :(

 
 ...

 @@ -243,4 +245,55 @@ static inline int ida_get_new(struct ida *ida, int *p_id)
  
  void __init idr_init_cache(void);
  
 +/* Percpu IDA/tag allocator */
 +
 +struct percpu_ida_cpu;
 +
 +struct percpu_ida {
 + /*
 +  * number of tags available to be allocated, as passed to
 +  * percpu_ida_init()
 +  */
 + unsignednr_tags;
 +
 + struct percpu_ida_cpu __percpu  *tag_cpu;
 +
 + /*
 +  * Bitmap of cpus that (may) have tags on their percpu freelists:
 +  * steal_tags() uses this to decide when to steal tags, and which cpus
 +  * to try stealing from.
 +  *
 +  * It's ok for a freelist to be empty when its bit is set - steal_tags()
 +  * will just keep looking - but the bitmap _must_ be set whenever a
 +  * percpu freelist does have tags.
 +  */
 + unsigned long   *cpus_have_tags;

Why not cpumask_t?

 + struct {
 + spinlock_t  lock;
 + /*
 +  * When we go to steal tags from another cpu (see steal_tags()),
 +  * we want to pick a cpu at random. Cycling through them every
 +  * time we steal is a bit easier and more or less equivalent:
 +  */
 + unsignedcpu_last_stolen;
 +
 + /* For sleeping on allocation failure */
 + wait_queue_head_t   wait;
 +
 + /*
 +  * Global freelist - it's a stack where nr_free points to the
 +  * top
 +  */
 + unsignednr_free;
 + unsigned*freelist;
 + } cacheline_aligned_in_smp;

Why the cacheline_aligned_in_smp?

 +};
 
 ...

 +
 +/* Percpu IDA */
 +
 +/*
 + * Number of tags we move between the percpu freelist and the global 
 freelist at
 + * a time

between a percpu freelist would be more accurate?

 + */
 +#define IDA_PCPU_BATCH_MOVE  32U
 +
 +/* Max size of percpu freelist, */
 +#define IDA_PCPU_SIZE((IDA_PCPU_BATCH_MOVE * 3) / 2)
 +
 +struct percpu_ida_cpu {
 + spinlock_t  lock;
 + unsignednr_free;
 + unsignedfreelist[];
 +};

Data structure needs documentation.  There's one of these per cpu.  I
guess nr_free and freelist are clear enough.  The presence of a lock
in a percpu data structure is a surprise.  It's for cross-cpu stealing,
I assume?

 +static inline void move_tags(unsigned *dst, unsigned *dst_nr,
 +  unsigned *src, unsigned *src_nr,
 +  unsigned nr)
 +{
 + *src_nr -= nr;
 + memcpy(dst + *dst_nr, src + *src_nr, sizeof(unsigned) * nr);
 + *dst_nr += nr;
 +}
 +
 
 ...

 +static inline void alloc_global_tags(struct percpu_ida *pool,
 +  struct percpu_ida_cpu *tags)
 +{
 + move_tags(tags-freelist, tags-nr_free,
 +   pool-freelist, pool-nr_free,
 +   min(pool-nr_free, IDA_PCPU_BATCH_MOVE));
 +}

Document this function?

 +static inline unsigned alloc_local_tag(struct percpu_ida *pool,
 +struct percpu_ida_cpu *tags)
 +{
 + int tag = -ENOSPC;
 +
 + spin_lock(tags-lock);
 + if (tags-nr_free)
 + tag = tags-freelist[--tags-nr_free];
 + spin_unlock(tags-lock);
 +
 + return tag;
 +}

I guess this one's clear enough, if the data structure relationships are
understood.

 +/**
 + * percpu_ida_alloc - allocate a tag
 + * @pool: pool to allocate from
 + * @gfp: gfp flags
 + *
 + * Returns a tag - an integer in the 

Re: [PATCH-v3 1/4] idr: Percpu ida

2013-08-28 Thread Andrew Morton
On Wed, 28 Aug 2013 12:53:17 -0700 Kent Overstreet k...@daterainc.com wrote:

   + while (1) {
   + spin_lock(pool-lock);
   +
   + /*
   +  * prepare_to_wait() must come before steal_tags(), in case
   +  * percpu_ida_free() on another cpu flips a bit in
   +  * cpus_have_tags
   +  *
   +  * global lock held and irqs disabled, don't need percpu lock
   +  */
   + prepare_to_wait(pool-wait, wait, TASK_UNINTERRUPTIBLE);
   +
   + if (!tags-nr_free)
   + alloc_global_tags(pool, tags);
   + if (!tags-nr_free)
   + steal_tags(pool, tags);
   +
   + if (tags-nr_free) {
   + tag = tags-freelist[--tags-nr_free];
   + if (tags-nr_free)
   + set_bit(smp_processor_id(),
   + pool-cpus_have_tags);
   + }
   +
   + spin_unlock(pool-lock);
   + local_irq_restore(flags);
   +
   + if (tag = 0 || !(gfp  __GFP_WAIT))
   + break;
   +
   + schedule();
   +
   + local_irq_save(flags);
   + tags = this_cpu_ptr(pool-tag_cpu);
   + }
  
  What guarantees that this wait will terminate?
 
 It seems fairly clear to me from the break statement a couple lines up;
 if we were passed __GFP_WAIT we terminate iff we succesfully allocated a
 tag. If we weren't passed __GFP_WAIT we never actually sleep.

OK ;)  Let me rephrase.  What guarantees that a tag will become available?

If what we have here is an open-coded __GFP_NOFAIL then that is
potentially problematic.
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH] percpu ida: Switch to cpumask_t, add some comments

2013-08-28 Thread Andrew Morton
On Wed, 28 Aug 2013 12:55:17 -0700 Kent Overstreet k...@daterainc.com wrote:

 Fixup patch, addressing Andrew's review feedback:

Looks reasonable.

  lib/idr.c   | 38 +-

I still don't think it should be in this file.

You say that some as-yet-unmerged patches will tie the new code into
the old ida code.  But will it do it in a manner which requires that
the two reside in the same file?

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH-v3 1/4] idr: Percpu ida

2013-08-28 Thread Andrew Morton
On Wed, 28 Aug 2013 13:44:54 -0700 Kent Overstreet k...@daterainc.com wrote:

What guarantees that this wait will terminate?
   
   It seems fairly clear to me from the break statement a couple lines up;
   if we were passed __GFP_WAIT we terminate iff we succesfully allocated a
   tag. If we weren't passed __GFP_WAIT we never actually sleep.
  
  OK ;)  Let me rephrase.  What guarantees that a tag will become available?
  
  If what we have here is an open-coded __GFP_NOFAIL then that is
  potentially problematic.
 
 It's the same semantics as a mempool, really - it'll succeed when a tag
 gets freed.

OK, that's reasonable if the code is being used to generate IO tags -
we expect the in-flight tags to eventually be returned.

But if a client of this code is using the allocator for something
totally different, there is no guarantee that the act of waiting will
result in any tags being returned.

(These are core design principles/constraints which should be
explicitly documented in a place where future readers will see them!)

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH] percpu ida: Switch to cpumask_t, add some comments

2013-08-28 Thread Andrew Morton
On Wed, 28 Aug 2013 14:00:10 -0700 Kent Overstreet k...@daterainc.com wrote:

 On Wed, Aug 28, 2013 at 01:25:50PM -0700, Andrew Morton wrote:
  On Wed, 28 Aug 2013 12:55:17 -0700 Kent Overstreet k...@daterainc.com 
  wrote:
  
   Fixup patch, addressing Andrew's review feedback:
  
  Looks reasonable.
  
lib/idr.c   | 38 +-
  
  I still don't think it should be in this file.
  
  You say that some as-yet-unmerged patches will tie the new code into
  the old ida code.  But will it do it in a manner which requires that
  the two reside in the same file?
 
 Not require, no - but it's just intimate enough with my ida rewrite that
 I think it makes sense; it makes some use of stuff that should be
 internal to the ida code.
 
 Mostly just sharing the lock though, since I got rid of the ida
 interfaces that don't do locking, but percpu ida needs a lock that also
 covers what ida needs.
 
 It also makes use of a ganged allocation interface, but there's no real
 reason ida can't expose that, it's just unlikely to be useful to
 anything but percpu ida.
 
 The other reason I think it makes sense to live in idr.c is more for
 users of the code; as you pointed out as far as the user's perspective
 percpu ida isn't doing anything fundamentally different from ida, so I
 think it makes sense for the code to live in the same place as a
 kindness to future kernel developers who are trying to find their way
 around the various library code.

I found things to be quite the opposite - it took 5 minutes of staring,
head-scratching, double-checking and penny-dropping before I was
confident that the newly-added code actually has nothing at all to do
with the current code.  Putting it in the same file was misleading, and
I got misled.
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH-v3 1/4] idr: Percpu ida

2013-08-28 Thread Andrew Morton
On Wed, 28 Aug 2013 14:12:17 -0700 Kent Overstreet k...@daterainc.com wrote:

 How's this look?
 
 diff --git a/lib/idr.c b/lib/idr.c
 index 15c021c..a3f8e9a 100644
 --- a/lib/idr.c
 +++ b/lib/idr.c
 @@ -1288,6 +1288,11 @@ static inline unsigned alloc_local_tag(struct 
 percpu_ida *pool,
   * Safe to be called from interrupt context (assuming it isn't passed
   * __GFP_WAIT, of course).
   *
 + * @gfp indicates whether or not to wait until a free id is available (it's 
 not
 + * used for internal memory allocations); thus if passed __GFP_WAIT we may 
 sleep
 + * however long it takes until another thread frees an id (same semantics as 
 a
 + * mempool).

Looks good.  Mentioning the mempool thing is effective - people
understand that.
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH] percpu ida: Switch to cpumask_t, add some comments

2013-08-28 Thread Andrew Morton
On Wed, 28 Aug 2013 14:23:58 -0700 Kent Overstreet k...@daterainc.com wrote:

  I found things to be quite the opposite - it took 5 minutes of staring,
  head-scratching, double-checking and penny-dropping before I was
  confident that the newly-added code actually has nothing at all to do
  with the current code.  Putting it in the same file was misleading, and
  I got misled.
 
 Ok... and I could see how the fact that it currently _doesn't_ have
 anything to do with the existing code would be confusing...
 
 Do you think that if/when it's making use of the ida rewrite it'll be
 ok? Or would you still prefer to have it in a new file

I'm constitutionally reluctant to ever assume that any out-of-tree code
will be merged.  Maybe you'll get hit by a bus, and maybe the code
sucks ;)

Are you sure that the two things are so tangled together that they must
live in the same file?  If there's some nice layering between ida and
percpu_ida then perhaps such a physical separation would remain
appropriate?

 (and if so, any preference on the naming?)

percpu_ida.c?
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [RFCv3 0/5] enable migration of driver pages

2015-07-07 Thread Andrew Morton
On Tue,  7 Jul 2015 13:36:20 +0900 Gioh Kim gioh@lge.com wrote:

 From: Gioh Kim guru...@hanmail.net
 
 Hello,
 
 This series try to enable migration of non-LRU pages, such as driver's page.
 
 My ARM-based platform occured severe fragmentation problem after long-term
 (several days) test. Sometimes even order-3 page allocation failed. It has
 memory size 512MB ~ 1024MB. 30% ~ 40% memory is consumed for graphic 
 processing
 and 20~30 memory is reserved for zram.
 
 I found that many pages of GPU driver and zram are non-movable pages. So I
 reported Minchan Kim, the maintainer of zram, and he made the internal 
 compaction logic of zram. And I made the internal compaction of GPU driver.
 
 They reduced some fragmentation but they are not enough effective.
 They are activated by its own interface, /sys, so they are not cooperative
 with kernel compaction. If there is too much fragmentation and kernel starts
 to compaction, zram and GPU driver cannot work with the kernel compaction.

 ...

 This patch set is tested:
 - turn on Ubuntu 14.04 with 1G memory on qemu.
 - do kernel building
 - after several seconds check more than 512MB is used with free command
 - command balloon 512 in qemu monitor
 - check hundreds MB of pages are migrated

OK, but what happens if the balloon driver is not used to force
compaction?  Does your test machine successfully compact pages on
demand, so those order-3 allocations now succeed?

Why are your changes to the GPU driver not included in this patch series?


___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [RFCv3 0/5] enable migration of driver pages

2015-07-07 Thread Andrew Morton
On Wed, 08 Jul 2015 09:02:59 +0900 Gioh Kim gioh@lge.com wrote:

 
 
 2015-07-08 __ 7:37___ Andrew Morton ___(___) ___ ___:
  On Tue,  7 Jul 2015 13:36:20 +0900 Gioh Kim gioh@lge.com wrote:
 
  From: Gioh Kim guru...@hanmail.net
 
  Hello,
 
  This series try to enable migration of non-LRU pages, such as driver's 
  page.
 
  My ARM-based platform occured severe fragmentation problem after long-term
  (several days) test. Sometimes even order-3 page allocation failed. It has
  memory size 512MB ~ 1024MB. 30% ~ 40% memory is consumed for graphic 
  processing
  and 20~30 memory is reserved for zram.
 
  I found that many pages of GPU driver and zram are non-movable pages. So I
  reported Minchan Kim, the maintainer of zram, and he made the internal
  compaction logic of zram. And I made the internal compaction of GPU driver.
 
  They reduced some fragmentation but they are not enough effective.
  They are activated by its own interface, /sys, so they are not cooperative
  with kernel compaction. If there is too much fragmentation and kernel 
  starts
  to compaction, zram and GPU driver cannot work with the kernel compaction.
 
  ...
 
  This patch set is tested:
  - turn on Ubuntu 14.04 with 1G memory on qemu.
  - do kernel building
  - after several seconds check more than 512MB is used with free command
  - command balloon 512 in qemu monitor
  - check hundreds MB of pages are migrated
 
  OK, but what happens if the balloon driver is not used to force
  compaction?  Does your test machine successfully compact pages on
  demand, so those order-3 allocations now succeed?
 
 If any driver that has many pages like the balloon driver is forced to 
 compact,
 the system can get free high-order pages.
 
 I have to show how this patch work with a driver existing in the kernel 
 source,
 for kernel developers' undestanding. So I selected the balloon driver
 because it has already compaction and working with kernel compaction.
 I can show how driver pages is compacted with lru-pages together.
 
 Actually balloon driver is not best example to show how this patch compacts 
 pages.
 The balloon driver compaction is decreasing page consumtion, for instance 
 1024MB - 512MB.
 I think it is not compaction precisely. It frees pages.
 Of course there will be many high-order pages after 512MB is freed.

Can the various in-kernel GPU drivers benefit from this?  If so, wiring
up one or more of those would be helpful?


___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH v7 00/12] Support non-lru page migration

2016-06-01 Thread Andrew Morton
On Wed,  1 Jun 2016 08:21:09 +0900 Minchan Kim  wrote:

> Recently, I got many reports about perfermance degradation in embedded
> system(Android mobile phone, webOS TV and so on) and easy fork fail.
> 
> The problem was fragmentation caused by zram and GPU driver mainly.
> With memory pressure, their pages were spread out all of pageblock and
> it cannot be migrated with current compaction algorithm which supports
> only LRU pages. In the end, compaction cannot work well so reclaimer
> shrinks all of working set pages. It made system very slow and even to
> fail to fork easily which requires order-[2 or 3] allocations.
> 
> Other pain point is that they cannot use CMA memory space so when OOM
> kill happens, I can see many free pages in CMA area, which is not
> memory efficient. In our product which has big CMA memory, it reclaims
> zones too exccessively to allocate GPU and zram page although there are
> lots of free space in CMA so system becomes very slow easily.

But this isn't presently implemented for GPU drivers or for CMA, yes?

What's the story there?
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH v3 00/16] Support non-lru page migration

2016-03-30 Thread Andrew Morton
On Wed, 30 Mar 2016 16:11:59 +0900 Minchan Kim  wrote:

> Recently, I got many reports about perfermance degradation
> in embedded system(Android mobile phone, webOS TV and so on)
> and failed to fork easily.
> 
> The problem was fragmentation caused by zram and GPU driver
> pages. Their pages cannot be migrated so compaction cannot
> work well, either so reclaimer ends up shrinking all of working
> set pages. It made system very slow and even to fail to fork
> easily.
> 
> Other pain point is that they cannot work with CMA.
> Most of CMA memory space could be idle(ie, it could be used
> for movable pages unless driver is using) but if driver(i.e.,
> zram) cannot migrate his page, that memory space could be
> wasted. In our product which has big CMA memory, it reclaims
> zones too exccessively although there are lots of free space
> in CMA so system was very slow easily.
> 
> To solve these problem, this patch try to add facility to
> migrate non-lru pages via introducing new friend functions
> of migratepage in address_space_operation and new page flags.
> 
>   (isolate_page, putback_page)
>   (PG_movable, PG_isolated)
> 
> For details, please read description in
> "mm/compaction: support non-lru movable page migration".

OK, I grabbed all these.

I wonder about testing coverage during the -next period.  How many
people are likely to exercise these code paths in a serious way before
it all hits mainline?
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH v4 00/13] Support non-lru page migration

2016-04-27 Thread Andrew Morton
On Wed, 27 Apr 2016 16:48:13 +0900 Minchan Kim  wrote:

> Recently, I got many reports about perfermance degradation in embedded
> system(Android mobile phone, webOS TV and so on) and easy fork fail.
> 
> The problem was fragmentation caused by zram and GPU driver mainly.
> With memory pressure, their pages were spread out all of pageblock and
> it cannot be migrated with current compaction algorithm which supports
> only LRU pages. In the end, compaction cannot work well so reclaimer
> shrinks all of working set pages. It made system very slow and even to
> fail to fork easily which requires order-[2 or 3] allocations.
> 
> Other pain point is that they cannot use CMA memory space so when OOM
> kill happens, I can see many free pages in CMA area, which is not
> memory efficient. In our product which has big CMA memory, it reclaims
> zones too exccessively to allocate GPU and zram page although there are
> lots of free space in CMA so system becomes very slow easily.
> 
> To solve these problem, this patch tries to add facility to migrate
> non-lru pages via introducing new functions and page flags to help
> migration.

I'm seeing some rejects here against Mel's changes and our patch
bandwidth is getting waaay way ahead of our review bandwidth.  So I
think I'll loadshed this patchset at this time, sorry.
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH kernel v8 3/4] mm: add inerface to offer info about unused pages

2017-03-16 Thread Andrew Morton
On Thu, 16 Mar 2017 15:08:46 +0800 Wei Wang  wrote:

> From: Liang Li 
> 
> This patch adds a function to provides a snapshot of the present system
> unused pages. An important usage of this function is to provide the
> unsused pages to the Live migration thread, which skips the transfer of
> thoses unused pages. Newly used pages can be re-tracked by the dirty
> page logging mechanisms.

I don't think this will be useful for anything other than
virtio-balloon.  I guess it would be better to keep this code in the
virtio-balloon driver if possible, even though that's rather a layering
violation :( What would have to be done to make that possible?  Perhaps
we can put some *small* helpers into page_alloc.c to prevent things
from becoming too ugly.

> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -4498,6 +4498,120 @@ void show_free_areas(unsigned int filter)
>   show_swap_cache_info();
>  }
>  
> +static int __record_unused_pages(struct zone *zone, int order,
> +  __le64 *buf, unsigned int size,
> +  unsigned int *offset, bool part_fill)
> +{
> + unsigned long pfn, flags;
> + int t, ret = 0;
> + struct list_head *curr;
> + __le64 *chunk;
> +
> + if (zone_is_empty(zone))
> + return 0;
> +
> + spin_lock_irqsave(>lock, flags);
> +
> + if (*offset + zone->free_area[order].nr_free > size && !part_fill) {
> + ret = -ENOSPC;
> + goto out;
> + }
> + for (t = 0; t < MIGRATE_TYPES; t++) {
> + list_for_each(curr, >free_area[order].free_list[t]) {
> + pfn = page_to_pfn(list_entry(curr, struct page, lru));
> + chunk = buf + *offset;
> + if (*offset + 2 > size) {
> + ret = -ENOSPC;
> + goto out;
> + }
> + /* Align to the chunk format used in virtio-balloon */
> + *chunk = cpu_to_le64(pfn << 12);
> + *(chunk + 1) = cpu_to_le64((1 << order) << 12);
> + *offset += 2;
> + }
> + }
> +
> +out:
> + spin_unlock_irqrestore(>lock, flags);
> +
> + return ret;
> +}

This looks like it could disable interrupts for a long time.  Too long?

> +/*
> + * The record_unused_pages() function is used to record the system unused
> + * pages. The unused pages can be skipped to transfer during live migration.
> + * Though the unused pages are dynamically changing, dirty page logging
> + * mechanisms are able to capture the newly used pages though they were
> + * recorded as unused pages via this function.
> + *
> + * This function scans the free page list of the specified order to record
> + * the unused pages, and chunks those continuous pages following the chunk
> + * format below:
> + * --
> + * | Base (52-bit)   | Rsvd (12-bit) |
> + * --
> + * --
> + * | Size (52-bit)   | Rsvd (12-bit) |
> + * --
> + *
> + * @start_zone: zone to start the record operation.
> + * @order: order of the free page list to record.
> + * @buf: buffer to record the unused page info in chunks.
> + * @size: size of the buffer in __le64 to record
> + * @offset: offset in the buffer to record.
> + * @part_fill: indicate if partial fill is used.
> + *
> + * return -EINVAL if parameter is invalid
> + * return -ENOSPC when the buffer is too small to record all the unsed pages
> + * return 0 when sccess
> + */

It's a strange thing - it returns information which will instantly
become incorrect.

> +int record_unused_pages(struct zone **start_zone, int order,
> + __le64 *buf, unsigned int size,
> + unsigned int *offset, bool part_fill)
> +{
> + struct zone *zone;
> + int ret = 0;
> + bool skip_check = false;
> +
> + /* Make sure all the parameters are valid */
> + if (buf == NULL || offset == NULL || order >= MAX_ORDER)
> + return -EINVAL;
> +
> + if (*start_zone != NULL) {
> + bool found = false;
> +
> + for_each_populated_zone(zone) {
> + if (zone != *start_zone)
> + continue;
> + found = true;
> + break;
> + }
> + if (!found)
> + return -EINVAL;
> + } else
> + skip_check = true;
> +
> + for_each_populated_zone(zone) {
> + /* Start from *start_zone if it's not NULL */
> + if (!skip_check) {
> + if (*start_zone != zone)
> + continue;
> + else
> + skip_check = true;
> + }
> + ret = __record_unused_pages(zone, order, buf, 

Re: [PATCH v9 3/5] mm: function to offer a page block on the free list

2017-04-13 Thread Andrew Morton
On Thu, 13 Apr 2017 17:35:06 +0800 Wei Wang  wrote:

> Add a function to find a page block on the free list specified by the
> caller. Pages from the page block may be used immediately after the
> function returns. The caller is responsible for detecting or preventing
> the use of such pages.
> 
> ...
>
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -4498,6 +4498,93 @@ void show_free_areas(unsigned int filter)
>   show_swap_cache_info();
>  }
>  
> +/**
> + * Heuristically get a page block in the system that is unused.
> + * It is possible that pages from the page block are used immediately after
> + * inquire_unused_page_block() returns. It is the caller's responsibility
> + * to either detect or prevent the use of such pages.
> + *
> + * The free list to check: zone->free_area[order].free_list[migratetype].
> + *
> + * If the caller supplied page block (i.e. **page) is on the free list, offer
> + * the next page block on the list to the caller. Otherwise, offer the first
> + * page block on the list.
> + *
> + * Return 0 when a page block is found on the caller specified free list.
> + */
> +int inquire_unused_page_block(struct zone *zone, unsigned int order,
> +   unsigned int migratetype, struct page **page)
> +{

Perhaps we can wrap this in the appropriate ifdef so the kernels which
won't be using virtio-balloon don't carry the added overhead.


___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH v13 1/5] Introduce xbitmap

2017-08-09 Thread Andrew Morton
On Thu,  3 Aug 2017 14:38:15 +0800 Wei Wang  wrote:

> From: Matthew Wilcox 
> 
> The eXtensible Bitmap is a sparse bitmap representation which is
> efficient for set bits which tend to cluster.  It supports up to
> 'unsigned long' worth of bits, and this commit adds the bare bones --
> xb_set_bit(), xb_clear_bit() and xb_test_bit().

Would like to see some additional details here justifying the change. 
The sole user is virtio-balloon, yes?  What alternatives were examined
and what are the benefits of this approach?

Have you identified any other subsystems which could utilize this?

>
> ...
>
> --- a/lib/radix-tree.c
> +++ b/lib/radix-tree.c
> @@ -37,6 +37,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  
>  
>  /* Number of nodes in fully populated tree of given height */
> @@ -78,6 +79,14 @@ static struct kmem_cache *radix_tree_node_cachep;
>  #define IDA_PRELOAD_SIZE (IDA_MAX_PATH * 2 - 1)
>  
>  /*
> + * The XB can go up to unsigned long, but also uses a bitmap.

This comment is hard to understand.

> + */
> +#define XB_INDEX_BITS(BITS_PER_LONG - ilog2(IDA_BITMAP_BITS))
> +#define XB_MAX_PATH  (DIV_ROUND_UP(XB_INDEX_BITS, \
> +   RADIX_TREE_MAP_SHIFT))
> +#define XB_PRELOAD_SIZE  (XB_MAX_PATH * 2 - 1)
> +
>
> ...
>  
> +void xb_preload(gfp_t gfp)
> +{
> + __radix_tree_preload(gfp, XB_PRELOAD_SIZE);
> + if (!this_cpu_read(ida_bitmap)) {
> + struct ida_bitmap *bitmap = kmalloc(sizeof(*bitmap), gfp);
> +
> + if (!bitmap)
> + return;
> + bitmap = this_cpu_cmpxchg(ida_bitmap, NULL, bitmap);
> + kfree(bitmap);
> + }
> +}
> +EXPORT_SYMBOL(xb_preload);

Please document the exported API.  It's conventional to do this in
kerneldoc but for some reason kerneldoc makes people write
uninteresting and unuseful documentation.  Be sure to cover the
*useful* stuff: what it does, why it does it, under which circumstances
it should be used, what the caller-provided locking should look like,
what the return values mean, etc.  Stuff which programmers actually
will benefit from knowing.

> +int xb_set_bit(struct xb *xb, unsigned long bit)
>
> ...
>
> +int xb_clear_bit(struct xb *xb, unsigned long bit)

There's quite a lot of common code here.  Did you investigate factoring
that out in some fashion?

> +bool xb_test_bit(const struct xb *xb, unsigned long bit)
> +{
> + unsigned long index = bit / IDA_BITMAP_BITS;
> + const struct radix_tree_root *root = >xbrt;
> + struct ida_bitmap *bitmap = radix_tree_lookup(root, index);
> +
> + bit %= IDA_BITMAP_BITS;
> +
> + if (!bitmap)
> + return false;
> + if (radix_tree_exception(bitmap)) {
> + bit += RADIX_TREE_EXCEPTIONAL_SHIFT;
> + if (bit > BITS_PER_LONG)
> + return false;
> + return (unsigned long)bitmap & (1UL << bit);
> + }
> + return test_bit(bit, bitmap->bitmap);
> +}
> +

Missing EXPORT_SYMBOL?


Perhaps all this code should go into a new lib/xbitmap.c.
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH v18 01/10] idr: add #include

2017-11-30 Thread Andrew Morton
On Wed, 29 Nov 2017 16:58:17 -0800 Matthew Wilcox  wrote:

> On Wed, Nov 29, 2017 at 09:55:17PM +0800, Wei Wang wrote:
> > The  was removed from radix-tree.h by the following commit:
> > f5bba9d11a256ad2a1c2f8e7fc6aabe6416b7890.
> > 
> > Since that commit, tools/testing/radix-tree/ couldn't pass compilation
> > due to: tools/testing/radix-tree/idr.c:17: undefined reference to
> > WARN_ON_ONCE. This patch adds the bug.h header to idr.h to solve the
> > issue.
> 
> Thanks; I sent this same patch out yesterday.
> 
> Unfortunately, you didn't cc the author of this breakage, Masahiro Yamada.
> I want to highlight that these kinds of header cleanups are risky,
> and very low reward.  I really don't want to see patches going all over
> the tree randomly touching header files.  If we've got a real problem
> to solve, then sure.  But I want to see a strong justification for any
> more header file cleanups.

I tend to disagree.  We accumulate more and more cruft over time so it
is good to be continually hacking away at it.  These little build
breaks happen occasionally but they are trivially and quickly fixed. 
If a small minority of these cleanups require a followup patch which
consumes a global ten person minutes then that seems an acceptable
price to pay.  Says the guy who pays most of that price :)

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH] kvmalloc: always use vmalloc if CONFIG_DEBUG_VM

2018-05-01 Thread Andrew Morton
On Tue, 24 Apr 2018 12:33:01 -0400 (EDT) Mikulas Patocka  
wrote:

> 
> 
> On Tue, 24 Apr 2018, Michal Hocko wrote:
> 
> > On Tue 24-04-18 11:30:40, Mikulas Patocka wrote:
> > > 
> > > 
> > > On Tue, 24 Apr 2018, Michal Hocko wrote:
> > > 
> > > > On Mon 23-04-18 20:25:15, Mikulas Patocka wrote:
> > > > 
> > > > > Fixing __vmalloc code 
> > > > > is easy and it doesn't require cooperation with maintainers.
> > > > 
> > > > But it is a hack against the intention of the scope api.
> > > 
> > > It is not!
> > 
> > This discussion simply doesn't make much sense it seems. The scope API
> > is to document the scope of the reclaim recursion critical section. That
> > certainly is not a utility function like vmalloc.
> 
> That 15-line __vmalloc bugfix doesn't prevent you (or any other kernel 
> developer) from converting the code to the scope API. You make nonsensical 
> excuses.
> 

Fun thread!

Winding back to the original problem, I'd state it as

- Caller uses kvmalloc() but passes the address into vmalloc-naive
  DMA API and

- Caller uses kvmalloc() but passes the address into kfree()

Yes?

If so, then...

Is there a way in which, in the kvmalloc-called-kmalloc path, we can
tag the slab-allocated memory with a "this memory was allocated with
kvmalloc()" flag?  I *think* there's extra per-object storage available
with suitable slab/slub debugging options?  Perhaps we could steal one
bit from the redzone, dunno.

If so then we can

a) set that flag in kvmalloc() if the kmalloc() call succeeded

b) check for that flag in the DMA code, WARN if it is set.

c) in kvfree(), clear that flag before calling kfree()

d) in kfree(), check for that flag and go WARN() if set.

So both potential bugs are detected all the time, dependent upon
CONFIG_SLUB_DEBUG (and perhaps other slub config options).

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH v25 1/2 RESEND] mm: support reporting free page blocks

2018-01-25 Thread Andrew Morton
On Thu, 25 Jan 2018 17:38:27 +0800 Wei Wang  wrote:

> This patch adds support to walk through the free page blocks in the
> system and report them via a callback function. Some page blocks may
> leave the free list after zone->lock is released, so it is the caller's
> responsibility to either detect or prevent the use of such pages.
> 
> One use example of this patch is to accelerate live migration by skipping
> the transfer of free pages reported from the guest. A popular method used
> by the hypervisor to track which part of memory is written during live
> migration is to write-protect all the guest memory. So, those pages that
> are reported as free pages but are written after the report function
> returns will be captured by the hypervisor, and they will be added to the
> next round of memory transfer.

It would be useful if we had some quantitative testing results, so we
can see the real-world benefits from this change?

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH v29 1/4] mm: support reporting free page blocks

2018-04-10 Thread Andrew Morton
On Tue, 10 Apr 2018 21:19:31 +0300 "Michael S. Tsirkin"  wrote:

> 
> Andrew, were your questions answered? If yes could I bother you for an ack on 
> this?
> 

Still not very happy that readers are told that "this function may
sleep" when it clearly doesn't do so.  If we wish to be able to change
it to sleep in the future then that should be mentioned.  And even put a
might_sleep() in there, to catch people who didn't read the comments...

Otherwise it looks OK.
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH] kvmalloc: always use vmalloc if CONFIG_DEBUG_VM

2018-04-19 Thread Andrew Morton
On Thu, 19 Apr 2018 12:12:38 -0400 (EDT) Mikulas Patocka  
wrote:

> The kvmalloc function tries to use kmalloc and falls back to vmalloc if
> kmalloc fails.
> 
> Unfortunatelly, some kernel code has bugs - it uses kvmalloc and then
> uses DMA-API on the returned memory or frees it with kfree. Such bugs were
> found in the virtio-net driver, dm-integrity or RHEL7 powerpc-specific
> code.
> 
> These bugs are hard to reproduce because vmalloc falls back to kmalloc
> only if memory is fragmented.

Yes, that's nasty.

> In order to detect these bugs reliably I submit this patch that changes
> kvmalloc to always use vmalloc if CONFIG_DEBUG_VM is turned on.
> 
> ...
>
> --- linux-2.6.orig/mm/util.c  2018-04-18 15:46:23.0 +0200
> +++ linux-2.6/mm/util.c   2018-04-18 16:00:43.0 +0200
> @@ -395,6 +395,7 @@ EXPORT_SYMBOL(vm_mmap);
>   */
>  void *kvmalloc_node(size_t size, gfp_t flags, int node)
>  {
> +#ifndef CONFIG_DEBUG_VM
>   gfp_t kmalloc_flags = flags;
>   void *ret;
>  
> @@ -426,6 +427,7 @@ void *kvmalloc_node(size_t size, gfp_t f
>*/
>   if (ret || size <= PAGE_SIZE)
>   return ret;
> +#endif
>  
>   return __vmalloc_node_flags_caller(size, node, flags,
>   __builtin_return_address(0));

Well, it doesn't have to be done at compile-time, does it?  We could
add a knob (in debugfs, presumably) which enables this at runtime. 
That's far more user-friendly.
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH] kvmalloc: always use vmalloc if CONFIG_DEBUG_VM

2018-04-19 Thread Andrew Morton
On Thu, 19 Apr 2018 17:19:20 -0400 (EDT) Mikulas Patocka  
wrote:

> > > In order to detect these bugs reliably I submit this patch that changes
> > > kvmalloc to always use vmalloc if CONFIG_DEBUG_VM is turned on.
> > > 
> > > ...
> > >
> > > --- linux-2.6.orig/mm/util.c  2018-04-18 15:46:23.0 +0200
> > > +++ linux-2.6/mm/util.c   2018-04-18 16:00:43.0 +0200
> > > @@ -395,6 +395,7 @@ EXPORT_SYMBOL(vm_mmap);
> > >   */
> > >  void *kvmalloc_node(size_t size, gfp_t flags, int node)
> > >  {
> > > +#ifndef CONFIG_DEBUG_VM
> > >   gfp_t kmalloc_flags = flags;
> > >   void *ret;
> > >  
> > > @@ -426,6 +427,7 @@ void *kvmalloc_node(size_t size, gfp_t f
> > >*/
> > >   if (ret || size <= PAGE_SIZE)
> > >   return ret;
> > > +#endif
> > >  
> > >   return __vmalloc_node_flags_caller(size, node, flags,
> > >   __builtin_return_address(0));
> > 
> > Well, it doesn't have to be done at compile-time, does it?  We could
> > add a knob (in debugfs, presumably) which enables this at runtime. 
> > That's far more user-friendly.
> 
> But who will turn it on in debugfs?

But who will turn it on in Kconfig?  Just a handful of developers.  We
could add SONFIG_DEBUG_SG to the list in
Documentation/process/submit-checklist.rst, but nobody reads that.

Also, a whole bunch of defconfigs set CONFIG_DEBUG_SG=y and some
googling indicates that they aren't the only ones...

> It should be default for debugging 
> kernels, so that users using them would report the error.

Well.  This isn't the first time we've wanted to enable expensive (or
noisy) debugging things in debug kernels, by any means.

So how could we define a debug kernel in which it's OK to enable such
things?

- Could be "it's an -rc kernel".  But then we'd be enabling a bunch of
  untested code when Linus cuts a release.

- Could be "it's an -rc kernel with SUBLEVEL <= 5".  But then we risk
  unexpected things happening when Linux cuts -rc6, which still isn't
  good.

- How about "it's an -rc kernel with odd-numbered SUBLEVEL and
  SUBLEVEL <= 5".  That way everybody who runs -rc1, -rc3 and -rc5 will
  have kvmalloc debugging enabled.  That's potentially nasty because
  vmalloc is much slower than kmalloc.  But kvmalloc() is only used for
  large and probably infrequent allocations, so it's probably OK.

I wonder how we get at SUBLEVEL from within .c.  
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH v29 3/4] mm/page_poison: expose page_poisoning_enabled to kernel modules

2018-03-26 Thread Andrew Morton
On Mon, 26 Mar 2018 10:39:53 +0800 Wei Wang <wei.w.w...@intel.com> wrote:

> In some usages, e.g. virtio-balloon, a kernel module needs to know if
> page poisoning is in use. This patch exposes the page_poisoning_enabled
> function to kernel modules.

Acked-by: Andrew Morton <a...@linux-foundation.org>
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH v29 1/4] mm: support reporting free page blocks

2018-03-26 Thread Andrew Morton
On Mon, 26 Mar 2018 10:39:51 +0800 Wei Wang  wrote:

> This patch adds support to walk through the free page blocks in the
> system and report them via a callback function. Some page blocks may
> leave the free list after zone->lock is released, so it is the caller's
> responsibility to either detect or prevent the use of such pages.
> 
> One use example of this patch is to accelerate live migration by skipping
> the transfer of free pages reported from the guest. A popular method used
> by the hypervisor to track which part of memory is written during live
> migration is to write-protect all the guest memory. So, those pages that
> are reported as free pages but are written after the report function
> returns will be captured by the hypervisor, and they will be added to the
> next round of memory transfer.
> 
> ...
>
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -4912,6 +4912,102 @@ void show_free_areas(unsigned int filter, nodemask_t 
> *nodemask)
>   show_swap_cache_info();
>  }
>  
> +/*
> + * Walk through a free page list and report the found pfn range via the
> + * callback.
> + *
> + * Return 0 if it completes the reporting. Otherwise, return the non-zero
> + * value returned from the callback.
> + */
> +static int walk_free_page_list(void *opaque,
> +struct zone *zone,
> +int order,
> +enum migratetype mt,
> +int (*report_pfn_range)(void *,
> +unsigned long,
> +unsigned long))
> +{
> + struct page *page;
> + struct list_head *list;
> + unsigned long pfn, flags;
> + int ret = 0;
> +
> + spin_lock_irqsave(>lock, flags);
> + list = >free_area[order].free_list[mt];
> + list_for_each_entry(page, list, lru) {
> + pfn = page_to_pfn(page);
> + ret = report_pfn_range(opaque, pfn, 1 << order);
> + if (ret)
> + break;
> + }
> + spin_unlock_irqrestore(>lock, flags);
> +
> + return ret;
> +}
> +
> +/**
> + * walk_free_mem_block - Walk through the free page blocks in the system
> + * @opaque: the context passed from the caller
> + * @min_order: the minimum order of free lists to check
> + * @report_pfn_range: the callback to report the pfn range of the free pages
> + *
> + * If the callback returns a non-zero value, stop iterating the list of free
> + * page blocks. Otherwise, continue to report.
> + *
> + * Please note that there are no locking guarantees for the callback and
> + * that the reported pfn range might be freed or disappear after the
> + * callback returns so the caller has to be very careful how it is used.
> + *
> + * The callback itself must not sleep or perform any operations which would
> + * require any memory allocations directly (not even GFP_NOWAIT/GFP_ATOMIC)
> + * or via any lock dependency. It is generally advisable to implement
> + * the callback as simple as possible and defer any heavy lifting to a
> + * different context.
> + *
> + * There is no guarantee that each free range will be reported only once
> + * during one walk_free_mem_block invocation.
> + *
> + * pfn_to_page on the given range is strongly discouraged and if there is
> + * an absolute need for that make sure to contact MM people to discuss
> + * potential problems.
> + *
> + * The function itself might sleep so it cannot be called from atomic
> + * contexts.

I don't see how walk_free_mem_block() can sleep.

> + * In general low orders tend to be very volatile and so it makes more
> + * sense to query larger ones first for various optimizations which like
> + * ballooning etc... This will reduce the overhead as well.
> + *
> + * Return 0 if it completes the reporting. Otherwise, return the non-zero
> + * value returned from the callback.
> + */
> +int walk_free_mem_block(void *opaque,
> + int min_order,
> + int (*report_pfn_range)(void *opaque,
> + unsigned long pfn,
> + unsigned long num))
> +{
> + struct zone *zone;
> + int order;
> + enum migratetype mt;
> + int ret;
> +
> + for_each_populated_zone(zone) {
> + for (order = MAX_ORDER - 1; order >= min_order; order--) {
> + for (mt = 0; mt < MIGRATE_TYPES; mt++) {
> + ret = walk_free_page_list(opaque, zone,
> +   order, mt,
> +   report_pfn_range);
> + if (ret)
> + return ret;
> + }
> + }
> + }
> +
> + return 0;
> +}
> +EXPORT_SYMBOL_GPL(walk_free_mem_block);

This looks like it could take a long time.  Will we end up needing to
add cond_resched() in there somewhere?

Re: [PATCH v2 0/3] kcov: collect coverage from usb and vhost

2019-10-24 Thread Andrew Morton
On Thu, 24 Oct 2019 14:47:31 +0200 Andrey Konovalov  
wrote:

> > is it expected that the new kcov feature will be used elsewhere in the
> > kernel?
> >
> > If the latter, which are the expected subsystems?
> 
> Currently we encountered two cases where this is useful: USB and vhost
> workers. Most probably there are more subsystems that will benefit
> from this kcov extension to get better fuzzing coverage. I don't have
> a list of them, but the provided interface should be easy to use when
> more of such cases are encountered.

It would be helpful to add such a list to the changelog.  Best-effort
and approximate is OK - just to help people understand the eventual
usefulness of the proposal.


___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH v2 1/3] kcov: remote coverage support

2019-10-23 Thread Andrew Morton
On Wed, 23 Oct 2019 17:24:29 +0200 Andrey Konovalov  
wrote:

> This patch adds background thread coverage collection ability to kcov.
> 
> With KCOV_ENABLE coverage is collected only for syscalls that are issued
> from the current process. With KCOV_REMOTE_ENABLE it's possible to collect
> coverage for arbitrary parts of the kernel code, provided that those parts
> are annotated with kcov_remote_start()/kcov_remote_stop().
> 
> This allows to collect coverage from two types of kernel background
> threads: the global ones, that are spawned during kernel boot and are
> always running (e.g. USB hub_event()); and the local ones, that are
> spawned when a user interacts with some kernel interface (e.g. vhost
> workers).
> 
> To enable collecting coverage from a global background thread, a unique
> global handle must be assigned and passed to the corresponding
> kcov_remote_start() call. Then a userspace process can pass a list of such
> handles to the KCOV_REMOTE_ENABLE ioctl in the handles array field of the
> kcov_remote_arg struct. This will attach the used kcov device to the code
> sections, that are referenced by those handles.
> 
> Since there might be many local background threads spawned from different
> userspace processes, we can't use a single global handle per annotation.
> Instead, the userspace process passes a non-zero handle through the
> common_handle field of the kcov_remote_arg struct. This common handle gets
> saved to the kcov_handle field in the current task_struct and needs to be
> passed to the newly spawned threads via custom annotations. Those threads
> should in turn be annotated with kcov_remote_start()/kcov_remote_stop().
> 
> Internally kcov stores handles as u64 integers. The top byte of a handle
> is used to denote the id of a subsystem that this handle belongs to, and
> the lower 4 bytes are used to denote a handle id within that subsystem.
> A reserved value 0 is used as a subsystem id for common handles as they
> don't belong to a particular subsystem. The bytes 4-7 are currently
> reserved and must be zero. In the future the number of bytes used for the
> subsystem or handle ids might be increased.
> 
> When a particular userspace proccess collects coverage by via a common
> handle, kcov will collect coverage for each code section that is annotated
> to use the common handle obtained as kcov_handle from the current
> task_struct. However non common handles allow to collect coverage
> selectively from different subsystems.
> 
> ...
>
> +static struct kcov_remote *kcov_remote_add(struct kcov *kcov, u64 handle)
> +{
> + struct kcov_remote *remote;
> +
> + if (kcov_remote_find(handle))
> + return ERR_PTR(-EEXIST);
> + remote = kmalloc(sizeof(*remote), GFP_ATOMIC);
> + if (!remote)
> + return ERR_PTR(-ENOMEM);
> + remote->handle = handle;
> + remote->kcov = kcov;
> + hash_add(kcov_remote_map, >hnode, handle);
> + return remote;
> +}
> +
>
> ...
>
> + spin_lock(_remote_lock);
> + for (i = 0; i < remote_arg->num_handles; i++) {
> + kcov_debug("handle %llx\n", remote_arg->handles[i]);
> + if (!kcov_check_handle(remote_arg->handles[i],
> + false, true, false)) {
> + spin_unlock(_remote_lock);
> + kcov_disable(t, kcov);
> + return -EINVAL;
> + }
> + remote = kcov_remote_add(kcov, remote_arg->handles[i]);
> + if (IS_ERR(remote)) {
> + spin_unlock(_remote_lock);
> + kcov_disable(t, kcov);
> + return PTR_ERR(remote);
> + }
> + }

It's worrisome that this code can perform up to 65536 GFP_ATOMIC
allocations without coming up for air.  The possibility of ENOMEM or of
causing collateral problems is significant.  It doesn't look too hard
to change this to use GFP_KERNEL?

> +u64 kcov_common_handle(void)
> +{
> + return current->kcov_handle;
> +}

I don't immediately understand what this "common handle" thing is all about. 
Code is rather lacking in this sort of high-level commentary?


___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH v2 0/3] kcov: collect coverage from usb and vhost

2019-10-23 Thread Andrew Morton
On Wed, 23 Oct 2019 17:24:28 +0200 Andrey Konovalov  
wrote:

> This patchset extends kcov to allow collecting coverage from the USB
> subsystem and vhost workers. See the first patch description for details
> about the kcov extension. The other two patches apply this kcov extension
> to USB and vhost.
> 
> These patches have been used to enable coverage-guided USB fuzzing with
> syzkaller for the last few years

I find it surprising that this material is so focused on USB.  Is
there something unique about USB that gave rise to this situation, or
is it expected that the new kcov feature will be used elsewhere in the
kernel?

If the latter, which are the expected subsystems?
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH v2 07/10] mm/memory_hotplug: Introduce offline_and_remove_memory()

2020-04-14 Thread Andrew Morton
On Tue, 14 Apr 2020 12:35:02 -0400 "Michael S. Tsirkin"  wrote:

> On Wed, Mar 11, 2020 at 06:19:04PM +0100, David Hildenbrand wrote:
> > On 11.03.20 18:14, David Hildenbrand wrote:
> > > virtio-mem wants to offline and remove a memory block once it unplugged
> > > all subblocks (e.g., using alloc_contig_range()). Let's provide
> > > an interface to do that from a driver. virtio-mem already supports to
> > > offline partially unplugged memory blocks. Offlining a fully unplugged
> > > memory block will not require to migrate any pages. All unplugged
> > > subblocks are PageOffline() and have a reference count of 0 - so
> > > offlining code will simply skip them.
> > > 
> > > All we need is an interface to offline and remove the memory from kernel
> > > module context, where we don't have access to the memory block devices
> > > (esp. find_memory_block() and device_offline()) and the device hotplug
> > > lock.
> > > 
> > > To keep things simple, allow to only work on a single memory block.
> > > 
> > 
> > Lost the ACK from Michael
> > 
> > Acked-by: Michal Hocko  [1]
> > 
> > [1] https://lkml.kernel.org/r/20200302142737.gp4...@dhcp22.suse.cz
> 
> 
> Andrew, could you pls ack merging this through the vhost tree,
> with the rest of the patchset?

I wish the device_offline() return value was documented :(

Yes, please go ahead and merge.

Acked-by: Andrew Morton 

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH v2 05/10] mm: Allow to offline unmovable PageOffline() pages via MEM_GOING_OFFLINE

2020-04-14 Thread Andrew Morton
On Tue, 14 Apr 2020 12:34:26 -0400 "Michael S. Tsirkin"  wrote:

> Andrew, could you please ack merging this through the vhost tree
> together with the rest of the patches?

Acked-by: Andrew Morton 

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH v2 2/3] mm/memory_hotplug: Introduce MHP_NO_FIRMWARE_MEMMAP

2020-04-30 Thread Andrew Morton
On Thu, 30 Apr 2020 20:43:39 +0200 David Hildenbrand  wrote:

> > 
> > Why does the firmware map support hotplug entries?
> 
> I assume:
> 
> The firmware memmap was added primarily for x86-64 kexec (and still, is
> mostly used on x86-64 only IIRC). There, we had ACPI hotplug. When DIMMs
> get hotplugged on real HW, they get added to e820. Same applies to
> memory added via HyperV balloon (unless memory is unplugged via
> ballooning and you reboot ... the the e820 is changed as well). I assume
> we wanted to be able to reflect that, to make kexec look like a real reboot.
> 
> This worked for a while. Then came dax/kmem. Now comes virtio-mem.
> 
> 
> But I assume only Andrew can enlighten us.
> 
> @Andrew, any guidance here? Should we really add all memory to the
> firmware memmap, even if this contradicts with the existing
> documentation? (especially, if the actual firmware memmap will *not*
> contain that memory after a reboot)

For some reason that patch is misattributed - it was authored by
Shaohui Zheng , who hasn't been heard from in
a decade.  I looked through the email discussion from that time and I'm
not seeing anything useful.  But I wasn't able to locate Dave Hansen's
review comments.


___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH v5 0/6] mm / virtio-mem: support ZONE_MOVABLE

2020-08-21 Thread Andrew Morton
On Fri, 21 Aug 2020 10:45:33 +0200 David Hildenbrand  wrote:

> On 21.08.20 10:31, David Hildenbrand wrote:
> > On 16.08.20 14:53, David Hildenbrand wrote:
> >> For 5.10. Patch #1-#4,#6 have RBs or ACKs, patch #5 is virtio-mem stuff
> >> maintained by me. This should go via the -mm tree.
> >>
> > 
> > @Andrew, can we give this a churn if there are no further comments? Thanks!
> 
> ... I just spotted the patches in -next, strange I didn't get an email
> notification. Thanks :)

https://lore.kernel.org/mm-commits/20200819025501.gjhzlolfc%25a...@linux-foundation.org/

akpm!=spam :)
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH v4 0/6] mm / virtio-mem: support ZONE_MOVABLE

2020-08-10 Thread Andrew Morton
On Mon, 10 Aug 2020 09:56:32 +0200 David Hildenbrand  wrote:

> On 04.08.20 21:41, David Hildenbrand wrote:
> > @Andrew can we give this a churn and consider it for v5.9 in case there
> > are no more comments?
> 
> @Andrew, Ping, so I assume we'll target v5.10?

Yep, sorry.  Merging a significant patch series during the merge
window(!) would be quite extraordinary and I don't think that anything
in this patchset justifies such an action?

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH v5 2/2] mm, treewide: Rename kzfree() to kfree_sensitive()

2020-06-16 Thread Andrew Morton
On Tue, 16 Jun 2020 11:43:11 -0400 Waiman Long  wrote:

> As said by Linus:
> 
>   A symmetric naming is only helpful if it implies symmetries in use.
>   Otherwise it's actively misleading.
> 
>   In "kzalloc()", the z is meaningful and an important part of what the
>   caller wants.
> 
>   In "kzfree()", the z is actively detrimental, because maybe in the
>   future we really _might_ want to use that "memfill(0xdeadbeef)" or
>   something. The "zero" part of the interface isn't even _relevant_.
> 
> The main reason that kzfree() exists is to clear sensitive information
> that should not be leaked to other future users of the same memory
> objects.
> 
> Rename kzfree() to kfree_sensitive() to follow the example of the
> recently added kvfree_sensitive() and make the intention of the API
> more explicit. In addition, memzero_explicit() is used to clear the
> memory to make sure that it won't get optimized away by the compiler.
> 
> The renaming is done by using the command sequence:
> 
>   git grep -w --name-only kzfree |\
>   xargs sed -i 's/\bkzfree\b/kfree_sensitive/'
> 
> followed by some editing of the kfree_sensitive() kerneldoc and adding
> a kzfree backward compatibility macro in slab.h.
> 
> ...
>
> --- a/include/linux/slab.h
> +++ b/include/linux/slab.h
> @@ -186,10 +186,12 @@ void memcg_deactivate_kmem_caches(struct mem_cgroup *, 
> struct mem_cgroup *);
>   */
>  void * __must_check krealloc(const void *, size_t, gfp_t);
>  void kfree(const void *);
> -void kzfree(const void *);
> +void kfree_sensitive(const void *);
>  size_t __ksize(const void *);
>  size_t ksize(const void *);
>  
> +#define kzfree(x)kfree_sensitive(x)  /* For backward compatibility */
> +

What was the thinking here?  Is this really necessary?

I suppose we could keep this around for a while to ease migration.  But
not for too long, please.
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH v2 27/29] mm/memory_hotplug: extend offline_and_remove_memory() to handle more than one memory block

2020-11-17 Thread Andrew Morton
On Thu, 12 Nov 2020 14:38:13 +0100 David Hildenbrand  wrote:

> virtio-mem soon wants to use offline_and_remove_memory() memory that
> exceeds a single Linux memory block (memory_block_size_bytes()). Let's
> remove that restriction.
> 
> Let's remember the old state and try to restore that if anything goes
> wrong. While re-onlining can, in general, fail, it's highly unlikely to
> happen (usually only when a notifier fails to allocate memory, and these
> are rather rare).
> 
> This will be used by virtio-mem to offline+remove memory ranges that are
> bigger than a single memory block - for example, with a device block
> size of 1 GiB (e.g., gigantic pages in the hypervisor) and a Linux memory
> block size of 128MB.
> 
> While we could compress the state into 2 bit, using 8 bit is much
> easier.
> 
> This handling is similar, but different to acpi_scan_try_to_offline():
> 
> a) We don't try to offline twice. I am not sure if this CONFIG_MEMCG
> optimization is still relevant - it should only apply to ZONE_NORMAL
> (where we have no guarantees). If relevant, we can always add it.
> 
> b) acpi_scan_try_to_offline() simply onlines all memory in case
> something goes wrong. It doesn't restore previous online type. Let's do
> that, so we won't overwrite what e.g., user space configured.
> 
> ...
>

uint8_t is a bit of a mouthful.  u8 is less typing ;)  Doesn't matter.

Acked-by: Andrew Morton 
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH v1 1/4] mm/memory_hotplug: use "unsigned long" for PFN in zone_for_pfn_range()

2021-07-15 Thread Andrew Morton
On Thu, 15 Jul 2021 11:42:21 +0200 David Hildenbrand  wrote:

> > I'd propose to add Cc:  since I actually had
> > the fun to try to debug something like this a couple of years ago:
> > 6cdb18ad98a4 ("mm/vmstat: fix overflow in mod_zone_page_state()")
> > 
> 
> Good point, and thinking again what can go wrong, I tend to agree. We 
> are trying to keep zones contiguous and it could happen that we end up 
> with something like ZONE_DMA here (via default_kernel_zone_for_pfn()) 
> and would consequently online something to ZONE_DMA that doesn't belong 
> there, resulting in crashes.
> 
> @Andrew can you add  Cc:  and
> 
> "As we will search for a fitting zone using the wrong pfn, we might end 
> up onlining memory to one of the special kernel zones, such as ZONE_DMA, 
> which can end badly as the onlined memory does not satisfy properties of 
> these zones."

Yep, all done.
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH 1/3] sched/headers: Fix compilation error with GCC 12

2022-04-14 Thread Andrew Morton
On Thu, 14 Apr 2022 17:21:01 +0200 Peter Zijlstra  wrote:

> > +/* The + 1 below places the pointers within the range of their array */
> >  #define for_class_range(class, _from, _to) \
> > -   for (class = (_from); class != (_to); class--)
> > +   for (class = (_from); class + 1 != (_to) + 1; class--)
> 
> Urgh, so now we get less readable code, just because GCC is being
> stupid?
> 
> What's wrong with negative array indexes? memory is memory, stuff works.

What's more, C is C.  Glorified assembly language in which people do odd
stuff.

But this is presumably a released gcc version and we need to do
something.  And presumably, we need to do a backportable something, so
people can compile older kernels with gcc-12.

Is it possible to suppress just this warning with a gcc option?  And if
so, are we confident that this warning will never be useful in other
places in the kernel?

If no||no then we'll need to add workarounds such as these?
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [linux-next:master] BUILD REGRESSION 8cb8311e95e3bb58bd84d6350365f14a718faa6d

2022-05-25 Thread Andrew Morton
On Wed, 25 May 2022 23:07:35 +0100 Jessica Clarke  wrote:

> This is i386, so an unsigned long is 32-bit, but i_blocks is a blkcnt_t
> i.e. a u64, which makes the shift without a cast of the LHS fishy.

Ah, of course, thanks.  I remember 32 bits ;)

--- a/mm/shmem.c~mm-shmemc-suppress-shift-warning
+++ a/mm/shmem.c
@@ -1945,7 +1945,7 @@ alloc_nohuge:
 
spin_lock_irq(>lock);
info->alloced += folio_nr_pages(folio);
-   inode->i_blocks += BLOCKS_PER_PAGE << folio_order(folio);
+   inode->i_blocks += (blkcnt_t)BLOCKS_PER_PAGE << folio_order(folio);
shmem_recalc_inode(inode);
spin_unlock_irq(>lock);
alloced = true;
_

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [linux-next:master] BUILD REGRESSION 8cb8311e95e3bb58bd84d6350365f14a718faa6d

2022-05-25 Thread Andrew Morton
On Thu, 26 May 2022 05:35:20 +0800 kernel test robot  wrote:

> tree/branch: 
> https://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git master
> branch HEAD: 8cb8311e95e3bb58bd84d6350365f14a718faa6d  Add linux-next 
> specific files for 20220525
> 
> Error/Warning reports:
> 
> ...
>
> Unverified Error/Warning (likely false positive, please contact us if 
> interested):

Could be so.

> mm/shmem.c:1948 shmem_getpage_gfp() warn: should '(((1) << 12) / 512) << 
> folio_order(folio)' be a 64 bit type?

I've been seeing this one for a while.  And from this report I can't
figure out what tool emitted it.  Clang?

>
> ...
>
> |-- i386-randconfig-m021
> |   `-- 
> mm-shmem.c-shmem_getpage_gfp()-warn:should-((()-)-)-folio_order(folio)-be-a-bit-type

If you're going to use randconfig then shouldn't you make the config
available?  Or maybe quote the KCONFIG_SEED - presumably there's a way
for others to regenerate.

Anyway, the warning seems wrong to me.


#define PAGE_SIZE   (_AC(1,UL) << PAGE_SHIFT)

#define BLOCKS_PER_PAGE  (PAGE_SIZE/512)

inode->i_blocks += BLOCKS_PER_PAGE << folio_order(folio);

so the RHS here should have unsigned long type.  Being able to generate
the cpp output would be helpful.  That requires the .config.

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH] mm: fix a potential infinite loop in start_isolate_page_range().

2022-05-24 Thread Andrew Morton
On Tue, 24 May 2022 15:47:56 -0400 Zi Yan  wrote:

> From: Zi Yan 
> 
> In isolate_single_pageblock() called by start_isolate_page_range(),
> there are some pageblock isolation issues causing a potential
> infinite loop when isolating a page range. This is reported by Qian Cai.
> 
> 1. the pageblock was isolated by just changing pageblock migratetype
>without checking unmovable pages. Calling set_migratetype_isolate() to
>isolate pageblock properly.
> 2. an off-by-one error caused migrating pages unnecessarily, since the page
>is not crossing pageblock boundary.
> 3. migrating a compound page across pageblock boundary then splitting the
>free page later has a small race window that the free page might be
>allocated again, so that the code will try again, causing an potential
>infinite loop. Temporarily set the to-be-migrated page's pageblock to
>MIGRATE_ISOLATE to prevent that and bail out early if no free page is
>found after page migration.
> 
> An additional fix to split_free_page() aims to avoid crashing in
> __free_one_page(). When the free page is split at the specified
> split_pfn_offset, free_page_order should check both the first bit of
> free_page_pfn and the last bit of split_pfn_offset and use the smaller one.
> For example, if free_page_pfn=0x1, split_pfn_offset=0xc000,
> free_page_order should first be 0x8000 then 0x4000, instead of 0x4000 then
> 0x8000, which the original algorithm did.
> 
> ...
>
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -1114,13 +1114,16 @@ void split_free_page(struct page *free_page,
>   unsigned long flags;
>   int free_page_order;
>  
> + if (split_pfn_offset == 0)
> + return;
> +
>   spin_lock_irqsave(>lock, flags);
>   del_page_from_free_list(free_page, zone, order);
>   for (pfn = free_page_pfn;
>pfn < free_page_pfn + (1UL << order);) {
>   int mt = get_pfnblock_migratetype(pfn_to_page(pfn), pfn);
>  
> - free_page_order = ffs(split_pfn_offset) - 1;
> + free_page_order = min(pfn ? __ffs(pfn) : order, 
> __fls(split_pfn_offset));

Why is it testing the zeroness of `pfn' here?  Can pfn==0 even happen? 
If so, it's a legitimate value so why does it get special-cased?



___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH v11 0/6] Use pageblock_order for cma and alloc_contig_range alignment.

2022-05-09 Thread Andrew Morton
On Mon, 25 Apr 2022 10:31:12 -0400 Zi Yan  wrote:

> This patchset tries to remove the MAX_ORDER-1 alignment requirement for CMA
> and alloc_contig_range(). It prepares for my upcoming changes to make
> MAX_ORDER adjustable at boot time[1].

I'm thinking this looks ready to be merged into mm-stable later this week, for
the 5.19-rc1 merge window.

I believe the build error at
https://lkml.kernel.org/r/ca+g9fyvemf-nu-rvrsbaora2g2qwxrkf7awviudrjyn9mns...@mail.gmail.com
was addressed in ARM?

I have one -fix to be squashed,
https://git.kernel.org/pub/scm/linux/kernel/git/akpm/25-new.git/tree/patches/mm-make-alloc_contig_range-work-at-pageblock-granularity-fix.patch

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [linux-next:master] BUILD REGRESSION a54df7622717a40ddec95fd98086aff8ba7839a6

2023-01-24 Thread Andrew Morton
On Wed, 25 Jan 2023 00:37:05 +0800 kernel test robot  wrote:

> tree/branch: 
> https://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git master
> branch HEAD: a54df7622717a40ddec95fd98086aff8ba7839a6  Add linux-next 
> specific files for 20230124
> 
> Error/Warning: (recently discovered and may have been fixed)
> 
> ERROR: modpost: "devm_platform_ioremap_resource" [drivers/dma/fsl-edma.ko] 
> undefined!
> ERROR: modpost: "devm_platform_ioremap_resource" [drivers/dma/idma64.ko] 
> undefined!
> drivers/gpu/drm/amd/amdgpu/../display/dc/link/link_dp_training.c:1585:38: 
> warning: variable 'result' set but not used [-Wunused-but-set-variable]
> 
> Unverified Error/Warning (likely false positive, please contact us if 
> interested):
> 
> ...
>
> mm/hugetlb.c:3100 alloc_hugetlb_folio() error: uninitialized symbol 'h_cg'.

hugetlb_cgroup_commit_charge(idx, pages_per_huge_page(h), h_cg, folio);

The warning looks to be bogus.  I guess we could put a "= NULL" in
there to keep the compiler quiet?

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization