On Fri, Jan 08, 2016 at 02:56:14PM -0500, Rafael Aquini wrote:
> On Fri, Jan 01, 2016 at 11:36:13AM +0200, Michael S. Tsirkin wrote:
> > On Mon, Dec 28, 2015 at 08:35:13AM +0900, Minchan Kim wrote:
> > > In balloon_page_dequeue, pages_lock should cover the loop
> > > (ie, list_for_each_entry_safe). Otherwise, the cursor page could
> > > be isolated by compaction and then list_del by isolation could
> > > poison the page->lru.{prev,next} so the loop finally could
> > > access wrong address like this. This patch fixes the bug.
> > > 
> > > general protection fault: 0000 [#1] SMP
> > > Dumping ftrace buffer:
> > >    (ftrace buffer empty)
> > > Modules linked in:
> > > CPU: 2 PID: 82 Comm: vballoon Not tainted 4.4.0-rc5-mm1-access_bit+ #1906
> > > Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 
> > > 01/01/2011
> > > task: ffff8800a7ff0000 ti: ffff8800a7fec000 task.ti: ffff8800a7fec000
> > > RIP: 0010:[<ffffffff8115e754>]  [<ffffffff8115e754>] 
> > > balloon_page_dequeue+0x54/0x130
> > > RSP: 0018:ffff8800a7fefdc0  EFLAGS: 00010246
> > > RAX: ffff88013fff9a70 RBX: ffffea000056fe00 RCX: 0000000000002b7d
> > > RDX: ffff88013fff9a70 RSI: ffffea000056fe00 RDI: ffff88013fff9a68
> > > RBP: ffff8800a7fefde8 R08: ffffea000056fda0 R09: 0000000000000000
> > > R10: ffff8800a7fefd90 R11: 0000000000000001 R12: dead0000000000e0
> > > R13: ffffea000056fe20 R14: ffff880138809070 R15: ffff880138809060
> > > FS:  0000000000000000(0000) GS:ffff88013fc40000(0000) 
> > > knlGS:0000000000000000
> > > CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
> > > CR2: 00007f229c10e000 CR3: 00000000b8b53000 CR4: 00000000000006a0
> > > Stack:
> > >  0000000000000100 ffff880138809088 ffff880138809000 ffff880138809060
> > >  0000000000000046 ffff8800a7fefe28 ffffffff812c86d3 ffff880138809020
> > >  ffff880138809000 fffffffffff91900 0000000000000100 ffff880138809060
> > > Call Trace:
> > >  [<ffffffff812c86d3>] leak_balloon+0x93/0x1a0
> > >  [<ffffffff812c8bc7>] balloon+0x217/0x2a0
> > >  [<ffffffff8143739e>] ? __schedule+0x31e/0x8b0
> > >  [<ffffffff81078160>] ? abort_exclusive_wait+0xb0/0xb0
> > >  [<ffffffff812c89b0>] ? update_balloon_stats+0xf0/0xf0
> > >  [<ffffffff8105b6e9>] kthread+0xc9/0xe0
> > >  [<ffffffff8105b620>] ? kthread_park+0x60/0x60
> > >  [<ffffffff8143b4af>] ret_from_fork+0x3f/0x70
> > >  [<ffffffff8105b620>] ? kthread_park+0x60/0x60
> > > Code: 8d 60 e0 0f 84 af 00 00 00 48 8b 43 20 a8 01 75 3b 48 89 d8 f0 0f 
> > > ba 28 00 72 10 48 8b 03 f6 c4 08 75 2f 48 89 df e8 8c 83 f9 ff <49> 8b 44 
> > > 24 20 4d 8d 6c 24 20 48 83 e8 20 4d 39 f5 74 7a 4c 89
> > > RIP  [<ffffffff8115e754>] balloon_page_dequeue+0x54/0x130
> > >  RSP <ffff8800a7fefdc0>
> > > ---[ end trace 43cf28060d708d5f ]---
> > > Kernel panic - not syncing: Fatal exception
> > > Dumping ftrace buffer:
> > >    (ftrace buffer empty)
> > > Kernel Offset: disabled
> > > 
> > > Cc: <[email protected]>
> > > Signed-off-by: Minchan Kim <[email protected]>
> > > ---
> > >  mm/balloon_compaction.c | 4 ++--
> > >  1 file changed, 2 insertions(+), 2 deletions(-)
> > > 
> > > diff --git a/mm/balloon_compaction.c b/mm/balloon_compaction.c
> > > index d3116be5a00f..300117f1a08f 100644
> > > --- a/mm/balloon_compaction.c
> > > +++ b/mm/balloon_compaction.c
> > > @@ -61,6 +61,7 @@ struct page *balloon_page_dequeue(struct 
> > > balloon_dev_info *b_dev_info)
> > >   bool dequeued_page;
> > >  
> > >   dequeued_page = false;
> > > + spin_lock_irqsave(&b_dev_info->pages_lock, flags);
> > >   list_for_each_entry_safe(page, tmp, &b_dev_info->pages, lru) {
> > >           /*
> > >            * Block others from accessing the 'page' while we get around
> > > @@ -75,15 +76,14 @@ struct page *balloon_page_dequeue(struct 
> > > balloon_dev_info *b_dev_info)
> > >                           continue;
> > >                   }
> > >  #endif
> > > -                 spin_lock_irqsave(&b_dev_info->pages_lock, flags);
> > >                   balloon_page_delete(page);
> > >                   __count_vm_event(BALLOON_DEFLATE);
> > > -                 spin_unlock_irqrestore(&b_dev_info->pages_lock, flags);
> > >                   unlock_page(page);
> > >                   dequeued_page = true;
> > >                   break;
> > >           }
> > >   }
> > > + spin_unlock_irqrestore(&b_dev_info->pages_lock, flags);
> > >  
> > >   if (!dequeued_page) {
> > >           /*
> > 
> > I think this will cause deadlocks.
> > 
> > pages_lock now nests within page lock, balloon_page_putback
> > nests them in the reverse order.
> > 
> > Did you test this with lockdep? You really should for
> > locking changes, and I'd expect it to warn about this.
> > 
> > Also, there's another issue there I think: after isolation page could
> > also get freed before we try to lock it.
> > 
> > We really must take a page reference before touching
> > the page.
> > 
> > I think we need something like the below to fix this issue.
> > Could you please try this out, and send Tested-by?
> > I will repost as a proper patch if this works for you.
> >
> 
> Nice catch! Thanks for spotting it. I just have one minor nit. See
> below
>  
> > 
> > diff --git a/mm/balloon_compaction.c b/mm/balloon_compaction.c
> > index d3116be..66d69c5 100644
> > --- a/mm/balloon_compaction.c
> > +++ b/mm/balloon_compaction.c
> > @@ -56,12 +56,34 @@ EXPORT_SYMBOL_GPL(balloon_page_enqueue);
> >   */
> >  struct page *balloon_page_dequeue(struct balloon_dev_info *b_dev_info)
> >  {
> > -   struct page *page, *tmp;
> > +   struct page *page;
> >     unsigned long flags;
> >     bool dequeued_page;
> > +   LIST_HEAD(processed); /* protected by b_dev_info->pages_lock */
> >  
> >     dequeued_page = false;
> > -   list_for_each_entry_safe(page, tmp, &b_dev_info->pages, lru) {
> > +   /*
> > +    * We need to go over b_dev_info->pages and lock each page,
> > +    * but b_dev_info->pages_lock must nest within page lock.
> > +    *
> > +    * To make this safe, remove each page from b_dev_info->pages list
> > +    * under b_dev_info->pages_lock, then drop this lock. Once list is
> > +    * empty, re-add them also under b_dev_info->pages_lock.
> > +    */
> > +   spin_lock_irqsave(&b_dev_info->pages_lock, flags);
> > +   while (!list_empty(&b_dev_info->pages)) {
> > +           page = list_first_entry(&b_dev_info->pages, typeof(*page), lru);
> > +           /* move to processed list to avoid going over it another time */
> > +           list_move(&page->lru, &processed);
> > +
> > +           if (!get_page_unless_zero(page))
> > +                   continue;
> > +           /*
> > +            * pages_lock nests within page lock,
> > +            * so drop it before trylock_page
> > +            */
> > +           spin_unlock_irqrestore(&b_dev_info->pages_lock, flags);
> > +
> >             /*
> >              * Block others from accessing the 'page' while we get around
> >              * establishing additional references and preparing the 'page'
> > @@ -72,6 +94,7 @@ struct page *balloon_page_dequeue(struct balloon_dev_info 
> > *b_dev_info)
> >                     if (!PagePrivate(page)) {
> >                             /* raced with isolation */
> >                             unlock_page(page);
> > +                           put_page(page);
> >                             continue;
> >                     }
> >  #endif
> > @@ -80,11 +103,18 @@ struct page *balloon_page_dequeue(struct 
> > balloon_dev_info *b_dev_info)
> >                     __count_vm_event(BALLOON_DEFLATE);
> >                     spin_unlock_irqrestore(&b_dev_info->pages_lock, flags);
> >                     unlock_page(page);
> > +                   put_page(page);
> >                     dequeued_page = true;
> >                     break;
>                         ^^^^[1]
> 
> >             }
> > +           put_page(page);
> > +           spin_lock_irqsave(&b_dev_info->pages_lock, flags);
> >     }
> >  
> > +   /* re-add remaining entries */
> > +   list_splice(&processed, &b_dev_info->pages);
> 
> By breaking the loop at its ordinary and expected way-out case [1] 
> we'll hit list_splice without holding b_dev_info->pages_lock, won't we?

Ouch. right.

> perhaps by adding the following on top of your patch we can address that 
> pickle
> aforementioned:

I'd rather just goto outside or return.
But maybe Minchan is right and the original patch is ok.
I still need to go into this.

> Cheers!
> Rafael
> --
> 
> diff --git a/mm/balloon_compaction.c b/mm/balloon_compaction.c
> index 66d69c5..74b3e9c 100644
> --- a/mm/balloon_compaction.c
> +++ b/mm/balloon_compaction.c
> @@ -58,7 +58,7 @@ struct page *balloon_page_dequeue(struct
> balloon_dev_info *b_dev_info)
>  {
>         struct page *page;
>         unsigned long flags;
> -       bool dequeued_page;
> +       bool dequeued_page, locked;
>         LIST_HEAD(processed); /* protected by b_dev_info->pages_lock */
>  
>         dequeued_page = false;
> @@ -105,13 +105,17 @@ struct page *balloon_page_dequeue(struct
> balloon_dev_info *b_dev_info)
>                         unlock_page(page);
>                         put_page(page);
>                         dequeued_page = true;
> +                       locked = false;
>                         break;
>                 }
>                 put_page(page);
>                 spin_lock_irqsave(&b_dev_info->pages_lock, flags);
> +               locked = true;
>         }
>  
>         /* re-add remaining entries */
> +       if (!locked)
> +               spin_lock_irqsave(&b_dev_info->pages_lock, flags);
>         list_splice(&processed, &b_dev_info->pages);
>         spin_unlock_irqrestore(&b_dev_info->pages_lock, flags);
_______________________________________________
Virtualization mailing list
[email protected]
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Reply via email to