Re: [Xen-devel] null domains after xl destroy

Juergen Gross Mon, 15 May 2017 02:59:03 -0700

On 13/05/17 06:02, Glenn Enright wrote:
> On 09/05/17 21:24, Roger Pau Monné wrote:
>> On Mon, May 08, 2017 at 11:10:24AM +0200, Juergen Gross wrote:
>>> On 04/05/17 00:17, Glenn Enright wrote:
>>>> On 04/05/17 04:58, Steven Haigh wrote:
>>>>> On 04/05/17 01:53, Juergen Gross wrote:
>>>>>> On 03/05/17 12:45, Steven Haigh wrote:
>>>>>>> Just wanted to give this a little nudge now people seem to be
>>>>>>> back on
>>>>>>> deck...
>>>>>>
>>>>>> Glenn, could you please give the attached patch a try?
>>>>>>
>>>>>> It should be applied on top of the other correction, the old debug
>>>>>> patch should not be applied.
>>>>>>
>>>>>> I have added some debug output to make sure we see what is happening.
>>>>>
>>>>> This patch is included in kernel-xen-4.9.26-1
>>>>>
>>>>> It should be in the repos now.
>>>>>
>>>>
>>>> Still seeing the same issue. Without the extra debug patch all I see in
>>>> the logs after destroy is this...
>>>>
>>>> xen-blkback: xen_blkif_disconnect: busy
>>>> xen-blkback: xen_blkif_free: delayed = 0
>>>
>>> Hmm, to me it seems as if some grant isn't being unmapped.
>>>
>>> Looking at gnttab_unmap_refs_async() I wonder how this is supposed to
>>> work:
>>>
>>> I don't see how a grant would ever be unmapped in case of
>>> page_count(item->pages[pc]) > 1 in __gnttab_unmap_refs_async(). All it
>>> does is deferring the call to the unmap operation again and again. Or
>>> am I missing something here?
>>
>> No, I don't think you are missing anything, but I cannot see how this
>> can be
>> solved in a better way, unmapping a page that's still referenced is
>> certainly
>> not the best option, or else we risk triggering a page-fault elsewhere.
>>
>> IMHO, gnttab_unmap_refs_async should have a timeout, and return an
>> error at
>> some point. Also, I'm wondering whether there's a way to keep track of
>> who has
>> references on a specific page, but so far I haven't been able to
>> figure out how
>> to get this information from Linux.
>>
>> Also, I've noticed that __gnttab_unmap_refs_async uses page_count,
>> shouldn't it
>> use page_ref_count instead?
>>
>> Roger.
>>
> 
> In case it helps, I have continued to work on this. I notices processed
> left behind (under 4.9.27). The same issue is ongoing.
> 
> # ps auxf | grep [x]vda
> root      2983  0.0  0.0      0     0 ?        S    01:44   0:00  \_
> [1.xvda1-1]
> root      5457  0.0  0.0      0     0 ?        S    02:06   0:00  \_
> [3.xvda1-1]
> root      7382  0.0  0.0      0     0 ?        S    02:36   0:00  \_
> [4.xvda1-1]
> root      9668  0.0  0.0      0     0 ?        S    02:51   0:00  \_
> [6.xvda1-1]
> root     11080  0.0  0.0      0     0 ?        S    02:57   0:00  \_
> [7.xvda1-1]
> 
> # xl list
> Name                              ID   Mem VCPUs      State   Time(s)
> Domain-0                          0  1512     2     r-----     118.5
> (null)                            1     8     4     --p--d      43.8
> (null)                            3     8     4     --p--d       6.3
> (null)                            4     8     4     --p--d      73.4
> (null)                            6     8     4     --p--d      14.7
> (null)                            7     8     4     --p--d      30
> 
> Those all have...
> 
> [root 11080]# cat wchan
> xen_blkif_schedule
> 
> [root 11080]# cat stack
> [<ffffffff814eaee8>] xen_blkif_schedule+0x418/0xb40
> [<ffffffff810a0555>] kthread+0xe5/0x100
> [<ffffffff816f1c45>] ret_from_fork+0x25/0x30
> [<ffffffffffffffff>] 0xffffffffffffffff


And found another reference count bug. Would you like to give the
attached patch (to be applied additionally to the previous ones) a try?


Juergen

>From ef37fe77b7b1b1d733011df5615634e543e21c4f Mon Sep 17 00:00:00 2001
From: Juergen Gross <jgr...@suse.com>
Date: Mon, 15 May 2017 11:41:18 +0200
Subject: [PATCH 3/3] xen/blkback: don't use xen_blkif_get() in xen-blkback
 kthread

There is no need to use xen_blkif_get()/xen_blkif_put() in the kthread
of xen-blkback. Thread stopping is synchronous and using the blkif
reference counting in the kthread will avoid to ever let the reference
count drop to zero at the end of an I/O running concurrent to
disconnecting and multiple rings.

Setting ring->xenblkd to NULL after stopping the kthread isn't needed
as the kthread does this already.

Signed-off-by: Juergen Gross <jgr...@suse.com>
---
 drivers/block/xen-blkback/blkback.c | 3 ---
 drivers/block/xen-blkback/xenbus.c  | 1 -
 2 files changed, 4 deletions(-)

diff --git a/drivers/block/xen-blkback/blkback.c b/drivers/block/xen-blkback/blkback.c
index 726c32e35db9..6b14c509f3c7 100644
--- a/drivers/block/xen-blkback/blkback.c
+++ b/drivers/block/xen-blkback/blkback.c
@@ -609,8 +609,6 @@ int xen_blkif_schedule(void *arg)
 	unsigned long timeout;
 	int ret;
 
-	xen_blkif_get(blkif);
-
 	set_freezable();
 	while (!kthread_should_stop()) {
 		if (try_to_freeze())
@@ -665,7 +663,6 @@ int xen_blkif_schedule(void *arg)
 		print_stats(ring);
 
 	ring->xenblkd = NULL;
-	xen_blkif_put(blkif);
 
 	return 0;
 }
diff --git a/drivers/block/xen-blkback/xenbus.c b/drivers/block/xen-blkback/xenbus.c
index ec7eaeee3765..9f8d136d7636 100644
--- a/drivers/block/xen-blkback/xenbus.c
+++ b/drivers/block/xen-blkback/xenbus.c
@@ -255,7 +255,6 @@ static int xen_blkif_disconnect(struct xen_blkif *blkif)
 		if (ring->xenblkd) {
 			kthread_stop(ring->xenblkd);
 			wake_up(&ring->shutdown_wq);
-			ring->xenblkd = NULL;
 		}
 
 		/* The above kthread_stop() guarantees that at this point we
-- 
2.12.0

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

Re: [Xen-devel] null domains after xl destroy

Reply via email to