On 2017-05-16 10:49, Glenn Enright wrote:
On 15/05/17 21:57, Juergen Gross wrote:
On 13/05/17 06:02, Glenn Enright wrote:
On 09/05/17 21:24, Roger Pau Monné wrote:
On Mon, May 08, 2017 at 11:10:24AM +0200, Juergen Gross wrote:
On 04/05/17 00:17, Glenn Enright wrote:
On 04/05/17 04:58, Steven Haigh wrote:
On 04/05/17 01:53, Juergen Gross wrote:
On 03/05/17 12:45, Steven Haigh wrote:
Just wanted to give this a little nudge now people seem to be
back on
deck...
Glenn, could you please give the attached patch a try?
It should be applied on top of the other correction, the old
debug
patch should not be applied.
I have added some debug output to make sure we see what is
happening.
This patch is included in kernel-xen-4.9.26-1
It should be in the repos now.
Still seeing the same issue. Without the extra debug patch all I
see in
the logs after destroy is this...
xen-blkback: xen_blkif_disconnect: busy
xen-blkback: xen_blkif_free: delayed = 0
Hmm, to me it seems as if some grant isn't being unmapped.
Looking at gnttab_unmap_refs_async() I wonder how this is supposed
to
work:
I don't see how a grant would ever be unmapped in case of
page_count(item->pages[pc]) > 1 in __gnttab_unmap_refs_async(). All
it
does is deferring the call to the unmap operation again and again.
Or
am I missing something here?
No, I don't think you are missing anything, but I cannot see how
this
can be
solved in a better way, unmapping a page that's still referenced is
certainly
not the best option, or else we risk triggering a page-fault
elsewhere.
IMHO, gnttab_unmap_refs_async should have a timeout, and return an
error at
some point. Also, I'm wondering whether there's a way to keep track
of
who has
references on a specific page, but so far I haven't been able to
figure out how
to get this information from Linux.
Also, I've noticed that __gnttab_unmap_refs_async uses page_count,
shouldn't it
use page_ref_count instead?
Roger.
In case it helps, I have continued to work on this. I notices
processed
left behind (under 4.9.27). The same issue is ongoing.
# ps auxf | grep [x]vda
root 2983 0.0 0.0 0 0 ? S 01:44 0:00 \_
[1.xvda1-1]
root 5457 0.0 0.0 0 0 ? S 02:06 0:00 \_
[3.xvda1-1]
root 7382 0.0 0.0 0 0 ? S 02:36 0:00 \_
[4.xvda1-1]
root 9668 0.0 0.0 0 0 ? S 02:51 0:00 \_
[6.xvda1-1]
root 11080 0.0 0.0 0 0 ? S 02:57 0:00 \_
[7.xvda1-1]
# xl list
Name ID Mem VCPUs State Time(s)
Domain-0 0 1512 2 r----- 118.5
(null) 1 8 4 --p--d 43.8
(null) 3 8 4 --p--d 6.3
(null) 4 8 4 --p--d 73.4
(null) 6 8 4 --p--d 14.7
(null) 7 8 4 --p--d 30
Those all have...
[root 11080]# cat wchan
xen_blkif_schedule
[root 11080]# cat stack
[<ffffffff814eaee8>] xen_blkif_schedule+0x418/0xb40
[<ffffffff810a0555>] kthread+0xe5/0x100
[<ffffffff816f1c45>] ret_from_fork+0x25/0x30
[<ffffffffffffffff>] 0xffffffffffffffff
And found another reference count bug. Would you like to give the
attached patch (to be applied additionally to the previous ones) a
try?
Juergen
This seems to have solved the issue in 4.9.28, with all three patches
applied. Awesome!
On my main test machine I can no longer replicate what I was
originally seeing, and in dmesg I now see this flow...
xen-blkback: xen_blkif_disconnect: busy
xen-blkback: xen_blkif_free: delayed = 1
xen-blkback: xen_blkif_free: delayed = 0
xl list is clean, xenstore looks right. No extraneous processes left
over.
Thankyou Juergen, so much. Really appreciate your persistence with
this. Anything I can do to help push this upstream please let me know.
Feel free to add a reported-by line with my name if you think it
appropriate.
This is good news.
Juergen, Can I request a full patch set posted to the list (plz CC me) -
and I'll ensure we can build the kernel with all 3 (?) patches applied
and test properly.
I'll build up a complete kernel with those patches and give a tested-by
if all goes well.
--
Steven Haigh
Email: net...@crc.id.au
Web: https://www.crc.id.au
Phone: (03) 9001 6090 - 0412 935 897
_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel