Re: [Xen-devel] xen-netfront crash when detaching network while some network activity

2016-01-22 Thread Marek Marczykowski-Górecki
On Thu, Jan 21, 2016 at 12:30:48PM +, Joao Martins wrote:
> 
> 
> On 01/20/2016 09:59 PM, Konrad Rzeszutek Wilk wrote:
> > On Tue, Dec 01, 2015 at 11:32:58PM +0100, Marek Marczykowski-Górecki wrote:
> >> On Tue, Dec 01, 2015 at 05:00:42PM -0500, Konrad Rzeszutek Wilk wrote:
> >>> On Tue, Nov 17, 2015 at 03:45:15AM +0100, Marek Marczykowski-Górecki 
> >>> wrote:
>  On Wed, Oct 21, 2015 at 08:57:34PM +0200, Marek Marczykowski-Górecki 
>  wrote:
> > On Wed, May 27, 2015 at 12:03:12AM +0200, Marek Marczykowski-Górecki 
> > wrote:
> >> On Tue, May 26, 2015 at 11:56:00AM +0100, David Vrabel wrote:
> >>> On 22/05/15 12:49, Marek Marczykowski-Górecki wrote:
>  Hi all,
> 
>  I'm experiencing xen-netfront crash when doing xl network-detach 
>  while
>  some network activity is going on at the same time. It happens only 
>  when
>  domU has more than one vcpu. Not sure if this matters, but the 
>  backend
>  is in another domU (not dom0). I'm using Xen 4.2.2. It happens on 
>  kernel
>  3.9.4 and 4.1-rc1 as well.
> 
>  Steps to reproduce:
>  1. Start the domU with some network interface
>  2. Call there 'ping -f some-IP'
>  3. Call 'xl network-detach NAME 0'
> >>>
> >>> Do you see this all the time or just on occassions?
> >>
> >> Using above procedure - all the time.
> >>
> >>> I tried to reproduce it and couldn't see it. Is your VM an PV or HVM?
> >>
> >> PV, started by libvirt. This may have something to do, the problem didn't
> >> existed on older Xen (4.1) and started by xl. I'm not sure about kernel
> >> version there, but I think I've tried there 3.18 too, which has this
> >> problem.
> >>
> >> But I don't see anything special in domU config file (neither backend
> >> nor frontend) - it may be some libvirt default. If that's really the
> >> cause. Can I (and how) get any useful information about that?
> > 
> > libvirt naturally does some libxl calls, and they may be different.
> > 
> > Any chance you could give me an idea of:
> >  - What commands you use in libvirt?
> >  - Do you use a bond or bridge?
> >  - What version of libvirt you are using?
> > 
> > Thanks!
> > CC-ing Joao just in case he has seen this.
> >>
> Hm, So far I couldn't reproduce the issue with upstream Xen/linux/libvirt, 
> using
> both libvirt or plain xl (both on a bridge setup) and also irrespective of the
> both load and direction of traffic (be it a ping flood, pktgen with min.
> sized packets or iperf).

I've ran the test again, on vanilla 4.4 and collected some info:
 - xenstore dump of frontend (xs-frontend-before.txt)
 - xenstore dump of backend (xs-backend-before.txt)
 - kernel messages (console output) (console.log)
 - kernel config (config-4.4)
 - libvirt config of that domain (netdebug.conf)

Versions:
 - kernel 4.4 (frontend), 4.2.8 (backend)
 - libvirt 1.2.20
 - xen 4.6.0

In backend domain there is no bridge or anything like that - only
routing. The same in frontend - nothing fancy - just IP set on eth0
there.

Steps to reproduce were the same:
 - start frontend domain (virsh create ...)
 - call ping -f
 - xl network-detach NAME 0

Note that the crash doesn't happen with attached patch applied (as noted
in mail on Oct 21), but I have no idea whether is it a proper fix, or
just prevents the crash by a coincidence.

-- 
Best Regards,
Marek Marczykowski-Górecki
Invisible Things Lab
A: Because it messes up the order in which people normally read text.
Q: Why is top-posting such a bad thing?
[0.00] x86/PAT: Configuration [0-7]: WB  WT  UC- UC  WC  WP  UC  UC  
[0.00] Initializing cgroup subsys cpuset
[0.00] Initializing cgroup subsys cpu
[0.00] Initializing cgroup subsys cpuacct
[0.00] Linux version 4.4.0-1.pvops.qubes.x86_64 (user@devel-3rdparty) 
(gcc version 4.9.2 20150212 (Red Hat 4.9.2-6) (GCC) ) #20 SMP Fri Jan 22 
00:39:29 CET 2016
[0.00] Command line: root=/dev/mapper/dmroot ro nomodeset console=hvc0 
rd_NO_PLYMOUTH 3 rd.break
[0.00] x86/fpu: Legacy x87 FPU detected.
[0.00] x86/fpu: Using 'lazy' FPU context switches.
[0.00] ACPI in unprivileged domain disabled
[0.00] Released 0 page(s)
[0.00] e820: BIOS-provided physical RAM map:
[0.00] Xen: [mem 0x-0x0009] usable
[0.00] Xen: [mem 0x000a-0x000f] reserved
[0.00] Xen: [mem 0x0010-0xf9ff] usable
[0.00] NX (Execute Disable) protection: active
[0.00] DMI not present or invalid.
[0.00] Hypervisor detected: Xen
[0.00] e820: last_pfn = 0xfa000 max_arch_pfn = 0x4
[0.00] MTRR: Disabled
[0.00] RAMDISK: [mem 0x0203-0x027c6fff]
[0.00] NUMA turned off
[0.00] Faking a node at [mem 0x-0xf9ff]
[0.00] NODE_DATA(0) allocated [mem 

Re: [Xen-devel] xen-netfront crash when detaching network while some network activity

2016-01-21 Thread Joao Martins


On 01/20/2016 09:59 PM, Konrad Rzeszutek Wilk wrote:
> On Tue, Dec 01, 2015 at 11:32:58PM +0100, Marek Marczykowski-Górecki wrote:
>> On Tue, Dec 01, 2015 at 05:00:42PM -0500, Konrad Rzeszutek Wilk wrote:
>>> On Tue, Nov 17, 2015 at 03:45:15AM +0100, Marek Marczykowski-Górecki wrote:
 On Wed, Oct 21, 2015 at 08:57:34PM +0200, Marek Marczykowski-Górecki wrote:
> On Wed, May 27, 2015 at 12:03:12AM +0200, Marek Marczykowski-Górecki 
> wrote:
>> On Tue, May 26, 2015 at 11:56:00AM +0100, David Vrabel wrote:
>>> On 22/05/15 12:49, Marek Marczykowski-Górecki wrote:
 Hi all,

 I'm experiencing xen-netfront crash when doing xl network-detach while
 some network activity is going on at the same time. It happens only 
 when
 domU has more than one vcpu. Not sure if this matters, but the backend
 is in another domU (not dom0). I'm using Xen 4.2.2. It happens on 
 kernel
 3.9.4 and 4.1-rc1 as well.

 Steps to reproduce:
 1. Start the domU with some network interface
 2. Call there 'ping -f some-IP'
 3. Call 'xl network-detach NAME 0'
>>>
>>> Do you see this all the time or just on occassions?
>>
>> Using above procedure - all the time.
>>
>>> I tried to reproduce it and couldn't see it. Is your VM an PV or HVM?
>>
>> PV, started by libvirt. This may have something to do, the problem didn't
>> existed on older Xen (4.1) and started by xl. I'm not sure about kernel
>> version there, but I think I've tried there 3.18 too, which has this
>> problem.
>>
>> But I don't see anything special in domU config file (neither backend
>> nor frontend) - it may be some libvirt default. If that's really the
>> cause. Can I (and how) get any useful information about that?
> 
> libvirt naturally does some libxl calls, and they may be different.
> 
> Any chance you could give me an idea of:
>  - What commands you use in libvirt?
>  - Do you use a bond or bridge?
>  - What version of libvirt you are using?
> 
> Thanks!
> CC-ing Joao just in case he has seen this.
>>
Hm, So far I couldn't reproduce the issue with upstream Xen/linux/libvirt, using
both libvirt or plain xl (both on a bridge setup) and also irrespective of the
both load and direction of traffic (be it a ping flood, pktgen with min.
sized packets or iperf).

>>
>> -- 
>> Best Regards,
>> Marek Marczykowski-Górecki
>> Invisible Things Lab
>> A: Because it messes up the order in which people normally read text.
>> Q: Why is top-posting such a bad thing?
> 
> 

___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


Re: [Xen-devel] xen-netfront crash when detaching network while some network activity

2016-01-20 Thread Konrad Rzeszutek Wilk
On Tue, Dec 01, 2015 at 11:32:58PM +0100, Marek Marczykowski-Górecki wrote:
> On Tue, Dec 01, 2015 at 05:00:42PM -0500, Konrad Rzeszutek Wilk wrote:
> > On Tue, Nov 17, 2015 at 03:45:15AM +0100, Marek Marczykowski-Górecki wrote:
> > > On Wed, Oct 21, 2015 at 08:57:34PM +0200, Marek Marczykowski-Górecki 
> > > wrote:
> > > > On Wed, May 27, 2015 at 12:03:12AM +0200, Marek Marczykowski-Górecki 
> > > > wrote:
> > > > > On Tue, May 26, 2015 at 11:56:00AM +0100, David Vrabel wrote:
> > > > > > On 22/05/15 12:49, Marek Marczykowski-Górecki wrote:
> > > > > > > Hi all,
> > > > > > > 
> > > > > > > I'm experiencing xen-netfront crash when doing xl network-detach 
> > > > > > > while
> > > > > > > some network activity is going on at the same time. It happens 
> > > > > > > only when
> > > > > > > domU has more than one vcpu. Not sure if this matters, but the 
> > > > > > > backend
> > > > > > > is in another domU (not dom0). I'm using Xen 4.2.2. It happens on 
> > > > > > > kernel
> > > > > > > 3.9.4 and 4.1-rc1 as well.
> > > > > > > 
> > > > > > > Steps to reproduce:
> > > > > > > 1. Start the domU with some network interface
> > > > > > > 2. Call there 'ping -f some-IP'
> > > > > > > 3. Call 'xl network-detach NAME 0'
> > 
> > Do you see this all the time or just on occassions?
> 
> Using above procedure - all the time.
> 
> > I tried to reproduce it and couldn't see it. Is your VM an PV or HVM?
> 
> PV, started by libvirt. This may have something to do, the problem didn't
> existed on older Xen (4.1) and started by xl. I'm not sure about kernel
> version there, but I think I've tried there 3.18 too, which has this
> problem.
> 
> But I don't see anything special in domU config file (neither backend
> nor frontend) - it may be some libvirt default. If that's really the
> cause. Can I (and how) get any useful information about that?

libvirt naturally does some libxl calls, and they may be different.

Any chance you could give me an idea of:
 - What commands you use in libvirt?
 - Do you use a bond or bridge?
 - What version of libvirt you are using?

Thanks!
CC-ing Joao just in case he has seen this.
> 
> 
> -- 
> Best Regards,
> Marek Marczykowski-Górecki
> Invisible Things Lab
> A: Because it messes up the order in which people normally read text.
> Q: Why is top-posting such a bad thing?



___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


Re: [Xen-devel] xen-netfront crash when detaching network while some network activity

2015-12-01 Thread Marek Marczykowski-Górecki
On Tue, Dec 01, 2015 at 05:00:42PM -0500, Konrad Rzeszutek Wilk wrote:
> On Tue, Nov 17, 2015 at 03:45:15AM +0100, Marek Marczykowski-Górecki wrote:
> > On Wed, Oct 21, 2015 at 08:57:34PM +0200, Marek Marczykowski-Górecki wrote:
> > > On Wed, May 27, 2015 at 12:03:12AM +0200, Marek Marczykowski-Górecki 
> > > wrote:
> > > > On Tue, May 26, 2015 at 11:56:00AM +0100, David Vrabel wrote:
> > > > > On 22/05/15 12:49, Marek Marczykowski-Górecki wrote:
> > > > > > Hi all,
> > > > > > 
> > > > > > I'm experiencing xen-netfront crash when doing xl network-detach 
> > > > > > while
> > > > > > some network activity is going on at the same time. It happens only 
> > > > > > when
> > > > > > domU has more than one vcpu. Not sure if this matters, but the 
> > > > > > backend
> > > > > > is in another domU (not dom0). I'm using Xen 4.2.2. It happens on 
> > > > > > kernel
> > > > > > 3.9.4 and 4.1-rc1 as well.
> > > > > > 
> > > > > > Steps to reproduce:
> > > > > > 1. Start the domU with some network interface
> > > > > > 2. Call there 'ping -f some-IP'
> > > > > > 3. Call 'xl network-detach NAME 0'
> 
> Do you see this all the time or just on occassions?

Using above procedure - all the time.

> I tried to reproduce it and couldn't see it. Is your VM an PV or HVM?

PV, started by libvirt. This may have something to do, the problem didn't
existed on older Xen (4.1) and started by xl. I'm not sure about kernel
version there, but I think I've tried there 3.18 too, which has this
problem.

But I don't see anything special in domU config file (neither backend
nor frontend) - it may be some libvirt default. If that's really the
cause. Can I (and how) get any useful information about that?


-- 
Best Regards,
Marek Marczykowski-Górecki
Invisible Things Lab
A: Because it messes up the order in which people normally read text.
Q: Why is top-posting such a bad thing?


pgpuH5w8RTS3O.pgp
Description: PGP signature
___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


Re: [Xen-devel] xen-netfront crash when detaching network while some network activity

2015-12-01 Thread Konrad Rzeszutek Wilk
On Tue, Nov 17, 2015 at 03:45:15AM +0100, Marek Marczykowski-Górecki wrote:
> On Wed, Oct 21, 2015 at 08:57:34PM +0200, Marek Marczykowski-Górecki wrote:
> > On Wed, May 27, 2015 at 12:03:12AM +0200, Marek Marczykowski-Górecki wrote:
> > > On Tue, May 26, 2015 at 11:56:00AM +0100, David Vrabel wrote:
> > > > On 22/05/15 12:49, Marek Marczykowski-Górecki wrote:
> > > > > Hi all,
> > > > > 
> > > > > I'm experiencing xen-netfront crash when doing xl network-detach while
> > > > > some network activity is going on at the same time. It happens only 
> > > > > when
> > > > > domU has more than one vcpu. Not sure if this matters, but the backend
> > > > > is in another domU (not dom0). I'm using Xen 4.2.2. It happens on 
> > > > > kernel
> > > > > 3.9.4 and 4.1-rc1 as well.
> > > > > 
> > > > > Steps to reproduce:
> > > > > 1. Start the domU with some network interface
> > > > > 2. Call there 'ping -f some-IP'
> > > > > 3. Call 'xl network-detach NAME 0'

Do you see this all the time or just on occassions?

I tried to reproduce it and couldn't see it. Is your VM an PV or HVM?

___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


Re: [Xen-devel] xen-netfront crash when detaching network while some network activity

2015-11-17 Thread David Vrabel
On 21/10/15 19:57, Marek Marczykowski-Górecki wrote:
> 
> Any ideas?

No, sorry.  Netfront looks correct to me.

We take an additional ref for the ref released by
gnttab_release_grant_reference().  The get_page() here is safe since we
haven't freed the page yet (this is done in the subsequent call to
skb_kfree_irq()).

get_page()/put_page() also look fine when used with tail pages.

David

___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


Re: [Xen-devel] xen-netfront crash when detaching network while some network activity

2015-11-16 Thread Marek Marczykowski-Górecki
On Wed, Oct 21, 2015 at 08:57:34PM +0200, Marek Marczykowski-Górecki wrote:
> On Wed, May 27, 2015 at 12:03:12AM +0200, Marek Marczykowski-Górecki wrote:
> > On Tue, May 26, 2015 at 11:56:00AM +0100, David Vrabel wrote:
> > > On 22/05/15 12:49, Marek Marczykowski-Górecki wrote:
> > > > Hi all,
> > > > 
> > > > I'm experiencing xen-netfront crash when doing xl network-detach while
> > > > some network activity is going on at the same time. It happens only when
> > > > domU has more than one vcpu. Not sure if this matters, but the backend
> > > > is in another domU (not dom0). I'm using Xen 4.2.2. It happens on kernel
> > > > 3.9.4 and 4.1-rc1 as well.
> > > > 
> > > > Steps to reproduce:
> > > > 1. Start the domU with some network interface
> > > > 2. Call there 'ping -f some-IP'
> > > > 3. Call 'xl network-detach NAME 0'
> > > 
> > > There's a use-after-free in xennet_remove().  Does this patch fix it?
> > 
> > Unfortunately not. Note that the crash is in xennet_disconnect_backend,
> > which is called before xennet_destroy_queues in xennet_remove.
> > I've tried to add napi_disable and even netif_napi_del just after
> > napi_synchronize in xennet_disconnect_backend (which would probably
> > cause crash when trying to cleanup the same later again), but it doesn't
> > help - the crash is the same (still in gnttab_end_foreign_access called
> > from xennet_disconnect_backend).
> 
> Finally I've found some more time to debug this... All tests redone on
> v4.3-rc6 frontend and 3.18.17 backend.
> 
> Looking at xennet_tx_buf_gc(), I have an impression that shared page
> (queue->grant_tx_page[id]) is/should be freed in some other means than
> (indirectly) calling to free_page via gnttab_end_foreign_access. Maybe the bug
> is that the page _is_ actually freed somewhere else already? At least changing
> gnttab_end_foreign_access to gnttab_end_foreign_access_ref makes the crash
> gone.
> 
> Relevant xennet_tx_buf_gc fragment:
> gnttab_end_foreign_access_ref(
> queue->grant_tx_ref[id], GNTMAP_readonly);
> gnttab_release_grant_reference(
> >gref_tx_head, queue->grant_tx_ref[id]);
> queue->grant_tx_ref[id] = GRANT_INVALID_REF;
> queue->grant_tx_page[id] = NULL;
> add_id_to_freelist(>tx_skb_freelist, queue->tx_skbs, id);
> dev_kfree_skb_irq(skb);
> 
> And similar fragment from xennet_release_tx_bufs:
> get_page(queue->grant_tx_page[i]);
> gnttab_end_foreign_access(queue->grant_tx_ref[i],
>   GNTMAP_readonly,
>   (unsigned long)page_address(queue->grant_tx_page[i]));
> queue->grant_tx_page[i] = NULL;
> queue->grant_tx_ref[i] = GRANT_INVALID_REF;
> add_id_to_freelist(>tx_skb_freelist, queue->tx_skbs, i);
> dev_kfree_skb_irq(skb);
> 
> Note that both have dev_kfree_skb_irq, but the former use
> gnttab_end_foreign_access_ref, while the later - gnttab_end_foreign_access.
> Also note that the crash is in gnttab_end_foreign_access, so before
> dev_kfree_skb_irq. If that would be double free, I'd expect crash in the 
> later.
> 
> This change was introduced by cefe007 "xen-netfront: fix resource leak in
> netfront". I'm not sure if changing gnttab_end_foreign_access back to
> gnttab_end_foreign_access_ref would not (re)introduce some memory leak.
> 
> Let me paste again the error message:
> [   73.718636] page:ea43b1c0 count:0 mapcount:0 mapping:  
> (null) index:0x0
> [   73.718661] flags: 0x3ffc008000(tail)
> [   73.718684] page dumped because: VM_BUG_ON_PAGE(atomic_read(>_count) 
> == 0)
> [   73.718725] [ cut here ]
> [   73.718743] kernel BUG at include/linux/mm.h:338!
> 
> Also it all look quite strange - there is get_page() call just before
> gnttab_end_foreign_access, but page->_count is still 0. Maybe it have 
> something
> to do how get_page() works on "tail" pages (whatever it means)?
> 
> static inline void get_page(struct page *page)
> {
> if (unlikely(PageTail(page)))
> if (likely(__get_page_tail(page)))
> return;
> /*
>  * Getting a normal page or the head of a compound page
>  * requires to already have an elevated page->_count.
>  */
> VM_BUG_ON_PAGE(atomic_read(>_count) <= 0, page);
> atomic_inc(>_count);
> }
> 
> which (I think) ends up in:
> 
> static inline void __get_page_tail_foll(struct page *page,
> bool get_page_head)
> {
> /*
>  * If we're getting a tail page, the elevated page->_count is
>  * required only in the head page and we will elevate the head
>  * page->_count and tail page->_mapcount.
>  *
>  * We elevate page_tail->_mapcount for tail pages to force
>  * page_tail->_count to be zero at all times to avoid getting
>  * false positives from get_page_unless_zero() with
>  

Re: [Xen-devel] xen-netfront crash when detaching network while some network activity

2015-10-21 Thread Marek Marczykowski-Górecki
On Wed, May 27, 2015 at 12:03:12AM +0200, Marek Marczykowski-Górecki wrote:
> On Tue, May 26, 2015 at 11:56:00AM +0100, David Vrabel wrote:
> > On 22/05/15 12:49, Marek Marczykowski-Górecki wrote:
> > > Hi all,
> > > 
> > > I'm experiencing xen-netfront crash when doing xl network-detach while
> > > some network activity is going on at the same time. It happens only when
> > > domU has more than one vcpu. Not sure if this matters, but the backend
> > > is in another domU (not dom0). I'm using Xen 4.2.2. It happens on kernel
> > > 3.9.4 and 4.1-rc1 as well.
> > > 
> > > Steps to reproduce:
> > > 1. Start the domU with some network interface
> > > 2. Call there 'ping -f some-IP'
> > > 3. Call 'xl network-detach NAME 0'
> > 
> > There's a use-after-free in xennet_remove().  Does this patch fix it?
> 
> Unfortunately not. Note that the crash is in xennet_disconnect_backend,
> which is called before xennet_destroy_queues in xennet_remove.
> I've tried to add napi_disable and even netif_napi_del just after
> napi_synchronize in xennet_disconnect_backend (which would probably
> cause crash when trying to cleanup the same later again), but it doesn't
> help - the crash is the same (still in gnttab_end_foreign_access called
> from xennet_disconnect_backend).

Finally I've found some more time to debug this... All tests redone on
v4.3-rc6 frontend and 3.18.17 backend.

Looking at xennet_tx_buf_gc(), I have an impression that shared page
(queue->grant_tx_page[id]) is/should be freed in some other means than
(indirectly) calling to free_page via gnttab_end_foreign_access. Maybe the bug
is that the page _is_ actually freed somewhere else already? At least changing
gnttab_end_foreign_access to gnttab_end_foreign_access_ref makes the crash
gone.

Relevant xennet_tx_buf_gc fragment:
gnttab_end_foreign_access_ref(
queue->grant_tx_ref[id], GNTMAP_readonly);
gnttab_release_grant_reference(
>gref_tx_head, queue->grant_tx_ref[id]);
queue->grant_tx_ref[id] = GRANT_INVALID_REF;
queue->grant_tx_page[id] = NULL;
add_id_to_freelist(>tx_skb_freelist, queue->tx_skbs, id);
dev_kfree_skb_irq(skb);

And similar fragment from xennet_release_tx_bufs:
get_page(queue->grant_tx_page[i]);
gnttab_end_foreign_access(queue->grant_tx_ref[i],
  GNTMAP_readonly,
  (unsigned long)page_address(queue->grant_tx_page[i]));
queue->grant_tx_page[i] = NULL;
queue->grant_tx_ref[i] = GRANT_INVALID_REF;
add_id_to_freelist(>tx_skb_freelist, queue->tx_skbs, i);
dev_kfree_skb_irq(skb);

Note that both have dev_kfree_skb_irq, but the former use
gnttab_end_foreign_access_ref, while the later - gnttab_end_foreign_access.
Also note that the crash is in gnttab_end_foreign_access, so before
dev_kfree_skb_irq. If that would be double free, I'd expect crash in the later.

This change was introduced by cefe007 "xen-netfront: fix resource leak in
netfront". I'm not sure if changing gnttab_end_foreign_access back to
gnttab_end_foreign_access_ref would not (re)introduce some memory leak.

Let me paste again the error message:
[   73.718636] page:ea43b1c0 count:0 mapcount:0 mapping:  
(null) index:0x0
[   73.718661] flags: 0x3ffc008000(tail)
[   73.718684] page dumped because: VM_BUG_ON_PAGE(atomic_read(>_count) 
== 0)
[   73.718725] [ cut here ]
[   73.718743] kernel BUG at include/linux/mm.h:338!

Also it all look quite strange - there is get_page() call just before
gnttab_end_foreign_access, but page->_count is still 0. Maybe it have something
to do how get_page() works on "tail" pages (whatever it means)?

static inline void get_page(struct page *page)
{
if (unlikely(PageTail(page)))
if (likely(__get_page_tail(page)))
return;
/*
 * Getting a normal page or the head of a compound page
 * requires to already have an elevated page->_count.
 */
VM_BUG_ON_PAGE(atomic_read(>_count) <= 0, page);
atomic_inc(>_count);
}

which (I think) ends up in:

static inline void __get_page_tail_foll(struct page *page,
bool get_page_head)
{
/*
 * If we're getting a tail page, the elevated page->_count is
 * required only in the head page and we will elevate the head
 * page->_count and tail page->_mapcount.
 *
 * We elevate page_tail->_mapcount for tail pages to force
 * page_tail->_count to be zero at all times to avoid getting
 * false positives from get_page_unless_zero() with
 * speculative page access (like in
 * page_cache_get_speculative()) on tail pages.
 */
VM_BUG_ON_PAGE(atomic_read(>first_page->_count) <= 0, page);
if (get_page_head)
atomic_inc(>first_page->_count);
get_huge_page_tail(page);
}

So 

Re: [Xen-devel] xen-netfront crash when detaching network while some network activity

2015-05-26 Thread Marek Marczykowski-Górecki
On Tue, May 26, 2015 at 11:56:00AM +0100, David Vrabel wrote:
 On 22/05/15 12:49, Marek Marczykowski-Górecki wrote:
  Hi all,
  
  I'm experiencing xen-netfront crash when doing xl network-detach while
  some network activity is going on at the same time. It happens only when
  domU has more than one vcpu. Not sure if this matters, but the backend
  is in another domU (not dom0). I'm using Xen 4.2.2. It happens on kernel
  3.9.4 and 4.1-rc1 as well.
  
  Steps to reproduce:
  1. Start the domU with some network interface
  2. Call there 'ping -f some-IP'
  3. Call 'xl network-detach NAME 0'
 
 There's a use-after-free in xennet_remove().  Does this patch fix it?

Unfortunately not. Note that the crash is in xennet_disconnect_backend,
which is called before xennet_destroy_queues in xennet_remove.
I've tried to add napi_disable and even netif_napi_del just after
napi_synchronize in xennet_disconnect_backend (which would probably
cause crash when trying to cleanup the same later again), but it doesn't
help - the crash is the same (still in gnttab_end_foreign_access called
from xennet_disconnect_backend).


 8
 xen-netfront: properly destroy queues when removing device
 
 xennet_remove() freed the queues before freeing the netdevice which
 results in a use-after-free when free_netdev() tries to delete the
 napi instances that have already been freed.
 
 Fix this by fully destroy the queues (which includes deleting the napi
 instances) before freeing the netdevice.
 
 Reported-by: Marek Marczykowski marma...@invisiblethingslab.com
 Signed-off-by: David Vrabel david.vra...@citrix.com
 ---
  drivers/net/xen-netfront.c |   15 ++-
  1 file changed, 2 insertions(+), 13 deletions(-)
 
 diff --git a/drivers/net/xen-netfront.c b/drivers/net/xen-netfront.c
 index 3f45afd..e031c94 100644
 --- a/drivers/net/xen-netfront.c
 +++ b/drivers/net/xen-netfront.c
 @@ -1698,6 +1698,7 @@ static void xennet_destroy_queues(struct netfront_info 
 *info)
  
   if (netif_running(info-netdev))
   napi_disable(queue-napi);
 + del_timer_sync(queue-rx_refill_timer);
   netif_napi_del(queue-napi);
   }
  
 @@ -2102,9 +2103,6 @@ static const struct attribute_group xennet_dev_group = {
  static int xennet_remove(struct xenbus_device *dev)
  {
   struct netfront_info *info = dev_get_drvdata(dev-dev);
 - unsigned int num_queues = info-netdev-real_num_tx_queues;
 - struct netfront_queue *queue = NULL;
 - unsigned int i = 0;
  
   dev_dbg(dev-dev, %s\n, dev-nodename);
  
 @@ -2112,16 +2110,7 @@ static int xennet_remove(struct xenbus_device *dev)
  
   unregister_netdev(info-netdev);
  
 - for (i = 0; i  num_queues; ++i) {
 - queue = info-queues[i];
 - del_timer_sync(queue-rx_refill_timer);
 - }
 -
 - if (num_queues) {
 - kfree(info-queues);
 - info-queues = NULL;
 - }
 -
 + xennet_destroy_queues(info);
   xennet_free_netdev(info-netdev);
  
   return 0;

-- 
Best Regards,
Marek Marczykowski-Górecki
Invisible Things Lab
A: Because it messes up the order in which people normally read text.
Q: Why is top-posting such a bad thing?


pgpX1pAwgNnhD.pgp
Description: PGP signature
___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


[Xen-devel] xen-netfront crash when detaching network while some network activity

2015-05-22 Thread Marek Marczykowski-Górecki
Hi all,

I'm experiencing xen-netfront crash when doing xl network-detach while
some network activity is going on at the same time. It happens only when
domU has more than one vcpu. Not sure if this matters, but the backend
is in another domU (not dom0). I'm using Xen 4.2.2. It happens on kernel
3.9.4 and 4.1-rc1 as well.

Steps to reproduce:
1. Start the domU with some network interface
2. Call there 'ping -f some-IP'
3. Call 'xl network-detach NAME 0'

The crash message:
[   54.163670] page:ea4bddc0 count:0 mapcount:0 mapping:
(null) index:0x0
[   54.163692] flags: 0x3fff808000(tail)
[   54.163704] page dumped because:
VM_BUG_ON_PAGE(atomic_read(page-_count) == 0)
[   54.163726] [ cut here ]
[   54.163734] kernel BUG at include/linux/mm.h:343!
[   54.163742] invalid opcode:  [#1] SMP 
[   54.163752] Modules linked in:
[   54.163762] CPU: 1 PID: 24 Comm: xenwatch Not tainted
4.1.0-rc1-1.pvops.qubes.x86_64 #4
[   54.163773] task: 8800133c4c00 ti: 880012c94000 task.ti:
880012c94000
[   54.163782] RIP: e030:[811843cc]  [811843cc]
__free_pages+0x4c/0x50
[   54.163800] RSP: e02b:880012c97be8  EFLAGS: 00010292
[   54.163808] RAX: 0044 RBX: 77ff8000 RCX:
0044
[   54.163817] RDX:  RSI:  RDI:
880013d0ea00
[   54.163826] RBP: 880012c97be8 R08: 00f2 R09:

[   54.163835] R10: 00f2 R11: 8185efc0 R12:

[   54.163844] R13: 880011814200 R14: 880012f77000 R15:
0004
[   54.163860] FS:  7f735f0d8740() GS:880013d0()
knlGS:
[   54.163870] CS:  e033 DS:  ES:  CR0: 8005003b
[   54.163878] CR2: 01652c50 CR3: 12112000 CR4:
2660
[   54.163892] Stack:
[   54.163901]  880012c97c08 81184430 0011
0004
[   54.163922]  880012c97c38 814100c6 87ff
880011f20d88
[   54.163943]  880011814200 880011f2 880012c97ca8
814d34e6
[   54.163964] Call Trace:
[   54.163977]  [81184430] free_pages+0x60/0x70
[   54.163994]  [814100c6]
gnttab_end_foreign_access+0x136/0x170
[   54.164012]  [814d34e6]
xennet_disconnect_backend.isra.24+0x166/0x390
[   54.164030]  [814d37a8] xennet_remove+0x38/0xd0
[   54.164045]  [8141a009] xenbus_dev_remove+0x59/0xc0
[   54.164059]  [81479d27] __device_release_driver+0x87/0x120
[   54.164528]  [81479de3] device_release_driver+0x23/0x30
[   54.164528]  [81479658] bus_remove_device+0x108/0x180
[   54.164528]  [81475861] device_del+0x141/0x270
[   54.164528]  [814186a0] ?
unregister_xenbus_watch+0x1d0/0x1d0
[   54.164528]  [814759b2] device_unregister+0x22/0x80
[   54.164528]  [81419e5f] xenbus_dev_changed+0xaf/0x200
[   54.164528]  [816ad346] ?
_raw_spin_unlock_irqrestore+0x16/0x20
[   54.164528]  [814186a0] ?
unregister_xenbus_watch+0x1d0/0x1d0
[   54.164528]  [8141bdb9] frontend_changed+0x29/0x60
[   54.164528]  [814186a0] ?
unregister_xenbus_watch+0x1d0/0x1d0
[   54.164528]  [8141872e] xenwatch_thread+0x8e/0x150
[   54.164528]  [810be2b0] ? wait_woken+0x90/0x90
[   54.164528]  [81099958] kthread+0xd8/0xf0
[   54.164528]  [81099880] ?
kthread_create_on_node+0x1b0/0x1b0
[   54.164528]  [816adde2] ret_from_fork+0x42/0x70
[   54.164528]  [81099880] ?
kthread_create_on_node+0x1b0/0x1b0
[   54.164528] Code: f6 74 0c e8 67 f5 ff ff 5d c3 0f 1f 44 00 00 31 f6
e8 99 fd ff ff 5d c3 0f 1f 80 00 00 00 00 48 c7 c6 78 29 a1 81 e8 d4 37
02 00 0f 0b 66 90 66 66 66 66 90 48 85 ff 75 06 f3 c3 0f 1f 40 00 55 
[   54.164528] RIP  [811843cc] __free_pages+0x4c/0x50
[   54.164528]  RSP 880012c97be8
[   54.166002] ---[ end trace 6b847bc27fec6d36 ]---

Any ideas how to fix this? I guess xennet_disconnect_backend should take
some lock.

-- 
Best Regards,
Marek Marczykowski-Górecki
Invisible Things Lab
A: Because it messes up the order in which people normally read text.
Q: Why is top-posting such a bad thing?


pgpMz0nWHYgnp.pgp
Description: PGP signature
___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


Re: [Xen-devel] xen-netfront crash when detaching network while some network activity

2015-05-22 Thread David Vrabel
On 22/05/15 17:42, Marek Marczykowski-Górecki wrote:
 On Fri, May 22, 2015 at 05:25:44PM +0100, David Vrabel wrote:
 On 22/05/15 12:49, Marek Marczykowski-Górecki wrote:
 Hi all,

 I'm experiencing xen-netfront crash when doing xl network-detach while
 some network activity is going on at the same time. It happens only when
 domU has more than one vcpu. Not sure if this matters, but the backend
 is in another domU (not dom0). I'm using Xen 4.2.2. It happens on kernel
 3.9.4 and 4.1-rc1 as well.

 Steps to reproduce:
 1. Start the domU with some network interface
 2. Call there 'ping -f some-IP'
 3. Call 'xl network-detach NAME 0'

 I tried this about 10 times without a crash.  How reproducible is it?

 I used a 4.1-rc4 frontend and a 4.0 backend.
 
 It happens every time for me... Do you have at least two vcpus in that
 domU? With one vcpu it doesn't crash. The IP for ping I've used one in
 backend domU, but it shouldn't matter.
 
 Backend is 3.19.6 here. I don't see any changes there between rc1 and
 rc4, so stayed with rc1. With 4.1-rc1 backend it also crashes for me.

Doesn't repro for me with 4 VCPU PV or PVHVM guests.  Is your guest
kernel vanilla or does it have some qubes specific patches on top?

David

___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


Re: [Xen-devel] xen-netfront crash when detaching network while some network activity

2015-05-22 Thread Marek Marczykowski-Górecki
On Fri, May 22, 2015 at 05:58:41PM +0100, David Vrabel wrote:
 On 22/05/15 17:42, Marek Marczykowski-Górecki wrote:
  On Fri, May 22, 2015 at 05:25:44PM +0100, David Vrabel wrote:
  On 22/05/15 12:49, Marek Marczykowski-Górecki wrote:
  Hi all,
 
  I'm experiencing xen-netfront crash when doing xl network-detach while
  some network activity is going on at the same time. It happens only when
  domU has more than one vcpu. Not sure if this matters, but the backend
  is in another domU (not dom0). I'm using Xen 4.2.2. It happens on kernel
  3.9.4 and 4.1-rc1 as well.
 
  Steps to reproduce:
  1. Start the domU with some network interface
  2. Call there 'ping -f some-IP'
  3. Call 'xl network-detach NAME 0'
 
  I tried this about 10 times without a crash.  How reproducible is it?
 
  I used a 4.1-rc4 frontend and a 4.0 backend.
  
  It happens every time for me... Do you have at least two vcpus in that
  domU? With one vcpu it doesn't crash. The IP for ping I've used one in
  backend domU, but it shouldn't matter.
  
  Backend is 3.19.6 here. I don't see any changes there between rc1 and
  rc4, so stayed with rc1. With 4.1-rc1 backend it also crashes for me.
 
 Doesn't repro for me with 4 VCPU PV or PVHVM guests.

I've tried with exactly 2 vcpus in frontend domU (PV), but I guess it
shouldn't matter. Backend is also PV.

 Is your guest
 kernel vanilla or does it have some qubes specific patches on top?

This one was from vanilla - both frontend and backend (just qubes
config).
Maybe something about device configuration? Here is xenstore dump:
frontend:
0 = 
 backend = /local/domain/66/backend/vif/69/0
 backend-id = 66
 state = 4
 handle = 0
 mac = 00:16:3e:5e:6c:07
 multi-queue-num-queues = 2
 queue-0 = 
  tx-ring-ref = 1280
  rx-ring-ref = 1281
  event-channel-tx = 19
  event-channel-rx = 20
 queue-1 = 
  tx-ring-ref = 1282
  rx-ring-ref = 1283
  event-channel-tx = 21
  event-channel-rx = 22
 request-rx-copy = 1
 feature-rx-notify = 1
 feature-sg = 1
 feature-gso-tcpv4 = 1
 feature-gso-tcpv6 = 1
 feature-ipv6-csum-offload = 1

backend:
69 = 
 0 = 
  frontend = /local/domain/69/device/vif/0
  frontend-id = 69
  online = 1
  state = 4
  script = /etc/xen/scripts/vif-route-qubes
  mac = 00:16:3e:5e:6c:07
  ip = 10.137.3.9
  handle = 0
  type = vif
  feature-sg = 1
  feature-gso-tcpv4 = 1
  feature-gso-tcpv6 = 1
  feature-ipv6-csum-offload = 1
  feature-rx-copy = 1
  feature-rx-flip = 0
  feature-split-event-channels = 1
  multi-queue-max-queues = 2
  hotplug-status = connected


-- 
Best Regards,
Marek Marczykowski-Górecki
Invisible Things Lab
A: Because it messes up the order in which people normally read text.
Q: Why is top-posting such a bad thing?


pgpvS9afJbT0h.pgp
Description: PGP signature
___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel