Re: [Xen-devel] Xen 4.7 crash

2016-06-14 Thread Aaron Cornelius

On 6/14/2016 9:26 AM, Aaron Cornelius wrote:

On 6/14/2016 9:15 AM, Wei Liu wrote:

On Tue, Jun 14, 2016 at 09:11:47AM -0400, Aaron Cornelius wrote:

On 6/9/2016 7:14 AM, Ian Jackson wrote:

Aaron Cornelius writes ("Re: [Xen-devel] Xen 4.7 crash"):

I am not that familiar with the xenstored code, but as far as I can tell
the grant mapping will be held by the xenstore until the xs_release()
function is called (which is not called by libxl, and I do not
explicitly call it in my software, although I might now just to be
safe), or until the last reference to a domain is released and the
registered destructor (destroy_domain), set by talloc_set_destructor(),
is called.


I'm not sure I follow.  Or maybe I disagree.  ISTM that:

The grant mapping is released by destroy_domain, which is called via
the talloc destructor as a result of talloc_free(domain->conn) in
domain_cleanup.  I don't see other references to domain->conn.

domain_cleanup calls talloc_free on domain->conn when it sees the
domain marked as dying in domain_cleanup.

So I still think that your acl reference ought not to keep the grant
mapping alive.


It took a while to complete the testing, but we've finished trying to
reproduce the error using oxenstored instead of the C xenstored.  When the
condition occurs that caused the error with the C xenstored (on
4.7.0-rc4/8478c9409a2c6726208e8dbc9f3e455b76725a33), oxenstored does not
cause the crash.

So for whatever reason, it would appear that the C xenstored does keep the
grant allocations open, but oxenstored does not.



Can you provide some easy to follow steps to reproduce this issue?

AFAICT your environment is very specialised, but we should be able to
trigger the issue with plan xenstore-* utilities?


I am not sure if the plain xenstore-* utilities will work, but here are
the steps to follow:

1. Create a non-standard xenstore path: /tool/test
2. Create a domU (mini-os/mirage/something small)
3. Add the new domU to the /tool/test permissions list (I'm not 100%
sure how to do this with the xenstore-* utilities)
a. call xs_get_permissions()
b. realloc() the permissions block to add the new domain
c. call xs_set_permissions()
4. Delete the domU from step 2
5. Repeat steps 2-4

Eventually the xs_set_permissions() function will return an E2BIG error
because the list of domains has grown too large.  Sometime after that is
when the crash occurs with the C xenstored and the 4.7.0-rc4 version of
Xen.  It usually takes around 1200 or so iterations for the crash to occur.


After writing up those steps I suddenly realized that I think I have a 
bug in my test that might have been causing the bug in the first place. 
Once I get errors returned from xs_set_permissions() I was not properly 
cleaning up the created domains.  So I think this was just a simple case 
of VMID exhaustion by creating more than 255 domUs at the same time.


In which case this is completely unrelated to xenstore holding on to 
grant allocations, and the C xenstore most likely behaves correctly.


- Aaron Cornelius


___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


Re: [Xen-devel] Xen 4.7 crash

2016-06-14 Thread Aaron Cornelius

On 6/14/2016 9:15 AM, Wei Liu wrote:

On Tue, Jun 14, 2016 at 09:11:47AM -0400, Aaron Cornelius wrote:

On 6/9/2016 7:14 AM, Ian Jackson wrote:

Aaron Cornelius writes ("Re: [Xen-devel] Xen 4.7 crash"):

I am not that familiar with the xenstored code, but as far as I can tell
the grant mapping will be held by the xenstore until the xs_release()
function is called (which is not called by libxl, and I do not
explicitly call it in my software, although I might now just to be
safe), or until the last reference to a domain is released and the
registered destructor (destroy_domain), set by talloc_set_destructor(),
is called.


I'm not sure I follow.  Or maybe I disagree.  ISTM that:

The grant mapping is released by destroy_domain, which is called via
the talloc destructor as a result of talloc_free(domain->conn) in
domain_cleanup.  I don't see other references to domain->conn.

domain_cleanup calls talloc_free on domain->conn when it sees the
domain marked as dying in domain_cleanup.

So I still think that your acl reference ought not to keep the grant
mapping alive.


It took a while to complete the testing, but we've finished trying to
reproduce the error using oxenstored instead of the C xenstored.  When the
condition occurs that caused the error with the C xenstored (on
4.7.0-rc4/8478c9409a2c6726208e8dbc9f3e455b76725a33), oxenstored does not
cause the crash.

So for whatever reason, it would appear that the C xenstored does keep the
grant allocations open, but oxenstored does not.



Can you provide some easy to follow steps to reproduce this issue?

AFAICT your environment is very specialised, but we should be able to
trigger the issue with plan xenstore-* utilities?


I am not sure if the plain xenstore-* utilities will work, but here are 
the steps to follow:


1. Create a non-standard xenstore path: /tool/test
2. Create a domU (mini-os/mirage/something small)
3. Add the new domU to the /tool/test permissions list (I'm not 100% 
sure how to do this with the xenstore-* utilities)

   a. call xs_get_permissions()
   b. realloc() the permissions block to add the new domain
   c. call xs_set_permissions()
4. Delete the domU from step 2
5. Repeat steps 2-4

Eventually the xs_set_permissions() function will return an E2BIG error 
because the list of domains has grown too large.  Sometime after that is 
when the crash occurs with the C xenstored and the 4.7.0-rc4 version of 
Xen.  It usually takes around 1200 or so iterations for the crash to occur.


- Aaron Cornelius

___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


Re: [Xen-devel] Xen 4.7 crash

2016-06-14 Thread Aaron Cornelius

On 6/9/2016 7:14 AM, Ian Jackson wrote:

Aaron Cornelius writes ("Re: [Xen-devel] Xen 4.7 crash"):

I am not that familiar with the xenstored code, but as far as I can tell
the grant mapping will be held by the xenstore until the xs_release()
function is called (which is not called by libxl, and I do not
explicitly call it in my software, although I might now just to be
safe), or until the last reference to a domain is released and the
registered destructor (destroy_domain), set by talloc_set_destructor(),
is called.


I'm not sure I follow.  Or maybe I disagree.  ISTM that:

The grant mapping is released by destroy_domain, which is called via
the talloc destructor as a result of talloc_free(domain->conn) in
domain_cleanup.  I don't see other references to domain->conn.

domain_cleanup calls talloc_free on domain->conn when it sees the
domain marked as dying in domain_cleanup.

So I still think that your acl reference ought not to keep the grant
mapping alive.


It took a while to complete the testing, but we've finished trying to 
reproduce the error using oxenstored instead of the C xenstored.  When 
the condition occurs that caused the error with the C xenstored (on 
4.7.0-rc4/8478c9409a2c6726208e8dbc9f3e455b76725a33), oxenstored does not 
cause the crash.


So for whatever reason, it would appear that the C xenstored does keep 
the grant allocations open, but oxenstored does not.


- Aaron Cornelius


___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


Re: [Xen-devel] Xen 4.7 crash

2016-06-07 Thread Aaron Cornelius

On 6/7/2016 9:40 AM, Aaron Cornelius wrote:

On 6/7/2016 5:53 AM, Ian Jackson wrote:

Aaron Cornelius writes ("Re: [Xen-devel] Xen 4.7 crash"):

We realized that we had forgotten to remove the domain from the
permissions list when the domain is deleted (which would cause the error
we saw).  The application was updated to remove the domain from the
permissions list:
1. retrieve the permissions with xs_get_permissions()
2. find the domain ID that is being deleted
3. memmove() the remaining domains down by 1 to "delete" the old domain
from the permissions list
4. update the permissions with xs_set_permissions()

After we made that change, a load test over the weekend confirmed that
the Xen crash no longer happens.  We checked this morning first thing
and confirmed that without this change the crash reliably occurs.


This is rather odd behaviour.  I don't think xenstored should hang
onto the domain's xs ring page just because the domain is still
mentioned in a permission list.

But it may do.  I haven't checked the code.  Are you using the
ocaml xenstored (oxenstored) or the C one ?


I didn't remember specifying anything special when building the xen
tools, but I did run into troubles where the ocaml tools appeared to
conflict with the opam installed mirage packages and libraries. Running
"sudo make dist-install" command installs the ocaml libraries as root
which made using opam difficult.  So I did disable the ocaml tools
during my build.

I double checked and confirmed that the C version of xenstored was
built.  We will try to test the failure scenario with oxenstored to see
if it behaves any differently.


I am not that familiar with the xenstored code, but as far as I can tell 
the grant mapping will be held by the xenstore until the xs_release() 
function is called (which is not called by libxl, and I do not 
explicitly call it in my software, although I might now just to be 
safe), or until the last reference to a domain is released and the 
registered destructor (destroy_domain), set by talloc_set_destructor(), 
is called.


I tried to follow the oxenstored code, but I certainly don't consider 
myself an expert at OCaml.  The oxenstored code does not appear to 
allocate grant mappings at all, which makes me think I am probably 
misunderstanding the code :)


- Aaron

___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


Re: [Xen-devel] Xen 4.7 crash

2016-06-07 Thread Aaron Cornelius

On 6/7/2016 5:53 AM, Ian Jackson wrote:

Aaron Cornelius writes ("Re: [Xen-devel] Xen 4.7 crash"):

We realized that we had forgotten to remove the domain from the
permissions list when the domain is deleted (which would cause the error
we saw).  The application was updated to remove the domain from the
permissions list:
1. retrieve the permissions with xs_get_permissions()
2. find the domain ID that is being deleted
3. memmove() the remaining domains down by 1 to "delete" the old domain
from the permissions list
4. update the permissions with xs_set_permissions()

After we made that change, a load test over the weekend confirmed that
the Xen crash no longer happens.  We checked this morning first thing
and confirmed that without this change the crash reliably occurs.


This is rather odd behaviour.  I don't think xenstored should hang
onto the domain's xs ring page just because the domain is still
mentioned in a permission list.

But it may do.  I haven't checked the code.  Are you using the
ocaml xenstored (oxenstored) or the C one ?


I didn't remember specifying anything special when building the xen 
tools, but I did run into troubles where the ocaml tools appeared to 
conflict with the opam installed mirage packages and libraries. Running 
"sudo make dist-install" command installs the ocaml libraries as root 
which made using opam difficult.  So I did disable the ocaml tools 
during my build.


I double checked and confirmed that the C version of xenstored was 
built.  We will try to test the failure scenario with oxenstored to see 
if it behaves any differently.


- Aaron

___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


Re: [Xen-devel] Xen 4.7 crash

2016-06-06 Thread Aaron Cornelius

On 6/6/2016 10:19 AM, Wei Liu wrote:

On Mon, Jun 06, 2016 at 03:05:47PM +0100, Julien Grall wrote:

(CC Ian, Stefano and Wei)

Hello Aaron,

On 06/06/16 14:58, Aaron Cornelius wrote:

On 6/2/2016 5:07 AM, Julien Grall wrote:

Hello Aaron,

On 02/06/2016 02:32, Aaron Cornelius wrote:

This is with a custom application, we use the libxl APIs to interact
with Xen.  Domains are created using the libxl_domain_create_new()
function, and domains are destroyed using the libxl_domain_destroy()
function.

The test in this case creates a domain, waits a minute, then
deletes/creates the next domain, waits a minute, and so on.  So I
wouldn't be surprised to see the VMID occasionally indicate there are 2
active domains since there could be one being created and one being
destroyed in a very short time.  However, I wouldn't expect to ever have
256 domains.


Your log has:

(XEN) grant_table.c:3288:d0v1 Grant release (0) ref:(9) flags:(2) dom:(0)
(XEN) grant_table.c:3288:d0v1 Grant release (1) ref:(11) flags:(2)
dom:(0)

Which suggest that some grants are still mapped in DOM0.



The CubieTruck only has 2GB of RAM, I allocate 512MB for dom0 which
means that only 48 of the the Mirage domains (with 32MB of RAM) would
work at the same time anyway.  Which doesn't account for the various
inter-domain resources or the RAM used by Xen itself.


All the pages who belongs to the domain could have been freed except the
one referenced by DOM0. So the footprint of this domain will be limited
at the time.

I would recommend you to check how many domain are running at this time
and if DOM0 effectively released all the resources.


If the p2m_teardown() function checked for NULL it would prevent the
crash, but I suspect Xen would be just as broken since all of my
resources have leaked away.  More broken in fact, since if the board
reboots at least the applications will restart and domains can be
recreated.

It certainly appears that some resources are leaking when domains are
deleted (possibly only on the ARM or ARM32 platforms).  We will try to
add some debug prints and see if we can discover exactly what is
going on.


The leakage could also happen from DOM0. FWIW, I have been able to cycle
2000 guests over the night on an ARM platforms.



We've done some more testing regarding this issue.  And further testing
shows that it doesn't matter if we delete the vchans before the domains
are deleted.  Those appear to be cleaned up correctly when the domain is
destroyed.

What does stop this issue from happening (using the same version of Xen
that the issue was detected on) is removing any non-standard xenstore
references before deleting the domain.  In this case our application
allocates permissions for created domains to non-standard xenstore
paths.  Making sure to remove those domain permissions before deleting
the domain prevents this issue from happening.


I am not sure to understand what you mean here. Could you give a quick
example?


So we have a custom xenstore path for our tool (/tool/custom/ for the 
sake of this example), and we then allow every domain created using this 
tool to read that path.  When the domain is created, the domain is 
explicitly given read permissions using xs_set_permissions().  More 
precisely we:

1. retrieve the current list of permissions with xs_get_permissions()
2. realloc the permissions list to increase it by 1
3. update the list of permissions to give the new domain read only access
4. then set the new permissions list with xs_set_permissions()

We saw errors logged because this list of permissions was getting 
prohibitively large, but this error did not appear to be directly 
connected to the Xen crash I submitted last week.  Or so we thought at 
the time.


We realized that we had forgotten to remove the domain from the 
permissions list when the domain is deleted (which would cause the error 
we saw).  The application was updated to remove the domain from the 
permissions list:

1. retrieve the permissions with xs_get_permissions()
2. find the domain ID that is being deleted
3. memmove() the remaining domains down by 1 to "delete" the old domain 
from the permissions list

4. update the permissions with xs_set_permissions()

After we made that change, a load test over the weekend confirmed that 
the Xen crash no longer happens.  We checked this morning first thing 
and confirmed that without this change the crash reliably occurs.



It does not appear to matter if we delete the standard domain xenstore
path (/local/domain/) since libxl handles removing this path when
the domain is destroyed.

Based on this I would guess that the xenstore is hanging onto the VMID.




This is a somewhat strange conclusion. I guess the root cause is still
unclear at this point.


We originally tested a fix that explicitly cleaned up the vchans 
(created to communicate with the domains) before the 
xen_domain_destroy() function is called and there was no change.  We 
have confirmed that the vchans do 

Re: [Xen-devel] Xen 4.7 crash

2016-06-06 Thread Aaron Cornelius

On 6/2/2016 5:07 AM, Julien Grall wrote:

Hello Aaron,

On 02/06/2016 02:32, Aaron Cornelius wrote:

This is with a custom application, we use the libxl APIs to interact
with Xen.  Domains are created using the libxl_domain_create_new()
function, and domains are destroyed using the libxl_domain_destroy()
function.

The test in this case creates a domain, waits a minute, then
deletes/creates the next domain, waits a minute, and so on.  So I
wouldn't be surprised to see the VMID occasionally indicate there are 2
active domains since there could be one being created and one being
destroyed in a very short time.  However, I wouldn't expect to ever have
256 domains.


Your log has:

(XEN) grant_table.c:3288:d0v1 Grant release (0) ref:(9) flags:(2) dom:(0)
(XEN) grant_table.c:3288:d0v1 Grant release (1) ref:(11) flags:(2) dom:(0)

Which suggest that some grants are still mapped in DOM0.



The CubieTruck only has 2GB of RAM, I allocate 512MB for dom0 which
means that only 48 of the the Mirage domains (with 32MB of RAM) would
work at the same time anyway.  Which doesn't account for the various
inter-domain resources or the RAM used by Xen itself.


All the pages who belongs to the domain could have been freed except the
one referenced by DOM0. So the footprint of this domain will be limited
at the time.

I would recommend you to check how many domain are running at this time
and if DOM0 effectively released all the resources.


If the p2m_teardown() function checked for NULL it would prevent the
crash, but I suspect Xen would be just as broken since all of my
resources have leaked away.  More broken in fact, since if the board
reboots at least the applications will restart and domains can be
recreated.

It certainly appears that some resources are leaking when domains are
deleted (possibly only on the ARM or ARM32 platforms).  We will try to
add some debug prints and see if we can discover exactly what is going on.


The leakage could also happen from DOM0. FWIW, I have been able to cycle
2000 guests over the night on an ARM platforms.



We've done some more testing regarding this issue.  And further testing 
shows that it doesn't matter if we delete the vchans before the domains 
are deleted.  Those appear to be cleaned up correctly when the domain is 
destroyed.


What does stop this issue from happening (using the same version of Xen 
that the issue was detected on) is removing any non-standard xenstore 
references before deleting the domain.  In this case our application 
allocates permissions for created domains to non-standard xenstore 
paths.  Making sure to remove those domain permissions before deleting 
the domain prevents this issue from happening.


It does not appear to matter if we delete the standard domain xenstore 
path (/local/domain/) since libxl handles removing this path when 
the domain is destroyed.


Based on this I would guess that the xenstore is hanging onto the VMID.

- Aaron Cornelius

___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


Re: [Xen-devel] Xen 4.7 crash

2016-06-01 Thread Aaron Cornelius

On 6/1/2016 6:35 PM, Julien Grall wrote:

Hello Aaron,

On 01/06/2016 20:54, Aaron Cornelius wrote:



I'm not 100% sure, from the "VMID pool exhausted" message it would
appear that the p2m_init() function failed to allocate a VM ID, which
caused domain creation to fail, and the NULL pointer dereference when
trying to clean up the not-fully-created domain.

However, since I only have 1 domain active at a time, I'm not sure why
I should run out of VM IDs.


arch_domain_destroy (and p2m_teardown) is only called when all the
reference on the given domain are released.

It may take a while to release all the resources. So if you launch the
domain as the same time as you destroy the previous guest. You will have
more than 1 domain active.

Can you detail how you create/destroy guest?



This is with a custom application, we use the libxl APIs to interact 
with Xen.  Domains are created using the libxl_domain_create_new() 
function, and domains are destroyed using the libxl_domain_destroy() 
function.


The test in this case creates a domain, waits a minute, then 
deletes/creates the next domain, waits a minute, and so on.  So I 
wouldn't be surprised to see the VMID occasionally indicate there are 2 
active domains since there could be one being created and one being 
destroyed in a very short time.  However, I wouldn't expect to ever have 
256 domains.


The CubieTruck only has 2GB of RAM, I allocate 512MB for dom0 which 
means that only 48 of the the Mirage domains (with 32MB of RAM) would 
work at the same time anyway.  Which doesn't account for the various 
inter-domain resources or the RAM used by Xen itself.


If the p2m_teardown() function checked for NULL it would prevent the 
crash, but I suspect Xen would be just as broken since all of my 
resources have leaked away.  More broken in fact, since if the board 
reboots at least the applications will restart and domains can be recreated.


It certainly appears that some resources are leaking when domains are 
deleted (possibly only on the ARM or ARM32 platforms).  We will try to 
add some debug prints and see if we can discover exactly what is going on.


- Aaron Cornelius


___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


Re: [Xen-devel] Xen 4.7 crash

2016-06-01 Thread Aaron Cornelius
> -Original Message-
> From: Andrew Cooper [mailto:am...@hermes.cam.ac.uk] On Behalf Of
> Andrew Cooper
> Sent: Wednesday, June 1, 2016 4:01 PM
> To: Aaron Cornelius <aaron.cornel...@dornerworks.com>; Xen-devel  de...@lists.xenproject.org>
> Subject: Re: [Xen-devel] Xen 4.7 crash
> 
> On 01/06/2016 20:54, Aaron Cornelius wrote:
> > I am doing some work with Xen 4.7 on the cubietruck (ARM32).  I've
> noticed some strange behavior after I create/destroy enough domains and
> put together a script to do the add/remove for me.  For this particular test I
> am creating a small mini-os (Mirage) domain with 32MB of RAM, deleting it,
> creating the new one, and so on.
> >
> > After running this for a while, I get the following error (with version
> 8478c9409a2c6726208e8dbc9f3e455b76725a33):
> >
> > (d846) Virtual -> physical offset = 3fc0
> > (d846) Checking DTB at 023ff000...
> > (d846) [32;1mMirageOS booting...[0m
> > (d846) Initialising console ... done.
> > (d846) gnttab_stubs.c: initialised mini-os gntmap
> > (d846) allocate_ondemand(1, 1) returning 230
> > (d846) allocate_ondemand(1, 1) returning 2301000
> > (XEN) grant_table.c:3288:d0v1 Grant release (0) ref:(9) flags:(2)
> > dom:(0)
> > (XEN) grant_table.c:3288:d0v1 Grant release (1) ref:(11) flags:(2)
> > dom:(0)
> > (XEN) p2m.c: dom1101: VMID pool exhausted
> > (XEN) CPU0: Unexpected Trap: Data Abort 
> >
> > I'm not 100% sure, from the "VMID pool exhausted" message it would
> appear that the p2m_init() function failed to allocate a VM ID, which caused
> domain creation to fail, and the NULL pointer dereference when trying to
> clean up the not-fully-created domain.
> >
> > However, since I only have 1 domain active at a time, I'm not sure why I
> should run out of VM IDs.
> 
> Sounds like a VMID resource leak.  Check to see whether it is freed properly
> in domain_destroy().
> 
> ~Andrew

That would be my assumption.  But as far as I can tell, arch_domain_destroy() 
calls pwm_teardown() which calls p2m_free_vmid(), and none of the functionality 
related to freeing a VM ID appears to have changed in years.

- Aaron
___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


[Xen-devel] Xen 4.7 crash

2016-06-01 Thread Aaron Cornelius
I am doing some work with Xen 4.7 on the cubietruck (ARM32).  I've noticed some 
strange behavior after I create/destroy enough domains and put together a 
script to do the add/remove for me.  For this particular test I am creating a 
small mini-os (Mirage) domain with 32MB of RAM, deleting it, creating the new 
one, and so on.

After running this for a while, I get the following error (with version 
8478c9409a2c6726208e8dbc9f3e455b76725a33):

(d846) Virtual -> physical offset = 3fc0
(d846) Checking DTB at 023ff000...
(d846) [32;1mMirageOS booting...[0m
(d846) Initialising console ... done.
(d846) gnttab_stubs.c: initialised mini-os gntmap
(d846) allocate_ondemand(1, 1) returning 230
(d846) allocate_ondemand(1, 1) returning 2301000
(XEN) grant_table.c:3288:d0v1 Grant release (0) ref:(9) flags:(2) dom:(0)
(XEN) grant_table.c:3288:d0v1 Grant release (1) ref:(11) flags:(2) dom:(0)
(XEN) p2m.c: dom1101: VMID pool exhausted
(XEN) CPU0: Unexpected Trap: Data Abort
(XEN) [ Xen-4.7.0-rc  arm32  debug=y  Not tainted ]
(XEN) CPU:0
(XEN) PC: 0021fdd4 free_domheap_pages+0x1c/0x324
(XEN) CPSR:   6001011a MODE:Hypervisor
(XEN)  R0:  R1: 0001 R2: 0003 R3: 00304320
(XEN)  R4: 41c57000 R5: 41c57188 R6: 00200200 R7: 00100100
(XEN)  R8: 41c57180 R9: 43fdfe60 R10: R11:43fdfd5c R12:
(XEN) HYP: SP: 43fdfd2c LR: 0025b0cc
(XEN)
(XEN)   VTCR_EL2: 80003558
(XEN)  VTTBR_EL2: 0001bfb0e000
(XEN)
(XEN)  SCTLR_EL2: 30cd187f
(XEN)HCR_EL2: 0038663f
(XEN)  TTBR0_EL2: bfafc000
(XEN)
(XEN)ESR_EL2: 9406
(XEN)  HPFAR_EL2: 0001c810
(XEN)  HDFAR: 0014
(XEN)  HIFAR: 84e37182
(XEN)
(XEN) Xen stack trace from sp=43fdfd2c:
(XEN)002cf1b7 43fdfd64 41c57000 0100 41c57000 41c57188 00200200 00100100
(XEN)41c57180 43fdfe60  43fdfd7c 0025b0cc 41c57000 fff0 43fdfe60
(XEN)001f 044d 43fdfe60 43fdfd8c 0024f668 41c57000 fff0 43fdfda4
(XEN)0024f8f0 41c57000   001f 43fdfddc 0020854c 43fdfddc
(XEN) cccd 00304600 002822bc  b6f20004 044d 00304600
(XEN)00304320 d767a000  43fdfeec 00206d6c 43fdfe6c 00218f8c 
(XEN)0007 43fdfe30 43fdfe34  43fdfe20 0002 43fdfe48 43fdfe78
(XEN)   7622 2b0e 40023000  43fdfec8
(XEN)0002 43fdfebc 00218f8c 0001 000b  b6eba880 000b
(XEN)5abab87d f34aab2c 6adc50b8 e1713cd0    
(XEN)b6eba8d8  50043f00 b6eb5038 b6effba8 003e  000c3034
(XEN)000b9cb8 000bda30 000bda30  b6eba56c 003e b6effba8 b6effdb0
(XEN)be9558d4 00d0 be9558d4 0071 b6effba8 b6effd6c b6ed6fb4 4a000ea1
(XEN)c01077f8 43fdff58 002067b8 00305000 be9557c8 d767a000  43fdff54
(XEN)00260130  43fdfefc 43fdff1c 200f019a 400238f4 0004 0004
(XEN)002c9f00  00304600 c094c240  00305000 be9557a0 d767a000
(XEN) 43fdff44  c094c240  00305000 be9557c8 d767a000
(XEN) 43fdff58 00263b10 b6f20004    
(XEN)c094c240  00305000 be9557c8 d767a000  0001 0024
(XEN) b691ab34 c01077f8 60010013  be9557c4 c0a38600 c010c400
(XEN) Xen call trace:
(XEN)[<0021fdd4>] free_domheap_pages+0x1c/0x324 (PC)
(XEN)[<0025b0cc>] p2m_teardown+0xa0/0x108 (LR)
(XEN)[<0025b0cc>] p2m_teardown+0xa0/0x108
(XEN)[<0024f668>] arch_domain_destroy+0x20/0x50
(XEN)[<0024f8f0>] arch_domain_create+0x258/0x284
(XEN)[<0020854c>] domain_create+0x2dc/0x510
(XEN)[<00206d6c>] do_domctl+0x5b4/0x1928
(XEN)[<00260130>] do_trap_hypervisor+0x1170/0x15b0
(XEN)[<00263b10>] entry.o#return_from_trap+0/0x4
(XEN)
(XEN)
(XEN) 
(XEN) Panic on CPU 0:
(XEN) CPU0: Unexpected Trap: Data Abort
(XEN)
(XEN) 
(XEN)
(XEN) Reboot in five seconds...

I'm not 100% sure, from the "VMID pool exhausted" message it would appear that 
the p2m_init() function failed to allocate a VM ID, which caused domain 
creation to fail, and the NULL pointer dereference when trying to clean up the 
not-fully-created domain.

However, since I only have 1 domain active at a time, I'm not sure why I should 
run out of VM IDs.

- Aaron Cornelius

___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


[Xen-devel] Xen crash in cpupool_assign_cpu_locked spinlock

2016-04-09 Thread Aaron Cornelius
I am not really sure where bugs on the staging branch should be reported.  I
just found this one and couldn't find anywhere that it had been reported yet.

I have a Xen/Ubuntu 14.04 VM (in virtualbox) with 2 CPUs allocated to it, when
it boots I remove a CPU from the default pool (because it's allocated to a new
credit schedule in the new cpuppol config file), and when I create the cpupool
Xen crashes.

I just pulled the staging branch from git://xenbits.xen.org/xen.git a few hours
ago.  I ran this with a fresh pull of master yesterday and I did not experience
this issue.

Here are the commands:
$ sudo xl cpupool-cpu-remove Pool-0 1
$ sudo xl cpupool-create /etc/xen/newpool


Here is the new cpupool config file:
$ cat /etc/xen/newpool
name = "newpool"
sched = "credit"
cpus = ["1"]

And here is the xen console output:
(XEN) Xen BUG at spinlock.c:48
(XEN) [ Xen-4.7-unstable  x86_64  debug=y  Not tainted ]
(XEN) CPU:0
(XEN) RIP:e008:[] spinlock.c#check_lock+0x3c/0x40
(XEN) RFLAGS: 00010246   CONTEXT: hypervisor (d0v1)
(XEN) rax: 0001   rbx: 83007ffe4148   rcx: 
(XEN) rdx: 0001   rsi: 82cfb300   rdi: 83007ffe414e
(XEN) rbp: 83007fd07c90   rsp: 83007fd07c90   r8:  830072629ed0
(XEN) r9:  deadbeef   r10: 82d08025edc0   r11: 0286
(XEN) r12: 830072629c60   r13: 830072629840   r14: 82d0802e4a40
(XEN) r15: 82d08033bd40   cr0: 80050033   cr4: 000406a0
(XEN) cr3: 4c536000   cr2: 7fc1e682bab5
(XEN) ds:    es:    fs:    gs:    ss: e010   cs: e008
(XEN) Xen code around  (spinlock.c#check_lock+0x3c/0x40):
(XEN)  00 98 83 f2 01 39 d0 75 <02> 0b 5d c3 55 48 89 e5 f0 ff 05 49 4d 1b 00 5d
(XEN) Xen stack trace from rsp=83007fd07c90:
(XEN)83007fd07ca8 82d08012fe1b 0001 83007fd07d18
(XEN)82d08012ec04 0001 830072629ed0 830072629e70
(XEN)0001 83007ffe6000 83007ffe4148 830072629840
(XEN)0001 830072629840 82d0802f961c 82d080312980
(XEN) 83007fd07d38 82d080101b77 83007fd07e48
(XEN)0001 83007fd07d88 82d080102385 0006c0dd
(XEN)fffe 83007fd07d80 83007fd0 880025cb82f8
(XEN)82d0802f961c 82d080312980  83007fd07ef8
(XEN)82d0801312e8 0001  0001
(XEN) 83007fd07ef8 82d080184d21 83007faee0c0
(XEN)7fc1e6eec004 83007fd07e48 82d08010acf7 07000206
(XEN)83007ffea010 83007fd07ea8 83007fd07ea4 0006c0dd
(XEN)83007ffb6000 0006c0dd  820040005000
(XEN)820040005178 0003 000d0012 00010004
(XEN)0001 0001 00e2 7fc1e59b67e8
(XEN) 7ffc3a785990 00408000 7fc1e6ce3515
(XEN)00da09e0 0001 0001 0001
(XEN)00da09e0  ffc5ac7667a4 0033
(XEN)83007ffb6000 880025cb82f8 7ffc3a785760 00305000
(XEN)7ffc3a785760 7cff802f80c7 82d0802418a2 8100146a
(XEN) Xen call trace:
(XEN)[] spinlock.c#check_lock+0x3c/0x40
(XEN)[] _spin_lock+0x11/0x52
(XEN)[] schedule_cpu_switch+0x17f/0x24a
(XEN)[] cpupool.c#cpupool_assign_cpu_locked+0x2b/0x113
(XEN)[] cpupool_do_sysctl+0x1e2/0x6b4
(XEN)[] do_sysctl+0x625/0x1088
(XEN)[] lstar_enter+0xe2/0x13c
(XEN)
(XEN)
(XEN) 
(XEN) Panic on CPU 0:
(XEN) Xen BUG at spinlock.c:48
(XEN) ****

- Aaron Cornelius

___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel