Re: [Xen-devel] Xen 4.7 crash

2016-06-14 Thread Wei Liu
On Tue, Jun 14, 2016 at 09:38:22AM -0400, Aaron Cornelius wrote:
> On 6/14/2016 9:26 AM, Aaron Cornelius wrote:
> >On 6/14/2016 9:15 AM, Wei Liu wrote:
> >>On Tue, Jun 14, 2016 at 09:11:47AM -0400, Aaron Cornelius wrote:
> >>>On 6/9/2016 7:14 AM, Ian Jackson wrote:
> >>>>Aaron Cornelius writes ("Re: [Xen-devel] Xen 4.7 crash"):
> >>>>>I am not that familiar with the xenstored code, but as far as I can tell
> >>>>>the grant mapping will be held by the xenstore until the xs_release()
> >>>>>function is called (which is not called by libxl, and I do not
> >>>>>explicitly call it in my software, although I might now just to be
> >>>>>safe), or until the last reference to a domain is released and the
> >>>>>registered destructor (destroy_domain), set by talloc_set_destructor(),
> >>>>>is called.
> >>>>
> >>>>I'm not sure I follow.  Or maybe I disagree.  ISTM that:
> >>>>
> >>>>The grant mapping is released by destroy_domain, which is called via
> >>>>the talloc destructor as a result of talloc_free(domain->conn) in
> >>>>domain_cleanup.  I don't see other references to domain->conn.
> >>>>
> >>>>domain_cleanup calls talloc_free on domain->conn when it sees the
> >>>>domain marked as dying in domain_cleanup.
> >>>>
> >>>>So I still think that your acl reference ought not to keep the grant
> >>>>mapping alive.
> >>>
> >>>It took a while to complete the testing, but we've finished trying to
> >>>reproduce the error using oxenstored instead of the C xenstored.  When the
> >>>condition occurs that caused the error with the C xenstored (on
> >>>4.7.0-rc4/8478c9409a2c6726208e8dbc9f3e455b76725a33), oxenstored does not
> >>>cause the crash.
> >>>
> >>>So for whatever reason, it would appear that the C xenstored does keep the
> >>>grant allocations open, but oxenstored does not.
> >>>
> >>
> >>Can you provide some easy to follow steps to reproduce this issue?
> >>
> >>AFAICT your environment is very specialised, but we should be able to
> >>trigger the issue with plan xenstore-* utilities?
> >
> >I am not sure if the plain xenstore-* utilities will work, but here are
> >the steps to follow:
> >
> >1. Create a non-standard xenstore path: /tool/test
> >2. Create a domU (mini-os/mirage/something small)
> >3. Add the new domU to the /tool/test permissions list (I'm not 100%
> >sure how to do this with the xenstore-* utilities)
> >a. call xs_get_permissions()
> >b. realloc() the permissions block to add the new domain
> >c. call xs_set_permissions()
> >4. Delete the domU from step 2
> >5. Repeat steps 2-4
> >
> >Eventually the xs_set_permissions() function will return an E2BIG error
> >because the list of domains has grown too large.  Sometime after that is
> >when the crash occurs with the C xenstored and the 4.7.0-rc4 version of
> >Xen.  It usually takes around 1200 or so iterations for the crash to occur.
> 
> After writing up those steps I suddenly realized that I think I have a bug
> in my test that might have been causing the bug in the first place. Once I
> get errors returned from xs_set_permissions() I was not properly cleaning up
> the created domains.  So I think this was just a simple case of VMID
> exhaustion by creating more than 255 domUs at the same time.
> 
> In which case this is completely unrelated to xenstore holding on to grant
> allocations, and the C xenstore most likely behaves correctly.
> 

OK, so I will treat this issue as resolved for now. Let us know if you
discover something new.

Wei.

> - Aaron Cornelius
> 

___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


Re: [Xen-devel] Xen 4.7 crash

2016-06-14 Thread Aaron Cornelius

On 6/14/2016 9:26 AM, Aaron Cornelius wrote:

On 6/14/2016 9:15 AM, Wei Liu wrote:

On Tue, Jun 14, 2016 at 09:11:47AM -0400, Aaron Cornelius wrote:

On 6/9/2016 7:14 AM, Ian Jackson wrote:

Aaron Cornelius writes ("Re: [Xen-devel] Xen 4.7 crash"):

I am not that familiar with the xenstored code, but as far as I can tell
the grant mapping will be held by the xenstore until the xs_release()
function is called (which is not called by libxl, and I do not
explicitly call it in my software, although I might now just to be
safe), or until the last reference to a domain is released and the
registered destructor (destroy_domain), set by talloc_set_destructor(),
is called.


I'm not sure I follow.  Or maybe I disagree.  ISTM that:

The grant mapping is released by destroy_domain, which is called via
the talloc destructor as a result of talloc_free(domain->conn) in
domain_cleanup.  I don't see other references to domain->conn.

domain_cleanup calls talloc_free on domain->conn when it sees the
domain marked as dying in domain_cleanup.

So I still think that your acl reference ought not to keep the grant
mapping alive.


It took a while to complete the testing, but we've finished trying to
reproduce the error using oxenstored instead of the C xenstored.  When the
condition occurs that caused the error with the C xenstored (on
4.7.0-rc4/8478c9409a2c6726208e8dbc9f3e455b76725a33), oxenstored does not
cause the crash.

So for whatever reason, it would appear that the C xenstored does keep the
grant allocations open, but oxenstored does not.



Can you provide some easy to follow steps to reproduce this issue?

AFAICT your environment is very specialised, but we should be able to
trigger the issue with plan xenstore-* utilities?


I am not sure if the plain xenstore-* utilities will work, but here are
the steps to follow:

1. Create a non-standard xenstore path: /tool/test
2. Create a domU (mini-os/mirage/something small)
3. Add the new domU to the /tool/test permissions list (I'm not 100%
sure how to do this with the xenstore-* utilities)
a. call xs_get_permissions()
b. realloc() the permissions block to add the new domain
c. call xs_set_permissions()
4. Delete the domU from step 2
5. Repeat steps 2-4

Eventually the xs_set_permissions() function will return an E2BIG error
because the list of domains has grown too large.  Sometime after that is
when the crash occurs with the C xenstored and the 4.7.0-rc4 version of
Xen.  It usually takes around 1200 or so iterations for the crash to occur.


After writing up those steps I suddenly realized that I think I have a 
bug in my test that might have been causing the bug in the first place. 
Once I get errors returned from xs_set_permissions() I was not properly 
cleaning up the created domains.  So I think this was just a simple case 
of VMID exhaustion by creating more than 255 domUs at the same time.


In which case this is completely unrelated to xenstore holding on to 
grant allocations, and the C xenstore most likely behaves correctly.


- Aaron Cornelius


___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


Re: [Xen-devel] Xen 4.7 crash

2016-06-14 Thread Aaron Cornelius

On 6/14/2016 9:15 AM, Wei Liu wrote:

On Tue, Jun 14, 2016 at 09:11:47AM -0400, Aaron Cornelius wrote:

On 6/9/2016 7:14 AM, Ian Jackson wrote:

Aaron Cornelius writes ("Re: [Xen-devel] Xen 4.7 crash"):

I am not that familiar with the xenstored code, but as far as I can tell
the grant mapping will be held by the xenstore until the xs_release()
function is called (which is not called by libxl, and I do not
explicitly call it in my software, although I might now just to be
safe), or until the last reference to a domain is released and the
registered destructor (destroy_domain), set by talloc_set_destructor(),
is called.


I'm not sure I follow.  Or maybe I disagree.  ISTM that:

The grant mapping is released by destroy_domain, which is called via
the talloc destructor as a result of talloc_free(domain->conn) in
domain_cleanup.  I don't see other references to domain->conn.

domain_cleanup calls talloc_free on domain->conn when it sees the
domain marked as dying in domain_cleanup.

So I still think that your acl reference ought not to keep the grant
mapping alive.


It took a while to complete the testing, but we've finished trying to
reproduce the error using oxenstored instead of the C xenstored.  When the
condition occurs that caused the error with the C xenstored (on
4.7.0-rc4/8478c9409a2c6726208e8dbc9f3e455b76725a33), oxenstored does not
cause the crash.

So for whatever reason, it would appear that the C xenstored does keep the
grant allocations open, but oxenstored does not.



Can you provide some easy to follow steps to reproduce this issue?

AFAICT your environment is very specialised, but we should be able to
trigger the issue with plan xenstore-* utilities?


I am not sure if the plain xenstore-* utilities will work, but here are 
the steps to follow:


1. Create a non-standard xenstore path: /tool/test
2. Create a domU (mini-os/mirage/something small)
3. Add the new domU to the /tool/test permissions list (I'm not 100% 
sure how to do this with the xenstore-* utilities)

   a. call xs_get_permissions()
   b. realloc() the permissions block to add the new domain
   c. call xs_set_permissions()
4. Delete the domU from step 2
5. Repeat steps 2-4

Eventually the xs_set_permissions() function will return an E2BIG error 
because the list of domains has grown too large.  Sometime after that is 
when the crash occurs with the C xenstored and the 4.7.0-rc4 version of 
Xen.  It usually takes around 1200 or so iterations for the crash to occur.


- Aaron Cornelius

___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


Re: [Xen-devel] Xen 4.7 crash

2016-06-14 Thread Wei Liu
On Tue, Jun 14, 2016 at 09:11:47AM -0400, Aaron Cornelius wrote:
> On 6/9/2016 7:14 AM, Ian Jackson wrote:
> >Aaron Cornelius writes ("Re: [Xen-devel] Xen 4.7 crash"):
> >>I am not that familiar with the xenstored code, but as far as I can tell
> >>the grant mapping will be held by the xenstore until the xs_release()
> >>function is called (which is not called by libxl, and I do not
> >>explicitly call it in my software, although I might now just to be
> >>safe), or until the last reference to a domain is released and the
> >>registered destructor (destroy_domain), set by talloc_set_destructor(),
> >>is called.
> >
> >I'm not sure I follow.  Or maybe I disagree.  ISTM that:
> >
> >The grant mapping is released by destroy_domain, which is called via
> >the talloc destructor as a result of talloc_free(domain->conn) in
> >domain_cleanup.  I don't see other references to domain->conn.
> >
> >domain_cleanup calls talloc_free on domain->conn when it sees the
> >domain marked as dying in domain_cleanup.
> >
> >So I still think that your acl reference ought not to keep the grant
> >mapping alive.
> 
> It took a while to complete the testing, but we've finished trying to
> reproduce the error using oxenstored instead of the C xenstored.  When the
> condition occurs that caused the error with the C xenstored (on
> 4.7.0-rc4/8478c9409a2c6726208e8dbc9f3e455b76725a33), oxenstored does not
> cause the crash.
> 
> So for whatever reason, it would appear that the C xenstored does keep the
> grant allocations open, but oxenstored does not.
> 

Can you provide some easy to follow steps to reproduce this issue?

AFAICT your environment is very specialised, but we should be able to
trigger the issue with plan xenstore-* utilities?

Wei.

> - Aaron Cornelius
> 

___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


Re: [Xen-devel] Xen 4.7 crash

2016-06-14 Thread Aaron Cornelius

On 6/9/2016 7:14 AM, Ian Jackson wrote:

Aaron Cornelius writes ("Re: [Xen-devel] Xen 4.7 crash"):

I am not that familiar with the xenstored code, but as far as I can tell
the grant mapping will be held by the xenstore until the xs_release()
function is called (which is not called by libxl, and I do not
explicitly call it in my software, although I might now just to be
safe), or until the last reference to a domain is released and the
registered destructor (destroy_domain), set by talloc_set_destructor(),
is called.


I'm not sure I follow.  Or maybe I disagree.  ISTM that:

The grant mapping is released by destroy_domain, which is called via
the talloc destructor as a result of talloc_free(domain->conn) in
domain_cleanup.  I don't see other references to domain->conn.

domain_cleanup calls talloc_free on domain->conn when it sees the
domain marked as dying in domain_cleanup.

So I still think that your acl reference ought not to keep the grant
mapping alive.


It took a while to complete the testing, but we've finished trying to 
reproduce the error using oxenstored instead of the C xenstored.  When 
the condition occurs that caused the error with the C xenstored (on 
4.7.0-rc4/8478c9409a2c6726208e8dbc9f3e455b76725a33), oxenstored does not 
cause the crash.


So for whatever reason, it would appear that the C xenstored does keep 
the grant allocations open, but oxenstored does not.


- Aaron Cornelius


___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


Re: [Xen-devel] Xen 4.7 crash

2016-06-09 Thread Ian Jackson
Aaron Cornelius writes ("Re: [Xen-devel] Xen 4.7 crash"):
> I am not that familiar with the xenstored code, but as far as I can tell 
> the grant mapping will be held by the xenstore until the xs_release() 
> function is called (which is not called by libxl, and I do not 
> explicitly call it in my software, although I might now just to be 
> safe), or until the last reference to a domain is released and the 
> registered destructor (destroy_domain), set by talloc_set_destructor(), 
> is called.

I'm not sure I follow.  Or maybe I disagree.  ISTM that:

The grant mapping is released by destroy_domain, which is called via
the talloc destructor as a result of talloc_free(domain->conn) in
domain_cleanup.  I don't see other references to domain->conn.

domain_cleanup calls talloc_free on domain->conn when it sees the
domain marked as dying in domain_cleanup.

So I still think that your acl reference ought not to keep the grant
mapping alive.

Ian.

___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


Re: [Xen-devel] Xen 4.7 crash

2016-06-07 Thread Aaron Cornelius

On 6/7/2016 9:40 AM, Aaron Cornelius wrote:

On 6/7/2016 5:53 AM, Ian Jackson wrote:

Aaron Cornelius writes ("Re: [Xen-devel] Xen 4.7 crash"):

We realized that we had forgotten to remove the domain from the
permissions list when the domain is deleted (which would cause the error
we saw).  The application was updated to remove the domain from the
permissions list:
1. retrieve the permissions with xs_get_permissions()
2. find the domain ID that is being deleted
3. memmove() the remaining domains down by 1 to "delete" the old domain
from the permissions list
4. update the permissions with xs_set_permissions()

After we made that change, a load test over the weekend confirmed that
the Xen crash no longer happens.  We checked this morning first thing
and confirmed that without this change the crash reliably occurs.


This is rather odd behaviour.  I don't think xenstored should hang
onto the domain's xs ring page just because the domain is still
mentioned in a permission list.

But it may do.  I haven't checked the code.  Are you using the
ocaml xenstored (oxenstored) or the C one ?


I didn't remember specifying anything special when building the xen
tools, but I did run into troubles where the ocaml tools appeared to
conflict with the opam installed mirage packages and libraries. Running
"sudo make dist-install" command installs the ocaml libraries as root
which made using opam difficult.  So I did disable the ocaml tools
during my build.

I double checked and confirmed that the C version of xenstored was
built.  We will try to test the failure scenario with oxenstored to see
if it behaves any differently.


I am not that familiar with the xenstored code, but as far as I can tell 
the grant mapping will be held by the xenstore until the xs_release() 
function is called (which is not called by libxl, and I do not 
explicitly call it in my software, although I might now just to be 
safe), or until the last reference to a domain is released and the 
registered destructor (destroy_domain), set by talloc_set_destructor(), 
is called.


I tried to follow the oxenstored code, but I certainly don't consider 
myself an expert at OCaml.  The oxenstored code does not appear to 
allocate grant mappings at all, which makes me think I am probably 
misunderstanding the code :)


- Aaron

___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


Re: [Xen-devel] Xen 4.7 crash

2016-06-07 Thread Aaron Cornelius

On 6/7/2016 5:53 AM, Ian Jackson wrote:

Aaron Cornelius writes ("Re: [Xen-devel] Xen 4.7 crash"):

We realized that we had forgotten to remove the domain from the
permissions list when the domain is deleted (which would cause the error
we saw).  The application was updated to remove the domain from the
permissions list:
1. retrieve the permissions with xs_get_permissions()
2. find the domain ID that is being deleted
3. memmove() the remaining domains down by 1 to "delete" the old domain
from the permissions list
4. update the permissions with xs_set_permissions()

After we made that change, a load test over the weekend confirmed that
the Xen crash no longer happens.  We checked this morning first thing
and confirmed that without this change the crash reliably occurs.


This is rather odd behaviour.  I don't think xenstored should hang
onto the domain's xs ring page just because the domain is still
mentioned in a permission list.

But it may do.  I haven't checked the code.  Are you using the
ocaml xenstored (oxenstored) or the C one ?


I didn't remember specifying anything special when building the xen 
tools, but I did run into troubles where the ocaml tools appeared to 
conflict with the opam installed mirage packages and libraries. Running 
"sudo make dist-install" command installs the ocaml libraries as root 
which made using opam difficult.  So I did disable the ocaml tools 
during my build.


I double checked and confirmed that the C version of xenstored was 
built.  We will try to test the failure scenario with oxenstored to see 
if it behaves any differently.


- Aaron

___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


Re: [Xen-devel] Xen 4.7 crash

2016-06-07 Thread Ian Jackson
Aaron Cornelius writes ("Re: [Xen-devel] Xen 4.7 crash"):
> We realized that we had forgotten to remove the domain from the 
> permissions list when the domain is deleted (which would cause the error 
> we saw).  The application was updated to remove the domain from the 
> permissions list:
> 1. retrieve the permissions with xs_get_permissions()
> 2. find the domain ID that is being deleted
> 3. memmove() the remaining domains down by 1 to "delete" the old domain 
> from the permissions list
> 4. update the permissions with xs_set_permissions()
> 
> After we made that change, a load test over the weekend confirmed that 
> the Xen crash no longer happens.  We checked this morning first thing 
> and confirmed that without this change the crash reliably occurs.

This is rather odd behaviour.  I don't think xenstored should hang
onto the domain's xs ring page just because the domain is still
mentioned in a permission list.

But it may do.  I haven't checked the code.  Are you using the
ocaml xenstored (oxenstored) or the C one ?

Ian.

___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


Re: [Xen-devel] Xen 4.7 crash

2016-06-06 Thread Aaron Cornelius

On 6/6/2016 10:19 AM, Wei Liu wrote:

On Mon, Jun 06, 2016 at 03:05:47PM +0100, Julien Grall wrote:

(CC Ian, Stefano and Wei)

Hello Aaron,

On 06/06/16 14:58, Aaron Cornelius wrote:

On 6/2/2016 5:07 AM, Julien Grall wrote:

Hello Aaron,

On 02/06/2016 02:32, Aaron Cornelius wrote:

This is with a custom application, we use the libxl APIs to interact
with Xen.  Domains are created using the libxl_domain_create_new()
function, and domains are destroyed using the libxl_domain_destroy()
function.

The test in this case creates a domain, waits a minute, then
deletes/creates the next domain, waits a minute, and so on.  So I
wouldn't be surprised to see the VMID occasionally indicate there are 2
active domains since there could be one being created and one being
destroyed in a very short time.  However, I wouldn't expect to ever have
256 domains.


Your log has:

(XEN) grant_table.c:3288:d0v1 Grant release (0) ref:(9) flags:(2) dom:(0)
(XEN) grant_table.c:3288:d0v1 Grant release (1) ref:(11) flags:(2)
dom:(0)

Which suggest that some grants are still mapped in DOM0.



The CubieTruck only has 2GB of RAM, I allocate 512MB for dom0 which
means that only 48 of the the Mirage domains (with 32MB of RAM) would
work at the same time anyway.  Which doesn't account for the various
inter-domain resources or the RAM used by Xen itself.


All the pages who belongs to the domain could have been freed except the
one referenced by DOM0. So the footprint of this domain will be limited
at the time.

I would recommend you to check how many domain are running at this time
and if DOM0 effectively released all the resources.


If the p2m_teardown() function checked for NULL it would prevent the
crash, but I suspect Xen would be just as broken since all of my
resources have leaked away.  More broken in fact, since if the board
reboots at least the applications will restart and domains can be
recreated.

It certainly appears that some resources are leaking when domains are
deleted (possibly only on the ARM or ARM32 platforms).  We will try to
add some debug prints and see if we can discover exactly what is
going on.


The leakage could also happen from DOM0. FWIW, I have been able to cycle
2000 guests over the night on an ARM platforms.



We've done some more testing regarding this issue.  And further testing
shows that it doesn't matter if we delete the vchans before the domains
are deleted.  Those appear to be cleaned up correctly when the domain is
destroyed.

What does stop this issue from happening (using the same version of Xen
that the issue was detected on) is removing any non-standard xenstore
references before deleting the domain.  In this case our application
allocates permissions for created domains to non-standard xenstore
paths.  Making sure to remove those domain permissions before deleting
the domain prevents this issue from happening.


I am not sure to understand what you mean here. Could you give a quick
example?


So we have a custom xenstore path for our tool (/tool/custom/ for the 
sake of this example), and we then allow every domain created using this 
tool to read that path.  When the domain is created, the domain is 
explicitly given read permissions using xs_set_permissions().  More 
precisely we:

1. retrieve the current list of permissions with xs_get_permissions()
2. realloc the permissions list to increase it by 1
3. update the list of permissions to give the new domain read only access
4. then set the new permissions list with xs_set_permissions()

We saw errors logged because this list of permissions was getting 
prohibitively large, but this error did not appear to be directly 
connected to the Xen crash I submitted last week.  Or so we thought at 
the time.


We realized that we had forgotten to remove the domain from the 
permissions list when the domain is deleted (which would cause the error 
we saw).  The application was updated to remove the domain from the 
permissions list:

1. retrieve the permissions with xs_get_permissions()
2. find the domain ID that is being deleted
3. memmove() the remaining domains down by 1 to "delete" the old domain 
from the permissions list

4. update the permissions with xs_set_permissions()

After we made that change, a load test over the weekend confirmed that 
the Xen crash no longer happens.  We checked this morning first thing 
and confirmed that without this change the crash reliably occurs.



It does not appear to matter if we delete the standard domain xenstore
path (/local/domain/) since libxl handles removing this path when
the domain is destroyed.

Based on this I would guess that the xenstore is hanging onto the VMID.




This is a somewhat strange conclusion. I guess the root cause is still
unclear at this point.


We originally tested a fix that explicitly cleaned up the vchans 
(created to communicate with the domains) before the 
xen_domain_destroy() function is called and there was no change.  We 
have confirmed that the vchans do not appear 

Re: [Xen-devel] Xen 4.7 crash

2016-06-06 Thread Wei Liu
On Mon, Jun 06, 2016 at 03:05:47PM +0100, Julien Grall wrote:
> (CC Ian, Stefano and Wei)
> 
> Hello Aaron,
> 
> On 06/06/16 14:58, Aaron Cornelius wrote:
> >On 6/2/2016 5:07 AM, Julien Grall wrote:
> >>Hello Aaron,
> >>
> >>On 02/06/2016 02:32, Aaron Cornelius wrote:
> >>>This is with a custom application, we use the libxl APIs to interact
> >>>with Xen.  Domains are created using the libxl_domain_create_new()
> >>>function, and domains are destroyed using the libxl_domain_destroy()
> >>>function.
> >>>
> >>>The test in this case creates a domain, waits a minute, then
> >>>deletes/creates the next domain, waits a minute, and so on.  So I
> >>>wouldn't be surprised to see the VMID occasionally indicate there are 2
> >>>active domains since there could be one being created and one being
> >>>destroyed in a very short time.  However, I wouldn't expect to ever have
> >>>256 domains.
> >>
> >>Your log has:
> >>
> >>(XEN) grant_table.c:3288:d0v1 Grant release (0) ref:(9) flags:(2) dom:(0)
> >>(XEN) grant_table.c:3288:d0v1 Grant release (1) ref:(11) flags:(2)
> >>dom:(0)
> >>
> >>Which suggest that some grants are still mapped in DOM0.
> >>
> >>>
> >>>The CubieTruck only has 2GB of RAM, I allocate 512MB for dom0 which
> >>>means that only 48 of the the Mirage domains (with 32MB of RAM) would
> >>>work at the same time anyway.  Which doesn't account for the various
> >>>inter-domain resources or the RAM used by Xen itself.
> >>
> >>All the pages who belongs to the domain could have been freed except the
> >>one referenced by DOM0. So the footprint of this domain will be limited
> >>at the time.
> >>
> >>I would recommend you to check how many domain are running at this time
> >>and if DOM0 effectively released all the resources.
> >>
> >>>If the p2m_teardown() function checked for NULL it would prevent the
> >>>crash, but I suspect Xen would be just as broken since all of my
> >>>resources have leaked away.  More broken in fact, since if the board
> >>>reboots at least the applications will restart and domains can be
> >>>recreated.
> >>>
> >>>It certainly appears that some resources are leaking when domains are
> >>>deleted (possibly only on the ARM or ARM32 platforms).  We will try to
> >>>add some debug prints and see if we can discover exactly what is
> >>>going on.
> >>
> >>The leakage could also happen from DOM0. FWIW, I have been able to cycle
> >>2000 guests over the night on an ARM platforms.
> >>
> >
> >We've done some more testing regarding this issue.  And further testing
> >shows that it doesn't matter if we delete the vchans before the domains
> >are deleted.  Those appear to be cleaned up correctly when the domain is
> >destroyed.
> >
> >What does stop this issue from happening (using the same version of Xen
> >that the issue was detected on) is removing any non-standard xenstore
> >references before deleting the domain.  In this case our application
> >allocates permissions for created domains to non-standard xenstore
> >paths.  Making sure to remove those domain permissions before deleting
> >the domain prevents this issue from happening.
> 
> I am not sure to understand what you mean here. Could you give a quick
> example?
> 
> >
> >It does not appear to matter if we delete the standard domain xenstore
> >path (/local/domain/) since libxl handles removing this path when
> >the domain is destroyed.
> >
> >Based on this I would guess that the xenstore is hanging onto the VMID.
> 

This is a somewhat strange conclusion. I guess the root cause is still
unclear at this point.

Is it possible that something else what rely on those xenstore node to
free up resources?

Wei.

> Regards,
> 
> -- 
> Julien Grall

___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


Re: [Xen-devel] Xen 4.7 crash

2016-06-06 Thread Julien Grall

(CC Ian, Stefano and Wei)

Hello Aaron,

On 06/06/16 14:58, Aaron Cornelius wrote:

On 6/2/2016 5:07 AM, Julien Grall wrote:

Hello Aaron,

On 02/06/2016 02:32, Aaron Cornelius wrote:

This is with a custom application, we use the libxl APIs to interact
with Xen.  Domains are created using the libxl_domain_create_new()
function, and domains are destroyed using the libxl_domain_destroy()
function.

The test in this case creates a domain, waits a minute, then
deletes/creates the next domain, waits a minute, and so on.  So I
wouldn't be surprised to see the VMID occasionally indicate there are 2
active domains since there could be one being created and one being
destroyed in a very short time.  However, I wouldn't expect to ever have
256 domains.


Your log has:

(XEN) grant_table.c:3288:d0v1 Grant release (0) ref:(9) flags:(2) dom:(0)
(XEN) grant_table.c:3288:d0v1 Grant release (1) ref:(11) flags:(2)
dom:(0)

Which suggest that some grants are still mapped in DOM0.



The CubieTruck only has 2GB of RAM, I allocate 512MB for dom0 which
means that only 48 of the the Mirage domains (with 32MB of RAM) would
work at the same time anyway.  Which doesn't account for the various
inter-domain resources or the RAM used by Xen itself.


All the pages who belongs to the domain could have been freed except the
one referenced by DOM0. So the footprint of this domain will be limited
at the time.

I would recommend you to check how many domain are running at this time
and if DOM0 effectively released all the resources.


If the p2m_teardown() function checked for NULL it would prevent the
crash, but I suspect Xen would be just as broken since all of my
resources have leaked away.  More broken in fact, since if the board
reboots at least the applications will restart and domains can be
recreated.

It certainly appears that some resources are leaking when domains are
deleted (possibly only on the ARM or ARM32 platforms).  We will try to
add some debug prints and see if we can discover exactly what is
going on.


The leakage could also happen from DOM0. FWIW, I have been able to cycle
2000 guests over the night on an ARM platforms.



We've done some more testing regarding this issue.  And further testing
shows that it doesn't matter if we delete the vchans before the domains
are deleted.  Those appear to be cleaned up correctly when the domain is
destroyed.

What does stop this issue from happening (using the same version of Xen
that the issue was detected on) is removing any non-standard xenstore
references before deleting the domain.  In this case our application
allocates permissions for created domains to non-standard xenstore
paths.  Making sure to remove those domain permissions before deleting
the domain prevents this issue from happening.


I am not sure to understand what you mean here. Could you give a quick 
example?




It does not appear to matter if we delete the standard domain xenstore
path (/local/domain/) since libxl handles removing this path when
the domain is destroyed.

Based on this I would guess that the xenstore is hanging onto the VMID.


Regards,

--
Julien Grall

___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


Re: [Xen-devel] Xen 4.7 crash

2016-06-06 Thread Aaron Cornelius

On 6/2/2016 5:07 AM, Julien Grall wrote:

Hello Aaron,

On 02/06/2016 02:32, Aaron Cornelius wrote:

This is with a custom application, we use the libxl APIs to interact
with Xen.  Domains are created using the libxl_domain_create_new()
function, and domains are destroyed using the libxl_domain_destroy()
function.

The test in this case creates a domain, waits a minute, then
deletes/creates the next domain, waits a minute, and so on.  So I
wouldn't be surprised to see the VMID occasionally indicate there are 2
active domains since there could be one being created and one being
destroyed in a very short time.  However, I wouldn't expect to ever have
256 domains.


Your log has:

(XEN) grant_table.c:3288:d0v1 Grant release (0) ref:(9) flags:(2) dom:(0)
(XEN) grant_table.c:3288:d0v1 Grant release (1) ref:(11) flags:(2) dom:(0)

Which suggest that some grants are still mapped in DOM0.



The CubieTruck only has 2GB of RAM, I allocate 512MB for dom0 which
means that only 48 of the the Mirage domains (with 32MB of RAM) would
work at the same time anyway.  Which doesn't account for the various
inter-domain resources or the RAM used by Xen itself.


All the pages who belongs to the domain could have been freed except the
one referenced by DOM0. So the footprint of this domain will be limited
at the time.

I would recommend you to check how many domain are running at this time
and if DOM0 effectively released all the resources.


If the p2m_teardown() function checked for NULL it would prevent the
crash, but I suspect Xen would be just as broken since all of my
resources have leaked away.  More broken in fact, since if the board
reboots at least the applications will restart and domains can be
recreated.

It certainly appears that some resources are leaking when domains are
deleted (possibly only on the ARM or ARM32 platforms).  We will try to
add some debug prints and see if we can discover exactly what is going on.


The leakage could also happen from DOM0. FWIW, I have been able to cycle
2000 guests over the night on an ARM platforms.



We've done some more testing regarding this issue.  And further testing 
shows that it doesn't matter if we delete the vchans before the domains 
are deleted.  Those appear to be cleaned up correctly when the domain is 
destroyed.


What does stop this issue from happening (using the same version of Xen 
that the issue was detected on) is removing any non-standard xenstore 
references before deleting the domain.  In this case our application 
allocates permissions for created domains to non-standard xenstore 
paths.  Making sure to remove those domain permissions before deleting 
the domain prevents this issue from happening.


It does not appear to matter if we delete the standard domain xenstore 
path (/local/domain/) since libxl handles removing this path when 
the domain is destroyed.


Based on this I would guess that the xenstore is hanging onto the VMID.

- Aaron Cornelius

___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


Re: [Xen-devel] Xen 4.7 crash

2016-06-02 Thread Jan Beulich
>>> On 02.06.16 at 10:53,  wrote:
> On 02/06/16 09:47, Jan Beulich wrote:
> On 02.06.16 at 00:31,  wrote:
>>> On 01/06/2016 23:24, Julien Grall wrote:
 free_xenheap_pages already tolerates NULL (even if an order != 0). Is
 there any reason to not do the same for free_domheap_pages?
>>> The xenheap allocation functions deal in terms of plain virtual
>>> addresses, while the domheap functions deal in terms of struct page_info *.
>>>
>>> Overall, this means that the domheap functions have a more restricted
>>> input/output set than their xenheap variants.
>>>
>>> As there is already precedent with xenheap, making domheap tolerate NULL
>>> is probably fine, and indeed the preferred course of action.
>> I disagree, for the very reason you mention above.
> 
> Which?  Dealing with struct page_info pointer?  Its still just a
> pointer, whose value is expected to be NULL if not allocated.

Yes, but it still makes the interface not malloc()-like, other than - as
you say yourself - e.g. the xenheap one. Just look at Linux for
comparison: __free_pages() also doesn't accept NULL, while
free_pages() does. I think we should stick to that distinction.

Jan


___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


Re: [Xen-devel] Xen 4.7 crash

2016-06-02 Thread Julien Grall

Hello Aaron,

On 02/06/2016 02:32, Aaron Cornelius wrote:

This is with a custom application, we use the libxl APIs to interact
with Xen.  Domains are created using the libxl_domain_create_new()
function, and domains are destroyed using the libxl_domain_destroy()
function.

The test in this case creates a domain, waits a minute, then
deletes/creates the next domain, waits a minute, and so on.  So I
wouldn't be surprised to see the VMID occasionally indicate there are 2
active domains since there could be one being created and one being
destroyed in a very short time.  However, I wouldn't expect to ever have
256 domains.


Your log has:

(XEN) grant_table.c:3288:d0v1 Grant release (0) ref:(9) flags:(2) dom:(0)
(XEN) grant_table.c:3288:d0v1 Grant release (1) ref:(11) flags:(2) dom:(0)

Which suggest that some grants are still mapped in DOM0.



The CubieTruck only has 2GB of RAM, I allocate 512MB for dom0 which
means that only 48 of the the Mirage domains (with 32MB of RAM) would
work at the same time anyway.  Which doesn't account for the various
inter-domain resources or the RAM used by Xen itself.


All the pages who belongs to the domain could have been freed except the 
one referenced by DOM0. So the footprint of this domain will be limited 
at the time.


I would recommend you to check how many domain are running at this time 
and if DOM0 effectively released all the resources.



If the p2m_teardown() function checked for NULL it would prevent the
crash, but I suspect Xen would be just as broken since all of my
resources have leaked away.  More broken in fact, since if the board
reboots at least the applications will restart and domains can be
recreated.

It certainly appears that some resources are leaking when domains are
deleted (possibly only on the ARM or ARM32 platforms).  We will try to
add some debug prints and see if we can discover exactly what is going on.


The leakage could also happen from DOM0. FWIW, I have been able to cycle 
2000 guests over the night on an ARM platforms.


Regards,

--
Julien Grall

___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


Re: [Xen-devel] Xen 4.7 crash

2016-06-02 Thread Andrew Cooper
On 02/06/16 09:47, Jan Beulich wrote:
 On 02.06.16 at 00:31,  wrote:
>> On 01/06/2016 23:24, Julien Grall wrote:
>>> free_xenheap_pages already tolerates NULL (even if an order != 0). Is
>>> there any reason to not do the same for free_domheap_pages?
>> The xenheap allocation functions deal in terms of plain virtual
>> addresses, while the domheap functions deal in terms of struct page_info *.
>>
>> Overall, this means that the domheap functions have a more restricted
>> input/output set than their xenheap variants.
>>
>> As there is already precedent with xenheap, making domheap tolerate NULL
>> is probably fine, and indeed the preferred course of action.
> I disagree, for the very reason you mention above.

Which?  Dealing with struct page_info pointer?  Its still just a
pointer, whose value is expected to be NULL if not allocated.

~Andrew

___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


Re: [Xen-devel] Xen 4.7 crash

2016-06-02 Thread Jan Beulich
>>> On 02.06.16 at 03:32,  wrote:
> The test in this case creates a domain, waits a minute, then 
> deletes/creates the next domain, waits a minute, and so on.  So I 
> wouldn't be surprised to see the VMID occasionally indicate there are 2 
> active domains since there could be one being created and one being 
> destroyed in a very short time.  However, I wouldn't expect to ever have 
> 256 domains.

But - did you check? Things may pile up over time...

Jan


___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


Re: [Xen-devel] Xen 4.7 crash

2016-06-02 Thread Jan Beulich
>>> On 02.06.16 at 00:31,  wrote:
> On 01/06/2016 23:24, Julien Grall wrote:
>> free_xenheap_pages already tolerates NULL (even if an order != 0). Is
>> there any reason to not do the same for free_domheap_pages?
> 
> The xenheap allocation functions deal in terms of plain virtual
> addresses, while the domheap functions deal in terms of struct page_info *.
> 
> Overall, this means that the domheap functions have a more restricted
> input/output set than their xenheap variants.
> 
> As there is already precedent with xenheap, making domheap tolerate NULL
> is probably fine, and indeed the preferred course of action.

I disagree, for the very reason you mention above.

Jan


___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


Re: [Xen-devel] Xen 4.7 crash

2016-06-01 Thread Aaron Cornelius

On 6/1/2016 6:35 PM, Julien Grall wrote:

Hello Aaron,

On 01/06/2016 20:54, Aaron Cornelius wrote:



I'm not 100% sure, from the "VMID pool exhausted" message it would
appear that the p2m_init() function failed to allocate a VM ID, which
caused domain creation to fail, and the NULL pointer dereference when
trying to clean up the not-fully-created domain.

However, since I only have 1 domain active at a time, I'm not sure why
I should run out of VM IDs.


arch_domain_destroy (and p2m_teardown) is only called when all the
reference on the given domain are released.

It may take a while to release all the resources. So if you launch the
domain as the same time as you destroy the previous guest. You will have
more than 1 domain active.

Can you detail how you create/destroy guest?



This is with a custom application, we use the libxl APIs to interact 
with Xen.  Domains are created using the libxl_domain_create_new() 
function, and domains are destroyed using the libxl_domain_destroy() 
function.


The test in this case creates a domain, waits a minute, then 
deletes/creates the next domain, waits a minute, and so on.  So I 
wouldn't be surprised to see the VMID occasionally indicate there are 2 
active domains since there could be one being created and one being 
destroyed in a very short time.  However, I wouldn't expect to ever have 
256 domains.


The CubieTruck only has 2GB of RAM, I allocate 512MB for dom0 which 
means that only 48 of the the Mirage domains (with 32MB of RAM) would 
work at the same time anyway.  Which doesn't account for the various 
inter-domain resources or the RAM used by Xen itself.


If the p2m_teardown() function checked for NULL it would prevent the 
crash, but I suspect Xen would be just as broken since all of my 
resources have leaked away.  More broken in fact, since if the board 
reboots at least the applications will restart and domains can be recreated.


It certainly appears that some resources are leaking when domains are 
deleted (possibly only on the ARM or ARM32 platforms).  We will try to 
add some debug prints and see if we can discover exactly what is going on.


- Aaron Cornelius


___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


Re: [Xen-devel] Xen 4.7 crash

2016-06-01 Thread Julien Grall

Hello Aaron,

On 01/06/2016 20:54, Aaron Cornelius wrote:

I am doing some work with Xen 4.7 on the cubietruck (ARM32).  I've noticed some 
strange behavior after I create/destroy enough domains and put together a 
script to do the add/remove for me.  For this particular test I am creating a 
small mini-os (Mirage) domain with 32MB of RAM, deleting it, creating the new 
one, and so on.

After running this for a while, I get the following error (with version 
8478c9409a2c6726208e8dbc9f3e455b76725a33):

(d846) Virtual -> physical offset = 3fc0
(d846) Checking DTB at 023ff000...
(d846) [32;1mMirageOS booting...[0m
(d846) Initialising console ... done.
(d846) gnttab_stubs.c: initialised mini-os gntmap
(d846) allocate_ondemand(1, 1) returning 230
(d846) allocate_ondemand(1, 1) returning 2301000
(XEN) grant_table.c:3288:d0v1 Grant release (0) ref:(9) flags:(2) dom:(0)
(XEN) grant_table.c:3288:d0v1 Grant release (1) ref:(11) flags:(2) dom:(0)
(XEN) p2m.c: dom1101: VMID pool exhausted
(XEN) CPU0: Unexpected Trap: Data Abort
(XEN) [ Xen-4.7.0-rc  arm32  debug=y  Not tainted ]
(XEN) CPU:0
(XEN) PC: 0021fdd4 free_domheap_pages+0x1c/0x324
(XEN) CPSR:   6001011a MODE:Hypervisor
(XEN)  R0:  R1: 0001 R2: 0003 R3: 00304320
(XEN)  R4: 41c57000 R5: 41c57188 R6: 00200200 R7: 00100100
(XEN)  R8: 41c57180 R9: 43fdfe60 R10: R11:43fdfd5c R12:
(XEN) HYP: SP: 43fdfd2c LR: 0025b0cc
(XEN)
(XEN)   VTCR_EL2: 80003558
(XEN)  VTTBR_EL2: 0001bfb0e000
(XEN)
(XEN)  SCTLR_EL2: 30cd187f
(XEN)HCR_EL2: 0038663f
(XEN)  TTBR0_EL2: bfafc000
(XEN)
(XEN)ESR_EL2: 9406
(XEN)  HPFAR_EL2: 0001c810
(XEN)  HDFAR: 0014
(XEN)  HIFAR: 84e37182
(XEN)
(XEN) Xen stack trace from sp=43fdfd2c:
(XEN)002cf1b7 43fdfd64 41c57000 0100 41c57000 41c57188 00200200 00100100
(XEN)41c57180 43fdfe60  43fdfd7c 0025b0cc 41c57000 fff0 43fdfe60
(XEN)001f 044d 43fdfe60 43fdfd8c 0024f668 41c57000 fff0 43fdfda4
(XEN)0024f8f0 41c57000   001f 43fdfddc 0020854c 43fdfddc
(XEN) cccd 00304600 002822bc  b6f20004 044d 00304600
(XEN)00304320 d767a000  43fdfeec 00206d6c 43fdfe6c 00218f8c 
(XEN)0007 43fdfe30 43fdfe34  43fdfe20 0002 43fdfe48 43fdfe78
(XEN)   7622 2b0e 40023000  43fdfec8
(XEN)0002 43fdfebc 00218f8c 0001 000b  b6eba880 000b
(XEN)5abab87d f34aab2c 6adc50b8 e1713cd0    
(XEN)b6eba8d8  50043f00 b6eb5038 b6effba8 003e  000c3034
(XEN)000b9cb8 000bda30 000bda30  b6eba56c 003e b6effba8 b6effdb0
(XEN)be9558d4 00d0 be9558d4 0071 b6effba8 b6effd6c b6ed6fb4 4a000ea1
(XEN)c01077f8 43fdff58 002067b8 00305000 be9557c8 d767a000  43fdff54
(XEN)00260130  43fdfefc 43fdff1c 200f019a 400238f4 0004 0004
(XEN)002c9f00  00304600 c094c240  00305000 be9557a0 d767a000
(XEN) 43fdff44  c094c240  00305000 be9557c8 d767a000
(XEN) 43fdff58 00263b10 b6f20004    
(XEN)c094c240  00305000 be9557c8 d767a000  0001 0024
(XEN) b691ab34 c01077f8 60010013  be9557c4 c0a38600 c010c400
(XEN) Xen call trace:
(XEN)[<0021fdd4>] free_domheap_pages+0x1c/0x324 (PC)
(XEN)[<0025b0cc>] p2m_teardown+0xa0/0x108 (LR)
(XEN)[<0025b0cc>] p2m_teardown+0xa0/0x108
(XEN)[<0024f668>] arch_domain_destroy+0x20/0x50
(XEN)[<0024f8f0>] arch_domain_create+0x258/0x284
(XEN)[<0020854c>] domain_create+0x2dc/0x510
(XEN)[<00206d6c>] do_domctl+0x5b4/0x1928
(XEN)[<00260130>] do_trap_hypervisor+0x1170/0x15b0
(XEN)[<00263b10>] entry.o#return_from_trap+0/0x4
(XEN)
(XEN)
(XEN) 
(XEN) Panic on CPU 0:
(XEN) CPU0: Unexpected Trap: Data Abort
(XEN)
(XEN) 
(XEN)
(XEN) Reboot in five seconds...

I'm not 100% sure, from the "VMID pool exhausted" message it would appear that 
the p2m_init() function failed to allocate a VM ID, which caused domain creation to fail, 
and the NULL pointer dereference when trying to clean up the not-fully-created domain.

However, since I only have 1 domain active at a time, I'm not sure why I should 
run out of VM IDs.


arch_domain_destroy (and p2m_teardown) is only called when all the 
reference on the given domain are released.


It may take a while to release all the resources. So if you launch the 
domain as the same time as you destroy the previous guest. You will have 
more than 1 domain active.


Can you detail how you create/destroy guest?

Regards,

--
Julien Grall

___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


Re: [Xen-devel] Xen 4.7 crash

2016-06-01 Thread Andrew Cooper
On 01/06/2016 23:24, Julien Grall wrote:
> Hi,
>
> On 01/06/2016 22:35, Andrew Cooper wrote:
>> On 01/06/2016 20:54, Aaron Cornelius wrote:
>>> 
>>> (XEN) Xen call trace:
>>> (XEN)[<0021fdd4>] free_domheap_pages+0x1c/0x324 (PC)
>>> (XEN)[<0025b0cc>] p2m_teardown+0xa0/0x108 (LR)
>>> (XEN)[<0025b0cc>] p2m_teardown+0xa0/0x108
>>> (XEN)[<0024f668>] arch_domain_destroy+0x20/0x50
>>> (XEN)[<0024f8f0>] arch_domain_create+0x258/0x284
>>> (XEN)[<0020854c>] domain_create+0x2dc/0x510
>>> (XEN)[<00206d6c>] do_domctl+0x5b4/0x1928
>>> (XEN)[<00260130>] do_trap_hypervisor+0x1170/0x15b0
>>> (XEN)[<00263b10>] entry.o#return_from_trap+0/0x4
>>> (XEN)
>>> (XEN)
>>> (XEN) 
>>> (XEN) Panic on CPU 0:
>>> (XEN) CPU0: Unexpected Trap: Data Abort
>>> (XEN)
>>> (XEN) 
>>> (XEN)
>>> (XEN) Reboot in five seconds...
>>
>> As for this specific crash itself,  In the case of an early error path,
>> p2m->root can be NULL in p2m_teardown(), in which case
>> free_domheap_pages() will fall over in a heap.  This patch should
>> resolve it.
>
> Good catch!
>
>>
>> @@ -1408,7 +1411,8 @@ void p2m_teardown(struct domain *d)
>>  while ( (pg = page_list_remove_head(>pages)) )
>>  free_domheap_page(pg);
>>
>> -free_domheap_pages(p2m->root, P2M_ROOT_ORDER);
>> +if ( p2m->root )
>> +free_domheap_pages(p2m->root, P2M_ROOT_ORDER);
>>
>>  p2m->root = NULL;
>>
>> I would be tempted to suggest making free_domheap_pages() tolerate NULL
>> pointers, except that would only be a safe thing to do if we assert that
>> the order parameter is 0, which won't help this specific case.
>
> free_xenheap_pages already tolerates NULL (even if an order != 0). Is
> there any reason to not do the same for free_domheap_pages?

The xenheap allocation functions deal in terms of plain virtual
addresses, while the domheap functions deal in terms of struct page_info *.

Overall, this means that the domheap functions have a more restricted
input/output set than their xenheap variants.

As there is already precedent with xenheap, making domheap tolerate NULL
is probably fine, and indeed the preferred course of action.

~Andrew

___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


Re: [Xen-devel] Xen 4.7 crash

2016-06-01 Thread Andrew Cooper
On 01/06/2016 23:18, Julien Grall wrote:
> Hi Andrew,
>
> On 01/06/2016 22:24, Andrew Cooper wrote:
>> On 01/06/2016 21:45, Aaron Cornelius wrote:

> However, since I only have 1 domain active at a time, I'm not sure
> why I
 should run out of VM IDs.

 Sounds like a VMID resource leak.  Check to see whether it is freed
 properly
 in domain_destroy().

 ~Andrew
>>> That would be my assumption.  But as far as I can tell,
>>> arch_domain_destroy() calls pwm_teardown() which calls
>>> p2m_free_vmid(), and none of the functionality related to freeing a
>>> VM ID appears to have changed in years.
>>
>> The VMID handling looks suspect.  It can be called repeatedly during
>> domain destruction, and it will repeatedly clear the same bit out of the
>> vmid_mask.
>
> Can you explain how the p2m_free_vmid can be called multiple time?
>
> We have the following path:
>arch_domain_destroy -> p2m_teardown -> p2m_free_vmid.
>
> And I can find only 3 call of arch_domain_destroy we should only be
> done once per domain.
>
> If arch_domain_destroy is called multiple time, p2m_free_vmid will not
> be the only place where Xen will be in trouble.

You are correct.  I was getting my phases of domain destruction mixed
up.  arch_domain_destroy() is strictly once, after the RCU reference of
the domain has dropped to 0.

>
>> diff --git a/xen/arch/arm/p2m.c b/xen/arch/arm/p2m.c
>> index 838d004..7adb39a 100644
>> --- a/xen/arch/arm/p2m.c
>> +++ b/xen/arch/arm/p2m.c
>> @@ -1393,7 +1393,10 @@ static void p2m_free_vmid(struct domain *d)
>>  struct p2m_domain *p2m = >arch.p2m;
>>  spin_lock(_alloc_lock);
>>  if ( p2m->vmid != INVALID_VMID )
>> -clear_bit(p2m->vmid, vmid_mask);
>> +{
>> +ASSERT(test_and_clear_bit(p2m->vmid, vmid_mask));
>> +p2m->vmid = INVALID_VMID;
>> +}
>>
>>  spin_unlock(_alloc_lock);
>>  }
>>
>> Having said that, I can't explain why that bug would result in the
>> symptoms you are seeing.  It is also possibly that your issue is memory
>> corruption from a separate source.
>>
>> Can you see about instrumenting p2m_alloc_vmid()/p2m_free_vmid() (with
>> vmid_alloc_lock held) to see which vmid is being allocated/freed ?
>> After the initial boot of the system, you should see the same vmid being
>> allocated and freed for each of your domains.
>
> Looking quickly at the log, the domain is dom1101. However, the number
> maximum number of VMID supported is 256, so the exhaustion might be a
> race somewhere.
>
> I would be interested to get a reproducer. I wrote a script to cycle a
> domain (create/domain) in loop, and I have not seen any issue after
> 1200 cycles (and counting).

Given that my previous thought was wrong, I am going to suggest that
some other form of memory corruption is a more likely cause.

~Andrew

___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


Re: [Xen-devel] Xen 4.7 crash

2016-06-01 Thread Julien Grall

Hi,

On 01/06/2016 22:35, Andrew Cooper wrote:

On 01/06/2016 20:54, Aaron Cornelius wrote:


(XEN) Xen call trace:
(XEN)[<0021fdd4>] free_domheap_pages+0x1c/0x324 (PC)
(XEN)[<0025b0cc>] p2m_teardown+0xa0/0x108 (LR)
(XEN)[<0025b0cc>] p2m_teardown+0xa0/0x108
(XEN)[<0024f668>] arch_domain_destroy+0x20/0x50
(XEN)[<0024f8f0>] arch_domain_create+0x258/0x284
(XEN)[<0020854c>] domain_create+0x2dc/0x510
(XEN)[<00206d6c>] do_domctl+0x5b4/0x1928
(XEN)[<00260130>] do_trap_hypervisor+0x1170/0x15b0
(XEN)[<00263b10>] entry.o#return_from_trap+0/0x4
(XEN)
(XEN)
(XEN) 
(XEN) Panic on CPU 0:
(XEN) CPU0: Unexpected Trap: Data Abort
(XEN)
(XEN) 
(XEN)
(XEN) Reboot in five seconds...


As for this specific crash itself,  In the case of an early error path,
p2m->root can be NULL in p2m_teardown(), in which case
free_domheap_pages() will fall over in a heap.  This patch should
resolve it.


Good catch!



@@ -1408,7 +1411,8 @@ void p2m_teardown(struct domain *d)
 while ( (pg = page_list_remove_head(>pages)) )
 free_domheap_page(pg);

-free_domheap_pages(p2m->root, P2M_ROOT_ORDER);
+if ( p2m->root )
+free_domheap_pages(p2m->root, P2M_ROOT_ORDER);

 p2m->root = NULL;

I would be tempted to suggest making free_domheap_pages() tolerate NULL
pointers, except that would only be a safe thing to do if we assert that
the order parameter is 0, which won't help this specific case.


free_xenheap_pages already tolerates NULL (even if an order != 0). Is 
there any reason to not do the same for free_domheap_pages?


Regards,

--
Julien Grall

___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


Re: [Xen-devel] Xen 4.7 crash

2016-06-01 Thread Julien Grall

Hi Andrew,

On 01/06/2016 22:24, Andrew Cooper wrote:

On 01/06/2016 21:45, Aaron Cornelius wrote:



However, since I only have 1 domain active at a time, I'm not sure why I

should run out of VM IDs.

Sounds like a VMID resource leak.  Check to see whether it is freed properly
in domain_destroy().

~Andrew

That would be my assumption.  But as far as I can tell, arch_domain_destroy() 
calls pwm_teardown() which calls p2m_free_vmid(), and none of the functionality 
related to freeing a VM ID appears to have changed in years.


The VMID handling looks suspect.  It can be called repeatedly during
domain destruction, and it will repeatedly clear the same bit out of the
vmid_mask.


Can you explain how the p2m_free_vmid can be called multiple time?

We have the following path:
   arch_domain_destroy -> p2m_teardown -> p2m_free_vmid.

And I can find only 3 call of arch_domain_destroy we should only be done 
once per domain.


If arch_domain_destroy is called multiple time, p2m_free_vmid will not 
be the only place where Xen will be in trouble.



diff --git a/xen/arch/arm/p2m.c b/xen/arch/arm/p2m.c
index 838d004..7adb39a 100644
--- a/xen/arch/arm/p2m.c
+++ b/xen/arch/arm/p2m.c
@@ -1393,7 +1393,10 @@ static void p2m_free_vmid(struct domain *d)
 struct p2m_domain *p2m = >arch.p2m;
 spin_lock(_alloc_lock);
 if ( p2m->vmid != INVALID_VMID )
-clear_bit(p2m->vmid, vmid_mask);
+{
+ASSERT(test_and_clear_bit(p2m->vmid, vmid_mask));
+p2m->vmid = INVALID_VMID;
+}

 spin_unlock(_alloc_lock);
 }

Having said that, I can't explain why that bug would result in the
symptoms you are seeing.  It is also possibly that your issue is memory
corruption from a separate source.

Can you see about instrumenting p2m_alloc_vmid()/p2m_free_vmid() (with
vmid_alloc_lock held) to see which vmid is being allocated/freed ?
After the initial boot of the system, you should see the same vmid being
allocated and freed for each of your domains.


Looking quickly at the log, the domain is dom1101. However, the number 
maximum number of VMID supported is 256, so the exhaustion might be a 
race somewhere.


I would be interested to get a reproducer. I wrote a script to cycle a 
domain (create/domain) in loop, and I have not seen any issue after 1200 
cycles (and counting).


Cheers,

--
Julien Grall

___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


Re: [Xen-devel] Xen 4.7 crash

2016-06-01 Thread Andrew Cooper
On 01/06/2016 20:54, Aaron Cornelius wrote:
> 
> (XEN) Xen call trace:
> (XEN)[<0021fdd4>] free_domheap_pages+0x1c/0x324 (PC)
> (XEN)[<0025b0cc>] p2m_teardown+0xa0/0x108 (LR)
> (XEN)[<0025b0cc>] p2m_teardown+0xa0/0x108
> (XEN)[<0024f668>] arch_domain_destroy+0x20/0x50
> (XEN)[<0024f8f0>] arch_domain_create+0x258/0x284
> (XEN)[<0020854c>] domain_create+0x2dc/0x510
> (XEN)[<00206d6c>] do_domctl+0x5b4/0x1928
> (XEN)[<00260130>] do_trap_hypervisor+0x1170/0x15b0
> (XEN)[<00263b10>] entry.o#return_from_trap+0/0x4
> (XEN)
> (XEN)
> (XEN) 
> (XEN) Panic on CPU 0:
> (XEN) CPU0: Unexpected Trap: Data Abort
> (XEN)
> (XEN) 
> (XEN)
> (XEN) Reboot in five seconds...

As for this specific crash itself,  In the case of an early error path,
p2m->root can be NULL in p2m_teardown(), in which case
free_domheap_pages() will fall over in a heap.  This patch should
resolve it.

@@ -1408,7 +1411,8 @@ void p2m_teardown(struct domain *d)
 while ( (pg = page_list_remove_head(>pages)) )
 free_domheap_page(pg);

-free_domheap_pages(p2m->root, P2M_ROOT_ORDER);
+if ( p2m->root )
+free_domheap_pages(p2m->root, P2M_ROOT_ORDER);

 p2m->root = NULL;

I would be tempted to suggest making free_domheap_pages() tolerate NULL
pointers, except that would only be a safe thing to do if we assert that
the order parameter is 0, which won't help this specific case.

~Andrew

___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


Re: [Xen-devel] Xen 4.7 crash

2016-06-01 Thread Andrew Cooper
On 01/06/2016 21:45, Aaron Cornelius wrote:
>>
>>> However, since I only have 1 domain active at a time, I'm not sure why I
>> should run out of VM IDs.
>>
>> Sounds like a VMID resource leak.  Check to see whether it is freed properly
>> in domain_destroy().
>>
>> ~Andrew
> That would be my assumption.  But as far as I can tell, arch_domain_destroy() 
> calls pwm_teardown() which calls p2m_free_vmid(), and none of the 
> functionality related to freeing a VM ID appears to have changed in years.

The VMID handling looks suspect.  It can be called repeatedly during
domain destruction, and it will repeatedly clear the same bit out of the
vmid_mask.

diff --git a/xen/arch/arm/p2m.c b/xen/arch/arm/p2m.c
index 838d004..7adb39a 100644
--- a/xen/arch/arm/p2m.c
+++ b/xen/arch/arm/p2m.c
@@ -1393,7 +1393,10 @@ static void p2m_free_vmid(struct domain *d)
 struct p2m_domain *p2m = >arch.p2m;
 spin_lock(_alloc_lock);
 if ( p2m->vmid != INVALID_VMID )
-clear_bit(p2m->vmid, vmid_mask);
+{
+ASSERT(test_and_clear_bit(p2m->vmid, vmid_mask));
+p2m->vmid = INVALID_VMID;
+}

 spin_unlock(_alloc_lock);
 }

Having said that, I can't explain why that bug would result in the
symptoms you are seeing.  It is also possibly that your issue is memory
corruption from a separate source.

Can you see about instrumenting p2m_alloc_vmid()/p2m_free_vmid() (with
vmid_alloc_lock held) to see which vmid is being allocated/freed ? 
After the initial boot of the system, you should see the same vmid being
allocated and freed for each of your domains.

~Andrew


___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


Re: [Xen-devel] Xen 4.7 crash

2016-06-01 Thread Aaron Cornelius
> -Original Message-
> From: Andrew Cooper [mailto:am...@hermes.cam.ac.uk] On Behalf Of
> Andrew Cooper
> Sent: Wednesday, June 1, 2016 4:01 PM
> To: Aaron Cornelius <aaron.cornel...@dornerworks.com>; Xen-devel  de...@lists.xenproject.org>
> Subject: Re: [Xen-devel] Xen 4.7 crash
> 
> On 01/06/2016 20:54, Aaron Cornelius wrote:
> > I am doing some work with Xen 4.7 on the cubietruck (ARM32).  I've
> noticed some strange behavior after I create/destroy enough domains and
> put together a script to do the add/remove for me.  For this particular test I
> am creating a small mini-os (Mirage) domain with 32MB of RAM, deleting it,
> creating the new one, and so on.
> >
> > After running this for a while, I get the following error (with version
> 8478c9409a2c6726208e8dbc9f3e455b76725a33):
> >
> > (d846) Virtual -> physical offset = 3fc0
> > (d846) Checking DTB at 023ff000...
> > (d846) [32;1mMirageOS booting...[0m
> > (d846) Initialising console ... done.
> > (d846) gnttab_stubs.c: initialised mini-os gntmap
> > (d846) allocate_ondemand(1, 1) returning 230
> > (d846) allocate_ondemand(1, 1) returning 2301000
> > (XEN) grant_table.c:3288:d0v1 Grant release (0) ref:(9) flags:(2)
> > dom:(0)
> > (XEN) grant_table.c:3288:d0v1 Grant release (1) ref:(11) flags:(2)
> > dom:(0)
> > (XEN) p2m.c: dom1101: VMID pool exhausted
> > (XEN) CPU0: Unexpected Trap: Data Abort 
> >
> > I'm not 100% sure, from the "VMID pool exhausted" message it would
> appear that the p2m_init() function failed to allocate a VM ID, which caused
> domain creation to fail, and the NULL pointer dereference when trying to
> clean up the not-fully-created domain.
> >
> > However, since I only have 1 domain active at a time, I'm not sure why I
> should run out of VM IDs.
> 
> Sounds like a VMID resource leak.  Check to see whether it is freed properly
> in domain_destroy().
> 
> ~Andrew

That would be my assumption.  But as far as I can tell, arch_domain_destroy() 
calls pwm_teardown() which calls p2m_free_vmid(), and none of the functionality 
related to freeing a VM ID appears to have changed in years.

- Aaron
___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


Re: [Xen-devel] Xen 4.7 crash

2016-06-01 Thread Andrew Cooper
On 01/06/2016 20:54, Aaron Cornelius wrote:
> I am doing some work with Xen 4.7 on the cubietruck (ARM32).  I've noticed 
> some strange behavior after I create/destroy enough domains and put together 
> a script to do the add/remove for me.  For this particular test I am creating 
> a small mini-os (Mirage) domain with 32MB of RAM, deleting it, creating the 
> new one, and so on.
>
> After running this for a while, I get the following error (with version 
> 8478c9409a2c6726208e8dbc9f3e455b76725a33):
>
> (d846) Virtual -> physical offset = 3fc0
> (d846) Checking DTB at 023ff000...
> (d846) [32;1mMirageOS booting...[0m
> (d846) Initialising console ... done.
> (d846) gnttab_stubs.c: initialised mini-os gntmap
> (d846) allocate_ondemand(1, 1) returning 230
> (d846) allocate_ondemand(1, 1) returning 2301000
> (XEN) grant_table.c:3288:d0v1 Grant release (0) ref:(9) flags:(2) dom:(0)
> (XEN) grant_table.c:3288:d0v1 Grant release (1) ref:(11) flags:(2) dom:(0)
> (XEN) p2m.c: dom1101: VMID pool exhausted
> (XEN) CPU0: Unexpected Trap: Data Abort
> 
>
> I'm not 100% sure, from the "VMID pool exhausted" message it would appear 
> that the p2m_init() function failed to allocate a VM ID, which caused domain 
> creation to fail, and the NULL pointer dereference when trying to clean up 
> the not-fully-created domain.
>
> However, since I only have 1 domain active at a time, I'm not sure why I 
> should run out of VM IDs.

Sounds like a VMID resource leak.  Check to see whether it is freed
properly in domain_destroy().

~Andrew

___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


[Xen-devel] Xen 4.7 crash

2016-06-01 Thread Aaron Cornelius
I am doing some work with Xen 4.7 on the cubietruck (ARM32).  I've noticed some 
strange behavior after I create/destroy enough domains and put together a 
script to do the add/remove for me.  For this particular test I am creating a 
small mini-os (Mirage) domain with 32MB of RAM, deleting it, creating the new 
one, and so on.

After running this for a while, I get the following error (with version 
8478c9409a2c6726208e8dbc9f3e455b76725a33):

(d846) Virtual -> physical offset = 3fc0
(d846) Checking DTB at 023ff000...
(d846) [32;1mMirageOS booting...[0m
(d846) Initialising console ... done.
(d846) gnttab_stubs.c: initialised mini-os gntmap
(d846) allocate_ondemand(1, 1) returning 230
(d846) allocate_ondemand(1, 1) returning 2301000
(XEN) grant_table.c:3288:d0v1 Grant release (0) ref:(9) flags:(2) dom:(0)
(XEN) grant_table.c:3288:d0v1 Grant release (1) ref:(11) flags:(2) dom:(0)
(XEN) p2m.c: dom1101: VMID pool exhausted
(XEN) CPU0: Unexpected Trap: Data Abort
(XEN) [ Xen-4.7.0-rc  arm32  debug=y  Not tainted ]
(XEN) CPU:0
(XEN) PC: 0021fdd4 free_domheap_pages+0x1c/0x324
(XEN) CPSR:   6001011a MODE:Hypervisor
(XEN)  R0:  R1: 0001 R2: 0003 R3: 00304320
(XEN)  R4: 41c57000 R5: 41c57188 R6: 00200200 R7: 00100100
(XEN)  R8: 41c57180 R9: 43fdfe60 R10: R11:43fdfd5c R12:
(XEN) HYP: SP: 43fdfd2c LR: 0025b0cc
(XEN)
(XEN)   VTCR_EL2: 80003558
(XEN)  VTTBR_EL2: 0001bfb0e000
(XEN)
(XEN)  SCTLR_EL2: 30cd187f
(XEN)HCR_EL2: 0038663f
(XEN)  TTBR0_EL2: bfafc000
(XEN)
(XEN)ESR_EL2: 9406
(XEN)  HPFAR_EL2: 0001c810
(XEN)  HDFAR: 0014
(XEN)  HIFAR: 84e37182
(XEN)
(XEN) Xen stack trace from sp=43fdfd2c:
(XEN)002cf1b7 43fdfd64 41c57000 0100 41c57000 41c57188 00200200 00100100
(XEN)41c57180 43fdfe60  43fdfd7c 0025b0cc 41c57000 fff0 43fdfe60
(XEN)001f 044d 43fdfe60 43fdfd8c 0024f668 41c57000 fff0 43fdfda4
(XEN)0024f8f0 41c57000   001f 43fdfddc 0020854c 43fdfddc
(XEN) cccd 00304600 002822bc  b6f20004 044d 00304600
(XEN)00304320 d767a000  43fdfeec 00206d6c 43fdfe6c 00218f8c 
(XEN)0007 43fdfe30 43fdfe34  43fdfe20 0002 43fdfe48 43fdfe78
(XEN)   7622 2b0e 40023000  43fdfec8
(XEN)0002 43fdfebc 00218f8c 0001 000b  b6eba880 000b
(XEN)5abab87d f34aab2c 6adc50b8 e1713cd0    
(XEN)b6eba8d8  50043f00 b6eb5038 b6effba8 003e  000c3034
(XEN)000b9cb8 000bda30 000bda30  b6eba56c 003e b6effba8 b6effdb0
(XEN)be9558d4 00d0 be9558d4 0071 b6effba8 b6effd6c b6ed6fb4 4a000ea1
(XEN)c01077f8 43fdff58 002067b8 00305000 be9557c8 d767a000  43fdff54
(XEN)00260130  43fdfefc 43fdff1c 200f019a 400238f4 0004 0004
(XEN)002c9f00  00304600 c094c240  00305000 be9557a0 d767a000
(XEN) 43fdff44  c094c240  00305000 be9557c8 d767a000
(XEN) 43fdff58 00263b10 b6f20004    
(XEN)c094c240  00305000 be9557c8 d767a000  0001 0024
(XEN) b691ab34 c01077f8 60010013  be9557c4 c0a38600 c010c400
(XEN) Xen call trace:
(XEN)[<0021fdd4>] free_domheap_pages+0x1c/0x324 (PC)
(XEN)[<0025b0cc>] p2m_teardown+0xa0/0x108 (LR)
(XEN)[<0025b0cc>] p2m_teardown+0xa0/0x108
(XEN)[<0024f668>] arch_domain_destroy+0x20/0x50
(XEN)[<0024f8f0>] arch_domain_create+0x258/0x284
(XEN)[<0020854c>] domain_create+0x2dc/0x510
(XEN)[<00206d6c>] do_domctl+0x5b4/0x1928
(XEN)[<00260130>] do_trap_hypervisor+0x1170/0x15b0
(XEN)[<00263b10>] entry.o#return_from_trap+0/0x4
(XEN)
(XEN)
(XEN) 
(XEN) Panic on CPU 0:
(XEN) CPU0: Unexpected Trap: Data Abort
(XEN)
(XEN) 
(XEN)
(XEN) Reboot in five seconds...

I'm not 100% sure, from the "VMID pool exhausted" message it would appear that 
the p2m_init() function failed to allocate a VM ID, which caused domain 
creation to fail, and the NULL pointer dereference when trying to clean up the 
not-fully-created domain.

However, since I only have 1 domain active at a time, I'm not sure why I should 
run out of VM IDs.

- Aaron Cornelius

___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel