Re: [openstack-dev] [Nova] nova-compute deadlock

2014-08-04 Thread Qin Zhao
Dear stackers,

FYI.  Eventually I report this problem to libguestfs. A workaround has been
included into libguestfs code to fix it. Thanks for your supporting!
https://bugzilla.redhat.com/show_bug.cgi?id=1123007


On Sat, Jun 7, 2014 at 3:27 AM, Qin Zhao chaoc...@gmail.com wrote:

 Yuriy,

 And I think if we use proxy object of multiprocessing, the green thread
 will not switch during we call libguestfs.  Is that correct?


 On Fri, Jun 6, 2014 at 2:44 AM, Qin Zhao chaoc...@gmail.com wrote:

 Hi Yuriy,

 I read multiprocessing source code just now.  Now I feel it may not solve
 this problem very easily.  For example, let us assume that we will use the
 proxy object in Manager's process to call libguestfs.  In manager.py, I see
 it needs to create a pipe, before fork the child process. The write end of
 this pipe is required by child process.


 http://sourcecodebrowser.com/python-multiprocessing/2.6.2.1/classmultiprocessing_1_1managers_1_1_base_manager.html#a57fe9abe7a3d281286556c4bf3fbf4d5

 And in Process._bootstrp(), I think we will need to register a function
 to be called by _run_after_forkers(), in order to closed the fds inherited
 from Nova process.


 http://sourcecodebrowser.com/python-multiprocessing/2.6.2.1/classmultiprocessing_1_1process_1_1_process.html#ae594800e7bdef288d9bfbf8b79019d2e

 And we also can not close the write end fd created by Manager in
 _run_after_forkers(). One feasible way may be getting that fd from the 5th
 element of _args attribute of Process object, then skip to close this
 fd  I have not investigate if or not Manager need to use other fds,
 besides this pipe. Personally, I feel such an implementation will be a
 little tricky and risky, because it tightly depends on Manager code. If
 Manager opens other files, or change the argument order, our code will fail
 to run. Am I wrong?  Is there any other safer way?


 On Thu, Jun 5, 2014 at 11:40 PM, Yuriy Taraday yorik@gmail.com
 wrote:

 Please take a look at
 https://docs.python.org/2.7/library/multiprocessing.html#managers -
 everything is already implemented there.
 All you need is to start one manager that would serve all your requests
 to libguestfs. The implementation in stdlib will provide you with all
 exceptions and return values with minimum code changes on Nova side.
 Create a new Manager, register an libguestfs endpoint in it and call
 start(). It will spawn a separate process that will speak with calling
 process over very simple RPC.
 From the looks of it all you need to do is replace tpool.Proxy calls in
 VFSGuestFS.setup method to calls to this new Manager.


 On Thu, Jun 5, 2014 at 7:21 PM, Qin Zhao chaoc...@gmail.com wrote:

 Hi Yuriy,

 Thanks for reading my bug!  You are right. Python 3.3 or 3.4 should not
 have this issue, since they have can secure the file descriptor. Before
 OpenStack move to Python 3, we may still need a solution. Calling
 libguestfs in a separate process seems to be a way. This way, Nova code can
 close those fd by itself, not depending upon CLOEXEC. However, that will be
 an expensive solution, since it requires a lot of code change. At least we
 need to write code to pass the return value and exception between these two
 processes. That will make this solution very complex. Do you agree?


 On Thu, Jun 5, 2014 at 9:39 PM, Yuriy Taraday yorik@gmail.com
 wrote:

 This behavior of os.pipe() has changed in Python 3.x so it won't be an
 issue on newer Python (if only it was accessible for us).

 From the looks of it you can mitigate the problem by running
 libguestfs requests in a separate process (multiprocessing.managers comes
 to mind). This way the only descriptors child process could theoretically
 inherit would be long-lived pipes to main process although they won't leak
 because they should be marked with CLOEXEC before any libguestfs request 
 is
 run. The other benefit is that this separate process won't be busy opening
 and closing tons of fds so the problem with inheriting will be avoided.


 On Thu, Jun 5, 2014 at 2:17 PM, laserjetyang laserjety...@gmail.com
 wrote:

   Will this patch of Python fix your problem? 
 *http://bugs.python.org/issue7213
 http://bugs.python.org/issue7213*


 On Wed, Jun 4, 2014 at 10:41 PM, Qin Zhao chaoc...@gmail.com wrote:

  Hi Zhu Zhu,

 Thank you for reading my diagram!   I need to clarify that this
 problem does not occur during data injection.  Before creating the ISO, 
 the
 driver code will extend the disk. Libguestfs is invoked in that time 
 frame.

 And now I think this problem may occur at any time, if the code use
 tpool to invoke libguestfs, and one external commend is executed in 
 another
 green thread simultaneously.  Please correct me if I am wrong.

 I think one simple solution for this issue is to call libguestfs
 routine in greenthread, rather than another native thread. But it will
 impact the performance very much. So I do not think that is an 
 acceptable
 solution.



  On Wed, Jun 4, 2014 at 12:00 PM, 

Re: [openstack-dev] [Nova] nova-compute deadlock

2014-06-07 Thread Richard W.M. Jones
On Sat, May 31, 2014 at 01:25:04AM +0800, Qin Zhao wrote:
 Hi all,
 
 When I run Icehouse code, I encountered a strange problem. The nova-compute
 service becomes stuck, when I boot instances. I report this bug in
 https://bugs.launchpad.net/nova/+bug/1313477.
 
 After thinking several days, I feel I know its root cause. This bug should
 be a deadlock problem cause by pipe fd leaking.  I draw a diagram to
 illustrate this problem.
 https://docs.google.com/drawings/d/1pItX9urLd6fmjws3BVovXQvRg_qMdTHS-0JhYfSkkVc/pub?w=960h=720
 
 However, I have not find a very good solution to prevent this deadlock.
 This problem is related with Python runtime, libguestfs, and eventlet. The
 situation is a little complicated. Is there any expert who can help me to
 look for a solution? I will appreciate for your help!

Thanks for the useful diagram.  libguestfs itself is very careful to
open all file descriptors with O_CLOEXEC (atomically if the OS
supports that), so I'm fairly confident that the bug is in Python 2,
not in libguestfs.

Another thing to say is that g.shutdown() sends a kill 9 signal to the
subprocess.  Furthermore you can obtain the qemu PID (g.get_pid()) and
send any signal you want to the process.

I wonder if a simpler way to fix this wouldn't be something like
adding a tiny C extension to the Python code to use pipe2 to open the
Python pipe with O_CLOEXEC atomically?  Are we allowed Python
extensions in OpenStack?

BTW do feel free to CC libgues...@redhat.com on any libguestfs
problems you have.  You don't need to subscribe to the list.

Rich.


-- 
Richard Jones, Virtualization Group, Red Hat http://people.redhat.com/~rjones
Read my programming and virtualization blog: http://rwmj.wordpress.com
virt-p2v converts physical machines to virtual machines.  Boot with a
live CD or over the network (PXE) and turn machines into KVM guests.
http://libguestfs.org/virt-v2v

___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [Nova] nova-compute deadlock

2014-06-06 Thread Qin Zhao
Yuriy,

And I think if we use proxy object of multiprocessing, the green thread
will not switch during we call libguestfs.  Is that correct?


On Fri, Jun 6, 2014 at 2:44 AM, Qin Zhao chaoc...@gmail.com wrote:

 Hi Yuriy,

 I read multiprocessing source code just now.  Now I feel it may not solve
 this problem very easily.  For example, let us assume that we will use the
 proxy object in Manager's process to call libguestfs.  In manager.py, I see
 it needs to create a pipe, before fork the child process. The write end of
 this pipe is required by child process.


 http://sourcecodebrowser.com/python-multiprocessing/2.6.2.1/classmultiprocessing_1_1managers_1_1_base_manager.html#a57fe9abe7a3d281286556c4bf3fbf4d5

 And in Process._bootstrp(), I think we will need to register a function to
 be called by _run_after_forkers(), in order to closed the fds inherited
 from Nova process.


 http://sourcecodebrowser.com/python-multiprocessing/2.6.2.1/classmultiprocessing_1_1process_1_1_process.html#ae594800e7bdef288d9bfbf8b79019d2e

 And we also can not close the write end fd created by Manager in
 _run_after_forkers(). One feasible way may be getting that fd from the 5th
 element of _args attribute of Process object, then skip to close this
 fd  I have not investigate if or not Manager need to use other fds,
 besides this pipe. Personally, I feel such an implementation will be a
 little tricky and risky, because it tightly depends on Manager code. If
 Manager opens other files, or change the argument order, our code will fail
 to run. Am I wrong?  Is there any other safer way?


 On Thu, Jun 5, 2014 at 11:40 PM, Yuriy Taraday yorik@gmail.com
 wrote:

 Please take a look at
 https://docs.python.org/2.7/library/multiprocessing.html#managers -
 everything is already implemented there.
 All you need is to start one manager that would serve all your requests
 to libguestfs. The implementation in stdlib will provide you with all
 exceptions and return values with minimum code changes on Nova side.
 Create a new Manager, register an libguestfs endpoint in it and call
 start(). It will spawn a separate process that will speak with calling
 process over very simple RPC.
 From the looks of it all you need to do is replace tpool.Proxy calls in
 VFSGuestFS.setup method to calls to this new Manager.


 On Thu, Jun 5, 2014 at 7:21 PM, Qin Zhao chaoc...@gmail.com wrote:

 Hi Yuriy,

 Thanks for reading my bug!  You are right. Python 3.3 or 3.4 should not
 have this issue, since they have can secure the file descriptor. Before
 OpenStack move to Python 3, we may still need a solution. Calling
 libguestfs in a separate process seems to be a way. This way, Nova code can
 close those fd by itself, not depending upon CLOEXEC. However, that will be
 an expensive solution, since it requires a lot of code change. At least we
 need to write code to pass the return value and exception between these two
 processes. That will make this solution very complex. Do you agree?


 On Thu, Jun 5, 2014 at 9:39 PM, Yuriy Taraday yorik@gmail.com
 wrote:

 This behavior of os.pipe() has changed in Python 3.x so it won't be an
 issue on newer Python (if only it was accessible for us).

 From the looks of it you can mitigate the problem by running libguestfs
 requests in a separate process (multiprocessing.managers comes to mind).
 This way the only descriptors child process could theoretically inherit
 would be long-lived pipes to main process although they won't leak because
 they should be marked with CLOEXEC before any libguestfs request is run.
 The other benefit is that this separate process won't be busy opening and
 closing tons of fds so the problem with inheriting will be avoided.


 On Thu, Jun 5, 2014 at 2:17 PM, laserjetyang laserjety...@gmail.com
 wrote:

   Will this patch of Python fix your problem? 
 *http://bugs.python.org/issue7213
 http://bugs.python.org/issue7213*


 On Wed, Jun 4, 2014 at 10:41 PM, Qin Zhao chaoc...@gmail.com wrote:

  Hi Zhu Zhu,

 Thank you for reading my diagram!   I need to clarify that this
 problem does not occur during data injection.  Before creating the ISO, 
 the
 driver code will extend the disk. Libguestfs is invoked in that time 
 frame.

 And now I think this problem may occur at any time, if the code use
 tpool to invoke libguestfs, and one external commend is executed in 
 another
 green thread simultaneously.  Please correct me if I am wrong.

 I think one simple solution for this issue is to call libguestfs
 routine in greenthread, rather than another native thread. But it will
 impact the performance very much. So I do not think that is an acceptable
 solution.



  On Wed, Jun 4, 2014 at 12:00 PM, Zhu Zhu bjzzu...@gmail.com wrote:

   Hi Qin Zhao,

 Thanks for raising this issue and analysis. According to the issue
 description and happen scenario(
 https://docs.google.com/drawings/d/1pItX9urLd6fmjws3BVovXQvRg_qMdTHS-0JhYfSkkVc/pub?w=960h=720
 ),  if that's the case,  concurrent 

Re: [openstack-dev] [Nova] nova-compute deadlock

2014-06-05 Thread laserjetyang
  Will this patch of Python fix your problem? *http://bugs.python.org/issue7213
http://bugs.python.org/issue7213*

On Wed, Jun 4, 2014 at 10:41 PM, Qin Zhao chaoc...@gmail.com wrote:

  Hi Zhu Zhu,

 Thank you for reading my diagram!   I need to clarify that this problem
 does not occur during data injection.  Before creating the ISO, the driver
 code will extend the disk. Libguestfs is invoked in that time frame.

 And now I think this problem may occur at any time, if the code use tpool
 to invoke libguestfs, and one external commend is executed in another green
 thread simultaneously.  Please correct me if I am wrong.

 I think one simple solution for this issue is to call libguestfs routine
 in greenthread, rather than another native thread. But it will impact the
 performance very much. So I do not think that is an acceptable solution.



  On Wed, Jun 4, 2014 at 12:00 PM, Zhu Zhu bjzzu...@gmail.com wrote:

   Hi Qin Zhao,

 Thanks for raising this issue and analysis. According to the issue
 description and happen scenario(
 https://docs.google.com/drawings/d/1pItX9urLd6fmjws3BVovXQvRg_qMdTHS-0JhYfSkkVc/pub?w=960h=720
 ),  if that's the case,  concurrent mutiple KVM spawn instances(*with
 both config drive and data injection enabled*) are triggered, the issue
 can be very likely to happen.
 As in libvirt/driver.py _create_image method, right after iso making 
 cdb.make_drive,
 the driver will attempt data injection which will call the libguestfs
 launch in another thread.

 Looks there were also a couple of libguestfs hang issues from Launch pad
 as below. . I am not sure if libguestfs itself can have certain mechanism
 to free/close the fds that inherited from parent process instead of require
 explicitly calling the tear down. Maybe open a defect to libguestfs to see
 what their thoughts?

  https://bugs.launchpad.net/nova/+bug/1286256
 https://bugs.launchpad.net/nova/+bug/1270304

 --
  Zhu Zhu
 Best Regards


  *From:* Qin Zhao chaoc...@gmail.com
 *Date:* 2014-05-31 01:25
  *To:* OpenStack Development Mailing List (not for usage questions)
 openstack-dev@lists.openstack.org
 *Subject:* [openstack-dev] [Nova] nova-compute deadlock
Hi all,

 When I run Icehouse code, I encountered a strange problem. The
 nova-compute service becomes stuck, when I boot instances. I report this
 bug in https://bugs.launchpad.net/nova/+bug/1313477.

 After thinking several days, I feel I know its root cause. This bug
 should be a deadlock problem cause by pipe fd leaking.  I draw a diagram to
 illustrate this problem.
 https://docs.google.com/drawings/d/1pItX9urLd6fmjws3BVovXQvRg_qMdTHS-0JhYfSkkVc/pub?w=960h=720

 However, I have not find a very good solution to prevent this deadlock.
 This problem is related with Python runtime, libguestfs, and eventlet. The
 situation is a little complicated. Is there any expert who can help me to
 look for a solution? I will appreciate for your help!

 --
 Qin Zhao


 ___
 OpenStack-dev mailing list
 OpenStack-dev@lists.openstack.org
 http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev




 --
 Qin Zhao

 ___
 OpenStack-dev mailing list
 OpenStack-dev@lists.openstack.org
 http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [Nova] nova-compute deadlock

2014-06-05 Thread Yuriy Taraday
This behavior of os.pipe() has changed in Python 3.x so it won't be an
issue on newer Python (if only it was accessible for us).

From the looks of it you can mitigate the problem by running libguestfs
requests in a separate process (multiprocessing.managers comes to mind).
This way the only descriptors child process could theoretically inherit
would be long-lived pipes to main process although they won't leak because
they should be marked with CLOEXEC before any libguestfs request is run.
The other benefit is that this separate process won't be busy opening and
closing tons of fds so the problem with inheriting will be avoided.


On Thu, Jun 5, 2014 at 2:17 PM, laserjetyang laserjety...@gmail.com wrote:

   Will this patch of Python fix your problem? 
 *http://bugs.python.org/issue7213
 http://bugs.python.org/issue7213*


 On Wed, Jun 4, 2014 at 10:41 PM, Qin Zhao chaoc...@gmail.com wrote:

  Hi Zhu Zhu,

 Thank you for reading my diagram!   I need to clarify that this problem
 does not occur during data injection.  Before creating the ISO, the driver
 code will extend the disk. Libguestfs is invoked in that time frame.

 And now I think this problem may occur at any time, if the code use tpool
 to invoke libguestfs, and one external commend is executed in another green
 thread simultaneously.  Please correct me if I am wrong.

 I think one simple solution for this issue is to call libguestfs routine
 in greenthread, rather than another native thread. But it will impact the
 performance very much. So I do not think that is an acceptable solution.



  On Wed, Jun 4, 2014 at 12:00 PM, Zhu Zhu bjzzu...@gmail.com wrote:

   Hi Qin Zhao,

 Thanks for raising this issue and analysis. According to the issue
 description and happen scenario(
 https://docs.google.com/drawings/d/1pItX9urLd6fmjws3BVovXQvRg_qMdTHS-0JhYfSkkVc/pub?w=960h=720
 ),  if that's the case,  concurrent mutiple KVM spawn instances(*with
 both config drive and data injection enabled*) are triggered, the issue
 can be very likely to happen.
 As in libvirt/driver.py _create_image method, right after iso making 
 cdb.make_drive,
 the driver will attempt data injection which will call the libguestfs
 launch in another thread.

 Looks there were also a couple of libguestfs hang issues from Launch pad
 as below. . I am not sure if libguestfs itself can have certain mechanism
 to free/close the fds that inherited from parent process instead of require
 explicitly calling the tear down. Maybe open a defect to libguestfs to see
 what their thoughts?

  https://bugs.launchpad.net/nova/+bug/1286256
 https://bugs.launchpad.net/nova/+bug/1270304

 --
  Zhu Zhu
 Best Regards


  *From:* Qin Zhao chaoc...@gmail.com
 *Date:* 2014-05-31 01:25
  *To:* OpenStack Development Mailing List (not for usage questions)
 openstack-dev@lists.openstack.org
 *Subject:* [openstack-dev] [Nova] nova-compute deadlock
Hi all,

 When I run Icehouse code, I encountered a strange problem. The
 nova-compute service becomes stuck, when I boot instances. I report this
 bug in https://bugs.launchpad.net/nova/+bug/1313477.

 After thinking several days, I feel I know its root cause. This bug
 should be a deadlock problem cause by pipe fd leaking.  I draw a diagram to
 illustrate this problem.
 https://docs.google.com/drawings/d/1pItX9urLd6fmjws3BVovXQvRg_qMdTHS-0JhYfSkkVc/pub?w=960h=720

 However, I have not find a very good solution to prevent this deadlock.
 This problem is related with Python runtime, libguestfs, and eventlet. The
 situation is a little complicated. Is there any expert who can help me to
 look for a solution? I will appreciate for your help!

 --
 Qin Zhao


 ___
 OpenStack-dev mailing list
 OpenStack-dev@lists.openstack.org
 http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev




 --
 Qin Zhao

 ___
 OpenStack-dev mailing list
 OpenStack-dev@lists.openstack.org
 http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev



 ___
 OpenStack-dev mailing list
 OpenStack-dev@lists.openstack.org
 http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev




-- 

Kind regards, Yuriy.
___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [Nova] nova-compute deadlock

2014-06-05 Thread Qin Zhao
Hi, Thanks for reading my bug!  I think this patch can not fix this problem
now, because pipe2() requires Python 3.3.


On Thu, Jun 5, 2014 at 6:17 PM, laserjetyang laserjety...@gmail.com wrote:

   Will this patch of Python fix your problem? 
 *http://bugs.python.org/issue7213
 http://bugs.python.org/issue7213*


 On Wed, Jun 4, 2014 at 10:41 PM, Qin Zhao chaoc...@gmail.com wrote:

  Hi Zhu Zhu,

 Thank you for reading my diagram!   I need to clarify that this problem
 does not occur during data injection.  Before creating the ISO, the driver
 code will extend the disk. Libguestfs is invoked in that time frame.

 And now I think this problem may occur at any time, if the code use tpool
 to invoke libguestfs, and one external commend is executed in another green
 thread simultaneously.  Please correct me if I am wrong.

 I think one simple solution for this issue is to call libguestfs routine
 in greenthread, rather than another native thread. But it will impact the
 performance very much. So I do not think that is an acceptable solution.



  On Wed, Jun 4, 2014 at 12:00 PM, Zhu Zhu bjzzu...@gmail.com wrote:

   Hi Qin Zhao,

 Thanks for raising this issue and analysis. According to the issue
 description and happen scenario(
 https://docs.google.com/drawings/d/1pItX9urLd6fmjws3BVovXQvRg_qMdTHS-0JhYfSkkVc/pub?w=960h=720
 ),  if that's the case,  concurrent mutiple KVM spawn instances(*with
 both config drive and data injection enabled*) are triggered, the issue
 can be very likely to happen.
 As in libvirt/driver.py _create_image method, right after iso making 
 cdb.make_drive,
 the driver will attempt data injection which will call the libguestfs
 launch in another thread.

 Looks there were also a couple of libguestfs hang issues from Launch pad
 as below. . I am not sure if libguestfs itself can have certain mechanism
 to free/close the fds that inherited from parent process instead of require
 explicitly calling the tear down. Maybe open a defect to libguestfs to see
 what their thoughts?

  https://bugs.launchpad.net/nova/+bug/1286256
 https://bugs.launchpad.net/nova/+bug/1270304

 --
  Zhu Zhu
 Best Regards


  *From:* Qin Zhao chaoc...@gmail.com
 *Date:* 2014-05-31 01:25
  *To:* OpenStack Development Mailing List (not for usage questions)
 openstack-dev@lists.openstack.org
 *Subject:* [openstack-dev] [Nova] nova-compute deadlock
Hi all,

 When I run Icehouse code, I encountered a strange problem. The
 nova-compute service becomes stuck, when I boot instances. I report this
 bug in https://bugs.launchpad.net/nova/+bug/1313477.

 After thinking several days, I feel I know its root cause. This bug
 should be a deadlock problem cause by pipe fd leaking.  I draw a diagram to
 illustrate this problem.
 https://docs.google.com/drawings/d/1pItX9urLd6fmjws3BVovXQvRg_qMdTHS-0JhYfSkkVc/pub?w=960h=720

 However, I have not find a very good solution to prevent this deadlock.
 This problem is related with Python runtime, libguestfs, and eventlet. The
 situation is a little complicated. Is there any expert who can help me to
 look for a solution? I will appreciate for your help!

 --
 Qin Zhao


 ___
 OpenStack-dev mailing list
 OpenStack-dev@lists.openstack.org
 http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev




 --
 Qin Zhao

 ___
 OpenStack-dev mailing list
 OpenStack-dev@lists.openstack.org
 http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev



 ___
 OpenStack-dev mailing list
 OpenStack-dev@lists.openstack.org
 http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev




-- 
Qin Zhao
___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [Nova] nova-compute deadlock

2014-06-05 Thread Qin Zhao
Hi Yuriy,

Thanks for reading my bug!  You are right. Python 3.3 or 3.4 should not
have this issue, since they have can secure the file descriptor. Before
OpenStack move to Python 3, we may still need a solution. Calling
libguestfs in a separate process seems to be a way. This way, Nova code can
close those fd by itself, not depending upon CLOEXEC. However, that will be
an expensive solution, since it requires a lot of code change. At least we
need to write code to pass the return value and exception between these two
processes. That will make this solution very complex. Do you agree?


On Thu, Jun 5, 2014 at 9:39 PM, Yuriy Taraday yorik@gmail.com wrote:

 This behavior of os.pipe() has changed in Python 3.x so it won't be an
 issue on newer Python (if only it was accessible for us).

 From the looks of it you can mitigate the problem by running libguestfs
 requests in a separate process (multiprocessing.managers comes to mind).
 This way the only descriptors child process could theoretically inherit
 would be long-lived pipes to main process although they won't leak because
 they should be marked with CLOEXEC before any libguestfs request is run.
 The other benefit is that this separate process won't be busy opening and
 closing tons of fds so the problem with inheriting will be avoided.


 On Thu, Jun 5, 2014 at 2:17 PM, laserjetyang laserjety...@gmail.com
 wrote:

   Will this patch of Python fix your problem? 
 *http://bugs.python.org/issue7213
 http://bugs.python.org/issue7213*


 On Wed, Jun 4, 2014 at 10:41 PM, Qin Zhao chaoc...@gmail.com wrote:

  Hi Zhu Zhu,

 Thank you for reading my diagram!   I need to clarify that this problem
 does not occur during data injection.  Before creating the ISO, the driver
 code will extend the disk. Libguestfs is invoked in that time frame.

 And now I think this problem may occur at any time, if the code use
 tpool to invoke libguestfs, and one external commend is executed in another
 green thread simultaneously.  Please correct me if I am wrong.

 I think one simple solution for this issue is to call libguestfs routine
 in greenthread, rather than another native thread. But it will impact the
 performance very much. So I do not think that is an acceptable solution.



  On Wed, Jun 4, 2014 at 12:00 PM, Zhu Zhu bjzzu...@gmail.com wrote:

   Hi Qin Zhao,

 Thanks for raising this issue and analysis. According to the issue
 description and happen scenario(
 https://docs.google.com/drawings/d/1pItX9urLd6fmjws3BVovXQvRg_qMdTHS-0JhYfSkkVc/pub?w=960h=720
 ),  if that's the case,  concurrent mutiple KVM spawn instances(*with
 both config drive and data injection enabled*) are triggered, the
 issue can be very likely to happen.
 As in libvirt/driver.py _create_image method, right after iso making 
 cdb.make_drive,
 the driver will attempt data injection which will call the libguestfs
 launch in another thread.

 Looks there were also a couple of libguestfs hang issues from Launch
 pad as below. . I am not sure if libguestfs itself can have certain
 mechanism to free/close the fds that inherited from parent process instead
 of require explicitly calling the tear down. Maybe open a defect to
 libguestfs to see what their thoughts?

  https://bugs.launchpad.net/nova/+bug/1286256
 https://bugs.launchpad.net/nova/+bug/1270304

 --
  Zhu Zhu
 Best Regards


  *From:* Qin Zhao chaoc...@gmail.com
 *Date:* 2014-05-31 01:25
  *To:* OpenStack Development Mailing List (not for usage questions)
 openstack-dev@lists.openstack.org
 *Subject:* [openstack-dev] [Nova] nova-compute deadlock
Hi all,

 When I run Icehouse code, I encountered a strange problem. The
 nova-compute service becomes stuck, when I boot instances. I report this
 bug in https://bugs.launchpad.net/nova/+bug/1313477.

 After thinking several days, I feel I know its root cause. This bug
 should be a deadlock problem cause by pipe fd leaking.  I draw a diagram to
 illustrate this problem.
 https://docs.google.com/drawings/d/1pItX9urLd6fmjws3BVovXQvRg_qMdTHS-0JhYfSkkVc/pub?w=960h=720

 However, I have not find a very good solution to prevent this deadlock.
 This problem is related with Python runtime, libguestfs, and eventlet. The
 situation is a little complicated. Is there any expert who can help me to
 look for a solution? I will appreciate for your help!

 --
 Qin Zhao


 ___
 OpenStack-dev mailing list
 OpenStack-dev@lists.openstack.org
 http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev




 --
 Qin Zhao

 ___
 OpenStack-dev mailing list
 OpenStack-dev@lists.openstack.org
 http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev



 ___
 OpenStack-dev mailing list
 OpenStack-dev@lists.openstack.org
 http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev




 --

 Kind regards, Yuriy.

 

Re: [openstack-dev] [Nova] nova-compute deadlock

2014-06-05 Thread Yuriy Taraday
Please take a look at
https://docs.python.org/2.7/library/multiprocessing.html#managers -
everything is already implemented there.
All you need is to start one manager that would serve all your requests to
libguestfs. The implementation in stdlib will provide you with all
exceptions and return values with minimum code changes on Nova side.
Create a new Manager, register an libguestfs endpoint in it and call
start(). It will spawn a separate process that will speak with calling
process over very simple RPC.
From the looks of it all you need to do is replace tpool.Proxy calls in
VFSGuestFS.setup method to calls to this new Manager.


On Thu, Jun 5, 2014 at 7:21 PM, Qin Zhao chaoc...@gmail.com wrote:

 Hi Yuriy,

 Thanks for reading my bug!  You are right. Python 3.3 or 3.4 should not
 have this issue, since they have can secure the file descriptor. Before
 OpenStack move to Python 3, we may still need a solution. Calling
 libguestfs in a separate process seems to be a way. This way, Nova code can
 close those fd by itself, not depending upon CLOEXEC. However, that will be
 an expensive solution, since it requires a lot of code change. At least we
 need to write code to pass the return value and exception between these two
 processes. That will make this solution very complex. Do you agree?


 On Thu, Jun 5, 2014 at 9:39 PM, Yuriy Taraday yorik@gmail.com wrote:

 This behavior of os.pipe() has changed in Python 3.x so it won't be an
 issue on newer Python (if only it was accessible for us).

 From the looks of it you can mitigate the problem by running libguestfs
 requests in a separate process (multiprocessing.managers comes to mind).
 This way the only descriptors child process could theoretically inherit
 would be long-lived pipes to main process although they won't leak because
 they should be marked with CLOEXEC before any libguestfs request is run.
 The other benefit is that this separate process won't be busy opening and
 closing tons of fds so the problem with inheriting will be avoided.


 On Thu, Jun 5, 2014 at 2:17 PM, laserjetyang laserjety...@gmail.com
 wrote:

   Will this patch of Python fix your problem? 
 *http://bugs.python.org/issue7213
 http://bugs.python.org/issue7213*


 On Wed, Jun 4, 2014 at 10:41 PM, Qin Zhao chaoc...@gmail.com wrote:

  Hi Zhu Zhu,

 Thank you for reading my diagram!   I need to clarify that this problem
 does not occur during data injection.  Before creating the ISO, the driver
 code will extend the disk. Libguestfs is invoked in that time frame.

 And now I think this problem may occur at any time, if the code use
 tpool to invoke libguestfs, and one external commend is executed in another
 green thread simultaneously.  Please correct me if I am wrong.

 I think one simple solution for this issue is to call libguestfs
 routine in greenthread, rather than another native thread. But it will
 impact the performance very much. So I do not think that is an acceptable
 solution.



  On Wed, Jun 4, 2014 at 12:00 PM, Zhu Zhu bjzzu...@gmail.com wrote:

   Hi Qin Zhao,

 Thanks for raising this issue and analysis. According to the issue
 description and happen scenario(
 https://docs.google.com/drawings/d/1pItX9urLd6fmjws3BVovXQvRg_qMdTHS-0JhYfSkkVc/pub?w=960h=720
 ),  if that's the case,  concurrent mutiple KVM spawn instances(*with
 both config drive and data injection enabled*) are triggered, the
 issue can be very likely to happen.
 As in libvirt/driver.py _create_image method, right after iso making 
 cdb.make_drive,
 the driver will attempt data injection which will call the libguestfs
 launch in another thread.

 Looks there were also a couple of libguestfs hang issues from Launch
 pad as below. . I am not sure if libguestfs itself can have certain
 mechanism to free/close the fds that inherited from parent process instead
 of require explicitly calling the tear down. Maybe open a defect to
 libguestfs to see what their thoughts?

  https://bugs.launchpad.net/nova/+bug/1286256
 https://bugs.launchpad.net/nova/+bug/1270304

 --
  Zhu Zhu
 Best Regards


  *From:* Qin Zhao chaoc...@gmail.com
 *Date:* 2014-05-31 01:25
  *To:* OpenStack Development Mailing List (not for usage questions)
 openstack-dev@lists.openstack.org
 *Subject:* [openstack-dev] [Nova] nova-compute deadlock
Hi all,

 When I run Icehouse code, I encountered a strange problem. The
 nova-compute service becomes stuck, when I boot instances. I report this
 bug in https://bugs.launchpad.net/nova/+bug/1313477.

 After thinking several days, I feel I know its root cause. This bug
 should be a deadlock problem cause by pipe fd leaking.  I draw a diagram 
 to
 illustrate this problem.
 https://docs.google.com/drawings/d/1pItX9urLd6fmjws3BVovXQvRg_qMdTHS-0JhYfSkkVc/pub?w=960h=720

 However, I have not find a very good solution to prevent this
 deadlock. This problem is related with Python runtime, libguestfs, and
 eventlet. The situation is a little complicated. Is there any 

Re: [openstack-dev] [Nova] nova-compute deadlock

2014-06-05 Thread Qin Zhao
Hi Yuriy,

I read multiprocessing source code just now.  Now I feel it may not solve
this problem very easily.  For example, let us assume that we will use the
proxy object in Manager's process to call libguestfs.  In manager.py, I see
it needs to create a pipe, before fork the child process. The write end of
this pipe is required by child process.

http://sourcecodebrowser.com/python-multiprocessing/2.6.2.1/classmultiprocessing_1_1managers_1_1_base_manager.html#a57fe9abe7a3d281286556c4bf3fbf4d5

And in Process._bootstrp(), I think we will need to register a function to
be called by _run_after_forkers(), in order to closed the fds inherited
from Nova process.

http://sourcecodebrowser.com/python-multiprocessing/2.6.2.1/classmultiprocessing_1_1process_1_1_process.html#ae594800e7bdef288d9bfbf8b79019d2e

And we also can not close the write end fd created by Manager in
_run_after_forkers(). One feasible way may be getting that fd from the 5th
element of _args attribute of Process object, then skip to close this
fd  I have not investigate if or not Manager need to use other fds,
besides this pipe. Personally, I feel such an implementation will be a
little tricky and risky, because it tightly depends on Manager code. If
Manager opens other files, or change the argument order, our code will fail
to run. Am I wrong?  Is there any other safer way?


On Thu, Jun 5, 2014 at 11:40 PM, Yuriy Taraday yorik@gmail.com wrote:

 Please take a look at
 https://docs.python.org/2.7/library/multiprocessing.html#managers -
 everything is already implemented there.
 All you need is to start one manager that would serve all your requests to
 libguestfs. The implementation in stdlib will provide you with all
 exceptions and return values with minimum code changes on Nova side.
 Create a new Manager, register an libguestfs endpoint in it and call
 start(). It will spawn a separate process that will speak with calling
 process over very simple RPC.
 From the looks of it all you need to do is replace tpool.Proxy calls in
 VFSGuestFS.setup method to calls to this new Manager.


 On Thu, Jun 5, 2014 at 7:21 PM, Qin Zhao chaoc...@gmail.com wrote:

 Hi Yuriy,

 Thanks for reading my bug!  You are right. Python 3.3 or 3.4 should not
 have this issue, since they have can secure the file descriptor. Before
 OpenStack move to Python 3, we may still need a solution. Calling
 libguestfs in a separate process seems to be a way. This way, Nova code can
 close those fd by itself, not depending upon CLOEXEC. However, that will be
 an expensive solution, since it requires a lot of code change. At least we
 need to write code to pass the return value and exception between these two
 processes. That will make this solution very complex. Do you agree?


 On Thu, Jun 5, 2014 at 9:39 PM, Yuriy Taraday yorik@gmail.com
 wrote:

 This behavior of os.pipe() has changed in Python 3.x so it won't be an
 issue on newer Python (if only it was accessible for us).

 From the looks of it you can mitigate the problem by running libguestfs
 requests in a separate process (multiprocessing.managers comes to mind).
 This way the only descriptors child process could theoretically inherit
 would be long-lived pipes to main process although they won't leak because
 they should be marked with CLOEXEC before any libguestfs request is run.
 The other benefit is that this separate process won't be busy opening and
 closing tons of fds so the problem with inheriting will be avoided.


 On Thu, Jun 5, 2014 at 2:17 PM, laserjetyang laserjety...@gmail.com
 wrote:

   Will this patch of Python fix your problem? 
 *http://bugs.python.org/issue7213
 http://bugs.python.org/issue7213*


 On Wed, Jun 4, 2014 at 10:41 PM, Qin Zhao chaoc...@gmail.com wrote:

  Hi Zhu Zhu,

 Thank you for reading my diagram!   I need to clarify that this
 problem does not occur during data injection.  Before creating the ISO, 
 the
 driver code will extend the disk. Libguestfs is invoked in that time 
 frame.

 And now I think this problem may occur at any time, if the code use
 tpool to invoke libguestfs, and one external commend is executed in 
 another
 green thread simultaneously.  Please correct me if I am wrong.

 I think one simple solution for this issue is to call libguestfs
 routine in greenthread, rather than another native thread. But it will
 impact the performance very much. So I do not think that is an acceptable
 solution.



  On Wed, Jun 4, 2014 at 12:00 PM, Zhu Zhu bjzzu...@gmail.com wrote:

   Hi Qin Zhao,

 Thanks for raising this issue and analysis. According to the issue
 description and happen scenario(
 https://docs.google.com/drawings/d/1pItX9urLd6fmjws3BVovXQvRg_qMdTHS-0JhYfSkkVc/pub?w=960h=720
 ),  if that's the case,  concurrent mutiple KVM spawn instances(*with
 both config drive and data injection enabled*) are triggered, the
 issue can be very likely to happen.
 As in libvirt/driver.py _create_image method, right after iso making
 cdb.make_drive, the driver will 

Re: [openstack-dev] [Nova] nova-compute deadlock

2014-06-03 Thread Zhu Zhu
Hi Qin Zhao,

Thanks for raising this issue and analysis. According to the issue description 
and happen 
scenario(https://docs.google.com/drawings/d/1pItX9urLd6fmjws3BVovXQvRg_qMdTHS-0JhYfSkkVc/pub?w=960h=720),
  if that's the case,  concurrent mutiple KVM spawn instances(with both config 
drive and data injection enabled) are triggered, the issue can be very likely 
to happen. 
As in libvirt/driver.py _create_image method, right after iso making 
cdb.make_drive, the driver will attempt data injection which will call the 
libguestfs launch in another thread. 

Looks there were also a couple of libguestfs hang issues from Launch pad as 
below. . I am not sure if libguestfs itself can have certain mechanism to 
free/close the fds that inherited from parent process instead of require 
explicitly calling the tear down. Maybe open a defect to libguestfs to see what 
their thoughts? 

https://bugs.launchpad.net/nova/+bug/1286256
https://bugs.launchpad.net/nova/+bug/1270304 



Zhu Zhu
Best Regards
 
From: Qin Zhao
Date: 2014-05-31 01:25
To: OpenStack Development Mailing List (not for usage questions)
Subject: [openstack-dev] [Nova] nova-compute deadlock
Hi all,

When I run Icehouse code, I encountered a strange problem. The nova-compute 
service becomes stuck, when I boot instances. I report this bug in 
https://bugs.launchpad.net/nova/+bug/1313477.

After thinking several days, I feel I know its root cause. This bug should be a 
deadlock problem cause by pipe fd leaking.  I draw a diagram to illustrate this 
problem. 
https://docs.google.com/drawings/d/1pItX9urLd6fmjws3BVovXQvRg_qMdTHS-0JhYfSkkVc/pub?w=960h=720

However, I have not find a very good solution to prevent this deadlock. This 
problem is related with Python runtime, libguestfs, and eventlet. The situation 
is a little complicated. Is there any expert who can help me to look for a 
solution? I will appreciate for your help!

-- 
Qin Zhao
___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev