Re: [openstack-dev] [Nova] nova-compute deadlock
;>>>> wrote: >>>>> >>>>>> Will this patch of Python fix your problem? >>>>>> *http://bugs.python.org/issue7213 >>>>>> <http://bugs.python.org/issue7213>* >>>>>> >>>>>> >>>>>> On Wed, Jun 4, 2014 at 10:41 PM, Qin Zhao wrote: >>>>>> >>>>>>> Hi Zhu Zhu, >>>>>>> >>>>>>> Thank you for reading my diagram! I need to clarify that this >>>>>>> problem does not occur during data injection. Before creating the ISO, >>>>>>> the >>>>>>> driver code will extend the disk. Libguestfs is invoked in that time >>>>>>> frame. >>>>>>> >>>>>>> And now I think this problem may occur at any time, if the code use >>>>>>> tpool to invoke libguestfs, and one external commend is executed in >>>>>>> another >>>>>>> green thread simultaneously. Please correct me if I am wrong. >>>>>>> >>>>>>> I think one simple solution for this issue is to call libguestfs >>>>>>> routine in greenthread, rather than another native thread. But it will >>>>>>> impact the performance very much. So I do not think that is an >>>>>>> acceptable >>>>>>> solution. >>>>>>> >>>>>>> >>>>>>> >>>>>>> On Wed, Jun 4, 2014 at 12:00 PM, Zhu Zhu >>>>>>> wrote: >>>>>>> >>>>>>>> Hi Qin Zhao, >>>>>>>> >>>>>>>> Thanks for raising this issue and analysis. According to the issue >>>>>>>> description and happen scenario( >>>>>>>> https://docs.google.com/drawings/d/1pItX9urLd6fmjws3BVovXQvRg_qMdTHS-0JhYfSkkVc/pub?w=960&h=720 >>>>>>>> ), if that's the case, concurrent mutiple KVM spawn instances(*with >>>>>>>> both config drive and data injection enabled*) are triggered, the >>>>>>>> issue can be very likely to happen. >>>>>>>> As in libvirt/driver.py _create_image method, right after iso >>>>>>>> making "cdb.make_drive", the driver will attempt "data injection" >>>>>>>> which will call the libguestfs launch in another thread. >>>>>>>> >>>>>>>> Looks there were also a couple of libguestfs hang issues from >>>>>>>> Launch pad as below. . I am not sure if libguestfs itself can have >>>>>>>> certain >>>>>>>> mechanism to free/close the fds that inherited from parent process >>>>>>>> instead >>>>>>>> of require explicitly calling the tear down. Maybe open a defect to >>>>>>>> libguestfs to see what their thoughts? >>>>>>>> >>>>>>>> https://bugs.launchpad.net/nova/+bug/1286256 >>>>>>>> https://bugs.launchpad.net/nova/+bug/1270304 >>>>>>>> >>>>>>>> -- >>>>>>>> Zhu Zhu >>>>>>>> Best Regards >>>>>>>> >>>>>>>> >>>>>>>> *From:* Qin Zhao >>>>>>>> *Date:* 2014-05-31 01:25 >>>>>>>> *To:* OpenStack Development Mailing List (not for usage questions) >>>>>>>> >>>>>>>> *Subject:* [openstack-dev] [Nova] nova-compute deadlock >>>>>>>>Hi all, >>>>>>>> >>>>>>>> When I run Icehouse code, I encountered a strange problem. The >>>>>>>> nova-compute service becomes stuck, when I boot instances. I report >>>>>>>> this >>>>>>>> bug in https://bugs.launchpad.net/nova/+bug/1313477. >>>>>>>> >>>>>>>> After thinking several days, I feel I know its root cause. This bug >>>>>>>> should be a deadlock problem cause by pipe fd leaking. I draw a >>>>>>>> diagram to >>>>>>>> illustrate this problem. >>>>>>>> https://docs.google.com/drawings/d/1pItX9urLd6fmjws3BVovXQvRg_qMdTHS-0JhYfSkkVc/pub?w=960&h=720 >>>>>>>> >>>>>>>> However, I have not find a very good solution to prevent this >>>>>>>> deadlock. This problem is related with Python runtime, libguestfs, and >>>>>>>> eventlet. The situation is a little complicated. Is there any expert >>>>>>>> who >>>>>>>> can help me to look for a solution? I will appreciate for your help! >>>>>>>> >>>>>>>> -- >>>>>>>> Qin Zhao >>>>>>>> >>>>>>>> >>>>>>>> ___ >>>>>>>> OpenStack-dev mailing list >>>>>>>> OpenStack-dev@lists.openstack.org >>>>>>>> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev >>>>>>>> >>>>>>>> >>>>>>> >>>>>>> >>>>>>> -- >>>>>>> Qin Zhao >>>>>>> >>>>>>> ___ >>>>>>> OpenStack-dev mailing list >>>>>>> OpenStack-dev@lists.openstack.org >>>>>>> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev >>>>>>> >>>>>>> >>>>>> >>>>>> ___ >>>>>> OpenStack-dev mailing list >>>>>> OpenStack-dev@lists.openstack.org >>>>>> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev >>>>>> >>>>>> >>>>> >>>>> >>>>> -- >>>>> >>>>> Kind regards, Yuriy. >>>>> >>>>> ___ >>>>> OpenStack-dev mailing list >>>>> OpenStack-dev@lists.openstack.org >>>>> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev >>>>> >>>>> >>>> >>>> >>>> -- >>>> Qin Zhao >>>> >>>> ___ >>>> OpenStack-dev mailing list >>>> OpenStack-dev@lists.openstack.org >>>> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev >>>> >>>> >>> >>> >>> -- >>> >>> Kind regards, Yuriy. >>> >>> ___ >>> OpenStack-dev mailing list >>> OpenStack-dev@lists.openstack.org >>> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev >>> >>> >> >> >> -- >> Qin Zhao >> > > > > -- > Qin Zhao > -- Qin Zhao ___ OpenStack-dev mailing list OpenStack-dev@lists.openstack.org http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
Re: [openstack-dev] [Nova] nova-compute deadlock
On Sat, May 31, 2014 at 01:25:04AM +0800, Qin Zhao wrote: > Hi all, > > When I run Icehouse code, I encountered a strange problem. The nova-compute > service becomes stuck, when I boot instances. I report this bug in > https://bugs.launchpad.net/nova/+bug/1313477. > > After thinking several days, I feel I know its root cause. This bug should > be a deadlock problem cause by pipe fd leaking. I draw a diagram to > illustrate this problem. > https://docs.google.com/drawings/d/1pItX9urLd6fmjws3BVovXQvRg_qMdTHS-0JhYfSkkVc/pub?w=960&h=720 > > However, I have not find a very good solution to prevent this deadlock. > This problem is related with Python runtime, libguestfs, and eventlet. The > situation is a little complicated. Is there any expert who can help me to > look for a solution? I will appreciate for your help! Thanks for the useful diagram. libguestfs itself is very careful to open all file descriptors with O_CLOEXEC (atomically if the OS supports that), so I'm fairly confident that the bug is in Python 2, not in libguestfs. Another thing to say is that g.shutdown() sends a kill 9 signal to the subprocess. Furthermore you can obtain the qemu PID (g.get_pid()) and send any signal you want to the process. I wonder if a simpler way to fix this wouldn't be something like adding a tiny C extension to the Python code to use pipe2 to open the Python pipe with O_CLOEXEC atomically? Are we allowed Python extensions in OpenStack? BTW do feel free to CC libgues...@redhat.com on any libguestfs problems you have. You don't need to subscribe to the list. Rich. -- Richard Jones, Virtualization Group, Red Hat http://people.redhat.com/~rjones Read my programming and virtualization blog: http://rwmj.wordpress.com virt-p2v converts physical machines to virtual machines. Boot with a live CD or over the network (PXE) and turn machines into KVM guests. http://libguestfs.org/virt-v2v ___ OpenStack-dev mailing list OpenStack-dev@lists.openstack.org http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
Re: [openstack-dev] [Nova] nova-compute deadlock
; the >>>>>> driver code will extend the disk. Libguestfs is invoked in that time >>>>>> frame. >>>>>> >>>>>> And now I think this problem may occur at any time, if the code use >>>>>> tpool to invoke libguestfs, and one external commend is executed in >>>>>> another >>>>>> green thread simultaneously. Please correct me if I am wrong. >>>>>> >>>>>> I think one simple solution for this issue is to call libguestfs >>>>>> routine in greenthread, rather than another native thread. But it will >>>>>> impact the performance very much. So I do not think that is an acceptable >>>>>> solution. >>>>>> >>>>>> >>>>>> >>>>>> On Wed, Jun 4, 2014 at 12:00 PM, Zhu Zhu wrote: >>>>>> >>>>>>> Hi Qin Zhao, >>>>>>> >>>>>>> Thanks for raising this issue and analysis. According to the issue >>>>>>> description and happen scenario( >>>>>>> https://docs.google.com/drawings/d/1pItX9urLd6fmjws3BVovXQvRg_qMdTHS-0JhYfSkkVc/pub?w=960&h=720 >>>>>>> ), if that's the case, concurrent mutiple KVM spawn instances(*with >>>>>>> both config drive and data injection enabled*) are triggered, the >>>>>>> issue can be very likely to happen. >>>>>>> As in libvirt/driver.py _create_image method, right after iso >>>>>>> making "cdb.make_drive", the driver will attempt "data injection" >>>>>>> which will call the libguestfs launch in another thread. >>>>>>> >>>>>>> Looks there were also a couple of libguestfs hang issues from Launch >>>>>>> pad as below. . I am not sure if libguestfs itself can have certain >>>>>>> mechanism to free/close the fds that inherited from parent process >>>>>>> instead >>>>>>> of require explicitly calling the tear down. Maybe open a defect to >>>>>>> libguestfs to see what their thoughts? >>>>>>> >>>>>>> https://bugs.launchpad.net/nova/+bug/1286256 >>>>>>> https://bugs.launchpad.net/nova/+bug/1270304 >>>>>>> >>>>>>> -- >>>>>>> Zhu Zhu >>>>>>> Best Regards >>>>>>> >>>>>>> >>>>>>> *From:* Qin Zhao >>>>>>> *Date:* 2014-05-31 01:25 >>>>>>> *To:* OpenStack Development Mailing List (not for usage questions) >>>>>>> >>>>>>> *Subject:* [openstack-dev] [Nova] nova-compute deadlock >>>>>>>Hi all, >>>>>>> >>>>>>> When I run Icehouse code, I encountered a strange problem. The >>>>>>> nova-compute service becomes stuck, when I boot instances. I report this >>>>>>> bug in https://bugs.launchpad.net/nova/+bug/1313477. >>>>>>> >>>>>>> After thinking several days, I feel I know its root cause. This bug >>>>>>> should be a deadlock problem cause by pipe fd leaking. I draw a >>>>>>> diagram to >>>>>>> illustrate this problem. >>>>>>> https://docs.google.com/drawings/d/1pItX9urLd6fmjws3BVovXQvRg_qMdTHS-0JhYfSkkVc/pub?w=960&h=720 >>>>>>> >>>>>>> However, I have not find a very good solution to prevent this >>>>>>> deadlock. This problem is related with Python runtime, libguestfs, and >>>>>>> eventlet. The situation is a little complicated. Is there any expert who >>>>>>> can help me to look for a solution? I will appreciate for your help! >>>>>>> >>>>>>> -- >>>>>>> Qin Zhao >>>>>>> >>>>>>> >>>>>>> ___ >>>>>>> OpenStack-dev mailing list >>>>>>> OpenStack-dev@lists.openstack.org >>>>>>> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev >>>>>>> >>>>>>> >>>>>> >>>>>> >>>>>> -- >>>>>> Qin Zhao >>>>>> >>>>>> ___ >>>>>> OpenStack-dev mailing list >>>>>> OpenStack-dev@lists.openstack.org >>>>>> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev >>>>>> >>>>>> >>>>> >>>>> ___ >>>>> OpenStack-dev mailing list >>>>> OpenStack-dev@lists.openstack.org >>>>> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev >>>>> >>>>> >>>> >>>> >>>> -- >>>> >>>> Kind regards, Yuriy. >>>> >>>> ___ >>>> OpenStack-dev mailing list >>>> OpenStack-dev@lists.openstack.org >>>> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev >>>> >>>> >>> >>> >>> -- >>> Qin Zhao >>> >>> ___ >>> OpenStack-dev mailing list >>> OpenStack-dev@lists.openstack.org >>> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev >>> >>> >> >> >> -- >> >> Kind regards, Yuriy. >> >> ___ >> OpenStack-dev mailing list >> OpenStack-dev@lists.openstack.org >> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev >> >> > > > -- > Qin Zhao > -- Qin Zhao ___ OpenStack-dev mailing list OpenStack-dev@lists.openstack.org http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
Re: [openstack-dev] [Nova] nova-compute deadlock
e in greenthread, rather than another native thread. But it will >>>>> impact the performance very much. So I do not think that is an acceptable >>>>> solution. >>>>> >>>>> >>>>> >>>>> On Wed, Jun 4, 2014 at 12:00 PM, Zhu Zhu wrote: >>>>> >>>>>> Hi Qin Zhao, >>>>>> >>>>>> Thanks for raising this issue and analysis. According to the issue >>>>>> description and happen scenario( >>>>>> https://docs.google.com/drawings/d/1pItX9urLd6fmjws3BVovXQvRg_qMdTHS-0JhYfSkkVc/pub?w=960&h=720 >>>>>> ), if that's the case, concurrent mutiple KVM spawn instances(*with >>>>>> both config drive and data injection enabled*) are triggered, the >>>>>> issue can be very likely to happen. >>>>>> As in libvirt/driver.py _create_image method, right after iso making >>>>>> "cdb.make_drive", the driver will attempt "data injection" which >>>>>> will call the libguestfs launch in another thread. >>>>>> >>>>>> Looks there were also a couple of libguestfs hang issues from Launch >>>>>> pad as below. . I am not sure if libguestfs itself can have certain >>>>>> mechanism to free/close the fds that inherited from parent process >>>>>> instead >>>>>> of require explicitly calling the tear down. Maybe open a defect to >>>>>> libguestfs to see what their thoughts? >>>>>> >>>>>> https://bugs.launchpad.net/nova/+bug/1286256 >>>>>> https://bugs.launchpad.net/nova/+bug/1270304 >>>>>> >>>>>> -- >>>>>> Zhu Zhu >>>>>> Best Regards >>>>>> >>>>>> >>>>>> *From:* Qin Zhao >>>>>> *Date:* 2014-05-31 01:25 >>>>>> *To:* OpenStack Development Mailing List (not for usage questions) >>>>>> >>>>>> *Subject:* [openstack-dev] [Nova] nova-compute deadlock >>>>>>Hi all, >>>>>> >>>>>> When I run Icehouse code, I encountered a strange problem. The >>>>>> nova-compute service becomes stuck, when I boot instances. I report this >>>>>> bug in https://bugs.launchpad.net/nova/+bug/1313477. >>>>>> >>>>>> After thinking several days, I feel I know its root cause. This bug >>>>>> should be a deadlock problem cause by pipe fd leaking. I draw a diagram >>>>>> to >>>>>> illustrate this problem. >>>>>> https://docs.google.com/drawings/d/1pItX9urLd6fmjws3BVovXQvRg_qMdTHS-0JhYfSkkVc/pub?w=960&h=720 >>>>>> >>>>>> However, I have not find a very good solution to prevent this >>>>>> deadlock. This problem is related with Python runtime, libguestfs, and >>>>>> eventlet. The situation is a little complicated. Is there any expert who >>>>>> can help me to look for a solution? I will appreciate for your help! >>>>>> >>>>>> -- >>>>>> Qin Zhao >>>>>> >>>>>> >>>>>> ___ >>>>>> OpenStack-dev mailing list >>>>>> OpenStack-dev@lists.openstack.org >>>>>> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev >>>>>> >>>>>> >>>>> >>>>> >>>>> -- >>>>> Qin Zhao >>>>> >>>>> ___ >>>>> OpenStack-dev mailing list >>>>> OpenStack-dev@lists.openstack.org >>>>> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev >>>>> >>>>> >>>> >>>> ___ >>>> OpenStack-dev mailing list >>>> OpenStack-dev@lists.openstack.org >>>> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev >>>> >>>> >>> >>> >>> -- >>> >>> Kind regards, Yuriy. >>> >>> ___ >>> OpenStack-dev mailing list >>> OpenStack-dev@lists.openstack.org >>> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev >>> >>> >> >> >> -- >> Qin Zhao >> >> ___ >> OpenStack-dev mailing list >> OpenStack-dev@lists.openstack.org >> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev >> >> > > > -- > > Kind regards, Yuriy. > > ___ > OpenStack-dev mailing list > OpenStack-dev@lists.openstack.org > http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev > > -- Qin Zhao ___ OpenStack-dev mailing list OpenStack-dev@lists.openstack.org http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
Re: [openstack-dev] [Nova] nova-compute deadlock
Please take a look at https://docs.python.org/2.7/library/multiprocessing.html#managers - everything is already implemented there. All you need is to start one manager that would serve all your requests to libguestfs. The implementation in stdlib will provide you with all exceptions and return values with minimum code changes on Nova side. Create a new Manager, register an libguestfs "endpoint" in it and call start(). It will spawn a separate process that will speak with calling process over very simple RPC. >From the looks of it all you need to do is replace tpool.Proxy calls in VFSGuestFS.setup method to calls to this new Manager. On Thu, Jun 5, 2014 at 7:21 PM, Qin Zhao wrote: > Hi Yuriy, > > Thanks for reading my bug! You are right. Python 3.3 or 3.4 should not > have this issue, since they have can secure the file descriptor. Before > OpenStack move to Python 3, we may still need a solution. Calling > libguestfs in a separate process seems to be a way. This way, Nova code can > close those fd by itself, not depending upon CLOEXEC. However, that will be > an expensive solution, since it requires a lot of code change. At least we > need to write code to pass the return value and exception between these two > processes. That will make this solution very complex. Do you agree? > > > On Thu, Jun 5, 2014 at 9:39 PM, Yuriy Taraday wrote: > >> This behavior of os.pipe() has changed in Python 3.x so it won't be an >> issue on newer Python (if only it was accessible for us). >> >> From the looks of it you can mitigate the problem by running libguestfs >> requests in a separate process (multiprocessing.managers comes to mind). >> This way the only descriptors child process could theoretically inherit >> would be long-lived pipes to main process although they won't leak because >> they should be marked with CLOEXEC before any libguestfs request is run. >> The other benefit is that this separate process won't be busy opening and >> closing tons of fds so the problem with inheriting will be avoided. >> >> >> On Thu, Jun 5, 2014 at 2:17 PM, laserjetyang >> wrote: >> >>> Will this patch of Python fix your problem? >>> *http://bugs.python.org/issue7213 >>> <http://bugs.python.org/issue7213>* >>> >>> >>> On Wed, Jun 4, 2014 at 10:41 PM, Qin Zhao wrote: >>> >>>> Hi Zhu Zhu, >>>> >>>> Thank you for reading my diagram! I need to clarify that this problem >>>> does not occur during data injection. Before creating the ISO, the driver >>>> code will extend the disk. Libguestfs is invoked in that time frame. >>>> >>>> And now I think this problem may occur at any time, if the code use >>>> tpool to invoke libguestfs, and one external commend is executed in another >>>> green thread simultaneously. Please correct me if I am wrong. >>>> >>>> I think one simple solution for this issue is to call libguestfs >>>> routine in greenthread, rather than another native thread. But it will >>>> impact the performance very much. So I do not think that is an acceptable >>>> solution. >>>> >>>> >>>> >>>> On Wed, Jun 4, 2014 at 12:00 PM, Zhu Zhu wrote: >>>> >>>>> Hi Qin Zhao, >>>>> >>>>> Thanks for raising this issue and analysis. According to the issue >>>>> description and happen scenario( >>>>> https://docs.google.com/drawings/d/1pItX9urLd6fmjws3BVovXQvRg_qMdTHS-0JhYfSkkVc/pub?w=960&h=720 >>>>> ), if that's the case, concurrent mutiple KVM spawn instances(*with >>>>> both config drive and data injection enabled*) are triggered, the >>>>> issue can be very likely to happen. >>>>> As in libvirt/driver.py _create_image method, right after iso making >>>>> "cdb.make_drive", >>>>> the driver will attempt "data injection" which will call the libguestfs >>>>> launch in another thread. >>>>> >>>>> Looks there were also a couple of libguestfs hang issues from Launch >>>>> pad as below. . I am not sure if libguestfs itself can have certain >>>>> mechanism to free/close the fds that inherited from parent process instead >>>>> of require explicitly calling the tear down. Maybe open a defect to >>>>> libguestfs to see what their thoughts? >>>>> >>>>> https://bugs.launchpad.net/nova/+bug/1286256 >>>>> https://bugs.launchpad.net/nova/+bug
Re: [openstack-dev] [Nova] nova-compute deadlock
Hi Yuriy, Thanks for reading my bug! You are right. Python 3.3 or 3.4 should not have this issue, since they have can secure the file descriptor. Before OpenStack move to Python 3, we may still need a solution. Calling libguestfs in a separate process seems to be a way. This way, Nova code can close those fd by itself, not depending upon CLOEXEC. However, that will be an expensive solution, since it requires a lot of code change. At least we need to write code to pass the return value and exception between these two processes. That will make this solution very complex. Do you agree? On Thu, Jun 5, 2014 at 9:39 PM, Yuriy Taraday wrote: > This behavior of os.pipe() has changed in Python 3.x so it won't be an > issue on newer Python (if only it was accessible for us). > > From the looks of it you can mitigate the problem by running libguestfs > requests in a separate process (multiprocessing.managers comes to mind). > This way the only descriptors child process could theoretically inherit > would be long-lived pipes to main process although they won't leak because > they should be marked with CLOEXEC before any libguestfs request is run. > The other benefit is that this separate process won't be busy opening and > closing tons of fds so the problem with inheriting will be avoided. > > > On Thu, Jun 5, 2014 at 2:17 PM, laserjetyang > wrote: > >> Will this patch of Python fix your problem? >> *http://bugs.python.org/issue7213 >> <http://bugs.python.org/issue7213>* >> >> >> On Wed, Jun 4, 2014 at 10:41 PM, Qin Zhao wrote: >> >>> Hi Zhu Zhu, >>> >>> Thank you for reading my diagram! I need to clarify that this problem >>> does not occur during data injection. Before creating the ISO, the driver >>> code will extend the disk. Libguestfs is invoked in that time frame. >>> >>> And now I think this problem may occur at any time, if the code use >>> tpool to invoke libguestfs, and one external commend is executed in another >>> green thread simultaneously. Please correct me if I am wrong. >>> >>> I think one simple solution for this issue is to call libguestfs routine >>> in greenthread, rather than another native thread. But it will impact the >>> performance very much. So I do not think that is an acceptable solution. >>> >>> >>> >>> On Wed, Jun 4, 2014 at 12:00 PM, Zhu Zhu wrote: >>> >>>> Hi Qin Zhao, >>>> >>>> Thanks for raising this issue and analysis. According to the issue >>>> description and happen scenario( >>>> https://docs.google.com/drawings/d/1pItX9urLd6fmjws3BVovXQvRg_qMdTHS-0JhYfSkkVc/pub?w=960&h=720 >>>> ), if that's the case, concurrent mutiple KVM spawn instances(*with >>>> both config drive and data injection enabled*) are triggered, the >>>> issue can be very likely to happen. >>>> As in libvirt/driver.py _create_image method, right after iso making >>>> "cdb.make_drive", >>>> the driver will attempt "data injection" which will call the libguestfs >>>> launch in another thread. >>>> >>>> Looks there were also a couple of libguestfs hang issues from Launch >>>> pad as below. . I am not sure if libguestfs itself can have certain >>>> mechanism to free/close the fds that inherited from parent process instead >>>> of require explicitly calling the tear down. Maybe open a defect to >>>> libguestfs to see what their thoughts? >>>> >>>> https://bugs.launchpad.net/nova/+bug/1286256 >>>> https://bugs.launchpad.net/nova/+bug/1270304 >>>> >>>> -- >>>> Zhu Zhu >>>> Best Regards >>>> >>>> >>>> *From:* Qin Zhao >>>> *Date:* 2014-05-31 01:25 >>>> *To:* OpenStack Development Mailing List (not for usage questions) >>>> >>>> *Subject:* [openstack-dev] [Nova] nova-compute deadlock >>>>Hi all, >>>> >>>> When I run Icehouse code, I encountered a strange problem. The >>>> nova-compute service becomes stuck, when I boot instances. I report this >>>> bug in https://bugs.launchpad.net/nova/+bug/1313477. >>>> >>>> After thinking several days, I feel I know its root cause. This bug >>>> should be a deadlock problem cause by pipe fd leaking. I draw a diagram to >>>> illustrate this problem. >>>> https://docs.google.com/drawings/d/1pItX9urLd6fmjws3BVovX
Re: [openstack-dev] [Nova] nova-compute deadlock
Hi, Thanks for reading my bug! I think this patch can not fix this problem now, because pipe2() requires Python 3.3. On Thu, Jun 5, 2014 at 6:17 PM, laserjetyang wrote: > Will this patch of Python fix your problem? > *http://bugs.python.org/issue7213 > <http://bugs.python.org/issue7213>* > > > On Wed, Jun 4, 2014 at 10:41 PM, Qin Zhao wrote: > >> Hi Zhu Zhu, >> >> Thank you for reading my diagram! I need to clarify that this problem >> does not occur during data injection. Before creating the ISO, the driver >> code will extend the disk. Libguestfs is invoked in that time frame. >> >> And now I think this problem may occur at any time, if the code use tpool >> to invoke libguestfs, and one external commend is executed in another green >> thread simultaneously. Please correct me if I am wrong. >> >> I think one simple solution for this issue is to call libguestfs routine >> in greenthread, rather than another native thread. But it will impact the >> performance very much. So I do not think that is an acceptable solution. >> >> >> >> On Wed, Jun 4, 2014 at 12:00 PM, Zhu Zhu wrote: >> >>> Hi Qin Zhao, >>> >>> Thanks for raising this issue and analysis. According to the issue >>> description and happen scenario( >>> https://docs.google.com/drawings/d/1pItX9urLd6fmjws3BVovXQvRg_qMdTHS-0JhYfSkkVc/pub?w=960&h=720 >>> ), if that's the case, concurrent mutiple KVM spawn instances(*with >>> both config drive and data injection enabled*) are triggered, the issue >>> can be very likely to happen. >>> As in libvirt/driver.py _create_image method, right after iso making >>> "cdb.make_drive", >>> the driver will attempt "data injection" which will call the libguestfs >>> launch in another thread. >>> >>> Looks there were also a couple of libguestfs hang issues from Launch pad >>> as below. . I am not sure if libguestfs itself can have certain mechanism >>> to free/close the fds that inherited from parent process instead of require >>> explicitly calling the tear down. Maybe open a defect to libguestfs to see >>> what their thoughts? >>> >>> https://bugs.launchpad.net/nova/+bug/1286256 >>> https://bugs.launchpad.net/nova/+bug/1270304 >>> >>> -- >>> Zhu Zhu >>> Best Regards >>> >>> >>> *From:* Qin Zhao >>> *Date:* 2014-05-31 01:25 >>> *To:* OpenStack Development Mailing List (not for usage questions) >>> >>> *Subject:* [openstack-dev] [Nova] nova-compute deadlock >>>Hi all, >>> >>> When I run Icehouse code, I encountered a strange problem. The >>> nova-compute service becomes stuck, when I boot instances. I report this >>> bug in https://bugs.launchpad.net/nova/+bug/1313477. >>> >>> After thinking several days, I feel I know its root cause. This bug >>> should be a deadlock problem cause by pipe fd leaking. I draw a diagram to >>> illustrate this problem. >>> https://docs.google.com/drawings/d/1pItX9urLd6fmjws3BVovXQvRg_qMdTHS-0JhYfSkkVc/pub?w=960&h=720 >>> >>> However, I have not find a very good solution to prevent this deadlock. >>> This problem is related with Python runtime, libguestfs, and eventlet. The >>> situation is a little complicated. Is there any expert who can help me to >>> look for a solution? I will appreciate for your help! >>> >>> -- >>> Qin Zhao >>> >>> >>> ___ >>> OpenStack-dev mailing list >>> OpenStack-dev@lists.openstack.org >>> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev >>> >>> >> >> >> -- >> Qin Zhao >> >> ___ >> OpenStack-dev mailing list >> OpenStack-dev@lists.openstack.org >> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev >> >> > > ___ > OpenStack-dev mailing list > OpenStack-dev@lists.openstack.org > http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev > > -- Qin Zhao ___ OpenStack-dev mailing list OpenStack-dev@lists.openstack.org http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
Re: [openstack-dev] [Nova] nova-compute deadlock
This behavior of os.pipe() has changed in Python 3.x so it won't be an issue on newer Python (if only it was accessible for us). >From the looks of it you can mitigate the problem by running libguestfs requests in a separate process (multiprocessing.managers comes to mind). This way the only descriptors child process could theoretically inherit would be long-lived pipes to main process although they won't leak because they should be marked with CLOEXEC before any libguestfs request is run. The other benefit is that this separate process won't be busy opening and closing tons of fds so the problem with inheriting will be avoided. On Thu, Jun 5, 2014 at 2:17 PM, laserjetyang wrote: > Will this patch of Python fix your problem? > *http://bugs.python.org/issue7213 > <http://bugs.python.org/issue7213>* > > > On Wed, Jun 4, 2014 at 10:41 PM, Qin Zhao wrote: > >> Hi Zhu Zhu, >> >> Thank you for reading my diagram! I need to clarify that this problem >> does not occur during data injection. Before creating the ISO, the driver >> code will extend the disk. Libguestfs is invoked in that time frame. >> >> And now I think this problem may occur at any time, if the code use tpool >> to invoke libguestfs, and one external commend is executed in another green >> thread simultaneously. Please correct me if I am wrong. >> >> I think one simple solution for this issue is to call libguestfs routine >> in greenthread, rather than another native thread. But it will impact the >> performance very much. So I do not think that is an acceptable solution. >> >> >> >> On Wed, Jun 4, 2014 at 12:00 PM, Zhu Zhu wrote: >> >>> Hi Qin Zhao, >>> >>> Thanks for raising this issue and analysis. According to the issue >>> description and happen scenario( >>> https://docs.google.com/drawings/d/1pItX9urLd6fmjws3BVovXQvRg_qMdTHS-0JhYfSkkVc/pub?w=960&h=720 >>> ), if that's the case, concurrent mutiple KVM spawn instances(*with >>> both config drive and data injection enabled*) are triggered, the issue >>> can be very likely to happen. >>> As in libvirt/driver.py _create_image method, right after iso making >>> "cdb.make_drive", >>> the driver will attempt "data injection" which will call the libguestfs >>> launch in another thread. >>> >>> Looks there were also a couple of libguestfs hang issues from Launch pad >>> as below. . I am not sure if libguestfs itself can have certain mechanism >>> to free/close the fds that inherited from parent process instead of require >>> explicitly calling the tear down. Maybe open a defect to libguestfs to see >>> what their thoughts? >>> >>> https://bugs.launchpad.net/nova/+bug/1286256 >>> https://bugs.launchpad.net/nova/+bug/1270304 >>> >>> -- >>> Zhu Zhu >>> Best Regards >>> >>> >>> *From:* Qin Zhao >>> *Date:* 2014-05-31 01:25 >>> *To:* OpenStack Development Mailing List (not for usage questions) >>> >>> *Subject:* [openstack-dev] [Nova] nova-compute deadlock >>>Hi all, >>> >>> When I run Icehouse code, I encountered a strange problem. The >>> nova-compute service becomes stuck, when I boot instances. I report this >>> bug in https://bugs.launchpad.net/nova/+bug/1313477. >>> >>> After thinking several days, I feel I know its root cause. This bug >>> should be a deadlock problem cause by pipe fd leaking. I draw a diagram to >>> illustrate this problem. >>> https://docs.google.com/drawings/d/1pItX9urLd6fmjws3BVovXQvRg_qMdTHS-0JhYfSkkVc/pub?w=960&h=720 >>> >>> However, I have not find a very good solution to prevent this deadlock. >>> This problem is related with Python runtime, libguestfs, and eventlet. The >>> situation is a little complicated. Is there any expert who can help me to >>> look for a solution? I will appreciate for your help! >>> >>> -- >>> Qin Zhao >>> >>> >>> ___ >>> OpenStack-dev mailing list >>> OpenStack-dev@lists.openstack.org >>> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev >>> >>> >> >> >> -- >> Qin Zhao >> >> ___ >> OpenStack-dev mailing list >> OpenStack-dev@lists.openstack.org >> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev >> >> > > ___ > OpenStack-dev mailing list > OpenStack-dev@lists.openstack.org > http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev > > -- Kind regards, Yuriy. ___ OpenStack-dev mailing list OpenStack-dev@lists.openstack.org http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
Re: [openstack-dev] [Nova] nova-compute deadlock
Will this patch of Python fix your problem? *http://bugs.python.org/issue7213 <http://bugs.python.org/issue7213>* On Wed, Jun 4, 2014 at 10:41 PM, Qin Zhao wrote: > Hi Zhu Zhu, > > Thank you for reading my diagram! I need to clarify that this problem > does not occur during data injection. Before creating the ISO, the driver > code will extend the disk. Libguestfs is invoked in that time frame. > > And now I think this problem may occur at any time, if the code use tpool > to invoke libguestfs, and one external commend is executed in another green > thread simultaneously. Please correct me if I am wrong. > > I think one simple solution for this issue is to call libguestfs routine > in greenthread, rather than another native thread. But it will impact the > performance very much. So I do not think that is an acceptable solution. > > > > On Wed, Jun 4, 2014 at 12:00 PM, Zhu Zhu wrote: > >> Hi Qin Zhao, >> >> Thanks for raising this issue and analysis. According to the issue >> description and happen scenario( >> https://docs.google.com/drawings/d/1pItX9urLd6fmjws3BVovXQvRg_qMdTHS-0JhYfSkkVc/pub?w=960&h=720 >> ), if that's the case, concurrent mutiple KVM spawn instances(*with >> both config drive and data injection enabled*) are triggered, the issue >> can be very likely to happen. >> As in libvirt/driver.py _create_image method, right after iso making >> "cdb.make_drive", >> the driver will attempt "data injection" which will call the libguestfs >> launch in another thread. >> >> Looks there were also a couple of libguestfs hang issues from Launch pad >> as below. . I am not sure if libguestfs itself can have certain mechanism >> to free/close the fds that inherited from parent process instead of require >> explicitly calling the tear down. Maybe open a defect to libguestfs to see >> what their thoughts? >> >> https://bugs.launchpad.net/nova/+bug/1286256 >> https://bugs.launchpad.net/nova/+bug/1270304 >> >> ------ >> Zhu Zhu >> Best Regards >> >> >> *From:* Qin Zhao >> *Date:* 2014-05-31 01:25 >> *To:* OpenStack Development Mailing List (not for usage questions) >> >> *Subject:* [openstack-dev] [Nova] nova-compute deadlock >>Hi all, >> >> When I run Icehouse code, I encountered a strange problem. The >> nova-compute service becomes stuck, when I boot instances. I report this >> bug in https://bugs.launchpad.net/nova/+bug/1313477. >> >> After thinking several days, I feel I know its root cause. This bug >> should be a deadlock problem cause by pipe fd leaking. I draw a diagram to >> illustrate this problem. >> https://docs.google.com/drawings/d/1pItX9urLd6fmjws3BVovXQvRg_qMdTHS-0JhYfSkkVc/pub?w=960&h=720 >> >> However, I have not find a very good solution to prevent this deadlock. >> This problem is related with Python runtime, libguestfs, and eventlet. The >> situation is a little complicated. Is there any expert who can help me to >> look for a solution? I will appreciate for your help! >> >> -- >> Qin Zhao >> >> >> ___ >> OpenStack-dev mailing list >> OpenStack-dev@lists.openstack.org >> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev >> >> > > > -- > Qin Zhao > > ___ > OpenStack-dev mailing list > OpenStack-dev@lists.openstack.org > http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev > > ___ OpenStack-dev mailing list OpenStack-dev@lists.openstack.org http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
Re: [openstack-dev] [Nova] nova-compute deadlock
Hi Zhu Zhu, Thank you for reading my diagram! I need to clarify that this problem does not occur during data injection. Before creating the ISO, the driver code will extend the disk. Libguestfs is invoked in that time frame. And now I think this problem may occur at any time, if the code use tpool to invoke libguestfs, and one external commend is executed in another green thread simultaneously. Please correct me if I am wrong. I think one simple solution for this issue is to call libguestfs routine in greenthread, rather than another native thread. But it will impact the performance very much. So I do not think that is an acceptable solution. On Wed, Jun 4, 2014 at 12:00 PM, Zhu Zhu wrote: > Hi Qin Zhao, > > Thanks for raising this issue and analysis. According to the issue > description and happen scenario( > https://docs.google.com/drawings/d/1pItX9urLd6fmjws3BVovXQvRg_qMdTHS-0JhYfSkkVc/pub?w=960&h=720 > ), if that's the case, concurrent mutiple KVM spawn instances(*with > both config drive and data injection enabled*) are triggered, the issue > can be very likely to happen. > As in libvirt/driver.py _create_image method, right after iso making > "cdb.make_drive", > the driver will attempt "data injection" which will call the libguestfs > launch in another thread. > > Looks there were also a couple of libguestfs hang issues from Launch pad > as below. . I am not sure if libguestfs itself can have certain mechanism > to free/close the fds that inherited from parent process instead of require > explicitly calling the tear down. Maybe open a defect to libguestfs to see > what their thoughts? > > https://bugs.launchpad.net/nova/+bug/1286256 > https://bugs.launchpad.net/nova/+bug/1270304 > > -- > Zhu Zhu > Best Regards > > > *From:* Qin Zhao > *Date:* 2014-05-31 01:25 > *To:* OpenStack Development Mailing List (not for usage questions) > > *Subject:* [openstack-dev] [Nova] nova-compute deadlock > Hi all, > > When I run Icehouse code, I encountered a strange problem. The > nova-compute service becomes stuck, when I boot instances. I report this > bug in https://bugs.launchpad.net/nova/+bug/1313477. > > After thinking several days, I feel I know its root cause. This bug should > be a deadlock problem cause by pipe fd leaking. I draw a diagram to > illustrate this problem. > https://docs.google.com/drawings/d/1pItX9urLd6fmjws3BVovXQvRg_qMdTHS-0JhYfSkkVc/pub?w=960&h=720 > > However, I have not find a very good solution to prevent this deadlock. > This problem is related with Python runtime, libguestfs, and eventlet. The > situation is a little complicated. Is there any expert who can help me to > look for a solution? I will appreciate for your help! > > -- > Qin Zhao > > > ___ > OpenStack-dev mailing list > OpenStack-dev@lists.openstack.org > http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev > > -- Qin Zhao ___ OpenStack-dev mailing list OpenStack-dev@lists.openstack.org http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
Re: [openstack-dev] [Nova] nova-compute deadlock
Hi Qin Zhao, Thanks for raising this issue and analysis. According to the issue description and happen scenario(https://docs.google.com/drawings/d/1pItX9urLd6fmjws3BVovXQvRg_qMdTHS-0JhYfSkkVc/pub?w=960&h=720), if that's the case, concurrent mutiple KVM spawn instances(with both config drive and data injection enabled) are triggered, the issue can be very likely to happen. As in libvirt/driver.py _create_image method, right after iso making "cdb.make_drive", the driver will attempt "data injection" which will call the libguestfs launch in another thread. Looks there were also a couple of libguestfs hang issues from Launch pad as below. . I am not sure if libguestfs itself can have certain mechanism to free/close the fds that inherited from parent process instead of require explicitly calling the tear down. Maybe open a defect to libguestfs to see what their thoughts? https://bugs.launchpad.net/nova/+bug/1286256 https://bugs.launchpad.net/nova/+bug/1270304 Zhu Zhu Best Regards From: Qin Zhao Date: 2014-05-31 01:25 To: OpenStack Development Mailing List (not for usage questions) Subject: [openstack-dev] [Nova] nova-compute deadlock Hi all, When I run Icehouse code, I encountered a strange problem. The nova-compute service becomes stuck, when I boot instances. I report this bug in https://bugs.launchpad.net/nova/+bug/1313477. After thinking several days, I feel I know its root cause. This bug should be a deadlock problem cause by pipe fd leaking. I draw a diagram to illustrate this problem. https://docs.google.com/drawings/d/1pItX9urLd6fmjws3BVovXQvRg_qMdTHS-0JhYfSkkVc/pub?w=960&h=720 However, I have not find a very good solution to prevent this deadlock. This problem is related with Python runtime, libguestfs, and eventlet. The situation is a little complicated. Is there any expert who can help me to look for a solution? I will appreciate for your help! -- Qin Zhao ___ OpenStack-dev mailing list OpenStack-dev@lists.openstack.org http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
[openstack-dev] [Nova] nova-compute deadlock
Hi all, When I run Icehouse code, I encountered a strange problem. The nova-compute service becomes stuck, when I boot instances. I report this bug in https://bugs.launchpad.net/nova/+bug/1313477. After thinking several days, I feel I know its root cause. This bug should be a deadlock problem cause by pipe fd leaking. I draw a diagram to illustrate this problem. https://docs.google.com/drawings/d/1pItX9urLd6fmjws3BVovXQvRg_qMdTHS-0JhYfSkkVc/pub?w=960&h=720 However, I have not find a very good solution to prevent this deadlock. This problem is related with Python runtime, libguestfs, and eventlet. The situation is a little complicated. Is there any expert who can help me to look for a solution? I will appreciate for your help! -- Qin Zhao ___ OpenStack-dev mailing list OpenStack-dev@lists.openstack.org http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev