Re: [openstack-dev] [neutron] [nova] non-deterministic gate failures due to unclosed eventlet Timeouts
On Mon, 8 Sep 2014 05:25:22 PM Jay Pipes wrote: On 09/07/2014 10:43 AM, Matt Riedemann wrote: On 9/7/2014 8:39 AM, John Schwarz wrote: Hi, Long story short: for future reference, if you initialize an eventlet Timeout, make sure you close it (either with a context manager or simply timeout.close()), and be extra-careful when writing tests using eventlet Timeouts, because these timeouts don't implicitly expire and will cause unexpected behaviours (see [1]) like gate failures. In our case this caused non-deterministic failures on the dsvm-functional test suite. Late last week, a bug was found ([2]) in which an eventlet Timeout object was initialized but not closed. This instance was left inside eventlet's inner-workings and triggered non-deterministic Timeout: 10 seconds errors and failures in dsvm-functional tests. As mentioned earlier, initializing a new eventlet.timeout.Timeout instance also registers it to inner mechanisms that exist within the library, and the reference remains there until it is explicitly removed (and not until the scope leaves the function block, as some would have thought). Thus, the old code (simply creating an instance without assigning it to a variable) left no way to close the timeout object. This reference remains throughout the life of a worker, so this can (and did) effect other tests and procedures using eventlet under the same process. Obviously this could easily effect production-grade systems with very high load. For future reference: 1) If you run into a Timeout: %d seconds exception whose traceback includes hub.switch() and self.greenlet.switch() calls, there might be a latent Timeout somewhere in the code, and a search for all eventlet.timeout.Timeout instances will probably produce the culprit. 2) The setup used to reproduce this error for debugging purposes is a baremetal machine running a VM with devstack. In the baremetal machine I used some 6 dd if=/dev/zero of=/dev/null to simulate high CPU load (full command can be found at [3]), and in the VM I ran the dsvm-functional suite. Using only a VM with similar high CPU simulation fails to produce the result. [1] http://eventlet.net/doc/modules/timeout.html#eventlet.timeout.eventlet.ti meout.Timeout.Timeout.cancel [2] https://review.openstack.org/#/c/119001/ [3] http://stackoverflow.com/questions/2925606/how-to-create-a-cpu-spike-with -a-bash-command -- John Schwarz, Software Engineer, Red Hat. ___ OpenStack-dev mailing list OpenStack-dev@lists.openstack.org http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev Thanks, that might be what's causing this timeout/gate failure in the nova unit tests. [1] [1] https://bugs.launchpad.net/nova/+bug/1357578 Indeed, there are a couple places where eventlet.timeout.Timeout() seems to be used in the test suite without a context manager or calling close() explicitly: tests/virt/libvirt/test_driver.py 8925:raise eventlet.timeout.Timeout() tests/virt/hyperv/test_vmops.py 196:mock_with_timeout.side_effect = etimeout.Timeout() If it's useful for anyone, I wrote a quick pylint test that will catch all the above cases of misused context managers. (Indeed, it will currently trigger on the raise Timeout() case, which is probably too eager but can be disabled in the usual #pylint meta-comment way) Here: https://review.openstack.org/#/c/120320/ -- - Gus ___ OpenStack-dev mailing list OpenStack-dev@lists.openstack.org http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
Re: [openstack-dev] [neutron] [nova] non-deterministic gate failures due to unclosed eventlet Timeouts
Good catch John, and good work Angus! ;) This will save a lot of headaches. - Original Message - On Mon, 8 Sep 2014 05:25:22 PM Jay Pipes wrote: On 09/07/2014 10:43 AM, Matt Riedemann wrote: On 9/7/2014 8:39 AM, John Schwarz wrote: Hi, Long story short: for future reference, if you initialize an eventlet Timeout, make sure you close it (either with a context manager or simply timeout.close()), and be extra-careful when writing tests using eventlet Timeouts, because these timeouts don't implicitly expire and will cause unexpected behaviours (see [1]) like gate failures. In our case this caused non-deterministic failures on the dsvm-functional test suite. Late last week, a bug was found ([2]) in which an eventlet Timeout object was initialized but not closed. This instance was left inside eventlet's inner-workings and triggered non-deterministic Timeout: 10 seconds errors and failures in dsvm-functional tests. As mentioned earlier, initializing a new eventlet.timeout.Timeout instance also registers it to inner mechanisms that exist within the library, and the reference remains there until it is explicitly removed (and not until the scope leaves the function block, as some would have thought). Thus, the old code (simply creating an instance without assigning it to a variable) left no way to close the timeout object. This reference remains throughout the life of a worker, so this can (and did) effect other tests and procedures using eventlet under the same process. Obviously this could easily effect production-grade systems with very high load. For future reference: 1) If you run into a Timeout: %d seconds exception whose traceback includes hub.switch() and self.greenlet.switch() calls, there might be a latent Timeout somewhere in the code, and a search for all eventlet.timeout.Timeout instances will probably produce the culprit. 2) The setup used to reproduce this error for debugging purposes is a baremetal machine running a VM with devstack. In the baremetal machine I used some 6 dd if=/dev/zero of=/dev/null to simulate high CPU load (full command can be found at [3]), and in the VM I ran the dsvm-functional suite. Using only a VM with similar high CPU simulation fails to produce the result. [1] http://eventlet.net/doc/modules/timeout.html#eventlet.timeout.eventlet.ti meout.Timeout.Timeout.cancel [2] https://review.openstack.org/#/c/119001/ [3] http://stackoverflow.com/questions/2925606/how-to-create-a-cpu-spike-with -a-bash-command -- John Schwarz, Software Engineer, Red Hat. ___ OpenStack-dev mailing list OpenStack-dev@lists.openstack.org http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev Thanks, that might be what's causing this timeout/gate failure in the nova unit tests. [1] [1] https://bugs.launchpad.net/nova/+bug/1357578 Indeed, there are a couple places where eventlet.timeout.Timeout() seems to be used in the test suite without a context manager or calling close() explicitly: tests/virt/libvirt/test_driver.py 8925:raise eventlet.timeout.Timeout() tests/virt/hyperv/test_vmops.py 196:mock_with_timeout.side_effect = etimeout.Timeout() If it's useful for anyone, I wrote a quick pylint test that will catch all the above cases of misused context managers. (Indeed, it will currently trigger on the raise Timeout() case, which is probably too eager but can be disabled in the usual #pylint meta-comment way) Here: https://review.openstack.org/#/c/120320/ -- - Gus ___ OpenStack-dev mailing list OpenStack-dev@lists.openstack.org http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev ___ OpenStack-dev mailing list OpenStack-dev@lists.openstack.org http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
Re: [openstack-dev] [neutron] [nova] non-deterministic gate failures due to unclosed eventlet Timeouts
On Mon, 2014-09-08 at 17:25 -0400, Jay Pipes wrote: Thanks, that might be what's causing this timeout/gate failure in the nova unit tests. [1] [1] https://bugs.launchpad.net/nova/+bug/1357578 Indeed, there are a couple places where eventlet.timeout.Timeout() seems to be used in the test suite without a context manager or calling close() explicitly: tests/virt/libvirt/test_driver.py 8925:raise eventlet.timeout.Timeout() tests/virt/hyperv/test_vmops.py 196:mock_with_timeout.side_effect = etimeout.Timeout() I looked into that too, but the docs for Timeout indicate that it's an Exception subclass, and passing it no args doesn't seem to start the timer running. I think you have to explicitly pass a duration value for Timeout to enable its timeout behavior, but that's just a guess on my part at this point… -- Kevin L. Mitchell kevin.mitch...@rackspace.com Rackspace ___ OpenStack-dev mailing list OpenStack-dev@lists.openstack.org http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
Re: [openstack-dev] [neutron] [nova] non-deterministic gate failures due to unclosed eventlet Timeouts
On 09/07/2014 10:43 AM, Matt Riedemann wrote: On 9/7/2014 8:39 AM, John Schwarz wrote: Hi, Long story short: for future reference, if you initialize an eventlet Timeout, make sure you close it (either with a context manager or simply timeout.close()), and be extra-careful when writing tests using eventlet Timeouts, because these timeouts don't implicitly expire and will cause unexpected behaviours (see [1]) like gate failures. In our case this caused non-deterministic failures on the dsvm-functional test suite. Late last week, a bug was found ([2]) in which an eventlet Timeout object was initialized but not closed. This instance was left inside eventlet's inner-workings and triggered non-deterministic Timeout: 10 seconds errors and failures in dsvm-functional tests. As mentioned earlier, initializing a new eventlet.timeout.Timeout instance also registers it to inner mechanisms that exist within the library, and the reference remains there until it is explicitly removed (and not until the scope leaves the function block, as some would have thought). Thus, the old code (simply creating an instance without assigning it to a variable) left no way to close the timeout object. This reference remains throughout the life of a worker, so this can (and did) effect other tests and procedures using eventlet under the same process. Obviously this could easily effect production-grade systems with very high load. For future reference: 1) If you run into a Timeout: %d seconds exception whose traceback includes hub.switch() and self.greenlet.switch() calls, there might be a latent Timeout somewhere in the code, and a search for all eventlet.timeout.Timeout instances will probably produce the culprit. 2) The setup used to reproduce this error for debugging purposes is a baremetal machine running a VM with devstack. In the baremetal machine I used some 6 dd if=/dev/zero of=/dev/null to simulate high CPU load (full command can be found at [3]), and in the VM I ran the dsvm-functional suite. Using only a VM with similar high CPU simulation fails to produce the result. [1] http://eventlet.net/doc/modules/timeout.html#eventlet.timeout.eventlet.timeout.Timeout.Timeout.cancel [2] https://review.openstack.org/#/c/119001/ [3] http://stackoverflow.com/questions/2925606/how-to-create-a-cpu-spike-with-a-bash-command -- John Schwarz, Software Engineer, Red Hat. ___ OpenStack-dev mailing list OpenStack-dev@lists.openstack.org http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev Thanks, that might be what's causing this timeout/gate failure in the nova unit tests. [1] [1] https://bugs.launchpad.net/nova/+bug/1357578 Indeed, there are a couple places where eventlet.timeout.Timeout() seems to be used in the test suite without a context manager or calling close() explicitly: tests/virt/libvirt/test_driver.py 8925:raise eventlet.timeout.Timeout() tests/virt/hyperv/test_vmops.py 196:mock_with_timeout.side_effect = etimeout.Timeout() Best, -jay ___ OpenStack-dev mailing list OpenStack-dev@lists.openstack.org http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
Re: [openstack-dev] [neutron] [nova] non-deterministic gate failures due to unclosed eventlet Timeouts
On 9/7/2014 8:39 AM, John Schwarz wrote: Hi, Long story short: for future reference, if you initialize an eventlet Timeout, make sure you close it (either with a context manager or simply timeout.close()), and be extra-careful when writing tests using eventlet Timeouts, because these timeouts don't implicitly expire and will cause unexpected behaviours (see [1]) like gate failures. In our case this caused non-deterministic failures on the dsvm-functional test suite. Late last week, a bug was found ([2]) in which an eventlet Timeout object was initialized but not closed. This instance was left inside eventlet's inner-workings and triggered non-deterministic Timeout: 10 seconds errors and failures in dsvm-functional tests. As mentioned earlier, initializing a new eventlet.timeout.Timeout instance also registers it to inner mechanisms that exist within the library, and the reference remains there until it is explicitly removed (and not until the scope leaves the function block, as some would have thought). Thus, the old code (simply creating an instance without assigning it to a variable) left no way to close the timeout object. This reference remains throughout the life of a worker, so this can (and did) effect other tests and procedures using eventlet under the same process. Obviously this could easily effect production-grade systems with very high load. For future reference: 1) If you run into a Timeout: %d seconds exception whose traceback includes hub.switch() and self.greenlet.switch() calls, there might be a latent Timeout somewhere in the code, and a search for all eventlet.timeout.Timeout instances will probably produce the culprit. 2) The setup used to reproduce this error for debugging purposes is a baremetal machine running a VM with devstack. In the baremetal machine I used some 6 dd if=/dev/zero of=/dev/null to simulate high CPU load (full command can be found at [3]), and in the VM I ran the dsvm-functional suite. Using only a VM with similar high CPU simulation fails to produce the result. [1] http://eventlet.net/doc/modules/timeout.html#eventlet.timeout.eventlet.timeout.Timeout.Timeout.cancel [2] https://review.openstack.org/#/c/119001/ [3] http://stackoverflow.com/questions/2925606/how-to-create-a-cpu-spike-with-a-bash-command -- John Schwarz, Software Engineer, Red Hat. ___ OpenStack-dev mailing list OpenStack-dev@lists.openstack.org http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev Thanks, that might be what's causing this timeout/gate failure in the nova unit tests. [1] [1] https://bugs.launchpad.net/nova/+bug/1357578 -- Thanks, Matt Riedemann ___ OpenStack-dev mailing list OpenStack-dev@lists.openstack.org http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev