Re: [openstack-dev] [gate] The gate: a failure analysis

2014-07-28 Thread Angus Lees
On Mon, 21 Jul 2014 04:39:49 PM David Kranz wrote:
 On 07/21/2014 04:13 PM, Jay Pipes wrote:
  On 07/21/2014 02:03 PM, Clint Byrum wrote:
  Thanks Matthew for the analysis.
  
  I think you missed something though.
  
  Right now the frustration is that unrelated intermittent bugs stop your
  presumably good change from getting in.
  
  Without gating, the result would be that even more bugs, many of them
  not
  intermittent at all, would get in. Right now, the one random developer
  who has to hunt down the rechecks and do them is inconvenienced. But
  without a gate, _every single_ developer will be inconvenienced until
  the fix is merged.
  
  The false negative rate is _way_ too high. Nobody would disagree there.
  However, adding more false negatives and allowing more people to ignore
  the ones we already have, seems like it would have the opposite effect:
  Now instead of annoying the people who hit the random intermittent bugs,
  we'll be annoying _everybody_ as they hit the non-intermittent ones.
  
  +10
 
 Right, but perhaps there is a middle ground. We must not allow changes
 in that can't pass through the gate, but we can separate the problems
 of constant rechecks using too many resources, and of constant rechecks
 causing developer pain. If failures were deterministic we would skip the
 failing tests until they were fixed. Unfortunately many of the common
 failures can blow up any test, or even the whole process. Following on
 what Sam said, what if we automatically reran jobs that failed in a
 known way, and disallowed recheck/reverify no bug? Developers would
 then have to track down what bug caused a failure or file a new one. But
 they would have to do so much less frequently, and as more common
 failures were catalogued it would become less and less frequent.
 
 Some might (reasonably) argue that this would be a bad thing because it
 would reduce the incentive for people to fix bugs if there were less
 pain being inflicted. But given how hard it is to track down these race
 bugs, and that we as a community have no way to force time to be spent
 on them, and that it does not appear that these bugs are causing real
 systems to fall down (only our gating process), perhaps something
 different should be considered?

So to pick an example dear to my heart, I've been working on removing these 
gate failures:
http://logstash.openstack.org/#eyJzZWFyY2giOiJcIkxvY2sgd2FpdCB0aW1lb3V0IGV4Y2VlZGVkOyB0cnkgcmVzdGFydGluZyB0cmFuc2FjdGlvblwiIiwiZmllbGRzIjpbXSwib2Zmc2V0IjowLCJ0aW1lZnJhbWUiOiI2MDQ4MDAiLCJncmFwaG1vZGUiOiJjb3VudCIsInRpbWUiOnsidXNlcl9pbnRlcnZhbCI6MH0sInN0YW1wIjoxNDA2NTI3OTA3NzkzfQ==

.. caused by a bad interaction between eventlet and our default choice of 
mysql driver.  It would also affect any real world deployment using mysql.

The problem has been identified and the fix proposed for almost a month now, 
but 
actually fixing the gate jobs is still no-where in sight.  The fix is (pretty 
much) as easy as a pip install and a slightly modified database connection 
string.
I look forward to a discussion of the meta-issues surrounding this, but it is 
not because no-one tracked down or fixed the bug :(

 - Gus

   -David
 
  Best,
  -jay
  
  Excerpts from Matthew Booth's message of 2014-07-21 03:38:07 -0700:
  On Friday evening I had a dependent series of 5 changes all with
  approval waiting to be merged. These were all refactor changes in the
  VMware driver. The changes were:
  
  * VMware: DatastorePath join() and __eq__()
  https://review.openstack.org/#/c/103949/
  
  * VMware: use datastore classes get_allowed_datastores/_sub_folder
  https://review.openstack.org/#/c/103950/
  
  * VMware: use datastore classes in file_move/delete/exists, mkdir
  https://review.openstack.org/#/c/103951/
  
  * VMware: Trivial indentation cleanups in vmops
  https://review.openstack.org/#/c/104149/
  
  * VMware: Convert vmops to use instance as an object
  https://review.openstack.org/#/c/104144/
  
  The last change merged this morning.
  
  In order to merge these changes, over the weekend I manually submitted:
  
  * 35 rechecks due to false negatives, an average of 7 per change
  * 19 resubmissions after a change passed, but its dependency did not
  
  Other interesting numbers:
  
  * 16 unique bugs
  * An 87% false negative rate
  * 0 bugs found in the change under test
  
  Because we don't fail fast, that is an average of at least 7.3 hours in
  the gate. Much more in fact, because some runs fail on the second pass,
  not the first. Because we don't resubmit automatically, that is only if
  a developer is actively monitoring the process continuously, and
  resubmits immediately on failure. In practise this is much longer,
  because sometimes we have to sleep.
  
  All of the above numbers are counted from the change receiving an
  approval +2 until final merging. There were far more failures than this
  during the approval process.
  
  Why do we test individual changes in the gate? The 

Re: [openstack-dev] [gate] The gate: a failure analysis

2014-07-28 Thread Ihar Hrachyshka
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA512

On 28/07/14 08:52, Angus Lees wrote:
 On Mon, 21 Jul 2014 04:39:49 PM David Kranz wrote:
 On 07/21/2014 04:13 PM, Jay Pipes wrote:
 On 07/21/2014 02:03 PM, Clint Byrum wrote:
 Thanks Matthew for the analysis.
 
 I think you missed something though.
 
 Right now the frustration is that unrelated intermittent bugs
 stop your presumably good change from getting in.
 
 Without gating, the result would be that even more bugs, many
 of them not intermittent at all, would get in. Right now, the
 one random developer who has to hunt down the rechecks and do
 them is inconvenienced. But without a gate, _every single_
 developer will be inconvenienced until the fix is merged.
 
 The false negative rate is _way_ too high. Nobody would
 disagree there. However, adding more false negatives and
 allowing more people to ignore the ones we already have,
 seems like it would have the opposite effect: Now instead of
 annoying the people who hit the random intermittent bugs, 
 we'll be annoying _everybody_ as they hit the
 non-intermittent ones.
 
 +10
 
 Right, but perhaps there is a middle ground. We must not allow
 changes in that can't pass through the gate, but we can separate
 the problems of constant rechecks using too many resources, and
 of constant rechecks causing developer pain. If failures were
 deterministic we would skip the failing tests until they were
 fixed. Unfortunately many of the common failures can blow up any
 test, or even the whole process. Following on what Sam said, what
 if we automatically reran jobs that failed in a known way, and
 disallowed recheck/reverify no bug? Developers would then have
 to track down what bug caused a failure or file a new one. But 
 they would have to do so much less frequently, and as more
 common failures were catalogued it would become less and less
 frequent.
 
 Some might (reasonably) argue that this would be a bad thing
 because it would reduce the incentive for people to fix bugs if
 there were less pain being inflicted. But given how hard it is to
 track down these race bugs, and that we as a community have no
 way to force time to be spent on them, and that it does not
 appear that these bugs are causing real systems to fall down
 (only our gating process), perhaps something different should be
 considered?
 
 So to pick an example dear to my heart, I've been working on
 removing these gate failures: 
 http://logstash.openstack.org/#eyJzZWFyY2giOiJcIkxvY2sgd2FpdCB0aW1lb3V0IGV4Y2VlZGVkOyB0cnkgcmVzdGFydGluZyB0cmFuc2FjdGlvblwiIiwiZmllbGRzIjpbXSwib2Zmc2V0IjowLCJ0aW1lZnJhbWUiOiI2MDQ4MDAiLCJncmFwaG1vZGUiOiJjb3VudCIsInRpbWUiOnsidXNlcl9pbnRlcnZhbCI6MH0sInN0YW1wIjoxNDA2NTI3OTA3NzkzfQ==

  .. caused by a bad interaction between eventlet and our default
 choice of mysql driver.  It would also affect any real world
 deployment using mysql.
 
 The problem has been identified and the fix proposed for almost a
 month now, but actually fixing the gate jobs is still no-where in
 sight.  The fix is (pretty much) as easy as a pip install and a
 slightly modified database connection string.

[And to highjack the thread even more] The fix Angus has kindly
referred to is:

- - spec: https://review.openstack.org/#/c/108355/
- - devstack: https://review.openstack.org/#/c/105209/ (plus several
tiny fixes in multiple projects to make sure the patch succeeds in db
migration).

 I look forward to a discussion of the meta-issues surrounding this,
 but it is not because no-one tracked down or fixed the bug :(
 
 - Gus
 
 -David
 
 Best, -jay
 
 Excerpts from Matthew Booth's message of 2014-07-21 03:38:07
 -0700:
 On Friday evening I had a dependent series of 5 changes all
 with approval waiting to be merged. These were all refactor
 changes in the VMware driver. The changes were:
 
 * VMware: DatastorePath join() and __eq__() 
 https://review.openstack.org/#/c/103949/
 
 * VMware: use datastore classes
 get_allowed_datastores/_sub_folder 
 https://review.openstack.org/#/c/103950/
 
 * VMware: use datastore classes in file_move/delete/exists,
 mkdir https://review.openstack.org/#/c/103951/
 
 * VMware: Trivial indentation cleanups in vmops 
 https://review.openstack.org/#/c/104149/
 
 * VMware: Convert vmops to use instance as an object 
 https://review.openstack.org/#/c/104144/
 
 The last change merged this morning.
 
 In order to merge these changes, over the weekend I
 manually submitted:
 
 * 35 rechecks due to false negatives, an average of 7 per
 change * 19 resubmissions after a change passed, but its
 dependency did not
 
 Other interesting numbers:
 
 * 16 unique bugs * An 87% false negative rate * 0 bugs
 found in the change under test
 
 Because we don't fail fast, that is an average of at least
 7.3 hours in the gate. Much more in fact, because some runs
 fail on the second pass, not the first. Because we don't
 resubmit automatically, that is only if a developer is
 actively monitoring the process continuously, and resubmits
 

Re: [openstack-dev] [gate] The gate: a failure analysis

2014-07-28 Thread Doug Hellmann

On Jul 28, 2014, at 2:52 AM, Angus Lees g...@inodes.org wrote:

 On Mon, 21 Jul 2014 04:39:49 PM David Kranz wrote:
 On 07/21/2014 04:13 PM, Jay Pipes wrote:
 On 07/21/2014 02:03 PM, Clint Byrum wrote:
 Thanks Matthew for the analysis.
 
 I think you missed something though.
 
 Right now the frustration is that unrelated intermittent bugs stop your
 presumably good change from getting in.
 
 Without gating, the result would be that even more bugs, many of them
 not
 intermittent at all, would get in. Right now, the one random developer
 who has to hunt down the rechecks and do them is inconvenienced. But
 without a gate, _every single_ developer will be inconvenienced until
 the fix is merged.
 
 The false negative rate is _way_ too high. Nobody would disagree there.
 However, adding more false negatives and allowing more people to ignore
 the ones we already have, seems like it would have the opposite effect:
 Now instead of annoying the people who hit the random intermittent bugs,
 we'll be annoying _everybody_ as they hit the non-intermittent ones.
 
 +10
 
 Right, but perhaps there is a middle ground. We must not allow changes
 in that can't pass through the gate, but we can separate the problems
 of constant rechecks using too many resources, and of constant rechecks
 causing developer pain. If failures were deterministic we would skip the
 failing tests until they were fixed. Unfortunately many of the common
 failures can blow up any test, or even the whole process. Following on
 what Sam said, what if we automatically reran jobs that failed in a
 known way, and disallowed recheck/reverify no bug? Developers would
 then have to track down what bug caused a failure or file a new one. But
 they would have to do so much less frequently, and as more common
 failures were catalogued it would become less and less frequent.
 
 Some might (reasonably) argue that this would be a bad thing because it
 would reduce the incentive for people to fix bugs if there were less
 pain being inflicted. But given how hard it is to track down these race
 bugs, and that we as a community have no way to force time to be spent
 on them, and that it does not appear that these bugs are causing real
 systems to fall down (only our gating process), perhaps something
 different should be considered?
 
 So to pick an example dear to my heart, I've been working on removing these 
 gate failures:
 http://logstash.openstack.org/#eyJzZWFyY2giOiJcIkxvY2sgd2FpdCB0aW1lb3V0IGV4Y2VlZGVkOyB0cnkgcmVzdGFydGluZyB0cmFuc2FjdGlvblwiIiwiZmllbGRzIjpbXSwib2Zmc2V0IjowLCJ0aW1lZnJhbWUiOiI2MDQ4MDAiLCJncmFwaG1vZGUiOiJjb3VudCIsInRpbWUiOnsidXNlcl9pbnRlcnZhbCI6MH0sInN0YW1wIjoxNDA2NTI3OTA3NzkzfQ==
 
 .. caused by a bad interaction between eventlet and our default choice of 
 mysql driver.  It would also affect any real world deployment using mysql.
 
 The problem has been identified and the fix proposed for almost a month now, 
 but 
 actually fixing the gate jobs is still no-where in sight.  The fix is (pretty 
 much) as easy as a pip install and a slightly modified database connection 
 string.
 I look forward to a discussion of the meta-issues surrounding this, but it is 
 not because no-one tracked down or fixed the bug :(

I believe the main blocking issue right now is that Oracle doesn’t upload that 
library to PyPI, and so our build-chain won’t be able to download it as it is 
currently configured. I think the last I saw someone was going to talk to 
Oracle about uploading the source. Have we heard back?

Doug

 
 - Gus
 
  -David
 
 Best,
 -jay
 
 Excerpts from Matthew Booth's message of 2014-07-21 03:38:07 -0700:
 On Friday evening I had a dependent series of 5 changes all with
 approval waiting to be merged. These were all refactor changes in the
 VMware driver. The changes were:
 
 * VMware: DatastorePath join() and __eq__()
 https://review.openstack.org/#/c/103949/
 
 * VMware: use datastore classes get_allowed_datastores/_sub_folder
 https://review.openstack.org/#/c/103950/
 
 * VMware: use datastore classes in file_move/delete/exists, mkdir
 https://review.openstack.org/#/c/103951/
 
 * VMware: Trivial indentation cleanups in vmops
 https://review.openstack.org/#/c/104149/
 
 * VMware: Convert vmops to use instance as an object
 https://review.openstack.org/#/c/104144/
 
 The last change merged this morning.
 
 In order to merge these changes, over the weekend I manually submitted:
 
 * 35 rechecks due to false negatives, an average of 7 per change
 * 19 resubmissions after a change passed, but its dependency did not
 
 Other interesting numbers:
 
 * 16 unique bugs
 * An 87% false negative rate
 * 0 bugs found in the change under test
 
 Because we don't fail fast, that is an average of at least 7.3 hours in
 the gate. Much more in fact, because some runs fail on the second pass,
 not the first. Because we don't resubmit automatically, that is only if
 a developer is actively monitoring the process continuously, and
 resubmits immediately on 

Re: [openstack-dev] [gate] The gate: a failure analysis

2014-07-28 Thread Ihar Hrachyshka
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA512

On 28/07/14 16:22, Doug Hellmann wrote:
 
 On Jul 28, 2014, at 2:52 AM, Angus Lees g...@inodes.org wrote:
 
 On Mon, 21 Jul 2014 04:39:49 PM David Kranz wrote:
 On 07/21/2014 04:13 PM, Jay Pipes wrote:
 On 07/21/2014 02:03 PM, Clint Byrum wrote:
 Thanks Matthew for the analysis.
 
 I think you missed something though.
 
 Right now the frustration is that unrelated intermittent
 bugs stop your presumably good change from getting in.
 
 Without gating, the result would be that even more bugs,
 many of them not intermittent at all, would get in. Right
 now, the one random developer who has to hunt down the
 rechecks and do them is inconvenienced. But without a gate,
 _every single_ developer will be inconvenienced until the
 fix is merged.
 
 The false negative rate is _way_ too high. Nobody would
 disagree there. However, adding more false negatives and
 allowing more people to ignore the ones we already have,
 seems like it would have the opposite effect: Now instead
 of annoying the people who hit the random intermittent
 bugs, we'll be annoying _everybody_ as they hit the
 non-intermittent ones.
 
 +10
 
 Right, but perhaps there is a middle ground. We must not allow
 changes in that can't pass through the gate, but we can
 separate the problems of constant rechecks using too many
 resources, and of constant rechecks causing developer pain. If
 failures were deterministic we would skip the failing tests
 until they were fixed. Unfortunately many of the common 
 failures can blow up any test, or even the whole process.
 Following on what Sam said, what if we automatically reran jobs
 that failed in a known way, and disallowed recheck/reverify no
 bug? Developers would then have to track down what bug caused
 a failure or file a new one. But they would have to do so much
 less frequently, and as more common failures were catalogued it
 would become less and less frequent.
 
 Some might (reasonably) argue that this would be a bad thing
 because it would reduce the incentive for people to fix bugs if
 there were less pain being inflicted. But given how hard it is
 to track down these race bugs, and that we as a community have
 no way to force time to be spent on them, and that it does not
 appear that these bugs are causing real systems to fall down
 (only our gating process), perhaps something different should
 be considered?
 
 So to pick an example dear to my heart, I've been working on
 removing these gate failures: 
 http://logstash.openstack.org/#eyJzZWFyY2giOiJcIkxvY2sgd2FpdCB0aW1lb3V0IGV4Y2VlZGVkOyB0cnkgcmVzdGFydGluZyB0cmFuc2FjdGlvblwiIiwiZmllbGRzIjpbXSwib2Zmc2V0IjowLCJ0aW1lZnJhbWUiOiI2MDQ4MDAiLCJncmFwaG1vZGUiOiJjb3VudCIsInRpbWUiOnsidXNlcl9pbnRlcnZhbCI6MH0sInN0YW1wIjoxNDA2NTI3OTA3NzkzfQ==


 
.. caused by a bad interaction between eventlet and our default choice of
 mysql driver.  It would also affect any real world deployment
 using mysql.
 
 The problem has been identified and the fix proposed for almost a
 month now, but actually fixing the gate jobs is still no-where in
 sight.  The fix is (pretty much) as easy as a pip install and a
 slightly modified database connection string. I look forward to a
 discussion of the meta-issues surrounding this, but it is not
 because no-one tracked down or fixed the bug :(
 
 I believe the main blocking issue right now is that Oracle doesn’t
 upload that library to PyPI, and so our build-chain won’t be able
 to download it as it is currently configured. I think the last I
 saw someone was going to talk to Oracle about uploading the source.
 Have we heard back?

Yes, the guy who is in charge of the module said to me he's working on
publishing it on PyPI. I guess it's just a matter of more push from
our side, we'll be able to clean that up in timely manner.

/Ihar
-BEGIN PGP SIGNATURE-
Version: GnuPG/MacGPG2 v2.0.22 (Darwin)
Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/

iQEcBAEBCgAGBQJT1l3LAAoJEC5aWaUY1u576uMIALC/ltVwr8hsukfzl4YV91uY
2/rU+brxJuS/pq6YUPURC49G7MGTjJ9fSpJn4HB7V8lZaTJ2+Ejm9gWIcr0w8oMn
UlTTvM+NEsi1tQXMZJVHfWjPNiMyquBihqlfBSJs9degHqb+c8kOMWB6wVZauA/m
nAZPRxfuoS1qOY8qljyvRbPE7Gf6yIiMZayh5mg3Lmp1tqDgk1IeB3Qc87NVp0Jx
Z7nxRlHA27caWI9nSC5FsFx58BHa1R7IMyQXMNUmxQVdy4Q5DABf7TZN4hy/XXC7
JrFsSwgHLJSyjkvWZLXW08y1Q3MZK9JN49y5ahgJGkmbiPyQnZ49AM4mBHwtqTU=
=lbt2
-END PGP SIGNATURE-

___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [gate] The gate: a failure analysis

2014-07-28 Thread Angus Lees
On Mon, 28 Jul 2014 10:22:07 AM Doug Hellmann wrote:
 On Jul 28, 2014, at 2:52 AM, Angus Lees g...@inodes.org wrote:
  On Mon, 21 Jul 2014 04:39:49 PM David Kranz wrote:
  On 07/21/2014 04:13 PM, Jay Pipes wrote:
  On 07/21/2014 02:03 PM, Clint Byrum wrote:
  Thanks Matthew for the analysis.
  
  I think you missed something though.
  
  Right now the frustration is that unrelated intermittent bugs stop your
  presumably good change from getting in.
  
  Without gating, the result would be that even more bugs, many of them
  not
  intermittent at all, would get in. Right now, the one random developer
  who has to hunt down the rechecks and do them is inconvenienced. But
  without a gate, _every single_ developer will be inconvenienced until
  the fix is merged.
  
  The false negative rate is _way_ too high. Nobody would disagree there.
  However, adding more false negatives and allowing more people to ignore
  the ones we already have, seems like it would have the opposite effect:
  Now instead of annoying the people who hit the random intermittent
  bugs,
  we'll be annoying _everybody_ as they hit the non-intermittent ones.
  
  +10
  
  Right, but perhaps there is a middle ground. We must not allow changes
  in that can't pass through the gate, but we can separate the problems
  of constant rechecks using too many resources, and of constant rechecks
  causing developer pain. If failures were deterministic we would skip the
  failing tests until they were fixed. Unfortunately many of the common
  failures can blow up any test, or even the whole process. Following on
  what Sam said, what if we automatically reran jobs that failed in a
  known way, and disallowed recheck/reverify no bug? Developers would
  then have to track down what bug caused a failure or file a new one. But
  they would have to do so much less frequently, and as more common
  failures were catalogued it would become less and less frequent.
  
  Some might (reasonably) argue that this would be a bad thing because it
  would reduce the incentive for people to fix bugs if there were less
  pain being inflicted. But given how hard it is to track down these race
  bugs, and that we as a community have no way to force time to be spent
  on them, and that it does not appear that these bugs are causing real
  systems to fall down (only our gating process), perhaps something
  different should be considered?
  
  So to pick an example dear to my heart, I've been working on removing
  these
  gate failures:
  http://logstash.openstack.org/#eyJzZWFyY2giOiJcIkxvY2sgd2FpdCB0aW1lb3V0IGV
  4Y2VlZGVkOyB0cnkgcmVzdGFydGluZyB0cmFuc2FjdGlvblwiIiwiZmllbGRzIjpbXSwib2Zmc
  2V0IjowLCJ0aW1lZnJhbWUiOiI2MDQ4MDAiLCJncmFwaG1vZGUiOiJjb3VudCIsInRpbWUiOns
  idXNlcl9pbnRlcnZhbCI6MH0sInN0YW1wIjoxNDA2NTI3OTA3NzkzfQ==
  
  .. caused by a bad interaction between eventlet and our default choice of
  mysql driver.  It would also affect any real world deployment using mysql.
  
  The problem has been identified and the fix proposed for almost a month
  now, but actually fixing the gate jobs is still no-where in sight.  The
  fix is (pretty much) as easy as a pip install and a slightly modified
  database connection string.
  I look forward to a discussion of the meta-issues surrounding this, but it
  is not because no-one tracked down or fixed the bug :(
 
 I believe the main blocking issue right now is that Oracle doesn’t upload
 that library to PyPI, and so our build-chain won’t be able to download it
 as it is currently configured. I think the last I saw someone was going to
 talk to Oracle about uploading the source. Have we heard back?

Yes, positive conversations are underway and we'll get there eventually.  My 
point was also about apparent priorities, however.  If addressing gate 
failures was *urgent*, we wouldn't wait for such a conversation to complete 
before making our own workarounds(*).  I don't feel we (as a group) are 
sufficiently terrified of false negatives.

(*) Indeed, the affected devstack gate tests install mysqlconnector via 
debs/rpms.  I think only the oslo.db opportunistic tests talk to mysql via 
pip-installed packages, and these don't also use eventlet.

 - Gus

 
 Doug
 
  - Gus
  
   -David
   
  Best,
  -jay
  
  Excerpts from Matthew Booth's message of 2014-07-21 03:38:07 -0700:
  On Friday evening I had a dependent series of 5 changes all with
  approval waiting to be merged. These were all refactor changes in the
  VMware driver. The changes were:
  
  * VMware: DatastorePath join() and __eq__()
  https://review.openstack.org/#/c/103949/
  
  * VMware: use datastore classes get_allowed_datastores/_sub_folder
  https://review.openstack.org/#/c/103950/
  
  * VMware: use datastore classes in file_move/delete/exists, mkdir
  https://review.openstack.org/#/c/103951/
  
  * VMware: Trivial indentation cleanups in vmops
  https://review.openstack.org/#/c/104149/
  
  * VMware: Convert vmops to use instance as an object
  

Re: [openstack-dev] [gate] The gate: a failure analysis

2014-07-28 Thread Monty Taylor

On 07/28/2014 02:32 PM, Angus Lees wrote:

On Mon, 28 Jul 2014 10:22:07 AM Doug Hellmann wrote:

On Jul 28, 2014, at 2:52 AM, Angus Lees g...@inodes.org wrote:

On Mon, 21 Jul 2014 04:39:49 PM David Kranz wrote:

On 07/21/2014 04:13 PM, Jay Pipes wrote:

On 07/21/2014 02:03 PM, Clint Byrum wrote:

Thanks Matthew for the analysis.

I think you missed something though.

Right now the frustration is that unrelated intermittent bugs stop your
presumably good change from getting in.

Without gating, the result would be that even more bugs, many of them
not
intermittent at all, would get in. Right now, the one random developer
who has to hunt down the rechecks and do them is inconvenienced. But
without a gate, _every single_ developer will be inconvenienced until
the fix is merged.

The false negative rate is _way_ too high. Nobody would disagree there.
However, adding more false negatives and allowing more people to ignore
the ones we already have, seems like it would have the opposite effect:
Now instead of annoying the people who hit the random intermittent
bugs,
we'll be annoying _everybody_ as they hit the non-intermittent ones.


+10


Right, but perhaps there is a middle ground. We must not allow changes
in that can't pass through the gate, but we can separate the problems
of constant rechecks using too many resources, and of constant rechecks
causing developer pain. If failures were deterministic we would skip the
failing tests until they were fixed. Unfortunately many of the common
failures can blow up any test, or even the whole process. Following on
what Sam said, what if we automatically reran jobs that failed in a
known way, and disallowed recheck/reverify no bug? Developers would
then have to track down what bug caused a failure or file a new one. But
they would have to do so much less frequently, and as more common
failures were catalogued it would become less and less frequent.

Some might (reasonably) argue that this would be a bad thing because it
would reduce the incentive for people to fix bugs if there were less
pain being inflicted. But given how hard it is to track down these race
bugs, and that we as a community have no way to force time to be spent
on them, and that it does not appear that these bugs are causing real
systems to fall down (only our gating process), perhaps something
different should be considered?


So to pick an example dear to my heart, I've been working on removing
these
gate failures:
http://logstash.openstack.org/#eyJzZWFyY2giOiJcIkxvY2sgd2FpdCB0aW1lb3V0IGV
4Y2VlZGVkOyB0cnkgcmVzdGFydGluZyB0cmFuc2FjdGlvblwiIiwiZmllbGRzIjpbXSwib2Zmc
2V0IjowLCJ0aW1lZnJhbWUiOiI2MDQ4MDAiLCJncmFwaG1vZGUiOiJjb3VudCIsInRpbWUiOns
idXNlcl9pbnRlcnZhbCI6MH0sInN0YW1wIjoxNDA2NTI3OTA3NzkzfQ==

.. caused by a bad interaction between eventlet and our default choice of
mysql driver.  It would also affect any real world deployment using mysql.

The problem has been identified and the fix proposed for almost a month
now, but actually fixing the gate jobs is still no-where in sight.  The
fix is (pretty much) as easy as a pip install and a slightly modified
database connection string.
I look forward to a discussion of the meta-issues surrounding this, but it
is not because no-one tracked down or fixed the bug :(


I believe the main blocking issue right now is that Oracle doesn’t upload
that library to PyPI, and so our build-chain won’t be able to download it
as it is currently configured. I think the last I saw someone was going to
talk to Oracle about uploading the source. Have we heard back?


Yes, positive conversations are underway and we'll get there eventually.  My
point was also about apparent priorities, however.  If addressing gate
failures was *urgent*, we wouldn't wait for such a conversation to complete
before making our own workarounds(*).  I don't feel we (as a group) are
sufficiently terrified of false negatives.

(*) Indeed, the affected devstack gate tests install mysqlconnector via
debs/rpms.  I think only the oslo.db opportunistic tests talk to mysql via
pip-installed packages, and these don't also use eventlet.


Honestly, I think devstack installing it from apt/yum is fine.



Doug


- Gus


  -David


Best,
-jay


Excerpts from Matthew Booth's message of 2014-07-21 03:38:07 -0700:

On Friday evening I had a dependent series of 5 changes all with
approval waiting to be merged. These were all refactor changes in the
VMware driver. The changes were:

* VMware: DatastorePath join() and __eq__()
https://review.openstack.org/#/c/103949/

* VMware: use datastore classes get_allowed_datastores/_sub_folder
https://review.openstack.org/#/c/103950/

* VMware: use datastore classes in file_move/delete/exists, mkdir
https://review.openstack.org/#/c/103951/

* VMware: Trivial indentation cleanups in vmops
https://review.openstack.org/#/c/104149/

* VMware: Convert vmops to use instance as an object
https://review.openstack.org/#/c/104144/

The last change merged this morning.

In order to merge 

Re: [openstack-dev] [gate] The gate: a failure analysis

2014-07-22 Thread Chris Friesen

On 07/21/2014 12:03 PM, Clint Byrum wrote:

Thanks Matthew for the analysis.

I think you missed something though.

Right now the frustration is that unrelated intermittent bugs stop your
presumably good change from getting in.

Without gating, the result would be that even more bugs, many of them not
intermittent at all, would get in. Right now, the one random developer
who has to hunt down the rechecks and do them is inconvenienced. But
without a gate, _every single_ developer will be inconvenienced until
the fix is merged.


The problem I see with this is that it's fundamentally not a fair system.

If someone is trying to fix a bug in the libvirt driver, it's wrong to 
expect them to try to debug issues with neutron being unstable.  They 
likely don't have the skillset to do it, and we shouldn't expect them to 
do so.  It's a waste of developer time.


Chris

___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [gate] The gate: a failure analysis

2014-07-22 Thread Jay Pipes

On 07/22/2014 10:48 AM, Chris Friesen wrote:

On 07/21/2014 12:03 PM, Clint Byrum wrote:

Thanks Matthew for the analysis.

I think you missed something though.

Right now the frustration is that unrelated intermittent bugs stop your
presumably good change from getting in.

Without gating, the result would be that even more bugs, many of them not
intermittent at all, would get in. Right now, the one random developer
who has to hunt down the rechecks and do them is inconvenienced. But
without a gate, _every single_ developer will be inconvenienced until
the fix is merged.


The problem I see with this is that it's fundamentally not a fair system.

If someone is trying to fix a bug in the libvirt driver, it's wrong to
expect them to try to debug issues with neutron being unstable.  They
likely don't have the skillset to do it, and we shouldn't expect them to
do so.  It's a waste of developer time.


Who is expecting the developer to debug issues with Neutron? It may be a 
waste of developer time to constantly recheck certain bugs (or no bug), 
but nobody is saying to the contributor of a libvirt fix Hey, this 
unrelated Neutron bug is causing a failure, so go fix it.


The point of the gate is specifically to provide the sort of rigidity 
that unfortunately manifests itself in discomfort from developers. 
Perhaps you don't have the history of when we had no strict gate, and it 
was a frequent source of frustration that code would sail through to 
master that would routinely break master and branches of other OpenStack 
projects. I, for one, don't want to revisit the bad old days. As much as 
a pain it is, the gate failures are a thorn in the side of folks 
precisely to push folks to fix the valid bugs that they highlight. What 
we need, like Sean said, is more folks fixing bugs and less folks 
working on features and vendor drivers.


Perhaps we, as a community, should make the bug triaging and fixing days 
a much more common thing? Maybe make Thursdays or Fridays dedicated bug 
days? How about monetary bug bounties being paid out by the OpenStack 
Foundation, with a payout scale based on the bug severity and 
importance? How about having dedicated bug-squashing teams that focus on 
a particular area of the code, that share their status reports at weekly 
meetings and on the ML?


best,
-jay

___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [gate] The gate: a failure analysis

2014-07-22 Thread Sean Dague
On 07/22/2014 11:51 AM, Jay Pipes wrote:
 On 07/22/2014 10:48 AM, Chris Friesen wrote:
 On 07/21/2014 12:03 PM, Clint Byrum wrote:
 Thanks Matthew for the analysis.

 I think you missed something though.

 Right now the frustration is that unrelated intermittent bugs stop your
 presumably good change from getting in.

 Without gating, the result would be that even more bugs, many of them
 not
 intermittent at all, would get in. Right now, the one random developer
 who has to hunt down the rechecks and do them is inconvenienced. But
 without a gate, _every single_ developer will be inconvenienced until
 the fix is merged.

 The problem I see with this is that it's fundamentally not a fair system.

 If someone is trying to fix a bug in the libvirt driver, it's wrong to
 expect them to try to debug issues with neutron being unstable.  They
 likely don't have the skillset to do it, and we shouldn't expect them to
 do so.  It's a waste of developer time.
 
 Who is expecting the developer to debug issues with Neutron? It may be a
 waste of developer time to constantly recheck certain bugs (or no bug),
 but nobody is saying to the contributor of a libvirt fix Hey, this
 unrelated Neutron bug is causing a failure, so go fix it.
 
 The point of the gate is specifically to provide the sort of rigidity
 that unfortunately manifests itself in discomfort from developers.
 Perhaps you don't have the history of when we had no strict gate, and it
 was a frequent source of frustration that code would sail through to
 master that would routinely break master and branches of other OpenStack
 projects. I, for one, don't want to revisit the bad old days. As much as
 a pain it is, the gate failures are a thorn in the side of folks
 precisely to push folks to fix the valid bugs that they highlight. What
 we need, like Sean said, is more folks fixing bugs and less folks
 working on features and vendor drivers.
 
 Perhaps we, as a community, should make the bug triaging and fixing days
 a much more common thing? Maybe make Thursdays or Fridays dedicated bug
 days? How about monetary bug bounties being paid out by the OpenStack
 Foundation, with a payout scale based on the bug severity and
 importance? How about having dedicated bug-squashing teams that focus on
 a particular area of the code, that share their status reports at weekly
 meetings and on the ML?

Something that's somewhat relevant to this discussion is one that we had
last week in Darmstadt at the Infra / QA Sprint, it even has a pretty
picture (#notverypretty) - https://dague.net/2014/07/22/openstack-failures/

I think fairness is one of those things that's hard to figure out here.
Because while it might not seem fair to a developer that they can't land
their patch, lets consider the alternative, where we turned off all the
testing (or limited it to only things we were 100% sure would not false
negative). In that environment the review teams would have to be fair
more careful about what they approved, as there was no backstop. Which
means I'd expect the review queue to grow by many integer multiples. And
land time for patches to actually increase.

An alternative to the current space of man it's annoying that my patch
gets killed by bugs some times isn't yay I'm landing all the codes!,
it's probably hmmm, how do I get anyone to look at my code, it's been
up for review for 6 months. Especially for newer developers without a
track record that haven't built up trust.

This is basically what you see in Linux. We could always evolve the
community in that direction, but I'm not sure it's what people actually
want. But in Linux if you show up as a new person the chance of anyone
reviewing your code is effectively 0%.

Every systemic change we've ever had to the gating system has 2nd and
3rd order effects, some we predict, and some we don't. Aren't emergent
systems fun? :)

For instance, when we implemented clean check, which demonstrably
decreased the gate queue length during rush times, many people now felt
like the system was punishing them because their code had to make more
round trips in the system. But so does everyone elses, which means some
really dubious behavior by some of the core teams in approving code that
hadn't been tested recently now was blocked. That was one of the
contributing factors to the January backup. So while it means that if
you hit a bug, your patch has longer in the system, it actually means if
you don't, it is less likely to be stuck behind a ton of other failing code.

-Sean

-- 
Sean Dague
http://dague.net

___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


[openstack-dev] [gate] The gate: a failure analysis

2014-07-21 Thread Matthew Booth
On Friday evening I had a dependent series of 5 changes all with
approval waiting to be merged. These were all refactor changes in the
VMware driver. The changes were:

* VMware: DatastorePath join() and __eq__()
https://review.openstack.org/#/c/103949/

* VMware: use datastore classes get_allowed_datastores/_sub_folder
https://review.openstack.org/#/c/103950/

* VMware: use datastore classes in file_move/delete/exists, mkdir
https://review.openstack.org/#/c/103951/

* VMware: Trivial indentation cleanups in vmops
https://review.openstack.org/#/c/104149/

* VMware: Convert vmops to use instance as an object
https://review.openstack.org/#/c/104144/

The last change merged this morning.

In order to merge these changes, over the weekend I manually submitted:

* 35 rechecks due to false negatives, an average of 7 per change
* 19 resubmissions after a change passed, but its dependency did not

Other interesting numbers:

* 16 unique bugs
* An 87% false negative rate
* 0 bugs found in the change under test

Because we don't fail fast, that is an average of at least 7.3 hours in
the gate. Much more in fact, because some runs fail on the second pass,
not the first. Because we don't resubmit automatically, that is only if
a developer is actively monitoring the process continuously, and
resubmits immediately on failure. In practise this is much longer,
because sometimes we have to sleep.

All of the above numbers are counted from the change receiving an
approval +2 until final merging. There were far more failures than this
during the approval process.

Why do we test individual changes in the gate? The purpose is to find
errors *in the change under test*. By the above numbers, it has failed
to achieve this at least 16 times previously.

Probability of finding a bug in the change under test: Small
Cost of testing:   High
Opportunity cost of slowing development:   High

and for comparison:

Cost of reverting rare false positives:Small

The current process expends a lot of resources, and does not achieve its
goal of finding bugs *in the changes under test*. In addition to using a
lot of technical resources, it also prevents good change from making its
way into the project and, not unimportantly, saps the will to live of
its victims. The cost of the process is overwhelmingly greater than its
benefits. The gate process as it stands is a significant net negative to
the project.

Does this mean that it is worthless to run these tests? Absolutely not!
These tests are vital to highlight a severe quality deficiency in
OpenStack. Not addressing this is, imho, an existential risk to the
project. However, the current approach is to pick contributors from the
community at random and hold them personally responsible for project
bugs selected at random. Not only has this approach failed, it is
impractical, unreasonable, and poisonous to the community at large. It
is also unrelated to the purpose of gate testing, which is to find bugs
*in the changes under test*.

I would like to make the radical proposal that we stop gating on CI
failures. We will continue to run them on every change, but only after
the change has been successfully merged.

Benefits:
* Without rechecks, the gate will use 8 times fewer resources.
* Log analysis is still available to indicate the emergence of races.
* Fixes can be merged quicker.
* Vastly less developer time spent monitoring gate failures.

Costs:
* A rare class of merge bug will make it into master.

Note that the benefits above will also offset the cost of resolving this
rare class of merge bug.

Of course, we still have the problem of finding resources to monitor and
fix CI failures. An additional benefit of not gating on CI will be that
we can no longer pretend that picking developers for project-affecting
bugs by lottery is likely to achieve results. As a project we need to
understand the importance of CI failures. We need a proper negotiation
with contributors to staff a team dedicated to the problem. We can then
use the review process to ensure that the right people have an incentive
to prioritise bug fixes.

Matt
-- 
Matthew Booth
Red Hat Engineering, Virtualisation Team

Phone: +442070094448 (UK)
GPG ID:  D33C3490
GPG FPR: 3733 612D 2D05 5458 8A8A 1600 3441 EA19 D33C 3490

___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [gate] The gate: a failure analysis

2014-07-21 Thread Chris Friesen

On 07/21/2014 04:38 AM, Matthew Booth wrote:


I would like to make the radical proposal that we stop gating on CI
failures. We will continue to run them on every change, but only after
the change has been successfully merged.

Benefits:
* Without rechecks, the gate will use 8 times fewer resources.
* Log analysis is still available to indicate the emergence of races.
* Fixes can be merged quicker.
* Vastly less developer time spent monitoring gate failures.

Costs:
* A rare class of merge bug will make it into master.

Note that the benefits above will also offset the cost of resolving this
rare class of merge bug.

Of course, we still have the problem of finding resources to monitor and
fix CI failures. An additional benefit of not gating on CI will be that
we can no longer pretend that picking developers for project-affecting
bugs by lottery is likely to achieve results. As a project we need to
understand the importance of CI failures. We need a proper negotiation
with contributors to staff a team dedicated to the problem. We can then
use the review process to ensure that the right people have an incentive
to prioritise bug fixes.


I'm generally in favour of this idea...I've only submitted a relatively 
small number of changes, but each time has involved gate bugs unrelated 
to the change being made.


Would there be value in doing unit tests at the time of submission?  We 
should all be doing this already, but it seems like it shouldn't be too 
expensive and might be reasonable insurance.


Chris

___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [gate] The gate: a failure analysis

2014-07-21 Thread Samuel Merritt

On 7/21/14, 3:38 AM, Matthew Booth wrote:

[snip]

I would like to make the radical proposal that we stop gating on CI
failures. We will continue to run them on every change, but only after
the change has been successfully merged.

Benefits:
* Without rechecks, the gate will use 8 times fewer resources.
* Log analysis is still available to indicate the emergence of races.
* Fixes can be merged quicker.
* Vastly less developer time spent monitoring gate failures.

Costs:
* A rare class of merge bug will make it into master.

Note that the benefits above will also offset the cost of resolving this
rare class of merge bug.


I think this is definitely a move in the right direction, but I'd like 
to propose a slight modification: let's cease blocking changes on 
*known* CI failures.


More precisely, if Elastic Recheck knows about all the failures that 
happened on a test run, treat that test run as successful.


I think this will gain virtually all the benefits you name while still 
retaining most of the gate's ability to keep breaking changes out.


As a bonus, it'll encourage people to make Elastic Recheck better. 
Currently, the easy path is to just type recheck no bug and click 
submit; it takes a lot less time than scrutinizing log files to guess 
at what went wrong. If failures identified by E-R don't block 
developers' changes, then the easy path is to improve E-R's checks, 
which benefits everyone.


___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [gate] The gate: a failure analysis

2014-07-21 Thread Clint Byrum
Thanks Matthew for the analysis.

I think you missed something though.

Right now the frustration is that unrelated intermittent bugs stop your
presumably good change from getting in.

Without gating, the result would be that even more bugs, many of them not
intermittent at all, would get in. Right now, the one random developer
who has to hunt down the rechecks and do them is inconvenienced. But
without a gate, _every single_ developer will be inconvenienced until
the fix is merged.

The false negative rate is _way_ too high. Nobody would disagree there.
However, adding more false negatives and allowing more people to ignore
the ones we already have, seems like it would have the opposite effect:
Now instead of annoying the people who hit the random intermittent bugs,
we'll be annoying _everybody_ as they hit the non-intermittent ones.

Excerpts from Matthew Booth's message of 2014-07-21 03:38:07 -0700:
 On Friday evening I had a dependent series of 5 changes all with
 approval waiting to be merged. These were all refactor changes in the
 VMware driver. The changes were:
 
 * VMware: DatastorePath join() and __eq__()
 https://review.openstack.org/#/c/103949/
 
 * VMware: use datastore classes get_allowed_datastores/_sub_folder
 https://review.openstack.org/#/c/103950/
 
 * VMware: use datastore classes in file_move/delete/exists, mkdir
 https://review.openstack.org/#/c/103951/
 
 * VMware: Trivial indentation cleanups in vmops
 https://review.openstack.org/#/c/104149/
 
 * VMware: Convert vmops to use instance as an object
 https://review.openstack.org/#/c/104144/
 
 The last change merged this morning.
 
 In order to merge these changes, over the weekend I manually submitted:
 
 * 35 rechecks due to false negatives, an average of 7 per change
 * 19 resubmissions after a change passed, but its dependency did not
 
 Other interesting numbers:
 
 * 16 unique bugs
 * An 87% false negative rate
 * 0 bugs found in the change under test
 
 Because we don't fail fast, that is an average of at least 7.3 hours in
 the gate. Much more in fact, because some runs fail on the second pass,
 not the first. Because we don't resubmit automatically, that is only if
 a developer is actively monitoring the process continuously, and
 resubmits immediately on failure. In practise this is much longer,
 because sometimes we have to sleep.
 
 All of the above numbers are counted from the change receiving an
 approval +2 until final merging. There were far more failures than this
 during the approval process.
 
 Why do we test individual changes in the gate? The purpose is to find
 errors *in the change under test*. By the above numbers, it has failed
 to achieve this at least 16 times previously.
 
 Probability of finding a bug in the change under test: Small
 Cost of testing:   High
 Opportunity cost of slowing development:   High
 
 and for comparison:
 
 Cost of reverting rare false positives:Small
 
 The current process expends a lot of resources, and does not achieve its
 goal of finding bugs *in the changes under test*. In addition to using a
 lot of technical resources, it also prevents good change from making its
 way into the project and, not unimportantly, saps the will to live of
 its victims. The cost of the process is overwhelmingly greater than its
 benefits. The gate process as it stands is a significant net negative to
 the project.
 
 Does this mean that it is worthless to run these tests? Absolutely not!
 These tests are vital to highlight a severe quality deficiency in
 OpenStack. Not addressing this is, imho, an existential risk to the
 project. However, the current approach is to pick contributors from the
 community at random and hold them personally responsible for project
 bugs selected at random. Not only has this approach failed, it is
 impractical, unreasonable, and poisonous to the community at large. It
 is also unrelated to the purpose of gate testing, which is to find bugs
 *in the changes under test*.
 
 I would like to make the radical proposal that we stop gating on CI
 failures. We will continue to run them on every change, but only after
 the change has been successfully merged.
 
 Benefits:
 * Without rechecks, the gate will use 8 times fewer resources.
 * Log analysis is still available to indicate the emergence of races.
 * Fixes can be merged quicker.
 * Vastly less developer time spent monitoring gate failures.
 
 Costs:
 * A rare class of merge bug will make it into master.
 
 Note that the benefits above will also offset the cost of resolving this
 rare class of merge bug.
 
 Of course, we still have the problem of finding resources to monitor and
 fix CI failures. An additional benefit of not gating on CI will be that
 we can no longer pretend that picking developers for project-affecting
 bugs by lottery is likely to achieve results. As a project we need to
 understand the importance of CI failures. We need a proper 

Re: [openstack-dev] [gate] The gate: a failure analysis

2014-07-21 Thread Jay Pipes

On 07/21/2014 02:03 PM, Clint Byrum wrote:

Thanks Matthew for the analysis.

I think you missed something though.

Right now the frustration is that unrelated intermittent bugs stop your
presumably good change from getting in.

Without gating, the result would be that even more bugs, many of them not
intermittent at all, would get in. Right now, the one random developer
who has to hunt down the rechecks and do them is inconvenienced. But
without a gate, _every single_ developer will be inconvenienced until
the fix is merged.

The false negative rate is _way_ too high. Nobody would disagree there.
However, adding more false negatives and allowing more people to ignore
the ones we already have, seems like it would have the opposite effect:
Now instead of annoying the people who hit the random intermittent bugs,
we'll be annoying _everybody_ as they hit the non-intermittent ones.


+10

Best,
-jay


Excerpts from Matthew Booth's message of 2014-07-21 03:38:07 -0700:

On Friday evening I had a dependent series of 5 changes all with
approval waiting to be merged. These were all refactor changes in the
VMware driver. The changes were:

* VMware: DatastorePath join() and __eq__()
https://review.openstack.org/#/c/103949/

* VMware: use datastore classes get_allowed_datastores/_sub_folder
https://review.openstack.org/#/c/103950/

* VMware: use datastore classes in file_move/delete/exists, mkdir
https://review.openstack.org/#/c/103951/

* VMware: Trivial indentation cleanups in vmops
https://review.openstack.org/#/c/104149/

* VMware: Convert vmops to use instance as an object
https://review.openstack.org/#/c/104144/

The last change merged this morning.

In order to merge these changes, over the weekend I manually submitted:

* 35 rechecks due to false negatives, an average of 7 per change
* 19 resubmissions after a change passed, but its dependency did not

Other interesting numbers:

* 16 unique bugs
* An 87% false negative rate
* 0 bugs found in the change under test

Because we don't fail fast, that is an average of at least 7.3 hours in
the gate. Much more in fact, because some runs fail on the second pass,
not the first. Because we don't resubmit automatically, that is only if
a developer is actively monitoring the process continuously, and
resubmits immediately on failure. In practise this is much longer,
because sometimes we have to sleep.

All of the above numbers are counted from the change receiving an
approval +2 until final merging. There were far more failures than this
during the approval process.

Why do we test individual changes in the gate? The purpose is to find
errors *in the change under test*. By the above numbers, it has failed
to achieve this at least 16 times previously.

Probability of finding a bug in the change under test: Small
Cost of testing:   High
Opportunity cost of slowing development:   High

and for comparison:

Cost of reverting rare false positives:Small

The current process expends a lot of resources, and does not achieve its
goal of finding bugs *in the changes under test*. In addition to using a
lot of technical resources, it also prevents good change from making its
way into the project and, not unimportantly, saps the will to live of
its victims. The cost of the process is overwhelmingly greater than its
benefits. The gate process as it stands is a significant net negative to
the project.

Does this mean that it is worthless to run these tests? Absolutely not!
These tests are vital to highlight a severe quality deficiency in
OpenStack. Not addressing this is, imho, an existential risk to the
project. However, the current approach is to pick contributors from the
community at random and hold them personally responsible for project
bugs selected at random. Not only has this approach failed, it is
impractical, unreasonable, and poisonous to the community at large. It
is also unrelated to the purpose of gate testing, which is to find bugs
*in the changes under test*.

I would like to make the radical proposal that we stop gating on CI
failures. We will continue to run them on every change, but only after
the change has been successfully merged.

Benefits:
* Without rechecks, the gate will use 8 times fewer resources.
* Log analysis is still available to indicate the emergence of races.
* Fixes can be merged quicker.
* Vastly less developer time spent monitoring gate failures.

Costs:
* A rare class of merge bug will make it into master.

Note that the benefits above will also offset the cost of resolving this
rare class of merge bug.

Of course, we still have the problem of finding resources to monitor and
fix CI failures. An additional benefit of not gating on CI will be that
we can no longer pretend that picking developers for project-affecting
bugs by lottery is likely to achieve results. As a project we need to
understand the importance of CI failures. We need a proper negotiation
with contributors to 

Re: [openstack-dev] [gate] The gate: a failure analysis

2014-07-21 Thread David Kranz

On 07/21/2014 04:13 PM, Jay Pipes wrote:

On 07/21/2014 02:03 PM, Clint Byrum wrote:

Thanks Matthew for the analysis.

I think you missed something though.

Right now the frustration is that unrelated intermittent bugs stop your
presumably good change from getting in.

Without gating, the result would be that even more bugs, many of them 
not

intermittent at all, would get in. Right now, the one random developer
who has to hunt down the rechecks and do them is inconvenienced. But
without a gate, _every single_ developer will be inconvenienced until
the fix is merged.

The false negative rate is _way_ too high. Nobody would disagree there.
However, adding more false negatives and allowing more people to ignore
the ones we already have, seems like it would have the opposite effect:
Now instead of annoying the people who hit the random intermittent bugs,
we'll be annoying _everybody_ as they hit the non-intermittent ones.


+10

Right, but perhaps there is a middle ground. We must not allow changes 
in that can't pass through the gate, but we can separate the problems
of constant rechecks using too many resources, and of constant rechecks 
causing developer pain. If failures were deterministic we would skip the 
failing tests until they were fixed. Unfortunately many of the common 
failures can blow up any test, or even the whole process. Following on 
what Sam said, what if we automatically reran jobs that failed in a 
known way, and disallowed recheck/reverify no bug? Developers would 
then have to track down what bug caused a failure or file a new one. But 
they would have to do so much less frequently, and as more common 
failures were catalogued it would become less and less frequent.


Some might (reasonably) argue that this would be a bad thing because it 
would reduce the incentive for people to fix bugs if there were less 
pain being inflicted. But given how hard it is to track down these race 
bugs, and that we as a community have no way to force time to be spent 
on them, and that it does not appear that these bugs are causing real 
systems to fall down (only our gating process), perhaps something 
different should be considered?


 -David


Best,
-jay


Excerpts from Matthew Booth's message of 2014-07-21 03:38:07 -0700:

On Friday evening I had a dependent series of 5 changes all with
approval waiting to be merged. These were all refactor changes in the
VMware driver. The changes were:

* VMware: DatastorePath join() and __eq__()
https://review.openstack.org/#/c/103949/

* VMware: use datastore classes get_allowed_datastores/_sub_folder
https://review.openstack.org/#/c/103950/

* VMware: use datastore classes in file_move/delete/exists, mkdir
https://review.openstack.org/#/c/103951/

* VMware: Trivial indentation cleanups in vmops
https://review.openstack.org/#/c/104149/

* VMware: Convert vmops to use instance as an object
https://review.openstack.org/#/c/104144/

The last change merged this morning.

In order to merge these changes, over the weekend I manually submitted:

* 35 rechecks due to false negatives, an average of 7 per change
* 19 resubmissions after a change passed, but its dependency did not

Other interesting numbers:

* 16 unique bugs
* An 87% false negative rate
* 0 bugs found in the change under test

Because we don't fail fast, that is an average of at least 7.3 hours in
the gate. Much more in fact, because some runs fail on the second pass,
not the first. Because we don't resubmit automatically, that is only if
a developer is actively monitoring the process continuously, and
resubmits immediately on failure. In practise this is much longer,
because sometimes we have to sleep.

All of the above numbers are counted from the change receiving an
approval +2 until final merging. There were far more failures than this
during the approval process.

Why do we test individual changes in the gate? The purpose is to find
errors *in the change under test*. By the above numbers, it has failed
to achieve this at least 16 times previously.

Probability of finding a bug in the change under test: Small
Cost of testing:   High
Opportunity cost of slowing development:   High

and for comparison:

Cost of reverting rare false positives:Small

The current process expends a lot of resources, and does not achieve 
its
goal of finding bugs *in the changes under test*. In addition to 
using a
lot of technical resources, it also prevents good change from making 
its

way into the project and, not unimportantly, saps the will to live of
its victims. The cost of the process is overwhelmingly greater than its
benefits. The gate process as it stands is a significant net 
negative to

the project.

Does this mean that it is worthless to run these tests? Absolutely not!
These tests are vital to highlight a severe quality deficiency in
OpenStack. Not addressing this is, imho, an existential risk to the
project. However, the current approach is 

Re: [openstack-dev] [gate] The gate: a failure analysis

2014-07-21 Thread Sean Dague
On 07/21/2014 04:39 PM, David Kranz wrote:
 On 07/21/2014 04:13 PM, Jay Pipes wrote:
 On 07/21/2014 02:03 PM, Clint Byrum wrote:
 Thanks Matthew for the analysis.

 I think you missed something though.

 Right now the frustration is that unrelated intermittent bugs stop your
 presumably good change from getting in.

 Without gating, the result would be that even more bugs, many of them
 not
 intermittent at all, would get in. Right now, the one random developer
 who has to hunt down the rechecks and do them is inconvenienced. But
 without a gate, _every single_ developer will be inconvenienced until
 the fix is merged.

 The false negative rate is _way_ too high. Nobody would disagree there.
 However, adding more false negatives and allowing more people to ignore
 the ones we already have, seems like it would have the opposite effect:
 Now instead of annoying the people who hit the random intermittent bugs,
 we'll be annoying _everybody_ as they hit the non-intermittent ones.

 +10

 Right, but perhaps there is a middle ground. We must not allow changes
 in that can't pass through the gate, but we can separate the problems
 of constant rechecks using too many resources, and of constant rechecks
 causing developer pain. If failures were deterministic we would skip the
 failing tests until they were fixed. Unfortunately many of the common
 failures can blow up any test, or even the whole process. Following on
 what Sam said, what if we automatically reran jobs that failed in a
 known way, and disallowed recheck/reverify no bug? Developers would
 then have to track down what bug caused a failure or file a new one. But
 they would have to do so much less frequently, and as more common
 failures were catalogued it would become less and less frequent.

Elastic Recheck was never meant for this purpose. It doesn't tell you
all the bugs that were in your job, it just tells you possibly 1 bug
that might have caused something to go wrong. There is no guaruntee
there weren't other bugs in there as well. Consider it a fail open solution.

 Some might (reasonably) argue that this would be a bad thing because it
 would reduce the incentive for people to fix bugs if there were less
 pain being inflicted. But given how hard it is to track down these race
 bugs, and that we as a community have no way to force time to be spent
 on them, and that it does not appear that these bugs are causing real
 systems to fall down (only our gating process), perhaps something
 different should be considered?

I really beg to differ on that point. The Infra team will tell you how
terribly unreliable our cloud providers can be at times, hitting many of
the same issues that we expose in elastic recheck.

Lightly loaded / basically static environments will hit some of these
issues at a far lower rate. They are still out there though. Probably
largely ignored through massive retry loops around our stuff.

Allocating a compute server that you can ssh to a dozen times in a test
run shouldn't be considered a moon shot level of function. That's kind
of table stakes for IaaS. :)

And yes, it's hard to debug, but seriously, if the development community
can't figure out why OpenStack doesn't work, can anyone?

-Sean

-- 
Sean Dague
http://dague.net



signature.asc
Description: OpenPGP digital signature
___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev