Re: [openstack-dev] [third-party-ci][neutron] What is Success exactly?

2014-07-07 Thread Kyle Mestery
On Thu, Jul 3, 2014 at 6:12 AM, Salvatore Orlando sorla...@nicira.com wrote:
 Apologies for quoting again the top post of the thread.

 Comments inline (mostly thinking aloud)
 Salvatore


 On 30 June 2014 22:22, Jay Pipes jaypi...@gmail.com wrote:

 Hi Stackers,

 Some recent ML threads [1] and a hot IRC meeting today [2] brought up some
 legitimate questions around how a newly-proposed Stackalytics report page
 for Neutron External CI systems [2] represented the results of an external
 CI system as successful or not.

 First, I want to say that Ilya and all those involved in the Stackalytics
 program simply want to provide the most accurate information to developers
 in a format that is easily consumed. While there need to be some changes in
 how data is shown (and the wording of things like Tests Succeeded), I hope
 that the community knows there isn't any ill intent on the part of Mirantis
 or anyone who works on Stackalytics. OK, so let's keep the conversation
 civil -- we're all working towards the same goals of transparency and
 accuracy. :)

 Alright, now, Anita and Kurt Taylor were asking a very poignant question:

 But what does CI tested really mean? just running tests? or tested to
 pass some level of requirements?

 In this nascent world of external CI systems, we have a set of issues that
 we need to resolve:

 1) All of the CI systems are different.

 Some run Bash scripts. Some run Jenkins slaves and devstack-gate scripts.
 Others run custom Python code that spawns VMs and publishes logs to some
 public domain.

 As a community, we need to decide whether it is worth putting in the
 effort to create a single, unified, installable and runnable CI system, so
 that we can legitimately say all of the external systems are identical,
 with the exception of the driver code for vendor X being substituted in the
 Neutron codebase.


 I think such system already exists, and it's documented here:
 http://ci.openstack.org/
 Still, understanding it is quite a learning curve, and running it is not
 exactly straightforward. But I guess that's pretty much understandable given
 the complexity of the system, isn't it?



 If the goal of the external CI systems is to produce reliable, consistent
 results, I feel the answer to the above is yes, but I'm interested to hear
 what others think. Frankly, in the world of benchmarks, it would be
 unthinkable to say go ahead and everyone run your own benchmark suite,
 because you would get wildly different results. A similar problem has
 emerged here.


 I don't think the particular infrastructure which might range from an
 openstack-ci clone to a 100-line bash script would have an impact on the
 reliability of the quality assessment regarding a particular driver or
 plugin. This is determined, in my opinion, by the quantity and nature of
 tests one runs on a specific driver. In Neutron for instance, there is a
 wide range of choices - from a few test cases in tempest.api.network to the
 full smoketest job. As long there is no minimal standard here, then it would
 be difficult to assess the quality of the evaluation from a CI system,
 unless we explicitly keep into account coverage into the evaluation.

 On the other hand, different CI infrastructures will have different levels
 in terms of % of patches tested and % of infrastructure failures. I think it
 might not be a terrible idea to use these parameters to evaluate how good a
 CI is from an infra standpoint. However, there are still open questions. For
 instance, a CI might have a low patch % score because it only needs to test
 patches affecting a given driver.


 2) There is no mediation or verification that the external CI system is
 actually testing anything at all

 As a community, we need to decide whether the current system of
 self-policing should continue. If it should, then language on reports like
 [3] should be very clear that any numbers derived from such systems should
 be taken with a grain of salt. Use of the word Success should be avoided,
 as it has connotations (in English, at least) that the result has been
 verified, which is simply not the case as long as no verification or
 mediation occurs for any external CI system.





 3) There is no clear indication of what tests are being run, and therefore
 there is no clear indication of what success is

 I think we can all agree that a test has three possible outcomes: pass,
 fail, and skip. The results of a test suite run therefore is nothing more
 than the aggregation of which tests passed, which failed, and which were
 skipped.

 As a community, we must document, for each project, what are expected set
 of tests that must be run for each merged patch into the project's source
 tree. This documentation should be discoverable so that reports like [3] can
 be crystal-clear on what the data shown actually means. The report is simply
 displaying the data it receives from Gerrit. The community needs to be
 proactive in saying this is what is expected 

Re: [openstack-dev] [third-party-ci][neutron] What is Success exactly?

2014-07-07 Thread Joe Gordon
On Jul 3, 2014 8:57 AM, Anita Kuno ante...@anteaya.info wrote:

 On 07/03/2014 06:22 AM, Sullivan, Jon Paul wrote:
  -Original Message-
  From: Anita Kuno [mailto:ante...@anteaya.info]
  Sent: 01 July 2014 14:42
  To: openstack-dev@lists.openstack.org
  Subject: Re: [openstack-dev] [third-party-ci][neutron] What is
Success
  exactly?
 
  On 06/30/2014 09:13 PM, Jay Pipes wrote:
  On 06/30/2014 07:08 PM, Anita Kuno wrote:
  On 06/30/2014 04:22 PM, Jay Pipes wrote:
  Hi Stackers,
 
  Some recent ML threads [1] and a hot IRC meeting today [2] brought
  up some legitimate questions around how a newly-proposed
  Stackalytics report page for Neutron External CI systems [2]
  represented the results of an external CI system as successful or
  not.
 
  First, I want to say that Ilya and all those involved in the
  Stackalytics program simply want to provide the most accurate
  information to developers in a format that is easily consumed. While
  there need to be some changes in how data is shown (and the wording
  of things like Tests Succeeded), I hope that the community knows
  there isn't any ill intent on the part of Mirantis or anyone who
  works on Stackalytics. OK, so let's keep the conversation civil --
  we're all working towards the same goals of transparency and
  accuracy. :)
 
  Alright, now, Anita and Kurt Taylor were asking a very poignant
  question:
 
  But what does CI tested really mean? just running tests? or tested
  to pass some level of requirements?
 
  In this nascent world of external CI systems, we have a set of
  issues that we need to resolve:
 
  1) All of the CI systems are different.
 
  Some run Bash scripts. Some run Jenkins slaves and devstack-gate
  scripts. Others run custom Python code that spawns VMs and publishes
  logs to some public domain.
 
  As a community, we need to decide whether it is worth putting in the
  effort to create a single, unified, installable and runnable CI
  system, so that we can legitimately say all of the external systems
  are identical, with the exception of the driver code for vendor X
  being substituted in the Neutron codebase.
 
  If the goal of the external CI systems is to produce reliable,
  consistent results, I feel the answer to the above is yes, but I'm
  interested to hear what others think. Frankly, in the world of
  benchmarks, it would be unthinkable to say go ahead and everyone
  run your own benchmark suite, because you would get wildly
  different results. A similar problem has emerged here.
 
  2) There is no mediation or verification that the external CI system
  is actually testing anything at all
 
  As a community, we need to decide whether the current system of
  self-policing should continue. If it should, then language on
  reports like [3] should be very clear that any numbers derived from
  such systems should be taken with a grain of salt. Use of the word
  Success should be avoided, as it has connotations (in English, at
  least) that the result has been verified, which is simply not the
  case as long as no verification or mediation occurs for any external
  CI system.
 
  3) There is no clear indication of what tests are being run, and
  therefore there is no clear indication of what success is
 
  I think we can all agree that a test has three possible outcomes:
  pass, fail, and skip. The results of a test suite run therefore is
  nothing more than the aggregation of which tests passed, which
  failed, and which were skipped.
 
  As a community, we must document, for each project, what are
  expected set of tests that must be run for each merged patch into
  the project's source tree. This documentation should be discoverable
  so that reports like [3] can be crystal-clear on what the data shown
  actually means. The report is simply displaying the data it receives
  from Gerrit. The community needs to be proactive in saying this is
  what is expected to be tested. This alone would allow the report to
  give information such as External CI system ABC performed the
  expected tests. X tests passed.
  Y tests failed. Z tests were skipped. Likewise, it would also make
  it possible for the report to give information such as External CI
  system DEF did not perform the expected tests., which is excellent
  information in and of itself.
 
  ===
 
  In thinking about the likely answers to the above questions, I
  believe it would be prudent to change the Stackalytics report in
  question [3] in the following ways:
 
  a. Change the Success % column header to % Reported +1 Votes
  b. Change the phrase  Green cell - tests ran successfully, red cell
  - tests failed to Green cell - System voted +1, red cell - System
  voted -1
 
  and then, when we have more and better data (for example, # tests
  passed, failed, skipped, etc), we can provide more detailed
  information than just reported +1 or not.
 
  Thoughts?
 
  Best,
  -jay
 
  [1]
  http://lists.openstack.org/pipermail/openstack-dev/2014-June/038933

Re: [openstack-dev] [third-party-ci][neutron] What is Success exactly?

2014-07-06 Thread Jay Pipes

On 07/03/2014 02:41 PM, Fawad Khaliq wrote:


On Thu, Jul 3, 2014 at 10:27 AM, Kevin Benton blak...@gmail.com
mailto:blak...@gmail.com wrote:

 This allows the viewer to see categories of reviews based upon their
divergence from OpenStack's Jenkins results. I think evaluating
divergence from Jenkins might be a metric worth consideration.

I think the only thing this really reflects though is how much the
third party CI system is mirroring Jenkins.
A system that frequently diverges may be functioning perfectly fine
and just has a vastly different code path that it is integration
testing so it is legitimately detecting failures the OpenStack CI
cannot.

Exactly. +1


Unfortunately, there's no good way to prove that.

-jay

___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [third-party-ci][neutron] What is Success exactly?

2014-07-03 Thread Sullivan, Jon Paul
 -Original Message-
 From: Anita Kuno [mailto:ante...@anteaya.info]
 Sent: 01 July 2014 14:42
 To: openstack-dev@lists.openstack.org
 Subject: Re: [openstack-dev] [third-party-ci][neutron] What is Success
 exactly?
 
 On 06/30/2014 09:13 PM, Jay Pipes wrote:
  On 06/30/2014 07:08 PM, Anita Kuno wrote:
  On 06/30/2014 04:22 PM, Jay Pipes wrote:
  Hi Stackers,
 
  Some recent ML threads [1] and a hot IRC meeting today [2] brought
  up some legitimate questions around how a newly-proposed
  Stackalytics report page for Neutron External CI systems [2]
  represented the results of an external CI system as successful or
 not.
 
  First, I want to say that Ilya and all those involved in the
  Stackalytics program simply want to provide the most accurate
  information to developers in a format that is easily consumed. While
  there need to be some changes in how data is shown (and the wording
  of things like Tests Succeeded), I hope that the community knows
  there isn't any ill intent on the part of Mirantis or anyone who
  works on Stackalytics. OK, so let's keep the conversation civil --
  we're all working towards the same goals of transparency and
  accuracy. :)
 
  Alright, now, Anita and Kurt Taylor were asking a very poignant
  question:
 
  But what does CI tested really mean? just running tests? or tested
  to pass some level of requirements?
 
  In this nascent world of external CI systems, we have a set of
  issues that we need to resolve:
 
  1) All of the CI systems are different.
 
  Some run Bash scripts. Some run Jenkins slaves and devstack-gate
  scripts. Others run custom Python code that spawns VMs and publishes
  logs to some public domain.
 
  As a community, we need to decide whether it is worth putting in the
  effort to create a single, unified, installable and runnable CI
  system, so that we can legitimately say all of the external systems
  are identical, with the exception of the driver code for vendor X
  being substituted in the Neutron codebase.
 
  If the goal of the external CI systems is to produce reliable,
  consistent results, I feel the answer to the above is yes, but I'm
  interested to hear what others think. Frankly, in the world of
  benchmarks, it would be unthinkable to say go ahead and everyone
  run your own benchmark suite, because you would get wildly
  different results. A similar problem has emerged here.
 
  2) There is no mediation or verification that the external CI system
  is actually testing anything at all
 
  As a community, we need to decide whether the current system of
  self-policing should continue. If it should, then language on
  reports like [3] should be very clear that any numbers derived from
  such systems should be taken with a grain of salt. Use of the word
  Success should be avoided, as it has connotations (in English, at
  least) that the result has been verified, which is simply not the
  case as long as no verification or mediation occurs for any external
 CI system.
 
  3) There is no clear indication of what tests are being run, and
  therefore there is no clear indication of what success is
 
  I think we can all agree that a test has three possible outcomes:
  pass, fail, and skip. The results of a test suite run therefore is
  nothing more than the aggregation of which tests passed, which
  failed, and which were skipped.
 
  As a community, we must document, for each project, what are
  expected set of tests that must be run for each merged patch into
  the project's source tree. This documentation should be discoverable
  so that reports like [3] can be crystal-clear on what the data shown
  actually means. The report is simply displaying the data it receives
  from Gerrit. The community needs to be proactive in saying this is
  what is expected to be tested. This alone would allow the report to
  give information such as External CI system ABC performed the
 expected tests. X tests passed.
  Y tests failed. Z tests were skipped. Likewise, it would also make
  it possible for the report to give information such as External CI
  system DEF did not perform the expected tests., which is excellent
  information in and of itself.
 
  ===
 
  In thinking about the likely answers to the above questions, I
  believe it would be prudent to change the Stackalytics report in
  question [3] in the following ways:
 
  a. Change the Success % column header to % Reported +1 Votes
  b. Change the phrase  Green cell - tests ran successfully, red cell
  - tests failed to Green cell - System voted +1, red cell - System
  voted -1
 
  and then, when we have more and better data (for example, # tests
  passed, failed, skipped, etc), we can provide more detailed
  information than just reported +1 or not.
 
  Thoughts?
 
  Best,
  -jay
 
  [1]
  http://lists.openstack.org/pipermail/openstack-dev/2014-June/038933.
  html
  [2]
  http://eavesdrop.openstack.org/meetings/third_party/2014/third_party
  .2014-06-30-18.01.log.html
 
 
  [3] http

Re: [openstack-dev] [third-party-ci][neutron] What is Success exactly?

2014-07-03 Thread Salvatore Orlando
Apologies for quoting again the top post of the thread.

Comments inline (mostly thinking aloud)
Salvatore


On 30 June 2014 22:22, Jay Pipes jaypi...@gmail.com wrote:

 Hi Stackers,

 Some recent ML threads [1] and a hot IRC meeting today [2] brought up some
 legitimate questions around how a newly-proposed Stackalytics report page
 for Neutron External CI systems [2] represented the results of an external
 CI system as successful or not.

 First, I want to say that Ilya and all those involved in the Stackalytics
 program simply want to provide the most accurate information to developers
 in a format that is easily consumed. While there need to be some changes in
 how data is shown (and the wording of things like Tests Succeeded), I
 hope that the community knows there isn't any ill intent on the part of
 Mirantis or anyone who works on Stackalytics. OK, so let's keep the
 conversation civil -- we're all working towards the same goals of
 transparency and accuracy. :)

 Alright, now, Anita and Kurt Taylor were asking a very poignant question:

 But what does CI tested really mean? just running tests? or tested to
 pass some level of requirements?

 In this nascent world of external CI systems, we have a set of issues that
 we need to resolve:

 1) All of the CI systems are different.

 Some run Bash scripts. Some run Jenkins slaves and devstack-gate scripts.
 Others run custom Python code that spawns VMs and publishes logs to some
 public domain.

 As a community, we need to decide whether it is worth putting in the
 effort to create a single, unified, installable and runnable CI system, so
 that we can legitimately say all of the external systems are identical,
 with the exception of the driver code for vendor X being substituted in the
 Neutron codebase.


I think such system already exists, and it's documented here:
http://ci.openstack.org/
Still, understanding it is quite a learning curve, and running it is not
exactly straightforward. But I guess that's pretty much understandable
given the complexity of the system, isn't it?



 If the goal of the external CI systems is to produce reliable, consistent
 results, I feel the answer to the above is yes, but I'm interested to
 hear what others think. Frankly, in the world of benchmarks, it would be
 unthinkable to say go ahead and everyone run your own benchmark suite,
 because you would get wildly different results. A similar problem has
 emerged here.


I don't think the particular infrastructure which might range from an
openstack-ci clone to a 100-line bash script would have an impact on the
reliability of the quality assessment regarding a particular driver or
plugin. This is determined, in my opinion, by the quantity and nature of
tests one runs on a specific driver. In Neutron for instance, there is a
wide range of choices - from a few test cases in tempest.api.network to the
full smoketest job. As long there is no minimal standard here, then it
would be difficult to assess the quality of the evaluation from a CI
system, unless we explicitly keep into account coverage into the evaluation.

On the other hand, different CI infrastructures will have different levels
in terms of % of patches tested and % of infrastructure failures. I think
it might not be a terrible idea to use these parameters to evaluate how
good a CI is from an infra standpoint. However, there are still open
questions. For instance, a CI might have a low patch % score because it
only needs to test patches affecting a given driver.


 2) There is no mediation or verification that the external CI system is
 actually testing anything at all

 As a community, we need to decide whether the current system of
 self-policing should continue. If it should, then language on reports like
 [3] should be very clear that any numbers derived from such systems should
 be taken with a grain of salt. Use of the word Success should be avoided,
 as it has connotations (in English, at least) that the result has been
 verified, which is simply not the case as long as no verification or
 mediation occurs for any external CI system.





 3) There is no clear indication of what tests are being run, and therefore
 there is no clear indication of what success is

 I think we can all agree that a test has three possible outcomes: pass,
 fail, and skip. The results of a test suite run therefore is nothing more
 than the aggregation of which tests passed, which failed, and which were
 skipped.

 As a community, we must document, for each project, what are expected set
 of tests that must be run for each merged patch into the project's source
 tree. This documentation should be discoverable so that reports like [3]
 can be crystal-clear on what the data shown actually means. The report is
 simply displaying the data it receives from Gerrit. The community needs to
 be proactive in saying this is what is expected to be tested. This alone
 would allow the report to give information such as External CI system ABC
 

Re: [openstack-dev] [third-party-ci][neutron] What is Success exactly?

2014-07-03 Thread Anita Kuno
On 07/03/2014 06:22 AM, Sullivan, Jon Paul wrote:
 -Original Message-
 From: Anita Kuno [mailto:ante...@anteaya.info]
 Sent: 01 July 2014 14:42
 To: openstack-dev@lists.openstack.org
 Subject: Re: [openstack-dev] [third-party-ci][neutron] What is Success
 exactly?

 On 06/30/2014 09:13 PM, Jay Pipes wrote:
 On 06/30/2014 07:08 PM, Anita Kuno wrote:
 On 06/30/2014 04:22 PM, Jay Pipes wrote:
 Hi Stackers,

 Some recent ML threads [1] and a hot IRC meeting today [2] brought
 up some legitimate questions around how a newly-proposed
 Stackalytics report page for Neutron External CI systems [2]
 represented the results of an external CI system as successful or
 not.

 First, I want to say that Ilya and all those involved in the
 Stackalytics program simply want to provide the most accurate
 information to developers in a format that is easily consumed. While
 there need to be some changes in how data is shown (and the wording
 of things like Tests Succeeded), I hope that the community knows
 there isn't any ill intent on the part of Mirantis or anyone who
 works on Stackalytics. OK, so let's keep the conversation civil --
 we're all working towards the same goals of transparency and
 accuracy. :)

 Alright, now, Anita and Kurt Taylor were asking a very poignant
 question:

 But what does CI tested really mean? just running tests? or tested
 to pass some level of requirements?

 In this nascent world of external CI systems, we have a set of
 issues that we need to resolve:

 1) All of the CI systems are different.

 Some run Bash scripts. Some run Jenkins slaves and devstack-gate
 scripts. Others run custom Python code that spawns VMs and publishes
 logs to some public domain.

 As a community, we need to decide whether it is worth putting in the
 effort to create a single, unified, installable and runnable CI
 system, so that we can legitimately say all of the external systems
 are identical, with the exception of the driver code for vendor X
 being substituted in the Neutron codebase.

 If the goal of the external CI systems is to produce reliable,
 consistent results, I feel the answer to the above is yes, but I'm
 interested to hear what others think. Frankly, in the world of
 benchmarks, it would be unthinkable to say go ahead and everyone
 run your own benchmark suite, because you would get wildly
 different results. A similar problem has emerged here.

 2) There is no mediation or verification that the external CI system
 is actually testing anything at all

 As a community, we need to decide whether the current system of
 self-policing should continue. If it should, then language on
 reports like [3] should be very clear that any numbers derived from
 such systems should be taken with a grain of salt. Use of the word
 Success should be avoided, as it has connotations (in English, at
 least) that the result has been verified, which is simply not the
 case as long as no verification or mediation occurs for any external
 CI system.

 3) There is no clear indication of what tests are being run, and
 therefore there is no clear indication of what success is

 I think we can all agree that a test has three possible outcomes:
 pass, fail, and skip. The results of a test suite run therefore is
 nothing more than the aggregation of which tests passed, which
 failed, and which were skipped.

 As a community, we must document, for each project, what are
 expected set of tests that must be run for each merged patch into
 the project's source tree. This documentation should be discoverable
 so that reports like [3] can be crystal-clear on what the data shown
 actually means. The report is simply displaying the data it receives
 from Gerrit. The community needs to be proactive in saying this is
 what is expected to be tested. This alone would allow the report to
 give information such as External CI system ABC performed the
 expected tests. X tests passed.
 Y tests failed. Z tests were skipped. Likewise, it would also make
 it possible for the report to give information such as External CI
 system DEF did not perform the expected tests., which is excellent
 information in and of itself.

 ===

 In thinking about the likely answers to the above questions, I
 believe it would be prudent to change the Stackalytics report in
 question [3] in the following ways:

 a. Change the Success % column header to % Reported +1 Votes
 b. Change the phrase  Green cell - tests ran successfully, red cell
 - tests failed to Green cell - System voted +1, red cell - System
 voted -1

 and then, when we have more and better data (for example, # tests
 passed, failed, skipped, etc), we can provide more detailed
 information than just reported +1 or not.

 Thoughts?

 Best,
 -jay

 [1]
 http://lists.openstack.org/pipermail/openstack-dev/2014-June/038933.
 html
 [2]
 http://eavesdrop.openstack.org/meetings/third_party/2014/third_party
 .2014-06-30-18.01.log.html


 [3] http://stackalytics.com/report/ci/neutron/7

Re: [openstack-dev] [third-party-ci][neutron] What is Success exactly?

2014-07-03 Thread Anita Kuno
On 07/03/2014 07:12 AM, Salvatore Orlando wrote:
 Apologies for quoting again the top post of the thread.
 
 Comments inline (mostly thinking aloud)
 Salvatore
 
 
 On 30 June 2014 22:22, Jay Pipes jaypi...@gmail.com wrote:
 
 Hi Stackers,

 Some recent ML threads [1] and a hot IRC meeting today [2] brought up some
 legitimate questions around how a newly-proposed Stackalytics report page
 for Neutron External CI systems [2] represented the results of an external
 CI system as successful or not.

 First, I want to say that Ilya and all those involved in the Stackalytics
 program simply want to provide the most accurate information to developers
 in a format that is easily consumed. While there need to be some changes in
 how data is shown (and the wording of things like Tests Succeeded), I
 hope that the community knows there isn't any ill intent on the part of
 Mirantis or anyone who works on Stackalytics. OK, so let's keep the
 conversation civil -- we're all working towards the same goals of
 transparency and accuracy. :)

 Alright, now, Anita and Kurt Taylor were asking a very poignant question:

 But what does CI tested really mean? just running tests? or tested to
 pass some level of requirements?

 In this nascent world of external CI systems, we have a set of issues that
 we need to resolve:

 1) All of the CI systems are different.

 Some run Bash scripts. Some run Jenkins slaves and devstack-gate scripts.
 Others run custom Python code that spawns VMs and publishes logs to some
 public domain.

 As a community, we need to decide whether it is worth putting in the
 effort to create a single, unified, installable and runnable CI system, so
 that we can legitimately say all of the external systems are identical,
 with the exception of the driver code for vendor X being substituted in the
 Neutron codebase.

 
 I think such system already exists, and it's documented here:
 http://ci.openstack.org/
 Still, understanding it is quite a learning curve, and running it is not
 exactly straightforward. But I guess that's pretty much understandable
 given the complexity of the system, isn't it?
 
 

 If the goal of the external CI systems is to produce reliable, consistent
 results, I feel the answer to the above is yes, but I'm interested to
 hear what others think. Frankly, in the world of benchmarks, it would be
 unthinkable to say go ahead and everyone run your own benchmark suite,
 because you would get wildly different results. A similar problem has
 emerged here.

 
 I don't think the particular infrastructure which might range from an
 openstack-ci clone to a 100-line bash script would have an impact on the
 reliability of the quality assessment regarding a particular driver or
 plugin. This is determined, in my opinion, by the quantity and nature of
 tests one runs on a specific driver. In Neutron for instance, there is a
 wide range of choices - from a few test cases in tempest.api.network to the
 full smoketest job. As long there is no minimal standard here, then it
 would be difficult to assess the quality of the evaluation from a CI
 system, unless we explicitly keep into account coverage into the evaluation.
 
 On the other hand, different CI infrastructures will have different levels
 in terms of % of patches tested and % of infrastructure failures. I think
 it might not be a terrible idea to use these parameters to evaluate how
 good a CI is from an infra standpoint. However, there are still open
 questions. For instance, a CI might have a low patch % score because it
 only needs to test patches affecting a given driver.
 
 
 2) There is no mediation or verification that the external CI system is
 actually testing anything at all

 As a community, we need to decide whether the current system of
 self-policing should continue. If it should, then language on reports like
 [3] should be very clear that any numbers derived from such systems should
 be taken with a grain of salt. Use of the word Success should be avoided,
 as it has connotations (in English, at least) that the result has been
 verified, which is simply not the case as long as no verification or
 mediation occurs for any external CI system.

 
 
 
 
 3) There is no clear indication of what tests are being run, and therefore
 there is no clear indication of what success is

 I think we can all agree that a test has three possible outcomes: pass,
 fail, and skip. The results of a test suite run therefore is nothing more
 than the aggregation of which tests passed, which failed, and which were
 skipped.

 As a community, we must document, for each project, what are expected set
 of tests that must be run for each merged patch into the project's source
 tree. This documentation should be discoverable so that reports like [3]
 can be crystal-clear on what the data shown actually means. The report is
 simply displaying the data it receives from Gerrit. The community needs to
 be proactive in saying this is what is expected to be tested. 

Re: [openstack-dev] [third-party-ci][neutron] What is Success exactly?

2014-07-03 Thread Sullivan, Jon Paul
 -Original Message-
 From: Anita Kuno [mailto:ante...@anteaya.info]
 Sent: 03 July 2014 13:53
 To: openstack-dev@lists.openstack.org
 Subject: Re: [openstack-dev] [third-party-ci][neutron] What is Success
 exactly?
 
 On 07/03/2014 06:22 AM, Sullivan, Jon Paul wrote:
  -Original Message-
  From: Anita Kuno [mailto:ante...@anteaya.info]
  Sent: 01 July 2014 14:42
  To: openstack-dev@lists.openstack.org
  Subject: Re: [openstack-dev] [third-party-ci][neutron] What is
 Success
  exactly?
 
  On 06/30/2014 09:13 PM, Jay Pipes wrote:
  On 06/30/2014 07:08 PM, Anita Kuno wrote:
  On 06/30/2014 04:22 PM, Jay Pipes wrote:
  Hi Stackers,
 
  Some recent ML threads [1] and a hot IRC meeting today [2] brought
  up some legitimate questions around how a newly-proposed
  Stackalytics report page for Neutron External CI systems [2]
  represented the results of an external CI system as successful
  or
  not.
 
  First, I want to say that Ilya and all those involved in the
  Stackalytics program simply want to provide the most accurate
  information to developers in a format that is easily consumed.
  While there need to be some changes in how data is shown (and the
  wording of things like Tests Succeeded), I hope that the
  community knows there isn't any ill intent on the part of Mirantis
  or anyone who works on Stackalytics. OK, so let's keep the
  conversation civil -- we're all working towards the same goals of
  transparency and accuracy. :)
 
  Alright, now, Anita and Kurt Taylor were asking a very poignant
  question:
 
  But what does CI tested really mean? just running tests? or
  tested to pass some level of requirements?
 
  In this nascent world of external CI systems, we have a set of
  issues that we need to resolve:
 
  1) All of the CI systems are different.
 
  Some run Bash scripts. Some run Jenkins slaves and devstack-gate
  scripts. Others run custom Python code that spawns VMs and
  publishes logs to some public domain.
 
  As a community, we need to decide whether it is worth putting in
  the effort to create a single, unified, installable and runnable
  CI system, so that we can legitimately say all of the external
  systems are identical, with the exception of the driver code for
  vendor X being substituted in the Neutron codebase.
 
  If the goal of the external CI systems is to produce reliable,
  consistent results, I feel the answer to the above is yes, but
  I'm interested to hear what others think. Frankly, in the world of
  benchmarks, it would be unthinkable to say go ahead and everyone
  run your own benchmark suite, because you would get wildly
  different results. A similar problem has emerged here.
 
  2) There is no mediation or verification that the external CI
  system is actually testing anything at all
 
  As a community, we need to decide whether the current system of
  self-policing should continue. If it should, then language on
  reports like [3] should be very clear that any numbers derived
  from such systems should be taken with a grain of salt. Use of the
  word Success should be avoided, as it has connotations (in
  English, at
  least) that the result has been verified, which is simply not the
  case as long as no verification or mediation occurs for any
  external
  CI system.
 
  3) There is no clear indication of what tests are being run, and
  therefore there is no clear indication of what success is
 
  I think we can all agree that a test has three possible outcomes:
  pass, fail, and skip. The results of a test suite run therefore is
  nothing more than the aggregation of which tests passed, which
  failed, and which were skipped.
 
  As a community, we must document, for each project, what are
  expected set of tests that must be run for each merged patch into
  the project's source tree. This documentation should be
  discoverable so that reports like [3] can be crystal-clear on what
  the data shown actually means. The report is simply displaying the
  data it receives from Gerrit. The community needs to be proactive
  in saying this is what is expected to be tested. This alone
  would allow the report to give information such as External CI
  system ABC performed the
  expected tests. X tests passed.
  Y tests failed. Z tests were skipped. Likewise, it would also
  make it possible for the report to give information such as
  External CI system DEF did not perform the expected tests.,
  which is excellent information in and of itself.
 
  ===
 
  In thinking about the likely answers to the above questions, I
  believe it would be prudent to change the Stackalytics report in
  question [3] in the following ways:
 
  a. Change the Success % column header to % Reported +1 Votes
  b. Change the phrase  Green cell - tests ran successfully, red
  cell
  - tests failed to Green cell - System voted +1, red cell -
  System voted -1
 
  and then, when we have more and better data (for example, # tests
  passed, failed, skipped, etc), we can provide

Re: [openstack-dev] [third-party-ci][neutron] What is Success exactly?

2014-07-03 Thread Anita Kuno
On 07/03/2014 09:52 AM, Sullivan, Jon Paul wrote:
 -Original Message-
 From: Anita Kuno [mailto:ante...@anteaya.info]
 Sent: 03 July 2014 13:53
 To: openstack-dev@lists.openstack.org
 Subject: Re: [openstack-dev] [third-party-ci][neutron] What is Success
 exactly?

 On 07/03/2014 06:22 AM, Sullivan, Jon Paul wrote:
 -Original Message-
 From: Anita Kuno [mailto:ante...@anteaya.info]
 Sent: 01 July 2014 14:42
 To: openstack-dev@lists.openstack.org
 Subject: Re: [openstack-dev] [third-party-ci][neutron] What is
 Success
 exactly?

 On 06/30/2014 09:13 PM, Jay Pipes wrote:
 On 06/30/2014 07:08 PM, Anita Kuno wrote:
 On 06/30/2014 04:22 PM, Jay Pipes wrote:
 Hi Stackers,

 Some recent ML threads [1] and a hot IRC meeting today [2] brought
 up some legitimate questions around how a newly-proposed
 Stackalytics report page for Neutron External CI systems [2]
 represented the results of an external CI system as successful
 or
 not.

 First, I want to say that Ilya and all those involved in the
 Stackalytics program simply want to provide the most accurate
 information to developers in a format that is easily consumed.
 While there need to be some changes in how data is shown (and the
 wording of things like Tests Succeeded), I hope that the
 community knows there isn't any ill intent on the part of Mirantis
 or anyone who works on Stackalytics. OK, so let's keep the
 conversation civil -- we're all working towards the same goals of
 transparency and accuracy. :)

 Alright, now, Anita and Kurt Taylor were asking a very poignant
 question:

 But what does CI tested really mean? just running tests? or
 tested to pass some level of requirements?

 In this nascent world of external CI systems, we have a set of
 issues that we need to resolve:

 1) All of the CI systems are different.

 Some run Bash scripts. Some run Jenkins slaves and devstack-gate
 scripts. Others run custom Python code that spawns VMs and
 publishes logs to some public domain.

 As a community, we need to decide whether it is worth putting in
 the effort to create a single, unified, installable and runnable
 CI system, so that we can legitimately say all of the external
 systems are identical, with the exception of the driver code for
 vendor X being substituted in the Neutron codebase.

 If the goal of the external CI systems is to produce reliable,
 consistent results, I feel the answer to the above is yes, but
 I'm interested to hear what others think. Frankly, in the world of
 benchmarks, it would be unthinkable to say go ahead and everyone
 run your own benchmark suite, because you would get wildly
 different results. A similar problem has emerged here.

 2) There is no mediation or verification that the external CI
 system is actually testing anything at all

 As a community, we need to decide whether the current system of
 self-policing should continue. If it should, then language on
 reports like [3] should be very clear that any numbers derived
 from such systems should be taken with a grain of salt. Use of the
 word Success should be avoided, as it has connotations (in
 English, at
 least) that the result has been verified, which is simply not the
 case as long as no verification or mediation occurs for any
 external
 CI system.

 3) There is no clear indication of what tests are being run, and
 therefore there is no clear indication of what success is

 I think we can all agree that a test has three possible outcomes:
 pass, fail, and skip. The results of a test suite run therefore is
 nothing more than the aggregation of which tests passed, which
 failed, and which were skipped.

 As a community, we must document, for each project, what are
 expected set of tests that must be run for each merged patch into
 the project's source tree. This documentation should be
 discoverable so that reports like [3] can be crystal-clear on what
 the data shown actually means. The report is simply displaying the
 data it receives from Gerrit. The community needs to be proactive
 in saying this is what is expected to be tested. This alone
 would allow the report to give information such as External CI
 system ABC performed the
 expected tests. X tests passed.
 Y tests failed. Z tests were skipped. Likewise, it would also
 make it possible for the report to give information such as
 External CI system DEF did not perform the expected tests.,
 which is excellent information in and of itself.

 ===

 In thinking about the likely answers to the above questions, I
 believe it would be prudent to change the Stackalytics report in
 question [3] in the following ways:

 a. Change the Success % column header to % Reported +1 Votes
 b. Change the phrase  Green cell - tests ran successfully, red
 cell
 - tests failed to Green cell - System voted +1, red cell -
 System voted -1

 and then, when we have more and better data (for example, # tests
 passed, failed, skipped, etc), we can provide more detailed
 information than just reported +1

Re: [openstack-dev] [third-party-ci][neutron] What is Success exactly?

2014-07-03 Thread Anita Kuno
On 07/03/2014 10:31 AM, Sullivan, Jon Paul wrote:
 -Original Message-
 From: Anita Kuno [mailto:ante...@anteaya.info]
 Sent: 03 July 2014 15:06
 To: openstack-dev@lists.openstack.org
 Subject: Re: [openstack-dev] [third-party-ci][neutron] What is Success
 exactly?
 
 I guess you missed this last time - the mail had gotten quite long :D
 
I had yes, thanks for drawing my attention to it.
 Hi Jon Paul: (Is it Jon Paul or Jon?)

 Hi Anita - it's Jon-Paul or JP.

Ah, thanks JP.


 But there is a second side to what you were saying which was the
 developer feedback.  I guess I am suggesting that if you are putting a
 system in place for developers to vote on the 3rd party CI, should that
 same system be in effect for the Openstack check/gate jobs?

 It already is, it is called #openstack-infra. All day long (the 24 hour
 day) developers drop in and tell us exactly how they feel about any
 aspect of OpenStack Infrastructure. They let us know when documentation
 is confusing, when things are broken, when a patch should have been
 merged and failed to be, when Zuul is caught in a retest loop and
 occasionally when we get something right.
 
 I had presumed this to be the case, and I guess this is the first port of 
 call when developers have questions on 3rd-party CI?  If so, then a very 
 interesting metric that would speak to the reliability of the 3rd CI might be 
 responsiveness to irc questions?
 
Yes, developers ask questions about what specific 3rd party accounts are
doing when commenting on their patches all the time. Often some version
of Why is systemx-ci commenting on my patch? Many of them ask in infra
and many of them ping me directly.

Then we move into some variation of Systemx-ci is {some behaviour that
does not meet requirements}. {What do I do? | Can someone do something
to fix this? | Can we disable this system?}
Requirements: http://ci.openstack.org/third_party.html#requirements
Open Patches:
https://review.openstack.org/#/q/status:open+project:openstack-infra/config+branch:master+topic:third-party,n,z
and
https://review.openstack.org/#/c/104565/

Sure responsiveness to irc questions would be an interesting metric. Now
how to collect data. I suppose you could scrape irc logs - I don't want
to see the regex to parse what is considered to be irc responsiveness.
You could ask the infra team if you like, but then that is a subset of
what I have already suggested for all developers plus puts more work on
infra which I will not voluntarily do, not if we can avoid it. You could
ask me, but my response will be based on an aggregation of my gut
responses based on personal experience with individual admins for
different accounts, it doesn't scale and while I feel it has some
credence should not be the sole source of information for any metric
given the scope of the issue. We currently have 70 gerrit ci accounts,
I'm not going to offer an opinion on accounts I have never interacted
with if everything has been running fine and they have had no reason to
interact with me.

By allowing the developers affected by the third party systems offer
their feedback, a more diverse source of data is collected. Keep in mind
that as a developer I have never had to splunk logs from third party ci
on my patches since the majority of my patches are for infra, which has
very little testing by third party ci. I'd like to have input from
developer who do interact with third party ci artifacts.


 OpenStack Infra logs can be found here:
 http://eavesdrop.openstack.org/irclogs/%23openstack-infra/

 I don't think having an irc channel for third party is practical because
 it simply will split infra resources and I have my doubts about how
 responsive folks would be in it. Hence my suggestion of the pages to
 allow developers to share the kind of information they share in
 openstack-infra all the time.
 
 Yes - I can understand your viewpoint on this, and it makes sense to have a 
 forum where developers can raise commetns or concerns and those responsible 
 for the 3rd party CI can respond.
Thanks and hopefully they will respond, and at the very least it will be
a quick way of seeing how many developers have attempted to give
feedback and the speed or lack thereof of a response.

There are some system admins that are very responsive and some are even
beginning to be proactive, by sending an email to the ml (dev and/or
infra) and informing us when their system is failing to build (we have
to get faster at disabling systems in those circumstances, but I
appreciate the proactiveness here) as well as posting when they move
their logs to a url with a dns rather than a hard coded ip address and
that breaks backward compatibility. Thank you for being proactive.
http://lists.openstack.org/pipermail/openstack-infra/2014-July/001473.html
http://lists.openstack.org/pipermail/openstack-dev/2014-July/039270.html

Thanks JP,
Anita.
 

 ___
 OpenStack-dev mailing list
 OpenStack-dev

Re: [openstack-dev] [third-party-ci][neutron] What is Success exactly?

2014-07-03 Thread Anita Kuno
On 07/03/2014 01:27 PM, Kevin Benton wrote:
 This allows the viewer to see categories of reviews based upon their
 divergence from OpenStack's Jenkins results. I think evaluating
 divergence from Jenkins might be a metric worth consideration.
 
 I think the only thing this really reflects though is how much the third
 party CI system is mirroring Jenkins.
 A system that frequently diverges may be functioning perfectly fine and
 just has a vastly different code path that it is integration testing so it
 is legitimately detecting failures the OpenStack CI cannot.
Great.

How do we measure the degree to which it is legitimately detecting failures?

Thanks Kevin,
Anita.
 
 --
 Kevin Benton
 
 
 On Thu, Jul 3, 2014 at 6:49 AM, Anita Kuno ante...@anteaya.info wrote:
 
 On 07/03/2014 07:12 AM, Salvatore Orlando wrote:
 Apologies for quoting again the top post of the thread.

 Comments inline (mostly thinking aloud)
 Salvatore


 On 30 June 2014 22:22, Jay Pipes jaypi...@gmail.com wrote:

 Hi Stackers,

 Some recent ML threads [1] and a hot IRC meeting today [2] brought up
 some
 legitimate questions around how a newly-proposed Stackalytics report
 page
 for Neutron External CI systems [2] represented the results of an
 external
 CI system as successful or not.

 First, I want to say that Ilya and all those involved in the
 Stackalytics
 program simply want to provide the most accurate information to
 developers
 in a format that is easily consumed. While there need to be some
 changes in
 how data is shown (and the wording of things like Tests Succeeded), I
 hope that the community knows there isn't any ill intent on the part of
 Mirantis or anyone who works on Stackalytics. OK, so let's keep the
 conversation civil -- we're all working towards the same goals of
 transparency and accuracy. :)

 Alright, now, Anita and Kurt Taylor were asking a very poignant
 question:

 But what does CI tested really mean? just running tests? or tested to
 pass some level of requirements?

 In this nascent world of external CI systems, we have a set of issues
 that
 we need to resolve:

 1) All of the CI systems are different.

 Some run Bash scripts. Some run Jenkins slaves and devstack-gate
 scripts.
 Others run custom Python code that spawns VMs and publishes logs to some
 public domain.

 As a community, we need to decide whether it is worth putting in the
 effort to create a single, unified, installable and runnable CI system,
 so
 that we can legitimately say all of the external systems are identical,
 with the exception of the driver code for vendor X being substituted in
 the
 Neutron codebase.


 I think such system already exists, and it's documented here:
 http://ci.openstack.org/
 Still, understanding it is quite a learning curve, and running it is not
 exactly straightforward. But I guess that's pretty much understandable
 given the complexity of the system, isn't it?



 If the goal of the external CI systems is to produce reliable,
 consistent
 results, I feel the answer to the above is yes, but I'm interested to
 hear what others think. Frankly, in the world of benchmarks, it would be
 unthinkable to say go ahead and everyone run your own benchmark suite,
 because you would get wildly different results. A similar problem has
 emerged here.


 I don't think the particular infrastructure which might range from an
 openstack-ci clone to a 100-line bash script would have an impact on the
 reliability of the quality assessment regarding a particular driver or
 plugin. This is determined, in my opinion, by the quantity and nature of
 tests one runs on a specific driver. In Neutron for instance, there is a
 wide range of choices - from a few test cases in tempest.api.network to
 the
 full smoketest job. As long there is no minimal standard here, then it
 would be difficult to assess the quality of the evaluation from a CI
 system, unless we explicitly keep into account coverage into the
 evaluation.

 On the other hand, different CI infrastructures will have different
 levels
 in terms of % of patches tested and % of infrastructure failures. I think
 it might not be a terrible idea to use these parameters to evaluate how
 good a CI is from an infra standpoint. However, there are still open
 questions. For instance, a CI might have a low patch % score because it
 only needs to test patches affecting a given driver.


 2) There is no mediation or verification that the external CI system is
 actually testing anything at all

 As a community, we need to decide whether the current system of
 self-policing should continue. If it should, then language on reports
 like
 [3] should be very clear that any numbers derived from such systems
 should
 be taken with a grain of salt. Use of the word Success should be
 avoided,
 as it has connotations (in English, at least) that the result has been
 verified, which is simply not the case as long as no verification or
 mediation occurs for any external CI system.





 3) There is no 

Re: [openstack-dev] [third-party-ci][neutron] What is Success exactly?

2014-07-03 Thread Kevin Benton
Maybe we can require period checks against the head of the master
branch (which should always pass) and build statistics based on the results
of that. Otherwise it seems like we have to take a CI system's word for it
that a particular patch indeed broke that system.

--
Kevin Benton


On Thu, Jul 3, 2014 at 11:07 AM, Anita Kuno ante...@anteaya.info wrote:

 On 07/03/2014 01:27 PM, Kevin Benton wrote:
  This allows the viewer to see categories of reviews based upon their
  divergence from OpenStack's Jenkins results. I think evaluating
  divergence from Jenkins might be a metric worth consideration.
 
  I think the only thing this really reflects though is how much the third
  party CI system is mirroring Jenkins.
  A system that frequently diverges may be functioning perfectly fine and
  just has a vastly different code path that it is integration testing so
 it
  is legitimately detecting failures the OpenStack CI cannot.
 Great.

 How do we measure the degree to which it is legitimately detecting
 failures?

 Thanks Kevin,
 Anita.
 
  --
  Kevin Benton
 
 
  On Thu, Jul 3, 2014 at 6:49 AM, Anita Kuno ante...@anteaya.info wrote:
 
  On 07/03/2014 07:12 AM, Salvatore Orlando wrote:
  Apologies for quoting again the top post of the thread.
 
  Comments inline (mostly thinking aloud)
  Salvatore
 
 
  On 30 June 2014 22:22, Jay Pipes jaypi...@gmail.com wrote:
 
  Hi Stackers,
 
  Some recent ML threads [1] and a hot IRC meeting today [2] brought up
  some
  legitimate questions around how a newly-proposed Stackalytics report
  page
  for Neutron External CI systems [2] represented the results of an
  external
  CI system as successful or not.
 
  First, I want to say that Ilya and all those involved in the
  Stackalytics
  program simply want to provide the most accurate information to
  developers
  in a format that is easily consumed. While there need to be some
  changes in
  how data is shown (and the wording of things like Tests Succeeded),
 I
  hope that the community knows there isn't any ill intent on the part
 of
  Mirantis or anyone who works on Stackalytics. OK, so let's keep the
  conversation civil -- we're all working towards the same goals of
  transparency and accuracy. :)
 
  Alright, now, Anita and Kurt Taylor were asking a very poignant
  question:
 
  But what does CI tested really mean? just running tests? or tested to
  pass some level of requirements?
 
  In this nascent world of external CI systems, we have a set of issues
  that
  we need to resolve:
 
  1) All of the CI systems are different.
 
  Some run Bash scripts. Some run Jenkins slaves and devstack-gate
  scripts.
  Others run custom Python code that spawns VMs and publishes logs to
 some
  public domain.
 
  As a community, we need to decide whether it is worth putting in the
  effort to create a single, unified, installable and runnable CI
 system,
  so
  that we can legitimately say all of the external systems are
 identical,
  with the exception of the driver code for vendor X being substituted
 in
  the
  Neutron codebase.
 
 
  I think such system already exists, and it's documented here:
  http://ci.openstack.org/
  Still, understanding it is quite a learning curve, and running it is
 not
  exactly straightforward. But I guess that's pretty much understandable
  given the complexity of the system, isn't it?
 
 
 
  If the goal of the external CI systems is to produce reliable,
  consistent
  results, I feel the answer to the above is yes, but I'm interested
 to
  hear what others think. Frankly, in the world of benchmarks, it would
 be
  unthinkable to say go ahead and everyone run your own benchmark
 suite,
  because you would get wildly different results. A similar problem has
  emerged here.
 
 
  I don't think the particular infrastructure which might range from an
  openstack-ci clone to a 100-line bash script would have an impact on
 the
  reliability of the quality assessment regarding a particular driver
 or
  plugin. This is determined, in my opinion, by the quantity and nature
 of
  tests one runs on a specific driver. In Neutron for instance, there is
 a
  wide range of choices - from a few test cases in tempest.api.network to
  the
  full smoketest job. As long there is no minimal standard here, then it
  would be difficult to assess the quality of the evaluation from a CI
  system, unless we explicitly keep into account coverage into the
  evaluation.
 
  On the other hand, different CI infrastructures will have different
  levels
  in terms of % of patches tested and % of infrastructure failures. I
 think
  it might not be a terrible idea to use these parameters to evaluate how
  good a CI is from an infra standpoint. However, there are still open
  questions. For instance, a CI might have a low patch % score because it
  only needs to test patches affecting a given driver.
 
 
  2) There is no mediation or verification that the external CI system
 is
  actually testing anything at all
 
  As a 

Re: [openstack-dev] [third-party-ci][neutron] What is Success exactly?

2014-07-03 Thread Kevin Benton
Yes, I can propose a spec for that. It probably won't be until Monday.
Is that okay?


On Thu, Jul 3, 2014 at 11:42 AM, Anita Kuno ante...@anteaya.info wrote:

 On 07/03/2014 02:33 PM, Kevin Benton wrote:
  Maybe we can require period checks against the head of the master
  branch (which should always pass) and build statistics based on the
 results
  of that.
 I like this suggestion. I really like this suggestion.

 H, what to do with a good suggestion? I wonder if we could capture
 it in an infra-spec and work on it from there.

 Would you feel comfortable offering a draft as an infra-spec and then
 perhaps we can discuss the design through the spec?

 What do you think?

 Thanks Kevin,
 Anita.

  Otherwise it seems like we have to take a CI system's word for it
  that a particular patch indeed broke that system.
 
  --
  Kevin Benton
 
 
  On Thu, Jul 3, 2014 at 11:07 AM, Anita Kuno ante...@anteaya.info
 wrote:
 
  On 07/03/2014 01:27 PM, Kevin Benton wrote:
  This allows the viewer to see categories of reviews based upon their
  divergence from OpenStack's Jenkins results. I think evaluating
  divergence from Jenkins might be a metric worth consideration.
 
  I think the only thing this really reflects though is how much the
 third
  party CI system is mirroring Jenkins.
  A system that frequently diverges may be functioning perfectly fine and
  just has a vastly different code path that it is integration testing so
  it
  is legitimately detecting failures the OpenStack CI cannot.
  Great.
 
  How do we measure the degree to which it is legitimately detecting
  failures?
 
  Thanks Kevin,
  Anita.
 
  --
  Kevin Benton
 
 
  On Thu, Jul 3, 2014 at 6:49 AM, Anita Kuno ante...@anteaya.info
 wrote:
 
  On 07/03/2014 07:12 AM, Salvatore Orlando wrote:
  Apologies for quoting again the top post of the thread.
 
  Comments inline (mostly thinking aloud)
  Salvatore
 
 
  On 30 June 2014 22:22, Jay Pipes jaypi...@gmail.com wrote:
 
  Hi Stackers,
 
  Some recent ML threads [1] and a hot IRC meeting today [2] brought
 up
  some
  legitimate questions around how a newly-proposed Stackalytics report
  page
  for Neutron External CI systems [2] represented the results of an
  external
  CI system as successful or not.
 
  First, I want to say that Ilya and all those involved in the
  Stackalytics
  program simply want to provide the most accurate information to
  developers
  in a format that is easily consumed. While there need to be some
  changes in
  how data is shown (and the wording of things like Tests
 Succeeded),
  I
  hope that the community knows there isn't any ill intent on the part
  of
  Mirantis or anyone who works on Stackalytics. OK, so let's keep the
  conversation civil -- we're all working towards the same goals of
  transparency and accuracy. :)
 
  Alright, now, Anita and Kurt Taylor were asking a very poignant
  question:
 
  But what does CI tested really mean? just running tests? or tested
 to
  pass some level of requirements?
 
  In this nascent world of external CI systems, we have a set of
 issues
  that
  we need to resolve:
 
  1) All of the CI systems are different.
 
  Some run Bash scripts. Some run Jenkins slaves and devstack-gate
  scripts.
  Others run custom Python code that spawns VMs and publishes logs to
  some
  public domain.
 
  As a community, we need to decide whether it is worth putting in the
  effort to create a single, unified, installable and runnable CI
  system,
  so
  that we can legitimately say all of the external systems are
  identical,
  with the exception of the driver code for vendor X being substituted
  in
  the
  Neutron codebase.
 
 
  I think such system already exists, and it's documented here:
  http://ci.openstack.org/
  Still, understanding it is quite a learning curve, and running it is
  not
  exactly straightforward. But I guess that's pretty much
 understandable
  given the complexity of the system, isn't it?
 
 
 
  If the goal of the external CI systems is to produce reliable,
  consistent
  results, I feel the answer to the above is yes, but I'm interested
  to
  hear what others think. Frankly, in the world of benchmarks, it
 would
  be
  unthinkable to say go ahead and everyone run your own benchmark
  suite,
  because you would get wildly different results. A similar problem
 has
  emerged here.
 
 
  I don't think the particular infrastructure which might range from an
  openstack-ci clone to a 100-line bash script would have an impact on
  the
  reliability of the quality assessment regarding a particular driver
  or
  plugin. This is determined, in my opinion, by the quantity and nature
  of
  tests one runs on a specific driver. In Neutron for instance, there
 is
  a
  wide range of choices - from a few test cases in tempest.api.network
 to
  the
  full smoketest job. As long there is no minimal standard here, then
 it
  would be difficult to assess the quality of the evaluation from a CI
  system, unless we explicitly keep 

Re: [openstack-dev] [third-party-ci][neutron] What is Success exactly?

2014-06-30 Thread Franck Yelles
Hi Jay,

Couple of points.

I support the fact that we need to define what is success is.
I believe that the metrics that should be used are Voted +1 and
Skipped.
But to certain valid case, I would say that the  Voted -1 is really
mostly a metric of bad health of a CI.
Most of the -1 are due to environment issue, configuration problem, etc...
In my case, the -1 are done manually since I want to avoid giving some
extra work to the developer.

That are some possible solutions ?

On the Jenkins, I think we could develop a script that will parse the
result html file.
Jenkins will then vote (+1, 0, -1) on the behalf of the 3rd party CI.
- It would prevent the abusive +1
- If the result HTML is empty, it would indicate the CI health is bad
- if all the result are failing, it would also indicate that CI health is
bad


Franck


Franck




Franck


On Mon, Jun 30, 2014 at 1:22 PM, Jay Pipes jaypi...@gmail.com wrote:

 Hi Stackers,

 Some recent ML threads [1] and a hot IRC meeting today [2] brought up some
 legitimate questions around how a newly-proposed Stackalytics report page
 for Neutron External CI systems [2] represented the results of an external
 CI system as successful or not.

 First, I want to say that Ilya and all those involved in the Stackalytics
 program simply want to provide the most accurate information to developers
 in a format that is easily consumed. While there need to be some changes in
 how data is shown (and the wording of things like Tests Succeeded), I
 hope that the community knows there isn't any ill intent on the part of
 Mirantis or anyone who works on Stackalytics. OK, so let's keep the
 conversation civil -- we're all working towards the same goals of
 transparency and accuracy. :)

 Alright, now, Anita and Kurt Taylor were asking a very poignant question:

 But what does CI tested really mean? just running tests? or tested to
 pass some level of requirements?

 In this nascent world of external CI systems, we have a set of issues that
 we need to resolve:

 1) All of the CI systems are different.

 Some run Bash scripts. Some run Jenkins slaves and devstack-gate scripts.
 Others run custom Python code that spawns VMs and publishes logs to some
 public domain.

 As a community, we need to decide whether it is worth putting in the
 effort to create a single, unified, installable and runnable CI system, so
 that we can legitimately say all of the external systems are identical,
 with the exception of the driver code for vendor X being substituted in the
 Neutron codebase.

 If the goal of the external CI systems is to produce reliable, consistent
 results, I feel the answer to the above is yes, but I'm interested to
 hear what others think. Frankly, in the world of benchmarks, it would be
 unthinkable to say go ahead and everyone run your own benchmark suite,
 because you would get wildly different results. A similar problem has
 emerged here.

 2) There is no mediation or verification that the external CI system is
 actually testing anything at all

 As a community, we need to decide whether the current system of
 self-policing should continue. If it should, then language on reports like
 [3] should be very clear that any numbers derived from such systems should
 be taken with a grain of salt. Use of the word Success should be avoided,
 as it has connotations (in English, at least) that the result has been
 verified, which is simply not the case as long as no verification or
 mediation occurs for any external CI system.

 3) There is no clear indication of what tests are being run, and therefore
 there is no clear indication of what success is

 I think we can all agree that a test has three possible outcomes: pass,
 fail, and skip. The results of a test suite run therefore is nothing more
 than the aggregation of which tests passed, which failed, and which were
 skipped.

 As a community, we must document, for each project, what are expected set
 of tests that must be run for each merged patch into the project's source
 tree. This documentation should be discoverable so that reports like [3]
 can be crystal-clear on what the data shown actually means. The report is
 simply displaying the data it receives from Gerrit. The community needs to
 be proactive in saying this is what is expected to be tested. This alone
 would allow the report to give information such as External CI system ABC
 performed the expected tests. X tests passed. Y tests failed. Z tests were
 skipped. Likewise, it would also make it possible for the report to give
 information such as External CI system DEF did not perform the expected
 tests., which is excellent information in and of itself.

 ===

 In thinking about the likely answers to the above questions, I believe it
 would be prudent to change the Stackalytics report in question [3] in the
 following ways:

 a. Change the Success % column header to % Reported +1 Votes
 b. Change the phrase  Green cell - tests ran successfully, red cell -
 tests failed 

Re: [openstack-dev] [third-party-ci][neutron] What is Success exactly?

2014-06-30 Thread Jay Pipes

On 06/30/2014 07:08 PM, Anita Kuno wrote:

On 06/30/2014 04:22 PM, Jay Pipes wrote:

Hi Stackers,

Some recent ML threads [1] and a hot IRC meeting today [2] brought up
some legitimate questions around how a newly-proposed Stackalytics
report page for Neutron External CI systems [2] represented the results
of an external CI system as successful or not.

First, I want to say that Ilya and all those involved in the
Stackalytics program simply want to provide the most accurate
information to developers in a format that is easily consumed. While
there need to be some changes in how data is shown (and the wording of
things like Tests Succeeded), I hope that the community knows there
isn't any ill intent on the part of Mirantis or anyone who works on
Stackalytics. OK, so let's keep the conversation civil -- we're all
working towards the same goals of transparency and accuracy. :)

Alright, now, Anita and Kurt Taylor were asking a very poignant question:

But what does CI tested really mean? just running tests? or tested to
pass some level of requirements?

In this nascent world of external CI systems, we have a set of issues
that we need to resolve:

1) All of the CI systems are different.

Some run Bash scripts. Some run Jenkins slaves and devstack-gate
scripts. Others run custom Python code that spawns VMs and publishes
logs to some public domain.

As a community, we need to decide whether it is worth putting in the
effort to create a single, unified, installable and runnable CI system,
so that we can legitimately say all of the external systems are
identical, with the exception of the driver code for vendor X being
substituted in the Neutron codebase.

If the goal of the external CI systems is to produce reliable,
consistent results, I feel the answer to the above is yes, but I'm
interested to hear what others think. Frankly, in the world of
benchmarks, it would be unthinkable to say go ahead and everyone run
your own benchmark suite, because you would get wildly different
results. A similar problem has emerged here.

2) There is no mediation or verification that the external CI system is
actually testing anything at all

As a community, we need to decide whether the current system of
self-policing should continue. If it should, then language on reports
like [3] should be very clear that any numbers derived from such systems
should be taken with a grain of salt. Use of the word Success should
be avoided, as it has connotations (in English, at least) that the
result has been verified, which is simply not the case as long as no
verification or mediation occurs for any external CI system.

3) There is no clear indication of what tests are being run, and
therefore there is no clear indication of what success is

I think we can all agree that a test has three possible outcomes: pass,
fail, and skip. The results of a test suite run therefore is nothing
more than the aggregation of which tests passed, which failed, and which
were skipped.

As a community, we must document, for each project, what are expected
set of tests that must be run for each merged patch into the project's
source tree. This documentation should be discoverable so that reports
like [3] can be crystal-clear on what the data shown actually means. The
report is simply displaying the data it receives from Gerrit. The
community needs to be proactive in saying this is what is expected to
be tested. This alone would allow the report to give information such
as External CI system ABC performed the expected tests. X tests passed.
Y tests failed. Z tests were skipped. Likewise, it would also make it
possible for the report to give information such as External CI system
DEF did not perform the expected tests., which is excellent information
in and of itself.

===

In thinking about the likely answers to the above questions, I believe
it would be prudent to change the Stackalytics report in question [3] in
the following ways:

a. Change the Success % column header to % Reported +1 Votes
b. Change the phrase  Green cell - tests ran successfully, red cell -
tests failed to Green cell - System voted +1, red cell - System voted -1

and then, when we have more and better data (for example, # tests
passed, failed, skipped, etc), we can provide more detailed information
than just reported +1 or not.

Thoughts?

Best,
-jay

[1]
http://lists.openstack.org/pipermail/openstack-dev/2014-June/038933.html
[2]
http://eavesdrop.openstack.org/meetings/third_party/2014/third_party.2014-06-30-18.01.log.html

[3] http://stackalytics.com/report/ci/neutron/7

___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

Hi Jay:

Thanks for starting this thread. You raise some interesting questions.

The question I had identified as needing definition is what algorithm
do we use to assess fitness of a third party ci system.