Re: [openstack-dev] [third-party-ci][neutron] What is Success exactly?
On Thu, Jul 3, 2014 at 6:12 AM, Salvatore Orlando sorla...@nicira.com wrote: Apologies for quoting again the top post of the thread. Comments inline (mostly thinking aloud) Salvatore On 30 June 2014 22:22, Jay Pipes jaypi...@gmail.com wrote: Hi Stackers, Some recent ML threads [1] and a hot IRC meeting today [2] brought up some legitimate questions around how a newly-proposed Stackalytics report page for Neutron External CI systems [2] represented the results of an external CI system as successful or not. First, I want to say that Ilya and all those involved in the Stackalytics program simply want to provide the most accurate information to developers in a format that is easily consumed. While there need to be some changes in how data is shown (and the wording of things like Tests Succeeded), I hope that the community knows there isn't any ill intent on the part of Mirantis or anyone who works on Stackalytics. OK, so let's keep the conversation civil -- we're all working towards the same goals of transparency and accuracy. :) Alright, now, Anita and Kurt Taylor were asking a very poignant question: But what does CI tested really mean? just running tests? or tested to pass some level of requirements? In this nascent world of external CI systems, we have a set of issues that we need to resolve: 1) All of the CI systems are different. Some run Bash scripts. Some run Jenkins slaves and devstack-gate scripts. Others run custom Python code that spawns VMs and publishes logs to some public domain. As a community, we need to decide whether it is worth putting in the effort to create a single, unified, installable and runnable CI system, so that we can legitimately say all of the external systems are identical, with the exception of the driver code for vendor X being substituted in the Neutron codebase. I think such system already exists, and it's documented here: http://ci.openstack.org/ Still, understanding it is quite a learning curve, and running it is not exactly straightforward. But I guess that's pretty much understandable given the complexity of the system, isn't it? If the goal of the external CI systems is to produce reliable, consistent results, I feel the answer to the above is yes, but I'm interested to hear what others think. Frankly, in the world of benchmarks, it would be unthinkable to say go ahead and everyone run your own benchmark suite, because you would get wildly different results. A similar problem has emerged here. I don't think the particular infrastructure which might range from an openstack-ci clone to a 100-line bash script would have an impact on the reliability of the quality assessment regarding a particular driver or plugin. This is determined, in my opinion, by the quantity and nature of tests one runs on a specific driver. In Neutron for instance, there is a wide range of choices - from a few test cases in tempest.api.network to the full smoketest job. As long there is no minimal standard here, then it would be difficult to assess the quality of the evaluation from a CI system, unless we explicitly keep into account coverage into the evaluation. On the other hand, different CI infrastructures will have different levels in terms of % of patches tested and % of infrastructure failures. I think it might not be a terrible idea to use these parameters to evaluate how good a CI is from an infra standpoint. However, there are still open questions. For instance, a CI might have a low patch % score because it only needs to test patches affecting a given driver. 2) There is no mediation or verification that the external CI system is actually testing anything at all As a community, we need to decide whether the current system of self-policing should continue. If it should, then language on reports like [3] should be very clear that any numbers derived from such systems should be taken with a grain of salt. Use of the word Success should be avoided, as it has connotations (in English, at least) that the result has been verified, which is simply not the case as long as no verification or mediation occurs for any external CI system. 3) There is no clear indication of what tests are being run, and therefore there is no clear indication of what success is I think we can all agree that a test has three possible outcomes: pass, fail, and skip. The results of a test suite run therefore is nothing more than the aggregation of which tests passed, which failed, and which were skipped. As a community, we must document, for each project, what are expected set of tests that must be run for each merged patch into the project's source tree. This documentation should be discoverable so that reports like [3] can be crystal-clear on what the data shown actually means. The report is simply displaying the data it receives from Gerrit. The community needs to be proactive in saying this is what is expected
Re: [openstack-dev] [third-party-ci][neutron] What is Success exactly?
On Jul 3, 2014 8:57 AM, Anita Kuno ante...@anteaya.info wrote: On 07/03/2014 06:22 AM, Sullivan, Jon Paul wrote: -Original Message- From: Anita Kuno [mailto:ante...@anteaya.info] Sent: 01 July 2014 14:42 To: openstack-dev@lists.openstack.org Subject: Re: [openstack-dev] [third-party-ci][neutron] What is Success exactly? On 06/30/2014 09:13 PM, Jay Pipes wrote: On 06/30/2014 07:08 PM, Anita Kuno wrote: On 06/30/2014 04:22 PM, Jay Pipes wrote: Hi Stackers, Some recent ML threads [1] and a hot IRC meeting today [2] brought up some legitimate questions around how a newly-proposed Stackalytics report page for Neutron External CI systems [2] represented the results of an external CI system as successful or not. First, I want to say that Ilya and all those involved in the Stackalytics program simply want to provide the most accurate information to developers in a format that is easily consumed. While there need to be some changes in how data is shown (and the wording of things like Tests Succeeded), I hope that the community knows there isn't any ill intent on the part of Mirantis or anyone who works on Stackalytics. OK, so let's keep the conversation civil -- we're all working towards the same goals of transparency and accuracy. :) Alright, now, Anita and Kurt Taylor were asking a very poignant question: But what does CI tested really mean? just running tests? or tested to pass some level of requirements? In this nascent world of external CI systems, we have a set of issues that we need to resolve: 1) All of the CI systems are different. Some run Bash scripts. Some run Jenkins slaves and devstack-gate scripts. Others run custom Python code that spawns VMs and publishes logs to some public domain. As a community, we need to decide whether it is worth putting in the effort to create a single, unified, installable and runnable CI system, so that we can legitimately say all of the external systems are identical, with the exception of the driver code for vendor X being substituted in the Neutron codebase. If the goal of the external CI systems is to produce reliable, consistent results, I feel the answer to the above is yes, but I'm interested to hear what others think. Frankly, in the world of benchmarks, it would be unthinkable to say go ahead and everyone run your own benchmark suite, because you would get wildly different results. A similar problem has emerged here. 2) There is no mediation or verification that the external CI system is actually testing anything at all As a community, we need to decide whether the current system of self-policing should continue. If it should, then language on reports like [3] should be very clear that any numbers derived from such systems should be taken with a grain of salt. Use of the word Success should be avoided, as it has connotations (in English, at least) that the result has been verified, which is simply not the case as long as no verification or mediation occurs for any external CI system. 3) There is no clear indication of what tests are being run, and therefore there is no clear indication of what success is I think we can all agree that a test has three possible outcomes: pass, fail, and skip. The results of a test suite run therefore is nothing more than the aggregation of which tests passed, which failed, and which were skipped. As a community, we must document, for each project, what are expected set of tests that must be run for each merged patch into the project's source tree. This documentation should be discoverable so that reports like [3] can be crystal-clear on what the data shown actually means. The report is simply displaying the data it receives from Gerrit. The community needs to be proactive in saying this is what is expected to be tested. This alone would allow the report to give information such as External CI system ABC performed the expected tests. X tests passed. Y tests failed. Z tests were skipped. Likewise, it would also make it possible for the report to give information such as External CI system DEF did not perform the expected tests., which is excellent information in and of itself. === In thinking about the likely answers to the above questions, I believe it would be prudent to change the Stackalytics report in question [3] in the following ways: a. Change the Success % column header to % Reported +1 Votes b. Change the phrase Green cell - tests ran successfully, red cell - tests failed to Green cell - System voted +1, red cell - System voted -1 and then, when we have more and better data (for example, # tests passed, failed, skipped, etc), we can provide more detailed information than just reported +1 or not. Thoughts? Best, -jay [1] http://lists.openstack.org/pipermail/openstack-dev/2014-June/038933
Re: [openstack-dev] [third-party-ci][neutron] What is Success exactly?
On 07/03/2014 02:41 PM, Fawad Khaliq wrote: On Thu, Jul 3, 2014 at 10:27 AM, Kevin Benton blak...@gmail.com mailto:blak...@gmail.com wrote: This allows the viewer to see categories of reviews based upon their divergence from OpenStack's Jenkins results. I think evaluating divergence from Jenkins might be a metric worth consideration. I think the only thing this really reflects though is how much the third party CI system is mirroring Jenkins. A system that frequently diverges may be functioning perfectly fine and just has a vastly different code path that it is integration testing so it is legitimately detecting failures the OpenStack CI cannot. Exactly. +1 Unfortunately, there's no good way to prove that. -jay ___ OpenStack-dev mailing list OpenStack-dev@lists.openstack.org http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
Re: [openstack-dev] [third-party-ci][neutron] What is Success exactly?
-Original Message- From: Anita Kuno [mailto:ante...@anteaya.info] Sent: 01 July 2014 14:42 To: openstack-dev@lists.openstack.org Subject: Re: [openstack-dev] [third-party-ci][neutron] What is Success exactly? On 06/30/2014 09:13 PM, Jay Pipes wrote: On 06/30/2014 07:08 PM, Anita Kuno wrote: On 06/30/2014 04:22 PM, Jay Pipes wrote: Hi Stackers, Some recent ML threads [1] and a hot IRC meeting today [2] brought up some legitimate questions around how a newly-proposed Stackalytics report page for Neutron External CI systems [2] represented the results of an external CI system as successful or not. First, I want to say that Ilya and all those involved in the Stackalytics program simply want to provide the most accurate information to developers in a format that is easily consumed. While there need to be some changes in how data is shown (and the wording of things like Tests Succeeded), I hope that the community knows there isn't any ill intent on the part of Mirantis or anyone who works on Stackalytics. OK, so let's keep the conversation civil -- we're all working towards the same goals of transparency and accuracy. :) Alright, now, Anita and Kurt Taylor were asking a very poignant question: But what does CI tested really mean? just running tests? or tested to pass some level of requirements? In this nascent world of external CI systems, we have a set of issues that we need to resolve: 1) All of the CI systems are different. Some run Bash scripts. Some run Jenkins slaves and devstack-gate scripts. Others run custom Python code that spawns VMs and publishes logs to some public domain. As a community, we need to decide whether it is worth putting in the effort to create a single, unified, installable and runnable CI system, so that we can legitimately say all of the external systems are identical, with the exception of the driver code for vendor X being substituted in the Neutron codebase. If the goal of the external CI systems is to produce reliable, consistent results, I feel the answer to the above is yes, but I'm interested to hear what others think. Frankly, in the world of benchmarks, it would be unthinkable to say go ahead and everyone run your own benchmark suite, because you would get wildly different results. A similar problem has emerged here. 2) There is no mediation or verification that the external CI system is actually testing anything at all As a community, we need to decide whether the current system of self-policing should continue. If it should, then language on reports like [3] should be very clear that any numbers derived from such systems should be taken with a grain of salt. Use of the word Success should be avoided, as it has connotations (in English, at least) that the result has been verified, which is simply not the case as long as no verification or mediation occurs for any external CI system. 3) There is no clear indication of what tests are being run, and therefore there is no clear indication of what success is I think we can all agree that a test has three possible outcomes: pass, fail, and skip. The results of a test suite run therefore is nothing more than the aggregation of which tests passed, which failed, and which were skipped. As a community, we must document, for each project, what are expected set of tests that must be run for each merged patch into the project's source tree. This documentation should be discoverable so that reports like [3] can be crystal-clear on what the data shown actually means. The report is simply displaying the data it receives from Gerrit. The community needs to be proactive in saying this is what is expected to be tested. This alone would allow the report to give information such as External CI system ABC performed the expected tests. X tests passed. Y tests failed. Z tests were skipped. Likewise, it would also make it possible for the report to give information such as External CI system DEF did not perform the expected tests., which is excellent information in and of itself. === In thinking about the likely answers to the above questions, I believe it would be prudent to change the Stackalytics report in question [3] in the following ways: a. Change the Success % column header to % Reported +1 Votes b. Change the phrase Green cell - tests ran successfully, red cell - tests failed to Green cell - System voted +1, red cell - System voted -1 and then, when we have more and better data (for example, # tests passed, failed, skipped, etc), we can provide more detailed information than just reported +1 or not. Thoughts? Best, -jay [1] http://lists.openstack.org/pipermail/openstack-dev/2014-June/038933. html [2] http://eavesdrop.openstack.org/meetings/third_party/2014/third_party .2014-06-30-18.01.log.html [3] http
Re: [openstack-dev] [third-party-ci][neutron] What is Success exactly?
Apologies for quoting again the top post of the thread. Comments inline (mostly thinking aloud) Salvatore On 30 June 2014 22:22, Jay Pipes jaypi...@gmail.com wrote: Hi Stackers, Some recent ML threads [1] and a hot IRC meeting today [2] brought up some legitimate questions around how a newly-proposed Stackalytics report page for Neutron External CI systems [2] represented the results of an external CI system as successful or not. First, I want to say that Ilya and all those involved in the Stackalytics program simply want to provide the most accurate information to developers in a format that is easily consumed. While there need to be some changes in how data is shown (and the wording of things like Tests Succeeded), I hope that the community knows there isn't any ill intent on the part of Mirantis or anyone who works on Stackalytics. OK, so let's keep the conversation civil -- we're all working towards the same goals of transparency and accuracy. :) Alright, now, Anita and Kurt Taylor were asking a very poignant question: But what does CI tested really mean? just running tests? or tested to pass some level of requirements? In this nascent world of external CI systems, we have a set of issues that we need to resolve: 1) All of the CI systems are different. Some run Bash scripts. Some run Jenkins slaves and devstack-gate scripts. Others run custom Python code that spawns VMs and publishes logs to some public domain. As a community, we need to decide whether it is worth putting in the effort to create a single, unified, installable and runnable CI system, so that we can legitimately say all of the external systems are identical, with the exception of the driver code for vendor X being substituted in the Neutron codebase. I think such system already exists, and it's documented here: http://ci.openstack.org/ Still, understanding it is quite a learning curve, and running it is not exactly straightforward. But I guess that's pretty much understandable given the complexity of the system, isn't it? If the goal of the external CI systems is to produce reliable, consistent results, I feel the answer to the above is yes, but I'm interested to hear what others think. Frankly, in the world of benchmarks, it would be unthinkable to say go ahead and everyone run your own benchmark suite, because you would get wildly different results. A similar problem has emerged here. I don't think the particular infrastructure which might range from an openstack-ci clone to a 100-line bash script would have an impact on the reliability of the quality assessment regarding a particular driver or plugin. This is determined, in my opinion, by the quantity and nature of tests one runs on a specific driver. In Neutron for instance, there is a wide range of choices - from a few test cases in tempest.api.network to the full smoketest job. As long there is no minimal standard here, then it would be difficult to assess the quality of the evaluation from a CI system, unless we explicitly keep into account coverage into the evaluation. On the other hand, different CI infrastructures will have different levels in terms of % of patches tested and % of infrastructure failures. I think it might not be a terrible idea to use these parameters to evaluate how good a CI is from an infra standpoint. However, there are still open questions. For instance, a CI might have a low patch % score because it only needs to test patches affecting a given driver. 2) There is no mediation or verification that the external CI system is actually testing anything at all As a community, we need to decide whether the current system of self-policing should continue. If it should, then language on reports like [3] should be very clear that any numbers derived from such systems should be taken with a grain of salt. Use of the word Success should be avoided, as it has connotations (in English, at least) that the result has been verified, which is simply not the case as long as no verification or mediation occurs for any external CI system. 3) There is no clear indication of what tests are being run, and therefore there is no clear indication of what success is I think we can all agree that a test has three possible outcomes: pass, fail, and skip. The results of a test suite run therefore is nothing more than the aggregation of which tests passed, which failed, and which were skipped. As a community, we must document, for each project, what are expected set of tests that must be run for each merged patch into the project's source tree. This documentation should be discoverable so that reports like [3] can be crystal-clear on what the data shown actually means. The report is simply displaying the data it receives from Gerrit. The community needs to be proactive in saying this is what is expected to be tested. This alone would allow the report to give information such as External CI system ABC
Re: [openstack-dev] [third-party-ci][neutron] What is Success exactly?
On 07/03/2014 06:22 AM, Sullivan, Jon Paul wrote: -Original Message- From: Anita Kuno [mailto:ante...@anteaya.info] Sent: 01 July 2014 14:42 To: openstack-dev@lists.openstack.org Subject: Re: [openstack-dev] [third-party-ci][neutron] What is Success exactly? On 06/30/2014 09:13 PM, Jay Pipes wrote: On 06/30/2014 07:08 PM, Anita Kuno wrote: On 06/30/2014 04:22 PM, Jay Pipes wrote: Hi Stackers, Some recent ML threads [1] and a hot IRC meeting today [2] brought up some legitimate questions around how a newly-proposed Stackalytics report page for Neutron External CI systems [2] represented the results of an external CI system as successful or not. First, I want to say that Ilya and all those involved in the Stackalytics program simply want to provide the most accurate information to developers in a format that is easily consumed. While there need to be some changes in how data is shown (and the wording of things like Tests Succeeded), I hope that the community knows there isn't any ill intent on the part of Mirantis or anyone who works on Stackalytics. OK, so let's keep the conversation civil -- we're all working towards the same goals of transparency and accuracy. :) Alright, now, Anita and Kurt Taylor were asking a very poignant question: But what does CI tested really mean? just running tests? or tested to pass some level of requirements? In this nascent world of external CI systems, we have a set of issues that we need to resolve: 1) All of the CI systems are different. Some run Bash scripts. Some run Jenkins slaves and devstack-gate scripts. Others run custom Python code that spawns VMs and publishes logs to some public domain. As a community, we need to decide whether it is worth putting in the effort to create a single, unified, installable and runnable CI system, so that we can legitimately say all of the external systems are identical, with the exception of the driver code for vendor X being substituted in the Neutron codebase. If the goal of the external CI systems is to produce reliable, consistent results, I feel the answer to the above is yes, but I'm interested to hear what others think. Frankly, in the world of benchmarks, it would be unthinkable to say go ahead and everyone run your own benchmark suite, because you would get wildly different results. A similar problem has emerged here. 2) There is no mediation or verification that the external CI system is actually testing anything at all As a community, we need to decide whether the current system of self-policing should continue. If it should, then language on reports like [3] should be very clear that any numbers derived from such systems should be taken with a grain of salt. Use of the word Success should be avoided, as it has connotations (in English, at least) that the result has been verified, which is simply not the case as long as no verification or mediation occurs for any external CI system. 3) There is no clear indication of what tests are being run, and therefore there is no clear indication of what success is I think we can all agree that a test has three possible outcomes: pass, fail, and skip. The results of a test suite run therefore is nothing more than the aggregation of which tests passed, which failed, and which were skipped. As a community, we must document, for each project, what are expected set of tests that must be run for each merged patch into the project's source tree. This documentation should be discoverable so that reports like [3] can be crystal-clear on what the data shown actually means. The report is simply displaying the data it receives from Gerrit. The community needs to be proactive in saying this is what is expected to be tested. This alone would allow the report to give information such as External CI system ABC performed the expected tests. X tests passed. Y tests failed. Z tests were skipped. Likewise, it would also make it possible for the report to give information such as External CI system DEF did not perform the expected tests., which is excellent information in and of itself. === In thinking about the likely answers to the above questions, I believe it would be prudent to change the Stackalytics report in question [3] in the following ways: a. Change the Success % column header to % Reported +1 Votes b. Change the phrase Green cell - tests ran successfully, red cell - tests failed to Green cell - System voted +1, red cell - System voted -1 and then, when we have more and better data (for example, # tests passed, failed, skipped, etc), we can provide more detailed information than just reported +1 or not. Thoughts? Best, -jay [1] http://lists.openstack.org/pipermail/openstack-dev/2014-June/038933. html [2] http://eavesdrop.openstack.org/meetings/third_party/2014/third_party .2014-06-30-18.01.log.html [3] http://stackalytics.com/report/ci/neutron/7
Re: [openstack-dev] [third-party-ci][neutron] What is Success exactly?
On 07/03/2014 07:12 AM, Salvatore Orlando wrote: Apologies for quoting again the top post of the thread. Comments inline (mostly thinking aloud) Salvatore On 30 June 2014 22:22, Jay Pipes jaypi...@gmail.com wrote: Hi Stackers, Some recent ML threads [1] and a hot IRC meeting today [2] brought up some legitimate questions around how a newly-proposed Stackalytics report page for Neutron External CI systems [2] represented the results of an external CI system as successful or not. First, I want to say that Ilya and all those involved in the Stackalytics program simply want to provide the most accurate information to developers in a format that is easily consumed. While there need to be some changes in how data is shown (and the wording of things like Tests Succeeded), I hope that the community knows there isn't any ill intent on the part of Mirantis or anyone who works on Stackalytics. OK, so let's keep the conversation civil -- we're all working towards the same goals of transparency and accuracy. :) Alright, now, Anita and Kurt Taylor were asking a very poignant question: But what does CI tested really mean? just running tests? or tested to pass some level of requirements? In this nascent world of external CI systems, we have a set of issues that we need to resolve: 1) All of the CI systems are different. Some run Bash scripts. Some run Jenkins slaves and devstack-gate scripts. Others run custom Python code that spawns VMs and publishes logs to some public domain. As a community, we need to decide whether it is worth putting in the effort to create a single, unified, installable and runnable CI system, so that we can legitimately say all of the external systems are identical, with the exception of the driver code for vendor X being substituted in the Neutron codebase. I think such system already exists, and it's documented here: http://ci.openstack.org/ Still, understanding it is quite a learning curve, and running it is not exactly straightforward. But I guess that's pretty much understandable given the complexity of the system, isn't it? If the goal of the external CI systems is to produce reliable, consistent results, I feel the answer to the above is yes, but I'm interested to hear what others think. Frankly, in the world of benchmarks, it would be unthinkable to say go ahead and everyone run your own benchmark suite, because you would get wildly different results. A similar problem has emerged here. I don't think the particular infrastructure which might range from an openstack-ci clone to a 100-line bash script would have an impact on the reliability of the quality assessment regarding a particular driver or plugin. This is determined, in my opinion, by the quantity and nature of tests one runs on a specific driver. In Neutron for instance, there is a wide range of choices - from a few test cases in tempest.api.network to the full smoketest job. As long there is no minimal standard here, then it would be difficult to assess the quality of the evaluation from a CI system, unless we explicitly keep into account coverage into the evaluation. On the other hand, different CI infrastructures will have different levels in terms of % of patches tested and % of infrastructure failures. I think it might not be a terrible idea to use these parameters to evaluate how good a CI is from an infra standpoint. However, there are still open questions. For instance, a CI might have a low patch % score because it only needs to test patches affecting a given driver. 2) There is no mediation or verification that the external CI system is actually testing anything at all As a community, we need to decide whether the current system of self-policing should continue. If it should, then language on reports like [3] should be very clear that any numbers derived from such systems should be taken with a grain of salt. Use of the word Success should be avoided, as it has connotations (in English, at least) that the result has been verified, which is simply not the case as long as no verification or mediation occurs for any external CI system. 3) There is no clear indication of what tests are being run, and therefore there is no clear indication of what success is I think we can all agree that a test has three possible outcomes: pass, fail, and skip. The results of a test suite run therefore is nothing more than the aggregation of which tests passed, which failed, and which were skipped. As a community, we must document, for each project, what are expected set of tests that must be run for each merged patch into the project's source tree. This documentation should be discoverable so that reports like [3] can be crystal-clear on what the data shown actually means. The report is simply displaying the data it receives from Gerrit. The community needs to be proactive in saying this is what is expected to be tested.
Re: [openstack-dev] [third-party-ci][neutron] What is Success exactly?
-Original Message- From: Anita Kuno [mailto:ante...@anteaya.info] Sent: 03 July 2014 13:53 To: openstack-dev@lists.openstack.org Subject: Re: [openstack-dev] [third-party-ci][neutron] What is Success exactly? On 07/03/2014 06:22 AM, Sullivan, Jon Paul wrote: -Original Message- From: Anita Kuno [mailto:ante...@anteaya.info] Sent: 01 July 2014 14:42 To: openstack-dev@lists.openstack.org Subject: Re: [openstack-dev] [third-party-ci][neutron] What is Success exactly? On 06/30/2014 09:13 PM, Jay Pipes wrote: On 06/30/2014 07:08 PM, Anita Kuno wrote: On 06/30/2014 04:22 PM, Jay Pipes wrote: Hi Stackers, Some recent ML threads [1] and a hot IRC meeting today [2] brought up some legitimate questions around how a newly-proposed Stackalytics report page for Neutron External CI systems [2] represented the results of an external CI system as successful or not. First, I want to say that Ilya and all those involved in the Stackalytics program simply want to provide the most accurate information to developers in a format that is easily consumed. While there need to be some changes in how data is shown (and the wording of things like Tests Succeeded), I hope that the community knows there isn't any ill intent on the part of Mirantis or anyone who works on Stackalytics. OK, so let's keep the conversation civil -- we're all working towards the same goals of transparency and accuracy. :) Alright, now, Anita and Kurt Taylor were asking a very poignant question: But what does CI tested really mean? just running tests? or tested to pass some level of requirements? In this nascent world of external CI systems, we have a set of issues that we need to resolve: 1) All of the CI systems are different. Some run Bash scripts. Some run Jenkins slaves and devstack-gate scripts. Others run custom Python code that spawns VMs and publishes logs to some public domain. As a community, we need to decide whether it is worth putting in the effort to create a single, unified, installable and runnable CI system, so that we can legitimately say all of the external systems are identical, with the exception of the driver code for vendor X being substituted in the Neutron codebase. If the goal of the external CI systems is to produce reliable, consistent results, I feel the answer to the above is yes, but I'm interested to hear what others think. Frankly, in the world of benchmarks, it would be unthinkable to say go ahead and everyone run your own benchmark suite, because you would get wildly different results. A similar problem has emerged here. 2) There is no mediation or verification that the external CI system is actually testing anything at all As a community, we need to decide whether the current system of self-policing should continue. If it should, then language on reports like [3] should be very clear that any numbers derived from such systems should be taken with a grain of salt. Use of the word Success should be avoided, as it has connotations (in English, at least) that the result has been verified, which is simply not the case as long as no verification or mediation occurs for any external CI system. 3) There is no clear indication of what tests are being run, and therefore there is no clear indication of what success is I think we can all agree that a test has three possible outcomes: pass, fail, and skip. The results of a test suite run therefore is nothing more than the aggregation of which tests passed, which failed, and which were skipped. As a community, we must document, for each project, what are expected set of tests that must be run for each merged patch into the project's source tree. This documentation should be discoverable so that reports like [3] can be crystal-clear on what the data shown actually means. The report is simply displaying the data it receives from Gerrit. The community needs to be proactive in saying this is what is expected to be tested. This alone would allow the report to give information such as External CI system ABC performed the expected tests. X tests passed. Y tests failed. Z tests were skipped. Likewise, it would also make it possible for the report to give information such as External CI system DEF did not perform the expected tests., which is excellent information in and of itself. === In thinking about the likely answers to the above questions, I believe it would be prudent to change the Stackalytics report in question [3] in the following ways: a. Change the Success % column header to % Reported +1 Votes b. Change the phrase Green cell - tests ran successfully, red cell - tests failed to Green cell - System voted +1, red cell - System voted -1 and then, when we have more and better data (for example, # tests passed, failed, skipped, etc), we can provide
Re: [openstack-dev] [third-party-ci][neutron] What is Success exactly?
On 07/03/2014 09:52 AM, Sullivan, Jon Paul wrote: -Original Message- From: Anita Kuno [mailto:ante...@anteaya.info] Sent: 03 July 2014 13:53 To: openstack-dev@lists.openstack.org Subject: Re: [openstack-dev] [third-party-ci][neutron] What is Success exactly? On 07/03/2014 06:22 AM, Sullivan, Jon Paul wrote: -Original Message- From: Anita Kuno [mailto:ante...@anteaya.info] Sent: 01 July 2014 14:42 To: openstack-dev@lists.openstack.org Subject: Re: [openstack-dev] [third-party-ci][neutron] What is Success exactly? On 06/30/2014 09:13 PM, Jay Pipes wrote: On 06/30/2014 07:08 PM, Anita Kuno wrote: On 06/30/2014 04:22 PM, Jay Pipes wrote: Hi Stackers, Some recent ML threads [1] and a hot IRC meeting today [2] brought up some legitimate questions around how a newly-proposed Stackalytics report page for Neutron External CI systems [2] represented the results of an external CI system as successful or not. First, I want to say that Ilya and all those involved in the Stackalytics program simply want to provide the most accurate information to developers in a format that is easily consumed. While there need to be some changes in how data is shown (and the wording of things like Tests Succeeded), I hope that the community knows there isn't any ill intent on the part of Mirantis or anyone who works on Stackalytics. OK, so let's keep the conversation civil -- we're all working towards the same goals of transparency and accuracy. :) Alright, now, Anita and Kurt Taylor were asking a very poignant question: But what does CI tested really mean? just running tests? or tested to pass some level of requirements? In this nascent world of external CI systems, we have a set of issues that we need to resolve: 1) All of the CI systems are different. Some run Bash scripts. Some run Jenkins slaves and devstack-gate scripts. Others run custom Python code that spawns VMs and publishes logs to some public domain. As a community, we need to decide whether it is worth putting in the effort to create a single, unified, installable and runnable CI system, so that we can legitimately say all of the external systems are identical, with the exception of the driver code for vendor X being substituted in the Neutron codebase. If the goal of the external CI systems is to produce reliable, consistent results, I feel the answer to the above is yes, but I'm interested to hear what others think. Frankly, in the world of benchmarks, it would be unthinkable to say go ahead and everyone run your own benchmark suite, because you would get wildly different results. A similar problem has emerged here. 2) There is no mediation or verification that the external CI system is actually testing anything at all As a community, we need to decide whether the current system of self-policing should continue. If it should, then language on reports like [3] should be very clear that any numbers derived from such systems should be taken with a grain of salt. Use of the word Success should be avoided, as it has connotations (in English, at least) that the result has been verified, which is simply not the case as long as no verification or mediation occurs for any external CI system. 3) There is no clear indication of what tests are being run, and therefore there is no clear indication of what success is I think we can all agree that a test has three possible outcomes: pass, fail, and skip. The results of a test suite run therefore is nothing more than the aggregation of which tests passed, which failed, and which were skipped. As a community, we must document, for each project, what are expected set of tests that must be run for each merged patch into the project's source tree. This documentation should be discoverable so that reports like [3] can be crystal-clear on what the data shown actually means. The report is simply displaying the data it receives from Gerrit. The community needs to be proactive in saying this is what is expected to be tested. This alone would allow the report to give information such as External CI system ABC performed the expected tests. X tests passed. Y tests failed. Z tests were skipped. Likewise, it would also make it possible for the report to give information such as External CI system DEF did not perform the expected tests., which is excellent information in and of itself. === In thinking about the likely answers to the above questions, I believe it would be prudent to change the Stackalytics report in question [3] in the following ways: a. Change the Success % column header to % Reported +1 Votes b. Change the phrase Green cell - tests ran successfully, red cell - tests failed to Green cell - System voted +1, red cell - System voted -1 and then, when we have more and better data (for example, # tests passed, failed, skipped, etc), we can provide more detailed information than just reported +1
Re: [openstack-dev] [third-party-ci][neutron] What is Success exactly?
On 07/03/2014 10:31 AM, Sullivan, Jon Paul wrote: -Original Message- From: Anita Kuno [mailto:ante...@anteaya.info] Sent: 03 July 2014 15:06 To: openstack-dev@lists.openstack.org Subject: Re: [openstack-dev] [third-party-ci][neutron] What is Success exactly? I guess you missed this last time - the mail had gotten quite long :D I had yes, thanks for drawing my attention to it. Hi Jon Paul: (Is it Jon Paul or Jon?) Hi Anita - it's Jon-Paul or JP. Ah, thanks JP. But there is a second side to what you were saying which was the developer feedback. I guess I am suggesting that if you are putting a system in place for developers to vote on the 3rd party CI, should that same system be in effect for the Openstack check/gate jobs? It already is, it is called #openstack-infra. All day long (the 24 hour day) developers drop in and tell us exactly how they feel about any aspect of OpenStack Infrastructure. They let us know when documentation is confusing, when things are broken, when a patch should have been merged and failed to be, when Zuul is caught in a retest loop and occasionally when we get something right. I had presumed this to be the case, and I guess this is the first port of call when developers have questions on 3rd-party CI? If so, then a very interesting metric that would speak to the reliability of the 3rd CI might be responsiveness to irc questions? Yes, developers ask questions about what specific 3rd party accounts are doing when commenting on their patches all the time. Often some version of Why is systemx-ci commenting on my patch? Many of them ask in infra and many of them ping me directly. Then we move into some variation of Systemx-ci is {some behaviour that does not meet requirements}. {What do I do? | Can someone do something to fix this? | Can we disable this system?} Requirements: http://ci.openstack.org/third_party.html#requirements Open Patches: https://review.openstack.org/#/q/status:open+project:openstack-infra/config+branch:master+topic:third-party,n,z and https://review.openstack.org/#/c/104565/ Sure responsiveness to irc questions would be an interesting metric. Now how to collect data. I suppose you could scrape irc logs - I don't want to see the regex to parse what is considered to be irc responsiveness. You could ask the infra team if you like, but then that is a subset of what I have already suggested for all developers plus puts more work on infra which I will not voluntarily do, not if we can avoid it. You could ask me, but my response will be based on an aggregation of my gut responses based on personal experience with individual admins for different accounts, it doesn't scale and while I feel it has some credence should not be the sole source of information for any metric given the scope of the issue. We currently have 70 gerrit ci accounts, I'm not going to offer an opinion on accounts I have never interacted with if everything has been running fine and they have had no reason to interact with me. By allowing the developers affected by the third party systems offer their feedback, a more diverse source of data is collected. Keep in mind that as a developer I have never had to splunk logs from third party ci on my patches since the majority of my patches are for infra, which has very little testing by third party ci. I'd like to have input from developer who do interact with third party ci artifacts. OpenStack Infra logs can be found here: http://eavesdrop.openstack.org/irclogs/%23openstack-infra/ I don't think having an irc channel for third party is practical because it simply will split infra resources and I have my doubts about how responsive folks would be in it. Hence my suggestion of the pages to allow developers to share the kind of information they share in openstack-infra all the time. Yes - I can understand your viewpoint on this, and it makes sense to have a forum where developers can raise commetns or concerns and those responsible for the 3rd party CI can respond. Thanks and hopefully they will respond, and at the very least it will be a quick way of seeing how many developers have attempted to give feedback and the speed or lack thereof of a response. There are some system admins that are very responsive and some are even beginning to be proactive, by sending an email to the ml (dev and/or infra) and informing us when their system is failing to build (we have to get faster at disabling systems in those circumstances, but I appreciate the proactiveness here) as well as posting when they move their logs to a url with a dns rather than a hard coded ip address and that breaks backward compatibility. Thank you for being proactive. http://lists.openstack.org/pipermail/openstack-infra/2014-July/001473.html http://lists.openstack.org/pipermail/openstack-dev/2014-July/039270.html Thanks JP, Anita. ___ OpenStack-dev mailing list OpenStack-dev
Re: [openstack-dev] [third-party-ci][neutron] What is Success exactly?
On 07/03/2014 01:27 PM, Kevin Benton wrote: This allows the viewer to see categories of reviews based upon their divergence from OpenStack's Jenkins results. I think evaluating divergence from Jenkins might be a metric worth consideration. I think the only thing this really reflects though is how much the third party CI system is mirroring Jenkins. A system that frequently diverges may be functioning perfectly fine and just has a vastly different code path that it is integration testing so it is legitimately detecting failures the OpenStack CI cannot. Great. How do we measure the degree to which it is legitimately detecting failures? Thanks Kevin, Anita. -- Kevin Benton On Thu, Jul 3, 2014 at 6:49 AM, Anita Kuno ante...@anteaya.info wrote: On 07/03/2014 07:12 AM, Salvatore Orlando wrote: Apologies for quoting again the top post of the thread. Comments inline (mostly thinking aloud) Salvatore On 30 June 2014 22:22, Jay Pipes jaypi...@gmail.com wrote: Hi Stackers, Some recent ML threads [1] and a hot IRC meeting today [2] brought up some legitimate questions around how a newly-proposed Stackalytics report page for Neutron External CI systems [2] represented the results of an external CI system as successful or not. First, I want to say that Ilya and all those involved in the Stackalytics program simply want to provide the most accurate information to developers in a format that is easily consumed. While there need to be some changes in how data is shown (and the wording of things like Tests Succeeded), I hope that the community knows there isn't any ill intent on the part of Mirantis or anyone who works on Stackalytics. OK, so let's keep the conversation civil -- we're all working towards the same goals of transparency and accuracy. :) Alright, now, Anita and Kurt Taylor were asking a very poignant question: But what does CI tested really mean? just running tests? or tested to pass some level of requirements? In this nascent world of external CI systems, we have a set of issues that we need to resolve: 1) All of the CI systems are different. Some run Bash scripts. Some run Jenkins slaves and devstack-gate scripts. Others run custom Python code that spawns VMs and publishes logs to some public domain. As a community, we need to decide whether it is worth putting in the effort to create a single, unified, installable and runnable CI system, so that we can legitimately say all of the external systems are identical, with the exception of the driver code for vendor X being substituted in the Neutron codebase. I think such system already exists, and it's documented here: http://ci.openstack.org/ Still, understanding it is quite a learning curve, and running it is not exactly straightforward. But I guess that's pretty much understandable given the complexity of the system, isn't it? If the goal of the external CI systems is to produce reliable, consistent results, I feel the answer to the above is yes, but I'm interested to hear what others think. Frankly, in the world of benchmarks, it would be unthinkable to say go ahead and everyone run your own benchmark suite, because you would get wildly different results. A similar problem has emerged here. I don't think the particular infrastructure which might range from an openstack-ci clone to a 100-line bash script would have an impact on the reliability of the quality assessment regarding a particular driver or plugin. This is determined, in my opinion, by the quantity and nature of tests one runs on a specific driver. In Neutron for instance, there is a wide range of choices - from a few test cases in tempest.api.network to the full smoketest job. As long there is no minimal standard here, then it would be difficult to assess the quality of the evaluation from a CI system, unless we explicitly keep into account coverage into the evaluation. On the other hand, different CI infrastructures will have different levels in terms of % of patches tested and % of infrastructure failures. I think it might not be a terrible idea to use these parameters to evaluate how good a CI is from an infra standpoint. However, there are still open questions. For instance, a CI might have a low patch % score because it only needs to test patches affecting a given driver. 2) There is no mediation or verification that the external CI system is actually testing anything at all As a community, we need to decide whether the current system of self-policing should continue. If it should, then language on reports like [3] should be very clear that any numbers derived from such systems should be taken with a grain of salt. Use of the word Success should be avoided, as it has connotations (in English, at least) that the result has been verified, which is simply not the case as long as no verification or mediation occurs for any external CI system. 3) There is no
Re: [openstack-dev] [third-party-ci][neutron] What is Success exactly?
Maybe we can require period checks against the head of the master branch (which should always pass) and build statistics based on the results of that. Otherwise it seems like we have to take a CI system's word for it that a particular patch indeed broke that system. -- Kevin Benton On Thu, Jul 3, 2014 at 11:07 AM, Anita Kuno ante...@anteaya.info wrote: On 07/03/2014 01:27 PM, Kevin Benton wrote: This allows the viewer to see categories of reviews based upon their divergence from OpenStack's Jenkins results. I think evaluating divergence from Jenkins might be a metric worth consideration. I think the only thing this really reflects though is how much the third party CI system is mirroring Jenkins. A system that frequently diverges may be functioning perfectly fine and just has a vastly different code path that it is integration testing so it is legitimately detecting failures the OpenStack CI cannot. Great. How do we measure the degree to which it is legitimately detecting failures? Thanks Kevin, Anita. -- Kevin Benton On Thu, Jul 3, 2014 at 6:49 AM, Anita Kuno ante...@anteaya.info wrote: On 07/03/2014 07:12 AM, Salvatore Orlando wrote: Apologies for quoting again the top post of the thread. Comments inline (mostly thinking aloud) Salvatore On 30 June 2014 22:22, Jay Pipes jaypi...@gmail.com wrote: Hi Stackers, Some recent ML threads [1] and a hot IRC meeting today [2] brought up some legitimate questions around how a newly-proposed Stackalytics report page for Neutron External CI systems [2] represented the results of an external CI system as successful or not. First, I want to say that Ilya and all those involved in the Stackalytics program simply want to provide the most accurate information to developers in a format that is easily consumed. While there need to be some changes in how data is shown (and the wording of things like Tests Succeeded), I hope that the community knows there isn't any ill intent on the part of Mirantis or anyone who works on Stackalytics. OK, so let's keep the conversation civil -- we're all working towards the same goals of transparency and accuracy. :) Alright, now, Anita and Kurt Taylor were asking a very poignant question: But what does CI tested really mean? just running tests? or tested to pass some level of requirements? In this nascent world of external CI systems, we have a set of issues that we need to resolve: 1) All of the CI systems are different. Some run Bash scripts. Some run Jenkins slaves and devstack-gate scripts. Others run custom Python code that spawns VMs and publishes logs to some public domain. As a community, we need to decide whether it is worth putting in the effort to create a single, unified, installable and runnable CI system, so that we can legitimately say all of the external systems are identical, with the exception of the driver code for vendor X being substituted in the Neutron codebase. I think such system already exists, and it's documented here: http://ci.openstack.org/ Still, understanding it is quite a learning curve, and running it is not exactly straightforward. But I guess that's pretty much understandable given the complexity of the system, isn't it? If the goal of the external CI systems is to produce reliable, consistent results, I feel the answer to the above is yes, but I'm interested to hear what others think. Frankly, in the world of benchmarks, it would be unthinkable to say go ahead and everyone run your own benchmark suite, because you would get wildly different results. A similar problem has emerged here. I don't think the particular infrastructure which might range from an openstack-ci clone to a 100-line bash script would have an impact on the reliability of the quality assessment regarding a particular driver or plugin. This is determined, in my opinion, by the quantity and nature of tests one runs on a specific driver. In Neutron for instance, there is a wide range of choices - from a few test cases in tempest.api.network to the full smoketest job. As long there is no minimal standard here, then it would be difficult to assess the quality of the evaluation from a CI system, unless we explicitly keep into account coverage into the evaluation. On the other hand, different CI infrastructures will have different levels in terms of % of patches tested and % of infrastructure failures. I think it might not be a terrible idea to use these parameters to evaluate how good a CI is from an infra standpoint. However, there are still open questions. For instance, a CI might have a low patch % score because it only needs to test patches affecting a given driver. 2) There is no mediation or verification that the external CI system is actually testing anything at all As a
Re: [openstack-dev] [third-party-ci][neutron] What is Success exactly?
Yes, I can propose a spec for that. It probably won't be until Monday. Is that okay? On Thu, Jul 3, 2014 at 11:42 AM, Anita Kuno ante...@anteaya.info wrote: On 07/03/2014 02:33 PM, Kevin Benton wrote: Maybe we can require period checks against the head of the master branch (which should always pass) and build statistics based on the results of that. I like this suggestion. I really like this suggestion. H, what to do with a good suggestion? I wonder if we could capture it in an infra-spec and work on it from there. Would you feel comfortable offering a draft as an infra-spec and then perhaps we can discuss the design through the spec? What do you think? Thanks Kevin, Anita. Otherwise it seems like we have to take a CI system's word for it that a particular patch indeed broke that system. -- Kevin Benton On Thu, Jul 3, 2014 at 11:07 AM, Anita Kuno ante...@anteaya.info wrote: On 07/03/2014 01:27 PM, Kevin Benton wrote: This allows the viewer to see categories of reviews based upon their divergence from OpenStack's Jenkins results. I think evaluating divergence from Jenkins might be a metric worth consideration. I think the only thing this really reflects though is how much the third party CI system is mirroring Jenkins. A system that frequently diverges may be functioning perfectly fine and just has a vastly different code path that it is integration testing so it is legitimately detecting failures the OpenStack CI cannot. Great. How do we measure the degree to which it is legitimately detecting failures? Thanks Kevin, Anita. -- Kevin Benton On Thu, Jul 3, 2014 at 6:49 AM, Anita Kuno ante...@anteaya.info wrote: On 07/03/2014 07:12 AM, Salvatore Orlando wrote: Apologies for quoting again the top post of the thread. Comments inline (mostly thinking aloud) Salvatore On 30 June 2014 22:22, Jay Pipes jaypi...@gmail.com wrote: Hi Stackers, Some recent ML threads [1] and a hot IRC meeting today [2] brought up some legitimate questions around how a newly-proposed Stackalytics report page for Neutron External CI systems [2] represented the results of an external CI system as successful or not. First, I want to say that Ilya and all those involved in the Stackalytics program simply want to provide the most accurate information to developers in a format that is easily consumed. While there need to be some changes in how data is shown (and the wording of things like Tests Succeeded), I hope that the community knows there isn't any ill intent on the part of Mirantis or anyone who works on Stackalytics. OK, so let's keep the conversation civil -- we're all working towards the same goals of transparency and accuracy. :) Alright, now, Anita and Kurt Taylor were asking a very poignant question: But what does CI tested really mean? just running tests? or tested to pass some level of requirements? In this nascent world of external CI systems, we have a set of issues that we need to resolve: 1) All of the CI systems are different. Some run Bash scripts. Some run Jenkins slaves and devstack-gate scripts. Others run custom Python code that spawns VMs and publishes logs to some public domain. As a community, we need to decide whether it is worth putting in the effort to create a single, unified, installable and runnable CI system, so that we can legitimately say all of the external systems are identical, with the exception of the driver code for vendor X being substituted in the Neutron codebase. I think such system already exists, and it's documented here: http://ci.openstack.org/ Still, understanding it is quite a learning curve, and running it is not exactly straightforward. But I guess that's pretty much understandable given the complexity of the system, isn't it? If the goal of the external CI systems is to produce reliable, consistent results, I feel the answer to the above is yes, but I'm interested to hear what others think. Frankly, in the world of benchmarks, it would be unthinkable to say go ahead and everyone run your own benchmark suite, because you would get wildly different results. A similar problem has emerged here. I don't think the particular infrastructure which might range from an openstack-ci clone to a 100-line bash script would have an impact on the reliability of the quality assessment regarding a particular driver or plugin. This is determined, in my opinion, by the quantity and nature of tests one runs on a specific driver. In Neutron for instance, there is a wide range of choices - from a few test cases in tempest.api.network to the full smoketest job. As long there is no minimal standard here, then it would be difficult to assess the quality of the evaluation from a CI system, unless we explicitly keep
Re: [openstack-dev] [third-party-ci][neutron] What is Success exactly?
Hi Jay, Couple of points. I support the fact that we need to define what is success is. I believe that the metrics that should be used are Voted +1 and Skipped. But to certain valid case, I would say that the Voted -1 is really mostly a metric of bad health of a CI. Most of the -1 are due to environment issue, configuration problem, etc... In my case, the -1 are done manually since I want to avoid giving some extra work to the developer. That are some possible solutions ? On the Jenkins, I think we could develop a script that will parse the result html file. Jenkins will then vote (+1, 0, -1) on the behalf of the 3rd party CI. - It would prevent the abusive +1 - If the result HTML is empty, it would indicate the CI health is bad - if all the result are failing, it would also indicate that CI health is bad Franck Franck Franck On Mon, Jun 30, 2014 at 1:22 PM, Jay Pipes jaypi...@gmail.com wrote: Hi Stackers, Some recent ML threads [1] and a hot IRC meeting today [2] brought up some legitimate questions around how a newly-proposed Stackalytics report page for Neutron External CI systems [2] represented the results of an external CI system as successful or not. First, I want to say that Ilya and all those involved in the Stackalytics program simply want to provide the most accurate information to developers in a format that is easily consumed. While there need to be some changes in how data is shown (and the wording of things like Tests Succeeded), I hope that the community knows there isn't any ill intent on the part of Mirantis or anyone who works on Stackalytics. OK, so let's keep the conversation civil -- we're all working towards the same goals of transparency and accuracy. :) Alright, now, Anita and Kurt Taylor were asking a very poignant question: But what does CI tested really mean? just running tests? or tested to pass some level of requirements? In this nascent world of external CI systems, we have a set of issues that we need to resolve: 1) All of the CI systems are different. Some run Bash scripts. Some run Jenkins slaves and devstack-gate scripts. Others run custom Python code that spawns VMs and publishes logs to some public domain. As a community, we need to decide whether it is worth putting in the effort to create a single, unified, installable and runnable CI system, so that we can legitimately say all of the external systems are identical, with the exception of the driver code for vendor X being substituted in the Neutron codebase. If the goal of the external CI systems is to produce reliable, consistent results, I feel the answer to the above is yes, but I'm interested to hear what others think. Frankly, in the world of benchmarks, it would be unthinkable to say go ahead and everyone run your own benchmark suite, because you would get wildly different results. A similar problem has emerged here. 2) There is no mediation or verification that the external CI system is actually testing anything at all As a community, we need to decide whether the current system of self-policing should continue. If it should, then language on reports like [3] should be very clear that any numbers derived from such systems should be taken with a grain of salt. Use of the word Success should be avoided, as it has connotations (in English, at least) that the result has been verified, which is simply not the case as long as no verification or mediation occurs for any external CI system. 3) There is no clear indication of what tests are being run, and therefore there is no clear indication of what success is I think we can all agree that a test has three possible outcomes: pass, fail, and skip. The results of a test suite run therefore is nothing more than the aggregation of which tests passed, which failed, and which were skipped. As a community, we must document, for each project, what are expected set of tests that must be run for each merged patch into the project's source tree. This documentation should be discoverable so that reports like [3] can be crystal-clear on what the data shown actually means. The report is simply displaying the data it receives from Gerrit. The community needs to be proactive in saying this is what is expected to be tested. This alone would allow the report to give information such as External CI system ABC performed the expected tests. X tests passed. Y tests failed. Z tests were skipped. Likewise, it would also make it possible for the report to give information such as External CI system DEF did not perform the expected tests., which is excellent information in and of itself. === In thinking about the likely answers to the above questions, I believe it would be prudent to change the Stackalytics report in question [3] in the following ways: a. Change the Success % column header to % Reported +1 Votes b. Change the phrase Green cell - tests ran successfully, red cell - tests failed
Re: [openstack-dev] [third-party-ci][neutron] What is Success exactly?
On 06/30/2014 07:08 PM, Anita Kuno wrote: On 06/30/2014 04:22 PM, Jay Pipes wrote: Hi Stackers, Some recent ML threads [1] and a hot IRC meeting today [2] brought up some legitimate questions around how a newly-proposed Stackalytics report page for Neutron External CI systems [2] represented the results of an external CI system as successful or not. First, I want to say that Ilya and all those involved in the Stackalytics program simply want to provide the most accurate information to developers in a format that is easily consumed. While there need to be some changes in how data is shown (and the wording of things like Tests Succeeded), I hope that the community knows there isn't any ill intent on the part of Mirantis or anyone who works on Stackalytics. OK, so let's keep the conversation civil -- we're all working towards the same goals of transparency and accuracy. :) Alright, now, Anita and Kurt Taylor were asking a very poignant question: But what does CI tested really mean? just running tests? or tested to pass some level of requirements? In this nascent world of external CI systems, we have a set of issues that we need to resolve: 1) All of the CI systems are different. Some run Bash scripts. Some run Jenkins slaves and devstack-gate scripts. Others run custom Python code that spawns VMs and publishes logs to some public domain. As a community, we need to decide whether it is worth putting in the effort to create a single, unified, installable and runnable CI system, so that we can legitimately say all of the external systems are identical, with the exception of the driver code for vendor X being substituted in the Neutron codebase. If the goal of the external CI systems is to produce reliable, consistent results, I feel the answer to the above is yes, but I'm interested to hear what others think. Frankly, in the world of benchmarks, it would be unthinkable to say go ahead and everyone run your own benchmark suite, because you would get wildly different results. A similar problem has emerged here. 2) There is no mediation or verification that the external CI system is actually testing anything at all As a community, we need to decide whether the current system of self-policing should continue. If it should, then language on reports like [3] should be very clear that any numbers derived from such systems should be taken with a grain of salt. Use of the word Success should be avoided, as it has connotations (in English, at least) that the result has been verified, which is simply not the case as long as no verification or mediation occurs for any external CI system. 3) There is no clear indication of what tests are being run, and therefore there is no clear indication of what success is I think we can all agree that a test has three possible outcomes: pass, fail, and skip. The results of a test suite run therefore is nothing more than the aggregation of which tests passed, which failed, and which were skipped. As a community, we must document, for each project, what are expected set of tests that must be run for each merged patch into the project's source tree. This documentation should be discoverable so that reports like [3] can be crystal-clear on what the data shown actually means. The report is simply displaying the data it receives from Gerrit. The community needs to be proactive in saying this is what is expected to be tested. This alone would allow the report to give information such as External CI system ABC performed the expected tests. X tests passed. Y tests failed. Z tests were skipped. Likewise, it would also make it possible for the report to give information such as External CI system DEF did not perform the expected tests., which is excellent information in and of itself. === In thinking about the likely answers to the above questions, I believe it would be prudent to change the Stackalytics report in question [3] in the following ways: a. Change the Success % column header to % Reported +1 Votes b. Change the phrase Green cell - tests ran successfully, red cell - tests failed to Green cell - System voted +1, red cell - System voted -1 and then, when we have more and better data (for example, # tests passed, failed, skipped, etc), we can provide more detailed information than just reported +1 or not. Thoughts? Best, -jay [1] http://lists.openstack.org/pipermail/openstack-dev/2014-June/038933.html [2] http://eavesdrop.openstack.org/meetings/third_party/2014/third_party.2014-06-30-18.01.log.html [3] http://stackalytics.com/report/ci/neutron/7 ___ OpenStack-dev mailing list OpenStack-dev@lists.openstack.org http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev Hi Jay: Thanks for starting this thread. You raise some interesting questions. The question I had identified as needing definition is what algorithm do we use to assess fitness of a third party ci system.