Re: [openstack-dev] [tripleo] [ci] recheck impact on CI infrastructure

2017-01-02 Thread Sven Anderson
Hi Emilien and all,

On 16.12.2016 01:26, Emilien Macchi wrote:
> On Thu, Dec 15, 2016 at 12:22 PM, Sven Anderson  wrote:
>> Hi all,
>>
>> while I was waiting again for the CI to be fixed and didn't want to
>> torture it with additional rechecks, I wanted to find out, how much of
>> our CI infrastructure we waste with rechecks. My assumption was that
>> every recheck is a waste of resources based on a false negative, because
>> it renders the previous build useless. So I wrote a small script[1] to
>> calculate how many rechecks are made on average per built patch-set. It
>> calculates the number of patch-sets of merged changes that CI was
>> testing (some patch-sets are not, because they were updated before CI
>> started testing), the number of rechecks issued on these patch-sets, and
>> a value "CI-factor", which is the factor by which the rechecks increased
>> the the CI runs, that is, without rechecks it would be 1, if every
>> tested patch-set would have exactly one recheck it would be 2.
> 
> I see 2 different topics here.
> 
> # One is not related to $topic but still worth mentioning:
> "while I was waiting again for the CI to be fixed"
> 
> This week has been tough, and many of us burnt our time to resolve
> different complex problems in TripleO CI, mostly related to external
> dependencies (qemu upgrade, centos 7.3 upgrade, tripleo-ci infra,
> etc).
> Resolving these problems is very challenging and you'll notice that
> only a few of us actually work on this task, while a lot of people
> continue to push their features "hoping" that it will pass CI
> sometimes and if not, well, we'll do 'recheck'.
> That is a way of working I would say. I personally can't continue to
> code if the project I'm working on has broken CI.
> 
> In a previous experience, I've been working in a team where everyone
> stopped regular work when CI was broken and focus on fixing it.
> I'm not saying everyone should stop their tasks and help, but this
> "wait and see" comment doesn't actually help us to move forward.
> People need to get more involved in CI and be more helpful. I know
> it's difficult, but it's something anyone can learn, like you would
> learn how to write Python code for example.

I think you got my mail in the wrong way. I didn't want to say that
anyone is not doing it's job right and I didn't want to complain. I know
how challenging this is. In my previous job I was the person running the
CI (among other things). I just wanted to share the results, because I
think it's interesting how much percentage of our CI infrastructure is
"wasted" by rechecks, to on one hand raise awareness that we not just
blindly "recheck until verified", and on the other hand, how valuable it
is to keep CI stable.

Is it really the case that more CI people would help here? I would have
expected, as long as we don't do more modularized testing, that it
doesn't scale. Would more CI people fix the problems more quickly? Or is
it more like: the burden could be distributed on more shoulders, so not
always the same people have to interrupt their work? The second wouldn't
improve the situation but just spread the burden in a more fair manner.

With my post I mainly wanted to provide reliable data and emphasize how
important a stable CI and the work on this is, and that we all restrain
ourselves from blindly rechecking.


Happy New Year to everyone!

Sven

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [tripleo] [ci] recheck impact on CI infrastructure

2016-12-15 Thread Emilien Macchi
On Thu, Dec 15, 2016 at 12:22 PM, Sven Anderson  wrote:
> Hi all,
>
> while I was waiting again for the CI to be fixed and didn't want to
> torture it with additional rechecks, I wanted to find out, how much of
> our CI infrastructure we waste with rechecks. My assumption was that
> every recheck is a waste of resources based on a false negative, because
> it renders the previous build useless. So I wrote a small script[1] to
> calculate how many rechecks are made on average per built patch-set. It
> calculates the number of patch-sets of merged changes that CI was
> testing (some patch-sets are not, because they were updated before CI
> started testing), the number of rechecks issued on these patch-sets, and
> a value "CI-factor", which is the factor by which the rechecks increased
> the the CI runs, that is, without rechecks it would be 1, if every
> tested patch-set would have exactly one recheck it would be 2.

I see 2 different topics here.

# One is not related to $topic but still worth mentioning:
"while I was waiting again for the CI to be fixed"

This week has been tough, and many of us burnt our time to resolve
different complex problems in TripleO CI, mostly related to external
dependencies (qemu upgrade, centos 7.3 upgrade, tripleo-ci infra,
etc).
Resolving these problems is very challenging and you'll notice that
only a few of us actually work on this task, while a lot of people
continue to push their features "hoping" that it will pass CI
sometimes and if not, well, we'll do 'recheck'.
That is a way of working I would say. I personally can't continue to
code if the project I'm working on has broken CI.

In a previous experience, I've been working in a team where everyone
stopped regular work when CI was broken and focus on fixing it.
I'm not saying everyone should stop their tasks and help, but this
"wait and see" comment doesn't actually help us to move forward.
People need to get more involved in CI and be more helpful. I know
it's difficult, but it's something anyone can learn, like you would
learn how to write Python code for example.


# The second one is about the actual $topic and your stats.
Yes we have been thinking about a way to optimize the way we restart
CI jobs and this is under discussion:
https://review.openstack.org/#/c/411450/

As you can see, there is some pushback from Clark who is infra-core,
so we might want to continue the discussion here and see how it goes.


On the long-term, our goal is to have more consistency in the way we
test TripleO and get more adoption in the tools we're developing for
CI, so they are more consumable from anyone in our community. Also we
hope to have more people involved when things are broken, and not
always the same folks spending days and evenings to "extinguish
fires". We are working hard on CI stabilization and consolidation with
multinode scenarios and OVB improvements, but it takes time and
iterations.

Any help is highly welcome here.


> The results were not as bad as my feeling, we are below 2 for most of
> the projects I tested. :-) But still, on THT for instance we use 71%
> more resources because of the false negatives. I made monthly
> breakdowns, so you can see a positive trend at least.
>
>
> Here the results:
>
> Project: tripleo-heat-templates
>
>  month  patches  rechecks  CI-factor
>  1  221   102   1.46
>  2  282   300   2.06
>  3  588   567   1.96
>  4  220   253   2.15
>  5  333   242   1.73
>  6  459   325   1.71
>  7  612   390   1.64
>  8  694   442   1.64
>  9  717   440   1.61
> 10  474   316   1.67
> 11  358   189   1.53
> 12  16880   1.48
>  total 5126  3646   1.71
>
> Project: tripleo-common
>
>  month  patches  rechecks  CI-factor
>  1   73291.4
>  2   5948   1.81
>  3   92   1012.1
>  4   1719   2.12
>  5   4727   1.57
>  6   8346   1.55
>  7   6626   1.39
>  8  209   102   1.49
>  9  261   129   1.49
> 10  11051   1.46
> 11  12147   1.39
> 12   4019   1.48
>  total 1178   644   1.55
>
> Project: tripleo-puppet-elements
>
>  month  patches  rechecks  CI-factor
>  1   24 9   1.38
>  2920   3.22
>  3716   3.29
>  4924   3.67
>  5   1417   2.21
>  6   1733   2.94
>  7   1216   2.33
>  8   15212.4
>  9   10142.4
> 10   12 5   1.42
> 11   3425   1.74
> 12   10132.3
>  total  173   

Re: [openstack-dev] [tripleo] [ci] recheck impact on CI infrastructure

2016-12-15 Thread Diana Clarke
Neat, thanks Sven!

Here are the nova stats:

http://paste.openstack.org/show/592551/

--diana

On Thu, Dec 15, 2016 at 12:22 PM, Sven Anderson  wrote:
> Hi all,
>
> while I was waiting again for the CI to be fixed and didn't want to
> torture it with additional rechecks, I wanted to find out, how much of
> our CI infrastructure we waste with rechecks. My assumption was that
> every recheck is a waste of resources based on a false negative, because
> it renders the previous build useless. So I wrote a small script[1] to
> calculate how many rechecks are made on average per built patch-set. It
> calculates the number of patch-sets of merged changes that CI was
> testing (some patch-sets are not, because they were updated before CI
> started testing), the number of rechecks issued on these patch-sets, and
> a value "CI-factor", which is the factor by which the rechecks increased
> the the CI runs, that is, without rechecks it would be 1, if every
> tested patch-set would have exactly one recheck it would be 2.
>
> The results were not as bad as my feeling, we are below 2 for most of
> the projects I tested. :-) But still, on THT for instance we use 71%
> more resources because of the false negatives. I made monthly
> breakdowns, so you can see a positive trend at least.
>
>
> Here the results:
>
> Project: tripleo-heat-templates
>
>  month  patches  rechecks  CI-factor
>  1  221   102   1.46
>  2  282   300   2.06
>  3  588   567   1.96
>  4  220   253   2.15
>  5  333   242   1.73
>  6  459   325   1.71
>  7  612   390   1.64
>  8  694   442   1.64
>  9  717   440   1.61
> 10  474   316   1.67
> 11  358   189   1.53
> 12  16880   1.48
>  total 5126  3646   1.71
>
> Project: tripleo-common
>
>  month  patches  rechecks  CI-factor
>  1   73291.4
>  2   5948   1.81
>  3   92   1012.1
>  4   1719   2.12
>  5   4727   1.57
>  6   8346   1.55
>  7   6626   1.39
>  8  209   102   1.49
>  9  261   129   1.49
> 10  11051   1.46
> 11  12147   1.39
> 12   4019   1.48
>  total 1178   644   1.55
>
> Project: tripleo-puppet-elements
>
>  month  patches  rechecks  CI-factor
>  1   24 9   1.38
>  2920   3.22
>  3716   3.29
>  4924   3.67
>  5   1417   2.21
>  6   1733   2.94
>  7   1216   2.33
>  8   15212.4
>  9   10142.4
> 10   12 5   1.42
> 11   3425   1.74
> 12   10132.3
>  total  173   213   2.23
>
> Project: puppet-tripleo
>
>  month  patches  rechecks  CI-factor
>  1   2923   1.79
>  2   3668   2.89
>  3   40442.1
>  4   6874   2.09
>  5  12943   1.33
>  6  265   206   1.78
>  7  235   1181.5
>  8  193   130   1.67
>  9  147   123   1.84
> 10  233   159   1.68
> 11  13786   1.63
> 12   20 5   1.25
>  total 1532  10791.7
>
>
> [1] https://gist.github.com/ansiwen/e139cbf25bc243d30629e0157fc753ff
>
> __
> OpenStack Development Mailing List (not for usage questions)
> Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


[openstack-dev] [tripleo] [ci] recheck impact on CI infrastructure

2016-12-15 Thread Sven Anderson
Hi all,

while I was waiting again for the CI to be fixed and didn't want to
torture it with additional rechecks, I wanted to find out, how much of
our CI infrastructure we waste with rechecks. My assumption was that
every recheck is a waste of resources based on a false negative, because
it renders the previous build useless. So I wrote a small script[1] to
calculate how many rechecks are made on average per built patch-set. It
calculates the number of patch-sets of merged changes that CI was
testing (some patch-sets are not, because they were updated before CI
started testing), the number of rechecks issued on these patch-sets, and
a value "CI-factor", which is the factor by which the rechecks increased
the the CI runs, that is, without rechecks it would be 1, if every
tested patch-set would have exactly one recheck it would be 2.

The results were not as bad as my feeling, we are below 2 for most of
the projects I tested. :-) But still, on THT for instance we use 71%
more resources because of the false negatives. I made monthly
breakdowns, so you can see a positive trend at least.


Here the results:

Project: tripleo-heat-templates

 month  patches  rechecks  CI-factor
 1  221   102   1.46
 2  282   300   2.06
 3  588   567   1.96
 4  220   253   2.15
 5  333   242   1.73
 6  459   325   1.71
 7  612   390   1.64
 8  694   442   1.64
 9  717   440   1.61
10  474   316   1.67
11  358   189   1.53
12  16880   1.48
 total 5126  3646   1.71

Project: tripleo-common

 month  patches  rechecks  CI-factor
 1   73291.4
 2   5948   1.81
 3   92   1012.1
 4   1719   2.12
 5   4727   1.57
 6   8346   1.55
 7   6626   1.39
 8  209   102   1.49
 9  261   129   1.49
10  11051   1.46
11  12147   1.39
12   4019   1.48
 total 1178   644   1.55

Project: tripleo-puppet-elements

 month  patches  rechecks  CI-factor
 1   24 9   1.38
 2920   3.22
 3716   3.29
 4924   3.67
 5   1417   2.21
 6   1733   2.94
 7   1216   2.33
 8   15212.4
 9   10142.4
10   12 5   1.42
11   3425   1.74
12   10132.3
 total  173   213   2.23

Project: puppet-tripleo

 month  patches  rechecks  CI-factor
 1   2923   1.79
 2   3668   2.89
 3   40442.1
 4   6874   2.09
 5  12943   1.33
 6  265   206   1.78
 7  235   1181.5
 8  193   130   1.67
 9  147   123   1.84
10  233   159   1.68
11  13786   1.63
12   20 5   1.25
 total 1532  10791.7


[1] https://gist.github.com/ansiwen/e139cbf25bc243d30629e0157fc753ff

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev