Re: [openstack-dev] [tripleo] tripleo gate is blocked - please read

2018-06-16 Thread Wesley Hayutin
On Sat, Jun 16, 2018 at 10:21 AM Paul Belanger 
wrote:

> On Sat, Jun 16, 2018 at 12:47:10PM +, Jeremy Stanley wrote:
> > On 2018-06-15 23:15:01 -0700 (-0700), Emilien Macchi wrote:
> > [...]
> > > ## Dockerhub proxy issue
> > > Infra using wrong image layer object storage proxy for Dockerhub:
> > > https://review.openstack.org/#/c/575787/
> > > Huge thanks to infra team, specially Clark for fixing this super
> quickly,
> > > it clearly helped to stabilize our container jobs, I actually haven't
> seen
> > > timeouts since we merged your patch. Thanks a ton!
> > [...]
> >
> > As best we can tell from logs, the way Dockerhub served these images
> > changed a few weeks ago (at the end of May) leading to this problem.
> > --
> > Jeremy Stanley
>
> Should also note what we are doing here is a terrible hack, we've only
> been able
> to learn the information by sniffing the traffic to hub.docker.io for our
> reverse
> proxy cache configuration. It is also possible this can break in the
> future too,
> so something to always keep in the back of your mind.
>

Thanks Paul, Jeremy and the other infra folks involved.   The TripleO CI
team is working towards tracking the time on some of these container tasks
atm.  Thanks for doing what you guys could do given the circumstances.


>
> It would be great if docker tools just worked with HTTP proxies.
>
> -Paul
>
> __
> OpenStack Development Mailing List (not for usage questions)
> Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
>
__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [tripleo] tripleo gate is blocked - please read

2018-06-16 Thread Paul Belanger
On Sat, Jun 16, 2018 at 12:47:10PM +, Jeremy Stanley wrote:
> On 2018-06-15 23:15:01 -0700 (-0700), Emilien Macchi wrote:
> [...]
> > ## Dockerhub proxy issue
> > Infra using wrong image layer object storage proxy for Dockerhub:
> > https://review.openstack.org/#/c/575787/
> > Huge thanks to infra team, specially Clark for fixing this super quickly,
> > it clearly helped to stabilize our container jobs, I actually haven't seen
> > timeouts since we merged your patch. Thanks a ton!
> [...]
> 
> As best we can tell from logs, the way Dockerhub served these images
> changed a few weeks ago (at the end of May) leading to this problem.
> -- 
> Jeremy Stanley

Should also note what we are doing here is a terrible hack, we've only been able
to learn the information by sniffing the traffic to hub.docker.io for our 
reverse
proxy cache configuration. It is also possible this can break in the future too,
so something to always keep in the back of your mind.

It would be great if docker tools just worked with HTTP proxies.

-Paul

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [tripleo] tripleo gate is blocked - please read

2018-06-16 Thread Jeremy Stanley
On 2018-06-15 23:15:01 -0700 (-0700), Emilien Macchi wrote:
[...]
> ## Dockerhub proxy issue
> Infra using wrong image layer object storage proxy for Dockerhub:
> https://review.openstack.org/#/c/575787/
> Huge thanks to infra team, specially Clark for fixing this super quickly,
> it clearly helped to stabilize our container jobs, I actually haven't seen
> timeouts since we merged your patch. Thanks a ton!
[...]

As best we can tell from logs, the way Dockerhub served these images
changed a few weeks ago (at the end of May) leading to this problem.
-- 
Jeremy Stanley


signature.asc
Description: PGP signature
__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [tripleo] tripleo gate is blocked - please read

2018-06-16 Thread Emilien Macchi
Sending an update before the weekend:

Gate was in very bad shape today (long queue, lot of failures) again today,
and it turns out we had a few more issues that we tracked here:
https://etherpad.openstack.org/p/tripleo-gate-issues-june-2018

## scenario007 broke because of a patch in networking-ovn
https://bugs.launchpad.net/tripleo/+bug/1777168
We made the job non voting and meanwhile tried and managed to fix it:
https://review.rdoproject.org/r/#/c/14155/
Breaking commit was:
https://github.com/openstack/networking-ovn/commit/2365df1cc3e24deb2f3745c925d78d6d8e5bb5df
Kudos to Daniel Alvarez for having the patch ready!
Also thanks to Wes for making the job non voting in the meantime.
I've reverted the non-voting things are situation is fixed now, so we can
vote again on this one.

## Dockerhub proxy issue
Infra using wrong image layer object storage proxy for Dockerhub:
https://review.openstack.org/#/c/575787/
Huge thanks to infra team, specially Clark for fixing this super quickly,
it clearly helped to stabilize our container jobs, I actually haven't seen
timeouts since we merged your patch. Thanks a ton!

## RDO master wasn't consistent anymore, python-cloudkittyclient broke
The client was refactored:
https://git.openstack.org/cgit/openstack/python-cloudkittyclient/commit/?id=d070f6a68cddf51c57e77107f1b823a8f75770ba
And it broke the RPM, we had to completely rewrite the dependencies so we
can build the package:
https://review.rdoproject.org/r/#/c/14265/
Mille merci Heikel for your responsive help at 3am, so we could come back
consistent and have our latest rpms that contained a bunch of fixes.

## Where we are now

Gate looks stable now. You can recheck and approve things. I went ahead and
rechecked everything and made sure nothing was left abandoned. Steve's work
has merged so I think we could re-consider
https://review.openstack.org/#/c/575330/ again.
Special thanks to everyone involved in these issues and Alex & John who
also stepped up to help.
Enjoy your weekend!

On Thu, Jun 14, 2018 at 6:40 AM, Emilien Macchi  wrote:

> It sounds like we merged a bunch last night thanks to the revert, so I
> went ahead and restored/rechecked everything that was out of the gate. I've
> checked and nothing was left over, but let me know in case I missed
> something.
> I'll keep updating this thread with the progress made to improve the
> situation etc.
> So from now, situation is back to "normal", recheck/+W is ok.
>
> Thanks again for your patience,
>
> On Wed, Jun 13, 2018 at 10:39 PM, Emilien Macchi 
> wrote:
>
>> https://review.openstack.org/575264 just landed (and didn't timeout in
>> check nor gate without recheck, so good sigh it helped to mitigate).
>>
>> I've restore and rechecked some patches that I evacuated from the gate,
>> please do not restore others or recheck or approve anything for now, and
>> see how it goes with a few patches.
>> We're still working with Steve on his patches to optimize the way we
>> deploy containers on the registry and are investigating how we could make
>> it faster with a proxy.
>>
>> Stay tuned and thanks for your patience.
>>
>> On Wed, Jun 13, 2018 at 5:50 PM, Emilien Macchi 
>> wrote:
>>
>>> TL;DR: gate queue was 25h+, we put all patches from gate on standby, do
>>> not restore/recheck until further announcement.
>>>
>>> We recently enabled the containerized undercloud for multinode jobs and
>>> we believe this was a bit premature as the container download process
>>> wasn't optimized so it's not pulling the mirrors for the same containers
>>> multiple times yet.
>>> It caused the job runtime to increase and probably the load on docker.io
>>> mirrors hosted by OpenStack Infra to be a bit slower to provide the same
>>> containers multiple times. The time taken to prepare containers on the
>>> undercloud and then for the overcloud caused the jobs to randomly timeout
>>> therefore the gate to fail in a high amount of times, so we decided to
>>> remove all jobs from the gate by abandoning the patches temporarily (I have
>>> them in my browser and will restore when things are stable again, please do
>>> not touch anything).
>>>
>>> Steve Baker has been working on a series of patches that optimize the
>>> way we prepare the containers but basically the workflow will be:
>>> - pull containers needed for the undercloud into a local registry, using
>>> infra mirror if available
>>> - deploy the containerized undercloud
>>> - pull containers needed for the overcloud minus the ones already pulled
>>> for the undercloud, using infra mirror if available
>>> - update containers on the overcloud
>>> - deploy the containerized undercloud
>>>
>>> With that process, we hope to reduce the runtime of the deployment and
>>> therefore reduce the timeouts in the gate.
>>> To enable it, we need to land in that order: https://review.openstac
>>> k.org/#/c/571613/, https://review.openstack.org/#/c/574485/,
>>> https://review.openstack.org/#/c/571631/ and https://review.openstack.o
>>> 

Re: [openstack-dev] [tripleo] tripleo gate is blocked - please read

2018-06-14 Thread Emilien Macchi
It sounds like we merged a bunch last night thanks to the revert, so I went
ahead and restored/rechecked everything that was out of the gate. I've
checked and nothing was left over, but let me know in case I missed
something.
I'll keep updating this thread with the progress made to improve the
situation etc.
So from now, situation is back to "normal", recheck/+W is ok.

Thanks again for your patience,

On Wed, Jun 13, 2018 at 10:39 PM, Emilien Macchi  wrote:

> https://review.openstack.org/575264 just landed (and didn't timeout in
> check nor gate without recheck, so good sigh it helped to mitigate).
>
> I've restore and rechecked some patches that I evacuated from the gate,
> please do not restore others or recheck or approve anything for now, and
> see how it goes with a few patches.
> We're still working with Steve on his patches to optimize the way we
> deploy containers on the registry and are investigating how we could make
> it faster with a proxy.
>
> Stay tuned and thanks for your patience.
>
> On Wed, Jun 13, 2018 at 5:50 PM, Emilien Macchi 
> wrote:
>
>> TL;DR: gate queue was 25h+, we put all patches from gate on standby, do
>> not restore/recheck until further announcement.
>>
>> We recently enabled the containerized undercloud for multinode jobs and
>> we believe this was a bit premature as the container download process
>> wasn't optimized so it's not pulling the mirrors for the same containers
>> multiple times yet.
>> It caused the job runtime to increase and probably the load on docker.io
>> mirrors hosted by OpenStack Infra to be a bit slower to provide the same
>> containers multiple times. The time taken to prepare containers on the
>> undercloud and then for the overcloud caused the jobs to randomly timeout
>> therefore the gate to fail in a high amount of times, so we decided to
>> remove all jobs from the gate by abandoning the patches temporarily (I have
>> them in my browser and will restore when things are stable again, please do
>> not touch anything).
>>
>> Steve Baker has been working on a series of patches that optimize the way
>> we prepare the containers but basically the workflow will be:
>> - pull containers needed for the undercloud into a local registry, using
>> infra mirror if available
>> - deploy the containerized undercloud
>> - pull containers needed for the overcloud minus the ones already pulled
>> for the undercloud, using infra mirror if available
>> - update containers on the overcloud
>> - deploy the containerized undercloud
>>
>> With that process, we hope to reduce the runtime of the deployment and
>> therefore reduce the timeouts in the gate.
>> To enable it, we need to land in that order: https://review.openstac
>> k.org/#/c/571613/, https://review.openstack.org/#/c/574485/,
>> https://review.openstack.org/#/c/571631/ and https://review.openstack.o
>> rg/#/c/568403.
>>
>> In the meantime, we are disabling the containerized undercloud recently
>> enabled on all scenarios: https://review.openstack.org/#/c/575264/ for
>> mitigation with the hope to stabilize things until Steve's patches land.
>> Hopefully, we can merge Steve's work tonight/tomorrow and re-enable the
>> containerized undercloud on scenarios after checking that we don't have
>> timeouts and reasonable deployment runtimes.
>>
>> That's the plan we came with, if you have any question / feedback please
>> share it.
>> --
>> Emilien, Steve and Wes
>>
>
>
>
> --
> Emilien Macchi
>



-- 
Emilien Macchi
__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [tripleo] tripleo gate is blocked - please read

2018-06-14 Thread Monty Taylor

On 06/13/2018 07:50 PM, Emilien Macchi wrote:
TL;DR: gate queue was 25h+, we put all patches from gate on standby, do 
not restore/recheck until further announcement.


We recently enabled the containerized undercloud for multinode jobs and 
we believe this was a bit premature as the container download process 
wasn't optimized so it's not pulling the mirrors for the same containers 
multiple times yet.
It caused the job runtime to increase and probably the load on docker.io 
 mirrors hosted by OpenStack Infra to be a bit slower 
to provide the same containers multiple times. The time taken to prepare 
containers on the undercloud and then for the overcloud caused the jobs 
to randomly timeout therefore the gate to fail in a high amount of 
times, so we decided to remove all jobs from the gate by abandoning the 
patches temporarily (I have them in my browser and will restore when 
things are stable again, please do not touch anything).


Steve Baker has been working on a series of patches that optimize the 
way we prepare the containers but basically the workflow will be:
- pull containers needed for the undercloud into a local registry, using 
infra mirror if available

- deploy the containerized undercloud
- pull containers needed for the overcloud minus the ones already pulled 
for the undercloud, using infra mirror if available

- update containers on the overcloud
- deploy the containerized undercloud


That sounds like a great improvement. Well done!

With that process, we hope to reduce the runtime of the deployment and 
therefore reduce the timeouts in the gate.
To enable it, we need to land in that order: 
https://review.openstack.org/#/c/571613/, 
https://review.openstack.org/#/c/574485/, 
https://review.openstack.org/#/c/571631/ and 
https://review.openstack.org/#/c/568403.


In the meantime, we are disabling the containerized undercloud recently 
enabled on all scenarios: https://review.openstack.org/#/c/575264/ for 
mitigation with the hope to stabilize things until Steve's patches land.
Hopefully, we can merge Steve's work tonight/tomorrow and re-enable the 
containerized undercloud on scenarios after checking that we don't have 
timeouts and reasonable deployment runtimes.


That's the plan we came with, if you have any question / feedback please 
share it.

--
Emilien, Steve and Wes


__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev




__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [tripleo] tripleo gate is blocked - please read

2018-06-14 Thread Bogdan Dobrelya

On 6/14/18 3:50 AM, Emilien Macchi wrote:
TL;DR: gate queue was 25h+, we put all patches from gate on standby, do 
not restore/recheck until further announcement.


We recently enabled the containerized undercloud for multinode jobs and 
we believe this was a bit premature as the container download process 
wasn't optimized so it's not pulling the mirrors for the same containers 
multiple times yet.
It caused the job runtime to increase and probably the load on docker.io 
 mirrors hosted by OpenStack Infra to be a bit slower 
to provide the same containers multiple times. The time taken to prepare 
containers on the undercloud and then for the overcloud caused the jobs 
to randomly timeout therefore the gate to fail in a high amount of 
times, so we decided to remove all jobs from the gate by abandoning the 
patches temporarily (I have them in my browser and will restore when 
things are stable again, please do not touch anything).


Steve Baker has been working on a series of patches that optimize the 
way we prepare the containers but basically the workflow will be:
- pull containers needed for the undercloud into a local registry, using 
infra mirror if available

- deploy the containerized undercloud
- pull containers needed for the overcloud minus the ones already pulled 
for the undercloud, using infra mirror if available

- update containers on the overcloud
- deploy the containerized undercloud


Let me also note that it's may be time to introduce jobs dependencies 
[0]. Dependencies might somewhat alleviate registries/mirrors DoS 
issues, like that one we have currently, by running jobs in batches, and 
not firing of all at once.


We still have options to think of. The undercloud deployment takes 
longer than standalone, but provides better coverage therefore better 
extrapolates (and cuts off) future overcloud failures for the dependent 
jobs. Standalone is less stable yet though. The containers update check 
may be also an option for the step 1, or step 2, before the remaining 
multinode jobs execute.


Making those dependent jobs skipped, in turn, reduces DoS effects caused 
to registries and mirrors.


[0] 
https://review.openstack.org/#/q/status:open+project:openstack-infra/tripleo-ci+topic:ci_pipelines




With that process, we hope to reduce the runtime of the deployment and 
therefore reduce the timeouts in the gate.
To enable it, we need to land in that order: 
https://review.openstack.org/#/c/571613/, 
https://review.openstack.org/#/c/574485/, 
https://review.openstack.org/#/c/571631/ and 
https://review.openstack.org/#/c/568403.


In the meantime, we are disabling the containerized undercloud recently 
enabled on all scenarios: https://review.openstack.org/#/c/575264/ for 
mitigation with the hope to stabilize things until Steve's patches land.
Hopefully, we can merge Steve's work tonight/tomorrow and re-enable the 
containerized undercloud on scenarios after checking that we don't have 
timeouts and reasonable deployment runtimes.


That's the plan we came with, if you have any question / feedback please 
share it.

--
Emilien, Steve and Wes

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev




--
Best regards,
Bogdan Dobrelya,
Irc #bogdando

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [tripleo] tripleo gate is blocked - please read

2018-06-13 Thread Emilien Macchi
https://review.openstack.org/575264 just landed (and didn't timeout in
check nor gate without recheck, so good sigh it helped to mitigate).

I've restore and rechecked some patches that I evacuated from the gate,
please do not restore others or recheck or approve anything for now, and
see how it goes with a few patches.
We're still working with Steve on his patches to optimize the way we deploy
containers on the registry and are investigating how we could make it
faster with a proxy.

Stay tuned and thanks for your patience.

On Wed, Jun 13, 2018 at 5:50 PM, Emilien Macchi  wrote:

> TL;DR: gate queue was 25h+, we put all patches from gate on standby, do
> not restore/recheck until further announcement.
>
> We recently enabled the containerized undercloud for multinode jobs and we
> believe this was a bit premature as the container download process wasn't
> optimized so it's not pulling the mirrors for the same containers multiple
> times yet.
> It caused the job runtime to increase and probably the load on docker.io
> mirrors hosted by OpenStack Infra to be a bit slower to provide the same
> containers multiple times. The time taken to prepare containers on the
> undercloud and then for the overcloud caused the jobs to randomly timeout
> therefore the gate to fail in a high amount of times, so we decided to
> remove all jobs from the gate by abandoning the patches temporarily (I have
> them in my browser and will restore when things are stable again, please do
> not touch anything).
>
> Steve Baker has been working on a series of patches that optimize the way
> we prepare the containers but basically the workflow will be:
> - pull containers needed for the undercloud into a local registry, using
> infra mirror if available
> - deploy the containerized undercloud
> - pull containers needed for the overcloud minus the ones already pulled
> for the undercloud, using infra mirror if available
> - update containers on the overcloud
> - deploy the containerized undercloud
>
> With that process, we hope to reduce the runtime of the deployment and
> therefore reduce the timeouts in the gate.
> To enable it, we need to land in that order: https://review.
> openstack.org/#/c/571613/, https://review.openstack.org/#/c/574485/,
> https://review.openstack.org/#/c/571631/ and https://review.openstack.
> org/#/c/568403.
>
> In the meantime, we are disabling the containerized undercloud recently
> enabled on all scenarios: https://review.openstack.org/#/c/575264/ for
> mitigation with the hope to stabilize things until Steve's patches land.
> Hopefully, we can merge Steve's work tonight/tomorrow and re-enable the
> containerized undercloud on scenarios after checking that we don't have
> timeouts and reasonable deployment runtimes.
>
> That's the plan we came with, if you have any question / feedback please
> share it.
> --
> Emilien, Steve and Wes
>



-- 
Emilien Macchi
__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


[openstack-dev] [tripleo] tripleo gate is blocked - please read

2018-06-13 Thread Emilien Macchi
TL;DR: gate queue was 25h+, we put all patches from gate on standby, do not
restore/recheck until further announcement.

We recently enabled the containerized undercloud for multinode jobs and we
believe this was a bit premature as the container download process wasn't
optimized so it's not pulling the mirrors for the same containers multiple
times yet.
It caused the job runtime to increase and probably the load on docker.io
mirrors hosted by OpenStack Infra to be a bit slower to provide the same
containers multiple times. The time taken to prepare containers on the
undercloud and then for the overcloud caused the jobs to randomly timeout
therefore the gate to fail in a high amount of times, so we decided to
remove all jobs from the gate by abandoning the patches temporarily (I have
them in my browser and will restore when things are stable again, please do
not touch anything).

Steve Baker has been working on a series of patches that optimize the way
we prepare the containers but basically the workflow will be:
- pull containers needed for the undercloud into a local registry, using
infra mirror if available
- deploy the containerized undercloud
- pull containers needed for the overcloud minus the ones already pulled
for the undercloud, using infra mirror if available
- update containers on the overcloud
- deploy the containerized undercloud

With that process, we hope to reduce the runtime of the deployment and
therefore reduce the timeouts in the gate.
To enable it, we need to land in that order:
https://review.openstack.org/#/c/571613/,
https://review.openstack.org/#/c/574485/,
https://review.openstack.org/#/c/571631/ and
https://review.openstack.org/#/c/568403.

In the meantime, we are disabling the containerized undercloud recently
enabled on all scenarios: https://review.openstack.org/#/c/575264/ for
mitigation with the hope to stabilize things until Steve's patches land.
Hopefully, we can merge Steve's work tonight/tomorrow and re-enable the
containerized undercloud on scenarios after checking that we don't have
timeouts and reasonable deployment runtimes.

That's the plan we came with, if you have any question / feedback please
share it.
-- 
Emilien, Steve and Wes
__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev