Re: [openstack-dev] [heat][infra] Help needed! high gate failure rate
On Thu, Aug 10, 2017 at 12:04 PM, Paul Belangerwrote: > On Thu, Aug 10, 2017 at 07:22:42PM +0530, Rabi Mishra wrote: >> On Thu, Aug 10, 2017 at 4:34 PM, Rabi Mishra wrote: >> >> > On Thu, Aug 10, 2017 at 2:51 PM, Ian Wienand wrote: >> > >> >> On 08/10/2017 06:18 PM, Rico Lin wrote: >> >> > We're facing a high failure rate in Heat's gates [1], four of our gate >> >> > suffering with fail rate from 6 to near 20% in 14 days. which makes >> >> most of >> >> > our patch stuck with the gate. >> >> >> >> There have been a confluence of things causing some problems recently. >> >> The loss of OSIC has distributed more load over everything else, and >> >> we have seen an increase in job timeouts and intermittent networking >> >> issues (especially if you're downloading large things from remote >> >> sites). There have also been some issues with the mirror in rax-ord >> >> [1] >> >> >> >> > gate-heat-dsvm-functional-convg-mysql-lbaasv2-ubuntu-xenial(19.67%) >> >> > gate-heat-dsvm-functional-convg-mysql-lbaasv2-non-apache- >> >> ubuntu-xenia(9.09%) >> >> > gate-heat-dsvm-functional-orig-mysql-lbaasv2-ubuntu-xenial(8.47%) >> >> > gate-heat-dsvm-functional-convg-mysql-lbaasv2-py35-ubuntu-xenial(6.00%) >> >> >> >> > We still try to find out what's the cause but (IMO,) seems it might be >> >> some >> >> > thing wrong with our infra. We need some help from infra team, to know >> >> if >> >> > any clue on this failure rate? >> >> >> >> The reality is you're just going to have to triage this and be a *lot* >> >> more specific with issues. >> > >> > >> > One of the issues we see recently is that, many jobs killed mid way >> > through the tests as the job times out(120 mins). It seems jobs are many >> > times scheduled to very slow nodes, where setting up devstack takes more >> > than 80 mins[1]. >> > >> > [1] http://logs.openstack.org/49/492149/2/check/gate-heat-dsvm- >> > functional-orig-mysql-lbaasv2-ubuntu-xenial/03b05dd/console. >> > html#_2017-08-10_05_55_49_035693 >> > >> > We download an image from a fedora mirror and it seems to take more than >> 1hr. >> >> http://logs.openstack.org/41/484741/7/check/gate-heat-dsvm-functional-convg-mysql-lbaasv2-py35-ubuntu-xenial/a797010/logs/devstacklog.txt.gz#_2017-08-10_04_13_14_400 >> >> Probably an issue with the specific mirror or some infra network bandwidth >> issue. I've submitted a patch to change the mirror to see if that helps. >> > Today we mirror both fedora-26[1] and fedora-25 (to be removed shortly). So if > you want to consider bumping your image for testing, you can fetch it from our > AFS mirrors. > > You can source /etc/ci/mirror_info.sh to get information about things we > mirror. > > [1] > http://mirror.regionone.infracloud-vanilla.openstack.org/fedora/releases/26/CloudImages/x86_64/images/ In order to make the gate happy, I've taken the time to submit this patch, appreciate if it can be reviewed so we can reduce the churn on our instances: https://review.openstack.org/#/c/492634/ >> >> > I find opening an etherpad and going >> >> through the failures one-by-one helpful (e.g. I keep [2] for centos >> >> jobs I'm interested in). >> >> >> >> Looking at the top of the console.html log you'll have the host and >> >> provider/region stamped in there. If it's timeouts or network issues, >> >> reporting to infra the time, provider and region of failing jobs will >> >> help. If it's network issues similar will help. Finding patterns is >> >> the first step to understanding what needs fixing. >> >> >> >> If it's due to issues with remote transfers, we can look at either >> >> adding specific things to mirrors (containers, images, packages are >> >> all things we've added recently) or adding a caching reverse-proxy for >> >> them ([3],[4] some examples). >> >> >> >> Questions in #openstack-infra will usually get a helpful response too >> >> >> >> Good luck :) >> >> >> >> -i >> >> >> >> [1] https://bugs.launchpad.net/openstack-gate/+bug/1708707/ >> >> [2] https://etherpad.openstack.org/p/centos7-dsvm-triage >> >> [3] https://review.openstack.org/491800 >> >> [4] https://review.openstack.org/491466 >> >> >> >> >> >> __ >> >> OpenStack Development Mailing List (not for usage questions) >> >> Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscrib >> >> e >> >> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev >> >> >> > >> > >> > >> > -- >> > Regards, >> > Rabi Misra >> > >> > >> >> >> -- >> Regards, >> Rabi Mishra > >> __ >> OpenStack Development Mailing List (not for usage questions) >> Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe >> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev > > > __ >
Re: [openstack-dev] [heat][infra] Help needed! high gate failure rate
On Thu, Aug 10, 2017 at 07:22:42PM +0530, Rabi Mishra wrote: > On Thu, Aug 10, 2017 at 4:34 PM, Rabi Mishrawrote: > > > On Thu, Aug 10, 2017 at 2:51 PM, Ian Wienand wrote: > > > >> On 08/10/2017 06:18 PM, Rico Lin wrote: > >> > We're facing a high failure rate in Heat's gates [1], four of our gate > >> > suffering with fail rate from 6 to near 20% in 14 days. which makes > >> most of > >> > our patch stuck with the gate. > >> > >> There have been a confluence of things causing some problems recently. > >> The loss of OSIC has distributed more load over everything else, and > >> we have seen an increase in job timeouts and intermittent networking > >> issues (especially if you're downloading large things from remote > >> sites). There have also been some issues with the mirror in rax-ord > >> [1] > >> > >> > gate-heat-dsvm-functional-convg-mysql-lbaasv2-ubuntu-xenial(19.67%) > >> > gate-heat-dsvm-functional-convg-mysql-lbaasv2-non-apache- > >> ubuntu-xenia(9.09%) > >> > gate-heat-dsvm-functional-orig-mysql-lbaasv2-ubuntu-xenial(8.47%) > >> > gate-heat-dsvm-functional-convg-mysql-lbaasv2-py35-ubuntu-xenial(6.00%) > >> > >> > We still try to find out what's the cause but (IMO,) seems it might be > >> some > >> > thing wrong with our infra. We need some help from infra team, to know > >> if > >> > any clue on this failure rate? > >> > >> The reality is you're just going to have to triage this and be a *lot* > >> more specific with issues. > > > > > > One of the issues we see recently is that, many jobs killed mid way > > through the tests as the job times out(120 mins). It seems jobs are many > > times scheduled to very slow nodes, where setting up devstack takes more > > than 80 mins[1]. > > > > [1] http://logs.openstack.org/49/492149/2/check/gate-heat-dsvm- > > functional-orig-mysql-lbaasv2-ubuntu-xenial/03b05dd/console. > > html#_2017-08-10_05_55_49_035693 > > > > We download an image from a fedora mirror and it seems to take more than > 1hr. > > http://logs.openstack.org/41/484741/7/check/gate-heat-dsvm-functional-convg-mysql-lbaasv2-py35-ubuntu-xenial/a797010/logs/devstacklog.txt.gz#_2017-08-10_04_13_14_400 > > Probably an issue with the specific mirror or some infra network bandwidth > issue. I've submitted a patch to change the mirror to see if that helps. > Today we mirror both fedora-26[1] and fedora-25 (to be removed shortly). So if you want to consider bumping your image for testing, you can fetch it from our AFS mirrors. You can source /etc/ci/mirror_info.sh to get information about things we mirror. [1] http://mirror.regionone.infracloud-vanilla.openstack.org/fedora/releases/26/CloudImages/x86_64/images/ > > > I find opening an etherpad and going > >> through the failures one-by-one helpful (e.g. I keep [2] for centos > >> jobs I'm interested in). > >> > >> Looking at the top of the console.html log you'll have the host and > >> provider/region stamped in there. If it's timeouts or network issues, > >> reporting to infra the time, provider and region of failing jobs will > >> help. If it's network issues similar will help. Finding patterns is > >> the first step to understanding what needs fixing. > >> > >> If it's due to issues with remote transfers, we can look at either > >> adding specific things to mirrors (containers, images, packages are > >> all things we've added recently) or adding a caching reverse-proxy for > >> them ([3],[4] some examples). > >> > >> Questions in #openstack-infra will usually get a helpful response too > >> > >> Good luck :) > >> > >> -i > >> > >> [1] https://bugs.launchpad.net/openstack-gate/+bug/1708707/ > >> [2] https://etherpad.openstack.org/p/centos7-dsvm-triage > >> [3] https://review.openstack.org/491800 > >> [4] https://review.openstack.org/491466 > >> > >> > >> __ > >> OpenStack Development Mailing List (not for usage questions) > >> Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscrib > >> e > >> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev > >> > > > > > > > > -- > > Regards, > > Rabi Misra > > > > > > > -- > Regards, > Rabi Mishra > __ > OpenStack Development Mailing List (not for usage questions) > Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe > http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev __ OpenStack Development Mailing List (not for usage questions) Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
Re: [openstack-dev] [heat][infra] Help needed! high gate failure rate
On Thu, Aug 10, 2017 at 4:34 PM, Rabi Mishrawrote: > On Thu, Aug 10, 2017 at 2:51 PM, Ian Wienand wrote: > >> On 08/10/2017 06:18 PM, Rico Lin wrote: >> > We're facing a high failure rate in Heat's gates [1], four of our gate >> > suffering with fail rate from 6 to near 20% in 14 days. which makes >> most of >> > our patch stuck with the gate. >> >> There have been a confluence of things causing some problems recently. >> The loss of OSIC has distributed more load over everything else, and >> we have seen an increase in job timeouts and intermittent networking >> issues (especially if you're downloading large things from remote >> sites). There have also been some issues with the mirror in rax-ord >> [1] >> >> > gate-heat-dsvm-functional-convg-mysql-lbaasv2-ubuntu-xenial(19.67%) >> > gate-heat-dsvm-functional-convg-mysql-lbaasv2-non-apache- >> ubuntu-xenia(9.09%) >> > gate-heat-dsvm-functional-orig-mysql-lbaasv2-ubuntu-xenial(8.47%) >> > gate-heat-dsvm-functional-convg-mysql-lbaasv2-py35-ubuntu-xenial(6.00%) >> >> > We still try to find out what's the cause but (IMO,) seems it might be >> some >> > thing wrong with our infra. We need some help from infra team, to know >> if >> > any clue on this failure rate? >> >> The reality is you're just going to have to triage this and be a *lot* >> more specific with issues. > > > One of the issues we see recently is that, many jobs killed mid way > through the tests as the job times out(120 mins). It seems jobs are many > times scheduled to very slow nodes, where setting up devstack takes more > than 80 mins[1]. > > [1] http://logs.openstack.org/49/492149/2/check/gate-heat-dsvm- > functional-orig-mysql-lbaasv2-ubuntu-xenial/03b05dd/console. > html#_2017-08-10_05_55_49_035693 > > We download an image from a fedora mirror and it seems to take more than 1hr. http://logs.openstack.org/41/484741/7/check/gate-heat-dsvm-functional-convg-mysql-lbaasv2-py35-ubuntu-xenial/a797010/logs/devstacklog.txt.gz#_2017-08-10_04_13_14_400 Probably an issue with the specific mirror or some infra network bandwidth issue. I've submitted a patch to change the mirror to see if that helps. > I find opening an etherpad and going >> through the failures one-by-one helpful (e.g. I keep [2] for centos >> jobs I'm interested in). >> >> Looking at the top of the console.html log you'll have the host and >> provider/region stamped in there. If it's timeouts or network issues, >> reporting to infra the time, provider and region of failing jobs will >> help. If it's network issues similar will help. Finding patterns is >> the first step to understanding what needs fixing. >> >> If it's due to issues with remote transfers, we can look at either >> adding specific things to mirrors (containers, images, packages are >> all things we've added recently) or adding a caching reverse-proxy for >> them ([3],[4] some examples). >> >> Questions in #openstack-infra will usually get a helpful response too >> >> Good luck :) >> >> -i >> >> [1] https://bugs.launchpad.net/openstack-gate/+bug/1708707/ >> [2] https://etherpad.openstack.org/p/centos7-dsvm-triage >> [3] https://review.openstack.org/491800 >> [4] https://review.openstack.org/491466 >> >> >> __ >> OpenStack Development Mailing List (not for usage questions) >> Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscrib >> e >> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev >> > > > > -- > Regards, > Rabi Misra > > -- Regards, Rabi Mishra __ OpenStack Development Mailing List (not for usage questions) Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
Re: [openstack-dev] [heat][infra] Help needed! high gate failure rate
> The reality is you're just going to have to triage this and be a *lot* > more specific with issues. I find opening an etherpad and going > through the failures one-by-one helpful (e.g. I keep [2] for centos > jobs I'm interested in). > > Looking at the top of the console.html log you'll have the host and > provider/region stamped in there. If it's timeouts or network issues, > reporting to infra the time, provider and region of failing jobs will > help. If it's network issues similar will help. Finding patterns is > the first step to understanding what needs fixing. > Here [1] I collect some fail records from gate As we can tell, most of environments set-up becomes really slow and failed at some point with time out error. In [1] I collect information for failed node. Hope you can find any clue from it. [1] https://etherpad.openstack.org/p/heat-gate-fail-2017-08 > If it's due to issues with remote transfers, we can look at either > adding specific things to mirrors (containers, images, packages are > all things we've added recently) or adding a caching reverse-proxy for > them ([3],[4] some examples). > > Questions in #openstack-infra will usually get a helpful response too > > Good luck :) > > -i > > [1] https://bugs.launchpad.net/openstack-gate/+bug/1708707/ > [2] https://etherpad.openstack.org/p/centos7-dsvm-triage > [3] https://review.openstack.org/491800 > [4] https://review.openstack.org/491466 __ OpenStack Development Mailing List (not for usage questions) Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
Re: [openstack-dev] [heat][infra] Help needed! high gate failure rate
On Thu, Aug 10, 2017 at 2:51 PM, Ian Wienandwrote: > On 08/10/2017 06:18 PM, Rico Lin wrote: > > We're facing a high failure rate in Heat's gates [1], four of our gate > > suffering with fail rate from 6 to near 20% in 14 days. which makes most > of > > our patch stuck with the gate. > > There have been a confluence of things causing some problems recently. > The loss of OSIC has distributed more load over everything else, and > we have seen an increase in job timeouts and intermittent networking > issues (especially if you're downloading large things from remote > sites). There have also been some issues with the mirror in rax-ord > [1] > > > gate-heat-dsvm-functional-convg-mysql-lbaasv2-ubuntu-xenial(19.67%) > > gate-heat-dsvm-functional-convg-mysql-lbaasv2-non- > apache-ubuntu-xenia(9.09%) > > gate-heat-dsvm-functional-orig-mysql-lbaasv2-ubuntu-xenial(8.47%) > > gate-heat-dsvm-functional-convg-mysql-lbaasv2-py35-ubuntu-xenial(6.00%) > > > We still try to find out what's the cause but (IMO,) seems it might be > some > > thing wrong with our infra. We need some help from infra team, to know if > > any clue on this failure rate? > > The reality is you're just going to have to triage this and be a *lot* > more specific with issues. One of the issues we see recently is that, many jobs killed mid way through the tests as the job times out(120 mins). It seems jobs are many times scheduled to very slow nodes, where setting up devstack takes more than 80 mins[1]. [1] http://logs.openstack.org/49/492149/2/check/gate-heat-dsvm-functional-orig-mysql-lbaasv2-ubuntu-xenial/03b05dd/console.html#_2017-08-10_05_55_49_035693 I find opening an etherpad and going > through the failures one-by-one helpful (e.g. I keep [2] for centos > jobs I'm interested in). > > Looking at the top of the console.html log you'll have the host and > provider/region stamped in there. If it's timeouts or network issues, > reporting to infra the time, provider and region of failing jobs will > help. If it's network issues similar will help. Finding patterns is > the first step to understanding what needs fixing. > > If it's due to issues with remote transfers, we can look at either > adding specific things to mirrors (containers, images, packages are > all things we've added recently) or adding a caching reverse-proxy for > them ([3],[4] some examples). > > Questions in #openstack-infra will usually get a helpful response too > > Good luck :) > > -i > > [1] https://bugs.launchpad.net/openstack-gate/+bug/1708707/ > [2] https://etherpad.openstack.org/p/centos7-dsvm-triage > [3] https://review.openstack.org/491800 > [4] https://review.openstack.org/491466 > > __ > OpenStack Development Mailing List (not for usage questions) > Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe > http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev > -- Regards, Rabi Misra __ OpenStack Development Mailing List (not for usage questions) Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
Re: [openstack-dev] [heat][infra] Help needed! high gate failure rate
On 08/10/2017 06:18 PM, Rico Lin wrote: > We're facing a high failure rate in Heat's gates [1], four of our gate > suffering with fail rate from 6 to near 20% in 14 days. which makes most of > our patch stuck with the gate. There have been a confluence of things causing some problems recently. The loss of OSIC has distributed more load over everything else, and we have seen an increase in job timeouts and intermittent networking issues (especially if you're downloading large things from remote sites). There have also been some issues with the mirror in rax-ord [1] > gate-heat-dsvm-functional-convg-mysql-lbaasv2-ubuntu-xenial(19.67%) > gate-heat-dsvm-functional-convg-mysql-lbaasv2-non-apache-ubuntu-xenia(9.09%) > gate-heat-dsvm-functional-orig-mysql-lbaasv2-ubuntu-xenial(8.47%) > gate-heat-dsvm-functional-convg-mysql-lbaasv2-py35-ubuntu-xenial(6.00%) > We still try to find out what's the cause but (IMO,) seems it might be some > thing wrong with our infra. We need some help from infra team, to know if > any clue on this failure rate? The reality is you're just going to have to triage this and be a *lot* more specific with issues. I find opening an etherpad and going through the failures one-by-one helpful (e.g. I keep [2] for centos jobs I'm interested in). Looking at the top of the console.html log you'll have the host and provider/region stamped in there. If it's timeouts or network issues, reporting to infra the time, provider and region of failing jobs will help. If it's network issues similar will help. Finding patterns is the first step to understanding what needs fixing. If it's due to issues with remote transfers, we can look at either adding specific things to mirrors (containers, images, packages are all things we've added recently) or adding a caching reverse-proxy for them ([3],[4] some examples). Questions in #openstack-infra will usually get a helpful response too Good luck :) -i [1] https://bugs.launchpad.net/openstack-gate/+bug/1708707/ [2] https://etherpad.openstack.org/p/centos7-dsvm-triage [3] https://review.openstack.org/491800 [4] https://review.openstack.org/491466 __ OpenStack Development Mailing List (not for usage questions) Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev