Re: [openstack-dev] [tripleo] glance backend: replace swift by file in CI
On Wed, Jun 29, 2016 at 02:59:45PM +0200, Dmitry Tantsur wrote: > On 06/28/2016 01:37 PM, Erno Kuvaja wrote: > > TL;DR > > > > Makes absolutely sense to run file backend on single node undercloud at CI. > > > > Few more comments inline. > > > > On Mon, Jun 27, 2016 at 8:49 PM, Emilien Macchiwrote: > > > On Mon, Jun 27, 2016 at 3:46 PM, Clay Gerrard > > > wrote: > > > > There's probably some minimal gain in cross compatibility testing to > > > > sticking with the status quo. The Swift API is old and stable, but I > > > > believe there was some bug in recent history where some return value in > > > > swiftclient changed from a iterable to a generator or something and some > > > > aggressive non-duck type checking broke something somewhere > > > > > > > > I find that bug reports sorta interesting, the reported memory pressure > > > > there doesn't make sense. Maybe there's some non- > > > > essential middleware configured on that proxy that's causing the > > > > workers to > > > > bloat up like that? > > > > > > Swift proxy pipeline: > > > pipeline = catch_errors healthcheck cache ratelimit bulk tempurl > > > formpost authtoken keystone staticweb proxy-logging proxy-server > > > > Some things I do not think we benefit having there if we want to > > experiment still with swift in undercloud: > > I hope we're not removing it completely... No, definitely not - we require Swift for several things other than backing glance, including: - Storing introspection data from ironic-inspector - Signals/Metadata-polling for Heat using the tempurl transport - Mistral deployment workflows, where plans are pushed into swift > > staticweb - do we need containers being presented as webpages? > > tempurl - Id assume we can expect the user having access the needed > > objects with their own credentials. > > Please leave it there, we need it to support agent_* family of ironic > drivers. Yes, we need tempurl for heat metadata/signals and also upload of artefacts such as puppet modules to the nodes, so we definitely need tempurl. Steve __ OpenStack Development Mailing List (not for usage questions) Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
Re: [openstack-dev] [tripleo] glance backend: replace swift by file in CI
On 06/28/2016 01:37 PM, Erno Kuvaja wrote: TL;DR Makes absolutely sense to run file backend on single node undercloud at CI. Few more comments inline. On Mon, Jun 27, 2016 at 8:49 PM, Emilien Macchiwrote: On Mon, Jun 27, 2016 at 3:46 PM, Clay Gerrard wrote: There's probably some minimal gain in cross compatibility testing to sticking with the status quo. The Swift API is old and stable, but I believe there was some bug in recent history where some return value in swiftclient changed from a iterable to a generator or something and some aggressive non-duck type checking broke something somewhere I find that bug reports sorta interesting, the reported memory pressure there doesn't make sense. Maybe there's some non- essential middleware configured on that proxy that's causing the workers to bloat up like that? Swift proxy pipeline: pipeline = catch_errors healthcheck cache ratelimit bulk tempurl formpost authtoken keystone staticweb proxy-logging proxy-server Some things I do not think we benefit having there if we want to experiment still with swift in undercloud: I hope we're not removing it completely... staticweb - do we need containers being presented as webpages? tempurl - Id assume we can expect the user having access the needed objects with their own credentials. Please leave it there, we need it to support agent_* family of ironic drivers. formpost - likely we do not need http forms instead of PUT calls either. ratelimit - There and there, have we had single time where something goes grazy and ratelimit has saved us and the tests still not failed. healthcheck - not likely used, but also really lightweight so shouldn't make any difference cache - Memcache is likely the thing that kills us. Thanks for your help, -clayg On Mon, Jun 27, 2016 at 12:30 PM, Emilien Macchi wrote: Hi, Today we're re-investigating a CI failure that we had multiple times [1]: Swift memory usage grows until it is OOM-killed. The perimeter of this thread is about our CI and not production environments. Indeed, our CI is running limited resources while production environments should not hit this problem. After some investigation on #ŧripleo, we found out this scenario was happening almost every time since recently: * undercloud is deployed, glance and swift are running. Glance is configured with Swift backend to store images. * tripleo CI upload overcloud image into Glance, image is successfully uploaded. * when overcloud starts deploying, some nodes randomly fail to deploy because the undercloud OOM-kills swift-proxy-server that is still sending the ovecloud image requested by Glance API. Swift fails, Glance fails, overcloud deployment fails with a "No valid hosts found". It's likely due to performances issues in our CI, and there is nothing we can do but adding more resources or reducing the number of environments, something we won't do at this time, because our recent improvements in our CI (more ram, SSD, etc). So the possible streamlining and optimizing swift for small environment was tried already? Another thing that comes to my mind based on the discussions lately. What is the core count on our CI uc node? Are all the serviced deployed there with their default worker values? Might be sensible (even for production use) to limit the amount of workers our services kick up in aio undercloud as that tends to have huge impact on memory consumption. - Erno "jokke_" Kuvaja As a first iteration, I propose [2] that we stop using Swift as a backend for Glance. Indeed, our undercloud is currently single-node, I see zero value of using Swift to store the overcloud image. If there is a value, then we can add the option to whether or not using it (and set it to False in our CI to use file backend, which won't lead to OOM). Note: on the overcloud: we currently support file, swift and rbd backends, that you can easily select during your deployment. [1] https://bugs.launchpad.net/tripleo/+bug/1595916 [2] https://review.openstack.org/#/c/334555/ -- Emilien Macchi __ OpenStack Development Mailing List (not for usage questions) Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev __ OpenStack Development Mailing List (not for usage questions) Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev -- Emilien Macchi __ OpenStack Development Mailing List (not for usage questions) Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
Re: [openstack-dev] [tripleo] glance backend: replace swift by file in CI
TL;DR Makes absolutely sense to run file backend on single node undercloud at CI. Few more comments inline. On Mon, Jun 27, 2016 at 8:49 PM, Emilien Macchiwrote: > On Mon, Jun 27, 2016 at 3:46 PM, Clay Gerrard wrote: >> There's probably some minimal gain in cross compatibility testing to >> sticking with the status quo. The Swift API is old and stable, but I >> believe there was some bug in recent history where some return value in >> swiftclient changed from a iterable to a generator or something and some >> aggressive non-duck type checking broke something somewhere >> >> I find that bug reports sorta interesting, the reported memory pressure >> there doesn't make sense. Maybe there's some non- >> essential middleware configured on that proxy that's causing the workers to >> bloat up like that? > > Swift proxy pipeline: > pipeline = catch_errors healthcheck cache ratelimit bulk tempurl > formpost authtoken keystone staticweb proxy-logging proxy-server Some things I do not think we benefit having there if we want to experiment still with swift in undercloud: staticweb - do we need containers being presented as webpages? tempurl - Id assume we can expect the user having access the needed objects with their own credentials. formpost - likely we do not need http forms instead of PUT calls either. ratelimit - There and there, have we had single time where something goes grazy and ratelimit has saved us and the tests still not failed. healthcheck - not likely used, but also really lightweight so shouldn't make any difference cache - Memcache is likely the thing that kills us. > > Thanks for your help, > >> -clayg >> >> On Mon, Jun 27, 2016 at 12:30 PM, Emilien Macchi wrote: >>> >>> Hi, >>> >>> Today we're re-investigating a CI failure that we had multiple times [1]: >>> Swift memory usage grows until it is OOM-killed. >>> >>> The perimeter of this thread is about our CI and not production >>> environments. >>> Indeed, our CI is running limited resources while production >>> environments should not hit this problem. >>> >>> After some investigation on #ŧripleo, we found out this scenario was >>> happening almost every time since recently: >>> >>> * undercloud is deployed, glance and swift are running. Glance is >>> configured with Swift backend to store images. >>> * tripleo CI upload overcloud image into Glance, image is successfully >>> uploaded. >>> * when overcloud starts deploying, some nodes randomly fail to deploy >>> because the undercloud OOM-kills swift-proxy-server that is still >>> sending the ovecloud image requested by Glance API. Swift fails, >>> Glance fails, overcloud deployment fails with a "No valid hosts >>> found". >>> >>> It's likely due to performances issues in our CI, and there is nothing >>> we can do but adding more resources or reducing the number of >>> environments, something we won't do at this time, because our recent >>> improvements in our CI (more ram, SSD, etc). So the possible streamlining and optimizing swift for small environment was tried already? Another thing that comes to my mind based on the discussions lately. What is the core count on our CI uc node? Are all the serviced deployed there with their default worker values? Might be sensible (even for production use) to limit the amount of workers our services kick up in aio undercloud as that tends to have huge impact on memory consumption. - Erno "jokke_" Kuvaja >>> >>> As a first iteration, I propose [2] that we stop using Swift as a >>> backend for Glance. Indeed, our undercloud is currently single-node, I >>> see zero value of using Swift to store the overcloud image. >>> If there is a value, then we can add the option to whether or not >>> using it (and set it to False in our CI to use file backend, which >>> won't lead to OOM). >>> >>> Note: on the overcloud: we currently support file, swift and rbd >>> backends, that you can easily select during your deployment. >>> >>> [1] https://bugs.launchpad.net/tripleo/+bug/1595916 >>> [2] https://review.openstack.org/#/c/334555/ >>> -- >>> Emilien Macchi >>> >>> __ >>> OpenStack Development Mailing List (not for usage questions) >>> Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe >>> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev >> >> >> >> __ >> OpenStack Development Mailing List (not for usage questions) >> Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe >> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev >> > > > > -- > Emilien Macchi > > __ > OpenStack Development Mailing List (not for usage questions) > Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe >
Re: [openstack-dev] [tripleo] glance backend: replace swift by file in CI
On Mon, Jun 27, 2016 at 3:46 PM, Clay Gerrardwrote: > There's probably some minimal gain in cross compatibility testing to > sticking with the status quo. The Swift API is old and stable, but I > believe there was some bug in recent history where some return value in > swiftclient changed from a iterable to a generator or something and some > aggressive non-duck type checking broke something somewhere > > I find that bug reports sorta interesting, the reported memory pressure > there doesn't make sense. Maybe there's some non- > essential middleware configured on that proxy that's causing the workers to > bloat up like that? Swift proxy pipeline: pipeline = catch_errors healthcheck cache ratelimit bulk tempurl formpost authtoken keystone staticweb proxy-logging proxy-server Thanks for your help, > -clayg > > On Mon, Jun 27, 2016 at 12:30 PM, Emilien Macchi wrote: >> >> Hi, >> >> Today we're re-investigating a CI failure that we had multiple times [1]: >> Swift memory usage grows until it is OOM-killed. >> >> The perimeter of this thread is about our CI and not production >> environments. >> Indeed, our CI is running limited resources while production >> environments should not hit this problem. >> >> After some investigation on #ŧripleo, we found out this scenario was >> happening almost every time since recently: >> >> * undercloud is deployed, glance and swift are running. Glance is >> configured with Swift backend to store images. >> * tripleo CI upload overcloud image into Glance, image is successfully >> uploaded. >> * when overcloud starts deploying, some nodes randomly fail to deploy >> because the undercloud OOM-kills swift-proxy-server that is still >> sending the ovecloud image requested by Glance API. Swift fails, >> Glance fails, overcloud deployment fails with a "No valid hosts >> found". >> >> It's likely due to performances issues in our CI, and there is nothing >> we can do but adding more resources or reducing the number of >> environments, something we won't do at this time, because our recent >> improvements in our CI (more ram, SSD, etc). >> >> As a first iteration, I propose [2] that we stop using Swift as a >> backend for Glance. Indeed, our undercloud is currently single-node, I >> see zero value of using Swift to store the overcloud image. >> If there is a value, then we can add the option to whether or not >> using it (and set it to False in our CI to use file backend, which >> won't lead to OOM). >> >> Note: on the overcloud: we currently support file, swift and rbd >> backends, that you can easily select during your deployment. >> >> [1] https://bugs.launchpad.net/tripleo/+bug/1595916 >> [2] https://review.openstack.org/#/c/334555/ >> -- >> Emilien Macchi >> >> __ >> OpenStack Development Mailing List (not for usage questions) >> Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe >> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev > > > > __ > OpenStack Development Mailing List (not for usage questions) > Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe > http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev > -- Emilien Macchi __ OpenStack Development Mailing List (not for usage questions) Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
Re: [openstack-dev] [tripleo] glance backend: replace swift by file in CI
There's probably some minimal gain in cross compatibility testing to sticking with the status quo. The Swift API is old and stable, but I believe there was some bug in recent history where some return value in swiftclient changed from a iterable to a generator or something and some aggressive non-duck type checking broke something somewhere I find that bug reports sorta interesting, the reported memory pressure there doesn't make sense. Maybe there's some non- essential middleware configured on that proxy that's causing the workers to bloat up like that? -clayg On Mon, Jun 27, 2016 at 12:30 PM, Emilien Macchiwrote: > Hi, > > Today we're re-investigating a CI failure that we had multiple times [1]: > Swift memory usage grows until it is OOM-killed. > > The perimeter of this thread is about our CI and not production > environments. > Indeed, our CI is running limited resources while production > environments should not hit this problem. > > After some investigation on #ŧripleo, we found out this scenario was > happening almost every time since recently: > > * undercloud is deployed, glance and swift are running. Glance is > configured with Swift backend to store images. > * tripleo CI upload overcloud image into Glance, image is successfully > uploaded. > * when overcloud starts deploying, some nodes randomly fail to deploy > because the undercloud OOM-kills swift-proxy-server that is still > sending the ovecloud image requested by Glance API. Swift fails, > Glance fails, overcloud deployment fails with a "No valid hosts > found". > > It's likely due to performances issues in our CI, and there is nothing > we can do but adding more resources or reducing the number of > environments, something we won't do at this time, because our recent > improvements in our CI (more ram, SSD, etc). > > As a first iteration, I propose [2] that we stop using Swift as a > backend for Glance. Indeed, our undercloud is currently single-node, I > see zero value of using Swift to store the overcloud image. > If there is a value, then we can add the option to whether or not > using it (and set it to False in our CI to use file backend, which > won't lead to OOM). > > Note: on the overcloud: we currently support file, swift and rbd > backends, that you can easily select during your deployment. > > [1] https://bugs.launchpad.net/tripleo/+bug/1595916 > [2] https://review.openstack.org/#/c/334555/ > -- > Emilien Macchi > > __ > OpenStack Development Mailing List (not for usage questions) > Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe > http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev > __ OpenStack Development Mailing List (not for usage questions) Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
[openstack-dev] [tripleo] glance backend: replace swift by file in CI
Hi, Today we're re-investigating a CI failure that we had multiple times [1]: Swift memory usage grows until it is OOM-killed. The perimeter of this thread is about our CI and not production environments. Indeed, our CI is running limited resources while production environments should not hit this problem. After some investigation on #ŧripleo, we found out this scenario was happening almost every time since recently: * undercloud is deployed, glance and swift are running. Glance is configured with Swift backend to store images. * tripleo CI upload overcloud image into Glance, image is successfully uploaded. * when overcloud starts deploying, some nodes randomly fail to deploy because the undercloud OOM-kills swift-proxy-server that is still sending the ovecloud image requested by Glance API. Swift fails, Glance fails, overcloud deployment fails with a "No valid hosts found". It's likely due to performances issues in our CI, and there is nothing we can do but adding more resources or reducing the number of environments, something we won't do at this time, because our recent improvements in our CI (more ram, SSD, etc). As a first iteration, I propose [2] that we stop using Swift as a backend for Glance. Indeed, our undercloud is currently single-node, I see zero value of using Swift to store the overcloud image. If there is a value, then we can add the option to whether or not using it (and set it to False in our CI to use file backend, which won't lead to OOM). Note: on the overcloud: we currently support file, swift and rbd backends, that you can easily select during your deployment. [1] https://bugs.launchpad.net/tripleo/+bug/1595916 [2] https://review.openstack.org/#/c/334555/ -- Emilien Macchi __ OpenStack Development Mailing List (not for usage questions) Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev