subject:"\[openstack\-dev\] Gate proposal \- drop Postgresql configurations in the gate"

Re: [openstack-dev] Gate proposal - drop Postgresql configurations in the gate

2014-06-16 Thread Joe Gordon

On Jun 14, 2014 11:12 AM, "Robert Collins" 
wrote:
>
> You know its bad when you can't sleep because you're redesigning gate
> workflows in your head so I apologise that this email is perhaps
> not as rational, nor as organised, as usual - but , . :)
>
> Obviously this is very important to address, and if we can come up
> with something systemic I'm going to devote my time both directly, and
> via resource-hunting within HP, to address it. And accordingly I'm
> going to feel free to say 'zuul this' with no regard for existing
> features. We need to get ahead of the problem and figure out how to
> stay there, and I think below I show why the current strategy just
> won't do that.
>
> On 13 June 2014 06:08, Sean Dague  wrote:
>
> > We're hitting a couple of inflection points.
> >
> > 1) We're basically at capacity for the unit work that we can do. Which
> > means it's time to start making decisions if we believe everything we
> > currently have running is more important than the things we aren't
> > currently testing.
> >
> > Everyone wants multinode testing in the gate. It would be impossible to
> > support that given current resources.
>
> How much of our capacity problems are due to waste - such as:
>  - tempest runs of code the author knows is broken
>  - tempest runs of code that doesn't pass unit tests
>  - tempest runs while the baseline is unstable - to expand on this
> one, if master only passes one commit in 4, no check job can have a
> higher success rate overall.
>
> Vs how much are an indication of the sheer volume of development being
done?
>
> > 2) We're far past the inflection point of people actually debugging jobs
> > when they go wrong.
> >
> > The gate is backed up (currently to 24hrs) because there are bugs in
> > OpenStack. Those are popping up at a rate much faster than the number of
> > people who are willing to spend any time on them. And often they are
> > popping up in configurations that we're not all that familiar with.
>
> So, I *totally* appreciate that people fixing the jobs is the visible
> expendable resource, but I'm not sure its the bottleneck. I think the
> bottleneck is our aggregate ability to a) detect the problem and b)
> resolve it.
>
> For instance - strawman - if when the gate goes bad, after a check for
> external issues like new SQLAlchemy releases etc, what if we just
> rolled trunk of every project that is in the integrated gate back to
> before the success rate nosedived ? I'm well aware of the DVCS issues
> that implies, but from a human debugging perspective that would
> massively increase the leverage we get from the folk that do dive in
> and help. It moves from 'figure out that there is a problem and it
> came in after X AND FIX IT' to 'figure out it came in after X'.
>
> Reverting is usually much faster and more robust than rolling forward,
> because rolling forward has more unknowns.
>
> I think we have a systematic problem, because this situation happens
> again and again. And the root cause is that our time to detect
> races/nondeterministic tests is a probability function, not a simple
> scalar. Sometimes we catch such tests within one patch in the gate,
> sometimes they slip through. If we want to land hundreds or thousands
> of patches a day, and we don't want this pain to happen, I don't see
> any way other than *either*:
> A - not doing this whole gating CI process at all
> B - making detection a whole lot more reliable (e.g. we want
> near-certainty that a given commit does not contain a race)
> C - making repair a whole lot faster (e.g. we want <= one test cycle
> in the gate to recover once we have determined that some commit is
> broken.
>
> Taking them in turn:
> A - yeah, no. We have lots of experience with the axiom that that
> which is not tested is broken. And thats the big concern about
> removing things from our matrix - when they are not tested, we can be
> sure that they will break and we will have to spend neurons fixing
> them - either directly or as reviews from people fixing it.
>
> B - this is really hard. Say we want quite sure sure that there are no
> new races that will occur with more than some probability in a given
> commit, and we assume that race codepaths might be run just once in
> the whole test matrix. A single test run can never tell us that - it
> just tells us it worked. What we need is some N trials where we don't
> observe a new race (but may observe old races), given a maximum risk
> of the introduction of a (say) 5% failure rate into the gate. [check
> my stats]
> (1-max risk)^trials = margin-of-error
> 0.95^N = 0.01
> log(0.01, base=0.95) = N
> N ~= 90
>
> So if we want to stop 5% races landing, and we may exercise any given
> possible race code path a minimum of 1 times in the test matrix, we
> need to exercise the whole test matrix 90 times to have that 1% margin
> sure we saw it. Raise that to a 1% race:
> log(0.01. base=0.99) = 458
> Thats a lot of test runs. I don't think we can do that for each

Re: [openstack-dev] Gate proposal - drop Postgresql configurations in the gate

2014-06-16 Thread Robert Collins

On 16 Jun 2014 22:33, "Sean Dague"  wrote:
>
> On 06/16/2014 04:33 AM, Thierry Carrez wrote:
> > Robert Collins wrote:
> >> [...]
> >> C - If we can't make it harder to get races in, perhaps we can make it
> >> easier to get races out. We have pretty solid emergent statistics from
> >> every gate job that is run as check. What if set a policy that when a
> >> gate queue gets a race:
> >>  - put a zuul stop all merges and checks on all involved branches
> >> (prevent further damage, free capacity for validation)
> >>  - figure out when it surfaced
> >>  - determine its not an external event
> >>  - revert all involved branches back to the point where they looked
> >> good, as one large operation
> >>- run that through jenkins N (e.g. 458) times in parallel.
> >>- on success land it
> >>  - go through all the merges that have been reverted and either
> >> twiddle them to be back in review with a new patchset against the
> >> revert to restore their content, or alternatively generate new reviews
> >> if gerrit would make that too hard.
> >
> > One of the issues here is that "gate queue gets a race" is not a binary
> > state. There are always rare issues, you just can't find all the bugs
> > that happen 0.1% of the time. You add more such issues, and at some
> > point they either add up to an unacceptable level, or some other
> > environmental situation suddenly increases the odds of some old rare
> > issue to happen (think: new test cluster with slightly different
> > performance characteristics being thrown into our test resources). There
> > is no single incident you need to find and fix, and during which you can
> > clearly escalate to defCon 1. You can't even assume that a "gate
> > situation" was created in the set of commits around when it surfaced.
> >
> > So IMHO it's a continuous process : keep looking into rare issues all
> > the time, to maintain them under the level where they become a problem.
> > You can't just have a specific process that kicks in when "the gate
> > queue gets a race".
>
> Definitely agree. I also think part of the issue is we get emergent
> behavior once we tip past some cumulative failure rate. Much of that
> emergent behavior we are coming to understand over time. We've done
> corrections like clean check and sliding gate window to impact them.
>
> It's also that a new issue tends to take 12 hrs to see and figure out if
> it's a ZOMG issue, and 3 - 5 days to see if it's any lower level of
> severity. And given that we merge 50 - 100 patches a day, across 40
> projects, across branches, the rollback would be  'interesting'.

So zomg - 50 runs and lower issues between 150 and 500 test runs. That's
fitting my model pretty well for the ballpark failure rate and margin I was
using. That is it sounds like the model isn't too far out from reality.

Yes revert would be hard... But what do you think of the model ... Is it
wrong? It implies Sergei different points we can try to fix things and I
would love to know what folk think of the other possibilities I've raised
or raise some themselves.

-Rob
___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

Re: [openstack-dev] Gate proposal - drop Postgresql configurations in the gate

2014-06-16 Thread Kyle Mestery

On Mon, Jun 16, 2014 at 11:38 AM, Joe Gordon  wrote:
>
>
>
> On Sat, Jun 14, 2014 at 3:46 AM, Sean Dague  wrote:
>>
>> On 06/13/2014 06:47 PM, Joe Gordon wrote:
>> >
>> >
>> >
>> > On Thu, Jun 12, 2014 at 7:18 PM, Dan Prince > > > wrote:
>> >
>> > On Thu, 2014-06-12 at 09:24 -0700, Joe Gordon wrote:
>> > >
>> > > On Jun 12, 2014 8:37 AM, "Sean Dague" > > > wrote:
>> > > >
>> > > > On 06/12/2014 10:38 AM, Mike Bayer wrote:
>> > > > >
>> > > > > On 6/12/14, 8:26 AM, Julien Danjou wrote:
>> > > > >> On Thu, Jun 12 2014, Sean Dague wrote:
>> > > > >>
>> > > > >>> That's not cacthable in unit or functional tests?
>> > > > >> Not in an accurate manner, no.
>> > > > >>
>> > > > >>> Keeping jobs alive based on the theory that they might one
>> > day
>> > > be useful
>> > > > >>> is something we just don't have the liberty to do any more.
>> > > We've not
>> > > > >>> seen an idle node in zuul in 2 days... and we're only at
>> > j-1.
>> > > j-3 will
>> > > > >>> be at least +50% of this load.
>> > > > >> Sure, I'm not saying we don't have a problem. I'm just saying
>> > > it's not a
>> > > > >> good solution to fix that problem IMHO.
>> > > > >
>> > > > > Just my 2c without having a full understanding of all of
>> > > OpenStack's CI
>> > > > > environment, Postgresql is definitely different enough that
>> > MySQL
>> > > > > "strict mode" could still allow issues to slip through quite
>> > > easily, and
>> > > > > also as far as capacity issues, this might be longer term but
>> > I'm
>> > > hoping
>> > > > > to get database-related tests to be lots faster if we can move
>> > to
>> > > a
>> > > > > model that spends much less time creating databases and
>> > schemas.
>> > > >
>> > > > This is what I mean by functional testing. If we were directly
>> > > hitting a
>> > > > real database on a set of in tree project tests, I think you
>> > could
>> > > > discover issues like this. Neutron was headed down that path.
>> > > >
>> > > > But if we're talking about a devstack / tempest run, it's not
>> > really
>> > > > applicable.
>> > > >
>> > > > If someone can point me to a case where we've actually found
>> > this
>> > > kind
>> > > > of bug with tempest / devstack, that would be great. I've just
>> > > *never*
>> > > > seen it. I was the one that did most of the fixing for pg
>> > support in
>> > > > Nova, and have helped other projects as well, so I'm relatively
>> > > familiar
>> > > > with the kinds of fails we can discover. The ones that Julien
>> > > pointed
>> > > > really aren't likely to be exposed in our current system.
>> > > >
>> > > > Which is why I think we're mostly just burning cycles on the
>> > > existing
>> > > > approach for no gain.
>> > >
>> > > Given all the points made above, I think dropping PostgreSQL is
>> > the
>> > > right choice; if only we had infinite cloud that would be another
>> > > story.
>> > >
>> > > What about converting one of our existing jobs (grenade partial
>> > ncpu,
>> > > large ops, regular grenade, tempest with nova network etc.) Into a
>> > > PostgreSQL only job? We could get some level of PostgreSQL testing
>> > > without any additional jobs, although this is  tradeoff obviously.
>> >
>> > I'd be fine with this tradeoff if it allows us to keep PostgreSQL in
>> > the
>> > mix.
>> >
>> >
>> > Here is my proposed change to how we handle postgres in the gate:
>> >
>> > https://review.openstack.org/#/c/100033
>> >
>> >
>> > Merge postgres and neutron jobs in integrated-gate template
>> >
>> >
>> >
>> >
>> > Instead of having a separate job for postgres and neutron, combine them.
>> > In the integrated-gate we will only test postgres+neutron and not
>> >
>> >
>> > neutron/mysql or nova-network/postgres.
>> >
>> > * neutron/mysql is still tested in integrated-gate-neutron
>> > * nova-network/postgres is tested in nova
>>
>> Because neutron only runs smoke jobs, this actually drops all the
>> interesting testing of pg. The things I've actually seen catch
>> differences are the nova negative tests, which basically aren't run in
>> this job.
>
>
> I forgot about the smoke test only part when I originally proposed this.
> From a cursory look, neutron-full appears to be fairly stable, so if we move
> over to neutron-full in the near future that should address your concerns.
> Are there plans to move over to neutron-full in the near future?
>
This is on my radar for Juno-2. I'll syncup with some folks in-channel
on what the next steps would be to make this happen.

Kyle

>>
>>
>> So I think that's kind of the worst of all possible worlds, because it
>> would make people think the thing is tested interestingly, when it's not.
>>
>> -Sean
>>
>> --
>> Sean Dague
>

Re: [openstack-dev] Gate proposal - drop Postgresql configurations in the gate

2014-06-16 Thread Robert Collins

On 16 Jun 2014 20:33, "Thierry Carrez"  wrote:
>
> Robert Collins wrote:
> > [...]
> > C - If we can't make it harder to get races in, perhaps we can make it
> > easier to get races out. We have pretty solid emergent statistics from
> > every gate job that is run as check. What if set a policy that when a
> > gate queue gets a race:
> >  - put a zuul stop all merges and checks on all involved branches
> > (prevent further damage, free capacity for validation)
> >  - figure out when it surfaced
> >  - determine its not an external event
> >  - revert all involved branches back to the point where they looked
> > good, as one large operation
> >- run that through jenkins N (e.g. 458) times in parallel.
> >- on success land it
> >  - go through all the merges that have been reverted and either
> > twiddle them to be back in review with a new patchset against the
> > revert to restore their content, or alternatively generate new reviews
> > if gerrit would make that too hard.
>
> One of the issues here is that "gate queue gets a race" is not a binary
> state. There are always rare issues, you just can't find all the bugs
> that happen 0.1% of the time. You add more such issues, and at some
> point they either add up to an unacceptable level, or some other
> environmental situation suddenly increases the odds of some old rare
> issue to happen (think: new test cluster with slightly different
> performance characteristics being thrown into our test resources). There
> is no single incident you need to find and fix, and during which you can
> clearly escalate to defCon 1. You can't even assume that a "gate
> situation" was created in the set of commits around when it surfaced.
>
> So IMHO it's a continuous process : keep looking into rare issues all
> the time, to maintain them under the level where they become a problem.
> You can't just have a specific process that kicks in when "the gate
> queue gets a race

You seem to be drawing different conclusions here but the emergent
behaviour is a shared model that we both have. In no part of my mail did I
suggest ignoring issues until we hit Defcon one. I suggested that what we
are doing is not working, and put forward a model to explain why it's not
working ... one which to me seems to fit the evidence. And finally
suggested a few different things which might help.

For the specific scenario you raise that might not fit... Adding a test
cluster is a change to our test config and certainly something we could
revert. That's the benefit of configuration as code.
___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

Re: [openstack-dev] Gate proposal - drop Postgresql configurations in the gate

2014-06-16 Thread Mac Innes, Kiall

On Thu, 2014-06-12 at 11:36 -0400, Sean Dague wrote:
> If someone can point me to a case where we've actually found this kind
> of bug with tempest / devstack, that would be great. I've just *never*
> seen it. I was the one that did most of the fixing for pg support in
> Nova, and have helped other projects as well, so I'm relatively
> familiar
> with the kinds of fails we can discover. The ones that Julien pointed
> really aren't likely to be exposed in our current system.
> 
> Which is why I think we're mostly just burning cycles on the existing
> approach for no gain.
> 
> -Sean

I don't have links handy - but Designate has hit a couple of bugs that
prevented our database migrations from succeeding on PostgreSQL - Maybe
a new re-usable test slave type for exercising database migrations +
database interface code against? These would be much quicker to run than
a full devstack/tempest gate..

Thanks,
Kiall
___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

Re: [openstack-dev] Gate proposal - drop Postgresql configurations in the gate

2014-06-16 Thread Joe Gordon

On Sat, Jun 14, 2014 at 3:46 AM, Sean Dague  wrote:

> On 06/13/2014 06:47 PM, Joe Gordon wrote:
> >
> >
> >
> > On Thu, Jun 12, 2014 at 7:18 PM, Dan Prince  > > wrote:
> >
> > On Thu, 2014-06-12 at 09:24 -0700, Joe Gordon wrote:
> > >
> > > On Jun 12, 2014 8:37 AM, "Sean Dague"  > > wrote:
> > > >
> > > > On 06/12/2014 10:38 AM, Mike Bayer wrote:
> > > > >
> > > > > On 6/12/14, 8:26 AM, Julien Danjou wrote:
> > > > >> On Thu, Jun 12 2014, Sean Dague wrote:
> > > > >>
> > > > >>> That's not cacthable in unit or functional tests?
> > > > >> Not in an accurate manner, no.
> > > > >>
> > > > >>> Keeping jobs alive based on the theory that they might one
> day
> > > be useful
> > > > >>> is something we just don't have the liberty to do any more.
> > > We've not
> > > > >>> seen an idle node in zuul in 2 days... and we're only at j-1.
> > > j-3 will
> > > > >>> be at least +50% of this load.
> > > > >> Sure, I'm not saying we don't have a problem. I'm just saying
> > > it's not a
> > > > >> good solution to fix that problem IMHO.
> > > > >
> > > > > Just my 2c without having a full understanding of all of
> > > OpenStack's CI
> > > > > environment, Postgresql is definitely different enough that
> MySQL
> > > > > "strict mode" could still allow issues to slip through quite
> > > easily, and
> > > > > also as far as capacity issues, this might be longer term but
> I'm
> > > hoping
> > > > > to get database-related tests to be lots faster if we can move
> to
> > > a
> > > > > model that spends much less time creating databases and
> schemas.
> > > >
> > > > This is what I mean by functional testing. If we were directly
> > > hitting a
> > > > real database on a set of in tree project tests, I think you
> could
> > > > discover issues like this. Neutron was headed down that path.
> > > >
> > > > But if we're talking about a devstack / tempest run, it's not
> really
> > > > applicable.
> > > >
> > > > If someone can point me to a case where we've actually found this
> > > kind
> > > > of bug with tempest / devstack, that would be great. I've just
> > > *never*
> > > > seen it. I was the one that did most of the fixing for pg
> support in
> > > > Nova, and have helped other projects as well, so I'm relatively
> > > familiar
> > > > with the kinds of fails we can discover. The ones that Julien
> > > pointed
> > > > really aren't likely to be exposed in our current system.
> > > >
> > > > Which is why I think we're mostly just burning cycles on the
> > > existing
> > > > approach for no gain.
> > >
> > > Given all the points made above, I think dropping PostgreSQL is the
> > > right choice; if only we had infinite cloud that would be another
> > > story.
> > >
> > > What about converting one of our existing jobs (grenade partial
> ncpu,
> > > large ops, regular grenade, tempest with nova network etc.) Into a
> > > PostgreSQL only job? We could get some level of PostgreSQL testing
> > > without any additional jobs, although this is  tradeoff obviously.
> >
> > I'd be fine with this tradeoff if it allows us to keep PostgreSQL in
> the
> > mix.
> >
> >
> > Here is my proposed change to how we handle postgres in the gate:
> >
> > https://review.openstack.org/#/c/100033
> >
> >
> > Merge postgres and neutron jobs in integrated-gate template
> >
> >
> >
> >
> > Instead of having a separate job for postgres and neutron, combine them.
> > In the integrated-gate we will only test postgres+neutron and not
> >
> >
> > neutron/mysql or nova-network/postgres.
> >
> > * neutron/mysql is still tested in integrated-gate-neutron
> > * nova-network/postgres is tested in nova
>
> Because neutron only runs smoke jobs, this actually drops all the
> interesting testing of pg. The things I've actually seen catch
> differences are the nova negative tests, which basically aren't run in
> this job.
>

I forgot about the smoke test only part when I originally proposed this.
>From a cursory look, neutron-full appears to be fairly stable, so if we
move over to neutron-full in the near future that should address your
concerns. Are there plans to move over to neutron-full in the near future?


>
> So I think that's kind of the worst of all possible worlds, because it
> would make people think the thing is tested interestingly, when it's not.
>
> -Sean
>
> --
> Sean Dague
> http://dague.net
>
>
> ___
> OpenStack-dev mailing list
> OpenStack-dev@lists.openstack.org
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
>
>
___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailma

Re: [openstack-dev] Gate proposal - drop Postgresql configurations in the gate

2014-06-16 Thread Chris Dent


On Fri, 13 Jun 2014, Sean Dague wrote:


So if we can't evolve the system back towards health, we need to just
cut a bunch of stuff off until we can.


+1 This is kind of the crux of the biscuit. As things stand there's
so much noise that it's far too easy to think and act like it is
somebody else's problem.

--
Chris Dent

___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

Re: [openstack-dev] Gate proposal - drop Postgresql configurations in the gate

2014-06-16 Thread Sean Dague

On 06/16/2014 04:33 AM, Thierry Carrez wrote:
> Robert Collins wrote:
>> [...]
>> C - If we can't make it harder to get races in, perhaps we can make it
>> easier to get races out. We have pretty solid emergent statistics from
>> every gate job that is run as check. What if set a policy that when a
>> gate queue gets a race:
>>  - put a zuul stop all merges and checks on all involved branches
>> (prevent further damage, free capacity for validation)
>>  - figure out when it surfaced
>>  - determine its not an external event
>>  - revert all involved branches back to the point where they looked
>> good, as one large operation
>>- run that through jenkins N (e.g. 458) times in parallel.
>>- on success land it
>>  - go through all the merges that have been reverted and either
>> twiddle them to be back in review with a new patchset against the
>> revert to restore their content, or alternatively generate new reviews
>> if gerrit would make that too hard.
> 
> One of the issues here is that "gate queue gets a race" is not a binary
> state. There are always rare issues, you just can't find all the bugs
> that happen 0.1% of the time. You add more such issues, and at some
> point they either add up to an unacceptable level, or some other
> environmental situation suddenly increases the odds of some old rare
> issue to happen (think: new test cluster with slightly different
> performance characteristics being thrown into our test resources). There
> is no single incident you need to find and fix, and during which you can
> clearly escalate to defCon 1. You can't even assume that a "gate
> situation" was created in the set of commits around when it surfaced.
> 
> So IMHO it's a continuous process : keep looking into rare issues all
> the time, to maintain them under the level where they become a problem.
> You can't just have a specific process that kicks in when "the gate
> queue gets a race".

Definitely agree. I also think part of the issue is we get emergent
behavior once we tip past some cumulative failure rate. Much of that
emergent behavior we are coming to understand over time. We've done
corrections like clean check and sliding gate window to impact them.

It's also that a new issue tends to take 12 hrs to see and figure out if
it's a ZOMG issue, and 3 - 5 days to see if it's any lower level of
severity. And given that we merge 50 - 100 patches a day, across 40
projects, across branches, the rollback would be  'interesting'.

-Sean.

-- 
Sean Dague
http://dague.net



signature.asc
Description: OpenPGP digital signature
___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

Re: [openstack-dev] Gate proposal - drop Postgresql configurations in the gate

2014-06-16 Thread Thierry Carrez

Robert Collins wrote:
> [...]
> C - If we can't make it harder to get races in, perhaps we can make it
> easier to get races out. We have pretty solid emergent statistics from
> every gate job that is run as check. What if set a policy that when a
> gate queue gets a race:
>  - put a zuul stop all merges and checks on all involved branches
> (prevent further damage, free capacity for validation)
>  - figure out when it surfaced
>  - determine its not an external event
>  - revert all involved branches back to the point where they looked
> good, as one large operation
>- run that through jenkins N (e.g. 458) times in parallel.
>- on success land it
>  - go through all the merges that have been reverted and either
> twiddle them to be back in review with a new patchset against the
> revert to restore their content, or alternatively generate new reviews
> if gerrit would make that too hard.

One of the issues here is that "gate queue gets a race" is not a binary
state. There are always rare issues, you just can't find all the bugs
that happen 0.1% of the time. You add more such issues, and at some
point they either add up to an unacceptable level, or some other
environmental situation suddenly increases the odds of some old rare
issue to happen (think: new test cluster with slightly different
performance characteristics being thrown into our test resources). There
is no single incident you need to find and fix, and during which you can
clearly escalate to defCon 1. You can't even assume that a "gate
situation" was created in the set of commits around when it surfaced.

So IMHO it's a continuous process : keep looking into rare issues all
the time, to maintain them under the level where they become a problem.
You can't just have a specific process that kicks in when "the gate
queue gets a race".

-- 
Thierry Carrez (ttx)

___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

Re: [openstack-dev] Gate proposal - drop Postgresql configurations in the gate

2014-06-14 Thread Robert Collins

You know its bad when you can't sleep because you're redesigning gate
workflows in your head so I apologise that this email is perhaps
not as rational, nor as organised, as usual - but , . :)

Obviously this is very important to address, and if we can come up
with something systemic I'm going to devote my time both directly, and
via resource-hunting within HP, to address it. And accordingly I'm
going to feel free to say 'zuul this' with no regard for existing
features. We need to get ahead of the problem and figure out how to
stay there, and I think below I show why the current strategy just
won't do that.

On 13 June 2014 06:08, Sean Dague  wrote:

> We're hitting a couple of inflection points.
>
> 1) We're basically at capacity for the unit work that we can do. Which
> means it's time to start making decisions if we believe everything we
> currently have running is more important than the things we aren't
> currently testing.
>
> Everyone wants multinode testing in the gate. It would be impossible to
> support that given current resources.

How much of our capacity problems are due to waste - such as:
 - tempest runs of code the author knows is broken
 - tempest runs of code that doesn't pass unit tests
 - tempest runs while the baseline is unstable - to expand on this
one, if master only passes one commit in 4, no check job can have a
higher success rate overall.

Vs how much are an indication of the sheer volume of development being done?

> 2) We're far past the inflection point of people actually debugging jobs
> when they go wrong.
>
> The gate is backed up (currently to 24hrs) because there are bugs in
> OpenStack. Those are popping up at a rate much faster than the number of
> people who are willing to spend any time on them. And often they are
> popping up in configurations that we're not all that familiar with.

So, I *totally* appreciate that people fixing the jobs is the visible
expendable resource, but I'm not sure its the bottleneck. I think the
bottleneck is our aggregate ability to a) detect the problem and b)
resolve it.

For instance - strawman - if when the gate goes bad, after a check for
external issues like new SQLAlchemy releases etc, what if we just
rolled trunk of every project that is in the integrated gate back to
before the success rate nosedived ? I'm well aware of the DVCS issues
that implies, but from a human debugging perspective that would
massively increase the leverage we get from the folk that do dive in
and help. It moves from 'figure out that there is a problem and it
came in after X AND FIX IT' to 'figure out it came in after X'.

Reverting is usually much faster and more robust than rolling forward,
because rolling forward has more unknowns.

I think we have a systematic problem, because this situation happens
again and again. And the root cause is that our time to detect
races/nondeterministic tests is a probability function, not a simple
scalar. Sometimes we catch such tests within one patch in the gate,
sometimes they slip through. If we want to land hundreds or thousands
of patches a day, and we don't want this pain to happen, I don't see
any way other than *either*:
A - not doing this whole gating CI process at all
B - making detection a whole lot more reliable (e.g. we want
near-certainty that a given commit does not contain a race)
C - making repair a whole lot faster (e.g. we want <= one test cycle
in the gate to recover once we have determined that some commit is
broken.

Taking them in turn:
A - yeah, no. We have lots of experience with the axiom that that
which is not tested is broken. And thats the big concern about
removing things from our matrix - when they are not tested, we can be
sure that they will break and we will have to spend neurons fixing
them - either directly or as reviews from people fixing it.

B - this is really hard. Say we want quite sure sure that there are no
new races that will occur with more than some probability in a given
commit, and we assume that race codepaths might be run just once in
the whole test matrix. A single test run can never tell us that - it
just tells us it worked. What we need is some N trials where we don't
observe a new race (but may observe old races), given a maximum risk
of the introduction of a (say) 5% failure rate into the gate. [check
my stats]
(1-max risk)^trials = margin-of-error
0.95^N = 0.01
log(0.01, base=0.95) = N
N ~= 90

So if we want to stop 5% races landing, and we may exercise any given
possible race code path a minimum of 1 times in the test matrix, we
need to exercise the whole test matrix 90 times to have that 1% margin
sure we saw it. Raise that to a 1% race:
log(0.01. base=0.99) = 458
Thats a lot of test runs. I don't think we can do that for each commit
with our current resources - and I'm not at all sure that asking for
enough resources to do that makes sense. Maybe it does.

Data point - our current risk, with 1% margin:
(1-max risk)^1 = 0.01
99% (that is, a single passing gat

Re: [openstack-dev] Gate proposal - drop Postgresql configurations in the gate

2014-06-14 Thread Sean Dague

On 06/13/2014 06:47 PM, Joe Gordon wrote:
> 
> 
> 
> On Thu, Jun 12, 2014 at 7:18 PM, Dan Prince  > wrote:
> 
> On Thu, 2014-06-12 at 09:24 -0700, Joe Gordon wrote:
> >
> > On Jun 12, 2014 8:37 AM, "Sean Dague"  > wrote:
> > >
> > > On 06/12/2014 10:38 AM, Mike Bayer wrote:
> > > >
> > > > On 6/12/14, 8:26 AM, Julien Danjou wrote:
> > > >> On Thu, Jun 12 2014, Sean Dague wrote:
> > > >>
> > > >>> That's not cacthable in unit or functional tests?
> > > >> Not in an accurate manner, no.
> > > >>
> > > >>> Keeping jobs alive based on the theory that they might one day
> > be useful
> > > >>> is something we just don't have the liberty to do any more.
> > We've not
> > > >>> seen an idle node in zuul in 2 days... and we're only at j-1.
> > j-3 will
> > > >>> be at least +50% of this load.
> > > >> Sure, I'm not saying we don't have a problem. I'm just saying
> > it's not a
> > > >> good solution to fix that problem IMHO.
> > > >
> > > > Just my 2c without having a full understanding of all of
> > OpenStack's CI
> > > > environment, Postgresql is definitely different enough that MySQL
> > > > "strict mode" could still allow issues to slip through quite
> > easily, and
> > > > also as far as capacity issues, this might be longer term but I'm
> > hoping
> > > > to get database-related tests to be lots faster if we can move to
> > a
> > > > model that spends much less time creating databases and schemas.
> > >
> > > This is what I mean by functional testing. If we were directly
> > hitting a
> > > real database on a set of in tree project tests, I think you could
> > > discover issues like this. Neutron was headed down that path.
> > >
> > > But if we're talking about a devstack / tempest run, it's not really
> > > applicable.
> > >
> > > If someone can point me to a case where we've actually found this
> > kind
> > > of bug with tempest / devstack, that would be great. I've just
> > *never*
> > > seen it. I was the one that did most of the fixing for pg support in
> > > Nova, and have helped other projects as well, so I'm relatively
> > familiar
> > > with the kinds of fails we can discover. The ones that Julien
> > pointed
> > > really aren't likely to be exposed in our current system.
> > >
> > > Which is why I think we're mostly just burning cycles on the
> > existing
> > > approach for no gain.
> >
> > Given all the points made above, I think dropping PostgreSQL is the
> > right choice; if only we had infinite cloud that would be another
> > story.
> >
> > What about converting one of our existing jobs (grenade partial ncpu,
> > large ops, regular grenade, tempest with nova network etc.) Into a
> > PostgreSQL only job? We could get some level of PostgreSQL testing
> > without any additional jobs, although this is  tradeoff obviously.
> 
> I'd be fine with this tradeoff if it allows us to keep PostgreSQL in the
> mix.
> 
> 
> Here is my proposed change to how we handle postgres in the gate:
> 
> https://review.openstack.org/#/c/100033
> 
> 
> Merge postgres and neutron jobs in integrated-gate template
> 
> 
> 
> 
> Instead of having a separate job for postgres and neutron, combine them.
> In the integrated-gate we will only test postgres+neutron and not
> 
> 
> neutron/mysql or nova-network/postgres.
> 
> * neutron/mysql is still tested in integrated-gate-neutron
> * nova-network/postgres is tested in nova

Because neutron only runs smoke jobs, this actually drops all the
interesting testing of pg. The things I've actually seen catch
differences are the nova negative tests, which basically aren't run in
this job.

So I think that's kind of the worst of all possible worlds, because it
would make people think the thing is tested interestingly, when it's not.

-Sean

-- 
Sean Dague
http://dague.net



signature.asc
Description: OpenPGP digital signature
___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

Re: [openstack-dev] Gate proposal - drop Postgresql configurations in the gate

2014-06-13 Thread Joe Gordon

On Thu, Jun 12, 2014 at 7:18 PM, Dan Prince  wrote:

> On Thu, 2014-06-12 at 09:24 -0700, Joe Gordon wrote:
> >
> > On Jun 12, 2014 8:37 AM, "Sean Dague"  wrote:
> > >
> > > On 06/12/2014 10:38 AM, Mike Bayer wrote:
> > > >
> > > > On 6/12/14, 8:26 AM, Julien Danjou wrote:
> > > >> On Thu, Jun 12 2014, Sean Dague wrote:
> > > >>
> > > >>> That's not cacthable in unit or functional tests?
> > > >> Not in an accurate manner, no.
> > > >>
> > > >>> Keeping jobs alive based on the theory that they might one day
> > be useful
> > > >>> is something we just don't have the liberty to do any more.
> > We've not
> > > >>> seen an idle node in zuul in 2 days... and we're only at j-1.
> > j-3 will
> > > >>> be at least +50% of this load.
> > > >> Sure, I'm not saying we don't have a problem. I'm just saying
> > it's not a
> > > >> good solution to fix that problem IMHO.
> > > >
> > > > Just my 2c without having a full understanding of all of
> > OpenStack's CI
> > > > environment, Postgresql is definitely different enough that MySQL
> > > > "strict mode" could still allow issues to slip through quite
> > easily, and
> > > > also as far as capacity issues, this might be longer term but I'm
> > hoping
> > > > to get database-related tests to be lots faster if we can move to
> > a
> > > > model that spends much less time creating databases and schemas.
> > >
> > > This is what I mean by functional testing. If we were directly
> > hitting a
> > > real database on a set of in tree project tests, I think you could
> > > discover issues like this. Neutron was headed down that path.
> > >
> > > But if we're talking about a devstack / tempest run, it's not really
> > > applicable.
> > >
> > > If someone can point me to a case where we've actually found this
> > kind
> > > of bug with tempest / devstack, that would be great. I've just
> > *never*
> > > seen it. I was the one that did most of the fixing for pg support in
> > > Nova, and have helped other projects as well, so I'm relatively
> > familiar
> > > with the kinds of fails we can discover. The ones that Julien
> > pointed
> > > really aren't likely to be exposed in our current system.
> > >
> > > Which is why I think we're mostly just burning cycles on the
> > existing
> > > approach for no gain.
> >
> > Given all the points made above, I think dropping PostgreSQL is the
> > right choice; if only we had infinite cloud that would be another
> > story.
> >
> > What about converting one of our existing jobs (grenade partial ncpu,
> > large ops, regular grenade, tempest with nova network etc.) Into a
> > PostgreSQL only job? We could get some level of PostgreSQL testing
> > without any additional jobs, although this is  tradeoff obviously.
>
> I'd be fine with this tradeoff if it allows us to keep PostgreSQL in the
> mix.
>
>
Here is my proposed change to how we handle postgres in the gate:

https://review.openstack.org/#/c/100033


Merge postgres and neutron jobs in integrated-gate template


Instead of having a separate job for postgres and neutron, combine them.
In the integrated-gate we will only test postgres+neutron and not

neutron/mysql or nova-network/postgres.

* neutron/mysql is still tested in integrated-gate-neutron
* nova-network/postgres is tested in nova


>
> >
> > -Sean
> >
> > --
> > Sean Dague
> > http://dague.net
> >
> >
> > ___
> > OpenStack-dev mailing list
> > OpenStack-dev@lists.openstack.org
> > http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
> >
>
>
> ___
> OpenStack-dev mailing list
> OpenStack-dev@lists.openstack.org
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev



___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

Re: [openstack-dev] Gate proposal - drop Postgresql configurations in the gate

2014-06-13 Thread Sean Dague

On 06/13/2014 08:13 AM, Mark McLoughlin wrote:
> On Fri, 2014-06-13 at 07:31 -0400, Sean Dague wrote:
>> On 06/13/2014 02:36 AM, Mark McLoughlin wrote:
>>> On Thu, 2014-06-12 at 22:10 -0400, Dan Prince wrote:
 On Thu, 2014-06-12 at 08:06 -0400, Sean Dague wrote:
> We're definitely deep into capacity issues, so it's going to be time to
> start making tougher decisions about things we decide aren't different
> enough to bother testing on every commit.

 In order to save resources why not combine some of the jobs in different
 ways. So for example instead of:

  check-tempest-dsvm-full
  check-tempest-dsvm-postgres-full

 Couldn't we just drop the postgres-full job and run one of the Neutron
 jobs w/ postgres instead? Or something similar, so long as at least one
 of the jobs which runs most of Tempest is using PostgreSQL I think we'd
 be mostly fine. Not shooting for 100% coverage for everything with our
 limited resource pool is fine, lets just do the best we can.

 Ditto for gate jobs (not check).
>>>
>>> I think that's what Clark was suggesting in:
>>>
>>> https://etherpad.openstack.org/p/juno-test-maxtrices
>>>
> Previously we've been testing Postgresql in the gate because it has a
> stricter interpretation of SQL than MySQL. And when we didn't test
> Postgresql it regressed. I know, I chased it for about 4 weeks in grizzly.
>
> However Monty brought up a good point at Summit, that MySQL has a strict
> mode. That should actually enforce the same strictness.
>
> My proposal is that we land this change to devstack -
> https://review.openstack.org/#/c/97442/ and backport it to past devstack
> branches.
>
> Then we drop the pg jobs, as the differences between the 2 configs
> should then be very minimal. All the *actual* failures we've seen
> between the 2 were completely about this strict SQL mode interpretation.


 I suppose I would like to see us keep it in the mix. Running SmokeStack
 for almost 3 years I found many an issue dealing w/ PostgreSQL. I ran it
 concurrently with many of the other jobs and I too had limited resources
 (much less that what we have in infra today).

 Would MySQL strict SQL mode catch stuff like this (old bugs, but still
 valid for this topic I think):

  https://bugs.launchpad.net/nova/+bug/948066

  https://bugs.launchpad.net/nova/+bug/1003756


 Having support for and testing against at least 2 databases helps keep
 our SQL queries and migrations cleaner... and is generally a good
 practice given we have abstractions which are meant to support this sort
 of thing anyway (so by all means let us test them!).

 Also, Having compacted the Nova migrations 3 times now I found many
 issues by testing on multiple databases (MySQL and PostgreSQL). I'm
 quite certain our migrations would be worse off if we just tested
 against the single database.
>>>
>>> Certainly sounds like this testing is far beyond the "might one day be
>>> useful" level Sean talks about.
>>
>> The migration compaction is a good point. And I'm happy to see there
>> were some bugs exposed as well.
>>
>> Here is where I remain stuck
>>
>> We are now at a failure rate in which it's 3 days (minimum) to land a
>> fix that decreases our failure rate at all.
>>
>> The way we are currently solving this is by effectively building "manual
>> zuul" and taking smart humans in coordination to end run around our
>> system. We've merged 18 fixes so far -
>> https://etherpad.openstack.org/p/gatetriage-june2014 this way. Merging a
>> fix this way is at least an order of magnitude more expensive on people
>> time because of the analysis and coordination we need to go through to
>> make sure these things are the right things to jump the queue.
>>
>> That effort, over 8 days, has gotten us down to *only* a 24hr merge
>> delay. And there are no more smoking guns. What's left is a ton of
>> subtle things. I've got ~ 30 patches outstanding right now (a bunch are
>> things to clarify what's going on in the build runs especially in the
>> fail scenarios). Every single one of them has been failed by Jenkins at
>> least once. Almost every one was failed by a different unique issue.
>>
>> So I'd say at best we're 25% of the way towards solving this. That being
>> said, because of the deep queues, people are just recheck grinding (or
>> hitting the jackpot and landing something through that then fails a lot
>> after landing). That leads to bugs like this:
>>
>> https://bugs.launchpad.net/heat/+bug/1306029
>>
>> Which was seen early in the patch - https://review.openstack.org/#/c/97569/
>>
>> Then kind of destroyed us completely for a day -
>> http://status.openstack.org/elastic-recheck/ (it's the top graph).
>>
>> And, predictably, a week into a long gate queue everyone is now grumpy.
>> The sniping between projects, an

Re: [openstack-dev] Gate proposal - drop Postgresql configurations in the gate

2014-06-13 Thread Mark McLoughlin

On Fri, 2014-06-13 at 07:31 -0400, Sean Dague wrote:
> On 06/13/2014 02:36 AM, Mark McLoughlin wrote:
> > On Thu, 2014-06-12 at 22:10 -0400, Dan Prince wrote:
> >> On Thu, 2014-06-12 at 08:06 -0400, Sean Dague wrote:
> >>> We're definitely deep into capacity issues, so it's going to be time to
> >>> start making tougher decisions about things we decide aren't different
> >>> enough to bother testing on every commit.
> >>
> >> In order to save resources why not combine some of the jobs in different
> >> ways. So for example instead of:
> >>
> >>  check-tempest-dsvm-full
> >>  check-tempest-dsvm-postgres-full
> >>
> >> Couldn't we just drop the postgres-full job and run one of the Neutron
> >> jobs w/ postgres instead? Or something similar, so long as at least one
> >> of the jobs which runs most of Tempest is using PostgreSQL I think we'd
> >> be mostly fine. Not shooting for 100% coverage for everything with our
> >> limited resource pool is fine, lets just do the best we can.
> >>
> >> Ditto for gate jobs (not check).
> > 
> > I think that's what Clark was suggesting in:
> > 
> > https://etherpad.openstack.org/p/juno-test-maxtrices
> > 
> >>> Previously we've been testing Postgresql in the gate because it has a
> >>> stricter interpretation of SQL than MySQL. And when we didn't test
> >>> Postgresql it regressed. I know, I chased it for about 4 weeks in grizzly.
> >>>
> >>> However Monty brought up a good point at Summit, that MySQL has a strict
> >>> mode. That should actually enforce the same strictness.
> >>>
> >>> My proposal is that we land this change to devstack -
> >>> https://review.openstack.org/#/c/97442/ and backport it to past devstack
> >>> branches.
> >>>
> >>> Then we drop the pg jobs, as the differences between the 2 configs
> >>> should then be very minimal. All the *actual* failures we've seen
> >>> between the 2 were completely about this strict SQL mode interpretation.
> >>
> >>
> >> I suppose I would like to see us keep it in the mix. Running SmokeStack
> >> for almost 3 years I found many an issue dealing w/ PostgreSQL. I ran it
> >> concurrently with many of the other jobs and I too had limited resources
> >> (much less that what we have in infra today).
> >>
> >> Would MySQL strict SQL mode catch stuff like this (old bugs, but still
> >> valid for this topic I think):
> >>
> >>  https://bugs.launchpad.net/nova/+bug/948066
> >>
> >>  https://bugs.launchpad.net/nova/+bug/1003756
> >>
> >>
> >> Having support for and testing against at least 2 databases helps keep
> >> our SQL queries and migrations cleaner... and is generally a good
> >> practice given we have abstractions which are meant to support this sort
> >> of thing anyway (so by all means let us test them!).
> >>
> >> Also, Having compacted the Nova migrations 3 times now I found many
> >> issues by testing on multiple databases (MySQL and PostgreSQL). I'm
> >> quite certain our migrations would be worse off if we just tested
> >> against the single database.
> > 
> > Certainly sounds like this testing is far beyond the "might one day be
> > useful" level Sean talks about.
> 
> The migration compaction is a good point. And I'm happy to see there
> were some bugs exposed as well.
> 
> Here is where I remain stuck
> 
> We are now at a failure rate in which it's 3 days (minimum) to land a
> fix that decreases our failure rate at all.
> 
> The way we are currently solving this is by effectively building "manual
> zuul" and taking smart humans in coordination to end run around our
> system. We've merged 18 fixes so far -
> https://etherpad.openstack.org/p/gatetriage-june2014 this way. Merging a
> fix this way is at least an order of magnitude more expensive on people
> time because of the analysis and coordination we need to go through to
> make sure these things are the right things to jump the queue.
> 
> That effort, over 8 days, has gotten us down to *only* a 24hr merge
> delay. And there are no more smoking guns. What's left is a ton of
> subtle things. I've got ~ 30 patches outstanding right now (a bunch are
> things to clarify what's going on in the build runs especially in the
> fail scenarios). Every single one of them has been failed by Jenkins at
> least once. Almost every one was failed by a different unique issue.
> 
> So I'd say at best we're 25% of the way towards solving this. That being
> said, because of the deep queues, people are just recheck grinding (or
> hitting the jackpot and landing something through that then fails a lot
> after landing). That leads to bugs like this:
> 
> https://bugs.launchpad.net/heat/+bug/1306029
> 
> Which was seen early in the patch - https://review.openstack.org/#/c/97569/
> 
> Then kind of destroyed us completely for a day -
> http://status.openstack.org/elastic-recheck/ (it's the top graph).
> 
> And, predictably, a week into a long gate queue everyone is now grumpy.
> The sniping between projects, and within projects in assigning blame
> starts to spike at about day 4 of t

Re: [openstack-dev] Gate proposal - drop Postgresql configurations in the gate

2014-06-13 Thread Sean Dague

On 06/12/2014 10:10 PM, Dan Prince wrote:
> On Thu, 2014-06-12 at 08:06 -0400, Sean Dague wrote:
>> We're definitely deep into capacity issues, so it's going to be time to
>> start making tougher decisions about things we decide aren't different
>> enough to bother testing on every commit.
> 
> In order to save resources why not combine some of the jobs in different
> ways. So for example instead of:
> 
>  check-tempest-dsvm-full
>  check-tempest-dsvm-postgres-full
> 
> Couldn't we just drop the postgres-full job and run one of the Neutron
> jobs w/ postgres instead? Or something similar, so long as at least one
> of the jobs which runs most of Tempest is using PostgreSQL I think we'd
> be mostly fine. Not shooting for 100% coverage for everything with our
> limited resource pool is fine, lets just do the best we can.
> 
> Ditto for gate jobs (not check).
> 
> 
>>
>> Previously we've been testing Postgresql in the gate because it has a
>> stricter interpretation of SQL than MySQL. And when we didn't test
>> Postgresql it regressed. I know, I chased it for about 4 weeks in grizzly.
>>
>> However Monty brought up a good point at Summit, that MySQL has a strict
>> mode. That should actually enforce the same strictness.
>>
>> My proposal is that we land this change to devstack -
>> https://review.openstack.org/#/c/97442/ and backport it to past devstack
>> branches.
>>
>> Then we drop the pg jobs, as the differences between the 2 configs
>> should then be very minimal. All the *actual* failures we've seen
>> between the 2 were completely about this strict SQL mode interpretation.
> 
> 
> I suppose I would like to see us keep it in the mix. Running SmokeStack
> for almost 3 years I found many an issue dealing w/ PostgreSQL. I ran it
> concurrently with many of the other jobs and I too had limited resources
> (much less that what we have in infra today).
> 
> Would MySQL strict SQL mode catch stuff like this (old bugs, but still
> valid for this topic I think):
> 
>  https://bugs.launchpad.net/nova/+bug/948066
> 
>  https://bugs.launchpad.net/nova/+bug/1003756
> 
> 
> Having support for and testing against at least 2 databases helps keep
> our SQL queries and migrations cleaner... and is generally a good
> practice given we have abstractions which are meant to support this sort
> of thing anyway (so by all means let us test them!).
> 
> Also, Having compacted the Nova migrations 3 times now I found many
> issues by testing on multiple databases (MySQL and PostgreSQL). I'm
> quite certain our migrations would be worse off if we just tested
> against the single database.

Through Tempest? or at a lower level?

Dropping this Tempest job doesn't mean we're going to remove the other
databases from unit tests, or keep them available for functional tests.

But my experience is that this sorts of things have not been found
through the API surface.

-Sean

-- 
Sean Dague
http://dague.net



signature.asc
Description: OpenPGP digital signature
___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

Re: [openstack-dev] Gate proposal - drop Postgresql configurations in the gate

2014-06-13 Thread Sean Dague

On 06/13/2014 02:36 AM, Mark McLoughlin wrote:
> On Thu, 2014-06-12 at 22:10 -0400, Dan Prince wrote:
>> On Thu, 2014-06-12 at 08:06 -0400, Sean Dague wrote:
>>> We're definitely deep into capacity issues, so it's going to be time to
>>> start making tougher decisions about things we decide aren't different
>>> enough to bother testing on every commit.
>>
>> In order to save resources why not combine some of the jobs in different
>> ways. So for example instead of:
>>
>>  check-tempest-dsvm-full
>>  check-tempest-dsvm-postgres-full
>>
>> Couldn't we just drop the postgres-full job and run one of the Neutron
>> jobs w/ postgres instead? Or something similar, so long as at least one
>> of the jobs which runs most of Tempest is using PostgreSQL I think we'd
>> be mostly fine. Not shooting for 100% coverage for everything with our
>> limited resource pool is fine, lets just do the best we can.
>>
>> Ditto for gate jobs (not check).
> 
> I think that's what Clark was suggesting in:
> 
> https://etherpad.openstack.org/p/juno-test-maxtrices
> 
>>> Previously we've been testing Postgresql in the gate because it has a
>>> stricter interpretation of SQL than MySQL. And when we didn't test
>>> Postgresql it regressed. I know, I chased it for about 4 weeks in grizzly.
>>>
>>> However Monty brought up a good point at Summit, that MySQL has a strict
>>> mode. That should actually enforce the same strictness.
>>>
>>> My proposal is that we land this change to devstack -
>>> https://review.openstack.org/#/c/97442/ and backport it to past devstack
>>> branches.
>>>
>>> Then we drop the pg jobs, as the differences between the 2 configs
>>> should then be very minimal. All the *actual* failures we've seen
>>> between the 2 were completely about this strict SQL mode interpretation.
>>
>>
>> I suppose I would like to see us keep it in the mix. Running SmokeStack
>> for almost 3 years I found many an issue dealing w/ PostgreSQL. I ran it
>> concurrently with many of the other jobs and I too had limited resources
>> (much less that what we have in infra today).
>>
>> Would MySQL strict SQL mode catch stuff like this (old bugs, but still
>> valid for this topic I think):
>>
>>  https://bugs.launchpad.net/nova/+bug/948066
>>
>>  https://bugs.launchpad.net/nova/+bug/1003756
>>
>>
>> Having support for and testing against at least 2 databases helps keep
>> our SQL queries and migrations cleaner... and is generally a good
>> practice given we have abstractions which are meant to support this sort
>> of thing anyway (so by all means let us test them!).
>>
>> Also, Having compacted the Nova migrations 3 times now I found many
>> issues by testing on multiple databases (MySQL and PostgreSQL). I'm
>> quite certain our migrations would be worse off if we just tested
>> against the single database.
> 
> Certainly sounds like this testing is far beyond the "might one day be
> useful" level Sean talks about.

The migration compaction is a good point. And I'm happy to see there
were some bugs exposed as well.

Here is where I remain stuck

We are now at a failure rate in which it's 3 days (minimum) to land a
fix that decreases our failure rate at all.

The way we are currently solving this is by effectively building "manual
zuul" and taking smart humans in coordination to end run around our
system. We've merged 18 fixes so far -
https://etherpad.openstack.org/p/gatetriage-june2014 this way. Merging a
fix this way is at least an order of magnitude more expensive on people
time because of the analysis and coordination we need to go through to
make sure these things are the right things to jump the queue.

That effort, over 8 days, has gotten us down to *only* a 24hr merge
delay. And there are no more smoking guns. What's left is a ton of
subtle things. I've got ~ 30 patches outstanding right now (a bunch are
things to clarify what's going on in the build runs especially in the
fail scenarios). Every single one of them has been failed by Jenkins at
least once. Almost every one was failed by a different unique issue.

So I'd say at best we're 25% of the way towards solving this. That being
said, because of the deep queues, people are just recheck grinding (or
hitting the jackpot and landing something through that then fails a lot
after landing). That leads to bugs like this:

https://bugs.launchpad.net/heat/+bug/1306029

Which was seen early in the patch - https://review.openstack.org/#/c/97569/

Then kind of destroyed us completely for a day -
http://status.openstack.org/elastic-recheck/ (it's the top graph).

And, predictably, a week into a long gate queue everyone is now grumpy.
The sniping between projects, and within projects in assigning blame
starts to spike at about day 4 of these events. Everyone assumes someone
else is to blame for these things.

So there is real community impact when we get to these states.

So, I'm kind of burnt out trying to figure out how to get us out of
this. As I do take it personally when we as a proje

Re: [openstack-dev] Gate proposal - drop Postgresql configurations in the gate

2014-06-13 Thread Sean Dague

On 06/12/2014 10:18 PM, Dan Prince wrote:
> On Thu, 2014-06-12 at 09:24 -0700, Joe Gordon wrote:
>>
>> On Jun 12, 2014 8:37 AM, "Sean Dague"  wrote:
>>>
>>> On 06/12/2014 10:38 AM, Mike Bayer wrote:

 On 6/12/14, 8:26 AM, Julien Danjou wrote:
> On Thu, Jun 12 2014, Sean Dague wrote:
>
>> That's not cacthable in unit or functional tests?
> Not in an accurate manner, no.
>
>> Keeping jobs alive based on the theory that they might one day
>> be useful
>> is something we just don't have the liberty to do any more.
>> We've not
>> seen an idle node in zuul in 2 days... and we're only at j-1.
>> j-3 will
>> be at least +50% of this load.
> Sure, I'm not saying we don't have a problem. I'm just saying
>> it's not a
> good solution to fix that problem IMHO.

 Just my 2c without having a full understanding of all of
>> OpenStack's CI
 environment, Postgresql is definitely different enough that MySQL
 "strict mode" could still allow issues to slip through quite
>> easily, and
 also as far as capacity issues, this might be longer term but I'm
>> hoping
 to get database-related tests to be lots faster if we can move to
>> a
 model that spends much less time creating databases and schemas.
>>>
>>> This is what I mean by functional testing. If we were directly
>> hitting a
>>> real database on a set of in tree project tests, I think you could
>>> discover issues like this. Neutron was headed down that path.
>>>
>>> But if we're talking about a devstack / tempest run, it's not really
>>> applicable.
>>>
>>> If someone can point me to a case where we've actually found this
>> kind
>>> of bug with tempest / devstack, that would be great. I've just
>> *never*
>>> seen it. I was the one that did most of the fixing for pg support in
>>> Nova, and have helped other projects as well, so I'm relatively
>> familiar
>>> with the kinds of fails we can discover. The ones that Julien
>> pointed
>>> really aren't likely to be exposed in our current system.
>>>
>>> Which is why I think we're mostly just burning cycles on the
>> existing
>>> approach for no gain.
>>
>> Given all the points made above, I think dropping PostgreSQL is the
>> right choice; if only we had infinite cloud that would be another
>> story.
>>
>> What about converting one of our existing jobs (grenade partial ncpu,
>> large ops, regular grenade, tempest with nova network etc.) Into a
>> PostgreSQL only job? We could get some level of PostgreSQL testing
>> without any additional jobs, although this is  tradeoff obviously.
> 
> I'd be fine with this tradeoff if it allows us to keep PostgreSQL in the
> mix.

The problem isn't just testing, it's people looking at the failures in
the different configurations.

I'm glad everyone loves having lots of configurations. :)

I'm less glad we've got a 24hr merge queue in the gate because very few
people are actually sifting through the failed results to figure out why
and fix them. :(

If we had more people looking through failures then it would just be a
machine capacity problem. But it's not, it's also a people capacity problem.

It's just not sustainable as a project. Pleading with people to help on
the failed side has not worked over the last year. So I really think
we're at a point where we need to start throwing jobs until we reduce
the failure rate to one that we can actually make forward progress.

Because right now we can't typically land fixes for the race conditions
in any timely manner because they get stomped by other races. I've got a
giant set of outstanding patches to make some of these stuff more clear,
which is all stuck.

So if we can't evolve the system back towards health, we need to just
cut a bunch of stuff off until we can.

-Sean

-- 
Sean Dague
http://dague.net



signature.asc
Description: OpenPGP digital signature
___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

Re: [openstack-dev] Gate proposal - drop Postgresql configurations in the gate

2014-06-12 Thread Mark McLoughlin

On Thu, 2014-06-12 at 22:10 -0400, Dan Prince wrote:
> On Thu, 2014-06-12 at 08:06 -0400, Sean Dague wrote:
> > We're definitely deep into capacity issues, so it's going to be time to
> > start making tougher decisions about things we decide aren't different
> > enough to bother testing on every commit.
> 
> In order to save resources why not combine some of the jobs in different
> ways. So for example instead of:
> 
>  check-tempest-dsvm-full
>  check-tempest-dsvm-postgres-full
> 
> Couldn't we just drop the postgres-full job and run one of the Neutron
> jobs w/ postgres instead? Or something similar, so long as at least one
> of the jobs which runs most of Tempest is using PostgreSQL I think we'd
> be mostly fine. Not shooting for 100% coverage for everything with our
> limited resource pool is fine, lets just do the best we can.
> 
> Ditto for gate jobs (not check).

I think that's what Clark was suggesting in:

https://etherpad.openstack.org/p/juno-test-maxtrices

> > Previously we've been testing Postgresql in the gate because it has a
> > stricter interpretation of SQL than MySQL. And when we didn't test
> > Postgresql it regressed. I know, I chased it for about 4 weeks in grizzly.
> > 
> > However Monty brought up a good point at Summit, that MySQL has a strict
> > mode. That should actually enforce the same strictness.
> > 
> > My proposal is that we land this change to devstack -
> > https://review.openstack.org/#/c/97442/ and backport it to past devstack
> > branches.
> > 
> > Then we drop the pg jobs, as the differences between the 2 configs
> > should then be very minimal. All the *actual* failures we've seen
> > between the 2 were completely about this strict SQL mode interpretation.
> 
> 
> I suppose I would like to see us keep it in the mix. Running SmokeStack
> for almost 3 years I found many an issue dealing w/ PostgreSQL. I ran it
> concurrently with many of the other jobs and I too had limited resources
> (much less that what we have in infra today).
> 
> Would MySQL strict SQL mode catch stuff like this (old bugs, but still
> valid for this topic I think):
> 
>  https://bugs.launchpad.net/nova/+bug/948066
> 
>  https://bugs.launchpad.net/nova/+bug/1003756
> 
> 
> Having support for and testing against at least 2 databases helps keep
> our SQL queries and migrations cleaner... and is generally a good
> practice given we have abstractions which are meant to support this sort
> of thing anyway (so by all means let us test them!).
> 
> Also, Having compacted the Nova migrations 3 times now I found many
> issues by testing on multiple databases (MySQL and PostgreSQL). I'm
> quite certain our migrations would be worse off if we just tested
> against the single database.

Certainly sounds like this testing is far beyond the "might one day be
useful" level Sean talks about.

Mark.


___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

Re: [openstack-dev] Gate proposal - drop Postgresql configurations in the gate

2014-06-12 Thread Dan Prince

On Thu, 2014-06-12 at 09:24 -0700, Joe Gordon wrote:
> 
> On Jun 12, 2014 8:37 AM, "Sean Dague"  wrote:
> >
> > On 06/12/2014 10:38 AM, Mike Bayer wrote:
> > >
> > > On 6/12/14, 8:26 AM, Julien Danjou wrote:
> > >> On Thu, Jun 12 2014, Sean Dague wrote:
> > >>
> > >>> That's not cacthable in unit or functional tests?
> > >> Not in an accurate manner, no.
> > >>
> > >>> Keeping jobs alive based on the theory that they might one day
> be useful
> > >>> is something we just don't have the liberty to do any more.
> We've not
> > >>> seen an idle node in zuul in 2 days... and we're only at j-1.
> j-3 will
> > >>> be at least +50% of this load.
> > >> Sure, I'm not saying we don't have a problem. I'm just saying
> it's not a
> > >> good solution to fix that problem IMHO.
> > >
> > > Just my 2c without having a full understanding of all of
> OpenStack's CI
> > > environment, Postgresql is definitely different enough that MySQL
> > > "strict mode" could still allow issues to slip through quite
> easily, and
> > > also as far as capacity issues, this might be longer term but I'm
> hoping
> > > to get database-related tests to be lots faster if we can move to
> a
> > > model that spends much less time creating databases and schemas.
> >
> > This is what I mean by functional testing. If we were directly
> hitting a
> > real database on a set of in tree project tests, I think you could
> > discover issues like this. Neutron was headed down that path.
> >
> > But if we're talking about a devstack / tempest run, it's not really
> > applicable.
> >
> > If someone can point me to a case where we've actually found this
> kind
> > of bug with tempest / devstack, that would be great. I've just
> *never*
> > seen it. I was the one that did most of the fixing for pg support in
> > Nova, and have helped other projects as well, so I'm relatively
> familiar
> > with the kinds of fails we can discover. The ones that Julien
> pointed
> > really aren't likely to be exposed in our current system.
> >
> > Which is why I think we're mostly just burning cycles on the
> existing
> > approach for no gain.
> 
> Given all the points made above, I think dropping PostgreSQL is the
> right choice; if only we had infinite cloud that would be another
> story.
> 
> What about converting one of our existing jobs (grenade partial ncpu,
> large ops, regular grenade, tempest with nova network etc.) Into a
> PostgreSQL only job? We could get some level of PostgreSQL testing
> without any additional jobs, although this is  tradeoff obviously.

I'd be fine with this tradeoff if it allows us to keep PostgreSQL in the
mix.


> 
> >
> > -Sean
> >
> > --
> > Sean Dague
> > http://dague.net
> >
> >
> > ___
> > OpenStack-dev mailing list
> > OpenStack-dev@lists.openstack.org
> > http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
> >
> 
> 
> ___
> OpenStack-dev mailing list
> OpenStack-dev@lists.openstack.org
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev



___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

Re: [openstack-dev] Gate proposal - drop Postgresql configurations in the gate

2014-06-12 Thread Matt Riedemann




On 6/12/2014 5:11 PM, Michael Still wrote:

On Thu, Jun 12, 2014 at 10:06 PM, Sean Dague  wrote:

We're definitely deep into capacity issues, so it's going to be time to
start making tougher decisions about things we decide aren't different
enough to bother testing on every commit.


I think one of the criticisms that could be made about OpenStack at
the moment is that we're not opinionated enough. We have a lot of bugs
because we support huge numbers of drivers of varying quality and
completeness. Do you think its time for the gate to be an opinionated
set of tests of how OpenStack can be deployed? Perhaps we should gate
on only one permutation of a possible OpenStack cloud, and then let
people who want to propose deviations from that permutation run their
own CI as third parties.

I'm not particularly advocating this stance, but it is an option and
I'd like to see it explored a bit more.

Michael



Yeah was sort of thinking along the same lines - does any of the survey 
data help here, i.e. what's the percentage of deployments using mysql vs 
postgresql?


Another example is we want testing for Ceph/Rbd but I don't expect that 
to be in the upstream CI/gate, I more or less expect that from some 3rd 
party CI run by someone using it in production and really really cares 
about it's quality and maintenance in the tree.


--

Thanks,

Matt Riedemann


___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

Re: [openstack-dev] Gate proposal - drop Postgresql configurations in the gate

2014-06-12 Thread Dan Prince

On Thu, 2014-06-12 at 08:06 -0400, Sean Dague wrote:
> We're definitely deep into capacity issues, so it's going to be time to
> start making tougher decisions about things we decide aren't different
> enough to bother testing on every commit.

In order to save resources why not combine some of the jobs in different
ways. So for example instead of:

 check-tempest-dsvm-full
 check-tempest-dsvm-postgres-full

Couldn't we just drop the postgres-full job and run one of the Neutron
jobs w/ postgres instead? Or something similar, so long as at least one
of the jobs which runs most of Tempest is using PostgreSQL I think we'd
be mostly fine. Not shooting for 100% coverage for everything with our
limited resource pool is fine, lets just do the best we can.

Ditto for gate jobs (not check).

> 
> Previously we've been testing Postgresql in the gate because it has a
> stricter interpretation of SQL than MySQL. And when we didn't test
> Postgresql it regressed. I know, I chased it for about 4 weeks in grizzly.
> 
> However Monty brought up a good point at Summit, that MySQL has a strict
> mode. That should actually enforce the same strictness.
> 
> My proposal is that we land this change to devstack -
> https://review.openstack.org/#/c/97442/ and backport it to past devstack
> branches.
> 
> Then we drop the pg jobs, as the differences between the 2 configs
> should then be very minimal. All the *actual* failures we've seen
> between the 2 were completely about this strict SQL mode interpretation.

I suppose I would like to see us keep it in the mix. Running SmokeStack
for almost 3 years I found many an issue dealing w/ PostgreSQL. I ran it
concurrently with many of the other jobs and I too had limited resources
(much less that what we have in infra today).

Would MySQL strict SQL mode catch stuff like this (old bugs, but still
valid for this topic I think):

 https://bugs.launchpad.net/nova/+bug/948066

 https://bugs.launchpad.net/nova/+bug/1003756

Having support for and testing against at least 2 databases helps keep
our SQL queries and migrations cleaner... and is generally a good
practice given we have abstractions which are meant to support this sort
of thing anyway (so by all means let us test them!).

Also, Having compacted the Nova migrations 3 times now I found many
issues by testing on multiple databases (MySQL and PostgreSQL). I'm
quite certain our migrations would be worse off if we just tested
against the single database.

That said I'm all for the focus being on realistic use cases.  If nobody
is using PostgreSQL in production (or has interest in doing so) then
perhaps considering it as a drop candidate is a good idea. If we go this
route though I gotta say it puts some of our abstractions on the
"chopping block" in my opinion. SQLAlchemy, python-migrate, and the like
aren't free in terms of CPU cycles and if all we really need is the one
database then perhaps we really should consider going with some straight
up inline SQL code instead.

Dan

> 
>   -Sean
> 
> ___
> OpenStack-dev mailing list
> OpenStack-dev@lists.openstack.org
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

Re: [openstack-dev] Gate proposal - drop Postgresql configurations in the gate

2014-06-12 Thread Kyle Mestery

On Thu, Jun 12, 2014 at 5:11 PM, Michael Still  wrote:
> On Thu, Jun 12, 2014 at 10:06 PM, Sean Dague  wrote:
>> We're definitely deep into capacity issues, so it's going to be time to
>> start making tougher decisions about things we decide aren't different
>> enough to bother testing on every commit.
>
> I think one of the criticisms that could be made about OpenStack at
> the moment is that we're not opinionated enough. We have a lot of bugs
> because we support huge numbers of drivers of varying quality and
> completeness. Do you think its time for the gate to be an opinionated
> set of tests of how OpenStack can be deployed? Perhaps we should gate
> on only one permutation of a possible OpenStack cloud, and then let
> people who want to propose deviations from that permutation run their
> own CI as third parties.
>
This is honestly the approach that we've taken with Neutron. We are
currently only testing the ML2 plugin with the Open vSwitch agent. We
don't (to my knowledge) even test the Linuxbridge agent or L2
population driver in ML2 in the gate, nor the other Open Source
drivers and plugins. We rely on 3rd party CI for all of those. It's
not unreasonable for a similar model to be followed for other pieces
of testing.

Kyle

> I'm not particularly advocating this stance, but it is an option and
> I'd like to see it explored a bit more.
>
> Michael
>
> --
> Rackspace Australia
>
> ___
> OpenStack-dev mailing list
> OpenStack-dev@lists.openstack.org
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

Re: [openstack-dev] Gate proposal - drop Postgresql configurations in the gate

2014-06-12 Thread Michael Still

On Thu, Jun 12, 2014 at 10:06 PM, Sean Dague  wrote:
> We're definitely deep into capacity issues, so it's going to be time to
> start making tougher decisions about things we decide aren't different
> enough to bother testing on every commit.

I think one of the criticisms that could be made about OpenStack at
the moment is that we're not opinionated enough. We have a lot of bugs
because we support huge numbers of drivers of varying quality and
completeness. Do you think its time for the gate to be an opinionated
set of tests of how OpenStack can be deployed? Perhaps we should gate
on only one permutation of a possible OpenStack cloud, and then let
people who want to propose deviations from that permutation run their
own CI as third parties.

I'm not particularly advocating this stance, but it is an option and
I'd like to see it explored a bit more.

Michael

-- 
Rackspace Australia

___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

Re: [openstack-dev] Gate proposal - drop Postgresql configurations in the gate

2014-06-12 Thread Jay Pipes


On 06/12/2014 12:24 PM, Joe Gordon wrote:


On Jun 12, 2014 8:37 AM, "Sean Dague" mailto:s...@dague.net>> wrote:
 >
 > On 06/12/2014 10:38 AM, Mike Bayer wrote:
 > >
 > > On 6/12/14, 8:26 AM, Julien Danjou wrote:
 > >> On Thu, Jun 12 2014, Sean Dague wrote:
 > >>
 > >>> That's not cacthable in unit or functional tests?
 > >> Not in an accurate manner, no.
 > >>
 > >>> Keeping jobs alive based on the theory that they might one day be
useful
 > >>> is something we just don't have the liberty to do any more. We've not
 > >>> seen an idle node in zuul in 2 days... and we're only at j-1. j-3
will
 > >>> be at least +50% of this load.
 > >> Sure, I'm not saying we don't have a problem. I'm just saying it's
not a
 > >> good solution to fix that problem IMHO.
 > >
 > > Just my 2c without having a full understanding of all of OpenStack's CI
 > > environment, Postgresql is definitely different enough that MySQL
 > > "strict mode" could still allow issues to slip through quite
easily, and
 > > also as far as capacity issues, this might be longer term but I'm
hoping
 > > to get database-related tests to be lots faster if we can move to a
 > > model that spends much less time creating databases and schemas.
 >
 > This is what I mean by functional testing. If we were directly hitting a
 > real database on a set of in tree project tests, I think you could
 > discover issues like this. Neutron was headed down that path.
 >
 > But if we're talking about a devstack / tempest run, it's not really
 > applicable.
 >
 > If someone can point me to a case where we've actually found this kind
 > of bug with tempest / devstack, that would be great. I've just *never*
 > seen it. I was the one that did most of the fixing for pg support in
 > Nova, and have helped other projects as well, so I'm relatively familiar
 > with the kinds of fails we can discover. The ones that Julien pointed
 > really aren't likely to be exposed in our current system.
 >
 > Which is why I think we're mostly just burning cycles on the existing
 > approach for no gain.

Given all the points made above, I think dropping PostgreSQL is the
right choice; if only we had infinite cloud that would be another story.

What about converting one of our existing jobs (grenade partial ncpu,
large ops, regular grenade, tempest with nova network etc.) Into a
PostgreSQL only job? We could get some level of PostgreSQL testing
without any additional jobs, although this is  tradeoff obviously.


I was initially -1 on Sean's proposal. My reasoning echoed some of 
Julien's reasoning and all of Chris Friesen's rationale (and the bug 
report he mentioned was a perfect example of the types of things that 
would not, IMO, be caught by a MySQL strict mode configuration.)


That said, I recognize the resource capacity issues the gate is 
suffering from and I think Joe's suggestion above is actually a really 
good one.


Best,
-jay


___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

Re: [openstack-dev] Gate proposal - drop Postgresql configurations in the gate

2014-06-12 Thread Gordon Chung

> If someone can point me to a case where we've actually found this kind
> of bug with tempest / devstack, that would be great. I've just *never*
> seen it. I was the one that did most of the fixing for pg support in
> Nova, and have helped other projects as well, so I'm relatively familiar
> with the kinds of fails we can discover. The ones that Julien pointed
> really aren't likely to be exposed in our current system.
>
> Which is why I think we're mostly just burning cycles on the existing
> approach for no gain.

not sure if this would get caught in mysql strict mode but we caught some 
differences between mysql/postgres in Ceilometer as well. ie. 
https://bugs.launchpad.net/ceilometer/+bug/1256318

personally, if resources weren't constrained i'd prefer both but out of 
curiosity, what was the reasoning for choosing to continue gating against 
mysql only rather than postgres only? is it known that mysql is the 
typical choice for openstack deployments?

cheers,
gordon chung
openstack, ibm software standards___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

Re: [openstack-dev] Gate proposal - drop Postgresql configurations in the gate

2014-06-12 Thread Sean Dague

On 06/12/2014 01:22 PM, Tim Bell wrote:
>> -Original Message-
>> From: Sean Dague [mailto:s...@dague.net]
>> Sent: 12 June 2014 17:37
>> To: OpenStack Development Mailing List (not for usage questions)
>> Subject: Re: [openstack-dev] Gate proposal - drop Postgresql configurations 
>> in
>> the gate
>>
> ...
>> But if we're talking about a devstack / tempest run, it's not really 
>> applicable.
>>
>> If someone can point me to a case where we've actually found this kind of bug
>> with tempest / devstack, that would be great. I've just *never* seen it. I 
>> was the
>> one that did most of the fixing for pg support in Nova, and have helped other
>> projects as well, so I'm relatively familiar with the kinds of fails we can 
>> discover.
>> The ones that Julien pointed really aren't likely to be exposed in our 
>> current
>> system.
>>
>> Which is why I think we're mostly just burning cycles on the existing 
>> approach
>> for no gain.
>>
> 
> In some cases, we've dropped support for drivers in OpenStack since they were 
> not tested in the gate, on the grounds that if it is not tested, it is 
> probably broken.
> 
> From my understanding, this change proposes to drop Postgres testing from the 
> default gate. Yet, there does not seem to be a proposal to drop Postgres 
> support.
> 
> Are these two positions consistent ?
> 
> (Just seeking clarification, I fully understand the difficulties involved in 
> multiple parallel testing at our scale)

We're hitting a couple of inflection points.

1) We're basically at capacity for the unit work that we can do. Which
means it's time to start making decisions if we believe everything we
currently have running is more important than the things we aren't
currently testing.

Everyone wants multinode testing in the gate. It would be impossible to
support that given current resources.

2) We're far past the inflection point of people actually debugging jobs
when they go wrong.

The gate is backed up (currently to 24hrs) because there are bugs in
OpenStack. Those are popping up at a rate much faster than the number of
people who are willing to spend any time on them. And often they are
popping up in configurations that we're not all that familiar with.

Landing a gating job comes with maintenance. Maintenance in looking into
failures, and not just running recheck. So there is an overhead to
testing this many different configurations.

I think #2 is just as important to realize as #1. As such I think we
need to get to the point where there are a relatively small number of
configurations that Infra/QA support, and beyond that every job needs
sponsors. And if the job success or # of uncategorized fails go past
some thresholds, we demote them to non-voting, and if you are non-voting
for > 1 month, you get demoted to experimental (or some specific
timeline, details to be sorted).

-Sean

-- 
Sean Dague
http://dague.net

signature.asc
Description: OpenPGP digital signature
___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

Re: [openstack-dev] Gate proposal - drop Postgresql configurations in the gate

2014-06-12 Thread Chris Friesen


On 06/12/2014 09:36 AM, Sean Dague wrote:


This is what I mean by functional testing. If we were directly hitting a
real database on a set of in tree project tests, I think you could
discover issues like this. Neutron was headed down that path.

But if we're talking about a devstack / tempest run, it's not really
applicable.

If someone can point me to a case where we've actually found this kind
of bug with tempest / devstack, that would be great. I've just *never*
seen it. I was the one that did most of the fixing for pg support in
Nova, and have helped other projects as well, so I'm relatively familiar
with the kinds of fails we can discover. The ones that Julien pointed
really aren't likely to be exposed in our current system.

Which is why I think we're mostly just burning cycles on the existing
approach for no gain.


What about

https://bugs.launchpad.net/nova/+bug/1292963

Would this have been caught by strict/traditional mode with mysql?  (Of 
course in this case we didn't actually have tempest testcases for server 
groups yet, not sure if they exist now.)


Also, while we're on the topic of testing databases...I opened a bug a 
while back for the fact that sqlite regexp() doesn't behave like 
mysql/postgres.  Having unit tests that don't behave like a real install 
seems like a bad move.


Chris

___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

Re: [openstack-dev] Gate proposal - drop Postgresql configurations in the gate

2014-06-12 Thread Clint Byrum

Excerpts from Matt Riedemann's message of 2014-06-12 08:15:46 -0700:
> 
> On 6/12/2014 9:38 AM, Mike Bayer wrote:
> >
> > On 6/12/14, 8:26 AM, Julien Danjou wrote:
> >> On Thu, Jun 12 2014, Sean Dague wrote:
> >>
> >>> That's not cacthable in unit or functional tests?
> >> Not in an accurate manner, no.
> >>
> >>> Keeping jobs alive based on the theory that they might one day be useful
> >>> is something we just don't have the liberty to do any more. We've not
> >>> seen an idle node in zuul in 2 days... and we're only at j-1. j-3 will
> >>> be at least +50% of this load.
> >> Sure, I'm not saying we don't have a problem. I'm just saying it's not a
> >> good solution to fix that problem IMHO.
> >
> > Just my 2c without having a full understanding of all of OpenStack's CI
> > environment, Postgresql is definitely different enough that MySQL
> > "strict mode" could still allow issues to slip through quite easily, and
> > also as far as capacity issues, this might be longer term but I'm hoping
> > to get database-related tests to be lots faster if we can move to a
> > model that spends much less time creating databases and schemas.
> >
> >
> >
> > ___
> > OpenStack-dev mailing list
> > OpenStack-dev@lists.openstack.org
> > http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
> >
> 
> Is there some organization out there that uses PostgreSQL in production 
> that could stand up 3rd party CI with it?
> 
> I know that at least for the DB2 support we're adding across the 
> projects we're doing 3rd party CI for that. Granted it's a proprietary 
> DB unlike PG but if we're talking about spending resources on testing 
> for something that's not widely used, but there is a niche set of users 
> that rely on it, we could/should move that to 3rd party CI.
> 
> I'd much rather see us spend our test resources on getting multi-node 
> testing running in the gate so we can test migrations in Nova.
> 

I think this is really the answer. To paraphrase the wise and well
experienced engineer, Beyoncé:

"If you like it then you shoulda put CI on it."

The project will succumb to a tragedy of the commons if it bends over
backwards for every deployment variation available. But 3rd parties who
care can always contribute resources and (if they play nice...) votes.

I think there are a tiny number of things that will cause corner case
bugs that could creep in, but as Sean says, we haven't actually seen
these.

___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

Re: [openstack-dev] Gate proposal - drop Postgresql configurations in the gate

2014-06-12 Thread Tim Bell

> -Original Message-
> From: Sean Dague [mailto:s...@dague.net]
> Sent: 12 June 2014 17:37
> To: OpenStack Development Mailing List (not for usage questions)
> Subject: Re: [openstack-dev] Gate proposal - drop Postgresql configurations in
> the gate
> 
...
> But if we're talking about a devstack / tempest run, it's not really 
> applicable.
> 
> If someone can point me to a case where we've actually found this kind of bug
> with tempest / devstack, that would be great. I've just *never* seen it. I 
> was the
> one that did most of the fixing for pg support in Nova, and have helped other
> projects as well, so I'm relatively familiar with the kinds of fails we can 
> discover.
> The ones that Julien pointed really aren't likely to be exposed in our current
> system.
> 
> Which is why I think we're mostly just burning cycles on the existing approach
> for no gain.
> 

In some cases, we've dropped support for drivers in OpenStack since they were 
not tested in the gate, on the grounds that if it is not tested, it is probably 
broken.

>From my understanding, this change proposes to drop Postgres testing from the 
>default gate. Yet, there does not seem to be a proposal to drop Postgres 
>support.

Are these two positions consistent ?

(Just seeking clarification, I fully understand the difficulties involved in 
multiple parallel testing at our scale)

Tim

___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

Re: [openstack-dev] Gate proposal - drop Postgresql configurations in the gate

2014-06-12 Thread Matthew Treinish

On Thu, Jun 12, 2014 at 09:24:15AM -0700, Joe Gordon wrote:
> On Jun 12, 2014 8:37 AM, "Sean Dague"  wrote:
> >
> > On 06/12/2014 10:38 AM, Mike Bayer wrote:
> > >
> > > On 6/12/14, 8:26 AM, Julien Danjou wrote:
> > >> On Thu, Jun 12 2014, Sean Dague wrote:
> > >>
> > >>> That's not cacthable in unit or functional tests?
> > >> Not in an accurate manner, no.
> > >>
> > >>> Keeping jobs alive based on the theory that they might one day be
> useful
> > >>> is something we just don't have the liberty to do any more. We've not
> > >>> seen an idle node in zuul in 2 days... and we're only at j-1. j-3 will
> > >>> be at least +50% of this load.
> > >> Sure, I'm not saying we don't have a problem. I'm just saying it's not
> a
> > >> good solution to fix that problem IMHO.
> > >
> > > Just my 2c without having a full understanding of all of OpenStack's CI
> > > environment, Postgresql is definitely different enough that MySQL
> > > "strict mode" could still allow issues to slip through quite easily, and
> > > also as far as capacity issues, this might be longer term but I'm hoping
> > > to get database-related tests to be lots faster if we can move to a
> > > model that spends much less time creating databases and schemas.
> >
> > This is what I mean by functional testing. If we were directly hitting a
> > real database on a set of in tree project tests, I think you could
> > discover issues like this. Neutron was headed down that path.
> >
> > But if we're talking about a devstack / tempest run, it's not really
> > applicable.
> >
> > If someone can point me to a case where we've actually found this kind
> > of bug with tempest / devstack, that would be great. I've just *never*
> > seen it. I was the one that did most of the fixing for pg support in
> > Nova, and have helped other projects as well, so I'm relatively familiar
> > with the kinds of fails we can discover. The ones that Julien pointed
> > really aren't likely to be exposed in our current system.
> >
> > Which is why I think we're mostly just burning cycles on the existing
> > approach for no gain.
> 
> Given all the points made above, I think dropping PostgreSQL is the right
> choice; if only we had infinite cloud that would be another story.

++

> 
> What about converting one of our existing jobs (grenade partial ncpu, large
> ops, regular grenade, tempest with nova network etc.) Into a PostgreSQL
> only job? We could get some level of PostgreSQL testing without any
> additional jobs, although this is  tradeoff obviously.
> 

I think that's a reasonable approach. Although, doing this you'd have to be
careful about asymmetry between what's gating on all the projects. We don't
want to only run postgres on a job that doesn't hit every project. Just thinking
out loud, but maybe it makes sense to switch the integrate-gate's neutron job
over to postgres and then keep the neutron jobs with mysql in the
integrated-gate-neutron template.

-Matt Treinish


pgpBc9Far2Z5U.pgp
Description: PGP signature
___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

Re: [openstack-dev] Gate proposal - drop Postgresql configurations in the gate

2014-06-12 Thread Monty Taylor

On 06/12/2014 08:36 AM, Sean Dague wrote:
> On 06/12/2014 10:38 AM, Mike Bayer wrote:
>>
>> On 6/12/14, 8:26 AM, Julien Danjou wrote:
>>> On Thu, Jun 12 2014, Sean Dague wrote:
>>>
 That's not cacthable in unit or functional tests?
>>> Not in an accurate manner, no.
>>>
 Keeping jobs alive based on the theory that they might one day be useful
 is something we just don't have the liberty to do any more. We've not
 seen an idle node in zuul in 2 days... and we're only at j-1. j-3 will
 be at least +50% of this load.
>>> Sure, I'm not saying we don't have a problem. I'm just saying it's not a
>>> good solution to fix that problem IMHO.
>>
>> Just my 2c without having a full understanding of all of OpenStack's CI
>> environment, Postgresql is definitely different enough that MySQL
>> "strict mode" could still allow issues to slip through quite easily, and
>> also as far as capacity issues, this might be longer term but I'm hoping
>> to get database-related tests to be lots faster if we can move to a
>> model that spends much less time creating databases and schemas.
> 
> This is what I mean by functional testing. If we were directly hitting a
> real database on a set of in tree project tests, I think you could
> discover issues like this. Neutron was headed down that path.

We have MySQL and PostGres available on all of the unittest nodes. So if
someone wrote a functional test to test for postgres specific issues
like that, and put the standard trap on it "only run this if you find a
postgres database with an openstackci user" - then we should be able to
catch all of the specific things like this without incurring the cost of
a double run.

So, in general, +1 from me.

> But if we're talking about a devstack / tempest run, it's not really
> applicable.
> 
> If someone can point me to a case where we've actually found this kind
> of bug with tempest / devstack, that would be great. I've just *never*
> seen it. I was the one that did most of the fixing for pg support in
> Nova, and have helped other projects as well, so I'm relatively familiar
> with the kinds of fails we can discover. The ones that Julien pointed
> really aren't likely to be exposed in our current system.
> 
> Which is why I think we're mostly just burning cycles on the existing
> approach for no gain.
> 
>   -Sean
> 
> 
> 
> ___
> OpenStack-dev mailing list
> OpenStack-dev@lists.openstack.org
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
> 




signature.asc
Description: OpenPGP digital signature
___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

Re: [openstack-dev] Gate proposal - drop Postgresql configurations in the gate

2014-06-12 Thread Joe Gordon

On Jun 12, 2014 8:37 AM, "Sean Dague"  wrote:
>
> On 06/12/2014 10:38 AM, Mike Bayer wrote:
> >
> > On 6/12/14, 8:26 AM, Julien Danjou wrote:
> >> On Thu, Jun 12 2014, Sean Dague wrote:
> >>
> >>> That's not cacthable in unit or functional tests?
> >> Not in an accurate manner, no.
> >>
> >>> Keeping jobs alive based on the theory that they might one day be
useful
> >>> is something we just don't have the liberty to do any more. We've not
> >>> seen an idle node in zuul in 2 days... and we're only at j-1. j-3 will
> >>> be at least +50% of this load.
> >> Sure, I'm not saying we don't have a problem. I'm just saying it's not
a
> >> good solution to fix that problem IMHO.
> >
> > Just my 2c without having a full understanding of all of OpenStack's CI
> > environment, Postgresql is definitely different enough that MySQL
> > "strict mode" could still allow issues to slip through quite easily, and
> > also as far as capacity issues, this might be longer term but I'm hoping
> > to get database-related tests to be lots faster if we can move to a
> > model that spends much less time creating databases and schemas.
>
> This is what I mean by functional testing. If we were directly hitting a
> real database on a set of in tree project tests, I think you could
> discover issues like this. Neutron was headed down that path.
>
> But if we're talking about a devstack / tempest run, it's not really
> applicable.
>
> If someone can point me to a case where we've actually found this kind
> of bug with tempest / devstack, that would be great. I've just *never*
> seen it. I was the one that did most of the fixing for pg support in
> Nova, and have helped other projects as well, so I'm relatively familiar
> with the kinds of fails we can discover. The ones that Julien pointed
> really aren't likely to be exposed in our current system.
>
> Which is why I think we're mostly just burning cycles on the existing
> approach for no gain.

Given all the points made above, I think dropping PostgreSQL is the right
choice; if only we had infinite cloud that would be another story.

What about converting one of our existing jobs (grenade partial ncpu, large
ops, regular grenade, tempest with nova network etc.) Into a PostgreSQL
only job? We could get some level of PostgreSQL testing without any
additional jobs, although this is  tradeoff obviously.

>
> -Sean
>
> --
> Sean Dague
> http://dague.net
>
>
> ___
> OpenStack-dev mailing list
> OpenStack-dev@lists.openstack.org
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
>
___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

Re: [openstack-dev] Gate proposal - drop Postgresql configurations in the gate

2014-06-12 Thread Sean Dague

On 06/12/2014 10:38 AM, Mike Bayer wrote:
> 
> On 6/12/14, 8:26 AM, Julien Danjou wrote:
>> On Thu, Jun 12 2014, Sean Dague wrote:
>>
>>> That's not cacthable in unit or functional tests?
>> Not in an accurate manner, no.
>>
>>> Keeping jobs alive based on the theory that they might one day be useful
>>> is something we just don't have the liberty to do any more. We've not
>>> seen an idle node in zuul in 2 days... and we're only at j-1. j-3 will
>>> be at least +50% of this load.
>> Sure, I'm not saying we don't have a problem. I'm just saying it's not a
>> good solution to fix that problem IMHO.
> 
> Just my 2c without having a full understanding of all of OpenStack's CI
> environment, Postgresql is definitely different enough that MySQL
> "strict mode" could still allow issues to slip through quite easily, and
> also as far as capacity issues, this might be longer term but I'm hoping
> to get database-related tests to be lots faster if we can move to a
> model that spends much less time creating databases and schemas.

This is what I mean by functional testing. If we were directly hitting a
real database on a set of in tree project tests, I think you could
discover issues like this. Neutron was headed down that path.

But if we're talking about a devstack / tempest run, it's not really
applicable.

If someone can point me to a case where we've actually found this kind
of bug with tempest / devstack, that would be great. I've just *never*
seen it. I was the one that did most of the fixing for pg support in
Nova, and have helped other projects as well, so I'm relatively familiar
with the kinds of fails we can discover. The ones that Julien pointed
really aren't likely to be exposed in our current system.

Which is why I think we're mostly just burning cycles on the existing
approach for no gain.

-Sean

-- 
Sean Dague
http://dague.net

signature.asc
Description: OpenPGP digital signature
___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

Re: [openstack-dev] Gate proposal - drop Postgresql configurations in the gate

2014-06-12 Thread Matt Riedemann




On 6/12/2014 9:38 AM, Mike Bayer wrote:


On 6/12/14, 8:26 AM, Julien Danjou wrote:

On Thu, Jun 12 2014, Sean Dague wrote:


That's not cacthable in unit or functional tests?

Not in an accurate manner, no.


Keeping jobs alive based on the theory that they might one day be useful
is something we just don't have the liberty to do any more. We've not
seen an idle node in zuul in 2 days... and we're only at j-1. j-3 will
be at least +50% of this load.

Sure, I'm not saying we don't have a problem. I'm just saying it's not a
good solution to fix that problem IMHO.


Just my 2c without having a full understanding of all of OpenStack's CI
environment, Postgresql is definitely different enough that MySQL
"strict mode" could still allow issues to slip through quite easily, and
also as far as capacity issues, this might be longer term but I'm hoping
to get database-related tests to be lots faster if we can move to a
model that spends much less time creating databases and schemas.



___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev



Is there some organization out there that uses PostgreSQL in production 
that could stand up 3rd party CI with it?


I know that at least for the DB2 support we're adding across the 
projects we're doing 3rd party CI for that. Granted it's a proprietary 
DB unlike PG but if we're talking about spending resources on testing 
for something that's not widely used, but there is a niche set of users 
that rely on it, we could/should move that to 3rd party CI.


I'd much rather see us spend our test resources on getting multi-node 
testing running in the gate so we can test migrations in Nova.


--

Thanks,

Matt Riedemann


___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

Re: [openstack-dev] Gate proposal - drop Postgresql configurations in the gate

2014-06-12 Thread Mike Bayer

On 6/12/14, 8:26 AM, Julien Danjou wrote:
> On Thu, Jun 12 2014, Sean Dague wrote:
>
>> That's not cacthable in unit or functional tests?
> Not in an accurate manner, no.
>
>> Keeping jobs alive based on the theory that they might one day be useful
>> is something we just don't have the liberty to do any more. We've not
>> seen an idle node in zuul in 2 days... and we're only at j-1. j-3 will
>> be at least +50% of this load.
> Sure, I'm not saying we don't have a problem. I'm just saying it's not a
> good solution to fix that problem IMHO.

Just my 2c without having a full understanding of all of OpenStack's CI
environment, Postgresql is definitely different enough that MySQL
"strict mode" could still allow issues to slip through quite easily, and
also as far as capacity issues, this might be longer term but I'm hoping
to get database-related tests to be lots faster if we can move to a
model that spends much less time creating databases and schemas.

___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

Re: [openstack-dev] Gate proposal - drop Postgresql configurations in the gate

2014-06-12 Thread Julien Danjou

On Thu, Jun 12 2014, Sean Dague wrote:

> That's not cacthable in unit or functional tests?

Not in an accurate manner, no.

> Keeping jobs alive based on the theory that they might one day be useful
> is something we just don't have the liberty to do any more. We've not
> seen an idle node in zuul in 2 days... and we're only at j-1. j-3 will
> be at least +50% of this load.

Sure, I'm not saying we don't have a problem. I'm just saying it's not a
good solution to fix that problem IMHO.

-- 
Julien Danjou
-- Free Software hacker
-- http://julien.danjou.info


signature.asc
Description: PGP signature
___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

Re: [openstack-dev] Gate proposal - drop Postgresql configurations in the gate

2014-06-12 Thread Sean Dague

On 06/12/2014 08:15 AM, Julien Danjou wrote:
> On Thu, Jun 12 2014, Sean Dague wrote:
> 
>> However Monty brought up a good point at Summit, that MySQL has a strict
>> mode. That should actually enforce the same strictness.
> 
> I would vote -1 on that, simply because using PostgreSQL should be more
> than that just doing strict SQL.
> 
> For example, in Ceilometer and Gnocchi we have custom SQL type that are
> implemented with different data type depending on the SQL engine that's
> being used. PostgreSQL proposes better and more optimized data type in
> certain case (timestamp or UUID from the top of my head). Not gating
> against PostgreSQL would potentially introduce bugs in that support for
> us.
> 
> Oh sure, I can easily imagine that it's not the case currently in many
> other OpenStack projects. But that IMHO would be a terrible move towards
> leveling down the SQL usage in OpenStack, which is already pretty low
> IMHO.

That's not cacthable in unit or functional tests?

My experience is that it's *really* hard to tickle stuff like that from
Tempest in any meaningful way that's not catchable at lower levels.
Especially as we're going through SQLA for all this access in the first
place.

Keeping jobs alive based on the theory that they might one day be useful
is something we just don't have the liberty to do any more. We've not
seen an idle node in zuul in 2 days... and we're only at j-1. j-3 will
be at least +50% of this load.

-Sean

-- 
Sean Dague
http://dague.net

signature.asc
Description: OpenPGP digital signature
___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

Re: [openstack-dev] Gate proposal - drop Postgresql configurations in the gate

2014-06-12 Thread Julien Danjou

On Thu, Jun 12 2014, Sean Dague wrote:

> However Monty brought up a good point at Summit, that MySQL has a strict
> mode. That should actually enforce the same strictness.

I would vote -1 on that, simply because using PostgreSQL should be more
than that just doing strict SQL.

For example, in Ceilometer and Gnocchi we have custom SQL type that are
implemented with different data type depending on the SQL engine that's
being used. PostgreSQL proposes better and more optimized data type in
certain case (timestamp or UUID from the top of my head). Not gating
against PostgreSQL would potentially introduce bugs in that support for
us.

Oh sure, I can easily imagine that it's not the case currently in many
other OpenStack projects. But that IMHO would be a terrible move towards
leveling down the SQL usage in OpenStack, which is already pretty low
IMHO.

My 2c,
-- 
Julien Danjou
# Free Software hacker
# http://julien.danjou.info

signature.asc
Description: PGP signature
___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

[openstack-dev] Gate proposal - drop Postgresql configurations in the gate

2014-06-12 Thread Sean Dague

We're definitely deep into capacity issues, so it's going to be time to
start making tougher decisions about things we decide aren't different
enough to bother testing on every commit.

Previously we've been testing Postgresql in the gate because it has a
stricter interpretation of SQL than MySQL. And when we didn't test
Postgresql it regressed. I know, I chased it for about 4 weeks in grizzly.

However Monty brought up a good point at Summit, that MySQL has a strict
mode. That should actually enforce the same strictness.

My proposal is that we land this change to devstack -
https://review.openstack.org/#/c/97442/ and backport it to past devstack
branches.

Then we drop the pg jobs, as the differences between the 2 configs
should then be very minimal. All the *actual* failures we've seen
between the 2 were completely about this strict SQL mode interpretation.

-Sean

-- 
Sean Dague
http://dague.net



signature.asc
Description: OpenPGP digital signature
___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

39 matches

Mail list logo