Re: [openstack-dev] [neutron][oslo.db] db retry madness

2017-09-10 Thread Michel Peterson
Thanks for this most interesting and informative exchange.

On Fri, Sep 8, 2017 at 5:20 PM, Michael Bayer  wrote:
> On Fri, Sep 8, 2017 at 8:53 AM, Kevin Benton  wrote:
> > Since the goal of that patch is to deal with deadlocks, the retry can't
> > happen down at that level, it will need to be somewhere further up the call
> > stack since the deadlock will put the session in a rollback state.
>
>
> OK when I was talking to Michel about this, I first mentioned that to
> get retry within this in-a-transaction block would require adding a
> savepoint to the mix, but he seemed to indicate that the method can
> also run by itself, but since there's no session.begin() there then
> that matches with my initial impression that the retry has to be at
> the level which the transaction starts, and that's not here.


I think there is also a deal of madness on how things are implemented
in this particular project. I'll review the way it's handled and
adjust accordingly, and then implement the right retries in the
correct places. I think my strategy will be to first start with using
a decorator based on wrap_db_retry to fix the bug. After that I'll
start, in another set of patches, the migration from the "legacy"
pattern to the "new" enginefacade. What do you think? Or given that
enginefacade already diminishes the possibilities of deadlocks I
should already start with the migration to enginefacade and then with
that already in place

> > For the functions being called outside of an active session, they should
> > switch to accepting a context to handle the enginefacade switch.

ACK.

> >>  3a.  Is this a temporary measure to work around legacy patterns?
> >
> > Yes. When that went in we had very little adoption of the enginefacade. Even
> > now we haven't fully completed the transition so checking for is_active
> > covers the old code paths.
>
> OK, so this is where they are still calling session.begin()
> themselves.Are all these session.begin() calls in projects under
> the "openstack" umbrella so I can use codesearch ?

So, to make sure I understand correctly. With enginefacade instead of
calling session.begin() to handle the transactions, when using
'reader' o 'writer' we are already inside a transaction, right? If
that's correct, no matter if decorating or using a context manager, in
both cases the transaction lives for the context of the decorated
function or the with-stmt context, right?

> > As we continue migrating to the enginefacade, the cases where you can get a
> > reference to a session that actually has 'is_active==False' are going to
> > keep dwindling. Once we complete the switch and remove the legacy session
> > constructor, the decorator will just be adjusted to check if there is a
> > session attached to the context to determine if retries are safe since
> > is_active will always be True.

Therefore, on implementation of enginefacade retry_if_session_inactive
is rendered useless and because of the deepcopy retry_db_errors in our
usecase is already useless. As a result we are better off with using
wrap_db_retry to create a decorator parametrised according to our
needs (ie: max retries, exception checker, etc), right? And for
exception checker I should use the is_retriable from neutron-lib
instead of the neutron db api, right? (it's a shame that MAX_RETRIES
is not also on neutron-lib and only in neutron)

Thanks to you both.

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [neutron][oslo.db] db retry madness

2017-09-08 Thread Michael Bayer
On Fri, Sep 8, 2017 at 8:53 AM, Kevin Benton  wrote:
> Thanks for asking this. It's good to get this documented on the mailing list
> since most of it is spread across bug reports and commits over the last
> couple of years.
>
>>1. since the retry_if_session_inactive decorator assumes functions that
>> receive a "context", what does it mean that networking_odl uses functions
>> that receive a "session" and not a "context" ?   Should networking_odl be
>> modified such that it accepts the same "context" that Neutron passes around?
>>
>>2. Alternatively, should the retry_if_session_inactive decorator be
>> modified (or a new decorator added) that provides the identical behavior for
>> functions that receive a "session" argument rather than a "context" argument
>> ?  (or should this be done anyway even if networking_odl's pattern is
>> legacy?)
>
> So it really depends on how those particular functions are being called. For
> example, it seems like this function in the review is always going to be
> called within an active session:
> https://review.openstack.org/#/c/500584/3/networking_odl/db/db.py@113  If
> that's the case then the decorator is never going to trigger a retry since
> is_active will be True.

>
> Since the goal of that patch is to deal with deadlocks, the retry can't
> happen down at that level, it will need to be somewhere further up the call
> stack since the deadlock will put the session in a rollback state.


OK when I was talking to Michel about this, I first mentioned that to
get retry within this in-a-transaction block would require adding a
savepoint to the mix, but he seemed to indicate that the method can
also run by itself, but since there's no session.begin() there then
that matches with my initial impression that the retry has to be at
the level which the transaction starts, and that's not here.



>
> For the functions being called outside of an active session, they should
> switch to accepting a context to handle the enginefacade switch.
>
>>  3a.  Is this a temporary measure to work around legacy patterns?
>
> Yes. When that went in we had very little adoption of the enginefacade. Even
> now we haven't fully completed the transition so checking for is_active
> covers the old code paths.


OK, so this is where they are still calling session.begin()
themselves.Are all these session.begin() calls in projects under
the "openstack" umbrella so I can use codesearch ?



>
> As we continue migrating to the enginefacade, the cases where you can get a
> reference to a session that actually has 'is_active==False' are going to
> keep dwindling. Once we complete the switch and remove the legacy session
> constructor, the decorator will just be adjusted to check if there is a
> session attached to the context to determine if retries are safe since
> is_active will always be True.

neat

>
>> 3b.  If 3a., is the pattern in use by networking_odl, receiving a
>> "session" and not "context", the specific legacy pattern in question?
>
> If those functions are called outside of an active session, then yes. Based
> on the session.begin() call without substransactions=True, I'm guessing
> that's the case for at least some of them:
> https://review.openstack.org/#/c/500584/3/networking_odl/db/db.py@85

Is there guidance written down somewhere I can point plugin teams towards ?


>
>>3c.  if 3a and 3b, are we expecting all the various plugins to start using
>> "context" at some point ?  (and when they come to me with breakage, I can
>> push them in that direction?)
>
> Yeah, for any functions that expect to start their own transaction they
> should altered to accept contexts and use the reader/writer context
> managers. If it's a function that is called from within an ongoing
> transaction, it can obviously still accept a session argument but the
> 'retry_if_session_inactive' decorator would obviously be irrelevant in that
> scenario since the session would always be active.
>
>> 4a.  Is it strictly that simple mutable lists and dicts of simple types
>> (ints, strings, etc.) are preserved across retries, assuming the receiving
>> functions might be modifying them?
>
> Yes, this was to deal with functions that would mutate collections passed
> in. In particular, we have functions that alter the parsed API request as
> they process them. If they hit a deadlock on commit and we retry at that
> point the API request would be mangled without the copy. (e.g.
> https://bugs.launchpad.net/neutron/+bug/1584920)
>
>>4b. Does this deepcopy() approach have any support for functions that are
>> being passed not just simple structures but ORM-mapped objects?   Was this
>> use case considered?
>
> No, it doesn't support ORM objects. Given that the target use case of the
> retry decorator was for functions not in an ongoing session, we didn't have
> a pattern to support in the main tree where an ORM object would be passed as
> an argument to the function that needed to be protected.

right, 

Re: [openstack-dev] [neutron][oslo.db] db retry madness

2017-09-08 Thread Kevin Benton
Thanks for asking this. It's good to get this documented on the mailing
list since most of it is spread across bug reports and commits over the
last couple of years.

>1. since the retry_if_session_inactive decorator assumes functions that
receive a "context", what does it mean that networking_odl uses functions
that receive a "session" and not a "context" ?   Should networking_odl be
modified such that it accepts the same "context" that Neutron passes around?
>
>2. Alternatively, should the retry_if_session_inactive decorator be modified
(or a new decorator added) that provides the identical behavior for
functions that receive a "session" argument rather than a "context"
argument ?  (or should this be done anyway even if networking_odl's pattern
is legacy?)

So it really depends on how those particular functions are being called. For
example, it seems like this function in the review is always going to be
called within an active session:
https://review.openstack.org/#/c/500584/3/networking_odl/db/db.py@113  If
that's the case then the decorator is never going to trigger a retry since
is_active will be True.

Since the goal of that patch is to deal with deadlocks, the retry can't
happen down at that level, it will need to be somewhere further up the call
stack since the deadlock will put the session in a rollback state.

For the functions being called outside of an active session, they should
switch to accepting a context to handle the enginefacade switch.

>  3a.  Is this a temporary measure to work around legacy patterns?

Yes. When that went in we had very little adoption of the enginefacade.
Even now we haven't fully completed the transition so checking for
is_active covers the old code paths.

As we continue migrating to the enginefacade, the cases where you can get a
reference to a session that actually has 'is_active==False' are going to
keep dwindling. Once we complete the switch and remove the legacy session
constructor, the decorator will just be adjusted to check if there is a
session attached to the context to determine if retries are safe since
is_active will always be True.

> 3b.  If 3a., is the pattern in use by networking_odl, receiving a "session"
and not "context", the specific legacy pattern in question?

If those functions are called outside of an active session, then yes. Based
on the session.begin() call without substransactions=True, I'm guessing
that's the case for at least some of them:
https://review.openstack.org/#/c/500584/3/networking_odl/db/db.py@85

>3c.  if 3a and 3b, are we expecting all the various plugins to start using
"context" at some point ?  (and when they come to me with breakage, I can
push them in that direction?)

Yeah, for any functions that expect to start their own transaction they
should altered to accept contexts and use the reader/writer context
managers. If it's a function that is called from within an ongoing
transaction, it can obviously still accept a session argument but the
'retry_if_session_inactive' decorator would obviously be irrelevant in that
scenario since the session would always be active.

> 4a.  Is it strictly that simple mutable lists and dicts of simple types
(ints, strings, etc.) are preserved across retries, assuming the receiving
functions might be modifying them?

Yes, this was to deal with functions that would mutate collections passed
in. In particular, we have functions that alter the parsed API request as
they process them. If they hit a deadlock on commit and we retry at that
point the API request would be mangled without the copy. (e.g.
https://bugs.launchpad.net/neutron/+bug/1584920)

>4b. Does this deepcopy() approach have any support for functions that are
being passed not just simple structures but ORM-mapped objects?   Was this
use case considered?

No, it doesn't support ORM objects. Given that the target use case of the
retry decorator was for functions not in an ongoing session, we didn't have
a pattern to support in the main tree where an ORM object would be passed
as an argument to the function that needed to be protected.

>4c. Does Neutron's model base have any support for deepcopy()?  I notice
that the base classes in model_base do **not** seem to have a __deepcopy__
present.  This would mean that this deepcopy() will definitely break any
ORM mapped objects because you need to use an approach like what Nova uses
for copy(): https://github.com/openstack/nova/blob/master/nova/db/
sqlalchemy/models.py#L46 . This would be why the developer's objects are
all getting kicked
out of the session.   (this is the "is this a bug?" part)

That's definitely why the objects are being kicked out of the session. I'm
not sure if it makes sense to attempt to support copying an ORM object when
the decorator is meant for functions starting a completely new session. If
there is a use case I'm missing that we do want to support, it probably
makes more sense to use a different decorator than to adjust the existing
one.

>4d. How frequent 

[openstack-dev] [neutron][oslo.db] db retry madness

2017-09-07 Thread Michael Bayer
I'm trying to get a handle on neutron's retry decorators.I'm
assisting a developer in the networking_odl project who wishes to make
use of neutron's retry_if_session_inactive decorator.

The developer has run into several problems, including that
networking_odl methods here seem to accept a "session" directly,
rather than the "context" as is standard throughout neutron, and
retry_if_session_inactive assumes the context calling pattern.
Additionally, the retry_db_errors() decorator which is also used by
retry_if_session_inactive is doing a deepcopy operation on arguments,
which when applied to ORM-mapped entities, makes a copy of them which
are then detached from the Session and they fail to function beyond
that.

For context, the developer's patch is at:
https://review.openstack.org/#/c/500584/3

and the neutron features I'm looking at can be seen when they were
reviewed at: https://review.openstack.org/#/c/356530/22


So I have a bunch of questions, many are asking the same things: "is
networking_odl's pattern legacy or not?" and "is this retry decorator
buggy?"   There's overlap in these questions.  But "yes" / "no" to
each would still help me figure out that this is a giraffe and not a
duck :) ("does it have a long neck?" "does it quack?" etc)

1. since the retry_if_session_inactive decorator assumes functions
that receive a "context", what does it mean that networking_odl uses
functions that receive a "session" and not a "context" ?   Should
networking_odl be modified such that it accepts the same "context"
that Neutron passes around?

2. Alternatively, should the retry_if_session_inactive decorator be
modified (or a new decorator added) that provides the identical
behavior for functions that receive a "session" argument rather than a
"context" argument ?  (or should this be done anyway even if
networking_odl's pattern is legacy?)

3. I notice that the session.is_active flag is considered to be
meaningful.   Normally, this flag is of no use, as the enginefacade
pattern only provides for a Session that has had the begin() method
called.  However, it looks like in neutron_lib/context.py, the Context
class is providing for a default session
"db_api.get_writer_session()", which I would assume is how we get a
Session with is_active is false:

3a.  Is this a temporary measure to work around legacy patterns?
3b.  If 3a., is the pattern in use by networking_odl, receiving a
"session" and not "context", the specific legacy pattern in question?
3c.  if 3a and 3b, are we expecting all the various plugins to
start using "context" at some point ?  (and when they come to me with
breakage, I can push them in that direction?)

4. What is the bug we are fixing by using deepcopy() with the
arguments passed to the decorated function:

4a.  Is it strictly that simple mutable lists and dicts of simple
types (ints, strings, etc.) are preserved across retries, assuming the
receiving functions might be modifying them?

4b. Does this deepcopy() approach have any support for functions
that are being passed not just simple structures but ORM-mapped
objects?   Was this use case considered?

4c. Does Neutron's model base have any support for deepcopy()?  I
notice that the base classes in model_base do **not** seem to have a
__deepcopy__ present.  This would mean that this deepcopy() will
definitely break any ORM mapped objects because you need to use an
approach like what Nova uses for copy():
https://github.com/openstack/nova/blob/master/nova/db/sqlalchemy/models.py#L46
.  This would be why the developer's objects are all getting kicked
out of the session.   (this is the "is this a bug?" part)

4d. How frequent is the problem of mutable lists and dicts, and
should this complex and actually pretty expensive behavior be
optional, only for those functions that are known to receieve mutable
structures and change them in place?

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev