Re: [openstack-dev] [neutron][oslo.db] db retry madness
Thanks for this most interesting and informative exchange. On Fri, Sep 8, 2017 at 5:20 PM, Michael Bayerwrote: > On Fri, Sep 8, 2017 at 8:53 AM, Kevin Benton wrote: > > Since the goal of that patch is to deal with deadlocks, the retry can't > > happen down at that level, it will need to be somewhere further up the call > > stack since the deadlock will put the session in a rollback state. > > > OK when I was talking to Michel about this, I first mentioned that to > get retry within this in-a-transaction block would require adding a > savepoint to the mix, but he seemed to indicate that the method can > also run by itself, but since there's no session.begin() there then > that matches with my initial impression that the retry has to be at > the level which the transaction starts, and that's not here. I think there is also a deal of madness on how things are implemented in this particular project. I'll review the way it's handled and adjust accordingly, and then implement the right retries in the correct places. I think my strategy will be to first start with using a decorator based on wrap_db_retry to fix the bug. After that I'll start, in another set of patches, the migration from the "legacy" pattern to the "new" enginefacade. What do you think? Or given that enginefacade already diminishes the possibilities of deadlocks I should already start with the migration to enginefacade and then with that already in place > > For the functions being called outside of an active session, they should > > switch to accepting a context to handle the enginefacade switch. ACK. > >> 3a. Is this a temporary measure to work around legacy patterns? > > > > Yes. When that went in we had very little adoption of the enginefacade. Even > > now we haven't fully completed the transition so checking for is_active > > covers the old code paths. > > OK, so this is where they are still calling session.begin() > themselves.Are all these session.begin() calls in projects under > the "openstack" umbrella so I can use codesearch ? So, to make sure I understand correctly. With enginefacade instead of calling session.begin() to handle the transactions, when using 'reader' o 'writer' we are already inside a transaction, right? If that's correct, no matter if decorating or using a context manager, in both cases the transaction lives for the context of the decorated function or the with-stmt context, right? > > As we continue migrating to the enginefacade, the cases where you can get a > > reference to a session that actually has 'is_active==False' are going to > > keep dwindling. Once we complete the switch and remove the legacy session > > constructor, the decorator will just be adjusted to check if there is a > > session attached to the context to determine if retries are safe since > > is_active will always be True. Therefore, on implementation of enginefacade retry_if_session_inactive is rendered useless and because of the deepcopy retry_db_errors in our usecase is already useless. As a result we are better off with using wrap_db_retry to create a decorator parametrised according to our needs (ie: max retries, exception checker, etc), right? And for exception checker I should use the is_retriable from neutron-lib instead of the neutron db api, right? (it's a shame that MAX_RETRIES is not also on neutron-lib and only in neutron) Thanks to you both. __ OpenStack Development Mailing List (not for usage questions) Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
Re: [openstack-dev] [neutron][oslo.db] db retry madness
On Fri, Sep 8, 2017 at 8:53 AM, Kevin Bentonwrote: > Thanks for asking this. It's good to get this documented on the mailing list > since most of it is spread across bug reports and commits over the last > couple of years. > >>1. since the retry_if_session_inactive decorator assumes functions that >> receive a "context", what does it mean that networking_odl uses functions >> that receive a "session" and not a "context" ? Should networking_odl be >> modified such that it accepts the same "context" that Neutron passes around? >> >>2. Alternatively, should the retry_if_session_inactive decorator be >> modified (or a new decorator added) that provides the identical behavior for >> functions that receive a "session" argument rather than a "context" argument >> ? (or should this be done anyway even if networking_odl's pattern is >> legacy?) > > So it really depends on how those particular functions are being called. For > example, it seems like this function in the review is always going to be > called within an active session: > https://review.openstack.org/#/c/500584/3/networking_odl/db/db.py@113 If > that's the case then the decorator is never going to trigger a retry since > is_active will be True. > > Since the goal of that patch is to deal with deadlocks, the retry can't > happen down at that level, it will need to be somewhere further up the call > stack since the deadlock will put the session in a rollback state. OK when I was talking to Michel about this, I first mentioned that to get retry within this in-a-transaction block would require adding a savepoint to the mix, but he seemed to indicate that the method can also run by itself, but since there's no session.begin() there then that matches with my initial impression that the retry has to be at the level which the transaction starts, and that's not here. > > For the functions being called outside of an active session, they should > switch to accepting a context to handle the enginefacade switch. > >> 3a. Is this a temporary measure to work around legacy patterns? > > Yes. When that went in we had very little adoption of the enginefacade. Even > now we haven't fully completed the transition so checking for is_active > covers the old code paths. OK, so this is where they are still calling session.begin() themselves.Are all these session.begin() calls in projects under the "openstack" umbrella so I can use codesearch ? > > As we continue migrating to the enginefacade, the cases where you can get a > reference to a session that actually has 'is_active==False' are going to > keep dwindling. Once we complete the switch and remove the legacy session > constructor, the decorator will just be adjusted to check if there is a > session attached to the context to determine if retries are safe since > is_active will always be True. neat > >> 3b. If 3a., is the pattern in use by networking_odl, receiving a >> "session" and not "context", the specific legacy pattern in question? > > If those functions are called outside of an active session, then yes. Based > on the session.begin() call without substransactions=True, I'm guessing > that's the case for at least some of them: > https://review.openstack.org/#/c/500584/3/networking_odl/db/db.py@85 Is there guidance written down somewhere I can point plugin teams towards ? > >>3c. if 3a and 3b, are we expecting all the various plugins to start using >> "context" at some point ? (and when they come to me with breakage, I can >> push them in that direction?) > > Yeah, for any functions that expect to start their own transaction they > should altered to accept contexts and use the reader/writer context > managers. If it's a function that is called from within an ongoing > transaction, it can obviously still accept a session argument but the > 'retry_if_session_inactive' decorator would obviously be irrelevant in that > scenario since the session would always be active. > >> 4a. Is it strictly that simple mutable lists and dicts of simple types >> (ints, strings, etc.) are preserved across retries, assuming the receiving >> functions might be modifying them? > > Yes, this was to deal with functions that would mutate collections passed > in. In particular, we have functions that alter the parsed API request as > they process them. If they hit a deadlock on commit and we retry at that > point the API request would be mangled without the copy. (e.g. > https://bugs.launchpad.net/neutron/+bug/1584920) > >>4b. Does this deepcopy() approach have any support for functions that are >> being passed not just simple structures but ORM-mapped objects? Was this >> use case considered? > > No, it doesn't support ORM objects. Given that the target use case of the > retry decorator was for functions not in an ongoing session, we didn't have > a pattern to support in the main tree where an ORM object would be passed as > an argument to the function that needed to be protected. right,
Re: [openstack-dev] [neutron][oslo.db] db retry madness
Thanks for asking this. It's good to get this documented on the mailing list since most of it is spread across bug reports and commits over the last couple of years. >1. since the retry_if_session_inactive decorator assumes functions that receive a "context", what does it mean that networking_odl uses functions that receive a "session" and not a "context" ? Should networking_odl be modified such that it accepts the same "context" that Neutron passes around? > >2. Alternatively, should the retry_if_session_inactive decorator be modified (or a new decorator added) that provides the identical behavior for functions that receive a "session" argument rather than a "context" argument ? (or should this be done anyway even if networking_odl's pattern is legacy?) So it really depends on how those particular functions are being called. For example, it seems like this function in the review is always going to be called within an active session: https://review.openstack.org/#/c/500584/3/networking_odl/db/db.py@113 If that's the case then the decorator is never going to trigger a retry since is_active will be True. Since the goal of that patch is to deal with deadlocks, the retry can't happen down at that level, it will need to be somewhere further up the call stack since the deadlock will put the session in a rollback state. For the functions being called outside of an active session, they should switch to accepting a context to handle the enginefacade switch. > 3a. Is this a temporary measure to work around legacy patterns? Yes. When that went in we had very little adoption of the enginefacade. Even now we haven't fully completed the transition so checking for is_active covers the old code paths. As we continue migrating to the enginefacade, the cases where you can get a reference to a session that actually has 'is_active==False' are going to keep dwindling. Once we complete the switch and remove the legacy session constructor, the decorator will just be adjusted to check if there is a session attached to the context to determine if retries are safe since is_active will always be True. > 3b. If 3a., is the pattern in use by networking_odl, receiving a "session" and not "context", the specific legacy pattern in question? If those functions are called outside of an active session, then yes. Based on the session.begin() call without substransactions=True, I'm guessing that's the case for at least some of them: https://review.openstack.org/#/c/500584/3/networking_odl/db/db.py@85 >3c. if 3a and 3b, are we expecting all the various plugins to start using "context" at some point ? (and when they come to me with breakage, I can push them in that direction?) Yeah, for any functions that expect to start their own transaction they should altered to accept contexts and use the reader/writer context managers. If it's a function that is called from within an ongoing transaction, it can obviously still accept a session argument but the 'retry_if_session_inactive' decorator would obviously be irrelevant in that scenario since the session would always be active. > 4a. Is it strictly that simple mutable lists and dicts of simple types (ints, strings, etc.) are preserved across retries, assuming the receiving functions might be modifying them? Yes, this was to deal with functions that would mutate collections passed in. In particular, we have functions that alter the parsed API request as they process them. If they hit a deadlock on commit and we retry at that point the API request would be mangled without the copy. (e.g. https://bugs.launchpad.net/neutron/+bug/1584920) >4b. Does this deepcopy() approach have any support for functions that are being passed not just simple structures but ORM-mapped objects? Was this use case considered? No, it doesn't support ORM objects. Given that the target use case of the retry decorator was for functions not in an ongoing session, we didn't have a pattern to support in the main tree where an ORM object would be passed as an argument to the function that needed to be protected. >4c. Does Neutron's model base have any support for deepcopy()? I notice that the base classes in model_base do **not** seem to have a __deepcopy__ present. This would mean that this deepcopy() will definitely break any ORM mapped objects because you need to use an approach like what Nova uses for copy(): https://github.com/openstack/nova/blob/master/nova/db/ sqlalchemy/models.py#L46 . This would be why the developer's objects are all getting kicked out of the session. (this is the "is this a bug?" part) That's definitely why the objects are being kicked out of the session. I'm not sure if it makes sense to attempt to support copying an ORM object when the decorator is meant for functions starting a completely new session. If there is a use case I'm missing that we do want to support, it probably makes more sense to use a different decorator than to adjust the existing one. >4d. How frequent
[openstack-dev] [neutron][oslo.db] db retry madness
I'm trying to get a handle on neutron's retry decorators.I'm assisting a developer in the networking_odl project who wishes to make use of neutron's retry_if_session_inactive decorator. The developer has run into several problems, including that networking_odl methods here seem to accept a "session" directly, rather than the "context" as is standard throughout neutron, and retry_if_session_inactive assumes the context calling pattern. Additionally, the retry_db_errors() decorator which is also used by retry_if_session_inactive is doing a deepcopy operation on arguments, which when applied to ORM-mapped entities, makes a copy of them which are then detached from the Session and they fail to function beyond that. For context, the developer's patch is at: https://review.openstack.org/#/c/500584/3 and the neutron features I'm looking at can be seen when they were reviewed at: https://review.openstack.org/#/c/356530/22 So I have a bunch of questions, many are asking the same things: "is networking_odl's pattern legacy or not?" and "is this retry decorator buggy?" There's overlap in these questions. But "yes" / "no" to each would still help me figure out that this is a giraffe and not a duck :) ("does it have a long neck?" "does it quack?" etc) 1. since the retry_if_session_inactive decorator assumes functions that receive a "context", what does it mean that networking_odl uses functions that receive a "session" and not a "context" ? Should networking_odl be modified such that it accepts the same "context" that Neutron passes around? 2. Alternatively, should the retry_if_session_inactive decorator be modified (or a new decorator added) that provides the identical behavior for functions that receive a "session" argument rather than a "context" argument ? (or should this be done anyway even if networking_odl's pattern is legacy?) 3. I notice that the session.is_active flag is considered to be meaningful. Normally, this flag is of no use, as the enginefacade pattern only provides for a Session that has had the begin() method called. However, it looks like in neutron_lib/context.py, the Context class is providing for a default session "db_api.get_writer_session()", which I would assume is how we get a Session with is_active is false: 3a. Is this a temporary measure to work around legacy patterns? 3b. If 3a., is the pattern in use by networking_odl, receiving a "session" and not "context", the specific legacy pattern in question? 3c. if 3a and 3b, are we expecting all the various plugins to start using "context" at some point ? (and when they come to me with breakage, I can push them in that direction?) 4. What is the bug we are fixing by using deepcopy() with the arguments passed to the decorated function: 4a. Is it strictly that simple mutable lists and dicts of simple types (ints, strings, etc.) are preserved across retries, assuming the receiving functions might be modifying them? 4b. Does this deepcopy() approach have any support for functions that are being passed not just simple structures but ORM-mapped objects? Was this use case considered? 4c. Does Neutron's model base have any support for deepcopy()? I notice that the base classes in model_base do **not** seem to have a __deepcopy__ present. This would mean that this deepcopy() will definitely break any ORM mapped objects because you need to use an approach like what Nova uses for copy(): https://github.com/openstack/nova/blob/master/nova/db/sqlalchemy/models.py#L46 . This would be why the developer's objects are all getting kicked out of the session. (this is the "is this a bug?" part) 4d. How frequent is the problem of mutable lists and dicts, and should this complex and actually pretty expensive behavior be optional, only for those functions that are known to receieve mutable structures and change them in place? __ OpenStack Development Mailing List (not for usage questions) Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev