Re: [patch net-next 6/8] ipv4: fib: Add an API to request a FIB dump
On 17.11.2016 19:32, Ido Schimmel wrote: > On Thu, Nov 17, 2016 at 06:20:39PM +0100, Hannes Frederic Sowa wrote: >> On 17.11.2016 17:45, David Miller wrote: >>> From: Hannes Frederic Sowa>>> Date: Thu, 17 Nov 2016 15:36:48 +0100 >>> The other way is the journal idea I had, which uses an rb-tree with timestamps as keys (can be lamport timestamps). You insert into the tree until the dump is finished and use it as queue later to shuffle stuff into the hardware. >>> >>> If you have this "place" where pending inserts are stored, you have >>> a policy decision to make. >>> >>> First of all what do other lookups see when there are pending entires? >> >> I think this is a problem with the current approach already, as the >> delayed work queue already postpones the insert for an undecidable >> amount of time (and reorders depending on which CPU the entry was >> inserted and the fib notifier was called). > > The delayed work queue only postpones the insert into the listener's > table, but the entries will be present in the kernel's table as usual. > Therefore, other lookup made on the kernel's table will see the pending > entries. > > Note that both listeners that currently call the dump API do that during > their init, before any lookups can be made on their tables. If you think > it's critical, then we can make sure the workqueues are flushed before > the listeners register their netdevs and effectively expose their tables > to lookups. I do see routing anyway as a best-effort. ;) That means in not too long time the hardware needs to be fully synchronized to the software path and either have all correct entries or abort must have been called. > I'm looking into the reordering issues you mentioned. I belive it's a > valid point. Thanks! >> For user space queries we would still query the in-kernel table. > > Right. No changes here. > >>> If, once inserted into the pending queue, you return success to the >>> inserting entity, then you must make those pending entries visible >>> to lookups. >> >> I think this same problem is in this patch set already. The way I >> understood it, is, that if a problem during insert emerges, the driver >> calls abort and every packet will be send to user space, as the routing >> table cannot be offloaded and it won't try it again, Ido? > > First of all, this abort mechanism is already in place and isn't part of > this patchset. Secondly, why is this a problem? The all-or-nothing > approach is an hard requirement and current implementation is infinitely > better than previous approach in which the kernel's tables were flushed > upon route insertion failure. It rendered the system unusable. Current > implementation of abort mechanism keeps the system usable, but with > reduced performance. Yes, I argued below that I am toally fine with this approach. >> Probably this is an artifact of the mellanox implementation and we can >> implement this differently for other cards, but the schema to abort all >> if the modification doesn't work, seems to be fundamental (I think we >> require the all-or-nothing approach for now). > > Yes, it's an hard requirement for all implementations. mlxsw implements > it by evicting all routes from its tables and inserting a default route > that traps packets to CPU. Correct, I don't see how fib offloading can do something else, besides semantically looking at the routing table and figure out where to insert "exceptions" for subtrees but keep most packets flowing over the hw directly. But this for another time... ;) >>> If you block the inserting entity, well that doesn't make a lot of >>> sense. If blocking is a workable solution, then we can just block the >>> entire insert while this FIB dump to the device driver is happening. >> >> I don't think we should really block (as in kernel-block) at any time. >> >> I was suggesting something like: >> >> rtnl_lock(); >> synchronize_rcu_expedited(); // barrier, all rounting modifications are >> stable and no new can be added due to rtnl_lock >> register notifier(); // notifier adds entries also into journal(); >> rtnl_unlock(); >> walk_fib_tree_rcu_into_journal(); >> // walk finished >> start_syncing_journal_to_hw(); // if new entries show up we sync them >> asap after this point >> >> The journal would need a spin lock to protect its internal state and >> order events correctly. >> >>> But I am pretty sure the idea of blocking modifications for so long >>> was considered undesirable. >> >> Yes, this is also still my opinion. > > It was indeed rejected :) > https://marc.info/?l=linux-netdev=147794848224884=2 Bye, Hannes
Re: [patch net-next 6/8] ipv4: fib: Add an API to request a FIB dump
On 17.11.2016 19:16, David Miller wrote: > From: Hannes Frederic Sowa> Date: Thu, 17 Nov 2016 18:20:39 +0100 > >> Hi, >> >> On 17.11.2016 17:45, David Miller wrote: >>> From: Hannes Frederic Sowa >>> Date: Thu, 17 Nov 2016 15:36:48 +0100 >>> The other way is the journal idea I had, which uses an rb-tree with timestamps as keys (can be lamport timestamps). You insert into the tree until the dump is finished and use it as queue later to shuffle stuff into the hardware. >>> >>> If you have this "place" where pending inserts are stored, you have >>> a policy decision to make. >>> >>> First of all what do other lookups see when there are pending entires? >> >> I think this is a problem with the current approach already, as the >> delayed work queue already postpones the insert for an undecidable >> amount of time (and reorders depending on which CPU the entry was >> inserted and the fib notifier was called). >> >> For user space queries we would still query the in-kernel table. > > Ok, I think I might misunderstand something. > > What is going into this journal exactly? The actual full software and > hardware insert operation, or just the notification to the hardware > device driver notifiers? The journal is only used as a timely ordered queue for updating the hardware in correct order. The enqueue side is the fib notifier only. If no fib notifier is registered we don't use this code at all (and also don't hit the lock protecting this journal in fib insertion/deletion path - fast in-kernel path is untouched - otherwise just a spin_lock already under rtnl_lock in slow path). The fib notifier enqueues the packet with a timestamp into this journal and can also merge entries while they are in the queue, e.g. we got a delete from the fib notifier but the rcu walk indicated an addition of the entry, so we can merge that at this point and depending on the timestamp remove the entry or drop the deletion event. We start dequeueing the fib entries into the hardware as soon as the rcu dump is finished, at this point we are up-to-date in the queue with all events. New events can be added to the journal (with appropriate locking) during this time, as the queue was once in proper synced state we stay proper synchronized. We keep up with the queue in steady state after the dump, so syncing happens ASAP. Maybe we can also drop the journal then. Something alike this described queue is implemented here (haven't checked if it exactly matches the specification, certainly it provides more features): https://github.com/bus1/bus1/blob/master/ipc/bus1/util/queue.h https://github.com/bus1/bus1/blob/master/ipc/bus1/util/queue.c For this to work the config path needs to add timestamps to the fib_infos or fib_aliases. > The "lookup" I'm mostly concerned with is the fast path where the > packets being processed actually look up a route. This doesn't change at all. All code will be hidden in a library that gets attached to the fib notifier, which is configuration code path. > I do not think we can return success on the insert to the user yet > have the route lookup dataplace not return that route on a lookup. We don't change kernel fast path at all. If we add/delete a route in software and hardware, kernel indicates success as soon as software has the entry added. It also gets queued up in the journal. Journal will be lazily processed, if error happens during that (e.g. hardware signals table full), abort is called and all packets go to user space ASAP. User space will always show the route as it is added to in the first place and after the driver called abort also process the packets accordingly. I can imagine this can get very complicated. David's approach with a counter to check for modifications and a limited number of retries probably works too, especially because the hardware will probably be initialized before routing daemons start up and will be synced up hopefully all the time. So maybe this is over engineered, but I have no idea how long hardware needs to sync up a e.g. full IPv4 routing table into hardware (if that is actually the current goal of this). Bye, Hannes
Re: [patch net-next 6/8] ipv4: fib: Add an API to request a FIB dump
On Thu, Nov 17, 2016 at 06:20:39PM +0100, Hannes Frederic Sowa wrote: > On 17.11.2016 17:45, David Miller wrote: > > From: Hannes Frederic Sowa> > Date: Thu, 17 Nov 2016 15:36:48 +0100 > > > >> The other way is the journal idea I had, which uses an rb-tree with > >> timestamps as keys (can be lamport timestamps). You insert into the tree > >> until the dump is finished and use it as queue later to shuffle stuff > >> into the hardware. > > > > If you have this "place" where pending inserts are stored, you have > > a policy decision to make. > > > > First of all what do other lookups see when there are pending entires? > > I think this is a problem with the current approach already, as the > delayed work queue already postpones the insert for an undecidable > amount of time (and reorders depending on which CPU the entry was > inserted and the fib notifier was called). The delayed work queue only postpones the insert into the listener's table, but the entries will be present in the kernel's table as usual. Therefore, other lookup made on the kernel's table will see the pending entries. Note that both listeners that currently call the dump API do that during their init, before any lookups can be made on their tables. If you think it's critical, then we can make sure the workqueues are flushed before the listeners register their netdevs and effectively expose their tables to lookups. I'm looking into the reordering issues you mentioned. I belive it's a valid point. > For user space queries we would still query the in-kernel table. Right. No changes here. > > If, once inserted into the pending queue, you return success to the > > inserting entity, then you must make those pending entries visible > > to lookups. > > I think this same problem is in this patch set already. The way I > understood it, is, that if a problem during insert emerges, the driver > calls abort and every packet will be send to user space, as the routing > table cannot be offloaded and it won't try it again, Ido? First of all, this abort mechanism is already in place and isn't part of this patchset. Secondly, why is this a problem? The all-or-nothing approach is an hard requirement and current implementation is infinitely better than previous approach in which the kernel's tables were flushed upon route insertion failure. It rendered the system unusable. Current implementation of abort mechanism keeps the system usable, but with reduced performance. > Probably this is an artifact of the mellanox implementation and we can > implement this differently for other cards, but the schema to abort all > if the modification doesn't work, seems to be fundamental (I think we > require the all-or-nothing approach for now). Yes, it's an hard requirement for all implementations. mlxsw implements it by evicting all routes from its tables and inserting a default route that traps packets to CPU. > > If you block the inserting entity, well that doesn't make a lot of > > sense. If blocking is a workable solution, then we can just block the > > entire insert while this FIB dump to the device driver is happening. > > I don't think we should really block (as in kernel-block) at any time. > > I was suggesting something like: > > rtnl_lock(); > synchronize_rcu_expedited(); // barrier, all rounting modifications are > stable and no new can be added due to rtnl_lock > register notifier(); // notifier adds entries also into journal(); > rtnl_unlock(); > walk_fib_tree_rcu_into_journal(); > // walk finished > start_syncing_journal_to_hw(); // if new entries show up we sync them > asap after this point > > The journal would need a spin lock to protect its internal state and > order events correctly. > > > But I am pretty sure the idea of blocking modifications for so long > > was considered undesirable. > > Yes, this is also still my opinion. It was indeed rejected :) https://marc.info/?l=linux-netdev=147794848224884=2
Re: [patch net-next 6/8] ipv4: fib: Add an API to request a FIB dump
From: Hannes Frederic SowaDate: Thu, 17 Nov 2016 18:20:39 +0100 > Hi, > > On 17.11.2016 17:45, David Miller wrote: >> From: Hannes Frederic Sowa >> Date: Thu, 17 Nov 2016 15:36:48 +0100 >> >>> The other way is the journal idea I had, which uses an rb-tree with >>> timestamps as keys (can be lamport timestamps). You insert into the tree >>> until the dump is finished and use it as queue later to shuffle stuff >>> into the hardware. >> >> If you have this "place" where pending inserts are stored, you have >> a policy decision to make. >> >> First of all what do other lookups see when there are pending entires? > > I think this is a problem with the current approach already, as the > delayed work queue already postpones the insert for an undecidable > amount of time (and reorders depending on which CPU the entry was > inserted and the fib notifier was called). > > For user space queries we would still query the in-kernel table. Ok, I think I might misunderstand something. What is going into this journal exactly? The actual full software and hardware insert operation, or just the notification to the hardware device driver notifiers? The "lookup" I'm mostly concerned with is the fast path where the packets being processed actually look up a route. I do not think we can return success on the insert to the user yet have the route lookup dataplace not return that route on a lookup.
Re: [patch net-next 6/8] ipv4: fib: Add an API to request a FIB dump
Hi, On 17.11.2016 17:45, David Miller wrote: > From: Hannes Frederic Sowa> Date: Thu, 17 Nov 2016 15:36:48 +0100 > >> The other way is the journal idea I had, which uses an rb-tree with >> timestamps as keys (can be lamport timestamps). You insert into the tree >> until the dump is finished and use it as queue later to shuffle stuff >> into the hardware. > > If you have this "place" where pending inserts are stored, you have > a policy decision to make. > > First of all what do other lookups see when there are pending entires? I think this is a problem with the current approach already, as the delayed work queue already postpones the insert for an undecidable amount of time (and reorders depending on which CPU the entry was inserted and the fib notifier was called). For user space queries we would still query the in-kernel table. > If, once inserted into the pending queue, you return success to the > inserting entity, then you must make those pending entries visible > to lookups. I think this same problem is in this patch set already. The way I understood it, is, that if a problem during insert emerges, the driver calls abort and every packet will be send to user space, as the routing table cannot be offloaded and it won't try it again, Ido? Probably this is an artifact of the mellanox implementation and we can implement this differently for other cards, but the schema to abort all if the modification doesn't work, seems to be fundamental (I think we require the all-or-nothing approach for now). > If you block the inserting entity, well that doesn't make a lot of > sense. If blocking is a workable solution, then we can just block the > entire insert while this FIB dump to the device driver is happening. I don't think we should really block (as in kernel-block) at any time. I was suggesting something like: rtnl_lock(); synchronize_rcu_expedited(); // barrier, all rounting modifications are stable and no new can be added due to rtnl_lock register notifier(); // notifier adds entries also into journal(); rtnl_unlock(); walk_fib_tree_rcu_into_journal(); // walk finished start_syncing_journal_to_hw(); // if new entries show up we sync them asap after this point The journal would need a spin lock to protect its internal state and order events correctly. > But I am pretty sure the idea of blocking modifications for so long > was considered undesirable. Yes, this is also still my opinion. Bye, Hannes
Re: [patch net-next 6/8] ipv4: fib: Add an API to request a FIB dump
On 17.11.2016 14:10, Ido Schimmel wrote: > Hi Hannes, > > On Wed, Nov 16, 2016 at 08:43:25PM +0100, Hannes Frederic Sowa wrote: >> On 16.11.2016 19:51, Ido Schimmel wrote: >>> On Wed, Nov 16, 2016 at 06:35:45PM +0100, Hannes Frederic Sowa wrote: On 16.11.2016 16:18, Ido Schimmel wrote: > On Wed, Nov 16, 2016 at 03:51:01PM +0100, Hannes Frederic Sowa wrote: >> On 16.11.2016 15:09, Jiri Pirko wrote: >>> From: Ido Schimmel>>> >>> Commit b90eb7549499 ("fib: introduce FIB notification infrastructure") >>> introduced a new notification chain to notify listeners (f.e., switchdev >>> drivers) about addition and deletion of routes. >>> >>> However, upon registration to the chain the FIB tables can already be >>> populated, which means potential listeners will have an incomplete view >>> of the tables. >>> >>> Solve that by adding an API to request a FIB dump. The dump itself it >>> done using RCU in order not to starve consumers that need RTNL to make >>> progress. >>> >>> Signed-off-by: Ido Schimmel >>> Signed-off-by: Jiri Pirko >> >> Have you looked at potential inconsistencies resulting of RCU walking >> the table and having concurrent inserts? > > Yes. I did try to think about situations in which this approach will > fail, but I could only find problems with concurrent removals, which I > addressed in 5/8. In case of concurrent insertions, even if you missed > the node, you would still get the ENTRY_ADD event to your listener. Theoretically a node could still be installed while the deletion event fired before registering the notifier. E.g. a synchronize_net before dumping could help here? >>> >>> If the deletion event fired for some fib alias, then by 5/8 we are >>> guaranteed that it was already unlinked from the fib alias list in the >>> leaf in which it was contained. So, while it's possible we didn't >>> register our listener in time for the deletion event, we won't traverse >>> this fib alias while dumping the trie anyway. Did I understand you >>> correctly? >>> >> >> Theoretically we can have the same problem for insertion: >> >> You receive a delete event from the notifier that is queued up first but >> the dump will still see the entry in the fib due to being managed by RCU >> (the notifier running on another CPU). >> >> The problem is that the fib_remove_alias->hlist_del_rcu->WRITE_ONCE is >> still not strongly ordered against the local fib dump trie walk. > > OK, so I believe my analysis in 5/8 was wrong. Despite CPU A invoking > fib_remove_alias() in t0 for fa1, it's possible for CPU B doing the fib > dump to see fa1 in t1, which will lead to fa1 being permanently present > in the listener's table. Yep. :( Also the delayed work queue is only partial ordered (only on one CPU), so you don't know about when an event is processed from the dump and the notifier. I think you need to create your own workqueue for doing so. > Given the above, I think Dave's suggestion is the only applicable > solution. Do you agree? Any other suggestions? As I wrote before the problem with seqcounter is, that with a quagga running on top and pushing updates you end up never getting a stable view, so maybe some logic to abort in case it cannot be loaded after a few tries would be fine. The other way is the journal idea I had, which uses an rb-tree with timestamps as keys (can be lamport timestamps). You insert into the tree until the dump is finished and use it as queue later to shuffle stuff into the hardware. Because you should be able to hold refs to the fa_info or other structs directly, I don't expect gigantic memory overhead, just for a cell to store timetamp and pointer (rb_node). Bye, Hannes
Re: [patch net-next 6/8] ipv4: fib: Add an API to request a FIB dump
Hi Hannes, On Wed, Nov 16, 2016 at 08:43:25PM +0100, Hannes Frederic Sowa wrote: > On 16.11.2016 19:51, Ido Schimmel wrote: > > On Wed, Nov 16, 2016 at 06:35:45PM +0100, Hannes Frederic Sowa wrote: > >> On 16.11.2016 16:18, Ido Schimmel wrote: > >>> On Wed, Nov 16, 2016 at 03:51:01PM +0100, Hannes Frederic Sowa wrote: > On 16.11.2016 15:09, Jiri Pirko wrote: > > From: Ido Schimmel> > > > Commit b90eb7549499 ("fib: introduce FIB notification infrastructure") > > introduced a new notification chain to notify listeners (f.e., switchdev > > drivers) about addition and deletion of routes. > > > > However, upon registration to the chain the FIB tables can already be > > populated, which means potential listeners will have an incomplete view > > of the tables. > > > > Solve that by adding an API to request a FIB dump. The dump itself it > > done using RCU in order not to starve consumers that need RTNL to make > > progress. > > > > Signed-off-by: Ido Schimmel > > Signed-off-by: Jiri Pirko > > Have you looked at potential inconsistencies resulting of RCU walking > the table and having concurrent inserts? > >>> > >>> Yes. I did try to think about situations in which this approach will > >>> fail, but I could only find problems with concurrent removals, which I > >>> addressed in 5/8. In case of concurrent insertions, even if you missed > >>> the node, you would still get the ENTRY_ADD event to your listener. > >> > >> Theoretically a node could still be installed while the deletion event > >> fired before registering the notifier. E.g. a synchronize_net before > >> dumping could help here? > > > > If the deletion event fired for some fib alias, then by 5/8 we are > > guaranteed that it was already unlinked from the fib alias list in the > > leaf in which it was contained. So, while it's possible we didn't > > register our listener in time for the deletion event, we won't traverse > > this fib alias while dumping the trie anyway. Did I understand you > > correctly? > > > > Theoretically we can have the same problem for insertion: > > You receive a delete event from the notifier that is queued up first but > the dump will still see the entry in the fib due to being managed by RCU > (the notifier running on another CPU). > > The problem is that the fib_remove_alias->hlist_del_rcu->WRITE_ONCE is > still not strongly ordered against the local fib dump trie walk. OK, so I believe my analysis in 5/8 was wrong. Despite CPU A invoking fib_remove_alias() in t0 for fa1, it's possible for CPU B doing the fib dump to see fa1 in t1, which will lead to fa1 being permanently present in the listener's table. Given the above, I think Dave's suggestion is the only applicable solution. Do you agree? Any other suggestions? Thanks
Re: [patch net-next 6/8] ipv4: fib: Add an API to request a FIB dump
From: Hannes Frederic SowaDate: Thu, 17 Nov 2016 15:36:48 +0100 > The other way is the journal idea I had, which uses an rb-tree with > timestamps as keys (can be lamport timestamps). You insert into the tree > until the dump is finished and use it as queue later to shuffle stuff > into the hardware. If you have this "place" where pending inserts are stored, you have a policy decision to make. First of all what do other lookups see when there are pending entires? If, once inserted into the pending queue, you return success to the inserting entity, then you must make those pending entries visible to lookups. If you block the inserting entity, well that doesn't make a lot of sense. If blocking is a workable solution, then we can just block the entire insert while this FIB dump to the device driver is happening. But I am pretty sure the idea of blocking modifications for so long was considered undesirable.
Re: [patch net-next 6/8] ipv4: fib: Add an API to request a FIB dump
On Wed, Nov 16, 2016 at 08:43:25PM +0100, Hannes Frederic Sowa wrote: > On 16.11.2016 19:51, Ido Schimmel wrote: > > On Wed, Nov 16, 2016 at 06:35:45PM +0100, Hannes Frederic Sowa wrote: > >> On 16.11.2016 16:18, Ido Schimmel wrote: > >>> On Wed, Nov 16, 2016 at 03:51:01PM +0100, Hannes Frederic Sowa wrote: > On 16.11.2016 15:09, Jiri Pirko wrote: > > From: Ido Schimmel> > > > Commit b90eb7549499 ("fib: introduce FIB notification infrastructure") > > introduced a new notification chain to notify listeners (f.e., switchdev > > drivers) about addition and deletion of routes. > > > > However, upon registration to the chain the FIB tables can already be > > populated, which means potential listeners will have an incomplete view > > of the tables. > > > > Solve that by adding an API to request a FIB dump. The dump itself it > > done using RCU in order not to starve consumers that need RTNL to make > > progress. > > > > Signed-off-by: Ido Schimmel > > Signed-off-by: Jiri Pirko > > Have you looked at potential inconsistencies resulting of RCU walking > the table and having concurrent inserts? > >>> > >>> Yes. I did try to think about situations in which this approach will > >>> fail, but I could only find problems with concurrent removals, which I > >>> addressed in 5/8. In case of concurrent insertions, even if you missed > >>> the node, you would still get the ENTRY_ADD event to your listener. > >> > >> Theoretically a node could still be installed while the deletion event > >> fired before registering the notifier. E.g. a synchronize_net before > >> dumping could help here? > > > > If the deletion event fired for some fib alias, then by 5/8 we are > > guaranteed that it was already unlinked from the fib alias list in the > > leaf in which it was contained. So, while it's possible we didn't > > register our listener in time for the deletion event, we won't traverse > > this fib alias while dumping the trie anyway. Did I understand you > > correctly? > > > > Theoretically we can have the same problem for insertion: > > You receive a delete event from the notifier that is queued up first but > the dump will still see the entry in the fib due to being managed by RCU > (the notifier running on another CPU). > > The problem is that the fib_remove_alias->hlist_del_rcu->WRITE_ONCE is > still not strongly ordered against the local fib dump trie walk. It's pretty late here so I would have to check this out tomorrow morning. If this is indeed the case (not saying you're wrong, just want to verify for myself), then I guess 5/8 can be dropped and instead we should go with Dave's suggestion? I don't see any other way given the constraints... Thanks a lot Hannes!
Re: [patch net-next 6/8] ipv4: fib: Add an API to request a FIB dump
On 16.11.2016 19:51, Ido Schimmel wrote: > Hi, > > On Wed, Nov 16, 2016 at 06:35:45PM +0100, Hannes Frederic Sowa wrote: >> On 16.11.2016 16:18, Ido Schimmel wrote: >>> On Wed, Nov 16, 2016 at 03:51:01PM +0100, Hannes Frederic Sowa wrote: On 16.11.2016 15:09, Jiri Pirko wrote: > From: Ido Schimmel> > Commit b90eb7549499 ("fib: introduce FIB notification infrastructure") > introduced a new notification chain to notify listeners (f.e., switchdev > drivers) about addition and deletion of routes. > > However, upon registration to the chain the FIB tables can already be > populated, which means potential listeners will have an incomplete view > of the tables. > > Solve that by adding an API to request a FIB dump. The dump itself it > done using RCU in order not to starve consumers that need RTNL to make > progress. > > Signed-off-by: Ido Schimmel > Signed-off-by: Jiri Pirko Have you looked at potential inconsistencies resulting of RCU walking the table and having concurrent inserts? >>> >>> Yes. I did try to think about situations in which this approach will >>> fail, but I could only find problems with concurrent removals, which I >>> addressed in 5/8. In case of concurrent insertions, even if you missed >>> the node, you would still get the ENTRY_ADD event to your listener. >> >> Theoretically a node could still be installed while the deletion event >> fired before registering the notifier. E.g. a synchronize_net before >> dumping could help here? > > If the deletion event fired for some fib alias, then by 5/8 we are > guaranteed that it was already unlinked from the fib alias list in the > leaf in which it was contained. So, while it's possible we didn't > register our listener in time for the deletion event, we won't traverse > this fib alias while dumping the trie anyway. Did I understand you > correctly? > Theoretically we can have the same problem for insertion: You receive a delete event from the notifier that is queued up first but the dump will still see the entry in the fib due to being managed by RCU (the notifier running on another CPU). The problem is that the fib_remove_alias->hlist_del_rcu->WRITE_ONCE is still not strongly ordered against the local fib dump trie walk. >> I don't know how you prepare the data structures for inserting in into >> the hardware, but if ordering matters, the notifier for a delete event >> can be called before the dump installed the fib entry? > > Right. It's possible for the listener to receive a deletion event for a > fib entry it doesn't have, in which case it should just ignore it (as > current listeners do). Yep, for this specific case. Bye, Hannes
Re: [patch net-next 6/8] ipv4: fib: Add an API to request a FIB dump
Hi, On Wed, Nov 16, 2016 at 06:35:45PM +0100, Hannes Frederic Sowa wrote: > On 16.11.2016 16:18, Ido Schimmel wrote: > > On Wed, Nov 16, 2016 at 03:51:01PM +0100, Hannes Frederic Sowa wrote: > >> On 16.11.2016 15:09, Jiri Pirko wrote: > >>> From: Ido Schimmel> >>> > >>> Commit b90eb7549499 ("fib: introduce FIB notification infrastructure") > >>> introduced a new notification chain to notify listeners (f.e., switchdev > >>> drivers) about addition and deletion of routes. > >>> > >>> However, upon registration to the chain the FIB tables can already be > >>> populated, which means potential listeners will have an incomplete view > >>> of the tables. > >>> > >>> Solve that by adding an API to request a FIB dump. The dump itself it > >>> done using RCU in order not to starve consumers that need RTNL to make > >>> progress. > >>> > >>> Signed-off-by: Ido Schimmel > >>> Signed-off-by: Jiri Pirko > >> > >> Have you looked at potential inconsistencies resulting of RCU walking > >> the table and having concurrent inserts? > > > > Yes. I did try to think about situations in which this approach will > > fail, but I could only find problems with concurrent removals, which I > > addressed in 5/8. In case of concurrent insertions, even if you missed > > the node, you would still get the ENTRY_ADD event to your listener. > > Theoretically a node could still be installed while the deletion event > fired before registering the notifier. E.g. a synchronize_net before > dumping could help here? If the deletion event fired for some fib alias, then by 5/8 we are guaranteed that it was already unlinked from the fib alias list in the leaf in which it was contained. So, while it's possible we didn't register our listener in time for the deletion event, we won't traverse this fib alias while dumping the trie anyway. Did I understand you correctly? > I don't know how you prepare the data structures for inserting in into > the hardware, but if ordering matters, the notifier for a delete event > can be called before the dump installed the fib entry? Right. It's possible for the listener to receive a deletion event for a fib entry it doesn't have, in which case it should just ignore it (as current listeners do).
Re: [patch net-next 6/8] ipv4: fib: Add an API to request a FIB dump
On 16.11.2016 16:18, Ido Schimmel wrote: > Hi Hannes, > > On Wed, Nov 16, 2016 at 03:51:01PM +0100, Hannes Frederic Sowa wrote: >> On 16.11.2016 15:09, Jiri Pirko wrote: >>> From: Ido Schimmel>>> >>> Commit b90eb7549499 ("fib: introduce FIB notification infrastructure") >>> introduced a new notification chain to notify listeners (f.e., switchdev >>> drivers) about addition and deletion of routes. >>> >>> However, upon registration to the chain the FIB tables can already be >>> populated, which means potential listeners will have an incomplete view >>> of the tables. >>> >>> Solve that by adding an API to request a FIB dump. The dump itself it >>> done using RCU in order not to starve consumers that need RTNL to make >>> progress. >>> >>> Signed-off-by: Ido Schimmel >>> Signed-off-by: Jiri Pirko >> >> Have you looked at potential inconsistencies resulting of RCU walking >> the table and having concurrent inserts? > > Yes. I did try to think about situations in which this approach will > fail, but I could only find problems with concurrent removals, which I > addressed in 5/8. In case of concurrent insertions, even if you missed > the node, you would still get the ENTRY_ADD event to your listener. Theoretically a node could still be installed while the deletion event fired before registering the notifier. E.g. a synchronize_net before dumping could help here? I don't know how you prepare the data structures for inserting in into the hardware, but if ordering matters, the notifier for a delete event can be called before the dump installed the fib entry? E.g. you also don't check what events are already queued in the delayed work queue so far. I hope I understand the patches correctly? Thanks, Hannes
Re: [patch net-next 6/8] ipv4: fib: Add an API to request a FIB dump
On 16.11.2016 17:27, David Miller wrote: > From: Hannes Frederic Sowa> Date: Wed, 16 Nov 2016 15:51:01 +0100 > >> I don't see a way around doing a journal like in filesystems somehow, > > We really just need a sequence counter incremented for each insert/remove, > and restart the dump from the beginning if it changes mid-dump. I thought about this at first, too, but became afraid if we could end up looping endlessly because of routing changes happen constantly while trying to upload the fib into the hardware.
Re: [patch net-next 6/8] ipv4: fib: Add an API to request a FIB dump
Hi Dave, On Wed, Nov 16, 2016 at 11:27:35AM -0500, David Miller wrote: > From: Hannes Frederic Sowa> Date: Wed, 16 Nov 2016 15:51:01 +0100 > > > I don't see a way around doing a journal like in filesystems somehow, > > We really just need a sequence counter incremented for each insert/remove, > and restart the dump from the beginning if it changes mid-dump. When you dump the FIB table your listener is already registered to the chain, so any changes happening mid-dump will be propagated to it. I've noted this in 5/8 and in my reply to Hannes.
Re: [patch net-next 6/8] ipv4: fib: Add an API to request a FIB dump
From: Hannes Frederic SowaDate: Wed, 16 Nov 2016 15:51:01 +0100 > I don't see a way around doing a journal like in filesystems somehow, We really just need a sequence counter incremented for each insert/remove, and restart the dump from the beginning if it changes mid-dump.
Re: [patch net-next 6/8] ipv4: fib: Add an API to request a FIB dump
Hi Hannes, On Wed, Nov 16, 2016 at 03:51:01PM +0100, Hannes Frederic Sowa wrote: > On 16.11.2016 15:09, Jiri Pirko wrote: > > From: Ido Schimmel> > > > Commit b90eb7549499 ("fib: introduce FIB notification infrastructure") > > introduced a new notification chain to notify listeners (f.e., switchdev > > drivers) about addition and deletion of routes. > > > > However, upon registration to the chain the FIB tables can already be > > populated, which means potential listeners will have an incomplete view > > of the tables. > > > > Solve that by adding an API to request a FIB dump. The dump itself it > > done using RCU in order not to starve consumers that need RTNL to make > > progress. > > > > Signed-off-by: Ido Schimmel > > Signed-off-by: Jiri Pirko > > Have you looked at potential inconsistencies resulting of RCU walking > the table and having concurrent inserts? Yes. I did try to think about situations in which this approach will fail, but I could only find problems with concurrent removals, which I addressed in 5/8. In case of concurrent insertions, even if you missed the node, you would still get the ENTRY_ADD event to your listener.
Re: [patch net-next 6/8] ipv4: fib: Add an API to request a FIB dump
On 16.11.2016 15:09, Jiri Pirko wrote: > From: Ido Schimmel> > Commit b90eb7549499 ("fib: introduce FIB notification infrastructure") > introduced a new notification chain to notify listeners (f.e., switchdev > drivers) about addition and deletion of routes. > > However, upon registration to the chain the FIB tables can already be > populated, which means potential listeners will have an incomplete view > of the tables. > > Solve that by adding an API to request a FIB dump. The dump itself it > done using RCU in order not to starve consumers that need RTNL to make > progress. > > Signed-off-by: Ido Schimmel > Signed-off-by: Jiri Pirko Have you looked at potential inconsistencies resulting of RCU walking the table and having concurrent inserts? I don't see a way around doing a journal like in filesystems somehow, but maybe the effects are not that severe and it is not a problem after all. Bye, Hannes