> On 18 Jun 2016, at 1:53 AM, Martin Pieuchot <m...@openbsd.org> wrote: > > On 15/06/16(Wed) 11:38, David Gwynne wrote: >> this tweaks art_walk in preparation for a world where the table may >> be updated on another cpu. >> >> at the moment we're relying on the big lock to serialise updates, >> so this adds big lock calls in the right place. > > I don't understand. rtable_walk() is always called with the > KERNEL_LOCK() held, is this going to change?
rtable_walk may always be called with KERNEL_LOCK but rtable_insert and rtable_delete wont be because we want them to be able to work during packet processing, which is being moved out of big lock. walk works by taking refcnts on tables while recursing. the refcnt changes arent atomic and therefore need to be serialised by the same lock used in the insert/delete paths. > >> jmatthew@ has a >> diff to serialise changes with a mutex, so those kernel lock calls >> can be replaced with the mutex ones. > > Why? I assume the mutex will be used to serialize changes with the > input path... But taking/releasing the lock for each node doesn't > sound clever to me. When rtable_walk() is called you generally want > the operation to complete on the whole table before allowing any other > change. > > The whole idea behind moving the EAGAIN check inside rtable_walk() is > that we can grab the lock around art_walk() which would make sure the > table has been completely updated before continuing. taking and releasing the lock isnt particularly clever, but its the least worst thing i could come up with. ill try to explain how i got to this. first, we're going with a mutex to serialise changes to the art structures because changes are going to be made in an interrupt path and mutexes are what we've got for that. as said above, we need to take the mutex to change the refcnts. secontly, from what i can tell rtable_walk is used for two main things: to read the whole table (eg, to serialise it for netstat -nr to read), and to filter/mutate routes (eg, if we take an interface down we need to walk the rtable to take the associated routes down too). the callback for dumping the table to userland does so using copyout(9), which may sleep, and holding a mutex over a sleep is not a Good Idea(tm). we could avoid that by allocating a temporary kernel buffer to have the rtable_walk callback copy in to, and then copying that out to userland outside the mutex. however, rtables can get large so allocating a bounce buffer may be unreliable on some systems, esp with limited kva space. while we're talking about reads, thrashing the mutex isnt the only caveat. it is possible the rtable can change while the callback is running without the mutex held. the consequence of that is we wont get a strongly consistent view of the rtable. for example, you maybe be halfway through reading a table when route A is inserted in the first quarter, followed by an insertion of route B in the last quarter. in that situation your table read will show route B and not A, even though A occurred first. im reasonably sure that this is ok though. think of it as a relaxed memory model where writes can happen out of order with reads and we dont have any barriers. anyway, the other problem with holding the mutex while running rtable_walk callbacks is the callbacks that filter entries in the table. those end up calling rtable_delete which in turn will try to acquire the mutex. holding the mutex for walk while calling those would end up with a deadlock. an alternative to that would be to create locked and unlocked variants of rtable insert and delete and push the knowledge of the locking protocol out to the callers. that would mean filters would call the locked variants, and id argue that the read callbacks would then have to explicitly give up the lock for the reasons discussed above. i hope this explains how i got to this diff. > >> the idea is that art walk will traverse the tables under the lock. >> that allows us to bump the refcount on tables, which means they'll >> still be there when we unwined after descending into them. it gives >> up the lock when it calls the per route callback. this is so the >> callback can decide to update the table while looking at it. that >> route reference is done via the srp api. > > That said, I like where this is going :) me too. itll be interesting to see how it scales. > >> Index: art.c >> =================================================================== >> RCS file: /cvs/src/sys/net/art.c,v >> retrieving revision 1.19 >> diff -u -p -r1.19 art.c >> --- art.c 14 Jun 2016 04:42:02 -0000 1.19 >> +++ art.c 14 Jun 2016 12:36:25 -0000 >> @@ -78,10 +78,13 @@ struct art_node *art_table_insert(struc >> int, struct art_node *); >> struct art_node *art_table_delete(struct art_root *, struct >> art_table *, >> int, struct art_node *); >> -void art_table_ref(struct art_root *, struct >> art_table *); >> +struct art_table *art_table_ref(struct art_root *, struct art_table *); >> int art_table_free(struct art_root *, struct art_table *); >> int art_table_walk(struct art_root *, struct art_table *, >> int (*f)(struct art_node *, void *), void *); >> +int art_walk_apply(struct art_root *, >> + struct art_node *, struct art_node *, >> + int (*f)(struct art_node *, void *), void *); >> void art_table_gc(void *); >> void art_gc(void *); >> >> @@ -565,10 +568,11 @@ art_table_delete(struct art_root *ar, st >> return (an); >> } >> >> -void >> +struct art_table * >> art_table_ref(struct art_root *ar, struct art_table *at) >> { >> at->at_refcnt++; >> + return (at); >> } >> >> static inline int >> @@ -604,42 +608,44 @@ art_table_free(struct art_root *ar, stru >> int >> art_walk(struct art_root *ar, int (*f)(struct art_node *, void *), void *arg) >> { >> + struct srp_ref sr; >> struct art_table *at; >> struct art_node *node; >> - int error; >> - >> - KERNEL_ASSERT_LOCKED(); >> + int error = 0; >> >> + KERNEL_LOCK(); >> at = srp_get_locked(&ar->ar_root); >> - if (at == NULL) >> - return (0); >> + if (at != NULL) { >> + art_table_ref(ar, at); >> >> - /* >> - * The default route should be processed here because the root >> - * table does not have a parent. >> - */ >> - node = srp_get_locked(&at->at_default); >> - if (node != NULL) { >> - error = (*f)(node, arg); >> - if (error) >> - return (error); >> + /* >> + * The default route should be processed here because the root >> + * table does not have a parent. >> + */ >> + node = srp_enter(&sr, &at->at_default); >> + error = art_walk_apply(ar, node, NULL, f, arg); >> + srp_leave(&sr); >> + >> + if (error == 0) >> + error = art_table_walk(ar, at, f, arg); >> + >> + art_table_free(ar, at); >> } >> + KERNEL_UNLOCK(); >> >> - return (art_table_walk(ar, at, f, arg)); >> + return (error); >> } >> >> int >> art_table_walk(struct art_root *ar, struct art_table *at, >> int (*f)(struct art_node *, void *), void *arg) >> { >> - struct art_node *next, *an = NULL; >> - struct art_node *node; >> + struct srp_ref sr; >> + struct art_node *node, *next; >> + struct art_table *nat; >> int i, j, error = 0; >> uint32_t maxfringe = (at->at_minfringe << 1); >> >> - /* Prevent this table to be freed while we're manipulating it. */ >> - art_table_ref(ar, at); >> - >> /* >> * Iterate non-fringe nodes in ``natural'' order. >> */ >> @@ -651,12 +657,13 @@ art_table_walk(struct art_root *ar, stru >> */ >> for (i = max(j, 2); i < at->at_minfringe; i <<= 1) { >> next = srp_get_locked(&at->at_heap[i >> 1].node); >> - an = srp_get_locked(&at->at_heap[i].node); >> - if ((an != NULL) && (an != next)) { >> - error = (*f)(an, arg); >> - if (error) >> - goto out; >> - } >> + >> + node = srp_enter(&sr, &at->at_heap[i].node); >> + error = art_walk_apply(ar, node, next, f, arg); >> + srp_leave(&sr); >> + >> + if (error != 0) >> + return (error); >> } >> } >> >> @@ -665,28 +672,47 @@ art_table_walk(struct art_root *ar, stru >> */ >> for (i = at->at_minfringe; i < maxfringe; i++) { >> next = srp_get_locked(&at->at_heap[i >> 1].node); >> - node = srp_get_locked(&at->at_heap[i].node); >> - if (!ISLEAF(node)) >> - an = srp_get_locked(&SUBTABLE(node)->at_default); >> - else >> - an = node; >> >> - if ((an != NULL) && (an != next)) { >> - error = (*f)(an, arg); >> - if (error) >> - goto out; >> + node = srp_enter(&sr, &at->at_heap[i].node); >> + if (!ISLEAF(node)) { >> + nat = art_table_ref(ar, SUBTABLE(node)); >> + node = srp_follow(&sr, &nat->at_default); >> + } else >> + nat = NULL; >> + >> + error = art_walk_apply(ar, node, next, f, arg); >> + srp_leave(&sr); >> + >> + if (error != 0) { >> + art_table_free(ar, nat); >> + return (error); >> } >> >> - if (ISLEAF(node)) >> - continue; >> + if (nat != NULL) { >> + error = art_table_walk(ar, nat, f, arg); >> + art_table_free(ar, nat); >> + if (error != 0) >> + return (error); >> + } >> + } >> >> - error = art_table_walk(ar, SUBTABLE(node), f, arg); >> - if (error) >> - break; >> + return (0); >> +} >> + >> +int >> +art_walk_apply(struct art_root *ar, >> + struct art_node *an, struct art_node *next, >> + int (*f)(struct art_node *, void *), void *arg) >> +{ >> + int error = 0; >> + >> + if ((an != NULL) && (an != next)) { >> + /* this assumes an->an_dst is not used by f */ >> + KERNEL_UNLOCK(); >> + error = (*f)(an, arg); >> + KERNEL_LOCK(); >> } >> >> -out: >> - art_table_free(ar, at); >> return (error); >> } >> >> Index: rtable.c >> =================================================================== >> RCS file: /cvs/src/sys/net/rtable.c,v >> retrieving revision 1.47 >> diff -u -p -r1.47 rtable.c >> --- rtable.c 14 Jun 2016 04:42:02 -0000 1.47 >> +++ rtable.c 14 Jun 2016 12:36:25 -0000 >> @@ -853,16 +853,16 @@ struct rtable_walk_cookie { >> int >> rtable_walk_helper(struct art_node *an, void *xrwc) >> { >> + struct srp_ref sr; >> struct rtable_walk_cookie *rwc = xrwc; >> - struct rtentry *rt, *nrt; >> + struct rtentry *rt; >> int error = 0; >> >> - KERNEL_ASSERT_LOCKED(); >> - >> - SRPL_FOREACH_SAFE_LOCKED(rt, &an->an_rtlist, rt_next, nrt) { >> + SRPL_FOREACH(rt, &sr, &an->an_rtlist, rt_next) { >> if ((error = (*rwc->rwc_func)(rt, rwc->rwc_arg, rwc->rwc_rid))) >> break; >> } >> + SRPL_LEAVE(&sr); >> >> return (error); >> } >>