Re: set art_walk up for an mpsafe world

David Gwynne Mon, 20 Jun 2016 03:54:22 -0700

> On 18 Jun 2016, at 1:53 AM, Martin Pieuchot <m...@openbsd.org> wrote:
> 
> On 15/06/16(Wed) 11:38, David Gwynne wrote:
>> this tweaks art_walk in preparation for a world where the table may
>> be updated on another cpu.
>> 
>> at the moment we're relying on the big lock to serialise updates,
>> so this adds big lock calls in the right place.
> 
> I don't understand.  rtable_walk() is always called with the
> KERNEL_LOCK() held, is this going to change?


rtable_walk may always be called with KERNEL_LOCK but rtable_insert and 
rtable_delete wont be because we want them to be able to work during packet 
processing, which is being moved out of big lock.

walk works by taking refcnts on tables while recursing. the refcnt changes 
arent atomic and therefore need to be serialised by the same lock used in the 
insert/delete paths.

> 
>>                                                jmatthew@ has a
>> diff to serialise changes with a mutex, so those kernel lock calls
>> can be replaced with the mutex ones.
> 
> Why?  I assume the mutex will be used to serialize changes with the
> input path...  But taking/releasing the lock for each node doesn't
> sound clever to me.  When rtable_walk() is called you generally want
> the operation to complete on the whole table before allowing any other
> change.
> 
> The whole idea behind moving the EAGAIN check inside rtable_walk() is
> that we can grab the lock around art_walk() which would make sure the
> table has been completely updated before continuing.

taking and releasing the lock isnt particularly clever, but its the least worst 
thing i could come up with. ill try to explain how i got to this.

first, we're going with a mutex to serialise changes to the art structures 
because changes are going to be made in an interrupt path and mutexes are what 
we've got for that. as said above, we need to take the mutex to change the 
refcnts.

secontly, from what i can tell rtable_walk is used for two main things: to read 
the whole table (eg, to serialise it for netstat -nr to read), and to 
filter/mutate routes (eg, if we take an interface down we need to walk the 
rtable to take the associated routes down too).

the callback for dumping the table to userland does so using copyout(9), which 
may sleep, and holding a mutex over a sleep is not a Good Idea(tm). we could 
avoid that by allocating a temporary kernel buffer to have the rtable_walk 
callback copy in to, and then copying that out to userland outside the mutex. 
however, rtables can get large so allocating a bounce buffer may be unreliable 
on some systems, esp with limited kva space.

while we're talking about reads, thrashing the mutex isnt the only caveat. it 
is possible the rtable can change while the callback is running without the 
mutex held. the consequence of that is we wont get a strongly consistent view 
of the rtable. for example, you maybe be halfway through reading a table when 
route A is inserted in the first quarter, followed by an insertion of route B 
in the last quarter. in that situation your table read will show route B and 
not A, even though A occurred first. im reasonably sure that this is ok though. 
think of it as a relaxed memory model where writes can happen out of order with 
reads and we dont have any barriers.

anyway, the other problem with holding the mutex while running rtable_walk 
callbacks is the callbacks that filter entries in the table. those end up 
calling rtable_delete which in turn will try to acquire the mutex. holding the 
mutex for walk while calling those would end up with a deadlock.

an alternative to that would be to create locked and unlocked variants of 
rtable insert and delete and push the knowledge of the locking protocol out to 
the callers. that would mean filters would call the locked variants, and id 
argue that the read callbacks would then have to explicitly give up the lock 
for the reasons discussed above.

i hope this explains how i got to this diff.

> 
>> the idea is that art walk will traverse the tables under the lock.
>> that allows us to bump the refcount on tables, which means they'll
>> still be there when we unwined after descending into them. it gives
>> up the lock when it calls the per route callback. this is so the
>> callback can decide to update the table while looking at it. that
>> route reference is done via the srp api.
> 
> That said, I like where this is going :)

me too. itll be interesting to see how it scales.

> 
>> Index: art.c
>> ===================================================================
>> RCS file: /cvs/src/sys/net/art.c,v
>> retrieving revision 1.19
>> diff -u -p -r1.19 art.c
>> --- art.c    14 Jun 2016 04:42:02 -0000      1.19
>> +++ art.c    14 Jun 2016 12:36:25 -0000
>> @@ -78,10 +78,13 @@ struct art_node          *art_table_insert(struc
>>                           int, struct art_node *);
>> struct art_node              *art_table_delete(struct art_root *, struct 
>> art_table *,
>>                           int, struct art_node *);
>> -void                         art_table_ref(struct art_root *, struct 
>> art_table *);
>> +struct art_table    *art_table_ref(struct art_root *, struct art_table *);
>> int                   art_table_free(struct art_root *, struct art_table *);
>> int                   art_table_walk(struct art_root *, struct art_table *,
>>                           int (*f)(struct art_node *, void *), void *);
>> +int                  art_walk_apply(struct art_root *,
>> +                         struct art_node *, struct art_node *,
>> +                         int (*f)(struct art_node *, void *), void *);
>> void                  art_table_gc(void *);
>> void                  art_gc(void *);
>> 
>> @@ -565,10 +568,11 @@ art_table_delete(struct art_root *ar, st
>>      return (an);
>> }
>> 
>> -void
>> +struct art_table *
>> art_table_ref(struct art_root *ar, struct art_table *at)
>> {
>>      at->at_refcnt++;
>> +    return (at);
>> }
>> 
>> static inline int
>> @@ -604,42 +608,44 @@ art_table_free(struct art_root *ar, stru
>> int
>> art_walk(struct art_root *ar, int (*f)(struct art_node *, void *), void *arg)
>> {
>> +    struct srp_ref           sr;
>>      struct art_table        *at;
>>      struct art_node         *node;
>> -    int                      error;
>> -
>> -    KERNEL_ASSERT_LOCKED();
>> +    int                      error = 0;
>> 
>> +    KERNEL_LOCK();
>>      at = srp_get_locked(&ar->ar_root);
>> -    if (at == NULL)
>> -            return (0);
>> +    if (at != NULL) {
>> +            art_table_ref(ar, at);
>> 
>> -    /*
>> -     * The default route should be processed here because the root
>> -     * table does not have a parent.
>> -     */
>> -    node = srp_get_locked(&at->at_default);
>> -    if (node != NULL) {
>> -            error = (*f)(node, arg);
>> -            if (error)
>> -                    return (error);
>> +            /*
>> +             * The default route should be processed here because the root
>> +             * table does not have a parent.
>> +             */
>> +            node = srp_enter(&sr, &at->at_default);
>> +            error = art_walk_apply(ar, node, NULL, f, arg);
>> +            srp_leave(&sr);
>> +
>> +            if (error == 0)
>> +                    error = art_table_walk(ar, at, f, arg);
>> +
>> +            art_table_free(ar, at);
>>      }
>> +    KERNEL_UNLOCK();
>> 
>> -    return (art_table_walk(ar, at, f, arg));
>> +    return (error);
>> }
>> 
>> int
>> art_table_walk(struct art_root *ar, struct art_table *at,
>>     int (*f)(struct art_node *, void *), void *arg)
>> {
>> -    struct art_node         *next, *an = NULL;
>> -    struct art_node         *node;
>> +    struct srp_ref           sr;
>> +    struct art_node         *node, *next;
>> +    struct art_table        *nat;
>>      int                      i, j, error = 0;
>>      uint32_t                 maxfringe = (at->at_minfringe << 1);
>> 
>> -    /* Prevent this table to be freed while we're manipulating it. */
>> -    art_table_ref(ar, at);
>> -
>>      /*
>>       * Iterate non-fringe nodes in ``natural'' order.
>>       */
>> @@ -651,12 +657,13 @@ art_table_walk(struct art_root *ar, stru
>>               */
>>              for (i = max(j, 2); i < at->at_minfringe; i <<= 1) {
>>                      next = srp_get_locked(&at->at_heap[i >> 1].node);
>> -                    an = srp_get_locked(&at->at_heap[i].node);
>> -                    if ((an != NULL) && (an != next)) {
>> -                            error = (*f)(an, arg);
>> -                            if (error)
>> -                                    goto out;
>> -                    }
>> +
>> +                    node = srp_enter(&sr, &at->at_heap[i].node);
>> +                    error = art_walk_apply(ar, node, next, f, arg);
>> +                    srp_leave(&sr);
>> +
>> +                    if (error != 0)
>> +                            return (error);
>>              }
>>      }
>> 
>> @@ -665,28 +672,47 @@ art_table_walk(struct art_root *ar, stru
>>       */
>>      for (i = at->at_minfringe; i < maxfringe; i++) {
>>              next = srp_get_locked(&at->at_heap[i >> 1].node);
>> -            node = srp_get_locked(&at->at_heap[i].node);
>> -            if (!ISLEAF(node))
>> -                    an = srp_get_locked(&SUBTABLE(node)->at_default);
>> -            else
>> -                    an = node;
>> 
>> -            if ((an != NULL) && (an != next)) {
>> -                    error = (*f)(an, arg);
>> -                    if (error)
>> -                            goto out;
>> +            node = srp_enter(&sr, &at->at_heap[i].node);
>> +            if (!ISLEAF(node)) {
>> +                    nat = art_table_ref(ar, SUBTABLE(node));
>> +                    node = srp_follow(&sr, &nat->at_default);
>> +            } else
>> +                    nat = NULL;
>> +
>> +            error = art_walk_apply(ar, node, next, f, arg);
>> +            srp_leave(&sr);
>> +
>> +            if (error != 0) {
>> +                    art_table_free(ar, nat);
>> +                    return (error);
>>              }
>> 
>> -            if (ISLEAF(node))
>> -                    continue;
>> +            if (nat != NULL) {
>> +                    error = art_table_walk(ar, nat, f, arg);
>> +                    art_table_free(ar, nat);
>> +                    if (error != 0)
>> +                            return (error);
>> +            }
>> +    }
>> 
>> -            error = art_table_walk(ar, SUBTABLE(node), f, arg);
>> -            if (error)
>> -                    break;
>> +    return (0);
>> +}
>> +
>> +int
>> +art_walk_apply(struct art_root *ar,
>> +    struct art_node *an, struct art_node *next,
>> +    int (*f)(struct art_node *, void *), void *arg)
>> +{
>> +    int error = 0;
>> +
>> +    if ((an != NULL) && (an != next)) {
>> +            /* this assumes an->an_dst is not used by f */
>> +            KERNEL_UNLOCK();
>> +            error = (*f)(an, arg);
>> +            KERNEL_LOCK();
>>      }
>> 
>> -out:
>> -    art_table_free(ar, at);
>>      return (error);
>> }
>> 
>> Index: rtable.c
>> ===================================================================
>> RCS file: /cvs/src/sys/net/rtable.c,v
>> retrieving revision 1.47
>> diff -u -p -r1.47 rtable.c
>> --- rtable.c 14 Jun 2016 04:42:02 -0000      1.47
>> +++ rtable.c 14 Jun 2016 12:36:25 -0000
>> @@ -853,16 +853,16 @@ struct rtable_walk_cookie {
>> int
>> rtable_walk_helper(struct art_node *an, void *xrwc)
>> {
>> +    struct srp_ref                   sr;
>>      struct rtable_walk_cookie       *rwc = xrwc;
>> -    struct rtentry                  *rt, *nrt;
>> +    struct rtentry                  *rt;
>>      int                              error = 0;
>> 
>> -    KERNEL_ASSERT_LOCKED();
>> -
>> -    SRPL_FOREACH_SAFE_LOCKED(rt, &an->an_rtlist, rt_next, nrt) {
>> +    SRPL_FOREACH(rt, &sr, &an->an_rtlist, rt_next) {
>>              if ((error = (*rwc->rwc_func)(rt, rwc->rwc_arg, rwc->rwc_rid)))
>>                      break;
>>      }
>> +    SRPL_LEAVE(&sr);
>> 
>>      return (error);
>> }
>>

Re: set art_walk up for an mpsafe world

Reply via email to