Adam Thompson reported a bgpd crash he sees in 6.6. Using
kern.nosuidcoredump=3 he was able to get me a back trace of the crash.
The RDE chokes on the TAILQ_REMOVE in nexthop_runner() which indicates
that the nexthop_runners list gets corrupted.
After staring at the code for a while I realized that it is possible that
a nexthop is put on the runner list even though there are no objects to
process. In this case if nexthop_unref() is called before the
nexthop_runner() had a chance to run and remove that nexthop the queue
will be corrupt because of a use-after-free of that element.
Fix is simple, check before enqueuing a nexthop on the nexthop_runners
queue that next_prefix is not NULL.
The 2nd hunk just adds a debug log in the case where a prefix removal
actually completes the nexthop run.
OK?
--
:wq Claudio
Index: rde_rib.c
===================================================================
RCS file: /cvs/src/usr.sbin/bgpd/rde_rib.c,v
retrieving revision 1.214
diff -u -p -r1.214 rde_rib.c
--- rde_rib.c 10 Jan 2020 14:52:57 -0000 1.214
+++ rde_rib.c 23 Jan 2020 03:59:09 -0000
@@ -1800,8 +1800,11 @@ nexthop_update(struct kroute_nexthop *ms
nh->nexthop_netlen = msg->netlen;
nh->next_prefix = LIST_FIRST(&nh->prefix_h);
- TAILQ_INSERT_HEAD(&nexthop_runners, nh, runner_l);
- log_debug("nexthop %s update starting", log_addr(&nh->exit_nexthop));
+ if (nh->next_prefix != NULL) {
+ TAILQ_INSERT_HEAD(&nexthop_runners, nh, runner_l);
+ log_debug("nexthop %s update starting",
+ log_addr(&nh->exit_nexthop));
+ }
}
void
@@ -1860,8 +1863,11 @@ nexthop_unlink(struct prefix *p)
if (p == p->nexthop->next_prefix) {
p->nexthop->next_prefix = LIST_NEXT(p, entry.list.nexthop);
/* remove nexthop from list if no prefixes left to update */
- if (p->nexthop->next_prefix == NULL)
+ if (p->nexthop->next_prefix == NULL) {
TAILQ_REMOVE(&nexthop_runners, p->nexthop, runner_l);
+ log_debug("nexthop %s update finished",
+ log_addr(&p->nexthop->exit_nexthop));
+ }
}
p->flags &= ~PREFIX_NEXTHOP_LINKED;