Re: bgpd async nexthop update loop
ok! Claudio Jeker(cje...@diehard.n-r-g.com) on 2019.06.17 23:08:37 +0200: > On Mon, Jun 17, 2019 at 10:20:57PM +0200, Sebastian Benoit wrote: > > Claudio Jeker(cje...@diehard.n-r-g.com) on 2019.06.17 21:34:30 +0200: > > > Hi, > > > > > > Now that the community rewrite is in here another diff to make bgpd more > > > awesome. There was one loop left that did not run asynchronous compared to > > > the main event loop. This loop was in nexthop_update() now a route server > > > or a big route reflector would be hit hard by this loop since it will send > > > out many updates when a nexthop becomes valid. I have seen times where the > > > RDE was busy for 15min in nexthop_update(). > > > > > > This diff changes nexthop_update() to do this work asynchronous and > > > therefor returns to the main event loop much more often. > > > Nexthops now have a prefix pointer next_prefix which is used for looping > > > through all prefixes. The for loop is broken up in RDE_RUNNER_ROUNDS steps > > > which should hopefully return quickly. > > > > > > Additionally prefixes belonging to a RIB that is not evaluated (no > > > decision process running) are no longer linked into this nexthop list > > > which should also shorten the loop a bit (at least Adj-RIB-In and > > > Adj-RIB-Out are now skipped). To make this work correctly the linking and > > > unlinking of prefixes has been adjusted to always work in the right way > > > (nexhop last on link and first on unlink). > > > > reads ok, one comment. > > Fixed. > > > > When testing please watch out for this message: > > > prefix_updateall: prefix with F_RIB_NOEVALUATE hit > > > This should never happen. > > > > > > I'm running this on a few routers with good success. Please check if you > > > see > > > a) worse convergence time (updates propagating much slower) > > > b) better responsiveness (to e.g. bgpctl commands) > > > > i see things much faster, actually too fast, i cant meaningdul test bgpctl > > responsiveness. > > You just need around 500 peers which are affected by a big nexthop update :) > > > > There is some potential for tuning but I would prefer to do this at a > > > later stage. > > > > sure. one question about that: is the TAILQ_INSERT_TAIL/TAILQ_INSERT_HEAD > > optimal? > > The idea is to make this like a ring. Start with the latest arrival > first and then process one after the other. This makes sure that every > entry gets about the same amount of process time. The TAILQ_INSERT_HEAD > in nexthop_update() will reduce initial latency for nexthops with only > few prefixes since they will be done in one nexthop_runner() call. > > There is also additional interactions with the other runners > (imsg processing, rib_dump_runner and rde_update_queue_runner) which I > need to profile to understand them better. It looks like > rde_dispatch_imsg_session() is running ahead and produces a lot of work > that can't be absorbed that quickly by the other runners. At least that is > the case in the big route server setup that I'm working with. > > > > -- > > > :wq Claudio > > > > > > Index: rde.c > > > === > > > RCS file: /cvs/src/usr.sbin/bgpd/rde.c,v > > > retrieving revision 1.470 > > > diff -u -p -r1.470 rde.c > > > --- rde.c 17 Jun 2019 13:35:43 - 1.470 > > > +++ rde.c 17 Jun 2019 13:56:16 - > > > @@ -256,9 +256,6 @@ rde_main(int debug, int verbose) > > > set_pollfd([PFD_PIPE_SESSION], ibuf_se); > > > set_pollfd([PFD_PIPE_SESSION_CTL], ibuf_se_ctl); > > > > > > - if (rib_dump_pending() || rde_update_queue_pending()) > > > - timeout = 0; > > > - > > > i = PFD_PIPE_COUNT; > > > for (mctx = LIST_FIRST(_mrts); mctx != 0; mctx = xmctx) { > > > xmctx = LIST_NEXT(mctx, entry); > > > @@ -275,6 +272,10 @@ rde_main(int debug, int verbose) > > > } > > > } > > > > > > + if (rib_dump_pending() || rde_update_queue_pending() || > > > + nexthop_pending()) > > > + timeout = 0; > > > + > > > if (poll(pfd, i, timeout) == -1) { > > > if (errno != EINTR) > > > fatal("poll error"); > > > @@ -311,12 +312,13 @@ rde_main(int debug, int verbose) > > > mctx = LIST_NEXT(mctx, entry); > > > } > > > > > > + rib_dump_runner(); > > > + nexthop_runner(); > > > if (ibuf_se && ibuf_se->w.queued < SESS_MSG_HIGH_MARK) { > > > rde_update_queue_runner(); > > > for (aid = AID_INET6; aid < AID_MAX; aid++) > > > rde_update6_queue_runner(aid); > > > } > > > - rib_dump_runner(); > > > } > > > > > > /* do not clean up on shutdown on production, it takes ages. */ > > > Index: rde.h > > > === > > > RCS file:
Re: bgpd async nexthop update loop
On Mon, Jun 17, 2019 at 10:20:57PM +0200, Sebastian Benoit wrote: > Claudio Jeker(cje...@diehard.n-r-g.com) on 2019.06.17 21:34:30 +0200: > > Hi, > > > > Now that the community rewrite is in here another diff to make bgpd more > > awesome. There was one loop left that did not run asynchronous compared to > > the main event loop. This loop was in nexthop_update() now a route server > > or a big route reflector would be hit hard by this loop since it will send > > out many updates when a nexthop becomes valid. I have seen times where the > > RDE was busy for 15min in nexthop_update(). > > > > This diff changes nexthop_update() to do this work asynchronous and > > therefor returns to the main event loop much more often. > > Nexthops now have a prefix pointer next_prefix which is used for looping > > through all prefixes. The for loop is broken up in RDE_RUNNER_ROUNDS steps > > which should hopefully return quickly. > > > > Additionally prefixes belonging to a RIB that is not evaluated (no > > decision process running) are no longer linked into this nexthop list > > which should also shorten the loop a bit (at least Adj-RIB-In and > > Adj-RIB-Out are now skipped). To make this work correctly the linking and > > unlinking of prefixes has been adjusted to always work in the right way > > (nexhop last on link and first on unlink). > > reads ok, one comment. Fixed. > > When testing please watch out for this message: > > prefix_updateall: prefix with F_RIB_NOEVALUATE hit > > This should never happen. > > > > I'm running this on a few routers with good success. Please check if you see > > a) worse convergence time (updates propagating much slower) > > b) better responsiveness (to e.g. bgpctl commands) > > i see things much faster, actually too fast, i cant meaningdul test bgpctl > responsiveness. You just need around 500 peers which are affected by a big nexthop update :) > > There is some potential for tuning but I would prefer to do this at a > > later stage. > > sure. one question about that: is the TAILQ_INSERT_TAIL/TAILQ_INSERT_HEAD > optimal? The idea is to make this like a ring. Start with the latest arrival first and then process one after the other. This makes sure that every entry gets about the same amount of process time. The TAILQ_INSERT_HEAD in nexthop_update() will reduce initial latency for nexthops with only few prefixes since they will be done in one nexthop_runner() call. There is also additional interactions with the other runners (imsg processing, rib_dump_runner and rde_update_queue_runner) which I need to profile to understand them better. It looks like rde_dispatch_imsg_session() is running ahead and produces a lot of work that can't be absorbed that quickly by the other runners. At least that is the case in the big route server setup that I'm working with. > > -- > > :wq Claudio > > > > Index: rde.c > > === > > RCS file: /cvs/src/usr.sbin/bgpd/rde.c,v > > retrieving revision 1.470 > > diff -u -p -r1.470 rde.c > > --- rde.c 17 Jun 2019 13:35:43 - 1.470 > > +++ rde.c 17 Jun 2019 13:56:16 - > > @@ -256,9 +256,6 @@ rde_main(int debug, int verbose) > > set_pollfd([PFD_PIPE_SESSION], ibuf_se); > > set_pollfd([PFD_PIPE_SESSION_CTL], ibuf_se_ctl); > > > > - if (rib_dump_pending() || rde_update_queue_pending()) > > - timeout = 0; > > - > > i = PFD_PIPE_COUNT; > > for (mctx = LIST_FIRST(_mrts); mctx != 0; mctx = xmctx) { > > xmctx = LIST_NEXT(mctx, entry); > > @@ -275,6 +272,10 @@ rde_main(int debug, int verbose) > > } > > } > > > > + if (rib_dump_pending() || rde_update_queue_pending() || > > + nexthop_pending()) > > + timeout = 0; > > + > > if (poll(pfd, i, timeout) == -1) { > > if (errno != EINTR) > > fatal("poll error"); > > @@ -311,12 +312,13 @@ rde_main(int debug, int verbose) > > mctx = LIST_NEXT(mctx, entry); > > } > > > > + rib_dump_runner(); > > + nexthop_runner(); > > if (ibuf_se && ibuf_se->w.queued < SESS_MSG_HIGH_MARK) { > > rde_update_queue_runner(); > > for (aid = AID_INET6; aid < AID_MAX; aid++) > > rde_update6_queue_runner(aid); > > } > > - rib_dump_runner(); > > } > > > > /* do not clean up on shutdown on production, it takes ages. */ > > Index: rde.h > > === > > RCS file: /cvs/src/usr.sbin/bgpd/rde.h,v > > retrieving revision 1.214 > > diff -u -p -r1.214 rde.h > > --- rde.h 17 Jun 2019 13:35:43 - 1.214 > > +++ rde.h 17 Jun 2019 13:56:16 - > > @@ -228,12 +228,15 @@ struct rde_aspath { > > enum nexthop_state { >
Re: bgpd async nexthop update loop
Claudio Jeker(cje...@diehard.n-r-g.com) on 2019.06.17 21:34:30 +0200: > Hi, > > Now that the community rewrite is in here another diff to make bgpd more > awesome. There was one loop left that did not run asynchronous compared to > the main event loop. This loop was in nexthop_update() now a route server > or a big route reflector would be hit hard by this loop since it will send > out many updates when a nexthop becomes valid. I have seen times where the > RDE was busy for 15min in nexthop_update(). > > This diff changes nexthop_update() to do this work asynchronous and > therefor returns to the main event loop much more often. > Nexthops now have a prefix pointer next_prefix which is used for looping > through all prefixes. The for loop is broken up in RDE_RUNNER_ROUNDS steps > which should hopefully return quickly. > > Additionally prefixes belonging to a RIB that is not evaluated (no > decision process running) are no longer linked into this nexthop list > which should also shorten the loop a bit (at least Adj-RIB-In and > Adj-RIB-Out are now skipped). To make this work correctly the linking and > unlinking of prefixes has been adjusted to always work in the right way > (nexhop last on link and first on unlink). reads ok, one comment. > When testing please watch out for this message: > prefix_updateall: prefix with F_RIB_NOEVALUATE hit > This should never happen. > > I'm running this on a few routers with good success. Please check if you see > a) worse convergence time (updates propagating much slower) > b) better responsiveness (to e.g. bgpctl commands) i see things much faster, actually too fast, i cant meaningdul test bgpctl responsiveness. > There is some potential for tuning but I would prefer to do this at a > later stage. sure. one question about that: is the TAILQ_INSERT_TAIL/TAILQ_INSERT_HEAD optimal? > -- > :wq Claudio > > Index: rde.c > === > RCS file: /cvs/src/usr.sbin/bgpd/rde.c,v > retrieving revision 1.470 > diff -u -p -r1.470 rde.c > --- rde.c 17 Jun 2019 13:35:43 - 1.470 > +++ rde.c 17 Jun 2019 13:56:16 - > @@ -256,9 +256,6 @@ rde_main(int debug, int verbose) > set_pollfd([PFD_PIPE_SESSION], ibuf_se); > set_pollfd([PFD_PIPE_SESSION_CTL], ibuf_se_ctl); > > - if (rib_dump_pending() || rde_update_queue_pending()) > - timeout = 0; > - > i = PFD_PIPE_COUNT; > for (mctx = LIST_FIRST(_mrts); mctx != 0; mctx = xmctx) { > xmctx = LIST_NEXT(mctx, entry); > @@ -275,6 +272,10 @@ rde_main(int debug, int verbose) > } > } > > + if (rib_dump_pending() || rde_update_queue_pending() || > + nexthop_pending()) > + timeout = 0; > + > if (poll(pfd, i, timeout) == -1) { > if (errno != EINTR) > fatal("poll error"); > @@ -311,12 +312,13 @@ rde_main(int debug, int verbose) > mctx = LIST_NEXT(mctx, entry); > } > > + rib_dump_runner(); > + nexthop_runner(); > if (ibuf_se && ibuf_se->w.queued < SESS_MSG_HIGH_MARK) { > rde_update_queue_runner(); > for (aid = AID_INET6; aid < AID_MAX; aid++) > rde_update6_queue_runner(aid); > } > - rib_dump_runner(); > } > > /* do not clean up on shutdown on production, it takes ages. */ > Index: rde.h > === > RCS file: /cvs/src/usr.sbin/bgpd/rde.h,v > retrieving revision 1.214 > diff -u -p -r1.214 rde.h > --- rde.h 17 Jun 2019 13:35:43 - 1.214 > +++ rde.h 17 Jun 2019 13:56:16 - > @@ -228,12 +228,15 @@ struct rde_aspath { > enum nexthop_state { > NEXTHOP_LOOKUP, > NEXTHOP_UNREACH, > - NEXTHOP_REACH > + NEXTHOP_REACH, > + NEXTHOP_FLAPPED > }; > > struct nexthop { > LIST_ENTRY(nexthop) nexthop_l; > + TAILQ_ENTRY(nexthop)runner_l; > struct prefix_list prefix_h; > + struct prefix *next_prefix; > struct bgpd_addrexit_nexthop; > struct bgpd_addrtrue_nexthop; > struct bgpd_addrnexthop_net; > @@ -246,6 +249,7 @@ struct nexthop { > #endif > int refcnt; > enum nexthop_state state; > + enum nexthop_state oldstate; > u_int8_tnexthop_netlen; > u_int8_tflags; > #define NEXTHOP_CONNECTED0x01 > @@ -593,6 +597,8 @@ prefix_vstate(struct prefix *p) > > void nexthop_init(u_int32_t); > void nexthop_shutdown(void); > +int nexthop_pending(void); > +void nexthop_runner(void); > void nexthop_modify(struct nexthop *, enum
bgpd async nexthop update loop
Hi, Now that the community rewrite is in here another diff to make bgpd more awesome. There was one loop left that did not run asynchronous compared to the main event loop. This loop was in nexthop_update() now a route server or a big route reflector would be hit hard by this loop since it will send out many updates when a nexthop becomes valid. I have seen times where the RDE was busy for 15min in nexthop_update(). This diff changes nexthop_update() to do this work asynchronous and therefor returns to the main event loop much more often. Nexthops now have a prefix pointer next_prefix which is used for looping through all prefixes. The for loop is broken up in RDE_RUNNER_ROUNDS steps which should hopefully return quickly. Additionally prefixes belonging to a RIB that is not evaluated (no decision process running) are no longer linked into this nexthop list which should also shorten the loop a bit (at least Adj-RIB-In and Adj-RIB-Out are now skipped). To make this work correctly the linking and unlinking of prefixes has been adjusted to always work in the right way (nexhop last on link and first on unlink). When testing please watch out for this message: prefix_updateall: prefix with F_RIB_NOEVALUATE hit This should never happen. I'm running this on a few routers with good success. Please check if you see a) worse convergence time (updates propagating much slower) b) better responsiveness (to e.g. bgpctl commands) There is some potential for tuning but I would prefer to do this at a later stage. -- :wq Claudio Index: rde.c === RCS file: /cvs/src/usr.sbin/bgpd/rde.c,v retrieving revision 1.470 diff -u -p -r1.470 rde.c --- rde.c 17 Jun 2019 13:35:43 - 1.470 +++ rde.c 17 Jun 2019 13:56:16 - @@ -256,9 +256,6 @@ rde_main(int debug, int verbose) set_pollfd([PFD_PIPE_SESSION], ibuf_se); set_pollfd([PFD_PIPE_SESSION_CTL], ibuf_se_ctl); - if (rib_dump_pending() || rde_update_queue_pending()) - timeout = 0; - i = PFD_PIPE_COUNT; for (mctx = LIST_FIRST(_mrts); mctx != 0; mctx = xmctx) { xmctx = LIST_NEXT(mctx, entry); @@ -275,6 +272,10 @@ rde_main(int debug, int verbose) } } + if (rib_dump_pending() || rde_update_queue_pending() || + nexthop_pending()) + timeout = 0; + if (poll(pfd, i, timeout) == -1) { if (errno != EINTR) fatal("poll error"); @@ -311,12 +312,13 @@ rde_main(int debug, int verbose) mctx = LIST_NEXT(mctx, entry); } + rib_dump_runner(); + nexthop_runner(); if (ibuf_se && ibuf_se->w.queued < SESS_MSG_HIGH_MARK) { rde_update_queue_runner(); for (aid = AID_INET6; aid < AID_MAX; aid++) rde_update6_queue_runner(aid); } - rib_dump_runner(); } /* do not clean up on shutdown on production, it takes ages. */ Index: rde.h === RCS file: /cvs/src/usr.sbin/bgpd/rde.h,v retrieving revision 1.214 diff -u -p -r1.214 rde.h --- rde.h 17 Jun 2019 13:35:43 - 1.214 +++ rde.h 17 Jun 2019 13:56:16 - @@ -228,12 +228,15 @@ struct rde_aspath { enum nexthop_state { NEXTHOP_LOOKUP, NEXTHOP_UNREACH, - NEXTHOP_REACH + NEXTHOP_REACH, + NEXTHOP_FLAPPED }; struct nexthop { LIST_ENTRY(nexthop) nexthop_l; + TAILQ_ENTRY(nexthop)runner_l; struct prefix_list prefix_h; + struct prefix *next_prefix; struct bgpd_addrexit_nexthop; struct bgpd_addrtrue_nexthop; struct bgpd_addrnexthop_net; @@ -246,6 +249,7 @@ struct nexthop { #endif int refcnt; enum nexthop_state state; + enum nexthop_state oldstate; u_int8_tnexthop_netlen; u_int8_tflags; #define NEXTHOP_CONNECTED 0x01 @@ -593,6 +597,8 @@ prefix_vstate(struct prefix *p) voidnexthop_init(u_int32_t); voidnexthop_shutdown(void); +int nexthop_pending(void); +voidnexthop_runner(void); voidnexthop_modify(struct nexthop *, enum action_types, u_int8_t, struct nexthop **, u_int8_t *); voidnexthop_link(struct prefix *); Index: rde_rib.c === RCS file: /cvs/src/usr.sbin/bgpd/rde_rib.c,v retrieving revision 1.191 diff -u -p -r1.191 rde_rib.c --- rde_rib.c 17 Jun 2019 11:02:20 - 1.191 +++ rde_rib.c 17 Jun 2019