On 12/08/15(Wed) 17:03, Martin Pieuchot wrote: > I'm currently working on the routing table interface to make is safe > to use by multiple CPUs at the same time. The diff below is a big > step in this direction and I'd really appreciate if people could test > it with their usual network setup and report back.
Thanks to a nice report from Hrvoje Popovski, I found a bug in my diff. Please find a corrected version below. The changes with the previous diff are: - nhrt = rt_match(rt->rt_gateway, flags, rtableid); + nhrt = rt_match(rt->rt_gateway, flags | RT_RESOLVE, rtableid); if (nhrt == NULL) return (rt); [...] - if ((nhrt->rt_flags & (RTF_UP|RTF_GATEWAY)) != RTF_UP) { + if ((nhrt->rt_flags & (RTF_UP|RTF_CLONING|RTF_GATEWAY)) != RTF_UP) { rtfree(nhrt); return (rt); } > The goal of this diff is to "cache" the route corresponding to "your" > next hop as early as possible. Let's assume you're using a common > dhcp-based network: > > mpi@goiaba $ netstat -rnf inet|egrep "(default|Dest)" > Destination Gateway Flags Refs Use Mtu Prio Iface > default 192.168.0.1 UGS 5 508 - 8 em0 > > Here my default route points to a gateway (G) whose address is > 192.168.0.1. In such setup your computer generally sends most of the > packets to the internet through this gateway. But to do that it needs > more informations: > > mpi@goiaba $ netstat -rnf inet|egrep "(192.168.0.1.*L|Dest)" > Destination Gateway Flags Refs Use Mtu Prio Iface > 192.168.0.1 bc:05:43:bd:3e:29 UHLc 1 149 - 8 em0 > > Yes this is another route. This one contains link-layer informations > (L) and has been cloned (c). This route is what I described before as > "your" next hop. In this case, "your" is a shortcut for "the next hop > of your default route" but all of this is valid for any route pointing > to a gateway (G). > > In order to send packets via my default route, the kernel needs to know > the link-layer address corresponding to the IP address of the gateway. > This is called "Address Resolution" in network jargon. In OpenBSD, > resolved addresses appear in the routing table with a link-layer address > in the "Gateway" field, as shown previously. > > This resolution is done in the kernel by calling rtalloc(9) with the > RT_RESOLVE flag for the wanted destination, in my case 192.168.0.1. > Once the resolution is complete, a corresponding entry appears in the > routing table and there's no need to redo it for a certain period of > time. That is what I meant with "cache". > > Currently this resolution is done "late" in the journey of a packet and > that's fine since it is not done often. Late means that it is done when > the packet reaches a L2 output function, nd6_output() or arpresolve(). > > The problem is that having a proper reference count on route entries in > these functions is really complicated because you can end up using 3 > different routes. So this diff starts the resolution early: as soon as > a gateway route is returned by rtalloc(9). > > It also makes sense to do the resolution as soon as possible since we > need the link-layer address to send the packet. > > One important point: gateway routes (rt_gwroute) are only returned to > the stack in L2 functions and when that happens, their reference > counter is not incremented. That's why the reference count for such > routes is almost always 1. They are the simplest example of working > route caching in our kernel*. That means that when you purge your > cloned route, rt_gwroute will still be valid but marked as RTP_DOWN > until a new resolution is started. > > This diff changes rt_checkgate() to only do sanity checks (finally!). > > Do not hesitate to ask questions if something is not clear, I believe > it's important that more people understand this. > > Note that this diff includes other bits to be committed separately: > > - Deprecate the use of RTF_XRESOLVE in rtalloc(9) > - Remove PF_KEY-specific code & comments now that SPD lookups no > longer use rtalloc(9). > - Make rtfree(9) accept NULL > > > * That's why I'm slowly killing "struct route" & friends to use the > simplest route caching mechanism everywhere. Index: net/route.c =================================================================== RCS file: /cvs/src/sys/net/route.c,v retrieving revision 1.217 diff -u -p -r1.217 route.c --- net/route.c 18 Jul 2015 15:51:16 -0000 1.217 +++ net/route.c 13 Aug 2015 12:09:58 -0000 @@ -153,6 +153,7 @@ int rtable_alloc(void ***, u_int); int rtflushclone1(struct rtentry *, void *, u_int); void rtflushclone(unsigned int, struct rtentry *); int rt_if_remove_rtdelete(struct rtentry *, void *, u_int); +struct rtentry *rt_match(struct sockaddr *, int, unsigned int); struct ifaddr *ifa_ifwithroute(int, struct sockaddr *, struct sockaddr *, u_int); @@ -297,17 +298,31 @@ rtable_exists(u_int id) /* verify table return (1); } +/* + * Do the actual lookup for rtalloc(9), do not use directly! + * + * Return the best matching entry for the destination ``dst''. + * + * "RT_RESOLVE" means that a corresponding L2 entry should + * be added to the routing table and resolved (via ARP or + * NDP), if it does not exist. + * + * "RT_REPORT" indicates that a message should be sent to + * userland if no matching route has been found or if an + * error occured while adding a L2 entry. + */ struct rtentry * -rtalloc(struct sockaddr *dst, int flags, unsigned int tableid) +rt_match(struct sockaddr *dst, int flags, unsigned int tableid) { struct rtentry *rt; struct rtentry *newrt = 0; struct rt_addrinfo info; - int s = splsoftnet(), err = 0, msgtype = RTM_MISS; + int s, err = 0; bzero(&info, sizeof(info)); info.rti_info[RTAX_DST] = dst; + s = splsoftnet(); rt = rtable_match(tableid, dst); if (rt != NULL) { newrt = rt; @@ -319,28 +334,15 @@ rtalloc(struct sockaddr *dst, int flags, rt->rt_refcnt++; goto miss; } - if ((rt = newrt) && (rt->rt_flags & RTF_XRESOLVE)) { - msgtype = RTM_RESOLVE; - goto miss; - } /* Inform listeners of the new route */ rt_sendmsg(rt, RTM_ADD, tableid); } else rt->rt_refcnt++; } else { - if (dst->sa_family != PF_KEY) - rtstat.rts_unreach++; - /* - * IP encapsulation does lots of lookups where we don't need nor want - * the RTM_MISSes that would be generated. It causes RTM_MISS storms - * sent upward breaking user-level routing queries. - */ + rtstat.rts_unreach++; miss: - if (ISSET(flags, RT_REPORT) && dst->sa_family != PF_KEY) { - bzero((caddr_t)&info, sizeof(info)); - info.rti_info[RTAX_DST] = dst; - rt_missmsg(msgtype, &info, 0, NULL, err, tableid); - } + if (ISSET(flags, RT_REPORT)) + rt_missmsg(RTM_MISS, &info, 0, NULL, err, tableid); } splx(s); return (newrt); @@ -374,12 +376,82 @@ rtalloc_mpath(struct sockaddr *dst, uint } #endif /* SMALL_KERNEL */ +/* + * Look in the routing table for the best matching entry for + * ``dst''. + * + * If a route with a gateway is found and its next hop is no + * longer valid, try to cache it. + */ +struct rtentry * +rtalloc(struct sockaddr *dst, int flags, unsigned int rtableid) +{ + struct rtentry *rt, *nhrt; + + rt = rt_match(dst, flags, rtableid); + + /* No match or route to host? We're done. */ + if (rt == NULL || (rt->rt_flags & RTF_GATEWAY) == 0) + return (rt); + + nhrt = rt->rt_gwroute; + + /* Nothing to do if the next hop is valid. */ + if (nhrt != NULL && (nhrt->rt_flags & RTF_UP)) + return (rt); + + rtfree(rt->rt_gwroute); + rt->rt_gwroute = NULL; + + /* + * If we cannot find a valid next hop, return the route + * with a gateway. + * + * Some dragons hiding in the tree certainly depends on + * this behavior. + */ + nhrt = rt_match(rt->rt_gateway, flags | RT_RESOLVE, rtableid); + if (nhrt == NULL) + return (rt); + + /* + * Next hop must be reachable, this also prevents rtentry + * loops for example when rt->rt_gwroute points to rt. + */ + if ((nhrt->rt_flags & (RTF_UP|RTF_CLONING|RTF_GATEWAY)) != RTF_UP) { + rtfree(nhrt); + return (rt); + } + + /* + * Next hop entry MUST be on the same interface. + * + * XXX We could use a KASSERT() here if routes with dangling + * ``ifa'' pointers were dropped. + */ + if (nhrt->rt_ifp != rt->rt_ifp) { + rtfree(nhrt); + return (rt); + } + + /* + * If the MTU of next hop is 0, this will reset the MTU of the + * route to run PMTUD again from scratch. + */ + if (!ISSET(rt->rt_locks, RTV_MTU) && (rt->rt_mtu > nhrt->rt_mtu)) + rt->rt_mtu = nhrt->rt_mtu; + + rt->rt_gwroute = nhrt; + return (rt); +} + void rtfree(struct rtentry *rt) { struct ifaddr *ifa; - KASSERT(rt != NULL); + if (rt == NULL) + return; rt->rt_refcnt--; @@ -526,7 +598,7 @@ create: rt->rt_flags |= RTF_MODIFIED; flags |= RTF_MODIFIED; stat = &rtstat.rts_newgateway; - rt_setgate(rt, gateway, rdomain); + rt_setgate(rt, gateway); } } else error = EHOSTUNREACH; @@ -983,8 +1055,7 @@ rtrequest1(int req, struct rt_addrinfo * * the routing table because the radix MPATH code use * it to (re)order routes. */ - if ((error = rt_setgate(rt, info->rti_info[RTAX_GATEWAY], - tableid))) { + if ((error = rt_setgate(rt, info->rti_info[RTAX_GATEWAY]))) { free(ndst, M_RTABLE, dlen); pool_put(&rtentry_pool, rt); return (error); @@ -1035,7 +1106,7 @@ rtrequest1(int req, struct rt_addrinfo * } int -rt_setgate(struct rtentry *rt, struct sockaddr *gate, unsigned int tableid) +rt_setgate(struct rtentry *rt, struct sockaddr *gate) { int glen = ROUNDUP(gate->sa_len); struct sockaddr *sa; @@ -1053,22 +1124,7 @@ rt_setgate(struct rtentry *rt, struct so rtfree(rt->rt_gwroute); rt->rt_gwroute = NULL; } - if (rt->rt_flags & RTF_GATEWAY) { - /* XXX is this actually valid to cross tables here? */ - rt->rt_gwroute = rtalloc(gate, RT_REPORT|RT_RESOLVE, tableid); - /* - * If we switched gateways, grab the MTU from the new - * gateway route if the current MTU is 0 or greater - * than the MTU of gateway. - * Note that, if the MTU of gateway is 0, we will reset the - * MTU of the route to run PMTUD again from scratch. XXX - */ - if (rt->rt_gwroute && !(rt->rt_rmx.rmx_locks & RTV_MTU) && - rt->rt_rmx.rmx_mtu && - rt->rt_rmx.rmx_mtu > rt->rt_gwroute->rt_rmx.rmx_mtu) { - rt->rt_rmx.rmx_mtu = rt->rt_gwroute->rt_rmx.rmx_mtu; - } - } + return (0); } @@ -1080,28 +1136,21 @@ rt_checkgate(struct ifnet *ifp, struct r KASSERT(rt != NULL); - if ((rt->rt_flags & RTF_UP) == 0) { - rt = rtalloc(dst, RT_REPORT|RT_RESOLVE, rtableid); - if (rt == NULL) - return (EHOSTUNREACH); - rt->rt_refcnt--; - if (rt->rt_ifp != ifp) - return (EHOSTUNREACH); - } + if ((rt->rt_flags & RTF_UP) == 0) + return (EHOSTUNREACH); rt0 = rt; if (rt->rt_flags & RTF_GATEWAY) { - if (rt->rt_gwroute && !(rt->rt_gwroute->rt_flags & RTF_UP)) { + if (rt->rt_gwroute == NULL) + return (EHOSTUNREACH); + + if ((rt->rt_gwroute->rt_flags & RTF_UP) == 0) { rtfree(rt->rt_gwroute); rt->rt_gwroute = NULL; + return (EHOSTUNREACH); } - if (rt->rt_gwroute == NULL) { - rt->rt_gwroute = rtalloc(rt->rt_gateway, - RT_REPORT|RT_RESOLVE, rtableid); - if (rt->rt_gwroute == NULL) - return (EHOSTUNREACH); - } + /* * Next hop must be reachable, this also prevents rtentry * loops, for example when rt->rt_gwroute points to rt. Index: net/route.h =================================================================== RCS file: /cvs/src/sys/net/route.h,v retrieving revision 1.109 diff -u -p -r1.109 route.h --- net/route.h 18 Jul 2015 15:51:16 -0000 1.109 +++ net/route.h 12 Aug 2015 12:20:45 -0000 @@ -111,6 +111,8 @@ struct rtentry { }; #define rt_use rt_rmx.rmx_pksent #define rt_expire rt_rmx.rmx_expire +#define rt_locks rt_rmx.rmx_locks +#define rt_mtu rt_rmx.rmx_mtu #define RTF_UP 0x1 /* route usable */ #define RTF_GATEWAY 0x2 /* destination is a gateway */ @@ -354,7 +356,7 @@ void rt_sendmsg(struct rtentry *, int, void rt_sendaddrmsg(struct rtentry *, int); void rt_missmsg(int, struct rt_addrinfo *, int, struct ifnet *, int, u_int); -int rt_setgate(struct rtentry *, struct sockaddr *, unsigned int); +int rt_setgate(struct rtentry *, struct sockaddr *); int rt_checkgate(struct ifnet *, struct rtentry *, struct sockaddr *, unsigned int, struct rtentry **); void rt_setmetrics(u_long, struct rt_metrics *, struct rt_kmetrics *); Index: net/rtsock.c =================================================================== RCS file: /cvs/src/sys/net/rtsock.c,v retrieving revision 1.166 diff -u -p -r1.166 rtsock.c --- net/rtsock.c 18 Jul 2015 21:58:06 -0000 1.166 +++ net/rtsock.c 12 Aug 2015 12:20:45 -0000 @@ -748,9 +748,8 @@ report: info.rti_info[RTAX_GATEWAY]->sa_len)) { newgate = 1; } - if (info.rti_info[RTAX_GATEWAY] != NULL && - (error = rt_setgate(rt, info.rti_info[RTAX_GATEWAY], - tableid))) + if (info.rti_info[RTAX_GATEWAY] != NULL && (error = + rt_setgate(rt, info.rti_info[RTAX_GATEWAY]))) goto flush; /* * new gateway could require new ifaddr, ifp;