Re: [PATCH] Fixing soft NFS umount -f, round 5
On Wed, Jul 08, 2015 at 06:07:11PM +, Emmanuel Dreyfus wrote: > http://ftp.espci.fr/shadow/manu/umount_f4.patch > > With default values, timeout = 3 and retrans = 10, we now wait > for ages or the unmount to completes. In the umount -f case > it does not makes sense to wait for too long. Here is an improved version that still use your proposed timeouts: http://ftp.espci.fr/shadow/manu/umount_f5.patch I introduced a NFS mount private flag NFSMNT_DISMNTFORCE that informs the NFS subsystem that we want to forcibly unmount the filesystem, which can be done at the expense of cutting timeout corners. An important point is that with the flag set, we do not attempt any NFS server reconnect. It makes sense since dounmount() calls VFS_SYNC() before VFS_UNMOUNT(). NFSMNT_DISMNTFORCE being set in VFS_UNMOUNT() / nfs_unmount(), we already made reconnection attempts withinin VFS_SYNC(), hence thereis benefit for doint it again. Now with that change, in the umount -f while NFS server is gone case, we spend 20s in VFS_SYNC(), and VFS_UNMOUNT() returns in less than a second. Now I think it would be nice to also cut coners in VFS_SYNC() when the force flag is used, but that touches filesystem-indpendent code, in dounmount(): if ((mp->mnt_flag & MNT_RDONLY) == 0) { error = VFS_SYNC(mp, MNT_WAIT, l->l_cred); } if (error == 0 || (flags & MNT_FORCE)) { error = VFS_UNMOUNT(mp, flags); } The first VFS_SYNC() is making us wait even if MNT_FORCE is set. This could be solved by adding a IMNT_UMOUNTFORCE to struct mount's mnt_iflag, just like I did for NFS in this patch. The flag would instruct underlying filesystem that force unmount is required and that fast return is expected. Opinions? -- Emmanuel Dreyfus m...@netbsd.org
Re: Improving use of rt_refcnt
On Sun, Jul 05, 2015 at 10:33:02PM -0700, Dennis Ferguson wrote: > If you don't want it to work this way then you'll need to replace the > radix tree with something that permits changes while readers are > concurrently operating. To take best advantage of a more modern data > structure, however, you are still not going to want readers to ever > write the shared data structure if that can be avoided. The two > atomic operations needed to increment and decrement a reference count > greatly exceed the cost of a (well-cached) route lookup. Let me pick the discussion up at this point since David mentioned that my last reply was somewhat terse. I think the current radix tree serves three different purposes right now: (1) Manage the view of the connectivity to the outside world in a way coherent with the administrator's intention or some routing protocol/daemon. (2) Provide a mechanism for finding the next-hop for traffic to not directly attached networks. (3) Provide a mechanism for finding L2 addresses on directly attached networks. Using a single data structure for this has the advantage of code sharing and can make detailed accounting very easy. It has the problem of overhead and mixing data of different levels of volatility. I would like to see the three mechanisms to be separated with appropiate data structures for each case. The first point would be moved out completely from the hot path, the actual packet handling case. It would then be no longer as performance sensitive, so options for storage can be more focused on size. For finding the next-hop, the problem is simplified. The number of next-hop addresses is (normally) limited by the size of the network neighborhood. Even a backend router at one of the major Internet exchange points will not have more than a few thousand next-hops, compared to having 200k routes or more. This can be exploited to reduce the data size of the BMP lookup data structure and by removing redundant entries, e.g. a longer prefix with the same next-hop as a shorter prefix. As I mentioned in my earlier email, the next-hop entry can and should store a reference to whatever L2 data is needed, so that no additional search is needed. For the L3->L2 address mapping, the problem changes from BMP search to an exact match search. If the mapping is managed correctly, it makes sense to do this (cheap) search first and skip the whole BMP lookup on a match as redundant. Hash tables and the like have also nice properties for read-mostly updates and cache density. Joerg
Re: kgdb on amd64
On Thu, Jul 09, 2015 at 04:47:19PM +0100, Patrick Welche wrote: > kvtopte() is defined in sys/arch/x86/include/pmap.h: static __inline pt_entry_t * __unused kvtopte(vaddr_t va) { pd_entry_t *pde; KASSERT(va >= VM_MIN_KERNEL_ADDRESS); pde = L2_BASE + pl2_i(va); if (*pde & PG_PS) return ((pt_entry_t *)pde); return (PTE_BASE + pl1_i(va)); } > #define pl1_i(VA) (((VA_SIGN_POS(VA)) & L1_FRAME) >> L1_SHIFT) > > So, where is the lock? (or is the address illegal?) (Sorry - sent vtopte instead of kvtopte)
Re: kgdb on amd64
panic: lockdebug_lookup: uninitialized lock... (from kvtopte()) is what you get if you try kgdb on amd64. On connecting, I see: ... Sending packet: $qAttached#8f...Ack Packet received: Packet qAttached (query-attached) is NOT supported Sending packet: $g#67...Ack Packet received: 009081810190818100908181d0078509d980080080070a1890e30f01189052ca Sending packet: $m81819000,1#65...Ack That $m81819000 leads to sys/arch/amd64/amd64/kgdb_machdep.c:kgdb_acc() and the panic is in pte = kvtopte(va); with va=0x81819000; kvtopte() is defined in sys/arch/x86/include/pmap.h: static __inline pt_entry_t * __unused vtopte(vaddr_t va) { KASSERT(va < VM_MIN_KERNEL_ADDRESS); return (PTE_BASE + pl1_i(va)); } #define pl1_i(VA) (((VA_SIGN_POS(VA)) & L1_FRAME) >> L1_SHIFT) So, where is the lock? (or is the address illegal?) Cheers, Patrick
Re: Improving use of rt_refcnt
> The thing is that pretty much all the networks that were "normal" in > 1980 had disappeared by about 1990, leaving only networks that worked > like DIX ethernet. You would think the code would have been > restructured for the new "normal" since then, but I guess old code > dies hard. Well, don't forget that there still are a few non-Ethernetty networks left. While it would make sense to optimize for Ethernet and its ilk, the generality still needs to be kept around. While this is hardly impossible, I daresay it reduces the motivation to change the current system. /~\ The ASCII Mouse \ / Ribbon Campaign X Against HTMLmo...@rodents-montreal.org / \ Email! 7D C8 61 52 5D E7 2D 39 4E F1 31 3E E8 B3 27 4B
Re: mount_checkdirs
> My inclination is that it is wrong [...] It looks strange to me, but that doesn't tell much. Could it have been added as a quick fix for the behaviour of some (then) standard daemon or the like? > The logic was added to 4.4 by Kirk McKusick but without much in the > way of rationale: Perhaps the easiest thing would be someone asking Kirk Himself whether he can remember?
Re: Improving use of rt_refcnt
On Thu, Jul 9, 2015 at 1:28 PM, Dennis Ferguson wrote: > > On 7 Jul, 2015, at 21:25 , Ryota Ozaki wrote: > >> BTW how do you think of separating L2 tables (ARP/NDP) from the L3 >> routing tables? The separation gets rid of cloning/cloned route >> features and that makes it easy to introduce locks in route.c. >> (Currently rtrequest1 can be called recursively to remove cloned >> routes and that makes it hard to use locks.) I read your paper >> (BSDNetworking.pdf) and it seems to suggest to maintain L2 routes >> in the common routing table (I may misunderstand your opinion). > > I think it is worth stepping back and thinking about what the end > result of the most common type of access to the route table (a > forwarding operation, done by a reader who wants to know what to do > with a packet it has) is going to be, since this is the operation you > want to optimize. If the packet is to be sent out an interface then > the result of the work you are doing is that an L2 header will be > prepended to the packet and the packet will be queued to an interface > for transmission. > > To make this direct and fast what you want is for the result of the > route lookup to point directly at the thing that knows what L2 header > needs to be added and which interface the packet needs to be delivered > to. If you have that then all that remains to be done after the > route lookup is to make space at the front of the packet for the L2 > header, memcpy() it in and give the resulting frame to the interface. > So you want the route lookup organized to get you from the addresses > in the packet you are processing to the L2 header and interface you > need to use to send a packet like that as directly as possible. > > While we could talk about how the route lookup might be structured > to better get directly to the point (this involves splitting the > rtentry into a "route" part and a "nexthop" part, the latter being > the result of a lookup and having the data needed to deliver the > packet with minimal extra work), this probably isn't relevant to > your question. What I did want to point out, however, is that > knowledge of the next hop IP address is (generally) entirely > unnecessary to forward a packet. All forwarding operations want > to know is the L2 header to add to the packet. Of course ARP or > ND will have used the next hop IP address to determine the L2 header > to attach to the packet, but once this is known all packet forwarding > wants is the result, the L2 header, and doesn't care how that was > arrived at. What this means is that your proposed use of the next > hop IP address is a gratuitous indirection; you would be taking > something which would be best done as > > -> > > and instead turning this into > > -> -> -> > > > This will likely always be significantly more expensive than the direct > alternative. The indirection is also easy to resolve up front, when a route > is added, so there's no need to do it over and over again for each forwarded > packet, and failing to do it when routes are installed moves yet another > data structure (per-interface) into the forwarding path that will need to > be dealt with if you eventually want to eliminate the locks. I think > you shouldn't do this, or anything else that requires if_output() to > look at the next hop IP address, since that indirection should go away. > > The neat thing about this is that the internal arrangement that makes > one think that the next hop IP address is an important result of a route > lookup (it is listed as one in the rtentry structure, and if_output() > takes it as an argument) is actually a historical artifact. I think > this code was written in about 1980. Then, as now, the point of the > route lookup was to determine the L2 header to prepend to the packet > and the interface to queue it to, but what was different was the networks > that existed then. Almost all of them did -> > mapping by storing the variable bits of the L2 header directly in the > local bits of the IP address; see RFC796 and RFC895 for a whole bunch of > examples (the all-zeros-host-part directed broadcast address that 4.2BSD > used came from the mapping for experimental ethernet). This meant that > the next hop IP address wasn't an indirection at all, it was directly > the data you needed to construct the L2 header to add to the packet. > The original exception to this was DIX Ethernet, with its 48 bit MAC > addresses that were too big to store that way, so the idea of > implementing an ARP cache in the interface code and using the next hop > IP address as a less efficient indirection to the L2 header data for > that type of interface, was invented to make DIX Ethernet look like a > "normal" interface where the next hop IP address directly and efficiently > provided the L2 bits you needed to know to send the packet. > > The thing is that pretty much all the networks that were "normal" > in 1980 had disappeared by about 1990, leaving only ne
Re: Improving use of rt_refcnt
On Wed, Jul 08, 2015 at 09:28:01PM -0700, Dennis Ferguson wrote: > What this means is that your proposed use of the next > hop IP address is a gratuitous indirection; you would be taking > something which would be best done as > > -> > > and instead turning this into > > -> -> -> > This is the part I disagree with. There are generally two cases here: - the BMP is a local network - the BMP is not a local network In the second case, the route can store a direct reference to the L2 address without artifical entries in the table and without additional lookup. There are some potential issues to consider for dealing with multiple interfaces sharing IP ranges, but that's a different question. For the first case, storing cloned routed or doing a hashed target lookup is very likely to have similar performance and often the latter option is going to be faster. Joerg