Re: [PATCH] Fixing soft NFS umount -f, round 5

2015-07-09 Thread Emmanuel Dreyfus
On Wed, Jul 08, 2015 at 06:07:11PM +, Emmanuel Dreyfus wrote:
> http://ftp.espci.fr/shadow/manu/umount_f4.patch
> 
> With default values, timeout = 3 and retrans = 10, we now wait
> for ages or the unmount to completes. In the umount -f case
> it does not makes sense to wait for too long.

Here is an improved version that still use your proposed timeouts:
http://ftp.espci.fr/shadow/manu/umount_f5.patch

I introduced a NFS mount private flag NFSMNT_DISMNTFORCE that informs
the NFS subsystem that we want to forcibly unmount the filesystem, 
which can be done at the expense of cutting timeout corners.

An important point is that with the flag set, we do not attempt any
NFS server reconnect. It makes sense since dounmount() calls 
VFS_SYNC() before VFS_UNMOUNT(). NFSMNT_DISMNTFORCE being set in 
VFS_UNMOUNT() / nfs_unmount(), we already made reconnection attempts
withinin VFS_SYNC(), hence thereis benefit for doint it again.

Now with that change, in the umount -f while NFS server is gone case,
we spend 20s in VFS_SYNC(), and VFS_UNMOUNT() returns in less than a
second.

Now I think it would be nice to also cut coners in VFS_SYNC() when 
the force flag is used, but that touches filesystem-indpendent code,
in dounmount():
if ((mp->mnt_flag & MNT_RDONLY) == 0) {
error = VFS_SYNC(mp, MNT_WAIT, l->l_cred);
}   
if (error == 0 || (flags & MNT_FORCE)) {
error = VFS_UNMOUNT(mp, flags);
}

The first VFS_SYNC() is making us wait even if MNT_FORCE is set. This could
be solved by adding a IMNT_UMOUNTFORCE to struct mount's mnt_iflag, just
like I did for NFS in this patch. The flag would instruct underlying 
filesystem that force unmount is required and that fast return is expected.

Opinions?

-- 
Emmanuel Dreyfus
m...@netbsd.org


Re: Improving use of rt_refcnt

2015-07-09 Thread Joerg Sonnenberger
On Sun, Jul 05, 2015 at 10:33:02PM -0700, Dennis Ferguson wrote:
> If you don't want it to work this way then you'll need to replace the
> radix tree with something that permits changes while readers are
> concurrently operating.  To take best advantage of a more modern data
> structure, however, you are still not going to want readers to ever
> write the shared data structure if that can be avoided.  The two
> atomic operations needed to increment and decrement a reference count
> greatly exceed the cost of a (well-cached) route lookup.

Let me pick the discussion up at this point since David mentioned that
my last reply was somewhat terse. I think the current radix tree serves
three different purposes right now:

(1) Manage the view of the connectivity to the outside world in a way
coherent with the administrator's intention or some routing
protocol/daemon.

(2) Provide a mechanism for finding the next-hop for traffic to not
directly attached networks.

(3) Provide a mechanism for finding L2 addresses on directly attached
networks.

Using a single data structure for this has the advantage of code sharing
and can make detailed accounting very easy. It has the problem of
overhead and mixing data of different levels of volatility. I would like
to see the three mechanisms to be separated with appropiate data
structures for each case. The first point would be moved out completely
from the hot path, the actual packet handling case. It would then be no
longer as performance sensitive, so options for storage can be more
focused on size.

For finding the next-hop, the problem is simplified. The number of
next-hop addresses is (normally) limited by the size of the network
neighborhood. Even a backend router at one of the major Internet
exchange points will not have more than a few thousand next-hops,
compared to having 200k routes or more. This can be exploited to reduce
the data size of the BMP lookup data structure and by removing redundant
entries, e.g. a longer prefix with the same next-hop as a shorter
prefix. As I mentioned in my earlier email, the next-hop entry can and
should store a reference to whatever L2 data is needed, so that no
additional search is needed.

For the L3->L2 address mapping, the problem changes from BMP search to
an exact match search. If the mapping is managed correctly, it makes
sense to do this (cheap) search first and skip the whole BMP lookup on
a match as redundant. Hash tables and the like have also nice properties
for read-mostly updates and cache density.

Joerg


Re: kgdb on amd64

2015-07-09 Thread Patrick Welche
On Thu, Jul 09, 2015 at 04:47:19PM +0100, Patrick Welche wrote:
> kvtopte() is defined in sys/arch/x86/include/pmap.h:

static __inline pt_entry_t * __unused
kvtopte(vaddr_t va)
{
pd_entry_t *pde;

KASSERT(va >= VM_MIN_KERNEL_ADDRESS);

pde = L2_BASE + pl2_i(va);
if (*pde & PG_PS)
return ((pt_entry_t *)pde);

return (PTE_BASE + pl1_i(va));
}

> #define pl1_i(VA)   (((VA_SIGN_POS(VA)) & L1_FRAME) >> L1_SHIFT)
> 
> So, where is the lock? (or is the address illegal?)

(Sorry - sent vtopte instead of kvtopte)


Re: kgdb on amd64

2015-07-09 Thread Patrick Welche
panic: lockdebug_lookup: uninitialized lock... (from kvtopte())

is what you get if you try kgdb on amd64.

On connecting, I see:
...
Sending packet: $qAttached#8f...Ack
Packet received: 
Packet qAttached (query-attached) is NOT supported
Sending packet: $g#67...Ack
Packet received: 
009081810190818100908181d0078509d980080080070a1890e30f01189052ca
Sending packet: $m81819000,1#65...Ack

That $m81819000 leads to sys/arch/amd64/amd64/kgdb_machdep.c:kgdb_acc()
and the panic is in

  pte = kvtopte(va);

with va=0x81819000;

kvtopte() is defined in sys/arch/x86/include/pmap.h:

static __inline pt_entry_t * __unused
vtopte(vaddr_t va)
{
 
KASSERT(va < VM_MIN_KERNEL_ADDRESS);
 
return (PTE_BASE + pl1_i(va));
}

#define pl1_i(VA)   (((VA_SIGN_POS(VA)) & L1_FRAME) >> L1_SHIFT)

So, where is the lock? (or is the address illegal?)

Cheers,

Patrick


Re: Improving use of rt_refcnt

2015-07-09 Thread Mouse
> The thing is that pretty much all the networks that were "normal" in
> 1980 had disappeared by about 1990, leaving only networks that worked
> like DIX ethernet.  You would think the code would have been
> restructured for the new "normal" since then, but I guess old code
> dies hard.

Well, don't forget that there still are a few non-Ethernetty networks
left.  While it would make sense to optimize for Ethernet and its ilk,
the generality still needs to be kept around.

While this is hardly impossible, I daresay it reduces the motivation to
change the current system.

/~\ The ASCII Mouse
\ / Ribbon Campaign
 X  Against HTMLmo...@rodents-montreal.org
/ \ Email!   7D C8 61 52 5D E7 2D 39  4E F1 31 3E E8 B3 27 4B


Re: mount_checkdirs

2015-07-09 Thread Edgar Fuß
> My inclination is that it is wrong [...]
It looks strange to me, but that doesn't tell much.
Could it have been added as a quick fix for the behaviour of some (then) 
standard daemon or the like?

> The logic was added to 4.4 by Kirk McKusick but without much in the
> way of rationale:
Perhaps the easiest thing would be someone asking Kirk Himself whether 
he can remember?


Re: Improving use of rt_refcnt

2015-07-09 Thread Ryota Ozaki
On Thu, Jul 9, 2015 at 1:28 PM, Dennis Ferguson
 wrote:
>
> On 7 Jul, 2015, at 21:25 , Ryota Ozaki  wrote:
>
>> BTW how do you think of separating L2 tables (ARP/NDP) from the L3
>> routing tables? The separation gets rid of cloning/cloned route
>> features and that makes it easy to introduce locks in route.c.
>> (Currently rtrequest1 can be called recursively to remove cloned
>> routes and that makes it hard to use locks.) I read your paper
>> (BSDNetworking.pdf) and it seems to suggest to maintain L2 routes
>> in the common routing table (I may misunderstand your opinion).
>
> I think it is worth stepping back and thinking about what the end
> result of the most common type of access to the route table (a
> forwarding operation, done by a reader who wants to know what to do
> with a packet it has) is going to be, since this is the operation you
> want to optimize.  If the packet is to be sent out an interface then
> the result of the work you are doing is that an L2 header will be
> prepended to the packet and the packet will be queued to an interface
> for transmission.
>
> To make this direct and fast what you want is for the result of the
> route lookup to point directly at the thing that knows what L2 header
> needs to be added and which interface the packet needs to be delivered
> to.  If you have that then all that remains to be done after the
> route lookup is to make space at the front of the packet for the L2
> header, memcpy() it in and give the resulting frame to the interface.
> So you want the route lookup organized to get you from the addresses
> in the packet you are processing to the L2 header and interface you
> need to use to send a packet like that as directly as possible.
>
> While we could talk about how the route lookup might be structured
> to better get directly to the point (this involves splitting the
> rtentry into a "route" part and a "nexthop" part, the latter being
> the result of a lookup and having the data needed to deliver the
> packet with minimal extra work), this probably isn't relevant to
> your question.  What I did want to point out, however, is that
> knowledge of the next hop IP address is (generally) entirely
> unnecessary to forward a packet.  All forwarding operations want
> to know is the L2 header to add to the packet.  Of course ARP or
> ND will have used the next hop IP address to determine the L2 header
> to attach to the packet, but once this is known all packet forwarding
> wants is the result, the L2 header, and doesn't care how that was
> arrived at.  What this means is that your proposed use of the next
> hop IP address is a gratuitous indirection; you would be taking
> something which would be best done as
>
>  -> 
>
> and instead turning this into
>
>  ->  ->  -> 
> 
>
> This will likely always be significantly more expensive than the direct
> alternative.  The indirection is also easy to resolve up front, when a route
> is added, so there's no need to do it over and over again for each forwarded
> packet, and failing to do it when routes are installed moves yet another
> data structure (per-interface) into the forwarding path that will need to
> be dealt with if you eventually want to eliminate the locks.  I think
> you shouldn't do this, or anything else that requires if_output() to
> look at the next hop IP address, since that indirection should go away.
>
> The neat thing about this is that the internal arrangement that makes
> one think that the next hop IP address is an important result of a route
> lookup (it is listed as one in the rtentry structure, and if_output()
> takes it as an argument) is actually a historical artifact.  I think
> this code was written in about 1980.  Then, as now, the point of the
> route lookup was to determine the L2 header to prepend to the packet
> and the interface to queue it to, but what was different was the networks
> that existed then.  Almost all of them did  -> 
> mapping by storing the variable bits of the L2 header directly in the
> local bits of the IP address; see RFC796 and RFC895 for a whole bunch of
> examples (the all-zeros-host-part directed broadcast address that 4.2BSD
> used came from the mapping for experimental ethernet).  This meant that
> the next hop IP address wasn't an indirection at all, it was directly
> the data you needed to construct the L2 header to add to the packet.
> The original exception to this was DIX Ethernet, with its 48 bit MAC
> addresses that were too big to store that way, so the idea of
> implementing an ARP cache in the interface code and using the next hop
> IP address as a less efficient indirection to the L2 header data for
> that type of interface, was invented to make DIX Ethernet look like a
> "normal" interface where the next hop IP address directly and efficiently
> provided the L2 bits you needed to know to send the packet.
>
> The thing is that pretty much all the networks that were "normal"
> in 1980 had disappeared by about 1990, leaving only ne

Re: Improving use of rt_refcnt

2015-07-09 Thread Joerg Sonnenberger
On Wed, Jul 08, 2015 at 09:28:01PM -0700, Dennis Ferguson wrote:
> What this means is that your proposed use of the next
> hop IP address is a gratuitous indirection; you would be taking
> something which would be best done as
> 
>  -> 
> 
> and instead turning this into
> 
>  ->  ->  -> 
> 

This is the part I disagree with. There are generally two cases here:
- the BMP is a local network
- the BMP is not a local network

In the second case, the route can store a direct reference to the L2
address without artifical entries in the table and without additional
lookup. There are some potential issues to consider for dealing with
multiple interfaces sharing IP ranges, but that's a different question.

For the first case, storing cloned routed or doing a hashed target
lookup is very likely to have similar performance and often the latter
option is going to be faster.

Joerg