Re: core statement on fexecve, O_EXEC, and O_SEARCH

2012-12-04 Thread Michael van Elst
m...@netbsd.org (Emmanuel Dreyfus) writes:

Alan Barrett a...@netbsd.org wrote:

 The fexecve function could be implemented entirely in libc, 
 via execve(2) on a file name of the form /proc/self/fd/N. 
 Any security concerns around fexecve() also apply to exec of 
 /proc/self/fd/N.

I gave a try to this approach. There is an unexpected issue:
for a reason I cannot figure, namei() does not resolve
/proc/self/fd/N. Here is a ktrace:

   810  1 t_fexecve CALL  open(0x8048db6,0,0)
   810  1 t_fexecve NAMI  /usr/bin/touch
   810  1 t_fexecve RET   open 3
   810  1 t_fexecve CALL  getpid
   810  1 t_fexecve RET   getpid 810/0x32a, 924/0x39c
   810  1 t_fexecve CALL  execve(0xbfbfe66f,0xbfbfea98,0xbfbfeaa4)
   810  1 t_fexecve NAMI  /proc/self/fd/3
   810  1 t_fexecve RET   execve -1 errno 2 No such file or
directory

The descriptor is probably already closed on exec before the syscall
tries to use it.

-- 
-- 
Michael van Elst
Internet: mlel...@serpens.de
A potential Snark may lurk in every tree.


Re: core statement on fexecve, O_EXEC, and O_SEARCH

2012-12-04 Thread Alan Barrett

The fexecve function could be implemented entirely in libc,
via execve(2) on a file name of the form /proc/self/fd/N.
Any security concerns around fexecve() also apply to exec of
/proc/self/fd/N.

I gave a try to this approach. There is an unexpected issue:

The descriptor is probably already closed on exec before the syscall
tries to use it.


I believe that we should not fix that without a proper design 
of how all the parts will work together.


Some questions that I would like to see answered are: Should it 
be possible to exec a fd only if a special flag was used in the 
open(2) call?  Should the file's executability be checked at open 
time or at exec time, or both, or does it depend on open flags or 
on what happened to the fd in between open and exec?  Should the 
record of the fact that the fd may be eligible for exec be erased 
when the fd is passed from one process to another?  Always or only 
sometimes?  How can fds obtained from procfs be made to follow the 
rules?


--apb (Alan Barrett)


Re: core statement on fexecve, O_EXEC, and O_SEARCH

2012-12-04 Thread Julian Yon
On Tue, 4 Dec 2012 08:49:17 + (UTC)
mlel...@serpens.de (Michael van Elst) wrote:

 The descriptor is probably already closed on exec before the syscall
 tries to use it.

Nope. That happens later. I was looking through this code yesterday as
the topic interests me. The namei lookup happens pretty early on. I
haven't solved it, but the problem seems to be one of context - if you
try to execve /proc/self you'll also get ENOENT instead of the expected
EACCES.


Julian

-- 
3072D/F3A66B3A Julian Yon (2012 General Use) pgp.2...@jry.me


signature.asc
Description: PGP signature


Re: Problem identified: WAPL/RAIDframe performance problems

2012-12-04 Thread David Holland
On Sat, Dec 01, 2012 at 11:38:55PM -0500, Mouse wrote:
   things.  What I care about is the largest size sector that will (in
   ^^^
   the ordinary course of things anyway) be written atomically.
   Then those are 512-byte-sector drives [...]
   No; because I can do 4K atomic writes, I want to know about that.
  
  And, can't you do that with traditional drives, drives which really do
  have 512-byte sectors?  Do a 4K transfer and you write 8 physical
  sectors with no opportunity for any other operation to see the write
  partially done.  Is that wrong, or am I missing something else?

Insert a kernel panic (or power failure(*)) after five sectors and
it's not atomic. One sector, at least in theory(*), is.

(*) let's ignore for now the various daft things that disks sometimes
do in practice.

-- 
David A. Holland
dholl...@netbsd.org


Re: Problem identified: WAPL/RAIDframe performance problems

2012-12-04 Thread David Holland
On Mon, Dec 03, 2012 at 12:19:58AM +, Julian Yon wrote:
  You appear to have just agreed with me, which makes me wonder what I'm
  missing, given you continue as though you disagree.

You asked why 4096-byte-sector disks accept 512-byte writes. I was
trying to explain.

   However, we're talking about hardware here, so you have to also
   consider the possibility that the drive firmware reports 512 because
   that's what someone coded up back in 1992 and nobody got around to
   fixing it.
  
  If that doesn't count as broken, what does? (Also, gosh, when did 1992
  become so long ago?)

By this standard, most hardware is broken.

-- 
David A. Holland
dholl...@netbsd.org


Re: Problem identified: WAPL/RAIDframe performance problems

2012-12-04 Thread Thor Lancelot Simon
On Tue, Dec 04, 2012 at 02:14:27PM +, David Holland wrote:
 On Sat, Dec 01, 2012 at 11:38:55PM -0500, Mouse wrote:
things.  What I care about is the largest size sector that will (in
^^^
the ordinary course of things anyway) be written atomically.
Then those are 512-byte-sector drives [...]
No; because I can do 4K atomic writes, I want to know about that.
   
   And, can't you do that with traditional drives, drives which really do
   have 512-byte sectors?  Do a 4K transfer and you write 8 physical
   sectors with no opportunity for any other operation to see the write
   partially done.  Is that wrong, or am I missing something else?
 
 Insert a kernel panic (or power failure(*)) after five sectors and

What's a kernel panic got to do with it?  If you hand the controller
and thus the drive 4K write, the kernel panicing won't suddenly cause
you to reverse time and have issued 8 512-byte writes instead.

Given how drives actually write data, I would not be so sanguine
that any sector, of whatever size, in-flight when the power fails,
is actually written with the values you expect, or not written
at all.



Re: Problem identified: WAPL/RAIDframe performance problems

2012-12-04 Thread David Holland
On Tue, Dec 04, 2012 at 09:26:17AM -0500, Thor Lancelot Simon wrote:
 And, can't you do that with traditional drives, drives which really do
 have 512-byte sectors?  Do a 4K transfer and you write 8 physical
 sectors with no opportunity for any other operation to see the write
 partially done.  Is that wrong, or am I missing something else?
   
   Insert a kernel panic (or power failure(*)) after five sectors and
  
  What's a kernel panic got to do with it?  If you hand the controller
  and thus the drive 4K write, the kernel panicing won't suddenly cause
  you to reverse time and have issued 8 512-byte writes instead.

That depends on additional properties of the pathway from the FS to
the drive firmware. It might have sent 1 of 2 2048-byte writes before
the panic, for example. Or it might be a vintage controller incapable
of handling more than one sector at a time.

Also, if there's a panic while the kernel is in the middle of talking
to the drive, such that the drive receives only part of the data you
intended to send, one can be reasonably certain it will reject a
partial sector... but if it's received 5 of 8 physical sectors and the
6th is partial, it may well write out those 5, which isn't what was
intended.

  Given how drives actually write data, I would not be so sanguine
  that any sector, of whatever size, in-flight when the power fails,
  is actually written with the values you expect, or not written
  at all.

Yes, I'm aware of that. It remains a useful approximation, especially
for already-existing FS code.

-- 
David A. Holland
dholl...@netbsd.org


Re: FFS write coalescing

2012-12-04 Thread David Holland
On Tue, Dec 04, 2012 at 09:59:46AM +0300, Alan Barrett wrote:
  the genfs code also never writes clean pages to disk, even though for
  RAID5 storage it would likely be more efficient to write clean pages
  that are in the same stripe as dirty pages if that would avoid issuing
  partial-stripe writes.  (which is basically another way of saying
  what david said.)
  
  Perhaps there should be a way for block devices to report at least three
  block sizes:
  
  a) smallest possible block size (512 for almost all disks)
  
  b) smallest efficient block size and alignment (4k for modern disks,
  stripe size for raid)
  
  c) largest possible size (a device and bus-dependent variant of MAXPHYS)
  
  Then the file system could use (b) to know when it's a good idea to
  combine dirty and clean pages into the same write.

As I was saying in the other thread, what filesystems really want to
know is the atomic write size. E.g. in ffs this affects the way
directories are laid out and is necessary (AFAIK including with wapbl)
for ~safe operation. This is not (a), and as far as I know it is also
not (b); see below.

I don't see (a) as useful. It is conceivable that a journaled FS might
want to know about it to allow packing journal records as tightly as
possible, but doing so is rather dubious from a recovery POV: the
point of flushing a journal is to get it physically onto disk safely,
and if you later let the disk rewrite part of what you thought was
safely on disk, it might cease to be safely on disk and break your
recovery scheme.

What guarantees do we actually get in practice for RAID5? Do you have
to commit journals in units of a whole stripe or stripe group to avoid
having them rewritten unsafely later? Or is the parity logging code
sufficient to make that safe? This matters for wapbl...

-- 
David A. Holland
dholl...@netbsd.org


Re: core statement on fexecve, O_EXEC, and O_SEARCH

2012-12-04 Thread David Holland
On Tue, Dec 04, 2012 at 01:58:13PM +, Julian Yon wrote:
   The descriptor is probably already closed on exec before the syscall
   tries to use it.
  
  Nope. That happens later. I was looking through this code yesterday as
  the topic interests me. The namei lookup happens pretty early on. I
  haven't solved it, but the problem seems to be one of context - if you
  try to execve /proc/self you'll also get ENOENT instead of the expected
  EACCES.

That doesn't make much sense... nor does the procfs_lookup code shed
any significant amount of light on it.

-- 
David A. Holland
dholl...@netbsd.org


Re: Making forced unmounts work

2012-12-04 Thread David Holland
On Sun, Dec 02, 2012 at 05:29:01PM +0100, J. Hannken-Illjes wrote:
  I'm convinced -- having fstrans_start() return ERESTART is the way to go.

Ok then :-)

   Also I wonder if there's any way to accomplish this that doesn't
   require adding fstrans calls to every operation in every fs.
   
   Not in a clean way. We would need some kind of reference counting for
   vnode operations and that is quite impossible as vnode operations on
   devices or fifos sometimes wait forever and are called from other fs
   like ufsspec_read() for example.  How could we protect UFS updating
   access times here?
   
   I'm not entirely convinced of that. There are basically three
   problems: (a) new incoming threads, (b) threads that are already in
   the fs and running, and (c) threads that are already in the fs and
   that are stuck more or less permanently because something broke.
   
   Admittedly I don't really understand how fstrans suspending works.
   Does it keep track of all the threads that are in the fs, so the (b)
   ones can be interrupted somehow, or so we at least can wait until all
   of them either leave the fs or enter fstrans somewhere and stall?
  
  Fstrans is a recursive rw-lock. Vnode operations take the reader lock
  on entry.  To suspend a fs we take the writer lock and therefore
  it will catch both (a) and (b).  New threads will block on first entry,
  threads already inside the fs will continue until they leave the fs
  and block on next entry.

Ok, fair enough, but...

  A suspended fs has the guarantee that no other thread will be inside
  fstrans_suspend / fstrans_done of any vnode operation.
  
  Threads stuck permanently as in (c) are impossible to catch.

...doesn't that mean the (c) threads are going to be holding read
locks, so the suspend will hang forever?

We have to assume there will be at least one such thread, as people
don't generally attempt umount -f unless something's wedged.

   If we're going to track that information we should really do it from
   vnode_if.c, both to avoid having to modify every fs and to make sure
   all fses support it correctly. (We also need to be careful about how
   it's done to avoid causing massive lock contention; that's why such
   logic doesn't already exist.)
  
  Some cases are nearly impossible to track at the vnode_if.c level:
  
  - Both VOP_GETPAGES() and VOP_PUTPAGES() cannot block or sleep here
as they run part or all of the operation with vnode interlock held.

Ugh. But this doesn't, for example, preclude atomically incrementing a
per-cpu counter or setting a flag in the lwp structure.

  - Accessing devices and fifos through a file system cannot be tracked
at the vnode_if.c level.  Take ufs_spec_read() for example:
  
   fstrans_start(...);
   VTOI(vp)-i_flag |= IN_ACCESS;
   fstrans_done(...);
   return VOCALL(spec_vnodeop_p, ...)
  
Here the VOCALL may sleep forever if the device has no data.

This I don't understand. To the extent it's in the fs while it's
doing this, vnode_if.c can know that, because it called ufs_spec_read
and ufs_spec_read hasn't returned yet. To the extent that it's in
specfs, specfs won't itself ever be umount -f'd, since it isn't
mounted, so there's no danger.

If umount -f currently blats over the data that specfs uses, then it's
currently not safe, but I don't see how it's different from any other
similar case with ufs data.

I expect I'm missing something.

-- 
David A. Holland
dholl...@netbsd.org


Re: core statement on fexecve, O_EXEC, and O_SEARCH

2012-12-04 Thread Robert Elz
Date:Tue, 4 Dec 2012 15:44:47 +0300
From:Alan Barrett a...@cequrux.com
Message-ID:  20121204124447.gf8...@apb-laptoy.apb.alt.za

  |  The fexecve function could be implemented entirely in libc,
  |  via execve(2) on a file name of the form /proc/self/fd/N.
  |  Any security concerns around fexecve() also apply to exec of
  |  /proc/self/fd/N.
  | I gave a try to this approach. There is an unexpected issue:
  | The descriptor is probably already closed on exec before the syscall
  | tries to use it.

I doubt that is the problem, I took a quick look, and couldn't see the
cause either, but I suspect something related to the files really
not truly existing, but just appearing when they're referenced (or
on a directory read) is the root cause of the problem.

  | I believe that we should not fix that without a proper design 
  | of how all the parts will work together.

First, I'm not sure it is really worth fixing at all, this doesn't
seem to be a particularly big problem in reality.  But, that said,
if a file exists, has x permission, and there's something executable
behind it, then exec should work on it, and there really should be
no need to look further than that.

  | Some questions that I would like to see answered are: Should it 
  | be possible to exec a fd only if a special flag was used in the 
  | open(2) call?

The questin here isn't really execing a fd, its execing a named file
(that happens to refer to an open fd, bt that shouldn't be important).

  | Should the file's executability be checked at open 
  | time or at exec time, or both,

For this use, at exec time, the fd refers to a file, this should be
not be different than an exec of a symlink.   We just have a slightly
different way if getting a reference to the file to be exec'd.

Half of the issues here all relate to how people see chroot(2) I think.
If we ignored chroot completely, most of what are seen as problems
would go away, and almost all of the issues could be resolved.

Even chroot isn't a problem, unless you're tempted to view it as some
kind of security mechanism.   It really isn't - it is just namespace
modification.   Sure, by modifying the filesystem namespace a bunch
of simple security attacks seem easy to avoid (and it does provide
some simple measure of protection) but as a true security mechanism
it really doesn't come close, and arguing against feature X or Y
because some tricky application of it can defeat chroot security
is just plain insane.

If true compartmentalisation is wanted for security purposes, we would
need something approaching true VM (like Xen DomU's or whatever) where
the whole environment can be protected, not just the filesystem.

chroot provides no protection at all to processes killing others,
reconfiguring the network, binding to random ports, thrashing the CPU,
altering sysctl data, rebooting the system, ...   There's almost no end
to the harm that a sufficiently inspired (and privileged) rogue process
can do, even if it is running in a chroot.

If we were willing to abandon the fiction that chroot is some kind of
security mechanism, then we can mostly just ignore it for almost all
purposes, and not worry about whether it would be possible to exec
a file via a fd passed through a socket to a process inside a chroot,
and all of that nonsense, as no-one would care one way or the other.

kre


Re: core statement on fexecve, O_EXEC, and O_SEARCH

2012-12-04 Thread Julian Yon
On Tue, 4 Dec 2012 15:30:36 +
David Holland dholland-t...@netbsd.org wrote:

 On Tue, Dec 04, 2012 at 01:58:13PM +, Julian Yon wrote:
The descriptor is probably already closed on exec before the
syscall tries to use it.
   
   Nope. That happens later. I was looking through this code
   yesterday as the topic interests me. The namei lookup happens
   pretty early on. I haven't solved it, but the problem seems to be
   one of context - if you try to execve /proc/self you'll also get
   ENOENT instead of the expected EACCES.
 
 That doesn't make much sense... nor does the procfs_lookup code shed
 any significant amount of light on it.
 

It's weird, isn't it? I've been staring at the code wondering what I'm
missing. Regardless of the pros  cons of being able to exec a fd, I
can't see how this inconsistency is correct behaviour.

-- 
3072D/F3A66B3A Julian Yon (2012 General Use) pgp.2...@jry.me


signature.asc
Description: PGP signature


Re: Problem identified: WAPL/RAIDframe performance problems

2012-12-04 Thread David Laight
On Tue, Dec 04, 2012 at 02:57:52PM +, David Holland wrote:
   
   What's a kernel panic got to do with it?  If you hand the controller
   and thus the drive 4K write, the kernel panicing won't suddenly cause
   you to reverse time and have issued 8 512-byte writes instead.
 
 That depends on additional properties of the pathway from the FS to
 the drive firmware. It might have sent 1 of 2 2048-byte writes before
 the panic, for example. Or it might be a vintage controller incapable
 of handling more than one sector at a time.

The ATA command set supports writes of multiple sectors and multi-sector
writes (probably not using those terms though!).

In the first case, although a single command is written the drive
will (effectively) loop through the sectors writing them 1 by 1.
All drives support this mode.

For multi-sector writes, the data transfer for each group of sectors
is done as a single burst. So if the drive supports 8 sector multi-sector
writes, and you are doing PIO transfers, you take a single 'data'
interrupt and then write all 4k bytes at once (assuming 512 byte sectors).
The drive identify response indicates whether multi-sector writes are
supported, and if so how many sectors can be written at once.
If the data transfer is DMA, it probably makes little difference to the
driver.

For quite a long time the netbsd ata driver mixes them up - and would
only request writes of multiple sectors if the drive supported multi-sector
writes.

Multi-sector writes are probably quite difficult to kill part way through
since there is only one DMA transfer block.

   Given how drives actually write data, I would not be so sanguine
   that any sector, of whatever size, in-flight when the power fails,
   is actually written with the values you expect, or not written
   at all.
 
 Yes, I'm aware of that. It remains a useful approximation, especially
 for already-existing FS code.

Given that (AFAIK) a physical sector is not dissimilar from an hdlc frame,
once the write has started the old data is gone, it the write is actually
interrupted you'll get a (correctable) bad sector.
If you are really unlucky the write will be long - and trash the
following sector (I managed to power off a floppy controller before it
wrecked the rest of a track when I'd reset the writer with write enabled).
If you are really, really unlucky I think it is possible to destroy
adjacent tracks.

David

-- 
David Laight: da...@l8s.co.uk


Re: wapbl_flush() speedup

2012-12-04 Thread Michael van Elst
hann...@eis.cs.tu-bs.de (J. Hannken-Illjes) writes:

The attached diff tries to coalesce writes to the journal in MAXPHYS
sized and aligned blocks.
[...]
Comments or objections anyone?

+ * Write data to the log.
+ * Try to coalesce writes and emit MAXPHYS aligned blocks.

Looks fine, but I would prefer the code to use an arbitrarily sized
buffer in case we get individual per device transfer limits. Currently
that size would still be MAXPHYS, but then the code could query the driver
for a proper size.

-- 
-- 
Michael van Elst
Internet: mlel...@serpens.de
A potential Snark may lurk in every tree.


Re: wapbl_flush() speedup

2012-12-04 Thread Thor Lancelot Simon
On Tue, Dec 04, 2012 at 07:10:47PM +, Michael van Elst wrote:
 hann...@eis.cs.tu-bs.de (J. Hannken-Illjes) writes:
 
 The attached diff tries to coalesce writes to the journal in MAXPHYS
 sized and aligned blocks.
 [...]
 Comments or objections anyone?
 
 + * Write data to the log.
 + * Try to coalesce writes and emit MAXPHYS aligned blocks.
 
 Looks fine, but I would prefer the code to use an arbitrarily sized
 buffer in case we get individual per device transfer limits. Currently
 that size would still be MAXPHYS, but then the code could query the driver
 for a proper size.

In fact, this patch and the discussion that led up to it have me
planning to replace the simple per-device maximum transfer length
currently implemented on the tls-maxphys branch with at least three
different properties:

* An upper-bound (the maximum transfer length)
* A lower-bound (the minimum atomic transfer length)
* An optimal transfer alignment

Exposing all these to the right code elsewhere in the kernel won't be
so easy, but I'm working on it (slowly).

Thor


Re: wapbl_flush() speedup

2012-12-04 Thread J. Hannken-Illjes

On Dec 4, 2012, at 8:10 PM, Michael van Elst mlel...@serpens.de wrote:

 hann...@eis.cs.tu-bs.de (J. Hannken-Illjes) writes:
 
 The attached diff tries to coalesce writes to the journal in MAXPHYS
 sized and aligned blocks.
 [...]
 Comments or objections anyone?
 
 + * Write data to the log.
 + * Try to coalesce writes and emit MAXPHYS aligned blocks.
 
 Looks fine, but I would prefer the code to use an arbitrarily sized
 buffer in case we get individual per device transfer limits. Currently
 that size would still be MAXPHYS, but then the code could query the driver
 for a proper size.

As `struct wapbl' is per-mount and I suppose this will be per-mount-static
it will be just a small `s/MAXPHYS/get-the-optimal-length/' as soon as
tls-maxphys comes to head.

--
J. Hannken-Illjes - hann...@eis.cs.tu-bs.de - TU Braunschweig (Germany)



Re: wapbl_flush() speedup

2012-12-04 Thread David Laight
On Tue, Dec 04, 2012 at 09:53:11PM +0100, J. Hannken-Illjes wrote:
 
 On Dec 4, 2012, at 8:10 PM, Michael van Elst mlel...@serpens.de wrote:
 
  hann...@eis.cs.tu-bs.de (J. Hannken-Illjes) writes:
  
  The attached diff tries to coalesce writes to the journal in MAXPHYS
  sized and aligned blocks.
  [...]
  Comments or objections anyone?
  
  + * Write data to the log.
  + * Try to coalesce writes and emit MAXPHYS aligned blocks.
  
  Looks fine, but I would prefer the code to use an arbitrarily sized
  buffer in case we get individual per device transfer limits. Currently
  that size would still be MAXPHYS, but then the code could query the driver
  for a proper size.
 
 As `struct wapbl' is per-mount and I suppose this will be per-mount-static
 it will be just a small `s/MAXPHYS/get-the-optimal-length/' as soon as
 tls-maxphys comes to head.

Except that you want the writes to be preferably aligned to that length,
not just of that length.

David

-- 
David Laight: da...@l8s.co.uk


Re: core statement on fexecve, O_EXEC, and O_SEARCH

2012-12-04 Thread Roland C. Dowdeswell
On Tue, Dec 04, 2012 at 11:42:04PM +0700, Robert Elz wrote:


 Even chroot isn't a problem, unless you're tempted to view it as some
 kind of security mechanism.   It really isn't - it is just namespace
 modification.   Sure, by modifying the filesystem namespace a bunch
 of simple security attacks seem easy to avoid (and it does provide
 some simple measure of protection) but as a true security mechanism
 it really doesn't come close, and arguing against feature X or Y
 because some tricky application of it can defeat chroot security
 is just plain insane.

Let's not lose sight of the fact that chroot can most certainly
compromise security if used improperly even if you are only using
it as a namespace mechanism, though.  So, there are most definitely
security considerations that must be taken into account even if
you think that chroot is not a security mechanism.

--
Roland Dowdeswell  http://Imrryr.ORG/~elric/


Re: Broadcast traffic on vlans leaks into the parent interface on NetBSD-5.1

2012-12-04 Thread John Nemeth
On Apr 22,  5:50pm, Robert Elz wrote:
}
} Date:Thu, 29 Nov 2012 22:54:24 -0500 (EST)
} From:Mouse mo...@rodents-montreal.org
} Message-ID:  201211300354.waa22...@sparkle.rodents-montreal.org
} 
} On the general VLAn topic, I agree with all Dennis said - leave the VLAN tags
} alone and just deal with them.
} 
}   |  I believe every use of BPF by an application to send and receive
}   |  protocol traffic is a signal that something is missing
} 
} I think was missing might be a better characterisation.
} 
}   | ...in general, I agree, but in the case of DHCP, I'm not so sure.  It
}   | needs to send and receive packets to and from unusual IPs (0.0.0.0, I
}   | think it is), if nothing else.
} 
} But that's not it, the DHCP server has no real issue with 0 addresses,
} that's the client (the server might need to receive from 0.0.0.0 but
} there's no reason at all for the stack to object to that - sending to
} 0.0.0.0 would be a truly silly desire for any software, including DHCP
} servers).

 Obtaining an address via DHCP is a four step process, and the
client can't legitimately use the new address until the fourth step is
completed.  To what address would you like the DHCP server to send its
responses?  I suppose the DHCP server could send responses to the
broadcast address, but I couldn't guarantee that every client would be
listening for them there (it's been a while since I looked at the
details).

} The missing part used to be (I believe we now have APIs that solve this
} probem) that the DHCP server needs to know which interface the packet
} arrived on - that's vital.  The original BSD API had no way to convey that
} information to the application, other than via BPF, or via binding a socket
} to the specific address of each interface, and inferring the interface
} from the socket the packet arrived on.  The latter is used by some UDP apps
} (including BIND) but is useless for DHCP, as the packets being received
} aren't sent to the server's address, but are broadcast (or multicast for v6).
} 
} As the DHCP server needed to get the interface information, it had to
} go the BPF route.  Once that's written, and works, there's no real reason
} to change it, even given that a better API (or at least an API, by
} definition it is better than the nothing that existed before, even though
} it isn't really a great API) now exists.  Retaining use of the BPF code allows
} dhcpd to work on older systems, and newer ones, without needing config
} options to control which way it works, and duplicate code paths to maintain.

 We use ISC's DHCP server.  As third party software, it is designed
to be portable to many systems.  BPF is a fairly portable interface,
thus a reasonable interface for it to use.

}-- End of excerpt from Robert Elz


Re: wapbl_flush() speedup

2012-12-04 Thread J. Hannken-Illjes
On Dec 4, 2012, at 10:11 PM, David Laight da...@l8s.co.uk wrote:

 On Tue, Dec 04, 2012 at 09:53:11PM +0100, J. Hannken-Illjes wrote:
 
 On Dec 4, 2012, at 8:10 PM, Michael van Elst mlel...@serpens.de wrote:
 
 hann...@eis.cs.tu-bs.de (J. Hannken-Illjes) writes:
 
 The attached diff tries to coalesce writes to the journal in MAXPHYS
 sized and aligned blocks.
 [...]
 Comments or objections anyone?
 
 + * Write data to the log.
 + * Try to coalesce writes and emit MAXPHYS aligned blocks.
 
 Looks fine, but I would prefer the code to use an arbitrarily sized
 buffer in case we get individual per device transfer limits. Currently
 that size would still be MAXPHYS, but then the code could query the driver
 for a proper size.
 
 As `struct wapbl' is per-mount and I suppose this will be per-mount-static
 it will be just a small `s/MAXPHYS/get-the-optimal-length/' as soon as
 tls-maxphys comes to head.
 
 Except that you want the writes to be preferably aligned to that length,
 not just of that length.



That is exactly how the diff works:

/*
 * Remaining space so this buffer ends on a MAXPHYS boundary.
 */
resid = MAXPHYS - dbtob(wl-wl_buffer_addr % btodb(MAXPHYS)) -
wl-wl_buffer_len;

aligns the end of the write to MAXPHYS.  So the first write will be
of size = MAXPHYS, the following writes will be of size MAXPHYS and
aligned to MAXPHYS and the last write will be of size = MAXPHYS.

--
J. Hannken-Illjes - hann...@eis.cs.tu-bs.de - TU Braunschweig (Germany)