Re: core statement on fexecve, O_EXEC, and O_SEARCH
m...@netbsd.org (Emmanuel Dreyfus) writes: Alan Barrett a...@netbsd.org wrote: The fexecve function could be implemented entirely in libc, via execve(2) on a file name of the form /proc/self/fd/N. Any security concerns around fexecve() also apply to exec of /proc/self/fd/N. I gave a try to this approach. There is an unexpected issue: for a reason I cannot figure, namei() does not resolve /proc/self/fd/N. Here is a ktrace: 810 1 t_fexecve CALL open(0x8048db6,0,0) 810 1 t_fexecve NAMI /usr/bin/touch 810 1 t_fexecve RET open 3 810 1 t_fexecve CALL getpid 810 1 t_fexecve RET getpid 810/0x32a, 924/0x39c 810 1 t_fexecve CALL execve(0xbfbfe66f,0xbfbfea98,0xbfbfeaa4) 810 1 t_fexecve NAMI /proc/self/fd/3 810 1 t_fexecve RET execve -1 errno 2 No such file or directory The descriptor is probably already closed on exec before the syscall tries to use it. -- -- Michael van Elst Internet: mlel...@serpens.de A potential Snark may lurk in every tree.
Re: core statement on fexecve, O_EXEC, and O_SEARCH
The fexecve function could be implemented entirely in libc, via execve(2) on a file name of the form /proc/self/fd/N. Any security concerns around fexecve() also apply to exec of /proc/self/fd/N. I gave a try to this approach. There is an unexpected issue: The descriptor is probably already closed on exec before the syscall tries to use it. I believe that we should not fix that without a proper design of how all the parts will work together. Some questions that I would like to see answered are: Should it be possible to exec a fd only if a special flag was used in the open(2) call? Should the file's executability be checked at open time or at exec time, or both, or does it depend on open flags or on what happened to the fd in between open and exec? Should the record of the fact that the fd may be eligible for exec be erased when the fd is passed from one process to another? Always or only sometimes? How can fds obtained from procfs be made to follow the rules? --apb (Alan Barrett)
Re: core statement on fexecve, O_EXEC, and O_SEARCH
On Tue, 4 Dec 2012 08:49:17 + (UTC) mlel...@serpens.de (Michael van Elst) wrote: The descriptor is probably already closed on exec before the syscall tries to use it. Nope. That happens later. I was looking through this code yesterday as the topic interests me. The namei lookup happens pretty early on. I haven't solved it, but the problem seems to be one of context - if you try to execve /proc/self you'll also get ENOENT instead of the expected EACCES. Julian -- 3072D/F3A66B3A Julian Yon (2012 General Use) pgp.2...@jry.me signature.asc Description: PGP signature
Re: Problem identified: WAPL/RAIDframe performance problems
On Sat, Dec 01, 2012 at 11:38:55PM -0500, Mouse wrote: things. What I care about is the largest size sector that will (in ^^^ the ordinary course of things anyway) be written atomically. Then those are 512-byte-sector drives [...] No; because I can do 4K atomic writes, I want to know about that. And, can't you do that with traditional drives, drives which really do have 512-byte sectors? Do a 4K transfer and you write 8 physical sectors with no opportunity for any other operation to see the write partially done. Is that wrong, or am I missing something else? Insert a kernel panic (or power failure(*)) after five sectors and it's not atomic. One sector, at least in theory(*), is. (*) let's ignore for now the various daft things that disks sometimes do in practice. -- David A. Holland dholl...@netbsd.org
Re: Problem identified: WAPL/RAIDframe performance problems
On Mon, Dec 03, 2012 at 12:19:58AM +, Julian Yon wrote: You appear to have just agreed with me, which makes me wonder what I'm missing, given you continue as though you disagree. You asked why 4096-byte-sector disks accept 512-byte writes. I was trying to explain. However, we're talking about hardware here, so you have to also consider the possibility that the drive firmware reports 512 because that's what someone coded up back in 1992 and nobody got around to fixing it. If that doesn't count as broken, what does? (Also, gosh, when did 1992 become so long ago?) By this standard, most hardware is broken. -- David A. Holland dholl...@netbsd.org
Re: Problem identified: WAPL/RAIDframe performance problems
On Tue, Dec 04, 2012 at 02:14:27PM +, David Holland wrote: On Sat, Dec 01, 2012 at 11:38:55PM -0500, Mouse wrote: things. What I care about is the largest size sector that will (in ^^^ the ordinary course of things anyway) be written atomically. Then those are 512-byte-sector drives [...] No; because I can do 4K atomic writes, I want to know about that. And, can't you do that with traditional drives, drives which really do have 512-byte sectors? Do a 4K transfer and you write 8 physical sectors with no opportunity for any other operation to see the write partially done. Is that wrong, or am I missing something else? Insert a kernel panic (or power failure(*)) after five sectors and What's a kernel panic got to do with it? If you hand the controller and thus the drive 4K write, the kernel panicing won't suddenly cause you to reverse time and have issued 8 512-byte writes instead. Given how drives actually write data, I would not be so sanguine that any sector, of whatever size, in-flight when the power fails, is actually written with the values you expect, or not written at all.
Re: Problem identified: WAPL/RAIDframe performance problems
On Tue, Dec 04, 2012 at 09:26:17AM -0500, Thor Lancelot Simon wrote: And, can't you do that with traditional drives, drives which really do have 512-byte sectors? Do a 4K transfer and you write 8 physical sectors with no opportunity for any other operation to see the write partially done. Is that wrong, or am I missing something else? Insert a kernel panic (or power failure(*)) after five sectors and What's a kernel panic got to do with it? If you hand the controller and thus the drive 4K write, the kernel panicing won't suddenly cause you to reverse time and have issued 8 512-byte writes instead. That depends on additional properties of the pathway from the FS to the drive firmware. It might have sent 1 of 2 2048-byte writes before the panic, for example. Or it might be a vintage controller incapable of handling more than one sector at a time. Also, if there's a panic while the kernel is in the middle of talking to the drive, such that the drive receives only part of the data you intended to send, one can be reasonably certain it will reject a partial sector... but if it's received 5 of 8 physical sectors and the 6th is partial, it may well write out those 5, which isn't what was intended. Given how drives actually write data, I would not be so sanguine that any sector, of whatever size, in-flight when the power fails, is actually written with the values you expect, or not written at all. Yes, I'm aware of that. It remains a useful approximation, especially for already-existing FS code. -- David A. Holland dholl...@netbsd.org
Re: FFS write coalescing
On Tue, Dec 04, 2012 at 09:59:46AM +0300, Alan Barrett wrote: the genfs code also never writes clean pages to disk, even though for RAID5 storage it would likely be more efficient to write clean pages that are in the same stripe as dirty pages if that would avoid issuing partial-stripe writes. (which is basically another way of saying what david said.) Perhaps there should be a way for block devices to report at least three block sizes: a) smallest possible block size (512 for almost all disks) b) smallest efficient block size and alignment (4k for modern disks, stripe size for raid) c) largest possible size (a device and bus-dependent variant of MAXPHYS) Then the file system could use (b) to know when it's a good idea to combine dirty and clean pages into the same write. As I was saying in the other thread, what filesystems really want to know is the atomic write size. E.g. in ffs this affects the way directories are laid out and is necessary (AFAIK including with wapbl) for ~safe operation. This is not (a), and as far as I know it is also not (b); see below. I don't see (a) as useful. It is conceivable that a journaled FS might want to know about it to allow packing journal records as tightly as possible, but doing so is rather dubious from a recovery POV: the point of flushing a journal is to get it physically onto disk safely, and if you later let the disk rewrite part of what you thought was safely on disk, it might cease to be safely on disk and break your recovery scheme. What guarantees do we actually get in practice for RAID5? Do you have to commit journals in units of a whole stripe or stripe group to avoid having them rewritten unsafely later? Or is the parity logging code sufficient to make that safe? This matters for wapbl... -- David A. Holland dholl...@netbsd.org
Re: core statement on fexecve, O_EXEC, and O_SEARCH
On Tue, Dec 04, 2012 at 01:58:13PM +, Julian Yon wrote: The descriptor is probably already closed on exec before the syscall tries to use it. Nope. That happens later. I was looking through this code yesterday as the topic interests me. The namei lookup happens pretty early on. I haven't solved it, but the problem seems to be one of context - if you try to execve /proc/self you'll also get ENOENT instead of the expected EACCES. That doesn't make much sense... nor does the procfs_lookup code shed any significant amount of light on it. -- David A. Holland dholl...@netbsd.org
Re: Making forced unmounts work
On Sun, Dec 02, 2012 at 05:29:01PM +0100, J. Hannken-Illjes wrote: I'm convinced -- having fstrans_start() return ERESTART is the way to go. Ok then :-) Also I wonder if there's any way to accomplish this that doesn't require adding fstrans calls to every operation in every fs. Not in a clean way. We would need some kind of reference counting for vnode operations and that is quite impossible as vnode operations on devices or fifos sometimes wait forever and are called from other fs like ufsspec_read() for example. How could we protect UFS updating access times here? I'm not entirely convinced of that. There are basically three problems: (a) new incoming threads, (b) threads that are already in the fs and running, and (c) threads that are already in the fs and that are stuck more or less permanently because something broke. Admittedly I don't really understand how fstrans suspending works. Does it keep track of all the threads that are in the fs, so the (b) ones can be interrupted somehow, or so we at least can wait until all of them either leave the fs or enter fstrans somewhere and stall? Fstrans is a recursive rw-lock. Vnode operations take the reader lock on entry. To suspend a fs we take the writer lock and therefore it will catch both (a) and (b). New threads will block on first entry, threads already inside the fs will continue until they leave the fs and block on next entry. Ok, fair enough, but... A suspended fs has the guarantee that no other thread will be inside fstrans_suspend / fstrans_done of any vnode operation. Threads stuck permanently as in (c) are impossible to catch. ...doesn't that mean the (c) threads are going to be holding read locks, so the suspend will hang forever? We have to assume there will be at least one such thread, as people don't generally attempt umount -f unless something's wedged. If we're going to track that information we should really do it from vnode_if.c, both to avoid having to modify every fs and to make sure all fses support it correctly. (We also need to be careful about how it's done to avoid causing massive lock contention; that's why such logic doesn't already exist.) Some cases are nearly impossible to track at the vnode_if.c level: - Both VOP_GETPAGES() and VOP_PUTPAGES() cannot block or sleep here as they run part or all of the operation with vnode interlock held. Ugh. But this doesn't, for example, preclude atomically incrementing a per-cpu counter or setting a flag in the lwp structure. - Accessing devices and fifos through a file system cannot be tracked at the vnode_if.c level. Take ufs_spec_read() for example: fstrans_start(...); VTOI(vp)-i_flag |= IN_ACCESS; fstrans_done(...); return VOCALL(spec_vnodeop_p, ...) Here the VOCALL may sleep forever if the device has no data. This I don't understand. To the extent it's in the fs while it's doing this, vnode_if.c can know that, because it called ufs_spec_read and ufs_spec_read hasn't returned yet. To the extent that it's in specfs, specfs won't itself ever be umount -f'd, since it isn't mounted, so there's no danger. If umount -f currently blats over the data that specfs uses, then it's currently not safe, but I don't see how it's different from any other similar case with ufs data. I expect I'm missing something. -- David A. Holland dholl...@netbsd.org
Re: core statement on fexecve, O_EXEC, and O_SEARCH
Date:Tue, 4 Dec 2012 15:44:47 +0300 From:Alan Barrett a...@cequrux.com Message-ID: 20121204124447.gf8...@apb-laptoy.apb.alt.za | The fexecve function could be implemented entirely in libc, | via execve(2) on a file name of the form /proc/self/fd/N. | Any security concerns around fexecve() also apply to exec of | /proc/self/fd/N. | I gave a try to this approach. There is an unexpected issue: | The descriptor is probably already closed on exec before the syscall | tries to use it. I doubt that is the problem, I took a quick look, and couldn't see the cause either, but I suspect something related to the files really not truly existing, but just appearing when they're referenced (or on a directory read) is the root cause of the problem. | I believe that we should not fix that without a proper design | of how all the parts will work together. First, I'm not sure it is really worth fixing at all, this doesn't seem to be a particularly big problem in reality. But, that said, if a file exists, has x permission, and there's something executable behind it, then exec should work on it, and there really should be no need to look further than that. | Some questions that I would like to see answered are: Should it | be possible to exec a fd only if a special flag was used in the | open(2) call? The questin here isn't really execing a fd, its execing a named file (that happens to refer to an open fd, bt that shouldn't be important). | Should the file's executability be checked at open | time or at exec time, or both, For this use, at exec time, the fd refers to a file, this should be not be different than an exec of a symlink. We just have a slightly different way if getting a reference to the file to be exec'd. Half of the issues here all relate to how people see chroot(2) I think. If we ignored chroot completely, most of what are seen as problems would go away, and almost all of the issues could be resolved. Even chroot isn't a problem, unless you're tempted to view it as some kind of security mechanism. It really isn't - it is just namespace modification. Sure, by modifying the filesystem namespace a bunch of simple security attacks seem easy to avoid (and it does provide some simple measure of protection) but as a true security mechanism it really doesn't come close, and arguing against feature X or Y because some tricky application of it can defeat chroot security is just plain insane. If true compartmentalisation is wanted for security purposes, we would need something approaching true VM (like Xen DomU's or whatever) where the whole environment can be protected, not just the filesystem. chroot provides no protection at all to processes killing others, reconfiguring the network, binding to random ports, thrashing the CPU, altering sysctl data, rebooting the system, ... There's almost no end to the harm that a sufficiently inspired (and privileged) rogue process can do, even if it is running in a chroot. If we were willing to abandon the fiction that chroot is some kind of security mechanism, then we can mostly just ignore it for almost all purposes, and not worry about whether it would be possible to exec a file via a fd passed through a socket to a process inside a chroot, and all of that nonsense, as no-one would care one way or the other. kre
Re: core statement on fexecve, O_EXEC, and O_SEARCH
On Tue, 4 Dec 2012 15:30:36 + David Holland dholland-t...@netbsd.org wrote: On Tue, Dec 04, 2012 at 01:58:13PM +, Julian Yon wrote: The descriptor is probably already closed on exec before the syscall tries to use it. Nope. That happens later. I was looking through this code yesterday as the topic interests me. The namei lookup happens pretty early on. I haven't solved it, but the problem seems to be one of context - if you try to execve /proc/self you'll also get ENOENT instead of the expected EACCES. That doesn't make much sense... nor does the procfs_lookup code shed any significant amount of light on it. It's weird, isn't it? I've been staring at the code wondering what I'm missing. Regardless of the pros cons of being able to exec a fd, I can't see how this inconsistency is correct behaviour. -- 3072D/F3A66B3A Julian Yon (2012 General Use) pgp.2...@jry.me signature.asc Description: PGP signature
Re: Problem identified: WAPL/RAIDframe performance problems
On Tue, Dec 04, 2012 at 02:57:52PM +, David Holland wrote: What's a kernel panic got to do with it? If you hand the controller and thus the drive 4K write, the kernel panicing won't suddenly cause you to reverse time and have issued 8 512-byte writes instead. That depends on additional properties of the pathway from the FS to the drive firmware. It might have sent 1 of 2 2048-byte writes before the panic, for example. Or it might be a vintage controller incapable of handling more than one sector at a time. The ATA command set supports writes of multiple sectors and multi-sector writes (probably not using those terms though!). In the first case, although a single command is written the drive will (effectively) loop through the sectors writing them 1 by 1. All drives support this mode. For multi-sector writes, the data transfer for each group of sectors is done as a single burst. So if the drive supports 8 sector multi-sector writes, and you are doing PIO transfers, you take a single 'data' interrupt and then write all 4k bytes at once (assuming 512 byte sectors). The drive identify response indicates whether multi-sector writes are supported, and if so how many sectors can be written at once. If the data transfer is DMA, it probably makes little difference to the driver. For quite a long time the netbsd ata driver mixes them up - and would only request writes of multiple sectors if the drive supported multi-sector writes. Multi-sector writes are probably quite difficult to kill part way through since there is only one DMA transfer block. Given how drives actually write data, I would not be so sanguine that any sector, of whatever size, in-flight when the power fails, is actually written with the values you expect, or not written at all. Yes, I'm aware of that. It remains a useful approximation, especially for already-existing FS code. Given that (AFAIK) a physical sector is not dissimilar from an hdlc frame, once the write has started the old data is gone, it the write is actually interrupted you'll get a (correctable) bad sector. If you are really unlucky the write will be long - and trash the following sector (I managed to power off a floppy controller before it wrecked the rest of a track when I'd reset the writer with write enabled). If you are really, really unlucky I think it is possible to destroy adjacent tracks. David -- David Laight: da...@l8s.co.uk
Re: wapbl_flush() speedup
hann...@eis.cs.tu-bs.de (J. Hannken-Illjes) writes: The attached diff tries to coalesce writes to the journal in MAXPHYS sized and aligned blocks. [...] Comments or objections anyone? + * Write data to the log. + * Try to coalesce writes and emit MAXPHYS aligned blocks. Looks fine, but I would prefer the code to use an arbitrarily sized buffer in case we get individual per device transfer limits. Currently that size would still be MAXPHYS, but then the code could query the driver for a proper size. -- -- Michael van Elst Internet: mlel...@serpens.de A potential Snark may lurk in every tree.
Re: wapbl_flush() speedup
On Tue, Dec 04, 2012 at 07:10:47PM +, Michael van Elst wrote: hann...@eis.cs.tu-bs.de (J. Hannken-Illjes) writes: The attached diff tries to coalesce writes to the journal in MAXPHYS sized and aligned blocks. [...] Comments or objections anyone? + * Write data to the log. + * Try to coalesce writes and emit MAXPHYS aligned blocks. Looks fine, but I would prefer the code to use an arbitrarily sized buffer in case we get individual per device transfer limits. Currently that size would still be MAXPHYS, but then the code could query the driver for a proper size. In fact, this patch and the discussion that led up to it have me planning to replace the simple per-device maximum transfer length currently implemented on the tls-maxphys branch with at least three different properties: * An upper-bound (the maximum transfer length) * A lower-bound (the minimum atomic transfer length) * An optimal transfer alignment Exposing all these to the right code elsewhere in the kernel won't be so easy, but I'm working on it (slowly). Thor
Re: wapbl_flush() speedup
On Dec 4, 2012, at 8:10 PM, Michael van Elst mlel...@serpens.de wrote: hann...@eis.cs.tu-bs.de (J. Hannken-Illjes) writes: The attached diff tries to coalesce writes to the journal in MAXPHYS sized and aligned blocks. [...] Comments or objections anyone? + * Write data to the log. + * Try to coalesce writes and emit MAXPHYS aligned blocks. Looks fine, but I would prefer the code to use an arbitrarily sized buffer in case we get individual per device transfer limits. Currently that size would still be MAXPHYS, but then the code could query the driver for a proper size. As `struct wapbl' is per-mount and I suppose this will be per-mount-static it will be just a small `s/MAXPHYS/get-the-optimal-length/' as soon as tls-maxphys comes to head. -- J. Hannken-Illjes - hann...@eis.cs.tu-bs.de - TU Braunschweig (Germany)
Re: wapbl_flush() speedup
On Tue, Dec 04, 2012 at 09:53:11PM +0100, J. Hannken-Illjes wrote: On Dec 4, 2012, at 8:10 PM, Michael van Elst mlel...@serpens.de wrote: hann...@eis.cs.tu-bs.de (J. Hannken-Illjes) writes: The attached diff tries to coalesce writes to the journal in MAXPHYS sized and aligned blocks. [...] Comments or objections anyone? + * Write data to the log. + * Try to coalesce writes and emit MAXPHYS aligned blocks. Looks fine, but I would prefer the code to use an arbitrarily sized buffer in case we get individual per device transfer limits. Currently that size would still be MAXPHYS, but then the code could query the driver for a proper size. As `struct wapbl' is per-mount and I suppose this will be per-mount-static it will be just a small `s/MAXPHYS/get-the-optimal-length/' as soon as tls-maxphys comes to head. Except that you want the writes to be preferably aligned to that length, not just of that length. David -- David Laight: da...@l8s.co.uk
Re: core statement on fexecve, O_EXEC, and O_SEARCH
On Tue, Dec 04, 2012 at 11:42:04PM +0700, Robert Elz wrote: Even chroot isn't a problem, unless you're tempted to view it as some kind of security mechanism. It really isn't - it is just namespace modification. Sure, by modifying the filesystem namespace a bunch of simple security attacks seem easy to avoid (and it does provide some simple measure of protection) but as a true security mechanism it really doesn't come close, and arguing against feature X or Y because some tricky application of it can defeat chroot security is just plain insane. Let's not lose sight of the fact that chroot can most certainly compromise security if used improperly even if you are only using it as a namespace mechanism, though. So, there are most definitely security considerations that must be taken into account even if you think that chroot is not a security mechanism. -- Roland Dowdeswell http://Imrryr.ORG/~elric/
Re: Broadcast traffic on vlans leaks into the parent interface on NetBSD-5.1
On Apr 22, 5:50pm, Robert Elz wrote: } } Date:Thu, 29 Nov 2012 22:54:24 -0500 (EST) } From:Mouse mo...@rodents-montreal.org } Message-ID: 201211300354.waa22...@sparkle.rodents-montreal.org } } On the general VLAn topic, I agree with all Dennis said - leave the VLAN tags } alone and just deal with them. } } | I believe every use of BPF by an application to send and receive } | protocol traffic is a signal that something is missing } } I think was missing might be a better characterisation. } } | ...in general, I agree, but in the case of DHCP, I'm not so sure. It } | needs to send and receive packets to and from unusual IPs (0.0.0.0, I } | think it is), if nothing else. } } But that's not it, the DHCP server has no real issue with 0 addresses, } that's the client (the server might need to receive from 0.0.0.0 but } there's no reason at all for the stack to object to that - sending to } 0.0.0.0 would be a truly silly desire for any software, including DHCP } servers). Obtaining an address via DHCP is a four step process, and the client can't legitimately use the new address until the fourth step is completed. To what address would you like the DHCP server to send its responses? I suppose the DHCP server could send responses to the broadcast address, but I couldn't guarantee that every client would be listening for them there (it's been a while since I looked at the details). } The missing part used to be (I believe we now have APIs that solve this } probem) that the DHCP server needs to know which interface the packet } arrived on - that's vital. The original BSD API had no way to convey that } information to the application, other than via BPF, or via binding a socket } to the specific address of each interface, and inferring the interface } from the socket the packet arrived on. The latter is used by some UDP apps } (including BIND) but is useless for DHCP, as the packets being received } aren't sent to the server's address, but are broadcast (or multicast for v6). } } As the DHCP server needed to get the interface information, it had to } go the BPF route. Once that's written, and works, there's no real reason } to change it, even given that a better API (or at least an API, by } definition it is better than the nothing that existed before, even though } it isn't really a great API) now exists. Retaining use of the BPF code allows } dhcpd to work on older systems, and newer ones, without needing config } options to control which way it works, and duplicate code paths to maintain. We use ISC's DHCP server. As third party software, it is designed to be portable to many systems. BPF is a fairly portable interface, thus a reasonable interface for it to use. }-- End of excerpt from Robert Elz
Re: wapbl_flush() speedup
On Dec 4, 2012, at 10:11 PM, David Laight da...@l8s.co.uk wrote: On Tue, Dec 04, 2012 at 09:53:11PM +0100, J. Hannken-Illjes wrote: On Dec 4, 2012, at 8:10 PM, Michael van Elst mlel...@serpens.de wrote: hann...@eis.cs.tu-bs.de (J. Hannken-Illjes) writes: The attached diff tries to coalesce writes to the journal in MAXPHYS sized and aligned blocks. [...] Comments or objections anyone? + * Write data to the log. + * Try to coalesce writes and emit MAXPHYS aligned blocks. Looks fine, but I would prefer the code to use an arbitrarily sized buffer in case we get individual per device transfer limits. Currently that size would still be MAXPHYS, but then the code could query the driver for a proper size. As `struct wapbl' is per-mount and I suppose this will be per-mount-static it will be just a small `s/MAXPHYS/get-the-optimal-length/' as soon as tls-maxphys comes to head. Except that you want the writes to be preferably aligned to that length, not just of that length. That is exactly how the diff works: /* * Remaining space so this buffer ends on a MAXPHYS boundary. */ resid = MAXPHYS - dbtob(wl-wl_buffer_addr % btodb(MAXPHYS)) - wl-wl_buffer_len; aligns the end of the write to MAXPHYS. So the first write will be of size = MAXPHYS, the following writes will be of size MAXPHYS and aligned to MAXPHYS and the last write will be of size = MAXPHYS. -- J. Hannken-Illjes - hann...@eis.cs.tu-bs.de - TU Braunschweig (Germany)