Re: [PATCH] ARM: ftrace: Ensure code modifications are synchronised across all cpus
Steven Rostedt wrote: > > Yes, and I think if you do use two 16-bit nops, you can even get rid of all > > the intermediate `sync' operations (I guess you might want one at the end if > > you want the call to become visible at a particular point). > > Wont work. We are replacing a 32bit call with a nop. That nop must also > be 32bits, because we could eventually replace the nop(s) with a 32bit > call. Basically, we can never allow the second 16bit part ever be the > next instruction. If the first 16bit nop is executed, and then the task > gets preempted. The nops get converted to a 32bit call. The task gets > scheduled again and now is executing the second 16bits of the 32bit call > and we get unexpected (probably crashing) results. > > By having either a 16bit breakpoint whose handler returns after the > second 16bit part, or a 16bit jump that simply jumps over the second > half, then all this should work. When the CPU processes a 32bit > instruction, it either processes all or non of it, correct? Sounds good, except what Will wrote a few days ago: On Fri, 2012-12-07 at 19:02 +, Will Deacon wrote: > For ARMv7, there are small subsets of instructions for ARM and Thumb which > are guaranteed to be atomic wrt concurrent modification and execution of > the instruction stream between different processors: > > Thumb: The 16-bit encodings of the B, NOP, BKPT, and SVC instructions. > ARM:The B, BL, NOP, BKPT, SVC, HVC, and SMC instructions. Thumb 32-bit ftrace call isn't in the above list. Questions: does the above concurrent modification guarantee require both the old instruction _and_ the new one to be among those listed, or is it enough to be just the new one (for example when setting a normal software breakpoint, that would be useful)? Can it be the old one and not the new (for example when removing a software breakpoint, that would be useful)? Does that subset mean replacing any of the listed instructions by any of the others is ok, or any of the listed with another of the same type? (I guess as a matter of architecture design, it makes sense to guarantee only a short list, because of occasions when the hardware, or a software emulation through traps, or a simulation, might read the instruction memory more than once.) This is what makes me wonder, if it's safe to replace the 32-bit mcount call with a 16-bit short jump: > On Mon, Dec 10, 2012 at 11:04:05AM +, Jon Medhurst (Tixy) wrote: > > So this means for things like kprobes which can modify arbitrary kernel > > code we are going to need to continue to always use some form of > > stop_the_whole_system() function? > > > > Also, kprobes currently uses patch_text() which only uses stop_machine > > for Thumb2 instructions which straddle a word boundary, so this needs > > changing? Will Deacon replied: > Yes; if you're modifying instructions other than those mentioned above, then > you'll need to synchronise the CPUs, update the instructions, perform > cache-maintenance on the writing CPU and then execute an isb on the > executing core (this last bit isn't needed if you're going to go through an > exception return to get back to the new code -- depends on how your > stop/resume code works). If I've understood that exchange, it implies that using patch_text() to replace an instruction not in the list of special ones, with a trap or jump, isn't ok? And so it's ok to replace the NOP with a short branch (since 16-bit "B" is in the list), but it's not ok to replace 16-bit "B" with the 32-bit ftrace call; and the same going the other way? Best, -- Jamie -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] ARM: ftrace: Ensure code modifications are synchronised across all cpus
Steven Rostedt wrote: Yes, and I think if you do use two 16-bit nops, you can even get rid of all the intermediate `sync' operations (I guess you might want one at the end if you want the call to become visible at a particular point). Wont work. We are replacing a 32bit call with a nop. That nop must also be 32bits, because we could eventually replace the nop(s) with a 32bit call. Basically, we can never allow the second 16bit part ever be the next instruction. If the first 16bit nop is executed, and then the task gets preempted. The nops get converted to a 32bit call. The task gets scheduled again and now is executing the second 16bits of the 32bit call and we get unexpected (probably crashing) results. By having either a 16bit breakpoint whose handler returns after the second 16bit part, or a 16bit jump that simply jumps over the second half, then all this should work. When the CPU processes a 32bit instruction, it either processes all or non of it, correct? Sounds good, except what Will wrote a few days ago: On Fri, 2012-12-07 at 19:02 +, Will Deacon wrote: For ARMv7, there are small subsets of instructions for ARM and Thumb which are guaranteed to be atomic wrt concurrent modification and execution of the instruction stream between different processors: Thumb: The 16-bit encodings of the B, NOP, BKPT, and SVC instructions. ARM:The B, BL, NOP, BKPT, SVC, HVC, and SMC instructions. Thumb 32-bit ftrace call isn't in the above list. Questions: does the above concurrent modification guarantee require both the old instruction _and_ the new one to be among those listed, or is it enough to be just the new one (for example when setting a normal software breakpoint, that would be useful)? Can it be the old one and not the new (for example when removing a software breakpoint, that would be useful)? Does that subset mean replacing any of the listed instructions by any of the others is ok, or any of the listed with another of the same type? (I guess as a matter of architecture design, it makes sense to guarantee only a short list, because of occasions when the hardware, or a software emulation through traps, or a simulation, might read the instruction memory more than once.) This is what makes me wonder, if it's safe to replace the 32-bit mcount call with a 16-bit short jump: On Mon, Dec 10, 2012 at 11:04:05AM +, Jon Medhurst (Tixy) wrote: So this means for things like kprobes which can modify arbitrary kernel code we are going to need to continue to always use some form of stop_the_whole_system() function? Also, kprobes currently uses patch_text() which only uses stop_machine for Thumb2 instructions which straddle a word boundary, so this needs changing? Will Deacon replied: Yes; if you're modifying instructions other than those mentioned above, then you'll need to synchronise the CPUs, update the instructions, perform cache-maintenance on the writing CPU and then execute an isb on the executing core (this last bit isn't needed if you're going to go through an exception return to get back to the new code -- depends on how your stop/resume code works). If I've understood that exchange, it implies that using patch_text() to replace an instruction not in the list of special ones, with a trap or jump, isn't ok? And so it's ok to replace the NOP with a short branch (since 16-bit B is in the list), but it's not ok to replace 16-bit B with the 32-bit ftrace call; and the same going the other way? Best, -- Jamie -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 22/25] Generic dynamic per cpu refcounting
Kent Overstreet wrote: > On Thu, Nov 29, 2012 at 09:54:47PM +0100, Andi Kleen wrote: > > > > The regular atomic_t is limited in ways that you are not. > > > > See my original mail. > > > > > > I don't follow, can you explain? > > > > For most cases the reference count is tied to some object, which are > > naturally limited by memory size or other physical resources. > > > > But in the assymetric CPU case with your ref count no such limiter > > exists. > > It's got exactly the same limit as the old code which used the atomic_t > - we're limited by the number of threads that can be issuing aio > syscalls at a time. > > The assymetry you're talking about _doesn't matter_, individual cpu > counters wrapping does not affect what the counters all sum to when we > go to tear down. > > A coworker at lunch actually pointed out to me that the reason this is > true is just that modular arithmatic is still associative with addition > and subtraction. It's just like jiffies. Everyone understands jiffies arithmetic I hope. -- Jamie -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 22/25] Generic dynamic per cpu refcounting
Kent Overstreet wrote: On Thu, Nov 29, 2012 at 09:54:47PM +0100, Andi Kleen wrote: The regular atomic_t is limited in ways that you are not. See my original mail. I don't follow, can you explain? For most cases the reference count is tied to some object, which are naturally limited by memory size or other physical resources. But in the assymetric CPU case with your ref count no such limiter exists. It's got exactly the same limit as the old code which used the atomic_t - we're limited by the number of threads that can be issuing aio syscalls at a time. The assymetry you're talking about _doesn't matter_, individual cpu counters wrapping does not affect what the counters all sum to when we go to tear down. A coworker at lunch actually pointed out to me that the reason this is true is just that modular arithmatic is still associative with addition and subtraction. It's just like jiffies. Everyone understands jiffies arithmetic I hope. -- Jamie -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] page-table walkers vs memory order
Paul E. McKenney wrote: > > Does some version of gcc, under the options which we insist upon, > > make such optimizations on any of the architectures which we support? > > Pretty much any production-quality compiler will do double-fetch > and old-value-reuse optimizations, the former especially on 32-bit > x86. I don't know of any production-quality compilers that do value > speculation, which would make the compiler act like DEC Alpha hardware, > and I would hope that if this does appear, (1) we would have warning > and (2) it could be turned off. But there has been a lot of work on > this topic, so we would be foolish to rule it out. GCC documentation for IA-64: -msched-ar-data-spec -mno-sched-ar-data-spec (En/Dis)able data speculative scheduling after reload. This results in generation of ld.a instructions and the corresponding check instructions (ld.c / chk.a). The default is 'enable'. I don't know if that results in value speculation of the relevant kind. -- Jamie -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] page-table walkers vs memory order
Paul E. McKenney wrote: Does some version of gcc, under the options which we insist upon, make such optimizations on any of the architectures which we support? Pretty much any production-quality compiler will do double-fetch and old-value-reuse optimizations, the former especially on 32-bit x86. I don't know of any production-quality compilers that do value speculation, which would make the compiler act like DEC Alpha hardware, and I would hope that if this does appear, (1) we would have warning and (2) it could be turned off. But there has been a lot of work on this topic, so we would be foolish to rule it out. GCC documentation for IA-64: -msched-ar-data-spec -mno-sched-ar-data-spec (En/Dis)able data speculative scheduling after reload. This results in generation of ld.a instructions and the corresponding check instructions (ld.c / chk.a). The default is 'enable'. I don't know if that results in value speculation of the relevant kind. -- Jamie -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Proposal for "proper" durable fsync() and fdatasync()
Jörn Engel wrote: > On Tue, 26 February 2008 15:28:10 +0000, Jamie Lokier wrote: > > > > > One interesting aspect of this comes with COW filesystems like btrfs or > > > logfs. Writing out data pages is not sufficient, because those will get > > > lost unless their referencing metadata is written as well. So either we > > > have to call fsync for those filesystems or add another callback and let > > > filesystems override the default implementation. > > > > Doesn't the ->fsync callback get called in the sys_fdatasync() case, > > with appropriate arguments? > > My paragraph above was aimed at the sync_file_range() case. fsync and > fdatasync do the right thing within the limitations you brought up in > this thread. sync_file_range() without further changes will only write > data pages, not the metadata required to actually access those data > pages. This works just fine for non-COW filesystems, which covers all > currently merged ones. > > With COW filesystems it is currently impossible to do sync_file_range() > properly. The problem is orthogonal to your's, I just brought it up > since you were already mentioning sync_file_range(). You're right. Though, doesn't normal page writeback enqueue the COW metadata changes? If not, how do they get written in a timely fashion? -- Jamie -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Proposal for "proper" durable fsync() and fdatasync()
Jeff Garzik wrote: > Nick Piggin wrote: > >Anyway, the idea of making fsync/fdatasync etc. safe by default is > >a good idea IMO, and is a bad bug that we don't do that :( > > Agreed... it's also disappointing that [unless I'm mistaken] you have > to hack each filesystem to support barriers. > > It seems far easier to make sync_blkdev() Do The Right Thing, and > magically make all filesystems data-safe. Well, you need ordered metadata writes, barriers _and_ flushes with some filesystems. Merely writing all the data pages than issuing a drive cache flush won't Do The Right Thing with those filesystems - someone already mentioned Btrfs, where it won't. But I agree that your suggestion would make a superb default, for filesystems which don't provide their own function. It's not optimal even then. Devices: On a software RAID, you ideally don't want to issue flushes to all drives if your database did a 1 block commit entry. (But they probably use O_DIRECT anyway, changing the rules again). But all that can be optimised in generic VFS code eventually. It doesn't need filesystem assistance in most cases. Apps: don't always want a full flush; sometimes a barrier would do. -- Jamie -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Proposal for "proper" durable fsync() and fdatasync()
Ric Wheeler wrote: > >>I was surprised that fsync() doesn't do this already. There was a lot > >>of effort put into block I/O write barriers during 2.5, so that > >>journalling filesystems can force correct write ordering, using disk > >>flush cache commands. > >> > >>After all that effort, I was very surprised to notice that Linux 2.6.x > >>doesn't use that capability to ensure fsync() flushes the disk cache > >>onto stable storage. > > > >It's surprising you are surprised, given that this [lame] fsync behavior > >has remaining consistently lame throughout Linux's history. > > Maybe I am confused, but isn't this is what fsync() does today whenever > barriers are enabled (the fsync() invalidates the drive's write cache). No, fsync() doesn't always flush the drive's write cache. It often does, any I think many people are under the impression it always does, but it doesn't. Try this code on ext3: fd = open ("test_file", O_RDWR | O_CREAT | O_TRUNC, 0666); while (1) { char byte; usleep (10); pwrite (fd, , 1, 0); fsync (fd); } It will do just over 10 write ops per second on an idle system (13 on mine), and 1 flush op per second. That's because ext3 fsync() only does a journal commit when the inode has changed. The inode mtime is changed by write only with 1 second granularity. Without a journal commit, there's no barrier, which translates to not flushing disk write cache. If you add "fchmod (fd, 0644); fchmod (fd, 0664);" between the write and fsync, you'll see at least 20 write ops and 20 flush ops per second, and you'll here the disk seeking more. That's because the fchmod dirties the inode, so fsync() writes the inode with a journal commit. It turns out even _that_ is not sufficient according to the kernel internals. A journal commit uses an ordered request, which isn't the same as a flush potentially, it just happens to use flush in this instance. I'm not sure if ordered requests are actually implemented by any drivers at the moment. If not now, they will be one day. We could change ext3 fsync() to always do a journal commit, and depend on the non-existence of block drivers which do ordered (not flush) barrier requests. But there's lots of things wrong with that. Not least, it sucks performance for database-like applications and virtual machines, a lot due to unnecessary seeks. That way lies wrongness. Rightness is to make fdatasync() work well, with a genuine flush (or equivalent (see FUA), only when required, and not a mere ordered barrier), no inode write, and to make sync_file_range()[*] offer the fancier applications finer controls which reflect what they actually need. [*] - or whatever. -- Jamie -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Proposal for "proper" durable fsync() and fdatasync()
Jörn Engel wrote: > On Tue, 26 February 2008 20:16:11 +1100, Nick Piggin wrote: > > Yeah, sync_file_range has slightly unusual semantics and introduce > > the new concept, "writeout", to userspace (does "writeout" include > > "in drive cache"? the kernel doesn't think so, but the only way to > > make sync_file_range "safe" is if you do consider it writeout). > > If sync_file_range isn't safe, it should get replaced by a noop > implementation. There really is no point in promising "a little" > safety. Sometimes there is a point in "a little" safety. There's a spectrum of durability (meaning how safely stored the data is). In the cases we're imagining, it's application -> main memory cache -> disk cache -> disk surface. There are others. _None_ of those provide perfect safety for your data. They are a spectrum, and how far along you want data to be committed before you say "fine, the data is safe enough for me" depends on your application. For example, there are users who like to turn _off_ fdatasync() with their SQL database of choice. They prefer speed over safety, and they don't mind losing an hour's data and doing regular backups (we assume ;-) Some blogs fall into this category; who cares if a rare crash costs you a comment or two and a restore from backup; it's acceptable for the speed. There's users who would really like fdatasync() to commit data to the drive platters, so after their database says "done", they are very confident that a power failure won't cause committed data to be lost. Accepting credit cards is more at this end. So should be anyone using a virtual machine of any kind without a journalling fs in the guest! And there's users who like it where it is right now: a compromise, where a system crash won't lose committed data; but a power failure might. (I'm making assumptions about drive behaviour on reset here.) My problem with fdatasync() at the moment is, I can't choose what I want from it, and there's no mechanism to give me the safest option. Most annoyingly, in-kernel filesystems _do_ have a mechanism; it just isn't exported to userspace. (A quick aside: fdatasync() et al. are actually used for two _different_ things. 1: A program says "I've written it", it can say so with confidence, e.g. announcing email receipt. 2: It's used for write ordering with write-ahead logging: write, fdatasync, write. When you tease at the details, efficient implementations of them are different... Think SCSI tagged commands versus cache flushes.) > One interesting aspect of this comes with COW filesystems like btrfs or > logfs. Writing out data pages is not sufficient, because those will get > lost unless their referencing metadata is written as well. So either we > have to call fsync for those filesystems or add another callback and let > filesystems override the default implementation. Doesn't the ->fsync callback get called in the sys_fdatasync() case, with appropriate arguments? With barriers/flushes it certainly makes those a bit more complicated. You have to flush not just the disks with data pages, but the _other_ disks in a software RAID with data pointer metadata pages, but ideally not all of them (think database journal commit). That can be implemented with per-buffer pending-barrier/flush flags (like I described for pages in the first mail), which are equally useful when a database-like application uses a block device. -- Jamie -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Proposal for "proper" durable fsync() and fdatasync()
Jörn Engel wrote: > On Tue, 26 February 2008 20:16:11 +1100, Nick Piggin wrote: > > > > Yeah, sync_file_range has slightly unusual semantics and introduce > > the new concept, "writeout", to userspace (does "writeout" include > > "in drive cache"? the kernel doesn't think so, but the only way to > > make sync_file_range "safe" is if you do consider it writeout). > > If sync_file_range isn't safe, it should get replaced by a noop > implementation. There really is no point in promising "a little" > safety. > > One interesting aspect of this comes with COW filesystems like btrfs or > logfs. Writing out data pages is not sufficient, because those will get > lost unless their referencing metadata is written as well. So either we > have to call fsync for those filesystems or add another callback and let > filesystems override the default implementation. fdatasync() is required to write data pages _and_ the necessary metadata to reference those changed pages (btrfs tree etc.), but not non-data metadata. It's the filesystem's responsibility to interpret that correctly. In-place writes don't need anything else. Phase-tree style writes do. Some kinds of logged writes don't. I'm under the impression that sync_file_range() is a sort of restricted-range asynchronous fdatasync(). By limiting the range of file date which must be written out, it becomes more refined for database and filesystem-in-a-file type applications. Just as fsync() is more refined than sync() - it's useful to sync less - same goes for syncing just part of a file. It's still the filesystem's responsibility to sync data access metadata appropriately. It can sync more if it wants, but not less. That's what I understand by sync_file_range(fd, start,length, SYNC_FILE_RANGE_WRITE_BEFORE | SYNC_FILE_RANGE_WRITE | SYNC_FILE_RANGE_WRITE_AFTER); Largely because the manual says to use that combination of flags for an equivalent to fdatasync(). The concept of "write-out" is not defined in the manual. I'm assuming it to mean this, as a reasonable guess: SYNC_FILE_RANGE_WRITE scans all pages in the range, looking for dirty pages which aren't already queued for write-out. It marks those with a "write-out" flag, and starts write I/Os at some unspecified time in the near future; it can be assumed writes for all the pages will complete eventually if there's no errors. When I/O completes on a page, it cleans the page and also clears the write-out flag. SYNC_FILE_RANGE_WAIT_AFTER waits until all pages in the range don't have the "write-out" flag set. SYNC_FILE_RANGE_WAIT_BEFORE does the same wait, but before marking pages for write-out. I don't actually see the point in this. Isn't a preceding call with SYNC_FILE_RANGE_WAIT_AFTER equivalent, making BEFORE a redundant flag? The manual says it is something to do with data-integrity, but it's not clear to me what that means. All this implies that "write-out" flag is a concept userspace can rely on. That's not so peculiar: WRITE seems to be equivalent to AIO-style fdatasync() on a limited range of offsets, and WAIT_AFTER seems to be equivalent to waiting for any previously issued such ops to complete. Any data access metadata updates that btrfs must make for fdatasync(), it must also make for sync_file_range(), for the limited range of offsets. -- Jamie -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Proposal for "proper" durable fsync() and fdatasync()
Jeff Garzik wrote: > [snip huge long proposal] > > Rather than invent new APIs, we should fix the existing ones to _really_ > flush data to physical media. Btw, one reason for the length is the current block request API isn't sufficient even to make fsync() durable with _no_ new APIs. It offers ordering barriers only, which aren't enough. I tried to explain, discuss some changes and then suggest optimisations. -- Jamie -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Proposal for "proper" durable fsync() and fdatasync()
Andrew Morton wrote: > On Tue, 26 Feb 2008 07:26:50 +0000 Jamie Lokier <[EMAIL PROTECTED]> wrote: > > > (It would be nicer if sync_file_range() > > took a vector of ranges for better elevator scheduling, but let's > > ignore that :-) > > Two passes: > > Pass 1: shove each of the segments into the queue with > SYNC_FILE_RANGE_WAIT_BEFORE|SYNC_FILE_RANGE_WRITE > > Pass 2: wait for them all to complete and return accumulated result > with SYNC_FILE_RANGE_WAIT_AFTER Thanks. Seems ok, though being able to cork the I/O until the last one would be a bonus (like TCP_MORE... SYNC_FILE_RANGE_MORE?) I'm imagining I'd omit the SYNC_FILE_RANGE_WAIT_BEFORE. Is there a reason why you have it there? The man page isn't very enlightening. -- Jamie -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Proposal for proper durable fsync() and fdatasync()
Andrew Morton wrote: On Tue, 26 Feb 2008 07:26:50 + Jamie Lokier [EMAIL PROTECTED] wrote: (It would be nicer if sync_file_range() took a vector of ranges for better elevator scheduling, but let's ignore that :-) Two passes: Pass 1: shove each of the segments into the queue with SYNC_FILE_RANGE_WAIT_BEFORE|SYNC_FILE_RANGE_WRITE Pass 2: wait for them all to complete and return accumulated result with SYNC_FILE_RANGE_WAIT_AFTER Thanks. Seems ok, though being able to cork the I/O until the last one would be a bonus (like TCP_MORE... SYNC_FILE_RANGE_MORE?) I'm imagining I'd omit the SYNC_FILE_RANGE_WAIT_BEFORE. Is there a reason why you have it there? The man page isn't very enlightening. -- Jamie -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Proposal for proper durable fsync() and fdatasync()
Jeff Garzik wrote: [snip huge long proposal] Rather than invent new APIs, we should fix the existing ones to _really_ flush data to physical media. Btw, one reason for the length is the current block request API isn't sufficient even to make fsync() durable with _no_ new APIs. It offers ordering barriers only, which aren't enough. I tried to explain, discuss some changes and then suggest optimisations. -- Jamie -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Proposal for proper durable fsync() and fdatasync()
Jörn Engel wrote: On Tue, 26 February 2008 20:16:11 +1100, Nick Piggin wrote: Yeah, sync_file_range has slightly unusual semantics and introduce the new concept, writeout, to userspace (does writeout include in drive cache? the kernel doesn't think so, but the only way to make sync_file_range safe is if you do consider it writeout). If sync_file_range isn't safe, it should get replaced by a noop implementation. There really is no point in promising a little safety. One interesting aspect of this comes with COW filesystems like btrfs or logfs. Writing out data pages is not sufficient, because those will get lost unless their referencing metadata is written as well. So either we have to call fsync for those filesystems or add another callback and let filesystems override the default implementation. fdatasync() is required to write data pages _and_ the necessary metadata to reference those changed pages (btrfs tree etc.), but not non-data metadata. It's the filesystem's responsibility to interpret that correctly. In-place writes don't need anything else. Phase-tree style writes do. Some kinds of logged writes don't. I'm under the impression that sync_file_range() is a sort of restricted-range asynchronous fdatasync(). By limiting the range of file date which must be written out, it becomes more refined for database and filesystem-in-a-file type applications. Just as fsync() is more refined than sync() - it's useful to sync less - same goes for syncing just part of a file. It's still the filesystem's responsibility to sync data access metadata appropriately. It can sync more if it wants, but not less. That's what I understand by sync_file_range(fd, start,length, SYNC_FILE_RANGE_WRITE_BEFORE | SYNC_FILE_RANGE_WRITE | SYNC_FILE_RANGE_WRITE_AFTER); Largely because the manual says to use that combination of flags for an equivalent to fdatasync(). The concept of write-out is not defined in the manual. I'm assuming it to mean this, as a reasonable guess: SYNC_FILE_RANGE_WRITE scans all pages in the range, looking for dirty pages which aren't already queued for write-out. It marks those with a write-out flag, and starts write I/Os at some unspecified time in the near future; it can be assumed writes for all the pages will complete eventually if there's no errors. When I/O completes on a page, it cleans the page and also clears the write-out flag. SYNC_FILE_RANGE_WAIT_AFTER waits until all pages in the range don't have the write-out flag set. SYNC_FILE_RANGE_WAIT_BEFORE does the same wait, but before marking pages for write-out. I don't actually see the point in this. Isn't a preceding call with SYNC_FILE_RANGE_WAIT_AFTER equivalent, making BEFORE a redundant flag? The manual says it is something to do with data-integrity, but it's not clear to me what that means. All this implies that write-out flag is a concept userspace can rely on. That's not so peculiar: WRITE seems to be equivalent to AIO-style fdatasync() on a limited range of offsets, and WAIT_AFTER seems to be equivalent to waiting for any previously issued such ops to complete. Any data access metadata updates that btrfs must make for fdatasync(), it must also make for sync_file_range(), for the limited range of offsets. -- Jamie -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Proposal for proper durable fsync() and fdatasync()
Jörn Engel wrote: On Tue, 26 February 2008 20:16:11 +1100, Nick Piggin wrote: Yeah, sync_file_range has slightly unusual semantics and introduce the new concept, writeout, to userspace (does writeout include in drive cache? the kernel doesn't think so, but the only way to make sync_file_range safe is if you do consider it writeout). If sync_file_range isn't safe, it should get replaced by a noop implementation. There really is no point in promising a little safety. Sometimes there is a point in a little safety. There's a spectrum of durability (meaning how safely stored the data is). In the cases we're imagining, it's application - main memory cache - disk cache - disk surface. There are others. _None_ of those provide perfect safety for your data. They are a spectrum, and how far along you want data to be committed before you say fine, the data is safe enough for me depends on your application. For example, there are users who like to turn _off_ fdatasync() with their SQL database of choice. They prefer speed over safety, and they don't mind losing an hour's data and doing regular backups (we assume ;-) Some blogs fall into this category; who cares if a rare crash costs you a comment or two and a restore from backup; it's acceptable for the speed. There's users who would really like fdatasync() to commit data to the drive platters, so after their database says done, they are very confident that a power failure won't cause committed data to be lost. Accepting credit cards is more at this end. So should be anyone using a virtual machine of any kind without a journalling fs in the guest! And there's users who like it where it is right now: a compromise, where a system crash won't lose committed data; but a power failure might. (I'm making assumptions about drive behaviour on reset here.) My problem with fdatasync() at the moment is, I can't choose what I want from it, and there's no mechanism to give me the safest option. Most annoyingly, in-kernel filesystems _do_ have a mechanism; it just isn't exported to userspace. (A quick aside: fdatasync() et al. are actually used for two _different_ things. 1: A program says I've written it, it can say so with confidence, e.g. announcing email receipt. 2: It's used for write ordering with write-ahead logging: write, fdatasync, write. When you tease at the details, efficient implementations of them are different... Think SCSI tagged commands versus cache flushes.) One interesting aspect of this comes with COW filesystems like btrfs or logfs. Writing out data pages is not sufficient, because those will get lost unless their referencing metadata is written as well. So either we have to call fsync for those filesystems or add another callback and let filesystems override the default implementation. Doesn't the -fsync callback get called in the sys_fdatasync() case, with appropriate arguments? With barriers/flushes it certainly makes those a bit more complicated. You have to flush not just the disks with data pages, but the _other_ disks in a software RAID with data pointer metadata pages, but ideally not all of them (think database journal commit). That can be implemented with per-buffer pending-barrier/flush flags (like I described for pages in the first mail), which are equally useful when a database-like application uses a block device. -- Jamie -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Proposal for proper durable fsync() and fdatasync()
Ric Wheeler wrote: I was surprised that fsync() doesn't do this already. There was a lot of effort put into block I/O write barriers during 2.5, so that journalling filesystems can force correct write ordering, using disk flush cache commands. After all that effort, I was very surprised to notice that Linux 2.6.x doesn't use that capability to ensure fsync() flushes the disk cache onto stable storage. It's surprising you are surprised, given that this [lame] fsync behavior has remaining consistently lame throughout Linux's history. Maybe I am confused, but isn't this is what fsync() does today whenever barriers are enabled (the fsync() invalidates the drive's write cache). No, fsync() doesn't always flush the drive's write cache. It often does, any I think many people are under the impression it always does, but it doesn't. Try this code on ext3: fd = open (test_file, O_RDWR | O_CREAT | O_TRUNC, 0666); while (1) { char byte; usleep (10); pwrite (fd, byte, 1, 0); fsync (fd); } It will do just over 10 write ops per second on an idle system (13 on mine), and 1 flush op per second. That's because ext3 fsync() only does a journal commit when the inode has changed. The inode mtime is changed by write only with 1 second granularity. Without a journal commit, there's no barrier, which translates to not flushing disk write cache. If you add fchmod (fd, 0644); fchmod (fd, 0664); between the write and fsync, you'll see at least 20 write ops and 20 flush ops per second, and you'll here the disk seeking more. That's because the fchmod dirties the inode, so fsync() writes the inode with a journal commit. It turns out even _that_ is not sufficient according to the kernel internals. A journal commit uses an ordered request, which isn't the same as a flush potentially, it just happens to use flush in this instance. I'm not sure if ordered requests are actually implemented by any drivers at the moment. If not now, they will be one day. We could change ext3 fsync() to always do a journal commit, and depend on the non-existence of block drivers which do ordered (not flush) barrier requests. But there's lots of things wrong with that. Not least, it sucks performance for database-like applications and virtual machines, a lot due to unnecessary seeks. That way lies wrongness. Rightness is to make fdatasync() work well, with a genuine flush (or equivalent (see FUA), only when required, and not a mere ordered barrier), no inode write, and to make sync_file_range()[*] offer the fancier applications finer controls which reflect what they actually need. [*] - or whatever. -- Jamie -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Proposal for proper durable fsync() and fdatasync()
Jeff Garzik wrote: Nick Piggin wrote: Anyway, the idea of making fsync/fdatasync etc. safe by default is a good idea IMO, and is a bad bug that we don't do that :( Agreed... it's also disappointing that [unless I'm mistaken] you have to hack each filesystem to support barriers. It seems far easier to make sync_blkdev() Do The Right Thing, and magically make all filesystems data-safe. Well, you need ordered metadata writes, barriers _and_ flushes with some filesystems. Merely writing all the data pages than issuing a drive cache flush won't Do The Right Thing with those filesystems - someone already mentioned Btrfs, where it won't. But I agree that your suggestion would make a superb default, for filesystems which don't provide their own function. It's not optimal even then. Devices: On a software RAID, you ideally don't want to issue flushes to all drives if your database did a 1 block commit entry. (But they probably use O_DIRECT anyway, changing the rules again). But all that can be optimised in generic VFS code eventually. It doesn't need filesystem assistance in most cases. Apps: don't always want a full flush; sometimes a barrier would do. -- Jamie -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Proposal for proper durable fsync() and fdatasync()
Jörn Engel wrote: On Tue, 26 February 2008 15:28:10 +, Jamie Lokier wrote: One interesting aspect of this comes with COW filesystems like btrfs or logfs. Writing out data pages is not sufficient, because those will get lost unless their referencing metadata is written as well. So either we have to call fsync for those filesystems or add another callback and let filesystems override the default implementation. Doesn't the -fsync callback get called in the sys_fdatasync() case, with appropriate arguments? My paragraph above was aimed at the sync_file_range() case. fsync and fdatasync do the right thing within the limitations you brought up in this thread. sync_file_range() without further changes will only write data pages, not the metadata required to actually access those data pages. This works just fine for non-COW filesystems, which covers all currently merged ones. With COW filesystems it is currently impossible to do sync_file_range() properly. The problem is orthogonal to your's, I just brought it up since you were already mentioning sync_file_range(). You're right. Though, doesn't normal page writeback enqueue the COW metadata changes? If not, how do they get written in a timely fashion? -- Jamie -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Proposal for "proper" durable fsync() and fdatasync()
Jeff Garzik wrote: > Jamie Lokier wrote: > >By durable, I mean that fsync() should actually commit writes to > >physical stable storage, > > Yes, it should. Glad we agree :-) > >I was surprised that fsync() doesn't do this already. There was a lot > >of effort put into block I/O write barriers during 2.5, so that > >journalling filesystems can force correct write ordering, using disk > >flush cache commands. > > > >After all that effort, I was very surprised to notice that Linux 2.6.x > >doesn't use that capability to ensure fsync() flushes the disk cache > >onto stable storage. > > It's surprising you are surprised, given that this [lame] fsync behavior > has remaining consistently lame throughout Linux's history. I was surprised because of the effort put into IDE write barriers to get it right for in-kernel filesystems, and the messages in 2004 telling concerned users that fsync would use barriers in 2.6, which it does sometimes but not always. > [snip huge long proposal] > > Rather than invent new APIs, we should fix the existing ones to _really_ > flush data to physical media. > > Linux should default to SAFE data storage, and permit users to retain > the older unsafe behavior via an option. It's completely ridiculous > that we default to an unsafe fsync. Well, I agree with you. Which is why the "new API" I suggested, being really just an extension of an existing one, allows fsync() to be SAFE if that's what people want. To be fair, fsync() is rather overkill for some apps. sync_file_range() is obviously the right place for fine tuning "less safe" variations. > And [anticipating a common response from others] it is completely > irrelevant that POSIX fsync(2) permits Linux's current behavior. The > current behavior is unsafe. > > Safety before performance -- ESPECIALLY when it comes to storing user data. Especially now that people work a lot in guest VMs, where the IDE barrier stuff doesn't work if the host fdatasync() doesn't work. Since it happened with Mac OS X, I wouldn't be surprised if changing fsync() and just that wasn't popular. Heck, you already get people asking "how to turn off fsync in PostGreSQL"... (Haven't those people heard of transactions...?) But with changes to sync_file_range() [or whatever... I don't care] to support database's finely tuned commit needs, and then adoption of that by database vendors, perhaps nobody will mind fsync() becoming safe then. Nobody seems bothered by it's performance for other things. -- Jamie -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Proposal for "proper" durable fsync() and fdatasync()
Dear kernel, This is a proposal to add "proper" durable fsync() and fdatasync() to Linux. First the problem, then a proposed solution "with benefits", so to speak. I need feedback on the details, before implementing anything. Or (hopefully) someone else thinks it's very important and does it themselves :-) By durable, I mean that fsync() should actually commit writes to physical stable storage, not just the disk write cache when that is enabled. Databases and guest VMs needs this, or an equivalent feature, if they aren't to face occasional corruption after power failure and perhaps some crashes. The alternative is to disable the disk write cache. But that isn't modern practice or recommendation, since I/O write barriers were implemented and they are much faster. I was surprised that fsync() doesn't do this already. There was a lot of effort put into block I/O write barriers during 2.5, so that journalling filesystems can force correct write ordering, using disk flush cache commands. After all that effort, I was very surprised to notice that Linux 2.6.x doesn't use that capability to ensure fsync() flushes the disk cache onto stable storage. I noticed this following up discussions on the Qemu mailing list, about guest VMs and how their IDE flush cache command should translate to fsync() to avoid data loss. (For guest VMs, fsync() isn't necessary if the host machine is fine, and it isn't enough (on Linux host) if the host machine loses power or the hard disk crashes another way.) Then I noticed it again, when I was designing a database engine with filesystem characteristics. I thought "how do I ensure ordered journal writes; can I use fdatasync()?" and was surprised to find the answer is no, I have to use hacks like calling hdparm, and the authors of major SQL databases seem to brush the problem under a carpet. (Interestingly, in the Linux 2.4 patches for write barriers, fsync() seems to be fine, if a bit slow.) It isn't the first time this topic has come up: http://groups.google.com.br/group/linux.kernel/browse_thread/thread/d343e51655b4ac7c/7ee9bca80977c2d1?#7ee9bca80977c2d1 ("True fsync() in Linux (on IDE)") In that thread, it was implied that would be fixed in 2.6. So I bet some people are under the illusion that it's fixed in 2.6... For a while, I've been meaning to bring it up on linux-kernel... The fsync problem - Chris Wedgwood wrote: > On Mon, Feb 25, 2008 at 08:50:40PM +, Jamie Lokier wrote: > > > On Linux (and other host OSes), fdatsync() and fsync() don't always > > commit data to hard storage; it sometimes only commits it to the hard > > drive cache. > > That's a filesystem bug IMO. People should be able to use f[data]sync > with some level onf confidence or else it's basically pointless. I agree, I consider it a serious bug, and I would be pleased if someone paid it some love and attention. Right now, if you want a reliable database on Linux, you _cannot_ properly depend on fsync() or fdatasync(). Considering how much Linux is used for critical databases, using these functions, this amazes me. Also, if you have a guest VM, then the guest's filesystem journalling is not reliable. Not only can it lose data on power loss, it can corrupt the guest filesystem too, due to reordering. This is contrary to what people expect, I think. I'm not sure if a system reset can cause similar loss; I don't know how disks react to that. Also, for the person porting ZFS to run on FUSE, same applies... Linux fsync is faulty in two ways: 1. Database commits aren't _durable_ against power failure, because fsync doesn't flush the disk's cache. This means data stored is not guaranteed to be stored at the expected durability. 2. It's unsafe for write-ahead logging, because it doesn't really guarantee any _ordering_ for the writes at the hard storage level. So aside from losing committed data, it can also corrupt structural metadata. With ext3 it's quite easy to verify that fsync/fdatasync don't always write a journal entry. (Apart from looking at the kernel code :-) Just write some data, fsync(), and observe the number of writes in /proc/diskstats. If the current mtime second _hasn't_ changed, the inode isn't written. If you write data, say, 10 times a second to the same place followed by fsync(), you'll see a little more than 10 write I/Os, and less than 20. By the way, this shows a trick for fixing #2 (ordering): use fchmod() to toggle the file attributes, and that will force the next fsync() to write a journal entry, which _does_ issue a write barrier. If you do that with each write as above (write, fchmod change, fsync 10 times a second), you will clearly see more write I/Os, and you'll hear the disk behaving differently: it's seeking more. However, even this ugly trick has problems: 3. Using the fchmod() trick or good fortun
Proposal for proper durable fsync() and fdatasync()
Dear kernel, This is a proposal to add proper durable fsync() and fdatasync() to Linux. First the problem, then a proposed solution with benefits, so to speak. I need feedback on the details, before implementing anything. Or (hopefully) someone else thinks it's very important and does it themselves :-) By durable, I mean that fsync() should actually commit writes to physical stable storage, not just the disk write cache when that is enabled. Databases and guest VMs needs this, or an equivalent feature, if they aren't to face occasional corruption after power failure and perhaps some crashes. The alternative is to disable the disk write cache. But that isn't modern practice or recommendation, since I/O write barriers were implemented and they are much faster. I was surprised that fsync() doesn't do this already. There was a lot of effort put into block I/O write barriers during 2.5, so that journalling filesystems can force correct write ordering, using disk flush cache commands. After all that effort, I was very surprised to notice that Linux 2.6.x doesn't use that capability to ensure fsync() flushes the disk cache onto stable storage. I noticed this following up discussions on the Qemu mailing list, about guest VMs and how their IDE flush cache command should translate to fsync() to avoid data loss. (For guest VMs, fsync() isn't necessary if the host machine is fine, and it isn't enough (on Linux host) if the host machine loses power or the hard disk crashes another way.) Then I noticed it again, when I was designing a database engine with filesystem characteristics. I thought how do I ensure ordered journal writes; can I use fdatasync()? and was surprised to find the answer is no, I have to use hacks like calling hdparm, and the authors of major SQL databases seem to brush the problem under a carpet. (Interestingly, in the Linux 2.4 patches for write barriers, fsync() seems to be fine, if a bit slow.) It isn't the first time this topic has come up: http://groups.google.com.br/group/linux.kernel/browse_thread/thread/d343e51655b4ac7c/7ee9bca80977c2d1?#7ee9bca80977c2d1 (True fsync() in Linux (on IDE)) In that thread, it was implied that would be fixed in 2.6. So I bet some people are under the illusion that it's fixed in 2.6... For a while, I've been meaning to bring it up on linux-kernel... The fsync problem - Chris Wedgwood wrote: On Mon, Feb 25, 2008 at 08:50:40PM +, Jamie Lokier wrote: On Linux (and other host OSes), fdatsync() and fsync() don't always commit data to hard storage; it sometimes only commits it to the hard drive cache. That's a filesystem bug IMO. People should be able to use f[data]sync with some level onf confidence or else it's basically pointless. I agree, I consider it a serious bug, and I would be pleased if someone paid it some love and attention. Right now, if you want a reliable database on Linux, you _cannot_ properly depend on fsync() or fdatasync(). Considering how much Linux is used for critical databases, using these functions, this amazes me. Also, if you have a guest VM, then the guest's filesystem journalling is not reliable. Not only can it lose data on power loss, it can corrupt the guest filesystem too, due to reordering. This is contrary to what people expect, I think. I'm not sure if a system reset can cause similar loss; I don't know how disks react to that. Also, for the person porting ZFS to run on FUSE, same applies... Linux fsync is faulty in two ways: 1. Database commits aren't _durable_ against power failure, because fsync doesn't flush the disk's cache. This means data stored is not guaranteed to be stored at the expected durability. 2. It's unsafe for write-ahead logging, because it doesn't really guarantee any _ordering_ for the writes at the hard storage level. So aside from losing committed data, it can also corrupt structural metadata. With ext3 it's quite easy to verify that fsync/fdatasync don't always write a journal entry. (Apart from looking at the kernel code :-) Just write some data, fsync(), and observe the number of writes in /proc/diskstats. If the current mtime second _hasn't_ changed, the inode isn't written. If you write data, say, 10 times a second to the same place followed by fsync(), you'll see a little more than 10 write I/Os, and less than 20. By the way, this shows a trick for fixing #2 (ordering): use fchmod() to toggle the file attributes, and that will force the next fsync() to write a journal entry, which _does_ issue a write barrier. If you do that with each write as above (write, fchmod change, fsync 10 times a second), you will clearly see more write I/Os, and you'll hear the disk behaving differently: it's seeking more. However, even this ugly trick has problems: 3. Using the fchmod() trick or good fortune, fsync() issues a write barrier. Right now, this does commit data (if the device can
Re: Proposal for proper durable fsync() and fdatasync()
Jeff Garzik wrote: Jamie Lokier wrote: By durable, I mean that fsync() should actually commit writes to physical stable storage, Yes, it should. Glad we agree :-) I was surprised that fsync() doesn't do this already. There was a lot of effort put into block I/O write barriers during 2.5, so that journalling filesystems can force correct write ordering, using disk flush cache commands. After all that effort, I was very surprised to notice that Linux 2.6.x doesn't use that capability to ensure fsync() flushes the disk cache onto stable storage. It's surprising you are surprised, given that this [lame] fsync behavior has remaining consistently lame throughout Linux's history. I was surprised because of the effort put into IDE write barriers to get it right for in-kernel filesystems, and the messages in 2004 telling concerned users that fsync would use barriers in 2.6, which it does sometimes but not always. [snip huge long proposal] Rather than invent new APIs, we should fix the existing ones to _really_ flush data to physical media. Linux should default to SAFE data storage, and permit users to retain the older unsafe behavior via an option. It's completely ridiculous that we default to an unsafe fsync. Well, I agree with you. Which is why the new API I suggested, being really just an extension of an existing one, allows fsync() to be SAFE if that's what people want. To be fair, fsync() is rather overkill for some apps. sync_file_range() is obviously the right place for fine tuning less safe variations. And [anticipating a common response from others] it is completely irrelevant that POSIX fsync(2) permits Linux's current behavior. The current behavior is unsafe. Safety before performance -- ESPECIALLY when it comes to storing user data. Especially now that people work a lot in guest VMs, where the IDE barrier stuff doesn't work if the host fdatasync() doesn't work. Since it happened with Mac OS X, I wouldn't be surprised if changing fsync() and just that wasn't popular. Heck, you already get people asking how to turn off fsync in PostGreSQL... (Haven't those people heard of transactions...?) But with changes to sync_file_range() [or whatever... I don't care] to support database's finely tuned commit needs, and then adoption of that by database vendors, perhaps nobody will mind fsync() becoming safe then. Nobody seems bothered by it's performance for other things. -- Jamie -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: jffs2: -ENOSPC when truncating file?!
Pavel Machek wrote: > > You need to write a log entry indicating the new length of the file. > > There is no space for new log entries. > > > > There is a special case for removal -- 'rm gps.nmea' would work. Perhaps > > we should add a special case for truncation too, so that it can also use > > the extra pool of free space. > > Yes, that would be nice. I somehow assumed that truncate can't fail > for -ENOSPC ... I was trying to actually free some space on the > filesystem... Same here! When I got ENOSPC from truncate, trying to free some space, I was so surprised (and a bit disappointed) that I assumed removal could fail too. So now I'm pleasantly surprised to learn I can at least remove a file. It does seem odd that truncate to zero length can fail. It is guaranteed to free up one or more data nodes, so there should be enough space for the size change node, provided GC is invoked when necessary and the node sizes are compatible for this in corner cases. -- Jamie -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: jffs2: -ENOSPC when truncating file?!
Pavel Machek wrote: You need to write a log entry indicating the new length of the file. There is no space for new log entries. There is a special case for removal -- 'rm gps.nmea' would work. Perhaps we should add a special case for truncation too, so that it can also use the extra pool of free space. Yes, that would be nice. I somehow assumed that truncate can't fail for -ENOSPC ... I was trying to actually free some space on the filesystem... Same here! When I got ENOSPC from truncate, trying to free some space, I was so surprised (and a bit disappointed) that I assumed removal could fail too. So now I'm pleasantly surprised to learn I can at least remove a file. It does seem odd that truncate to zero length can fail. It is guaranteed to free up one or more data nodes, so there should be enough space for the size change node, provided GC is invoked when necessary and the node sizes are compatible for this in corner cases. -- Jamie -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] LogFS take three
David Weinehall wrote: > > It is also the filesystem that tries to scale logarithmically, as Arnd > > has noted. Maybe I should call it Log2 to emphesize this point. Log1 > > would be horrible scalability. > > So, log2fs... Sounds great to me. Why Log2? Logarithmic scaling is just logarithmic scaling. Does the filesystem use 2-ary trees or anything else which gives particular meaning to 2? -- Jamie - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] LogFS take three
David Weinehall wrote: It is also the filesystem that tries to scale logarithmically, as Arnd has noted. Maybe I should call it Log2 to emphesize this point. Log1 would be horrible scalability. So, log2fs... Sounds great to me. Why Log2? Logarithmic scaling is just logarithmic scaling. Does the filesystem use 2-ary trees or anything else which gives particular meaning to 2? -- Jamie - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] LogFS take three
Jörn Engel wrote: > > Almost all your static functions start with logfs_, why not this one? > > Because after a while I discovered how silly it is to start every > function with logfs_. That prefix doesn't add much unless the function > has global scope. What I didn't do was remove the prefix from older > functions. It's handy when debugging or showing detailed backtraces. Not that I'm advocating it (or not), just something I've noticed in other programs. -- Jamie - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] LogFS take three
Jörn Engel wrote: Almost all your static functions start with logfs_, why not this one? Because after a while I discovered how silly it is to start every function with logfs_. That prefix doesn't add much unless the function has global scope. What I didn't do was remove the prefix from older functions. It's handy when debugging or showing detailed backtraces. Not that I'm advocating it (or not), just something I've noticed in other programs. -- Jamie - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] LogFS take three
Artem Bityutskiy wrote: > On Wed, 2007-05-16 at 12:34 +0100, Jamie Lokier wrote: > > Jörn Engel wrote: > > > On Wed, 16 May 2007 12:54:14 +0800, David Woodhouse wrote: > > > > Personally I'd just go for 'JFFS3'. After all, it has a better claim to > > > > the name than either of its predecessors :) > > > > > > Did you ever see akpm's facial expression when he tried to pronounce > > > "JFFS2"? ;) > > > > JFFS3 is a good, meaningful name to anyone familiar with JFFS2. > > > > But if akpm can't pronounce it, how about FFFS for faster flash > > filesystem ;-) > > The problem is that JFFS2 will always be faster in terms of I/O speed > anyway, just because it does not have to maintain on-flash indexing > data structures. But yes, it is slow in mount and in building big > inodes, so the "faster" is confusing. Is LogFS really slower than JFFS2 in practice? I would have guessed reads to be a similar speed, tree updates to be a similar speed to journal updates for sustained non-fsyncing writes, and the difference unimportant for tiny individual commits whose index updates are not merged with any other. I've not thought about it much though. -- Jamie - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] LogFS take three
Albert Cahalan wrote: > Please don't forget the immutable bit. ("man lsattr") > Having both, BSD-style, would be even better. > The immutable bit is important for working around > software bugs and "features" that damage files. > > I also can't find xattr support. Imho, Given that the filesystem is still 'experimental', I'd concentrate on getting it stable before worrying about immutable and xattrs unless they are easy. (Immutable is easy). In 13 years of using Linux in all sorts of ways I've never yet to use either feature. They would be good to have, but stability in a filesystem is much more useful. I'm biased of course: if LogFS were stable and well tested, I'd be using it right now in my embedded thingy - and that doesn't even bother with uids :-) -- Jamie - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] LogFS take three
Jörn Engel wrote: > On Wed, 16 May 2007 12:54:14 +0800, David Woodhouse wrote: > > Personally I'd just go for 'JFFS3'. After all, it has a better claim to > > the name than either of its predecessors :) > > Did you ever see akpm's facial expression when he tried to pronounce > "JFFS2"? ;) JFFS3 is a good, meaningful name to anyone familiar with JFFS2. But if akpm can't pronounce it, how about FFFS for faster flash filesystem ;-) -- Jamie - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] LogFS take three
Jörn Engel wrote: On Wed, 16 May 2007 12:54:14 +0800, David Woodhouse wrote: Personally I'd just go for 'JFFS3'. After all, it has a better claim to the name than either of its predecessors :) Did you ever see akpm's facial expression when he tried to pronounce JFFS2? ;) JFFS3 is a good, meaningful name to anyone familiar with JFFS2. But if akpm can't pronounce it, how about FFFS for faster flash filesystem ;-) -- Jamie - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] LogFS take three
Albert Cahalan wrote: Please don't forget the immutable bit. (man lsattr) Having both, BSD-style, would be even better. The immutable bit is important for working around software bugs and features that damage files. I also can't find xattr support. Imho, Given that the filesystem is still 'experimental', I'd concentrate on getting it stable before worrying about immutable and xattrs unless they are easy. (Immutable is easy). In 13 years of using Linux in all sorts of ways I've never yet to use either feature. They would be good to have, but stability in a filesystem is much more useful. I'm biased of course: if LogFS were stable and well tested, I'd be using it right now in my embedded thingy - and that doesn't even bother with uids :-) -- Jamie - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] LogFS take three
Artem Bityutskiy wrote: On Wed, 2007-05-16 at 12:34 +0100, Jamie Lokier wrote: Jörn Engel wrote: On Wed, 16 May 2007 12:54:14 +0800, David Woodhouse wrote: Personally I'd just go for 'JFFS3'. After all, it has a better claim to the name than either of its predecessors :) Did you ever see akpm's facial expression when he tried to pronounce JFFS2? ;) JFFS3 is a good, meaningful name to anyone familiar with JFFS2. But if akpm can't pronounce it, how about FFFS for faster flash filesystem ;-) The problem is that JFFS2 will always be faster in terms of I/O speed anyway, just because it does not have to maintain on-flash indexing data structures. But yes, it is slow in mount and in building big inodes, so the faster is confusing. Is LogFS really slower than JFFS2 in practice? I would have guessed reads to be a similar speed, tree updates to be a similar speed to journal updates for sustained non-fsyncing writes, and the difference unimportant for tiny individual commits whose index updates are not merged with any other. I've not thought about it much though. -- Jamie - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: dnotify/inotify and vfs questions
Ian Campbell wrote: > On Tue, 2005-08-23 at 16:23 +0100, Jamie Lokier wrote: > > ... > > if (any_dnotify_or_inotify_events_pending) { > > read_dnotify_or_inotify_events(); > > if (any_events_related_to(file)) { > > store_in_userspace_stat_cache(file, stat(file)); > > } > > } > > stat_info = lookup_userspace_stat_cache(file); > > > > Now that's a silly way to save one system call in the fast path by itself. > > I'm not that familiar with inotify internals but doesn't > read_dnotify_or_inotify_events() or > any_dnotify_or_inotify_events_pending() involve a syscall? The fast path is just any_dnotify_or_inotify_events_pending: there aren't any relevant events pending in the fast path. There's a few methods of doing this for free per individual stat cache check. 1. Signal handler. dnotify: Check a variable set by the signal handler. [ There's a small time window between dnotify sending the signal and the receiving thread noticing on an SMP system due to the IPI, during which the sending task might have another way to signal the recieving task that it's finished some operation, so this method of using dnotify to invalidate a stat cache only has the correct ordering properties on UP systems. ] This works because dnotify signals are thread-specific, so the checking thread will definitely have received the signal after the time another process modifies the file. inotify: Disappointingly inotify doesn't support SIGIO readiness :( 2. If you have to mix the test with a poll/select/epoll/rtsig fd waiting for some other purpose. For example: a file/web/local server, where the constraint is only that each stat() to revalidate a cached response appears to happen any time after the beginning of receiving the network request is known. dnotify: It's free if you were using sigtimedwait anyway for I/O events, provided you completely read the queue, or get the signal priority right. inotify: It's free if you were using poll/select/epoll anyway for I/O events, provided in the case of epoll that you completely read the queue, or use a two-level queue. 3. Amortising the test over many stat cache checks. Even if you must use a system call to check for any pending events, for revalidating an object which depends on multiple files, only one call is needed for all of the stat cache checks. More generally (this is more flexible), you can separate the notion of "cache time checkpoint" from "cache validation". It's enough to know that a stat result was valid any time between the checkpoint time, and the current time. That's how I'm implementing the file/web/local server case described above in step 2. Then the events only need to be checked once during that time interval, no matter how many complex objects are being revalidated. It gets slightly more efficient when you have multiple, overlapping checkpoint->validation time intervals due to multiple outstanding requests being processed concurrently. As I explained in the previous mail, all this is absolutely pointless to save one system call. It's a lot of work for negligable gain. The point is when it saves lots of calls and userspace logic together, for things like web page templates and compiled programs, which depend on many files which can be revalidated in a small number of operations. -- Jamie - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: dnotify/inotify and vfs questions
Ian Campbell wrote: On Tue, 2005-08-23 at 16:23 +0100, Jamie Lokier wrote: receive some request... if (any_dnotify_or_inotify_events_pending) { read_dnotify_or_inotify_events(); if (any_events_related_to(file)) { store_in_userspace_stat_cache(file, stat(file)); } } stat_info = lookup_userspace_stat_cache(file); Now that's a silly way to save one system call in the fast path by itself. I'm not that familiar with inotify internals but doesn't read_dnotify_or_inotify_events() or any_dnotify_or_inotify_events_pending() involve a syscall? The fast path is just any_dnotify_or_inotify_events_pending: there aren't any relevant events pending in the fast path. There's a few methods of doing this for free per individual stat cache check. 1. Signal handler. dnotify: Check a variable set by the signal handler. [ There's a small time window between dnotify sending the signal and the receiving thread noticing on an SMP system due to the IPI, during which the sending task might have another way to signal the recieving task that it's finished some operation, so this method of using dnotify to invalidate a stat cache only has the correct ordering properties on UP systems. ] This works because dnotify signals are thread-specific, so the checking thread will definitely have received the signal after the time another process modifies the file. inotify: Disappointingly inotify doesn't support SIGIO readiness :( 2. If you have to mix the test with a poll/select/epoll/rtsig fd waiting for some other purpose. For example: a file/web/local server, where the constraint is only that each stat() to revalidate a cached response appears to happen any time after the beginning of receiving the network request is known. dnotify: It's free if you were using sigtimedwait anyway for I/O events, provided you completely read the queue, or get the signal priority right. inotify: It's free if you were using poll/select/epoll anyway for I/O events, provided in the case of epoll that you completely read the queue, or use a two-level queue. 3. Amortising the test over many stat cache checks. Even if you must use a system call to check for any pending events, for revalidating an object which depends on multiple files, only one call is needed for all of the stat cache checks. More generally (this is more flexible), you can separate the notion of cache time checkpoint from cache validation. It's enough to know that a stat result was valid any time between the checkpoint time, and the current time. That's how I'm implementing the file/web/local server case described above in step 2. Then the events only need to be checked once during that time interval, no matter how many complex objects are being revalidated. It gets slightly more efficient when you have multiple, overlapping checkpoint-validation time intervals due to multiple outstanding requests being processed concurrently. As I explained in the previous mail, all this is absolutely pointless to save one system call. It's a lot of work for negligable gain. The point is when it saves lots of calls and userspace logic together, for things like web page templates and compiled programs, which depend on many files which can be revalidated in a small number of operations. -- Jamie - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: dnotify/inotify and vfs questions
Asser Femø wrote: > According to the fcntl manual you can cancel a notification by doing > fcntl(fd, F_NOTIFY, 0) (ie. sending 0 as the notification mask), but > looking in the kernel code fcntl_dirnotify() immediately calls > dnotify_flush() with neither telling the vfs module about it. Is there a > reason for this? Otherwise I'd propose calling > filp->f_op->dir_notify(filp, 0) at some point in this scenario. > > Regarding inotify, inotify_add_watch doesn't seem to pass on the request > either, which works fine for local filesystem operations as they call > fsnotify_* functions every time, but that isn't really feasible for > filesystems like cifs because we'd have to request change notification > on everything. Is there plans for implementing a mechanism to let vfs > modules get watch requests too? On a related note: dnotify and inotify on local filesystems appear to be synchronous, in the following rather useful sense: If you have previously registered for inotify/dnotify events that will catch a change to a file, and called stat() on the file, then the following operation: ... stat_info = stat(file) may be replaced in userspace code with: ... if (any_dnotify_or_inotify_events_pending) { read_dnotify_or_inotify_events(); if (any_events_related_to(file)) { store_in_userspace_stat_cache(file, stat(file)); } } stat_info = lookup_userspace_stat_cache(file); Now that's a silly way to save one system call in the fast path by itself. But when the stat_info is a prerequisite for validating cached data -- such as the contents of a file parsed into a data structure -- it can save a lot of system calls and logical work. For example, an Apache-style path walk which checks for .htaccess, or a Samba-style path walk which is checking for unsafe symbolic links, can be reduced from say 20 system calls to zero using this method. Pre-compiled or pre-parsed programs/scripts/templates/config-files where all the source files used are prerequisites for invalidating a cached compiled form, reduces from say 40 system calls to stat() all the source files, to zero that's quite a saving. It's not just reducing system calls. The logical tests in userspace are also skipped, if coded properly, facilitating very quick decisions about things that depend on files which mostly don't change. (Cascading structured cache prerequisites...mmm). Remote dnotify/inotify doesn't _necessarily_ have this synchronous property. It may do in some cases, depending on the implementation (this is subtle...). So, it would be nice if there was a way to query this... rather than the tedious method of testing the filesystem type and having a table of "known local filesystem types" where it's safe to depend on this property. Alternatively, a way to specify at dnotify/inotify creation type that synchronous notifications are required, and have the request rejected if those can't be provided. -- Jamie - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: dnotify/inotify and vfs questions
Asser Femø wrote: According to the fcntl manual you can cancel a notification by doing fcntl(fd, F_NOTIFY, 0) (ie. sending 0 as the notification mask), but looking in the kernel code fcntl_dirnotify() immediately calls dnotify_flush() with neither telling the vfs module about it. Is there a reason for this? Otherwise I'd propose calling filp-f_op-dir_notify(filp, 0) at some point in this scenario. Regarding inotify, inotify_add_watch doesn't seem to pass on the request either, which works fine for local filesystem operations as they call fsnotify_* functions every time, but that isn't really feasible for filesystems like cifs because we'd have to request change notification on everything. Is there plans for implementing a mechanism to let vfs modules get watch requests too? On a related note: dnotify and inotify on local filesystems appear to be synchronous, in the following rather useful sense: If you have previously registered for inotify/dnotify events that will catch a change to a file, and called stat() on the file, then the following operation: receive some request... stat_info = stat(file) may be replaced in userspace code with: receive some request... if (any_dnotify_or_inotify_events_pending) { read_dnotify_or_inotify_events(); if (any_events_related_to(file)) { store_in_userspace_stat_cache(file, stat(file)); } } stat_info = lookup_userspace_stat_cache(file); Now that's a silly way to save one system call in the fast path by itself. But when the stat_info is a prerequisite for validating cached data -- such as the contents of a file parsed into a data structure -- it can save a lot of system calls and logical work. For example, an Apache-style path walk which checks for .htaccess, or a Samba-style path walk which is checking for unsafe symbolic links, can be reduced from say 20 system calls to zero using this method. Pre-compiled or pre-parsed programs/scripts/templates/config-files where all the source files used are prerequisites for invalidating a cached compiled form, reduces from say 40 system calls to stat() all the source files, to zero that's quite a saving. It's not just reducing system calls. The logical tests in userspace are also skipped, if coded properly, facilitating very quick decisions about things that depend on files which mostly don't change. (Cascading structured cache prerequisites...mmm). Remote dnotify/inotify doesn't _necessarily_ have this synchronous property. It may do in some cases, depending on the implementation (this is subtle...). So, it would be nice if there was a way to query this... rather than the tedious method of testing the filesystem type and having a table of known local filesystem types where it's safe to depend on this property. Alternatively, a way to specify at dnotify/inotify creation type that synchronous notifications are required, and have the request rejected if those can't be provided. -- Jamie - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] FUSE permission modell (Was: fuse review bits)
Eric Van Hensbergen wrote: > I'd like to second that I think private-namespaces are the right way > to solve this sort of problem. It also helps not cluttering the > global namespace with user-local mounts > > > > > Shared subtrees and more support in userspace tools is needed before > > private namespaces can become really useful. > > > > I'd like to talk about this a bit more and start driving to a solution > here. I've been looking at the namespace code quite a bit and was > just about to dive in and start checking into adding/fixing certain > aspects such as stackable namespaces, optional inheritence (changes in > a parent namespace are reflected in the child but not vice-versa), > etc. > > One aspect I was thinking about here was a mount flag that would give > you a new private namespace (if you didn't already have one) for the > mount (and I guess that would impact any subsequent mounts from the > user in that shell). Another option would be a 'newns' style > system-call, but I'm generally against adding new system calls. > > Shared subtrees are a tricky one. I know how we would handle it in > V9FS, but not sure how well that would translate to others > (essentially we'd re-export the subtree so other user's could mount it > individually -- but that's a very Plan 9 solution and may not be what > more UNIX-minded folks would want -- we also need to improve our own > server infrastructure to more efficiently support such a re-export). > > So, to sum up I think private namespaces is the right solution, and > I'd rather put effort into making it more useful than work-around the > fact that its not practical right now. Have a chat with Al Viro, who has already done some work on shared mounts and subtrees I think. -- Jamie - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] FUSE permission modell (Was: fuse review bits)
Eric Van Hensbergen wrote: I'd like to second that I think private-namespaces are the right way to solve this sort of problem. It also helps not cluttering the global namespace with user-local mounts Shared subtrees and more support in userspace tools is needed before private namespaces can become really useful. I'd like to talk about this a bit more and start driving to a solution here. I've been looking at the namespace code quite a bit and was just about to dive in and start checking into adding/fixing certain aspects such as stackable namespaces, optional inheritence (changes in a parent namespace are reflected in the child but not vice-versa), etc. One aspect I was thinking about here was a mount flag that would give you a new private namespace (if you didn't already have one) for the mount (and I guess that would impact any subsequent mounts from the user in that shell). Another option would be a 'newns' style system-call, but I'm generally against adding new system calls. Shared subtrees are a tricky one. I know how we would handle it in V9FS, but not sure how well that would translate to others (essentially we'd re-export the subtree so other user's could mount it individually -- but that's a very Plan 9 solution and may not be what more UNIX-minded folks would want -- we also need to improve our own server infrastructure to more efficiently support such a re-export). So, to sum up I think private namespaces is the right solution, and I'd rather put effort into making it more useful than work-around the fact that its not practical right now. Have a chat with Al Viro, who has already done some work on shared mounts and subtrees I think. -- Jamie - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] FUSE permission modell (Was: fuse review bits)
Miklos Szeredi wrote: > > Yet, the results from stat() don't distinguish the number spaces, > > and "ls" doesn't map the numbers to names properly in the wrong > > space. > > Well you can use "ls -n". It's up to the tools to present the > information you want in the way you want it. If a tool can't do that, > tough, but you are not worse off than if the information is not > available _at_all_. Well, how do you currently provide access to the information that's not presentable through stat()? -- Jamie - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] FUSE permission modell (Was: fuse review bits)
Miklos Szeredi wrote: > I have a little project to imlement a "userloop" filesystem, which > works just like "mount -o loop", but you don't need root privs. This > is really simple to do with FUSE and UML. That would be a nice way to implement those rarely used old filesystems that aren't really needed in the kernel source tree any more, but which it would be nice to have access to as legacy filesystem formats. In other words, migrating old legacy filesystems out of the kernel tree, into FUSE. > I don't think that it's far feched, that in certain situations the > user _does_ have the right (and usefulness) to do otherwise privileged > filesystem operations. It's really a matter of philosophy, as to whether the results of stat() are just handy information for the user, or are always defined to mean what you can/can't do with a file. Local-ssh-into-UML makes more sense for this in some ways, because the uids/gids inside your tgz files or foreign loop filesystems are not related to the space of uids/gids of the host system. Yet, the results from stat() don't distinguish the number spaces, and "ls" doesn't map the numbers to names properly in the wrong space. -- Jamie - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] FUSE permission modell (Was: fuse review bits)
Miklos Szeredi wrote: > > Look up the rather large linux-kernel & linux-fsdevel thread "silent > > semantic changes with reiser4" and it's followup threads, from last > > year. > > Wow, it's 700+ messages. I got through the first 40, and already feel > dizzy :) It's easier if you skip the ones by Hans and their immediate followups :) (Nothing personal, it's that Hans is mostly justifying reiser4's behaviour, and the posts you really need to read aren't about reiser4). > > It's already been tried. You will also find sensible ideas on what > > semantics it should have to do it properly. > > OK, I understand the "slash -> directory, no-slash -> regular file" > semantics. > > How do you envision implementing this for "mount directory over file"? Somewhere deep in that thread is a discussion between Al Viro and Linus on it. > A new mount flag indicating that it's only to be followed down if > there's a slash after the mountpoint? The new flag would indicate more than that: These mounts should be detachable in the sense that deleting the file is possible, and perhaps renamable/linkable too. That's the stuff Al Viro discusses in some detail in the big thread. Ideally we'd like automounting, a bit like the Hurd's translators. Attached to files (using an xattr or something, and executed with the uid/gid of the file owner), and also per-user "pattern->action" options for matching files with a certain type (e.g. tgz/zip/deb/rpm/xml). But that can be added much later, as it's an orthogonal feature. -- Jamie - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] FUSE permission modell (Was: fuse review bits)
Miklos Szeredi wrote: > > > Aren't there some assumptions in VFS that currently make this > > > impossible? > > > > I believe it's OK with VFS, but applications would be confused to death. > > Well, there really is one issue -- dentries have exactly one parent, so > > what do you do when opening a file with hardlinks as a directory? (In > > fact IIRC that is what lead to all the funny talk about mountpoints, > > since they don't have this limitation) > > OK, that makes sense. > > It would be quite interesting to see how applications react. Maybe > I'll hack something up :) Look up the rather large linux-kernel & linux-fsdevel thread "silent semantic changes with reiser4" and it's followup threads, from last year. It's already been tried. You will also find sensible ideas on what semantics it should have to do it properly. -- Jamie - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] FUSE permission modell (Was: fuse review bits)
> > > A nice implemention of it in FUSE could push it along a bit :) > > > > Aren't there some assumptions in VFS that currently make this > > impossible? > > I believe it's OK with VFS, but applications would be confused to death. > Well, there really is one issue -- dentries have exactly one parent, so > what do you do when opening a file with hardlinks as a directory? (In > fact IIRC that is what lead to all the funny talk about mountpoints, > since they don't have this limitation) Hardlinks aren't a problem when entering a file as if it's a directory, provided the directory does not contain any hard links to a parent in the hierarchy. In other words, as long as it's a directed acyclic graph. This is trivially always true for virtual directories such as entering an archive file. And detachable/movable mountpoints are a nice and sensible way to implement it. Some work has actually been done on this. Experiments with the reiserfs file-as-directory extension showed that applications are generally ok with it. It looks like a file, but you can cd into it or follow a path lookup into it. Linus had some good ideas on the exact semantics to implement when doing path lookup on these objects. -- Jamie - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] FUSE permission modell (Was: fuse review bits)
A nice implemention of it in FUSE could push it along a bit :) Aren't there some assumptions in VFS that currently make this impossible? I believe it's OK with VFS, but applications would be confused to death. Well, there really is one issue -- dentries have exactly one parent, so what do you do when opening a file with hardlinks as a directory? (In fact IIRC that is what lead to all the funny talk about mountpoints, since they don't have this limitation) Hardlinks aren't a problem when entering a file as if it's a directory, provided the directory does not contain any hard links to a parent in the hierarchy. In other words, as long as it's a directed acyclic graph. This is trivially always true for virtual directories such as entering an archive file. And detachable/movable mountpoints are a nice and sensible way to implement it. Some work has actually been done on this. Experiments with the reiserfs file-as-directory extension showed that applications are generally ok with it. It looks like a file, but you can cd into it or follow a path lookup into it. Linus had some good ideas on the exact semantics to implement when doing path lookup on these objects. -- Jamie - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] FUSE permission modell (Was: fuse review bits)
Miklos Szeredi wrote: Aren't there some assumptions in VFS that currently make this impossible? I believe it's OK with VFS, but applications would be confused to death. Well, there really is one issue -- dentries have exactly one parent, so what do you do when opening a file with hardlinks as a directory? (In fact IIRC that is what lead to all the funny talk about mountpoints, since they don't have this limitation) OK, that makes sense. It would be quite interesting to see how applications react. Maybe I'll hack something up :) Look up the rather large linux-kernel linux-fsdevel thread silent semantic changes with reiser4 and it's followup threads, from last year. It's already been tried. You will also find sensible ideas on what semantics it should have to do it properly. -- Jamie - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] FUSE permission modell (Was: fuse review bits)
Miklos Szeredi wrote: Look up the rather large linux-kernel linux-fsdevel thread silent semantic changes with reiser4 and it's followup threads, from last year. Wow, it's 700+ messages. I got through the first 40, and already feel dizzy :) It's easier if you skip the ones by Hans and their immediate followups :) (Nothing personal, it's that Hans is mostly justifying reiser4's behaviour, and the posts you really need to read aren't about reiser4). It's already been tried. You will also find sensible ideas on what semantics it should have to do it properly. OK, I understand the slash - directory, no-slash - regular file semantics. How do you envision implementing this for mount directory over file? Somewhere deep in that thread is a discussion between Al Viro and Linus on it. A new mount flag indicating that it's only to be followed down if there's a slash after the mountpoint? The new flag would indicate more than that: These mounts should be detachable in the sense that deleting the file is possible, and perhaps renamable/linkable too. That's the stuff Al Viro discusses in some detail in the big thread. Ideally we'd like automounting, a bit like the Hurd's translators. Attached to files (using an xattr or something, and executed with the uid/gid of the file owner), and also per-user pattern-action options for matching files with a certain type (e.g. tgz/zip/deb/rpm/xml). But that can be added much later, as it's an orthogonal feature. -- Jamie - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] FUSE permission modell (Was: fuse review bits)
Miklos Szeredi wrote: I have a little project to imlement a userloop filesystem, which works just like mount -o loop, but you don't need root privs. This is really simple to do with FUSE and UML. That would be a nice way to implement those rarely used old filesystems that aren't really needed in the kernel source tree any more, but which it would be nice to have access to as legacy filesystem formats. In other words, migrating old legacy filesystems out of the kernel tree, into FUSE. I don't think that it's far feched, that in certain situations the user _does_ have the right (and usefulness) to do otherwise privileged filesystem operations. It's really a matter of philosophy, as to whether the results of stat() are just handy information for the user, or are always defined to mean what you can/can't do with a file. Local-ssh-into-UML makes more sense for this in some ways, because the uids/gids inside your tgz files or foreign loop filesystems are not related to the space of uids/gids of the host system. Yet, the results from stat() don't distinguish the number spaces, and ls doesn't map the numbers to names properly in the wrong space. -- Jamie - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] FUSE permission modell (Was: fuse review bits)
Miklos Szeredi wrote: Yet, the results from stat() don't distinguish the number spaces, and ls doesn't map the numbers to names properly in the wrong space. Well you can use ls -n. It's up to the tools to present the information you want in the way you want it. If a tool can't do that, tough, but you are not worse off than if the information is not available _at_all_. Well, how do you currently provide access to the information that's not presentable through stat()? -- Jamie - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] FUSE permission modell (Was: fuse review bits)
Anton Altaparmakov wrote: > > That said, I would _usually_ prefer that when I enter a tgz, that I > > see all component files having the same uid/gid/permissions as the tgz > > file itself - the same as I'd see if I entered a zip file. > > As you say _usually_, even you admit that sometimes you would want the > real owner/permissions to be shown. And that is the point Miklos is > trying to make I believe: it should be configurable not hard set to only > one which is what I understand HCH wants. > > There are uses for both. For example today I was updating the tar ball > which is used to create the var file system for a new chroot. I certainly > want to see corretly setup owner/permissions when I look into that tar > ball using a FUSE fs... If I'm updating a var filesystem for a new chroot, I'd need the ability to chmod and chown things in that filesystem. Does that work as an ordinary user? -- Jamie - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] FUSE permission modell (Was: fuse review bits)
Miklos Szeredi wrote: > Still can't find it :) > > Which kernel? Which file? I'm looking at linux-2.4.30/fs/nfs/dir.c. -- Jamie - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] FUSE permission modell (Was: fuse review bits)
Miklos Szeredi wrote: > > Indeed, if it can be done with namespaces _and_ mounting on a file > > (that file-as-directory concept), _and_ automounting, then you could > > cd into your tgz files and others could too :) > > There's still that little problem of accessing the tgz file both as a > file and a directory. But yes. Maybe in 10 years time :) There was a thread a few months ago where file-as-directory was discussed extensively, after Namesys implemented it. That's where the conversation on detachable mount points originated AFAIR. It will probably happen at some point. A nice implemention of it in FUSE could push it along a bit :) -- Jamie - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] FUSE permission modell (Was: fuse review bits)
Miklos Szeredi wrote: > The same can be true for tarfs. I mount it for my purpose, others can > mount it for theirs. Since the daemon providing the filesystem asways > runs with the same capabilities as the user who did the mount, I and > others will always get the permissions that we have on the actual tar > file. Fair enough. > Think of the "no permission for others" as "hiding", not as some > special permission rule. And if this hiding can be nicely done with > namespaces, all the better, I'll happily drop this feature at that > instant. Indeed, if it can be done with namespaces _and_ mounting on a file (that file-as-directory concept), _and_ automounting, then you could cd into your tgz files and others could too :) -- Jamie - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] FUSE permission modell (Was: fuse review bits)
Miklos Szeredi wrote: > > Yes, for NFSv2, this test in nfs_permssion(): > > > > if (!NFS_PROTO(inode)->access) > > goto out; > > I've seen that, I just thought that was for some broken servers not > for all NFSv2 servers. > > Anyway that's been fixed in NFSv3, so obviously the "permission > checking on both sides" wasn't optimal :) > > > And for either version of NFS, if the uid and gid are non-zero, and > > the permission bits indicate that an access is permitted, then the > > client does not consult the server for permission. > > Where's that? I see no such check. /* * Trust UNIX mode bits except: * * 1) When override capabilities may have been invoked * 2) When root squashing may be involved * 3) When ACLs may overturn a negative answer */ if (!capable(CAP_DAC_OVERRIDE) && !capable(CAP_DAC_READ_SEARCH) && (current->fsuid != 0) && (current->fsgid != 0) && error != -EACCES) goto out; -- Jamie - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] FUSE permission modell (Was: fuse review bits)
Miklos Szeredi wrote: > > Note that NFS checks the permissions on _both_ the client and server, > > for a reason. > > Does it? If I read the code correctly the client checks credentials > supplied by the server (or cached). But the server does the actual > checking of permissions. > > Am I missing something? Yes, for NFSv2, this test in nfs_permssion(): if (!NFS_PROTO(inode)->access) goto out; And for either version of NFS, if the uid and gid are non-zero, and the permission bits indicate that an access is permitted, then the client does not consult the server for permission. -- Jamie - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] FUSE permission modell (Was: fuse review bits)
Miklos Szeredi wrote: > > If the user wants to edit a read-only file in a tgz owned by himself, > > why can he not _chmod_ the file and _then_ edit it? > > > > That said, I would _usually_ prefer that when I enter a tgz, that I > > see all component files having the same uid/gid/permissions as the tgz > > file itself - the same as I'd see if I entered a zip file. > > I can accept that usually you are not interested in the stored > uid/gid. But doubt that you want to lose permission information when > you mount a tar file. Zip is a different kettle of fish since that > doesn't contain uid/gid/permissions. It's not about being not interested. It's because I'd want my programs, and other users, to have exactly the same access to the tgz contents through vfs as they have when accessing the tgz file directly. Not some baroque combination of unobvious hard-coded permission rules, that aren't even visible through stat(), and which both increase permissions for the user and decrease it for others at the same time. I see why you may want to hide certain things from other users sometimes - the sshfs example is a good one. I daresay Al Viro could come up with a per-user or per-keyring mount point for those occasions :) -- Jamie - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] FUSE permission modell (Was: fuse review bits)
Frank Sorenson wrote: > > That does not make sense. > > > > Are you saying you cannot trust your own sshfs userspace daemon? > > The user who wrote the userspace code may be able to, but the system > shouldn't trust the userspace daemon. Permissions will be enforced by > the ssh server. In fact that's wrong. Although permissions are enforced by the ssh server, the server assumes that once a conncection is established, all requests on it are from the same originating user. The server is not able to check which user is accessing the files on the client machine - which is why Miklos wants/needs restrictive permissions on the client machine too. -- Jamie - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] FUSE permission modell (Was: fuse review bits)
Bodo Eggert <[EMAIL PROTECTED]> wrote: > >> That is exactly the intended effect. If I'm at my work machine (where > >> I'm not an admin unfortunately) and I mount my home machine with sshfs > >> (because FUSE is installed fortunately :), then I bloody well don't > >> want the sysadmin or some automated script of his to go mucking under > >> the mountpoint. > > > > I think that would be _much_ nicer implemented as a mount which is > > invisible to other users, rather than one which causes the admin's > > scripts to spew error messages. Is the namespace mechanism at all > > suitable for that? > > This will require shared subtrees plus a way for new logins from the same > user to join an existing (previous login) namespace. Or "per-user namespaces". It's part of a more general problem of how you limit access to private data such as crypto keys, either per user, or more finely than that. Isn't that what all the keyring stuff is for? -- Jamie - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] FUSE permission modell (Was: fuse review bits)
Bodo Eggert <[EMAIL PROTECTED]> wrote: > Jamie Lokier <[EMAIL PROTECTED]> wrote: > > Miklos Szeredi wrote: > > >> 4) Access should not be further restricted for the owner of the > >> mount, even if permission bits, uid or gid would suggest > >> otherwise > > > > Why? Surely you want to prevent writing to files which don't have the > > writable bit set? A filesystem may also create append-only files - > > and all users including the mount owner should be bound by that. > > That will depend on the situation. If the user is mounting a tgz owned > by himself, FUSE should default to being a convenient hex-editor. If the user wants to edit a read-only file in a tgz owned by himself, why can he not _chmod_ the file and _then_ edit it? That said, I would _usually_ prefer that when I enter a tgz, that I see all component files having the same uid/gid/permissions as the tgz file itself - the same as I'd see if I entered a zip file. -- Jamie - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] FUSE permission modell (Was: fuse review bits)
Miklos Szeredi wrote: > The same is true for the case when you mount an sshfs. Since you > entered your password (or have a passwordless login to the server) you > are authorized to browse the files on the server, but only with the > capabilities you have there as a user. The server does the > authorization. The same is true for an NFS mount btw. It's not the > client that checks the permissions. > > So do you see why I argue in favor of having an option _not_ to check > permissions on the client by the kernel? Note that NFS checks the permissions on _both_ the client and server, for a reason. -- Jamie - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] FUSE permission modell (Was: fuse review bits)
Miklos Szeredi wrote: > > It would also be nice to generalise and have virtual filesystems which > > are able to present different views to different users. Can FUSE do > > that already - is the userspace part told which user is doing each > > operation? > > Yes. > > > With that, the desire for virtual filesystems which cannot be read > > by your sysadmin (by accident) is easy to satisfy - and that kind of > > mechanism would probably be acceptable to all. > > The problem is that this way the responsibility goes to the userspace > program, which can't be trusted. That does not make sense. Are you saying you cannot trust your own sshfs userspace daemon? -- Jamie - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] FUSE permission modell (Was: fuse review bits)
Miklos Szeredi wrote: It would also be nice to generalise and have virtual filesystems which are able to present different views to different users. Can FUSE do that already - is the userspace part told which user is doing each operation? Yes. With that, the desire for virtual filesystems which cannot be read by your sysadmin (by accident) is easy to satisfy - and that kind of mechanism would probably be acceptable to all. The problem is that this way the responsibility goes to the userspace program, which can't be trusted. That does not make sense. Are you saying you cannot trust your own sshfs userspace daemon? -- Jamie - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] FUSE permission modell (Was: fuse review bits)
Miklos Szeredi wrote: The same is true for the case when you mount an sshfs. Since you entered your password (or have a passwordless login to the server) you are authorized to browse the files on the server, but only with the capabilities you have there as a user. The server does the authorization. The same is true for an NFS mount btw. It's not the client that checks the permissions. So do you see why I argue in favor of having an option _not_ to check permissions on the client by the kernel? Note that NFS checks the permissions on _both_ the client and server, for a reason. -- Jamie - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] FUSE permission modell (Was: fuse review bits)
Bodo Eggert [EMAIL PROTECTED] wrote: Jamie Lokier [EMAIL PROTECTED] wrote: Miklos Szeredi wrote: 4) Access should not be further restricted for the owner of the mount, even if permission bits, uid or gid would suggest otherwise Why? Surely you want to prevent writing to files which don't have the writable bit set? A filesystem may also create append-only files - and all users including the mount owner should be bound by that. That will depend on the situation. If the user is mounting a tgz owned by himself, FUSE should default to being a convenient hex-editor. If the user wants to edit a read-only file in a tgz owned by himself, why can he not _chmod_ the file and _then_ edit it? That said, I would _usually_ prefer that when I enter a tgz, that I see all component files having the same uid/gid/permissions as the tgz file itself - the same as I'd see if I entered a zip file. -- Jamie - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] FUSE permission modell (Was: fuse review bits)
Bodo Eggert [EMAIL PROTECTED] wrote: That is exactly the intended effect. If I'm at my work machine (where I'm not an admin unfortunately) and I mount my home machine with sshfs (because FUSE is installed fortunately :), then I bloody well don't want the sysadmin or some automated script of his to go mucking under the mountpoint. I think that would be _much_ nicer implemented as a mount which is invisible to other users, rather than one which causes the admin's scripts to spew error messages. Is the namespace mechanism at all suitable for that? This will require shared subtrees plus a way for new logins from the same user to join an existing (previous login) namespace. Or per-user namespaces. It's part of a more general problem of how you limit access to private data such as crypto keys, either per user, or more finely than that. Isn't that what all the keyring stuff is for? -- Jamie - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] FUSE permission modell (Was: fuse review bits)
Frank Sorenson wrote: That does not make sense. Are you saying you cannot trust your own sshfs userspace daemon? The user who wrote the userspace code may be able to, but the system shouldn't trust the userspace daemon. Permissions will be enforced by the ssh server. In fact that's wrong. Although permissions are enforced by the ssh server, the server assumes that once a conncection is established, all requests on it are from the same originating user. The server is not able to check which user is accessing the files on the client machine - which is why Miklos wants/needs restrictive permissions on the client machine too. -- Jamie - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] FUSE permission modell (Was: fuse review bits)
Miklos Szeredi wrote: If the user wants to edit a read-only file in a tgz owned by himself, why can he not _chmod_ the file and _then_ edit it? That said, I would _usually_ prefer that when I enter a tgz, that I see all component files having the same uid/gid/permissions as the tgz file itself - the same as I'd see if I entered a zip file. I can accept that usually you are not interested in the stored uid/gid. But doubt that you want to lose permission information when you mount a tar file. Zip is a different kettle of fish since that doesn't contain uid/gid/permissions. It's not about being not interested. It's because I'd want my programs, and other users, to have exactly the same access to the tgz contents through vfs as they have when accessing the tgz file directly. Not some baroque combination of unobvious hard-coded permission rules, that aren't even visible through stat(), and which both increase permissions for the user and decrease it for others at the same time. I see why you may want to hide certain things from other users sometimes - the sshfs example is a good one. I daresay Al Viro could come up with a per-user or per-keyring mount point for those occasions :) -- Jamie - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] FUSE permission modell (Was: fuse review bits)
Miklos Szeredi wrote: Note that NFS checks the permissions on _both_ the client and server, for a reason. Does it? If I read the code correctly the client checks credentials supplied by the server (or cached). But the server does the actual checking of permissions. Am I missing something? Yes, for NFSv2, this test in nfs_permssion(): if (!NFS_PROTO(inode)-access) goto out; And for either version of NFS, if the uid and gid are non-zero, and the permission bits indicate that an access is permitted, then the client does not consult the server for permission. -- Jamie - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] FUSE permission modell (Was: fuse review bits)
Miklos Szeredi wrote: Yes, for NFSv2, this test in nfs_permssion(): if (!NFS_PROTO(inode)-access) goto out; I've seen that, I just thought that was for some broken servers not for all NFSv2 servers. Anyway that's been fixed in NFSv3, so obviously the permission checking on both sides wasn't optimal :) And for either version of NFS, if the uid and gid are non-zero, and the permission bits indicate that an access is permitted, then the client does not consult the server for permission. Where's that? I see no such check. /* * Trust UNIX mode bits except: * * 1) When override capabilities may have been invoked * 2) When root squashing may be involved * 3) When ACLs may overturn a negative answer */ if (!capable(CAP_DAC_OVERRIDE) !capable(CAP_DAC_READ_SEARCH) (current-fsuid != 0) (current-fsgid != 0) error != -EACCES) goto out; -- Jamie - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] FUSE permission modell (Was: fuse review bits)
Miklos Szeredi wrote: The same can be true for tarfs. I mount it for my purpose, others can mount it for theirs. Since the daemon providing the filesystem asways runs with the same capabilities as the user who did the mount, I and others will always get the permissions that we have on the actual tar file. Fair enough. Think of the no permission for others as hiding, not as some special permission rule. And if this hiding can be nicely done with namespaces, all the better, I'll happily drop this feature at that instant. Indeed, if it can be done with namespaces _and_ mounting on a file (that file-as-directory concept), _and_ automounting, then you could cd into your tgz files and others could too :) -- Jamie - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] FUSE permission modell (Was: fuse review bits)
Miklos Szeredi wrote: Still can't find it :) Which kernel? Which file? I'm looking at linux-2.4.30/fs/nfs/dir.c. -- Jamie - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] FUSE permission modell (Was: fuse review bits)
Miklos Szeredi wrote: Indeed, if it can be done with namespaces _and_ mounting on a file (that file-as-directory concept), _and_ automounting, then you could cd into your tgz files and others could too :) There's still that little problem of accessing the tgz file both as a file and a directory. But yes. Maybe in 10 years time :) There was a thread a few months ago where file-as-directory was discussed extensively, after Namesys implemented it. That's where the conversation on detachable mount points originated AFAIR. It will probably happen at some point. A nice implemention of it in FUSE could push it along a bit :) -- Jamie - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] FUSE permission modell (Was: fuse review bits)
Anton Altaparmakov wrote: That said, I would _usually_ prefer that when I enter a tgz, that I see all component files having the same uid/gid/permissions as the tgz file itself - the same as I'd see if I entered a zip file. As you say _usually_, even you admit that sometimes you would want the real owner/permissions to be shown. And that is the point Miklos is trying to make I believe: it should be configurable not hard set to only one which is what I understand HCH wants. There are uses for both. For example today I was updating the tar ball which is used to create the var file system for a new chroot. I certainly want to see corretly setup owner/permissions when I look into that tar ball using a FUSE fs... If I'm updating a var filesystem for a new chroot, I'd need the ability to chmod and chown things in that filesystem. Does that work as an ordinary user? -- Jamie - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] FUSE permission modell (Was: fuse review bits)
Miklos Szeredi wrote: > That is exactly the intended effect. If I'm at my work machine (where > I'm not an admin unfortunately) and I mount my home machine with sshfs > (because FUSE is installed fortunately :), then I bloody well don't > want the sysadmin or some automated script of his to go mucking under > the mountpoint. I think that would be _much_ nicer implemented as a mount which is invisible to other users, rather than one which causes the admin's scripts to spew error messages. Is the namespace mechanism at all suitable for that? It would also be nice to generalise and have virtual filesystems which are able to present different views to different users. Can FUSE do that already - is the userspace part told which user is doing each operation? With that, the desire for virtual filesystems which cannot be read by your sysadmin (by accident) is easy to satisfy - and that kind of mechanism would probably be acceptable to all. -- Jamie - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] FUSE permission modell (Was: fuse review bits)
Miklos Szeredi wrote: > 1) User must not be able to modify files or directories in a way > which he otherwise could not do (e.g. mount a filesystem over > /bin) > > 2) Suid and device semantics should be disabled within the mount > > 3) No other user should have access to files under the mount, not > even root[5] Why? I can see plenty of uses where I want a filesystem generated by one user to be accessible by other users. That policy should not be hard-coded into the kernel. It might be an option. > 4) Access should not be further restricted for the owner of the > mount, even if permission bits, uid or gid would suggest > otherwise Why? Surely you want to prevent writing to files which don't have the writable bit set? A filesystem may also create append-only files - and all users including the mount owner should be bound by that. > 5) As much of the available information should be exported via the > filesystem as possible This is the root of the conflict. You are trying to overload the permission bits and uid/gid to mean something different than they normally do. While it's convenient to see some "remote" information such as the uid/gid in a tar file, are you sure it's a good idea to break the unix permissions model - which will break some programs? (For example, try editing a file with the broken semantics in an editor which checks the uid/gid of the file against the current user). For most virtual filesystems, the "remote" information does not map to uid/gid in a particularly natural way anyway. So it seems odd to want to break the unix permissions model just so that a small _subset_ of virtual filesystems can use stat() as a way to get a bit of information out, when other virtual filesystems (e.g. webdavfs) can't put meaningful information in there, and would benefit from normal unix permissions instead. > 1) Only allow mount over a directory for which the user has write > access (and is not sticky) Seems good - but why not sticky? Mounting a user filesystem in /tmp/user-xxx/my-mount-point seems not unreasonable - provided the administrator can delete the directory (which is possible with detachable mount points). > 2) Use nosuid,nodev mount options Of course. Ideally, make sure they appear to be set in /proc/mounts. (root (or equivalent) should be able to create virtual filesystems without these options, but probably they should be set by default even for root, and clearable using suid,dev). > 3) In permission method of FUSE kernel module compare fsuid against > mounting user's ID, and return EACCES if they are not equal. Bad. How do I, user A, then create a virtual filesystem which I want user B to be able to access? > 4) The filesystem daemon does not run with elevated permissions. > The kernel doesn't check file more in the permission method. I like the idea that the fs daemon doesn't need elevated permissions. > 5) The filesystem daemon is free to fill in all file attributes to > any (sane) value, and the kernel won't modify these. Dangerous, because an administrative program might actually trust the attributes to mean what they normally mean in the unix permissions model. > The debated part is 3) and 4), namely that normal permission checking > based on file mode is bypassed, and the mounting user has full > permission to all files, while other users have none. > > This feature has been in FUSE from the start and thus has been very > well tested in real world scenarios. Also I have thought a lot about > how this could pose any kind of security threat, and as yet found no > such possiblity. Ok, but why do you prevent the useful behaviour of allowing access to other users, when I want that? For example, I might export my current project's database as a filesystem that I _want_ other users to be able to read. > Despite this Christoph feels this behavior is unacceptable for a > filesystem, and wants me to remove this feature before merging FUSE > into mainline. I try to be open to ideas, but also feel strongly that > this is the Right Way, so I won't give up easily. I agree with Christoph. It is a huge deviation from the unix permissions model -- and it seems to prevent some useful applications of FUSE so it's not clear why you want it. -- Jamie - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] FUSE permission modell (Was: fuse review bits)
Miklos Szeredi wrote: 1) User must not be able to modify files or directories in a way which he otherwise could not do (e.g. mount a filesystem over /bin) 2) Suid and device semantics should be disabled within the mount 3) No other user should have access to files under the mount, not even root[5] Why? I can see plenty of uses where I want a filesystem generated by one user to be accessible by other users. That policy should not be hard-coded into the kernel. It might be an option. 4) Access should not be further restricted for the owner of the mount, even if permission bits, uid or gid would suggest otherwise Why? Surely you want to prevent writing to files which don't have the writable bit set? A filesystem may also create append-only files - and all users including the mount owner should be bound by that. 5) As much of the available information should be exported via the filesystem as possible This is the root of the conflict. You are trying to overload the permission bits and uid/gid to mean something different than they normally do. While it's convenient to see some remote information such as the uid/gid in a tar file, are you sure it's a good idea to break the unix permissions model - which will break some programs? (For example, try editing a file with the broken semantics in an editor which checks the uid/gid of the file against the current user). For most virtual filesystems, the remote information does not map to uid/gid in a particularly natural way anyway. So it seems odd to want to break the unix permissions model just so that a small _subset_ of virtual filesystems can use stat() as a way to get a bit of information out, when other virtual filesystems (e.g. webdavfs) can't put meaningful information in there, and would benefit from normal unix permissions instead. 1) Only allow mount over a directory for which the user has write access (and is not sticky) Seems good - but why not sticky? Mounting a user filesystem in /tmp/user-xxx/my-mount-point seems not unreasonable - provided the administrator can delete the directory (which is possible with detachable mount points). 2) Use nosuid,nodev mount options Of course. Ideally, make sure they appear to be set in /proc/mounts. (root (or equivalent) should be able to create virtual filesystems without these options, but probably they should be set by default even for root, and clearable using suid,dev). 3) In permission method of FUSE kernel module compare fsuid against mounting user's ID, and return EACCES if they are not equal. Bad. How do I, user A, then create a virtual filesystem which I want user B to be able to access? 4) The filesystem daemon does not run with elevated permissions. The kernel doesn't check file more in the permission method. I like the idea that the fs daemon doesn't need elevated permissions. 5) The filesystem daemon is free to fill in all file attributes to any (sane) value, and the kernel won't modify these. Dangerous, because an administrative program might actually trust the attributes to mean what they normally mean in the unix permissions model. The debated part is 3) and 4), namely that normal permission checking based on file mode is bypassed, and the mounting user has full permission to all files, while other users have none. This feature has been in FUSE from the start and thus has been very well tested in real world scenarios. Also I have thought a lot about how this could pose any kind of security threat, and as yet found no such possiblity. Ok, but why do you prevent the useful behaviour of allowing access to other users, when I want that? For example, I might export my current project's database as a filesystem that I _want_ other users to be able to read. Despite this Christoph feels this behavior is unacceptable for a filesystem, and wants me to remove this feature before merging FUSE into mainline. I try to be open to ideas, but also feel strongly that this is the Right Way, so I won't give up easily. I agree with Christoph. It is a huge deviation from the unix permissions model -- and it seems to prevent some useful applications of FUSE so it's not clear why you want it. -- Jamie - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] FUSE permission modell (Was: fuse review bits)
Miklos Szeredi wrote: That is exactly the intended effect. If I'm at my work machine (where I'm not an admin unfortunately) and I mount my home machine with sshfs (because FUSE is installed fortunately :), then I bloody well don't want the sysadmin or some automated script of his to go mucking under the mountpoint. I think that would be _much_ nicer implemented as a mount which is invisible to other users, rather than one which causes the admin's scripts to spew error messages. Is the namespace mechanism at all suitable for that? It would also be nice to generalise and have virtual filesystems which are able to present different views to different users. Can FUSE do that already - is the userspace part told which user is doing each operation? With that, the desire for virtual filesystems which cannot be read by your sysadmin (by accident) is easy to satisfy - and that kind of mechanism would probably be acceptable to all. -- Jamie - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: kernel bug: futex_wait hang
Lee Revell wrote: > On Tue, 2005-03-22 at 04:48 +0000, Jamie Lokier wrote: > > I argued for fixing Glibc on the grounds that the changed kernel > > behaviour, or more exactly having Glibc depend on it, loses a certain > > semantic property useful for unusual operations on multiple futexes at > > the same time. But I appear to have lost the argument, and Jakub's > > latest patch does clean up some cruft quite nicely, with no expected > > performance hit. > > A glibc fix will take forever to get to users compared to a kernel fix. Interesting perspective. On my systems Glibc is upgraded more often than the kernel. -- Jamie - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: kernel bug: futex_wait hang
Lee Revell wrote: On Tue, 2005-03-22 at 04:48 +, Jamie Lokier wrote: I argued for fixing Glibc on the grounds that the changed kernel behaviour, or more exactly having Glibc depend on it, loses a certain semantic property useful for unusual operations on multiple futexes at the same time. But I appear to have lost the argument, and Jakub's latest patch does clean up some cruft quite nicely, with no expected performance hit. A glibc fix will take forever to get to users compared to a kernel fix. Interesting perspective. On my systems Glibc is upgraded more often than the kernel. -- Jamie - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: kernel bug: futex_wait hang
Andrew Morton wrote: > iirc we ended up deciding that the futex problems around that time were due > to userspace problems (a version of libc). But then, there's no discussion > around Seto's patch and it didn't get applied. So I don't know what > happened to that work - it's all a bit mysterious. It can be fixed _either_ in Glibc, or by changing the kernel. That problem is caused by differing assumptions between Glibc and the kernel about subtle futex semantics. Which goes to show they are really clever, or something. I provided pseudo-code for the Glibc fix, but not an actual patch because NPTL is quite complicated and I wanted to know the Glibc people were interested, but apparently they were too busy at the time - benchmarks would have made sense for such a patch. Scott Snyder started fixing part of Glibc, and that did fix his instance of this problem so we know the approach works. But a full patch for Glibc was not prepared. The most recent messages under "Futex queue_me/get_user ordering", with a patch from Jakub Jelinek will fix this problem by changing the kernel. Yes, you should apply Jakub's most recent patch, message-ID "<[EMAIL PROTECTED]>". I have not tested the patch, but it looks convincing. I argued for fixing Glibc on the grounds that the changed kernel behaviour, or more exactly having Glibc depend on it, loses a certain semantic property useful for unusual operations on multiple futexes at the same time. But I appear to have lost the argument, and Jakub's latest patch does clean up some cruft quite nicely, with no expected performance hit. -- Jamie - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: kernel bug: futex_wait hang
Lee Revell wrote: > > iirc we ended up deciding that the futex problems around that time were due > > to userspace problems (a version of libc). But then, there's no discussion > > around Seto's patch and it didn't get applied. So I don't know what > > happened to that work - it's all a bit mysterious. > > It does seem like it could be a different problem. Maybe Paul can > provide some more evidence that it's a kernel and not a glibc/NPTL bug. > I'm really just posting this on Paul's behalf; I don't claim to > understand the issue. ;-) Try applying the patch below, which was recently posted by Jakub Jelinek. If it fixes the problem, it's the same thing as Hidetoshi Seto noticed, although this patch is much improved thanks to "preempt_count technology" (tm). If not, it's a whole new problem. -- Jamie --- linux-2.6.11/kernel/futex.c.jj 2005-03-17 04:42:29.0 -0500 +++ linux-2.6.11/kernel/futex.c 2005-03-18 05:45:29.0 -0500 @@ -97,7 +97,6 @@ struct futex_q { */ struct futex_hash_bucket { spinlock_t lock; - unsigned intnqueued; struct list_head chain; }; @@ -265,7 +264,6 @@ static inline int get_futex_value_locked inc_preempt_count(); ret = __copy_from_user_inatomic(dest, from, sizeof(int)); dec_preempt_count(); - preempt_check_resched(); return ret ? -EFAULT : 0; } @@ -339,7 +337,6 @@ static int futex_requeue(unsigned long u struct list_head *head1; struct futex_q *this, *next; int ret, drop_count = 0; - unsigned int nqueued; retry: down_read(>mm->mmap_sem); @@ -354,23 +351,22 @@ static int futex_requeue(unsigned long u bh1 = hash_futex(); bh2 = hash_futex(); - nqueued = bh1->nqueued; + if (bh1 < bh2) + spin_lock(>lock); + spin_lock(>lock); + if (bh1 > bh2) + spin_lock(>lock); + if (likely(valp != NULL)) { int curval; - /* In order to avoid doing get_user while - holding bh1->lock and bh2->lock, nqueued - (monotonically increasing field) must be first - read, then *uaddr1 fetched from userland and - after acquiring lock nqueued field compared with - the stored value. The smp_mb () below - makes sure that bh1->nqueued is read from memory - before *uaddr1. */ - smp_mb(); - ret = get_futex_value_locked(, (int __user *)uaddr1); if (unlikely(ret)) { + spin_unlock(>lock); + if (bh1 != bh2) + spin_unlock(>lock); + /* If we would have faulted, release mmap_sem, fault * it in and start all over again. */ @@ -385,21 +381,10 @@ static int futex_requeue(unsigned long u } if (curval != *valp) { ret = -EAGAIN; - goto out; + goto out_unlock; } } - if (bh1 < bh2) - spin_lock(>lock); - spin_lock(>lock); - if (bh1 > bh2) - spin_lock(>lock); - - if (unlikely(nqueued != bh1->nqueued && valp != NULL)) { - ret = -EAGAIN; - goto out_unlock; - } - head1 = >chain; list_for_each_entry_safe(this, next, head1, list) { if (!match_futex (>key, )) @@ -435,13 +420,9 @@ out: return ret; } -/* - * queue_me and unqueue_me must be called as a pair, each - * exactly once. They are called with the hashed spinlock held. - */ - /* The key must be already stored in q->key. */ -static void queue_me(struct futex_q *q, int fd, struct file *filp) +static inline struct futex_hash_bucket * +queue_lock(struct futex_q *q, int fd, struct file *filp) { struct futex_hash_bucket *bh; @@ -455,11 +436,35 @@ static void queue_me(struct futex_q *q, q->lock_ptr = >lock; spin_lock(>lock); - bh->nqueued++; + return bh; +} + +static inline void __queue_me(struct futex_q *q, struct futex_hash_bucket *bh) +{ list_add_tail(>list, >chain); spin_unlock(>lock); } +static inline void +queue_unlock(struct futex_q *q, struct futex_hash_bucket *bh) +{ + spin_unlock(>lock); + drop_key_refs(>key); +} + +/* + * queue_me and unqueue_me must be called as a pair, each + * exactly once. They are called with the hashed spinlock held. + */ + +/* The key must be already stored in q->key. */ +static void queue_me(struct futex_q *q, int fd, struct file *filp) +{ + struct futex_hash_bucket *bh; + bh = queue_lock(q, fd, filp); + __queue_me(q, bh); +} + /* Return 1 if we were still queued (ie. 0 means we were woken) */ static int unqueue_me(struct futex_q *q) { @@
Re: kernel bug: futex_wait hang
Lee Revell wrote: iirc we ended up deciding that the futex problems around that time were due to userspace problems (a version of libc). But then, there's no discussion around Seto's patch and it didn't get applied. So I don't know what happened to that work - it's all a bit mysterious. It does seem like it could be a different problem. Maybe Paul can provide some more evidence that it's a kernel and not a glibc/NPTL bug. I'm really just posting this on Paul's behalf; I don't claim to understand the issue. ;-) Try applying the patch below, which was recently posted by Jakub Jelinek. If it fixes the problem, it's the same thing as Hidetoshi Seto noticed, although this patch is much improved thanks to preempt_count technology (tm). If not, it's a whole new problem. -- Jamie --- linux-2.6.11/kernel/futex.c.jj 2005-03-17 04:42:29.0 -0500 +++ linux-2.6.11/kernel/futex.c 2005-03-18 05:45:29.0 -0500 @@ -97,7 +97,6 @@ struct futex_q { */ struct futex_hash_bucket { spinlock_t lock; - unsigned intnqueued; struct list_head chain; }; @@ -265,7 +264,6 @@ static inline int get_futex_value_locked inc_preempt_count(); ret = __copy_from_user_inatomic(dest, from, sizeof(int)); dec_preempt_count(); - preempt_check_resched(); return ret ? -EFAULT : 0; } @@ -339,7 +337,6 @@ static int futex_requeue(unsigned long u struct list_head *head1; struct futex_q *this, *next; int ret, drop_count = 0; - unsigned int nqueued; retry: down_read(current-mm-mmap_sem); @@ -354,23 +351,22 @@ static int futex_requeue(unsigned long u bh1 = hash_futex(key1); bh2 = hash_futex(key2); - nqueued = bh1-nqueued; + if (bh1 bh2) + spin_lock(bh1-lock); + spin_lock(bh2-lock); + if (bh1 bh2) + spin_lock(bh1-lock); + if (likely(valp != NULL)) { int curval; - /* In order to avoid doing get_user while - holding bh1-lock and bh2-lock, nqueued - (monotonically increasing field) must be first - read, then *uaddr1 fetched from userland and - after acquiring lock nqueued field compared with - the stored value. The smp_mb () below - makes sure that bh1-nqueued is read from memory - before *uaddr1. */ - smp_mb(); - ret = get_futex_value_locked(curval, (int __user *)uaddr1); if (unlikely(ret)) { + spin_unlock(bh1-lock); + if (bh1 != bh2) + spin_unlock(bh2-lock); + /* If we would have faulted, release mmap_sem, fault * it in and start all over again. */ @@ -385,21 +381,10 @@ static int futex_requeue(unsigned long u } if (curval != *valp) { ret = -EAGAIN; - goto out; + goto out_unlock; } } - if (bh1 bh2) - spin_lock(bh1-lock); - spin_lock(bh2-lock); - if (bh1 bh2) - spin_lock(bh1-lock); - - if (unlikely(nqueued != bh1-nqueued valp != NULL)) { - ret = -EAGAIN; - goto out_unlock; - } - head1 = bh1-chain; list_for_each_entry_safe(this, next, head1, list) { if (!match_futex (this-key, key1)) @@ -435,13 +420,9 @@ out: return ret; } -/* - * queue_me and unqueue_me must be called as a pair, each - * exactly once. They are called with the hashed spinlock held. - */ - /* The key must be already stored in q-key. */ -static void queue_me(struct futex_q *q, int fd, struct file *filp) +static inline struct futex_hash_bucket * +queue_lock(struct futex_q *q, int fd, struct file *filp) { struct futex_hash_bucket *bh; @@ -455,11 +436,35 @@ static void queue_me(struct futex_q *q, q-lock_ptr = bh-lock; spin_lock(bh-lock); - bh-nqueued++; + return bh; +} + +static inline void __queue_me(struct futex_q *q, struct futex_hash_bucket *bh) +{ list_add_tail(q-list, bh-chain); spin_unlock(bh-lock); } +static inline void +queue_unlock(struct futex_q *q, struct futex_hash_bucket *bh) +{ + spin_unlock(bh-lock); + drop_key_refs(q-key); +} + +/* + * queue_me and unqueue_me must be called as a pair, each + * exactly once. They are called with the hashed spinlock held. + */ + +/* The key must be already stored in q-key. */ +static void queue_me(struct futex_q *q, int fd, struct file *filp) +{ + struct futex_hash_bucket *bh; + bh = queue_lock(q, fd, filp); + __queue_me(q, bh); +} + /* Return 1 if we were still queued (ie. 0 means we were woken) */ static int
Re: kernel bug: futex_wait hang
Andrew Morton wrote: iirc we ended up deciding that the futex problems around that time were due to userspace problems (a version of libc). But then, there's no discussion around Seto's patch and it didn't get applied. So I don't know what happened to that work - it's all a bit mysterious. It can be fixed _either_ in Glibc, or by changing the kernel. That problem is caused by differing assumptions between Glibc and the kernel about subtle futex semantics. Which goes to show they are really clever, or something. I provided pseudo-code for the Glibc fix, but not an actual patch because NPTL is quite complicated and I wanted to know the Glibc people were interested, but apparently they were too busy at the time - benchmarks would have made sense for such a patch. Scott Snyder started fixing part of Glibc, and that did fix his instance of this problem so we know the approach works. But a full patch for Glibc was not prepared. The most recent messages under Futex queue_me/get_user ordering, with a patch from Jakub Jelinek will fix this problem by changing the kernel. Yes, you should apply Jakub's most recent patch, message-ID [EMAIL PROTECTED]. I have not tested the patch, but it looks convincing. I argued for fixing Glibc on the grounds that the changed kernel behaviour, or more exactly having Glibc depend on it, loses a certain semantic property useful for unusual operations on multiple futexes at the same time. But I appear to have lost the argument, and Jakub's latest patch does clean up some cruft quite nicely, with no expected performance hit. -- Jamie - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Futex queue_me/get_user ordering
Ingo Molnar wrote: > > * Jakub Jelinek <[EMAIL PROTECTED]> wrote: > > > The futex man pages that have been around for years (certainly since > > mid 2002) certainly don't document FUTEX_WAIT as token passing > > operation, but as atomic operation: > > > > Say http://www.icewalkers.com/Linux/ManPages/futex-2.html > > besides this documented-behavior argument, i dont think futexes should > be degraded into waitqueues I give in... Depending on atomicity makes it impossible for an application, which is linked with NPTL and Glibc, to write an NPTL-compatible "wait on two locks" function. I'm not saying that's a very clean thing to want, but it's a conceptual loss and I'm disappointed I seem to be the only one noticing it. On the other hand, I was mistaken to think it makes it impossible to write an emulation of synchronous futex() in terms of asynchronous futex().* In fact it makes it impossible to do so using the existing FUTEX_FD, but it would be possible if there were a FUTEX_FD2 added somewhere down the line. * - The reason you would do this is if you were writing userspace-threading for any reason, and you had to include an emulation of synchronous futex() in terms of async futex because there are some libraries which might run on top of the userspace-threading which use futex in an application-dependent way. > - in fact, to solve some of the known > performance problems the opposite will have to happen: e.g. i believe > that in the future we'll need to enable the kernel-side futex code to > actually modify the futex variable. I.e. atomicity of the read in > FUTEX_WAIT is an absolute must, and is only the first step. Some of those performance problems can be solved already by better use of FUTEX_REQUEUE instead of FUTEX_WAKE. > [ the double-context-switch problem in cond_signal() that Jamie > mentioned is precisely one such case: pthread semantics force us that > the wakeup of the wakee _must_ happen while still holding the internal > lock. So we cannot just delay the wakeup to outside the glibc critical > section. This double context-switch could be avoided if the 'release > internal lock and wake up wakee' operation could be done atomically > within the kernel. (A sane default 'userspace unlock' operation on a > machine word could be defined .. e.g. decrement-to-zero.) ] Did you not see the solution I gave last November, using FUTEX_REQUEUE? See: http://lkml.org/lkml/2004/11/29/201 I spent a _lot_ of time figuring it out but everyone was too busy to confirm that it worked. It would improve performance in a number of cases. I hope that it does not get ignored yet again. There _may_ be cases where more complex futex operations are needed, but we should try the better algorithms that use the existing futex operations before adding new ones. -- Jamie - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Futex queue_me/get_user ordering
Ingo Molnar wrote: * Jakub Jelinek [EMAIL PROTECTED] wrote: The futex man pages that have been around for years (certainly since mid 2002) certainly don't document FUTEX_WAIT as token passing operation, but as atomic operation: Say http://www.icewalkers.com/Linux/ManPages/futex-2.html besides this documented-behavior argument, i dont think futexes should be degraded into waitqueues I give in... Depending on atomicity makes it impossible for an application, which is linked with NPTL and Glibc, to write an NPTL-compatible wait on two locks function. I'm not saying that's a very clean thing to want, but it's a conceptual loss and I'm disappointed I seem to be the only one noticing it. On the other hand, I was mistaken to think it makes it impossible to write an emulation of synchronous futex() in terms of asynchronous futex().* In fact it makes it impossible to do so using the existing FUTEX_FD, but it would be possible if there were a FUTEX_FD2 added somewhere down the line. * - The reason you would do this is if you were writing userspace-threading for any reason, and you had to include an emulation of synchronous futex() in terms of async futex because there are some libraries which might run on top of the userspace-threading which use futex in an application-dependent way. - in fact, to solve some of the known performance problems the opposite will have to happen: e.g. i believe that in the future we'll need to enable the kernel-side futex code to actually modify the futex variable. I.e. atomicity of the read in FUTEX_WAIT is an absolute must, and is only the first step. Some of those performance problems can be solved already by better use of FUTEX_REQUEUE instead of FUTEX_WAKE. [ the double-context-switch problem in cond_signal() that Jamie mentioned is precisely one such case: pthread semantics force us that the wakeup of the wakee _must_ happen while still holding the internal lock. So we cannot just delay the wakeup to outside the glibc critical section. This double context-switch could be avoided if the 'release internal lock and wake up wakee' operation could be done atomically within the kernel. (A sane default 'userspace unlock' operation on a machine word could be defined .. e.g. decrement-to-zero.) ] Did you not see the solution I gave last November, using FUTEX_REQUEUE? See: http://lkml.org/lkml/2004/11/29/201 I spent a _lot_ of time figuring it out but everyone was too busy to confirm that it worked. It would improve performance in a number of cases. I hope that it does not get ignored yet again. There _may_ be cases where more complex futex operations are needed, but we should try the better algorithms that use the existing futex operations before adding new ones. -- Jamie - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Futex queue_me/get_user ordering
Jakub Jelinek wrote: > http://www.ussg.iu.edu/hypermail/linux/kernel/0411.2/0953.html > > Your argument in November was that you don't want to slow down the > kernel and that userland must be able to cope with the > non-atomicity of futex syscall. Those were two of them. But my other main concern is conceptual. Right now, a futex_wait call is roughly equivalent to to add_wait_queue, which is quite versatile. It means anything you can do with one futex, you can extend to multiple futexes (e.g. waiting on more than one lock), and you can do asynchronously (e.g. futex_wait can be implemented in userspace as futex_fd[1] + poll[2], and therefore things like poll-driven state machines where one of the state machines wants to wait on a lock are possible). [1] Ulrich was mistaken in his paper to say futex_fd needs to check a word to be useful; userspace is supposed to check the word after futex_fd and before polling or waiting on it. This is more useful because it extends to multiple futexes. [2] actually it can't right now because of a flaw in futex_fd's poll function, but that could be fixed. The _principle_ is sound. If you change futex_wait to be "atomic", and then have userspace locks which _depend_ on that atomicity, it becomes impossible to wait on multiple of those locks, or make poll-driven state machines which can wait on those locks. There are applications and libraries which use futex, not just for threading but things like database locks in files. You can do userspace threading and simulate most blocking system calls by making them non-blocking and using poll). (I'm not saying anything against NPTL by this, by the way - NPTL is a very good general purpose library - but there are occasions when an application wants to do it's own equivalent of simulated blocking system calls for one reason or another. My favourite being research into inter-thread JIT-optimisation in an environment like valgrind). Right now, in principle, futex_wait is among the system calls which can be simulated by making it non-blocking (= futex_fd) and using poll()[2]. Which means programs using futex themselves can be subject to interesting thread optimisations by code which knows nothing about the program (similar to valgrind..) If you change futex_wait to be "atomic", then it would be _impossible_ to take a some random 3rd party library which is using that futex_wait, and convert it's blocking system calls to use poll-driven state machines instead. I think taking that away would be a great conceptual loss. It's not a _huge_ loss, but considering it's only Glibc which is demanding this and futexes have another property, token-passing, which Glibc could be using instead - why not use it? That said, let's look at your patch. > It would simplify requeue implementation (getting rid of the nqueued > field), The change to FUTEX_REQUEUE2 is an improvement :) nqueued is an abomination, like the rest of FUTEX_REQUEUE2 :) > @@ -265,7 +264,6 @@ static inline int get_futex_value_locked > inc_preempt_count(); > ret = __copy_from_user_inatomic(dest, from, sizeof(int)); > dec_preempt_count(); > - preempt_check_resched(); > > return ret ? -EFAULT : 0; > } inc_preempt_count() and dec_preempt_count() aren't needed, as preemption is disabled by the queue spinlocks. So get_futex_value_locked isn't needed any more: with the spinlocks held, __get_user will do. > [numerous instances of...] > + preempt_check_resched(); Not required. The spin unlocks will do this. > But with the recent changes to futex.c I think kernel can ensure > atomicity for free. I agree it would probably not slow the kernel, but I would _strongly_ prefer that Glibc were fixed to use the token-passing property, if Glibc is the driving intention behind this patch - instead of this becoming a semantic that application-level users of futex (like database and IPC libraries) come to depend on and which can't be decomposed into a multiple-waiting form. (I admit that the kernel code does look nicer with get_futex_value_locked gone, though). By the way, do you know of Scott Snyder's recent work on fixing Glibc in this way? He bumped into one of Glibc's currently broken corner cases, fixed it (according to the algorithm I gave in November), and reported that it works fine with the fix. -- Jamie - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Futex queue_me/get_user ordering
Jakub Jelinek wrote: http://www.ussg.iu.edu/hypermail/linux/kernel/0411.2/0953.html Your argument in November was that you don't want to slow down the kernel and that userland must be able to cope with the non-atomicity of futex syscall. Those were two of them. But my other main concern is conceptual. Right now, a futex_wait call is roughly equivalent to to add_wait_queue, which is quite versatile. It means anything you can do with one futex, you can extend to multiple futexes (e.g. waiting on more than one lock), and you can do asynchronously (e.g. futex_wait can be implemented in userspace as futex_fd[1] + poll[2], and therefore things like poll-driven state machines where one of the state machines wants to wait on a lock are possible). [1] Ulrich was mistaken in his paper to say futex_fd needs to check a word to be useful; userspace is supposed to check the word after futex_fd and before polling or waiting on it. This is more useful because it extends to multiple futexes. [2] actually it can't right now because of a flaw in futex_fd's poll function, but that could be fixed. The _principle_ is sound. If you change futex_wait to be atomic, and then have userspace locks which _depend_ on that atomicity, it becomes impossible to wait on multiple of those locks, or make poll-driven state machines which can wait on those locks. There are applications and libraries which use futex, not just for threading but things like database locks in files. You can do userspace threading and simulate most blocking system calls by making them non-blocking and using poll). (I'm not saying anything against NPTL by this, by the way - NPTL is a very good general purpose library - but there are occasions when an application wants to do it's own equivalent of simulated blocking system calls for one reason or another. My favourite being research into inter-thread JIT-optimisation in an environment like valgrind). Right now, in principle, futex_wait is among the system calls which can be simulated by making it non-blocking (= futex_fd) and using poll()[2]. Which means programs using futex themselves can be subject to interesting thread optimisations by code which knows nothing about the program (similar to valgrind..) If you change futex_wait to be atomic, then it would be _impossible_ to take a some random 3rd party library which is using that futex_wait, and convert it's blocking system calls to use poll-driven state machines instead. I think taking that away would be a great conceptual loss. It's not a _huge_ loss, but considering it's only Glibc which is demanding this and futexes have another property, token-passing, which Glibc could be using instead - why not use it? That said, let's look at your patch. It would simplify requeue implementation (getting rid of the nqueued field), The change to FUTEX_REQUEUE2 is an improvement :) nqueued is an abomination, like the rest of FUTEX_REQUEUE2 :) @@ -265,7 +264,6 @@ static inline int get_futex_value_locked inc_preempt_count(); ret = __copy_from_user_inatomic(dest, from, sizeof(int)); dec_preempt_count(); - preempt_check_resched(); return ret ? -EFAULT : 0; } inc_preempt_count() and dec_preempt_count() aren't needed, as preemption is disabled by the queue spinlocks. So get_futex_value_locked isn't needed any more: with the spinlocks held, __get_user will do. [numerous instances of...] + preempt_check_resched(); Not required. The spin unlocks will do this. But with the recent changes to futex.c I think kernel can ensure atomicity for free. I agree it would probably not slow the kernel, but I would _strongly_ prefer that Glibc were fixed to use the token-passing property, if Glibc is the driving intention behind this patch - instead of this becoming a semantic that application-level users of futex (like database and IPC libraries) come to depend on and which can't be decomposed into a multiple-waiting form. (I admit that the kernel code does look nicer with get_futex_value_locked gone, though). By the way, do you know of Scott Snyder's recent work on fixing Glibc in this way? He bumped into one of Glibc's currently broken corner cases, fixed it (according to the algorithm I gave in November), and reported that it works fine with the fix. -- Jamie - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH/RFC] Futex mmap_sem deadlock
Olof Johansson wrote: > How's this? I went with get_val_no_fault(), since it isn't really a > get_user.*() any more (ptr being passed in), and no_paging is a little > misleading (not all faults are due to paging). How ironic: I deliberately didn't choose "no_fault" because that function *does* take page faults. It only disables paging operations! :) -- Jamie - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH/RFC] Futex mmap_sem deadlock
Linus Torvalds wrote: > > I suggest putting it into futex.c, and make it an inline function > > which takes "u32 __user *". > > Agreed, except we've traditionally just made it "int __user *". The type signatures in futex.c are a bit mixed up - most places say "int __user *" but sys_futex() says "u32 __user *". get_futex_key uses sizeof(u32) to check the address. -- Jamie - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH/RFC] Futex mmap_sem deadlock
Olof Johansson wrote: > On Wed, Feb 23, 2005 at 07:54:06AM -0800, Linus Torvalds wrote: > > > > Otherwise, a preempt attempt in get_user would not be seen > > > until some future preempt_enable was executed. > > > > True. I guess we should have a "preempt_check_resched()" there too. That's > > what "kunmap_atomic()" does too (which is what we rely on in the other > > case we do this..) > > Ok, this is getting complex enough to warrant get_user_inatomic(), > which means adding it to every arch's uaccess.h. > > Below patch does so. Unfortunately I don't have a Viro setup with cross > compilers for nearly every arch, so I can't make sure it doesn't break > anything. But since I pasted the same code everywhere it shouldn't. My turn to say uglee. Firstly, get_user_inatomic is the wrong name. "inatomic" in __copy_from_user_inatomic means it's called inside a non-premptable region (in atomic...). Your macro get_user_inatomic is _not_ called inside a non-preemptable region, so it shouldn't be called "inatomic". (A better name is get_user_no_paging). Secondly, does this _one_ use (it's not likely to be used elsewhere) justify copying & pasting the same code into every asm-*/uaccess, especially when the code is not in any way arch-specific? I suggest putting it into futex.c, and make it an inline function which takes "u32 __user *". -- Jamie - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH/RFC] Futex mmap_sem deadlock
Olof Johansson wrote: On Wed, Feb 23, 2005 at 07:54:06AM -0800, Linus Torvalds wrote: Otherwise, a preempt attempt in get_user would not be seen until some future preempt_enable was executed. True. I guess we should have a preempt_check_resched() there too. That's what kunmap_atomic() does too (which is what we rely on in the other case we do this..) Ok, this is getting complex enough to warrant get_user_inatomic(), which means adding it to every arch's uaccess.h. Below patch does so. Unfortunately I don't have a Viro setup with cross compilers for nearly every arch, so I can't make sure it doesn't break anything. But since I pasted the same code everywhere it shouldn't. My turn to say uglee. Firstly, get_user_inatomic is the wrong name. inatomic in __copy_from_user_inatomic means it's called inside a non-premptable region (in atomic...). Your macro get_user_inatomic is _not_ called inside a non-preemptable region, so it shouldn't be called inatomic. (A better name is get_user_no_paging). Secondly, does this _one_ use (it's not likely to be used elsewhere) justify copying pasting the same code into every asm-*/uaccess, especially when the code is not in any way arch-specific? I suggest putting it into futex.c, and make it an inline function which takes u32 __user *. -- Jamie - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH/RFC] Futex mmap_sem deadlock
Linus Torvalds wrote: I suggest putting it into futex.c, and make it an inline function which takes u32 __user *. Agreed, except we've traditionally just made it int __user *. The type signatures in futex.c are a bit mixed up - most places say int __user * but sys_futex() says u32 __user *. get_futex_key uses sizeof(u32) to check the address. -- Jamie - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH/RFC] Futex mmap_sem deadlock
Olof Johansson wrote: How's this? I went with get_val_no_fault(), since it isn't really a get_user.*() any more (ptr being passed in), and no_paging is a little misleading (not all faults are due to paging). How ironic: I deliberately didn't choose no_fault because that function *does* take page faults. It only disables paging operations! :) -- Jamie - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH/RFC] Futex mmap_sem deadlock
Linus Torvalds wrote: > > queue_me(...) etc. > > current->flags |= PF_MMAP_SEM; <- new > > ret = get_user(...); > > current->flags &= PF_MMAP_SEM; <- new > > /* the rest */ > > That is uglee. > > We really have this already, and it's called "current->preempt". It > handles any lock at all, and doesn't add yet another special case to all > the architectures. Ooh, I didn't know current->preempt did that (been away). > repeat: > down_read(>mm->mmap_sem); > get_futex_key(...) etc. > queue_me(...) etc. > inc_preempt_count(); > ret = get_user(...); > dec_preempt_count(); > if (unlikely(ret)) { > up_read(>mm->mmap_sem); > /* Re-do the access outside the lock */ > ret = get_user(...); > if (!ret) > goto repeat; > return ret; > } That would work. I like it. :) Page faults will enter the fault handler twice (i.e. slower), but that's not really a disadvantage, because a program always references the memory just before calling futex_wait anyway. A fault is rare. There is one small but important error: the "return ret" mustn't just return. It must call unqueue_me() just like the code at out_unqueue, _including_ the conditional "ret = 0", but _excluding_ the up_read(). Alternatively, since it's a rare case, just shuffle the loop around: down_read(>mm->mmap_sem); repeat: get_futex_key(...) etc. queue_me(...) etc. inc_preempt_count(); ret = get_user(...); dec_preempt_count(); if (unlikely(ret)) { up_read(>mm->mmap_sem); /* Re-do the access outside the lock */ ret = get_user(...); down_read(>mm->mmap_sem); if (!ret) goto repeat; goto out_unqueue; } -- Jamie - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH/RFC] Futex mmap_sem deadlock
Olof Johansson wrote: > > That won't work because the vma lock must be help between key > > calculation and get_user() - otherwise futex is not reliable. It > > would work if the futex key calculation was inside the loop. > > Sure, but that's still true: It's just that the get_user() is done twice > instead. The semaphore is never released between the key calculation and > the "real" get_user(). Ah, I didn't look at where the loop is used and didn't think there'd be _two_ get_user() calls in the fast case. Not my instinct. > > A much simpler solution (and sorry for not offering it earlier, > > because Andrew Morton pointed out this bug long ago, but I was busy), is: > > Either way works for me. Andrew/Linus, got a preference? I'll either > post my refresh based on Andrews comments, or code up Jamie's > suggestion. Yours has a couple of problems. 1. It'll make futex waits somewhat slower. One of the nicer features of 2.6 futexes is that we got rid of the explicit page table lookup. 2. It's broken because a page can be paged out by another thread after you've forced it in and before the get_user(). We only take mmap_sem, not the page table lock. -- Jamie - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH/RFC] Futex mmap_sem deadlock
Chris Friesen wrote: > > down_read(>mm->mmap_sem); > > get_futex_key(...) etc. > > queue_me(...) etc. > > current->flags |= PF_MMAP_SEM; <- new > > ret = get_user(...); > > current->flags &= PF_MMAP_SEM; <- new > > /* the rest */ > > Should the second new line be this (with the inverse)? > > current->flags &= ~PF_MMAP_SEM; Quiet! I was trying to sneak in a security hole! :) -- Jamie - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH/RFC] Futex mmap_sem deadlock
Andrew Morton wrote: > > This will quickly lock up, since the futex_wait code dows a > > down_read(mmap_sem), then a get_user(). > > > > The do_page_fault code on ppc64 (as well as other architectures) needs > > to take the same semaphore for reading. This is all good until the > > second thread comes into play: Its mmap call tries to take the same > > semaphore for writing which causes in the do_page_fault down_read() > > to get stuck. Classic deadlock. > > Yup. Jamie says that the futex code _has_ to hold mmap_sem across the > get_user(). I forget (but could probably locate) the details. It does - the "key" which identifies a futex depends on a vma calculation, and the vma must not change between the calculation and the get_user(). > > One attempt to fix this is included below. It works, but I'm not entirely > > happy with the fact that it's a bit messy solution. If anyone has a > > better idea for how to solve it I'd be all ears. > > It's fairly sane. Style-wise I'd be inclined to turn this: > > down_read(>mm->mmap_sem); > while (!check_user_page_readable(current->mm, uaddr1)) { > up_read(>mm->mmap_sem); > /* Fault in the page through get_user() but discard result */ > if (get_user(curval, (int __user *)uaddr1) != 0) > return -EFAULT; > down_read(>mm->mmap_sem); > } That won't work because the vma lock must be help between key calculation and get_user() - otherwise futex is not reliable. It would work if the futex key calculation was inside the loop. A much simpler solution (and sorry for not offering it earlier, because Andrew Morton pointed out this bug long ago, but I was busy), is: In futex.c: down_read(>mm->mmap_sem); get_futex_key(...) etc. queue_me(...) etc. current->flags |= PF_MMAP_SEM; <- new ret = get_user(...); current->flags &= PF_MMAP_SEM; <- new /* the rest */ And in arch/*/mm/fault.c, replace every one of these: down_read(>mmap_sem); up_read(>mmap_sem); with these: if (!(current & PF_MMAP_SEM)) down_read(>mmap_sem); if (!(current & PF_MMAP_SEM)) up_read(>mmap_sem); -- Jamie - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/