Re: [PATCH] ARM: ftrace: Ensure code modifications are synchronised across all cpus

2012-12-10 Thread Jamie Lokier
Steven Rostedt wrote:
> > Yes, and I think if you do use two 16-bit nops, you can even get rid of all
> > the intermediate `sync' operations (I guess you might want one at the end if
> > you want the call to become visible at a particular point).
> 
> Wont work. We are replacing a 32bit call with a nop. That nop must also
> be 32bits, because we could eventually replace the nop(s) with a 32bit
> call. Basically, we can never allow the second 16bit part ever be the
> next instruction. If the first 16bit nop is executed, and then the task
> gets preempted. The nops get converted to a 32bit call. The task gets
> scheduled again and now is executing the second 16bits of the 32bit call
> and we get unexpected (probably crashing) results.
> 
> By having either a 16bit breakpoint whose handler returns after the
> second 16bit part, or a 16bit jump that simply jumps over the second
> half, then all this should work. When the CPU processes a 32bit
> instruction, it either processes all or non of it, correct?

Sounds good, except what Will wrote a few days ago:

On Fri, 2012-12-07 at 19:02 +, Will Deacon wrote:
> For ARMv7, there are small subsets of instructions for ARM and Thumb which
> are guaranteed to be atomic wrt concurrent modification and execution of
> the instruction stream between different processors:
>
> Thumb:  The 16-bit encodings of the B, NOP, BKPT, and SVC instructions.
> ARM:The B, BL, NOP, BKPT, SVC, HVC, and SMC instructions.

Thumb 32-bit ftrace call isn't in the above list.

Questions: does the above concurrent modification guarantee require
both the old instruction _and_ the new one to be among those listed,
or is it enough to be just the new one (for example when setting a
normal software breakpoint, that would be useful)?  Can it be the old
one and not the new (for example when removing a software breakpoint,
that would be useful)?  Does that subset mean replacing any of the
listed instructions by any of the others is ok, or any of the listed
with another of the same type?

(I guess as a matter of architecture design, it makes sense to
guarantee only a short list, because of occasions when the hardware,
or a software emulation through traps, or a simulation, might read the
instruction memory more than once.)

This is what makes me wonder, if it's safe to replace the 32-bit
mcount call with a 16-bit short jump:

> On Mon, Dec 10, 2012 at 11:04:05AM +, Jon Medhurst (Tixy) wrote:
> > So this means for things like kprobes which can modify arbitrary kernel
> > code we are going to need to continue to always use some form of
> > stop_the_whole_system() function?
> >
> > Also, kprobes currently uses patch_text() which only uses stop_machine
> > for Thumb2 instructions which straddle a word boundary, so this needs
> > changing?

Will Deacon replied:
> Yes; if you're modifying instructions other than those mentioned above, then
> you'll need to synchronise the CPUs, update the instructions, perform
> cache-maintenance on the writing CPU and then execute an isb on the
> executing core (this last bit isn't needed if you're going to go through an
> exception return to get back to the new code -- depends on how your
> stop/resume code works).

If I've understood that exchange, it implies that using patch_text()
to replace an instruction not in the list of special ones, with a trap
or jump, isn't ok?  And so it's ok to replace the NOP with a short
branch (since 16-bit "B" is in the list), but it's not ok to replace
16-bit "B" with the 32-bit ftrace call; and the same going the other way?

Best,
-- Jamie
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] ARM: ftrace: Ensure code modifications are synchronised across all cpus

2012-12-10 Thread Jamie Lokier
Steven Rostedt wrote:
  Yes, and I think if you do use two 16-bit nops, you can even get rid of all
  the intermediate `sync' operations (I guess you might want one at the end if
  you want the call to become visible at a particular point).
 
 Wont work. We are replacing a 32bit call with a nop. That nop must also
 be 32bits, because we could eventually replace the nop(s) with a 32bit
 call. Basically, we can never allow the second 16bit part ever be the
 next instruction. If the first 16bit nop is executed, and then the task
 gets preempted. The nops get converted to a 32bit call. The task gets
 scheduled again and now is executing the second 16bits of the 32bit call
 and we get unexpected (probably crashing) results.
 
 By having either a 16bit breakpoint whose handler returns after the
 second 16bit part, or a 16bit jump that simply jumps over the second
 half, then all this should work. When the CPU processes a 32bit
 instruction, it either processes all or non of it, correct?

Sounds good, except what Will wrote a few days ago:

On Fri, 2012-12-07 at 19:02 +, Will Deacon wrote:
 For ARMv7, there are small subsets of instructions for ARM and Thumb which
 are guaranteed to be atomic wrt concurrent modification and execution of
 the instruction stream between different processors:

 Thumb:  The 16-bit encodings of the B, NOP, BKPT, and SVC instructions.
 ARM:The B, BL, NOP, BKPT, SVC, HVC, and SMC instructions.

Thumb 32-bit ftrace call isn't in the above list.

Questions: does the above concurrent modification guarantee require
both the old instruction _and_ the new one to be among those listed,
or is it enough to be just the new one (for example when setting a
normal software breakpoint, that would be useful)?  Can it be the old
one and not the new (for example when removing a software breakpoint,
that would be useful)?  Does that subset mean replacing any of the
listed instructions by any of the others is ok, or any of the listed
with another of the same type?

(I guess as a matter of architecture design, it makes sense to
guarantee only a short list, because of occasions when the hardware,
or a software emulation through traps, or a simulation, might read the
instruction memory more than once.)

This is what makes me wonder, if it's safe to replace the 32-bit
mcount call with a 16-bit short jump:

 On Mon, Dec 10, 2012 at 11:04:05AM +, Jon Medhurst (Tixy) wrote:
  So this means for things like kprobes which can modify arbitrary kernel
  code we are going to need to continue to always use some form of
  stop_the_whole_system() function?
 
  Also, kprobes currently uses patch_text() which only uses stop_machine
  for Thumb2 instructions which straddle a word boundary, so this needs
  changing?

Will Deacon replied:
 Yes; if you're modifying instructions other than those mentioned above, then
 you'll need to synchronise the CPUs, update the instructions, perform
 cache-maintenance on the writing CPU and then execute an isb on the
 executing core (this last bit isn't needed if you're going to go through an
 exception return to get back to the new code -- depends on how your
 stop/resume code works).

If I've understood that exchange, it implies that using patch_text()
to replace an instruction not in the list of special ones, with a trap
or jump, isn't ok?  And so it's ok to replace the NOP with a short
branch (since 16-bit B is in the list), but it's not ok to replace
16-bit B with the 32-bit ftrace call; and the same going the other way?

Best,
-- Jamie
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 22/25] Generic dynamic per cpu refcounting

2012-11-29 Thread Jamie Lokier
Kent Overstreet wrote:
> On Thu, Nov 29, 2012 at 09:54:47PM +0100, Andi Kleen wrote:
> > > > The regular atomic_t is limited in ways that you are not.
> > > > See my original mail.
> > > 
> > > I don't follow, can you explain?
> > 
> > For most cases the reference count is tied to some object, which are
> > naturally limited by memory size or other physical resources.
> > 
> > But in the assymetric CPU case with your ref count no such limiter
> > exists.
> 
> It's got exactly the same limit as the old code which used the atomic_t
> - we're limited by the number of threads that can be issuing aio
> syscalls at a time.
> 
> The assymetry you're talking about _doesn't matter_, individual cpu
> counters wrapping does not affect what the counters all sum to when we
> go to tear down.
> 
> A coworker at lunch actually pointed out to me that the reason this is
> true is just that modular arithmatic is still associative with addition
> and subtraction.

It's just like jiffies.  Everyone understands jiffies arithmetic I hope.

-- Jamie
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 22/25] Generic dynamic per cpu refcounting

2012-11-29 Thread Jamie Lokier
Kent Overstreet wrote:
 On Thu, Nov 29, 2012 at 09:54:47PM +0100, Andi Kleen wrote:
The regular atomic_t is limited in ways that you are not.
See my original mail.
   
   I don't follow, can you explain?
  
  For most cases the reference count is tied to some object, which are
  naturally limited by memory size or other physical resources.
  
  But in the assymetric CPU case with your ref count no such limiter
  exists.
 
 It's got exactly the same limit as the old code which used the atomic_t
 - we're limited by the number of threads that can be issuing aio
 syscalls at a time.
 
 The assymetry you're talking about _doesn't matter_, individual cpu
 counters wrapping does not affect what the counters all sum to when we
 go to tear down.
 
 A coworker at lunch actually pointed out to me that the reason this is
 true is just that modular arithmatic is still associative with addition
 and subtraction.

It's just like jiffies.  Everyone understands jiffies arithmetic I hope.

-- Jamie
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] page-table walkers vs memory order

2012-07-30 Thread Jamie Lokier
Paul E. McKenney wrote:
> > Does some version of gcc, under the options which we insist upon,
> > make such optimizations on any of the architectures which we support?
> 
> Pretty much any production-quality compiler will do double-fetch
> and old-value-reuse optimizations, the former especially on 32-bit
> x86.  I don't know of any production-quality compilers that do value
> speculation, which would make the compiler act like DEC Alpha hardware,
> and I would hope that if this does appear, (1) we would have warning
> and (2) it could be turned off.  But there has been a lot of work on
> this topic, so we would be foolish to rule it out.

GCC documentation for IA-64:

   -msched-ar-data-spec
   -mno-sched-ar-data-spec
 (En/Dis)able data speculative scheduling after reload. This results
 in generation of ld.a instructions and the corresponding check
 instructions (ld.c / chk.a). The default is 'enable'.

I don't know if that results in value speculation of the relevant kind.

-- Jamie
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] page-table walkers vs memory order

2012-07-30 Thread Jamie Lokier
Paul E. McKenney wrote:
  Does some version of gcc, under the options which we insist upon,
  make such optimizations on any of the architectures which we support?
 
 Pretty much any production-quality compiler will do double-fetch
 and old-value-reuse optimizations, the former especially on 32-bit
 x86.  I don't know of any production-quality compilers that do value
 speculation, which would make the compiler act like DEC Alpha hardware,
 and I would hope that if this does appear, (1) we would have warning
 and (2) it could be turned off.  But there has been a lot of work on
 this topic, so we would be foolish to rule it out.

GCC documentation for IA-64:

   -msched-ar-data-spec
   -mno-sched-ar-data-spec
 (En/Dis)able data speculative scheduling after reload. This results
 in generation of ld.a instructions and the corresponding check
 instructions (ld.c / chk.a). The default is 'enable'.

I don't know if that results in value speculation of the relevant kind.

-- Jamie
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Proposal for "proper" durable fsync() and fdatasync()

2008-02-26 Thread Jamie Lokier
Jörn Engel wrote:
> On Tue, 26 February 2008 15:28:10 +0000, Jamie Lokier wrote:
> > 
> > > One interesting aspect of this comes with COW filesystems like btrfs or
> > > logfs.  Writing out data pages is not sufficient, because those will get
> > > lost unless their referencing metadata is written as well.  So either we
> > > have to call fsync for those filesystems or add another callback and let
> > > filesystems override the default implementation.
> > 
> > Doesn't the ->fsync callback get called in the sys_fdatasync() case,
> > with appropriate arguments?
> 
> My paragraph above was aimed at the sync_file_range() case.  fsync and
> fdatasync do the right thing within the limitations you brought up in
> this thread.  sync_file_range() without further changes will only write
> data pages, not the metadata required to actually access those data
> pages.  This works just fine for non-COW filesystems, which covers all
> currently merged ones.
> 
> With COW filesystems it is currently impossible to do sync_file_range()
> properly.  The problem is orthogonal to your's, I just brought it up
> since you were already mentioning sync_file_range().

You're right.  Though, doesn't normal page writeback enqueue the COW
metadata changes?  If not, how do they get written in a timely
fashion?

-- Jamie
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Proposal for "proper" durable fsync() and fdatasync()

2008-02-26 Thread Jamie Lokier
Jeff Garzik wrote:
> Nick Piggin wrote:
> >Anyway, the idea of making fsync/fdatasync etc. safe by default is
> >a good idea IMO, and is a bad bug that we don't do that :(
> 
> Agreed...  it's also disappointing that [unless I'm mistaken] you have 
> to hack each filesystem to support barriers.
> 
> It seems far easier to make sync_blkdev() Do The Right Thing, and 
> magically make all filesystems data-safe.

Well, you need ordered metadata writes, barriers _and_ flushes with
some filesystems.

Merely writing all the data pages than issuing a drive cache flush
won't Do The Right Thing with those filesystems - someone already
mentioned Btrfs, where it won't.

But I agree that your suggestion would make a superb default, for
filesystems which don't provide their own function.

It's not optimal even then.

  Devices: On a software RAID, you ideally don't want to issue flushes
  to all drives if your database did a 1 block commit entry.  (But they
  probably use O_DIRECT anyway, changing the rules again).  But all that
  can be optimised in generic VFS code eventually.  It doesn't need
  filesystem assistance in most cases.

  Apps: don't always want a full flush; sometimes a barrier would do.

-- Jamie
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Proposal for "proper" durable fsync() and fdatasync()

2008-02-26 Thread Jamie Lokier
Ric Wheeler wrote:
> >>I was surprised that fsync() doesn't do this already.  There was a lot
> >>of effort put into block I/O write barriers during 2.5, so that
> >>journalling filesystems can force correct write ordering, using disk
> >>flush cache commands.
> >>
> >>After all that effort, I was very surprised to notice that Linux 2.6.x
> >>doesn't use that capability to ensure fsync() flushes the disk cache
> >>onto stable storage.
> >
> >It's surprising you are surprised, given that this [lame] fsync behavior 
> >has remaining consistently lame throughout Linux's history.
> 
> Maybe I am confused, but isn't this is what fsync() does today whenever 
> barriers are enabled (the fsync() invalidates the drive's write cache).

No, fsync() doesn't always flush the drive's write cache.  It often
does, any I think many people are under the impression it always does,
but it doesn't.

Try this code on ext3:

fd = open ("test_file", O_RDWR | O_CREAT | O_TRUNC, 0666);
while (1) {
char byte;
usleep (10);
pwrite (fd, , 1, 0);
fsync (fd);
}

It will do just over 10 write ops per second on an idle system (13 on
mine), and 1 flush op per second.

That's because ext3 fsync() only does a journal commit when the inode
has changed.  The inode mtime is changed by write only with 1 second
granularity.  Without a journal commit, there's no barrier, which
translates to not flushing disk write cache.

If you add "fchmod (fd, 0644); fchmod (fd, 0664);" between the write
and fsync, you'll see at least 20 write ops and 20 flush ops per
second, and you'll here the disk seeking more.  That's because the
fchmod dirties the inode, so fsync() writes the inode with a journal
commit.

It turns out even _that_ is not sufficient according to the kernel
internals.  A journal commit uses an ordered request, which isn't the
same as a flush potentially, it just happens to use flush in this
instance.  I'm not sure if ordered requests are actually implemented
by any drivers at the moment.  If not now, they will be one day.

We could change ext3 fsync() to always do a journal commit, and depend
on the non-existence of block drivers which do ordered (not flush)
barrier requests.  But there's lots of things wrong with that.  Not
least, it sucks performance for database-like applications and virtual
machines, a lot due to unnecessary seeks.  That way lies wrongness.

Rightness is to make fdatasync() work well, with a genuine flush (or
equivalent (see FUA), only when required, and not a mere ordered
barrier), no inode write, and to make sync_file_range()[*] offer the
fancier applications finer controls which reflect what they actually
need.

[*] - or whatever.

-- Jamie
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Proposal for "proper" durable fsync() and fdatasync()

2008-02-26 Thread Jamie Lokier
Jörn Engel wrote:
> On Tue, 26 February 2008 20:16:11 +1100, Nick Piggin wrote:
> > Yeah, sync_file_range has slightly unusual semantics and introduce
> > the new concept, "writeout", to userspace (does "writeout" include
> > "in drive cache"? the kernel doesn't think so, but the only way to
> > make sync_file_range "safe" is if you do consider it writeout).
> 
> If sync_file_range isn't safe, it should get replaced by a noop
> implementation.  There really is no point in promising "a little"
> safety.

Sometimes there is a point in "a little" safety.

There's a spectrum of durability (meaning how safely stored the data
is).  In the cases we're imagining, it's application -> main memory
cache -> disk cache -> disk surface.  There are others.

_None_ of those provide perfect safety for your data.  They are a
spectrum, and how far along you want data to be committed before you
say "fine, the data is safe enough for me" depends on your application.

For example, there are users who like to turn _off_ fdatasync() with
their SQL database of choice.  They prefer speed over safety, and they
don't mind losing an hour's data and doing regular backups (we assume
;-) Some blogs fall into this category; who cares if a rare crash
costs you a comment or two and a restore from backup; it's acceptable
for the speed.

There's users who would really like fdatasync() to commit data to the
drive platters, so after their database says "done", they are very
confident that a power failure won't cause committed data to be lost.
Accepting credit cards is more at this end.  So should be anyone using
a virtual machine of any kind without a journalling fs in the guest!

And there's users who like it where it is right now: a compromise,
where a system crash won't lose committed data; but a power failure
might.  (I'm making assumptions about drive behaviour on reset here.)

My problem with fdatasync() at the moment is, I can't choose what I
want from it, and there's no mechanism to give me the safest option.
Most annoyingly, in-kernel filesystems _do_ have a mechanism; it just
isn't exported to userspace.

(A quick aside: fdatasync() et al. are actually used for two
_different_ things.  1: A program says "I've written it", it can say
so with confidence, e.g. announcing email receipt.  2: It's used for
write ordering with write-ahead logging: write, fdatasync, write.
When you tease at the details, efficient implementations of them are
different...  Think SCSI tagged commands versus cache flushes.)

> One interesting aspect of this comes with COW filesystems like btrfs or
> logfs.  Writing out data pages is not sufficient, because those will get
> lost unless their referencing metadata is written as well.  So either we
> have to call fsync for those filesystems or add another callback and let
> filesystems override the default implementation.

Doesn't the ->fsync callback get called in the sys_fdatasync() case,
with appropriate arguments?

With barriers/flushes it certainly makes those a bit more complicated.
You have to flush not just the disks with data pages, but the _other_
disks in a software RAID with data pointer metadata pages, but ideally
not all of them (think database journal commit).

That can be implemented with per-buffer pending-barrier/flush flags
(like I described for pages in the first mail), which are equally
useful when a database-like application uses a block device.

-- Jamie
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Proposal for "proper" durable fsync() and fdatasync()

2008-02-26 Thread Jamie Lokier
Jörn Engel wrote:
> On Tue, 26 February 2008 20:16:11 +1100, Nick Piggin wrote:
> > 
> > Yeah, sync_file_range has slightly unusual semantics and introduce
> > the new concept, "writeout", to userspace (does "writeout" include
> > "in drive cache"? the kernel doesn't think so, but the only way to
> > make sync_file_range "safe" is if you do consider it writeout).
> 
> If sync_file_range isn't safe, it should get replaced by a noop
> implementation.  There really is no point in promising "a little"
> safety.
> 
> One interesting aspect of this comes with COW filesystems like btrfs or
> logfs.  Writing out data pages is not sufficient, because those will get
> lost unless their referencing metadata is written as well.  So either we
> have to call fsync for those filesystems or add another callback and let
> filesystems override the default implementation.

fdatasync() is required to write data pages _and_ the necessary
metadata to reference those changed pages (btrfs tree etc.), but not
non-data metadata.

It's the filesystem's responsibility to interpret that correctly.
In-place writes don't need anything else.  Phase-tree style writes do.
Some kinds of logged writes don't.

I'm under the impression that sync_file_range() is a sort of
restricted-range asynchronous fdatasync().

By limiting the range of file date which must be written out, it
becomes more refined for database and filesystem-in-a-file type
applications.  Just as fsync() is more refined than sync() - it's
useful to sync less - same goes for syncing just part of a file.

It's still the filesystem's responsibility to sync data access
metadata appropriately.  It can sync more if it wants, but not less.

That's what I understand by
   sync_file_range(fd, start,length, SYNC_FILE_RANGE_WRITE_BEFORE
   | SYNC_FILE_RANGE_WRITE
   | SYNC_FILE_RANGE_WRITE_AFTER);
Largely because the manual says to use that combination of flags for
an equivalent to fdatasync().

The concept of "write-out" is not defined in the manual.  I'm assuming
it to mean this, as a reasonable guess:

SYNC_FILE_RANGE_WRITE scans all pages in the range, looking for dirty
pages which aren't already queued for write-out.  It marks those with
a "write-out" flag, and starts write I/Os at some unspecified time in
the near future; it can be assumed writes for all the pages will
complete eventually if there's no errors.  When I/O completes on a
page, it cleans the page and also clears the write-out flag.

SYNC_FILE_RANGE_WAIT_AFTER waits until all pages in the range don't
have the "write-out" flag set.

SYNC_FILE_RANGE_WAIT_BEFORE does the same wait, but before marking
pages for write-out.  I don't actually see the point in this.  Isn't a
preceding call with SYNC_FILE_RANGE_WAIT_AFTER equivalent, making
BEFORE a redundant flag?

The manual says it is something to do with data-integrity, but it's
not clear to me what that means.

All this implies that "write-out" flag is a concept userspace can rely
on.  That's not so peculiar: WRITE seems to be equivalent to AIO-style
fdatasync() on a limited range of offsets, and WAIT_AFTER seems to be
equivalent to waiting for any previously issued such ops to complete.

Any data access metadata updates that btrfs must make for fdatasync(),
it must also make for sync_file_range(), for the limited range of
offsets.

-- Jamie
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Proposal for "proper" durable fsync() and fdatasync()

2008-02-26 Thread Jamie Lokier
Jeff Garzik wrote:
> [snip huge long proposal]
> 
> Rather than invent new APIs, we should fix the existing ones to _really_ 
> flush data to physical media.

Btw, one reason for the length is the current block request API isn't
sufficient even to make fsync() durable with _no_ new APIs.

It offers ordering barriers only, which aren't enough.  I tried to
explain, discuss some changes and then suggest optimisations.

-- Jamie
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Proposal for "proper" durable fsync() and fdatasync()

2008-02-26 Thread Jamie Lokier
Andrew Morton wrote:
> On Tue, 26 Feb 2008 07:26:50 +0000 Jamie Lokier <[EMAIL PROTECTED]> wrote:
> 
> > (It would be nicer if sync_file_range()
> > took a vector of ranges for better elevator scheduling, but let's
> > ignore that :-)
> 
> Two passes:
> 
> Pass 1: shove each of the segments into the queue with
> SYNC_FILE_RANGE_WAIT_BEFORE|SYNC_FILE_RANGE_WRITE
> 
> Pass 2: wait for them all to complete and return accumulated result
> with SYNC_FILE_RANGE_WAIT_AFTER

Thanks.

Seems ok, though being able to cork the I/O until the last one would
be a bonus (like TCP_MORE...  SYNC_FILE_RANGE_MORE?)

I'm imagining I'd omit the SYNC_FILE_RANGE_WAIT_BEFORE.  Is there a
reason why you have it there?  The man page isn't very enlightening.

-- Jamie
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Proposal for proper durable fsync() and fdatasync()

2008-02-26 Thread Jamie Lokier
Andrew Morton wrote:
 On Tue, 26 Feb 2008 07:26:50 + Jamie Lokier [EMAIL PROTECTED] wrote:
 
  (It would be nicer if sync_file_range()
  took a vector of ranges for better elevator scheduling, but let's
  ignore that :-)
 
 Two passes:
 
 Pass 1: shove each of the segments into the queue with
 SYNC_FILE_RANGE_WAIT_BEFORE|SYNC_FILE_RANGE_WRITE
 
 Pass 2: wait for them all to complete and return accumulated result
 with SYNC_FILE_RANGE_WAIT_AFTER

Thanks.

Seems ok, though being able to cork the I/O until the last one would
be a bonus (like TCP_MORE...  SYNC_FILE_RANGE_MORE?)

I'm imagining I'd omit the SYNC_FILE_RANGE_WAIT_BEFORE.  Is there a
reason why you have it there?  The man page isn't very enlightening.

-- Jamie
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Proposal for proper durable fsync() and fdatasync()

2008-02-26 Thread Jamie Lokier
Jeff Garzik wrote:
 [snip huge long proposal]
 
 Rather than invent new APIs, we should fix the existing ones to _really_ 
 flush data to physical media.

Btw, one reason for the length is the current block request API isn't
sufficient even to make fsync() durable with _no_ new APIs.

It offers ordering barriers only, which aren't enough.  I tried to
explain, discuss some changes and then suggest optimisations.

-- Jamie
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Proposal for proper durable fsync() and fdatasync()

2008-02-26 Thread Jamie Lokier
Jörn Engel wrote:
 On Tue, 26 February 2008 20:16:11 +1100, Nick Piggin wrote:
  
  Yeah, sync_file_range has slightly unusual semantics and introduce
  the new concept, writeout, to userspace (does writeout include
  in drive cache? the kernel doesn't think so, but the only way to
  make sync_file_range safe is if you do consider it writeout).
 
 If sync_file_range isn't safe, it should get replaced by a noop
 implementation.  There really is no point in promising a little
 safety.
 
 One interesting aspect of this comes with COW filesystems like btrfs or
 logfs.  Writing out data pages is not sufficient, because those will get
 lost unless their referencing metadata is written as well.  So either we
 have to call fsync for those filesystems or add another callback and let
 filesystems override the default implementation.

fdatasync() is required to write data pages _and_ the necessary
metadata to reference those changed pages (btrfs tree etc.), but not
non-data metadata.

It's the filesystem's responsibility to interpret that correctly.
In-place writes don't need anything else.  Phase-tree style writes do.
Some kinds of logged writes don't.

I'm under the impression that sync_file_range() is a sort of
restricted-range asynchronous fdatasync().

By limiting the range of file date which must be written out, it
becomes more refined for database and filesystem-in-a-file type
applications.  Just as fsync() is more refined than sync() - it's
useful to sync less - same goes for syncing just part of a file.

It's still the filesystem's responsibility to sync data access
metadata appropriately.  It can sync more if it wants, but not less.

That's what I understand by
   sync_file_range(fd, start,length, SYNC_FILE_RANGE_WRITE_BEFORE
   | SYNC_FILE_RANGE_WRITE
   | SYNC_FILE_RANGE_WRITE_AFTER);
Largely because the manual says to use that combination of flags for
an equivalent to fdatasync().

The concept of write-out is not defined in the manual.  I'm assuming
it to mean this, as a reasonable guess:

SYNC_FILE_RANGE_WRITE scans all pages in the range, looking for dirty
pages which aren't already queued for write-out.  It marks those with
a write-out flag, and starts write I/Os at some unspecified time in
the near future; it can be assumed writes for all the pages will
complete eventually if there's no errors.  When I/O completes on a
page, it cleans the page and also clears the write-out flag.

SYNC_FILE_RANGE_WAIT_AFTER waits until all pages in the range don't
have the write-out flag set.

SYNC_FILE_RANGE_WAIT_BEFORE does the same wait, but before marking
pages for write-out.  I don't actually see the point in this.  Isn't a
preceding call with SYNC_FILE_RANGE_WAIT_AFTER equivalent, making
BEFORE a redundant flag?

The manual says it is something to do with data-integrity, but it's
not clear to me what that means.

All this implies that write-out flag is a concept userspace can rely
on.  That's not so peculiar: WRITE seems to be equivalent to AIO-style
fdatasync() on a limited range of offsets, and WAIT_AFTER seems to be
equivalent to waiting for any previously issued such ops to complete.

Any data access metadata updates that btrfs must make for fdatasync(),
it must also make for sync_file_range(), for the limited range of
offsets.

-- Jamie
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Proposal for proper durable fsync() and fdatasync()

2008-02-26 Thread Jamie Lokier
Jörn Engel wrote:
 On Tue, 26 February 2008 20:16:11 +1100, Nick Piggin wrote:
  Yeah, sync_file_range has slightly unusual semantics and introduce
  the new concept, writeout, to userspace (does writeout include
  in drive cache? the kernel doesn't think so, but the only way to
  make sync_file_range safe is if you do consider it writeout).
 
 If sync_file_range isn't safe, it should get replaced by a noop
 implementation.  There really is no point in promising a little
 safety.

Sometimes there is a point in a little safety.

There's a spectrum of durability (meaning how safely stored the data
is).  In the cases we're imagining, it's application - main memory
cache - disk cache - disk surface.  There are others.

_None_ of those provide perfect safety for your data.  They are a
spectrum, and how far along you want data to be committed before you
say fine, the data is safe enough for me depends on your application.

For example, there are users who like to turn _off_ fdatasync() with
their SQL database of choice.  They prefer speed over safety, and they
don't mind losing an hour's data and doing regular backups (we assume
;-) Some blogs fall into this category; who cares if a rare crash
costs you a comment or two and a restore from backup; it's acceptable
for the speed.

There's users who would really like fdatasync() to commit data to the
drive platters, so after their database says done, they are very
confident that a power failure won't cause committed data to be lost.
Accepting credit cards is more at this end.  So should be anyone using
a virtual machine of any kind without a journalling fs in the guest!

And there's users who like it where it is right now: a compromise,
where a system crash won't lose committed data; but a power failure
might.  (I'm making assumptions about drive behaviour on reset here.)

My problem with fdatasync() at the moment is, I can't choose what I
want from it, and there's no mechanism to give me the safest option.
Most annoyingly, in-kernel filesystems _do_ have a mechanism; it just
isn't exported to userspace.

(A quick aside: fdatasync() et al. are actually used for two
_different_ things.  1: A program says I've written it, it can say
so with confidence, e.g. announcing email receipt.  2: It's used for
write ordering with write-ahead logging: write, fdatasync, write.
When you tease at the details, efficient implementations of them are
different...  Think SCSI tagged commands versus cache flushes.)

 One interesting aspect of this comes with COW filesystems like btrfs or
 logfs.  Writing out data pages is not sufficient, because those will get
 lost unless their referencing metadata is written as well.  So either we
 have to call fsync for those filesystems or add another callback and let
 filesystems override the default implementation.

Doesn't the -fsync callback get called in the sys_fdatasync() case,
with appropriate arguments?

With barriers/flushes it certainly makes those a bit more complicated.
You have to flush not just the disks with data pages, but the _other_
disks in a software RAID with data pointer metadata pages, but ideally
not all of them (think database journal commit).

That can be implemented with per-buffer pending-barrier/flush flags
(like I described for pages in the first mail), which are equally
useful when a database-like application uses a block device.

-- Jamie
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Proposal for proper durable fsync() and fdatasync()

2008-02-26 Thread Jamie Lokier
Ric Wheeler wrote:
 I was surprised that fsync() doesn't do this already.  There was a lot
 of effort put into block I/O write barriers during 2.5, so that
 journalling filesystems can force correct write ordering, using disk
 flush cache commands.
 
 After all that effort, I was very surprised to notice that Linux 2.6.x
 doesn't use that capability to ensure fsync() flushes the disk cache
 onto stable storage.
 
 It's surprising you are surprised, given that this [lame] fsync behavior 
 has remaining consistently lame throughout Linux's history.
 
 Maybe I am confused, but isn't this is what fsync() does today whenever 
 barriers are enabled (the fsync() invalidates the drive's write cache).

No, fsync() doesn't always flush the drive's write cache.  It often
does, any I think many people are under the impression it always does,
but it doesn't.

Try this code on ext3:

fd = open (test_file, O_RDWR | O_CREAT | O_TRUNC, 0666);
while (1) {
char byte;
usleep (10);
pwrite (fd, byte, 1, 0);
fsync (fd);
}

It will do just over 10 write ops per second on an idle system (13 on
mine), and 1 flush op per second.

That's because ext3 fsync() only does a journal commit when the inode
has changed.  The inode mtime is changed by write only with 1 second
granularity.  Without a journal commit, there's no barrier, which
translates to not flushing disk write cache.

If you add fchmod (fd, 0644); fchmod (fd, 0664); between the write
and fsync, you'll see at least 20 write ops and 20 flush ops per
second, and you'll here the disk seeking more.  That's because the
fchmod dirties the inode, so fsync() writes the inode with a journal
commit.

It turns out even _that_ is not sufficient according to the kernel
internals.  A journal commit uses an ordered request, which isn't the
same as a flush potentially, it just happens to use flush in this
instance.  I'm not sure if ordered requests are actually implemented
by any drivers at the moment.  If not now, they will be one day.

We could change ext3 fsync() to always do a journal commit, and depend
on the non-existence of block drivers which do ordered (not flush)
barrier requests.  But there's lots of things wrong with that.  Not
least, it sucks performance for database-like applications and virtual
machines, a lot due to unnecessary seeks.  That way lies wrongness.

Rightness is to make fdatasync() work well, with a genuine flush (or
equivalent (see FUA), only when required, and not a mere ordered
barrier), no inode write, and to make sync_file_range()[*] offer the
fancier applications finer controls which reflect what they actually
need.

[*] - or whatever.

-- Jamie
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Proposal for proper durable fsync() and fdatasync()

2008-02-26 Thread Jamie Lokier
Jeff Garzik wrote:
 Nick Piggin wrote:
 Anyway, the idea of making fsync/fdatasync etc. safe by default is
 a good idea IMO, and is a bad bug that we don't do that :(
 
 Agreed...  it's also disappointing that [unless I'm mistaken] you have 
 to hack each filesystem to support barriers.
 
 It seems far easier to make sync_blkdev() Do The Right Thing, and 
 magically make all filesystems data-safe.

Well, you need ordered metadata writes, barriers _and_ flushes with
some filesystems.

Merely writing all the data pages than issuing a drive cache flush
won't Do The Right Thing with those filesystems - someone already
mentioned Btrfs, where it won't.

But I agree that your suggestion would make a superb default, for
filesystems which don't provide their own function.

It's not optimal even then.

  Devices: On a software RAID, you ideally don't want to issue flushes
  to all drives if your database did a 1 block commit entry.  (But they
  probably use O_DIRECT anyway, changing the rules again).  But all that
  can be optimised in generic VFS code eventually.  It doesn't need
  filesystem assistance in most cases.

  Apps: don't always want a full flush; sometimes a barrier would do.

-- Jamie
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Proposal for proper durable fsync() and fdatasync()

2008-02-26 Thread Jamie Lokier
Jörn Engel wrote:
 On Tue, 26 February 2008 15:28:10 +, Jamie Lokier wrote:
  
   One interesting aspect of this comes with COW filesystems like btrfs or
   logfs.  Writing out data pages is not sufficient, because those will get
   lost unless their referencing metadata is written as well.  So either we
   have to call fsync for those filesystems or add another callback and let
   filesystems override the default implementation.
  
  Doesn't the -fsync callback get called in the sys_fdatasync() case,
  with appropriate arguments?
 
 My paragraph above was aimed at the sync_file_range() case.  fsync and
 fdatasync do the right thing within the limitations you brought up in
 this thread.  sync_file_range() without further changes will only write
 data pages, not the metadata required to actually access those data
 pages.  This works just fine for non-COW filesystems, which covers all
 currently merged ones.
 
 With COW filesystems it is currently impossible to do sync_file_range()
 properly.  The problem is orthogonal to your's, I just brought it up
 since you were already mentioning sync_file_range().

You're right.  Though, doesn't normal page writeback enqueue the COW
metadata changes?  If not, how do they get written in a timely
fashion?

-- Jamie
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Proposal for "proper" durable fsync() and fdatasync()

2008-02-25 Thread Jamie Lokier
Jeff Garzik wrote:
> Jamie Lokier wrote:
> >By durable, I mean that fsync() should actually commit writes to
> >physical stable storage,
> 
> Yes, it should.

Glad we agree :-)

> >I was surprised that fsync() doesn't do this already.  There was a lot
> >of effort put into block I/O write barriers during 2.5, so that
> >journalling filesystems can force correct write ordering, using disk
> >flush cache commands.
> >
> >After all that effort, I was very surprised to notice that Linux 2.6.x
> >doesn't use that capability to ensure fsync() flushes the disk cache
> >onto stable storage.
> 
> It's surprising you are surprised, given that this [lame] fsync behavior 
> has remaining consistently lame throughout Linux's history.

I was surprised because of the effort put into IDE write barriers to
get it right for in-kernel filesystems, and the messages in 2004
telling concerned users that fsync would use barriers in 2.6, which it
does sometimes but not always.

> [snip huge long proposal]
> 
> Rather than invent new APIs, we should fix the existing ones to _really_ 
> flush data to physical media.
>
> Linux should default to SAFE data storage, and permit users to retain 
> the older unsafe behavior via an option.  It's completely ridiculous 
> that we default to an unsafe fsync.

Well, I agree with you.  Which is why the "new API" I suggested, being
really just an extension of an existing one, allows fsync() to be SAFE
if that's what people want.

To be fair, fsync() is rather overkill for some apps.
sync_file_range() is obviously the right place for fine tuning "less
safe" variations.

> And [anticipating a common response from others] it is completely 
> irrelevant that POSIX fsync(2) permits Linux's current behavior.  The 
> current behavior is unsafe.
> 
> Safety before performance -- ESPECIALLY when it comes to storing user data.

Especially now that people work a lot in guest VMs, where the IDE
barrier stuff doesn't work if the host fdatasync() doesn't work.

Since it happened with Mac OS X, I wouldn't be surprised if changing
fsync() and just that wasn't popular.  Heck, you already get people
asking "how to turn off fsync in PostGreSQL"...  (Haven't those people
heard of transactions...?)

But with changes to sync_file_range() [or whatever... I don't care] to
support database's finely tuned commit needs, and then adoption of
that by database vendors, perhaps nobody will mind fsync() becoming
safe then.

Nobody seems bothered by it's performance for other things.

-- Jamie
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Proposal for "proper" durable fsync() and fdatasync()

2008-02-25 Thread Jamie Lokier
Dear kernel,

This is a proposal to add "proper" durable fsync() and fdatasync() to Linux.

First the problem, then a proposed solution "with benefits", so to speak.

I need feedback on the details, before implementing anything.  Or
(hopefully) someone else thinks it's very important and does it
themselves :-)

By durable, I mean that fsync() should actually commit writes to
physical stable storage, not just the disk write cache when that is
enabled.  Databases and guest VMs needs this, or an equivalent
feature, if they aren't to face occasional corruption after power
failure and perhaps some crashes.

The alternative is to disable the disk write cache.  But that isn't
modern practice or recommendation, since I/O write barriers were
implemented and they are much faster.

I was surprised that fsync() doesn't do this already.  There was a lot
of effort put into block I/O write barriers during 2.5, so that
journalling filesystems can force correct write ordering, using disk
flush cache commands.

After all that effort, I was very surprised to notice that Linux 2.6.x
doesn't use that capability to ensure fsync() flushes the disk cache
onto stable storage.

I noticed this following up discussions on the Qemu mailing list,
about guest VMs and how their IDE flush cache command should translate
to fsync() to avoid data loss.  (For guest VMs, fsync() isn't
necessary if the host machine is fine, and it isn't enough (on Linux
host) if the host machine loses power or the hard disk crashes another
way.)

Then I noticed it again, when I was designing a database engine with
filesystem characteristics.  I thought "how do I ensure ordered
journal writes; can I use fdatasync()?" and was surprised to find the
answer is no, I have to use hacks like calling hdparm, and the authors
of major SQL databases seem to brush the problem under a carpet.

(Interestingly, in the Linux 2.4 patches for write barriers, fsync()
seems to be fine, if a bit slow.)

It isn't the first time this topic has come up:


http://groups.google.com.br/group/linux.kernel/browse_thread/thread/d343e51655b4ac7c/7ee9bca80977c2d1?#7ee9bca80977c2d1
("True fsync() in Linux (on IDE)")

In that thread, it was implied that would be fixed in 2.6.  So I bet
some people are under the illusion that it's fixed in 2.6...


For a while, I've been meaning to bring it up on linux-kernel...


The fsync problem
-

Chris Wedgwood wrote:
> On Mon, Feb 25, 2008 at 08:50:40PM +, Jamie Lokier wrote:
> 
> > On Linux (and other host OSes), fdatsync() and fsync() don't always
> > commit data to hard storage; it sometimes only commits it to the hard
> > drive cache.
> 
> That's a filesystem bug IMO.  People should be able to use f[data]sync
> with some level onf confidence or else it's basically pointless.

I agree, I consider it a serious bug, and I would be pleased if
someone paid it some love and attention.

Right now, if you want a reliable database on Linux, you _cannot_
properly depend on fsync() or fdatasync().  Considering how much Linux
is used for critical databases, using these functions, this amazes me.

Also, if you have a guest VM, then the guest's filesystem journalling
is not reliable.  Not only can it lose data on power loss, it can
corrupt the guest filesystem too, due to reordering.  This is contrary
to what people expect, I think.

I'm not sure if a system reset can cause similar loss; I don't know
how disks react to that.

Also, for the person porting ZFS to run on FUSE, same applies...

Linux fsync is faulty in two ways:

   1. Database commits aren't _durable_ against power failure, because
  fsync doesn't flush the disk's cache.  This means data stored
  is not guaranteed to be stored at the expected durability.

   2. It's unsafe for write-ahead logging, because it doesn't really
  guarantee any _ordering_ for the writes at the hard storage
  level.  So aside from losing committed data, it can also corrupt
  structural metadata.

With ext3 it's quite easy to verify that fsync/fdatasync don't always
write a journal entry.  (Apart from looking at the kernel code :-)

Just write some data, fsync(), and observe the number of writes in
/proc/diskstats.  If the current mtime second _hasn't_ changed, the
inode isn't written.  If you write data, say, 10 times a second to the
same place followed by fsync(), you'll see a little more than 10 write
I/Os, and less than 20.

By the way, this shows a trick for fixing #2 (ordering): use fchmod()
to toggle the file attributes, and that will force the next fsync() to
write a journal entry, which _does_ issue a write barrier.  If you do
that with each write as above (write, fchmod change, fsync 10 times a
second), you will clearly see more write I/Os, and you'll hear the
disk behaving differently: it's seeking more.

However, even this ugly trick has problems:

  3. Using the fchmod() trick or good fortun

Proposal for proper durable fsync() and fdatasync()

2008-02-25 Thread Jamie Lokier
Dear kernel,

This is a proposal to add proper durable fsync() and fdatasync() to Linux.

First the problem, then a proposed solution with benefits, so to speak.

I need feedback on the details, before implementing anything.  Or
(hopefully) someone else thinks it's very important and does it
themselves :-)

By durable, I mean that fsync() should actually commit writes to
physical stable storage, not just the disk write cache when that is
enabled.  Databases and guest VMs needs this, or an equivalent
feature, if they aren't to face occasional corruption after power
failure and perhaps some crashes.

The alternative is to disable the disk write cache.  But that isn't
modern practice or recommendation, since I/O write barriers were
implemented and they are much faster.

I was surprised that fsync() doesn't do this already.  There was a lot
of effort put into block I/O write barriers during 2.5, so that
journalling filesystems can force correct write ordering, using disk
flush cache commands.

After all that effort, I was very surprised to notice that Linux 2.6.x
doesn't use that capability to ensure fsync() flushes the disk cache
onto stable storage.

I noticed this following up discussions on the Qemu mailing list,
about guest VMs and how their IDE flush cache command should translate
to fsync() to avoid data loss.  (For guest VMs, fsync() isn't
necessary if the host machine is fine, and it isn't enough (on Linux
host) if the host machine loses power or the hard disk crashes another
way.)

Then I noticed it again, when I was designing a database engine with
filesystem characteristics.  I thought how do I ensure ordered
journal writes; can I use fdatasync()? and was surprised to find the
answer is no, I have to use hacks like calling hdparm, and the authors
of major SQL databases seem to brush the problem under a carpet.

(Interestingly, in the Linux 2.4 patches for write barriers, fsync()
seems to be fine, if a bit slow.)

It isn't the first time this topic has come up:


http://groups.google.com.br/group/linux.kernel/browse_thread/thread/d343e51655b4ac7c/7ee9bca80977c2d1?#7ee9bca80977c2d1
(True fsync() in Linux (on IDE))

In that thread, it was implied that would be fixed in 2.6.  So I bet
some people are under the illusion that it's fixed in 2.6...


For a while, I've been meaning to bring it up on linux-kernel...


The fsync problem
-

Chris Wedgwood wrote:
 On Mon, Feb 25, 2008 at 08:50:40PM +, Jamie Lokier wrote:
 
  On Linux (and other host OSes), fdatsync() and fsync() don't always
  commit data to hard storage; it sometimes only commits it to the hard
  drive cache.
 
 That's a filesystem bug IMO.  People should be able to use f[data]sync
 with some level onf confidence or else it's basically pointless.

I agree, I consider it a serious bug, and I would be pleased if
someone paid it some love and attention.

Right now, if you want a reliable database on Linux, you _cannot_
properly depend on fsync() or fdatasync().  Considering how much Linux
is used for critical databases, using these functions, this amazes me.

Also, if you have a guest VM, then the guest's filesystem journalling
is not reliable.  Not only can it lose data on power loss, it can
corrupt the guest filesystem too, due to reordering.  This is contrary
to what people expect, I think.

I'm not sure if a system reset can cause similar loss; I don't know
how disks react to that.

Also, for the person porting ZFS to run on FUSE, same applies...

Linux fsync is faulty in two ways:

   1. Database commits aren't _durable_ against power failure, because
  fsync doesn't flush the disk's cache.  This means data stored
  is not guaranteed to be stored at the expected durability.

   2. It's unsafe for write-ahead logging, because it doesn't really
  guarantee any _ordering_ for the writes at the hard storage
  level.  So aside from losing committed data, it can also corrupt
  structural metadata.

With ext3 it's quite easy to verify that fsync/fdatasync don't always
write a journal entry.  (Apart from looking at the kernel code :-)

Just write some data, fsync(), and observe the number of writes in
/proc/diskstats.  If the current mtime second _hasn't_ changed, the
inode isn't written.  If you write data, say, 10 times a second to the
same place followed by fsync(), you'll see a little more than 10 write
I/Os, and less than 20.

By the way, this shows a trick for fixing #2 (ordering): use fchmod()
to toggle the file attributes, and that will force the next fsync() to
write a journal entry, which _does_ issue a write barrier.  If you do
that with each write as above (write, fchmod change, fsync 10 times a
second), you will clearly see more write I/Os, and you'll hear the
disk behaving differently: it's seeking more.

However, even this ugly trick has problems:

  3. Using the fchmod() trick or good fortune, fsync() issues a write
 barrier.  Right now, this does commit data (if the device can

Re: Proposal for proper durable fsync() and fdatasync()

2008-02-25 Thread Jamie Lokier
Jeff Garzik wrote:
 Jamie Lokier wrote:
 By durable, I mean that fsync() should actually commit writes to
 physical stable storage,
 
 Yes, it should.

Glad we agree :-)

 I was surprised that fsync() doesn't do this already.  There was a lot
 of effort put into block I/O write barriers during 2.5, so that
 journalling filesystems can force correct write ordering, using disk
 flush cache commands.
 
 After all that effort, I was very surprised to notice that Linux 2.6.x
 doesn't use that capability to ensure fsync() flushes the disk cache
 onto stable storage.
 
 It's surprising you are surprised, given that this [lame] fsync behavior 
 has remaining consistently lame throughout Linux's history.

I was surprised because of the effort put into IDE write barriers to
get it right for in-kernel filesystems, and the messages in 2004
telling concerned users that fsync would use barriers in 2.6, which it
does sometimes but not always.

 [snip huge long proposal]
 
 Rather than invent new APIs, we should fix the existing ones to _really_ 
 flush data to physical media.

 Linux should default to SAFE data storage, and permit users to retain 
 the older unsafe behavior via an option.  It's completely ridiculous 
 that we default to an unsafe fsync.

Well, I agree with you.  Which is why the new API I suggested, being
really just an extension of an existing one, allows fsync() to be SAFE
if that's what people want.

To be fair, fsync() is rather overkill for some apps.
sync_file_range() is obviously the right place for fine tuning less
safe variations.

 And [anticipating a common response from others] it is completely 
 irrelevant that POSIX fsync(2) permits Linux's current behavior.  The 
 current behavior is unsafe.
 
 Safety before performance -- ESPECIALLY when it comes to storing user data.

Especially now that people work a lot in guest VMs, where the IDE
barrier stuff doesn't work if the host fdatasync() doesn't work.

Since it happened with Mac OS X, I wouldn't be surprised if changing
fsync() and just that wasn't popular.  Heck, you already get people
asking how to turn off fsync in PostGreSQL...  (Haven't those people
heard of transactions...?)

But with changes to sync_file_range() [or whatever... I don't care] to
support database's finely tuned commit needs, and then adoption of
that by database vendors, perhaps nobody will mind fsync() becoming
safe then.

Nobody seems bothered by it's performance for other things.

-- Jamie
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: jffs2: -ENOSPC when truncating file?!

2008-02-24 Thread Jamie Lokier
Pavel Machek wrote:
> > You need to write a log entry indicating the new length of the file.
> > There is no space for new log entries.
> > 
> > There is a special case for removal -- 'rm gps.nmea' would work. Perhaps
> > we should add a special case for truncation too, so that it can also use
> > the extra pool of free space.
> 
> Yes, that would be nice. I somehow assumed that truncate can't fail
> for -ENOSPC ... I was trying to actually free some space on the
> filesystem...

Same here!  When I got ENOSPC from truncate, trying to free some
space, I was so surprised (and a bit disappointed) that I assumed
removal could fail too.  So now I'm pleasantly surprised to learn I
can at least remove a file.

It does seem odd that truncate to zero length can fail.  It is
guaranteed to free up one or more data nodes, so there should be
enough space for the size change node, provided GC is invoked when
necessary and the node sizes are compatible for this in corner cases.

-- Jamie
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: jffs2: -ENOSPC when truncating file?!

2008-02-24 Thread Jamie Lokier
Pavel Machek wrote:
  You need to write a log entry indicating the new length of the file.
  There is no space for new log entries.
  
  There is a special case for removal -- 'rm gps.nmea' would work. Perhaps
  we should add a special case for truncation too, so that it can also use
  the extra pool of free space.
 
 Yes, that would be nice. I somehow assumed that truncate can't fail
 for -ENOSPC ... I was trying to actually free some space on the
 filesystem...

Same here!  When I got ENOSPC from truncate, trying to free some
space, I was so surprised (and a bit disappointed) that I assumed
removal could fail too.  So now I'm pleasantly surprised to learn I
can at least remove a file.

It does seem odd that truncate to zero length can fail.  It is
guaranteed to free up one or more data nodes, so there should be
enough space for the size change node, provided GC is invoked when
necessary and the node sizes are compatible for this in corner cases.

-- Jamie
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] LogFS take three

2007-05-19 Thread Jamie Lokier
David Weinehall wrote:
> > It is also the filesystem that tries to scale logarithmically, as Arnd
> > has noted.  Maybe I should call it Log2 to emphesize this point.  Log1
> > would be horrible scalability.
> 
> So, log2fs...  Sounds great to me.

Why Log2?  Logarithmic scaling is just logarithmic scaling.  Does the
filesystem use 2-ary trees or anything else which gives particular
meaning to 2?

-- Jamie
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] LogFS take three

2007-05-19 Thread Jamie Lokier
David Weinehall wrote:
  It is also the filesystem that tries to scale logarithmically, as Arnd
  has noted.  Maybe I should call it Log2 to emphesize this point.  Log1
  would be horrible scalability.
 
 So, log2fs...  Sounds great to me.

Why Log2?  Logarithmic scaling is just logarithmic scaling.  Does the
filesystem use 2-ary trees or anything else which gives particular
meaning to 2?

-- Jamie
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] LogFS take three

2007-05-17 Thread Jamie Lokier
Jörn Engel wrote:
> > Almost all your static functions start with logfs_, why not this one?
> 
> Because after a while I discovered how silly it is to start every
> function with logfs_.  That prefix doesn't add much unless the function
> has global scope.  What I didn't do was remove the prefix from older
> functions.

It's handy when debugging or showing detailed backtraces.  Not that
I'm advocating it (or not), just something I've noticed in other
programs.

-- Jamie
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] LogFS take three

2007-05-17 Thread Jamie Lokier
Jörn Engel wrote:
  Almost all your static functions start with logfs_, why not this one?
 
 Because after a while I discovered how silly it is to start every
 function with logfs_.  That prefix doesn't add much unless the function
 has global scope.  What I didn't do was remove the prefix from older
 functions.

It's handy when debugging or showing detailed backtraces.  Not that
I'm advocating it (or not), just something I've noticed in other
programs.

-- Jamie
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] LogFS take three

2007-05-16 Thread Jamie Lokier
Artem Bityutskiy wrote:
> On Wed, 2007-05-16 at 12:34 +0100, Jamie Lokier wrote:
> > Jörn Engel wrote:
> > > On Wed, 16 May 2007 12:54:14 +0800, David Woodhouse wrote:
> > > > Personally I'd just go for 'JFFS3'. After all, it has a better claim to
> > > > the name than either of its predecessors :)
> > > 
> > > Did you ever see akpm's facial expression when he tried to pronounce
> > > "JFFS2"?  ;)
> > 
> > JFFS3 is a good, meaningful name to anyone familiar with JFFS2.
> > 
> > But if akpm can't pronounce it, how about FFFS for faster flash
> > filesystem ;-)
> 
> The problem is that JFFS2 will always be faster in terms of I/O speed
> anyway, just because it does not have to maintain on-flash indexing
> data structures. But yes, it is slow in mount and in building big
> inodes, so the "faster" is confusing.

Is LogFS really slower than JFFS2 in practice?

I would have guessed reads to be a similar speed, tree updates to be a
similar speed  to journal  updates for sustained  non-fsyncing writes,
and the difference unimportant for tiny individual commits whose index
updates are not merged with any other.  I've not thought about it much
though.

-- Jamie
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] LogFS take three

2007-05-16 Thread Jamie Lokier
Albert Cahalan wrote:
> Please don't forget the immutable bit. ("man lsattr")
> Having both, BSD-style, would be even better.
> The immutable bit is important for working around
> software bugs and "features" that damage files.
> 
> I also can't find xattr support.

Imho,

Given that the filesystem is still 'experimental', I'd concentrate on
getting it stable before worrying about immutable and xattrs unless
they are easy.  (Immutable is easy).

In 13 years of using Linux in all sorts of ways I've never yet to use
either feature.  They would be good to have, but stability in a
filesystem is much more useful.

I'm biased of course: if LogFS were stable and well tested, I'd be
using it right now in my embedded thingy - and that doesn't even
bother with uids :-)

-- Jamie
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] LogFS take three

2007-05-16 Thread Jamie Lokier
Jörn Engel wrote:
> On Wed, 16 May 2007 12:54:14 +0800, David Woodhouse wrote:
> > Personally I'd just go for 'JFFS3'. After all, it has a better claim to
> > the name than either of its predecessors :)
> 
> Did you ever see akpm's facial expression when he tried to pronounce
> "JFFS2"?  ;)

JFFS3 is a good, meaningful name to anyone familiar with JFFS2.

But if akpm can't pronounce it, how about FFFS for faster flash
filesystem ;-)

-- Jamie
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] LogFS take three

2007-05-16 Thread Jamie Lokier
Jörn Engel wrote:
 On Wed, 16 May 2007 12:54:14 +0800, David Woodhouse wrote:
  Personally I'd just go for 'JFFS3'. After all, it has a better claim to
  the name than either of its predecessors :)
 
 Did you ever see akpm's facial expression when he tried to pronounce
 JFFS2?  ;)

JFFS3 is a good, meaningful name to anyone familiar with JFFS2.

But if akpm can't pronounce it, how about FFFS for faster flash
filesystem ;-)

-- Jamie
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] LogFS take three

2007-05-16 Thread Jamie Lokier
Albert Cahalan wrote:
 Please don't forget the immutable bit. (man lsattr)
 Having both, BSD-style, would be even better.
 The immutable bit is important for working around
 software bugs and features that damage files.
 
 I also can't find xattr support.

Imho,

Given that the filesystem is still 'experimental', I'd concentrate on
getting it stable before worrying about immutable and xattrs unless
they are easy.  (Immutable is easy).

In 13 years of using Linux in all sorts of ways I've never yet to use
either feature.  They would be good to have, but stability in a
filesystem is much more useful.

I'm biased of course: if LogFS were stable and well tested, I'd be
using it right now in my embedded thingy - and that doesn't even
bother with uids :-)

-- Jamie
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] LogFS take three

2007-05-16 Thread Jamie Lokier
Artem Bityutskiy wrote:
 On Wed, 2007-05-16 at 12:34 +0100, Jamie Lokier wrote:
  Jörn Engel wrote:
   On Wed, 16 May 2007 12:54:14 +0800, David Woodhouse wrote:
Personally I'd just go for 'JFFS3'. After all, it has a better claim to
the name than either of its predecessors :)
   
   Did you ever see akpm's facial expression when he tried to pronounce
   JFFS2?  ;)
  
  JFFS3 is a good, meaningful name to anyone familiar with JFFS2.
  
  But if akpm can't pronounce it, how about FFFS for faster flash
  filesystem ;-)
 
 The problem is that JFFS2 will always be faster in terms of I/O speed
 anyway, just because it does not have to maintain on-flash indexing
 data structures. But yes, it is slow in mount and in building big
 inodes, so the faster is confusing.

Is LogFS really slower than JFFS2 in practice?

I would have guessed reads to be a similar speed, tree updates to be a
similar speed  to journal  updates for sustained  non-fsyncing writes,
and the difference unimportant for tiny individual commits whose index
updates are not merged with any other.  I've not thought about it much
though.

-- Jamie
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: dnotify/inotify and vfs questions

2005-08-25 Thread Jamie Lokier
Ian Campbell wrote:
> On Tue, 2005-08-23 at 16:23 +0100, Jamie Lokier wrote:
> > ...
> > if (any_dnotify_or_inotify_events_pending) {
> > read_dnotify_or_inotify_events();
> > if (any_events_related_to(file)) {
> > store_in_userspace_stat_cache(file, stat(file));
> > }
> > }
> > stat_info = lookup_userspace_stat_cache(file);
> > 
> > Now that's a silly way to save one system call in the fast path by itself.
> 
> I'm not that familiar with inotify internals but doesn't
> read_dnotify_or_inotify_events() or
> any_dnotify_or_inotify_events_pending() involve a syscall?

The fast path is just any_dnotify_or_inotify_events_pending: there
aren't any relevant events pending in the fast path.

There's a few methods of doing this for free per individual stat cache check.

1. Signal handler.

dnotify: Check a variable set by the signal handler.

 [ There's a small time window between dnotify sending the
 signal and the receiving thread noticing on an SMP system due
 to the IPI, during which the sending task might have another
 way to signal the recieving task that it's finished some
 operation, so this method of using dnotify to invalidate a
 stat cache only has the correct ordering properties on UP systems. ]

 This works because dnotify signals are thread-specific,
 so the checking thread will definitely have received the
 signal after the time another process modifies the file.
 
inotify: Disappointingly inotify doesn't support SIGIO readiness :(

2. If you have to mix the test with a poll/select/epoll/rtsig fd waiting
   for some other purpose.  For example: a file/web/local server, where the
   constraint is only that each stat() to revalidate a cached response
   appears to happen any time after the beginning of receiving the
   network request is known.

dnotify: It's free if you were using sigtimedwait anyway for I/O events,
 provided you completely read the queue, or get the signal
 priority right.

inotify: It's free if you were using poll/select/epoll anyway for I/O
 events, provided in the case of epoll that you completely
 read the queue, or use a two-level queue.

3. Amortising the test over many stat cache checks.

   Even if you must use a system call to check for any pending events,
   for revalidating an object which depends on multiple files, only
   one call is needed for all of the stat cache checks.

   More generally (this is more flexible), you can separate the notion
   of "cache time checkpoint" from "cache validation".  It's enough to
   know that a stat result was valid any time between the checkpoint
   time, and the current time.  That's how I'm implementing the
   file/web/local server case described above in step 2.  Then the
   events only need to be checked once during that time interval, no
   matter how many complex objects are being revalidated.

   It gets slightly more efficient when you have multiple, overlapping
   checkpoint->validation time intervals due to multiple outstanding
   requests being processed concurrently.

As I explained in the previous mail, all this is absolutely pointless
to save one system call.  It's a lot of work for negligable gain.

The point is when it saves lots of calls and userspace logic together,
for things like web page templates and compiled programs, which depend
on many files which can be revalidated in a small number of operations.

-- Jamie
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: dnotify/inotify and vfs questions

2005-08-25 Thread Jamie Lokier
Ian Campbell wrote:
 On Tue, 2005-08-23 at 16:23 +0100, Jamie Lokier wrote:
  receive some request...
  if (any_dnotify_or_inotify_events_pending) {
  read_dnotify_or_inotify_events();
  if (any_events_related_to(file)) {
  store_in_userspace_stat_cache(file, stat(file));
  }
  }
  stat_info = lookup_userspace_stat_cache(file);
  
  Now that's a silly way to save one system call in the fast path by itself.
 
 I'm not that familiar with inotify internals but doesn't
 read_dnotify_or_inotify_events() or
 any_dnotify_or_inotify_events_pending() involve a syscall?

The fast path is just any_dnotify_or_inotify_events_pending: there
aren't any relevant events pending in the fast path.

There's a few methods of doing this for free per individual stat cache check.

1. Signal handler.

dnotify: Check a variable set by the signal handler.

 [ There's a small time window between dnotify sending the
 signal and the receiving thread noticing on an SMP system due
 to the IPI, during which the sending task might have another
 way to signal the recieving task that it's finished some
 operation, so this method of using dnotify to invalidate a
 stat cache only has the correct ordering properties on UP systems. ]

 This works because dnotify signals are thread-specific,
 so the checking thread will definitely have received the
 signal after the time another process modifies the file.
 
inotify: Disappointingly inotify doesn't support SIGIO readiness :(

2. If you have to mix the test with a poll/select/epoll/rtsig fd waiting
   for some other purpose.  For example: a file/web/local server, where the
   constraint is only that each stat() to revalidate a cached response
   appears to happen any time after the beginning of receiving the
   network request is known.

dnotify: It's free if you were using sigtimedwait anyway for I/O events,
 provided you completely read the queue, or get the signal
 priority right.

inotify: It's free if you were using poll/select/epoll anyway for I/O
 events, provided in the case of epoll that you completely
 read the queue, or use a two-level queue.

3. Amortising the test over many stat cache checks.

   Even if you must use a system call to check for any pending events,
   for revalidating an object which depends on multiple files, only
   one call is needed for all of the stat cache checks.

   More generally (this is more flexible), you can separate the notion
   of cache time checkpoint from cache validation.  It's enough to
   know that a stat result was valid any time between the checkpoint
   time, and the current time.  That's how I'm implementing the
   file/web/local server case described above in step 2.  Then the
   events only need to be checked once during that time interval, no
   matter how many complex objects are being revalidated.

   It gets slightly more efficient when you have multiple, overlapping
   checkpoint-validation time intervals due to multiple outstanding
   requests being processed concurrently.

As I explained in the previous mail, all this is absolutely pointless
to save one system call.  It's a lot of work for negligable gain.

The point is when it saves lots of calls and userspace logic together,
for things like web page templates and compiled programs, which depend
on many files which can be revalidated in a small number of operations.

-- Jamie
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: dnotify/inotify and vfs questions

2005-08-23 Thread Jamie Lokier
Asser Femø wrote:
> According to the fcntl manual you can cancel a notification by doing
> fcntl(fd, F_NOTIFY, 0) (ie. sending 0 as the notification mask), but
> looking in the kernel code fcntl_dirnotify() immediately calls
> dnotify_flush() with neither telling the vfs module about it. Is there a
> reason for this?  Otherwise I'd propose calling
> filp->f_op->dir_notify(filp, 0) at some point in this scenario.
> 
> Regarding inotify, inotify_add_watch doesn't seem to pass on the request
> either, which works fine for local filesystem operations as they call
> fsnotify_* functions every time, but that isn't really feasible for
> filesystems like cifs because we'd have to request change notification
> on everything. Is there plans for implementing a mechanism to let vfs
> modules get watch requests too?

On a related note:

dnotify and inotify on local filesystems appear to be synchronous, in
the following rather useful sense:

If you have previously registered for inotify/dnotify events that will
catch a change to a file, and called stat() on the file, then the
following operation:

...
stat_info = stat(file)

may be replaced in userspace code with:

...
if (any_dnotify_or_inotify_events_pending) {
read_dnotify_or_inotify_events();
if (any_events_related_to(file)) {
store_in_userspace_stat_cache(file, stat(file));
}
}
stat_info = lookup_userspace_stat_cache(file);

Now that's a silly way to save one system call in the fast path by itself.

But when the stat_info is a prerequisite for validating cached data --
such as the contents of a file parsed into a data structure -- it can
save a lot of system calls and logical work.

For example, an Apache-style path walk which checks for .htaccess, or
a Samba-style path walk which is checking for unsafe symbolic links,
can be reduced from say 20 system calls to zero using this method.

Pre-compiled or pre-parsed programs/scripts/templates/config-files
where all the source files used are prerequisites for invalidating a
cached compiled form, reduces from say 40 system calls to stat() all
the source files, to zero  that's quite a saving.

It's not just reducing system calls.  The logical tests in userspace
are also skipped, if coded properly, facilitating very quick decisions
about things that depend on files which mostly don't change.
(Cascading structured cache prerequisites...mmm).

Remote dnotify/inotify doesn't _necessarily_ have this synchronous
property.  It may do in some cases, depending on the implementation
(this is subtle...).

So, it would be nice if there was a way to query this... rather than
the tedious method of testing the filesystem type and having a table
of "known local filesystem types" where it's safe to depend on this
property.  Alternatively, a way to specify at dnotify/inotify creation
type that synchronous notifications are required, and have the request
rejected if those can't be provided.

-- Jamie


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: dnotify/inotify and vfs questions

2005-08-23 Thread Jamie Lokier
Asser Femø wrote:
 According to the fcntl manual you can cancel a notification by doing
 fcntl(fd, F_NOTIFY, 0) (ie. sending 0 as the notification mask), but
 looking in the kernel code fcntl_dirnotify() immediately calls
 dnotify_flush() with neither telling the vfs module about it. Is there a
 reason for this?  Otherwise I'd propose calling
 filp-f_op-dir_notify(filp, 0) at some point in this scenario.
 
 Regarding inotify, inotify_add_watch doesn't seem to pass on the request
 either, which works fine for local filesystem operations as they call
 fsnotify_* functions every time, but that isn't really feasible for
 filesystems like cifs because we'd have to request change notification
 on everything. Is there plans for implementing a mechanism to let vfs
 modules get watch requests too?

On a related note:

dnotify and inotify on local filesystems appear to be synchronous, in
the following rather useful sense:

If you have previously registered for inotify/dnotify events that will
catch a change to a file, and called stat() on the file, then the
following operation:

receive some request...
stat_info = stat(file)

may be replaced in userspace code with:

receive some request...
if (any_dnotify_or_inotify_events_pending) {
read_dnotify_or_inotify_events();
if (any_events_related_to(file)) {
store_in_userspace_stat_cache(file, stat(file));
}
}
stat_info = lookup_userspace_stat_cache(file);

Now that's a silly way to save one system call in the fast path by itself.

But when the stat_info is a prerequisite for validating cached data --
such as the contents of a file parsed into a data structure -- it can
save a lot of system calls and logical work.

For example, an Apache-style path walk which checks for .htaccess, or
a Samba-style path walk which is checking for unsafe symbolic links,
can be reduced from say 20 system calls to zero using this method.

Pre-compiled or pre-parsed programs/scripts/templates/config-files
where all the source files used are prerequisites for invalidating a
cached compiled form, reduces from say 40 system calls to stat() all
the source files, to zero  that's quite a saving.

It's not just reducing system calls.  The logical tests in userspace
are also skipped, if coded properly, facilitating very quick decisions
about things that depend on files which mostly don't change.
(Cascading structured cache prerequisites...mmm).

Remote dnotify/inotify doesn't _necessarily_ have this synchronous
property.  It may do in some cases, depending on the implementation
(this is subtle...).

So, it would be nice if there was a way to query this... rather than
the tedious method of testing the filesystem type and having a table
of known local filesystem types where it's safe to depend on this
property.  Alternatively, a way to specify at dnotify/inotify creation
type that synchronous notifications are required, and have the request
rejected if those can't be provided.

-- Jamie


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] FUSE permission modell (Was: fuse review bits)

2005-04-17 Thread Jamie Lokier
Eric Van Hensbergen wrote:
> I'd like to second that I think private-namespaces are the right way
> to solve this sort of problem.  It also helps not cluttering the
> global namespace with user-local mounts
> 
> >
> > Shared subtrees and more support in userspace tools is needed before
> > private namespaces can become really useful.
> > 
> 
> I'd like to talk about this a bit more and start driving to a solution
> here.  I've been looking at the namespace code quite a bit and was
> just about to dive in and start checking into adding/fixing certain
> aspects such as stackable namespaces, optional inheritence (changes in
> a parent namespace are reflected in the child but not vice-versa),
> etc.
> 
> One aspect I was thinking about here was a mount flag that would give
> you a new private namespace (if you didn't already have one) for the
> mount (and I guess that would impact any subsequent mounts from the
> user in that shell).  Another option would be a 'newns' style
> system-call, but I'm generally against adding new system calls.
> 
> Shared subtrees are a tricky one.  I know how we would handle it in
> V9FS, but not sure how well that would translate to others
> (essentially we'd re-export the subtree so other user's could mount it
> individually -- but that's a very Plan 9 solution and may not be what
> more UNIX-minded folks would want -- we also need to improve our own
> server infrastructure to more efficiently support such a re-export).
> 
> So, to sum up I think private namespaces is the right solution, and
> I'd rather put effort into making it more useful than work-around the
> fact that its not practical right now.

Have a chat with Al Viro, who has already done some work on shared
mounts and subtrees I think.

-- Jamie
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] FUSE permission modell (Was: fuse review bits)

2005-04-17 Thread Jamie Lokier
Eric Van Hensbergen wrote:
 I'd like to second that I think private-namespaces are the right way
 to solve this sort of problem.  It also helps not cluttering the
 global namespace with user-local mounts
 
 
  Shared subtrees and more support in userspace tools is needed before
  private namespaces can become really useful.
  
 
 I'd like to talk about this a bit more and start driving to a solution
 here.  I've been looking at the namespace code quite a bit and was
 just about to dive in and start checking into adding/fixing certain
 aspects such as stackable namespaces, optional inheritence (changes in
 a parent namespace are reflected in the child but not vice-versa),
 etc.
 
 One aspect I was thinking about here was a mount flag that would give
 you a new private namespace (if you didn't already have one) for the
 mount (and I guess that would impact any subsequent mounts from the
 user in that shell).  Another option would be a 'newns' style
 system-call, but I'm generally against adding new system calls.
 
 Shared subtrees are a tricky one.  I know how we would handle it in
 V9FS, but not sure how well that would translate to others
 (essentially we'd re-export the subtree so other user's could mount it
 individually -- but that's a very Plan 9 solution and may not be what
 more UNIX-minded folks would want -- we also need to improve our own
 server infrastructure to more efficiently support such a re-export).
 
 So, to sum up I think private namespaces is the right solution, and
 I'd rather put effort into making it more useful than work-around the
 fact that its not practical right now.

Have a chat with Al Viro, who has already done some work on shared
mounts and subtrees I think.

-- Jamie
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] FUSE permission modell (Was: fuse review bits)

2005-04-13 Thread Jamie Lokier
Miklos Szeredi wrote:
> > Yet, the results from stat() don't distinguish the number spaces,
> > and "ls" doesn't map the numbers to names properly in the wrong
> > space.
> 
> Well you can use "ls -n".  It's up to the tools to present the
> information you want in the way you want it.  If a tool can't do that,
> tough, but you are not worse off than if the information is not
> available _at_all_.

Well, how do you currently provide access to the information that's
not presentable through stat()?

-- Jamie
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] FUSE permission modell (Was: fuse review bits)

2005-04-13 Thread Jamie Lokier
Miklos Szeredi wrote:
> I have a little project to imlement a "userloop" filesystem, which
> works just like "mount -o loop", but you don't need root privs.  This
> is really simple to do with FUSE and UML.

That would be a nice way to implement those rarely used old
filesystems that aren't really needed in the kernel source tree any
more, but which it would be nice to have access to as legacy
filesystem formats.

In other words, migrating old legacy filesystems out of the kernel
tree, into FUSE.

> I don't think that it's far feched, that in certain situations the
> user _does_ have the right (and usefulness) to do otherwise privileged
> filesystem operations.

It's really a matter of philosophy, as to whether the results of
stat() are just handy information for the user, or are always defined
to mean what you can/can't do with a file.

Local-ssh-into-UML makes more sense for this in some ways, because the
uids/gids inside your tgz files or foreign loop filesystems are not
related to the space of uids/gids of the host system.  Yet, the
results from stat() don't distinguish the number spaces, and "ls"
doesn't map the numbers to names properly in the wrong space.

-- Jamie
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] FUSE permission modell (Was: fuse review bits)

2005-04-13 Thread Jamie Lokier
Miklos Szeredi wrote:
> > Look up the rather large linux-kernel & linux-fsdevel thread "silent
> > semantic changes with reiser4" and it's followup threads, from last
> > year.
> 
> Wow, it's 700+ messages.  I got through the first 40, and already feel
> dizzy :)

It's easier if you skip the ones by Hans and their immediate followups :)

(Nothing personal, it's that Hans is mostly justifying reiser4's
behaviour, and the posts you really need to read aren't about reiser4).

> > It's already been tried.  You will also find sensible ideas on what
> > semantics it should have to do it properly.
> 
> OK, I understand the "slash -> directory, no-slash -> regular file"
> semantics.
> 
> How do you envision implementing this for "mount directory over file"?

Somewhere deep in that thread is a discussion between Al Viro and
Linus on it.

> A new mount flag indicating that it's only to be followed down if
> there's a slash after the mountpoint?

The new flag would indicate more than that: These mounts should be
detachable in the sense that deleting the file is possible, and
perhaps renamable/linkable too.  That's the stuff Al Viro discusses in
some detail in the big thread.

Ideally we'd like automounting, a bit like the Hurd's translators.
Attached to files (using an xattr or something, and executed with the
uid/gid of the file owner), and also per-user "pattern->action"
options for matching files with a certain type (e.g. tgz/zip/deb/rpm/xml).

But that can be added much later, as it's an orthogonal feature.

-- Jamie
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] FUSE permission modell (Was: fuse review bits)

2005-04-13 Thread Jamie Lokier
Miklos Szeredi wrote:
> > > Aren't there some assumptions in VFS that currently make this
> > > impossible?
> > 
> > I believe it's OK with VFS, but applications would be confused to death.
> > Well, there really is one issue -- dentries have exactly one parent, so
> > what do you do when opening a file with hardlinks as a directory? (In
> > fact IIRC that is what lead to all the funny talk about mountpoints,
> > since they don't have this limitation)
> 
> OK, that makes sense.
> 
> It would be quite interesting to see how applications react.  Maybe
> I'll hack something up :)

Look up the rather large linux-kernel & linux-fsdevel thread "silent
semantic changes with reiser4" and it's followup threads, from last
year.

It's already been tried.  You will also find sensible ideas on what
semantics it should have to do it properly.

-- Jamie
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] FUSE permission modell (Was: fuse review bits)

2005-04-13 Thread Jamie Lokier
> > > A nice implemention of it in FUSE could push it along a bit :)
> > 
> > Aren't there some assumptions in VFS that currently make this
> > impossible?
> 
> I believe it's OK with VFS, but applications would be confused to death.
> Well, there really is one issue -- dentries have exactly one parent, so
> what do you do when opening a file with hardlinks as a directory? (In
> fact IIRC that is what lead to all the funny talk about mountpoints,
> since they don't have this limitation)

Hardlinks aren't a problem when entering a file as if it's a
directory, provided the directory does not contain any hard links to a
parent in the hierarchy.  In other words, as long as it's a directed
acyclic graph.

This is trivially always true for virtual directories such as entering
an archive file.  And detachable/movable mountpoints are a nice and
sensible way to implement it.  Some work has actually been done on this.

Experiments with the reiserfs file-as-directory extension showed that
applications are generally ok with it.  It looks like a file, but you
can cd into it or follow a path lookup into it.

Linus had some good ideas on the exact semantics to implement when
doing path lookup on these objects.

-- Jamie
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] FUSE permission modell (Was: fuse review bits)

2005-04-13 Thread Jamie Lokier
   A nice implemention of it in FUSE could push it along a bit :)
  
  Aren't there some assumptions in VFS that currently make this
  impossible?
 
 I believe it's OK with VFS, but applications would be confused to death.
 Well, there really is one issue -- dentries have exactly one parent, so
 what do you do when opening a file with hardlinks as a directory? (In
 fact IIRC that is what lead to all the funny talk about mountpoints,
 since they don't have this limitation)

Hardlinks aren't a problem when entering a file as if it's a
directory, provided the directory does not contain any hard links to a
parent in the hierarchy.  In other words, as long as it's a directed
acyclic graph.

This is trivially always true for virtual directories such as entering
an archive file.  And detachable/movable mountpoints are a nice and
sensible way to implement it.  Some work has actually been done on this.

Experiments with the reiserfs file-as-directory extension showed that
applications are generally ok with it.  It looks like a file, but you
can cd into it or follow a path lookup into it.

Linus had some good ideas on the exact semantics to implement when
doing path lookup on these objects.

-- Jamie
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] FUSE permission modell (Was: fuse review bits)

2005-04-13 Thread Jamie Lokier
Miklos Szeredi wrote:
   Aren't there some assumptions in VFS that currently make this
   impossible?
  
  I believe it's OK with VFS, but applications would be confused to death.
  Well, there really is one issue -- dentries have exactly one parent, so
  what do you do when opening a file with hardlinks as a directory? (In
  fact IIRC that is what lead to all the funny talk about mountpoints,
  since they don't have this limitation)
 
 OK, that makes sense.
 
 It would be quite interesting to see how applications react.  Maybe
 I'll hack something up :)

Look up the rather large linux-kernel  linux-fsdevel thread silent
semantic changes with reiser4 and it's followup threads, from last
year.

It's already been tried.  You will also find sensible ideas on what
semantics it should have to do it properly.

-- Jamie
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] FUSE permission modell (Was: fuse review bits)

2005-04-13 Thread Jamie Lokier
Miklos Szeredi wrote:
  Look up the rather large linux-kernel  linux-fsdevel thread silent
  semantic changes with reiser4 and it's followup threads, from last
  year.
 
 Wow, it's 700+ messages.  I got through the first 40, and already feel
 dizzy :)

It's easier if you skip the ones by Hans and their immediate followups :)

(Nothing personal, it's that Hans is mostly justifying reiser4's
behaviour, and the posts you really need to read aren't about reiser4).

  It's already been tried.  You will also find sensible ideas on what
  semantics it should have to do it properly.
 
 OK, I understand the slash - directory, no-slash - regular file
 semantics.
 
 How do you envision implementing this for mount directory over file?

Somewhere deep in that thread is a discussion between Al Viro and
Linus on it.

 A new mount flag indicating that it's only to be followed down if
 there's a slash after the mountpoint?

The new flag would indicate more than that: These mounts should be
detachable in the sense that deleting the file is possible, and
perhaps renamable/linkable too.  That's the stuff Al Viro discusses in
some detail in the big thread.

Ideally we'd like automounting, a bit like the Hurd's translators.
Attached to files (using an xattr or something, and executed with the
uid/gid of the file owner), and also per-user pattern-action
options for matching files with a certain type (e.g. tgz/zip/deb/rpm/xml).

But that can be added much later, as it's an orthogonal feature.

-- Jamie
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] FUSE permission modell (Was: fuse review bits)

2005-04-13 Thread Jamie Lokier
Miklos Szeredi wrote:
 I have a little project to imlement a userloop filesystem, which
 works just like mount -o loop, but you don't need root privs.  This
 is really simple to do with FUSE and UML.

That would be a nice way to implement those rarely used old
filesystems that aren't really needed in the kernel source tree any
more, but which it would be nice to have access to as legacy
filesystem formats.

In other words, migrating old legacy filesystems out of the kernel
tree, into FUSE.

 I don't think that it's far feched, that in certain situations the
 user _does_ have the right (and usefulness) to do otherwise privileged
 filesystem operations.

It's really a matter of philosophy, as to whether the results of
stat() are just handy information for the user, or are always defined
to mean what you can/can't do with a file.

Local-ssh-into-UML makes more sense for this in some ways, because the
uids/gids inside your tgz files or foreign loop filesystems are not
related to the space of uids/gids of the host system.  Yet, the
results from stat() don't distinguish the number spaces, and ls
doesn't map the numbers to names properly in the wrong space.

-- Jamie
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] FUSE permission modell (Was: fuse review bits)

2005-04-13 Thread Jamie Lokier
Miklos Szeredi wrote:
  Yet, the results from stat() don't distinguish the number spaces,
  and ls doesn't map the numbers to names properly in the wrong
  space.
 
 Well you can use ls -n.  It's up to the tools to present the
 information you want in the way you want it.  If a tool can't do that,
 tough, but you are not worse off than if the information is not
 available _at_all_.

Well, how do you currently provide access to the information that's
not presentable through stat()?

-- Jamie
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] FUSE permission modell (Was: fuse review bits)

2005-04-12 Thread Jamie Lokier
Anton Altaparmakov wrote:
> > That said, I would _usually_ prefer that when I enter a tgz, that I
> > see all component files having the same uid/gid/permissions as the tgz
> > file itself - the same as I'd see if I entered a zip file.
> 
> As you say _usually_, even you admit that sometimes you would want the 
> real owner/permissions to be shown.  And that is the point Miklos is 
> trying to make I believe: it should be configurable not hard set to only 
> one which is what I understand HCH wants.
> 
> There are uses for both.  For example today I was updating the tar ball 
> which is used to create the var file system for a new chroot.  I certainly 
> want to see corretly setup owner/permissions when I look into that tar 
> ball using a FUSE fs...

If I'm updating a var filesystem for a new chroot, I'd need the
ability to chmod and chown things in that filesystem.  Does that work
as an ordinary user?

-- Jamie
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] FUSE permission modell (Was: fuse review bits)

2005-04-12 Thread Jamie Lokier
Miklos Szeredi wrote:
> Still can't find it :)
> 
> Which kernel?  Which file?

I'm looking at linux-2.4.30/fs/nfs/dir.c.

-- Jamie
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] FUSE permission modell (Was: fuse review bits)

2005-04-12 Thread Jamie Lokier
Miklos Szeredi wrote:
> > Indeed, if it can be done with namespaces _and_ mounting on a file
> > (that file-as-directory concept), _and_ automounting, then you could
> > cd into your tgz files and others could too :)
> 
> There's still that little problem of accessing the tgz file both as a
> file and a directory.  But yes.  Maybe in 10 years time :)

There was a thread a few months ago where file-as-directory was
discussed extensively, after Namesys implemented it.  That's where the
conversation on detachable mount points originated AFAIR.  It will
probably happen at some point.

A nice implemention of it in FUSE could push it along a bit :)

-- Jamie

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] FUSE permission modell (Was: fuse review bits)

2005-04-12 Thread Jamie Lokier
Miklos Szeredi wrote:
> The same can be true for tarfs.  I mount it for my purpose, others can
> mount it for theirs.  Since the daemon providing the filesystem asways
> runs with the same capabilities as the user who did the mount, I and
> others will always get the permissions that we have on the actual tar
> file.

Fair enough.

> Think of the "no permission for others" as "hiding", not as some
> special permission rule.  And if this hiding can be nicely done with
> namespaces, all the better, I'll happily drop this feature at that
> instant.

Indeed, if it can be done with namespaces _and_ mounting on a file
(that file-as-directory concept), _and_ automounting, then you could
cd into your tgz files and others could too :)

-- Jamie
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] FUSE permission modell (Was: fuse review bits)

2005-04-12 Thread Jamie Lokier
Miklos Szeredi wrote:
> > Yes, for NFSv2, this test in nfs_permssion():
> > 
> > if (!NFS_PROTO(inode)->access)
> > goto out;
> 
> I've seen that, I just thought that was for some broken servers not
> for all NFSv2 servers.
> 
> Anyway that's been fixed in NFSv3, so obviously the "permission
> checking on both sides" wasn't optimal :)
> 
> > And for either version of NFS, if the uid and gid are non-zero, and
> > the permission bits indicate that an access is permitted, then the
> > client does not consult the server for permission.
> 
> Where's that?  I see no such check.

/*
 * Trust UNIX mode bits except:
 *
 * 1) When override capabilities may have been invoked
 * 2) When root squashing may be involved
 * 3) When ACLs may overturn a negative answer */
if (!capable(CAP_DAC_OVERRIDE) && !capable(CAP_DAC_READ_SEARCH)
&& (current->fsuid != 0) && (current->fsgid != 0)
&& error != -EACCES)
goto out;

-- Jamie
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] FUSE permission modell (Was: fuse review bits)

2005-04-12 Thread Jamie Lokier
Miklos Szeredi wrote:
> > Note that NFS checks the permissions on _both_ the client and server,
> > for a reason.
> 
> Does it?  If I read the code correctly the client checks credentials
> supplied by the server (or cached).  But the server does the actual
> checking of permissions.
> 
> Am I missing something?

Yes, for NFSv2, this test in nfs_permssion():

if (!NFS_PROTO(inode)->access)
goto out;

And for either version of NFS, if the uid and gid are non-zero, and
the permission bits indicate that an access is permitted, then the
client does not consult the server for permission.

-- Jamie
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] FUSE permission modell (Was: fuse review bits)

2005-04-12 Thread Jamie Lokier
Miklos Szeredi wrote:
> > If the user wants to edit a read-only file in a tgz owned by himself,
> > why can he not _chmod_ the file and _then_ edit it?
> > 
> > That said, I would _usually_ prefer that when I enter a tgz, that I
> > see all component files having the same uid/gid/permissions as the tgz
> > file itself - the same as I'd see if I entered a zip file.
> 
> I can accept that usually you are not interested in the stored
> uid/gid.  But doubt that you want to lose permission information when
> you mount a tar file.  Zip is a different kettle of fish since that
> doesn't contain uid/gid/permissions.

It's not about being not interested.

It's because I'd want my programs, and other users, to have exactly
the same access to the tgz contents through vfs as they have when
accessing the tgz file directly.  Not some baroque combination of
unobvious hard-coded permission rules, that aren't even visible
through stat(), and which both increase permissions for the user and
decrease it for others at the same time.

I see why you may want to hide certain things from other users
sometimes - the sshfs example is a good one.  I daresay Al Viro could
come up with a per-user or per-keyring mount point for those occasions :)

-- Jamie
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] FUSE permission modell (Was: fuse review bits)

2005-04-12 Thread Jamie Lokier
Frank Sorenson wrote:
> > That does not make sense.
> > 
> > Are you saying you cannot trust your own sshfs userspace daemon?
> 
> The user who wrote the userspace code may be able to, but the system
> shouldn't trust the userspace daemon.  Permissions will be enforced by
> the ssh server.

In fact that's wrong.  Although permissions are enforced by the ssh
server, the server assumes that once a conncection is established, all
requests on it are from the same originating user.  The server is not
able to check which user is accessing the files on the client machine
- which is why Miklos wants/needs restrictive permissions on the
client machine too.

-- Jamie
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] FUSE permission modell (Was: fuse review bits)

2005-04-12 Thread Jamie Lokier
Bodo Eggert <[EMAIL PROTECTED]> wrote:
> >> That is exactly the intended effect.  If I'm at my work machine (where
> >> I'm not an admin unfortunately) and I mount my home machine with sshfs
> >> (because FUSE is installed fortunately :), then I bloody well don't
> >> want the sysadmin or some automated script of his to go mucking under
> >> the mountpoint.
> > 
> > I think that would be _much_ nicer implemented as a mount which is
> > invisible to other users, rather than one which causes the admin's
> > scripts to spew error messages.  Is the namespace mechanism at all
> > suitable for that?
> 
> This will require shared subtrees plus a way for new logins from the same
> user to join an existing (previous login) namespace.

Or "per-user namespaces".

It's part of a more general problem of how you limit access to private
data such as crypto keys, either per user, or more finely than that.

Isn't that what all the keyring stuff is for?

-- Jamie
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] FUSE permission modell (Was: fuse review bits)

2005-04-12 Thread Jamie Lokier
Bodo Eggert <[EMAIL PROTECTED]> wrote:
> Jamie Lokier <[EMAIL PROTECTED]> wrote:
> > Miklos Szeredi wrote:
> 
> >>   4) Access should not be further restricted for the owner of the
> >>  mount, even if permission bits, uid or gid would suggest
> >>  otherwise
> > 
> > Why?  Surely you want to prevent writing to files which don't have the
> > writable bit set?  A filesystem may also create append-only files -
> > and all users including the mount owner should be bound by that.
> 
> That will depend on the situation. If the user is mounting a tgz owned
> by himself, FUSE should default to being a convenient hex-editor.

If the user wants to edit a read-only file in a tgz owned by himself,
why can he not _chmod_ the file and _then_ edit it?

That said, I would _usually_ prefer that when I enter a tgz, that I
see all component files having the same uid/gid/permissions as the tgz
file itself - the same as I'd see if I entered a zip file.

-- Jamie
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] FUSE permission modell (Was: fuse review bits)

2005-04-12 Thread Jamie Lokier
Miklos Szeredi wrote:
> The same is true for the case when you mount an sshfs.  Since you
> entered your password (or have a passwordless login to the server) you
> are authorized to browse the files on the server, but only with the
> capabilities you have there as a user.  The server does the
> authorization.  The same is true for an NFS mount btw.  It's not the
> client that checks the permissions.
> 
> So do you see why I argue in favor of having an option _not_ to check
> permissions on the client by the kernel?

Note that NFS checks the permissions on _both_ the client and server,
for a reason.

-- Jamie
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] FUSE permission modell (Was: fuse review bits)

2005-04-12 Thread Jamie Lokier
Miklos Szeredi wrote:
> > It would also be nice to generalise and have virtual filesystems which
> > are able to present different views to different users.  Can FUSE do
> > that already - is the userspace part told which user is doing each
> > operation?
> 
> Yes.
> 
> > With that, the desire for virtual filesystems which cannot be read
> > by your sysadmin (by accident) is easy to satisfy - and that kind of
> > mechanism would probably be acceptable to all.
> 
> The problem is that this way the responsibility goes to the userspace
> program, which can't be trusted.

That does not make sense.

Are you saying you cannot trust your own sshfs userspace daemon?

-- Jamie
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] FUSE permission modell (Was: fuse review bits)

2005-04-12 Thread Jamie Lokier
Miklos Szeredi wrote:
  It would also be nice to generalise and have virtual filesystems which
  are able to present different views to different users.  Can FUSE do
  that already - is the userspace part told which user is doing each
  operation?
 
 Yes.
 
  With that, the desire for virtual filesystems which cannot be read
  by your sysadmin (by accident) is easy to satisfy - and that kind of
  mechanism would probably be acceptable to all.
 
 The problem is that this way the responsibility goes to the userspace
 program, which can't be trusted.

That does not make sense.

Are you saying you cannot trust your own sshfs userspace daemon?

-- Jamie
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] FUSE permission modell (Was: fuse review bits)

2005-04-12 Thread Jamie Lokier
Miklos Szeredi wrote:
 The same is true for the case when you mount an sshfs.  Since you
 entered your password (or have a passwordless login to the server) you
 are authorized to browse the files on the server, but only with the
 capabilities you have there as a user.  The server does the
 authorization.  The same is true for an NFS mount btw.  It's not the
 client that checks the permissions.
 
 So do you see why I argue in favor of having an option _not_ to check
 permissions on the client by the kernel?

Note that NFS checks the permissions on _both_ the client and server,
for a reason.

-- Jamie
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] FUSE permission modell (Was: fuse review bits)

2005-04-12 Thread Jamie Lokier
Bodo Eggert [EMAIL PROTECTED] wrote:
 Jamie Lokier [EMAIL PROTECTED] wrote:
  Miklos Szeredi wrote:
 
    4) Access should not be further restricted for the owner of the
   mount, even if permission bits, uid or gid would suggest
   otherwise
  
  Why?  Surely you want to prevent writing to files which don't have the
  writable bit set?  A filesystem may also create append-only files -
  and all users including the mount owner should be bound by that.
 
 That will depend on the situation. If the user is mounting a tgz owned
 by himself, FUSE should default to being a convenient hex-editor.

If the user wants to edit a read-only file in a tgz owned by himself,
why can he not _chmod_ the file and _then_ edit it?

That said, I would _usually_ prefer that when I enter a tgz, that I
see all component files having the same uid/gid/permissions as the tgz
file itself - the same as I'd see if I entered a zip file.

-- Jamie
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] FUSE permission modell (Was: fuse review bits)

2005-04-12 Thread Jamie Lokier
Bodo Eggert [EMAIL PROTECTED] wrote:
  That is exactly the intended effect.  If I'm at my work machine (where
  I'm not an admin unfortunately) and I mount my home machine with sshfs
  (because FUSE is installed fortunately :), then I bloody well don't
  want the sysadmin or some automated script of his to go mucking under
  the mountpoint.
  
  I think that would be _much_ nicer implemented as a mount which is
  invisible to other users, rather than one which causes the admin's
  scripts to spew error messages.  Is the namespace mechanism at all
  suitable for that?
 
 This will require shared subtrees plus a way for new logins from the same
 user to join an existing (previous login) namespace.

Or per-user namespaces.

It's part of a more general problem of how you limit access to private
data such as crypto keys, either per user, or more finely than that.

Isn't that what all the keyring stuff is for?

-- Jamie
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] FUSE permission modell (Was: fuse review bits)

2005-04-12 Thread Jamie Lokier
Frank Sorenson wrote:
  That does not make sense.
  
  Are you saying you cannot trust your own sshfs userspace daemon?
 
 The user who wrote the userspace code may be able to, but the system
 shouldn't trust the userspace daemon.  Permissions will be enforced by
 the ssh server.

In fact that's wrong.  Although permissions are enforced by the ssh
server, the server assumes that once a conncection is established, all
requests on it are from the same originating user.  The server is not
able to check which user is accessing the files on the client machine
- which is why Miklos wants/needs restrictive permissions on the
client machine too.

-- Jamie
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] FUSE permission modell (Was: fuse review bits)

2005-04-12 Thread Jamie Lokier
Miklos Szeredi wrote:
  If the user wants to edit a read-only file in a tgz owned by himself,
  why can he not _chmod_ the file and _then_ edit it?
  
  That said, I would _usually_ prefer that when I enter a tgz, that I
  see all component files having the same uid/gid/permissions as the tgz
  file itself - the same as I'd see if I entered a zip file.
 
 I can accept that usually you are not interested in the stored
 uid/gid.  But doubt that you want to lose permission information when
 you mount a tar file.  Zip is a different kettle of fish since that
 doesn't contain uid/gid/permissions.

It's not about being not interested.

It's because I'd want my programs, and other users, to have exactly
the same access to the tgz contents through vfs as they have when
accessing the tgz file directly.  Not some baroque combination of
unobvious hard-coded permission rules, that aren't even visible
through stat(), and which both increase permissions for the user and
decrease it for others at the same time.

I see why you may want to hide certain things from other users
sometimes - the sshfs example is a good one.  I daresay Al Viro could
come up with a per-user or per-keyring mount point for those occasions :)

-- Jamie
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] FUSE permission modell (Was: fuse review bits)

2005-04-12 Thread Jamie Lokier
Miklos Szeredi wrote:
  Note that NFS checks the permissions on _both_ the client and server,
  for a reason.
 
 Does it?  If I read the code correctly the client checks credentials
 supplied by the server (or cached).  But the server does the actual
 checking of permissions.
 
 Am I missing something?

Yes, for NFSv2, this test in nfs_permssion():

if (!NFS_PROTO(inode)-access)
goto out;

And for either version of NFS, if the uid and gid are non-zero, and
the permission bits indicate that an access is permitted, then the
client does not consult the server for permission.

-- Jamie
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] FUSE permission modell (Was: fuse review bits)

2005-04-12 Thread Jamie Lokier
Miklos Szeredi wrote:
  Yes, for NFSv2, this test in nfs_permssion():
  
  if (!NFS_PROTO(inode)-access)
  goto out;
 
 I've seen that, I just thought that was for some broken servers not
 for all NFSv2 servers.
 
 Anyway that's been fixed in NFSv3, so obviously the permission
 checking on both sides wasn't optimal :)
 
  And for either version of NFS, if the uid and gid are non-zero, and
  the permission bits indicate that an access is permitted, then the
  client does not consult the server for permission.
 
 Where's that?  I see no such check.

/*
 * Trust UNIX mode bits except:
 *
 * 1) When override capabilities may have been invoked
 * 2) When root squashing may be involved
 * 3) When ACLs may overturn a negative answer */
if (!capable(CAP_DAC_OVERRIDE)  !capable(CAP_DAC_READ_SEARCH)
 (current-fsuid != 0)  (current-fsgid != 0)
 error != -EACCES)
goto out;

-- Jamie
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] FUSE permission modell (Was: fuse review bits)

2005-04-12 Thread Jamie Lokier
Miklos Szeredi wrote:
 The same can be true for tarfs.  I mount it for my purpose, others can
 mount it for theirs.  Since the daemon providing the filesystem asways
 runs with the same capabilities as the user who did the mount, I and
 others will always get the permissions that we have on the actual tar
 file.

Fair enough.

 Think of the no permission for others as hiding, not as some
 special permission rule.  And if this hiding can be nicely done with
 namespaces, all the better, I'll happily drop this feature at that
 instant.

Indeed, if it can be done with namespaces _and_ mounting on a file
(that file-as-directory concept), _and_ automounting, then you could
cd into your tgz files and others could too :)

-- Jamie
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] FUSE permission modell (Was: fuse review bits)

2005-04-12 Thread Jamie Lokier
Miklos Szeredi wrote:
 Still can't find it :)
 
 Which kernel?  Which file?

I'm looking at linux-2.4.30/fs/nfs/dir.c.

-- Jamie
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] FUSE permission modell (Was: fuse review bits)

2005-04-12 Thread Jamie Lokier
Miklos Szeredi wrote:
  Indeed, if it can be done with namespaces _and_ mounting on a file
  (that file-as-directory concept), _and_ automounting, then you could
  cd into your tgz files and others could too :)
 
 There's still that little problem of accessing the tgz file both as a
 file and a directory.  But yes.  Maybe in 10 years time :)

There was a thread a few months ago where file-as-directory was
discussed extensively, after Namesys implemented it.  That's where the
conversation on detachable mount points originated AFAIR.  It will
probably happen at some point.

A nice implemention of it in FUSE could push it along a bit :)

-- Jamie

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] FUSE permission modell (Was: fuse review bits)

2005-04-12 Thread Jamie Lokier
Anton Altaparmakov wrote:
  That said, I would _usually_ prefer that when I enter a tgz, that I
  see all component files having the same uid/gid/permissions as the tgz
  file itself - the same as I'd see if I entered a zip file.
 
 As you say _usually_, even you admit that sometimes you would want the 
 real owner/permissions to be shown.  And that is the point Miklos is 
 trying to make I believe: it should be configurable not hard set to only 
 one which is what I understand HCH wants.
 
 There are uses for both.  For example today I was updating the tar ball 
 which is used to create the var file system for a new chroot.  I certainly 
 want to see corretly setup owner/permissions when I look into that tar 
 ball using a FUSE fs...

If I'm updating a var filesystem for a new chroot, I'd need the
ability to chmod and chown things in that filesystem.  Does that work
as an ordinary user?

-- Jamie
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] FUSE permission modell (Was: fuse review bits)

2005-04-11 Thread Jamie Lokier
Miklos Szeredi wrote:
> That is exactly the intended effect.  If I'm at my work machine (where
> I'm not an admin unfortunately) and I mount my home machine with sshfs
> (because FUSE is installed fortunately :), then I bloody well don't
> want the sysadmin or some automated script of his to go mucking under
> the mountpoint.

I think that would be _much_ nicer implemented as a mount which is
invisible to other users, rather than one which causes the admin's
scripts to spew error messages.  Is the namespace mechanism at all
suitable for that?

It would also be nice to generalise and have virtual filesystems which
are able to present different views to different users.  Can FUSE do
that already - is the userspace part told which user is doing each
operation?  With that, the desire for virtual filesystems which cannot
be read by your sysadmin (by accident) is easy to satisfy - and that
kind of mechanism would probably be acceptable to all.

-- Jamie
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] FUSE permission modell (Was: fuse review bits)

2005-04-11 Thread Jamie Lokier
Miklos Szeredi wrote:
>   1) User must not be able to modify files or directories in a way
>  which he otherwise could not do (e.g. mount a filesystem over
>  /bin)
> 
>   2) Suid and device semantics should be disabled within the mount
> 
>   3) No other user should have access to files under the mount, not
>  even root[5]

Why?  I can see plenty of uses where I want a filesystem generated by
one user to be accessible by other users.  That policy should not be
hard-coded into the kernel.  It might be an option.

>   4) Access should not be further restricted for the owner of the
>  mount, even if permission bits, uid or gid would suggest
>  otherwise

Why?  Surely you want to prevent writing to files which don't have the
writable bit set?  A filesystem may also create append-only files -
and all users including the mount owner should be bound by that.

>   5) As much of the available information should be exported via the
>  filesystem as possible

This is the root of the conflict.  You are trying to overload the
permission bits and uid/gid to mean something different than they
normally do.

While it's convenient to see some "remote" information such as the
uid/gid in a tar file, are you sure it's a good idea to break the unix
permissions model - which will break some programs?  (For example, try
editing a file with the broken semantics in an editor which checks the
uid/gid of the file against the current user).

For most virtual filesystems, the "remote" information does not map to
uid/gid in a particularly natural way anyway.  So it seems odd to want
to break the unix permissions model just so that a small _subset_ of
virtual filesystems can use stat() as a way to get a bit of
information out, when other virtual filesystems (e.g. webdavfs) can't
put meaningful information in there, and would benefit from normal
unix permissions instead.

>   1) Only allow mount over a directory for which the user has write
>  access (and is not sticky)

Seems good - but why not sticky?  Mounting a user filesystem in
/tmp/user-xxx/my-mount-point seems not unreasonable - provided the
administrator can delete the directory (which is possible with
detachable mount points).

>   2) Use nosuid,nodev mount options

Of course.  Ideally, make sure they appear to be set in /proc/mounts.

(root (or equivalent) should be able to create virtual filesystems
without these options, but probably they should be set by default even
for root, and clearable using suid,dev).

>   3) In permission method of FUSE kernel module compare fsuid against
>  mounting user's ID, and return EACCES if they are not equal.

Bad.  How do I, user A, then create a virtual filesystem which I want
user B to be able to access?

>   4) The filesystem daemon does not run with elevated permissions.
>  The kernel doesn't check file more in the permission method.

I like the idea that the fs daemon doesn't need elevated permissions.

>   5) The filesystem daemon is free to fill in all file attributes to
>  any (sane) value, and the kernel won't modify these.

Dangerous, because an administrative program might actually trust the
attributes to mean what they normally mean in the unix permissions model.

> The debated part is 3) and 4), namely that normal permission checking
> based on file mode is bypassed, and the mounting user has full
> permission to all files, while other users have none.
> 
> This feature has been in FUSE from the start and thus has been very
> well tested in real world scenarios.  Also I have thought a lot about
> how this could pose any kind of security threat, and as yet found no
> such possiblity.

Ok, but why do you prevent the useful behaviour of allowing access to
other users, when I want that?  For example, I might export my current
project's database as a filesystem that I _want_ other users to be
able to read.

> Despite this Christoph feels this behavior is unacceptable for a
> filesystem, and wants me to remove this feature before merging FUSE
> into mainline.  I try to be open to ideas, but also feel strongly that
> this is the Right Way, so I won't give up easily.

I agree with Christoph.  It is a huge deviation from the unix
permissions model -- and it seems to prevent some useful applications
of FUSE so it's not clear why you want it.

-- Jamie
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] FUSE permission modell (Was: fuse review bits)

2005-04-11 Thread Jamie Lokier
Miklos Szeredi wrote:
   1) User must not be able to modify files or directories in a way
  which he otherwise could not do (e.g. mount a filesystem over
  /bin)
 
   2) Suid and device semantics should be disabled within the mount
 
   3) No other user should have access to files under the mount, not
  even root[5]

Why?  I can see plenty of uses where I want a filesystem generated by
one user to be accessible by other users.  That policy should not be
hard-coded into the kernel.  It might be an option.

   4) Access should not be further restricted for the owner of the
  mount, even if permission bits, uid or gid would suggest
  otherwise

Why?  Surely you want to prevent writing to files which don't have the
writable bit set?  A filesystem may also create append-only files -
and all users including the mount owner should be bound by that.

   5) As much of the available information should be exported via the
  filesystem as possible

This is the root of the conflict.  You are trying to overload the
permission bits and uid/gid to mean something different than they
normally do.

While it's convenient to see some remote information such as the
uid/gid in a tar file, are you sure it's a good idea to break the unix
permissions model - which will break some programs?  (For example, try
editing a file with the broken semantics in an editor which checks the
uid/gid of the file against the current user).

For most virtual filesystems, the remote information does not map to
uid/gid in a particularly natural way anyway.  So it seems odd to want
to break the unix permissions model just so that a small _subset_ of
virtual filesystems can use stat() as a way to get a bit of
information out, when other virtual filesystems (e.g. webdavfs) can't
put meaningful information in there, and would benefit from normal
unix permissions instead.

   1) Only allow mount over a directory for which the user has write
  access (and is not sticky)

Seems good - but why not sticky?  Mounting a user filesystem in
/tmp/user-xxx/my-mount-point seems not unreasonable - provided the
administrator can delete the directory (which is possible with
detachable mount points).

   2) Use nosuid,nodev mount options

Of course.  Ideally, make sure they appear to be set in /proc/mounts.

(root (or equivalent) should be able to create virtual filesystems
without these options, but probably they should be set by default even
for root, and clearable using suid,dev).

   3) In permission method of FUSE kernel module compare fsuid against
  mounting user's ID, and return EACCES if they are not equal.

Bad.  How do I, user A, then create a virtual filesystem which I want
user B to be able to access?

   4) The filesystem daemon does not run with elevated permissions.
  The kernel doesn't check file more in the permission method.

I like the idea that the fs daemon doesn't need elevated permissions.

   5) The filesystem daemon is free to fill in all file attributes to
  any (sane) value, and the kernel won't modify these.

Dangerous, because an administrative program might actually trust the
attributes to mean what they normally mean in the unix permissions model.

 The debated part is 3) and 4), namely that normal permission checking
 based on file mode is bypassed, and the mounting user has full
 permission to all files, while other users have none.
 
 This feature has been in FUSE from the start and thus has been very
 well tested in real world scenarios.  Also I have thought a lot about
 how this could pose any kind of security threat, and as yet found no
 such possiblity.

Ok, but why do you prevent the useful behaviour of allowing access to
other users, when I want that?  For example, I might export my current
project's database as a filesystem that I _want_ other users to be
able to read.

 Despite this Christoph feels this behavior is unacceptable for a
 filesystem, and wants me to remove this feature before merging FUSE
 into mainline.  I try to be open to ideas, but also feel strongly that
 this is the Right Way, so I won't give up easily.

I agree with Christoph.  It is a huge deviation from the unix
permissions model -- and it seems to prevent some useful applications
of FUSE so it's not clear why you want it.

-- Jamie
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] FUSE permission modell (Was: fuse review bits)

2005-04-11 Thread Jamie Lokier
Miklos Szeredi wrote:
 That is exactly the intended effect.  If I'm at my work machine (where
 I'm not an admin unfortunately) and I mount my home machine with sshfs
 (because FUSE is installed fortunately :), then I bloody well don't
 want the sysadmin or some automated script of his to go mucking under
 the mountpoint.

I think that would be _much_ nicer implemented as a mount which is
invisible to other users, rather than one which causes the admin's
scripts to spew error messages.  Is the namespace mechanism at all
suitable for that?

It would also be nice to generalise and have virtual filesystems which
are able to present different views to different users.  Can FUSE do
that already - is the userspace part told which user is doing each
operation?  With that, the desire for virtual filesystems which cannot
be read by your sysadmin (by accident) is easy to satisfy - and that
kind of mechanism would probably be acceptable to all.

-- Jamie
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: kernel bug: futex_wait hang

2005-03-22 Thread Jamie Lokier
Lee Revell wrote:
> On Tue, 2005-03-22 at 04:48 +0000, Jamie Lokier wrote:
> > I argued for fixing Glibc on the grounds that the changed kernel
> > behaviour, or more exactly having Glibc depend on it, loses a certain
> > semantic property useful for unusual operations on multiple futexes at
> > the same time.  But I appear to have lost the argument, and Jakub's
> > latest patch does clean up some cruft quite nicely, with no expected
> > performance hit.
> 
> A glibc fix will take forever to get to users compared to a kernel fix.

Interesting perspective.  On my systems Glibc is upgraded more often
than the kernel.

-- Jamie
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: kernel bug: futex_wait hang

2005-03-22 Thread Jamie Lokier
Lee Revell wrote:
 On Tue, 2005-03-22 at 04:48 +, Jamie Lokier wrote:
  I argued for fixing Glibc on the grounds that the changed kernel
  behaviour, or more exactly having Glibc depend on it, loses a certain
  semantic property useful for unusual operations on multiple futexes at
  the same time.  But I appear to have lost the argument, and Jakub's
  latest patch does clean up some cruft quite nicely, with no expected
  performance hit.
 
 A glibc fix will take forever to get to users compared to a kernel fix.

Interesting perspective.  On my systems Glibc is upgraded more often
than the kernel.

-- Jamie
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: kernel bug: futex_wait hang

2005-03-21 Thread Jamie Lokier
Andrew Morton wrote:
> iirc we ended up deciding that the futex problems around that time were due
> to userspace problems (a version of libc).  But then, there's no discussion
> around Seto's patch and it didn't get applied.  So I don't know what
> happened to that work - it's all a bit mysterious.

It can be fixed _either_ in Glibc, or by changing the kernel.

That problem is caused by differing assumptions between Glibc and the
kernel about subtle futex semantics.  Which goes to show they are
really clever, or something.

I provided pseudo-code for the Glibc fix, but not an actual patch
because NPTL is quite complicated and I wanted to know the Glibc
people were interested, but apparently they were too busy at the time
- benchmarks would have made sense for such a patch.

Scott Snyder started fixing part of Glibc, and that did fix his
instance of this problem so we know the approach works.  But a full
patch for Glibc was not prepared.

The most recent messages under "Futex queue_me/get_user ordering",
with a patch from Jakub Jelinek will fix this problem by changing the
kernel.  Yes, you should apply Jakub's most recent patch, message-ID
"<[EMAIL PROTECTED]>".

I have not tested the patch, but it looks convincing.

I argued for fixing Glibc on the grounds that the changed kernel
behaviour, or more exactly having Glibc depend on it, loses a certain
semantic property useful for unusual operations on multiple futexes at
the same time.  But I appear to have lost the argument, and Jakub's
latest patch does clean up some cruft quite nicely, with no expected
performance hit.

-- Jamie
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: kernel bug: futex_wait hang

2005-03-21 Thread Jamie Lokier
Lee Revell wrote:
> > iirc we ended up deciding that the futex problems around that time were due
> > to userspace problems (a version of libc).  But then, there's no discussion
> > around Seto's patch and it didn't get applied.  So I don't know what
> > happened to that work - it's all a bit mysterious.
> 
> It does seem like it could be a different problem.  Maybe Paul can
> provide some more evidence that it's a kernel and not a glibc/NPTL bug.
> I'm really just posting this on Paul's behalf; I don't claim to
> understand the issue. ;-)

Try applying the patch below, which was recently posted by Jakub Jelinek.

If it fixes the problem, it's the same thing as Hidetoshi Seto
noticed, although this patch is much improved thanks to
"preempt_count technology" (tm).  If not, it's a whole new problem.

-- Jamie

--- linux-2.6.11/kernel/futex.c.jj  2005-03-17 04:42:29.0 -0500
+++ linux-2.6.11/kernel/futex.c 2005-03-18 05:45:29.0 -0500
@@ -97,7 +97,6 @@ struct futex_q {
  */
 struct futex_hash_bucket {
spinlock_t  lock;
-   unsigned intnqueued;
struct list_head   chain;
 };
 
@@ -265,7 +264,6 @@ static inline int get_futex_value_locked
inc_preempt_count();
ret = __copy_from_user_inatomic(dest, from, sizeof(int));
dec_preempt_count();
-   preempt_check_resched();
 
return ret ? -EFAULT : 0;
 }
@@ -339,7 +337,6 @@ static int futex_requeue(unsigned long u
struct list_head *head1;
struct futex_q *this, *next;
int ret, drop_count = 0;
-   unsigned int nqueued;
 
  retry:
down_read(>mm->mmap_sem);
@@ -354,23 +351,22 @@ static int futex_requeue(unsigned long u
bh1 = hash_futex();
bh2 = hash_futex();
 
-   nqueued = bh1->nqueued;
+   if (bh1 < bh2)
+   spin_lock(>lock);
+   spin_lock(>lock);
+   if (bh1 > bh2)
+   spin_lock(>lock);
+
if (likely(valp != NULL)) {
int curval;
 
-   /* In order to avoid doing get_user while
-  holding bh1->lock and bh2->lock, nqueued
-  (monotonically increasing field) must be first
-  read, then *uaddr1 fetched from userland and
-  after acquiring lock nqueued field compared with
-  the stored value.  The smp_mb () below
-  makes sure that bh1->nqueued is read from memory
-  before *uaddr1.  */
-   smp_mb();
-
ret = get_futex_value_locked(, (int __user *)uaddr1);
 
if (unlikely(ret)) {
+   spin_unlock(>lock);
+   if (bh1 != bh2)
+   spin_unlock(>lock);
+
/* If we would have faulted, release mmap_sem, fault
 * it in and start all over again.
 */
@@ -385,21 +381,10 @@ static int futex_requeue(unsigned long u
}
if (curval != *valp) {
ret = -EAGAIN;
-   goto out;
+   goto out_unlock;
}
}
 
-   if (bh1 < bh2)
-   spin_lock(>lock);
-   spin_lock(>lock);
-   if (bh1 > bh2)
-   spin_lock(>lock);
-
-   if (unlikely(nqueued != bh1->nqueued && valp != NULL)) {
-   ret = -EAGAIN;
-   goto out_unlock;
-   }
-
head1 = >chain;
list_for_each_entry_safe(this, next, head1, list) {
if (!match_futex (>key, ))
@@ -435,13 +420,9 @@ out:
return ret;
 }
 
-/*
- * queue_me and unqueue_me must be called as a pair, each
- * exactly once.  They are called with the hashed spinlock held.
- */
-
 /* The key must be already stored in q->key. */
-static void queue_me(struct futex_q *q, int fd, struct file *filp)
+static inline struct futex_hash_bucket *
+queue_lock(struct futex_q *q, int fd, struct file *filp)
 {
struct futex_hash_bucket *bh;
 
@@ -455,11 +436,35 @@ static void queue_me(struct futex_q *q, 
q->lock_ptr = >lock;
 
spin_lock(>lock);
-   bh->nqueued++;
+   return bh;
+}
+
+static inline void __queue_me(struct futex_q *q, struct futex_hash_bucket *bh)
+{
list_add_tail(>list, >chain);
spin_unlock(>lock);
 }
 
+static inline void
+queue_unlock(struct futex_q *q, struct futex_hash_bucket *bh)
+{
+   spin_unlock(>lock);
+   drop_key_refs(>key);
+}
+
+/*
+ * queue_me and unqueue_me must be called as a pair, each
+ * exactly once.  They are called with the hashed spinlock held.
+ */
+
+/* The key must be already stored in q->key. */
+static void queue_me(struct futex_q *q, int fd, struct file *filp)
+{
+   struct futex_hash_bucket *bh;
+   bh = queue_lock(q, fd, filp);
+   __queue_me(q, bh);
+}
+
 /* Return 1 if we were still queued (ie. 0 means we were woken) */
 static int unqueue_me(struct futex_q *q)
 {
@@ 

Re: kernel bug: futex_wait hang

2005-03-21 Thread Jamie Lokier
Lee Revell wrote:
  iirc we ended up deciding that the futex problems around that time were due
  to userspace problems (a version of libc).  But then, there's no discussion
  around Seto's patch and it didn't get applied.  So I don't know what
  happened to that work - it's all a bit mysterious.
 
 It does seem like it could be a different problem.  Maybe Paul can
 provide some more evidence that it's a kernel and not a glibc/NPTL bug.
 I'm really just posting this on Paul's behalf; I don't claim to
 understand the issue. ;-)

Try applying the patch below, which was recently posted by Jakub Jelinek.

If it fixes the problem, it's the same thing as Hidetoshi Seto
noticed, although this patch is much improved thanks to
preempt_count technology (tm).  If not, it's a whole new problem.

-- Jamie

--- linux-2.6.11/kernel/futex.c.jj  2005-03-17 04:42:29.0 -0500
+++ linux-2.6.11/kernel/futex.c 2005-03-18 05:45:29.0 -0500
@@ -97,7 +97,6 @@ struct futex_q {
  */
 struct futex_hash_bucket {
spinlock_t  lock;
-   unsigned intnqueued;
struct list_head   chain;
 };
 
@@ -265,7 +264,6 @@ static inline int get_futex_value_locked
inc_preempt_count();
ret = __copy_from_user_inatomic(dest, from, sizeof(int));
dec_preempt_count();
-   preempt_check_resched();
 
return ret ? -EFAULT : 0;
 }
@@ -339,7 +337,6 @@ static int futex_requeue(unsigned long u
struct list_head *head1;
struct futex_q *this, *next;
int ret, drop_count = 0;
-   unsigned int nqueued;
 
  retry:
down_read(current-mm-mmap_sem);
@@ -354,23 +351,22 @@ static int futex_requeue(unsigned long u
bh1 = hash_futex(key1);
bh2 = hash_futex(key2);
 
-   nqueued = bh1-nqueued;
+   if (bh1  bh2)
+   spin_lock(bh1-lock);
+   spin_lock(bh2-lock);
+   if (bh1  bh2)
+   spin_lock(bh1-lock);
+
if (likely(valp != NULL)) {
int curval;
 
-   /* In order to avoid doing get_user while
-  holding bh1-lock and bh2-lock, nqueued
-  (monotonically increasing field) must be first
-  read, then *uaddr1 fetched from userland and
-  after acquiring lock nqueued field compared with
-  the stored value.  The smp_mb () below
-  makes sure that bh1-nqueued is read from memory
-  before *uaddr1.  */
-   smp_mb();
-
ret = get_futex_value_locked(curval, (int __user *)uaddr1);
 
if (unlikely(ret)) {
+   spin_unlock(bh1-lock);
+   if (bh1 != bh2)
+   spin_unlock(bh2-lock);
+
/* If we would have faulted, release mmap_sem, fault
 * it in and start all over again.
 */
@@ -385,21 +381,10 @@ static int futex_requeue(unsigned long u
}
if (curval != *valp) {
ret = -EAGAIN;
-   goto out;
+   goto out_unlock;
}
}
 
-   if (bh1  bh2)
-   spin_lock(bh1-lock);
-   spin_lock(bh2-lock);
-   if (bh1  bh2)
-   spin_lock(bh1-lock);
-
-   if (unlikely(nqueued != bh1-nqueued  valp != NULL)) {
-   ret = -EAGAIN;
-   goto out_unlock;
-   }
-
head1 = bh1-chain;
list_for_each_entry_safe(this, next, head1, list) {
if (!match_futex (this-key, key1))
@@ -435,13 +420,9 @@ out:
return ret;
 }
 
-/*
- * queue_me and unqueue_me must be called as a pair, each
- * exactly once.  They are called with the hashed spinlock held.
- */
-
 /* The key must be already stored in q-key. */
-static void queue_me(struct futex_q *q, int fd, struct file *filp)
+static inline struct futex_hash_bucket *
+queue_lock(struct futex_q *q, int fd, struct file *filp)
 {
struct futex_hash_bucket *bh;
 
@@ -455,11 +436,35 @@ static void queue_me(struct futex_q *q, 
q-lock_ptr = bh-lock;
 
spin_lock(bh-lock);
-   bh-nqueued++;
+   return bh;
+}
+
+static inline void __queue_me(struct futex_q *q, struct futex_hash_bucket *bh)
+{
list_add_tail(q-list, bh-chain);
spin_unlock(bh-lock);
 }
 
+static inline void
+queue_unlock(struct futex_q *q, struct futex_hash_bucket *bh)
+{
+   spin_unlock(bh-lock);
+   drop_key_refs(q-key);
+}
+
+/*
+ * queue_me and unqueue_me must be called as a pair, each
+ * exactly once.  They are called with the hashed spinlock held.
+ */
+
+/* The key must be already stored in q-key. */
+static void queue_me(struct futex_q *q, int fd, struct file *filp)
+{
+   struct futex_hash_bucket *bh;
+   bh = queue_lock(q, fd, filp);
+   __queue_me(q, bh);
+}
+
 /* Return 1 if we were still queued (ie. 0 means we were woken) */
 static int 

Re: kernel bug: futex_wait hang

2005-03-21 Thread Jamie Lokier
Andrew Morton wrote:
 iirc we ended up deciding that the futex problems around that time were due
 to userspace problems (a version of libc).  But then, there's no discussion
 around Seto's patch and it didn't get applied.  So I don't know what
 happened to that work - it's all a bit mysterious.

It can be fixed _either_ in Glibc, or by changing the kernel.

That problem is caused by differing assumptions between Glibc and the
kernel about subtle futex semantics.  Which goes to show they are
really clever, or something.

I provided pseudo-code for the Glibc fix, but not an actual patch
because NPTL is quite complicated and I wanted to know the Glibc
people were interested, but apparently they were too busy at the time
- benchmarks would have made sense for such a patch.

Scott Snyder started fixing part of Glibc, and that did fix his
instance of this problem so we know the approach works.  But a full
patch for Glibc was not prepared.

The most recent messages under Futex queue_me/get_user ordering,
with a patch from Jakub Jelinek will fix this problem by changing the
kernel.  Yes, you should apply Jakub's most recent patch, message-ID
[EMAIL PROTECTED].

I have not tested the patch, but it looks convincing.

I argued for fixing Glibc on the grounds that the changed kernel
behaviour, or more exactly having Glibc depend on it, loses a certain
semantic property useful for unusual operations on multiple futexes at
the same time.  But I appear to have lost the argument, and Jakub's
latest patch does clean up some cruft quite nicely, with no expected
performance hit.

-- Jamie
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Futex queue_me/get_user ordering

2005-03-20 Thread Jamie Lokier
Ingo Molnar wrote:
> 
> * Jakub Jelinek <[EMAIL PROTECTED]> wrote:
> 
> > The futex man pages that have been around for years (certainly since
> > mid 2002) certainly don't document FUTEX_WAIT as token passing
> > operation, but as atomic operation:
> > 
> > Say http://www.icewalkers.com/Linux/ManPages/futex-2.html
> 
> besides this documented-behavior argument, i dont think futexes should
> be degraded into waitqueues

I give in...

Depending on atomicity makes it impossible for an application, which
is linked with NPTL and Glibc, to write an NPTL-compatible "wait on
two locks" function.

I'm not saying that's a very clean thing to want, but it's a
conceptual loss and I'm disappointed I seem to be the only one
noticing it.

On the other hand, I was mistaken to think it makes it impossible to
write an emulation of synchronous futex() in terms of asynchronous
futex().* In fact it makes it impossible to do so using the existing
FUTEX_FD, but it would be possible if there were a FUTEX_FD2 added
somewhere down the line.

* - The reason you would do this is if you were writing userspace-threading
for any reason, and you had to include an emulation of synchronous
futex() in terms of async futex because there are some libraries
which might run on top of the userspace-threading which use futex
in an application-dependent way.

> - in fact, to solve some of the known
> performance problems the opposite will have to happen: e.g. i believe
> that in the future we'll need to enable the kernel-side futex code to
> actually modify the futex variable. I.e. atomicity of the read in
> FUTEX_WAIT is an absolute must, and is only the first step.

Some of those performance problems can be solved already by better use
of FUTEX_REQUEUE instead of FUTEX_WAKE.

> [ the double-context-switch problem in cond_signal() that Jamie
>   mentioned is precisely one such case: pthread semantics force us that
>   the wakeup of the wakee _must_ happen while still holding the internal
>   lock. So we cannot just delay the wakeup to outside the glibc critical
>   section. This double context-switch could be avoided if the 'release
>   internal lock and wake up wakee' operation could be done atomically
>   within the kernel. (A sane default 'userspace unlock' operation on a
>   machine word could be defined .. e.g. decrement-to-zero.) ]

Did you not see the solution I gave last November, using FUTEX_REQUEUE?

See:

http://lkml.org/lkml/2004/11/29/201

I spent a _lot_ of time figuring it out but everyone was too busy to
confirm that it worked.  It would improve performance in a number of cases.

I hope that it does not get ignored yet again.

There _may_ be cases where more complex futex operations are needed,
but we should try the better algorithms that use the existing futex
operations before adding new ones.

-- Jamie
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Futex queue_me/get_user ordering

2005-03-20 Thread Jamie Lokier
Ingo Molnar wrote:
 
 * Jakub Jelinek [EMAIL PROTECTED] wrote:
 
  The futex man pages that have been around for years (certainly since
  mid 2002) certainly don't document FUTEX_WAIT as token passing
  operation, but as atomic operation:
  
  Say http://www.icewalkers.com/Linux/ManPages/futex-2.html
 
 besides this documented-behavior argument, i dont think futexes should
 be degraded into waitqueues

I give in...

Depending on atomicity makes it impossible for an application, which
is linked with NPTL and Glibc, to write an NPTL-compatible wait on
two locks function.

I'm not saying that's a very clean thing to want, but it's a
conceptual loss and I'm disappointed I seem to be the only one
noticing it.

On the other hand, I was mistaken to think it makes it impossible to
write an emulation of synchronous futex() in terms of asynchronous
futex().* In fact it makes it impossible to do so using the existing
FUTEX_FD, but it would be possible if there were a FUTEX_FD2 added
somewhere down the line.

* - The reason you would do this is if you were writing userspace-threading
for any reason, and you had to include an emulation of synchronous
futex() in terms of async futex because there are some libraries
which might run on top of the userspace-threading which use futex
in an application-dependent way.

 - in fact, to solve some of the known
 performance problems the opposite will have to happen: e.g. i believe
 that in the future we'll need to enable the kernel-side futex code to
 actually modify the futex variable. I.e. atomicity of the read in
 FUTEX_WAIT is an absolute must, and is only the first step.

Some of those performance problems can be solved already by better use
of FUTEX_REQUEUE instead of FUTEX_WAKE.

 [ the double-context-switch problem in cond_signal() that Jamie
   mentioned is precisely one such case: pthread semantics force us that
   the wakeup of the wakee _must_ happen while still holding the internal
   lock. So we cannot just delay the wakeup to outside the glibc critical
   section. This double context-switch could be avoided if the 'release
   internal lock and wake up wakee' operation could be done atomically
   within the kernel. (A sane default 'userspace unlock' operation on a
   machine word could be defined .. e.g. decrement-to-zero.) ]

Did you not see the solution I gave last November, using FUTEX_REQUEUE?

See:

http://lkml.org/lkml/2004/11/29/201

I spent a _lot_ of time figuring it out but everyone was too busy to
confirm that it worked.  It would improve performance in a number of cases.

I hope that it does not get ignored yet again.

There _may_ be cases where more complex futex operations are needed,
but we should try the better algorithms that use the existing futex
operations before adding new ones.

-- Jamie
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Futex queue_me/get_user ordering

2005-03-17 Thread Jamie Lokier
Jakub Jelinek wrote:
> http://www.ussg.iu.edu/hypermail/linux/kernel/0411.2/0953.html
> 
> Your argument in November was that you don't want to slow down the
> kernel and that userland must be able to cope with the
> non-atomicity of futex syscall.

Those were two of them.

But my other main concern is conceptual.

Right now, a futex_wait call is roughly equivalent to to
add_wait_queue, which is quite versatile.

It means anything you can do with one futex, you can extend to
multiple futexes (e.g. waiting on more than one lock), and you can do
asynchronously (e.g. futex_wait can be implemented in userspace as
futex_fd[1] + poll[2], and therefore things like poll-driven state machines
where one of the state machines wants to wait on a lock are possible).

[1] Ulrich was mistaken in his paper to say futex_fd needs to check a word
to be useful; userspace is supposed to check the word after futex_fd
and before polling or waiting on it.  This is more useful because it
extends to multiple futexes.
[2] actually it can't right now because of a flaw in futex_fd's poll
function, but that could be fixed.  The _principle_ is sound.

If you change futex_wait to be "atomic", and then have userspace locks
which _depend_ on that atomicity, it becomes impossible to wait on
multiple of those locks, or make poll-driven state machines which can
wait on those locks.

There are applications and libraries which use futex, not just for
threading but things like database locks in files.

You can do userspace threading and simulate most blocking system calls
by making them non-blocking and using poll).

(I'm not saying anything against NPTL by this, by the way - NPTL is a
very good general purpose library - but there are occasions when an
application wants to do it's own equivalent of simulated blocking
system calls for one reason or another.  My favourite being research
into inter-thread JIT-optimisation in an environment like valgrind).

Right now, in principle, futex_wait is among the system calls which
can be simulated by making it non-blocking (= futex_fd) and using poll()[2].
Which means programs using futex themselves can be subject to interesting
thread optimisations by code which knows nothing about the program
(similar to valgrind..)

If you change futex_wait to be "atomic", then it would be _impossible_
to take a some random 3rd party library which is using that
futex_wait, and convert it's blocking system calls to use poll-driven
state machines instead.

I think taking that away would be a great conceptual loss.

It's not a _huge_ loss, but considering it's only Glibc which is
demanding this and futexes have another property, token-passing, which
Glibc could be using instead - why not use it?

That said, let's look at your patch.

> It would simplify requeue implementation (getting rid of the nqueued
> field),

The change to FUTEX_REQUEUE2 is an improvement :)
nqueued is an abomination, like the rest of FUTEX_REQUEUE2 :)

> @@ -265,7 +264,6 @@ static inline int get_futex_value_locked
>   inc_preempt_count();
>   ret = __copy_from_user_inatomic(dest, from, sizeof(int));
>   dec_preempt_count();
> - preempt_check_resched();
>  
>   return ret ? -EFAULT : 0;
>  }

inc_preempt_count() and dec_preempt_count() aren't needed, as
preemption is disabled by the queue spinlocks.  So
get_futex_value_locked isn't needed any more: with the spinlocks held,
__get_user will do.

> [numerous instances of...]
> + preempt_check_resched();

Not required.  The spin unlocks will do this.

> But with the recent changes to futex.c I think kernel can ensure
> atomicity for free.

I agree it would probably not slow the kernel, but I would _strongly_
prefer that Glibc were fixed to use the token-passing property, if
Glibc is the driving intention behind this patch - instead of this
becoming a semantic that application-level users of futex (like
database and IPC libraries) come to depend on and which can't be
decomposed into a multiple-waiting form.

(I admit that the kernel code does look nicer with
get_futex_value_locked gone, though).

By the way, do you know of Scott Snyder's recent work on fixing Glibc
in this way?  He bumped into one of Glibc's currently broken corner
cases, fixed it (according to the algorithm I gave in November), and
reported that it works fine with the fix.

-- Jamie
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Futex queue_me/get_user ordering

2005-03-17 Thread Jamie Lokier
Jakub Jelinek wrote:
 http://www.ussg.iu.edu/hypermail/linux/kernel/0411.2/0953.html
 
 Your argument in November was that you don't want to slow down the
 kernel and that userland must be able to cope with the
 non-atomicity of futex syscall.

Those were two of them.

But my other main concern is conceptual.

Right now, a futex_wait call is roughly equivalent to to
add_wait_queue, which is quite versatile.

It means anything you can do with one futex, you can extend to
multiple futexes (e.g. waiting on more than one lock), and you can do
asynchronously (e.g. futex_wait can be implemented in userspace as
futex_fd[1] + poll[2], and therefore things like poll-driven state machines
where one of the state machines wants to wait on a lock are possible).

[1] Ulrich was mistaken in his paper to say futex_fd needs to check a word
to be useful; userspace is supposed to check the word after futex_fd
and before polling or waiting on it.  This is more useful because it
extends to multiple futexes.
[2] actually it can't right now because of a flaw in futex_fd's poll
function, but that could be fixed.  The _principle_ is sound.

If you change futex_wait to be atomic, and then have userspace locks
which _depend_ on that atomicity, it becomes impossible to wait on
multiple of those locks, or make poll-driven state machines which can
wait on those locks.

There are applications and libraries which use futex, not just for
threading but things like database locks in files.

You can do userspace threading and simulate most blocking system calls
by making them non-blocking and using poll).

(I'm not saying anything against NPTL by this, by the way - NPTL is a
very good general purpose library - but there are occasions when an
application wants to do it's own equivalent of simulated blocking
system calls for one reason or another.  My favourite being research
into inter-thread JIT-optimisation in an environment like valgrind).

Right now, in principle, futex_wait is among the system calls which
can be simulated by making it non-blocking (= futex_fd) and using poll()[2].
Which means programs using futex themselves can be subject to interesting
thread optimisations by code which knows nothing about the program
(similar to valgrind..)

If you change futex_wait to be atomic, then it would be _impossible_
to take a some random 3rd party library which is using that
futex_wait, and convert it's blocking system calls to use poll-driven
state machines instead.

I think taking that away would be a great conceptual loss.

It's not a _huge_ loss, but considering it's only Glibc which is
demanding this and futexes have another property, token-passing, which
Glibc could be using instead - why not use it?

That said, let's look at your patch.

 It would simplify requeue implementation (getting rid of the nqueued
 field),

The change to FUTEX_REQUEUE2 is an improvement :)
nqueued is an abomination, like the rest of FUTEX_REQUEUE2 :)

 @@ -265,7 +264,6 @@ static inline int get_futex_value_locked
   inc_preempt_count();
   ret = __copy_from_user_inatomic(dest, from, sizeof(int));
   dec_preempt_count();
 - preempt_check_resched();
  
   return ret ? -EFAULT : 0;
  }

inc_preempt_count() and dec_preempt_count() aren't needed, as
preemption is disabled by the queue spinlocks.  So
get_futex_value_locked isn't needed any more: with the spinlocks held,
__get_user will do.

 [numerous instances of...]
 + preempt_check_resched();

Not required.  The spin unlocks will do this.

 But with the recent changes to futex.c I think kernel can ensure
 atomicity for free.

I agree it would probably not slow the kernel, but I would _strongly_
prefer that Glibc were fixed to use the token-passing property, if
Glibc is the driving intention behind this patch - instead of this
becoming a semantic that application-level users of futex (like
database and IPC libraries) come to depend on and which can't be
decomposed into a multiple-waiting form.

(I admit that the kernel code does look nicer with
get_futex_value_locked gone, though).

By the way, do you know of Scott Snyder's recent work on fixing Glibc
in this way?  He bumped into one of Glibc's currently broken corner
cases, fixed it (according to the algorithm I gave in November), and
reported that it works fine with the fix.

-- Jamie
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH/RFC] Futex mmap_sem deadlock

2005-02-23 Thread Jamie Lokier
Olof Johansson wrote:
> How's this? I went with get_val_no_fault(), since it isn't really a
> get_user.*() any more (ptr being passed in), and no_paging is a little
> misleading (not all faults are due to paging).

How ironic: I deliberately didn't choose "no_fault" because that
function *does* take page faults.  It only disables paging operations! :)

-- Jamie
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH/RFC] Futex mmap_sem deadlock

2005-02-23 Thread Jamie Lokier
Linus Torvalds wrote:
> > I suggest putting it into futex.c, and make it an inline function
> > which takes "u32 __user *".
> 
> Agreed, except we've traditionally just made it "int __user *".

The type signatures in futex.c are a bit mixed up - most places say
"int __user *" but sys_futex() says "u32 __user *".  get_futex_key
uses sizeof(u32) to check the address.

-- Jamie
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH/RFC] Futex mmap_sem deadlock

2005-02-23 Thread Jamie Lokier
Olof Johansson wrote:
> On Wed, Feb 23, 2005 at 07:54:06AM -0800, Linus Torvalds wrote:
> 
> > > Otherwise, a preempt attempt in get_user would not be seen
> > > until some future preempt_enable was executed.
> > 
> > True. I guess we should have a "preempt_check_resched()" there too. That's 
> > what "kunmap_atomic()" does too (which is what we rely on in the other 
> > case we do this..)
> 
> Ok, this is getting complex enough to warrant get_user_inatomic(),
> which means adding it to every arch's uaccess.h.
> 
> Below patch does so. Unfortunately I don't have a Viro setup with cross
> compilers for nearly every arch, so I can't make sure it doesn't break
> anything. But since I pasted the same code everywhere it shouldn't.

My turn to say uglee.

Firstly, get_user_inatomic is the wrong name.

"inatomic" in __copy_from_user_inatomic means it's called inside a
non-premptable region (in atomic...).

Your macro get_user_inatomic is _not_ called inside a
non-preemptable region, so it shouldn't be called "inatomic".

(A better name is get_user_no_paging).

Secondly, does this _one_ use (it's not likely to be used elsewhere)
justify copying & pasting the same code into every asm-*/uaccess,
especially when the code is not in any way arch-specific?

I suggest putting it into futex.c, and make it an inline function
which takes "u32 __user *".

-- Jamie
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH/RFC] Futex mmap_sem deadlock

2005-02-23 Thread Jamie Lokier
Olof Johansson wrote:
 On Wed, Feb 23, 2005 at 07:54:06AM -0800, Linus Torvalds wrote:
 
   Otherwise, a preempt attempt in get_user would not be seen
   until some future preempt_enable was executed.
  
  True. I guess we should have a preempt_check_resched() there too. That's 
  what kunmap_atomic() does too (which is what we rely on in the other 
  case we do this..)
 
 Ok, this is getting complex enough to warrant get_user_inatomic(),
 which means adding it to every arch's uaccess.h.
 
 Below patch does so. Unfortunately I don't have a Viro setup with cross
 compilers for nearly every arch, so I can't make sure it doesn't break
 anything. But since I pasted the same code everywhere it shouldn't.

My turn to say uglee.

Firstly, get_user_inatomic is the wrong name.

inatomic in __copy_from_user_inatomic means it's called inside a
non-premptable region (in atomic...).

Your macro get_user_inatomic is _not_ called inside a
non-preemptable region, so it shouldn't be called inatomic.

(A better name is get_user_no_paging).

Secondly, does this _one_ use (it's not likely to be used elsewhere)
justify copying  pasting the same code into every asm-*/uaccess,
especially when the code is not in any way arch-specific?

I suggest putting it into futex.c, and make it an inline function
which takes u32 __user *.

-- Jamie
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH/RFC] Futex mmap_sem deadlock

2005-02-23 Thread Jamie Lokier
Linus Torvalds wrote:
  I suggest putting it into futex.c, and make it an inline function
  which takes u32 __user *.
 
 Agreed, except we've traditionally just made it int __user *.

The type signatures in futex.c are a bit mixed up - most places say
int __user * but sys_futex() says u32 __user *.  get_futex_key
uses sizeof(u32) to check the address.

-- Jamie
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH/RFC] Futex mmap_sem deadlock

2005-02-23 Thread Jamie Lokier
Olof Johansson wrote:
 How's this? I went with get_val_no_fault(), since it isn't really a
 get_user.*() any more (ptr being passed in), and no_paging is a little
 misleading (not all faults are due to paging).

How ironic: I deliberately didn't choose no_fault because that
function *does* take page faults.  It only disables paging operations! :)

-- Jamie
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH/RFC] Futex mmap_sem deadlock

2005-02-22 Thread Jamie Lokier
Linus Torvalds wrote:
> > queue_me(...) etc.
> > current->flags |= PF_MMAP_SEM; <- new
> > ret = get_user(...);
> > current->flags &= PF_MMAP_SEM; <- new
> > /* the rest */
> 
> That is uglee. 
> 
> We really have this already, and it's called "current->preempt". It 
> handles any lock at all, and doesn't add yet another special case to all 
> the architectures.

Ooh, I didn't know current->preempt did that (been away).

>   repeat:
>   down_read(>mm->mmap_sem);
>   get_futex_key(...) etc.
>   queue_me(...) etc.
>   inc_preempt_count();
>   ret = get_user(...);
>   dec_preempt_count();
>   if (unlikely(ret)) {
>   up_read(>mm->mmap_sem);
>   /* Re-do the access outside the lock */
>   ret = get_user(...);
>   if (!ret)
>   goto repeat;
>   return ret;
>   }

That would work.  I like it. :)

Page faults will enter the fault handler twice (i.e. slower), but
that's not really a disadvantage, because a program always references
the memory just before calling futex_wait anyway.  A fault is rare.

There is one small but important error: the "return ret" mustn't just
return.  It must call unqueue_me() just like the code at out_unqueue,
_including_ the conditional "ret = 0", but _excluding_ the up_read().

Alternatively, since it's a rare case, just shuffle the loop around:

down_read(>mm->mmap_sem);
repeat:
get_futex_key(...) etc.
queue_me(...) etc.
inc_preempt_count();
ret = get_user(...);
dec_preempt_count();
if (unlikely(ret)) {
up_read(>mm->mmap_sem);
/* Re-do the access outside the lock */
ret = get_user(...);
down_read(>mm->mmap_sem);
if (!ret)
goto repeat;
goto out_unqueue;
}

-- Jamie
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH/RFC] Futex mmap_sem deadlock

2005-02-22 Thread Jamie Lokier
Olof Johansson wrote:
> > That won't work because the vma lock must be help between key
> > calculation and get_user() - otherwise futex is not reliable.  It
> > would work if the futex key calculation was inside the loop.
> 
> Sure, but that's still true: It's just that the get_user() is done twice
> instead. The semaphore is never released between the key calculation and
> the "real" get_user().

Ah, I didn't look at where the loop is used and didn't think there'd
be _two_ get_user() calls in the fast case.  Not my instinct.

> > A much simpler solution (and sorry for not offering it earlier,
> > because Andrew Morton pointed out this bug long ago, but I was busy), is:
> 
> Either way works for me. Andrew/Linus, got a preference? I'll either
> post my refresh based on Andrews comments, or code up Jamie's
> suggestion.

Yours has a couple of problems.

   1. It'll make futex waits somewhat slower.  One of the nicer features
  of 2.6 futexes is that we got rid of the explicit page table lookup.

   2. It's broken because a page can be paged out by another thread
  after you've forced it in and before the get_user().  We only
  take mmap_sem, not the page table lock.

-- Jamie
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH/RFC] Futex mmap_sem deadlock

2005-02-22 Thread Jamie Lokier
Chris Friesen wrote:
> > down_read(>mm->mmap_sem);
> > get_futex_key(...) etc.
> > queue_me(...) etc.
> > current->flags |= PF_MMAP_SEM; <- new
> > ret = get_user(...);
> > current->flags &= PF_MMAP_SEM; <- new
> > /* the rest */
> 
> Should the second new line be this (with the inverse)?
> 
>   current->flags &= ~PF_MMAP_SEM;

Quiet!  I was trying to sneak in a security hole! :)

-- Jamie
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH/RFC] Futex mmap_sem deadlock

2005-02-22 Thread Jamie Lokier
Andrew Morton wrote:
> > This will quickly lock up, since the futex_wait code dows a
> > down_read(mmap_sem), then a get_user().
> > 
> > The do_page_fault code on ppc64 (as well as other architectures) needs
> > to take the same semaphore for reading. This is all good until the
> > second thread comes into play: Its mmap call tries to take the same
> > semaphore for writing which causes in the do_page_fault down_read()
> > to get stuck. Classic deadlock.
> 
> Yup.  Jamie says that the futex code _has_ to hold mmap_sem across the
> get_user().  I forget (but could probably locate) the details.

It does - the "key" which identifies a futex depends on a vma
calculation, and the vma must not change between the calculation and
the get_user().

> > One attempt to fix this is included below. It works, but I'm not entirely
> > happy with the fact that it's a bit messy solution. If anyone has a
> > better idea for how to solve it I'd be all ears.
> 
> It's fairly sane.  Style-wise I'd be inclined to turn this:
> 
>   down_read(>mm->mmap_sem);
>   while (!check_user_page_readable(current->mm, uaddr1)) {
>   up_read(>mm->mmap_sem);
>   /* Fault in the page through get_user() but discard result */
>   if (get_user(curval, (int __user *)uaddr1) != 0)
>   return -EFAULT;
>   down_read(>mm->mmap_sem);
>   }

That won't work because the vma lock must be help between key
calculation and get_user() - otherwise futex is not reliable.  It
would work if the futex key calculation was inside the loop.

A much simpler solution (and sorry for not offering it earlier,
because Andrew Morton pointed out this bug long ago, but I was busy), is:

In futex.c:

down_read(>mm->mmap_sem);
get_futex_key(...) etc.
queue_me(...) etc.
current->flags |= PF_MMAP_SEM; <- new
ret = get_user(...);
current->flags &= PF_MMAP_SEM; <- new
/* the rest */

And in arch/*/mm/fault.c, replace every one of these:

down_read(>mmap_sem);

up_read(>mmap_sem);

with these:

if (!(current & PF_MMAP_SEM))
down_read(>mmap_sem);

if (!(current & PF_MMAP_SEM))
up_read(>mmap_sem);

-- Jamie
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


  1   2   3   4   5   6   >