from:"Jamie Lokier"

Re: [PATCH] ARM: ftrace: Ensure code modifications are synchronised across all cpus

2012-12-10 Thread Jamie Lokier

Steven Rostedt wrote:
> > Yes, and I think if you do use two 16-bit nops, you can even get rid of all
> > the intermediate `sync' operations (I guess you might want one at the end if
> > you want the call to become visible at a particular point).
> 
> Wont work. We are replacing a 32bit call with a nop. That nop must also
> be 32bits, because we could eventually replace the nop(s) with a 32bit
> call. Basically, we can never allow the second 16bit part ever be the
> next instruction. If the first 16bit nop is executed, and then the task
> gets preempted. The nops get converted to a 32bit call. The task gets
> scheduled again and now is executing the second 16bits of the 32bit call
> and we get unexpected (probably crashing) results.
> 
> By having either a 16bit breakpoint whose handler returns after the
> second 16bit part, or a 16bit jump that simply jumps over the second
> half, then all this should work. When the CPU processes a 32bit
> instruction, it either processes all or non of it, correct?

Sounds good, except what Will wrote a few days ago:

On Fri, 2012-12-07 at 19:02 +, Will Deacon wrote:
> For ARMv7, there are small subsets of instructions for ARM and Thumb which
> are guaranteed to be atomic wrt concurrent modification and execution of
> the instruction stream between different processors:
>
> Thumb:  The 16-bit encodings of the B, NOP, BKPT, and SVC instructions.
> ARM:The B, BL, NOP, BKPT, SVC, HVC, and SMC instructions.

Thumb 32-bit ftrace call isn't in the above list.

Questions: does the above concurrent modification guarantee require
both the old instruction _and_ the new one to be among those listed,
or is it enough to be just the new one (for example when setting a
normal software breakpoint, that would be useful)?  Can it be the old
one and not the new (for example when removing a software breakpoint,
that would be useful)?  Does that subset mean replacing any of the
listed instructions by any of the others is ok, or any of the listed
with another of the same type?

(I guess as a matter of architecture design, it makes sense to
guarantee only a short list, because of occasions when the hardware,
or a software emulation through traps, or a simulation, might read the
instruction memory more than once.)

This is what makes me wonder, if it's safe to replace the 32-bit
mcount call with a 16-bit short jump:

> On Mon, Dec 10, 2012 at 11:04:05AM +, Jon Medhurst (Tixy) wrote:
> > So this means for things like kprobes which can modify arbitrary kernel
> > code we are going to need to continue to always use some form of
> > stop_the_whole_system() function?
> >
> > Also, kprobes currently uses patch_text() which only uses stop_machine
> > for Thumb2 instructions which straddle a word boundary, so this needs
> > changing?

Will Deacon replied:
> Yes; if you're modifying instructions other than those mentioned above, then
> you'll need to synchronise the CPUs, update the instructions, perform
> cache-maintenance on the writing CPU and then execute an isb on the
> executing core (this last bit isn't needed if you're going to go through an
> exception return to get back to the new code -- depends on how your
> stop/resume code works).

If I've understood that exchange, it implies that using patch_text()
to replace an instruction not in the list of special ones, with a trap
or jump, isn't ok?  And so it's ok to replace the NOP with a short
branch (since 16-bit "B" is in the list), but it's not ok to replace
16-bit "B" with the 32-bit ftrace call; and the same going the other way?

Best,
-- Jamie
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] ARM: ftrace: Ensure code modifications are synchronised across all cpus

2012-12-10 Thread Jamie Lokier

Steven Rostedt wrote:
  Yes, and I think if you do use two 16-bit nops, you can even get rid of all
  the intermediate `sync' operations (I guess you might want one at the end if
  you want the call to become visible at a particular point).
 
 Wont work. We are replacing a 32bit call with a nop. That nop must also
 be 32bits, because we could eventually replace the nop(s) with a 32bit
 call. Basically, we can never allow the second 16bit part ever be the
 next instruction. If the first 16bit nop is executed, and then the task
 gets preempted. The nops get converted to a 32bit call. The task gets
 scheduled again and now is executing the second 16bits of the 32bit call
 and we get unexpected (probably crashing) results.
 
 By having either a 16bit breakpoint whose handler returns after the
 second 16bit part, or a 16bit jump that simply jumps over the second
 half, then all this should work. When the CPU processes a 32bit
 instruction, it either processes all or non of it, correct?

Sounds good, except what Will wrote a few days ago:

On Fri, 2012-12-07 at 19:02 +, Will Deacon wrote:
 For ARMv7, there are small subsets of instructions for ARM and Thumb which
 are guaranteed to be atomic wrt concurrent modification and execution of
 the instruction stream between different processors:

 Thumb:  The 16-bit encodings of the B, NOP, BKPT, and SVC instructions.
 ARM:The B, BL, NOP, BKPT, SVC, HVC, and SMC instructions.

Thumb 32-bit ftrace call isn't in the above list.

Questions: does the above concurrent modification guarantee require
both the old instruction _and_ the new one to be among those listed,
or is it enough to be just the new one (for example when setting a
normal software breakpoint, that would be useful)?  Can it be the old
one and not the new (for example when removing a software breakpoint,
that would be useful)?  Does that subset mean replacing any of the
listed instructions by any of the others is ok, or any of the listed
with another of the same type?

(I guess as a matter of architecture design, it makes sense to
guarantee only a short list, because of occasions when the hardware,
or a software emulation through traps, or a simulation, might read the
instruction memory more than once.)

This is what makes me wonder, if it's safe to replace the 32-bit
mcount call with a 16-bit short jump:

 On Mon, Dec 10, 2012 at 11:04:05AM +, Jon Medhurst (Tixy) wrote:
  So this means for things like kprobes which can modify arbitrary kernel
  code we are going to need to continue to always use some form of
  stop_the_whole_system() function?
 
  Also, kprobes currently uses patch_text() which only uses stop_machine
  for Thumb2 instructions which straddle a word boundary, so this needs
  changing?

Will Deacon replied:
 Yes; if you're modifying instructions other than those mentioned above, then
 you'll need to synchronise the CPUs, update the instructions, perform
 cache-maintenance on the writing CPU and then execute an isb on the
 executing core (this last bit isn't needed if you're going to go through an
 exception return to get back to the new code -- depends on how your
 stop/resume code works).

If I've understood that exchange, it implies that using patch_text()
to replace an instruction not in the list of special ones, with a trap
or jump, isn't ok?  And so it's ok to replace the NOP with a short
branch (since 16-bit B is in the list), but it's not ok to replace
16-bit B with the 32-bit ftrace call; and the same going the other way?

Best,
-- Jamie
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 22/25] Generic dynamic per cpu refcounting

2012-11-29 Thread Jamie Lokier

Kent Overstreet wrote:
> On Thu, Nov 29, 2012 at 09:54:47PM +0100, Andi Kleen wrote:
> > > > The regular atomic_t is limited in ways that you are not.
> > > > See my original mail.
> > > 
> > > I don't follow, can you explain?
> > 
> > For most cases the reference count is tied to some object, which are
> > naturally limited by memory size or other physical resources.
> > 
> > But in the assymetric CPU case with your ref count no such limiter
> > exists.
> 
> It's got exactly the same limit as the old code which used the atomic_t
> - we're limited by the number of threads that can be issuing aio
> syscalls at a time.
> 
> The assymetry you're talking about _doesn't matter_, individual cpu
> counters wrapping does not affect what the counters all sum to when we
> go to tear down.
> 
> A coworker at lunch actually pointed out to me that the reason this is
> true is just that modular arithmatic is still associative with addition
> and subtraction.

It's just like jiffies.  Everyone understands jiffies arithmetic I hope.

-- Jamie
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 22/25] Generic dynamic per cpu refcounting

2012-11-29 Thread Jamie Lokier

Kent Overstreet wrote:
 On Thu, Nov 29, 2012 at 09:54:47PM +0100, Andi Kleen wrote:
The regular atomic_t is limited in ways that you are not.
See my original mail.
   
   I don't follow, can you explain?
  
  For most cases the reference count is tied to some object, which are
  naturally limited by memory size or other physical resources.
  
  But in the assymetric CPU case with your ref count no such limiter
  exists.
 
 It's got exactly the same limit as the old code which used the atomic_t
 - we're limited by the number of threads that can be issuing aio
 syscalls at a time.
 
 The assymetry you're talking about _doesn't matter_, individual cpu
 counters wrapping does not affect what the counters all sum to when we
 go to tear down.
 
 A coworker at lunch actually pointed out to me that the reason this is
 true is just that modular arithmatic is still associative with addition
 and subtraction.

It's just like jiffies.  Everyone understands jiffies arithmetic I hope.

-- Jamie
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC] page-table walkers vs memory order

2012-07-30 Thread Jamie Lokier

Paul E. McKenney wrote:
> > Does some version of gcc, under the options which we insist upon,
> > make such optimizations on any of the architectures which we support?
> 
> Pretty much any production-quality compiler will do double-fetch
> and old-value-reuse optimizations, the former especially on 32-bit
> x86.  I don't know of any production-quality compilers that do value
> speculation, which would make the compiler act like DEC Alpha hardware,
> and I would hope that if this does appear, (1) we would have warning
> and (2) it could be turned off.  But there has been a lot of work on
> this topic, so we would be foolish to rule it out.

GCC documentation for IA-64:

   -msched-ar-data-spec
   -mno-sched-ar-data-spec
 (En/Dis)able data speculative scheduling after reload. This results
 in generation of ld.a instructions and the corresponding check
 instructions (ld.c / chk.a). The default is 'enable'.

I don't know if that results in value speculation of the relevant kind.

-- Jamie
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC] page-table walkers vs memory order

2012-07-30 Thread Jamie Lokier

Paul E. McKenney wrote:
  Does some version of gcc, under the options which we insist upon,
  make such optimizations on any of the architectures which we support?
 
 Pretty much any production-quality compiler will do double-fetch
 and old-value-reuse optimizations, the former especially on 32-bit
 x86.  I don't know of any production-quality compilers that do value
 speculation, which would make the compiler act like DEC Alpha hardware,
 and I would hope that if this does appear, (1) we would have warning
 and (2) it could be turned off.  But there has been a lot of work on
 this topic, so we would be foolish to rule it out.

GCC documentation for IA-64:

   -msched-ar-data-spec
   -mno-sched-ar-data-spec
 (En/Dis)able data speculative scheduling after reload. This results
 in generation of ld.a instructions and the corresponding check
 instructions (ld.c / chk.a). The default is 'enable'.

I don't know if that results in value speculation of the relevant kind.

-- Jamie
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Proposal for "proper" durable fsync() and fdatasync()

2008-02-26 Thread Jamie Lokier

Jörn Engel wrote:
> On Tue, 26 February 2008 15:28:10 +0000, Jamie Lokier wrote:
> > 
> > > One interesting aspect of this comes with COW filesystems like btrfs or
> > > logfs.  Writing out data pages is not sufficient, because those will get
> > > lost unless their referencing metadata is written as well.  So either we
> > > have to call fsync for those filesystems or add another callback and let
> > > filesystems override the default implementation.
> > 
> > Doesn't the ->fsync callback get called in the sys_fdatasync() case,
> > with appropriate arguments?
> 
> My paragraph above was aimed at the sync_file_range() case.  fsync and
> fdatasync do the right thing within the limitations you brought up in
> this thread.  sync_file_range() without further changes will only write
> data pages, not the metadata required to actually access those data
> pages.  This works just fine for non-COW filesystems, which covers all
> currently merged ones.
> 
> With COW filesystems it is currently impossible to do sync_file_range()
> properly.  The problem is orthogonal to your's, I just brought it up
> since you were already mentioning sync_file_range().

You're right.  Though, doesn't normal page writeback enqueue the COW
metadata changes?  If not, how do they get written in a timely
fashion?

-- Jamie
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Proposal for "proper" durable fsync() and fdatasync()

2008-02-26 Thread Jamie Lokier

Jeff Garzik wrote:
> Nick Piggin wrote:
> >Anyway, the idea of making fsync/fdatasync etc. safe by default is
> >a good idea IMO, and is a bad bug that we don't do that :(
> 
> Agreed...  it's also disappointing that [unless I'm mistaken] you have 
> to hack each filesystem to support barriers.
> 
> It seems far easier to make sync_blkdev() Do The Right Thing, and 
> magically make all filesystems data-safe.

Well, you need ordered metadata writes, barriers _and_ flushes with
some filesystems.

Merely writing all the data pages than issuing a drive cache flush
won't Do The Right Thing with those filesystems - someone already
mentioned Btrfs, where it won't.

But I agree that your suggestion would make a superb default, for
filesystems which don't provide their own function.

It's not optimal even then.

  Devices: On a software RAID, you ideally don't want to issue flushes
  to all drives if your database did a 1 block commit entry.  (But they
  probably use O_DIRECT anyway, changing the rules again).  But all that
  can be optimised in generic VFS code eventually.  It doesn't need
  filesystem assistance in most cases.

  Apps: don't always want a full flush; sometimes a barrier would do.

-- Jamie
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Proposal for "proper" durable fsync() and fdatasync()

2008-02-26 Thread Jamie Lokier

Ric Wheeler wrote:
> >>I was surprised that fsync() doesn't do this already.  There was a lot
> >>of effort put into block I/O write barriers during 2.5, so that
> >>journalling filesystems can force correct write ordering, using disk
> >>flush cache commands.
> >>
> >>After all that effort, I was very surprised to notice that Linux 2.6.x
> >>doesn't use that capability to ensure fsync() flushes the disk cache
> >>onto stable storage.
> >
> >It's surprising you are surprised, given that this [lame] fsync behavior 
> >has remaining consistently lame throughout Linux's history.
> 
> Maybe I am confused, but isn't this is what fsync() does today whenever 
> barriers are enabled (the fsync() invalidates the drive's write cache).

No, fsync() doesn't always flush the drive's write cache.  It often
does, any I think many people are under the impression it always does,
but it doesn't.

Try this code on ext3:

fd = open ("test_file", O_RDWR | O_CREAT | O_TRUNC, 0666);
while (1) {
char byte;
usleep (10);
pwrite (fd, , 1, 0);
fsync (fd);
}

It will do just over 10 write ops per second on an idle system (13 on
mine), and 1 flush op per second.

That's because ext3 fsync() only does a journal commit when the inode
has changed.  The inode mtime is changed by write only with 1 second
granularity.  Without a journal commit, there's no barrier, which
translates to not flushing disk write cache.

If you add "fchmod (fd, 0644); fchmod (fd, 0664);" between the write
and fsync, you'll see at least 20 write ops and 20 flush ops per
second, and you'll here the disk seeking more.  That's because the
fchmod dirties the inode, so fsync() writes the inode with a journal
commit.

It turns out even _that_ is not sufficient according to the kernel
internals.  A journal commit uses an ordered request, which isn't the
same as a flush potentially, it just happens to use flush in this
instance.  I'm not sure if ordered requests are actually implemented
by any drivers at the moment.  If not now, they will be one day.

We could change ext3 fsync() to always do a journal commit, and depend
on the non-existence of block drivers which do ordered (not flush)
barrier requests.  But there's lots of things wrong with that.  Not
least, it sucks performance for database-like applications and virtual
machines, a lot due to unnecessary seeks.  That way lies wrongness.

Rightness is to make fdatasync() work well, with a genuine flush (or
equivalent (see FUA), only when required, and not a mere ordered
barrier), no inode write, and to make sync_file_range()[*] offer the
fancier applications finer controls which reflect what they actually
need.

[*] - or whatever.

-- Jamie
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Proposal for "proper" durable fsync() and fdatasync()

2008-02-26 Thread Jamie Lokier

Jörn Engel wrote:
> On Tue, 26 February 2008 20:16:11 +1100, Nick Piggin wrote:
> > Yeah, sync_file_range has slightly unusual semantics and introduce
> > the new concept, "writeout", to userspace (does "writeout" include
> > "in drive cache"? the kernel doesn't think so, but the only way to
> > make sync_file_range "safe" is if you do consider it writeout).
> 
> If sync_file_range isn't safe, it should get replaced by a noop
> implementation.  There really is no point in promising "a little"
> safety.

Sometimes there is a point in "a little" safety.

There's a spectrum of durability (meaning how safely stored the data
is).  In the cases we're imagining, it's application -> main memory
cache -> disk cache -> disk surface.  There are others.

_None_ of those provide perfect safety for your data.  They are a
spectrum, and how far along you want data to be committed before you
say "fine, the data is safe enough for me" depends on your application.

For example, there are users who like to turn _off_ fdatasync() with
their SQL database of choice.  They prefer speed over safety, and they
don't mind losing an hour's data and doing regular backups (we assume
;-) Some blogs fall into this category; who cares if a rare crash
costs you a comment or two and a restore from backup; it's acceptable
for the speed.

There's users who would really like fdatasync() to commit data to the
drive platters, so after their database says "done", they are very
confident that a power failure won't cause committed data to be lost.
Accepting credit cards is more at this end.  So should be anyone using
a virtual machine of any kind without a journalling fs in the guest!

And there's users who like it where it is right now: a compromise,
where a system crash won't lose committed data; but a power failure
might.  (I'm making assumptions about drive behaviour on reset here.)

My problem with fdatasync() at the moment is, I can't choose what I
want from it, and there's no mechanism to give me the safest option.
Most annoyingly, in-kernel filesystems _do_ have a mechanism; it just
isn't exported to userspace.

(A quick aside: fdatasync() et al. are actually used for two
_different_ things.  1: A program says "I've written it", it can say
so with confidence, e.g. announcing email receipt.  2: It's used for
write ordering with write-ahead logging: write, fdatasync, write.
When you tease at the details, efficient implementations of them are
different...  Think SCSI tagged commands versus cache flushes.)

> One interesting aspect of this comes with COW filesystems like btrfs or
> logfs.  Writing out data pages is not sufficient, because those will get
> lost unless their referencing metadata is written as well.  So either we
> have to call fsync for those filesystems or add another callback and let
> filesystems override the default implementation.

Doesn't the ->fsync callback get called in the sys_fdatasync() case,
with appropriate arguments?

With barriers/flushes it certainly makes those a bit more complicated.
You have to flush not just the disks with data pages, but the _other_
disks in a software RAID with data pointer metadata pages, but ideally
not all of them (think database journal commit).

That can be implemented with per-buffer pending-barrier/flush flags
(like I described for pages in the first mail), which are equally
useful when a database-like application uses a block device.

-- Jamie
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Proposal for "proper" durable fsync() and fdatasync()

2008-02-26 Thread Jamie Lokier

Jörn Engel wrote:
> On Tue, 26 February 2008 20:16:11 +1100, Nick Piggin wrote:
> > 
> > Yeah, sync_file_range has slightly unusual semantics and introduce
> > the new concept, "writeout", to userspace (does "writeout" include
> > "in drive cache"? the kernel doesn't think so, but the only way to
> > make sync_file_range "safe" is if you do consider it writeout).
> 
> If sync_file_range isn't safe, it should get replaced by a noop
> implementation.  There really is no point in promising "a little"
> safety.
> 
> One interesting aspect of this comes with COW filesystems like btrfs or
> logfs.  Writing out data pages is not sufficient, because those will get
> lost unless their referencing metadata is written as well.  So either we
> have to call fsync for those filesystems or add another callback and let
> filesystems override the default implementation.

fdatasync() is required to write data pages _and_ the necessary
metadata to reference those changed pages (btrfs tree etc.), but not
non-data metadata.

It's the filesystem's responsibility to interpret that correctly.
In-place writes don't need anything else.  Phase-tree style writes do.
Some kinds of logged writes don't.

I'm under the impression that sync_file_range() is a sort of
restricted-range asynchronous fdatasync().

By limiting the range of file date which must be written out, it
becomes more refined for database and filesystem-in-a-file type
applications.  Just as fsync() is more refined than sync() - it's
useful to sync less - same goes for syncing just part of a file.

It's still the filesystem's responsibility to sync data access
metadata appropriately.  It can sync more if it wants, but not less.

That's what I understand by
   sync_file_range(fd, start,length, SYNC_FILE_RANGE_WRITE_BEFORE
   | SYNC_FILE_RANGE_WRITE
   | SYNC_FILE_RANGE_WRITE_AFTER);
Largely because the manual says to use that combination of flags for
an equivalent to fdatasync().

The concept of "write-out" is not defined in the manual.  I'm assuming
it to mean this, as a reasonable guess:

SYNC_FILE_RANGE_WRITE scans all pages in the range, looking for dirty
pages which aren't already queued for write-out.  It marks those with
a "write-out" flag, and starts write I/Os at some unspecified time in
the near future; it can be assumed writes for all the pages will
complete eventually if there's no errors.  When I/O completes on a
page, it cleans the page and also clears the write-out flag.

SYNC_FILE_RANGE_WAIT_AFTER waits until all pages in the range don't
have the "write-out" flag set.

SYNC_FILE_RANGE_WAIT_BEFORE does the same wait, but before marking
pages for write-out.  I don't actually see the point in this.  Isn't a
preceding call with SYNC_FILE_RANGE_WAIT_AFTER equivalent, making
BEFORE a redundant flag?

The manual says it is something to do with data-integrity, but it's
not clear to me what that means.

All this implies that "write-out" flag is a concept userspace can rely
on.  That's not so peculiar: WRITE seems to be equivalent to AIO-style
fdatasync() on a limited range of offsets, and WAIT_AFTER seems to be
equivalent to waiting for any previously issued such ops to complete.

Any data access metadata updates that btrfs must make for fdatasync(),
it must also make for sync_file_range(), for the limited range of
offsets.

-- Jamie
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Proposal for "proper" durable fsync() and fdatasync()

2008-02-26 Thread Jamie Lokier

Jeff Garzik wrote:
> [snip huge long proposal]
> 
> Rather than invent new APIs, we should fix the existing ones to _really_ 
> flush data to physical media.

Btw, one reason for the length is the current block request API isn't
sufficient even to make fsync() durable with _no_ new APIs.

It offers ordering barriers only, which aren't enough.  I tried to
explain, discuss some changes and then suggest optimisations.

-- Jamie
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Proposal for "proper" durable fsync() and fdatasync()

2008-02-26 Thread Jamie Lokier

Andrew Morton wrote:
> On Tue, 26 Feb 2008 07:26:50 +0000 Jamie Lokier <[EMAIL PROTECTED]> wrote:
> 
> > (It would be nicer if sync_file_range()
> > took a vector of ranges for better elevator scheduling, but let's
> > ignore that :-)
> 
> Two passes:
> 
> Pass 1: shove each of the segments into the queue with
> SYNC_FILE_RANGE_WAIT_BEFORE|SYNC_FILE_RANGE_WRITE
> 
> Pass 2: wait for them all to complete and return accumulated result
> with SYNC_FILE_RANGE_WAIT_AFTER

Thanks.

Seems ok, though being able to cork the I/O until the last one would
be a bonus (like TCP_MORE...  SYNC_FILE_RANGE_MORE?)

I'm imagining I'd omit the SYNC_FILE_RANGE_WAIT_BEFORE.  Is there a
reason why you have it there?  The man page isn't very enlightening.

-- Jamie
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Proposal for proper durable fsync() and fdatasync()

2008-02-26 Thread Jamie Lokier

Andrew Morton wrote:
 On Tue, 26 Feb 2008 07:26:50 + Jamie Lokier [EMAIL PROTECTED] wrote:
 
  (It would be nicer if sync_file_range()
  took a vector of ranges for better elevator scheduling, but let's
  ignore that :-)
 
 Two passes:
 
 Pass 1: shove each of the segments into the queue with
 SYNC_FILE_RANGE_WAIT_BEFORE|SYNC_FILE_RANGE_WRITE
 
 Pass 2: wait for them all to complete and return accumulated result
 with SYNC_FILE_RANGE_WAIT_AFTER

Thanks.

Seems ok, though being able to cork the I/O until the last one would
be a bonus (like TCP_MORE...  SYNC_FILE_RANGE_MORE?)

I'm imagining I'd omit the SYNC_FILE_RANGE_WAIT_BEFORE.  Is there a
reason why you have it there?  The man page isn't very enlightening.

-- Jamie
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Proposal for proper durable fsync() and fdatasync()

2008-02-26 Thread Jamie Lokier

Jeff Garzik wrote:
 [snip huge long proposal]
 
 Rather than invent new APIs, we should fix the existing ones to _really_ 
 flush data to physical media.

Btw, one reason for the length is the current block request API isn't
sufficient even to make fsync() durable with _no_ new APIs.

It offers ordering barriers only, which aren't enough.  I tried to
explain, discuss some changes and then suggest optimisations.

-- Jamie
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Proposal for proper durable fsync() and fdatasync()

2008-02-26 Thread Jamie Lokier

Jörn Engel wrote:
 On Tue, 26 February 2008 20:16:11 +1100, Nick Piggin wrote:
  
  Yeah, sync_file_range has slightly unusual semantics and introduce
  the new concept, writeout, to userspace (does writeout include
  in drive cache? the kernel doesn't think so, but the only way to
  make sync_file_range safe is if you do consider it writeout).
 
 If sync_file_range isn't safe, it should get replaced by a noop
 implementation.  There really is no point in promising a little
 safety.
 
 One interesting aspect of this comes with COW filesystems like btrfs or
 logfs.  Writing out data pages is not sufficient, because those will get
 lost unless their referencing metadata is written as well.  So either we
 have to call fsync for those filesystems or add another callback and let
 filesystems override the default implementation.

fdatasync() is required to write data pages _and_ the necessary
metadata to reference those changed pages (btrfs tree etc.), but not
non-data metadata.

It's the filesystem's responsibility to interpret that correctly.
In-place writes don't need anything else.  Phase-tree style writes do.
Some kinds of logged writes don't.

I'm under the impression that sync_file_range() is a sort of
restricted-range asynchronous fdatasync().

By limiting the range of file date which must be written out, it
becomes more refined for database and filesystem-in-a-file type
applications.  Just as fsync() is more refined than sync() - it's
useful to sync less - same goes for syncing just part of a file.

It's still the filesystem's responsibility to sync data access
metadata appropriately.  It can sync more if it wants, but not less.

That's what I understand by
   sync_file_range(fd, start,length, SYNC_FILE_RANGE_WRITE_BEFORE
   | SYNC_FILE_RANGE_WRITE
   | SYNC_FILE_RANGE_WRITE_AFTER);
Largely because the manual says to use that combination of flags for
an equivalent to fdatasync().

The concept of write-out is not defined in the manual.  I'm assuming
it to mean this, as a reasonable guess:

SYNC_FILE_RANGE_WRITE scans all pages in the range, looking for dirty
pages which aren't already queued for write-out.  It marks those with
a write-out flag, and starts write I/Os at some unspecified time in
the near future; it can be assumed writes for all the pages will
complete eventually if there's no errors.  When I/O completes on a
page, it cleans the page and also clears the write-out flag.

SYNC_FILE_RANGE_WAIT_AFTER waits until all pages in the range don't
have the write-out flag set.

SYNC_FILE_RANGE_WAIT_BEFORE does the same wait, but before marking
pages for write-out.  I don't actually see the point in this.  Isn't a
preceding call with SYNC_FILE_RANGE_WAIT_AFTER equivalent, making
BEFORE a redundant flag?

The manual says it is something to do with data-integrity, but it's
not clear to me what that means.

All this implies that write-out flag is a concept userspace can rely
on.  That's not so peculiar: WRITE seems to be equivalent to AIO-style
fdatasync() on a limited range of offsets, and WAIT_AFTER seems to be
equivalent to waiting for any previously issued such ops to complete.

Any data access metadata updates that btrfs must make for fdatasync(),
it must also make for sync_file_range(), for the limited range of
offsets.

-- Jamie
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Proposal for proper durable fsync() and fdatasync()

2008-02-26 Thread Jamie Lokier

Jörn Engel wrote:
 On Tue, 26 February 2008 20:16:11 +1100, Nick Piggin wrote:
  Yeah, sync_file_range has slightly unusual semantics and introduce
  the new concept, writeout, to userspace (does writeout include
  in drive cache? the kernel doesn't think so, but the only way to
  make sync_file_range safe is if you do consider it writeout).
 
 If sync_file_range isn't safe, it should get replaced by a noop
 implementation.  There really is no point in promising a little
 safety.

Sometimes there is a point in a little safety.

There's a spectrum of durability (meaning how safely stored the data
is).  In the cases we're imagining, it's application - main memory
cache - disk cache - disk surface.  There are others.

_None_ of those provide perfect safety for your data.  They are a
spectrum, and how far along you want data to be committed before you
say fine, the data is safe enough for me depends on your application.

For example, there are users who like to turn _off_ fdatasync() with
their SQL database of choice.  They prefer speed over safety, and they
don't mind losing an hour's data and doing regular backups (we assume
;-) Some blogs fall into this category; who cares if a rare crash
costs you a comment or two and a restore from backup; it's acceptable
for the speed.

There's users who would really like fdatasync() to commit data to the
drive platters, so after their database says done, they are very
confident that a power failure won't cause committed data to be lost.
Accepting credit cards is more at this end.  So should be anyone using
a virtual machine of any kind without a journalling fs in the guest!

And there's users who like it where it is right now: a compromise,
where a system crash won't lose committed data; but a power failure
might.  (I'm making assumptions about drive behaviour on reset here.)

My problem with fdatasync() at the moment is, I can't choose what I
want from it, and there's no mechanism to give me the safest option.
Most annoyingly, in-kernel filesystems _do_ have a mechanism; it just
isn't exported to userspace.

(A quick aside: fdatasync() et al. are actually used for two
_different_ things.  1: A program says I've written it, it can say
so with confidence, e.g. announcing email receipt.  2: It's used for
write ordering with write-ahead logging: write, fdatasync, write.
When you tease at the details, efficient implementations of them are
different...  Think SCSI tagged commands versus cache flushes.)

 One interesting aspect of this comes with COW filesystems like btrfs or
 logfs.  Writing out data pages is not sufficient, because those will get
 lost unless their referencing metadata is written as well.  So either we
 have to call fsync for those filesystems or add another callback and let
 filesystems override the default implementation.

Doesn't the -fsync callback get called in the sys_fdatasync() case,
with appropriate arguments?

With barriers/flushes it certainly makes those a bit more complicated.
You have to flush not just the disks with data pages, but the _other_
disks in a software RAID with data pointer metadata pages, but ideally
not all of them (think database journal commit).

That can be implemented with per-buffer pending-barrier/flush flags
(like I described for pages in the first mail), which are equally
useful when a database-like application uses a block device.

-- Jamie
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Proposal for proper durable fsync() and fdatasync()

2008-02-26 Thread Jamie Lokier

Ric Wheeler wrote:
 I was surprised that fsync() doesn't do this already.  There was a lot
 of effort put into block I/O write barriers during 2.5, so that
 journalling filesystems can force correct write ordering, using disk
 flush cache commands.
 
 After all that effort, I was very surprised to notice that Linux 2.6.x
 doesn't use that capability to ensure fsync() flushes the disk cache
 onto stable storage.
 
 It's surprising you are surprised, given that this [lame] fsync behavior 
 has remaining consistently lame throughout Linux's history.
 
 Maybe I am confused, but isn't this is what fsync() does today whenever 
 barriers are enabled (the fsync() invalidates the drive's write cache).

No, fsync() doesn't always flush the drive's write cache.  It often
does, any I think many people are under the impression it always does,
but it doesn't.

Try this code on ext3:

fd = open (test_file, O_RDWR | O_CREAT | O_TRUNC, 0666);
while (1) {
char byte;
usleep (10);
pwrite (fd, byte, 1, 0);
fsync (fd);
}

It will do just over 10 write ops per second on an idle system (13 on
mine), and 1 flush op per second.

That's because ext3 fsync() only does a journal commit when the inode
has changed.  The inode mtime is changed by write only with 1 second
granularity.  Without a journal commit, there's no barrier, which
translates to not flushing disk write cache.

If you add fchmod (fd, 0644); fchmod (fd, 0664); between the write
and fsync, you'll see at least 20 write ops and 20 flush ops per
second, and you'll here the disk seeking more.  That's because the
fchmod dirties the inode, so fsync() writes the inode with a journal
commit.

It turns out even _that_ is not sufficient according to the kernel
internals.  A journal commit uses an ordered request, which isn't the
same as a flush potentially, it just happens to use flush in this
instance.  I'm not sure if ordered requests are actually implemented
by any drivers at the moment.  If not now, they will be one day.

We could change ext3 fsync() to always do a journal commit, and depend
on the non-existence of block drivers which do ordered (not flush)
barrier requests.  But there's lots of things wrong with that.  Not
least, it sucks performance for database-like applications and virtual
machines, a lot due to unnecessary seeks.  That way lies wrongness.

Rightness is to make fdatasync() work well, with a genuine flush (or
equivalent (see FUA), only when required, and not a mere ordered
barrier), no inode write, and to make sync_file_range()[*] offer the
fancier applications finer controls which reflect what they actually
need.

[*] - or whatever.

-- Jamie
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Proposal for proper durable fsync() and fdatasync()

2008-02-26 Thread Jamie Lokier

Jeff Garzik wrote:
 Nick Piggin wrote:
 Anyway, the idea of making fsync/fdatasync etc. safe by default is
 a good idea IMO, and is a bad bug that we don't do that :(
 
 Agreed...  it's also disappointing that [unless I'm mistaken] you have 
 to hack each filesystem to support barriers.
 
 It seems far easier to make sync_blkdev() Do The Right Thing, and 
 magically make all filesystems data-safe.

Well, you need ordered metadata writes, barriers _and_ flushes with
some filesystems.

Merely writing all the data pages than issuing a drive cache flush
won't Do The Right Thing with those filesystems - someone already
mentioned Btrfs, where it won't.

But I agree that your suggestion would make a superb default, for
filesystems which don't provide their own function.

It's not optimal even then.

  Devices: On a software RAID, you ideally don't want to issue flushes
  to all drives if your database did a 1 block commit entry.  (But they
  probably use O_DIRECT anyway, changing the rules again).  But all that
  can be optimised in generic VFS code eventually.  It doesn't need
  filesystem assistance in most cases.

  Apps: don't always want a full flush; sometimes a barrier would do.

-- Jamie
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Proposal for proper durable fsync() and fdatasync()

2008-02-26 Thread Jamie Lokier

Jörn Engel wrote:
 On Tue, 26 February 2008 15:28:10 +, Jamie Lokier wrote:
  
   One interesting aspect of this comes with COW filesystems like btrfs or
   logfs.  Writing out data pages is not sufficient, because those will get
   lost unless their referencing metadata is written as well.  So either we
   have to call fsync for those filesystems or add another callback and let
   filesystems override the default implementation.
  
  Doesn't the -fsync callback get called in the sys_fdatasync() case,
  with appropriate arguments?
 
 My paragraph above was aimed at the sync_file_range() case.  fsync and
 fdatasync do the right thing within the limitations you brought up in
 this thread.  sync_file_range() without further changes will only write
 data pages, not the metadata required to actually access those data
 pages.  This works just fine for non-COW filesystems, which covers all
 currently merged ones.
 
 With COW filesystems it is currently impossible to do sync_file_range()
 properly.  The problem is orthogonal to your's, I just brought it up
 since you were already mentioning sync_file_range().

You're right.  Though, doesn't normal page writeback enqueue the COW
metadata changes?  If not, how do they get written in a timely
fashion?

-- Jamie
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Proposal for "proper" durable fsync() and fdatasync()

2008-02-25 Thread Jamie Lokier

Jeff Garzik wrote:
> Jamie Lokier wrote:
> >By durable, I mean that fsync() should actually commit writes to
> >physical stable storage,
> 
> Yes, it should.

Glad we agree :-)

> >I was surprised that fsync() doesn't do this already.  There was a lot
> >of effort put into block I/O write barriers during 2.5, so that
> >journalling filesystems can force correct write ordering, using disk
> >flush cache commands.
> >
> >After all that effort, I was very surprised to notice that Linux 2.6.x
> >doesn't use that capability to ensure fsync() flushes the disk cache
> >onto stable storage.
> 
> It's surprising you are surprised, given that this [lame] fsync behavior 
> has remaining consistently lame throughout Linux's history.

I was surprised because of the effort put into IDE write barriers to
get it right for in-kernel filesystems, and the messages in 2004
telling concerned users that fsync would use barriers in 2.6, which it
does sometimes but not always.

> [snip huge long proposal]
> 
> Rather than invent new APIs, we should fix the existing ones to _really_ 
> flush data to physical media.
>
> Linux should default to SAFE data storage, and permit users to retain 
> the older unsafe behavior via an option.  It's completely ridiculous 
> that we default to an unsafe fsync.

Well, I agree with you.  Which is why the "new API" I suggested, being
really just an extension of an existing one, allows fsync() to be SAFE
if that's what people want.

To be fair, fsync() is rather overkill for some apps.
sync_file_range() is obviously the right place for fine tuning "less
safe" variations.

> And [anticipating a common response from others] it is completely 
> irrelevant that POSIX fsync(2) permits Linux's current behavior.  The 
> current behavior is unsafe.
> 
> Safety before performance -- ESPECIALLY when it comes to storing user data.

Especially now that people work a lot in guest VMs, where the IDE
barrier stuff doesn't work if the host fdatasync() doesn't work.

Since it happened with Mac OS X, I wouldn't be surprised if changing
fsync() and just that wasn't popular.  Heck, you already get people
asking "how to turn off fsync in PostGreSQL"...  (Haven't those people
heard of transactions...?)

But with changes to sync_file_range() [or whatever... I don't care] to
support database's finely tuned commit needs, and then adoption of
that by database vendors, perhaps nobody will mind fsync() becoming
safe then.

Nobody seems bothered by it's performance for other things.

-- Jamie
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Proposal for "proper" durable fsync() and fdatasync()

2008-02-25 Thread Jamie Lokier

Dear kernel,

This is a proposal to add "proper" durable fsync() and fdatasync() to Linux.

First the problem, then a proposed solution "with benefits", so to speak.

I need feedback on the details, before implementing anything.  Or
(hopefully) someone else thinks it's very important and does it
themselves :-)

By durable, I mean that fsync() should actually commit writes to
physical stable storage, not just the disk write cache when that is
enabled.  Databases and guest VMs needs this, or an equivalent
feature, if they aren't to face occasional corruption after power
failure and perhaps some crashes.

The alternative is to disable the disk write cache.  But that isn't
modern practice or recommendation, since I/O write barriers were
implemented and they are much faster.

I was surprised that fsync() doesn't do this already.  There was a lot
of effort put into block I/O write barriers during 2.5, so that
journalling filesystems can force correct write ordering, using disk
flush cache commands.

After all that effort, I was very surprised to notice that Linux 2.6.x
doesn't use that capability to ensure fsync() flushes the disk cache
onto stable storage.

I noticed this following up discussions on the Qemu mailing list,
about guest VMs and how their IDE flush cache command should translate
to fsync() to avoid data loss.  (For guest VMs, fsync() isn't
necessary if the host machine is fine, and it isn't enough (on Linux
host) if the host machine loses power or the hard disk crashes another
way.)

Then I noticed it again, when I was designing a database engine with
filesystem characteristics.  I thought "how do I ensure ordered
journal writes; can I use fdatasync()?" and was surprised to find the
answer is no, I have to use hacks like calling hdparm, and the authors
of major SQL databases seem to brush the problem under a carpet.

(Interestingly, in the Linux 2.4 patches for write barriers, fsync()
seems to be fine, if a bit slow.)

It isn't the first time this topic has come up:

http://groups.google.com.br/group/linux.kernel/browse_thread/thread/d343e51655b4ac7c/7ee9bca80977c2d1?#7ee9bca80977c2d1
("True fsync() in Linux (on IDE)")

In that thread, it was implied that would be fixed in 2.6.  So I bet
some people are under the illusion that it's fixed in 2.6...

For a while, I've been meaning to bring it up on linux-kernel...

The fsync problem
-

Chris Wedgwood wrote:
> On Mon, Feb 25, 2008 at 08:50:40PM +, Jamie Lokier wrote:
> 
> > On Linux (and other host OSes), fdatsync() and fsync() don't always
> > commit data to hard storage; it sometimes only commits it to the hard
> > drive cache.
> 
> That's a filesystem bug IMO.  People should be able to use f[data]sync
> with some level onf confidence or else it's basically pointless.

I agree, I consider it a serious bug, and I would be pleased if
someone paid it some love and attention.

Right now, if you want a reliable database on Linux, you _cannot_
properly depend on fsync() or fdatasync().  Considering how much Linux
is used for critical databases, using these functions, this amazes me.

Also, if you have a guest VM, then the guest's filesystem journalling
is not reliable.  Not only can it lose data on power loss, it can
corrupt the guest filesystem too, due to reordering.  This is contrary
to what people expect, I think.

I'm not sure if a system reset can cause similar loss; I don't know
how disks react to that.

Also, for the person porting ZFS to run on FUSE, same applies...

Linux fsync is faulty in two ways:

   1. Database commits aren't _durable_ against power failure, because
  fsync doesn't flush the disk's cache.  This means data stored
  is not guaranteed to be stored at the expected durability.

   2. It's unsafe for write-ahead logging, because it doesn't really
  guarantee any _ordering_ for the writes at the hard storage
  level.  So aside from losing committed data, it can also corrupt
  structural metadata.

With ext3 it's quite easy to verify that fsync/fdatasync don't always
write a journal entry.  (Apart from looking at the kernel code :-)

Just write some data, fsync(), and observe the number of writes in
/proc/diskstats.  If the current mtime second _hasn't_ changed, the
inode isn't written.  If you write data, say, 10 times a second to the
same place followed by fsync(), you'll see a little more than 10 write
I/Os, and less than 20.

By the way, this shows a trick for fixing #2 (ordering): use fchmod()
to toggle the file attributes, and that will force the next fsync() to
write a journal entry, which _does_ issue a write barrier.  If you do
that with each write as above (write, fchmod change, fsync 10 times a
second), you will clearly see more write I/Os, and you'll hear the
disk behaving differently: it's seeking more.

However, even this ugly trick has problems:

  3. Using the fchmod() trick or good fortun

Proposal for proper durable fsync() and fdatasync()

2008-02-25 Thread Jamie Lokier

Dear kernel,

This is a proposal to add proper durable fsync() and fdatasync() to Linux.

First the problem, then a proposed solution with benefits, so to speak.

I need feedback on the details, before implementing anything. Or
(hopefully) someone else thinks it's very important and does it
themselves :-)

By durable, I mean that fsync() should actually commit writes to
physical stable storage, not just the disk write cache when that is
enabled. Databases and guest VMs needs this, or an equivalent
feature, if they aren't to face occasional corruption after power
failure and perhaps some crashes.

The alternative is to disable the disk write cache. But that isn't
modern practice or recommendation, since I/O write barriers were
implemented and they are much faster.

I was surprised that fsync() doesn't do this already. There was a lot
of effort put into block I/O write barriers during 2.5, so that
journalling filesystems can force correct write ordering, using disk
flush cache commands.

After all that effort, I was very surprised to notice that Linux 2.6.x
doesn't use that capability to ensure fsync() flushes the disk cache
onto stable storage.

I noticed this following up discussions on the Qemu mailing list,
about guest VMs and how their IDE flush cache command should translate
to fsync() to avoid data loss. (For guest VMs, fsync() isn't
necessary if the host machine is fine, and it isn't enough (on Linux
host) if the host machine loses power or the hard disk crashes another
way.)

Then I noticed it again, when I was designing a database engine with
filesystem characteristics. I thought how do I ensure ordered
journal writes; can I use fdatasync()? and was surprised to find the
answer is no, I have to use hacks like calling hdparm, and the authors
of major SQL databases seem to brush the problem under a carpet.

(Interestingly, in the Linux 2.4 patches for write barriers, fsync()
seems to be fine, if a bit slow.)

It isn't the first time this topic has come up:

http://groups.google.com.br/group/linux.kernel/browse_thread/thread/d343e51655b4ac7c/7ee9bca80977c2d1?#7ee9bca80977c2d1
(True fsync() in Linux (on IDE))

In that thread, it was implied that would be fixed in 2.6. So I bet
some people are under the illusion that it's fixed in 2.6...

For a while, I've been meaning to bring it up on linux-kernel...

The fsync problem
-

Chris Wedgwood wrote:
On Mon, Feb 25, 2008 at 08:50:40PM +, Jamie Lokier wrote:

On Linux (and other host OSes), fdatsync() and fsync() don't always
commit data to hard storage; it sometimes only commits it to the hard
drive cache.

That's a filesystem bug IMO. People should be able to use f[data]sync
with some level onf confidence or else it's basically pointless.

I agree, I consider it a serious bug, and I would be pleased if
someone paid it some love and attention.

Right now, if you want a reliable database on Linux, you _cannot_
properly depend on fsync() or fdatasync(). Considering how much Linux
is used for critical databases, using these functions, this amazes me.

Also, if you have a guest VM, then the guest's filesystem journalling
is not reliable. Not only can it lose data on power loss, it can
corrupt the guest filesystem too, due to reordering. This is contrary
to what people expect, I think.

I'm not sure if a system reset can cause similar loss; I don't know
how disks react to that.

Also, for the person porting ZFS to run on FUSE, same applies...

Linux fsync is faulty in two ways:

1. Database commits aren't _durable_ against power failure, because
fsync doesn't flush the disk's cache. This means data stored
is not guaranteed to be stored at the expected durability.

2. It's unsafe for write-ahead logging, because it doesn't really
guarantee any _ordering_ for the writes at the hard storage
level. So aside from losing committed data, it can also corrupt
structural metadata.

With ext3 it's quite easy to verify that fsync/fdatasync don't always
write a journal entry. (Apart from looking at the kernel code :-)

Just write some data, fsync(), and observe the number of writes in
/proc/diskstats. If the current mtime second _hasn't_ changed, the
inode isn't written. If you write data, say, 10 times a second to the
same place followed by fsync(), you'll see a little more than 10 write
I/Os, and less than 20.

By the way, this shows a trick for fixing #2 (ordering): use fchmod()
to toggle the file attributes, and that will force the next fsync() to
write a journal entry, which _does_ issue a write barrier. If you do
that with each write as above (write, fchmod change, fsync 10 times a
second), you will clearly see more write I/Os, and you'll hear the
disk behaving differently: it's seeking more.

However, even this ugly trick has problems:

3. Using the fchmod() trick or good fortune, fsync() issues a write
barrier. Right now, this does commit data (if the device can

Re: Proposal for proper durable fsync() and fdatasync()

2008-02-25 Thread Jamie Lokier

Jeff Garzik wrote:
 Jamie Lokier wrote:
 By durable, I mean that fsync() should actually commit writes to
 physical stable storage,
 
 Yes, it should.

Glad we agree :-)

 I was surprised that fsync() doesn't do this already.  There was a lot
 of effort put into block I/O write barriers during 2.5, so that
 journalling filesystems can force correct write ordering, using disk
 flush cache commands.
 
 After all that effort, I was very surprised to notice that Linux 2.6.x
 doesn't use that capability to ensure fsync() flushes the disk cache
 onto stable storage.
 
 It's surprising you are surprised, given that this [lame] fsync behavior 
 has remaining consistently lame throughout Linux's history.

I was surprised because of the effort put into IDE write barriers to
get it right for in-kernel filesystems, and the messages in 2004
telling concerned users that fsync would use barriers in 2.6, which it
does sometimes but not always.

 [snip huge long proposal]
 
 Rather than invent new APIs, we should fix the existing ones to _really_ 
 flush data to physical media.

 Linux should default to SAFE data storage, and permit users to retain 
 the older unsafe behavior via an option.  It's completely ridiculous 
 that we default to an unsafe fsync.

Well, I agree with you.  Which is why the new API I suggested, being
really just an extension of an existing one, allows fsync() to be SAFE
if that's what people want.

To be fair, fsync() is rather overkill for some apps.
sync_file_range() is obviously the right place for fine tuning less
safe variations.

 And [anticipating a common response from others] it is completely 
 irrelevant that POSIX fsync(2) permits Linux's current behavior.  The 
 current behavior is unsafe.
 
 Safety before performance -- ESPECIALLY when it comes to storing user data.

Especially now that people work a lot in guest VMs, where the IDE
barrier stuff doesn't work if the host fdatasync() doesn't work.

Since it happened with Mac OS X, I wouldn't be surprised if changing
fsync() and just that wasn't popular.  Heck, you already get people
asking how to turn off fsync in PostGreSQL...  (Haven't those people
heard of transactions...?)

But with changes to sync_file_range() [or whatever... I don't care] to
support database's finely tuned commit needs, and then adoption of
that by database vendors, perhaps nobody will mind fsync() becoming
safe then.

Nobody seems bothered by it's performance for other things.

-- Jamie
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: jffs2: -ENOSPC when truncating file?!

2008-02-24 Thread Jamie Lokier

Pavel Machek wrote:
> > You need to write a log entry indicating the new length of the file.
> > There is no space for new log entries.
> > 
> > There is a special case for removal -- 'rm gps.nmea' would work. Perhaps
> > we should add a special case for truncation too, so that it can also use
> > the extra pool of free space.
> 
> Yes, that would be nice. I somehow assumed that truncate can't fail
> for -ENOSPC ... I was trying to actually free some space on the
> filesystem...

Same here!  When I got ENOSPC from truncate, trying to free some
space, I was so surprised (and a bit disappointed) that I assumed
removal could fail too.  So now I'm pleasantly surprised to learn I
can at least remove a file.

It does seem odd that truncate to zero length can fail.  It is
guaranteed to free up one or more data nodes, so there should be
enough space for the size change node, provided GC is invoked when
necessary and the node sizes are compatible for this in corner cases.

-- Jamie
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: jffs2: -ENOSPC when truncating file?!

2008-02-24 Thread Jamie Lokier

Pavel Machek wrote:
  You need to write a log entry indicating the new length of the file.
  There is no space for new log entries.
  
  There is a special case for removal -- 'rm gps.nmea' would work. Perhaps
  we should add a special case for truncation too, so that it can also use
  the extra pool of free space.
 
 Yes, that would be nice. I somehow assumed that truncate can't fail
 for -ENOSPC ... I was trying to actually free some space on the
 filesystem...

Same here!  When I got ENOSPC from truncate, trying to free some
space, I was so surprised (and a bit disappointed) that I assumed
removal could fail too.  So now I'm pleasantly surprised to learn I
can at least remove a file.

It does seem odd that truncate to zero length can fail.  It is
guaranteed to free up one or more data nodes, so there should be
enough space for the size change node, provided GC is invoked when
necessary and the node sizes are compatible for this in corner cases.

-- Jamie
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] LogFS take three

2007-05-19 Thread Jamie Lokier

David Weinehall wrote:
> > It is also the filesystem that tries to scale logarithmically, as Arnd
> > has noted.  Maybe I should call it Log2 to emphesize this point.  Log1
> > would be horrible scalability.
> 
> So, log2fs...  Sounds great to me.

Why Log2?  Logarithmic scaling is just logarithmic scaling.  Does the
filesystem use 2-ary trees or anything else which gives particular
meaning to 2?

-- Jamie
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] LogFS take three

2007-05-19 Thread Jamie Lokier

David Weinehall wrote:
  It is also the filesystem that tries to scale logarithmically, as Arnd
  has noted.  Maybe I should call it Log2 to emphesize this point.  Log1
  would be horrible scalability.
 
 So, log2fs...  Sounds great to me.

Why Log2?  Logarithmic scaling is just logarithmic scaling.  Does the
filesystem use 2-ary trees or anything else which gives particular
meaning to 2?

-- Jamie
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] LogFS take three

2007-05-17 Thread Jamie Lokier

Jörn Engel wrote:
> > Almost all your static functions start with logfs_, why not this one?
> 
> Because after a while I discovered how silly it is to start every
> function with logfs_.  That prefix doesn't add much unless the function
> has global scope.  What I didn't do was remove the prefix from older
> functions.

It's handy when debugging or showing detailed backtraces.  Not that
I'm advocating it (or not), just something I've noticed in other
programs.

-- Jamie
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] LogFS take three

2007-05-17 Thread Jamie Lokier

Jörn Engel wrote:
  Almost all your static functions start with logfs_, why not this one?
 
 Because after a while I discovered how silly it is to start every
 function with logfs_.  That prefix doesn't add much unless the function
 has global scope.  What I didn't do was remove the prefix from older
 functions.

It's handy when debugging or showing detailed backtraces.  Not that
I'm advocating it (or not), just something I've noticed in other
programs.

-- Jamie
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] LogFS take three

2007-05-16 Thread Jamie Lokier

Artem Bityutskiy wrote:
> On Wed, 2007-05-16 at 12:34 +0100, Jamie Lokier wrote:
> > Jörn Engel wrote:
> > > On Wed, 16 May 2007 12:54:14 +0800, David Woodhouse wrote:
> > > > Personally I'd just go for 'JFFS3'. After all, it has a better claim to
> > > > the name than either of its predecessors :)
> > > 
> > > Did you ever see akpm's facial expression when he tried to pronounce
> > > "JFFS2"?  ;)
> > 
> > JFFS3 is a good, meaningful name to anyone familiar with JFFS2.
> > 
> > But if akpm can't pronounce it, how about FFFS for faster flash
> > filesystem ;-)
> 
> The problem is that JFFS2 will always be faster in terms of I/O speed
> anyway, just because it does not have to maintain on-flash indexing
> data structures. But yes, it is slow in mount and in building big
> inodes, so the "faster" is confusing.

Is LogFS really slower than JFFS2 in practice?

I would have guessed reads to be a similar speed, tree updates to be a
similar speed  to journal  updates for sustained  non-fsyncing writes,
and the difference unimportant for tiny individual commits whose index
updates are not merged with any other.  I've not thought about it much
though.

-- Jamie
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] LogFS take three

2007-05-16 Thread Jamie Lokier

Albert Cahalan wrote:
> Please don't forget the immutable bit. ("man lsattr")
> Having both, BSD-style, would be even better.
> The immutable bit is important for working around
> software bugs and "features" that damage files.
> 
> I also can't find xattr support.

Imho,

Given that the filesystem is still 'experimental', I'd concentrate on
getting it stable before worrying about immutable and xattrs unless
they are easy.  (Immutable is easy).

In 13 years of using Linux in all sorts of ways I've never yet to use
either feature.  They would be good to have, but stability in a
filesystem is much more useful.

I'm biased of course: if LogFS were stable and well tested, I'd be
using it right now in my embedded thingy - and that doesn't even
bother with uids :-)

-- Jamie
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] LogFS take three

2007-05-16 Thread Jamie Lokier

Jörn Engel wrote:
> On Wed, 16 May 2007 12:54:14 +0800, David Woodhouse wrote:
> > Personally I'd just go for 'JFFS3'. After all, it has a better claim to
> > the name than either of its predecessors :)
> 
> Did you ever see akpm's facial expression when he tried to pronounce
> "JFFS2"?  ;)

JFFS3 is a good, meaningful name to anyone familiar with JFFS2.

But if akpm can't pronounce it, how about FFFS for faster flash
filesystem ;-)

-- Jamie
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] LogFS take three

2007-05-16 Thread Jamie Lokier

Jörn Engel wrote:
 On Wed, 16 May 2007 12:54:14 +0800, David Woodhouse wrote:
  Personally I'd just go for 'JFFS3'. After all, it has a better claim to
  the name than either of its predecessors :)
 
 Did you ever see akpm's facial expression when he tried to pronounce
 JFFS2?  ;)

JFFS3 is a good, meaningful name to anyone familiar with JFFS2.

But if akpm can't pronounce it, how about FFFS for faster flash
filesystem ;-)

-- Jamie
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] LogFS take three

2007-05-16 Thread Jamie Lokier

Albert Cahalan wrote:
 Please don't forget the immutable bit. (man lsattr)
 Having both, BSD-style, would be even better.
 The immutable bit is important for working around
 software bugs and features that damage files.
 
 I also can't find xattr support.

Imho,

Given that the filesystem is still 'experimental', I'd concentrate on
getting it stable before worrying about immutable and xattrs unless
they are easy.  (Immutable is easy).

In 13 years of using Linux in all sorts of ways I've never yet to use
either feature.  They would be good to have, but stability in a
filesystem is much more useful.

I'm biased of course: if LogFS were stable and well tested, I'd be
using it right now in my embedded thingy - and that doesn't even
bother with uids :-)

-- Jamie
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] LogFS take three

2007-05-16 Thread Jamie Lokier

Artem Bityutskiy wrote:
 On Wed, 2007-05-16 at 12:34 +0100, Jamie Lokier wrote:
  Jörn Engel wrote:
   On Wed, 16 May 2007 12:54:14 +0800, David Woodhouse wrote:
Personally I'd just go for 'JFFS3'. After all, it has a better claim to
the name than either of its predecessors :)
   
   Did you ever see akpm's facial expression when he tried to pronounce
   JFFS2?  ;)
  
  JFFS3 is a good, meaningful name to anyone familiar with JFFS2.
  
  But if akpm can't pronounce it, how about FFFS for faster flash
  filesystem ;-)
 
 The problem is that JFFS2 will always be faster in terms of I/O speed
 anyway, just because it does not have to maintain on-flash indexing
 data structures. But yes, it is slow in mount and in building big
 inodes, so the faster is confusing.

Is LogFS really slower than JFFS2 in practice?

I would have guessed reads to be a similar speed, tree updates to be a
similar speed  to journal  updates for sustained  non-fsyncing writes,
and the difference unimportant for tiny individual commits whose index
updates are not merged with any other.  I've not thought about it much
though.

-- Jamie
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: dnotify/inotify and vfs questions

2005-08-25 Thread Jamie Lokier

Ian Campbell wrote:
> On Tue, 2005-08-23 at 16:23 +0100, Jamie Lokier wrote:
> > ...
> > if (any_dnotify_or_inotify_events_pending) {
> > read_dnotify_or_inotify_events();
> > if (any_events_related_to(file)) {
> > store_in_userspace_stat_cache(file, stat(file));
> > }
> > }
> > stat_info = lookup_userspace_stat_cache(file);
> > 
> > Now that's a silly way to save one system call in the fast path by itself.
> 
> I'm not that familiar with inotify internals but doesn't
> read_dnotify_or_inotify_events() or
> any_dnotify_or_inotify_events_pending() involve a syscall?

The fast path is just any_dnotify_or_inotify_events_pending: there
aren't any relevant events pending in the fast path.

There's a few methods of doing this for free per individual stat cache check.

1. Signal handler.

dnotify: Check a variable set by the signal handler.

 [ There's a small time window between dnotify sending the
 signal and the receiving thread noticing on an SMP system due
 to the IPI, during which the sending task might have another
 way to signal the recieving task that it's finished some
 operation, so this method of using dnotify to invalidate a
 stat cache only has the correct ordering properties on UP systems. ]

 This works because dnotify signals are thread-specific,
 so the checking thread will definitely have received the
 signal after the time another process modifies the file.

inotify: Disappointingly inotify doesn't support SIGIO readiness :(

2. If you have to mix the test with a poll/select/epoll/rtsig fd waiting
   for some other purpose.  For example: a file/web/local server, where the
   constraint is only that each stat() to revalidate a cached response
   appears to happen any time after the beginning of receiving the
   network request is known.

dnotify: It's free if you were using sigtimedwait anyway for I/O events,
 provided you completely read the queue, or get the signal
 priority right.

inotify: It's free if you were using poll/select/epoll anyway for I/O
 events, provided in the case of epoll that you completely
 read the queue, or use a two-level queue.

3. Amortising the test over many stat cache checks.

   Even if you must use a system call to check for any pending events,
   for revalidating an object which depends on multiple files, only
   one call is needed for all of the stat cache checks.

   More generally (this is more flexible), you can separate the notion
   of "cache time checkpoint" from "cache validation".  It's enough to
   know that a stat result was valid any time between the checkpoint
   time, and the current time.  That's how I'm implementing the
   file/web/local server case described above in step 2.  Then the
   events only need to be checked once during that time interval, no
   matter how many complex objects are being revalidated.

   It gets slightly more efficient when you have multiple, overlapping
   checkpoint->validation time intervals due to multiple outstanding
   requests being processed concurrently.

As I explained in the previous mail, all this is absolutely pointless
to save one system call.  It's a lot of work for negligable gain.

The point is when it saves lots of calls and userspace logic together,
for things like web page templates and compiled programs, which depend
on many files which can be revalidated in a small number of operations.

-- Jamie
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: dnotify/inotify and vfs questions

2005-08-25 Thread Jamie Lokier

Ian Campbell wrote:
 On Tue, 2005-08-23 at 16:23 +0100, Jamie Lokier wrote:
  receive some request...
  if (any_dnotify_or_inotify_events_pending) {
  read_dnotify_or_inotify_events();
  if (any_events_related_to(file)) {
  store_in_userspace_stat_cache(file, stat(file));
  }
  }
  stat_info = lookup_userspace_stat_cache(file);
  
  Now that's a silly way to save one system call in the fast path by itself.
 
 I'm not that familiar with inotify internals but doesn't
 read_dnotify_or_inotify_events() or
 any_dnotify_or_inotify_events_pending() involve a syscall?

The fast path is just any_dnotify_or_inotify_events_pending: there
aren't any relevant events pending in the fast path.

There's a few methods of doing this for free per individual stat cache check.

1. Signal handler.

dnotify: Check a variable set by the signal handler.

 [ There's a small time window between dnotify sending the
 signal and the receiving thread noticing on an SMP system due
 to the IPI, during which the sending task might have another
 way to signal the recieving task that it's finished some
 operation, so this method of using dnotify to invalidate a
 stat cache only has the correct ordering properties on UP systems. ]

 This works because dnotify signals are thread-specific,
 so the checking thread will definitely have received the
 signal after the time another process modifies the file.
 
inotify: Disappointingly inotify doesn't support SIGIO readiness :(

2. If you have to mix the test with a poll/select/epoll/rtsig fd waiting
   for some other purpose.  For example: a file/web/local server, where the
   constraint is only that each stat() to revalidate a cached response
   appears to happen any time after the beginning of receiving the
   network request is known.

dnotify: It's free if you were using sigtimedwait anyway for I/O events,
 provided you completely read the queue, or get the signal
 priority right.

inotify: It's free if you were using poll/select/epoll anyway for I/O
 events, provided in the case of epoll that you completely
 read the queue, or use a two-level queue.

3. Amortising the test over many stat cache checks.

   Even if you must use a system call to check for any pending events,
   for revalidating an object which depends on multiple files, only
   one call is needed for all of the stat cache checks.

   More generally (this is more flexible), you can separate the notion
   of cache time checkpoint from cache validation.  It's enough to
   know that a stat result was valid any time between the checkpoint
   time, and the current time.  That's how I'm implementing the
   file/web/local server case described above in step 2.  Then the
   events only need to be checked once during that time interval, no
   matter how many complex objects are being revalidated.

   It gets slightly more efficient when you have multiple, overlapping
   checkpoint-validation time intervals due to multiple outstanding
   requests being processed concurrently.

As I explained in the previous mail, all this is absolutely pointless
to save one system call.  It's a lot of work for negligable gain.

The point is when it saves lots of calls and userspace logic together,
for things like web page templates and compiled programs, which depend
on many files which can be revalidated in a small number of operations.

-- Jamie
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: dnotify/inotify and vfs questions

2005-08-23 Thread Jamie Lokier

Asser Femø wrote:
> According to the fcntl manual you can cancel a notification by doing
> fcntl(fd, F_NOTIFY, 0) (ie. sending 0 as the notification mask), but
> looking in the kernel code fcntl_dirnotify() immediately calls
> dnotify_flush() with neither telling the vfs module about it. Is there a
> reason for this?  Otherwise I'd propose calling
> filp->f_op->dir_notify(filp, 0) at some point in this scenario.
> 
> Regarding inotify, inotify_add_watch doesn't seem to pass on the request
> either, which works fine for local filesystem operations as they call
> fsnotify_* functions every time, but that isn't really feasible for
> filesystems like cifs because we'd have to request change notification
> on everything. Is there plans for implementing a mechanism to let vfs
> modules get watch requests too?

On a related note:

dnotify and inotify on local filesystems appear to be synchronous, in
the following rather useful sense:

If you have previously registered for inotify/dnotify events that will
catch a change to a file, and called stat() on the file, then the
following operation:

...
stat_info = stat(file)

may be replaced in userspace code with:

...
if (any_dnotify_or_inotify_events_pending) {
read_dnotify_or_inotify_events();
if (any_events_related_to(file)) {
store_in_userspace_stat_cache(file, stat(file));
}
}
stat_info = lookup_userspace_stat_cache(file);

Now that's a silly way to save one system call in the fast path by itself.

But when the stat_info is a prerequisite for validating cached data --
such as the contents of a file parsed into a data structure -- it can
save a lot of system calls and logical work.

For example, an Apache-style path walk which checks for .htaccess, or
a Samba-style path walk which is checking for unsafe symbolic links,
can be reduced from say 20 system calls to zero using this method.

Pre-compiled or pre-parsed programs/scripts/templates/config-files
where all the source files used are prerequisites for invalidating a
cached compiled form, reduces from say 40 system calls to stat() all
the source files, to zero  that's quite a saving.

It's not just reducing system calls.  The logical tests in userspace
are also skipped, if coded properly, facilitating very quick decisions
about things that depend on files which mostly don't change.
(Cascading structured cache prerequisites...mmm).

Remote dnotify/inotify doesn't _necessarily_ have this synchronous
property.  It may do in some cases, depending on the implementation
(this is subtle...).

So, it would be nice if there was a way to query this... rather than
the tedious method of testing the filesystem type and having a table
of "known local filesystem types" where it's safe to depend on this
property.  Alternatively, a way to specify at dnotify/inotify creation
type that synchronous notifications are required, and have the request
rejected if those can't be provided.

-- Jamie

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: dnotify/inotify and vfs questions

2005-08-23 Thread Jamie Lokier

Asser Femø wrote:
 According to the fcntl manual you can cancel a notification by doing
 fcntl(fd, F_NOTIFY, 0) (ie. sending 0 as the notification mask), but
 looking in the kernel code fcntl_dirnotify() immediately calls
 dnotify_flush() with neither telling the vfs module about it. Is there a
 reason for this?  Otherwise I'd propose calling
 filp-f_op-dir_notify(filp, 0) at some point in this scenario.
 
 Regarding inotify, inotify_add_watch doesn't seem to pass on the request
 either, which works fine for local filesystem operations as they call
 fsnotify_* functions every time, but that isn't really feasible for
 filesystems like cifs because we'd have to request change notification
 on everything. Is there plans for implementing a mechanism to let vfs
 modules get watch requests too?

On a related note:

dnotify and inotify on local filesystems appear to be synchronous, in
the following rather useful sense:

If you have previously registered for inotify/dnotify events that will
catch a change to a file, and called stat() on the file, then the
following operation:

receive some request...
stat_info = stat(file)

may be replaced in userspace code with:

receive some request...
if (any_dnotify_or_inotify_events_pending) {
read_dnotify_or_inotify_events();
if (any_events_related_to(file)) {
store_in_userspace_stat_cache(file, stat(file));
}
}
stat_info = lookup_userspace_stat_cache(file);

Now that's a silly way to save one system call in the fast path by itself.

But when the stat_info is a prerequisite for validating cached data --
such as the contents of a file parsed into a data structure -- it can
save a lot of system calls and logical work.

For example, an Apache-style path walk which checks for .htaccess, or
a Samba-style path walk which is checking for unsafe symbolic links,
can be reduced from say 20 system calls to zero using this method.

Pre-compiled or pre-parsed programs/scripts/templates/config-files
where all the source files used are prerequisites for invalidating a
cached compiled form, reduces from say 40 system calls to stat() all
the source files, to zero  that's quite a saving.

It's not just reducing system calls.  The logical tests in userspace
are also skipped, if coded properly, facilitating very quick decisions
about things that depend on files which mostly don't change.
(Cascading structured cache prerequisites...mmm).

Remote dnotify/inotify doesn't _necessarily_ have this synchronous
property.  It may do in some cases, depending on the implementation
(this is subtle...).

So, it would be nice if there was a way to query this... rather than
the tedious method of testing the filesystem type and having a table
of known local filesystem types where it's safe to depend on this
property.  Alternatively, a way to specify at dnotify/inotify creation
type that synchronous notifications are required, and have the request
rejected if those can't be provided.

-- Jamie


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC] FUSE permission modell (Was: fuse review bits)

2005-04-17 Thread Jamie Lokier

Eric Van Hensbergen wrote:
> I'd like to second that I think private-namespaces are the right way
> to solve this sort of problem.  It also helps not cluttering the
> global namespace with user-local mounts
> 
> >
> > Shared subtrees and more support in userspace tools is needed before
> > private namespaces can become really useful.
> > 
> 
> I'd like to talk about this a bit more and start driving to a solution
> here.  I've been looking at the namespace code quite a bit and was
> just about to dive in and start checking into adding/fixing certain
> aspects such as stackable namespaces, optional inheritence (changes in
> a parent namespace are reflected in the child but not vice-versa),
> etc.
> 
> One aspect I was thinking about here was a mount flag that would give
> you a new private namespace (if you didn't already have one) for the
> mount (and I guess that would impact any subsequent mounts from the
> user in that shell).  Another option would be a 'newns' style
> system-call, but I'm generally against adding new system calls.
> 
> Shared subtrees are a tricky one.  I know how we would handle it in
> V9FS, but not sure how well that would translate to others
> (essentially we'd re-export the subtree so other user's could mount it
> individually -- but that's a very Plan 9 solution and may not be what
> more UNIX-minded folks would want -- we also need to improve our own
> server infrastructure to more efficiently support such a re-export).
> 
> So, to sum up I think private namespaces is the right solution, and
> I'd rather put effort into making it more useful than work-around the
> fact that its not practical right now.

Have a chat with Al Viro, who has already done some work on shared
mounts and subtrees I think.

-- Jamie
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC] FUSE permission modell (Was: fuse review bits)

2005-04-17 Thread Jamie Lokier

Eric Van Hensbergen wrote:
 I'd like to second that I think private-namespaces are the right way
 to solve this sort of problem.  It also helps not cluttering the
 global namespace with user-local mounts
 
 
  Shared subtrees and more support in userspace tools is needed before
  private namespaces can become really useful.
  
 
 I'd like to talk about this a bit more and start driving to a solution
 here.  I've been looking at the namespace code quite a bit and was
 just about to dive in and start checking into adding/fixing certain
 aspects such as stackable namespaces, optional inheritence (changes in
 a parent namespace are reflected in the child but not vice-versa),
 etc.
 
 One aspect I was thinking about here was a mount flag that would give
 you a new private namespace (if you didn't already have one) for the
 mount (and I guess that would impact any subsequent mounts from the
 user in that shell).  Another option would be a 'newns' style
 system-call, but I'm generally against adding new system calls.
 
 Shared subtrees are a tricky one.  I know how we would handle it in
 V9FS, but not sure how well that would translate to others
 (essentially we'd re-export the subtree so other user's could mount it
 individually -- but that's a very Plan 9 solution and may not be what
 more UNIX-minded folks would want -- we also need to improve our own
 server infrastructure to more efficiently support such a re-export).
 
 So, to sum up I think private namespaces is the right solution, and
 I'd rather put effort into making it more useful than work-around the
 fact that its not practical right now.

Have a chat with Al Viro, who has already done some work on shared
mounts and subtrees I think.

-- Jamie
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC] FUSE permission modell (Was: fuse review bits)

2005-04-13 Thread Jamie Lokier

Miklos Szeredi wrote:
> > Yet, the results from stat() don't distinguish the number spaces,
> > and "ls" doesn't map the numbers to names properly in the wrong
> > space.
> 
> Well you can use "ls -n".  It's up to the tools to present the
> information you want in the way you want it.  If a tool can't do that,
> tough, but you are not worse off than if the information is not
> available _at_all_.

Well, how do you currently provide access to the information that's
not presentable through stat()?

-- Jamie
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC] FUSE permission modell (Was: fuse review bits)

2005-04-13 Thread Jamie Lokier

Miklos Szeredi wrote:
> I have a little project to imlement a "userloop" filesystem, which
> works just like "mount -o loop", but you don't need root privs.  This
> is really simple to do with FUSE and UML.

That would be a nice way to implement those rarely used old
filesystems that aren't really needed in the kernel source tree any
more, but which it would be nice to have access to as legacy
filesystem formats.

In other words, migrating old legacy filesystems out of the kernel
tree, into FUSE.

> I don't think that it's far feched, that in certain situations the
> user _does_ have the right (and usefulness) to do otherwise privileged
> filesystem operations.

It's really a matter of philosophy, as to whether the results of
stat() are just handy information for the user, or are always defined
to mean what you can/can't do with a file.

Local-ssh-into-UML makes more sense for this in some ways, because the
uids/gids inside your tgz files or foreign loop filesystems are not
related to the space of uids/gids of the host system.  Yet, the
results from stat() don't distinguish the number spaces, and "ls"
doesn't map the numbers to names properly in the wrong space.

-- Jamie
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC] FUSE permission modell (Was: fuse review bits)

2005-04-13 Thread Jamie Lokier

Miklos Szeredi wrote:
> > Look up the rather large linux-kernel & linux-fsdevel thread "silent
> > semantic changes with reiser4" and it's followup threads, from last
> > year.
> 
> Wow, it's 700+ messages.  I got through the first 40, and already feel
> dizzy :)

It's easier if you skip the ones by Hans and their immediate followups :)

(Nothing personal, it's that Hans is mostly justifying reiser4's
behaviour, and the posts you really need to read aren't about reiser4).

> > It's already been tried.  You will also find sensible ideas on what
> > semantics it should have to do it properly.
> 
> OK, I understand the "slash -> directory, no-slash -> regular file"
> semantics.
> 
> How do you envision implementing this for "mount directory over file"?

Somewhere deep in that thread is a discussion between Al Viro and
Linus on it.

> A new mount flag indicating that it's only to be followed down if
> there's a slash after the mountpoint?

The new flag would indicate more than that: These mounts should be
detachable in the sense that deleting the file is possible, and
perhaps renamable/linkable too.  That's the stuff Al Viro discusses in
some detail in the big thread.

Ideally we'd like automounting, a bit like the Hurd's translators.
Attached to files (using an xattr or something, and executed with the
uid/gid of the file owner), and also per-user "pattern->action"
options for matching files with a certain type (e.g. tgz/zip/deb/rpm/xml).

But that can be added much later, as it's an orthogonal feature.

-- Jamie
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC] FUSE permission modell (Was: fuse review bits)

2005-04-13 Thread Jamie Lokier

Miklos Szeredi wrote:
> > > Aren't there some assumptions in VFS that currently make this
> > > impossible?
> > 
> > I believe it's OK with VFS, but applications would be confused to death.
> > Well, there really is one issue -- dentries have exactly one parent, so
> > what do you do when opening a file with hardlinks as a directory? (In
> > fact IIRC that is what lead to all the funny talk about mountpoints,
> > since they don't have this limitation)
> 
> OK, that makes sense.
> 
> It would be quite interesting to see how applications react.  Maybe
> I'll hack something up :)

Look up the rather large linux-kernel & linux-fsdevel thread "silent
semantic changes with reiser4" and it's followup threads, from last
year.

It's already been tried.  You will also find sensible ideas on what
semantics it should have to do it properly.

-- Jamie
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC] FUSE permission modell (Was: fuse review bits)

2005-04-13 Thread Jamie Lokier

> > > A nice implemention of it in FUSE could push it along a bit :)
> > 
> > Aren't there some assumptions in VFS that currently make this
> > impossible?
> 
> I believe it's OK with VFS, but applications would be confused to death.
> Well, there really is one issue -- dentries have exactly one parent, so
> what do you do when opening a file with hardlinks as a directory? (In
> fact IIRC that is what lead to all the funny talk about mountpoints,
> since they don't have this limitation)

Hardlinks aren't a problem when entering a file as if it's a
directory, provided the directory does not contain any hard links to a
parent in the hierarchy.  In other words, as long as it's a directed
acyclic graph.

This is trivially always true for virtual directories such as entering
an archive file.  And detachable/movable mountpoints are a nice and
sensible way to implement it.  Some work has actually been done on this.

Experiments with the reiserfs file-as-directory extension showed that
applications are generally ok with it.  It looks like a file, but you
can cd into it or follow a path lookup into it.

Linus had some good ideas on the exact semantics to implement when
doing path lookup on these objects.

-- Jamie
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC] FUSE permission modell (Was: fuse review bits)

2005-04-13 Thread Jamie Lokier

   A nice implemention of it in FUSE could push it along a bit :)
  
  Aren't there some assumptions in VFS that currently make this
  impossible?
 
 I believe it's OK with VFS, but applications would be confused to death.
 Well, there really is one issue -- dentries have exactly one parent, so
 what do you do when opening a file with hardlinks as a directory? (In
 fact IIRC that is what lead to all the funny talk about mountpoints,
 since they don't have this limitation)

Hardlinks aren't a problem when entering a file as if it's a
directory, provided the directory does not contain any hard links to a
parent in the hierarchy.  In other words, as long as it's a directed
acyclic graph.

This is trivially always true for virtual directories such as entering
an archive file.  And detachable/movable mountpoints are a nice and
sensible way to implement it.  Some work has actually been done on this.

Experiments with the reiserfs file-as-directory extension showed that
applications are generally ok with it.  It looks like a file, but you
can cd into it or follow a path lookup into it.

Linus had some good ideas on the exact semantics to implement when
doing path lookup on these objects.

-- Jamie
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC] FUSE permission modell (Was: fuse review bits)

2005-04-13 Thread Jamie Lokier

Miklos Szeredi wrote:
   Aren't there some assumptions in VFS that currently make this
   impossible?
  
  I believe it's OK with VFS, but applications would be confused to death.
  Well, there really is one issue -- dentries have exactly one parent, so
  what do you do when opening a file with hardlinks as a directory? (In
  fact IIRC that is what lead to all the funny talk about mountpoints,
  since they don't have this limitation)
 
 OK, that makes sense.
 
 It would be quite interesting to see how applications react.  Maybe
 I'll hack something up :)

Look up the rather large linux-kernel  linux-fsdevel thread silent
semantic changes with reiser4 and it's followup threads, from last
year.

It's already been tried.  You will also find sensible ideas on what
semantics it should have to do it properly.

-- Jamie
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC] FUSE permission modell (Was: fuse review bits)

2005-04-13 Thread Jamie Lokier

Miklos Szeredi wrote:
  Look up the rather large linux-kernel  linux-fsdevel thread silent
  semantic changes with reiser4 and it's followup threads, from last
  year.
 
 Wow, it's 700+ messages.  I got through the first 40, and already feel
 dizzy :)

It's easier if you skip the ones by Hans and their immediate followups :)

(Nothing personal, it's that Hans is mostly justifying reiser4's
behaviour, and the posts you really need to read aren't about reiser4).

  It's already been tried.  You will also find sensible ideas on what
  semantics it should have to do it properly.
 
 OK, I understand the slash - directory, no-slash - regular file
 semantics.
 
 How do you envision implementing this for mount directory over file?

Somewhere deep in that thread is a discussion between Al Viro and
Linus on it.

 A new mount flag indicating that it's only to be followed down if
 there's a slash after the mountpoint?

The new flag would indicate more than that: These mounts should be
detachable in the sense that deleting the file is possible, and
perhaps renamable/linkable too.  That's the stuff Al Viro discusses in
some detail in the big thread.

Ideally we'd like automounting, a bit like the Hurd's translators.
Attached to files (using an xattr or something, and executed with the
uid/gid of the file owner), and also per-user pattern-action
options for matching files with a certain type (e.g. tgz/zip/deb/rpm/xml).

But that can be added much later, as it's an orthogonal feature.

-- Jamie
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC] FUSE permission modell (Was: fuse review bits)

2005-04-13 Thread Jamie Lokier

Miklos Szeredi wrote:
 I have a little project to imlement a userloop filesystem, which
 works just like mount -o loop, but you don't need root privs.  This
 is really simple to do with FUSE and UML.

That would be a nice way to implement those rarely used old
filesystems that aren't really needed in the kernel source tree any
more, but which it would be nice to have access to as legacy
filesystem formats.

In other words, migrating old legacy filesystems out of the kernel
tree, into FUSE.

 I don't think that it's far feched, that in certain situations the
 user _does_ have the right (and usefulness) to do otherwise privileged
 filesystem operations.

It's really a matter of philosophy, as to whether the results of
stat() are just handy information for the user, or are always defined
to mean what you can/can't do with a file.

Local-ssh-into-UML makes more sense for this in some ways, because the
uids/gids inside your tgz files or foreign loop filesystems are not
related to the space of uids/gids of the host system.  Yet, the
results from stat() don't distinguish the number spaces, and ls
doesn't map the numbers to names properly in the wrong space.

-- Jamie
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC] FUSE permission modell (Was: fuse review bits)

2005-04-13 Thread Jamie Lokier

Miklos Szeredi wrote:
  Yet, the results from stat() don't distinguish the number spaces,
  and ls doesn't map the numbers to names properly in the wrong
  space.
 
 Well you can use ls -n.  It's up to the tools to present the
 information you want in the way you want it.  If a tool can't do that,
 tough, but you are not worse off than if the information is not
 available _at_all_.

Well, how do you currently provide access to the information that's
not presentable through stat()?

-- Jamie
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

1 2 3 4 5 6 >

1 - 100 of 580 matches

Mail list logo