Re: [dm-devel] Re: [RFD] BIO_RW_BARRIER - what it means for devices, filesystems, and dm/md.

2007-05-31 Thread Jens Axboe
On Wed, May 30 2007, Phillip Susi wrote:
 That would be the exactly how I understand Documentation/block/barrier.txt:
 
 In other words, I/O barrier requests have the following two properties.
 1. Request ordering
 ...
 2. Forced flushing to physical medium
 
 So, I/O barriers need to guarantee that requests actually get written
 to non-volatile medium in order. 
 
 I think you misinterpret this, and it probably could be worded a bit 
 better.  The barrier request is about constraining the order.  The 
 forced flushing is one means to implement that constraint.  The other 
 alternative mentioned there is to use ordered tags.  The key part there 
 is requests actually get written to non-volatile medium _in order_, 
 not before the request completes, which would be synchronous IO.

No Stephan is right, the barrier is both an ordering and integrity
constraint. If a driver completes a barrier request before that request
and previously submitted requests are on STABLE storage, then it
violates that principle. Look at the code and the various ordering
options.

-- 
Jens Axboe

-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFD] BIO_RW_BARRIER - what it means for devices, filesystems, and dm/md.

2007-05-31 Thread Jens Axboe
On Thu, May 31 2007, David Chinner wrote:
 IOWs, there are two parts to the problem:
 
   1 - guaranteeing I/O ordering
   2 - guaranteeing blocks are on persistent storage.
 
 Right now, a single barrier I/O is used to provide both of these
 guarantees. In most cases, all we really need to provide is 1); the
 need for 2) is a much rarer condition but still needs to be
 provided.
 
  if I am understanding it correctly, the big win for barriers is that you 
  do NOT have to stop and wait until the data is on persistant media before 
  you can continue.
 
 Yes, if we define a barrier to only guarantee 1), then yes this
 would be a big win (esp. for XFS). But that requires all filesystems
 to handle sync writes differently, and sync_blockdev() needs to
 call blkdev_issue_flush() as well
 
 So, what do we do here? Do we define a barrier I/O to only provide
 ordering, or do we define it to also provide persistent storage
 writeback? Whatever we decide, it needs to be documented

The block layer already has a notion of the two types of barriers, with
a very small amount of tweaking we could expose that. There's absolutely
zero reason we can't easily support both types of barriers.

-- 
Jens Axboe

-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFD] BIO_RW_BARRIER - what it means for devices, filesystems, and dm/md.

2007-05-31 Thread David Chinner
On Thu, May 31, 2007 at 08:26:45AM +0200, Jens Axboe wrote:
 On Thu, May 31 2007, David Chinner wrote:
  IOWs, there are two parts to the problem:
  
  1 - guaranteeing I/O ordering
  2 - guaranteeing blocks are on persistent storage.
  
  Right now, a single barrier I/O is used to provide both of these
  guarantees. In most cases, all we really need to provide is 1); the
  need for 2) is a much rarer condition but still needs to be
  provided.
  
   if I am understanding it correctly, the big win for barriers is that you 
   do NOT have to stop and wait until the data is on persistant media before 
   you can continue.
  
  Yes, if we define a barrier to only guarantee 1), then yes this
  would be a big win (esp. for XFS). But that requires all filesystems
  to handle sync writes differently, and sync_blockdev() needs to
  call blkdev_issue_flush() as well
  
  So, what do we do here? Do we define a barrier I/O to only provide
  ordering, or do we define it to also provide persistent storage
  writeback? Whatever we decide, it needs to be documented
 
 The block layer already has a notion of the two types of barriers, with
 a very small amount of tweaking we could expose that. There's absolutely
 zero reason we can't easily support both types of barriers.

That sounds like a good idea - we can leave the existing
WRITE_BARRIER behaviour unchanged and introduce a new WRITE_ORDERED
behaviour that only guarantees ordering. The filesystem can then
choose which to use where appropriate

Cheers,

Dave.
-- 
Dave Chinner
Principal Engineer
SGI Australian Software Group
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [patch 12/41] fs: introduce write_begin, write_end, and perform_write aops

2007-05-31 Thread Andrew Morton
On Thu, 31 May 2007 07:15:39 +0200 Nick Piggin [EMAIL PROTECTED] wrote:

 If you can send that rollup, it would be good. I could try getting
 everything to compile and do some more testing on it too.


Single patch against 2.6.22-rc3: http://userweb.kernel.org/~akpm/np.gz

broken-out: 
ftp://ftp.kernel.org/pub/linux/kernel/people/akpm/mm/broken-out-2007-05-30-09-30.tar.gz
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Cross-chunk reference checking time estimates

2007-05-31 Thread Valerie Henson
Hey all,

I altered Karuna's cref tool to print the number of seconds it would
take to check the cross-references for a chunk.  The results look good
for chunkfs: on my laptop /home file system and a 1 GB chunk size, the
per-chunk cross-reference check time would be an average of 5 seconds
and a max of 160 seconds in 2013.  This is calculated assuming average
seek time and rotational latency delay for every cross-reference
checked; some simple batching of I/Os could significantly improve
that.

The tool is a little dodgy on error handling and other edge cases ATM,
but for now, here's the results and the code (attached):

[EMAIL PROTECTED]:~/chunkfs/cref_new$ sudo ./cref.sh /dev/hda3 dump /home 1024
Total size = 19535040 KB
Total data stored = 13998240 KB
Number of files = 445406
Number of directories = 31836
Number of special files = 12156
Size of block groups = 1048576 KB
Inodes per block group = 130304
Intra-file cross references = 63167
Directory-subdirectory references = 429
Directory-file references = 2381
Total directory cross references = 2810
Total cross references = 65977
Total cross references = 65977
Average cross references per group = 439
Maximum cross references in a group = 13997
Max group is 4 (0:3, 1:46, 2:282, 3:4996, 5:8445, 6:2, 7:1, 8:27, 9:1, 10:2, 
12:1, 13:51, 14:32, 15:99, 16:2, 17:5, 18:2, )
Average additional time to check cross references = 6.77 s
Max additional time to check cross references = 215.55 s
2013 average additional time to check cross references = 4.93 s
2013 max additional time to check cross references = 156.77 s

Questions?  Come talk on #linuxfs at irc.oftc.net.

-VAL


cref_new.tar.gz
Description: GNU Zip compressed data


Re: [PATCH] AFS: Implement file locking [try #2]

2007-05-31 Thread David Howells
J. Bruce Fields [EMAIL PROTECTED] wrote:

  Yes.  I need to get the server lock first, before going to the VFS locking
  routines.
 
 That doesn't really answer the question.  The NFS client has similar
 requirements, but it doesn't have to duplicate the per-inode lists of
 granted locks, for example.

Actually, it might...  It's just that they're in the lock manager server, not
in the NFS client.

As far as I can tell, NFS passes each lock request individually to the lock
manager server, which grants them individually.  AFS doesn't do that.

David
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [dm-devel] Re: [RFD] BIO_RW_BARRIER - what it means for devices, filesystems, and dm/md.

2007-05-31 Thread Stefan Bader

2007/5/30, Phillip Susi [EMAIL PROTECTED]:

Stefan Bader wrote:

 Since drive a supports barrier request we don't get -EOPNOTSUPP but
 the request with block y might get written before block x since the
 disk are independent. I guess the chances of this are quite low since
 at some point a barrier request will also hit drive b but for the time
 being it might be better to indicate -EOPNOTSUPP right from
 device-mapper.

The device mapper needs to ensure that ALL underlying devices get a
barrier request when one comes down from above, even if it has to
construct zero length barriers to send to most of them.



And somehow also make sure all of the barriers have been processed
before returning the barrier that came in. Plus it would have to queue
all mapping requests until the barrier is done (if strictly acting
according to barrier.txt).

But I am wondering a bit whether the requirements to barriers are
really that tight as described in Tejun's document (barrier request is
only started if everything before is safe, the barrier itself isn't
returned until it is safe, too, and all requests after the barrier
aren't started before the barrier is done). Is it really necessary to
defer any further requests until the barrier has been written to save
storage? Or would it be sufficient to guarantee that, if a barrier
request returns, everything up to (including the barrier) is on safe
storage?

Stefan
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFD] BIO_RW_BARRIER - what it means for devices, filesystems, and dm/md.

2007-05-31 Thread Bill Davidsen

Neil Brown wrote:

On Monday May 28, [EMAIL PROTECTED] wrote:
  

There are two things I'm not sure you covered.

First, disks which don't support flush but do have a cache dirty 
status bit you can poll at times like shutdown. If there are no drivers 
which support these, it can be ignored.



There are really devices like that?  So to implement a flush, you have
to stop sending writes and wait and poll - maybe poll every
millisecond?
  


Yes, there really are (or were). But I don't think that there are 
drivers, so it's not an issue.

That wouldn't be very good for performance  maybe you just
wouldn't bother with barriers on that sort of device?
  


That is why there are no drivers...

Which reminds me:  What is the best way to turn off barriers?
Several filesystems have -o nobarriers or -o barriers=0,
or the inverse.
  


If they can function usefully without, the admin gets to make that choice.

md/raid currently uses barriers to write metadata, and there is no
way to turn that off.  I'm beginning to wonder if that is best.
  


I don't see how you can have reliable operation without it, particularly 
WRT bitmap.

Maybe barrier support should be a function of the device.  i.e. the
filesystem or whatever always sends barrier requests where it thinks
it is appropriate, and the block device tries to honour them to the
best of its ability, but if you run
   blockdev --enforce-barriers=no /dev/sda
then you lose some reliability guarantees, but gain some throughput (a
bit like the 'async' export option for nfsd).

  
Since this is device dependent, it really should be in the device 
driver, and requests should have status of success, failure, or feature 
unavailability.




Second, NAS (including nbd?). Is there enough information to handle this  really 
right?



NAS means lots of things, including NFS and CIFS where this doesn't
apply.
  


Well, we're really talking about network attached devices rather than 
network filesystems. I guess people do lump them together.



For 'nbd', it is entirely up to the protocol.  If the protocol allows
a barrier flag to be sent to the server, then barriers should just
work.  If it doesn't, then either the server disables write-back
caching, or flushes every request, or you lose all barrier
guarantees. 
  


Pretty much agrees with what I said above, it's at a level closer to the 
device, and status should come back from the physical i/o request.

For 'iscsi', I guess it works just the same as SCSI...
  


Hopefully.

--
bill davidsen [EMAIL PROTECTED]
 CTO TMR Associates, Inc
 Doing interesting things with small computers since 1979

-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [patch 0/2] i_version update

2007-05-31 Thread Trond Myklebust
On Thu, 2007-05-31 at 10:01 +1000, Neil Brown wrote:

 This will provide a change number that normally changes only when the
 file changes and doesn't require any extra storage on disk.
 The change number will change inappropriately only when the inode has
 fallen out of cache and is being reload, which is either after a crash
 (hopefully rare) of when a file hasn't been used for a while, implying
 that it is unlikely that any client has it in cache.

It will also change inappropriately if the server is under heavy load
and needs to reclaim memory by tossing out inodes that are cached and
still in use by the clients. That change will trigger clients to
invalidate their caches and to refetch the data from the server, further
cranking up the load.

Not an ideal solution...

Trond

-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFD] BIO_RW_BARRIER - what it means for devices, filesystems, and dm/md.

2007-05-31 Thread Bill Davidsen

Jens Axboe wrote:

On Thu, May 31 2007, David Chinner wrote:
  

On Thu, May 31, 2007 at 08:26:45AM +0200, Jens Axboe wrote:


On Thu, May 31 2007, David Chinner wrote:
  

IOWs, there are two parts to the problem:

1 - guaranteeing I/O ordering
2 - guaranteeing blocks are on persistent storage.

Right now, a single barrier I/O is used to provide both of these
guarantees. In most cases, all we really need to provide is 1); the
need for 2) is a much rarer condition but still needs to be
provided.


if I am understanding it correctly, the big win for barriers is that you 
do NOT have to stop and wait until the data is on persistant media before 
you can continue.
  

Yes, if we define a barrier to only guarantee 1), then yes this
would be a big win (esp. for XFS). But that requires all filesystems
to handle sync writes differently, and sync_blockdev() needs to
call blkdev_issue_flush() as well

So, what do we do here? Do we define a barrier I/O to only provide
ordering, or do we define it to also provide persistent storage
writeback? Whatever we decide, it needs to be documented


The block layer already has a notion of the two types of barriers, with
a very small amount of tweaking we could expose that. There's absolutely
zero reason we can't easily support both types of barriers.
  

That sounds like a good idea - we can leave the existing
WRITE_BARRIER behaviour unchanged and introduce a new WRITE_ORDERED
behaviour that only guarantees ordering. The filesystem can then
choose which to use where appropriate



Precisely. The current definition of barriers are what Chris and I came
up with many years ago, when solving the problem for reiserfs
originally. It is by no means the only feasible approach.

I'll add a WRITE_ORDERED command to the #barrier branch, it already
contains the empty-bio barrier support I posted yesterday (well a
slightly modified and cleaned up version).

  
Wait. Do filesystems expect (depend on) anything but ordering now? Does 
md? Having users of barriers as they currently behave suddenly getting 
SYNC behavior where they expect ORDERED is likely to have a negative 
effect on performance. Or do I misread what is actually guaranteed by 
WRITE_BARRIER now, and a flush is currently happening in all cases?


And will this also be available to user space f/s, since I just proposed 
a project which uses one? :-(
I think the goal is good, more choice is almost always better choice, I 
just want to be sure there won't be big disk performance regressions.


--
bill davidsen [EMAIL PROTECTED]
 CTO TMR Associates, Inc
 Doing interesting things with small computers since 1979

-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFD] BIO_RW_BARRIER - what it means for devices, filesystems, and dm/md.

2007-05-31 Thread Jens Axboe
On Thu, May 31 2007, Bill Davidsen wrote:
 Jens Axboe wrote:
 On Thu, May 31 2007, David Chinner wrote:
   
 On Thu, May 31, 2007 at 08:26:45AM +0200, Jens Axboe wrote:
 
 On Thu, May 31 2007, David Chinner wrote:
   
 IOWs, there are two parts to the problem:
 
   1 - guaranteeing I/O ordering
   2 - guaranteeing blocks are on persistent storage.
 
 Right now, a single barrier I/O is used to provide both of these
 guarantees. In most cases, all we really need to provide is 1); the
 need for 2) is a much rarer condition but still needs to be
 provided.
 
 
 if I am understanding it correctly, the big win for barriers is that 
 you do NOT have to stop and wait until the data is on persistant media 
 before you can continue.
   
 Yes, if we define a barrier to only guarantee 1), then yes this
 would be a big win (esp. for XFS). But that requires all filesystems
 to handle sync writes differently, and sync_blockdev() needs to
 call blkdev_issue_flush() as well
 
 So, what do we do here? Do we define a barrier I/O to only provide
 ordering, or do we define it to also provide persistent storage
 writeback? Whatever we decide, it needs to be documented
 
 The block layer already has a notion of the two types of barriers, with
 a very small amount of tweaking we could expose that. There's absolutely
 zero reason we can't easily support both types of barriers.
   
 That sounds like a good idea - we can leave the existing
 WRITE_BARRIER behaviour unchanged and introduce a new WRITE_ORDERED
 behaviour that only guarantees ordering. The filesystem can then
 choose which to use where appropriate
 
 
 Precisely. The current definition of barriers are what Chris and I came
 up with many years ago, when solving the problem for reiserfs
 originally. It is by no means the only feasible approach.
 
 I'll add a WRITE_ORDERED command to the #barrier branch, it already
 contains the empty-bio barrier support I posted yesterday (well a
 slightly modified and cleaned up version).
 
   
 Wait. Do filesystems expect (depend on) anything but ordering now? Does 
 md? Having users of barriers as they currently behave suddenly getting 
 SYNC behavior where they expect ORDERED is likely to have a negative 
 effect on performance. Or do I misread what is actually guaranteed by 
 WRITE_BARRIER now, and a flush is currently happening in all cases?

See the above stuff you quote, it's answered there. It's not a change,
this is how the Linux barrier write has always worked since I first
implemented it. What David and I are talking about is adding a more
relaxed version as well, that just implies ordering.

 And will this also be available to user space f/s, since I just proposed 
 a project which uses one? :-(

I see several uses for that, so I'd hope so.

 I think the goal is good, more choice is almost always better choice, I 
 just want to be sure there won't be big disk performance regressions.

We can't get more heavy weight than the current barrier, it's about as
conservative as you can get.

-- 
Jens Axboe

-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] obsoleting /etc/mtab

2007-05-31 Thread Miklos Szeredi
  
  (2) needs work in the filesystems implicated.  I already have patches
  for ext2, ext3, tmpfs, devpts and hostfs, and it would be nice if the
  maintainers for others could help out.
  
 
 A lot of these could be fixed all at once by letting the filesystem tell
 the VFS to retain the string passed to the original mount.  That will
 solve *almost* all filesystems which take string options.

On remount some filesystems like ext[234] use the given options as a
delta, not as the new set of options.  Others just ignore some of the
options on remount.

Yes, /etc/mtab is broken wrt. remount too, but somehow I feel breaking
/proc/mounts this way too would be frowned upon.

 On the other hand, maybe it's cleaner to present a canonical view of the
 options.  Note that /etc/mtab does not, however.

Yes, we could emulate /etc/mtab like this for filesystems which don't
have a -show_options(), but mostly filesystems do have
-show_options(), they are just lazy about updating it with all the
new mount options.

Miklos
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH resend] introduce I_SYNC

2007-05-31 Thread Dave Kleikamp
On Thu, 2007-05-31 at 16:25 +0200, Jörn Engel wrote:
 --- linux-2.6.21logfs/fs/jfs/jfs_txnmgr.c~I_LOCK2007-05-07
 10:28:55.0 +0200
 +++ linux-2.6.21logfs/fs/jfs/jfs_txnmgr.c   2007-05-29
 13:10:32.0 +0200
 @@ -1286,7 +1286,14 @@ int txCommit(tid_t tid,  /*
 transaction 
  * commit the transaction synchronously, so the last
 iput
  * will be done by the calling thread (or later)
  */
 -   if (tblk-u.ip-i_state  I_LOCK)
 +   /*
 +* I believe this code is no longer needed.  Splitting
 I_LOCK
 +* into two bits, I_LOCK and I_SYNC should prevent
 this
 +* deadlock as well.  But since I don't have a JFS
 testload
 +* to verify this, only a trivial s/I_LOCK/I_SYNC/ was
 done.
 +* Joern
 +*/
 +   if (tblk-u.ip-i_state  I_SYNC)
 tblk-xflag = ~COMMIT_LAZY;
 }

I think the code is still needed, and I think this change is correct.
The deadlock that this code is avoiding is caused by clear_inode()
calling wait_on_inode().  Since clear_inode() now calls
inode_sync_wait(inode), we want to avoid the lazily committing this
transaction when the I_SYNC flag is set.

Unfortunately, recreating the deadlock is hard, and I haven't been able
to recreate it with this code commented out.

Thanks,
Shaggy
-- 
David Kleikamp
IBM Linux Technology Center

-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [patch 0/2] i_version update

2007-05-31 Thread Mingming Cao
On Thu, 2007-05-31 at 10:33 +1000, David Chinner wrote:
 On Wed, May 30, 2007 at 04:32:57PM -0700, Mingming Cao wrote:
  On Wed, 2007-05-30 at 10:21 +1000, David Chinner wrote:
   On Fri, May 25, 2007 at 06:25:31PM +0200, Jean noel Cordenner wrote:
Hi,

This is an update of the i_version patch.
The i_version field is a 64bit counter that is set on every inode
creation and that is incremented every time the inode data is modified
(similarly to the ctime time-stamp).
   
   My understanding (please correct me if I'm wrong) is that the
   requirements are much more rigourous than simply incrementing an in
   memory counter on every change.  i.e. the this counter has to
   survive server crashes intact so clients never see the counter go
   backwards. That means version number changes need to be journalled
   along with the operation that caused the change of the version
   number.
   
  Yeah, the i_version is the in memeory counter. From the patch it looks
  like the counter is being updated inside ext4_mark_iloc_dirty(), so it
  is being journalled and being flush to on-disk ext4 inode structure
  immediately (On-disk ext4 inode structure is being modified/expanded to
  store the counter in the first patch). 
 
 Ok, that catches most things (I missed that), but the version number still
 needs to change on file data changes, right? So if we are overwriting the
 file, we're calling __mark_inode_dirty(I_DIRTY_PAGES) which means you don't
 get the callout and so the version number doesn't change or get logged. In
 that case, the version number is not doing what it needs to do, right?
 

Hmm, maybe I missed something... but looking at the code again, in the
case of overwrite (file date updated),it seems the ctime/mtime is being
updated and the inode is being dirtied, so the version number is being
updated.

 vfs_write()-..
-__generic_file_aio_write_nolock()
-file_update_time()
-mark_inode_dirty_sync()
-__mark_inode_dirty(I_DIRTY_SYNC)
-ext4_dirty_inode()
-ext4_mark_inode_dirty()

Regards,
Mingming

-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFD] BIO_RW_BARRIER - what it means for devices, filesystems, and dm/md.

2007-05-31 Thread Phillip Susi

David Chinner wrote:
you are understanding barriers to be the same as syncronous writes. (and 
therefor the data is on persistant media before the call returns)


No, I'm describing the high level behaviour that is expected by
a filesystem. The reasons for this are below


You say no, but then you go on to contradict yourself below.


Ok, that's my understanding of how *device based barriers* can work,
but there's more to it than that. As far as the filesystem is
concerned the barrier write needs to *behave* exactly like a sync
write because of the guarantees the filesystem has to provide
userspace. Specifically - sync, sync writes and fsync.


There, you just ascribed the synchronous property to barrier requests. 
This is false.  Barriers are about ordering, synchronous writes are 
another thing entirely.  The filesystem is supposed to use barriers to 
maintain ordering for journal data.  If you are trying to handle a 
synchronous write request, that's another flag.



This is the big problem, right? If we use barriers for commit
writes, the filesystem can return to userspace after a sync write or
fsync() and an *ordered barrier device implementation* may not have
written the blocks to persistent media. If we then pull the plug on
the box, we've just lost data that sync or fsync said was
successfully on disk. That's BAD.


That's why for synchronous writes, you set the flag to mark the request 
as synchronous, which has nothing at all to do with barriers.  You are 
trying to use barriers to solve two different problems.  Use one flag to 
indicate ordering, and another to indicate synchronisity.



Right now a barrier write on the last block of the fsync/sync write
is sufficient to prevent that because of the FUA on the barrier
block write. A purely ordered barrier implementation does not
provide this guarantee.


This is a side effect of the implementation of the barrier, not part of 
the semantics of barriers, so you shouldn't rely on this behavior.  You 
don't have to use FUA to handle the barrier request, and if you don't, 
then the request can be completed while the data is still in the write 
cache.  You just have to make sure to flush it before any subsequent 
requests.



IOWs, there are two parts to the problem:

1 - guaranteeing I/O ordering
2 - guaranteeing blocks are on persistent storage.

Right now, a single barrier I/O is used to provide both of these
guarantees. In most cases, all we really need to provide is 1); the
need for 2) is a much rarer condition but still needs to be
provided.


Yep... two problems... two flags.


Yes, if we define a barrier to only guarantee 1), then yes this
would be a big win (esp. for XFS). But that requires all filesystems
to handle sync writes differently, and sync_blockdev() needs to
call blkdev_issue_flush() as well

So, what do we do here? Do we define a barrier I/O to only provide
ordering, or do we define it to also provide persistent storage
writeback? Whatever we decide, it needs to be documented


We do the former or we end up in the same boat as O_DIRECT; where you 
have one flag that means several things, and no way to specify you only 
need some of those and not the others.



-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFD] BIO_RW_BARRIER - what it means for devices, filesystems, and dm/md.

2007-05-31 Thread Phillip Susi

David Chinner wrote:

That sounds like a good idea - we can leave the existing
WRITE_BARRIER behaviour unchanged and introduce a new WRITE_ORDERED
behaviour that only guarantees ordering. The filesystem can then
choose which to use where appropriate


So what if you want a synchronous write, but DON'T care about the order? 
  They need to be two completely different flags which you can choose 
to combine, or use individually.


-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [dm-devel] Re: [RFD] BIO_RW_BARRIER - what it means for devices, filesystems, and dm/md.

2007-05-31 Thread Phillip Susi

Jens Axboe wrote:

No Stephan is right, the barrier is both an ordering and integrity
constraint. If a driver completes a barrier request before that request
and previously submitted requests are on STABLE storage, then it
violates that principle. Look at the code and the various ordering
options.


I am saying that is the wrong thing to do.  Barrier should be about 
ordering only.  So long as the order they hit the media is maintained, 
the order the requests are completed in can change.  barrier.txt bears 
this out:


Requests in ordered sequence are issued in order, but not required to 
finish in order.  Barrier implementation can handle out-of-order 
completion of ordered sequence.  IOW, the requests MUST be processed in 
order but the hardware/software completion paths are allowed to reorder 
completion notifications - eg. current SCSI midlayer doesn't preserve 
completion order during error handling.



-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [dm-devel] Re: [RFD] BIO_RW_BARRIER - what it means for devices, filesystems, and dm/md.

2007-05-31 Thread Jens Axboe
On Thu, May 31 2007, Phillip Susi wrote:
 Jens Axboe wrote:
 No Stephan is right, the barrier is both an ordering and integrity
 constraint. If a driver completes a barrier request before that request
 and previously submitted requests are on STABLE storage, then it
 violates that principle. Look at the code and the various ordering
 options.
 
 I am saying that is the wrong thing to do.  Barrier should be about 
 ordering only.  So long as the order they hit the media is maintained, 
 the order the requests are completed in can change.  barrier.txt bears 

But you can't guarentee ordering without flushing the data out as well.
It all depends on the type of cache on the device, of course. If you
look at the ordinary sata/ide drive with write back caching, you can't
just issue the requests in order and pray that the drive cache will make
it to platter.

If you don't have write back caching, or if the cache is battery backed
and thus guarenteed to never be lost, maintaining order is naturally
enough.

Or if the drive can do ordered queued commands, you can relax the
flushing (again depending on the cache type, you may need to take
different paths).

 Requests in ordered sequence are issued in order, but not required to 
 finish in order.  Barrier implementation can handle out-of-order 
 completion of ordered sequence.  IOW, the requests MUST be processed in 
 order but the hardware/software completion paths are allowed to reorder 
 completion notifications - eg. current SCSI midlayer doesn't preserve 
 completion order during error handling.

If you carefully re-read that paragraph, then it just tells you that the
software implementation can deal with reordered completions. It doesn't
relax the rconstraints on ordering and integrity AT ALL.

-- 
Jens Axboe

-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFD] BIO_RW_BARRIER - what it means for devices, filesystems, and dm/md.

2007-05-31 Thread Jens Axboe
On Thu, May 31 2007, Phillip Susi wrote:
 David Chinner wrote:
 That sounds like a good idea - we can leave the existing
 WRITE_BARRIER behaviour unchanged and introduce a new WRITE_ORDERED
 behaviour that only guarantees ordering. The filesystem can then
 choose which to use where appropriate
 
 So what if you want a synchronous write, but DON'T care about the order? 
   They need to be two completely different flags which you can choose 
 to combine, or use individually.

If you have a use case for that, we can easily support it as well...
Depending on the drive capabilities (FUA support or not), it may be
nearly as slow as a real barrier write.

-- 
Jens Axboe

-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFD] BIO_RW_BARRIER - what it means for devices, filesystems, and dm/md.

2007-05-31 Thread david

On Thu, 31 May 2007, Jens Axboe wrote:


On Thu, May 31 2007, Phillip Susi wrote:

David Chinner wrote:

That sounds like a good idea - we can leave the existing
WRITE_BARRIER behaviour unchanged and introduce a new WRITE_ORDERED
behaviour that only guarantees ordering. The filesystem can then
choose which to use where appropriate


So what if you want a synchronous write, but DON'T care about the order?
  They need to be two completely different flags which you can choose
to combine, or use individually.


If you have a use case for that, we can easily support it as well...
Depending on the drive capabilities (FUA support or not), it may be
nearly as slow as a real barrier write.


true, but a real barrier write could have significant side effects on 
other writes that wouldn't happen with a synchronous wrote (a sync wrote 
can have other, unrelated writes re-ordered around it, a barrier write 
can't)


David Lang
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] gfs2: stop giving out non-cluster-coherent leases

2007-05-31 Thread J. Bruce Fields
From: Marc Eshel [EMAIL PROTECTED]

Since gfs2 can't prevent conflicting opens or leases on other nodes, we
probably shouldn't allow it to give out leases at all.

Put the newly defined lease operation into use in gfs2 by turning off
lease, unless we're using the nolock' locking module (in which case all
locking is local anyway).

Signed-off-by: Marc Eshel [EMAIL PROTECTED]
---
 fs/gfs2/ops_file.c |   26 ++
 1 files changed, 26 insertions(+), 0 deletions(-)

diff --git a/fs/gfs2/ops_file.c b/fs/gfs2/ops_file.c
index 064df88..78ac4ac 100644
--- a/fs/gfs2/ops_file.c
+++ b/fs/gfs2/ops_file.c
@@ -489,6 +489,31 @@ static int gfs2_fsync(struct file *file, struct dentry 
*dentry, int datasync)
 }
 
 /**
+ * gfs2_set_lease - acquire/release a file lease
+ * @file: the file pointer
+ * @arg: lease type
+ * @fl: file lock
+ *
+ * Returns: errno
+ */
+
+static int gfs2_set_lease(struct file *file, long arg, struct file_lock **fl)
+{
+   struct gfs2_sbd *sdp = GFS2_SB(file-f_mapping-host);
+   int ret = EAGAIN;
+
+   if (sdp-sd_args.ar_localflocks) {
+   return setlease(file, arg, fl);
+   }
+
+   /* For now fail the delegation request. Cluster file system can not
+  allow any node in the cluster to get a local lease until it can
+  be managed centrally by the cluster file system.
+   */
+   return ret;
+}
+
+/**
  * gfs2_lock - acquire/release a posix lock on a file
  * @file: the file pointer
  * @cmd: either modify or retrieve lock state, possibly wait
@@ -639,6 +664,7 @@ const struct file_operations gfs2_file_fops = {
.flock  = gfs2_flock,
.splice_read= generic_file_splice_read,
.splice_write   = generic_file_splice_write,
+   .set_lease  = gfs2_set_lease,
 };
 
 const struct file_operations gfs2_dir_fops = {
-- 
1.5.2.rc3

-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] locks: share more common lease code

2007-05-31 Thread J. Bruce Fields
From: J. Bruce Fields [EMAIL PROTECTED]

Share more code between setlease (used by nfsd) and fcntl.

Also some minor cleanup.

Signed-off-by: J. Bruce Fields [EMAIL PROTECTED]
---
 fs/locks.c |   14 +++---
 1 files changed, 3 insertions(+), 11 deletions(-)

diff --git a/fs/locks.c b/fs/locks.c
index 431a8b8..3f366e1 100644
--- a/fs/locks.c
+++ b/fs/locks.c
@@ -1469,14 +1469,6 @@ int fcntl_setlease(unsigned int fd, struct file *filp, 
long arg)
struct inode *inode = dentry-d_inode;
int error;
 
-   if ((current-fsuid != inode-i_uid)  !capable(CAP_LEASE))
-   return -EACCES;
-   if (!S_ISREG(inode-i_mode))
-   return -EINVAL;
-   error = security_file_lock(filp, arg);
-   if (error)
-   return error;
-
locks_init_lock(fl);
error = lease_init(filp, arg, fl);
if (error)
@@ -1484,15 +1476,15 @@ int fcntl_setlease(unsigned int fd, struct file *filp, 
long arg)
 
lock_kernel();
 
-   error = __setlease(filp, arg, flp);
+   error = setlease(filp, arg, flp);
if (error || arg == F_UNLCK)
goto out_unlock;
 
error = fasync_helper(fd, filp, 1, flp-fl_fasync);
if (error  0) {
-   /* remove lease just inserted by __setlease */
+   /* remove lease just inserted by setlease */
flp-fl_type = F_UNLCK | F_INPROGRESS;
-   flp-fl_break_time = jiffies- 10;
+   flp-fl_break_time = jiffies - 10;
time_out_leases(inode);
goto out_unlock;
}
-- 
1.5.2.rc3

-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] locks: provide a file lease method enabling cluster-coherent leases

2007-05-31 Thread J. Bruce Fields
From: J. Bruce Fields [EMAIL PROTECTED]

Currently leases are only kept locally, so there's no way for a distributed
filesystem to enforce them against multiple clients.  We're particularly
interested in the case of nfsd exporting a cluster filesystem, in which
case nfsd needs cluster-coherent leases in order to implement delegations
correctly.

Signed-off-by: J. Bruce Fields [EMAIL PROTECTED]
---
 fs/locks.c |5 -
 include/linux/fs.h |4 
 2 files changed, 8 insertions(+), 1 deletions(-)

diff --git a/fs/locks.c b/fs/locks.c
index 3f366e1..40a7f39 100644
--- a/fs/locks.c
+++ b/fs/locks.c
@@ -1444,7 +1444,10 @@ int setlease(struct file *filp, long arg, struct 
file_lock **lease)
return error;
 
lock_kernel();
-   error = __setlease(filp, arg, lease);
+   if (filp-f_op  filp-f_op-set_lease)
+   error = filp-f_op-set_lease(filp, arg, lease);
+else
+   error = __setlease(filp, arg, lease);
unlock_kernel();
 
return error;
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 7cf0c54..09aefb4 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1112,6 +1112,7 @@ struct file_operations {
int (*flock) (struct file *, int, struct file_lock *);
ssize_t (*splice_write)(struct pipe_inode_info *, struct file *, loff_t 
*, size_t, unsigned int);
ssize_t (*splice_read)(struct file *, loff_t *, struct pipe_inode_info 
*, size_t, unsigned int);
+   int (*set_lease)(struct file *, long, struct file_lock **);
 };
 
 struct inode_operations {
@@ -1137,6 +1138,7 @@ struct inode_operations {
ssize_t (*listxattr) (struct dentry *, char *, size_t);
int (*removexattr) (struct dentry *, const char *);
void (*truncate_range)(struct inode *, loff_t, loff_t);
+   int (*break_lease)(struct inode *, unsigned int);
 };
 
 struct seq_file;
@@ -1463,6 +1465,8 @@ static inline int locks_verify_truncate(struct inode 
*inode,
 
 static inline int break_lease(struct inode *inode, unsigned int mode)
 {
+   if (inode-i_op  inode-i_op-break_lease)
+   return inode-i_op-break_lease(inode, mode);
if (inode-i_flock)
return __break_lease(inode, mode);
return 0;
-- 
1.5.2.rc3

-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [NFS] [PATCH] locks: provide a file lease method enabling cluster-coherent leases

2007-05-31 Thread Trond Myklebust
On Thu, 2007-05-31 at 17:40 -0400, J. Bruce Fields wrote:
 From: J. Bruce Fields [EMAIL PROTECTED]
 
 Currently leases are only kept locally, so there's no way for a distributed
 filesystem to enforce them against multiple clients.  We're particularly
 interested in the case of nfsd exporting a cluster filesystem, in which
 case nfsd needs cluster-coherent leases in order to implement delegations
 correctly.
 
 Signed-off-by: J. Bruce Fields [EMAIL PROTECTED]
 ---
  fs/locks.c |5 -
  include/linux/fs.h |4 
  2 files changed, 8 insertions(+), 1 deletions(-)
 
 diff --git a/fs/locks.c b/fs/locks.c
 index 3f366e1..40a7f39 100644
 --- a/fs/locks.c
 +++ b/fs/locks.c
 @@ -1444,7 +1444,10 @@ int setlease(struct file *filp, long arg, struct 
 file_lock **lease)
   return error;
  
   lock_kernel();
 - error = __setlease(filp, arg, lease);
 + if (filp-f_op  filp-f_op-set_lease)
 + error = filp-f_op-set_lease(filp, arg, lease);
 +else
 + error = __setlease(filp, arg, lease);
   unlock_kernel();
  
   return error;
 diff --git a/include/linux/fs.h b/include/linux/fs.h
 index 7cf0c54..09aefb4 100644
 --- a/include/linux/fs.h
 +++ b/include/linux/fs.h
 @@ -1112,6 +1112,7 @@ struct file_operations {
   int (*flock) (struct file *, int, struct file_lock *);
   ssize_t (*splice_write)(struct pipe_inode_info *, struct file *, loff_t 
 *, size_t, unsigned int);
   ssize_t (*splice_read)(struct file *, loff_t *, struct pipe_inode_info 
 *, size_t, unsigned int);
 + int (*set_lease)(struct file *, long, struct file_lock **);
  };
  
  struct inode_operations {
 @@ -1137,6 +1138,7 @@ struct inode_operations {
   ssize_t (*listxattr) (struct dentry *, char *, size_t);
   int (*removexattr) (struct dentry *, const char *);
   void (*truncate_range)(struct inode *, loff_t, loff_t);
 + int (*break_lease)(struct inode *, unsigned int);

Splitting the lease into a file_operation part and an inode_operation
part looks really ugly. It also means that you're calling twice down
into the filesystem for every call to may_open() (once for
vfs_permission() and once for break_lease()) and 3 times in
do_sys_truncate().

Would it perhaps make sense to package up the call to vfs_permission()
and break_lease() as a single 'may_open()' inode operation that could be
called by may_open(), do_sys_truncate() and nfsd?

Trond

-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] obsoleting /etc/mtab

2007-05-31 Thread Karel Zak

 Hi Miklos,

On Thu, May 31, 2007 at 06:29:12PM +0200, Miklos Szeredi wrote:
 It's not just mount(8) that reads /etc/mtab, but various other
 utilities, for example df(1).  So the best solution would be if

 mount.nfs, mount.cifs, mount.ocfs, HAL, am-utils (amd)...

 and these utils also write to mtab, although I think many of them already
 check for a symlink.

 /etc/mtab were a symlink to /proc/mounts, and the kernel would be the
 authoritative source of information regarding mounts.

 Yes.

 (1) user mounts (user or users option in /etc/fstab) currently
 need /etc/mtab to keep track of the owner

 There is more things:

  loop=/dev/loopN

  ... umount(8) uses this option for loop device deinitialization,
  when the device was initialized by mount(8),

  encryption=, offset=, speed=

   ... but nothing uses these options

  uhelper=

   ... this one is my baby :-(

   (Not released by upstream yet.  ...according to Google this
   Fedora patch is already in Mandrake, PCLinuxOS, Pardus, and
   ??? )

   From man page:

   The uhelper (unprivileged umount request helper) is possible
   used when non-root user wants  to  umount  a mountpoint
   which is not defined in the /etc/fstab file (e.g devices
   mounted by HAL).

   GNOME people love it, because that's a way how use command line
   utils (umount(8)) for devices that was mounted by desktop
   daemons.


 The umount.nfs also reads many options from mtab, but it seems all
 these options are also in /proc/mounts.

 I know almost nothing about the others [u]mount dialects (cifs, ...).

 (2) lots of filesystems only present a few mount options (or none) in
 /proc/mounts
 
 (1) can be solved with the new mount owner support in the unprivileged
 mounts patchset.  Mount(8) would still have to detect at boot time if
 this is available, and either create the symlink to /proc/mounts or if
 MS_SETOWNER is not available, fall back to using /etc/mtab.

 Sounds good, but there should be a way (an option) to disable this
 functionality (in case when mtab is required for an exotic reason).

 (2) needs work in the filesystems implicated.  I already have patches
 for ext2, ext3, tmpfs, devpts and hostfs, and it would be nice if the
 maintainers for others could help out.
 
 It wouldn't even be fatal if some mount options were missing from
 /proc/mounts.  Mount options in /etc/mtab have never been perfectly
 accurate, especially after a remount, when they could get totally out
 of sync with the options effective for the filesystem.

 The /etc/mtab is almost always useless with NFS (kernel is changing
 mount options according to NFS server settings, so there is possible
 that you have rw in mtab and ro in /proc/mounts :-)

 Can someone think of any other problem with getting rid of /etc/mtab?

 Crazy idea: make kernel more promiscuous with mount options -- it
 means you can use an arbitrary foo= option for mount(2) when max
 length of all options is less than or equal to
 /proc/sys/fs/mntopt_max.  (well...  NACK :-)

 I agree that the /etc/mtab file is badly designed thing where we
 duplicate almost all from /proc/mounts, but loop= and uhelper= are
 nice examples that userspace utils are able to capitalize on this
 concept. Maybe we need something better than the mtab for userspace
 specific options.

 Somewhere at people.redhat.com/kzak/ I have a patch that add LUKS
 support to the mount(8) and this patch also add new options to the
 mtab file. I can imagine more scenarios when userspace specific
 options are good thing.

 [1] http://lkml.org/lkml/2007/4/27/180

 The patches have been postponed by Andrew, right? Or is it already in
 -mm?

Karel

-- 
 Karel Zak  [EMAIL PROTECTED]
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH resend] introduce I_SYNC

2007-05-31 Thread Andrew Morton
On Thu, 31 May 2007 16:25:35 +0200
Jörn Engel [EMAIL PROTECTED] wrote:

 On Wed, 16 May 2007 10:15:35 -0700, Andrew Morton wrote:
 
  If we're going to do this then please let's get some exhaustive commentary
  in there so that others have a chance of understanding these flags without
  having to do the amount of reverse-engineering which you've been put 
  through.
 
 Done.  Found and fixed some bugs in the process.  By now I feal
 reasonable certain that the patch fixes more than it breaks.
 

 
 -- 
 Good warriors cause others to come to them and do not go to others.
 -- Sun Tzu
 
 Introduce I_SYNC.
 
 I_LOCK was used for several unrelated purposes, which caused deadlock
 situations in certain filesystems as a side effect.  One of the purposes
 now uses the new I_SYNC bit.

Do we know what those deadlocks were?  It's a bit of a mystery patch otherwise.

Put yourself in the position of random-distro-engineer wondering should I
backport this?.

 Also document the various bits and change their order from historical to
 logical.

What a nice comment you added ;)
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] obsoleting /etc/mtab

2007-05-31 Thread Karel Zak
On Thu, May 31, 2007 at 09:40:49AM -0700, H. Peter Anvin wrote:
 Miklos Szeredi wrote:
  
  (2) needs work in the filesystems implicated.  I already have patches
  for ext2, ext3, tmpfs, devpts and hostfs, and it would be nice if the
  maintainers for others could help out.
  
 
 A lot of these could be fixed all at once by letting the filesystem tell
 the VFS to retain the string passed to the original mount.  That will

 Unfortunately, the original option string (from userspace) != real
 options (in kernel), see NFS. This bug should be fixed -- the kernel
 has to fully follow mount(2) or ends with EINVAL.

Karel

-- 
 Karel Zak  [EMAIL PROTECTED]
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] support larger cifs network reads

2007-05-31 Thread Steve French

With Samba 3.0.26pre it is now possible for a cifs client (one which
supports the newest Unix/Posix cifs extensions) to request up to
almost 8MB at a time on a cifs read request.

A patch for the cifs client to support larger reads follows.  In this
patch, using very large reads is not the default behavior, since it
would require larger buffer allocations for the large cifs request
buffers, but in the future when cifs can demultiplex reads to a page
list in the cifs_demultiplex_thread (without having to copy to a large
temporary buffer) this will be even more useful.

diff --git a/fs/cifs/README b/fs/cifs/README
index 4d01697..6ad722b 100644
--- a/fs/cifs/README
+++ b/fs/cifs/README
@@ -301,8 +301,19 @@ A partial list of the supported mount op
during the local client kernel build will be used.
If server does not support Unicode, this parameter is
unused.
-  rsizedefault read size (usually 16K)
-  wsizedefault write size (usually 16K, 32K is often better 
over GigE)
+  rsizedefault read size (usually 16K). The client currently
+   can not use rsize larger than CIFSMaxBufSize. CIFSMaxBufSize
+   defaults to 16K and may be changed (from 8K to the maximum
+   kmalloc size allowed by your kernel) at module install time
+   for cifs.ko. Setting CIFSMaxBufSize to a very large value
+   will cause cifs to use more memory and may reduce performance
+   in some cases.  To use rsize greater than 127K (the original
+   cifs protocol maximum) also requires that the server support
+   a new Unix Capability flag (for very large read) which some
+   newer servers (e.g. Samba 3.0.26 or later) do. rsize can be
+   set from a minimum of 2048 to a maximum of 8388608 (depending
+   on the value of CIFSMaxBufSize)
+  wsizedefault write size (default 57344)
maximum wsize currently allowed by CIFS is 57344 (14 4096 byte
pages)
  rwmount the network share read-write (note that the
diff --git a/fs/cifs/cifsfs.c b/fs/cifs/cifsfs.c
index d38c69b..8c4365d 100644
--- a/fs/cifs/cifsfs.c
+++ b/fs/cifs/cifsfs.c
@@ -732,10 +732,21 @@ cifs_init_request_bufs(void)
/* Buffer size can not be smaller than 2 * PATH_MAX since maximum
Unicode path name has to fit in any SMB/CIFS path based frames */
CIFSMaxBufSize = 8192;
-   } else if (CIFSMaxBufSize  1024*127) {
-   CIFSMaxBufSize = 1024 * 127;
} else {
-   CIFSMaxBufSize = 0x1FE00; /* Round size to even 512 byte mult*/
+   if (CIFSMaxBufSize + MAX_CIFS_HDR_SIZE  KMALLOC_MAX_SIZE) {
+   CIFSMaxBufSize = KMALLOC_MAX_SIZE - MAX_CIFS_HDR_SIZE;
+   cERROR(1,(CIFSMaxBufSize too large, resetting to %ld,
+  KMALLOC_MAX_SIZE));
+   }
+   
+   /* The length field is 3 bytes, but for time being we use only
+*  23 of the available 24 length bits */
+   if (CIFSMaxBufSize  8388608) {
+   CIFSMaxBufSize = 8388608;
+   cERROR(1,
+   (CIFSMaxBufSize set to protocol max 8388608));
+   } else  /* round buffer size to even 512 byte multiple */
+   CIFSMaxBufSize = 0x7FFE00;
}
/*  cERROR(1,(CIFSMaxBufSize %d 0x%x,CIFSMaxBufSize,CIFSMaxBufSize)); */
cifs_req_cachep = kmem_cache_create(cifs_request,
diff --git a/fs/cifs/cifspdu.h b/fs/cifs/cifspdu.h
index d619ca7..6e6cda0 100644
--- a/fs/cifs/cifspdu.h
+++ b/fs/cifs/cifspdu.h
@@ -1885,15 +1885,19 @@ typedef struct {
#define CIFS_UNIX_POSIX_PATHNAMES_CAP   0x0010 /* Allow POSIX
path chars  */
#define CIFS_UNIX_POSIX_PATH_OPS_CAP0x0020 /* Allow new POSIX
path based
  calls including posix open
- and posix unlink */
+ and posix unlink */
+#define CIFS_UNIX_LARGE_READ_CAP0x0040 /* support reads 128K (up
+ to 0x00 */

+#define CIFS_UNIX_LARGE_WRITE_CAP   0x0080
+
#ifdef CONFIG_CIFS_POSIX
/* Can not set pathnames cap yet until we send new posix create SMB since
   otherwise server can treat such handles opened with older ntcreatex
   (by a new client which knows how to send posix path ops)
   as non-posix handles (can affect write behavior with byte range locks.
   We can add back in POSIX_PATH_OPS cap when Posix Create/Mkdir finished */
-/* #define CIFS_UNIX_CAP_MASK  0x003b */
-#define CIFS_UNIX_CAP_MASK  0x001b
+/* #define CIFS_UNIX_CAP_MASK  

Re: [RFC] obsoleting /etc/mtab

2007-05-31 Thread Trond Myklebust
On Fri, 2007-06-01 at 01:04 +0200, Karel Zak wrote:
 On Thu, May 31, 2007 at 09:40:49AM -0700, H. Peter Anvin wrote:
  Miklos Szeredi wrote:
   
   (2) needs work in the filesystems implicated.  I already have patches
   for ext2, ext3, tmpfs, devpts and hostfs, and it would be nice if the
   maintainers for others could help out.
   
  
  A lot of these could be fixed all at once by letting the filesystem tell
  the VFS to retain the string passed to the original mount.  That will
 
  Unfortunately, the original option string (from userspace) != real
  options (in kernel), see NFS. This bug should be fixed -- the kernel
  has to fully follow mount(2) or ends with EINVAL.

Way ahead of you... See patches 6 and 7 on

  http://client.linux-nfs.org/Linux-2.6.x/2.6.22-rc3/

Trond

-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFD] BIO_RW_BARRIER - what it means for devices, filesystems, and dm/md.

2007-05-31 Thread David Chinner
On Thu, May 31, 2007 at 02:31:21PM -0400, Phillip Susi wrote:
 David Chinner wrote:
 That sounds like a good idea - we can leave the existing
 WRITE_BARRIER behaviour unchanged and introduce a new WRITE_ORDERED
 behaviour that only guarantees ordering. The filesystem can then
 choose which to use where appropriate
 
 So what if you want a synchronous write, but DON'T care about the order? 

submit_bio(WRITE_SYNC, bio);

Already there, already used by XFS, JFS and direct I/O.

Cheers,

Dave.
-- 
Dave Chinner
Principal Engineer
SGI Australian Software Group
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] obsoleting /etc/mtab

2007-05-31 Thread David Chinner
On Thu, May 31, 2007 at 09:40:49AM -0700, H. Peter Anvin wrote:
 Miklos Szeredi wrote:
  
  (2) needs work in the filesystems implicated.  I already have patches
  for ext2, ext3, tmpfs, devpts and hostfs, and it would be nice if the
  maintainers for others could help out.
  
 
 A lot of these could be fixed all at once by letting the filesystem tell
 the VFS to retain the string passed to the original mount.  That will
 solve *almost* all filesystems which take string options.

Except some filesystems mangle that string as it gets passed around
(i.e. XFS)

Cheers,

Dave.
-- 
Dave Chinner
Principal Engineer
SGI Australian Software Group
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] obsoleting /etc/mtab

2007-05-31 Thread H. Peter Anvin
Trond Myklebust wrote:

 A lot of these could be fixed all at once by letting the filesystem tell
 the VFS to retain the string passed to the original mount.  That will
  Unfortunately, the original option string (from userspace) != real
  options (in kernel), see NFS. This bug should be fixed -- the kernel
  has to fully follow mount(2) or ends with EINVAL.
 
 Way ahead of you... See patches 6 and 7 on
 
   http://client.linux-nfs.org/Linux-2.6.x/2.6.22-rc3/
 

NFS takes a binary option block anyway.  However, that's the exception,
not the rule.

-hpa
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [patch 12/41] fs: introduce write_begin, write_end, and perform_write aops

2007-05-31 Thread Nick Piggin
On Thu, May 31, 2007 at 12:05:39AM -0700, Andrew Morton wrote:
 On Thu, 31 May 2007 07:15:39 +0200 Nick Piggin [EMAIL PROTECTED] wrote:
 
  If you can send that rollup, it would be good. I could try getting
  everything to compile and do some more testing on it too.
 
 
 Single patch against 2.6.22-rc3: http://userweb.kernel.org/~akpm/np.gz
 
 broken-out: 
 ftp://ftp.kernel.org/pub/linux/kernel/people/akpm/mm/broken-out-2007-05-30-09-30.tar.gz

Thanks, I'll get onto it.
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFD] BIO_RW_BARRIER - what it means for devices, filesystems, and dm/md.

2007-05-31 Thread Tejun Heo
Jens Axboe wrote:
 On Thu, May 31 2007, David Chinner wrote:
 On Thu, May 31, 2007 at 08:26:45AM +0200, Jens Axboe wrote:
 On Thu, May 31 2007, David Chinner wrote:
 IOWs, there are two parts to the problem:

1 - guaranteeing I/O ordering
2 - guaranteeing blocks are on persistent storage.

 Right now, a single barrier I/O is used to provide both of these
 guarantees. In most cases, all we really need to provide is 1); the
 need for 2) is a much rarer condition but still needs to be
 provided.

 if I am understanding it correctly, the big win for barriers is that you 
 do NOT have to stop and wait until the data is on persistant media before 
 you can continue.
 Yes, if we define a barrier to only guarantee 1), then yes this
 would be a big win (esp. for XFS). But that requires all filesystems
 to handle sync writes differently, and sync_blockdev() needs to
 call blkdev_issue_flush() as well

 So, what do we do here? Do we define a barrier I/O to only provide
 ordering, or do we define it to also provide persistent storage
 writeback? Whatever we decide, it needs to be documented
 The block layer already has a notion of the two types of barriers, with
 a very small amount of tweaking we could expose that. There's absolutely
 zero reason we can't easily support both types of barriers.
 That sounds like a good idea - we can leave the existing
 WRITE_BARRIER behaviour unchanged and introduce a new WRITE_ORDERED
 behaviour that only guarantees ordering. The filesystem can then
 choose which to use where appropriate
 
 Precisely. The current definition of barriers are what Chris and I came
 up with many years ago, when solving the problem for reiserfs
 originally. It is by no means the only feasible approach.
 
 I'll add a WRITE_ORDERED command to the #barrier branch, it already
 contains the empty-bio barrier support I posted yesterday (well a
 slightly modified and cleaned up version).

Would that be very different from issuing barrier and not waiting for
its completion?  For ATA and SCSI, we'll have to flush write back cache
anyway, so I don't see how we can get performance advantage by
implementing separate WRITE_ORDERED.  I think zero-length barrier
(haven't looked at the code yet, still recovering from jet lag :-) can
serve as genuine barrier without the extra write tho.

Thanks.

-- 
tejun
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [dm-devel] Re: [RFD] BIO_RW_BARRIER - what it means for devices, filesystems, and dm/md.

2007-05-31 Thread Tejun Heo
Stefan Bader wrote:
 2007/5/30, Phillip Susi [EMAIL PROTECTED]:
 Stefan Bader wrote:
 
  Since drive a supports barrier request we don't get -EOPNOTSUPP but
  the request with block y might get written before block x since the
  disk are independent. I guess the chances of this are quite low since
  at some point a barrier request will also hit drive b but for the time
  being it might be better to indicate -EOPNOTSUPP right from
  device-mapper.

 The device mapper needs to ensure that ALL underlying devices get a
 barrier request when one comes down from above, even if it has to
 construct zero length barriers to send to most of them.

 
 And somehow also make sure all of the barriers have been processed
 before returning the barrier that came in. Plus it would have to queue
 all mapping requests until the barrier is done (if strictly acting
 according to barrier.txt).
 
 But I am wondering a bit whether the requirements to barriers are
 really that tight as described in Tejun's document (barrier request is
 only started if everything before is safe, the barrier itself isn't
 returned until it is safe, too, and all requests after the barrier
 aren't started before the barrier is done). Is it really necessary to
 defer any further requests until the barrier has been written to save
 storage? Or would it be sufficient to guarantee that, if a barrier
 request returns, everything up to (including the barrier) is on safe
 storage?

Well, what's described in barrier.txt is the current implemented
semantics and what filesystems expect, so we can't change it underneath
them but we definitely can introduce new more relaxed variants, but one
thing we should bear in mind is that harddisks don't have humongous
caches or very smart controller / instruction set.  No matter how
relaxed interface the block layer provides, in the end, it just has to
issue whole-sale FLUSH CACHE on the device to guarantee data ordering on
the media.

IMHO, we can do better by paying more attention to how we do things in
the request queue which can be deeper and more intelligent than the
device queue.

Thanks.

-- 
tejun
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] obsoleting /etc/mtab

2007-05-31 Thread Andreas Dilger
On May 31, 2007  17:11 -0700, H. Peter Anvin wrote:
 NFS takes a binary option block anyway.  However, that's the exception,
 not the rule.

There was recently a patch submitted to linux-fsdevel to change NFS to
use text option parsing.

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.

-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [dm-devel] Re: [RFD] BIO_RW_BARRIER - what it means for devices, filesystems, and dm/md.

2007-05-31 Thread david

On Fri, 1 Jun 2007, Tejun Heo wrote:


but one
thing we should bear in mind is that harddisks don't have humongous
caches or very smart controller / instruction set.  No matter how
relaxed interface the block layer provides, in the end, it just has to
issue whole-sale FLUSH CACHE on the device to guarantee data ordering on
the media.


if you are talking about individual drives you may be right for the moment 
(but 16M cache on drives is a _lot_ larger then people imagined would be 
there a few years ago)


but when you consider the self-contained disk arrays it's an entirely 
different story. you can easily have a few gig of cache and a complete OS 
pretending to be a single drive as far as you are concerned.


and the price of such devices is plummeting (in large part thanks to Linux 
moving into this space), you can now readily buy a 10TB array for $10k 
that looks like a single drive.


David Lang
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html