[new patch 3/3] fs: fix cont vs deadlock patches

2006-12-01 Thread Nick Piggin
> I see. I guess you need to synchronise your writepage versus this
> extention in order to handle it properly then. I won't bother with
> that though: it won't be worse than it was before.
> 
> Thanks for review, do you agree with the other hunks?

Well, Andrew's got the rest of the patches in his tree, so I'll send
what we've got for now. Has had some testing on both reiserfs and
fat. Doesn't look like the other filesystems using cont_prepare_write
will have any problems...

Andrew, please apply this patch as a replacement for the fat-fix
patch in your rollup (this patch includes the same fix, and is a
more logical change unit I think).
--

Stop overloading zero length prepare/commit_write to request a
file extend operation by the "cont" buffer code. Instead, have
generic_cont_expand perform a zeroing operation on the last
page, and cont_prepare_write naturally zeroes any previous
pages required.

Signed-off-by: OGAWA Hirofumi <[EMAIL PROTECTED]>

Reiserfs was trying to "extend" a file to something smaller than
it already is with generic_cont_expand. This broke the above fix.
Open code the prepare/ commit pair instead... maybe the code would
be cleaner if reiserfs just did the operation explicitly (call
get_block or whatever helper is used to unpack the tail)?

Signed-off-by: Nick Piggin <[EMAIL PROTECTED]>


Index: linux-2.6/fs/buffer.c
===
--- linux-2.6.orig/fs/buffer.c
+++ linux-2.6/fs/buffer.c
@@ -2004,18 +2004,24 @@ int block_read_full_page(struct page *pa
return 0;
 }
 
-/* utility function for filesystems that need to do work on expanding
+/*
+ * utility function for filesystems that need to do work on expanding
  * truncates.  Uses prepare/commit_write to allow the filesystem to
  * deal with the hole.  
  */
-static int __generic_cont_expand(struct inode *inode, loff_t size,
-pgoff_t index, unsigned int offset)
+int generic_cont_expand(struct inode *inode, loff_t size)
 {
struct address_space *mapping = inode->i_mapping;
+   loff_t pos = inode->i_size;
struct page *page;
unsigned long limit;
+   pgoff_t index;
+   unsigned int from, to;
+   void *kaddr;
int err;
 
+   WARN_ON(pos >= size);
+
err = -EFBIG;
 limit = current->signal->rlim[RLIMIT_FSIZE].rlim_cur;
if (limit != RLIM_INFINITY && size > (loff_t)limit) {
@@ -2025,11 +2031,18 @@ static int __generic_cont_expand(struct 
if (size > inode->i_sb->s_maxbytes)
goto out;
 
+   index = (size - 1) >> PAGE_CACHE_SHIFT;
+   to = size - ((loff_t)index << PAGE_CACHE_SHIFT);
+   if (index != (pos >> PAGE_CACHE_SHIFT))
+   from = 0;
+   else
+   from = pos & (PAGE_CACHE_SIZE - 1);
+
err = -ENOMEM;
page = grab_cache_page(mapping, index);
if (!page)
goto out;
-   err = mapping->a_ops->prepare_write(NULL, page, offset, offset);
+   err = mapping->a_ops->prepare_write(NULL, page, from, to);
if (err) {
/*
 * ->prepare_write() may have instantiated a few blocks
@@ -2041,7 +2054,12 @@ static int __generic_cont_expand(struct 
goto out;
}
 
-   err = mapping->a_ops->commit_write(NULL, page, offset, offset);
+   kaddr = kmap_atomic(page, KM_USER0);
+   memset(kaddr + from, 0, to - from);
+   flush_dcache_page(page);
+   kunmap_atomic(kaddr, KM_USER0);
+
+   err = mapping->a_ops->commit_write(NULL, page, from, to);
 
unlock_page(page);
page_cache_release(page);
@@ -2051,36 +2069,6 @@ out:
return err;
 }
 
-int generic_cont_expand(struct inode *inode, loff_t size)
-{
-   pgoff_t index;
-   unsigned int offset;
-
-   offset = (size & (PAGE_CACHE_SIZE - 1)); /* Within page */
-
-   /* ugh.  in prepare/commit_write, if from==to==start of block, we
-   ** skip the prepare.  make sure we never send an offset for the start
-   ** of a block
-   */
-   if ((offset & (inode->i_sb->s_blocksize - 1)) == 0) {
-   /* caller must handle this extra byte. */
-   offset++;
-   }
-   index = size >> PAGE_CACHE_SHIFT;
-
-   return __generic_cont_expand(inode, size, index, offset);
-}
-
-int generic_cont_expand_simple(struct inode *inode, loff_t size)
-{
-   loff_t pos = size - 1;
-   pgoff_t index = pos >> PAGE_CACHE_SHIFT;
-   unsigned int offset = (pos & (PAGE_CACHE_SIZE - 1)) + 1;
-
-   /* prepare/commit_write can handle even if from==to==start of block. */
-   return __generic_cont_expand(inode, size, index, offset);
-}
-
 /*
  * For moronic filesystems that do not allow holes in file.
  * We may have to extend the file.
@@ -3015,7 +3003,6 @@ EXPORT_SYMBOL(fsync_bdev);
 EXPORT_SYMBOL(generic_block_bmap);
 EXPORT_SYMBOL(generic_commit_write);
 EXPORT_SYMBOL(generic_cont_expand);
-EXPORT_SYMBOL(gen

Re: NFSv4/pNFS possible POSIX I/O API standards

2006-12-01 Thread Latchesar Ionkov

Hi,

One general remark: I don't think it is feasible to add new system
calls every time somebody has a problem. Usually there are (may be not
that good) solutions that don't require big changes and work well
enough. "Let's change the interface and make the life of many
filesystem developers miserable, because they have to worry about
3-4-5 more operations" is not the easiest solution in the long run.

On 12/1/06, Rob Ross <[EMAIL PROTECTED]> wrote:

Hi all,

The use model for openg() and openfh() (renamed sutoc()) is n processes
spread across a large cluster simultaneously opening a file. The
challenge is to avoid to the greatest extent possible incurring O(n) FS
interactions. To do that we need to allow actions of one process to be
reused by other processes on other OS instances.

The openg() call allows one process to perform name resolution, which is
often the most expensive part of this use model. Because permission


If the name resolution is the most expensive part, why not implement
just the name lookup part and call it "lookup" instead of "openg". Or
even better, make NFS to resolve multiple names with a single request.
If the NFS server caches the last few name lookups, the responses from
the other nodes will be fast, and you will get your file descriptor
with two instead of the proposed one request. The performance could be
just good enough without introducing any new functions and file
handles.

Thanks,
   Lucho
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [patch 3/3] fs: fix cont vs deadlock patches

2006-12-01 Thread Nick Piggin

OGAWA Hirofumi wrote:

Nick Piggin <[EMAIL PROTECTED]> writes:



status = __block_prepare_write(inode, new_page, zerofrom,
PAGE_CACHE_SIZE, get_block);
if (status)
@@ -2110,7 +2111,7 @@
memset(kaddr+zerofrom, 0, PAGE_CACHE_SIZE-zerofrom);
flush_dcache_page(new_page);
kunmap_atomic(kaddr, KM_USER0);
-   generic_commit_write(NULL, new_page, zerofrom, PAGE_CACHE_SIZE);
+   __block_commit_write(inode, new_page, zerofrom, 
PAGE_CACHE_SIZE);


Whatever function this is doesn't need to update i_size?


Yes, it is the code in cont_prepare_write that is expanding a hole
at the end of file.

We can do this now because fat_commit_write is now changed to call
generic_commit_write in the case of a non-zero length.

I think it is an improvement because now the file will not get
arbitrarily extended in the case of a write failure somewhere down
the track.



Ah, unfortunately we can't this. If we don't update ->i_size before
page_cache_release, pdflush will think these pages is outside ->i_size
and just clean the page without writing it.


I see. I guess you need to synchronise your writepage versus this
extention in order to handle it properly then. I won't bother with
that though: it won't be worse than it was before.

Thanks for review, do you agree with the other hunks?

--
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com 


-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: NFSv4/pNFS possible POSIX I/O API standards

2006-12-01 Thread Rob Ross

Hi all,

The use model for openg() and openfh() (renamed sutoc()) is n processes 
spread across a large cluster simultaneously opening a file. The 
challenge is to avoid to the greatest extent possible incurring O(n) FS 
interactions. To do that we need to allow actions of one process to be 
reused by other processes on other OS instances.


The openg() call allows one process to perform name resolution, which is 
often the most expensive part of this use model. Because permission 
checking is also performed as part of the openg(), some file systems to 
not require additional communication between OS and FS at openfh(). 
External communication channels are used to pass the handle resulting 
from the openg() call out to processes on other nodes (e.g. MPI_Bcast).


dup(), openat(), and UNIX sockets are not viable options in this model, 
because there are many OS instances, not just one.


All the calls that are being discussed as part of the HEC extensions are 
being discussed in this context of multiple OS instances and cluster 
file systems.


Regarding the lifetime of the handle, there has been quite a bit of 
discussion about this. I believe that we most recently were thinking 
that there was an undefined lifetime for this, allowing servers to 
"forget" these values (as in the case where a server is restarted). 
Clients would need to perform the openg() again if they were to try to 
use an outdated handle, or simply fall back to a regular open(). This is 
not a problem in our use model.


I've attached a graph showing the time to use individual open() calls 
vs. the openg()/MPI_Bcast()/openfh() combination; it's a clear win for 
any significant number of processes. These results are from our 
colleagues at Sandia (Ruth Klundt et. al.) with PVFS underneath, but I 
expect the trend to be similar for many cluster file systems.


Regarding trying to "force APIs using standardization" on you 
(Christoph's 11/29/2006 message), you've got us all wrong. The 
standardization process is going to take some time, so we're starting on 
it at the same time that we're working with prototypes, so that we don't 
have to wait any longer than necessary to have these things be part of 
POSIX. The whole reason we're presenting this on this list is to try to 
describe why we think these calls are important and get feedback on how 
we can make these calls work well in the context of Linux. I'm glad to 
see so many people taking interest.


I look forward to further constructive discussion. Thanks,

Rob
---
Rob Ross
Mathematics and Computer Science Division
Argonne National Laboratory

Christoph Hellwig wrote:

On Wed, Nov 29, 2006 at 05:23:13AM -0700, Matthew Wilcox wrote:

Is this for people who don't know about dup(), or do they need
independent file offsets?  If the latter, I think an xdup() would be
preferable (would there be a security issue for OSes with revoke()?)
Either that, or make the key be useful for something else.


Not sharing the file offset means we need a separate file struct, at
which point the only thing saved is doing a lookup at the time of
opening the file.  While a full pathname traversal can be quite costly
an open is not something you do all that often anyway.  And if you really
need to open/close files very often you can speed it up nicely by keeping
a file descriptor on the parent directory open and use openat().

Anyway, enough of talking here.  We really need a very good description
of the use case people want this for, and the specific performance problems
they see to find a solution.  And the solution definitly does not involve
as second half-assed file handle time with unspecified lifetime rules :-)


openg-compare.pdf
Description: Adobe PDF document


Re: [RFC][PATCH] ensure i_ino uniqueness in filesystems without permanent inode numbers (via idr hashing)

2006-12-01 Thread Jeff Layton
Thanks again, Randy. Here's an updated and tested patch and description. 
This
one also makes sure that the root inode for the mount gets a unique 
i_ino value

as well. Let me know what you think...

--[snip]---

This patch is a proof of concept. It works, but I'd like to get some buyin
on the approach before I start doing the legwork to convert all of
the existing filesystems to use it. First, the problems:

1) on filesystems w/o permanent inode numbers, i_ino values can be
larger than 32 bits, which can cause problems for some 32 bit userspace
programs on a 64 bit kernel.

2) many filesystems call new_inode and assume that the i_ino values they
are given are unique. They are not guaranteed to be so, since the static
counter can wrap.

3) after allocating a new inode, some filesystems call iunique to try to
get a unique i_ino value, but they don't actually add their inodes to
the hashtable, and so they're still not guaranteed to be unique.

One way to fix this would be to just make sure they all use iunique and hash
their inodes, but that might slow down lookups for filesystems whose inodes
are not pinned in memory.

This patch is one way to correct these problems. This adds 2 new
functions, an iunique_register and iunique_unregister. Filesystems can call
iunique_register at inode creation time, and then at deletion time, we'll
automatically unregister them.

This patch also adds a new s_generation counter to the superblock.
Because i_ino's can be reused so quickly, we don't want NFS getting
confused when it happens. When iunique_register is called, we'll assign
the s_generation value to the i_generation, and then increment it to
help ensure that we get different filehandles.

There are some things that need to be cleaned up, of course:

- better error handling for the iunique calls

- recheck all the possible places where the inode should be unhashed

- convert other filesystems

- remove the static counter from new_inode and (maybe) eliminate iunique

The patch also converts pipefs to use the new scheme as an example. Al
Viro had expressed some concern with an earlier patch that this might
slow down pipe creation. I've done some testing and I think the impact
will be minimal. Timing a small program that creates and closes 100
million pipes in a loop:

patched:
-
real8m8.623s
user0m37.418s
sys 7m31.196s

unpatched:
--
real8m7.150s
user0m40.943s
sys 7m26.204s

As the number of pipes grows on the system this time may grow somewhat,
but it doesn't seem like it would be terrible.

Comments and suggestions appreciated.

Signed-off-by: Jeff Layton <[EMAIL PROTECTED]>


diff --git a/fs/inode.c b/fs/inode.c
index 26cdb11..252192a 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -288,6 +288,8 @@ static void dispose_list(struct list_hea
list_del_init(&inode->i_sb_list);
spin_unlock(&inode_lock);

+   iunique_unregister(inode);
+
wake_up_inode(inode);
destroy_inode(inode);
nr_disposed++;
@@ -706,6 +708,32 @@ retry:

 EXPORT_SYMBOL(iunique);

+int iunique_register(struct inode *inode, int max_reserved)
+{
+   int rv;
+
+   rv = idr_pre_get(&inode->i_sb->s_inode_ids, GFP_KERNEL);
+   if (!rv)
+   return -ENOMEM;
+
+   spin_lock(&inode->i_sb->s_inode_ids_lock);
+   rv = idr_get_new_above(&inode->i_sb->s_inode_ids, inode,
+   max_reserved+1, (int *) &inode->i_ino);
+   inode->i_generation = inode->i_sb->s_generation++;
+   spin_unlock(&inode->i_sb->s_inode_ids_lock);
+   return rv;
+}
+EXPORT_SYMBOL_GPL(iunique_register);
+
+void iunique_unregister(struct inode *inode)
+{
+   spin_lock(&inode->i_sb->s_inode_ids_lock);
+   if (idr_find(&inode->i_sb->s_inode_ids, (int) inode->i_ino))
+   idr_remove(&inode->i_sb->s_inode_ids, (int) inode->i_ino);
+   spin_unlock(&inode->i_sb->s_inode_ids_lock);
+}
+EXPORT_SYMBOL_GPL(iunique_unregister);
+
 struct inode *igrab(struct inode *inode)
 {
spin_lock(&inode_lock);
@@ -1025,6 +1053,7 @@ void generic_delete_inode(struct inode *
spin_lock(&inode_lock);
hlist_del_init(&inode->i_hash);
spin_unlock(&inode_lock);
+   iunique_unregister(inode);
wake_up_inode(inode);
BUG_ON(inode->i_state != I_CLEAR);
destroy_inode(inode);
@@ -1057,6 +1086,7 @@ static void generic_forget_inode(struct
inode->i_state |= I_FREEING;
inodes_stat.nr_inodes--;
spin_unlock(&inode_lock);
+   iunique_unregister(inode);
if (inode->i_data.nrpages)
truncate_inode_pages(&inode->i_data, 0);
clear_inode(inode);
diff --git a/fs/pipe.c b/fs/pipe.c
index b1626f2..74dbbf0 100644
--- a/fs/pipe.c
+++ b/fs/pipe.c
@@ -845,6 +845,9 @@ static struct inode * get_pipe_inode(voi
if (!inode)
goto fail_inode;

+   if (iunique_register(inode, 0))
+

Re: [PATCH] prune_icache_sb

2006-12-01 Thread Andrew Morton
On Thu, 30 Nov 2006 11:05:32 -0500
Wendy Cheng <[EMAIL PROTECTED]> wrote:

> How about a simple and plain change with this uploaded patch 
> 
> The idea is, instead of unconditionally dropping every buffer associated 
> with the particular mount point (that defeats the purpose of page 
> caching), base kernel exports the "drop_pagecache_sb()" call that allows 
> page cache to be trimmed. More importantly, it is changed to offer the 
> choice of not randomly purging any buffer but the ones that seem to be 
> unused (i_state is NULL and i_count is zero). This will encourage 
> filesystem(s) to pro actively response to vm memory shortage if they 
> choose so.

argh.

In Linux a filesystem is a dumb layer which sits between the VFS and the
I/O layer and provides dumb services such as reading/writing inodes,
reading/writing directory entries, mapping pagecache offsets to disk
blocks, etc.  (This model is to varying degrees incorrect for every
post-ext2 filesystem, but that's the way it is).

We do not want to go "encouraging" filesystems to play games tuning and
trimming VFS caches and things like that.  If a patch doing that were to
turn up it would be heartily shouted at and the originator would be asked
to go off and implement the functionality in core VFS so that a) all
filesystems can immediately utilise it and b) other filesystems aren't
tempted to go off and privately implement something similar.

So please bear this philosophy in mind, and think about this feature from
that perspective.

One approach might be to add a per-superblock upper-bound on the number of
cached dentries and/or inodes.  Typically that would be controlled by a
(re)mount option.  Although we could perhaps discuss a sysfs representation
of this (and, presumably, other mount options).

But I'd expect such a proposal to have a hard time, because we'd need to
know why such a thing is needed: we prefer auto-tuning, and that's what we
have now, so what's gone wrong with it and how can we fix it, rather than
adding a manual override?

>  From our end (cluster locks are expensive - that's why we cache them), 
> one of our kernel daemons will invoke this newly exported call based on 
> a set of pre-defined tunables. It is then followed by a lock reclaim 
> logic to trim the locks by checking the page cache associated with the 
> inode (that this cluster lock is created for). If nothing is attached to 
> the inode (based on i_mapping->nrpages count), we know it is a good 
> candidate for trimming and will subsequently drop this lock (instead of 
> waiting until the end of vfs inode life cycle).

Again, I don't understand why you're tying the lifetime of these locks to
the VFS inode reclaim mechanisms.  Seems odd.

If you want to put an upper bound on the number of in-core locks, why not
string them on a list and throw away the old ones when the upper bound is
reached?

Did you look at improving that lock-lookup algorithm, btw?  Core kernel has
no problem maintaining millions of cached VFS objects - is there any reason
why your lock lookup cannot be similarly efficient?

> Note that I could do invalidate_inode_pages() within our kernel modules 
> to accomplish what drop_pagecache_sb() does (without coming here to bug 
> people) but I don't have access to inode_lock as an external kernel 
> module. So either EXPORT_SYMBOL(inode_lock) or this patch ?
> 
> The end result (of this change) should encourage filesystem to actively 
> avoid depleting too much memory

That is _not_ a filesytem responsibility!  inode cache is owned and
maintained by the VFS.

> and we'll encourage our applications to 
> understand clustering locality issues.

?

> Haven't tested this out though - would appreciate some comments before 
> spending more efforts on this direction.
> 
> -- Wendy
> 
> 
> 
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: NFSv4/pNFS possible POSIX I/O API standards

2006-12-01 Thread Sage Weil

On Fri, 1 Dec 2006, Trond Myklebust wrote:

Also, it's a tiring and trivial example, but even the 'ls -al' scenario
isn't ideally addressed by readdir()+statlite(), since statlite() might
return size/mtime from before 'ls -al' was executed by the user.


stat() will do the same.


It does with NFS, but only because NFS doesn't follow POSIX in that 
regard.  In general, stat() is supposed to return a value that's 
accurate at the time of the call.


(Although now I'm confused again.  If you're assuming stat() can return 
cached results, why do you think statlite() is useful?)



Currently, you will never get anything other than weak consistency with
NFS whether you are talking about stat(), access(), getacl(),
lseek(SEEK_END), or append(). Your 'permitting it' only in statlite() is
irrelevant to the facts on the ground: I am not changing the NFS client
caching model in any way that would affect existing applications.


Clearly, if you cache attributes on the client and provide only weak 
consistency, then readdirplus() doesn't change much.  But _other_ non-NFS 
filesystems may elect to provide POSIX semantics and strong consistency, 
even though NFS doesn't.  And the interface simply doesn't allow that to 
be done efficiently in distributed environments, because applications 
can't communicate their varying consistency needs.  Instead, systems like 
NFS weaken attribute consistency globally.  That works well enough for 
most people most of the time, but it's hardly ideal.


readdirplus() allows applications like 'ls -al' to distinguish themselves 
from applications that want individually accurate stat() results.  That in 
turn allows distributed filesystems that are both strongly consistent 
_and_ efficient at scale.  In most cases, it'll trivially turn into a 
readdir()+stat() in the VFS, but in some cases filesystems can exploit 
that information for (often enormous) performance gain, while still 
maintaining well-defined consistency semantics.  readdir() already leaks 
some inode information into it's result (via d_type)... I'm not sure I 
understand the resistance to providing more.


sage
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: NFSv4/pNFS possible POSIX I/O API standards

2006-12-01 Thread Trond Myklebust
On Fri, 2006-12-01 at 10:42 -0800, Sage Weil wrote:
> On Fri, 1 Dec 2006, Trond Myklebust wrote:
> > I'm quite happy with a proposal for a statlite(). I'm objecting to
> > readdirplus() because I can't see that it offers you anything useful.
> > You haven't provided an example of an application which would clearly
> > benefit from a readdirplus() interface instead of readdir()+statlite()
> > and possibly some tools for managing cache consistency.
> 
> Okay, now I think I understand where you're coming from.
> 
> The difference between readdirplus() and readdir()+statlite() is that 
> (depending on the mask you specify) statlite() either provides the "right" 
> answer (ala stat()), or anything that is vaguely "recent."  readdirplus() 
> would provide size/mtime from sometime _after_ the initial opendir() call, 
> establishing a useful ordering.  So without readdirplus(), you either get 
> readdir()+stat() and the performance problems I mentioned before, or 
> readdir()+statlite() where "recent" may not be good enough.
> 
> Instead of my previous example of proccess #1 waiting for process #2 to 
> finish and then checking the results with stat(), imagine instead that #1 
> is waiting for 100,000 other processes to finish, and then wants to check 
> the results (size/mtime) of all of them.  readdir()+statlite() won't 
> work, and readdir()+stat() may be pathologically slow.
> 
> Also, it's a tiring and trivial example, but even the 'ls -al' scenario 
> isn't ideally addressed by readdir()+statlite(), since statlite() might 
> return size/mtime from before 'ls -al' was executed by the user.

stat() will do the same.

> One can 
> easily imagine modifying a file on one host, then doing 'ls -al' on 
> another host and not seeing the effects.  If 'ls -al' can use 
> readdirplus(), it's overall application semantics can be preserved without 
> hammering large directories in a distributed filesystem.

So readdirplus() would not even be cached? Yech!

> > I agree that an interface which allows a userland process offer hints to
> > the kernel as to what kind of cache consistency it requires for file
> > metadata would be useful. We already have stuff like posix_fadvise() etc
> > for file data, and perhaps it might be worth looking into how you could
> > devise something similar for metadata.
> > If what you really want is for applications to be able to manage network
> > filesystem cache consistency, then why not provide those tools instead?
> 
> True, something to manage the attribute cache consistency for statlite() 
> results would also address the issue by letting an application declare how 
> weak it's results are allowed to be.  That seems a bit more awkward, 
> though, and would only affect statlite()--the only call that allows weak 
> consistency in the first place.  In contrast, readdirplus maps nicely onto 
> what filesystems like NFS are already doing over the wire.

Currently, you will never get anything other than weak consistency with
NFS whether you are talking about stat(), access(), getacl(),
lseek(SEEK_END), or append(). Your 'permitting it' only in statlite() is
irrelevant to the facts on the ground: I am not changing the NFS client
caching model in any way that would affect existing applications.

Cheers
  Trond

-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: NFSv4/pNFS possible POSIX I/O API standards

2006-12-01 Thread Sage Weil

On Fri, 1 Dec 2006, Trond Myklebust wrote:

I'm quite happy with a proposal for a statlite(). I'm objecting to
readdirplus() because I can't see that it offers you anything useful.
You haven't provided an example of an application which would clearly
benefit from a readdirplus() interface instead of readdir()+statlite()
and possibly some tools for managing cache consistency.


Okay, now I think I understand where you're coming from.

The difference between readdirplus() and readdir()+statlite() is that 
(depending on the mask you specify) statlite() either provides the "right" 
answer (ala stat()), or anything that is vaguely "recent."  readdirplus() 
would provide size/mtime from sometime _after_ the initial opendir() call, 
establishing a useful ordering.  So without readdirplus(), you either get 
readdir()+stat() and the performance problems I mentioned before, or 
readdir()+statlite() where "recent" may not be good enough.


Instead of my previous example of proccess #1 waiting for process #2 to 
finish and then checking the results with stat(), imagine instead that #1 
is waiting for 100,000 other processes to finish, and then wants to check 
the results (size/mtime) of all of them.  readdir()+statlite() won't 
work, and readdir()+stat() may be pathologically slow.


Also, it's a tiring and trivial example, but even the 'ls -al' scenario 
isn't ideally addressed by readdir()+statlite(), since statlite() might 
return size/mtime from before 'ls -al' was executed by the user.  One can 
easily imagine modifying a file on one host, then doing 'ls -al' on 
another host and not seeing the effects.  If 'ls -al' can use 
readdirplus(), it's overall application semantics can be preserved without 
hammering large directories in a distributed filesystem.



I agree that an interface which allows a userland process offer hints to
the kernel as to what kind of cache consistency it requires for file
metadata would be useful. We already have stuff like posix_fadvise() etc
for file data, and perhaps it might be worth looking into how you could
devise something similar for metadata.
If what you really want is for applications to be able to manage network
filesystem cache consistency, then why not provide those tools instead?


True, something to manage the attribute cache consistency for statlite() 
results would also address the issue by letting an application declare how 
weak it's results are allowed to be.  That seems a bit more awkward, 
though, and would only affect statlite()--the only call that allows weak 
consistency in the first place.  In contrast, readdirplus maps nicely onto 
what filesystems like NFS are already doing over the wire.


sage
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: NFSv4/pNFS possible POSIX I/O API standards

2006-12-01 Thread Trond Myklebust
On Fri, 2006-12-01 at 08:47 -0800, Sage Weil wrote:
> On Fri, 1 Dec 2006, Trond Myklebust wrote:
> > 'ls --color' and 'find' don't give a toss about most of the arguments 
> > from 'stat()'. They just want to know what kind of filesystem object 
> > they are dealing with. We already provide that information in the 
> > readdir() syscall via the 'd_type' field. Adding all the other stat() 
> > information is just going to add unnecessary synchronisation burdens.
> 
> 'ls -al' cares about the stat() results, but does not care about the 
> relative timing accuracy wrt the preceeding readdir().  I'm not sure why 
> 'ls --color' still calls stat when it can get that from the readdir() 
> results, but either way it's asking more from the kernel/filesystem than 
> it needs.
> 
> >> Something like 'ls' certainly doesn't care, but in general applications do
> >> care that stat() results aren't cached.  They expect the stat results to
> >> reflect the file's state at a point in time _after_ they decide to call
> >> stat().  For example, for process A to see how much data a just-finished
> >> process B wrote to a file...
> >
> > AFAICS, it will not change any consistency semantics. The main
> > irritation it will introduce will be that the NFS client will suddenly
> > have to do things like synchronising readdirplus() and file write() in
> > order to provide the POSIX guarantees that you mentioned.
> >
> > i.e: if someone has written data to one of the files in the directory,
> > then an NFS client will now have to flush that data out before calling
> > readdir so that the server returns the correct m/ctime or file size.
> > Previously, it could delay that until the stat() call.
> 
> It sounds like you're talking about a single (asynchronous) client in a 
> directory.  In that case, the client need only flush if someone calls 
> readdirplus() instead of readdir(), and since readdirplus() is effectively 
> also a stat(), the situation isn't actually any different.
> 
> The more interesting case is multiple clients in the same directory.  In 
> order to provide strong consistency, both stat() and readdir() have to 
> talk to the server (or more complicated leasing mechanisms are needed). 

Why would that be interesting? What applications do you have that
require strong consistency in that scenario? I keep looking for uses for
strong cache consistency with no synchronisation, but I have yet to meet
someone who has an actual application that relies on it.

> In that scenario, readdirplus() is asking for _less_ 
> synchronization/consistency of results than readdir()+stat(), not more. 
> i.e. both the readdir() and stat() would require a server request in order 
> to achieve the standard POSIX semantics, while a readdirplus() would allow 
> a single request.  The NFS client already provibes weak consistency of 
> stat() results for clients.  Extending the interface doesn't suddenly 
> require the NFS client to provide strong consistency, it just makes life 
> easier for the implementation if it (or some other filesystem) chooses to 
> do so.

I'm quite happy with a proposal for a statlite(). I'm objecting to
readdirplus() because I can't see that it offers you anything useful.
You haven't provided an example of an application which would clearly
benefit from a readdirplus() interface instead of readdir()+statlite()
and possibly some tools for managing cache consistency.

> Consider two use cases.  Process A is 'ls -al', who doesn't really care 
> about when the size/mtime are from (i.e. sometime after opendir()). 
> Process B waits for a process on another host to write to a file, and then 
> calls stat() locally to check the result.  In order for B to get the 
> correct result, stat() _must_ return a value for size/mtime from _after_ 
> the stat() initiated.  That makes 'ls -al' slow, because it probably has 
> to talk to the server to make sure files haven't been modified between the 
> readdir() and stat().  In reality, 'ls -al' doesn't care, but the 
> filesystem has no way to know that without the presense of readdirplus(). 
> Alternatively, an NFS (or other distributed filesystem) client can cache 
> file attributes to make 'ls -al' fast, and simply break process B (as NFS 
> currently does).  readdirplus() makes it clear what 'ls -al' doesn't need, 
> allowing the client (if it so chooses) to avoid breaking B in the general 
> case.  That simply isn't possible to explicitly communicate with the 
> existing interface.  How is that not a win?

Using readdir() to monitor size/mtime on individual files is hardly a
case we want to optimise for. There are better tools, including
inotify() for applications that care.

I agree that an interface which allows a userland process offer hints to
the kernel as to what kind of cache consistency it requires for file
metadata would be useful. We already have stuff like posix_fadvise() etc
for file data, and perhaps it might be worth looking into how you could
devise something simil

Re: [RFC][PATCH] ensure i_ino uniqueness in filesystems without permanent inode numbers (via idr)

2006-12-01 Thread Randy Dunlap

Jeff Layton wrote:

On Fri, Dec 01, 2006 at 08:52:27AM -0800, Randy Dunlap wrote:

Thanks for having a look, Randy...


s/idr_/iunique_/


Doh! Can you tell I cut and pasted this email from earlier ones? :-)


- don't attempt to remove inodes with values <100

Please explain that one.  (May be obvious to some, but not to me.)


Actually, we probably don't need to do that now. My thought here was to add
a low range of i_ino numbers that could be used by the filesystem code without
needing to call iunique (in particular for things like the root inode in the
filesystem). It's probably best to not do this though and let the filesystem
handle it on its own.


Better to post patches inline (for review) rather than as attachments.


Here's an updated (but untested) patch based on your suggestions. I also went
ahead and made the exported symbols GPL-only since that seems like it would be
appropriate here. Any further thoughts on it?


Just needs new/updated patch description.
and one "typo" fixed.


diff --git a/fs/inode.c b/fs/inode.c
index 26cdb11..e45cec9 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -706,6 +708,32 @@ retry:
 
 EXPORT_SYMBOL(iunique);
 
+int iunique_register(struct inode *inode, int max_reserved)

+{
+   int rv;
+
+   rv = idr_pre_get(&inode->i_sb->s_inode_ids, GFP_KERNEL);
+   if (! rv)


No space after !, just:
if (!rv)


+   return -ENOMEM;
+
+   spin_lock(&inode->i_sb->s_inode_ids_lock);
+   rv = idr_get_new_above(&inode->i_sb->s_inode_ids, inode,
+   max_reserved+1, (int *) &inode->i_ino);
+   inode->i_generation = inode->i_sb->s_generation++;
+   spin_unlock(&inode->i_sb->s_inode_ids_lock);
+   return rv;
+}


Thanks.
--
~Randy
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC][PATCH] ensure i_ino uniqueness in filesystems without permanent inode numbers (via idr)

2006-12-01 Thread Jeff Layton
On Fri, Dec 01, 2006 at 08:52:27AM -0800, Randy Dunlap wrote:

Thanks for having a look, Randy...

> s/idr_/iunique_/

Doh! Can you tell I cut and pasted this email from earlier ones? :-)

> > - don't attempt to remove inodes with values <100
> 
> Please explain that one.  (May be obvious to some, but not to me.)

Actually, we probably don't need to do that now. My thought here was to add
a low range of i_ino numbers that could be used by the filesystem code without
needing to call iunique (in particular for things like the root inode in the
filesystem). It's probably best to not do this though and let the filesystem
handle it on its own.

> Better to post patches inline (for review) rather than as attachments.

Here's an updated (but untested) patch based on your suggestions. I also went
ahead and made the exported symbols GPL-only since that seems like it would be
appropriate here. Any further thoughts on it?

Signed-off-by: Jeff Layton <[EMAIL PROTECTED]>

diff --git a/fs/inode.c b/fs/inode.c
index 26cdb11..e45cec9 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -288,6 +288,8 @@ static void dispose_list(struct list_hea
list_del_init(&inode->i_sb_list);
spin_unlock(&inode_lock);
 
+   iunique_unregister(inode);
+
wake_up_inode(inode);
destroy_inode(inode);
nr_disposed++;
@@ -706,6 +708,32 @@ retry:
 
 EXPORT_SYMBOL(iunique);
 
+int iunique_register(struct inode *inode, int max_reserved)
+{
+   int rv;
+
+   rv = idr_pre_get(&inode->i_sb->s_inode_ids, GFP_KERNEL);
+   if (! rv)
+   return -ENOMEM;
+
+   spin_lock(&inode->i_sb->s_inode_ids_lock);
+   rv = idr_get_new_above(&inode->i_sb->s_inode_ids, inode,
+   max_reserved+1, (int *) &inode->i_ino);
+   inode->i_generation = inode->i_sb->s_generation++;
+   spin_unlock(&inode->i_sb->s_inode_ids_lock);
+   return rv;
+}
+EXPORT_SYMBOL_GPL(iunique_register);
+
+void iunique_unregister(struct inode *inode)
+{
+   spin_lock(&inode->i_sb->s_inode_ids_lock);
+   if (idr_find(&inode->i_sb->s_inode_ids, (int) inode->i_ino))
+   idr_remove(&inode->i_sb->s_inode_ids, (int) inode->i_ino);
+   spin_unlock(&inode->i_sb->s_inode_ids_lock);
+}
+EXPORT_SYMBOL_GPL(iunique_unregister);
+
 struct inode *igrab(struct inode *inode)
 {
spin_lock(&inode_lock);
@@ -1025,6 +1053,7 @@ void generic_delete_inode(struct inode *
spin_lock(&inode_lock);
hlist_del_init(&inode->i_hash);
spin_unlock(&inode_lock);
+   iunique_unregister(inode);
wake_up_inode(inode);
BUG_ON(inode->i_state != I_CLEAR);
destroy_inode(inode);
@@ -1057,6 +1086,7 @@ static void generic_forget_inode(struct 
inode->i_state |= I_FREEING;
inodes_stat.nr_inodes--;
spin_unlock(&inode_lock);
+   iunique_unregister(inode);
if (inode->i_data.nrpages)
truncate_inode_pages(&inode->i_data, 0);
clear_inode(inode);
diff --git a/fs/pipe.c b/fs/pipe.c
index b1626f2..d74ae65 100644
--- a/fs/pipe.c
+++ b/fs/pipe.c
@@ -845,6 +845,9 @@ static struct inode * get_pipe_inode(voi
if (!inode)
goto fail_inode;
 
+   if (iunique_register(inode, 0))
+   goto fail_iput;
+
pipe = alloc_pipe_info(inode);
if (!pipe)
goto fail_iput;
diff --git a/fs/super.c b/fs/super.c
index 47e554c..d2dbdec 100644
--- a/fs/super.c
+++ b/fs/super.c
@@ -93,6 +93,8 @@ static struct super_block *alloc_super(s
s->s_qcop = sb_quotactl_ops;
s->s_op = &default_op;
s->s_time_gran = 10;
+   idr_init(&s->s_inode_ids);
+   spin_lock_init(&s->s_inode_ids_lock);
}
 out:
return s;
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 2fe6e3f..3afb4a2 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -278,6 +278,7 @@ #include 
 #include 
 #include 
 #include 
+#include 
 
 #include 
 #include 
@@ -961,6 +962,12 @@ #endif
/* Granularity of c/m/atime in ns.
   Cannot be worse than a second */
u32s_time_gran;
+
+   /* for fs's with dynamic i_ino values, track them with idr, and 
increment
+  the generation every time we register a new inode */
+   __u32   s_generation;
+   struct idr  s_inode_ids;
+   spinlock_t  s_inode_ids_lock;
 };
 
 extern struct timespec current_fs_time(struct super_block *sb);
@@ -1681,6 +1688,8 @@ extern void inode_init_once(struct inode
 extern void iput(struct inode *);
 extern struct inode * igrab(struct inode *);
 extern ino_t iunique(struct super_block *, ino_t);
+extern int iunique_register(struct inode *inode, int max_reserved);
+extern void iunique_unregister(struct inode *inode);
 extern int inode_needs_sync(struct inode *inode);
 extern void generic_delete_inode(struct inode *inode

Re: [RFC][PATCH] ensure i_ino uniqueness in filesystems without permanent inode numbers (via idr)

2006-12-01 Thread Randy Dunlap
On Fri, 01 Dec 2006 09:48:36 -0500 Jeff Layton wrote:

> This patch is a proof of concept. It works, but still needs a bit of
> polish before it's ready for submission. First, the problems:
> 
> 
> This patch is a first step at correcting these problems. This adds 2 new
> functions, an idr_register and idr_unregister. Filesystems can call
> idr_register at inode creation time, and then at deletion time, we'll
> automatically unregister them.

s/idr_/iunique_/

> This patch also adds a new s_generation counter to the superblock.
> Because i_ino's can be reused so quickly, we don't want NFS getting
> confused when it happens. When iunique_register is called, we'll assign
> the s_generation value to the i_generation, and then increment it to
> help ensure that we get different filehandles.
> 
> There are some things that need to be cleaned up, of course:
> 
> - error handling for the idr calls
> 
> - recheck all the possible places where the inode should be unhashed
> 
> - don't attempt to remove inodes with values <100

Please explain that one.  (May be obvious to some, but not to me.)

> - convert other filesystems
> 
> - remove the static counter from new_inode and (maybe) eliminate iunique
> 
> Comments and suggestions appreciated.


Better to post patches inline (for review) rather than as attachments.

Some (mostly style) comments on the patch:

+   rv = idr_pre_get(&inode->i_sb->s_inode_ids, GFP_KERNEL);
+   if (! rv)
+   return -ENOMEM;

if (!rv)

+   rv = idr_get_new_above(&inode->i_sb->s_inode_ids, inode,
+   max_reserved+1, (int *) &inode->i_ino);

max_reserved + 1,

+}
+
+EXPORT_SYMBOL(iunique_register);

No need for the extra blank line after the function closing
brace.  Just put the EXPORT_SYMBOL immediately on the next line.
(in multiple places)

@@ -1681,6 +1688,8 @@ extern void inode_init_once(struct inode
 extern void iput(struct inode *);
 extern struct inode * igrab(struct inode *);
 extern ino_t iunique(struct super_block *, ino_t);
+extern int iunique_register(struct inode *, int);
+extern void iunique_unregister(struct inode *);
 extern int inode_needs_sync(struct inode *inode);
 extern void generic_delete_inode(struct inode *inode);
 extern void generic_drop_inode(struct inode *inode);

Some of these have a parameter name, some don't.
Having a (meaningful) parameter name is strongly preferred.

---
~Randy
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: NFSv4/pNFS possible POSIX I/O API standards

2006-12-01 Thread Sage Weil

On Fri, 1 Dec 2006, Trond Myklebust wrote:
'ls --color' and 'find' don't give a toss about most of the arguments 
from 'stat()'. They just want to know what kind of filesystem object 
they are dealing with. We already provide that information in the 
readdir() syscall via the 'd_type' field. Adding all the other stat() 
information is just going to add unnecessary synchronisation burdens.


'ls -al' cares about the stat() results, but does not care about the 
relative timing accuracy wrt the preceeding readdir().  I'm not sure why 
'ls --color' still calls stat when it can get that from the readdir() 
results, but either way it's asking more from the kernel/filesystem than 
it needs.



Something like 'ls' certainly doesn't care, but in general applications do
care that stat() results aren't cached.  They expect the stat results to
reflect the file's state at a point in time _after_ they decide to call
stat().  For example, for process A to see how much data a just-finished
process B wrote to a file...


AFAICS, it will not change any consistency semantics. The main
irritation it will introduce will be that the NFS client will suddenly
have to do things like synchronising readdirplus() and file write() in
order to provide the POSIX guarantees that you mentioned.

i.e: if someone has written data to one of the files in the directory,
then an NFS client will now have to flush that data out before calling
readdir so that the server returns the correct m/ctime or file size.
Previously, it could delay that until the stat() call.


It sounds like you're talking about a single (asynchronous) client in a 
directory.  In that case, the client need only flush if someone calls 
readdirplus() instead of readdir(), and since readdirplus() is effectively 
also a stat(), the situation isn't actually any different.


The more interesting case is multiple clients in the same directory.  In 
order to provide strong consistency, both stat() and readdir() have to 
talk to the server (or more complicated leasing mechanisms are needed). 
In that scenario, readdirplus() is asking for _less_ 
synchronization/consistency of results than readdir()+stat(), not more. 
i.e. both the readdir() and stat() would require a server request in order 
to achieve the standard POSIX semantics, while a readdirplus() would allow 
a single request.  The NFS client already provibes weak consistency of 
stat() results for clients.  Extending the interface doesn't suddenly 
require the NFS client to provide strong consistency, it just makes life 
easier for the implementation if it (or some other filesystem) chooses to 
do so.


Consider two use cases.  Process A is 'ls -al', who doesn't really care 
about when the size/mtime are from (i.e. sometime after opendir()). 
Process B waits for a process on another host to write to a file, and then 
calls stat() locally to check the result.  In order for B to get the 
correct result, stat() _must_ return a value for size/mtime from _after_ 
the stat() initiated.  That makes 'ls -al' slow, because it probably has 
to talk to the server to make sure files haven't been modified between the 
readdir() and stat().  In reality, 'ls -al' doesn't care, but the 
filesystem has no way to know that without the presense of readdirplus(). 
Alternatively, an NFS (or other distributed filesystem) client can cache 
file attributes to make 'ls -al' fast, and simply break process B (as NFS 
currently does).  readdirplus() makes it clear what 'ls -al' doesn't need, 
allowing the client (if it so chooses) to avoid breaking B in the general 
case.  That simply isn't possible to explicitly communicate with the 
existing interface.  How is that not a win?


I imagine that most of the time readdirplus() will hit something in the 
VFS that simply calls readdir() and stat().  But a smart NFS (or other 
network filesytem) client can can opt to send a readdirplus over the wire 
for readdirplus() without sacrificing stat() consistency in the general 
case.


sage
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [patch 3/3] fs: fix cont vs deadlock patches

2006-12-01 Thread OGAWA Hirofumi
OGAWA Hirofumi <[EMAIL PROTECTED]> writes:

> Ah, unfortunately we can't this. If we don't update ->i_size before
> page_cache_release, pdflush will think these pages is outside ->i_size
> and just clean the page without writing it.

Ugh, of course, s/page_cache_release/unlock_page/
-- 
OGAWA Hirofumi <[EMAIL PROTECTED]>
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: NFSv4/pNFS possible POSIX I/O API standards

2006-12-01 Thread Ric Wheeler



Andreas Dilger wrote:

On Nov 29, 2006  09:04 +, Christoph Hellwig wrote:

 - readdirplus

This one is completely unneeded as a kernel API.  Doing readdir
plus calls on the wire makes a lot of sense and we already do
that for NFSv3+.  Doing this at the syscall layer just means
kernel bloat - syscalls are very cheap.


The question is how does the filesystem know that the application is
going to do readdir + stat every file?  It has to do this as a heuristic
implemented in the filesystem to determine if the ->getattr() calls match
the ->readdir() order.  If the application knows that it is going to be
doing this (e.g. ls, GNU rm, find, etc) then why not let the filesystem
take advantage of this information?  If combined with the statlite
interface, it can make a huge difference for clustered filesystems.




I think that this kind of heuristic would be a win for local file systems with a 
huge number of files as well...


ric
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [patch 3/3] fs: fix cont vs deadlock patches

2006-12-01 Thread OGAWA Hirofumi
Nick Piggin <[EMAIL PROTECTED]> writes:

>> >status = __block_prepare_write(inode, new_page, zerofrom,
>> >PAGE_CACHE_SIZE, get_block);
>> >if (status)
>> > @@ -2110,7 +2111,7 @@
>> >memset(kaddr+zerofrom, 0, PAGE_CACHE_SIZE-zerofrom);
>> >flush_dcache_page(new_page);
>> >kunmap_atomic(kaddr, KM_USER0);
>> > -  generic_commit_write(NULL, new_page, zerofrom, PAGE_CACHE_SIZE);
>> > +  __block_commit_write(inode, new_page, zerofrom, 
>> > PAGE_CACHE_SIZE);
>> 
>> Whatever function this is doesn't need to update i_size?
>
> Yes, it is the code in cont_prepare_write that is expanding a hole
> at the end of file.
>
> We can do this now because fat_commit_write is now changed to call
> generic_commit_write in the case of a non-zero length.
>
> I think it is an improvement because now the file will not get
> arbitrarily extended in the case of a write failure somewhere down
> the track.

Ah, unfortunately we can't this. If we don't update ->i_size before
page_cache_release, pdflush will think these pages is outside ->i_size
and just clean the page without writing it.
-- 
OGAWA Hirofumi <[EMAIL PROTECTED]>
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: NFSv4/pNFS possible POSIX I/O API standards

2006-12-01 Thread Trond Myklebust
On Thu, 2006-11-30 at 23:08 -0800, Sage Weil wrote:
> I mean atomic only in the sense that the stat result returned by 
> readdirplus() would reflect the file state at some point during the time 
> consumed by that system call.  In contrast, when you call stat() 
> separately, it's expected that the result you get back reflects the state 
> at some time during the stat() call, and not the readdir() that may 
> have preceeded it.  readdir() results may be weakly cached, but stat() 
> results normally aren't (ignoring the usual NFS behavior for the moment).
> 
> It's the stat() part of readdir() + stat() that makes life unnecessarily 
> difficult for a filesystem providing strong consistency.  How can the 
> filesystem know that 'ls' doesn't care if the stat() results are accurate 
> at the time of the readdir() and not the subsequent stat()?  Something 
> like readdirplus() allows that to be explicitly communicated, without 
> resorting to heuristics or weak metadata consistency (ala NFS attribute 
> caching).  For distributed or network filesystems that can be a big win. 
> (Admittedly, there's probably little benefit for local filesystems beyond 
> the possibility of better prefetching, if syscalls are as cheap as 
> Christoph says.)

'ls --color' and 'find' don't give a toss about most of the arguments
from 'stat()'. They just want to know what kind of filesystem object
they are dealing with. We already provide that information in the
readdir() syscall via the 'd_type' field.
Adding all the other stat() information is just going to add unnecessary
synchronisation burdens.

> > Besides, why would your application care about atomicity of the
> > attribute information unless you also have some form of locking to
> > guarantee that said information remains valid until you are done
> > processing it?
> 
> Something like 'ls' certainly doesn't care, but in general applications do 
> care that stat() results aren't cached.  They expect the stat results to 
> reflect the file's state at a point in time _after_ they decide to call 
> stat().  For example, for process A to see how much data a just-finished 
> process B wrote to a file...

AFAICS, it will not change any consistency semantics. The main
irritation it will introduce will be that the NFS client will suddenly
have to do things like synchronising readdirplus() and file write() in
order to provide the POSIX guarantees that you mentioned.
i.e: if someone has written data to one of the files in the directory,
then an NFS client will now have to flush that data out before calling
readdir so that the server returns the correct m/ctime or file size.
Previously, it could delay that until the stat() call.

Trond

-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[RFC][PATCH] ensure i_ino uniqueness in filesystems without permanent inode numbers (via idr)

2006-12-01 Thread Jeff Layton

This patch is a proof of concept. It works, but still needs a bit of
polish before it's ready for submission. First, the problems:

1) on filesystems w/o permanent inode numbers, i_ino values can be
larger than 32 bits, which can cause problems for some 32 bit userspace
programs on a 64 bit kernel.

2) many filesystems call new_inode and assume that the i_ino values they
are given are unique. They are not guaranteed to be so, since the static
counter can wrap.

3) after allocating a new inode, some filesystems call iunique to try to
get a unique i_ino value, but they don't actually add their inodes to
the hashtable, so they're still not guaranteed to be unique.

This patch is a first step at correcting these problems. This adds 2 new
functions, an idr_register and idr_unregister. Filesystems can call
idr_register at inode creation time, and then at deletion time, we'll
automatically unregister them.

This patch also adds a new s_generation counter to the superblock.
Because i_ino's can be reused so quickly, we don't want NFS getting
confused when it happens. When iunique_register is called, we'll assign
the s_generation value to the i_generation, and then increment it to
help ensure that we get different filehandles.

There are some things that need to be cleaned up, of course:

- error handling for the idr calls

- recheck all the possible places where the inode should be unhashed

- don't attempt to remove inodes with values <100

- convert other filesystems

- remove the static counter from new_inode and (maybe) eliminate iunique

The patch also converts pipefs to use the new scheme as an example. Al
Viro had expressed some concern with an earlier patch that this might
slow down pipe creation. I've done some testing and I think the impact
will be minimal. Timing a small program that creates and closes 100
million pipes in a loop:

patched:
-
real8m8.623s
user0m37.418s
sys 7m31.196s

unpatched:
--
real8m7.150s
user0m40.943s
sys 7m26.204s

As the number of pipes grows on the system, this time may grow somewhat
but it doesn't seem like it would be terrible.

Comments and suggestions appreciated.

Signed-off-by: Jeff Layton <[EMAIL PROTECTED]>


diff --git a/fs/inode.c b/fs/inode.c
index 26cdb11..841e2fc 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -288,6 +288,8 @@ static void dispose_list(struct list_hea
 		list_del_init(&inode->i_sb_list);
 		spin_unlock(&inode_lock);
 
+		iunique_unregister(inode);
+
 		wake_up_inode(inode);
 		destroy_inode(inode);
 		nr_disposed++;
@@ -706,6 +708,34 @@ retry:
 
 EXPORT_SYMBOL(iunique);
 
+int iunique_register(struct inode *inode, int max_reserved)
+{
+	int rv;
+
+	rv = idr_pre_get(&inode->i_sb->s_inode_ids, GFP_KERNEL);
+	if (! rv)
+		return -ENOMEM;
+
+	spin_lock(&inode->i_sb->s_inode_ids_lock);
+	rv = idr_get_new_above(&inode->i_sb->s_inode_ids, inode,
+		max_reserved+1, (int *) &inode->i_ino);
+	inode->i_generation = inode->i_sb->s_generation++;
+	spin_unlock(&inode->i_sb->s_inode_ids_lock);
+	return rv;
+}
+
+EXPORT_SYMBOL(iunique_register);
+
+void iunique_unregister(struct inode *inode)
+{
+	spin_lock(&inode->i_sb->s_inode_ids_lock);
+	if (idr_find(&inode->i_sb->s_inode_ids, (int) inode->i_ino))
+		idr_remove(&inode->i_sb->s_inode_ids, (int) inode->i_ino);
+	spin_unlock(&inode->i_sb->s_inode_ids_lock);
+}
+
+EXPORT_SYMBOL(iunique_unregister);
+
 struct inode *igrab(struct inode *inode)
 {
 	spin_lock(&inode_lock);
@@ -1025,6 +1055,7 @@ void generic_delete_inode(struct inode *
 	spin_lock(&inode_lock);
 	hlist_del_init(&inode->i_hash);
 	spin_unlock(&inode_lock);
+	iunique_unregister(inode);
 	wake_up_inode(inode);
 	BUG_ON(inode->i_state != I_CLEAR);
 	destroy_inode(inode);
@@ -1057,6 +1088,7 @@ static void generic_forget_inode(struct 
 	inode->i_state |= I_FREEING;
 	inodes_stat.nr_inodes--;
 	spin_unlock(&inode_lock);
+	iunique_unregister(inode);
 	if (inode->i_data.nrpages)
 		truncate_inode_pages(&inode->i_data, 0);
 	clear_inode(inode);
diff --git a/fs/pipe.c b/fs/pipe.c
index b1626f2..d74ae65 100644
--- a/fs/pipe.c
+++ b/fs/pipe.c
@@ -845,6 +845,9 @@ static struct inode * get_pipe_inode(voi
 	if (!inode)
 		goto fail_inode;
 
+	if (iunique_register(inode, 0))
+		goto fail_iput;
+
 	pipe = alloc_pipe_info(inode);
 	if (!pipe)
 		goto fail_iput;
diff --git a/fs/super.c b/fs/super.c
index 47e554c..d2dbdec 100644
--- a/fs/super.c
+++ b/fs/super.c
@@ -93,6 +93,8 @@ static struct super_block *alloc_super(s
 		s->s_qcop = sb_quotactl_ops;
 		s->s_op = &default_op;
 		s->s_time_gran = 10;
+		idr_init(&s->s_inode_ids);
+		spin_lock_init(&s->s_inode_ids_lock);
 	}
 out:
 	return s;
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 2fe6e3f..3ad12a6 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -278,6 +278,7 @@ #include 
 #include 
 #include 
 #include 
+#include 
 
 #include 
 #include 
@@ -961,6 +962,12 @@ #endif
 	/* Granularity of c/m/atime in ns.
 	   Cannot be worse than a second */
 	

Re:[RFC][PATCH 0/3] Extent base online defrag

2006-12-01 Thread sho
Hi,

Thank you for your further testing.

>From the test the file seems to be pretty fragmented, still the defrag
>utility gives ENOSPC error.

I found the problem of the workaround code related to journal in my
patches.  2048 blocks was always passed to ext3_journal_start so that
the journal blocks in transaction didn't exceed j_max_transaction_buffers.
However, j_max_transaction_buffers could be adjusted to the number of
filesystem blocks.  Maybe your filesystem's j_max_transaction_buffers
would be less than 2048, so the error would occur.

When the above problem occurs, the following message will be output
in syslog.  Could you check your syslog?

"JBD: e4defrag wants too many credits (2048 > )"

I am working for fixing the workaround code, but the work isn't
finished yet.  I will update my patches next week.
If you need to run my defrag soon, you can use the following
patch to avoid the above problem as the provisional solution.

After Alex's patches and my previous patches are applied,
you can apply the following patch.
---
diff -uprN -X linux-2.6.16.8-tnes-org/Documentation/dontdiff 
linux-2.6.16.8-tnes-org/fs/ext3/extents.c linux-2.6.16.8-work/fs/ext3/extents.c
--- linux-2.6.16.8-tnes-org/fs/ext3/extents.c   2006-12-01 10:37:28.0 
+0900
+++ linux-2.6.16.8-work/fs/ext3/extents.c   2006-12-01 17:23:38.0 
+0900
@@ -2738,6 +2738,7 @@ ext3_ext_replace_branches(struct ext3_ex
handle_t *handle = NULL;
unsigned jnum;
struct inode *inode;
+   journal_t   *journal;
 
 
from = from_page << (PAGE_CACHE_SHIFT - dest_tree->inode->i_blkbits);
@@ -2752,10 +2753,11 @@ ext3_ext_replace_branches(struct ext3_ex
 */
/* TODO:
 * Need to consider the way of calculating journal blocks
-* because j_max_transaction_buffer may exceed 2048
+* because the journal blocks may exceed j_max_transaction_buffer
 * if we have a deep depth.
 */
-   jnum = 2048;
+   journal = EXT3_JOURNAL(inode);
+   jnum = journal->j_max_transaction_buffers;
handle = ext3_journal_start(inode, jnum);
if (IS_ERR(handle)) {
err = PTR_ERR(handle);
@@ -3093,6 +3095,7 @@ ext3_ext_new_extent_tree(struct inode *t
unsigned jnum;
int ret = 0, err = 0, depth;
int last_extent = 0;
+   journal_t   *journal;
 
/* (blocks_per_page * count) * (extent_blocks + index_blocks)
 * + super_block + block_bitmap + group_descriptor
@@ -3100,10 +3103,11 @@ ext3_ext_new_extent_tree(struct inode *t
 */
/* TODO:
 * Need to consider the way of calculating journal blocks
-* because j_max_transaction_buffer may exceed 2048
+* because the journal blocks may exceed j_max_transaction_buffer
 * if we have a deep depth.
 */
-   jnum = 2048;
+   journal = EXT3_JOURNAL(tmp_inode);
+   jnum = journal->j_max_transaction_buffers; 
eh = EXT_ROOT_HDR(dest_tree);
eh->eh_depth = 0;
handle = ext3_journal_start(tmp_inode, jnum);

Cheers, Takashi
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html