Re: [PATCH 6/6] NLM: Add reference counting to lockd

2008-01-08 Thread Wendy Cheng

Jeff Layton wrote:



The previous patch removes a kill_proc(... SIGKILL),  this one adds it
back.
That makes me wonder if the intermediate state is 'correct'.

But I also wonder what "correct" means.
Do we want all locks to be dropped when the last nfsd thread dies?
The answer is presumably either "yes" or "no".
If "yes", then we don't have that because if there are any NFS mounts
active, lockd will not be killed.
If "no", then we don't want this kill_proc here.

The comment in lockd() which currently reads:

/*
 * The main request loop. We don't terminate until the last
 * NFS mount or NFS daemon has gone away, and we've been sent
a
 * signal, or else another process has taken over our job.
 */

suggests that someone once thought that lockd could hang around after
all nfsd threads and nfs mounts had gone, but I don't think it does.

We really should think this through and get it right, because if lockd
ever drops it's locks, then we really need to make sure sm_notify gets
run.  So it needs to be a well defined event.

Thoughts?




This is the part I've been struggling with the most -- defining what
proper behavior should be when lockd is restarted. As you point out,
restarting lockd without doing a sm_notify could be bad news for data
integrity.

Then again, we'd like someone to be able to shut down the NFS "service"
and be able to unmount underlying filesystems without jumping through
special hoops

Overall, I think I'd vote "yes". We need to drop locks when the last
nfsd goes down. If userspace brings down nfsd, then it's userspace's
responsibility to make sure that a sm_notify is sent when nfsd and lockd
are restarted.
  


I would vote for "no", at least for nfs v3. Shutting down lockd would 
require clients to reclaim the locks. With current status (protocol, 
design, and even the implementation itself, etc), it is simply too 
disruptive. I understand current logic (i.e. shutting down nfsd but 
leaving lockd alone) is awkward but debugging multiple platforms 
(remember clients may not be on linux boxes) is very non-trivial.


-- Wendy


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 6/6] NLM: Add reference counting to lockd

2008-01-08 Thread Wendy Cheng

Jeff Layton wrote:



The previous patch removes a kill_proc(... SIGKILL),  this one adds it
back.
That makes me wonder if the intermediate state is 'correct'.

But I also wonder what correct means.
Do we want all locks to be dropped when the last nfsd thread dies?
The answer is presumably either yes or no.
If yes, then we don't have that because if there are any NFS mounts
active, lockd will not be killed.
If no, then we don't want this kill_proc here.

The comment in lockd() which currently reads:

/*
 * The main request loop. We don't terminate until the last
 * NFS mount or NFS daemon has gone away, and we've been sent
a
 * signal, or else another process has taken over our job.
 */

suggests that someone once thought that lockd could hang around after
all nfsd threads and nfs mounts had gone, but I don't think it does.

We really should think this through and get it right, because if lockd
ever drops it's locks, then we really need to make sure sm_notify gets
run.  So it needs to be a well defined event.

Thoughts?




This is the part I've been struggling with the most -- defining what
proper behavior should be when lockd is restarted. As you point out,
restarting lockd without doing a sm_notify could be bad news for data
integrity.

Then again, we'd like someone to be able to shut down the NFS service
and be able to unmount underlying filesystems without jumping through
special hoops

Overall, I think I'd vote yes. We need to drop locks when the last
nfsd goes down. If userspace brings down nfsd, then it's userspace's
responsibility to make sure that a sm_notify is sent when nfsd and lockd
are restarted.
  


I would vote for no, at least for nfs v3. Shutting down lockd would 
require clients to reclaim the locks. With current status (protocol, 
design, and even the implementation itself, etc), it is simply too 
disruptive. I understand current logic (i.e. shutting down nfsd but 
leaving lockd alone) is awkward but debugging multiple platforms 
(remember clients may not be on linux boxes) is very non-trivial.


-- Wendy


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [Cluster-devel] [-mm patch] make gfs2_change_nlink_i() static

2007-01-16 Thread Wendy Cheng

Adrian Bunk wrote:

On Thu, Jan 11, 2007 at 10:26:27PM -0800, Andrew Morton wrote:
  

...
Changes since 2.6.20-rc3-mm1:
...
 git-gfs2-nmw.patch
...
 git trees
...




This patch makes the needlessly globlal gfs2_change_nlink_i() static.
  
We will probably need to call this routine from other files in our next 
round of code check-in.


-- Wendy

Signed-off-by: Adrian Bunk <[EMAIL PROTECTED]>

---

 fs/gfs2/inode.c |   88 
 fs/gfs2/inode.h |1 
 2 files changed, 44 insertions(+), 45 deletions(-)


--- linux-2.6.20-rc4-mm1/fs/gfs2/inode.h.old2007-01-13 08:56:24.0 
+0100
+++ linux-2.6.20-rc4-mm1/fs/gfs2/inode.h2007-01-13 08:56:30.0 
+0100
@@ -40,7 +40,6 @@
 
 int gfs2_dinode_dealloc(struct gfs2_inode *inode);

 int gfs2_change_nlink(struct gfs2_inode *ip, int diff);
-int gfs2_change_nlink_i(struct gfs2_inode *ip);
 struct inode *gfs2_lookupi(struct inode *dir, const struct qstr *name,
   int is_root, struct nameidata *nd);
 struct inode *gfs2_createi(struct gfs2_holder *ghs, const struct qstr *name,
--- linux-2.6.20-rc4-mm1/fs/gfs2/inode.c.old2007-01-13 08:56:37.0 
+0100
+++ linux-2.6.20-rc4-mm1/fs/gfs2/inode.c2007-01-13 08:57:21.0 
+0100
@@ -280,6 +280,50 @@
return error;
 }
 
+static int gfs2_change_nlink_i(struct gfs2_inode *ip)

+{
+   struct gfs2_sbd *sdp = ip->i_inode.i_sb->s_fs_info;
+   struct gfs2_inode *rindex = GFS2_I(sdp->sd_rindex);
+   struct gfs2_glock *ri_gl = rindex->i_gl;
+   struct gfs2_rgrpd *rgd;
+   struct gfs2_holder ri_gh, rg_gh;
+   int existing, error;
+
+   /* if we come from rename path, we could have the lock already */
+   existing = gfs2_glock_is_locked_by_me(ri_gl);
+   if (!existing) {
+   error = gfs2_rindex_hold(sdp, _gh);
+   if (error)
+   goto out;
+   }
+
+   /* find the matching rgd */
+   error = -EIO;
+   rgd = gfs2_blk2rgrpd(sdp, ip->i_num.no_addr);
+   if (!rgd)
+   goto out_norgrp;
+
+   /*
+* Eventually we may want to move rgd(s) to a linked list
+* and piggyback the free logic into one of gfs2 daemons
+* to gain some performance.
+*/
+   if (!rgd->rd_gl || !gfs2_glock_is_locked_by_me(rgd->rd_gl)) {
+   error = gfs2_glock_nq_init(rgd->rd_gl, LM_ST_EXCLUSIVE, 0, 
_gh);
+   if (error)
+   goto out_norgrp;
+
+   gfs2_unlink_di(>i_inode); /* mark inode unlinked */
+   gfs2_glock_dq_uninit(_gh);
+   }
+
+out_norgrp:
+   if (!existing)
+   gfs2_glock_dq_uninit(_gh);
+out:
+   return error;
+}
+
 /**
  * gfs2_change_nlink - Change nlink count on inode
  * @ip: The GFS2 inode
@@ -326,50 +370,6 @@
return error;
 }
 
-int gfs2_change_nlink_i(struct gfs2_inode *ip)

-{
-   struct gfs2_sbd *sdp = ip->i_inode.i_sb->s_fs_info;
-   struct gfs2_inode *rindex = GFS2_I(sdp->sd_rindex);
-   struct gfs2_glock *ri_gl = rindex->i_gl;
-   struct gfs2_rgrpd *rgd;
-   struct gfs2_holder ri_gh, rg_gh;
-   int existing, error;
-
-   /* if we come from rename path, we could have the lock already */
-   existing = gfs2_glock_is_locked_by_me(ri_gl);
-   if (!existing) {
-   error = gfs2_rindex_hold(sdp, _gh);
-   if (error)
-   goto out;
-   }
-
-   /* find the matching rgd */
-   error = -EIO;
-   rgd = gfs2_blk2rgrpd(sdp, ip->i_num.no_addr);
-   if (!rgd)
-   goto out_norgrp;
-
-   /*
-* Eventually we may want to move rgd(s) to a linked list
-* and piggyback the free logic into one of gfs2 daemons
-* to gain some performance.
-*/
-   if (!rgd->rd_gl || !gfs2_glock_is_locked_by_me(rgd->rd_gl)) {
-   error = gfs2_glock_nq_init(rgd->rd_gl, LM_ST_EXCLUSIVE, 0, 
_gh);
-   if (error)
-   goto out_norgrp;
-
-   gfs2_unlink_di(>i_inode); /* mark inode unlinked */
-   gfs2_glock_dq_uninit(_gh);
-   }
-
-out_norgrp:
-   if (!existing)
-   gfs2_glock_dq_uninit(_gh);
-out:
-   return error;
-}
-
 struct inode *gfs2_lookup_simple(struct inode *dip, const char *name)
 {
struct qstr qstr;

  


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [Cluster-devel] [-mm patch] make gfs2_change_nlink_i() static

2007-01-16 Thread Wendy Cheng

Adrian Bunk wrote:

On Thu, Jan 11, 2007 at 10:26:27PM -0800, Andrew Morton wrote:
  

...
Changes since 2.6.20-rc3-mm1:
...
 git-gfs2-nmw.patch
...
 git trees
...




This patch makes the needlessly globlal gfs2_change_nlink_i() static.
  
We will probably need to call this routine from other files in our next 
round of code check-in.


-- Wendy

Signed-off-by: Adrian Bunk [EMAIL PROTECTED]

---

 fs/gfs2/inode.c |   88 
 fs/gfs2/inode.h |1 
 2 files changed, 44 insertions(+), 45 deletions(-)


--- linux-2.6.20-rc4-mm1/fs/gfs2/inode.h.old2007-01-13 08:56:24.0 
+0100
+++ linux-2.6.20-rc4-mm1/fs/gfs2/inode.h2007-01-13 08:56:30.0 
+0100
@@ -40,7 +40,6 @@
 
 int gfs2_dinode_dealloc(struct gfs2_inode *inode);

 int gfs2_change_nlink(struct gfs2_inode *ip, int diff);
-int gfs2_change_nlink_i(struct gfs2_inode *ip);
 struct inode *gfs2_lookupi(struct inode *dir, const struct qstr *name,
   int is_root, struct nameidata *nd);
 struct inode *gfs2_createi(struct gfs2_holder *ghs, const struct qstr *name,
--- linux-2.6.20-rc4-mm1/fs/gfs2/inode.c.old2007-01-13 08:56:37.0 
+0100
+++ linux-2.6.20-rc4-mm1/fs/gfs2/inode.c2007-01-13 08:57:21.0 
+0100
@@ -280,6 +280,50 @@
return error;
 }
 
+static int gfs2_change_nlink_i(struct gfs2_inode *ip)

+{
+   struct gfs2_sbd *sdp = ip-i_inode.i_sb-s_fs_info;
+   struct gfs2_inode *rindex = GFS2_I(sdp-sd_rindex);
+   struct gfs2_glock *ri_gl = rindex-i_gl;
+   struct gfs2_rgrpd *rgd;
+   struct gfs2_holder ri_gh, rg_gh;
+   int existing, error;
+
+   /* if we come from rename path, we could have the lock already */
+   existing = gfs2_glock_is_locked_by_me(ri_gl);
+   if (!existing) {
+   error = gfs2_rindex_hold(sdp, ri_gh);
+   if (error)
+   goto out;
+   }
+
+   /* find the matching rgd */
+   error = -EIO;
+   rgd = gfs2_blk2rgrpd(sdp, ip-i_num.no_addr);
+   if (!rgd)
+   goto out_norgrp;
+
+   /*
+* Eventually we may want to move rgd(s) to a linked list
+* and piggyback the free logic into one of gfs2 daemons
+* to gain some performance.
+*/
+   if (!rgd-rd_gl || !gfs2_glock_is_locked_by_me(rgd-rd_gl)) {
+   error = gfs2_glock_nq_init(rgd-rd_gl, LM_ST_EXCLUSIVE, 0, 
rg_gh);
+   if (error)
+   goto out_norgrp;
+
+   gfs2_unlink_di(ip-i_inode); /* mark inode unlinked */
+   gfs2_glock_dq_uninit(rg_gh);
+   }
+
+out_norgrp:
+   if (!existing)
+   gfs2_glock_dq_uninit(ri_gh);
+out:
+   return error;
+}
+
 /**
  * gfs2_change_nlink - Change nlink count on inode
  * @ip: The GFS2 inode
@@ -326,50 +370,6 @@
return error;
 }
 
-int gfs2_change_nlink_i(struct gfs2_inode *ip)

-{
-   struct gfs2_sbd *sdp = ip-i_inode.i_sb-s_fs_info;
-   struct gfs2_inode *rindex = GFS2_I(sdp-sd_rindex);
-   struct gfs2_glock *ri_gl = rindex-i_gl;
-   struct gfs2_rgrpd *rgd;
-   struct gfs2_holder ri_gh, rg_gh;
-   int existing, error;
-
-   /* if we come from rename path, we could have the lock already */
-   existing = gfs2_glock_is_locked_by_me(ri_gl);
-   if (!existing) {
-   error = gfs2_rindex_hold(sdp, ri_gh);
-   if (error)
-   goto out;
-   }
-
-   /* find the matching rgd */
-   error = -EIO;
-   rgd = gfs2_blk2rgrpd(sdp, ip-i_num.no_addr);
-   if (!rgd)
-   goto out_norgrp;
-
-   /*
-* Eventually we may want to move rgd(s) to a linked list
-* and piggyback the free logic into one of gfs2 daemons
-* to gain some performance.
-*/
-   if (!rgd-rd_gl || !gfs2_glock_is_locked_by_me(rgd-rd_gl)) {
-   error = gfs2_glock_nq_init(rgd-rd_gl, LM_ST_EXCLUSIVE, 0, 
rg_gh);
-   if (error)
-   goto out_norgrp;
-
-   gfs2_unlink_di(ip-i_inode); /* mark inode unlinked */
-   gfs2_glock_dq_uninit(rg_gh);
-   }
-
-out_norgrp:
-   if (!existing)
-   gfs2_glock_dq_uninit(ri_gh);
-out:
-   return error;
-}
-
 struct inode *gfs2_lookup_simple(struct inode *dip, const char *name)
 {
struct qstr qstr;

  


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [Cluster-devel] Re: [GFS2] Don't flush everything on fdatasync [70/70]

2006-12-07 Thread Wendy Cheng

Steven Whitehouse wrote:

Hi,

On Fri, 2006-12-01 at 11:09 -0800, Andrew Morton wrote:
  

I was taking my cue here from ext3 which does something similar. The
filemap_fdatawrite() is done by the VFS before this is called with a
filemap_fdatawait() afterwards. This was intended to flush the metadata
via (eventually) ->write_inode() although I guess I should be calling
write_inode_now() instead?
  

oh I see, you're jsut trying to write the inode itself, not the pages.

write_inode_now() will write the pages, which you seem to not want to do.
Whatever.  The APIs here are a bit awkward.



I've added a comment to explain whats going on here, and also the
following patch. I know it could be better, but its still an improvement
on what was there before,


  

Steve,

I'm in the middle of something else and don't have upstream kernel 
source handy at this moment. But I read akpm's comment as 
"write_inode_now" would do writepage and that is *not* what you want (?) 
(since vfs has done that before this call is invoked). I vaguely 
recalled I did try write_inode_now() on GFS1 once but had to replace it 
with "sync_inode" on RHEL4 (for the reason that I can't remember at this 
moment). I suggest you keep "sync_inode" (at least for a while until we 
can prove other call can do better). This "sync_inode" has been well 
tested out (with GFS1's fsync call).


There is another issue. It is a gray area. Note that you don't grab any 
glock here ... so if someone *has* written something in other nodes, 
this sync could miss it (?). This depends on how people expects a 
fsync/fdatasync should behave in a cluster filesystem. GFS1 asks for a 
shared lock here so it will force other node to flush the data (I 
personally think this is a more correct behavior). Your call though.


-- Wendy




>From 34126f9f41901ca9d7d0031c2b11fc0d6c07b72d Mon Sep 17 00:00:00 2001
From: Steven Whitehouse <[EMAIL PROTECTED]>
Date: Thu, 7 Dec 2006 09:13:14 -0500
Subject: [PATCH] [GFS2] Change gfs2_fsync() to use write_inode_now()

This is a bit better than the previous version of gfs2_fsync()
although it would be better still if we were able to call a
function which only wrote the inode & metadata. Its no big deal
though that this will potentially write the data as well since
the VFS has already done that before calling gfs2_fsync(). I've
also added a comment to explain whats going on here.

Signed-off-by: Steven Whitehouse <[EMAIL PROTECTED]>
Cc: Andrew Morton <[EMAIL PROTECTED]>
---
 fs/gfs2/ops_file.c |   11 ++-
 1 files changed, 6 insertions(+), 5 deletions(-)

diff --git a/fs/gfs2/ops_file.c b/fs/gfs2/ops_file.c
index 7bd971b..b3f1e03 100644
--- a/fs/gfs2/ops_file.c
+++ b/fs/gfs2/ops_file.c
@@ -510,6 +510,11 @@ static int gfs2_close(struct inode *inod
  * is set. For stuffed inodes we must flush the log in order to
  * ensure that all data is on disk.
  *
+ * The call to write_inode_now() is there to write back metadata and
+ * the inode itself. It does also try and write the data, but thats
+ * (hopefully) a no-op due to the VFS having already called 
filemap_fdatawrite()
+ * for us.
+ *
  * Returns: errno
  */
 
@@ -518,10 +523,6 @@ static int gfs2_fsync(struct file *file,

struct inode *inode = dentry->d_inode;
int sync_state = inode->i_state & (I_DIRTY_SYNC|I_DIRTY_DATASYNC);
int ret = 0;
-   struct writeback_control wbc = {
-   .sync_mode = WB_SYNC_ALL,
-   .nr_to_write = 0,
-   };
 
 	if (gfs2_is_jdata(GFS2_I(inode))) {

gfs2_log_flush(GFS2_SB(inode), GFS2_I(inode)->i_gl);
@@ -530,7 +531,7 @@ static int gfs2_fsync(struct file *file,
 
 	if (sync_state != 0) {

if (!datasync)
-   ret = sync_inode(inode, );
+   ret = write_inode_now(inode, 0);
 
 		if (gfs2_is_stuffed(GFS2_I(inode)))

gfs2_log_flush(GFS2_SB(inode), GFS2_I(inode)->i_gl);
  


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [Cluster-devel] Re: [GFS2] Don't flush everything on fdatasync [70/70]

2006-12-07 Thread Wendy Cheng

Steven Whitehouse wrote:

Hi,

On Fri, 2006-12-01 at 11:09 -0800, Andrew Morton wrote:
  

I was taking my cue here from ext3 which does something similar. The
filemap_fdatawrite() is done by the VFS before this is called with a
filemap_fdatawait() afterwards. This was intended to flush the metadata
via (eventually) -write_inode() although I guess I should be calling
write_inode_now() instead?
  

oh I see, you're jsut trying to write the inode itself, not the pages.

write_inode_now() will write the pages, which you seem to not want to do.
Whatever.  The APIs here are a bit awkward.



I've added a comment to explain whats going on here, and also the
following patch. I know it could be better, but its still an improvement
on what was there before,


  

Steve,

I'm in the middle of something else and don't have upstream kernel 
source handy at this moment. But I read akpm's comment as 
write_inode_now would do writepage and that is *not* what you want (?) 
(since vfs has done that before this call is invoked). I vaguely 
recalled I did try write_inode_now() on GFS1 once but had to replace it 
with sync_inode on RHEL4 (for the reason that I can't remember at this 
moment). I suggest you keep sync_inode (at least for a while until we 
can prove other call can do better). This sync_inode has been well 
tested out (with GFS1's fsync call).


There is another issue. It is a gray area. Note that you don't grab any 
glock here ... so if someone *has* written something in other nodes, 
this sync could miss it (?). This depends on how people expects a 
fsync/fdatasync should behave in a cluster filesystem. GFS1 asks for a 
shared lock here so it will force other node to flush the data (I 
personally think this is a more correct behavior). Your call though.


-- Wendy




From 34126f9f41901ca9d7d0031c2b11fc0d6c07b72d Mon Sep 17 00:00:00 2001
From: Steven Whitehouse [EMAIL PROTECTED]
Date: Thu, 7 Dec 2006 09:13:14 -0500
Subject: [PATCH] [GFS2] Change gfs2_fsync() to use write_inode_now()

This is a bit better than the previous version of gfs2_fsync()
although it would be better still if we were able to call a
function which only wrote the inode  metadata. Its no big deal
though that this will potentially write the data as well since
the VFS has already done that before calling gfs2_fsync(). I've
also added a comment to explain whats going on here.

Signed-off-by: Steven Whitehouse [EMAIL PROTECTED]
Cc: Andrew Morton [EMAIL PROTECTED]
---
 fs/gfs2/ops_file.c |   11 ++-
 1 files changed, 6 insertions(+), 5 deletions(-)

diff --git a/fs/gfs2/ops_file.c b/fs/gfs2/ops_file.c
index 7bd971b..b3f1e03 100644
--- a/fs/gfs2/ops_file.c
+++ b/fs/gfs2/ops_file.c
@@ -510,6 +510,11 @@ static int gfs2_close(struct inode *inod
  * is set. For stuffed inodes we must flush the log in order to
  * ensure that all data is on disk.
  *
+ * The call to write_inode_now() is there to write back metadata and
+ * the inode itself. It does also try and write the data, but thats
+ * (hopefully) a no-op due to the VFS having already called 
filemap_fdatawrite()
+ * for us.
+ *
  * Returns: errno
  */
 
@@ -518,10 +523,6 @@ static int gfs2_fsync(struct file *file,

struct inode *inode = dentry-d_inode;
int sync_state = inode-i_state  (I_DIRTY_SYNC|I_DIRTY_DATASYNC);
int ret = 0;
-   struct writeback_control wbc = {
-   .sync_mode = WB_SYNC_ALL,
-   .nr_to_write = 0,
-   };
 
 	if (gfs2_is_jdata(GFS2_I(inode))) {

gfs2_log_flush(GFS2_SB(inode), GFS2_I(inode)-i_gl);
@@ -530,7 +531,7 @@ static int gfs2_fsync(struct file *file,
 
 	if (sync_state != 0) {

if (!datasync)
-   ret = sync_inode(inode, wbc);
+   ret = write_inode_now(inode, 0);
 
 		if (gfs2_is_stuffed(GFS2_I(inode)))

gfs2_log_flush(GFS2_SB(inode), GFS2_I(inode)-i_gl);
  


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] prune_icache_sb

2006-12-04 Thread Wendy Cheng

Russell Cattelan wrote:

Wendy Cheng wrote:

Linux kernel, particularly the VFS layer, is starting to show signs 
of inadequacy as the software components built upon it keep growing. 
I have doubts that it can keep up and handle this complexity with a 
development policy like you just described (filesystem is a dumb 
layer ?). Aren't these DIO_xxx_LOCKING flags inside 
__blockdev_direct_IO() a perfect example why trying to do too many 
things inside vfs layer for so many filesystems is a bad idea ? By 
the way, since we're on this subject, could we discuss a little bit 
about vfs rename call (or I can start another new discussion thread) ?


Note that linux do_rename() starts with the usual lookup logic, 
followed by "lock_rename", then a final round of dentry lookup, and 
finally comes to filesystem's i_op->rename call. Since lock_rename() 
only calls for vfs layer locks that are local to this particular 
machine, for a cluster filesystem, there exists a huge window between 
the final lookup and filesystem's i_op->rename calls such that the 
file could get deleted from another node before fs can do anything 
about it. Is it possible that we could get a new function pointer 
(lock_rename) in inode_operations structure so a cluster filesystem 
can do proper locking ?


It looks like the ocfs2 guys have the similar problem?

http://ftp.kernel.org/pub/linux/kernel/people/mfasheh/ocfs2/ocfs2_git_patches/ocfs2-upstream-linus-20060924/0009-PATCH-Allow-file-systems-to-manually-d_move-inside-of-rename.txt 






Thanks for the pointer. Same as ocfs2, under current VFS code, both 
GFS1/2 also need FS_ODD_RENAME flag for the rename problem - got an ugly 
~200 line draft patch ready for GFS1 (and am looking into GFS2 at this 
moment). The issue here is, for GFS, if vfs lock_rename() can call us, 
this complication can be greatly reduced. Will start another thread to 
see whether the wish can be granted.


-- Wendy

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] prune_icache_sb

2006-12-04 Thread Wendy Cheng

Russell Cattelan wrote:

Wendy Cheng wrote:

Linux kernel, particularly the VFS layer, is starting to show signs 
of inadequacy as the software components built upon it keep growing. 
I have doubts that it can keep up and handle this complexity with a 
development policy like you just described (filesystem is a dumb 
layer ?). Aren't these DIO_xxx_LOCKING flags inside 
__blockdev_direct_IO() a perfect example why trying to do too many 
things inside vfs layer for so many filesystems is a bad idea ? By 
the way, since we're on this subject, could we discuss a little bit 
about vfs rename call (or I can start another new discussion thread) ?


Note that linux do_rename() starts with the usual lookup logic, 
followed by lock_rename, then a final round of dentry lookup, and 
finally comes to filesystem's i_op-rename call. Since lock_rename() 
only calls for vfs layer locks that are local to this particular 
machine, for a cluster filesystem, there exists a huge window between 
the final lookup and filesystem's i_op-rename calls such that the 
file could get deleted from another node before fs can do anything 
about it. Is it possible that we could get a new function pointer 
(lock_rename) in inode_operations structure so a cluster filesystem 
can do proper locking ?


It looks like the ocfs2 guys have the similar problem?

http://ftp.kernel.org/pub/linux/kernel/people/mfasheh/ocfs2/ocfs2_git_patches/ocfs2-upstream-linus-20060924/0009-PATCH-Allow-file-systems-to-manually-d_move-inside-of-rename.txt 






Thanks for the pointer. Same as ocfs2, under current VFS code, both 
GFS1/2 also need FS_ODD_RENAME flag for the rename problem - got an ugly 
~200 line draft patch ready for GFS1 (and am looking into GFS2 at this 
moment). The issue here is, for GFS, if vfs lock_rename() can call us, 
this complication can be greatly reduced. Will start another thread to 
see whether the wish can be granted.


-- Wendy

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] prune_icache_sb

2006-12-03 Thread Wendy Cheng

Andrew Morton wrote:


On Sun, 03 Dec 2006 12:49:42 -0500
Wendy Cheng <[EMAIL PROTECTED]> wrote:

 

I read this as "It is ok to give system admin(s) commands (that this 
"drop_pagecache_sb() call" is all about) to drop page cache. It is, 
however, not ok to give filesystem developer(s) this very same function 
to trim their own page cache if the filesystems choose to do so" ?
   



If you're referring to /proc/sys/vm/drop_pagecache then no, that isn't for
administrators - it's a convenience thing for developers, to get repeatable
benchmarks.  Attempts to make it a per-numa-node control for admin purposes have
been rejected.
 

Just saw you suggested the "next door" LKML thread ("la la la la ... 
swappiness") to use "-o sync" ? Well, now I do see you're determined ... 
anyway, I think I got my stuff working without this kernel patch ... 
still under testing though.


The rename post will be done first thing tomorrow morning.


[snip] ...
   


hmm, I suppose that makes sense.

Are there dentries associated with these locks?
 

Yes, dentries are part of the logic (during lookup time) but 
book-keepings (reference count, reclaim, delete, etc) are all done thru 
inode structures.


 


Did you look at improving that lock-lookup algorithm, btw?  Core kernel has
no problem maintaining millions of cached VFS objects - is there any reason
why your lock lookup cannot be similarly efficient?

 

Yes, just found the new DLM uses "jhash" call (include/linux/jhash.h). 
I'm on an older version of DLM that uses FNV hash algorithm 
(http://www.isthe.com/chongo/tech/comp/fnv/). Will do some performance 
test runs to compare these two methods.


A final note on this subject - I may not agree with you (about various 
things) but your comments and amazingly quick responses are very very 
appreciated !


-- Wendy

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] prune_icache_sb

2006-12-03 Thread Wendy Cheng

Andrew Morton wrote:


On Thu, 30 Nov 2006 11:05:32 -0500
Wendy Cheng <[EMAIL PROTECTED]> wrote:

 



The idea is, instead of unconditionally dropping every buffer associated 
with the particular mount point (that defeats the purpose of page 
caching), base kernel exports the "drop_pagecache_sb()" call that allows 
page cache to be trimmed. More importantly, it is changed to offer the 
choice of not randomly purging any buffer but the ones that seem to be 
unused (i_state is NULL and i_count is zero). This will encourage 
filesystem(s) to pro actively response to vm memory shortage if they 
choose so.
   



argh.
 

I read this as "It is ok to give system admin(s) commands (that this 
"drop_pagecache_sb() call" is all about) to drop page cache. It is, 
however, not ok to give filesystem developer(s) this very same function 
to trim their own page cache if the filesystems choose to do so" ?



In Linux a filesystem is a dumb layer which sits between the VFS and the
I/O layer and provides dumb services such as reading/writing inodes,
reading/writing directory entries, mapping pagecache offsets to disk
blocks, etc.  (This model is to varying degrees incorrect for every
post-ext2 filesystem, but that's the way it is).
 

Linux kernel, particularly the VFS layer, is starting to show signs of 
inadequacy as the software components built upon it keep growing. I have 
doubts that it can keep up and handle this complexity with a development 
policy like you just described (filesystem is a dumb layer ?). Aren't 
these DIO_xxx_LOCKING flags inside __blockdev_direct_IO() a perfect 
example why trying to do too many things inside vfs layer for so many 
filesystems is a bad idea ? By the way, since we're on this subject, 
could we discuss a little bit about vfs rename call (or I can start 
another new discussion thread) ?


Note that linux do_rename() starts with the usual lookup logic, followed 
by "lock_rename", then a final round of dentry lookup, and finally comes 
to filesystem's i_op->rename call. Since lock_rename() only calls for 
vfs layer locks that are local to this particular machine, for a cluster 
filesystem, there exists a huge window between the final lookup and 
filesystem's i_op->rename calls such that the file could get deleted 
from another node before fs can do anything about it. Is it possible 
that we could get a new function pointer (lock_rename) in 
inode_operations structure so a cluster filesystem can do proper locking ?


From our end (cluster locks are expensive - that's why we cache them), 
one of our kernel daemons will invoke this newly exported call based on 
a set of pre-defined tunables. It is then followed by a lock reclaim 
logic to trim the locks by checking the page cache associated with the 
inode (that this cluster lock is created for). If nothing is attached to 
the inode (based on i_mapping->nrpages count), we know it is a good 
candidate for trimming and will subsequently drop this lock (instead of 
waiting until the end of vfs inode life cycle).
   



Again, I don't understand why you're tying the lifetime of these locks to
the VFS inode reclaim mechanisms.  Seems odd.
 


Cluster locks are expensive because:

1. Every node in the cluster has to agree about it upon granting the 
request (communication overhead).
2. It involves disk flushing if bouncing between nodes. Say one node 
requests a read lock after another node's write... before the read lock 
can be granted, the write node needs to flush the data to the disk (disk 
io overhead).


For optimization purpose, we want to refrain the disk flush after writes 
and hope (and encourage) the next person who requests the lock to be on 
the very same node (to take the advantage of OS write-back logic). 
That's why the locks are cached on the very same node. It will not get 
removed unless necessary.
What would be better to build the lock caching on top of the existing 
inode cache logic - since these are the objects that the cluster locks 
are created for in the first place.



If you want to put an upper bound on the number of in-core locks, why not
string them on a list and throw away the old ones when the upper bound is
reached?
 

Don't take me wrong. DLM *has* a tunable to set the max lock counts. We 
do drop the locks but to drop the right locks, we need a little bit help 
from VFS layer. Latency requirement is difficult to manage.



Did you look at improving that lock-lookup algorithm, btw?  Core kernel has
no problem maintaining millions of cached VFS objects - is there any reason
why your lock lookup cannot be similarly efficient?
 

Don't be so confident. I did see some complaints from ext3 based mail 
servers in the past - when the storage size was large enough, people had 
to explicitly umount the filesystem from time to time to rescue their 
performance. I don't recall the details at this moment though.


For us with this particular customer, it is a 

Re: [PATCH] prune_icache_sb

2006-12-03 Thread Wendy Cheng

Andrew Morton wrote:


On Thu, 30 Nov 2006 11:05:32 -0500
Wendy Cheng [EMAIL PROTECTED] wrote:

 



The idea is, instead of unconditionally dropping every buffer associated 
with the particular mount point (that defeats the purpose of page 
caching), base kernel exports the drop_pagecache_sb() call that allows 
page cache to be trimmed. More importantly, it is changed to offer the 
choice of not randomly purging any buffer but the ones that seem to be 
unused (i_state is NULL and i_count is zero). This will encourage 
filesystem(s) to pro actively response to vm memory shortage if they 
choose so.
   



argh.
 

I read this as It is ok to give system admin(s) commands (that this 
drop_pagecache_sb() call is all about) to drop page cache. It is, 
however, not ok to give filesystem developer(s) this very same function 
to trim their own page cache if the filesystems choose to do so ?



In Linux a filesystem is a dumb layer which sits between the VFS and the
I/O layer and provides dumb services such as reading/writing inodes,
reading/writing directory entries, mapping pagecache offsets to disk
blocks, etc.  (This model is to varying degrees incorrect for every
post-ext2 filesystem, but that's the way it is).
 

Linux kernel, particularly the VFS layer, is starting to show signs of 
inadequacy as the software components built upon it keep growing. I have 
doubts that it can keep up and handle this complexity with a development 
policy like you just described (filesystem is a dumb layer ?). Aren't 
these DIO_xxx_LOCKING flags inside __blockdev_direct_IO() a perfect 
example why trying to do too many things inside vfs layer for so many 
filesystems is a bad idea ? By the way, since we're on this subject, 
could we discuss a little bit about vfs rename call (or I can start 
another new discussion thread) ?


Note that linux do_rename() starts with the usual lookup logic, followed 
by lock_rename, then a final round of dentry lookup, and finally comes 
to filesystem's i_op-rename call. Since lock_rename() only calls for 
vfs layer locks that are local to this particular machine, for a cluster 
filesystem, there exists a huge window between the final lookup and 
filesystem's i_op-rename calls such that the file could get deleted 
from another node before fs can do anything about it. Is it possible 
that we could get a new function pointer (lock_rename) in 
inode_operations structure so a cluster filesystem can do proper locking ?


From our end (cluster locks are expensive - that's why we cache them), 
one of our kernel daemons will invoke this newly exported call based on 
a set of pre-defined tunables. It is then followed by a lock reclaim 
logic to trim the locks by checking the page cache associated with the 
inode (that this cluster lock is created for). If nothing is attached to 
the inode (based on i_mapping-nrpages count), we know it is a good 
candidate for trimming and will subsequently drop this lock (instead of 
waiting until the end of vfs inode life cycle).
   



Again, I don't understand why you're tying the lifetime of these locks to
the VFS inode reclaim mechanisms.  Seems odd.
 


Cluster locks are expensive because:

1. Every node in the cluster has to agree about it upon granting the 
request (communication overhead).
2. It involves disk flushing if bouncing between nodes. Say one node 
requests a read lock after another node's write... before the read lock 
can be granted, the write node needs to flush the data to the disk (disk 
io overhead).


For optimization purpose, we want to refrain the disk flush after writes 
and hope (and encourage) the next person who requests the lock to be on 
the very same node (to take the advantage of OS write-back logic). 
That's why the locks are cached on the very same node. It will not get 
removed unless necessary.
What would be better to build the lock caching on top of the existing 
inode cache logic - since these are the objects that the cluster locks 
are created for in the first place.



If you want to put an upper bound on the number of in-core locks, why not
string them on a list and throw away the old ones when the upper bound is
reached?
 

Don't take me wrong. DLM *has* a tunable to set the max lock counts. We 
do drop the locks but to drop the right locks, we need a little bit help 
from VFS layer. Latency requirement is difficult to manage.



Did you look at improving that lock-lookup algorithm, btw?  Core kernel has
no problem maintaining millions of cached VFS objects - is there any reason
why your lock lookup cannot be similarly efficient?
 

Don't be so confident. I did see some complaints from ext3 based mail 
servers in the past - when the storage size was large enough, people had 
to explicitly umount the filesystem from time to time to rescue their 
performance. I don't recall the details at this moment though.


For us with this particular customer, it is a 15TB storage.

-- Wendy
-
To unsubscribe from this list: send the line

Re: [PATCH] prune_icache_sb

2006-12-03 Thread Wendy Cheng

Andrew Morton wrote:


On Sun, 03 Dec 2006 12:49:42 -0500
Wendy Cheng [EMAIL PROTECTED] wrote:

 

I read this as It is ok to give system admin(s) commands (that this 
drop_pagecache_sb() call is all about) to drop page cache. It is, 
however, not ok to give filesystem developer(s) this very same function 
to trim their own page cache if the filesystems choose to do so ?
   



If you're referring to /proc/sys/vm/drop_pagecache then no, that isn't for
administrators - it's a convenience thing for developers, to get repeatable
benchmarks.  Attempts to make it a per-numa-node control for admin purposes have
been rejected.
 

Just saw you suggested the next door LKML thread (la la la la ... 
swappiness) to use -o sync ? Well, now I do see you're determined ... 
anyway, I think I got my stuff working without this kernel patch ... 
still under testing though.


The rename post will be done first thing tomorrow morning.


[snip] ...
   


hmm, I suppose that makes sense.

Are there dentries associated with these locks?
 

Yes, dentries are part of the logic (during lookup time) but 
book-keepings (reference count, reclaim, delete, etc) are all done thru 
inode structures.


 


Did you look at improving that lock-lookup algorithm, btw?  Core kernel has
no problem maintaining millions of cached VFS objects - is there any reason
why your lock lookup cannot be similarly efficient?

 

Yes, just found the new DLM uses jhash call (include/linux/jhash.h). 
I'm on an older version of DLM that uses FNV hash algorithm 
(http://www.isthe.com/chongo/tech/comp/fnv/). Will do some performance 
test runs to compare these two methods.


A final note on this subject - I may not agree with you (about various 
things) but your comments and amazingly quick responses are very very 
appreciated !


-- Wendy

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] prune_icache_sb

2006-11-30 Thread Wendy Cheng

How about a simple and plain change with this uploaded patch 

The idea is, instead of unconditionally dropping every buffer associated 
with the particular mount point (that defeats the purpose of page 
caching), base kernel exports the "drop_pagecache_sb()" call that allows 
page cache to be trimmed. More importantly, it is changed to offer the 
choice of not randomly purging any buffer but the ones that seem to be 
unused (i_state is NULL and i_count is zero). This will encourage 
filesystem(s) to pro actively response to vm memory shortage if they 
choose so.


From our end (cluster locks are expensive - that's why we cache them), 
one of our kernel daemons will invoke this newly exported call based on 
a set of pre-defined tunables. It is then followed by a lock reclaim 
logic to trim the locks by checking the page cache associated with the 
inode (that this cluster lock is created for). If nothing is attached to 
the inode (based on i_mapping->nrpages count), we know it is a good 
candidate for trimming and will subsequently drop this lock (instead of 
waiting until the end of vfs inode life cycle).


Note that I could do invalidate_inode_pages() within our kernel modules 
to accomplish what drop_pagecache_sb() does (without coming here to bug 
people) but I don't have access to inode_lock as an external kernel 
module. So either EXPORT_SYMBOL(inode_lock) or this patch ?


The end result (of this change) should encourage filesystem to actively 
avoid depleting too much memory and we'll encourage our applications to 
understand clustering locality issues.


Haven't tested this out though - would appreciate some comments before 
spending more efforts on this direction.


-- Wendy


--- linux-2.6.18/include/linux/fs.h	2006-09-19 23:42:06.0 -0400
+++ ups-kernel/include/linux/fs.h	2006-11-30 02:16:34.0 -0500
@@ -1625,7 +1625,8 @@ extern void remove_inode_hash(struct ino
 static inline void insert_inode_hash(struct inode *inode) {
 	__insert_inode_hash(inode, inode->i_ino);
 }
-
+extern void void drop_pagecache_sb(struct super_block *sb, int nr_goal);:q
+ 
 extern struct file * get_empty_filp(void);
 extern void file_move(struct file *f, struct list_head *list);
 extern void file_kill(struct file *f);
--- linux-2.6.18/fs/drop_caches.c	2006-09-19 23:42:06.0 -0400
+++ ups-kernel/fs/drop_caches.c	2006-11-30 03:36:11.0 -0500
@@ -12,13 +12,20 @@
 /* A global variable is a bit ugly, but it keeps the code simple */
 int sysctl_drop_caches;
 
-static void drop_pagecache_sb(struct super_block *sb)
+void drop_pagecache_sb(struct super_block *sb, int nr_goal)
 {
 	struct inode *inode;
+	int nr_count=0;
 
 	spin_lock(_lock);
 	list_for_each_entry(inode, >s_inodes, i_sb_list) {
-		if (inode->i_state & (I_FREEING|I_WILL_FREE))
+		if (nr_goal) {
+			if (nr_goal == nr_count)
+break;
+			if ((inode->i_state || atomic_read(>i_count))
+continue;
+			nr_count++;
+		} else if (inode->i_state & (I_FREEING|I_WILL_FREE))
 			continue;
 		invalidate_inode_pages(inode->i_mapping);
 	}
@@ -36,7 +43,7 @@ restart:
 		spin_unlock(_lock);
 		down_read(>s_umount);
 		if (sb->s_root)
-			drop_pagecache_sb(sb);
+			drop_pagecache_sb(sb, 0);
 		up_read(>s_umount);
 		spin_lock(_lock);
 		if (__put_super_and_need_restart(sb))
@@ -66,3 +73,6 @@ int drop_caches_sysctl_handler(ctl_table
 	}
 	return 0;
 }
+
+EXPORT_SYMBOL(drop_pagecache_sb);
+


Re: [PATCH] prune_icache_sb

2006-11-30 Thread Wendy Cheng

How about a simple and plain change with this uploaded patch 

The idea is, instead of unconditionally dropping every buffer associated 
with the particular mount point (that defeats the purpose of page 
caching), base kernel exports the drop_pagecache_sb() call that allows 
page cache to be trimmed. More importantly, it is changed to offer the 
choice of not randomly purging any buffer but the ones that seem to be 
unused (i_state is NULL and i_count is zero). This will encourage 
filesystem(s) to pro actively response to vm memory shortage if they 
choose so.


From our end (cluster locks are expensive - that's why we cache them), 
one of our kernel daemons will invoke this newly exported call based on 
a set of pre-defined tunables. It is then followed by a lock reclaim 
logic to trim the locks by checking the page cache associated with the 
inode (that this cluster lock is created for). If nothing is attached to 
the inode (based on i_mapping-nrpages count), we know it is a good 
candidate for trimming and will subsequently drop this lock (instead of 
waiting until the end of vfs inode life cycle).


Note that I could do invalidate_inode_pages() within our kernel modules 
to accomplish what drop_pagecache_sb() does (without coming here to bug 
people) but I don't have access to inode_lock as an external kernel 
module. So either EXPORT_SYMBOL(inode_lock) or this patch ?


The end result (of this change) should encourage filesystem to actively 
avoid depleting too much memory and we'll encourage our applications to 
understand clustering locality issues.


Haven't tested this out though - would appreciate some comments before 
spending more efforts on this direction.


-- Wendy


--- linux-2.6.18/include/linux/fs.h	2006-09-19 23:42:06.0 -0400
+++ ups-kernel/include/linux/fs.h	2006-11-30 02:16:34.0 -0500
@@ -1625,7 +1625,8 @@ extern void remove_inode_hash(struct ino
 static inline void insert_inode_hash(struct inode *inode) {
 	__insert_inode_hash(inode, inode-i_ino);
 }
-
+extern void void drop_pagecache_sb(struct super_block *sb, int nr_goal);:q
+ 
 extern struct file * get_empty_filp(void);
 extern void file_move(struct file *f, struct list_head *list);
 extern void file_kill(struct file *f);
--- linux-2.6.18/fs/drop_caches.c	2006-09-19 23:42:06.0 -0400
+++ ups-kernel/fs/drop_caches.c	2006-11-30 03:36:11.0 -0500
@@ -12,13 +12,20 @@
 /* A global variable is a bit ugly, but it keeps the code simple */
 int sysctl_drop_caches;
 
-static void drop_pagecache_sb(struct super_block *sb)
+void drop_pagecache_sb(struct super_block *sb, int nr_goal)
 {
 	struct inode *inode;
+	int nr_count=0;
 
 	spin_lock(inode_lock);
 	list_for_each_entry(inode, sb-s_inodes, i_sb_list) {
-		if (inode-i_state  (I_FREEING|I_WILL_FREE))
+		if (nr_goal) {
+			if (nr_goal == nr_count)
+break;
+			if ((inode-i_state || atomic_read(inode-i_count))
+continue;
+			nr_count++;
+		} else if (inode-i_state  (I_FREEING|I_WILL_FREE))
 			continue;
 		invalidate_inode_pages(inode-i_mapping);
 	}
@@ -36,7 +43,7 @@ restart:
 		spin_unlock(sb_lock);
 		down_read(sb-s_umount);
 		if (sb-s_root)
-			drop_pagecache_sb(sb);
+			drop_pagecache_sb(sb, 0);
 		up_read(sb-s_umount);
 		spin_lock(sb_lock);
 		if (__put_super_and_need_restart(sb))
@@ -66,3 +73,6 @@ int drop_caches_sysctl_handler(ctl_table
 	}
 	return 0;
 }
+
+EXPORT_SYMBOL(drop_pagecache_sb);
+


Re: [PATCH] prune_icache_sb

2006-11-28 Thread Wendy Cheng

Andrew Morton wrote:


We shouldn't export this particular implementation to modules because it
has bad failure modes.  There might be a case for exposing an
i_sb_list-based API or, perhaps better, a max-unused-inodes mount option.


 

Ok, thanks for looking into this - it is appreciated. I'll try to figure 
out something else.


-- Wendy

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] prune_icache_sb

2006-11-28 Thread Wendy Cheng

Andrew Morton wrote:

On Mon, 27 Nov 2006 18:52:58 -0500
Wendy Cheng <[EMAIL PROTECTED]> wrote:

  

Not sure about walking thru sb->s_inodes for several reasons

1. First, the changes made are mostly for file server setup with large 
fs size - the entry count in sb->s_inodes may not be shorter then 
inode_unused list.



umm, that's the best-case.  We also care about worst-case.  Think:
1,000,000 inodes on inode_unused, of which a randomly-sprinkled 10,000 are
from the being-unmounted filesytem.  The code as-proposed will do 100x more
work that it needs to do.  All under a global spinlock.
  
By walking thru sb->s_inodes, we also need to take inode_lock and 
iprune_mutex (?), since we're purging the inodes from the system - or 
specifically, removing them from inode_unused list. There is really not 
much difference from the current prune_icache() logic. What's been 
proposed here is simply *exporting* the prune_icache() kernel code to 
allow filesystems to trim (purge a small percentage of ) its 
(potentially will be) unused per-mount inodes for *latency* considerations.


I made a mistake by using the "page dirty ratio" to explain the problem 
(sorry! I was not thinking well in previous write-up) that could mislead 
you to think this is a VM issue. This is not so much about 
low-on-free-pages (and/or memory fragmentation) issue (though 
fragmentation is normally part of the symptoms). What the (external) 
kernel module does is to tie its cluster-wide file lock with in-memory 
inode that is obtained during file look-up time. The lock is removed 
from the machine when


1. the lock is granted to other (cluster) machine; or
2. the in-memory inode is purged from the system.

One of the clusters that has this latency issue is an IP/TV application 
where it "rsync" with main station server (with long geographical 
distance) every 15 minutes. It subsequently (and constantly) generates 
large amount of inode (and locks) hanging around. When other nodes, 
served as FTP servers, within the same cluster are serving the files, 
DLM has to wade through huge amount of locks entries to know whether the 
lock requests can be granted. That's where this latency issue gets 
popped out. Our profiling data shows when the cluster performance is 
dropped into un-acceptable ranges, DLM could hogs 40% of CPU cycle in 
lock searching logic. From VM point of view, the system does not have 
memory shortage so it doesn't have a need to kick off prune_icache() call.


This issue could also be fixed in several different ways - maybe by a 
better DLM hash function, maybe by asking IT people to umount the 
filesystem where *all* per-mount inodes are unconditionally purged (but 
it defeats the purpose of caching inodes and, in our case, the locks) 
after each rsync, , etc. But I do think the proposed patch is the 
most sensible way to fix this issue and believe it will be one of these 
functions that if you export it, people will find a good use of it. It 
helps with memory fragmentation and/or shortage *before* it becomes a 
problem as well. I certainly understand and respect a maintainer's 
daunting job on how to take/reject a patch - let me know how you think 
so I can start to work on other solutions if required.


-- Wendy

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] prune_icache_sb

2006-11-28 Thread Wendy Cheng

Andrew Morton wrote:

On Mon, 27 Nov 2006 18:52:58 -0500
Wendy Cheng [EMAIL PROTECTED] wrote:

  

Not sure about walking thru sb-s_inodes for several reasons

1. First, the changes made are mostly for file server setup with large 
fs size - the entry count in sb-s_inodes may not be shorter then 
inode_unused list.



umm, that's the best-case.  We also care about worst-case.  Think:
1,000,000 inodes on inode_unused, of which a randomly-sprinkled 10,000 are
from the being-unmounted filesytem.  The code as-proposed will do 100x more
work that it needs to do.  All under a global spinlock.
  
By walking thru sb-s_inodes, we also need to take inode_lock and 
iprune_mutex (?), since we're purging the inodes from the system - or 
specifically, removing them from inode_unused list. There is really not 
much difference from the current prune_icache() logic. What's been 
proposed here is simply *exporting* the prune_icache() kernel code to 
allow filesystems to trim (purge a small percentage of ) its 
(potentially will be) unused per-mount inodes for *latency* considerations.


I made a mistake by using the page dirty ratio to explain the problem 
(sorry! I was not thinking well in previous write-up) that could mislead 
you to think this is a VM issue. This is not so much about 
low-on-free-pages (and/or memory fragmentation) issue (though 
fragmentation is normally part of the symptoms). What the (external) 
kernel module does is to tie its cluster-wide file lock with in-memory 
inode that is obtained during file look-up time. The lock is removed 
from the machine when


1. the lock is granted to other (cluster) machine; or
2. the in-memory inode is purged from the system.

One of the clusters that has this latency issue is an IP/TV application 
where it rsync with main station server (with long geographical 
distance) every 15 minutes. It subsequently (and constantly) generates 
large amount of inode (and locks) hanging around. When other nodes, 
served as FTP servers, within the same cluster are serving the files, 
DLM has to wade through huge amount of locks entries to know whether the 
lock requests can be granted. That's where this latency issue gets 
popped out. Our profiling data shows when the cluster performance is 
dropped into un-acceptable ranges, DLM could hogs 40% of CPU cycle in 
lock searching logic. From VM point of view, the system does not have 
memory shortage so it doesn't have a need to kick off prune_icache() call.


This issue could also be fixed in several different ways - maybe by a 
better DLM hash function, maybe by asking IT people to umount the 
filesystem where *all* per-mount inodes are unconditionally purged (but 
it defeats the purpose of caching inodes and, in our case, the locks) 
after each rsync, , etc. But I do think the proposed patch is the 
most sensible way to fix this issue and believe it will be one of these 
functions that if you export it, people will find a good use of it. It 
helps with memory fragmentation and/or shortage *before* it becomes a 
problem as well. I certainly understand and respect a maintainer's 
daunting job on how to take/reject a patch - let me know how you think 
so I can start to work on other solutions if required.


-- Wendy

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] prune_icache_sb

2006-11-28 Thread Wendy Cheng

Andrew Morton wrote:


We shouldn't export this particular implementation to modules because it
has bad failure modes.  There might be a case for exposing an
i_sb_list-based API or, perhaps better, a max-unused-inodes mount option.


 

Ok, thanks for looking into this - it is appreciated. I'll try to figure 
out something else.


-- Wendy

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] prune_icache_sb

2006-11-27 Thread Wendy Cheng

Andrew Morton wrote:

This search is potentially inefficient.  It would be better walk
sb->s_inodes.

  

Not sure about walking thru sb->s_inodes for several reasons

1. First, the changes made are mostly for file server setup with large 
fs size - the entry count in sb->s_inodes may not be shorter then 
inode_unused list.
2. Different from calls such as drop_pagecache_sb() (that doesn't do 
list entry removal), we're walking thru the list to dispose the entries. 
This implies we are walking thru one list (sb->s_inodes) to remove the 
other list's entries (inode_unused). This feels awkward.
3. The new code will be very similar to current prune_icache() with few 
differences - e.g., we really don't want to list_move() within the 
sb->s_inodes list itself (as done in prune_icache() that moves the 
examined entry to the tail of the inode_unused list). We have to either 
duplicate the code or clutter the current prune_icache() routine.


Pruning based on sb->s_inodes *does* have its advantage but a simple and 
plain patch as shown in previous post (that has been well-tested out in 
two large scale production systems) could be equally effective. Make 
sense ?


-- Wendy

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] prune_icache_sb

2006-11-27 Thread Wendy Cheng

Andrew Morton wrote:

This search is potentially inefficient.  It would be better walk
sb-s_inodes.

  

Not sure about walking thru sb-s_inodes for several reasons

1. First, the changes made are mostly for file server setup with large 
fs size - the entry count in sb-s_inodes may not be shorter then 
inode_unused list.
2. Different from calls such as drop_pagecache_sb() (that doesn't do 
list entry removal), we're walking thru the list to dispose the entries. 
This implies we are walking thru one list (sb-s_inodes) to remove the 
other list's entries (inode_unused). This feels awkward.
3. The new code will be very similar to current prune_icache() with few 
differences - e.g., we really don't want to list_move() within the 
sb-s_inodes list itself (as done in prune_icache() that moves the 
examined entry to the tail of the inode_unused list). We have to either 
duplicate the code or clutter the current prune_icache() routine.


Pruning based on sb-s_inodes *does* have its advantage but a simple and 
plain patch as shown in previous post (that has been well-tested out in 
two large scale production systems) could be equally effective. Make 
sense ?


-- Wendy

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] Add ENOSYS into sys_io_cancel

2005-07-12 Thread Wendy Cheng

Benjamin LaHaise wrote:

Also, please cc [EMAIL PROTECTED] on future aio patches.  
 

Didn't realize the patch was sent to linux-kernel (that I don't 
subscribe) instead of linux-aio - revised patch attached. Thanks for the 
help  Wendy



On Mon, Jul 11, 2005 at 03:06:52PM -0400, Wendy Cheng wrote:

Note that other than few exceptions, most of the current filesystem 
and/or drivers do not have aio cancel specifically defined 
(kiob->ki_cancel field is mostly NULL). However, sys_io_cancel system 
call universally sets return code to -EGAIN. This gives applications a 
wrong impression that this call is implemented but just never works. We 
have customer inquires about this issue.


Upload a trivial patch to address this confusion.

 



Signed-off-by: S. Wendy Cheng <[EMAIL PROTECTED]>

--- linux-2.6.12/fs/aio.c   2005-06-17 15:48:29.0 -0400
+++ linux/fs/aio.c  2005-07-12 11:26:08.503256160 -0400
@@ -1660,7 +1660,7 @@ asmlinkage long sys_io_cancel(aio_contex
ret = -EFAULT;
}
} else
-   printk(KERN_DEBUG "iocb has no cancel operation\n");
+   ret = -ENOSYS;
 
put_ioctx(ctx);
 


Re: [PATCH] Add ENOSYS into sys_io_cancel

2005-07-12 Thread Wendy Cheng

Benjamin LaHaise wrote:

Also, please cc [EMAIL PROTECTED] on future aio patches.  
 

Didn't realize the patch was sent to linux-kernel (that I don't 
subscribe) instead of linux-aio - revised patch attached. Thanks for the 
help  Wendy



On Mon, Jul 11, 2005 at 03:06:52PM -0400, Wendy Cheng wrote:

Note that other than few exceptions, most of the current filesystem 
and/or drivers do not have aio cancel specifically defined 
(kiob-ki_cancel field is mostly NULL). However, sys_io_cancel system 
call universally sets return code to -EGAIN. This gives applications a 
wrong impression that this call is implemented but just never works. We 
have customer inquires about this issue.


Upload a trivial patch to address this confusion.

 



Signed-off-by: S. Wendy Cheng [EMAIL PROTECTED]

--- linux-2.6.12/fs/aio.c   2005-06-17 15:48:29.0 -0400
+++ linux/fs/aio.c  2005-07-12 11:26:08.503256160 -0400
@@ -1660,7 +1660,7 @@ asmlinkage long sys_io_cancel(aio_contex
ret = -EFAULT;
}
} else
-   printk(KERN_DEBUG iocb has no cancel operation\n);
+   ret = -ENOSYS;
 
put_ioctx(ctx);
 


[PATCH] Add ENOSYS into sys_io_cancel

2005-07-11 Thread Wendy Cheng
Previously sent via private mail that doesn't seem to go thru - resend 
via office mailer.


Note that other than few exceptions, most of the current filesystem 
and/or drivers do not have aio cancel specifically defined 
(kiob->ki_cancel field is mostly NULL). However, sys_io_cancel system 
call universally sets return code to -EGAIN. This gives applications a 
wrong impression that this call is implemented but just never works. We 
have customer inquires about this issue.


Upload a trivial patch to address this confusion.

-- Wendy


--- linux-2.6.12/fs/aio.c   2005-06-17 15:48:29.0 -0400
+++ linux/fs/aio.c  2005-07-10 12:48:14.0 -0400
@@ -1641,8 +1641,9 @@ asmlinkage long sys_io_cancel(aio_contex
cancel = kiocb->ki_cancel;
kiocb->ki_users ++;
kiocbSetCancelled(kiocb);
-   } else
+   } else 
cancel = NULL;
+
spin_unlock_irq(>ctx_lock);
 
if (NULL != cancel) {
@@ -1659,8 +1660,10 @@ asmlinkage long sys_io_cancel(aio_contex
if (copy_to_user(result, , sizeof(tmp)))
ret = -EFAULT;
}
-   } else
+   } else {
+   ret = -ENOSYS;
printk(KERN_DEBUG "iocb has no cancel operation\n");
+   } 
 
put_ioctx(ctx);
 


[PATCH] Add ENOSYS into sys_io_cancel

2005-07-11 Thread Wendy Cheng
Previously sent via private mail that doesn't seem to go thru - resend 
via office mailer.


Note that other than few exceptions, most of the current filesystem 
and/or drivers do not have aio cancel specifically defined 
(kiob-ki_cancel field is mostly NULL). However, sys_io_cancel system 
call universally sets return code to -EGAIN. This gives applications a 
wrong impression that this call is implemented but just never works. We 
have customer inquires about this issue.


Upload a trivial patch to address this confusion.

-- Wendy


--- linux-2.6.12/fs/aio.c   2005-06-17 15:48:29.0 -0400
+++ linux/fs/aio.c  2005-07-10 12:48:14.0 -0400
@@ -1641,8 +1641,9 @@ asmlinkage long sys_io_cancel(aio_contex
cancel = kiocb-ki_cancel;
kiocb-ki_users ++;
kiocbSetCancelled(kiocb);
-   } else
+   } else 
cancel = NULL;
+
spin_unlock_irq(ctx-ctx_lock);
 
if (NULL != cancel) {
@@ -1659,8 +1660,10 @@ asmlinkage long sys_io_cancel(aio_contex
if (copy_to_user(result, tmp, sizeof(tmp)))
ret = -EFAULT;
}
-   } else
+   } else {
+   ret = -ENOSYS;
printk(KERN_DEBUG iocb has no cancel operation\n);
+   } 
 
put_ioctx(ctx);