subject:"Re\: \-mm merge plans for 2.6.24"

Re: per BDI dirty limit (was Re: -mm merge plans for 2.6.24)

2007-12-14 Thread Peter Zijlstra


On Fri, 2007-12-14 at 16:14 +0100, Miklos Szeredi wrote:
> > Neil suggested using device numbers which would work, however I think
> > those might not be human friendly. While its easy to find the device
> > number of a given path (eg.: stat -c %d /), its rather hard to find the
> > path belonging to a given device number.
> 
> Ram Pai had a patch which added the device number (among other things)
> to /proc/mounts:
> 
> 
> 
> Subject: [RFC2 PATCH 1/1] VFS: Augment /proc/mount with subroot and
>   shared-subtree
> From: Ram Pai <[EMAIL PROTECTED]>
> To:   "H. Peter Anvin" <[EMAIL PROTECTED]>
> Cc:   Al Viro <[EMAIL PROTECTED]>,
>   Linux Kernel Mailing List ,
>   [EMAIL PROTECTED], [EMAIL PROTECTED]
> In-Reply-To: <[EMAIL PROTECTED]>
> Content-Type: text/plain
> Date: Mon, 16 Jul 2007 11:46:48 -0700
> Content-Transfer-Encoding: 7bit
> Sender:   [EMAIL PROTECTED]
> X-Mailing-List:   [EMAIL PROTECTED]
> 
> /proc/mounts in its current state fail to disambiguate bind mounts, 
> especially 
> when the bind mount is subrooted. Also it does not capture propagation state 
> of
> the mounts(shared-subtree). The following patch addresses the problem.
> 
> The following additional fields to /proc/mounts are added.
> 
> propagation-type in the form of [:][,...]
>   note: 'shared' flag is followed by the mntid of its peer mount
> 'slave' flag is followed by the mntid of its master mount
> 'private' flag stands by itself
> 'unbindable' flag stands by itself
>
> mntid -- is a unique identifier of the mount
> major:minor -- is the major minor number of the device hosting the filesystem
> dir -- the subdir in the filesystem which forms the root of this mount
> parent -- the id of the parent mount
> 
> 
> Here is a sample cat /proc/mounts after execution the following commands:
> 
> mount --bind /mnt /mnt
> mount --make-shared /mnt
> mount --bind /mnt/1 /var
> mount --make-slave /var
> mount --make-shared /var
> mount --bind /var/abc /tmp
> mount --make-unbindable /proc
> 
> rootfs / rootfs rw 0 0 private 2 0:1 / 2 
> /dev/root / ext2 rw  0 0 private 16 98:0 / 2 
> /proc /proc proc rw 0 0 unbindable 17 0:3 / 16 
> devpts /dev/pts devpts rw 0 0 private 18 0:10 / 16 
> /dev/root /mnt ext2 rw  0 0 shared:19 19 98:0 /mnt 16 
> /dev/root /var ext2 rw  0 0 shared:21,slave:19 20 98:0 /mnt/1 16 
> /dev/root /tmp ext2 rw  0 0 shared:20,slave:19 21 98:0 /mnt/1/abc 16 
> 
> For example, the last line indicates that :
> 
> 1) The mount is a shared mount.
> 2) Its peer mount of mount with id 20
> 3) It is also a slave mount of the master-mount with the id  19
> 4) The filesystem on device with major/minor number 98:0 and subdirectory 
>   mnt/1/abc makes the root directory of this mount.
> 5) And finally the mount with id 16 is its parent.
> 
> 
> Testing: symlinked /etc/mtab to /proc/mounts and did some mount and df 
> commands. They worked normally.
> 
> 
> 
> Signed-off-by: Ram Pai <[EMAIL PROTECTED]>

OK, I guess that would work.

Ram Pai, what is the current status of that work?


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: per BDI dirty limit (was Re: -mm merge plans for 2.6.24)

2007-12-14 Thread Miklos Szeredi

> Neil suggested using device numbers which would work, however I think
> those might not be human friendly. While its easy to find the device
> number of a given path (eg.: stat -c %d /), its rather hard to find the
> path belonging to a given device number.

Ram Pai had a patch which added the device number (among other things)
to /proc/mounts:



Subject: [RFC2 PATCH 1/1] VFS: Augment /proc/mount with subroot and
shared-subtree
From:   Ram Pai <[EMAIL PROTECTED]>
To: "H. Peter Anvin" <[EMAIL PROTECTED]>
Cc: Al Viro <[EMAIL PROTECTED]>,
Linux Kernel Mailing List ,
[EMAIL PROTECTED], [EMAIL PROTECTED]
In-Reply-To: <[EMAIL PROTECTED]>
Content-Type: text/plain
Date:   Mon, 16 Jul 2007 11:46:48 -0700
Content-Transfer-Encoding: 7bit
Sender: [EMAIL PROTECTED]
X-Mailing-List: [EMAIL PROTECTED]

/proc/mounts in its current state fail to disambiguate bind mounts, especially 
when the bind mount is subrooted. Also it does not capture propagation state of
the mounts(shared-subtree). The following patch addresses the problem.

The following additional fields to /proc/mounts are added.

propagation-type in the form of [:][,...]
note: 'shared' flag is followed by the mntid of its peer mount
  'slave' flag is followed by the mntid of its master mount
  'private' flag stands by itself
  'unbindable' flag stands by itself
   
mntid -- is a unique identifier of the mount
major:minor -- is the major minor number of the device hosting the filesystem
dir -- the subdir in the filesystem which forms the root of this mount
parent -- the id of the parent mount


Here is a sample cat /proc/mounts after execution the following commands:

mount --bind /mnt /mnt
mount --make-shared /mnt
mount --bind /mnt/1 /var
mount --make-slave /var
mount --make-shared /var
mount --bind /var/abc /tmp
mount --make-unbindable /proc

rootfs / rootfs rw 0 0 private 2 0:1 / 2 
/dev/root / ext2 rw  0 0 private 16 98:0 / 2 
/proc /proc proc rw 0 0 unbindable 17 0:3 / 16 
devpts /dev/pts devpts rw 0 0 private 18 0:10 / 16 
/dev/root /mnt ext2 rw  0 0 shared:19 19 98:0 /mnt 16 
/dev/root /var ext2 rw  0 0 shared:21,slave:19 20 98:0 /mnt/1 16 
/dev/root /tmp ext2 rw  0 0 shared:20,slave:19 21 98:0 /mnt/1/abc 16 

For example, the last line indicates that :

1) The mount is a shared mount.
2) Its peer mount of mount with id 20
3) It is also a slave mount of the master-mount with the id  19
4) The filesystem on device with major/minor number 98:0 and subdirectory 
mnt/1/abc makes the root directory of this mount.
5) And finally the mount with id 16 is its parent.


Testing: symlinked /etc/mtab to /proc/mounts and did some mount and df 
commands. They worked normally.



Signed-off-by: Ram Pai <[EMAIL PROTECTED]>

---
 fs/dcache.c  |   53 +++
 fs/namespace.c   |   35 +++-
 fs/pnode.c   |   22 +
 fs/pnode.h   |2 +
 fs/seq_file.c|   79 ++-
 include/linux/dcache.h   |2 +
 include/linux/mount.h|1 
 include/linux/seq_file.h |1 
 8 files changed, 172 insertions(+), 23 deletions(-)

Index: linux-2.6.21.5/fs/dcache.c
===
--- linux-2.6.21.5.orig/fs/dcache.c
+++ linux-2.6.21.5/fs/dcache.c
@@ -1835,6 +1835,59 @@ char * d_path(struct dentry *dentry, str
return res;
 }
 
+static inline int prepend(char **buffer, int *buflen, const char *str,
+   int namelen)
+{
+   if ((*buflen -= namelen) < 0)
+   return 1;
+   *buffer -= namelen;
+   memcpy(*buffer, str, namelen);
+   return 0;
+}
+
+/*
+ * write full pathname into buffer and return start of pathname.
+ * If @vfsmnt is not specified return the path relative to the
+ * its filesystem's root.
+ */
+char * dentry_path(struct dentry *dentry, char *buf, int buflen)
+{
+   char * end = buf+buflen;
+   char * retval;
+
+   spin_lock(_lock);
+   prepend(, , "\0", 1);
+   if (!IS_ROOT(dentry) && d_unhashed(dentry)) {
+   if (prepend(, , "//deleted", 10))
+   goto Elong;
+   }
+   /* Get '/' right */
+   retval = end-1;
+   *retval = '/';
+
+   for (;;) {
+   struct dentry * parent;
+   if (IS_ROOT(dentry))
+   break;
+
+   parent = dentry->d_parent;
+   prefetch(parent);
+
+   if (prepend(, , dentry->d_name.name,
+   dentry->d_name.len) ||
+   prepend(, , "/", 1))
+   goto Elong;
+
+   retval = end;
+   dentry = parent;
+   }
+   spin_unlock(_lock);
+   return retval;
+Elong:
+   spin_unlock(_lock);
+   return ERR_PTR(-ENAMETOOLONG);
+}
+
 /*
  * NOTE! The user-level library version

Re: per BDI dirty limit (was Re: -mm merge plans for 2.6.24)

2007-12-14 Thread Peter Zijlstra


On Fri, 2007-10-26 at 12:37 -0400, Trond Myklebust wrote:
> On Fri, 2007-10-26 at 16:48 +0200, Peter Zijlstra wrote:
> 
> > Miklos, Trond: could you suggest a better fmt for the bdi_init_fmt() for 
> > your
> > respective filesystems?
> 
> > Index: linux-2.6-2/fs/nfs/client.c
> > ===
> > --- linux-2.6-2.orig/fs/nfs/client.c
> > +++ linux-2.6-2/fs/nfs/client.c
> > @@ -678,7 +678,7 @@ static int nfs_probe_fsinfo(struct nfs_s
> > goto out_error;
> >  
> > nfs_server_set_fsinfo(server, );
> > -   error = bdi_init(>backing_dev_info);
> > +   error = bdi_init_fmt(>backing_dev_info, "nfs-%s-%p", 
> > clp->cl_hostname, server);
> 
> In our debugging printks, we usually use the superblock->s_id, but that
> only gets initialised later.
> 
> I'd suggest passing the 'dev_name' from *_get_sb() into
> *_create_server().

I just realised that such a name would contain '/' and I'm quite sure
that makes filesystems unhappy.

Neil suggested using device numbers which would work, however I think
those might not be human friendly. While its easy to find the device
number of a given path (eg.: stat -c %d /), its rather hard to find the
path belonging to a given device number.



--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: per BDI dirty limit (was Re: -mm merge plans for 2.6.24)

2007-12-14 Thread Peter Zijlstra


On Fri, 2007-10-26 at 12:37 -0400, Trond Myklebust wrote:
 On Fri, 2007-10-26 at 16:48 +0200, Peter Zijlstra wrote:
 
  Miklos, Trond: could you suggest a better fmt for the bdi_init_fmt() for 
  your
  respective filesystems?
 snip
  Index: linux-2.6-2/fs/nfs/client.c
  ===
  --- linux-2.6-2.orig/fs/nfs/client.c
  +++ linux-2.6-2/fs/nfs/client.c
  @@ -678,7 +678,7 @@ static int nfs_probe_fsinfo(struct nfs_s
  goto out_error;
   
  nfs_server_set_fsinfo(server, fsinfo);
  -   error = bdi_init(server-backing_dev_info);
  +   error = bdi_init_fmt(server-backing_dev_info, nfs-%s-%p, 
  clp-cl_hostname, server);
 
 In our debugging printks, we usually use the superblock-s_id, but that
 only gets initialised later.
 
 I'd suggest passing the 'dev_name' from *_get_sb() into
 *_create_server().

I just realised that such a name would contain '/' and I'm quite sure
that makes filesystems unhappy.

Neil suggested using device numbers which would work, however I think
those might not be human friendly. While its easy to find the device
number of a given path (eg.: stat -c %d /), its rather hard to find the
path belonging to a given device number.



--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: per BDI dirty limit (was Re: -mm merge plans for 2.6.24)

2007-12-14 Thread Miklos Szeredi

 Neil suggested using device numbers which would work, however I think
 those might not be human friendly. While its easy to find the device
 number of a given path (eg.: stat -c %d /), its rather hard to find the
 path belonging to a given device number.

Ram Pai had a patch which added the device number (among other things)
to /proc/mounts:



Subject: [RFC2 PATCH 1/1] VFS: Augment /proc/mount with subroot and
shared-subtree
From:   Ram Pai [EMAIL PROTECTED]
To: H. Peter Anvin [EMAIL PROTECTED]
Cc: Al Viro [EMAIL PROTECTED],
Linux Kernel Mailing List linux-kernel@vger.kernel.org,
[EMAIL PROTECTED], [EMAIL PROTECTED]
In-Reply-To: [EMAIL PROTECTED]
Content-Type: text/plain
Date:   Mon, 16 Jul 2007 11:46:48 -0700
Content-Transfer-Encoding: 7bit
Sender: [EMAIL PROTECTED]
X-Mailing-List: [EMAIL PROTECTED]

/proc/mounts in its current state fail to disambiguate bind mounts, especially 
when the bind mount is subrooted. Also it does not capture propagation state of
the mounts(shared-subtree). The following patch addresses the problem.

The following additional fields to /proc/mounts are added.

propagation-type in the form of propagation_flag[:mntid][,...]
note: 'shared' flag is followed by the mntid of its peer mount
  'slave' flag is followed by the mntid of its master mount
  'private' flag stands by itself
  'unbindable' flag stands by itself
   
mntid -- is a unique identifier of the mount
major:minor -- is the major minor number of the device hosting the filesystem
dir -- the subdir in the filesystem which forms the root of this mount
parent -- the id of the parent mount


Here is a sample cat /proc/mounts after execution the following commands:

mount --bind /mnt /mnt
mount --make-shared /mnt
mount --bind /mnt/1 /var
mount --make-slave /var
mount --make-shared /var
mount --bind /var/abc /tmp
mount --make-unbindable /proc

rootfs / rootfs rw 0 0 private 2 0:1 / 2 
/dev/root / ext2 rw  0 0 private 16 98:0 / 2 
/proc /proc proc rw 0 0 unbindable 17 0:3 / 16 
devpts /dev/pts devpts rw 0 0 private 18 0:10 / 16 
/dev/root /mnt ext2 rw  0 0 shared:19 19 98:0 /mnt 16 
/dev/root /var ext2 rw  0 0 shared:21,slave:19 20 98:0 /mnt/1 16 
/dev/root /tmp ext2 rw  0 0 shared:20,slave:19 21 98:0 /mnt/1/abc 16 

For example, the last line indicates that :

1) The mount is a shared mount.
2) Its peer mount of mount with id 20
3) It is also a slave mount of the master-mount with the id  19
4) The filesystem on device with major/minor number 98:0 and subdirectory 
mnt/1/abc makes the root directory of this mount.
5) And finally the mount with id 16 is its parent.


Testing: symlinked /etc/mtab to /proc/mounts and did some mount and df 
commands. They worked normally.



Signed-off-by: Ram Pai [EMAIL PROTECTED]

---
 fs/dcache.c  |   53 +++
 fs/namespace.c   |   35 +++-
 fs/pnode.c   |   22 +
 fs/pnode.h   |2 +
 fs/seq_file.c|   79 ++-
 include/linux/dcache.h   |2 +
 include/linux/mount.h|1 
 include/linux/seq_file.h |1 
 8 files changed, 172 insertions(+), 23 deletions(-)

Index: linux-2.6.21.5/fs/dcache.c
===
--- linux-2.6.21.5.orig/fs/dcache.c
+++ linux-2.6.21.5/fs/dcache.c
@@ -1835,6 +1835,59 @@ char * d_path(struct dentry *dentry, str
return res;
 }
 
+static inline int prepend(char **buffer, int *buflen, const char *str,
+   int namelen)
+{
+   if ((*buflen -= namelen)  0)
+   return 1;
+   *buffer -= namelen;
+   memcpy(*buffer, str, namelen);
+   return 0;
+}
+
+/*
+ * write full pathname into buffer and return start of pathname.
+ * If @vfsmnt is not specified return the path relative to the
+ * its filesystem's root.
+ */
+char * dentry_path(struct dentry *dentry, char *buf, int buflen)
+{
+   char * end = buf+buflen;
+   char * retval;
+
+   spin_lock(dcache_lock);
+   prepend(end, buflen, \0, 1);
+   if (!IS_ROOT(dentry)  d_unhashed(dentry)) {
+   if (prepend(end, buflen, //deleted, 10))
+   goto Elong;
+   }
+   /* Get '/' right */
+   retval = end-1;
+   *retval = '/';
+
+   for (;;) {
+   struct dentry * parent;
+   if (IS_ROOT(dentry))
+   break;
+
+   parent = dentry-d_parent;
+   prefetch(parent);
+
+   if (prepend(end, buflen, dentry-d_name.name,
+   dentry-d_name.len) ||
+   prepend(end, buflen, /, 1))
+   goto Elong;
+
+   retval = end;
+   dentry = parent;
+   }
+   spin_unlock(dcache_lock);
+   return retval;
+Elong:
+   spin_unlock(dcache_lock);
+   return

Re: per BDI dirty limit (was Re: -mm merge plans for 2.6.24)

2007-12-14 Thread Peter Zijlstra


On Fri, 2007-12-14 at 16:14 +0100, Miklos Szeredi wrote:
  Neil suggested using device numbers which would work, however I think
  those might not be human friendly. While its easy to find the device
  number of a given path (eg.: stat -c %d /), its rather hard to find the
  path belonging to a given device number.
 
 Ram Pai had a patch which added the device number (among other things)
 to /proc/mounts:
 
 
 
 Subject: [RFC2 PATCH 1/1] VFS: Augment /proc/mount with subroot and
   shared-subtree
 From: Ram Pai [EMAIL PROTECTED]
 To:   H. Peter Anvin [EMAIL PROTECTED]
 Cc:   Al Viro [EMAIL PROTECTED],
   Linux Kernel Mailing List linux-kernel@vger.kernel.org,
   [EMAIL PROTECTED], [EMAIL PROTECTED]
 In-Reply-To: [EMAIL PROTECTED]
 Content-Type: text/plain
 Date: Mon, 16 Jul 2007 11:46:48 -0700
 Content-Transfer-Encoding: 7bit
 Sender:   [EMAIL PROTECTED]
 X-Mailing-List:   [EMAIL PROTECTED]
 
 /proc/mounts in its current state fail to disambiguate bind mounts, 
 especially 
 when the bind mount is subrooted. Also it does not capture propagation state 
 of
 the mounts(shared-subtree). The following patch addresses the problem.
 
 The following additional fields to /proc/mounts are added.
 
 propagation-type in the form of propagation_flag[:mntid][,...]
   note: 'shared' flag is followed by the mntid of its peer mount
 'slave' flag is followed by the mntid of its master mount
 'private' flag stands by itself
 'unbindable' flag stands by itself

 mntid -- is a unique identifier of the mount
 major:minor -- is the major minor number of the device hosting the filesystem
 dir -- the subdir in the filesystem which forms the root of this mount
 parent -- the id of the parent mount
 
 
 Here is a sample cat /proc/mounts after execution the following commands:
 
 mount --bind /mnt /mnt
 mount --make-shared /mnt
 mount --bind /mnt/1 /var
 mount --make-slave /var
 mount --make-shared /var
 mount --bind /var/abc /tmp
 mount --make-unbindable /proc
 
 rootfs / rootfs rw 0 0 private 2 0:1 / 2 
 /dev/root / ext2 rw  0 0 private 16 98:0 / 2 
 /proc /proc proc rw 0 0 unbindable 17 0:3 / 16 
 devpts /dev/pts devpts rw 0 0 private 18 0:10 / 16 
 /dev/root /mnt ext2 rw  0 0 shared:19 19 98:0 /mnt 16 
 /dev/root /var ext2 rw  0 0 shared:21,slave:19 20 98:0 /mnt/1 16 
 /dev/root /tmp ext2 rw  0 0 shared:20,slave:19 21 98:0 /mnt/1/abc 16 
 
 For example, the last line indicates that :
 
 1) The mount is a shared mount.
 2) Its peer mount of mount with id 20
 3) It is also a slave mount of the master-mount with the id  19
 4) The filesystem on device with major/minor number 98:0 and subdirectory 
   mnt/1/abc makes the root directory of this mount.
 5) And finally the mount with id 16 is its parent.
 
 
 Testing: symlinked /etc/mtab to /proc/mounts and did some mount and df 
 commands. They worked normally.
 
 
 
 Signed-off-by: Ram Pai [EMAIL PROTECTED]

OK, I guess that would work.

Ram Pai, what is the current status of that work?


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: per BDI dirty limit (was Re: -mm merge plans for 2.6.24)

2007-11-02 Thread Kay Sievers

On Fri, 2007-11-02 at 15:17 +0100, Peter Zijlstra wrote:
> One more question,
> 
> I currently prefix the names with "bdi-", is that needed?

Not really.

> That is, if I give the bdi object a parent, how will it look?
> Would a bdi device with name "sda" with a block device called "sda" as
> parent look like: /sys/block/sda/sda? Or would if be
> called /sys/block/sda/bdi:sda or just /sys/block/sda/bdi?

The class devices as childs get their own subdirectory at the parent. So
it would be /sys/block/sda/bdi/sda

It's currently only implemented for class devices which get a bus device
as a parent, but that will be for all parents, so that we prevent
clashing names if devices from multiple subsystems get the same parent.
See here for netdevs there is a "net" "glue directory":

  $ ls -l /sys/class/net/eth0
  /sys/class/net/eth0 -> 
../../devices/pci:00/:00:1c.0/:02:00.0/net/eth0

We needed this, because people complained that they can't name their
netif "irq", because there is already an attribute at the parent with
that name. :)

When we get the "block as devices" patch merged, we can do the proper
parent logic for bdi, and I'll add the same logic as we have for bus
device parents today.

Kay

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: per BDI dirty limit (was Re: -mm merge plans for 2.6.24)

2007-11-02 Thread Peter Zijlstra

One more question,

I currently prefix the names with "bdi-", is that needed?

That is, if I give the bdi object a parent, how will it look?
Would a bdi device with name "sda" with a block device called "sda" as
parent look like: /sys/block/sda/sda? Or would if be
called /sys/block/sda/bdi:sda or just /sys/block/sda/bdi?



-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: per BDI dirty limit (was Re: -mm merge plans for 2.6.24)

2007-11-02 Thread Peter Zijlstra

On Fri, 2007-11-02 at 14:50 +0100, Kay Sievers wrote:
> On Nov 2, 2007 2:15 PM, Peter Zijlstra <[EMAIL PROTECTED]> wrote:
> > Thanks for the help so far, however we're still not quite there.
> >
> > The below patch still has the funny 20 character name limit. Is there a
> > good reason its a char array like this, and not just a char * to a kstr?
> > The code does kstrdup all over the place, I can't imagine why suddenly
> > limiting it to 20 chars seems like a good idea.
> 
> You are absolutely right, it doesn't make any sense. The 20 char limit
> is bad, but really,
> having the name duplicated in the device structure, while the name is
> already in the
> embedded kobject, is really bad.
> 
> Greg recently got rid of the 20 chars in the kobject, now we need to fix the
> devices to completely get rid of the static bus_id string array, and just set
> the kobject name directly.
> It's all long overdue to fix things like this in the driver core -
> it's such a mess. After the
> kset cleanup Greg and I are doing currently, we will remove that silly limit.

Ok, great!

Could I ask you to nudge me awake once those patches hit a git tree
somewhere?

> Hmm, regardless of the limit, isn't there a better device name than a memory
> address of a kernel structure. :)

Yes there is, Trond already suggested a proper replacement, however so
far I've been just trying to get it to work before trying to make it
pretty.

Will implement Trond's suggestion while you and Greg eradicate this 20
byte thing :-)

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: per BDI dirty limit (was Re: -mm merge plans for 2.6.24)

2007-11-02 Thread Kay Sievers

On Nov 2, 2007 2:15 PM, Peter Zijlstra <[EMAIL PROTECTED]> wrote:
> Thanks for the help so far, however we're still not quite there.
>
> The below patch still has the funny 20 character name limit. Is there a
> good reason its a char array like this, and not just a char * to a kstr?
> The code does kstrdup all over the place, I can't imagine why suddenly
> limiting it to 20 chars seems like a good idea.

You are absolutely right, it doesn't make any sense. The 20 char limit
is bad, but really,
having the name duplicated in the device structure, while the name is
already in the
embedded kobject, is really bad.

Greg recently got rid of the 20 chars in the kobject, now we need to fix the
devices to completely get rid of the static bus_id string array, and just set
the kobject name directly.
It's all long overdue to fix things like this in the driver core -
it's such a mess. After the
kset cleanup Greg and I are doing currently, we will remove that silly limit.

Hmm, regardless of the limit, isn't there a better device name than a memory
address of a kernel structure. :) If there is no better data,
shouldn't we get something
like an atomic counter somewhere in the nfs code, which just increases
with every
instance, and we could use that number as a connection id?

Kay
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: per BDI dirty limit (was Re: -mm merge plans for 2.6.24)

2007-11-02 Thread Peter Zijlstra

Hi,

Thanks for the help so far, however we're still not quite there.

The below patch still has the funny 20 character name limit. Is there a
good reason its a char array like this, and not just a char * to a kstr?
The code does kstrdup all over the place, I can't imagine why suddenly
limiting it to 20 chars seems like a good idea.


Mounting NFS filesystems:  mount: 192.168.0.1:/mnt/ying failed, reason given by 
server: Permission denied
mount: 192.168.0.1:/mnt/yang failed, reason given by server: Permission denied
[   52.705052] sysfs: duplicate filename 'bdi-nfs-192.168.0.1' can not be 
created
[   52.712517] WARNING: at /mnt/md0/src/linux-2.6-2/fs/sysfs/dir.c:424 
sysfs_add_one()
[   52.720243]
[   52.720244] Call Trace:
[   52.724311]  [] sysfs_add_one+0xac/0xe0
[   52.729708]  [] create_dir+0x63/0xc0
[   52.734832]  [] sysfs_create_dir+0x34/0x50
[   52.740489]  [] _spin_unlock+0x30/0x60
[   52.745792]  [] kobject_add+0xbe/0x200
[   52.751094]  [] device_add+0xc0/0x680
[   52.756305]  [] device_register+0x19/0x20
[   52.761877]  [] device_create+0xe7/0x120
[   52.767360]  [] vsnprintf+0x2bc/0x690
[   52.772585]  [] kvasprintf+0x70/0x90
[   52.24]  [] bdi_register+0x9b/0xe0
[   52.783037]  [] percpu_counter_init_irq+0x39/0x50
[   52.789299]  [] prop_local_init_percpu+0x3c/0x50
[   52.795462]  [] bdi_init+0x61/0xb0
[   52.800411]  [] nfs_probe_fsinfo+0x4d9/0x640
[   52.806226]  [] nfs_create_server+0x1ea/0x560
[   52.812123]  [] lock_release_holdtime+0x66/0x80
[   52.818217]  [] __d_lookup+0x106/0x1d0
[   52.823529]  [] __kmalloc_track_caller+0xa5/0x100
[   52.829789]  [] trace_hardirqs_on+0xd5/0x170
[   52.835618]  [] nfs_get_sb+0x2b2/0x740
[   52.840931]  [] nfs_get_sb+0x2f9/0x740
[   52.846245]  [] __put_mnt_ns+0x90/0xa0
[   52.851556]  [] vfs_kern_mount+0xbb/0x150
[   52.857127]  [] do_kern_mount+0x4e/0x100
[   52.862614]  [] do_mount+0x4dc/0x7a0
[   52.867751]  [] trace_hardirqs_on+0xd5/0x170
[   52.873580]  [] _spin_unlock_irqrestore+0x42/0x80
[   52.879841]  [] __up_read+0x45/0xb0
[   52.884894]  [] search_exception_tables+0x25/0x40
[   52.891155]  [] get_page_from_freelist+0x4a5/0x760
[   52.897487]  [] bad_range+0x1f/0x80
[   52.902540]  [] get_page_from_freelist+0x387/0x760
[   52.90]  [] __alloc_pages+0x6e/0x3b0
[   52.914374]  [] alloc_pages_current+0x5a/0x90
[   52.920290]  [] __get_free_pages+0x1b/0x40
[   52.925929]  [] copy_mount_options+0x52/0x170
[   52.931827]  [] sys_mount+0x94/0xe0
[   52.936881]  [] trace_hardirqs_on_thunk+0x35/0x3a
[   52.943141]  [] system_call+0x7e/0x83
[   52.948357]
[   52.949850] kobject_add failed for bdi-nfs-192.168.0.1 with -EEXIST, don't 
try to register things with the same name in the same directory.
[   52.962332]

(just in case it wasn't obvious, the -%p part that was supposed to make
it unique got truncated)

---
 block/genhd.c   |3 +
 fs/fuse/inode.c |3 -
 fs/nfs/client.c |3 -
 include/linux/backing-dev.h |   19 +++
 include/linux/writeback.h   |3 +
 lib/percpu_counter.c|1 
 mm/backing-dev.c|  107 
 mm/page-writeback.c |2 
 8 files changed, 138 insertions(+), 3 deletions(-)

Index: linux-2.6-2/block/genhd.c
===
--- linux-2.6-2.orig/block/genhd.c
+++ linux-2.6-2/block/genhd.c
@@ -182,6 +182,8 @@ void add_disk(struct gendisk *disk)
disk->minors, NULL, exact_match, exact_lock, disk);
register_disk(disk);
blk_register_queue(disk);
+   bdi_register(>queue->backing_dev_info, NULL,
+   "bdi-%s", disk->disk_name);
 }
 
 EXPORT_SYMBOL(add_disk);
@@ -190,6 +192,7 @@ EXPORT_SYMBOL(del_gendisk); /* in partit
 void unlink_gendisk(struct gendisk *disk)
 {
blk_unregister_queue(disk);
+   bdi_unregister(>queue->backing_dev_info);
blk_unregister_region(MKDEV(disk->major, disk->first_minor),
  disk->minors);
 }
Index: linux-2.6-2/fs/fuse/inode.c
===
--- linux-2.6-2.orig/fs/fuse/inode.c
+++ linux-2.6-2/fs/fuse/inode.c
@@ -467,7 +467,8 @@ static struct fuse_conn *new_conn(void)
atomic_set(>num_waiting, 0);
fc->bdi.ra_pages = (VM_MAX_READAHEAD * 1024) / PAGE_CACHE_SIZE;
fc->bdi.unplug_io_fn = default_unplug_io_fn;
-   err = bdi_init(>bdi);
+   err = bdi_init_fmt(>bdi, NULL,
+   "bdi-fuse-%llu", (unsigned long long)fc->id);
if (err) {
kfree(fc);
fc = NULL;
Index: linux-2.6-2/fs/nfs/client.c
===
--- linux-2.6-2.orig/fs/nfs/client.c
+++ linux-2.6-2/fs/nfs/client.c
@@ -678,7 +678,8 @@ static int nfs_probe_fsinfo(struct nfs_s
goto out_error;

Re: per BDI dirty limit (was Re: -mm merge plans for 2.6.24)

2007-11-02 Thread Peter Zijlstra

One more question,

I currently prefix the names with bdi-, is that needed?

That is, if I give the bdi object a parent, how will it look?
Would a bdi device with name sda with a block device called sda as
parent look like: /sys/block/sda/sda? Or would if be
called /sys/block/sda/bdi:sda or just /sys/block/sda/bdi?



-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: per BDI dirty limit (was Re: -mm merge plans for 2.6.24)

2007-11-02 Thread Kay Sievers

On Fri, 2007-11-02 at 15:17 +0100, Peter Zijlstra wrote:
 One more question,
 
 I currently prefix the names with bdi-, is that needed?

Not really.

 That is, if I give the bdi object a parent, how will it look?
 Would a bdi device with name sda with a block device called sda as
 parent look like: /sys/block/sda/sda? Or would if be
 called /sys/block/sda/bdi:sda or just /sys/block/sda/bdi?

The class devices as childs get their own subdirectory at the parent. So
it would be /sys/block/sda/bdi/sda

It's currently only implemented for class devices which get a bus device
as a parent, but that will be for all parents, so that we prevent
clashing names if devices from multiple subsystems get the same parent.
See here for netdevs there is a net glue directory:

  $ ls -l /sys/class/net/eth0
  /sys/class/net/eth0 - 
../../devices/pci:00/:00:1c.0/:02:00.0/net/eth0

We needed this, because people complained that they can't name their
netif irq, because there is already an attribute at the parent with
that name. :)

When we get the block as devices patch merged, we can do the proper
parent logic for bdi, and I'll add the same logic as we have for bus
device parents today.

Kay

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: per BDI dirty limit (was Re: -mm merge plans for 2.6.24)

2007-11-02 Thread Peter Zijlstra

On Fri, 2007-11-02 at 14:50 +0100, Kay Sievers wrote:
 On Nov 2, 2007 2:15 PM, Peter Zijlstra [EMAIL PROTECTED] wrote:
  Thanks for the help so far, however we're still not quite there.
 
  The below patch still has the funny 20 character name limit. Is there a
  good reason its a char array like this, and not just a char * to a kstr?
  The code does kstrdup all over the place, I can't imagine why suddenly
  limiting it to 20 chars seems like a good idea.
 
 You are absolutely right, it doesn't make any sense. The 20 char limit
 is bad, but really,
 having the name duplicated in the device structure, while the name is
 already in the
 embedded kobject, is really bad.
 
 Greg recently got rid of the 20 chars in the kobject, now we need to fix the
 devices to completely get rid of the static bus_id string array, and just set
 the kobject name directly.
 It's all long overdue to fix things like this in the driver core -
 it's such a mess. After the
 kset cleanup Greg and I are doing currently, we will remove that silly limit.

Ok, great!

Could I ask you to nudge me awake once those patches hit a git tree
somewhere?

 Hmm, regardless of the limit, isn't there a better device name than a memory
 address of a kernel structure. :)

Yes there is, Trond already suggested a proper replacement, however so
far I've been just trying to get it to work before trying to make it
pretty.

Will implement Trond's suggestion while you and Greg eradicate this 20
byte thing :-)

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: per BDI dirty limit (was Re: -mm merge plans for 2.6.24)

2007-11-02 Thread Kay Sievers

On Nov 2, 2007 2:15 PM, Peter Zijlstra [EMAIL PROTECTED] wrote:
 Thanks for the help so far, however we're still not quite there.

 The below patch still has the funny 20 character name limit. Is there a
 good reason its a char array like this, and not just a char * to a kstr?
 The code does kstrdup all over the place, I can't imagine why suddenly
 limiting it to 20 chars seems like a good idea.

You are absolutely right, it doesn't make any sense. The 20 char limit
is bad, but really,
having the name duplicated in the device structure, while the name is
already in the
embedded kobject, is really bad.

Greg recently got rid of the 20 chars in the kobject, now we need to fix the
devices to completely get rid of the static bus_id string array, and just set
the kobject name directly.
It's all long overdue to fix things like this in the driver core -
it's such a mess. After the
kset cleanup Greg and I are doing currently, we will remove that silly limit.

Hmm, regardless of the limit, isn't there a better device name than a memory
address of a kernel structure. :) If there is no better data,
shouldn't we get something
like an atomic counter somewhere in the nfs code, which just increases
with every
instance, and we could use that number as a connection id?

Kay
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: per BDI dirty limit (was Re: -mm merge plans for 2.6.24)

2007-11-02 Thread Peter Zijlstra

Hi,

Thanks for the help so far, however we're still not quite there.

The below patch still has the funny 20 character name limit. Is there a
good reason its a char array like this, and not just a char * to a kstr?
The code does kstrdup all over the place, I can't imagine why suddenly
limiting it to 20 chars seems like a good idea.


Mounting NFS filesystems:  mount: 192.168.0.1:/mnt/ying failed, reason given by 
server: Permission denied
mount: 192.168.0.1:/mnt/yang failed, reason given by server: Permission denied
[   52.705052] sysfs: duplicate filename 'bdi-nfs-192.168.0.1' can not be 
created
[   52.712517] WARNING: at /mnt/md0/src/linux-2.6-2/fs/sysfs/dir.c:424 
sysfs_add_one()
[   52.720243]
[   52.720244] Call Trace:
[   52.724311]  [80304f7c] sysfs_add_one+0xac/0xe0
[   52.729708]  [80305613] create_dir+0x63/0xc0
[   52.734832]  [803056a4] sysfs_create_dir+0x34/0x50
[   52.740489]  [804d5710] _spin_unlock+0x30/0x60
[   52.745792]  [8036958e] kobject_add+0xbe/0x200
[   52.751094]  [803dfc40] device_add+0xc0/0x680
[   52.756305]  [803e0219] device_register+0x19/0x20
[   52.761877]  [803e0697] device_create+0xe7/0x120
[   52.767360]  [8036e0bc] vsnprintf+0x2bc/0x690
[   52.772585]  [80370380] kvasprintf+0x70/0x90
[   52.24]  [80295bbb] bdi_register+0x9b/0xe0
[   52.783037]  [8037e039] percpu_counter_init_irq+0x39/0x50
[   52.789299]  [8036abcc] prop_local_init_percpu+0x3c/0x50
[   52.795462]  [80295a41] bdi_init+0x61/0xb0
[   52.800411]  [80308069] nfs_probe_fsinfo+0x4d9/0x640
[   52.806226]  [80308f9a] nfs_create_server+0x1ea/0x560
[   52.812123]  [80263526] lock_release_holdtime+0x66/0x80
[   52.818217]  [802ca936] __d_lookup+0x106/0x1d0
[   52.823529]  [802b0fd5] __kmalloc_track_caller+0xa5/0x100
[   52.829789]  [80266185] trace_hardirqs_on+0xd5/0x170
[   52.835618]  [80312412] nfs_get_sb+0x2b2/0x740
[   52.840931]  [80312459] nfs_get_sb+0x2f9/0x740
[   52.846245]  [802d] __put_mnt_ns+0x90/0xa0
[   52.851556]  [802b89db] vfs_kern_mount+0xbb/0x150
[   52.857127]  [802b8ade] do_kern_mount+0x4e/0x100
[   52.862614]  [802d122c] do_mount+0x4dc/0x7a0
[   52.867751]  [80266185] trace_hardirqs_on+0xd5/0x170
[   52.873580]  [804d5842] _spin_unlock_irqrestore+0x42/0x80
[   52.879841]  [8036c505] __up_read+0x45/0xb0
[   52.884894]  [80255aa5] search_exception_tables+0x25/0x40
[   52.891155]  [8028d1d5] get_page_from_freelist+0x4a5/0x760
[   52.897487]  [8028a58f] bad_range+0x1f/0x80
[   52.902540]  [8028d0b7] get_page_from_freelist+0x387/0x760
[   52.90]  [8028d54e] __alloc_pages+0x6e/0x3b0
[   52.914374]  [802a80ea] alloc_pages_current+0x5a/0x90
[   52.920290]  [8028c97b] __get_free_pages+0x1b/0x40
[   52.925929]  [802cf892] copy_mount_options+0x52/0x170
[   52.931827]  [802d1584] sys_mount+0x94/0xe0
[   52.936881]  [804d49cd] trace_hardirqs_on_thunk+0x35/0x3a
[   52.943141]  [8020c43e] system_call+0x7e/0x83
[   52.948357]
[   52.949850] kobject_add failed for bdi-nfs-192.168.0.1 with -EEXIST, don't 
try to register things with the same name in the same directory.
[   52.962332]

(just in case it wasn't obvious, the -%p part that was supposed to make
it unique got truncated)

---
 block/genhd.c   |3 +
 fs/fuse/inode.c |3 -
 fs/nfs/client.c |3 -
 include/linux/backing-dev.h |   19 +++
 include/linux/writeback.h   |3 +
 lib/percpu_counter.c|1 
 mm/backing-dev.c|  107 
 mm/page-writeback.c |2 
 8 files changed, 138 insertions(+), 3 deletions(-)

Index: linux-2.6-2/block/genhd.c
===
--- linux-2.6-2.orig/block/genhd.c
+++ linux-2.6-2/block/genhd.c
@@ -182,6 +182,8 @@ void add_disk(struct gendisk *disk)
disk-minors, NULL, exact_match, exact_lock, disk);
register_disk(disk);
blk_register_queue(disk);
+   bdi_register(disk-queue-backing_dev_info, NULL,
+   bdi-%s, disk-disk_name);
 }
 
 EXPORT_SYMBOL(add_disk);
@@ -190,6 +192,7 @@ EXPORT_SYMBOL(del_gendisk); /* in partit
 void unlink_gendisk(struct gendisk *disk)
 {
blk_unregister_queue(disk);
+   bdi_unregister(disk-queue-backing_dev_info);
blk_unregister_region(MKDEV(disk-major, disk-first_minor),
  disk-minors);
 }
Index: linux-2.6-2/fs/fuse/inode.c
===
--- linux-2.6-2.orig/fs/fuse/inode.c
+++ linux-2.6-2/fs/fuse/inode.c
@@ -467,7 +467,8 @@ static struct fuse_conn *new_conn(void)
atomic_set(fc-num_waiting, 0);
fc-bdi.ra_pages =

Re: per BDI dirty limit (was Re: -mm merge plans for 2.6.24)

2007-10-28 Thread Greg KH

On Sat, Oct 27, 2007 at 11:35:45PM +0200, Peter Zijlstra wrote:
> On Sat, 2007-10-27 at 23:08 +0200, Kay Sievers wrote:
> > All this open-coded attribute stuff should go away and be replaced by:
> >   bdi_class->dev_attrs = bdi_dev_attrs;
> > Otherwise at event time the attributes are not created and stuff hooking
> > into the events will not be able to set values. Also, the core will do
> > proper add/remove and error handling then.
> 
> ok, that's good to know. someone ought to write a book on how to use all
> this... really, even the functions are bare of documentation or
> comments.

Yes, I know, sorry :(

I'm working on cleaning up the apis a lot right now, so hopefully, in a
few months things will have settled down.  I'm trying to document things
as I go, but right now I'm stuck at a much lower level (ksets and
friends...)

thanks,

greg k-h
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: per BDI dirty limit (was Re: -mm merge plans for 2.6.24)

2007-10-28 Thread Greg KH

On Sat, Oct 27, 2007 at 11:35:45PM +0200, Peter Zijlstra wrote:
 On Sat, 2007-10-27 at 23:08 +0200, Kay Sievers wrote:
  All this open-coded attribute stuff should go away and be replaced by:
bdi_class-dev_attrs = bdi_dev_attrs;
  Otherwise at event time the attributes are not created and stuff hooking
  into the events will not be able to set values. Also, the core will do
  proper add/remove and error handling then.
 
 ok, that's good to know. someone ought to write a book on how to use all
 this... really, even the functions are bare of documentation or
 comments.

Yes, I know, sorry :(

I'm working on cleaning up the apis a lot right now, so hopefully, in a
few months things will have settled down.  I'm trying to document things
as I go, but right now I'm stuck at a much lower level (ksets and
friends...)

thanks,

greg k-h
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: per BDI dirty limit (was Re: -mm merge plans for 2.6.24)

2007-10-27 Thread Peter Zijlstra

On Sat, 2007-10-27 at 23:08 +0200, Kay Sievers wrote:
> On Sat, 2007-10-27 at 09:02 -0700, Greg KH wrote:

> > Ah, I see a few problems.  Here, try this version instead.  It's
> > compile-tested only, and should be a lot simpler.
> > 
> > Note, we still are not setting the parent to the new bdi structure
> > properly, so the devices will show up in /sys/devices/virtual/ instead
> > of in their proper location.  To do this, we need the parent of the
> > device, which I'm not so sure what it should be (block device?  block
> > device controller?)
> 
> Assigning a parent device will only work with the upcoming conversion of
> the raw kobjects in the block subsystem to "struct device".
> 
> A few comments to the patch:
> 
> > --- a/include/linux/string.h
> > +++ b/include/linux/string.h
> > @@ -8,6 +8,7 @@
> >  #include /* for inline */
> >  #include/* for size_t */
> >  #include   /* for NULL */
> > +#include 
> >  
> >  #ifdef __cplusplus
> >  extern "C" {
> > @@ -111,6 +112,9 @@ extern void *kmemdup(const void *src, si
> >  extern char **argv_split(gfp_t gfp, const char *str, int *argcp);
> >  extern void argv_free(char **argv);
> >  
> > +char *kvprintf(const char *fmt, va_list args);
> > +char *kprintf(const char *fmt, ...);
> 
> Why is that here? I don't think we need this when we use the existing:
>   kvasprintf(GFP_KERNEL, fmt, args)

Ignorance of the existance of said function. Thanks for pointing it out.
(kobject_set_name ought to use it too I guess)

> > --- a/mm/backing-dev.c
> > +++ b/mm/backing-dev.c
> 
> > +
> > +static struct device_attribute bdi_dev_attrs[] = {
> > +   __ATTR(readahead, 0644, readahead_show, readahead_store),
> > +   __ATTR_RO(reclaimable),
> > +   __ATTR_RO(writeback),
> > +   __ATTR_RO(dirty),
> > +   __ATTR_RO(bdi_dirty),
> > +};
> 
> Default attributes will need the NULL termination back (see below).
> 
> > +static __init int bdi_class_init(void)
> > +{
> > +   bdi_class = class_create(THIS_MODULE, "bdi");
> > +   return 0;
> > +}
> > +
> > +__initcall(bdi_class_init);
> > +
> > +int bdi_register(struct backing_dev_info *bdi, const char *fmt, ...)
> 
> This function should accept a: "struct device *parent" and all callers
> just pass NULL until the block layer conversion gets merged.

Yeah, you're right, but I wanted to just get something working before
bothering with the parent thing.

> > +{
> > +   char *name;
> > +   va_list args;
> > +   int ret = -ENOMEM;
> > +   int i;
> > +
> > +   va_start(args, fmt);
> > +   name = kvprintf(fmt, args);
> 
> kvasprintf(GFP_KERNEL, fmt, args);
> 
> > +   va_end(args);
> > +
> > +   if (!name)
> > +   return -ENOMEM;
> > +
> > +   bdi->dev = device_create(bdi_class, NULL, MKDEV(0,0), name);
> 
> The parent should be passed here.
> 
> > +   for (i = 0; i < ARRAY_SIZE(bdi_dev_attrs); i++) {
> > +   ret = device_create_file(bdi->dev, _dev_attrs[i]);
> > +   if (ret)
> > +   break;
> > +   }
> > +   if (ret) {
> > +   while (--i >= 0)
> > +   device_remove_file(bdi->dev, _dev_attrs[i]);
> > +   device_unregister(bdi->dev);
> > +   bdi->dev = NULL;
> > +   }
> 
> All this open-coded attribute stuff should go away and be replaced by:
>   bdi_class->dev_attrs = bdi_dev_attrs;
> Otherwise at event time the attributes are not created and stuff hooking
> into the events will not be able to set values. Also, the core will do
> proper add/remove and error handling then.

ok, that's good to know. someone ought to write a book on how to use all
this... really, even the functions are bare of documentation or
comments.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: per BDI dirty limit (was Re: -mm merge plans for 2.6.24)

2007-10-27 Thread Kay Sievers

On Sat, 2007-10-27 at 09:02 -0700, Greg KH wrote:
> On Sat, Oct 27, 2007 at 10:39:59AM +0200, Peter Zijlstra wrote:
> > On Fri, 2007-10-26 at 19:40 -0700, Greg KH wrote:
> > > On Sat, Oct 27, 2007 at 03:18:08AM +0200, Peter Zijlstra wrote:
> > > > On Fri, 2007-10-26 at 22:04 +0200, Peter Zijlstra wrote:
> > > > > This crashes and burns on bootup, but I'm too tired to figure out 
> > > > > what I
> > > > > did wrong... will give it another try tomorrow..
> > > > 
> > > > Ok, can't sleep.. took a look. I have several problems here.
> > > > 
> > > > The thing that makes it go *boom* is the __ATTR_NULL. Removing that
> > > > makes it boot. Albeit it then warns me of multiple duplicate sysfs
> > > > objects, all named "bdi".

> > > I'll look at this and see what I can come up with.  Would you just like
> > > a whole new patch, or one against this one?
> > 
> > Sorry for the grumpy note, I get that way at 3.30 am. Maybe I ought not
> > have mailed :-/
> > 
> > This is the code I had at that time.
> 
> Ah, I see a few problems.  Here, try this version instead.  It's
> compile-tested only, and should be a lot simpler.
> 
> Note, we still are not setting the parent to the new bdi structure
> properly, so the devices will show up in /sys/devices/virtual/ instead
> of in their proper location.  To do this, we need the parent of the
> device, which I'm not so sure what it should be (block device?  block
> device controller?)

Assigning a parent device will only work with the upcoming conversion of
the raw kobjects in the block subsystem to "struct device".

A few comments to the patch:

> --- a/include/linux/string.h
> +++ b/include/linux/string.h
> @@ -8,6 +8,7 @@
>  #include   /* for inline */
>  #include  /* for size_t */
>  #include /* for NULL */
> +#include 
>  
>  #ifdef __cplusplus
>  extern "C" {
> @@ -111,6 +112,9 @@ extern void *kmemdup(const void *src, si
>  extern char **argv_split(gfp_t gfp, const char *str, int *argcp);
>  extern void argv_free(char **argv);
>  
> +char *kvprintf(const char *fmt, va_list args);
> +char *kprintf(const char *fmt, ...);

Why is that here? I don't think we need this when we use the existing:
  kvasprintf(GFP_KERNEL, fmt, args)

> --- a/mm/backing-dev.c
> +++ b/mm/backing-dev.c

> +
> +static struct device_attribute bdi_dev_attrs[] = {
> + __ATTR(readahead, 0644, readahead_show, readahead_store),
> + __ATTR_RO(reclaimable),
> + __ATTR_RO(writeback),
> + __ATTR_RO(dirty),
> + __ATTR_RO(bdi_dirty),
> +};

Default attributes will need the NULL termination back (see below).

> +static __init int bdi_class_init(void)
> +{
> + bdi_class = class_create(THIS_MODULE, "bdi");
> + return 0;
> +}
> +
> +__initcall(bdi_class_init);
> +
> +int bdi_register(struct backing_dev_info *bdi, const char *fmt, ...)

This function should accept a: "struct device *parent" and all callers
just pass NULL until the block layer conversion gets merged.

> +{
> + char *name;
> + va_list args;
> + int ret = -ENOMEM;
> + int i;
> +
> + va_start(args, fmt);
> + name = kvprintf(fmt, args);

kvasprintf(GFP_KERNEL, fmt, args);

> + va_end(args);
> +
> + if (!name)
> + return -ENOMEM;
> +
> + bdi->dev = device_create(bdi_class, NULL, MKDEV(0,0), name);

The parent should be passed here.

> + for (i = 0; i < ARRAY_SIZE(bdi_dev_attrs); i++) {
> + ret = device_create_file(bdi->dev, _dev_attrs[i]);
> + if (ret)
> + break;
> + }
> + if (ret) {
> + while (--i >= 0)
> + device_remove_file(bdi->dev, _dev_attrs[i]);
> + device_unregister(bdi->dev);
> + bdi->dev = NULL;
> + }

All this open-coded attribute stuff should go away and be replaced by:
  bdi_class->dev_attrs = bdi_dev_attrs;
Otherwise at event time the attributes are not created and stuff hooking
into the events will not be able to set values. Also, the core will do
proper add/remove and error handling then.

Thanks,
Kay

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: per BDI dirty limit (was Re: -mm merge plans for 2.6.24)

2007-10-27 Thread Peter Zijlstra


On Sat, 2007-10-27 at 09:02 -0700, Greg KH wrote:

> Ah, I see a few problems.  Here, try this version instead.  It's
> compile-tested only, and should be a lot simpler.
> 
> Note, we still are not setting the parent to the new bdi structure
> properly, so the devices will show up in /sys/devices/virtual/ instead
> of in their proper location.  To do this, we need the parent of the
> device, which I'm not so sure what it should be (block device?  block
> device controller?)

The problem is that not every bdi has a sysfs represented parent, hence
the class suggestion. For block devices it is indeed the block device
itself, but for example the NFS client's server descriptor does not have
a sysfs representation.

> Let me know if this works better, I'm off to a kids birthday party for
> the day, but will be around this evening...

Hehe, do enjoy! Thanks.


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: per BDI dirty limit (was Re: -mm merge plans for 2.6.24)

2007-10-27 Thread Greg KH

On Sat, Oct 27, 2007 at 10:39:59AM +0200, Peter Zijlstra wrote:
> On Fri, 2007-10-26 at 19:40 -0700, Greg KH wrote:
> > On Sat, Oct 27, 2007 at 03:18:08AM +0200, Peter Zijlstra wrote:
> > > 
> > > On Fri, 2007-10-26 at 22:04 +0200, Peter Zijlstra wrote:
> > > > This crashes and burns on bootup, but I'm too tired to figure out what I
> > > > did wrong... will give it another try tomorrow..
> > > 
> > > Ok, can't sleep.. took a look. I have several problems here.
> > > 
> > > The thing that makes it go *boom* is the __ATTR_NULL. Removing that
> > > makes it boot. Albeit it then warns me of multiple duplicate sysfs
> > > objects, all named "bdi".
> > > 
> > > For some obscure reason this device interface insists on using the
> > > bus_id as name (?!), and further reduces usability by limiting that to
> > > 20 odd characters.
> > > 
> > > This makes it quite useless. I tried fudging around that limit by using
> > > device_rename and kobject_rename, but to no avail.
> > > 
> > > Really, it should not be this hard to use, trying to expose a handfull
> > > of simple integers to userspace should not take 8h+ and still not work.
> > > 
> > > Peter, who thinks sysfs is contorted mess beyond his skill. I'll stick
> > > to VM and scheduler code, that actually makes sense.
> > 
> > Heh, that's funny :)
> > 
> > I'll look at this and see what I can come up with.  Would you just like
> > a whole new patch, or one against this one?
> 
> Sorry for the grumpy note, I get that way at 3.30 am. Maybe I ought not
> have mailed :-/
> 
> This is the code I had at that time.

Ah, I see a few problems.  Here, try this version instead.  It's
compile-tested only, and should be a lot simpler.

Note, we still are not setting the parent to the new bdi structure
properly, so the devices will show up in /sys/devices/virtual/ instead
of in their proper location.  To do this, we need the parent of the
device, which I'm not so sure what it should be (block device?  block
device controller?)

Let me know if this works better, I'm off to a kids birthday party for
the day, but will be around this evening...

thanks,

greg k-h

---
 block/genhd.c   |2 
 fs/fuse/inode.c |2 
 fs/nfs/client.c |2 
 include/linux/backing-dev.h |   19 +++
 include/linux/string.h  |4 +
 include/linux/writeback.h   |3 +
 mm/backing-dev.c|  110 
 mm/page-writeback.c |2 
 mm/util.c   |   42 
 9 files changed, 183 insertions(+), 3 deletions(-)

--- a/block/genhd.c
+++ b/block/genhd.c
@@ -182,6 +182,7 @@ void add_disk(struct gendisk *disk)
disk->minors, NULL, exact_match, exact_lock, disk);
register_disk(disk);
blk_register_queue(disk);
+   bdi_register(>queue->backing_dev_info, "bdi-%s", disk->disk_name);
 }
 
 EXPORT_SYMBOL(add_disk);
@@ -190,6 +191,7 @@ EXPORT_SYMBOL(del_gendisk); /* in partit
 void unlink_gendisk(struct gendisk *disk)
 {
blk_unregister_queue(disk);
+   bdi_unregister(>queue->backing_dev_info);
blk_unregister_region(MKDEV(disk->major, disk->first_minor),
  disk->minors);
 }
--- a/fs/fuse/inode.c
+++ b/fs/fuse/inode.c
@@ -467,7 +467,7 @@ static struct fuse_conn *new_conn(void)
atomic_set(>num_waiting, 0);
fc->bdi.ra_pages = (VM_MAX_READAHEAD * 1024) / PAGE_CACHE_SIZE;
fc->bdi.unplug_io_fn = default_unplug_io_fn;
-   err = bdi_init(>bdi);
+   err = bdi_init_fmt(>bdi, "bdi-fuse-%llu", (unsigned long 
long)fc->id);
if (err) {
kfree(fc);
fc = NULL;
--- a/fs/nfs/client.c
+++ b/fs/nfs/client.c
@@ -678,7 +678,7 @@ static int nfs_probe_fsinfo(struct nfs_s
goto out_error;
 
nfs_server_set_fsinfo(server, );
-   error = bdi_init(>backing_dev_info);
+   error = bdi_init_fmt(>backing_dev_info, "bdi-nfs-%s-%p", 
clp->cl_hostname, server);
if (error)
goto out_error;
 
--- a/include/linux/backing-dev.h
+++ b/include/linux/backing-dev.h
@@ -11,6 +11,8 @@
 #include 
 #include 
 #include 
+#include 
+#include 
 #include 
 
 struct page;
@@ -48,11 +50,28 @@ struct backing_dev_info {
 
struct prop_local_percpu completions;
int dirty_exceeded;
+
+   struct device *dev;
 };
 
 int bdi_init(struct backing_dev_info *bdi);
 void bdi_destroy(struct backing_dev_info *bdi);
 
+int bdi_register(struct backing_dev_info *bdi, const char *fmt, ...);
+void bdi_unregister(struct backing_dev_info *bdi);
+
+#define bdi_init_fmt(bdi, fmt...)  \
+   ({  \
+   int ret;\
+   ret = bdi_init(bdi);\
+   if (!ret) {

Re: per BDI dirty limit (was Re: -mm merge plans for 2.6.24)

2007-10-27 Thread Peter Zijlstra

On Fri, 2007-10-26 at 19:40 -0700, Greg KH wrote:
> On Sat, Oct 27, 2007 at 03:18:08AM +0200, Peter Zijlstra wrote:
> > 
> > On Fri, 2007-10-26 at 22:04 +0200, Peter Zijlstra wrote:
> > > This crashes and burns on bootup, but I'm too tired to figure out what I
> > > did wrong... will give it another try tomorrow..
> > 
> > Ok, can't sleep.. took a look. I have several problems here.
> > 
> > The thing that makes it go *boom* is the __ATTR_NULL. Removing that
> > makes it boot. Albeit it then warns me of multiple duplicate sysfs
> > objects, all named "bdi".
> > 
> > For some obscure reason this device interface insists on using the
> > bus_id as name (?!), and further reduces usability by limiting that to
> > 20 odd characters.
> > 
> > This makes it quite useless. I tried fudging around that limit by using
> > device_rename and kobject_rename, but to no avail.
> > 
> > Really, it should not be this hard to use, trying to expose a handfull
> > of simple integers to userspace should not take 8h+ and still not work.
> > 
> > Peter, who thinks sysfs is contorted mess beyond his skill. I'll stick
> > to VM and scheduler code, that actually makes sense.
> 
> Heh, that's funny :)
> 
> I'll look at this and see what I can come up with.  Would you just like
> a whole new patch, or one against this one?

Sorry for the grumpy note, I get that way at 3.30 am. Maybe I ought not
have mailed :-/

This is the code I had at that time.

Signed-off-by: Peter Zijlstra <[EMAIL PROTECTED]>
---
 block/genhd.c   |2 
 fs/fuse/inode.c |2 
 fs/nfs/client.c |2 
 include/linux/backing-dev.h |   21 ++
 include/linux/string.h  |4 +
 include/linux/writeback.h   |3 
 mm/backing-dev.c|  144 
 mm/page-writeback.c |2 
 mm/util.c   |   42 
 9 files changed, 219 insertions(+), 3 deletions(-)

Index: linux-2.6-2/fs/fuse/inode.c
===
--- linux-2.6-2.orig/fs/fuse/inode.c
+++ linux-2.6-2/fs/fuse/inode.c
@@ -467,7 +467,7 @@ static struct fuse_conn *new_conn(void)
atomic_set(>num_waiting, 0);
fc->bdi.ra_pages = (VM_MAX_READAHEAD * 1024) / PAGE_CACHE_SIZE;
fc->bdi.unplug_io_fn = default_unplug_io_fn;
-   err = bdi_init(>bdi);
+   err = bdi_init_fmt(>bdi, "bdi-fuse-%llu", (unsigned long 
long)fc->id);
if (err) {
kfree(fc);
fc = NULL;
Index: linux-2.6-2/fs/nfs/client.c
===
--- linux-2.6-2.orig/fs/nfs/client.c
+++ linux-2.6-2/fs/nfs/client.c
@@ -678,7 +678,7 @@ static int nfs_probe_fsinfo(struct nfs_s
goto out_error;
 
nfs_server_set_fsinfo(server, );
-   error = bdi_init(>backing_dev_info);
+   error = bdi_init_fmt(>backing_dev_info, "bdi-nfs-%s-%p", 
clp->cl_hostname, server);
if (error)
goto out_error;
 
Index: linux-2.6-2/include/linux/backing-dev.h
===
--- linux-2.6-2.orig/include/linux/backing-dev.h
+++ linux-2.6-2/include/linux/backing-dev.h
@@ -11,6 +11,8 @@
 #include 
 #include 
 #include 
+#include 
+#include 
 #include 
 
 struct page;
@@ -48,11 +50,30 @@ struct backing_dev_info {
 
struct prop_local_percpu completions;
int dirty_exceeded;
+
+#ifdef CONFIG_SYSFS
+   struct device kdev;
+#endif
 };
 
 int bdi_init(struct backing_dev_info *bdi);
 void bdi_destroy(struct backing_dev_info *bdi);
 
+int bdi_register(struct backing_dev_info *bdi, const char *fmt, ...);
+void bdi_unregister(struct backing_dev_info *bdi);
+
+#define bdi_init_fmt(bdi, fmt...)  \
+   ({  \
+   int ret;\
+   ret = bdi_init(bdi);\
+   if (!ret) { \
+   ret = bdi_register(bdi, ##fmt); \
+   if (ret)\
+   bdi_destroy(bdi);   \
+   }   \
+   ret;\
+})
+
 static inline void __add_bdi_stat(struct backing_dev_info *bdi,
enum bdi_stat_item item, s64 amount)
 {
Index: linux-2.6-2/include/linux/writeback.h
===
--- linux-2.6-2.orig/include/linux/writeback.h
+++ linux-2.6-2/include/linux/writeback.h
@@ -113,6 +113,9 @@ struct file;
 int dirty_writeback_centisecs_handler(struct ctl_table *, int, struct file *,
  void __user *, size_t *, loff_t *);
 
+void

Re: per BDI dirty limit (was Re: -mm merge plans for 2.6.24)

2007-10-27 Thread Peter Zijlstra

On Fri, 2007-10-26 at 19:40 -0700, Greg KH wrote:
 On Sat, Oct 27, 2007 at 03:18:08AM +0200, Peter Zijlstra wrote:
  
  On Fri, 2007-10-26 at 22:04 +0200, Peter Zijlstra wrote:
   This crashes and burns on bootup, but I'm too tired to figure out what I
   did wrong... will give it another try tomorrow..
  
  Ok, can't sleep.. took a look. I have several problems here.
  
  The thing that makes it go *boom* is the __ATTR_NULL. Removing that
  makes it boot. Albeit it then warns me of multiple duplicate sysfs
  objects, all named bdi.
  
  For some obscure reason this device interface insists on using the
  bus_id as name (?!), and further reduces usability by limiting that to
  20 odd characters.
  
  This makes it quite useless. I tried fudging around that limit by using
  device_rename and kobject_rename, but to no avail.
  
  Really, it should not be this hard to use, trying to expose a handfull
  of simple integers to userspace should not take 8h+ and still not work.
  
  Peter, who thinks sysfs is contorted mess beyond his skill. I'll stick
  to VM and scheduler code, that actually makes sense.
 
 Heh, that's funny :)
 
 I'll look at this and see what I can come up with.  Would you just like
 a whole new patch, or one against this one?

Sorry for the grumpy note, I get that way at 3.30 am. Maybe I ought not
have mailed :-/

This is the code I had at that time.

Signed-off-by: Peter Zijlstra [EMAIL PROTECTED]
---
 block/genhd.c   |2 
 fs/fuse/inode.c |2 
 fs/nfs/client.c |2 
 include/linux/backing-dev.h |   21 ++
 include/linux/string.h  |4 +
 include/linux/writeback.h   |3 
 mm/backing-dev.c|  144 
 mm/page-writeback.c |2 
 mm/util.c   |   42 
 9 files changed, 219 insertions(+), 3 deletions(-)

Index: linux-2.6-2/fs/fuse/inode.c
===
--- linux-2.6-2.orig/fs/fuse/inode.c
+++ linux-2.6-2/fs/fuse/inode.c
@@ -467,7 +467,7 @@ static struct fuse_conn *new_conn(void)
atomic_set(fc-num_waiting, 0);
fc-bdi.ra_pages = (VM_MAX_READAHEAD * 1024) / PAGE_CACHE_SIZE;
fc-bdi.unplug_io_fn = default_unplug_io_fn;
-   err = bdi_init(fc-bdi);
+   err = bdi_init_fmt(fc-bdi, bdi-fuse-%llu, (unsigned long 
long)fc-id);
if (err) {
kfree(fc);
fc = NULL;
Index: linux-2.6-2/fs/nfs/client.c
===
--- linux-2.6-2.orig/fs/nfs/client.c
+++ linux-2.6-2/fs/nfs/client.c
@@ -678,7 +678,7 @@ static int nfs_probe_fsinfo(struct nfs_s
goto out_error;
 
nfs_server_set_fsinfo(server, fsinfo);
-   error = bdi_init(server-backing_dev_info);
+   error = bdi_init_fmt(server-backing_dev_info, bdi-nfs-%s-%p, 
clp-cl_hostname, server);
if (error)
goto out_error;
 
Index: linux-2.6-2/include/linux/backing-dev.h
===
--- linux-2.6-2.orig/include/linux/backing-dev.h
+++ linux-2.6-2/include/linux/backing-dev.h
@@ -11,6 +11,8 @@
 #include linux/percpu_counter.h
 #include linux/log2.h
 #include linux/proportions.h
+#include linux/kernel.h
+#include linux/device.h
 #include asm/atomic.h
 
 struct page;
@@ -48,11 +50,30 @@ struct backing_dev_info {
 
struct prop_local_percpu completions;
int dirty_exceeded;
+
+#ifdef CONFIG_SYSFS
+   struct device kdev;
+#endif
 };
 
 int bdi_init(struct backing_dev_info *bdi);
 void bdi_destroy(struct backing_dev_info *bdi);
 
+int bdi_register(struct backing_dev_info *bdi, const char *fmt, ...);
+void bdi_unregister(struct backing_dev_info *bdi);
+
+#define bdi_init_fmt(bdi, fmt...)  \
+   ({  \
+   int ret;\
+   ret = bdi_init(bdi);\
+   if (!ret) { \
+   ret = bdi_register(bdi, ##fmt); \
+   if (ret)\
+   bdi_destroy(bdi);   \
+   }   \
+   ret;\
+})
+
 static inline void __add_bdi_stat(struct backing_dev_info *bdi,
enum bdi_stat_item item, s64 amount)
 {
Index: linux-2.6-2/include/linux/writeback.h
===
--- linux-2.6-2.orig/include/linux/writeback.h
+++ linux-2.6-2/include/linux/writeback.h
@@ -113,6 +113,9 @@ struct file;
 int dirty_writeback_centisecs_handler(struct ctl_table *, int, struct file *,

Re: per BDI dirty limit (was Re: -mm merge plans for 2.6.24)

2007-10-27 Thread Greg KH

On Sat, Oct 27, 2007 at 10:39:59AM +0200, Peter Zijlstra wrote:
 On Fri, 2007-10-26 at 19:40 -0700, Greg KH wrote:
  On Sat, Oct 27, 2007 at 03:18:08AM +0200, Peter Zijlstra wrote:
   
   On Fri, 2007-10-26 at 22:04 +0200, Peter Zijlstra wrote:
This crashes and burns on bootup, but I'm too tired to figure out what I
did wrong... will give it another try tomorrow..
   
   Ok, can't sleep.. took a look. I have several problems here.
   
   The thing that makes it go *boom* is the __ATTR_NULL. Removing that
   makes it boot. Albeit it then warns me of multiple duplicate sysfs
   objects, all named bdi.
   
   For some obscure reason this device interface insists on using the
   bus_id as name (?!), and further reduces usability by limiting that to
   20 odd characters.
   
   This makes it quite useless. I tried fudging around that limit by using
   device_rename and kobject_rename, but to no avail.
   
   Really, it should not be this hard to use, trying to expose a handfull
   of simple integers to userspace should not take 8h+ and still not work.
   
   Peter, who thinks sysfs is contorted mess beyond his skill. I'll stick
   to VM and scheduler code, that actually makes sense.
  
  Heh, that's funny :)
  
  I'll look at this and see what I can come up with.  Would you just like
  a whole new patch, or one against this one?
 
 Sorry for the grumpy note, I get that way at 3.30 am. Maybe I ought not
 have mailed :-/
 
 This is the code I had at that time.

Ah, I see a few problems.  Here, try this version instead.  It's
compile-tested only, and should be a lot simpler.

Note, we still are not setting the parent to the new bdi structure
properly, so the devices will show up in /sys/devices/virtual/ instead
of in their proper location.  To do this, we need the parent of the
device, which I'm not so sure what it should be (block device?  block
device controller?)

Let me know if this works better, I'm off to a kids birthday party for
the day, but will be around this evening...

thanks,

greg k-h

---
 block/genhd.c   |2 
 fs/fuse/inode.c |2 
 fs/nfs/client.c |2 
 include/linux/backing-dev.h |   19 +++
 include/linux/string.h  |4 +
 include/linux/writeback.h   |3 +
 mm/backing-dev.c|  110 
 mm/page-writeback.c |2 
 mm/util.c   |   42 
 9 files changed, 183 insertions(+), 3 deletions(-)

--- a/block/genhd.c
+++ b/block/genhd.c
@@ -182,6 +182,7 @@ void add_disk(struct gendisk *disk)
disk-minors, NULL, exact_match, exact_lock, disk);
register_disk(disk);
blk_register_queue(disk);
+   bdi_register(disk-queue-backing_dev_info, bdi-%s, disk-disk_name);
 }
 
 EXPORT_SYMBOL(add_disk);
@@ -190,6 +191,7 @@ EXPORT_SYMBOL(del_gendisk); /* in partit
 void unlink_gendisk(struct gendisk *disk)
 {
blk_unregister_queue(disk);
+   bdi_unregister(disk-queue-backing_dev_info);
blk_unregister_region(MKDEV(disk-major, disk-first_minor),
  disk-minors);
 }
--- a/fs/fuse/inode.c
+++ b/fs/fuse/inode.c
@@ -467,7 +467,7 @@ static struct fuse_conn *new_conn(void)
atomic_set(fc-num_waiting, 0);
fc-bdi.ra_pages = (VM_MAX_READAHEAD * 1024) / PAGE_CACHE_SIZE;
fc-bdi.unplug_io_fn = default_unplug_io_fn;
-   err = bdi_init(fc-bdi);
+   err = bdi_init_fmt(fc-bdi, bdi-fuse-%llu, (unsigned long 
long)fc-id);
if (err) {
kfree(fc);
fc = NULL;
--- a/fs/nfs/client.c
+++ b/fs/nfs/client.c
@@ -678,7 +678,7 @@ static int nfs_probe_fsinfo(struct nfs_s
goto out_error;
 
nfs_server_set_fsinfo(server, fsinfo);
-   error = bdi_init(server-backing_dev_info);
+   error = bdi_init_fmt(server-backing_dev_info, bdi-nfs-%s-%p, 
clp-cl_hostname, server);
if (error)
goto out_error;
 
--- a/include/linux/backing-dev.h
+++ b/include/linux/backing-dev.h
@@ -11,6 +11,8 @@
 #include linux/percpu_counter.h
 #include linux/log2.h
 #include linux/proportions.h
+#include linux/kernel.h
+#include linux/device.h
 #include asm/atomic.h
 
 struct page;
@@ -48,11 +50,28 @@ struct backing_dev_info {
 
struct prop_local_percpu completions;
int dirty_exceeded;
+
+   struct device *dev;
 };
 
 int bdi_init(struct backing_dev_info *bdi);
 void bdi_destroy(struct backing_dev_info *bdi);
 
+int bdi_register(struct backing_dev_info *bdi, const char *fmt, ...);
+void bdi_unregister(struct backing_dev_info *bdi);
+
+#define bdi_init_fmt(bdi, fmt...)  \
+   ({  \
+   int ret;\
+   ret = bdi_init(bdi);\
+   if (!ret) {

Re: per BDI dirty limit (was Re: -mm merge plans for 2.6.24)

2007-10-27 Thread Peter Zijlstra


On Sat, 2007-10-27 at 09:02 -0700, Greg KH wrote:

 Ah, I see a few problems.  Here, try this version instead.  It's
 compile-tested only, and should be a lot simpler.
 
 Note, we still are not setting the parent to the new bdi structure
 properly, so the devices will show up in /sys/devices/virtual/ instead
 of in their proper location.  To do this, we need the parent of the
 device, which I'm not so sure what it should be (block device?  block
 device controller?)

The problem is that not every bdi has a sysfs represented parent, hence
the class suggestion. For block devices it is indeed the block device
itself, but for example the NFS client's server descriptor does not have
a sysfs representation.

 Let me know if this works better, I'm off to a kids birthday party for
 the day, but will be around this evening...

Hehe, do enjoy! Thanks.


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: per BDI dirty limit (was Re: -mm merge plans for 2.6.24)

2007-10-27 Thread Kay Sievers

On Sat, 2007-10-27 at 09:02 -0700, Greg KH wrote:
 On Sat, Oct 27, 2007 at 10:39:59AM +0200, Peter Zijlstra wrote:
  On Fri, 2007-10-26 at 19:40 -0700, Greg KH wrote:
   On Sat, Oct 27, 2007 at 03:18:08AM +0200, Peter Zijlstra wrote:
On Fri, 2007-10-26 at 22:04 +0200, Peter Zijlstra wrote:
 This crashes and burns on bootup, but I'm too tired to figure out 
 what I
 did wrong... will give it another try tomorrow..

Ok, can't sleep.. took a look. I have several problems here.

The thing that makes it go *boom* is the __ATTR_NULL. Removing that
makes it boot. Albeit it then warns me of multiple duplicate sysfs
objects, all named bdi.

   I'll look at this and see what I can come up with.  Would you just like
   a whole new patch, or one against this one?
  
  Sorry for the grumpy note, I get that way at 3.30 am. Maybe I ought not
  have mailed :-/
  
  This is the code I had at that time.
 
 Ah, I see a few problems.  Here, try this version instead.  It's
 compile-tested only, and should be a lot simpler.
 
 Note, we still are not setting the parent to the new bdi structure
 properly, so the devices will show up in /sys/devices/virtual/ instead
 of in their proper location.  To do this, we need the parent of the
 device, which I'm not so sure what it should be (block device?  block
 device controller?)

Assigning a parent device will only work with the upcoming conversion of
the raw kobjects in the block subsystem to struct device.

A few comments to the patch:

 --- a/include/linux/string.h
 +++ b/include/linux/string.h
 @@ -8,6 +8,7 @@
  #include linux/compiler.h  /* for inline */
  #include linux/types.h /* for size_t */
  #include linux/stddef.h/* for NULL */
 +#include stdarg.h
  
  #ifdef __cplusplus
  extern C {
 @@ -111,6 +112,9 @@ extern void *kmemdup(const void *src, si
  extern char **argv_split(gfp_t gfp, const char *str, int *argcp);
  extern void argv_free(char **argv);
  
 +char *kvprintf(const char *fmt, va_list args);
 +char *kprintf(const char *fmt, ...);

Why is that here? I don't think we need this when we use the existing:
  kvasprintf(GFP_KERNEL, fmt, args)

 --- a/mm/backing-dev.c
 +++ b/mm/backing-dev.c

 +
 +static struct device_attribute bdi_dev_attrs[] = {
 + __ATTR(readahead, 0644, readahead_show, readahead_store),
 + __ATTR_RO(reclaimable),
 + __ATTR_RO(writeback),
 + __ATTR_RO(dirty),
 + __ATTR_RO(bdi_dirty),
 +};

Default attributes will need the NULL termination back (see below).

 +static __init int bdi_class_init(void)
 +{
 + bdi_class = class_create(THIS_MODULE, bdi);
 + return 0;
 +}
 +
 +__initcall(bdi_class_init);
 +
 +int bdi_register(struct backing_dev_info *bdi, const char *fmt, ...)

This function should accept a: struct device *parent and all callers
just pass NULL until the block layer conversion gets merged.

 +{
 + char *name;
 + va_list args;
 + int ret = -ENOMEM;
 + int i;
 +
 + va_start(args, fmt);
 + name = kvprintf(fmt, args);

kvasprintf(GFP_KERNEL, fmt, args);

 + va_end(args);
 +
 + if (!name)
 + return -ENOMEM;
 +
 + bdi-dev = device_create(bdi_class, NULL, MKDEV(0,0), name);

The parent should be passed here.

 + for (i = 0; i  ARRAY_SIZE(bdi_dev_attrs); i++) {
 + ret = device_create_file(bdi-dev, bdi_dev_attrs[i]);
 + if (ret)
 + break;
 + }
 + if (ret) {
 + while (--i = 0)
 + device_remove_file(bdi-dev, bdi_dev_attrs[i]);
 + device_unregister(bdi-dev);
 + bdi-dev = NULL;
 + }

All this open-coded attribute stuff should go away and be replaced by:
  bdi_class-dev_attrs = bdi_dev_attrs;
Otherwise at event time the attributes are not created and stuff hooking
into the events will not be able to set values. Also, the core will do
proper add/remove and error handling then.

Thanks,
Kay

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: per BDI dirty limit (was Re: -mm merge plans for 2.6.24)

2007-10-27 Thread Peter Zijlstra

On Sat, 2007-10-27 at 23:08 +0200, Kay Sievers wrote:
 On Sat, 2007-10-27 at 09:02 -0700, Greg KH wrote:

  Ah, I see a few problems.  Here, try this version instead.  It's
  compile-tested only, and should be a lot simpler.
  
  Note, we still are not setting the parent to the new bdi structure
  properly, so the devices will show up in /sys/devices/virtual/ instead
  of in their proper location.  To do this, we need the parent of the
  device, which I'm not so sure what it should be (block device?  block
  device controller?)
 
 Assigning a parent device will only work with the upcoming conversion of
 the raw kobjects in the block subsystem to struct device.
 
 A few comments to the patch:
 
  --- a/include/linux/string.h
  +++ b/include/linux/string.h
  @@ -8,6 +8,7 @@
   #include linux/compiler.h/* for inline */
   #include linux/types.h   /* for size_t */
   #include linux/stddef.h  /* for NULL */
  +#include stdarg.h
   
   #ifdef __cplusplus
   extern C {
  @@ -111,6 +112,9 @@ extern void *kmemdup(const void *src, si
   extern char **argv_split(gfp_t gfp, const char *str, int *argcp);
   extern void argv_free(char **argv);
   
  +char *kvprintf(const char *fmt, va_list args);
  +char *kprintf(const char *fmt, ...);
 
 Why is that here? I don't think we need this when we use the existing:
   kvasprintf(GFP_KERNEL, fmt, args)

Ignorance of the existance of said function. Thanks for pointing it out.
(kobject_set_name ought to use it too I guess)

  --- a/mm/backing-dev.c
  +++ b/mm/backing-dev.c
 
  +
  +static struct device_attribute bdi_dev_attrs[] = {
  +   __ATTR(readahead, 0644, readahead_show, readahead_store),
  +   __ATTR_RO(reclaimable),
  +   __ATTR_RO(writeback),
  +   __ATTR_RO(dirty),
  +   __ATTR_RO(bdi_dirty),
  +};
 
 Default attributes will need the NULL termination back (see below).
 
  +static __init int bdi_class_init(void)
  +{
  +   bdi_class = class_create(THIS_MODULE, bdi);
  +   return 0;
  +}
  +
  +__initcall(bdi_class_init);
  +
  +int bdi_register(struct backing_dev_info *bdi, const char *fmt, ...)
 
 This function should accept a: struct device *parent and all callers
 just pass NULL until the block layer conversion gets merged.

Yeah, you're right, but I wanted to just get something working before
bothering with the parent thing.

  +{
  +   char *name;
  +   va_list args;
  +   int ret = -ENOMEM;
  +   int i;
  +
  +   va_start(args, fmt);
  +   name = kvprintf(fmt, args);
 
 kvasprintf(GFP_KERNEL, fmt, args);
 
  +   va_end(args);
  +
  +   if (!name)
  +   return -ENOMEM;
  +
  +   bdi-dev = device_create(bdi_class, NULL, MKDEV(0,0), name);
 
 The parent should be passed here.
 
  +   for (i = 0; i  ARRAY_SIZE(bdi_dev_attrs); i++) {
  +   ret = device_create_file(bdi-dev, bdi_dev_attrs[i]);
  +   if (ret)
  +   break;
  +   }
  +   if (ret) {
  +   while (--i = 0)
  +   device_remove_file(bdi-dev, bdi_dev_attrs[i]);
  +   device_unregister(bdi-dev);
  +   bdi-dev = NULL;
  +   }
 
 All this open-coded attribute stuff should go away and be replaced by:
   bdi_class-dev_attrs = bdi_dev_attrs;
 Otherwise at event time the attributes are not created and stuff hooking
 into the events will not be able to set values. Also, the core will do
 proper add/remove and error handling then.

ok, that's good to know. someone ought to write a book on how to use all
this... really, even the functions are bare of documentation or
comments.

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: per BDI dirty limit (was Re: -mm merge plans for 2.6.24)

2007-10-26 Thread Greg KH

On Sat, Oct 27, 2007 at 03:18:08AM +0200, Peter Zijlstra wrote:
> 
> On Fri, 2007-10-26 at 22:04 +0200, Peter Zijlstra wrote:
> > This crashes and burns on bootup, but I'm too tired to figure out what I
> > did wrong... will give it another try tomorrow..
> 
> Ok, can't sleep.. took a look. I have several problems here.
> 
> The thing that makes it go *boom* is the __ATTR_NULL. Removing that
> makes it boot. Albeit it then warns me of multiple duplicate sysfs
> objects, all named "bdi".
> 
> For some obscure reason this device interface insists on using the
> bus_id as name (?!), and further reduces usability by limiting that to
> 20 odd characters.
> 
> This makes it quite useless. I tried fudging around that limit by using
> device_rename and kobject_rename, but to no avail.
> 
> Really, it should not be this hard to use, trying to expose a handfull
> of simple integers to userspace should not take 8h+ and still not work.
> 
> Peter, who thinks sysfs is contorted mess beyond his skill. I'll stick
> to VM and scheduler code, that actually makes sense.

Heh, that's funny :)

I'll look at this and see what I can come up with.  Would you just like
a whole new patch, or one against this one?

thanks,

greg k-h
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: per BDI dirty limit (was Re: -mm merge plans for 2.6.24)

2007-10-26 Thread Peter Zijlstra

On Fri, 2007-10-26 at 22:04 +0200, Peter Zijlstra wrote:
> This crashes and burns on bootup, but I'm too tired to figure out what I
> did wrong... will give it another try tomorrow..

Ok, can't sleep.. took a look. I have several problems here.

The thing that makes it go *boom* is the __ATTR_NULL. Removing that
makes it boot. Albeit it then warns me of multiple duplicate sysfs
objects, all named "bdi".

For some obscure reason this device interface insists on using the
bus_id as name (?!), and further reduces usability by limiting that to
20 odd characters.

This makes it quite useless. I tried fudging around that limit by using
device_rename and kobject_rename, but to no avail.

Really, it should not be this hard to use, trying to expose a handfull
of simple integers to userspace should not take 8h+ and still not work.

Peter, who thinks sysfs is contorted mess beyond his skill. I'll stick
to VM and scheduler code, that actually makes sense.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: per BDI dirty limit (was Re: -mm merge plans for 2.6.24)

2007-10-26 Thread Peter Zijlstra

This crashes and burns on bootup, but I'm too tired to figure out what I
did wrong... will give it another try tomorrow..


Signed-off-by: Peter Zijlstra <[EMAIL PROTECTED]>
---
 block/genhd.c   |2 
 fs/fuse/inode.c |2 
 fs/nfs/client.c |2 
 include/linux/backing-dev.h |   33 
 include/linux/writeback.h   |3 +
 mm/backing-dev.c|  121 
 mm/page-writeback.c |2 
 7 files changed, 162 insertions(+), 3 deletions(-)

Index: linux-2.6-2/fs/fuse/inode.c
===
--- linux-2.6-2.orig/fs/fuse/inode.c
+++ linux-2.6-2/fs/fuse/inode.c
@@ -467,7 +467,7 @@ static struct fuse_conn *new_conn(void)
atomic_set(>num_waiting, 0);
fc->bdi.ra_pages = (VM_MAX_READAHEAD * 1024) / PAGE_CACHE_SIZE;
fc->bdi.unplug_io_fn = default_unplug_io_fn;
-   err = bdi_init(>bdi);
+   err = bdi_init_fmt(>bdi, "fuse-%llu", (unsigned long 
long)fc->id);
if (err) {
kfree(fc);
fc = NULL;
Index: linux-2.6-2/fs/nfs/client.c
===
--- linux-2.6-2.orig/fs/nfs/client.c
+++ linux-2.6-2/fs/nfs/client.c
@@ -678,7 +678,7 @@ static int nfs_probe_fsinfo(struct nfs_s
goto out_error;
 
nfs_server_set_fsinfo(server, );
-   error = bdi_init(>backing_dev_info);
+   error = bdi_init_fmt(>backing_dev_info, "nfs-%s-%p", 
clp->cl_hostname, server);
if (error)
goto out_error;
 
Index: linux-2.6-2/include/linux/backing-dev.h
===
--- linux-2.6-2.orig/include/linux/backing-dev.h
+++ linux-2.6-2/include/linux/backing-dev.h
@@ -11,6 +11,8 @@
 #include 
 #include 
 #include 
+#include 
+#include 
 #include 
 
 struct page;
@@ -48,11 +50,42 @@ struct backing_dev_info {
 
struct prop_local_percpu completions;
int dirty_exceeded;
+
+#ifdef CONFIG_SYSFS
+   struct device kdev;
+#endif
 };
 
 int bdi_init(struct backing_dev_info *bdi);
 void bdi_destroy(struct backing_dev_info *bdi);
 
+int __bdi_register(struct backing_dev_info *bdi);
+void bdi_unregister(struct backing_dev_info *bdi);
+
+#ifdef CONFIG_SYSFS
+#define bdi_init_fmt(bdi, fmt...)  \
+   ({  \
+   int ret;\
+   kobject_set_name(&(bdi)->kdev.kobj, ##fmt); \
+   ret = bdi_init(bdi);\
+   if (!ret) { \
+   ret = __bdi_register(bdi);  \
+   if (ret)\
+   bdi_destroy(bdi);   \
+   }   \
+   ret;\
+   })
+
+#define bdi_register(bdi, fmt...)  \
+   ({  \
+   kobject_set_name(&(bdi)->kdev.kobj, ##fmt); \
+   __bdi_register(bdi);\
+   })
+#else
+#define bdi_init_fmt(bdi, fmt...)  bdi_init(bdi)
+#define bdi_register(bdi, fmt...)  __bdi_register(bdi)
+#endif
+
 static inline void __add_bdi_stat(struct backing_dev_info *bdi,
enum bdi_stat_item item, s64 amount)
 {
Index: linux-2.6-2/include/linux/writeback.h
===
--- linux-2.6-2.orig/include/linux/writeback.h
+++ linux-2.6-2/include/linux/writeback.h
@@ -113,6 +113,9 @@ struct file;
 int dirty_writeback_centisecs_handler(struct ctl_table *, int, struct file *,
  void __user *, size_t *, loff_t *);
 
+void get_dirty_limits(long *pbackground, long *pdirty, long *pbdi_dirty,
+struct backing_dev_info *bdi);
+
 void page_writeback_init(void);
 void balance_dirty_pages_ratelimited_nr(struct address_space *mapping,
unsigned long nr_pages_dirtied);
Index: linux-2.6-2/mm/backing-dev.c
===
--- linux-2.6-2.orig/mm/backing-dev.c
+++ linux-2.6-2/mm/backing-dev.c
@@ -4,12 +4,130 @@
 #include 
 #include 
 #include 
+#include 
+#include 
+
+#ifdef CONFIG_SYSFS
+
+static void bdi_release(struct device *dev)
+{
+}
+
+static int bdi_uevent(struct device *dev, struct kobj_uevent_env *env)
+{
+   return 0;
+}
+
+static struct class bdi_class = {
+   .name = "bdi",
+   .dev_release = bdi_release,
+   .dev_uevent = bdi_uevent,
+};
+
+static ssize_t readahead_store(struct device *dev,
+

Re: per BDI dirty limit (was Re: -mm merge plans for 2.6.24)

2007-10-26 Thread Trond Myklebust


On Fri, 2007-10-26 at 16:48 +0200, Peter Zijlstra wrote:

> Miklos, Trond: could you suggest a better fmt for the bdi_init_fmt() for your
> respective filesystems?

> Index: linux-2.6-2/fs/nfs/client.c
> ===
> --- linux-2.6-2.orig/fs/nfs/client.c
> +++ linux-2.6-2/fs/nfs/client.c
> @@ -678,7 +678,7 @@ static int nfs_probe_fsinfo(struct nfs_s
>   goto out_error;
>  
>   nfs_server_set_fsinfo(server, );
> - error = bdi_init(>backing_dev_info);
> + error = bdi_init_fmt(>backing_dev_info, "nfs-%s-%p", 
> clp->cl_hostname, server);

In our debugging printks, we usually use the superblock->s_id, but that
only gets initialised later.

I'd suggest passing the 'dev_name' from *_get_sb() into
*_create_server().

Cheers
  Trond

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: per BDI dirty limit (was Re: -mm merge plans for 2.6.24)

2007-10-26 Thread Kay Sievers


On Fri, 2007-10-26 at 17:33 +0200, Peter Zijlstra wrote:
> On Fri, 2007-10-26 at 17:33 +0200, Kay Sievers wrote:
> > On Fri, 2007-10-26 at 17:22 +0200, Peter Zijlstra wrote:
> > > On Fri, 2007-10-26 at 17:10 +0200, Kay Sievers wrote:
> > > > On Fri, 2007-10-26 at 16:48 +0200, Peter Zijlstra wrote:
> > > > 
> > > > > I appreciate the sysfs people their opinion that /sys/bdi/ might not 
> > > > > be the
> > > > > best from their POV, however I'm not seeing where to hook the BDI 
> > > > > object from
> > > > > so that it all makes sense, a few of the things are currently not 
> > > > > exposed in
> > > > > sysfs at all, like the NFS and FUSE things.
> > > > 
> > > > What happended to the idea to create a "bdi" class, and have the
> > > > existing devices as parents, and for stuff that is not (not now, or
> > > > never) in sysfs, no parent is set.
> > > 
> > > Must have forgotten about that, mainly because I'm not sure I fully
> > > understand it.
> > > 
> > > So we create a class,
> > 
> > Yes.
> > 
> > > create these objects,
> > 
> > Yes, "struct device" objects, assigned to the "bdi" class. (Don't use
> > class_device, that will be removed soon.)
> > 
> > > which are all called bdi
> > 
> > Probably not. You can name it how you want, you can inherit the name of
> > the parent, or prefix it with whatever fits, they just need to be
> > unique. Things like the "fuse-%llu" name would work just fine. I guess
> > you already solved that problem in the debugfs directory.
> > 
> > > and have children with these attributes in it.
> > 
> > The attributes would just be files in the device object.
> > 
> > > Now, I supposed there is a directory that lists all unparented thingies,
> > > how do I locate the one that matches my nfs mount?
> > 
> > You look for the name (prefix), try: "ls /sys/class/sound/", it's the
> > same model all over the place.
> 
> Ok, will try that. Is there a 'simple uncluttered' example I could look
> at to copy from?

drivers/firmware/dmi-id.c

It has only a single device created in the init routine, but it shows
what to do with the class.

Until the block subsystem is converted from using raw kobjects to
devices, you need to set the parent kobject of the bdi device to the
blockdev:
  bdidev->dev.kobj.parent = >kobj

Kay

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: per BDI dirty limit (was Re: -mm merge plans for 2.6.24)

2007-10-26 Thread Peter Zijlstra

On Fri, 2007-10-26 at 17:33 +0200, Kay Sievers wrote:
> On Fri, 2007-10-26 at 17:22 +0200, Peter Zijlstra wrote:
> > On Fri, 2007-10-26 at 17:10 +0200, Kay Sievers wrote:
> > > On Fri, 2007-10-26 at 16:48 +0200, Peter Zijlstra wrote:
> > > 
> > > > I appreciate the sysfs people their opinion that /sys/bdi/ might not be 
> > > > the
> > > > best from their POV, however I'm not seeing where to hook the BDI 
> > > > object from
> > > > so that it all makes sense, a few of the things are currently not 
> > > > exposed in
> > > > sysfs at all, like the NFS and FUSE things.
> > > 
> > > What happended to the idea to create a "bdi" class, and have the
> > > existing devices as parents, and for stuff that is not (not now, or
> > > never) in sysfs, no parent is set.
> > 
> > Must have forgotten about that, mainly because I'm not sure I fully
> > understand it.
> > 
> > So we create a class,
> 
> Yes.
> 
> > create these objects,
> 
> Yes, "struct device" objects, assigned to the "bdi" class. (Don't use
> class_device, that will be removed soon.)
> 
> > which are all called bdi
> 
> Probably not. You can name it how you want, you can inherit the name of
> the parent, or prefix it with whatever fits, they just need to be
> unique. Things like the "fuse-%llu" name would work just fine. I guess
> you already solved that problem in the debugfs directory.
> 
> > and have children with these attributes in it.
> 
> The attributes would just be files in the device object.
> 
> > Now, I supposed there is a directory that lists all unparented thingies,
> > how do I locate the one that matches my nfs mount?
> 
> You look for the name (prefix), try: "ls /sys/class/sound/", it's the
> same model all over the place.

Ok, will try that. Is there a 'simple uncluttered' example I could look
at to copy from?


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: per BDI dirty limit (was Re: -mm merge plans for 2.6.24)

2007-10-26 Thread Kay Sievers

On Fri, 2007-10-26 at 17:22 +0200, Peter Zijlstra wrote:
> On Fri, 2007-10-26 at 17:10 +0200, Kay Sievers wrote:
> > On Fri, 2007-10-26 at 16:48 +0200, Peter Zijlstra wrote:
> > 
> > > I appreciate the sysfs people their opinion that /sys/bdi/ might not be 
> > > the
> > > best from their POV, however I'm not seeing where to hook the BDI object 
> > > from
> > > so that it all makes sense, a few of the things are currently not exposed 
> > > in
> > > sysfs at all, like the NFS and FUSE things.
> > 
> > What happended to the idea to create a "bdi" class, and have the
> > existing devices as parents, and for stuff that is not (not now, or
> > never) in sysfs, no parent is set.
> 
> Must have forgotten about that, mainly because I'm not sure I fully
> understand it.
> 
> So we create a class,

Yes.

> create these objects,

Yes, "struct device" objects, assigned to the "bdi" class. (Don't use
class_device, that will be removed soon.)

> which are all called bdi

Probably not. You can name it how you want, you can inherit the name of
the parent, or prefix it with whatever fits, they just need to be
unique. Things like the "fuse-%llu" name would work just fine. I guess
you already solved that problem in the debugfs directory.

> and have children with these attributes in it.

The attributes would just be files in the device object.

> Now, I supposed there is a directory that lists all unparented thingies,
> how do I locate the one that matches my nfs mount?

You look for the name (prefix), try: "ls /sys/class/sound/", it's the
same model all over the place.

Kay

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: per BDI dirty limit (was Re: -mm merge plans for 2.6.24)

2007-10-26 Thread Peter Zijlstra

On Fri, 2007-10-26 at 17:10 +0200, Kay Sievers wrote:
> On Fri, 2007-10-26 at 16:48 +0200, Peter Zijlstra wrote:
> 
> > I appreciate the sysfs people their opinion that /sys/bdi/ might not be the
> > best from their POV, however I'm not seeing where to hook the BDI object 
> > from
> > so that it all makes sense, a few of the things are currently not exposed in
> > sysfs at all, like the NFS and FUSE things.
> 
> What happended to the idea to create a "bdi" class, and have the
> existing devices as parents, and for stuff that is not (not now, or
> never) in sysfs, no parent is set.

Must have forgotten about that, mainly because I'm not sure I fully
understand it.

So we create a class, create these objects, which are all called bdi and
have children with these attributes in it.

Now, I supposed there is a directory that lists all unparented thingies,
how do I locate the one that matches my nfs mount?



-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: per BDI dirty limit (was Re: -mm merge plans for 2.6.24)

2007-10-26 Thread Kay Sievers

On Fri, 2007-10-26 at 16:48 +0200, Peter Zijlstra wrote:

> I appreciate the sysfs people their opinion that /sys/bdi/ might not be the
> best from their POV, however I'm not seeing where to hook the BDI object from
> so that it all makes sense, a few of the things are currently not exposed in
> sysfs at all, like the NFS and FUSE things.

What happended to the idea to create a "bdi" class, and have the
existing devices as parents, and for stuff that is not (not now, or
never) in sysfs, no parent is set.

Thanks,
Kay

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: per BDI dirty limit (was Re: -mm merge plans for 2.6.24)

2007-10-26 Thread Miklos Szeredi

> Subject: bdi: debugfs interface
> 
> Expose the BDI stats (and readahead window) in /debug/bdi/
> 
> I'm still thinking it should go into /sys somewhere, however I just noticed
> not all block devices that have a queue have a /queue directory. Noticeably
> those that use make_request_fn() as opposed to request_fn(). And then of
> course there are the non-block/non-queue BDIs.
> 
> A BDI is basically the object that represents the 'thing' you dirty pages
> against. For block devices that is related to the block device (and is
> typically embedded in the queue object), for NFS mounts its the remote server
> object of the client. For FUSE, yet again something else.
> 
> I appreciate the sysfs people their opinion that /sys/bdi/ might not be the
> best from their POV, however I'm not seeing where to hook the BDI object from
> so that it all makes sense, a few of the things are currently not exposed in
> sysfs at all, like the NFS and FUSE things.
> 
> So, for now, I've exposed the thing in debugfs. Please suggest a better
> alternative.
> 
> Miklos, Trond: could you suggest a better fmt for the bdi_init_fmt() for your
> respective filesystems?

For fuse:

err = bdi_init_fmt(>bdi, "fuse-%llu", (unsigned long long) fc->id);

This would match the connection ID in /sys/fs/fuse/connections/*

Miklos
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: per BDI dirty limit (was Re: -mm merge plans for 2.6.24)

2007-10-26 Thread Peter Zijlstra

On Wed, 2007-10-03 at 15:35 +0200, Kay Sievers wrote:
> On Wed, 2007-10-03 at 12:37 +0200, Peter Zijlstra wrote:
> > On Wed, 2007-10-03 at 12:15 +0200, Kay Sievers wrote:
> > > On Tue, 2007-10-02 at 22:05 +1000, Nick Piggin wrote:
> > > > On Tuesday 02 October 2007 21:40, Peter Zijlstra wrote:
> > > > > On Tue, 2007-10-02 at 13:21 +0200, Kay Sievers wrote:
> > > > 
> > > > > > How about adding this information to the tree then, instead of
> > > > > > creating a new top-level hack, just because something that you think
> > > > > > you need doesn't exist.
> > > > >
> > > > > So you suggest adding all the various network filesystems in there
> > > > > (where?), and adding the concept of a BDI, and ensuring all are 
> > > > > properly
> > > > > linked together - somehow. Feel free to do so.
> > > > 
> > > > Would something fit better under /sys/fs/? At least filesystems are
> > > > already an existing concept to userspace.
> > > 
> > > Sounds at least less messy than an new top-level directory.
> > > 
> > > But again, if it's "device" releated, like the name suggests, it should
> > > be reachable from the device tree.
> > > Which userspace tool is supposed to set these values, and at what time?
> > > An init-script, something at device discovery/setup? If that is is ever
> > > going to be used in a hotplug setup, you really don't want to go look
> > > for directories with magic device names in another disconnected tree.
> > 
> > Filesystems don't really map to BDIs either. One can have multiple FSs
> > per BDI.
> > 
> > 'Normally' a BDI relates to a block device, but networked (and other
> > non-block device) filesystems have to create a BDI too. So these need to
> > be represented some place as well.
> > 
> > The typical usage would indeed be init scripts. The typical example
> > would be setting the read-ahead window. Currently that cannot be done
> > for NFS mounts.
> 
> What kind of context for a non-block based fs will get the bdi controls
> added? Is there a generic place, or does every non-block based
> filesystem needs to be adapted individually to use it?

---
Subject: bdi: debugfs interface

Expose the BDI stats (and readahead window) in /debug/bdi/

I'm still thinking it should go into /sys somewhere, however I just noticed
not all block devices that have a queue have a /queue directory. Noticeably
those that use make_request_fn() as opposed to request_fn(). And then of
course there are the non-block/non-queue BDIs.

A BDI is basically the object that represents the 'thing' you dirty pages
against. For block devices that is related to the block device (and is
typically embedded in the queue object), for NFS mounts its the remote server
object of the client. For FUSE, yet again something else.

I appreciate the sysfs people their opinion that /sys/bdi/ might not be the
best from their POV, however I'm not seeing where to hook the BDI object from
so that it all makes sense, a few of the things are currently not exposed in
sysfs at all, like the NFS and FUSE things.

So, for now, I've exposed the thing in debugfs. Please suggest a better
alternative.

Miklos, Trond: could you suggest a better fmt for the bdi_init_fmt() for your
respective filesystems?

Signed-off-by: Peter Zijlstra <[EMAIL PROTECTED]>
CC: Miklos Szeredi <[EMAIL PROTECTED]>
CC: Trond Myklebust <[EMAIL PROTECTED]>
---
 block/genhd.c   |2 
 block/ll_rw_blk.c   |1 
 drivers/block/loop.c|7 ++
 drivers/md/dm.c |2 
 drivers/md/md.c |2 
 fs/fuse/inode.c |2 
 fs/nfs/client.c |2 
 include/linux/backing-dev.h |   15 
 include/linux/debugfs.h |   11 +++
 include/linux/writeback.h   |3 
 mm/backing-dev.c|  153 
 mm/page-writeback.c |2 
 12 files changed, 199 insertions(+), 3 deletions(-)

Index: linux-2.6-2/fs/fuse/inode.c
===
--- linux-2.6-2.orig/fs/fuse/inode.c
+++ linux-2.6-2/fs/fuse/inode.c
@@ -467,7 +467,7 @@ static struct fuse_conn *new_conn(void)
atomic_set(>num_waiting, 0);
fc->bdi.ra_pages = (VM_MAX_READAHEAD * 1024) / PAGE_CACHE_SIZE;
fc->bdi.unplug_io_fn = default_unplug_io_fn;
-   err = bdi_init(>bdi);
+   err = bdi_init_fmt(>bdi, "fuse-%p", fc);
if (err) {
kfree(fc);
fc = NULL;
Index: linux-2.6-2/fs/nfs/client.c
===
--- linux-2.6-2.orig/fs/nfs/client.c
+++ linux-2.6-2/fs/nfs/client.c
@@ -678,7 +678,7 @@ static int nfs_probe_fsinfo(struct nfs_s
goto out_error;
 
nfs_server_set_fsinfo(server, );
-   error = bdi_init(>backing_dev_info);
+   error = bdi_init_fmt(>backing_dev_info, "nfs-%s-%p", 
clp->cl_hostname, server);
if (error)
goto

Re: per BDI dirty limit (was Re: -mm merge plans for 2.6.24)

2007-10-26 Thread Peter Zijlstra

On Wed, 2007-10-03 at 15:35 +0200, Kay Sievers wrote:
 On Wed, 2007-10-03 at 12:37 +0200, Peter Zijlstra wrote:
  On Wed, 2007-10-03 at 12:15 +0200, Kay Sievers wrote:
   On Tue, 2007-10-02 at 22:05 +1000, Nick Piggin wrote:
On Tuesday 02 October 2007 21:40, Peter Zijlstra wrote:
 On Tue, 2007-10-02 at 13:21 +0200, Kay Sievers wrote:

  How about adding this information to the tree then, instead of
  creating a new top-level hack, just because something that you think
  you need doesn't exist.

 So you suggest adding all the various network filesystems in there
 (where?), and adding the concept of a BDI, and ensuring all are 
 properly
 linked together - somehow. Feel free to do so.

Would something fit better under /sys/fs/? At least filesystems are
already an existing concept to userspace.
   
   Sounds at least less messy than an new top-level directory.
   
   But again, if it's device releated, like the name suggests, it should
   be reachable from the device tree.
   Which userspace tool is supposed to set these values, and at what time?
   An init-script, something at device discovery/setup? If that is is ever
   going to be used in a hotplug setup, you really don't want to go look
   for directories with magic device names in another disconnected tree.
  
  Filesystems don't really map to BDIs either. One can have multiple FSs
  per BDI.
  
  'Normally' a BDI relates to a block device, but networked (and other
  non-block device) filesystems have to create a BDI too. So these need to
  be represented some place as well.
  
  The typical usage would indeed be init scripts. The typical example
  would be setting the read-ahead window. Currently that cannot be done
  for NFS mounts.
 
 What kind of context for a non-block based fs will get the bdi controls
 added? Is there a generic place, or does every non-block based
 filesystem needs to be adapted individually to use it?

---
Subject: bdi: debugfs interface

Expose the BDI stats (and readahead window) in /debug/bdi/

I'm still thinking it should go into /sys somewhere, however I just noticed
not all block devices that have a queue have a /queue directory. Noticeably
those that use make_request_fn() as opposed to request_fn(). And then of
course there are the non-block/non-queue BDIs.

A BDI is basically the object that represents the 'thing' you dirty pages
against. For block devices that is related to the block device (and is
typically embedded in the queue object), for NFS mounts its the remote server
object of the client. For FUSE, yet again something else.

I appreciate the sysfs people their opinion that /sys/bdi/ might not be the
best from their POV, however I'm not seeing where to hook the BDI object from
so that it all makes sense, a few of the things are currently not exposed in
sysfs at all, like the NFS and FUSE things.

So, for now, I've exposed the thing in debugfs. Please suggest a better
alternative.

Miklos, Trond: could you suggest a better fmt for the bdi_init_fmt() for your
respective filesystems?

Signed-off-by: Peter Zijlstra [EMAIL PROTECTED]
CC: Miklos Szeredi [EMAIL PROTECTED]
CC: Trond Myklebust [EMAIL PROTECTED]
---
 block/genhd.c   |2 
 block/ll_rw_blk.c   |1 
 drivers/block/loop.c|7 ++
 drivers/md/dm.c |2 
 drivers/md/md.c |2 
 fs/fuse/inode.c |2 
 fs/nfs/client.c |2 
 include/linux/backing-dev.h |   15 
 include/linux/debugfs.h |   11 +++
 include/linux/writeback.h   |3 
 mm/backing-dev.c|  153 
 mm/page-writeback.c |2 
 12 files changed, 199 insertions(+), 3 deletions(-)

Index: linux-2.6-2/fs/fuse/inode.c
===
--- linux-2.6-2.orig/fs/fuse/inode.c
+++ linux-2.6-2/fs/fuse/inode.c
@@ -467,7 +467,7 @@ static struct fuse_conn *new_conn(void)
atomic_set(fc-num_waiting, 0);
fc-bdi.ra_pages = (VM_MAX_READAHEAD * 1024) / PAGE_CACHE_SIZE;
fc-bdi.unplug_io_fn = default_unplug_io_fn;
-   err = bdi_init(fc-bdi);
+   err = bdi_init_fmt(fc-bdi, fuse-%p, fc);
if (err) {
kfree(fc);
fc = NULL;
Index: linux-2.6-2/fs/nfs/client.c
===
--- linux-2.6-2.orig/fs/nfs/client.c
+++ linux-2.6-2/fs/nfs/client.c
@@ -678,7 +678,7 @@ static int nfs_probe_fsinfo(struct nfs_s
goto out_error;
 
nfs_server_set_fsinfo(server, fsinfo);
-   error = bdi_init(server-backing_dev_info);
+   error = bdi_init_fmt(server-backing_dev_info, nfs-%s-%p, 
clp-cl_hostname, server);
if (error)
goto out_error;
 
Index: linux-2.6-2/include/linux/backing-dev.h

Re: per BDI dirty limit (was Re: -mm merge plans for 2.6.24)

2007-10-26 Thread Miklos Szeredi

 Subject: bdi: debugfs interface
 
 Expose the BDI stats (and readahead window) in /debug/bdi/
 
 I'm still thinking it should go into /sys somewhere, however I just noticed
 not all block devices that have a queue have a /queue directory. Noticeably
 those that use make_request_fn() as opposed to request_fn(). And then of
 course there are the non-block/non-queue BDIs.
 
 A BDI is basically the object that represents the 'thing' you dirty pages
 against. For block devices that is related to the block device (and is
 typically embedded in the queue object), for NFS mounts its the remote server
 object of the client. For FUSE, yet again something else.
 
 I appreciate the sysfs people their opinion that /sys/bdi/ might not be the
 best from their POV, however I'm not seeing where to hook the BDI object from
 so that it all makes sense, a few of the things are currently not exposed in
 sysfs at all, like the NFS and FUSE things.
 
 So, for now, I've exposed the thing in debugfs. Please suggest a better
 alternative.
 
 Miklos, Trond: could you suggest a better fmt for the bdi_init_fmt() for your
 respective filesystems?

For fuse:

err = bdi_init_fmt(fc-bdi, fuse-%llu, (unsigned long long) fc-id);

This would match the connection ID in /sys/fs/fuse/connections/*

Miklos
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: per BDI dirty limit (was Re: -mm merge plans for 2.6.24)

2007-10-26 Thread Kay Sievers

On Fri, 2007-10-26 at 16:48 +0200, Peter Zijlstra wrote:

 I appreciate the sysfs people their opinion that /sys/bdi/ might not be the
 best from their POV, however I'm not seeing where to hook the BDI object from
 so that it all makes sense, a few of the things are currently not exposed in
 sysfs at all, like the NFS and FUSE things.

What happended to the idea to create a bdi class, and have the
existing devices as parents, and for stuff that is not (not now, or
never) in sysfs, no parent is set.

Thanks,
Kay

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: per BDI dirty limit (was Re: -mm merge plans for 2.6.24)

2007-10-26 Thread Kay Sievers

On Fri, 2007-10-26 at 17:22 +0200, Peter Zijlstra wrote:
 On Fri, 2007-10-26 at 17:10 +0200, Kay Sievers wrote:
  On Fri, 2007-10-26 at 16:48 +0200, Peter Zijlstra wrote:
  
   I appreciate the sysfs people their opinion that /sys/bdi/ might not be 
   the
   best from their POV, however I'm not seeing where to hook the BDI object 
   from
   so that it all makes sense, a few of the things are currently not exposed 
   in
   sysfs at all, like the NFS and FUSE things.
  
  What happended to the idea to create a bdi class, and have the
  existing devices as parents, and for stuff that is not (not now, or
  never) in sysfs, no parent is set.
 
 Must have forgotten about that, mainly because I'm not sure I fully
 understand it.
 
 So we create a class,

Yes.

 create these objects,

Yes, struct device objects, assigned to the bdi class. (Don't use
class_device, that will be removed soon.)

 which are all called bdi

Probably not. You can name it how you want, you can inherit the name of
the parent, or prefix it with whatever fits, they just need to be
unique. Things like the fuse-%llu name would work just fine. I guess
you already solved that problem in the debugfs directory.

 and have children with these attributes in it.

The attributes would just be files in the device object.

 Now, I supposed there is a directory that lists all unparented thingies,
 how do I locate the one that matches my nfs mount?

You look for the name (prefix), try: ls /sys/class/sound/, it's the
same model all over the place.

Kay

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: per BDI dirty limit (was Re: -mm merge plans for 2.6.24)

2007-10-26 Thread Peter Zijlstra

On Fri, 2007-10-26 at 17:33 +0200, Kay Sievers wrote:
 On Fri, 2007-10-26 at 17:22 +0200, Peter Zijlstra wrote:
  On Fri, 2007-10-26 at 17:10 +0200, Kay Sievers wrote:
   On Fri, 2007-10-26 at 16:48 +0200, Peter Zijlstra wrote:
   
I appreciate the sysfs people their opinion that /sys/bdi/ might not be 
the
best from their POV, however I'm not seeing where to hook the BDI 
object from
so that it all makes sense, a few of the things are currently not 
exposed in
sysfs at all, like the NFS and FUSE things.
   
   What happended to the idea to create a bdi class, and have the
   existing devices as parents, and for stuff that is not (not now, or
   never) in sysfs, no parent is set.
  
  Must have forgotten about that, mainly because I'm not sure I fully
  understand it.
  
  So we create a class,
 
 Yes.
 
  create these objects,
 
 Yes, struct device objects, assigned to the bdi class. (Don't use
 class_device, that will be removed soon.)
 
  which are all called bdi
 
 Probably not. You can name it how you want, you can inherit the name of
 the parent, or prefix it with whatever fits, they just need to be
 unique. Things like the fuse-%llu name would work just fine. I guess
 you already solved that problem in the debugfs directory.
 
  and have children with these attributes in it.
 
 The attributes would just be files in the device object.
 
  Now, I supposed there is a directory that lists all unparented thingies,
  how do I locate the one that matches my nfs mount?
 
 You look for the name (prefix), try: ls /sys/class/sound/, it's the
 same model all over the place.

Ok, will try that. Is there a 'simple uncluttered' example I could look
at to copy from?


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: per BDI dirty limit (was Re: -mm merge plans for 2.6.24)

2007-10-26 Thread Kay Sievers


On Fri, 2007-10-26 at 17:33 +0200, Peter Zijlstra wrote:
 On Fri, 2007-10-26 at 17:33 +0200, Kay Sievers wrote:
  On Fri, 2007-10-26 at 17:22 +0200, Peter Zijlstra wrote:
   On Fri, 2007-10-26 at 17:10 +0200, Kay Sievers wrote:
On Fri, 2007-10-26 at 16:48 +0200, Peter Zijlstra wrote:

 I appreciate the sysfs people their opinion that /sys/bdi/ might not 
 be the
 best from their POV, however I'm not seeing where to hook the BDI 
 object from
 so that it all makes sense, a few of the things are currently not 
 exposed in
 sysfs at all, like the NFS and FUSE things.

What happended to the idea to create a bdi class, and have the
existing devices as parents, and for stuff that is not (not now, or
never) in sysfs, no parent is set.
   
   Must have forgotten about that, mainly because I'm not sure I fully
   understand it.
   
   So we create a class,
  
  Yes.
  
   create these objects,
  
  Yes, struct device objects, assigned to the bdi class. (Don't use
  class_device, that will be removed soon.)
  
   which are all called bdi
  
  Probably not. You can name it how you want, you can inherit the name of
  the parent, or prefix it with whatever fits, they just need to be
  unique. Things like the fuse-%llu name would work just fine. I guess
  you already solved that problem in the debugfs directory.
  
   and have children with these attributes in it.
  
  The attributes would just be files in the device object.
  
   Now, I supposed there is a directory that lists all unparented thingies,
   how do I locate the one that matches my nfs mount?
  
  You look for the name (prefix), try: ls /sys/class/sound/, it's the
  same model all over the place.
 
 Ok, will try that. Is there a 'simple uncluttered' example I could look
 at to copy from?

drivers/firmware/dmi-id.c

It has only a single device created in the init routine, but it shows
what to do with the class.

Until the block subsystem is converted from using raw kobjects to
devices, you need to set the parent kobject of the bdi device to the
blockdev:
  bdidev-dev.kobj.parent = disk-kobj

Kay

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: per BDI dirty limit (was Re: -mm merge plans for 2.6.24)

2007-10-26 Thread Peter Zijlstra

On Fri, 2007-10-26 at 17:10 +0200, Kay Sievers wrote:
 On Fri, 2007-10-26 at 16:48 +0200, Peter Zijlstra wrote:
 
  I appreciate the sysfs people their opinion that /sys/bdi/ might not be the
  best from their POV, however I'm not seeing where to hook the BDI object 
  from
  so that it all makes sense, a few of the things are currently not exposed in
  sysfs at all, like the NFS and FUSE things.
 
 What happended to the idea to create a bdi class, and have the
 existing devices as parents, and for stuff that is not (not now, or
 never) in sysfs, no parent is set.

Must have forgotten about that, mainly because I'm not sure I fully
understand it.

So we create a class, create these objects, which are all called bdi and
have children with these attributes in it.

Now, I supposed there is a directory that lists all unparented thingies,
how do I locate the one that matches my nfs mount?



-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: per BDI dirty limit (was Re: -mm merge plans for 2.6.24)

2007-10-26 Thread Trond Myklebust


On Fri, 2007-10-26 at 16:48 +0200, Peter Zijlstra wrote:

 Miklos, Trond: could you suggest a better fmt for the bdi_init_fmt() for your
 respective filesystems?
snip
 Index: linux-2.6-2/fs/nfs/client.c
 ===
 --- linux-2.6-2.orig/fs/nfs/client.c
 +++ linux-2.6-2/fs/nfs/client.c
 @@ -678,7 +678,7 @@ static int nfs_probe_fsinfo(struct nfs_s
   goto out_error;
  
   nfs_server_set_fsinfo(server, fsinfo);
 - error = bdi_init(server-backing_dev_info);
 + error = bdi_init_fmt(server-backing_dev_info, nfs-%s-%p, 
 clp-cl_hostname, server);

In our debugging printks, we usually use the superblock-s_id, but that
only gets initialised later.

I'd suggest passing the 'dev_name' from *_get_sb() into
*_create_server().

Cheers
  Trond

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: per BDI dirty limit (was Re: -mm merge plans for 2.6.24)

2007-10-26 Thread Peter Zijlstra

This crashes and burns on bootup, but I'm too tired to figure out what I
did wrong... will give it another try tomorrow..


Signed-off-by: Peter Zijlstra [EMAIL PROTECTED]
---
 block/genhd.c   |2 
 fs/fuse/inode.c |2 
 fs/nfs/client.c |2 
 include/linux/backing-dev.h |   33 
 include/linux/writeback.h   |3 +
 mm/backing-dev.c|  121 
 mm/page-writeback.c |2 
 7 files changed, 162 insertions(+), 3 deletions(-)

Index: linux-2.6-2/fs/fuse/inode.c
===
--- linux-2.6-2.orig/fs/fuse/inode.c
+++ linux-2.6-2/fs/fuse/inode.c
@@ -467,7 +467,7 @@ static struct fuse_conn *new_conn(void)
atomic_set(fc-num_waiting, 0);
fc-bdi.ra_pages = (VM_MAX_READAHEAD * 1024) / PAGE_CACHE_SIZE;
fc-bdi.unplug_io_fn = default_unplug_io_fn;
-   err = bdi_init(fc-bdi);
+   err = bdi_init_fmt(fc-bdi, fuse-%llu, (unsigned long 
long)fc-id);
if (err) {
kfree(fc);
fc = NULL;
Index: linux-2.6-2/fs/nfs/client.c
===
--- linux-2.6-2.orig/fs/nfs/client.c
+++ linux-2.6-2/fs/nfs/client.c
@@ -678,7 +678,7 @@ static int nfs_probe_fsinfo(struct nfs_s
goto out_error;
 
nfs_server_set_fsinfo(server, fsinfo);
-   error = bdi_init(server-backing_dev_info);
+   error = bdi_init_fmt(server-backing_dev_info, nfs-%s-%p, 
clp-cl_hostname, server);
if (error)
goto out_error;
 
Index: linux-2.6-2/include/linux/backing-dev.h
===
--- linux-2.6-2.orig/include/linux/backing-dev.h
+++ linux-2.6-2/include/linux/backing-dev.h
@@ -11,6 +11,8 @@
 #include linux/percpu_counter.h
 #include linux/log2.h
 #include linux/proportions.h
+#include linux/kernel.h
+#include linux/device.h
 #include asm/atomic.h
 
 struct page;
@@ -48,11 +50,42 @@ struct backing_dev_info {
 
struct prop_local_percpu completions;
int dirty_exceeded;
+
+#ifdef CONFIG_SYSFS
+   struct device kdev;
+#endif
 };
 
 int bdi_init(struct backing_dev_info *bdi);
 void bdi_destroy(struct backing_dev_info *bdi);
 
+int __bdi_register(struct backing_dev_info *bdi);
+void bdi_unregister(struct backing_dev_info *bdi);
+
+#ifdef CONFIG_SYSFS
+#define bdi_init_fmt(bdi, fmt...)  \
+   ({  \
+   int ret;\
+   kobject_set_name((bdi)-kdev.kobj, ##fmt); \
+   ret = bdi_init(bdi);\
+   if (!ret) { \
+   ret = __bdi_register(bdi);  \
+   if (ret)\
+   bdi_destroy(bdi);   \
+   }   \
+   ret;\
+   })
+
+#define bdi_register(bdi, fmt...)  \
+   ({  \
+   kobject_set_name((bdi)-kdev.kobj, ##fmt); \
+   __bdi_register(bdi);\
+   })
+#else
+#define bdi_init_fmt(bdi, fmt...)  bdi_init(bdi)
+#define bdi_register(bdi, fmt...)  __bdi_register(bdi)
+#endif
+
 static inline void __add_bdi_stat(struct backing_dev_info *bdi,
enum bdi_stat_item item, s64 amount)
 {
Index: linux-2.6-2/include/linux/writeback.h
===
--- linux-2.6-2.orig/include/linux/writeback.h
+++ linux-2.6-2/include/linux/writeback.h
@@ -113,6 +113,9 @@ struct file;
 int dirty_writeback_centisecs_handler(struct ctl_table *, int, struct file *,
  void __user *, size_t *, loff_t *);
 
+void get_dirty_limits(long *pbackground, long *pdirty, long *pbdi_dirty,
+struct backing_dev_info *bdi);
+
 void page_writeback_init(void);
 void balance_dirty_pages_ratelimited_nr(struct address_space *mapping,
unsigned long nr_pages_dirtied);
Index: linux-2.6-2/mm/backing-dev.c
===
--- linux-2.6-2.orig/mm/backing-dev.c
+++ linux-2.6-2/mm/backing-dev.c
@@ -4,12 +4,130 @@
 #include linux/fs.h
 #include linux/sched.h
 #include linux/module.h
+#include linux/writeback.h
+#include linux/device.h
+
+#ifdef CONFIG_SYSFS
+
+static void bdi_release(struct device *dev)
+{
+}
+
+static int bdi_uevent(struct device *dev, struct kobj_uevent_env *env)
+{
+   return 0;
+}
+
+static struct class bdi_class = {
+   .name

Re: per BDI dirty limit (was Re: -mm merge plans for 2.6.24)

2007-10-26 Thread Peter Zijlstra


On Fri, 2007-10-26 at 22:04 +0200, Peter Zijlstra wrote:
 This crashes and burns on bootup, but I'm too tired to figure out what I
 did wrong... will give it another try tomorrow..

Ok, can't sleep.. took a look. I have several problems here.

The thing that makes it go *boom* is the __ATTR_NULL. Removing that
makes it boot. Albeit it then warns me of multiple duplicate sysfs
objects, all named bdi.

For some obscure reason this device interface insists on using the
bus_id as name (?!), and further reduces usability by limiting that to
20 odd characters.

This makes it quite useless. I tried fudging around that limit by using
device_rename and kobject_rename, but to no avail.

Really, it should not be this hard to use, trying to expose a handfull
of simple integers to userspace should not take 8h+ and still not work.

Peter, who thinks sysfs is contorted mess beyond his skill. I'll stick
to VM and scheduler code, that actually makes sense.

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: per BDI dirty limit (was Re: -mm merge plans for 2.6.24)

2007-10-26 Thread Greg KH

On Sat, Oct 27, 2007 at 03:18:08AM +0200, Peter Zijlstra wrote:
 
 On Fri, 2007-10-26 at 22:04 +0200, Peter Zijlstra wrote:
  This crashes and burns on bootup, but I'm too tired to figure out what I
  did wrong... will give it another try tomorrow..
 
 Ok, can't sleep.. took a look. I have several problems here.
 
 The thing that makes it go *boom* is the __ATTR_NULL. Removing that
 makes it boot. Albeit it then warns me of multiple duplicate sysfs
 objects, all named bdi.
 
 For some obscure reason this device interface insists on using the
 bus_id as name (?!), and further reduces usability by limiting that to
 20 odd characters.
 
 This makes it quite useless. I tried fudging around that limit by using
 device_rename and kobject_rename, but to no avail.
 
 Really, it should not be this hard to use, trying to expose a handfull
 of simple integers to userspace should not take 8h+ and still not work.
 
 Peter, who thinks sysfs is contorted mess beyond his skill. I'll stick
 to VM and scheduler code, that actually makes sense.

Heh, that's funny :)

I'll look at this and see what I can come up with.  Would you just like
a whole new patch, or one against this one?

thanks,

greg k-h
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: -mm merge plans for 2.6.24

2007-10-13 Thread Borislav Petkov

On Sat, Oct 13, 2007 at 01:52:52AM -0700, Andrew Morton wrote:
> On Sat, 13 Oct 2007 10:44:55 +0200 Borislav Petkov <[EMAIL PROTECTED]> wrote:
> 
> > can you please add http://lkml.org/lkml/2007/7/30/98 also to the misc-queue 
> > for
> > the warning still persists and the patch is good to go as is (against 
> > current git
> > v2.6.23-2840-g752097c, for example) albeit with a little fuzziness.
> 
> I got completely fed up with maintaining that patch against ongoing churn
> frenzy in Greg's trees so I dropped it.
> 
> If/when things settle down in that area someone will need to redo the
> patch.

/me wondering: what if you pass it on upstream to Linus, since:
1. it applies cleanly now
2. is pretty trivial
and forget it about it forever :). It seems the place this
patch touches is the only place where kobject_* and sysfs_create_* etc. error
codes are not being handled in contrast to all those functions which have been 
declared
__must_check?

-- 
Regards/Gruß,
Boris.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: -mm merge plans for 2.6.24

2007-10-13 Thread Andrew Morton

On Sat, 13 Oct 2007 10:44:55 +0200 Borislav Petkov <[EMAIL PROTECTED]> wrote:

> can you please add http://lkml.org/lkml/2007/7/30/98 also to the misc-queue 
> for
> the warning still persists and the patch is good to go as is (against current 
> git
> v2.6.23-2840-g752097c, for example) albeit with a little fuzziness.

I got completely fed up with maintaining that patch against ongoing churn
frenzy in Greg's trees so I dropped it.

If/when things settle down in that area someone will need to redo the
patch.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: -mm merge plans for 2.6.24

2007-10-13 Thread Borislav Petkov

On Mon, Oct 01, 2007 at 02:22:22PM -0700, Andrew Morton wrote:

> #increase-at_vector_size-to-terminate-saved_auxv-properly.patch: Tony wanted 
> enhancements
> increase-at_vector_size-to-terminate-saved_auxv-properly.patch
> change-inotifyfs-magic-as-the-same-magic-is-used-for-futexfs-v2.patch
> delay-creation-of-khcvd-thread.patch
> hvc-console-is-also-used-by-iseries-so-add-that-to-hvc_driver-help.patch
> lockdep-give-each-filesystem-its-own-inode-lock-class.patch
> menuconfig-transform-nls-and-dlm-menus.patch
> menuconfig-transform-network-filesystems-menu.patch
> fs-udf-ballocc-mark-a-variable-as-uninitialized_var.patch
> jbd-config_jbd_debug-cannot-create-proc-entry.patch
> jbd-config_jbd_debug-cannot-create-proc-entry-fix.patch
> jbd-fix-commit-code-to-properly-abort-journal.patch
> jbd-fix-jbd-warnings-when-compiling-with-config_jbd_debug.patch
> dont-truncate-proc-pid-environ-at-4096-characters.patch
> fix-wrong-filename-reference-in-drivers-testingtxt.patch
> anon-inodes-use-open-coded-atomic_inc-for-the-shared-inode.patch
> ncr53c8xx-remove-deprecated-irq-flags-sa_.patch
> completely-remove-deprecated-irq-flags-sa_.patch
> compile-handle_percpu_irq-even-for-uniprocessor-kernels.patch
> fs-correct-sus-compliance-for-open-of-large-file-without.patch
> ext3-remove-ifdef-config_ext3_index.patch
> rename-signalfd_siginfo-fields.patch
> break-elf_platform-and-stack-pointer-randomization-dependency.patch
> spin_lock_unlocked-cleanups.patch
> binfmt_flat-minimum-support-for-the-blackfin-relocations.patch
> binfmt_flat-minimum-support-for-the-blackfin-relocations-checkpatch-fixes.patch
> 
>   The infamous misc.  Will re-review and will merge basically all of them.
can you please add http://lkml.org/lkml/2007/7/30/98 also to the misc-queue for
the warning still persists and the patch is good to go as is (against current 
git
v2.6.23-2840-g752097c, for example) albeit with a little fuzziness.
-- 
Regards/Gruß,
Boris.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: -mm merge plans for 2.6.24

2007-10-13 Thread Borislav Petkov

On Mon, Oct 01, 2007 at 02:22:22PM -0700, Andrew Morton wrote:
snip
 #increase-at_vector_size-to-terminate-saved_auxv-properly.patch: Tony wanted 
 enhancements
 increase-at_vector_size-to-terminate-saved_auxv-properly.patch
 change-inotifyfs-magic-as-the-same-magic-is-used-for-futexfs-v2.patch
 delay-creation-of-khcvd-thread.patch
 hvc-console-is-also-used-by-iseries-so-add-that-to-hvc_driver-help.patch
 lockdep-give-each-filesystem-its-own-inode-lock-class.patch
 menuconfig-transform-nls-and-dlm-menus.patch
 menuconfig-transform-network-filesystems-menu.patch
 fs-udf-ballocc-mark-a-variable-as-uninitialized_var.patch
 jbd-config_jbd_debug-cannot-create-proc-entry.patch
 jbd-config_jbd_debug-cannot-create-proc-entry-fix.patch
 jbd-fix-commit-code-to-properly-abort-journal.patch
 jbd-fix-jbd-warnings-when-compiling-with-config_jbd_debug.patch
 dont-truncate-proc-pid-environ-at-4096-characters.patch
 fix-wrong-filename-reference-in-drivers-testingtxt.patch
 anon-inodes-use-open-coded-atomic_inc-for-the-shared-inode.patch
 ncr53c8xx-remove-deprecated-irq-flags-sa_.patch
 completely-remove-deprecated-irq-flags-sa_.patch
 compile-handle_percpu_irq-even-for-uniprocessor-kernels.patch
 fs-correct-sus-compliance-for-open-of-large-file-without.patch
 ext3-remove-ifdef-config_ext3_index.patch
 rename-signalfd_siginfo-fields.patch
 break-elf_platform-and-stack-pointer-randomization-dependency.patch
 spin_lock_unlocked-cleanups.patch
 binfmt_flat-minimum-support-for-the-blackfin-relocations.patch
 binfmt_flat-minimum-support-for-the-blackfin-relocations-checkpatch-fixes.patch
 
   The infamous misc.  Will re-review and will merge basically all of them.
can you please add http://lkml.org/lkml/2007/7/30/98 also to the misc-queue for
the warning still persists and the patch is good to go as is (against current 
git
v2.6.23-2840-g752097c, for example) albeit with a little fuzziness.
-- 
Regards/Gruß,
Boris.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: -mm merge plans for 2.6.24

2007-10-13 Thread Andrew Morton

On Sat, 13 Oct 2007 10:44:55 +0200 Borislav Petkov [EMAIL PROTECTED] wrote:

 can you please add http://lkml.org/lkml/2007/7/30/98 also to the misc-queue 
 for
 the warning still persists and the patch is good to go as is (against current 
 git
 v2.6.23-2840-g752097c, for example) albeit with a little fuzziness.

I got completely fed up with maintaining that patch against ongoing churn
frenzy in Greg's trees so I dropped it.

If/when things settle down in that area someone will need to redo the
patch.

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: -mm merge plans for 2.6.24

2007-10-13 Thread Borislav Petkov

On Sat, Oct 13, 2007 at 01:52:52AM -0700, Andrew Morton wrote:
 On Sat, 13 Oct 2007 10:44:55 +0200 Borislav Petkov [EMAIL PROTECTED] wrote:
 
  can you please add http://lkml.org/lkml/2007/7/30/98 also to the misc-queue 
  for
  the warning still persists and the patch is good to go as is (against 
  current git
  v2.6.23-2840-g752097c, for example) albeit with a little fuzziness.
 
 I got completely fed up with maintaining that patch against ongoing churn
 frenzy in Greg's trees so I dropped it.
 
 If/when things settle down in that area someone will need to redo the
 patch.

/me wondering: what if you pass it on upstream to Linus, since:
1. it applies cleanly now
2. is pretty trivial
and forget it about it forever :). It seems the place this
patch touches is the only place where kobject_* and sysfs_create_* etc. error
codes are not being handled in contrast to all those functions which have been 
declared
__must_check?

-- 
Regards/Gruß,
Boris.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Memory controller merge (was Re: -mm merge plans for 2.6.24)

2007-10-11 Thread Balbir Singh

Rik van Riel wrote:
> On Tue, 02 Oct 2007 09:51:11 +0530
> Balbir Singh <[EMAIL PROTECTED]> wrote:
> 
>> I was hopeful of getting the bare minimal infrastructure for memory
>> control in mainline, so that review is easy and additional changes
>> can be well reviewed as well.
> 
> I am not yet convinced that the way the memory controller code and
> lumpy reclaim have been merged is correct.  I am combing through the
> code now and will send in a patch when I figure out if/what is wrong.
> 

Hi, Rik,

Do you mean the way the memory controller and lumpy reclaim work
together? The reclaim in memory controller (on hitting the limit) is not
lumpy. Would you like to see that change?

Please do share your findings in the form of comments or patches.

> I ran into this because I'm trying to merge the split VM code up to
> the latest -mm...
> 

Interesting, I'll see if I can find some spare test cycles to help test
this code.


-- 
Warm Regards,
Balbir Singh
Linux Technology Center
IBM, ISTL
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Memory controller merge (was Re: -mm merge plans for 2.6.24)

2007-10-11 Thread Balbir Singh

Rik van Riel wrote:
 On Tue, 02 Oct 2007 09:51:11 +0530
 Balbir Singh [EMAIL PROTECTED] wrote:
 
 I was hopeful of getting the bare minimal infrastructure for memory
 control in mainline, so that review is easy and additional changes
 can be well reviewed as well.
 
 I am not yet convinced that the way the memory controller code and
 lumpy reclaim have been merged is correct.  I am combing through the
 code now and will send in a patch when I figure out if/what is wrong.
 

Hi, Rik,

Do you mean the way the memory controller and lumpy reclaim work
together? The reclaim in memory controller (on hitting the limit) is not
lumpy. Would you like to see that change?

Please do share your findings in the form of comments or patches.

 I ran into this because I'm trying to merge the split VM code up to
 the latest -mm...
 

Interesting, I'll see if I can find some spare test cycles to help test
this code.


-- 
Warm Regards,
Balbir Singh
Linux Technology Center
IBM, ISTL
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Memory controller merge (was Re: -mm merge plans for 2.6.24)

2007-10-10 Thread Rik van Riel

On Tue, 02 Oct 2007 09:51:11 +0530
Balbir Singh <[EMAIL PROTECTED]> wrote:

> I was hopeful of getting the bare minimal infrastructure for memory
> control in mainline, so that review is easy and additional changes
> can be well reviewed as well.

I am not yet convinced that the way the memory controller code and
lumpy reclaim have been merged is correct.  I am combing through the
code now and will send in a patch when I figure out if/what is wrong.

I ran into this because I'm trying to merge the split VM code up to
the latest -mm...

-- 
All Rights Reversed
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: remove zero_page (was Re: -mm merge plans for 2.6.24)

2007-10-10 Thread Linus Torvalds

On Wed, 10 Oct 2007, Nick Piggin wrote:
> 
> It just seems like now might be a good time to just _try_ removing
> the zero page

Yes. Let's do your patch immediately after the x86 merge, and just see if 
anybody screams. 

It might take a while, because I certainly agree that whoever would be 
affected by it is likely to be unusual.

> OK, maybe this is where we are not on the same page.
> There are 2 issues really. Firstly, performance problem of
> refcounting the zero-page -- we've established that it causes
> this livelock and that we should stop refcounting it, right?

Yes, I do agree that refcounting is problematic. 

> Second issue is the performance difference between removing the
> zero page completely, and de-refcounting it (it's obviously
> incorrect to argue for zero page removal for performance reasons
> if the performance improvement is simply coming from avoiding
> the refcounting).

Well, even if it's a "when you don't get into the bad behaviour, 
performance difference is not measurable", and give a before-and-after 
number for some random but interesting load. Even if it's just a kernel 
compile..

Linus
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: remove zero_page (was Re: -mm merge plans for 2.6.24)

2007-10-10 Thread Nick Piggin

On Wednesday 10 October 2007 15:20, Linus Torvalds wrote:
> On Wed, 10 Oct 2007, Hugh Dickins wrote:
> > On Tue, 9 Oct 2007, Nick Piggin wrote:
> > > by it ;) To prove my point: the *first* approach I posted to fix this
> > > problem was exactly a patch to special-case the zero_page refcounting
> > > which was removed with my PageReserved patch. Neither Hugh nor yourself
> > > liked it one bit!
> >
> > True (speaking for me; I forget whether Linus ever got to see it).
>
> The problem is, those first "remove ref-counting" patches were ugly
> *regardless* of ZERO_PAGE.
>
> We (yes, largely I) fixed up the mess since. The whole vm_normal_page()
> and the magic PFN_REMAP thing got rid of a lot of the problems.
>
> And I bet that we could do something very similar wrt the zero page too.
>
> Basically, the ZERO page could act pretty much exactly like a PFN_REMAP
> page: the VM would not touch it. No rmap, no page refcounting, no nothing.
>
> This following patch is not meant to be even half-way correct (it's not
> even _remotely_ tested), but is just meant to be a rough "grep for
> ZERO_PAGE in the VM, and see what happens if you don't ref-count it".
>
> Would something like the below work? I dunno. But I suspect it would. I

Sure it will work. It's not completely trivial like your patch,
though. The VM has to know about ZERO_PAGE if you also want it
to do the "optimised" wp (what you have won't work because it
will break all other "not normal" pages which are non-zero I think).

And your follow_page_page path is not going to do the right thing
for ZERO_PAGE either I think.

> doubt anybody has the energy to actually try to actually follow through on
> it, which is why I'm not pushing on it any more, and why I'll accept

Sure they have.

http://marc.info/?l=linux-mm=117515508009729=2

OK, this patch was open coding the tests rather than putting them in
vm_normal_page, but vm_normal_page doesn't magically make it a whole
lot cleaner (a _little_ bit cleaner, I agree, but in my current patch
I still need a vm_normal_or_zero_page() function).

> Nick's patch to just remove ZERO_PAGE, but I really *am* very unhappy
> about this.

Well that's not very good...

> The "page refcounting cleanups" in the VM back when were really painful.
> And dammit, I felt like I was the one who had to clean them up after you
> guys. Which makes me really testy on this subject.

OK, but in this case we'll not have a big hard-to-revert set of
changes that fundamentally alter assumptions throughout the vm.
It will be more a case of "if somebody screams, put the zero page
back", won't it?

> Totally half-assed untested patch to follow, not meant for anything but a
> "I think this kind of approach should have worked too" comment.
>
> So I'm not pushing the patch below, I'm just fighting for people realizing
> that
>
>  - the kernel has *always* (since pretty much day 1) done that ZERO_PAGE
>thing. This means that I would not be at all surprised if some
>application basically depends on it. I've written test-programs that
>depends on it - maybe people have written other code that basically has
>been written for and tested with a kernel that has basically always
>made read-only zero pages extra cheap.
>
>So while it may be true that removing ZERO_PAGE won't affect anybody, I
>don't think it's a given, and I also don't think it's sane calling
>people "crazy" for depending on something that has always been true
>under Linux for the last 15+ years. There are few behaviors that have
>been around for that long.

That's the main question. Maybe my wording was a little strong, but
I simply personally couldn't think of sane uses of zero page. I'm
not prepared to argue that none could possibly exist.

It just seems like now might be a good time to just _try_ removing
the zero page, because of this peripheral problem caused by my
refcounting patch. If it doesn't work out, then at least we'll be
wiser for it, we can document why the zero page is needed, and add
it back with the refcounting exceptions.

>  - make sure the commit message is accurate as to need for this (ie not
>claim that the ZERO_PAGE itself was the problem, and give some actual
>performance numbers on what is going on)

OK, maybe this is where we are not on the same page.
There are 2 issues really. Firstly, performance problem of
refcounting the zero-page -- we've established that it causes
this livelock and that we should stop refcounting it, right?

Second issue is the performance difference between removing the
zero page completely, and de-refcounting it (it's obviously
incorrect to argue for zero page removal for performance reasons
if the performance improvement is simply coming from avoiding
the refcounting). The problem with that is I simply don't know
any tests that use the ZERO_PAGE significantly enough to measure
a difference. The 1000 COW faults vs < 1 unmap per second thing
was simply to show that, on the micro level,

Re: remove zero_page (was Re: -mm merge plans for 2.6.24)

2007-10-10 Thread Nick Piggin

On Wednesday 10 October 2007 15:20, Linus Torvalds wrote:
 On Wed, 10 Oct 2007, Hugh Dickins wrote:
  On Tue, 9 Oct 2007, Nick Piggin wrote:
   by it ;) To prove my point: the *first* approach I posted to fix this
   problem was exactly a patch to special-case the zero_page refcounting
   which was removed with my PageReserved patch. Neither Hugh nor yourself
   liked it one bit!
 
  True (speaking for me; I forget whether Linus ever got to see it).

 The problem is, those first remove ref-counting patches were ugly
 *regardless* of ZERO_PAGE.

 We (yes, largely I) fixed up the mess since. The whole vm_normal_page()
 and the magic PFN_REMAP thing got rid of a lot of the problems.

 And I bet that we could do something very similar wrt the zero page too.

 Basically, the ZERO page could act pretty much exactly like a PFN_REMAP
 page: the VM would not touch it. No rmap, no page refcounting, no nothing.

 This following patch is not meant to be even half-way correct (it's not
 even _remotely_ tested), but is just meant to be a rough grep for
 ZERO_PAGE in the VM, and see what happens if you don't ref-count it.

 Would something like the below work? I dunno. But I suspect it would. I

Sure it will work. It's not completely trivial like your patch,
though. The VM has to know about ZERO_PAGE if you also want it
to do the optimised wp (what you have won't work because it
will break all other not normal pages which are non-zero I think).

And your follow_page_page path is not going to do the right thing
for ZERO_PAGE either I think.



 doubt anybody has the energy to actually try to actually follow through on
 it, which is why I'm not pushing on it any more, and why I'll accept

Sure they have.

http://marc.info/?l=linux-mmm=117515508009729w=2

OK, this patch was open coding the tests rather than putting them in
vm_normal_page, but vm_normal_page doesn't magically make it a whole
lot cleaner (a _little_ bit cleaner, I agree, but in my current patch
I still need a vm_normal_or_zero_page() function).


 Nick's patch to just remove ZERO_PAGE, but I really *am* very unhappy
 about this.

Well that's not very good...


 The page refcounting cleanups in the VM back when were really painful.
 And dammit, I felt like I was the one who had to clean them up after you
 guys. Which makes me really testy on this subject.

OK, but in this case we'll not have a big hard-to-revert set of
changes that fundamentally alter assumptions throughout the vm.
It will be more a case of if somebody screams, put the zero page
back, won't it?


 Totally half-assed untested patch to follow, not meant for anything but a
 I think this kind of approach should have worked too comment.

 So I'm not pushing the patch below, I'm just fighting for people realizing
 that

  - the kernel has *always* (since pretty much day 1) done that ZERO_PAGE
thing. This means that I would not be at all surprised if some
application basically depends on it. I've written test-programs that
depends on it - maybe people have written other code that basically has
been written for and tested with a kernel that has basically always
made read-only zero pages extra cheap.

So while it may be true that removing ZERO_PAGE won't affect anybody, I
don't think it's a given, and I also don't think it's sane calling
people crazy for depending on something that has always been true
under Linux for the last 15+ years. There are few behaviors that have
been around for that long.

That's the main question. Maybe my wording was a little strong, but
I simply personally couldn't think of sane uses of zero page. I'm
not prepared to argue that none could possibly exist.

It just seems like now might be a good time to just _try_ removing
the zero page, because of this peripheral problem caused by my
refcounting patch. If it doesn't work out, then at least we'll be
wiser for it, we can document why the zero page is needed, and add
it back with the refcounting exceptions.


  - make sure the commit message is accurate as to need for this (ie not
claim that the ZERO_PAGE itself was the problem, and give some actual
performance numbers on what is going on)

OK, maybe this is where we are not on the same page.
There are 2 issues really. Firstly, performance problem of
refcounting the zero-page -- we've established that it causes
this livelock and that we should stop refcounting it, right?

Second issue is the performance difference between removing the
zero page completely, and de-refcounting it (it's obviously
incorrect to argue for zero page removal for performance reasons
if the performance improvement is simply coming from avoiding
the refcounting). The problem with that is I simply don't know
any tests that use the ZERO_PAGE significantly enough to measure
a difference. The 1000 COW faults vs  1 unmap per second thing
was simply to show that, on the micro level, performance won't
have regressed by removing the zero page.

So I'm not arguing

Re: remove zero_page (was Re: -mm merge plans for 2.6.24)

2007-10-10 Thread Linus Torvalds



On Wed, 10 Oct 2007, Nick Piggin wrote:
 
 It just seems like now might be a good time to just _try_ removing
 the zero page

Yes. Let's do your patch immediately after the x86 merge, and just see if 
anybody screams. 

It might take a while, because I certainly agree that whoever would be 
affected by it is likely to be unusual.

 OK, maybe this is where we are not on the same page.
 There are 2 issues really. Firstly, performance problem of
 refcounting the zero-page -- we've established that it causes
 this livelock and that we should stop refcounting it, right?

Yes, I do agree that refcounting is problematic. 

 Second issue is the performance difference between removing the
 zero page completely, and de-refcounting it (it's obviously
 incorrect to argue for zero page removal for performance reasons
 if the performance improvement is simply coming from avoiding
 the refcounting).

Well, even if it's a when you don't get into the bad behaviour, 
performance difference is not measurable, and give a before-and-after 
number for some random but interesting load. Even if it's just a kernel 
compile..

Linus
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Memory controller merge (was Re: -mm merge plans for 2.6.24)

2007-10-10 Thread Rik van Riel

On Tue, 02 Oct 2007 09:51:11 +0530
Balbir Singh [EMAIL PROTECTED] wrote:

 I was hopeful of getting the bare minimal infrastructure for memory
 control in mainline, so that review is easy and additional changes
 can be well reviewed as well.

I am not yet convinced that the way the memory controller code and
lumpy reclaim have been merged is correct.  I am combing through the
code now and will send in a patch when I figure out if/what is wrong.

I ran into this because I'm trying to merge the split VM code up to
the latest -mm...

-- 
All Rights Reversed
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: remove zero_page (was Re: -mm merge plans for 2.6.24)

2007-10-09 Thread Linus Torvalds

On Wed, 10 Oct 2007, Hugh Dickins wrote:

> On Tue, 9 Oct 2007, Nick Piggin wrote:
> > by it ;) To prove my point: the *first* approach I posted to fix this
> > problem was exactly a patch to special-case the zero_page refcounting
> > which was removed with my PageReserved patch. Neither Hugh nor yourself
> > liked it one bit!
> 
> True (speaking for me; I forget whether Linus ever got to see it).

The problem is, those first "remove ref-counting" patches were ugly 
*regardless* of ZERO_PAGE.

We (yes, largely I) fixed up the mess since. The whole vm_normal_page() 
and the magic PFN_REMAP thing got rid of a lot of the problems.

And I bet that we could do something very similar wrt the zero page too.

Basically, the ZERO page could act pretty much exactly like a PFN_REMAP 
page: the VM would not touch it. No rmap, no page refcounting, no nothing.

This following patch is not meant to be even half-way correct (it's not 
even _remotely_ tested), but is just meant to be a rough "grep for 
ZERO_PAGE in the VM, and see what happens if you don't ref-count it".

Would something like the below work? I dunno. But I suspect it would. I 
doubt anybody has the energy to actually try to actually follow through on 
it, which is why I'm not pushing on it any more, and why I'll accept 
Nick's patch to just remove ZERO_PAGE, but I really *am* very unhappy 
about this.

The "page refcounting cleanups" in the VM back when were really painful. 
And dammit, I felt like I was the one who had to clean them up after you 
guys. Which makes me really testy on this subject.

And yes, I also admit that the vm_normal_page() and the PFN_REMAP thing 
ended up really improving the VM, and we're pretty much certainly better 
off now than we were before - but I also think that ZERO_PAGE etc could 
easily be handled with the same model. After all, if we can make 
"mmap(/dev/mem)" work with COW and everything, I'd argue that ZERO_PAGE 
really is just a very very small special case of that!

Totally half-assed untested patch to follow, not meant for anything but a 
"I think this kind of approach should have worked too" comment.

So I'm not pushing the patch below, I'm just fighting for people realizing 
that

 - the kernel has *always* (since pretty much day 1) done that ZERO_PAGE 
   thing. This means that I would not be at all surprised if some 
   application basically depends on it. I've written test-programs that 
   depends on it - maybe people have written other code that basically has 
   been written for and tested with a kernel that has basically always 
   made read-only zero pages extra cheap.

   So while it may be true that removing ZERO_PAGE won't affect anybody, I 
   don't think it's a given, and I also don't think it's sane calling 
   people "crazy" for depending on something that has always been true 
   under Linux for the last 15+ years. There are few behaviors that have 
   been around for that long.

 - make sure the commit message is accurate as to need for this (ie not 
   claim that the ZERO_PAGE itself was the problem, and give some actual 
   performance numbers on what is going on)

that's all.

Linus

---
 mm/memory.c  |   17 -
 mm/migrate.c |2 +-
 2 files changed, 9 insertions(+), 10 deletions(-)

diff --git a/mm/memory.c b/mm/memory.c
index f82b359..0a8cc88 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -386,6 +386,7 @@ static inline int is_cow_mapping(unsigned int flags)
 struct page *vm_normal_page(struct vm_area_struct *vma, unsigned long addr, 
pte_t pte)
 {
unsigned long pfn = pte_pfn(pte);
+   struct page *page;

if (unlikely(vma->vm_flags & VM_PFNMAP)) {
unsigned long off = (addr - vma->vm_start) >> PAGE_SHIFT;
@@ -413,7 +414,11 @@ struct page *vm_normal_page(struct vm_area_struct *vma, 
unsigned long addr, pte_
 * The PAGE_ZERO() pages and various VDSO mappings can
 * cause them to exist.
 */
-   return pfn_to_page(pfn);
+   page = pfn_to_page(pfn);
+   if (PageReserved(page))
+   page = NULL;
+
+   return page;
 }

 /*
@@ -968,7 +973,7 @@ no_page_table:
if (flags & FOLL_ANON) {
page = ZERO_PAGE(address);
if (flags & FOLL_GET)
-   get_page(page);
+   page = alloc_page(GFP_KERNEL | GFP_ZERO);
BUG_ON(flags & FOLL_WRITE);
}
return page;
@@ -1131,9 +1136,6 @@ static int zeromap_pte_range(struct mm_struct *mm, pmd_t 
*pmd,
pte++;
break;
}
-   page_cache_get(page);
-   page_add_file_rmap(page);
-   inc_mm_counter(mm, file_rss);
set_pte_at(mm, addr, pte, zero_pte);
} while (pte++, addr += PAGE_SIZE, addr != end);
arch_leave_lazy_mmu_mode();
@@ -1717,7 +1719,7 @@ gotten:

if (unlikely(anon_vma_prepare(vma)))
goto oom;
-

Re: remove zero_page (was Re: -mm merge plans for 2.6.24)

2007-10-09 Thread Hugh Dickins

On Tue, 9 Oct 2007, Nick Piggin wrote:
> by it ;) To prove my point: the *first* approach I posted to fix this
> problem was exactly a patch to special-case the zero_page refcounting
> which was removed with my PageReserved patch. Neither Hugh nor yourself
> liked it one bit!

True (speaking for me; I forget whether Linus ever got to see it).

I apologize to you, Nick, for getting you into this position of
fighting for something which wasn't your choice in the first place.

If I thought we'd have a better kernel by dropping this patch and
going back to one that just avoids the refcounting, I'd say do it.
No, I still think it's worth trying this one first.

But best have your avoid-the-refcounting patch ready and reviewed
for emergency use if regression does show up somewhere.

Thanks,
Hugh

[My mails out are at present getting randomly delayed by six hours or
so, which makes it extra hard for me to engage usefully in any thread.]
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: remove zero_page (was Re: -mm merge plans for 2.6.24)

2007-10-09 Thread Linus Torvalds

On Tue, 9 Oct 2007, Nick Piggin wrote:
> 
> I gave 2 other numbers. After that, it really doesn't matter if I give
> you 2 numbers or 200, because it wouldn't change the fact that there
> are 3 programs using the ZERO_PAGE that we'll never know about.

You gave me no timings what-so-ever. Yes, you said "1000 page faults", but 
no, I have yet to see a *single* actual performance number.

Maybe I missed it? Or maybe you just never did them.

Was it really so non-obvious that I actually wanted *performance* numbers, 
not just some random numbers about how many page faults you have? Or did 
you post them somewhere else? I don't have any memory of having seen any 
performance numbers what-so-ever, but I admittedly get too much email.

Here's three numbers of my own: 8, 17 and 975.

So I gave you "numbers", but what do they _mean_? 

So let me try one more time:

 - I don't want any excuses about how bad PAGE_ZERO is. You made it bad, 
   it wasn't bad before.
 - I want numbers. I want the commit message to tell us *why* this is 
   done. The numbers I want is performance numbers, not handwave numbers. 
   Both for the bad case that it's supposed to fix, *and* for "normal 
   load".
 - I want you to just say that if it turns out that there are people who 
   use ZERO_PAGE, you stop calling them crazy, and promise to look at the 
   alternatives.

How much clearer can I be? I have said several times that I think this 
patch is kind of sad, and the reason I think it's sad is that you (and 
Hugh) convinced me to take the patch that made it sad in the first place. 
It didn't *use* to be bad. And I've use ZERO_PAGE myself for timing, I've 
had nice test-programs that knew that it could ignore cache effects and 
get pure TLB effects when it just allocated memory and didn't write to it.

That's why I don't like the lack of numbers. That's why I didn't like the 
original commit message that tried to blame the wrong part. That's why I 
didn't like this patch to begin with.

But I'm perfectly ready to take it, and see if anybody complains. 
Hopefully nobody ever will.

But by now I absolutely *detest* this patch because of its history, and 
how I *told* you guys what the reserved bit did, and how you totally 
ignored it, and then tried to blame ZERO_PAGE for that. So yes, I want the 
patch to be accompanied by an explanation, which includes the performance 
side of why it is wanted/needed in the first place.

If this patch didn't have that kind of history, I wouldn't give a flying f 
about it. As it is, this whole thing has a background.

Linus
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: remove zero_page (was Re: -mm merge plans for 2.6.24)

2007-10-09 Thread Nick Piggin

On Wednesday 10 October 2007 12:22, Linus Torvalds wrote:
> On Tue, 9 Oct 2007, Nick Piggin wrote:
> > Where do you suggest I go from here? Is there any way I can
> > convince you to try it? Make it a config option? (just kidding)
>
> No, I'll take the damn patch, but quite frankly, I think your arguments
> suck.
>
> I've told you so before, and asked for numbers, and all you do is

I gave 2 other numbers. After that, it really doesn't matter if I give
you 2 numbers or 200, because it wouldn't change the fact that there
are 3 programs using the ZERO_PAGE that we'll never know about.

> handwave. And this is like the *third*time*, and you don't even seem to
> admit that you're handwaving.

I think I've always admitted I'm handwaving in my assertion that programs
would not be using the zero page. My handwaving is an attempt to show that
I have some vaguely reasonable reasons to think it will be OK to remove it.
That's all.

> So let's do it, but dammit:
>  - make sure there aren't any invalid statements like this in the final
>commit message.

Was the last one OK?

>  - if somebody shows that you were wrong, and points to a real load,
>please never *ever* make excuses for this again, ok?
>
> Is that a deal? I hope we'll never need to hear about this again, but I
> really object to the way you've tried to "sell" this thing, by basically
> starting out dishonest about what the problem was,

The dishonesty in the changelog is more of an oversight than an attempt
to get it merged. It never even crossed my mind that you would be fooled
by it ;) To prove my point: the *first* approach I posted to fix this
problem was exactly a patch to special-case the zero_page refcounting
which was removed with my PageReserved patch. Neither Hugh nor yourself
liked it one bit!

So I have no particular bias against the zero page or problem admitting
I introduced the issue. I do just think this could be a nice opportunity
to try getting rid of the zero page and simplifiy things.

> and even now I've yet 
> to see a *single* performance number even though I've asked for them
> (except for the problem case, which was introduced by *you*)

Basically: I don't know what else to show you! I expect it would be
relatively difficult to find a measurable difference between no zero-page
and zero-page with no refcounting problem. Precisely because I can't
find anything that really makes use of it. Again: what numbers can I
get for you that would make you feel happier about it?

Anyway, before you change your mind: it's a deal! If somebody screams
then I'll have a patch for you to reintroduce the zero page minus
refcounting the next day.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: remove zero_page (was Re: -mm merge plans for 2.6.24)

2007-10-09 Thread Linus Torvalds

On Tue, 9 Oct 2007, Nick Piggin wrote:
> 
> Where do you suggest I go from here? Is there any way I can
> convince you to try it? Make it a config option? (just kidding)

No, I'll take the damn patch, but quite frankly, I think your arguments 
suck.

I've told you so before, and asked for numbers, and all you do is 
handwave. And this is like the *third*time*, and you don't even seem to 
admit that you're handwaving.

So let's do it, but dammit:
 - make sure there aren't any invalid statements like this in the final 
   commit message.
 - if somebody shows that you were wrong, and points to a real load, 
   please never *ever* make excuses for this again, ok? 

Is that a deal? I hope we'll never need to hear about this again, but I 
really object to the way you've tried to "sell" this thing, by basically 
starting out dishonest about what the problem was, and even now I've yet 
to see a *single* performance number even though I've asked for them 
(except for the problem case, which was introduced by *you*)

Linus
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: remove zero_page (was Re: -mm merge plans for 2.6.24)

2007-10-09 Thread Nick Piggin

On Wednesday 10 October 2007 00:52, Linus Torvalds wrote:
> On Tue, 9 Oct 2007, Nick Piggin wrote:
> > I have done some tests which indicate a couple of very basic common tools
> > don't do much zero-page activity (ie. kbuild). And also combined with
> > some logical arguments to say that a "sane" app wouldn't be using
> > zero_page much. (basically -- if the app cares about memory or cache
> > footprint and is using many pages of zeroes, then it should have a more
> > compressed representation of zeroes anyway).
>
> One of the things that zero-page has been used for is absolutely *huge*
> (but sparse) arrays in Fortan programs.
>
> At least in traditional fortran, it was very hard to do dynamic
> allocations, so people would allocate the *maximum* array statically, and
> then not necessarily use everything. I don't know if the pages ever even
> got paged in,

In which case, they would not be using the ZERO_PAGE?
If they were paging in (ie. reading) huge reams of zeroes,
then maybe their algorithms aren't so good anyway? (I don't
know).

> but this is the kind of usage which is *not* insane. 

Yeah, that's why I use the double quotes... I wonder how to
find out, though. I guess I could ask SGI if they could ask
around -- but that still comes back to the problem of not being
able to ever conclusively show that there are no real users of
the ZERO_PAGE.

Where do you suggest I go from here? Is there any way I can
convince you to try it? Make it a config option? (just kidding)
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: remove zero_page (was Re: -mm merge plans for 2.6.24)

2007-10-09 Thread Hugh Dickins

On Tue, 9 Oct 2007, Nick Piggin wrote:
> 
> The commit b5810039a54e5babf428e9a1e89fc1940fabff11 contains the note
> 
>   A last caveat: the ZERO_PAGE is now refcounted and managed with rmap
>   (and thus mapcounted and count towards shared rss).  These writes to
>   the struct page could cause excessive cacheline bouncing on big
>   systems.  There are a number of ways this could be addressed if it is
>   an issue.
> 
> And indeed this cacheline bouncing has shown up on large SGI systems.
> There was a situation where an Altix system was essentially livelocked
> tearing down ZERO_PAGE pagetables when an HPC app aborted during startup.
> This situation can be avoided in userspace, but it does highlight the
> potential scalability problem with refcounting ZERO_PAGE, and corner
> cases where it can really hurt (we don't want the system to livelock!).
> 
> There are several broad ways to fix this problem:
> 1. add back some special casing to avoid refcounting ZERO_PAGE
> 2. per-node or per-cpu ZERO_PAGES
> 3. remove the ZERO_PAGE completely
> 
> I will argue for 3. The others should also fix the problem, but they
> result in more complex code than does 3, with little or no real benefit
> that I can see. Why?

Sorry, I've no useful arguments to add (and my testing was too much
like yours to add any value), but I do want to go on record as still
a strong supporter of approach 3 and your patch.

Hugh
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: remove zero_page (was Re: -mm merge plans for 2.6.24)

2007-10-09 Thread Linus Torvalds

On Tue, 9 Oct 2007, Nick Piggin wrote:
> 
> I have done some tests which indicate a couple of very basic common tools
> don't do much zero-page activity (ie. kbuild). And also combined with some
> logical arguments to say that a "sane" app wouldn't be using zero_page much.
> (basically -- if the app cares about memory or cache footprint and is using
> many pages of zeroes, then it should have a more compressed representation
> of zeroes anyway).

One of the things that zero-page has been used for is absolutely *huge* 
(but sparse) arrays in Fortan programs.

At least in traditional fortran, it was very hard to do dynamic 
allocations, so people would allocate the *maximum* array statically, and 
then not necessarily use everything. I don't know if the pages ever even 
got paged in, but this is the kind of usage which is *not* insane.

Linus
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

r/o bind mounts, was Re: -mm merge plans for 2.6.24

2007-10-09 Thread Christoph Hellwig

> r-o-bind-mounts-filesystem-helpers-for-custom-struct-files.patch
> r-o-bind-mounts-rearrange-may_open-to-be-r-o-friendly.patch
> r-o-bind-mounts-give-permission-a-local-mnt-variable.patch
> r-o-bind-mounts-create-cleanup-helper-svc_msnfs.patch

<...>

>   Doesn't seem ready yet

I guess we need some more time for this patchset to mature, yes.  But
the four patches I've quoted about are just small preparator cleanups
that should go into 2.6.24.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: remove zero_page (was Re: -mm merge plans for 2.6.24)

2007-10-09 Thread Nick Piggin

On Thursday 04 October 2007 01:21, Linus Torvalds wrote:
> On Wed, 3 Oct 2007, Nick Piggin wrote:
> > I don't know if Linus actually disliked the patch itself, or disliked
> > my (maybe confusingly worded) rationale?
>
> Yes. I'd happily accept the patch, but I'd want it clarified and made
> obvious what the problem was - and it wasn't the zero page itself, it was
> a regression in the VM that made it less palatable.

OK, revised changelog at the end of this mail...

> I also thought that there were potentially better solutions, namely to
> simply avoid the VM regression, but I also acknowledge that they may not
> be worth it - I just want them to be on the table.
>
> In short: the real cost of the zero page was the reference counting on the
> page that we do these days. For example, I really do believe that the
> problem could fairly easily be fixed by simply not considering zero_page
> to be a "vm_normal_page()". We already *do* have support for pages not
> getting ref-counted (since we need it for other things), and I think that
> zero_page very naturally falls into exactly that situation.
>
> So in many ways, I would think that turning zero-page into a nonrefcounted
> page (the same way we really do have to do for other things anyway) would
> be the much more *direct* solution, and in many ways the obvious one.

That was my first approach. It isn't completely trivial, but
vm_normal_page() does play a part (but we end up needing a
vm_normal_page() variant -- IIRC vm_normal_or_zero_page()). But taken
as a whole, non-refcounted zero_page is obviously a lot more work than
no zero page at all :)

> HOWEVER - if people think that it's easier to remove zero_page, and want
> to do it for other reasons, *AND* can hopefully even back up the claim
> that it never matters with numbers (ie that the extra pagefaults just make
> the whole zero-page thing pointless), then I'd certainly accept the patch.

I have done some tests which indicate a couple of very basic common tools
don't do much zero-page activity (ie. kbuild). And also combined with some
logical arguments to say that a "sane" app wouldn't be using zero_page much.
(basically -- if the app cares about memory or cache footprint and is using
many pages of zeroes, then it should have a more compressed representation
of zeroes anyway).

However there is a window for some "insane" code to regress without the
zero_page. I'm not arguing that we don't care about those, however I have
no way to guarantee they don't exist. I hope we wouldn't get a potentially
useless complexity like this stuck in the VM forever just because we don't
_know_ whether it's useful to anybody...

How about something like this?

---
From: Nick Piggin <[EMAIL PROTECTED]>

The commit b5810039a54e5babf428e9a1e89fc1940fabff11 contains the note

  A last caveat: the ZERO_PAGE is now refcounted and managed with rmap
  (and thus mapcounted and count towards shared rss).  These writes to
  the struct page could cause excessive cacheline bouncing on big
  systems.  There are a number of ways this could be addressed if it is
  an issue.

And indeed this cacheline bouncing has shown up on large SGI systems.
There was a situation where an Altix system was essentially livelocked
tearing down ZERO_PAGE pagetables when an HPC app aborted during startup.
This situation can be avoided in userspace, but it does highlight the
potential scalability problem with refcounting ZERO_PAGE, and corner
cases where it can really hurt (we don't want the system to livelock!).

There are several broad ways to fix this problem:
1. add back some special casing to avoid refcounting ZERO_PAGE
2. per-node or per-cpu ZERO_PAGES
3. remove the ZERO_PAGE completely

I will argue for 3. The others should also fix the problem, but they
result in more complex code than does 3, with little or no real benefit
that I can see. Why?

Inserting a ZERO_PAGE for anonymous read faults appears to be a false
optimisation: if an application is performance critical, it would not be
doing many read faults of new memory, or at least it could be expected to
write to that memory soon afterwards. If cache or memory use is critical,
it should not be working with a significant number of ZERO_PAGEs anyway
(a more compact representation of zeroes should be used).

As a sanity check -- mesuring on my desktop system, there are never many
mappings to the ZERO_PAGE (eg. 2 or 3), thus memory usage here should not
increase much without it.

When running a make -j4 kernel compile on my dual core system, there are
about 1,000 mappings to the ZERO_PAGE created per second, but about 1,000
ZERO_PAGE COW faults per second (less than 1 ZERO_PAGE mapping per second
is torn down without being COWed). So removing ZERO_PAGE will save 1,000
page faults per second, and 2,000 bounces of the ZERO_PAGE struct page
cacheline per second when running kbuild, while saving less than 1 page
clearing operation per second (even 1 page clear is far cheaper than a
thousand cacheline

Re: remove zero_page (was Re: -mm merge plans for 2.6.24)

2007-10-09 Thread Nick Piggin

On Thursday 04 October 2007 01:21, Linus Torvalds wrote:
 On Wed, 3 Oct 2007, Nick Piggin wrote:
  I don't know if Linus actually disliked the patch itself, or disliked
  my (maybe confusingly worded) rationale?

 Yes. I'd happily accept the patch, but I'd want it clarified and made
 obvious what the problem was - and it wasn't the zero page itself, it was
 a regression in the VM that made it less palatable.

OK, revised changelog at the end of this mail...


 I also thought that there were potentially better solutions, namely to
 simply avoid the VM regression, but I also acknowledge that they may not
 be worth it - I just want them to be on the table.

 In short: the real cost of the zero page was the reference counting on the
 page that we do these days. For example, I really do believe that the
 problem could fairly easily be fixed by simply not considering zero_page
 to be a vm_normal_page(). We already *do* have support for pages not
 getting ref-counted (since we need it for other things), and I think that
 zero_page very naturally falls into exactly that situation.

 So in many ways, I would think that turning zero-page into a nonrefcounted
 page (the same way we really do have to do for other things anyway) would
 be the much more *direct* solution, and in many ways the obvious one.

That was my first approach. It isn't completely trivial, but
vm_normal_page() does play a part (but we end up needing a
vm_normal_page() variant -- IIRC vm_normal_or_zero_page()). But taken
as a whole, non-refcounted zero_page is obviously a lot more work than
no zero page at all :)


 HOWEVER - if people think that it's easier to remove zero_page, and want
 to do it for other reasons, *AND* can hopefully even back up the claim
 that it never matters with numbers (ie that the extra pagefaults just make
 the whole zero-page thing pointless), then I'd certainly accept the patch.

I have done some tests which indicate a couple of very basic common tools
don't do much zero-page activity (ie. kbuild). And also combined with some
logical arguments to say that a sane app wouldn't be using zero_page much.
(basically -- if the app cares about memory or cache footprint and is using
many pages of zeroes, then it should have a more compressed representation
of zeroes anyway).

However there is a window for some insane code to regress without the
zero_page. I'm not arguing that we don't care about those, however I have
no way to guarantee they don't exist. I hope we wouldn't get a potentially
useless complexity like this stuck in the VM forever just because we don't
_know_ whether it's useful to anybody...

How about something like this?

---
From: Nick Piggin [EMAIL PROTECTED]

The commit b5810039a54e5babf428e9a1e89fc1940fabff11 contains the note

  A last caveat: the ZERO_PAGE is now refcounted and managed with rmap
  (and thus mapcounted and count towards shared rss).  These writes to
  the struct page could cause excessive cacheline bouncing on big
  systems.  There are a number of ways this could be addressed if it is
  an issue.

And indeed this cacheline bouncing has shown up on large SGI systems.
There was a situation where an Altix system was essentially livelocked
tearing down ZERO_PAGE pagetables when an HPC app aborted during startup.
This situation can be avoided in userspace, but it does highlight the
potential scalability problem with refcounting ZERO_PAGE, and corner
cases where it can really hurt (we don't want the system to livelock!).

There are several broad ways to fix this problem:
1. add back some special casing to avoid refcounting ZERO_PAGE
2. per-node or per-cpu ZERO_PAGES
3. remove the ZERO_PAGE completely

I will argue for 3. The others should also fix the problem, but they
result in more complex code than does 3, with little or no real benefit
that I can see. Why?

Inserting a ZERO_PAGE for anonymous read faults appears to be a false
optimisation: if an application is performance critical, it would not be
doing many read faults of new memory, or at least it could be expected to
write to that memory soon afterwards. If cache or memory use is critical,
it should not be working with a significant number of ZERO_PAGEs anyway
(a more compact representation of zeroes should be used).

As a sanity check -- mesuring on my desktop system, there are never many
mappings to the ZERO_PAGE (eg. 2 or 3), thus memory usage here should not
increase much without it.

When running a make -j4 kernel compile on my dual core system, there are
about 1,000 mappings to the ZERO_PAGE created per second, but about 1,000
ZERO_PAGE COW faults per second (less than 1 ZERO_PAGE mapping per second
is torn down without being COWed). So removing ZERO_PAGE will save 1,000
page faults per second, and 2,000 bounces of the ZERO_PAGE struct page
cacheline per second when running kbuild, while saving less than 1 page
clearing operation per second (even 1 page clear is far cheaper than a
thousand cacheline bounces between CPUs).

Of course,

r/o bind mounts, was Re: -mm merge plans for 2.6.24

2007-10-09 Thread Christoph Hellwig

 r-o-bind-mounts-filesystem-helpers-for-custom-struct-files.patch
 r-o-bind-mounts-rearrange-may_open-to-be-r-o-friendly.patch
 r-o-bind-mounts-give-permission-a-local-mnt-variable.patch
 r-o-bind-mounts-create-cleanup-helper-svc_msnfs.patch

...

   Doesn't seem ready yet

I guess we need some more time for this patchset to mature, yes.  But
the four patches I've quoted about are just small preparator cleanups
that should go into 2.6.24.

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: remove zero_page (was Re: -mm merge plans for 2.6.24)

2007-10-09 Thread Linus Torvalds



On Tue, 9 Oct 2007, Nick Piggin wrote:
 
 I have done some tests which indicate a couple of very basic common tools
 don't do much zero-page activity (ie. kbuild). And also combined with some
 logical arguments to say that a sane app wouldn't be using zero_page much.
 (basically -- if the app cares about memory or cache footprint and is using
 many pages of zeroes, then it should have a more compressed representation
 of zeroes anyway).

One of the things that zero-page has been used for is absolutely *huge* 
(but sparse) arrays in Fortan programs.

At least in traditional fortran, it was very hard to do dynamic 
allocations, so people would allocate the *maximum* array statically, and 
then not necessarily use everything. I don't know if the pages ever even 
got paged in, but this is the kind of usage which is *not* insane.

Linus
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: remove zero_page (was Re: -mm merge plans for 2.6.24)

2007-10-09 Thread Hugh Dickins

On Tue, 9 Oct 2007, Nick Piggin wrote:
 
 The commit b5810039a54e5babf428e9a1e89fc1940fabff11 contains the note
 
   A last caveat: the ZERO_PAGE is now refcounted and managed with rmap
   (and thus mapcounted and count towards shared rss).  These writes to
   the struct page could cause excessive cacheline bouncing on big
   systems.  There are a number of ways this could be addressed if it is
   an issue.
 
 And indeed this cacheline bouncing has shown up on large SGI systems.
 There was a situation where an Altix system was essentially livelocked
 tearing down ZERO_PAGE pagetables when an HPC app aborted during startup.
 This situation can be avoided in userspace, but it does highlight the
 potential scalability problem with refcounting ZERO_PAGE, and corner
 cases where it can really hurt (we don't want the system to livelock!).
 
 There are several broad ways to fix this problem:
 1. add back some special casing to avoid refcounting ZERO_PAGE
 2. per-node or per-cpu ZERO_PAGES
 3. remove the ZERO_PAGE completely
 
 I will argue for 3. The others should also fix the problem, but they
 result in more complex code than does 3, with little or no real benefit
 that I can see. Why?

Sorry, I've no useful arguments to add (and my testing was too much
like yours to add any value), but I do want to go on record as still
a strong supporter of approach 3 and your patch.

Hugh
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: remove zero_page (was Re: -mm merge plans for 2.6.24)

2007-10-09 Thread Nick Piggin

On Wednesday 10 October 2007 00:52, Linus Torvalds wrote:
 On Tue, 9 Oct 2007, Nick Piggin wrote:
  I have done some tests which indicate a couple of very basic common tools
  don't do much zero-page activity (ie. kbuild). And also combined with
  some logical arguments to say that a sane app wouldn't be using
  zero_page much. (basically -- if the app cares about memory or cache
  footprint and is using many pages of zeroes, then it should have a more
  compressed representation of zeroes anyway).

 One of the things that zero-page has been used for is absolutely *huge*
 (but sparse) arrays in Fortan programs.

 At least in traditional fortran, it was very hard to do dynamic
 allocations, so people would allocate the *maximum* array statically, and
 then not necessarily use everything. I don't know if the pages ever even
 got paged in,

In which case, they would not be using the ZERO_PAGE?
If they were paging in (ie. reading) huge reams of zeroes,
then maybe their algorithms aren't so good anyway? (I don't
know).


 but this is the kind of usage which is *not* insane. 

Yeah, that's why I use the double quotes... I wonder how to
find out, though. I guess I could ask SGI if they could ask
around -- but that still comes back to the problem of not being
able to ever conclusively show that there are no real users of
the ZERO_PAGE.

Where do you suggest I go from here? Is there any way I can
convince you to try it? Make it a config option? (just kidding)
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: remove zero_page (was Re: -mm merge plans for 2.6.24)

2007-10-09 Thread Linus Torvalds



On Tue, 9 Oct 2007, Nick Piggin wrote:
 
 Where do you suggest I go from here? Is there any way I can
 convince you to try it? Make it a config option? (just kidding)

No, I'll take the damn patch, but quite frankly, I think your arguments 
suck.

I've told you so before, and asked for numbers, and all you do is 
handwave. And this is like the *third*time*, and you don't even seem to 
admit that you're handwaving.

So let's do it, but dammit:
 - make sure there aren't any invalid statements like this in the final 
   commit message.
 - if somebody shows that you were wrong, and points to a real load, 
   please never *ever* make excuses for this again, ok? 

Is that a deal? I hope we'll never need to hear about this again, but I 
really object to the way you've tried to sell this thing, by basically 
starting out dishonest about what the problem was, and even now I've yet 
to see a *single* performance number even though I've asked for them 
(except for the problem case, which was introduced by *you*)

Linus
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: remove zero_page (was Re: -mm merge plans for 2.6.24)

2007-10-09 Thread Nick Piggin

On Wednesday 10 October 2007 12:22, Linus Torvalds wrote:
 On Tue, 9 Oct 2007, Nick Piggin wrote:
  Where do you suggest I go from here? Is there any way I can
  convince you to try it? Make it a config option? (just kidding)

 No, I'll take the damn patch, but quite frankly, I think your arguments
 suck.

 I've told you so before, and asked for numbers, and all you do is

I gave 2 other numbers. After that, it really doesn't matter if I give
you 2 numbers or 200, because it wouldn't change the fact that there
are 3 programs using the ZERO_PAGE that we'll never know about.


 handwave. And this is like the *third*time*, and you don't even seem to
 admit that you're handwaving.

I think I've always admitted I'm handwaving in my assertion that programs
would not be using the zero page. My handwaving is an attempt to show that
I have some vaguely reasonable reasons to think it will be OK to remove it.
That's all.


 So let's do it, but dammit:
  - make sure there aren't any invalid statements like this in the final
commit message.

Was the last one OK?


  - if somebody shows that you were wrong, and points to a real load,
please never *ever* make excuses for this again, ok?

 Is that a deal? I hope we'll never need to hear about this again, but I
 really object to the way you've tried to sell this thing, by basically
 starting out dishonest about what the problem was,

The dishonesty in the changelog is more of an oversight than an attempt
to get it merged. It never even crossed my mind that you would be fooled
by it ;) To prove my point: the *first* approach I posted to fix this
problem was exactly a patch to special-case the zero_page refcounting
which was removed with my PageReserved patch. Neither Hugh nor yourself
liked it one bit!

So I have no particular bias against the zero page or problem admitting
I introduced the issue. I do just think this could be a nice opportunity
to try getting rid of the zero page and simplifiy things.


 and even now I've yet 
 to see a *single* performance number even though I've asked for them
 (except for the problem case, which was introduced by *you*)

Basically: I don't know what else to show you! I expect it would be
relatively difficult to find a measurable difference between no zero-page
and zero-page with no refcounting problem. Precisely because I can't
find anything that really makes use of it. Again: what numbers can I
get for you that would make you feel happier about it?

Anyway, before you change your mind: it's a deal! If somebody screams
then I'll have a patch for you to reintroduce the zero page minus
refcounting the next day.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: remove zero_page (was Re: -mm merge plans for 2.6.24)

2007-10-09 Thread Linus Torvalds



On Tue, 9 Oct 2007, Nick Piggin wrote:
 
 I gave 2 other numbers. After that, it really doesn't matter if I give
 you 2 numbers or 200, because it wouldn't change the fact that there
 are 3 programs using the ZERO_PAGE that we'll never know about.

You gave me no timings what-so-ever. Yes, you said 1000 page faults, but 
no, I have yet to see a *single* actual performance number.

Maybe I missed it? Or maybe you just never did them.

Was it really so non-obvious that I actually wanted *performance* numbers, 
not just some random numbers about how many page faults you have? Or did 
you post them somewhere else? I don't have any memory of having seen any 
performance numbers what-so-ever, but I admittedly get too much email.

Here's three numbers of my own: 8, 17 and 975.

So I gave you numbers, but what do they _mean_? 

So let me try one more time:

 - I don't want any excuses about how bad PAGE_ZERO is. You made it bad, 
   it wasn't bad before.
 - I want numbers. I want the commit message to tell us *why* this is 
   done. The numbers I want is performance numbers, not handwave numbers. 
   Both for the bad case that it's supposed to fix, *and* for normal 
   load.
 - I want you to just say that if it turns out that there are people who 
   use ZERO_PAGE, you stop calling them crazy, and promise to look at the 
   alternatives.

How much clearer can I be? I have said several times that I think this 
patch is kind of sad, and the reason I think it's sad is that you (and 
Hugh) convinced me to take the patch that made it sad in the first place. 
It didn't *use* to be bad. And I've use ZERO_PAGE myself for timing, I've 
had nice test-programs that knew that it could ignore cache effects and 
get pure TLB effects when it just allocated memory and didn't write to it.

That's why I don't like the lack of numbers. That's why I didn't like the 
original commit message that tried to blame the wrong part. That's why I 
didn't like this patch to begin with.

But I'm perfectly ready to take it, and see if anybody complains. 
Hopefully nobody ever will.

But by now I absolutely *detest* this patch because of its history, and 
how I *told* you guys what the reserved bit did, and how you totally 
ignored it, and then tried to blame ZERO_PAGE for that. So yes, I want the 
patch to be accompanied by an explanation, which includes the performance 
side of why it is wanted/needed in the first place.

If this patch didn't have that kind of history, I wouldn't give a flying f 
about it. As it is, this whole thing has a background.

Linus
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: remove zero_page (was Re: -mm merge plans for 2.6.24)

2007-10-09 Thread Hugh Dickins

On Tue, 9 Oct 2007, Nick Piggin wrote:
 by it ;) To prove my point: the *first* approach I posted to fix this
 problem was exactly a patch to special-case the zero_page refcounting
 which was removed with my PageReserved patch. Neither Hugh nor yourself
 liked it one bit!

True (speaking for me; I forget whether Linus ever got to see it).

I apologize to you, Nick, for getting you into this position of
fighting for something which wasn't your choice in the first place.

If I thought we'd have a better kernel by dropping this patch and
going back to one that just avoids the refcounting, I'd say do it.
No, I still think it's worth trying this one first.

But best have your avoid-the-refcounting patch ready and reviewed
for emergency use if regression does show up somewhere.

Thanks,
Hugh

[My mails out are at present getting randomly delayed by six hours or
so, which makes it extra hard for me to engage usefully in any thread.]
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: remove zero_page (was Re: -mm merge plans for 2.6.24)

2007-10-09 Thread Linus Torvalds



On Wed, 10 Oct 2007, Hugh Dickins wrote:

 On Tue, 9 Oct 2007, Nick Piggin wrote:
  by it ;) To prove my point: the *first* approach I posted to fix this
  problem was exactly a patch to special-case the zero_page refcounting
  which was removed with my PageReserved patch. Neither Hugh nor yourself
  liked it one bit!
 
 True (speaking for me; I forget whether Linus ever got to see it).

The problem is, those first remove ref-counting patches were ugly 
*regardless* of ZERO_PAGE.

We (yes, largely I) fixed up the mess since. The whole vm_normal_page() 
and the magic PFN_REMAP thing got rid of a lot of the problems.

And I bet that we could do something very similar wrt the zero page too.

Basically, the ZERO page could act pretty much exactly like a PFN_REMAP 
page: the VM would not touch it. No rmap, no page refcounting, no nothing.

This following patch is not meant to be even half-way correct (it's not 
even _remotely_ tested), but is just meant to be a rough grep for 
ZERO_PAGE in the VM, and see what happens if you don't ref-count it.

Would something like the below work? I dunno. But I suspect it would. I 
doubt anybody has the energy to actually try to actually follow through on 
it, which is why I'm not pushing on it any more, and why I'll accept 
Nick's patch to just remove ZERO_PAGE, but I really *am* very unhappy 
about this.

The page refcounting cleanups in the VM back when were really painful. 
And dammit, I felt like I was the one who had to clean them up after you 
guys. Which makes me really testy on this subject.

And yes, I also admit that the vm_normal_page() and the PFN_REMAP thing 
ended up really improving the VM, and we're pretty much certainly better 
off now than we were before - but I also think that ZERO_PAGE etc could 
easily be handled with the same model. After all, if we can make 
mmap(/dev/mem) work with COW and everything, I'd argue that ZERO_PAGE 
really is just a very very small special case of that!

Totally half-assed untested patch to follow, not meant for anything but a 
I think this kind of approach should have worked too comment.

So I'm not pushing the patch below, I'm just fighting for people realizing 
that

 - the kernel has *always* (since pretty much day 1) done that ZERO_PAGE 
   thing. This means that I would not be at all surprised if some 
   application basically depends on it. I've written test-programs that 
   depends on it - maybe people have written other code that basically has 
   been written for and tested with a kernel that has basically always 
   made read-only zero pages extra cheap.

   So while it may be true that removing ZERO_PAGE won't affect anybody, I 
   don't think it's a given, and I also don't think it's sane calling 
   people crazy for depending on something that has always been true 
   under Linux for the last 15+ years. There are few behaviors that have 
   been around for that long.

 - make sure the commit message is accurate as to need for this (ie not 
   claim that the ZERO_PAGE itself was the problem, and give some actual 
   performance numbers on what is going on)

that's all.

Linus

---
 mm/memory.c  |   17 -
 mm/migrate.c |2 +-
 2 files changed, 9 insertions(+), 10 deletions(-)

diff --git a/mm/memory.c b/mm/memory.c
index f82b359..0a8cc88 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -386,6 +386,7 @@ static inline int is_cow_mapping(unsigned int flags)
 struct page *vm_normal_page(struct vm_area_struct *vma, unsigned long addr, 
pte_t pte)
 {
unsigned long pfn = pte_pfn(pte);
+   struct page *page;
 
if (unlikely(vma-vm_flags  VM_PFNMAP)) {
unsigned long off = (addr - vma-vm_start)  PAGE_SHIFT;
@@ -413,7 +414,11 @@ struct page *vm_normal_page(struct vm_area_struct *vma, 
unsigned long addr, pte_
 * The PAGE_ZERO() pages and various VDSO mappings can
 * cause them to exist.
 */
-   return pfn_to_page(pfn);
+   page = pfn_to_page(pfn);
+   if (PageReserved(page))
+   page = NULL;
+
+   return page;
 }
 
 /*
@@ -968,7 +973,7 @@ no_page_table:
if (flags  FOLL_ANON) {
page = ZERO_PAGE(address);
if (flags  FOLL_GET)
-   get_page(page);
+   page = alloc_page(GFP_KERNEL | GFP_ZERO);
BUG_ON(flags  FOLL_WRITE);
}
return page;
@@ -1131,9 +1136,6 @@ static int zeromap_pte_range(struct mm_struct *mm, pmd_t 
*pmd,
pte++;
break;
}
-   page_cache_get(page);
-   page_add_file_rmap(page);
-   inc_mm_counter(mm, file_rss);
set_pte_at(mm, addr, pte, zero_pte);
} while (pte++, addr += PAGE_SIZE, addr != end);
arch_leave_lazy_mmu_mode();
@@ -1717,7 +1719,7 @@ gotten:
 
if (unlikely(anon_vma_prepare(vma)))
goto oom;
-   if (old_page ==

Re: Memory controller merge (was Re: -mm merge plans for 2.6.24)

2007-10-07 Thread Balbir Singh

Hugh Dickins wrote:
> On Fri, 5 Oct 2007, Balbir Singh wrote:
>> Hugh Dickins wrote:
>>> That's where it should happen, yes; but my point is that it very
>>> often does not.  Because the swap cache page (read in as part of
>>> the readaround cluster of some other cgroup, or in swapoff by some
>>> other cgroup) is already assigned to that other cgroup (by the
>>> mem_cgroup_cache_charge in __add_to_swap_cache), and so goes "The
>>> page_cgroup exists and the page has already been accounted" route
>>> when mem_cgroup_charge is called from do_swap_page.  Doesn't it?
>>>
>> You are right, at this point I am beginning to wonder if I should
>> account for the swap cache at all? We account for the pages in RSS
>> and when the page comes back into the page table(s) via do_swap_page.
>> If we believe that the swap cache is transitional and the current
>> expected working behaviour does not seem right or hard to fix,
>> it might be easy to ignore unuse_pte() and add/remove_from_swap_cache()
>> for accounting and control.
> 
> It would be wrong to ignore the unuse_pte() case: what it's intending
> to do is correct, it's just being prevented by the swapcache issue
> from doing what it intends at present.
> 

OK

> (Though I'm not thrilled with the idea of it causing an admin's
> swapoff to fail because of a cgroup reaching mem limit there, I do
> agree with your earlier argument that that's the right thing to happen,
> and it's up to the admin to fix things up - my original objection came
> from not realizing that normally the cgroup will reclaim from itself
> to free its mem. 

I'm glad we have that sorted out.

Hmm, would the charge fail or the mm get OOM'ed?)
> 

Right now, we OOM if charging and reclaim fails.


> Ignoring add_to/remove_from swap cache is what I've tried before,
> and again today.  It's not enough: if you trying run a memhog
> (something that allocates and touches more memory than the cgroup
> is allowed, relying on pushing out to swap to complete), then that
> works well with the present accounting in add_to/remove_from swap
> cache, but it OOMs once I remove the memcontrol mods from
> mm/swap_state.c.  I keep going back to investigate why, keep on
> thinking I understand it, then later realize I don't.  Please
> give it a try, I hope you've got better mental models than I have.
> 

I will try it. Another way to try it, is to set memory.control_type
to 1, that removes charging of cache pages (both swap cache
and page cache). I just did a quick small test on the memory
controller with swap cache changes disabled and it worked fine
for me on my UML image (without OOMing). I'll try the same test
on a bigger box. Disabling swap does usually cause an
OOM for workloads using anonymous pages if the cgroup goes
over it's limit (since the cgroup cannot pushout memory).

> And I don't think it will be enough to handle shmem/tmpfs either;
> but won't worry about that until we've properly understood why
> exempting swapcache leads to those OOMs, and fixed that up.
> 

Sure.


>>> Are we misunderstanding each other, because I'm assuming
>>> MEM_CGROUP_TYPE_ALL and you're assuming MEM_CGROUP_TYPE_MAPPED?
>>> though I can't see that _MAPPED and _CACHED are actually supported,
>>> there being no reference to them outside the enum that defines them.
>> I am also assuming MEM_CGROUP_TYPE_ALL for the purpose of our
>> discussion. The accounting is split into mem_cgroup_charge() and
>> mem_cgroup_cache_charge(). While charging the caches is when we
>> check for the control_type.
> 
> It checks MEM_CGROUP_TYPE_ALL there, yes; but I can't find anything
> checking for either MEM_CGROUP_TYPE_MAPPED or MEM_CGROUP_TYPE_CACHED.
> (Or is it hidden in one of those preprocesor ## things which frustrate
> both my greps and me!?)
> 

MEM_CGROUP_TYPE_ALL is defined to be (MEM_CGROUP_TYPE_CACHED |
MEM_CGROUP_TYPE_MAPPED). I'll make that more explicit with a patch.
When the type is not MEM_CGROUP_TYPE_ALL, cached pages are ignored.

>>> Or are you deceived by that ifdef NUMA code in swapin_readahead,
>>> which propagates the fantasy that swap allocation follows vma layout?
>>> That nonsense has been around too long, I'll soon be sending a patch
>>> to remove it.
>> The swapin readahead code under #ifdef NUMA is very confusing.
> 
> I sent a patch to linux-mm last night, to remove that confusion.
> 

Thanks, I saw that.

>> I also
>> noticed another confusing thing during my test, swap cache does not
>> drop to 0, even though I've disabled all swap using swapoff. May be
>> those are tmpfs pages. The other interesting thing I tried was running
>> swapoff after a cgroup went over it's limit, the swapoff succeeded,
>> but I see strange numbers for free swap. I'll start another thread
>> after investigating a bit more.
> 
> Those indeed are strange behaviours (if the swapoff really has
> succeeded, rather than lying), I not seen such and don't have an
> explanation.  tmpfs doesn't add any weirdness there: when there's
> no swap, there

Re: Memory controller merge (was Re: -mm merge plans for 2.6.24)

2007-10-07 Thread Hugh Dickins

On Fri, 5 Oct 2007, Balbir Singh wrote:
> Hugh Dickins wrote:
> > 
> > That's where it should happen, yes; but my point is that it very
> > often does not.  Because the swap cache page (read in as part of
> > the readaround cluster of some other cgroup, or in swapoff by some
> > other cgroup) is already assigned to that other cgroup (by the
> > mem_cgroup_cache_charge in __add_to_swap_cache), and so goes "The
> > page_cgroup exists and the page has already been accounted" route
> > when mem_cgroup_charge is called from do_swap_page.  Doesn't it?
> > 
> 
> You are right, at this point I am beginning to wonder if I should
> account for the swap cache at all? We account for the pages in RSS
> and when the page comes back into the page table(s) via do_swap_page.
> If we believe that the swap cache is transitional and the current
> expected working behaviour does not seem right or hard to fix,
> it might be easy to ignore unuse_pte() and add/remove_from_swap_cache()
> for accounting and control.

It would be wrong to ignore the unuse_pte() case: what it's intending
to do is correct, it's just being prevented by the swapcache issue
from doing what it intends at present.

(Though I'm not thrilled with the idea of it causing an admin's
swapoff to fail because of a cgroup reaching mem limit there, I do
agree with your earlier argument that that's the right thing to happen,
and it's up to the admin to fix things up - my original objection came
from not realizing that normally the cgroup will reclaim from itself
to free its mem.  Hmm, would the charge fail or the mm get OOM'ed?)

Ignoring add_to/remove_from swap cache is what I've tried before,
and again today.  It's not enough: if you trying run a memhog
(something that allocates and touches more memory than the cgroup
is allowed, relying on pushing out to swap to complete), then that
works well with the present accounting in add_to/remove_from swap
cache, but it OOMs once I remove the memcontrol mods from
mm/swap_state.c.  I keep going back to investigate why, keep on
thinking I understand it, then later realize I don't.  Please
give it a try, I hope you've got better mental models than I have.

And I don't think it will be enough to handle shmem/tmpfs either;
but won't worry about that until we've properly understood why
exempting swapcache leads to those OOMs, and fixed that up.

> > Are we misunderstanding each other, because I'm assuming
> > MEM_CGROUP_TYPE_ALL and you're assuming MEM_CGROUP_TYPE_MAPPED?
> > though I can't see that _MAPPED and _CACHED are actually supported,
> > there being no reference to them outside the enum that defines them.
> 
> I am also assuming MEM_CGROUP_TYPE_ALL for the purpose of our
> discussion. The accounting is split into mem_cgroup_charge() and
> mem_cgroup_cache_charge(). While charging the caches is when we
> check for the control_type.

It checks MEM_CGROUP_TYPE_ALL there, yes; but I can't find anything
checking for either MEM_CGROUP_TYPE_MAPPED or MEM_CGROUP_TYPE_CACHED.
(Or is it hidden in one of those preprocesor ## things which frustrate
both my greps and me!?)

> > Or are you deceived by that ifdef NUMA code in swapin_readahead,
> > which propagates the fantasy that swap allocation follows vma layout?
> > That nonsense has been around too long, I'll soon be sending a patch
> > to remove it.
> 
> The swapin readahead code under #ifdef NUMA is very confusing.

I sent a patch to linux-mm last night, to remove that confusion.

> I also
> noticed another confusing thing during my test, swap cache does not
> drop to 0, even though I've disabled all swap using swapoff. May be
> those are tmpfs pages. The other interesting thing I tried was running
> swapoff after a cgroup went over it's limit, the swapoff succeeded,
> but I see strange numbers for free swap. I'll start another thread
> after investigating a bit more.

Those indeed are strange behaviours (if the swapoff really has
succeeded, rather than lying), I not seen such and don't have an
explanation.  tmpfs doesn't add any weirdness there: when there's
no swap, there can be no swap cache.  Or is the swapoff still in
progress?  While it's busy, we keep /proc/meminfo looking sensible,
but m can show negative free swap (IIRC).

I'll be interested to hear what your investigation shows.

Hugh
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Memory controller merge (was Re: -mm merge plans for 2.6.24)

2007-10-07 Thread Hugh Dickins

On Fri, 5 Oct 2007, Balbir Singh wrote:
 Hugh Dickins wrote:
  
  That's where it should happen, yes; but my point is that it very
  often does not.  Because the swap cache page (read in as part of
  the readaround cluster of some other cgroup, or in swapoff by some
  other cgroup) is already assigned to that other cgroup (by the
  mem_cgroup_cache_charge in __add_to_swap_cache), and so goes The
  page_cgroup exists and the page has already been accounted route
  when mem_cgroup_charge is called from do_swap_page.  Doesn't it?
  
 
 You are right, at this point I am beginning to wonder if I should
 account for the swap cache at all? We account for the pages in RSS
 and when the page comes back into the page table(s) via do_swap_page.
 If we believe that the swap cache is transitional and the current
 expected working behaviour does not seem right or hard to fix,
 it might be easy to ignore unuse_pte() and add/remove_from_swap_cache()
 for accounting and control.

It would be wrong to ignore the unuse_pte() case: what it's intending
to do is correct, it's just being prevented by the swapcache issue
from doing what it intends at present.

(Though I'm not thrilled with the idea of it causing an admin's
swapoff to fail because of a cgroup reaching mem limit there, I do
agree with your earlier argument that that's the right thing to happen,
and it's up to the admin to fix things up - my original objection came
from not realizing that normally the cgroup will reclaim from itself
to free its mem.  Hmm, would the charge fail or the mm get OOM'ed?)

Ignoring add_to/remove_from swap cache is what I've tried before,
and again today.  It's not enough: if you trying run a memhog
(something that allocates and touches more memory than the cgroup
is allowed, relying on pushing out to swap to complete), then that
works well with the present accounting in add_to/remove_from swap
cache, but it OOMs once I remove the memcontrol mods from
mm/swap_state.c.  I keep going back to investigate why, keep on
thinking I understand it, then later realize I don't.  Please
give it a try, I hope you've got better mental models than I have.

And I don't think it will be enough to handle shmem/tmpfs either;
but won't worry about that until we've properly understood why
exempting swapcache leads to those OOMs, and fixed that up.

  Are we misunderstanding each other, because I'm assuming
  MEM_CGROUP_TYPE_ALL and you're assuming MEM_CGROUP_TYPE_MAPPED?
  though I can't see that _MAPPED and _CACHED are actually supported,
  there being no reference to them outside the enum that defines them.
 
 I am also assuming MEM_CGROUP_TYPE_ALL for the purpose of our
 discussion. The accounting is split into mem_cgroup_charge() and
 mem_cgroup_cache_charge(). While charging the caches is when we
 check for the control_type.

It checks MEM_CGROUP_TYPE_ALL there, yes; but I can't find anything
checking for either MEM_CGROUP_TYPE_MAPPED or MEM_CGROUP_TYPE_CACHED.
(Or is it hidden in one of those preprocesor ## things which frustrate
both my greps and me!?)

  Or are you deceived by that ifdef NUMA code in swapin_readahead,
  which propagates the fantasy that swap allocation follows vma layout?
  That nonsense has been around too long, I'll soon be sending a patch
  to remove it.
 
 The swapin readahead code under #ifdef NUMA is very confusing.

I sent a patch to linux-mm last night, to remove that confusion.

 I also
 noticed another confusing thing during my test, swap cache does not
 drop to 0, even though I've disabled all swap using swapoff. May be
 those are tmpfs pages. The other interesting thing I tried was running
 swapoff after a cgroup went over it's limit, the swapoff succeeded,
 but I see strange numbers for free swap. I'll start another thread
 after investigating a bit more.

Those indeed are strange behaviours (if the swapoff really has
succeeded, rather than lying), I not seen such and don't have an
explanation.  tmpfs doesn't add any weirdness there: when there's
no swap, there can be no swap cache.  Or is the swapoff still in
progress?  While it's busy, we keep /proc/meminfo looking sensible,
but AltSysRqm can show negative free swap (IIRC).

I'll be interested to hear what your investigation shows.

Hugh
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Memory controller merge (was Re: -mm merge plans for 2.6.24)

2007-10-07 Thread Balbir Singh

Hugh Dickins wrote:
 On Fri, 5 Oct 2007, Balbir Singh wrote:
 Hugh Dickins wrote:
 That's where it should happen, yes; but my point is that it very
 often does not.  Because the swap cache page (read in as part of
 the readaround cluster of some other cgroup, or in swapoff by some
 other cgroup) is already assigned to that other cgroup (by the
 mem_cgroup_cache_charge in __add_to_swap_cache), and so goes The
 page_cgroup exists and the page has already been accounted route
 when mem_cgroup_charge is called from do_swap_page.  Doesn't it?

 You are right, at this point I am beginning to wonder if I should
 account for the swap cache at all? We account for the pages in RSS
 and when the page comes back into the page table(s) via do_swap_page.
 If we believe that the swap cache is transitional and the current
 expected working behaviour does not seem right or hard to fix,
 it might be easy to ignore unuse_pte() and add/remove_from_swap_cache()
 for accounting and control.
 
 It would be wrong to ignore the unuse_pte() case: what it's intending
 to do is correct, it's just being prevented by the swapcache issue
 from doing what it intends at present.
 

OK

 (Though I'm not thrilled with the idea of it causing an admin's
 swapoff to fail because of a cgroup reaching mem limit there, I do
 agree with your earlier argument that that's the right thing to happen,
 and it's up to the admin to fix things up - my original objection came
 from not realizing that normally the cgroup will reclaim from itself
 to free its mem. 

I'm glad we have that sorted out.

Hmm, would the charge fail or the mm get OOM'ed?)
 

Right now, we OOM if charging and reclaim fails.


 Ignoring add_to/remove_from swap cache is what I've tried before,
 and again today.  It's not enough: if you trying run a memhog
 (something that allocates and touches more memory than the cgroup
 is allowed, relying on pushing out to swap to complete), then that
 works well with the present accounting in add_to/remove_from swap
 cache, but it OOMs once I remove the memcontrol mods from
 mm/swap_state.c.  I keep going back to investigate why, keep on
 thinking I understand it, then later realize I don't.  Please
 give it a try, I hope you've got better mental models than I have.
 

I will try it. Another way to try it, is to set memory.control_type
to 1, that removes charging of cache pages (both swap cache
and page cache). I just did a quick small test on the memory
controller with swap cache changes disabled and it worked fine
for me on my UML image (without OOMing). I'll try the same test
on a bigger box. Disabling swap does usually cause an
OOM for workloads using anonymous pages if the cgroup goes
over it's limit (since the cgroup cannot pushout memory).

 And I don't think it will be enough to handle shmem/tmpfs either;
 but won't worry about that until we've properly understood why
 exempting swapcache leads to those OOMs, and fixed that up.
 

Sure.


 Are we misunderstanding each other, because I'm assuming
 MEM_CGROUP_TYPE_ALL and you're assuming MEM_CGROUP_TYPE_MAPPED?
 though I can't see that _MAPPED and _CACHED are actually supported,
 there being no reference to them outside the enum that defines them.
 I am also assuming MEM_CGROUP_TYPE_ALL for the purpose of our
 discussion. The accounting is split into mem_cgroup_charge() and
 mem_cgroup_cache_charge(). While charging the caches is when we
 check for the control_type.
 
 It checks MEM_CGROUP_TYPE_ALL there, yes; but I can't find anything
 checking for either MEM_CGROUP_TYPE_MAPPED or MEM_CGROUP_TYPE_CACHED.
 (Or is it hidden in one of those preprocesor ## things which frustrate
 both my greps and me!?)
 

MEM_CGROUP_TYPE_ALL is defined to be (MEM_CGROUP_TYPE_CACHED |
MEM_CGROUP_TYPE_MAPPED). I'll make that more explicit with a patch.
When the type is not MEM_CGROUP_TYPE_ALL, cached pages are ignored.

 Or are you deceived by that ifdef NUMA code in swapin_readahead,
 which propagates the fantasy that swap allocation follows vma layout?
 That nonsense has been around too long, I'll soon be sending a patch
 to remove it.
 The swapin readahead code under #ifdef NUMA is very confusing.
 
 I sent a patch to linux-mm last night, to remove that confusion.
 

Thanks, I saw that.

 I also
 noticed another confusing thing during my test, swap cache does not
 drop to 0, even though I've disabled all swap using swapoff. May be
 those are tmpfs pages. The other interesting thing I tried was running
 swapoff after a cgroup went over it's limit, the swapoff succeeded,
 but I see strange numbers for free swap. I'll start another thread
 after investigating a bit more.
 
 Those indeed are strange behaviours (if the swapoff really has
 succeeded, rather than lying), I not seen such and don't have an
 explanation.  tmpfs doesn't add any weirdness there: when there's
 no swap, there can be no swap cache.  Or is the swapoff still in
 progress?  While it's busy, we keep /proc/meminfo looking sensible,
 but

Re: Memory controller merge (was Re: -mm merge plans for 2.6.24)

2007-10-04 Thread Balbir Singh

Hugh Dickins wrote:
> On Thu, 4 Oct 2007, Balbir Singh wrote:
>> Hugh Dickins wrote:
>>> Well, swap control is another subject.  I guess for that you'll need
>>> to track which cgroup each swap page belongs to (rather more expensive
>>> than the current swap_map of unsigned shorts).  And I doubt it'll be
>>> swap control as such that's required, but control of rss+swap.
>> I see what you mean now, other people have recommending a per cgroup
>> swap file/device.
> 
> Sounds too inflexible, and too many swap areas to me.  Perhaps the
> right answer will fall in between: assign clusters of swap pages to
> different cgroups as needed.  But worry about that some other time.
> 

Yes, depending on the number of cgroups, we'll need to share swap
areas between them. It requires more work and thought process.

>>> But here I'm just worrying about how the existence of swap makes
>>> something of a nonsense of your rss control.
>>>
>> Ideally, pages would not reside for too long in swap cache (unless
> 
> Thinking particularly of those brought in by swapoff or swap readahead:
> some will get attached to mms once accessed, others will simply get
> freed when tasks exit or munmap, others will hang around until they
> reach the bottom of the LRU and are reclaimed again by memory pressure.
> 
> But as your code stands, that'll be total memory pressure: in-cgroup
> memory pressure will tend to miss them, since typically they're
> assigned to the wrong cgroup; until then their presence is liable
> to cause other pages to be reclaimed which ideally should not be.
> 

in-cgroup pressure will not affect them, since they are in different
cgroups. If there is pressure in the cgroup to which they are wrongly
assigned, they would get reclaimed first.

>> I've misunderstood swap cache or there are special cases for tmpfs/
>> ramfs).
> 
> ramfs pages are always in RAM, never go out to swap, no need to
> worry about them in this regard.  But tmpfs pages can indeed go
> out to swap, so whatever we come up with needs to make sense
> with them too, yes.  I don't think its swapoff/readahead issues
> are any harder to handle than the anonymous mapped page case,
> but it will need its own code to handle them.
> 
>> Once pages have been swapped back in, they get assigned
>> back to their respective cgroup's in do_swap_page() (where we charge
>> them back to the cgroup).
>>
> 
> That's where it should happen, yes; but my point is that it very
> often does not.  Because the swap cache page (read in as part of
> the readaround cluster of some other cgroup, or in swapoff by some
> other cgroup) is already assigned to that other cgroup (by the
> mem_cgroup_cache_charge in __add_to_swap_cache), and so goes "The
> page_cgroup exists and the page has already been accounted" route
> when mem_cgroup_charge is called from do_swap_page.  Doesn't it?
> 

You are right, at this point I am beginning to wonder if I should
account for the swap cache at all? We account for the pages in RSS
and when the page comes back into the page table(s) via do_swap_page.
If we believe that the swap cache is transitional and the current
expected working behaviour does not seem right or hard to fix,
it might be easy to ignore unuse_pte() and add/remove_from_swap_cache()
for accounting and control.

The expected working behaviour of the memory controller is that
currently, as you point out several pages get accounted to the
cgroup that initiates swapin readahead or swapoff. On
cgroup pressure (the one that initiated swapin or swapoff), the
cgroup would discard these pages first. These pages are discarded
from the cgroup, but still live on the global LRU.

When the original cgroup is under pressure, these pages might not
be effected as they belong to a different cgroup, which might not
be under any sort of pressure.

> Are we misunderstanding each other, because I'm assuming
> MEM_CGROUP_TYPE_ALL and you're assuming MEM_CGROUP_TYPE_MAPPED?
> though I can't see that _MAPPED and _CACHED are actually supported,
> there being no reference to them outside the enum that defines them.
> 

I am also assuming MEM_CGROUP_TYPE_ALL for the purpose of our
discussion. The accounting is split into mem_cgroup_charge() and
mem_cgroup_cache_charge(). While charging the caches is when we
check for the control_type.

> Or are you deceived by that ifdef NUMA code in swapin_readahead,
> which propagates the fantasy that swap allocation follows vma layout?
> That nonsense has been around too long, I'll soon be sending a patch
> to remove it.
> 

The swapin readahead code under #ifdef NUMA is very confusing. I also
noticed another confusing thing during my test, swap cache does not
drop to 0, even though I've disabled all swap using swapoff. May be
those are tmpfs pages. The other interesting thing I tried was running
swapoff after a cgroup went over it's limit, the swapoff succeeded,
but I see strange numbers for free swap. I'll start another thread
after investigating a bit more.

>> The swap

Re: Memory controller merge (was Re: -mm merge plans for 2.6.24)

2007-10-04 Thread Paul Menage

On 10/2/07, Hugh Dickins <[EMAIL PROTECTED]> wrote:
>
> I accept that full swap control is something you're intending to add
> incrementally later; but the current state doesn't make sense to me.

One comment on swap - ideally it should be a separate subsystem from
the memory controller. That way people who are using cpusets to
provide memory isolation (rather than using the page-based memory
controller) can also get swap isolation.

Paul
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Memory controller merge (was Re: -mm merge plans for 2.6.24)

2007-10-04 Thread Hugh Dickins

On Thu, 4 Oct 2007, Balbir Singh wrote:
> Hugh Dickins wrote:
> > Well, swap control is another subject.  I guess for that you'll need
> > to track which cgroup each swap page belongs to (rather more expensive
> > than the current swap_map of unsigned shorts).  And I doubt it'll be
> > swap control as such that's required, but control of rss+swap.
> 
> I see what you mean now, other people have recommending a per cgroup
> swap file/device.

Sounds too inflexible, and too many swap areas to me.  Perhaps the
right answer will fall in between: assign clusters of swap pages to
different cgroups as needed.  But worry about that some other time.

> 
> > But here I'm just worrying about how the existence of swap makes
> > something of a nonsense of your rss control.
> > 
> 
> Ideally, pages would not reside for too long in swap cache (unless

Thinking particularly of those brought in by swapoff or swap readahead:
some will get attached to mms once accessed, others will simply get
freed when tasks exit or munmap, others will hang around until they
reach the bottom of the LRU and are reclaimed again by memory pressure.

But as your code stands, that'll be total memory pressure: in-cgroup
memory pressure will tend to miss them, since typically they're
assigned to the wrong cgroup; until then their presence is liable
to cause other pages to be reclaimed which ideally should not be.

> I've misunderstood swap cache or there are special cases for tmpfs/
> ramfs).

ramfs pages are always in RAM, never go out to swap, no need to
worry about them in this regard.  But tmpfs pages can indeed go
out to swap, so whatever we come up with needs to make sense
with them too, yes.  I don't think its swapoff/readahead issues
are any harder to handle than the anonymous mapped page case,
but it will need its own code to handle them.

> Once pages have been swapped back in, they get assigned
> back to their respective cgroup's in do_swap_page() (where we charge
> them back to the cgroup).
> 

That's where it should happen, yes; but my point is that it very
often does not.  Because the swap cache page (read in as part of
the readaround cluster of some other cgroup, or in swapoff by some
other cgroup) is already assigned to that other cgroup (by the
mem_cgroup_cache_charge in __add_to_swap_cache), and so goes "The
page_cgroup exists and the page has already been accounted" route
when mem_cgroup_charge is called from do_swap_page.  Doesn't it?

Are we misunderstanding each other, because I'm assuming
MEM_CGROUP_TYPE_ALL and you're assuming MEM_CGROUP_TYPE_MAPPED?
though I can't see that _MAPPED and _CACHED are actually supported,
there being no reference to them outside the enum that defines them.

Or are you deceived by that ifdef NUMA code in swapin_readahead,
which propagates the fantasy that swap allocation follows vma layout?
That nonsense has been around too long, I'll soon be sending a patch
to remove it.

> The swap cache pages will be the first ones to go, once the cgroup
> exceeds its limit.

No, because they're (in general) booked to the wrong cgroup.

> 
> There might be gaps in my understanding or I might be missing a use
> case scenario, where things work differently.
> 
> >>> I accept that full swap control is something you're intending to add
> >>> incrementally later; but the current state doesn't make sense to me.
> >>>
> >>> The problems are swapoff and swapin readahead.  These pull pages into
> >>> the swap cache, which are assigned to the cgroup (or the whatever-we-
> >>> call-the-remainder-outside-all-the-cgroups) which is running swapoff
> >  ^
> > I'd appreciate it if you'd teach me the right name for that!
> > 
> 
> In the past people have used names like default cgroup, we could use
> the root cgroup as the default cgroup.

Okay, thanks.

Hugh
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Memory controller merge (was Re: -mm merge plans for 2.6.24)

2007-10-04 Thread Hugh Dickins

On Thu, 4 Oct 2007, Balbir Singh wrote:
 Hugh Dickins wrote:
  Well, swap control is another subject.  I guess for that you'll need
  to track which cgroup each swap page belongs to (rather more expensive
  than the current swap_map of unsigned shorts).  And I doubt it'll be
  swap control as such that's required, but control of rss+swap.
 
 I see what you mean now, other people have recommending a per cgroup
 swap file/device.

Sounds too inflexible, and too many swap areas to me.  Perhaps the
right answer will fall in between: assign clusters of swap pages to
different cgroups as needed.  But worry about that some other time.

 
  But here I'm just worrying about how the existence of swap makes
  something of a nonsense of your rss control.
  
 
 Ideally, pages would not reside for too long in swap cache (unless

Thinking particularly of those brought in by swapoff or swap readahead:
some will get attached to mms once accessed, others will simply get
freed when tasks exit or munmap, others will hang around until they
reach the bottom of the LRU and are reclaimed again by memory pressure.

But as your code stands, that'll be total memory pressure: in-cgroup
memory pressure will tend to miss them, since typically they're
assigned to the wrong cgroup; until then their presence is liable
to cause other pages to be reclaimed which ideally should not be.

 I've misunderstood swap cache or there are special cases for tmpfs/
 ramfs).

ramfs pages are always in RAM, never go out to swap, no need to
worry about them in this regard.  But tmpfs pages can indeed go
out to swap, so whatever we come up with needs to make sense
with them too, yes.  I don't think its swapoff/readahead issues
are any harder to handle than the anonymous mapped page case,
but it will need its own code to handle them.

 Once pages have been swapped back in, they get assigned
 back to their respective cgroup's in do_swap_page() (where we charge
 them back to the cgroup).
 

That's where it should happen, yes; but my point is that it very
often does not.  Because the swap cache page (read in as part of
the readaround cluster of some other cgroup, or in swapoff by some
other cgroup) is already assigned to that other cgroup (by the
mem_cgroup_cache_charge in __add_to_swap_cache), and so goes The
page_cgroup exists and the page has already been accounted route
when mem_cgroup_charge is called from do_swap_page.  Doesn't it?

Are we misunderstanding each other, because I'm assuming
MEM_CGROUP_TYPE_ALL and you're assuming MEM_CGROUP_TYPE_MAPPED?
though I can't see that _MAPPED and _CACHED are actually supported,
there being no reference to them outside the enum that defines them.

Or are you deceived by that ifdef NUMA code in swapin_readahead,
which propagates the fantasy that swap allocation follows vma layout?
That nonsense has been around too long, I'll soon be sending a patch
to remove it.

 The swap cache pages will be the first ones to go, once the cgroup
 exceeds its limit.

No, because they're (in general) booked to the wrong cgroup.

 
 There might be gaps in my understanding or I might be missing a use
 case scenario, where things work differently.
 
  I accept that full swap control is something you're intending to add
  incrementally later; but the current state doesn't make sense to me.
 
  The problems are swapoff and swapin readahead.  These pull pages into
  the swap cache, which are assigned to the cgroup (or the whatever-we-
  call-the-remainder-outside-all-the-cgroups) which is running swapoff
   ^
  I'd appreciate it if you'd teach me the right name for that!
  
 
 In the past people have used names like default cgroup, we could use
 the root cgroup as the default cgroup.

Okay, thanks.

Hugh
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Memory controller merge (was Re: -mm merge plans for 2.6.24)

2007-10-04 Thread Paul Menage

On 10/2/07, Hugh Dickins [EMAIL PROTECTED] wrote:

 I accept that full swap control is something you're intending to add
 incrementally later; but the current state doesn't make sense to me.

One comment on swap - ideally it should be a separate subsystem from
the memory controller. That way people who are using cpusets to
provide memory isolation (rather than using the page-based memory
controller) can also get swap isolation.

Paul
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Memory controller merge (was Re: -mm merge plans for 2.6.24)

2007-10-04 Thread Balbir Singh

Hugh Dickins wrote:
 On Thu, 4 Oct 2007, Balbir Singh wrote:
 Hugh Dickins wrote:
 Well, swap control is another subject.  I guess for that you'll need
 to track which cgroup each swap page belongs to (rather more expensive
 than the current swap_map of unsigned shorts).  And I doubt it'll be
 swap control as such that's required, but control of rss+swap.
 I see what you mean now, other people have recommending a per cgroup
 swap file/device.
 
 Sounds too inflexible, and too many swap areas to me.  Perhaps the
 right answer will fall in between: assign clusters of swap pages to
 different cgroups as needed.  But worry about that some other time.
 

Yes, depending on the number of cgroups, we'll need to share swap
areas between them. It requires more work and thought process.

 But here I'm just worrying about how the existence of swap makes
 something of a nonsense of your rss control.

 Ideally, pages would not reside for too long in swap cache (unless
 
 Thinking particularly of those brought in by swapoff or swap readahead:
 some will get attached to mms once accessed, others will simply get
 freed when tasks exit or munmap, others will hang around until they
 reach the bottom of the LRU and are reclaimed again by memory pressure.
 
 But as your code stands, that'll be total memory pressure: in-cgroup
 memory pressure will tend to miss them, since typically they're
 assigned to the wrong cgroup; until then their presence is liable
 to cause other pages to be reclaimed which ideally should not be.
 

in-cgroup pressure will not affect them, since they are in different
cgroups. If there is pressure in the cgroup to which they are wrongly
assigned, they would get reclaimed first.

 I've misunderstood swap cache or there are special cases for tmpfs/
 ramfs).
 
 ramfs pages are always in RAM, never go out to swap, no need to
 worry about them in this regard.  But tmpfs pages can indeed go
 out to swap, so whatever we come up with needs to make sense
 with them too, yes.  I don't think its swapoff/readahead issues
 are any harder to handle than the anonymous mapped page case,
 but it will need its own code to handle them.
 
 Once pages have been swapped back in, they get assigned
 back to their respective cgroup's in do_swap_page() (where we charge
 them back to the cgroup).

 
 That's where it should happen, yes; but my point is that it very
 often does not.  Because the swap cache page (read in as part of
 the readaround cluster of some other cgroup, or in swapoff by some
 other cgroup) is already assigned to that other cgroup (by the
 mem_cgroup_cache_charge in __add_to_swap_cache), and so goes The
 page_cgroup exists and the page has already been accounted route
 when mem_cgroup_charge is called from do_swap_page.  Doesn't it?
 

You are right, at this point I am beginning to wonder if I should
account for the swap cache at all? We account for the pages in RSS
and when the page comes back into the page table(s) via do_swap_page.
If we believe that the swap cache is transitional and the current
expected working behaviour does not seem right or hard to fix,
it might be easy to ignore unuse_pte() and add/remove_from_swap_cache()
for accounting and control.

The expected working behaviour of the memory controller is that
currently, as you point out several pages get accounted to the
cgroup that initiates swapin readahead or swapoff. On
cgroup pressure (the one that initiated swapin or swapoff), the
cgroup would discard these pages first. These pages are discarded
from the cgroup, but still live on the global LRU.

When the original cgroup is under pressure, these pages might not
be effected as they belong to a different cgroup, which might not
be under any sort of pressure.

 Are we misunderstanding each other, because I'm assuming
 MEM_CGROUP_TYPE_ALL and you're assuming MEM_CGROUP_TYPE_MAPPED?
 though I can't see that _MAPPED and _CACHED are actually supported,
 there being no reference to them outside the enum that defines them.
 

I am also assuming MEM_CGROUP_TYPE_ALL for the purpose of our
discussion. The accounting is split into mem_cgroup_charge() and
mem_cgroup_cache_charge(). While charging the caches is when we
check for the control_type.

 Or are you deceived by that ifdef NUMA code in swapin_readahead,
 which propagates the fantasy that swap allocation follows vma layout?
 That nonsense has been around too long, I'll soon be sending a patch
 to remove it.
 

The swapin readahead code under #ifdef NUMA is very confusing. I also
noticed another confusing thing during my test, swap cache does not
drop to 0, even though I've disabled all swap using swapoff. May be
those are tmpfs pages. The other interesting thing I tried was running
swapoff after a cgroup went over it's limit, the swapoff succeeded,
but I see strange numbers for free swap. I'll start another thread
after investigating a bit more.

 The swap cache pages will be the first ones to go, once the cgroup
 exceeds its limit.
 
 No,

Re: Memory controller merge (was Re: -mm merge plans for 2.6.24)

2007-10-03 Thread Balbir Singh

Hugh Dickins wrote:
> On Wed, 3 Oct 2007, Balbir Singh wrote:
>> Hugh Dickins wrote:
>>> Sorry, Balbir, I've failed to get back to you, still attending to
>>> priorities.  Let me briefly summarize my issue with the mem controller:
>>> you've not yet given enough attention to swap.
>> I am open to suggestions and ways and means of making swap control
>> complete and more usable.
> 
> Well, swap control is another subject.  I guess for that you'll need
> to track which cgroup each swap page belongs to (rather more expensive
> than the current swap_map of unsigned shorts).  And I doubt it'll be
> swap control as such that's required, but control of rss+swap.
> 

I see what you mean now, other people have recommending a per cgroup
swap file/device.

> But here I'm just worrying about how the existence of swap makes
> something of a nonsense of your rss control.
> 

Ideally, pages would not reside for too long in swap cache (unless
I've misunderstood swap cache or there are special cases for tmpfs/
ramfs). Once pages have been swapped back in, they get assigned
back to their respective cgroup's in do_swap_page() (where we charge
them back to the cgroup).

The swap cache pages will be the first ones to go, once the cgroup
exceeds its limit.

There might be gaps in my understanding or I might be missing a use
case scenario, where things work differently.

>>> I accept that full swap control is something you're intending to add
>>> incrementally later; but the current state doesn't make sense to me.
>>>
>>> The problems are swapoff and swapin readahead.  These pull pages into
>>> the swap cache, which are assigned to the cgroup (or the whatever-we-
>>> call-the-remainder-outside-all-the-cgroups) which is running swapoff
>  ^
> I'd appreciate it if you'd teach me the right name for that!
> 

In the past people have used names like default cgroup, we could use
the root cgroup as the default cgroup.

>>> or faulting in its own page; yet they very clearly don't (in general)
>>> belong to that cgroup, but to other cgroups which will be discovered
>>> later.
>> I understand what your trying to say, but with several approaches that
>> we tried in the past, we found caches the hardest to most accurately
>> account. IIRC, with readahead, we don't even know if all the pages
>> readahead will be used, that's why we charge everything to the cgroup
>> that added the page to the cache.
> 
> Yes, readahead is anyway problematic.  My guess is that in the file
> cache case, you'll tend not to go too far wrong by charging to the
> one that added - though we're all aware that's fairly unsatisfactory.
> 
> My point is that in the swap cache case, it's badly wrong: there's
> no page more obviously owned by a cgroup than its anonymous pages
> (forgetting for a moment that minority shared between cgroups
> until copy-on-write), so it's very wrong for swapin readahead
> or swapoff to go charging those to another or to no cgroup.
> 
> Imagine a cgroup at its rss limit, with more out on swap.  Then
> another cgroup does some swap readahead, bringing pages private
> to the first into cache.  Or runs swapoff which actually plugs
> them into the rss of the first cgroup, so it goes over limit.
> 
> Those are pages we'd want to swap out when the first cgroup
> faults to go further over its limit; but they're now not even
> identified as belonging to the right cgroup, so won't be found.
> 

Won't the right cgroup assignment happen as discussed above?

>>> I did try removing the cgroup mods to mm/swap_state.c, so swap pages
>>> get assigned to a cgroup only once it's really known; but that's not
>>> enough by itself, because cgroup RSS reclaim doesn't touch those
>>> pages, so the cgroup can easily OOM much too soon.  I was thinking
>>> that you need a "limbo" cgroup for these pages, which can be attacked
>>> for reclaim along with any cgroup being reclaimed, but from which
>>> pages are readily migrated to their real cgroup once that's known.
>>>
>> Is migrating the charge to the real cgroup really required?
> 
> My answer is definitely yes.  I'm not suggesting that you need
> general migration between cgroups at this stage (something for
> later quite likely); but I am suggesting you need one pseudo-cgroup
> to hold these cases temporarily, and that you cannot properly track
> rss without it (if there is any swap).
> 

If what I understand and discussed earlier is, then we don't need
to go this route. But I think the idea of having a pseduo cgroup
is interesting (needs more thought).

>>> So in the current memory controller, that unuse_pte mem charge I was
>>> originally worried about failing (I hadn't at that point delved in
>>> to see how it tries to reclaim) actually never fails (and never
>>> does anything): the page is already assigned to some cgroup-or-
>>> whatever and is never charged to vma->vm_mm at that point.
>>>
>> Excellent!
> 
> Umm, please explain what's excellent about that.
>

A kernel Tracing interface (was Re: -mm merge plans for 2.6.24)

2007-10-03 Thread David Wilder



Andrew-

Could you please add the trace patches to the merge list?
These patches have been very well reviewed on lkml. I believe they are 
ready to be merged.  The patches can be found here:

http://lkml.org/lkml/2007/10/2/236
http://lkml.org/lkml/2007/10/2/237
http://lkml.org/lkml/2007/10/2/238
http://lkml.org/lkml/2007/10/2/239

Dave Wilder
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Memory controller merge (was Re: -mm merge plans for 2.6.24)

2007-10-03 Thread Hugh Dickins

On Wed, 3 Oct 2007, Balbir Singh wrote:
> Hugh Dickins wrote:
> > 
> > Sorry, Balbir, I've failed to get back to you, still attending to
> > priorities.  Let me briefly summarize my issue with the mem controller:
> > you've not yet given enough attention to swap.
> 
> I am open to suggestions and ways and means of making swap control
> complete and more usable.

Well, swap control is another subject.  I guess for that you'll need
to track which cgroup each swap page belongs to (rather more expensive
than the current swap_map of unsigned shorts).  And I doubt it'll be
swap control as such that's required, but control of rss+swap.

But here I'm just worrying about how the existence of swap makes
something of a nonsense of your rss control.

> > I accept that full swap control is something you're intending to add
> > incrementally later; but the current state doesn't make sense to me.
> > 
> > The problems are swapoff and swapin readahead.  These pull pages into
> > the swap cache, which are assigned to the cgroup (or the whatever-we-
> > call-the-remainder-outside-all-the-cgroups) which is running swapoff
 ^
I'd appreciate it if you'd teach me the right name for that!

> > or faulting in its own page; yet they very clearly don't (in general)
> > belong to that cgroup, but to other cgroups which will be discovered
> > later.
> 
> I understand what your trying to say, but with several approaches that
> we tried in the past, we found caches the hardest to most accurately
> account. IIRC, with readahead, we don't even know if all the pages
> readahead will be used, that's why we charge everything to the cgroup
> that added the page to the cache.

Yes, readahead is anyway problematic.  My guess is that in the file
cache case, you'll tend not to go too far wrong by charging to the
one that added - though we're all aware that's fairly unsatisfactory.

My point is that in the swap cache case, it's badly wrong: there's
no page more obviously owned by a cgroup than its anonymous pages
(forgetting for a moment that minority shared between cgroups
until copy-on-write), so it's very wrong for swapin readahead
or swapoff to go charging those to another or to no cgroup.

Imagine a cgroup at its rss limit, with more out on swap.  Then
another cgroup does some swap readahead, bringing pages private
to the first into cache.  Or runs swapoff which actually plugs
them into the rss of the first cgroup, so it goes over limit.

Those are pages we'd want to swap out when the first cgroup
faults to go further over its limit; but they're now not even
identified as belonging to the right cgroup, so won't be found.

> > I did try removing the cgroup mods to mm/swap_state.c, so swap pages
> > get assigned to a cgroup only once it's really known; but that's not
> > enough by itself, because cgroup RSS reclaim doesn't touch those
> > pages, so the cgroup can easily OOM much too soon.  I was thinking
> > that you need a "limbo" cgroup for these pages, which can be attacked
> > for reclaim along with any cgroup being reclaimed, but from which
> > pages are readily migrated to their real cgroup once that's known.
> > 
> 
> Is migrating the charge to the real cgroup really required?

My answer is definitely yes.  I'm not suggesting that you need
general migration between cgroups at this stage (something for
later quite likely); but I am suggesting you need one pseudo-cgroup
to hold these cases temporarily, and that you cannot properly track
rss without it (if there is any swap).

> > But I had to switch over to other work before trying that out:
> > perhaps the idea doesn't really fly at all.  And it might well
> > be no longer needed once full mem+swap control is there.
> > 
> > So in the current memory controller, that unuse_pte mem charge I was
> > originally worried about failing (I hadn't at that point delved in
> > to see how it tries to reclaim) actually never fails (and never
> > does anything): the page is already assigned to some cgroup-or-
> > whatever and is never charged to vma->vm_mm at that point.
> > 
> 
> Excellent!

Umm, please explain what's excellent about that.

Hugh
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: remove zero_page (was Re: -mm merge plans for 2.6.24)

2007-10-03 Thread Linus Torvalds

On Wed, 3 Oct 2007, Nick Piggin wrote:
> 
> I don't know if Linus actually disliked the patch itself, or disliked
> my (maybe confusingly worded) rationale?

Yes. I'd happily accept the patch, but I'd want it clarified and made 
obvious what the problem was - and it wasn't the zero page itself, it was 
a regression in the VM that made it less palatable.

I also thought that there were potentially better solutions, namely to 
simply avoid the VM regression, but I also acknowledge that they may not 
be worth it - I just want them to be on the table.

In short: the real cost of the zero page was the reference counting on the 
page that we do these days. For example, I really do believe that the 
problem could fairly easily be fixed by simply not considering zero_page
to be a "vm_normal_page()". We already *do* have support for pages not 
getting ref-counted (since we need it for other things), and I think that 
zero_page very naturally falls into exactly that situation.

So in many ways, I would think that turning zero-page into a nonrefcounted 
page (the same way we really do have to do for other things anyway) would 
be the much more *direct* solution, and in many ways the obvious one.

HOWEVER - if people think that it's easier to remove zero_page, and want 
to do it for other reasons, *AND* can hopefully even back up the claim 
that it never matters with numbers (ie that the extra pagefaults just make 
the whole zero-page thing pointless), then I'd certainly accept the patch. 

I'd just want the patch *description* to then also be correct, and blame 
the right situation, instead of blaming zero-page itself.

Linus
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: per BDI dirty limit (was Re: -mm merge plans for 2.6.24)

2007-10-03 Thread Peter Zijlstra


On Wed, 2007-10-03 at 15:35 +0200, Kay Sievers wrote:
> On Wed, 2007-10-03 at 12:37 +0200, Peter Zijlstra wrote:
> > On Wed, 2007-10-03 at 12:15 +0200, Kay Sievers wrote:
> > > On Tue, 2007-10-02 at 22:05 +1000, Nick Piggin wrote:
> > > > On Tuesday 02 October 2007 21:40, Peter Zijlstra wrote:
> > > > > On Tue, 2007-10-02 at 13:21 +0200, Kay Sievers wrote:
> > > > 
> > > > > > How about adding this information to the tree then, instead of
> > > > > > creating a new top-level hack, just because something that you think
> > > > > > you need doesn't exist.
> > > > >
> > > > > So you suggest adding all the various network filesystems in there
> > > > > (where?), and adding the concept of a BDI, and ensuring all are 
> > > > > properly
> > > > > linked together - somehow. Feel free to do so.
> > > > 
> > > > Would something fit better under /sys/fs/? At least filesystems are
> > > > already an existing concept to userspace.
> > > 
> > > Sounds at least less messy than an new top-level directory.
> > > 
> > > But again, if it's "device" releated, like the name suggests, it should
> > > be reachable from the device tree.
> > > Which userspace tool is supposed to set these values, and at what time?
> > > An init-script, something at device discovery/setup? If that is is ever
> > > going to be used in a hotplug setup, you really don't want to go look
> > > for directories with magic device names in another disconnected tree.
> > 
> > Filesystems don't really map to BDIs either. One can have multiple FSs
> > per BDI.
> > 
> > 'Normally' a BDI relates to a block device, but networked (and other
> > non-block device) filesystems have to create a BDI too. So these need to
> > be represented some place as well.
> > 
> > The typical usage would indeed be init scripts. The typical example
> > would be setting the read-ahead window. Currently that cannot be done
> > for NFS mounts.
> 
> What kind of context for a non-block based fs will get the bdi controls
> added? Is there a generic place, or does every non-block based
> filesystem needs to be adapted individually to use it?

Not sure what the other non-block FSs do, but NFS puts it in its
superblock. So that would roughly be per mount.


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: per BDI dirty limit (was Re: -mm merge plans for 2.6.24)

2007-10-03 Thread Kay Sievers

On Wed, 2007-10-03 at 12:37 +0200, Peter Zijlstra wrote:
> On Wed, 2007-10-03 at 12:15 +0200, Kay Sievers wrote:
> > On Tue, 2007-10-02 at 22:05 +1000, Nick Piggin wrote:
> > > On Tuesday 02 October 2007 21:40, Peter Zijlstra wrote:
> > > > On Tue, 2007-10-02 at 13:21 +0200, Kay Sievers wrote:
> > > 
> > > > > How about adding this information to the tree then, instead of
> > > > > creating a new top-level hack, just because something that you think
> > > > > you need doesn't exist.
> > > >
> > > > So you suggest adding all the various network filesystems in there
> > > > (where?), and adding the concept of a BDI, and ensuring all are properly
> > > > linked together - somehow. Feel free to do so.
> > > 
> > > Would something fit better under /sys/fs/? At least filesystems are
> > > already an existing concept to userspace.
> > 
> > Sounds at least less messy than an new top-level directory.
> > 
> > But again, if it's "device" releated, like the name suggests, it should
> > be reachable from the device tree.
> > Which userspace tool is supposed to set these values, and at what time?
> > An init-script, something at device discovery/setup? If that is is ever
> > going to be used in a hotplug setup, you really don't want to go look
> > for directories with magic device names in another disconnected tree.
> 
> Filesystems don't really map to BDIs either. One can have multiple FSs
> per BDI.
> 
> 'Normally' a BDI relates to a block device, but networked (and other
> non-block device) filesystems have to create a BDI too. So these need to
> be represented some place as well.
> 
> The typical usage would indeed be init scripts. The typical example
> would be setting the read-ahead window. Currently that cannot be done
> for NFS mounts.

What kind of context for a non-block based fs will get the bdi controls
added? Is there a generic place, or does every non-block based
filesystem needs to be adapted individually to use it?

Kay

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

1 2 3 >

1 - 100 of 200 matches

Mail list logo