qstr abuse in git-cifs

2007-11-05 Thread Andrew Morton

static int cifs_ci_compare(struct dentry *dentry, struct qstr *a,
   struct qstr *b)
{
struct nls_table *codepage = CIFS_SB(dentry->d_inode->i_sb)->local_nls;

if ((a->len == b->len) &&
(nls_strnicmp(codepage, a->name, b->name, a->len) == 0)) {
/*
 * To preserve case, don't let an existing negative dentry's
 * case take precedence.  If a is not a negative dentry, this
 * should have no side effects
 */
memcpy(a->name, b->name, a->len);
return 0;
}
return 1;
}

produces

fs/cifs/dir.c: In function 'cifs_ci_compare':
fs/cifs/dir.c:596: warning: passing argument 1 of '__constant_memcpy' discards 
qualifiers from pointer target type
fs/cifs/dir.c:596: warning: passing argument 1 of '__memcpy' discards 
qualifiers from pointer target type

I suspect that bad things are happening in there.

It's strange for a "comparison" function to go and alter one of the things
which it's comparing, too.
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: writeout stalls in current -git

2007-11-05 Thread Torsten Kaiser
On 11/6/07, David Chinner <[EMAIL PROTECTED]> wrote:
> On Mon, Nov 05, 2007 at 07:27:16PM +0100, Torsten Kaiser wrote:
> > On 11/5/07, David Chinner <[EMAIL PROTECTED]> wrote:
> > > Ok, so it's probably a side effect of the writeback changes.
> > >
> > > Attached are two patches (two because one was in a separate patchset as
> > > a standalone change) that should prevent async writeback from blocking
> > > on locked inode cluster buffers. Apply the xfs-factor-inotobp patch first.
> > > Can you see if this fixes the problem?
> >
> > Now testing v2.6.24-rc1-650-gb55d1b1+ the fix for the missapplied 
> > raid5-patch
> > Applying your two patches ontop of that does not fix the stalls.
>
> So you are having RAID5 problems as well?

The first 2.6.24-rc1-git-kernel that I patched with your patches did
not boot for me. (Oops send in one of my previous mails) But given
that the stacktrace was not xfs related and I had seen this patch on
the lkml, I tried to fix this Oops this way.
I did not have troubles with the RAID5 otherwise.

> I'm struggling to understand what possible changed in XFS or writeback that
> would lead to stalls like this, esp. as you appear to be removing files when
> the stalls occur. Rather than vmstat, can you use something like iostat to
> show how busy your disks are?  i.e. are we seeing RMW cycles in the raid5 or
> some such issue.

Will do this this evening.

> OOC, what is the 'xfs_info ' output for your filesystem?

meta-data=/dev/mapper/root   isize=256agcount=32, agsize=4731132 blks
 =   sectsz=512   attr=1
data =   bsize=4096   blocks=151396224, imaxpct=25
 =   sunit=0  swidth=0 blks, unwritten=1
naming   =version 2  bsize=4096
log  =internal   bsize=4096   blocks=32768, version=1
 =   sectsz=512   sunit=0 blks, lazy-count=0
realtime =none   extsz=4096   blocks=0, rtextents=0


> > vmstat 10 output from unmerging (uninstalling) a kernel:
> >  1  0  0 3512188332 19264400   18512  368  735 10  3 85 
> >  1
> > -> emerge starts to remove the kernel source files
> >  3  0  0 3506624332 1928360015  9825 2458 8307  7 12 81 
> >  0
> >  0  0  0 3507212332 19283600 0   554  630 1233  0  1 99 
> >  0
> >  0  0  0 3507292332 19283600 0   537  580 1328  0  1 99 
> >  0
> >  0  0  0 3507168332 19283600 0   633  626 1380  0  1 99 
> >  0
> >  0  0  0 3507116332 19283600 0  1510  768 2030  1  2 97 
> >  0
> >  0  0  0 3507596332 19283600 0   524  540 1544  0  0 99 
> >  0
> > procs ---memory-- ---swap-- -io -system-- 
> > cpu
> >  r  b   swpd   free   buff  cache   si   sobibo   in   cs us sy id 
> > wa
> >  0  0  0 3507540332 19283600 0   489  551 1293  0  0 99 
> >  0
> >  0  0  0 3507528332 19283600 0   527  510 1432  1  1 99 
> >  0
> >  0  0  0 3508052332 19284000 0  2088  910 2964  2  3 95 
> >  0
> >  0  0  0 3507888332 19284000 0   442  565 1383  1  1 99 
> >  0
> >  0  0  0 3508704332 19284000 0   497  529 1479  0  0 99 
> >  0
> >  0  0  0 3508704332 19284000 0   594  595 1458  0  0 99 
> >  0
> >  0  0  0 3511492332 19284000 0  2381 1028 2941  2  3 95 
> >  0
> >  0  0  0 3510684332 19284000 0   699  600 1390  0  0 99 
> >  0
> >  0  0  0 3511636332 19284000 0   741  661 1641  0  0 
> > 100  0
> >  0  0  0 3524020332 19284000 0  2452 1080 3910  2  3 95 
> >  0
> >  0  0  0 3524040332 19284400 0   530  617 1297  0  0 99 
> >  0
> >  0  0  0 3524128332 19284400 0   812  674 1667  0  1 99 
> >  0
> >  0  0  0 3527000332 19367200   339   721  754 1681  3  2 93 
> >  1
> > -> emerge is finished, no dirty or writeback data in /proc/meminfo
>
> At this point, can you run a "sync" and see how long that takes to
> complete?

Already tried that: http://lkml.org/lkml/2007/11/2/178
See the logs from the second unmerge in the second half of the mail.

The sync did not stop this writeout, but returned immediately.

> The only thing I can think that woul dbe written out after
> this point is inodes, but even then it seems to go on for a long,
> long time and it really doesn't seem like XFS is holding up the
> inode writes.

Yes, I completly agree that this is much to long. Thats why I included
the after-emerge-finished parts of the logs. But I still partly
suspect xfs, because the xfssyncd shows up when I hip SysRq+W.

> Another option is to use blktrace/blkparse to determine which process is
> issuing this I/O.
>
> >  0  0  0 3583780332 19506000 0   494  555 1080  0  1 99 
> >  0
> >  0  0

Re: [PATCH] NFS: Stop sillyname renames and unmounts from racing

2007-11-05 Thread Alexander Viro
On Mon, Nov 05, 2007 at 09:06:36PM -0800, Andrew Morton wrote:
> > Any objections to exporting the inode_lock spin lock?
> > If so, how should modules _safely_ access the s_inode list?

> That's going to make hch unhappy.

That's going to make me just as unhappy, especially since it's pointless;
instead of the entire sorry mess we should just bump sb->s_active to pin
the superblock down (we know that it's active at that point, so it's just
an atomic_inc(); no games with locking, etc., are needed) and call
deactivate_super() on the way out.  And deactivate_super() is exported
already.
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] NFS: Stop sillyname renames and unmounts from racing

2007-11-05 Thread Andrew Morton
On Sat, 03 Nov 2007 07:09:25 -0400 Steve Dickson <[EMAIL PROTECTED]> wrote:

> The following patch stops NFS sillyname renames and umounts from racing.

(appropriate cc's added)

> I have a test script does the following:
>  1) start nfs server
>   2) mount loopback
>   3) open file in background
>   4) remove file
>   5) stop nfs server
>   6) kill -9 process which has file open
>   7) restart nfs server
>   8) umount looback mount.
> 
> After umount I got the "VFS: Busy inodes after unmount" message
> because the processing of the rename has not finished.
> 
> Below is a patch that the uses the new silly_count mechanism to
> synchronize sillyname processing and umounts. The patch introduces a
> nfs_put_super() routine that waits until the nfsi->silly_count count
> is zero.
> 
> A side-effect of finding and waiting for all the inode to
> find the sillyname processing, is I need to traverse
> the sb->s_inodes list in the supper block. To do that
> safely the inode_lock spin lock has to be held. So for
> modules to be able to "see" that lock I needed to
> EXPORT_SYMBOL_GPL() it.
> 
> Any objections to exporting the inode_lock spin lock?
> If so, how should modules _safely_ access the s_inode list?
> 
> steved.
> 
> 
> Author: Steve Dickson <[EMAIL PROTECTED]>
> Date:   Wed Oct 31 12:19:26 2007 -0400
> 
>  Close a unlink/sillyname rename and umount race by added a
>  nfs_put_super routine that will run through all the inode
>  currently on the super block, waiting for those that are
>  in the middle of a sillyname rename or removal.
> 
>  This patch stop the infamous "VFS: Busy inodes after unmount... "
>  warning during umounts.
> 
>  Signed-off-by: Steve Dickson <[EMAIL PROTECTED]>
> 
> diff --git a/fs/inode.c b/fs/inode.c
> index ed35383..da9034a 100644
> --- a/fs/inode.c
> +++ b/fs/inode.c
> @@ -81,6 +81,7 @@ static struct hlist_head *inode_hashtable __read_mostly;
>* the i_state of an inode while it is in use..
>*/
>   DEFINE_SPINLOCK(inode_lock);
> +EXPORT_SYMBOL_GPL(inode_lock);

That's going to make hch unhappy.

Your email client is performing space-stuffing.
See http://mbligh.org/linuxdocs/Email/Clients/Thunderbird

>   static struct file_system_type nfs_fs_type = {
>   .owner  = THIS_MODULE,
> @@ -223,6 +225,7 @@ static const struct super_operations nfs_sops = {
>   .alloc_inode= nfs_alloc_inode,
>   .destroy_inode  = nfs_destroy_inode,
>   .write_inode= nfs_write_inode,
> + .put_super  = nfs_put_super,
>   .statfs = nfs_statfs,
>   .clear_inode= nfs_clear_inode,
>   .umount_begin   = nfs_umount_begin,
> @@ -1767,6 +1770,30 @@ static void nfs4_kill_super(struct super_block *sb)
>   nfs_free_server(server);
>   }
> 
> +void nfs_put_super(struct super_block *sb)

This was (correctly) declared to be static.  We should define it that way
too (I didn't know you could do this, actually).

> +{
> + struct inode *inode;
> + struct nfs_inode *nfsi;
> + /*
> +  * Make sure there are no outstanding renames
> +  */
> +relock:
> + spin_lock(&inode_lock);
> + list_for_each_entry(inode, &sb->s_inodes, i_sb_list) {
> + nfsi = NFS_I(inode);
> + if (atomic_read(&nfsi->silly_count) > 0) {
> + /* Keep this inode around  during the wait */
> + atomic_inc(&inode->i_count);
> + spin_unlock(&inode_lock);
> + wait_event(nfsi->waitqueue,
> + atomic_read(&nfsi->silly_count) == 1);
> + iput(inode);
> + goto relock;
> + }
> + }
> + spin_unlock(&inode_lock);
> +}

That's an O(n^2) search.  If it is at all possible to hit a catastrophic
slowdown in here, you can bet that someone out there will indeed hit it in
real life.

I'm too lazy to look, but we might need to check things like I_FREEING
and I_CLEAR before taking a ref on this inode.
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: writeout stalls in current -git

2007-11-05 Thread David Chinner
On Mon, Nov 05, 2007 at 07:27:16PM +0100, Torsten Kaiser wrote:
> On 11/5/07, David Chinner <[EMAIL PROTECTED]> wrote:
> > Ok, so it's probably a side effect of the writeback changes.
> >
> > Attached are two patches (two because one was in a separate patchset as
> > a standalone change) that should prevent async writeback from blocking
> > on locked inode cluster buffers. Apply the xfs-factor-inotobp patch first.
> > Can you see if this fixes the problem?
> 
> Now testing v2.6.24-rc1-650-gb55d1b1+ the fix for the missapplied raid5-patch
> Applying your two patches ontop of that does not fix the stalls.

So you are having RAID5 problems as well?

I'm struggling to understand what possible changed in XFS or writeback that
would lead to stalls like this, esp. as you appear to be removing files when
the stalls occur. Rather than vmstat, can you use something like iostat to
show how busy your disks are?  i.e. are we seeing RMW cycles in the raid5 or
some such issue.

OOC, what is the 'xfs_info ' output for your filesystem? 

> vmstat 10 output from unmerging (uninstalling) a kernel:
>  1  0  0 3512188332 19264400   18512  368  735 10  3 85  1
> -> emerge starts to remove the kernel source files
>  3  0  0 3506624332 1928360015  9825 2458 8307  7 12 81  0
>  0  0  0 3507212332 19283600 0   554  630 1233  0  1 99  0
>  0  0  0 3507292332 19283600 0   537  580 1328  0  1 99  0
>  0  0  0 3507168332 19283600 0   633  626 1380  0  1 99  0
>  0  0  0 3507116332 19283600 0  1510  768 2030  1  2 97  0
>  0  0  0 3507596332 19283600 0   524  540 1544  0  0 99  0
> procs ---memory-- ---swap-- -io -system-- cpu
>  r  b   swpd   free   buff  cache   si   sobibo   in   cs us sy id wa
>  0  0  0 3507540332 19283600 0   489  551 1293  0  0 99  0
>  0  0  0 3507528332 19283600 0   527  510 1432  1  1 99  0
>  0  0  0 3508052332 19284000 0  2088  910 2964  2  3 95  0
>  0  0  0 3507888332 19284000 0   442  565 1383  1  1 99  0
>  0  0  0 3508704332 19284000 0   497  529 1479  0  0 99  0
>  0  0  0 3508704332 19284000 0   594  595 1458  0  0 99  0
>  0  0  0 3511492332 19284000 0  2381 1028 2941  2  3 95  0
>  0  0  0 3510684332 19284000 0   699  600 1390  0  0 99  0
>  0  0  0 3511636332 19284000 0   741  661 1641  0  0 100  > 0
>  0  0  0 3524020332 19284000 0  2452 1080 3910  2  3 95  0
>  0  0  0 3524040332 19284400 0   530  617 1297  0  0 99  0
>  0  0  0 3524128332 19284400 0   812  674 1667  0  1 99  0
>  0  0  0 3527000332 19367200   339   721  754 1681  3  2 93  1
> -> emerge is finished, no dirty or writeback data in /proc/meminfo

At this point, can you run a "sync" and see how long that takes to
complete? The only thing I can think that woul dbe written out after
this point is inodes, but even then it seems to go on for a long,
long time and it really doesn't seem like XFS is holding up the
inode writes.

Another option is to use blktrace/blkparse to determine which process is
issuing this I/O.

>  0  0  0 3583780332 19506000 0   494  555 1080  0  1 99  0
>  0  0  0 3584352332 19506000 099  347  559  0  0 99  0
>  0  0  0 3585232332 19506000 011  301  621  0  0 99  0
> -> disks go idle.
> 
> So these patches do not seem to be the source of these excessive disk 
> writes...

Well, the patches I posted should prevent blocking in the places that it
was seen, so if that does not stop the slowdowns then either the writeback
code is not feeding us inodes fast enough or the block device below is
having some kind of problem

Cheers,

Dave.
-- 
Dave Chinner
Principal Engineer
SGI Australian Software Group
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Problem with accessing namespace_sem from LSM.

2007-11-05 Thread Arjan van de Ven
On Tue, 06 Nov 2007 13:00:41 +0900
Tetsuo Handa <[EMAIL PROTECTED]> wrote:

> Hello.
> 
> I found that accessing namespace_sem from security_inode_create()
> causes lockdep warning when compiled with CONFIG_PROVE_LOCKING=y .
> 
> 

sounds like you have an AB-BA deadlock...

-- 
If you want to reach me at my work email, use [EMAIL PROTECTED]
For development, discussion and tips for power savings, 
visit http://www.lesswatts.org
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Problem with accessing namespace_sem from LSM.

2007-11-05 Thread Tetsuo Handa
Hello.

I found that accessing namespace_sem from security_inode_create()
causes lockdep warning when compiled with CONFIG_PROVE_LOCKING=y .



===
[ INFO: possible circular locking dependency detected ]
---
klogd/1798 is trying to acquire lock:
 (&namespace_sem){}, at: [] _aa_perm_dentry+0x80/0x184 [apparmor]

but task is already holding lock:
 (&inode->i_mutex){--..}, at: [] mutex_lock+0x12/0x15

which lock already depends on the new lock.


the existing dependency chain (in reverse order) is:

-> #1 (&inode->i_mutex){--..}:
   [] lock_acquire+0x4b/0x6a
   [] __mutex_lock_slowpath+0xb0/0x1f6
   [] mutex_lock+0x12/0x15
   [] graft_tree+0x5c/0xd4
   [] do_add_mount+0x84/0x100
   [] do_mount+0x602/0x659
   [] sys_mount+0x64/0x9b
   [] sysenter_past_esp+0x56/0x99

-> #0 (&namespace_sem){}:
   [] lock_acquire+0x4b/0x6a
   [] down_read+0x1e/0x31
   [] _aa_perm_dentry+0x80/0x184 [apparmor]
   [] aa_perm_dentry+0x62/0xa4 [apparmor]
   [] apparmor_inode_create+0x40/0x63 [apparmor]
   [] vfs_create+0x84/0x13e
   [] open_namei+0x169/0x635
   [] do_filp_open+0x20/0x36
   [] do_sys_open+0x40/0xbb
   [] sys_open+0x16/0x18
   [] sysenter_past_esp+0x56/0x99

other info that might help us debug this:

1 lock held by klogd/1798:
 #0:  (&inode->i_mutex){--..}, at: [] mutex_lock+0x12/0x15

stack backtrace:
 [] show_trace+0xd/0x10
 [] dump_stack+0x19/0x1b
 [] print_circular_bug_tail+0x59/0x64
 [] __lock_acquire+0x7ea/0x973
 [] lock_acquire+0x4b/0x6a
 [] down_read+0x1e/0x31
 [] _aa_perm_dentry+0x80/0x184 [apparmor]
 [] aa_perm_dentry+0x62/0xa4 [apparmor]
 [] apparmor_inode_create+0x40/0x63 [apparmor]
 [] vfs_create+0x84/0x13e
 [] open_namei+0x169/0x635
 [] do_filp_open+0x20/0x36
 [] do_sys_open+0x40/0xbb
 [] sys_open+0x16/0x18
 [] sysenter_past_esp+0x56/0x99



If this warning is true,
AppArmor shipped with OpenSuSE 10.1 and 10.2 is affected.

- Kernel 2.6.16.53-0.16 for OpenSuSE 10.1 -

do_add_mount() { /* in fs/namespace.c */
  down_write(&namespace_sem);
  graft_tree() {
mutex_lock(&nd->dentry->d_inode->i_mutex);
...
mutex_unlock(&nd->dentry->d_inode->i_mutex);
  }
  up_write(&namespace_sem);
}

open_namei() { /* in fs/namei.c */
  mutex_lock(&dir->d_inode->i_mutex);
  vfs_create() {
security_inode_create() {
  subdomain_inode_create() { /* in security/apparmor/lsm.c */
sd_perm_dentry() { /* in security/apparmor/main.c */
  _sd_perm_dentry() {
sd_path_begin() { /* in security/apparmor/inline.h */
  sd_path_begin2() {
down_read(&namespace_sem);
  }
}
...
sd_path_end() {
  up_read(&namespace_sem);
}
  }
}
  }
}
  }
  mutex_unlock(&dir->d_inode->i_mutex);
}

- Kernel 2.6.18.8-0.7 for OpenSuSE 10.2 -

do_add_mount() { /* in fs/namespace.c */
  down_write(&namespace_sem);
  graft_tree() {
mutex_lock(&nd->dentry->d_inode->i_mutex);
...
mutex_unlock(&nd->dentry->d_inode->i_mutex);
  }
  up_write(&namespace_sem);
}

open_namei() { /* in fs/namei.c */
  mutex_lock(&dir->d_inode->i_mutex);
  vfs_create() {
security_inode_create() {
  apparmor_inode_create() { /* in security/apparmor/lsm.c */
aa_perm_dentry() { /* in security/apparmor/lsm.c */
  _aa_perm_dentry() {
aa_path_begin() { /* in security/apparmor/inline.h */
  aa_path_begin2() {
down_read(&namespace_sem);
  }
}
...
aa_path_end() {
  up_read(&namespace_sem);
}
  }
}
  }
}
  }
  mutex_unlock(&dir->d_inode->i_mutex);
}

AppArmor shipped with OpenSuSE 10.3 and Ubuntu 7.10 will not be affected
since kernel was modified to pass vfsmount parameter
to VFS helper functions and LSM hooks.

TOMOYO Linux 2.x (which is implemented using LSM) is also affected
and I'm looking for solution.
http://lkml.org/lkml/2007/11/5/55

Possible solution would be to pass vfsmount parameter
to VFS helper functions and LSM hooks for all kernels.
I do hope that "Pass struct vfsmount to ..." patches
are merged into mainline kernel.

Regards.
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: writeout stalls in current -git

2007-11-05 Thread Andrew Morton
On Fri, 2 Nov 2007 18:33:29 +0800
Fengguang Wu <[EMAIL PROTECTED]> wrote:

> On Fri, Nov 02, 2007 at 11:15:32AM +0100, Peter Zijlstra wrote:
> > On Fri, 2007-11-02 at 10:21 +0800, Fengguang Wu wrote:
> > 
> > > Interestingly, no background_writeout() appears, but only
> > > balance_dirty_pages() and wb_kupdate.  Obviously wb_kupdate won't
> > > block the process.
> > 
> > Yeah, the background threshold is not (yet) scaled. So it can happen
> > that the bdi_dirty limit is below the background limit.
> > 
> > I'm curious though as to these stalls, though, I can't seem to think of
> > what goes wrong.. esp since most writeback seems to happen from pdflush.
> 
> Me confused too. The new debug patch will confirm whether emerge is
> waiting in balance_dirty_pages().
> 
> > (or I'm totally misreading it - quite a possible as I'm still recovering
> > from a serious cold and not all the green stuff has yet figured out its
> > proper place wrt brain cells 'n stuff)
> 
> Do take care of yourself.
> 
> > 
> > I still have this patch floating around:
> 
> I think this patch is OK for 2.6.24 :-)
> 
> Reviewed-by: Fengguang Wu <[EMAIL PROTECTED]> 

I would prefer Tested-by: :(

> > 
> > ---
> > Subject: mm: speed up writeback ramp-up on clean systems
> > 
> > We allow violation of bdi limits if there is a lot of room on the
> > system. Once we hit half the total limit we start enforcing bdi limits
> > and bdi ramp-up should happen. Doing it this way avoids many small
> > writeouts on an otherwise idle system and should also speed up the
> > ramp-up.

Given the problems we're having in there I'm a bit reluctant to go tossing
hastily put together and inadequately tested stuff onto the fire.  And
that's what this patch looks like to me.

Wanna convince me otherwise?
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [ANN] Squashfs 3.3 released

2007-11-05 Thread Michael Tokarev
Phillip Lougher wrote:
> Hi,
> 
> I'm pleased to announce another release of Squashfs.  This is the 22nd
> release in just over five years.

Thanks Phillip.

A tiny bug[fix] I always forgot to send...  In fs/squashfs/inode.c,
constants TASK_UNINTERRUPTIBLE and TASK_INTERRUPTIBLE are used, but
they aren't sometimes defined (declared in linux/sched.h):

  CC [M]  fs/squashfs/inode.o
fs/squashfs/inode.c: In function 'squashfs_get_cached_block':
fs/squashfs/inode.c:367: error: 'TASK_UNINTERRUPTIBLE' undeclared (first use in 
this function)
fs/squashfs/inode.c:367: error: (Each undeclared identifier is reported only 
once
fs/squashfs/inode.c:367: error: for each function it appears in.)
fs/squashfs/inode.c:367: warning: implicit declaration of function 'schedule'
fs/squashfs/inode.c:404: error: 'TASK_INTERRUPTIBLE' undeclared (first use in 
this function)
fs/squashfs/inode.c: In function 'release_cached_fragment':
fs/squashfs/inode.c:499: error: 'TASK_UNINTERRUPTIBLE' undeclared (first use in 
this function)
fs/squashfs/inode.c:499: error: 'TASK_INTERRUPTIBLE' undeclared (first use in 
this function)
fs/squashfs/inode.c: In function 'get_cached_fragment':
fs/squashfs/inode.c:522: error: 'TASK_UNINTERRUPTIBLE' undeclared (first use in 
this function)
fs/squashfs/inode.c:559: error: 'TASK_INTERRUPTIBLE' undeclared (first use in 
this function)

I'm not exactly sure which config option is "at blame"
(this is an i486-based UP generic-hardware config), but
I'm not interested to know, either.  The following
trivial patch fixes it once for all.

--- linux-2.6.22.orig/fs/squashfs/inode.c   2007-07-12 14:57:22.0 
+0400
+++ linux-2.6.22/fs/squashfs/inode.c2007-07-12 14:57:53.0 +0400
@@ -31,6 +31,7 @@
 #include 
 #include 
+#include 
 #include 

 #include "squashfs.h"


It was needed for v3.2 too.

Thanks.

/mjt
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: migratepage failures on reiserfs

2007-11-05 Thread Christoph Lameter
On Mon, 5 Nov 2007, Mel Gorman wrote:

> The grow_dev_page() pages should be reclaimable even though migration
> is not supported for those pages? They were marked movable as it was
> useful for lumpy reclaim taking back pages for hugepage allocations and
> the like. Would it make sense for memory unremove to attempt migration
> first and reclaim second?

Note that a page is still movable even if there is no file system method 
for migration available. In that case the page needs to be cleaned before 
it can be moved.

-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: + embed-a-struct-path-into-struct-nameidata-instead-of-nd-dentrymnt.patch added to -mm tree

2007-11-05 Thread Jörn Engel
On Mon, 5 November 2007 13:01:25 -0800, [EMAIL PROTECTED] wrote:
> 
> The patch titled
>  Embed a struct path into struct nameidata instead of nd->{dentry,mnt}
> has been added to the -mm tree.  Its filename is
>  embed-a-struct-path-into-struct-nameidata-instead-of-nd-dentrymnt.patch
> 
> *** Remember to use Documentation/SubmitChecklist when testing your code ***
> 
> See http://www.zip.com.au/~akpm/linux/patches/stuff/added-to-mm.txt to find
> out what to do about this
> 
> --
> Subject: Embed a struct path into struct nameidata instead of nd->{dentry,mnt}
> From: Jan Blunck <[EMAIL PROTECTED]>
> 
> Switch from nd->{dentry,mnt} to nd->path.{dentry,mnt} everywhere.
> 
> Signed-off-by: Jan Blunck <[EMAIL PROTECTED]>
> Signed-off-by: Andreas Gruenbacher <[EMAIL PROTECTED]>
> Acked-by: Christoph Hellwig <[EMAIL PROTECTED]>
> Cc: Al Viro <[EMAIL PROTECTED]>
> CC: 
> Signed-off-by: Andrew Morton <[EMAIL PROTECTED]>
Frowned-upon-by: Joern Engel <[EMAIL PROTECTED]>

This patch changes some 400 lines, most if not all of which get longer
and more complicated to read.  23 get sufficiently longer to require an
additional linebreak.  I can't remember complexity being invited into
the kernel without good reasoning, yet the patch description is
surprisingly low on reasoning:
> Switch from nd->{dentry,mnt} to nd->path.{dentry,mnt} everywhere.

The following two patches manage to remove 7 lines in total.  In total
23 were added, 7 removed , 400+ made longer and more complicated.  Is
there another more favorable metric?  Will this patchset prevent bugs?
Shrink the kernel size?  Anything?

If churn is the only effect of this, please considere it NAKed again.

Jörn

-- 
A surrounded army must be given a way out.
-- Sun Tzu
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


+ embed-a-struct-path-into-struct-nameidata-instead-of-nd-dentrymnt-unionfs.patch added to -mm tree

2007-11-05 Thread akpm

The patch titled
 Embed a struct path into struct nameidata instead of nd->{dentry,mnt}
has been added to the -mm tree.  Its filename is
 
embed-a-struct-path-into-struct-nameidata-instead-of-nd-dentrymnt-unionfs.patch

*** Remember to use Documentation/SubmitChecklist when testing your code ***

See http://www.zip.com.au/~akpm/linux/patches/stuff/added-to-mm.txt to find
out what to do about this

--
Subject: Embed a struct path into struct nameidata instead of nd->{dentry,mnt}
From: Jan Blunck <[EMAIL PROTECTED]>

Switch from nd->{dentry,mnt} to nd->path.{dentry,mnt} everywhere.

Signed-off-by: Jan Blunck <[EMAIL PROTECTED]>
Signed-off-by: Andreas Gruenbacher <[EMAIL PROTECTED]>
Acked-by: Christoph Hellwig <[EMAIL PROTECTED]>
Cc: Al Viro <[EMAIL PROTECTED]>
CC: 
Signed-off-by: Andrew Morton <[EMAIL PROTECTED]>
---

 fs/unionfs/inode.c |   12 
 fs/unionfs/main.c  |9 -
 fs/unionfs/super.c |   17 -
 3 files changed, 16 insertions(+), 22 deletions(-)

diff -puN 
fs/unionfs/inode.c~embed-a-struct-path-into-struct-nameidata-instead-of-nd-dentrymnt-unionfs
 fs/unionfs/inode.c
--- 
a/fs/unionfs/inode.c~embed-a-struct-path-into-struct-nameidata-instead-of-nd-dentrymnt-unionfs
+++ a/fs/unionfs/inode.c
@@ -168,10 +168,8 @@ static struct dentry *unionfs_lookup(str
unionfs_read_lock(dentry->d_sb);
 
/* save the dentry & vfsmnt from namei */
-   if (nd) {
-   path_save.dentry = nd->dentry;
-   path_save.mnt = nd->mnt;
-   }
+   if (nd)
+   path_save = nd->path;
 
/*
 * unionfs_lookup_backend returns a locked dentry upon success,
@@ -180,10 +178,8 @@ static struct dentry *unionfs_lookup(str
ret = unionfs_lookup_backend(dentry, nd, INTERPOSE_LOOKUP);
 
/* restore the dentry & vfsmnt in namei */
-   if (nd) {
-   nd->dentry = path_save.dentry;
-   nd->mnt = path_save.mnt;
-   }
+   if (nd)
+   nd->path = path_save;
if (!IS_ERR(ret)) {
if (ret)
dentry = ret;
diff -puN 
fs/unionfs/main.c~embed-a-struct-path-into-struct-nameidata-instead-of-nd-dentrymnt-unionfs
 fs/unionfs/main.c
--- 
a/fs/unionfs/main.c~embed-a-struct-path-into-struct-nameidata-instead-of-nd-dentrymnt-unionfs
+++ a/fs/unionfs/main.c
@@ -227,11 +227,11 @@ void unionfs_reinterpose(struct dentry *
 int check_branch(struct nameidata *nd)
 {
/* XXX: remove in ODF code -- stacking unions allowed there */
-   if (!strcmp(nd->dentry->d_sb->s_type->name, UNIONFS_NAME))
+   if (!strcmp(nd->path.dentry->d_sb->s_type->name, UNIONFS_NAME))
return -EINVAL;
-   if (!nd->dentry->d_inode)
+   if (!nd->path.dentry->d_inode)
return -ENOENT;
-   if (!S_ISDIR(nd->dentry->d_inode->i_mode))
+   if (!S_ISDIR(nd->path.dentry->d_inode->i_mode))
return -ENOTDIR;
return 0;
 }
@@ -372,8 +372,7 @@ static int parse_dirs_option(struct supe
goto out;
}
 
-   lower_root_info->lower_paths[bindex].dentry = nd.dentry;
-   lower_root_info->lower_paths[bindex].mnt = nd.mnt;
+   lower_root_info->lower_paths[bindex] = nd.path;
 
set_branchperms(sb, bindex, perms);
set_branch_count(sb, bindex, 0);
diff -puN 
fs/unionfs/super.c~embed-a-struct-path-into-struct-nameidata-instead-of-nd-dentrymnt-unionfs
 fs/unionfs/super.c
--- 
a/fs/unionfs/super.c~embed-a-struct-path-into-struct-nameidata-instead-of-nd-dentrymnt-unionfs
+++ a/fs/unionfs/super.c
@@ -202,8 +202,8 @@ static noinline int do_remount_mode_opti
goto out;
}
for (idx = 0; idx < cur_branches; idx++)
-   if (nd.mnt == new_lower_paths[idx].mnt &&
-   nd.dentry == new_lower_paths[idx].dentry)
+   if (nd.path.mnt == new_lower_paths[idx].mnt &&
+   nd.path.dentry == new_lower_paths[idx].dentry)
break;
path_release(&nd);  /* no longer needed */
if (idx == cur_branches) {
@@ -245,8 +245,8 @@ static noinline int do_remount_del_optio
goto out;
}
for (idx = 0; idx < cur_branches; idx++)
-   if (nd.mnt == new_lower_paths[idx].mnt &&
-   nd.dentry == new_lower_paths[idx].dentry)
+   if (nd.path.mnt == new_lower_paths[idx].mnt &&
+   nd.path.dentry == new_lower_paths[idx].dentry)
break;
path_release(&nd);  /* no longer needed */
if (idx == cur_branches) {
@@ -329,8 +329,8 @@ static noinline int do_remount_add_optio
goto out;
}
for (idx = 0; idx < cur_branches; idx++)
-   if (nd.mnt == new_lower_paths[idx].mnt &&
-   nd.dentry == new_lower_paths[idx].dentry)
+ 

+ introduce-path_put.patch added to -mm tree

2007-11-05 Thread akpm

The patch titled
 Introduce path_put()
has been added to the -mm tree.  Its filename is
 introduce-path_put.patch

*** Remember to use Documentation/SubmitChecklist when testing your code ***

See http://www.zip.com.au/~akpm/linux/patches/stuff/added-to-mm.txt to find
out what to do about this

--
Subject: Introduce path_put()
From: Jan Blunck <[EMAIL PROTECTED]>

* Add path_put() functions for releasing a reference to the dentry and
  vfsmount of a struct path in the right order

* Switch from path_release(nd) to path_put(&nd->path)

* Rename dput_path() to path_put_conditional()

Signed-off-by: Jan Blunck <[EMAIL PROTECTED]>
Signed-off-by: Andreas Gruenbacher <[EMAIL PROTECTED]>
Acked-by: Christoph Hellwig <[EMAIL PROTECTED]>
Cc: 
Cc: Al Viro <[EMAIL PROTECTED]>
Signed-off-by: Andrew Morton <[EMAIL PROTECTED]>
---

 arch/alpha/kernel/osf_sys.c  |2 
 arch/mips/kernel/sysirix.c   |6 -
 arch/parisc/hpux/sys_hpux.c  |2 
 arch/powerpc/platforms/cell/spufs/syscalls.c |2 
 arch/sparc64/solaris/fs.c|4 -
 drivers/md/dm-table.c|2 
 drivers/mtd/mtdsuper.c   |4 -
 fs/afs/mntpt.c   |2 
 fs/autofs4/root.c|2 
 fs/block_dev.c   |2 
 fs/coda/pioctl.c |4 -
 fs/compat.c  |4 -
 fs/configfs/symlink.c|4 -
 fs/dquot.c   |2 
 fs/ecryptfs/main.c   |2 
 fs/exec.c|4 -
 fs/ext3/super.c  |4 -
 fs/ext4/super.c  |4 -
 fs/gfs2/ops_fstype.c |2 
 fs/inotify_user.c|4 -
 fs/namei.c   |   56 +
 fs/namespace.c   |   20 +++---
 fs/nfs/namespace.c   |2 
 fs/nfsctl.c  |2 
 fs/nfsd/export.c |   10 +--
 fs/nfsd/nfs4recover.c|2 
 fs/nfsd/nfs4state.c  |2 
 fs/open.c|   22 +++---
 fs/proc/base.c   |2 
 fs/reiserfs/super.c  |8 +-
 fs/revoke.c  |2 
 fs/stat.c|6 -
 fs/utimes.c  |2 
 fs/xattr.c   |   16 ++--
 fs/xfs/linux-2.6/xfs_ioctl.c |2 
 include/linux/namei.h|7 --
 include/linux/path.h |2 
 kernel/audit_tree.c  |   12 +--
 kernel/auditfilter.c |4 -
 net/sunrpc/rpc_pipe.c|2 
 net/unix/af_unix.c   |6 -
 41 files changed, 125 insertions(+), 124 deletions(-)

diff -puN arch/alpha/kernel/osf_sys.c~introduce-path_put 
arch/alpha/kernel/osf_sys.c
--- a/arch/alpha/kernel/osf_sys.c~introduce-path_put
+++ a/arch/alpha/kernel/osf_sys.c
@@ -261,7 +261,7 @@ osf_statfs(char __user *path, struct osf
retval = user_path_walk(path, &nd);
if (!retval) {
retval = do_osf_statfs(nd.path.dentry, buffer, bufsiz);
-   path_release(&nd);
+   path_put(&nd.path);
}
return retval;
 }
diff -puN arch/mips/kernel/sysirix.c~introduce-path_put 
arch/mips/kernel/sysirix.c
--- a/arch/mips/kernel/sysirix.c~introduce-path_put
+++ a/arch/mips/kernel/sysirix.c
@@ -711,7 +711,7 @@ asmlinkage int irix_statfs(const char __
}
 
 dput_and_out:
-   path_release(&nd);
+   path_put(&nd.path);
 out:
return error;
 }
@@ -1385,7 +1385,7 @@ asmlinkage int irix_statvfs(char __user 
error |= __put_user(0, &buf->f_fstr[i]);
 
 dput_and_out:
-   path_release(&nd);
+   path_put(&nd.path);
 out:
return error;
 }
@@ -1636,7 +1636,7 @@ asmlinkage int irix_statvfs64(char __use
error |= __put_user(0, &buf->f_fstr[i]);
 
 dput_and_out:
-   path_release(&nd);
+   path_put(&nd.path);
 out:
return error;
 }
diff -puN arch/parisc/hpux/sys_hpux.c~introduce-path_put 
arch/parisc/hpux/sys_hpux.c
--- a/arch/parisc/hpux/sys_hpux.c~introduce-path_put
+++ a/arch/parisc/hpux/sys_hpux.c
@@ -222,7 +222,7 @@ asmlinkage long hpux_statfs(const char _
error = vfs_statfs_hpux(nd.path.dentry, &tmp);
if (!error && copy_to_user(buf, &tmp, sizeof(tmp)))
error = -EFAULT;
-   path_release(&nd);
+   path_put(&nd.path);
}
return error;
 }
di

Re: msync(2) bug(?), returns AOP_WRITEPAGE_ACTIVATE to userland

2007-11-05 Thread Hugh Dickins
On Mon, 5 Nov 2007, Dave Hansen wrote:
> 
> Actually, I think your s/while/if/ change is probably a decent fix.

Any resemblance to a decent fix is purely coincidental.

> Barring any other races, that loop should always have made progress on
> mnt->__mnt_writers the way it is written.  If we get to:
> 
> > lock_and_coalesce_cpu_mnt_writer_counts();
> ->HERE
> > mnt_unlock_cpus();
> 
> and don't have a positive mnt->__mnt_writers, we know something is going
> badly.  We WARN_ON() there, which should at least give an earlier
> warning that the system is not doing well.  But it doesn't fix the
> inevitable.  Could you try the attached patch and see if it at least
> warns you earlier?

Thanks, Dave, yes, that gives me a nice warning:

leak detected on mount(c25ebd80) writers count: -65537
WARNING: at fs/namespace.c:249 handle_write_count_underflow()
 [] show_trace_log_lvl+0x1b/0x2e
 [] show_trace+0x16/0x1b
 [] dump_stack+0x19/0x1e
 [] handle_write_count_underflow+0x4c/0x60
 [] mnt_drop_write+0x69/0x8e
 [] __fput+0xff/0x162
 [] fput+0x2e/0x33
 [] unionfs_file_release+0xc2/0x1c5
 [] __fput+0x8f/0x162
 [] fput+0x2e/0x33
 [] filp_close+0x50/0x5d
 [] sys_close+0x74/0xb4
 [] sysenter_past_esp+0x5f/0x85

and the test then goes quietly on its way instead of hanging.  Though
I imagine, with your patch or mine, that it's then making an unfortunate
frequency of calls to lock_and_coalesce_longer_name_than_I_care_to_type
thereafter.  But it's hardly your responsibility to optimize for bugs
elsewhere.

The 2.6.23-mm1 tree has MNT_USER at 0x200, so I adjusted your flag to
#define MNT_IMBALANCED_WRITE_COUNT  0x400 /* just for debugging */

> 
> I have a decent guess what the bug is, too.  In the unionfs code:

I'll let Erez take it from there...

Hugh
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[3/4] Distributed storage. Algorithms.

2007-11-05 Thread Evgeniy Polyakov
Mirror and linear data stripping algorithms for DST.

Signed-off-by: Evgeniy Polyakov <[EMAIL PROTECTED]>

diff --git a/drivers/block/dst/alg_linear.c b/drivers/block/dst/alg_linear.c
new file mode 100644
index 000..cb77b57
--- /dev/null
+++ b/drivers/block/dst/alg_linear.c
@@ -0,0 +1,104 @@
+/*
+ * 2007+ Copyright (c) Evgeniy Polyakov <[EMAIL PROTECTED]>
+ * All rights reserved.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ */
+
+#include 
+#include 
+#include 
+#include 
+
+static struct dst_alg *alg_linear;
+
+/*
+ * This callback is invoked when node is removed from storage.
+ */
+static void dst_linear_del_node(struct dst_node *n)
+{
+}
+
+/*
+ * This callback is invoked when node is added to storage.
+ */
+static int dst_linear_add_node(struct dst_node *n)
+{
+   struct dst_storage *st = n->st;
+
+   dprintk("%s: disk_size: %llu, node_size: %llu.\n",
+   __func__, st->disk_size, n->size);
+
+   mutex_lock(&st->tree_lock);
+   n->start = st->disk_size;
+   st->disk_size += n->size;
+   mutex_unlock(&st->tree_lock);
+
+   return 0;
+}
+
+static int dst_linear_remap(struct dst_request *req)
+{
+   int err;
+
+   if (req->node->bdev) {
+   generic_make_request(req->bio);
+   return 0;
+   }
+
+   err = kst_check_permissions(req->state, req->bio);
+   if (err)
+   return err;
+
+   return req->state->ops->push(req);
+}
+
+/*
+ * Failover callback - it is invoked each time error happens during
+ * request processing.
+ */
+static int dst_linear_error(struct kst_state *st, int err)
+{
+   if (err)
+   set_bit(DST_NODE_FROZEN, &st->node->flags);
+   else
+   clear_bit(DST_NODE_FROZEN, &st->node->flags);
+   return 0;
+}
+
+static struct dst_alg_ops alg_linear_ops = {
+   .remap  = dst_linear_remap,
+   .add_node   = dst_linear_add_node,
+   .del_node   = dst_linear_del_node,
+   .error  = dst_linear_error,
+   .owner  = THIS_MODULE,
+};
+
+static int __devinit alg_linear_init(void)
+{
+   alg_linear = dst_alloc_alg("alg_linear", &alg_linear_ops);
+   if (!alg_linear)
+   return -ENOMEM;
+
+   return 0;
+}
+
+static void __devexit alg_linear_exit(void)
+{
+   dst_remove_alg(alg_linear);
+}
+
+module_init(alg_linear_init);
+module_exit(alg_linear_exit);
+
+MODULE_LICENSE("GPL");
+MODULE_AUTHOR("Evgeniy Polyakov <[EMAIL PROTECTED]>");
+MODULE_DESCRIPTION("Linear distributed algorithm.");
diff --git a/drivers/block/dst/alg_mirror.c b/drivers/block/dst/alg_mirror.c
new file mode 100644
index 000..1b55f4d
--- /dev/null
+++ b/drivers/block/dst/alg_mirror.c
@@ -0,0 +1,1113 @@
+/*
+ * 2007+ Copyright (c) Evgeniy Polyakov <[EMAIL PROTECTED]>
+ * All rights reserved.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ */
+
+#include 
+#include 
+#include 
+#include 
+#include 
+
+struct dst_mirror_node_data
+{
+   u64 age;
+};
+
+struct dst_mirror_priv
+{
+   unsigned intchunk_num;
+
+   u64 last_start;
+
+   spinlock_t  backlog_lock;
+   struct list_headbacklog_list;
+
+   struct dst_mirror_node_data old_data, new_data;
+
+   unsigned long   *chunk;
+};
+
+static struct dst_alg *alg_mirror;
+static struct bio_set *dst_mirror_bio_set;
+
+static int dst_mirror_resync(struct dst_node *n, int ndp);
+
+static void dst_mirror_mark_sync(struct dst_node *n)
+{
+   if (test_bit(DST_NODE_NOTSYNC, &n->flags)) {
+   struct dst_mirror_priv *priv = n->priv;
+
+   clear_bit(DST_NODE_NOTSYNC, &n->flags);
+   dprintk("%s: node: %p, %llu:%llu synchronization "
+   "has been completed.\n",
+   __func__, n, n->start, n->size);
+   priv->old_data.age = 0;
+   }
+}
+
+static void dst_mirror_mark_notsync(struct dst_node *n)
+{
+   if (!test_bit(DST_NODE_NOTSYNC, &n->flags)) {
+

[4/4] Distributed storage. Core interfaces.

2007-11-05 Thread Evgeniy Polyakov
This one contains core interfaces of the distributed storage, storage
and node initialization and cleanup code, block layer callbacks and the
like. It also contains Kconfig and Makefile changes.

Signed-off-by: Evgeniy Polyakov <[EMAIL PROTECTED]>

diff --git a/drivers/block/Kconfig b/drivers/block/Kconfig
index b4c8319..ca6592d 100644
--- a/drivers/block/Kconfig
+++ b/drivers/block/Kconfig
@@ -451,6 +451,8 @@ config ATA_OVER_ETH
This driver provides Support for ATA over Ethernet block
devices like the Coraid EtherDrive (R) Storage Blade.
 
+source "drivers/block/dst/Kconfig"
+
 source "drivers/s390/block/Kconfig"
 
 endmenu
diff --git a/drivers/block/Makefile b/drivers/block/Makefile
index dd88e33..fcf042d 100644
--- a/drivers/block/Makefile
+++ b/drivers/block/Makefile
@@ -29,3 +29,4 @@ obj-$(CONFIG_VIODASD) += viodasd.o
 obj-$(CONFIG_BLK_DEV_SX8)  += sx8.o
 obj-$(CONFIG_BLK_DEV_UB)   += ub.o
 
+obj-$(CONFIG_DST)  += dst/
diff --git a/drivers/block/dst/Kconfig b/drivers/block/dst/Kconfig
new file mode 100644
index 000..d35e0cc
--- /dev/null
+++ b/drivers/block/dst/Kconfig
@@ -0,0 +1,21 @@
+config DST
+   tristate "Distributed storage"
+   depends on NET
+   select CONNECTOR
+   select LIBCRC32C
+   ---help---
+   This driver allows to create a distributed storage.
+
+config DST_ALG_LINEAR
+   tristate "Linear distribution algorithm"
+   depends on DST
+   ---help---
+   This module allows to create linear mapping of the nodes
+   in the distributed storage.
+
+config DST_ALG_MIRROR
+   tristate "Mirror distribution algorithm"
+   depends on DST
+   ---help---
+   This module allows to create a mirror of the noes in the
+   distributed storage.
diff --git a/drivers/block/dst/Makefile b/drivers/block/dst/Makefile
new file mode 100644
index 000..1400e94
--- /dev/null
+++ b/drivers/block/dst/Makefile
@@ -0,0 +1,6 @@
+obj-$(CONFIG_DST) += dst.o
+
+dst-y := dcore.o kst.o
+
+obj-$(CONFIG_DST_ALG_LINEAR) += alg_linear.o
+obj-$(CONFIG_DST_ALG_MIRROR) += alg_mirror.o

diff --git a/drivers/block/dst/dcore.c b/drivers/block/dst/dcore.c
new file mode 100644
index 000..2b3ef10
--- /dev/null
+++ b/drivers/block/dst/dcore.c
@@ -0,0 +1,1608 @@
+/*
+ * 2007+ Copyright (c) Evgeniy Polyakov <[EMAIL PROTECTED]>
+ * All rights reserved.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ */
+
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+#include 
+
+static LIST_HEAD(dst_storage_list);
+static LIST_HEAD(dst_alg_list);
+static DEFINE_MUTEX(dst_storage_lock);
+static DEFINE_MUTEX(dst_alg_lock);
+static int dst_major;
+static struct kst_worker *kst_main_worker;
+static struct cb_id cn_dst_id = { CN_DST_IDX, CN_DST_VAL };
+
+struct kmem_cache *dst_request_cache;
+
+static char dst_name[] = "Squizzed black-out of the dancing back-aching hippo";
+
+/*
+ * DST sysfs tree. For device called 'storage' which is formed
+ * on top of two nodes this looks like this:
+ *
+ * /sys/devices/storage/
+ * /sys/devices/storage/alg : alg_linear
+ * /sys/devices/storage/n-800/type : R: 192.168.4.80:1025
+ * /sys/devices/storage/n-800/size : 800
+ * /sys/devices/storage/n-800/start : 800
+ * /sys/devices/storage/n-800/clean
+ * /sys/devices/storage/n-800/dirty
+ * /sys/devices/storage/n-0/type : R: 192.168.4.81:1025
+ * /sys/devices/storage/n-0/size : 800
+ * /sys/devices/storage/n-0/start : 0
+ * /sys/devices/storage/n-0/clean
+ * /sys/devices/storage/n-0/dirty
+ * /sys/devices/storage/remove_all_nodes
+ * /sys/devices/storage/nodes : sectors (start [size]): 0 [800] | 800 [800]
+ * /sys/devices/storage/name : storage
+ */
+
+static int dst_dev_match(struct device *dev, struct device_driver *drv)
+{
+   return 1;
+}
+
+static void dst_dev_release(struct device *dev)
+{
+}
+
+static struct bus_type dst_dev_bus_type = {
+   .name   = "dst",
+   .match  = &dst_dev_match,
+};
+
+static struct device dst_dev = {
+   .bus= &dst_dev_bus_type,
+   .release= &dst_dev_release
+};
+
+static void dst_node_release(struct device *dev)
+{
+}
+
+static struct device dst_node_dev = {
+   .release= &dst_node_release
+};
+
+static void dst_free_alg(struct dst_alg *alg)
+{
+   kfree(alg);
+}
+
+/*
+ * Algorithm is never freed directly,
+ * since its module reference counter is increased
+ * by storage when it is created - just

[2/4] Distributed storage. Network processing.

2007-11-05 Thread Evgeniy Polyakov
This file contains all bits needed for async non-blocking network
processing of the block requests directed to DST.

Signed-off-by: Evgeniy Polyakov <[EMAIL PROTECTED]>

diff --git a/drivers/block/dst/kst.c b/drivers/block/dst/kst.c
new file mode 100644
index 000..ba5e5ef
--- /dev/null
+++ b/drivers/block/dst/kst.c
@@ -0,0 +1,1475 @@
+/*
+ * 2007+ Copyright (c) Evgeniy Polyakov <[EMAIL PROTECTED]>
+ * All rights reserved.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ */
+
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+#include 
+
+struct kst_poll_helper
+{
+   poll_table  pt;
+   struct kst_state*st;
+};
+
+static LIST_HEAD(kst_worker_list);
+static DEFINE_MUTEX(kst_worker_mutex);
+
+/*
+ * This function creates bound socket for local export node.
+ */
+static int kst_sock_create(struct kst_state *st, struct saddr *addr,
+   int type, int proto, int backlog)
+{
+   int err;
+
+   err = sock_create(addr->sa_family, type, proto, &st->socket);
+   if (err)
+   goto err_out_exit;
+
+   err = st->socket->ops->bind(st->socket, (struct sockaddr *)addr,
+   addr->sa_data_len);
+
+   err = st->socket->ops->listen(st->socket, backlog);
+   if (err)
+   goto err_out_release;
+
+   st->socket->sk->sk_allocation = GFP_NOIO;
+
+   return 0;
+
+err_out_release:
+   sock_release(st->socket);
+err_out_exit:
+   return err;
+}
+
+static void kst_sock_release(struct kst_state *st)
+{
+   if (st->socket) {
+   sock_release(st->socket);
+   st->socket = NULL;
+   }
+}
+
+void kst_wake(struct kst_state *st)
+{
+   if (st) {
+   struct kst_worker *w = st->node->w;
+   unsigned long flags;
+
+   spin_lock_irqsave(&w->ready_lock, flags);
+   if (list_empty(&st->ready_entry))
+   list_add_tail(&st->ready_entry, &w->ready_list);
+   spin_unlock_irqrestore(&w->ready_lock, flags);
+
+   wake_up(&w->wait);
+   }
+}
+EXPORT_SYMBOL_GPL(kst_wake);
+
+/*
+ * Polling machinery.
+ */
+static int kst_state_wake_callback(wait_queue_t *wait, unsigned mode,
+   int sync, void *key)
+{
+   struct kst_state *st = container_of(wait, struct kst_state, wait);
+   kst_wake(st);
+   return 1;
+}
+
+static void kst_queue_func(struct file *file, wait_queue_head_t *whead,
+poll_table *pt)
+{
+   struct kst_state *st = container_of(pt, struct kst_poll_helper, pt)->st;
+
+   st->whead = whead;
+   init_waitqueue_func_entry(&st->wait, kst_state_wake_callback);
+   add_wait_queue(whead, &st->wait);
+}
+
+static void kst_poll_exit(struct kst_state *st)
+{
+   if (st->whead) {
+   remove_wait_queue(st->whead, &st->wait);
+   st->whead = NULL;
+   }
+}
+
+/*
+ * This function removes request from state tree and ordering list.
+ */
+void kst_del_req(struct dst_request *req)
+{
+   list_del_init(&req->request_list_entry);
+}
+EXPORT_SYMBOL_GPL(kst_del_req);
+
+static struct dst_request *kst_req_first(struct kst_state *st)
+{
+   struct dst_request *req = NULL;
+
+   if (!list_empty(&st->request_list))
+   req = list_entry(st->request_list.next, struct dst_request,
+   request_list_entry);
+   return req;
+}
+
+/*
+ * This function dequeues first request from the queue and tree.
+ */
+static struct dst_request *kst_dequeue_req(struct kst_state *st)
+{
+   struct dst_request *req;
+
+   mutex_lock(&st->request_lock);
+   req = kst_req_first(st);
+   if (req)
+   kst_del_req(req);
+   mutex_unlock(&st->request_lock);
+   return req;
+}
+
+/*
+ * This function enqueues request into tree, indexed by start of the request,
+ * and also puts request into ordered queue.
+ */
+int kst_enqueue_req(struct kst_state *st, struct dst_request *req)
+{
+   if (unlikely(req->flags & DST_REQ_CHECK_QUEUE)) {
+   struct dst_request *r;
+
+   list_for_each_entry(r, &st->request_list, request_list_entry) {
+   if (bio_rw(r->bio) != bio_rw(req->bio))
+   continue;
+
+   if (r->start >= req->start + req->size)
+   continue;
+
+   if (r->start + r->size <= req->sta

[1/4] Distributed storage. Documentation.

2007-11-05 Thread Evgeniy Polyakov
DST documentation.

Signed-off-by: Evgeniy Polyakov <[EMAIL PROTECTED]>

diff --git a/Documentation/dst/algorithms.txt b/Documentation/dst/algorithms.txt
new file mode 100644
index 000..1437a6a
--- /dev/null
+++ b/Documentation/dst/algorithms.txt
@@ -0,0 +1,115 @@
+Each storage by itself is just a set of contiguous logical blocks, with
+allowed number of operations. Nodes, each of which has own start and size,
+are placed into storage by appropriate algorithm, which remaps
+logical sector number into real node's sector. One can create
+own algorithms, since DST has pluggable interface for that.
+Currently mirrored and linear algorithms are supported.
+
+Let's briefly describe how they work.
+
+Linear algorithm.
+Simple approach of concatenating storages into single device with
+increased size is used in this algorithm. Essentially new device
+has size equal to sum of sizes of underlying nodes and nodes are
+placed one after another.
+
+  /- Node 1 ---\ /-- Node 3 \
+start  end start   end
+ |==||==|
+ |start end |
+ |  \--- Node 2 -/  |
+ |  |
+start  end
+ \-- DST storage --/
+
+   /\
+   ||
+   ||
+
+  IO operations
+
+   Figure 1. 
+ 3 nodes combined into single storage using linear algorithm.
+
+Mirror algorithm.
+In this algorithms nodes are placed under each other, so when
+operation comes to the first one, it can be mirrored to all
+underlying nodes. In case of reading, actual data is obtained from
+the nearest node - algoritm keeps track of previous operation
+and knows where it was stopped, so that subsequent seek to the 
+start of the new request will take the shortest time.
+Writing is always mirrored to all underlying nodes.
+
+  IO operations
+   ||
+   ||
+   \/
+
+| DST storage ---|
+|  prev position |
+|---| Node 1 |
+|  prev pos  |
+| Node 2 -|--|
+|prev pos|
+|---| Node 3 |
+
+   Figure 2.
+   3 nodes combined into single storage using mirror algorithm.
+
+Each algorithm must implement number of callbacks,
+which must be registered during initialization time.
+
+struct dst_alg_ops
+{
+   int (*add_node)(struct dst_node *n);
+   void(*del_node)(struct dst_node *n);
+   int (*remap)(struct dst_request *req);
+   int (*error)(struct kst_state *state, int err);
+   struct module   *owner;
+};
+
[EMAIL PROTECTED]
+This callback is invoked when new node is being added into the storage,
+but before node is actually added into the storage, so that it could
+be accessed from it. When it is called, all appropriate initialization
+of the underlying device is already completed (system has been connected
+to remote node or got a reference to the local block device). At this
+stage algorithm can add node into private map. 
+It must return zero on success or negative value otherwise.
+
[EMAIL PROTECTED]
+This callback is invoked when node is being deleted from the storage,
+i.e. when its reference counter hits zero. It is called before
+any cleaning is performed.
+It must return zero on success or negative value otherwise.
+
[EMAIL PROTECTED]
+This callback is invoked each time new bio hits the storage.
+Request structure contains BIO itself, pointer to the node, which originally
+stores the whole region under given IO request, and various parameters
+used by storage core to process this block request.
+It must return zero on success or negative value otherwise. It is upto
+this method to call all cleaning if remapping failed, for example it must
+call kst_bio_endio() for given callback in case of error, which in turn
+will call bio_endio(). Note, that dst_request structure provided in this
+callback is allocated on stack, so if there is a need to use it outside
+of the given function, it must be cloned (it will happen automatically
+in state's push callback, but that copy will not be shared by any other
+user).
+
[EMAIL PROTECTED]
+This callback is invoked for each error, which happend when processed
+requests for remote nodes or when talking to remote size
+of the local export node (state contains data related to data
+transfers over the networ

[0/4] Distributed storage. Squizzed black-out of the dancing back-aching hippo.

2007-11-05 Thread Evgeniy Polyakov
Hi.

I'm pleased to announce 7'th and the final release of the distributed
storage subsystem (DST). It allows to form a storage on top of local and
remote nodes and combine them in linear or mirroring setup, which in
turn can be exported to remote nodes.

Short changelog:
* added strong checksum support (Castagnoli crc)
* extended autoconfiguration (added ability to request if remote
side supports strong checksum and turn it on if needed)
* documentation addon - sysfs files
* added clean/dirty sysfs files which allows to mark
node as clean (sinc) or dirty (not sync)
* fair number of bug fixes (including really tricky
bastards, which are unlikely to be found in real
setups, but which were still bugs)
* and the main one - added release name (it clearly shows my condition)

Overall list of features of the DST can be found on project's homepage:

http://tservice.net.ru/~s0mbre/old/?section=projects&item=dst

Thank you.

Signed-off-by: Evgeniy Polyakov <[EMAIL PROTECTED]>


-- 
Evgeniy Polyakov
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: writeout stalls in current -git

2007-11-05 Thread Torsten Kaiser
On 11/5/07, David Chinner <[EMAIL PROTECTED]> wrote:
> Ok, so it's probably a side effect of the writeback changes.
>
> Attached are two patches (two because one was in a separate patchset as
> a standalone change) that should prevent async writeback from blocking
> on locked inode cluster buffers. Apply the xfs-factor-inotobp patch first.
> Can you see if this fixes the problem?

Now testing v2.6.24-rc1-650-gb55d1b1+ the fix for the missapplied raid5-patch
Applying your two patches ontop of that does not fix the stalls.

vmstat 10 output from unmerging (uninstalling) a kernel:
 1  0  0 3512188332 19264400   18512  368  735 10  3 85  1
-> emerge starts to remove the kernel source files
 3  0  0 3506624332 1928360015  9825 2458 8307  7 12 81  0
 0  0  0 3507212332 19283600 0   554  630 1233  0  1 99  0
 0  0  0 3507292332 19283600 0   537  580 1328  0  1 99  0
 0  0  0 3507168332 19283600 0   633  626 1380  0  1 99  0
 0  0  0 3507116332 19283600 0  1510  768 2030  1  2 97  0
 0  0  0 3507596332 19283600 0   524  540 1544  0  0 99  0
procs ---memory-- ---swap-- -io -system-- cpu
 r  b   swpd   free   buff  cache   si   sobibo   in   cs us sy id wa
 0  0  0 3507540332 19283600 0   489  551 1293  0  0 99  0
 0  0  0 3507528332 19283600 0   527  510 1432  1  1 99  0
 0  0  0 3508052332 19284000 0  2088  910 2964  2  3 95  0
 0  0  0 3507888332 19284000 0   442  565 1383  1  1 99  0
 0  0  0 3508704332 19284000 0   497  529 1479  0  0 99  0
 0  0  0 3508704332 19284000 0   594  595 1458  0  0 99  0
 0  0  0 3511492332 19284000 0  2381 1028 2941  2  3 95  0
 0  0  0 3510684332 19284000 0   699  600 1390  0  0 99  0
 0  0  0 3511636332 19284000 0   741  661 1641  0  0 100  0
 0  0  0 3524020332 19284000 0  2452 1080 3910  2  3 95  0
 0  0  0 3524040332 19284400 0   530  617 1297  0  0 99  0
 0  0  0 3524128332 19284400 0   812  674 1667  0  1 99  0
 0  0  0 3527000332 19367200   339   721  754 1681  3  2 93  1
-> emerge is finished, no dirty or writeback data in /proc/meminfo
 0  0  0 3571056332 19476800   111   639  632 1344  0  1 99  0
 0  0  0 3571260332 19476800 0   757  688 1405  1  0 99  0
 0  0  0 3571156332 19476800 0   753  641 1361  0  0 99  0
 0  0  0 3571404332 19476800 0   766  653 1389  0  0 99  0
 1  0  0 3571136332 19476800 6   764  669 1488  0  0 99  0
 0  0  0 3571668332 19482400 0   764  657 1482  0  0 99  0
 0  0  0 3571848332 19482400 0   673  659 1406  0  0 99  0
 0  0  0 3571908332 1950520022   753  638 1500  0  1 99  0
 0  0  0 3573052332 19505200 0   765  631 1482  0  1 99  0
 0  0  0 3574144332 19505200 0   771  640 1497  0  0 99  0
 0  0  0 3573468332 19505200 0   458  485 1251  0  0 99  0
 0  0  0 3574184332 19505200 0   427  474 1192  0  0 100  0
 0  0  0 3575092332 19505200 0   461  482 1235  0  0 99  0
 0  0  0 3576368332 19505600 0   582  556 1310  0  0 99  0
 0  0  0 3579300332 19505600 0   695  571 1402  0  0 99  0
 0  0  0 3580376332 19505600 0   417  568  906  0  0 99  0
 0  0  0 3581212332 19505600 0   421  559  977  0  1 99  0
 0  0  0 3583780332 19506000 0   494  555 1080  0  1 99  0
 0  0  0 3584352332 19506000 099  347  559  0  0 99  0
 0  0  0 3585232332 19506000 011  301  621  0  0 99  0
-> disks go idle.

So these patches do not seem to be the source of these excessive disk writes...

Torsten
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: msync(2) bug(?), returns AOP_WRITEPAGE_ACTIVATE to userland

2007-11-05 Thread Dave Hansen
On Mon, 2007-11-05 at 15:40 +, Hugh Dickins wrote:
> The second problem was a hang: all cpus in
> handle_write_count_underflow
> doing lock_and_coalesce_cpu_mnt_writer_counts: new -mm stuff from Dave
> Hansen.  At first I thought that was a locking problem in Dave's code,
> but I now suspect it's that your unionfs reference counting is wrong
> somewhere, and the error accumulates until __mnt_writers drops below
> MNT_WRITER_UNDERFLOW_LIMIT, but the coalescence does nothing to help
> and we're stuck in that loop. 

I've never actually seen this happen in practice, but I do know exactly
what you're talking about.

> but I hope Dave can
> also make handle_write_count_underflow more robust, it's unfortunate
> if refcount errors elsewhere first show up as a hang there.

Actually, I think your s/while/if/ change is probably a decent fix.
Barring any other races, that loop should always have made progress on
mnt->__mnt_writers the way it is written.  If we get to:

> lock_and_coalesce_cpu_mnt_writer_counts();
->HERE
> mnt_unlock_cpus();

and don't have a positive mnt->__mnt_writers, we know something is going
badly.  We WARN_ON() there, which should at least give an earlier
warning that the system is not doing well.  But it doesn't fix the
inevitable.  Could you try the attached patch and see if it at least
warns you earlier?

I have a decent guess what the bug is, too.  In the unionfs code:

> int init_lower_nd(struct nameidata *nd, unsigned int flags)
> {
> ...
> #ifdef ALLOC_LOWER_ND_FILE
> file = kzalloc(sizeof(struct file), GFP_KERNEL);
> if (unlikely(!file)) {
> err = -ENOMEM;
> break; /* exit switch statement and thus return */
> }
> nd->intent.open.file = file;
> #endif /* ALLOC_LOWER_ND_FILE */

The r/o bind mount code will mnt_drop_write() on that file's f_vfsmnt at
__fput() time.  Since that code never got a write on the mount, we'll
see an imbalance if the file was opened for a write.  I don't see this
file's mnt set anywhere, so I'm not completely sure that this is it.  In
any case, rolling your own 'struct file' without using alloc_file() and
friends is a no-no.

BTW, I have some "debugging" code in my latest set of patches that I
think should fix this kind of imbalance with the mnt->__mnt_writers().
It ensures that before we do that mnt_drop_write() at __fput() that we
absolutely did a mnt_want_write() at some point in the 'struct file's
life.  

-- Dave

 linux-2.6.git-dave/fs/namespace.c|   31 ++-
 linux-2.6.git-dave/include/linux/mount.h |1 +
 2 files changed, 23 insertions(+), 9 deletions(-)

diff -puN fs/namei.c~fix-naughty-loop fs/namei.c
diff -puN fs/namespace.c~fix-naughty-loop fs/namespace.c
--- linux-2.6.git/fs/namespace.c~fix-naughty-loop   2007-11-05 
08:03:59.0 -0800
+++ linux-2.6.git-dave/fs/namespace.c   2007-11-05 08:35:06.0 -0800
@@ -225,16 +225,29 @@ static void lock_and_coalesce_cpu_mnt_wr
  */
 static void handle_write_count_underflow(struct vfsmount *mnt)
 {
-   while (atomic_read(&mnt->__mnt_writers) <
-   MNT_WRITER_UNDERFLOW_LIMIT) {
-   /*
-* It isn't necessary to hold all of the locks
-* at the same time, but doing it this way makes
-* us share a lot more code.
-*/
-   lock_and_coalesce_cpu_mnt_writer_counts();
-   mnt_unlock_cpus();
+   if (atomic_read(&mnt->__mnt_writers) >=
+   MNT_WRITER_UNDERFLOW_LIMIT)
+   return;
+   /*
+* It isn't necessary to hold all of the locks
+* at the same time, but doing it this way makes
+* us share a lot more code.
+*/
+   lock_and_coalesce_cpu_mnt_writer_counts();
+   /*
+* If coalescing the per-cpu writer counts did not
+* get us back to a positive writer count, we have
+* a bug.
+*/
+   if ((atomic_read(&mnt->__mnt_writers) < 0) &&
+   !(mnt->mnt_flags & MNT_IMBALANCED_WRITE_COUNT)) {
+   printk("leak detected on mount(%p) writers count: %d\n",
+   mnt, atomic_read(&mnt->__mnt_writers));
+   WARN_ON(1);
+   /* use the flag to keep the dmesg spam down */
+   mnt->mnt_flags |= MNT_IMBALANCED_WRITE_COUNT;
}
+   mnt_unlock_cpus();
 }
 
 /**
diff -puN include/linux/mount.h~fix-naughty-loop include/linux/mount.h
--- linux-2.6.git/include/linux/mount.h~fix-naughty-loop2007-11-05 
08:22:21.0 -0800
+++ linux-2.6.git-dave/include/linux/mount.h2007-11-05 08:28:20.0 
-0800
@@ -32,6 +32,7 @@ struct mnt_namespace;
 #define MNT_READONLY   0x40/* does the user want this to be r/o? */
 
 #define MNT_SHRINKABLE 0x100
+#define MNT_IMBALANCED_WRITE_COUNT 0x200 /* just for debugging */
 
 #define MN

Re: msync(2) bug(?), returns AOP_WRITEPAGE_ACTIVATE to userland

2007-11-05 Thread Hugh Dickins
[Dave, I've Cc'ed you re handle_write_count_underflow, see below.]

On Wed, 31 Oct 2007, Erez Zadok wrote:
> 
> Hi Hugh, I've addressed all of your concerns and am happy to report that the
> newly revised unionfs_writepage works even better, including under my
> memory-pressure conditions.  To summarize my changes since the last time:
> 
> - I'm only masking __GFP_FS, not __GFP_IO
> - using find_or_create_page to avoid locking issues around mapping mask
> - handle for_reclaim case more efficiently
> - using copy_highpage so we handle KM_USER*
> - un/locking upper/lower page as/when needed
> - updated comments to clarify what/why
> - unionfs_sync_page: gone (yes, vfs.txt did confuse me, plus ecryptfs used
>   to have it)
> 
> Below is the newest version of unionfs_writepage.  Let me know what you
> think.
> 
> I have to say that with these changes, unionfs appears visibly faster under
> memory pressure.  I suspect the for_reclaim handling is probably the largest
> contributor to this speedup.

That's good news, and that unionfs_writepage looks good to me -
with three reservations I've not observed before.

One, I think you would be safer to do a set_page_dirty(lower_page)
before your clear_page_dirty_for_io(lower_page).  I know that sounds
silly, but see Linus' "Yes, Virginia" comment in clear_page_dirty_for_io:
there's a lot of subtlety hereabouts, and I think you'd be mimicing the
usual path closer if you set_page_dirty first - there's nothing else
doing it on that lower_page, is there?  I'm not certain that you need
to, but I think you'd do well to look into it and make up your own mind.

Two, I'm unsure of the way you're clearing or setting PageUptodate on
the upper page there.  The rules for PageUptodate are fairly obvious
when reading, but when a write fails, it's not so obvious.  Again, I'm
not saying what you've got is wrong (it may be unavoidable, to keep
synch between lower and upper), but it deserves a second thought.

Three, I believe you need to add a flush_dcache_page(lower_page)
after the copy_highpage(lower_page): some architectures will need
that to see the new data if they have lower_page mapped (though I
expect it's anyway shaky ground to be accessing through the lower
mount at the same time as modifying through the upper).

I've been trying this out on 2.6.23-mm1 with your 21 Oct 1-9/9
and your 2 Nov 1-8/8 patches applied (rejects being patches which
were already in 2.6.23-mm1).  I was hoping to reproduce the
BUG_ON(entry->val) that I fear from shmem_writepage(), before
fixing it; but not seen that at all yet - that might be good
news, but it's more likely I just haven't tried hard enough yet.

For now I'm doing repeated make -j20 kernel builds, pushing into
swap, in a unionfs mount of just a single dir on tmpfs.  This has
shown up several problems, two of which I've had to hack around to
get further.

The first: I very quickly hit "BUG: atomic counter underflow"
from -mm's i386 atomic_dec_and_test: from filp_close calling
unionfs_flush.  I did a little test fork()ing while holding a file
open on unionfs, and indeed it appears that your totalopens code is
broken, being unaware of how fork() bumps up a file count without
an open.  That's rather basic, I'm puzzled that this has remained
undiscovered until now - or perhaps it's just a recent addition.

It looked to me as if the totalopens count was about avoiding some
d_deleted processing in unionfs_flush, which actually should be left
until unionfs_release (and that your unionfs_flush ought to be calling
the lower level flush in all cases).  To get going, I've been running
with the quick hack patch below: but I've spent very little time
thinking about it, plus it's a long time since I've done VFS stuff;
so that patch may be nothing but an embarrassment that reflects
neither your intentions nor the VFS rules!  And it may itself be
responsible for the further problems I've seen.

The second problem was a hang: all cpus in handle_write_count_underflow
doing lock_and_coalesce_cpu_mnt_writer_counts: new -mm stuff from Dave
Hansen.  At first I thought that was a locking problem in Dave's code,
but I now suspect it's that your unionfs reference counting is wrong
somewhere, and the error accumulates until __mnt_writers drops below
MNT_WRITER_UNDERFLOW_LIMIT, but the coalescence does nothing to help
and we're stuck in that loop.  My even greater hack to solve that one
was to change Dave's "while" to "if"!  Then indeed tests can run for
some while.  As I say, my suspicion is that the actual error is within
unionfs (perhaps introduced by my first hack); but I hope Dave can
also make handle_write_count_underflow more robust, it's unfortunate
if refcount errors elsewhere first show up as a hang there.

I've had CONFIG_UNION_FS_DEBUG=y but will probably turn it off when
I come back to this, since it's rather noisy at present.  I've not
checked whether its reports are peculiar to having tmpfs below or not.
I get lots of "unionfs: new lower inode mtime" r

2008 Linux Storage and Filesystem Workshop

2007-11-05 Thread Chris Mason

Hello everyone,

The position statement submission system for the 2008 storage and
filesystem workshop is now online.  This is how you let us know you're
interested in attending and what topics are most important for
discussion.

For all the details, please see:

http://www.usenix.org/events/lsf08/

-chris
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: migratepage failures on reiserfs

2007-11-05 Thread Chris Mason
On Mon, 5 Nov 2007 10:23:35 +
[EMAIL PROTECTED] (Mel Gorman) wrote:

> On (01/11/07 10:10), Badari Pulavarty didst pronounce:
>
> > > Hmpf, my first reply had a paragraph about the block device inode
> > > pages, I noticed the phrase file data pages and deleted it ;)
> > > 
> > > But, for the metadata buffers there's not much we can do.  They
> > > are included in a bunch of different lists and the patch would
> > > be non-trivial.
> > 
> > Unfortunately, these buffer pages are spread all around making
> > those sections of memory non-removable. Of course, one can use
> > ZONE_MOVABLE to make sure to guarantee the remove. But I am
> > hoping we could easily group all these allocations and minimize
> > spreading them around. Mel ?
> 
> The grow_dev_page() pages should be reclaimable even though migration
> is not supported for those pages? They were marked movable as it was
> useful for lumpy reclaim taking back pages for hugepage allocations
> and the like. Would it make sense for memory unremove to attempt
> migration first and reclaim second?
> 

In this case, reiserfs has the page pinned while it is doing journal
magic.  Not sure if ext3 has the same issues.

-chris
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [ANN] Squashfs 3.3 released

2007-11-05 Thread maximilian attems
On Mon, Nov 05, 2007 at 11:13:14AM +, Phillip Lougher wrote:
> I'm pleased to announce another release of Squashfs.  This is the 22nd
> release in just over five years.  Squashfs 3.3 has lots of nice 
> improvements,
> both to the filesystem itself (bigger blocks and sparse files), but
> also to the Squashfs-tools Mksquashfs and Unsquashfs.
> 
> The next stage after this release is to fix the one remaining blocking issue
> (filesystem endianness), and then try to get Squashfs mainlined into the
> Linux kernel again.
> 

that would be very cool!
with my hat as debian kernel maintainer i'd be very relieved to see it
mainlined. i don't know of any major distro that doesn't ship it.

thanks

-- 
maks
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[ANN] Squashfs 3.3 released

2007-11-05 Thread Phillip Lougher

Hi,

I'm pleased to announce another release of Squashfs.  This is the 22nd
release in just over five years.  Squashfs 3.3 has lots of nice improvements,
both to the filesystem itself (bigger blocks and sparse files), but
also to the Squashfs-tools Mksquashfs and Unsquashfs.

The next stage after this release is to fix the one remaining blocking issue
(filesystem endianness), and then try to get Squashfs mainlined into the
Linux kernel again.

The list of changes from the change-log are as follows:

1. Filesystem improvements:

   1.1. Maximum block size has been increased to 1Mbyte, and the
default block size has been increased to 128 Kbytes.
This improves compression.

   1.2. Sparse files are now supported.  Sparse files are files
which have large areas of unallocated data commonly called
holes.  These files are now detected by Squashfs and stored
more efficiently.  This improves compression and read
performance for sparse files.

2. Mksquashfs improvements:

   2.1.  Exclude files have been extended to use wildcard pattern
 matching and regular expressions.  Support has also been
 added for non-anchored excludes, which means it is
 now possible to specify excludes which match anywhere
 in the filesystem (i.e. leaf files), rather than always
 having to specify exclude files starting from the root
 directory (anchored excludes).

   2.2.  Recovery files are now created when appending to existing
 Squashfs filesystems.  This allows the original filesystem
 to be recovered if Mksquashfs aborts unexpectedly
 (i.e. power failure).

3. Unsquashfs improvements:

3.1. Multiple extract files can now be specified on the
 command line, and the files/directories to be extracted can
 now also be given in a file.

3.2. Extract files have been extended to use wildcard pattern
 matching and regular expressions.

3.3. Filename printing has been enhanced and Unquashfs can
 now display filenames with file attributes
 ('ls -l' style output).

3.4. A -stat option has been added which displays the filesystem
 superblock information.

3.5. Unsquashfs now supports 1.x filesystems.

4. Miscellaneous improvements/bug fixes:

   4.1. Squashfs kernel code improved to use SetPageError in
squashfs_readpage() if I/O error occurs.

   4.2. Fixed Squashfs kernel code bug preventing file
seeking beyond 2GB.

   4.3. Mksquashfs now detects file size changes between
first phase directory scan and second phase filesystem create.

Regards

Phillip Lougher
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Is it illegal to refer namespace_sem while inode's mutex held?

2007-11-05 Thread Tetsuo Handa
Hello.

I'm running my LSM module on kernel 2.6.23 / Debian Sarge.
I encountered the following warning message.

It seems that calling down_read(&namespace_sem) is not permitted
inside mutex_lock(&inode->i_mutex) , but I'm not sure.
Is it illegal to refer namespace_sem while inode's mutex held?



===
[ INFO: possible circular locking dependency detected ]
2.6.23-tomoyo2.1 #27
---
rcS/1093 is trying to acquire lock:
 (&namespace_sem){}, at: [] m_start+0x11/0x20

but task is already holding lock:
 (&inode->i_mutex){--..}, at: [] open_namei+0xf2/0x522

which lock already depends on the new lock.


the existing dependency chain (in reverse order) is:

-> #1 (&inode->i_mutex){--..}:
   [] graft_tree+0x62/0xca
   [] check_prev_add+0xc4/0x1bc
   [] graft_tree+0x62/0xca
   [] check_prevs_add+0x56/0xcb
   [] validate_chain+0x2a2/0x31f
   [] __kernel_text_address+0x18/0x23
   [] dump_trace+0x6f/0x87
   [] __lock_acquire+0x6f2/0x762
   [] validate_chain+0x275/0x31f
   [] lock_acquire+0x79/0x93
   [] graft_tree+0x62/0xca
   [] __mutex_lock_slowpath+0xea/0x280
   [] graft_tree+0x62/0xca
   [] graft_tree+0x62/0xca
   [] do_add_mount+0x8a/0xe7
   [] do_mount+0x1a9/0x1c0
   [] __alloc_pages+0x64/0x2b6
   [] copy_mount_options+0x4d/0x97
   [] sys_mount+0x79/0xb5
   [] name_to_dev_t+0x4d/0x25d
   [] schedule_timeout+0x79/0x8d
   [] create_proc_entry+0x73/0x86
   [] process_timeout+0x0/0x5
   [] kernel_init+0x0/0xa3
   [] prepare_namespace+0x86/0x18e
   [] sys_access+0x1f/0x23
   [] kernel_init+0x99/0xa3
   [] kernel_thread_helper+0x7/0x10
   [] 0x

-> #0 (&namespace_sem){}:
   [] check_prev_add+0x27/0x1bc
   [] check_prevs_add+0x56/0xcb
   [] validate_chain+0x2a2/0x31f
   [] __lock_acquire+0x6f2/0x762
   [] __d_lookup+0xda/0xfa
   [] lock_acquire+0x79/0x93
   [] m_start+0x11/0x20
   [] down_read+0x3b/0x71
   [] m_start+0x11/0x20
   [] m_start+0x11/0x20
   [] tmy_do_single_write_perm+0x7e/0xda
   [] vfs_create+0x83/0x105
   [] open_namei_create+0x47/0x8a
   [] open_namei+0x15c/0x522
   [] do_filp_open+0x25/0x39
   [] _spin_unlock+0x14/0x1c
   [] get_unused_fd_flags+0xb0/0xba
   [] do_sys_open+0x44/0xc5
   [] sys_open+0x1a/0x1c
   [] syscall_call+0x7/0xb
   [] 0x

other info that might help us debug this:

1 lock held by rcS/1093:
 #0:  (&inode->i_mutex){--..}, at: [] open_namei+0xf2/0x522

stack backtrace:
 [] print_circular_bug_tail+0x5f/0x67
 [] check_prev_add+0x27/0x1bc
 [] check_prevs_add+0x56/0xcb
 [] validate_chain+0x2a2/0x31f
 [] __lock_acquire+0x6f2/0x762
 [] __d_lookup+0xda/0xfa
 [] lock_acquire+0x79/0x93
 [] m_start+0x11/0x20
 [] down_read+0x3b/0x71
 [] m_start+0x11/0x20
 [] m_start+0x11/0x20
 [] tmy_do_single_write_perm+0x7e/0xda
 [] vfs_create+0x83/0x105
 [] open_namei_create+0x47/0x8a
 [] open_namei+0x15c/0x522
 [] do_filp_open+0x25/0x39
 [] _spin_unlock+0x14/0x1c
 [] get_unused_fd_flags+0xb0/0xba
 [] do_sys_open+0x44/0xc5
 [] sys_open+0x1a/0x1c
 [] syscall_call+0x7/0xb
 ===



The location is tmy_do_single_write_perm()
(whose call trace is open_namei() -> open_namei_create() -> 
security_inode_create())
in the following file
http://svn.sourceforge.jp/cgi-bin/viewcvs.cgi/trunk/2.1.x/tomoyo-lsm/patches/tomoyo-hooks.diff?rev=653&root=tomoyo&view=markup

Regards.

-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: migratepage failures on reiserfs

2007-11-05 Thread Mel Gorman
On (01/11/07 10:10), Badari Pulavarty didst pronounce:
> On Thu, 2007-11-01 at 11:51 -0400, Chris Mason wrote:
> > On Thu, 01 Nov 2007 08:38:57 -0800
> > Badari Pulavarty <[EMAIL PROTECTED]> wrote:
> > 
> > > On Wed, 2007-10-31 at 13:40 -0400, Chris Mason wrote:
> > > > On Wed, 31 Oct 2007 08:14:21 -0800
> > > > Badari Pulavarty <[EMAIL PROTECTED]> wrote:
> > > > > 
> > > > > I tried data=writeback mode and it didn't help :(
> > > > 
> > > > Ouch, so much for the easy way out.
> > > > 
> > > > > 
> > > > > unable to release the page 262070
> > > > > bh c000211b9408 flags 110029 count 1 private 0
> > > > > unable to release the page 262098
> > > > > bh c00020ec9198 flags 110029 count 1 private 0
> > > > > memory offlining 3f000 to 4 failed
> > > > > 
> > > > 
> > > > The only other special thing reiserfs does with the page cache is
> > > > file tails.  I don't suppose all of these pages are index zero in
> > > > files smaller than 4k?
> > > 
> > > Ah !! I am so blind :(
> > > 
> > > I have been suspecting reiserfs all along, since its executing
> > > fallback_migrate_page(). Actually, these buffer heads are
> > > backing blockdev. I guess these are metadata buffers :( 
> > > I am not sure we can do much with these..
> > 
> > Hmpf, my first reply had a paragraph about the block device inode
> > pages, I noticed the phrase file data pages and deleted it ;)
> > 
> > But, for the metadata buffers there's not much we can do.  They are
> > included in a bunch of different lists and the patch would
> > be non-trivial.
> 
> Unfortunately, these buffer pages are spread all around making
> those sections of memory non-removable. Of course, one can use
> ZONE_MOVABLE to make sure to guarantee the remove. But I am
> hoping we could easily group all these allocations and minimize
> spreading them around. Mel ?

The grow_dev_page() pages should be reclaimable even though migration
is not supported for those pages? They were marked movable as it was
useful for lumpy reclaim taking back pages for hugepage allocations and
the like. Would it make sense for memory unremove to attempt migration
first and reclaim second?

-- 
Mel Gorman
Part-time Phd Student  Linux Technology Center
University of Limerick IBM Dublin Software Lab
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html